Hacker News new | past | comments | ask | show | jobs | submit login

I would presume the crawlers have a queue-based architectures with thousands of workers. It’s an amplification attack.

When a worker gets a webpage for the honeypot, it crawls it, scrapes it, and finds X links on the page where X is greater than 1. Those links get put on the crawler queue. Because there’s more than 1 link per page, each worker on the honeypot will add more links to the queue than it removed.

Other sites will eventually leave the queue, because they have a finite number of pages so the crawlers eventually have nothing new to queue.

Not on the honeypot. It has a virtually infinite number of pages. Scraping a page will almost deterministically increase the size of the queue (1 page removed, a dozen added per scrape). Because other sites eventually leave the queue, the queue eventually becomes just the honeypot.

OpenAI is big enough this probably wasn’t their entire queue, but I wouldn’t be surprised if it was a whole digit percentage. The author said 1.8M requests; I don’t know the duration, but that’s equivalent to 20 QPS for an entire day. Not a crazy amount, but not insignificant. It’s within the QPS Googlebot would send to a fairly large site like LinkedIn.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: