There is always IP filtering, DNS blocking, and HTTP agent screening. Just sayin...

mapmeld · on April 11, 2024

Based on the page footer ("IECC ChurnWare") I believe this is a site by design to waste time for web crawlers and tools which try to get root access on every domain. The robots.txt looks like this: https://ulysses-antoine-kurtis.web.sp.am/robots.txt

I don't see how this does much to keep bad actors away from other domains, but I can see why they don't want to give up the game for OpenAI to stop crawling.

fl7305 · on April 11, 2024

I think he's saying that it's not a problem for him, but for OpenAI?

unnouinceput · on April 11, 2024

Yup, that's my impression as well. He's just nice to let OpenAI they have a problem. Usually this should be rewarded with a nice "hey, u guys have a bug" bounty because not long time ago some VP from OpenAI was lamenting that training their AI is, and it's his direct quote, "eye watering" cost (the order was millions of $$ per second).

joha4270 · on April 11, 2024

I would be a little sceptical about that figure. 3 million dollars per second is around the world GDP.

I get it, AI training is expensive, but I don't believe it's that expensive

tasuki · on April 11, 2024

Thank you for that perspective. I always appreciate it when people put numbers like these in context.

Also 1 million per second is 60 million per minute is 3.6 billion per hour is 86.4 billion per day. It's about one whole value of FAANG per month...

samspenc · on April 11, 2024

Sam Altman has said in a few interviews that it was around $100 million for GPT-3, and higher for GPT-4.

But yes, this is a one-time cost, and far lower than the "millions of dollars per second" in GP comment.

https://fortune.com/2024/04/04/ai-training-costs-how-much-is...

https://www.wired.com/story/openai-ceo-sam-altman-the-age-of...

unsupp0rted · on April 11, 2024

> Before someone tells me to fix my robots.txt, this is a content farm so rather than being one web site with 6,859,000,000 pages, it is 6,859,000,000 web sites each with one page.

apocalyptic0n3 · on April 11, 2024

The reason that bit is relevant is that robots.txt is only applicable to the current domain. Because each "page" is a different subdomain, the crawler needs to fetch the robots.txt for every single page request.

What the poster was suggesting is blocking them at a higher level - e.g. a user-agent block in an .htaccess or an IP block in iptables or similar. That would be a one-stop fix. It would also defeat the purpose of the website, however, which is to waste the time of crawlers

matt_heimer · on April 11, 2024

The real question is how is GPTBot finding all the other subdomains? Currently the sites have GPTBot disallowed, https://www.web.sp.am/robots.txt

If GPTBot is compliant with the robots.txt specification then it can't read the URL containing the HTML to find the other subdomains.

Either:

  1. GPTBot treats a disallow as a noindex but still requests the page itself. Note that Google doesn't treat a disallow as a noindex. They will still show your page in search results if they discover the link from other pages but they show it with a "No information is available for this page." disclaimer.
  2. The site didn't have a GPTBot disallow until they noticed the traffic spike and they bot has already discovered a couple million links that need to be crawled.
  3. There is some other page out there on the internet that GPTBot discovered that links to millions of these subdomains. This seems possible and the subdomains really don't have any way to prevent a bot from requesting millions of robots.txt files. The only prevention here is to firewall the bot's IP range or work with the bot owners to implement better subdomain handling.