Hacker News new | past | comments | ask | show | jobs | submit login

There is always IP filtering, DNS blocking, and HTTP agent screening. Just sayin'.



Based on the page footer ("IECC ChurnWare") I believe this is a site by design to waste time for web crawlers and tools which try to get root access on every domain. The robots.txt looks like this: https://ulysses-antoine-kurtis.web.sp.am/robots.txt

I don't see how this does much to keep bad actors away from other domains, but I can see why they don't want to give up the game for OpenAI to stop crawling.


I think he's saying that it's not a problem for him, but for OpenAI?


Yup, that's my impression as well. He's just nice to let OpenAI they have a problem. Usually this should be rewarded with a nice "hey, u guys have a bug" bounty because not long time ago some VP from OpenAI was lamenting that training their AI is, and it's his direct quote, "eye watering" cost (the order was millions of $$ per second).


I would be a little sceptical about that figure. 3 million dollars per second is around the world GDP.

I get it, AI training is expensive, but I don't believe it's that expensive


Thank you for that perspective. I always appreciate it when people put numbers like these in context.

Also 1 million per second is 60 million per minute is 3.6 billion per hour is 86.4 billion per day. It's about one whole value of FAANG per month...


Sam Altman has said in a few interviews that it was around $100 million for GPT-3, and higher for GPT-4.

But yes, this is a one-time cost, and far lower than the "millions of dollars per second" in GP comment.

https://fortune.com/2024/04/04/ai-training-costs-how-much-is...

https://www.wired.com/story/openai-ceo-sam-altman-the-age-of...


> Before someone tells me to fix my robots.txt, this is a content farm so rather than being one web site with 6,859,000,000 pages, it is 6,859,000,000 web sites each with one page.


The reason that bit is relevant is that robots.txt is only applicable to the current domain. Because each "page" is a different subdomain, the crawler needs to fetch the robots.txt for every single page request.

What the poster was suggesting is blocking them at a higher level - e.g. a user-agent block in an .htaccess or an IP block in iptables or similar. That would be a one-stop fix. It would also defeat the purpose of the website, however, which is to waste the time of crawlers


The real question is how is GPTBot finding all the other subdomains? Currently the sites have GPTBot disallowed, https://www.web.sp.am/robots.txt

If GPTBot is compliant with the robots.txt specification then it can't read the URL containing the HTML to find the other subdomains.

Either:

  1. GPTBot treats a disallow as a noindex but still requests the page itself. Note that Google doesn't treat a disallow as a noindex. They will still show your page in search results if they discover the link from other pages but they show it with a "No information is available for this page." disclaimer.
  2. The site didn't have a GPTBot disallow until they noticed the traffic spike and they bot has already discovered a couple million links that need to be crawled.
  3. There is some other page out there on the internet that GPTBot discovered that links to millions of these subdomains. This seems possible and the subdomains really don't have any way to prevent a bot from requesting millions of robots.txt files. The only prevention here is to firewall the bot's IP range or work with the bot owners to implement better subdomain handling.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: