Isn't this just poor, sloppy crawler implementation? You shouldn't need to fetch...

mrweasel · 2025-03-20T13:48:45 1742478525

From reading Drew Devaults angry post from earlier this week, my take is that not only is it poorly implemented crawlers, it's also that it's cheaper to scrape, rather than keep copies on hand. Effectively these companies are outsourcing the storage of "their" training data to everyone on the internet.

Ideally a site would get scraped once, and then the scraper would check if content has changed, e.g. etag, while also learning how frequently content changes. So rather than just hammer some poor personal git repo over and over, it would learn that Monday is a good time to check if something changed and then back off for a week.

QuadrupleA · 2025-03-20T16:48:38 1742489318

That seems crazy - millions of $ on GPUs but they can't afford some cheap storage? And direct network scraping seems super high latency. Although I guess a massive pretaining run might cycle through the corpus very slowly. Dunno, sounds fishy.

dalke · 2025-03-20T19:51:15 1742500275

I see ChatGPT's bots pull down all of my Python wheels every couple of weeks.

Wheels that haven't changed in years, with a "Last-Modified" and "ETag" that haven't changed.

The only thing that makes sense to me is it's cheaper them to re-pull and re-analyze the data than to develop a cache.

fuzztester · 2025-03-20T22:06:24 1742508384

Or, could it be, just possibly be (gasp), that some of the devs at these "hotshot" AI companies are just ignorant or lazy or pressured enough, so as to not do such normal checks? Wouldn't be surprised if so.

dalke · 2025-03-21T07:17:59 1742541479

You think they do cache the data but don't use it?

For what it's worth, mj12bot.com is even worse. They pull down every wheel every two or three days, even though something like chemfp-3.4-cp35-cp35m-manylinux1_x86_64.whl hasn't changed in years - it's for Python 3.5, after all.

fuzztester · 2025-03-25T00:25:30 1742862330

>You think they do cache the data but don't use it?

that's not what I meant.

and it is not they, it is it.

i.e. the web server, not bots or devs on the other end of the connection, is what tells you the needed info. all you have to do is check it and act accordingly, i.e. download the changed resource or don't download the unchanged one.

google:

http header last modified

and look for the etag link too.

fuzztester · 2025-03-25T00:43:03 1742863383

here you go:

https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

danaris · 2025-03-20T17:44:00 1742492640

It's not that they can't afford storage.

It's that not doing so means they can increase their profit numbers just a skoshe more.

And at least as long as they haven't IPOed, that number's the only thing that matters. Everything getting in the way of increasing it is just an obstacle to be removed.

csydas · 2025-03-20T13:36:36 1742477796

You are correct it's poor and sloppy, but it's not "just" that. It's a lack of concern over the effects of their poor/sloppy crawler implementation.

The poor implementation is not really relevant, it's companies deciding they own the internet and can take whatever they want, let everyone else deal with the consequences. The companies do not care what the impact of their ai non-sense is..

fuzztester · 2025-03-20T22:07:40 1742508460

https://news.ycombinator.com/item?id=43429532

Forbo · 2025-03-20T13:43:34 1742478214

A capitalist externalizing costs?! Why, no, never! /s

Havoc · 2025-03-20T13:46:33 1742478393

It’s people that don’t care if they ruin things for everyone else.

Crawlers have existed forever in search engine space and mostly behave.

This sort of no rate limit, fake user agent, 100s of IPs approach used by AI teams is obviously intentionally not caring who it fucks over. More malicious than sloppy implementation

mistrial9 · 2025-03-20T14:45:27 1742481927

it is an ecosystem of social roles, not just "people" .. casting the decision into individual choices is not the right filter to understand this situation..

wadadadad · 2025-03-20T16:04:42 1742486682

I'm not sure I'm following what you mean by 'social roles'. Which roles are you referring to here?

I'll disagree that it's not at least individual malicious choice, though. Someone decided that they needed to fake/change user agents (as one example), and implemented it. Most likely it was more than one person- some manager(s)/teams probably also either suggested or agreed to this choice.

I would like to think at some point in this decision making process, someone would have considered 'is it ethical to change user agents to get around bans? Is it ethical to ignore robots.txt?' and decided not to proceed, but apparently that's not happening here...

blibble · 2025-03-20T13:50:37 1742478637

they're probably using their own garbage AI to write the crawler

the result? a mixed up version of 5000 plagiarised "baby's first webcrawler" github projects

Sharlin · 2025-03-20T13:59:37 1742479177

The brute-force solution to the first hard problem in computer science is to not have a cache at all.

mycall · 2025-03-20T15:06:41 1742483201

Yet in these cases mentioned in the article, if they had an HTTP static cache version of each page, using git hooks to refresh them, the bots would be negligable to their services. That is assuming the bots use HTTP 80/443 instead of git 9418

diggan · 2025-03-20T13:32:16 1742477536

Sounds like it me. Why build a crawler that fetches one HTML page per commit in a repository instead of doing a bare-clone and then just get the data from there? Surely would contain even more data too, compared to the HTML page.

superkuh · 2025-03-20T14:41:38 1742481698

And poor, sloppy, website implementation. If your professional website can't handle 20k hits it's ... well, poor. Because my home connection hosted on my desktop PC website tanked 20k hits from alibaba bot (among a few more thousand of normal traffic) yesterday without missing a beat.

It is literally the point of public websites to answer HTTP requests. If yours can't you're doing something wrong.