From reading Drew Devaults angry post from earlier this week, my take is that not only is it poorly implemented crawlers, it's also that it's cheaper to scrape, rather than keep copies on hand. Effectively these companies are outsourcing the storage of "their" training data to everyone on the internet.
Ideally a site would get scraped once, and then the scraper would check if content has changed, e.g. etag, while also learning how frequently content changes. So rather than just hammer some poor personal git repo over and over, it would learn that Monday is a good time to check if something changed and then back off for a week.
That seems crazy - millions of $ on GPUs but they can't afford some cheap storage? And direct network scraping seems super high latency. Although I guess a massive pretaining run might cycle through the corpus very slowly. Dunno, sounds fishy.
Or, could it be, just possibly be (gasp), that some of the devs at these "hotshot" AI companies are just ignorant or lazy or pressured enough, so as to not do such normal checks? Wouldn't be surprised if so.
You think they do cache the data but don't use it?
For what it's worth, mj12bot.com is even worse. They pull down every wheel every two or three days, even though something like chemfp-3.4-cp35-cp35m-manylinux1_x86_64.whl hasn't changed in years - it's for Python 3.5, after all.
>You think they do cache the data but don't use it?
that's not what I meant.
and it is not they, it is it.
i.e. the web server, not bots or devs on the other end of the connection, is what tells you the needed info. all you have to do is check it and act accordingly, i.e. download the changed resource or don't download the unchanged one.
It's that not doing so means they can increase their profit numbers just a skoshe more.
And at least as long as they haven't IPOed, that number's the only thing that matters. Everything getting in the way of increasing it is just an obstacle to be removed.
You are correct it's poor and sloppy, but it's not "just" that. It's a lack of concern over the effects of their poor/sloppy crawler implementation.
The poor implementation is not really relevant, it's companies deciding they own the internet and can take whatever they want, let everyone else deal with the consequences. The companies do not care what the impact of their ai non-sense is..
It’s people that don’t care if they ruin things for everyone else.
Crawlers have existed forever in search engine space and mostly behave.
This sort of no rate limit, fake user agent, 100s of IPs approach used by AI teams is obviously intentionally not caring who it fucks over. More malicious than sloppy implementation
it is an ecosystem of social roles, not just "people" .. casting the decision into individual choices is not the right filter to understand this situation..
I'm not sure I'm following what you mean by 'social roles'. Which roles are you referring to here?
I'll disagree that it's not at least individual malicious choice, though. Someone decided that they needed to fake/change user agents (as one example), and implemented it. Most likely it was more than one person- some manager(s)/teams probably also either suggested or agreed to this choice.
I would like to think at some point in this decision making process, someone would have considered 'is it ethical to change user agents to get around bans? Is it ethical to ignore robots.txt?' and decided not to proceed, but apparently that's not happening here...
Yet in these cases mentioned in the article, if they had an HTTP static cache version of each page, using git hooks to refresh them, the bots would be negligable to their services. That is assuming the bots use HTTP 80/443 instead of git 9418
Sounds like it me. Why build a crawler that fetches one HTML page per commit in a repository instead of doing a bare-clone and then just get the data from there? Surely would contain even more data too, compared to the HTML page.
And poor, sloppy, website implementation. If your professional website can't handle 20k hits it's ... well, poor. Because my home connection hosted on my desktop PC website tanked 20k hits from alibaba bot (among a few more thousand of normal traffic) yesterday without missing a beat.
It is literally the point of public websites to answer HTTP requests. If yours can't you're doing something wrong.