Hacker News new | past | comments | ask | show | jobs | submit login

Yes and no.

The distributed nature of git is fine until you want to serve it to the world - then, you're back to bad actors. They're looking for commits because it's nicely chunked, I'm taking a guess.




> They're looking for commits because it's nicely chunked, I'm taking a guess.

They're not looking for anything specifically from what I can tell. If that was the case, they would be just cloning the git repository, as it would be the easiest way to ingest such information. Instead, they just want to guzzle every single URL they can get hold of. And a web frontend for git generates thousands of those. Every file in a repository results in dozens, if not hundreds of unique links for file revisions, blame, etc. and many of those are expensive to serve. Which is why they are often put in robots.txt, so everything was fine until the LLM crawlers came along and ignored robots.txt.


The distributed nature of git lets me be independent of some central instance (you may decide that the master copy resides on Github, but with the advent of mesh VPNs like the ones Zerotier and Tailscale offer, you could also sidestep it and push/pull from your colleagues directly as well). It also lets me dictate who gets to access it.

What the article describes, though, is possibly the worst way a machine can access a git repository, which is using a web UI and scraping that, instead of cloning it and adding all the commits to its training set. I feel like they simply don't give a shit. They got such a huge capital injection that they feel they can afford to not give a shit about their own cost efficiency and that they go using the scorched earth tactics. After all, even their own LLMs can produce a naive scraper that wreaks havoc on the internet infrastructure, and they just let it loose. Got mine, fuck you all the way!

But then they will release some DeepSeek R(xyz), and yay, all the hackernews who were roasting them for such methods, will be applauding them for a new version of an "open source" stochastic parrot. Yay indeed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: