If its for training data, why are they straining FOSS so much? Is there thousand...

xena · 2025-03-20T20:14:20 1742501660

Git forges are some of the worst case for this. The scrapers click on every link on every page. If you do this to a git forge, it gets very O(scary) very fast because you have to look at data that is not frequently looked at and will NOT be cached. Most of how git forges are fast is through caching.

The thing about AI scrapers is that they don't just do this once. They do this every day in case every file in a glibc commit from 15 years ago changed. It's absolutely maddening and I don't know why AI companies do this, but if this is not handled then the git forge falls over and nobody can use it.

Anubis is a solution that should not have to exist, but the problem it solves is catastrophically bad, so it needs to exist.

biophysboy · 2025-03-21T15:19:21 1742570361

That's very strange to me that they do it everyday. I thought training runs took months. Do they throw away the vast majority of their training attempts (e.g. one had suboptimal hyperparameters, etc)?

xena · 2025-03-23T04:34:15 1742704455

Training attempts != training data