Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If its for training data, why are they straining FOSS so much? Is there thousands of actors repeatedly making training data all the time? I thought it was a sort of one-off thing w/ the big tech players.


Git forges are some of the worst case for this. The scrapers click on every link on every page. If you do this to a git forge, it gets very O(scary) very fast because you have to look at data that is not frequently looked at and will NOT be cached. Most of how git forges are fast is through caching.

The thing about AI scrapers is that they don't just do this once. They do this every day in case every file in a glibc commit from 15 years ago changed. It's absolutely maddening and I don't know why AI companies do this, but if this is not handled then the git forge falls over and nobody can use it.

Anubis is a solution that should not have to exist, but the problem it solves is catastrophically bad, so it needs to exist.


That's very strange to me that they do it everyday. I thought training runs took months. Do they throw away the vast majority of their training attempts (e.g. one had suboptimal hyperparameters, etc)?


Training attempts != training data




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: