Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That web domain dataset they get from the internet archive is interesting in light of the current discussion, in that I am supposing it probably has .uk content that has been removed from the actual internet archive by robots.txt changes.

I think if I were running a national or internationally mandated archiving initiative I would basically want to take in content from Internet Archive, and not remove things, and probably it would be less expensive that way than having my own crawler.



Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: