Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> (Because the Wayback Machine starts from scratch each time?)

I don't know what "start from scratch" would mean – the point is that each site is sampled many times throughout history. That said, it is very odd that a current change in robots.txt would prevent looking at old samples. And that's indeed what it looks like [1]:

> Page cannot be displayed due to robots.txt.

[1] https://web.archive.org/web/*/blog.reidreport.com



I'd imagine they're in a dodgy copyright situation and so guard against it by being conservative wrt robots.txt.

The robots.txt shows a positive assertion that parts of a site should be excluded from being used by automated systems.

In most cases I imagine WBM does not have permission of the owner to keep a duplicate of the site, it's certainly tortuous in UK law.

Sites that don't change their robots.txt are probably highly correlated with sites that don't sue for the infringement.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: