Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Adding a robots.txt file to a domain doesn't cause them to delete their archive of it, only to hide it.


Which is one and the same to the public.

IMO, they need to stop applying robots.txt retroactively if they want to be considered a valid archive.


The problem is that the Internet Archive exists on legally shaky ground. Neither they nor anyone else has a right to archive copyrighted web content and display it to the public. They manage to continue doing so in part because they're clearly non-commercial. They also manage to continue doing so because they voluntarily respond to robots.txt, even retroactively.

Libraries/archives have no special exemption from copyright law, which is actually a good thing, because otherwise libraries would presumably need to be licensed in some way by the government to qualify for special treatment.


Why not look at WHOIS information when getting an update, and then class a site as 'different' based on whether that changes? In most cases, a new domain owner usually means the site isn't the same as the earlier versions.

You'd then just have to stop the archive indexing/showing content after the WHOIS information changed, while leaving the stuff before it intact. Maybe you'd then have a nice form to report pages you want removed/hidden (for the edge cases), or even a seperate robots.txt/meta declaration you can make confirming you're the same person that owns the site. After all, most of the reasons why sites go missing aren't deliberate attempts to rewrite history, but domain squatters not wanting holding pages indexed.

Feels like it'd be so easy to implement robots.txt in a more logical way on the Internet Archive.


It's been suggested, but there's no way to automatically do it correctly. The whois info might be anonymized, in which case a change means nothing at all. It might just be someone's name and address, with no way of verifying who that someone works for. Also meaningless. Better just to default to something safe, and spend your manpower on something more important.


    > the Internet Archive exists on legally shaky ground
Not least because of the EU's 'right to be forgotten'.


That doesn't seem likely to apply.


Key point given the current climate, if the Trump presidency adds a restrictive robots.txt to all .gov domains they will prevent the Internet Archive Wayback Machine from showing history on any of those domains.

Not only would this obliterate public access to the Obama, Bush, Clinton era government websites in the archive, it'd prevent the use of the Wayback Machine for keeping track of Trump's shifting agendas, as demonstrated on his web domain recently.


Government documents including, I assume, web pages are in the public domain.


That's irrelevant; the issue isn't copyright, but the implementation of the Internet Archive, which applies robots.txt retroactively to archived versions of pages as well as respecting it currently.


But the reason the Internet Archive applies robots.txt retroactively is copyright, as explained in a sibling to the grandparent.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: