The robots.txt exlusion loophole has been known for quite a long time.

jaclaz · on April 25, 2018

>The robots.txt exlusion loophole has been known for quite a long time.

Yes, but it seemed like they had changed their mind, exactly because there is a huge issue with "expired" domains, see:

https://blog.archive.org/2016/12/17/robots-txt-gov-mil-websi...

https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

They experimentally ignored robots.txt on .mil and .gov domains, and I thought they were going to extend this new policy for all archived sites.

The situation/status is not clear, though the retroactive validity of robots.txt remains (at least to me) absurd.

It is IMHO only fair to respect a robots.txt since the date it has been put online, it is the retroactivity that is perplexing, as a matter of fact I see it as violating the decisions of the Author, that - at the time some contents was made available - by not posting a robots.txt expressed the intention to have the contents archived and accessible, while there is no guarantee whatsoever that the robots.txt posted years later is still an expression of the same Author.

Most probably a middle way would be - if possible technically - that the robots.txt is respected only for the period in which the site has the same owner/registrar, but for the large amount of sites with anonymous or "by proxy" ownership that could not possibly work.

ghaff · on April 25, 2018

.gov and .mil sites are presumably public domain anyway because they're US government. Therefore, it makes sense to ignore instructions not to archive them.

In pretty much all other cases--except where they were public domain or CC0--it's probably not strictly legal to archive them at all. Therefore, it makes sense to bend over backwards to remove any material if asked to programatically or otherwise.

>I see it as violating the decisions of the Author

Maybe in some cases. But, for better or worse, preventing crawling is opt-in rather than opt-out, and defaults are very powerful. You didn't explicitly tell me that you didn't want me to repurpose your copyrighted material isn't a very strong legal argument.

skywhopper · on April 25, 2018

I'm guessing retroactively respecting robots.txt is a political decision to protect the integrity of their archives. "Here's an easy way to automatically remove your stuff from the publicly accessible archives" prevents a lot of lawsuits and potentially bad press. It's annoying if it blocks a few sites, but better to quickly and quietly block a few than to generate so much noise that a bunch more get blocked. As a pragmatic strategy it's probably the best one for preserving the widest array of publicly accessible archives.

gmueckl · on April 25, 2018

Is it reasonably feasible to extend the syntax of robots.txt to include date ranges when the entries are specific to the IA bot? That way, specific content from a certain time span could be retroactively suppressed if desired.

This would also solve situations where a new owner blocks robot access for a domain where the former owner is OK with the existence of the archived site.

rocqua · on April 25, 2018

Why allow retroactive suppression at all?

It seems to make the most sense to only have a robots.txt affect pages archived when that specific version of robots.txt is in effect.

cheschire · on April 25, 2018

Perhaps this article[0] may provide you with insight into the motivations of those who may prefer to suppress historical data.

0: https://en.wikipedia.org/wiki/Right_to_be_forgotten

jaclaz · on April 25, 2018

There is no need to extend the syntax of robots.txt.

At the time of crawling the robots.txt is parsed anyway.

If it excludes part or the whole site from crawling, it should IMHO be respected, and the crawl of that day should be stopped (and pages NOT even archived) if it doesn't then crawling and archiving them is "fair".

The point here is that by adding a "new" robots.txt the "previously archived" pages (that remain archived) are not anymore displayed by the Wayback Machine.

It is only a political/legal (and unilateral) decision by the good people at the archive.org, it could be changed any time, at their discretion, without the need of any "new" syntax for robots.txt.

gmueckl · on April 25, 2018

I think that enabling the user to selectively suppress parts of the site for certain archived time spans is a better solution. Sometimes, a page might have been in temporary violation of a law or contract and that version needs to be suppressed. But that particular does not mean that any other version needs to be hidden as well.

jaclaz · on April 25, 2018

>I think that enabling the user to selectively suppress parts of the site for certain archived time spans is a better solution.

But that is easily achieved by politely asking the good people at archive.org, they won't normally decline a "reasonable" request to suppress this or that page access.

As a side note, there is something that (when it comes to the internet) really escapes me, in the "real" world, before everything was digital, you had lawful means to get a retraction in case - say - of libel but you weren't allowed to retroactively change the history, destroying all written memories and attempts like burning books on public squares weren't much appreciated, I don't really see how going digital should be so much different.

I guess that the meaning of "publish" in the sense of "making public by printing it" has been altered by the common presence of the "undo" button.

Another sign I am getting old (and grumpy), I know.

gmueckl · on April 25, 2018

Well, you are missing the part where producing and selling new copies of e.g. libelous works can be forbidden in the real world. So the old copies will still be around, but they have to be passed on privately. Effectively, this takes affected works out of circulation.

jaclaz · on April 25, 2018

No, actually it was exactly the example I made, in the case of libel someone with a recognized authority (a Court) can seize/impound the libelous material and prohibit further publication (and of course not destroy each and every copy in the wild), but the procedure is very different from someone (remember not necessarily the actual Author, actually only the owner of the domain/site at a given moment) being able to prevent access to archived material published in the past (material that does not represent a libel and is not violating any Law) only because he/she can.

C4K3 · on April 25, 2018

A middle way could be to only observe robots.txt for crawling, and not for displaying pages. So once a page is grabbed, it's available forever. But if a page is covered by a robots.txt exclusion, it won't be crawled.