If that's the case you could buy up defunct domains, exclude everything via robo...

bryanrasmussen · on April 25, 2018

For reference I've submitted a post on the Danish net archives https://news.ycombinator.com/item?id=16919264 which by law will archive all of any 'Danish' site and ignore robots.txt exclusions.

from the faq: http://netarkivet.dk/in-english/faq/#anchor8

8. Do you respect robots.txt? No, we do not. When we collect the Danish part of the internet we ignore the so-called robots.txt directives. Studies from 2003-2004 showed that many of the truly important web sites (e.g. news media, political parties) had very stringent robots.txt directives. If we follow these directives, very little or nothing at all will be archived from those websites. Therefore, ignoring robots.txt is explicitly mentioned in the commentary to the law as being necessary in order to collect all relevant material

I wonder if there are any other national archives of the internet that do the same.

Stranger43 · on April 25, 2018

The relatively clear mandate for collecting all published materials given to the danish royal libary by the danish constitution is kind of rare in it's scope and status.

https://www.bl.uk/collection-guides/uk-web-archive describe the much more limited aproach taken by the british library much later in time, but might extend to a similar scope.

bryanrasmussen · on April 26, 2018

That web domain dataset they get from the internet archive is interesting in light of the current discussion, in that I am supposing it probably has .uk content that has been removed from the actual internet archive by robots.txt changes.

I think if I were running a national or internationally mandated archiving initiative I would basically want to take in content from Internet Archive, and not remove things, and probably it would be less expensive that way than having my own crawler.

dTal · on April 25, 2018

It's actually very clever.

The key is it works both ways. By respecting the live robots.txt, and only the live one, data hiding must be an active process requested on an ongoing basis by a live entity. As soon as the entity goes defunct, any previously scraped data is automatically republished. Thus archive.org is protected from lawsuit by any extant organisaton, yet in the long run still archives everything it reasonably can.

klez · on April 25, 2018

They seem to have had this problem in the past and decided to skirt around it by ignoring robots.txt a year ago[0]. Does anybody know what happened to revert this decision?

[0]https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

icebraining · on April 25, 2018

As per that post, they only ignored robots.txt for .gov and .mil sites.

aero-002 · on April 25, 2018

IA disallow in robots.txt will still block archive.org, the blog post was about ignoring parts that were meant for search engines.

klez · on April 25, 2018

Yes, but it also says

> We are now looking to do this more broadly.

That's the part I'm asking about.

icebraining · on April 25, 2018

Right, but it doesn't mean they reverted it, they are probably still looking into it.

proaralyst · on April 25, 2018

I don't think they purge the archives, I think they just don't serve them on the wayback machine.

pronoiac · on April 25, 2018

Yes. Instead of deleting anything, I think the Archive tends to mark stuff as "do not show this for a few decades."

jake-low · on April 25, 2018

Do you have a source for this? I didn’t know that but it’s very interesting – a good compromise between the interests of current website owners[1] and future historians.

[1]: Sure, some people just want to hide embarrassing or incriminating content, but there’s also cases where someone is being stalked or harassed based on things they shared online, and hiding those things from Archive users may mitigate that.

aepiepaey · on April 25, 2018

Generally when items are "taken down" from the Internet Archive, they just stop being published, and are not deleted.

I don't think it's mentioned in an official document, but it's usually referred to as "darking".

It probably safe to assume that the same concept applies to the Wayback Machine as to the rest of IA.

Edit: Here's a page that indirectly conveys some information about it: https://archive.org/details/IA_books_QA_codes

pronoiac · on April 25, 2018

I thought I'd read it in a blog post by Jason Scott at textfiles.com, but I couldn't find a reference quickly. It could have come from conversation, as I've visited the Internet Archive a few times.

pbhjpbhj · on April 25, 2018

Which won't be legal, it seems if serving EU based users, and keep PII on any other EU users.

Piskvorrr · on April 25, 2018

Yup, that already happened in the past. For some reason, this is apparently a feature, not a bug.

TeMPOraL · on April 25, 2018

From what I understand, this is the deal they make to avoid getting sued by everyone for copyright violation.

CM30 · on April 25, 2018

It is a glaring flaw. It's meant a lot of sites ended up wiped out of the archive (or at least made inaccessible) simply because their domain has expired and the domain squatters blocked the empty domain from being indexed.

The solution (which the Internet Archive really needs to implement) is to look at the domain registration data or something, and then only remove content if the same owner updated the robots.txt file. If not, then just disallow archiving any new content, since the new domain owner usually has no right to decide what happens to the old site content.