For reference I've submitted a post on the Danish net archives https://news.ycombinator.com/item?id=16919264 which by law will archive all of any 'Danish' site and ignore robots.txt exclusions.
8. Do you respect robots.txt?
No, we do not. When we collect the Danish part of the internet we ignore the so-called robots.txt directives. Studies from 2003-2004 showed that many of the truly important web sites (e.g. news media, political parties) had very stringent robots.txt directives. If we follow these directives, very little or nothing at all will be archived from those websites. Therefore, ignoring robots.txt is explicitly mentioned in the commentary to the law as being necessary in order to collect all relevant material
I wonder if there are any other national archives of the internet that do the same.
The relatively clear mandate for collecting all published materials given to the danish royal libary by the danish constitution is kind of rare in it's scope and status.
That web domain dataset they get from the internet archive is interesting in light of the current discussion, in that I am supposing it probably has .uk content that has been removed from the actual internet archive by robots.txt changes.
I think if I were running a national or internationally mandated archiving initiative I would basically want to take in content from Internet Archive, and not remove things, and probably it would be less expensive that way than having my own crawler.
from the faq: http://netarkivet.dk/in-english/faq/#anchor8
8. Do you respect robots.txt? No, we do not. When we collect the Danish part of the internet we ignore the so-called robots.txt directives. Studies from 2003-2004 showed that many of the truly important web sites (e.g. news media, political parties) had very stringent robots.txt directives. If we follow these directives, very little or nothing at all will be archived from those websites. Therefore, ignoring robots.txt is explicitly mentioned in the commentary to the law as being necessary in order to collect all relevant material
I wonder if there are any other national archives of the internet that do the same.