That web domain dataset they get from the internet archive is interesting in light of the current discussion, in that I am supposing it probably has .uk content that has been removed from the actual internet archive by robots.txt changes.
I think if I were running a national or internationally mandated archiving initiative I would basically want to take in content from Internet Archive, and not remove things, and probably it would be less expensive that way than having my own crawler.
I think if I were running a national or internationally mandated archiving initiative I would basically want to take in content from Internet Archive, and not remove things, and probably it would be less expensive that way than having my own crawler.