The problem with that though is that people will think you've manipulated your copy. I've had people accuse me of this when I save pages with screenshots. You need to have a trusted third party make and store the copies.
Sometimes I use archive.is , they don't automatically delete because of robots.txt but it's not fully clear to me when they do delete things.
One method for making a copy or crawl very difficult to tamper with is to publish a hash somewhere difficult to forge (e.g: in a national newspaper or opentimestamps). That won't prove the copy wasn't manipulated before it was archived, though. For that, we would need multiple, independent archives.
This is effectively what libraries have been doing for many years with their archives of newspapers.
The hashing needs to be done by a trusted third party. It would be a cheaper to operate service than wayback, but would let you check individuals content against manipulation.
You have to incentivize people running and storing the hashes.
You could put it in the bitcoin blockchain. Or if you don't need that level of complexity and cost, you could put it on twitter, which doesn't allow editing tweets (but does allow deleting).
This would be entirely overkill, but in theory if you're accessing the site via https couldn't you record the conversation from your end and later prove it?
Edit: thought about this some more, I don't think this would work since in ssl iirc you agree on a symmetric encryption key which is then used to encrypt the rest of the request response cycle.
You'd need a proven time stamp on that conversation, or else the site could just switch certs and then leak their keys. Then, they can claim that you forged the traffic using the leaked keys.
I can't believe I'm the one to propose this but being able to unjustifiably timestamp some data is one of the only actual use cases of blockchains. Archive your data, compute a checksum and store that in a Bitcoin block and you can prove later on that you actually owned the data at this point.
Of course there are other ways to achieve that such as publishing your checksum to a vast number of neutral third parties such as in a mailing list, bittorent or even a newspaper. You could also rely on a trusted 3rd party who has low incentives to manipulate the data (or would risk a lot if they were caught cheating) such as a bank, insurance or notary for instance.
I think archive.org could potentially do something like that by using a merkle tree containing the checksums of the various pages they've archived during the day and publish the top hash every day for anybody to archive (or publish it on a blockchain or whatever, as said above). If later on somebody accuses the archive of manipulation they can publish the merkle tree of the day the site was archived which contains the checksum of the page, and anybody having the top hash of that day can vouch that it is indeed correct.
It doesn't stop the archive from storing bogus data but it makes it impossible to "change the past" post-facto, so in this particular situation the journalist could only complain that the archive stored bogus data back in the day and not that it was recently tampered with.
I immediately had the same idea to use a 3rd part to host checksums, surprised they haven't done this. Blockchain makes a lot of sense from the immutability standpoint, but how would you incentivize people to maintain it? Maybe you can get people to do that for the common good a la wikipedia? Not sure about that. Maybe you get Apache to bake it into their webserver to ask people to opt-in to dedicate 0.1% of resources to the cause?
I was thinking about using an existing blockchain such as Bitcoin. Of course then the inconvenient is that archive.org would have to pay a fee every time they submit a new hash. A comment above pointed out that the scheme I described (unsurprisingly) already exists at https://petertodd.org/2016/opentimestamps-announcement
Realistically it might be overkill though, simply setting up some mailing list where anybody can subscribe and be sent the checksum every day or even just publishing it at some URL and letting users scrap it if they want might be sufficient. If we're talking about one checksum every day it's only a few kilobytes every year, it shouldn't be too difficult to convince a few hundred people and organizations around the world to mirror it.
Merkle trees.
Whomever wants to store a timestamp for some message stores the path from that message to the root of the Merkle tree. Only the root of the merkle tree of each day needs to be published.
Yup! This is the problem. There’s a blog post out there by a security guy tittled “nothing is real” where he covers how nobody can truly trust the data they get from computers.
As an aside, I’ve been noticing Google has been getting worse and worse at finding something I’m sure is out there. I’m not saying that they’re necessarily getting worse, but maybe it’s getting harder to deal with the sheer scale of the web these days.
It's good enough to combat link rot if you need to or want to refer back to or read something again. Nothing on the internet is permanent and even if it is still out there, that's no guarantee that Google or any other search engine still has it in their index. At least that way the information isn't lost to you.
Sometimes I use archive.is , they don't automatically delete because of robots.txt but it's not fully clear to me when they do delete things.