The problem with that though is that people will think you've manipulated your c...

zerebubuth · on April 25, 2018

One method for making a copy or crawl very difficult to tamper with is to publish a hash somewhere difficult to forge (e.g: in a national newspaper or opentimestamps). That won't prove the copy wasn't manipulated before it was archived, though. For that, we would need multiple, independent archives.

This is effectively what libraries have been doing for many years with their archives of newspapers.

prepend · on April 25, 2018

The hashing needs to be done by a trusted third party. It would be a cheaper to operate service than wayback, but would let you check individuals content against manipulation.

You have to incentivize people running and storing the hashes.

Buge · on April 26, 2018

You could put it in the bitcoin blockchain. Or if you don't need that level of complexity and cost, you could put it on twitter, which doesn't allow editing tweets (but does allow deleting).

foota · on April 25, 2018

This would be entirely overkill, but in theory if you're accessing the site via https couldn't you record the conversation from your end and later prove it?

foota · on April 25, 2018

Edit: thought about this some more, I don't think this would work since in ssl iirc you agree on a symmetric encryption key which is then used to encrypt the rest of the request response cycle.

rocqua · on April 25, 2018

You'd need a proven time stamp on that conversation, or else the site could just switch certs and then leak their keys. Then, they can claim that you forged the traffic using the leaked keys.

simias · on April 25, 2018

I can't believe I'm the one to propose this but being able to unjustifiably timestamp some data is one of the only actual use cases of blockchains. Archive your data, compute a checksum and store that in a Bitcoin block and you can prove later on that you actually owned the data at this point.

Of course there are other ways to achieve that such as publishing your checksum to a vast number of neutral third parties such as in a mailing list, bittorent or even a newspaper. You could also rely on a trusted 3rd party who has low incentives to manipulate the data (or would risk a lot if they were caught cheating) such as a bank, insurance or notary for instance.

I think archive.org could potentially do something like that by using a merkle tree containing the checksums of the various pages they've archived during the day and publish the top hash every day for anybody to archive (or publish it on a blockchain or whatever, as said above). If later on somebody accuses the archive of manipulation they can publish the merkle tree of the day the site was archived which contains the checksum of the page, and anybody having the top hash of that day can vouch that it is indeed correct.

It doesn't stop the archive from storing bogus data but it makes it impossible to "change the past" post-facto, so in this particular situation the journalist could only complain that the archive stored bogus data back in the day and not that it was recently tampered with.

dumbfounder · on April 25, 2018

I immediately had the same idea to use a 3rd part to host checksums, surprised they haven't done this. Blockchain makes a lot of sense from the immutability standpoint, but how would you incentivize people to maintain it? Maybe you can get people to do that for the common good a la wikipedia? Not sure about that. Maybe you get Apache to bake it into their webserver to ask people to opt-in to dedicate 0.1% of resources to the cause?

simias · on April 25, 2018

I was thinking about using an existing blockchain such as Bitcoin. Of course then the inconvenient is that archive.org would have to pay a fee every time they submit a new hash. A comment above pointed out that the scheme I described (unsurprisingly) already exists at https://petertodd.org/2016/opentimestamps-announcement

Realistically it might be overkill though, simply setting up some mailing list where anybody can subscribe and be sent the checksum every day or even just publishing it at some URL and letting users scrap it if they want might be sufficient. If we're talking about one checksum every day it's only a few kilobytes every year, it shouldn't be too difficult to convince a few hundred people and organizations around the world to mirror it.

dumbfounder · on April 25, 2018

I think you need a checksum for every page, not every day. How would you independently verify the checksum for an entire day?

rocqua · on April 25, 2018

Merkle trees. Whomever wants to store a timestamp for some message stores the path from that message to the root of the Merkle tree. Only the root of the merkle tree of each day needs to be published.

Buge · on April 26, 2018

No. HTTPS doesn't provide non-repudiation.

HTTPS does not sign the content. It MACs the content.

3pt14159 · on April 25, 2018

Yup! This is the problem. There’s a blog post out there by a security guy tittled “nothing is real” where he covers how nobody can truly trust the data they get from computers.

casefields · on April 25, 2018

Do you think you can find it? Sounds interesting.

3pt14159 · on April 25, 2018

I found my tweet about it:

https://mobile.twitter.com/zachaysan/status/9594855218430976...

As an aside, I’ve been noticing Google has been getting worse and worse at finding something I’m sure is out there. I’m not saying that they’re necessarily getting worse, but maybe it’s getting harder to deal with the sheer scale of the web these days.

SllX · on April 25, 2018

It's good enough to combat link rot if you need to or want to refer back to or read something again. Nothing on the internet is permanent and even if it is still out there, that's no guarantee that Google or any other search engine still has it in their index. At least that way the information isn't lost to you.