Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So if you find something you'd better make a copy of it yourself because it might be going dark. Doesn't that kind of defeat the whole purpose of the Wayback Machine?


So I try to make a copy of any interesting web pages on archive.is these days.

http://archive.is/faq#Why_does_archive_is_not_obey_robots_tx...


This one is funny, because conservatives have used archive.is for some time to archive and mock left-leaning websites and some of them blocked archive.is in the past and still block archive.is today.

VOX for example returns a 0-sized page for archive.is. In the past VICE returned 404s to archive.is https://i.imgur.com/OnFdVpS.jpg

What I mean to say is that these services are useful but they are not faultless.


VICE didn't just block archive.is, they blocked the Internet Archive too by returning the exact same 404 page. They really didn't want any archived copies of their posts hanging around anywhere outside their control.


Why are so many irrelevant political left-vs-right "he said, she said" type comments popping up on HN just lately?


I personally think this is simply a result of how much harder the media is pushing that divide (for all of their various purposes). I actually spent some time last year researching this, because I thought I might have just become an old man thinking how great things used to be. I started reading old news stories fairly randomly, from the present time all the way back to the Vietnam era (and a few rabbit holes to earlier times). The first thing that surprised me was the amount of link rot that exists. I always knew intellectually that it was a problem, but wow. It's bad. The second thing that I found was that indeed, the media hammers on the "us-versus-them" political divide of American politics much, much harder nowadays than even ten years ago. I think Fox News was really the turning point. It opened the flood gates. I always remember thinking how "extreme" Fox News was, but I challenge anyone to look up a few of their older stories from the middle of the last decade. It's child's play compared to what pretty much every media outlet is doing today. You can hardly read a recent news story from just about anywhere without being told how it's supposed to fit into our political worldview, and how we should feel about it, and why it's good/bad/stupid/amazing/"terrifying". And so of course, because of this, people are just responding to the programming. Creating the world they're led to believe we they live in. I think it really is that straightforward.


Did you just look at print? Talk radio has been hammering this since the late eighties. Hell you can probably draw a line straight from the "Moral Majority" shit in the seventies, to where we find ourselves now. I suspect this has always been a big part of American culture, but it's being magnified now either by new tech or malicious actors or both.


Oh, you know, that's interesting. I hadn't even thought about talk radio, but you're absolutely right.


Do you have a selection of those old stories - would be interesting .


Ah, I apologize, I didn't keep notes or save links or anything of the sort, and I keep kicking myself for it. I'm usually pretty good about taking notes just out of my regular habit of doing research, but it was so casual, and I didn't think it would end up taking as much of my time as it did. It's a pretty easy formula to replicate, though. I picked current events that I could remember -- intervention in Kosovo, Bill Clinton's sex scandal, Berlin Wall, first election of Putin, Enron scandal, those kinds of things -- and just started looking up stories, and asked my parents and older friends to help me with events I wasn't old enough to remember before the 80s. I made sure to hit a "good" cross-section of the media outlets of the day.


I'd love to see the Bush/Gore 2000 election play out on social media. I was only a kid but the news coverage seemed pretty mild compared to how I imagine it would be if that happened in 2016.


Thanks - it seems like a half day project to build a spider for this ... one for the list :-)


Everything is partisan political now. What books you read, what films and TV series you watch, where you live, the definition of "political" itself, and to some extent even what internet archiving service you use. (In reality, left-leaning folks have used archive.is to save and mock conservative sites for some time too, but even though this happens across the board it's still normal and expected to think of this as a partisan political activity because everything is now.)


The people that used to hang around /r/incel are now spending their time elsewhere on the net.

https://twitter.com/ArshyMann/status/988818797086871558


Mentioning that certain publications block archiving is not irrelevant.


I think having written short sighted things and then regretting them is a somewhat universal thing. I also don't have a problem with either side using previous writings, as long as they are reproduced accurately and faithfully.

In short, before you publish a blog post that is sexist/racist/homophobic/whatever, consider that even if you delete it, others may have a copy and will use it against you.


How you maintain cognitive dissonance in defending such personal blog sites masquerading as news outlets despite admitting that a word-for-word reproduction of their words constitutes mockery is beyond me.


well in this case it is a left leaning activist who taking this other left leaning activist down. it is not always the politics we expect but we can guarantee if its political it will be nasty.

the issue I have is, we should not be able to just block access to archived content because its embarrassing.


Yes, but it's also always been true that if you want to keep something for reference, you make your own copy.


The problem with that though is that people will think you've manipulated your copy. I've had people accuse me of this when I save pages with screenshots. You need to have a trusted third party make and store the copies.

Sometimes I use archive.is , they don't automatically delete because of robots.txt but it's not fully clear to me when they do delete things.


One method for making a copy or crawl very difficult to tamper with is to publish a hash somewhere difficult to forge (e.g: in a national newspaper or opentimestamps). That won't prove the copy wasn't manipulated before it was archived, though. For that, we would need multiple, independent archives.

This is effectively what libraries have been doing for many years with their archives of newspapers.


The hashing needs to be done by a trusted third party. It would be a cheaper to operate service than wayback, but would let you check individuals content against manipulation.

You have to incentivize people running and storing the hashes.


You could put it in the bitcoin blockchain. Or if you don't need that level of complexity and cost, you could put it on twitter, which doesn't allow editing tweets (but does allow deleting).


This would be entirely overkill, but in theory if you're accessing the site via https couldn't you record the conversation from your end and later prove it?


Edit: thought about this some more, I don't think this would work since in ssl iirc you agree on a symmetric encryption key which is then used to encrypt the rest of the request response cycle.


You'd need a proven time stamp on that conversation, or else the site could just switch certs and then leak their keys. Then, they can claim that you forged the traffic using the leaked keys.


I can't believe I'm the one to propose this but being able to unjustifiably timestamp some data is one of the only actual use cases of blockchains. Archive your data, compute a checksum and store that in a Bitcoin block and you can prove later on that you actually owned the data at this point.

Of course there are other ways to achieve that such as publishing your checksum to a vast number of neutral third parties such as in a mailing list, bittorent or even a newspaper. You could also rely on a trusted 3rd party who has low incentives to manipulate the data (or would risk a lot if they were caught cheating) such as a bank, insurance or notary for instance.

I think archive.org could potentially do something like that by using a merkle tree containing the checksums of the various pages they've archived during the day and publish the top hash every day for anybody to archive (or publish it on a blockchain or whatever, as said above). If later on somebody accuses the archive of manipulation they can publish the merkle tree of the day the site was archived which contains the checksum of the page, and anybody having the top hash of that day can vouch that it is indeed correct.

It doesn't stop the archive from storing bogus data but it makes it impossible to "change the past" post-facto, so in this particular situation the journalist could only complain that the archive stored bogus data back in the day and not that it was recently tampered with.


I immediately had the same idea to use a 3rd part to host checksums, surprised they haven't done this. Blockchain makes a lot of sense from the immutability standpoint, but how would you incentivize people to maintain it? Maybe you can get people to do that for the common good a la wikipedia? Not sure about that. Maybe you get Apache to bake it into their webserver to ask people to opt-in to dedicate 0.1% of resources to the cause?


I was thinking about using an existing blockchain such as Bitcoin. Of course then the inconvenient is that archive.org would have to pay a fee every time they submit a new hash. A comment above pointed out that the scheme I described (unsurprisingly) already exists at https://petertodd.org/2016/opentimestamps-announcement

Realistically it might be overkill though, simply setting up some mailing list where anybody can subscribe and be sent the checksum every day or even just publishing it at some URL and letting users scrap it if they want might be sufficient. If we're talking about one checksum every day it's only a few kilobytes every year, it shouldn't be too difficult to convince a few hundred people and organizations around the world to mirror it.


I think you need a checksum for every page, not every day. How would you independently verify the checksum for an entire day?


Merkle trees. Whomever wants to store a timestamp for some message stores the path from that message to the root of the Merkle tree. Only the root of the merkle tree of each day needs to be published.


No. HTTPS doesn't provide non-repudiation.

HTTPS does not sign the content. It MACs the content.


Yup! This is the problem. There’s a blog post out there by a security guy tittled “nothing is real” where he covers how nobody can truly trust the data they get from computers.


Do you think you can find it? Sounds interesting.


I found my tweet about it:

https://mobile.twitter.com/zachaysan/status/9594855218430976...

As an aside, I’ve been noticing Google has been getting worse and worse at finding something I’m sure is out there. I’m not saying that they’re necessarily getting worse, but maybe it’s getting harder to deal with the sheer scale of the web these days.


It's good enough to combat link rot if you need to or want to refer back to or read something again. Nothing on the internet is permanent and even if it is still out there, that's no guarantee that Google or any other search engine still has it in their index. At least that way the information isn't lost to you.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: