Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think they might just try all the user agents in the robots.txt. [1] I've included a picture showing an example. In this second image, [2] I receive the paywall with the user agent left as default. There might also just be an archival user agent that most websites accept, but I haven't looked into it very much.

[1] https://i.imgur.com/lyeRTKo.png

[2] https://i.imgur.com/IlBhObn.png



That user-agent seems to be in the robots.txt as _disallowed_, but somehow it gets through the paywall? That seems counter-intuitive.


It's just blocking the root. Look up the specifications for the robots.txt for more information. One purpose is to reduce loads on parts of the website that they do not want indexed.


Definitely incorrect, the paths in the robots.txt are prefixes, so `/` means anything starting with `/`, that is, everything. Look up the specifications for the robots.txt for more information! (Or, for instance, look up how you'd block the whole site in robots.txt if you wanted to!)


No, / means the entire site, since root and anything lower.


That's an interesting idea, but is it true?


Websites usually want their pages indexed for search engines, as it increases the traffic they receive. They also often try to allow archival usage. The robots.txt usually has defined user agents used by search engines defined, as one purpose is to reduce load on the website by not indexing pages that do not need to be indexed.

It might not be what is happening as there are other ways around, but this is a real possibility for how it could be done. (at least until the websites allowing other user agents decide they want to try to stop archive.is usage, etc)

edit: I think the probability is probably high that they have multiple methods for archiving a website. I think in this post, there are many people stating that they've previously stated they just convert the link to an AMP link and archive it. I'm more so doubtful that's all they do, but it could be it too.

Using the robots.txt file in this way might not be how the author's of the website intended for it to be used. I could see that maybe being used against them in a legal system if someone ever tried to stop them. In the past, I've seen websites state to people creating bots to purposefully change their user agent to one they defined, but, using it for a non-allowed purpose is what I was mentioning. Though, there are multiple ways they could be archiving a website, so this is not necessarily how it is being done.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: