Reading the web offline and distraction-free

ivansavz · on Oct 10, 2021

Very cool.

The take-any-webpage-offline need is also common in the education space (teachers want to save a webpage and send it to their students as part of a lesson and don't want to worry about availability or ads etc).

I used to work on tools for this https://github.com/learningequality/ricecooker/blob/develop/... and https://github.com/learningequality/BasicCrawler/blob/master... which worked quite well for most sites, but still very far from a general-purpose solution.

There is also more powerful/general-purpose scraper that generates a ZIM file here: https://github.com/openzim/zimit

It would be really nice to a "common" scraper code base that takes care of scraping (possibly with a real headless browser) and outputs all assets as files + info as JSON. This common code base could then be used by all kinds of programs to package the content as standalone HTML zip files, ePub, ZIM, or even PDF for crazy people like me who like to print things ;)

genewitch · on Oct 11, 2021

Firefox headless can save webpages as images or "as a complete page" can't it?

I use the image functional and only sites like Twitter fail to work correctly, although it's probably my CGI gateway timing out waiting for JavaScript or whatever.

tasogare · on Oct 10, 2021

That's a solved problem: click right, save the page as MHTML. It can be opened later and online with a browser again.

If I recall correctly the only gotcha is that the option to save in this format needs to be enable using the flags settings.

captn3m0 · on Oct 10, 2021

I do a lot of this work[3] (web to documents) and it's interesting to see other approaches. The medium image problem is something I've faced as well, but never got around to fixing. I'm planning to get a Remarkable soon, so will definitely be trying this out.

My personal solution has been https://github.com/captn3m0/url-to-epub/ (Node/readability), which I've tested against the entirety of Tor's original fiction collection[0] where it performs well enough (I'm biased). Another tool that does this beautifully well is percollate[1], but it doesn't give enough control of the metadata to the user - something I really care about.

I've also started to use rdrview[2], which is a C-port of the current Firefox implementation of "reader view". It is very unix-y, so it is easy to pipe content to it (I usually run it through tidy first). Quite helpful in building web-archiving or web-to-pdf or web-to-kindle pipelines easily.

[0]: https://www.tor.com/category/all-fiction/original-fiction/

[1]: https://github.com/danburzo/percollate

[2]: https://github.com/eafer/rdrview

[3]: https://captnemo.in/ebooks/

atsaloli · on Oct 10, 2021

I run "lynx --dump $URL | vim -" to read the text in Vim when the web page gets too cluttered (I use Vim as a pager because I know "Vim" better than "less").

genewitch · on Oct 11, 2021

Links (I don't know if lynx has it) has a mode where it can put images in too, and render to a desktop window, and I think an image file. It makes websites look completely different, but is useful for a wikipedia article or a news article where the important bits are the text and photo/graphs.

owulveryck · on Oct 11, 2021

Nice trick, thanks for sharing it.

Syonyk · on Oct 10, 2021

How is this different from the Wallabag project, which, as I understand it (it's on my list of "Things to mess with at some point") does exactly the same thing - website to epub for offline reading?

bredren · on Oct 10, 2021

Newspaper3k is a Python package I’m using to extract content from articles across the web.

But it has not been maintained, since the author joined Facebook.

It works alright, but it has many issues.

If I understand correctly, a full on replacement for newspaper is in the wings, seeking to offer a sustainable content extraction tool in Python.

But it isn’t ready yet. And some of the problems in this area mirror those faced by web scrapers.

phkx · on Oct 10, 2021

I‘ve been using pandoc to extract texts next to my notes (both in Markdown) in order to add links between them. I haven’t extracted too many pages yet, but the results were reasonable so far, although sometimes lots of html tags remain. Also, none of them contained any math so far.

thm · on Oct 10, 2021

Needless to say that extractability hasn't gotten easier in recent years but I'm even more concerned about archive.org's quality/capabilities — They really need to step up their game to remain useful in this area.

gcr · on Oct 11, 2021

Calibre supports getpocket via a plugin that you can add from the app. Then, you can click the "Get News" button to download all the articles from your Pocket feed into your eBook reader at once.

owulveryck · on Oct 10, 2021

This is a post about a tool I am building to generate an epub from a website

spikej · on Oct 10, 2021

Good luck! Those HTML issues you're coming across are tough and so varied across the web!

I was working with Mercury Parser (pluggable parsing for different sites) in the past.

https://github.com/postlight/mercury-parser

haroldtreen · on Oct 10, 2021

I built a Chrome Extension that does this exact thing :). There's also a WebAPI.

https://epub.press/

alexmcc81 · on Oct 10, 2021

On a related note, does anyone know of any open source project that could parse the website text and generate a list of tags?

thm · on Oct 10, 2021

Textblob/NLTK. I use Google (paid) though these days.

mro_name · on Oct 10, 2021

occasionally I use https://github.com/gildas-lormeau/SingleFile

notionparallax · on Oct 11, 2021

I've been making something for this for a couple of years now, with http://waldenpond.press/

It connects to the Pocket API to get the parsed articles, pushes them through quite a lot of BS4 clean up, then renders them using paged.js. The resulting PDFs are then printed by Lulu.com, and they come once a month as a printed book to read completely offline.

I solved the Medium image issue with CSS as far as I remember. `.medium\.com svg:first-of-type` and then set it to `display: none`.