What I really want is a script that (1) Takes a URL and optional comment as inpu...

strombofulous · on Feb 28, 2020

>(2) Saves the webpage it points into a git repo (a simple curl should suffice for most websites)

FWIW I've done something similar and lots of sites that use a lot of JS (and pretty much every single page webpage like twitter and FB) will not re-render correctly just because you have the files. It actually takes a lot of work to clone a webpage, the best solution I've found so far is to print a PDF from a headless chrome (but this has its own problems, like now you have to deal with a PDF).

Even generating the PDF is a lot harder than it seems, at least if you've never done it before, because there are a lot of gotchas (for example, did you know that most websites provide a second stylesheet to be used while printing that makes it look barely messed up, but still clearly broken? I didn't either)

gildas · on Feb 28, 2020

If the PDF format is not mandatory for you, you might be interested in SingleFile [1] (I'm the author) which you can run from the command line. It will interpret scripts and faithfully save a snapshot of a page in a single HTML file.

[1] https://github.com/gildas-lormeau/SingleFile/tree/master/cli

101404 · on Feb 28, 2020

> lots of sites that use a lot of JS

For many "modern" sites, its really better to just take a screenshot and save the PNG.

Though there are still many sites that render just fine without JS. I've been trying out Brave Browser with JS disabled for some weeks now, and I was surprised how many sites are readable with JS disabled. And so much faster and less jumpy too.

Karunamon · on Feb 28, 2020

Hmmm.. this wouldn't be too hard to write and sounds like an interesting weekend project.

Would you be interested in using WARC for the webpage though? This way, everything is captured in a single file and you aren't littering your repo with random files and images.

butterthebuddha · on Feb 28, 2020

> sounds like an interesting weekend project.

Aren't we just in luck that the weekend is just coming up!

> Would you be interested in using WARC for the webpage though? This way, everything is captured in a single file and you aren't littering your repo with random files and images.

I didn't know about this. I've looked into it a bit, and it seems perfect.

I'm not too concerned about saving webpages, I'm much more concerned about actually having a populated database of links. I only expect to need to use the saved page if the link breaks.

I can work on writing a simple elisp script (incidentally, I don't know very much elisp either, but that's something I am willing to take time out to learn because I expect to be using it a lot in the future), but I do need someone else to write the Android app.

toomuchtodo · on Feb 28, 2020

You should always expect the link to break.

bachmeier · on Feb 28, 2020

(1) Takes a URL and optional comment as input

(2) Saves the webpage it points to into a git repo (a simple curl should suffice for most websites)

(3) Inserts that URL, title of the page pointed-to by the URL and the optional comment into an org-mode file that lives in the root of the repo

If you're willing to change "git" to "version control", it should be pretty easy to implement that in Fossil. It doesn't require much to add an extension written in your language of choice if you're going to run it on your desktop. Plus you'd get the web interface for free if you decided to put it on a web server.

aptitude_frenzy · on Feb 29, 2020

I just wrote a script that cover the first 2 points (though it does create a pdf rather than a simple curl) and allows for searching the database. Org-mode stuff could be added later. github.com/websalt/bmark