I would like to store ALL actual text I ingest from the world, it wouldn't be th...

purplecats · on Nov 27, 2020

this would have been easier years ago when everything you surf on the web was stored as files in a cache directory that is accessible to the user. you'd just have to copy all text from that directory into another persistent directory and use something like ripgrep over it.

freehunter · on Nov 27, 2020

I've tried building similar things and this is the issue I've run into. I want to log every bit of text I read, but either I'm breaking HTTPS to sniff it off the wire (and then mangling that data to try to get to the actual content I read, not the formatting of the site) or I'm going to have to write an app for every platform I use and switch to reading exclusively through that.

Then you run into copyright issues...

purplecats · on Nov 28, 2020

you dont have to break https. one (browser specific) method to do this is just build your own extension (like chrome extension) that loads on every page, and sends that data to your persistence layer.

for example, have ur extension on page load, get the body.innerText or innerHTML, send that as a post request to your localhost server at some endpoint, and have it write to your persistence db.

then build a query layer over it. since your db can be stored anywhere like a cloud, you can just build an app/mechanism to query that central hub from any platform or webapp.

copyright issues aren't realistically a concern when only you are using it. if im going to save a picture from the internet and use it as my laptop wallpaper, i'm not going through the arduous process of getting copyright licenses to do that. hopefully, you were referring to something else.

freehunter · on Nov 28, 2020

>you dont have to break https

You do if you want to scrape what you're reading on your mobile devices too, or you need to write an app for every platform (like I mentioned). Safari on iOS won't let you do that.

Copyright is an issue if you're planning on selling this as a service or distributing it as open source (like youtube-dl).

purplecats · on Nov 29, 2020

> You do if you want to scrape what you're reading on your mobile devices too, or you need to write an app for every platform (like I mentioned). Safari on iOS won't let you do that.

It's possible you are desiring too much. This is one of those cases where you should consider the case of diminishing returns.

That said, unless you wanted to aim for 100% (I advise against), then just use Chrome or another app on your iDevice and the history is automatically synced to your non-mobile device, where you can leverage the previous trick to scrape those sites. No need for an external app.

> Copyright is an issue if you're planning on selling this as a service or distributing it as open source (like youtube-dl).

Sure, but that's a completely different tangent from your current scenario. However, it is still very much possible with slightly different strategies.

mikewarot · on Nov 27, 2020

Breaking HTTPS seems to be the way to go.

https://mitmproxy.org/

It's open source, and handles that issue. It could spit out logs of everything.

tombot · on Nov 28, 2020

don’t use a different app, scrape text you read from a full screen capture of your desktop

freehunter · on Nov 28, 2020

But most of my reading is on an iPhone or iPad.