I would like to store ALL actual text I ingest from the world, it wouldn't be that big. It would be immune to the unreliable nature of the web. I'd trigram index it as it went in to enable better search. (I read somewhere yesterday that's the way to do it)
I'd also like all the audio and video I hear/see to have a transcript of any spoken words, with a recording, and timestamps. Video is huge, so that would need to be managed a bit.
Given these requirements, I can see a need to start riding along with Moore's law again. Text is well within our capabilities, we can't read that fast, so it should be quite feasible to store it all. The need for more storage and processor than my laptop has comes in to play with audio and video, especially transcription and storage.
Having stored this content, I want to be able to search it. I want some form of content/sentiment analysis to allow search by concept and association.
I also want to be able to rate it, not just in a single dimension, but in an arbitrary number of them. Something can be funny, insightful, literally false and metaphorically true, a bit racist, somewhat political, and in English. The thumbs up/down or limiting to a single 1-10 scale works well for forcing into a single database field, but not for actual real world use.
Every single piece of stuff fits into multiple orthogonal hierarchies, you can't store that information in any single rank system without information loss.
As for sharing, it has to be something I pay for, or host myself, with possible federation. Ads corrupt.
---
Implementation - The first step is to simply tap the stream of web traffic I see in the browser, and train a classifier to recognize text/not text. It is important to link it back to the source.
Once I have a reliable stream of text, I think the rest starts to align.
this would have been easier years ago when everything you surf on the web was stored as files in a cache directory that is accessible to the user. you'd just have to copy all text from that directory into another persistent directory and use something like ripgrep over it.
I've tried building similar things and this is the issue I've run into. I want to log every bit of text I read, but either I'm breaking HTTPS to sniff it off the wire (and then mangling that data to try to get to the actual content I read, not the formatting of the site) or I'm going to have to write an app for every platform I use and switch to reading exclusively through that.
you dont have to break https. one (browser specific) method to do this is just build your own extension (like chrome extension) that loads on every page, and sends that data to your persistence layer.
for example, have ur extension on page load, get the body.innerText or innerHTML, send that as a post request to your localhost server at some endpoint, and have it write to your persistence db.
then build a query layer over it. since your db can be stored anywhere like a cloud, you can just build an app/mechanism to query that central hub from any platform or webapp.
copyright issues aren't realistically a concern when only you are using it. if im going to save a picture from the internet and use it as my laptop wallpaper, i'm not going through the arduous process of getting copyright licenses to do that. hopefully, you were referring to something else.
You do if you want to scrape what you're reading on your mobile devices too, or you need to write an app for every platform (like I mentioned). Safari on iOS won't let you do that.
Copyright is an issue if you're planning on selling this as a service or distributing it as open source (like youtube-dl).
> You do if you want to scrape what you're reading on your mobile devices too, or you need to write an app for every platform (like I mentioned). Safari on iOS won't let you do that.
It's possible you are desiring too much. This is one of those cases where you should consider the case of diminishing returns.
That said, unless you wanted to aim for 100% (I advise against), then just use Chrome or another app on your iDevice and the history is automatically synced to your non-mobile device, where you can leverage the previous trick to scrape those sites. No need for an external app.
> Copyright is an issue if you're planning on selling this as a service or distributing it as open source (like youtube-dl).
Sure, but that's a completely different tangent from your current scenario. However, it is still very much possible with slightly different strategies.
I'd also like all the audio and video I hear/see to have a transcript of any spoken words, with a recording, and timestamps. Video is huge, so that would need to be managed a bit.
Given these requirements, I can see a need to start riding along with Moore's law again. Text is well within our capabilities, we can't read that fast, so it should be quite feasible to store it all. The need for more storage and processor than my laptop has comes in to play with audio and video, especially transcription and storage.
Having stored this content, I want to be able to search it. I want some form of content/sentiment analysis to allow search by concept and association.
I also want to be able to rate it, not just in a single dimension, but in an arbitrary number of them. Something can be funny, insightful, literally false and metaphorically true, a bit racist, somewhat political, and in English. The thumbs up/down or limiting to a single 1-10 scale works well for forcing into a single database field, but not for actual real world use.
Every single piece of stuff fits into multiple orthogonal hierarchies, you can't store that information in any single rank system without information loss.
As for sharing, it has to be something I pay for, or host myself, with possible federation. Ads corrupt.
---
Implementation - The first step is to simply tap the stream of web traffic I see in the browser, and train a classifier to recognize text/not text. It is important to link it back to the source.
Once I have a reliable stream of text, I think the rest starts to align.