I tried with the demo, but full content search does not work. I don't know if th...

SansGuidon · 2025-05-02T09:52:15 1746179535

Could be related to this as well -> https://github.com/linkwarden/linkwarden/issues/586 where content seems imported but its indexation is stuck in the queue. Blocker for me.

bflesch · 2025-05-02T09:54:37 1746179677

the trap of using indexation and a database when normal grep would work with disk-based storage

SansGuidon · 2025-05-02T12:04:14 1746187454

Exactly... so much hype around complexity when simplicity wins. That's also why such systems like Wallabag, Linkwarden, Omnivore etc all disappointed me. In the end with a simple system made of static files and tools available out of the box on most distributions, I could make my own alternative to most archiving/bookmarking management systems and it just works. No DB, no framework, no fancy UI. Yet powerful. I have to blog about it.

rrgok · 2025-05-02T12:37:19 1746189439

Do you mind sharing your approach? I've been looking for a system where I can store all kind of data (webpage, pdf, images, docx, xlsx...) and can fast full-text search on them. Oddly enough, that what a filesystem should do, but sadly that's not gonna happen.

Can grep and the like search on images or docx?

I know there is FileLocator Pro, but I'm looking for a cross-platform tool.

SansGuidon · 2025-05-02T15:12:16 1746198736

I mostly focus on text based content so PDF and webpages are easily supported. for PDFs I thought about using https://github.com/phiresky/ripgrep-all or pdfgrep https://pdfgrep.org/

For images, what do you want to grep for? for exif data -> https://exiftool.org/ if you want to find image based content, you might need something smarter. I think maybe it is a place where tools such as https://github.com/ultralytics/yolov5 can shine for me. simple enough to work with most of my images and tag them according to some preferences, and I would save such tags in a txt file.

Anyway, all metadata I store about images, links etc are all persisted in txt files. summaries, tags, etc, incoming/outgoing links etc, each has its own file. There are folders per link/content. Under each folder, one file per type of metadata. So it is very easy to know if some metadata is missing for a file, no index needed, it is just as simple as checking the presence of a file. everything is compatible with grep then.

for docx and xlsx it is out of my plate at this time, I didn't experiment enough to judge what works well enough. I hate those things.

bflesch · 2025-05-03T09:17:24 1746263844

As docx / xlsx are zip files, I normally unzip them and then use some sort of XML-aware grep. But these formats are a rabbithole on their own ;)