More

wolfgarbe · 2025-02-16T11:14:12 1739704452

The most widely used DHT is Kademlia from Petar Maymounkov and David Mazières. It is used in Ethereum, IPFS, I2P, Gnutella DHT, and many other applications.

https://en.wikipedia.org/wiki/Kademlia

https://pdos.csail.mit.edu/~petar/papers/maymounkov-kademlia...

https://web.archive.org/web/20120128120732/http://www.cs.ric...

wolfgarbe · 2024-12-09T18:56:22 1733770582

SeekStorm comes with an http interface.

The SeekStorm server features an REST API via http: https://seekstorm.apidocumentation.com

It also comes with an embedded Web UI: https://github.com/SeekStorm/SeekStorm?tab=readme-ov-file#bu...

Or did you mean a Web based interface to create and manage indices, define index schemas, add documents etc?

wolfgarbe · 2024-12-09T18:47:34 1733770054

>> The documentation seems a bit sparse.

We just released a new OpenAPI based documentation for the SeekStorm server REST API: https://seekstorm.apidocumentation.com

For the library we have the standard rust doc: https://docs.rs/seekstorm/latest/seekstorm/

wolfgarbe · 2024-12-03T18:32:17 1733250737

For the latency benchmarks we used vanilla BM25 (SimilarityType::Bm25f for a single field) for comparability, so there are no differences in terms of accuracy.

For SimilarityType::Bm25fProximity which takes into account the proximity between query term matches within the document, we have so far only anecdotal evidence that it returns significantly more relevant results for many queries.

Systematic relevancy benchmarks like BeIR, MS MARCO are planned.

ghita_ · 2024-12-11T00:46:41 1733878001

got it - i think the anecdotal evidence is what intrigued me a little bit looking forward to seeing the systematic relevancy benchmarks

wolfgarbe · 2024-12-03T12:03:40 1733227420

The Seekstorm library is 9 MB, and the Seekstorm server executable is 8 MB, depending on the features selected in cargo.

You add the library via 'cargo add seekstorm' to your project which you anyway have to compile.

As for the server, we may add binaries for the main OS in the future.

WASM and Python bindings are on our roadmap.

wolfgarbe · 2024-12-03T07:22:29 1733210549

In SeekStorm you can choose per index whether to use Mmap or let SeekStorm fully control Ram access. There is a slight performance advantage to the latter, at the cost of higher index load time of the former. https://docs.rs/seekstorm/latest/seekstorm/index/enum.Access...

nextaccountic · 2024-12-06T23:59:50 1733529590

Does seekstorm use io_uring? Could io_uring lower load time here?

Or at least lazy loading of index in RAM (emulating what mmap would do anyway)

wolfgarbe · 2024-12-14T14:12:19 1734185539

SeekStorm does currently not use io_uring, but it is on our roadmap. Challenges are the cross-platform compatibility. Linux (io_uring) and Windows (IoRing) use different implementations, and other OS don't support it. There is no abstraction layer over those implementations in Rust, so we are on our own.

It would increase concurrent read and write speed (index loading, searching) by removing the need to lock seek and read/write.

But I would expect that the mmap implementations do already use io_uring / IoRing.

Yes, lazy loading would be possible, but pure RAM access does not offer enough benefits to justify the effort to replicate much of the memory mapping.

wolfgarbe · 2024-12-03T07:15:36 1733210136

The benchmark should be fairly fair, as it was developed by Tantivy themselves (and Jason Wolfe). So, the choice of corpus and queries was theirs. But, of course, your mileage may vary. It is always best to benchmark it on your machine with your data and your queries.

Yes, WASM and Python bindings are on our roadmap.

wolfgarbe · 2024-12-03T06:54:15 1733208855

The code for the distributed search cluster is not yet stable enough to be published, but it will be released as open-source as well.

As for shared storage, do you mean something like NAS or, rather Amazon S3? Cloud-native support of object storage and separating storage and compute is on our roadmap. Challenges will be maintaining latency and the need for more sophisticated caching.

jszymborski · 2024-12-04T15:17:00 1733325420

S3 support would be absolutely killer.

wolfgarbe · 2024-12-02T23:44:13 1733183053

Its not just about speed. Speed reflects efficiency. Efficiency is needed to serve more queries in parallel, to search within exponentially growing data, with less expensive hardware, and fewer servers, consuming less energy. Therefore the pursuit for efficiency never gets outdated and has no limit.

sdesol · 2024-12-03T00:44:40 1733186680

In addition to what you said, faster searches can also provide different search options. For example, if you can execute five similar searches in the same time that it would take to execute one. You now have the option to ask "Can I leverage five similar searches to produce better results" and if the answer is yes, you can now provide better answers and still keep the same user experience.

Where I really think faster searches will come into play is with AI. There is nothing energy efficient about how LLM work and I really think Enterprise will focus on using LLM to generate as many Q and A pairs during off peak energy hours and using a hybrid search that can bridge semantic (vector) and text. I think for Enterprise the risk of hallucinations (even with RAG) will be too great and fall back to traditional search, but with a better user experience.

Based on the README, it looks like vector search is not supported or planned, but it would be interesting to see if SeekStorm can do this more efficiently than Lucene/OpenSearch and others. I only dabbled in the search space, so I don't know how complex this would be, but I think SeekStorm can become a killer search solution if it can support both.

Edit: My bad, it looks like vector search is PoC.

wolfgarbe · 2024-12-02T22:10:21 1733177421

PostgreSQL is an SQL database that also offers full-text search (FTS), with extensions like pg_search it also supports BM25 scoring which is essential for lexical search. SeekStorm is centered around full-text search only, it doesn't offer SQL.

Performance-wise it would be indeed interesting to run a benchmark. The third-party open-source benchmark we are currently using (search_benchmark_game) does not yet support PostgreSQL. So yes, that comparison is still pending.

yellow_lead · 2024-12-03T03:11:34 1733195494

When I tried to use FTS in Postgres, I got terrible performance, but maybe I was doing something wrong. I'm using Meili now.

anonzzzies · 2024-12-03T08:00:42 1733212842

Same here, this would easily beat it as far as I have seen, but maybe I did something wrong.

philippemnoel · 2024-12-03T16:26:45 1733243205

ParadeDB (paradedb.com) is similar to this HN, but baked into Postgres to solve this very problem you are describing