Show HN: SeekStorm – open-source sub-millisecond search in Rust

throwaway888abc · on Dec 2, 2024

Impressive, bookmarked, upvoted.

Appreciate the demo: https://deephn.org/?q=apple+silicon

0bit · on Dec 3, 2024

Counter-example: https://deephn.org/?q=embeddings Contrast with https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

It is fast but, nowhere close to accurate or useful for this specific example. Could not find a way to force the plural form. Neither quotes nor plus worked.

d3Xt3r · on Dec 4, 2024

Seems to buggy. According to the SeekStorm github, it's supposed to support boolean operators right? But they don't seem to work.

Eg: https://deephn.org/?q=Linux+OR+KDE

treefarmer · on Dec 2, 2024

Is there distributed server support? I see it on the list of new features with (currently PoC) next to it, but is the code for the PoC available anywhere?

Also, would there be any potential issues if the index was mounted on shared storage between multiple instances?

wolfgarbe · on Dec 3, 2024

The code for the distributed search cluster is not yet stable enough to be published, but it will be released as open-source as well.

As for shared storage, do you mean something like NAS or, rather Amazon S3? Cloud-native support of object storage and separating storage and compute is on our roadmap. Challenges will be maintaining latency and the need for more sophisticated caching.

jszymborski · on Dec 4, 2024

S3 support would be absolutely killer.

justmarc · on Dec 2, 2024

I really like your approach. Impressed by your care for performance and your fast pace of adding what appears to be pretty complex stuff, while making sure it stays performant.

Keep it up!

Bookmarked.

infamouscow · on Dec 3, 2024

I'm not sure it's a good idea to use mmap for this.

https://db.cs.cmu.edu/mmap-cidr2022/

wolfgarbe · on Dec 3, 2024

In SeekStorm you can choose per index whether to use Mmap or let SeekStorm fully control Ram access. There is a slight performance advantage to the latter, at the cost of higher index load time of the former. https://docs.rs/seekstorm/latest/seekstorm/index/enum.Access...

nextaccountic · on Dec 6, 2024

Does seekstorm use io_uring? Could io_uring lower load time here?

Or at least lazy loading of index in RAM (emulating what mmap would do anyway)

wolfgarbe · on Dec 14, 2024

SeekStorm does currently not use io_uring, but it is on our roadmap. Challenges are the cross-platform compatibility. Linux (io_uring) and Windows (IoRing) use different implementations, and other OS don't support it. There is no abstraction layer over those implementations in Rust, so we are on our own.

It would increase concurrent read and write speed (index loading, searching) by removing the need to lock seek and read/write.

But I would expect that the mmap implementations do already use io_uring / IoRing.

Yes, lazy loading would be possible, but pure RAM access does not offer enough benefits to justify the effort to replicate much of the memory mapping.

remram · on Dec 2, 2024

What is the story for multi-language corpus? Do I have to do my own stop word pruning, tokenizing, lemming, etc? This is usually the case with full-text search solutions and it is a pain.

jazzyjackson · on Dec 2, 2024

Re: stemming and lemming, I just want to plug the most impressive NLP stack I ever used, "chat script", really it's for building dialog trees where it walks down a branch of conversation using effectively switch statements but with really rich conceptual pattern matching and capturing - so somewhere in the middle of the stack it has excellent abstracting from word input to general concept (in WordNet), performing all the spell correction (according to your dictionary), stem, lem, and disambiguation.

I've had it in mind for a while to build a fuzzy search tool based on parsing each phrase into concepts, parsing the search query into concepts, and finding nearest match based on that. It's a C library and very fast.

https://github.com/ChatScript/ChatScript

Looks like it hasn't been committed to in some time, I'll have to check out their blog and see what's up. I guess with the advent of LLMs, dialog trees are passé.

kreyenborgi · on Dec 3, 2024

Their company home page, http://brilligunderstanding.com/ wow..

wolfgarbe · on Dec 2, 2024

We started with making the core search technology faster. Then we added a Unicode character folding/normalization tokenizer (diacritics, accents, umlauts, bold, italic, full-width chars...). Last week we added a tokenizer that supports Chinese word segmentation. Currently, we are working on a multi-language tokenizer, that segments Chinese, Japanese an Korean without switching the tokenizer.

ronjakoi · on Dec 3, 2024

I hope the folding and normalization is configurable by language. I really hate it when some search decides that a and ä are the same letter. In Finnish they really aren't; "saari" is an island, "sääri" is the lower leg or shin.

wolfgarbe · on Dec 3, 2024

Currently, you can choose between tokenizers with or without folding. But configurability per language or full customizability of the folding logic by the user is a good idea.

tlofreso · on Dec 2, 2024

Demo = impressed.

How's SeekStorm's prowess in mid-cap enterprise? How hairy is the ingest pipeline for sources like: decade old sharepoint sites, PDFs with partial text layers, excel, email.msg files, etc...

wolfgarbe · on Dec 2, 2024

Yes, integration in complex legacy systems is always challenging. As a small startup, we are concentrating on core search technology to make search faster and to make the most of available server infrastructure. As SeekStorm is open-source, system integrators can take it from there.

fiedzia · on Dec 2, 2024

Same as any other full-text search solution - it's your job to integrate it.

m348e912 · on Dec 3, 2024

>Demo = impressed.

How did you demo? Did you spin up your own instance and index the wikipedia corpus like the docs suggest? I'd like to just give it a whirl on an already running instance.

Never mind, found that someone posted a link already.

jazzyjackson · on Dec 2, 2024

On that topic, can anybody chime in on state of the art PDF OCR? Even if that's a multimodal LLM, I've used ChatGPT to extract tabular data from images but need something I can self host for proprietary data.

CharlieDigital · on Dec 3, 2024

Azure Document Intelligence (especially with the layout model[0]) is really good. It has both JSON and MD output modes and does a pretty solid job identifying headers, sections, tables, etc.

What's interesting is that they have a self-deployable container model[1] that only phones home for billing so you can self-host the runtime and model.

[0] https://learn.microsoft.com/en-us/azure/ai-services/document...

[1] https://learn.microsoft.com/en-us/azure/ai-services/document...

jazzyjackson · on Dec 3, 2024

Peculiar, Thanks!

faizshah · on Dec 3, 2024

This is really impressive, I would suggest benchmarking it against Vespa as well I have gotten better perf results from Vespa than Lucerne/Solr/ES.

I’ll take a try this weekend as well.

Leoko · on Dec 2, 2024

Sub-millisecond latency sounds impressive, but isn't network latency going to overshadow these gains in most real-world scenarios?

pornel · on Dec 2, 2024

When search is cheap and quick, it's possible to improve search by postprocessing search results and running more queries when necessary.

I use Tantivy, and add refinements like: if the top result is objectively a low-quality one, it's usually a query with a typo finding a document with the same typo, so I run the query again with fuzzy spelling. If all the top results have the same tag (that isn't in the query), then I mix in results from another search with the most common tag excluded. If the query is a word that has multiple meanings, I can ensure that each meaning is represented in the top results.

wolfgarbe · on Dec 2, 2024

It depends on the application.

When using SeekStorm as a server, keeping the latency per query low increases the throughput and the number of parallel queries a server can handle on top of a given hardware. An efficient search server can reduce the required investments in server hardware.

In other cases, only the local search performance matters, e.g., for data mining or RAG.

Also, it's not only about averages but also about tail latencies. While network latencies dominate the average search time, that is not the case for tail latencies, which in turn heavily influence user satisfaction and revenue in online shopping.

intelVISA · on Dec 2, 2024

A typical server is serving more than one request at a time, hopefully.

llIIllIIllIIl · on Dec 3, 2024

How is it different from Meilisearch[1]? I’m running search for my small multi tenant SaaS and self hosted Meilisearch gives me grief like any relatively new tech, so I’m shopping for new solutions.

1: https://www.meilisearch.com/

J_Shelby_J · on Dec 3, 2024

Well, off the bat, this seems to be able to embedded directly into your rust project without the need for a standalone server.

wiradikusuma · on Dec 3, 2024

Could you share more about your experience with Meilisearch?

llIIllIIllIIl · on Dec 3, 2024

Tl;dr: 4/5 stars for hobbit software SaaS.

—————

Full version: I run it on a dedicated machine 2vcpu2gb on digital ocean. Every tenant has an index and i have like 30k searches per week across all tenants. Each tenant has from 1 to 150k documents in their index. Sentry catches MeilisearchTimeoutException couple times every day with the message that Meilisearch could not finish adding document to index. I don’t care too much about that because background worker is responsible for updating index, so that tasks gets rescheduled. I like to keep my sentry clean, so it’s more an inconvenience than the issue. Meilisearch setup is very straightforward, they provide client libraries for almost all languages (maybe even for esoteric and marginal, idk, i only need python), have pretty decent documentation covering the basics and don’t really require operations at my scale. I really liked the feature of issuing the limited access tokens to be able to set the pre condition. That’s how i limit the searches for particular user on the tenant to see only their data.

dantodor · on Dec 3, 2024

Interesting approach, would love to see a comparison with Typesense

ghita_ · on Dec 3, 2024

Very impressive results. I'm curious how you benchmarked against bm25 in terms of accuracy? I couldn't find metrics around that, just one search example. I think there are use cases where latency is king, but when it comes to vector search / hybrid search accuracy is probably more important.

wolfgarbe · on Dec 3, 2024

For the latency benchmarks we used vanilla BM25 (SimilarityType::Bm25f for a single field) for comparability, so there are no differences in terms of accuracy.

For SimilarityType::Bm25fProximity which takes into account the proximity between query term matches within the document, we have so far only anecdotal evidence that it returns significantly more relevant results for many queries.

Systematic relevancy benchmarks like BeIR, MS MARCO are planned.

ghita_ · on Dec 11, 2024

got it - i think the anecdotal evidence is what intrigued me a little bit looking forward to seeing the systematic relevancy benchmarks

littlestymaar · on Dec 3, 2024

I don't know how fair the benchmark is, but beating Tantivy by that margin is impressive to say the least.

Any plan to make it run on WASM? I wanted to add this feature to Tantivy a few years ago but they weren't interested, and I had to fall back to a JavaScript search engine that was much slower.

fulmicoton · on Dec 3, 2024

Developer of tantivy chiming in! (I hope that's ok) Database performance is a space where there are a lot of lies and bullshit, so you are 100% right to be suspicious.

I don't know SeekStorm's team and I did not dig much into the details, but my impression so far is that their benchmark's results are fair. At least I see no reason not to trust them.

PSeitz · on Dec 3, 2024

Also we are working on some performance improvements based on the benchmark comparison, as they highlighted some areas we can improve in tantivy.

wolfgarbe · on Dec 3, 2024

The benchmark should be fairly fair, as it was developed by Tantivy themselves (and Jason Wolfe). So, the choice of corpus and queries was theirs. But, of course, your mileage may vary. It is always best to benchmark it on your machine with your data and your queries.

Yes, WASM and Python bindings are on our roadmap.

Thaxll · on Dec 2, 2024

It feels like everyone re-implement the same application, searching text in language x.y.z has been done a million times, search speed in not a problem so what differenciate this solution with the dozen+ mature ones.

The speed looks great but isn't everything else already fast enough?

wolfgarbe · on Dec 2, 2024

Its not just about speed. Speed reflects efficiency. Efficiency is needed to serve more queries in parallel, to search within exponentially growing data, with less expensive hardware, and fewer servers, consuming less energy. Therefore the pursuit for efficiency never gets outdated and has no limit.

sdesol · on Dec 3, 2024

In addition to what you said, faster searches can also provide different search options. For example, if you can execute five similar searches in the same time that it would take to execute one. You now have the option to ask "Can I leverage five similar searches to produce better results" and if the answer is yes, you can now provide better answers and still keep the same user experience.

Where I really think faster searches will come into play is with AI. There is nothing energy efficient about how LLM work and I really think Enterprise will focus on using LLM to generate as many Q and A pairs during off peak energy hours and using a hybrid search that can bridge semantic (vector) and text. I think for Enterprise the risk of hallucinations (even with RAG) will be too great and fall back to traditional search, but with a better user experience.

Based on the README, it looks like vector search is not supported or planned, but it would be interesting to see if SeekStorm can do this more efficiently than Lucene/OpenSearch and others. I only dabbled in the search space, so I don't know how complex this would be, but I think SeekStorm can become a killer search solution if it can support both.

Edit: My bad, it looks like vector search is PoC.

jamil7 · on Dec 3, 2024

Software is currently extremely inefficient, driven by years of increasingly powerful cheap hardware. Once that starts to slow it makes sense that we start squeezing efficiency out of software again. We’ve also seen in the last 20 years the rise of languages that make writing performant, higher-level software a lot easier.

We’re also at a point where cloud compute is consuming a significant amount of energy globally.

emmanueloga_ · on Dec 3, 2024

The documentation seems a bit sparse. Also, I couldn't find binaries so I'm guessing building from source is required at the moment?

I'm curious about the binary size of it all. Could this be compiled with WASM and run on static pages?

wolfgarbe · on Dec 9, 2024

>> The documentation seems a bit sparse.

We just released a new OpenAPI based documentation for the SeekStorm server REST API: https://seekstorm.apidocumentation.com

For the library we have the standard rust doc: https://docs.rs/seekstorm/latest/seekstorm/

wolfgarbe · on Dec 3, 2024

The Seekstorm library is 9 MB, and the Seekstorm server executable is 8 MB, depending on the features selected in cargo.

You add the library via 'cargo add seekstorm' to your project which you anyway have to compile.

As for the server, we may add binaries for the main OS in the future.

WASM and Python bindings are on our roadmap.

bosky101 · on Dec 5, 2024

Why isn't there a http interface? That would increase adoption by far

wolfgarbe · on Dec 9, 2024

SeekStorm comes with an http interface.

The SeekStorm server features an REST API via http: https://seekstorm.apidocumentation.com

It also comes with an embedded Web UI: https://github.com/SeekStorm/SeekStorm?tab=readme-ov-file#bu...

Or did you mean a Web based interface to create and manage indices, define index schemas, add documents etc?

athompsondog · on Dec 2, 2024

I wonder how burnt sushi feels about this

DonnyV · on Dec 2, 2024

[flagged]

neonsunset · on Dec 2, 2024

I find the note unfortunate. They state 2-4x performance improvement. I'm sure looking at the implementation with a profiler and tactically optimizing critical paths would have yielded them 2-3x as is. They could have also reached out to .NET JIT team via issues or discussions on GitHub for guidance. Especially since .NET has rich set of SIMD APIs very well suited for implementing SOTA text search algorithms (and also comes with many out of box, seriously, look at e.g. https://devblogs.microsoft.com/dotnet/performance-improvemen...)

The note also states "No framework dependencies (CLR or JVM virtual machines)" which isn't true either - 'dotnet publish /p:PublishSingleFile=true /p:PublishTrimmed=true' gives the same "dependency-less" experience. "Ahead-of-time instead of just-in-time compilation" is similarly wrong - replace previous args with '/p:PublishAot=true' and you get a native binary.

wolfgarbe · on Dec 2, 2024

The 2-4 speed ratio was not meant to denounce C#, which is a great language I loved to program in for over two decades, coming from Delphi. Unfortunately, C# has not a complete SIMD support. See our request to support the SSE4.2 _mm_cmpistrm instruction https://github.com/dotnet/runtime/discussions/63332, which we required for a vectorized intersection between two sorted 16-bit arrays. We did the switch from C# to Rust not light-minded, as the cost of porting a fairly large codebase is time-consuming. We just wanted to share our experience for our specific task, not as a general statement.

neonsunset · on Dec 2, 2024

Thank you. It is indeed true that .NET has some gaps in its SIMD API, which might require either writing a specific routine in C and pinvoking it or implementing the algorithm differently.

Were there any other factors that contributed to the decision?

FWIW I forwarded the issue the discussion links to dotnetevolution discord server.

wolfgarbe · on Dec 2, 2024

Yes. We waited long for AOT compilation to become mature, to remove the need for the user to install the .Net framework. But two years ago when we decided to switch, we still couldn't just get the AOT compilation of our codebase to work without changes (perhaps it was somehow possible, but the available documentation was not very verbose about this). Also, there is still a performance gap. Of course, this doesn't matter for most of the applications, where the completeness and consistency of the framework, and the number of programmers fluent in that language might matter more. But for a search server, we needed to carve out every inch of performance we could get. And other benchmarks seemed to echo our experience: https://programming-language-benchmarks.vercel.app/rust-vs-c...

neonsunset · on Dec 2, 2024

That specific suite is...not the best. https://benchmarksgame-team.pages.debian.net/benchmarksgame/... is more focused on optimized implementations and showcases where the performance of .NET places given submissions someone cared to spend some time optimizing.

It is true that 2 years ago NAOT was in its infancy, it has improved substantially since then. Self-contained trimmed binaries already worked back then however.

I guess it is more about unfortunate timing than anything - even the compiler itself moves fast and in some areas the difference in codegen quality is very significant between 7, 8 and 9.

distracted_boy · on Dec 2, 2024

How does this compare to PostgreSQL?

wolfgarbe · on Dec 2, 2024

PostgreSQL is an SQL database that also offers full-text search (FTS), with extensions like pg_search it also supports BM25 scoring which is essential for lexical search. SeekStorm is centered around full-text search only, it doesn't offer SQL.

Performance-wise it would be indeed interesting to run a benchmark. The third-party open-source benchmark we are currently using (search_benchmark_game) does not yet support PostgreSQL. So yes, that comparison is still pending.

yellow_lead · on Dec 3, 2024

When I tried to use FTS in Postgres, I got terrible performance, but maybe I was doing something wrong. I'm using Meili now.

anonzzzies · on Dec 3, 2024

Same here, this would easily beat it as far as I have seen, but maybe I did something wrong.

philippemnoel · on Dec 3, 2024

ParadeDB (paradedb.com) is similar to this HN, but baked into Postgres to solve this very problem you are describing