It is fast but, nowhere close to accurate or useful for this specific example. Could not find a way to force the plural form. Neither quotes nor plus worked.
Is there distributed server support? I see it on the list of new features with (currently PoC) next to it, but is the code for the PoC available anywhere?
Also, would there be any potential issues if the index was mounted on shared storage between multiple instances?
The code for the distributed search cluster is not yet stable enough to be published, but it will be released as open-source as well.
As for shared storage, do you mean something like NAS or, rather Amazon S3?
Cloud-native support of object storage and separating storage and compute is on our roadmap. Challenges will be maintaining latency and the need for more sophisticated caching.
I really like your approach. Impressed by your care for performance and your fast pace of adding what appears to be pretty complex stuff, while making sure it stays performant.
In SeekStorm you can choose per index whether to use Mmap or let SeekStorm fully control Ram access. There is a slight performance advantage to the latter, at the cost of higher index load time of the former.
https://docs.rs/seekstorm/latest/seekstorm/index/enum.Access...
SeekStorm does currently not use io_uring, but it is on our roadmap.
Challenges are the cross-platform compatibility. Linux (io_uring) and Windows (IoRing) use different implementations, and other OS don't support it. There is no abstraction layer over those implementations in Rust, so we are on our own.
It would increase concurrent read and write speed (index loading, searching) by removing the need to lock seek and read/write.
But I would expect that the mmap implementations do already use io_uring / IoRing.
Yes, lazy loading would be possible, but pure RAM access does not offer enough benefits to justify the effort to replicate much of the memory mapping.
What is the story for multi-language corpus? Do I have to do my own stop word pruning, tokenizing, lemming, etc? This is usually the case with full-text search solutions and it is a pain.
Re: stemming and lemming, I just want to plug the most impressive NLP stack I ever used, "chat script", really it's for building dialog trees where it walks down a branch of conversation using effectively switch statements but with really rich conceptual pattern matching and capturing - so somewhere in the middle of the stack it has excellent abstracting from word input to general concept (in WordNet), performing all the spell correction (according to your dictionary), stem, lem, and disambiguation.
I've had it in mind for a while to build a fuzzy search tool based on parsing each phrase into concepts, parsing the search query into concepts, and finding nearest match based on that. It's a C library and very fast.
Looks like it hasn't been committed to in some time, I'll have to check out their blog and see what's up. I guess with the advent of LLMs, dialog trees are passé.
We started with making the core search technology faster. Then we added a Unicode character folding/normalization tokenizer (diacritics, accents, umlauts, bold, italic, full-width chars...). Last week we added a tokenizer that supports Chinese word segmentation. Currently, we are working on a multi-language tokenizer, that segments Chinese, Japanese an Korean without switching the tokenizer.
I hope the folding and normalization is configurable by language. I really hate it when some search decides that a and ä are the same letter. In Finnish they really aren't; "saari" is an island, "sääri" is the lower leg or shin.
Currently, you can choose between tokenizers with or without folding. But configurability per language or full customizability of the folding logic by the user is a good idea.
How's SeekStorm's prowess in mid-cap enterprise? How hairy is the ingest pipeline for sources like: decade old sharepoint sites, PDFs with partial text layers, excel, email.msg files, etc...
Yes, integration in complex legacy systems is always challenging.
As a small startup, we are concentrating on core search technology to make search faster and to make the most of available server infrastructure.
As SeekStorm is open-source, system integrators can take it from there.
How did you demo? Did you spin up your own instance and index the wikipedia corpus like the docs suggest? I'd like to just give it a whirl on an already running instance.
Never mind, found that someone posted a link already.
On that topic, can anybody chime in on state of the art PDF OCR? Even if that's a multimodal LLM, I've used ChatGPT to extract tabular data from images but need something I can self host for proprietary data.
Azure Document Intelligence (especially with the layout model[0]) is really good. It has both JSON and MD output modes and does a pretty solid job identifying headers, sections, tables, etc.
What's interesting is that they have a self-deployable container model[1] that only phones home for billing so you can self-host the runtime and model.
When search is cheap and quick, it's possible to improve search by postprocessing search results and running more queries when necessary.
I use Tantivy, and add refinements like: if the top result is objectively a low-quality one, it's usually a query with a typo finding a document with the same typo, so I run the query again with fuzzy spelling. If all the top results have the same tag (that isn't in the query), then I mix in results from another search with the most common tag excluded. If the query is a word that has multiple meanings, I can ensure that each meaning is represented in the top results.
When using SeekStorm as a server, keeping the latency per query low increases the throughput and the number of parallel queries a server can handle on top of a given hardware. An efficient search server can reduce the required investments in server hardware.
In other cases, only the local search performance matters, e.g., for data mining or RAG.
Also, it's not only about averages but also about tail latencies. While network latencies dominate the average search time, that is not the case for tail latencies, which in turn heavily influence user satisfaction and revenue in online shopping.
How is it different from Meilisearch[1]? I’m running search for my small multi tenant SaaS and self hosted Meilisearch gives me grief like any relatively new tech, so I’m shopping for new solutions.
Full version: I run it on a dedicated machine 2vcpu2gb on digital ocean. Every tenant has an index and i have like 30k searches per week across all tenants. Each tenant has from 1 to 150k documents in their index. Sentry catches MeilisearchTimeoutException couple times every day with the message that Meilisearch could not finish adding document to index. I don’t care too much about that because background worker is responsible for updating index, so that tasks gets rescheduled. I like to keep my sentry clean, so it’s more an inconvenience than the issue.
Meilisearch setup is very straightforward, they provide client libraries for almost all languages (maybe even for esoteric and marginal, idk, i only need python), have pretty decent documentation covering the basics and don’t really require operations at my scale. I really liked the feature of issuing the limited access tokens to be able to set the pre condition. That’s how i limit the searches for particular user on the tenant to see only their data.
Very impressive results. I'm curious how you benchmarked against bm25 in terms of accuracy? I couldn't find metrics around that, just one search example. I think there are use cases where latency is king, but when it comes to vector search / hybrid search accuracy is probably more important.
For the latency benchmarks we used vanilla BM25 (SimilarityType::Bm25f for a single field) for comparability, so there are no differences in terms of accuracy.
For SimilarityType::Bm25fProximity which takes into account the proximity between query term matches within the document, we have so far only anecdotal evidence that it returns significantly more relevant results for many queries.
Systematic relevancy benchmarks like BeIR, MS MARCO are planned.
I don't know how fair the benchmark is, but beating Tantivy by that margin is impressive to say the least.
Any plan to make it run on WASM? I wanted to add this feature to Tantivy a few years ago but they weren't interested, and I had to fall back to a JavaScript search engine that was much slower.
Developer of tantivy chiming in! (I hope that's ok) Database performance is a space where there are a lot of lies and bullshit, so you are 100% right to be suspicious.
I don't know SeekStorm's team and I did not dig much into the details, but my impression so far is that their benchmark's results are fair. At least I see no reason not to trust them.
The benchmark should be fairly fair, as it was developed by Tantivy themselves (and Jason Wolfe). So, the choice of corpus and queries was theirs. But, of course, your mileage may vary. It is always best to benchmark it on your machine with your data and your queries.
It feels like everyone re-implement the same application, searching text in language x.y.z has been done a million times, search speed in not a problem so what differenciate this solution with the dozen+ mature ones.
The speed looks great but isn't everything else already fast enough?
Its not just about speed. Speed reflects efficiency. Efficiency is needed to serve more queries in parallel, to search within exponentially growing data, with less expensive hardware, and fewer servers, consuming less energy. Therefore the pursuit for efficiency never gets outdated and has no limit.
In addition to what you said, faster searches can also provide different search options. For example, if you can execute five similar searches in the same time that it would take to execute one. You now have the option to ask "Can I leverage five similar searches to produce better results" and if the answer is yes, you can now provide better answers and still keep the same user experience.
Where I really think faster searches will come into play is with AI. There is nothing energy efficient about how LLM work and I really think Enterprise will focus on using LLM to generate as many Q and A pairs during off peak energy hours and using a hybrid search that can bridge semantic (vector) and text. I think for Enterprise the risk of hallucinations (even with RAG) will be too great and fall back to traditional search, but with a better user experience.
Based on the README, it looks like vector search is not supported or planned, but it would be interesting to see if SeekStorm can do this more efficiently than Lucene/OpenSearch and others. I only dabbled in the search space, so I don't know how complex this would be, but I think SeekStorm can become a killer search solution if it can support both.
Software is currently extremely inefficient, driven by years of increasingly powerful cheap hardware. Once that starts to slow it makes sense that we start squeezing efficiency out of software again. We’ve also seen in the last 20 years the rise of languages that make writing performant, higher-level software a lot easier.
We’re also at a point where cloud compute is consuming a significant amount of energy globally.
I find the note unfortunate. They state 2-4x performance improvement. I'm sure looking at the implementation with a profiler and tactically optimizing critical paths would have yielded them 2-3x as is. They could have also reached out to .NET JIT team via issues or discussions on GitHub for guidance. Especially since .NET has rich set of SIMD APIs very well suited for implementing SOTA text search algorithms (and also comes with many out of box, seriously, look at e.g. https://devblogs.microsoft.com/dotnet/performance-improvemen...)
The note also states "No framework dependencies (CLR or JVM virtual machines)" which isn't true either - 'dotnet publish /p:PublishSingleFile=true /p:PublishTrimmed=true' gives the same "dependency-less" experience. "Ahead-of-time instead of just-in-time compilation" is similarly wrong - replace previous args with '/p:PublishAot=true' and you get a native binary.
The 2-4 speed ratio was not meant to denounce C#, which is a great language I loved to program in for over two decades, coming from Delphi.
Unfortunately, C# has not a complete SIMD support.
See our request to support the SSE4.2 _mm_cmpistrm instruction https://github.com/dotnet/runtime/discussions/63332, which we required for a vectorized intersection between two sorted 16-bit arrays.
We did the switch from C# to Rust not light-minded, as the cost of porting a fairly large codebase is time-consuming.
We just wanted to share our experience for our specific task, not as a general statement.
Thank you. It is indeed true that .NET has some gaps in its SIMD API, which might require either writing a specific routine in C and pinvoking it or implementing the algorithm differently.
Were there any other factors that contributed to the decision?
FWIW I forwarded the issue the discussion links to dotnetevolution discord server.
Yes. We waited long for AOT compilation to become mature, to remove the need for the user to install the .Net framework. But two years ago when we decided to switch, we still couldn't just get the AOT compilation of our codebase to work without changes (perhaps it was somehow possible, but the available documentation was not very verbose about this). Also, there is still a performance gap. Of course, this doesn't matter for most of the applications, where the completeness and consistency of the framework, and the number of programmers fluent in that language might matter more. But for a search server, we needed to carve out every inch of performance we could get. And other benchmarks seemed to echo our experience: https://programming-language-benchmarks.vercel.app/rust-vs-c...
That specific suite is...not the best. https://benchmarksgame-team.pages.debian.net/benchmarksgame/... is more focused on optimized implementations and showcases where the performance of .NET places given submissions someone cared to spend some time optimizing.
It is true that 2 years ago NAOT was in its infancy, it has improved substantially since then. Self-contained trimmed binaries already worked back then however.
I guess it is more about unfortunate timing than anything - even the compiler itself moves fast and in some areas the difference in codegen quality is very significant between 7, 8 and 9.
PostgreSQL is an SQL database that also offers full-text search (FTS), with extensions like pg_search it also supports BM25 scoring which is essential for lexical search. SeekStorm is centered around full-text search only, it doesn't offer SQL.
Performance-wise it would be indeed interesting to run a benchmark. The third-party open-source benchmark we are currently using (search_benchmark_game) does not yet support PostgreSQL. So yes, that comparison is still pending.
Appreciate the demo: https://deephn.org/?q=apple+silicon