The focus on the top 10 in vector search is a product of wanting to prove value over keyword search. Keyword search is going to miss some conceptual matches. You can try to work around that with tokenization and complex queries with all variations but it's not easy.
Vector search isn't all that new a concept. For example, the annoy library (https://github.com/spotify/annoy) has been around since 2014. It was one of the first open source approximate nearest neighbor libraries. Recommendations have always been a good use case for vector similarity.
Recommendations are a natural extension of search and transformers models made building the vectors for natural language possible. To prove the worth of vector search over keyword search, the focus was always on showing how the top N matches include results not possible with keyword search.
In 2023, there has been a shift towards acknowledging keyword search also has value and that a combination of vector + keyword search (aka hybrid search) operates in the sweet spot. Once again this is validated through the same benchmarks which focus on the top 10.
On top of all this, there is also the reality that the vector database space is very crowded and some want to use their performance benchmarks for marketing.
Amusingly, vector search is conceptually nearly as old as keyword search.
Most older IR textbooks promote the idea of viewing keyword search as a maximization problem of the weighted bit vector product between the query and the document, where the weight matrix is something like BM-25.
I was interested in dense vector search in 2004 and a bit depressed at how the indexing algorithms weren’t that good, Around 2013 I was involved with an actual product which was a search engine for patents.
The thing about vector search today is that it is an absolute gold rush, Pinecone had the right idea but the wrong product (sorry SAAS is toxic), many of the latecomers are… late, and will find the whole startup and VC process will cost precious months when customers want to build products right now.
Note for that patent search engine we had a quite a different idea than is fashionable today, that is, like very old TREC we care about the quality of results you got 1000 or 2000 results in in the assumption that a patent researcher wanted to be comprehensive. When I was first into IR it was in the context of “how is Google so much better than other search engines?” and one part of it was that based on the way search engines were being evaluated at TREC, Google was not a better search engine* because it was so focused on the first page.
It has come full circle now, I have talked to people recently who complain that current TREC evaluations are too focused on the first few results.
> The focus on the top 10 in vector search is a product of wanting to prove value over keyword search
I'm surprised to see less mention of the use of vector search in RAG. The reason top 10 (or less) is so important in RAG is that you only have so much context to work with (and even in large context models, you still would prefer to minimize the total amount of context you're sending the model).
Everyone I know working seriously on vector search is using it to plug into a prompt for an LLM. For this use case top 10 is a fairly important metric.
I agree. my experience is that hybrid search does provide better results in many cases, and is honestly not as easy to implement as may seem at first. In general, getting search right can be complicated today and the common thinking of "hey I'm going to put up a vector DB and use that" is simplistic.
Disclaimer: I'm with Vectara (https://vectara.com), we provide an end-to-end platform for building GenAI products.
Isn't one big problem of vector search that it is bounded by the complexity of the knn problem in multiple dimensions, whereas keyword search like bm25 is very fast and scalable because of inverted indices? Annoy and others need to sacrifice accuracy for speed I believe. Hybrid search can be used to first filter a smaller number of candidates and then rerank them using vector search. Performance wise, it is naturally pretty slow and hard to scale compared to keyword search.
The elephant in the room is the Curse of Dimensionality. As the number of vector dimensions increase, "nearest neighbor" search behaves in an surprising manner. In a high-dimensional cube randomly sampled vectors are mostly near the edge of the cube, not at all what you expect based on two or three dimensions. There's other hitches, too.
When using vector search, you're making two leaps - that your document can really be summarized in a vector (keeping its important distinguishing semantic features), and that it's possible to retrieve similar documents using high-dimensional vectors. Both of these are lossy and have all sorts of caveats.
What would be great is if these new vector DB companies show examples where the top-ten "search engine results page" based on semantic embedding / vector search is clearly better than a conventionally-tuned, keyword-based system like ElasticSearch. I haven't see that, but I have seen tons of numbers about precision/recall/etc. What's the point of taking 10ns to retrieve good vector results if its unclear if they're solving a real problem?
These vectors are lower-dimensional than traditional vectors though, aren't they? Vector embeddings are in the hundreds to low thousands range of dimensions (roughly between 128-1024), whereas TF-IDF has the same dimension as your vocabulary. It's also not just about being flat-out better, but about increasing the recall of queries, as you're grabbing content that doesn't contain the keywords directly, but is still relevant. You are also free to mix the two approaches together in one result set, which gives the best of both.
The problems with dimensionality certainly show up even with 256 dimensions. PCA-ing down to a few hundred dimensions is still a problem, and then you have to deal with PCA lossiness too!
You can try the 3 methods: keyword, knn, and hybrid on a collection in the AI domain here https://search.zeta-alpha.com/. YMMV sometimes knn has more interesting results, sometimes keyword is closer to what you wanted. Hybrid is sometimes also a good middle ground. But it fails when one of the retrievers has very bad results (because it will interleave them)
If you only have to sacrifice a tiny bit of accuracy, it's not really a problem is it? You can quantify these things experimentally to see whether it's acceptable for your use case.
Given that vector and hybrid search are used in many situations at scale, I think to make the kind of claims you're trying to make, it might help to quantify exactly when you think it will be too slow or what experiences you have had of struggling to make it fast enough.
We (at Weaviate) support separate BM25, vector, and combined hybrid search, so I can talk about the performance aspect of all of these a lot. The tl;dr is: It's complicated. What you say, may be true in some cases, but not in others. One does not always scale better than the other.
For example, ANN indexing with an index like HNSW scales pretty much logarithmically. A bit oversimplified, but roughly we see search times double with every 10x of the search space. In addition, calculations like euclidean distance and dot product (and therefore cosine sim) can be parallelized very well at a hardware level, e.g. AVX2/AVX512/neon and similar SIMD techniques.
With BM25, this isn't quite as simple. We have algorithms such as WAND and Block-Max-WAND, which help eliminate _scoring_ elements that cannot reach a top-k spot. However, the distribution of terms plays a big role here. As a rule of thumb, the rarer a word is the fewer documents require scoring. Let's say you have 1B documents, and a 2-term query, but each term matches only 100k documents. If you AND-combine those queries, you will have at most 100k matches, if you OR-combine them, you will have at most 200k. The fact that there were 1B documents indexed played no role. But now think of two terms that each match 500M objects. Even with the aforementioned algorithms – which rely on the relative impact of each term – there is a risk we would now have to score every single document in the database. This is, again, over-simplified, but my point is the following:
ANN latency is fairly predictable. BM25 latency depends a lot on the input query. Our monitoring shows that in production cases, when running a hybrid search, sometimes the vector query is the bottleneck, sometimes the BM25 query is.
> Hybrid search can be used to first filter a smaller number of candidates and then rerank them using vector search
My post is already getting quite long, but want to quickly comment on this two-step approach. For Weaviate, that's not the case. BM25 and vector searches happen independently, then the scores are aggregated to combine a single result set.
Yes, you could also use BM25 as a filter set first, then re-rank the results with embeddings, however you wouold lose the BM25 scores. If you want to use keywords just as filters, you can do that much, much cheaper than through BM25 scoring. In Weaviate matching keywords to create a filter set, would only require AND-ing or OR-ing a few roaring bitmaps which is orders of magnitides more efficient than BM25 scoring.
> In 2023, there has been a shift towards acknowledging keyword search also has value and that a combination of vector + keyword search (aka hybrid search) operates in the sweet spot.
Do you have evidence that this hybrid approach is better than a vector search? That would be surprising.
Imagine "day 2" search tasks like the user wanting to filter results for a specific time range
You could force the data team to go through hoops like multimodal embeddings, multiple indexes, post-filtering... or figure it out
Near-term, it makes sense to do one thing well, so many new vector DBs are really mutable vector indexes with lackluster search. But eventually, if you are not a narrowly-scoped kernel (faiss) but the main search platform needing $$$-level marketshare for growing payrolls, you're incentivized to figure search out beyond ANN, and your competitors are advertising that they are doing the same. So you add post-filtering, and then eventually, maybe even push-down/dynamic.
Source: We have been happily using on-the-fly GPU knn embeddings for years in pygraphistry, but when we added billion-scale vector indexing to power louie.ai RAG search for real-time news & database querying, we outgrow pure ann on our first user, and I'm thankful we did diligence on our DB selection
- We would have paid more for storing billion+ rows in-memory than our gov & enterprise customers were paying for a full product. Since then, ivfpq etc have gotten more popular to do responsive searches for bigger-than-memory datasets. If we needed small in-memory workloads, we could have done faiss+pandas+fastapi and moved on.
- Vendors pitched themselves as managed databases yet their users complain about them going down & losing data
- Many did not support basic search operators: time filtering, ...
- Few/none had real modern security, e.g., row-level multi-tenant ABAC
I haven't been tracking on a per-vendor basis as each publicizes making piecemeal progress on these. We found one that did #1-3 without failing us yet, and we're biting the bullet to implement #4. A scalable OSS answer to #4 might be the one that causes us to switch.
I agree that BM25 retrieval + vector re-ranking can work. But vector search does bring results to the table that vanilla BM25 can't, even with a large retrieval window. So I do think there is a place for both with the usual "it depends on your data/requirements" caveat.
The opposite is also true. BM25 prefers lexical matches and brings these candidates back that vector search often doesn’t.
I am not disagreeing vectors are useful, but I think benchmark based evidence is not the same as deploying a solution that must scale, be constantly updated, serve many use cases like filtering, search syntax, etc customers want.
And plus I think there’s a real danger Of herding to vector retrieval (even then one view of it) which cuts off exploration of diverse solutions.
I'm with you 100% on not herding to any one way for any problem. I still remember the pre-2023 world where you weren't pressured to work LLMs into everything.
Vector search isn't all that new a concept. For example, the annoy library (https://github.com/spotify/annoy) has been around since 2014. It was one of the first open source approximate nearest neighbor libraries. Recommendations have always been a good use case for vector similarity.
Recommendations are a natural extension of search and transformers models made building the vectors for natural language possible. To prove the worth of vector search over keyword search, the focus was always on showing how the top N matches include results not possible with keyword search.
In 2023, there has been a shift towards acknowledging keyword search also has value and that a combination of vector + keyword search (aka hybrid search) operates in the sweet spot. Once again this is validated through the same benchmarks which focus on the top 10.
On top of all this, there is also the reality that the vector database space is very crowded and some want to use their performance benchmarks for marketing.
Disclaimer: I am the author of txtai (https://github.com/neuml/txtai), an open source embeddings database