A gentle introduction to vector databases

splatcollision · on Feb 23, 2022

I've done vector indexes on CouchDB before, because it supports arrays as keys in it's map-reduce implementation. Worked great for fast document similarity search.

Brief writeup: http://splatcollision.com/page/fast-vector-similarity-querie...

joexner · on Feb 23, 2022

Vector indices are the novel part of vector databases. Let's hear more about them. The rest is just BLOB CRUD.

gk1 · on Feb 23, 2022

Here you go: https://www.pinecone.io/learn/vector-indexes/

xibalba · on Feb 23, 2022

Hi, could you please stop marketing here? You've posted a bunch of replies in this thread linking to your employer. It's nice that you're (sort of) disclosing your biases, but we really don't want HN to turn into marketing spam.

gk1 · on Feb 23, 2022

Yep, Dang already pointed out how my comments might come across as spammy, so I acknowledge that and apologize! No spam intended, just hoping to be helpful. I'll tone things down.

dontreact · on Feb 23, 2022

The way that vector indices work typically can make doing CRUD with them a real challenge. There is definitely novelty in being able to do both ANN indexing and fast high throughput CRUD.

In addition, the R of crud is hard to combine with vector indices. Case in point I am still waiting for elastic search to support both ANN and regular, structured filtering together well.

heipei · on Feb 23, 2022

If anyone wants to build a scalable ANN-index with single-stage filtering (i.e. not with the builtin vector index which does post-filtering), I suggest people try binarising and splitting their feature vectors, or using something like Product Quantization (PQ). Both approaches will return a list of fixed terms which can be indexed in Elasticsearch as keyword and then searched with a simple "terms" query.

Big fan of what Pinecone is doing, but I have too much invested into Elasticsearch/Lucene at this point in time to be considering anything else really, and with Elasticsearch I get everything in one box, including things like n-gram accelerated wildcard searches.

lmeyerov · on Feb 23, 2022

I saw elastic has vector indexes now -- what does 'well' mean?

My guess is for most popular uses, vector indexes are indeed more of a feature than a product, so will smooth out over time in big DBs, and curious what that should look like. Seperately, I do think there is a small niche for opinionated vector-first DBs, and not sure what that is either :)

anentropic · on Feb 23, 2022

> I saw elastic has vector indexes now -- what does 'well' mean?

Funny enough I was just reading about this today as a result of the OP article

A commenter here, who apparently works for the pinecone.io vector db platform, states that:

"Pre-filtering in most solutions (eg, Elasticsearch on AWS) requires using an inefficient brute-force search (kNN) through the remaining vectors after they've been filtered, because the original index was built on the unfiltered list and would no longer be useful. This causes sky-high search latencies."

https://www.reddit.com/r/MachineLearning/comments/po3pos/com...

gk1 · on Feb 23, 2022

That was me.

In all honesty if you are already using ES and you just want nearest-neighbor search for less than 10M documents, just stay with ES.

Things get less obvious when you grow past 10M documents and still want low latency. Or if you need live index updates without downtime, or if you want to apply metadata filters to nearest-neighbor searches.

If you have 100M documents -- not a difficult threshold if you're an enterprise software company or a popular consumer app -- then ES gets ruled out fairly early in the process. We get a lot of those exasperated teams coming to Pinecone after trying their best with ES/OpenSearch.

lmeyerov · on Feb 23, 2022

Interesting, thank you!

Why does 100M vectors not work in ES?

- Is this a configuration issue -- common for ES users -- or something fundamental?

- It sounds like latency is the main thing. Any numbers intuition here, and any other dimensions of concern?

AFAICT ES is using the same OSS vector libraries as pinecone, weaviate, etc. ES in general is used for > 100M documents, e.g., logging, so this is surprising.

We are seeing growing interest by our ES/Splunk users in combining our viz tech with vector indexes, so I've been wondering about these, thanks! We currently go out-of-band at the compute tier or dump in our own indexes, but are thinking through managed flows, where fundamental limits gets interesting.

jkb79 · on Feb 25, 2022

>Why does 100M vectors not work in ES?

I think that it's important to mention Elasticsearch version in this context :=)

Full disclosure, I work on Vespa.ai, but I have a pretty good understanding of how NN and ANN works in ES.

Elasticsearch 7x only had support for brute-force exact nearest neighbor search (1) which you can surely run with 100M documents, you just need to have a lot of nodes to bring latency down to service latency SLA. Typically, a single threaded/single index segment query with 1M documents with 128 dims using float is 300ms with Elasticsearch 7x (3). Two shards/segments instead would bring it down to 150 ms and so on. But 100M becomes costly, but perfectly doable.

With Elasticsearch 8.0, there is support for doing approximate nearest neighbor search (2) using the HNSW support from Apache Lucene 9. This would bring latency down to low single digit ms (single threaded, single shard) with 1M documents. The biggest disappointment was probably that they don't integrate the approximate nearest neighbor search with regular query filters/terms. This will likely be coming in Elasticsearch 8.1 or so, once it's implemented in Apache Lucene 9.x.

(1) https://www.elastic.co/blog/text-similarity-search-with-vect... (2) https://www.elastic.co/blog/introducing-approximate-nearest-... (3) https://github.com/jobergum/dense-vector-ranking-performance

zmjjmz · on Feb 23, 2022

We've had good success with Vespa [1] for this usecase, both using it to rank by vector similarity and apply business logic oriented filters.

Search latencies are fairly low, even with constantly updating documents.

[1] https://tech.okcupid.com/vespa-vs-elasticsearch-for-matching...

fzliu · on Feb 23, 2022

I've heard great things about Vespa as well (I used to work at Yahoo). I unfortunately never got the chance to play around with it too much myself.

Looking forward to doing some at some point in the near future though.

gk1 · on Feb 23, 2022

Pinecone does this, FYI: https://www.pinecone.io/docs/metadata-filtering/

And you're not alone... Lots of people want this, and lots of people can't get it with ES.

dontreact · on Feb 23, 2022

Yes but my sense it is coming pretty soon for ES, and switching to Pinecone has a huge rewrite cost. As NNs keep getting better, vector search keeps becoming more important, and it becomes more and more existential for elasticsearch to do this well. I wonder if there actually some fundamental reason they won't be able to do so.

gk1 · on Feb 23, 2022

We've had people kicking down our doors (figuratively) for the filtering feature. I have no doubt ES is getting the same feedback and is working on it.

> I wonder if there actually some fundamental reason they won't be able to do so.

This is where it gets interesting. I don't think there's anything a company with ES's resources can't do if they really want to. However...

ES's vector search is based on Lucene 9.0, which in turn uses the HNSW vector index. HNSW is shaky when it comes to CRUD updates (namely deletions), and flat out does not support metadata filtering: https://issues.apache.org/jira/browse/LUCENE-10040

Even with ES resources, it's a fundamentally hard problem. In the best case it's going to take a while to implement.

peterstjohn · on Feb 23, 2022

If you control the HNSW implementation, it can definitely do pre-filtering. Vespa does it, and you can modify open source HNSW libs easily. I added pre-filtering support to an internal fork of HNSWLIB last week, for example…

fzliu · on Feb 23, 2022

Thanks for your feedback. I'm writing a post on vector indices and will throw it up this week.

tabtab · on Feb 23, 2022

This reminds me of "Factor Tables": https://github.com/RowColz/AI

fzliu · on Feb 23, 2022

As mentioned in the article, I recommend Milvus (https://milvus.io) - it's open source and cloud native with standalone versions available. Alternatively, if you're looking for an open-source solution for generating embeddings, I recommend (https://github.com/towhee-io/towhee).

dtjohnnyb · on Feb 23, 2022

One downside for milvus is that version 1 doesn't do filtering (necessary for most search applications) and version 2 is significantly slower. Google's vector nearest neighbors offering, weaviate, and Vespa are much better options if you're expecting to extend to more realistic workloads

phenkdo · on Feb 23, 2022

Nice writeup. Have you looked at qdrant [1] for your comparison? I found it better than Milvus.

[1] https://github.com/qdrant/qdrant

occupant · on Feb 23, 2022

What did you find better about it?

phenkdo · on Feb 25, 2022

For us the winning feature for qdrant it has the concept of both positive and negative query vectors. Overall they are both very good! It's a matter of individual preference.

cbsmith · on Feb 23, 2022

Everything old is new again. ;-)

gk1 · on Feb 23, 2022

This is a great writeup, and awesome to see vector databases come up more and more often.

For anyone interested in going down this rabbit hole, we have an entire learning center about vector databases and vector search (https://www.pinecone.io/learn/) including the obligatory "What is a Vector Database" intro with example notebooks: https://www.pinecone.io/learn/vector-database/

dang · on Feb 23, 2022

You've posted several comments in this thread alone linking to your product, and it seems that the majority of your posts have been doing this for quite a while now. I'm sure it's excellent work, but can you please stop doing this?

It's fine to link to your own work occasionally, when it's particularly relevant, as part of a diverse mix of posts on unrelated things*. It's not ok to use HN primarily for promotion. See https://news.ycombinator.com/newsguidelines.html: "Please don't use HN primarily for promotion. It's ok to post your own stuff occasionally, but the primary use of the site should be for curiosity."

When people do that we eventually start penalizing their accounts and sites, or in egregious cases, banning them. You're a good HN user, but this is still excessive. You're crossing the line at which the community starts to think of the word 'spam', and we inevitably start getting emails about it.

* I do get that your work is particularly relevant in a thread like this. What's missing is the 'diverse mix of posts on unrelated things'. In such a context, posting repeatedly about your own stuff starts to come across the wrong way.

pathsjs · on Feb 23, 2022

For what is worth, I do appreciate their links, and their resources are pretty informative (I am not by any means affiliated with Pinecone)

gk1 · on Feb 23, 2022

Dully noted!

In part it's a reflection of what I think about -- which these days is mainly what I work on. Totally get what you're saying though, and how it might look from the outside.

starkd · on Feb 23, 2022

Thank you for this. One approach I find missing in your blog is that of distance-based indexing. It's an approach that indexes vectors according to distances from chosen vantage points from within the data set. I've done some preliminary work on creating a system for images: phash.dev

liminal · on Feb 23, 2022

Pinecone looks great. Any plans to have a non-hosted option?

krishnakatyal · on Feb 23, 2022

Very well written