Can anyone please suggest a good stack for the following:
- calculating text embeddings using open-source/local methods (not OpenAI)
- storing them in a vector database. I'm confused by the myriad of options like Chromadb, Pinecone, etc.
- running vector similarity search using open-source/local methods.
Also, how granular should the text chunks be? Too short and we'll end up with a huge database, too long and we'll probably miss some relevant information in some chunks.
Has anyone been able to achieve reliable results from these? Preferably w/o using Langchain.
`ankane/pgvector` docker image is a drop in replacement for the postgres image, so you can fire this up with docker very quickly.
It's a normal postgres db with a vector datatype. It can index the vectors and allows efficient retrieval. Both AWS RDS and Google Cloud now support this in their managed Postgres offerings, so postgres+pgvector is a viable managed production vectordb solution.
> Also, how granular should the text chunks be?
That depends on the use case, the size of your corpus, the context of the model you are using, how much money you are willing to spend.
> Has anyone been able to achieve reliable results from these? Preferably w/o using Langchain.
I think I’d need to fine tune the model to see better results with some domain-specific terms, but I couldn’t find much information about how to actually do that - what sort of input data you need, how much of it, etc.
Would be interested to hear if anyone had more to share about fine tuning these models for semantic search.
Open source and the way it works is that you give an instruction on the type of task or even domain you want the embedding to be tailored to and embeddings change depending on the instructions.
It has to do with the architecture of the network used to create the embeddings. The embeddings is actually the output of the final layer of the model. The dimensionality is a function of the number of parameters in that layer.
Different models/architectures will produce different dimension embeddings.
Calculating the embeddings is probably going to be an application-specific thing. Either your application has reasonable pre-trained encoders or you train one off a mountain of matching pairs of data.
Once you have the embeddings in some space, for PoC I’ve mostly seen people shove them into faiss, which handles most of the rest very well for small/medium datasets:
https://github.com/facebookresearch/faiss
You don’t need to train anything if you just need embeddings. The data is text. You apply the pretrained model to your text and it returns the embedding. You save it in a vector database if you’re fancy, or a big numpy array if you’re like me. Then run your similarity search (cosine, Euclidean, etc).
a lot of embedding models have poor performance on domain specific data that is only mitigated with finetuning. alternately the instructor series mitigates this by fine-tuning the model on instructions and giving specific instructions to targeted domains.
You might want to give Haystack a try (disclaimer: I work at deepset, the company behind Haystack).
Haystack allows you to pre-process your documents into smaller chunks, calculate embeddings and index them into a document store. You can wrap all of that in a modular pipeline if you want.
Next, you can query your documents using a retrieval pipeline.
Regarding document store selection: Replacing your document store is easy, so I would start with the most simple one, probably an InMemoryDocumentStore. When you want to move from experimentation to production, you‘ll want to tailor your selection to your use case. Here‘s a few things that I‘ve observed.
You don’t want to manage anything and are fine with SaaS -> Pinecone
You have a very large dataset (500M+ vectors) and you want something that you can run locally -> maybe Qdrant
You have meta data that you want to incorporate into your retrieval or you want to do hybrid search -> Opensearch/Elasticsearch
Regarding model selection:
We‘ve seen https://huggingface.co/sentence-transformers/multi-qa-distil... work well for a good semantic search baseline with fast indexing times. If you feel like the performance is lacking, you could look at the E5 models. What also works fairly well for us is a multi-step retrieval process where we retrieve ~100 documents with BM25 first and then use a cross-encoder to rank these by semantic relevance. Very fast indexing times are a benefit and you also don’t need a beefy vector db to store your documents. Latency at query time will be slightly higher though and you might need a GPU machine to run your query pipeline.
This is what we use: BERT sentence transformers to generate the embeddings (we used Universal Sentence Encoder before that and it was good too), and ElasticSearch for storage, which has a dense vector data type. It also has a cosineSimilarity function to run searches.
I would not use Elastic for vector search due to its architectural limitations and poor performance when conducting vector search. https://zilliz.com/benchmark
I should’ve said that we were already using it for actual search where embeddings similarity is just one component of the overall score. For pure vector stuff a dedicated solution will be faster.
Hi, if you're looking into vector storage and querying, there are several things to consider. For lightweight usage, you can directly use FAISS without any database overhead. For heavy usage, Milvus/Zilliz is the most production-ready solution.
Also, here's a benchmark that allows you to easily test their performance differences through a user-friendly interface. This includes both cloud solutions and open-source options. If you prefer to view pre-tested results, there are standard ones available as well. Check it out here: VectorDBBench. https://github.com/zilliztech/VectorDBBench
Depending on your use case (particularly if it is research-oriented), "scipy.spatial.distance.cdist" and "scipy.spatial.distance.pdist" are your friends. If you are doing something in production, the PG extension seems like a good bet.
One way to potentially answer your question about text-chunk-granularity is to take a random sample of 500 pieces of chunked text and look at several "most similar pairs." Do this for a few different chunk-lengths and you'll see how much information is lost...
And when you query, you generate embeddings for your query and run a knn vector similarity search.
It uses some embeddings I generated with openai. You could use something like easybert or one of the many OSS embeddings models instead. Basically you need some code that converts your text/images/whatever into lists of numbers using such a model.
So:
1) use some magical tool that given a thing returns embeddings. You use this to extract embeddings at index time from your content and at query time for your queries.
2) put your embeddings along with your things in an Elasticsearch index (or vector db of your choice, Opensearch works similar to Elasticsearch for this)
3) when querying, create embeddings for your queries and find the nearest match.
I built this tutorial as a quick POC to figure out how easy it is with my own library. I'm not an expert. Mission accomplished and it only took me a few hours. The results are not impressive as this model is probably not very appropriate for the demo content. But it vaguely works. There are a bunch of people that are smarter than me that suggest that most oss models struggle to outperform bm25, which is just doing simple text searches.
Btw. the embeddings are the hard part. The rest is just plumbing. And of course world + dog just glosses over that. There's an interesting article that I came across recently that goes a bit more in depth on this: https://blog.metarank.ai/from-zero-to-semantic-search-embedd...
If you're just starting out, I'd use sentence-transformers for calculating embeddings. You'll want a bi-encoder model since they produce embeddings. As the author of the blog, I'm partial towards Milvus (https://github.com/milvus-io/milvus) due to its enterprise and scale, but FAISS is a great option too if you're just looking for something more local and contained.
Milvus will perform vector similarity search for you - all you need to do is give it a query vector.
You could use Marqo, it is a vector search engine that includes the text chunking, inference for calculating embeddings, vector storage, and vector search. You can pick from a heap of open-source models or bring your own fine-tuned ones. It all runs locally in docker https://github.com/marqo-ai/marqo
This is not always good advice. Many people require to not use off premise models, due to data ownership issues.
I would therefore suggest a better default for this, such as BERT+Qdrant.
It would be wonderful if there were a simpler (single file, SQLite or DuckDB like) database for vectors than the complex (and in some cases, unfortunately cloud-based) ones available now.
> It would be wonderful if there were a simpler (single file, SQLite or DuckDB like) database for vectors than the complex (and in some cases, unfortunately cloud-based) ones available now.
Admittedly, I don't know much about chroma, but it seems similar to Qdrant to me. Perhaps I'm missing something.
It doesn't appear that it stores everything in 1 singledb file, but rather a plethora of files in some directory. Although it does appear that it runs local, which is a huge plus.
- calculating text embeddings using open-source/local methods (not OpenAI)
- storing them in a vector database. I'm confused by the myriad of options like Chromadb, Pinecone, etc.
- running vector similarity search using open-source/local methods.
Also, how granular should the text chunks be? Too short and we'll end up with a huge database, too long and we'll probably miss some relevant information in some chunks.
Has anyone been able to achieve reliable results from these? Preferably w/o using Langchain.