Can anyone please suggest a good stack for the following: - calculating text emb...

computerex · on July 12, 2023

To calculate embeddings for free, use this very popular model: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v...

For storing the vectors and doing the vector search: https://github.com/pgvector/pgvector

`ankane/pgvector` docker image is a drop in replacement for the postgres image, so you can fire this up with docker very quickly.

It's a normal postgres db with a vector datatype. It can index the vectors and allows efficient retrieval. Both AWS RDS and Google Cloud now support this in their managed Postgres offerings, so postgres+pgvector is a viable managed production vectordb solution.

> Also, how granular should the text chunks be?

That depends on the use case, the size of your corpus, the context of the model you are using, how much money you are willing to spend.

> Has anyone been able to achieve reliable results from these? Preferably w/o using Langchain.

Definitely. We use postgres+pgvector with php.

kiwicopple · on July 12, 2023

(Disclosure: I work at supabase but this applies to any Postgres host that has the pgvector extension)

Here is a guide/notebook using the model above, pgvector, and a python client: https://supabase.com/docs/guides/ai/quickstarts/text-dedupli...

tallytarik · on July 12, 2023

I’ve had some success with this.

I think I’d need to fine tune the model to see better results with some domain-specific terms, but I couldn’t find much information about how to actually do that - what sort of input data you need, how much of it, etc.

Would be interested to hear if anyone had more to share about fine tuning these models for semantic search.

famouswaffles · on July 12, 2023

Instead of fine-tuning, you might try out the Instructor series from here. https://instructor-embedding.github.io/ https://huggingface.co/spaces/mteb/leaderboard

Open source and the way it works is that you give an instruction on the type of task or even domain you want the embedding to be tailored to and embeddings change depending on the instructions.

david_shi · on July 18, 2023

Instructor performs very well but is still limited to its training corpus. For niche domains, fine-tuning can be helpful in improving performance.

ClassyJacket · on July 12, 2023

>This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space

Interesting. Can someone explain to me why 384 specifically? How did they arrive at that number?

computerex · on July 12, 2023

It has to do with the architecture of the network used to create the embeddings. The embeddings is actually the output of the final layer of the model. The dimensionality is a function of the number of parameters in that layer.

Different models/architectures will produce different dimension embeddings.

bigbillheck · on July 12, 2023

It's not a power of two but is still a nice round number (0x180).

tomrod · on July 12, 2023

I see the all-MiniLM-L6-v2 is very popular. How does it compare to hkunlp/instructor-large?

mattnewton · on July 12, 2023

Calculating the embeddings is probably going to be an application-specific thing. Either your application has reasonable pre-trained encoders or you train one off a mountain of matching pairs of data.

Once you have the embeddings in some space, for PoC I’ve mostly seen people shove them into faiss, which handles most of the rest very well for small/medium datasets: https://github.com/facebookresearch/faiss

NhanH · on July 12, 2023

Could you please point to some materials to understand the data needed to train the embedding model for a specific domain?

anon7725 · on July 12, 2023

You don’t need to train anything if you just need embeddings. The data is text. You apply the pretrained model to your text and it returns the embedding. You save it in a vector database if you’re fancy, or a big numpy array if you’re like me. Then run your similarity search (cosine, Euclidean, etc).

famouswaffles · on July 12, 2023

a lot of embedding models have poor performance on domain specific data that is only mitigated with finetuning. alternately the instructor series mitigates this by fine-tuning the model on instructions and giving specific instructions to targeted domains.

mathis-l · on July 12, 2023

You might want to give Haystack a try (disclaimer: I work at deepset, the company behind Haystack).

Haystack allows you to pre-process your documents into smaller chunks, calculate embeddings and index them into a document store. You can wrap all of that in a modular pipeline if you want.

Next, you can query your documents using a retrieval pipeline.

Regarding document store selection: Replacing your document store is easy, so I would start with the most simple one, probably an InMemoryDocumentStore. When you want to move from experimentation to production, you‘ll want to tailor your selection to your use case. Here‘s a few things that I‘ve observed.

You don’t want to manage anything and are fine with SaaS -> Pinecone

You have a very large dataset (500M+ vectors) and you want something that you can run locally -> maybe Qdrant

You have meta data that you want to incorporate into your retrieval or you want to do hybrid search -> Opensearch/Elasticsearch

Regarding model selection:

We‘ve seen https://huggingface.co/sentence-transformers/multi-qa-distil... work well for a good semantic search baseline with fast indexing times. If you feel like the performance is lacking, you could look at the E5 models. What also works fairly well for us is a multi-step retrieval process where we retrieve ~100 documents with BM25 first and then use a cross-encoder to rank these by semantic relevance. Very fast indexing times are a benefit and you also don’t need a beefy vector db to store your documents. Latency at query time will be slightly higher though and you might need a GPU machine to run your query pipeline.

Retrieval in Haystack: https://docs.haystack.deepset.ai/docs/retriever

Cross-Encoder approach: https://docs.haystack.deepset.ai/docs/ranker

Blog Post with an end-to-end retrieval example: https://haystack.deepset.ai/blog/how-to-build-a-semantic-sea...

Xenoamorphous · on July 12, 2023

This is what we use: BERT sentence transformers to generate the embeddings (we used Universal Sentence Encoder before that and it was good too), and ElasticSearch for storage, which has a dense vector data type. It also has a cosineSimilarity function to run searches.

fzliu · on July 12, 2023

I would not use Elastic for vector search due to its architectural limitations and poor performance when conducting vector search. https://zilliz.com/benchmark

Xenoamorphous · on July 12, 2023

I should’ve said that we were already using it for actual search where embeddings similarity is just one component of the overall score. For pure vector stuff a dedicated solution will be faster.

dunefox · on July 12, 2023

What about Solr?

liliuleo93 · on July 12, 2023

Hi, if you're looking into vector storage and querying, there are several things to consider. For lightweight usage, you can directly use FAISS without any database overhead. For heavy usage, Milvus/Zilliz is the most production-ready solution.

Also, here's a benchmark that allows you to easily test their performance differences through a user-friendly interface. This includes both cloud solutions and open-source options. If you prefer to view pre-tested results, there are standard ones available as well. Check it out here: VectorDBBench. https://github.com/zilliztech/VectorDBBench

YossarianFrPrez · on July 12, 2023

Depending on your use case (particularly if it is research-oriented), "scipy.spatial.distance.cdist" and "scipy.spatial.distance.pdist" are your friends. If you are doing something in production, the PG extension seems like a good bet.

One way to potentially answer your question about text-chunk-granularity is to take a random sample of 500 pieces of chunked text and look at several "most similar pairs." Do this for a few different chunk-lengths and you'll see how much information is lost...

jillesvangurp · on July 12, 2023

I wrote up a little tutorial on how to do the last two with elasticsearch and my kotlin kt-search library here a while ago:

https://jillesvangurp.github.io/kt-search/manual/KnnSearch.h...

You index into an index with a dense vector field

And when you query, you generate embeddings for your query and run a knn vector similarity search.

It uses some embeddings I generated with openai. You could use something like easybert or one of the many OSS embeddings models instead. Basically you need some code that converts your text/images/whatever into lists of numbers using such a model.

So:

1) use some magical tool that given a thing returns embeddings. You use this to extract embeddings at index time from your content and at query time for your queries.

2) put your embeddings along with your things in an Elasticsearch index (or vector db of your choice, Opensearch works similar to Elasticsearch for this)

3) when querying, create embeddings for your queries and find the nearest match.

I built this tutorial as a quick POC to figure out how easy it is with my own library. I'm not an expert. Mission accomplished and it only took me a few hours. The results are not impressive as this model is probably not very appropriate for the demo content. But it vaguely works. There are a bunch of people that are smarter than me that suggest that most oss models struggle to outperform bm25, which is just doing simple text searches.

Btw. the embeddings are the hard part. The rest is just plumbing. And of course world + dog just glosses over that. There's an interesting article that I came across recently that goes a bit more in depth on this: https://blog.metarank.ai/from-zero-to-semantic-search-embedd...

fzliu · on July 12, 2023

If you're just starting out, I'd use sentence-transformers for calculating embeddings. You'll want a bi-encoder model since they produce embeddings. As the author of the blog, I'm partial towards Milvus (https://github.com/milvus-io/milvus) due to its enterprise and scale, but FAISS is a great option too if you're just looking for something more local and contained.

Milvus will perform vector similarity search for you - all you need to do is give it a query vector.

zX41ZdbW · on July 12, 2023

I've recently made a presentation on this topic exactly.

Video: https://www.youtube.com/watch?v=hGRNcftpqAk Slides: https://presentations.clickhouse.com/meetup74/ai/

owen-elliott · on July 12, 2023

You could use Marqo, it is a vector search engine that includes the text chunking, inference for calculating embeddings, vector storage, and vector search. You can pick from a heap of open-source models or bring your own fine-tuned ones. It all runs locally in docker https://github.com/marqo-ai/marqo

kacperlukawski · on July 12, 2023

This is also an interesting piece of how to do it completely for free: https://news.ycombinator.com/item?id=36693239

Beefin · on July 12, 2023

here's a github repo i've maintained to teach each step: https://vectorsearch.dev/

jscheel · on July 12, 2023

Check out haystack

kulikalov · on July 12, 2023

What are the storage and latency requirements?

jeffchuber · on July 12, 2023

Really just chroma + openai is all you need. Chroma makes this easy https://docs.trychroma.com/embeddings.

All you have to do is chunk... I'd start at a paragraph and experiment.

chaxor · on July 12, 2023

This is not always good advice. Many people require to not use off premise models, due to data ownership issues.

I would therefore suggest a better default for this, such as BERT+Qdrant.

It would be wonderful if there were a simpler (single file, SQLite or DuckDB like) database for vectors than the complex (and in some cases, unfortunately cloud-based) ones available now.

fzliu · on July 12, 2023

You might enjoy milvus-lite: https://zilliz.com/blog/exploring-magic-vector-databases-jup...

   pip install milvus

jeffchuber · on July 12, 2023

ah sorry, i should read OP better - chroma's default embedding model is sentence transformers - and we have many other integrated - https://github.com/chroma-core/chroma/blob/main/chromadb/uti...

> It would be wonderful if there were a simpler (single file, SQLite or DuckDB like) database for vectors than the complex (and in some cases, unfortunately cloud-based) ones available now.

This is literally chroma!

chaxor · on July 12, 2023

Admittedly, I don't know much about chroma, but it seems similar to Qdrant to me. Perhaps I'm missing something. It doesn't appear that it stores everything in 1 singledb file, but rather a plethora of files in some directory. Although it does appear that it runs local, which is a huge plus.

drewcperkins · on July 12, 2023

Chroma is just a wrapper for Clickhouse. I’d go with something like Weaviate or Qdrant that is a lot more mature as far as vector databases.

redskyluan · on July 12, 2023

how do you define chroma as a vector db? a wrapper on top of other databases and hnsw? What about a billion vectors in house~ is chroma all we need?