Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Serverless Semantic Search, Free tier only (qdrant.tech)
91 points by todsacerdoti on July 12, 2023 | hide | past | favorite | 61 comments


This tutorial is very complex. Here's how to get free semantic search with much less complexity:

  1. Install sentence-transformers [1]
  2. Initialize the MiniLM model - `model = SentenceTransformer('all-MiniLM-L6-v2')`
  3. Embed your corpus [2]
  4. Embed your queries, then search the corpus
This runs on CPU (~750 sentences per second), and GPU (18k sentences per second). You can use paragraphs instead of sentences if you need more text. The embeddings are accurate [3] and only 384 dimensions, so they're space-efficient [4].

Here's how to handle persistence. I recommend starting with the simplest strategy, and only getting more complex if you need higher performance:

  - Just save the embedding tensors to disk, and load them if you need them later.
  - Use Faiss to store the embeddings (it will use an index to retrieve them faster) [5]
  - Use pgvector, an extension for postgres that stores embeddings
  - If you really need it, use something like qdrant/weaviate/pinecone, etc.
This setup is much simpler and cheaper than using a ton of cloud services to do embeddings. I don't know why people make semantic search so complex.

I've used it for https://www.endless.academy, and https://www.dataquest.io and it's worked well in production.

[1] https://www.sbert.net/

[2] https://www.sbert.net/examples/applications/semantic-search/...

[3] https://huggingface.co/blog/mteb

[4] https://medium.com/@nils_reimers/openai-gpt-3-text-embedding...

[5] https://github.com/facebookresearch/faiss


I also recommend this approach when you want to understand every step. I recently did a presentation about this topic: https://www.youtube.com/watch?v=hGRNcftpqAk

It covers end-to-end, including ClickHouse as a vector database.


TBH, it does not look "less complex", not at all. :) install, install, ... but where to install and run all of this? The topic is "serverless". This means you do not need to run anything, just need two cloud APIs and a Lambda Script.


If you are going to shill for your project, the proper thing to do is disclose that.

https://github.com/azayarni is a contributor (andre-z on twitter).


+1, disappointing to see.


15 minutes of install, install, install beats getting into the vicious SAAS vendor cycle of pay, pay, pay with heavy lock-in.


I'd rather run ~10 lines of code locally than setup 3 cloud services and a lambda function, but to each their own...


How would you host sentence-transformers model for free? You need it to vectorize each query so that has to be hosted somewhere. Is there any way to do it for free?


Just run it on CPU, on your own machine. That's the cheapest way. You could also rent a free/cheap VPS, and even parallelize across multiple machines/cores if you need it.


Maybe I'm grumpy today but I am shocked at how many responses you are getting where people think this is a novel idea. Has the engineering mindset really shifted into a default of "buy" even when build could take less than a week?


I was surprised, too, but then I realized they all work at Qdrant.

But the general dialogue around AI-related tools is surprising to me. The production parts of the langchain, embeddings, etc tools can usually be built in a few hours with better observability, performance, and maintainability.


How does sentence-transformers compare to OpenAI embeddings? How long does it take to generate an embedding on a CPU?


Am I being unreasonable to find it bizarre that the tutorial begins with subscribing to 3 different SAAS vendors?

Especially seeing these days you can run a vector store on-disk if you have less than 10 million records, pull any free embedding model straight from HuggingFace and run on consumer hardware (your laptop).


Doesn't match how things are run in production these days. As a vendor, you need to target the customer's environment as closely as possible. Even if it's theoretically feasible to serve off a single machine, you should have a cloud-native setup ready to go.

In principle you could totally run this on a single bare-metal node, but most will not be doing that in practice.


> you should have a cloud-native setup ready to go

why is storing the file as a FAISS/LanceDB on-disk vector store not "cloud native"? I am running this setup in production across dozens of nodes, we migrated all of our infrastructure off Pinecone towards this solution and have seen 10x drop in latency, and the cost improvements have been dramatic (from paid, to totally free).

I have a bit of an axe to grind in the vector DB space, it feels like the industry has gaslit developers over the last year or so into thinking SAAS is necessary for vector retrieval, when low latency on-disk KNN across vectors is a solved problem.


I totally agree that latency of this solution leaves a lot of room to improvement. But that's totally besides the point of the article, which is that people can get a no-cost semantic search for their personal website using those services. They can also use other solutions, of course.

Also I'm experimenting in further integrating things to reduce latency and most likely will publish another article within the month. Stay tuned.

Finally I somewhat agree that many of the players in the vector DB space try to push their cloud offerings. Which is fine, how else should they make money? And if latency matters that much to you, Qdrant offers custom deployments, too. I believe running Qdrant locally will handily beat your LanceDB solution perf-wise unless you're talking about less than 100k entries. We have both docker containers and release binaries for all major OSes, why not give it a try?


That's fantastic! Not all organizations (arguably most) are running their tech/infrastructure so well and competently. For a lot of organizations, it makes sense to externalize anything that's not a core competency directly related to their business. For them, less infra and less code is "better". Depending on how the accounting is done it might also be better to have a "vendor" expense rather than "internal team" expense which requires staffing.

All that is to say, maybe there's a lot of money in the SAAS/big cloud space, and customers willing to run their own setup that requires tuning might not be willing to hand them large sums of money? Just theorizing here!

Oh also "cloud native" is like a marketing term vaguely saying "you can hook this into other cloud stuff" and it works with K8s/whatever cloud thingy.


If you need semantic search locally then it's fine, but serving an embedding model might be still challenging. And if you want to expose it, your laptop might be not enough.


I've hosted embedding models on AWS Lambda (fair that this is a vendor, but 1 vs. 3), if you try an LLM with 1B+ parameters you will struggle, but if the difference between a light-weight BERT-like transformer and an LLM is only a few % of loss, why bother getting your credit card out?

Edit: another thought, skip lambda entirely and run the embedding job on the server as a background process, and use an on-disk vector store (lancedb)


Shameless plug: I built Mighty Inference Server to solve this problem. Fast embeddings with minimal footprint. Better BEIR and MTEB scores using the lightning fast and small E5 V2 models. Scales linearly on CPU, no GPU needed.

https://max.io


The initial version of this actually used Mighty, but I didn't find any free tier available, so I switched to Cohere to keep the $0 pricetag.


Mighty is free if you're not making money from it. You could have used Mighty and I would have been glad to help you set it up :)


There's a bit of a difference between what you see following the 'purchase' link and what you see if you scroll down to 'pricing' on your site. It confused me at first too - I'm just so used to seeing a 'pricing' link in the top bar, I pretty much always go there first to see if there's a reasonable free tier for me to play with something.


Thanks for the feedback! I'll do my best to make things more clear.


You serve the embedding model in a lambda and then run something like FAISS in the backend.


Author here. This was a fun exercise to produce a semantic search using only free-tier services (which in the case of Cohere means non-commercial use only and in the case of AWS Lambda is limited to one year).

It also marks my first foray into using cloud services for a project. I've long been a cloud sceptic, and doing this confirmed some of my suspicions regarding complexity (mostly the administrative part around roles and URLs), while the coding part itself was a blast.


Cool, but I suggest to look into using terraform for "infrastructure as code" for creating all sorts of aws services/infrastructure and maintaining state.

It seems complex at first, but it is a lot more maintable and portable than creating aws infrastructure manually in the console. Once you leave your service to run for 6 months you will forget where stuff is, then in the worst possible moment if it goes down and you need to make some change you'll be franticly looking for aws docs... "can I create a synthetic canary and use the lambda I already have, or do I have to delete it and create it from Cloud Watch interface?" These kind of questions are the bane of Aws ops experience... And once you learn everything they "bring a new console experience"... So I prefer to learn terraform once and that's it.

Why terraform and not python with boto, cdk, cloudformation or ansible or something else? Because terraform is easy to port between providers (sort of), people who are not that good in python find terraform easier so you don't need "senior" people to maintain your code, finally it's a pretty "opinionated" about how w stuff should be done, so it's unlikely you'll open your project in a year and think "why in the world I did that!?", because all your tf projects will be very similar most likely. Also tf is mainly for infrastructure as code, there is no configuration management like in ansible... It is for one thing and it does it relatively well. (I have no relation to TF beyond being a happy user).


Thank you for the suggestion! I actually thought about using terraform, but I wanted to keep the experiment somewhat minimal regarding technologies and as I had already added AWS Lambda and Rust to the tech stack, I wanted to stay as close to the metal as possible. Besides, this is not for commercial applications, so I don't think high availability is in scope.


I find serverless to be needlessly complex. I'd rather write an HTTP server and serve it off of t3.micro instance (also free-tier eligible). So much simpler for side projects.


I find "serverless" is indeed more complex, because it's a higher abstraction layer. Often, I see people deploying containers lambdas or pods that are full unix environments, with permissions, network routing, filesystems etc. And then because it's "serverless" they use permissions (IAM), networking stuff (VPC, etc), filesystems (S3 etc), and other capabilities that they already have in a lower abstraction level (unix) and are sort of also using. So the complexity of a unix server is a unix server, but the complexity of "serverless" is a unix server plus all the capabilities you duplicated at a higher abstraction level.

Many other commenters replying to https://news.ycombinator.com/item?id=36693471 are interpreting "complex" as "hard for me to set up." I think that's neither here nor there -- no matter what's underneath, you can always rig something to deploy it with the press of a button. The question is: how many layers of stuff did you just deploy? How big of a can of worms did you just dump on future maintainers?


Serverless is too broad a category to say things like "it's too complex". For example, if you already know docker, you can use google cloud run and just deploy the container to it. You then just say "I want to allow this many simultaneous connections, a minimum of N instances, a maximum of M instances, and each instance should have X vcpus and Y gb of ram".


When starting this project I thought the same thing, but having done it I honestly cannot tell that much of a difference. Yes, there are two more steps in setting up the Lambda function, but in the end you still write an HTTP server and have them serve it.


Using a decent IaC framework such as Serverless Framework or the CDK instead of the AWS CLI would make the deployment pretty easy.


I also found while writing the article but after I had already done my research that cargo-lambda has grown some additional functionality that could have removed the need for the AWS CLI, but I wanted to get the article out, so I didn't test-drive that.


When using an EC2 instance, testing, deployment, and adding new endpoints are all simpler.


Easier for you* I've done both for years now and I find developing, deploying, testing lambdas much simpler.


I agree on testing and dev, but for deployment I think stuff like elastic beanstalk or app engine strike a good balance. Almost never run pure EC2.


“Serverless” often has some upfront complexity but I greatly prefer it because once I have it running I’ve never had scaling issues or even had to think about them. To each their own and I’m sure that serverless isn’t the answer for everyone but for my projects (which are very bursty, with long periods of inactivity) is a dream.


It's a bit easier in Python if you use tools like https://www.serverless.com/. I'm not sure if Rust has something similar yet.


At the cost of being very specific to Rust, Shuttle is pretty damn simple. https://www.shuttle.rs/


It's kind of unclear to me, can I use shuttle without using shuttle.rs (the platform) to actually run it?

Not that I am against paying for a service, but the idea of writing my app against with a specific library against a specific platform makes me uneasy.

They have a github project but I think that is just the CLI + rust libs?


From what I've read you can, but I haven't tried myself or looked into it too deeply.


First login failed with: "Callback handler failed. CAUSE: Missing state cookie from login request (check login URL, callback URL and cookie config)." but after retrying it went to Projects list. API Key copy button doesn't do anything.


Yeah it seems the premise of serverless is your code always restarts, which is exactly the same as cloud. The only difference is in front of trillion explosive gotchas in the giant 200GB free middleware called GNU/Linux are their employment in case with the serverless vs yours with the cloud.

UNIX is close to turning 50, and people are fundamentally paying as well as getting paid to make a written program loop to the beginning, instead of exiting. I think this is kind of wrong.


It depends what you’re doing. I’ve run many side projects off a single Lambda function with the “public URL” config enabled. I pay $0 because of the free tier and updating the code is as simple as pushing a ZIP file. No SSH, no OS updates, nothing else to worry about. You start to get into trouble when you try to break your app into tons of microservices without using some kind of framework or deployment tooling to keep it straight.


What about serverless do you find to be “needlessly complex”?


There are just too many required parameters to create a single handler. And then you need to do that N times for each handler. Take a look at a complete Terraform example for a lambda: https://github.com/terraform-aws-modules/terraform-aws-lambd...

For a personal project it's just a bit much in my experience, especially since most personal projects can easily be served by a t3.micro.


Thanks for clarifying. That’s a fair critique.


to be fair it is (mostly) a Rube Goldberg machine designed to keep backend engineers employed.



Made with Qdrant Vector Database https://github.com/qdrant/qdrant

Cohere embed API https://docs.cohere.com/reference/embed

and AWS Lambda


Those Cohere docs are awesome.


Can't agree more..It is some UX done right.


Agree,

Anyone knows what was used to generate the docs?


Looks like it's this site: https://readme.com/documentation

EDIT: I looked at the source to get to this link.


Owner/founder is a regular on here too:

https://news.ycombinator.com/user?id=gkoberger


I believe they are using https://readme.com/


Looks like its https://redocly.com

One of the engineers at work showed this to me a few weeks ago and ive been thinking of porting my Swagger docs over to it. I made a mental note to see if a free self-hosted open source version was available but have not gone back to check on that. Please let me know if you find suitable alternatives.

Edit: This is open source. Looks like the paid offerings are mostly to cover CDN costs.


I have proof-of-concepted[1] building an entirely static website offering client-side semantic search using transformers.js[2] and Semantra*[3] with MiniLM-L12[4]. It's usable on fast connections and decent CPUs but is not delightfully fast and can't handle huge document loads. Still could be useful for some embedded documents! I hope to incorporate an auto-builder for these experiences in the next big release of Semantra.

[1] https://twitter.com/dylfreed/status/1651572024488218627

[2] https://xenova.github.io/transformers.js

[3] https://github.com/freedmand/semantra *(author of this tool)

[4] https://huggingface.co/sentence-transformers/all-MiniLM-L12-...


You absolutely NEEED to make it a plugin for Obsidian !!!!


I built out a Bible semantic search engine for myself last year using qdrant and Rust. It was a really fun experience, qdrant is solid, and Andrey was a nice guy to work with.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: