This would change if there wasn’t a culture of giving 5 stars to every driver. It started because Uber unfairly punished good drivers for very good but honest 4/5 reviews, and now every Uber driver who uses their phone while driving or has an interior smelling of cigarette smoke gets 5 stars out of obligation.
> - Cursor (can't remember which model, the default)
> - Google's Jules
> - OpenAI Codex with o4
Cursor's "default model" rarely works for me. You have to choose one of the models yourself. Sonnet 4, Gemini 2.5 Pro, and for tricky problems, o3.
There is no public release of o4; you used o4-mini, a model with poorer performance than any of the frontier models (Sonnet 4, Gemini Pro 2.5, o3).
Jules and Codex, if they're like Claude Code, do not work well with "Build me a Facebook clone"-type instructions. You have to break everything down and make your own tech stack decisions, even if you use these tools to do so. Yes they are not perfect and make regressions or forget to run linters or check their work with the compiler, but they do work extremely well if you learn to use them, just like any other tool. They are not yet magic that works without you having to put in any effort to learn them.
What is your preferred static text embedding model?
For someone looking to build a large embedding search, fast static embeddings seem like a good deal, but almost too good to be true. What quality tradeoff are you seeing with these models versus embedding models with attention mechanisms?
It depends a bit on the task and language, but my go-to is usually minishlab/potion-base-8M for every task except retrieval (classification, clustering, etc). For retrieval minishlab/potion-retrieval-32M works best. If performance is critical minishlab/potion-base-32M is best, although it's a bit bigger (~100mb).
There's definitely a quality trade-off. We have extensive benchmarks here: https://github.com/MinishLab/model2vec/blob/main/results/REA.... potion-base-32M reaches ~92% of the performance of MiniLM while being much faster (about 70x faster on CPU). It depends a bit on your constraints: if you have limited hardware and very high throughput, these models will allow you to still make decent quality embeddings, but ofcourse an attention based model will be better, but more expensive.
Thanks man this is incredible work, really appreciate the details you went into.
I've been chewing on if there was a miracle that could make embeddings 10x faster for my search app that uses minilmv3, sounds like there is :) I never would have dreamed. I'll definitely be trying potion-base in my library for Flutter x ONNX.
EDIT: I was thanking you for thorough benchmarking, then it dawned on me you were on the team that built the model - fantastic work, I can't wait to try this. And you already have ONNX!
EDIT2: Craziest demo I've seen in a while. I'm seeing 23x faster, after 10 minutes of work.
Onnxruntime supports CoreML, though if my experience with converting an embedding model to CoreML using Apple's CoreML conversion tool is similar to the ORT maintainers', I can see why it would be unmaintained.
It took multiple tries to get the model to convert at all to the mlpackage format, and then a lot of experimenting to get it to run on the ANE instead of the GPU, only to discover that constant reshaping was killing any performance benefit (either you have a fixed multiplication size or don't bother), and even at a fixed size and using the attention mask, its operations were slower than saturating the GPU with large batches.
I discovered an issue where using the newer iOS 18 standard would cause the model conversion to break, and put an issue in on their GitHub, including an example repository for easy replication. I got a response quickly, but almost a year later, the bug is still unfixed.
Even when George Hotz attempted to hack it to use it without Apple's really bad and unmaintained CoreML library, he gave up because it was impossible without breaking some pretty core OS features (certificate signing IIRC).
The ANE/CoreML is just not serious at all about making their hardware usable at all. Even Apple's internal MLX team can't crack that nut.
ONNX is horrible for anything that has variable input shapes and that is why nobody uses it for LLMs. It fundamentally is poorly designed for anything that doesn't take a fixed size image.
Note: I searched "Protein bars", and it treated all protein bars equally. The 1st-20th cheapest had <15g of protein per bar. I had to scroll down to the 50th-60th to find protein bars with 20g of protein, which surprised me for being cheaper than Kirkland Signature's protein bars.
Not Dune exactly, but having to run 'eval $(opam env)' in the terminal every time you open an OCaml project rather than the default being npm-like, where you can just open the directory and use the package manager command without having to think about it.
The only issues I've had with OCaml's build system is using "ocamlopt", "ocamlbuild", "ocamlfind" manually, but this was solved by OASIS and now Dune. I don't need to think about it. It automatically compiles when I save the file in Emacs. Very easy to set it up (one time setup).
Guatemala is the only place outside the US where I was not quoted 3-5x Uber prices by very pushy taxi drivers.
Taxi drivers are scam artists and thieves. There’s no reputational damage either, as you will never see them again.
Uber solves the reputation problem: every driver is rated, and poorly rated and badly behaved drivers do not get to work for them.
reply