> I don’t know. After the model has been created (trained), I’m pretty sure that...

energy123 · 2025-05-12T16:46:52 1747068412

I might be wrong but aren't embedding models usually bidirectional and not causal, so the attention mechanism itself is more expensive.

minimaxir · 2025-05-12T17:04:39 1747069479

It depends on the architecture (you very well can convert a decoder-only causal model to an embeddings model, e.g. Qwen/Mistral), but it is true the traditional embeddings models such as a BERT-based one are bidirectional, although unclear how much more compute that inherently requires.

Compare to ModernBERT, which uses more modern techniques and is still bidirectional, but it is very very speedy. https://huggingface.co/blog/modernbert

breadislove · 2025-05-12T16:51:23 1747068683

yes exactly

kaycebasques · 2025-05-13T02:02:31 1747101751

Will update the post to capture that interesting tidbit, thanks

And also, thanks for this post! https://minimaxir.com/2025/02/embeddings-parquet/