It would be good to have a BM25 baseline entry on this leaderboard: https://hugg...

softwaredoug · on Sept 5, 2023

The opposite is also true. BM25 prefers lexical matches and brings these candidates back that vector search often doesn’t.

I am not disagreeing vectors are useful, but I think benchmark based evidence is not the same as deploying a solution that must scale, be constantly updated, serve many use cases like filtering, search syntax, etc customers want.

And plus I think there’s a real danger Of herding to vector retrieval (even then one view of it) which cuts off exploration of diverse solutions.

dmezzetti · on Sept 5, 2023

Hybrid retrieval can possibly be the best of both worlds.

I invested a lot of time with the last txtai release adding a minimal dependency Python-based BM25 component (https://neuml.hashnode.dev/building-an-efficient-sparse-keyw...). And keyword-only indexes are supported if one desires.

I'm with you 100% on not herding to any one way for any problem. I still remember the pre-2023 world where you weren't pressured to work LLMs into everything.