Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It would be good to have a BM25 baseline entry on this leaderboard: https://huggingface.co/spaces/mteb/leaderboard

And I'm sure you're aware of the BEIR paper: https://arxiv.org/abs/2306.07471. Elastic references that in this blogpost: https://www.elastic.co/blog/improving-information-retrieval-...

I agree that BM25 retrieval + vector re-ranking can work. But vector search does bring results to the table that vanilla BM25 can't, even with a large retrieval window. So I do think there is a place for both with the usual "it depends on your data/requirements" caveat.



The opposite is also true. BM25 prefers lexical matches and brings these candidates back that vector search often doesn’t.

I am not disagreeing vectors are useful, but I think benchmark based evidence is not the same as deploying a solution that must scale, be constantly updated, serve many use cases like filtering, search syntax, etc customers want.

And plus I think there’s a real danger Of herding to vector retrieval (even then one view of it) which cuts off exploration of diverse solutions.


Hybrid retrieval can possibly be the best of both worlds.

I invested a lot of time with the last txtai release adding a minimal dependency Python-based BM25 component (https://neuml.hashnode.dev/building-an-efficient-sparse-keyw...). And keyword-only indexes are supported if one desires.

I'm with you 100% on not herding to any one way for any problem. I still remember the pre-2023 world where you weren't pressured to work LLMs into everything.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: