Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Embeddings are still underrated—even in RAG.

Legal text is deeply hierarchical and full of pointers (“Art. 5 CF”, “see Art 34”). One vector per article leaves too much on the table.

Things that moved the needle for us:

– *Multi-layer embeds* vectors for every paragraph and every structural level (chapter → book). Retriever picks the right granularity. (arXiv:2411.07739)

– *Propositional queries* strip speech-act fluff (“could you please…”) before embedding. Similarity + top-k recall jump. (arXiv:2503.10654)

– *Poly-vector retrieval* two vectors per norm—one for content, one for the label/nickname. Handles “what does the CDC say?” and internal cross-refs. (arXiv:2504.10508)

*TL;DR* If your corpus has hierarchy or aliases, stop thinking “one doc = one embedding.” Plenty of juice to squeeze before heavier tricks.

[1] https://arxiv.org/abs/2411.07739 [2] https://arxiv.org/abs/2503.10654 [3] https://arxiv.org/abs/2504.10508



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: