Legal text is deeply hierarchical and full of pointers (“Art. 5 CF”, “see Art 34”). One vector per article leaves too much on the table.
Things that moved the needle for us:
– *Multi-layer embeds* vectors for every paragraph and every structural level (chapter → book). Retriever picks the right granularity. (arXiv:2411.07739)
– *Propositional queries* strip speech-act fluff (“could you please…”) before embedding. Similarity + top-k recall jump. (arXiv:2503.10654)
– *Poly-vector retrieval* two vectors per norm—one for content, one for the label/nickname. Handles “what does the CDC say?” and internal cross-refs. (arXiv:2504.10508)
*TL;DR* If your corpus has hierarchy or aliases, stop thinking “one doc = one embedding.” Plenty of juice to squeeze before heavier tricks.
Legal text is deeply hierarchical and full of pointers (“Art. 5 CF”, “see Art 34”). One vector per article leaves too much on the table.
Things that moved the needle for us:
– *Multi-layer embeds* vectors for every paragraph and every structural level (chapter → book). Retriever picks the right granularity. (arXiv:2411.07739)
– *Propositional queries* strip speech-act fluff (“could you please…”) before embedding. Similarity + top-k recall jump. (arXiv:2503.10654)
– *Poly-vector retrieval* two vectors per norm—one for content, one for the label/nickname. Handles “what does the CDC say?” and internal cross-refs. (arXiv:2504.10508)
*TL;DR* If your corpus has hierarchy or aliases, stop thinking “one doc = one embedding.” Plenty of juice to squeeze before heavier tricks.
[1] https://arxiv.org/abs/2411.07739 [2] https://arxiv.org/abs/2503.10654 [3] https://arxiv.org/abs/2504.10508