Semantic difference doesn’t define relationships between semantically disimilar ...

lmeyerov · on April 29, 2024

AFAICT most of the "graph" rag implementations discussions, instead of fancy graph queries & or structured knowledge graph, mean:

1. Primary: Inverted index on keywords (= entities). At ingest time, extract entities and reverse index on them. At query time, extract entities and find those related documents, and include next to the vector results as part of the reranking set, or maybe something fancier like a second search based on those.

2. Secondary: Bidrectionally linked summary. At index time, recursively summarize large documents and embed+link the various nested results. At retrieval time, retrieve whatever directly matches, and maybe go up the hierarchy for more.

3. Secondary: Throw everything into the DB - queries, answers, text, chunks - and link them together. As with the others, the retrieval strategy for getting good results generally doesn't leverage this heterogeneous structure and instead end up being pretty simple & direct, e.g., any KV store.

AFAICT, KV stores are really what's being used here to augment the vector search. Scalable text keyword reverse indexing is historically done more on a KV document store like opensearch/elasticsearch, as it doesn't really stress most of the power of a graph engine. Recursive summaries work fine that way too.

Multihop queries and large graph reasoning are cool but aren't really what these are about. Typed knowledge graphs & even fancier reasoning engines (RDF, ...) even less so. These retrieval tasks are so simple that almost DB can work in theory on them -- SQL, KV, Graph, Log, etc. However, as the size grows, their cost/maintenance/perf etc differences show. We do a lot of graph DB + AI work for our dayjob, so I'm more bullish here on graph long-term, but agreed with others, good to be intellectually honest to make real progress on these.