I recently quit my job to build specialized tooling in this space. We’re broadly focusing on eval in general, but are starting with high quality question and answer generation for testing these kinds of RAG pipelines. It’s surprisingly hard!
Sounds very interesting. I am building an open-source LLM building platform (agenta.ai) and looking for eval approaches to integrate for our users. Do you have already a product/api that we could use?