That was very much the point of my talk - and the reason I had so many live demos (as opposed to pre-recorded demos). I wanted to have at least a few instances of demos going wildly wrong to help emphasize how unreliable this stuff is.
Being unreliable doesn't mean it isn't useful. Journalists handle unreliable sources all the time - fact-checking and comparing multiple sources is built into the profession. As such, I think journalists may be better equipped to make use of LLMs than most other professions!
I found the talk very interesting because it shows both the issues as well as potential solutions.
One of the demos (extracting text from a PDF turned PNG) makes me wonder how you're ever going to fact check whether something in there is a hallucination. Innocent doctors won't always turn out to be Michael Jackson's sister after all :)
But then in one of the last demos you're showing how the fact checking can be "engineered" right into the prompt: "What were the themes of this meeting and for each theme give me an illustrative quote". Now you can search for the quote.
This is kind of eye opening for me, because you could build this sort of deterministic provability into all kinds of prompts. It certainly doesn't work for all applications but where it does work it basically allows you to swap false positives for false negatives, which is extremely valuable in many cases.
I think of AI as a “hint generator” that will give you some good guesses, but you still have to verify the guesses yourself. One thing it can help with is coming up with search terms that you might not have thought of.
>What would be the equivalent of searching for quotes in your first (PNG) example?
I don't have a general answer to that. It depends on the specifics of the application. In many cases the documents I'm interested in will have some overlap with structured data I have stored in a database. In the concrete example there could be a register of practicing physicians that could be used for cross referencing. But in other cases I think it's an unsolved problem that may never be solved completely.
>Switching to text source, what would you do if say 30% of the quotes do not match with CTR-F?
That's what I meant by swapping false positives for false negatives. You could simply throw out all the items for which you can't find the quote (which can obviously be done automatically). The remaining items are now "fact checked" to some degree. But the number of false negatives will probably have increased because not all the quotes without matches will be hallucinations.
Another approach would be to send the query separately to multiple different models or to ask one model to check another model's claims.
I think what works and what is good enough is highly application specific.
The baseline is to do the work yourself and compare - the equivalent of a "brute force" solution.
This off course defeats the purpose of the entire exercise. You propose an approach to reduce the validation price by crafting the prompt in such a way that the validation can be parially automated. This may reduce the quality because of false negatives and what not.
The underlying assumption is that this process is cheaper then "brute force" and the quality is "good enough". It would be interesting to see a writeup of some specific examples.
It is useful and perhaps very useful for journalists and other people who use it onetime. It is very ill suited for massive automation at the moment, and that’s a real problem everyone struggles with.
The application of embedding vectors without the rest of the LLM can presently deliver much sustainable innovation. At least compared to present day SOTA models (imho of course).
Being unreliable doesn't mean it isn't useful. Journalists handle unreliable sources all the time - fact-checking and comparing multiple sources is built into the profession. As such, I think journalists may be better equipped to make use of LLMs than most other professions!