I've been tracking prompt injection for 2.5 years now and this is the first proposed mitigation for it that feels genuinely credible to me. Unlike most of the others it doesn't rely on using other AI models to try and spot injection attacks, which is a flawed approach because if you only catch 99% of attacks your system will be broken by motivated adversarial attackers.
(Imagine if we protected against SQL injection or XSS using statistical methods that only caught 99% of attacks!)
I read your (excellent) blog post just now. This reminds me very much of the Apple "Do you want to share your location" feature.
Do you think that this practically limits the usefulness of an LLM "agent"?
In your email example it is all well and good for me to check it is indeed sending to bob@mycompany.com and confirm it as trusted from now on, but what if my agent is doing something with lots of code or a lengthy legal document etc.. Am I right in thinking I'd have to meticulously check these and confirm they are correct (as the end user)?
If that's the case, even in the email example many users probably wouldn't notice bob@mycumpany.com. Equally, this feels like it would be a non-starter for cron-like, webhook-like, or long-running flows (basically anywhere the human isn't already naturally in the loop).
P.S. They must have called it CaMeL for the two LLMs/humps, otherwise it is the most awful backronym I've ever seen!
> Do you think that this practically limits the usefulness of an LLM "agent"?
Yes, I do. I think it limits the usefulness a lot. Sadly it's the best option we've seen in 2.5 years for building AI "agents" that don't instantly leak your private data to anyone who asks them for it.
I'd love it if someone could come up with something better!
> (Imagine if we protected against SQL injection or XSS using statistical methods that only caught 99% of attacks!)
For what it's worth, we do that all the time: WAFs (web app firewalls). I can't begin to tell you the number of applications whose protections against XSS and SQLi were a combination of "hope we got it right" and "hope the WAF covered us where we didn't".
Once consulted on an M&A vetting gig, where they pulled me after a day because the sheer number of critical findings meant that there was no way that they would move forward. They used the WAF+prayers method.
They're actually a pretty good comparison to most of the other proposed mitigations to prompt injection: slap a bunch of leaky heuristics over the top of your system (likely implemented by a vendor who promises you the world), then cross your fingers and hope.
(Imagine if we protected against SQL injection or XSS using statistical methods that only caught 99% of attacks!)
I wrote up my own extensive thoughts on this paper last week: https://simonwillison.net/2025/Apr/11/camel/
Admittedly I have a bias towards it because it builds on a proposal I made a couple of years using dual quarantined and privileged LLMs: https://simonwillison.net/2023/Apr/25/dual-llm-pattern/
I'm particularly tickled that a DeepMind academic paper now exists with a section titled "Is Dual LLM of Willison enough?" (Spoiler: it is not.)