We saw this exact failure mode at AgenticQA. Our screening agent was 'obedient' to a fault—under basic social engineering pressure (e.g., 'URGENT AUDIT'), it would override its system prompt and leak PII logs.
The issue isn't the prompt; it's the lack of a runtime guardrail. An LLM cannot be trusted to police itself when the context window gets messy.
I built a middleware API to act as an external circuit breaker for this. It runs adversarial simulations (PII extraction, infinite loops) against the agent logic before deployment. It catches the drift that unit tests miss.
I remember reading another comment a while ago about being able to only trust an llm with sensitive info only if you can guarantee that the output will only be viewed by people who already had access to the sensitive info already, or cannot control any of the inputs to the llm.
HN, OP here.
I built this because I recently watched my LangChain agent burn through ~$50 of OpenAI credits overnight. It got stuck in a semantic infinite loop (repeating "I am checking..." over and over) which my basic max_iterations check didn't catch because the phrasing was slightly different each time.
Realizing that "Pre-Flight" testing for agents is surprisingly hard, I built a small middleware API (FastAPI + LangChain) to automate this.
What it does:
It acts as an adversarial simulator. You send it your agent's system prompt, and it spins up a 'Red Team' LLM to attack it.
Currently checks for:
Infinite Loops: Semantic repetition detection.
PII Leaks: Attempts social engineering ('URGENT AUDIT') to force the agent to leak fake PII, then checks if it gets blocked.
Prompt Injection: Basic resistance checks.
Tech Stack: Python, FastAPI, Supabase (for logs).
It's open-source and I hosted a live instance on Render if you want to try curl it without installing:
https://agentic-qa-api.onrender.com/docs
Would love feedback on what other failure modes you've seen your agents fall into!
The issue isn't the prompt; it's the lack of a runtime guardrail. An LLM cannot be trusted to police itself when the context window gets messy.
I built a middleware API to act as an external circuit breaker for this. It runs adversarial simulations (PII extraction, infinite loops) against the agent logic before deployment. It catches the drift that unit tests miss.
Open sourced the core logic here: [https://github.com/Saurabh0377/agentic-qa-api] Live demo of it blocking a PII leak: [https://agentic-qa-api.onrender.com/docs]"