I have been vibe coding (90% Cursor, 10% Claude Code) for an entire month now (I know how to code, but I really want to explore this space and push the boundaries).
I found that LLM agents are notoriously bad at two things:
1. Database migrations
2. Remembering they are supposed to write tests and keep ALL of them green (just like our human juniors...)
Database migrations
I am incapable of making the coding agent follow industry best practices. E.g. when in development and a new field is needed in the DB, what most web frameworks / ORMs offer is a migration up and down that does not affect the DB. I do not want to reset my DB even if I am developing locally.
So far the agent has been doing weird stuff, almost always ending with a DB that needed a reset to get back to work. Often times the agent would ignore my instructions NEVER to reset nor RUN migrations.
By extrapolating this misbehavior to production, I can imagine how badly this could end.
Actually, as long as there are no STRICT guarantees by LLM providers on how to prevent the LLM from doing something, this issue will never get solved. The only way I found is to block the agent running certain commands (requiring my consent) but that can only take me so far, since there are infinite command line tools the agent can run.
Tests
This one is equally bad in terms of LLMs ignoring instructions, possibly with less potential for disaster, yet still completely weird behavior.
Of all the instructions / prompts I give to LLMs, the part about testing gets ignored the most. By far. E.g. I have in my custom prompts an instruction for always updating the CHANGELOG.md file - which the agent ALWAYS follows even for the tiniest changes.
But when it comes to testing - the agent will almost never write new tests or run the test suite as part of a larger change. I almost always have to tell it explicitly to run the tests, fix the failing ones. And even then it will fix 8/10 tests and celebrate big success (despite the clear instruction that ALL tests must pass, no excuses).
Happy to exchange thoughts and ideas with someone with similar struggles - meet me on X (@cogito_matt). I am working on a LLM-powered agentic AI tool for data analysis / BI and so far the experience has been fantastic - but LLMs really require to think differently about programming and execution.
> 2. Remembering they are supposed to write tests and keep ALL of them green (just like our human juniors...)
I think the core principle that everyone is forgetting is that your evaluation metric must be kept separate from your optimization metric.
In most setups I've seen, there isn't much emphasis on adding scripting that's external to the LLM, but in my experience having that verification outside of the LLM loop is critical to avoid it cheating. It won't intend to cheat, insofar as it has any intent at all, but you're giving it a boatload of optimization functions to balance and it's prone to randomly dropping one at the worst time. And to be fair, falling flat on its face to win the race [1] is often the implicit conclusion of what we told it to do without realizing the consequences.
If you need something to happen every time, particularly as part of the validation, it is better to have an automated script as part of the process, rather than trying to pile on one more instruction.
> The only way I found is to block the agent running certain commands (requiring my consent) but that can only take me so far, since there are infinite command line tools the agent can run.
You're doing this the wrong way around. You need to default to blocking and have an allowlist for the exceptions, not default to allowing and a blocklist for the exceptions.
I have been vibe coding (90% Cursor, 10% Claude Code) for an entire month now (I know how to code, but I really want to explore this space and push the boundaries).
I found that LLM agents are notoriously bad at two things: 1. Database migrations 2. Remembering they are supposed to write tests and keep ALL of them green (just like our human juniors...)
Database migrations
I am incapable of making the coding agent follow industry best practices. E.g. when in development and a new field is needed in the DB, what most web frameworks / ORMs offer is a migration up and down that does not affect the DB. I do not want to reset my DB even if I am developing locally.
So far the agent has been doing weird stuff, almost always ending with a DB that needed a reset to get back to work. Often times the agent would ignore my instructions NEVER to reset nor RUN migrations.
By extrapolating this misbehavior to production, I can imagine how badly this could end.
Actually, as long as there are no STRICT guarantees by LLM providers on how to prevent the LLM from doing something, this issue will never get solved. The only way I found is to block the agent running certain commands (requiring my consent) but that can only take me so far, since there are infinite command line tools the agent can run.
Tests
This one is equally bad in terms of LLMs ignoring instructions, possibly with less potential for disaster, yet still completely weird behavior.
Of all the instructions / prompts I give to LLMs, the part about testing gets ignored the most. By far. E.g. I have in my custom prompts an instruction for always updating the CHANGELOG.md file - which the agent ALWAYS follows even for the tiniest changes.
But when it comes to testing - the agent will almost never write new tests or run the test suite as part of a larger change. I almost always have to tell it explicitly to run the tests, fix the failing ones. And even then it will fix 8/10 tests and celebrate big success (despite the clear instruction that ALL tests must pass, no excuses).
Happy to exchange thoughts and ideas with someone with similar struggles - meet me on X (@cogito_matt). I am working on a LLM-powered agentic AI tool for data analysis / BI and so far the experience has been fantastic - but LLMs really require to think differently about programming and execution.