I always get the feeling that fundamentally our software should be built on a foundation of sound logic and reasoning. That doesn't mean that we cannot use LLMs to build that software, but it does mean that in the end every line of code must be validated to make sure there's no issues injected by the LLM tools that inherently lack logic and reasoning, or at least such validation must be on par with human authored code + review. Because of this, the validation cannot be done by an LLM, as it would just compound the problem.
Unless we get a drastic change in the level of error detection and self-validation that can be done by an LLM, this remains a problem for the foreseeable future.
How is it then that people build tooling where the LLM validates the code they write? Or claim 2x speedups for code written by LLMs? Is there some kind of false positive/negative tradeoff I'm missing that allows people to extract robust software from an inherently not-robust generation process?
I'm not talking about search and documentation, where I'm already seeing a lot of benefit from LLMs today, because between the LLM output and the code is me, sanity checking and filtering everything. What I'm asking about is the: "LLM take the wheel!" type engineering.
It's a common idea, all the way back to Hoare logic. There was a time when people believed in the future, people would write specifications instead of code.
The problem with it takes several times more effort to verify code than to write it. This makes intuitive sense if you consider that the search space for the properties of code is much larger than the code for space. Rice theorem's states that all non trivial semantic properties of a program are undeniable.
No, Rice's theorem states that there is no general procedure to take an arbitrary program and decide nontrivial properties of its behaviour. As software engineers, though, we write specific programs which have properties which can be decided, perhaps by reasoning specific to the program. (That's, like, the whole point of software engineering: you can't claim to have solved a problem if you wrote a program such that it's undecidable whether it solved the problem.)
The "several times more effort to verify code" thing: I'm hoping the next few generations of LLMs will be able to do this properly! Imagine if you were writing in a dependently typed language, and you wrote your test as simply a theorem, and used a very competent LLM (perhaps with other program search techniques; who knows) to fill in the proof, which nobody will never read. Seems like a natural end state of the OP: more compute may relax the constraints on writing software whose behaviour is formally verifiable.
> That doesn't mean that we cannot use LLMs to build that software, but it does mean that in the end every line of code must be validated to make sure there's no issues injected by the LLM tools that inherently (...)
The problem with your assertion is that it fails to understand that today's software, where every single line of code was typed in by real flesh-and-bone humans, already fails to have adequate test coverages, let alone be validated.
The main problem with output from LLMs is that they were trained with the code written by humans, and thus they accurately reflect the quality of the code that's found in the wild. Consequently, your line of reasoning actually criticizes LLMs for outputing the same unreliable code that people write.
Counterintuitively, LLMs end up generating a better output because at least they are designed to simplify the task of automatically generating tests.
From my testing the robots seem to 'understand' the code more than just learn how do thing X in code from reading code about doing X. I've thrown research papers at them and they just 'get' what needs to be done to take the idea and implement it as a library or whatever. Or, what has become my favorite activity of late, give them some code and ask them how they would make it better -- then take that and split it up into simpler tasks because they get confused it you ask them to do too much at one time.
As for debugging, they're not so good at that. Some debugging they can figure out but if they need to do something simple, like counting how far away item A is from item B, then I've found you pretty much have to do that for them. Don't get me wrong, they've found some pretty deep bugs I would have spend a bunch of time tracking down in gdb, so they aren't completely worthless but I have definitely given up on the idea that I can just tell them the problem and they get to work fixing it though.
And, yeah, they're good at writing tests. I usually work on python C modules and my typical testing is playing with it in the repl but my current project is getting fully tested at the C level before I have gotten around to the python wrapper code.
Overall its been pretty productive using the robots, code is being written I wouldn't have spent the time working on, unit testing is being used to make sure they don't break anything as the project progresses and the codebase is being kept pretty sound because I know enough to see when they're going off the rails as they often do.
Right but by your reasoning it would make sense to use LLMs only to augment an incomplete but rigorous testing process, or to otherwise elevate below average code.
My issue is not necessarily with the quality of the code, but rather with the intention of the code, which is much more important: a good design without tests is more durable than a bad design with tests.
> Right but by your reasoning it would make sense to use LLMs only to augment an incomplete but rigorous testing process, or to otherwise elevate below average code.
No. It makes sense to use LLMs to generate tests. Even if their output matches the worst output the average human can write by hand, having any coverage whatsoever already raises the bar from where the average human output is.
> My issue is not necessarily with the quality of the code, but rather with the intention of the code (...)
That's not the LLM's responsibility. Humans specify what they want and LLMs fill in the blanks. If today's LLMs output bad results, that's a reflection of the prompts. Garbage in, garbage out.
> No. It makes sense to use LLMs to generate tests. Even if their output matches the worst output the average human can write by hand, having any coverage whatsoever already raises the bar from where the average human output is.
Although this is true, it disregards the fact that prompting for tests takes time which may also be spent writing tests, and its not clear if poor quality tests are free, in the sense that further development may cause these tests to fail for the wrong reasons, causing time spent debugging. This is why I used the word "augment": these tests are clearly not the same quality as manual tests, and should be considered separately from manual tests. In other words, they may serve to elevate below average code or augment manual tests, but not more than that. Again, I'm not saying it makes no sense to do this.
> That's not the LLM's responsibility. Humans specify what they want and LLMs fill in the blanks. If today's LLMs output bad results, that's a reflection of the prompts. Garbage in, garbage out.
This is unlikely to be true, for a couple reasons:
1. Ambiguity makes it impossible to define "garbage", see prompt engineering. In fact, all human natural language output is garbage in the context of programming.
2. As the LLM fills in blanks, it must do so respecting the intention of the code, otherwise the intention of the code erodes, and its design is lost.
3. This would imply that LLMs have reached their peak and only improve by requiring less prompting by a user, this is simply not true as it is trivial to currently find problems an LLM cannot solve, regardless of the amount of prompting.
> Although this is true, it disregards the fact that prompting for tests takes time which may also be spent writing tests (...)
No, not today at least. Some services like Copilot provide plugins that implement actions to automatically generate unit tests. This means that the unit test coverage you're describing is a right-click away.
> (...).and its not clear if poor quality tests are free, in the sense that further development may cause these tests to fail for the wrong reasons, causing time spent debugging.
That's not how automated tests work. If you have a green test that turns red when you touch some part of the code, this is the test working as expected, because your code change just introduced unexpected changes that violated an invariant.
Also, today's LLMs are able to recreate all your unit tests from scratch.
> This is unlikely to be true, for a couple reasons: 1. Ambiguity makes it impossible to define "garbage", see prompt engineering.
"Ambiguity" is garbage in this context.
> . 2. As the LLM fills in blanks, it must do so respecting the intention of the code, otherwise the intention of the code erodes, and its design is lost.
That's the responsibility of the developer, not the LLM. Garbage in, garbage out.
> . 3. This would imply that LLMs have reached their peak and only improve by requiring less prompting by a user, this is simply not true as it is trivial to currently find problems an LLM cannot solve, regardless of the amount of prompting.
I don't think that point is relevant. The goal of a developer is still to meet the definition of done, not to tie their hands around their back and expect working code to just fall on their lap. Currently the main approach to vibe coding is to set the architecture, and lean on the LLM to progressively go from high level to low level details. Speaking from personal experience in vibecoding, LLMs are quite capable of delivering fully working apps with a single, detailed prompt. However, you get far more satisfactory results (i.e., the app reflects the same errors in judgement you'd make) if you just draft a skeleton and progressively fill in the blanks.
> That's not how automated tests work
> today's LLMs are able to recreate all your unit tests from scratch.
> That's the responsibility of the developer
> LLMs are quite capable of delivering fully working apps with a single, detailed prompt
You seem to be very resolute in positing generalizations, I think those are rarely true. I don't see a lot of benefit coming out of a discussion like this. Try reading my replies as if you agree with them, it will help you better understand my point of view, which will make your criticism more targeted, so you can avoid generalizations.
This particular person seems to be using LLMs for code review, not generation. I agree that the problem is compounded if you use an LLM (esp. the same model) on both sides. However, it seems reasonable and useful to use it as an adjunct to other forms of testing, though not necessarily a replacement for them. Though again, the degree to which it can be a replacement is a function of the level of the technology, and it is currently at the level where it can probably replace some traditional testing methods, though it's hard to know which, ex-ante.
edit: of course, maybe that means we need a meta-suite, that uses a different LLM to tell you which tests you should write yourself and which tests you can safely leave to LLM review.
Indeed the idea of a meta LLM, or some sort of clear distinction between manual and automated-but-questionable tests makes sense. So what bothers me is that does not seem to be the approach most people take: code produced by the LLM is treated the same as code produces by human authors.
LLM-based coding only really works when wrapped in structured prompts, constrained outputs, external checks etc. The systems that work well aren’t just 'LLM take the wheel' architecture, they’re carefully engineered pipelines. Most success stories are more about that scaffolding than the model itself.
A breakdown would be interesting. I can’t give you hard numbers, but in our case scaffolding was most of the work. Getting the model to act reliably meant building structured abstractions, retries, output validation, context tracking, etc. Once that’s in place you start saving time per task, but there’s a cost up front.
If you are working with natural language, it is by definition 'fuzzy' unless you reduce it to simple templates. So to evaluate whether an output is a semantically e.g. a reasonable answer to an input where non-templated natural verbalization is needed, you need something that 'tests' the output, and that is not going to be purely 'logical'.
Will that test be perfect? No. But what is the alternative?
Are you referring to the process of requirement engineering? Because although I agree its a fuzzy natural language interface, behind the interface should be (heavy should) a rigorously defined & designed system, where fuzzyness is eliminated. The LLMs need to work primarily with the rigorous definition, not the fuzzyness.
It depends on the use case. e.g. Music generation like Suno. How do you rigorously and logically check the output? Or an automated copy-writing service?
The tests should match the rigidity of the case. A mismatch in modality will lead to bad outcomes.
Aha! Like that. Yes that's interesting, the only other alternative would be manual classification of novel data, so extremely labour intensive. If an LLM is able to do the same classification automatically it opens up use cases that are otherwise indeed impossible.
I always get the feeling that fundamentally our software should be built on a foundation of sound logic and reasoning. That doesn't mean that we cannot use LLMs to build that software, but it does mean that in the end every line of code must be validated to make sure there's no issues injected by the LLM tools that inherently lack logic and reasoning, or at least such validation must be on par with human authored code + review. Because of this, the validation cannot be done by an LLM, as it would just compound the problem.
Unless we get a drastic change in the level of error detection and self-validation that can be done by an LLM, this remains a problem for the foreseeable future.
How is it then that people build tooling where the LLM validates the code they write? Or claim 2x speedups for code written by LLMs? Is there some kind of false positive/negative tradeoff I'm missing that allows people to extract robust software from an inherently not-robust generation process?
I'm not talking about search and documentation, where I'm already seeing a lot of benefit from LLMs today, because between the LLM output and the code is me, sanity checking and filtering everything. What I'm asking about is the: "LLM take the wheel!" type engineering.