It depends a lot on the language. I recently tried this with Aider, Claude, and Rust, and after writing one function and its tests the model couldn't even get the code compiling, much less the tests passing. After 6-8 rounds with no progress I gave up.
Obviously, that's Rust, which is famously difficult to get compiling. It makes sense that it would have an easier time with a dynamic language like Python where it only has to handle the edge cases it wrote tests for and not all the ones the compiler finds for you.
I've found something similar, when you keep telling the LLM what the compiler says, it keeps adding more and more complexity to try to fix the error, and it either works by chance (leaving you with way overengineered code) or it just never works.
I've very rarely seen it simplify things to get the code to work.
I have the same observation, looks like LLMs are highly biased to add complexity to solve problems: for example add explicit handling of the edge-cases I pointed out rather than rework the algorithm to eliminate edge-cases altogether. Almost everytime it starts with something that's 80% correct, then iterate into something that's 90% correct while being super complex, unmaintainable and having no chance to ever cover the last 10%
Unfortunately this is my experience as well, to the point where I can't trust it with any technology that I'm not intimately familiar with and can thoroughly review.
Hmm, I worked with students in an “intro to programming” type course for a couple years. As far as I’m concerned, “I added complexity until it compiled and now it works but I don’t understand it” is pretty close to passing the Turing test, hahaha.
Obviously, that's Rust, which is famously difficult to get compiling. It makes sense that it would have an easier time with a dynamic language like Python where it only has to handle the edge cases it wrote tests for and not all the ones the compiler finds for you.