From reading the PDF it seems that this ‘merely’ generates tests that will repeatedly pass i.e. that are not flaky. The main purpose is to create a regression test suite by having tests that pin the behaviour of existing code. This isn’t a replacement for developer written tests, which one would hope come with the knowledge of what the functional requirement is.
Almost 20 years ago the company I worked for trialled AgitarOne - its promise was automagically generating test cases for Java code that help explore its behaviour. But also Agitar could create passing tests more or less automatically, which you could then use as a regression suite. Personally I never liked it, as it just led to too much stuff and it was something management didn’t really understand - to them if the test coverage had gone up then the quality must have too. I wonder how much better the LLM approach FB talk about here is compared to that though…
A lot of unit tests generated that way will simply be change detectors (fail when code changes) rather than regression tests (fail when bug is re-introduced). Those are pretty big distinctions, I don’t see LML’s getting here until they can ascertain tear correctness without just assuming good tests pass or depending on an oracle (the prompt will have to include behavior expectations somehow).
This articulates the problem I’m having right now in an interesting way. I’m fine writing unit tests that validate business logic requirements or bug fixes, but writing tests that validate implementations to the point that they reimplement the same logic is a bit much.
I want to figure out how to count the number of times a test has had to change with updated requirements vs how many defects they’ve prevented (vs how much wall clock time / compute resources they’ve consumed in running them).
Brilliant distillation of this insight, I've never heard it put in those words before but it's perfect. It cuts both ways too, if you have lots of tests but most of them aren't really exercising the external API, then you're worse off.
> I want to figure out how to count the number of times a test has had to change with updated requirements vs how many defects they’ve prevented
I did the same some years back in a project that had both a unit test suite with pretty high code coverage, and a end to end suite as well.
The results for the unit test suite were abysmal. The number of times they caught an actual regression over a couple of months time were close to zero. However the number of times they failed simply because code was changed due to new business requirements was huge. With other words: they provided close to zero value while at the same time having high maintenance costs.
The end to end suite did catch a regression now and then, the drawback of it was the usual one, it was very slow to run and maintaining it could be quite painful.
The moral of the story could have been to drastically cut down on writing unit tests. Or maybe write them while implementing a new ticket or fixing a bug, but throwing it away after it went live. But of course this didn't happen. It sort of goes against human nature to throw away something that you just put a lot of effort in.
That’s what I believe Facebook have created here, so you’re right ‘regression’ is a big word - the tests are more likely detecting change e.g. by asserting the existing behaviour of conditionals previously not executed.
And it will lock the system into behaviour that might just be accidental. The value of tests is to make sure that you don't break anything that anyone cares about, not that the every little never used edge case behaviour, which might just an artefact of a specific implementation, is locked in forever.
This is my experience as well. The problem is that persisting "but what _shall_ it do?" on a low level is seen as redundant, as long as everything works. Typically forgotten edge cases are detected elsewhere. The metric _that_ you ran past those code lines says nothing about that you came there for the right reason.
Almost 20 years ago the company I worked for trialled AgitarOne - its promise was automagically generating test cases for Java code that help explore its behaviour. But also Agitar could create passing tests more or less automatically, which you could then use as a regression suite. Personally I never liked it, as it just led to too much stuff and it was something management didn’t really understand - to them if the test coverage had gone up then the quality must have too. I wonder how much better the LLM approach FB talk about here is compared to that though…
http://www.agitar.com/solutions/products/agitarone.html