I find it amazing, that the same ideas pop up in the same period of time. For example, I work on tests generation and I went the same path. I tried to find bugs by prompting "Find bugs in this code and implement tests to show it.", but this didn't get me far. Then I switched to property (invariant) testing, like you, but in my case I ask AI: "Based on the whole codebase, make the property tests." and then I fuzz some random actions on the state-full objects and run prop tests over and over again.
At first I also wanted to automate everything, but over time I realized that best is: 10% human to 90% AI of work.
I'd have much more confidence in an AI codebase where the human has chosen the property tests, than a human codebase where the AI has chosen the property tests.
Tests are executable specs. That is the last thing you should offload to an LLM.
Also, a poorly designed test suite makes your code base extremely painful to change. A well-designed test suite with good abatractions makes it easy to change code, on top of which, it makes tests extremely fast to write.
I think the whole idea of getting LLMs to write the tests comes from a pandemic of under-abstracted, labour-intensive test suites. And that just makes the problem worse.
Perhaps the viewpoint that tests are a chore or grunt work; something you have to do but you don't really view as interesting or important.
(like how I describe what git should do and I get the LLM to give me the magic commands with all the confusing nouns and verbs and dashes in the right place).
Yeah—I like writing elegant test abstractions much more than I like writing clumsy, verbose unit tests, and there's an inverse relationship between those. Maybe people just don't want to ever bother to refactor a test suite, and so early shortcuts turn into walls of boilerplate.
While I agree in theory -- the problem I have is that humans I've worked with are much worse at writing tests than they are at writing the implementation. Maybe its motivation or experience, but test quality generally is much worse than implementation quality -- at least in my experience.
An under-explored approach is to collect data on human usage of the app (from production and from internal testers) and feed that to your generative inputs
At first I also wanted to automate everything, but over time I realized that best is: 10% human to 90% AI of work.
Another idea I'm exploring is AI + Mutation Tests (https://en.wikipedia.org/wiki/Mutation_testing). It should help AI with generation of full coverage.