Just a couple days ago, I attended a python meetup where an experienced developer with a couple years experience using LLM assist tools, estimated that they gave him 10x the productivity.
The main thing to take away from studies like this, is not whether or not LLM-based tools are useful (clearly they at least sometimes are). It's that it cannot be the case that the breathless estimates (like "10x") are correct, or even a flawed study (as long as it was not fraudulent) would find some improvement.
If you had done this kind of study when we first started using higher level languages instead of assembler, we would not have had to worry about whether or not we made some subtle (and common) statistical error. LLM tools clearly are useful, at least sometimes, but we haven't seen Microsoft (early adopter of such tools and early investor in OpenAI) have some big burst of productivity. So, they cannot be the unprecedented game-changer that their boosters allege.
What I would like to see is some kind of study of how well it helps new developers, or experienced developers using a new language or programming in a new problem space. That seems like the kind of thing that would get the biggest improvement from LLM tools.
What is your theory on why Microsoft (who, as an early investor in OpenAI, has had access to these tools the longest) has not shown any signs of making better or more software? Are they using it wrong? Or is it helping with something that's not the rate-limiting step? Something else?
This article was one of the clearest analyses I've read of the METR study; I found it really helpful overall to understand the merits and (mostly) the challenges with the study format.
The main thing to take away from studies like this, is not whether or not LLM-based tools are useful (clearly they at least sometimes are). It's that it cannot be the case that the breathless estimates (like "10x") are correct, or even a flawed study (as long as it was not fraudulent) would find some improvement.
If you had done this kind of study when we first started using higher level languages instead of assembler, we would not have had to worry about whether or not we made some subtle (and common) statistical error. LLM tools clearly are useful, at least sometimes, but we haven't seen Microsoft (early adopter of such tools and early investor in OpenAI) have some big burst of productivity. So, they cannot be the unprecedented game-changer that their boosters allege.
What I would like to see is some kind of study of how well it helps new developers, or experienced developers using a new language or programming in a new problem space. That seems like the kind of thing that would get the biggest improvement from LLM tools.