One mediocre paper/study (it should not even be called that with all the bias and sample size issues) and now we have to put up with stories re-hashing and dissecting it. I really hope these don't get upvoted more in the future.
16 devs. And they weren't allowed to pick which tasks they used the AI on. Ridiculous. Also using it on "old and >1 million line" codebases and then extrapolating that to software engineering in general.
Writers like this then theorize why AI isn't helpful, then those "theories" get repeated until it feels less like a theory and more like a fact and it all proliferates into an echo chamber of AI isn't a useful tool. There have been too many anecdotes and my own personal experience to ignore that it isn't useful.
It is a tool and you have to learn it to be successful with it.
That is not true, usage of AI was decided randomly. From the paper:
"To directly measure the impact of AI tools on developer productivity, we conduct a randomized
controlled trial by having 16 developers complete 246 tasks (2.0 hours on average) on well-known
open-source repositories (23,000 stars on average) they regularly contribute to. Each task is randomly assigned to allow or disallow AI usage, and we measure how long it takes developers to
complete tasks in each condition."
> If AI is allowed, developers can use any AI tools or models they choose, including no AI tooling if they expect it to not be helpful. If AI is not allowed, no generative AI tooling can be used.
True, my bad, I didn't read you correctly. What you said was true.
I do believe however that it's important to emphasize the fact that they didn't got to choose in general, though, which I think your wording (even though it is correct) does not make evident.
Yes, and the other half they had the option to use AI. That's why I said they were allowed to pick whether or not to use AI on a subset of tasks. On the other subset they were not allowed to use AI.
> and then extrapolating that to software engineering in general.
To the credit of the paper authors, they were very clear that they were not making a claim against software engineering in general. But everyone wants to reinforce their biases, so...
Great for the authors. But everyone else seems to be extrapolating. Authors have a responsibility and should recognize how their work will be used.
Metr may overall have an ok mission, but their motivation is questionable. They published something like this to get attention. Mission accomplished on that but they had to have known how this would be twisted.
It was just published. Too new for someone to conduct a direct study to critique and journals don't just publish critiques anyway. It would have to be a study that disputes the results.
They used 16 developers. The confidence intervals are wide and a few atypical issues per dev could swing the headline figure.
Veteran maintainers on projects they know inside-out. This is a bias.
Devs supplied the issue list (then randomized) which still leads to subtle self-selection bias. Maintainers may pick tasks they enjoy or that showcase deep repo knowledge—exactly where AI probably has least marginal value.
Time was not independently logged and was self-reported.
No possible direct quality metric is possible. Could the AI code be better?
The Hawthorne effect. Knowing they are observed paid may make devs over-document, over-prompt, or simply take their time.
>They used 16 developers. The confidence intervals are wide and a few atypical issues per dev could swing the headline figure
This is reasonable, but there have been enough anecdotal evidence from developers over the last 3 years for me to believe the data is measuring something real.
>Veteran maintainers on projects they know inside-out. This is a bias
I think this is complete BS. The study was trying to measure the real world impact of these tools with experienced developers. I think having them try them out on greenfield work, or a code-base they are not familiar with, makes it harder to measure this.
Also, let's be honest--if the study showed that LLMs DID increase productivity on greenfield work, does that even matter? How many developers out there are starting greenfield projects on a weekly basis? I'd argue very few. So if the study is suggesting that experienced developers are better working on code they're already familiar with without the assistance of an LLM, then that means the vast majority of software development work could be better off without LLMs.
>Devs supplied the issue list (then randomized) which still leads to subtle self-selection bias. Maintainers may pick tasks they enjoy or that showcase deep repo knowledge—exactly where AI probably has least marginal value
Again, for MANY developers, they are going to have deep repo knowledge. If they're not faster with LLMs, despite the knowledge, why use them? You're trying to prop up this as bias against the study, but IMO you're missing the point.
16 devs. And they weren't allowed to pick which tasks they used the AI on. Ridiculous. Also using it on "old and >1 million line" codebases and then extrapolating that to software engineering in general.
Writers like this then theorize why AI isn't helpful, then those "theories" get repeated until it feels less like a theory and more like a fact and it all proliferates into an echo chamber of AI isn't a useful tool. There have been too many anecdotes and my own personal experience to ignore that it isn't useful.
It is a tool and you have to learn it to be successful with it.