One mediocre paper/study (it should not even be called that with all the bias an...

davidcbc · 2025-07-14T16:12:14 1752509534

> And they weren't allowed to pick which tasks they used the AI on.

They were allowed to pick whether or not to use AI on a subset of tasks. They weren't forced to use AI on tasks that don't make sense for AI

throwaway284927 · 2025-07-14T16:46:32 1752511592

That is not true, usage of AI was decided randomly. From the paper:

"To directly measure the impact of AI tools on developer productivity, we conduct a randomized controlled trial by having 16 developers complete 246 tasks (2.0 hours on average) on well-known open-source repositories (23,000 stars on average) they regularly contribute to. Each task is randomly assigned to allow or disallow AI usage, and we measure how long it takes developers to complete tasks in each condition."

davidcbc · 2025-07-14T16:52:49 1752511969

Directly from the paper:

> If AI is allowed, developers can use any AI tools or models they choose, including no AI tooling if they expect it to not be helpful. If AI is not allowed, no generative AI tooling can be used.

AI is allowed not required

throwaway284927 · 2025-07-14T17:00:44 1752512444

True, my bad, I didn't read you correctly. What you said was true.

I do believe however that it's important to emphasize the fact that they didn't got to choose in general, though, which I think your wording (even though it is correct) does not make evident.

rosspackard · 2025-07-14T16:44:40 1752511480

Half the tasks they were not allowed to use AI.

davidcbc · 2025-07-14T17:00:30 1752512430

Yes, and the other half they had the option to use AI. That's why I said they were allowed to pick whether or not to use AI on a subset of tasks. On the other subset they were not allowed to use AI.

RamblingCTO · 2025-07-14T16:16:35 1752509795

It's just the same with all the anecdotal evidence of some hype guys on twitter claiming 10x performance on coding ... Same same but different

steveklabnik · 2025-07-14T16:38:15 1752511095

> and then extrapolating that to software engineering in general.

To the credit of the paper authors, they were very clear that they were not making a claim against software engineering in general. But everyone wants to reinforce their biases, so...

rosspackard · 2025-07-14T16:42:50 1752511370

Great for the authors. But everyone else seems to be extrapolating. Authors have a responsibility and should recognize how their work will be used.

Metr may overall have an ok mission, but their motivation is questionable. They published something like this to get attention. Mission accomplished on that but they had to have known how this would be twisted.

jplusequalt · 2025-07-14T16:26:02 1752510362

>One mediocre paper/study (it should not even be called that with all the bias and sample size issues)

Can you bring up any specific issues with the metr study? Alternatively, can you site a journal that critiques it?

rosspackard · 2025-07-14T16:39:40 1752511180

It was just published. Too new for someone to conduct a direct study to critique and journals don't just publish critiques anyway. It would have to be a study that disputes the results.

They used 16 developers. The confidence intervals are wide and a few atypical issues per dev could swing the headline figure.

Veteran maintainers on projects they know inside-out. This is a bias.

Devs supplied the issue list (then randomized) which still leads to subtle self-selection bias. Maintainers may pick tasks they enjoy or that showcase deep repo knowledge—exactly where AI probably has least marginal value.

Time was not independently logged and was self-reported.

No possible direct quality metric is possible. Could the AI code be better?

The Hawthorne effect. Knowing they are observed paid may make devs over-document, over-prompt, or simply take their time.

Many of the devs were new to Cursor

Bias in forecasting.

jplusequalt · 2025-07-15T13:50:04 1752587404

TBH, most of your points are a bit of a reach.

>They used 16 developers. The confidence intervals are wide and a few atypical issues per dev could swing the headline figure

This is reasonable, but there have been enough anecdotal evidence from developers over the last 3 years for me to believe the data is measuring something real.

>Veteran maintainers on projects they know inside-out. This is a bias

I think this is complete BS. The study was trying to measure the real world impact of these tools with experienced developers. I think having them try them out on greenfield work, or a code-base they are not familiar with, makes it harder to measure this.

Also, let's be honest--if the study showed that LLMs DID increase productivity on greenfield work, does that even matter? How many developers out there are starting greenfield projects on a weekly basis? I'd argue very few. So if the study is suggesting that experienced developers are better working on code they're already familiar with without the assistance of an LLM, then that means the vast majority of software development work could be better off without LLMs.

>Devs supplied the issue list (then randomized) which still leads to subtle self-selection bias. Maintainers may pick tasks they enjoy or that showcase deep repo knowledge—exactly where AI probably has least marginal value

Again, for MANY developers, they are going to have deep repo knowledge. If they're not faster with LLMs, despite the knowledge, why use them? You're trying to prop up this as bias against the study, but IMO you're missing the point.