Most of the code is badly written. Models are doing what most of their dataset i...

simonw · 2025-04-12T16:18:48 1744474728

More recent models are producing much higher quality code than models from 6/12/18 months ago. I believe a lot of this is because the AI labs have figured out how to feed them better examples in the training - filtering for higher quality open source code libraries, or loading up on code that passes automated tests.

A lot of model training these days uses synthetic data. Generating good code synthetic data is a whole lot easier than any other category, as you can at least ensure the code you're generating is gramatically valid and executes without syntax errors.

jccooper · 2025-04-12T16:48:19 1744476499

The dataset isn't making up fake dependencies.