>Seems like it could easily be training data set size as well. I'm convinced tha...

dotancohen · 2025-07-09T14:01:06 1752069666

I realized this too, and it led me to the conclusion that LLMs really can't program. I did some experiments to find what a programming language would look like, instead of e.g. python, if it were designed to be written and edited by an LLM. It turns out that it's extremely verbose, especially in variable names, function names, class names, etc. Actually, it turned out that classes were very redundant. But the real insight was that LLMs are great at naming things, and performing small operations on the little things they named. They're really not good at any logic that they can't copy paste from something they found on the web.

short_sells_poo · 2025-07-09T16:32:27 1752078747

Is this really a surprise? I'd hazard a guess that the ability to program and beyond that - to create new programming languages - requires more than just probabilistic text prediction. LLMs work for programming languages where they have enough existing corpus to basically ape a programmer having seen similar enough text. A real programmer can take the concepts of one programming language and express them in another, without having to have digested gigabytes of raw text.

There may be emergent abilities that arise in these models purely due to how much information they contain, but I'm unconvinced that their architecture allows them to crystallize actual understanding. E.g. I'm sceptical that there'd be an area in the LLM weights that encodes the logic behind arithmetic and gives rise to the model actually modelling arithmetic as opposed to just probabilistically saying that the text `1+1=` tended to be followed by the letter `2`.

weird-eye-issue · 2025-07-09T14:05:36 1752069936

> I did some experiments to find what a programming language would look like, instead of e.g. python, if it were designed to be written and edited by an LLM.

Did your experiment consist of asking an LLM to design a programming language for itself?

dotancohen · 2025-07-09T14:10:08 1752070208

Yes. ChatGPT 4 and Claude 3.7. They led me to similar conclusions, but they produced very different syntax, which led me to believe that they were not just regurgitating from a common source.

weird-eye-issue · 2025-07-09T22:03:54 1752098634

Great so your experiment just consisted of having an LLM hallucinate

That's not really an experiment is it? You basically just used them to create a hypothesis but you never actually proved anything

They're great at writing text and code so the fact that the other LLM was able to use that syntax to presumably write code that worked (which you had no way of proving since you can't actually run that code) doesn't really mean anything

It would be similar to having it respond in a certain JSON format, they are great at that too. Doesn't really translate to a real world codebase

dotancohen · 2025-07-10T02:21:11 1752114071

  > That's not really an experiment is it? You basically just used them to create a hypothesis but you never actually proved anything

The experiment was checking how well another unrelated LLM could write code using the syntax. And then in the reverse direction in new sessions.

  > They're great at writing text and code so the fact that the other LLM was able to use that syntax to presumably write code that worked (which you had no way of proving since you can't actually run that code) doesn't really mean anything

Of course I could check the code. I had no compiler for it, but "running" code in one's head without a compiler is something first year students get very good at in their Introduction To C course. And checking how they edit and modify the code.

This isn't a published study, it was an experiment. And it influenced how I use LLMs for work, for the better. I'd even call that a successful experiment, now that I better understand the strengths and limitations of LLMs in this field.

weird-eye-issue · 2025-07-10T05:06:28 1752123988

> And it influenced how I use LLMs for work, for the better

How so?

dotancohen · 2025-07-10T16:01:34 1752163294

I let the LLM come up with all the boiler plate classes, functions, modules, etc that it wants. I let it name things. I let it design the API. But what I don't let it do, is design the flow of operations. I come up with a flow chart as a flow of operations, and explain that to the LLM. Almost any if statement is a result of something I specifically mentioned.

QuercusMax · 2025-07-09T19:37:04 1752089824

Is there a reason you believe the models can accurately predict this sort of thing?

dotancohen · 2025-07-09T21:52:01 1752097921

There wasn't, but after taking the syntax that I developed with one model to another model, and having it write some code in that syntax, it did very well. Same in the other direction.

LLMs need all their context within easy reach. An LLM-first (for editing) language still has code comments and docstrings. Identifier names are long, and functions don't really need optional parameters. Strict typing is a must.

ChadNauseam · 2025-07-09T19:54:52 1752090892

In my experience, claude works well at writing rust, and gemini is terrible. gemini writes rust as if it's a C++ programmer who has spent one day learning the basics of rust.

dlahoda · 2025-07-09T13:47:42 1752068862

i tried gemini, openai, copilot, claude on reasonably big rust project. claude worked well to fix use, clippy, renames, refactorings, ci. i used highest cost claude with custom context per crate. never was able to get it write new code well.

for nix, i is nice template engine to start or search. did not tried big nix changes.

mfro · 2025-07-09T14:19:25 1752070765

Yep. I had similar issues asking Gemini for help with F#, I assume lack of training data is the cause.