AI advancement may be unstoppable, but how it is used is up to the government. A lot of the benefits of AI are not actually necessary to stay competitive as a country. My hope is that it ends up getting regulated in some way.
In the meantime, I am planning for the worst. I've cut my spending and I am using the money to invest in things that I think will provide income in a world where jobs are hard to come by.
Just as industrial jobs went to places with lower costs (part of which was lower safety and environmental standards), so anything that can be done by AI will go to places that have lower restrictions on how AI can be used.
> A lot of the benefits of AI are not actually necessary to stay competitive as a country.
I would if I didn't think people on the selling side were paying to get in my shopping basket. In that case, it feels like it would be too easy to get ripped off. Knowing how the world works that is exactly what will happen.
I do use it, but pretty much just the chat interface. Even that I am a bit wary of because too many times it will suggest doing things that are insanely complex for no reason. I've tried a few different flavors of AI coding such as Cursor, but I didn't see the benefit if I couldn't really trust what it was doing.
Mainly I use it like a coworker to bounce ideas off of, and that seems to work pretty well. Sometimes it will reveal some information, or do something in a way I hadn't thought of. The only problem is that it tends to veer into unnecessary territory if I'm not careful what enters into the context.
I also still just use Google because I find advice written by humans to be much more valuable.
I always see these reports about how much better AI is than humans now, but I can't even get it to help me with pretty mundane problem solving. Yesterday I gave Claude a file with a few hundred lines of code, what the input should be, and told it where the problem was. I tried until I ran out of credits and it still could not work backwards to tell me where things were going wrong. In the end I just did it myself and it turned out to be a pretty obvious problem.
The strange part with these LLMs is that they get weirdly hung up on things. I try to direct them away from a certain type of output and somehow they keep going back to it. It's like the same problem I have with Google where if I try to modify my search to be more specific, it just ignores what it doesn't like about my query and gives me the same output.
Some people say they find LLMs very helpful for coding, some people say they are incredibly bad.
I often see people wondering if the some coding task is performed well or not because of availability of code examples in the training data. It's way worse than that. It's overfitting to diffs it was trained on.
"In other words, the model learns to predict plausible changes to code from examples of changes made to code by human programmers."
... which explains why some models are better at code than others. The best coding models (like Claude 3.7 Sonnet) are likely that good because Anthropic spent an extraordinary amount of effort cultivating a really good training set for them.
I get the impression one of the most effective tricks is to load your training set up with as much code as possible that has comprehensive automated tests that pass already.
I've often experienced that I had what I thought an obscure and very intellectually challenging coding problem, and after prompting the LLM, it basically one-shotted it.
I've been profoundly humbled by the the experience, but then it occurred to me that what I thought to be an unique problem has been solved by quite a few people before and the model had plenty of references to pull from.
Yeah for the positive example, I described the syntax of a domain-specific-language, and the AI basically one-shotted the parsing rules, that only needed minor fixes.
For a counterexample, working on any part of a codebase that's 100% application specific business logic, with our custom abstractions, the AI is usually so lost that it's basically not even worth using it, as the chances of writing correct and usable code is next to zero.
> ... which explains why some models are better at code than others.
No. It explains why models seem better at code in given situations. When your prompt mapped to diffs in the training data that are useful to you they seem great.
I've been writing code with LLM assistance for over two years now and I've had plenty of situations where I am 100% confident the thing I am doing has never been done by anyone else before.
I've tried things like searching all of the public code on GitHub for every possible keyword relevant to my problem.
... or I'm writing code against libraries which didn't exist when the models were trained.
The idea that models can only write code if they've seen code that does the exact same thing in the past is uninformed in my opinion.
This seems to be very hard for people to accept, per the other comments here.
Until recently I was willing to accept an argument that perhaps LLMs had mostly learned the patterns; e.g. to maybe believe 'well there aren't that many really different leetcode questions'.
But with recent models (eg sonnet-3.7-thinking) they are operating well on such large and novel chunks of code that the idea they've seen everything in the training set, or even, like, a close structural match, is becoming ridiculous.
All due respect to Simon but I would love to see some of that groundbreaking code that the LLMs are coming up with.
I am sure that the functionalities implemented are novel but do you really think the training data cannot possibly have had the patterns being used to deliver these features, really? How is it that in the past few months or years people suddenly found the opportunity and motivation to write code that cannot possibly be in any way shape or form represented by patterns in the diffs that have been pushed in the past 30 years?
When I said "the thing I am doing has never been done by anyone else before" I didn't necessarily mean groundbreaking pushes-the-edge-of-computer-science stuff - I meant more pedestrian things like "nobody has ever published Python code to condense and uncondense JSON using this new format I just invented today": https://github.com/simonw/condense-json
I'm not claiming LLMs can invent new computer science. I'm saying it's not accurate to say "they can only produce code that's almost identical to what's in their training data".
> "they can only produce code that's almost identical to what's in their training data"
Again, you're misinterpreting in a way that seems like you are reacting to the perception that someone attacked some of your core beliefs rather than considering what I am saying and conversing about that.
I never even used the words "exact same thing" or "almost identical". Not even synonyms. I just said overfitting and quoted from an OpenAI/Anthropic paper that said "predict plausible changes to code from examples of changes"
Think about that. Don't react, think. Why do you equate overfitting and plausibility prediction with "exact" and "identical". It very obviously is not what I said.
What I am getting at is that a cannon will kill the mosquito. But drawing a fly swatter in the cannonball and saying the plastic ones are obsolete now would be in bad faith. No need to say to someone pointing that out that they are claiming that the cannon can only fire on mosquitoes that have been swatted before.
I don't think I understood your point then. I matched it with the common "LLMs can only produce code that's similar to what they've seen before" argument.
Reading back, you said:
> I often see people wondering if the some coding task is performed well or not because of availability of code examples in the training data. It's way worse than that. It's overfitting to diffs it was trained on.
I'll be honest: I don't understand what you mean by "overfitting to diffs it was trained on" there.
Maybe I don't understand what "overfitting" means in this context?
(I'm afraid I didn't understand your cannon / fly swatter analogy either.)
It's overkill. The models do not capture knowledge about coding. They overfit to the dataset. When one distills data into a useful model the model can be used to predict future behavior of the system.
That is the premise of LLM-as-AI. By training these models on enough data, knowledge of the world is purported as having been captured, creating something useful that can be leveraged to process new input and get a prediction of the trajectory of the system in some phase space.
But this, I argue, is not the case. The models merely overfit to the training data. Hence the variable results perceived by people. When their intentions and prompt fit to the data in the training, the model appears to give good output. But the situation and prompt do not, the models do no "reason" about it and "infer" anything. It fails. It gives you gibberish or go in circles, or worse if there is some "agentic" arrangement if fails to terminate and burns tokens until you intervene.
It's overkill. And I am pointing out it is overkill. It's not a clever system for creating code for any given situation. It overfits to training data set. And your response is to claim that my argument is something else, not that it's overkill but that it can only kill dead things. I never said that. I see it's more than capable of spitting out useful code even if that exact same code is not in the training dataset. But it is just automating the process of going through google, docs and stack overflow and assembling something for you. You might be good at searching and lucky and it is just what you need. You might not be so used to using the right keywords or just be using some uncommon language, or in a domain that happens to not be well represented and then it feels less useful. But instead of just coming up short as search, the model overkills and wastes your time and god knows how much subsidized energy and compute. Lucky you if you're not burning tokens on some agentic monstosity.
You are correct that variable results could be a symptom of a failure to generalise well beyond the training set.
Such failure could happen if the models were overfit, or for other reasons. I don't think 'overfit', which is pretty well defined, is exactly the word you mean to use here.
However, I respectfully disagree with your claim. I think they are generalising well beyond the training dataset (though not as far beyond as say a good programmer would - at least not yet). I further think they are learning semantically.
Can't prove it in a comment except to say that there's simply no way they'd be able to successfully manipulate such large pieces of code, using English language instructions, it they weren't great at generalisation and ok at understanding semantics.
I understand your position. But I think you're underestimating just how much training data is used and how much information can be encoded in hundreds of billions of parameters.
But this is the crux of the disagreement. I think the models overfit to the training data hence the fluctuating behavior. And you think they show generalization and semantic understanding. Which yeah they apparently do. But the failure modes in my opinion show that they don't and would be explained by overfitting.
If that's the case, it turns out that what I want is a system that's "overfitted to the dataset" on code, since I'm getting incredibly useful results for code out of it.
(I'm not personally interested in the whole AGI thing.)
Good man I never said anything about AGI. Why do you keep responding to things I never said?
This whole exchange was you having knee-jerk reactions to things you imagined I said. It has been incredibly frustrating. And at the end you shrug and say "eh it's useful to me"??
I am talking about this because of deceitfulness, resource efficiency, societal implications of technology.
"That is the premise of LLM-as-AI" - I assumed that was an AGI reference. My definition of AGI is pretty much "hyped AI". What did you mean by "LLM-as-AI"?
In my own writing I don't even use the term "AI" very often because its meaning is so vague.
(Worse than that, I said "... is uninformed in my opinion" which was rude because I was saying that about a strawman argument.)
I did that thing where I saw an excuse to bang on one of my pet peeves (people saying "LLMs can't create new code if it's not already in their training data") and jumped at the opportunity.
I've tried to continue the rest of the conversation in good faith though. I'm sorry if it didn't come across that way.
Simon, intelligence exists (and unintelligence exists). When you write «I'm not claiming LLMs can invent new computer science», you imply intelligence exists.
We can implement it. And it is somehow urgent, because intelligence is very desirable wealth - there is definite scarcity. It is even more urgent after the recent hype has made some people perversely confused about the idea of intelligence.
I’ve spent a fair amount of time trying to coax assistance out of LLMs when trying to design novel or custom neural network architectures. They are sometimes helpful with narrow aspects of this. But more often, they disregard key requirements in favor of the common patterns they were trained on.
That paper describes an experimental diff-focused approach from 2022. It's not clear to me how relevant it is to the way models like Claude 3.7 Sonnet (thinking) and o3-mini work today.
If do not you think past research by OpenAI and Anthropic on how to use LLMs to generate code is relevant to how Anthropic LLMs generate code 3 years later I really don't think it is possible to have a reasonable conversation about this topic with you.
Can we be sure that research became part of their mainline model development process as opposed to being an interesting side-quest?
Are Gemini and DeepSeek and Llama and other strong coding models using the same ideas?
Llama and DeepSeek are at least slightly more open about their training processes so there might be clues in their papers (that's a lot of stuff to crunch through though).
I also think LLMs are more difficult to use for most tasks than is often flouted myself but I don't really jive with statements like "Anyone who tells you otherwise is being misleading". Most of the time I find they are just using them in a very different capacity.
> If someone tells you that coding with LLMs is easy they are (probably unintentionally) misleading you. They may well have stumbled on to patterns that work, but those patterns do not come naturally to everyone.
I think you and I must have different definitions of the word "hype".
To me, it means LinkedIn influencers screaming "AGI is coming!", "It's so over", "Programming as a career is dead" etc.
Or implying that LLMs are flawless technology that can and should be used to solve every problem.
To hype something is to provide a dishonest impression of how great it is without ever admitting its weaknesses. That's what I try to avoid doing with LLMs.
"To hype something is to provide a dishonest impression of how great it is" is accurate.
Marketing hype is all about "provide a dishonest impression of how great it is". Putting the weaknesses in fine print doesn't change the hype
Anyways I don't mean to pile on but I agree with some of the other posters here. An awful lot of extremely pro-AI posts that I've noticed have your name on them
I don't think you are as critical of the tech as you think you are.
One of the reasons I do the "pelican riding a bicycle" thing is that it's a great way to deflate the hype around these tools - the supposedly best LLM in the world still draws a pelican that looks like it was done by a five year old! https://simonwillison.net/tags/pelican-riding-a-bicycle/
If you want AI hype there are a thousand places on the internet you can go to get it. I try not to be one of them.
I agree - the content you write about LLMs is informative and realistic, not hyped. I get a lot of value from it, especially because you write mostly as stream of consciousness and explains your approach and/or reasoning. Thank you for doing that.
In my experience, most people who say "Hey these tools are kind of disappointing" either refuse to provide a reproducible example of how it falls short, or if they do, it's clear that they're not using the tool correctly.
I'd love to see a reproducible example of these tools producing something that is exceptional. Or a clear reproducible example of using them the right way.
I've used them some (sorry I didn't make detailed notes about my usage, probably used them wrong) but pretty much there are always subtle bugs that if I didn't know better I would have overlooked.
I don't doubt people find them useful, personally I'd rather spend my time learning about things that interest me instead of spending money learning how to prompt a machine to do something I can do myself that I also enjoy doing.
I think a lot of the disagreements on hn about this tech is that both sides are mostly on the extremes of either "it doesn't work and at and is pointless" or "it's amazing and makes me 100x more productive" and not much discussion about the mid-ground of it works for some stuff and knowing what stuff it works well on makes it useful but it won't solve all your problems.
Why are you setting the bar at "exceptional". If it means that you can write your git commit messages more quickly and with fewer errors then that's all the payoff most orgs need to make them worthwhile.
Because that is how they are being sold to us and hyped
> If it means that you can write your git commit messages more quickly and with fewer errors then that's all the payoff most orgs need to make them worthwhile.
This is so trivial that it wouldn't even be worth looking into, it's basically zero value
> I'd love to see a reproducible example of these tools producing something that is exceptional.
I’m happy that my standards are somewhat low, because the other day I used Claude Sonnet 3.7 to make me refactor around 70 source files and it worked out really nicely - with a bit of guidance along the way it got me a bunch of correctly architected interfaces and base/abstract classes and made the otherwise tedious task take much less time and effort, with a bit of cleanup and improvements along the way. It all also works okay, after the needed amount of testing.
I don’t need exceptional, I need meaningful productivity improvements that make the career less stressful and frustrating.
Historically, that meant using a good IDE. Along the way, that also started to mean IaC and containers. Now that means LLMs.
I honestly think the problem is you are just a lot smarter than I am.
I find these tools wonderful but I am a lazy, college drop out of the most average intelligence, a very shitty programmer who would never get paid to write code.
I am intellectually curious though and these tools help me level up closer to someone like you.
Of course, if I had 30 more IQ points I wouldn't need these tools but I don't have 30 more IQ points.
The latest example for me was trying to generate a thumbnail of a PSD in plain C and figure out the layers in there as I was lazy to read the specs, with the objective to bundle it as a wasm and execute it on a browser, it never got to extract a thumbnail from a given PSD, it's very confident at making stuff but it never got anywhere despite spending a couple hours on it which would have been better spend reading specs and existing code on that topic
Gemini 2.5 came out just over two weeks ago (25th March) and is a very significant improvement on Gemini 2.0 (5th February), according to a bunch of benchmarks but also the all-important vibes.
LLMs are a casino. They're probabilistic models which might come up with incredible solutions at a drop of a hat, then turn around and fumble even the most trivial stuff - I've had this same experience from GPT3.5 to the latest and greatest models.
They come up with something amazing once, and then never again, leading me to believe, it's operator error, not pure dumb luck or slight prompt wording that lead me to be humbled once, and then tear my hair out in frustration the next time.
Granted, newer models tend to do more hitting than missing, but it's still far from a certainty that it'll spit out something good.
Admittedly, the first line is also my reaction to the likes of ASM or system level programming languages (C, C++, Rust…) because they can be unpleasant and difficult to use when compared to something that’d let me iterate more quickly (Go, Python, Node, …) for certain use cases.
For example, building a CLI tool in Go vs C++. Or maybe something to shuffle some data around and handle certain formatting in Python vs Rust. Or a GUI tool with Node/Electron vs anything else.
People telling me to RTFM and spend a decade practicing to use them well wouldn’t be wrong though, because you can do a lot with those tools, if you know how to use them well.
IDK, maybe there's a secret conspiracy of major LLM providers to split users into two groups, one that gets the good models, and the other that gets the bad models, and ensure each user is assigned to the same bucket at every provider.
Surely it's more likely that you and me got put into different buckets by the Deep LLM Cartel I just described, than it is for you to be holding the tool wrong.
When did 3.7 come out? I might have had the same experience. I think I have been using 3.5 with success, but I cannot remember exactly. I may have not used 3.7 for coding (as I had a couple of months break).
I will have to check, but apparently I have been using 3.5 with success, then. I will give 3.7 a try later, I hope it is really not that much worse, or is it? :(
This was 3.7. I did give Gemini a shot for a bit but it couldn’t do it either and the output didn’t look quite as nice. Also, I paid for a year of Claude so kind of feel stuck using it now.
Studio Ghibli might not have been affected yet, but only because the technology is not there yet. What's going to happen when someone can make a competing movie in their style with just a prompt? Should we all just be okay with it because it's been decided that Studio Ghibli has made enough money?
If the effort required to create that can just be ingested by a machine and replicated without consequence, how would it be viable for someone to justify that kind of investment? Where would the next evolution of the art form come from? Even if some company put in the time to create something amazing using AI that does require an investment, the precedent is that it can just be ingested and copied without consequence.
I think aside from what is legal, we need to think about what kind of world we want to live in. We can already plainly see what social media has done to the world. What do you honestly think the world will look like once this plays out?
> What's going to happen when someone can make a competing movie in their style with just a prompt?
Nothing? Just like how if some studio today invests millions of man-hours and does a competing movie in Studio Ghibli's aesthetic (but not including any Studio Ghibli's characters, branding, etc. - basically, not the copyrightable or trademarkable stuff) nothing out of ordinary is going to happen.
I mean, artistic style is not copyrightable, right?
You are missing the point entirely. If you can make a movie with just a prompt, who is going to invest the money creating something like a Ghibli movie just to have it ripped off? Instead people will just rip off what has already been done and everything just stagnates.
The lower cost is not the bad thing. Allowing an AI to learn from it and regurgitate is the bad thing. If we can put anything into an AI and then say whatever it spits out is "clean", even though it is obviously imitating what it learned from, whoever puts the investment into trying something new becomes the sucker.
Also, I don't get this weird sense of entitlement people have over someone else's work. Just because it can be copied means it should belong to everyone?
Can you please explain how did you jump to this conclusion?
I fail to see how artistic expression would cease to be a thing and how people will stop liking novelty. And as long as those are a thing, original styles will also be a thing.
If anything, making the entry barriers lower would result in more original styles, as art is [at least] frequently an evolutionary process, where existing ideas meet novel ones and mix in interesting ways. And even for the entirely novel (from-scratch, if that's a thing) ideas will still keep appearing - if someone thinks of something, they're still free to express themselves, as it was always the case. I cannot think of why people would stop painting with brushes, fingers or anything else.
Art exists because of human nature. Nothing changes in this regard.
I'm sorry, but I do not think I understand the idea why and how Studio Ghibli is being "ripped off" in this scenario.
As I've said, art styles are not considered copyrightable. You say I'm missing the point but I fail to see why. I've used lack of copyright protection as a reality check, a verifiable fact that can be used to determine the current consensus on the matter. Based on this lack of legal protection, I'm concluding that the societies have considered it's not something that needs to be protected, and thus that there is no "ripping off" in replicating a successful style. I have no doubts there are plenty of people who would think otherwise (and e.g. say that current state of copyright is not optimal - which can be very true), but they need to argue about copyright protections not technological accessibility. The latter merely exposes the former (by drastically lowering the cost barriers), but is not the root issue.
I also have doubts about your prediction of stagnation, particularly because you seem to ignore the demand side. People want novelty and originality, it was always the case and always will be (or at least for as long as human nature doesn't change). Things will change for sure (they always do), but I don't think a stagnation is a realistic scenario.
In the meantime, I am planning for the worst. I've cut my spending and I am using the money to invest in things that I think will provide income in a world where jobs are hard to come by.