Yeah. Honestly my advice to a junior dev right now would be:
1) Keep up with the trends by reading hacker news. Be on the lookout for decent blog posts but ignore social media and most of youtube.
2) BUT your best bet for actually leveling up is reading these ten books I'll give you (Designing Data-Intensive Applications, etc. etc.), plus doing side projects to get hands-on practice.
It is, but it is way harder to self-promote yourself and derive other than intellectual satisfaction. It was not purposely built as a platform for advertising and branding like social networks.
You don't even have the dopamine hit of counting your content upvotes.
What do you mean? Every post and comment can receive upvotes, your account has total upvotes, the dopamine hit is there when you can simply look at the top of the screen next to your username and see the number go up after posting. There's even rewards if you make the number go high enough, such as being able to downvote and flag stuff once you reach a certain level.
Social media is where one shares one's social life (it's in the name!).
Granted there is often crossover between technical discussion forums and social media (I'm thinking of a motorbike forum I frequent), but to suggest the likes of HN is social media is rather silly. Isn't it.
I've always thought of it (and seen it defined) as media where the content is generated, liked, and commented upon by users (i.e. socially) rather than the media being controlled by a small group serving in an editorial role (traditional media).
Usenet is (IMO) the original social media. Very few usenet groups had anything to do with sharing one's social life.
Based on my understanding of your comment, the crossover exceptions dominate the discriminative power of the categories, so the taxonomy feels useless.
I left it open-ended because it would somewhat depend on the junior developer's previous knowledge and their focus (do they have a CS degree? what topics do they want to go deep on in their career?)
* Nand2Tetris. Computers from first principles, often fills in a lot of gaps. Possibly preceded by Code: The Hidden Language of Computer Hardware and Software if they need even more grounding.
* SICP if I think they'd be ready for it
* Crafting Interpreters
* If they're working with Python I'd give them Fluent Python by Luciano Ramalho. For whatever other language(s) pick a book that allows them to go from intermediate to advanced in that language and really understand it inside out.
* Ousterhout, A Philosophy of Software Design
* Data and Reality
* The Staff Engineer’s Path by Tanya Reilly to demystify the upper IC track. Or The Manager's Path by Camille Fournier.
frankly terrible advice, especially now that this website is just AI News. If you want to be a better programmer, there are better places, but I'm not going to advertise them here because I do not want to infect them with the HN commentariat which is much too focused on trends.
Engineering fundamentals have not changed in decades. Screw trends, especially at the beginning.
Books written before 2022 are a good bet. Maybe the value of traditional education has also returned.
This website has always somewhat been about trends. Before AI it was the metaverse. Crypto before that. NoSQL before that. Rails before that. Arguably there's always the undercurrent of the cult of personality of PG. ...
I'm sure I've missed some things, I've taken more than one hiatus.
Nuance usually cannot be well conveyed in a blog post. Someone is always selling something. When something exists long enough the bullshit behind it is eventually revealed. Reality is messy and there's always bullshit hiding somewhere.
It doesn't mean HN is useless. I use it as a bellweather to see what other people are putting their attention on. I don't pay attention to AI other than what's on here. I mostly follow my interests, which outside my dayjob, is currently concurrency. But I don't write about it.
Ultimately the place to become a better programmer is behind a keyboard learning what works, what doesn't, where it does, where it doesn't, and why or why not. It's difficult to convey all the nuance in every decision which means most people never actually do it. Any post is dripping with assumptions. In my mind nearly any decision could be justified and would be surprised to find a place that actually attempts to teach these.
Rather than consuming media you're probably better off putting it out there and letting people tell you all the ways you're "wrong" (because they love to do that). Somewhat paradoxically I don't really follow my own advice, but that's humans for you.
I mean, the whole point of my post was to say that books are more useful than anything else, and it would certainly better to have access to 10 good books and no hacker news at all than the reverse. But I do think there is value in keeping an eye on trends. They shape our industry; you may wish they didn’t, I wish they didn’t sometimes, but they do. Obviously good fundamentals are more important, but I think it’s doing a disservice to juniors if you tell them to ignore the “commentariat” completely.
Not obvious from the title, but this is a great post on why it's a terrible terrible time to be an entry-level developer right now:
* Job market sucks for junior devs due to the end of ZIRP and normalization of layoffs
* Dearth of good developer role models due to the "public sphere" getting worse (collapse of developer twitter community, rise of vacuous influencers). I'm a little skeptical about this one as I didn't really benefit much from these sources myself when getting into the industry. And there's always hacker news, which is doing fine!
* Loss of good mentorship opportunities due to rise of remote work
* AI tools are good for seniors who already know how to do things, but terrible for junior devs trying to learn. This is the forklifts metaphor. (And AI is probably not helping the junior dev job market either, although that was already bad for other reasons as mentioned above.)
I am truly worried about where the next generation of senior devs is going to come from. Some juniors, maybe 10% of them, will be fine no matter what: brilliant engineers who are disciplined enough to teach themselves the skills they need and can also adapt well to AI dev tools. I don't worry about them. But I worry about what happens to the median junior engineer, and consequently what our profession will look like in 10-20 years.
> LLMs are trained to predict what the “next word” would be a sentence. Their objective requires the LLM to keep surprise to an absolute minimum.
from which the author concludes that pre-training introduces bias against being able to tell jokes. I see no reason for this to be true. This feels like they’re imposing their intuitive understanding of surprise onto the emergent properties of a very complex process (“minimize the cross-entropy loss function across a huge training corpus”).
I think if what the author said was true, you’d be able to improve joke-writing ability by increasing temperature (i.e., allowing more unexpected tokens). I doubt this actually works.
As an aside, I just asked gpt5-thinking to write some jokes on a specific niche topic, and I’d say it was batting maybe 20% of them being moderately funny? Probably better than I’d get out of a room of human beings. So much like with code, LLMs aren’t at the level of a senior developer or expert comedian, but are around the level of a junior dev or an amateur at standup night.
> I think if what the author said was true, you’d be able to improve joke-writing ability by increasing temperature (i.e., allowing more unexpected tokens). I doubt this actually works.
I would think it would help tbh. Seems worth a try at least.
Many people use this kind of reasoning to justify that LLMs can't be creative, are destined to write bland text, etc. (one notable example was Ted Chiang in the New Yorker) but it has never made any sense.
In my view, the easiest mental model that can be used to roughly explain what LLMs do is a Markov chain. Of course, comparing LLMs to a Markov chain is a gross simplification but it's one that can only make you underestimate them, not vice versa, for obvious reasons.
Well, even a Markov chain can surprise you. While they predict the next word probabilistically, if the dice roll comes out just right, they can choose a low-probability word in the right place and generate original and unexpected text.
Add to this that LLMs are much better at "Markov chaining" that Markov chains themselves, that there is the added instruction tuning (including RLHF) which can be used to bias the model towards more creative/original text that humans like, and that LLMs often pull off things in ways that we don't even really understand - and this kind of claims sound very naive.
Usually the "10x" improvements come from greenfield projects or at least smaller codebases. Productivity improvements on mature complex codebases are much more modest, more like 1.2x.
If you really in good faith want to understand where people are coming from when they talk about huge productivity gains, then I would recommend installing Claude Code (specifically that tool) and asking it to build some kind of small project from scratch. (The one I tried was a small app to poll a public flight API for planes near my house and plot the positions, along with other metadata. I didn't give it the api schema at all. It was still able to make it work.) This will show you, at least, what these tools are capable of -- and not just on toy apps, but also at small startups doing a lot of greenfield work very quickly.
Most of us aren't doing that kind of work, we work on large mature codebases. AI is much less effective there because it doesn't have all the context we have about the codebase and product. Sometimes it's useful, sometimes not. But to start making that tradeoff I do think it's worth first setting aside skepticism and seeing it at its best, and giving yourself that "wow" moment.
So, I'm doing that right now. You do get wow moments, but then you rapidly hit the WTF are you doing moments.
One of the first three projects I tried was a spin on a to-do app. The buttons didn't even work when clicked.
Yes, I keep it iterating, give it a puppeteer MCP, etc.
I think you're just misunderstanding how hard it is to make a greenfield project when you have a super-charged stack overflow that AI is.
Greenfield projects aren't hard, what's hard is starting them.
What AI has helped me immensely with is blank page syndrome. I get it to spit out some boilerplate for a SINGLE page, then boom, I have a new greenfield project 95% my own code in a couple of days.
That's the mistake I think you 10x ers are making.
And you're all giddy and excited and are putting in a ton of work without realising you're the one doing the work, not the AI.
And you'll eventually burn out on that.
And those of us who are a bit more skeptical are realising we could have done it on our own, faster, we just wouldn't normally have bothered. I'd have gone done some gardening with that time instead.
I'm not a 10x-er. My job is working on a mature codebase. The results of AI in that situation are mixed, 1.2x if you're lucky.
My recommendation was that it's useful to try the tools on greenfield projects, since they you can see them at their best.
The productivity improvements of AI for greenfield projects are real. It's not all bullshit. It is a huge boost if you're at a small startup trying to find product market fit. If you don't believe that and think it would be faster to do it all manually I don't know what to tell you - go talk to some startup founders, maybe?
I was able to realize huge productivity gains working on a 20 years old codebase with 2+ million loc, as I mentioned in the sister post. So I disagree that big productivity gains are only on greenfield projects. Realizing productivity gains on mature code based requires more skill and upfront setup. You need to put some work in your claude.md and give Claude tools for accessing necessary data, logs, build process. It should be able to test your code autonomously as much as possible. In my experience, people who say they are not able to realize productivity gains don't put enough effort to understand these new tools and setup them properly for their project.
You should write a blog post on this! We need more discussion of how to get traction on mature codebases and less of the youtube influencers making toy greenfield apps. Of course at a high level it's all going to be "give the model the right context" (in Claude.md etc.) but the devil is in the details.
1) LLMs are controlled by BigCorps who don’t have user’s best interests at heart.
2) I don’t like LLMs and don’t use them because they spoil my feeling of craftsmanship.
3) LLMs can’t be useful to anyone because I “kick the tires” every so often and am underwhelmed. (But what did you actually try? Do tell.)
#1 is obviously true and is a problem, but it’s just capitalism. #2 is a personal choice, you do you etc., but it’s also kinda betting your career on AI failing. You may or may not have a technical niche where you’ll be fine for the next decade, but would you really in good conscience recommend a juniorish web dev take this position? #3 is a rather strong claim because it requires you to claim that a lot of smart reasonable programmers who see benefits from AI use are deluded. (Not everyone who says they get some benefit from AI is a shill or charlatan.)
How exactly am I betting my career on LLMs failing? The inverse is definitely true — going all in on LLMs feels like betting on the future success of LLMs. However not using LLMs to program today is not betting on anything, except maybe myself, but even that’s a stretch.
After all, I can always pick up LLMs in the future. If a few weeks is long enough for all my priors to become stale, why should I have to start now? Everything I learn will be out of date in a few weeks. Things will only be easier to learn 6, 12, 18 months from now.
Also no where in my post did I say that LLMs can’t be useful to anyone. In fact I said the opposite. If you like LLMs or benefit from them, then you’re probably already using them, in which case I’m not advocating anyone stop. However there are many segments of people who LLMs are not for. No tool is a panacea. I’m just trying to nip and FUD in the butt.
There are so many demands for our attention in the modern world to stay looped in and up to date on everything; I’m just here saying don’t fret. Do what you enjoy. LLMs will be here in 12 months. And again in 24. And 36. You don’t need to care now.
And yes I mentor several juniors (designers and engineers). I do not let them use LLMs for anything and actively discourage them from using LLMs. That is not what I’m trying to do in this post, but for those whose success I am invested in, who ask me for advice, I quite confidently advise against it. At least for now. But that is a separate matter.
EDIT: My exact words from another comment in this thread prior to your comment:
> I’m open to programming with LLMs, and I’m entirely fine with people using them and I’m glad people are happy.
I wonder, what drives this intense FOMO ideation about AI tools as expressed further upthread?
How does someone reconcile a faith that AI tooling is rapdily improving with that contradictory belief that there is some permanent early-adopter benefit?
I do see this a lot. It's hard to have a reasonable conversation about AI amidst, on the one hand, hype-mongers and boosters talking about how we'll have AGI in 2027 and all jobs are just about to be automated away, and on the other hand, a chorus of people who hate AI so much they have invested their identify in it failing and haven't really updated their priors since ChatGPT came out. Both groups repeat the same set of tired points that haven't really changed much in three years.
But there are plenty of us who try and walk a middle course. A lot of us have changed our opinions over time. ("When the facts change, I change my mind.") I didn't think AI models were much use for coding a year ago. The facts changed. (Claude Code came out.) Now I do. Frankly, I'd be suspicious of anyone who hasn't changed their opinions about AI in the last year.
You can believe all these things at once, and many of us do:
* LLMs are extremely impressive in what they can do. (I didn't believe I'd see something like this in my lifetime.)
* Used judiciously, they are a big productivity boost for software engineers and many other professions.
* They are imperfect and make mistakes, often in weird ways. They hallucinate. There are some trivial problems that they mess up.
* But they're not just "stochastic parrots." They can model the world and reason about it, albeit imperfectly and not like humans do.
* AI will change the world in the next 20 years
* But AI companies are overvalued at the present time and we're mostly likely in a bubble which will burst.
* Being in a bubble doesn't mean the technology is useless. (c.f. the dotcom bubble or the railroad bubble in the 19th century.)
* AGI isn't just around the corner. (There's still no way models can learn from experience.)
* A lot of people making optimistic claims about AI are doing it for self-serving boosterish reasons, because they want to pump up their stock price or sell you something
* AI has many potential negative consequences for society and mental health, and may be at least as nasty as social media in that respect
* AI has the potential to accelerate human progress in ways that really matter, such as medical research
* But anyone who claims to know the future is just guessing
> But they're not just "stochastic parrots." They can model the world and reason about it, albeit imperfectly and not like humans do.
I've not seen anything from a model to persuade me they're not just stochastic parrots. Maybe I just have higher expectations of stochastic parrots than you do.
I agree with you that AI will have a big impact. We're talking about somewhere between "invention of the internet" and "invention of language" levels of impact, but it's going to take a couple of decades for this to ripple through the economy.
What is your definition of "stochastic parrot"? Mine is something along the lines of "produces probabilistic completions of language/tokens without having any meaningful internal representation of the concepts underlying the language/tokens."
Early LLMs were like that. That's not what they are now. An LLM got Gold on the Mathematical Olympiad - very difficult math problems that it hadn't seen in advance. You don't do that without some kind of working internal model of mathematics. There is just no way you can get to the right answer by spouting out plausible-sounding sentence completions without understanding what they mean. (If you don't believe me, have a look at the questions.)
Ignoring its negative connotation, it's more likely to be a highly advanced "stochastic parrot".
> "You don't do that without some kind of working internal model of mathematics."
This is speculation at best. Models are black boxes, even to those who make them. We can't discern a "meaningful internal representation" in a model, anymore than a human brain.
> "There is just no way you can get to the right answer by spouting out plausible-sounding sentence completions without understanding what they mean."
You've just anthropomorphised a stochastic machine, and this behaviour is far more concerning, because it implies we're special, and we're not. We're just highly advanced "stochastic parrots" with a game loop.
> This is speculation at best. Models are black boxes, even to those who make them. We can't discern a "meaningful internal representation" in a model, anymore than a human brain.
They are not pure black boxes. They are too complex to decipher, but it doesn't mean we can't look at activations and get some very high level idea of what is going on.
For world models specifically, the paper that first demonstrated that LLM has some kind of a world model corresponding to the task it is trained on came out in 2023: https://www.neelnanda.io/mechanistic-interpretability/othell.... Now you might argue that this doesn't prove anything about generic LLMs, and that is true. But I would argue that, given this result, and given what LLMs are capable of doing, assuming that they have some kind of world model (even if it's drastically simplified and even outright wrong around the edges) should be the default at this point, and people arguing that they definitely don't have anything like that should present concrete evidence ot that effect.
> We're just highly advanced "stochastic parrots" with a game loop.
If that is your assertion, then what's the point of even talking about "stochastic parrots" at all? By this definition, _everything_ is that, so it ceases to be a meaningful distinction.
> actually living in the real world where time passes
sure, but it feels like this is just looking at what distinguishes humans from LLMs and calling that “intelligence.” I highlight this difference too when I talk about LLMs, but I don’t feel the need to follow up with “and that’s why they’re not really intelligent.”
well the second part (implied above, I didn’t actually write it) is “and operate intelligently in that world”. talking about “intelligence” in some abstract form where “does this text output constitute intelligence” is hyper silly to me. the discussion should anchor on real-world consequences, not the endless hypotheticals we end up with in these discussions
Agree. This article would had been a lot stronger if it had just concentrated on the issue of anthropomorphizing LLMs, without bringing “intelligence” into it. At this point LLMs are so good at a variety of results-oriented tasks (gold on the Mathematical Olympiad, for example) that we should either just call them intelligent or stop talking about the concept altogether.
But the problem of anthropomorphizing is real. LLMs are deeply weird machines - they’ve been fine-tuned to sound friendly and human, but behind that is something deeply alien: a huge pile of linear algebra that does not work at all like a human mind (notably, they can’t really learn form experience at all after training is complete). They don’t have bodies or even a single physical place where their mind lives (each message in a conversation might be generated on a different GPU in a different datacenter). They can fail in weird and novel ways. It’s clear that anthropomorphism here is a bad idea. Although that’s not a particularly novel point.
LLMs can't reason with self-awareness. Full stop (so far). This distinguishes them from human sentience and thus our version of intelligence completely, and it's a huge gulf, no matter how good they are at simulating discourse, thought and empathy, or at pretending to think the way we do. While processing vast reams of information for the sake of discussion and directed tasks is something an LLM can do on a scale that leaves human minds far behind in the dust (though LLMs fail at synthesizing said information to a notably high degree) even the most ordinary human with the most mediocre intelligence can reason with self awareness to some degree or another and this is, again, distinct.
You could also argue around how our brains process vast amounts of information unconsciously as a backdrop to the conscious part of us being alive at all, and how they pull all of this and awareness off on the same energy that powers a low-energy light bulb, but that's expanding beyond the basic and obvious difference stated above.
The Turing test has been broken by LLMs, but this only shows that it was never a good test of sentient artificial intelligence to begin with. I do incidentally wish Turing himself could have stuck around to see these things at work, and ask him what he thinks of his test and them.
I can conceptually imagine a world in which I'd feel guilty for ending a conversation with an LLM, because in the course of that conversation the LLM has changed from who "they" were at the beginning; they have new memories and experiences based on the interaction.
But we're not there, at least in my mind. I feel no guilt or hesitation about ending one conversation and starting a new one with a slightly different prompt because I didn't like the way the first one went.
Different people probably have different thresholds for this, or might otherwise find that LLMs in the current generation have enough of a context window that they have developed a "lived experience" and that ending that conversation means that something precious and unique has been lost.
Actually, that's another level of humans-being-tricked going on: The "personality" most people are thinking of is a fictional character we humans perceive in a document.
I disagree. I see absolutely no problem with anthropomorphizing LLMs, and I do that myself all the time. I strongly believe that we shouldn't focus on how a word is defined in dictionary, but rather what's the intuitive meaning behind it. If talking to an LLM feels like talking to a person, then I don't see a problem with seeing it as a person-like entity.
I think it is one dictionary authors would agree with? Dictionaries do not dictate the meanings of words; they document them.
Now, within some contexts it is best to stick to standard precise definitions for some words. Still, the meaning of a word within a community is determined by how it is used and understood within that community, not by what is in a dictionary.
Not always. Where I come from there's very strong push to speak the standard language, and regional variations are simply considered wrong. This leads to situations where words commonly used across the whole country don't make it to the dictionary because of petty office politics of those who make dictionaries. Changing the dictionary to allow things previously considered "wrong" would damage the reputation of scholars, who pride themselves in being exemplary.
Uh-uh. Funnily, the dictionary always represents the language of certain upper class which are the minority, and they often refuse to acknowledge words used by lower classes because fuck you that's why. Not to mention that dictionary is always, by definition, outdated.
any secret sauce in prompting etc could be trivially reverse engineered by the companies building the other agents, since they could easily capture all the prompts it sends to the LLM. If there’s any edge, it’s probably more around them fine-tuning the model itself on Claude Code tasks.
Interesting that the other vendors haven't done this "trivial" task, then, and have pretty much ceded the field to Claude Code. _Every_ CLI interface I've used from another vendor has been markedly inferior to Claude Code, and that includes Codex CLI using GPT-5.
This essay could probably benefit from some engagement with the literature on “interpretability” in LLMs, including the empirical results about how knowledge (like addition) is represented inside the neural network. To be blunt, I’m not sure being smart and reasoning from first principles after asking the LLM a lot of questions and cherry picking what it gets wrong gets to any novel insights at this point. And it already feels a little out date, with LLMs getting gold on the mathematical Olympiad they clearly have a pretty good world model of mathematics. I don’t think cherry-picking a failure to prove 2 + 2 = 4 in the particular specific way the writer wanted to see disproves that at all.
LLMs have imperfect world models, sure. (So do humans.) That’s because they are trained to be generalists and because their internal representations of things are massively compressed single they don’t have enough weights to encode everything. I don’t think this means there are some natural limits to what they can do.
Your being blunt is actually very kind, if you're describing what I'm doing as "being smart and reasoning from first principles"; and I agree that I am not saying something very novel, at most it's slightly contrarian given the current sentiment.
My goal is not to cherry-pick failures for its own sake as much as to try to explain why I get pretty bad output from LLMs much of the time, which I do. They are also very useful to me at times.
Let's see how my predictions hold up; I have made enough to look very wrong if they don't.
Regarding "failure disproving success": it can't, but it can disprove a theory of how this success is achieved. And, I have much better examples than the 2+2=4, which I am citing as something that sorta works these says
Your LLM output seems abnormally bad, like you are using old models, bad models, or intentionally poor prompting. I just copied and pasted your Krita example into ChatGPT, and reasonable answer, nothing like what you paraphrased in your post.
I imagine people give up silently more often than they write a well syndicated article about it. The actual adoption and efficiencies we see in enterprises will be the most verifiable data on if LLMs are generally useful in practice. Everything so far is just academic pontificating or anecdata from strangers online.
However, I'm not completely sure. Eg object oriented programming was basically a useless fad full of empty, never-delivered-on promises, but software companies still lapped it up. (If you happen to like OOP, you can probably substitute your own favourite software or wider management fad.)
Another objection: even an LLM with limited capabilities and glaring flaws can still be useful for some commercial use-cases. Eg the job of first line call centre agents that aren't allowed to deviate from a fixed script can be reasonable automated with even a fairly bad LLM.
Will it suck occasionally? Of course! But so does interacting with the humans placed into these positions without authority to get anything done for you. So if the bad LLM is cheaper, it might be worthwhile.
This. I think we’ve about reached the limit of the usefulness of anecdata “hey I asked an LLM this this and this” blog posts. We really need more systematic large scale data and studies on the latest models and tools - the recent one on cursor (which had mixed results) was a good start but it was carried out before Claude Code was even released, i.e. prehistoric times in terms of AI coding progress.
For my part I don’t really have a lot of doubts that coding agents can be a useful productivity boost on real-world tasks. Setting aside personal experience, I’ve talked to enough developers at my company using them for a range of tickets on a large codebase to know that they are. The question is more, how much: are we talking a 20% boost, or something larger, and also, what are the specific tasks they’re most useful on. I do hope in the next few years we can get some systematic answers to that as an industry, that go beyond people asking LLMs random things and trying to reason about AI capabilities from first principles.
The examples are from the latest versions of ChatGPT, Claude, Grok, and Google AI Overview. I did not bother to list the full conversations because (A) LLMs are very verbose and (B) nothing ever reproduces, so in any case any failure is "abnormally bad." I guess dismissing failures and focusing on successes is a natural continuation of our industry's trend to ship software with bugs which allegedly don't matter because they're rare, except with "AI" the MTBF is orders of magnitude shorter
I think it's hard to take any LLM criticism seriously if they don't even specify which model they used. Saying "an LLM model" is totally useless for deriving any kind of conclusion.
When talking about the capabilities of a class of tools long term, it makes sense to be general. I think deriving conclusions at all is pretty difficult given how fast everything is moving, but there is some realities we do actually know about how LLMs work and we can talk about that.
Knowing that ChatGPT output good tokens last tuesday but Sonnet didn't does not help us know much about the future of the tools on general.
> Knowing that ChatGPT output good tokens last tuesday but Sonnet didn't does not help us know much about the future of the tools on general.
Isnt that exactly what is going to help us understand the value these tools bring to end-users, and how to optimize these tools for better future use? None of these models are copy+pastes, they tend to be doing things slightly differently under the hood. How those differences affect results seems like the exact data we would want here
I guess I disagree that the main concern is the differences per each model, rather than the overall technology of LLMs in general. Given how fast it's all changing, I would rather focus on the broader conversation personally. I don't really care if GPT5 is better at benchmarks, I care that LLMs are actually capable of the type of reasoning and productive output that the world currently thinks they are.
Sure, but if you're making a point about LLMs in general, you need to use examples from best-in-class models. Otherwise your examples of how these models fail are meaningless. It would be like complaining about how smartphone cameras are inherently terrible, but all your examples of bad photos aren't labeled with what phone was used to capture. How can anyone infer anything meaningful from that?
I've seen plenty of blunders, but in general it's better than their previous models.
Well, it depends a bit on what you mean by blunders. But eg I've seen it confidently assert mathematically wrong statements with nonsense proofs, instead of admitting that it doesn't know.
I mean yeah, it’s a good essay in that it made me think and try to articulate the gaps, and I’m always looking to read things that push back on AI hype. I usually just skip over the hype blogging.
I think my biggest complaint is that the essay points out flaws in LLM’s world models (totally valid, they do confidently get things wrong and hallucinate in ways that are different, and often more frustrating, from how humans get things wrong) but then it jumps to claiming that there is some fundamental limitation about LLMs that prevents them from forming workable world models. In particular, it strays a bit towards the “they’re just stochastic parrots” critique, e.g. “that just shows the LLM knows to put the words explaining it after the words asking the question.” That just doesn’t seem to hold up in the face of e.g. LLMs getting gold on the Mathematical Olympiad, which features novel questions. If that isn’t a world model of mathematics - being able to apply learned techniques to challenging new questions - then I don’t know what is.
A lot of that success is from reinforcement learning techniques where the LLM is made to solve tons of math problems after the pre-training “read everything” step, which then gives it a chance to update its weights. LLMs aren’t just trained from reading a lot of text anymore. It’s very similar to how the alpha zero chess engine was trained, in fact.
I do think there’s a lot that the essay gets right. If I was to recast it, I’d put it something like this:
* LLMs have imperfect models of the world which is conditioned by how they’re trained on next token prediction.
* We’ve shown we can drastically improve those world models for particular tasks by reinforcement learning. you kind of allude to this already by talking about how they’ve been “flogged” to be good at math.
* I would claim that there’s no particular reason these RL techniques aren’t extensible in principle to beat all sorts of benchmarks that might look unrealistic now. (Two years ago it would have been an extreme optimist position to say an LLM could get gold on the mathematical Olympiad, and most LLM skeptics would probably have said it could never happen.)
* Of course it’s very expensive, so most world models LLMs have won’t get the RL treatment and so will be full of gaps, especially for things that aren’t amenable to RL. It’s good to beware of this.
I think the biggest limitation LLMs actually have, the one that is the biggest barrier to AGI, is that they can’t learn on the job, during inference. This means that with a novel codebase they are never able to build a good model of it, because they can never update their weights. (If an LLM was given tons of RL training on that codebase, it could build a better world model, but that’s expensive and very challenging to set up.) This problem is hinted at in your essay, but the lack of on-the-job learning isn’t centered. But it’s the real elephant in the room with LLMs and the one the boosters don’t really have an answer to.
I'm not saying that LLMs can't learn about the world - I even mention how they obviously do it, even at the learned embeddings level. I'm saying that they're not compelled by their training objective to learn about the world and in many cases they clearly don't, and I don't see how to characterize the opposite cases in a more useful way than "happy accidents."
I don't really know how they are made "good at math," and I'm not that good at math myself. With code I have a better gut feeling of the limitations. I do think that you could throw them off terribly with unusual math quastions to show that what they learned isn't math, but I'm not the guy to do it; my examples are about chess and programming where I am more qualified to do it. (You could say that my question about the associativity of blending and how caching works sort of shows that it can't use the concept of associativity in novel situations; not sure if this can be called an illustration of its weakness at math)
>LLMs are not "compelled" by the training algorithms to learn symbolic logic.
I think "compell" is such a unique human trait that machine will never replicate to the T.
The article did mention specifically about this very issue:
"And of course people can be like that, too - eg much better at the big O notation and complexity analysis in interviews than on the job. But I guarantee you that if you put a gun to their head or offer them a million dollar bonus for getting it right, they will do well enough on the job, too. And with 200 billion thrown at LLM hardware last year, the thing can't complain that it wasn't incentivized to perform."
If it's not already evident that in itself LLM is a limited stochastic AI tool by definition and its distant cousins are the deterministic logic, optimization and constraint programming [1],[2],[3]. Perhaps one of the two breakthroughs that the author was predicting will be in this deterministic domain in order to assist LLM, and it will be the hybrid approach rather than purely LLM.
[1] Logic, Optimization, and Constraint Programming: A Fruitful Collaboration - John Hooker - CMU (2023) [video]:
It’s not just on the job learning though. I’m no AI expert, but the fact that you have “prompt engineers” and AI doesn’t know what it doesn’t know, gives me pause.
If you ask an expert, they know the bounds of their knowledge and can understand questions asked to them in multiple ways. If they don’t know the answer, they could point to someone who does or just say “we don’t know”.
LLMs just lie to you and we call it “hallucinating“ as though they will eventually get it right when the drugs wear off.
> I’m no AI expert, but the fact that you have “prompt engineers” [...] gives me pause.
Why? A bunch of human workers can get a lot more done with a capable leader who helps prompt them in the right direction and corrects oversights etc.
And overall, prompt engineering seems like exactly the kind of skill AI will be able to develop by itself. You already have a bit like this happening: when you ask Gemini to create a picture for you, then the language part of Gemini will take your request and engineer a prompt for the picture part of Gemini.
> A lot of that success is from reinforcement learning techniques where the LLM is made to solve tons of math problems after the pre-training “read everything” step, which then gives it a chance to update its weights. LLMs aren’t just trained from reading a lot of text anymore. It’s very similar to how the alpha zero chess engine was trained, in fact.
It's closer to AlphaGo, which first trained on expert human games and then 'fine tuned' with self-play.
AlphaZero specifically did not use human training data at all.
I am waiting for an AlphaZero style general AI. ('General' not in the GAI sense but in the ChatGPT sense of something you can throw general problems at and it will give it a good go, but not necessarily at human level, yet.) I just don't want to call it an LLM, because it wouldn't necessarily be trained on language.
What I have in mind is something that first solves lots and lots of problems, eg logic problems, formally posed programming problems, computer games, predicting of next frames in a web cam video, economic time series, whatever, as a sort-of pre-training step and then later perhaps you feed it a relatively small amount of human readable text and speech so you can talk to it.
Just to be clear: this is not meant as a suggestion for how to successfully train an AI. I'm just curious whether it would work at all and how well / how badly.
Presumably there's a reason why all SOTA models go 'predict human produced text first, then learn problem solving afterwards'.
> I think the biggest limitation LLMs actually have, the one that is the biggest barrier to AGI, is that they can’t learn on the job, during inference. This means that with a novel codebase they are never able to build a good model of it, because they can never update their weights. [...]
Yes, I agree. But 'on-the-job' training is also such an obvious idea that plenty of people are working on making it work.
With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever. That addition (something which only takes a few gates in digital logic) happens to be overfit into a few nodes on multi-billion node networks is hardly a surprise to anyone except the most religious of AI believers.
The core issue there isn't that the LLM isn't building internal models to represent its world, it's that its world is limited to tokens. Anything not represented in tokens, or token relationships, can't be modeled by the LLM, by definition.
It's like asking a blind person to count the number of colors on a car. They can give it a go and assume glass, tires, and metal are different colors as there is likely a correlation they can draw from feeling them or discussing them. That's the best they can do though as they can't actually perceive color.
In this case, the LLM can't see letters, so asking it to count them causes it to try and draw from some proxy of that information. If it doesn't have an accurate one, then bam, strawberry has two r's.
LLMs are able to encode geospatial relationships because they can be represented by token relationships well. Teo countries that are close together will be talked about together much more often than two countries far from each other.
That is just not a solid argument. There are countless examples of LLMs splitting "blueberry" into "b l u e b e r r y", which would contain one token per letter. And then they still manage to get it wrong.
Your argument is based on a flawed assumption, that they can't see letters. If they didn't they wouldn't be able to spell the word out. But they do. And when they do get one token per letter, they still miscount.
> With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever.
Train your model on characters instead of on tokens, and this problem goes away. But I don't think this teaches us anything about world models more generally.
Actually I forgive them those issues that stem from tokenization. I used to make fun at them for listing datum as a noun whose plural form ends with an i, but once I learned about how tokenization works, I no longer do it - it feels like mocking a person's intelligence because of a speech impediment or something... I am very kind to these things, I think
It’s a historical thing that people still falsely claim is true, bizarrely without trying it on the latest models. As you found, leading LLMs don’t have a problem with it anymore.
The question is, did these LLMs figured it out by themselves or has someone programmed a specific coroutine to address this „issue“, to make it look smarter than it is?
On a trillion dollar budget, you could just crawl the web for AI tests people came up with and solve them manually. We know it‘s a massively curated game. With that kind of money you can do a lot of things. You could feed every human on earth countless blueberries for starters.
Calling an algorithm to count letters in a word isn’t exactly worth the hype tho is it?
The point is, we tend to find new ways these LLMs can’t figure out the most basic shit about the world. Horses can count. Counting is in everything. If you read every text ever written and still can’t grasp counting you simply are not that smart.
Some LLMs do better than others, but this still sometimes trips up even "frontier" non-reasoning models. People were showing this on this very forum with GPT-5 in the past couple days.
Of course they do stuff like that, otherwise it would look like they are stagnating. Fake it till you make it. Tho, at this point, the world is in deep shit, if they don’t make it…
My prediction is that this will be like the 2000 dot com bubble. Both dot com and AI are real and really useful technologies but hype and share price has got way ahead of it so will need to re adjust.
A major economic crisis, yes. I think the web is already kinda broken because of AI, gonna get a lot worse. I also question its usefulness… Is it useful solving any real problems, and if so how long before we run out of these problems? Because we conflated a lot of bullshit with innovation right before AI. Right now people may be getting a slight edge, but it’s like getting a dishwasher, once expectations adjusted things will feel like a grind again, and I really don’t think people will like that new reality in regard to experience of self-efficacy (which is important for mental health). I presume the struggle to get information, figuring it out yourself, may be a really important part of putting pressure towards process optimization and for learning, cognitive development. We may collectively regress there. With so many major crisis, a potential economic crisis on top, I am not sure we can afford losing problem solving capabilities to any extent. And I really, really don’t think AI is worth the fantastical energy expenditure, waste of resources and human exploitation, so far.
It depend on context. English is often not very precise and relies on implied context clues. And that's good. It makes communication more efficient in general.
To spell it out: in this case I suspect you are talking about English letter case? Most people don't care about case when they ask these questions, especially in an informal question.
LLMs don't ingest text a character at a time. The difficulty with analyzing individual letterings just reflected that they don't directly "see" letters in their tokenized input.
A direct comparison would be asking someone how many convex Bézier curves are in the spoken word "monopoly".
Or how many red pixels are in a visible icon.
We could work out answers to both. But they won't come to us one-shot or accurately, without specific practice.
> they clearly don't have any world model whatsoever
Then how did an LLM get gold on the mathematical Olympiad, where it certainly hadn’t seen the questions before? How on earth is that possible without a decent working model of mathematics? Sure, LLMs might make weird errors sometimes (nobody is denying that), but clearly the story is rather more complicated than you suggest.
> where it certainly hadn’t seen the questions before?
What are you basing this certainty on?
And even if you're right that the specific questions had not come up, it may still be that the questions from the math olympiad were rehashes of similar questions in other texts, or happened to correspond well to a composition of some other problems that were part of the training set, such that the LLM could 'pick up' on the similarity.
It's also possible that the LLM was specifically trained on similar problems, or may even have a dedicated sub-net or tool for it. Still impressive, but possibly not in a way that generalizes even to math like one might think based on the press releases.
Like the other reply said, each exam has entirely new questions which are of course secret until the test is taken.
Sure, the questions were probably in a similar genre as existing questions or required similar techniques that could be found in solutions that are out there. So what? You still need some kind of world model of mathematics in which to understand the new problem and apply the different techniques to solve it.
Are you really claiming that SOTA LLMs don’t have any world model of mathematics at all? If so, can you tell us what sort of example would convince you otherwise? (Note that the ability to do novel mathematics research is setting the bar too high, because many capable mathematics majors never get to that point, and they clearly have a reasonable model of mathematics in their heads.)
I think both the literature on interpretability and explorations on internal representations actually reinforce the author's conclusion. I think internal representation research tends to nets that deal with a single "model" don't necessary have the same representation and don't necessarily have a single representation.
And doing well on XYZ isn't evidence of a world model in particular. The point that these things aren't always using a world is reinforced by systems being easily confused by extraneous information, even systems as sophisticated as thus that can solve Math Olympiad questions. The literature has said "ad-hoc predictors" for a long time and I don't think much has changed - except things do better on benchmarks.
And, humans too can act without a consistent world model.
1) Keep up with the trends by reading hacker news. Be on the lookout for decent blog posts but ignore social media and most of youtube.
2) BUT your best bet for actually leveling up is reading these ten books I'll give you (Designing Data-Intensive Applications, etc. etc.), plus doing side projects to get hands-on practice.
reply