The data doesn't well support the claim that FP is best. Elixir tops the table at 97.5%, but C# (88.4%) is OOP and scores almost identically to Racket (88.9%), and Ruby (81.0%) and Java (80.9%) both outscore Scala (78.4%), which is explicitly functional. If FP were the driver, Scala should beat those languages, but it doesn't.
It's tempting to argue that a more constrained language helps, but Rust (62.8%) vs Elixir (97.5%) is an interesting data point here. Both are highly constrained, but in different directions. Elixir's constraints narrow the solution space because you can't mutate, you can't use loops, and you must pattern match, so every constraint eliminates options and funnels you toward fewer valid solutions that the LLM has to search through. Rust adds another constraint that must independently be satisfied on top of solving the actual problem, where the borrow checker doesn't eliminate approaches but adds a second axis of correctness the LLM has to get right simultaneously.
Overall, it seems like languages with strong conventions and ecosystems that narrow the solution space beat languages where there's a thousand ways to do something. Elixir has one build tool, one formatter, one way to do things. C#, Kotlin, and Java have strong ceremony and convention that effectively narrow how you write a program. Meanwhile JS, Python, PHP, and Perl offer endless choices, fragmented ecosystems, and rapidly shifting idioms, and they cluster at the bottom of the table.
Scala is explicitly multiparadigm and offers a lot of advanced OOP features. It also had a Python-like (though reportedly better handled) 2 -> 3 transition, which deprecated some things, removed others, and added a bunch of new ones. Scala has always been complex, and right now it's also chaotic. It's a wonder the models can get that high a score with it, honestly.
Racket is a similarly large PL, with many abstractions built on the metaprogramming primitives it offers. Without looking at the generated code, it's hard to say anything, but I suspect the high score despite that might be because of the Scheme core of Racket: `racket/base` is a much smaller language than `racket`, so if the LLMs keep to it, it might narrow the solution space enough to show different results.
In general, I think you're half-right: the "solution space" size is a factor, but so is its shape - ie. which features specifically are offered and how they interact. A more compact and cohesive language design should yield better results than just a reduced surface area. C is not a huge language, but the features it offers don't lend themselves to writing correct code much. Elixir is both relatively small and strongly steers a programmer towards safer idioms. Racket is big, but the advanced features are opt-in, while the baseline (immutable bindings, pure functions, expressive contracts) is similar to Elixir. Python is both huge and complex; "there's one obvious way to do it" has always been a bit of a joke. Rust is incredibly complex - the idea is that the tooling should allow you to handle that complexity easily, but that requires agents; one-shotting solutions there won't work as well.
Seems plausible. I used to refer to StackOverflow before LLMs and a good amount of the examples there were flawed code presented as working. If the LLM had less junk in its training then it might benefit even though the volume of training on that language is lower.
If we assume that the amount of training data matters at least a bit (which is a very reasonable assumption), I wouldn’t immediately discard the functional hypothesis. Scala’s score is almost equal to Java’s even though there’s probably something like two orders of magnitude less Scala than Java code in the wild. Similarly with C# and Racket.
Yep I think you can reasonably argue that immutability + strong conventions are the most important dimensions (as opposed to FP vs. OOP, as much as I like FP and dislike OOP):
I took that to mean ≈ "Amount of training data isn't the big factor dwarfing all else." Depends who "we" refers to, I guess. Back when LLM-generated code was new, I definitely saw predictions that LLMs would struggle with niche or rarely used languages. These days, consensus among colleagues within earshot is that LLMs handle Rust much better than Python or C++ (corpus size and AutoCodeBench scores notwithstanding).
> 3. Storing it the way this article presents makes it usable for agents, but not humans. Whereas the point of knowledge graph, ontology, etc is to create the same layer for both humans and AI to interact with
If storing it this way makes it usable for agents, then why don't humans just use agents when they need to interact with it?
Let's say that you want to know who your largest customer is, both by order value and volume. I could either:
1. Prompt my agent and deal with writing the prompt, waiting for the agent to sift through all the data (which would be massive), and pay the token costs, all of which has to be repeated everytime I want to answer this question, OR
2. I check my ontology for the answer, probably in a dashboard, and it takes 5 seconds. I have a link I can freely share around my enterprise and I haven't spent token costs.
Whats more, when I have sent my agent out to some tasks (go find out what revenue we're leaving on the table by not selling spot contracts to our biggest customers) my ontology gives me a few bits of data to validate the agents work against. For humans and AI to work together, they need the same context layer
Dario has made a specific cohort argument here. His numbers (from various interviews) are: you train a model in 2023 for $100M, deploy it, and it earns $200M over its lifetime. Meanwhile you train the 2024 model for $1B, which goes on to earn $2B. Each vintage returns 2x on its training cost.
However, the GAAP P&L tells the opposite story. You book $200M revenue in the same year you spend $1B training the next model, so you report an $800M loss. Next year you book $2B against $10B in training spend, reporting an $8B loss. The business looks like it's dying when every individual model generation actually generates a healthy profit.
That's actually Dario's answer to your depreciation question. If each cohort earns back its training cost within its natural lifespan (however short that lifespan is), the depreciation schedule is already baked in. The model doesn't need to live forever, it just needs to return more than it cost before the next one replaces it. Whether that's actually happening at Anthropic is a different question, and one we can't answer without audited financials, but it's the claim Dario makes (and seems entirely reasonable from a distance).
GAAP doesn't work here really. the R&D treadmill means you are always betting on next year and its NOT inventory or something you can defer your cost on. It's an upfront R&D expense.
so what happens on year 10 when Anthropic hits a $10B training and only returns $8T? they're cooked
If those numbers are correct, then my assertion that "Almost certainly, any reasonable depreciation schedule of the cost of training will result in leading labs being presently wildly unprofitable." is incorrect.
And I admit that I made that assertion from my gut without actually knowing if it's true or not.
If you have to continually spend greater amounts of money to keep up with the competition on every new model then it is dying.
Every single time a company comes around and goes "Actually GAAP are wrong, look at my new math that says were good" its led to much wailing and gnashing of teeth in the future when it inevitably isnt.
That's an interesting idea. I'm curious, though, are there any other industries and/or companies that have tried to pull this sort of thing off? And what ultimately happened to them?
Enron had a system like this. They regularly worked on large, long term contracts that became profitable over years/decades. They wanted to push rewards forward so would estimate the total value of the contract and book the profit when it closed. Mark-to-market accounting wasn't unheard of the time but using it for assets without an active market was unique. Without the market to make against, the numbers were best guess projections.
The problem is everyone along the line is incentivized to be aggressive with estimate (commissions for sales are bigger, public financials looks better) and discouraged from correcting the estimates when they go wrong.
Estimating multi-year returns on frontier models looks harder than estimating returns on oil and gas projects in the 90s.
Why would anyone use 200M model when 1B model is available? The company increase its bet with each iteration increasing risks. It blow up at some point because it cannot guarantee 2B return after 1B investment.
To GAAP point - 200M or 1B or 10B is not a loss but cash converted into an asset. It won’t affect the bottom line at all. Unless the company re-evaluates the asset and say it now cost 1M instead of 200M. This would hit the bottom line.
He says "You paid $100 million and then it made $200 million of revenue. There's some cost to inference with the model, but let's just assume in this cartoonish cartoon example that even if you add those two up, you're kind of in a good state. So, if every model was a company, the model is actually, in this example is actually profitable. What's going on is that at the same time"
importantly you'll notice that he's talking revenue, and assumes that inference is cheap enough/profitable enough that 100M + Inferance_Over_Lifetime < 200M
Based on the docs and API surface, I think the filesystem abstraction is probably copy-on-mount backed by object storage.
I suspect it works as follows: when a task starts, filesystem contents sync down from S3/R2/GCS to a local directory, which gets bind-mounted into the container. The agent reads and writes normally - no FUSE, no network round-trips per file op. On task completion or explicit sync, changes flush back to object storage. The presigned URL support for upload/download is the giveaway that object storage is the source of truth.
This makes way more sense than FUSE for agent workloads. Agents do thousands of small reads (find, grep, git status) that would each be a network call with FUSE. With copy-on-mount it's all local disk speed after initial sync.
Cross-task sharing falls out naturally - two tasks mounting the same filesystem ID just means two containers syncing from the same S3 prefix. Probably last-write-wins rather than distributed locking, which is fine since agents rarely have concurrent writes to the same file.
It strikes me there's more low hanging fruit to pluck re. context window management. Backtracking strikes me as another promising direction to avoid context bloat and compaction (i.e. when a model takes a few attempts to do the right thing, once it's done the right thing, prune the failed attempts out of the context).
Agree. I’d like more fine grained control of context and compaction. If you spend time debugging in the middle of a session, once you’ve fixed the bugs you ought to be able to remove everything related to fixing them out of context and continue as you had before you encountered them. (Right now depending on your IDE this can be quite annoying to do manually. And I’m not aware of any that allow you to snip it out if you’ve worked with the agent on other tasks afterwards.)
I think agents should manage their own context too. For example, if you’re working with a tool that dumps a lot of logged information into context, those logs should get pruned out after one or two more prompts.
Context should be thought of something that can be freely manipulated, rather than a stack that can only have things appended or removed from the end.
Yeah, the fact that we have treated context as immutable baffles me, it’s not like humans working memory keeps a perfect history of everything they’ve done over the last hour, it shouldn’t be that complicated to train a secondary model that just runs online compaction, eg: it runs a tool call, the model determines what’s Germaine to the conversion and prunes the rest, or some task gets completed, ok just leave a stub in the context that says completed x, with a tool available to see the details of x if it becomes relevant again.
That's pretty much the approach we took with context-mode. Tool outputs get processed in a sandbox, only a stub summary comes back into context, and the full details stay in a searchable FTS5 index the model can query on demand. Not trained into the model itself, but gets you most of the way there as a plugin today.
This is a partial realization of the idea, but, for a long running agent the proportion of noise increases linearly with the session length, unless you take an appropriately large machete to the problem you’re still going to wind up with sub optimal results.
Yeah, I'd definitely like to be able to edit my context a lot more. And once you consider that you start seeing things in your head like "select this big chunk of context and ask the model to simply that part", or do things like fix the model trying to ingest too many tokens because it dumped a whole file in that it didn't realize was going to be as large as it was. There's about a half-dozen things like that that are immediately obviously useful.
Oh that's quite a nice idea - agentic context management (riffing on agentic memory management).
There's some challenges around the LLM having enough output tokens to easily specify what it wants its next input tokens to be, but "snips" should be able to be expressed concisely (i.e. the next input should include everything sent previously except the chunk that starts XXX and ends YYY). The upside is tighter context, the downside is it'll bust the prompt cache (perhaps the optimal trade-off is to batch the snips).
So I built that in my chat harness. I just gave the agent a “prune” tool and it can remove shit it doesn’t need any more from its own context. But chat is last gen.
Good point on prompt cache invalidation. Context-mode sidesteps this by never letting the bloat in to begin with, rather than snipping it out after. Tool output runs in a sandbox, a short summary enters context, and the raw data sits in a local search index. No cache busting because the big payload never hits the conversation history in the first place.
> I think agents should manage their own context too.
My intuition is that this should be almost trivial. If I copy/paste your long coding session into an LLM and ask it which parts can be removed from context without losing much, I'm confident that it will know to remove the debugging bits.
I generally do this when I arrive at the agent getting stuck at a test loop or whatever after injecting some later requirement in and tweaking. Once I hit a decent place I have the agent summarize, discard the branch (it’s part of the context too!) and start with the new prompt
> For example, if you’re working with a tool that dumps a lot of logged information into context
I've set up a hook that blocks directly running certain common tools and instead tells Claude to pipe the output to a temporary file and search that for relevant info. There's still some noise where it tries to run the tool once, gets blocked, then runs it the right way. But it's better than before.
I think telling it to run those in a subagent should accomplish the same thing and ensure only the answer makes it to the main context. Otherwise you will still have some bloat from reading the exact output, although in some cases that could be good if you’re debugging or something
Not really because it reliably greps or searches the file for relevant info. So far I haven't seen it ever load the whole file. It might be more efficient for the main thread to have a subagent do it but probably at a significant slowdown penalty when all I'm doing is linting or running tests. So this is probably a judgement call depending on the situation.
That's exactly what context-mode does for tool outputs. Instead of dumping raw logs and snapshots into context, it runs them in a sandbox and only returns a summary. The full data stays in a local FTS5 index so you can search it later when you need specifics.
what i want is for the agent to initially get the full data and make the right decision based on it, then later it doesnt need to know as much about how it got there.
isnt that how thinking works? intermediate tokens that then get replaced with the reuslt?
i think something kinda easy for that could be to pretend that pruned output was actually done by a subagent. copy the detailed logs out, and replace it with a compacted summary.
Treat context like git shas. Yes, there is a specific order within a 'branch' but you should be able to do the equivalent of cherry-picking and rebasing it
I do this with my agents. Basically, every "work" oriented call spawns a subprocess which does not add anything to the parent context window. When the subprocess completes the task, I ask it to 1) provide a complete answer, 2) provide a succinct explanation of how the answer was arrived at, 3) provide a succinct explanation of any attempts which did not work, and 4) Anything learned during the process which may be useful in the future. Then, I feed those 4 answers back to the parent as if they were magically arrived at. Another thing I do for managing context window is, any tool/MCP call has its output piped into a file. The LLM then can only read parts of the file and only add that to its context if it is sufficient. For example, execute some command that produces a lot of output and ultimately ends in "Success!", the LLM can just tail the last line to see if it succeeded. If it did, the rest of the output doesn't need to be read. if it fails, usually the failure message is at the end of the log. Something I'm working on now is having a smaller local model summarize the log output and feed that summarization to the more powerful LLM (because I can run my local model for ~free, but it is no where near as capable as the cloud models). I don't keep up with SOTA so I have no idea if what I'm doing is well known or not, but it works for me and my set up.
It feels like the late 1990s all over again, but instead of html and sql, it’s coding agents. This time around, a lot of us are well experienced at software engineering and so we can find optimizations simply by using claude code all day long. We get an idea, we work with ai to help create a detailed design and then let it develop it for us.
Totally agree. Failed attempts are just noise once the right path is found. Auto-detecting retry patterns and pruning them down to the final working version feels very doable, especially for clear cases like lint or compilation fixes.
Maybe the right answer is “why not both”, but subagents can also be used for that problem. That is, when something isn’t going as expected, fork a subagent to solve the problem and return with the answer.
It’s interesting to imagine a single model deciding to wipe its own memory though, and roll back in time to a past version of itself (only, with the answer to a vexing problem)
I forget where now but I'm sure I read an article from one of the coding harness companies talking about how they'd done just that. Effectively it could pass a note to its past self saying "Path X doesn't work", and otherwise reset the context to any previous point.
I could see this working like some sort of undo tree, with multiple branches you can jump back and forth between.
Here's a concrete example of what composition looks like in practice.
Say your team has an internal `infractl` CLI for managing your deploy infrastructure. No LLM has ever seen it in training data. You add `--mtp-describe` (one function call with any of the SDKs), then open Claude Code and type:
> !mtpcli
> How do I use infractl?
The first line runs `mtpcli`, which prints instructions teaching the LLM the `--mtp-describe` convention: how to discover tools, how schemas map to CLI invocations, how to compose with pipes. The second line causes the LLM to run `infractl --mtp-describe`, get back the full schema, and understand a tool it has never seen in training data. Now you say:
> Write a crontab entry that posts unhealthy pods to the #ops Slack channel every 5 minutes
And it composes your custom CLI with a third-party MCP server it's never touched before:
Your tool, a Slack MCP server, and `jq`, in a pipeline the LLM wrote because it could discover every piece. That script can run in CI, or on a Raspberry Pi. No tokens burned, no inference round-trips. The composition primitives have been here for 50 years. Bash is all you need!
Looks like another Claude App/Cowork-type competitor with slightly different tradeoffs (Cowork just calls Claude Code in a VM, this just calls Codex CLI with OS sandboxing).
Here's the Codex tech stack in case anyone was interested like me.
It's a smart move – while Codex has the same aspirations, limiting it to savvy power users will likely lead to better feedback, and less catastrophic misuse.
Meaning sentry exposes an MCP layer with a tool call layer and tool registry. In this case, the layer is provided by Sentry. Native would mean if calling specific Sentry APIs is provided as a specific integration path depending on the context. Atleast thats how I categorize.
I'm so confused. Sentry is a native client crash reporting tool. What does this have to do with MCP or the LLM itself? Do you mean when interpreting the crash data?
But AI companies would want more engineers generating more tokens.
In this case, Anthropic would want 100 SWEs generating 100,000/mo of revenue. Replacing the very headcount that is responsible for token usage would hurt company growth.
Not to mention what happens when companies start doing this recursively. 100 SWE -> 10 SWE, 100 slack/gmail/notion/etc. subscriptions become 10.
> The simple evidence for this is that everyone who has invested the same resources in AI has produced roughly the same result.
I think this conflates together a lot of different types of AI investment - the application layer vs the model layer vs the cloud layer vs the chip layer.
It's entirely possible that it's hard to generate an economic profit at the model layer, but that doesn't mean that there can't be great returns from the other layers (and a lot of VC money is focused on the application layer).
> In python, ..., calling shell commands or other OS processes requires fiddling with the subprocess module, writing ad-hoc streaming loops, etc - don't even start with piping several commands together.
It's tempting to argue that a more constrained language helps, but Rust (62.8%) vs Elixir (97.5%) is an interesting data point here. Both are highly constrained, but in different directions. Elixir's constraints narrow the solution space because you can't mutate, you can't use loops, and you must pattern match, so every constraint eliminates options and funnels you toward fewer valid solutions that the LLM has to search through. Rust adds another constraint that must independently be satisfied on top of solving the actual problem, where the borrow checker doesn't eliminate approaches but adds a second axis of correctness the LLM has to get right simultaneously.
Overall, it seems like languages with strong conventions and ecosystems that narrow the solution space beat languages where there's a thousand ways to do something. Elixir has one build tool, one formatter, one way to do things. C#, Kotlin, and Java have strong ceremony and convention that effectively narrow how you write a program. Meanwhile JS, Python, PHP, and Perl offer endless choices, fragmented ecosystems, and rapidly shifting idioms, and they cluster at the bottom of the table.