More

nr378 · 2026-03-17T11:41:16 1773747676

The data doesn't well support the claim that FP is best. Elixir tops the table at 97.5%, but C# (88.4%) is OOP and scores almost identically to Racket (88.9%), and Ruby (81.0%) and Java (80.9%) both outscore Scala (78.4%), which is explicitly functional. If FP were the driver, Scala should beat those languages, but it doesn't.

It's tempting to argue that a more constrained language helps, but Rust (62.8%) vs Elixir (97.5%) is an interesting data point here. Both are highly constrained, but in different directions. Elixir's constraints narrow the solution space because you can't mutate, you can't use loops, and you must pattern match, so every constraint eliminates options and funnels you toward fewer valid solutions that the LLM has to search through. Rust adds another constraint that must independently be satisfied on top of solving the actual problem, where the borrow checker doesn't eliminate approaches but adds a second axis of correctness the LLM has to get right simultaneously.

Overall, it seems like languages with strong conventions and ecosystems that narrow the solution space beat languages where there's a thousand ways to do something. Elixir has one build tool, one formatter, one way to do things. C#, Kotlin, and Java have strong ceremony and convention that effectively narrow how you write a program. Meanwhile JS, Python, PHP, and Perl offer endless choices, fragmented ecosystems, and rapidly shifting idioms, and they cluster at the bottom of the table.

klibertp · 2026-03-17T14:06:10 1773756370

Scala is explicitly multiparadigm and offers a lot of advanced OOP features. It also had a Python-like (though reportedly better handled) 2 -> 3 transition, which deprecated some things, removed others, and added a bunch of new ones. Scala has always been complex, and right now it's also chaotic. It's a wonder the models can get that high a score with it, honestly.

Racket is a similarly large PL, with many abstractions built on the metaprogramming primitives it offers. Without looking at the generated code, it's hard to say anything, but I suspect the high score despite that might be because of the Scheme core of Racket: `racket/base` is a much smaller language than `racket`, so if the LLMs keep to it, it might narrow the solution space enough to show different results.

In general, I think you're half-right: the "solution space" size is a factor, but so is its shape - ie. which features specifically are offered and how they interact. A more compact and cohesive language design should yield better results than just a reduced surface area. C is not a huge language, but the features it offers don't lend themselves to writing correct code much. Elixir is both relatively small and strongly steers a programmer towards safer idioms. Racket is big, but the advanced features are opt-in, while the baseline (immutable bindings, pure functions, expressive contracts) is similar to Elixir. Python is both huge and complex; "there's one obvious way to do it" has always been a bit of a joke. Rust is incredibly complex - the idea is that the tooling should allow you to handle that complexity easily, but that requires agents; one-shotting solutions there won't work as well.

ashirviskas · 2026-03-17T12:44:51 1773751491

What if it is the quality of data? Internet is full of terrible python/js, but probably not Elixir.

deflator · 2026-03-17T12:52:16 1773751936

Seems plausible. I used to refer to StackOverflow before LLMs and a good amount of the examples there were flawed code presented as working. If the LLM had less junk in its training then it might benefit even though the volume of training on that language is lower.

Sharlin · 2026-03-17T12:41:08 1773751268

If we assume that the amount of training data matters at least a bit (which is a very reasonable assumption), I wouldn’t immediately discard the functional hypothesis. Scala’s score is almost equal to Java’s even though there’s probably something like two orders of magnitude less Scala than Java code in the wild. Similarly with C# and Racket.

nr378 · 2026-03-17T13:01:46 1773752506

Yep I think you can reasonably argue that immutability + strong conventions are the most important dimensions (as opposed to FP vs. OOP, as much as I like FP and dislike OOP):

Immutable by convention + Strong conventions: 91.3% - Elixir 97.5%, Kotlin 90.5%, Racket 88.9%, C# 88.4%

Immutable by convention + Fragmented: 78.4% - Scala 78.4% (n=1)

Mutable + Strong conventions: 77.5% - Ruby 81.0%, Swift 78.5%, Julia 78.5%, Dart 78.0%, Go 71.7%

Mutable + Fragmented: 67.9% - Java 80.9%, R 75.8%, C++ 75.8%, Shell 72.9%, Python 65.3%, Perl 64.5%, TS 61.3%, JS 60.9%, PHP 53.8%

(my grouping is somewhat subjective)

f1shy · 2026-03-17T12:52:03 1773751923

I agree with you, but, from the article: "The amount of training data doesn’t matter as much as we thought. Functional paradigms transfer well"

Anyway, I tend to think you are right, and the article is wrong in that sentence. (Or I misinterpreted something?)

I think both the quantity and quality of that has a big influence in the results.

kd0amg · 2026-03-17T13:41:43 1773754903

I took that to mean ≈ "Amount of training data isn't the big factor dwarfing all else." Depends who "we" refers to, I guess. Back when LLM-generated code was new, I definitely saw predictions that LLMs would struggle with niche or rarely used languages. These days, consensus among colleagues within earshot is that LLMs handle Rust much better than Python or C++ (corpus size and AutoCodeBench scores notwithstanding).

nolist_policy · 2026-03-17T16:20:52 1773764452

TFA's theory also doesn't explain why C++ (75.8%) beats Python, JavaScript and Rust.

nr378 · 2026-03-15T14:53:28 1773586408

> 3. Storing it the way this article presents makes it usable for agents, but not humans. Whereas the point of knowledge graph, ontology, etc is to create the same layer for both humans and AI to interact with

If storing it this way makes it usable for agents, then why don't humans just use agents when they need to interact with it?

condwanaland · 2026-03-15T17:18:43 1773595123

Let's say that you want to know who your largest customer is, both by order value and volume. I could either: 1. Prompt my agent and deal with writing the prompt, waiting for the agent to sift through all the data (which would be massive), and pay the token costs, all of which has to be repeated everytime I want to answer this question, OR

2. I check my ontology for the answer, probably in a dashboard, and it takes 5 seconds. I have a link I can freely share around my enterprise and I haven't spent token costs.

Whats more, when I have sent my agent out to some tasks (go find out what revenue we're leaving on the table by not selling spot contracts to our biggest customers) my ontology gives me a few bits of data to validate the agents work against. For humans and AI to work together, they need the same context layer

nr378 · 2026-03-10T17:03:14 1773162194

Dario has made a specific cohort argument here. His numbers (from various interviews) are: you train a model in 2023 for $100M, deploy it, and it earns $200M over its lifetime. Meanwhile you train the 2024 model for $1B, which goes on to earn $2B. Each vintage returns 2x on its training cost.

However, the GAAP P&L tells the opposite story. You book $200M revenue in the same year you spend $1B training the next model, so you report an $800M loss. Next year you book $2B against $10B in training spend, reporting an $8B loss. The business looks like it's dying when every individual model generation actually generates a healthy profit.

That's actually Dario's answer to your depreciation question. If each cohort earns back its training cost within its natural lifespan (however short that lifespan is), the depreciation schedule is already baked in. The model doesn't need to live forever, it just needs to return more than it cost before the next one replaces it. Whether that's actually happening at Anthropic is a different question, and one we can't answer without audited financials, but it's the claim Dario makes (and seems entirely reasonable from a distance).

calvinmorrison · 2026-03-10T17:14:21 1773162861

GAAP doesn't work here really. the R&D treadmill means you are always betting on next year and its NOT inventory or something you can defer your cost on. It's an upfront R&D expense.

so what happens on year 10 when Anthropic hits a $10B training and only returns $8T? they're cooked

Verdex · 2026-03-10T17:21:42 1773163302

Yeah, that's kind of what I'm wondering about.

It's an interesting story about how even though all metrics show massive losses actually they have massive gains.

Accounting is a rather mature field, so I figure that someone in the past has tried this stunt and there should probably be ways for dealing with it.

Or do they always flame out after losing all the money? Knowing the history here would be informative.

elbasti · 2026-03-10T17:10:30 1773162630

If those numbers are correct, then my assertion that "Almost certainly, any reasonable depreciation schedule of the cost of training will result in leading labs being presently wildly unprofitable." is incorrect.

And I admit that I made that assertion from my gut without actually knowing if it's true or not.

lovich · 2026-03-10T20:49:42 1773175782

If you have to continually spend greater amounts of money to keep up with the competition on every new model then it is dying.

Every single time a company comes around and goes "Actually GAAP are wrong, look at my new math that says were good" its led to much wailing and gnashing of teeth in the future when it inevitably isnt.

Verdex · 2026-03-10T17:18:13 1773163093

That's an interesting idea. I'm curious, though, are there any other industries and/or companies that have tried to pull this sort of thing off? And what ultimately happened to them?

stusmall · 2026-03-10T18:41:25 1773168085

Enron had a system like this. They regularly worked on large, long term contracts that became profitable over years/decades. They wanted to push rewards forward so would estimate the total value of the contract and book the profit when it closed. Mark-to-market accounting wasn't unheard of the time but using it for assets without an active market was unique. Without the market to make against, the numbers were best guess projections.

The problem is everyone along the line is incentivized to be aggressive with estimate (commissions for sales are bigger, public financials looks better) and discouraged from correcting the estimates when they go wrong.

Estimating multi-year returns on frontier models looks harder than estimating returns on oil and gas projects in the 90s.

yunwal · 2026-03-10T18:41:58 1773168118

The bar for "wildly unprofitable" has risen quite a bit since then, but Amazon basically pioneered this.

kikimora · 2026-03-10T23:35:14 1773185714

Why would anyone use 200M model when 1B model is available? The company increase its bet with each iteration increasing risks. It blow up at some point because it cannot guarantee 2B return after 1B investment.

To GAAP point - 200M or 1B or 10B is not a loss but cash converted into an asset. It won’t affect the bottom line at all. Unless the company re-evaluates the asset and say it now cost 1M instead of 200M. This would hit the bottom line.

skybrian · 2026-03-10T17:09:38 1773162578

If you can remember where you read it, could you share a link?

Avshalom · 2026-03-10T17:30:12 1773163812

https://youtu.be/GcqQ1ebBqkc?t=1027 is on such but he doesn't actually say that each model has been profitable.

He says "You paid $100 million and then it made $200 million of revenue. There's some cost to inference with the model, but let's just assume in this cartoonish cartoon example that even if you add those two up, you're kind of in a good state. So, if every model was a company, the model is actually, in this example is actually profitable. What's going on is that at the same time"

importantly you'll notice that he's talking revenue, and assumes that inference is cheap enough/profitable enough that 100M + Inferance_Over_Lifetime < 200M

nr378 · 2026-03-09T20:34:06 1773088446

Based on the docs and API surface, I think the filesystem abstraction is probably copy-on-mount backed by object storage.

I suspect it works as follows: when a task starts, filesystem contents sync down from S3/R2/GCS to a local directory, which gets bind-mounted into the container. The agent reads and writes normally - no FUSE, no network round-trips per file op. On task completion or explicit sync, changes flush back to object storage. The presigned URL support for upload/download is the giveaway that object storage is the source of truth.

This makes way more sense than FUSE for agent workloads. Agents do thousands of small reads (find, grep, git status) that would each be a network call with FUSE. With copy-on-mount it's all local disk speed after initial sync.

Cross-task sharing falls out naturally - two tasks mounting the same filesystem ID just means two containers syncing from the same S3 prefix. Probably last-write-wins rather than distributed locking, which is fine since agents rarely have concurrent writes to the same file.

vivekraja · 2026-03-09T22:14:46 1773094486

That's a good analysis:) We want to go with FUSE but the performance overhead, especially with multiple calls to use files, is a constraint

dangoodmanUT · 2026-03-10T12:34:16 1773146056

How have you determined that? You can easily push 6GB/s+, sub ms ttfb with networked filesystems, and hundreds of thousands of iops through fuse.

smithclay · 2026-03-10T01:34:33 1773106473

sprites.dev / fly.io has publicly said they are using a variant of JuiceFS for the object-storage-to-VM-filesystem stuff, it's cool tech.

* https://fly.io/blog/design-and-implementation/ * https://juicefs.com

nr378 · 2026-02-28T14:06:46 1772287606

Nice work.

It strikes me there's more low hanging fruit to pluck re. context window management. Backtracking strikes me as another promising direction to avoid context bloat and compaction (i.e. when a model takes a few attempts to do the right thing, once it's done the right thing, prune the failed attempts out of the context).

elephanlemon · 2026-02-28T16:55:17 1772297717

Agree. I’d like more fine grained control of context and compaction. If you spend time debugging in the middle of a session, once you’ve fixed the bugs you ought to be able to remove everything related to fixing them out of context and continue as you had before you encountered them. (Right now depending on your IDE this can be quite annoying to do manually. And I’m not aware of any that allow you to snip it out if you’ve worked with the agent on other tasks afterwards.)

I think agents should manage their own context too. For example, if you’re working with a tool that dumps a lot of logged information into context, those logs should get pruned out after one or two more prompts.

Context should be thought of something that can be freely manipulated, rather than a stack that can only have things appended or removed from the end.

FuckButtons · 2026-02-28T20:51:10 1772311870

Yeah, the fact that we have treated context as immutable baffles me, it’s not like humans working memory keeps a perfect history of everything they’ve done over the last hour, it shouldn’t be that complicated to train a secondary model that just runs online compaction, eg: it runs a tool call, the model determines what’s Germaine to the conversion and prunes the rest, or some task gets completed, ok just leave a stub in the context that says completed x, with a tool available to see the details of x if it becomes relevant again.

mksglu · 2026-02-28T21:08:59 1772312939

That's pretty much the approach we took with context-mode. Tool outputs get processed in a sandbox, only a stub summary comes back into context, and the full details stay in a searchable FTS5 index the model can query on demand. Not trained into the model itself, but gets you most of the way there as a plugin today.

FuckButtons · 2026-03-01T03:40:54 1772336454

This is a partial realization of the idea, but, for a long running agent the proportion of noise increases linearly with the session length, unless you take an appropriately large machete to the problem you’re still going to wind up with sub optimal results.

jerf · 2026-03-01T05:28:57 1772342937

Yeah, I'd definitely like to be able to edit my context a lot more. And once you consider that you start seeing things in your head like "select this big chunk of context and ask the model to simply that part", or do things like fix the model trying to ingest too many tokens because it dumped a whole file in that it didn't realize was going to be as large as it was. There's about a half-dozen things like that that are immediately obviously useful.

esperent · 2026-02-28T21:59:20 1772315960

Is it because of caching? If the context changes arbitrarily every turn then you would have to throw away the cache.

FuckButtons · 2026-03-01T02:34:44 1772332484

So use a block based cache and tune the block size to maximize the hit rate? This isn’t rocket science.

wonnage · 2026-03-01T06:00:25 1772344825

This seems misguided, you have to cache a prefix due to attention.

nr378 · 2026-02-28T18:04:33 1772301873

Oh that's quite a nice idea - agentic context management (riffing on agentic memory management).

There's some challenges around the LLM having enough output tokens to easily specify what it wants its next input tokens to be, but "snips" should be able to be expressed concisely (i.e. the next input should include everything sent previously except the chunk that starts XXX and ends YYY). The upside is tighter context, the downside is it'll bust the prompt cache (perhaps the optimal trade-off is to batch the snips).

lowbloodsugar · 2026-03-01T07:10:48 1772349048

So I built that in my chat harness. I just gave the agent a “prune” tool and it can remove shit it doesn’t need any more from its own context. But chat is last gen.

mksglu · 2026-02-28T21:09:36 1772312976

Good point on prompt cache invalidation. Context-mode sidesteps this by never letting the bloat in to begin with, rather than snipping it out after. Tool output runs in a sandbox, a short summary enters context, and the raw data sits in a local search index. No cache busting because the big payload never hits the conversation history in the first place.

MichaelDickens · 2026-03-01T03:28:38 1772335718

> I think agents should manage their own context too.

My intuition is that this should be almost trivial. If I copy/paste your long coding session into an LLM and ask it which parts can be removed from context without losing much, I'm confident that it will know to remove the debugging bits.

bbatha · 2026-03-01T04:40:47 1772340047

I generally do this when I arrive at the agent getting stuck at a test loop or whatever after injecting some later requirement in and tweaking. Once I hit a decent place I have the agent summarize, discard the branch (it’s part of the context too!) and start with the new prompt

esperent · 2026-02-28T21:58:05 1772315885

> For example, if you’re working with a tool that dumps a lot of logged information into context

I've set up a hook that blocks directly running certain common tools and instead tells Claude to pipe the output to a temporary file and search that for relevant info. There's still some noise where it tries to run the tool once, gets blocked, then runs it the right way. But it's better than before.

wonnage · 2026-03-01T06:01:16 1772344876

I think telling it to run those in a subagent should accomplish the same thing and ensure only the answer makes it to the main context. Otherwise you will still have some bloat from reading the exact output, although in some cases that could be good if you’re debugging or something

esperent · 2026-03-01T11:32:38 1772364758

Not really because it reliably greps or searches the file for relevant info. So far I haven't seen it ever load the whole file. It might be more efficient for the main thread to have a subagent do it but probably at a significant slowdown penalty when all I'm doing is linting or running tests. So this is probably a judgement call depending on the situation.

mullingitover · 2026-03-01T02:37:23 1772332643

I’ve been wondering about this and just found this paper[1]: Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Looks interesting.

[1] https://arxiv.org/html/2510.04618v1

mksglu · 2026-02-28T21:08:33 1772312913

That's exactly what context-mode does for tool outputs. Instead of dumping raw logs and snapshots into context, it runs them in a sandbox and only returns a summary. The full data stays in a local FTS5 index so you can search it later when you need specifics.

8note · 2026-03-01T00:06:22 1772323582

what i want is for the agent to initially get the full data and make the right decision based on it, then later it doesnt need to know as much about how it got there.

isnt that how thinking works? intermediate tokens that then get replaced with the reuslt?

dsclough · 2026-03-01T18:29:55 1772389795

Trees in pi let you do this, after done debugging you move back up and continue, leaving all the debugging context in its own branch

8note · 2026-03-01T00:04:54 1772323494

i think something kinda easy for that could be to pretend that pruned output was actually done by a subagent. copy the detailed logs out, and replace it with a compacted summary.

jaredsohn · 2026-03-01T00:39:15 1772325555

Treat context like git shas. Yes, there is a specific order within a 'branch' but you should be able to do the equivalent of cherry-picking and rebasing it

IncreasePosts · 2026-03-01T00:46:51 1772326011

I do this with my agents. Basically, every "work" oriented call spawns a subprocess which does not add anything to the parent context window. When the subprocess completes the task, I ask it to 1) provide a complete answer, 2) provide a succinct explanation of how the answer was arrived at, 3) provide a succinct explanation of any attempts which did not work, and 4) Anything learned during the process which may be useful in the future. Then, I feed those 4 answers back to the parent as if they were magically arrived at. Another thing I do for managing context window is, any tool/MCP call has its output piped into a file. The LLM then can only read parts of the file and only add that to its context if it is sufficient. For example, execute some command that produces a lot of output and ultimately ends in "Success!", the LLM can just tail the last line to see if it succeeded. If it did, the rest of the output doesn't need to be read. if it fails, usually the failure message is at the end of the log. Something I'm working on now is having a smaller local model summarize the log output and feed that summarization to the more powerful LLM (because I can run my local model for ~free, but it is no where near as capable as the cloud models). I don't keep up with SOTA so I have no idea if what I'm doing is well known or not, but it works for me and my set up.

jonnycoder · 2026-02-28T16:27:58 1772296078

It feels like the late 1990s all over again, but instead of html and sql, it’s coding agents. This time around, a lot of us are well experienced at software engineering and so we can find optimizations simply by using claude code all day long. We get an idea, we work with ai to help create a detailed design and then let it develop it for us.

mksglu · 2026-02-28T21:10:27 1772313027

The people who spent years doing the work manually are the ones who immediately see where the bottlenecks are.

mksglu · 2026-02-28T21:05:49 1772312749

Totally agree. Failed attempts are just noise once the right path is found. Auto-detecting retry patterns and pruning them down to the final working version feels very doable, especially for clear cases like lint or compilation fixes.

ip26 · 2026-02-28T17:14:51 1772298891

Maybe the right answer is “why not both”, but subagents can also be used for that problem. That is, when something isn’t going as expected, fork a subagent to solve the problem and return with the answer.

It’s interesting to imagine a single model deciding to wipe its own memory though, and roll back in time to a past version of itself (only, with the answer to a vexing problem)

jon-wood · 2026-02-28T17:48:10 1772300890

I forget where now but I'm sure I read an article from one of the coding harness companies talking about how they'd done just that. Effectively it could pass a note to its past self saying "Path X doesn't work", and otherwise reset the context to any previous point.

I could see this working like some sort of undo tree, with multiple branches you can jump back and forth between.

nr378 · 2026-02-10T13:47:31 1770731251

Here's a concrete example of what composition looks like in practice.

Say your team has an internal `infractl` CLI for managing your deploy infrastructure. No LLM has ever seen it in training data. You add `--mtp-describe` (one function call with any of the SDKs), then open Claude Code and type:

  > !mtpcli
  > How do I use infractl?

The first line runs `mtpcli`, which prints instructions teaching the LLM the `--mtp-describe` convention: how to discover tools, how schemas map to CLI invocations, how to compose with pipes. The second line causes the LLM to run `infractl --mtp-describe`, get back the full schema, and understand a tool it has never seen in training data. Now you say:

  > Write a crontab entry that posts unhealthy pods to the #ops Slack channel every 5 minutes

And it composes your custom CLI with a third-party MCP server it's never touched before:

  */5 * * * * infractl pods list --cluster prod --unhealthy --json \
    | mtpcli wrap --url "https://slack-mcp.example.com/v1/mcp" \
        postMessage -- --channel "#ops" --text "$(jq -r '.[] | .name')"

Your tool, a Slack MCP server, and `jq`, in a pipeline the LLM wrote because it could discover every piece. That script can run in CI, or on a Raspberry Pi. No tokens burned, no inference round-trips. The composition primitives have been here for 50 years. Bash is all you need!

nr378 · 2026-02-02T19:13:02 1770059582

Looks like another Claude App/Cowork-type competitor with slightly different tradeoffs (Cowork just calls Claude Code in a VM, this just calls Codex CLI with OS sandboxing).

Here's the Codex tech stack in case anyone was interested like me.

Framework: Electron 40.0.0

Frontend:

- React 19.2.0

- Jotai (state management)

- TanStack React Form

- Vite (bundler)

- TypeScript

Backend/Main Process:

- Node.js

- better-sqlite3 (local database)

- node-pty (terminal emulation)

- Zod (validation)

- Immer (immutable state)

Build & Dev:

- pnpm (package manager)

- Electron Forge

- Vitest (testing)

- ESLint + Prettier

Native/macOS:

- Sparkle (auto-updates)

- Squirrel (installer)

- electron-liquid-glass (macOS vibrancy effects)

- Sentry (error tracking)

epolanski · 2026-02-02T20:10:48 1770063048

They have the same stack of a boot camper, quite telling.

dcre · 2026-02-02T19:20:02 1770060002

The use of the name Codex and the focus on diffs and worktrees suggests this is still more dev-focused than Cowork.

nxobject · 2026-02-02T23:59:12 1770076752

It's a smart move – while Codex has the same aspirations, limiting it to savvy power users will likely lead to better feedback, and less catastrophic misuse.

elpakal · 2026-02-02T20:53:08 1770065588

> this just calls Codex CLI with OS sandboxing

The git and terminal views are a big plus for me. I usually have those open and active in addition to my codex CLI sessions.

Excited to try skills, too.

another_twist · 2026-02-02T21:00:08 1770066008

Is the integration with Sentry native or via MCP ?

hdjrudni · 2026-02-02T21:58:21 1770069501

What does Sentry via MCP even mean? You want the LLM to call Sentry itself whenever it encounters an error?

another_twist · 2026-02-03T04:27:12 1770092832

Meaning sentry exposes an MCP layer with a tool call layer and tool registry. In this case, the layer is provided by Sentry. Native would mean if calling specific Sentry APIs is provided as a specific integration path depending on the context. Atleast thats how I categorize.

wizzledonker · 2026-02-03T05:45:49 1770097549

I'm so confused. Sentry is a native client crash reporting tool. What does this have to do with MCP or the LLM itself? Do you mean when interpreting the crash data?

sumedh · 2026-02-03T10:56:45 1770116205

Sentry provides a MCP server where your LLM can call the Sentry MCP and answer questions like number of crashes in the last X days etc.

The LLM gets the data from Sentry using Sentry MCP.

firemelt · 2026-02-05T08:36:41 1770280601

wow where did you find the stack? how about claude app stack?

nr378 · 2026-01-10T23:27:20 1768087640

100 SWEs running Claude Code generating 400mn tokens/mo = 400mn * $25/mn = $10,000/mo of revenue for Anthropic

10 SWEs running Claude Code generating 400mn tokens/mo = 400mn * $25/mn = $10,000/mo of revenue for Anthropic

If AI can make 10 engineers a productive as 100, then AI companies bank at least the same revenue.

SuboptimalEng · 2026-01-10T23:53:09 1768089189

But AI companies would want more engineers generating more tokens.

In this case, Anthropic would want 100 SWEs generating 100,000/mo of revenue. Replacing the very headcount that is responsible for token usage would hurt company growth.

Not to mention what happens when companies start doing this recursively. 100 SWE -> 10 SWE, 100 slack/gmail/notion/etc. subscriptions become 10.

nr378 · 2025-12-31T09:00:20 1767171620

> The simple evidence for this is that everyone who has invested the same resources in AI has produced roughly the same result.

I think this conflates together a lot of different types of AI investment - the application layer vs the model layer vs the cloud layer vs the chip layer.

It's entirely possible that it's hard to generate an economic profit at the model layer, but that doesn't mean that there can't be great returns from the other layers (and a lot of VC money is focused on the application layer).

londons_explore · 2025-12-31T09:04:28 1767171868

Whilst those other layers are useful, none of them are particularly hard to build or rebuild when you have many millions of dollars on hand.

One doesn't need tens of billions for them.

tucnak · 2025-12-31T09:59:45 1767175185

Yeah, because making good chips (TPU) and compilers (XLA) is notoriously easy, right?

londons_explore · 2025-12-31T20:50:05 1767214205

All of that is below the model

nr378 · 2025-12-30T22:34:20 1767134060

> In python, ..., calling shell commands or other OS processes requires fiddling with the subprocess module, writing ad-hoc streaming loops, etc - don't even start with piping several commands together.

You inspired me to throw something simpler together - https://pypi.org/project/shell-pilot/

xg15 · 2025-12-30T23:05:02 1767135902

That looks really cool!