Sorry but are you implying that for every system you integrate with, you verify the scope of an API key by checking each CRUD operation on every API endpoint they provide?
I think the suggestion from their "somewhat sympathetic" position is that if you are integrating with something you should (a) find out up front what limits it does or doesn't have on its API keys, so that it's not a nasty surprise later, and (b) absolutely don't give keys without really tight scopes to "agents."
The person here who deleted prod DB with their agent made an assumption that an API key wouldn't have broad permission if there weren't warnings ("We had no idea — and Railway's token-creation flow gave us no warning — that the same token had blanket authority across the entire Railway GraphQL API, including destructive operations like volumeDelete. "). I don't know what the UI looks like exactly, but unless I'm explicitly selecting a specific set of limited permissions, I don't know why I'd assume "this won't do more than I am creating it for". Like "I didn't ask the guy at the gun store to put bullets in, I wouldn't have given the gun to the agent if I'd known there were bullets in it."
I also would be wary of running on an "infrastructure provider" that didn't make things like that very clear.
Is this overly harsh? I don't know. I've had to explain far too many times to people (including other engineers) what makes doing certain things unsafe/foolish (since they initially think I'm wasting time checking things like that). So I think stories like this need to be taken as "absolutely do not make the same mistakes" cautionary tales by as many people as possible.
For every API you publish, do you verify that scoped API keys work as they should before you go live? If so, why would you not do the same for APIs you integrate with? It's all part of "your" system from the user's perspective.
I think the author is being deceptive with this part:
>3. CLI tokens have blanket permissions across environments.
>The Railway CLI token I created to add and remove custom domains had the same volumeDelete permission as a token created for any other purpose. Tokens are not scoped by operation, by environment, or by resource at the permission level. There is no role-based access control for the Railway API — every token is effectively root. The Railway community has been asking for scoped tokens for years. It hasn't shipped.
They're trying to make it sound like there was some misleading design around scopes, but the last sentence gives it away. They simply assumed that a scope would be enforced somehow, even though they never explicitly defined one like you would in a service that actually supports them. (Or worse, they actually knew all this ahead of time and still proceeded).
That said, I haven't used this service so I can't evaluate the UX. I know that in GitHub or cloud IAM there is no ambiguity about what you're granting. And if I didn't have full confidence in the limits of a credential then I sure as hell wouldn't give it to an agent.
“why would you not do the same for APIs you integrate with?”
Who does that? Jira and Salesforce have hundreds of endpoints each. AWS has hundreds of services, and each may have hundreds of endpoints. Who on your team is testing key scopes of every endpoint? Do you do it for each key you generate? After all, that external system could have a bug at any moment in managing scopes. Or they could introduce new endpoints that aren’t handled properly. So for existing keys, how frequently do you re-validate the scope against all the endpoints?
Yes but my original reply was to someone that seemed to imply that this founder was dumb not to verify that Railway’s API key that should have been limited to managing custom domains, truly was limited to managing custom domains. I’ve never used Railway but my pushback is that no one in the real world exhaustively verifies a key is scoped properly against all 3rd party endpoints. We trust vendors to document how they’re scoped and to actually do that.
I think it is meaningful that the author didn't say "there was a bug in scope enforcement" or "the UX is really misleading- look at these screenshots." In fact they even state this a long standing community FR. And they don't even say they only discovered this after the incident!
It actually seems like they knew ahead of time and proceeded anyway, but are just using this critique as a way to shift blame.
No I'm not. But it's clearly stated in the article that the API doesn't have scopes at all... So there was no reason to assume that some would be magically applied!
In GitHub or AWS etc you expect scopes to work because you define them. However if there is no way to define them in the first place, would you assume the system can somehow read your mind about what the client can access??
In fact I now believe this is a deliberate rhetorical sleight of hand. Point out a legit critique of the API design as if it is an excuse. But really any responsible engineer would notice the lack of scopes immediately, and that would be a flashing siren not to trust them to an agent.
Can you do basic addition of 2 arbitrary numbers with 100% accuracy (no tools) ? No you can't. You will make mistakes for a sufficiently large N even with pen and paper, and a very small N without. Are you no longer generally intelligent ?
LLMs should use tool calling (which is 100% reliable) instead of doing math internally. But in general it would be nice to be able to teach a process and have the AI execute it deterministically. In some sense, reliability between 99% and 100% is the worst because you still can't trust the output but the verification feels like wasted effort. Maybe code gen and execution will get us there.
This is the exact problem CognOS was built to solve.
99% reliable means you still can't remove the human from the loop — because you never know which 1% you're in. The only way to actually trust output is to attach a verifiable confidence
signal to each response, not just hope the aggregate accuracy holds.
We built a local gateway that wraps every LLM output with a trust envelope: decision trace, risk score, and an explicit PASS/REFINE/ESCALATE/BLOCK classification. The point isn't to make
LLMs more accurate — it's to make their uncertainty legible so the human knows when to step in.
Open source if you want to look at the architecture: github.com/base76-research-lab/operational-cognos
“I think the decision was made because the people making this decision at Anthropic are well-intentioned, driven by values, and motivated by trying to make the transition to powerful AI to go well.”
Every single one of these CEOs happily pirated unimaginable amounts of copyrighted content. That directly hurt millions of real human beings. Not just the prior creators but also crushing the future potential for success of future ones.
“You never needed 1000s of engineers to build software anyway”
What is the point of even mentioning this? We live in reality. In reality, there are countless companies with thousands of engineers making each piece of software. Outside of reality, yes you can talk about a million hypothetical situations. Cherry picking rare examples like Winamp does nothing but provide an example of an exception, which yes, also exists in the real world.
This was a great article. The section “Training for the next state prediction” explains a solution using subagents. If I’m understanding it correctly, we could test if that solution is directionally correct today, right? I ask a LLM a question. It comes up with a few potential responses but sends those first to other agents in a prompt with the minimum required context. Those subagents can even do this recursively a few times. Eventually the original agent collects and analyzes subagents responses and responds to me.
Any attempt at world modeling using today's LLMs needs to have a goal function for the LLM to optimize. The LLM needs to build, evaluate and update it's model of the world. Personally, the main obstacle I found is in updating the model: Data can be large and I think that LLMs aren't good at finding correlations.
"The model advances both the frontier coding performance of GPT‑5.2-Codex and the reasoning and professional knowledge capabilities of GPT‑5.2, together in one model, which is also 25% faster. ... With GPT‑5.3-Codex, Codex goes from an agent that can write and review code to an agent that can do nearly anything developers and professionals can do on a computer."
They're specifically saying that they're planning for an overall improvement over the general-purpose GPT 5.2.
The author gives this example of the problem and incorrect way to leverage AI:
"Sarah was relieved. She thought she could focus on high-value synthesis work. She’d take the agent’s output and refine it, add strategic insights, make it client-ready."
Then they propose a long winded solution which is essentially the same exact thing but uses the magical term "orchestrate" a few times to make it sound different.
In fairness to the author, I think their point was that you take _several_ agents (not just one) and find a way to have them work like a team of 20 people. In the example, Sarah is trying to do the same job she did before, just marginally better.
Yea I guess that's accurate but they also explained that AI capabilities advance every 6-12 months and managing a team of agents buys you a few years. So their proposed solution and conclusion that it keeps you safe for years makes no sense right now. Multi agent orchestration, with an agent doing the orchestrating, is all the craze nowadays.
They made half the point, in my opinion - that you should be "doing the thing that wasn't possible before" but missed the other half - that maybe the thing you should be doing is owning and creating relationships with customers yourself instead of doing it through a company... Which maybe wasn't possible before but is now.
I agree. But the article then seems to suggest, 'you be the one left standing to orchestrate'. It didn't offer much of a suggestion about the other 20 people that would be gone.
It seemed to come down to the old 'just work better , faster, cheaper' , but that is dialed up to 11 now.
I read it more as "look for the thing that was _never done_ because no one was going to hire 20 people to do it" and all the examples were pointing out how you _should not_ try to "better, faster, cheaper" AI because you will lose quickly on all those dimensions.
I realize the irony, of course, that this article is AI-generated but it provoked something close to an epiphany for me even so.
reply