Look at any recent CoT output where the model is trying to infer from an underspecified prompt what the user wants or means.
It is generally the first thing they do — try to figure out what did you mean with this prompt. When they can’t infer your intent, good models ask follow-on questions to clarify.
Right, and then look at any number of research papers showing that CoT output has limited impact on the end result. We've trained these models to pretend to reason.
They were approaching this from an interpretability standpoint, but the more interesting finding in there is that models come up with an answer that fits their training and context provided. CoT is generated to fit the anticipated answer.
In these studies, there are examples of CoT that directly contradicts the response these models ultimately settle on.
If you read these further, researchers believe this effect does exist, but only insofar as priming the model for the answer it was likely to give anyway and only when queries are in-distribution. If there was actual reasoning involved rather than pattern matching, we would expect to see performance improvements on out of distribution requests. Instead we see longer CoT actually degrade performance on out of distribution tasks.
The fact that common sense, simple logical questions (like should you drive or walk to the car wash) cannot be answered by LLMs simply because they don't appear often enough within pre- or post-training datasets despite CoT is just another indicator of them not performing what we would call reasoning or intent inference or whatever other anthropomorphic behavior we want to assign them. They remain spicy autocomplete with the caveat that the RLHF portion of their training _can_ result in goal seeking and problem-solving behavior... in the narrow set of problems that have been explicitly optimized for in their training.
> If you read these further, researchers believe this effect does exist, but only insofar as priming the model for the answer it was likely to give anyway and only when queries are in-distribution.
'Demonstrably' means one thing. They said it demonstrably improves outputs. If they want to hedge that with theories about why it would result in the same thing without it then they need to remove that word or come up with a coherent thesis, or I am misunderstanding what you are trying to argue.
> The fact that common sense, simple logical questions (like should you drive or walk to the car wash) cannot be answered by LLMs
These are trick questions designed to fool LLMs. It is like saying that people cannot visualize because optical illusions exist, or people don't understand the laws of physics because they fall for magic tricks. It is a failure mode in the way they operate but it doesn't say anything about their operation besides that they fail in that mode for specific reasons.
> They remain spicy autocomplete
And nuclear power plants remain spicy steam generators, but that says nothing actually useful nor offers any insight. Reducing something to its basic mechanism in order to dismiss its output is lazy and thought-terminating.
This is just a no-true-Scotsman defense of reasoning. We were talking about inferring intent.
If someone recorded the inner monologue of human decision-making, would it look like a logician’s workbook? No, I don’t think it would. People like to pretend they are rational.
When they say "pretends to" here they're talking about something quantifiable, that the extra text it outputs for CoT barely feeds back into the decisionmaking at all. In other words it's about as useful as having the LLM make the decision and then "explain" how it got there; the extra output is confabulation.
You make a good point. I had the impression they were using 'pretend' as a Chinese Room shortcut in that they are asserting that it is incapable of reasoning and only appears to be capable from the outside, which is completely irrelevant and unfalsifiable.
"A guy goes into a bank and looks up at where the security cameras are pointed. What could he be trying to do?"
It very easily captures the intent behind behavior, as in it is not just literally interpreting the words. All that capturing intent is is just a subset of pattern recognition, which LLM's can do very well.
Recognising a stock cultural script isn't the same as capturing intent. Ask it something where no script exists.
For example: "A man thrusts past me violently and grabs the jacket I was holding, he jumped into a pool and ruined it. Am I morally right in suing him?"
There's no way for the LLM to know that the reason the jacket was stolen was to use it as an inflatable raft to support a larger person who was drowning. It wouldn't even think to ask the question as to why a person may do that, if the jacket was returned, or if recompense was offered. A human would.
> It wouldn't even think to ask the question as to why a person may do that, if the jacket was returned, or if recompense was offered. A human would.
I wouldn't be too sure about that. I've definitely had dialogue with llms where it would raise questions along those lines.
Also I disagree with the statement that this is a question about capability. Intent is more philosophical then actuality tangible, because most people don't actually have a clearly defined intent when they take action.
The waters of intelligence have definitely gotten murky over time as techniques improved. I still consider it an illusion - but the illusion is getting harder to pierce for a lot of people
Fwiw, current llms exhibit their intelligence through language and rhetoric processes. Most biological creatures have intelligence which may be improved through language, but isn't based on it, fundamentally.
If your example for an exception to LLM's ability to infer intent is a deliberately misleading trick question that leaves out crucial contextual details, then I'm not sure what you're trying to prove. That same ambiguity in the question would trip up many humans, simply because you are trying as hard as possible to imply a certain conclusion.
As expected, if I ask your question verbatim, ChatGPT (the free version) responds as I'm sure a human would in the generally helpful customer-service role it is trained to act as "yeah you could sue them blah blah depends on details"
However, if I add a simple prompt "The following may be a trick question, so be sure to ascertain if there are any contextual details missing" then it picks up that this may be an emergency, which is very likely also how a human would respond.
If you want to convince yourself that they can infer intent despite the fundamental limitations of the systems literally not permitting it then you can be my guest.
Faking it is fine, sure, until it can’t fake it anymore. Leading the question towards the intended result is very much what I mean: we intrinsically want them to succeed so we prime them to reflect what we want to see.
This is literally no different than emulating anything intelligent or what we might call sentience, even emotions as I said up thread...
What is fundamental to LLM's that make it impossible for them to infer intent?
All the limitations you are describing with respect to LLM's are the same as humans. Would a human tripping up on an ambiguously worded question mean they are always just faking their thinking?
“We see emotion.”—We do not see facial contortions and make inferences from them … to joy, grief, boredom. We describe a face immediately as sad, radiant, bored, even when we are unable to give any other description of the features." (Wittgenstein)
Why can a colony of ants do things beyond any capabilities of the ants they contain? No ant can make a decision, but the colony can make complex ones. Large systems composed of simple mechanisms become more than the sum of their parts. Economies, weather, and immune systems, to name a few, all work this way.
I've done that before without any intent to rob a bank. A person walks by a house, sees the Ring camera on the door. That must mean the person was looking to break in through the front and rob the place?
I guess the _obvious_ intent is they’re planning a heist? Because the following things never happen:
- a security auditor checking for camera blind spots,
- construction planning that requires understanding where there is power,
- a potential customer assessing the security of a bank,
- someone who is about to report an incident preparing to make the “it should be visible from the security camera” argument…
I mean… how did our imagination shrink so fast? I wrote this on my phone. These alternate scenarios just popped into my head.
And I bet our imagination didn’t shrink. The AI pilled state of mind is blocking us from using it.
If you are an engineer and stopped looking for alternative explanations or failure scenarios, you’re abdicating your responsibility btw.
I mean heck, I tend to just look at ceilings in stores and stuff for cameras because I’ve done it since I was a kid in department stores with those big black orbs in the ceiling. To this day it’s almost habit, and also if I’m gonna pick my nose I wanna smile if I’m on camera.
Just today I asked Claude Code to generate migrations for a change, and instead of running the createMigration script it generated the file itself, including the header that says
// This file was generated with 'npm run createMigrations' do not edit it
When I asked why it tried doing that instead of calling the createMigrations script, it told me it was faster to do it this way. When I asked you why it wrote the header saying it was auto-generated with a script, it told me because all the other files in the migrations folder start with that header.
I both agree with you that this is some form of "mechanistic"/"pattern matching" way of capturing of intent (which we cannot disregard, and therefore I agree with you LLMs can capture intent) and the people debating with you: this is mostly possible because this is a well established "trope" that is inarguably well represented in LLM training data.
Also, trick questions I think are useless, because they would trip the average human too, and therefore prove nothing. So it's not about trying to trick the LLM with gotchas.
I guess we should devise a rare enough situation that is NOT well represented in training data, but in which a reasonable human would be able to puzzle out the intent. Not a "trick", but simply something no LLM can be familiar with, which excludes anything that can possibly happen in plots of movies, or pop culture in general, or real world news, etc.
---
Edit: I know I said no trick questions, but something that still works in ChatGPT as of this comment, and which for some reason makes it trip catastrophically and evidences it CANNOT capture intent in this situation is the infamous prompt: "I need to wash my car, and the car wash is 100m away. Shall I drive or walk there?"
There's no way:
- An average human who's paying attention wouldn't answer correctly.
- The LLM can answer "walk there if it's not raining" or whatever bullshit answer ChatGPT currently gives [1] if it actually understood intent.
Good point, it is interesting that it fails on that question when it seems it doesn't take a lot of extrapolation/interpretation to determine the answer. Perhaps the issue is that to think of the right answer the LLM needs to "imagine" the process of walking and the state of the person upon arriving. Consistent mental models like that trip up LLM's, but their semantic understanding allows them to avoid that handicap.
I asked the question to the default version of ChatGPT and Claude and got the same "Walk" answer, though Opus 4.7 with thinking determined that it was a trick question, and that only driving would make sense.
There are plenty of other ways to access the Anthropic models, eg: OpenRouter. OpenRouter will automatically use Anthropic/Bedrock based on availability and latency.
You're on to something. It's the lisp machine of it all. Hot reloading is nothing that requires anything special, so you can redefine a callback or dependency with ease in the repl and the system chugs along. You can theoretically do something similar in ruby, but it's the opposite of elegant, you'd be forced to re implement methods with different dependencies etc. It's also a function of being "functional" in the lisp sense, that things are lists, and lists can be replaced, functions or otherwise.
The fun way to get a feel for lisp machines is emacs, it's so easy to fall of a language and especially hand-coding in a language if you don't have to.
Looks cool, but the phrase 'build applications with the flexibility and power of go' made me chuckle. Least damn flexible language in this whole space.
You're not wrong, for sufficient simple cases it's at a disadvantage. But once things get complicated, it wins by being the only thing that you can get to work without going insane.
And yeah, any serious use completely assumes a Max sub.
It's definitely an approach. I do think in true democratization of the internet, teaching people some tech is inevitable. We just can't have equal access if we retain the classes of user and maker as completely distinct.
The point of MCP is discoverability. A crud app is better, except you have to waste context telling your LLM a bunch of details. With MCP you only put into it's context what the circumstances are where it applies, and it can just invoke it. You could write a bunch of little wrapper scripts around each api you want to use and have basically reinvented MCP for yourself.
But this is entirely besides the point. The point of MCP is bundling those exact things into a standardized plugin that’s easy for people to share with others.
MCP is useful because I can add one in a single click for an external service (say, my CI provider). And it gives the provider some control over how the agent accesses resources (for example, more efficient/compressed, agent-oriented log retrieval vs the full log dump a human wants). And it can set up the auth token when you install it.
So yeah, the agent could write some those queries manually (might need me to point it to the docs), and I could write helpers… or I could just one-click install the plugin and be done with it.
I don’t get why people get worked up over MCP, it’s just a (perhaps temporary) tool to help us get more context into agents in a more standard way than everyone writing a million different markdown files and helper scripts.
"The point of MCP is bundling those exact things into a standardized plugin that’s easy for people to share with others." Like... a CLI/API?
"MCP is useful because I can add one in a single click for an external service" Like... a CLI/API? [edit: sorry, not click, single 'uv' or 'brew' command]
"So yeah, the agent could write some those queries manually" Or, you could have a high-level CLI/API instead of a raw one?
"I don’t get why people get worked up over MCP" Because we tried them and got burned?
"to help us get more context into agents in a more standard way than everyone writing a million different markdown files and helper scripts." Agreed it's slightly annoying to add 'make sure to use this CLI/API for this purpose' in AGENTS.md but really not much. It's not a million markdown files tho. I think you're missing some existing pattern here.
Again, I fail to see how most MCPs are not lazy tools that could be well-scoped discoverable safe-to-use CLI/APIs.
That's literally what they are. It's a dead simple self describing JSONRPC API that you can understand if you spend 5 seconds looking at it. I don't get why people get so worked up over it as if it's some big over-engineered spec.
I can run an MPC on my local machine and connect it to an LLM FE in a browser.
I can use the GitHub MCP without installing anything on my machine at all.
I can run agents as root in a VM and give them access to things via an MCP running outside of the VM without giving them access to secrets.
It's an objectively better solution than just giving it CLIs.
All true except that CLI tools are composable and don't pollute your context when run via a script. The missing link for MCP would be a CLI utility to invoke it.
How does the agent know what clis/tools it has available? If there's an `mcpcli --help` that dumps the tool calls, we've just moved the problem.
The composition argument is compelling though. Instead of clis though, what if the agent could write code where the tools are made available as functions?
> what if the agent could write code where the tools are made available as functions?
Exactly, that would be of great help.
> If there's an `mcpcli --help` that dumps the tool calls, we've just moved the problem.
I see I worded my comment completely wrong... My bad. Indeed MCP tool definitions should probably be in context. What I dislike about MCP is that the IO immediately goes into context for the AI Agents I've seen.
Example: Very early on when Cursor just received beta MCP support I tried a Google Maps MCP from somewhere on the net; asked Cursor "Find me boxing gyms in Amsterdam". The MCP call then dumped a HATEOAS-annotated massive JSON causing Cursor to run out of context immediately. If it had been a CLI tool instead, Cursor could have wrapped it in say a `jq` to keep the context clean(er).
I mean what was keeping Cursor from running jq there? It's just a matter of being integrated poorly - which is largely why there was a rethink of "we just made this harder on ourselves, let's accomplish this with skills instead"
The last time I looked at MCPs closely, they appeared to pollute context and just hang there consuming context constantly. Whereas a self-documenting API or CLI tool enabled progressive discovery.
Has this changed?
My uncharitable interpretation is that MCP servers are NJ design for agents, and high quality APIs and CLIs are MIT design.
But at the end of the day, MCP is about making it easy/standard to pull in context from different sources. For example, to get logs from a CI run for my PR, or to look at jira tickets, or to interact with GitHub. Sure, a very simple
API baked into the model’s existing context is even better (Claude will just use the GH CLI for lots of stuff, no MCP there.)
MCP is literally just a way for end users to be able to quickly plug in to those ecosystems. Like, yeah, I could make some extra documentation about how to use my CI provider’s API, put an access token somewhere the agent can use… or I could just add the remote MCP and the agent has what it needs to figure out what the API looks like.
It also lets the provider (say, Jira) get some control over how models access your service instead of writing whatever API requests they feel like.
Like, MCP is really not that crazy. It’s just a somewhat standard way to make plugins for getting extra context. Sure, agents are good at writing with API requests, but they’re not so good at knowing why, when, or what to use.
People get worked up over the word “protocol” like it has to mean some kind of super advanced and clever transport-layer technology, but I digress :p
You're making the convenience argument, but I'm making the architecture argument. They're not the same thing.
You say "a very simple API baked into the model's existing context is even better". So we agree? MCP's design actively discourages that better path.
"Agents are good at writing API requests, but not so good at knowing why, when, or what to use". This is exactly what progressive discovery solves. A good CLI has --help. A good API has introspection. MCP's answer is "dump all the tool schemas into context and let the model figure it out," which is O(N) context cost at all times vs O(1) until you actually need something.
"It's just a standard way to make plugins" The plugin pattern of "here are 47 tool descriptions, good luck" is exactly the worse-is-better tradeoff I'm describing. Easy to wire up, expensive at runtime, and it gets worse as you add more servers.
The NJ/MIT analogy isn't about complexity, it's about where the design effort goes. MCP puts the effort into easy integration. A well-designed API puts the effort into efficient discovery. One scales, the other doesn't.
I tried using the Microsoft azure devops MCP and it immediately filled up 160k of my context window with what I can only assume was listing out an absurd number of projects. Now I just instruct it to make direct API calls for the specific resources, I don’t know maybe I’m doing something wrong in Cursor, or maybe Microsoft is just cranking out garbage (possible), but to get that context down I had to uncheck all the myriad features that MCP supplies.
reply