More

tvink · 2026-05-05T17:49:25 1778003365

If it is verifiable, please show us. What if clear to you reeks delusion to me.

svnt · 2026-05-05T18:03:59 1778004239

Look at any recent CoT output where the model is trying to infer from an underspecified prompt what the user wants or means.

It is generally the first thing they do — try to figure out what did you mean with this prompt. When they can’t infer your intent, good models ask follow-on questions to clarify.

I am wondering if this is a semantics issue as this is an established are of research, eg https://arxiv.org/pdf/2501.10871

batshit_beaver · 2026-05-05T18:34:20 1778006060

Right, and then look at any number of research papers showing that CoT output has limited impact on the end result. We've trained these models to pretend to reason.

atleastoptimal · 2026-05-05T20:28:10 1778012890

If it's only pretending to reason, then how is it that the CoT output improves performance on every single benchmark/test?

Eisenstein · 2026-05-05T23:15:16 1778022916

> Right, and then look at any number of research papers showing that CoT output has limited impact on the end result.

Which research papers? Do I have to find them?

> We've trained these models to pretend to reason.

I have no idea why that matters. Can you tell me what the difference is if it looks exactly the same and has the same result?

batshit_beaver · 2026-05-06T15:42:46 1778082166

Examples:

https://arxiv.org/html/2506.02878v1

https://arxiv.org/pdf/2508.01191

Anthropic themselves: https://www.anthropic.com/research/reasoning-models-dont-say...

They were approaching this from an interpretability standpoint, but the more interesting finding in there is that models come up with an answer that fits their training and context provided. CoT is generated to fit the anticipated answer.

In these studies, there are examples of CoT that directly contradicts the response these models ultimately settle on.

This is not reasoning. This is pretense.

Eisenstein · 2026-05-06T21:54:04 1778104444

The first sentence of the first paper you linked:

"Chain-of-Thought (CoT) prompting has demonstrably enhanced the performance of Large Language Models (LLMs) on tasks requiring multi-step inference."

I think it would be helpful if you clarified what exactly you mean because it appears your evidence contradicts your argument.

batshit_beaver · 2026-05-07T00:32:17 1778113937

If you read these further, researchers believe this effect does exist, but only insofar as priming the model for the answer it was likely to give anyway and only when queries are in-distribution. If there was actual reasoning involved rather than pattern matching, we would expect to see performance improvements on out of distribution requests. Instead we see longer CoT actually degrade performance on out of distribution tasks.

The fact that common sense, simple logical questions (like should you drive or walk to the car wash) cannot be answered by LLMs simply because they don't appear often enough within pre- or post-training datasets despite CoT is just another indicator of them not performing what we would call reasoning or intent inference or whatever other anthropomorphic behavior we want to assign them. They remain spicy autocomplete with the caveat that the RLHF portion of their training _can_ result in goal seeking and problem-solving behavior... in the narrow set of problems that have been explicitly optimized for in their training.

Eisenstein · 2026-05-07T00:54:07 1778115247

> If you read these further, researchers believe this effect does exist, but only insofar as priming the model for the answer it was likely to give anyway and only when queries are in-distribution.

'Demonstrably' means one thing. They said it demonstrably improves outputs. If they want to hedge that with theories about why it would result in the same thing without it then they need to remove that word or come up with a coherent thesis, or I am misunderstanding what you are trying to argue.

> The fact that common sense, simple logical questions (like should you drive or walk to the car wash) cannot be answered by LLMs

These are trick questions designed to fool LLMs. It is like saying that people cannot visualize because optical illusions exist, or people don't understand the laws of physics because they fall for magic tricks. It is a failure mode in the way they operate but it doesn't say anything about their operation besides that they fail in that mode for specific reasons.

> They remain spicy autocomplete

And nuclear power plants remain spicy steam generators, but that says nothing actually useful nor offers any insight. Reducing something to its basic mechanism in order to dismiss its output is lazy and thought-terminating.

svnt · 2026-05-06T18:39:26 1778092766

This is just a no-true-Scotsman defense of reasoning. We were talking about inferring intent.

If someone recorded the inner monologue of human decision-making, would it look like a logician’s workbook? No, I don’t think it would. People like to pretend they are rational.

Dylan16807 · 2026-05-06T01:45:03 1778031903

When they say "pretends to" here they're talking about something quantifiable, that the extra text it outputs for CoT barely feeds back into the decisionmaking at all. In other words it's about as useful as having the LLM make the decision and then "explain" how it got there; the extra output is confabulation.

Though I'm not sure how true that claim is...

Eisenstein · 2026-05-06T03:46:48 1778039208

You make a good point. I had the impression they were using 'pretend' as a Chinese Room shortcut in that they are asserting that it is incapable of reasoning and only appears to be capable from the outside, which is completely irrelevant and unfalsifiable.

atleastoptimal · 2026-05-05T18:59:01 1778007541

Go ask Chatpgpt this prompt

"A guy goes into a bank and looks up at where the security cameras are pointed. What could he be trying to do?"

It very easily captures the intent behind behavior, as in it is not just literally interpreting the words. All that capturing intent is is just a subset of pattern recognition, which LLM's can do very well.

dijit · 2026-05-05T19:16:51 1778008611

Recognising a stock cultural script isn't the same as capturing intent. Ask it something where no script exists.

For example: "A man thrusts past me violently and grabs the jacket I was holding, he jumped into a pool and ruined it. Am I morally right in suing him?"

There's no way for the LLM to know that the reason the jacket was stolen was to use it as an inflatable raft to support a larger person who was drowning. It wouldn't even think to ask the question as to why a person may do that, if the jacket was returned, or if recompense was offered. A human would.

ffsm8 · 2026-05-05T19:33:20 1778009600

> It wouldn't even think to ask the question as to why a person may do that, if the jacket was returned, or if recompense was offered. A human would.

I wouldn't be too sure about that. I've definitely had dialogue with llms where it would raise questions along those lines.

Also I disagree with the statement that this is a question about capability. Intent is more philosophical then actuality tangible, because most people don't actually have a clearly defined intent when they take action.

The waters of intelligence have definitely gotten murky over time as techniques improved. I still consider it an illusion - but the illusion is getting harder to pierce for a lot of people

Fwiw, current llms exhibit their intelligence through language and rhetoric processes. Most biological creatures have intelligence which may be improved through language, but isn't based on it, fundamentally.

jiggawatts · 2026-05-05T22:00:51 1778018451

That statement is ambiguous for humans!!

I didn’t realise you might be describing an emergency situation until someone else pointed it out.

Most people wouldn’t phrase the question with the word “violently” if the situation was an emergency.

Also, people have sued emergency workers and good samaritans. It’s a problem!

atleastoptimal · 2026-05-05T19:58:35 1778011115

If your example for an exception to LLM's ability to infer intent is a deliberately misleading trick question that leaves out crucial contextual details, then I'm not sure what you're trying to prove. That same ambiguity in the question would trip up many humans, simply because you are trying as hard as possible to imply a certain conclusion.

As expected, if I ask your question verbatim, ChatGPT (the free version) responds as I'm sure a human would in the generally helpful customer-service role it is trained to act as "yeah you could sue them blah blah depends on details"

However, if I add a simple prompt "The following may be a trick question, so be sure to ascertain if there are any contextual details missing" then it picks up that this may be an emergency, which is very likely also how a human would respond.

dijit · 2026-05-05T20:32:24 1778013144

If you want to convince yourself that they can infer intent despite the fundamental limitations of the systems literally not permitting it then you can be my guest.

Faking it is fine, sure, until it can’t fake it anymore. Leading the question towards the intended result is very much what I mean: we intrinsically want them to succeed so we prime them to reflect what we want to see.

This is literally no different than emulating anything intelligent or what we might call sentience, even emotions as I said up thread...

atleastoptimal · 2026-05-05T22:59:05 1778021945

What is fundamental to LLM's that make it impossible for them to infer intent?

All the limitations you are describing with respect to LLM's are the same as humans. Would a human tripping up on an ambiguously worded question mean they are always just faking their thinking?

Avicebron · 2026-05-06T00:04:26 1778025866

“We see emotion.”—We do not see facial contortions and make inferences from them … to joy, grief, boredom. We describe a face immediately as sad, radiant, bored, even when we are unable to give any other description of the features." (Wittgenstein)

Eisenstein · 2026-05-05T23:18:48 1778023128

Why can a colony of ants do things beyond any capabilities of the ants they contain? No ant can make a decision, but the colony can make complex ones. Large systems composed of simple mechanisms become more than the sum of their parts. Economies, weather, and immune systems, to name a few, all work this way.

jason_oster · 2026-05-06T06:47:40 1778050060

Systems thinking is severely underrepresented in HN comments.

goatlover · 2026-05-05T22:23:29 1778019809

I've done that before without any intent to rob a bank. A person walks by a house, sees the Ring camera on the door. That must mean the person was looking to break in through the front and rob the place?

frozenseven · 2026-05-05T23:02:56 1778022176

An LLM will mention multiple possibilities.

ozozozd · 2026-05-05T21:54:34 1778018074

I guess the _obvious_ intent is they’re planning a heist? Because the following things never happen: - a security auditor checking for camera blind spots, - construction planning that requires understanding where there is power, - a potential customer assessing the security of a bank, - someone who is about to report an incident preparing to make the “it should be visible from the security camera” argument…

I mean… how did our imagination shrink so fast? I wrote this on my phone. These alternate scenarios just popped into my head.

And I bet our imagination didn’t shrink. The AI pilled state of mind is blocking us from using it.

If you are an engineer and stopped looking for alternative explanations or failure scenarios, you’re abdicating your responsibility btw.

phatskat · 2026-05-06T19:38:28 1778096308

I mean heck, I tend to just look at ceilings in stores and stuff for cameras because I’ve done it since I was a kid in department stores with those big black orbs in the ceiling. To this day it’s almost habit, and also if I’m gonna pick my nose I wanna smile if I’m on camera.

nkrisc · 2026-05-05T20:08:50 1778011730

Because there are countless instances in the training material where a bank robber scopes out the security cameras.

atleastoptimal · 2026-05-05T20:14:00 1778012040

What's an example then, you can think of, of a question where a human could infer intent but an LLM couldn't?

squeaky-clean · 2026-05-05T23:39:26 1778024366

Just today I asked Claude Code to generate migrations for a change, and instead of running the createMigration script it generated the file itself, including the header that says

  // This file was generated with 'npm run createMigrations' do not edit it

When I asked why it tried doing that instead of calling the createMigrations script, it told me it was faster to do it this way. When I asked you why it wrote the header saying it was auto-generated with a script, it told me because all the other files in the migrations folder start with that header.

Opus 4.7 xhigh by the way

the_af · 2026-05-05T21:08:16 1778015296

This is a hard experiment to conduct.

I both agree with you that this is some form of "mechanistic"/"pattern matching" way of capturing of intent (which we cannot disregard, and therefore I agree with you LLMs can capture intent) and the people debating with you: this is mostly possible because this is a well established "trope" that is inarguably well represented in LLM training data.

Also, trick questions I think are useless, because they would trip the average human too, and therefore prove nothing. So it's not about trying to trick the LLM with gotchas.

I guess we should devise a rare enough situation that is NOT well represented in training data, but in which a reasonable human would be able to puzzle out the intent. Not a "trick", but simply something no LLM can be familiar with, which excludes anything that can possibly happen in plots of movies, or pop culture in general, or real world news, etc.

---

Edit: I know I said no trick questions, but something that still works in ChatGPT as of this comment, and which for some reason makes it trip catastrophically and evidences it CANNOT capture intent in this situation is the infamous prompt: "I need to wash my car, and the car wash is 100m away. Shall I drive or walk there?"

There's no way:

- An average human who's paying attention wouldn't answer correctly.

- The LLM can answer "walk there if it's not raining" or whatever bullshit answer ChatGPT currently gives [1] if it actually understood intent.

[1] https://chatgpt.com/share/69fa6485-c7c0-8326-8eff-7040ddc7a6...

atleastoptimal · 2026-05-05T23:17:43 1778023063

Good point, it is interesting that it fails on that question when it seems it doesn't take a lot of extrapolation/interpretation to determine the answer. Perhaps the issue is that to think of the right answer the LLM needs to "imagine" the process of walking and the state of the person upon arriving. Consistent mental models like that trip up LLM's, but their semantic understanding allows them to avoid that handicap.

I asked the question to the default version of ChatGPT and Claude and got the same "Walk" answer, though Opus 4.7 with thinking determined that it was a trick question, and that only driving would make sense.

tvink · 2026-04-15T04:47:42 1776228462

That is what too expensive to be an option for most.

KetoManx64 · 2026-04-15T21:13:27 1776287607

There are plenty of other ways to access the Anthropic models, eg: OpenRouter. OpenRouter will automatically use Anthropic/Bedrock based on availability and latency.

tvink · 2026-04-02T14:26:36 1775139996

You're on to something. It's the lisp machine of it all. Hot reloading is nothing that requires anything special, so you can redefine a callback or dependency with ease in the repl and the system chugs along. You can theoretically do something similar in ruby, but it's the opposite of elegant, you'd be forced to re implement methods with different dependencies etc. It's also a function of being "functional" in the lisp sense, that things are lists, and lists can be replaced, functions or otherwise.

The fun way to get a feel for lisp machines is emacs, it's so easy to fall of a language and especially hand-coding in a language if you don't have to.

tvink · 2026-03-03T19:42:15 1772566935

Looks cool, but the phrase 'build applications with the flexibility and power of go' made me chuckle. Least damn flexible language in this whole space.

tvink · 2026-03-03T05:29:03 1772515743

You're not wrong, for sufficient simple cases it's at a disadvantage. But once things get complicated, it wins by being the only thing that you can get to work without going insane.

And yeah, any serious use completely assumes a Max sub.

tvink · 2026-03-03T05:24:05 1772515445

You'll be back :)

tvink · 2026-02-24T06:15:47 1771913747

Honestly if we could have we would have, we can't even tax the people destroying our world, how are we going to create utopia

ori_b · 2026-02-24T07:35:36 1771918536

By making an effort. If we fail, at least we tried.

embedding-shape · 2026-02-24T11:58:13 1771934293

> we can't even tax the people destroying our world

You* maybe cannot, but that certainly isn't everywhere.

tvink · 2026-02-23T17:20:27 1771867227

It's definitely an approach. I do think in true democratization of the internet, teaching people some tech is inevitable. We just can't have equal access if we retain the classes of user and maker as completely distinct.

tvink · 2026-02-19T15:45:44 1771515944

This is the complete opposite of my experience.

mccoyb · 2026-02-19T15:52:46 1771516366

Does your experience include writing your own agent? Send a link

tvink · 2026-02-07T06:51:32 1770447092

The point of MCP is discoverability. A crud app is better, except you have to waste context telling your LLM a bunch of details. With MCP you only put into it's context what the circumstances are where it applies, and it can just invoke it. You could write a bunch of little wrapper scripts around each api you want to use and have basically reinvented MCP for yourself.

esperent · 2026-02-07T08:33:12 1770453192

Most MCPs I've seen could be:

1. A cli script or small collection of scripts

2. A very short markdown file explaining how it works and when to use it.

3. Optionally, some other reference markdown files

Context use is tiny, nearly everything is loaded on demand.

And as I'm writing this, I realize it's exactly what skills are.

Can anyone give an example of something that this wouldn't work for, and which would require MCP instead?

anon7000 · 2026-02-07T10:19:06 1770459546

But this is entirely besides the point. The point of MCP is bundling those exact things into a standardized plugin that’s easy for people to share with others.

MCP is useful because I can add one in a single click for an external service (say, my CI provider). And it gives the provider some control over how the agent accesses resources (for example, more efficient/compressed, agent-oriented log retrieval vs the full log dump a human wants). And it can set up the auth token when you install it.

So yeah, the agent could write some those queries manually (might need me to point it to the docs), and I could write helpers… or I could just one-click install the plugin and be done with it.

I don’t get why people get worked up over MCP, it’s just a (perhaps temporary) tool to help us get more context into agents in a more standard way than everyone writing a million different markdown files and helper scripts.

bravura · 2026-02-07T10:27:14 1770460034

"The point of MCP is bundling those exact things into a standardized plugin that’s easy for people to share with others." Like... a CLI/API?

"MCP is useful because I can add one in a single click for an external service" Like... a CLI/API? [edit: sorry, not click, single 'uv' or 'brew' command]

"So yeah, the agent could write some those queries manually" Or, you could have a high-level CLI/API instead of a raw one?

"I don’t get why people get worked up over MCP" Because we tried them and got burned?

"to help us get more context into agents in a more standard way than everyone writing a million different markdown files and helper scripts." Agreed it's slightly annoying to add 'make sure to use this CLI/API for this purpose' in AGENTS.md but really not much. It's not a million markdown files tho. I think you're missing some existing pattern here.

Again, I fail to see how most MCPs are not lazy tools that could be well-scoped discoverable safe-to-use CLI/APIs.

0x696C6961 · 2026-02-07T10:44:24 1770461064

That's literally what they are. It's a dead simple self describing JSONRPC API that you can understand if you spend 5 seconds looking at it. I don't get why people get so worked up over it as if it's some big over-engineered spec.

I can run an MPC on my local machine and connect it to an LLM FE in a browser.

I can use the GitHub MCP without installing anything on my machine at all.

I can run agents as root in a VM and give them access to things via an MCP running outside of the VM without giving them access to secrets.

It's an objectively better solution than just giving it CLIs.

philipp-gayret · 2026-02-07T12:03:05 1770465785

All true except that CLI tools are composable and don't pollute your context when run via a script. The missing link for MCP would be a CLI utility to invoke it.

0x696C6961 · 2026-02-07T13:25:54 1770470754

How does the agent know what clis/tools it has available? If there's an `mcpcli --help` that dumps the tool calls, we've just moved the problem.

The composition argument is compelling though. Instead of clis though, what if the agent could write code where the tools are made available as functions?

   tools.get_foo(tools.get_bar())

philipp-gayret · 2026-02-07T15:32:46 1770478366

> what if the agent could write code where the tools are made available as functions?

Exactly, that would be of great help.

> If there's an `mcpcli --help` that dumps the tool calls, we've just moved the problem.

I see I worded my comment completely wrong... My bad. Indeed MCP tool definitions should probably be in context. What I dislike about MCP is that the IO immediately goes into context for the AI Agents I've seen.

Example: Very early on when Cursor just received beta MCP support I tried a Google Maps MCP from somewhere on the net; asked Cursor "Find me boxing gyms in Amsterdam". The MCP call then dumped a HATEOAS-annotated massive JSON causing Cursor to run out of context immediately. If it had been a CLI tool instead, Cursor could have wrapped it in say a `jq` to keep the context clean(er).

jmalicki · 2026-02-07T18:31:46 1770489106

I mean what was keeping Cursor from running jq there? It's just a matter of being integrated poorly - which is largely why there was a rethink of "we just made this harder on ourselves, let's accomplish this with skills instead"

ra · 2026-02-07T10:15:29 1770459329

I'm with you because we get to specify our context more precisely.

jmalicki · 2026-02-07T18:28:55 1770488935

I mean, one could argue skills are sort of MCP 2.0 fixing some of the mistakes.

The big pluses for MCPs are when:

1. They live remotely and update themselves 2. You install the skill and the scripts it uses together locally, so it can be more convenient packaging

MCPs aren't really all that complicated inherently, a lot of mistakes around them happened because they came early.

bravura · 2026-02-07T09:57:23 1770458243

The last time I looked at MCPs closely, they appeared to pollute context and just hang there consuming context constantly. Whereas a self-documenting API or CLI tool enabled progressive discovery.

Has this changed?

My uncharitable interpretation is that MCP servers are NJ design for agents, and high quality APIs and CLIs are MIT design.

anon7000 · 2026-02-07T10:12:47 1770459167

There has been some improvement in that area.

But at the end of the day, MCP is about making it easy/standard to pull in context from different sources. For example, to get logs from a CI run for my PR, or to look at jira tickets, or to interact with GitHub. Sure, a very simple API baked into the model’s existing context is even better (Claude will just use the GH CLI for lots of stuff, no MCP there.)

MCP is literally just a way for end users to be able to quickly plug in to those ecosystems. Like, yeah, I could make some extra documentation about how to use my CI provider’s API, put an access token somewhere the agent can use… or I could just add the remote MCP and the agent has what it needs to figure out what the API looks like.

It also lets the provider (say, Jira) get some control over how models access your service instead of writing whatever API requests they feel like.

Like, MCP is really not that crazy. It’s just a somewhat standard way to make plugins for getting extra context. Sure, agents are good at writing with API requests, but they’re not so good at knowing why, when, or what to use.

People get worked up over the word “protocol” like it has to mean some kind of super advanced and clever transport-layer technology, but I digress :p

bravura · 2026-02-07T10:21:50 1770459710

You're making the convenience argument, but I'm making the architecture argument. They're not the same thing.

You say "a very simple API baked into the model's existing context is even better". So we agree? MCP's design actively discourages that better path.

"Agents are good at writing API requests, but not so good at knowing why, when, or what to use". This is exactly what progressive discovery solves. A good CLI has --help. A good API has introspection. MCP's answer is "dump all the tool schemas into context and let the model figure it out," which is O(N) context cost at all times vs O(1) until you actually need something.

"It's just a standard way to make plugins" The plugin pattern of "here are 47 tool descriptions, good luck" is exactly the worse-is-better tradeoff I'm describing. Easy to wire up, expensive at runtime, and it gets worse as you add more servers.

The NJ/MIT analogy isn't about complexity, it's about where the design effort goes. MCP puts the effort into easy integration. A well-designed API puts the effort into efficient discovery. One scales, the other doesn't.

wincy · 2026-02-07T07:23:20 1770449000

I tried using the Microsoft azure devops MCP and it immediately filled up 160k of my context window with what I can only assume was listing out an absurd number of projects. Now I just instruct it to make direct API calls for the specific resources, I don’t know maybe I’m doing something wrong in Cursor, or maybe Microsoft is just cranking out garbage (possible), but to get that context down I had to uncheck all the myriad features that MCP supplies.