Hacker Newsnew | past | comments | ask | show | jobs | submit | riskassessment's commentslogin

Can someone explain to me how and in what way Claude Code is considered "agentic" and Cursor/Gemini CLI/Antigravity are not?


Gemini cli id definitely agentic, cursor and antigravity have agentic tools.

Claude code is simply considered the best agentic tool, not the only one lol


Stealthily degrade the model or stealthily constrain the model with a tighter harness? These coding tools like Claude Code were created to overcome the shortcomings of last year's models. Models have gotten better but the harnesses have not been rebuilt from scratch to reflect improved planning and tool use inherent to newer models.

I do wonder how much all the engineering put into these coding tools may actually in some cases degrade coding performance relative to simpler instructions and terminal access. Not to mention that the monthly subscription pricing structure incentivizes building the harness to reduce token use. How much of that token efficiency is to the benefit of the user? Someone needs to be doing research comparing e.g. Claude Code vs generic code assist via API access with some minimal tooling and instructions.


I've been using pi.dev since December. The only significant change to the harness in that time which affects my usage is the availability of parallel tool calls. Yet Claude models have become unusable in the past month for many of the reasons observed here. Conclusion: it's not the harness.

I tend to agree about the legacy workarounds being actively harmful though. I tried out Zed agent for a while and I was SHOCKED at how bad its edit tool is compared to the search-and-replace tool in pi. I didn't find a single frontier model capable of using it reliably. By forking, it completely decouples models' thinking from their edits and then erases the evidence from their context. Agents ended up believing that a less capable subagent was making editing mistakes.


Are you using Pi with a cloud subscription, or are you using the API?


Out of curiosity, what can parallel tool calls do that one can't do with parallel subagents and background processes?


How would you do a parallel subagent if you don't have parallel tool calls? Sub agents are tools.


you find that pay-per-use API's degraded too?


Yes, absolutely.


Agree: it is Anthropic's aggressive changes to the harnesses and to the hidden base prompt we users do not see. Clearly intended to give long right tail users a haircut.


I feel like "feature/model freeze" may be justified

just call it something like "[month][year]edition" and work on next release

users spend effort arriving to narrow peak of performace, but every change keeps moving the peak sideways


The changes to reduce inference costs are intentional. Last thing you're going to do is have users linger on an older version that spends much more. This is essentially what's going on with layers upon layers of social engineering on top of it.


Love your point. Instructions found to be good by trial and error for one LLM may not be good for another LLM.


> Love your point. Instructions found to be good by trial and error for one LLM may not be good for another LLM.

Well, according to this story, instructions refined by trial and error over months might be good for one LLM on Tuesday, and then be bad for the same LLM on Wednesday.


For what it's worth, early statins were originally cleared based only on the evidence that they lower cholesterol without longer term studies showing a reduction in mortality. Of course there is now plenty of evidence showing statins improve overall endpoints.


That’s true.

Similarly, there were other drugs that lowered cholesterol that didn’t show a significant reduction in coronary events. As we later learned, it’s not nearly as simple as “cholesterol bad.”


That doesn't sound like the same thing at all.


Nor is that inequality an oddity at all. If you were to think NaN should equal NaN, that thought would probably stem from the belief that NaN is a singular entity which is a misunderstanding of its purpose. NaN rather signifies a specific number that is not representable as a floating point. Two specific numbers that cannot be represented are not necessarily equal because they may have resulted from different calculations!

I'll add that, if I recall correctly, in R, the statement NaN == NaN evaluates to NA which basicall means "it is not known whether these numbers equal each other" which is a more reasonable result than False.


> "it is not known whether these numbers equal each other"

Equality, among other operations, are not defined for these inputs. NaN's really are a separate type of object embedded inside another objects value space. So you get the rare programmers gift of being able to construct a statement that is not always realizable based solely on the values of your inputs.


It's the only "primitive type" that does that. If I deserialize data from wire, I'll be very surprised when the same bits deserialize as unequal variables. If it cannot be represented, then throwing makes more sense than trying to represent it.


Other primitive types also do this, but this is not clearly visible from high-level programming languages, because most HLLs have only incomplete support for the CPU hardware.

If you do a (signed) integer operation, the hardware does not fit the result in a register of the size expected in a HLL, but the result has some bits elsewhere, typically in a "flags" register.

So the result of an integer arithmetic operation has an extra bit, usually named as the "overflow" bit. That bit is used to encode a not-a-number value, i.e. if the overflow bit is set, the result of the operation is an integer NaN.

For correct results, one should check whether the result is a NaN, which is called checking for integer overflow (unlike for FP, the integer execution units do not distinguish between true overflow and undefined operations, i.e. there are no distinct encodings for infinity and for NaN). After checking that the result is not a NaN, the extra bit can be stripped from the result.

If you serialize an integer number for sending it elsewhere, that implicitly assumes that wherever your number was produced, someone has tested for overflow, i.e. that the value is not a NaN, so the extra bit was correctly stripped from the value. If nobody has tested, your serialized value can be bogus, the same as when serializing a FP NaN and not checking later that it is a NaN, before using one of the 6 relational operators intended for total orders, which may be wrong for partial orders.


> They teach us Scientific Realism in school.

I'd argue the opposite is true for anyone who has studied statistics which is largely built on Instrumentalism (think George Box: 'All models are wrong, but some are useful') and Popperian falsification (Null Hypothesis testing). We are absolutely taught to treat models as predictive tools rather than metaphysical truths.


Statistics is even presented like metaphysical truth. Or at least my experience in engineering school.

And taking fluid dynamics, we used renyolds number, which is a made up ratio that helps for decision making... Its not like when we answered questions, we could answer the grey area we are discussing.

If I had to guess, I think its due to western civilization being built of Platonism (and even Aristotle was infected). Our science and morality is later built by platonic realism. Only in the last 100-ish years are we starting to get over it.


I don't understand this reasoning. Randomizing people to AI vs standard of care is expensive and risky. Checking whether the AI can pass hypothetical scenarios seems like a perfectly reasonable approach to researching the safety of these models before running a clinical trial.


You would pass those hypothetical scenarios to doctors too, and then the analyses of results would be done by doctors who don't know if it's an AI or doctor result.


From the paper

> Three physicians independently assigned gold-standard triage levels based on cited clinical guidelines and clinical expertise, with high inter-rater agreement


You're misunderstanding. What this paper did-- Those three physicians set a ground truth to compare the AI response to.

What people in this thread are asking for-- Evaluate a set of doctors on those cases as well, and compare doctor vs AI accuracy.


The issue is that those hypothetical scenarios do not have to look like how patients actually interact with the tool.

Real life use is full of ill posed questions open ended statements inaccurate assessment of symptoms, and conclusory remarks sprinkled in between. Real use of chat bots for Health by non-clinicians looks very different than scenario based evaluation.


You can start by comparing "doctor" care vs "doctor who also uses AI" care


The thinkpad shell could have undergone elastic deformation which could reduce peak force.


I was expecting a system like Leibniz notation, Boolean Algebra, Begriffsschrift, or the notation system in Principia Mathematica


> R is perhaps the closest, because it has data.frame as a 'first class citizen', but most people don't seem to use it, and use e.g. tibbles from dplyr instead.

Everyone in R uses data.frame because tibble (and data.table) inherits from data.frame. This means that "first class" (base R) functions work directly on tibble/data.table. It also makes it trivial to convert between tibble, data.table, and data.frames.


> html

Would be willing to bet this is the issue. Adding html files to context for gemini models results in a ton of token use.


why?

EDIT: why must users care?


Gotta learn all the quirks of the model before it's replaced in 8 minutes.


Quirks? like context window?


I'm saying it's egregious to expect all users to know the fact that an HTML document, for some reason, uses an enormous amount of context in an LLM designed specifically for working with code.



The accepted answer is one that doesn’t care about the questioner‘s use case and instead gives a pretty excessive "Don‘t do it"


It does also give the right solution, using an xml parser.


We don’t know the use case.

Maybe the questioner is also in full control of the HTML creation and they don’t need a parser for all possible HTML edge cases.


Maybe they are, but they would also need to ensure a well-defined subset of HTML and also show that the subset is a reglar (Chomsky Type 3) grammar.

It seems that even the very conceptually simple example given by the questioner is impossible.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: