Hacker Newsnew | past | comments | ask | show | jobs | submit | Kuinox's commentslogin

The em dashes exists in ChatGPT output because existing human text contains it, like journal articles.


Remember, they only measured that the less time you spend on a task, the less you remember it.


> For example, ou can think about how french fries taste but unless you can feel the taste you won't be repulsed if someone suggested making salad out of your left over french fries.

Deep fried potato salad is a thing and humans didnt waited LLMs to invent it.


[Yuk,] a thing where?



You changed the quote: His statement was about a specific technology, an AI that make 3D character move like zombies.

The author is also changing the subject of the quote.

He said it reminded him of a disabled friend that this technology was an insult to life itself.


It's specific model that run for maths. GPT-5 and Gemini 2.5 still cannot compute an arbitrary length sum of whole number without a calculator. I have a proceduraly generated benchmark of basic operations, LLMs gets better at it with time, but they cant still solve basic maths or logic problems.

BTW I'm open to selling it, my email is on my hn profile.


Have you ever seen what these arbitrary length whole numbers look like once they are tokenized? They don't break down to one-digit-per-token, and the same long number has no guarantee of breaking down into tokens the same way every time it is encountered.

But the algorithms they teach humans in school to do long-hand arithmetic (which are liable to be the only algorithms demonstrated in the training data) require a single unique numeral for every digit.

This is the same source as the problem of counting "R"'s in "Strawberry".


That's was the initial thinking of anyone which I explained this, it was also my speculation, but when you look in it's reasoning where it do the mistake, it correctly extract the digits out of the input token. As I say in another comments, most of the mistakes her happen when it recopy the answer it calculated from the summation table. You can avoid tokenization issue when it extract the answer by making it output an array of digits of the answer, it will still fail at simply recopying the correct digit.


I recently saw someone that posted a leaked system prompt for GPT5 (and regardless of the truth of the matter since I can't confirm the authenticity of the claim, the point I'm making stands alone to some degree).

A portion of the system prompt was specifically instructing the LLM that math problems are, essentially, "special", and that there is zero tolerance for approximation or imprecision with these queries.

To some degree I get the issue here. Most queries are full of imprecision and generalization, and the same type of question may even get a different output if asked in a different context, but when it comes to math problems, we have absolutely zero tolerance for that. To us this is obvious, but when looking from the outside, it is a bit odd that we are so loose and sloppy with, well basically everything we do, but then we put certain characters in a math format, and we are hyper obsessed with ultra precision.

The actual system prompt section for this was funny though. It essentially said "you suck at math, you have a long history of sucking at math in all contexts, never attempt to do it yourself, always use the calculation tools you are provided."


o/~ Mathematics keeps your intellect intact / many answers should be carefully exact

But for daily application, use a close approximation, round it off.. o/~


> But the algorithms they teach humans in school to do long-hand arithmetic (which are liable to be the only algorithms demonstrated in the training data) require a single unique numeral for every digit.

But humans don't see single digits, we learn to parse noisy visual data into single digits and then use those single digits to do the math.

It is much easier for these models to understand what the number is based on the tokens and parse that than it is for a visual model to do it based on an image, so getting those tokens streamed straight into its system makes its problem to solve much much simpler than what humans do. We weren't born able to read numbers, we learn that.


i'd wager your benchmark problems require cumbersome arithmetic or are poorly worded / inadequately described. or, you're mislabeling them as basic math and logic (a domain within which LLMs have proven their strengths!)

i only call this out because you're selling it and don't hypothesize* on why they fail your simple problems. i suppose an easily aced bench wouldn't be very marketable


This is a simple sum of 2 whole number, the number are simply big.

Most of the time they make a correct summation table but fail to copy correctly the sum result into a final result. That is not a tokenisation problem (you can change the output format to make sure of it). I have a separated benchmark that test specifically this, when the input is too large, the LLMs fails to accuratly copy the correct token. I suppose the positional embedding, are not perfectly learned and it sometimes cause a mistake.

The prompt is quite short, it use structured output, and I can generate a nice graph of % of good response accross difficulity of the question (which is just the total digit count of the input numbers.

LLMs have 100% success rate on theses sum until they reach a frontier, past that their accuracy collapse at various speed depending of the model.


This is close to what the apple paper [1] also found on constraint satisfaction problems. As an example, on towers of hanoi, past a frontier, accuracy collapses.

Even when the algorithm steps are laid out precisely, they cannot be followed. Perhaps, LLMs should be trained on turing machine specs and be given a tape lol.

Constraint satisfaction and combinatorics are where the search space is exponential, and the techniques are not formalized (not enough data in training set), and remain hard for machines as seen in the Problem 6 of IMO which could not be solved by LLMs. I suspect, there is this aspect of human intelligence which is not yet captured in LLMs.

[1] - https://machinelearning.apple.com/research/illusion-of-think...


Have you tried greedy decoding (temp 0) in aistudio?

The temp 0.7-1.0 defaults are not designed for reconstructing context with perfect accuracy.


I always use the lowest temperature that I can input. But GPT-5 doesn't support a temperature setting. You'll get something like:

{ "error": { "message": "Unsupported value: 'temperature' does not support 0.0 with this model. Only the default (1) value is supported.", "type": "invalid_request_error", "param": "temperature", "code": "unsupported_value" } }


I can't see why that's necessary, when it can call a tool. Everyone uses a calculator. A logic problem, it can solve with reasoning, perhaps it's not the smartest but it can solve logic problems. All indications are that it will continue to become smarter.


Simple maths problems are simple logic problem. Here it doesn't even have to come up with a reasoning, it probably already memorised how to solve sums. Yet it fails at that, it shows it cannot solve logic problems if there are too much steps.

> All indications are that it will continue to become smarter.

I'm not disputing that, every new model score better at my benchmark, but right now, none truly "solve" one of these small logic problem.


If it can frame the question for the tool, it therefore has the logic (whether that was static recall or deductive).

LLM's struggle with simple maths by nature of their architecture not due to a lack of logic. Yes it struggles with logic questions too but they're not directly related here.


Most of the failures for theses simple logic question come from the inability to simply copy data accuratly. Logic is too abstract to be measured, but this single bench show something getting in it's way. I got another bench that show that the LLMs do basic mistakes that can be easily avoided with minimum logic and observation.


> LLM's struggle with simple maths by nature of their architecture not due to a lack of logic.

No, if it was good at logic it would have overcame that tiny architectural hurdle, its such a trivial process to convert tokens to numbers that it is ridiculous for you to suggest that is the reason it fails at math.

The reason it fails at math is because it fails at logic, and math is the most direct set of logic we have. It doesn't fail at converting between formats, it can convert strawberry to correct Base64 encoding, meaning it does know exactly what letters are there, it just lacks to logic to actually understand what "count letters" means.


It can't see that data so how can it convert it? It can only see the token input.

An analogy (probably poor) is like asking a human to see UV light. We can do so but only with tools or by removing our lense.

The fact that SOTA models (not yet publicly available) can achieve gold at IOM implies otherwise.


It's because math problems allow to easily check that the solution is correct, it allow to do a lot of 'search': https://yellow-apartment-148.notion.site/AI-Search-The-Bitte...


> GPT-5 and Gemini 2.5 still cannot compute an arbitrary length sum of whole number without a calculator.

Neither can many humans, including some very smart ones. Even those who can will usually choose to use a calculator (or spreadsheet or whatever) rather than doing the arithmetic themselves.


> Neither can many humans...

1) GPT-5 is advertised as "PhD-level intelligence". So, I take OpenAI (and anyone else who advertises their bots with language like this) at their word about the bot's capabilities and constrain the set of humans I use for comparison to those who also have PhD-level intelligence.

2) Any human who has been introduced to long addition will absolutely be able to compute the sum of two whole numbers of arbitrary length. You may have to provide them a sufficiently strong incentive to actually do it long-hand, but they absolutely are capable because the method is not difficult. I'm fairly certain that most adult humans [0] (regardless of whether or not they have PhD-level intelligence) find the method to be trivial, if tedious.

[0] And many human children!


I have a PhD, in mathematics, from a top university. If you give me, say, 100 10-digit numbers to add up and tell me to do the job in my head then I will probably get the answer wrong.

Of course, if you give me 100 10-digit numbers to add up and let me use a calculator, or pencil and paper, then I will probably get it right.

Same for, say, two 100-digit numbers. (I can probably get that one right without tools if you obligingly print them monospaced and put one of them immediately above the other where I can look at them.)

Anyway, the premise here seems to be simply false. I just gave ChatGPT and Claude (free versions of both; ChatGPT5, whatever specific model it routed my query to, and Sonnet 4) a list of 100 random 10-digit numbers to add up, with a prompt encouraging them to be careful about it but nothing beyond that (e.g., no specific strategies or tools to use), and both of them got the right total. Then I did the same with two 100-digit numbers and both of them got that right too.


https://i.imgur.com/l2elIAv.png

Difficulty is the amount of digits, small models struggle with 10 digits numbers, gemini and gpt-5 are very good recent models, gemini start failing before 40 digits, GPT-5 (the one by api, the online chat version is worse and I didn't tested it) can do more than 120 digits (at this point it's pointless to test for more).


My tests of GPT-5 were using the online chat version.

Of course, I only ran it once; I can't at all rule out the possibility that sometimes it gets it wrong. But, again, the same is true of humans.


The online version is way worse, it also have a router that could route it to a random model.


Right but most (competent) humans will reliably use a calculator. It's difficult to get these to reliably make lots of tool calls like that.


I do think that competent humans can solve any arbitrary sum of 2 whole number with a pen, paper and time. LLMs can't do that.


That’s interesting, you added a tool. You did not just leave it to the human alone.


I'm not the fellow you replied to, but I felt like stepping in.

> That’s interesting, you added a tool.

The "tool" in this case, is a memory aid. Because they are computer programs running inside a fairly-ordinary computer, the LLMs have exactly the same sort of tool available to them. I would find a claim that LLMs don't have a free MB or so of RAM to use as scratch space for long addition to be unbelievable.


The fact that an LLM is running inside an ordinary computer does not mean that it gets to use all the abilities of that computer. They do not have megabytes of scratch space merely because the computer has a lot of memory.

They do have something a bit like it: their "context window", the amount of input and recently-generated output they get to look at while generating the next token. Claude Sonnet 4 has 1M tokens of context, but e.g. Opus 4.1 has only 200k and I think GPT-5 has 256k. And it doesn't really behave like "scratch space" in any useful sense; e.g., the models can't modify anything once it's there.


Well, the GPT-5 context windows offer roughly a little more than a MB


LLMs already get enough working memory, they do not fail because of lack of working space.


Mozilla have been building local AI for years, and it have been great.

We got tons of local AI services of acceptable quality thanks to them: Translation, text to speech, speech to text.

Let them cook, I'm sure a few of theses features will be really useful.

Better searchability of your history sounds very practical.


Local translation has been pretty useful for me. It's obviously not as good as fat commercial models used by e.g Google Translate, but get the point across often enough that I almost stopped using those commercial engines. What's interesting is that this only useful ML service came out from collaboration with several European universities and was mostly an academic project, polished by a commercial interest into a useful service. Makes an interesting comparison with those pipe dreams conjured up by Mozilla business suites that set fire to millions of dollars and ended up with nothing.


I use Firefox translate a lot and like yourself, it’s exceedingly rare that I reach for other services.


Agree, the local translation is borderline unusable. I've had to change browsers as a result.


You say you agree, but the content of your comment seems to imply you're saying the opposite of GP?


Wouldn't a flame thrower be crazy effective against a drone ?


at a distance close enough to hit the drone it's shrapnel and explosives could probably hit you. and you're sitting on a big tank of flammable material.

it's like the dudes that throw their rifle at the drone 6 feet away, hit it, cause it explode... but they still take shrapnel to the head and neck anyway because it's only 6 feet away.

plus what happens if you miss?

and the ammo is big, heavy, and requires special tanks. non-starter of an idea.


You dont need to burn the drone but the fiber cable.


Avalonia is cross plateform unlike WPF.


Author used copilot in rider, tbh it's one of the worst rider. Which llm model was used ? VSCode&VS copilot allow you to select it.

I suspect theses bench were run on the default model, ChatGPT 4o, which is now more than a year old.


Oh good point which one should I use? I ran it alos with o3 and claude sonnet but the results were similar or worst.(Some are in the repo)


GitHub wasn't used there. It was on Codeberg.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: