Something I noticed a long time ago is that going from 90% correct to 95% correct is not a 5% difference, it’s a 2x difference. As you approach 100%, the last few 0.01% error rates going away make a qualitative difference.
“Computer” used to be a job, and human error rates are on the order of 1-2% no matter what level of training or experience they had. Work had to be done in triplicate and cross-checked if it mattered.
Digital computers are down to error rates roughly 10e-15 to 10e-22 and are hence treated as nearly infallible. We regularly write code routines where a trillion steps have to be executed flawlessly in sequence for things not to explode!
AIs can now output maybe 1K to 2K tokens in a sequence before they make a mistake. That’s 99.9% to 99.95%! Better than human already.
Don’t believe me?
Write me a 500 line program with pen and paper (not pencil!) and have it work the first time!
I’ve seen Gemini Pro 2.5 do this in a useful way.
As the error rates drop, the length of usefully correct sequences will get to 10K, then 100K, and maybe… who knows?
There was just a press release today about Gemini Diffusion that can alter already-generated tokens to correct mistakes.
You are having low expectations here. People used to enter machine code on switches and punched paper tape, so yes they made sure it worked the first time. Later, people had code reviews by marking up printouts of code, and software got sent out in boxes that couldn't be changed until the next year.
Programmers who "iterate" buggy shit for 10 rounds until they get it right are a post-Google push-update phenomenon.
I don't think the length you're talking about is that much of an issue. As you say, depending on how you measure it, LLMs are better at remaining accurate over a long span of text.
The issue seems to be more in the intelligence department. You can't really leave them in an agent-like loop with compiler/shell output and expect them to meaningfully progress on their tasks past some small number of steps.
Improving their initial error-free token length is solving the wrong problem. I would take less initial accuracy than a human but equally capable of iterating on their solution over time.
“Computer” used to be a job, and human error rates are on the order of 1-2% no matter what level of training or experience they had. Work had to be done in triplicate and cross-checked if it mattered.
Digital computers are down to error rates roughly 10e-15 to 10e-22 and are hence treated as nearly infallible. We regularly write code routines where a trillion steps have to be executed flawlessly in sequence for things not to explode!
AIs can now output maybe 1K to 2K tokens in a sequence before they make a mistake. That’s 99.9% to 99.95%! Better than human already.
Don’t believe me?
Write me a 500 line program with pen and paper (not pencil!) and have it work the first time!
I’ve seen Gemini Pro 2.5 do this in a useful way.
As the error rates drop, the length of usefully correct sequences will get to 10K, then 100K, and maybe… who knows?
There was just a press release today about Gemini Diffusion that can alter already-generated tokens to correct mistakes.
Error rates will drop.
Useful output length will go up.