More

panarky · 2025-11-19T01:02:10 1763514130

A paywall warning would be appreciated.

panarky · 2025-11-19T00:06:29 1763510789

The fact that Gemini 3 is so far ahead of every other frontier model in math might be telling us something more general about the model itself.

It scored 23.4% on MathArena Apex, compared with 0.5% for Gemini 2.5 Pro, 1.6% for Claude Sonnet 4.5 and 1.0% for GPT 5.1.

This is not an incremental advance. It is a step change. This indicates a new discovery, not just more data or more compute.

To succeed this well in math, you can't just do better probabilistic generation, you need verifiable search.

You need to verify what you're doing, detect when you make a mistake, and backtrack to try a different approach.

The SimpleQA benchmark is another datapoint that we're probably looking at a research breakthrough, not just more data or more compute. Gemini 3 Pro achieved more than double the reliability of GPT-5.1 (72.1% vs. 34.9%).

This isn't an incremental gain, it's a step-change leap in reducing hallucinations.

And it's exactly what you'd expect to see if there's an underlying shift from probabilistic token prediction to verified search, with better error detection and backtracking when it finds an error.

That could explain the breakout performance on math, and reliability, and even operating graphical user interfaces (ScreenSpot-Pro at 72.7% vs. 3.5% for GPT-5.1).

Xss3 · 2025-11-19T11:59:02 1763553542

This comment was written by an AI specifically instructed to be more concise than usual.

vitorgrs · 2025-11-19T04:06:32 1763525192

I usually ask a simple question that ALL the models get wrong: List of mayor of my city [Londrina]. ALL the models (offine) get wrong. And I mean, all the models. The best that I could, it's o3 I believe, saying it couldn't give a good answer for that, and told to access the city website.

Gemini 3 somehow is able to give a list of mayors, including details on who got impeached, etc.

This should be a simple answer, because all the data is on wikipedia, that certainly the models are trained on, but somehow most models don't manage to give that answer right, because... it's just a irrelevant city in a huge dataset.

But somehow, Gemini 3 did it.

Edit: Just asked "Cool places to visit in Londrina" (In portuguese), and it was also 99% right, unlike other models, who just create stuff. The only thing wrong here, it mentioned sakuras in a lake... Maybe it confused with Brazilian ipês, which are similar, and indeed the city it's full of them.

It seems to have a visual understanding, imo.

guiambros · 2025-11-19T04:53:06 1763527986

Ha, I just did the same with my hometown (Guaiba, RS), a city that is 1/6th of Londrina, and its wikipedia page in English hasn't been updated in years, and still has the wrong mayor (!).

Gemini 3 nailed on the first try, included political affiliation, and added some context on who they competed with and won over in each of the last 3 elections. And I just did a fun application with AI Studio, and it worked on first shot. Pretty impressive.

(disclaimer: Googler, but no affiliation with Gemini team)

usef- · 2025-11-19T04:59:26 1763528366

Pure fact-based, niche questions like that aren't really the focus of most providers any more from what I've heard, since they can be solved more reliably by integrating search tools (and all providers now have search).

I wouldn't be surprised if the smallest models can answer fewer such (fact-only) questions over time offline as they distill/focus them more thoroughly on logic etc.

arach · 2025-11-19T04:48:02 1763527682

thanks for sharing, very interesting example

belter · 2025-11-19T08:26:04 1763540764

I asked Claude, and had no issues with the answer including mentioning the impeached Antonio Belinati...

dbbk · 2025-11-19T11:57:52 1763553472

Aren't you just describing tool calls?

calmoo · 2025-11-19T11:48:45 1763552925

Your comment is AI generated

red75prime · 2025-11-19T10:40:33 1763548833

> To succeed this well in math, you can't just do better probabilistic generation, you need verifiable search.

You say "probabilistic generation" like it's some kind of a limitation. What is exactly the limiting factor here? [(0.9999, "4"), (0.00001, "four"), ...] is a valid probability distribution. The sampler can be set to always choose "4" in such cases.

coldtea · 2025-11-19T11:33:38 1763552018

>This is not an incremental advance. It is a step change. This indicates a new discovery, not just more data or more compute.
To succeed this well in math, you can't just do better probabilistic generation, you need verifiable search.

You need to verify what you're doing, detect when you make a mistake, and backtrack to try a different approach.

Loos like AI slop

legel · 2025-11-19T02:57:22 1763521042

Thanks for reporting these metrics and drawing the conclusion of an underlying breakthrough in search.

In his Nobel Prize winning speech, Demis Hassabis ends by discussing how he sees all of intelligence as a big tree-like search process.

https://youtube.com/watch?v=YtPaZsasmNA&t=1218

rafaelmn · 2025-11-19T04:50:28 1763527828

It tells me that the benchmark is probably leaking into training data, and going to the benchmark site :

> Model was published after the competition date, making contamination possible.

Aside from eval on most of these benchmarks being stupid most of the time, these guys have every incentive to cheat - these aren't some academic AI labs, they have to justify hundreds of billions being spent/allocated in the market.

Actually trying the model on a few of my daily tasks and reading the reasoning traces all I'm seeing is same old, same old - Claude is still better at "getting" the problem.

thomasahle · 2025-11-19T10:02:04 1763546524

From my understanding, Google put online the largest RL cluster in the world not so long ago. It's not surprising they do really well on things that are "easy" to RL, like math or SimpleQA

Der_Einzige · 2025-11-19T08:24:30 1763540670

You clearly AI generated this comment.

thejarren · 2025-11-19T00:15:57 1763511357

[flagged]

panarky · 2025-11-19T00:38:36 1763512716

Hmmm, I wrote those words myself, maybe I've spent too much time with LLMs and now I'm talking like them??

I'd be interested in any evidence-based arguments you might have beyond attacking my writing style and insinuating bad intent.

I found this commenter had sage advice about how to use HN well, I try to follow it: https://news.ycombinator.com/item?id=38944467

thejarren · 2025-11-19T05:44:29 1763531069

I’ll take you at your word, sorry for the incorrect callout. Your comment format appeared malicious, so my response wasn’t an attempt at being “snarky”, just acting defensively. I like the HN Rules/Guidelines.

momojo · 2025-11-19T02:26:35 1763519195

You mentioned "step change" twice. Maybe a once over next time? My favorite Mark Twain quote is (very paraphrased) "My apologies, had I more time, I would have written a shorter letter".

versteegen · 2025-11-19T08:52:38 1763542358

I thought the repetition was intentional.

ciberado · 2025-11-19T06:38:53 1763534333

This is something that is happening to me too, and frankly I'm a little concerned. English is not my first language, so I use AI for checking and writing many things. And I spend a lot of time with coding tools. And now I need sometimes to do a conscient effort to avoid mimicking some LLM patterns...

Aeolun · 2025-11-19T06:16:12 1763532972

“If you gaze long into an abyss, the abyss also gazes into you.”

gnabgib · 2025-11-19T06:21:45 1763533305

Is that you Nietzsche? Or are you Magog https://andromeda.fandom.com/wiki/Spirit_of_the_Abyss

ineedasername · 2025-11-19T04:46:11 1763527571

1) Models learn these patterns from common human usage. They are in the wild, and as such there will be people who use them naturally.

2) Now, given its for-some-reason-ubiquitous choice by models, it is also a phrasing that many more people are exposed to, every day.

Language is contagious. This phrasing is approaching herd levels, meaning models trained from up-to-the-moment web content will start to see it as less distinctly salient. Eventually, there will be some other high-signal novel phrase with high salience, and the attention heads will latch on to it from the surrounding context, and then that will be the new AI shibboleth.

It's just how language works. We see it in the mixes between generations when our kids pick up new lingo, and then it stops being in-group for them when it spreads too far.. Skibidi, 6 7, etc.

It's just how language works, and a generation ago the internet put it on steroids. Now? Even faster.

sindriava · 2025-11-19T00:33:58 1763512438

You seem very comfortable making unfounded claims. I don't think this is very constructive or adds much to the discussion. While we can debate the stylistic changes of the previous commenter, you seem to be discounting the rate at which the writing style of various LLMs has backpropagated into many peoples' brains.

roncesvalles · 2025-11-19T02:43:52 1763520232

Also discounting the fact that people actually do talk like that. In fact, these days I have to modify my prose to be intentionally less LLM-like lest the reader thinks it's LLM output.

panarky · 2025-11-18T22:33:55 1763505235

> I can't pay for it and it has 20 minutes of use

You can't provide an API key for a project that has billing enabled?

panarky · 2025-11-18T18:59:30 1763492370

The jump in ARC-AGI and MathArena suggests Google has solved the data scarcity problem for reasoning, maybe with synthetic data self-play??

This was the primary bottleneck preventing models from tackling novel scientific problems they haven't seen before.

If Gemini 3 Pro has transcended "reading the internet" (knowledge saturation), and made huge progress in "thinking about the internet" (reasoning scaling), then this is a really big deal.

panarky · 2025-11-18T15:49:33 1763480973

Claude Opus is $15 input, $75 output.

panarky · 2025-11-17T00:59:28 1763341168

The thing about bubbles is that it's devilishly difficult to tell the difference between a total sham and a genuine regime shift while it's happening because the hype level is similar for both.

heresie-dabord · 2025-11-17T01:49:16 1763344156

Sure, it certainly could be difficult. But this is often because of information assymetry — the purveyors are making big claims about the dragon in their garage. [1]

If your intuition told you that there ought to be some evidence for grand claims, you would be right to be suspicious of the continuing absence of suitable evidence.

[1] _ http://people.whitman.edu/~herbrawt/classes/110/Sagan.pdf

panarky · 2025-11-12T19:16:34 1762974994

93% of American drivers think they're better drivers than the median driver [0].

This overconfidence causes humans to take unnecessary risks that not only endanger themselves, but everyone else on the road.

After taking several dozen Waymo rides and watching them negotiate complex and ambiguous driving scenarios, including many situations where overconfident drivers are driving dangerously, I realize that Waymo is a far better driver than I am.

Waymos don't just prevent a large percentage of accidents by making fewer mistakes than a human driver, but Waymos also avoid a lot of accidents caused by other distracted and careless human drivers.

Now when I have to drive a car myself, my goal is to try to drive as much like a Waymo as I can.

[0] https://gwern.net/doc/psychology/1981-svenson.pdf

Night_Thastus · 2025-11-12T19:35:54 1762976154

It's not just overconfidence, it's selfishness.

Speeding feels like "I'm more important than everyone else and the safety of others and rules don't apply to me" personally. It's one thing to match the speed of traffic and avoid being a nuisance (that I'm fine with) - a lot of people just think they're the main character and everyone else is just in their way.

It's a problem that goes way beyond driving, sadly.

treis · 2025-11-12T20:11:02 1762978262

Eh this doesn't mean much. The quality of drivers is pretty bimodal.

You have the group that's really bad and does things like drive drunk, weave in and out of traffic, do their makeup and so on.

The other group generally pays attention and tries to drive safely. This is larger than the first group and realistically there's not all that much difference within the group.

If you're in group two you will think you're above average because the comparison is to the crap drivers in group one.

panarky · 2025-11-12T16:33:02 1762965182

If a science experiment works and is transformational can be worth a trillion dollars, how much is it worth if it has a 5% chance of being transformational?

startupsfail · 2025-11-12T16:43:29 1762965809

What if it is a 99% chance of being transformational and the results of that transformation are completely unpredictable?

WalterSear · 2025-11-12T19:11:33 1762974693

What it's transformational but takes a decade or so, instead of a year or so?

It's not like this isn't following exactly the same hype cycle as every other technological transformation.

panarky · 2025-11-11T19:18:10 1762888690

Right-click on the Firefox icon and choose "Open Profile Manager".

Or add "-p" to the startup command to do the same thing without right-clicking:

    firefox -p

panarky · 2025-11-06T16:05:34 1762445134

Nearly everything in life is non-deterministic.

Most things that are generally helpful and beneficial are not 100% helpful and beneficial 100% of the time.

I used GPT-4 as a second opinion on my medical tests and doctor's advice, and it suggested an alternate diagnosis and treatment plan that turned out to be correct. That was incredibly helpful and beneficial.

You're replying to a person who had a similar and even more helpful and beneficial experience because they're alive today.

Pedantically pointing out that a beneficial and helpful thing isn't 100% beneficial and helpful 100% of the time doesn't add anything useful to the conversation since everyone here already knows it's not 100%.