More

Davidzheng · 2025-03-28T03:48:25 1743133705

Capability today and next year will probably be very different in reliability

loxs · 2025-03-28T07:29:20 1743146960

As someone who uses LLMs to write code every day, I don't see a huge progress since last year, so I'm also not that sure about next year.

Davidzheng · 2025-03-26T01:53:54 1742954034

It's a bit strange to talk about stuck when the most recent breakthrough is less than a year old.

LPisGood · 2025-03-26T04:31:06 1742963466

I’m not sure what you mean by breakthrough, but if you’re talking about Deepseek, it’s more of an incremental improvement than a breakthrough.

Davidzheng · 2025-03-25T20:42:00 1742935320

I feel like Google intentionally don't want people to be as excited. This is a very good model. Definitely the best available model today.

Davidzheng · 2025-03-25T20:39:21 1742935161

On initial thoughts, I think this might be the first AI model to be reliably helpful as a research assistant in pure mathematics (o3-mini-high can be helpful but is more prone to hallucinations)

kadushka · 2025-03-26T06:12:41 1742969561

Have you tried o1-pro?

Davidzheng · 2025-03-25T19:08:06 1742929686

Honestly someone should scrape the algebraic topology Discord to AI it'll be a nice training set

Davidzheng · 2025-03-25T09:35:40 1742895340

Also there's no clear way to verify the solution. There could be easily multiple rules which works on the same examples

Davidzheng · 2025-03-25T01:49:50 1742867390

Probably openai will be >60% in three months if not immediately with these $1000/question level compute (which is the way tbh we should throw compute whenever possible that's the main advantage of silicon intelligence)

Davidzheng · 2025-03-25T01:50:57 1742867457

Their own admission that intelligence is a meaningless metric without bound on compute is one of the main reasons AI will overpower human intelligence soon. Simple scaling is very effective.

Davidzheng · 2025-03-25T01:37:38 1742866658

Tbh such a big jump from current capability would be ASI already

Davidzheng · 2025-03-10T07:33:26 1741592006

Disagree, now that there are great open models are available I think there's less need of huge training data--can just post train

Davidzheng · 2025-03-07T08:27:50 1741336070

i mean they have a verifier, so can't they even get to 90% just by random generation by the net and testing against verifier until it's numerically correct? I think the end solve rate is less important and the generality of approach is maybe more important

vessenes · 2025-03-07T20:44:24 1741380264

No, they specifically test for this (the "RL" case). They most particularly can not do this with random generation, which is very interesting.

Davidzheng · 2025-03-08T02:06:14 1741399574

but i mean it depends on how many attempts you let it generate. the right comparison is to use the test time rl compute to just do generation and compare success rates. (if you gen for long enough you eventually will hit the answer by chance)