Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There was a little more information in that reddit thread. Of the three difficulty tiers, 25% are T1 (easiest) and 50% are T2. Of the five public problems that the author looked at, two were T1 and two were T2. Glazer on reddit described T1 as "IMO/undergraduate problems", but the article author says that they don't consider them to be undergraduate problems. So the LLM is already doing what the author says they would be surprised about.

Also Glazer seemed to regret calling T1 "IMO/undergraduate", and not only because of the disparity between IMO and typical undergraduate. He said that "We bump problems down a tier if we feel the difficulty comes too heavily from applying a major result, even in an advanced field, as a black box, since that makes a problem vulnerable to naive attacks from models"

Also, all of the problems shows to Tao were T3



The reddit thread is ... interesting (direct link[1]). It seems to be a debate among mathematicians some of whom do have access to the secret set. But they're debating publicly and so naturally avoiding any concrete examples that would give the set away so wind-up with fuzzy-fiddly language for the qualities of the problem tiers.

The "reality" of keeping this stuff secret 'cause someone would train on it is itself bizarre and certainly shouldn't be above questioning.

https://www.reddit.com/r/OpenAI/comments/1hiq4yv/comment/m30...


It's not about training directly on the test set, it's about people discussing questions in the test set online (e.g., in forums), and then this data is swept up into the training set. That's what makes test set contamination so difficult to avoid.


Yes,

That is the "reality" - that because companies can train their models on the whole Internet, companies will train their (base) models on the entire Internet.

And in this situation, "having heard the problem" actually serves as a barrier to understanding of these harder problems since any variation of known problem will receive a standard "half-assed guestimate".

And these companies "can't not" use these base models since they're resigned to the "bitter lesson" (better the "bitter lesson viewpoint" imo) that they need large scale heuristics for the start of their process and only then can they start symbolic/reasoning manipulations.

But hold-up! Why couldn't an organization freeze their training set and their problems and release both to the public? That would give us an idea where the research stands. Ah, the answer comes out, 'cause they don't own the training set and the result they want to train is a commercial product that needs every drop of data to be the best. As Yan LeCun has said, this isn't research, this is product development.


>> It's not about training directly on the test set, it's about people discussing questions in the test set online

Don't kid yourself. There are 10's of billions of dollars going into AI. Some of the humans involved would happily cheat on comparative tests to boost investment.


The incentives are definitely there, but even CEOs and VCs know that if they cheat the tests just to get more investment, they're only cheating themselves. No one is liquidating within the next 5 years so either they end up getting caught and lose everything or they spent all this energy trying to cheat while having a subpar model which results in them losing to competitors who actually invested in good technology.

Having a higher valuation could help with attracting better talent or more funding to invest in GPUs and actual model improvements but I don't think that outweighs the risks unless you're a tiny startup with nothing to show (but then you wouldn't have the money to bribe anyone).


People like to cheat. See the VW case. Company is big and established and still cheated.

It depends a lot on individuals making up the companies command chain and their values.


Why is this any different from say, Theranos?

CEOs and VCs will happily lie because they are convinced they are smarter than everyone else and will solve the problem before they get caught.


Theranos didn't have 10 different competitors doing the exact same thing. A new AI model which scores better on a random metric isn't going to suddenly make them the top model that everyone uses unless they're actually good. So while Theranos cheating would help put them in stores like CVS, an AI company cheating would just mean that they make a few sales before everyone realizes that their model is actually pretty bad compared to all the competitors.


Not having access to the dataset really makes the whole thing seem incredibly shady. Totally valid questions you are raising


it’s a key aspect of the entire project. we have gone through many cycles of evils where the dataset is public


> So the LLM is already doing what the author says they would be surprised about.

that's if you unconditionally believe in result without any proofreading, confirmation, reproducability and even barely any details (we are given only one slide).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: