Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Assuming they trained this LLM on SAT/LSAT/GRE prep materials, I would totally expect they could get it this good. It's like having benchmark-aware code.

I think the whole concept of standardized tests may need to be re-evaluated.



> I would totally expect they could get it this good.

But would you have expected an algorithm to score 90th percentile on the LSAT two years ago? Our expectations of what an algorithm can do are being upended in real time. I think it's worth taking a moment to try to understand what the implications of these changes will be.


Yes. Being very familiar with the LSAT and being familiar enough with ML’s capability for finding patterns in volumes of similar data, I absolutely would have.

These LLM’s are really exciting, but benchmarks like these exploit people’s misconceptions about both standardized tests and the technology.


From the paper

> We tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally designed for humans.3 We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two. We believe the results to be representative. For further details on contamination (methodology and per-exam statistics), see Appendix C.


I think you're right, and that test prep materials were included in the dataset, even if only by accident. Except that humans have access to the same test prep materials, and they fail these exams all the time. The prep materials are just that, preparatory. They're representative of the test questions, but actual test has different passages to read and different questions. On to of that, the LSAT isn't a math test with formulas where you just substitute different numbers in. Which is to say, the study guides are good practice but passing the test on top of that represents having a good command of the English language and an understanding of the subject materials.

It's not the same as the Nvidia driver having code that says "if benchmark, cheat and don't render anything behind you because no one's looking".


Humans fail because they cant review the entirety of test prep, can’t remember very much, and have a much smaller amount of “parameters” to store info in.

I would say LLMs store parameters that are quite superficial and don’t really get at the underlying concepts but given enough of those parameters, you can kind of cargo-cult your to an approximation of understanding.

It is like reconstructing the Mandelbrot set at every zoom level from deep learning. Try it!


They mention in the article that other than incidental material it may have seen in its general training data, they did not specifically train it for the tests.


The training data is so large that it incidentally includes basically anything that Google would index plus the contents of as many thousands of copyrighted works that they could get their hands on. So that would definitely include some test prep books.


They seem to be taking this into account: We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two. We believe the results to be representative. (this is from the technical report itself: https://cdn.openai.com/papers/gpt-4.pdf, not the article).


By the same token, though, whatever test questions and answers it might have seen represent a tiny bit of the overall training data. It would be very surprising if it selectively "remembered" exact answers to all those questions, unless it was specifically trained repeatedly on them.


If it's trained on material scraped from the web, I imagine it would include all the test prep sites and forums.


Could they not have easily excluded any page with terms like LSAT? I’m sure it wouldn’t catch everything but it would probably be close.


Totally, there's no way they removed all the prep material as well when they were trying to address the "contamination" issue with these standardized tests:

> for each exam we run a variant with these questions removed and report the lower score of the two.

I think even with all that test prep material, which is surely helping the model get a higher score, the high scores are still pretty impressive.


This feels the same as a human attending cram school to get better results in tests. Should we abolish them?


A test being a good indicator of human learning progress and ability is almost completely orthogonal to it being a good indicator for AI learning process and ability.

In their everyday jobs, barely anyone uses even 5% of the knowledge and skills they were ever tested for. Even that's a better (but still very bad) reason to abolish tests.

What matters is the amount of jobs that can be automated and replaced. We shall see. Many people have found LLMs useful in their work, it will be even more in the future.


IMO, it's a good opportunity to re-think about exam and future of education. For many schools, education = good results in exams. Now GPT-4 is going to slam them and say what's the point now!


> I think the whole concept of standardized tests may need to be re-evaluated.

It's perfectly fine as a proxy for future earnings of a human.

To use it for admissions? Meh. I think the whole credentialism thing is loooong overdue for some transformation, but people are conservative as fuck.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: