Here's what's really terrifying about these tests: they are exploring a fundamen...

Here's what's really terrifying about these tests: they are exploring a fundamental misunderstanding of what these models are in the first place. They evaluate the personification of GPT, then use that evaluation to set expectations for GPT itself.

Tests like this are designed to evaluate subjective and logical understanding. That isn't what GPT does in the first place!

GPT models the content of its training corpus, then uses that model to generate more content.

GPT does not do logic. GPT does not recognize or categorize subjects.

Instead, GPT relies on all of those behaviors (logic, subjective answers to questions, etc.) as being already present in the language examples of its training corpus. It exhibits the implicit behavior of language itself by spitting out the (semantically) closest examples it has.

In the text corpus - that people have written, and that GPT has modeled - the semantically closest thing to a question is most likely a coherent and subjectively correct answer. That fact is the one singular tool that GPT's performance on these tests is founded upon. GPT will "succeed" to answer a question only when it happens to find the "correct answer" in the model it has built from its training corpus, in response to the specific phrasing of the question that is written in the test.

Effectively, these tests are evaluating the subjective correctness of training corpus itself, in the context of answering the tests' questions.

If the training is "done well", then GPT's continuations of a test will include subjectively correct answers. But that means that "done well" is a metric for how "correct" the resulting "answer" is.

It is not a measure for how well GPT has modeled the language features present in its training corpus, or how well it navigates that model to generate a preferable continuation: yet these are the behaviors that should be measured, because they are everything GPT itself is and does.

What we learn from these tests is so subjectively constrained, we can't honestly extrapolate that data to any meaningful expectations. GPT as a tool is not expected to be used strictly on these tests alone: it is expected to present a diverse variety of coherent language continuations. Evaluating the subjective answers to these tests does practically nothing to evaluate the behavior GPT is truly intended to exhibit.