I made the WikiText-2 and WikiText-103 datasets they compare against and held state of the art results on language modeling over PTB, WT-2, and WT-103 not too long ago.
OpenGPT-2's results are near equal to GPT-2's in zero-shot language model perplexity on multiple datasets [1].
The zero shot perplexity results are also exactly where we'd expect the 1.5 billion parameter to be, markedly better than OpenAI's 775M GPT-2 model[2] (the second largest model OpenAI trained) that they released in the last few days.
To me this is about as close a replication as you could expect, especially given OpenAI didn't release many of the exact training details. If OpenAI retrained the 1.5 billion parameter GPT-2 model I wouldn't be surprised to see the same variance in performance simply due to the random initialization of the parameters.
That's true, but isn't the point of GPT-2 that it's a strong at many tasks? It did really well at a lot more than just the four perplexity measures reported in OP.
OpenGPT-2's results are near equal to GPT-2's in zero-shot language model perplexity on multiple datasets [1].
The zero shot perplexity results are also exactly where we'd expect the 1.5 billion parameter to be, markedly better than OpenAI's 775M GPT-2 model[2] (the second largest model OpenAI trained) that they released in the last few days.
To me this is about as close a replication as you could expect, especially given OpenAI didn't release many of the exact training details. If OpenAI retrained the 1.5 billion parameter GPT-2 model I wouldn't be surprised to see the same variance in performance simply due to the random initialization of the parameters.
[1]: https://miro.medium.com/max/3200/1*h1JoiQq9f1qOHS-rN4u57A.pn...
[2]: https://www.semanticscholar.org/paper/Language-Models-are-Un...