I made the WikiText-2 and WikiText-103 datasets they compare against and held st...

I made the WikiText-2 and WikiText-103 datasets they compare against and held state of the art results on language modeling over PTB, WT-2, and WT-103 not too long ago.

OpenGPT-2's results are near equal to GPT-2's in zero-shot language model perplexity on multiple datasets [1].

The zero shot perplexity results are also exactly where we'd expect the 1.5 billion parameter to be, markedly better than OpenAI's 775M GPT-2 model[2] (the second largest model OpenAI trained) that they released in the last few days.

To me this is about as close a replication as you could expect, especially given OpenAI didn't release many of the exact training details. If OpenAI retrained the 1.5 billion parameter GPT-2 model I wouldn't be surprised to see the same variance in performance simply due to the random initialization of the parameters.

[1]: https://miro.medium.com/max/3200/1*h1JoiQq9f1qOHS-rN4u57A.pn...

[2]: https://www.semanticscholar.org/paper/Language-Models-are-Un...