It's worth noting that, like other attempted replications, the perplexities of this model mostly aren't as good as GPT-2. Given that the title of the GPT-2 paper was "Language Models are Unsupervised Multitask Learners," I'd be interested in a lot more metrics before I'd believe GPT-2 has actually been replicated. Especially because every other time someone says this, metrics show otherwise. Until then, this is just a really big model.
I made the WikiText-2 and WikiText-103 datasets they compare against and held state of the art results on language modeling over PTB, WT-2, and WT-103 not too long ago.
OpenGPT-2's results are near equal to GPT-2's in zero-shot language model perplexity on multiple datasets [1].
The zero shot perplexity results are also exactly where we'd expect the 1.5 billion parameter to be, markedly better than OpenAI's 775M GPT-2 model[2] (the second largest model OpenAI trained) that they released in the last few days.
To me this is about as close a replication as you could expect, especially given OpenAI didn't release many of the exact training details. If OpenAI retrained the 1.5 billion parameter GPT-2 model I wouldn't be surprised to see the same variance in performance simply due to the random initialization of the parameters.
That's true, but isn't the point of GPT-2 that it's a strong at many tasks? It did really well at a lot more than just the four perplexity measures reported in OP.
I suppose you can go and shit on people trying to replicate scientific work, and telling them, extremely reductively, “Well your almost as big number isn’t as big, so fuck you,” like a soccer ultra comparing their team’s score of 3 versus the opposing team’s score of 2, as though the only question that matters is “Whose Soccer Team is Best?” Is that even the right question to ask?
People who actually do research, they don’t just look at the absolute comparison of published numbers! Do you think that’s how research is done, by chasing whatever has the biggest number? No repeat innovator does that.
It’s an interesting collision of world views for sure. This is a social media forum for a venture capital firm. They’d hate for anyone to discover that the numbers don’t tell the whole story, that actually everybody starts at zero, and that being second, because of the price premium put on first, is a huge opportunity. So even in some narrow, cynical interpretation, your point of view would lose people a ton of money. But I don’t really know anything about that.
Holy cow, dude. That's not even remotely like what I said. If you read the original paper, they show that GPT-2 does all sorts of cool things. In OP, they show that it's almost as good at one thing on a couple data sets.
It's like if I wrote a paper showing that widgets improve liver health in young men, young women, and adult men. Additionally, widgets make you happy and taller and turn blue. Then you come along and try to replicate my results, showing only that your version of a widget makes young men and women's livers almost as healthy as mine did.
But sure, pretend I said whatever you want to argue against.