The perplexity numbers are for different tasks. MIM is encoding a sentence into a latent variable and then reconstructing it, and achieves PTB perplexity 4.6. GPT-2 is generating the sentence from scratch, which will on average have higher perplexity numbers. I agree that PTB perplexity 4.6 on autoregressive language modeling would be a huge result.