I recently found this paper[1] claiming near GPT-3 performance with only a fract...

I recently found this paper[1] claiming near GPT-3 performance with only a fraction of parameters. They seems to simply reformulate the input sequence to change classification to a sequence generation task.

Disclaimer, I am not affiliated to any of the authors

[1] https://arxiv.org/pdf/2009.07118.pdf