I recently found this paper[1] claiming near GPT-3 performance with only a fraction of parameters. They seems to simply reformulate the input sequence to change classification to a sequence generation task.
Disclaimer, I am not affiliated to any of the authors
Disclaimer, I am not affiliated to any of the authors
[1] https://arxiv.org/pdf/2009.07118.pdf