The repo seems to imply that it matches GPT-2, so I imagine any analyses of GPT-...

programmarchy · on Jan 11, 2023

Does anyone know the main differences between GPT-2 and GPT-3? Are there significant architectural changes, or is the advancement primarily from training?

naasking · on Jan 11, 2023

If you google "GPT-2 vs GPT-3" you'll find lots of overviews and comparisons, like:

* https://www.kdnuggets.com/2021/02/gpt2-gpt3-openai-showdown....

* https://bakztfuture.substack.com/p/the-chasm-between-gpt-2-a...

programmarchy · on Jan 11, 2023

Thanks. Sounds like they 10x'ed the number of parameters, which made some "magic leap" that isn't yet well understood, and fed it more data to train it on more specialized domains.

naasking · on Jan 11, 2023

Yes, although Chinchilla seems to imply that training data size matters a lot more than parameter count, and nanoGPT author is trying to reproduce that here:

https://github.com/karpathy/nanoGPT/blob/master/scaling_laws...

karpathy · on Jan 11, 2023

I was also a bit surprised that the Chinchilla numbers and tables don't reproduce and that there are calculation bugs in the paper (e.g. the FLOPs calculation in the paper is wrong), especially because the paper has been so impactful in the field. Maybe people are focusing on the broad themes of the paper (e.g. scale model and data approx. in tandem) and just roughly interpolating the main Figure, without sweating the details. The corresponding authors responded very kindly at first and I was able to bring the results closer but now they went dark. Still hoping to make things match, if others in LLM space can spot any issues in my own reproduction please let me know.

programmarchy · on Jan 11, 2023

Oh, that's really interesting, and makes sense intuitively. From the abstract:

> We find that current large language models are significantly under-trained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant ... the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.

Assuming the GPT-3 authors know this, one could surmise they 10x'ed the number of training tokens also.

Edit: Should have kept reading. Sounds like GPT-3 was found to be undertrained.

kwerk · on Jan 11, 2023

I’m not easily finding GPT-2 use cases. Any query guidance?

visarga · on Jan 11, 2023

The GPT family of models shines above 100B parameters. Almost nobody uses GPT2 today. It's too weak.

If you want to go with <1B model, you use a BERT which is bidirectional or a T5 that is easier to fine-tune on other tasks.

fredoliveira · on Jan 11, 2023

Something that immediately comes to mind is text summarization. You'll by now be used to better results from GPT-3 or recent models, though.