Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The repo seems to imply that it matches GPT-2, so I imagine any analyses of GPT-2 will give you a good idea.


Does anyone know the main differences between GPT-2 and GPT-3? Are there significant architectural changes, or is the advancement primarily from training?


If you google "GPT-2 vs GPT-3" you'll find lots of overviews and comparisons, like:

* https://www.kdnuggets.com/2021/02/gpt2-gpt3-openai-showdown....

* https://bakztfuture.substack.com/p/the-chasm-between-gpt-2-a...


Thanks. Sounds like they 10x'ed the number of parameters, which made some "magic leap" that isn't yet well understood, and fed it more data to train it on more specialized domains.


Yes, although Chinchilla seems to imply that training data size matters a lot more than parameter count, and nanoGPT author is trying to reproduce that here:

https://github.com/karpathy/nanoGPT/blob/master/scaling_laws...


I was also a bit surprised that the Chinchilla numbers and tables don't reproduce and that there are calculation bugs in the paper (e.g. the FLOPs calculation in the paper is wrong), especially because the paper has been so impactful in the field. Maybe people are focusing on the broad themes of the paper (e.g. scale model and data approx. in tandem) and just roughly interpolating the main Figure, without sweating the details. The corresponding authors responded very kindly at first and I was able to bring the results closer but now they went dark. Still hoping to make things match, if others in LLM space can spot any issues in my own reproduction please let me know.


Oh, that's really interesting, and makes sense intuitively. From the abstract:

> We find that current large language models are significantly under-trained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant ... the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.

Assuming the GPT-3 authors know this, one could surmise they 10x'ed the number of training tokens also.

Edit: Should have kept reading. Sounds like GPT-3 was found to be undertrained.


I’m not easily finding GPT-2 use cases. Any query guidance?


The GPT family of models shines above 100B parameters. Almost nobody uses GPT2 today. It's too weak.

If you want to go with <1B model, you use a BERT which is bidirectional or a T5 that is easier to fine-tune on other tasks.


Something that immediately comes to mind is text summarization. You'll by now be used to better results from GPT-3 or recent models, though.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: