Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Which model are you training and on what dataset?


It's a fully custom architecture, heavily inspired by the modded-nanogpt speedrun (https://github.com/KellerJordan/modded-nanogpt) but written fully from scratch and further tweaked/modified. I use it for experiments and as a testbed when developing my training harness (which I use for training other models too, and which receives all of my non-LLM-specific improvements like e.g. better than Adam optimizers, a custom GPU memory allocator, custom gradient accumulation that accumulates directly into the optimizers' state without using extra VRAM for gradient, etc.).

For the dataset I just use FineWeb-Edu.


Wow, thank you for the link to the code - I haven't seen it before - it contains a ton of useful tricks. Lots to learn from there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: