Which model are you training and on what dataset?

kouteiheika · 2025-03-15T21:44:31 1742075071

It's a fully custom architecture, heavily inspired by the modded-nanogpt speedrun (https://github.com/KellerJordan/modded-nanogpt) but written fully from scratch and further tweaked/modified. I use it for experiments and as a testbed when developing my training harness (which I use for training other models too, and which receives all of my non-LLM-specific improvements like e.g. better than Adam optimizers, a custom GPU memory allocator, custom gradient accumulation that accumulates directly into the optimizers' state without using extra VRAM for gradient, etc.).

For the dataset I just use FineWeb-Edu.

kadushka · 2025-03-16T01:20:53 1742088053

Wow, thank you for the link to the code - I haven't seen it before - it contains a ton of useful tricks. Lots to learn from there.