Hacker News new | past | comments | ask | show | jobs | submit login

So I remember Deepseek used float8 for training - Character AI also used int8 for training - it is indeed possible, but sometimes training can be unstable - Deepseek to my knowledge is actually the first lab to use float8 at a large scale without causing loss spikes - they used FP8 tensor cores, then every 4th matrix multiply, they accumulated to a FP32 accumulator - it seems like the Hopper Tensor Cores accumulation mechanism might not be actual FP32 accumulation. I wrote more here: https://x.com/danielhanchen/status/1872719599029850391



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: