Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wasn't the bitter lesson about training on large amounts of data? The model that he's using was still trained on a massive corpus (22T tokens).


I think GP means that if you internalize the bitter lesson (more data more compute wins), you stop imagining how to squeeze SOTA minus 1 performance out of constrained compute environments.


This. When we ran out of speed on the CPU, we moved to the GPU. Same thing here. The more we work with (22T) models, quants, and decimating precision - the more we learn and find more novel ways to do things.


What does that have to do with quantizing?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: