Wasn't the bitter lesson about training on large amounts of data? The model that...

itsalotoffun · 2025-07-29T16:52:45 1753807965

I think GP means that if you internalize the bitter lesson (more data more compute wins), you stop imagining how to squeeze SOTA minus 1 performance out of constrained compute environments.

reactordev · 2025-07-30T02:25:14 1753842314

This. When we ran out of speed on the CPU, we moved to the GPU. Same thing here. The more we work with (22T) models, quants, and decimating precision - the more we learn and find more novel ways to do things.

yahoozoo · 2025-07-29T16:50:25 1753807825

What does that have to do with quantizing?