He lost me a bit at the end talking about running chat bots on CPUs. I know it's possible, but it's inherently parallel computing isn't it? Would that ever really make sense? I expected to hear something more like low end consumer gpus.
Recent generation llms do seem to have some significant efficiency gains. And routers to decide if you really need all of their power on a given question. And Google is building their custom tpus. So I'm not sure if I buy the idea that everyone ignores efficiency.
(Hi, Tom!) Reread the article and look for “CPU”. The whole article is about doing deep learning on CPUs not GPUs. Moonshine, the open source project and startup he talks about, shows speech recognition and realtime translation on the device rather than on a server. My understanding is that doing The Math in parallel is itself a performance hack, but Doing Less Math is also a performance hack.
Recent generation llms do seem to have some significant efficiency gains. And routers to decide if you really need all of their power on a given question. And Google is building their custom tpus. So I'm not sure if I buy the idea that everyone ignores efficiency.