It's worth taking a second to note that the author just assumes that readers understand "the wilderness" to mean "not Google".
This post gives a lot of credit to Google's infra and hardware teams, and I'd love to read a perspective from one of those insiders who then went on to do related work elsewhere.
> I was completely taken aback by the failure rate of GPUs as opposed to my experiences on TPUs at Google
Should be "I was completely unaware of the failure modes of GPUs, because all my career I've been inside Google and used Google TPUs and was well-acquainted with those failure modes."
I've used GPUs mostly, and when I tried TPUs the jobs failed all the time for really hard-to-debug reasons. Often the indirection between the x86 chip and the TPU device caused hours of hair-pulling, stuff you never get with x86+nvidia+pytorch.
10-15 years ago, Google minted many $10m+ data scientists (aka Sawzall engineers) who also ventured "into the wilderness" and had very similar reactions. This blog post is much more about the OP hyping his company and personal brand than contributing useful notes to the community.
When was this? I use JAX+TPUs to train LLMs and haven't experienced many issues. IMO it was way easier to set up distributed training, sharding, etc compared to Pytorch+GPUs.
OP mentions the failure rate of GPUs as "If this were in GPU land, it would have failed within the first few days for sure.".
In my humble opinion, we never had failures of GPU even for large scale training. Our current training batch job is a 20GB json file which takes 6 hours just to load and has been running for more than 15 days with not a hiccup. And we are using the older Tesla T4.
GPUs have memory constraint issues but if you can plan and work around it, I havent seen it crash in real life.
That's an undemanding and well-debugged chip by this point (6 years ago!). So you aren't experiencing any of the pain people using A100s or H100s (never mind people who have to stand up clusters with B100s soon) are going through now.
Well it would depend on the specifics of the JSON file but eyeballing the stats at https://github.com/miloyip/nativejson-benchmark/tree/master seems to indicate that even on a 2015 MacBook the parsing proceeds using e.g. Configuru parser at several megabytes per second.
I took the phrase to mean "outside any large company". It seems like a fairly obvious metaphor; if you have a starup working on a large scale infrastructure project, you have to set your own logistics just a camp in the literal wildness.
Agreed. It reads like Seven of Nine realizing she's separated from the Collective and needs to rely lowly human capabilities. The insights into vendors was informative.
Newbie question - What happens after when an LLM training job experience a hardware failure? I don't suppose you lose all the training progress do you? Then the pain is mostly in the diagnostic of the problem and getting the cluster running again, but no need to worry about data loss right?
This post gives a lot of credit to Google's infra and hardware teams, and I'd love to read a perspective from one of those insiders who then went on to do related work elsewhere.