It's worth taking a second to note that the author just assumes that readers und...

choppaface · on March 6, 2024

Really telling quote:

> I was completely taken aback by the failure rate of GPUs as opposed to my experiences on TPUs at Google

Should be "I was completely unaware of the failure modes of GPUs, because all my career I've been inside Google and used Google TPUs and was well-acquainted with those failure modes."

I've used GPUs mostly, and when I tried TPUs the jobs failed all the time for really hard-to-debug reasons. Often the indirection between the x86 chip and the TPU device caused hours of hair-pulling, stuff you never get with x86+nvidia+pytorch.

10-15 years ago, Google minted many $10m+ data scientists (aka Sawzall engineers) who also ventured "into the wilderness" and had very similar reactions. This blog post is much more about the OP hyping his company and personal brand than contributing useful notes to the community.

quadrature · on March 7, 2024

I think the OP is referring to hardware failures rather than software not playing well together.

StarCyan · on March 7, 2024

When was this? I use JAX+TPUs to train LLMs and haven't experienced many issues. IMO it was way easier to set up distributed training, sharding, etc compared to Pytorch+GPUs.

ganeshkrishnan · on March 7, 2024

OP mentions the failure rate of GPUs as "If this were in GPU land, it would have failed within the first few days for sure.".

In my humble opinion, we never had failures of GPU even for large scale training. Our current training batch job is a 20GB json file which takes 6 hours just to load and has been running for more than 15 days with not a hiccup. And we are using the older Tesla T4.

GPUs have memory constraint issues but if you can plan and work around it, I havent seen it crash in real life.

gwern · on March 7, 2024

> And we are using the older Tesla T4.

That's an undemanding and well-debugged chip by this point (6 years ago!). So you aren't experiencing any of the pain people using A100s or H100s (never mind people who have to stand up clusters with B100s soon) are going through now.

icpmacdo · on March 11, 2024

> never mind people who have to stand up clusters with B100s soon

Gwern(or anyone else) do you have any resources on this?

nl · on March 7, 2024

> 20GB json file… takes 6 hours just to load

Err you definitely should be doing something about that.

20GB on T4s (how many?) isn’t really comparable to terabytes on thousands of A100s.

shrubble · on March 7, 2024

Have you checked if there is a faster way to parse your JSON? 3Gbytes/hour to load a file seems slow on today's CPUs...

flybarrel · on March 7, 2024

What would be an ideal (or more appropriate) speed?

shrubble · on March 7, 2024

Well it would depend on the specifics of the JSON file but eyeballing the stats at https://github.com/miloyip/nativejson-benchmark/tree/master seems to indicate that even on a 2015 MacBook the parsing proceeds using e.g. Configuru parser at several megabytes per second.

kji · on March 8, 2024

simdjson can parse JSON files at ~2.5-3GB/s: https://github.com/simdjson/simdjson

teaearlgraycold · on March 7, 2024

Ha! We’re also committing great sins of computation against T4s at our company. Hopefully, as I learn, things get less janky.

joe_the_user · on March 6, 2024

I took the phrase to mean "outside any large company". It seems like a fairly obvious metaphor; if you have a starup working on a large scale infrastructure project, you have to set your own logistics just a camp in the literal wildness.

lambersley · on March 7, 2024

Agreed. It reads like Seven of Nine realizing she's separated from the Collective and needs to rely lowly human capabilities. The insights into vendors was informative.

flybarrel · on March 7, 2024

Newbie question - What happens after when an LLM training job experience a hardware failure? I don't suppose you lose all the training progress do you? Then the pain is mostly in the diagnostic of the problem and getting the cluster running again, but no need to worry about data loss right?