Interesting! this was already the case with TPUs easily beating A100s. We sell Stable Diffusion finetuning on TPUs (dreamlook.ai), people are amazed how fast and cheap we can offer it - but there's no big secret, we just use hardware that's strictly faster and cheaper per unit of work.
I expect a new wave of "your task, but on superior hardware" services to crop up with these chips!
The v5es and v5ps are pretty amazing at running SD, giving code for SD3 now to optimise it on those.
v5es are particularly interesting given the millions that will land and the large pod sizes, particularly well constructed for million token context windows.
I expect a new wave of "your task, but on superior hardware" services to crop up with these chips!