We're talking about algorithms for TPUs, which quietly quantize your float32 matrices to bfloat16 behind your back [0]. This is aimed at a crowd that doesn't care about stability.
There's a big difference between not caring about stability, and being willing to trade precision for better memory bandwidth for an application that doesn't benefit from increased precision. When doing large training jobs on TPUs, stability is paramount! It's true that you have to know more about what you're doing when you reduce bit-depth - the horrors of floating point are harder to ignore, and it's wildly inappropriate for many scientific computations. However the reduction of bit-depth is likely to continue as we seek to make modern models more efficient and economical to train and use.
What does this mean in practice? For ML, we usually don't care if a weight is 0.05 or 0.10 cause we have millions of weights. We do care if one 1.237e+27 instead of 1.237e-3 though.
Numerical errors have the annoying tendency to accumulate if you're not careful. So doing one matrix operation with low precision might be okay, while doing a dozen might completely garble your result.
This is not that relevant for ML. Each gradient pass will re-compute your cost function and the gradients so errors are not likely to accumulate. The main thing is to not make errors big enough that you end up in a completely different part of the parameter space derailing progress which is what the above commenter points out.
I am familiarizing myself with recurrent neural networks and getting them trained online is a pain - I get NaNs all the time except for very small learning rates that actually prevent my networks to learn anything.
The deeper network is, the more pronounced accumulation of errors in online training is. Add 20-30 fully connected (not highway or residual) layers before softmax and you'll see wonders there, you won't be able to have anything stable.
This isn't true in general. Very specific ML algorithms that were likely developed with years of blood and sweat and tears may have this kind of resiliency, but I've been in the the numerical weeds enough here that I wouldn't bet on even that without a real expert weighing in on it - and I wonder what the tradeoff is if it's true there. It's very easy to have numerical stability issues absolutely crater ML results; been there, done that.
I have some ~15 year old experience with the math behind some of this, but actually none with day-to-day deep learning applications using any of the now-conventional algorithms, so my perspective here is perhaps not that of the most pragmatic user. The status quo may have improved, at least de facto.
I'm not really sure there is evidence for that. In fact, depending on your interpretation of why posits[1] work, we may even have empirical evidence that the opposite is true.
When building a mcmc sampler I was too lazy to properly code a matrix approximation needed to avoid some mathematical black hole and the corresponding underflow. It was cheaper to just ignore the faulty simulations.
Turns out our results were better than the papers we compared to, both in time and precision.
I am not that familiar with ml, but can't you just ignore those faulty weights?
With MCMC, depending on application, it seems risky to just toss out the NaN/inf results. I'd guess these numerical issues are more likely to occur in certain regions of the state space you're sampling from, so your resulting sample could end up a bit biased. In some cases the bias may be small or otherwise unimportant, so the speed-up and simpler code of filtering NaN/inf results is worth it, but in other cases (like when the MCMC samples feed into some chain of downstream computations) the bias may have sneaky insidious effects.
I didn't think deeply about this back then since my parameter estimates where close/better than the literature I compared to, but now I'm interested in checking the distribution of those NaN/inf. If I recall correctly they were uniformly distributed throughout an adaptive phase.
When people talk about AI taking over the world, a funny image pops up in my head where a robot is trying to enter a frying pan. When you ask it why it's doing that, it says "because I feel like [NaN, NaN, 2.45e24, NaN]", which is a perfectly valid reason.
I'm not at all caught up with the this side of ML but my first instinct is that faulty weights would lead to interpretability issues. The numbers represented by NaN/Inf vastly outnumber the ones within precision range, so interpreting them is much more of a guess.
"A considerable group of numerical analysts still believes in the folk “theorem” that fast MM is always numerical unstable, but in actual tests loss of accuracy in fast MM algorithms was limited, and formal proofs of quite reasonable numerical stability of all known fast MM algorithms is available (see [23], [90], [91], [62], and [61])." https://arxiv.org/abs/1804.04102
My concern is that there are not enough people who are qualified to determine if a fast algorithm can be used or not. It feels reckless to include less stable algorithms in a general purpose library when the vast majority of users are mainly concerned with speed and blissfully unaware of the pitfalls of floating-point arithmetic.
My reading of the paper is that the new 4x4 algorithm only works in Z/(2), where there are no issues of roundoff errors. (Z/(2) is the field of integers modulo 2.) The paper seems to say that for real numbers, Strassen is still the best known algorithm for the 4x4 case.
(Disclaimer: googler, I have nothing to do with this research.)
I am not a fan of Google. But this is such a bizzare, deliberately misleading, snarky comment. Lower precision floating point arithmetic is very common in ML training. There's no 'behind your back' going on here.
If there are in fact stability issues, I wonder which is cheaper: using this fancy algorithm or changing bfloat16 to something like bfloat14 and using a more stable matmul.
I don't see how that can be true. Lack of precision is one thing, lack of stability is very different.
Instability leads to divergence from the true answer, and I would expect it to mean super-linear divergence (though I am not an expert in this) which would quickly destroy any meaningful result (=> chaotic behaviour). But I'm not an expert.
In practice this doesn't happen because numerically unstable NNs tend to have bad loss. A simple way to see this is that instability means that the network is highly sensitive to the inputs, which means that the network will give wildly different results for basically the same input which is wrong. Furthermore, if the weights of your NN are such that you are getting dramatic overflow/underflow that prevents it from correctly predicting, that will have a high loss, and the process of training will move towards parameters that don't have these rounding errors blow up.
Wrong way to think about the problem. Lowering bits does not mean lowering model capacity. Its the opposite in fact - it allows you to you to fit more parameters.
Well, yes within your constraints. In the end, you are choosing between two aspects. Same as with screens: you could have resolution (1024x768!!) or color (16bits!).
kind of. if we get computers that are 1000x faster, it just becomes a tradeoff between higher precision or 1000x more parameters. the reason resolution has stopped being pushed is that our eyes have severe diminishing returns. it's not yet known whether brains do.
[0] https://cloud.google.com/blog/products/ai-machine-learning/b...