We're talking about algorithms for TPUs, which quietly quantize your float32 mat...

alevskaya · on Oct 7, 2022

There's a big difference between not caring about stability, and being willing to trade precision for better memory bandwidth for an application that doesn't benefit from increased precision. When doing large training jobs on TPUs, stability is paramount! It's true that you have to know more about what you're doing when you reduce bit-depth - the horrors of floating point are harder to ignore, and it's wildly inappropriate for many scientific computations. However the reduction of bit-depth is likely to continue as we seek to make modern models more efficient and economical to train and use.

bjourne · on Oct 7, 2022

What does this mean in practice? For ML, we usually don't care if a weight is 0.05 or 0.10 cause we have millions of weights. We do care if one 1.237e+27 instead of 1.237e-3 though.

adrianN · on Oct 7, 2022

Numerical errors have the annoying tendency to accumulate if you're not careful. So doing one matrix operation with low precision might be okay, while doing a dozen might completely garble your result.

omegalulw · on Oct 7, 2022

This is not that relevant for ML. Each gradient pass will re-compute your cost function and the gradients so errors are not likely to accumulate. The main thing is to not make errors big enough that you end up in a completely different part of the parameter space derailing progress which is what the above commenter points out.

thesz · on Oct 7, 2022

It is extremely relevant for ML.

I am familiarizing myself with recurrent neural networks and getting them trained online is a pain - I get NaNs all the time except for very small learning rates that actually prevent my networks to learn anything.

The deeper network is, the more pronounced accumulation of errors in online training is. Add 20-30 fully connected (not highway or residual) layers before softmax and you'll see wonders there, you won't be able to have anything stable.

emn13 · on Oct 7, 2022

This isn't true in general. Very specific ML algorithms that were likely developed with years of blood and sweat and tears may have this kind of resiliency, but I've been in the the numerical weeds enough here that I wouldn't bet on even that without a real expert weighing in on it - and I wonder what the tradeoff is if it's true there. It's very easy to have numerical stability issues absolutely crater ML results; been there, done that.

I have some ~15 year old experience with the math behind some of this, but actually none with day-to-day deep learning applications using any of the now-conventional algorithms, so my perspective here is perhaps not that of the most pragmatic user. The status quo may have improved, at least de facto.

kelseyfrog · on Oct 7, 2022

I'm not really sure there is evidence for that. In fact, depending on your interpretation of why posits[1] work, we may even have empirical evidence that the opposite is true.

1. https://spectrum.ieee.org/floating-point-numbers-posits-proc...

yarky · on Oct 7, 2022

When building a mcmc sampler I was too lazy to properly code a matrix approximation needed to avoid some mathematical black hole and the corresponding underflow. It was cheaper to just ignore the faulty simulations.

Turns out our results were better than the papers we compared to, both in time and precision.

I am not that familiar with ml, but can't you just ignore those faulty weights?

psb217 · on Oct 7, 2022

With MCMC, depending on application, it seems risky to just toss out the NaN/inf results. I'd guess these numerical issues are more likely to occur in certain regions of the state space you're sampling from, so your resulting sample could end up a bit biased. In some cases the bias may be small or otherwise unimportant, so the speed-up and simpler code of filtering NaN/inf results is worth it, but in other cases (like when the MCMC samples feed into some chain of downstream computations) the bias may have sneaky insidious effects.

yarky · on Oct 7, 2022

I didn't think deeply about this back then since my parameter estimates where close/better than the literature I compared to, but now I'm interested in checking the distribution of those NaN/inf. If I recall correctly they were uniformly distributed throughout an adaptive phase.

mxkopy · on Oct 7, 2022

When people talk about AI taking over the world, a funny image pops up in my head where a robot is trying to enter a frying pan. When you ask it why it's doing that, it says "because I feel like [NaN, NaN, 2.45e24, NaN]", which is a perfectly valid reason.

I'm not at all caught up with the this side of ML but my first instinct is that faulty weights would lead to interpretability issues. The numbers represented by NaN/Inf vastly outnumber the ones within precision range, so interpreting them is much more of a guess.

Der_Einzige · on Oct 7, 2022

Weight changes in one neuron can have dramatic and non linear or obviously predictable impact on the performance of a full model.

j7f3 · on Oct 7, 2022

in numerical analysis 101 you learn not to use algorithms that don't have certain properties and numerical stability is one of them

what good will it do to compute something if its error is unbound?

the issue of the accumulation of roundoff errors is generally speaking unavoidable when it's linear but fortunately they tend to be small

bjourne · on Oct 8, 2022

"A considerable group of numerical analysts still believes in the folk “theorem” that fast MM is always numerical unstable, but in actual tests loss of accuracy in fast MM algorithms was limited, and formal proofs of quite reasonable numerical stability of all known fast MM algorithms is available (see [23], [90], [91], [62], and [61])." https://arxiv.org/abs/1804.04102

zekrioca · on Oct 7, 2022

My concern is that there are not enough people who are qualified to determine if a fast algorithm can be used or not. It feels reckless to include less stable algorithms in a general purpose library when the vast majority of users are mainly concerned with speed and blissfully unaware of the pitfalls of floating-point arithmetic.

MatteoFrigo · on Oct 7, 2022

My reading of the paper is that the new 4x4 algorithm only works in Z/(2), where there are no issues of roundoff errors. (Z/(2) is the field of integers modulo 2.) The paper seems to say that for real numbers, Strassen is still the best known algorithm for the 4x4 case.

(Disclaimer: googler, I have nothing to do with this research.)

fastball · on Oct 10, 2022

For real numbers the decomposition rank is 49 for both AlphaTensor and Strassen, so they're equivalent – wouldn't really say Strassen is better.

johndfsgdgdfg · on Oct 7, 2022

I am not a fan of Google. But this is such a bizzare, deliberately misleading, snarky comment. Lower precision floating point arithmetic is very common in ML training. There's no 'behind your back' going on here.

WithinReason · on Oct 7, 2022

If there are in fact stability issues, I wonder which is cheaper: using this fancy algorithm or changing bfloat16 to something like bfloat14 and using a more stable matmul.

zasdffaa · on Oct 7, 2022

I don't see how that can be true. Lack of precision is one thing, lack of stability is very different.

Instability leads to divergence from the true answer, and I would expect it to mean super-linear divergence (though I am not an expert in this) which would quickly destroy any meaningful result (=> chaotic behaviour). But I'm not an expert.

adgjlsfhk1 · on Oct 7, 2022

In practice this doesn't happen because numerically unstable NNs tend to have bad loss. A simple way to see this is that instability means that the network is highly sensitive to the inputs, which means that the network will give wildly different results for basically the same input which is wrong. Furthermore, if the weights of your NN are such that you are getting dramatic overflow/underflow that prevents it from correctly predicting, that will have a high loss, and the process of training will move towards parameters that don't have these rounding errors blow up.

why_only_15 · on Oct 7, 2022

bfloat16 is stable enough for ML training, which is what TPUs exist for

pfortuny · on Oct 7, 2022

Be careful: 16 bits was quite a lot of colors around 20 years ago. Now we would laugh at it. Ditto for 640Kb of RAM (who would need more?) etc.

Not trying to be dismissive just saying that... computational limits are limits on what can be done, in the end.

omegalulw · on Oct 7, 2022

Wrong way to think about the problem. Lowering bits does not mean lowering model capacity. Its the opposite in fact - it allows you to you to fit more parameters.

pfortuny · on Oct 7, 2022

Well, yes within your constraints. In the end, you are choosing between two aspects. Same as with screens: you could have resolution (1024x768!!) or color (16bits!).

Edit: the term I could not remember is tradeoff.

adgjlsfhk1 · on Oct 7, 2022

kind of. if we get computers that are 1000x faster, it just becomes a tradeoff between higher precision or 1000x more parameters. the reason resolution has stopped being pushed is that our eyes have severe diminishing returns. it's not yet known whether brains do.

mjan22640 · on Oct 7, 2022

30 years ago. The time flies...

pfortuny · on Oct 7, 2022

well, you are right and I grow old, I grow old, I shall wear the bottoms of my trousers rolled…

I stand corrected.

rubatuga · on Oct 7, 2022

Not always though!