I completely disagree with the conclusion of the article. The reason the example...

ForceBru · 2025-04-22T20:39:05 1745354345

The article specifically points out that these polynomials only work well on specific intervals (emphasis copied from the article):

"The second source of their bad reputation is misunderstanding of Weierstrass’ approximation theorem. It’s usually cited as “polynomials can approximate arbitrary continuous functions”. But that’s not entrely true. They can approximate arbitrary continuous functions in an interval. This means that when using polynomial features, the data must be normalized to lie in an interval. It can be done using min-max scaling, computing empirical quantiles, or passing the feature through a sigmoid. But we should avoid the use of polynomials on raw un-normalized features."

As I understand it, one of the main ideas of this series of posts is that normalizing features to very specific intervals is important when fitting polynomials. I don't think this "went completely uncommented".

nestorD · 2025-04-22T20:59:01 1745355541

Yes! And the next articles in the series double down on this:

"Any polynomial basis has a “natural domain” where its approximation properties are well-known. Raw features must be normalized to that domain. The natural domain of the Bernstein basis is the interval [0,1][0,1]."

constantcrying · 2025-04-22T21:08:45 1745356125

Totally irrelevant. The natural domain,if compact, can always be scaled, it has nothing to do with the numerical problems.

Also the hermite polynomials have an unbounded natural domain.

constantcrying · 2025-04-22T20:44:31 1745354671

The quote has absolutely nothing to do with my point.

The scaling to an interval in the quote is about formal mathematical reasons,in particular that polynomials do not approximate continuous functions globally. This is totally unrelated to numerics.

The issue is that in particular the interval 0 to 1 has to be chosen, as otherwise the numerics totally fall apart. The message of the article is that high degree polynomials pose no danger, but that is wrong. All the examples in the article only work because of a specific choice of interval. All the major numerical issues are totally ignored, which would immediately invalid the core thesis of the article. If you calculate 10^100 in 64 bit floating point you will run into trouble. The article pretends that will not be the case.

ForceBru · 2025-04-22T20:56:55 1745355415

However, if you normalize your data to [0,1], you'll never have to compute 10^100 and thus never face any numerical issues. "Never" assumes no distribution shift.

Indeed, the examples work thanks to this choice of the interval, but this comes with the choice of the basis. Of course Bernstein basis functions explode outside [0,1], but I think the point is that high-degree polynomials pose no danger if you scale the data *according to the polynomial* (use [0,1] for Bernstein and [-1,1] for Chebyshev, for example). So the "magic combo" is polynomial + scaling to its interval. Otherwise all bets are off, of course.

constantcrying · 2025-04-22T21:06:20 1745355980

The article totally ignores this and does not even mention the numerical issues at all, which is pretty insane.

Surely at least naming THE ONE reason high degree polynomials are dangerous has to be done. Writing an article arguing that something is not a problem, while not even acknowledging the single most important reason why people believe the problem exist is totally disingenuousness and pretty terrible scholarship.

At least include that the choice of 0 to 1 is necessary for this to work. Not including it makes the author look either clueless or malicious.

xigoi · 2025-04-26T15:39:08 1745681948

The article says:

> These are n-degree polynomials defined by on [0,1]…

How can it possibly be clearer?

nobodywillobsrv · 2025-04-22T20:33:39 1745354019

Yes and it fails to talk about boundary conditions or predicates or whatever.

alexshtf · 2025-04-27T04:02:37 1745726557

I am the author of that post series... I see it did some noise :)

I partially agree with your claims. It indeed is an issue if you take high powers of numbers outside of the [-1, 1] interval, but I completely disagree with your assertion that this is the main issue.

Take the famous LogSumExp function as an example - it's used everywhere in machine learning. Nobody says it's a "big no no", even though a naive implementation blows up because you exponentiate large numbers. And if you want to back-prop through it - well, it's even worse. But there is also a "right" way to use it: LogSumExp(x_1, ..., x_n) = LogSumExp(x_1 - m, ..., x_n - m) + m where m=max(x_1, ..., x_n)

But the point is that you don't even care! We have good implementations in library functions, such as toch.logsumexp, or scipy.special.logsumexp, and we don't need to worry about it. We never have to compute LogSumExp ourselves and get into the numerical issues.

With polynomials it's the same. Take the Legendre basis. Over an arbitrary interval [a, b], you have the formula P_n(x) = (2n+1)(x - (a + b)) P_n(x) / (b - a) + n P_n(x) It's numerically stable. But you don't even care what the formula is - you just call the appropriate function from the np.polynomial.legendre package.

The difference between polynomials and LogSumExp stems from education - people are taught that LogSumExp is "normal" but polynomials are a "big no no". Most people know that there is a library function for LogSumExp, but don't know there are similarly good library functions for the "right" polynomial bases.

And why I believe that the basis is the main issue? Well, it's because even if you use [-1, 1], the standard basis doesn't work with high degrees - it has an extremely badly conditioned vandermonde matrix, and regularization can't help you get a reasonable polynomial. I show it in several posts in the series. The problem with large powers of small numbers is as bad as with large numbers. And the issue with the standard basis is that you have no choice, but to compute these high powers - that's how the basis is defined.

Having said that, I will obviously take your input into account, and say something about high powers in the post. Explain why I believe it's exactly like LogSumExp, and why I believe it's the basis (rather than floating point errors) that is the issue. You are correct that I wrongly assumed it's obivious, and it is not.

Like with any object, oftentimes working with it by definition is very different with how you work with it computationally. But I believe that once you have good library function, you don't really care.