The article specifically points out that these polynomials only work well on specific intervals (emphasis copied from the article):
"The second source of their bad reputation is misunderstanding of Weierstrass’ approximation theorem. It’s usually cited as “polynomials can approximate arbitrary continuous functions”. But that’s not entrely true. They can approximate arbitrary continuous functions in an interval. This means that when using polynomial features, the data must be normalized to lie in an interval. It can be done using min-max scaling, computing empirical quantiles, or passing the feature through a sigmoid. But we should avoid the use of polynomials on raw un-normalized features."
As I understand it, one of the main ideas of this series of posts is that normalizing features to very specific intervals is important when fitting polynomials. I don't think this "went completely uncommented".
Yes! And the next articles in the series double down on this:
"Any polynomial basis has a “natural domain” where its approximation properties are well-known. Raw features must be normalized to that domain. The natural domain of the Bernstein basis is the interval [0,1][0,1]."
The quote has absolutely nothing to do with my point.
The scaling to an interval in the quote is about formal mathematical reasons,in particular that polynomials do not approximate continuous functions globally. This is totally unrelated to numerics.
The issue is that in particular the interval 0 to 1 has to be chosen, as otherwise the numerics totally fall apart. The message of the article is that high degree polynomials pose no danger, but that is wrong. All the examples in the article only work because of a specific choice of interval. All the major numerical issues are totally ignored, which would immediately invalid the core thesis of the article. If you calculate 10^100 in 64 bit floating point you will run into trouble. The article pretends that will not be the case.
However, if you normalize your data to [0,1], you'll never have to compute 10^100 and thus never face any numerical issues. "Never" assumes no distribution shift.
Indeed, the examples work thanks to this choice of the interval, but this comes with the choice of the basis. Of course Bernstein basis functions explode outside [0,1], but I think the point is that high-degree polynomials pose no danger if you scale the data *according to the polynomial* (use [0,1] for Bernstein and [-1,1] for Chebyshev, for example). So the "magic combo" is polynomial + scaling to its interval. Otherwise all bets are off, of course.
The article totally ignores this and does not even mention the numerical issues at all, which is pretty insane.
Surely at least naming THE ONE reason high degree polynomials are dangerous has to be done. Writing an article arguing that something is not a problem, while not even acknowledging the single most important reason why people believe the problem exist is totally disingenuousness and pretty terrible scholarship.
At least include that the choice of 0 to 1 is necessary for this to work. Not including it makes the author look either clueless or malicious.
"The second source of their bad reputation is misunderstanding of Weierstrass’ approximation theorem. It’s usually cited as “polynomials can approximate arbitrary continuous functions”. But that’s not entrely true. They can approximate arbitrary continuous functions in an interval. This means that when using polynomial features, the data must be normalized to lie in an interval. It can be done using min-max scaling, computing empirical quantiles, or passing the feature through a sigmoid. But we should avoid the use of polynomials on raw un-normalized features."
As I understand it, one of the main ideas of this series of posts is that normalizing features to very specific intervals is important when fitting polynomials. I don't think this "went completely uncommented".