There's something I'm fundamentally missing here--if the standard basis and the Berenstain basis describe exactly the same set of polynomials of degree n, then surely the polynomial of degree n that minimizes the mean square error is singular (and independent of the basis--the error is the samples vs the approximation, the coefficients/basis are not involved) so both the standard basis and Berenstain basis solution are the same (pathological, overfitted, oscillating) curve?
Like I understand how the standard basis is pathological because the higher degree powers diverge like mad so given "reasonable" components the Berenstain basis is more likely to give "reasonable" curves but if you're already maximizing I don't understand how you arrive at a different curve.
The minimization is regularized, meaning you add a penalty term for large coefficients. The coefficients will be different for the two bases, meaning the regularization will work differently.
Ok, yeah, doing a little googling that makes sense. I kind of feel that the article author was burying the lede by saying this was about ML optimization where apparently regularization is the norm(so to speak lol) and basis selection is the whole ball game indirectly through the way it influences convex optimization
Like I understand how the standard basis is pathological because the higher degree powers diverge like mad so given "reasonable" components the Berenstain basis is more likely to give "reasonable" curves but if you're already maximizing I don't understand how you arrive at a different curve.
What am I missing?