Very well said .. > Almost every corner of an ML problem has an optimization pro...

Very well said ..

> Almost every corner of an ML problem has an optimization problem that needs to be solved: There is a function that you want to minimize subject to constraints. Typically these are everywhere smooth, or sometimes almost everywhere smooth. So calculus shows up in (i) algorithms to find the bottom of these functions (if they exist) or (ii) deriving the location of the minima in closed form. These functions would be "how close am I to the correct parameter", "What losses would these settings rake up on average" etc etc.

> The reason why this differs from a purely optimization / mathematical programming problem is that we can only approximately evaluate the actual function (the performance of our model on new / unseen data) that we care to optimize. Great optimization algorithms need not be (and often are not) good ML algorithms. In ML we have to optimize a function that's getting revealed to us slowly, one datapoint at a time. The true function typically involves a continuum of datapoints. This is where we can bring probability into the picture