Almost every corner of an ML problem has an optimization problem that needs to be solved: There is a function that you want to minimize subject to constraints. Typically these are everywhere smooth, or sometimes almost everywhere smooth. So calculus shows up in (i) algorithms to find the bottom of these functions (if they exist) or (ii) deriving the location of the minima in closed form. These functions would be "how close am I to the correct parameter", "What losses would these settings rake up on average" etc etc.
The reason why this differs from a purely optimization / mathematical programming problem is that we can only approximately evaluate the actual function (the performance of our model on new / unseen data) that we care to optimize. Great optimization algorithms need not be (and often are not) good ML algorithms. In ML we have to optimize a function that's getting revealed to us slowly, one datapoint at a time. The true function typically involves a continuum of datapoints. This is where we can bring probability into the picture (another option is to treat it as an adverserial game with nature). In the probabilistic approach, we make the assumption that functions being revealed to us is in some probabilistic proximity of the true function and the sample is closing onto it slowly. We have to be careful to be not too eager to model the revealed function, our goal is to optimize the function where these revealed functions are ultimately headed.
Those things aside, if you have to choose just one prereq, I think it has to be linear algebra and you already have that in your bag. Without it, a lot of multivariate calculus will not make much sense anyway. Then one can push things a little bit and go for the linear algebra where your vectors have infinte dimension. This becomes important because often your data would have far too much information that you can encode in a finite dimensional vector. Thankfully a lot of intution carries over to infinite dimension (except when it does not). This goes by the name functional analysis. Not absolutely essential, but then lack of intution here can rein you in from doing some certain kinds of work. You will just get a better (at times spatial or geometric) understanding of the picture, etc etc.
Other than theeir motivating narratives, there is not much difference btween probability/stats and information theory. There is a one to one mapping between many if not all of their core problems. A lot of this applies to signal processing too. Many of the problems that we are stuck at in these domains are the same. Sometimes a problem seems better motivated in one narrative over the other. Some will call it finding the best code for the source, others will call it parameter estimation, yet others will call it learning.
Or If I may paraphrase for the CS audience, blame the reals \mathbb{R}. Otherwise it would have been the problem of reverse engineering a noisy Turing machine that we can access only through its input and output. Pretty damn hard even if we dont get into reals. In those situations you could potentially get by without calculus, algebra by itslef should go a long way, but as I said it gets frigging hard. Learning even the lowly regular expression from examples is hard. Calculus would still be helpful because many combinatorial / counting prolems that come up can be dealt with generating function techniques where you would run into integral calculus with complex numbers.
> Almost every corner of an ML problem has an optimization problem that needs to be solved: There is a function that you want to minimize subject to constraints. Typically these are everywhere smooth, or sometimes almost everywhere smooth. So calculus shows up in (i) algorithms to find the bottom of these functions (if they exist) or (ii) deriving the location of the minima in closed form. These functions would be "how close am I to the correct parameter", "What losses would these settings rake up on average" etc etc.
> The reason why this differs from a purely optimization / mathematical programming problem is that we can only approximately evaluate the actual function (the performance of our model on new / unseen data) that we care to optimize. Great optimization algorithms need not be (and often are not) good ML algorithms. In ML we have to optimize a function that's getting revealed to us slowly, one datapoint at a time. The true function typically involves a continuum of datapoints. This is where we can bring probability into the picture
The reason why this differs from a purely optimization / mathematical programming problem is that we can only approximately evaluate the actual function (the performance of our model on new / unseen data) that we care to optimize. Great optimization algorithms need not be (and often are not) good ML algorithms. In ML we have to optimize a function that's getting revealed to us slowly, one datapoint at a time. The true function typically involves a continuum of datapoints. This is where we can bring probability into the picture (another option is to treat it as an adverserial game with nature). In the probabilistic approach, we make the assumption that functions being revealed to us is in some probabilistic proximity of the true function and the sample is closing onto it slowly. We have to be careful to be not too eager to model the revealed function, our goal is to optimize the function where these revealed functions are ultimately headed.
Those things aside, if you have to choose just one prereq, I think it has to be linear algebra and you already have that in your bag. Without it, a lot of multivariate calculus will not make much sense anyway. Then one can push things a little bit and go for the linear algebra where your vectors have infinte dimension. This becomes important because often your data would have far too much information that you can encode in a finite dimensional vector. Thankfully a lot of intution carries over to infinite dimension (except when it does not). This goes by the name functional analysis. Not absolutely essential, but then lack of intution here can rein you in from doing some certain kinds of work. You will just get a better (at times spatial or geometric) understanding of the picture, etc etc.
Other than theeir motivating narratives, there is not much difference btween probability/stats and information theory. There is a one to one mapping between many if not all of their core problems. A lot of this applies to signal processing too. Many of the problems that we are stuck at in these domains are the same. Sometimes a problem seems better motivated in one narrative over the other. Some will call it finding the best code for the source, others will call it parameter estimation, yet others will call it learning.
Or If I may paraphrase for the CS audience, blame the reals \mathbb{R}. Otherwise it would have been the problem of reverse engineering a noisy Turing machine that we can access only through its input and output. Pretty damn hard even if we dont get into reals. In those situations you could potentially get by without calculus, algebra by itslef should go a long way, but as I said it gets frigging hard. Learning even the lowly regular expression from examples is hard. Calculus would still be helpful because many combinatorial / counting prolems that come up can be dealt with generating function techniques where you would run into integral calculus with complex numbers.