I'm not super interested in ML but I am very interested in applied mathematics in computer science. I've got a fair bit of linear algebra due to cryptography, but have had virtually no need of any form of calculus (unless I'm relying on it without knowing it) in my career.
So beyond just saying that you'd need grounding in multivariable calculus to do serious ML work, I would be super interested in hearing more about why that is and what kinds of problems crop up in ML that demand it.
Calculus essentially discusses how things change smoothly and it has a very nice mechanism for talking about smooth changes algebraically.
A system which is at an optimum will, at that exact point, be no longer increasing or decreasing: a metal sheet balanced at the peak of a hill rests flat.
Many problems in ML are optimization problems: given some set of constraints, what choices of unknown parameters minimizes error? This can be very hard (NP-hard) in general, but if you design your situation to be "smooth" then you can use calculus and its very nice set of algebraic solutions.
You also need multivariate calculus because typically while you're only trying to minimize "error", you do so by changing many, many parameters at once. This means that you've got to talk about smooth changes in a high-dimensional space.
--
The other side of calculus is integration which talks about "measuring" how big things are. Most of probability is discussing very generalized ratios: of the total, "how big is this piece" is analogous to "what are the odds this will happen".
The general discussion of measure is complex and essentially the only tool to tackle it involves gigantic (infinite, really) sums of small, well-behaved pieces to form a complex whole.
It just happens to turn out (and this is the big secret of calculus) that this machinery (integration) is dual to the study of smooth changes and you can knock them both out together.
--
So ultimately, ML hinges upon being able to measure things (integration) and talk about how they change (derivation). Those two happen to be the same concept in a way and they are essentially what you study in calculus.
A lot of probability theory requires it. For instance, ML is largely framed mathematically as a series of optimisation problem, which are then solved by finding the gradient and performing gradient descent; this requires elementary calculus to calculate the gradient.
Additionally, if you want to calculate a probability given a density function, or evaluate an expectation, you need to calculate several integrals. This arises quite often in the theoretical sections of ML papers/textbooks.
The use of calculus in ML is probably similar to the use of number theory in crypto- you can do applied work fine without it, but you understand the work a lot better by knowing the math, and are less likely to make dumb mistakes.
Most of ML is fitting models to data. To fit a model you minimise some error measure as a function of its real valued parameters, e.g. the weights of the connections in a neural network. The algorithms to do the minimisation are based on gradient descent, which depends on derivatives, i.e. differential calculus.
If you're doing Bayesian inference you're going to need integral calculus because Bayes' law gives the posterior distribution as an integral.
For ML you just need Calculus 1 and 2. The curl/div and Stokes is Calculus 3 which a physics thing. You don't need that for ML.
You may need the basics of functional analysis in certain areas of ML, which is arguably Calculus 4.
> Most of ML is fitting models to data. To fit a model you minimise some error measure as a function of its real valued parameters, e.g. the weights of the connections in a neural network. The algorithms to do the minimisation are based on gradient descent, which depends on derivatives, i.e. differential calculus.
> If you're doing Bayesian inference you're going to need integral calculus because Bayes' law gives the posterior distribution as an integral.
The most obvious thing is understanding back-propagation. Backprop is pretty much all partial derivatives / chain rule manipulations. Also a lot of machine learning involves convex optimization which entails some calculus.
Much of ML is optimization. This is linked to calculus by derivatives. There is the simple part that at a minimum or maximum the derivative is 0. However, more relevance comes from gradient descent. This depends very heavily on calculating derivatives, and its one of the most universal fast optimization methods.
Beyond that, for iterative methods, convergence is a matter of limits. This again is calculus. Formulating iteration as repeatedly applying a function, we converge to a fixed point of that function if and only if the derivative at that fixed point lies between -1 and 1. Again derivatives come in.
Finally, for error estimation, taylor-expansions are often useful. Again, the topic here is calculus. Notably, all I can think of regards limits and derivatives, not integrals. That might just be due to my hatred of integrals though.
I have a pretty good math background, but understanding K-L divergence ([0], a measure of the difference between two probability distributions) required revisiting some calculus. It's needed for understanding models with probabilistic output, used in both generative models and reinforcement learning.
Almost every corner of an ML problem has an optimization problem that needs to be solved: There is a function that you want to minimize subject to constraints. Typically these are everywhere smooth, or sometimes almost everywhere smooth. So calculus shows up in (i) algorithms to find the bottom of these functions (if they exist) or (ii) deriving the location of the minima in closed form. These functions would be "how close am I to the correct parameter", "What losses would these settings rake up on average" etc etc.
The reason why this differs from a purely optimization / mathematical programming problem is that we can only approximately evaluate the actual function (the performance of our model on new / unseen data) that we care to optimize. Great optimization algorithms need not be (and often are not) good ML algorithms. In ML we have to optimize a function that's getting revealed to us slowly, one datapoint at a time. The true function typically involves a continuum of datapoints. This is where we can bring probability into the picture (another option is to treat it as an adverserial game with nature). In the probabilistic approach, we make the assumption that functions being revealed to us is in some probabilistic proximity of the true function and the sample is closing onto it slowly. We have to be careful to be not too eager to model the revealed function, our goal is to optimize the function where these revealed functions are ultimately headed.
Those things aside, if you have to choose just one prereq, I think it has to be linear algebra and you already have that in your bag. Without it, a lot of multivariate calculus will not make much sense anyway. Then one can push things a little bit and go for the linear algebra where your vectors have infinte dimension. This becomes important because often your data would have far too much information that you can encode in a finite dimensional vector. Thankfully a lot of intution carries over to infinite dimension (except when it does not). This goes by the name functional analysis. Not absolutely essential, but then lack of intution here can rein you in from doing some certain kinds of work. You will just get a better (at times spatial or geometric) understanding of the picture, etc etc.
Other than theeir motivating narratives, there is not much difference btween probability/stats and information theory. There is a one to one mapping between many if not all of their core problems. A lot of this applies to signal processing too. Many of the problems that we are stuck at in these domains are the same. Sometimes a problem seems better motivated in one narrative over the other. Some will call it finding the best code for the source, others will call it parameter estimation, yet others will call it learning.
Or If I may paraphrase for the CS audience, blame the reals \mathbb{R}. Otherwise it would have been the problem of reverse engineering a noisy Turing machine that we can access only through its input and output. Pretty damn hard even if we dont get into reals. In those situations you could potentially get by without calculus, algebra by itslef should go a long way, but as I said it gets frigging hard. Learning even the lowly regular expression from examples is hard. Calculus would still be helpful because many combinatorial / counting prolems that come up can be dealt with generating function techniques where you would run into integral calculus with complex numbers.
> Almost every corner of an ML problem has an optimization problem that needs to be solved: There is a function that you want to minimize subject to constraints. Typically these are everywhere smooth, or sometimes almost everywhere smooth. So calculus shows up in (i) algorithms to find the bottom of these functions (if they exist) or (ii) deriving the location of the minima in closed form. These functions would be "how close am I to the correct parameter", "What losses would these settings rake up on average" etc etc.
> The reason why this differs from a purely optimization / mathematical programming problem is that we can only approximately evaluate the actual function (the performance of our model on new / unseen data) that we care to optimize. Great optimization algorithms need not be (and often are not) good ML algorithms. In ML we have to optimize a function that's getting revealed to us slowly, one datapoint at a time. The true function typically involves a continuum of datapoints. This is where we can bring probability into the picture
The optimization techniques required to actually fit models are almost all powered by some form of gradient descent, and integration is usually required in truly probabilistic models to go from a density function to predictions.
All of statistics and machine learning involves lots of integrals and derivatives. For example: expected values are integrals, and model fitting is done by hill climbing in the direction of the derivative.
So beyond just saying that you'd need grounding in multivariable calculus to do serious ML work, I would be super interested in hearing more about why that is and what kinds of problems crop up in ML that demand it.