Actually, I read the batch norm paper and maybe I forgot important details, but it roughly went like this: "here we add a term `b` to make sure the mean values of Ax+b are zero and that will help us with convergence; ah, and here is a covariance matrix!", but no quantitative proofs about how much that convergence was helped. No, I intuitively agree that shifting the mean value to zero should help, but math taught me that there is a huge difference between a seemingly correct statement and its proof. The ML papers seem to just state these seemingly correct ideas without real, proof-backed understanding, why this works. In other words, ML is entirely about empirical results, peppered with math-like terminology. But don't take my blunt writing style personally.
Let's take the simplest example: recognizing the grayscale 30x80 pictures with 0-9 digits. IIRC, this is called the MNIST example and can be done by my cat in 1 hour without prior knowledge. Let's choose the probably simplest model: 2400 inputs are fully connected with a 1024 vector that's fully connected with a 10 vector. And let's use relu at both steps. We know that this kinda works and converges quickly. In particular, after T steps we get error E(T) and E(1e6) < 0.03 (a random guess). Can you tell me how T and E will change if we add another layer: 2400->1024->1024->10, using the same relu? Same question, but now we replace relu with tanh: 2400->1024->10.
I think you and the person you're responding to might have slightly different expectations behind what level of rigor counts as "math", just like how physicists and theoretical mathematicians often have somewhat different ideas about rigor.
My impression is that obviously ML is guided by math and people want to have an understanding of why some things converge and others don't. But "in the field" many people just mess around with different set-ups and see what works (especially in deep learning). Maybe theory follows to explain why it worked. I think you're right that a lot of progress in the field is based on intuition and some reasoning (e.g. trying something like an inception network) more than derivations that show that a particular set-up should be successful. I get the impression that most low-level components are pretty well understood, but when they are stacked and combined it gets more complicated.
I would be very curious to see a video of your cat solving MNIST in 1 hour!
Let's take the simplest example: recognizing the grayscale 30x80 pictures with 0-9 digits. IIRC, this is called the MNIST example and can be done by my cat in 1 hour without prior knowledge. Let's choose the probably simplest model: 2400 inputs are fully connected with a 1024 vector that's fully connected with a 10 vector. And let's use relu at both steps. We know that this kinda works and converges quickly. In particular, after T steps we get error E(T) and E(1e6) < 0.03 (a random guess). Can you tell me how T and E will change if we add another layer: 2400->1024->1024->10, using the same relu? Same question, but now we replace relu with tanh: 2400->1024->10.