First up, what you seem to be talking about is Deep Learning, not Machine Learning in general. In more general ML there are many theorems, some also apply to DL.
Also, the step of "shuffle around the ML graph using some intuition" involves gathering that intuition, which usually arises from a great deal of mathematical competence. A 3x3 conv kernel versus a 2x2 one can, for instance, be discussed in terms of Fourier theory and mathematical image processing, but areas with huge built-in theory.
Things like replacing the activation function were initially studied anecdotally. People realized that in some settings one activation function or another would lead. Eventually, there was also theory showing that in large nets of stable configurations, there was serious interaction between the initialization method and the activation function and problems like poor backprop signal propagation were tackled theoretically and practically.
Generally, the mystery comes from the vast parameterization of these DL models. They operate in a space that's very hard to generalize—large, finite spaces. Small finite spaces get treated exhaustively. Infinite spaces get treated asymptotically. Large finite spaces get bounded on either side by those methods.
So yes, there might feel like there's a dearth of theory in DL when it comes to the large scale behavior of a general network. That can be super frustrating. At the same time, people are trying to push through and create more theory every day.
Also, the step of "shuffle around the ML graph using some intuition" involves gathering that intuition, which usually arises from a great deal of mathematical competence. A 3x3 conv kernel versus a 2x2 one can, for instance, be discussed in terms of Fourier theory and mathematical image processing, but areas with huge built-in theory.
Things like replacing the activation function were initially studied anecdotally. People realized that in some settings one activation function or another would lead. Eventually, there was also theory showing that in large nets of stable configurations, there was serious interaction between the initialization method and the activation function and problems like poor backprop signal propagation were tackled theoretically and practically.
Generally, the mystery comes from the vast parameterization of these DL models. They operate in a space that's very hard to generalize—large, finite spaces. Small finite spaces get treated exhaustively. Infinite spaces get treated asymptotically. Large finite spaces get bounded on either side by those methods.
So yes, there might feel like there's a dearth of theory in DL when it comes to the large scale behavior of a general network. That can be super frustrating. At the same time, people are trying to push through and create more theory every day.