That's not entirely true. As someone who has a pretty long background in this area, works with LLMs/Diffusion models everyday, and generally thinks there is a bit too much hype (but also a lot of potential): there is a lot that we don't really understand about how these models behave and why.
For starters, this article discusses how we need to change the architecture for solving XOR. That's something we do understand quite well. However what we really don't understand is why architectures like transformers work so well. From an engineering standpoint they make sense because the models look like they're doing something we want, and it makes intuitive sense that they work.
But from a theoretical standpoint it's not known why we really need all these fancy architectures (rather than just using a bunch of layers that should also be able to "figure out" what the network needs to do). All of our success has boiled down to "hey, let's try this and see if the model will learn better/faster"
Similarly, from a mathematical perspective, we do have some intuition around the reality that all NNs are basically doing some highly non-linear, complex transformation on to some latent surface where the problem is linearly separable. That gives us a sense of why these models probably can't learn "truth" (unless you do believe there exists a latent space of what is true and what is not, which is pretty radical). But if you start asking many more questions about how this works, or why the model would choose one representation of the other internally, we don't really know.
Nearly all of our progress in deep learning that last decade has been basically hacking around and applying larger and larger amounts of data and compute resources. But at the end of the day even the best in the field don't really understand exactly what's happening.
Following from this: If you really want to understand these tools better, start playing with them and trying to build cool things. A deep understanding of the fundamentals is not much more useful for success with LLMs and Diffusion models is that knowing how to efficiently implement b-trees helps build a cool product with a database back end.
It can absolutely be called magic when creators of LLMs themselves openly say that don't understand why they work the way they work. The word "magic" is very flexible ("his singing is magical", "it was a magical holiday in Vegas with four girls and me in the hotel room") and it can be definitely used in this context as "something wonderful we don't fully understand".
>But from a theoretical standpoint it's not known why we really need all these fancy architectures
As someone who has been researching neural networks in a variety of settings for a very long time now, it is actually pretty obvious. There is also no real "magic" to it, even though it certainly might seem so to people who did not follow the academic world of research closely. But to those who do, all of this followed a pretty straightforward path, even though certain key steps were only obvious in hindsight. We already knew since the 90s that a perceptron with a single hidden layer can approximate any function (with some caveats that in practice only boil down to computational limits) with arbitrary accuracy, with the error scaling like 1/N with the number of hidden Neurons. But the proof of that theorem already shows that this is by far not the most efficient way to approximate functions. While in practice you could plug pixel values of an image directly into a perceptron, computationally it turned out to be hugely more efficient to use convolutions first as a dimensionality reduction scheme. This not just allowed people to train much larger networks on larger datasets, it also highlighted how additional layers enable hierarchical knowledge. So the first layer of such a network might only encode lines or circles, while deeper layers could encode noses and ears and eventually entire human faces. For language modeling, the thing holding everything back was also computability. Recurrent neural networks are theoretically even more powerful than simple perceptrons, but they come with a significant cost when computing gradients. Trying to improve these restraints is what eventually led to the transformer, which at its core is just an extremely scalable, general purpose, differentiable algorithm that you can optimise using backpropagation. We didn't need this architecture from a purely theoretical perspective, but we needed it in practice because our computing hardware is still very limited once we are trying to mimic actual biological neural networks as you would find them in the human brain.
> But to those who do, all of this followed a pretty straightforward path, even though certain key steps were only obvious in hindsight.
The research still largely relies on post-hoc justification for these architectural benefits. We know CNNs work, we can open them up and see what they're doing, but we didn't get there from a theoretical foundation that predicted this outcome, nor do we have a real theoretical framework to justify them.
The history of pre-science is filled with post-hoc justification that is very similar, allows for practitioners to make progress, but ultimately has turned out to be wildly incorrect.
> it is actually pretty obvious.
In this entire reply you leave out the theoretical justifications to back up this claim. You show many example of intuitively why these architectures work, but never dive into the rigorous explanation, because such explanations don't exist yet.
This comment simply outlines the growing "bag of tricks" we've built up over the years to solve problems, along with the common post-hoc justifications. But at it's current state this is not different than alchemy, which did get some ideas correct, was able to create some useful practices but ultimately failed to provide a theoretical frame work for what was being done.
I don't know any serious deep learning researcher who disagrees that at this point the practice far out paces our theoretical understanding.
Is the key to answering this question in the continued study of neurobiology? Are there any clues as to what the human brain is doing that apply to these concepts? Structuralism is radically popular, one would think if its are right, we should be able to grow conscious beings with a certain original blueprint.
That opinion was held by a large part of the field for the longest time, and some actually still cling to it. These are usually the people who criticise transformers, because they go against everything they believe. But what we have seen in recent years points to the fact that capability of neural networks is only a question of size. Yes, the human brain uses some tricks like recurrent layers and convolutional layers as well - and to some extent it does so better than we currently can. But transformers have shown that you don't need any of that for language processing and not even for vision, showing once again that you only need a sufficiently sized network. The details of the architecture are not that important. In the same way that your microprocessor architecture does not really matter once you deal with high level programs in userland.
That's not entirely true. As someone who has a pretty long background in this area, works with LLMs/Diffusion models everyday, and generally thinks there is a bit too much hype (but also a lot of potential): there is a lot that we don't really understand about how these models behave and why.
For starters, this article discusses how we need to change the architecture for solving XOR. That's something we do understand quite well. However what we really don't understand is why architectures like transformers work so well. From an engineering standpoint they make sense because the models look like they're doing something we want, and it makes intuitive sense that they work.
But from a theoretical standpoint it's not known why we really need all these fancy architectures (rather than just using a bunch of layers that should also be able to "figure out" what the network needs to do). All of our success has boiled down to "hey, let's try this and see if the model will learn better/faster"
Similarly, from a mathematical perspective, we do have some intuition around the reality that all NNs are basically doing some highly non-linear, complex transformation on to some latent surface where the problem is linearly separable. That gives us a sense of why these models probably can't learn "truth" (unless you do believe there exists a latent space of what is true and what is not, which is pretty radical). But if you start asking many more questions about how this works, or why the model would choose one representation of the other internally, we don't really know.
Nearly all of our progress in deep learning that last decade has been basically hacking around and applying larger and larger amounts of data and compute resources. But at the end of the day even the best in the field don't really understand exactly what's happening.
Following from this: If you really want to understand these tools better, start playing with them and trying to build cool things. A deep understanding of the fundamentals is not much more useful for success with LLMs and Diffusion models is that knowing how to efficiently implement b-trees helps build a cool product with a database back end.