Hacker News new | past | comments | ask | show | jobs | submit login
Deep Learning Is Not So Mysterious or Different (arxiv.org)
485 points by wuubuu 48 days ago | hide | past | favorite | 126 comments



If anyone wants to delve into machine learning, one of the superb resources I have found is, Stanfords "Probability for computer scientists"(https://www.youtube.com/watch?v=2MuDZIAzBMY&list=PLoROMvodv4...).

It delves into theoretical underpinnings of probability theory and ML, IMO better than any other course I have seen. (Yeah, Andrew Ng is legendary, but his course demands some mathematical familarity with linear algebra topics)

And of course, for deep learning, 3b1b is great for getting some visual introduction (https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQ...).


I watched the 3b1b series on neural nets years ago, and it still accounts for 95% of my understanding of AI in general.

I’m not an ML person, but still. That guy has a serious gift for explaining stuff.

His video on the uncertainty principle explained stuff to me that my entire undergrad education failed to!


> That guy has a serious gift for explaining stuff

I'd like to challenge this idea.

I don't believe he's more gifted than other people. I strongly believe that the point is he spent a lot of time and effort to get better at explaining stuff.

He contemplated feedback and improved his explanations throughout the years.

His videos are excellent because he poured himself into making them excellent, not because he has a gift.

In my experience the professors who lack this ability do so because they don't put enough effort into it, not because they were born without it.


You're probably reading too much into previous poster's choice of the word "gift".

Most likely it is a slightly misused idiom rather than intending to convey that the teaching ability was obtained without effort.


I disagree. He has always been excellent from the beginning of his Youtube career. Maximum potential skill levels and skill acquisition/growth rates vary from person to person. I think most people wouldn't have as much success even with twice as many hours invested in the 4 separate crafts (!) of mathematics communication, data visualization, video animation, and video editing. I know I wouldn't, and I consider technical communication one of my strong suits.

Everyone can improve with practice, but some people really are gifted.


To be very good at something it is necessary, but not sufficient, to have a talent for it. The other 85% is hard work. You aren't going to pull just anyone off the street and have the same level of instruction, no matter how motivated they are.


I think real genius is translating all the heavy symbolic manipulation into visual processes, that people can see and interpret. Suddenly, you are not seeing some abstract derivation somewhat removed from real world, but another real visual process which you pause and reason with.

That makes the whole concept tick.


it could one or the other or be both,

gifted and spending time to get it right are not mutually exclusive


It helps that 3b1b doesn't start with a curriculum and then has to figure out how to teach it. Instead he can select topics to suit his style.



From, a comment I posted elsewhere for written versions.

There is a course reader for CS109 [1]. You can download pdf version of this.

There is also book[2] for excellent caltech course[3].

[1] https://chrispiech.github.io/probabilityForComputerScientist...

[2] https://www.amazon.com/Learning-Data-Yaser-S-Abu-Mostafa/dp/...

[3] https://work.caltech.edu/telecourse


Your first two links don't work


That's because they posted them somewhere else (easy mistake to make.. HN doesn't show you the full link in a comment, so copy/paste just copies the ellipsis)

https://chrispiech.github.io/probabilityForComputerScientist...

https://www.amazon.com/Learning-Data-Yaser-S-Abu-Mostafa/dp/...


Thanks. Sorry for the oversight.


Caltech's learning from data was really good too, if someone is looking for theoretical understanding of ML topics.

https://work.caltech.edu/telecourse


I highly recommend the course you've mentioned (by Yaser Abu-Mostafa). In fact I still recommend it for picking up the basics; very good mix of math and intuition, Abu-Mostafa himself is a terrific teacher, and he is considerate and thoughtful in responding to questions at the end of his presentations. The last part is important if you're a beginner: it builds confidence in you that its probably ok to ask what you might consider a simple question - it still deserves a good answer. The series is a bit dated now in terms of what it covers, but still solid as a foundational course.


Apparently the word “delve” is the biggest indicator of the use of ChatGPT according to Paul Graham.


That seems utterly bizarre to me. I don't use "delve" frequently myself, but it is common enough that it doesn't jump out as an unusual word. Perhaps it is overused or used in a not-exactly-usual context that tips one off that it is LLM-generated, but by itself it signifies nothing to me.


It is a very common word used in Nigerian style English which was a very common place they were outsourcing RLHF tasks to. A sibling comment has a link but it is also easy to google.


As a non native speaker, I didn't know the word "delve" but now I know this word. I think internet community is learning from LLM?


  > learning from LLM
Or from each other?


Saying that kind of stuff is the biggest indicator of Paul Graham (pg) himself


I’d love to see an article delve into why that is.



Because it's common in Nigerian English, which is where they outsourced a lot of the RLHF conditioning work to.


Really!? Do you have a source for this? This would be really interesting if true.



Non native speaker here. Will remember this.

Hm... Saw that, I have used it multiple times in my comment. I was just trying to convey the meaning.

What is right use of word? What would be right word to use here?


Native English speaker here. It was the right word. At the same time, while “delve” is common enough to be recognized, it’s not that commonly used in American English, so I also was wondering if this was AI generated.


Got it. What is the common phrase used in this case? Same as what sibling comment has said?


So ChatGPT or Nigerians or me apparently... :`(


It does kind of go with "deep" though when Deep Learning is the topic. Delve into the depths.


For me it’s “eerie” it just will not stop using this word.


Absolutely, here’s why.


Nonsense. Chatgpt uses the word a lot precisely because people used it a lot.


Apparently this depends on where people are. It is not used a lot in US English, but it is used a lot in African English.

Part of training LLMs involves extensive human feedback, and many LLM makers outsource that to Africa to save money. The LLMs then pick up and use African English.

See the link in this comment [1] for an interesting article about this.

[1] https://news.ycombinator.com/item?id=43394220


Just watched the whole thing. Thanks! I can't get in to my Masters CS: AI program at UC Berkeley because I'm dumb, but seeing this 1st day of a Probability class kinda felt like I was beginning that program haha.

I will add a great find for starting one's AI journey https://www.youtube.com/watch?v=_xIwjmCH6D4 . Kind of needs one to know intermediate CS since 1st step is "learn Python".


and if anyone is interested in delving more deeply into the statistical concepts & results referenced in the paper of this post (e.g. VC-dimension, PAC-learning, etc), I can recommend this book: https://amzn.eu/d/7Zwe6jw


Looks nice - are there written versions?


There is a course reader for CS109 [1]. You can download pdf version of this.

There is also book[2] for excellent caltech course[3].

[1] https://chrispiech.github.io/probabilityForComputerScientist...

[2] https://www.amazon.com/Learning-Data-Yaser-S-Abu-Mostafa/dp/...

[3] https://work.caltech.edu/telecourse


Thanks!


Yeah I took CS109 (through SCPD), it was a blast. But it took some serious time commitment.


Great recommendations


Fully agree! 3blue1brown is who have single-handedly thought me a majority of what I've needed to know about it.

I actually started building my own neural network framework last week in C++! It's a great way to delve into the details of how they work. It currently supports only dense MLP's, but does so quite well, and work is underway for convolutional layers and pooling layers on a separate branch.

https://github.com/perkele1989/prkl-ann


Agreed, but PAC-Bayes or other descendants of VC theory is probably not the best explanation. The notion of algorithmic stability provides a (much) more compelling explanation. See [1] (particularly Sections 11 and 12)

[1] https://arxiv.org/abs/2203.10036


I'm a huge fan of HN just for replies such as this that smash the OP's post/product with something better. It's like at least half the reason I stick around here.

Thanks for the great read.


>smash with something better

Not a fan of the aggressive rhetoric here...


I too felt threatened


Yeah, and it's not "better", but actually less general, relying on optimization/GD, unlike OP.


Violent disagreement is violence.


Hard disagree. Your link relies on gradient descent as an explanation, whereas OP explains why optimization is not needed to understand DL generalization. PAC-Bayes, and the other different countable hypothesis bounds in OP also are quite divergent from VC dimension. The whole point of OP seems to be that these other frameworks, unlike VC dimension, can explain generalization with an arbitrarily flexible hypothesis space.


Yes, and that's the problem. What Zhang et al [2] showed convincingly in the Rethinking paper is that just focusing on the hypothesis space cannot be enough since the same hypothesis space fits real and random data so it's already too large. Therefore, these methods that focus on the hypothesis space have to talk about a bias in practice towards a better subspace, and that already requires studying the specific optimization algorithm in order to understand why it picks certain hypothesis over others in the space.

But once you are ready to do that then algorithmic stability is enough. You don't then need to think about Bayesian ensembles, or other proxies/simplifications etc. but can focus on just the specific learning setup you have. BTW algorithmic stability is not a new idea. An early version showed up within a few years of VC theory in the 80s in order to understand why nearest neighbors generalizes (it wasn't called algorithmic stability then though).

If you are interested in this, also recommend [3].

[2] https://arxiv.org/abs/1611.03530

[3] https://arxiv.org/abs/1902.04742


But it's not a problem, it's actually a good thing that OP's explanation is more general. One of the main points in the OP paper is that you do not in fact need proxies or simplification. You can derive generalization bounds that do explain this behavior, without needing to rely on optimization dynamics. This exactly responds to the tests set forth in Zhang et al. OP does not "rely on Bayesian ensembles, or other proxies/simplifications". That seems to be a misunderstanding of the paper. It's analyzing the solutions that neural networks actually reach, which differentiates it from a lot of other work. It also additionally shows how other simple model classes can reproduce the same behavior, and these reproductions do not depend on optimization.

"and that already requires studying the specific optimization algorithm in order to understand why it picks certain hypothesis over others in the space." But the OP paper explains how even "guess and check" can generalize similarly to SGD. It's becoming more well understood that the role of the optimizer may have been historically overstated for understanding DL generalization. It seems to be more about loss landscapes.

Don't get me wrong, these references you're linking are super interesting. But they don't take away from the OP paper which is adding something quite valuable to the discussion.


Thank you for the great discussion. You've put your finger on the right thing I think. We can now dispense with the old VC-type thinking (i.e., that it's because the hypothesis space is not complex enough that we get generalization). Instead now the real question is this: is it the loss landscape itself, or the particular way in which the landscape is searched that leads to good generalization in deep learning.

One can think of perhaps an "exhaustive" search with say God's computer of the loss landscape and pick an arbitrary point among all the points that minimize (or are close to the minimum). Or with our computers we can merely sample. But in both cases, it's hard to see how one would avoid picking "memorization" solutions in the loss landscape. Recall that in an over-parameterized setting, there will be many solutions that have the same low training loss but very different test losses. The reference in my original post [1] shows a nice example with a toy overparameterized linear model (Section 3) where multiple linear models fit the training data but they have very different generalizations. (It also shows why GD ends up picking the better-generalizing solution.)

Now people have argued that the curvature around the solution is a distinguishing factor between well-generalizing solutions and not. Though already now we are moving into the territory of how to sample the space i.e. the specifics of the searching algorithm (a direction you may not like), but even if we press ahead, it's not a satisfactory explanation since in a linear model with L2 loss, the curvature is the same everywhere as Zhang et al. pointed out. So the curvature theories fail for the simplest case already unless one believes that somehow linear models are fundamentally different from deeper and non-linear models.

[1] points out other troubling facts about the curvature explanation (Section 12), but one I like more than the others is the following: As per curvature theories the reason for good generalization at the start of the training process is fundamentally different from the reason from good generalization at the end of the training process. (As always, generalization is just the difference between test and training, and so good generalization is when that difference is small; not necessarily that the test loss is small.) At the start of the GD training process curvature theories would not be applicable (we just picked a random point after all) and so they would hold that we get good (in fact, perfect) generalization because we didn't look at the training data. However, at the end of training, they say we have good generalization because we found a shallow minima. This lack of continuity is disconcerting. In contrast, stability based arguments provide a continuous explanation: the longer you run SGD the less stable it is (so don't run it too long and you'll be fine since you'll achieve an acceptable tradeoff between lowering the loss and overfitting).

[1]: https://arxiv.org/abs/2203.10036


Statistical mechanics is the lens that makes most sense to me, and it's well studied.


Good read, thanks for sharing


Anyone who wants to demystify ML should read: The StatQuest Illustrated Guide to Machine Learning [0] By Josh Starmer.

To this day I haven't found a teacher who could express complex ideas as clearly and concisely as Starmer does. It's written in an almost children's book like format that is very easy to read and understand. He also just published a book on NN that is just as good. Highly recommend even if you are already an expert as it will give you great ways to teach and communicate complex ideas in ML.

[0]: https://www.goodreads.com/book/show/75622146-the-statquest-i...


I have followed a fair few StatQuest and other videos (treadmills with Youtube are great for fitness and learning in one)

I find that no single source seems to cover things in a way that I easily understand, but cumulatively they fill in the blanks of each other.

Serrano Academy has been a good source for me as well. https://www.youtube.com/@SerranoAcademy/videos

The best tutorials give you a clear sense that the teacher has a clear understanding of the underlying principles and how/why they are applied.

I have seen a fair few things that are effectively.

'To do X, you {math thing}' While also creating the impression that they don't understand why {math thing} is the right thing to do, just that {math thing} has a name and it produces the result. Meticulously explaining the minutiae of {math thing} substitutes for a understanding of what it is doing.

It really stood out to me when looking at UMAP and seeing a bunch of things where they got into the weeds in the math without explaining why these were the particular weeds to be looking in.

Then I found a talk by Leland McInnes that had the format.

{math thing} is a tool to do {objective}. It works, there is a proof, you don't need to understand it to use the tool but the info for that is over there if you want to tale a look. These are our objectives, let's use these tools to achieve them.

The tools are neither magical black boxes, nor confused with the actual goal. It really showed the power of fully understanding the topic.


Double Bam


Also would like to add that he has a YouTube channel as well https://youtube.com/@statquest


> rather than restricting the hypothesis space to avoid overfitting, embrace a flexible hypothesis space, with a soft preference for simpler solutions that are consistent with the data. This principle can be encoded in many model classes, and thus deep learning is not as mysterious or different from other model classes as it might seem.

How does deep learning do this? The last time I was deeply involved in machine learning, we used a penalized likelihood approach. To find a good model for data, you would optimize a cost function over model space, and the cost function was the sum of two terms: one quantifying the difference between model predictions and data, and the other quantifying the model's complexity. This framework encodes exactly a "soft preference for simpler solutions that are consistent with the data", but is that how deep learning works? I had the impression that the way complexity is penalized in deep learning was more complex, less straightforward.


You're correct, and the term you're looking for is "regularisation".

There are two common ways of doing this: * L1 or L2 regularisation: penalises models whose weight matrices are complex (in the sense of having lots of large elements) * Dropout: train on random subsets of the neurons to force the model to rely on simple representations that are distributed robustly across its weights


Dropout is roughly equivalent to layer-specific L2 regularization, and it's easy to see why: asymptotically, dropping out random neurons will achieve something similar to shrinking weights towards zero proportional to their (squared) magnitude.

Trevor Hastie's Elements of Statistical Learning has a nice proof that (for linear models) L2 regularization is also semi-equivalent to dimensionality reduction, which you could use to motivate a "simplicity prior" idea in deep learning.

Yet another way of thinking about it, in the context of ReLU units, is that a layer of ReLUs forms a truncated hyper-plane basis (like splines but in higher dimensions) in feature space, and regularization induces smoothness in this N-dimensional basis by shrinking that basis towards being a flat hyper-plane


Wow! I think I dimly intuited your first paragraph already; I directionally get why your second might be true (although I'd have thought L1 was even more so, since it encourages zeros which is kind of like choosing a subspace).

Your third paragraph took me ages to get an intuition for - is the idea that regularisation penalises having "sharp elbows" at the join points of your hyper-spline thing? That's mind blowing and such an interesting way to think about what a ReLU layer is doing.

Thanks so much for a thought provoking comment, that's incredibly cool.


The solution to the L1 regularization problem is actually a specific form of the classical ReLU nonlinearity used in deep learning. I’m not sure if similar results hold for other nonlinearities, but this gave me good intuition for what thresholding is doing mathematically!


Here is an example for data-efficient vision transformers: https://arxiv.org/abs/2401.12511

Vision transformers have a more flexible hypothesis space, but they tend to have worse sample complexity than convolutional networks which have a strong architectural inductive bias. A "soft inductive bias" would be something like what this paper does where they have a special scheme for initializing vision transformers. So schemes like initialization that encourage the model to find the right solution without excessively constraining it would be a soft preference for simpler solutions.


I'm not a guru myself, but I'm sure someone will correct me if I'm wrong. :-)

The usual approach to supervised ML is to "invent" the model (layers, their parameters) or more often copy one from known good reference, then define the cost function and feed it data. "Deep" learning just means that instead of a few layers you use a big number of them.

What you describe sounds like an automated way of tweaking the architecture, IIUC? Never done that, usually the cost of a run was too high to let an algorithm do that for me. But I'm curious if this approach is being used?


Yeah, it's straightforward to reproduce the results of the paper whose conclusion they criticize, "Understanding deep learning requires rethinking generalization", without any (explicit) regularization or anything else that can be easily described as a "soft preference for simpler solutions".


Yeah that's just regularized optimization which is actually just the Bayesian Learning Rule which is actually just variational Bayes.


the AdamW optimizer (basically the default in DL nowadays) is doing exactly that


An example, which is interesting, in which "deep" networks are necessary, is discussed in this fascinating and popular recent paper on RNNs [1]. Despite the fact that the minGRU and minLSTM models they propose don't explicitly model ordered state dependency, they can learn them as long as they are deep enough (deep >= 3):

> Instead of explicitly modelling dependencies on previous states to capture long-range dependencies, these kinds of recurrent models can learn them by stacking multiple layers.

[1] https://arxiv.org/abs/2410.01201


Well it's not called Mysterious Learning or Different Learning for a reason.

In fact, with how many misnomers there are in the world, I think Deep Learning is actually a pretty great name, all things considered.

It properly communicates (imo) that the training data and resulting weights are complex enough that just looking at the learning/training process on its own is not sufficient to understand the resulting system (vs other "less deep" machine learning where it mostly is).


DNNs do not have special generalization powers. If anything, their generalization is likely weaker than more mathematically principled techniques like the SVM.

If you try to train a DNN to solve a classical ML problem like the "Wine Quality" dataset from the UCI Machine Learning repo [0], you will get abysmal results and overfitting.

The "magic" of LLMs comes from the training paradigm. Because the optimization is word prediction, you effectively have a data sample size equal to the number of words in the corpus - an inconceivably vast number. Because you are training against a vast dataset, you can use a proportionally immense model (e.g. 400B parameters) without overfitting. This vast (but justified) model complexity is what creates the amazing abilities of GPT/etc.

What wasn't obvious 10 years ago was the principle of "reusability" - the idea that the vastly complex model you trained using the LLM paradigm would have any practical value. Why is it useful to build an immensely sophisticated word prediction machine, who cares about predicting words? The reason is that all those concepts you learned from word-prediction can be reused for related NLP tasks.

[0] https://archive.ics.uci.edu/dataset/186/wine+quality


You may want to look at this. Neural network models with enough capacity to memorize random labels are still capable of generalizing well when fed actual data

Zhang et al (2021) 'Understanding deep learning (still) requires rethinking generalization'

https://dl.acm.org/doi/10.1145/3446776


When I was first getting into Deep Learning, learning the proof of the universal approximation theorem helped a lot. Once you understand why neural networks are able to approximate functions, it makes everything built on top of them much easier to understand.


A decade ago the paper "Understanding deep learning requires rethinking generalization" [0] was published. The submission is a response to that paper and subsequent literature.

Deep neural nets are notable for their strong generalization performance: despite being highly overparametrized they do not seem to overfit the training data. They still perform well on hold-out data and very often on out of distribution data "in the wild". The paper [0] noted a particularly odd feature of neural net training: one can train neural nets on standard datasets to fit random labels. There does not seem to be an inductive bias strong enough to rule out bad overfitting. It is in principle possible to train a model which performs perfectly on the training data but gives nonsense on the test data. But this doesn't seem to happen in practice.

The submission argues that this is unsurprising, and fits within standard theoretical frameworks for machine learning. In section 4 it is claimed that this kind of thing ("benign overfitting") is common to any learning algorithm with "a flexible hypothesis space, combined with a loss function that demands we fit the data, and a simplicity bias: amongst solutions that are consistent with the data (i.e., fit the data perfectly), the simpler ones are preferred".

The fact that the third of these conditions is satisfied, however, is non-trivial, and in my opinion is still not well understood. The results of [0] are reproducible with a wide variety of architectures, with or without any form of explicit regularization. If there is an inductive bias toward "simpler solutions" in fitting deep neural nets it has to come either from SGD itself or from some bias which is very generic in architecture. It's not something like "CNNs generalize well on image data because of an inductive bias toward translation invariant features." While there is some work on implicit smoothing by SGD, for example, in my opinion this is not sufficient to explain the phenomena observed in [0]. What I would find satisfying is a reproducible ablation study of neural net training that removed benign overfitting (+), so that it was clear what exactly are the necessary and sufficient conditions for this behavior in the context of neural nets. As far as I know this still has never been done, because it is not known what this would even entail.

(+) To be clear, I think this would not look like "the fit model still generalizes, but we can no longer fit random labels" but rather "the fit model now gives nonsense on holdout data".

[0] https://arxiv.org/abs/1611.03530


Doesn't the simplicity bias come explicitly from regularization techniques, including drop out or l2 norm?


Those are not necessary to reproduce benign overfitting


I wish I had the time to try this:

1.) Grab many GBs of text (books, etc).

2.) For each word, for each next $N words, store distance from current word, and increment count for word pair/distance.

3.) For each word, store most frequent word for each $N distance. [a]

4.) Create a prediction algorithm that determines the next word (or set of words) to output from any user input. Basically this would compare word pairs/distance and find most probable next set of word(s)

How close would this be to GPT 2?

[a] You could go one step further and store multiple words for each distance, ordered by frequency


The scaling is brutal. If you have a 20k word vocabulary and want to do 3 grams, you need a 20000^3 matrix of elements (8 trillion). Most of which is going to be empty.

GPT and friends cheat by not modeling each word separately, but a large dimensional “embedding” (just a vector if you also find new vocabulary silly). The embedding represents similar words near each other in this space. The famous king-man-queen example. So even if your training set has never seen “The Queen ordered the traitor <blank>”, it might have previously seen, “The King ordered the traitor beheaded”. The vector representation lets the model use words that represent similar concepts without concrete examples.


Importantly, though, LLMs do not take the embeddings as input during training; they take the tokens and learn the embeddings as part of the training.

Specifically all Transformer-based models; older models used things like word2vec or elmo, but all current LLMs train their embeddings from scratch.


And tokens are now going down to the byte level:

https://ai.meta.com/research/publications/byte-latent-transf...


You shouldn't need to allocate every possible combination !_! if you dynamically add new pairs/distance as you find them. Im talkin simple for loops.


you might enjoy this read, which is an up-to-date document from this year laying out what was the state of the art 20 years ago:

https://web.stanford.edu/~jurafsky/slp3/3.pdf

Essentially you just count every n-gram that's actually in the corpus, and "fill in the blanks" for all the 0s with some simple rules for smoothing out the probability.


There is some recent work [0] that explores this idea, scaling up n-gram models substantially while using word2vec vectors to understand similarity. Used to compute something the authors call the Creativity Index [1].

[0]: https://infini-gram.io [1]: https://arxiv.org/abs/2410.04265v1


Claude Shannon was interested in this kind of thing and had a paper on the entropy per letter or word of English. He also has a section in his famous "A Mathematical Theory of Communication that has experiments using the conditional probability of the next word based on the previous n=1,2 words from a few books. I wonder if the conditional entropy approaches zero as n increases assuming ergodicity. But the number of entries in the conditional probability table blows up exponentially. The trick of combining multiple n=1 of different distances sounds interesting, and reminds me a bit of contrastive prediction ml methods.

Anyway the experiments in Shannon's paper sound similar to what you describe but with less data and distance, so it should give some idea of how it would look: From the text:

* 5. First-order word approximation. Rather than continue with tetragram, : : : , n-gram structure it is easier and better to jump at this point to word units. Here words are chosen independently but with their appropriate frequencies.

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NAT- URAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE.

6. Second-order word approximation. The word transition probabilities are correct but no further structure is included.

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHAR- ACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED *


this is pretty close to how language models worked in the 90s-2000s. deep language models -- even GPT 2 -- are much much better. on the other hand, the n-gram language models are "surprisingly good" even for small n.


Pretty sure this wouldn't produce anything useful. Pretty sure this would generate incoherent gibberish that looks and sounds like English but makes no sense. This ignores perhaps the most important element of LLM's, the attention mechanism.


And, the attention mechanism scales quadratically with context length. This is where all of the insane memory bandwidth requirements come from.


Every thing has meaning in precise relation to the frequency of cooccurrence to every other thing.

I, too, have been mulling this. Word to word, paragraph to paragraph. Even letter to letter.

Also what if you processed text in signal space? I keep wondering if that’s possible. Then you get it all at once rather than windows. Use a derivative of change for every page, so the phase space is the signal end to end.


> How close would this be to GPT 2

Here's a post from 2015 doing something a bit like this [1]

[1] https://nbviewer.org/gist/yoavg/d76121dfde2618422139


The problem is that for any reasonable value of N (>100) you will need prohibitive amounts of storage. And it will be extremely sparse. And you won’t capture any interactions between N-99 and N-98.

Transformers do that fairly well and are pretty efficient in training.


Markov chains are very very far off from gpt2.


Aren't they technically the same? GPT picks the next token given the state of current context, based on probabilities and a random factor. That is mathematically equivalent to a Markov chain, isn't it?


Markov chains don't account for the full history. While all LLMs do have a context length, this is more a practical limitation based on resources rather than anything implicit in the model.


I actually tried sth like that with the Bible back in 2021. scaling is bitch. very difficult to train these types of models.


You can listen to an explanation of the paper here: https://www.pdftomp3.com/shared/67d8abf0ecf38326f8973e49

I created this tool last year to listen to a machine learning book, now I use it for ML papers. The explanations are still a bit repetitive, its not perfect yet.


Correct me if I'm wrong, but an artificial neuron is just good old linear regression followed by an activation function to make it non linear. Make a network out of it and cool stuff happens.


This is like saying "the human brain is just some chemistry." You have the general idea correct, but there's a lot more going on that just that, and the emergent system is so much more complex that it deserves its own separate field.


Although with extra irony. "linear regression followed by an activation function to make it non linear". So it isn't good old linear regression because it is explicitly delinearised.


In a sense; linear regression can be computed exactly so refers to a specific technique for producing a linear model.

Most artificial neurons are trained stochastically rather than holistically, i.e. rather than looking at the entire training set and computing the gradient to minimize the squared loss or something similar, they look at each training example and compute the local gradient and make small changes in that direction.

In addition, the "activation function" almost universally used now is the rectified linear unit, which is linear for positive input and zero for negative input. This is non-decreasing at least as a function, but the fact that it is not monotonic means that there is no additional loss accrued for overcorrecting in the negative direction.

Given this, using the term "linear regression" to describe the model of an artificial neuron is not really a useful heuristic.


MLPs are compositions of generalized linear models. That's not very enlightening though; the "mysterious" part is the macroscopics of the composition, which you can't really understand with the tools of statistics.


Yes. An artificial neuron, as a mathematical function f, is defined by f(x) = g(wx + b) where x is the input, w is the weight, b is the bias, and g is some non-linear activation function. Is that "good old linear regression followed by an activation function to make it non linear"? Yes, it is exactly that.


I've seen the same patterns in neural networks that I've seen in simpler algorithms. It's less about mystery and more about complexity.


So where is the line that something becomes ‘AI’ and is regulated?


[flagged]


Please formulate your critique instead of simply labeling it with negative words.


Sure.

"Preprint" implies prior to printing, which means that there's a reasonable expectation for this paper to be submitted, accepted, and printed in a scholarly journal.

What we have here is little more than a tongue-in-cheek submission which carries an aesthetic of "hot-take" throughout the paper. This is unbecoming of one committed to scholarly pursuits and all but guarantees rejection from journals committed to professionalism.

Furthermore, what's really interesting is how this comment section has developed. It really is the blind leading the blind here.

I will not subject myself further to the consequences of Brandolini's law except to implore the reader to consider the signal-to-noise ratio resulting from being too tolerant of posts like this.


This is still just name calling. You are just using negatively charged adjectives without quoting or arguing the substance or even the style. Is your crique only about the presentation or the substance of the ideas too?

What makes it unprofessional? To me it looks much better than a substantial chunk of my review stacks at ML conferences and journals. Are you an ML researcher? Maybe you're used to a different research community that's more "uptight"?


You're making a normative argument. The fact that other people publish crap is irrelevant, unless you actually intend to lend implicit justification of the status quo's existence just because it exists. "Ought", meet "is", etc.

The take proferred by TFA just isn't a useful take at all except perhaps for those who have never been elbow-deep in ML model architecture design, analysis, and training. The headline alludes to a surprising fact that you learn throughout course studies, a sidenote that can be repeatedly referred back to in order to emphasize the universality of statistical reasoning, but it's certainly not worthy of some kind of manifesto.

I agree we need to demystify ML for the common audience but this is a messaging problem much moreso than it is a pedagogical one. Typically the standard for publication is "genuine novel contribution" but no one who has been through a study regimen about ML will learn anything new from this. Preprints are supposed to be reserved for those papers which anticipate publication but I see no path for this paper to be accepted anywhere.


The paper offers a counterpoint to a published work (Zhang et al., 2021) which together with their earlier unpublished Arxiv version from 2016 has over 7000 citations. If you disagree with this rebuttal, by all means formulate what you find lacking.

> The fact that other people publish crap is irrelevant

You argued that this work is not something that can be seriously be considered to be submitted for publication and cannot be counted as a preprint. It has been pointed out that many works do get submitted to academic venues that aren't up to this quality. You're shifting goalposts.

You are making dismissive remarks without having to state your own view. Do you think Zhang et al's view is correct and deep learning shows novel effects that existing tools can't describe? Do you think the current manuscript does not effectively address those points? You have to argue if you think you have arguments. Labeling something a manifesto or a hot take is just low effort jab. Why do you think that the paper has no chance of acceptance? You are rehashing the same non-argument in different words.

A useful comment would state something like: the authors still do not explain effect X and Y that appears only in deep learning and not in classic ML. Or: the authors' point regarding effect X is incorrect and does not actually show what they claim to show. Etc. Simply saying "it's unserious" can be just turned back at your comment the same way.


> many works do get submitted to academic venues that aren't up to this quality

There's that normative argument rearing its head again. Not interested in jumping off bridges just because your friends do it, thanks.

It is unserious. The thesis amounts to "statistical models like these are mean-field, roughly-max-entropy approximations under the implied data generating process" which is not only an offhand comment a professor might make in ML 201 but tautological on its face. The fact they drag in a couple citations to say as much is besides the point entirely.


It's a higher quality article than half the submissions to NeuroIPS and the rest of the AI/ML conferences, because it has potential to remain relevant next year, and because of its high didactic content. I wouldn't classify it as a "hot take".


The implication that any software is "mysterious" is problematic - there is no "woo" here - the exact state of the machine running the software may be determined at every cycle. The exact instruction and the data it executed with may be precisely determined, as can the next instruction. The entire mythos of any software being a "black box" is just so much advertising jargon, perpetuated by tech bros who want to believe they are part of some Mr. Robot self-styled priestly class.


You're misunderstanding. A level of abstraction is necessary for operation of modern systems. There is no human alive who, given an intermediate step in the middle of some running learning algorithm, is able to understand and mentally model the full system at full man-made resolution, that is, down to the transistor level, on a modern CPU. Someone wishing to understand a piece of software in 2025 is forced to, at some point, accept that something somewhere "does what it says on the tin" and model it thusly rather than having a full understanding.


It's not misunderstanding at all - but your response is certainly an attempt to obfuscate the point being made. The moment you represent anything in code, you are abstracting a real thing into it's digital representation. That digital representation if fully formed at every cycle of the digital system processing it, and the state of the system - all the way down to the transistor level may be precisely determined. To say otherwise is to make the same error as those who claim that consciousness or understanding are indefinable "extra-ordinary" things that we have to just accept exist without any justification or evidence.


Okay, then, you're just using your own personal definition of "black box" instead of the one everyone else uses.

Something that's a black box is unknown to the speaker. It's not understood to be unknowable to anyone.


So your claim is that there are instructions, data, or both that are unable to be determined in what, is by definition, a fully deterministic machine?


By an individual person, yes. I claim that there exists no single human capable of fully understanding the totality of the software and hardware down to the individual transistor level.


That's a very wrong statement. Pretty sure I could explain all the maths, all the physics, all the electronics, all the operating systems and all the user space of a single high level language operation, when I was a fresh graduate. Now, I have forgotten most of the physics and electronics, since the university was quite some time ago, but feel free to ask any decent student of an IT bachelor, they should be able to pretty much build the PC from scratch. Sure, modern processors and whatnot add a bunch of optimizations, but you seem to really overstate the complexity of the computer.


We're talking about two separate things.

I'm talking about understanding, fully, the state of the CPU. Not just the conceptual operation of the CPU. Like, given a specific, modern AMD or Intel CPU, understand fully all states of all transistors.


I agree and never claimed that "a single person" could - but just because something is too complex for a single person to fully understand does not make it "mysterious" or a "black box". So what is the claim you are making? Anything beyond the complexity of a single person to understand = magic?


We're just using different definitions for "black box".

My definition is that it's something unknown, yours is that it's something unknowable.


The mystery was never in the "how do computers calculate the probabilities of next tokens" but rather in the "why is it able to work so well" and "what does this individual neuron contribute to the whole model"


The mystery is in how the data is encoded in the parameters and why LLMs performance scales so well with parameters. The key seems to be almost orthogonal vectors that allow neural networks to store so much data. They allow 2^(cn) vectors to be learned in an n-dimensional space with c being a constant.Since almost orthogonal vectors have very small dot products, they minimally interfere with each other, allowing many concepts to coexist with limited cross talk which enables superposition


I don't know any serious programmer who thinks that, just because each operation is simple, the operation of the whole thing can't be mysterious.


But the weights trained from machine learning are a black box, in the sense that no human designed e.g. the image processing kernels that those weights represent.

That is one reason people are skeptical of them, not only is training a large model at home expensive, not only is the data too big to trivially store, but the weights are not trivial to debug either




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: