I sort of wish that we would move on from the "grokking" terminology in the way that the field generally uses it (a magical kind of generalization that may-or-may-not-suddenly-happen if you train for a really long time).
I generally regard grokking as a failure mode in a lot of cases -- it's oftentimes not really a good thing. It tends to indicate that the combination of your network, task, and data are poorly suited for learning {XYZ} thing. There are emergent traits which I think the network can learn in a healthy manner over training, and I think that tends to fall under the 'generalization' umbrella.
Though I'd strongly prefer to call it 'transitive' rather than 'compositional' in terms of generalization, as transitive is the formal term most disciplines use for such things, compositional is a different, more general meaning entirely. Similarly, I'd replace 'parametric' and 'non-parametric' with 'internal' and 'external', etc. Sloughing through the definition salad of words (this paper alone takes up roughly half of the top Kagi hits for 'parametric memory') makes actually interpreting an argument more difficult.
One reinterpretation of the problem is -- of course external memory models will have trouble generalizing to certain things like models relying on internal memory do! This is because, in part, models with internal memory will have much more 'experience' integrating the examples that they've seen, whereas, for an external-memory model like a typical RAG setup, anything is possible.
But, that being said, I don't think you can necessarily isolate that to the type of memory that the model has alone, i.e., I don't think you can clearly say even in a direct comparison between the two motifs that it's the kind of memory itself (internal vs. external) that is to blame for this. I think that might end up leading down some unfruitful research paths if so.
That said, one positive about this paper is the fact that they seem to have found a general circuit that forms for their task, and analyze that, I believe that has value, but (and I know I tend to be harsh on papers generally) the rest of the paper seems to be more of a distraction.
Definitional salad buffets and speculation about the 'in' topics are going to be the things that make the headlines, but in order to make real progress, focusing on the fundamentals is really what's necessary here, I think. They may seem 'boring' a lot of the times, but they've certainly helped me quite a bit in my research. <3 :'))))
Heyo! Have been doing this for a while. SSMs certainly are flashy (most popular topics-of-the-year are), and it would be nice to see if they hit a point of competitive performance with transformers (and if they stand the test of time!)
There are certainly tradeoffs to both, the general transformer motif scales very well on a number of axis, so that may be the dominant algorithm for a while to come, though almost certainly it will change and evolve as time goes along (and who knows? something else may come along as well <3 :')))) ).
This is incorrect enough as to be dangerous (IMPE, I am not a doctor). They are non-drowsy because they do not cross the blood brain barrier effectively as I understand. Second and third generation antihistamines are fantastic.
While I agree with your comment, for some peoples non-drowsy antihistamines are a myth.
I must be overly sensitive or have a deficient BBB because 10 mg loratadine transform me into a lethargic zombie for about 48 hours while providing minimal relief. A double dose of vyvaanse and a few coffees are not enough to bring me out of that state.
I had brain zaps with Zyrtec (Cetirizine) that took me a while to recognize for what they were because I thought they were related to other meds I was taking. I find Allegra (Fexofenadine) agrees with me a lot better. Personally I hate Claritin (Loratadine) as it definitely makes me depressed.
Experience with those others makes me wary of using Allegra except when my allergy symptoms are really bad.
BTW: Benadryl (Diphenhydramine), which has the same ingredient in the same dose marketed as a sleep aid, is really good for Poison Ivy because of its ability to penetrate into tissues really well. 30 years ago you would get a prescription for a round of steroid pills that will have you feeling pretty messed up for a week if you got Poison Ivy but today you are likely to be told to go to the pharmacy and treat yourself with OTC pills. Poison Ivy is bad enough that most people will take the drowsiness.
Or even $7.99, that or $8.99 is sort of a nice line between signaling "very cheap game" and "potentially short but enjoyable experience for the evening worth the gamble to find out if so".
I can't speak to it in general, that's just my 2c on the matter/issue without really knowing too terribly much about the game (or game development in generally, really, I'm just a consumer here! XD). <3 :'))))
Yeah, I saw the work from @Sree_Harsha_N, though that accuracy plot on the Adam/SGD side of things is very untuned, it was about what one could expect from an afternoon of working with it, but as far as baselines go most people in the weeds with optimizers would recognize that it's pretty not-good for comparison (not to dump on the reproduction efforts).
Hence why I think it might be hard to accurately compare them, likely SGD and Adam/AdamW are going to have better potential top ends but are going to get more thrashed in public comparisons vs an optimizer that seems to perform more flatly overall. Aaron works at FAIR so I am assuming that he knows this, I reached out with some concerns on my end a little bit before he published the optimizer but didn't hear back either unfortunately.
yeah it's been crazy to see how things have changed and im really glad that theres still interest in optimizing things for these benchmarks. ;P keller's pretty meticulous and has put in a lot of work for this from what i understand. im not sure where david's code came from originally, but it definitely impacted my code as i referenced it heavily when writing mine, and keller rewrote a lot of my code with his style + the improvements that he made in turn. hopefully the pedigree of minimal code can continue as a tradition, it really has a surprising impact
96 legitimately is pretty hard, i struggled doing it even in 2 minutes, so seeing it in 45 seconds is crazy. definitely gets exponentially harder for every fraction of a percent, so i think that's a pretty big achievement to hit :D
Ye this is a massive achievement indeed - I was quite astounded - I 100% will run this and I wanna read up on the paper - https://arxiv.org/pdf/2404.00498.pdf!
This is a pretty hyped-up optimizer that seems to have okay-ish performance in-practice, but there are a number of major red flags here. For one, the baselines are decently sandbagged, but the twitter posts sharing them (which are pretty hype-y) directly says that the baselines are "highly tuned" and that there's no benchmark trickery (which is flat-out wrong). If someone has not had experience with said benchmarks, it is a plausible statement, having worked with some these datasets very closely, some of the baselines are simply terrible, I don't know where they came from.
Additionally, the optimizer does actually appear to have a kind of momentum, despite claims directly saying the contrary, and uses it with a nesterov-like step (line 2 of 3 in the inner loop). Finally, it is 'schedule-free' because the schedule is actually hardcoded into the algorithm itself -- 1./steps_taken which is not necessarily a rare learning rate schedule. This is a decently robust but sometimes suboptimal schedule, and I find it sketchy to make claims that it is 'schedule-free'. This also cripples the optimizer by tying performance to the number of steps taken -- which is potentially a problem if you are using any batchsize+lr scaling strategies as I understand.
There is a mixture of hype and substance here, and I wish the author was more straightforward with their approach and claims. I think there is the potential for a good "bolts-included" optimizer with some of the ideas being presented here, but the amount of overhyping and deception makes me not want to trust any of the following work coming.
Unfortunately, hype is what sells best on Twitter, and some of the claims being made here appear to be at the very best deceptive, and at the very worst, untrue. I could be wrong -- these are just my personal opinions from my own experience, but I do occasionally find myself distraught about the things that tend to catch wind in the technical news cycle.
The behavior is actually more complex than a 1/t schedule. It behaves like a linear decay schedule 1-t/T with fixed stopping time T, as if T had been chosen in advance as the current timestep. When warmup is included, this is similar to high performance triangular learning rate schedules.
Schedules of the form 1/t schedules perform really poorly in practice, we actually did a large scale comparison that included them in a prior paper: https://arxiv.org/pdf/2310.07831.pdf
My main current concerns are I tried asking for a transformer benchmark to see if this worked on transformers, but didn't get any response. Also they seem particularly focused on CNN type benchmarks, but did not bother to benchmark superconvergence + Ranger21 + the learning rate range finder, since they explicitiy said Schedule-Free needs tuning as well.
Their past research on D-Adpatation (won ICML best paper 2023) and their follow up work Prodigy all did worse / similar than AdamW, so maybe this works on CNNs, but does not on transformers - but for CNNs we have superconvergence.
I shall wait for their paper which will come in 1-2 months.
Funding is a huge one as well. Funding is the wheel that drives the project (source, have been hanging around the project people for a little while).
If you know anyone that would help chip in for the Phase 2 of the project (scaling up, please let Nat know! (not directly affiliated with the project management team, just pointing to him as a great contact for that.... <3 :')))) ) )
It seems "weird" none of the mega rich has committed a few million dollars for this, it looks like a very good way to build a legacy while benefiting humanity, and e.g. Bezos would probably find a million dollars behind the couch pillows.
Bezos funded the reading of the Archimedes palimpsest, didn't he? I guess this would be up his wheelhouse then. Unless, of course, he is only willing to finance of the decryption of his own property...
it's almost nauseating to me that every month or so our nation deminstrates it is capable and willing to collectively chip in enough money to turn one random nobody into a near-billionaire, muvh of which gets promptly vaporized on drugs and tacky status symbol purchases for themselves and maybe some immediate family, when the same money would fund a hundred Vesuvius Challenges a year at several times the scale of this project.
I appreciate the effort that went into this visualization, however, as someone who has worked with neural networks for 9 years, I found it far more confusing than helpful. I believe it was due to trying to present all items at once instead of deferring to abstract concepts, however, I am not entirely sure of this fact. <3 :'))))
Some of the topics in the parent post should not be a major surprise to anyone who has read https://people.math.harvard.edu/~ctm/home/text/others/shanno... ! If we do not have read the foundations of the field that we are in, we are doomed to be mystified by unexplained phenomena which arise pretty naturally as consequences of already-distilled work!
That said, the experiments seem very thorough, on a first pass/initial cursory examination, I appreciate the amount of detail that seemed to go into them.
The tradeoff between learning existing theory, and attempting to re-derive it from scratch, I think, is a hard tradeoff, as not having the traditional foundation allows for the discovery of new things, but having it allows for a deeper understanding of certain phenomena. There is a tradeoff either way.
I've seen several people here in the comments seemingly shocked that a model that maximizes the log likelihood of a sequence given the data somehow does not magically deviate from that behavior when run in inference. It's a density estimation model, do you want it to magically recite Shakespeare from the void?
Please! Let's stick to the basics, it will help experiments like this make much more sense as there already is a very clear mathematical foundation which clearly explains it (and said emergent phenomena).
If you want more specifics, there are several layers, Shannon's treatment of ergodic systems is a good start (though there is some minor deviation from that here, but it likely is a 'close enough' match to what's happening here to be properly instructive to the reader about the general dynamics of what is going on, overall.)
> which clearly explains it (and said emergent phenomena)
Very smart information theory people have looked at neural networks through the lens of information theory and published famous papers about it years ago. It couldn't explain many things about neural networks, but it was interesting nonetheless.
FWIW it's not uncommon for smart people to say "this mathematical structure looks like this other idea with [+/- some structure]!!" and that it totally explains everything... (kind of with so and so exceptions, well and also this and that and..). Truthfully, we just don't know. And I've never seen theorists in this field actually take the theory and produce something novel or make useful predictions with it. It's all try stuff and see what works, and then retroactively make up some crud on why it worked, if it did work (otherwise brush it under the rug).
The article introduced a discrete algorithm method for approximating the gradient optimization model.
It would be interesting to optimize the discrete algorithm for both design and inference times, and see if any space or time advantages over gradient learning could be found. Or if new ideas popped as a result of optimization successes or failures.
It also might have an advantage in terms of algorithm adjustments. For instance, given the most likely responses at each step, discard the most likely whenever follow ups are not too far below - and see if that reliably avoided copyright issues.
A lot easier to poke around a discrete algorithm, with zero uncertainty as to what is happening, vs. vast tensor models.
I appreciate what you're saying, but convergence (via alternative paths, of various depths) is its own signal. Repeated rediscovery perhaps isn't necessarily wastefulness, but affirmation and validation of deep truth for which there are multiple paths of arrival :)
I wish that this worked out in the long run! However, watching the field spin its wheels in the mud over and over with silly pet theories and local results makes it pretty clear that a lot of people are just chasing the butterfly, then after a few years grow disenchanted and sort of just give up.
The bridge comes when people connect concepts to those that are well known and well understood, and that is good. It is all well and good to say in theory that rediscovering things is bad -- it is not necessarily! But when it becomes groundhog day for years on end without significant theoretical change, then that is an indicator that something is amiss in general in how we learn and interpret information in the field.
Of course, this is just my crotchety young opinion coming up on 9 years in the field, so please take that that with a grain of salt and all that.
Meanwhile in economics you have economists argue that the findings of anthropologists are invalid, because they don't understand modern economic theory. It's history that needs to change.
In another adjacent thread, people are talking about the implications of a neural network conforming to the training data with some error margin with regards to copyright.
Many textbooks on information theory already call out the content-addressable nature of such networks[1], and they’re even used in applications like compression due to this purpose[2][3], and therefore it’s no surprise that the NYT prompting OpenAI models with a few paragraphs of their articles reproduced them nearly verbatim.
Yes! This is a consequence of empirical risk minimization via maximum likelihood estimation. To have a model not reproduce the density of data it trained on would be like trying to get a horse and buggy to work well at speed, "now just without the wheels this time". It would generally not necessarily go all that well, I think! :'D
I get the feeling you may not have read the paper as closely as you could have! Section 8 followed by Section 2 may look a tiny bit different if you consider it from this particular perspective.... ;)
This message confused me on a few dimensions, so I translated it a bit:
"State subjective perspective as objective fact. Cast shame upon the OP for not pre-aligning with said belief. Put the responsibility on the OP to prove that they are not deserving of shame."
I grew up in an environment where this kind communication was sort of the default, hence why I was curious and wanted to drill down a bit and give it some thought. Of course, many people agree that Twitter is more unhealthy than healthy. But that's not entirely the point here, I think.
I generally regard grokking as a failure mode in a lot of cases -- it's oftentimes not really a good thing. It tends to indicate that the combination of your network, task, and data are poorly suited for learning {XYZ} thing. There are emergent traits which I think the network can learn in a healthy manner over training, and I think that tends to fall under the 'generalization' umbrella.
Though I'd strongly prefer to call it 'transitive' rather than 'compositional' in terms of generalization, as transitive is the formal term most disciplines use for such things, compositional is a different, more general meaning entirely. Similarly, I'd replace 'parametric' and 'non-parametric' with 'internal' and 'external', etc. Sloughing through the definition salad of words (this paper alone takes up roughly half of the top Kagi hits for 'parametric memory') makes actually interpreting an argument more difficult.
One reinterpretation of the problem is -- of course external memory models will have trouble generalizing to certain things like models relying on internal memory do! This is because, in part, models with internal memory will have much more 'experience' integrating the examples that they've seen, whereas, for an external-memory model like a typical RAG setup, anything is possible.
But, that being said, I don't think you can necessarily isolate that to the type of memory that the model has alone, i.e., I don't think you can clearly say even in a direct comparison between the two motifs that it's the kind of memory itself (internal vs. external) that is to blame for this. I think that might end up leading down some unfruitful research paths if so.
That said, one positive about this paper is the fact that they seem to have found a general circuit that forms for their task, and analyze that, I believe that has value, but (and I know I tend to be harsh on papers generally) the rest of the paper seems to be more of a distraction.
Definitional salad buffets and speculation about the 'in' topics are going to be the things that make the headlines, but in order to make real progress, focusing on the fundamentals is really what's necessary here, I think. They may seem 'boring' a lot of the times, but they've certainly helped me quite a bit in my research. <3 :'))))