Honestly, these days I just rely on arxiv. The conferences are so noisy that it ...

pama · on May 2, 2024

I agree with almost all you said except that Twitter is better than top conferences, and I take a contrarian view that reviewers slow down AGI with requests for additional experiments. Without going into specifics, which you can probably guess based on your background, too many ideas that work well, even optimally, at small scale fail horribly at large scale. Other ideas that work at super specialized settings don’t transfer or don’t generalize. The saving of two or three dimensions for exact symmetry operations is super important when you deal with handful of dimensions and is often irrelevant or slowing down training a lot when you already deal with tens of thousands of dimensions. Correlations in huge multimodal datasets are way more complicated than most humans can grasp and we will not get to AGI before we can have a large enough group of people dealing with such data routinely. It is very likely detrimental for our progress to AGI that we lack abundant hardware for academics and hobbyists to contribute frontier experiments, however we don’t do anybody a favor by increasing the entropy of the publications in the huge ML conferences. This particular work listed in HN stands out despite lack of scaling and will probably make it in a top conference (perhaps with some additional background citations) but not everything that is merely novel should simply make it to ICLR or neurIPS or ICML, otherwise we could have a million papers in each in a few years from today and nobody would be the wiser.

godelski · on May 2, 2024

> too many ideas that work well, even optimally, at small scale fail horribly at large scale.

Not that I disagree, but I don't think that's a reason to not publish. There's another way to rephrase what you've said

  many ideas that work well at small scales do not trivially work at large scales

But this is true for many works, even transformers. You don't just scale by turning up model parameters and data. You can, but generally more things are going on. So why hold these works back because of that? There may be nuggets in there that may be of value and people may learn how to scale them. Just because they don't scale (now or ever) doesn't mean they aren't of value (and let's be honest, if they don't scale, this is a real killer for the "scale is all you need" people)

> Other ideas that work at super specialized settings don’t transfer or don’t generalize.

It is also hard to tell if these are hyper-parameter settings. Not that I disagree with you, but it is hard to tell.

> Correlations in huge multimodal datasets are way more complicated than most humans can grasp and we will not get to AGI before we can have a large enough group of people dealing with such data routinely.

I'm not sure I understand your argument here. The people I know that work at scale often have the worst understanding of large data. Not understanding the differences between density in a normal distribution and a uniform. Thinking that LERPing in a normal yields representative data. Or cosine simularity and orthogonality. IME people that work at scale benefit from being able to throw compute at problems.

> we don’t do anybody a favor by increasing the entropy of the publications in the huge ML conferences

You and I have very different ideas as to what constitutes information gain. I would say a majority of people studying two models (LLMs and diffusion) results in lower gain, not more.

And as I've said above, I don't care about novelty. It's a meaningless term. (and I wish to god people would read the fucking conference reviewer guidelines as they constantly violate them when discussing novelty)

pama · on May 2, 2024

I think information gain will be easy to measure in principle with an AI in the near future: if the work is correct, how unexpected is it. Anything trivially predictable based on published literature, including exact reproduction disguised as novel is not worthy of too much attention. Anything that has a change of changing the model of the world is important. It can seem minor even trivial to some nasty reviewer, but if the effect is real and not demonstrated before then it deserves attention. Until then, we deal with imperfect humans.

Regarding large multimodal data, I don’t know what people you refer to, so I can’t comment further. The current math is useful but very limited when it comes to understanding the densities in such data; vectors are always orthogonal at high dim and densities are always sampled very poorly. The type of understanding of data that would help progress in drug and material design, say, is very different from the type of data that can help a chatbot code. Obviously the future AI should understand it all, but it may take interdisciplinary collaborations that best start at an early age and don’t fit the current academic system very well unfortunately.

godelski · on May 2, 2024

> will be easy to measure in principle with an AI in the near future

I'd like to push back on this quite a bit. We don't have AI that shows decent reasoning capabilities. You can hope that this will be resolved, but I'd wager that this will just become more convoluted. A thing that acts like a human, even at an indistinguishable level need not also be human nor have the same capabilities of of a human[0]. This question WILL get harder to answer in the future, I'm certain of that, but we do need to be careful.

Getting to the main point, metrics are fucking hard. The curse of dimensionality isn't just that there are lots of numbers, it is that your nearest neighbor becomes ambiguous. It is that the difference between the furthest point (neighbor) and the closest point (nearest neighbor) decreases. It is that orthogonality becomes a more vague concept. That means may not be representative of a distribution. This is stuff that is incredibly complex and convolutes the nature of these measurements. For AI to be better than us, it would have to actually reason, because right now we __decide__ not to reason instead __decide__ to take the easy way out and act as if metrics are the same as they are in 2D (ignoring all advice from the mathematicians...).

It is not necessarily about the type of data when the issue we're facing is at an abstraction of any type of data. Categorically they share a lot of features. The current mindset in ML is "you don't need math" when the current wall we face is highly dependent on understanding these complex mathematics.

I think it is incredibly naive to just rely on AI solving our problems. How do we make AI to solve problems when we __won't__ even address the basic nature of problems themselves?

[0] As an example, think about an animatronic duck. It could be very lifelike and probably even fool a duck. In fact, we've seen pretty low quality ones fool animals, including just ones that are static and don't make sounds. Now imagine one that can fly and quack. But is it a duck? Can we do this without the robot being sentient? Certainly! Will it also fool humans? Almost surely! (No, I'm not suggesting birds aren't real. Just to clarify)

pama · on May 3, 2024

An AI that can help referee papers to advance human knowledge doesn’t need to have lots of human qualities. I think it suffices if a) it has the ability to judge correctness precisely, and b) it expresses a degree of surprise (low log likelihood?) if the correct data does not fit its current worldview.

godelski · on May 3, 2024

> it has the ability to judge correctness precisely,

That's not possible from a paper.

> it expresses a degree of surprise (low log likelihood?)

I think you're interpreting statistical terms too literally.

The truth of the matter is that we rely on a lot of trust from both reviewers and authors. This isn't a mechanical process. You can't just take metrics at face value[0]. The difficulty of peer review is the thing that AI systems are __the worst__ at and we have absolutely no idea how to resolve. It is about nuance. Anything short of nuance and we get metric hacking. And boy, you wanna see the degrade of academic works, the make the referee an automated system. No matter how complex that system is, I guarantee you human ingenuity will win and you'll just have metric hacking. We already see this in human led systems (like "peer review" and anyone that's ever had a job has experienced this).

I for one don't want to see science led by metric hacking.

Processes will always be noisy, and I'm not suggesting we can get a perfect system. But if we're unwilling to recognize the limitations of our systems and the governing dynamics of the tools that we build, then you're doomed to metric hack. It's a tale as old as time (literally). Now, if we create a sentient intelligence, well that's a completely different ball game but not what you were arguing either.

  You need to stop focusing on "making things work" and making sure they actually work. No measurement is perfectly aligned with ones goals. Anyone in ML that isn't intimately familiar with Goodhart's Law is simply an architect of Goodhart's Hell.

Especially if we are to discuss AGI, because there is no perfect way to measure and there never will be. It is a limitation in physics and mathematics. The story of the Jinni is about precisely this, but we've formalized it.

[0] This is the whole problem with SOTA. Some metrics no longer actually mean anything useful. I'll give an example, look at FID, the main metric for goodness of image generation. It's assumptions are poor (the norms aren't very normal and it's based on a ImageNet1k training which is extremely biased. And no, these aren't solved by just switching to CLIP-FID). There's been many papers written on this and similar for any given metric.

abhgh · on May 1, 2024

Yes arxiv is a good first source too. I mentioned conferences as a way to get exposed to diversity, but not necessarily (sadly) merit. It has been my experience as an author and reviewer both that review quality has plummeted over the years for the most part. As a reviewer I had to struggle with the ills of "commission and omission" both, i.e., (a) convince other reviewers to see an idea (from a trendy area such as in-context learning) as not novel (because it has been done before, even in the area of LLMs), and (b) see an idea as novel, which wouldn't haven't seemed so initially because some reviewers weren't aware of the background or impact of anything non-LLM, or god forbid, non-DL. As an author this has personally affected me because I had to work on my PhD remotely, so I didn't have access to a bunch of compute and I deliberately picked a non-DL area, and I had to pay the price for that in terms of multiple rejections, reviewer ghosting, journals not responding for years (yes, years).

godelski · on May 1, 2024

I've stopped considering novelty at all. The only thing I now consider is if the precise technique has been done before. If not, well I've seen pretty small things change results dramatically. The pattern I've seen that scares me more is that when authors do find simple but effective changes, they end up convoluting the ideas because simplicity and clarity is often confused with novelty. And honestly, revisiting ideas is useful as our environments change. So I don't want to discourage this type of work.

Personally, this has affected me as a late PhD student. Late in the literal sense as I'm not getting my work pushed out (even some SOTA stuff) because of factors like these and my department insists something is wrong with me but will not read my papers, the reviews, or suggest what I need to do besides "publish more." (Literally told to me, "try publishing 5 papers a year, one should get in.") You'll laugh at this, I pushed a paper into a workshop and a major complaint was that I didn't give enough background on StyleGAN because "not everyone would be familiar with the architecture." (while I can understand the comment, 8 pages is not much room when you gotta show pictures on several datasets. My appendix was quite lengthy and included all requested information). We just used a GAN as a proxy because diffusion is much more expensive to train (most common complaints are "not enough datasets" and "how's it scale"). I think this is the reason so many universities use pretrained networks instead of training things from scratch, which just railroads research.

(I also got a paper double desk rejected. First because it was "already published." Took a 2 months for them to realize it was arxiv only. Then they fixed that and rejected again because "didn't cite relevant works" with no mention of what those works were... I've obviously lost all faith in the review process)

pama · on May 2, 2024

Sorry to hear all this (after writing my other sibling comment). Please don’t lose faith in the review process. It is still useful. Until the AGI can be better reviewers, which is hopefully not too far in the future.

godelski · on May 2, 2024

For me to regain faith in the review process I need to actually see some semblance of the review process working.

So far, instead, I've seen:

  - Banning social media posting so that only big tech and collusion positing can happen to "protect the little guy"
  - Undoing the ban to lots of complaints
  - Instituting a no LLM policy with no teeth and no method to actually verify
  - Instituting a high school track to get those rich kids in sooner

Until I see such changes like "we're going to focus on review quality" I'm going to continue thinking it is a scam. They get paid by my tax dollars, by private companies, and I volunteer time, for what...? Something a LLM could have actually done better? I'm seeing great papers from big (and small) labs get turned down while terrible papers are getting accepted. Collusion rings go unpunished. And methods get more and more convoluted as everyone tries to game the system.

You think of all people, we, ML, would understand reward hacking. But until we admit it, we can't solve it. And if we can't solve it here, how the hell are we going to convince anyone we're going to create safe AGI?

pama · on May 2, 2024

I feel you. Here are some thoughts from the other side of the fence:

Social media banning aims to preserve anonymity when the reviews are blind. It is hard to convincingly keep anonymity for many submissions, but an effort to keep it is still worthwhile and typically helps the less privileged to get a fair shot at a decent review, avoiding the social media popularity contest.

The policies for LLM usage differ between conferences. The only possibly valid concern with use of AI is the disclosure of non public info to an outside LLM company that may happen to publish or be retrained on that data (however unlikely this is in practice) before the paper becomes public; for example, someone could withdraw their publication and it no longer sees the day of light on the openreview website. (I personally disagree with this concern.) As far as I know there is no real limitation to using self hosted AI as long as the reviewer takes full credit for the final product and there is no limitation in using non public AI to improve the review clarity without dumping the full paper text. A fraction of authors would appreciate better referee reports, so at a minimum, the use of AI can bridge the language gap. I wouldn’t mind the conferences instituting an automatic AI processing to help the reviewers reduce ambiguity and avoid trivialities.

The high school track has been ridiculed, as expected. I think it is a great idea and doesn’t only apply to rich kids. There exist excellent specialized schools in NYC and other places in the US that might find ways to get resources for underprivileged ambitious high schoolers. It is possible that in the future a variant of such a track will incentivize some industry to donate compute resources to high school programs and it may start early and powerful local communities. I learned a lot in what would be middle school in the US by interacting with self motivated children at a ad hoc computer club and kept the same level of osmotic learning in the computer lab at college. The current state of AI is not super deep in terms of background knowledge, mostly super broad, and some specialized high schools already cover calculus and linear algebra, and certainly many high schools nowadays provide sufficient background in programming and elementary data analysis.

My personal reward hacking is that the conferences provide a decent way to focus the review to the top hundred or couple hundred plausible abstracts and even when the eventual choice is wrong I get a much better reward to noise ratio than from social media and the pure attacks on the arxiv (although LLMs help here as well). I always find it refreshing to see the novel ideas when they are in a raw form before they have been polished and before everyone can easily judge their worth. Too many of them get unnecessary negative views, which is why the system integrates multiple reviewers and area chairs that can make corrective decisions. It is important to avoid too much noise even at the risk of missing a couple great ones, and yet it always hurts when people drop greatness because of misunderstandings or poor chair choices. No system is perfect, but scaling these conferences from a couple hundred people a year up to about a dozen years ago to approaching hundred thousand a year has worked reasonably well.

godelski · on May 2, 2024

> Social media banning aims to preserve anonymity when the reviews are blind.

Then ban preprints. That's the only reasonable resolution to solve the stated problem. But I think we recognize that in doing so, we'd be taking steps back that aren't worth it.

> avoiding the social media popularity contest.

The unfortunate truth is that this has always been the case. It's just gotten worse because __we__ the researchers fall for this trap more than the public does. Specifically, we discourage dissenting opinions. Specifically, we still rely heavily on authority (but we call it prestige).

> The policies for LLM usage differ between conferences.

This is known, and my comment was in a direct reference to CVPR policy being laughable.

The point I was making is not so literal as your interpretation. It is one step abstracted: the official policies are being carelessly made, and in such ways that are laughable and demonstrate that the smallest iota of reasoning was placed into these. Implying that there is a goal to signal rather than address the issues at hand. Because let's be real, resolving the issues is no easy task. So instead of addressing the difficulties of this and acknowledging them, we try to sweep them under the rug and signal that we are doing something. But that's no different than throwing your hands up and giving up.

> The high school track ... doesn’t only apply to rich kids.

You're right in theory but if you think this will be correct in practice I encourage you to reason a bit more deeply and talk to your peers who come from middle and lower class families. Ones where parents were not in academia. Ones where they may be the only STEM person in their family. The only person pursuing graduate education. Maybe even the only one with an undergraduate degree (or that it is uncommon in their family). Ask them if they had a robotics club. A chess club. IB classes? AP classes? Hell, I'll even tell you that my undergraduate didn't even have research opportunities, and this is essentially a requirement now for grad school. Be wary of the bubbles you live in. If you do not have these people around you, then consider the bias/bubble that led to this situation. And I'll ask you an important question: do you really think the difference between any two random STEM majors in undergrad are large? Sure, there's decent variance, but do you truthfully think that you can't pick a random STEM student from a school ranked 100 and place them in a top 10 school (assume financials are not an issue and forget family issues), that they would not have a similar success rate? Because there's plenty of data on this (there's a reason I mentioned the specific caveats, but let's recognize those aren't about the person's capabilities, which is what my question is after). If you are on my side, then I think you'd recognize that the way we are doing things is giving up a lot of potential talent, and if you want to accelerate the path to AGI then I'd argue that this is far more influential than any r̶i̶c̶h̶ ̶c̶h̶i̶l̶d̶,̶ ̶c̶h̶i̶l̶d̶ ̶o̶f̶ ̶p̶r̶o̶f̶e̶s̶s̶o̶r̶ High School track. But we both know that's not going to happen because we care more about e̶l̶i̶t̶i̶s̶m̶ "prestige" than efficiency. (And think about the consequences of this for when we teach a machine to mimic humans)

Edit: I want to make sure I ask a different question. You seem to recognize that there is a problem. I take it you think it's small. Then why defend it? Why not try to solve it? If you think there is no problem, why? And why do you think it isn't when so many do? (There seems to be a bias of where these attitudes come from. And I want to make clear that I truly believe everyone is working hard. I don't think anyone is trying to undermine hard work. I don't care if you're at a rank 1 or 100 school, if you're doing a PhD you're doing hard work)

abhgh · on May 2, 2024

Sorry to hear that. My experiences haven't been very different. I really can't tell if the current review process is the least bad among alternatives or is there something better (if so, what is it?).

godelski · on May 2, 2024

I'm sorry to hear that too. I really wish there was something that could be done. I imagine a lot of graduate students are in complicated situations because of this.

As for alternatives: I don't see why we don't just push to OpenReview and call it a day. We can link our code, it has revisions, and people can comment and review. I don't see what the advantage of having 1-3 referees who don't want to read my paper and have no interest in it but have strong incentives to reject it is any meaningful signal of value. I'll take arxiv over their opinions.

versteegen · on May 2, 2024

> I've saved a Welling paper from rejection from two reviewers who admitted to not knowing PDEs

Thank you for fighting the good fight.

This is why I love OpenReview, I can spot and ignore nonsensical reviewer criticisms and ratings and look for the insightful comments and rebuttals. Many reviewers do put in a lot of very valuable work reading and critiquing most of which would go to waste if not made public.

godelski · on May 2, 2024

I like OR too and I wish we would just post to there instead. It has everything we need, and I see no value from the venues. No one wants to act in good faith and they have every incentive not to.

And I gotta say, I'm not going to put up a fight much longer. As soon as I get out of my PhD I intend to just post to OR.