I am an ML researcher working in the industry: *by far* the most effective way t...

tomp · on Dec 10, 2022

Sorry if this is a stupid question, but from a non-practitioner's perspective, how or why is this sensible?

Most of the cutting edge papers are trained on several $100k worth of GPU time, so does it even make sense to implement the algorithm without the available data & compute? How can you be sure that your implementation is correct, if you can't train it (hence you can't run proper inference with a good model)?

Compare that to e.g. reimplementing a pure CS paper, almost anything can be reimplemented in a simple way - even something like "distributed database over 1000 nodes", well you don't technically need 1000 servers, you can just, you know, simulate them quite cheaply.

Of course there might be similar techniques for ML but I'm just not aware of them.

jdeaton · on Dec 11, 2022

> how or why is this sensible?

The whole objective here is personal learning and this advice would be wildly different for how to practice ML professionally. The approach is directly analogous to advising a beginner programmer to get better at programming by actually writing computer programs.

> Most of the cutting edge papers are trained on several $100k worth of GPU time

Its besides the point, but I said nothing about a requirement that the methods that you choose to implement and learn from having to be cutting edge. More to the point, unless we have a different definition for what "cutting edge" means, you're wrong that "most of the cutting edge papers" require high computational resources. If that were true it would be nearly impossible for the field to make progress at the pace it does. There is a plethora of research in purely algorithmic approaches which do not require massive compute resources, and in fact this is the most productive portion of research to learn from because there the focus is on theory and progress in how to conceptualize / frame ML problems. Works which amount to "we took method X and massively scaled it up" are (in my opinion) less intellectually interesting to someone seeking to grow their knowledge in ML (though the results may might be extremely impressive and impactful, and it may be intellectually very interesting for the working directly on that project).

> How can you be sure that your implementation is correct, if you can't train it (hence you can't run proper inference with a good model)?

This is like asking how you can be sure that you've correctly implemented a B tree if you haven't used it to serve a distributed database to 1 million users. The answer is small isolated tests.

One of the best ways to really test your knowledge of an ML algorithm is to design and write unit tests to assert it behaves correctly on trivial cases. You'll find bugs in your implementation, but you'll also be forced to think carefully about what the core characteristics of the algorithm are that must be asserted in order to convince yourself that its correct. Its a common beginner mistake in ML to just run/train your model and have that be the only test of its correctness. Its like deploying a web service with zero tests and letting "do I get X number of users" be the only test of your code's correctness. It sounds insane but its basically equivalent to what most beginners do in ML (my former self included).

tomp · on Dec 11, 2022

Thanks for extensive answer.

Do you have a few of these cutting-edge algorithmic advances papers in mind, could you list them?

I guess I got too pessimistic because of things like "emergent features" [1] / "grokking" [2] that seems to happen only with a lot of compute, and also the fact that the original (vanilla) transformer architecture remains (one of) the best, despite many additional ideas and "advances" (but that is only evident at large scale) [3].

Because of the points above, it's really hard for me, as a non-expert, to assess which papers are true advancements, and which were only published in pursuit of vanity metrics (e.g. publication counts) but actually represent overfit/cherry-picked results rather than robust progress.

[1] https://timdettmers.com/2022/08/17/llm-int8-and-emergent-fea...

[2] https://www.lesswrong.com/posts/N6WM6hs7RQMKDhYjB/a-mechanis...

[3] https://twitter.com/YiTayML/status/1551657355036676096

jdeaton · on Dec 12, 2022

I posted in another comment on this thread a list of papers which met these criteria for me at the time and which I learned a lot by implementing.

> it's really hard for me, as a non-expert, to assess which papers are true advancements

Its hard for me too, though I wouldn't consider myself an expert, just someone with a moderate amount of experience. Learning to discriminate important from less-important papers is another skill which takes effort to develop.

meken · on Dec 10, 2022

> How can you be sure that your implementation is correct, if you can't train it (hence you can't run proper inference with a good model)?

Seems impossible to know for sure. Best I can think is to train your model to overfit a small amount of data

madisonmay · on Dec 11, 2022

Often it might be viable to implement prediction w/o necessarily implementing training (especially if there are published weights or a reference implementation). Not viable for papers where the key contribution is a change to the pre-training objective / training methodology / optimizer, but useful for papers where the key contribution is architectural.

mxkopy · on Dec 11, 2022

> Most of the cutting edge papers are trained on several $100k worth of GPU time

You can scale some things down. VGG 16 is basically a stack of CNNs, there's no reason you need 16 of them with an input size of 224x224x3; you can just as easily watch a 4 layer CNN learn filters on inputs of size 64x64x1. Obviously if the paper's result is achieved from sheer compute this won't work, but plenty of results come purely from the architecture.

You could also implement and run networks that are designed to be really cheap to compute. ResNet/InceptionNet, for example. I think this is a pretty important part of the space right now, considering how performant, general, and therefore inefficient Transformer architectures are.

mudrockbestgirl · on Dec 11, 2022

But these are "old" models from 5+ years ago. Implementing them is not going to help you get up to speed with more recent AI research. From the OPs post, it seems like he already knows these basics.

fullstackchris · on Dec 10, 2022

+1 on implementing papers, that's one of the best you can do to improve your skills (anywhere in science or engineering actually). A warning: I remember trying to do this back in my uni / grad days and more often than not there is key information or things (perhaps even by accident) left out of the implementation descriptions. I was more in mechanical engineering so perhaps this is less common in AI oriented papers, but I still think it's a valid thing to look out for.

usmannk · on Dec 10, 2022

> I was more in mechanical engineering so perhaps this is less common in AI oriented papers

No, you got it right. This is EXTREMELY prevalent in modern AI/ML papers, to everyone's detriment. In the majority of interesting cases, reproduction is only possible with the original code.

dotnet00 · on Dec 10, 2022

I think it's actually often worse in AI papers. Fortunately at least some bigger journals/conferences encourage or require releasing source code, which makes it easier to track down subtle details that the authors didn't clearly mention in the paper.

On top of that due to its dependence on data and the ability to 'fudge' statistics, a lot of AI papers aren't really that replicable even if there aren't any implementation subtleties. For example, I've run into papers on image generation which describe some trick to improve quality, but focus entirely on standardized scores without providing any visual comparisons (and thus as feared turning out to not have as much of a visual improvement as the scores would suggest on other datasets).

While in a lot of sciences or engineering many things can be attributed to being standard practice for experts in the field, AI moves too fast to have such standards and tends to be a bit too arbitrary for such standards to mean much.

jmcgough · on Dec 11, 2022

This is an issue in biomedical research as well. Sometimes I've reached out to researchers who've done similar studies and ask them missing details in their methods.

janalsncm · on Dec 10, 2022

Can you recommend some papers which fit those criteria?

jdeaton · on Dec 11, 2022

Because of criteria 0, 1, 2 these entirely depend on the individual. However, some papers which fit the criteria for me at the time were the following:

- Score-Based Generative Modeling through Stochastic Differential Equations https://arxiv.org/abs/2011.13456

- Structured Denoising Diffusion Models in Discrete State-Spaces https://arxiv.org/abs/2107.03006

- Efficient and Modular Implicit Differentiation https://arxiv.org/abs/2105.15183

- Scalable Gradients for Stochastic Differential Equations https://arxiv.org/abs/2001.01328

- Bayesian Optimization with Unknown Constraints https://arxiv.org/pdf/1403.5607.pdf

- SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks https://arxiv.org/abs/2006.10503

- DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking https://arxiv.org/abs/2210.01776

HybridCurve · on Dec 11, 2022

I am not sure if this might be exactly you are looking for but paperswithcode.com has a well organized selection of research with publicly available source code. Anyone trying to reproduce the code independently from the paper can always take a peek at the original source for details which may not be clear.

kurikuri · on Dec 12, 2022

I liked this site initially, but decided to read the ToS and was a bit turned off:

> To the extent that you provide User Content, you hereby grant us (and represent and warrant that you have the right to grant) an irrevocable, non-exclusive, royalty-free and fully-paid-up, worldwide license to reproduce, distribute, publicly display and perform, prepare derivative works of, incorporate into other works and otherwise use and exploit such User Content, and to grant sublicenses of the foregoing rights.

So not only can Meta use these cutting edge techniques in their products without needing to request permission from the implementer, they can also sell those derivatives to anyone and practically have full ownership over what was submitted. Furthermore:

> You assume all risks associated with the use of your User Content.

and

> For the avoidance of doubt, Meta Platforms does not claim ownership of User Content you submit or other content made available for inclusion via our Website.

So, if anything goes wrong, it is the implementer’s fault entirely, but Meta may freely profit from those submissions in any way they see fit.

Sure, this is a cynical reading of the ToS, but I’m assuming the ToS will only ever be used in Meta’s favor…

Throwaway23459 · on Dec 11, 2022

Most recent papers, in NLP at least, are so sparse on detail that it is impossible to reproduce their models. And then there's the compute cost, as at least one other poster has mentioned.