Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Scaling Down Deep Learning (greydanus.github.io)
124 points by lelf on Dec 5, 2020 | hide | past | favorite | 36 comments


The model size in deep learning is one of the most problematic aspects of the tech.

I'm surprised we haven't seen more research on modular models. Random forests and boosting both work by building a collection of weak models, and combining their results to produce final predictions. It seems like building an ensemble of some sort of unsupervised learners, then training a smaller supervised network on that ensemble could produce good results, while also being more reusable and easier to train.


This is true for classical ML, but unfortunately it does not hold very well for hard problems. The reason is that GPT-3 at al. are meant as multipurpose pretrained models their size is their strength in that they can model highly complex datasets.

In my opinion the right approach to downsizing is to train at full size and then either distill or prune the model to keep the parts that are relevant to your problem.


This helps lessen issues like inference speed / requirements but it doesn't address the environmental impact of training.


That's only for pretraining a model. Very few groups pretrain models, since it is so expensive in terms of GPU time. Finetuning a model for a specific task typically only takes a few hours. E.g. I regularly train multitask syntax models (POS tagging, lemmatization, morphological tagging, dependency relations, topological fields), which takes just a few hours on a consumer-level RTX 2060 super.

Unfortuntaly, distillation of smaller models can take a fair bit of time. However, there is a lot of recent work to make distillation more efficient, e.g. by not just training on the label distributions of of a teacher model, but by also learning to emulate the teacher's attention, hidden layer outputs, etc.


Is there one model that you use more frequently than others as a base for these disparate fine tuning tasks? Basically, are there any that are particularly flexible?


In general, BERT would be the most common one. RoBERTa is the same model but trained for longer, which turns out to work better. T5 is a larger model, which works better on many tasks but is more expensive.


Thanks for the summary! I'm familiar with BERT, but less so the different variants, so that's quite helpful. I'll take a look at how RoBERTa works.


So far, of the models that run on GPUs with 8-16GiB VRAM XLM-RoBERTa has been the best for these specific tasks. It worked better than the multi-lingual BERT model and language-specific BERT models by quite a wide margin.


Great, thanks very much for the pointer, especially the VRAM context - I'm looking to fine-tune on 2080Ti's rather than V100/A100s, so that's really good to know.


The environmental impact of training deep models is a tiny fraction of the impact of eg mining bitcoin.


Also it's tiny compared to the server cost of big tech companies like facebook and google.


the environmental impact of training is, generally considered to be a one-time cost.


Unfortunately, this is not what the science says in 2020. See for example the graph at https://youtu.be/OBCciGnOJVs?t=1250.

More parameters is inherently better, with the knowledge we have now.


You can make a model of any size with deep learning.

If your concerns are about over fitting there are lots of regularization techniques used in practice like dropout, weight decay, and data augmentation.

There's nothing preventing you from sharing weights across layers, and would be interesting to see some research about that.


There's nothing preventing you from sharing weights across layers, and would be interesting to see some research about that.

E.g. the ALBERT model does that:

https://arxiv.org/abs/1909.11942

I have done model distillation of XLM-RoBERTa into ALBERT-based models with multiple layer groups and for the tasks that I was working on (syntax) it works really well.

E.g. we have gone from a finetuned ~1000MiB XLM-R base model to a 74MiB ALBERT-based model with barely any loss in accuracy.


You can also train a neural network with a single shared weight parameter:

https://weightagnostic.github.io/


The Rocket paper [0] for time serie classification works along those lines. They use a linear classifier on top of an ensemble of random convolutions.

[0]: https://arxiv.org/abs/1910.13051


I'm all in favor of scaling down deep learning. Also, recently I published a paper on DNA compression that shows that deep learning is still quite easily beat in terms of time and compression ratio by using specialized models and simple MLPs.

Open access to the paper here: https://doi.org/10.1093/gigascience/giaa119


Very interesting work, thanks for sharing! It's great that you included a link to the project on GitHub, it's unfortunately too rare to see research papers accompanied by source code.


Yes, this is a big problem. We want to do comparisons, and it is quite difficult to download much of the software and datasets.

For example, we are in the process of writing a paper for the compression of proteins. There are a few papers about this problem, but only one working compressor that we could find. Even more problematic is when the papers in question claim very good efficiency and we are always left wondering if we are missing something, or if the data they used isn't the same data we have access to etc. In that regard, GigaScience is great, because it places a lot of importance in reproducibility.


wait until DeepMind gets hold of this problem :-)


Meh, Deepmind is old news. Wait till the Jiuzhang team publishes its findings on this


One of the few things I remember from watching fastai course a few years ago, is to train a model with smaller data. For example, instead of 2k×2k images, downscale them to for ex 400×400. A network with a better design should still learn fastest on smaller data.

Also, one can train the network to a good accuracy, then change the input layer, and unfreeze the inner layers, that way the network will have a head start.

Not sure how universal this principle is, but it seemed reasonable, if I remember it correctly, of course.

The approach described in the article looks very smart. Also could be handy for integration testing of ML frameworks. I've been working on my own DL framework, and this data set looks like a good way to test the training and inference pipelines E2E.


Nitpick: mice are not "far simpler" than humans. In terms of genomic complexity all mammals are roughly same.


I do really like the idea of an MNIST alternative to very quickly verify ideas. However I have a few nitpicks:

1. 10 classes is way too small to make meaningful estimates as to how well a model will do on a "proper" image datatset such as COCO or ImageNet. Even a vastly more complicated dataset like CIFAR-10 does not hold up.

2. I feel like CIFAR-100 is widely used as the dataset that you envision MNIST-1D to be. Personally I found that some training methods will work very well on CIFAR-100 but not that well on ImageNet so TinyImageNet is now my go-to "verify new ideas dataset"


Genuine question: are there real-world image recognition tasks that requires training on more than, say, 10 or even 100 classes? I'm personally aware of only one that might come close, and that's because it's an image-based species detection module, and it's whole purpose is actively trying to be able to recognize a large number of very specific subgroups. But most of the other I can think of get maybe a couple dozen, and sometimes even as few as 4 or 5 classes where they're useful, and the accuracy within those cases is much more important than the sheer number of possibilities themselves.

I guess I'm just asking if COCO or ImageNet-trained networks are actually noticeably superior for most real-world tasks, or if it's just a metric that's used because the performance differences only show up in the long tail of the distribution.


> I guess I'm just asking if COCO or ImageNet-trained networks are actually noticeably superior for most real-world tasks, or if it's just a metric that's used because the performance differences only show up in the long tail of the distribution.

Given that for any real-world vision task you start from a pretrained model om those datasets they will in fact be noticably superior on the real world task after finetuning. Just because the quality of the features extracted through the backbone is better.


Even if not exactly related to the approach in the article, take a look at adapterhub.ml for training NLP transformer adapters (small modules inserted into large pre-trained transformers), that often achieve comparable results to large transformers trained from scratch on many tasks while taking hours to train instead of months.



I'm pretty worried about designing new datasets specifically around observing things we've already seen in real data. Sure, we can observe spatial priors and double descent with this dataset by design, but when we see a new interesting property, what's the chance that it carries over to real-world data?


Seems very interesting, but I cannot read that thin, very light gray text on white background.


As someone with a degenerative eye disease (keratoconus) I almost universally hit ctrl++ about 5 times upon opening any article on hackernews. Heck, I view the HN itself at 300% size.


Have you tried using reader view? At least firefox also remembers the zoom level.


Hi, I have a blog (in bio).

How would I make it easier for people like you to read?

Also, my blog has kinda unique colors. Do u find it easy to read?


blog site is very readable, even without enabling reader mode. The main problem for usability are sites that mess up with the zoom, so that when you enlarge the view the formatting breaks and the text becomes unreadable.


Thanks for the feedback




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: