The model size in deep learning is one of the most problematic aspects of the tech.
I'm surprised we haven't seen more research on modular models. Random forests and boosting both work by building a collection of weak models, and combining their results to produce final predictions. It seems like building an ensemble of some sort of unsupervised learners, then training a smaller supervised network on that ensemble could produce good results, while also being more reusable and easier to train.
This is true for classical ML, but unfortunately it does not hold very well for hard problems. The reason is that GPT-3 at al. are meant as multipurpose pretrained models their size is their strength in that they can model highly complex datasets.
In my opinion the right approach to downsizing is to train at full size and then either distill or prune the model to keep the parts that are relevant to your problem.
That's only for pretraining a model. Very few groups pretrain models, since it is so expensive in terms of GPU time. Finetuning a model for a specific task typically only takes a few hours. E.g. I regularly train multitask syntax models (POS tagging, lemmatization, morphological tagging, dependency relations, topological fields), which takes just a few hours on a consumer-level RTX 2060 super.
Unfortuntaly, distillation of smaller models can take a fair bit of time. However, there is a lot of recent work to make distillation more efficient, e.g. by not just training on the label distributions of of a teacher model, but by also learning to emulate the teacher's attention, hidden layer outputs, etc.
Is there one model that you use more frequently than others as a base for these disparate fine tuning tasks? Basically, are there any that are particularly flexible?
In general, BERT would be the most common one. RoBERTa is the same model but trained for longer, which turns out to work better. T5 is a larger model, which works better on many tasks but is more expensive.
So far, of the models that run on GPUs with 8-16GiB VRAM XLM-RoBERTa has been the best for these specific tasks. It worked better than the multi-lingual BERT model and language-specific BERT models by quite a wide margin.
Great, thanks very much for the pointer, especially the VRAM context - I'm looking to fine-tune on 2080Ti's rather than V100/A100s, so that's really good to know.
You can make a model of any size with deep learning.
If your concerns are about over fitting there are lots of regularization techniques used in practice like dropout, weight decay, and data augmentation.
There's nothing preventing you from sharing weights across layers, and would be interesting to see some research about that.
I have done model distillation of XLM-RoBERTa into ALBERT-based models with multiple layer groups and for the tasks that I was working on (syntax) it works really well.
E.g. we have gone from a finetuned ~1000MiB XLM-R base model to a 74MiB ALBERT-based model with barely any loss in accuracy.
I'm all in favor of scaling down deep learning. Also, recently I published a paper on DNA compression that shows that deep learning is still quite easily beat in terms of time and compression ratio by using specialized models and simple MLPs.
Very interesting work, thanks for sharing! It's great that you included a link to the project on GitHub, it's unfortunately too rare to see research papers accompanied by source code.
Yes, this is a big problem. We want to do comparisons, and it is quite difficult to download much of the software and datasets.
For example, we are in the process of writing a paper for the compression of proteins. There are a few papers about this problem, but only one working compressor that we could find. Even more problematic is when the papers in question claim very good efficiency and we are always left wondering if we are missing something, or if the data they used isn't the same data we have access to etc. In that regard, GigaScience is great, because it places a lot of importance in reproducibility.
One of the few things I remember from watching fastai course a few years ago, is to train a model with smaller data. For example, instead of 2k×2k images, downscale them to for ex 400×400. A network with a better design should still learn fastest on smaller data.
Also, one can train the network to a good accuracy, then change the input layer, and unfreeze the inner layers, that way the network will have a head start.
Not sure how universal this principle is, but it seemed reasonable, if I remember it correctly, of course.
The approach described in the article looks very smart. Also could be handy for integration testing of ML frameworks. I've been working on my own DL framework, and this data set looks like a good way to test the training and inference pipelines E2E.
I do really like the idea of an MNIST alternative to very quickly verify ideas. However I have a few nitpicks:
1. 10 classes is way too small to make meaningful estimates as to how well a model will do on a "proper" image datatset such as COCO or ImageNet. Even a vastly more complicated dataset like CIFAR-10 does not hold up.
2. I feel like CIFAR-100 is widely used as the dataset that you envision MNIST-1D to be. Personally I found that some training methods will work very well on CIFAR-100 but not that well on ImageNet so TinyImageNet is now my go-to "verify new ideas dataset"
Genuine question: are there real-world image recognition tasks that requires training on more than, say, 10 or even 100 classes? I'm personally aware of only one that might come close, and that's because it's an image-based species detection module, and it's whole purpose is actively trying to be able to recognize a large number of very specific subgroups. But most of the other I can think of get maybe a couple dozen, and sometimes even as few as 4 or 5 classes where they're useful, and the accuracy within those cases is much more important than the sheer number of possibilities themselves.
I guess I'm just asking if COCO or ImageNet-trained networks are actually noticeably superior for most real-world tasks, or if it's just a metric that's used because the performance differences only show up in the long tail of the distribution.
> I guess I'm just asking if COCO or ImageNet-trained networks are actually noticeably superior for most real-world tasks, or if it's just a metric that's used because the performance differences only show up in the long tail of the distribution.
Given that for any real-world vision task you start from a pretrained model om those datasets they will in fact be noticably superior on the real world task after finetuning. Just because the quality of the features extracted through the backbone is better.
Even if not exactly related to the approach in the article, take a look at adapterhub.ml for training NLP transformer adapters (small modules inserted into large pre-trained transformers), that often achieve comparable results to large transformers trained from scratch on many tasks while taking hours to train instead of months.
I'm pretty worried about designing new datasets specifically around observing things we've already seen in real data. Sure, we can observe spatial priors and double descent with this dataset by design, but when we see a new interesting property, what's the chance that it carries over to real-world data?
As someone with a degenerative eye disease (keratoconus) I almost universally hit ctrl++ about 5 times upon opening any article on hackernews. Heck, I view the HN itself at 300% size.
blog site is very readable, even without enabling reader mode. The main problem for usability are sites that mess up with the zoom, so that when you enlarge the view the formatting breaks and the text becomes unreadable.
I'm surprised we haven't seen more research on modular models. Random forests and boosting both work by building a collection of weak models, and combining their results to produce final predictions. It seems like building an ensemble of some sort of unsupervised learners, then training a smaller supervised network on that ensemble could produce good results, while also being more reusable and easier to train.