Small Language Models Are Also Few-Shot Learners

BenoitEssiambre · on Oct 14, 2021

I've had this hypothesis for years that a good AI model would be something in between a neural net and probabilistic grammar.

In my mind it would involve some kind of generative grammar that would generate nodes (select from a pool), then these nodes could be trained. I'm thinking about something like a grammar for Bayesian networks or, more broadly, a generative program induction scheme where you'd specify a programming language grammar, generate programs that fit the data and tune the parameters.

I attempted and failed to implement some of these ideas (described here: https://www.quora.com/What-deep-learning-ideas-have-you-trie...). The search space for generative programs is just so huge and irregular and I don't know how to keep things differentiable. In the paper, they mention gradient-based optimization so maybe to figured out part of it.

advael · on Oct 15, 2021

While a code-data distinction is often illusory in practice, figuring out where this separation can be useful runs throughout the information theory underlying learning systems and programming languages. My question about this proposed model is how it handles the tradeoff between generality and efficiency that most learning systems are up against better than "mere" function approximation (e.g. ANNs). Concretely, treating parts of the solution space of some problem as a grammar specification seems like a way of simplifying the representation domain, so that the differentiable bits of a given model have fewer ultimately meaningless degrees of freedom to search over, but at some point you have to choose a grammar, and how would this work in the proposed system?

eterevsky · on Oct 14, 2021

"Large carbon footprint"? Really? How do they know how the electricity is generated that is used by the OpenAI datacenter? Maybe it's all solar and wind.

Why can't they just talk about the reduction of energy or compute resources instead?

soraki_soladead · on Oct 14, 2021

This is somewhat well covered in prior research, in so far as the information required is available. Here are some recent evaluations of the carbon footprint of modern models:

https://arxiv.org/abs/1906.02243 (cited in the above paper)

> The U.S. Environmental Protection Agency (EPA) provides average CO2 produced (in pounds per kilowatt-hour) for power consumed in the U.S. (EPA, 2018), which we use to convert power to estimated CO2 emissions:

> CO2e = 0.954 pt

> This conversion takes into account the relative proportions of different energy sources (primarily natural gas, coal, nuclear and renewable) consumed to produce energy in the United States.

(Other countries are also included in the paper.)

https://arxiv.org/abs/2104.10350

The authors note that in many cases accurately estimating the carbon footprint is difficult because the information required is not publicly or readily available. However, they do provide some additional data, improved calculations, as well as motivations beyond CO2 reduction.

eterevsky · on Oct 14, 2021

The abstract of the first paper says "Remarkably, the choice of DNN, datacenter, and processor can reduce the carbon footprint up to ~100-1000X." If your aim is to optimize CO2 emissions you have a lot of variable at your disposal and the architecture of the network is just one of them.

If the paper from the post indeed tried to evaluate and compare various kinds of optimization, then citing the CO2 emissions would be valid. But since it only discusses the improvements in the model itself, then it would be much more productive to just point to the reductions in the required memory, GPU/TPU-time etc.

The readers can do the carbon math themselves depending on how carbon-neutral is their datacenter.

ChefboyOG · on Oct 14, 2021

There are also some pretty cool open source projects dedicated to tracking this kind of thing:

https://www.comet.ml/site/introducing-codecarbon-an-open-sou...

star-trek-fleet · on Oct 14, 2021

Cannot you read the parent?

Because of the opaqueness of the electrical infrastructure, it's not precise of these carbon fiber footprint measurements, because the data is simply not there.

Therefore, a better measurement is just the electric usage...

soraki_soladead · on Oct 14, 2021

The papers I provided go a bit beyond "electrical usage". One of them is cited by the paper in question.

Yes, they are approximations in lieu of more accurate data but that doesn't invalidate them as a tool or a motivation for future work such as this one.

Consider further that it's not just OpenAI's models in question: it's every practitioner who attempts to train similarly large models. These practitioners may not be using "green" data centers, even if we generously assume that OpenAI does. (Microsoft's 100% renewable target for data centers isn't until 2025. Read another way: they may be trying but they're not there yet.)

The available data and approximations illustrate that it is not accurate to assume that the average data center are powered by 100% renewables and carbon neutral. Thus the only reasonable conclusion is that more efficient models will have a positive impact on CO2 which is the motivation of the paper.

Even if you don't agree, it's not completely unfounded and is based on at least some research and data. At the end of the day, is this really worth fighting against? Who wants less energy efficient models?

star-trek-fleet · on Oct 14, 2021

[flagged]

soraki_soladead · on Oct 14, 2021

I have provided recent and relevant citations with data and detailed comments. You have provided? What? Attacks? I'm honestly not even sure. Given your comment history I think I'm done here.

star-trek-fleet · on Oct 14, 2021

Hmm, I meant that we are just talking about the same thing...

robbedpeter · on Oct 14, 2021

The margin for error makes the studies that try to aggregate these things almost meaningless.

For example, Google CoLab is running in Google data centers, which are carbon neutral. Which is more or less true, but also depends on the exact nature of the source of their power and accuracy of the carbon offsets they purchase, and we all know Google is perfectly accurate and truthful all the time. /s lol

I trust Google insofar as their publicized usage of renewables and extensive use of solar. I also suspect that they're on-grid and a net producer of energy, taking advantage of deals with power companies and governments to ekeout every last penny of value in their infrastructure.

The problem is that without exact reporting of numbers, the margin of error for that source alone creates huge uncertainty in trying to assess the net carbon footprint of their service. How much research is being done using Google infrastructure? How much is being done on college campus data centers that run their hpsc on solar and wind? How much money is spent on offsets by those other sources? Again, the reality requires exact knowledge, since the usage of offsets introduces huge uncertainty, so aggregate reported usage could vary in accuracy as a proxy by more than 100% of the naively assumed footprint.

The studies are only as good as their data, and the data isn't very good unless it's obtained through legal mandate, via subpoena or regulated reporting. To my knowledge, very little of the data available for these estimates is anything except self reported numbers. The math and analyses they do are great, but the margin of error likely exceeds 75%.

ShamelessC · on Oct 14, 2021

Eh, who cares? Maybe the energy usage from machine learning overall is negligible - I haven't looked into it. But still, what's wrong with showing that a result also has a reduced carbon footprint compared to modern methods?

TrueDuality · on Oct 14, 2021

The easy answer is that reduced energy usage for an individual operation does not mean reduced energy overall with all other things being equal (See Jevon's Paradox).

In practice algorithms don't work in isolation. You may have to do additional pre-processing on your data to get it into this kind of model, offsetting any energy usage that you reduce through the execution of your model. There are any number of small things that aren't taken into account when looking at this portion of an overall solution and you can't equally make statements about carbon usage based on energy usage alone, nor can you make statements about energy usage based on computation requirements alone.

The actual energy usage is not an independently derived factor and unrelated to the work itself. Adding it as a benefit in your paper without the additional work of proving that is marketing fluff. It doesn't belong and actively hurts studies that are focused on energy usage reduction with the explicit intent of reducing carbon footprints.

eterevsky · on Oct 14, 2021

Because this is supposed to be a scientific paper, it's supposed to be talking about quantifiable effects, not of some speculations. Especially in this case where the high energy and monetary cost are easy to estimate. OpenAI spent $3 million just on the compute for training. Even inference requires multiple TPUs each using hundreds of Watts.

Nimitz14 · on Oct 14, 2021

I agree it's silly. The fact that 99.9% of researchers dont have the resources to use these large models is a minor detail but a chance to shoehorn in fighting climate change is an unmissable opportunity apparently.

eterevsky · on Oct 15, 2021

OpenAI is pretty open for researchers. They can apply for the access to the API.

What many researchers can't do is replicating the training process.

sva_ · on Oct 14, 2021

Whenever somebody starts roasting current ML techniques about their carbon footprint, I take that as "I've got no better argument to present".

Sure, a reduction in computational expenses is in many ways desirable, but I don't think the carbon footprint of a model is a very good metric. There are much better arguments for more efficient models.

I guess you have to do, what you have to do for those grants.

solarmist · on Oct 14, 2021

It's still millions of parameters. What kind if hardware is needed to train a 10 million parameter model?

greens · on Oct 14, 2021

A single consumer GPU for ~hours

binarymax · on Oct 14, 2021

Yes, verbatim from the paper: "Moreover, training with PET can be performed in several hours on a single GPU without requiring expensive hyperparameter optimization."

solarmist · on Oct 14, 2021

Nice. Still reading it.

solarmist · on Oct 14, 2021

Oh! Nice! The a much more accessible than I was expecting.

solarmist · on Oct 14, 2021

Like could I train it in a week with a few GPU's? Or would this still require a cluster to train in a reasonable amount of time?

travisgriggs · on Oct 14, 2021

I clicked on the link thinking I was going to be reading about Forth and other "simple" programming languages.

dunefox · on Oct 14, 2021

Since when are programming languages called language models?

setr · on Oct 14, 2021

I believe he was thinking of programming language paradigms -- e.g. procedural, stack-oriented, vector-oriented, functional, etc.

Though "learners" should have given it away as well