Hacker News new | past | comments | ask | show | jobs | submit login
Deep Neural Nets: 33 years ago and 33 years from now (2022) (karpathy.github.io)
286 points by gsky on Aug 26, 2023 | hide | past | favorite | 93 comments



Something else I find exciting, starting with one of the reflections-

The original training took 3 days on a Sun 4/260 workstation; I can't find specifics but I believe that era of early SPARC workstations would likely pull about 200 watts in total (the CPU wasn't super high powered but the whole system, running with the disks and the monitor etc would pull about that).

So 200 watts * 72 hours = 14400 watt-hours of energy.

Karpathy trained the equivalent on a Macbook, not even fully utilized, in 90 seconds. Likely something around 20 watts * 0.025 hours = 0.5 watt-hours.

An energy efficiency improvement of nearly 30000x.


This is very interesting, because I've always thought that all NN performance should be measured in a unit with energy in the denominator.


It totally depends on what you want to use a measure for. Just like neither height or volume alone will tell you what will fit in your car.

By any measure that puts energy used by the brain in the denominator, humans are probably dumber than ants. But that doesn't mean those measures are always accurate.

(For contemporary neural networks, you also have to distinguish training costs from inference costs.)


To add more context, humans are 100W biological machines. Brain is ~20% of that power - 20W.

The greatest form of general intelligence at 20W.

A MacBook Air is ~30W.

https://www.jackery.com/blogs/knowledge/how-many-watts-a-lap...


You're leaving out the training requirements


That’s not entirely fair. That’s run time cost but not the same cost to information ratio and the cost of drawing from that pool.

Given a laptop is at 30W how much can a laptop do disconnected from the internet? Now how much can it do with the internet? Now how much information does the internet cost in terms of wattage? Now what’s the ratio?


What can a human do disconnected from society?


It is “the greatest” because we only appreciate intelligence that we ourselves understand. A 0.0001W calculator calculates arithmetic faster than any human brain.


I dispute that, if the metric is a chess game between an ant and a human


For inference that could be useful, but the energy is not for the model it is for at least the tuple of: model, model architecture and compilation, and hardware chosen.


30k doesn't even sound like that much to me given Moore's law. I'd expect more improvement since 1989. Supercomputer performance increased more than a million since then


My (wrong) intuition on reading your comment was that you were over-estimating the expected growth in performance over that time period, but actually after checking the maths based on Moore's Law, i.e. doubling every two years (though of course I understand that was a rough estimate, more of a concept prediction than expected to be precise) you're right so I'll share the maths for anyone else whose intuition might be as poor as mine:

Doubling every 2 years = compound annual growth rate (CAGR) of ~41.42%

  CAGR = ((End Value / Start Value)^(1 / Number of Years)) - 1

  ((2 / 1)^(1 / 2)) - 1 = 0.41421356237
Therefore in 34 years since then:

  1 * (1 + 0.41421356237)^34 = ~131,072
So x30k is ~4.4x less than 131k. Then again, that's equivalent to ~x1.833 every two years, compared to Moore's Law of x2 every two years, so only ~8% less growth per two years, which coming back to the fact that Moore's Law is a rough estimate concept not an exact fact, doesn't seem to far off!


The rest of that difference can easily be explained by the difference in the class of hardware used. A desktop made today vs a laptop is roughly that factor 4. Not sure if back then there would have been laptops that you could have done this on for a more apples-to-apples comparison.

Modern laptops give great efficiency, when I went for solar power here the first thing to go was the desktop computer. I still have it, but it hasn't run in over a year and the elderly thinkpad that is now my daily driver uses far less power and still has enough compute to serve my modest needs. But if I would dive into something requiring much more compute I'd have to start the desktop again. Unfortunately power management is not such that computers can really throttle down to 'miser mode' when you don't need it, it's a good step but not as good as the jump between desktop and laptop.


also the 'memory wall', remember memory b/w did not grow at pace with moore's law. sure there are ways to mitigate it but that eat into chip budget and reflect when real world performance is calculated.


Yes, true and in a way that wall is still there. The way GPUs are limited in how much RAM they have because there is a way to sell you that memory at a multiple of the cost.

Imagine a GPU with a 128G or even 256G slot based memory section that is sold unpopulated. 8 SODIMM slots or so.


Hadn't thought of that, good point


imagine that we are discussing "proving" a law with historical data from our POV but at the time, it must've seem like a theory at best or comical at the least.

8% less growth is not the point. The "law" has stood the test of time which says something about the guy and his vision


It isn't much, but as the link says, the neural network they were reimplementing is too small to take advantage of modern hardware.


Amdahl's law


33 years ago is 2000/1999


Um... you might want to check your tens digit.


Opps


quickest maffs


> watt-hours

You mean joules (up to a constant factor)?


A watt-hour is 3600 joules but watt-hours or kilowatt-hours is commonly used because it’s easier to calculate.


I really enjoyed this article. My only critique is that the 2055 predictions are "meta-linear". In other words: the author avoids the (probable) mistake of taking our current tech and linearly regressing the numbers 33 years forward, but the predictions still suggest a kind of "worldline symmetry" with the present date at the origin.

It's quite possible that none of these predictions will come true simply because the timeframe is large enough for many unanticipated breakthroughs and roadblocks.

Maybe someone will figure out a much, much simpler foundational architecture than "perceptrons++", maybe we'll all be training clouds of 3D gaussians, maybe quantum computers will finally take off and we don't even have the nouns for the building blocks we'll use.

On the negative side perhaps we hit a hard scaling limit (in hardware or training) that we didn't see coming. Or a civilizational setback.

All that said, though, if I were a betting man I wouldn't exactly wager against the article's conclusions; they're probably the best we can extrapolate knowing only the past and present state of affairs.


I think you are right, the next 33 years are likely to be very different.

I would lean to them being even more dramatic, due to the opportunity to advance algorithms, not just resources.

On the more obvious side, most libraries are not yet taking full advantage of many known gradient optimization techniques. It’s been so much easier to just add data & processing that there is an overhangs of tools to still apply.

And large successful models are telling us important things.

For instance, it is clear that language models are learning a kind of logic of language similar to how we process thoughts, allowing highly disparate types information to be woven together sensibly.

At some point, identifying the nature of that processing could radically simplify language processing.

That is just one opportunity for radical architecture and algorithm advances, and it would be revolutionary.


So should we spend the next 33 years doing the same things, just with more data and more compute power? That would be the logical conclusion of the breathless "I can't believe it is finally happening in my lifetime" and "we just need bigger models and more data" enthusiasm for LLMs when they first appeared. But can we really simply brute force our way to AGI?

Remember, 33 years ago "connectionist AI" wasn't the dominant AI paradigm, and "symbolic AI" wasn't the only other approach either - there were others, like "robotic functionalism" (the idea that you couldn't have true intelligence with interacting with the physical world). Maybe in 33 years some of these other approaches will have a resurgence, perhaps in combination with connectionist approaches. Or maybe they'll even be some entirely new approach.


Great article. I lived through the early days of artificial neural networks. I was on a DARPA advisory panel for neural network tooling in the mid 1980s, wrote the first version of the SAIC ANSim commercial product, and created the simple back-prop model that was deployed in the bomb detector my company built under contract to the FAA. I also managed a ‘conventional’ deep learning team at Capital One 5-6 years ago.

My world has been very exciting in the last 18 months. I spend as much time as I can exploring self hosted LLMs, APIs from Hugging Face, OpenAI, etc.

My mind is blown even thinking about tech 33 years from now!


The most fundamental change is the difference in what models are being trained on.

Little images of characters is a trivia type problem, very different from training on the linguistic and visual communication of essentially the whole human race.

Another 33 years of expanded computing resources won’t be training models to mimic the behavior and knowledge of humanity.

That problem (us!) will have been reduced to a toy problem long before then.


I think AI models will evolve by generating synthetic data, filtering and improving it, and then retraining. Possibly with external systems in the loop - code execution, search, human, simulation or robot. Quality won't degrade because there will be a lot of effort put into data filtering and diversity. We can always improve on a model by giving it more time.

Model architecture doesn't matter compared to the dataset. Any model from a class can learn the same skills from the same data, but change the data and they all change their abilities - the intelligence is in the data.

The future is data engineering, not model architecturing. Human culture, by analogy, evolves faster than human biology. The data is evolving faster than the model. And we are seeing a drastic reduction in novel architectures in AI, diverse datasets applied to the same transformer models in recent years. Even among the transformers, very few variants are largely used, thousands of them abandoned.

I like to think of it as language evolution by memetics being the real engine behind intelligence. We and AI are riding the language exponential together.


> Model architecture doesn't matter compared to the dataset. Any model from a class can learn the same skills from the same data, but change the data and they all change their abilities - the intelligence is in the data.

You might be right in the same sense that big-O notation is 'right'. Constant factor can matter; especially once you have to take energy use into account.


Come close to solving the toy problem of autonomous driving first, we're still waiting.


I don’t know. I find pessimistic views, like you are expressing, very strange.

My Tesla drives and navigates itself most of the time. 90-95% at least, just not 100%.

As apposed to cars 10 or more years ago which didn’t do any of that.

To me it is much like the “God of the Gaps” when tremendous progress on a big problem is dismissed negatively, due to the (continuously shrinking) gaps of what it can’t do.


We are already five years late in autonomous vehicles replacing all truck drivers. We will see if we even have autonomous driving "long before" 33 years have passed. "AGI" (rebranded AI after "AI" failed to deliver) will of course still not be a thing.


I have no idea what “late” technology means.

And not delivering AGI yet is a problem?

What are these broad technology schedule based criticisms founded on?

I really want to understand this viewpoint!

Hopefully not the over-optimism of anyone who uses optimistic timelines as a motivational force. That’s not real data. Or a suitable benchmark for human progress.


Waymo and Cruise are operating without drivers which is much more impressive, even if in limited areas.


I don’t know if it is much more impressive than Tesla, given Tesla’s “limited areas” don’t seem very limiting in my experience.

But I think all these complementary takes on the problem, with significant year-to-year progress by all three firms, are fantastic.

That used to be considered a fast learning curve!


It's not clear that compute will scale as it did for the next 33 years. But it doesn't really need to.

I read the article and I was thinking "my God, I remember I used MSE that weekend in my pet ML project and it really didn't work out that well; wrong loss function." Our current crop of LLMs, or the one next year, will be perfectly able to tell me how I can improve my code and graphs, which means that I can deploy some expert-level techniques that otherwise would be "locked" to me by 50000 hours of "mastery acquisition".

A part of me is telling me that we humans are doomed, and that in 33 years we would have created a world in which we humans are irrelevant. But another part tells me that if we avoid that fate and all the other dooms, the future might just be quite bright.


> or the one next year

We have heard, and will continue to hear, this sort of thing rather a lot. The last 5 yards are the hardest, but without them the previous 5 miles are of limited utility.


I think there’s going to be a point where we need to slow a AI way, way down in order to avoid bad outcomes. I’m with Zvi Mowshowitz here: we should encourage progress and risk taking in every area except those where there are extinction risks. Applying today’s LLMs to all sorts of problems won’t end us. But I think we may only be a few years away from AGI that is conscious and can plan, and we don’t know the upper limit of how smart we’ll be able to make them.

And I think that we have a responsibility to any intelligent being we bring into the world. Some lament that there’s no test to become a parent-what about creating a million copies of a new virtual brain from scratch? And basically so they can be born into lifelong servitude.


This was really good. The only thing I didn't see explicitly discussed, although I guess it's obvious, is that what's different 33 years later is the inputs the models operate on. The '89 sota model used 16x16 greyscale images, today we have single digit megapixel color images, in 30 years, a desktop will be able to train Clip in 90 seconds, but what will the sota models be trained on?


Human behaviour in a way far more general than which token we might next type. To mimick humans as closely as might be possible with the basic deep learning method, train something that can predict human behaviour in general. Training would require billions to quadrillions of hours of video and audio and probably many other inputs, from many different people, engaged in the full variety of human activity.


Humans have a brain that physically changes especially in childhood, so that is potentially a massive advantage.


Why? An adult by 25 only has 146k hours of video experience “training,” most of it repeated, derivative, and unproductive. And their encoded genes can be observed in their genome, so don’t need to be retrained by millions of years of evolution.


Much of that time also includes physical interaction with the world, which makes it far more valuable because it can improve performance in a focused way.


How better to learn to do menial physical tasks like house cleaning, and produce picking?


Neural nets seem to learn much slower than humans. Even GPT-2 has seen orders of magnitude more tokens of language than a human experiences in a lifetime. At least as far as language is concerned, humans are able to extract a lot more information from their training data.


Humans are also extensively pretrained by billions of years of evolution, so by starting from scratch GPT is admittedly disadvantaged from the get-go.


But the human genome is only 3 gigabytes and the vast majority of that is unlikely to be encoding brain structure.


"Only" 3 gigabytes.

The lambda calculus (a system we know is capable of infinite self-complexity, learning, etc. with the right program) can be described in a few hundred bits. And a neural net can be described in the lambda calculus in perhaps a few thousand bits.

Also, we have no idea how "compressed" the genome is.


Basic structure can be encoded yes (and obviously is given brains have consistent structure), but the weights or parameters, presuming that brains learn via synaptic weights, obviously do not fit in the genome.

Compression still must obey information theory.


True, but we don’t know how to recreate that kind of pretraining.


One difference is that humans are actively involved in data collection, so when there is a gap in their knowledge, they don’t just wait for the information to show up, they ask a question, etc.


Humans do not train on video, the idea of a video, or even a frame, is a high level abstraction within the human brain.


We might have megapixel images that we can easily get with phone cameras, but virtually all vision models in common use take 224x224 resolution images as input, or maybe 384x384. Anything higher resolution than that just gets resampled down. It seems that you are better off using your compute budget on a bigger “brain” than on better “eyes” for now.


I don't think that's current. Certainly the object detection models work on bigger images, and the datasets they're pretrained on e.g. coco are not 224x224. I think standard models pretrained on imagenet, like the Resnets usually have everything resized to 224x224, and so they favor this kind of scaling.


Millions of hours of data captured by headsets like the vision pro?

Not sure all the things it captures, but a model could be trained on the combination of audio/video/spatial/iris/what have you...


It’s interesting in that time we almost completely lost interest in neural networks and then came back around to them.


I had to retake my AI class at university several times because I just didn’t agree on the “AI is symbolic search” aspect.

Now though, I’m sure people are taking LLMs and putting them together to do forward and backward chaining.


In this case there are good reasons for the resurgence, but that's really the case with pretty much anything software-related. Except the fashion cycles tend to be shorter with more mainstream technologies.


Thank Hinton for that. It's a pity we don't have a Nobel for software.

But a Turing award is pretty neat as well.


It's crazy how little has changed and how much had changed. I remember what a revelation "the unreasonable effectiveness of RNNs" was when I was read it and it feels like we live in a different world.


I think we could collectively more constructive and sober conversation if we kept that 2015 bit of work as a sort of baseline.

The new stuff is better, by a lot, and with implications more to come.

But those of us paying attention then had a frame of reference where “so much better it’s crazy” still stops short of “it’s out of control”.

It’s a lot better.


Always refreshing reading Andrej Karpathy: the more he knows stuff, the more he explores the fundamentals of the science of ML in a direct and simple way. The field is full of papers that for a very hard to reproduce gain in some new convoluted architecture (in the hope to beat some state of art result) will happily fill 50 useless pages trying to make their works "serious".


>> The original network trained for 3 days on a SUN-4/260 workstation.

This is exactly why I didn't start experimenting with this stuff back then. I read some articles and had the interest, but having no access to existing training data or "fast" computers was really a show stopper. This article really convinced me that the amazing results today are mostly due to hardware advances.

I will add my own view that 1) hardware will not be advancing anywhere near so much in the future. And 2) training and inference have to be done together like real brains do. Then the AI will learn from experience while deployed and you can clone the best ones later.


>This article really convinced me that the amazing results today are mostly due to hardware advances.

For LLMs that is true. But many other things like Whisper, Stable Diffusion etc. could in theory have been made a decade earlier.


Possible future development I'm excited about is something like GPT but for real world interaction - i.e. robots that take input from sensors and are able to physically navigate and manipulate the world.

Fine-tuning would then be used for specific environments (human hair, an apartment) and robots (robotic barber, cleaning robot).


Google put an LLM into a robot a few months ago. This is the second paper I've seen on it.

https://arxiv.org/abs/2306.08647


AI layperson here. Is there something like an MRI scan for a neural network?

Imagining I could take a foundation model, run it on my specialized task, and measure which regions of the neural network light up.

Then I could carve out unused regions of the network to create a more lightweight model.

Or is this a silly idea?


Maybe a better idea of the future is to look at what LeCun is working on now as a future program. He wants to change quite a lot in order to move towards more animal-like cognitive abilities.

LeCun is not even really interested in supervised learning anymore, for example.

https://youtu.be/vyqXLJsmsrk?si=8n0ylC6qdLX06CmY

Note that the talk is not really primarily about ChatGPT even though that's in the title. The new ideas are a little bit in. The beginning of the talk is just him explaining how unimpressed he is with LLMs. Which I think is a misjudgement but that doesn't mean his plan doesn't have merit.


The odds of the next material advance in ML/AI coming from one of its pioneers are zero. Not to say we shouldn't listen to what LeCun has to say (the other ones have basically lost it) but focusing on him is a bad way of imagining the future.


I don't understand why Yann is so focused on his "Animals are smarter than AI" analogy. If compute wasn't so limited, couldn't we just train a transformer on video, audio and text data? I don't see why it would not learn the basic physical structure of our world just like a language transformer learns the grammatical and other structures of language. Then with this prettaines transformer you can build an agent and use some reinforcement learning too. I feel very confident that this would mirror the level of intelligence of non-human animals quite well.


Animals are indeed smarter than AI for certain aspects of intelligence (physical intelligence).

Evolution has optimized animal brain and bodies to survive and take care of the next generation. They have a good grasp of environment, where they are, where food is, where predators are, basic communication if they live as a group. Babies grow up and start learning.

Our current AI is extremely power hungry compared to a brain. Cruise & Waymo put large power hungry supercomputers in cars. The computing system costs 100k+. They still make silly mistakes like crashing into fire trucks, blocking roads, driving into wet cement etc.

ChatGPT and friends make silly mistakes for trivial math problems that require a few hierarchical planning steps.

All in all, brains have some form of symbolic computation and reasoning that we haven’t been able to replicate with current AI algorithms.

I’m not saying we’ll never be able to but current AI is really hyped. Kinda like crypto boom of 2019.

There are some really hard algorithmic problems to be solved.

Google, Microsoft, Meta could have 1000x more computing power and data, however in the grand space or all algorithms there exists a learning algorithm that is probably >10000X more efficient at generalized modeling and reasoning than what we have.

The proof that we (20W biological generally intelligent computers) exist validates the hypothesis that there is a lot of advancement we can still do at the algorithm part.


There have been many attempts to do multi modal pre training, the difficulty is finding the right combination of data for it to be “useful” and “scalable”. It’s not trivial to just train a transformer on video, text, audio, etc. mainly due to O(N^2) on the token counts, time components with video, etc


I don't really see it. O(N^2) for context window length is not an issue as you don't need particularly much longer context windows than for text. You don't have to run this at 30 FPS, 12 would already be enough to understand what's going on. The dimensionality is much higher of course, but how many dimensions does the latent space of TikTok videos really have? Train an autoencoder and take only 1000 dimensions for the frames. Of course it's not that simple but my mental models atm make me feel like this should work. What do I mean by work? A grainy low fps video+subtitles transformer that looks terrible but as a plus actually has some decent physical consistency. I guess that is Phenaki etc., but I'd hope for much better understanding if the "grammar" of the physical world. That should be rather orthogonal to FPS or resolution (or number of dimensions you take).


I know intuition is often wrong, but to me a mix of dedicated visual processing AI and language AI and sound AI all somehow interacting with each other would be a good way to make an “animal” like AI rather than throw loads of attention heads at everything all at once.


I'm just thinking of google's GATO, it seemed to have absolutely no trouble integrating a large number of modes.


Related:

Deep Neural Nets: 33 years ago and 33 years from now - https://news.ycombinator.com/item?id=30673821 - March 2022 (5 comments)


> Our datasets and models today [2055] look like a joke. Both are somewhere around 10,000,000X larger.

will there really be 10 million times 400 million images floating around then?


I think you are limiting yourself by thinking of the dataset of the future as just being more and bigger images.

Perhaps it will be trained on whole videos, or a combination of different inputs from agents that move about in the real world / or a video game.


Maybe the real game changer in the future will be the ability to train the same model on very different kind of inputs like video, images, text, audio... Imagine also all these data cleaning tasks are already automated, you just need to feed the model PDFs and automatically a support model will extract all the relevant metadata... or probably you'll just be able to select a set of books from an online library and your model will train on them as well (of course for a non trivial subscription lol)


10e6*400e6/8e9/365/18 = 76 images per person per waking hour; it's not implausible given how many cameras there are and how many moments people might snap to share with remote friends — I can easily believe we'll have always-on video chat with multiple people in AR glasses by that point.


Most images are not shared though; just snapped. In the past you had photo albums no-one ever looked in. And that weren't that many pics; now , whenever, people (old and young) take 100s of pictures, on iPhones often by holding the button so it zaps 100s of them in a few seconds.


> Most images are not shared though

Not yet.

As the joke goes:

People in the 60s:

I better not say that or the government will wiretap my house

People today:

Hey wiretap, do you have a recipe for pancakes?


Maybe you won't receive your "world coin" universal income dividend unless you livestream 24/7.


Maybe, but the input in 2055 will be more something in the form of continuous/realtime data input streams.


No, there won't. I must assume he is exaggerating for the clicks.


Generate as many as you need.


Oh, also curious… today, how many individual image frames from video are there just from Tesla vehicles?


Training models from generated content degrades them over time.


The generated results can come from other means - for example, pretraining on rendered CG imagery is quite popular in the computer vision world, especially for problems where acquiring ground truth data in the real world is quite difficult.


Yet science fiction pushes civilization towards novelty




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: