Time for a dedicated "AI box" at home with hotswapping compute boards? Maybe put...

Closi · on May 5, 2023

It's more likely that you want a lot of compute for a very little amount of time each day - which makes centralised/cloud processing the most obvious answer.

If I want a response within 100ms, and have 1000 AI-queries per day, that would only be about 2 minutes of aggregated processing time for your AI box per day. It's less than 1% utilised. If the same box is multiuser and on the internet, it can probably serve 50-100 peoples queries concurrently.

The converse is that if you put something onto the cloud, for the same cost you might be able to effectively get 50x the hardware per user for the same cost (i.e. rather than have 1 AI box locally with 1 GPU for each of the 50 users, you could have 1 AI box with 50 GPU's which is usable by all 50 users).

regularfry · on May 5, 2023

"a lot of compute for a very little amount of time each day" sounds like something I can play games on when I'm not working.

Closi · on May 5, 2023

Why not just buy a computer that is correctly-sized to play games, rather than buy an AI-sized computer that you mostly use for games?

regularfry · on May 5, 2023

Because I want both.

Closi · on May 6, 2023

But not use both at once?

regularfry · on May 6, 2023

Well... No. If I'm sat playing a game, I'm unlikely to be generating AI queries.

yayr · on May 4, 2023

each billion parameters using 16 bit floats requires around 2 GB of GPU or TPU RAM. ChatGPT is expected to have around 1000 billion. Good open source LLMs have around 7-20 billion currently. Consumer GPUs currently max out at 24 GB. You can now quantize the model to e.g. 4 bits instead of 32 per parameter and do other compressions, but still there is quite a limit what you can do with 24 GB of RAM. The Apple unified memory approach may be a path forward to increase that... so one box gives you access to the small models, for a GPT4 like model you'd need (for inference and if you had the model and tools) probably 100 of those 4090s or 25 of H100 with 96 GBs I guess to fit in 2 TB of model data.

niemandhier · on May 4, 2023

Currently we do not explore sparsity. The next iteration of models will be much more compact by focusing on reducing effective tensor size.

kir-gadjello · on May 5, 2023

It is quite likely GPT-4 uses one or even two sparsity approaches on top of each other (namely, coarse grained switch transformer-like and fine grained intra-tensor block sparsity), if you look at the openly available contributors' research CVs.

Google, in collaboration with OpenAI, has published an impressive tour de force where they have throughly developed and validated at scale a sparse transformer architecture, applied to general language modeling task: https://arxiv.org/abs/2111.12763

This happened in November of 2021, and there is a public implementation of this architecture on the Google's public github.

Impressively, due to some reasons, other up-and-coming players are still not releasing models trained with this approach, even though it promises multiplicative payoff in inference economy. One boring explanation is conservatism for NN training at scale, where training runs cost O(yearly salary).

Let's hope the open source side of things catches up.

necroforest · on May 5, 2023

not in collaboration with openai, one of the authors joined openai before the paper was written and arxiv'd

snovv_crash · on May 5, 2023

At least from my experience with sparse matrix libraries like Eigen, you need to get the sparsity down to about 5% before switching from a dense to a sparse algorithm gets you execution time benefits.

Of course from a memory bandwidth and model size perspective maybe there are benefits long before that.

bee_rider · on May 4, 2023

It seems like a ton of engineering effort has been put into these neural network frameworks. How didn’t they explore sparsity yet? With numerical linear algebra that’s, like, the most obvious thing to do (which is to say, you probably know beforehand if your problem can be mapped to sparse matrices).

(Edit: just to be clear here, I’m not saying I expect the whole field is full of dummies who missed something obvious or something like that, I don’t know much at all about machine learning so I’m sure I’m missing something).

contravariant · on May 4, 2023

It's not at all strange to get something to work before you start optimizing. I mean if you can only run small models then how would you even know what you're losing by optimizing for space? Heck you wouldn't even know how the model behaves, so you won't know where to start shaving away.

I'm not saying it's impossible but if resources allow it makes a lot of sense to start with the biggest model you can still train. Especially since for whatever reason things seem to get a lot easier if you simply throw more computing power at it (kind of like how no matter how advanced your caching algorithm it's not going to be more than 2 times faster than the simplest LRU algorithm with double the amount of cache).

taneq · on May 4, 2023

From what I’ve read recently, most sparse methods just haven’t given that much improvement yet, and we’re only recently pushing up against the limits of the “just buy more RAM” approach.

It sounds like there is a lot of work happening on sparse networks now, so it’ll be interesting to see how this changes in the near future.

micro_cam · on May 5, 2023

Gpus were built for dense math and they ran with it. To the point current best architectures are in part just the ones that run best using the subset of linear algebra gpus are really good at.

There has been a lot of work on sparsity and discovering sparse subnetworks in trained dense networks. And intel even proposed some alternative cpu friendly architectures and torch/tf and gpus are starting to do okay with sparse matrixes so thing are changing.

qorrect · on May 4, 2023

Sounds a bit like premature optimization ( to have done it _by_ now ) , I bet it's in the works now though.

airstrike · on May 5, 2023

"Do things that don't scale" – PG

lhl · on May 5, 2023

llama-65b on a 4-bit quantize sizes down to about 39GB - you can run that on a 48GB A6000 (~$4.5K) or on 2 x 24GB 3090s (~$1500 used). llama-30b (33b really but who's counting) quantizes down to 19GB (17GB w/ some optimization), so that'll fit comfortably on a 24GB GPU.

A 4-bit quantize of a 1000B model should be <600GB, so would fit on a regular 8x80 DGX system.

coffeebeqn · on May 4, 2023

I wonder if an outdated chip architecture with just a lot of (64GB?) GDDR4 or something would work? Recycle all the previous generation cards to super high VRAM units

lhl · on May 5, 2023

I've seen reports of people being able to run LLMs at decent speeds on old Nvidia P40s. These are 24GB Pascal GPUs and can be bought for as low as $100 (although more commonly $200) on eBay.

outofpaper · on May 5, 2023

Link to report please

lhl · on May 5, 2023

https://www.reddit.com/r/Oobabooga/comments/126dejd/comment/...

kiratp · on May 5, 2023

You can do it on CPU now.

Benchmarks: https://github.com/ggerganov/llama.cpp/issues/34

atq2119 · on May 5, 2023

It should work. After the initial prompt processing, token generation is typically limited by memory bandwidth more than by compute.

xattt · on May 5, 2023

Intel Larrabee has come into the chat but showed up about decade and a half too early.

randomifcpfan · on May 5, 2023

For completeness, Apple’s consumer GPUs currently max out at 64 GB - OS overhead, so about 56 GB. But you are limited to 1 GPU per system.

kiratp · on May 5, 2023

Benchmarks for what you can do on CPU alone.

https://github.com/ggerganov/llama.cpp/issues/34

An M1 Max does 100ms per token. A 64 core threadripper about 33ms per token.