Someone needs to write a buyer's guide for GPUs and LLMs. For example, what's th...

lhl · on April 17, 2023

For a general guide, I recommend: https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...

There's a subreddit r/LocalLLaMA that seems like the most active community focused on self-hosting LLMs. Here's a recent discussion on hardware: https://www.reddit.com/r/LocalLLaMA/comments/12lynw8/is_anyo...

If you're looking just for local inference, you're best bet is probably to buy a consumer GPU w/ 24GB of RAM (3090 is fine, 4090 more performance potential), which can fit a 30B parameter 4-bit quantized model that can probably be fine-tuned to ChatGPT (3.5) level quality. If not, then you can probably add a second card later on.

Alternatively, if you have an Apple Silicon Mac, llama.cpp performs surprisingly well, it's easy to try for free: https://github.com/ggerganov/llama.cpp

Current AMD consumer cards have terrible software support and IMO isn't really an option. On Windows you might be able to use SHARK or DirectML ports, but nothing will run out of the box. ROCm still has no RDNA3 support (supposedly coming w/ 5.5 but no release date announced) and it's unclear how well it'll work - basically, unless you would rather be fighting w/ hardware than playing around w/ ML, it's probably best to avoid (the older RDNA cards also don't have tensor cores, so perf would be hobbled even if you could get things running. Lots of software has been written w/ CUDA-only in mind).

BeefWellington · on April 17, 2023

FWIW I had no real issues getting StableDiffusion to run on a 6800 I have in one of my systems.

I haven't tried with LLaMA at all.

lhl · on April 17, 2023

6800 is RDNA2, not RDNA3. The latter is still waiting for ROCm support 4 months post-launch: https://github.com/RadeonOpenCompute/ROCm/issues/1813

BeefWellington · on April 17, 2023

I'm aware that a 6800 is not RDNA3. You stated broadly:

> Current AMD consumer cards have terrible software support and IMO isn't really an option. On Windows you might be able to use SHARK or DirectML ports, but nothing will run out of the box.

I was merely sharing that I did not have that same experience that current consumer cards have terrible support.

lhl · on April 17, 2023

Sure, and I was merely clarifying that only last-gen architectures work. While I'm glad that SD works for you, if we are getting into it, I think that having no support on current-gen flagship models does equal broadly terrible software support and is more worth highlighting, since otherwise someone might assume they could just go pick one of those up and get a 24GB GPU on the cheap, especially in the context of LLMs (which is what the OP was asking about).

For RDNA2, you apparently can get LLMs running, but it requires forking/patching both bitsandbytes and GPTQ: https://rentry.org/eq3hg - and this will be true for any library (eg, can you use accelerate? deepspeed? fastgen? who knows, but certainly no one is testing it and AMD doesn't care if you're not on CDNA). It's important to note again, anything that works atm will still only work with last-gen cards, on Linux-only (ROCm does not work through WSL), w/ limited VRAM (no 30Bq4 models), and since RDNA2 tensor support is awful, if the SD benchmarks are anything to go by, performance will still end up worse than an RTX 3050: https://www.tomshardware.com/news/stable-diffusion-gpu-bench...

BeefWellington · on April 17, 2023

> I think that having no support on current-gen flagship models does equal broadly terrible software support and is more worth highlighting, since otherwise someone might assume they could just go pick one of those up and get a 24GB GPU on the cheap, especially in the context of LLMs (which is what the OP was asking about).

Absolutely fair and I agree with this part. I started my reply with "FWIW" (For What It's Worth) on purpose.

> For RDNA2, you apparently can get LLMs running, but it requires forking/patching both bitsandbytes and GPTQ: https://rentry.org/eq3hg - and this will be true for any library (eg, can you use accelerate? deepspeed? fastgen? who knows, but certainly no one is testing it and AMD doesn't care if you're not on CDNA).

I haven't tried any of the GPU-based LLMs yet. SD leveraging PyTorch (which seems to have solid ROCm support) worked for me. It will not be faster than NVIDIA for sure but if someone already has a 16GB+ AMD card they may be able to at least play with stuff without needing to purchase an NVIDIA card instead.

AuryGlenz · on April 17, 2023

Running is one thing. Running with any sort of speed is another.

thewataccount · on April 17, 2023

Nvidia and the highest amount of vram you can get.

Currently the 4090, the rumor is the 4090ti will have 48gb of vram, idk if its worth waiting or not.

The more VRAM the higher paremeter count you can run all in memory (fastest by far).

AMD is almost a joke in ML. The lack of CUDA support (which is nvidia proprietary) is straight lethal, and also even though ROCM does have much better support these days, from what I've seen it's still a fraction of the performance of what it should be. I'm also not sure if you need projects to support it or not, I know pytorch has backend support for it but I'm not sure how easy it is to drop in.

causi · on April 17, 2023

That's good to know. A lot of people might look at the 7900XTX with its 24GB of ram for a grand and think "ah, a good deal for fitting an LLM in".

brandall10 · on April 17, 2023

I think the most recent rumors were amended to it having 24, unfortunately.

thewataccount · on April 17, 2023

Darn.

I mean in all honestly there's no reason a gaming card would need 48gb at the moment when so few games even use 24gb.

48GB really only makes sense for workstation cards.

redox99 · on April 17, 2023

It's pretty simple actually.

Get a 3090 or 4090. Forget about AMD.

sureglymop · on April 17, 2023

I have access to an Nvidia A100. But as a layman, what specs does the rest of the system need to use it for some real work? I would assume there needs to be at least as much ram as vram and maybe a few terabytes of disk space. Does anyone have experience with this?

michaelt · on April 17, 2023

If you have an A100, which in its 80GB variant costs $23,667 [1], you would not generally quibble over the price of a few terabytes of disk space.

[1] https://www.dell.com/en-us/shop/nvidia-ampere-a100-pcie-300w...

sureglymop · on April 18, 2023

A whole different department made those decisions and I don't think they had any idea of what is actually needed. They wanted to buy such a GPU because training your own model is a trendy thing and they wanted to stay within their budget (mostly used up by the GPU). I'm just trying to scramble together something from the hacked together thing I now have to deal with.

mewpmewp2 · on April 17, 2023

Should I get a gaming PC with 4090 if I want to get into personal side projects in ML and also games?

Do I need dualboot? Or is Windows good?

Regic · on April 17, 2023

I'm running models locally on my 3090 and it's fast enough, although for example building a vector database can take a while. I can run LoRa training but I haven't done anything meaningful with it so far. I chose 3090 because of the cable issue of 4090 (also, no nvlink, although I'm not sure that matters) but it's debatable if my fears are justified. I need to leave the gpu running while I'm away and I just don't feel comfortable doing that with a 4090. I rather take the lower performance.

One caveat though, my asus b650e-f is barely supported by the currently used ubuntu kernel (e.g. my microphone doesn't work, before upgrading kernel + bios I didn't have lan connection...) so expect some problems if you want to use a relatively new gaming setup for linux.

thewataccount · on April 17, 2023

Windows generally works but there may be a somewhat small performance hit. IMO linux is much easier to get to work judging by all the github issue threads I see able SD/LLaMa stuff on windows - but I don't use windows so I dont have personal experience.

4090 24GB is 1800USD, The Ada A6000 48GB is like 8000USD and idk where you buy it? So if you want to run games and models locally the 4090 is honestly the best option.

EDIT: I forgot - there is a rumored 4090ti with 48gb of vram, no idea if thats worth waiting for.

macrolime · on April 17, 2023

The A6000 is actually the old generation, Ampere. The new Ada generation one is called 6000. Seems many places still sell A6000 (Ampere) for the same price as RTX 6000 (Ada) though, even though the new one is twice as fast.

Seems you can get used RTX A6000s for around $3000 on ebay.

thewataccount · on April 17, 2023

That.... That explains why I can't find it and makes a ton of sense.....

I think that's such a silly name for it, but oh well

Thanks for the correction!

macrolime · on April 17, 2023

Just to add to the confusion, there's another older RTX 6000 with 24GB ram. This is from an even older generation, same as the GeForce 20 series.

thewataccount · on April 17, 2023

You're kidding? So they called it the RTX 6000, then called it the RTX A6000 for ampere, then back to RTX 6000 for Ada?

Why do they do this? Sometimes consumer products are versioned weirdly to mislead customers (like intel cpus) - but these wouldn't even make sense to do that with as they're enterprise cards?

macrolime · on April 17, 2023

Actually the first one is called Quadro RTX 6000, while the Ada one is just RTX 6000 without "Quadro" in front. Not that it makes the naming make much more sense.

According to GPT-4 the next generation one will be called Galactic Unicorn RTX 6000 :D

wincy · on April 17, 2023

If I was going to spend $8000 on a video card I’d hunt on eBay for an A100 80GB rather than settle for the A6000

thewataccount · on April 17, 2023

Honestly yeah a used A100 80GB sounds like a better idea.

dharma1 · on April 17, 2023

4090 is amazing, but very large card. 3090 is "good enough" for ML - same 24gb vram - and you can pick them up used for half the price of a new 4090. That's what I did.

WSL on windows apparently decent, or native PyTorch, dual boot windows/ubuntu still prob best tho.

artificial · on April 17, 2023

Getting CUDA on OpenSUSE was super easy. The Nvidia blob drivers are easy to install and CUDA needs another download and some copy paste. Even Unreal Editor was easier to install than on Windows.

whywhywhywhy · on April 17, 2023

Save some money and go 3090, same vram, speed difference probably isn't worth the premium for the 4090. Then upgrade when the rumored 5090 generational leap happens.

turmeric_root · on April 17, 2023

Windows reserves a certain percentage for VRAM for some reason. So I'd recommend Linux. Or find a way to disable the desktop/UI in Windows.

sampa · on April 17, 2023

you can run Windows in VM and passthrough your videocard to it. So it's basically a native gaming (3-5% cost) w/o reboots. Look for VFIO.

circuit10 · on April 18, 2023

Unless you have two graphics cards (well, you can use an integrated GPU) and need to run both OSes at once I think for most people this will be less convenient than a dual boot setup though

You can’t switch which GPU Linux is using without restarting the session

dannyw · on April 17, 2023

You could also get more detailed help from https://reddit.com/r/buildapc/

Tepix · on April 17, 2023

Read Tim Dettmer's guide.

Basically, you want nVidia, and you want lots of VRAM. Buy used for much more bang for the buck.

Depending on your budget, get:

- an RTX 3060 with 12GB or

- 1 used RTX 3090 with 24GB (approx twice as expensive as the 3060 but twice the VRAM and much faster) or

- 2 used RTX 3090 cards if you need more than 24GB.

Everything beyond that gets quite a bit more expensive because then you need a platform with more PCIe lanes, you may need more than one PSU and you will have problems fitting and cooling everything.

With two cards and 2x24GB you can run the largest version of the LLaMA model (the 65B variant) and all its descendants with 4-bit quantization inside your GPU's VRAM, i.e. with good performance. Can can also try some low resource fine-tuning variants (LoRa etc).

Oh and while you're at it also get a decent amount of RAM like 64GB or 128GB (it's very cheap right now) and a NVMe SSD. These models are quite large.

mk_stjames · on April 17, 2023

Ask: how many parameters is the model? At how many bits? and has it been quantized to a smaller number of bits or if not can it be without losing quality? multiply number of bits by the number of parameters, convert to GB, and then add so amount of overhead for however much else needs to load and/or how horrible/amazing of an implementation the inference code is. Done.

Example: A 30B parameter model trained at 16bit FP gets quantized down to 4 bit ints. 4 bits = 0.5 byte. 30 billion * 0.5 byte = 15GB of VRAM (plus a GB or few of other overhead)

For more real world discussion see

https://github.com/ggerganov/llama.cpp/issues/13

seydor · on April 17, 2023

At least for llama derivatives, reddit has a list https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_...

dannyw · on April 17, 2023

Definitely NVIDIA. Good luck running any real ML models on AMD GPUs.

avereveard · on April 17, 2023

Depends on whether you just want to play around or have actual business needs, but to play around I'm waiting for the next generation of unified memory architectures

ilteris · on April 17, 2023

Can you point any source re: next generation of unified memory architectures? How far are they away?

avereveard · on April 17, 2023

Well currently there's just apple and they are super secretive about their roadmap, the M1/2 top line chip already perform quite well, and while the top version look expensive, a pair of a6000 will definitely eat more money both to purchase and to run.

M3 is around the corner tho, and there's some announcement to come from intel or arm following their partnership. There's also the new card coming from intel that is supposed to be aimed squarely at machine learning workloads, and they don't have to segment their market by memory sizing like Nvidia do, but they aren't well supported as device targets, but a pair of these will likely be very cost effective if and only if they will get credible compatibility with the libraries and models

l33tman · on April 17, 2023

You don't need a GPU, alpaca.cpp runs purely on the CPU if you have 32+ GB of RAM for the 30B Llama model. It's a bit slow, but you can toy with it..

boppo1 · on April 17, 2023

3090 or 4090. That's it as far as LLMs go. You can run 13b on a 4080 but 13B sucks compared to 30B. And ROCm is a poorly supported hair-pulling nightmare.