Someone needs to write a buyer's guide for GPUs and LLMs. For example, what's the best course of action if don't need to train anything but do want to eventually run whatever model becomes the first local-capable equivalent to ChatGPT? Do you go with Nvidia for the CUDA cores or with AMD for more VRAM? Do you do neither and wait another generation?
If you're looking just for local inference, you're best bet is probably to buy a consumer GPU w/ 24GB of RAM (3090 is fine, 4090 more performance potential), which can fit a 30B parameter 4-bit quantized model that can probably be fine-tuned to ChatGPT (3.5) level quality. If not, then you can probably add a second card later on.
Alternatively, if you have an Apple Silicon Mac, llama.cpp performs surprisingly well, it's easy to try for free: https://github.com/ggerganov/llama.cpp
Current AMD consumer cards have terrible software support and IMO isn't really an option. On Windows you might be able to use SHARK or DirectML ports, but nothing will run out of the box. ROCm still has no RDNA3 support (supposedly coming w/ 5.5 but no release date announced) and it's unclear how well it'll work - basically, unless you would rather be fighting w/ hardware than playing around w/ ML, it's probably best to avoid (the older RDNA cards also don't have tensor cores, so perf would be hobbled even if you could get things running. Lots of software has been written w/ CUDA-only in mind).
I'm aware that a 6800 is not RDNA3. You stated broadly:
> Current AMD consumer cards have terrible software support and IMO isn't really an option. On Windows you might be able to use SHARK or DirectML ports, but nothing will run out of the box.
I was merely sharing that I did not have that same experience that current consumer cards have terrible support.
Sure, and I was merely clarifying that only last-gen architectures work. While I'm glad that SD works for you, if we are getting into it, I think that having no support on current-gen flagship models does equal broadly terrible software support and is more worth highlighting, since otherwise someone might assume they could just go pick one of those up and get a 24GB GPU on the cheap, especially in the context of LLMs (which is what the OP was asking about).
For RDNA2, you apparently can get LLMs running, but it requires forking/patching both bitsandbytes and GPTQ: https://rentry.org/eq3hg - and this will be true for any library (eg, can you use accelerate? deepspeed? fastgen? who knows, but certainly no one is testing it and AMD doesn't care if you're not on CDNA). It's important to note again, anything that works atm will still only work with last-gen cards, on Linux-only (ROCm does not work through WSL), w/ limited VRAM (no 30Bq4 models), and since RDNA2 tensor support is awful, if the SD benchmarks are anything to go by, performance will still end up worse than an RTX 3050: https://www.tomshardware.com/news/stable-diffusion-gpu-bench...
> I think that having no support on current-gen flagship models does equal broadly terrible software support and is more worth highlighting, since otherwise someone might assume they could just go pick one of those up and get a 24GB GPU on the cheap, especially in the context of LLMs (which is what the OP was asking about).
Absolutely fair and I agree with this part. I started my reply with "FWIW" (For What It's Worth) on purpose.
> For RDNA2, you apparently can get LLMs running, but it requires forking/patching both bitsandbytes and GPTQ: https://rentry.org/eq3hg - and this will be true for any library (eg, can you use accelerate? deepspeed? fastgen? who knows, but certainly no one is testing it and AMD doesn't care if you're not on CDNA).
I haven't tried any of the GPU-based LLMs yet. SD leveraging PyTorch (which seems to have solid ROCm support) worked for me. It will not be faster than NVIDIA for sure but if someone already has a 16GB+ AMD card they may be able to at least play with stuff without needing to purchase an NVIDIA card instead.
Nvidia and the highest amount of vram you can get.
Currently the 4090, the rumor is the 4090ti will have 48gb of vram, idk if its worth waiting or not.
The more VRAM the higher paremeter count you can run all in memory (fastest by far).
AMD is almost a joke in ML. The lack of CUDA support (which is nvidia proprietary) is straight lethal, and also even though ROCM does have much better support these days, from what I've seen it's still a fraction of the performance of what it should be. I'm also not sure if you need projects to support it or not, I know pytorch has backend support for it but I'm not sure how easy it is to drop in.
I have access to an Nvidia A100. But as a layman, what specs does the rest of the system need to use it for some real work? I would assume there needs to be at least as much ram as vram and maybe a few terabytes of disk space. Does anyone have experience with this?
A whole different department made those decisions and I don't think they had any idea of what is actually needed. They wanted to buy such a GPU because training your own model is a trendy thing and they wanted to stay within their budget (mostly used up by the GPU).
I'm just trying to scramble together something from the hacked together thing I now have to deal with.
I'm running models locally on my 3090 and it's fast enough, although for example building a vector database can take a while. I can run LoRa training but I haven't done anything meaningful with it so far. I chose 3090 because of the cable issue of 4090 (also, no nvlink, although I'm not sure that matters) but it's debatable if my fears are justified. I need to leave the gpu running while I'm away and I just don't feel comfortable doing that with a 4090. I rather take the lower performance.
One caveat though, my asus b650e-f is barely supported by the currently used ubuntu kernel (e.g. my microphone doesn't work, before upgrading kernel + bios I didn't have lan connection...) so expect some problems if you want to use a relatively new gaming setup for linux.
Windows generally works but there may be a somewhat small performance hit. IMO linux is much easier to get to work judging by all the github issue threads I see able SD/LLaMa stuff on windows - but I don't use windows so I dont have personal experience.
4090 24GB is 1800USD, The Ada A6000 48GB is like 8000USD and idk where you buy it? So if you want to run games and models locally the 4090 is honestly the best option.
EDIT: I forgot - there is a rumored 4090ti with 48gb of vram, no idea if thats worth waiting for.
The A6000 is actually the old generation, Ampere. The new Ada generation one is called 6000. Seems many places still sell A6000 (Ampere) for the same price as RTX 6000 (Ada) though, even though the new one is twice as fast.
Seems you can get used RTX A6000s for around $3000 on ebay.
You're kidding? So they called it the RTX 6000, then called it the RTX A6000 for ampere, then back to RTX 6000 for Ada?
Why do they do this? Sometimes consumer products are versioned weirdly to mislead customers (like intel cpus) - but these wouldn't even make sense to do that with as they're enterprise cards?
Actually the first one is called Quadro RTX 6000, while the Ada one is just RTX 6000 without "Quadro" in front. Not that it makes the naming make much more sense.
According to GPT-4 the next generation one will be called Galactic Unicorn RTX 6000 :D
4090 is amazing, but very large card. 3090 is "good enough" for ML - same 24gb vram - and you can pick them up used for half the price of a new 4090. That's what I did.
WSL on windows apparently decent, or native PyTorch, dual boot windows/ubuntu still prob best tho.
Getting CUDA on OpenSUSE was super easy. The Nvidia blob drivers are easy to install and CUDA needs another download and some copy paste. Even Unreal Editor was easier to install than on Windows.
Save some money and go 3090, same vram, speed difference probably isn't worth the premium for the 4090. Then upgrade when the rumored 5090 generational leap happens.
Unless you have two graphics cards (well, you can use an integrated GPU) and need to run both OSes at once I think for most people this will be less convenient than a dual boot setup though
You can’t switch which GPU Linux is using without restarting the session
Basically, you want nVidia, and you want lots of VRAM. Buy used for much more bang for the buck.
Depending on your budget, get:
- an RTX 3060 with 12GB or
- 1 used RTX 3090 with 24GB (approx twice as expensive as the 3060 but twice the VRAM and much faster) or
- 2 used RTX 3090 cards if you need more than 24GB.
Everything beyond that gets quite a bit more expensive because then you need a platform with more PCIe lanes, you may need more than one PSU and you will have problems fitting and cooling everything.
With two cards and 2x24GB you can run the largest version of the LLaMA model (the 65B variant) and all its descendants with 4-bit quantization inside your GPU's VRAM, i.e. with good performance. Can can also try some low resource fine-tuning variants (LoRa etc).
Oh and while you're at it also get a decent amount of RAM like 64GB or 128GB (it's very cheap right now) and a NVMe SSD. These models are quite large.
Ask: how many parameters is the model? At how many bits? and has it been quantized to a smaller number of bits or if not can it be without losing quality? multiply number of bits by the number of parameters, convert to GB, and then add so amount of overhead for however much else needs to load and/or how horrible/amazing of an implementation the inference code is. Done.
Example: A 30B parameter model trained at 16bit FP gets quantized down to 4 bit ints. 4 bits = 0.5 byte. 30 billion * 0.5 byte = 15GB of VRAM (plus a GB or few of other overhead)
Depends on whether you just want to play around or have actual business needs, but to play around I'm waiting for the next generation of unified memory architectures
Well currently there's just apple and they are super secretive about their roadmap, the M1/2 top line chip already perform quite well, and while the top version look expensive, a pair of a6000 will definitely eat more money both to purchase and to run.
M3 is around the corner tho, and there's some announcement to come from intel or arm following their partnership. There's also the new card coming from intel that is supposed to be aimed squarely at machine learning workloads, and they don't have to segment their market by memory sizing like Nvidia do, but they aren't well supported as device targets, but a pair of these will likely be very cost effective if and only if they will get credible compatibility with the libraries and models
3090 or 4090. That's it as far as LLMs go. You can run 13b on a 4080 but 13B sucks compared to 30B. And ROCm is a poorly supported hair-pulling nightmare.