What GPU offers a good balance between cost and performance for running LLMs loc...

adam_arthur · 2024-12-29T21:05:23 1735506323

Inferencing does not require Nvidia GPUs at all, and its almost criminal to be recommending dedicated GPUs with only 12GB of RAM.

Buy a MacMini or MacbookPro with RAM maxed out.

I just bought an M4 mac mini for exactly this use case that has 64GB for ~2k. You can get 128GB on the MBP for ~5k. These will run much larger (and more useful) models.

EDIT: Since the request was for < $1600, you can still get a 32GB mac mini for $1200 or 24GB for $800

talldayo · 2024-12-29T21:16:57 1735507017

> its almost criminal to be recommending dedicated GPUs with only 12GB of RAM.

If you already own a PC, it makes a hell of a lot more sense to spend $900 on a 3090 than it does to spec out a Mac Mini with 24gb of RAM. Plus, the Nvidia setup can scale to as many GPUs as you own which gives you options for upgrading that Apple wouldn't be caught dead offering.

Oh, and native Linux support that doesn't suck balls is a plus. I haven't benchmarked a Mac since the M2 generation, but the figures I can find put the M4 Max's compute somewhere near the desktop 3060 Ti: https://browser.geekbench.com/opencl-benchmarks

adam_arthur · 2024-12-29T21:19:29 1735507169

A Mac Mini with 24GB is ~$800 at the cheapest configuration. I can respect wanting to do a single part upgrade, but if you're using these LLMs for serious work, the price/perf for inferencing is far in favor of using Macs at the moment.

You can easily use the MacMini as a hub for running the LLM while you do work on your main computer (and it won't eat up your system resources or turn your primary computer into a heater)

I hope that more non-mac PCs come out optimized for high RAM SoC, I'm personally not a huge Apple fan but use them begrudgingly.

Also your $900 quote is a used/refurbished GPU. I've had plenty of GPUs burn out on me in the old days, not sure how it is nowadays, but that's a lot to pay for a used part IMO

fragmede · 2024-12-29T21:27:18 1735507638

if you're doing serious work, performance is more important than getting a good price/perf ratio, and a pair of 3090s is gonna be faster. It depends on your budget, however as that configuration is a bit more expensive, however.

adam_arthur · 2024-12-29T21:37:14 1735508234

Whether performance or cost is more important depends on your use case. Some tasks that an LLM can do very well may not need to be done often, or even particularly quickly (as in my case).

e.g. LLM as one step of an ETL-style pipeline

Latency of the response really only matters if that response is user facing and is being actively awaited by the user

seanmcdirmid · 2024-12-30T02:27:05 1735525625

> M4 Max's compute somewhere near the desktop 3060 Ti

The only advantage is the M4 Max's ability to have way more VRAM than a 3060 Ti. You won't find many M4 Maxes with just 8 or 16 GB of RAM, and I don't think you can do much except use really small models with a 3060 Ti.

talldayo · 2024-12-30T03:58:07 1735531087

It's a bit of a moot point when CUDA will run 4 3060Tis in parallel, with further options for paging out to system memory. Since most models (particularly bigger/MOE ones) are sparsely decoded, you can get quite a lot of mileage out of multiple PCIe slots fed with enough bandwidth.

There's no doubt in my mind that the PC is the better performer if raw power is your concern. It's far-and-away the better value if you don't need to buy new hardware and only need a GPU. $2,000 of Nvidia GPUs will buy you halfway to an enterprise cluster, $2,000 of Apple hardware will get you a laptop chip with HBM.

seanmcdirmid · 2024-12-30T04:00:23 1735531223

You need a lot of space for that, cooling, and a good fuse that won't trip when you turn it on. I would totally just pay the money for an M4 Ultra MacStudio with 128 GB of RAM (or an M4 Max with 64 GB). It is a much cleaner setup, especially if you aren't interested in image generation (which the Macs are not good at yet).

If I could spend $4k on a non-Apple turn key solution that I could reasonably manage in my house, I would totally consider it.

talldayo · 2024-12-30T04:22:23 1735532543

Well, that's your call. If you're the sort of person that's willing to spend $2,000 on a M4 Ultra (which doesn't quite exist yet but we can pretend it does), then I honest to god do not understand why you'd refuse to spend that same money on a Jetson Orin with the same amount of memory in a smaller footprint with better performance and lower power consumption.

Unless you're specifically speccing out a computer for mobile use, the price premium you spend on a Mac isn't for better software or faster hardware. If you can tolerate Linux or Windows, I don't see why you'd even consider Mac hardware for your desktop. In the OP's position, suggesting Apple hardware literally makes no sense. They're not asking for the best hardware that runs MacOS, they're asking for the best hardware for AI.

> If I could spend $4k on a non-Apple turn key solution that I could reasonably manage in my house, I would totally consider it.

You can't pay Apple $4k for a turnkey solution, either. MacOS is borderline useless for headless inference; Vulkan compute and OpenCL are both MIA, package managers break on regular system updates and don't support rollback, LTS support barely exists, most coreutils are outdated and unmaintained, Asahi features things that MacOS doesn't support and vice-versa... you can't fool me into thinking that's a "turn key solution" any day of the week. If your car requires you to pick a package manager after you turn the engine over, then I really feel sorry for you. The state of MacOS for AI inference is truly no better than what Microsoft did with DirectML. By some accounts it's quite a bit worse.

seanmcdirmid · 2024-12-30T04:57:39 1735534659

M4 Ultra with enough RAM will cost more than $2000. An M2 Ultra mac studio with 64GB is $3999, and you probably want more RAM than that to run bigger models that the ultra can handle (it is basically 2X as powerful as the Max with more memory bandwidth). An M2 Max with 64GB of RAM, which is more reasonable, will run you $2,499. I have no idea if those prices will hold when the M4 Mac Studious finally come out (M4 Max MBP with 64 GB of ram starts at $3900 ATM).

> You can't pay Apple $4k for a turnkey solution, either.

I've seen/read plenty of success stories of Metal ports of models being used via LM Studio without much configuration/setup/hardware scavenging, so we can just disagree there.

europeanplug09 · 2024-12-30T08:20:24 1735546824

>You need a lot of space for that, cooling, and a good fuse

Or live in europe where any wall-socket can give you closer to 3kW. For crazier setups like charging your EV you can have three-phase plugs with ~22kW to play with. 1m2 of floor-space isn't that substantial either unless you already live in a closet in middle of the most crowded city.

dotancohen · 2024-12-30T20:49:47 1735591787

3 phase 240v at 16amps is just about 11kW. You're not going to find anything above that residential unless it was purpose-built.

That's still a lot of power, though, and does not invalidate your point.

natch · 2024-12-29T21:11:20 1735506680

Reasonable? $7,000 for a laptop is pretty up there.

[Edit: OK I see I am adding cost when checking due to choosing a larger SSD drive, so $5,000 is more of a fair bottom price, with 1TB of storage.]

Responding specifically to this very specific claim: "Can get 128GB of ram for a reasonable price."

I'm open to your explanation of how this is reasonable — I mean, you didn't say cheap, to be fair. Maybe 128GB of ram on GPUs would be way more (that's like 6 x 4090s), is what you're saying.

For anyone who wants to reply with other amounts of memory, that's not what I'm talking about here.

But on another point, do you think the ram really buys you the equivalent of GPU memory? Is Apple's melding of CPU/GPU really that good?

I'm not just coming from a point of skepticism, I'm actually kind of hoping to be convinced you're right, so wanting to hear the argument in more detail.

adam_arthur · 2024-12-29T21:29:36 1735507776

It's reasonable in a "working professional who gets substantial value from" or "building an LLM driven startup project" kind of way.

It's not for the casual user, but for somebody who derives significant value from running it locally.

Personally I use the MacMini as a hub for a project I'm working on as it gives me full control and is simply much cheaper operationally. A one time ~$2000 cost isn't so bad for replacing tasks that a human would have to do. e.g. In my case I'm parsing loosely organized financial documents where structured data isn't available.

I suspect the hardware costs will continue to decline rapidly as they have in the past though, so that $5k for 128GB will likely be $5k for 256GB in a year or two, and so on.

We're almost at the inflection point where really powerful models are able to be inferenced locally for cheap

seanmcdirmid · 2024-12-30T02:24:26 1735525466

For a coding setup, should I go with a Mac Mini M4 pro with 64GB of RAM? Or is it better to go with a M4 max (only available for the MBP right now, maybe in the Studio in a few months)? I'm not really interested in the 4090/3090 approach, but it is hard to make a decision on Apple hardware ATM.

I don't see prices falling much in the near term, a Mac Studio M2 Max or Ultra has been keeping its value surprisingly well as of late (mainly because of AI?). Just like 3090s/4090s are holding their value really well also.

ein0p · 2024-12-29T23:54:48 1735516488

It's reasonable when the alternative is 2-4x4090 at $2.2K each (or 2xA6000 at 4.5K each) + server grade hardware to host them. Realistically, the vast majority of people should just buy a subscription or API access if they need to run grotesquely large models. While large LLMs (up to about 200B params) work on an MBP, they aren't super fast, and you do have to be plugged in - they chew through your battery like it's nothing. I know this because I have a 128GB M3 MBP.

natch · 2024-12-30T22:54:32 1735599272

How large of a model can you use with your 128GB M3? Anything you can tell would be great to hear. Number of parameters, quantization, which model, etc.

ein0p · 2025-01-09T19:13:56 1736450036

I'm running 123B parameter Mistral Large with no issues. Larger models will run, too, but slowly. I wish Ollama had support for speculative decoding.

natch · 2025-01-11T18:08:43 1736618923

Thanks for the reply. Is that quantized? And what's the bit size of the floating point values in that model (apologies if I'm not asking the question correctly).

Abishek_Muthian · 2024-12-30T09:16:05 1735550165

OP here, I almost got a decked out Mac studio before I returned it for a Asus ROG as the native Linux support, upgradability & CUDA support is much more important to me.

Meagre VRAM in these Nvidia consumer GPUs is indeed painful but with increasing performance of smaller LLMs & fine tuned models I don't think 12GB, 14GB, 16GB Nvidia GPUs offering much better performance over a Mac can be easily dismissed.

2-3-7-43-1807 · 2024-12-29T22:13:36 1735510416

how about heat dissipation which i assume a mbp is at a disadvantage compared to a pc?

Aurornis · 2024-12-30T04:43:12 1735533792

A MacBook Pro has lower peak thermal output but proportionally lower performance. For a given task you’d be dissipating somewhat similar heat, the MacBook Pro would just be spreading it out over a longer period of time.

nVidia GPUs actually have similar efficiency, despite all of Apple’s marketing. The difference is they the nVidia GPUs have a much higher ceiling.

elorant · 2024-12-29T21:38:54 1735508334

I consider the RTX 4060 Ti as the best entry level GPU for running small models. It has 16GBs of RAM which gives you plenty of space for running large context windows and Tensor Cores which are crucial for inference. For larger models probably multiple RTX 3090s since you can buy them on the cheap on the second hand market.

I don’t have experience with AMD cards so I can’t vouch for them.

fnqi8ckfek · 2024-12-29T22:21:33 1735510893

I know nothing about gpus. Should I be assuming that when people say "ram" in the context of gpus they always mean vram?

elorant · 2024-12-29T23:55:06 1735516506

Not always, because system RAM also has to be equally adequate, but mostly yes, it's about the total VRAM of the GPU(s).

layer8 · 2024-12-29T22:27:03 1735511223

“GPU with xx RAM” means VRAM, yes.

kolbe · 2024-12-29T20:51:38 1735505498

If you want to wait until the 5090s come out, you should see a drop in the price of the 30xx and 40xx series. Right now, shopping used, you can get two 3090s or two 4080s in your price range. Conventional wisdom says two 3090s would be better, but this is all highly dependent on what models you want to run. Basically the first requirement is to have enough VRAM to host all of your model on it, and secondarily, the quality of the GPU.

Have a look through Hugging Face to see which models interest you. A rough estimate for the amount of VRAM you need is half the model size plus a couple gigs. So, if using the 70B models interests you, two 4080s wouldn't fit it, but two 3090s would. If you're just interested in the 1B, 3B and 7B models (llama 3B is fantastic), you really don't need much at all. A single 3060 can handle that, and those are not expensive.

thijson · 2024-12-30T17:22:10 1735579330

I've been running some of the larger models (like Llama 405B) via CPU on a Dell R820. It's got 32 Xeon's (4 chips), and 256 GB RAM. I bought it used for around $400. The memory is NUMA, so it makes sense if the computing is done on local data, not sure if Ollama supports that.

The tokens per second is very slow though, but at least it can execute it.

I think the future will be increasingly more powerful NPU's built into future CPU's. That will need to paired with higher bandwidth memory, maybe HBM, or silicon photonics for off chip memory.

redacted · 2024-12-29T20:33:02 1735504382

Nvidia for compatibility, and as much VRAM as you can afford. Shouldn't be hard to find a 3090 / Ti in your price range. I have had decent success with a base 3080 but the 10GB really limits the models you can run

zitterbewegung · 2024-12-29T20:58:43 1735505923

Get a new / used 3090 it has 24GB of RAM and it's below $1600.

christianqchung · 2024-12-29T21:39:02 1735508342

A lot of moderate power users are running an undervolted used pair of 3090s on a 1000-1200W psu. 48 GB of vram let's you run 70B models at Q4 with 16k context.

If you use speculative decoding (a small model generates tokens verified by a larger model, I'm not sure on the specifics) you can get past 20 tokens per second it seems. You can also fit 32B models like Qwen/Qwen Coder at Q6 with lots of context this way, with spec decoding, closer to 40+ tks/s.

yk · 2024-12-29T20:41:37 1735504897

Second the other comment, as much vram as possible. A 3060 has 12 GB at a reasonable price point. (And is not too limiting.)

grobbyy · 2024-12-29T20:56:46 1735505806

There's a huge step to I'm capability with 16gb and 24gb, for not to much more. The 4060 has a 16gb version, for example. On the the cheap end, the Intel Arc does too.

Next major step up is 48GB and then hundreds of GB. But a lot of ML models target 16-24gb since that's in the grad student price range.

navbaker · 2024-12-30T00:13:04 1735517584

At the 48GB level, L40S are great cards and very cost effective. If you aren’t aiming for constant uptime on several >70B models at once, they’re for sure the way to go!

bubaumba · 2024-12-30T02:27:48 1735525668

> L40S are great cards and very cost effective

from https://www.asacomputers.com/nvidia-l40s-48gb-graphics-card....

nvidia l40s 48gb graphics card Our price: $7,569.10*

Not arguing against 'great', but cost efficiency is questionable. for 10% you can get two used 3090. The good thing about LLMs is they are sequential and should be easily parallelized. Model can be split in several sub-models, by the number of GPUs. Then 2,3,4.. GPUs should improve performance proportionally on big batches, and make it possible to run bigger model on low end hardware.

nickthegreek · 2024-12-30T01:48:48 1735523328

Dual 3090s are way cheaper than the l40s though. You can even buy a few backups.

navbaker · 2024-12-30T12:57:41 1735563461

Yeah, I’m specifically responding to the parent’s comment about the 48GB tier. When you’re looking in that range, it’s usually because you want to pack in as much vram as possible into your rack space, so consumer level cards are off the table. I definitely agree multiple 3090 is the way to go if you aren’t trying to host models for smaller scale enterprise use, which is where 48GB cards shine.

bloomingkales · 2024-12-29T20:46:58 1735505218

Honestly I think if you just want to do inferencing the 7600xt and rx6800 have 16gb at $300 and $400 on Amazon. It's gonna be my stop gap until whatever. The RX6800 has better memory bandwidth than the 4060ti (think it matches the 4070).

kolbe · 2024-12-29T20:53:21 1735505601

AMD GPUs are a fantastic deal until you hit a problem. Some models/frameworks it works great. Others, not so much.

bloomingkales · 2024-12-29T21:00:23 1735506023

For sure but I think people on the fine tuning/training/stable diffusion side are more concerned with that. They make a big fuss about this and basically talk people out of a perfectly good and well priced 16gb vram card that literally works out of the box with ollama, lmstudio for text inferencing.

Kind of one of the reasons AMD is a sleeper stock for me. If people only knew.