So I'm thinking, inference seems mostly memory bound. With a fast CPU (for examp...

throw-qqqqq · 2025-01-28T11:15:41 1738062941

The bottleneck is mainly memory bandwidth. AMD EPYC hw is appealing for local inference because it has a higher memory bandwidth than desktop gear (because 8-12 memory channels vs 2 on almost everything else), but not as fast as the Apple architectures and nowhere near VRAM speeds. If you want to drastically exceed ~3-5 tokens/s on 70b-q4 models, you usually still need GPUs.

magicalhippo · 2025-01-28T12:49:52 1738068592

This was beautifully illustrated in the recent Phoronix 5090 LLM benchmark[1], which I noted here[2]. The tested GPUs had an almost perfect linear relationship between generated token/s and GB/s memory bandwidth, except the 5090 where it dipped slightly.

I guess the 5090 either started ever so slightly to become compute limited as well, or hit some overhead limitation.

[1]: https://www.phoronix.com/review/nvidia-rtx5090-llama-cpp

[2]: https://news.ycombinator.com/item?id=42847284

timschmidt · 2025-01-28T12:16:06 1738066566

On Zen5 you also get AVX512 which llamafile takes advantage of for drastically improved speeds during prompt processing, at least. And the 12 channel Epycs actually seem to have more memory bandwidth available than the Apple M series. Especially considering it's all available to the CPU as opposed to just some portion of it.

Gracana · 2025-01-28T12:47:02 1738068422

Maybe EPYC can make better use of the available bandwidth, but for comparison I have a water cooled Xeon W5-3435X running at 4.7GHz all-core with 8 channels of DDR5-6400, and CPU inference is still dog slow. With a 70B Q8 model I get 1 tok/s, which is a lot less than I thought I would get with 410GB/s max RAM bandwidth. If I run on 5x A4000s I get 6.1 tok/s, which makes sense... 448GB/s / 70GB = 6.4 tok/s max.

iamnotagenius · 2025-01-28T14:27:19 1738074439

very strange as I get on old i5-12400+DDR4 2 tok/sec with 14B/q8 model.

jmb99 · 2025-01-28T15:39:46 1738078786

It’s more expensive, but Zen4 Threadripper Pro is probably the way to go on that front. 8 memory channels, with DIMMs available up to DDR5-7200 for 8x32GB (256GB), or DDR5-6800 for 8x48GB (384GB). It’ll set you back ~$3k for the RAM and ~$6k for a CPU with 8 CCDs (the 7985WX, at least), and then ~$1k for motherboard and however much you want to spend on NVME. Basically ~$10k for a 384GB DDR5 system with ~435GB/s actual bandwidth. Not quite as fast as the 192GB Apple machines, but twice as much memory and more compute for “only” a few thousand more.

sourcecodeplz · 2025-01-28T17:20:27 1738084827

At these prices, I would just get 2xDigits for $6k and have 256gb.

jadbox · 2025-01-28T20:11:57 1738095117

I have a feeling that Digits will probably get sold out and will pricing will get hiked WAY up.

riku_iki · 2025-01-28T21:17:00 1738099020

is it confirmed that you can get 256gb of vram for that amount? Because my understanding is that digits pricing will start at $3k for some basic config.

fluoridation · 2025-01-28T21:40:00 1738100400

What they meant is buying two whole separate computers.

riku_iki · 2025-01-28T21:51:59 1738101119

I understand. It is still unclear if you can get 128GB vram for $3k.

fluoridation · 2025-01-28T22:01:37 1738101697

Well, I mean, the press release is pretty unambiguous.

>Each Project DIGITS features 128GB of unified, coherent memory and up to 4TB of NVMe storage.

Even if $3k is only the starting price, it doesn't sound like spending more buys you more memory.

riku_iki · 2025-01-28T22:19:45 1738102785

Ok, but it is not clear what kind of RAM is that, how many memory channels, etc. If the goal is to have just 128GB of some ram, then it could be achieved by paying few $100.

fluoridation · 2025-01-28T22:27:17 1738103237

Fine, but at that point you're arguing about the concept of the product. It's billed as a computer for AI and you're saying that it might not be more suitable for AI than a regular PC.

riku_iki · 2025-01-28T22:30:53 1738103453

it is possible that one could build better PC than digits for AI. We will see once they release digits.

immibis · 2025-01-28T12:03:55 1738065835

FWIW Threadrippers go up to 1TB and Threadripper Pro up to 2TB. That's even in the lowest model of each series. (I know this because it happens to be the chip I have. Not saying you shouldn't go for Epyc if it works out better.)

ant6n · 2025-01-29T10:44:36 1738147476

Have you tried running the full R1 model with that? People in sibling comments mention high end EPYCs gor a 10K machine, but I’m curious whether it’s possible to make a 1-2K machine that could still run those big models simply because they fit in RAM.

immibis · 2025-01-29T12:23:00 1738153380

I spent about $3000 on my machine, have the cheapest Threadripper CPU and 256GB of RAM, so no, 600GB won't fit in RAM on a $2K machine.

But everyone is using the distilled models which are much smaller.