Hacker News new | past | comments | ask | show | jobs | submit login

So I'm thinking, inference seems mostly memory bound. With a fast CPU (for example 7950x with 16 cores), and 256GB of RAM (seems to be the max), shouldn't that give you plenty of ability to run the largest models (albeit a bit slowly).

It seems that AMD Epyc CPUs support terabytes of ram, some are as cheap as 1000 EUR. why not just run the full R1 model on that - seems that it would be much cheaper than multiple of those insane NVidia-Karten.




The bottleneck is mainly memory bandwidth. AMD EPYC hw is appealing for local inference because it has a higher memory bandwidth than desktop gear (because 8-12 memory channels vs 2 on almost everything else), but not as fast as the Apple architectures and nowhere near VRAM speeds. If you want to drastically exceed ~3-5 tokens/s on 70b-q4 models, you usually still need GPUs.


This was beautifully illustrated in the recent Phoronix 5090 LLM benchmark[1], which I noted here[2]. The tested GPUs had an almost perfect linear relationship between generated token/s and GB/s memory bandwidth, except the 5090 where it dipped slightly.

I guess the 5090 either started ever so slightly to become compute limited as well, or hit some overhead limitation.

[1]: https://www.phoronix.com/review/nvidia-rtx5090-llama-cpp

[2]: https://news.ycombinator.com/item?id=42847284


On Zen5 you also get AVX512 which llamafile takes advantage of for drastically improved speeds during prompt processing, at least. And the 12 channel Epycs actually seem to have more memory bandwidth available than the Apple M series. Especially considering it's all available to the CPU as opposed to just some portion of it.


Maybe EPYC can make better use of the available bandwidth, but for comparison I have a water cooled Xeon W5-3435X running at 4.7GHz all-core with 8 channels of DDR5-6400, and CPU inference is still dog slow. With a 70B Q8 model I get 1 tok/s, which is a lot less than I thought I would get with 410GB/s max RAM bandwidth. If I run on 5x A4000s I get 6.1 tok/s, which makes sense... 448GB/s / 70GB = 6.4 tok/s max.


very strange as I get on old i5-12400+DDR4 2 tok/sec with 14B/q8 model.


It’s more expensive, but Zen4 Threadripper Pro is probably the way to go on that front. 8 memory channels, with DIMMs available up to DDR5-7200 for 8x32GB (256GB), or DDR5-6800 for 8x48GB (384GB). It’ll set you back ~$3k for the RAM and ~$6k for a CPU with 8 CCDs (the 7985WX, at least), and then ~$1k for motherboard and however much you want to spend on NVME. Basically ~$10k for a 384GB DDR5 system with ~435GB/s actual bandwidth. Not quite as fast as the 192GB Apple machines, but twice as much memory and more compute for “only” a few thousand more.


At these prices, I would just get 2xDigits for $6k and have 256gb.


I have a feeling that Digits will probably get sold out and will pricing will get hiked WAY up.


is it confirmed that you can get 256gb of vram for that amount? Because my understanding is that digits pricing will start at $3k for some basic config.


What they meant is buying two whole separate computers.


I understand. It is still unclear if you can get 128GB vram for $3k.


Well, I mean, the press release is pretty unambiguous.

>Each Project DIGITS features 128GB of unified, coherent memory and up to 4TB of NVMe storage.

Even if $3k is only the starting price, it doesn't sound like spending more buys you more memory.


Ok, but it is not clear what kind of RAM is that, how many memory channels, etc. If the goal is to have just 128GB of some ram, then it could be achieved by paying few $100.


Fine, but at that point you're arguing about the concept of the product. It's billed as a computer for AI and you're saying that it might not be more suitable for AI than a regular PC.


it is possible that one could build better PC than digits for AI. We will see once they release digits.


FWIW Threadrippers go up to 1TB and Threadripper Pro up to 2TB. That's even in the lowest model of each series. (I know this because it happens to be the chip I have. Not saying you shouldn't go for Epyc if it works out better.)


Have you tried running the full R1 model with that? People in sibling comments mention high end EPYCs gor a 10K machine, but I’m curious whether it’s possible to make a 1-2K machine that could still run those big models simply because they fit in RAM.


I spent about $3000 on my machine, have the cheapest Threadripper CPU and 256GB of RAM, so no, 600GB won't fit in RAM on a $2K machine.

But everyone is using the distilled models which are much smaller.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: