The benefits are primarily price - 96GB of VRAM would be 4x3090/4090 (~$6K) or 2xA6000 (~$8-14K) cards (also, looks like you can buy an 80GB A100 PCIe for about $15K atm). While Apple is using LPDDR5, it is also running a lot more channels than comparable PC hardware. The M2 has 100GB/s, M2 Pro 200GB/s, M2 Max 400GB/s, and M2 Ultra is 800GB/s (8 channel) of memory bandwidth. The Nvidia cards are about 900GB/s-1TB/s (A100 PCIe gets up to 1.5TB/s).
In practice, on quantizes of the larger open LLMs, an M2 Ultra can currently inference about 2-4X faster than the best PC CPUs I've seen (mega Epyc systems), but also about 2-4X slower than 2x4090s.
That is useful info, but still does not quite address the question.
The question was how memory type, memory amount and bandwidth factor into actual performance. So let me rephrase: Given a budget of $X, what performance/limitations should you expect with
- 256GB of non-unified DDR5 in a PC, just CPU
- 128GB of DDR5 for an APU
- 96GB of unified DDR5
- Whatever Nvidia will sell you for $X.
An answer of "just compare a single memory bandwidth number" seems a bit short. Sure, more bandwidth helps, but is half as much RAM at double bandwidth better or worse?
No idea, I just said I wanted to try this out and see how it performs.
Doesn’t VRAM amount limit the size of the model you can load? I’m not talking about training just inference. I also pointed out these are not the greatest GPUs available, just that the advantage they have is being able to address more memory since on those machines is a shared block between system and GPU.