With 4-bit quantization it will take 15 GB, so it fits easily. On 96 GB you can ...

trillic · on March 12, 2023

So you’re saying I could make the full model run on a 16 core ryzen with 64GB of DDR4? I have an 8GB VRAM 3070 but based on this thread it sounds like the CPU might have better perf due to the RAM?

sebzim4500 · on March 12, 2023

These are my observations from playing with this over the weekend.

1. There is no thoughput benefit to running on GPU unless you can fit all the weights in VRAM. Otherwise the moving the weights eats up any benefit you can get from the faster compute.

2. The quantized models do worse than non-quantized smaller models, so currently they aren't worth using for much use cases. My hope is that more sophisticated quantization methods (like GPTQ) will resolve this.

3. Much like using raw GPT-3, you need to put a lot of thought into your prompts. You can really tell it hasn't been 'aligned' or whatever the kids are calling it these days.

blablablub · on March 13, 2023

i have the 65B model running fine on my 48GB Ryzen 5.

orf · on March 12, 2023

This might be naïve, but couldn’t you just mmap the weights on an apple silicon MacBook? Why do you need to load the entire set of weights into memory at once?

flangola7 · on March 12, 2023

Each token is inferenced against the entire model. For the largest model that means 60GB of data or at least 10 seconds per token on the fastest SSDs. Very heavy SSD wear from that many read operations would quickly burn out even enterprise drives too.

orf · on March 12, 2023

SSDs don’t wear from reading, only from writing.

Assuming a sensible, somewhat linear layout using mmap to map the weights would give you the ability to load a lot in memory, with potentially a fairly minimal page-in overhead