With 4-bit quantization it will take 15 GB, so it fits easily. On 96 GB you can not only run 30b model, you can even finetune it. As I understand, these model were trained on float16, so full 30b model takes 60 GB of RAM
So you’re saying I could make the full model run on a 16 core ryzen with 64GB of DDR4? I have an 8GB VRAM 3070 but based on this thread it sounds like the CPU might have better perf due to the RAM?
These are my observations from playing with this over the weekend.
1. There is no thoughput benefit to running on GPU unless you can fit all the weights in VRAM. Otherwise the moving the weights eats up any benefit you can get from the faster compute.
2. The quantized models do worse than non-quantized smaller models, so currently they aren't worth using for much use cases. My hope is that more sophisticated quantization methods (like GPTQ) will resolve this.
3. Much like using raw GPT-3, you need to put a lot of thought into your prompts. You can really tell it hasn't been 'aligned' or whatever the kids are calling it these days.
This might be naïve, but couldn’t you just mmap the weights on an apple silicon MacBook? Why do you need to load the entire set of weights into memory at once?
Each token is inferenced against the entire model. For the largest model that means 60GB of data or at least 10 seconds per token on the fastest SSDs. Very heavy SSD wear from that many read operations would quickly burn out even enterprise drives too.
Assuming a sensible, somewhat linear layout using mmap to map the weights would give you the ability to load a lot in memory, with potentially a fairly minimal page-in overhead