Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Just a quick note, it's worth pointing out that for most people (eg, wanting to chat to a model in realtime), I don't think running locally on a CPU is a very viable option unless you're very patient. On my 16c Ryzen 5950X/64GB DDR4-3800 system, llama-2-70b-chat (q4_K_M) running llama.cpp (eb542d3) and testing doing a 100 token test (life's too short to try max context), I got 1.25 tokens/second (~1 word/second) output.

Compiled with cuBLAS w/ `-ngl 0` (~400MB of VRAM usage, no layers loaded) makes no perf difference. The max layers I can load on a headless 24GB 4090 is 45/83 layers (running `-ngl 45 --low-vram`) which brings speeds up to 2.5 t/s. A little less painful, but still not super pleasant. For reference, people have reported performance of 12-15 t/s with 2x4090s w/ exllama (GPTQ). People are using a 14,20 split and able to load a full (NTK Rope scaled) 16K context into 48GB of VRAM.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: