Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes, you can run inference at decent speeds on CPU with llama.cpp. A token is about 0.75/words, so you can see lots of people getting 4-8 words/s on their CPUs: https://github.com/ggerganov/llama.cpp/issues/34

There a lot of optimizations that can be done. Here's one w/ potentially a 15X AVX speedup for example: https://github.com/ggerganov/llama.cpp/pull/996



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: