Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Even on CPU, you should get the start of a response within 5 seconds for Q4 8B-or-smaller Llama models (proportionally faster for smaller ones), which then stream at several tokens per second.

There are a lot of things to criticize about LLMs (the answer is quite likely to ignore what you're actually asking, for example) but your speed problem looks like a config issue instead. Are you calling the API in streaming mode?



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: