Even on CPU, you should get the start of a response within 5 seconds for Q4 8B-or-smaller Llama models (proportionally faster for smaller ones), which then stream at several tokens per second.
There are a lot of things to criticize about LLMs (the answer is quite likely to ignore what you're actually asking, for example) but your speed problem looks like a config issue instead. Are you calling the API in streaming mode?
There are a lot of things to criticize about LLMs (the answer is quite likely to ignore what you're actually asking, for example) but your speed problem looks like a config issue instead. Are you calling the API in streaming mode?