But for code completion in an IDE where it has to react as you type, every 100 millisecond delay in response time is noticable.
Even with a 24GB GPU, a 7B model doesn't feel snappy enough for code-completion in an IDE.
The smaller the model, the less has to be read from ram for every single token.
Batching mixes up this calculus a bit.
Would that be enough? shrug
But for code completion in an IDE where it has to react as you type, every 100 millisecond delay in response time is noticable.
Even with a 24GB GPU, a 7B model doesn't feel snappy enough for code-completion in an IDE.