Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You need roughly (model size + (n * (prompt + generated text)) where n. Is the number of parallel users/ request.


It should be noted that that last part has a pretty large factor to it that also scales with model size, because to run transformers efficiently you cache some of the intermediate activations from the attention block.

The factor is basically 2 * number of layers * number of embeddings values (e.g. fp16) that are stored per token.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: