"Running" and "acceptable inference speed and quality" are two different constra... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		minimaxir on Sept 12, 2023 \| parent \| context \| favorite \| on: Fine-tune your own Llama 2 to replace GPT-3.5/4 "Running" and "acceptable inference speed and quality" are two different constraints, particularly at scale/production.

moonchrome on Sept 12, 2023 [–]

I don't understand what you're trying to say ?

From what I've read 4090 should blow A100 away if you can fit within 22GB VRAM, which a 7B model should comfortably.

And the latency (along with variability and availability) on OpenAI API is terrible because of the load they are getting.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact