Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Typical savings: 60-90% on most requests, since Gemini Flash is often free/cheapest, but you still get Claude or GPT-4 when needed.

This claim seems overstated. Accurately routing arbitrary prompts to the cheapest viable model is a hard problem. If it were reliably solvable, it would fundamentally disrupt the pricing models of OpenAI and Anthropic. In practice, you'd either sacrifice quality on edge cases or end up re-running failed requests on pricier models anyway, eating into those "savings".



I genuinely wonder the use cases are where the required accuracy is so low (or I guess the prompts are so strong) that you don't need to vigorously use evals to prevent regressions with the model that works best--let alone actually just change models on the fly based on what's cheaper.


Yes and in addition for some reason that use case is also not a fit for some cheap OS model like qwen or kimi, but must be run on the cheapest model of the big three.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: