The pricing is still such that you can't routinely use a customized Gemini with your own long fixed pre-filled context + a variable short query. If the one-time compute cost of the caching could be effectively amortized over many queries, this pattern would replace many fine-tuning and RAG cases with something more predictable and controllable.
It's not as simple as that because the large cache needs to be loaded into GPU memory every time, but optimizations must be feasible if the usage rate is large enough to keep the cache alive in a dedicated machine.
It's not as simple as that because the large cache needs to be loaded into GPU memory every time, but optimizations must be feasible if the usage rate is large enough to keep the cache alive in a dedicated machine.
Previous discussion: https://news.ycombinator.com/item?id=40034972#40036309