Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That pricing is ridiculous. A token is essentially a 32 bit integer. Four bytes. A million tokens is 4MB. Imagine paying $1/hr for less than the storage of three floppies. That's two million times more expensive than the storage cost of standard S3 (720 hours×256M tokens (1gb)×$1 vs $0.09). Or 2000 times more expensive than the storage cost of Elasticache serverless.

(Yes, I realize it's probably more than 4MB, but it's still an outrageously high markup. They could do their own caching, not tell you they're doing it, and keep the difference and make even more money)



You have to store the KV cache, not the tokens. For Gemma 27B (probably slightly larger than Flash), this would be:

  Size of KV cache = 2 * (num_layers) * (num_kv_heads * dim_head) * seq_length * precision

  8-bit Gemma 27B KV cache = 2 * (46) * (16 * 144) * 1e6 * 1 byte ≈ 200 GB
Note that this doesn't take further optimizations into account that Google might be using.

Formula: https://developer.nvidia.com/blog/mastering-llm-techniques-i...

Gemma 27B config: https://huggingface.co/google/gemma-2-27b/blob/main/config.j...


Is there some easy to understand source / paper about how this caching works?



Ask chat gpt to explain how K-V caching works. What they are doing is essentially the same thing, with a few more engineering details.


Well that really depends where you're caching the data.

Is it a lot for caching in L1 on a chip somewhere? No that'd be wildly cheap.

Is it a lot for "caching" on a tape somewhere? Yes.

So where on this scale does keeping it quick to get to gpu memory lie?

> That's two million times more expensive than the storage cost of standard S3 (

You're not comparing to s3 at all.


"RAM near a GPU" is ~the same cost as "RAM near literally any other piece of hardware". Even if it has to traverse the network, that's a fairly low, fairly fixed (in, say, the same rack) cost. Hell, it's probably even fast enough to use an NVME disk.

Google can search the entire Internet in a fraction of a second, they can keep a million tokens within a few dozen milliseconds of a GPU for less than a dollar an hour.


Is that fast enough? And how much data is being stored? They're not storing the tokens you pass on but the activations after processing them. I'll take a wild stab that the activations for Claude 3.5 aren't anywhere near 4 meg.


You've already frowned on my comparison to S3 but I think it's apt: it's many times more expensive than S3, but (for even a gigabyte of activations) it doesn't even need to be two orders of magnitude faster than standard S3.

If you use the Elasticache pricing, which is $0.125/gb per hour, it's still eight times more expensive. So even if a million tokens is a full gigabyte of data, it's still almost an order of magnitude more expensive than an in-memory cache adjacent to the inference boxes.

When your managed cache costs right times as much as a general purpose managed cache _in the cloud_, you've jumped the shark on pricing.


I do not understand your first paragraph. It's more expensive than s3 and it's entirely different. It's cheaper than writing on vellum, and faster!

> If you use the Elasticache pricing, which is $0.125/gb per hour, it's still eight times more expensive. So even if a million tokens is a full gigabyte of data

Is it a gigabyte of data and is it fast enough?

You've guessed 4mb and 1gb. What's the actual data size here? What speed do you need to get it into the GPU ram?

The entire point here is to lower latency and costs so it has to be close and fast.

Guessing at sizes isn't helping anything here.


What is stored is not the tokens, but all keys and values of all attention layers for each token.


+1 it wouldn’t be terribly useful if it were only caching the tokenizer output.


As I pointed out, even if it's a gig of data that's still almost an order of magnitude more than the cost of a managed in-memory cache in the cloud. That's wild.


It’s probably more. Pretty conservatively, if the KV embedding dimension for each token is ~10K x 100 attention layers (this is roughly the scale of Llama3.1 405B) that’s already 1M 16-bit floats per token = 2MB. They have likely needed to implement some kind of KV compression (like DeepSeek) to make this even feasible.


There are some errors in you calculation.

> A token is 32-bit integer.

No, in transformer, token is a vector, for larger models it is probably something like 6k-12k floats, assuming larger model sizes. Assume 8-bit precision, a token is more like 6-12kB, per token.

So assume 100k tokens, you will end up with 554MB for input tokens, ALONE.

Depending on your model architecture, the memory could vary, but from my observation, the runtime memory increase is at least on the same magnitude with the initial amount of memory usage upon loading the model, and this is for a moderate context length (<32k), and will grow linearly, if we don't count the n*n KV matrices.

So you are easily looking at caching 10~100GB of data, in a very hot state, and that is going to be very expensive indeed.


It's probably much more than a single GB of data for a million tokens.


It's not. You're confused and don't understand what you're talking about. Just read all the replies you've already gotten and try to understand why you're wrong instead of doubling down on an incorrect take.


It's more likely some 400GB of data.


You need to cache all the key/values on each layer, otherwise there is no point in caching.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: