That pricing is ridiculous. A token is essentially a 32 bit integer. Four bytes....

atgctg · on Aug 19, 2024

You have to store the KV cache, not the tokens. For Gemma 27B (probably slightly larger than Flash), this would be:

  Size of KV cache = 2 * (num_layers) * (num_kv_heads * dim_head) * seq_length * precision

  8-bit Gemma 27B KV cache = 2 * (46) * (16 * 144) * 1e6 * 1 byte ≈ 200 GB

Note that this doesn't take further optimizations into account that Google might be using.

Formula: https://developer.nvidia.com/blog/mastering-llm-techniques-i...

Gemma 27B config: https://huggingface.co/google/gemma-2-27b/blob/main/config.j...

manojlds · on Aug 19, 2024

Is there some easy to understand source / paper about how this caching works?

xihajun · on Aug 21, 2024

https://arxiv.org/pdf/2311.04934

danielmarkbruce · on Aug 19, 2024

Ask chat gpt to explain how K-V caching works. What they are doing is essentially the same thing, with a few more engineering details.

IanCal · on Aug 18, 2024

Well that really depends where you're caching the data.

Is it a lot for caching in L1 on a chip somewhere? No that'd be wildly cheap.

Is it a lot for "caching" on a tape somewhere? Yes.

So where on this scale does keeping it quick to get to gpu memory lie?

> That's two million times more expensive than the storage cost of standard S3 (

You're not comparing to s3 at all.

bastawhiz · on Aug 18, 2024

"RAM near a GPU" is ~the same cost as "RAM near literally any other piece of hardware". Even if it has to traverse the network, that's a fairly low, fairly fixed (in, say, the same rack) cost. Hell, it's probably even fast enough to use an NVME disk.

Google can search the entire Internet in a fraction of a second, they can keep a million tokens within a few dozen milliseconds of a GPU for less than a dollar an hour.

IanCal · on Aug 18, 2024

Is that fast enough? And how much data is being stored? They're not storing the tokens you pass on but the activations after processing them. I'll take a wild stab that the activations for Claude 3.5 aren't anywhere near 4 meg.

bastawhiz · on Aug 18, 2024

You've already frowned on my comparison to S3 but I think it's apt: it's many times more expensive than S3, but (for even a gigabyte of activations) it doesn't even need to be two orders of magnitude faster than standard S3.

If you use the Elasticache pricing, which is $0.125/gb per hour, it's still eight times more expensive. So even if a million tokens is a full gigabyte of data, it's still almost an order of magnitude more expensive than an in-memory cache adjacent to the inference boxes.

When your managed cache costs right times as much as a general purpose managed cache _in the cloud_, you've jumped the shark on pricing.

IanCal · on Aug 18, 2024

I do not understand your first paragraph. It's more expensive than s3 and it's entirely different. It's cheaper than writing on vellum, and faster!

> If you use the Elasticache pricing, which is $0.125/gb per hour, it's still eight times more expensive. So even if a million tokens is a full gigabyte of data

Is it a gigabyte of data and is it fast enough?

You've guessed 4mb and 1gb. What's the actual data size here? What speed do you need to get it into the GPU ram?

The entire point here is to lower latency and costs so it has to be close and fast.

Guessing at sizes isn't helping anything here.

GaggiX · on Aug 18, 2024

What is stored is not the tokens, but all keys and values of all attention layers for each token.

sdrg822 · on Aug 18, 2024

+1 it wouldn’t be terribly useful if it were only caching the tokenizer output.

bastawhiz · on Aug 18, 2024

As I pointed out, even if it's a gig of data that's still almost an order of magnitude more than the cost of a managed in-memory cache in the cloud. That's wild.

hansonw · on Aug 18, 2024

It’s probably more. Pretty conservatively, if the KV embedding dimension for each token is ~10K x 100 attention layers (this is roughly the scale of Llama3.1 405B) that’s already 1M 16-bit floats per token = 2MB. They have likely needed to implement some kind of KV compression (like DeepSeek) to make this even feasible.

karmasimida · on Aug 19, 2024

There are some errors in you calculation.

> A token is 32-bit integer.

No, in transformer, token is a vector, for larger models it is probably something like 6k-12k floats, assuming larger model sizes. Assume 8-bit precision, a token is more like 6-12kB, per token.

So assume 100k tokens, you will end up with 554MB for input tokens, ALONE.

Depending on your model architecture, the memory could vary, but from my observation, the runtime memory increase is at least on the same magnitude with the initial amount of memory usage upon loading the model, and this is for a moderate context length (<32k), and will grow linearly, if we don't count the n*n KV matrices.

So you are easily looking at caching 10~100GB of data, in a very hot state, and that is going to be very expensive indeed.

GaggiX · on Aug 18, 2024

It's probably much more than a single GB of data for a million tokens.

Kiro · on Aug 19, 2024

It's not. You're confused and don't understand what you're talking about. Just read all the replies you've already gotten and try to understand why you're wrong instead of doubling down on an incorrect take.

rfoo · on Aug 19, 2024

It's more likely some 400GB of data.

karmasimida · on Aug 19, 2024

You need to cache all the key/values on each layer, otherwise there is no point in caching.