That pricing is ridiculous. A token is essentially a 32 bit integer. Four bytes. A million tokens is 4MB. Imagine paying $1/hr for less than the storage of three floppies. That's two million times more expensive than the storage cost of standard S3 (720 hours×256M tokens (1gb)×$1 vs $0.09). Or 2000 times more expensive than the storage cost of Elasticache serverless.
(Yes, I realize it's probably more than 4MB, but it's still an outrageously high markup. They could do their own caching, not tell you they're doing it, and keep the difference and make even more money)
"RAM near a GPU" is ~the same cost as "RAM near literally any other piece of hardware". Even if it has to traverse the network, that's a fairly low, fairly fixed (in, say, the same rack) cost. Hell, it's probably even fast enough to use an NVME disk.
Google can search the entire Internet in a fraction of a second, they can keep a million tokens within a few dozen milliseconds of a GPU for less than a dollar an hour.
Is that fast enough? And how much data is being stored? They're not storing the tokens you pass on but the activations after processing them. I'll take a wild stab that the activations for Claude 3.5 aren't anywhere near 4 meg.
You've already frowned on my comparison to S3 but I think it's apt: it's many times more expensive than S3, but (for even a gigabyte of activations) it doesn't even need to be two orders of magnitude faster than standard S3.
If you use the Elasticache pricing, which is $0.125/gb per hour, it's still eight times more expensive. So even if a million tokens is a full gigabyte of data, it's still almost an order of magnitude more expensive than an in-memory cache adjacent to the inference boxes.
When your managed cache costs right times as much as a general purpose managed cache _in the cloud_, you've jumped the shark on pricing.
I do not understand your first paragraph. It's more expensive than s3 and it's entirely different. It's cheaper than writing on vellum, and faster!
> If you use the Elasticache pricing, which is $0.125/gb per hour, it's still eight times more expensive. So even if a million tokens is a full gigabyte of data
Is it a gigabyte of data and is it fast enough?
You've guessed 4mb and 1gb. What's the actual data size here? What speed do you need to get it into the GPU ram?
The entire point here is to lower latency and costs so it has to be close and fast.
As I pointed out, even if it's a gig of data that's still almost an order of magnitude more than the cost of a managed in-memory cache in the cloud. That's wild.
It’s probably more. Pretty conservatively, if the KV embedding dimension for each token is ~10K x 100 attention layers (this is roughly the scale of Llama3.1 405B) that’s already 1M 16-bit floats per token = 2MB. They have likely needed to implement some kind of KV compression (like DeepSeek) to make this even feasible.
No, in transformer, token is a vector, for larger models it is probably something like 6k-12k floats, assuming larger model sizes. Assume 8-bit precision, a token is more like 6-12kB, per token.
So assume 100k tokens, you will end up with 554MB for input tokens, ALONE.
Depending on your model architecture, the memory could vary, but from my observation, the runtime memory increase is at least on the same magnitude with the initial amount of memory usage upon loading the model, and this is for a moderate context length (<32k), and will grow linearly, if we don't count the n*n KV matrices.
So you are easily looking at caching 10~100GB of data, in a very hot state, and that is going to be very expensive indeed.
It's not. You're confused and don't understand what you're talking about. Just read all the replies you've already gotten and try to understand why you're wrong instead of doubling down on an incorrect take.
(Yes, I realize it's probably more than 4MB, but it's still an outrageously high markup. They could do their own caching, not tell you they're doing it, and keep the difference and make even more money)