"Full 131k" context , actually the full context is double that at 262144 context and with 8x yarn mutiplier it can go up to 2million. It looks like even full chip scale Cerebras has trouble with context length, well, this is a limitation of the transformer architechture itself where memory requirements scale ~linearly and compute requirements roughly quadratically with the increase in kv cache.
Anyway, YOU'RE NOT SERVING FULL CONTEXT CEREBRAS, YOU'RE SERVING HALF. Also what quantization exactly is this, can the customers know?
Anyway, YOU'RE NOT SERVING FULL CONTEXT CEREBRAS, YOU'RE SERVING HALF. Also what quantization exactly is this, can the customers know?