What they really destroyed was the idea that OpenAI would be able to charge $200/month for their ChatGPT Pro subscription which includes o1. That was always ridiculous IMO. The Free tier and $20/month Plus tier along with their API business (minus any future plan to charge a ridiculous amount for API access to o1) will be fine.
> The Free tier and $20/month Plus tier along with their API business (minus any future plan to charge a ridiculous amount for API access to o1) will be fine.
Actually no! If we take their paper at face value, the crucial innovation to get a strong model with efficiency is their much reduced KV cache and their MoE approach:
- where a standard model needs to store two large vectors for each token at inference time (and load/store those over and over from memory) deepseek v3/R1 only stores one smaller vector C that is a „compression“ from which the large k,v vectors can be decoded on the fly.
- They use a fairly standard Mixture of Expert (MoE) approach, which works well in training with their tricks, but whose inference time advantages are immediate and equal to all other MoE techniques, which is to say that from ~85% of the 600B+ params that are inside the MoE layers, the model at each token inference step will only pick a small fraction to use. This reduces FLOPs and memory io by a large factor in comparison to a so-called dense model where all weights are used for every token (cf Llama 3 405B)