This sounds plausible and fascinating. Let’s see what it would have taken to tra...

This sounds plausible and fascinating. Let’s see what it would have taken to train a model as well.

Given an estimate of 6 FLOPs per token per parameter, training a 7B parameter model would require about 1.26×10^22 FLOPs. That translates to roughly 500 000 years on an 800 MFLOPS X-MP, far too long to be feasible. Training a 100M parameter model would still take nearly 70 years.

However, a 7M-parameter model would only have required about six months of training, and a 14M one about a year, so let’s settle on 10 million. That’s already far more reasonable than the 300K model I mentioned earlier.

Moreover, a 10M parameter model would have been far from useless. It could have performed decent summarization, categorization, basic code autocompletion, and even powered a simple chatbot with a short context, all that in 1984, which would have been pure sci-fi back in those days. And pretty snappy too, maybe around 10 tokens per second if not a little more.

Too bad we lacked the datasets and the concepts...