Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Estimating the number of words used by humanity: starting from total number of humans who ever lived 110B, multiply that with 80 year * 365 days * 35000 words ~= 1B words/human lifetime, you get 1.1*10^20 words.

Now, the training set of GPT-4 was around 14T tokens, let's say 10T words. If you divide them you get around 10 million.



> total number of humans who ever lived 110B

But why do you think that all 110B individuals have made significant contribution to overall intelligence?

I think, at best case, in every generation exists few significant contributors, but all others are just carriers of genome diversity and nothing more, and all they have done, just disappear as a breath of wind.

So, for 200k years, if one generation 40 years, will be just 5000 generations, ok, lets consider significant 1000 persons from each generation, will be just 5millions, 5 magnitudes less than your estimation, and BTW much closer to known estimations of humanity knowledge and size of datasets used to train largest existing models.


BTW one of the largest in world collection of texts - Library of Congress, old estimation considered to about 100B characters, or approximate 100B LLM tokens.

Plus, they have thousands of multimedia carriers (cinema, music, etc) and one time was archived all tweets for history preservation.

But all multimedia and tweets are much less volume than texts, but added few additional dimensions, hard to express with text.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: