Haha, even less effective than I expected but enjoyed the read nonetheless. I th...

bregma · on May 21, 2022

There once was a group of old friends who would get together on a regular basis for social interaction, telling stories and making jokes. As these folks aged, they ended up telling pretty much the same old stories and jokes again and again until they reached the point where they just assigned a number to each.

Decades later, they would gather and sit in silence for long periods until one or the other of them would just say something like "47" and all the other would laugh.

AndrewOMartin · on May 21, 2022

When I tried my luck and approached that group and said a random number, no one laughed. Apparently it's all in the delivery.

h2odragon · on May 21, 2022

Or you picked the puppy story.

geoduck14 · on May 21, 2022

You just described marriage

dredmorbius · on May 21, 2022

ObXkcd: https://xkcd.com/794/

jchw · on May 21, 2022

I’m not so sure it’s “at least as good” as sophisticated autocorrect. If you have a sufficiently advanced prediction model, it seems to me that you could make better guesses on context, which might, just by chance, outperform this approach. If you narrowed it down to only words that matches using this approach, but still used some kind of prediction based on context to pick which, I’d expect it to likely perform much better.

Someone · on May 21, 2022

I don’t expect that would work. It would compress long words, but most short words would be encoded with numbers longer than the word themselves, and those typically make up most of the words.

beanlog · on May 21, 2022

The trick is to encode the numbers in binary, not plaintext. But still you'd probably want to use a Huffman coding rather than plain indexes so that the common words are shorter.

mdeck_ · on May 21, 2022

> I think if you have the luxury of assuming every token is a dictionary word, you can do much better by simply encoding each word as its index in the dictionary.

Then you have to store the dictionary.

Banana699 · on May 21, 2022

It's the standard pool/index tradeoff : instead of storing an array of possibly-duplicate objects, you simply store all unique objects once in a pool and store the array as an array of references into that pool.

You win if the original array of duplicate objects was so long or so duplicated that replacing it with references into the pool is a worthwhile reduction. The objectSize/indexSize ratio also plays some role.

Jowsey · on May 21, 2022

The entire English dictionary (being very generous on what a "word" is) is around 4MB - nothing nowadays.

Not to mention of course that your computer probably already has 50 copies of it somewhere if you really don't want to bundle it

Banana699 · on May 21, 2022

The vast majority of text objects anybody generates won't exceed 4MB though. 4MB is 4 million ascii/utf8 characters, if we assume a typical word in English is 4 characters for simplicity, that's 1 million words without spaces. A quick Google search for "Novel that is 1 million words" yields the fact that this is twice the word count of Lord of the Ring and the word count of the first 3 A Song of Ice and Fire (Game of Thrones) books. Accounting for whitespace, longer common words, and inefficient encoding schemes would bring that overestimate down to, what, 150K words? a 300-page or 600-page book (depending on the spacing) according to Google, still massive.

I see it only working where there's massive pooling, like you say. An OS or a tool provides a known dictionary service and you call into it with a text string and get back an array of indices that you can then decode with another call. That amortizes the dictionary cost among all possible uses of the service, which is a lot if it's a standard well-known service. Another scenario is perhaps in a database\cloud storage\social media, any single text object might be small but the total text they store overall is massive.

MereInterest · on May 21, 2022

Except that the corpus doesn't need to be stored as part of the compressed message, and can be considered part of the compression algorithm. It increases the size of the decoder by ~4MB, but doesn't increase the size of each message.

beanlog · on May 21, 2022

The autocorrect method also uses a dictionary, and then some.

woojoo666 · on May 21, 2022

The autocorrect system might be able to do better if it had an understanding of grammar and proper english like GPT3, and could actually fill in missing words, replace words, etc

behnamoh · on May 21, 2022

yes, like replace qck with quick instead of ack.