Haha, even less effective than I expected but enjoyed the read nonetheless.
I think if you have the luxury of assuming every token is a dictionary word, you can do much better by simply encoding each word as its index in the dictionary. Using the pidgeonhole principle, you can probably prove that this is at least as good as the autocorrect approach no matter how sophisticated the autocorrect is.
There once was a group of old friends who would get together on a regular basis for social interaction, telling stories and making jokes. As these folks aged, they ended up telling pretty much the same old stories and jokes again and again until they reached the point where they just assigned a number to each.
Decades later, they would gather and sit in silence for long periods until one or the other of them would just say something like "47" and all the other would laugh.
I’m not so sure it’s “at least as good” as sophisticated autocorrect. If you have a sufficiently advanced prediction model, it seems to me that you could make better guesses on context, which might, just by chance, outperform this approach. If you narrowed it down to only words that matches using this approach, but still used some kind of prediction based on context to pick which, I’d expect it to likely perform much better.
I don’t expect that would work. It would compress long words, but most short words would be encoded with numbers longer than the word themselves, and those typically make up most of the words.
The trick is to encode the numbers in binary, not plaintext. But still you'd probably want to use a Huffman coding rather than plain indexes so that the common words are shorter.
> I think if you have the luxury of assuming every token is a dictionary word, you can do much better by simply encoding each word as its index in the dictionary.
It's the standard pool/index tradeoff : instead of storing an array of possibly-duplicate objects, you simply store all unique objects once in a pool and store the array as an array of references into that pool.
You win if the original array of duplicate objects was so long or so duplicated that replacing it with references into the pool is a worthwhile reduction. The objectSize/indexSize ratio also plays some role.
The vast majority of text objects anybody generates won't exceed 4MB though. 4MB is 4 million ascii/utf8 characters, if we assume a typical word in English is 4 characters for simplicity, that's 1 million words without spaces. A quick Google search for "Novel that is 1 million words" yields the fact that this is twice the word count of Lord of the Ring and the word count of the first 3 A Song of Ice and Fire (Game of Thrones) books. Accounting for whitespace, longer common words, and inefficient encoding schemes would bring that overestimate down to, what, 150K words? a 300-page or 600-page book (depending on the spacing) according to Google, still massive.
I see it only working where there's massive pooling, like you say. An OS or a tool provides a known dictionary service and you call into it with a text string and get back an array of indices that you can then decode with another call. That amortizes the dictionary cost among all possible uses of the service, which is a lot if it's a standard well-known service. Another scenario is perhaps in a database\cloud storage\social media, any single text object might be small but the total text they store overall is massive.
Except that the corpus doesn't need to be stored as part of the compressed message, and can be considered part of the compression algorithm. It increases the size of the decoder by ~4MB, but doesn't increase the size of each message.
The autocorrect system might be able to do better if it had an understanding of grammar and proper english like GPT3, and could actually fill in missing words, replace words, etc
I think if you have the luxury of assuming every token is a dictionary word, you can do much better by simply encoding each word as its index in the dictionary. Using the pidgeonhole principle, you can probably prove that this is at least as good as the autocorrect approach no matter how sophisticated the autocorrect is.