I implemented a "conceptual" lossy text compression a few months ago by using word2vec (glove) to convert each word to a list of floats, laid the result out in a square, then jpeg compressed it at different quality levels. This ensured the decompressed output was all real words, but lowering the jpeg quality basically controlled the descent into synonyms and madness.
This was a follow up to a similar demo where a colleague compressed the raw ascii bytes as jpeg, which started devolving into nonsense (that didn't even look like words anymore) below ~97 quality.
I'm away from my computer for the next week or so but if you catch me after that I'm happy to post some of the results.
Would be interesting to try with some other image formats as well.
--
Could also play with the data layout to make the jpeg curves make sense in a different way. Right now the vectors are stacked end to end, like:
If you consider the vector to be an N dimensional space, jpeg compression probably makes more sense if you lay it out like:
out = v0[0] + v1[0] ... v0[len-1] + v1[len-1]
This would allow the DCT to apply to continuous coordinate spaces from the vectors.
--
Another idea that would possibly have even more sensible output at good compression ratios, would be to train a language model neural network encoder/decoder pair on a text compression task. Teach it to "downscale" and "upscale" text, perhaps in an unsupervised or semi-supervised manner. Then you can also constrain the intermediate form.
Here's a quick zero shot text compression attempt with GPT3, compressing then decompressing some of my sentences picked from upthread. Looks pretty impressive imo.
Notes:
These results include every attempt I made and are not curated in any way, except for the "image formats" compression when I manually asked GPT3 for additional output because it stopped short.
To construct the same request, copy the appropriate prompt and append two more lines matching the prompt, such as `{prompt}\n\nLong: your input text\nShort:` then ask GPT3 for a completion.
--
Compression:
Prompt
Make text shorter or longer, using slang, chat speak, or just add or remove vowels and silent letters.
Long: This text is compressed.
Short: dis txt is shrt
Outputs (each prediction starts after Short:)
Long: I want this sentence, only shorter.
Short: I wnt dis sentnce, only shrr
Long: This ensured the decompressed output was all real words, but lowering the jpeg quality basically controlled the descent into synonyms and madness.
Short: dis enrvd th cmpd outut b all rl wrds, but lwg th jpg qulity basically cntrlwd th dscnt into synns and mds.
Long: I'm away from my computer for the next week or so but if you catch me after that I'm happy to post some of the results. Would be interesting to try with some other image formats as well.
Short: I'm awy frm my cmpu 4 da nex wk or so bt if u ctch me aftr dat I'm hppy 2 post sum of d rzltz. Wd b intrestng 2 try w/ sum odder imjd frms as wl.
--
Now I take the above outputs (without the inputs) and feed them back into GPT3 with a different prompt.
Decompression:
Prompt
Make text longer. Expand slang and chat speak, or just add vowels, silent letters, and punctuation.
Short: dis txt is shrt
Long: This text is compressed.
Short: I wnt dis sentnce, only shrr
Long: I want this sentence, only shorter.
Outputs (each prediction starts after Long:)
Short: dis enrvd th cmpd outut b all rl wrds, but lwg th jpg qulity basically cntrlwd th dscnt into synns and mds.
Long: This enhanced the compressed output by all real words, but long the JPEG quality basically controlled the descent into synonyms and made-up words.
Short: I'm awy frm my cmpu 4 da nex wk or so bt if u ctch me aftr dat I'm hppy 2 post sum of d rzltz. Wd b intrestng 2 try w/ sum odder imjd frms as wl.
Long: I'm away from my computer for the next week or so but if you catch me after that I'm happy to post some of the results. Would be interesting to try with some other image forms as well.
There's a "translate-shell" package for Linux that provides a "trans" command. It makes for the most fun lossy pipeline, though there's no compression.
$ echo "It’s Handy That I Have My Library Card Because I’m Totally Checking You Out." | trans -b en:vi | trans -b vi:en
Completely free when you need to inspect the fish and the whole country to check the water. report.
Like with audio and video, you lose some of the original details, but maybe that doesn’t matter to you if it captures enough of what you find important.
Take the English text, translate it into a language with simpler grammar, such as Bahasa Malaysia. Then do lossless compression on the result (you have to do this because e.g. the plural form of a noun in BM is simply the noun, twice).
I always use "congrats" instead of "congralutions" because i always misspell it and the other side always get it even though "congrats" could mean other things
EDIT: I love every single answer below