Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What exactly is lossy text compression then?

EDIT: I love every single answer below




I implemented a "conceptual" lossy text compression a few months ago by using word2vec (glove) to convert each word to a list of floats, laid the result out in a square, then jpeg compressed it at different quality levels. This ensured the decompressed output was all real words, but lowering the jpeg quality basically controlled the descent into synonyms and madness.

This was a follow up to a similar demo where a colleague compressed the raw ascii bytes as jpeg, which started devolving into nonsense (that didn't even look like words anymore) below ~97 quality.

I'm away from my computer for the next week or so but if you catch me after that I'm happy to post some of the results. Would be interesting to try with some other image formats as well.

--

Could also play with the data layout to make the jpeg curves make sense in a different way. Right now the vectors are stacked end to end, like:

  v0 = vec(words[0])
  v1 = vec(words[1])
  out = v0[0...len] + v1[0...len]
If you consider the vector to be an N dimensional space, jpeg compression probably makes more sense if you lay it out like:

  out = v0[0] + v1[0] ... v0[len-1] + v1[len-1]
This would allow the DCT to apply to continuous coordinate spaces from the vectors.

--

Another idea that would possibly have even more sensible output at good compression ratios, would be to train a language model neural network encoder/decoder pair on a text compression task. Teach it to "downscale" and "upscale" text, perhaps in an unsupervised or semi-supervised manner. Then you can also constrain the intermediate form.


Here's a quick zero shot text compression attempt with GPT3, compressing then decompressing some of my sentences picked from upthread. Looks pretty impressive imo.

Notes: These results include every attempt I made and are not curated in any way, except for the "image formats" compression when I manually asked GPT3 for additional output because it stopped short.

To construct the same request, copy the appropriate prompt and append two more lines matching the prompt, such as `{prompt}\n\nLong: your input text\nShort:` then ask GPT3 for a completion.

--

Compression:

Prompt

  Make text shorter or longer, using slang, chat speak, or just add or remove vowels and silent letters.

  Long: This text is compressed.
  Short: dis txt is shrt
Outputs (each prediction starts after Short:)

Long: I want this sentence, only shorter.

Short: I wnt dis sentnce, only shrr

Long: This ensured the decompressed output was all real words, but lowering the jpeg quality basically controlled the descent into synonyms and madness.

Short: dis enrvd th cmpd outut b all rl wrds, but lwg th jpg qulity basically cntrlwd th dscnt into synns and mds.

Long: I'm away from my computer for the next week or so but if you catch me after that I'm happy to post some of the results. Would be interesting to try with some other image formats as well.

Short: I'm awy frm my cmpu 4 da nex wk or so bt if u ctch me aftr dat I'm hppy 2 post sum of d rzltz. Wd b intrestng 2 try w/ sum odder imjd frms as wl.

--

Now I take the above outputs (without the inputs) and feed them back into GPT3 with a different prompt.

Decompression:

Prompt

  Make text longer. Expand slang and chat speak, or just add vowels, silent letters, and punctuation.

  Short: dis txt is shrt
  Long: This text is compressed.

  Short: I wnt dis sentnce, only shrr
  Long: I want this sentence, only shorter.
Outputs (each prediction starts after Long:)

Short: dis enrvd th cmpd outut b all rl wrds, but lwg th jpg qulity basically cntrlwd th dscnt into synns and mds.

Long: This enhanced the compressed output by all real words, but long the JPEG quality basically controlled the descent into synonyms and made-up words.

Short: I'm awy frm my cmpu 4 da nex wk or so bt if u ctch me aftr dat I'm hppy 2 post sum of d rzltz. Wd b intrestng 2 try w/ sum odder imjd frms as wl.

Long: I'm away from my computer for the next week or so but if you catch me after that I'm happy to post some of the results. Would be interesting to try with some other image forms as well.


For example, you can shorten the words, by removing the vowels.

An example: https://studio.code.org/projects/applab/hxXJIEGg2yza_Q7t9W04...


"Why waste time say lot word when few word do trick?"


Or, to quote the article, “there is a necessity of knowing dictionary that used in compression process”.


Words said I like. Button press up.


Thank.


+


There's a "translate-shell" package for Linux that provides a "trans" command. It makes for the most fun lossy pipeline, though there's no compression.

  $ echo "It’s Handy That I Have My Library Card Because I’m Totally Checking You Out." | trans -b en:vi | trans -b vi:en
  Completely free when you need to inspect the fish and the whole country to check the water. report.


Surely everyone had to learn how to do lossy text compression in primary school? At least everyone my age in the UK I think. It's called a précis.


If you use a smaller font then you can fit more text on the page, but it's hard to read.


Ths is losy tet comresio


Que?


A summary.

Like with audio and video, you lose some of the original details, but maybe that doesn’t matter to you if it captures enough of what you find important.


Take the English text, translate it into a language with simpler grammar, such as Bahasa Malaysia. Then do lossless compression on the result (you have to do this because e.g. the plural form of a noun in BM is simply the noun, twice).

To decompress do the process in reverse.


d way v used 2 txt b4 on fb


ᵀʰⁱˢ ˢ ᵃⁿ ᵉˣᵃᵐᵖˡ ᵒᶠ ˡᵒˢˢⁱ ᵗᵉˣ ᶜᵒᵐᵖʳᵉˢˢⁱᵒⁿ.


wow it's 50% smaller!


I always use "congrats" instead of "congralutions" because i always misspell it and the other side always get it even though "congrats" could mean other things


Any typo i make i blame on lossy compreson.


1. Start with your text

2. Remove some sentences here and there

3. ???

4. Profit


Yes, was thinking the same thing when read title. I prefere to have my Dostojevski the same after compression/decompression.

But seriously, is loosy text compression a thing (useful)?


or replacing synonyms with single representation


White space removal


Character removal




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: