Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"OCR rot"


If you read the article, you'd know that OCR isn't used here. The issue is way more interesting than that.


Well... this issue isn't exactly unrelated to OCR. OCR compresses images of text by representing them as text, which is more abstract and therefore takes less space to describe. The particular glyphs being recognized are fixed in advance -- platonic a, b, c, etc.

Here, it would be fair to describe what's going on as OCR with the glyphs not being fixed in advance, but rather being discovered on the fly by the algorithm. The entire concept is to identify sections of the image that "show the same thing", and replace the data in those sections with pointers to a single representative patch. That's really not so different from compressing image data that looks suspiciously similar to a capital A down to the one byte 0x41. It's just that different image sections are being Optically Recognized as "similar to each other" rather than "similar to this hardcoded reference glyph".


OCR generally does not work as you describe. The common case is for the OCR system to tag charactes in an image, so that text may be selected. More advanced systems will generate fonts from the images and replace the text with those. Either way, the text isn't reduced to a single byte.


I've read plenty of kindle books that were clearly the product of OCR. True, "cl" hasn't reduced the image of a lowercase d to a single byte, but that was the intention. Don't confuse OCR, the concept, with OCR-as-implemented-in-a-particular-way, or with a-process-that-we-called-OCR-because-OCR-is-involved-at-some-point. OCR is any system that recognizes sections of image data as matching letter shapes[1].

"Generating a font from the image and replacing the original image data with that" is a very good description of what's going on here.

[1] Or numbers, or symbols like parentheses. The basic concept is letters.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: