Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not sure what you mean by destructive replacement, since nothing is destroyed.

So I just looked into this, and it's specifically Mixed Raster Content pipeline (ISO/IEC 16485) used in lots of different scanners. There's no need to find which specific software generated it because it's used by lots of them.

It's a technique used to attempt to isolate font characters of the same size and style as separate layers before OCR-ing to make OCR more accurate.

ABBYY FineReader, for example, is mentioned as producing the exact same type of results. But there's no guarantee that was the actual software because lots of scanning software does it -- it's a general technique. Plus it won't even be deterministically reproducible if it was e.g. scanned and OCR'd at higher resolution and then saved at a lower resolution, as is generally considered best practice for maximizing accuracy while keeping file sizes lower.

https://www.obamaconspiracy.org/2013/01/heres-the-birth-cert...

So this is very much a nothingburger. It's not an "alternative theory", it's a complete and total explanation.





So this program really doesn't keep the original image of the document as a raster layer? That's kind of surprising, especially if it's used in the legal world. Personally, I'd always want to be able to recover the original document from the OCR layers. Or, are you saying you can? Then you should tell snopes, because it'll make the snopes article a lot shorter if they can just lead with that.

I think you are misunderstanding. The pipeline is e.g.:

Scan (600 dpi) > MRC (600 dpi) > OCR (600 dpi) > Downsample (150 dpi) > Save to PDF (150 dpi)

The image is saved in raster format at 150 dpi. That's the document, but not at the original scanning resolution. If you performed MRC and OCR at the 150 dpi level, you'd get different/worse results than were originally gotten at 600 dpi. Which is why you always OCR before downsampling, and you downsample for smaller files.

This isn't changing anything about the Snopes article. It just explains why if you run MRC/OCR at the PDF resolution, you won't deterministically reproduce it because it's not the resolution it was originally run at.

You do understand that this OCR is only for being able to search and highlight text? It's not changing what's displayed. That's still the pixels.


I didn't see the original pixels in the document at any resolution though. That's the point.

You don't see the pixels when you zoom in? Try again:

https://obamawhitehouse.archives.gov/sites/default/files/rss...

If you don't see jaggy pixel edges to the letters and form elements, what do you see?


Are you saying that I'm saying that there are no pixels in the document? Like, do you think that I think that scanners have come to operate on pure platonic forms and no longer use the concept of pixels? That would be really cool, wouldn't it. But no, I don't believe that. Hm. Where did this conversation go wrong. I think I was unclear in my last statement. I have yet to see someone show the original scribbles or ink marks that these OCR layers were generated based on. That's what I meant by "destructive". Now, I'm no expert on documents, so you might want to just cut your losses and stop trying to educate me and let me be uneducated in this matter. I'll accept that I don't know what I'm talking about, and reduce my criticisms of this whole thing to pointing out that the explanations don't make sense to me.

> Are you saying that I'm saying that there are no pixels in the document?

I genuinely don't know what you're saying.

> Like, do you think that I think that scanners have come to operate on pure platonic forms and no longer use the concept of pixels? That would be really cool, wouldn't it.

Yes, because that is absolutely a thing. That's what Adobe ClearScan does, converting pixels to smooth vector outlines. Zoom in, and zero pixels in OCR'd text. That's not the case in this file though.

> I have yet to see someone show the original scribbles or ink marks that these OCR layers were generated based on. That's what I meant by "destructive".

I still genuinely don't know what you mean. The original scribbles and ink marks are a physical piece of paper. The MRC layers are generated from the scan and don't destroy anything, they only separate. The resulting layered bitmap is identical, pixel-for-pixel. My best interpretation of what you're saying is you want a higher-resolution scan? But why? Again, nothing "destructive" has happened except maybe reducing the resolution. But "destructive" is not a word people usually use for that.

> so you might want to just cut your losses and stop trying to educate me and let me be uneducated in this matter

I can't tell if you're being sarcastic or not. I am genuinely happy to help you understand, but if you really don't want to then obviously I won't spend any more time replying. But if you're going to publicly throw suspicion on the validity of the birth certificate, I feel it's important to correct the record here on HN simply for other people who might read this exchange.


I'm starting to understand where this conversation diverged. I'm coming from a place of having read the Snopes page and watched the videos linked there. I think understanding where I'm at, is a good place to start trying to explain it to me. To put it more clearly, at this point I've seen a video that seems to show that the PDF has a collection of layers, some contain text and one contains the page below. Now, it seems like you were saying that the text layers are just the pixels from the page moved up to a new layer. I said that I think that's surprising. Then we got caught up on the meaning of the words "original pixels". I probably should have said: a full buffer of pixels from the CCD sensor, perhaps with resolution reduction or compression, but nothing moved to new layers (whether that's normally considered "destructive" or not is another issue).

OK, well hopefully you understand now! This part is key:

> Now, it seems like you were saying that the text layers are just the pixels from the page moved up to a new layer. I said that I think that's surprising.

That's indeed all it is. It may be surprising, but that's how it works. Absolutely nothing about the pixels are changed as part of the process of that text layer separation. Later when saving the final PDF there's normal lossy compression, same as any JPEG, but the layer process actually preserves the text edges better than JPEG.

So I do hope you're satisfied now that everything about this is just normal image processing, and that nothing has been destroyed, there's no "missing evidence" or anything. And that the whole idea that the layers imply some kind of manipulation or forgery is false. What you're seeing is just the scan itself, saved (presumably) at a lower resolution and with the normal image compression scanners produce. The layer separation process is completely non-destructive and doesn't manipulate pixels at all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: