Other interesting news from the link: They expect improvements in OCR to make it possible for them in the coming years to apply it to their entire library. This would liberate an enormous amount of knowledge for easier access.
they are talking about treating ocr as lossy. i wonder about making a lossless compression algorithm for text scans based on an ocr; in effect, use the ocr to predict which text will show up and how, and then encode the pixel-level differences on top of that