Having gone through the gut-wrenching task of choosing which data books to throw...

yourapostasy · on Aug 17, 2015

When I looked into book scanning a few years ago, the Kirtas (mentioned elsewhere in this thread) was as far as I could tell from Net sources, and remains, the reference method of fast, high-volume, non-destructive, high-quality scanning. Even so, many libraries with Kirtas units still employ someone to stand watch over the page turning and ensure only one page at a time is flipped. Perfect page turning is apparently not a completely solved problem yet.

If procuring (and paying nearly $10K USD per year in maintenance fees) through a hacker collective or maker space is infeasible in your area, then the community at www.diybookscanner.org have a workable solution for a much smaller subset of what the Kirtas units address, so you could look into that as a modest workaround for the time being (though I wonder what results they got for dewarping by simply taking pictures on all the sides of the scanning target to synthetically construct a 3D volume, as perfect dewarping continues to be an open and unsolved problem).

userbinator · on Aug 17, 2015

Even so, many libraries with Kirtas units still employ someone to stand watch over the page turning and ensure only one page at a time is flipped

Most books have page numbers; couldn't they use that along with OCR to detect and retry skipped pages? Maybe even a state that shakes the pages more than usual in an attempt to separate ones stuck together. It doesn't sound too difficult to do (perhaps you'd have to tell it where the page number is), given what the Kirtas machine costs.

yourapostasy · on Aug 22, 2015

The challenge seems to be the OCR takes place in a post-processing phase instead of real-time, and the desire is to catch the improper page flip before putting away the book. Perhaps with one or more gigabit pipes, the image processing can take place in the cloud in near real-time.

The Kirtas units seem highly-regarded by conservators; they might have lots of objections to even gentle shaking of their sometimes fragile charges. The impression I get is that the slight vacuum employed by the Kirtas on pages is the most handling that is accepted. There might be recent developments in computer vision and robotic fingers which could see an improved robotic analog to a human page flipper in the future.

My personal hunch is the popularization and (relative) mass adoption of the slower, lower-tech open source book scanners will eventually outstrip the dedicated scanning throughput of the high-end units, and put more digitized content onto the Net, along with a legal fight over content "abandoned" by publishers. When I digitize my content, it goes into my private collection, but I sure wish publishers were more aggressive with digitization of the older material, or lenient with letting that older material go into the public domain if they aren't even chasing the long-long-long tail of that material anymore.