Pure JS OCR via Emscripten

antimatter15 · on Jan 2, 2014

Hey, I'm the author here. My web host is having some problems right now, and I'd always intended to post a link to the demo instead, so I've put up a Cloudflare redirect straight to the demo.

danso · on Jan 2, 2014

A. Cool

B. It is 6:30AM and before I've had my coffee...but am I reading right that he isn't using Tesseract? I know he says it that it was a bad idea to even try compiling it, but then spends a large part of the post talking about how great Tesseract is...just wanted to make sure I didn't miss a: "Well, finally bit the bullet and successfully got Tesseract compiled"

C. If not using Tesseract, then what is the rate of accuracy of what he's using (GOCR and Ocrad) compared to Tesseract? I see that GOCR was recently updated to 0.5 (though not uploaded to SourceForge yet, according to the notes http://jocr.sourceforge.net/)

FWIW, Tesseract is at 3.02 and its latest release notes are dated 10/23/2012...While doing things in straight JS has a lot of value in web apps...Tesseract, from my experience, is really far ahead of its OSS peers, and further along than a lot of commercial packages. I'm not sure the conveniences of pure JS OCR outweigh the necessity for accuracy in this domain

jahewson · on Jan 2, 2014

Tesseract is certainly further along than its OSS peers but it's not even close to commercial packages. The most promising OSS project I've seen is another Google-sponsored effort, OCRopus http://code.google.com/p/ocropus/ but it is very much ongoing research.

tobltobs · on Jan 2, 2014

Ocropus is using Tesseract as OCR engine.

tlarkworthy · on Jan 2, 2014

yeah I did not get the explanation of why he did not use Tesseract. It seems like a "self evident" argument was dropped, but I am not in the know about why it's obviously not possible. My guess is that Tesseract is spread across 100 files and does a ton of IO or something that can't be translated easily??? Dunno

dbaupp · on Jan 2, 2014

At the end of http://antimatter15.github.io/ocrad.js/demo.html, "When you include the training data, Tesseract is actually kind of massive — A functional Emscripten port would probably be at least 30 times the size of OCRAD.js!"

jahewson · on Jan 2, 2014

Why include the traing data? There's no need to train on the browser, do it with the native app and just port the code to run the neural nets to the browser. It should be a fraction of the size.

binarymax · on Jan 2, 2014

Seems to be a link to a blog post which went down, and cloudflare didnt have a working version...so here is the link to the GH project:

http://antimatter15.github.io/ocrad.js/demo.html

davedx · on Jan 2, 2014

I've kicked the tires of the live demo, and found that it seems to only work well if you draw seriffed letters. My initial attempt at a B resulted in a 'g' result. I then tried again to make it as perfect as possible, and got an '8'. So then I tried again, adding serifs (I think?) at the top and bottom extending to the left, and finally got a 'B'.

My attempts at full words took a lot of tweaking of the form of the letters to get a correct match. Many letters were not identified at all.

gjm11 · on Jan 2, 2014

It's intended for recognizing typeset characters, not handwriting.

So why did they provide an interface that lets you scribble? I dunno. For fun, I guess.

(Though the author does say "Ocrad does seem to vastly outperform GOCR when it comes to letter sketches on a canvas, so that's the one I'm focusing on here." which suggests that recognizing hand-drawn letters is something the author's interested in.)

SunboX · on Jan 2, 2014

If we are on optical recognition, did someone try compiling ZBar [1] to JavaScript? I did it but failed to get it reading canvas pixels. :/

[1] http://zbar.sourceforge.net/

jankey · on Jan 2, 2014

Try https://github.com/LazarSoft/jsqrcode...

SunboX · on Jan 2, 2014

It's only a QR-Code reader ;) ZBar supports all kind of Barcodes. But this one seems interesting:

https://github.com/EddieLa/BarcodeReader

jheriko · on Jan 2, 2014

I'm always curious... why not port the original rather than use emscripten? Naively you should get better performance and maintainability... is it just a time saving thing?

marcosscriven · on Jan 2, 2014

Actually Emscripten outputs a strict subset of JS dubbed asm.js. Using this allows some really significant speed improvements in execution, due to simplified type checking.

My understanding then is that for certain things this could well be faster than a hand written Javascript port.

Ravengenocide · on Jan 2, 2014

asm.js can quite easily be optimized, but V8 is not yet optimized for it. So depending on the case it might run faster or slower depending on what JavaScript engine is running it.

FrozenCow · on Jan 2, 2014

I guess it saves you time developing and testing. It also makes you more confident the code actually works in production, since there is probably years of work behind existing libraries.

jheriko · on Jan 3, 2014

> there is probably years of work behind existing libraries

this is often good thinking, but it is fallacious.

for high performance or critical code i've found many situations where platform or ancient libraries or software have enormous bugs, memory leaks or are trivially outperformed

memcpy is perhaps the best example of this, yes, memcpy... e.g. http://software.intel.com/en-us/articles/memcpy-performance and skim reading that I already know how to outperform their implementation - even if its a tiny bit. (and no i'm not thinking cache hints which suddenly seem to be flavour of the month now that script kiddies have discovered them...), but i've also found bugs in increasing numbers over the years, the latest flavours of Microsoft madness (WinRT) is pretty leaky and hand tying whilst slowing you down - Objective-C/Cocoa Touch isn't far behind with overkill super generic late binding interfaces that spunk my performance up the wall and a reference counting system which has caused me more trouble than new and delete ever have... not to mention various bugs, especially with the wide character and unicode support in their c std lib... don't get me started on *nix - even make has a serious fail by using timestamps to detect changes!

all code is made by progammers, most programmers are terrible, some are merely bad

_gtly · on Jan 2, 2014

Guess you could use it as part of a captcha-defeating toolchain.

yogo · on Jan 2, 2014

This won't beat anything but a very very simple captcha. From what I gathered it is using feature extraction only. This means you can't train it with a lot of data to increase its accuracy.

imdsm · on Jan 2, 2014

Source: https://github.com/antimatter15/ocrad.js

mrfusion · on Jan 2, 2014

Would it be possible to use this in an iOS or android app? Or is there a better way to get OCR in an app.

auvrw · on Jan 2, 2014

of course, but as the article says, tesseract, which was developed at HP in the 80's and more recently adopted by google, is really the engine of choice. you might check out

https://github.com/rmtheis/tess-two

jheriko · on Jan 2, 2014

yes, you can use the original project this is created from (ocrad) - since it is nice native code you won't need to jump through any hoops like running it through emscripten to use it that way.

ye · on Jan 2, 2014

An OCR article without a single image.