Hey, I'm the author here. My web host is having some problems right now, and I'd always intended to post a link to the demo instead, so I've put up a Cloudflare redirect straight to the demo.
B. It is 6:30AM and before I've had my coffee...but am I reading right that he isn't using Tesseract? I know he says it that it was a bad idea to even try compiling it, but then spends a large part of the post talking about how great Tesseract is...just wanted to make sure I didn't miss a: "Well, finally bit the bullet and successfully got Tesseract compiled"
C. If not using Tesseract, then what is the rate of accuracy of what he's using (GOCR and Ocrad) compared to Tesseract? I see that GOCR was recently updated to 0.5 (though not uploaded to SourceForge yet, according to the notes http://jocr.sourceforge.net/)
FWIW, Tesseract is at 3.02 and its latest release notes are dated 10/23/2012...While doing things in straight JS has a lot of value in web apps...Tesseract, from my experience, is really far ahead of its OSS peers, and further along than a lot of commercial packages. I'm not sure the conveniences of pure JS OCR outweigh the necessity for accuracy in this domain
Tesseract is certainly further along than its OSS peers but it's not even close to commercial packages. The most promising OSS project I've seen is another Google-sponsored effort, OCRopus http://code.google.com/p/ocropus/ but it is very much ongoing research.
yeah I did not get the explanation of why he did not use Tesseract. It seems like a "self evident" argument was dropped, but I am not in the know about why it's obviously not possible. My guess is that Tesseract is spread across 100 files and does a ton of IO or something that can't be translated easily??? Dunno
At the end of http://antimatter15.github.io/ocrad.js/demo.html, "When you include the training data, Tesseract is actually kind of massive — A functional Emscripten port would probably be at least 30 times the size of OCRAD.js!"
Why include the traing data? There's no need to train on the browser, do it with the native app and just port the code to run the neural nets to the browser. It should be a fraction of the size.
I've kicked the tires of the live demo, and found that it seems to only work well if you draw seriffed letters. My initial attempt at a B resulted in a 'g' result. I then tried again to make it as perfect as possible, and got an '8'. So then I tried again, adding serifs (I think?) at the top and bottom extending to the left, and finally got a 'B'.
My attempts at full words took a lot of tweaking of the form of the letters to get a correct match. Many letters were not identified at all.
It's intended for recognizing typeset characters, not handwriting.
So why did they provide an interface that lets you scribble? I dunno. For fun, I guess.
(Though the author does say "Ocrad does seem to vastly outperform GOCR when it comes to letter sketches on a canvas, so that's the one I'm focusing on here." which suggests that recognizing hand-drawn letters is something the author's interested in.)
I'm always curious... why not port the original rather than use emscripten? Naively you should get better performance and maintainability... is it just a time saving thing?
Actually Emscripten outputs a strict subset of JS dubbed asm.js. Using this allows some really significant speed improvements in execution, due to simplified type checking.
My understanding then is that for certain things this could well be faster than a hand written Javascript port.
asm.js can quite easily be optimized, but V8 is not yet optimized for it. So depending on the case it might run faster or slower depending on what JavaScript engine is running it.
I guess it saves you time developing and testing. It also makes you more confident the code actually works in production, since there is probably years of work behind existing libraries.
> there is probably years of work behind existing libraries
this is often good thinking, but it is fallacious.
for high performance or critical code i've found many situations where platform or ancient libraries or software have enormous bugs, memory leaks or are trivially outperformed
memcpy is perhaps the best example of this, yes, memcpy... e.g. http://software.intel.com/en-us/articles/memcpy-performance and skim reading that I already know how to outperform their implementation - even if its a tiny bit. (and no i'm not thinking cache hints which suddenly seem to be flavour of the month now that script kiddies have discovered them...), but i've also found bugs in increasing numbers over the years, the latest flavours of Microsoft madness (WinRT) is pretty leaky and hand tying whilst slowing you down - Objective-C/Cocoa Touch isn't far behind with overkill super generic late binding interfaces that spunk my performance up the wall and a reference counting system which has caused me more trouble than new and delete ever have... not to mention various bugs, especially with the wide character and unicode support in their c std lib... don't get me started on *nix - even make has a serious fail by using timestamps to detect changes!
all code is made by progammers, most programmers are terrible, some are merely bad
This won't beat anything but a very very simple captcha. From what I gathered it is using feature extraction only. This means you can't train it with a lot of data to increase its accuracy.
of course, but as the article says, tesseract, which was developed at HP in the 80's and more recently adopted by google, is really the engine of choice. you might check out
yes, you can use the original project this is created from (ocrad) - since it is nice native code you won't need to jump through any hoops like running it through emscripten to use it that way.