This is a wicked cool site, but you need to put in screenshots of the input (how it went in) and the output (what the output looked like in an epub reader).
What approach do your algorithms use? Do you do recognition of title, subtitles etc based on differences in fonts, spacing, line length etc.? Or do you need to enter regexps to recognize those?
Do you recognize paragraphs correctly?
Can you filter out front- and back filler like the ToC, and extract only the 'content' pages?
If so, it's 90% of what I'm looking for and I think good enough to pay for :)
I have some notes on how to approach from when I tried to make it myself, it includes what functionality I consider necessary for a MVP. Let me know if you're interested...
I'm working on an FAQ/Help page which will show some of those features in more detail.
The algorithm I use is a variation of the code described here: http://denis.papathanasiou.org/?p=343 except the output is html, not text, so that I can take account things like font sizes and paragraph breaks.
If you signup and try it (it's free for the first 3 days), you'll see that the parser renders each pdf page as text, and it's up to you to decide which range of pages you want to use in your book.
Feel free to contact me by the form on that site, and I can reply in more detail.
What approach do your algorithms use? Do you do recognition of title, subtitles etc based on differences in fonts, spacing, line length etc.? Or do you need to enter regexps to recognize those?
Do you recognize paragraphs correctly?
Can you filter out front- and back filler like the ToC, and extract only the 'content' pages?
If so, it's 90% of what I'm looking for and I think good enough to pay for :)
I have some notes on how to approach from when I tried to make it myself, it includes what functionality I consider necessary for a MVP. Let me know if you're interested...