Hacker Newsnew | past | comments | ask | show | jobs | submit | garethl's commentslogin


Thanks!


The open-source Ghostscript [1] can convert simple PDFs to text, while keeping the layout. I doubt it will handle some of the more complicated cases outlined in the article though.

I use it quite successfully to turn my bank statements into text, which can then be further processed.

[1]: https://www.ghostscript.com/


I've recently done this. Have scanned over 5,000 documents to PDF, then batch converted those from PDF to TIFF using Ghostscript, and then Tesseract to OCR the TIFF and combine both back into a searchable PDF. Tesseract may not be the worlds best OCR software but it's free and both it and Ghostscript are easy to automate.

Now all I need is a good front end search system for my document archive.


How did you scan the documents to PDF? I use a Canon P-208 that has served me well for many many years (long may it!) and the OCR on that works well.

Does the scanning system you use not do OCR?

I use a Mac and Spotlight does a good job of indexing the files. I think alternatives for other OSes might be something like Apache Solr?


I have a Brother ADS-2700w[1] as my scanner which is network connected. It scans directly to a network share (SMB, but also supports FTP, nfs etc.) and outputs as PDF. The PDFs are basically 'dumb' PDFs in that each page of the PDF is an image all wrapped up inside the PDF container.

So that's where Ghostscript comes in. On a schedule I have a script that picks up new PDFs in the share, runs them through Ghostscript to create a multipage TIFF, that TIFF is then given to Tesseract (as it can't handle PDFs natively) which does the OCR and outputs a nice PDF with searchable text. All very simple.

The scanning of the pages is very fast, but the scanner takes an age sending the PDFs over the network - it's ethernet port is only 100mbit/s but to be honest I just think the CPU inside the scanner is slow. It also doesn't have enough internal buffer which means you can't scan the next document until the previous one has completed being sent to the share.

If I hooked the scanner up to USB, then the PC could run the Brother software which does use OCR - but it's not automatic, all it does is display the PDF inside Paperport once the scan is complete. For bulk scanning, it's not workable.

Regarding indexing - I've started looking at Solr, and it might suit my needs. I was hoping for a visual type search system, where you could see thumbnails of the PDFs in the results.

---

[1] https://www.brother.co.uk/scanners/ads-2700w


MuPDF and XPDF can do that by selecting the text with the right mouse buttom.


xpdf seems to have started to respect the not-copyable-flags, while in days yonder, it didn't. So now, even something like a manual of some command-line tools or a text book on C++ or Rust, you still have to re-type the text (wtf). Time to remove and search for something better, something that does not need a 0.5GB update every 3 days (on Windows). (yes, exaggerating slightly)


Maybe it's the new QT version, OpenBSD still has the Motif one, and it works great. For Windows you have SumatraPDF which is pretty good and it's libre.

Also, muPDF for Windows: https://www.mupdf.com/downloads/archive/mupdf-1.16.0-windows... unzip in a folder and run mupdf.exe


It does say, "All EC2 host infrastructure has been updated with these new protections, and no customer action is required at the infrastructure level."

"Meanwhile, we suggest using the stronger security and isolation properties of EC2 instances to separate any untrusted workloads." As I read it, it is talking about running code within your instance - if you have untrusted workloads, rather run them in a separate instance, so as not to encounter issues like this cross-process.


Not sure how you can compare your own servers to s3 - s3 has redundancy built in that can survive the loss of 2 data centers. I'm pretty sure it would be more expensive to roll your own version of that...


See my comment elsewhere with some numbers. You can beat S3 even with the new pricing in many cases. With the old pricing its trivial.

With replication to 3 data centres.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: