The open-source Ghostscript [1] can convert simple PDFs to text, while keeping the layout. I doubt it will handle some of the more complicated cases outlined in the article though.
I use it quite successfully to turn my bank statements into text, which can then be further processed.
I've recently done this. Have scanned over 5,000 documents to PDF, then batch converted those from PDF to TIFF using Ghostscript, and then Tesseract to OCR the TIFF and combine both back into a searchable PDF. Tesseract may not be the worlds best OCR software but it's free and both it and Ghostscript are easy to automate.
Now all I need is a good front end search system for my document archive.
I have a Brother ADS-2700w[1] as my scanner which is network connected. It scans directly to a network share (SMB, but also supports FTP, nfs etc.) and outputs as PDF. The PDFs are basically 'dumb' PDFs in that each page of the PDF is an image all wrapped up inside the PDF container.
So that's where Ghostscript comes in. On a schedule I have a script that picks up new PDFs in the share, runs them through Ghostscript to create a multipage TIFF, that TIFF is then given to Tesseract (as it can't handle PDFs natively) which does the OCR and outputs a nice PDF with searchable text. All very simple.
The scanning of the pages is very fast, but the scanner takes an age sending the PDFs over the network - it's ethernet port is only 100mbit/s but to be honest I just think the CPU inside the scanner is slow. It also doesn't have enough internal buffer which means you can't scan the next document until the previous one has completed being sent to the share.
If I hooked the scanner up to USB, then the PC could run the Brother software which does use OCR - but it's not automatic, all it does is display the PDF inside Paperport once the scan is complete. For bulk scanning, it's not workable.
Regarding indexing - I've started looking at Solr, and it might suit my needs. I was hoping for a visual type search system, where you could see thumbnails of the PDFs in the results.
xpdf seems to have started to respect the not-copyable-flags, while in days yonder, it didn't. So now, even something like a manual of some command-line tools or a text book on C++ or Rust, you still have to re-type the text (wtf). Time to remove and search for something better, something that does not need a 0.5GB update every 3 days (on Windows). (yes, exaggerating slightly)
Maybe it's the new QT version, OpenBSD still has the Motif one, and it works great. For Windows you have SumatraPDF which is pretty good and it's libre.
It does say, "All EC2 host infrastructure has been updated with these new protections, and no customer action is required at the infrastructure level."
"Meanwhile, we suggest using the stronger security and isolation properties of EC2 instances to separate any untrusted workloads." As I read it, it is talking about running code within your instance - if you have untrusted workloads, rather run them in a separate instance, so as not to encounter issues like this cross-process.
Not sure how you can compare your own servers to s3 - s3 has redundancy built in that can survive the loss of 2 data centers. I'm pretty sure it would be more expensive to roll your own version of that...
(via https://www.tug.org/texshowcase/)