I don't have much experience with PDF as a spec, but I guess I'm a "PDF hater."
It's the things that didn't need to be PDFs, but inexplicably are, that annoy me. Like data dumps from local governments that could have been machine-readable, or announcements that are distributed in print and emailed as PDFs, rather than lifting the content into the message body.
You're right - it's just the wrong format and it isn't intended for that. Gov should be publishing text/csv/parsable-formats not PDFs when it comes to data dumps.
Agreed, but there exist solutions [0] to make even PDF tables machine-readable (optionally making use of machine learning techniques). It's incredibly backwards and much harder than, say, CSV, but it might get the machine-reading job done.
> data dumps from local governments that could have been machine-readable
It's annoying, but if they were produced from a database (as opposed to scans), they're still usually machine-readable by converting the PDF to text, and then running a few regexes as needed to convert to something like CSV, if it's tabular in the first place.
In theory the text could be gibberish because of font subsetting that intentionally scrambles the glyphs, but that's rare and generally only implemented when a publisher is intentionally trying to thwart text extraction and/or font extraction, which I wouldn't expect a local government to either intend or to enable accidentally.
I know of a company that was required to send HR data to a union (time clockings over a period of time). They didn't like it. They just printed a badly-organized spreadsheet to a pdf. There, they sent the data, and it was unusable.
> data dumps from local governments that could have been machine-readable, or announcements that are distributed in print and emailed as PDFs, rather than lifting the content into the message body.
You should thank PDF for giving you any useful electronic copies at all.
If it's scanned-in papers, sticking them loosly in an e-mail or web page would be much more difficult to read through.
If it's text data, then perhaps it was primarily composed to be printed, and PDF allows easy creation of readable electronic copies with minimum of effort from any input. Before PDF you might have gotten nothing at all, because most people don't have readers for various obscure proprietary input formats.
And PDF is far easier than other formats to convert into another format for your own consumption. Do you have a command-line tool which will extract the embedded images out of a Microsoft Word document? Or one that will convert it to plain text, preserving formatting? pdfimages and pdftotext -layout are very widely available.
> You should thank PDF for giving you any useful electronic copies at all.
I think the point is that data dumps in PDF format are not useful at all.
I take objection to your statement that it’s easier to convert. The only reason there are so many tools to do so is because it’s so hard/impossible in the first place.
> The only reason there are so many tools to do so is because it’s so hard
What exactly do you base that on? Have you written any PDF or postscript utilities?
Images are easily located in rather discrete chunks, and they are conveniently stored in standard formats like JPEG. Preserving the layout of text output takes a bit of work, but otherwise extracting text and images is just about a necessary first early step in writing any PDF viewer. And I do believe even very early PDF viewers allowed arbitrary copy/paste of text.
And I would like to have an actual portable document format for electronic documents, but since mhtml support was for some reason dropped except in e-mail clients (as .eml), I'm stuck with this frankenformat that is pdf !
It's the things that didn't need to be PDFs, but inexplicably are, that annoy me. Like data dumps from local governments that could have been machine-readable, or announcements that are distributed in print and emailed as PDFs, rather than lifting the content into the message body.