I don't have much experience with PDF as a spec, but I guess I'm a "PDF hater." ...

systemvoltage · on April 4, 2021

You're right - it's just the wrong format and it isn't intended for that. Gov should be publishing text/csv/parsable-formats not PDFs when it comes to data dumps.

diarrhea · on April 4, 2021

Agreed, but there exist solutions [0] to make even PDF tables machine-readable (optionally making use of machine learning techniques). It's incredibly backwards and much harder than, say, CSV, but it might get the machine-reading job done.

0: https://camelot-py.readthedocs.io/en/master/

crazygringo · on April 5, 2021

> data dumps from local governments that could have been machine-readable

It's annoying, but if they were produced from a database (as opposed to scans), they're still usually machine-readable by converting the PDF to text, and then running a few regexes as needed to convert to something like CSV, if it's tabular in the first place.

In theory the text could be gibberish because of font subsetting that intentionally scrambles the glyphs, but that's rare and generally only implemented when a publisher is intentionally trying to thwart text extraction and/or font extraction, which I wouldn't expect a local government to either intend or to enable accidentally.

BostonEnginerd · on April 4, 2021

Our city council posts PDF files which consist of scans of documents someone printed out. Sometimes they turn on OCR, but not often.

benhurmarcel · on April 5, 2021

Sometimes that's on purpose.

I know of a company that was required to send HR data to a union (time clockings over a period of time). They didn't like it. They just printed a badly-organized spreadsheet to a pdf. There, they sent the data, and it was unusable.

axiolite · on April 4, 2021

> data dumps from local governments that could have been machine-readable, or announcements that are distributed in print and emailed as PDFs, rather than lifting the content into the message body.

You should thank PDF for giving you any useful electronic copies at all.

If it's scanned-in papers, sticking them loosly in an e-mail or web page would be much more difficult to read through.

If it's text data, then perhaps it was primarily composed to be printed, and PDF allows easy creation of readable electronic copies with minimum of effort from any input. Before PDF you might have gotten nothing at all, because most people don't have readers for various obscure proprietary input formats.

And PDF is far easier than other formats to convert into another format for your own consumption. Do you have a command-line tool which will extract the embedded images out of a Microsoft Word document? Or one that will convert it to plain text, preserving formatting? pdfimages and pdftotext -layout are very widely available.

Aeolun · on April 4, 2021

> You should thank PDF for giving you any useful electronic copies at all.

I think the point is that data dumps in PDF format are not useful at all.

I take objection to your statement that it’s easier to convert. The only reason there are so many tools to do so is because it’s so hard/impossible in the first place.

axiolite · on April 4, 2021

> The only reason there are so many tools to do so is because it’s so hard

What exactly do you base that on? Have you written any PDF or postscript utilities?

Images are easily located in rather discrete chunks, and they are conveniently stored in standard formats like JPEG. Preserving the layout of text output takes a bit of work, but otherwise extracting text and images is just about a necessary first early step in writing any PDF viewer. And I do believe even very early PDF viewers allowed arbitrary copy/paste of text.

Aeolun · on April 6, 2021

I’ve attempted, and given up on, writing and reading PDF files more often than I can count.

Conversely, I’ve only ever tried to write a (new) word file once, since it all worked right away.

BlueTemplar · on April 4, 2021

Also pdf has very bad support of animation formats.

tonyedgecombe · on April 4, 2021

That's a good thing, I don't want to see animations in a PDF.

BlueTemplar · on April 4, 2021

And I would like to have an actual portable document format for electronic documents, but since mhtml support was for some reason dropped except in e-mail clients (as .eml), I'm stuck with this frankenformat that is pdf !

dredmorbius · on April 5, 2021

ePub (based on HTML as it happens) seems to come close, though even it has its warts.

nl · on April 5, 2021

mhtml support is there in both Firefox and Chrome. I didn't check other browsers.

BlueTemplar · on April 5, 2021

How can you save a web page as a single mhtml file in Firefox ?

TuringTest · on April 5, 2021

Not a mhtml, but you can download a page as a single html file with the TagSpaces Web Clipper extension. It convert images to inline format.

https://addons.mozilla.org/es/firefox/addon/tagspaces/

BlueTemplar · on April 5, 2021

Interesting, I was using SingleFile myself, will have to look into the differences...

BlueTemplar · on April 12, 2021

Single File works better IMHO.