Nougat: Neural Optical Understanding for Academic Documents

echo_time · on Aug 31, 2023

Funnily enough the Example Page 1 is wrong. Rendering du^n as du^*, and then nu^n-1 as nw^*-1.

It is impressive but...it really feels like those are the details that really really matter.

nicodjimenez · on Aug 31, 2023

Second page is even worse. Ends in repeated \cdots and doesn’t finish parsing page. Also it read number 73 as 3 I guess because the previous section number was 2.

orbital-decay · on Sept 1, 2023

The main issue with OCR of anything with math in it is always that it has to be not 99.99% but 100% correct. Which is probably not possible.

stevenae · on Aug 31, 2023

Missed opportunity to call it "...Texts" instead of "Documents".

rnadomvirlabe · on Aug 31, 2023

If you are interested in this sort of thing for producing LaTeX documents, I've had good experience with [Mathpix](https://mathpix.com) in the past. It can take most anything I throw at it, and even render matrices well.

gremlinunderway · on Aug 31, 2023

This is great, but when is academia, business and government going to finally get off PDF as a typical standard? It's awful, not adaptive for mobile, and a pain in the ass to work with for any kind of development.

vosper · on Aug 31, 2023

What's a good alternative, for users and developers?

I don't have any love for PDF, but I'm actually not sure what's more cross-platform. Any browser will render PDF, so everyone already has a viewer on their computer. A browser will also print any document to PDF, and many other editors can export to PDF (though perhaps not import for editing)

It can't be replaced by an Office format, like docx, because even today apps like Pages can't render MS Office docs correctly half the time.

Doesn't seem like HTML would fly, either, given all the kinds of things that get embedded into PDF.

harshreality · on Aug 31, 2023

HTML and various javascript libraries like mathjax or other libraries for charts and graphs.

> Doesn't seem like HTML would fly, either, given all the kinds of things that get embedded into PDF.

That's ironic. Browser PDF readers, at least open source ones, render PDFs as HTML using javascript. At least I'm sure about FF because I just checked that text from a native-digital pdf showed up in the DOM in developer tools.

vosper · on Aug 31, 2023

> HTML and various javascript libraries like mathjax or other libraries for charts and graphs.

How do you email that to someone as an attachment? Can you embed all of that stuff into a single .html file?

harshreality · on Sept 1, 2023

Sure, epub.

You could (or maybe you can't, but ebook readers should allow you to) disable any network access without explicit confirmation, so the javascript can't do anything evil other than modify the ebook being displayed. If you can't do that, that's up to ebook readers to solve, and not a flaw with epubs.

sublinear · on Sept 1, 2023

> Can you embed all of that stuff into a single .html file?

Technically yes, but there are two problems.

First is that inline style and scripts are a potential security vulnerability.

Second is that if someone does not inline everything and instead references css/js from the web the document will stop rendering correctly when those resources go offline.

froh · on Aug 31, 2023

pdf looks the same everywhere and is self-contained. an "immutable" document which looks the same for everyone if it hashes to the same sha... key. which has a value on it's own.

harshreality · on Aug 31, 2023

There's nothing immutable about pdfs. If you have an "original" document, it'll always hash to whatever it hashes to. I fail to see the point. You can cite md5 hashes on LG the same whether they're pdfs or epubs or, heaven forbid, azw3 (amazon's proprietary epub-like format).

What's the obsession with "looking the same everywhere"?

Page references: this shouldn't be a thing. Academia has already solved this problem for notable texts. Rather than nearly uncountable numbers of paragraphs that all run together, paragraphs or short sections or lines are numbered. See any good edition of Plato or Aristotle, or just about any notable play or longer poem ever translated. Relying on a single published layout of a work to reference is dumb.

Citing exact line numbers isn't even necessary for native-language works. When they're digital, search works. It works even better in flowed-format texts than it does in pdfs, which sometimes, depending on how the pdf was constructed, won't match text properly across newlines.

Visual quality: As long as images—data, charts, graphs, photographs—are not degraded beyond usefulness, the actual text, and its display, is up to the reader application. Everyone uses the web complete with mathjax, and those doesn't have Knuth-approved formatting in every respect. But they're good enough, and they work everywhere on every device without squinting or pinch to zoom. There are some people who insist on putting pre-rendered images of math in html, and they always look worse, because they don't match the text without a lot of work to have extra high-res images that are auto-scaled according to viewport and surrounding font size—work that I bet not many people have ever done in the history of html publishing.

froh · on Aug 31, 2023

that's all missing the point.

mhtml would somewhat fit part of the bill of what PDF offers: a single downloadable "file" you can archive or forward and you know: the recipient will see exactly what you saw.

however the mhtml doesn't look the same, depending on the device. and looking.exactly the same helps a great deal in convincing a judge that we all talk about the same.thing.

get me right.

I hate PDF with all passion of my heart. epub (similar to mhtml) imho is a much better format for many intents and purposes and it allows to reflow the contents depending on the device.

but the claim was "PDF is useless and.shall go" and that's cutting.it too short.

harshreality · on Sept 1, 2023

I agree PDF is not entirely useless. I don't mind it that much for papers, but I don't see how it's any better than responsive formats. It's obviously important for presentations where exact visual relationships between items is fixed, or flyers (things that inherently get published on dead trees—leaving layout to an html rendering engine in that case is not great), or other things like that.

I still don't understand why it needs to look exactly the same. I get that habitually people say "turn to page X and look 2/3 down the page for the line starting "The quick brown fox jumped", but with digital documents that's not in my experience how anything works. You just say "the sentence starting with 'The quick brown fox'", and everyone can search for it in a few seconds.

If an official proceeding needs to be sure everyone's working from the same document, they can distribute, or publish hashes of, an epub or mhtml the same as they can for a pdf. There's no assurance that two pdfs that you think are the same document are actually the same document, any more than two epubs would be.

esafak · on Aug 31, 2023

Most academic PDFs are typeset and consequently look better than typical web sites. There are notable exceptions such as distill.pub

etrautmann · on Aug 31, 2023

how does line number citation work for responsive text?

harshreality · on Aug 31, 2023

You put line numbers in the margin (with css styling), or I've also seen it as [#] inline in the text, possibly styled differently to make it more intuitive that it's not part of the source text.

For the vast majority of works that are untranslated, that isn't necessary, because, as mentioned, search works fine, and it's faster, too. For translated works, the concept of one published source of truth for page numbers is already broken, so you need some alternative to page numbers anyway.

anigbrowl · on Aug 31, 2023

not adaptive for mobile

I look at pdfs on my phone all the time, it's great. 'Optimized for mobile' usually means oversized fonts and a shitty UI so I get RSI in my thumb from endless scrolling.

PDF is kind of an ugly format, but the problem with realtime text flow etc. is that designers are (at the behest of clients) are always trying to look visually distinct and as a result nothing is standardized or predictable at the rendering end. 95% of digital layout is ass compared to the print version.

yonatan8070 · on Sept 1, 2023

It works, but that doesn't mean it's good. A document designed to be printed on an A4 page isn't easy to read when scaled down to the size of a phone, especially for people with suboptimal eyesight, and if you zoom in to read the text, you need to scroll horizontally for every line you want to read

anigbrowl · on Sept 1, 2023

I'm aware of all that, although your complaint about zooming in is easily fixable by turning the phone sideways. My point is that PDF remains popular because it's predictable and because print layout standards are generally better than dynamic ones. That's the problem to address: people's desire to read longform text in a consistent way without being continually derailed by interactive cruft.

adr1an · on Aug 31, 2023

We need more DjVu!

ttul · on Sept 1, 2023

“Hey guys, we need to train a large language model to understand all of science. All we have is a stack of old papers from the dawn of time and we need to convert them into LaTeX…”

stavros · on Sept 1, 2023

Semi offtopic, but don't pretend you're making an acronym when the letters aren't the first ones in each word.

Or, as I like to call it, SOUNDYMAREHEATRONER.

mikewang · on Sept 1, 2023

I made a quick test: 1. GPU resoursce consuming. 16G~ 2. I test some document other than English, it is poor.

poulpy123 · on Sept 1, 2023

it's interesting to see they didn't pick their examples to exclude all failures. I find that really great and it should be done more often

czbond · on Aug 31, 2023

pretty impressive