More

jeremynixon · 2024-09-08T01:52:22 1725760342

There is article shows no evidence of fabrication, fraud or misinformation, while making accusations of all of them. All it shows is that ChatGPT was used, which is wildly escalated into "evidence manipulation" (ironically without evidence).

Much more work is needed to show that this means anything.

viraptor · 2024-09-08T03:50:29 1725767429

If the result was not read even to check for obvious boilerplate GPT markers, then we can't expect anything else in them was. That means anything else, numbers, interpretation, conclusion was potentially never checked.

The authors use fraud in a specific sense here: "using ChatGPT fraudulently or undeclared" where they proved that the produced text was included without proper review. They also never accused those papers of misinformation, so they don't need to show evidence of that.

jeremynixon · 2024-05-19T03:53:56 1716090836

Search through youtube via embedding search over frames of video, in a click-to-semantic-search interface. (Right click to play video)

jeremynixon · on March 31, 2024

For now, the researcher using our tool needs to check.

jeremynixon · on March 31, 2024

or perhaps, free scientists to focus on genuine insight, ending publish or perish forever

jeremynixon · on March 31, 2024

OP here: We have data source powered research papers in the making, that reports on experiments executed with codegen - really hopeful that the papers can be informative at a minimum.

It's clear that superhuman citation depth & breadth is already imminent. Hopefully we can push hallucinations to near-zero with the next generation of models.

sacherjc · on March 31, 2024

that's great, yeah! what is the data source for the papers?

jeremynixon · on Nov 18, 2022

What libraries do you see as being SOTA? Fitz? Tika?

My hope is that computer vision + OCR will solve this once and for all in near future.

notacop31337 · on Nov 18, 2022

To be 100% honest it's been a while since I looked into libraries for it, so I couldn't say.

Your second comment rings true, and in my opinion, we are there. Highly recommend throwing some PDFs at AWS Textract and checking out the quality, it wasn't there a few years ago, can safely state it's there now though. I threw stuff at it that previously would just spit out trash, and it handled it fairly well, specifically for table data extraction (I was looking at public stock market quarterly reports).

Cost is the kicker for me, 1000 pages for $15, adds up fairly quickly at any sort of scale!

999900000999 · on Nov 18, 2022

OCR is built into Adobe's PDF reader, issue is it's 15$ a month.

I really want to see OCR become easier to use, but I don't know why it's such a hard problem in the first place.

mythrwy · on Nov 18, 2022

There is the python library ocrmypdf https://ocrmypdf.readthedocs.io/en/latest/ that works really well. I have found the results comparable to Adobe in accuracy.

I believe it uses tesseract, ghostscript and some other libraries.

Speaking of ghostscript, one way to deal with problematic PDFs is to print them to file and deal with the result instead.

999900000999 · on Nov 18, 2022

Any open source apps integrate this ?

I'd love to just be able to search a PDF document for a string and get a list of results.

jeremynixon · on Nov 18, 2022

Why is the state of the art in PDF parsing SO BAD? This is an incredibly common and important problem. Tika and fitz have very poor results. What is the reason that this is still so backwards?

jahewson · on Nov 18, 2022

Despite the thousands of pages of ISO 32000, the reality is that the format is not defined. Acrobat tolerates unfathomably malformed PDF files generated by old software that predates the opening-up of the standard when people were reverse-engineering it. There’s always some utterly insane file that Acrobat opens just fine and now you get to play the game of figuring out how Acrobat repaired it.

Plus all the fun of the fact that you can embed the following formats inside a PDF:

PNG, JPEG (including CMYK), JPEG 2000 (dead), JBIG2 (dead), CCIT G4 (dead, fax machines), PostScript Type1 fonts (dead), PostScript Type3 fonts (dead), PostScript CIDFonts (pre-Unicode, dead), CFF fonts (the inside of an OTF), TrueType fonts, ICC Profiles, PostScript functions defining Color spaces, XML forms (the worst), LZ compressed data, Run-length compressed data, Deflate-compressed data.

All of which Acrobat will allow to be malformed in various non-standard ways so you need to write your own parsers.

Note the lack of OpenType fonts, also lack of proper Unicode!

userbinator · on Nov 18, 2022

JPEG 2000 (dead)

Not sure what you mean by "dead", but tons of book scans, particularly those at archive.org, are PDFs of entirely JPEG2000 images.

joe_guy · on Nov 18, 2022

Believe it or not, but digital cinema projection is done with jpeg 2000 https://en.wikipedia.org/wiki/Digital_cinema

jahewson · on Nov 18, 2022

That’s wild!

jahewson · on Nov 18, 2022

I mean dead as in the fact that it’s used somewhere is noteworthy.

I’d love for JPEG XL to replace such uses!

jwilk · on Nov 18, 2022

> lack of proper Unicode

What do you mean?

pwg · on Nov 18, 2022

PDF was defined way back before Unicode was ever a thing. It is natively an 8-bit character-set format for text handling. The way it gets around this limit of only 256 characters available is because it also allows defining custom byte to character glyph mappings (think both ASCII and EBCDIC encoding in different parts of the same document). To typeset a glyph that is not in the current in use 256 character sub-set mapping you switch to a different custom byte value to character glyph mapping to typeset that other character.

jwilk · on Nov 18, 2022

> PDF was defined way back before Unicode was ever a thing.

Unicode 1.0 was released in 1991.

PDF 1.0 was released in 1993.

dbrueck · on Nov 18, 2022

Others have given some good answers already, but I'll add one more: PDF is all about creating a final layout, but given a final layout, there are an infinite number of inputs that could have produced it, but if you are parsing the PDF, most of the time you are trying to get back something higher level than e.g. a command to draw a single character at some X,Y location. But many PDFs in the wild were not generated with that type of semantic extraction in mind, so you have to sort of fudge things to get what you want, and that's where it becomes complex and crazy.

For example, I once had to try to parse PDF invoices generated by some legacy system, and at the bottom was a line that read something like, "Total: $32.56". But in the PDF there was an instruction to write out the string "Total:" and a separate one to write out the amount string, but there was nothing in the PDF itself that correlated the two in any way at all (they didn't appear anywhere close to either other in the page's hierarchy, they weren't at a fixed set of coordinates, etc, etc.).

layer8 · on Nov 18, 2022

1. PDF is mostly designed as a write-only (or render-only) format. PDF’s original purpose was as a device-independent page output language for printers, because PostScript documents at the time were specific to the targeted printer model. Interpreting a PDF for any other purpose than rendering resembles the task of disassembling machine code back into intelligible source code.

2. Many PDF documents do not conform to the PDF specification in a multitude of ways, yet Adobe Acrobat Reader still accepts them, and so PDF parsers have to implement a lot of kludgy logic in an attempt to replicate Adobe’s behavior.

3. The format has grown to be quite complex, with a lot of features added over the years. Implementing a parser even for spec-compliant PDFs is a decidedly nontrivial effort.

So PDF is a reasonably good output format for fixed-layout pages for display and especially for print, but a really bad input format.

autotune · on Nov 18, 2022

My current company uses ML to parse PDF invoices and identify fraud. I have no idea how the devs manage this black magic wizardry because they also spend time contributing to infra code before they hired more people like me on board. If anyone wants a great startup idea, look to solving a problem involving parsing PDFs en masse. Maybe something in legal tech. That market is absolutely ripe for disruption.

bmitc · on Nov 18, 2022

Droit does similar things.

autotune · on Nov 18, 2022

That is awesome! Can “cross border” do things like process GDPR compliance regulations or is that not the intended use case?

newsclues · on Nov 18, 2022

PDF has always seemed to be a janky Adobe product.

Should a modern, open version of PDF be created knowing that how it evolved from the original concept in 1991? Shouldn't we at some point say, we need to start over and created PDF2?

jahewson · on Nov 18, 2022

That would be XPS https://en.m.wikipedia.org/wiki/Open_XML_Paper_Specification

copperbrick25 · on Nov 18, 2022

Sadly XPS is not supported by most software, I'd love to use something better than PDF, but even LibreOffice can't export as OXPS.

userbinator · on Nov 18, 2022

Anything related to XML is arguably even worse.

mdaniel · on Nov 18, 2022

I know it's fun to hate on XML, but as compared to inventing a new pseudo-text-pseudo-binary format, its parsing mechanics are well understood

I'm not claiming all of PDF's woes are related to its encoding, but it's not zero, either. Start from the fact that XML documents have XML Schema allowing one to formally specify what can and cannot appear where. The PDF specification is a bunch of English which makes for shitty constraint boundaries

manv1 · on Nov 18, 2022

It was a fight between DiskPaper and PDF. PDF won because the tools were better and it was cross-platform.

And PDF is a subset of PostScript, the product that made Adobe and the DTP industry.

It's janky because the goal was to render identically everywhere. If you think it's easy look at the code abortion that is CSS.

mdaniel · on Nov 18, 2022

I know this is likely a case of "you know what I meant," but there already is a PDF 2.0: https://www.pdfa.org/resource/iso-32000-pdf/

steampilot · on Nov 18, 2022

I think it's not too late to create a modern open-source alternative to PDF. I find it unacceptable that something that has become so widely used doesn't have proper free tools for editing. Society shouldn't be limited by income if they want (have?) to use PDFs, or else suffer from a bad experience. The other bigger problem with PDF is that a lot of the times it's used for something for which it wasn't made to be used for. Anything that is expected to be consumed on both mobile and desktop devices should never use PDFs. Government forms should not use PDFs with hacky embedded scripts either.

macintux · on Nov 18, 2022

It does seem like that would be a good opportunity to weed out some of the insecure aspects of the format.

Unfortunately in practice it would mean that everyone would have to support both PDF and PDF2.

brailsafe · on Nov 18, 2022

It's not that bad, it's just that the problem is big enough in scope to get right that the state of the art is provided by private industry, and to a lesser extent some open source tools, and you're probably way better off joining them rather than trying to beat them unless you want to grind your brain into the dust for a page layout spec.

userbinator · on Nov 18, 2022

The PDF format is itself a weird hybrid of text and binary.

(I have written a PDF parser myself.)

jeremynixon · on Nov 11, 2022

Genuinely curious. Why is apple search still so bad after poaching all of these Google people?

eklitzke · on Nov 11, 2022

Because building a high quality search product is extremely difficult, it's not just a matter of hiring a few brilliant people.

babypuncher · on Nov 11, 2022

Given he decline in quality of Google search over the last decade, my guess is that Google people are the problem.

jeremynixon · on Oct 15, 2022

Are these fines close to the business value gained from getting this legislation passed over the last 5 years? Seems unlikely. Nobody is going to jail. So where is the disincentive?

schnebbau · on Oct 15, 2022

Agreed. When a company does something illegal, the "company" is liable. But it was actually a person within the company who made the decision to do this; companies are not sentient. Why are the decision makers not held liable instead?

f-securus · on Oct 15, 2022

Cost of doing business. Seems like this had positive ROI.

I’d like for companies that do illegal things to have repercussions in line with what an individual would face for doing the same thing. Whoever signs off on it at the highest level should pay the price.

rayiner · on Oct 16, 2022

The former President of AT&T Illinois has been charged and may go to prison: https://www.justice.gov/usao-ndil/pr/former-president-att-il...

duped · on Oct 15, 2022

To hell with monetary gain. It should be prison time for executives and board members.

jcampbell1 · on Oct 15, 2022

The head of AT&T Illinois is very much going to do time in federal prison and AT&T provided the evidence to the feds to send him there.

jeremynixon · on April 29, 2021

Instead of ensuring that health care benefits are independent of corporate attachments, Marty Walsh continues to destroy useful distinctions in a system that congress refuses to appropriately reform. Sad.