There is article shows no evidence of fabrication, fraud or misinformation, while making accusations of all of them. All it shows is that ChatGPT was used, which is wildly escalated into "evidence manipulation" (ironically without evidence).
Much more work is needed to show that this means anything.
If the result was not read even to check for obvious boilerplate GPT markers, then we can't expect anything else in them was. That means anything else, numbers, interpretation, conclusion was potentially never checked.
The authors use fraud in a specific sense here: "using ChatGPT fraudulently or undeclared" where they proved that the produced text was included without proper review. They also never accused those papers of misinformation, so they don't need to show evidence of that.
OP here: We have data source powered research papers in the making, that reports on experiments executed with codegen - really hopeful that the papers can be informative at a minimum.
It's clear that superhuman citation depth & breadth is already imminent. Hopefully we can push hallucinations to near-zero with the next generation of models.
To be 100% honest it's been a while since I looked into libraries for it, so I couldn't say.
Your second comment rings true, and in my opinion, we are there. Highly recommend throwing some PDFs at AWS Textract and checking out the quality, it wasn't there a few years ago, can safely state it's there now though. I threw stuff at it that previously would just spit out trash, and it handled it fairly well, specifically for table data extraction (I was looking at public stock market quarterly reports).
Cost is the kicker for me, 1000 pages for $15, adds up fairly quickly at any sort of scale!
Why is the state of the art in PDF parsing SO BAD? This is an incredibly common and important problem. Tika and fitz have very poor results. What is the reason that this is still so backwards?
Despite the thousands of pages of ISO 32000, the reality is that the format is not defined. Acrobat tolerates unfathomably malformed PDF files generated by old software that predates the opening-up of the standard when people were reverse-engineering it. There’s always some utterly insane file that Acrobat opens just fine and now you get to play the game of figuring out how Acrobat repaired it.
Plus all the fun of the fact that you can embed the following formats inside a PDF:
PNG, JPEG (including CMYK), JPEG 2000 (dead), JBIG2 (dead), CCIT G4 (dead, fax machines), PostScript Type1 fonts (dead), PostScript Type3 fonts (dead), PostScript CIDFonts (pre-Unicode, dead), CFF fonts (the inside of an OTF), TrueType fonts, ICC Profiles, PostScript functions defining Color spaces, XML forms (the worst), LZ compressed data, Run-length compressed data, Deflate-compressed data.
All of which Acrobat will allow to be malformed in various non-standard ways so you need to write your own parsers.
Note the lack of OpenType fonts, also lack of proper Unicode!
PDF was defined way back before Unicode was ever a thing. It is natively an 8-bit character-set format for text handling. The way it gets around this limit of only 256 characters available is because it also allows defining custom byte to character glyph mappings (think both ASCII and EBCDIC encoding in different parts of the same document). To typeset a glyph that is not in the current in use 256 character sub-set mapping you switch to a different custom byte value to character glyph mapping to typeset that other character.
Others have given some good answers already, but I'll add one more: PDF is all about creating a final layout, but given a final layout, there are an infinite number of inputs that could have produced it, but if you are parsing the PDF, most of the time you are trying to get back something higher level than e.g. a command to draw a single character at some X,Y location. But many PDFs in the wild were not generated with that type of semantic extraction in mind, so you have to sort of fudge things to get what you want, and that's where it becomes complex and crazy.
For example, I once had to try to parse PDF invoices generated by some legacy system, and at the bottom was a line that read something like, "Total: $32.56". But in the PDF there was an instruction to write out the string "Total:" and a separate one to write out the amount string, but there was nothing in the PDF itself that correlated the two in any way at all (they didn't appear anywhere close to either other in the page's hierarchy, they weren't at a fixed set of coordinates, etc, etc.).
1. PDF is mostly designed as a write-only (or render-only) format. PDF’s original purpose was as a device-independent page output language for printers, because PostScript documents at the time were specific to the targeted printer model. Interpreting a PDF for any other purpose than rendering resembles the task of disassembling machine code back into intelligible source code.
2. Many PDF documents do not conform to the PDF specification in a multitude of ways, yet Adobe Acrobat Reader still accepts them, and so PDF parsers have to implement a lot of kludgy logic in an attempt to replicate Adobe’s behavior.
3. The format has grown to be quite complex, with a lot of features added over the years. Implementing a parser even for spec-compliant PDFs is a decidedly nontrivial effort.
So PDF is a reasonably good output format for fixed-layout pages for display and especially for print, but a really bad input format.
My current company uses ML to parse PDF invoices and identify fraud. I have no idea how the devs manage this black magic wizardry because they also spend time contributing to infra code before they hired more people like me on board. If anyone wants a great startup idea, look to solving a problem involving parsing PDFs en masse. Maybe something in legal tech. That market is absolutely ripe for disruption.
PDF has always seemed to be a janky Adobe product.
Should a modern, open version of PDF be created knowing that how it evolved from the original concept in 1991? Shouldn't we at some point say, we need to start over and created PDF2?
I know it's fun to hate on XML, but as compared to inventing a new pseudo-text-pseudo-binary format, its parsing mechanics are well understood
I'm not claiming all of PDF's woes are related to its encoding, but it's not zero, either. Start from the fact that XML documents have XML Schema allowing one to formally specify what can and cannot appear where. The PDF specification is a bunch of English which makes for shitty constraint boundaries
I think it's not too late to create a modern open-source alternative to PDF.
I find it unacceptable that something that has become so widely used doesn't have proper free tools for editing.
Society shouldn't be limited by income if they want (have?) to use PDFs, or else suffer from a bad experience.
The other bigger problem with PDF is that a lot of the times it's used for something for which it wasn't made to be used for. Anything that is expected to be consumed on both mobile and desktop devices should never use PDFs. Government forms should not use PDFs with hacky embedded scripts either.
It's not that bad, it's just that the problem is big enough in scope to get right that the state of the art is provided by private industry, and to a lesser extent some open source tools, and you're probably way better off joining them rather than trying to beat them unless you want to grind your brain into the dust for a page layout spec.
Are these fines close to the business value gained from getting this legislation passed over the last 5 years? Seems unlikely. Nobody is going to jail. So where is the disincentive?
Agreed. When a company does something illegal, the "company" is liable. But it was actually a person within the company who made the decision to do this; companies are not sentient. Why are the decision makers not held liable instead?
Cost of doing business. Seems like this had positive ROI.
I’d like for companies that do illegal things to have repercussions in line with what an individual would face for doing the same thing. Whoever signs off on it at the highest level should pay the price.
Instead of ensuring that health care benefits are independent of corporate attachments, Marty Walsh continues to destroy useful distinctions in a system that congress refuses to appropriately reform. Sad.
Much more work is needed to show that this means anything.