Hacker News new | past | comments | ask | show | jobs | submit | jeremynixon's comments login

There is article shows no evidence of fabrication, fraud or misinformation, while making accusations of all of them. All it shows is that ChatGPT was used, which is wildly escalated into "evidence manipulation" (ironically without evidence).

Much more work is needed to show that this means anything.


If the result was not read even to check for obvious boilerplate GPT markers, then we can't expect anything else in them was. That means anything else, numbers, interpretation, conclusion was potentially never checked.

The authors use fraud in a specific sense here: "using ChatGPT fraudulently or undeclared" where they proved that the produced text was included without proper review. They also never accused those papers of misinformation, so they don't need to show evidence of that.


Search through youtube via embedding search over frames of video, in a click-to-semantic-search interface. (Right click to play video)


For now, the researcher using our tool needs to check.


or perhaps, free scientists to focus on genuine insight, ending publish or perish forever


OP here: We have data source powered research papers in the making, that reports on experiments executed with codegen - really hopeful that the papers can be informative at a minimum.

It's clear that superhuman citation depth & breadth is already imminent. Hopefully we can push hallucinations to near-zero with the next generation of models.


that's great, yeah! what is the data source for the papers?


What libraries do you see as being SOTA? Fitz? Tika?

My hope is that computer vision + OCR will solve this once and for all in near future.


To be 100% honest it's been a while since I looked into libraries for it, so I couldn't say.

Your second comment rings true, and in my opinion, we are there. Highly recommend throwing some PDFs at AWS Textract and checking out the quality, it wasn't there a few years ago, can safely state it's there now though. I threw stuff at it that previously would just spit out trash, and it handled it fairly well, specifically for table data extraction (I was looking at public stock market quarterly reports).

Cost is the kicker for me, 1000 pages for $15, adds up fairly quickly at any sort of scale!


OCR is built into Adobe's PDF reader, issue is it's 15$ a month.

I really want to see OCR become easier to use, but I don't know why it's such a hard problem in the first place.


There is the python library ocrmypdf https://ocrmypdf.readthedocs.io/en/latest/ that works really well. I have found the results comparable to Adobe in accuracy.

I believe it uses tesseract, ghostscript and some other libraries.

Speaking of ghostscript, one way to deal with problematic PDFs is to print them to file and deal with the result instead.


Any open source apps integrate this ?

I'd love to just be able to search a PDF document for a string and get a list of results.


Why is the state of the art in PDF parsing SO BAD? This is an incredibly common and important problem. Tika and fitz have very poor results. What is the reason that this is still so backwards?


Despite the thousands of pages of ISO 32000, the reality is that the format is not defined. Acrobat tolerates unfathomably malformed PDF files generated by old software that predates the opening-up of the standard when people were reverse-engineering it. There’s always some utterly insane file that Acrobat opens just fine and now you get to play the game of figuring out how Acrobat repaired it.

Plus all the fun of the fact that you can embed the following formats inside a PDF:

PNG, JPEG (including CMYK), JPEG 2000 (dead), JBIG2 (dead), CCIT G4 (dead, fax machines), PostScript Type1 fonts (dead), PostScript Type3 fonts (dead), PostScript CIDFonts (pre-Unicode, dead), CFF fonts (the inside of an OTF), TrueType fonts, ICC Profiles, PostScript functions defining Color spaces, XML forms (the worst), LZ compressed data, Run-length compressed data, Deflate-compressed data.

All of which Acrobat will allow to be malformed in various non-standard ways so you need to write your own parsers.

Note the lack of OpenType fonts, also lack of proper Unicode!


JPEG 2000 (dead)

Not sure what you mean by "dead", but tons of book scans, particularly those at archive.org, are PDFs of entirely JPEG2000 images.


Believe it or not, but digital cinema projection is done with jpeg 2000 https://en.wikipedia.org/wiki/Digital_cinema


That’s wild!


I mean dead as in the fact that it’s used somewhere is noteworthy.

I’d love for JPEG XL to replace such uses!


> lack of proper Unicode

What do you mean?


PDF was defined way back before Unicode was ever a thing. It is natively an 8-bit character-set format for text handling. The way it gets around this limit of only 256 characters available is because it also allows defining custom byte to character glyph mappings (think both ASCII and EBCDIC encoding in different parts of the same document). To typeset a glyph that is not in the current in use 256 character sub-set mapping you switch to a different custom byte value to character glyph mapping to typeset that other character.


> PDF was defined way back before Unicode was ever a thing.

Unicode 1.0 was released in 1991.

PDF 1.0 was released in 1993.


Others have given some good answers already, but I'll add one more: PDF is all about creating a final layout, but given a final layout, there are an infinite number of inputs that could have produced it, but if you are parsing the PDF, most of the time you are trying to get back something higher level than e.g. a command to draw a single character at some X,Y location. But many PDFs in the wild were not generated with that type of semantic extraction in mind, so you have to sort of fudge things to get what you want, and that's where it becomes complex and crazy.

For example, I once had to try to parse PDF invoices generated by some legacy system, and at the bottom was a line that read something like, "Total: $32.56". But in the PDF there was an instruction to write out the string "Total:" and a separate one to write out the amount string, but there was nothing in the PDF itself that correlated the two in any way at all (they didn't appear anywhere close to either other in the page's hierarchy, they weren't at a fixed set of coordinates, etc, etc.).


1. PDF is mostly designed as a write-only (or render-only) format. PDF’s original purpose was as a device-independent page output language for printers, because PostScript documents at the time were specific to the targeted printer model. Interpreting a PDF for any other purpose than rendering resembles the task of disassembling machine code back into intelligible source code.

2. Many PDF documents do not conform to the PDF specification in a multitude of ways, yet Adobe Acrobat Reader still accepts them, and so PDF parsers have to implement a lot of kludgy logic in an attempt to replicate Adobe’s behavior.

3. The format has grown to be quite complex, with a lot of features added over the years. Implementing a parser even for spec-compliant PDFs is a decidedly nontrivial effort.

So PDF is a reasonably good output format for fixed-layout pages for display and especially for print, but a really bad input format.


My current company uses ML to parse PDF invoices and identify fraud. I have no idea how the devs manage this black magic wizardry because they also spend time contributing to infra code before they hired more people like me on board. If anyone wants a great startup idea, look to solving a problem involving parsing PDFs en masse. Maybe something in legal tech. That market is absolutely ripe for disruption.


Droit does similar things.


That is awesome! Can “cross border” do things like process GDPR compliance regulations or is that not the intended use case?


PDF has always seemed to be a janky Adobe product.

Should a modern, open version of PDF be created knowing that how it evolved from the original concept in 1991? Shouldn't we at some point say, we need to start over and created PDF2?



Sadly XPS is not supported by most software, I'd love to use something better than PDF, but even LibreOffice can't export as OXPS.


Anything related to XML is arguably even worse.


I know it's fun to hate on XML, but as compared to inventing a new pseudo-text-pseudo-binary format, its parsing mechanics are well understood

I'm not claiming all of PDF's woes are related to its encoding, but it's not zero, either. Start from the fact that XML documents have XML Schema allowing one to formally specify what can and cannot appear where. The PDF specification is a bunch of English which makes for shitty constraint boundaries


It was a fight between DiskPaper and PDF. PDF won because the tools were better and it was cross-platform.

And PDF is a subset of PostScript, the product that made Adobe and the DTP industry.

It's janky because the goal was to render identically everywhere. If you think it's easy look at the code abortion that is CSS.


I know this is likely a case of "you know what I meant," but there already is a PDF 2.0: https://www.pdfa.org/resource/iso-32000-pdf/


I think it's not too late to create a modern open-source alternative to PDF. I find it unacceptable that something that has become so widely used doesn't have proper free tools for editing. Society shouldn't be limited by income if they want (have?) to use PDFs, or else suffer from a bad experience. The other bigger problem with PDF is that a lot of the times it's used for something for which it wasn't made to be used for. Anything that is expected to be consumed on both mobile and desktop devices should never use PDFs. Government forms should not use PDFs with hacky embedded scripts either.


It does seem like that would be a good opportunity to weed out some of the insecure aspects of the format.

Unfortunately in practice it would mean that everyone would have to support both PDF and PDF2.


It's not that bad, it's just that the problem is big enough in scope to get right that the state of the art is provided by private industry, and to a lesser extent some open source tools, and you're probably way better off joining them rather than trying to beat them unless you want to grind your brain into the dust for a page layout spec.


The PDF format is itself a weird hybrid of text and binary.

(I have written a PDF parser myself.)


Genuinely curious. Why is apple search still so bad after poaching all of these Google people?


Because building a high quality search product is extremely difficult, it's not just a matter of hiring a few brilliant people.


Given he decline in quality of Google search over the last decade, my guess is that Google people are the problem.


Are these fines close to the business value gained from getting this legislation passed over the last 5 years? Seems unlikely. Nobody is going to jail. So where is the disincentive?


Agreed. When a company does something illegal, the "company" is liable. But it was actually a person within the company who made the decision to do this; companies are not sentient. Why are the decision makers not held liable instead?


Cost of doing business. Seems like this had positive ROI.

I’d like for companies that do illegal things to have repercussions in line with what an individual would face for doing the same thing. Whoever signs off on it at the highest level should pay the price.


The former President of AT&T Illinois has been charged and may go to prison: https://www.justice.gov/usao-ndil/pr/former-president-att-il...


To hell with monetary gain. It should be prison time for executives and board members.


The head of AT&T Illinois is very much going to do time in federal prison and AT&T provided the evidence to the feds to send him there.


Instead of ensuring that health care benefits are independent of corporate attachments, Marty Walsh continues to destroy useful distinctions in a system that congress refuses to appropriately reform. Sad.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: