AI for Data Journalism: demonstrating what we can do with this stuff

andy99 · on April 22, 2024

Looks interesting - my reaction though is that AI is great at demos and always has been. The devil is in the details and historically that has made it unusable for most applications despite the presence of a cool demo. I'm not criticizing the project, I get that its demos. Just that we judge AI too much by demos and then handwave about how it will actually work in practice.

simonw · on April 22, 2024

That was very much the point of my talk - and the reason I had so many live demos (as opposed to pre-recorded demos). I wanted to have at least a few instances of demos going wildly wrong to help emphasize how unreliable this stuff is.

Being unreliable doesn't mean it isn't useful. Journalists handle unreliable sources all the time - fact-checking and comparing multiple sources is built into the profession. As such, I think journalists may be better equipped to make use of LLMs than most other professions!

fauigerzigerk · on April 22, 2024

I found the talk very interesting because it shows both the issues as well as potential solutions.

One of the demos (extracting text from a PDF turned PNG) makes me wonder how you're ever going to fact check whether something in there is a hallucination. Innocent doctors won't always turn out to be Michael Jackson's sister after all :)

But then in one of the last demos you're showing how the fact checking can be "engineered" right into the prompt: "What were the themes of this meeting and for each theme give me an illustrative quote". Now you can search for the quote.

This is kind of eye opening for me, because you could build this sort of deterministic provability into all kinds of prompts. It certainly doesn't work for all applications but where it does work it basically allows you to swap false positives for false negatives, which is extremely valuable in many cases.

skybrian · on April 22, 2024

I think of AI as a “hint generator” that will give you some good guesses, but you still have to verify the guesses yourself. One thing it can help with is coming up with search terms that you might not have thought of.

sorokod · on April 22, 2024

What would be the equivalent of searching for quotes in your first (PNG) example?

Switching to text source, what would you do if say 30% of the quotes do not match with CTR-F?

fauigerzigerk · on April 22, 2024

>What would be the equivalent of searching for quotes in your first (PNG) example?

I don't have a general answer to that. It depends on the specifics of the application. In many cases the documents I'm interested in will have some overlap with structured data I have stored in a database. In the concrete example there could be a register of practicing physicians that could be used for cross referencing. But in other cases I think it's an unsolved problem that may never be solved completely.

>Switching to text source, what would you do if say 30% of the quotes do not match with CTR-F?

That's what I meant by swapping false positives for false negatives. You could simply throw out all the items for which you can't find the quote (which can obviously be done automatically). The remaining items are now "fact checked" to some degree. But the number of false negatives will probably have increased because not all the quotes without matches will be hallucinations.

Another approach would be to send the query separately to multiple different models or to ask one model to check another model's claims.

I think what works and what is good enough is highly application specific.

sorokod · on April 22, 2024

There are two issues to address

1. The price of validation.

2. The quality.

The baseline is to do the work yourself and compare - the equivalent of a "brute force" solution. This off course defeats the purpose of the entire exercise. You propose an approach to reduce the validation price by crafting the prompt in such a way that the validation can be parially automated. This may reduce the quality because of false negatives and what not.

The underlying assumption is that this process is cheaper then "brute force" and the quality is "good enough". It would be interesting to see a writeup of some specific examples.

larodi · on April 22, 2024

It is useful and perhaps very useful for journalists and other people who use it onetime. It is very ill suited for massive automation at the moment, and that’s a real problem everyone struggles with.

The application of embedding vectors without the rest of the LLM can presently deliver much sustainable innovation. At least compared to present day SOTA models (imho of course).

Upvoter33 · on April 22, 2024

Great talk and I have been enjoying your work. Keep it up!

ctrw · on April 23, 2024

Ai in its current iteration is a human enhancement tool, not a human replacing tool. I do not understand how anyone who works for a loving thinks this is a bad thing.

It would be like if the industrial revolution made artisanal weaving a million times more efficient without a giant satanic mill.

jeffbee · on April 22, 2024

Regarding the campaign finance documents, it's not just a vision problem with handwritten filings. Even when I feed Gemini Pro a typed PDF with normal text (the kind you can search, select, and copy in any PDF reader) it just invents things to fill in the result. I asked it to summarize the largest donors of a local campaign given a 70-page PDF and it gave me a top-10 table that seemed plausible but all the names were pulled from somewhere else in Gemini's tiny brain. None of them appeared anywhere in the filing.

devmor · on April 22, 2024

Generative AI is just not the Leatherman Tool everyone wants it to be.

There are absolutely ways to create solutions in this problem space using ML as a tool, but they are more specialized and probably not economical until the cost of training bespoke models goes down.

anamax · on April 22, 2024

"Leatherman Tool" is a good comparison, as pretty much every "blade" in a Leathermann is a crappy tool, better only than the Swiss Army knife equivalent.

The Leatherman's reason for existence is that you have it with you for unexpected problems.

wcedmisten · on April 22, 2024

This is why I can't really trust LLMs for data tasks like this. I can't really be certain that it won't be hallucinating results

simonw · on April 22, 2024

Yeah, I'm beginning to think detailed OCR-style extraction just isn't a good fit for these models.

It's frustrating, because getting that junk out of horribly scanned PDFs is 90% of the need of data journalism!

The best OCR API I've used is still AWS Textract - have you seen anything better?

I ended up building a tiny CLI tool for Textract because I found the thing so hard to use: https://github.com/simonw/textract-cli

hm-nah · on April 22, 2024

Azure Doc Intel is good. It creates a semantic representation (is json) of the doc. The schema is useful with page, paragraph, table, etc objects. The bounding-box for each element is also useful. It flails with images, only giving you the bounding-box of where the image exists. It’s up to you to extract any images separately then figure out how to correlate. Overall I think it’s useful rather than rolling your own (at least for small scale).

c_moscardi · on April 22, 2024

Yeah, I think MS' is the best out there, but agree that the usability leaves something to be desired. 2 thoughts:

1. I believe the IR jargon for getting a JSON of this form is Key Information Extraction (KIE). MS has an out-of-the-box model for this. I just tried the screenshot and it did a pretty good (but not perfect) job. It didn't get every form field, but most. MS sort-of has a flow for fine-tuning, but it really leaves a lot to be desired IMO. Curious if this would be "good enough" to satisfy the use case.

2. In terms of just OCR (i.e. getting the text/numeric strings correct), MS is known to be the best on typed text at the moment [1]. Handwriting is a different beast... but it looks like MS is doing a very good job there (and SOTA on handwriting is very good). In particular, it got all the numbers in that screenshot correct.

If you want to see the results from MS on the screenshot in this blog post, here's the entire JSON blob. A bit of a behemoth but the key/value stuff is in there: https://gist.github.com/cmoscardi/8c376094181451a49f0c62406e...

[1] https://mindee.github.io/doctr/latest/using_doctr/using_mode...

simonw · on April 22, 2024

That does look pretty great, thanks for the tip.

Sending images through that API and then using an LLM to extract data from the text result from the OCR could be worth exploring.

simonw · on April 22, 2024

I have to admit I've been having trouble figuring out what to do with bounding boxes of elements - I've got those out of Textract, but it feels like there's a lot of custom code needed to get from a bunch of bounding boxes to a useful JSON structure of the document.

That's why the idea of having an LLM like GPT-4 Vision or Gemini Pro or Claude process these things is so tempting - I want to be able to tell it what to do and get back JSON. And I can! It's just not reliable enough (yet?) to be particularly useful.

larodi · on April 22, 2024

Have you considered using a sort of SAM (segment anything) for the bounds and then OCR for the text and finally run it through LLM, which is a good predictor of text, to figure missing words or typos (wrong chars).

simonw · on April 22, 2024

Oh that's an interesting idea! One challenge I've had with OCR is that things like multiple columns frequently confuse it - pulling out the regions first, OCRIng them independently and then using an LLM to try and piece everything back together could be a neat direction.

mistrial9 · on April 22, 2024

OCR software is thirty years in the making! there must be dozens of alternatives .. interested to hear from people close to this topic

is_true · on April 22, 2024

I had a good experience using easyocr to get data out of lottery draws videos

photochemsyn · on April 22, 2024

There was a story about how many legislative bills introduced in the US were written by lobbyists a few years ago that might have benefited from these tools:

https://publicintegrity.org/politics/state-politics/copy-pas...

> "Using data provided by LegiScan, which tracks every proposed law introduced in the U.S., we pulled in digital copies of nearly 1 million pieces of legislation introduced between 2010 and Oct. 15, 2018."

The scoring system used seems fairly simple in comparison to what AI can do: > "Our scoring system is based on three factors: the longest string of common text between a model and a bill; the number of common strings of five or more words; and the number of common strings of 10 or more words."

AlbertCory · on April 22, 2024

I found a couple of bylined articles in the Daily Illini that I wrote as a freshman. Very boring stuff; they didn't give the exciting beats to a newbie.

Then I ran them through ChatGPT to see what it thinks "objective journalism" is.

https://albertcory50.substack.com/p/ai-does-journalism

Conclusion: if I'd done what it suggested, the editor would have red-pencilled it out. It isn't that hard to just write the facts. No one wants to hear "both sides" on a donation of land to the Park District.

simonw · on April 22, 2024

One of the themes of my talk was that generating text directly is actually one of the least interesting applications of LLMs to journalism - that's why I focused on things like structured data extraction and code generation (SQL and code interpreter), those are much more useful for data journalists IMO.

simonw · on April 22, 2024

This post is about a talk I gave at a data journalism conference, but it's also a demo of a whole bunch of projects I've been working on over the past couple of months/years:

- Claude 3 Haiku generation via phone or laptop camera: https://tools.simonwillison.net/haiku

- A new Datasette plugin for creating tables by pasting in CSV/TSV/JSON data: https://github.com/datasette/datasette-import

- A plugin that lets you ask a question in English and have that converted into a SQL query (also using Claude 3 Haiku): https://github.com/datasette/datasette-query-assistant

- shot-scraper for scraping web pages from the command-line: https://shot-scraper.datasette.io/en/stable/javascript.html

- Datasette Enrichments for applying bulk operations to data in a SQLite database table: https://enrichments.datasette.io/en/stable/ - demonstrating both a geocoder enrichment https://github.com/datasette/datasette-enrichments-opencage and a GPT-powered one: https://github.com/datasette/datasette-enrichments-gpt

- The LLM command-line tool: https://llm.datasette.io/ - including a preview of the image support, currently in a branch

- The datasette-extract plugin for extracting structured data into tables from unstructured text and images: https://www.datasette.cloud/blog/2024/datasette-extract/

- The new datasette-embeddings plugin for calculating and storing embeddings for table content and executing semantic search queries against them: https://github.com/datasette/datasette-embeddings

- Datasette Scribe by Alex Garcia: a tool for transcribing YouTube video audio into a database, diarizing speakers and making the whole thing searchable: https://simonwillison.net/2024/Apr/17/ai-for-data-journalism...

doctorpangloss · on April 22, 2024

But Simon, you are a strongly opinionated guy whose opinions are interesting. Why don't you make a data journalism piece? Why don't you submit an article to the NYTimes, WSJ, etc.? They will publish it! What are you waiting for?

simonw · on April 22, 2024

Sounds like a lot of work!

I have been thinking that I need to do some actual reporting myself at some point though, ultimate version of dogfooding for all of my stuff.

ddp26 · on April 22, 2024

Looks great Simon! Do you have any anecdotes of journalists using the techniques in this demo in news pieces we can read?

simonw · on April 22, 2024

Most of the stuff I presented in this talk (the structured data extraction things, AI query assistance etc) is so new that it's not had a chance to be used for a published story yet - I'm working with a few newsrooms right now getting them setup with it though.

Datasette itself has been used for all kinds of things within newsrooms. Two recent examples I heard about: the WSJ use it internally for tools around topics like CEO compensation tracking, and Bellingcat have used it for some of their work that involves leaked data relating to Russia.

The problem with open source tools is that people can use them without telling you about it! I'm trying to get better at encouraging people to share details like this with me, I'd really like to get some good use-cases written up.

Vvector · on April 22, 2024

"Haikus from images with Claude 3 Haiku"

But think of all the poet jobs you are eliminating...