First look at the data: https://pastila.nl/?05ee30a0/be7f1715c7de106b95cccd9385a...

wumms · on July 1, 2024

Seems to correlate nicely [0]:

> Prime Minister of the United Kingdom, from 1940 to 1945 during the Second World War, and 1951 to 1955

> Died 24 January 1965

[0] https://en.wikipedia.org/wiki/Winston_Churchill

zX41ZdbW · on July 1, 2024

Also uploaded it to the public playground for queries:

https://play.clickhouse.com/play?user=play#U0VMRUNUIHllYXIsI...

zX41ZdbW · on July 1, 2024

But the scan quality is subpar. Example:

> For belter safekecping Russta’s $2¢4,000,000 collection of crown jewels, probably (he finesl array of gems ever assem- bled at one tle

https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...

whistle650 · on July 1, 2024

https://chatgpt.com/share/13f553a8-5cff-42a1-be95-4a9d33cd10...

May also be easy to correct a lot of it:

“For better safekeeping, Russia’s $24,000,000 collection of crown jewels, probably the finest array of gems ever assembled at one time,”

bbarnett · on July 1, 2024

But are you correcting the OICR or miscorrecting the originals?

I want original text, including misspellings, and original regional / historical spellings, including slang (which may look like another word, but is not, and isn't in a dictionary).

You cannot fix OCR text wirhout lioking at the original.

brabel · on July 1, 2024

With the spelling having been fixed, even if imperfectly, you could much more easily search for content and find relevant results, and then go on to look at the originals. What you want is still possible, unless you unreasonably make it a requirement that the transcriptions should be perfect.

bbarnett · on July 1, 2024

Proper transcription to digital is to do so with accuracy, not "close enough".

DemocracyFTW2 · on July 1, 2024

to quote myself, "every interesting data set will have inaccuracies in it"

bbarnett · on July 1, 2024

There is a vast difference between a rare, honest mistake, and an attenpt to mitigate them...

vs willingly knowing you are introducing corrections that are ridiculously wrong.

Advocating and being a champion for inaccuracy, really isn't a positive. You should find a new thing to quote about yourself.

DemocracyFTW2 · on July 2, 2024

This is not what this phrase is about. I came to it working on the structural data of just under 100k Chinese characters. I'd spend hours, days and weeks proofreading and correcting formulas, so your "advocating and being a champion for inaccuracy" doesn't stick. But absent an automated, complete coverage of all records against a known error-free data set, there will likely be a small percentage of errors and dubious cases.

And thanks by the way for the readiness to jump to conclusions and fire a salve of allegations, viz. "willingly", "knowingly", "introducing", "ridiculous"

bbarnett · on July 2, 2024

You're making statements supporting the concept that errors are unavoidable, with an air of "oh well!", in a thread where someone is claiming AI is a solution... right after demonstrating a 10x error!

AI is a ridiculous answer, with its hallucinations and absurd error rates. If you didn't intend to support that level of absurd error rate, you shouldn't be replying in defence.

It sounds like you did not want to give that impression, if so, I suggest you look at the chain of replies, and the context.

AI hype is literally a danger to us all.

DemocracyFTW2 · on July 4, 2024

oh well

notachatbot1234 · on July 1, 2024

"$2¢4,000,000" should be "$204,000,000" rather than ChatGPT's "$24,000,000".

djhn · on July 1, 2024

Are you aware of any models that perform as well as an LLM on this task at lower cost?

bl4ckneon · on July 1, 2024

Self hosted LLM?

dr_kiszonka · on July 1, 2024

In case it wasn't intentional, you may be doxxing yourself.