Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

First look at the data: https://pastila.nl/?05ee30a0/be7f1715c7de106b95cccd9385a6c2e...

TLDR: it makes sense :)



Seems to correlate nicely [0]:

> Prime Minister of the United Kingdom, from 1940 to 1945 during the Second World War, and 1951 to 1955

> Died 24 January 1965

[0] https://en.wikipedia.org/wiki/Winston_Churchill


Also uploaded it to the public playground for queries:

https://play.clickhouse.com/play?user=play#U0VMRUNUIHllYXIsI...


But the scan quality is subpar. Example:

> For belter safekecping Russta’s $2¢4,000,000 collection of crown jewels, probably (he finesl array of gems ever assem- bled at one tle

https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...


https://chatgpt.com/share/13f553a8-5cff-42a1-be95-4a9d33cd10...

May also be easy to correct a lot of it:

“For better safekeeping, Russia’s $24,000,000 collection of crown jewels, probably the finest array of gems ever assembled at one time,”


But are you correcting the OICR or miscorrecting the originals?

I want original text, including misspellings, and original regional / historical spellings, including slang (which may look like another word, but is not, and isn't in a dictionary).

You cannot fix OCR text wirhout lioking at the original.


With the spelling having been fixed, even if imperfectly, you could much more easily search for content and find relevant results, and then go on to look at the originals. What you want is still possible, unless you unreasonably make it a requirement that the transcriptions should be perfect.


Proper transcription to digital is to do so with accuracy, not "close enough".


to quote myself, "every interesting data set will have inaccuracies in it"


There is a vast difference between a rare, honest mistake, and an attenpt to mitigate them...

vs willingly knowing you are introducing corrections that are ridiculously wrong.

Advocating and being a champion for inaccuracy, really isn't a positive. You should find a new thing to quote about yourself.


This is not what this phrase is about. I came to it working on the structural data of just under 100k Chinese characters. I'd spend hours, days and weeks proofreading and correcting formulas, so your "advocating and being a champion for inaccuracy" doesn't stick. But absent an automated, complete coverage of all records against a known error-free data set, there will likely be a small percentage of errors and dubious cases.

And thanks by the way for the readiness to jump to conclusions and fire a salve of allegations, viz. "willingly", "knowingly", "introducing", "ridiculous"


You're making statements supporting the concept that errors are unavoidable, with an air of "oh well!", in a thread where someone is claiming AI is a solution... right after demonstrating a 10x error!

AI is a ridiculous answer, with its hallucinations and absurd error rates. If you didn't intend to support that level of absurd error rate, you shouldn't be replying in defence.

It sounds like you did not want to give that impression, if so, I suggest you look at the chain of replies, and the context.

AI hype is literally a danger to us all.


oh well


"$2¢4,000,000" should be "$204,000,000" rather than ChatGPT's "$24,000,000".


Are you aware of any models that perform as well as an LLM on this task at lower cost?


Self hosted LLM?


In case it wasn't intentional, you may be doxxing yourself.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: