It was already better. I edit a podcast and have > a decade of pro audio editing...

thfuran · on Sept 21, 2022

>~97% accuracy over hour-long conversations. I'm sure it's been an absolute godsend for law enforcement

97% accuracy means roughly three or four errors per minute of speech. That seems potentially extremely problematic for something like law enforcement use where decisions with significant impact on people's day and/or life might be made on the basis of "evidence".

anigbrowl · on Sept 21, 2022

No it isn't. That just means 2-3% of your content needs to be double-checked by a person at the audio level, saving huge amounts of time - equally true of human transcription, in which individual words are often [UNINTELLIGEBLE].

Would you want to review this fully before going into court, absolutely - because you'd want to play the recording to a jury for emotional impact. Can you rely on it when you want to quickly read through hours of conversation and make decisions about whether to invest further resources (which might just mean another hour of listening back to the original audio)? Also absolutely. Bear in mind that a lot of these errors have little to no semantic impact, being on the same level as typos or misspellings in a written communication.

Bear in mind too that if law enforcement (honest or not) is so interested in you that they're willing to record your conversations, your day is already ruined, you just don't know it yet. The change here is one of scale rather than quality.

wging · on Sept 21, 2022

Doesn't it mean 100% of your content needs to be double-checked? You can't easily identify which 2-3% of your content has errors. I'm aware that errors are more likely when the model is less confident of its predictions, but that shouldn't be enough.

(edit for clarification: errors are not always something like "[UNINTELLIGIBLE]", where the system knows it doesn't know; they can also be misrecognitions that the system believes in with high confidence.)

anigbrowl · on Sept 21, 2022

By the time you're prosecuting someone in court, yes of course you double, triple, quadruple check everything. That's why lawyers get paid the big bucks (for now...). But yes you can identify which content probably has errors and flag it as such.

Look, I have decades of experience dealing with human speech, and not just as an editor - I can trace the human voice from neural impulses in Broca's region through the physiology of vocal production, mechanical transduction into electrical signals, discrete fourier transforms of the resultant waveforms into spectral information and back again, the reproduction of altered signals from time-aligned speakers to create a sense of spatialization, how those are processed in the human ear, and how the cilia are connected by nerves back to your brain. I'm a good enough editor that I can recognize many short words by sight of a waveform, or make 10 edits in a row by sight and know it will sound good on playback.

So when I say that machine transcription is as good as human realtime transcription now, I say so with the clear expectation that those decades of craft are very close to being rendered obsolete. I absolutely expect to hand off the mechanical part of editing to a machine within 2 years or so. It's already at the stage where I edit some interviews as text, like in a word processor, and then export the edited document as audio and it's Good Enough - not for every speaker, but more than half the time.

NPR and a lot of commercial broadcasters cut their material this way already, because you can get the same result from 30 minutes of reading and text editing that would require 3 hours of pure audio editing with no transcription.

frognumber · on Sept 22, 2022

What tools do you use to do this? I once hacked together an editor like this maybe a decade ago -- edit speech as text from OCR -- and sorely need one now.

Alignment of video to text is a big problem for me too.

boundlessdreamz · on Sept 22, 2022

This can be done via https://www.descript.com/ You can edit video/audio by editing the transcript.

You can even add/modify words that weren't originally there https://www.descript.com/overdub

frognumber · on Sept 23, 2022

Thank you!

yourapostasy · on Sept 22, 2022

> So when I say that machine transcription is as good as human realtime transcription now...

Would you go as far as to assert machine transcription can be used as an objective benchmark of a speaker’s verbal legibility?

It is fraught with political and interpersonal dynamics to approach someone even privately one on one today and gently suggest their career would get a huge boost if they hired a voice coach to help improve their verbal communication delivery. So even when I don’t directly mention their accent, it becomes a very sensitive subject with many.

However, if audio professionals like you can point to a system and say the raw biomechanics and acoustic physics of the world dictate that this is as physically and psychometrically good as audio parsing of human speech gets regardless whether the system was biologically evolved or ML evolved, the conversation can be couched even more objectively.

I enable recording and voice transcription in every meeting I can (ostensibly for DE&I but really for my own selfish purposes), and already observe in myself I have to work hard to overcome a tendency to gloss over speakers who don’t transcribe well when I review meeting transcripts to jot down any key information I might have missed taking notes upon during the meeting.

Note that I’m perfectly aware that my foreign language verbal skills are nowhere near the English skills of those I have tried to help. If the lingua franca of the coding world switched to Urdu tomorrow, then I’d hire help to learn and polish my spoken Urdu, like I went to a speech coach when learning public speaking because I can always use help in the many skills I lack.

etienne618 · on Sept 21, 2022

Presumably you can use the 97% that is correctly transcribed to rapidly filter out the relevant content. This is likely to be only a small portion of the total content. Then you check 100% of that.

woah · on Sept 21, 2022

You double check things that you think are important, in this case, passages that will be used as evidence in court.

6gvONxR4sf7o · on Sept 22, 2022

> I'm aware that errors are more likely when the model is less confident of its predictions, but that shouldn't be enough.

Suppose 90% of the errors are in the 10% where the model is least confident. Then you can review just 10% of your content and take a 2% error rate down to 0.2% error rate.

vivegi · on Sept 22, 2022

You can also use multiple transcription engines and then use mismatches among the text streams to narrow down the % of content that needs to be reviewed. This is quite similar to multi-voting OCR for document images.

The principle is that the engines have different failure modes (hopefully) and therefore the 2-3% error rate of each engine is in different areas of the audio. The key underlying assumption is that the events are mutually exclusive.

With 3 engines, you can use something like 2-of-3 stream matches to override the stream that mismatches.

u8 · on Sept 22, 2022

I had to do a lot of manual transcription in Journalism school. Using a tool like Descript saved HOURS of my life. Generally it was 80% accurate, but going over an two-hour-long recording again at 3x speed while reading over the transcript, fixing errors from memory or pausing took a five hour job down to 30-40 minutes. Either way, somebody is going to have to listen to the recording. This just removes a layer of grunt work.

TheCapeGreek · on Sept 22, 2022

Having done audio transcription in college as a side gig, it takes a lot longer than it sounds. Even at a decent 100wpm you'll take about 5 minutes to type out 1 minute of audio.

Not having to pause + rewind will save a ton of time for that 3%.

guelo · on Sept 21, 2022

Maybe you could run the text through a grammar checker to identify the errors.

thfuran · on Sept 21, 2022

That might work if people were required to speak grammatically.

NaturalPhallacy · on Sept 22, 2022

For real. The way people normally speak, with backtracking, repetition, restarting sentences, or stopping mid sentence and starting a new one with entirely different nouns or entire subjects is perfectly normal in synchronous conversation and isn't jarring, but written down as is, it's like 40% noise.

worthless-trash · on Sept 22, 2022

For a good example of this, read ANY of trumps speaches transcribed.

NaturalPhallacy · on Sept 22, 2022

I mean if you want to make it unnecessarily political, Biden's are worse: https://www.youtube.com/watch?v=3bWM1zsnTJc

gzer0 · on Sept 22, 2022

To be fair, you chose a video that displays an amalgamation of the biggest gaffes of 2021 for Biden.

“During his term as President of the United States, Donald Trump made tens of thousands of false or misleading claims. The Washington Post's fact-checker had tallied the number as 30,573 by January 2021, an average of about 21 per day by the end of his presidency.” [1][2][3][4]

I think it’s fair to say there would be a 100 hour long plus video / documentary if they were all compiled into one. lovely!

  - [1] Fact Checker (January 20, 2021). "In four years, President Trump made 30,573 false or misleading claims". The Washington Post. Archived from the original on January 20, 2021.

  - [2] Kessler, Glenn (January 23, 2021). "Trump made 30,573 false or misleading claims as president. Nearly half came in his final year". The Washington Post. Archived from the original on January 24, 2021. Retrieved January 24, 2021.

  - [3] Elfrink, Tim (August 14, 2020). "'Do you regret at all, all the lying you've done?': A reporter's blunt question to Trump goes unanswered". The Washington Post. Retrieved August 14, 2020.

[4] https://en.m.wikipedia.org/wiki/Veracity_of_statements_by_Do...

worthless-trash · on Sept 22, 2022

Oh no no, i wasn't trying to be political, its just one that I read.. and wow you're right!

thfuran · on Sept 21, 2022

>equally true of human transcription, in which individual words are often [UNINTELLIGEBLE].

ML systems somewhat notoriously do not necessarily make the same sorts of errors that a human would. And I'd expect a large portion of the errors to be transcribing the wrong words rather that indicating that a word couldn't be transcribed. That sort of error means that you can't really get away with manually reviewing just 3% of the audio.

notahacker · on Sept 21, 2022

ML tending to make weird mistakes rather than subtle ones that make sense in context like human transcribers is likely to make them easier to spot.

And there are humans in the loop too, and an enormous amount of redundancy in the questions and answer, so even plausible false transcriptions will get picked up on if they matter. Nobody gets sent to jail simply because the transcription process - human or machine - accidentally substitutes "I did it" in place of "I didn't" midway through a two hour interview.

BartjeD · on Sept 22, 2022

The thing is that 'Likely' is very far away from 'always'. There is no guarantee the mistake will be easy to spot.

For entertainment purposes AI transcription is awesome.

For serious business applications the ability to recognize mistakes will continue to be a field to which serious attention is given. It would be interesting to see AI processes double check itself, and also run a logic check on whether the transcription makes sense. So that it can report sections flagged as incongruous or of dubious reliability.

iroh2727 · on Sept 22, 2022

+1. There is a widespread "metric fallacy" or "task fallacy" going around. Models of course optimize for metrics, so they tend to perform well on those related metrics.

Humans, however, are not simply metric optimizers. Though it's always in the interest of those corporations producing metric optimizers (i.e. models) to paint humans as such, so their models shine in comparison. They want humans to look like bad machines, so it looks like they should be automated. Not to say they shouldn't in many cases, just that there's a clear one-sidedness in all corporate PR (and funded research, especially that research which is also PR).

All this to say that yes I agree with you. And if we humans don't want our unsustainable economic growth to turn us even more into machines (as our bureaucratic creep has done quite well thus far), we should fight such rhetoric that aims to paint humans simply as machines or task-doers.

datalopers · on Sept 21, 2022

If you know which 2-3% are the false positives, you have a very lucrative business model.

MonkeyMalarky · on Sept 21, 2022

When doing validation, I find it will often be the same errors repeated again and again in a transcription. Like it will fail on someone or some thing's name (that is rare / unique) and map it onto a known similar sounding word.

dotancohen · on Sept 23, 2022

Sometimes even human will disagree about what was said in a recording - I had this happen recently. I heard a specific sentence, the other person heard the exact opposite. I cannot say who was right, even after listening to the recording several times on headphones and speakers I'm as certain of my interpretation as was the other party.

gnramires · on Sept 21, 2022

I think an [UNINTELLIGIBLE] indication would be a great addition to automatic transcription systems.

inanutshellus · on Sept 21, 2022

It'd [UNINTELLIGIBLE score="92%" alternatives="pro-rabble; pourable"]probably[/UNINTELLIGIBLE] be useful to make a markup-based output... though you'd probably find it gave you more info than you wanted.

anigbrowl · on Sept 21, 2022

It already exists. The commercial product I use most is called sonix.ai and I think they have a free tier or trial period. It has shortcomings but it's shockingly good, despite having some limitations.

yencabulator · on Sept 22, 2022

Google Voice voicemail transcription used to do this, with varying levels of gray. It seems that feature is gone, now.

gs17 · on Sept 21, 2022

Yeah, I tried to use automated transcription for a research project and we had to do it all manually because the few errors (I would say it did pretty well given our recording quality) were often dropping words like "not", which changed the whole meaning of a sentence! It was a useful assistance during transcription, but I really hope they would verify it was correct before arresting anyone based on it.

hadlock · on Sept 21, 2022

Microsoft announced their voice transcription technology a couple years ago and were also touting ~97-98% accuracy which was actually better than human transcription error rates. The errors are usually in part people garbling their own speech, or they move their head while talking and the microphone misses a syllable. Anything in that error bar would probably fall under "reasonable doubt"

kyriakos · on Sept 21, 2022

If its anything like Microsoft teams transcription I doubt the 97%+ accuracy.

j-krieger · on Sept 21, 2022

I've worked with similar technology in the law enforcement space and the software is never used to make decisions. You can make out critical timestamps in conversations and a law enforcement officer will always manually confirm the software's assessments.

JohnFen · on Sept 21, 2022

Given that law enforcement has made similar claims about technology use in the past that turned out to be false, I have no faith in this claim.

j-krieger · on Sept 28, 2022

In all honesty, this is the correct mindset to have. I have limited expertise in this topic, and you should be aware that other law enforcement agencies probably do not handle this the same way.

CTDOCodebases · on Sept 21, 2022

I imagine a certain percentage of a given population is on a voice call at any one time.

1. Set up a computer with voice recognition software that flags certain patterns.

2. Connect computer to voice call communication network.

3. Configure computer to switch between calls every x number of seconds.

Think of it like a system to generate leads for law enforcement that can be integrated with other systems to produce the best quality leads.

NaturalPhallacy · on Sept 22, 2022

This is called "a fishing expedition" and is wildly unconstitutional in the US.

>The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized.

CTDOCodebases · on Sept 22, 2022

Are you sure about that? [0]

Besides I wasn't talking about the USA when I said this. I was remembering a conversation I once had with a person who worked as a technician in a telephone exchange.

[0] - https://en.wikipedia.org/wiki/Jewel_v._NSA

jjoonathan · on Sept 22, 2022

Yes, it is wildly unconstitutional, but in practice don't the courts endorse the asinine "it's not a search unless we find something" argument from the NSA?

Power always just finds a way to rationalize what it wants to do.

kurisufag · on Sept 22, 2022

see: Operation PRISM

Thorentis · on Sept 21, 2022

Not really. Imagine that they do simple keyword matching on the text. Anything that's missed (part of the 97%) the criminals get away with. Anything that matches in the 3% is then checked by a human (by listening to the audio at that time stamp). So you only need to manually check the 3%, and even then only if something you're interested in is found.

golem14 · on Sept 21, 2022

One would think that the few crucial bits of information gleaned are listened to manually, and the machine translation is not the only thing the judge or a jury sees.

thfuran · on Sept 21, 2022

You have absolutely ruined someone's day way before they're sitting in front of a jury.

formerly_proven · on Sept 21, 2022

Stuff like that is a very good tell that someone has zero experience with law enforcement.

adamgordonbell · on Sept 21, 2022

I've not found that to be the case.

For technical content, I use Rev.com and provide a glossary and real humans do the transcript. Other AI transcription services get lots wrong because the context often matters. Words like "TCP/IP" or "FAT disk format" or "Big Endian" I've never found AI so far to handle well.

I'm interested to test out whisper on this one.

https://corecursive.com/063-apple-2001/

deegles · on Sept 21, 2022

There's already software that can imitate a person's voice, so we have all the pieces already to do speech-to-text, clean up with GPT-3, and back to text-to-speech in the original person's voice. Maybe with a style transfer to keep the person's inflections etc the same?

Karuma · on Sept 21, 2022

I think something similar already exists. See this, for example: https://koe.ai/recast/

Although I don't know if they're using anything similar to what you suggest. Very cool idea, anyway!

biomcgary · on Sept 21, 2022

Since you work on podcasts, do any open source transcription tools currently identity the speaker in the output? This would be particularly helpful for interviews.

nico · on Sept 22, 2022

Not sure about open source, but in general, automated transcription systems need a separate track for each different speaker. So for example, for a phone call with one person on each end, you need two separate channels (recording systems usually split them left/right on one stereo file).

nonoesp · on Sept 22, 2022

I'm not sure if you've tried Descript, but their ML-based "Studio Sound" filter makes bad audio sound like it was recorded and edited nicely.

solarmist · on Sept 21, 2022

Any recommendations for particular services?

anigbrowl · on Sept 21, 2022

I use a service called sonix.ai. It's paid but I think they have a free tier or trial period, and it's not very expensive. I'm excited about this new OpenAI thing because I'd rather do it on my own hardware than send it to the cloud, but this company has earned its commercial success.

solarmist · on Sept 21, 2022

That is an exciting possibility. Being able to fix bad setups and missed takes automagically. It’s always been possible, just expensive and time consuming for moderate improvements.