Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It was already better. I edit a podcast and have > a decade of pro audio editing experience in the film industry, and I was already using a commercial AI transcription service to render the content to text and sometimes edit it as such (outputting edited audio).

Existing (and affordable) offerings are so good that they can cope with shitty recordings off a phone speaker and maintain ~97% accuracy over hour-long conversations. I'm sure it's been an absolute godsend for law enforcement other people who need to gather poor-quality audio at scale, though much less great for the targets of repressive authority.

Having this fully open is a big deal though - now that level of transcription ability can be wrapped as an audio plugin and just used wherever. Given the parallel advances in resynthesis and understanding idiomatic speech, in a year or two I probably won't need to cut out all those uuh like um y'know by hand ever again, and every recording can be given an noise reduction bath and come out sounding like it was recorded in a room full of soft furniture.



>~97% accuracy over hour-long conversations. I'm sure it's been an absolute godsend for law enforcement

97% accuracy means roughly three or four errors per minute of speech. That seems potentially extremely problematic for something like law enforcement use where decisions with significant impact on people's day and/or life might be made on the basis of "evidence".


No it isn't. That just means 2-3% of your content needs to be double-checked by a person at the audio level, saving huge amounts of time - equally true of human transcription, in which individual words are often [UNINTELLIGEBLE].

Would you want to review this fully before going into court, absolutely - because you'd want to play the recording to a jury for emotional impact. Can you rely on it when you want to quickly read through hours of conversation and make decisions about whether to invest further resources (which might just mean another hour of listening back to the original audio)? Also absolutely. Bear in mind that a lot of these errors have little to no semantic impact, being on the same level as typos or misspellings in a written communication.

Bear in mind too that if law enforcement (honest or not) is so interested in you that they're willing to record your conversations, your day is already ruined, you just don't know it yet. The change here is one of scale rather than quality.


Doesn't it mean 100% of your content needs to be double-checked? You can't easily identify which 2-3% of your content has errors. I'm aware that errors are more likely when the model is less confident of its predictions, but that shouldn't be enough.

(edit for clarification: errors are not always something like "[UNINTELLIGIBLE]", where the system knows it doesn't know; they can also be misrecognitions that the system believes in with high confidence.)


By the time you're prosecuting someone in court, yes of course you double, triple, quadruple check everything. That's why lawyers get paid the big bucks (for now...). But yes you can identify which content probably has errors and flag it as such.

Look, I have decades of experience dealing with human speech, and not just as an editor - I can trace the human voice from neural impulses in Broca's region through the physiology of vocal production, mechanical transduction into electrical signals, discrete fourier transforms of the resultant waveforms into spectral information and back again, the reproduction of altered signals from time-aligned speakers to create a sense of spatialization, how those are processed in the human ear, and how the cilia are connected by nerves back to your brain. I'm a good enough editor that I can recognize many short words by sight of a waveform, or make 10 edits in a row by sight and know it will sound good on playback.

So when I say that machine transcription is as good as human realtime transcription now, I say so with the clear expectation that those decades of craft are very close to being rendered obsolete. I absolutely expect to hand off the mechanical part of editing to a machine within 2 years or so. It's already at the stage where I edit some interviews as text, like in a word processor, and then export the edited document as audio and it's Good Enough - not for every speaker, but more than half the time.

NPR and a lot of commercial broadcasters cut their material this way already, because you can get the same result from 30 minutes of reading and text editing that would require 3 hours of pure audio editing with no transcription.


What tools do you use to do this? I once hacked together an editor like this maybe a decade ago -- edit speech as text from OCR -- and sorely need one now.

Alignment of video to text is a big problem for me too.


This can be done via https://www.descript.com/ You can edit video/audio by editing the transcript.

You can even add/modify words that weren't originally there https://www.descript.com/overdub


Thank you!


> So when I say that machine transcription is as good as human realtime transcription now...

Would you go as far as to assert machine transcription can be used as an objective benchmark of a speaker’s verbal legibility?

It is fraught with political and interpersonal dynamics to approach someone even privately one on one today and gently suggest their career would get a huge boost if they hired a voice coach to help improve their verbal communication delivery. So even when I don’t directly mention their accent, it becomes a very sensitive subject with many.

However, if audio professionals like you can point to a system and say the raw biomechanics and acoustic physics of the world dictate that this is as physically and psychometrically good as audio parsing of human speech gets regardless whether the system was biologically evolved or ML evolved, the conversation can be couched even more objectively.

I enable recording and voice transcription in every meeting I can (ostensibly for DE&I but really for my own selfish purposes), and already observe in myself I have to work hard to overcome a tendency to gloss over speakers who don’t transcribe well when I review meeting transcripts to jot down any key information I might have missed taking notes upon during the meeting.

Note that I’m perfectly aware that my foreign language verbal skills are nowhere near the English skills of those I have tried to help. If the lingua franca of the coding world switched to Urdu tomorrow, then I’d hire help to learn and polish my spoken Urdu, like I went to a speech coach when learning public speaking because I can always use help in the many skills I lack.


Presumably you can use the 97% that is correctly transcribed to rapidly filter out the relevant content. This is likely to be only a small portion of the total content. Then you check 100% of that.


You double check things that you think are important, in this case, passages that will be used as evidence in court.


> I'm aware that errors are more likely when the model is less confident of its predictions, but that shouldn't be enough.

Suppose 90% of the errors are in the 10% where the model is least confident. Then you can review just 10% of your content and take a 2% error rate down to 0.2% error rate.


You can also use multiple transcription engines and then use mismatches among the text streams to narrow down the % of content that needs to be reviewed. This is quite similar to multi-voting OCR for document images.

The principle is that the engines have different failure modes (hopefully) and therefore the 2-3% error rate of each engine is in different areas of the audio. The key underlying assumption is that the events are mutually exclusive.

With 3 engines, you can use something like 2-of-3 stream matches to override the stream that mismatches.


I had to do a lot of manual transcription in Journalism school. Using a tool like Descript saved HOURS of my life. Generally it was 80% accurate, but going over an two-hour-long recording again at 3x speed while reading over the transcript, fixing errors from memory or pausing took a five hour job down to 30-40 minutes. Either way, somebody is going to have to listen to the recording. This just removes a layer of grunt work.


Having done audio transcription in college as a side gig, it takes a lot longer than it sounds. Even at a decent 100wpm you'll take about 5 minutes to type out 1 minute of audio.

Not having to pause + rewind will save a ton of time for that 3%.


Maybe you could run the text through a grammar checker to identify the errors.


That might work if people were required to speak grammatically.


For real. The way people normally speak, with backtracking, repetition, restarting sentences, or stopping mid sentence and starting a new one with entirely different nouns or entire subjects is perfectly normal in synchronous conversation and isn't jarring, but written down as is, it's like 40% noise.


For a good example of this, read ANY of trumps speaches transcribed.


I mean if you want to make it unnecessarily political, Biden's are worse: https://www.youtube.com/watch?v=3bWM1zsnTJc


To be fair, you chose a video that displays an amalgamation of the biggest gaffes of 2021 for Biden.

“During his term as President of the United States, Donald Trump made tens of thousands of false or misleading claims. The Washington Post's fact-checker had tallied the number as 30,573 by January 2021, an average of about 21 per day by the end of his presidency.” [1][2][3][4]

I think it’s fair to say there would be a 100 hour long plus video / documentary if they were all compiled into one. lovely!

  - [1] Fact Checker (January 20, 2021). "In four years, President Trump made 30,573 false or misleading claims". The Washington Post. Archived from the original on January 20, 2021.

  - [2] Kessler, Glenn (January 23, 2021). "Trump made 30,573 false or misleading claims as president. Nearly half came in his final year". The Washington Post. Archived from the original on January 24, 2021. Retrieved January 24, 2021.

  - [3] Elfrink, Tim (August 14, 2020). "'Do you regret at all, all the lying you've done?': A reporter's blunt question to Trump goes unanswered". The Washington Post. Retrieved August 14, 2020.
[4] https://en.m.wikipedia.org/wiki/Veracity_of_statements_by_Do...


Oh no no, i wasn't trying to be political, its just one that I read.. and wow you're right!


>equally true of human transcription, in which individual words are often [UNINTELLIGEBLE].

ML systems somewhat notoriously do not necessarily make the same sorts of errors that a human would. And I'd expect a large portion of the errors to be transcribing the wrong words rather that indicating that a word couldn't be transcribed. That sort of error means that you can't really get away with manually reviewing just 3% of the audio.


ML tending to make weird mistakes rather than subtle ones that make sense in context like human transcribers is likely to make them easier to spot.

And there are humans in the loop too, and an enormous amount of redundancy in the questions and answer, so even plausible false transcriptions will get picked up on if they matter. Nobody gets sent to jail simply because the transcription process - human or machine - accidentally substitutes "I did it" in place of "I didn't" midway through a two hour interview.


The thing is that 'Likely' is very far away from 'always'. There is no guarantee the mistake will be easy to spot.

For entertainment purposes AI transcription is awesome.

For serious business applications the ability to recognize mistakes will continue to be a field to which serious attention is given. It would be interesting to see AI processes double check itself, and also run a logic check on whether the transcription makes sense. So that it can report sections flagged as incongruous or of dubious reliability.


+1. There is a widespread "metric fallacy" or "task fallacy" going around. Models of course optimize for metrics, so they tend to perform well on those related metrics.

Humans, however, are not simply metric optimizers. Though it's always in the interest of those corporations producing metric optimizers (i.e. models) to paint humans as such, so their models shine in comparison. They want humans to look like bad machines, so it looks like they should be automated. Not to say they shouldn't in many cases, just that there's a clear one-sidedness in all corporate PR (and funded research, especially that research which is also PR).

All this to say that yes I agree with you. And if we humans don't want our unsustainable economic growth to turn us even more into machines (as our bureaucratic creep has done quite well thus far), we should fight such rhetoric that aims to paint humans simply as machines or task-doers.


If you know which 2-3% are the false positives, you have a very lucrative business model.


When doing validation, I find it will often be the same errors repeated again and again in a transcription. Like it will fail on someone or some thing's name (that is rare / unique) and map it onto a known similar sounding word.


Sometimes even human will disagree about what was said in a recording - I had this happen recently. I heard a specific sentence, the other person heard the exact opposite. I cannot say who was right, even after listening to the recording several times on headphones and speakers I'm as certain of my interpretation as was the other party.


I think an [UNINTELLIGIBLE] indication would be a great addition to automatic transcription systems.


It'd [UNINTELLIGIBLE score="92%" alternatives="pro-rabble; pourable"]probably[/UNINTELLIGIBLE] be useful to make a markup-based output... though you'd probably find it gave you more info than you wanted.


It already exists. The commercial product I use most is called sonix.ai and I think they have a free tier or trial period. It has shortcomings but it's shockingly good, despite having some limitations.


Google Voice voicemail transcription used to do this, with varying levels of gray. It seems that feature is gone, now.


Yeah, I tried to use automated transcription for a research project and we had to do it all manually because the few errors (I would say it did pretty well given our recording quality) were often dropping words like "not", which changed the whole meaning of a sentence! It was a useful assistance during transcription, but I really hope they would verify it was correct before arresting anyone based on it.


Microsoft announced their voice transcription technology a couple years ago and were also touting ~97-98% accuracy which was actually better than human transcription error rates. The errors are usually in part people garbling their own speech, or they move their head while talking and the microphone misses a syllable. Anything in that error bar would probably fall under "reasonable doubt"


If its anything like Microsoft teams transcription I doubt the 97%+ accuracy.


I've worked with similar technology in the law enforcement space and the software is never used to make decisions. You can make out critical timestamps in conversations and a law enforcement officer will always manually confirm the software's assessments.


Given that law enforcement has made similar claims about technology use in the past that turned out to be false, I have no faith in this claim.


In all honesty, this is the correct mindset to have. I have limited expertise in this topic, and you should be aware that other law enforcement agencies probably do not handle this the same way.


I imagine a certain percentage of a given population is on a voice call at any one time.

1. Set up a computer with voice recognition software that flags certain patterns.

2. Connect computer to voice call communication network.

3. Configure computer to switch between calls every x number of seconds.

Think of it like a system to generate leads for law enforcement that can be integrated with other systems to produce the best quality leads.


This is called "a fishing expedition" and is wildly unconstitutional in the US.

>The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized.


Are you sure about that? [0]

Besides I wasn't talking about the USA when I said this. I was remembering a conversation I once had with a person who worked as a technician in a telephone exchange.

[0] - https://en.wikipedia.org/wiki/Jewel_v._NSA


Yes, it is wildly unconstitutional, but in practice don't the courts endorse the asinine "it's not a search unless we find something" argument from the NSA?

Power always just finds a way to rationalize what it wants to do.


see: Operation PRISM


Not really. Imagine that they do simple keyword matching on the text. Anything that's missed (part of the 97%) the criminals get away with. Anything that matches in the 3% is then checked by a human (by listening to the audio at that time stamp). So you only need to manually check the 3%, and even then only if something you're interested in is found.


One would think that the few crucial bits of information gleaned are listened to manually, and the machine translation is not the only thing the judge or a jury sees.


You have absolutely ruined someone's day way before they're sitting in front of a jury.


Stuff like that is a very good tell that someone has zero experience with law enforcement.


I've not found that to be the case.

For technical content, I use Rev.com and provide a glossary and real humans do the transcript. Other AI transcription services get lots wrong because the context often matters. Words like "TCP/IP" or "FAT disk format" or "Big Endian" I've never found AI so far to handle well.

I'm interested to test out whisper on this one.

https://corecursive.com/063-apple-2001/


There's already software that can imitate a person's voice, so we have all the pieces already to do speech-to-text, clean up with GPT-3, and back to text-to-speech in the original person's voice. Maybe with a style transfer to keep the person's inflections etc the same?


I think something similar already exists. See this, for example: https://koe.ai/recast/

Although I don't know if they're using anything similar to what you suggest. Very cool idea, anyway!


Since you work on podcasts, do any open source transcription tools currently identity the speaker in the output? This would be particularly helpful for interviews.


Not sure about open source, but in general, automated transcription systems need a separate track for each different speaker. So for example, for a phone call with one person on each end, you need two separate channels (recording systems usually split them left/right on one stereo file).


I'm not sure if you've tried Descript, but their ML-based "Studio Sound" filter makes bad audio sound like it was recorded and edited nicely.


Any recommendations for particular services?


I use a service called sonix.ai. It's paid but I think they have a free tier or trial period, and it's not very expensive. I'm excited about this new OpenAI thing because I'd rather do it on my own hardware than send it to the cloud, but this company has earned its commercial success.


That is an exciting possibility. Being able to fix bad setups and missed takes automagically. It’s always been possible, just expensive and time consuming for moderate improvements.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: