Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Enhance Speech from Adobe – Free AI filter for cleaning up spoken audio (adobe.com)
340 points by modinfo on Dec 19, 2022 | hide | past | favorite | 123 comments


This is incredible. I took a talk recording I made for NeurIPS and passed it through this tool. The improvement in audio quality was night and day. It went from clearly being recorded in a bedroom to a studio-like experience.

[0, original] https://youtu.be/gwkCIdwHRhc

[1, enhanced] https://youtu.be/RPnUqmSyZ6Q


The result is nowhere near studio quality. It's highly compressed and full of artifacts. I found the original to be more pleasant to listen to. It at least sounds natural.


It's so good it looks weird. It removed all the room acoustics, such that you would expect to see a mic very close to your face.


I feel the exact same way. It sounds like the mic is close to his mouth, but when watching the video it's clearly not there. It's super trippy.


I agree. I was also taken aback for the first few seconds. After those first few seconds, I think I was able to adjust to the sound.

This would be wilder if I were in an unconstrained environment with similar enhanced audio quality. It would seem like a voice-over.


Wow you’re right, this is a great improvement in quality.

Also, very enjoyable & clear presentation.


Thank you very much! :)


This difference actually has an audio “uncanny valley” effect because the voice and the video setting are so different now.


Use AI to separate the speech and remix the room noise back into the recording.


Or just remix the background to match the expected environment for the audio!


This seems to work well for certain voices and recordings and quite badly for others.

Is there any way one could easily train it on my own voice to make sure it isolates my or any other trained voice from noisy environments?


I was shocked at how good it worked in this case. Thanks for the examples!


I was shocked how (on my notebook) it sounds like there is something with the speakers.


I tried listening to it with my AirPods on. The enhanced audio sounds much more natural this way, as if I’m in the room with you.


I was very excited about this, and then I tried it. I gave a talk at a local meetup 8 years ago [0] that wasn't mic'd properly, but I uploaded it anyways. It was never the most pleasant thing to listen to, but the audio was understandable enough that for years it was a top search result for "javascript promises" on YouTube and accumulated 39K views.

However, when I ripped that video's audio and put it through the linked AI filter (Adobe Podcast's Enhance Speech), it went from being unpleasant-but-understandable to perfectly clear gibberish [1]. For example, about 15 seconds in I said:

> [...] at Second Street, the company whose office you are currently sitting in. Uh, you can find me on the web on GitHub, on Twitter, or my own site at Kerrick Long Dot Com.

In the "enhanced" version, it sounds like a person who isn't me said the following in soundproofed studio (with a head cold):

> [...] at Second Street, the cubbany od office you are currently city-aly. You gidhi me on the web on GitDub, on Twitter, or my own side at Kerri Flow Dot Cowf.

[0]: https://youtube.com/watch?v=wc72cyYt8-c

[1]: https://soundcloud.com/kerricklong/javascript-promises-think...


I used this filter to reduce wind noise during a tour guide of BB63:

https://youtu.be/1LDlOmKtfeQ?t=60

This is a 20 yr old vid shot at Pearl Harbor on the deck of USS Missouri with a low-end consumer Hi-8 camera using built-in mics. Notice flags and clothes rippling in the wind. This filter works pretty well in this case. Maybe it was trained on a New York accent? Mixing in the enhanced track as needed works well. Using only the enhanced track may sound artificial at times.

Here's another vid shot at a brewery where this filter helped clarify the brewmaster's voice over a noisy restaurant and outdoor machinery (again, built-in mic):

https://youtu.be/nANSdnYj-R0

I found this filter useful.


> Mixing in the enhanced track as needed works well. Using only the enhanced track may sound artificial at times.

Funnily, in order to use this sort of model for tasks involving speech recognition it's often recommended in the literature to mix back in some of the original noisy audio. This reduces the impact of artifacts introduced by the enhancement which would otherwise reduce ASR quality due to domain shift in the data.

Guess humans and computers have similar needs in this case. :)

These are impressive results, the audio mostly sounds like you gave the guy a lapel mic. :P


As a human I had a hard time understanding that audio


Honestly you might have had a hard time understanding that speech in person


Having unlisted version on Youtube “before” the audio was filtered would make it easier to hear the true difference, but agree the quality is impressive as is.


I did so on the brewery tour:

https://youtu.be/M5pdHVoXQHE


Thanks! Honestly I liked the post filtering audio with no “natural audio” mixed back in. Get the reasoning, but still. Did you do testing with anyone to gauge viewer preference? If so, how?


This filter is not perfect so you'll have a few audio artifacts that you won't want in your final mix. Mixing with the original audio hides these artifacts somewhat. Using this filter also increases your audio mixing effort (you're mixing in a second vocal track) so it's easiest to mix the 2 audio tracks at constant levels then adjust when you hit an artifact. This is my personal preference on this style of audio mixing, and others who watched (and gave feedback) couldn't tell it was enhanced and had no difficulty understanding the speaker.


To push it to the limit I recorded exactly the same recording with my phone microphone and my AT875R XLR shotgun. I did this because my phone microphone is poor and picks up a lot of echo. Results are as follows:

- If the microphone quality itself is bad, the enhanced audio is still pretty horrible. - It does clean up echo but with there's some pretty aggressive EQ that doesn't sound nice, and the noise gate is pretty severe - Compared to my XLR shotgun, the quality of the phone was pretty horrible

What we can conclude is that if you already have a good recording but with some problems, you might be able to use this to remove those problems. However, don't expect a crappy microphone to turn into a good microphone, or a crappy recording to turn into "studio quality".

The bottom line is that there's no substitute for a decent microphone in a decent space. (At the very minimum, small room without echo.)


This is just beginning.

Look at all the money that's been going into making phones compelling substitutes for professional cameras. And camera tech is still evolving. Consumer audio devices will get the same attention and investment.

We'll eventually have models for audio signals in all sorts of distorted and noisy environments. I'd bet that in ten years a cellphone microphone can duplicate a professional audio setup in 90% of circumstances.


Out of curiosity what phone do you have?


OP should clean both microphones on their phone, usually a sewing needle or thin toothpick can do the trick, but 99% Isopropyl on a toothbrush might be needed afterwards if the grill inside that protects the microphone is also clogged up.

Both microphones need to be cleaned of any blockages so the hardware echo and noise cancellation on a given phone works well. Otherwise you've got distorted audio getting processed as if it's not distorted...


Important question because a microphone array (which exists on some phones, and things like home voice assistants) can be steered into the equivalent of a shotgun mic's pattern, or even more focused than that. It's just that an algorithm aims it toward the strongest signal, instead of the user aiming a hypercardioid mic manually. Either way, this is what reduces the ratio of reverberant room sound ("echo").


> At the very minimum, small room without echo

Why small room? Does that reduce echo?

In theory, would a gigantic room that was miles to the nearest wall be even better?


>In theory, would a gigantic room that was miles to the nearest wall be even better?

Yes, as effectively it's open space. In practice though, to record in high quality you would rather build an anechoic environment, as small as possible (preferably a booth).


Small spaces bring hard surfaces closer to the mic, making flutter echoes and room mode resonances louder, create murky-sounding bass pooling, and can be overly-sensitive to mic position within the space; you can sound oddly different without warning.

You need to absorb the sound of your voice so there is less echo by baffling material on the back of the mic, and absorb room tone and echoes in the area the directional microphone is pointed, generally behind your head. Small spaces have only disadvantages as studios.

To take advantage of a reach-in closet full of clothes, put some pillows on the shelf over the clothes, take the closet doors off, and back into the closet as much as you can. In this way the microphone is primarily listening to the baffled sound inside the closet, and you can avoid bass pooling by speaking into the room—ideally with baffling material (e.g. see http://PillowFortStudios.com/ ) ON the back of a LDC microphone.


"as small as possible (preferably a booth)."

This is where the Startup Garage analogous cliche for musicians comes from: recorded in the closet


Sound takes lesser time to travel in smaller rooms, hence the difference in time is not very large between the original sound wave and the reflected one which makes it harder to distinguish to human ears.


Beware of the wind in such a room

(and of the rain too)


> don't expect a crappy microphone to turn into a good microphone

It's the same as with photos. If your raw material is bad, no tool on earth can make it good.


The tools can make stuff up and produce something, may I say "good", from it, but it won't be the same contents :)


StableDiffusion+img2img may beg to differ :)


How about running it several times, will that make it better or worse?


That could be a fun exercise, I'd imagine you might end up with something like "I Am Sitting In A Room" [0][1]

[0] https://en.wikipedia.org/wiki/I_Am_Sitting_in_a_Room [1] https://www.youtube.com/watch?v=fAxHlLK3Oyk


Shameless plug, but i've built a similar software as an indiehacker. I think it's similar quality to adobe's

https://www.audostudio.com/


I was suspicious of your claim, but this seems very good. Your first example is way more extreme than Adobe's (a restaurant with background noise as loud as your voice??), but the output is ridiculously clear.

I would definitely say this is "similar quality" to adobe's. Nice work.


This isn’t meant as a criticism of your work, just your pricing model in that it reflects the industry wide creep of subscription based services into more and more narrow niches.

I’ve checked several similar services and it really frustrates me that all of them price by the hour the bill is a monthly plan, but in actual fact the real cost is $/hour of processed data. The these plans inevitably end up as a series of x hours per month + some features, and all features of the previous plan teir… with many companies using larger hour requirements as a way to force you to pay more per month.

Also why the hell is it so hard to find the equivalent of this technology but for realtime as in streaming my microphone audio live … i would happily pay $100 -> $250 for a AudioUnit plugin (or other equivalent audio pipeline plugin formats) for this kind of real-time voice cleanup… but it doesn’t exist. So can you help explain why? Since you’ve built something like this I’m hoping you have more insight into why it’s harder for real-time processing…


https://krisp.ai might work for you?


I tested a few recordings of an Indian Swami giving speeches in English back in the 70's. Recording had a lot of background noise, not great. You have to listen very carefully to hear what is being said. I was hoping for good results, but...

Results:

- background noise was reduced

- some previously clear words are turned into garbled non-words

- some parts are replaced by a different Indian voice, I assume AI, so it sounds like multiple people talking

All in all, the results are not anywhere near what the sample shows.


Time is a flat circle. Remember when Xerox copiers would randomly replace digits with different digits?

https://www.theverge.com/2013/8/6/4594482/xerox-copiers-rand...


Good news: This fancy new compression reduces the error rate from 5% to 1% for the same file size

Bad news: While the 5% was a minor inconvenience for customers, the 1% is bad enough to end your company


Well, that’s a nightmare I never knew existed…


Epic talk (in German, David Kriesel, "Traue keinem Scan, den du nicht selbst gefälscht hast"):

https://www.youtube.com/watch?v=7FeqF1-Z1g0


Watched it two times! It is very well made and funny!


They should've added english subtitles by now.



I love it how they say "yeah, character substitution is a known issue". How was that ever fine?


Close the ticket. My performance review scorecard doesn't include "known issues"


Hi there, I have made a free open-source tool that does better. Care you check that out? https://github.com/shahules786/mayavoz


This is very neat! Have you done any profiling on what codecs and sample rates this performs best on?

Just curious how the performance differs between PCMU @ 8khz compared to Opus @ 48k or IMBE and AMBE+2 (Project 25 Public Safety audio codecs) :D

My dream would be doing audio processing in real time to clean up the audio of phone calls


Hi, most models performs best at 16KHz. Current architectures does not support real-time speech enhancement but I plan to add that in future.


Honestly, it sounds like you're judging on a pretty big outlier example. The sample seems to more be aimed at background noise and even that sample is extremely easy to understand without the enhancement. There are a bunch of tools out there that are probably better aimed at your goals.


In time, these tools will gain control knobs and eventually start to focus on longer tail audio recovery tasks. I have hope for our old audio. Where there's signal, there's a way.


One interesting thing I realized after experimenting with this today is that if you upload mp3 (as opposed to wav) your audio won't really sync correctly afterward. For example, I found the effect to work too well at certain points, and so I tried to blend it with the original audio, but then it sounded as if the phase was off. Uploading a wav resolved this issue for me.


I tried this on a couple of speeches that were recorded in front of an artificial waterfall, and the output is not just bad - it’s not the English language. Nor any language on earth that I’m aware of. Haha

The tool that now comes with the latest updates to FCPX handled them without a problem. (Still some background noise, but you can clearly hear every word.) I think Adobe has a long way to go on this.


After seeing this this AM, I just used this in a proper professional mix session for an ad we were finishing today. Sit-down interview shot with an iPhone (don't ask). I output stems of each subject's dialog, uploaded separately, laid back in and mixed little bit of roomtone back into it.

Gobsmack all around. Mine, that it did better in 10 seconds than I think nearly anyone could have done in 10 hours. Theirs, that they thought I performed a literal miracle.

Life comes at you pretty fast sometimes.


Can it remove the professor's coughing? I have recordings of some interesting lectures recorded by a professor who had covid or something, coughing after every some words.

Many other professors (as well as me, occasionally) also use to involuntarely (and often unaware of that) say something like eeeeeh when they strugle to recall the right word. Would be great if this could be removed as well.


I don’t know their roadmap, but if anything I’d point you to a tool called Descript which is great at this. You can get transcriptions, and make edits based on the text (e.g. cough, um s, etc.) Descript.app (I am not affiliated with them, just a fan)


Thank you very much. I never knew anything like this exists already, only wondered why is there not given all the advances in the AI.


Interesting "automagic" tool for audio post targeted towards hobbyists and creators. Anyone able to compare this with the entry level[1] version of the professional standard dialog cleanup software, RX?

[1] https://www.izotope.com/en/shop/rx-10-elements.html


Programmed to hallucinate from every sound. Goes "du du duuuu du du" with background music.


The demo is pretty disappointing in that it hitches whenever you flip the switch. They should have invested a little more in making it seamlessly switch.


One of the good things about OpenAI is that their demos are great on a mobile phone. This one is totally broken on iOS.


Audio demo works on iOS for me (safari and chrome). I think the visual is cut off tho


and on Android


Tried to upload my favorite recorded audio file (MP3) from Japanese program given in the middle of 1980s about a ancient mound (grave) in Japan. A Japanese famous archaeologist, late Koichi Mori (a professor of Doshisya University at that time) talks about Hashihaka mound, but he talks in ... in Spanish?


> Tried to upload my favorite recorded audio file (MP3) from Japanese program given in the middle of 1980s about a ancient mound (grave) in Japan. A Japanese famous archaeologist, late Koichi Mori (a professor of Doshisya University at that time) talks about Hashihaka mound, but he talks in ... in Spanish?

I'm not sure I understand what you're saying. Do you mean that the talk was in Japanese but the output of this service somehow screwed it up in a way that it sounds like spanish?

Or are you just mentioning that the thing you uploaded was spanish and not describing the quality of the output?


I uploaded a sound file the professor talks in Japanese, and the enhanced file can be listen like a speech in Spanish.


I wish youtube could have such an option, along with sound normalization and perhaps compression. So many videos, lectures and chats have bad sound and it really becomes an issue for people who dont have the best hearing.

Also would be great if used in Zoom recordings of podcasts


I experienced something absolutely bizarre with this, making me want to try and reupload to see if the same thing occurs. I had some footage laying around that was taken on a windy day, with buses and wind and kids screaming, and for the most part it was greatly improved.

However, a few seconds into my recording there is a part where there is someone else's dialogue for a few seconds. I can't make out what they're saying but it definitely sounds like a man, with a Latin-American accent, speaking English for a second.

Could that be a hallucination or somehow they mixed audio from another recording? It only last for about a second, but it's so strange.


My guess would be it fixates on the most dominant source available and mutes the other factors. It probably favors human voices over other ambient noise, therefore singeing the man out.

It will really get freaky when there an ambient noise resembling a human voice. I'm thinking the Bear scene from the movie Annihilation.


One should take a STT transcription on the raw and modified media streams and do a diff to find unintended modifications.


I tried reuploading and again the exact same thing happened, which is interesting because it seems that it's producing audio fairly deterministically, which is not how I think of most AI produced results are but I'm not an expert.


a lot of AI produced results can be deterministic if you want them to be. For stable diffusion, just set the seed of the initial noise, and it's deterministic. With GPT-3, set the temperature to 0 (always choose the highest probability word).

It can be a useful technique for learning how slightly different prompts affect things


Whatever happened to Voco, their 'photoshop for voice'? I'm not a frequent HN poster, so I don't know how to 'properly' cite the presentation, but I remember them showing it off years ago and... nothing came of it.


To find old things like that, use the search box at the bottom of the page: https://hn.algolia.com/?q=adobe+voco. Big splash 6 years ago, then nothing.

It seems Adobe never advanced Voco past the research project stage ([1] via [2]). I'm guessing they had trouble getting it to work reliably on a wide-enough range of real-world audio.

[1] https://community.adobe.com/t5/audition-discussions/beta-tes...

[2] https://en.wikipedia.org/wiki/Adobe_Voco


I imagine it got shut down, either by their own executives/ethics department, or by outside pressure. It was announced in the same year that the word "fake news" took off. Not the best time to get the world excited for a "photoshop for voice"


That would be the first time for a big company to make a morally sound decision, though.


I've been using this as one of several tools (noise gate in Audacity, The Levelator) to increase the audio quality in my podcast. Subjectively, I think it's been working, and I love the simplicity of the interface. No tuning, just trust the AI. It works well for standard spoken English, but will do some horrible things to music (it won't detect music and be like "hey don't do anything to this"). So you shouldn't run it on a file that includes both music and voice.


This is really cool, I wish this was shipped as an AU plugin I could use with Audio Hijack and Loopback to process audio for video conferences. I already use a pretty detailed filter-chain that greatly improves audio quality, but this would give more of a "radio voice" quality compared to my current filter chain. I've found that improving audio quality has a marked difference in how people respond to my proposals when I present them, working remotely.


There is an open-source tool that does the same https://github.com/shahules786/mayavoz


Plugging is okay, but you should always disclose if you are involved (which appears to be the case here based on your username).


Their username is in the repository name, that seems pretty clear to me...


Thanks for pointing out, I'm not a regular user of HN. I myself am the creator of Mayavoz.


Is there a way to train it on your own voice to isolate your own voice but not distort it too much?


I've just tried it but it seems the Adobe one works better and outputs a clearer voice.


How does this stack up against something like Auphonic? https://auphonic.com/


Does anyone have recommendation how to 'upsample' recordings like these? [0] It's ripped from old tapes, with heavy NR, and compressed.

[0] https://fsi-languages.yojik.eu/languages/FSI/fsi-french-basi...


try descript.com - their tool called „studio sound“ is quite incredible


How close are we to AI understanding your voice, then simply re-creating it with perfect clarity?

I feel like we’re real close


> How close are we to AI understanding your voice, then simply re-creating it with perfect clarity?

People do this regularly today using Descript. https://www.descript.com/overdub

Through answering this I also found Respeecher, which seems interesting too. https://www.respeecher.com/


Oh! Not only are we past close, but it’s a polished commercial product. AI is stunning. Thank you for the links.


Adobe had this sometime ago https://www.youtube.com/watch?v=I3l4XLZ59iw It was discontinued.


For folks that would like something like this live during video conferencing, check out krisp.ai.


I was expecting someone to have mentioned Descript's Sound Studio audio enhancement filter.


Does anybody know how this works? As somebody with a serious interest in signal processing, machine learning, and audio, I'm genuinely interested. Have Adobe published anything about the technology behind this?


I seem to recall a demo where a trumpet sound was plugged into an aggressive high quality speech denoiser, making it sound like a speaker. Does anyone remember this demo, or have similar links on creative use of this tech?


Nvidia has a similar thing called RTX audio, but its even more impressive and runs live on recording

https://www.youtube.com/watch?v=uWUHkCgslNE


It seems quite different actually. While this filter tries to completely change the sound of the whole recording, RTX voice isn't really improving the audio quality, but is rather focusing only on removing noise. I think it's a much more sound approach and doesn't really end up in actually changing language like in some of the examples below.


It's not usable for everyone though, I got a friend with a high pitched voice and RTX audio totally butchers it if she's not careful with how she speaks. And a giggle turns into a garbled mess and so on.


How long until there's a way to process more than one hour at a time? After trying a sample, I cannot wait to run some old audiobooks through this. The results sound like it was made twenty years newer.


I have a lot of trouble understanding people on group calls. I wonder if something like this software could help me?


Free as in "free beer", I assume.


Most likely they'll use your recordings to enhance their own audio models.


It significantly improved the quality for me but also made many (english) words very hard to understand.


Not sure when we get this tech real-time and part of our video conferencing tools but … can't wait.


No need to wait. It's already built into all major conferencing platforms today (Teams, Webex, Meet etc).

If you'd like to run something locally, there's also https://www.nvidia.com/en-us/geforce/guides/nvidia-rtx-voice....


AMD and NVIDIA have already done that. There's also Krisp which you can pay to use anywhere or just use the free Discord integration. I'm sure there are others too.


I hope they add this to Adobe Premiere, it would be useful for cleaning up audio from talk recordings.


That would be convenient, but for now you could bounce the audio out, run it through the online tool and then bring it back in to Premiere.


I want to add this to zoom so I can hear my coworkers better.


You can use Nvidia Broadcast with a noise filtered Speaker. I use this for zoom meetings and it works wonders


I use an m1 and iphone for my zoom calls :-/. AFAIK, that requires an nvidia chip.


I use Adobe Audition for podcast editing and they have a pretty nice feature built in for that already - not sure how this differs. I am also not a big fan of edit-by-transcription, since I do like to remove occasional long silences, weird sounds, etc. That being said the mic check looks pretty useful.


i uploaded the raw audio of myself and in the beginning I'm talking to myself while off mic. Adobe turned it into some weird non-english thing.


Anyone know if this is integrated into Premiere yet?


Waves Clarity Vx is comparably good and costs 29.95 USD at one of the frequent sales (such as now). All you need is a VST host, which you're probably using already anyway.


I need this on my guild discord server




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: