This is incredible. I took a talk recording I made for NeurIPS and passed it through this tool. The improvement in audio quality was night and day. It went from clearly being recorded in a bedroom to a studio-like experience.
The result is nowhere near studio quality. It's highly compressed and full of artifacts. I found the original to be more pleasant to listen to. It at least sounds natural.
I was very excited about this, and then I tried it. I gave a talk at a local meetup 8 years ago [0] that wasn't mic'd properly, but I uploaded it anyways. It was never the most pleasant thing to listen to, but the audio was understandable enough that for years it was a top search result for "javascript promises" on YouTube and accumulated 39K views.
However, when I ripped that video's audio and put it through the linked AI filter (Adobe Podcast's Enhance Speech), it went from being unpleasant-but-understandable to perfectly clear gibberish [1]. For example, about 15 seconds in I said:
> [...] at Second Street, the company whose office you are currently sitting in. Uh, you can find me on the web on GitHub, on Twitter, or my own site at Kerrick Long Dot Com.
In the "enhanced" version, it sounds like a person who isn't me said the following in soundproofed studio (with a head cold):
> [...] at Second Street, the cubbany od office you are currently city-aly. You gidhi me on the web on GitDub, on Twitter, or my own side at Kerri Flow Dot Cowf.
This is a 20 yr old vid shot at Pearl Harbor on the deck of USS Missouri with a low-end consumer Hi-8 camera using built-in mics. Notice flags and clothes rippling in the wind. This filter works pretty well in this case. Maybe it was trained on a New York accent? Mixing in the enhanced track as needed works well. Using only the enhanced track may sound artificial at times.
Here's another vid shot at a brewery where this filter helped clarify the brewmaster's voice over a noisy restaurant and outdoor machinery (again, built-in mic):
> Mixing in the enhanced track as needed works well. Using only the enhanced track may sound artificial at times.
Funnily, in order to use this sort of model for tasks involving speech recognition it's often recommended in the literature to mix back in some of the original noisy audio. This reduces the impact of artifacts introduced by the enhancement which would otherwise reduce ASR quality due to domain shift in the data.
Guess humans and computers have similar needs in this case. :)
These are impressive results, the audio mostly sounds like you gave the guy a lapel mic. :P
Having unlisted version on Youtube “before” the audio was filtered would make it easier to hear the true difference, but agree the quality is impressive as is.
Thanks! Honestly I liked the post filtering audio with no “natural audio” mixed back in. Get the reasoning, but still. Did you do testing with anyone to gauge viewer preference? If so, how?
This filter is not perfect so you'll have a few audio artifacts that you won't want in your final mix. Mixing with the original audio hides these artifacts somewhat. Using this filter also increases your audio mixing effort (you're mixing in a second vocal track) so it's easiest to mix the 2 audio tracks at constant levels then adjust when you hit an artifact. This is my personal preference on this style of audio mixing, and others who watched (and gave feedback) couldn't tell it was enhanced and had no difficulty understanding the speaker.
To push it to the limit I recorded exactly the same recording with my phone microphone and my AT875R XLR shotgun. I did this because my phone microphone is poor and picks up a lot of echo. Results are as follows:
- If the microphone quality itself is bad, the enhanced audio is still pretty horrible.
- It does clean up echo but with there's some pretty aggressive EQ that doesn't sound nice, and the noise gate is pretty severe
- Compared to my XLR shotgun, the quality of the phone was pretty horrible
What we can conclude is that if you already have a good recording but with some problems, you might be able to use this to remove those problems. However, don't expect a crappy microphone to turn into a good microphone, or a crappy recording to turn into "studio quality".
The bottom line is that there's no substitute for a decent microphone in a decent space. (At the very minimum, small room without echo.)
Look at all the money that's been going into making phones compelling substitutes for professional cameras. And camera tech is still evolving. Consumer audio devices will get the same attention and investment.
We'll eventually have models for audio signals in all sorts of distorted and noisy environments. I'd bet that in ten years a cellphone microphone can duplicate a professional audio setup in 90% of circumstances.
OP should clean both microphones on their phone, usually a sewing needle or thin toothpick can do the trick, but 99% Isopropyl on a toothbrush might be needed afterwards if the grill inside that protects the microphone is also clogged up.
Both microphones need to be cleaned of any blockages so the hardware echo and noise cancellation on a given phone works well. Otherwise you've got distorted audio getting processed as if it's not distorted...
Important question because a microphone array (which exists on some phones, and things like home voice assistants) can be steered into the equivalent of a shotgun mic's pattern, or even more focused than that. It's just that an algorithm aims it toward the strongest signal, instead of the user aiming a hypercardioid mic manually. Either way, this is what reduces the ratio of reverberant room sound ("echo").
>In theory, would a gigantic room that was miles to the nearest wall be even better?
Yes, as effectively it's open space. In practice though, to record in high quality you would rather build an anechoic environment, as small as possible (preferably a booth).
Small spaces bring hard surfaces closer to the mic, making flutter echoes and room mode resonances louder, create murky-sounding bass pooling, and can be overly-sensitive to mic position within the space; you can sound oddly different without warning.
You need to absorb the sound of your voice so there is less echo by baffling material on the back of the mic, and absorb room tone and echoes in the area the directional microphone is pointed, generally behind your head. Small spaces have only disadvantages as studios.
To take advantage of a reach-in closet full of clothes, put some pillows on the shelf over the clothes, take the closet doors off, and back into the closet as much as you can. In this way the microphone is primarily listening to the baffled sound inside the closet, and you can avoid bass pooling by speaking into the room—ideally with baffling material (e.g. see http://PillowFortStudios.com/ ) ON the back of a LDC microphone.
Sound takes lesser time to travel in smaller rooms, hence the difference in time is not very large between the original sound wave and the reflected one which makes it harder to distinguish to human ears.
I was suspicious of your claim, but this seems very good. Your first example is way more extreme than Adobe's (a restaurant with background noise as loud as your voice??), but the output is ridiculously clear.
I would definitely say this is "similar quality" to adobe's. Nice work.
This isn’t meant as a criticism of your work, just your pricing model in that it reflects the industry wide creep of subscription based services into more and more narrow niches.
I’ve checked several similar services and it really frustrates me that all of them price by the hour the bill is a monthly plan, but in actual fact the real cost is $/hour of processed data. The these plans inevitably end up as a series of x hours per month + some features, and all features of the previous plan teir… with many companies using larger hour requirements as a way to force you to pay more per month.
Also why the hell is it so hard to find the equivalent of this technology but for realtime as in streaming my microphone audio live … i would happily pay $100 -> $250 for a AudioUnit plugin (or other equivalent audio pipeline plugin formats) for this kind of real-time voice cleanup… but it doesn’t exist. So can you help explain why? Since you’ve built something like this I’m hoping you have more insight into why it’s harder for real-time processing…
I tested a few recordings of an Indian Swami giving speeches in English back in the 70's. Recording had a lot of background noise, not great. You have to listen very carefully to hear what is being said. I was hoping for good results, but...
Results:
- background noise was reduced
- some previously clear words are turned into garbled non-words
- some parts are replaced by a different Indian voice, I assume AI, so it sounds like multiple people talking
All in all, the results are not anywhere near what the sample shows.
Honestly, it sounds like you're judging on a pretty big outlier example. The sample seems to more be aimed at background noise and even that sample is extremely easy to understand without the enhancement. There are a bunch of tools out there that are probably better aimed at your goals.
In time, these tools will gain control knobs and eventually start to focus on longer tail audio recovery tasks. I have hope for our old audio. Where there's signal, there's a way.
One interesting thing I realized after experimenting with this today is that if you upload mp3 (as opposed to wav) your audio won't really sync correctly afterward. For example, I found the effect to work too well at certain points, and so I tried to blend it with the original audio, but then it sounded as if the phase was off. Uploading a wav resolved this issue for me.
I tried this on a couple of speeches that were recorded in front of an artificial waterfall, and the output is not just bad - it’s not the English language. Nor any language on earth that I’m aware of. Haha
The tool that now comes with the latest updates to FCPX handled them without a problem. (Still some background noise, but you can clearly hear every word.) I think Adobe has a long way to go on this.
After seeing this this AM, I just used this in a proper professional mix session for an ad we were finishing today. Sit-down interview shot with an iPhone (don't ask). I output stems of each subject's dialog, uploaded separately, laid back in and mixed little bit of roomtone back into it.
Gobsmack all around. Mine, that it did better in 10 seconds than I think nearly anyone could have done in 10 hours. Theirs, that they thought I performed a literal miracle.
Can it remove the professor's coughing? I have recordings of some interesting lectures recorded by a professor who had covid or something, coughing after every some words.
Many other professors (as well as me, occasionally) also use to involuntarely (and often unaware of that) say something like eeeeeh when they strugle to recall the right word. Would be great if this could be removed as well.
I don’t know their roadmap, but if anything I’d point you to a tool called Descript which is great at this. You can get transcriptions, and make edits based on the text (e.g. cough, um s, etc.) Descript.app (I am not affiliated with them, just a fan)
Interesting "automagic" tool for audio post targeted towards hobbyists and creators. Anyone able to compare this with the entry level[1] version of the professional standard dialog cleanup software, RX?
The demo is pretty disappointing in that it hitches whenever you flip the switch. They should have invested a little more in making it seamlessly switch.
Tried to upload my favorite recorded audio file (MP3) from Japanese program given in the middle of 1980s about a ancient mound (grave) in Japan. A Japanese famous archaeologist, late Koichi Mori (a professor of Doshisya University at that time) talks about Hashihaka mound, but he talks in ... in Spanish?
> Tried to upload my favorite recorded audio file (MP3) from Japanese program given in the middle of 1980s about a ancient mound (grave) in Japan. A Japanese famous archaeologist, late Koichi Mori (a professor of Doshisya University at that time) talks about Hashihaka mound, but he talks in ... in Spanish?
I'm not sure I understand what you're saying. Do you mean that the talk was in Japanese but the output of this service somehow screwed it up in a way that it sounds like spanish?
Or are you just mentioning that the thing you uploaded was spanish and not describing the quality of the output?
I wish youtube could have such an option, along with sound normalization and perhaps compression. So many videos, lectures and chats have bad sound and it really becomes an issue for people who dont have the best hearing.
Also would be great if used in Zoom recordings of podcasts
I experienced something absolutely bizarre with this, making me want to try and reupload to see if the same thing occurs. I had some footage laying around that was taken on a windy day, with buses and wind and kids screaming, and for the most part it was greatly improved.
However, a few seconds into my recording there is a part where there is someone else's dialogue for a few seconds. I can't make out what they're saying but it definitely sounds like a man, with a Latin-American accent, speaking English for a second.
Could that be a hallucination or somehow they mixed audio from another recording? It only last for about a second, but it's so strange.
My guess would be it fixates on the most dominant source available and mutes the other factors. It probably favors human voices over other ambient noise, therefore singeing the man out.
It will really get freaky when there an ambient noise resembling a human voice. I'm thinking the Bear scene from the movie Annihilation.
I tried reuploading and again the exact same thing happened, which is interesting because it seems that it's producing audio fairly deterministically, which is not how I think of most AI produced results are but I'm not an expert.
a lot of AI produced results can be deterministic if you want them to be. For stable diffusion, just set the seed of the initial noise, and it's deterministic. With GPT-3, set the temperature to 0 (always choose the highest probability word).
It can be a useful technique for learning how slightly different prompts affect things
Whatever happened to Voco, their 'photoshop for voice'?
I'm not a frequent HN poster, so I don't know how to 'properly' cite the presentation, but I remember them showing it off years ago and... nothing came of it.
To find old things like that, use the search box at the bottom of the page: https://hn.algolia.com/?q=adobe+voco. Big splash 6 years ago, then nothing.
It seems Adobe never advanced Voco past the research project stage ([1] via [2]). I'm guessing they had trouble getting it to work reliably on a wide-enough range of real-world audio.
I imagine it got shut down, either by their own executives/ethics department, or by outside pressure. It was announced in the same year that the word "fake news" took off. Not the best time to get the world excited for a "photoshop for voice"
I've been using this as one of several tools (noise gate in Audacity, The Levelator) to increase the audio quality in my podcast. Subjectively, I think it's been working, and I love the simplicity of the interface. No tuning, just trust the AI. It works well for standard spoken English, but will do some horrible things to music (it won't detect music and be like "hey don't do anything to this"). So you shouldn't run it on a file that includes both music and voice.
This is really cool, I wish this was shipped as an AU plugin I could use with Audio Hijack and Loopback to process audio for video conferences. I already use a pretty detailed filter-chain that greatly improves audio quality, but this would give more of a "radio voice" quality compared to my current filter chain. I've found that improving audio quality has a marked difference in how people respond to my proposals when I present them, working remotely.
Does anybody know how this works? As somebody with a serious interest in signal processing, machine learning, and audio, I'm genuinely interested. Have Adobe published anything about the technology behind this?
I seem to recall a demo where a trumpet sound was plugged into an aggressive high quality speech denoiser, making it sound like a speaker. Does anyone remember this demo, or have similar links on creative use of this tech?
It seems quite different actually. While this filter tries to completely change the sound of the whole recording, RTX voice isn't really improving the audio quality, but is rather focusing only on removing noise. I think it's a much more sound approach and doesn't really end up in actually changing language like in some of the examples below.
It's not usable for everyone though, I got a friend with a high pitched voice and RTX audio totally butchers it if she's not careful with how she speaks. And a giggle turns into a garbled mess and so on.
How long until there's a way to process more than one hour at a time? After trying a sample, I cannot wait to run some old audiobooks through this. The results sound like it was made twenty years newer.
AMD and NVIDIA have already done that. There's also Krisp which you can pay to use anywhere or just use the free Discord integration. I'm sure there are others too.
I use Adobe Audition for podcast editing and they have a pretty nice feature built in for that already - not sure how this differs. I am also not a big fan of edit-by-transcription, since I do like to remove occasional long silences, weird sounds, etc. That being said the mic check looks pretty useful.
Waves Clarity Vx is comparably good and costs 29.95 USD at one of the frequent sales (such as now). All you need is a VST host, which you're probably using already anyway.
[0, original] https://youtu.be/gwkCIdwHRhc
[1, enhanced] https://youtu.be/RPnUqmSyZ6Q