OpenAI says it can clone a voice from just 15 seconds of audio

bruce343434 · on March 30, 2024

I think it's plausible that it can roughly clone the timbre that my exact vocal and sinus geometry produces, then apply that to its internal representation of a "baseline human voice". However, I don't believe that it can capture exactly every nuance in articulation, emotional expression, intonation. I simply can't imagine even a voice actor iterating through "the full range" in merely 15 seconds. And what about mannerisms, like pronouncing "you all" as "y'all"? Ending every sentence by affixing the "uh" sound to the last word-uh? Or how they sound a sigh or laugh?

In Dutch for instance, there's 3 ways to pronounce the R, and everyone has a certain combination of when they use which kind.

xbmcuser · on March 30, 2024

A lot of people miss the forest for the trees. They can clone a voice using 15 seconds of audio. So how good the clone will be when they have 10 min, 30 min or larger sample size. And it will probably keep getting better

marcusverus · on March 30, 2024

> And what about mannerisms, like pronouncing "you all" as "y'all"?

In the US this is highly determined by location, to the point where the New York Times even built a cool quiz [0] that can guess where you're from pretty accurately based on a few dozen vocab questions. So there's a dataset out there that would allow you to handle most of this variance by simply plugging in a ZIP code for the person you're spoofing!

[0]https://archive.is/yodpA

numpad0 · on March 30, 2024

It's often not necessary to sample all keys on a piano to make a passable piano sound generator, though that will have its own limitations.

A lot of these generators rely heavily on the base model and are worse at generating e.g. Dutch accented English and/or Dutch accented Dutch. But it works great for e.g. Californian accented English with unique voice.

verticalscaler · on March 30, 2024

I'll speculate "what is really happening" - Think of a voice like a musical instrument.

This learned to be a Shazam for human voices instead of songs. And just like you can figure out by ear a snippet of music was made on a Yamaha DX7, seems 15 seconds is enough to narrow it down to a reasonably small set of vocoders that can recreate the sample given.

criddell · on March 30, 2024

Does it really matter if they have not produced a perfect clone but merely a convincing clone?

jstummbillig · on March 30, 2024

Sure. Edge cases will be infinite.

lr4444lr · on March 30, 2024

First text, then images, soon video and speech.

It feels like OpenAI is mostly concerned with developing proofs of the untrustability of every digital medium, creating a convincing case either for doing everything in person again, or necessitating cryptographic signatures on absolutely everything.

anon373839 · on March 30, 2024

In some ways, this just feels like a reversion to the mean for humanity. “Don’t believe everything you read” is a maxim as old as print, and it’s good advice. If people ultimately learn that videos and recordings can’t be trusted, is that a huge loss? I’m not sure.

Of course, the road from here to there may be bumpy...

testacpwoek · on March 30, 2024

> In some ways, this just feels like a reversion to the mean for humanity.

No it doesn't, this is ridiculous. All of you guys with your "this is just like the printing press" arguments are so intellectually lazy, think a little more about the problems of scale and speed and the way information spreads in the modern world.

anon373839 · on March 30, 2024

Your tone is unwarranted. Also, an empty appeal to modernity, lacking any specifics whatsoever, is about as lazy as arguments get.

mewpmewp2 · on March 30, 2024

In which cases anyhow would they have been a trustworthy content. You could easily take real videos out of context which people were already doing to mislead.

Ultimately you should reason and investigate deeper to hold somewhat accurate opinion on anything.

Most media and news out there already is where the agenda comes first and then it is about compiling cherry picked content from wide variety of content as evidence for that agenda. And you can find content to support any agenda already.

It is a simple algorithm of:

1. We want to prove X.

2. From millions of datapoints we pick the 100 that support X the most.

3. We write an article on it.

4. Most of our readers already agree with X so they will be happy about it. No need to go deeper.

5. Anyone who doesn't agree would probably not read our platform anyway.

scarface_74 · on March 30, 2024

https://en.m.wikipedia.org/wiki/Firing_of_Shirley_Sherrod

Shawnj2 · on March 30, 2024

The answer we see is that if you put AI generated stuff in places people don’t look at too carefully they won’t care because they won’t notice. I’m not sure that’s great precedent

fragmede · on March 30, 2024

scammers have already used this technology to convince people that someone they know needs money for some reason

anon373839 · on March 30, 2024

Right, we're in a somewhat dangerous period where the general public doesn't understand the capabilities that exist and how they can be misused. But if you mentally fast-forward to a time when people do understand this and have even become blasé about it, what does that time look like? (Presumably, you don't wire money based on a phone call without kicking the tires a bit...)

taskforcegemini · on March 31, 2024

it also means that audio and video proof that someone did you dirty can not be used anymore, because they could be fake

leovailati · on March 30, 2024

> It feels like OpenAI is mostly concerned with developing proofs of the untrustability of every digital medium

Which, to me, makes sense. Once the underlying technology exists, a malicious actor would not think twice before developing tools of deceptionlike those. It makes sense that OpenAI would work on that "in the open" to demonstrate that we now need to be skeptical of audios.

reaperducer · on March 30, 2024

It feels like OpenAI is mostly concerned with developing proofs of the untrustability of every digital medium

Agreed. I'm still waiting for the part of AI where it's supposed to benefit mankind.

So far, it's 10% entrainment, 20% employment elimination, and 70% crypto-grade hype.

The next AI winter can't come soon enough so the tech industry can get back to doing useful things.

Asraelite · on March 30, 2024

> 20% employment elimination

How is this not benefiting mankind?

If you're concerned with unequal distribution of the generated wealth then that's a political problem, nothing to do with AI.

anileated · on March 30, 2024

When I hear “ML-based generative tools pushing people out of their jobs is just a political problem”, that I hear is “ClosedAI and Microsoft definitely had nothing to do with choosing to not license and compensate original creators whose works they have acquired in order to build their commercial products, it’s all the government’s fault”.

Often the same people will cry over-regulation about ML tech rollout potentially being suppressed and controlled via policy, even though that is literally the implication of their position.

Turing_Machine · on March 30, 2024

> OpenAI is mostly concerned with developing proofs of the untrustability of every digital medium, creating a convincing case either for doing everything in person again, or necessitating cryptographic signatures on absolutely everything.

Am I alone in thinking that this would not be a bad thing?

tomjen3 · on March 30, 2024

Honestly cryptographic signatures being more widespread does not seem a bad thing to me.

Obviously not everywhere: I don't need a reddit worldnews comment or a recipe for blueberry muffins signed.

nonoesp · on March 30, 2024

> When the product officially rolls out there will be a “no-go voice list” that detects and prevents AI-generated speakers that are too similar to prominent figures.

Should they then let anyone register their "voice id" to prevent others frok generating similar voices?

What if your voice happens to resemble the voice of a "prominent figure?"

testacpwoek · on March 30, 2024

It's insane that this protection exists for "prominent figures" but nobody else. The same damage that can be done to prominent figures can be done to regular people, they acknowledge the damage can be done, and yet they're still rolling this out? The more OpenAI does the more it's clear they don't give a single damn about the consequences of their technology.

xandrius · on March 30, 2024

That's what for-profit and investors do to morality.

TrueGeek · on March 30, 2024

If you had just worked harder and become a billionaire you too could have been safe from the AI apocalypse

MrNeon · on March 30, 2024

It is very obvious that the amount and type of damage is not the same if you clone any random person's voice or clone a political leader's voice.

It is not the same damage.

testacpwoek · on March 30, 2024

Murdering a random civilian and murdering a public official are also very different levels of damage, but we still outlaw and prosecute both of them. This is the same thing.

Not to mention, what constitutes a political leader? Just the upper echelon? What about local civil servants? Mayors, cops, judges? Are they gonna have a database of every public or political figure in the country? No they won't. This is absurd.

reaperducer · on March 30, 2024

Not to mention, what constitutes a political leader? Just the upper echelon? What about local civil servants? Mayors, cops, judges?

Barack Obama started out as a "community organizer."

AI can clone someone making a public speech at that level, then store it forever until they become the president.

Cheer2171 · on March 30, 2024

Many jurisdictions actually do have higher penalties for crimes of violence committed against public officials than ordinary citizens. Assault a USPS postal worker at a post office and you automatically have a higher sentence than assaulting a UPS worker at a UPS store.

testacpwoek · on March 30, 2024

I know that. Both are still illegal, which is what I said, and is the point.

coffeebeqn · on March 30, 2024

Don’t forget business people. Clone the voice of some executives and start calling people pretending to be them

zoky · on March 30, 2024

The “Open” in OpenAI refers to, uh, a certain part of your anatomy…

Shawnj2 · on March 30, 2024

It’s a little tricky because if someone cloned your voice first they could pretend they were you and stop you from using “their” voice, then do nasty stuff with it. This way they can still do that, but they can’t prevent you from using your own voice unless you have to upload an ID with your voice or something. It’s much easier to enforce this on known celebrities or otherwise well known people than the general public.

mtillman · on March 30, 2024

It reminds me of the new EU law being implemented in that it eliminates the use of ai for private companies but explicitly allows it for law enforcement. Both are methods of consolidating power. Safer to be open for everyone and unrestricted than explicitly protecting the few.

Plankaluel · on March 30, 2024

VoiceCraft, which has been released open source a week or so ago, can do the same with 5-10sec audio and is pretty convincing. It's pretty fun to play around with: https://github.com/jasonppy/VoiceCraft

bijoo · on March 30, 2024

Submitted multiple times in past few days. Here's a link to the most upvoted one as of now: https://news.ycombinator.com/item?id=39865340 VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild (jasonppy.github.io)

washadjeffmad · on March 30, 2024

Thanks. TorToiSe has been feeling a long in the tooth, and this looks a little more digestible.

23B1 · on March 30, 2024

Altman and OAI – and their investors, more importantly – have already proven over decades and decades that they're in it for the money and at any cost to people including their own paying customers.

This generation is making the same can-kicking mistakes as their parents and nobody wants to admit it.

notRobot · on March 30, 2024

I am not sure why this is being developed. The potential for misuse is so much greater than any possible beneficial use cases I can imagine. Text to speech is cool, why do you have to introduce technologies that allow for voice cloning, which will inevitably be used (more than it already is) to commit fraud and hurt people?

Society is not ready for all of these "intelligences", heck, we can't even figure out what to do about all the drawbacks of social media and we've had it for decades.

MyFedora · on March 30, 2024

I feel like I'm missing smth here?

Voice cloning has a good benefit to drawback ratio compared to many technologies. Threat actors with enough resources could always clone your voice. Everyone can hire voice actors. Costs more time and money sure, but it was always possible.

Beneficial use cases:

* Accessibility: Give people that lost their voice their voice back.

* Personalized Digital Assistants: More humane voice.

* Voiceovers: Sick content creator can let the AI do the voice, filmmakers can fix mistakes in post and so on and so forth.

* Language Learning: Learn an accent by mimicking how the AI voices yourself in the target accent.

* Audiobook Narration: Make a good audiobook out of a textbook.

* Preservation of Cultural Heritage: Preserve an accent that's about to die out.

* Corporate Training: Believe it or not, a robotic voice sucks to listen to after a while.

* Interactive Gaming: Entire virtual DND campaign based on a prompt with AI generated world, story, and characters drawn and voiced by AI with quality control by AI.

Now compare that to guns, tanks, flamethrowers, chemical weapons, atomic bombs, hydrogen bombs, killer drones and bombers. Yeah uh... death. Made to kill people.

notRobot · on March 31, 2024

You are indeed missing something: you don't need machine powered instant voice "cloning" tech to have good artificial voices, which would do basically everything you're talking about.

Voice cloning has always been technically possible and so has fake videos but lowering the barrier of entry to voice cloning and deepfakes has very real implications for society that we're not remotely equipped to deal with.

Nothing is believable anymore. Stuff like this will only become more commonplace:

https://edition.cnn.com/2024/02/04/asia/deepfake-cfo-scam-ho...

https://news.cgtn.com/news/2024-03-03/AI-deepfake-scams-tric...

https://www.trendmicro.com/vinfo/es/security/news/cybercrime...

https://www.newyorker.com/science/annals-of-artificial-intel...

vouaobrasil · on March 30, 2024

> I am not sure why this is being developed. The potential for misuse is so much greater than any possible beneficial use cases I can imagine.

It's being developed because for decades, tech companies have touted the supposed benefits of new technology and hidden the costs. It's been an avalanche effect of greater and greater technology for greater and greater costs. Only now, the costs are so great that people are starting to wake up.

When we look back in time to see the devastation wrough by technology, we won't look at AI as the starting point. It will be the smartphones, the 4G internet, and the 8K video that we so blindly accepted without ever considering the immense changes to society that they implied. AI may be the start of the end of reasonable society, but it is only the APEX of a phenomenon that has been happening ever since we accepted fossil fuel use without considering the implicates of climate change. It's an entire societal attitude.

nonoesp · on March 30, 2024

Descript acquired Lyrebird years ago and already have a product that can synthesize your voice.

They require training data longer than 15 seconds, which could lead the out out to resemble more the actual voice.

I've seen weird behaviors where the AI voice forces a British accent to pronounce certain words which I don't have.

Descript also uses voice synthesis to regenerate edited portions of conversations with a noticeable cut to smoothen the transition, which is pretty useful.

ShamelessC · on March 30, 2024

Isn't that true of approximately every voice cloning model released lately? I've seen as few as 3-5 seconds as a matter of fact...

rany_ · on March 30, 2024

They're just saying anything to bump up their value.

ShamelessC · on March 30, 2024

Certainly what it seems like. They have a model they can't release without backlash, but still want the credit and don't want to wait to receive accolades until they can figure out how to prevent abuse (which I assume is just impossible).

Jayakumark · on March 30, 2024

Why is this big news, there are open source models that can do this for quite some time , including recently released https://github.com/jasonppy/VoiceCraft

sherburt3 · on March 30, 2024

I would imagine if I lost my voice that I would prefer a standard TTS voice as opposed to paying a subscription to have an unholy digital facsimile voice that OpenAI owned.

ChrisArchitect · on March 30, 2024

[dupe]

Some more discussion on official post: https://news.ycombinator.com/item?id=39866493

vouaobrasil · on March 30, 2024

FTA:

> OpenAI says they see this technology being useful for reading assistance, language translation and helping those who suffer from sudden or degenerative speech conditions.

Of course they would. They are manipulating others by appealing to accessibility. Who can argue with that? Accessibility has become the "but think of the children" appeal of the AI world. Yes, it can help those who suffer from degenetative speech conditions, but it can also deceive and push propaganda to whole new heights.

This sort of manipulation is characteristic of antisocial personality disorder. In fact, if you look up the symptoms of that disorder, technology companies like OpenAI exhibit quite a lot of them...

jsheard · on March 30, 2024

As a wise man once said: "drunk driving may kill a lot of people, but it also helps a lot of people get to work on time, so, it;s impossible to say if its bad or not"

vouaobrasil · on March 30, 2024

Sounds like a fool. We have the ability to weight costs and benefits. We have advanced linear models and computers. We have the math to split the fucking atom, but when it comes to even thinking about the social costs of technology, suddenly everyone is impotent. What a joke.

consumer451 · on March 31, 2024

I agree with you entirely. I could not have said it better myself. However, for context, the comment you are responding to is a quote of a very sarcastic poaster known as "dril."

lewhoo · on March 30, 2024

> They are manipulating others by appealing to accessibility.

I'd argue voice cloning isn't really crucial to accessibility. I mean, this is just cosmetics. I'd like to hear opinions of people who actually can benefit from this and if it is really such a "game changer" for them.

vouaobrasil · on March 30, 2024

Indeed, I agree. And before I hear from them, I would like to make sure that it wasn't some tech company that provided all the funding and technology in return for a PR piece as is quite common in the tech world.

fragmede · on March 30, 2024

So Apple already did this, for that reason though.

https://www.aarp.org/home-family/personal-technology/info-20...

vouaobrasil · on March 30, 2024

Yes they did. That is typical of tech companies: it's amazing that people actually believe it. But that's our societies attitude: new tech is an improvement. It's a faith stronger than most people have in their god(s).

fragmede · on March 30, 2024

Unfortunately we can't put the cat back in the box, so even if you don't have that attitude, what are you to do? join the Amish? new tech isn't always an improvement, but we can't uninvent things, just like we can't unsee things.

vouaobrasil · on March 30, 2024

Slavoj Žižek made a great point about this. He said that we have been conditioned to believe that anything is possible with technology and science (cure for cancer, limitless energy, flying cars, teleportation), and we have ALSO been conditioned to believe that nothing can change sociologically. ("The world will always be the same.")

But that conditioning is not actually true and perhaps you are exhibiting it: we actually can change things. I can already think of some things we could do:

(1) Form an economic coalition that bans the use of AI, and squeezes out AI. Actually, the general public attitude towards AI is already ambivalent enough so that this might work.

(2) Form a revolution against modern tech-controlled society. This would obviously involve a lot of things, but it could involve mass public activity against AI companies.

mysterydip · on March 30, 2024

"We have had the approval from the BBC to use the AI generated voice so we won’t need Sara anymore.”

https://www.hollywoodreporter.com/business/business-news/ai-...!

e40 · on March 30, 2024

Don’t a lot of companies exhibit symptoms of ASPD?

vouaobrasil · on March 30, 2024

That is true, but there are two replies:

(1) First, most large companies are essentially tech companies in some way, at least in that they exist to advance technology. Therefore, your observation is good reason to abolish large companies.

(2) (Direct) tech companies often exhibit one symptom that other companies don't, at least in not such a strong way. Tech companies lie more often directly to the public: car companies, drink companies, etc. don't lie about making the world a better place, they just try to appeal to basal desires. Tech companies actually are deluded enough to think they are making the world a better place, and use manipulation to convince us that this is true.

testacpwoek · on March 30, 2024

The fact that so many of them believe in that effective altruism nonsense shows why none of these people should ever be in control of such vast amounts of power and resources. They think they should have all the money and power because they're tech CEOs and have a maniacally unhinged confidence in their understanding of complex global societal issues / are straight-up lying because they're greedy, and the rest of us peons should just bow to their will.

vouaobrasil · on March 30, 2024

Tech comapanies and CEOs are experts in the fine art of exploiting the prisoner's dilemma to quickly amass wealth at great societal expense, all within the system. I'm afraid the only way to solve this problem is a combination of educating society by making them more aware and taking tech companies down by mass public opposition / mob rule.

tanepiper · on March 30, 2024

That scene from Sneakers takes on a whole new level of meaning - we're actually there now...

testacpwoek · on March 30, 2024

Making AI-voice videos of Trump and Biden playing Minecraft is one thing, they clearly sound overly robotic and that makes it funny, but being able to perfectly replicate voices is a realistic way is a technology that has very little real-life utility and so, so many real-life downsides.

Literally what is the point of this tech? To eliminate voice actors? I don't buy for a second their supposed use-cases of accessibility or assisting people with disabilities. We're gonna enable people to fake voices realistically off of 15 seconds of for example a voicemail just so media companies can save a few bucks? This is actually madness.

CatWChainsaw · on March 30, 2024

The only "legitimate" use I can think of is making call center AI agents sound like people. And all the other ones involve getting enough of someone's "voiceprint" to impersonate them to get money or ruin reputations.

helsinkiandrew · on March 30, 2024

"All things are lawful, but not all things are beneficial or advantageous"

St Pauls Letter to the Corinthians 10:23