I think it's plausible that it can roughly clone the timbre that my exact vocal and sinus geometry produces, then apply that to its internal representation of a "baseline human voice". However, I don't believe that it can capture exactly every nuance in articulation, emotional expression, intonation. I simply can't imagine even a voice actor iterating through "the full range" in merely 15 seconds. And what about mannerisms, like pronouncing "you all" as "y'all"? Ending every sentence by affixing the "uh" sound to the last word-uh? Or how they sound a sigh or laugh?
In Dutch for instance, there's 3 ways to pronounce the R, and everyone has a certain combination of when they use which kind.
A lot of people miss the forest for the trees. They can clone a voice using 15 seconds of audio. So how good the clone will be when they have 10 min, 30 min or larger sample size. And it will probably keep getting better
> And what about mannerisms, like pronouncing "you all" as "y'all"?
In the US this is highly determined by location, to the point where the New York Times even built a cool quiz [0] that can guess where you're from pretty accurately based on a few dozen vocab questions. So there's a dataset out there that would allow you to handle most of this variance by simply plugging in a ZIP code for the person you're spoofing!
It's often not necessary to sample all keys on a piano to make a passable piano sound generator, though that will have its own limitations.
A lot of these generators rely heavily on the base model and are worse at generating e.g. Dutch accented English and/or Dutch accented Dutch. But it works great for e.g. Californian accented English with unique voice.
I'll speculate "what is really happening" - Think of a voice like a musical instrument.
This learned to be a Shazam for human voices instead of songs. And just like you can figure out by ear a snippet of music was made on a Yamaha DX7, seems 15 seconds is enough to narrow it down to a reasonably small set of vocoders that can recreate the sample given.
It feels like OpenAI is mostly concerned with developing proofs of the untrustability of every digital medium, creating a convincing case either for doing everything in person again, or necessitating cryptographic signatures on absolutely everything.
In some ways, this just feels like a reversion to the mean for humanity. “Don’t believe everything you read” is a maxim as old as print, and it’s good advice. If people ultimately learn that videos and recordings can’t be trusted, is that a huge loss? I’m not sure.
Of course, the road from here to there may be bumpy...
> In some ways, this just feels like a reversion to the mean for humanity.
No it doesn't, this is ridiculous. All of you guys with your "this is just like the printing press" arguments are so intellectually lazy, think a little more about the problems of scale and speed and the way information spreads in the modern world.
In which cases anyhow would they have been a trustworthy content. You could easily take real videos out of context which people were already doing to mislead.
Ultimately you should reason and investigate deeper to hold somewhat accurate opinion on anything.
Most media and news out there already is where the agenda comes first and then it is about compiling cherry picked content from wide variety of content as evidence for that agenda. And you can find content to support any agenda already.
It is a simple algorithm of:
1. We want to prove X.
2. From millions of datapoints we pick the 100 that support X the most.
3. We write an article on it.
4. Most of our readers already agree with X so they will be happy about it. No need to go deeper.
5. Anyone who doesn't agree would probably not read our platform anyway.
The answer we see is that if you put AI generated stuff in places people don’t look at too carefully they won’t care because they won’t notice. I’m not sure that’s great precedent
Right, we're in a somewhat dangerous period where the general public doesn't understand the capabilities that exist and how they can be misused. But if you mentally fast-forward to a time when people do understand this and have even become blasé about it, what does that time look like? (Presumably, you don't wire money based on a phone call without kicking the tires a bit...)
> It feels like OpenAI is mostly concerned with developing proofs of the untrustability of every digital medium
Which, to me, makes sense. Once the underlying technology exists, a malicious actor would not think twice before developing tools of deceptionlike those. It makes sense that OpenAI would work on that "in the open" to demonstrate that we now need to be skeptical of audios.
When I hear “ML-based generative tools pushing people out of their jobs is just a political problem”, that I hear is “ClosedAI and Microsoft definitely had nothing to do with choosing to not license and compensate original creators whose works they have acquired in order to build their commercial products, it’s all the government’s fault”.
Often the same people will cry over-regulation about ML tech rollout potentially being suppressed and controlled via policy, even though that is literally the implication of their position.
> OpenAI is mostly concerned with developing proofs of the untrustability of every digital medium, creating a convincing case either for doing everything in person again, or necessitating cryptographic signatures on absolutely everything.
Am I alone in thinking that this would not be a bad thing?
> When the product officially rolls out there will be a “no-go voice list” that detects and prevents AI-generated speakers that are too similar to prominent figures.
Should they then let anyone register their "voice id" to prevent others frok generating similar voices?
What if your voice happens to resemble the voice of a "prominent figure?"
It's insane that this protection exists for "prominent figures" but nobody else. The same damage that can be done to prominent figures can be done to regular people, they acknowledge the damage can be done, and yet they're still rolling this out? The more OpenAI does the more it's clear they don't give a single damn about the consequences of their technology.
Murdering a random civilian and murdering a public official are also very different levels of damage, but we still outlaw and prosecute both of them. This is the same thing.
Not to mention, what constitutes a political leader? Just the upper echelon? What about local civil servants? Mayors, cops, judges? Are they gonna have a database of every public or political figure in the country? No they won't. This is absurd.
Many jurisdictions actually do have higher penalties for crimes of violence committed against public officials than ordinary citizens. Assault a USPS postal worker at a post office and you automatically have a higher sentence than assaulting a UPS worker at a UPS store.
It’s a little tricky because if someone cloned your voice first they could pretend they were you and stop you from using “their” voice, then do nasty stuff with it. This way they can still do that, but they can’t prevent you from using your own voice unless you have to upload an ID with your voice or something. It’s much easier to enforce this on known celebrities or otherwise well known people than the general public.
It reminds me of the new EU law being implemented in that it eliminates the use of ai for private companies but explicitly allows it for law enforcement. Both are methods of consolidating power. Safer to be open for everyone and unrestricted than explicitly protecting the few.
VoiceCraft, which has been released open source a week or so ago, can do the same with 5-10sec audio and is pretty convincing. It's pretty fun to play around with: https://github.com/jasonppy/VoiceCraft
Submitted multiple times in past few days. Here's a link to the most upvoted one as of now:
https://news.ycombinator.com/item?id=39865340
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild (jasonppy.github.io)
Altman and OAI – and their investors, more importantly – have already proven over decades and decades that they're in it for the money and at any cost to people including their own paying customers.
This generation is making the same can-kicking mistakes as their parents and nobody wants to admit it.
I am not sure why this is being developed. The potential for misuse is so much greater than any possible beneficial use cases I can imagine. Text to speech is cool, why do you have to introduce technologies that allow for voice cloning, which will inevitably be used (more than it already is) to commit fraud and hurt people?
Society is not ready for all of these "intelligences", heck, we can't even figure out what to do about all the drawbacks of social media and we've had it for decades.
Voice cloning has a good benefit to drawback ratio compared to many technologies. Threat actors with enough resources could always clone your voice. Everyone can hire voice actors. Costs more time and money sure, but it was always possible.
Beneficial use cases:
* Accessibility: Give people that lost their voice their voice back.
* Personalized Digital Assistants: More humane voice.
* Voiceovers: Sick content creator can let the AI do the voice, filmmakers can fix mistakes in post and so on and so forth.
* Language Learning: Learn an accent by mimicking how the AI voices yourself in the target accent.
* Audiobook Narration: Make a good audiobook out of a textbook.
* Preservation of Cultural Heritage: Preserve an accent that's about to die out.
* Corporate Training: Believe it or not, a robotic voice sucks to listen to after a while.
* Interactive Gaming: Entire virtual DND campaign based on a prompt with AI generated world, story, and characters drawn and voiced by AI with quality control by AI.
Now compare that to guns, tanks, flamethrowers, chemical weapons, atomic bombs, hydrogen bombs, killer drones and bombers. Yeah uh... death. Made to kill people.
You are indeed missing something: you don't need machine powered instant voice "cloning" tech to have good artificial voices, which would do basically everything you're talking about.
Voice cloning has always been technically possible and so has fake videos but lowering the barrier of entry to voice cloning and deepfakes has very real implications for society that we're not remotely equipped to deal with.
Nothing is believable anymore. Stuff like this will only become more commonplace:
> I am not sure why this is being developed. The potential for misuse is so much greater than any possible beneficial use cases I can imagine.
It's being developed because for decades, tech companies have touted the supposed benefits of new technology and hidden the costs. It's been an avalanche effect of greater and greater technology for greater and greater costs. Only now, the costs are so great that people are starting to wake up.
When we look back in time to see the devastation wrough by technology, we won't look at AI as the starting point. It will be the smartphones, the 4G internet, and the 8K video that we so blindly accepted without ever considering the immense changes to society that they implied. AI may be the start of the end of reasonable society, but it is only the APEX of a phenomenon that has been happening ever since we accepted fossil fuel use without considering the implicates of climate change. It's an entire societal attitude.
Descript acquired Lyrebird years ago and already have a product that can synthesize your voice.
They require training data longer than 15 seconds, which could lead the out out to resemble more the actual voice.
I've seen weird behaviors where the AI voice forces a British accent to pronounce certain words which I don't have.
Descript also uses voice synthesis to regenerate edited portions of conversations with a noticeable cut to smoothen the transition, which is pretty useful.
Certainly what it seems like. They have a model they can't release without backlash, but still want the credit and don't want to wait to receive accolades until they can figure out how to prevent abuse (which I assume is just impossible).
Why is this big news, there are open source models that can do this for quite some time , including recently released https://github.com/jasonppy/VoiceCraft
I would imagine if I lost my voice that I would prefer a standard TTS voice as opposed to paying a subscription to have an unholy digital facsimile voice that OpenAI owned.
> OpenAI says they see this technology being useful for reading assistance, language translation and helping those who suffer from sudden or degenerative speech conditions.
Of course they would. They are manipulating others by appealing to accessibility. Who can argue with that? Accessibility has become the "but think of the children" appeal of the AI world. Yes, it can help those who suffer from degenetative speech conditions, but it can also deceive and push propaganda to whole new heights.
This sort of manipulation is characteristic of antisocial personality disorder. In fact, if you look up the symptoms of that disorder, technology companies like OpenAI exhibit quite a lot of them...
As a wise man once said: "drunk driving may kill a lot of people, but it also helps a lot of people get to work on time, so, it;s impossible to say if its bad or not"
Sounds like a fool. We have the ability to weight costs and benefits. We have advanced linear models and computers. We have the math to split the fucking atom, but when it comes to even thinking about the social costs of technology, suddenly everyone is impotent. What a joke.
I agree with you entirely. I could not have said it better myself. However, for context, the comment you are responding to is a quote of a very sarcastic poaster known as "dril."
> They are manipulating others by appealing to accessibility.
I'd argue voice cloning isn't really crucial to accessibility. I mean, this is just cosmetics. I'd like to hear opinions of people who actually can benefit from this and if it is really such a "game changer" for them.
Indeed, I agree. And before I hear from them, I would like to make sure that it wasn't some tech company that provided all the funding and technology in return for a PR piece as is quite common in the tech world.
Yes they did. That is typical of tech companies: it's amazing that people actually believe it. But that's our societies attitude: new tech is an improvement. It's a faith stronger than most people have in their god(s).
Unfortunately we can't put the cat back in the box, so even if you don't have that attitude, what are you to do? join the Amish? new tech isn't always an improvement, but we can't uninvent things, just like we can't unsee things.
Slavoj Žižek made a great point about this. He said that we have been conditioned to believe that anything is possible with technology and science (cure for cancer, limitless energy, flying cars, teleportation), and we have ALSO been conditioned to believe that nothing can change sociologically. ("The world will always be the same.")
But that conditioning is not actually true and perhaps you are exhibiting it: we actually can change things. I can already think of some things we could do:
(1) Form an economic coalition that bans the use of AI, and squeezes out AI. Actually, the general public attitude towards AI is already ambivalent enough so that this might work.
(2) Form a revolution against modern tech-controlled society. This would obviously involve a lot of things, but it could involve mass public activity against AI companies.
(1) First, most large companies are essentially tech companies in some way, at least in that they exist to advance technology. Therefore, your observation is good reason to abolish large companies.
(2) (Direct) tech companies often exhibit one symptom that other companies don't, at least in not such a strong way. Tech companies lie more often directly to the public: car companies, drink companies, etc. don't lie about making the world a better place, they just try to appeal to basal desires. Tech companies actually are deluded enough to think they are making the world a better place, and use manipulation to convince us that this is true.
The fact that so many of them believe in that effective altruism nonsense shows why none of these people should ever be in control of such vast amounts of power and resources. They think they should have all the money and power because they're tech CEOs and have a maniacally unhinged confidence in their understanding of complex global societal issues / are straight-up lying because they're greedy, and the rest of us peons should just bow to their will.
Tech comapanies and CEOs are experts in the fine art of exploiting the prisoner's dilemma to quickly amass wealth at great societal expense, all within the system. I'm afraid the only way to solve this problem is a combination of educating society by making them more aware and taking tech companies down by mass public opposition / mob rule.
Making AI-voice videos of Trump and Biden playing Minecraft is one thing, they clearly sound overly robotic and that makes it funny, but being able to perfectly replicate voices is a realistic way is a technology that has very little real-life utility and so, so many real-life downsides.
Literally what is the point of this tech? To eliminate voice actors? I don't buy for a second their supposed use-cases of accessibility or assisting people with disabilities. We're gonna enable people to fake voices realistically off of 15 seconds of for example a voicemail just so media companies can save a few bucks? This is actually madness.
The only "legitimate" use I can think of is making call center AI agents sound like people. And all the other ones involve getting enough of someone's "voiceprint" to impersonate them to get money or ruin reputations.
In Dutch for instance, there's 3 ways to pronounce the R, and everyone has a certain combination of when they use which kind.