In CGI there were always these milestones which I observed getting reached. Like trees with leaves finally looking close to realistic, wind blowing in grass looking almost realistic, hair, jelly, and it were usually Pixar shorts pointing out what they have been focusing on and then seeing it applied to their movies.
Then mocap, mapping digital faces on real actors which was first mind-blowing to see in Pirates of the Caribbean, then the apes in one of the Planet of the Apes movies... So much in the CGI industry has already reached a point where the hardest problems seem to have been solved.
When I now clicked play on the first Synthesized Dialoge from Dialogue Synthesis "Where did you go last summer? | I went to Greece, it was amazing.", I was blown away. It's as if we've now reached one of those milestones where a problem appears to be fixed or cracked. Machines will be able to really sound like humans, indistinguishable from them.
10-5 years ago, if you wanted to deal with TTS, the best option you had was to let your Android phone render a TTS into an audio file, because everything else sounded really bad. Specially Open Source stuff sounded absolutely horrible.
So how long will it be until we will be able to download something of this quality onto a future-gen Raspberry Pi which can do some AI processing, where we make an HTTP call and it starts speaking through the audio out in a perfect voice without relying on the cloud? 5 years?
To answer that question, it is necessary to answer the question "why did we want actual human singers in the past?"
If you think it is because there was no alternative, then you will see no barriers to the adoption of the sort of system you imagine.
But if you think it is because all human art is a form (however weird, tangled and opaque) of story telling, and for it to work, the person experiencing the art needs to be believe that there must be a story to be told, then your imagined system is not that interesting.
However, we are seeing an overwhelming propensity even today for people (even people who ought to know better) to ascribe intentionality and even emotion to computer systems that demonstrably do not have it (and as far back as Eliza in the 1970s). It seems likely to me that most people will rapidly come to believe in a sufficiently intentional and emtional backstory to your putative wunder-singer that their belief will satisfy their desire for "a person behind the art".
Whether a machine can sing like a human does not matter much as far as appreciation of art: all art is self-expression, and self-expression always requires a self to express and a self to receive. Someone programmed the voice to sing something to someone for some reason.
Both human and software-generated singing exist today, and either can be boring nonsense or speak to you and leave an emotional impact.
(That said, if said machine is actually a derivative work produced from vocal samples without appropriate licensing, that’d make it dubious legally or morally. To my knowledge, preexisting software-generated singing like Vocaloid does not suffer from this problem.)
If your definition of art includes things made without a self, a thinking mind with intention and agency then go ahead and ponder how the Sun and human DNA are art, I’ll wait for when you come back with a new definition.
You are welcome to produce a definition that separates those things.
You have seen my definition. By definition an act of self-expression cannot occur without a self.
Without clarifying what makes something art, your claim that art can be something produced without a self (i.e., is not self-expression) demotes art to the level of “any random thing”.
As a choral singer, if there’s an app that one day allows me to sing with a fake choir of extremely good singers, I would enjoy doing that all day long. And it would allow my actual choir to practice way more, making our performances far better.
From what I've seen, the music industry (all genres including pop, classical, etc.) tends to be more about personality than actual content these days, so it's not really an issue.
Virtual singers are already pretty good anyways, I feel we've already passed the point of diminishing returns.
And even if you look at virtual singers, the ones which are popular are the ones from ten years ago, not the newer ones with more realistic voices.
What do you mean by "ten years ago"? Yamaha Corporation produces vocal synths since 2000s, but it is quite pointless to use technologies of Vocaloid 4 if you can use Vocaloid 6.
People like playing drums too, but a drum machine means that if you're not any good at it or too busy but you need drum sounds you can have drum sounds.
There are rights issues if the result is it replaces a particular singer, if you made it so that Sneaker Pimps can fire Kelli but still have her voice on subsequent songs that's a problem. But suppose you're a bedroom musician, and you realise you've got a piece that really wants somebody with a different voice than yours to make it work - you can pay someone, but technology like this offers a cheaper, easier option.
>So how long will it be until we will be able to download something of this quality onto a future-gen Raspberry Pi which can do some AI processing, where we make an HTTP call and it starts speaking through the audio out in a perfect voice without relying on the cloud?
5 years? It's probably possible roughly whenever the larger Whisper models can run on it. Probably the next Raspberry Pi, running quantized or optimized versions of some audio model.
It may be almost possible right now if you tried really realy hard, and you used a small model fine-tuned on a single voice, instead of something larger and more general purpose that can do any voice. I think whisper-tiny works on a Pi on real time, right? And that's not leveraging the GPU on the Pi. (https://github.com/ggerganov/whisper.cpp/discussions/166)
Edit: looks like medium is 30x slower on the Pi than tiny model, so I may have been overly optimistic. I didn't realize Whisper tiny was that much faster than medium.
This method works pretty well with Tortoise, letting you use the super fast Tortoise quality settings but get quality similar to the larger models. Fine-tuning the whole thing on just one voice removes a lot of the cool capabilities of course. With Tortoise, that would still be way too slow for a Pi but potentially that same strategy could work with faster models like SoundStorm.
In terms of quality there's still a lot of room to go with long term coherence, like long audio segments. When a real person reads an audiobook the words at the top the page have a pretty big impact on how many words at the bottom the page are read. And there can be some impact at any distance, page 10 to page 300. When you try audiobooks on super high end TTS models and listen carefully you really notice the mismatch. It's like the reader recorded the paragraphs out of order, or a video game voice lines where you can tell the actors recorded all the lines separately, and were not reacting to each other's performance.
You can bump the context windows, a minute, two minutes. That's gonna get you closer and probably good enough for some books. In the short term a human could simply adjust all the all the audio samples and manually tweak things to sound correct. So this will enable fan-created audiobooks where they take the time to get it right. But for fully automated books the mismatch drives me nuts. The performance is just soooo close for certain segments that when you get a tonal mismatch it hurts.
Over the years I've also watched CGI progress and been amazed by the steps there -- the joy of seeing yet another complex part of reality rendered was a constant.
These days, though...every new technique developed to simulate and replicate human creativity and behaviors builds on a constant sense of unease.
If I view or read, do I have the right to know if it's generated?
It's good that Bing, Bard are using the latest Microsoft, Google Cloud offerings but it would be nice to see these speech advances (along with audio palm - https://google-research.github.io/seanet/audiopalm/examples/ etc) hit public api's and/or user interfaces.
Bard's TTS is alright but it's clearly behind.
On that note, Bing's English/Korean TTS is really good. I also didn't realize Microsoft uses the best offerings for free TTS on edge so it blows google's default tts voices away.
Yes. I'm not saying Google's Top Cloud offerings are bad although i still think microsoft's stuff is better.
Just that
1. It's behind their current sota research
2. You can only use those voices extensively by paying for it. Microsoft offers their best stuff on edge for free. So for reading aloud a pdf or web page, microsoft is far better.
It's disappointing, but I wouldn't expect research algorithms to be available immediately unless they held it back until the product is ready. I guess Apple would do that?
> I also didn't realize Microsoft uses the best offerings for free TTS on edge so it blows google's default tts voices away.
This sounds really interesting - can you share a bit more? I'm behind in this space, my parser got all jammed up, something like: "Microsoft uses [the best offerings for free TTS](as in FOSS libraries, or free as in beer SaaS?) [on edge](Edge browser, or on the edge as in client's computer?)(Is the implication that all TTS on the client's computer blows Google's default TTS voices away?)"
I believe they mean that the free TTS feature in Microsoft Edge uses their best technology, and that said tech is better than Google's default offering.
Ha i'm not even asking for code/model releases. It's just a bit funny that what you can *pay* google to use is so far behind what they have up and running collecting dust.
Sure, but there was a time not too long ago when companies were still in the "good will" phase of handing out even highly valuable models like CLIP, guided-diffusion, etc. Come to think, it was mostly OpenAI doing this. And they kinda still do? But far more selectively. I'm just preemptively romanticizing that.
I don’t want to defend Google’s business practices, but this is such a trite comment someone always feels compelled to post on anything about Google, including even a research paper, apparently.
It's a very relevant comment. It tells you to not rely, or expect further development, on any new Google technology, even seemingly good ones, as it can go to the graveyard like many others.
I don't bother to post the comment, but the high likelihood of any Google project/product being killed within a year or two is absolutely the first thought I have whenever a new Google project/product is announced (not because of HN posts, but because of their history), so good job on that Google.
I'll argue it's not trite. It's a concise compilation of the thousands of teeth-gnashing comments here on HN and all over the internet whenever Google randomly drowns another one of its children.
Just fucking stay away from Google products. Period.
First of all it isn't a product. It's a f*king research paper. Like dozens other showing up on HN every day. Most of them never becomes a product.
Second of all, by whining nauseously you drown out discussions on the merits of the technology, and chase people away. I hardly read Google news on HN now precisely because of that reason. Imagine if "Attention is all your need" came out now? [0]
Save your complaint for when Google makes it a product.
>Save your complaint for when Google makes it a product.
or save yourself the trouble and find alternatives to big-G.
It's entirely their own fault that people now view all Google news as temporary and fleeting. People don't want to put time into things that'll get thrown away in a year.
Reading G research papers seems like a shortcut to me, know what will be thrown away in 2 years before it's a valid product in 1 year and someone gets huckleberry'd into devoting time and effort into implementing the dead-product-walking API.
> It's entirely their own fault that people now view all Google news as temporary and fleeting. People don't want to put time into things that'll get thrown away in a year.
Most of research don’t become their own products, from Google or anyone else. As a research project they still have values, unless you are saying Google research is garbage because they get into the habit of canceling their products.
> or save yourself the trouble and find alternatives to big-G
Totally valid point. No need to complain about it in a post about Google research though. It’s tiresome.
Google's non-advertising revenue in the latest quarter was about $15 billion. Is that significant amount of non-ads product revenue? At least that is higher than the revenue of any of IBM, HP, Oracle, Intel, Cisco, Netflix, Broadcom, Qualcomm, or Salesforce in that same quarter.
I think their non-ads businesses alone would be the 6th largest US tech company by revenue. (Amazon, Apple, Microsoft, the ads business of Alphabet, Meta. Am I forgetting something?)
Revenue is easy when you lose money on every dollar. Last quarter Ads printed $21B of income, rest was a loss except cloud not losing hundreds of $millions for the very first time.
When people wax eloquent about how the artisans will just find something new to do for work, what they fail to mention is that the new work is often a menial and lower-paid job. When Amazon puts mom and pop shops out of business, they don’t go start new businesses, they go get jobs at Wal-Mart.
Interesting that SoundStorm was trained to produce dialog between two people using transcripts annotated with '|' marking changes in voice. But the exact same '|' characters seem to mostly work in the Bark model out of the box and also produce a dialog?
Maybe a third or a bit more of Bark outputs are a dialog person talking to themselves -- and it often misses a voice change. But the pipe characters do reliably produce audio that sounds like a dialog in the performance style.
Is there some text-audio data somewhere in the training data that uses | for voice changes?
Amusingly, Bark tends to render the SoundStorm prompts sarcastically. Not sure if that's a difference in style in the models, or just Google cherry picking the more straightforward line readings as the featured samples.
The creators won't say as far as i know but bark looks to be trained on lot of youtube corpora (rather than typical ML audio datasets) where audio may have transcripts like that and why stuff like [laughs] work
In the future, will children think it's normal to talk like, "Hey, what up, Youtube! ... Be sure to like and subscribe! ... Smash that like button! ... Let me know in the comments down below!"?
I wonder how ML trained on the tone transitions to a sponsored segment dripping with secret shame... would infect general speech.
Yeah I often try to think about what might be in a YouTube caption when finding prompts that work in Bark. But pipe character isn't one I remember seeing on YouTube. Maybe it's part of some other audio dataset though. Or maybe it's on YouTube but only in non English videos.
I wonder if work marketplaces like UpWork and Fiverr will adapt quickly enough to this new situation, where many of their services, which in the past were done by humans, can now be done by software.
Their current marketplace interface seems inadequate for this. Instead of contacting a human and then wait for them to finish the work, buyers will want to get results right away.
Therefore they will have to change their platform to work like an app store. Where the sellers connect their services and buyers can use these services.
Announcements often get played repeatedly -- "Train 101 to Lisbon is now on track 5". Why do you want to torture station's workers with that?
Instead, make an effort to start a conversation with your fellow travelers, or graciously respond to such effort from them. Apologize if you already do.
Taken to its conclusion, doesn't this just mean we should outlaw all forms of mechanization in order to preserve human jobs as much as possible? Would this be better for a country than just allocating a portion of the economy's output and using it to provide social safety nets?
Since you mention it, there was some news a while back of some African countries not using construction machines to build roads on purpose, as means to foster more jobs.
Likewise in many countries, having someone at the gas station to fill in the tank is still a job, and most likely even with EV they would be the ones taking care of the charging.
At the beginning of the 20th century in the US, 3 in 4 workers were either laborers, farmers, miners, or household service workers. By the end of the century that number had fallen to 1 in 4.
Of course that wasn't a net loss, it was part a larger economic transformation that created more higher paying jobs.
This is the lump of labor fallacy.
It's not about "what jobs will be left", it's about the new jobs we'll invent with all the time we'll have on our hands.
There was never a fixed number of jobs, there's a fixed number of workers.
I didn't think you understand. We aren't about to automate all things humans can "currently" do. We are about to automate everything that separates us from a brick. What can possibly be left, and why would an unemployed person pay for whatever what is left?
The nature of the jobs change too though. Do you believe jobs will become more or less skilled/autonomous and connect people more or less with their fellow man? Some of us are pessimistic about those questions.
Where I see the benefit is in dialogue replacement. It takes a lot of time to call back an actor to the stage after they have finished the project. They might have moved onto another project or depending on how in demand they are they have a tight schedule and you're time with them is limited. Then on top of all that, some actors are just not very good at the process. So you have them in a room for a long time trying to get the right performance; sometimes which is difficult because that performance was done in some type of specific environment.
Having a tool at my fingertips where I can feed it some of the actor's previous lines and be able to belch out something to fill the gaps with set parameters and be able to move along in the project without all the logistics would be heaven.
It would however, kill an entire field of expertise. It would also devalue the actor as well. Though it's already happening. There are already programs on the market that replace voice actors all together and are being used in the video game space.
For the work I do, I can see the benefits it could bring. But I am also fully aware it probably will be heavily abused.
App Stores, the web, etc. How else does software as a service get sold? It’s not a new thing. Probably a lot of these things will just end up as features in existing systems.
Existing appstores like the ones on iOS and Android mostly target casual use cases, mobile devices and on-device software. Not "buy once" experiences for work via software as a service. They also do not offer a unified experience. Two "text-to-speach" apps could have completely different user interfaces.
The web does not have good discovery and reputation management and also does not provide a unified interface. That is why market places like Booking.com, Amazon, Spotify etc have become so big.
The most impressive part of this is that they are seemingly able to produce 30 seconds of TTS with just 3 seconds of source material. That is super cool and honestly much more further in the curve that I expected it to be.
No, I am not interested play a game generated this way because the exact reason I want to listen to NPC conversations is because they are written by human.
Impressive, although the first example fails at the very last second when the virtual guy's voice glides up while saying "what?" and the autotune-ish pitch correction effect is clearly audible.
Other examples are nothing short of incredible. If it really can output minutes of credible voice just by training it with a few seconds, next step will be to make it sing. I foresee a legal storm approaching when someone will use a similar technology to put for example Elvis' voice -without naming him- in some advertising, so that fans would recognize him but the audio wouldn't match neither any of his lyrics nor any existing songs by him.
I don't have anything intelligent to say about this but it's ALOT of fun making all the samples play at the same time - sort of like the HTML version of Ableton Live.
Seems like we wouldn’t be far at all from just correlating this to face movement (including subtle iris movement and blinks, not just the mouth). As long as you clearly label it as CGI it’s harmless and I’m excited for the day to come. Might be quite fun to chat with a little buddy this way
Did you read the paper? They intentionally steered the quality to ensure they sound fake. Their generated speech is "very easy to detect" according to the reference at the end of the paper.
I found that in my (high quality) studio monitors, the audio sounded fine and hard to distinguish from 24kHz wav. But in headphones, the artifacts were pretty obvious. So probably some reverberation will do a lot to cover up artifacts. In the paper, they only do a subjective comparison between the generated audio and the soundstream-encoded original audio, which seems a bit disingenuous. Listening to soundstream audio in headphones, I can hear those same artifacts.
This is nothing short of amazing. It is exciting, a bit scary as well, what the future will bring.
It just makes me sad that I cannot open this page on Safari. It will not play a single audio, yet Chrome plays it fine. So here we are, able to generate audio, video, code, do amazing things with AI, but a simple website that has text and audio is not working on the most popular laptop out there.
Good for fraudsters and spammers, bad for anyone who ever hoped to make a living from voice acting. I'm perplexed by AI technologists' seemingly incessant drive to automate away the existence of artistic performers.
LLMs aren't great and can't be relied upon in business setting or at least I would not.
But think open world games. GTA VII for example where all NPCs have their dialogs auto generated in real time but also converted to audio in real time.
That's going to be a world which would be a lot more spontaneous with lot less effort.
Right now, If memory serves me right, GTA V dialogs alone are 5000 pages or more, hand written.
That's all true, but I think it's a pity that the jobs that currently exist for voice artists will disappear. Gamers and consumers will have somewhat better interactive experiences, which is good. Indie game developers will also be able to put out games with lower budgets, which is nice for them. But the market for voice acting work is largely going to dry up and blow away for people who are not already at the top of that field. People who could previously have made a modest but sufficient living as voice performers will be replaced by computer-generated voices. It will be almost impossible to make a living in that field within 5 years.
Generative models around images are nothing new and have been around for a while already. But even today, if you really want creative control and expression, you need a designer that's good with Photoshop or Illustrator etc.
This is applicable to LLMs as well. You can get it to write plausible BS but if you really want a rooted in reality, well articulated write up about something, a human has to be taken onboard.
This equally extends to voice over. If you really want expressive and creative control to put some outstanding rendering of something, AI isn't going to cut it.
This is only true if you assume AI isn't going to keep improving. It gets significantly better on a quarterly basis, far faster than the time it takes for an actor to develop their craft and career. The output quality of todays' cutting edge models would have been science fiction only 2-3 years ago.
I'm not so sure about the future. Such models, all the models don't have a well understood input output mapping and that's going to be a problem for a very long time.
I am not. The use cases like interactive assistants for the blind will generate very little commercial activity compared to the uses (and abuses) for entertainment and marketing purposes. A good example of this from the real world is the absence of cheap/open ASL interpretation for deaf people.
Ever notice big huge font on the phone of older people? So big that a screen may only contain a few lines of text. Or that people has to pull out their reading glasses every time they check their phone? Text to speech is a godsend in that case. Enormous benefits to an increasingly older population.
Huh? How big is the blind group compared to the older population?
You are saying it’s not economical to use tech to speech to support blind people. I’m saying the benefits are huge for older population. It isn’t just for fraudsters or spammers as you claim.
No, I'm not saying that at all. I'm saying the resources invested in helping people will be dwarfed by those invested in crap designed to exploit them economically or criminally.
Set asides the fact that you have absolutely no proof of that claim, the criminal world is tiny compared to the people who benefit from TTS (God forbid if that isn’t the case). Encryption, as an example, is hugely beneficial to the regular people despite being used or exploited extensively in the shady and questionable activities.
Completely different technology. Encryption doesn't directly replace human labor. And nobody is complaining about the existence of TTS per se, but rather pointing out that lots of resources are being poured into replacing human expressive performance skills with machinery. There's a strike on in Hollywood right now because studio owners want to strongarm performers into contractual arrangements that give away resynthesis rights of their appearance and performance in perpetuity.
None of the objections voiced to my original comment have even attempted to engage with this economic industrial reality.
I don't need your help imagining use cases; I've been in this field a lot longer than you, and have talked up the technological possibilities of AI-powered TTS here for *years. I understand the technology very well and am bullish on it. What I'm saying is that too much of the effort is being spent in solving the wrong problems. Please try reading what I wrote instead of your imaginary subtext.
They don't, otherwise there would be many former CEOs living in tents. In reality, those who control large amounts of capital are quite willing (and increasingly, say so in the open) to to deprive others of their livelihoods, homes, and ability to feed themselves in order to realize a marginal increase in their own wealth.
Then mocap, mapping digital faces on real actors which was first mind-blowing to see in Pirates of the Caribbean, then the apes in one of the Planet of the Apes movies... So much in the CGI industry has already reached a point where the hardest problems seem to have been solved.
When I now clicked play on the first Synthesized Dialoge from Dialogue Synthesis "Where did you go last summer? | I went to Greece, it was amazing.", I was blown away. It's as if we've now reached one of those milestones where a problem appears to be fixed or cracked. Machines will be able to really sound like humans, indistinguishable from them.
10-5 years ago, if you wanted to deal with TTS, the best option you had was to let your Android phone render a TTS into an audio file, because everything else sounded really bad. Specially Open Source stuff sounded absolutely horrible.
So how long will it be until we will be able to download something of this quality onto a future-gen Raspberry Pi which can do some AI processing, where we make an HTTP call and it starts speaking through the audio out in a perfect voice without relying on the cloud? 5 years?