SoundStorm: Efficient Parallel Audio Generation

qwertox · on July 16, 2023

In CGI there were always these milestones which I observed getting reached. Like trees with leaves finally looking close to realistic, wind blowing in grass looking almost realistic, hair, jelly, and it were usually Pixar shorts pointing out what they have been focusing on and then seeing it applied to their movies.

Then mocap, mapping digital faces on real actors which was first mind-blowing to see in Pirates of the Caribbean, then the apes in one of the Planet of the Apes movies... So much in the CGI industry has already reached a point where the hardest problems seem to have been solved.

When I now clicked play on the first Synthesized Dialoge from Dialogue Synthesis "Where did you go last summer? | I went to Greece, it was amazing.", I was blown away. It's as if we've now reached one of those milestones where a problem appears to be fixed or cracked. Machines will be able to really sound like humans, indistinguishable from them.

10-5 years ago, if you wanted to deal with TTS, the best option you had was to let your Android phone render a TTS into an audio file, because everything else sounded really bad. Specially Open Source stuff sounded absolutely horrible.

So how long will it be until we will be able to download something of this quality onto a future-gen Raspberry Pi which can do some AI processing, where we make an HTTP call and it starts speaking through the audio out in a perfect voice without relying on the cloud? 5 years?

amelius · on July 16, 2023

Another question, how long until we have systems that can sing 10 octaves and we don't need/want any actual human singers anymore?

PaulDavisThe1st · on July 16, 2023

To answer that question, it is necessary to answer the question "why did we want actual human singers in the past?"

If you think it is because there was no alternative, then you will see no barriers to the adoption of the sort of system you imagine.

But if you think it is because all human art is a form (however weird, tangled and opaque) of story telling, and for it to work, the person experiencing the art needs to be believe that there must be a story to be told, then your imagined system is not that interesting.

However, we are seeing an overwhelming propensity even today for people (even people who ought to know better) to ascribe intentionality and even emotion to computer systems that demonstrably do not have it (and as far back as Eliza in the 1970s). It seems likely to me that most people will rapidly come to believe in a sufficiently intentional and emtional backstory to your putative wunder-singer that their belief will satisfy their desire for "a person behind the art".

anileated · on July 17, 2023

Whether a machine can sing like a human does not matter much as far as appreciation of art: all art is self-expression, and self-expression always requires a self to express and a self to receive. Someone programmed the voice to sing something to someone for some reason.

Both human and software-generated singing exist today, and either can be boring nonsense or speak to you and leave an emotional impact.

(That said, if said machine is actually a derivative work produced from vocal samples without appropriate licensing, that’d make it dubious legally or morally. To my knowledge, preexisting software-generated singing like Vocaloid does not suffer from this problem.)

PaulDavisThe1st · on July 17, 2023

The production of art demonstrably does not require a self.

The production of art that people want to experience probably requires a self, but may only require the belief that a self was involved.

The experience of art requires a self, as do all experiences.

anileated · on July 17, 2023

If your definition of art includes things made without a self, a thinking mind with intention and agency then go ahead and ponder how the Sun and human DNA are art, I’ll wait for when you come back with a new definition.

PaulDavisThe1st · on July 17, 2023

that there can be art made without a self does not imply that anything made without a self is art.

that there are things made which are art does not imply that anything made is art.

that things can be made does not imply that everything is made.

anileated · on July 18, 2023

You are welcome to produce a definition that separates those things.

You have seen my definition. By definition an act of self-expression cannot occur without a self.

Without clarifying what makes something art, your claim that art can be something produced without a self (i.e., is not self-expression) demotes art to the level of “any random thing”.

civilitty · on July 17, 2023

With multimodal vocal-LLMs we’ll be able to prompt any kind of personality to the singer.

“You are a castrato who sailed with Black Beard and survived the holocaust with a degree in Klingon studies from Oxbridge…”

ttul · on July 16, 2023

As a choral singer, if there’s an app that one day allows me to sing with a fake choir of extremely good singers, I would enjoy doing that all day long. And it would allow my actual choir to practice way more, making our performances far better.

nraford · on July 16, 2023

This exists right now!

Not as an app exactly, but you should check out Holly Herndon and Mat Dryhurt’s suite of tools called “Holly Plus”:

https://holly.plus/

I’m pretty sure you can access their model somehow and even train your own voice using their “spawning” approach.

She did an awesome TED talk demonstrating this:

https://www.ted.com/talks/holly_herndon_what_if_you_could_si...

Here’s a cool example, using Dolly Parton’s song “Jolene”:

https://www.youtube.com/watch?v=kPAEMUzDxuo

I don’t think it’s quite at the level of consumer use yet, but I know they’re working on it. Definitely check it out.

sporkl · on July 16, 2023

From what I've seen, the music industry (all genres including pop, classical, etc.) tends to be more about personality than actual content these days, so it's not really an issue.

Virtual singers are already pretty good anyways, I feel we've already passed the point of diminishing returns.

And even if you look at virtual singers, the ones which are popular are the ones from ten years ago, not the newer ones with more realistic voices.

Lockal · on July 17, 2023

What do you mean by "ten years ago"? Yamaha Corporation produces vocal synths since 2000s, but it is quite pointless to use technologies of Vocaloid 4 if you can use Vocaloid 6.

wodenokoto · on July 17, 2023

> these days

How far back are you counting as these days? I'd the entire recording industry has always been about attaching an image to songs.

griffiths · on July 17, 2023

Aren't we there yet? https://en.wikipedia.org/wiki/Hatsune_Miku

I saw somewhere that this "girl" fills concerts.

tolien · on July 17, 2023

Star Trek: Voyager "Virtuoso" thought of that already. Their answer was that you get something which is barely recognisable as music.

macawfish · on July 17, 2023

If you had to pick one and only one, which would you pick?

- art is the process of creating

- art is the outcome of creating

I know which one I'd pick.

marssaxman · on July 17, 2023

The popularity of karaoke suggests that the quality of the singing is not especially relevant.

jayd16 · on July 16, 2023

People like to song along though.

tialaramex · on July 16, 2023

People like playing drums too, but a drum machine means that if you're not any good at it or too busy but you need drum sounds you can have drum sounds.

There are rights issues if the result is it replaces a particular singer, if you made it so that Sneaker Pimps can fire Kelli but still have her voice on subsequent songs that's a problem. But suppose you're a bedroom musician, and you realise you've got a piece that really wants somebody with a different voice than yours to make it work - you can pay someone, but technology like this offers a cheaper, easier option.

JonathanFly · on July 16, 2023

>So how long will it be until we will be able to download something of this quality onto a future-gen Raspberry Pi which can do some AI processing, where we make an HTTP call and it starts speaking through the audio out in a perfect voice without relying on the cloud?

5 years? It's probably possible roughly whenever the larger Whisper models can run on it. Probably the next Raspberry Pi, running quantized or optimized versions of some audio model.

It may be almost possible right now if you tried really realy hard, and you used a small model fine-tuned on a single voice, instead of something larger and more general purpose that can do any voice. I think whisper-tiny works on a Pi on real time, right? And that's not leveraging the GPU on the Pi. (https://github.com/ggerganov/whisper.cpp/discussions/166)

Edit: looks like medium is 30x slower on the Pi than tiny model, so I may have been overly optimistic. I didn't realize Whisper tiny was that much faster than medium.

This method works pretty well with Tortoise, letting you use the super fast Tortoise quality settings but get quality similar to the larger models. Fine-tuning the whole thing on just one voice removes a lot of the cool capabilities of course. With Tortoise, that would still be way too slow for a Pi but potentially that same strategy could work with faster models like SoundStorm.

In terms of quality there's still a lot of room to go with long term coherence, like long audio segments. When a real person reads an audiobook the words at the top the page have a pretty big impact on how many words at the bottom the page are read. And there can be some impact at any distance, page 10 to page 300. When you try audiobooks on super high end TTS models and listen carefully you really notice the mismatch. It's like the reader recorded the paragraphs out of order, or a video game voice lines where you can tell the actors recorded all the lines separately, and were not reacting to each other's performance.

You can bump the context windows, a minute, two minutes. That's gonna get you closer and probably good enough for some books. In the short term a human could simply adjust all the all the audio samples and manually tweak things to sound correct. So this will enable fan-created audiobooks where they take the time to get it right. But for fully automated books the mismatch drives me nuts. The performance is just soooo close for certain segments that when you get a tonal mismatch it hurts.

nine_k · on July 16, 2023

In you need a really compact form factor, you can buy a Jetson right now and run more complex models on it. It's pricey though.

rossjudson · on July 17, 2023

Over the years I've also watched CGI progress and been amazed by the steps there -- the joy of seeing yet another complex part of reality rendered was a constant.

These days, though...every new technique developed to simulate and replicate human creativity and behaviors builds on a constant sense of unease.

If I view or read, do I have the right to know if it's generated?

bckr · on July 16, 2023

I would bet 2 years tops

og_kalu · on July 16, 2023

It's good that Bing, Bard are using the latest Microsoft, Google Cloud offerings but it would be nice to see these speech advances (along with audio palm - https://google-research.github.io/seanet/audiopalm/examples/ etc) hit public api's and/or user interfaces.

Bard's TTS is alright but it's clearly behind.

On that note, Bing's English/Korean TTS is really good. I also didn't realize Microsoft uses the best offerings for free TTS on edge so it blows google's default tts voices away.

GordonS · on July 16, 2023

I used Azure TTS for a product demo voice-over recently, and nobody I showed it to knew it wasn't a human doing it!

Some of Azure's voices are better than others, and the TTS web app has a few minor bugs, but overall I was really pleased with the whole experience.

jameszhao00 · on July 16, 2023

Have you tried Google Cloud Studio voices?

https://cloud.google.com/text-to-speech/docs/wavenet#studio_...

og_kalu · on July 16, 2023

Yes. I'm not saying Google's Top Cloud offerings are bad although i still think microsoft's stuff is better.

Just that

1. It's behind their current sota research

2. You can only use those voices extensively by paying for it. Microsoft offers their best stuff on edge for free. So for reading aloud a pdf or web page, microsoft is far better.

skybrian · on July 16, 2023

It's disappointing, but I wouldn't expect research algorithms to be available immediately unless they held it back until the product is ready. I guess Apple would do that?

jameszhao00 · on July 16, 2023

By “SOTA” tts I think you mean LLM based TTS? With sound and language tokens trained GPT style?

Without going into too much details, imo they’re not really usable right now for TTS use cases.

og_kalu · on July 18, 2023

Not necessarily LLM style. The above isn't for instance.

also Google Studio Voices is excellent. Definitely better than Microsoft's best, albeit very limited voices.

refulgentis · on July 16, 2023

> I also didn't realize Microsoft uses the best offerings for free TTS on edge so it blows google's default tts voices away.

This sounds really interesting - can you share a bit more? I'm behind in this space, my parser got all jammed up, something like: "Microsoft uses [the best offerings for free TTS](as in FOSS libraries, or free as in beer SaaS?) [on edge](Edge browser, or on the edge as in client's computer?)(Is the implication that all TTS on the client's computer blows Google's default TTS voices away?)"

og_kalu · on July 16, 2023

The top voices you'd pay for on Azure's TTS services can be used for free to read web page(and PDF) text on Microsoft Edge. I don't mean Open source.

This is not the case with Google

wg0 · on July 16, 2023

I didn't know that. Edge is too good. Just downloaded and such features are great.

GranPC · on July 16, 2023

I believe they mean that the free TTS feature in Microsoft Edge uses their best technology, and that said tech is better than Google's default offering.

ShamelessC · on July 16, 2023

> public api's and/or user interfaces

sigh. Google used to release _some_ models. Guess the fun early days are coming to an end.

og_kalu · on July 16, 2023

Ha i'm not even asking for code/model releases. It's just a bit funny that what you can *pay* google to use is so far behind what they have up and running collecting dust.

Raed667 · on July 16, 2023

I'm speculating here, but for me it looks like the product (R&D) teams are not working closely with the research teams.

Even the demo website is on Github Pages instead of a Google domain/blog.

ShamelessC · on July 16, 2023

Also true.

Legend2440 · on July 16, 2023

Google is a business and this is clearly a valuable product.

ShamelessC · on July 16, 2023

Sure, but there was a time not too long ago when companies were still in the "good will" phase of handing out even highly valuable models like CLIP, guided-diffusion, etc. Come to think, it was mostly OpenAI doing this. And they kinda still do? But far more selectively. I'm just preemptively romanticizing that.

joezydeco · on July 16, 2023

[flagged]

vore · on July 16, 2023

I don’t want to defend Google’s business practices, but this is such a trite comment someone always feels compelled to post on anything about Google, including even a research paper, apparently.

glimshe · on July 16, 2023

It's a very relevant comment. It tells you to not rely, or expect further development, on any new Google technology, even seemingly good ones, as it can go to the graveyard like many others.

georgemcbay · on July 16, 2023

I don't bother to post the comment, but the high likelihood of any Google project/product being killed within a year or two is absolutely the first thought I have whenever a new Google project/product is announced (not because of HN posts, but because of their history), so good job on that Google.

joezydeco · on July 16, 2023

I'll argue it's not trite. It's a concise compilation of the thousands of teeth-gnashing comments here on HN and all over the internet whenever Google randomly drowns another one of its children.

Just fucking stay away from Google products. Period.

relativ575 · on July 16, 2023

First of all it isn't a product. It's a f*king research paper. Like dozens other showing up on HN every day. Most of them never becomes a product.

Second of all, by whining nauseously you drown out discussions on the merits of the technology, and chase people away. I hardly read Google news on HN now precisely because of that reason. Imagine if "Attention is all your need" came out now? [0]

Save your complaint for when Google makes it a product.

[0] - https://news.ycombinator.com/item?id=15938082

serf · on July 16, 2023

>Save your complaint for when Google makes it a product.

or save yourself the trouble and find alternatives to big-G.

It's entirely their own fault that people now view all Google news as temporary and fleeting. People don't want to put time into things that'll get thrown away in a year.

Reading G research papers seems like a shortcut to me, know what will be thrown away in 2 years before it's a valid product in 1 year and someone gets huckleberry'd into devoting time and effort into implementing the dead-product-walking API.

signatoremo · on July 16, 2023

> It's entirely their own fault that people now view all Google news as temporary and fleeting. People don't want to put time into things that'll get thrown away in a year.

Most of research don’t become their own products, from Google or anyone else. As a research project they still have values, unless you are saying Google research is garbage because they get into the habit of canceling their products.

> or save yourself the trouble and find alternatives to big-G

Totally valid point. No need to complain about it in a post about Google research though. It’s tiresome.

joezydeco · on July 17, 2023

unless you are saying Google research is garbage because they get into the habit of canceling their products.

Bravo! Now you're getting it!

(actually it's worse than garbage because they'll lock it all behind patents before abandoning any finished product)

rasz · on July 16, 2023

Product is something you sell to make money. The only real Google product is users sold to advertisers.

jsnell · on July 16, 2023

Google's non-advertising revenue in the latest quarter was about $15 billion. Is that significant amount of non-ads product revenue? At least that is higher than the revenue of any of IBM, HP, Oracle, Intel, Cisco, Netflix, Broadcom, Qualcomm, or Salesforce in that same quarter.

I think their non-ads businesses alone would be the 6th largest US tech company by revenue. (Amazon, Apple, Microsoft, the ads business of Alphabet, Meta. Am I forgetting something?)

rasz · on July 16, 2023

Revenue is easy when you lose money on every dollar. Last quarter Ads printed $21B of income, rest was a loss except cloud not losing hundreds of $millions for the very first time.

https://abc.xyz/assets/investor/static/pdf/2023Q1_alphabet_e...

vore · on July 16, 2023

Uh, what about all of their paid cloud offerings?

rasz · on July 16, 2023

Distraction. Generated whole 1% of overall profit last quarter, and that was the first time it didnt lose money. https://www.cnbc.com/2023/04/25/googles-cloud-business-turns...

ShamelessC · on July 17, 2023

How in gods name is this a controversial take? Good lord.

binary132 · on July 16, 2023

When people wax eloquent about how the artisans will just find something new to do for work, what they fail to mention is that the new work is often a menial and lower-paid job. When Amazon puts mom and pop shops out of business, they don’t go start new businesses, they go get jobs at Wal-Mart.

JonathanFly · on July 16, 2023

Interesting that SoundStorm was trained to produce dialog between two people using transcripts annotated with '|' marking changes in voice. But the exact same '|' characters seem to mostly work in the Bark model out of the box and also produce a dialog?

Maybe a third or a bit more of Bark outputs are a dialog person talking to themselves -- and it often misses a voice change. But the pipe characters do reliably produce audio that sounds like a dialog in the performance style.

https://twitter.com/jonathanfly/status/1675987073893904386

Is there some text-audio data somewhere in the training data that uses | for voice changes?

Amusingly, Bark tends to render the SoundStorm prompts sarcastically. Not sure if that's a difference in style in the models, or just Google cherry picking the more straightforward line readings as the featured samples.

og_kalu · on July 16, 2023

The creators won't say as far as i know but bark looks to be trained on lot of youtube corpora (rather than typical ML audio datasets) where audio may have transcripts like that and why stuff like [laughs] work

neilv · on July 16, 2023

In the future, will children think it's normal to talk like, "Hey, what up, Youtube! ... Be sure to like and subscribe! ... Smash that like button! ... Let me know in the comments down below!"?

I wonder how ML trained on the tone transitions to a sponsored segment dripping with secret shame... would infect general speech.

paulmd · on July 17, 2023

https://img.ifunny.co/images/4053eb689b6bc2d1d8635dfe53c7acd...

JonathanFly · on July 16, 2023

Yeah I often try to think about what might be in a YouTube caption when finding prompts that work in Bark. But pipe character isn't one I remember seeing on YouTube. Maybe it's part of some other audio dataset though. Or maybe it's on YouTube but only in non English videos.

mg · on July 16, 2023

I wonder if work marketplaces like UpWork and Fiverr will adapt quickly enough to this new situation, where many of their services, which in the past were done by humans, can now be done by software.

Their current marketplace interface seems inadequate for this. Instead of contacting a human and then wait for them to finish the work, buyers will want to get results right away.

Therefore they will have to change their platform to work like an app store. Where the sellers connect their services and buyers can use these services.

Legend2440 · on July 16, 2023

Why does everybody focus on "how will this replace humans?" It's just a really good text-to-speech.

PhasmaFelis · on July 16, 2023

Because it will replace humans, and that's worth thinking about?

pjmlp · on July 16, 2023

Maybe because I no longer hear friendly human voices on train stations, rather computer generated train announcements?

While those people are now looking for jobs elsewhere.

relativ575 · on July 16, 2023

Announcements often get played repeatedly -- "Train 101 to Lisbon is now on track 5". Why do you want to torture station's workers with that?

Instead, make an effort to start a conversation with your fellow travelers, or graciously respond to such effort from them. Apologize if you already do.

pjmlp · on July 16, 2023

Better a tortured job that puts food on the table than none at all.

BHSPitMonkey · on July 17, 2023

Taken to its conclusion, doesn't this just mean we should outlaw all forms of mechanization in order to preserve human jobs as much as possible? Would this be better for a country than just allocating a portion of the economy's output and using it to provide social safety nets?

pjmlp · on July 17, 2023

Since you mention it, there was some news a while back of some African countries not using construction machines to build roads on purpose, as means to foster more jobs.

Likewise in many countries, having someone at the gas station to fill in the tank is still a job, and most likely even with EV they would be the ones taking care of the charging.

cpill · on July 16, 2023

tell that to the kids in Nike sweat shops

pjmlp · on July 17, 2023

Their family appreciate.

As do everyone that buys Nike.

dtjb · on July 17, 2023

At the beginning of the 20th century in the US, 3 in 4 workers were either laborers, farmers, miners, or household service workers. By the end of the century that number had fallen to 1 in 4.

Of course that wasn't a net loss, it was part a larger economic transformation that created more higher paying jobs.

pjmlp · on July 17, 2023

Looking at the employment rate across US doesn't look like it.

Specially the people living from paycheck to paycheck, without any kind of healthcare support.

Legend2440 · on July 16, 2023

Fantastic! That's a massive efficency gain.

We will not run out of productive things to do with our time. Labor force participation has stayed in 60-70% despite centuries of automation.

pjmlp · on July 16, 2023

Lovely capitalism.

ImHereToVote · on July 16, 2023

Personally I can't wait for all the streets to be lined with the homeless like in SF. So good.

akaij · on July 16, 2023

It's kinda sad to see you believe that this is the inevitable outcome.

pjmlp · on July 16, 2023

Well, if we imagine that the only thing that will be left are physical jobs that can't be done by computers.

At least until they get clever enough to start a transformers line factory.

Legend2440 · on July 16, 2023

This is the lump of labor fallacy. It's not about "what jobs will be left", it's about the new jobs we'll invent with all the time we'll have on our hands.

There was never a fixed number of jobs, there's a fixed number of workers.

ImHereToVote · on July 17, 2023

I didn't think you understand. We aren't about to automate all things humans can "currently" do. We are about to automate everything that separates us from a brick. What can possibly be left, and why would an unemployed person pay for whatever what is left?

bigfudge · on July 17, 2023

The nature of the jobs change too though. Do you believe jobs will become more or less skilled/autonomous and connect people more or less with their fellow man? Some of us are pessimistic about those questions.

slaterbug · on July 17, 2023

> There was never a fixed number of jobs, there's a fixed number of workers.

Isn’t this the ‘problem’ that AI is trying to solve?

pjmlp · on July 16, 2023

Well, we can also return to feudalism.

seydor · on July 16, 2023

> where many of their services, which in the past were done by humans, can now be done by software.

Their users are already using AI to do the work that they are supposed to do. i think that's fine

bleepblop · on July 17, 2023

Where I see the benefit is in dialogue replacement. It takes a lot of time to call back an actor to the stage after they have finished the project. They might have moved onto another project or depending on how in demand they are they have a tight schedule and you're time with them is limited. Then on top of all that, some actors are just not very good at the process. So you have them in a room for a long time trying to get the right performance; sometimes which is difficult because that performance was done in some type of specific environment.

Having a tool at my fingertips where I can feed it some of the actor's previous lines and be able to belch out something to fill the gaps with set parameters and be able to move along in the project without all the logistics would be heaven.

It would however, kill an entire field of expertise. It would also devalue the actor as well. Though it's already happening. There are already programs on the market that replace voice actors all together and are being used in the video game space.

For the work I do, I can see the benefits it could bring. But I am also fully aware it probably will be heavily abused.

throw47474777j · on July 16, 2023

Why wouldn't people just use existing software markets?

mg · on July 16, 2023

For example?

throw47474777j · on July 16, 2023

App Stores, the web, etc. How else does software as a service get sold? It’s not a new thing. Probably a lot of these things will just end up as features in existing systems.

mg · on July 16, 2023

Existing appstores like the ones on iOS and Android mostly target casual use cases, mobile devices and on-device software. Not "buy once" experiences for work via software as a service. They also do not offer a unified experience. Two "text-to-speach" apps could have completely different user interfaces.

The web does not have good discovery and reputation management and also does not provide a unified interface. That is why market places like Booking.com, Amazon, Spotify etc have become so big.

asutekku · on July 16, 2023

The most impressive part of this is that they are seemingly able to produce 30 seconds of TTS with just 3 seconds of source material. That is super cool and honestly much more further in the curve that I expected it to be.

butz · on July 16, 2023

With all recent advances, are there any decent TTS voices for Linux that are not complicated to set up for regular user?

up2isomorphism · on July 17, 2023

No, I am not interested play a game generated this way because the exact reason I want to listen to NPC conversations is because they are written by human.

squarefoot · on July 16, 2023

Impressive, although the first example fails at the very last second when the virtual guy's voice glides up while saying "what?" and the autotune-ish pitch correction effect is clearly audible. Other examples are nothing short of incredible. If it really can output minutes of credible voice just by training it with a few seconds, next step will be to make it sing. I foresee a legal storm approaching when someone will use a similar technology to put for example Elvis' voice -without naming him- in some advertising, so that fans would recognize him but the audio wouldn't match neither any of his lyrics nor any existing songs by him.

linhns · on July 17, 2023

I don't believe the first example is created by Soundstorm. It sounds coercive and modulated to me.

willemmerson · on July 16, 2023

I don't have anything intelligent to say about this but it's ALOT of fun making all the samples play at the same time - sort of like the HTML version of Ableton Live.

varunpant · on July 17, 2023

https://github.com/lucidrains/soundstorm-pytorch

nwoli · on July 16, 2023

Seems like we wouldn’t be far at all from just correlating this to face movement (including subtle iris movement and blinks, not just the mouth). As long as you clearly label it as CGI it’s harmless and I’m excited for the day to come. Might be quite fun to chat with a little buddy this way

tagyro · on July 16, 2023

I've wasted (counting) about 300 seconds of my life listening to these audio files and they all sound and seem fake...

jeffbee · on July 16, 2023

Did you read the paper? They intentionally steered the quality to ensure they sound fake. Their generated speech is "very easy to detect" according to the reference at the end of the paper.

svantana · on July 16, 2023

I found that in my (high quality) studio monitors, the audio sounded fine and hard to distinguish from 24kHz wav. But in headphones, the artifacts were pretty obvious. So probably some reverberation will do a lot to cover up artifacts. In the paper, they only do a subjective comparison between the generated audio and the soundstream-encoded original audio, which seems a bit disingenuous. Listening to soundstream audio in headphones, I can hear those same artifacts.

tagyro · on July 16, 2023

just to be clear, one could mistake them for some (voice) actor reading a book (maybe) but even to my untrained ear they sound fake and artificial.

Am i missing something?

kvn8888 · on July 16, 2023

It's meant to sound artificial. The focus is on speed and consistency

elAhmo · on July 16, 2023

This is nothing short of amazing. It is exciting, a bit scary as well, what the future will bring.

It just makes me sad that I cannot open this page on Safari. It will not play a single audio, yet Chrome plays it fine. So here we are, able to generate audio, video, code, do amazing things with AI, but a simple website that has text and audio is not working on the most popular laptop out there.

lern_too_spel · on July 17, 2023

If you want to use the modern multimedia web, you can't use Lynx or Safari.

elAhmo · on July 17, 2023

They are not comparable.

I understand that some applications might prefer "other browsers", but a simple page that plays audio snippets was really a disappointment.

globular-toast · on July 17, 2023

That name is a blast from the past. Anyone remember Nvidia SoundStorm?

anigbrowl · on July 16, 2023

Good for fraudsters and spammers, bad for anyone who ever hoped to make a living from voice acting. I'm perplexed by AI technologists' seemingly incessant drive to automate away the existence of artistic performers.

wg0 · on July 16, 2023

LLMs aren't great and can't be relied upon in business setting or at least I would not.

But think open world games. GTA VII for example where all NPCs have their dialogs auto generated in real time but also converted to audio in real time.

That's going to be a world which would be a lot more spontaneous with lot less effort.

Right now, If memory serves me right, GTA V dialogs alone are 5000 pages or more, hand written.

anigbrowl · on July 16, 2023

That's all true, but I think it's a pity that the jobs that currently exist for voice artists will disappear. Gamers and consumers will have somewhat better interactive experiences, which is good. Indie game developers will also be able to put out games with lower budgets, which is nice for them. But the market for voice acting work is largely going to dry up and blow away for people who are not already at the top of that field. People who could previously have made a modest but sufficient living as voice performers will be replaced by computer-generated voices. It will be almost impossible to make a living in that field within 5 years.

wg0 · on July 16, 2023

Generative models around images are nothing new and have been around for a while already. But even today, if you really want creative control and expression, you need a designer that's good with Photoshop or Illustrator etc.

This is applicable to LLMs as well. You can get it to write plausible BS but if you really want a rooted in reality, well articulated write up about something, a human has to be taken onboard.

This equally extends to voice over. If you really want expressive and creative control to put some outstanding rendering of something, AI isn't going to cut it.

anigbrowl · on July 16, 2023

This is only true if you assume AI isn't going to keep improving. It gets significantly better on a quarterly basis, far faster than the time it takes for an actor to develop their craft and career. The output quality of todays' cutting edge models would have been science fiction only 2-3 years ago.

wg0 · on July 16, 2023

I'm not so sure about the future. Such models, all the models don't have a well understood input output mapping and that's going to be a problem for a very long time.

brucethemoose2 · on July 17, 2023

With some select exceptions, the quality of AAA storytelling has detiorated.

I welcome any tech that makes indies more competitive.

Legend2440 · on July 16, 2023

You are being deliberately pessimistic. There are a million fantastic, practical uses for text-to-speech.

anigbrowl · on July 16, 2023

I am not. The use cases like interactive assistants for the blind will generate very little commercial activity compared to the uses (and abuses) for entertainment and marketing purposes. A good example of this from the real world is the absence of cheap/open ASL interpretation for deaf people.

signatoremo · on July 16, 2023

Ever notice big huge font on the phone of older people? So big that a screen may only contain a few lines of text. Or that people has to pull out their reading glasses every time they check their phone? Text to speech is a godsend in that case. Enormous benefits to an increasingly older population.

anigbrowl · on July 16, 2023

'helping blind people' was literally the first use case I mentioned. Maybe you should have read the comment before reacting to it.

signatoremo · on July 16, 2023

Huh? How big is the blind group compared to the older population?

You are saying it’s not economical to use tech to speech to support blind people. I’m saying the benefits are huge for older population. It isn’t just for fraudsters or spammers as you claim.

anigbrowl · on July 16, 2023

No, I'm not saying that at all. I'm saying the resources invested in helping people will be dwarfed by those invested in crap designed to exploit them economically or criminally.

signatoremo · on July 16, 2023

Set asides the fact that you have absolutely no proof of that claim, the criminal world is tiny compared to the people who benefit from TTS (God forbid if that isn’t the case). Encryption, as an example, is hugely beneficial to the regular people despite being used or exploited extensively in the shady and questionable activities.

anigbrowl · on July 17, 2023

Completely different technology. Encryption doesn't directly replace human labor. And nobody is complaining about the existence of TTS per se, but rather pointing out that lots of resources are being poured into replacing human expressive performance skills with machinery. There's a strike on in Hollywood right now because studio owners want to strongarm performers into contractual arrangements that give away resynthesis rights of their appearance and performance in perpetuity.

None of the objections voiced to my original comment have even attempted to engage with this economic industrial reality.

Legend2440 · on July 16, 2023

Imagine having an app on your phone that turns any ebook into an audiobook.

Imagine replacing crappy phone menus with polite virtual assistants that actually understand what you're saying.

Imagine an AI language tutor that speaks every language in the world fluently. Or a universal speech-to-speech translator.

And that's just off the top of my head. Clever people will come up with a lot more uses, I'm sure.

anigbrowl · on July 16, 2023

I don't need your help imagining use cases; I've been in this field a lot longer than you, and have talked up the technological possibilities of AI-powered TTS here for *years. I understand the technology very well and am bullish on it. What I'm saying is that too much of the effort is being spent in solving the wrong problems. Please try reading what I wrote instead of your imaginary subtext.

croes · on July 16, 2023

Why spare artists if everyone else gets replaced by technology?

anigbrowl · on July 16, 2023

They don't, otherwise there would be many former CEOs living in tents. In reality, those who control large amounts of capital are quite willing (and increasingly, say so in the open) to to deprive others of their livelihoods, homes, and ability to feed themselves in order to realize a marginal increase in their own wealth.

brucethemoose2 · on July 17, 2023

The function of CEOs and board members are closer to being replaced than, say, roofers or garbage workers.

It will be interesting to see what happens when that pressure is applied...

krapp · on July 17, 2023

Capitalism will never allow capitalists to be made redundant - that's not how it works.

brucethemoose2 · on July 17, 2023

It will if its more profitable, which it is.

bozhark · on July 17, 2023

Close. Too flat of a sound overall.

mgaunard · on July 17, 2023

Why would you choose to synthesize audio using such awful valley girl accents?