Hacker News new | past | comments | ask | show | jobs | submit login
SoundStorm: Efficient Parallel Audio Generation (google-research.github.io)
293 points by sh_tomer on July 16, 2023 | hide | past | favorite | 130 comments



In CGI there were always these milestones which I observed getting reached. Like trees with leaves finally looking close to realistic, wind blowing in grass looking almost realistic, hair, jelly, and it were usually Pixar shorts pointing out what they have been focusing on and then seeing it applied to their movies.

Then mocap, mapping digital faces on real actors which was first mind-blowing to see in Pirates of the Caribbean, then the apes in one of the Planet of the Apes movies... So much in the CGI industry has already reached a point where the hardest problems seem to have been solved.

When I now clicked play on the first Synthesized Dialoge from Dialogue Synthesis "Where did you go last summer? | I went to Greece, it was amazing.", I was blown away. It's as if we've now reached one of those milestones where a problem appears to be fixed or cracked. Machines will be able to really sound like humans, indistinguishable from them.

10-5 years ago, if you wanted to deal with TTS, the best option you had was to let your Android phone render a TTS into an audio file, because everything else sounded really bad. Specially Open Source stuff sounded absolutely horrible.

So how long will it be until we will be able to download something of this quality onto a future-gen Raspberry Pi which can do some AI processing, where we make an HTTP call and it starts speaking through the audio out in a perfect voice without relying on the cloud? 5 years?


Another question, how long until we have systems that can sing 10 octaves and we don't need/want any actual human singers anymore?


To answer that question, it is necessary to answer the question "why did we want actual human singers in the past?"

If you think it is because there was no alternative, then you will see no barriers to the adoption of the sort of system you imagine.

But if you think it is because all human art is a form (however weird, tangled and opaque) of story telling, and for it to work, the person experiencing the art needs to be believe that there must be a story to be told, then your imagined system is not that interesting.

However, we are seeing an overwhelming propensity even today for people (even people who ought to know better) to ascribe intentionality and even emotion to computer systems that demonstrably do not have it (and as far back as Eliza in the 1970s). It seems likely to me that most people will rapidly come to believe in a sufficiently intentional and emtional backstory to your putative wunder-singer that their belief will satisfy their desire for "a person behind the art".


Whether a machine can sing like a human does not matter much as far as appreciation of art: all art is self-expression, and self-expression always requires a self to express and a self to receive. Someone programmed the voice to sing something to someone for some reason.

Both human and software-generated singing exist today, and either can be boring nonsense or speak to you and leave an emotional impact.

(That said, if said machine is actually a derivative work produced from vocal samples without appropriate licensing, that’d make it dubious legally or morally. To my knowledge, preexisting software-generated singing like Vocaloid does not suffer from this problem.)


The production of art demonstrably does not require a self.

The production of art that people want to experience probably requires a self, but may only require the belief that a self was involved.

The experience of art requires a self, as do all experiences.


If your definition of art includes things made without a self, a thinking mind with intention and agency then go ahead and ponder how the Sun and human DNA are art, I’ll wait for when you come back with a new definition.


that there can be art made without a self does not imply that anything made without a self is art.

that there are things made which are art does not imply that anything made is art.

that things can be made does not imply that everything is made.


You are welcome to produce a definition that separates those things.

You have seen my definition. By definition an act of self-expression cannot occur without a self.

Without clarifying what makes something art, your claim that art can be something produced without a self (i.e., is not self-expression) demotes art to the level of “any random thing”.


With multimodal vocal-LLMs we’ll be able to prompt any kind of personality to the singer.

“You are a castrato who sailed with Black Beard and survived the holocaust with a degree in Klingon studies from Oxbridge…”


As a choral singer, if there’s an app that one day allows me to sing with a fake choir of extremely good singers, I would enjoy doing that all day long. And it would allow my actual choir to practice way more, making our performances far better.


This exists right now!

Not as an app exactly, but you should check out Holly Herndon and Mat Dryhurt’s suite of tools called “Holly Plus”:

https://holly.plus/

I’m pretty sure you can access their model somehow and even train your own voice using their “spawning” approach.

She did an awesome TED talk demonstrating this:

https://www.ted.com/talks/holly_herndon_what_if_you_could_si...

Here’s a cool example, using Dolly Parton’s song “Jolene”:

https://www.youtube.com/watch?v=kPAEMUzDxuo

I don’t think it’s quite at the level of consumer use yet, but I know they’re working on it. Definitely check it out.


From what I've seen, the music industry (all genres including pop, classical, etc.) tends to be more about personality than actual content these days, so it's not really an issue.

Virtual singers are already pretty good anyways, I feel we've already passed the point of diminishing returns.

And even if you look at virtual singers, the ones which are popular are the ones from ten years ago, not the newer ones with more realistic voices.


What do you mean by "ten years ago"? Yamaha Corporation produces vocal synths since 2000s, but it is quite pointless to use technologies of Vocaloid 4 if you can use Vocaloid 6.


> these days

How far back are you counting as these days? I'd the entire recording industry has always been about attaching an image to songs.


Aren't we there yet? https://en.wikipedia.org/wiki/Hatsune_Miku

I saw somewhere that this "girl" fills concerts.


Star Trek: Voyager "Virtuoso" thought of that already. Their answer was that you get something which is barely recognisable as music.


If you had to pick one and only one, which would you pick?

- art is the process of creating

- art is the outcome of creating

I know which one I'd pick.


The popularity of karaoke suggests that the quality of the singing is not especially relevant.


People like to song along though.


People like playing drums too, but a drum machine means that if you're not any good at it or too busy but you need drum sounds you can have drum sounds.

There are rights issues if the result is it replaces a particular singer, if you made it so that Sneaker Pimps can fire Kelli but still have her voice on subsequent songs that's a problem. But suppose you're a bedroom musician, and you realise you've got a piece that really wants somebody with a different voice than yours to make it work - you can pay someone, but technology like this offers a cheaper, easier option.


>So how long will it be until we will be able to download something of this quality onto a future-gen Raspberry Pi which can do some AI processing, where we make an HTTP call and it starts speaking through the audio out in a perfect voice without relying on the cloud?

5 years? It's probably possible roughly whenever the larger Whisper models can run on it. Probably the next Raspberry Pi, running quantized or optimized versions of some audio model.

It may be almost possible right now if you tried really realy hard, and you used a small model fine-tuned on a single voice, instead of something larger and more general purpose that can do any voice. I think whisper-tiny works on a Pi on real time, right? And that's not leveraging the GPU on the Pi. (https://github.com/ggerganov/whisper.cpp/discussions/166)

Edit: looks like medium is 30x slower on the Pi than tiny model, so I may have been overly optimistic. I didn't realize Whisper tiny was that much faster than medium.

This method works pretty well with Tortoise, letting you use the super fast Tortoise quality settings but get quality similar to the larger models. Fine-tuning the whole thing on just one voice removes a lot of the cool capabilities of course. With Tortoise, that would still be way too slow for a Pi but potentially that same strategy could work with faster models like SoundStorm.

In terms of quality there's still a lot of room to go with long term coherence, like long audio segments. When a real person reads an audiobook the words at the top the page have a pretty big impact on how many words at the bottom the page are read. And there can be some impact at any distance, page 10 to page 300. When you try audiobooks on super high end TTS models and listen carefully you really notice the mismatch. It's like the reader recorded the paragraphs out of order, or a video game voice lines where you can tell the actors recorded all the lines separately, and were not reacting to each other's performance.

You can bump the context windows, a minute, two minutes. That's gonna get you closer and probably good enough for some books. In the short term a human could simply adjust all the all the audio samples and manually tweak things to sound correct. So this will enable fan-created audiobooks where they take the time to get it right. But for fully automated books the mismatch drives me nuts. The performance is just soooo close for certain segments that when you get a tonal mismatch it hurts.


In you need a really compact form factor, you can buy a Jetson right now and run more complex models on it. It's pricey though.


Over the years I've also watched CGI progress and been amazed by the steps there -- the joy of seeing yet another complex part of reality rendered was a constant.

These days, though...every new technique developed to simulate and replicate human creativity and behaviors builds on a constant sense of unease.

If I view or read, do I have the right to know if it's generated?


I would bet 2 years tops


It's good that Bing, Bard are using the latest Microsoft, Google Cloud offerings but it would be nice to see these speech advances (along with audio palm - https://google-research.github.io/seanet/audiopalm/examples/ etc) hit public api's and/or user interfaces.

Bard's TTS is alright but it's clearly behind.

On that note, Bing's English/Korean TTS is really good. I also didn't realize Microsoft uses the best offerings for free TTS on edge so it blows google's default tts voices away.


I used Azure TTS for a product demo voice-over recently, and nobody I showed it to knew it wasn't a human doing it!

Some of Azure's voices are better than others, and the TTS web app has a few minor bugs, but overall I was really pleased with the whole experience.


Have you tried Google Cloud Studio voices?

https://cloud.google.com/text-to-speech/docs/wavenet#studio_...


Yes. I'm not saying Google's Top Cloud offerings are bad although i still think microsoft's stuff is better.

Just that

1. It's behind their current sota research

2. You can only use those voices extensively by paying for it. Microsoft offers their best stuff on edge for free. So for reading aloud a pdf or web page, microsoft is far better.


It's disappointing, but I wouldn't expect research algorithms to be available immediately unless they held it back until the product is ready. I guess Apple would do that?


By “SOTA” tts I think you mean LLM based TTS? With sound and language tokens trained GPT style?

Without going into too much details, imo they’re not really usable right now for TTS use cases.


Not necessarily LLM style. The above isn't for instance.

also Google Studio Voices is excellent. Definitely better than Microsoft's best, albeit very limited voices.


> I also didn't realize Microsoft uses the best offerings for free TTS on edge so it blows google's default tts voices away.

This sounds really interesting - can you share a bit more? I'm behind in this space, my parser got all jammed up, something like: "Microsoft uses [the best offerings for free TTS](as in FOSS libraries, or free as in beer SaaS?) [on edge](Edge browser, or on the edge as in client's computer?)(Is the implication that all TTS on the client's computer blows Google's default TTS voices away?)"


The top voices you'd pay for on Azure's TTS services can be used for free to read web page(and PDF) text on Microsoft Edge. I don't mean Open source.

This is not the case with Google


I didn't know that. Edge is too good. Just downloaded and such features are great.


I believe they mean that the free TTS feature in Microsoft Edge uses their best technology, and that said tech is better than Google's default offering.


> public api's and/or user interfaces

sigh. Google used to release _some_ models. Guess the fun early days are coming to an end.


Ha i'm not even asking for code/model releases. It's just a bit funny that what you can *pay* google to use is so far behind what they have up and running collecting dust.


I'm speculating here, but for me it looks like the product (R&D) teams are not working closely with the research teams.

Even the demo website is on Github Pages instead of a Google domain/blog.


Also true.


Google is a business and this is clearly a valuable product.


Sure, but there was a time not too long ago when companies were still in the "good will" phase of handing out even highly valuable models like CLIP, guided-diffusion, etc. Come to think, it was mostly OpenAI doing this. And they kinda still do? But far more selectively. I'm just preemptively romanticizing that.


[flagged]


I don’t want to defend Google’s business practices, but this is such a trite comment someone always feels compelled to post on anything about Google, including even a research paper, apparently.


It's a very relevant comment. It tells you to not rely, or expect further development, on any new Google technology, even seemingly good ones, as it can go to the graveyard like many others.


I don't bother to post the comment, but the high likelihood of any Google project/product being killed within a year or two is absolutely the first thought I have whenever a new Google project/product is announced (not because of HN posts, but because of their history), so good job on that Google.


I'll argue it's not trite. It's a concise compilation of the thousands of teeth-gnashing comments here on HN and all over the internet whenever Google randomly drowns another one of its children.

Just fucking stay away from Google products. Period.


First of all it isn't a product. It's a f*king research paper. Like dozens other showing up on HN every day. Most of them never becomes a product.

Second of all, by whining nauseously you drown out discussions on the merits of the technology, and chase people away. I hardly read Google news on HN now precisely because of that reason. Imagine if "Attention is all your need" came out now? [0]

Save your complaint for when Google makes it a product.

[0] - https://news.ycombinator.com/item?id=15938082


>Save your complaint for when Google makes it a product.

or save yourself the trouble and find alternatives to big-G.

It's entirely their own fault that people now view all Google news as temporary and fleeting. People don't want to put time into things that'll get thrown away in a year.

Reading G research papers seems like a shortcut to me, know what will be thrown away in 2 years before it's a valid product in 1 year and someone gets huckleberry'd into devoting time and effort into implementing the dead-product-walking API.


> It's entirely their own fault that people now view all Google news as temporary and fleeting. People don't want to put time into things that'll get thrown away in a year.

Most of research don’t become their own products, from Google or anyone else. As a research project they still have values, unless you are saying Google research is garbage because they get into the habit of canceling their products.

> or save yourself the trouble and find alternatives to big-G

Totally valid point. No need to complain about it in a post about Google research though. It’s tiresome.


unless you are saying Google research is garbage because they get into the habit of canceling their products.

Bravo! Now you're getting it!

(actually it's worse than garbage because they'll lock it all behind patents before abandoning any finished product)


Product is something you sell to make money. The only real Google product is users sold to advertisers.


Google's non-advertising revenue in the latest quarter was about $15 billion. Is that significant amount of non-ads product revenue? At least that is higher than the revenue of any of IBM, HP, Oracle, Intel, Cisco, Netflix, Broadcom, Qualcomm, or Salesforce in that same quarter.

I think their non-ads businesses alone would be the 6th largest US tech company by revenue. (Amazon, Apple, Microsoft, the ads business of Alphabet, Meta. Am I forgetting something?)


Revenue is easy when you lose money on every dollar. Last quarter Ads printed $21B of income, rest was a loss except cloud not losing hundreds of $millions for the very first time.

https://abc.xyz/assets/investor/static/pdf/2023Q1_alphabet_e...


Uh, what about all of their paid cloud offerings?


Distraction. Generated whole 1% of overall profit last quarter, and that was the first time it didnt lose money. https://www.cnbc.com/2023/04/25/googles-cloud-business-turns...


How in gods name is this a controversial take? Good lord.


When people wax eloquent about how the artisans will just find something new to do for work, what they fail to mention is that the new work is often a menial and lower-paid job. When Amazon puts mom and pop shops out of business, they don’t go start new businesses, they go get jobs at Wal-Mart.


Interesting that SoundStorm was trained to produce dialog between two people using transcripts annotated with '|' marking changes in voice. But the exact same '|' characters seem to mostly work in the Bark model out of the box and also produce a dialog?

Maybe a third or a bit more of Bark outputs are a dialog person talking to themselves -- and it often misses a voice change. But the pipe characters do reliably produce audio that sounds like a dialog in the performance style.

https://twitter.com/jonathanfly/status/1675987073893904386

Is there some text-audio data somewhere in the training data that uses | for voice changes?

Amusingly, Bark tends to render the SoundStorm prompts sarcastically. Not sure if that's a difference in style in the models, or just Google cherry picking the more straightforward line readings as the featured samples.


The creators won't say as far as i know but bark looks to be trained on lot of youtube corpora (rather than typical ML audio datasets) where audio may have transcripts like that and why stuff like [laughs] work


In the future, will children think it's normal to talk like, "Hey, what up, Youtube! ... Be sure to like and subscribe! ... Smash that like button! ... Let me know in the comments down below!"?

I wonder how ML trained on the tone transitions to a sponsored segment dripping with secret shame... would infect general speech.



Yeah I often try to think about what might be in a YouTube caption when finding prompts that work in Bark. But pipe character isn't one I remember seeing on YouTube. Maybe it's part of some other audio dataset though. Or maybe it's on YouTube but only in non English videos.


I wonder if work marketplaces like UpWork and Fiverr will adapt quickly enough to this new situation, where many of their services, which in the past were done by humans, can now be done by software.

Their current marketplace interface seems inadequate for this. Instead of contacting a human and then wait for them to finish the work, buyers will want to get results right away.

Therefore they will have to change their platform to work like an app store. Where the sellers connect their services and buyers can use these services.


Why does everybody focus on "how will this replace humans?" It's just a really good text-to-speech.


Because it will replace humans, and that's worth thinking about?


Maybe because I no longer hear friendly human voices on train stations, rather computer generated train announcements?

While those people are now looking for jobs elsewhere.


Announcements often get played repeatedly -- "Train 101 to Lisbon is now on track 5". Why do you want to torture station's workers with that?

Instead, make an effort to start a conversation with your fellow travelers, or graciously respond to such effort from them. Apologize if you already do.


Better a tortured job that puts food on the table than none at all.


Taken to its conclusion, doesn't this just mean we should outlaw all forms of mechanization in order to preserve human jobs as much as possible? Would this be better for a country than just allocating a portion of the economy's output and using it to provide social safety nets?


Since you mention it, there was some news a while back of some African countries not using construction machines to build roads on purpose, as means to foster more jobs.

Likewise in many countries, having someone at the gas station to fill in the tank is still a job, and most likely even with EV they would be the ones taking care of the charging.


tell that to the kids in Nike sweat shops


Their family appreciate.

As do everyone that buys Nike.


At the beginning of the 20th century in the US, 3 in 4 workers were either laborers, farmers, miners, or household service workers. By the end of the century that number had fallen to 1 in 4.

Of course that wasn't a net loss, it was part a larger economic transformation that created more higher paying jobs.


Looking at the employment rate across US doesn't look like it.

Specially the people living from paycheck to paycheck, without any kind of healthcare support.


Fantastic! That's a massive efficency gain.

We will not run out of productive things to do with our time. Labor force participation has stayed in 60-70% despite centuries of automation.


Lovely capitalism.


Personally I can't wait for all the streets to be lined with the homeless like in SF. So good.


It's kinda sad to see you believe that this is the inevitable outcome.


Well, if we imagine that the only thing that will be left are physical jobs that can't be done by computers.

At least until they get clever enough to start a transformers line factory.


This is the lump of labor fallacy. It's not about "what jobs will be left", it's about the new jobs we'll invent with all the time we'll have on our hands.

There was never a fixed number of jobs, there's a fixed number of workers.


I didn't think you understand. We aren't about to automate all things humans can "currently" do. We are about to automate everything that separates us from a brick. What can possibly be left, and why would an unemployed person pay for whatever what is left?


The nature of the jobs change too though. Do you believe jobs will become more or less skilled/autonomous and connect people more or less with their fellow man? Some of us are pessimistic about those questions.


> There was never a fixed number of jobs, there's a fixed number of workers.

Isn’t this the ‘problem’ that AI is trying to solve?


Well, we can also return to feudalism.


> where many of their services, which in the past were done by humans, can now be done by software.

Their users are already using AI to do the work that they are supposed to do. i think that's fine


Where I see the benefit is in dialogue replacement. It takes a lot of time to call back an actor to the stage after they have finished the project. They might have moved onto another project or depending on how in demand they are they have a tight schedule and you're time with them is limited. Then on top of all that, some actors are just not very good at the process. So you have them in a room for a long time trying to get the right performance; sometimes which is difficult because that performance was done in some type of specific environment.

Having a tool at my fingertips where I can feed it some of the actor's previous lines and be able to belch out something to fill the gaps with set parameters and be able to move along in the project without all the logistics would be heaven.

It would however, kill an entire field of expertise. It would also devalue the actor as well. Though it's already happening. There are already programs on the market that replace voice actors all together and are being used in the video game space.

For the work I do, I can see the benefits it could bring. But I am also fully aware it probably will be heavily abused.


Why wouldn't people just use existing software markets?


For example?


App Stores, the web, etc. How else does software as a service get sold? It’s not a new thing. Probably a lot of these things will just end up as features in existing systems.


Existing appstores like the ones on iOS and Android mostly target casual use cases, mobile devices and on-device software. Not "buy once" experiences for work via software as a service. They also do not offer a unified experience. Two "text-to-speach" apps could have completely different user interfaces.

The web does not have good discovery and reputation management and also does not provide a unified interface. That is why market places like Booking.com, Amazon, Spotify etc have become so big.


The most impressive part of this is that they are seemingly able to produce 30 seconds of TTS with just 3 seconds of source material. That is super cool and honestly much more further in the curve that I expected it to be.


With all recent advances, are there any decent TTS voices for Linux that are not complicated to set up for regular user?


No, I am not interested play a game generated this way because the exact reason I want to listen to NPC conversations is because they are written by human.


Impressive, although the first example fails at the very last second when the virtual guy's voice glides up while saying "what?" and the autotune-ish pitch correction effect is clearly audible. Other examples are nothing short of incredible. If it really can output minutes of credible voice just by training it with a few seconds, next step will be to make it sing. I foresee a legal storm approaching when someone will use a similar technology to put for example Elvis' voice -without naming him- in some advertising, so that fans would recognize him but the audio wouldn't match neither any of his lyrics nor any existing songs by him.


I don't believe the first example is created by Soundstorm. It sounds coercive and modulated to me.


I don't have anything intelligent to say about this but it's ALOT of fun making all the samples play at the same time - sort of like the HTML version of Ableton Live.



Seems like we wouldn’t be far at all from just correlating this to face movement (including subtle iris movement and blinks, not just the mouth). As long as you clearly label it as CGI it’s harmless and I’m excited for the day to come. Might be quite fun to chat with a little buddy this way


I've wasted (counting) about 300 seconds of my life listening to these audio files and they all sound and seem fake...


Did you read the paper? They intentionally steered the quality to ensure they sound fake. Their generated speech is "very easy to detect" according to the reference at the end of the paper.


I found that in my (high quality) studio monitors, the audio sounded fine and hard to distinguish from 24kHz wav. But in headphones, the artifacts were pretty obvious. So probably some reverberation will do a lot to cover up artifacts. In the paper, they only do a subjective comparison between the generated audio and the soundstream-encoded original audio, which seems a bit disingenuous. Listening to soundstream audio in headphones, I can hear those same artifacts.


just to be clear, one could mistake them for some (voice) actor reading a book (maybe) but even to my untrained ear they sound fake and artificial.

Am i missing something?


It's meant to sound artificial. The focus is on speed and consistency


This is nothing short of amazing. It is exciting, a bit scary as well, what the future will bring.

It just makes me sad that I cannot open this page on Safari. It will not play a single audio, yet Chrome plays it fine. So here we are, able to generate audio, video, code, do amazing things with AI, but a simple website that has text and audio is not working on the most popular laptop out there.


If you want to use the modern multimedia web, you can't use Lynx or Safari.


They are not comparable.

I understand that some applications might prefer "other browsers", but a simple page that plays audio snippets was really a disappointment.


That name is a blast from the past. Anyone remember Nvidia SoundStorm?


Good for fraudsters and spammers, bad for anyone who ever hoped to make a living from voice acting. I'm perplexed by AI technologists' seemingly incessant drive to automate away the existence of artistic performers.


LLMs aren't great and can't be relied upon in business setting or at least I would not.

But think open world games. GTA VII for example where all NPCs have their dialogs auto generated in real time but also converted to audio in real time.

That's going to be a world which would be a lot more spontaneous with lot less effort.

Right now, If memory serves me right, GTA V dialogs alone are 5000 pages or more, hand written.


That's all true, but I think it's a pity that the jobs that currently exist for voice artists will disappear. Gamers and consumers will have somewhat better interactive experiences, which is good. Indie game developers will also be able to put out games with lower budgets, which is nice for them. But the market for voice acting work is largely going to dry up and blow away for people who are not already at the top of that field. People who could previously have made a modest but sufficient living as voice performers will be replaced by computer-generated voices. It will be almost impossible to make a living in that field within 5 years.


Generative models around images are nothing new and have been around for a while already. But even today, if you really want creative control and expression, you need a designer that's good with Photoshop or Illustrator etc.

This is applicable to LLMs as well. You can get it to write plausible BS but if you really want a rooted in reality, well articulated write up about something, a human has to be taken onboard.

This equally extends to voice over. If you really want expressive and creative control to put some outstanding rendering of something, AI isn't going to cut it.


This is only true if you assume AI isn't going to keep improving. It gets significantly better on a quarterly basis, far faster than the time it takes for an actor to develop their craft and career. The output quality of todays' cutting edge models would have been science fiction only 2-3 years ago.


I'm not so sure about the future. Such models, all the models don't have a well understood input output mapping and that's going to be a problem for a very long time.


With some select exceptions, the quality of AAA storytelling has detiorated.

I welcome any tech that makes indies more competitive.


You are being deliberately pessimistic. There are a million fantastic, practical uses for text-to-speech.


I am not. The use cases like interactive assistants for the blind will generate very little commercial activity compared to the uses (and abuses) for entertainment and marketing purposes. A good example of this from the real world is the absence of cheap/open ASL interpretation for deaf people.


Ever notice big huge font on the phone of older people? So big that a screen may only contain a few lines of text. Or that people has to pull out their reading glasses every time they check their phone? Text to speech is a godsend in that case. Enormous benefits to an increasingly older population.


'helping blind people' was literally the first use case I mentioned. Maybe you should have read the comment before reacting to it.


Huh? How big is the blind group compared to the older population?

You are saying it’s not economical to use tech to speech to support blind people. I’m saying the benefits are huge for older population. It isn’t just for fraudsters or spammers as you claim.


No, I'm not saying that at all. I'm saying the resources invested in helping people will be dwarfed by those invested in crap designed to exploit them economically or criminally.


Set asides the fact that you have absolutely no proof of that claim, the criminal world is tiny compared to the people who benefit from TTS (God forbid if that isn’t the case). Encryption, as an example, is hugely beneficial to the regular people despite being used or exploited extensively in the shady and questionable activities.


Completely different technology. Encryption doesn't directly replace human labor. And nobody is complaining about the existence of TTS per se, but rather pointing out that lots of resources are being poured into replacing human expressive performance skills with machinery. There's a strike on in Hollywood right now because studio owners want to strongarm performers into contractual arrangements that give away resynthesis rights of their appearance and performance in perpetuity.

None of the objections voiced to my original comment have even attempted to engage with this economic industrial reality.


Imagine having an app on your phone that turns any ebook into an audiobook.

Imagine replacing crappy phone menus with polite virtual assistants that actually understand what you're saying.

Imagine an AI language tutor that speaks every language in the world fluently. Or a universal speech-to-speech translator.

And that's just off the top of my head. Clever people will come up with a lot more uses, I'm sure.


I don't need your help imagining use cases; I've been in this field a lot longer than you, and have talked up the technological possibilities of AI-powered TTS here for *years. I understand the technology very well and am bullish on it. What I'm saying is that too much of the effort is being spent in solving the wrong problems. Please try reading what I wrote instead of your imaginary subtext.


Why spare artists if everyone else gets replaced by technology?


They don't, otherwise there would be many former CEOs living in tents. In reality, those who control large amounts of capital are quite willing (and increasingly, say so in the open) to to deprive others of their livelihoods, homes, and ability to feed themselves in order to realize a marginal increase in their own wealth.


The function of CEOs and board members are closer to being replaced than, say, roofers or garbage workers.

It will be interesting to see what happens when that pressure is applied...


Capitalism will never allow capitalists to be made redundant - that's not how it works.


It will if its more profitable, which it is.


Close. Too flat of a sound overall.


Why would you choose to synthesize audio using such awful valley girl accents?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: