Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The system also sounds more natural thanks to the incorporation of speech disfluencies (e.g. “hmm”s and “uh”s). These are added when combining widely differing sound units in the concatenative TTS or adding synthetic waits, which allows the system to signal in a natural way that it is still processing. (This is what people often do when they are gathering their thoughts.) In user studies, we found that conversations using these disfluencies sound more familiar and natural.

This part stuck out to me during the Google I/O demo, as an intentional deficiency is an interesting design decision.



> as an intentional deficiency is an interesting design decision.

well, in semantics/pragmatics these discourse particles are often not deficiencies at all. They are signals with practical semantic purpose. "hmms" and "uhs" can signal attentiveness, turn-taking (turn holding, turn yielding, etc), agreement - just to name a few.

For any machine system to be able to pass as human, it will have to be able to control these nuances or people will pick up on something being wrong, though they might not be able to articulate precisely what.


I really enjoyed the machine's "uhs" and "uhms" in the demo speech. However, I felt the "uh-huh"s sounded forced. It's funny how these subtleties are very important in human conversation.


I think probably because "uh-huh" can have many different meanings based on inflection!

As a "non-word", it relies heavily on how it is conveyed.

Imagine someone asks you a question, I bet you can answer using just the word "uh-huh" but conveying these different emotions:

rude, perky, bored, upset, annoyed, dubious, excited

and probably a dozen more.

Even using the "perky" or "happy" one in a situation where it isn't warranted might sound rude or unthoughtful!


It's not a new thing. A famous tax preparation software introduced a "compute" screen that took a few seconds to make people more comfortable with the results even if the computation itself is instantaneous.


It's really just an audio version of a loading bar or spinner - users get really uncomfortable if the UI becomes unresponsive for even a few hundred milliseconds, but they'll wait for several seconds if it looks like something is happening.

See also:

https://en.wikipedia.org/wiki/Comfort_noise


People have learned that the spinner is non-progress, though. The progress bar still has some life in it, except that those are often fake, not measuring progress.


OS-level cursor spinners like the mac pinwheel have lost credibility, because they don't reliably indicate whether the system is temporarily unresponsive or needs to be restarted. Modern multitasking OSes have a wide range of situations in which they can become mostly unresponsive without actually crashing.

Spinners on the application or UI element level are more credible, but generally worse than a progress bar. They're still very useful as a comfort indicator for short delays.

Progress bars have very low credibility on Windows, because users have learned that they're basically useless as an indicator of wait time. A progress bar might get stuck at 7%, then suddenly rush to 100%; conversely, it might get stuck at 95% but never finish. The bar offers no real indication of the actual level of progress; in most cases, this could be greatly improved with a bit of educated guesswork.

A completely fictitious progress bar can be extremely credible, because it's totally predictable - if you need to create a 10 second delay, then it's easy to make the bar progress linearly from 0% to 100% in that time. Users learn very quickly that your progress bar tells the truth about how long they'll be waiting, even though it's lying about the reason for the wait.


> Progress bars have very low credibility on Windows, because users have learned that they're basically useless as an indicator of wait time. A progress bar might get stuck at 7%, then suddenly rush to 100%; conversely, it might get stuck at 95% but never finish. The bar offers no real indication of the actual level of progress

I disagree with this; I find the progress bars more credible with erratic timing. (And ideally, a display of the task currently at hand, like "Copying tiny file. Copying tiny file. Copying giant file............")

A progress bar that smoothly fills from 0 to 100 looks like an animation that somebody thought it would make you happy to watch. A progress bar that lags at 7% and then rushes the rest of the way looks like the software has some internal metric for task completion, and is reporting according to that metric. This implies that when the number changes, progress has happened, which isn't the case for a progress bar that isn't affected by workload.

The software can't use "how much time has elapsed?" as a progress metric, because it doesn't know how much time things will take, and because the passage of time does not actually cause -- or reflect -- any progress. That progress bar would be a spinner, not a progress bar.


> Spinners on the application or UI element level are more credible, but generally worse than a progress bar. They're still very useful as a comfort indicator for short delays.

Strongly disagree. A spinner on the web UI element that lasts longer than ~1 second indicates for me that the site's JavaScript broke again, and it's time to reload or wait for the devs to notice and fix it.


He's not talking about the cursor.

He's talking about a circular loading animation. Like the one that replaces the submit button when you're making a post on Twitter/Facebook.


I'm talking exactly about that spinner. It's a lie. You quickly learn it has no relation whatsoever to what's happening in the background. And indeed it doesn't, because it's an animated GIF, completely detached from any logic or networking code!

(Compare the CLI spinner/fan - that "/ - \ |" animation used to indicate progress. There you know that each tick of the spinner means work has been done, because it has to be animated from code, and it's much simpler to just update it from the code that does the work.)


That's not true at all. In the websites I and many others build, that loading spinner is linked directly to network code.

The spinner appears when a request is made. It disappears when the request is resolved.


I was talking about animation. Show/hide on request made/resolved gives only binary information about starting and finishing something. But the spinning animation itself does not represent any operations being executed. It may very well be that the request failed and a bug in JS made it not remove the spinner. You end up with a forever-looping animation of "work", even though no work is being done. This makes the spinner an untrustworthy element.


Still better than nothing? Sure, maybe sometimes exceptions aren't handled properly, but at least you know that it was trying to do something, rather than having users click a submit button 10x because there was no UI feedback whatsoever.


The most annoying part of progress bars is the fact that programs so often use multiple bars. What's the point of watching a bar slowly reach 100%, only for it to be replaced with another progress bar that starts from 0 again?


My apps add a second "outer" progress bar for that use-case.


The "please wait while we verify your passcode" on our corporate phone conference system drives me nuts. In the time that it took to speak that sentence, the passcode could have been verified millions of times.


That may be yet another use case of delays: makes bruteforcing (or even plain guessing a few common codes) a lot slower.


In true market economy fashion, the comfort noise is also a perfect advertising opportunity.

For instance, I frequently deal with ATM machines that display "please wait" screens between every operation. Those screens last usually between 1 and 3 seconds, and it's obviously because the operations take that long, and totally not because they also display a half-screen or full-screen ad...


I've heard the HP12c calculator also slows down its screen refresh on purpose because people couldn't believe the math was right when it first came out and it was blazing fast.


This is a pretty common pattern. Lots of websites also have "establishing secure connection" interstitials for the same reason.


Yep, and the 10 second "deal" compilations for travel packages really happen in a fraction of a second. They just purposefully delay the results to make it seem like they are doing a lot of processing in finding all the possible deals and showing you the best ones.


It can be a more friendly way of rate-limiting expensive DB queries. An interstitial that says "too many queries, try again in 10 seconds" is far more annoying than a loading bar.


Yup. We have a similar thing at my company. Every time we try to test out of the loading animation, conversion and retention goes down. It’s an amazing thing to see.


That's the opposite of what parent and other commenters are saying. Users prefer the loading animations, according to the growth hackers.


I think they were agreeing. "Test out of" seems to be another way of saying "we tried getting rid of the spinners but people didn't like it"?


That famous tax preparation software added several screens to "review" the data.


Most of the flight search companies do the same ("Finding the best/cheapest flights for you"). It's almost instantaneous, but they introduce this artificial wait.


That seems unlikey. Flight search really does take a long time because they need to make API calls to external services for most customer requests and they need to refresh prices roughly hourly and so cannot rely on cached data. Also, even the best flight search websites are frustratingly slow. If that delay was created intentionally then they already lost me as a customer as a result.


I can't seem to find that post right now, but a person (on Quora/reddit I guess) who worked in the development team of a flight search company told this fact.


I don't know if I'd call it a "deficiency" - if we interpret "disfluency" in a literal sense as "not flowing" without negative connotation, then the interruptions (hmm, uh, okay) are actually communicating useful information to the other party. I might even say that omitting those interruptions (and replacing them with, say, dead silence) might be poor communication.


The "um" isn't a deficiency, but the slow response is. If the response is artificially delayed to give the appearance of slow thinking, and an "um" added to fill the artificially long silence, that's an artificial deficiency.


I interpreted it differently. It isn't to give the appearance of slow thinking. It is to wait for the other person to be ready to accept the answer.

When talking to real humans, I've encountered people who don't do this, and I find it makes communication difficult and frustrating.

I'm not 100% sure why I need this pause, but I know I need it. Maybe I'm considering whether my question made sense or needs corrections/additions, so that I can't focus on the answer yet. Or maybe it takes time to switch the brain from "speaking mode" to "listening mode".

At any rate, when people do this, I have to ask them to repeat the first few words they said because I didn't catch them. And the reason I didn't catch them wasn't mumbling or background noise or anything. Well-formed sounds made it to my ear just fine, but my brain wasn't ready to accept them for a fraction of a second.


It's not a deficiency if understanding is increased. If that fake pause increases the listener's understanding of the sentence (it might), then the 'slow response' is not a deficiency but an improvement.

Edit: should the robot talk at 2x normal speaking speed in order to more quickly convey the necessary information? Slowing the speech down artificially so a human could easily understand it sounds like a deficiency to me. (By your definition).


Ums and other filler words are not as bad as they are made out to be among the public speaking crowd: https://www.eab.com/daily-briefing/2016/07/29/um-filler-word...


Reminds me of comfort noise in the telephone system.

Even though the system encodes silences noise free (so improve compression), it deliberately inserts noise because otherwise people think the line is dead.


Similar to how when designing a virtual face, it looks more natural if it has some slight asymmetries and "defects", and when designing a synthetic drumbeat, if it's "perfect" it sounds totally robotic.

Imperfection is natural and comfortable. Perfect corners and edges are artificial and weird to the distracting point.


People are more willing get on board with bots acting like people than the other way around.

The speech disfluencies used by Duplex in the salon and restaurant interactions are perfect examples of why natural speech sounds natural. It's the cadence as well as the timing.


Google maps has the most pleasant and human sounding voice approach I’ve encountered in any such system.

All other gps guidance voices sound incredibly crude and mechanical in comparison.


The Brazilian Portuguese voice is not that great.


In my city I can recognize that google has two different 'voices' or voice libraries. They sound slightly different. I'm curious how that works and why it's not all done with one.


I've noticed this as well. My working hypothesis is that one is for high(er) bandwidth and the other for low bandwidth situations.


My understanding is that the "low-fi", more robotic one uses an offline TTS engine for when there is no connectivity. When connectivity is good, it will switch to better, cloud-based one.


Similar to lens flares in (first person) video games.


If you are building a system that mimics human speech you need to teach it to be imperfect and use common parlance. Otherwise you will fall into the uncanny valley. If you listen to the conversation again there are several points where they lose immersion. For example no one would say 12 pm, they would say 'noon' instead. Google has clearly done some impressive work here, and I'm now a bit more confident/scared that they will be able to successfully fool me in the next few years.


Your argument about calling it "noon" instead of "PM" is just illogical. I'd always use PM instead of noon -- whenever I'm trying to be specific about something (appointments and such). I understand the argument you're trying to make but that wasn't enough.


How is "PM" more specific than "noon"?

Absent the context of this conversation, it's not immediately obvious to me whether 12 PM is midday or midnight, where as "noon" is unambiguous.


I think historically "noon" expresses less precision, although I suspect that's less true now that everyone always knows what time it is and has GPS in their pocket to help calculate arrival times.

20 years ago, had I said to someone "I'll be there at 12pm" it would have had a stronger implication of precision than "I'll be there at noon." I don't think it's true today.


12pm is also commonly interpreted as midnight so "12 noon" or "12 midnight" is generally preferred when scheduling meetings or deadlines in order to avoid confusion.


Understood. So the difference is you interpret "noon" as an ill-defined probability distribution centred roughly around midday, rather than a concrete point in time, whereas you interpret "12PM" as a concrete point in time. Fair enough.


Go back further and "noon" is "whenever the sun is at its zenith" so we've definitely made strides.


Isn't that still the case, assuming central position within timezone and no summer time?





Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: