If your 200ms number is round trip, 90ms + 90ms is more than half that budget.

volta83 · on March 2, 2021

The pipeline is basically:

- stream your voice through an encoder: X ms

- send encoded packages over the network: 20-100ms latency (fiber vs mobile phone)

- potential decoding + encoding (if receiver does not support the senders codec, e.g., a landline phone using old codec)

- stream packages through a decoder: Y ms

If you are aiming for 60ms audio latency, which is what I would consider "good", then in the best scenario (20ms network latency; both using same codec) the latency of the encoder+decoder has to be max 40ms (e.g. 20ms for encoder, and 20ms for decoder).

It should be obvious that a decoder that does not meet the 20 ms budget, but takes 90 ms instead which is > 3x the budget, can produce better audio (ideally 3-4x better).

Latency wise, everything below 60 ms is really good, 60ms is good, and the 60-200ms range goes from good to unusable. That is, 200ms, which is what this new codec would hit under ideal conditions, has a latency that humans consider "unusable" because it is too high to be able to have a fluent conversation.

For me, personally, if latency is higher than 120ms, I really don't care about how good a codec "sounds". I use a phone to talk to people, and if we start speaking over each other, cutting each other, etc. because latency is too high, then the function of the phone is gone.

Its like having a super nice car that cannot drive. Sure its nice, but when I want to use a car, I actually want to drive with it somewhere. If it cannot drive anywhere, then it is not very useful to me.

regularfry · on March 2, 2021

It's not, that's mouth-to-ear.