The Lyra version is clearly much louder. This is a serious problem and it borders on being reasonable to call it "cheating".
It's well known in the audio biz that if you ask people to compare two experiences, and one of them is a bit louder than the other, people will say that the louder one was better, or came through more clearly, or whatever it is you're trying to market for. For the purpose of comparing artifacts in two samples, it's absolutely crucial that they be the same volume. You might as well compare two image compression codecs where one of them "enhances" the colors of the original image.
Note: I took the clips for this comparison from the "clean speech" examples at the original source on Googleblog, not the blogspam.
Definitely a real effect, but it seems like Google accounted for that in their listening tests.
The Google blog post links to the Lyra paper[1], and Section 5.2 of the paper says:
> To evaluate the absolute quality of the different systems on different SNRs a Mean Opinion Score (MOS) listening test was performed. Except for data collection, we followed the ITU-T P.800 (ACR) recommendation.
You can download those ITU test procedures[2], and skimming through that, it does mention making "the necessary gain adjustments, so as to bring each group of sentences to the standardized active speech level" and a 1000 Hz calibration test tone related to that. (See sections B.1.7 and B.1.8.)
So, if I skimmed correctly, and if the ITU's method of distilling speech loudness into a single number is an effective way to match the volume levels[3], then it seems like they did what they could to avoid cheating at the listening tests.
It is still interesting that Lyra makes things louder, though.
That's good information, thanks. My comment is mostly directed at the misleading blog post. I have no direct reason to believe that the study itself was compromised, though it would be great to have confirmation from the authors that it was not.
The part about matching volume levels in the ITU recommendation seems to be talking about making sure the source recordings were balanced. All their clips might well have been exactly at the ITU recommended level of -26 dB, but if Lyra introduced a level mismatch this would have to have been corrected at a later stage, and it's at least possible that it might not have been. The Lyra paper does explicitly say that they didn't follow the ITU rec for "data collection".
Interestingly, the Opus and Reference sources are almost exactly -26 dB relative to full scale (according to several measurements of loudness), but the Lyra clip is about 6 dB hotter. So the source (the reference clip) exactly follows the ITU rec. Did they remember to fix the levels on the Lyra clips? I hope so!
Excellent catch. To be precise you need a measure of perceptual loudness rather than raw waveform excursion, but I would expect the results to be in line with what you've found.
> It's well known in the audio biz that if you ask people to compare two experiences, and one of them is a bit louder than the other, people will say that the louder one was better, or came through more clearly, or whatever it is you're trying to market for.
As a former mastering engineer, you're absolutely right that this is well understood in the audio industry. I used to present my clients with level-matched comparisons of source audio vs. processed so they would understand exactly what was being done, aesthetically.
Yes. Assuming this was done by the Lyra encoder directly, and not the person who wrote the blog post pushing the slider, you have to wonder how it would respond to an input with a peak around -3 dB. Would it clip? Is it performing some kind of normalization? Who knows!
It's also interesting that the Lyra clip is ever so slightly longer than the other two. The Opus clip has exactly the same number of samples as the reference. Maybe they didn't use a decoder for Lyra at all, just played the file on one system and recorded it using a line-in on another?
Well the blog post states they use a generative model. If that means what I think that means, they are doing in audio what folks have done in images which is sketch where a rabbit should be and have the model generate a rabbit there. Great encoding because the notion of 'rabbit-ness' is in the model not the data.
Again, assuming I understand correctly, that isn't a "trans coder" that is "here is a seed, generate what this seed creates." kind of thing.
Another way to look at it would be to think about text to speech. That takes words, as characters, and applies a model for how the speech is spoken, and generates audio. You could think of that as really low bit rate audio but the result doesn't "sound" like the person who wrote the text, it sounds like the model. If instead, you did speech to text and captured the person's timbre and allophones as a model, sent the model and then sent the text, you would get what they said, and it would sound like them.
It is a pretty neat trick if they are doing it that way since it seems reasonably obvious for speech that if you could do it this way then the combination of model deltas and phonemes would be a VERY dense encoding.
But that is the thing, what if it isn't a codec? What if it is simply a set of model parameters, a generative model, and a stream of fiducial bits which trigger the model? We have already seen some of this with generative models that let you generate voices that sound like speaker data used to train the model right? What if, instead of say "i-frames" (or what ever their equivalent would be in an audio codec) you sent "m-frames" which were tweaks to the generative model for the next few bits of data.
I think I understand what he is saying, what I am struggling with is why would a 'sound' GAN care about different languages when an 'image' GAN doesn't care about different images?
What I'm getting at is this, do they use the sample as a training data set with a streamlined model generation algorithm so that they can send new initial model parameters as a blob before the rest of the data arrives?
It has my head spinning but the possibilities seem pretty tantalizing here.
I think you would agree that a GAN, or any generative model can only generate something in the same domain as what it was trained on. If you trained on mostly on human faces with a little bit of rabbits, it's not going to generate rabbits well. If you trained it on mostly on English text and a little bit on Mandarin, it's not going to generate good text in Mandarin. Same with sounds. Different languages use different sounds.
If they use any generative model in their codec, they had to train it first, offline, on some dataset. They can't possibly train it equally well on all languages, so we should be able to tell the difference in quality when comparing English to more exotic languages.
I agree with you 100%! This is where I am wondering:
> If they use any generative model in their codec, they had to train it first, offline, on some dataset.
One thing I'm wondering if they have a model that can be "retrained" on the fly.
Let's assume for this discussion that you've got a model with 1024 weights in it. You train it on spoken text, all languages, just throw anything at it that is speech. That gets a you a generalized model that isn't specialized for any particular kind of speech and the results will be predictably mixed when you generate random speech from it. But if you take it, and ran a "mini" training system on just the sample of interest, so you have this general model, you digitize the speech, you run it through your trainer, now the generalized model is better at generating exactly this kind of speech agreed? So now you take the weights and generate a set of changes from the previous "generic" set, you bundle those changes in the header of the data you are sending and label them appropriately. Now you send only the data bits from the training set that were needed to activate those parts of the model that are updated. Your data product becomes (<model deltas>, <sound deltas>).
What I'm wondering is this, if every digitization is used to train the model, and you can send the model deltas in a way that the receiver can incorporate those changes in a predictable way to its local model. Can you then send just the essential features of the digitized sound and get it to re-generate by the model on the other end (which has incorporated the model deltas you sent).
Here is an analogy for how I'm thinking about this, and it can be completely wrong, just speculating. If you wanted to "transport" a human with the least number of bits you could simply take their DNA and their mental state and transmit THAT to a cloning facility. No need to digitize every scar, every bit of tissue, instead a model is used to regenerate the person and their 'state' is sent as state of mind.
That is clearly science fiction, but some of the GAN models I've played with have this "feel" where they will produce reliably consistent results from the same seed. Not exact results necessarily, but very consistent.
From that, and this article, I'm wondering if they figured out how to compute the 'seed' + 'initial conditions', given the model that will reproduce what was just digitized. If they have, then its a pretty amazing result.
What you described could work in principle, but in practice, "mini" training on a single sample is not likely to produce good results, unless the sample is very large. Also, this finetuning would most likely be quite resource intensive. I recall older speech recognition systems where they would ask you to read a specific text sample to adapt the model to your voice, so yes, this can work.
If you can fit a large generative model (e.g. an rnn or a transformer) in the codec, you might be able to offer something like "prompt engineering" [1], where the weights of the model don't change, but the hidden state vectors are adjusted using the current input. So, using your analogy, weights would be DNA, and the hidden state vectors would be the "mental state". By talking to this person you adjust their mental state to hopefully steer the conversation in the right direction.
I do still prefer Lyra overall, though not as much as some others (see sibling comment). To me, Lyra is cleaner and easier to understand, but the artifacts it introduces are more annoying and fatiguing than those introduced by Opus. Some people in this thread have reported trouble understanding Lyra, which I attribute to the strange artifacts it introduces.
When I was doing amateur audio engineering from my parents' basement 15 years ago this phenomenon was easily noticeable and extremely difficult to avoid, particularly when doing things where the entire point is to change the loudness of everything (one aspect of mastering) or to change the loudness of things relative to other things (mixing). My "solution" was to simply take a long break (perhaps overnight) and see if I still thought the newer version sounded better with clear ears than I remember the old version sounding with clear ears.
1. Lossy codecs will use a low-pass filter to get rid of hard to compress higher frequencies. This is often inaudible, but even when it is, it should lower the volume, unless you're applying some kind of compensation for it.
2. It's true that lossy codecs compress different frequencies differently, but that's not usually done in such a way that amounts to applying EQ to the frequencies.
3. Even if the relative balance of frequencies did shift as a result of applying lossy compression, this is still done in a way that the overall loudness of the audio does not change. In this case the Lyra output has changed significantly and in an easily audible way (about +6 dB). You could easily get the same effect in Opus just by amplifying (or applying compression to) the result, but Opus is doing things correctly.
I wouldn't call this cheating though. Audio compression makes use of the mental way we perceive sound. If a sound artefact is perceived to sound more clear when it is louder compared to another one with the same compression bitrate yet less volume I would say this falls into the category psychoaccoustic compression
If simply turning up the volume made it easier to understand the speech, then not turning up the volume on the other codecs would make for an unfair comparison.
Cloudflare are kindly hosting [1] a free HTTP gateway for the IPFS [2] network. So I can host an image myself on a server with IPFS, and Cloudflare will cache it for me. This is better than Imgur because the latter has been redirecting users to annoying "social" pages with ads instead of showing them the actual image, at least in some cases. I also can't be sure whether Imgur recompresses your uploads or not - I assume they usually do.
It's also more generally useful because I can host other files too, not just images.
Is hosting the image yourself, on like a $5 Digital Ocean Droplet and a $10 personal domain, out of the question? This would seem to be the ideal situation in terms of simple, decentralized file hosting solution. What are the downsides of this approach?
(I can imagine a server package that can modify index.html sub-resource URLs depending on current server load, preferring private, locally hosted sub-resources but willing to use 3rd party solutions like Cloudflare, too, if required by a black swan event.)
Out of the question? No. As convenient as running one command on a desktop computer? Also no.
> the ideal situation in terms of simple, decentralized file hosting solution
Not sure what you mean by "decentralized" if you are in fact hosting it yourself.
> What are the downsides of this approach?
Well, for the casual person it has the obvious downside that you have to have your own VPS. Most people don't have those. Even if you do, IPFS has a couple of advantages: you can host images anonymously, and anyone anywhere in the world can "pin" the image to make sure it stays live. If you're using a server and you forget to pay DO your $5 one month, all your images go poof into the ether.
Not OP, and I haven't done it myself yet, but it makes a lot of sense. It's basically free image hosting if you can get the file cached by Cloudflare.
Imgur these days is slow and riddled by ads. A page show will sometimes load many times the image size in Javascript, stylesheets and images. It also doesn't allow the user to just view the raw image, going as far as redirecting requests to the raw image to a web page if you directly access the URL.
The only downside I see is that the URL is less user friendly without the IPFS toolset installed. Sounds like a pretty good idea to me.
Are these imgur problems a USA thing? Because I literally never had any of the behavior described. Direct image links always go to the image, there is no JS or HTML or anything.
Yep same here. Maybe the issue is about non-direct links, but it could be that imgur changes what it responds with depending on the request. If the url ends with .jpg it can still serve an HTML page.
If you upload and then link to an image on Imgur and the person clicking the link has not run Imgur's javascript yet within $timeperiod, the image will not display. Instead you'll be given javascript to run.
Cloudflare as a gateway is distasteful and this won't last long, but for now at least when you click an ipfs image over cloudflare you get an image and not javascript code.
Not OP but I assume because it kind of defets the purpose of IPFS. IPFS is all about links that refer to content and not location, a cloudflare link is now back to a location and when the cf mirror goes down, the link will be broken.
But its also the only way normal users can see the content.
What I want to know is whether Lyra takes any longer to encode than the alternatives.
Because as far as I can tell, nobody cares in the slightest about latency.
Phone calls are getting to be like writing postcards to each other. Speak in a whole paragraph. Wait several seconds for the latency to clear. Then the other party responds with a whole paragraph, waits several seconds for the latency to clear...
Improvements to fidelity are nice-to-have, but I would like some real-time in my real-time communications, please.
I had the pleasure of using a real landline just before the pandemic. Honest wire-to-wire connection between two ranches, so no silly VOIP steps between.
It was fantastic.
You don't appreciate how much latency is destroying our ability to communicate verbally until you go back to the old way.
One example is arguing. It's no wonder people used to be able to argue with one another on a telephone. You could raise your voice and still hear the other side and adjust your speech in real time. Today it's just one party shouting over the other to drown the opponent out.
Between miserable latency, not-so-great fidelity, and the fecklessness of phone companies in the face of the robocall epidemic, I have come to hate phone calls.
I'm rooting for something to replace phone communications. Any chance that Matrix can do better on any of those fronts? Especially on fidelity and latency since they're germane to the high-level subject of this discussion.
One conspiracy theory is that tech companies have lobbied to prevent real action on robo-calls, in order to get people like you to hate calls and migrate to online services.
It definitely exists in France. I still receive a robo call every single day on my land-line, liste rouge/Robinson or not. I have stopped bothering reporting them. The prevalence of "who called me/is it important" services makes me think it is not an uncommon problem.
Another thing I clearly experienced in the beginning of the mobile phone era, when one still could compare those things, was dynamic range. When speaking via a good landline, you could hear the presence of the other, but on mobile, there was only speech or silence. Noise suppression saves bandwidth, but it is so aggressive that the nuances in the other's utterances (breathing, hesitations, etc.) simply disappear. For a simple transaction that's not a problem, but when your SO lives far away, miscommunication arises too easily.
VoIP is definitely not the issue here. Codecs can be, and are, fast. I have no idea what cellular providers are doing to mangle the voice path so bad, but it certainly isn’t inherent to VoIP.
I have no idea what they are doing but it sucks. I have in many occasions found that making a call using WhatsApp works much better - both in quality and latency - than making a phone call. Phone lines are so atrociously bad, they've put me off from doing a voice call unless it's absolutely necessary.
Not all packets are created equal. ATM (Asynchronous Transfer Mode) cells have a fixed 48-byte payload, but IP (Internet Protocol) packets have longer headers and variable-length payloads up to about 1500 bytes. ATM is designed for low latency audio transfer, whereas IP incentivizes using bigger packets to reduce overhead at the expense of more latency.
Historically, sure. But at a realistic modern internet speed of, say, 15 MB/s, 1500 bytes takes 0.1ms to transfer. Even at 1.5 MB/s it takes 1ms. Typical VoIP call latency apparently ranges from 20ms to 200ms, so packet size is not a major contributor.
Ever experienced european ISDN to european ISDN at 64kb/s ?
I know americans said 'I still don't need(it)' but I still do miss it :-)
Funny thing was I had better(and cheaper!) calls to the US using calling cards, dialing into Frankfurt, and from there to the US than using the native offer of my telco.
Arguing is definitely affected by latency but it is at least possible to reduce latency, e.g. by getting fibre internet or using ethernet.
A more annoying "feature" of many VoIP systems is that they mute the other person while you're talking. You literally can't interrupt people because they won't hear you.
I presume this is done in order to reduce feedback, but it still sucks.
I'm glad you bring up latency - I've experienced several second latency in discord and it's really terrible when it happens (diagnosed via side-channel). The worst thing is that the app does nothing to try to salvage the conversation and the app doesn't tell you it's in a degraded state, so you'd never know this was a problem.
Add to that when the other party is hearing you over speakers, and the echo cancellation kicks in whenever you start to speak so that you can't hear what they're saying when they try to interject. You just see lips moving, stop speaking and you start to hear them mid sentence.
A cell phone tbh, is about the same latency as a landline in most end to end call circumstances. Latency only really is noticeable when better than 600ms. (And only a real problem over 1000)
Do you have a citation for that? I've heard 200ms as the key threshold and the ITU uses 100ms for their default delay sensitivity class[1]. One key concept is that this isn't fixed but situational: if you're watching a TV show, the threshold is higher than if you're trying to react to something which is higher than simply noticing a delay and speech is more forgiving than, say, music hitting a precise tempo (I believe musicians have been tested as noticing delays down into the 10-20ms range).
One other big factor is consistency: if the variability is due to compressor overhead which is constant, the effect will be noticeable but less distracting than if it's varying due to something like wireless conditions.
I know that if the delay is constant, a good musician can compensate for it. I think it's a lot easier to hear that two things don't happen at exactly the same time than it is to tell whether two things happen 200ms or 300ms apart.
Musicians have that internal metronome to compare things with.
I disagree with these numbers, in general. Though of course "noticeable" is subjective and varies by use case as well as by person.
For many people, end-to-end audio latency in a 1:1 conversation becomes noticeable/annoying at 200ms. And in a multi-participant conversation, talking over each other becomes noticeably more common even at 100ms compared to 50ms.
I think latency is noticeable at even lower values. As a basic example, try to sing a song with someone over a voice call. Consider using Airpods or similar bluetooth headphones to make it more apparent.
This is a fantastic question. I agree with you that we're slowing boiling the frog (and the frog is ourselves) in accepting more and more latency in our real-time communications.
I think the answer for Lyra is that latency is a concern, but maybe at this stage not as much of a concern as it could be. I'm only guessing, though based on this [0]:
> The basic architecture of the Lyra codec is quite simple. Features, or distinctive speech attributes, are extracted from speech every 40ms and are then compressed for transmission.
That sounds like the minimum frame size for Lyra is 40ms. For Opus (the audio codec used for most WebRTC applications), the default frame size is 20ms [1], and most implementations support frame sizes of 10ms [2].
Of course, your favorite web browser might not default to 20ms frames for Opus. And by "most implementations" I meant Google Chrome. :-)
I can't click a button I see flash faster than 200 ms, and that's when 100% of my attention is looking closely at a light change color and clicking. If 20ms is an issue for a conversation, where there so much more processing and understanding from both side, I'm clearly a subhuman...
I spent the last year and a half living about as far from my home city as possible (Melbourne > Edinburgh) (Currently in quarantine) and the latency of phone calls drives me nuts.
The huge gaps between people speaking, and the complete change in conversation flow since you have to speak in huge continuous paragraphs. The constant "go ahead" , "no you go" etc... Ugh.
But I agree with you... nobody seems to care except me.
Well, there's part of your problem right there. No need to even mention Bluetooth or device-related delays.
South Australia to North Britain is the better part of literally half way across the globe. It's either 64000 km at lightspeed (via satellite) or roughly 35000 km via optic fibre cable at 60% lightspeed (e.g. equivalent to ~60000km at lightspeed).
That's 200 ms one way latency just from the distance alone (best case scenario, no less), so 400 ms of latency just from distance alone. Even with something like Starlink we'd still be talking about at least 100 ms latency.
The whole latency from wireless protocols and codecs are just the cherry on top.
I also agree. Sometimes I speak to work colleagues who are on BT headsets and it must add what feels like 100ms to the call. I can sometimes hear the end of my sentence as I speak. Infuriating.
I don't understand how Bluetooth has made it this far for phone calls. It has two modes: unintelligible quality and bad latency, or decent quality and ludicrous latency. The entire standard is an advertisement for Airpods.
Mostly Windows and Android. Although both audio pipelines aren't great I don't think it's a platform problem--I've never seen a measurement of even 100ms lag with Bluetooth (excluding aptX-LL) and it's usually in the 300ms range. It also still only does two-way audio at 8kHz, which is unbelievable to me in 2021.
I would go the exact opposite route for this exact reason: something like 24Kbit ADPCM or 16Kbit G728 can provide an absolutely decent quality with only a 5 ms delay. This kind of bandwidth is now available in the vast majority of VoIP scenarios, an 100ms 3Kbps codec is only relevant for the extreme niches.
Discussed a few days ago [1] Copying my comment here
Even 3G AMR, I think that was pre 2000 Speech Codec started at 5Kbps. With a latency of only ~20ms. If I am reading correctly the encode due to ML nature would take at least 40ms and up to 90ms for Lyra.
I am sure there are some specific usage that would be a great use case. But for most consumer consumption I cant think of one on top of my head. One should also be aware of the current roadmap in 5G and the on going work in 6G. We still have a long way to go in maximising bandwidth / transfer per capita. i.e More bandwidth for everybody.
It seems to be the case with ML they want to take these speech codec to new low bitrate. While it is fun doing it as a research, I much rather they push the envelop of at least 6 / 8Kbps if not even higher closer to perfection with even lower latency ( 10ms if not lower ).
Providers just don't seem to care. My internet connection is a hundred times faster than it was in 2004 but I'm still playing games with the same 50ms-60ms of lag. Sure the speed of electrical signals in wire is a good chunk of it but there's still so much room for improvement.
> but I'm still playing games with the same 50ms-60ms of lag
What kind of lag, though? Input lag has actually gone up in the past 15 years (e.g. due to displays and USB device polling). These things add up quickly and it doesn't even have to be just the network that introduces lag.
> This trick enables Lyra to not only run on cloud servers, but also on-device on mid-range phones in real time (with a processing latency of 90ms, which is in line with other traditional speech codecs).
Worth noting that the paper itself explicitly says 90ms algorithmic delay. Seems plausible to me that there might be additional processing required on top of that, and then you have to add network transit time... suffice it to say that the vast majority of use cases are going to be better served by Opus, for the time being.
It covers it, but it's not exactly brilliant. 200ms is the point at which conversation breaks down. If half that budget has gone on the codec, not much has to happen on the wire for it to be noticeable.
- send encoded packages over the network: 20-100ms latency (fiber vs mobile phone)
- potential decoding + encoding (if receiver does not support the senders codec, e.g., a landline phone using old codec)
- stream packages through a decoder: Y ms
If you are aiming for 60ms audio latency, which is what I would consider "good", then in the best scenario (20ms network latency; both using same codec) the latency of the encoder+decoder has to be max 40ms (e.g. 20ms for encoder, and 20ms for decoder).
It should be obvious that a decoder that does not meet the 20 ms budget, but takes 90 ms instead which is > 3x the budget, can produce better audio (ideally 3-4x better).
Latency wise, everything below 60 ms is really good, 60ms is good, and the 60-200ms range goes from good to unusable. That is, 200ms, which is what this new codec would hit under ideal conditions, has a latency that humans consider "unusable" because it is too high to be able to have a fluent conversation.
For me, personally, if latency is higher than 120ms, I really don't care about how good a codec "sounds". I use a phone to talk to people, and if we start speaking over each other, cutting each other, etc. because latency is too high, then the function of the phone is gone.
Its like having a super nice car that cannot drive. Sure its nice, but when I want to use a car, I actually want to drive with it somewhere. If it cannot drive anywhere, then it is not very useful to me.
Isn't the actual phone network packet-switched and running over fibre optics now anyway? I don't think you can get a literally analog phone call anymore can you?
You haven't had "analog" phone calls for decades. Your analog line was converted to a digital signal at the central office or concentrator (ugly green boxes).
From the CO to a tandem through the core network your call was digitally switched to the endpoint where it was converted back to analog for the callee's phone.
I should have been clearer, the OP doesn't want "analog" service but PSTN (TWM switched) service. Unfortunately that is dying out as carriers are moving to all IP core networks. As customers have moved away from fixed landline service it ends up costing more per customer to keep the PSTN running. The last mile equipment is the same for TWM or IP switching but they can dump that directly into an IP network rather than maintain a hierarchy of COs/tandems/toll switches with dedicated links between them.
You probably can still get a real phone, but plugging a plain old phone into an optical network terminal, which converts to VoIP, has most of the virtue of POTS. I use a regular POTS device with Sonic fiber service and it sounds fantastic, with no noticeable latency.
Most of the problems with voice over mobile networks is caused by frame drops and wacky inter-arrival times. A wired IP network just doesn't have those problems.
Going back to the article/paper, I'd love to hear more about how Lyra interacts with Duo's machine-learned voice interpolation that fills in for dropped frames. Do they complement each other, or interfere?
When Google's announcement [1] was posted a few days ago, I listened to their samples and heard an odd effect in the "chocolate bread" sample (the video chat example) [1], which is not mirrored in this article.
On that sample, I felt [2] that the Lyra version exaggerates the pronunciation of the phrase 'with chocolate' in a way that meaningfully differs from the speaker's original. It weakens the voiced 'th' to nothingness, and overshoots both the lead consonant and first vowel of 'choc', and then proceeds to wash the entire rest of the sentence with a peculiar brightened voice that's high, lacks consonant definition, and is close to ringing.
I'm guessing it's actually style transfer, because though the result sounds not much like the speaker's original, the result is reminiscent of the speech pattern and accent that people with East Asian and Southeast Asian ancestry adopt when speaking American English. It was surprising, given that the speaker doesn't sound like that in the original. I wonder if others hear this too.
While Lyra sounds richer and wider-band than Opus or Speex at these bitrates, the degradations and artifacts of those codecs are universally recognized (through years of familiarity with telephones) as compression artifacts and not innate features of the speaker themselves. Therefore listeners can be expected to be sympathetic to the quality issues and not attribute the whole of the sound on the speaker's person.
If AI-trained voice synthesizer codecs become the norm, and it performs well on most speakers, that expectation will go away, and the resulting audio will be attributed wholly to the speaker. That increases the impact of mistakes and misrepresentations introduced by the codec, unbeknowst to the speaker and listener.
> 'with chocolate' in a way that meaningfully differs from the speaker's original. It weakens the voiced 'th' to nothingness
I honestly don't hear a 'th' in the original.
> It was surprising, given that the speaker doesn't sound like that in the original.
I disagree. Note that the speaker says "these bread". The three possibilities for those two words—"these bread", "thiiiis bread", and "these breads" with a dropped "s"—would all be weird things for a native english speaker to say for different reasons relating to either wrong pronunciation of "this" or "breads" or the fact that bread is its own collective noun and therefore we typically require separate qualifiers like "these buns" or "these loaves" when separating multiple individual "pieces" (another) into a non-collective. We ask for "some bread" or "a piece of bread", but we don't say "a bread" or "some breads" unless we are discussing categorical types of bread ("ciabatta and rye are breads") rather than instances of such, and only one type of bread is represented in the video.
The Lyra reproduction has a band-pass filtered quality to it, but I find it still remarkably representative of the reference.
Yes yes yes please somebody look into latency with these fancy ML methods! You can quite literally have most ML models approximated to a very good degree as very fast DSP using very few processor cycles given modern-day CPU optimizations. Or heck use an ISA simulator plugged into another fancy ML model, and have it also optimize to minimize instruction count while recreating the same signal! (having a model optimize on its own inference is a neat trick, but I digress.) I’m sure ML is just one bottleneck among many (looking at you, Chromium) but I so desperately wish people started caring about latency again.
"When a man looks for something beyond his reach ..."
The word "looks" sounds completely wrong for me with Lyra. To the point of completely not understanding what this word is supposed to be (first example with your [1] link).
For me "looks" sounds fine but the word before, "Man", sounds like "Lan". So to me the opus sample sounds more understandable. Even though the "quality" of Lyra is better, that shouldn't be the score to optimize for, but fidelity of the compression. It's not helpful if the compression algorithm generates a beautiful flower from a flower image but it's a red flower instead of a blue one like the original. Gives me Xerox vibes...
Similarly, for me, the word "miracle" in the the noisy environment becomes like "vericle" with Lyra, where in Opus it is clearer. (Speex does fairly badly, but in a way that's a clear failure overall rather than making it sound like something else.)
That remains to be seen. In my experience the performance with anything other than (US) English is mediocre at best and less common the language, the worse results get.
So while Spanish, French, or German might get there eventually, don't even try Polish, Czech, or Farsi (Persian) dialects.
Nothing about licensing or patents. I assume the worst (read: unusable for small businesses)?
10+ years ago I worked in a small voip shop, where we had very high quality (low jitter), but low bandwidth connection. I researched many codecs of the time (2010-ish).
We liked speex, because it can be used "without strings attached". Also, I can choose the quality depending on the bandwidth. Although for low bandwidth g729 was better. Which we couldn't use because of royalties (but allowed myself to test it).
We chose alaw/ulaw when bandwidth was not a concern, and speex when it was.
Since it does not mention usability outside of google, I also find this comparison unfair or incomplete: if you are comparing a proprietary codec, compare it to g729. If you are comparing a codec to speex, it should be open/free.
These days the correct comparison would be to Opus, which is similarly unencumbered and performs fantastically at low bitrates (and has a speech specific mode for even lower bitrates, because it's a hybrid of two codecs). It's also extremely low latency, so there's now no reason to accept trade-offs. (For the same bitrate as alaw/ulaw, you can get high quality full band music with Opus.)
These days it's more or less the standard for realtime voice. WebRTC uses it, most of the popular realtime voice applications use it as well, as does Signal.
I am curious about this as well. With all the debates in the threads here about this codec, I think it would be worth having the folks at Mumble [1] (murmur server) incorporate it as an option so people could fire up an instance and put it to the test. That of course only works if the license is compatible. They compare the performance to Opus which is used currently by Mumble.
> Nothing about licensing or patents. I assume the worst (read: unusable for small businesses)?
If there's a free software implementation, and a company offering the service based in the EU (or shop around and find any other jurisdiction where software patents don't matter), it's often YOLO - but call that "legal arbitrage" if you want to sound fancy :)
What are the current choices for "CD quality" speech compression (lossy but indiscernible) at the moment? Just had a discussion with a friend of keeping an always-on speech recorder on and wondered about disk space consumption.
A fair point if you're evaluating actual usage rather than just quality, but Google isn't going to be only one with this sort of codec. We can expect better free versions to come along in time.
I found the Lyra codecs in both examples easily the most difficult to comprehend, even compared against the scratchy Speex.
Am I the only one? It is a little odd to me to see the praise here and on the previous discussion.
To be fair, I am convinced I have APD (and, to be fair again, I have never got it checked out).
E: Just realized there is a third example. Perhaps it is not as strong a statement due to Opus' doubled bitrate, but it is still far scratchier. Yet, it is more decipherable than the Lyra codec to me.
It sounds 'clean' but the remaining artefacts are quite weird. Somehow the intonation/timing feels off, which is just sounds plain odd. The compression artefacts are 'ugly' but at least recogniseable as compression artefacts.
I personally wouldn't say "difficult to comprehend". I would say that the Lyra audio is "cleaner" but that the artifacts that there are, are louder and more annoying in Lyra. There's a very bad ringing effect and some flutter. If you personally find these artifacts distracting or confusing, I could very easily see the Lyra examples being harder to understand.
I'm almost certain that Lyra has increased the volume on the first sample too. It's quite audible, although I haven't confirmed this with Audacity.
Through good quality headphones, I actually find the Lyra artifacts rather piercing and think I'd pretty quickly get fatigued through having it in my ears over a long conversation. Maybe they would handle this better with a bit of a lowpass filter added.
I noticed the loudness, too, followed by a bit of a letdown with the subsequent odd artifacting. I was wondering whether that was also a factor in the user-perception ratings similar to how FM radio stations all started deploying compression to sound louder because it increased the odds of people favorably picking out their station while scanning through the spectrum.
After reading vidarh's comment below ("The biggest challenge with evaluating all of these, is that once you've listened to a comprehensible version of one of these samples, they all sound more intelligible."), I recommend trying to listen to the Lyra version first, then listen to the original next. It actually does sound very strange when you don't have the true audio parsed in your head already. I'm not sure if I would like having to do the extra work to translate this in my head on every conversation.
I did as you said and watched the Lyra version first. I couldn’t understand her and the tone wasn’t matching her lips and mouth. It sounds like someone with a speech impediment is dubbing someone who is not. I wonder how much miscommunication is caused by poor compression and latency.
Codec 2 does a better job of isolating the parts of sound which are most necessary to intelligible speech, without necessarily caring too much about preserving the original qualities of the speaker's voice or environment.
Fun fact: Codec 2 can be used to transmit voice over IRC:
I had to listen to your wav sample 4 times before understanding what it was saying... To me, that isn't intelligible... Perhaps with practice one could learn to understand it, but that isn't really what I want from my audio codec.
The biggest challenge with evaluating all of these, is that once you've listened to a comprehensible version of one of these samples, they all sound more intelligible. I had problems with the example too. After hearing the original it's now easy. It makes it really hard to properly assess the intelligibility for developers without decent sized panels of people to help evaluating them.
Or maybe they are visually impaired. I'm always impressed when I sit next to a blind person in a train and hear them using a smartphone.
The screen readers run at 2x normal speed (at least) and to my untrained ears the robotic noises just sound like a garbled mess instead of intelligible speech. The blind person using the phone, however, seems to have no problem understanding it. Fascinates me every time.
Funny thing: voice codecs below 2.4kbps are export controlled by ITAR, right along side military technology and nuclear devices. David Rowe, who made the codec2 open source voice codec, had to get confirmation from his govt that they would not enforce it on his project: https://www.mail-archive.com/freetel-codec2@lists.sourceforg...
I'm not sure exactly why - someone mentioned that it might be because deep submarines have very little bandwidth and need to use them, but I don't have a reference.
Nevertheless, wouldn't be surprised if google didn't want the hassle.
This reminds me of a technology in Vernor Vinge's "zone of thought" series. I think they called it "evocations", where at the beginning of a call, a model is transmitted that allows the other end to re-construct what the sender would look/sound like from severely abridged data. It sure sounds plausible - the semantically meaningful parts of a conversation (video or audio) would appear to have significantly less entropy that all of the details captured by a mic/webcam. The fact that things like JPEG and MP3 exist are proof enough, and those (to my knowledge) aren't even feature-based.
Maybe N years from now, your {Skype,FaceTime,Zoom,Jitsi} call starts by transmitting a pre-trained auto-encoder that can reproduce your speech and visual appears with a "good enough" margin of error from a few kbps worth of data.
As one post alluded to the side effect of creating a "synthetic accent", I wonder what kind of social implications this will create.
Over time, as these "synthetic accents" gain wide spread adoption, would the main stream pronunciation also adopt such an accent? It's certainly interesting to think about.
I'm not going to say it's not impressive given the limitations, but am I an outlier if I say it sounds unacceptably bad for a voice call? If someone was on discord/zoom/hangouts with that quality I'd ask them to check if they had some hardware issues or their connection was borked.
The original Iridium satellite phone voice codec did voice calls at approximately 2800 bps 22 years ago, but the quality is quite poor, and it's very rudimentary. Encoding power and CPUs were quite different in 1998/1999.
Important to remember that this type of codec can be used as a backup for higher-bandwidth codecs. You don’t necessarily need to hear it’s artifacts all the time. The higher level codec also only needs to encode the differences between the prediction and groundtruth. The same thinking applies to video especially of faces. Neural nets are a huge leap forward for this type of data compression and will likely be used pretty much everywhere in the future with great success.
Isn‘t VoIP at such low data rates already dominated by the overhead of UDP, IP and whatever lower layer? Multiplexing it with a low-bandwidth video stream would be possible, though.
I was thinking this could be most relevant for something like digital wireless transmissions.
G.729 is 21-30kbps with transport overhead, depending on a few factors. So shaving off 5kbps would still be meaningful. Or better quality at the same bandwidth might enable in-band DTMF or fax, neither works on G.729 now.
I thought I remembered getting 2400 baud fax (unreliably) working on G.729. Though I guess, yes, if this codec is trained on voices that probably doesn't bode well for fax.
Ah, I keep forgetting how low bandwidth fax modems are. That might actually work – 2400 baud is basically a person whistling or humming one out of two tones :)
60ish sounds right, though with Ethernet it's going to be padded to a minimum 64 bytes regardless. Might not matter depending what your bottleneck link actually uses though.
Why in the world do we need 3 kbps audio for voice? It's so hard to hear people speaking over mobile phones. Why don't we use 32-64 kbps 48 kHz for all voice communication? GSM CSD offers 9-14 kbps down/up, 3G offers 384 kbps, and EDGE offers 473 kbps. Why limit to 3 kbps?
I think the old Iridium satellite constellation only went as high 3 kbps. It was heavily depended on by the US military. I'm sure there's still a need for this.
This is subjective, but there is loss of real speech information here. I don't hear the accents of these speech samples daily. I can easily follow the uncompressed versions. But the heavily compressed versions delete cues and I could not initially understand them.
This is an effect even with regular telephony these days. A smartphone, carefully held a little way from your ear because you don't trust it to mask touch events when used as a phone, and using 16kbps audio, is not as understandable as an old fashioned hardline phone. Ironically, higher fidelity audio via app (e.g. Whatsapp phone calls) scores better, despite the occasional glitches.
It is unfortunate that for now this appears to be proprietary, closed source and being treated as a google competitive advantage over others, unlike opus which is fully open.
If they add it to WebRTC as they suggest, it will get auto-included in nearly all videoconferencing applications (most use webrtc under the covers, and a simple git pull will get it included in the next release).
They briefly mention the existence and current role of the webrtc codecs, but I don't see where they suggest they intend to contribute it or open it up as a library others can use.
If they don't release it as part of webrtc, they won't be able to use it in browser based videoconferencing. Google Meet/Hangouts on desktop rely on that.
While it is theoretically possible to process audio with something compiled with webassembly, data can't be marshalled into/out of a webassembly worker without the main browser threads help, and that tends to be too janky to use for realtime audio on most platforms.
That pretty much forces googles hand - if they want to use it in their web-based products, it must be opensource and available to all competitors.
They might say this feature only works with their mobile apps though.
It would take some time to see how it goes. It's possible to run on browsers without being a member of WebRTC codec family. The codec can be deployed as a WebAssemble module and WebRTC acts as a data channel for it. Zoom used to deploy a similar approach while staying on its proprietary codec.
I listened to the first 4 samples of the woman speaking and Lyra was the most clear and close to the loudness of the original, but it was also the most different. Lyra sounds like a computer voice of the women speaking the sentence back. I don’t know how to explain it better, I pick out the slight buildup/double “p” and it’s not present in the Lyra version. Almost like it’s removing human speech imperfections.
When they converted back to wav from Lyra, they used a 32 bit 16 Khz wav instead of 16 bit 16 Khz wav like the source. The size of the Lyra file is almost exactly 2x as big as the reference.
Note that this isn't cheating in any way, the source is the source, so it's just a quirk from their conversion process. Probably the tooling around Lyra is pretty rudimentary and the decoder could only output a 32 bit file.
Since a browser can't play lyra I think they took the lyra output and put it inside something lossless like a 44kHz stereo wav so that people can listen to it.
Hmm, I have to say it's impressive to do this at 3 kbps, but even the Lyra sample I found very hard to understand, and misheard most of the words. The original was the only one I could understand fully.
I think this will be the rough direction, but not exactly text, rather some other efficient, machine-readable embedding of speech that is also able to carry tone and rhythm effectively and pronounciation accurately and unambiguously.
Basically yes. "Features, or distinctive speech attributes, are extracted from speech every 40ms and are then compressed for transmission. The features themselves are log mel spectrograms, a list of numbers representing the speech energy in different frequency bands, which have traditionally been used for their perceptual relevance because they are modeled after human auditory response."
Only if you're ready to kill all the intonation nuance. If you're ok with that, why not stick to just reading text? At least we can use emotes in there.
"As with any ML based system, the model must be trained to make sure that it works for everyone. We’ve trained Lyra with thousands of hours of audio with speakers in over 70 languages using open-source audio libraries and then verifying the audio quality with expert and crowdsourced listeners. ... Lyra trains on a wide dataset, including speakers in a myriad of languages, to make sure the codec is robust to any situation it might encounter."
The Lyra version is clearly much louder. This is a serious problem and it borders on being reasonable to call it "cheating".
It's well known in the audio biz that if you ask people to compare two experiences, and one of them is a bit louder than the other, people will say that the louder one was better, or came through more clearly, or whatever it is you're trying to market for. For the purpose of comparing artifacts in two samples, it's absolutely crucial that they be the same volume. You might as well compare two image compression codecs where one of them "enhances" the colors of the original image.
Note: I took the clips for this comparison from the "clean speech" examples at the original source on Googleblog, not the blogspam.