A long time ago in Internet time, Justin Frankel (creator of Winamp) created a tool for live music collaboration that approached this problem in a very different way. Basically what it does is that it _adds_ delay for everyone in a way that synchronizes the musical measures, such that you play a measure while listening to everyone else's previous measure.
I never tried it because I'm not a musician (just a longtime fan of Justin and Winamp), but I always found the concept very interesting. Apparently it is still alive: https://www.cockos.com/ninjam/
Bingo this is the answer. You can't "speed up" the speed of light, but you can have predetermined latency with a synchronized clock that lets you do lots of great things.
Other ideas (this is not new btw), you don't get to hear the other participants but you're all on the same sync'ed clock with a metronome and trust the general output will be ok. Final mix is synced to viewers (obviously on delay to achieve sync).
Most studio recordings are done iteratively, where you record a rough "scratch" track but then one by one, record over each part so that the final recording is the sum of everyone playing their best. This combines the feeling of playing with a group asynchronously with producing a high-quality recording. Would be cool to see tools to make this easier although the current gen of DAWs is pretty good.
It's worth watching some interviews with artist like Jack White to get their take on the quantization of music like this. It's not always the best way to make a track. And then when you get o concert If the artists can't bring it on stage then it's a bit of a letdown.
Latency is the most crucial issue for audio when doing live broadcasts, especially when the peers all need to be synchronized to a rhythm or pulse.
However, it doesn't stop there. Latency is also important for voice, not just music; excess latency is also the cause of the dreaded zoom fatigue [0].
We humans seem to have brains designed for particular cadences of conversation, and products like Zoom really work to disrupt & disengage these preferences we have, leading to poor communications outcomes.
I wonder if over time, we will learn to adapt to the new cadence as we continue to socialize over platforms like zoom. My hunch is that we will, it will just take time for our minds to adapt to the new medium.
We'll likely fix the latency issues inherent to the way we do live broadcasting on the web today before the human race has the time needed to adapt to the new cadence.
Our vocal communication was solidified into our genetic code over millions of years of iteration. I imagine that in a decade or less, the latency issue will be fixed for most of the connected world.
I'm not sure i understand: if the cello is delayed so I can feel like I play my trumpet in sync, then the cello player can't also feel he'd in sync with the trumpet, he must be ahead.
So there is some sort of hierarchy of instruments or a dumb sync track pre-recorded.
I'm not a musician, but I interpreted it like this:
Lets say the delay is 15 seconds, you hear the composite, including your part delayed, and you "think, cool, I played the right thing at the right time, I'll keep going with my same assumptions" or you hear "my part was way off, but the rest sounded decent enough, better do something different."
The part where everyone is just playing terribly isn't a concern or doesn't manifest because you've got a bunch of intermediate or better musicians playing.
You're right, but most speakerphones still aren't even properly full duplex, much less utopian zoom software for apex cyber virtual orchestras.
The amount of times I'm tripping up on my own words because the last thing I said is blaring out a friends/family mobile speakerphone and then back into the mic is disappointing on all levels, to say the least.
I think it works only if everyone is playing the same chord/scale. Kinda like someone playing with a loop pedal. You're right that it definitely wouldn't work with a piece of music.
I've spent most of 2020 building something similar. Now I just wish I had a landing page ready so I can plug it here instead of over engineering the app itself, but oh well :)
Anyway, what I'm building is meant more for the repetitive kinds of electronic music, but I solved the problem by just making it work like a shared loop pedal that records up to 16 measures of audio.
Everyone works asynchronously and can add or remove audio on their own time, but loops get synced to other players as they get recorded.
Best part? You can do cool stuff with browsers nowadays (even record uncompressed audio). So it just needs a web browser.
No comment that mentions Justin Frankel and Winamp is complete without mentioning Reaper. Software written with true passion - that's what Winamp was and that's what Reaper is today.
We tackled this kind of problem with voice systems on the server side, record all streams simultaneously, sync and merge them and stream the recording. Downside, there will be a delay, but it is a fixed delay and its determined by the worst connection. ( this was around 1995 )
When I saw the headline, I thought they might provide some more insights or maybe a novel ideas on how to overcome the latency problems. Very disappointing public relationship piece.
I agree with others, latency will always be an issue over a distance.
Thought Leadership ... ha ha ... I'm impressed by their leadership, I didn't expect latency and timing matters in Music. Yet, I'm also a novice and ignorant :)
The real problem here is latency, and I'm good friends with the folks who are working the hardest on that issue for real musicians wanting to collaborate with each other in real time. The most important work is being done with SoundJack.
This, I've asked my musician friends too how teaching has been like through covid. All the platforms for recording they stated have at least a 0.5s delay, which makes playing together with a student very difficult. Usually the teacher sometimes has to prerecord a sequence instead and play a video, which makes teaching online not really convenient or fun for the teacher.
The established technique and one that many people have independently invented (myself included) is to chain the performers so one person hears only a metronome and each other person hears the previous members on the chain, forwarding along the audio in sync.
As I've said elsewhere in the thread zoom is the wrong tree to bark up since these tools exist.
As a musician, audio engineer, and software engineer, this just blew my mind. I can't believe I've never heard of or thought of this until now. But my primary focus has been non-audio software through the pandemic, so I haven't been actively trying to find solutions to this. Thank you.
What applications would you recommend? Can it also do video, if not each link in the chain, at least one camera for the conductor (the node providing the metronome audio)? I realize the conductor would hear nobody at all, until playing back a recording from the end of the chain. Playback of said recording could begin after the song, or one measure into the song if there are no tempo/time changes (NINJAM style).
There's a sea shanty called the Wellerman that recently blow up on TikTok, and listening to it I was amazed how in sync the whole thing was despite all the layers. (It uses the same chain technique, but not "live".)
Not-live overdubbing with perfect sync has been going on since the dawn of multi-track recording, although the popularity of including video from each layer (and DIY without a tracking engineer running the session) surged in popularity during the pandemic. With modern editing in a DAW, quantizing every phrase, word, or even syllable to the grid (or other layers) isn't difficult at all, just a bit tedious.
For which kind of music does this work in practice, though? The first musician in the chain would have a very lonely experience ;-)
Another technique is to mix all musicians on a server and send the mix back to all of them. Everyone hears the exact same signal, so in theory people could play perfectly in sync, but in practice hearing yourself delayed is very awkward and needs lots of training. We tried it but gave up quickly.
The first musician could be the conductor and they already are accustomed to conduct a bit ahead of the sound. Or they could put the instrument with most notes.
Musicians want/need to hear each other. This is obvious for improvised music, but even when everything is written out, players have/want to listen to each other.
While this approach certainly works from a technical perspective, but I don't see many musical styles where this would yield a satisfying musical result. However, it might work for pieces which are specifically written for this constellation.
Latency can be radically improved with SoundJack and tweaks to router settings, combined with audio equipment that introduces as little latency as possible. (i.e., don't use cheap USB mics.)
Do all of this, and you can collaborate really well with people in the same geographic area as you. Of course you can't get around the speed of light, but you can do a lot better than the default setup or Zoom will give you.
Yeah: it is maybe worth noting explicitly for people that the speed of sound is sufficiently slower than the speed of light that two people laying instruments on opposite sides of a small stage can actually, if they have pickup mics and headphones using software carefully designed to minimize latency, be reasonably far apart over the Internet and have the same experience (such as between a bunch of musicians living in San Francisco, or all in the same region of Los Angeles).
Even about two orders of magnitude further than that, based on the work done so far.
For instance, 22.5ms latency has been achieved at a distance of 900 miles (between the middle of Kentucky and Boston) with SoundJack and this setup:
Both port-forwarded, both using Fast Music Boxes, both on fiber internet connections. First user had an MXL770 XLR mic and an Audient EVO4 interface. Second user also had an XLR mic and an audio interface.
I think you're underestimating the amount of latency that the human ear can detect. If the audio is as little as 10 ms out of sync, either way, someone will notice that.
Depends on what sound you are measuring. A single drum hit where one channel is delayed by 10ms is very audible. Would a violin and cello player have issues playing together if one was delayed by 10ms though? Could anyone hear that in the audio itself, or even detect it in their performance at all? That seems a very different thing.
Sound travels at ~1ft per ms in air so 10ms equivalent to being 10ft away. Sure, you notice the lag at that distance but it’s still possible. 30ms (or 30ft) is the point at which I would give up on trying to play rhythmic music well. That said 30ms latency is a lot of wiggle room where the speed of light is involved
And I think you are underestimating the speed of light ;P. The round-trip latency (twice my distance) from my apartment in Santa Barbara to servers I have in Nevada and Northern California is 20ms, making me the equivalent of like 11ft away from those places by sound. If I were communicating within San Francisco or Los Angeles I should trivially be able to hit 3ms round trip times.
That is the role of a conductor. He hears the piece as it should be heard and is active participant in "tuning" it. My understanding is that he can slow down/speed up various performers to get them in sync.
Performers need to learn to ignore the rest and play their instrument while looking at the conductor.
Over surprisingly short distances you run into issues with the speed of light. I did live sound engineering when I was younger, and worked in some studio recordings, and the conventional wisdom of the field is that musicians will begin to get distracted by latency greater than 10ms (which you have to care about when sending foldback to a musician, even when all of your equipment is in the same room).
10ms only gets light 2/3rds of the straight line distance between LA and NY. After you start accounting latency in infrastructure components, round trips, and the fact that cables aren't laid in straight lines between two users, the idea of live musical collaboration over even short distances starts to seem implausible.
It's actually not implausible; it's already happening. No, you can't do it between LA and NY, but you definitely can do it across distances of, say, several hundred miles quite well.
Well, I don't know about how fast electrons travel in copper, but I imagine it's well below c. In optical fiber, the speed of light is about 2/3 c. So, you should probably at least multiply your numbers by 2/3. In that case, you can only get 4/9 of the way between LA and NYC.
The threshold is more like 3ms in my experience. When tracking vocals with headphones (no acoustic latency) anything higher will start feeling weird. This is also why Thunderbolt is preferred over USB for recording; you can get in and out of the CPU with effects much faster.
Distance from the source of sound is actually very important in live performances. There have been some empirical studies on it, but it’s still something people often get rather opinionated about. There is a reason though that a small ensemble can play to the beat of something like a drum, but a large ensemble like an orchestra can’t. There’s plenty of writing about how orchestras lag behind conductors and how different sections of orchestras have to take their timing queues differently depending on where they’re situated.
10ms is what I was always told was the generally accepted threshold after which a musician start to experience increased difficulty as a result of the latency. Which largely aligns with my own experience with audio latency. From there the difficulty just keeps increasing, and the performance quality keeps degrading until at some point the musicians and the audience gives up.
Thanks for writing this. I made the exact same point in another subthread, that 10ms is enough to mess with your brain. My only experience of it is when the audio and the visual out of sync by only about 10ms while watching something (I forget what). I can't watch anything like that for any real length of time, and I'm not even trying to play along.
The refractive index of single model fiber is ~1.5. So 10ms of fiber transit is closer to 1250 miles. The fact that no two real world internet users are connected by a straight line of fiber optic cabling is going to bring that in even further. If you were relying on round trip latency for some reason, that would cut the distance in half.
And that’s just for an idealized communication network. A real world use case would have a lot more latency introduced by routing/switching, and all of the typical quality issues you’d expect from ISPs. I used to live within line of sight of the data centre that hosted an online video game that I played, and the lower bound of my ping was about 15ms, it was usually in the 20s. Those factors are why something like this is unlikely to ever provide a high quality experience over even relatively short distances. I’ve just always found it interesting how quickly you start to run into issues with the speed of light when you’re trying to optimize for network latency.
I mean, the stated use case is for a music school. I actually live in a University town. We have tons of students all "working remotely" within a single mile radius of campus, and many of them are connected together by our campus intranet -- the University does wireless point-to-multipoint links to connect all of these random housing options (which include large buildings they simply bought in the community) -- which is something I think we could easily push to include more buildings. I bet we could easily pull off 2ms networking overhead.
(And no: you are wrong about round trip. If you care about round trip time then 3ms of round trip by sound is so close that the violin player is going to be elbowing the singer. Sound is fundamentally slow, and you need to just accept that.)
A well architected local network is about the only type of network that a service like this could operate over, and still provide a decent experience.
Also, the reason you'd care about round trip time, is if you needed mixing or processing to be done to musician foldback (I never mentioned 3ms being an issue btw). The mixing and processing of sound is prior to being sent to foldback monitors is a completely standard process in just about all live and studio sound engineering. Latency is a very important consideration for sound engineers, and it's why in situations like stadium performances, the only sounds that matters to a musician on stage, is the sound coming out of the foldback monitors directly in front of them. I'm getting a rather strong impression that you don't know much about what actually goes into sound engineering.
I guess I don't understand why you think "sound engineering" is relevant to the point of being required, as the question to me here is more about what happens at something like a "session"? I used to go every week to play with a large mix of random people at an Irish bar... we definitely didn't have monitors (or a sound engineer ;P). People have been playing music with each other in groups for a long time, and being within 11'3" of everyone seems unrealistic. I have been in drum circles larger than that (I am at best a percussionist, though I was dabbling with fiddle), and I would assume that to be the case where precision matters the most. The stage we were on (which is weird: most sessions I've seen--and I have seen more than I participated in as I got involved in this stuff due to dating a musician--are at dining tables) was definitely over ten feet wide.
FWIW, when I asked my professional musician ex about all of this (which was before I read this comment of yours here) she also mentioned monitors, but it was because she claimed a core consideration for her with respect to distance in a performance was volume of her instrument drowning out her playing companions (and when I pushed into that she said if the acoustics were bad enough they would use monitors). She also told me that people she knew were already using software to play with each other over the Internet. If nothing else, I feel like "proof by counter example" should win here vs. your statement that this is just somehow impossible? As others have pointed out in this thread, the SoundJack people exist and claim to do this (whether or not you believe they have users: my ex apparently knows of users, though I don't know if they are using the same software).
"The biggest problem is Latency". Try turning off video and instead run Mumble with low latency settings. Should be way better than zoom at least. Why do we always have to create something new from scratch when we already have a solution that just needs a bit of polish?
I gotta mention JackTrip in this thread. You can stream full quality audio with near zero latency. If you're within a ~300 mile radius of your peer, it boasts <20ms latency, which is low enough to feel instant. I've tried it with a friend and it's pretty incredible.
My dad uses https://www.soundjack.eu/
running on a Raspberry Pi to practice with his wind quartet. along with a separate video call so they can see each other.
Its also a peer2peer system, and not easy to set up correctly.
Sorry, but 20ms is more than double the amount I can tolerate when playing my instrument.
Even just the difference between local loopback (instrument -> headphones) vs instrument -> USB -> PC -> USB -> headphones, feels like playing with someone continually dragging the tempo down.
I'm sure I could relearn and adapt to >10ms, but this is a far cry from a live experience of playing with others, and I am definitely not alone in my cohort (after trying several low latency remote meetup configurations).
20ms is actually equivalent to playing/singing with someone around 20 feet away from you, on a stage. (Sound travels around 1.13 feet per millisecond.)
Many of us in classical music are doing this all the time. So it's definitely doable. Especially if you have a conductor! Indeed, that's one of the reasons conductors are useful.
Listening to yourself on a 20ms delay is a far different thing than listing to someone else on the same delay. The latter is what you experience in every day life all the time as sound travels about 1ms/foot. The former causes a negative feedback loop that will cause problems as you try in vain to effect an impossible correction. https://en.wikipedia.org/wiki/Delayed_Auditory_Feedback
It is pretty painful to set up, that's for sure. You need to be comfortable with the command line so I haven't been able to experiment yet with non-computer friends. I'm going to try to create some sort of one-liner setup script for my friends to use but I haven't gotten there yet.
There are some good instructions here for different systems.
For anyone who hasn't seen it yet - this adjacent project sells a complete unit utilizing this software. As far as I can tell it's just a raspberry pi with an ADC/DAC hat, in a metal enclosure, and some sort of cloud configuration interface. They also publish the details and the pi image if you want to build your own, and subscribe to their cloud plans.
I was thinking recently on how bad modern voice calls are. It's all highly compressed and basically unidirectional, forcing the participants to take turns talking.
The old analog POTS phone system wasn't like this. It had that feeling of almost being in the same room, no latency, true birectional conversation.
I was wondering if there's anything that tries to get close to this old POTS level of quality, at the cost of higher bandwidth and processing power.
Can some of the software mentioned in this thread achieve this?
> The old analog POTS phone system wasn't like this.
I mostly agree with this, but POTS starting going digital surprisingly early, before VoIP ever became a thing. And of course digitization back then was typically chopping a huge amount of the high frequencies, making things sound...well, like the phone.
Yes, but it was using TDM near exclusively in the digital domain, and not packet switching like we do now, which lead to vastly different levels of service compared to now. The problem isn't digital, the problem is packet switching.
Most people making calls today have no conception of how good telephones used to be.
Sadly for some moronic reason FaceTime can’t use hardwired Ethernet. I have a reliable 4ms ping and Apple sells Ethernet adapters for iPads but FaceTime won’t use them.
Yeah systems like zoom are designed for enterprise where hundreds of simultaneous calls are happening on the same network. They work very hard to make those packets as tiny as possible.
I'm a musician and I'm incredibly confused by this point. Music instruction cannot possibly be effective without a visual component, regardless of whether it's a solo instruction or a group. An A/V method with a bit of latency is almost always going to be superior to audio-only.
With one group of friends I’ve been doing weekly calls for months now with audio through Mumble for audio, and video through Discord where everyone stays muted.
Are you doing instructions, including "your hand position isn't right" etc.? For that latency is tolerable, but different alten y for picture and audio is problematic.
How you place your hands, bend your wrists, stretch your fingers and generally hold your instrument can greatly impact how you play it. Depending on the instrument, a simple bend of a wrist can lead you to consistently play out of tune, or make it hard for you to hit a specific note.
Even just touching strings in different ways can produce wildly different results; think playing a harmonic note on a bass guitar, which is how you get such a low-sounding instrument to play a high-pitched note. How you blow air through your horn can greatly change the timbre and mood of your music.
There are a lot more examples, and they are all difficult to correct if you aren't able to see what the musican is doing.
Physical matters of technique. Depending on the instrument or lack thereof: grip, placement of fingers, posture, breathing, lip embouchure, and more. Heck, you could even have it put together wrong or be holding it upside down. A remote teacher cannot easily convey this things in a timely manner without seeing the student.
Please re-read my post and note that, at the beginning of the sentence you quoted, I specifically targeted the instruction of music. More specifically, you cannot help teach a blind violinist how to prevent themselves from continuously playing their notes flat without being able to see how they are playing.
Blind people are perfectly capable of learning to play music. Blind people and non-blind people alike can learn on their own, or seek out instruction. It is the latter that we are discussing here.
Totally, and I'm sure there have been many, but being able to touch and feel the placement of the musician's hands/fingers/etc. would be important. There are certainly things they could teach without that touch, just like non-blind people could teach a few things remotely sans video, but core areas of playing wouldn't get the right attention.
Latency will always be an issue here. 200ms is enough to throw off the timing of an “all-at-once” note or syncing up sections.
I’ve been a hobbyist musician since leaving school 20 years ago. I play piano, guitar, bass, mandolin, cello, drums, etc. All of the typical video platforms fall short of having a low enough latency to achieve “remote orchestra”.
There was one product/service that I tried that got closer than anyone, JamKazam (https://jamkazam.com/) and there were a few audio only options back mid-2000s that had close to 20-50ms which was better.
No, it will not. :) You can get to 5ms above the network delay, even using compressed audio (opus), though some sound devices may be picky.
Network delay can be just a few milliseconds even across a city, assuming that no latency murdering devices (wifi or nics in interrupt mitigation mode) are on the path.
This means that you can have lower audio delay from a compressed audio conference crossing your state than is achieved from speed-of-sound delays from a performer sitting on the other side of a moderate sized room!
The future is already here -- it's just not very evenly distributed. (yet)
> Network delay can be just a few milliseconds even across a city
If you limit your collaboration radius in that way, then sure (though it still depends on the router technology the packets encounter along the way). But if you want to go beyond that, the numbers get bigger. For transcontinental stuff, much much longer.
One way delay between San Francisco and New York city is 37ms across the public internet today.
This is the same one way delay as sound in air across 12 meters, which would be on the large size for an orchestra pit.
It does get larger as you go further, but live performance with people on the other side of a large continent is completely realistic and has been done. :)
Sure, expecting to go between NYC and someone in china behind the GFW is probably asking too much for a seamless experience. But many people mostly want to work with people in the same country as them ... and for that, with sufficient technology, latency need not be an issue.
> live performance with people on the other side of a large continent is completely realistic and has been done. :)
In case you're not aware, I'm the original author of JACK, around which JackTrip is built, which is the most likely and reliable tool for such a collaboration. I certainly know people who've done this, and I regret that I didn't realize back in February how useful it would have been to make something like JackTrip into a much easier-to-use tool for computer-naive folks.
37ms cross-country in the USA is optimistic, but certainly possible. US->Europe is not so great.
In case you're not aware, I'm one of the authors of Opus. :P
When the pandemic hit I thought about putting out some pointers to easier to use low latency streaming resources... but... that would require overcoming pandemic lockdown funk. And honestly it's still kind of a technical rats nest to get everything working. It's not exactly musician friendly to need instructions like "next you need to make sure your nic doesn't enable interrupt mitigation when there are more than 100 packets per second...". :)
Another example of HN being amazing. In the interest of playing too, I did some work on reducing audio latency in Android, but that all predated AAudio/Oboe, which takes that quite a bit further and makes at least the Pixel devices a plausible platform for low-latency audio collaboration apps as suggested in this thread. I for one would love to see that happen - I miss singing hymns on Sunday mornings.
Yeah, I'm really sad that I didn't get some serious effort together on this back at the start of the year. There's no lockdown funk here, just too busy working on Ardour.
One of the 2 netjack implementations uses Opus - great stuff.
Absolutely right that the good stuff (i.e. not zoom) requires way too much setup, and even then (as you note) you're not guaranteed reliable function because you don't control all the hops. Soundjack gets the user-side fairly good, but it's still not quite what I think would have really taken off during the pandemic - probably needs a mobile app for that.
And I love AOO, I use it as the basis for SonoBus.... I’ve been meaning to contact you about it, and work on merging your latest into my fork (which has diverged a bit over the last few months).
Latency in this case is cumulative. 1-5ms for DAW encoding of your analog signal, encoding it in whatever codec is used for transmission 1-?ms. Actual transmission 10-1000ms. Decoding the signal into audio channels for playback 1-5ms. Don’t forget this is duplex. You are transmitting as you are also receiving. In an ideal world we could get very close to zero-latency (enough where an orchestra could by in time step with each other).
It sounds do-able. It sounds like we should already be there. Looking around though you’ll see we aren’t because of technical issues described that are systemic of the infrastructure used.
Network delay can be just a few milliseconds across a city, or you can discover, like my brother who teaches elementary music, that your kids have different ISPs and some are awful.
He's been teaching music remotely since August, and one of the first days he demonstrated why they have to be on mute on Zoom by having them all sing happy birthday to one of the students. It was utter chaos. I worked with him to try to find a solution, we tried various tools, but found nothing that worked across cable, DSL, WISP, and cellular connections.
Some tools could sync after the fact so the recording sounded good (if each musician keeps the same tempo), but the whole point is that the kids needed to be able to hear and follow each other, and just one student at 500ms latency totally ruins the tech.
So he teaches music to second graders by zoom, and they're all on mute unless he unmutes them one by one.
Yes, it will as soon as you are 1000km away. Light speed creates a lot of unresolvable latency. How can you play live between London and Sydney? You need a different musical understanding.
1000km at the speed of light is 3.33 ms which on it’s own isn’t an issue. While fiber is slower, you can have low latency connections over that distance without issue.
The real problem is computer audio isn’t designed for low latency. Generally, as long as video and audio are reasonably synchronized nobody notices, but latency as rarely a consideration on it’s own.
Even with the state of the art equipment, transcontinental music making will always exclude the kinds of music that most of us play while in the same room together.
I was specifically mentioning 1,000km as the metric.
AWS latency isn’t great between regions. Looking on some other sites you can see London to NYC is 70ms, but that’s also 5,500km which is far behind the 1,000km mentioned. Ex: https://wondernetwork.com/pings/London
70ms /5.5 = 12.7 ms seconds round trip. The speed of sound is only 1,125 ft/s, and half that for round trips so 1,000km over fiber is the equivalent of ~7.14 feet apart in the same room plus whatever latency your personal computer and software adds.
My DSL line gives me a 20 ms roundtrip to the first hop. My only choices at this address are DSL, LTE (latency seems highly variable, and much more expensive than DSL), or paying at least $50k to get muni fiber installed. So latency will always be an issue here.
Well, I could probably get some form of POTS either analog or ISDN/T1
DSL is pretty much the worst of the wireline services as far as latency goes. Even some of the wireless technologies are better.
If you use sonic, at least in some areas they have a setting where you can turn down the level of forward error correction and get lower latency in exchange for somewhat higher packet loss.
Jitter is an issue as well. If the packets get routed in different ways the variability in latency might annoy you. Ever had a video call where the person gets really slow and then speeds up again?
For extremely low latency audio, you simply convert most jitter into loss (and the rest into added delay). How annoying the loss depends on the quality of the loss concealment.
If your conferencing is 'slowing down' and 'speeding up' then it's simply not a realtime communications medium.
> Network delay can be just a few milliseconds even across a city, assuming that no latency murdering devices (wifi or nics in interrupt mitigation mode) are on the path.
Even with fiber all ISPs I have ever used at home have latency murdering devices between my modem and their connection to the Internet. And I have to go through these hops even if I connect to an IP on the same IP.
Best I ever got was 12ms. Once you connect your laptop to the Internet in a data center you will instantly know what I mean.
High jitter also occurs when the signal is marginal and the client scans the WiFi frequencies for other base stations to associate to (on the assumption that maybe you moved and there is another base station with a stronger signal). Since it has to “tune” into the other channels to listen for beacons, there will be a noticeable jitter while it is not tuned to the “current” channel.
This is my understanding as I’m still learning. Fascinating this stuff works at all.
I mentioned JamKazam in this thread above about latency. It’s the top of my list for this purpose but still has some issues when the connection quality isn’t good (ISP issues)
Some friends of mine had a pretty low latency (sub 50ms) using Jamulus (https://github.com/corrados/jamulus), which is an open source conferencing tool specially made for musicians. Need to self-host though.
Mumble is probably not the right software given the purpose built ones that exist. Jamkazam is one. There are many. The apparent ignorance about existing tools in this story and comment section is driving me nuts.
Because turning off video makes it useless for a music instruction tele-conference app -- which is all about the instructor being able to see how you sit, how you use your fingers, show you fingerings, point mistakes, and so on.
In other words, they don't want a real time music streaming app, they want a music teaching app. Duh!
Mumble requires running your own servers, and many universities already pay for Zoom licenses. Why not ask your vendor to add a feature to software you're already paying for?
Mumble has excellent audio quality but it's marred by its lack of feedback cancellation. Mumble is basically unusable if anybody isn't using a headset to listen.
Which may be tied to its low latency... I assume there are algorithmic reasons why cancellation needs latency.
That and between the audio setup wizard and its unorthodox auth system, the onboarding process is brutal for non-technical users.
Engineers often discount the value of discoverability and accessibility. I would argue those are the main reasons Zoom became ubiquitous in the first place.
We use OBS Studio in conjunction with the teachers preferred videotelephony software (currently Skype) for the piano lessons of our daughter.
I have defined four scenes in OBS which she can switch with A, S, Z and X, followed by a space bar press.
A is the webcam of the notebook, which sits right above the keyboard of our digital piano. This shows just the semi-portrait of the person playing.
Z shows the view from a second webcam mounted above the keyboard to display the keys and the fingers.
X is the view of just the virtual keyboard from a Synthesia window, where the pressed keys are highlighted.
S is all three previously mentioned scenes combined in one nicely arranged view.
Audio from the digital piano is also a source for scenes in OBS.
OBS provides a virtual camera that can be used like a normal camera in most, but not all conferencing software. For example, the preinstalled Skype in Windows 10 doesn't recognize the virtual cam, but Skype Desktop from skype.com does.
It's not perfect, but - I think - a decent setup, quickly built with only a laptop, one extra webcam and free software.
Congratulations. The teacher must be really happy.
My better half teaches online as well. Your setup goes the extra mile and leaves all other setups in the dust. Most parents cannot be bothered to make sure a basic setup is working.
For music education even a $5 lav mic would be a huge help.
Now I wonder about performing together? Everything I've seen that looks like a live Zoom performance of 10-20 musicians is actually recorded separately (with a click track) and then edited together in a video to make it look like it's Zoom.
But now that the music quality is there... Zoom could conceivably broadcast a click track (e.g. the host's audio that would go to performers' speakers but not the audience's, or even the host being a conductor or lead performer), performers would play "solo" (without hearing each other), but then the central Zoom server would wait for and cache every performer's audio until they were all received, with a suitable buffer, and then output it mixed, to the audience, synchronized to the host's (click-track) audio (timecodes).
And unlike normal Zoom calls where it only ever transmits 1 or 2 audio streams (the loudest ones at the moment), it would always mix all of them.
It's probably complicated enough that it wouldn't be worth Zoom's effort, and also niche enough not to be worth a startup developing a separate product for it... but it certainly seems doable, no?
Another way to do this is to put everyone in a chain, and have each person be listening to what the person before them was playing a few seconds ago. I wrote some software that does this, and anyone is welcome to play with it: https://echo.jefftk.com
Not the same as your idea, but similar is NINJAM, which is tempo synced, but deliberately makes the latency equal to one bar, phrase, or other segment of time. You are always playing live to what everyone else played the very last time around, but you are still playing on the beat together. It obviously works better for repetitive jam sessions than for structured songs, but it looks very interesting. I haven't yet tried it with my music buddies, and should.
Bit unrelated, but I recall an old webpage that had a grid of 10 or so YouTube videos embedded, each with a different instrument playing (they didn’t auto play).
This was pre-mobile, so it was expected that you could play multiple videos at once. I was amazed, and it was the intent, that independent of how few or many videos you played and no matter when you started each of them, all sounds blended together nicely.
I don’t know what to call it when there’s no inter-sound coordination - a timing independent harmony? It was so cool.
Anyways, I would be super interested in attending a zoom concept of that sort that might side step some technical difficulties.
This is correct. The reason is more or less that humans are really, really good at staying in sync, but, even with only 2 parts to listen to, it starts to get almost impossible levels of effort to keep proper time. In person, without any network lag, musicians can make adjustments in real time (e.g. always, always, always follow the soloist, even if the soloist is off tempo or off pitch). The conductor is listening to the soloist, too, so you don't need to worry about conflicting directions.
Here are a couple of examples I know of on YouTube where a single musician plays multiple parts, which are then edited together:
Now professional cellist, graduating senior at time of recording, Sarah Chaffee playing an original transcription of Alice Cooper's Poison with 5 cello parts (with a short guest appearance by Shostakovich at about 2:42, then showing up again in the outro to play us out at 3:33): https://www.youtube.com/watch?v=PjYwQHFsld0
This guy has a really, uhm, interesting violin technique. ;) Just for the fact that he chose to play the instrument the way he did tells me he does not actually know how to play violin. What he's doing is playing it like a teeny tiny cello. Since the notes are so much closer together on the respective fingerboards on a violin than on the cello, I'm betting he needed an extra take or 2 on that. His performance is passable, but nothing when put up against the performance of an actual, professional violinist.
Zoom is not the right software for this. It already exists. There are many online streaming musical collaboration tools, including free ones. Jamkazam is one example.
My band did that for a while. It was a lot of work for the producers.
We also played a specially composed piece for playing over zoom. We each recorded separately then sent our files in later. But it was live. And the music was super-simple.
Possible solution: First everybody plays to the same click track without hearing each other. Then everybody plays the same thing again but this time hearing what others played during the first round.
This can be repeated as many times as wanted. On 3rd time you will hear what others played during the 2nd round and that is presumably better played than what they played the first time when they only heard the click-track.
Thus you can "jam" with the other musicians the way they were last time around.
Something like a live recording session? Where the musicians input is synchronized and gets streamed into the DAW of each participant so they can mix their own version. This could be fun.
During the first lockdown I watched a demo video of a remote VST plugin combo that allowed the recording artist to record his vocals, and the recording engineer to manipulate the audio each from their own homes.
Yes, but sort of decentralized. I can play with any combination of recordings that were recorded over each other. There’s some neat graph theory stuff in there.
There is an iOS version available now too, which has inbuilt synths and effects but also allows external sources. That in essence means people are doing exactly that!
The app is alot of fun even with little musical ability. With studio it seems they're trying to achieve your vision but in the DAW.
The flaw here is when one musician makes a timing mistake, the others may play off it as they naturally would while jamming, but then the mistake is compounded. Similar to a copy of a copy, the flaws are amplified.
During lockdown I was able to have music jams with high quality uncompressed audio at 10ms (effectively real time), using Jamulus for the audio, and Zoom for the video. Then the whole thing was live streamed to Twitch.
Me too - jamming at about 15ms lag with Jamulus felt really natural. In fact after months of talking to other people through zoom, just chatting with my friends with low latency, high quality, and without the auto processing/muting that zoom seems to do, was amazing. Conversation just flowed back and forth better!
There’s a bunch of similar bits of software but none of them are super welcoming for the less technically inclined. I guess that the experience is often going to be limited by network performance (specifically jitter and loss on UDP) so it’s a challenging thing to make super slick/consistent. I think that low latency 5G with latency guarantees might help solve it when its widely available though.
I can also report a positive experience with Jamulus. There's probably people in this discussion who have only experienced the latency through Zoom or Skype, and assume it's about as good as you can get over the internet. It's not even close.
If you have special software like Jamulus, non-wireless internet, players geographically close to each other, and an external audio interface, the latency is surprisingly good. If you have most but not all of that, it can still be usable. Our end-to-end latency was around 30 ms, so not as low as parent's, but we were still able to play even "jam songs" where just followed each other for chord changes instead of having agreed them beforehand.
> If I had a child, I would have severe qualms with it.
Yeah, I'm skeptical. If you had a child, you'd be so spread thin with keeping your kid healthy and safe, while paying for it all, that the licensing of some software they use would be the least of your concerns. You'd be worried that his girlfriend might be manipulative, if he's smoking too much weed, if he's partying too much, if he is being bullied, or if saving up for the PS5 is a wise use of his summer job funds, most likely.
For the last few years, there have been complaints that Youtube Kids has issues where the algorithm stumbles onto content using cartoon characters in ways that are disturbing for young kids.
Do you think imaginative play is supposed to be from whole cloth? I thought that was half the point of children’s media, to give raw material for children to synthesize into imaginative play.
My kids play very different when they are not given media. They can create their own stories. They dig in the mud instead of grabbing some dirt and calling it thomas.
I loved Thomas as a kid and will let my kids watch it, but if you actually watch the plots they have very odd morals to the stories and are generally male chauvinist. But trains and it's generally harmless IMHO as long as you discuss the shortcomings with your children.
I can tell you right now they're just gonna default to Zoom as that's the service that is mostly talked about, and probably the easiest to set up and use. Even my grandmother knows how to set up a Zoom meeting.
As a teacher of the Deaf, teaching in sign language, I've found the video quality of various video conferencing platforms to be very variable. I wish Zoom had more tuning options for Video so we can optimize our video streams for better video clarity (e.g. tweaking FPS or enforcing a minimum bitrate/resolution/quality/etc for the video stream). Interesting to know there are also issues for the audio stream as well.
I also recommend Jamulus (https://jamulus.io/) for really low-latency audio. It got a fair bit of attention and development done during 2020 and now works better than ever.
I regularly stumbled upon open jamulus rooms which some music teachers used for their private lessons. They usually spend a lot of time (painfully so) explaining the students how to connect everything and use Zoom simultaneously for video.
Piano is a notoriously difficult instrument even for regular music production, just because of its broad frequency range and harmonics are a pain to deal with. So kudos for zoom for at least making it sound palatable.
I do wish there are more examples for other types of instruments or even voices. Guess i'll have to try it out myself.
> Piano is a notoriously difficult instrument even for regular music production…
Er, sort of. I think there’s some incorrect information in this comment.
Piano is an acoustic instrument, and to record an acoustic instrument well, you generally need to have a decent instrument in a good acoustic space with good microphone placement. For most people, getting all three is quite hard. Microphone placement for piano ain’t exactly easy but it’s not really harder than, say, guitar or violin. The hard part is that you inevitably capture a lot of ambiance from the room. With a room that sounds bad or with poor mic placement you’re going to get an inferior piano sound.
What makes, for example, guitar easier is really just the fact that if you have any decent room somewhere, you can fit somebody with a guitar into it. It’s small & portable. You still need a good instrument, a good room, and good mic placement but you don’t need to have the guitar permanently set up.
The frequency range of a piano & its harmonics are not especially difficult to tame in music production.
There are some instruments that are kind of “easier” in a sense, like vocals, electric guitar, or electric bass—but I don’t think the differences are all that stark. The main difference is that you are much more likely to use only a single microphone for these instruments rather than multiple mics, and you’re more likely use multiple mics for piano / acoustic guitar / violin / etc.
If there’s one common instrument that requires the most effort in music production I’ll say that it’s the humble drum kit, hands down. If you’re in a home studio, the drum kit might spur you to add more channels, make you work harder setting up the instrument and mics, and make you do more work with acoustic treatment.
Ooh something I have direct personal experience in. I currently use Zoom once a week for a music lesson. The latency is poor but my friend is not super great with computers so it's a reasonable compromise.
Now _playing_ together is much more challenging, and not something especially viable for students (which is why it's a bit weird targeting this at Zoom): you definitely need a proper audio interface, most likely with ~5ms of latency. You then also absolutely have to be on a wired network connection, which on most modern laptops probably requires another adapter, as wireless introduces too much latency. Overall you need < 25ms of latency in order to be able to play "live" together. This is doable but you need the right tools.
I'm a folk musician who plays in pubs and you can even see the effects of latency in a real-life setting like that - if the room is too large then people on the other side start to drop out of time. What I do in this situation is follow their hands rather than what I can hear. Unfortunately this too doesn't work online due to video latency being even worse. Metronomes don't really work for our music because it doesn't follow a standard BPM, it can and should vary.
We did manage a few sessions using JamKazam (https://jamkazam.com/) which works well enough with the tweaks listed above. It's largely restricted to people who know what they're doing, sadly.
I can use products like Geforce now and manually best my server to a location a few thousand miles away. Ping time increases from 10/15 ms to about 70ms, and for anything but twitch gaming I can get high fidelity graphics and controls & sound synced indistinguishably from playing on my local PC.
This is not a technology bottleneck, it is an implementation bottleneck. Zoom has benefited from the WFH and remote learning movement, but the market is ripe for disruption with a higher quality experience.
There is so much low hanging fruit in zoom they just refuse to add options for. The "turn on original audio" is fairly good at removing noticeable DSP processing but the audio can offer suffer from issues due to compression.
I have been trying to follow the instructions on Zoom's web site to turn on original audio (because I always use headphones on a call), but the options just aren't where the web site says they are. How did you manage it?
To my knowledge sometimes are setting are locked behind an institution's setup. You might need to log in to the Zoom website where your preferences are and see if there is a setting to enable advanced settings or something to that effect.
Then prior to screen sharing or audio sharing there is a button on the top left for "original audio".
There are many notes about latency, how it can/cannot be solved etc. This ignores a basic technique being used by most vid conf vendors that is not music friendly - unpredictable (non contant) network delays happen, and to help counter that, most software speeds up the audio playback occasionally to help clients catch up when audio arrives later than expected. This is a HUGE deal breaker for music lessons as it wreaks havoc on musical timing .. even when the lesson does not require live playing together.
Just turning this off would make a big difference to online music lessons I think. The "speed it up to catch up" technique is fine for speech .. which is the dominant usecase.
My guitar teacher has switched to using Rock Out Loud (https://rockoutloud.live/) and its made an enormous difference. Music is no longer distorted and the lag is negligible so we're now able to jam together realtime rather than having to take turns playing sections.
Yes, and I am quite sure it would also be possible to "hide" a time code within the audio coming in that the softeare can use to remedy slight discrepancies over time. From my experience with sound and sound engineering, the brain wouldn't be able to hear any fluctuations that the software would need to make as the processing power nowadays is more than capable of processing this audio quickly.
I remember experiencing issues earlier last year with Zoom correctly transmitting the sub bass frequencies in a music production class I was taking.
Perhaps that was a user error, but in a craft where getting your frequencies right across the whole spectrum is paramount any kind of lossy optimization can lead to undesirable results.
I wonder if there are parts of Zoom's audio transmission that optimize for the frequency range of the human voice, and everything else gets mangled a little, along the lines of chroma subsampling optimization in the visual world.
I am not certain why you are being downvoted. It is fairly common for voice stacks to include a high-pass filter to remove breathing noises and other low-frequency artifacts, and this could cause exactly the issues you describe.
when I first started teaching online I used Google Hangouts for 6 months and it was beautiful. I do a lot of rhythm teaching where the student uses their own metronome and taps, claps or sticks the rhythm and I listen for accuracy. I was dismayed when I switched to Zoom and experienced the filter. It has no idea what to with that kind of audio, where as it was so simple in GH. Even with all of the Zoom setting optimally configured you never know for certain what’s going to get through and what’s going to be canceled out.
As for configuring the settings, it can be easy enough to walk a student through the desktop config but last I checked the tablet config is ridiculous. And even required uninstalling and reinstalling. Would be so happy if that’s no longer the case?
The best part of Zoom imo is being able to share your computer’s audio or the audio from a certain window. That combined with Google Shared Piano is pretty clutch for me.
[edit] oh yeah one other offering to the Zoom gods. I also do a lot of screen sharing, like most of the lesson, and I hate how the video feeds default to a tiny format that can not be resized. A lot of times I need to be able to share my screen AND see my students hands but that’s just impossible now. I could do it with Google Hangouts no problem...
Two family members are college music students. I'm a musician too. The answer is that they're making do with whatever they've got, because it happened suddenly and everybody had to scramble. When the lockdown started, I bought them some good microphones, and to be honest, even mainstream headphones have pretty good fidelity.
Possibly a bigger issue is having a place where they can work. If you've ever heard a podcast ruined by background noise, you know what I mean.
I think that musicians are particularly adept at mentally filtering out shortcomings of the audio "stack" when they need to, and hear what they need to hear. There are still seminal recordings that are only available as scratchy phonograph records and tapes made with primitive microphones. Maybe it's because we process music at a symbolic level. And it's possible to focus the lessons on the things that can be done remotely. By the time they reach college, they're already playing at an exceptionally high level.
A friend of mine who's a music professor at another university says that he encourages students to use "whatever works" including their cell phones, so that the mechanics of recording don't distract from learning music. They've always had to deal with issues such as students who can't afford professional quality instruments.
Of course they're missing out on ensemble playing.
Since the edit window is closed, I'll add that the psychological side of the equation is an even bigger issue. For instance when you're supposed to record your juries, you end up spending hours trying to get a perfect take, as you get more tired and frustrated.
When it was in person, you showed up, played, and were done, good or bad. As a performing musician, I greatly prefer to make a mistake in front of an audience than play perfectly into a microphone.
I agree with what you're saying, and it applies well to most instruments, but it's worth noting that certain instruments work very badly with some recording devices, like the violin. If you make violin recordings on most laptops and phones (with default quality settings) they will sound horrible.
The two aforementioned music students play violin and cello, and I play double bass. There are a couple issues that I've noticed. The first, especially pertinent to the bass, is that the sound doesn't come from any single place, so you have to find a sweet spot for the mic, that captures a representative tone quality, and that doesn't collide with the bow, etc.
With the violin, that place is probably above the fiddle, so you have to find a way to dangle your phone, or put a mic on a stand. Especially since you're probably standing to play, for reasons of good posture.
And you want the mic to be close, basically to drown out the effects of typically bad room acoustics unless you're lucky to have a nice recording room. Plus, noise from HVAC and other sources.
And... mainstream recording apps for primarily voice use have built in compression and possibly other artifacts that you have to figure out how to turn off.
And... if you want audio and video, it further constrains your options. You often need your teacher to see you play, to comment on posture and technique issues.
Once it's all working, then the mic in a cell phone and its audio input hardware are actually surprisingly decent. The tiny little condenser mic elements are fairly high fidelity, and an audio codec is pretty much a slam dunk nowadays.
I have some knowledge of mic'ing acoustic guitars and there are some similar issues. In particular, with an acoustic guitar, most of the sound comes from the soundboard, which is also the case for a violin and cello. With the bigger instruments I can definitely see the issue is that the soundboard and whole body are enormous. For acoustic guitars, the typical advise is to use a condenser mic and aim it at the 12th fret (avoiding most of the fret noise and not getting too boomy from the bass of the body/soundboard), about 12 inches away using a mic stand. I think condenser mics are pretty much the best option here as well.
Personally I don't like the sound of a close-mic'd acoustic guitar. Unlike cello and violin, there can be a lot of fret noise and wound strings can be quite squeaky. Classical guitars have different, quieter wound strings under less tension, and nylon for unwound. This is also to reduce fret noise and squeakiness, but personally I prefer the full sound of steel strings. I've been working a lot on getting my fretting technique more clean to eliminate all of that mess and sound more professional, but to be honest I actually prefer to hear my own guitar live in a quiet, resonant room and not through a recording. Bathrooms with tile are great, if a little small! Of course that sounds like complete garbage if you try to record it. But it is absolutely heavenly live :-)
For the doublebass I can imagine you'd be best off with a mic stand at a couple feet away and seeing if you can get a mic that has a very narrow response angle (almost a directed mic).
If you're doing any kind of composition where mixing and mastering are a mandatory component, you're strongly encouraged to have a proper set of cans, or even better, a decent set of monitors. The tech has gotten rather good these days, so you can get a decent pair for not that much. But if the class is primarily about the theory and the arrangement, and not so much about the final mix, then I don't imagine a good setup will be needed or even recommended.
Does the article ever say what the setting we should change are, or is just saying that Yale has a leadership platform that it exercised for musicians during the pandemic?
I'm rehearing with a group of highly amateur musicians, and getting some easily configurable improvement on this would be something we would definitely do, but I missed what the changed settings should be. Thx
Would an n-way conference over ham radio offer good enough latency (if all participants are within a couple km and the signal among everyone is good enough) to make band practice possible ?
The amateur radio rules prohibit music (except a weirdly specific exception for NASA). Also, HF tends to be really noisy and low bandwidth, not sure it'd be very pleasant for music.
Edit: And another problem that's probably even more important for this use case, most common ham radio equipment is simplex (only one person can transmit at a time). In the VHF/UHF range you might be able to engineer a high bandwidth duplex (or TDM) system, but then your range would be limited to a few miles.
Most ham radio equipment is half-duplex, not simplex. Your average consumer hifi radio is simplex, because it can only receive, not send. Ham radio is half-duplex, because it can send and receive, but not at the same time.
I'm not a ham, but wouldn't bandwidth be an issue? I was under the impression that channel widths in the amateur bands were pretty narrow (<10kHz). Though I don't know if that applies to all bands, or just HF.
I'm assuming they use standard codecs like H.264/MPEG-4 and AAC for wide hardware compatibility. Zoom runs on a large variety of platforms including mobile devices. Would be very surprised if it was the likes of Opus etc.
My impression is that the standard version has compression/processing much like standard telephones, where most of the harmonics / overtones are removed. This ends up creating a bass heavy sound.
I would argue that over the past decade, most humans have been exposed to this kind of processing/compression and our "tastes" reflect our overwhelming exposure to this sound. Just look at how headphones are marketed these days, or any sound system for that matter. "How much bass can I get? Will it knock my socks off?"
The second portion of the video clearly has much for detail and clarity in the higher frequencies, and as a result, can sound a bit "tinnier". It doesn't come across to me as quality at which I'd love to listen to music, but it certainly helps convey the nuances of the harmonies in the piano much, much better.
TLDR: "the new functionality allows users to disable echo cancellation and post-processing, and get rid of compression, and increases the audio codec quality from 22kHz to 48kHz, 96Kbps mono/192Kbps stereo"
I never tried it because I'm not a musician (just a longtime fan of Justin and Winamp), but I always found the concept very interesting. Apparently it is still alive: https://www.cockos.com/ninjam/