The threshold is more like 3ms in my experience. When tracking vocals with headphones (no acoustic latency) anything higher will start feeling weird. This is also why Thunderbolt is preferred over USB for recording; you can get in and out of the CPU with effects much faster.
Distance from the source of sound is actually very important in live performances. There have been some empirical studies on it, but it’s still something people often get rather opinionated about. There is a reason though that a small ensemble can play to the beat of something like a drum, but a large ensemble like an orchestra can’t. There’s plenty of writing about how orchestras lag behind conductors and how different sections of orchestras have to take their timing queues differently depending on where they’re situated.
10ms is what I was always told was the generally accepted threshold after which a musician start to experience increased difficulty as a result of the latency. Which largely aligns with my own experience with audio latency. From there the difficulty just keeps increasing, and the performance quality keeps degrading until at some point the musicians and the audience gives up.
Thanks for writing this. I made the exact same point in another subthread, that 10ms is enough to mess with your brain. My only experience of it is when the audio and the visual out of sync by only about 10ms while watching something (I forget what). I can't watch anything like that for any real length of time, and I'm not even trying to play along.
The refractive index of single model fiber is ~1.5. So 10ms of fiber transit is closer to 1250 miles. The fact that no two real world internet users are connected by a straight line of fiber optic cabling is going to bring that in even further. If you were relying on round trip latency for some reason, that would cut the distance in half.
And that’s just for an idealized communication network. A real world use case would have a lot more latency introduced by routing/switching, and all of the typical quality issues you’d expect from ISPs. I used to live within line of sight of the data centre that hosted an online video game that I played, and the lower bound of my ping was about 15ms, it was usually in the 20s. Those factors are why something like this is unlikely to ever provide a high quality experience over even relatively short distances. I’ve just always found it interesting how quickly you start to run into issues with the speed of light when you’re trying to optimize for network latency.
I mean, the stated use case is for a music school. I actually live in a University town. We have tons of students all "working remotely" within a single mile radius of campus, and many of them are connected together by our campus intranet -- the University does wireless point-to-multipoint links to connect all of these random housing options (which include large buildings they simply bought in the community) -- which is something I think we could easily push to include more buildings. I bet we could easily pull off 2ms networking overhead.
(And no: you are wrong about round trip. If you care about round trip time then 3ms of round trip by sound is so close that the violin player is going to be elbowing the singer. Sound is fundamentally slow, and you need to just accept that.)
A well architected local network is about the only type of network that a service like this could operate over, and still provide a decent experience.
Also, the reason you'd care about round trip time, is if you needed mixing or processing to be done to musician foldback (I never mentioned 3ms being an issue btw). The mixing and processing of sound is prior to being sent to foldback monitors is a completely standard process in just about all live and studio sound engineering. Latency is a very important consideration for sound engineers, and it's why in situations like stadium performances, the only sounds that matters to a musician on stage, is the sound coming out of the foldback monitors directly in front of them. I'm getting a rather strong impression that you don't know much about what actually goes into sound engineering.
I guess I don't understand why you think "sound engineering" is relevant to the point of being required, as the question to me here is more about what happens at something like a "session"? I used to go every week to play with a large mix of random people at an Irish bar... we definitely didn't have monitors (or a sound engineer ;P). People have been playing music with each other in groups for a long time, and being within 11'3" of everyone seems unrealistic. I have been in drum circles larger than that (I am at best a percussionist, though I was dabbling with fiddle), and I would assume that to be the case where precision matters the most. The stage we were on (which is weird: most sessions I've seen--and I have seen more than I participated in as I got involved in this stuff due to dating a musician--are at dining tables) was definitely over ten feet wide.
FWIW, when I asked my professional musician ex about all of this (which was before I read this comment of yours here) she also mentioned monitors, but it was because she claimed a core consideration for her with respect to distance in a performance was volume of her instrument drowning out her playing companions (and when I pushed into that she said if the acoustics were bad enough they would use monitors). She also told me that people she knew were already using software to play with each other over the Internet. If nothing else, I feel like "proof by counter example" should win here vs. your statement that this is just somehow impossible? As others have pointed out in this thread, the SoundJack people exist and claim to do this (whether or not you believe they have users: my ex apparently knows of users, though I don't know if they are using the same software).