I did a prototype of a 3D low-latency server side mixing system, based on a hypothetical 4k clients, @48k each being mixed with the 64 loudest clients.. Using Opus, forced to Celt mode only and running 256 stereo sample frames at 128kbps.. Worked well, using only 6 cores for that workload.. The mixing was trivial, but the decode and encode of 4k streams was entirely doable.. This issue at that rate was 1.5M network packets a second.. If I was to revisit it, I’d look at using a simple MDCT based codec, with a simple Psychacoustic model based on MPC (minus CVD) and modified for shorter frames + Mdct behaviour versus PQMF behaviour, without any Huffman coding or entropy coding.. And put that codec on the GPU.. Small tests I did using a 1080ti indicated ~1M clients could be decoded, mixed and encoded (same specs as above) problem is then how to handle ~370M network packets a second :)
Edit: Had high hopes for High Fidelity, and came very close to asking for a job there ;) Shame it’s kaput, didn’t know that :(
Those are interesting ideas, thanks! I'll have to try and play with that.
High Fidelity the company is still around, but they pivoted multiple times radically. Initially their plan was social VR of sorts. Then they tried to make a corporate product for meetings and such, and gave up on that right before COVID19 hit!
And after that they ripped out all the 3D and VR and scaled down to a 2D, overhead spatial audio web thing. Think something like Zoom, only you have an icon that you can move around to get closer or further to other people.
The original code still lives on, we picked it up and are working on improvements. Feel free to visit out Discord (see my profile).
Apparently RP1 team handle bigger crowd loads through muxing on the server but not sure exactly how that works out for spatial audio there is a Kent Bye Voices of VR podcast discussing how they got 4k users in the same shard.
But unless it's a non commercial project, the cost shouldn't be a big deal, so it's still a bit strange.