I mentioned this the opposite way around a little while ago (mute the audio in countries where it isn't licensed, rather than blocking the video outright), but the conclusion was generally that the video and audio are muxed together into a single deliverable. The file is almost certainly pre-generated (once for each resolution) to avoid the server-side costs of merging them together for streaming.
I guess in theory you could generate a 'demux mapping' and the client could request byte-ranges corresponding to only one channel, but that seems incredibly complicated, would generate huge requests, and is probably bypassable on the client-side anyway (more of an issue for my idea than yours).
I guess in theory you could generate a 'demux mapping' and the client could request byte-ranges corresponding to only one channel, but that seems incredibly complicated, would generate huge requests, and is probably bypassable on the client-side anyway (more of an issue for my idea than yours).