Correct. There are several levels at which this applies: Phone hardware (microph...

Correct. There are several levels at which this applies:

Phone hardware (microphones, speakers) are only calibrated to detect 'useful' frequencies for human speech.

The sampling rate used by audio codecs tend to cut off _before_ the human ear's limits e.g. at 8kHz or 16kHz. They aren't even trying to reproduce everything the ear can detect; just human speech to decent quality.

Codecs are optimized to make human speech inteligible. The person listening to you on the phone isn't receiving a complete waveform for the recorded frequency range. The signal has been compressed to reduce the bandwidth required, where the goal isn't e.g. lossless compression; it's decent quality speech after decompression.

It's completely possible to play tones alongside speech that we won't notice, but in the general case, not tones that the human ear can't detect.