There are already apps/technologies that transmit information through audio at frequencies not audible to humans. It should be trivial to adapt this so that if two AI systems are interacting they can perform an "AI-handshake" in the audio at the start and then switch to a more efficient form of communication.
Correct. There are several levels at which this applies:
Phone hardware (microphones, speakers) are only calibrated to detect 'useful' frequencies for human speech.
The sampling rate used by audio codecs tend to cut off _before_ the human ear's limits e.g. at 8kHz or 16kHz. They aren't even trying to reproduce everything the ear can detect; just human speech to decent quality.
Codecs are optimized to make human speech inteligible. The person listening to you on the phone isn't receiving a complete waveform for the recorded frequency range. The signal has been compressed to reduce the bandwidth required, where the goal isn't e.g. lossless compression; it's decent quality speech after decompression.
It's completely possible to play tones alongside speech that we won't notice, but in the general case, not tones that the human ear can't detect.