A noisy word might easily be guessed in context. But likewise a semantically ambiguous word might also be guessed due to other factors like the tone of the speaker, facial expressions or more.
I suspect the parent's point is that the disambiguation in both cases might be addressed with information encoded in the other. One can pattern match based on the context. I think some of the work in multimodal transformer models demonstrates this.