Thank you, think about possible introduction of speech -> text transcription under each voice. It might not be perfect but useful for rapid communication.
On the contrary, majority of audio content could be conceived as a background. Text or video require more attention.