I’ve had limited success combining the results of pyannotate with whisper transc...

I’ve had limited success combining the results of pyannotate with whisper transcripts (based on time stamps).

It’s ok, but the quality of speaker identification is nowhere near as good as the transcription itself.

I’d love to see models which try and use stereo information in recordings to solve the problem. Or, given a fixed camera and static speakers, I thought it should even be possible to use video to add information about who is speaking. There doesn’t seem to be anything like that right now tho.