How much time of audio or data do you need on average to train up something? I guess I'm wondering if I only had a few minutes of someone speaking, would that be enough?
The implementation i am using works best with several <30 second clips. I've tested it by cutting up 20 minute interviews to only have the person i care about, and it seems to be about the same as a half dozen 15 second clips.