Yeah I was frustrated by slow and hard to use OSS diarization too; recently released a library to address that, check it out: https://github.com/narcotic-sh/senko
Also https://zanshin.sh, if you'd like speaker diarization when watching YouTube videos
Hey, thanks for this. Been trying it out and it's very fast but seems to hear more speakers than are in the audio. I didn't see a way to tweak speaker similarity settings or merge speakers in some way. Any advice?
Yeah unfortunately, since the diarization is acoustic features based, it really does require high recorded voice fidelity/quality to get the best results.
However, I just added another knob to the Diarizer class called mer_cos, which controls the speaker merging threshold. The default is 0.875, so perhaps try lowering to 0.8. That should help.
I'll also get around to adding a oracle/min/max speakers feature at some point, for cases where you know the exact number of speakers ahead of time, or wanna set upper/lower bounds. Gotten busy with another project, so haven't done it yet. PR's welcome though! haha
Thanks, `mer_cos` definitely gets me closer. I appreciate that. Yeah, I was thinking providing a param for the expected number of speakers would be nice. I'll check out the codebase and see if that's something I can contribute :).
Yeah would love contributions! Here's a brief overview of how I think it can be done:
Senko has two clustering types, (1) spectral for audio < 20 mins in length, and (2) UMAP+HDBSCAN for >= 20 mins. In the clustering code, spectral actually already supports orcale/min/max speakers, but UMAP+HDBSCAN doesn't. However, someone forked Senko and added min/max speakers to that here (for oracle, I guess min = max): https://github.com/DedZago/senko/commit/c33812ae185a5cd420f2...
So I think all that's required is basically just testing this thoroughly to make sure it doesn't introduce any regressions in clustering quality. And then just wiring the oracle/min/max parameters to the Diarizer class, or diarize() func.
Thanks :)
Agreed, the limiting factor has been diarization (generating the "who speaks when" data) speed. But the diarization backend of this app that I developed can now process 1 hour of audio in ~8 seconds on a M3 Mac. So that's more or less a solved problem now (at least on Mac), just UI work remains.
Blame that one on the US two-party system. Multiple parties means you can have okay-ish feelings about 2-3 of them, and support people based on their ideas.
While genuinely a sad statistic, should it still be called "sixth grade level" at that point if less than half of adults, much less 12 year olds actually reach it?
I mean it should because it should be a reasonable level to reach were it not for the dismantling of the educational system, but apparently it's not.
I think this is a reasonable question. However, I would argue that it still should be.
Firstly, I should say that reading scores aren't typically measured by grade levels for this type of study. That's just a colloquialism we use to make it comprehensible to the average person. The PIAAC for example uses a numerical score that translates to "levels of competency". [1]
Still, I think it's still a valuable way to express the idea. There exist levels beyond the sixth. Even if most folks don't attain those higher levels anymore we do need some way to refer to them and the sixth grade is when a high school bound adult should have attained that level in order to keep up with later coursework.