Hacker Newsnew | past | comments | ask | show | jobs | submit | hamza_q_'s commentslogin


Hilarious that this is maintained by facebook and yet SAM fails so badly


Yeah I was frustrated by slow and hard to use OSS diarization too; recently released a library to address that, check it out: https://github.com/narcotic-sh/senko

Also https://zanshin.sh, if you'd like speaker diarization when watching YouTube videos


Hey, thanks for this. Been trying it out and it's very fast but seems to hear more speakers than are in the audio. I didn't see a way to tweak speaker similarity settings or merge speakers in some way. Any advice?


Thanks for checking it out!

Yeah unfortunately, since the diarization is acoustic features based, it really does require high recorded voice fidelity/quality to get the best results. However, I just added another knob to the Diarizer class called mer_cos, which controls the speaker merging threshold. The default is 0.875, so perhaps try lowering to 0.8. That should help.

I'll also get around to adding a oracle/min/max speakers feature at some point, for cases where you know the exact number of speakers ahead of time, or wanna set upper/lower bounds. Gotten busy with another project, so haven't done it yet. PR's welcome though! haha


Thanks, `mer_cos` definitely gets me closer. I appreciate that. Yeah, I was thinking providing a param for the expected number of speakers would be nice. I'll check out the codebase and see if that's something I can contribute :).


Yeah would love contributions! Here's a brief overview of how I think it can be done:

Senko has two clustering types, (1) spectral for audio < 20 mins in length, and (2) UMAP+HDBSCAN for >= 20 mins. In the clustering code, spectral actually already supports orcale/min/max speakers, but UMAP+HDBSCAN doesn't. However, someone forked Senko and added min/max speakers to that here (for oracle, I guess min = max): https://github.com/DedZago/senko/commit/c33812ae185a5cd420f2...

So I think all that's required is basically just testing this thoroughly to make sure it doesn't introduce any regressions in clustering quality. And then just wiring the oracle/min/max parameters to the Diarizer class, or diarize() func.


looks interesting. will check it out.


Thanks for COD: MW2 (2009), Vince. The game of my childhood. Rest in Peace.


Cool use of ONNX! Fluid Inference also have great implementations of Parakeet v2/v3 in CoreML for Apple devices and OpenVINO for Intel:

https://github.com/FluidInference/FluidAudio

https://github.com/FluidInference/eddy-audio


Location: Vancouver, BC, Canada

Remote: Yes

Willing to relocate: Yes

Technologies: diarization, Voice AI, PyTorch, CoreML,

Svelte/SvelteKit, Flask, SQLite, Tauri

Résumé/CV: https://hamzaq.com/Hamza_Qayyum_Resume_Public.pdf

Email: mhamzaqayyum [at] icloud [dot] com

---------

Projects:

- Senko: very fast, accurate, speaker diarization (https://senko.sh)

- Zanshin: novel media player that allows you to navigate by speaker (https://zanshin.sh)


Thought about it but it seems they have some stringent pre-req's they'd like: https://github.com/ghostty-org/ghostty/issues/189

I didn't care for those; just told Claude Code to add in the feature directly. So they probably wouldn't accept the PR if I made one.


Thanks :) Agreed, the limiting factor has been diarization (generating the "who speaks when" data) speed. But the diarization backend of this app that I developed can now process 1 hour of audio in ~8 seconds on a M3 Mac. So that's more or less a solved problem now (at least on Mac), just UI work remains.


We do know; it's just not in the popular conscience yet. Read a bit of Marshall McLuhan.


Taking bets on how fast Marshall McLuhan re-enters the public conscience :)


It's remarkable that Marshall McLuhan's ideas haven't entered the public conscience yet.


That book is brutally dense reading. It almost needs a translation for normal folks.

It is absolutely no wonder the ideas have not caught on more.


53% of American adults read below the sixth grade level. No idea that requires more than a sixth grade education will ever be mainstream again.

Huxley was right.


It’s been a long time, if ever, that people voted for ideas. They vote for party, as they always have, or for a charismatic candidate.


Blame that one on the US two-party system. Multiple parties means you can have okay-ish feelings about 2-3 of them, and support people based on their ideas.


While genuinely a sad statistic, should it still be called "sixth grade level" at that point if less than half of adults, much less 12 year olds actually reach it?

I mean it should because it should be a reasonable level to reach were it not for the dismantling of the educational system, but apparently it's not.


I think this is a reasonable question. However, I would argue that it still should be.

Firstly, I should say that reading scores aren't typically measured by grade levels for this type of study. That's just a colloquialism we use to make it comprehensible to the average person. The PIAAC for example uses a numerical score that translates to "levels of competency". [1]

Still, I think it's still a valuable way to express the idea. There exist levels beyond the sixth. Even if most folks don't attain those higher levels anymore we do need some way to refer to them and the sixth grade is when a high school bound adult should have attained that level in order to keep up with later coursework.

[1] https://nces.ed.gov/surveys/piaac/skillsmap/


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: