Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Widjosumarajzer" = video summarizer

It's just a hodgepodge of prototype scripts, but one that I actually used on a few occasions already. Most of the work is manual, but does seem easily run as "fire and forget" with maybe some ways to correct afterwards.

First, I'm using the pyannote for speech recognition: it converts audio to text, while being able to discern speakers: SPEAKER_01, _02, etc. The diarization provides nice timestamps, with resolution down to parts of words, which I later use in the minimal UI to quickly skip around, when a text is selected.

Next, I'm running a LLM prompt to identify speakers; so if SPEAKER_02 said to SPEAKER_05 "Hey Greg", it will identify SPEAKER_05 = Greg. I think it was my first time using the mistral 7b and I went "wow" out loud, once it got correct.

After that, I fill in the holes manually in speaker names and move on to grouping a bunch of text - in order to summarize. That doesn't seem interesting at a glance, but removing the filler words, which there are a ton of in any presentation or meeting, is a huge help. I do it chunk by chunk. I'm leaning here for the best LLM available and often pick the dolphin finetune of mixtral.

Last, I summarize those summarizations and slap that on the front of the google doc.

I also insert some relevant screenshots in between chunks (might go with some ffmpeg automatic scene change detection in the future).

aaand that's it. A doc, that is searchable easily. So, previously I had a bunch of 30 min. to 90 min. meeting recordings and any attempt at searching required a linear scan of files. Now, with a lot of additional prompt messaging I was able to:

- create meeting notes, with especially worthwile "what did I promise to send later" points

- this is huge: TALK with the transcript. I paste the whole transcript into the mistral 7b with 32k context and simply ask questions and follow-ups. No more watching or skimming an hour long video, just ask the transcript, if there was another round of lay-offs or if parking spaces rules changed.

- draw a mermaid sequence diagram, of a request flowing across services. It wasn't perfect, but it got me super excited about future possibilities to create or update service documentation based on ad-hoc meetings.

I guess everybody is actually trying to build the same, seems like a no-brainer based on current tool's capabilities.



Very interested in this. I have been contemplating building something similar, but am unaware of any existing services that do this. Haven't played with pyannote, how does it compare to whisper? Also thought it might be useful to be able to OCR screenshots and use the text to inform the summariation and transcription especially for things like code snippets and domain-specifc terms.


I remember whisper v3 large blowing my mind: it was able to properly transcribe some two language monstrosity (przescreenować, which is a english word "to screen a candidate", but conjugated according to standard polish rules). Once I saw that I thought "it's finally time: truly good transcription has finally arrived".

So I view whisper as sota with excellent accuracy.

Now, for the type of transcription I need speaker discerning is much more valuable than accurate to the point translation: so it will be summarized anyway and that tends to gloss over some of errors anyway.

That said, pyannote has also caught me off guard: it correctly annotated lazily spoken "DP8" with non native speaker accent.

It looks really good


Is pyannote the best diarization library you found? What's SOA? I've been using a saas product (Gladia) and I'm getting close to my 10-hour mark.


The first and good enough for me not to look further




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: