The only thing Whisper misses is speaker diarization. I'm currently working on a model that uses Whisper + pyannote to transcribe Interviews and also detects who is speaking. It's working but damn it takes so long
My goal for my project is to build a tool that transcribes Interviews (e.g, in Sales or Recruiting) and puts the Transcription through ChatGPT (Waiting for the API atm) to make a summary that looks like the notes of the call. Speaker diarization is important, so I don't have more than 4000 tokens input in ChatGPT. I will see how it goes, but if it's reliable enough (looks like it so far), it will save the time it takes to write meeting notes and rewrite them to send them to someone after the call (Hiring Managers etc.) Imagine a 10x Otter.ai or something like that.
Why are you waiting for the API? The OpenAI Playground has API examples you can copy paste. You can go over 4000 tokens if you have a business justification and payment method. You have access to most of their models even the new Codex ones
Edit: Looked at your link and I misunderstood. I think I understand you're waiting for the ChatGPT specific model now?
You are correct that I was incorrect. Thank you for correcting me. I misread their documentation. Sounds like they might increase the token limit in the future, but right now it's 4097 tokens shared with the prompt
Which uses x8 less memory than the Python implementation for the tiny model. It would be a good idea to keep an eye on it since there are Python bindings planned on the roadmap:
I understand this is self-hosting the OpenAI Whisper model (which I see is fully MIT-licensed, weights and all). So not calling any OpenAI APIs like other GPT-related tools do.
Whisper-UI is also looking really nice lately but I think it's still pretty early in development. The ability to click on the transcript and hear the sound of that particular moment is great.
https://github.com/hayabhay/whisper-ui
I was working on this yesterday. It seems that the most common approach with Whisper is simply to break the audio into chunks and transcribe each one separately. This works but as you'd expect sometimes has trouble at the edges. The segments also have to be sufficiently long (like 10s) or the accuracy suffers, meaning it's not truly real-time.
You could do better by overlapping the segments, except then stitching the transcriptions together becomes an issue since whisper doesn't provide reliable per-token timestamps [0], and the output of the common part of overlapping segments isn't necessarily the same. I can imagine a cool approach where you transcribe long, overlapping chunks in real-time and intelligently merge the stream of words somehow though.
Some more useful discussion here (whisper.cpp project, but still relevant) [1].
Run this locally for a few work related tasks. One useful feature is being able to provide in your own 'jargon' in the initial prompt which improves recognition quality ('--initial_prompt 'jargon1 jargon 2 ... ')
Is there an open source speech recognition model which can be restricted to a smaller domain-specific dictionary?
Use case: I want to transcribe my poker hands while playing, eg: "Flop was 2 of spaces, 3 of diamonds and King of spades", "Button raised to $20" etc.
When I tried using Whisper and some other model, the recognition accuracy was atrocious, and it kept finding non-poker words that sounded similar to poker words. I want to restrict its search space to my own list of poker words which should significantly increase the accuracy (theoretically).
This looks really good, thanks!
Really appreciate this and all the other Whisper implementations in this thread as I am sorting up transcriptions for my 120+ podcast episodes.
Generally yes when it produces sane output at all, but while YT can get stuff comically wrong I've never seen it just go off the rails and start hallucinating and mindlessly repeating itself, which Whisper sometimes does especially if you're also trying to get it to translate something. Like Whisper will sometimes output a stream of things like "Please subscribe to my channel and follow me on Twitter!" or "Thank you for watching.".
On one source I tried the other day, the first 90 seconds or so is just generic opening music, no speech, but it "transcribes" it as "This is the end of the video. Thank you for watching. Please subscribe to the channel if you like. See you in the next video. Thank you for watching. Please subscribe to the channel if you like. Thank you for watching. ..." If you help it along by cutting up the source into only spoken segments you can get it to do better but just throwing it at a directory of material is probably going to leave you with some disappointment.
Then sometimes it does something surprising, on a j-pop song after hallucinating a bit during the intro it spit out a translation in the form you might find on a lyrics site, that is each line was "japanese-characters romaji-version english-translation". I haven't been able to get it to do it again (even for the same source).
Yeah, it can help a bit with looping, but introduces other problems. I recalled from earlier that a combo of tweaking no_speech_threshold and logprob_threshold settings helped somewhat, though trying again on a random video it doesn't do much. Still hallucinates a stream of captions (albeit non-repetitive, though one run had several Touhou related lines) for what should be 4 minutes of looping background music before the first sentence. If all one needs Whisper for is transcribing English though, I still think it's pretty decent. On my test video now it will 'correctly' transcribe the music as ♪ when I ask it to just transcribe it as English.
Related/Off Topic: Is there a documented way to improve the accuracy of a particular language model? Say we can put in the effort to collect 1000's of verified/transcribed samples of a language that is currently scored poorly (WER). What steps do I have to take to get those improvements into the system?
Yes, you need to fine-tune the model with your data. This might be easy or hard, depending on your experience level and complexity of the model and available tooling.
Very cool - I have a homegrown setup where a script scans my iCloud audio notes directory and generates transcriptions for any new notes. Works like a charm.