Really interesting, I can see ton of potential uses. 2 questions: 1) how does it...

nshm · on Sept 21, 2022

You properly mentioned timestamps. There are many other important properties of good ASR system like vocabulary adaptability (if you can introduce new words) or streaming. Or confidences. Or latency of the output. Compared to Vosk models this model can not work in streaming manner, so not very suitable for real-time applications.

But in general the model is robust and accurate and trained on the amount of speech we never dreamed about in Vosk. We will certainly benefit from this model as a teacher (together with others like gigaspeech models). I recently wrote about it https://alphacephei.com/nsh/2022/06/14/voting.html

goffi · on Sept 21, 2022

> goffi

for 2), it's actually written in the description: "phrase-level timestamps", so it should be possible (phrase level is neat for skipping to a special location on a video, but maybe not for audio editing).