Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Really interesting, I can see ton of potential uses.

2 questions:

1) how does it compare to state of the art FOSS solutions? I'm seeking about DeepSpeech or Vosk

2) would it be somehow possible to associate timestamp to the words recognized? That would be amazing for things such as audio editing or skipping to a particular location on a video



You properly mentioned timestamps. There are many other important properties of good ASR system like vocabulary adaptability (if you can introduce new words) or streaming. Or confidences. Or latency of the output. Compared to Vosk models this model can not work in streaming manner, so not very suitable for real-time applications.

But in general the model is robust and accurate and trained on the amount of speech we never dreamed about in Vosk. We will certainly benefit from this model as a teacher (together with others like gigaspeech models). I recently wrote about it https://alphacephei.com/nsh/2022/06/14/voting.html


> goffi

for 2), it's actually written in the description: "phrase-level timestamps", so it should be possible (phrase level is neat for skipping to a special location on a video, but maybe not for audio editing).




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: