Ah, but that sounds like a very inefficient approach, which probably still has q...

huac · on Jan 30, 2024

whisper is simply not designed for this, in many ways, and it's impressive engineering to try and overcome its limitations, but I can't help but feel that it is easier to just use an architecture that is designed for the problem.

I was impressed by Kaldi's models for streaming ASR: https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index... ; I suspect that the Nvidia/Suno Parakeet models will also be pretty good for streaming https://huggingface.co/nvidia/parakeet-ctc-0.6b

Oranguru · on Jan 30, 2024

Very interesting. Thanks for the references. Have you released the code or pre-trained models yet or do you plan to do so at some point?

albertzeyer · on Jan 30, 2024

The code is all released already. You find it here: https://github.com/rwth-i6/returnn-experiments/tree/master/2...

This is TensorFlow-based. But I also have another PyTorch-based implementation already, also public (inside our other repo, i6_experiments). It's not so easy currently to set this up, but I'm working on a simpler pipeline in PyTorch.

We don't have the models online yet, but we can upload them later. But I'm not sure how useful they are outside of research, as they are specifically for those research tasks (Librispeech, Tedlium), and probably don't perform too well on other data.