Hacker Newsnew | past | comments | ask | show | jobs | submit | renus's commentslogin

Everything runs locally, we use:

- WhisperLive for the transcription - https://github.com/collabora/WhisperLive - WhisperSpeech for the text-to-speech - https://github.com/collabora/WhisperSpeech

and an LLM (phi-2, Mistral, etc.) in between


Thank you! When I read OpenAI I was thinking would be going through them. This revelation is perfect timing for me… keeping user data even more private. Excellent!


We are going to put the sample interface into the Docker, so it's more mainly:

> docker run --gpus all --shm-size 64G -p 80:80 -it ghcr.io/collabora/whisperfusion:latest

instead of:

> docker run --gpus all --shm-size 64G -p 6006:6006 -p 8888:8888 -it ghcr.io/collabora/whisperfusion:latest > cd examples/chatbot/html > python -m http.server


A fast turnaround time is also super important; if the transcription is not correct, waiting multiple seconds for each turn would kill the application. E.g., ordering food using voice is only convenient if it gets me right all the time; if not, I will fall back to the app.


To streamline the experience we don't send the transcription to the LLM after the pause, since we are using the time we wait for the end of sentence trigger (pause) to generate the LLM and text-to-speech output. So ideally once we detected the pause, we already processed everything.


We tested https://huggingface.co/cognitivecomputations/dolphin-2_6-phi... as well, in some tasks it performs better. That said, you can use Mistral as well, we support a few models through TensorRT-LLM.


pyryt posted https://arxiv.org/abs/2010.10874, which might be helpful here, but we probably end off with personalized models that learned from conversation styles. A magic stop/processing word would be the easiest to add since you already have the transcript, but it's taking the natural feel of a conversation.


Good point; another area we are currently looking into is predicting intention; often, when talking to someone, we have a good idea of what that person might say next. That would not only help with latency but also, allow us to give better answers, and load the right context.


For the transcription part, we are looking into W2v-BERT 2.0 as well and will make it available in a live-streaming context. That said, Whisper, especially small (<50ms), is not as compute-heavy; right now, most of the compute is consumed by the LLM.


No, it's not that it's compute-heavy, especially, it's that the model expects to work on 30-second samples. So if you want sub-second latency, you have to do 30 seconds worth of processing more than once a second. It just multiplies the problem up. If you can't offload it to a gpu it's painfully inefficient.

As to why that might matter: my single 4090 is occupied with most of a Mixtral instance, and I don't especially want to take any compute away from that.


For minimum latency you want a recurrent model that works in the time domain. A Mamba-like model could do it.


WhisperLive builds upon the Whisper model; for the demo, we used small.en, but you can also use large without introducing a bigger latency for the overall pipeline since the transcription process is decoupled from the LLM and text-to-speech process.


Yes, but when you change Whisper to make it live, to get WhisperLive, surely this has an effect on the WER, it will get worse. The question is, how much worse? And what is the latency? Depending on the type of streaming model, you might be able to control the latency, so you get a graph, latency vs WER, and in the extreme (offline) case, you have the original WER.

How exactly does WhisperLive work actually? Did you reduce the chunk size from 30 sec to something lower? To what? Is this fixed or can it be configured by the user? Where can I find information on those details, or even a broad overview on how WhisperLive works?



Yes I have looked there. I did not find any WER numbers and latency numbers (ideally both together in a graph). I also did not find the model being described.

*Edit*

Ah, when you write faster_whisper, you actually mean https://github.com/SYSTRAN/faster-whisper?

And for streaming, you use https://github.com/ufal/whisper_streaming? So, the model as described in http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main...?

There, for example in Table 1, you have exactly that, latency vs WER. But the latency is huge (2.85 sec the lowest). Usually, streaming speech recognition systems have latency well beyond 1 sec.

But anyway, is this actually what you use in WhisperLive / WhisperFusion? I think it would be good to give a bit more details on that.


WhisperLive supports both TensorRT and faster-whisper. We didn’t reduce the chunk size rather use padding based on the chunk size received from the client. Reducing the segment size should be a more optimised solution in the Live scenario.

For streaming we continuously stream audio bytes of fixed size to the server and send the completed segments back to the client while incrementing the timestamp_offset.


Ah, but that sounds like a very inefficient approach, which probably still has quite high latency, and probably also performs bad in terms of word-error-rate (WER).

But I'm happy to be proven wrong. That's why I would like to see some actual numbers. Maybe it's still okish enough, maybe it's actually really bad. I'm curious. But I don't just want to see a demo or a sloppy statement like "it's working ok".

Note that this is a highly non-trivial problem, to make a streamable speech recognition system with low latency and still good performance. There is a big research community working on just this problem.

I actually have worked on this problem myself. E.g. see our work "Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition" (https://arxiv.org/abs/2309.08436), which will be presented at ICASSP 2024. E.g. for a median latency of 1.11s ec, we get a WER of 7.5% on TEDLIUM-v2 dev, which is almost as good as the offline model with 7.4% WER. This is a very good result (only very minor WER degradation). Or with a latency of 0.78 sec, we get 7.7% WER. Our model currently does not work too well when we go to even lower latencies (or the computational overhead becomes impractical).

Or see Emformer (https://arxiv.org/abs/2010.10759) as another popular model.


whisper is simply not designed for this, in many ways, and it's impressive engineering to try and overcome its limitations, but I can't help but feel that it is easier to just use an architecture that is designed for the problem.

I was impressed by Kaldi's models for streaming ASR: https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index... ; I suspect that the Nvidia/Suno Parakeet models will also be pretty good for streaming https://huggingface.co/nvidia/parakeet-ctc-0.6b


Very interesting. Thanks for the references. Have you released the code or pre-trained models yet or do you plan to do so at some point?


The code is all released already. You find it here: https://github.com/rwth-i6/returnn-experiments/tree/master/2...

This is TensorFlow-based. But I also have another PyTorch-based implementation already, also public (inside our other repo, i6_experiments). It's not so easy currently to set this up, but I'm working on a simpler pipeline in PyTorch.

We don't have the models online yet, but we can upload them later. But I'm not sure how useful they are outside of research, as they are specifically for those research tasks (Librispeech, Tedlium), and probably don't perform too well on other data.


We will add the details, thanks for pointing it out.



WhisperFusion is fully open-source - https://github.com/collabora/WhisperFusion


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: