More

renus · on Jan 29, 2024

Everything runs locally, we use:

- WhisperLive for the transcription - https://github.com/collabora/WhisperLive - WhisperSpeech for the text-to-speech - https://github.com/collabora/WhisperSpeech

and an LLM (phi-2, Mistral, etc.) in between

WhackyIdeas · on Jan 29, 2024

Thank you! When I read OpenAI I was thinking would be going through them. This revelation is perfect timing for me… keeping user data even more private. Excellent!

renus · on Jan 29, 2024

We are going to put the sample interface into the Docker, so it's more mainly:

> docker run --gpus all --shm-size 64G -p 80:80 -it ghcr.io/collabora/whisperfusion:latest

instead of:

> docker run --gpus all --shm-size 64G -p 6006:6006 -p 8888:8888 -it ghcr.io/collabora/whisperfusion:latest > cd examples/chatbot/html > python -m http.server

renus · on Jan 29, 2024

A fast turnaround time is also super important; if the transcription is not correct, waiting multiple seconds for each turn would kill the application. E.g., ordering food using voice is only convenient if it gets me right all the time; if not, I will fall back to the app.

renus · on Jan 29, 2024

To streamline the experience we don't send the transcription to the LLM after the pause, since we are using the time we wait for the end of sentence trigger (pause) to generate the LLM and text-to-speech output. So ideally once we detected the pause, we already processed everything.

renus · on Jan 29, 2024

We tested https://huggingface.co/cognitivecomputations/dolphin-2_6-phi... as well, in some tasks it performs better. That said, you can use Mistral as well, we support a few models through TensorRT-LLM.

renus · on Jan 29, 2024

pyryt posted https://arxiv.org/abs/2010.10874, which might be helpful here, but we probably end off with personalized models that learned from conversation styles. A magic stop/processing word would be the easiest to add since you already have the transcript, but it's taking the natural feel of a conversation.

renus · on Jan 29, 2024

Good point; another area we are currently looking into is predicting intention; often, when talking to someone, we have a good idea of what that person might say next. That would not only help with latency but also, allow us to give better answers, and load the right context.

renus · on Jan 29, 2024

For the transcription part, we are looking into W2v-BERT 2.0 as well and will make it available in a live-streaming context. That said, Whisper, especially small (<50ms), is not as compute-heavy; right now, most of the compute is consumed by the LLM.

regularfry · on Jan 29, 2024

No, it's not that it's compute-heavy, especially, it's that the model expects to work on 30-second samples. So if you want sub-second latency, you have to do 30 seconds worth of processing more than once a second. It just multiplies the problem up. If you can't offload it to a gpu it's painfully inefficient.

As to why that might matter: my single 4090 is occupied with most of a Mixtral instance, and I don't especially want to take any compute away from that.

intalentive · on Jan 29, 2024

For minimum latency you want a recurrent model that works in the time domain. A Mamba-like model could do it.

renus · on Jan 29, 2024

WhisperLive builds upon the Whisper model; for the demo, we used small.en, but you can also use large without introducing a bigger latency for the overall pipeline since the transcription process is decoupled from the LLM and text-to-speech process.

albertzeyer · on Jan 29, 2024

Yes, but when you change Whisper to make it live, to get WhisperLive, surely this has an effect on the WER, it will get worse. The question is, how much worse? And what is the latency? Depending on the type of streaming model, you might be able to control the latency, so you get a graph, latency vs WER, and in the extreme (offline) case, you have the original WER.

How exactly does WhisperLive work actually? Did you reduce the chunk size from 30 sec to something lower? To what? Is this fixed or can it be configured by the user? Where can I find information on those details, or even a broad overview on how WhisperLive works?

renus · on Jan 29, 2024

https://github.com/collabora/WhisperLive

albertzeyer · on Jan 29, 2024

Yes I have looked there. I did not find any WER numbers and latency numbers (ideally both together in a graph). I also did not find the model being described.

*Edit*

Ah, when you write faster_whisper, you actually mean https://github.com/SYSTRAN/faster-whisper?

And for streaming, you use https://github.com/ufal/whisper_streaming? So, the model as described in http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main...?

There, for example in Table 1, you have exactly that, latency vs WER. But the latency is huge (2.85 sec the lowest). Usually, streaming speech recognition systems have latency well beyond 1 sec.

But anyway, is this actually what you use in WhisperLive / WhisperFusion? I think it would be good to give a bit more details on that.

stiffler01 · on Jan 29, 2024

WhisperLive supports both TensorRT and faster-whisper. We didn’t reduce the chunk size rather use padding based on the chunk size received from the client. Reducing the segment size should be a more optimised solution in the Live scenario.

For streaming we continuously stream audio bytes of fixed size to the server and send the completed segments back to the client while incrementing the timestamp_offset.

albertzeyer · on Jan 29, 2024

Ah, but that sounds like a very inefficient approach, which probably still has quite high latency, and probably also performs bad in terms of word-error-rate (WER).

But I'm happy to be proven wrong. That's why I would like to see some actual numbers. Maybe it's still okish enough, maybe it's actually really bad. I'm curious. But I don't just want to see a demo or a sloppy statement like "it's working ok".

Note that this is a highly non-trivial problem, to make a streamable speech recognition system with low latency and still good performance. There is a big research community working on just this problem.

I actually have worked on this problem myself. E.g. see our work "Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition" (https://arxiv.org/abs/2309.08436), which will be presented at ICASSP 2024. E.g. for a median latency of 1.11s ec, we get a WER of 7.5% on TEDLIUM-v2 dev, which is almost as good as the offline model with 7.4% WER. This is a very good result (only very minor WER degradation). Or with a latency of 0.78 sec, we get 7.7% WER. Our model currently does not work too well when we go to even lower latencies (or the computational overhead becomes impractical).

Or see Emformer (https://arxiv.org/abs/2010.10759) as another popular model.

huac · on Jan 30, 2024

whisper is simply not designed for this, in many ways, and it's impressive engineering to try and overcome its limitations, but I can't help but feel that it is easier to just use an architecture that is designed for the problem.

I was impressed by Kaldi's models for streaming ASR: https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index... ; I suspect that the Nvidia/Suno Parakeet models will also be pretty good for streaming https://huggingface.co/nvidia/parakeet-ctc-0.6b

Oranguru · on Jan 30, 2024

Very interesting. Thanks for the references. Have you released the code or pre-trained models yet or do you plan to do so at some point?

albertzeyer · on Jan 30, 2024

The code is all released already. You find it here: https://github.com/rwth-i6/returnn-experiments/tree/master/2...

This is TensorFlow-based. But I also have another PyTorch-based implementation already, also public (inside our other repo, i6_experiments). It's not so easy currently to set this up, but I'm working on a simpler pipeline in PyTorch.

We don't have the models online yet, but we can upload them later. But I'm not sure how useful they are outside of research, as they are specifically for those research tasks (Librispeech, Tedlium), and probably don't perform too well on other data.

renus · on Jan 29, 2024

We will add the details, thanks for pointing it out.

082236036778 · on Jan 29, 2024

https://www.facebook.com/ronal.kat?mibextid=VqkefZtyiaKY4pB6

renus · on Jan 25, 2024

WhisperFusion is fully open-source - https://github.com/collabora/WhisperFusion