Ok, my test harness is ready. My A40 box will be busy until later tonight, but on an NVIDIA A2 [1], this is the batchsize=1 throughput I'm seeing. Common Voice, default Whisper settings, card is staying at 97-100% utilization:
tiny.en: ~18 sec/sec
base.en: ~14 sec/sec
small.en: ~6 sec sec/sec
medium.en: ~2.2 sec/sec
large: ~1.0 sec/sec (fairly wide variance when ramping up as this is slow to process individual clips)
Isn’t the A2 much weaker than a 3090? So those results are promising.
EDIT: for what it's worth, Nvidia rated the A2 at 18 TFLOPS of FP16, and Apple rates the current A16 Neural Engine at 17 TFLOPS of FP16. I'm sure it's not an "apples to apples" comparison.
If you count the GPU component and memory bandwidth, the Apple M2 is slightly weaker on paper for 16-bit inference than the NVIDIA A2, if you manage to use the whole chip efficiently. The A16 is then slightly weaker than the M2.
Sure, the Whisper Tiny model is probably going to be fast enough, but from my preliminary results I'm not sure it will be any better than other models that are much much faster at this power class.
Whisper Large looks pretty cool, but it seems much harder to run in any meaningful realtime fashion. It's likely pretty useful for batch transcription though.
Even if you hit a realtime factor of 1x, the model can leverage up to 30 seconds of future audio context. So at 1x, if you speak for 10 seconds, you'll potentially need to wait another 10 seconds to use the result. This kind of latency is generally unsatisfying.
EDIT: After writing and posting the original version of this comment, I did an experiment where I dictated it to Siri, and then saved that audio (which was recorded simultaneously), which I then fed to both Whisper's tiny.en and medium.en... Siri did terrible for me. Whisper tiny.en was 100% accurate, as far as I can tell, and the only thing Whisper medium.en did was add a few commas that tiny.en had missed. I actually ended up playing the audio file for Siri as well, and that did not end well either. YMMV, but even the tiny model seems very useful. tiny.en took 17.5 seconds to process the ~1 minute audio file, and medium.en took 351 seconds, but I think there is a lot of room for performance optimization on this M2 MBA. The model evaluation was purely using the CPU, not GPU or neural engine, and it wasn't even using all of the CPU cores for whatever reason.
----
With Siri dictation, I feel like I usually spend at least as much time correcting its mistakes as I do speaking the dictation itself. In some cases, that is still faster/easier than typing, but I would rather have a voice model that can work in about the same total amount of time without requiring constant corrections. If I speak for 30 seconds, then I can do other things for 30 seconds while my phone processes it… that might actually be preferable if it gets it right. Otherwise, I’ll be spending 30 seconds actively editing it anyways. Even an improvement on the number of edits required per dictation would be nice. Admittedly, I feel like Google and Microsoft already do a much better job here.
It could be interesting to use the tiny model to give a preview of the writing while the large model is taking its time, and then allow the user to tap on words that changed to see the predictions from the tiny model and correct back to them if they want. I was doing some experiments a few minutes ago, and on one audio clip, the tiny model wrote down a very literal interpretation of an uncommon sci-fi word, and that was more accurate than either the medium or the large models. The rest of the time, the larger models did better, as expected.
But, I don’t know. This is interesting to me, but I agree there could be issues with making is workable for real time transcription.