What is the practical latency difference you see between on-device and, say, whisper, in streaming mode, over the internet? Comparable? Seems that internet latency would be mostly negligible (assuming reasonable internet/cell coverage), or at least compensated for by the higher end hardware on the other side?
If you run a smaller whisper-distil variant AND you optimize the decoder to run on Apple Neural Engine, you can get latency down to ~300ms without any backend infra.
The issue is that the smaller models tend to suck, which is why the fine-tuning is valuable.
My hypothesis is that you can distill a giant model like Gemini into a tiny distilled whisper model.
but it depends on the machina you are running, which is why local AI is a PITA.
Yeah sorry that was unclear on my part. I chunk at the endpoint level, whisper itself obviously processes 30s windows. The memory/latency thing I was referring to is more about processing longer files end to end through the pipeline, not a single whisper pass. My fastapi wrapper just splits the audio and runs chunks sequentially so total wall time scales linearly with file length, nothing fancy.
Wondering similar. It certainly can run beyond 30 seconds but at some point I believe the output should degrade
Plus you could do actual batch inference instead. Or if you must carry forward the context you could still do it linearly, but the mem usage shouldn’t just explode
Excellent work still, your repo is much more robust and fleshed out and I am just beelining straight to audio LoRa not really knowing what I'm doing, as this is my first time attempting a ~real ML training project.
Definitely interested in swapping notes if you are though. Probably the biggest thing that came out of this exercise for us was realizing that Apple actually has some really powerful local inference/data processing tools available locally, they just are much more marketed towards application developers so a lot of them fly under the radar.
We just published https://github.com/accretional/macos-vision to make it easy for anybody to use Apple's local OCR, image segmentation, foreground-masking, facial analysis, classification, and video tracking functionality accessible via CLI and hopefully more commonly in ML and data workloads. Hopefully you or someone else can get some use of it. I definitely will from yours!
Here’s the trick: use Gemini Pro deep research to create “Advanced Hacker’s Field Guide for X” where X is the problem that you are trying to solve. Ask for all the known issues, common bugs, unintuitive patterns, etc. Get very detailed if you want.
Then feed that to Claude / Codex / Cursor. Basically, create a cheat sheet for your AI agents.
Memory usage increases quadratically with sequence length. Therefore, using shorter sequences during fine-tuning can prevent memory explosions. On my 64GB RAM machine, I'm limited to input sequences of about 2,000 tokens, considering my average output for the fine-tuning task is around 1,000 tokens (~3k tokens total).
Ah that makes sense, quadratic scaling is brutal. So with 96gb i'd probably get somewhere around 4-5k total sequence length before hitting the wall, which is still pretty limiting for anything multimodal. Do you do any gradient checkpointing or is that not worth the speed tradeoff at these sizes?
Shouldn't FlashAttention address the quadratic increase in memory footprint wrt.
fine-tuning/training? I'm also pretty sure that it does not apply to pure inference due to how KV-caching works.
We've had no WW3 (so far) and no one here needs to worry about being drafted into a war. Gatling might have thought his gun would reduce the number of war fatalities, but but Oppenheimer thought he would end the world. Both were wrong.
Alternative take: Inventors are bad at predicting the downstream societal effects of their inventions.
Let's assume a nuclear exchange happens at some point during a war. There is a very high chance that this will cause an escalation leading to a nuclear apocalypse.
Since this result is presumably inevitable at increasing frequency, it's more like nukes prevented another major world war and stole a form of peace from the future, temporarily. That peace debt might be repaid with the end of everything.
Funny how the unintentional close calls become more sparse with time. I wonder if that’s because humanity got better at dealing with the responsibility or because the oopsies haven’t been declassified yet.
Nuclear weapons traded a high probability of a major war for a low probability of an apocalyptic war.
My question is, how low is that probability, exactly? Because the tradeoff looks very different if it’s one in a million per year, versus one in a hundred per year.
My assessment, looking at the history and the close calls, is that it’s more like one in a hundred.
I haven’t heard a peep about conscription, can you provide a source? There was some vague national service proposal for school leavers a couple of years ago, but that was it.
Unlike Apple's formal "developer evangelist" and several others I contacted, the guy actually took the time to talk to us, and I was/am grateful for that. He's a cog in a very large corporate machine. Apple is Apple. He's not the CEO. He was doing his job and did me a favor. I am grateful to him.
> Why would someone say "You're a mosquito. Apple will just stomp on you and you will not exist.", it makes zero sense to me given the context laid out here.
I'm telling you what I was told. It's a true story. I was there. It happened to me.