Hacker Newsnew | past | comments | ask | show | jobs | submit | MediaSquirrel's commentslogin

yeah, it came out after I stared on my project last year. Only issue is that you can't fine-tune it on Apple Silicon.


More data -> better, faster on-device models

The actual plan was to distill Gemini 2.5 Pro into the best on-device voice dictation model.

Pretty sure it would have worked. Alas.


Reasons for running local aside...

What is the practical latency difference you see between on-device and, say, whisper, in streaming mode, over the internet? Comparable? Seems that internet latency would be mostly negligible (assuming reasonable internet/cell coverage), or at least compensated for by the higher end hardware on the other side?


depends on the model!

If you run a smaller whisper-distil variant AND you optimize the decoder to run on Apple Neural Engine, you can get latency down to ~300ms without any backend infra.

The issue is that the smaller models tend to suck, which is why the fine-tuning is valuable.

My hypothesis is that you can distill a giant model like Gemini into a tiny distilled whisper model.

but it depends on the machina you are running, which is why local AI is a PITA.


re: Whisper v3 -- how is this possible? Whisper has a 30s context window. You have to chunk it.


Yeah sorry that was unclear on my part. I chunk at the endpoint level, whisper itself obviously processes 30s windows. The memory/latency thing I was referring to is more about processing longer files end to end through the pipeline, not a single whisper pass. My fastapi wrapper just splits the audio and runs chunks sequentially so total wall time scales linearly with file length, nothing fancy.


Wondering similar. It certainly can run beyond 30 seconds but at some point I believe the output should degrade

Plus you could do actual batch inference instead. Or if you must carry forward the context you could still do it linearly, but the mem usage shouldn’t just explode


Great minds think alike!

Also, I had a huge head start, as I spent a month or two working on this in September 2025, shelved it and dusted it back off this weekend.


Excellent work still, your repo is much more robust and fleshed out and I am just beelining straight to audio LoRa not really knowing what I'm doing, as this is my first time attempting a ~real ML training project.

I think in https://github.com/mattmireles/gemma-tuner-multimodal/blob/m... and https://github.com/mattmireles/gemma-tuner-multimodal/blob/m... and https://github.com/mattmireles/gemma-tuner-multimodal/blob/m... you have a superset of the various cludges I have in my finetuning repo, I'm going to study this and do what I can to learn from it. Really appreciate you sharing it here!

Definitely interested in swapping notes if you are though. Probably the biggest thing that came out of this exercise for us was realizing that Apple actually has some really powerful local inference/data processing tools available locally, they just are much more marketed towards application developers so a lot of them fly under the radar.

We just published https://github.com/accretional/macos-vision to make it easy for anybody to use Apple's local OCR, image segmentation, foreground-masking, facial analysis, classification, and video tracking functionality accessible via CLI and hopefully more commonly in ML and data workloads. Hopefully you or someone else can get some use of it. I definitely will from yours!


Look inside here: https://github.com/mattmireles/gemma-tuner-multimodal/tree/m...

Here’s the trick: use Gemini Pro deep research to create “Advanced Hacker’s Field Guide for X” where X is the problem that you are trying to solve. Ask for all the known issues, common bugs, unintuitive patterns, etc. Get very detailed if you want.

Then feed that to Claude / Codex / Cursor. Basically, create a cheat sheet for your AI agents.

This will unlock a whole new level of capability.

I’m @mattmireles on Twitter — feel free to DM me.


you are welcome! It was a fun side quest


Memory usage increases quadratically with sequence length. Therefore, using shorter sequences during fine-tuning can prevent memory explosions. On my 64GB RAM machine, I'm limited to input sequences of about 2,000 tokens, considering my average output for the fine-tuning task is around 1,000 tokens (~3k tokens total).


Ah that makes sense, quadratic scaling is brutal. So with 96gb i'd probably get somewhere around 4-5k total sequence length before hitting the wall, which is still pretty limiting for anything multimodal. Do you do any gradient checkpointing or is that not worth the speed tradeoff at these sizes?


Haven’t tried yet. That’s on the do list. But good suggestion.


Shouldn't FlashAttention address the quadratic increase in memory footprint wrt. fine-tuning/training? I'm also pretty sure that it does not apply to pure inference due to how KV-caching works.


Here's the gist of the paper for anyone interested: https://gist.is/science.org/en/VdSDF9qjxbH8


Nukes gave us peace and freedom.

We've had no WW3 (so far) and no one here needs to worry about being drafted into a war. Gatling might have thought his gun would reduce the number of war fatalities, but but Oppenheimer thought he would end the world. Both were wrong.

Alternative take: Inventors are bad at predicting the downstream societal effects of their inventions.


Let's assume a nuclear exchange happens at some point during a war. There is a very high chance that this will cause an escalation leading to a nuclear apocalypse.

Since this result is presumably inevitable at increasing frequency, it's more like nukes prevented another major world war and stole a form of peace from the future, temporarily. That peace debt might be repaid with the end of everything.



Funny how the unintentional close calls become more sparse with time. I wonder if that’s because humanity got better at dealing with the responsibility or because the oopsies haven’t been declassified yet.


let's assume the trees rise up and set fire to the ionosphere.


well whatever society is left will definitely be "peaceful" for at least a couple of decades.


Nuclear weapons traded a high probability of a major war for a low probability of an apocalyptic war.

My question is, how low is that probability, exactly? Because the tradeoff looks very different if it’s one in a million per year, versus one in a hundred per year.

My assessment, looking at the history and the close calls, is that it’s more like one in a hundred.


It certainly rises if the USA votes for an irresponsible crook.


It very much depends on where "here" is.

At least, it gives impunity to attack others with less fear of retaliation…


> no one here needs to worry about being drafted into a war.

here meaning the US or HN?


> no one here needs to worry about being drafted into a war

Lots of talk in the UK recently about conscription.



I haven’t heard a peep about conscription, can you provide a source? There was some vague national service proposal for school leavers a couple of years ago, but that was it.



One person in the Lords raising the issue in no way constitutes widespread calls for conscription.


Germany is already working on it. Austria made it longer.


Nuclear PROLIFERATION gave us peace and freedom.

The Americans wanted to keep it all to themselves you know...


Your bar is higher than mine.

Unlike Apple's formal "developer evangelist" and several others I contacted, the guy actually took the time to talk to us, and I was/am grateful for that. He's a cog in a very large corporate machine. Apple is Apple. He's not the CEO. He was doing his job and did me a favor. I am grateful to him.


> Why would someone say "You're a mosquito. Apple will just stomp on you and you will not exist.", it makes zero sense to me given the context laid out here.

I'm telling you what I was told. It's a true story. I was there. It happened to me.

Why would I make up a detail like that?


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: