Every time a STT/TTS model is posted I wonder if it will change my current workflow on MacOS, which is:
STT with Parakeet-V3 via Hex [1] app for near-instant good-enough transcription for talking to AI agents.
TTS using KyutAI’s Pocket-TTS, an amazing 100M-param model. I used this to make a "voice" plugin [2] for Claude Code
So far I haven’t seen anything that replaces these for me, or haven't been persuaded enough to spend time testing an alternative (explore/exploit and all that).
[1] Hex STT app - https://github.com/kitlangton/Hex, which is macOS-only.
(also good free/OSS alternatives: Handy, VoiceInk. No need for Wispr, Superwhisper etc)
I guess you mean for STT. For my usecase of talking to AI's or coding agents, pure STT accuracy is less important than transcription speed. Transcription needs to be near-instant, and accuracy "good enough" so that the AI's can "read between the lines". Parakeet-V3 gives exactly this.
My regular workflow is to run code agents in Tmux panes and often I have Claude Code consult/collaborate with Codex, using my tmux-cli [1] tool, which is a wrapper around Tmux that provides good defaults (delay etc) for robust sending of messages, and waiting for completion etc.
I now often have CC make technical/architecture diagrams with tikz, the results look much better than mermaid but still requires multiple iterations to fix bad arrows, bad layouts etc.
Diagrams are still far from solved. We need a good non-gameable diagrams benchmark.
You can use llama.cpp server directly to serve local LLMs and use them in Claude Code or other CLI agents. I’ve collected full setup instructions for Gemma4 and other recent open-weight LLMs here, tested on my M1 Max 64 GB MacBook:
The 26BA4B is the most interesting to run on such hardware, and I get nearly double the token-gen speed (40 tok/s) compared to Qwen3.5 35BA3B. However the tau2 bench results[1] for this Gemma4 variant lag far behind the Qwen variant (68% vs 81%), so I don’t expect the former to do well on heavy agentic tool-heavy tasks:
Did you have any Anthropic vs OpenAI specification issues with Claude Code? I have been using mlx_vlm and vMLX and I get 400 Bad Request errors from Claude Code. Presumably you're not seeing those issues with llama-server ?
Correct, no issues because since at least a few months, llama.cpp/server exposes an Anthropic messages API at v1/messages, in addition to the OpenAI-compatible API at v1/chat/completions. Claude Code uses the former.
At least for the Gemma4-26B-A4B, Token-gen speed with OMLX is far worse on my M1 Max 64GB Macbook, compared to llama-server:
Quick benchmark on M1 Max 64GB, Gemma 4 26B-A4B (MoE), comparing matched dynamic 4-bit quants. Workload
was Claude Code, which sends ~35K tokens of input context per request (system prompt + tools + user
message):
llama.cpp (unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL, llama-server -fa on -c 131072 --jinja --temp 1.0
--top-p 0.95 --top-k 64):
- pp ≈ 395 tok/s
- tg ≈ 40 tok/s
oMLX (unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit, omlx serve --model-dir ~/models/omlx, with
sampling.max_context_window and max_tokens bumped to 131072 in ~/.omlx/settings.json):
- pp ≈ 350 tok/s
- tg ≈ 5–13 tok/s
Same model family and quant tier. Prompt processing is comparable, but oMLX's token generation is 3–7x
slower than llama.cpp's Metal backend. Counter-intuitive given MLX is Apple's native ML framework.
Same. Opencode + oMLX (0.3.4) + unsloth-Qwen3-Coder-Next-mlx-8bit on my M5 Max w 128GB is the sweet spot for me locally. The prompt decode caching keeps things coherent and fast even when contexts get north of 100k tokens.
For token-generation speed, a challenging test is to see how it performs in a code-agent harness like Claude Code, which has anywhere between 15-40K tokens from the system prompt itself (+ tools/skills etc).
Here the 26B-A4B variant is head and shoulders above recent open-weight models, at least on my trusty M1 Max 64GB MacBook.
I set up Claude Code to use this variant via llama-server, with 37K tokens initial context, and it performs very well: ~40 tokens/sec, far better than Qwen3.5-35B-A3B, though I don't know yet about the intelligence or tool-calling consistency. Prompt processing speed is comparable to the Qwen variant at ~400 tok/s.
My informal tests, all with roughly 30K-37K tokens initial context:
For me one of the most interesting aspects is how compaction works. It turns out compaction still preserves the full original pre-compaction conversation in the session jsonl file, and those are marked as "not to be sent to the API". Which means, even after compaction, if you think something was lost, you can tell CC to "look in the session log files to find details about what we did with XYZ". I knew this before the leak since it can be seen from the session logs. Some more details:
The full conversation is preserved in the JSONL file, and messages
are filtered before being sent to the API.
Key mechanisms:
1. JSONL is append-only — old pre-compaction messages are never deleted. New messages (boundary
marker, summary, attachments) are appended after compaction.
2. Messages have flags controlling API visibility:
- isCompactSummary: true — marks the AI-generated summary message
- isVisibleInTranscriptOnly: true — prevents a message from being sent to the API
- isMeta — another filter for non-API messages
- getMessagesAfterCompactBoundary() returns only post-compaction messages for API calls
3. After compaction, the API sees only:
- The compact boundary marker
- The summary message
- Attachments (file refs, plan, skills)
- Any new messages after compaction
4. Three compaction types exist:
- Full compaction — API summarizes all old messages
- Session memory compaction — uses extracted session memory as summary (cheaper)
- Microcompaction — clears old tool result content when cache is cold (>1h idle)
I dug into this more. It's disabled by default, and it's a cost/token-usage optimization.
The logic is:
1. Anthropic's API has a server-side prompt cache with a 1-hour TTL
2. When you're actively using a session, each API call reuses the cached prefix — you only pay
for new tokens
3. After 1 hour idle, that cache is guaranteed expired
4. Your next message will re-send and re-process the entire conversation from scratch — every
token, full price
5. So if you have 150K tokens of old Grep/Read/Bash outputs sitting in the conversation, you're
paying to re-ingest all of that even though it's stale context the model probably doesn't need
The microcompact says: "since we're paying full price anyway, let's shrink the bill by clearing
the bulky stuff."
What's preserved vs lost:
- The tool_use blocks (what tool was called, with what arguments) — kept
- The tool_result content (the actual output) — replaced with [Old tool result content cleared]
- The most recent 5 tool results — kept
So Claude can still see "I ran Grep for foo in src/" but not the 500-line grep output from 2
hours ago.
Does it affect quality? Yes, somewhat — but the tradeoff is that without it, you're paying
potentially tens of thousands of tokens to re-ingest stale tool outputs that the model already
acted on. And remember, if the conversation is long enough, full compaction would have summarized
those messages anyway.
And critically: this is disabled by default (enabled: false in timeBasedMCConfig.ts:31). It's
behind a GrowthBook feature flag that Anthropic controls server-side. So unless they've flipped
it on for your account, it's not happening to you.
> it's basically a cost optimization masquerading as a feature
Cost optimization in the user's favor.
Remember that every time you send a new message to the LLM, you are actually sending the entire conversation again with that added last message to the LLM.
Remember that LLMs are fixed functions, the only variable is the context input (and temperature, sure).
Naively, this would lead to quadratic consumption of your token quota, which would get ridiculously expensive as conversations stretch into current 100k-1M context windows.
To solve this, AI providers cache the context on the GPU, and only charge you for the delta in the conversation/context. But they're not going to keep that GPU cache warm for you forever, so it'll time out after some inactivity.
So the microcompaction-on-idle happens to soften the token consumption blow after you've stepped away for lunch, your context cache has been flushed by the AI provider, and you basically have to spend tokens to restart your conversation from scratch.
I've always liked umlet and umletino (web version) for a nice mix of drag and drop and edit by text editor. In the absence of good enough layout algorithms, the ability to manually drag things to the right place is kind of essential. The resulting diagrams are not so pretty of course.
I have tried a lot of tools in this space. If it comes out looking alright, that's usually because it was so simple that it didn't actually need a diagram. Anything with a bit of non trivial structure seems to quickly escalate with essentially no good options other then esoteric hacks with styling to make it look any good.
This seems to be a thing where you can have pretty automated layouts, complex diagrams, or correct diagrams and can only have two out of three.
Which means that almost 100% of my use cases for these tools never really work for me unless I sit down and grab some old school drawing tool (or just give up on the whole notion, which is much more likely). If it was trivial, I wouldn't bother making a diagram. These tools seem only usable for stuff where diagrams were overkill to begin with. I saw no examples on the linked article (and the rest of the site; I browsed the top few recent articles) to really counter this.
Agree. For what it’s worth, in interviews Cherny (Claude Code creator) and Steinberger (OpenClaw creator) say they keep things simple and use none of the workflow frameworks. The latter even said he doesn’t even use plan mode, but I find that very useful: exiting plan mode starts clean with compressed context.
https://news.ycombinator.com/item?id=45114245
Every time a STT/TTS model is posted I wonder if it will change my current workflow on MacOS, which is:
STT with Parakeet-V3 via Hex [1] app for near-instant good-enough transcription for talking to AI agents.
TTS using KyutAI’s Pocket-TTS, an amazing 100M-param model. I used this to make a "voice" plugin [2] for Claude Code
So far I haven’t seen anything that replaces these for me, or haven't been persuaded enough to spend time testing an alternative (explore/exploit and all that).
[1] Hex STT app - https://github.com/kitlangton/Hex, which is macOS-only. (also good free/OSS alternatives: Handy, VoiceInk. No need for Wispr, Superwhisper etc)
[2] Claude Code Voice Plugin - https://pchalasani.github.io/claude-code-tools/plugins-detai...
reply