Hacker Newsnew | past | comments | ask | show | jobs | submit | d4rkp4ttern's commentslogin

This was on HN 7 months ago:

https://news.ycombinator.com/item?id=45114245

Every time a STT/TTS model is posted I wonder if it will change my current workflow on MacOS, which is:

STT with Parakeet-V3 via Hex [1] app for near-instant good-enough transcription for talking to AI agents.

TTS using KyutAI’s Pocket-TTS, an amazing 100M-param model. I used this to make a "voice" plugin [2] for Claude Code

So far I haven’t seen anything that replaces these for me, or haven't been persuaded enough to spend time testing an alternative (explore/exploit and all that).

[1] Hex STT app - https://github.com/kitlangton/Hex, which is macOS-only. (also good free/OSS alternatives: Handy, VoiceInk. No need for Wispr, Superwhisper etc)

[2] Claude Code Voice Plugin - https://pchalasani.github.io/claude-code-tools/plugins-detai...


What do you consider to be the model with highest accuracy?

I guess you mean for STT. For my usecase of talking to AI's or coding agents, pure STT accuracy is less important than transcription speed. Transcription needs to be near-instant, and accuracy "good enough" so that the AI's can "read between the lines". Parakeet-V3 gives exactly this.

My regular workflow is to run code agents in Tmux panes and often I have Claude Code consult/collaborate with Codex, using my tmux-cli [1] tool, which is a wrapper around Tmux that provides good defaults (delay etc) for robust sending of messages, and waiting for completion etc.

[1] https://pchalasani.github.io/claude-code-tools/tools/tmux-cl...


I now often have CC make technical/architecture diagrams with tikz, the results look much better than mermaid but still requires multiple iterations to fix bad arrows, bad layouts etc.

Diagrams are still far from solved. We need a good non-gameable diagrams benchmark.


You can use llama.cpp server directly to serve local LLMs and use them in Claude Code or other CLI agents. I’ve collected full setup instructions for Gemma4 and other recent open-weight LLMs here, tested on my M1 Max 64 GB MacBook:

https://pchalasani.github.io/claude-code-tools/integrations/...

The 26BA4B is the most interesting to run on such hardware, and I get nearly double the token-gen speed (40 tok/s) compared to Qwen3.5 35BA3B. However the tau2 bench results[1] for this Gemma4 variant lag far behind the Qwen variant (68% vs 81%), so I don’t expect the former to do well on heavy agentic tool-heavy tasks:

[1] https://news.ycombinator.com/item?id=47616761


Did you have any Anthropic vs OpenAI specification issues with Claude Code? I have been using mlx_vlm and vMLX and I get 400 Bad Request errors from Claude Code. Presumably you're not seeing those issues with llama-server ?


Correct, no issues because since at least a few months, llama.cpp/server exposes an Anthropic messages API at v1/messages, in addition to the OpenAI-compatible API at v1/chat/completions. Claude Code uses the former.


I’ve jumped over to oMLX. A ton of rough edges but I think it’s the future.


At least for the Gemma4-26B-A4B, Token-gen speed with OMLX is far worse on my M1 Max 64GB Macbook, compared to llama-server:

  Quick benchmark on M1 Max 64GB, Gemma 4 26B-A4B (MoE), comparing matched dynamic 4-bit quants. Workload
  was Claude Code, which sends ~35K tokens of input context per request (system prompt + tools + user
  message):

  llama.cpp (unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL, llama-server -fa on -c 131072 --jinja --temp 1.0
  --top-p 0.95 --top-k 64):
  - pp ≈ 395 tok/s
  - tg ≈ 40 tok/s

  oMLX (unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit, omlx serve --model-dir ~/models/omlx, with
  sampling.max_context_window and max_tokens bumped to 131072 in ~/.omlx/settings.json):
  - pp ≈ 350 tok/s
  - tg ≈ 5–13 tok/s

  Same model family and quant tier. Prompt processing is comparable, but oMLX's token generation is 3–7x
  slower than llama.cpp's Metal backend. Counter-intuitive given MLX is Apple's native ML framework.


Check out vMLX if you use Apple Silicon. https://github.com/jjang-ai/mlxstudio


Same. Opencode + oMLX (0.3.4) + unsloth-Qwen3-Coder-Next-mlx-8bit on my M5 Max w 128GB is the sweet spot for me locally. The prompt decode caching keeps things coherent and fast even when contexts get north of 100k tokens.


Have you been using `omlx serve`? If so, how are you bumping up the max context size? I'm not seeing a param to go above 32k?


you can set it in the .omlx/settings.json - ask a code-agent to figure it out by pointing it at the omlx repo


For token-generation speed, a challenging test is to see how it performs in a code-agent harness like Claude Code, which has anywhere between 15-40K tokens from the system prompt itself (+ tools/skills etc).

Here the 26B-A4B variant is head and shoulders above recent open-weight models, at least on my trusty M1 Max 64GB MacBook.

I set up Claude Code to use this variant via llama-server, with 37K tokens initial context, and it performs very well: ~40 tokens/sec, far better than Qwen3.5-35B-A3B, though I don't know yet about the intelligence or tool-calling consistency. Prompt processing speed is comparable to the Qwen variant at ~400 tok/s.

My informal tests, all with roughly 30K-37K tokens initial context:

    ┌────────────────────┬───────────────┬────────────┐
    │       Model        │ Active Params │ tg (tok/s) │
    ├────────────────────┼───────────────┼────────────┤
    │ Gemma-4-26B-A4B    │ 4B            │ ~40        │
    ├────────────────────┼───────────────┼────────────┤
    │ GPT-OSS-20B        │ 3.6B          │ ~17-38     │
    ├────────────────────┼───────────────┼────────────┤
    │ Qwen3-30B-A3B      │ 3B            │ ~15-27     │
    ├────────────────────┼───────────────┼────────────┤
    │ GLM-4.7-Flash      │ 3B            │ ~12-13     │
    ├────────────────────┼───────────────┼────────────┤
    │ Qwen3.5-35B-A3B    │ 3B            │ ~12        │
    ├────────────────────┼───────────────┼────────────┤
    │ Qwen3-Next-80B-A3B │ 3B            │ ~3-5       │
    └────────────────────┴───────────────┴────────────┘

Full instructions for running this and other open-weight models with Claude Code are here:

https://pchalasani.github.io/claude-code-tools/integrations/...


gpt oss 20b is not dense


Thanks, fixed


For me one of the most interesting aspects is how compaction works. It turns out compaction still preserves the full original pre-compaction conversation in the session jsonl file, and those are marked as "not to be sent to the API". Which means, even after compaction, if you think something was lost, you can tell CC to "look in the session log files to find details about what we did with XYZ". I knew this before the leak since it can be seen from the session logs. Some more details:

  The full conversation is preserved in the JSONL file, and messages
  are filtered before being sent to the API.

  Key mechanisms:

  1. JSONL is append-only — old pre-compaction messages are never deleted. New messages (boundary
  marker, summary, attachments) are appended after compaction.
  2. Messages have flags controlling API visibility:
    - isCompactSummary: true — marks the AI-generated summary message
    - isVisibleInTranscriptOnly: true — prevents a message from being sent to the API
    - isMeta — another filter for non-API messages
    - getMessagesAfterCompactBoundary() returns only post-compaction messages for API calls
  3. After compaction, the API sees only:
    - The compact boundary marker
    - The summary message
    - Attachments (file refs, plan, skills)
    - Any new messages after compaction
  4. Three compaction types exist:
    - Full compaction — API summarizes all old messages
    - Session memory compaction — uses extracted session memory as summary (cheaper)
    - Microcompaction — clears old tool result content when cache is cold (>1h idle)


What is microcompaction? I didn’t realize there was any thing time based in CC, when I go eat dinner and come back it compacted while I was gone?


I dug into this more. It's disabled by default, and it's a cost/token-usage optimization.

  The logic is:

  1. Anthropic's API has a server-side prompt cache with a 1-hour TTL
  2. When you're actively using a session, each API call reuses the cached prefix — you only pay
  for new tokens
  3. After 1 hour idle, that cache is guaranteed expired
  4. Your next message will re-send and re-process the entire conversation from scratch — every
  token, full price
  5. So if you have 150K tokens of old Grep/Read/Bash outputs sitting in the conversation, you're
  paying to re-ingest all of that even though it's stale context the model probably doesn't need

  The microcompact says: "since we're paying full price anyway, let's shrink the bill by clearing
  the bulky stuff."

  What's preserved vs lost:
  - The tool_use blocks (what tool was called, with what arguments) — kept
  - The tool_result content (the actual output) — replaced with [Old tool result content cleared]
  - The most recent 5 tool results — kept

  So Claude can still see "I ran Grep for foo in src/" but not the 500-line grep output from 2
  hours ago.

  Does it affect quality? Yes, somewhat — but the tradeoff is that without it, you're paying
  potentially tens of thousands of tokens to re-ingest stale tool outputs that the model already
  acted on. And remember, if the conversation is long enough, full compaction would have summarized
   those messages anyway.

  And critically: this is disabled by default (enabled: false in timeBasedMCConfig.ts:31). It's
  behind a GrowthBook feature flag that Anthropic controls server-side. So unless they've flipped
  it on for your account, it's not happening to you.


[flagged]


> it's basically a cost optimization masquerading as a feature

Cost optimization in the user's favor.

Remember that every time you send a new message to the LLM, you are actually sending the entire conversation again with that added last message to the LLM.

Remember that LLMs are fixed functions, the only variable is the context input (and temperature, sure).

Naively, this would lead to quadratic consumption of your token quota, which would get ridiculously expensive as conversations stretch into current 100k-1M context windows.

To solve this, AI providers cache the context on the GPU, and only charge you for the delta in the conversation/context. But they're not going to keep that GPU cache warm for you forever, so it'll time out after some inactivity.

So the microcompaction-on-idle happens to soften the token consumption blow after you've stepped away for lunch, your context cache has been flushed by the AI provider, and you basically have to spend tokens to restart your conversation from scratch.


that frustration regex is missing "idiot", which is the most common frustration word I use with code-agents


Another option I use open is to ask the code-agent to make a diagram using Tikz (as a .tex file), which can then be converted to pdf/png.

But in general AI-diagramming is still unsolved; needs several iterations to get rid of wonky/wrong arrows, misplaced boxes, misplaced text etc.


I've always liked umlet and umletino (web version) for a nice mix of drag and drop and edit by text editor. In the absence of good enough layout algorithms, the ability to manually drag things to the right place is kind of essential. The resulting diagrams are not so pretty of course.

I have tried a lot of tools in this space. If it comes out looking alright, that's usually because it was so simple that it didn't actually need a diagram. Anything with a bit of non trivial structure seems to quickly escalate with essentially no good options other then esoteric hacks with styling to make it look any good.

This seems to be a thing where you can have pretty automated layouts, complex diagrams, or correct diagrams and can only have two out of three.

Which means that almost 100% of my use cases for these tools never really work for me unless I sit down and grab some old school drawing tool (or just give up on the whole notion, which is much more likely). If it was trivial, I wouldn't bother making a diagram. These tools seem only usable for stuff where diagrams were overkill to begin with. I saw no examples on the linked article (and the rest of the site; I browsed the top few recent articles) to really counter this.


Agree. For what it’s worth, in interviews Cherny (Claude Code creator) and Steinberger (OpenClaw creator) say they keep things simple and use none of the workflow frameworks. The latter even said he doesn’t even use plan mode, but I find that very useful: exiting plan mode starts clean with compressed context.


They backed out the “clear context and execute plan” thing recently. It’s a bummer, I thought it was great.


Maybe they figured it wasn't need with 1M context?


Anecdotal evidence says that the 1M context one still gets stupid around 200-300k tokens.

Context still matters and I'll never stop implementing things in small slices instead of trying to one-shot.


Thanks - I use worktrees and direnv but never thought about this ennrc trick to auto-share .env across worktrees.


But how did you use your main worktree's .env before? Symlink it?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: