i'm working on something tangenially with cloud coding agents to bring the workflow to mobile. the breakthrough for me was realizing that the IDE isn't needed anymore and cloud repos + sandboxes open up the ability to continue working from anywhere. mouse.dev
I run the gemini live api over a mesh hosted managed webrtc cloud. works fantastic, and Ive been running it for 2 years. you can try websocket, handle ephemeral keys, ect ect. but when you speak with people running voice agents at scale in this space, many of the issues are solved with webRTC and pipecat and the many resources allocated to solved problems in this space. It certainly feels overkill, and it probably is, but once connection is established, it's pretty magical. the startup time and buffering has been solved for quicker voice connections too, https://github.com/pipecat-ai/pipecat-examples/tree/main/ins... (video is harder)
there are a lot of extremely smart people that have come back to webRTC time and time again because it continues to solve problems other methods and protocols can't. with saying that, quic is certainly interesting going forward, but i primarily stream voice + vision at 1fps so it just makes sense, and websockets fail and are insecure at scale for this use case (see https://www.daily.co/videosaurus/websockets-and-webrtc/) . also just listen to sean in this thread, dude knows whats up.
Ocean-2, a full-scale prototype deployed off the Washington coast, built by the Panthalassa team of inventors, programmers, welders, physicists, and engineers.
Before starting CNN, Ted Turner captained the sailing Yacht Courageous to an America's Cup victor 4-0 over the Australians in Newport, RI during what was arguably sailings hay day.
Oh my god I finally get a very specific Harvey Birdman joke as a result of this factoid. Fuck me, Phil Ken Sebben as a parody of Ted Turner kinda works.
That I was aware of. I'm more familiar with his media and wildlife conservation efforts than his business acumen or sports achievements. Captain Planet, Turner Classic Movies, Hanna-Barbera, Cartoon Network, etc.
I wish I had known about Pipecat a lot sooner. I found out about it a few weeks back, and since Gemma 4 launched, I've been building my own entirely local voice assistant using Gemma 4 + Kokoro TTS + Whisper from scratch - https://github.com/pncnmnp/strawberry.
Looks like everyone is building one of these, I have my own little version that's using streaming STT, it can actually be too fast in some cases, and I have a little ring buffer grabbing audio from before the wake word detection fires (so it can hear "Hey Jarvis, turn on the lights" without deliberate pause) https://github.com/jaggederest/pronghorn/
The whole setup works on my M2 MacBook Pro with 16 GB RAM. I use Gemma 4B via LiteRT-LM.
I've found that LiteRT-LM has a much lower DRAM footprint than Ollama. I've also made tons of optimizations in the code - for eg, you can do quite a bit with a 16k context window for a voice assistant while managing a good footprint, so I keep track of the token usage and then perform an auto-compaction after a while. I use sub-agents and only do deep-think calls with them, so the context window is separated out. In a multi-turn conversation, if Gemma 4 directly processes audio input, the KV cache fills up within a few turns, so I channel it all via Whisper.
Also, by far the biggest optimization is: 3-stage producer-consumer architecture. The LiteRT-LM streams tokens and I split them into sentences. A synthesizer thread then converts each sentence to audio via Kokoro TTS - the main thread then plays audio chunks sequentially. There's a parallel barge-in monitor thread. https://github.com/pncnmnp/strawberry/blob/main/main.py#L446
I did not want to use openWakeWord or Picovoice because they had limitations on which wake word you could choose. Alternative was to train a model of my own. So I created my own wake word detection pipeline using Whisper Tiny - works surprisingly well: https://github.com/pncnmnp/strawberry/blob/main/main.py#L143...
I'm using the MacBook's built-in microphones for this, though, and I haven't fully tested it with other microphones. I've been ironing out the rough edges on a daily basis. I should write a quick blog on this too.
Check out [0]. You can do 'Voice AI' on small/cheap hardware. It's the most fun you can have in the space ATM :) It's been a while, but posted a demo here [1]
If you like Pipecat’s focus on speed, you might also try out our open source, which comes with all the batteries included (knowledge base, telephony/SIP, variables, BYOK any LLM STT TTS, Speech to Speech, etc )
agreed. OpenCode is a strong base, and with a couple modifications it can become a very effective harness. my sideproject mouse.dev I’ve been combining parts from OpenCode, Claude Code, and Hermes to build a cloud agent architecture that works well from mobile.
I haven’t run formal evals but i improved the experience for my own needs and it feels noticeably better with these modifications.
-Claude-style subagents
-an MCP layer for higher-level tools
-Cursor-style control plane modes like Ask, Plan, Debug, and Build.
The MCP layer lets the harness use things like GitHub file/code read, PR creation, web search/fetch, structured user questions, plan-mode switching, user skills, and subagents.
So the improvement is mostly from better ui/ux orchestration and tool access. There's some things from hermes that are interesting as well.
Most of my focus has been on applying this stack to sandboxed cloud agents so you can properly code and work from mobile devices.
I can't definitively say that the stack is better or worse than Claude code, more just tuned for my use case I guess.
Kudos! Cool idea, I'm on the same path you are, yet you're just one step ahead. For mouse.dev, what are you using for the cloud agent sandbox piece? I haven't moved my agents to the cloud yet (for on the go mobile enablement). Would Islo be a competitor to mouse?
cool! I've been mostly building for what coding from an iphone can look like. the cloud agent sandbox portion is definitely not polished yet but working well so far. i looked at daytona, e2b, modal ect. but decided to roll my own with fly.io. ttl on agent create. mouse uses per-thread sandboxes (not shared-container multi-workspace) and then post-gres for agent history ect.
i'll have to look more at islo, I definitely think its a growing space with alot of opportunity for those that participate and solve problems.
I'm a Claude Code Web fan and a rather heavy user. So I was interested in your product. However, I couldn't find an answer on the website. What parts did you find so good that you ported them?
Nothing groundbreaking but i'll do a blog writeup on the architecture if it would be helpful for people. My focus has been on mobile.
The main pieces I've integrated for mouse.dev inspired by claude/cursor was plan mode, agent questions, subagents, pre/post hooks, context compaction, repo-local skills, and permission modes. So mostly tools like enter_plan_mode, ask_user_question, and spawn_subagent, plus .mouse/skills and .mouse/plans.
One nice feature is continuity. If you’re working on desktop and save a plan to .mouse/plans, you can pick it up later on mobile with cloud agents, or do the reverse. You can plan something from your phone, then when you’re back at your desk, review it/build it. That was my initial goal with this project because I've found the plan act loop so helpful.
Mouse Cloud Agents is mostly an OpenCode-based harness, but everything routes through our MCP/event system so it’s mobile-first and provider-agnostic.
I intentionally skipped a lot of IDE and Claude Code style desktop features. The bet is that this new style of coding is becoming less “edit files in an IDE” and more steer a capable coding chatbot.
Would love to hear from anyone reading that's iterating on harness architecture, it's been really fun to work on.
reply