I use Claude Code a lot. A lot lot. I make it do Atomic Git commits for me. When it gets stuck and instead of just saying so starts to refactor half of the codebase, I jump back to commit where the issue first appeared and get a summary of the involved files. Those in full text (not files) into o3 pro. And you can be sure it finds the issue or gives a direction where the issue does not appear. Would love o3-pro as am MCP so whenever Claude Code goes on a "lets refactor everything" coding spree it just asks o3 pro.
Sounds like you're doing the equivalent of Aider's architect mode (use one model for the reasoning, and another for the code changes).
I would encourage you to try it. It's generally (much) cheaper doing stuff in Aider, but if you're paying a monthly subscription and using it a lot, Claude Code may be cheaper...
Can you give an example what claude works on autonomously for hours? I only use the chat, maybe I’m just not prompting well, but I throw away almost everything claude writes and solve it in significantly less lines of code using the proper abstractions.
currently i am coding a node/react/ts firebase app that allows dynamic multiagent workflows to automate content workflows (a workflow.json defines call this model and the pass this part of the output of that model to that model and then combine it with this model to do that)
my setup is claude code in yolo mode with playwright MCP + browser MCP (to do stuff in the logged i firebase web interface) plus search enabled.
the prototype was developed via firebase studio until i reached a dead end there, then i used claude code to rip out firebase genkit and hooked in google-genai, openai, ...
the whole codebase goes into google gemini studio (caus the million token window) to write tickets, more tickets and even more tickets.
claude code then has the job to implemt these tickets (create a detailed tasklist for each ticket first) and then code it until done. end of each tasklist is a working playwright end to end test with verified output.
and atomic commits.
i hooked anydesk to my computer so i can check i at some point to tell to to continue or to read Claude.md again (the meta instructions which basically tells it to not to fallbacks, mock data or cheat in amy other way.)
ever fourth ticket is refactoring for sinplicity and documentation.
the tickets mist be updated before each commit and moved to the do done folder only when 100 tested ok.
so yeah, when i wale up in the morning either magic happend and the tockets are all done. or it got stuck and refactores half the codebase. in that case it works for an hoor to go over all git commits to find out where it went wrong.
what i need are multiple coding agent which challenge each other at crucial points.
I have to ask a probably naive question - after the initial boilerplate/scaffolding is this actually any faster than just typing in the code you want? Or using the standard AI flow before these long task agents? It feels like you are juggling and bouncing async tools, doubling back on output, and constant trial and error to get things working.
I'm sure lots of code is being generated, but I do wonder about the effectiveness ratio of it when I read comments like above. Like there is a sweet spot after initial scaffold where its easier just to express yourself in code?
Yeah, so far, I've only seen cases where the work is extremely simple and using pervasively used libraries and solutions to create widely implemented solutions. Add something a little out there and things start to unravel.