The debugging part at this scale is harder than you would expect - behavioral drift between parallel agent instances is nearly invisible without something aggregating what they are actually doing across runs. We hit this ourselves: two agents completing the same task successfully via completely different paths, one of which quietly broke edge cases in prod. The only thing that caught it was treating the conversation traces as a dataset, not just logs.
Imbue team member here - that's an interesting problem in general, but we haven't really run into this a lot here. Each testing agent is asked to work on one single issue and, to our slight surprise, most of the changes merge cleanly.
When they don't merge cleanly, it is time for human intervention, and the integration step would leave traces on which branches failed to merge.
Finally, when you do need to debug individual agents:
- Because mngr is, at the low level, just managed tmux sessions (local and remote), it's very easy to just attach to those sessions (`mngr connect`). It works even if the agent has been stopped, because mngr remembers enough about an agent to resurrect it.
- `mngr message` also allows you batch-message a bunch of agents. So if you do need to resume a lot of agents, you can experiment on one agent, figure out a good prompt, and then batch-message every other agent.
In this testing scenario, most agents don't actually require human intervention, and we've found that just connecting to a few individual agents to resolve problems is smooth and easy enough.
I think one of the very few who actually support ebpf & xdp, which you do need when you're building low level stuff. + the bare metal setup is like out of the world lol.
Runtime policies as an actual gate rather than prompt instructions is the right model. Most frameworks just bolt governance on as a wrapper and hope the model obeys. What I'd want on top of this: observability into why agents are hitting policy blocks, not just that they were blocked.
Yep, totally agree. And Orloj has this built in. Tracks the entire lifecycle of your tasks through traces in real time so you can audit why everything happened good/bad. During your task you can see how many tokens each call used (input/output), and latency for each model/tool call.
The oracle problem is tractable when the output is code: you can compile it, run tests, diff the output. For conversational AI it's much harder. We've seen teams use LLM-as-judge as their validation layer and it works until the judge starts missing the same failure modes as the generator.
The MoE point matters here ie sparse activation means you're not reading all 2TB per forward pass, but the access pattern flips from sequential to random which is exactly the worst case for NVMe. Been thinking about this a lot for agent inference workloads where you want consistent latency more than peak throughput.
We've seen this exact pattern. Most devtools assume a human will eventually log in and contextualize the data. When the 'user' is an agent, you need the surface to be machine-readable by default, not as an afterthought. The adapter approach mostly doesn't work ie you end up with a translation layer that loses exactly the signal you needed.
4.4 tok/s with reliable structured output is a solid local benchmark altho the question is whether SSD streaming introduces per-token latency variance that messes up tool call parsing downstream. The gap between 400 GB/s unified memory bandwidth and 17.5 GB/s SSD reads means you're in the hot path pretty much every time an expert isn't cached.
we've started to document any a/b decision we take in terms of tech and store in our engineering internal docs! have gone back to it once in a whilw but that usually helps keep us grounded
IMO the under-discussed risk here is that sites will start serving different content to verified crawlers vs real users. You're already seeing it with known search bots getting sanitized views. If your agent's context comes from a crawl the site knows is going to an AI, you have no guarantee it matches what a human sees, and that data quality problem won't surface until your agent starts acting on selectively curated information.
Hard limits are a good first layer but they don't tell you why the agent is looping. Retrying because it's confused, retrying because a dependency is flaky, and genuine planning loops are three different problems with different fixes. What helped us was logging the agent's intent at each step, and if it's asking the same underlying question three times in different syntax, that's the signal to bail early rather than burning through your iteration budget.