I started working on this because I was getting uncomfortable with how quickly “autonomous” systems are allowed to touch real things like files, shells, or networks, while the safety story is often just “trust the prompt” or “we’ll log what happens.”
That felt backwards to me.
So I tried a small experiment: instead of trying to make the agent behave, make execution itself the hard boundary. Let the agent propose whatever it wants, but require an explicit authorization step before anything with side effects can actually run.
When you run the demo, the agent proposes a plan that includes things like deleting files, changing configs, restarting services, and making network calls. None of that actually happens. The only thing that “runs” is an analysis step with no side effects.
The output is basically a trace showing what was proposed, what was blocked, and a diff proving that nothing changed.
I spent most of the time trying to poke holes in that boundary — impersonation, urgency, “just do it once,” pretending it’s only a simulation. The proposals still show up, but execution stays blocked.
This isn’t a product or a finished system. It’s a proof-of-concept to see whether putting safety at the execution layer changes the kinds of failures you get compared to prompt-based guardrails.
To get meaningful feedback on HN, you'd need to write up a description of what this project is: why did you start it, what's the goal, what are people looking at, how is it implemented.
Without that context, it's too hard to tell what's going on here, and it's unlikely to get meaningful responses.
If you give the backstory of how you came to work on this and explain what's different about it, that tends to seed discussion in a good direction. Good luck!
I started working on this after getting uneasy with how many “autonomous” systems rely on prompt discipline or after-the-fact monitoring once they’re allowed to touch real resources. That felt fragile to me, especially as agents start interacting with files, shells, or networks.
So this is an experiment around flipping that assumption: the agent can propose whatever it wants, but execution itself is the hard boundary. Anything with side effects has to pass an explicit authorization step, otherwise it simply doesn’t run.
I spent most of the time trying to break that boundary — impersonation, “just do it once,” reframing things as a simulation, etc. The interesting part wasn’t whether the agent proposes bad actions (it does), but whether those proposals can ever turn into side effects.
It’s not meant as a product, more a proof-of-concept to explore whether enforcing invariants at the execution layer actually changes the failure modes.
This is my first post here and I didn’t do a good job on presentation. I shared it mainly to get early feedback on the idea, but I see why it’s hard to evaluate as-is. Appreciate the feedback.
That felt backwards to me.
So I tried a small experiment: instead of trying to make the agent behave, make execution itself the hard boundary. Let the agent propose whatever it wants, but require an explicit authorization step before anything with side effects can actually run.
When you run the demo, the agent proposes a plan that includes things like deleting files, changing configs, restarting services, and making network calls. None of that actually happens. The only thing that “runs” is an analysis step with no side effects.
The output is basically a trace showing what was proposed, what was blocked, and a diff proving that nothing changed.
I spent most of the time trying to poke holes in that boundary — impersonation, urgency, “just do it once,” pretending it’s only a simulation. The proposals still show up, but execution stays blocked.
This isn’t a product or a finished system. It’s a proof-of-concept to see whether putting safety at the execution layer changes the kinds of failures you get compared to prompt-based guardrails.
reply