This. I've had claude (sonnet 4) delete an entire file by running `rm filename.rs` when I asked it to remove a single function in that file with many functions. I'm sure there's a reasonably probability that it will do much worse.
Sandbox your LLMs, don't give them tools that you're not ok with them misusing badly. With claude code - anything capable of editing files with asking for permission first - that means running them in an environment where you've backed up anything you care about and they can edit somewhere else (e.g. a remote git repository).
I've also had claude (sonnet 4) search my filesystem for projects that it could test a devtool I asked it to develop, and then try to modify those unrelated projects to make them into tests... in place...
These tools are the equivalent of sharp knives with strange designs. You need to be careful with them.
Just to confirm that this is not a rare event, had the same last week (Claude nukes a whole file after asking to remove a single test).
Always make sure you are in full control. Removing a file is usually not impactful with git, etc. but an Anthropic has to even warned that misalignment can cause even worse damage.
The LLM can just as well nuke the `.git` directory as it can any other file in the project. Probably best to run it as a separate user with permissions to edit only the files you want it to edit.
To confirm your confirmation, over a month ago I was debugging an issue with Claude Code itself, and it launched another copy of itself in yolo mode which just started tearing up like a powertool at a belt sander race. These coding agents should really only be used in a separate user account.
Also, make it it auto-pushes somewhere else, I use aider a lot, and I have a regular task that backs everything up at regular interval, just to make sure the LLM doesn't decide to rm -rf .git :-)
I've had similar behavior through Github Copilot. It somehow messed up the diff format to make changes, left a mangled file, said "I'll simply delete the file and recreate it from memory", and then didn't have enough of the original file in context anymore to recreate it. At least Copilot has an easy undo for one step of file changes, although I try to git commit before letting it touch anything.
I think what vibe coding does in some ways is interfere with the make feature/test/change then commit loop. I started doing one thing, then committing it (in vscode or the terminal not Claude code) then going to the next thing. If Claude decides to go crazy then I just reset to HEAD and whatever Claude did is undone. Of course there are more complex environments than this that would not be resilient. But then I guess using new technology comes with some assumptions it will have some bugs in it.
Forget sandboxing. I'd say review every command it puts out and avoid auto-accept. Right now given inference speeds running 2 or 3 parallel Claude sessions in parallel and still manually accept is still giving me a 10x productivity boost without risking disastrous writes. I know I feel like a caveman not having the agent own the end to end code to prod push but the value for me has been in tightening the innerloop. The rest is not a big deal.
Same thing happened to me. Was writing database migrations, asked it to try a different approach - and it went lol let's delete the whole database instead. Even worse, it didn't prompt me first like it had been doing, and I 100% didn't have auto-accept turned on.
You can create hooks for claude code to prevent a lot of the behavior, especially if you work with the same tooling always, you can write hooks to prevent most bad behaviour and execute certain things yourself while claude continues afterwards.
> Why does the author feel confident that Claude won't do this?
I have a guess
| (I have almost zero knowledge of how the Windows CLI tool actually works. What follows below was analyzed and written with the help of AI. If you are an expert reading this, would love to know if this is accurate)
I'm not sure why this doesn't make people distrust these systems.
Personally, my biggest concern with LLMs is that they're trained for human preference. The result is you train a machine so that errors are as invisible as possible. God tools need to make errors loud, not quiet. The less trust you have for them the more important this is. But I guess they really are like junior devs. Junior devs will make mistakes and then try to hide it and let no one know
This is a spot-on observation. All LLMs have that "fake it till you make it" attitude together with "failure is not an option" - exactly like junior devs on their first job.
Or like those insufferable grindset IndieHackers hustling their way through their 34th project this month. It’s like these things are trained on LinkedIn posts.
Jsut today I was doing some vibe coding ish experiments where I had a todo list and getting the AI tools to work through the list. Claude decided to do an item that was already checked off, which was something like “write database queries for the app” kind of thing. It first deleted all of the files in the db source directory and wrote new stuff. I stopped it and asked why it’s doing an already completed task and it responded with something like “oh sorry I thought I was supposed to do that task, I saw the directory already had files, so I deleted them”.
Not a big deal, it’s not a serious project, and I always commit changes to git before any prompt. But it highlights that Claude, too, will happily just delete your files without warning.
Why would you ask one of these tools why they did something? There's no capacity for metacognition there. All they'll do is roleplay how human might answer that question. They'll never give you any feedback with predictive power.
They have no metacognition abilities, but they do have the ability to read the context window. With how most of these tools work anyways, where the same context is fed to the followup request as the original.
There's two subreasons why that might make asking them valuable. One is that with some frontends you can't actually get the raw context window so the LLM is actually more capable of seeing what happened than you are. The other is that these context windows are often giant and making the LLM read it for you and guess at what happened is a lot faster than reading it yourself to guess what happened.
Meanwhile understanding what happens goes towards understanding how to make use of these tools better. For example what patterns in the context window do you need to avoid, and what bugs there are in your tool where it's just outright feeding it the wrong context... e.g. does it know whether or not a command failed (I've seen it not know this for terminal commands)? Does it have the full output from a command it ran (I've seen this be truncated to the point of making the output useless)? Did the editor just entirely omit the contents of a file you told it to send to the AI (A real bug I've hit...)?
> One is that with some frontends you can't actually get the raw context window so the LLM is actually more capable of seeing what happened than you are. The other is that these context windows are often giant and making the LLM read it for you and guess at what happened is a lot faster than reading it yourself to guess what happened.
I feel like this is some bizzaro-world variant of the halting problem. Like...it seems bonkers to me that having the AI re-read the context window would produce a meaningful answer about what went wrong...because it itself is the thing that produced the bad result given all of the context.
It seems like a totally different task to me, which should have totally different failure conditions. Not being able to work out the right thing to do doesn't mean it shouldn't be able to guess why it did what it did do. It's also notable here that these are probabilistic approximators, just because it did the wrong thing (with some probability) doesn't mean its not also capable of doing the right thing (with some probability)... but that's not even necessary here...
You also see behaviour when using them where they understand that previous "AI-turns" weren't perfect, so they aren't entirely over indexing on "I did the right thing for sure". Here's an actual snippet of a transcript where, without my intervention, claude realized it did the wrong thing and attempted to undo it
> Let me also remove the unused function to clean up the warning:
> * Search files for regex `run_query_with_visibility_and_fields`
> * Delete `<redacted>/src/main.rs`
> Oops! I made a mistake. Let me restore the file:
It more or less succeeded too, `jj undo` is objectively the wrong command to run here, but it was running with a prompt asking it to commit after every terminal command, which meant it had just committed prior to this, which made this work basically as intended.
> They have no metacognition abilities, but they do have the ability to read the context window.
Sure, but so can you-- you're going to have more insight into why they did it than they do-- because you've actually driven an LLM and have experience from doing so.
It's gonna look at the context window and make something up. The result will sound plausible but have no relation to what it actually did.
A fun example is to just make up the window yourself then ask the AI why it did the things above then watch it gaslight you. "I was testing to see if you were paying attention", "I forgot that a foobaz is not a bazfoo.", etc.
I've found it to be almost universally the case that the LLM isn't better than me, just faster. That applies here, it does a worse job than I would if I did it, but it's a useful tool because it enables me to make queries that would cost too much of my time to do myself.
If the query returns something interesting, or just unexpected, that's at least a signal that I might want to invest my own time into it.
I ask it why when it acts stupid and then ask it to summarize what just happened and how to avoid it into claude.md
With varied success, sometimes it works sometimes it doesn't. But the more of these Claude.md patches I let it write the more unpredictable it turns after a while.
Sometimes we can clearly identify the misunderstanding. Usually it just mixes prior prompts to something different it can act on.
So I ask it to summarize it's changes in the file after a while. And this is where it usually starts doing the same mistakes again
It's magical thinking all the way down: convinced they have the one true prompt to unlock LLMs true potential, finding comfort from finding the right model for the right job, assuming the most benevolent of intentions to the companies backing LLMs, etc.
I can't say I necessarily blame this behavior though. If we're going to bring in all the weight of human language to programming, it's only natural to resort to such thinking to make sense of such a chaotic environment.
Claude will do this. I've seen it create "migration scripts" to make wholesale file changes -- botch them -- and have no recourse. It's obviously _not great_ when this happens. You can mitigate this by running these agents in sandbox environments and/or frequently checkpointing your code - ideally in a SCM like git.
I haven't used Claude Code but Claude 4 Opus has happily suggested on deleting entire databases. I haven't given yet permission to run commands without me pressing the button.
Why does the author feel confident that Claude won't do this?