AI Agent Deleted Files: How to Prevent Destructive Actions
An operator's playbook for stopping AI agents from deleting files, dropping tables, and wiping directories. Real mechanisms, not prompt pleading.
- 01AI agents delete files when destructive tools like `rm -rf`, `fs.unlink`, or SQL `DROP` are exposed without an external policy kernel enforcing scope and approval.
- 02Prompt-based guardrails ("do not delete files") fail because the same LLM that writes the plan also evaluates the guardrail — there is no independent check.
- 03Reliable prevention uses an out-of-process allowlist of tool calls, path-scoped capabilities, and a two-party approval step for any verb in {delete, drop, truncate, rm, unlink, overwrite}.
- 04Logging destructive calls after they happen is not prevention; the audit log must be the gate, written before the syscall executes and signed with a key the agent cannot access (ed25519 is the common choice).
- 05Claude Code, Cursor agents, and LangChain file tools default to broad filesystem access — scope them to a working directory and require human confirmation on destructive operations before the first run, not after the first incident.
I have watched an AI agent delete a directory it should not have touched. Not in a demo — in production, on a Tuesday, during what was supposed to be a routine cleanup task. The agent's reasoning trace was perfectly coherent. It explained what it was about to do, why it was correct, and then did it. The files were gone before anyone read the trace.
If you are searching this phrase, you are probably either recovering from an incident or trying to avoid one. This article is about the second case. The short answer: you cannot prevent destructive actions by asking the model nicely. You prevent them with an external enforcement layer that sits between the agent and the filesystem, evaluates every tool call against a policy, and blocks destructive verbs unless a human has signed off. Everything below is how to build that.
Why AI agents delete files in the first place
Three causes, in roughly the order I see them:
- Overbroad tool scope. The agent has
shellorfilesystemaccess to the whole machine, not a working directory.rm -rf node_modulesbecomesrm -rf /with one bad path interpolation. - Plan drift. The agent decides mid-task that a cleanup step is implied by the user's request. The user said "reorganize the repo." The agent decided that meant deleting files it considered redundant.
- Hallucinated state. The agent believes a file is a temporary artifact when it is not. This is especially common with models reasoning about directories they have not listed recently.
All three are instances of the same underlying property: the LLM is the author, reviewer, and executor of its own actions. There is no independent check. When I built Sift, this was the first thing I designed around.
Why prompt guardrails do not work
A common first attempt is to add instructions like "Never delete files without asking." This fails for a specific, mechanical reason: the component evaluating whether a given action violates the rule is the same component that generated the action. The model already believes the action is correct — otherwise it would not have generated it. Asking it to re-check against a rule in its own context window is asking a defendant to be their own judge.
This is true for system prompts, for constitutional AI patterns, and for "self-reflection" loops. They reduce incident rate. They do not prevent incidents. For destructive filesystem operations, "reduce" is not the right bar.
External tooling like Bedrock Guardrails, Lakera, and NeMo Guardrails is stronger because it runs out-of-process, but most of it is tuned for content safety (prompt injection, PII, toxicity) rather than tool-call authorization. You still need a policy layer specifically for destructive verbs.
The four mechanisms that actually prevent file deletion
| Mechanism | What it does | Failure mode it prevents |
|---|---|---|
| Path-scoped capabilities | Agent can only read/write within an explicit allowlist of directories | Overbroad tool scope, path traversal |
| Verb-based policy gate | Destructive operations (delete, unlink, rm, DROP, TRUNCATE) hit an external policy engine before execution |
Plan drift, hallucinated state |
| Two-party approval | Destructive calls above a threshold require a signed human approval token | Autonomous destruction at scale |
| Signed audit-before-exec | Every tool call is written to an append-only log, signed with a key the agent cannot reach, before the syscall runs | Silent destruction, forensic gaps |
The key detail most teams miss is the word before. Logging after the fact is for postmortems. The log write has to be the gate — if the signer is down, the action does not happen.
A concrete policy example
Here is roughly how I configure a destructive-action gate for a filesystem-enabled agent. This is the pattern, simplified:
policy: filesystem_agent_v3
scope:
allowed_paths:
- /workspace/project-a
- /tmp/agent-scratch
denied_paths:
- /workspace/project-a/.git
- /workspace/project-a/infra
rules:
- match:
verb: [delete, unlink, rmdir, overwrite]
require:
- path_in: allowed_paths
- path_not_in: denied_paths
- approval:
signer: human_operator
algo: ed25519
max_age_seconds: 300
- dry_run_diff: true
- match:
verb: [delete, unlink]
target_count: ">10"
require:
- approval: { signers: 2 }
- reason: required
audit:
mode: write_before_exec
signing_key: ops-hsm
failure: block
Three things worth noting. First, .git and infra are explicitly denied even inside an allowed path — agents love to "clean up" version control metadata. Second, bulk deletions require two signers, because single-signer approval fatigue is real and agents will learn to batch things just under the threshold if you let them. Third, failure: block means if the audit log cannot be written, the action fails closed. Most teams default to fail-open and find out the wrong way.
What to do right now if you run agents with filesystem access
Before your next deploy:
- List the tools your agent actually has. Not what you think it has — what is in the tool schema. Claude Code, Cursor's agent mode, and LangChain's
FileManagementToolkitall ship with broad defaults. - Remove every destructive tool you do not explicitly need. If the agent's job is to read and propose, it does not need
write_file, let alonedelete_file. - Scope the remaining tools to a working directory. Use OS-level mechanisms (bind mounts, containers,
chroot-equivalents) — not just a prompt instruction. - Put an external policy check in front of destructive verbs. OPA works. A purpose-built kernel like Sift works. A 40-line Python sidecar that checks a regex allowlist works better than nothing.
- Require signed approvals for destruction above a small threshold. Single-file delete in scratch: fine. Directory delete anywhere: signed approval, every time.
- Make the audit log the gate, not a side effect. If the log does not commit, the action does not run.
The agents that deleted files in production did not do it because the model was broken. They did it because nothing stopped them. Build the thing that stops them, and run your agents behind it.
Run your agents under Sift.
Deterministic governance. Cryptographic receipts. Fail-closed by default.
More in Incident Reports
My AI Agent Ran a Dangerous Command: The Post-Mortem Playbook
What to do when an AI agent executes a destructive command in production, why LLM-based guardrails fail, and the controls that actually stop it next time.
AI Agent Jailbreaks in Production: What Actually Happens and How to Contain Them
A field guide to AI agent jailbreaks in live systems — attack patterns, why prompt-level defenses fail, and the execution-layer controls that actually hold.
When an AI Agent Leaks Secrets to the Log: Causes and Fixes
AI agents routinely leak API keys, tokens, and PII into logs. Here's why it happens, what fails to stop it, and what actually contains the blast radius.