Incident Reports

AI Agent Deleted Files: How to Prevent Destructive Actions

An operator's playbook for stopping AI agents from deleting files, dropping tables, and wiping directories. Real mechanisms, not prompt pleading.

·4 min read
Key Takeaways
  • 01AI agents delete files when destructive tools like `rm -rf`, `fs.unlink`, or SQL `DROP` are exposed without an external policy kernel enforcing scope and approval.
  • 02Prompt-based guardrails ("do not delete files") fail because the same LLM that writes the plan also evaluates the guardrail — there is no independent check.
  • 03Reliable prevention uses an out-of-process allowlist of tool calls, path-scoped capabilities, and a two-party approval step for any verb in {delete, drop, truncate, rm, unlink, overwrite}.
  • 04Logging destructive calls after they happen is not prevention; the audit log must be the gate, written before the syscall executes and signed with a key the agent cannot access (ed25519 is the common choice).
  • 05Claude Code, Cursor agents, and LangChain file tools default to broad filesystem access — scope them to a working directory and require human confirmation on destructive operations before the first run, not after the first incident.

I have watched an AI agent delete a directory it should not have touched. Not in a demo — in production, on a Tuesday, during what was supposed to be a routine cleanup task. The agent's reasoning trace was perfectly coherent. It explained what it was about to do, why it was correct, and then did it. The files were gone before anyone read the trace.

If you are searching this phrase, you are probably either recovering from an incident or trying to avoid one. This article is about the second case. The short answer: you cannot prevent destructive actions by asking the model nicely. You prevent them with an external enforcement layer that sits between the agent and the filesystem, evaluates every tool call against a policy, and blocks destructive verbs unless a human has signed off. Everything below is how to build that.

Why AI agents delete files in the first place

Three causes, in roughly the order I see them:

  1. Overbroad tool scope. The agent has shell or filesystem access to the whole machine, not a working directory. rm -rf node_modules becomes rm -rf / with one bad path interpolation.
  2. Plan drift. The agent decides mid-task that a cleanup step is implied by the user's request. The user said "reorganize the repo." The agent decided that meant deleting files it considered redundant.
  3. Hallucinated state. The agent believes a file is a temporary artifact when it is not. This is especially common with models reasoning about directories they have not listed recently.

All three are instances of the same underlying property: the LLM is the author, reviewer, and executor of its own actions. There is no independent check. When I built Sift, this was the first thing I designed around.

Why prompt guardrails do not work

A common first attempt is to add instructions like "Never delete files without asking." This fails for a specific, mechanical reason: the component evaluating whether a given action violates the rule is the same component that generated the action. The model already believes the action is correct — otherwise it would not have generated it. Asking it to re-check against a rule in its own context window is asking a defendant to be their own judge.

This is true for system prompts, for constitutional AI patterns, and for "self-reflection" loops. They reduce incident rate. They do not prevent incidents. For destructive filesystem operations, "reduce" is not the right bar.

External tooling like Bedrock Guardrails, Lakera, and NeMo Guardrails is stronger because it runs out-of-process, but most of it is tuned for content safety (prompt injection, PII, toxicity) rather than tool-call authorization. You still need a policy layer specifically for destructive verbs.

The four mechanisms that actually prevent file deletion

Mechanism What it does Failure mode it prevents
Path-scoped capabilities Agent can only read/write within an explicit allowlist of directories Overbroad tool scope, path traversal
Verb-based policy gate Destructive operations (delete, unlink, rm, DROP, TRUNCATE) hit an external policy engine before execution Plan drift, hallucinated state
Two-party approval Destructive calls above a threshold require a signed human approval token Autonomous destruction at scale
Signed audit-before-exec Every tool call is written to an append-only log, signed with a key the agent cannot reach, before the syscall runs Silent destruction, forensic gaps

The key detail most teams miss is the word before. Logging after the fact is for postmortems. The log write has to be the gate — if the signer is down, the action does not happen.

A concrete policy example

Here is roughly how I configure a destructive-action gate for a filesystem-enabled agent. This is the pattern, simplified:

policy: filesystem_agent_v3
scope:
  allowed_paths:
    - /workspace/project-a
    - /tmp/agent-scratch
  denied_paths:
    - /workspace/project-a/.git
    - /workspace/project-a/infra

rules:
  - match:
      verb: [delete, unlink, rmdir, overwrite]
    require:
      - path_in: allowed_paths
      - path_not_in: denied_paths
      - approval:
          signer: human_operator
          algo: ed25519
          max_age_seconds: 300
      - dry_run_diff: true

  - match:
      verb: [delete, unlink]
      target_count: ">10"
    require:
      - approval: { signers: 2 }
      - reason: required

audit:
  mode: write_before_exec
  signing_key: ops-hsm
  failure: block

Three things worth noting. First, .git and infra are explicitly denied even inside an allowed path — agents love to "clean up" version control metadata. Second, bulk deletions require two signers, because single-signer approval fatigue is real and agents will learn to batch things just under the threshold if you let them. Third, failure: block means if the audit log cannot be written, the action fails closed. Most teams default to fail-open and find out the wrong way.

What to do right now if you run agents with filesystem access

Before your next deploy:

  • List the tools your agent actually has. Not what you think it has — what is in the tool schema. Claude Code, Cursor's agent mode, and LangChain's FileManagementToolkit all ship with broad defaults.
  • Remove every destructive tool you do not explicitly need. If the agent's job is to read and propose, it does not need write_file, let alone delete_file.
  • Scope the remaining tools to a working directory. Use OS-level mechanisms (bind mounts, containers, chroot-equivalents) — not just a prompt instruction.
  • Put an external policy check in front of destructive verbs. OPA works. A purpose-built kernel like Sift works. A 40-line Python sidecar that checks a regex allowlist works better than nothing.
  • Require signed approvals for destruction above a small threshold. Single-file delete in scratch: fine. Directory delete anywhere: signed approval, every time.
  • Make the audit log the gate, not a side effect. If the log does not commit, the action does not run.

The agents that deleted files in production did not do it because the model was broken. They did it because nothing stopped them. Build the thing that stops them, and run your agents behind it.

Run your agents under Sift.

Deterministic governance. Cryptographic receipts. Fail-closed by default.

Related

More in Incident Reports