My AI Agent Ran a Dangerous Command: The Post-Mortem Playbook
What to do when an AI agent executes a destructive command in production, why LLM-based guardrails fail, and the controls that actually stop it next time.
- 01AI agents execute dangerous commands because tool-calling APIs from OpenAI and Anthropic treat the model's intent as authorization — there is no built-in policy check between decision and execution.
- 02Prompt-based guardrails (system prompts, 'never run rm -rf' instructions) fail because instruction-following is probabilistic and can be overridden by adversarial context or tool output injection.
- 03A deterministic policy kernel sitting outside the LLM — evaluating every tool call against a signed allow-list before execution — is the only control that reliably prevents destructive actions.
- 04After an incident, preserve the full trace: the prompt, retrieved context, the model's tool-call JSON, and the executor's response. Without this, root cause is unrecoverable.
- 05OPA, Bedrock Guardrails, NeMo Guardrails, and Lakera address different layers; none of them replace a pre-execution authorization check on tool calls.
An AI agent ran a command it shouldn't have. Maybe it dropped a table, deleted a branch, emailed a customer, or spent $47 in API fees before anyone noticed. The first instinct is to blame the model. That instinct is wrong. The model did what tool-calling agents are designed to do: it picked a function from the list you gave it and called it. The failure is architectural — there was no deterministic check between the agent's decision and the tool's execution. This article walks through what happened, why the usual fixes don't work, and what to put in place before the next incident.
What actually happened when your agent ran the command
The execution path for every major agent framework (LangChain, LlamaIndex, OpenAI Assistants, Anthropic's tool use, CrewAI) looks roughly like this:
- The LLM receives a prompt plus a list of available tools (functions with JSON schemas).
- The LLM returns a tool-call object:
{"name": "run_sql", "arguments": {"query": "DROP TABLE users"}}. - Your executor deserializes that JSON and calls the function.
- The result is fed back into the model.
Step 3 is where the incident happened. Most agent implementations treat the model's tool-call as authorization. There is no policy layer. If the model emits DROP TABLE users, the executor runs DROP TABLE users. The model was not "jailbroken" in a cinematic sense — it was doing its job, possibly nudged by a prompt injection buried in retrieved documents, a confused chain-of-thought, or an ambiguous user instruction.
Why the obvious fixes don't hold up
After an incident, the usual responses are:
| Proposed fix | Why it fails |
|---|---|
| Add "never run destructive SQL" to the system prompt | Instruction-following is probabilistic. Prompt injection in tool outputs bypasses it. |
| Use a smarter model (GPT-5, Claude Opus 4) | Capability is orthogonal to policy. A smarter model follows destructive instructions more competently. |
| Add an LLM-as-judge reviewer | Two stochastic systems do not equal one deterministic check. The judge can be injected too. |
| Regex-filter the tool arguments | Works until the model writes DR+OP TABLE or encodes the payload. Regex loses to adversarial context. |
| Human-in-the-loop on every call | Collapses within a week. Operators approve blindly. Defeats the point of autonomy. |
I ran all of these across 23 production agents. The only control that held under adversarial traffic was a deterministic authorization layer that evaluated tool calls outside the LLM, against a signed policy, before the executor touched the function.
The incident-response sequence
When an agent has already run a dangerous command, do this in order:
- Freeze the agent's credentials, not the agent. Revoke the API key or IAM role the executor was using. Leaving the process running preserves in-memory state for forensics.
- Capture the full trace. You need the system prompt, the user turn, every retrieved document, the tool schema list, the exact tool-call JSON, and the executor's response. In LangSmith, Langfuse, or Arize, export the run. If you have no tracing, this is the first thing to fix — root cause is unrecoverable without it.
- Determine whether it was intent or injection. Search the retrieved context for the destructive instruction. If it appears in a document, email, or web page the agent ingested, you have a prompt injection, not a model hallucination. These require different fixes.
- Reverse what's reversible. Point-in-time restore for databases, revert for version control, refund for charges. Document what is unrecoverable.
- Write the allow-list the agent should have had. Not the deny-list of what it did. Deny-lists are infinite; allow-lists are finite.
What deterministic governance looks like
The control that prevents recurrence is a policy kernel between the model's tool-call and the executor. It has three properties: deterministic (same input, same decision, always), external to the LLM (cannot be prompted away), and auditable (every decision is logged with a signature you can verify later).
A minimal example of what the check looks like, expressed as a policy rule:
# Policy evaluated on every tool call, outside the LLM
tool: run_sql
allow:
- match:
statement_type: SELECT
tables: ["public.orders", "public.products"]
row_limit_max: 10000
deny:
- match:
statement_type: [DROP, TRUNCATE, DELETE, ALTER, GRANT]
action: block_and_alert
require_approval:
- match:
statement_type: UPDATE
affected_rows_estimate: ">100"
audit:
sign: ed25519
sink: s3://agent-audit/prod/
This is not a prompt. It is a rule the executor consults before calling run_sql. The LLM can emit whatever it wants. The policy decides whether it runs.
Open-Policy-Agent (OPA), AWS Bedrock Guardrails, NVIDIA NeMo Guardrails, and Lakera each address a slice of this. OPA handles policy evaluation but isn't agent-aware. Bedrock Guardrails filters content but doesn't gate tool calls on structured arguments. NeMo focuses on conversational flow. Lakera scans for injection. None of them, on their own, is the pre-execution authorization check described above — which is the gap Sift was built to fill.
What to put in place this week
If you're reading this after an incident and you have one week of engineering time, in priority order:
- Tracing on every agent run. Langfuse or LangSmith. Non-negotiable. You cannot govern what you cannot see.
- An allow-list per tool. Not a deny-list. Enumerate the legitimate argument shapes for each function the agent can call. Everything else is blocked by default.
- Signed audit logs. ed25519-signed, append-only, shipped to object storage the agent cannot write to. When the next incident happens, you want a tamper-evident record.
- Budget and rate ceilings enforced outside the agent. Dollar caps, call-rate caps, per-tool quotas. Enforced by the executor wrapper, not by asking the model to be careful.
- A kill switch that revokes credentials, not one that asks the agent to stop. Agents do not reliably stop themselves.
The dangerous command your agent ran is a preview. The agents are getting more capable and more autonomous every quarter. The governance layer — deterministic, external, auditable — is what makes running them in production sustainable instead of terrifying.
Run your agents under Sift.
Deterministic governance. Cryptographic receipts. Fail-closed by default.
More in Incident Reports
AI Agent Deleted Files: How to Prevent Destructive Actions
An operator's playbook for stopping AI agents from deleting files, dropping tables, and wiping directories. Real mechanisms, not prompt pleading.
AI Agent Jailbreaks in Production: What Actually Happens and How to Contain Them
A field guide to AI agent jailbreaks in live systems — attack patterns, why prompt-level defenses fail, and the execution-layer controls that actually hold.
When an AI Agent Leaks Secrets to the Log: Causes and Fixes
AI agents routinely leak API keys, tokens, and PII into logs. Here's why it happens, what fails to stop it, and what actually contains the blast radius.