Incident Reports

Autonomous AI Agent Safety Risks: What Actually Breaks in Production

The real safety risks of autonomous AI agents in production: prompt injection, tool misuse, cost runaways, and why wrappers like guardrails aren't enough.

·5 min read
Key Takeaways
  • 01The top autonomous agent safety risks are prompt injection via retrieved content, unbounded tool execution, unbounded spend, and unverifiable agent state.
  • 02Content-layer guardrails (Bedrock Guardrails, Lakera, NeMo) filter text but do not stop an agent from calling a destructive tool with valid-looking arguments.
  • 03A single unbounded agent loop can burn thousands of dollars in API fees in minutes; OpenAI and Anthropic both bill per token on retries and tool-call failures.
  • 04Deterministic pre-execution checks on tool calls — not post-hoc logging — are what actually prevent damaging actions in autonomous systems.
  • 05Signed, append-only audit logs (e.g. ed25519-signed) are required to reconstruct what an agent did and why, because LLM reasoning traces are not reproducible.

I run 23 autonomous agents in production. Over the last eighteen months I've watched every class of failure described in the academic literature happen to real agents handling real traffic — and a few the papers haven't gotten to yet. This is the honest list of what goes wrong, why the common mitigations don't fully work, and what does.

The four failure modes that actually cause incidents

Most "AI safety" discourse is about model outputs. In agent systems, the output is rarely the problem. The problem is what the agent does with that output — which tool it calls, with which arguments, how many times, against which system.

The four failure modes I see repeatedly:

Failure mode What it looks like in production Typical blast radius
Prompt injection via retrieved content A customer email, PDF, or web page contains instructions the agent follows Data exfiltration, unauthorized tool calls
Unbounded tool execution Agent loops on a failing tool call, or chains tools beyond intended scope Corrupted state, downstream system load
Cost runaways Retry logic + tool-call failures + long context = geometric token spend $100-$10,000+ in minutes
Unverifiable state Agent "believes" it completed an action it didn't, or vice versa Silent data loss, duplicate side effects

I burned $47 in OpenAI credits once before I noticed an agent stuck in a self-correction loop on a malformed JSON schema. That was cheap. A colleague at a YC company burned $3,200 in four hours on a similar bug against Claude Opus.

Why prompt injection is worse for agents than for chatbots

With a chatbot, prompt injection produces an embarrassing output. With an agent, it produces an action. The August 2024 Slack AI exfiltration disclosure, the Microsoft Copilot EchoLeak disclosure in June 2025, and the ongoing GitHub Copilot indirect-injection research all share the same shape: untrusted content enters the context window, and the agent treats its instructions as operator intent.

You cannot solve this at the model layer. Anthropic, OpenAI, and Google have all said so explicitly in their own docs. The model has no reliable way to distinguish "instructions from the developer" from "instructions embedded in a document the developer asked me to read." This is architectural, not a training problem.

The mitigations that get marketed — Lakera, Bedrock Guardrails, NVIDIA NeMo Guardrails — are content classifiers. They're useful for blocking obvious jailbreak strings. They do nothing when an injected instruction tells the agent to call send_email with a perfectly valid-looking recipient and a perfectly valid-looking body.

Unbounded tool execution is the expensive one

Every framework I've used — LangChain, LangGraph, CrewAI, AutoGen, the OpenAI Agents SDK — ships with default loop limits that are too high and default timeouts that are too long. More importantly, none of them enforce semantic limits. "Don't call delete_record more than 5 times in a session" is not a config option. You have to write it yourself, per tool, and then hope your code path always goes through your wrapper.

The three sub-failures I see:

  • Retry storms. Tool returns a 500, agent retries, tool returns a 500, agent retries with slightly different args, tool returns a 500. Repeat until token budget exhausted.
  • Scope creep within a single run. Agent was supposed to read one record. It read one, then decided to "verify" by reading 200.
  • Cross-tool chaining. Agent reads from production DB, writes summary to a shared doc, which is then indexed by the next agent, which treats the summary as ground truth.

Why wrapping the LLM isn't enough

The standard 2024-era architecture is: LLM → guardrail layer → tool executor. The guardrail layer reads the model's proposed action and decides whether to allow it. This fails for three reasons.

  1. The guardrail is itself an LLM call. It has the same injection surface as the primary agent.
  2. It can't see prior state. "Is this the 47th call to wire_transfer this hour?" is not a question a stateless classifier can answer.
  3. Its decisions aren't auditable. When it blocks, you get a probability score. When the compliance team asks why a specific action was allowed at 2:14 AM, you cannot reconstruct the decision deterministically.

What's needed is a deterministic policy layer — real code, not a model — that sits between the agent and every tool. Think OPA/Rego for agent actions, not for Kubernetes RBAC.

What actually works: pre-execution checks and signed audit

The pattern I've landed on, after rebuilding it three times across different stacks, has two properties:

  1. Every tool call is checked by deterministic code before execution, against a policy that can see full session history.
  2. Every decision — allow, deny, modify — is written to an append-only log signed with ed25519 so the record is tamper-evident.

A minimal version of the check looks like this:

def authorize_tool_call(call: ToolCall, session: SessionState) -> Decision:
    # Hard limits — no model in the loop
    if session.spend_usd > session.budget_usd:
        return Decision.deny("budget_exceeded")
    if session.tool_calls[call.name] >= LIMITS[call.name]:
        return Decision.deny("rate_limit")
    if call.name in DESTRUCTIVE and not session.human_approved:
        return Decision.deny("requires_approval")
    # Argument validation against JSON schema
    if not validate(call.args, SCHEMAS[call.name]):
        return Decision.deny("invalid_args")
    return Decision.allow()

This is unglamorous. It's also the only thing I've seen stop the failure modes above in practice. The LLM proposes; deterministic code disposes.

The audit requirement people underestimate

When an agent does something wrong at 2 AM, you need to answer three questions within minutes: what did it do, what was it told, and why was the action allowed. LLM reasoning traces are not reproducible — run the same prompt twice and you get different chains of thought. The only reliable artifact is the policy decision log.

If that log isn't signed, it isn't evidence. Anyone with write access to your observability stack can rewrite history. ed25519 signatures on each log entry, with the previous entry's hash included, give you an append-only chain that survives a breach of the logging system itself.

What we built

Sift is the governance kernel I extracted from running these 23 agents. It does exactly what's described above: deterministic pre-execution checks on every tool call, signed append-only audit, and hard budget and rate limits that can't be bypassed by the agent itself. It's not a guardrail wrapper around an LLM. It's the layer that makes autonomous agents something you can actually leave running overnight.

The safety risks of autonomous agents are real, but they're specific and addressable. The mistake is treating them as a model-quality problem instead of an execution-governance problem.

Run your agents under Sift.

Deterministic governance. Cryptographic receipts. Fail-closed by default.

Related

More in Incident Reports