Incident Reports

When an AI Agent Leaks Secrets to the Log: Causes and Fixes

AI agents routinely leak API keys, tokens, and PII into logs. Here's why it happens, what fails to stop it, and what actually contains the blast radius.

·5 min read
Key Takeaways
  • 01LLM-based agents leak secrets when tool outputs, stack traces, or env dumps pass through the model context and get echoed to stdout, traces, or observability platforms like Datadog and LangSmith.
  • 02Prompt-level instructions such as 'never print API keys' fail because the model has no deterministic mechanism to recognize secrets it has never seen before.
  • 03The standard containment pattern is a deterministic pre-log redactor that pattern-matches known secret formats (AWS AKIA, GitHub ghp_, Stripe sk_live_, JWT, PEM blocks) before any write to disk or network.
  • 04Rotate every credential the agent had access to during the leak window — logs are replicated to SIEMs, backups, and LLM training pipelines within minutes.
  • 05Structural fixes include scoped short-lived credentials via AWS STS or HashiCorp Vault, tool-output filtering before context injection, and signed audit logs that separate operator-visible from model-visible fields.

An AI agent leaked a secret to a log. That is the search query, and if you are reading this at 2 a.m. with a PagerDuty alert open, here is the short version: rotate every credential the agent had access to during the leak window, purge the log sinks you control, and assume the ones you don't control (Datadog, LangSmith, Sentry, CloudWatch, S3 backups) have already replicated. Then come back and fix the actual cause, which is almost never what the agent 'decided' to do. It is what your execution layer allowed it to do.

I run 23 autonomous agents in production. I have seen this happen. The first time cost me about six hours of rotation work across AWS IAM, Stripe, GitHub, and an internal service mesh. The mechanism is boringly consistent, and so is the fix.

How the leak actually happens

The agent did not 'decide' to print the secret. One of four things happened:

  1. Tool output echo. A tool returned a response containing a token (e.g., a shell tool ran env, or an HTTP tool returned a response body with a session cookie). The model summarized it back to the user, or a tracing library captured the raw tool output before any redaction.
  2. Error trace. An exception bubbled up from a library like boto3 or stripe-python with the credential embedded in the request signature or URL. The agent's default exception handler wrote the full trace to stdout, which your logger happily shipped to Datadog.
  3. Context-window bleed. The system prompt or a RAG retrieval contained a secret (someone committed a .env to a repo the embedder indexed). The model then reproduced it on request, or hallucinated it close enough to the real value to be useful to an attacker.
  4. Observability capture. LangSmith, Langfuse, Helicone, and similar tools log the entire prompt and completion by default. If a secret passed through context, it is now in a third-party SaaS retention window — typically 30 to 90 days.

In every case, the model is a passive conduit. The leak is a property of the scaffolding around it.

Why 'tell the model not to' does not work

The first instinct is to add a system-prompt instruction: Never print API keys, passwords, or secrets. This fails for a specific reason. The model has no deterministic secret classifier. It pattern-matches against what it has seen in training, which means it will reliably redact strings that look like the textbook examples (sk-..., AKIA...) and reliably fail on anything novel: internal service tokens, custom HMAC signatures, base64-encoded session blobs, database connection strings with inline passwords.

Anthropic, OpenAI, and the major frameworks are explicit about this in their own docs. Prompt-level guardrails are advisory. They are not a security boundary.

The same problem applies to LLM-based filters (Lakera, NeMo Guardrails, Bedrock Guardrails). They reduce the rate but do not bound it. If your compliance posture requires 'zero secrets in logs,' a probabilistic filter is the wrong layer.

What actually contains the blast radius

The working pattern has three layers, and they have to be deterministic — pure functions that run before anything touches a log sink or a network egress.

Layer What it does Where it runs
Credential scoping Short-lived tokens via AWS STS, Vault dynamic secrets, GitHub fine-grained PATs Before the agent starts
Tool output filtering Regex + entropy pass over every tool return value before it enters context Between tool executor and model
Log/trace redaction Pattern-match known secret formats on every log line and span attribute Before any write to stdout, file, or network

Here is the minimum redactor I run in front of every agent process. It is not clever. Clever is the problem.

import re

SECRET_PATTERNS = [
    (re.compile(r'AKIA[0-9A-Z]{16}'), '[AWS_AKID]'),
    (re.compile(r'ghp_[A-Za-z0-9]{36}'), '[GITHUB_PAT]'),
    (re.compile(r'sk_live_[A-Za-z0-9]{24,}'), '[STRIPE_LIVE]'),
    (re.compile(r'-----BEGIN [A-Z ]+PRIVATE KEY-----[\s\S]+?-----END [A-Z ]+PRIVATE KEY-----'), '[PEM]'),
    (re.compile(r'eyJ[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}'), '[JWT]'),
    (re.compile(r'postgres(?:ql)?://[^:]+:[^@]+@'), 'postgres://[REDACTED]@'),
]

def redact(s: str) -> str:
    for pat, repl in SECRET_PATTERNS:
        s = pat.sub(repl, s)
    return s

This runs inside the logging formatter, the tool output handler, and the trace exporter. Three places. Same function. The agent cannot write to any sink that does not route through it.

The audit-log problem

There is a second-order issue that bites once you deploy this. Your redacted logs are now useful for debugging but useless for forensics. When the auditor asks 'what did the agent actually see,' you need an unredacted record — and that record itself becomes a secret-bearing artifact.

The resolution is to split the log stream. One channel is operator-visible, fully redacted, retained long-term. The other is a sealed forensic channel: encrypted at write time with a key only the incident-response role can access, signed per entry (ed25519 works), and retained on a shorter window. This is the pattern we implemented in Sift's kernel because every governance framework — SOC 2 CC7.2, ISO 27001 A.12.4, the EU AI Act Article 12 — requires tamper-evident logs, and standard JSON-to-Datadog does not clear that bar.

The checklist after an incident

  • Identify the leak window from the earliest log entry to the remediation timestamp.
  • Rotate every credential the agent's IAM role, service account, or token scope could reach during that window. Not just the one you saw.
  • Revoke active sessions (OAuth refresh tokens, JWT signing keys if the key itself leaked).
  • Purge internal log sinks you control. File a deletion request with third-party observability vendors — most honor it within 7 days.
  • Check if the leaked secret appeared in any output returned to an end user; if yes, you likely have a disclosure obligation.
  • Add the leaked secret's format to your redactor's pattern list so the same shape cannot leak again.
  • Move the offending credential behind a short-lived token issuer if it is not already.

The leak is not a model failure. It is a missing layer between the model and the world. Put the layer in, make it deterministic, and the class of incident goes away.

Run your agents under Sift.

Deterministic governance. Cryptographic receipts. Fail-closed by default.

Related

More in Incident Reports