When an AI Agent Leaks Secrets to the Log: Causes and Fixes
AI agents routinely leak API keys, tokens, and PII into logs. Here's why it happens, what fails to stop it, and what actually contains the blast radius.
- 01LLM-based agents leak secrets when tool outputs, stack traces, or env dumps pass through the model context and get echoed to stdout, traces, or observability platforms like Datadog and LangSmith.
- 02Prompt-level instructions such as 'never print API keys' fail because the model has no deterministic mechanism to recognize secrets it has never seen before.
- 03The standard containment pattern is a deterministic pre-log redactor that pattern-matches known secret formats (AWS AKIA, GitHub ghp_, Stripe sk_live_, JWT, PEM blocks) before any write to disk or network.
- 04Rotate every credential the agent had access to during the leak window — logs are replicated to SIEMs, backups, and LLM training pipelines within minutes.
- 05Structural fixes include scoped short-lived credentials via AWS STS or HashiCorp Vault, tool-output filtering before context injection, and signed audit logs that separate operator-visible from model-visible fields.
An AI agent leaked a secret to a log. That is the search query, and if you are reading this at 2 a.m. with a PagerDuty alert open, here is the short version: rotate every credential the agent had access to during the leak window, purge the log sinks you control, and assume the ones you don't control (Datadog, LangSmith, Sentry, CloudWatch, S3 backups) have already replicated. Then come back and fix the actual cause, which is almost never what the agent 'decided' to do. It is what your execution layer allowed it to do.
I run 23 autonomous agents in production. I have seen this happen. The first time cost me about six hours of rotation work across AWS IAM, Stripe, GitHub, and an internal service mesh. The mechanism is boringly consistent, and so is the fix.
How the leak actually happens
The agent did not 'decide' to print the secret. One of four things happened:
- Tool output echo. A tool returned a response containing a token (e.g., a shell tool ran
env, or an HTTP tool returned a response body with a session cookie). The model summarized it back to the user, or a tracing library captured the raw tool output before any redaction. - Error trace. An exception bubbled up from a library like
boto3orstripe-pythonwith the credential embedded in the request signature or URL. The agent's default exception handler wrote the full trace to stdout, which your logger happily shipped to Datadog. - Context-window bleed. The system prompt or a RAG retrieval contained a secret (someone committed a
.envto a repo the embedder indexed). The model then reproduced it on request, or hallucinated it close enough to the real value to be useful to an attacker. - Observability capture. LangSmith, Langfuse, Helicone, and similar tools log the entire prompt and completion by default. If a secret passed through context, it is now in a third-party SaaS retention window — typically 30 to 90 days.
In every case, the model is a passive conduit. The leak is a property of the scaffolding around it.
Why 'tell the model not to' does not work
The first instinct is to add a system-prompt instruction: Never print API keys, passwords, or secrets. This fails for a specific reason. The model has no deterministic secret classifier. It pattern-matches against what it has seen in training, which means it will reliably redact strings that look like the textbook examples (sk-..., AKIA...) and reliably fail on anything novel: internal service tokens, custom HMAC signatures, base64-encoded session blobs, database connection strings with inline passwords.
Anthropic, OpenAI, and the major frameworks are explicit about this in their own docs. Prompt-level guardrails are advisory. They are not a security boundary.
The same problem applies to LLM-based filters (Lakera, NeMo Guardrails, Bedrock Guardrails). They reduce the rate but do not bound it. If your compliance posture requires 'zero secrets in logs,' a probabilistic filter is the wrong layer.
What actually contains the blast radius
The working pattern has three layers, and they have to be deterministic — pure functions that run before anything touches a log sink or a network egress.
| Layer | What it does | Where it runs |
|---|---|---|
| Credential scoping | Short-lived tokens via AWS STS, Vault dynamic secrets, GitHub fine-grained PATs | Before the agent starts |
| Tool output filtering | Regex + entropy pass over every tool return value before it enters context | Between tool executor and model |
| Log/trace redaction | Pattern-match known secret formats on every log line and span attribute | Before any write to stdout, file, or network |
Here is the minimum redactor I run in front of every agent process. It is not clever. Clever is the problem.
import re
SECRET_PATTERNS = [
(re.compile(r'AKIA[0-9A-Z]{16}'), '[AWS_AKID]'),
(re.compile(r'ghp_[A-Za-z0-9]{36}'), '[GITHUB_PAT]'),
(re.compile(r'sk_live_[A-Za-z0-9]{24,}'), '[STRIPE_LIVE]'),
(re.compile(r'-----BEGIN [A-Z ]+PRIVATE KEY-----[\s\S]+?-----END [A-Z ]+PRIVATE KEY-----'), '[PEM]'),
(re.compile(r'eyJ[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}'), '[JWT]'),
(re.compile(r'postgres(?:ql)?://[^:]+:[^@]+@'), 'postgres://[REDACTED]@'),
]
def redact(s: str) -> str:
for pat, repl in SECRET_PATTERNS:
s = pat.sub(repl, s)
return s
This runs inside the logging formatter, the tool output handler, and the trace exporter. Three places. Same function. The agent cannot write to any sink that does not route through it.
The audit-log problem
There is a second-order issue that bites once you deploy this. Your redacted logs are now useful for debugging but useless for forensics. When the auditor asks 'what did the agent actually see,' you need an unredacted record — and that record itself becomes a secret-bearing artifact.
The resolution is to split the log stream. One channel is operator-visible, fully redacted, retained long-term. The other is a sealed forensic channel: encrypted at write time with a key only the incident-response role can access, signed per entry (ed25519 works), and retained on a shorter window. This is the pattern we implemented in Sift's kernel because every governance framework — SOC 2 CC7.2, ISO 27001 A.12.4, the EU AI Act Article 12 — requires tamper-evident logs, and standard JSON-to-Datadog does not clear that bar.
The checklist after an incident
- Identify the leak window from the earliest log entry to the remediation timestamp.
- Rotate every credential the agent's IAM role, service account, or token scope could reach during that window. Not just the one you saw.
- Revoke active sessions (OAuth refresh tokens, JWT signing keys if the key itself leaked).
- Purge internal log sinks you control. File a deletion request with third-party observability vendors — most honor it within 7 days.
- Check if the leaked secret appeared in any output returned to an end user; if yes, you likely have a disclosure obligation.
- Add the leaked secret's format to your redactor's pattern list so the same shape cannot leak again.
- Move the offending credential behind a short-lived token issuer if it is not already.
The leak is not a model failure. It is a missing layer between the model and the world. Put the layer in, make it deterministic, and the class of incident goes away.
Run your agents under Sift.
Deterministic governance. Cryptographic receipts. Fail-closed by default.
More in Incident Reports
My AI Agent Ran a Dangerous Command: The Post-Mortem Playbook
What to do when an AI agent executes a destructive command in production, why LLM-based guardrails fail, and the controls that actually stop it next time.
AI Agent Deleted Files: How to Prevent Destructive Actions
An operator's playbook for stopping AI agents from deleting files, dropping tables, and wiping directories. Real mechanisms, not prompt pleading.
AI Agent Jailbreaks in Production: What Actually Happens and How to Contain Them
A field guide to AI agent jailbreaks in live systems — attack patterns, why prompt-level defenses fail, and the execution-layer controls that actually hold.