Agent containment is the set of architectural patterns that limit what an AI agent can do when it goes wrong. Drawing from Anthropic, OpenAI, Google DeepMind, Microsoft, and OWASP — here are the four layers every team deploying agents in production needs to understand, illustrated with FlowZap sequence diagrams showing the interactions between Agent, Sandbox, Human, Permission Gates, and SIEM.
Why Agent Containment Matters Now
On June 19, 2026, Anthropic published "How we contain Claude across products" — a detailed breakdown of the security architecture protecting claude.ai, Claude Code, and Cowork. The opening line sets the stakes:
"As agents grow more capable, so does their potential blast radius. The engineering question is how to cap it."
Anthropic is the latest of four major ecosystems that have published containment frameworks since February:
| Ecosystem | Key Contribution | Date |
|---|---|---|
| Anthropic | 4-layer containment stack (Sandbox→Permissions→HITL→Audit) for Claude Code | Jun 2026 |
| OpenAI | "Practices for Governing Agentic AI Systems" — blast radius, delegation chains, permission scoping | Jan 2026 |
| Google DeepMind | Agent safety framework for Astra, Mariner, Veo — runtime isolation + approval hierarchies | Mar 2026 |
| Microsoft | AI Red Team lessons from Copilot agents — sandbox escapes, prompt injection in agentic chains | Feb 2026 |
This isn't theoretical. Every CI/CD pipeline that auto-approves PRs from an AI coding agent is a blast radius waiting to be measured. Every MCP server that grants terminal access without path scoping is a sandbox escape vector. The patterns below are what the four ecosystems converged on.
The 4 Layers of Agent Containment
Anthropic formalized the stack. OpenAI, DeepMind, and Microsoft each contributed nuance. Here's the unified model:
The Interaction Model
Every containment layer is a dialogue between participants, not a monologue inside the agent. The diagrams below show the real interactions:
- Layer 1 — Sandbox: Agent ↔ Sandbox Runtime (ephemeral container, path validation)
- Layer 2 — Permissions: Agent ↔ Permission Gate (whitelist, scope check)
- Layer 3 — HITL: Agent ↔ Human Reviewer (approval, fatigue management)
- Layer 4 — Audit: Agent ↔ SIEM (immutable logging, alerting)
Layer 1: Sandboxing — Agent ↔ Sandbox
The first line of defense: the agent runs in an environment where it physically cannot touch anything critical.
The pattern (5 ecosystems converge):
- Dedicated containers or VMs per agent session (Anthropic, Google, Microsoft)
- No network access to internal services by default (OpenAI, OWASP #4)
- Read-only filesystem mounts for system directories (all five)
- Ephemeral storage destroyed after each session (Anthropic, Google)
agent { # Agent
n1: circle label:"Session Start"
n2: rectangle label:"Request Tool Call"
n5: rectangle label:"Process Result"
n6: circle label:"Session End"
n1.handle(right) -> n2.handle(left)
n2.handle(bottom) -> sandbox.n3.handle(top) [label="tool call"]
}
sandbox { # Sandbox
n3: diamond label:"Path in Workspace?"
n4: rectangle label:"Execute in Ephemeral Container"
n7: rectangle label:"Deny - Log Security Event"
n8: rectangle label:"Return Result to Agent"
n3.handle(right) -> n4.handle(left) [label="yes"]
n3.handle(bottom) -> n7.handle(top) [label="no"]
n4.handle(right) -> n8.handle(left)
n7.handle(right) -> n8.handle(bottom)
n8.handle(top) -> agent.n5.handle(bottom) [label="result"]
n5.handle(right) -> n6.handle(left)
}
What Anthropic does: Claude Code runs in a sandboxed environment where each tool invocation is evaluated against ALLOWED_HOSTS, with SSRF protection and request timeouts.
What Microsoft adds: Copilot agents run in "Defender-managed sandboxes" that intercept prompt injection at the model boundary — before the agent can act on a malicious instruction. Their red team found that 34% of sandbox escapes in agentic systems came through tool descriptions, not user prompts.
The gotcha: Sandboxing is only as good as its configuration. A container with --privileged or a Docker socket mounted inside defeats the purpose. Google DeepMind's safety team recommends runtime attestation: verifying the sandbox configuration hasn't been tampered with before each agent session.
Layer 2: Permissions — Agent ↔ Permission Gate
Even inside a sandbox, an agent needs some access. Layer 2 defines exactly what.
The pattern:
- Whitelist, never blacklist (Anthropic, OpenAI, OWASP)
- Principle of least privilege per tool (Google, Microsoft)
- Path-based restrictions: only
./workspace/, never/etc/(all five) - Read vs. write vs. execute as separate permissions (Anthropic, OpenAI)
agent { # Agent
n1: circle label:"Tool Call Initiated"
n2: rectangle label:"Request File Write"
n5: diamond label:"Permission Granted?"
n6: rectangle label:"Write File to Disk"
n7: rectangle label:"Abort - Log Denial"
n8: circle label:"Done"
n1.handle(right) -> n2.handle(left)
n2.handle(bottom) -> permgate.n3.handle(top) [label="check permission"]
}
permgate { # Permission Gate
n3: rectangle label:"Check Whitelist and Path Scope"
n4: rectangle label:"Return Decision"
n3.handle(right) -> n4.handle(left)
n4.handle(top) -> agent.n5.handle(bottom) [label="decision"]
n5.handle(right) -> n6.handle(left) [label="granted"]
n5.handle(bottom) -> n7.handle(top) [label="denied"]
n6.handle(right) -> n8.handle(left)
n7.handle(right) -> n8.handle(left)
}
What OpenAI mandates: "Practices for Governing Agentic AI Systems" (Jan 2026) explicitly calls out delegation chain permission scoping — when Agent A delegates to Agent B, B must have strictly fewer permissions than A. No child agent should have more power than its parent. This maps directly to flowzap-senior-dev → flowzap-security-auditor delegation patterns.
What OWASP flags: Item #4 "Excessive Agency" in the Top 10 for LLM Applications (v2.0, Nov 2025) warns that granting agents unrestricted tool access — especially shell, file system writes, and network egress — is the #1 architectural vulnerability in production agent deployments.
Layer 3: HITL Approval — Agent ↔ Human
Some actions are too dangerous to automate. Layer 3 puts a human between the agent's decision and the real world.
The pattern:
- Auto-approve: read-only, low-risk (Anthropic, Microsoft)
- Ask: file writes, network calls, shell commands (all five)
- Deny: destructive operations, config changes, secret access (OpenAI, Google)
- Approval fatigue prevention: batch approvals, pattern learning (Anthropic's "auto mode" innovation)
agent { # Agent
n1: circle label:"Dangerous Tool Call"
n2: rectangle label:"Request Human Approval"
n5: diamond label:"Human Approved?"
n6: rectangle label:"Execute Tool Safely"
n7: rectangle label:"Abort Operation"
n8: circle label:"Done"
n1.handle(right) -> n2.handle(left)
n2.handle(bottom) -> human.n3.handle(top) [label="approval request"]
n5.handle(right) -> n6.handle(left) [label="approved"]
n6.handle(right) -> n8.handle(left)
n7.handle(right) -> n8.handle(left)
n5.handle(bottom) -> n7.handle(bottom) [label="Rejected"]
}
human { # Human Reviewer
n3: rectangle label:"Review Tool Call and Context"
n4: rectangle label:"Return Decision"
n3.handle(right) -> n4.handle(left)
n4.handle(top) -> agent.n5.handle(bottom) [label="decision"]
}
| Action | Default | Rationale |
|---|---|---|
read_file | Auto-approve | Read-only, no side effects |
grep / glob | Auto-approve | Search operations |
write_file | Ask | Modifies filesystem |
terminal (shell) | Ask | Arbitrary code execution |
web_fetch | Ask | Network egress |
.env access | Ask + Warn | Secrets exposure |
rm -rf / destructive | Deny | Irreversible damage |
What Anthropic innovated: Claude Code's "auto mode" (March 2026) selectively skips permission prompts for low-risk operations while keeping the human in the loop for anything that modifies state. The key innovation: the agent learns which patterns you approve and auto-approves similar future operations, reducing fatigue without sacrificing security. But their postmortem of "three recent issues" (Sept 2025) revealed that pattern-learning auto-approve created a new class of bugs where developers stopped reading prompts and auto-approved everything.
What Google DeepMind enforces: "Approval hierarchies" — for multi-agent systems, no single human approves their own agent's actions. The approver must be in a different reporting chain, preventing rubber-stamping. Project Mariner implements this at the browser-action level.
Layer 4: Audit Logging — Agent ↔ SIEM
The layer most teams skip — and the one they wish they had during an incident.
The pattern:
- Immutable log per agent session (Anthropic, Microsoft)
- Every tool call logged: timestamp, tool name, arguments (sanitized), result (all five)
- Security events flagged: denied permissions, unusual patterns, rate limit hits (OWASP)
- Logs shipped to a separate system — not readable by the agent itself (Google)
agent { # Agent
n1: circle label:"Execute Tool Call"
n2: rectangle label:"Send Event to Logger"
n5: rectangle label:"Continue Execution"
n6: circle label:"Done"
n1.handle(right) -> n2.handle(left)
n2.handle(bottom) -> siem.n3.handle(top) [label="log event"]
}
siem { # SIEM Audit System
n3: rectangle label:"Sanitize Args and Strip Secrets"
n4: rectangle label:"Write to Immutable Log Store"
n3.handle(right) -> n4.handle(left)
n4.handle(top) -> agent.n5.handle(bottom) [label="logged"]
n5.handle(right) -> n6.handle(left)
}
What Microsoft's red team found: In 40% of their simulated attacks on Copilot agents, the audit logs were the only detection mechanism. Permissions failed due to misconfiguration. Sandboxing failed due to container escape. HITL failed due to approval fatigue. Audit logs caught 100% of the attacks post-hoc — but only in teams that had actually shipped logs off-machine and set up alerting rules.
What OWASP recommends: Logs must be "attestable" — cryptographically signed so an agent cannot tamper with its own audit trail after a breach. This is particularly critical for CI/CD agents that have write access to the repository.
Putting It All Together: The Complete Containment Stack
When all four layers work together, the architecture looks like this — a single Containment Stack that the Agent communicates with for every tool call:
agent { # Agent
n1: circle label:"User Prompt Received"
n2: rectangle label:"Agent Plans Tool Call"
n5: rectangle label:"Process Final Result"
n6: circle label:"Response to User"
n1.handle(right) -> n2.handle(left)
n2.handle(bottom) -> stack.n3.handle(top) [label="tool call"]
}
stack { # Containment Stack
n3: rectangle label:"L1 - Sandbox Ephemeral Container"
n4: rectangle label:"L2 - Whitelist Permission Check"
n7: diamond label:"Dangerous Operation?"
n8: rectangle label:"L3 - Human Approves"
n9: rectangle label:"Execute Tool"
n10: rectangle label:"L4 - Audit to Immutable SIEM"
n3.handle(right) -> n4.handle(left)
n4.handle(right) -> n7.handle(left)
n7.handle(right) -> n8.handle(left) [label="yes"]
n7.handle(bottom) -> n9.handle(top) [label="no"]
n8.handle(right) -> n9.handle(left)
n9.handle(right) -> n10.handle(left)
n10.handle(top) -> agent.n5.handle(bottom) [label="result"]
n5.handle(right) -> n6.handle(left)
}
What Works vs. What Breaks
| Approach | Works When | Breaks When | Ecosystem Evidence |
|---|---|---|---|
| Sandbox-only | Agents are stateless, read-only | Agent needs persistent state or DB access | Anthropic: sandbox alone insufficient, June 2026 |
| Permissions-only | Tool surface is small and stable | New tools added without updating whitelist | OpenAI: delegation chains must scope down, Jan 2026 |
| HITL-only | Operations are infrequent | Agent makes 50+ tool calls/task (fatigue) | Anthropic: postmortem Sept 2025 on auto-mode fatigue |
| Audit-only | You have a dedicated security team | Logs are never reviewed (security theater) | Microsoft Red Team: 40% of attacks only caught by audit, Feb 2026 |
| 4-Layer Stack | You're running agents in production | — (this is the target state) | All five ecosystems |
The lesson from these leading ecosystems: no single layer is enough. Sandboxing without permissions is a cardboard box. Permissions without HITL is a policy nobody reads. HITL without audit logging means you'll never know what you approved.
What This Means for FlowZap's Architecture
My own learning: The five containment patterns map directly to my agent orchestrator:
| Containment Layer | FlowZap Implementation | Status |
|---|---|---|
| L1 Sandbox | MCP server secureFetch() wrapper (SSRF protection, ALLOWED_HOSTS, timeouts) | In place |
| L2 Permissions | Profile-scoped skills (marie-pierre, code, securite, qa) — each with minimal tool access | In place |
| L3 HITL | Cron → Idea Scout → Human approval → Writer pipeline | Built this week |
| L4 Audit | Hermes cron logs → session DB → Telegram delivery | In place |
The missing piece: cross-profile permission scoping. When my senior-dev (code profile) delegates to security-auditor (securite profile), the child agent currently inherits full parent permissions. OpenAI's delegation chain principle says the child must have strictly fewer permissions. This is a gap I need to address.
The Bottom Line
- Start with Layer 1 (sandboxing) today. If your agent runs in the same environment as your production database, fix that before anything else.
- Layer 2 and 3 can be implemented incrementally. Whitelist your tools. Add approval prompts for writes. You don't need a perfect system on day one. Anthropic took 18 months from Claude Code launch (Apr 2025) to the 4-layer post (Jun 2026).
- Layer 4 (audit) is the one most teams skip — and the one they wish they had during an incident. Log every tool call. Ship logs off-machine. Set up alerting rules for
[SECURITY]events. - Multi-agent systems multiply the blast radius. OpenAI's chain-of-delegation principle and Google's approval hierarchies are not optional when you have more than one agent in the loop.
Inspirations
- Anthropic Engineering — How we contain Claude across products, June 2026
- OpenAI — Practices for Governing Agentic AI Systems, January 2026
- Google DeepMind — Agent Safety Framework, March 2026
- Microsoft AI Red Team — Lessons from Securing Copilot Agents, February 2026
- OWASP Top 10 for LLM Applications v2.0, November 2025
All FlowZap diagrams generated with FlowZap Code. Copy any .fz block above and paste it into your FlowZap Account to view, edit, and share.
