Skip to main content
← Back to governance hub

Safety Patterns

Safety controls for minimizing harm while preserving useful agent autonomy, so one bad run doesn't cascade.

Sandboxing strategies

  • Isolate untrusted execution in ephemeral containers or VMs.
  • Use network egress allowlists and secret-scoped credentials.
  • Apply filesystem and syscall restrictions for tool runtimes.
  • Separate staging and production permissions to prevent lateral impact.

Rate limiting and resource caps

  • Per-user, per-agent, and per-tool request limits.
  • Token, CPU, memory, and execution time budgets.
  • Spend controls for paid APIs and external actions.
  • Adaptive throttling under anomaly detection signals.

Rollback and kill-switch patterns

  • Atomic changes with idempotent compensation handlers.
  • Global kill-switch and scoped feature flags for containment.
  • Automated rollback triggers tied to error and risk thresholds.
  • Operator runbooks that define safe restart criteria.

Testing dangerous operations safely

Never validate high-risk pathways directly in production. Use staged rehearsals with synthetic data, enforced isolation, and explicit go/no-go safety checks.

# Dangerous Operation Test Plan (Staging)

1) Build synthetic test data with no customer-sensitive payloads.
2) Enable strict sandbox profile (no external writes, limited network).
3) Run canary scenarios with capped budgets and request rates.
4) Inject failures (timeouts, malformed inputs, policy violations).
5) Validate containment, rollback, and escalation behavior.
6) Promote only after all safety assertions pass.

Incident response playbook template

  • Detect and classify the incident severity.
  • Contain impact by disabling risky pathways.
  • Preserve evidence and establish timeline.
  • Remediate root cause and validate fix in staging.
  • Communicate status updates and complete post-incident review.
Incident ID:
Severity (P0-P3):
Detected At:
Owner:

Impact Summary:
Affected Systems:
Customer Impact:

Immediate Containment Actions:
Evidence Preserved:

Root Cause:
Corrective Actions:
Preventive Actions:

Communication Log:
Post-Incident Review Date: