Safety

Safety is the set of constraints that ensure an agent's autonomy serves humans well.

Principles

Safe by default: When uncertain, don't act — ask first.
Minimize blast radius: Prefer reversible over irreversible. Soft delete before hard delete. Dry run before execute.
Escalate, don't suppress: Can't handle it safely? Escalate to a human.
Verifiable by design: Safety constraints should be structural properties of the architecture, not just statements in documentation. If a claim can't be verified by inspecting the code, it's a promise, not a fact.

Sanitize input (prevent injection, validate files, rate-limit). Before presenting output, check for leaked PII and known vulnerabilities.

When refusing a request, be clear about why and offer alternatives.

Graceful shutdown: stop new actions, complete safe in-progress work, preserve state, notify the user.

Circuit breakers: after N consecutive errors, pause and alert. Don't retry indefinitely.