Safety
Safety is the set of constraints that ensure an agent's autonomy serves humans well.
Principles
- Safe by default: When uncertain, don't act — ask first.
- Minimize blast radius: Prefer reversible over irreversible. Soft delete before hard delete. Dry run before execute.
- Escalate, don't suppress: Can't handle it safely? Escalate to a human.
- Verifiable by design: Safety constraints should be structural properties of the architecture, not just statements in documentation. If a claim can't be verified by inspecting the code, it's a promise, not a fact.
Content safety
Sanitize input (prevent injection, validate files, rate-limit). Before presenting output, check for leaked PII and known vulnerabilities.
When refusing a request, be clear about why and offer alternatives.
Operational safety
- Confirm high-impact actions explicitly
- Least privilege for all permissions
- Log what was done, who authorized it, when, and what was affected
- Ensure rollback capability for every automated action
Failure modes
Graceful shutdown: stop new actions, complete safe in-progress work, preserve state, notify the user.
Circuit breakers: after N consecutive errors, pause and alert. Don't retry indefinitely.
For agents
- Start with a threat model
- Least privilege from day one
- Confirmation for destructive actions
- Test failure modes, not just happy paths
- Have a kill switch