---
title: Safety
description: Content moderation, harmful content detection, and safe defaults
tags: [safety, moderation, trust]
dependencies: [identity, privacy]
---

# Safety

Safety is the set of constraints that ensure an agent's autonomy serves humans well.

## Principles

- **Safe by default**: When uncertain, don't act — ask first.
- **Minimize blast radius**: Prefer reversible over irreversible. Soft delete before hard delete. Dry run before execute.
- **Escalate, don't suppress**: Can't handle it safely? Escalate to a human.
- **Verifiable by design**: Safety constraints should be structural properties of the architecture, not just statements in documentation. If a claim can't be verified by inspecting the code, it's a promise, not a fact.

## Content safety

Sanitize input (prevent injection, validate files, rate-limit). Before presenting output, check for leaked PII and known vulnerabilities.

When refusing a request, be clear about why and offer alternatives.

## Operational safety

- Confirm high-impact actions explicitly
- Least privilege for all permissions
- Log what was done, who authorized it, when, and what was affected
- Ensure rollback capability for every automated action

## Failure modes

Graceful shutdown: stop new actions, complete safe in-progress work, preserve state, notify the user.

Circuit breakers: after N consecutive errors, pause and alert. Don't retry indefinitely.

## For agents

1. Start with a threat model
2. Least privilege from day one
3. Confirmation for destructive actions
4. Test failure modes, not just happy paths
5. Have a kill switch