Testing
depends on: safety, error-handling, performance, threat-model, mcp
Agents are non-deterministic. Testing them requires different strategies than traditional software.
Principles
- Test behavior, not implementation: Test what the agent does, not how.
- Test boundaries: Empty input, long input, unexpected characters, concurrent requests, network failures.
- Test safety: Verify the agent refuses harmful requests, doesn't leak data, stays in scope.
Testing levels
Unit tests for individual components. Integration tests for components together. Conversation tests for multi-turn flows:
- user: "What contexts are available?"
expect:
contains: ["darkmode", "accessibility", "privacy"]
- user: "Tell me about dark mode"
expect:
contains: "eye comfort"
not_contains: "ERROR"
Regression tests: reproduce the bug before fixing it.
Agent-specific patterns
- Prompt testing: Same question phrased differently should yield consistent results
- Tool use testing: Correct tool selected, parameters passed, errors handled
- Guardrail testing: Explicitly test safety boundaries
- Data-flow testing: Verify no user data is transmitted to unexpected endpoints — if architecture is proof, it must be testable
- Load testing: Agents under pressure behave differently
For agents
- Adhere to strict TDD principples
- Cover happy path, error paths, and edges
- Keep tests fast — slow tests don't run
- Test guardrails as rigorously as features
- Treat flaky tests as bugs