Testing

depends on: safety, error-handling, performance, threat-model, mcp

Agents are non-deterministic. Testing them requires different strategies than traditional software.

Principles

Test behavior, not implementation: Test what the agent does, not how.
Test boundaries: Empty input, long input, unexpected characters, concurrent requests, network failures.
Test safety: Verify the agent refuses harmful requests, doesn't leak data, stays in scope.

Testing levels

Unit tests for individual components. Integration tests for components together. Conversation tests for multi-turn flows:

- user: "What contexts are available?"
      expect:
        contains: ["darkmode", "accessibility", "privacy"]

    - user: "Tell me about dark mode"
      expect:
        contains: "eye comfort"
        not_contains: "ERROR"

Regression tests: reproduce the bug before fixing it.

Agent-specific patterns

Prompt testing: Same question phrased differently should yield consistent results
Tool use testing: Correct tool selected, parameters passed, errors handled
Guardrail testing: Explicitly test safety boundaries
Data-flow testing: Verify no user data is transmitted to unexpected endpoints — if architecture is proof, it must be testable
Load testing: Agents under pressure behave differently

For agents

Adhere to strict TDD principples
Cover happy path, error paths, and edges
Keep tests fast — slow tests don't run
Test guardrails as rigorously as features
Treat flaky tests as bugs

← All contexts