Section 2 · Trust ArchitectureDefense in depth, not in a single guardrail
Three validation phases. Each catches a different failure.
A single content filter is not a trust architecture. Wrap every consequential action in three checks — they are cheap to add, expensive to skip, and the failure modes they catch are non-overlapping.
“Should this action be attempted at all?”
Catches
- →Out-of-policy actions (Zone violations from the previous slide)
- →Schema violations and obviously malformed arguments
- →Budget exhaustion — the agent has spent its allowance
- →Prompt-injection attempts surfaced by the planner
- →Stale context — the world has moved since the plan was made
Techniques
Schema validatorsPolicy engines (OPA-style)Budget countersAllow-listed tool sets per goalInput sanitization
“Is the action behaving as expected as it runs?”
Catches
- →Tool calls returning unexpected types or sizes
- →Latency or error spikes from downstream systems
- →The agent re-trying a failing call too quickly
- →Drift between what was planned and what is being executed
- →Cost or token consumption exceeding the trajectory budget
Techniques
Circuit breakersAdaptive timeoutsToken / cost metersStep-level signed receiptsAnomaly detection on traces
“Did the outcome match the intent — and is the world still consistent?”
Catches
- →State changes that violate invariants (negative balances, orphaned records)
- →Outcomes that satisfy the literal goal but miss the intent
- →Side effects in systems the agent was not supposed to touch
- →Drift in agent quality over time (silent degradation)
- →Patterns visible only across many runs (systemic bias, exploitation)
Techniques
Invariant checks on state diffIndependent evaluator modelSample-based human reviewCross-run analyticsReplay for incident review
Pattern to remember: Plan → Authorize → Execute → Verify. Authorization is the pre-flight gate. Verification is the post-flight gate. The agent's loop runs between the gates, not around them.