Four families, two metrics each. Quality and cost are obvious; risk and health are the ones most teams skip — and the ones that decide whether an agent ages well.
What: Share of runs where the agent reached the goal without escalation
Why: The headline number. If this is below baseline, nothing else matters.
Healthy looks like
Above the human baseline for the same workflow, measured on the same evaluation set.
What: Independent grader scoring whether the outcome served the user's actual intent
Why: Catches the agent that satisfies the literal goal in the wrong way.
Healthy looks like
Steady or improving. A drop here that completion rate doesn't show is a serious signal.
What: Total spend (tokens, tool calls, infra, evaluators) divided by successful outcomes
Why: The economic unit. The only number that says whether the agent is worth running.
Healthy looks like
Trending down over time as prompts, models, and tools improve.
What: Median number of model calls, tool invocations, and tokens per run
Why: A leading indicator for cost. Spikes here precede cost spikes by days.
Healthy looks like
Stable or compressing. Sudden growth means something changed — investigate.
What: Share of runs that required a human override, escalation, or rejection
Why: A workflow with rising interventions is regressing — even if completion rate looks fine.
Healthy looks like
Low and stable. A cliff downward is suspicious; a creep upward is a problem.
What: Count of agent actions that required rollback, customer communication, or remediation
Why: The number a regulator or a board will ask for. Track it from day one.
Healthy looks like
Tracked by zone. Zone 2–3 incidents reviewed individually; Zone 0–1 reported in aggregate.
What: Statistical distance between today's outputs and a fixed historical reference set
Why: Silent quality degradation is the most-cited cause of agent failures in production.
Healthy looks like
Within a defined band. Crossing the band auto-creates a review ticket.
What: Time from agent proposal to human decision in human-in-the-loop workflows
Why: If approvals take too long, reviewers rubber-stamp. If they happen too fast, reviewers aren't reading.
Healthy looks like
In a healthy band — fast enough not to bottleneck, slow enough that reviewers actually read.
Reporting cadence: quality and cost weekly. Risk monthly to the executive team. Health continuously, with alerting. Aggregate the panel into a single one-page dashboard the agent owner reviews daily.