Skip to Content
ExplainersHow you know it's working

How you know it’s working

You don’t need a full “AI ops” department on day one. You need a short list of signals that tell you whether the system is healthy — and a few failure shapes you can recognize quickly.

Healthy signals

  • Task success rate — for defined jobs, did the user get the right outcome without rework?
  • Latency p95 — not average: tail latency is what people feel.
  • Cost per successful task — ties quality to money (see what it costs).
  • Tool error rate — timeouts, 4xx/5xx from APIs, rate limits.
  • Escalations — how often humans must intervene; should trend down as prompts and tools improve.

Common failure shapes

SymptomLikely causeFirst fix
Right sometimes, wildly wrong sometimesvague instructions / missing eval criteriatighten the spec; add examples
Slow and expensiveeverything routed to a huge modelroute by tier
Forgets context mid-runlong tool chains without summarizationsummarize + checkpoint state
“Done” but nothing happenedtool silently failedsurface errors to the model; retry policy
Stuck repeatingmissing stop conditioncap steps; add a critic/reviewer step

What “good visibility” means

If you run more than one agent or long chains, you want tasks, spend, and an audit trail — the difference between debugging with data vs. guessing in chat transcripts.

Minimal stack idea

  • Structured logs for each agent step (what tool, what args, what result code).
  • Traces if you can (OpenTelemetry is common) — nice-to-have, not a blocker.
  • Weekly review of top failures — cheapest “continuous improvement” loop.
Last updated on