How you know it’s working

You don’t need a full “AI ops” department on day one. You need a short list of signals that tell you whether the system is healthy — and a few failure shapes you can recognize quickly.

Healthy signals

Task success rate — for defined jobs, did the user get the right outcome without rework?
Latency p95 — not average: tail latency is what people feel.
Cost per successful task — ties quality to money (see what it costs).
Tool error rate — timeouts, 4xx/5xx from APIs, rate limits.
Escalations — how often humans must intervene; should trend down as prompts and tools improve.

Common failure shapes

Symptom	Likely cause	First fix
Right sometimes, wildly wrong sometimes	vague instructions / missing eval criteria	tighten the spec; add examples
Slow and expensive	everything routed to a huge model	route by tier
Forgets context mid-run	long tool chains without summarization	summarize + checkpoint state
“Done” but nothing happened	tool silently failed	surface errors to the model; retry policy
Stuck repeating	missing stop condition	cap steps; add a critic/reviewer step

What “good visibility” means

If you run more than one agent or long chains, you want tasks, spend, and an audit trail — the difference between debugging with data vs. guessing in chat transcripts.

Minimal stack idea

Structured logs for each agent step (what tool, what args, what result code).
Traces if you can (OpenTelemetry is common) — nice-to-have, not a blocker.
Weekly review of top failures — cheapest “continuous improvement” loop.

How you know it’s working

Healthy signals

Common failure shapes

What “good visibility” means

Minimal stack idea

Related