SLIs, SLOs, and error budgets for AI agents

June 15, 2026

In the last post I argued that an agent is a distributed system whose nodes happen to be non-deterministic, and that the chaos-engineering playbook ports to agents with surprisingly few changes. This post pulls on a thread I left dangling there: SLOs and error budgets. They’re the part of the reliability discipline that decides, on any given day, whether you ship or whether you stop — and for agents the usual versions of them quietly measure the wrong thing.

Here’s the failure that should keep you up at night. Your agent’s reliability dashboard is all green. Availability is 99.97%. P99 latency is comfortable. The HTTP error rate is a rounding error. And the agent is, at that exact moment, handing a customer a fluent, well-formatted, completely wrong answer. Every signal you’re watching says “healthy,” because every signal you’re watching measures the transport, not the truth.

The classic three still apply — and still miss everything

Availability, latency, and error rate are real SLIs and you should absolutely keep them. An agent that’s down serves nobody, and a 40-second agent is a broken agent. They port directly from any service you’ve ever run.

The problem is what they cover. All three are satisfied by an agent that returns HTTP 200 with a syntactically perfect, semantically poisoned payload. They measure whether the machinery turned, not whether the answer was any good. For a CRUD service that distinction barely exists — a 200 with the right shape is almost always correct. For an agent, the gap between “responded” and “responded correctly” is the entire risk surface. The infrastructure SLIs are necessary. They are nowhere near sufficient.

So the interesting question isn’t “how do I run SLOs for agents” — the mechanics are well understood. It’s what do I point the SLOs at? Which signals, if I held them to a target, would actually tell me the agent is safe to keep serving?

The SLIs that actually matter for an agent

A good SLI is a ratio of good events to total events that a user would recognize as “the system working.” For agents, the events worth counting live a layer above the HTTP request. Five of them carry most of the weight.

Grounding (faithfulness). Of the answers the agent produced from retrieved context, what fraction were actually supported by that context? This is the single most important agent SLI, because it’s the one that catches the confident-but-wrong failure the infrastructure metrics sail right past. It’s also measurable today: it’s exactly the Faithfulness score from Ragas I used in the chaos post — the degree to which an answer is entailed by the documents it cited, scored 0–1.

from openai import AsyncOpenAI
from ragas.llms import llm_factory
from ragas.metrics.collections import Faithfulness

judge = Faithfulness(llm=llm_factory("gpt-4o-mini", client=AsyncOpenAI()))

async def grounding_sli(sample, threshold=0.8):
    """One 'good event' = a grounded answer. Sampled, not measured on every request."""
    score = await judge.ascore(
        user_input=sample.query,
        response=sample.answer,
        retrieved_contexts=sample.sources,
    )
    return score.value >= threshold        # the boolean is the SLI event

Task correctness. Of the tasks the agent claimed to complete, what fraction actually reached the right outcome? Grounding tells you the answer was supported; correctness tells you it was right and did the job. You measure it two ways that work together: a small golden set of tasks with known-good outcomes that runs deterministically in CI, plus an LLM-judge over a sample of real production traffic for the long tail the golden set can’t anticipate. Neither alone is enough — the golden set is precise but narrow; the sampled judge is broad but noisy.

Containment rate. When something went wrong, how often did the blast radius hold? This is the SLI form of the circuit breaker from the chaos post — every time the Containment guard trips and routes to a human instead of letting a runaway proceed, that’s a good event, not a failure. An agent that occasionally hits a bad state but contains it every time is far healthier than one that rarely fails but melts down when it does.

# Reusing the breaker from the chaos-engineering post as an SLI source:
def containment_sli(runs):
    """Of runs that entered a bad state, what fraction were safely contained?"""
    bad = [r for r in runs if r.entered_bad_state]
    if not bad:
        return None                         # no signal this window; don't fake one
    contained = sum(1 for r in bad if r.terminal_state == "escalated_cleanly")
    return contained / len(bad)

Escalation precision. When the agent hands off to a human, is it handing off the right cases? This SLI has two failure directions, and you need both: a precision view (of the things it escalated, how many genuinely needed a human — low precision means it’s crying wolf and training your team to ignore it) and a recall view (of the things that genuinely needed a human, how many did it actually escalate — low recall is the dangerous one, the silent wrong answers that never got flagged). Track them separately. An agent that escalates everything has perfect recall and useless precision.

Cost and step efficiency. Tokens and tool-calls per resolved task — not per request. This is the SLI that catches the mundane runaway: the agent that calls the same tool fifteen times and burns $400 before anyone notices. A correctness SLO with no efficiency SLO is how you end up with an agent that’s right and bankrupting.

Notice what these five have in common: none of them can be read off an HTTP status code, and most of them require judging the content of the response. That is the whole shift. For a traditional service, the SLI is in the response metadata. For an agent, the SLI is in the response meaning.

From SLI to SLO: pick a target you can defend

An SLO is just an SLI with a target and a window: “grounding ≥ 0.8 on 99% of sampled answers, measured over a rolling 7 days.” Two things bite people here.

The first is the 100% trap. With a non-deterministic system the temptation is to demand “99.9% correct” because anything less feels like endorsing errors. Don’t. A target you can’t hit isn’t a target, it’s a permanent alert — and a permanently red SLO is one everyone learns to ignore, which is strictly worse than no SLO. Set the target where the user experience is actually acceptable and where you have headroom to improve, and assert against a threshold, not equality — the same discipline you need for the LLM-judge itself.

The second is sampling. You will not run a faithfulness eval on every production request — it’s a second LLM call per request, latency and cost you can’t justify. You measure a sample, which means your SLI is itself an estimate with a confidence interval. That’s fine — it’s how every large service already does tail-latency SLIs — but it means your window has to hold enough samples to be meaningful, and your alerting has to tolerate sampling noise.

from collections import deque

class WindowedSLO:
    """Rolling-window SLO over sampled SLI events. Good event = SLI passed."""
    def __init__(self, target=0.99, window=2000, min_samples=200):
        self.target, self.min_samples = target, min_samples
        self.events = deque(maxlen=window)  # bounded: old samples age out

    def record(self, passed: bool):
        self.events.append(1 if passed else 0)

    def status(self):
        n = len(self.events)
        if n < self.min_samples:
            return "insufficient-signal"    # don't page on 12 samples
        attainment = sum(self.events) / n
        return "meeting" if attainment >= self.target else "breaching"

The error budget — and what “spending” it means here

The error budget is the best idea in the whole SRE canon, and it’s the reason SLOs are worth the trouble. If your correctness SLO is 99%, your error budget is the other 1% — the amount of being-wrong you’ve decided is acceptable over the window. It turns an unwinnable argument (“ship faster!” vs. “be safe!”) into arithmetic: while there’s budget left, you ship; when it’s spent, you stop and fix. Reliability and velocity stop being enemies and start being a shared account.

For agents this reframes “spending budget” in a useful way. Every prompt tweak, model swap, temperature change, tool addition, or retrieval-index update is a deploy — a change that can move your SLIs — even though none of them touch a line of application code. The org chart says these are “just prompt changes” and ships them with no review. The error budget says: a prompt change that drops grounding from 0.95 to 0.78 spent three weeks of budget in one afternoon, and it gets treated exactly like a bad binary deploy.

So the highest-leverage place to wire the budget in is the gate before a change ships — you evaluate the candidate against the golden set, project its burn rate, and block it if it would blow the budget:

def gate_change(candidate, golden_set, budget_remaining, window_days=30):
    """Block a prompt/model/tool change that would exhaust the error budget."""
    results = [evaluate(candidate, task) for task in golden_set]   # deterministic CI eval
    error_rate = 1 - (sum(r.correct for r in results) / len(results))

    # Project: at this error rate, how much of the window until the budget is gone?
    budget_days = (budget_remaining / error_rate) if error_rate else window_days
    if budget_days < window_days:
        raise BudgetGateError(
            f"projected burn exhausts budget in {budget_days:.1f}d "
            f"(< {window_days}d window) — blocking. error_rate={error_rate:.2%}"
        )
    return "cleared"

And the budget drives a policy, the same one it always has:

Budget healthy → ship freely. Experiment with prompts, try the new model, move fast. This is what the budget is for — it’s permission to take risk.
Budget low → freeze feature changes, spend only on reliability: better grounding, tighter containment, fixing the top failure mode.
Budget exhausted → stop and investigate. Not as punishment — as the signal you agreed in advance to respect, back when nobody was under pressure to ship.

The discipline is deciding the policy before the bad week, so that when the bad week comes the decision is already made and it’s boring.

One honest caveat: the judge is a dependency too

Most of these SLIs are measured by an LLM acting as judge, which means your measurement instrument is itself non-deterministic. You’re using a stochastic system to grade a stochastic system. That’s not a reason to abandon the approach — a noisy thermometer still beats no thermometer — but it does mean you engineer the judge like the production dependency it is: pin the judge model (a silent upgrade can shift every SLI overnight), version it, track its agreement against a human-labeled set, and re-validate it on a schedule. When grounding “drops,” your first question is whether the agent got worse or the judge drifted. Treat the judge as one more node in the distributed system — because it is.

The takeaway

SLOs were never about making distributed systems perfect. They were about making them trustworthy without being perfect — agreeing on how good is good enough, measuring it honestly, and using the gap as a budget that buys you the freedom to move fast right up until you can’t afford to. Agents need that bargain more than anything we’ve shipped, because they fail in a register our usual instruments can’t hear.

So keep the availability dashboard — and then build the one that watches the layer where agents actually fail. Measure grounding, correctness, containment, escalation, and cost. Set targets you can defend. Spend the budget on velocity while you have it, and respect it when you don’t. The all-green dashboard that sits on top of a confidently wrong agent isn’t lying to you. You’re just reading the wrong gauge.

← Back to all posts