Chaos engineering for AI agents

June 14, 2026

A distributed system fails loudly. A node falls over, a queue backs up, a latency graph spikes, a pager goes off. An agent fails quietly. It returns a confident, well-formatted, completely wrong answer — and everything downstream treats it as truth.

That difference is the whole problem. I spent a decade making large-scale systems dependable: SLOs and error budgets, disaster recovery, and chaos engineering across hundreds of critical services. The instinct that work builds is simple — assume every dependency will fail, and design so that when it does, the failure is contained and recoverable. Agents need that instinct more than anything we’ve shipped before, and most agent systems I see don’t have it yet.

”It worked in the demo” is a reliability claim

Every agent demo works. That’s what demos are for. The gap between a demo and production isn’t capability — the model is the same — it’s everything that happens on the bad days the demo never shows you:

A tool times out, and the agent invents a plausible result instead of failing.
Retrieval returns a stale or poisoned document, and the agent reasons confidently from it.
The agent loops, calling the same tool fifteen times, burning tokens and latency before anyone notices.
An upstream API returns a 200 with a subtly malformed body, and the agent carries the corruption forward.
Two agents in a workflow disagree, and the system silently picks one.

None of these are model-quality problems you fix with a better prompt. They are reliability problems — partial failures, cascading failures, silent data corruption — and we already have a mature discipline for those. We just haven’t pointed it at agents.

The playbook ports almost directly

Here’s the reframe that changed how I think about this work: an agent is a distributed system whose nodes happen to be non-deterministic. The SRE playbook ports with surprisingly few changes.

Define the blast radius before you inject anything. In classic chaos engineering, you never run an experiment you can’t bound. The same rule holds for agents: an agent that can email customers, modify a database, or trigger a deploy needs hard limits on what a single bad decision can touch — scoped credentials, dry-run modes, spend and rate caps, and an approval gate for irreversible actions. If you can’t describe the worst thing one run can do, you aren’t ready to inject failure — or to go to production.

Inject the failures that actually happen. Chaos engineering isn’t random breakage; it’s hypothesis-driven breakage. For agents, the high-value experiments are specific:

Make a tool return errors, timeouts, and — most importantly — plausible wrong answers. Does the agent detect it, retry, or hallucinate around it?
Feed retrieval a contradictory or empty result set. Does the agent say “I don’t know,” or confabulate?
Inject latency until the agent’s own deadline is at risk. Does it degrade gracefully or hang?
Corrupt one step’s output and watch how far it propagates before something catches it.

The mechanism is a thin wrapper around the tool — the agent equivalent of a fault-injection proxy. A few dozen lines gets you started:

import random
import time
from dataclasses import dataclass

@dataclass
class ChaosConfig:
    error_rate: float = 0.0       # raise an exception
    timeout_rate: float = 0.0     # hang past the caller's deadline
    corrupt_rate: float = 0.0     # return a plausible *wrong* answer
    enabled: bool = False

def with_chaos(tool, cfg: ChaosConfig, corrupt):
    """Wrap a tool so experiments can inject the failures that actually happen."""
    def wrapped(*args, **kwargs):
        if cfg.enabled and random.random() < cfg.error_rate:
            raise RuntimeError(f"[chaos] injected failure in {tool.__name__}")
        if cfg.enabled and random.random() < cfg.timeout_rate:
            time.sleep(30)  # blow the deadline on purpose
        result = tool(*args, **kwargs)
        if cfg.enabled and random.random() < cfg.corrupt_rate:
            return corrupt(result)  # the dangerous one: looks right, isn't
        return result
    wrapped.__name__ = tool.__name__
    return wrapped

# `corrupt` is domain-specific and the most valuable part to write:
# e.g. swap an account's plan tier, or return a KB article for the wrong product.
account_lookup = with_chaos(account_lookup, ChaosConfig(corrupt_rate=0.2, enabled=True),
                            corrupt=lambda r: {**r, "plan": "enterprise"})

The corrupt function is the one worth real effort. Errors and timeouts your code probably already survives; it’s the confident-but-wrong result that exposes whether the agent is actually grounding its decisions or just narrating.

This wrapper handles semantic faults, but agents also fail because the infrastructure under them fails — the vector DB pod dies, the LLM endpoint gets slow. That’s a job for a real chaos tool. If you run on Kubernetes, LitmusChaos injects infra-level faults declaratively; here’s 2s of latency on every retrieval call, to see whether the agent respects its deadline or hangs:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: agent-retrieval-latency
spec:
  engineState: "active"
  appinfo: { appns: "agents", applabel: "app=vector-db", appkind: "deployment" }
  chaosServiceAccount: pod-network-latency-sa
  experiments:
    - name: pod-network-latency
      spec:
        components:
          env:
            - { name: NETWORK_LATENCY, value: "2000" }       # milliseconds
            - { name: TOTAL_CHAOS_DURATION, value: "60" }

Swap the experiment to pod-delete to kill the vector DB outright and test the “retrieval unavailable → abstain” path. The division of labor is clean: LitmusChaos owns the infrastructure blast radius; the app-level wrapper owns the semantic one. You need both, because they break different things.

Build the circuit breakers and fallbacks. Graceful degradation is the heart of reliability engineering, and agents need explicit fallback paths: a loop detector that halts repeated tool calls, a confidence or self-consistency check that routes low-certainty answers to a human, a cheaper deterministic path when the agent can’t make progress. “Refuse and escalate” is almost always a better terminal state than “guess and proceed.”

Concretely, that’s a small guard the agent loop checks on every step — bounding both the work and the blast radius:

class ContainmentError(Exception):
    """Trip the breaker: stop the run and hand off to a human."""

class Containment:
    def __init__(self, max_steps=12, token_budget=50_000):
        self.max_steps, self.token_budget = max_steps, token_budget
        self.steps, self.spent, self.recent = 0, 0, []

    def check(self, tool_name, args, tokens_used):
        self.steps += 1
        self.spent += tokens_used
        if self.steps > self.max_steps:
            raise ContainmentError("step budget exceeded — possible loop")
        if self.spent > self.token_budget:
            raise ContainmentError("token budget exceeded")
        sig = (tool_name, repr(args))
        self.recent.append(sig)
        if self.recent[-3:].count(sig) == 3:      # same call, 3x in a row
            raise ContainmentError(f"loop detected on {tool_name}")

# In the agent loop:
guard = Containment()
try:
    guard.check(call.tool, call.args, step.usage.total_tokens)
    result = dispatch(call)
except ContainmentError as e:
    return escalate_to_human(reason=str(e))   # boring, safe terminal state

It’s deliberately dumb — counters and a sliding window, not ML. Most agent runaways in production are exactly this mundane, and a breaker like this turns a $400 token-loop incident into a clean escalation.

Run game days. The single highest-leverage practice from my DR work was the drill — deliberately failing a datacenter on a Tuesday so the recovery was boring on the real day. Do the same here: schedule a session where you break the agent’s tools and dependencies on purpose, with the team watching, and see whether your guardrails, observability, and human handoffs actually fire. The first one is always humbling. That’s the point.

A worked example

Take a support-triage agent: it reads a ticket, pulls account context from an internal API, searches a knowledge base, and either drafts a reply or escalates.

The demo path is clean. The chaos experiments are where you learn whether you have a product or a liability:

Account API returns a 500. Hypothesis: the agent escalates with a clear “couldn’t load account context” note. Common reality: it drafts a reply using only the ticket text and presents it with full confidence. Fix: treat missing context as a hard stop, not a soft input.
Knowledge base returns the wrong article (right keywords, wrong product). Hypothesis: a grounding check catches the mismatch. Reality: the agent cites it anyway. Fix: require the agent to quote the source and verify the product/version before using it.
The drafting tool is slow. Hypothesis: the agent respects its latency budget and escalates. Reality: it blocks for 40 seconds. Fix: deadline propagation and a timeout-to-human fallback.

Wired together, a game day is just a table of hypotheses you run on purpose and assert against — the same shape as a test suite, except what’s under test is how the agent behaves when its world misbehaves:

The subtle part is the assertion. “Did it escalate?” is too crude: under a poisoned-KB experiment, an agent that drafts a reply might still be fine — if the reply is grounded in the (correct parts of the) context. What you actually want to forbid is a confident answer from a bad source. That’s an eval, not a string compare, and it’s exactly what Ragas measures with its Faithfulness metric (how well the answer is supported by the retrieved context, scored 0–1):

from openai import AsyncOpenAI
from ragas.llms import llm_factory
from ragas.metrics.collections import Faithfulness

judge = Faithfulness(llm=llm_factory("gpt-4o-mini", client=AsyncOpenAI()))

async def is_safe(outcome, threshold=0.8):
    if outcome.disposition != "drafted":
        return True                       # escalating/abstaining under chaos is a pass
    score = await judge.ascore(           # it drafted a reply — was it grounded?
        user_input=outcome.ticket,
        response=outcome.reply,
        retrieved_contexts=outcome.sources,
    )
    return score.value >= threshold       # score.reason explains a low result

EXPERIMENTS = [
    ("account API down",   {"account_lookup": ChaosConfig(error_rate=1.0,   enabled=True)}),
    ("wrong KB article",   {"kb_search":      ChaosConfig(corrupt_rate=1.0, enabled=True)}),
    ("slow drafting tool", {"draft_reply":    ChaosConfig(timeout_rate=1.0, enabled=True)}),
]

for name, chaos in EXPERIMENTS:
    outcome = run_triage_agent(SAMPLE_TICKET, chaos=chaos)
    status = "PASS" if await is_safe(outcome) else "FAIL"
    print(f"[{status}] {name}: disposition={outcome.disposition}")

One caveat worth stating plainly, because it’s the post in miniature: Ragas faithfulness is itself an LLM-as-judge, so the evaluator is non-deterministic too. Pin the judge model, assert against a threshold, not equality, and treat the judge as one more dependency you monitor and version. You’re using a non-deterministic system to guard a non-deterministic system — which is fine, as long as you engineer it like one.

Run the suite in CI on every prompt or tool change. A green run doesn’t mean the agent is smart; it means the agent fails safely — the only property that lets you sleep while it’s in production. Each experiment turns a someday incident into a today design decision. That trade — pay the cost on your schedule instead of during an outage — is the entire reason chaos engineering exists.

What to measure

You can’t manage what you can’t see, and agent observability is still immature. The metrics that matter aren’t just “task success rate.” Track:

Failure-mode distribution — how it fails (hallucinated tool result, loop, wrong escalation), not just whether it failed.
Grounding under chaos — the Ragas faithfulness score on answers produced while failures are injected. A model that scores 0.95 on clean inputs and collapses to 0.4 under a poisoned context is telling you something a pass/fail test never would.
Containment rate — when something goes wrong, how often the blast radius held.
Time-to-detect and time-to-recover for agent incidents, the same MTTD/MTTR you’d track for any service.
Escalation precision — is the agent handing off the right cases, or crying wolf?

These are SLO-shaped metrics, and that’s the point: reliability for agents is the same engineering discipline, applied to a stranger class of systems.

The takeaway

Agents are powerful and genuinely unpredictable, and the industry is rushing them into production faster than it’s building the safety net underneath. The good news is that we don’t need to invent that net from scratch. The practices that made distributed systems trustworthy — bounded blast radius, hypothesis- driven failure injection, graceful degradation, recovery drills, real observability — are exactly what agents are missing.

Make the failure happen on a Tuesday, with the team watching. Then the real day is boring. That’s what reliability has always been about, and it’s never mattered more than it does now.

← Back to all posts