Engineering Manager · LinkedIn

AmarChaudhari

Making AI agents reliable enough to trust in production.

I've spent a decade making large-scale systems dependable — reliability platforms, disaster recovery, and chaos engineering across hundreds of critical services. I'm now focused on the next frontier: reliability for Agentic AI.

Read the blog Experience → Résumé

Where I'm headed

Agents are powerful but unpredictable. My work is making them trustworthy — applying hard-won reliability practice to a new class of systems.

Evaluation & guardrails

Designing evals, SLOs, and guardrails for non-deterministic agents — measuring quality where there is no single correct answer.

Failure-mode engineering

Bringing chaos engineering, disaster recovery, and incident practice to agentic workflows so failures are expected, contained, and recoverable.

Agents in production

The observability, capacity, and control-plane work that lets agents operate safely at scale — not just in a demo.

Selected projects

All projects →

AstroSequence

An AI astrophotography coach

VibeCodeAtlas vibecodeatlas.com ↗

Product Hunt for vibe-coded apps

Mintu mintu.ai ↗

AI assistant for everyday life

Recent writing

All posts →

Jul 2026 Twelve monitoring signals for AI agent reliability I ran 672 agent episodes through injected incidents. During the worst one, the agent issued unauthorized refunds while latency, error rate, and throughput stayed perfectly green. Here's what a dashboard has to watch instead — and a working paper with the receipts. Jun 2026 SLIs, SLOs, and error budgets for AI agents A green reliability dashboard can sit on top of an agent that's confidently wrong. The classic SLIs measure the wrong layer. Here are the ones that actually tell you whether an agent is safe in production. Jun 2026 Chaos engineering for AI agents The reliability playbook that tamed distributed systems is the missing layer for agents in production. Here's how to make agent failure expected instead of surprising.