Amar Chaudhari

Experience

Engineering Manager, Online Databases

LinkedIn · 2025 – Present

  • Lead LinkedIn’s Online Analytics team, managing a unified multi-engine analytics platform across Pinot, ClickHouse, and related engines — ~13,000 servers and ~10 PB of data.
  • Drive the AI-native evolution of the platform, applying AI to onboarding, query understanding, ingestion optimization, troubleshooting, and capacity planning.
  • Operate as a Tech Lead Manager for a 15-engineer team, spending 50%+ of time on coding, architecture, design reviews, and platform tradeoff decisions.
  • Established technical leadership across core platform areas by developing tech leads and clarifying ownership boundaries.
  • Served on LinkedIn’s AI-first hiring council, helping define interview processes and evaluation criteria.

Engineering Manager, Reliability Infra

LinkedIn · 2022 – 2024

  • Scaled ForgeFire from a nascent stress-testing platform into LinkedIn’s enterprise-wide reliability validation standard — 500+ critical services, ~30% lower change failure rate, ~25% lower MTTR.
  • Defined and drove LinkedIn’s performance testing and disaster recovery strategy with Principal Staff engineers, VPs, infra, product, and SRE orgs.
  • Led development of LinkedIn’s Disaster Recovery platform, automating failout of unhealthy datacenters and cutting time-to-mitigate by ~70%.
  • Introduced AI agents to automate stress-test creation, environment setup, and result analysis — ~80% faster authoring, ~75% faster setup.
  • Grew the team from 3 to 7 engineers and sponsored multiple promotions.

Site Reliability Engineer, Tech Lead

LinkedIn · 2018 – 2022

  • Led reliability and performance engineering for LinkedIn Company Pages, redesigning a monolith into microservices and scaling to ~1M QPS serving 200M+ pages — ~10% lower P99 latency, ~40% less GC pause time.
  • Established a reliability-first operating model: SLOs/SLAs, error budgets, automated monitoring, graceful degradation, capacity planning — ~50% less unplanned downtime.
  • Introduced disaster recovery and chaos engineering as team standards.
  • Designed and rolled out Investigator, a triage platform adopted by 4,000+ engineers and TSMs, cutting manual debugging toil by ~80%.

Software Engineer, Network Automation

Rakuten · 2013 – 2016

  • Implemented zero-touch provisioning for Juniper and Cisco devices, reducing deployment from hours to minutes.
  • Built network configuration automation across a global fleet, eliminating recurring configuration drift.
  • Migrated 3,000+ devices to enterprise observability platforms (PRTG, Grafana, PagerDuty) with automated discovery and alerting.

Education

University of Colorado, Boulder
Master's — Network Engineering · 2016 – 2018

Pune Institute of Computer Technology
Bachelor's — Computer Science (IT) · 2009 – 2013


Skills

Leadership & Execution: Team Building, Hiring, Mentorship, Technical Strategy, Roadmap & OKR Planning, Stakeholder Management

Systems & Platform Engineering: Distributed Systems, Microservices, Platform Engineering, Control Planes, Kubernetes, AWS, GCP

Reliability & Operations: SLO/SLA/SLI, Capacity Planning, Incident Management, Disaster Recovery, Chaos Engineering, Observability

AI, Data & Observability: Python, LangChain, LangGraph, OpenAI Agents SDK, RAG, LLM Evaluation, Guardrails, Claude Code