Accelerate MTTR
Playbook for reducing MTTR using telemetry, runbooks, and gamedays.
When to use this playbook
Run the MTTR play when customer-impacting incidents linger, on-call engineers scramble for context, or retros repeat the same action items. It complements reliability investments by tightening the feedback loop between detection, diagnosis, remediation, and learning.
Desired outcomes
- Critical incidents are detected within minutes and routed to the right responders.
- Engineers have the context and authority to remediate safely, using rehearsed playbooks.
- Post-incident learning closes the loop with measurable improvements to systems and processes.
Core plays
- Strengthen detection. Review SLOs for tier-1 services, align alerting thresholds to user impact, and add synthetic checks for top customer journeys. Eliminate noisy alerts and track time-to-detect.
- Standardize response workflows. Maintain an on-call rota with clear escalation, provide incident command templates, and ensure runbooks link to dashboards, feature flags, and rollback scripts.
- Accelerate diagnosis. Embed deployment markers in dashboards, enrich alerts with recent changes, and create service ownership pages in the developer portal so context is one click away.
- Automate remediation. Script rollbacks, feature flag toggles, and infrastructure failovers. Test them during calm periods and document prerequisites in runbooks.
- Institutionalize learning. Run blameless post-incident reviews within 48 hours, capture timeline data, and track follow-up actions in the team backlog. Review progress during weekly ops reviews.
Operating cadence
- Daily pager health check to monitor noise and responder load.
- Monthly incident drills or gamedays to exercise people, process, and tooling.
- Quarterly leadership review covering MTTR trends, staffing, and investment priorities.
Signals you are winning
- Median MTTR falls quarter-over-quarter while incident volume stays flat or drops.
- Pagers are quieter; responders acknowledge alerts quickly and rarely escalate for missing context.
- Post-incident actions close within agreed SLAs and feed platform/product backlogs with meaningful improvements.
Supporting assets
- MTTR response checklist for day-to-day operations.
- FAQ for communicating expectations to executives and partner teams.
- Related manuals:
manual/01-dora-accelerate/index.md,manual/04-platform-engineering/index.md.
