MTTR response checklist
Checklist for improving incident detection, diagnosis, and recovery.
- Inventory tier-1 services, confirm SLOs exist, and verify alert thresholds map to user impact.
- Audit alert routing: on-call schedule active, escalation paths documented, and ownership pages current.
- Review dashboards: deploy markers enabled, recent logs surfaced, and runbooks linked from the same view.
- Ensure responders have access to feature flag consoles, rollback scripts, and infrastructure consoles before incidents.
- Conduct monthly gamedays or tabletop exercises covering major failure modes; record outcomes and action items.
- During incidents, assign incident commander, scribe, and communications lead within the first five minutes.
- Capture timeline, contributing factors, and customer impact in the incident report within 24 hours.
- Create remediation tasks with clear owners and due dates; track to closure during ops reviews.
- Update runbooks and the developer portal with lessons learned after each incident.
Prerequisites
- Centralized logging, metrics, and tracing platform accessible to responders.
- Agreement on blameless post-incident practice and time allocated to close actions.
Pitfalls
- Allowing alert fatigue to desensitize responders to real signals.
- Focusing solely on tooling while ignoring staffing, process clarity, or communications.
Need a guided incident rehearsal? Connect with us via /contact.
