MTTR response checklist

Checklist for improving incident detection, diagnosis, and recovery.

  1. Inventory tier-1 services, confirm SLOs exist, and verify alert thresholds map to user impact.
  2. Audit alert routing: on-call schedule active, escalation paths documented, and ownership pages current.
  3. Review dashboards: deploy markers enabled, recent logs surfaced, and runbooks linked from the same view.
  4. Ensure responders have access to feature flag consoles, rollback scripts, and infrastructure consoles before incidents.
  5. Conduct monthly gamedays or tabletop exercises covering major failure modes; record outcomes and action items.
  6. During incidents, assign incident commander, scribe, and communications lead within the first five minutes.
  7. Capture timeline, contributing factors, and customer impact in the incident report within 24 hours.
  8. Create remediation tasks with clear owners and due dates; track to closure during ops reviews.
  9. Update runbooks and the developer portal with lessons learned after each incident.

Prerequisites

  • Centralized logging, metrics, and tracing platform accessible to responders.
  • Agreement on blameless post-incident practice and time allocated to close actions.

Pitfalls

  • Allowing alert fatigue to desensitize responders to real signals.
  • Focusing solely on tooling while ignoring staffing, process clarity, or communications.

Need a guided incident rehearsal? Connect with us via /contact.