Effective On-call

Brendon Thiede
September 4th, 2025

Effective on-call

  • Alerts are meaningful
  • Alerts are actionable
  • Playbooks are concise
  • Playbooks are deterministic (no ambiguity)

Reliability

SRE Service Hierarchy of Needs https://sre.google/sre-book/part-III-practices/

What are alerts?

Alerts are notifications that something requires attention.

  • They can be informational, warning, or critical.
  • They should be meaningful.
  • They should be actionable.
  • They should be prioritized.

How is this different than monitoring?

Monitoring is the process of collecting, analyzing, and using information to track the performance of a system. Alerts are a subset of monitoring, focused on notifying the right people when something goes wrong.

Meaningful alerts

Meaningful alerts represent an actual degradation of service, data loss, or other customer or compliance impacting event.

Symptoms (high resource utilization, error rates, large queues, etc.) do not necessarily indicate a real problem.

Actionable alerts

Actionable alerts are those that can be acted upon by the on-call SRE alerted individual.

  • They should have a clear owner.
  • They should have a clear resolution path.
  • They should not require escalation.

Consider system improvements for recurring alerts.

Signs of bad alerts

  • Self-resolving all the time.
  • Require escalation to another team.
  • Can happen even when things are "good".

Playbooks

Playbooks are step-by-step instructions for responding to alerts.

  • They should be concise.
  • They should be deterministic (no ambiguity).
  • They should be easy to follow.
  • They should be tested regularly.

Bad Playbooks

Bad playbooks are those that are not effective in guiding the on-call engineer through the resolution process.

  • They are too long and complex.
  • They have ambiguous options or language.
  • They are not regularly updated.
  • They do not account for all failure scenarios.
  • They are too broad.
  • They don't have a clear target state/finish line.

What you can do:

  • Alerts are meaningful
  • Alerts are actionable
  • Playbooks are concise
  • Playbooks are deterministic (no ambiguity)

Thank you!

Sleep soundly