On-Call Best Practices: Part 1

This is Part 1 in a multi-part series dealing with tips for being on-call.

Photo Credit: Aaron Jacobs

There is only one thing worse than being woken up at 3am by PagerDuty to learn that your systems are down:  to wake up on your own at 8am and discover that your systems were down for 5 hours and nobody got the alert.

This post, along with future ‘Best Practices’ posts, will include tips on how to make sure your high-severity alerts are received reliably and promptly by on-call staff.  This is the first step in reducing your mean time to recovery (MTTR) when – not if – problems happen.

Equipment

First, and most obviously, a cellphone is a must for an on-call shift.  This is for receiving phone and SMS alerts when not at home, as well as for contacting and being contacted by others when high-severity problems occur.

Make sure to set your phone’s ringer volume to ‘high’ when your shift starts, to reduce the chance of sleeping through an alert, or missing a call or SMS in noisy situations.  If your SMS ringtone has a separate volume control, make sure to crank that up too.  Picking sharp or piercing phone and SMS ringtones always help.  Finally, if you bought a cellphone not exactly known for its awesome battery life, like I did long ago, then you’d be wise to keep a charger handy as well.

Another must-have piece of equipment for an on-call shift is a mobile USB broadband modem or mobile hotspot device.  This, of course, is only true if your team can deal with operational issues remotely, such as with a laptop and VPN connection.  If this is the case, then having one of these mobile devices allows on-calls to connect to the internet from wherever they are, rather than having to rush home (or worse, to the office!) to fight your operational fires.

These mobile hotspots or modems can be a lifesaver:  both in terms of reducing incident response times as well as improving the lives of on-call staff, who can now venture further than 10-15 minutes away from home or other sources of guaranteed internet connections.  Only one device is needed per team – they can be passed around along with the primary on-call rotation if need be – and the monthly fees are quite economical for basic data plans.  We recently got the new LG Verizon 4G modem for our on-call (and yes, PagerDuty has on-call too), and it seems pretty decent, but any 3G device would likely work just as well.

Contact Methods

Users should always include multiple contact methods in their PagerDuty Contact Info in order to ensure reliable delivery of notifications.  SMS is a quick, terse, and convenient notification method, but the protocol does not guarantee immediate delivery of messages, and notifications can occasionally be delayed significantly within your mobile carrier’s network.  We’ve seen occasional delays of several minutes – and sometimes more – between when an SMS is sent and when it is received by a handset.  On our end, we partner with multiple SMS providers in order to try and ensure reliable and timely delivery of our SMS notifications, but it isn’t always enough.

To that end, we strongly recommend using both SMS and phone notifications in your user contact info.    Use a fairly short delay – a couple minutes at most – in between notifications, as it is quick and easy to acknowledge a notification once it is actually received.  If you have a work phone line that is land or VoIP-based, you can also include that as a 3rd contact method in case your cell coverage is spotty at work.  If you want to kick it old-school, you can even setup PagerDuty to send email to your alphanumeric pager!  (Just so long as your pager’s wireless carrier has an email-to-pager gateway;  ironically this is the only method of paging that PagerDuty currently supports.  Nobody has asked for more yet, but let us know if you want it.)

Future ‘Best Practices’ posts will give tips for creating robust on-call schedules and escalation policies.  Stay tuned!


Share on FacebookTweet about this on TwitterGoogle+
This entry was posted in Operations Performance and tagged , , . Bookmark the permalink.

6 Responses to On-Call Best Practices: Part 1

  1. Pingback: Sysadmin Sunday #25 « Boxed Ice Blog

  2. Chris says:

    Having a broadband modem is only a viable option if your sysadmin team is all in the same locale. Large organizations, like ours, rotate between members in multiple timezones.

    • John Laban says:

      True, but then your organization could spring for (at least) one mobile broadband modem per local team.

      Or alternatively, if your personal cellphone and mobile carrier support this, you could enable a data-tethering or wifi-hotspot feature on your personal cellphone, and have your employer reimburse you for the extra costs of the data plan, if applicable.

      The increase in uptime (and potentially the decrease in attrition due to stress/strain on the ops team) should be worth the relatively small monthly fees to the organization, IMO.

    • John Laban says:

      True, but then your organization could spring for (at least) one mobile broadband modem per local team.

      Or alternatively, if your personal cellphone and mobile carrier support this, you could enable a data-tethering or wifi-hotspot feature on your personal cellphone, and have your employer reimburse you for the extra costs of the data plan, if applicable.

      The increase in uptime (and potentially the decrease in attrition due to stress/strain on the ops team) should be worth the relatively small monthly fees to the organization, IMO.

  3. Rvmenon says:

    Did you write Part 2 and on on-call schedule?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>