Outage Post-Mortem – June 14

On Thursday, June 14, starting at 8:44pm Pacific time, PagerDuty suffered a serious outage. The application experienced 30 minutes of downtime, followed by a period of high load. We take the reliability of our systems very seriously; it’s our number one priority. We’ve written this post-mortem to give you full disclosure on what happened, what went wrong, what we did right and what we’re doing to ensure this never happens again.

We also want to apologize for the outage. It shouldn’t have lasted as long as it did. Really, it shouldn’t have happened at all.

Background

The PagerDuty infrastructure is hosted on multiple providers. At the time of this writing, our main provider is Amazon AWS and our primary system is hosted on AWS in the US-East region.

We have designed our systems from the outset to be fault tolerant across multiple Availability Zones (AZs). AZs are basically separate data centers within an AWS region. So, in the event of a single AZ failure, PagerDuty can recover quickly by flipping to a backup AZ.

AWS can fail in multiple different ways. The failure of a single AZ is the simple case, which we’ve designed for by distributing our primary stack (including both database and web servers) across multiple AZs.

Another, much more serious, AWS failure scenario is the failure of an entire region at once. Historically, this has happened a few times. PagerDuty is designed to be fault tolerant in this case as well. We’ve set up a hot backup of our full stack on a separate hosting provider (non-Amazon). As part of this, we’ve developed an emergency flip process which allows us to quickly switch from AWS to our backup provider.

The outage

Note: All times referenced below are Pacific time.

  • At 8:44pm, our monitoring tools detected an outage of the PagerDuty app.
  • At 8:49pm, the PagerDuty on-call engineer was contacted. The on-call person realized this was a severity-1 incident and contacted a few other engineers to help out.
  • At 8:55pm, the decision was made to do the emergency flip to our backup provider (the one that is not in AWS). The reason for this decision is that the response team noticed the AWS console was down, which led them to believe that a flip to a backup AZ would not work.
  • At 9:14pm, the flip to the backup hosting provider was completed. At this point, our systems were fullly operational but under very high load, working to process a large backlog of notifications.
  • 9:22pm: The large backlog of notifications was cleared. At this point, we were still under high load: new notifications were taking 5 – 6 minutes to go through.
  • By 10:03, the load dropped and we were back to normal, processing notifications within seconds.

What we did wrong

We are not happy with the fact that it took us 30 minutes to do the emergency flip to our backup stack. Also, we are not happy with how we dealt with the high load after the flip was completed. Here is the full list of issues:

  1. Our monitoring tools took 5 minutes to notify us of the problem. This is too long: an acceptable value is in the range of 1-2 minutes.
  2. After the on-call engineer realized he was dealing with a severity 1 issue, it took him several minutes to set up a group call with the rest of the team. Our sev-1 process relies on Skype for setting up the group call. However, Skype crashed on us a few times, resulting in wasted time.
  3. The emergency flip took too long – 20 minutes. The flip guide is too verbose and thus hard to follow during an emergency. Also, some of the people who worked on doing the flip had never done one before, resulting in the flip taking more time.
  4. After the flip, we had trouble spinning up all of our background task processors at full capacity.

What we did right

All in all, we did an emergency data center flip within 30 minutes of an outage and the flip process worked. Here’s the list of things that went well:

  1. We responded very quickly to the outage.
  2. We made the right call to do the emergency flip off of AWS. This was done after noticing that the AWS console was down.
  3. We communicated often via our twitter channels on both @pagerduty and @pagerdutyops to keep our customers up to date on the outage. Just in case you didn’t already know, our @pagerdutyops Twitter account is reserved for emergency outage notifications; in fact, you can set it up to SMS you for each update we make. For more info, see http://support.pagerduty.com/entries/21059657-what-if-pagerduty-goes-down.
  4. Bigger picture: our redundant hot backup stack hosted on a different (non-Amazon) provider worked very well. This stack helped us contain the outage to only 30 minutes. If we didn’t have the backup stack, the outage may have taken 1 hour or even longer.

What we’re going to do to prevent this from happening again

Big picture:

We are migrating data centers. We are moving off of AWS US-East and into US-West. This data center migration was scheduled (well before the outage) to happen on June 19. The reason for this migration is that many of our customers (over 20%) run their applications in US-East. When the region is experiencing problems, we lose capacity and we are under heavy load from our customers. Essentially, our failures are correlated with those of our customers. So, in the short term (ie. tomorrow) we are moving to US-West.  As a quick aside, the timing of this outage couldn’t have been worse: we were just 5 days away from our scheduled flip to US-West when the outage happened.

Long term, we’ve realized two main things:

  • We cannot trust any single hosting provider.
  • Flips are bad: they always result in downtime and sometimes they don’t work.

As a result, we will move to a 3 hosting provider setup where the PagerDuty system is distributed across 3 data centers (3 DCs to start, 5 later on). We are also moving to a multi-node distributed fault-tolerant data store (Cassandra) which doesn’t require flips when a data center fails. This project has been started last December and we already have the notification sending component of PagerDuty running on the new stack.

Details:

  1. We are looking at implementing another internal monitoring tool which will alert the on-call within 1-2 minutes of the time an outage is detected.
  2. We will look into implementing an alternative to Skype for starting a group call to deal with sev-1 issues. Most likely, we will set up a reliable conference calling bridge, unless we find a better solution (Readers, if you have any suggestions for this, please tell us in the comments.)
  3. We will work on streamlining the emergency flip process. As part of that, we will condense the flip manual and test it thoroughly. We will also look at automating parts of the process.
  4. We will perform load testing on our systems and look for ways to optimize the background processes that send out notifications. The goal is to recover quickly after doing an emergency flip and deal with high load scenarios.
Share on FacebookTweet about this on TwitterGoogle+
This entry was posted in Reliability. Bookmark the permalink.
  • Drew

    May I ask, how do you initiate this failover from one DC to another so quickly? Do you do it at the DNS level? If so, how do you account for the lag in propagating the DNS change?

    • Alex Solomon

      For the emergency flip to our backup hosting provider, we do a DNS flip (we have a short TTL).  It does take a few minutes to propagate but it’s the only solution we have to do a cross-provider flip like this.

      We’ve also done practice flips in the past (where the old stack was still functional), and we noticed that the vast majority of clients were well-behaved, respected the TTL, and moved on to the new IP within minutes.

  • Jeff Gonzalez

    You can create a persistent google+ Hangout if you schedule it far into the future. Then you can share the link to it on a wiki and anyone who has the link can join in. Sorry for the comment necromancy.