Outage Post Mortem – March 15

As some of you know, PagerDuty suffered an outage for a total of 15 minutes this morning. We take the reliability of our systems very seriously, and are writing this to give you full disclosure on what happened, what we did wrong, what we did right, and what we’re doing to help prevent this in the future.

We also want to let you know that we are very sorry this outage happened. We have been working hard over the past 6 months on re-engineering our systems to be fully fault tolerant. We are tantalizingly close, but not quite there yet. Read on for the full details and steps we are taking to make sure this never happens again.

Background

PagerDuty’s main systems are hosted on Amazon Web Services’ EC2.  AWS has the concept of “Availability Zones” (AZ’s), in which hosts are intended to fail independently of hosts in other availability zones within the same EC2 region.

PagerDuty takes advantage of these availability zones and makes sure to spread its hosts and datastores across multiple AZ’s.  In the event of a failure of a single AZ, PagerDuty can recover quickly by redirecting traffic to a surviving AZ very quickly.

However, it’s quite obvious that there are many situations in which all Availability Zones in a given EC2 region fail at once.  From experience, these situations happen roughly every 6 months.  One such region-wide failure occurred early this morning, in which AWS suffered internet connectivity issues across all of its US-East-1 region at once.

The Outage

PagerDuty became inaccessible at 2:27am this morning.

Knowing that fallbacks within other AZ’s aren’t enough, PagerDuty has another fully-functional replica of its entire stack running in another (completely separately owned and operated) datacenter.  We began the procedure to flip to this replica after we were notified of the problem with EC2 and when it became obvious that EC2 was having a region-wide outage.

At 2:42am (15 minutes after the start of the outage), EC2’s US-East-1 region re-appeared, and our systems started to quickly process the backlog of incoming API and email-based events, creating a large number of outgoing notifications to our customers.  At this point we aborted the flip to our fallback external notifications stack.

What we did wrong

Fifteen minutes seems like a long time between when our outage began and when we perform our flip.  And it is.

We use multiple external monitoring systems to monitor PagerDuty and alert all of us when there are issues (we can’t use PagerDuty ourselves, alas!).  After careful examination, the alerts from these systems were delayed by a few minutes.  As a result, we responded to the outage a few minutes late.

This is obviously an action item on us to remedy as soon as possible.  These minutes count.  We know they are very important to you.  We will look at switching or augmenting our monitoring systems as soon as possible.

Another miss on our part was not notifying all of you immediately of our outage via our emergency mass-broadcast system (see http://support.pagerduty.com/entries/21059657-what-if-pagerduty-goes-down).  This was due to an internal miscommunication on when it is appropriate to use this system.  We will come out with another blog post shortly that details exactly how we use this system going forward, and a reminder on how you can register yourself for it.

What we did right

We’ve previously taken steps to be able to mitigate these large-scale EC2 events when they happen.

One such step is the very existence of our externally-hosted fallback PagerDuty environment.  This is an (expensive) solution to this rare problem.  We regularly run internal fire drills where we test and practice the procedure to flip to this environment.  We will continue these drills.

Another step that we’ve taken to mitigate these large-scale EC2 events is to make sure our systems can handle the very high amounts of traffic we see when a third of our customers (all the ones hosted on EC2) all go down at the same time. We’ve made many improvements to our systems over the past 6 months: our system now queues events quickly, intelligently sheds load under high-traffic scenarios in order to continue operating, and makes absolutely sure not to fail to page any of our customers.  These systems performed very well this morning, preventing further alerting delays.

What we’re going to do

A flip, no matter how quick, involves some downtime. This leaves a sour taste in our mouths. We are working (hard!) on our internal re-architecture to fully move to a notification processing system that involves NO temporary single points of failure, even when that SPOF is “all of EC2 east”.

Our new system will use a clustered multi-node datastore deployed on multiple hosts located in multiple independent data centers with different hosting providers. The new system will be able to survive a data center outage without any flips whatsoever. That’s right, we’re going flip-less (because the word “flip” is synonymous with “outage”). We are working full-steam on building this new system and deploying it as soon as possible, while making sure we stay stable during the changeover. This re-engineering effort is fairly substantial, so stand by for a few shorter term solutions.

During our internal post-mortem this morning, we have identified a few places where we can immediately improve the availability of our external event endpoints. These include building better redundancy into our email endpoint as well as our API endpoint. We are prioritizing these changes to the top of the heap.

We are also taking a closer look at moving our primary systems off of AWS US-East. In the short-term, we will continue to use US-East in some capacity (perhaps as a secondary provider). Longer term, we will switch all of our critical systems off of AWS altogether.

Finally, as mentioned above, we will improve our own monitoring systems. We’ve had alerts delivered too slowly by our own external web monitoring, and we will fix this asap. We will also improve our Twitter-based emergency broadcast procedure, which helps us announce to you when we are experiencing internal problems. Keep turned for another blog post about this in the next few days.

 

Share on FacebookTweet about this on TwitterGoogle+
This entry was posted in Reliability. Bookmark the permalink.

Comments are closed.