Outage Post-Mortem – April 13, 2013

We spend enormous amount of our time on the reliability of PagerDuty and the infrastructure that hosts it.  Most of this work is invisible, hidden behind the API and the user interface our customers interact with. However, when they fail, they become very noticeable as delays in notifications and 500s on our API endpoints.  That’s what happened on Saturday, April 13, at around 8:00am Pacific Time. PagerDuty suffered an outage triggered by degradation in a peering point used by two AWS regions.

We are writing this post to let our customers know what had happened, what we have learned and what we’ll do to fix all the issues uncovered by this outage.

Background

PagerDuty’s infrastructure is hosted in three different datacenters (two in AWS and another in Linode).  For the past year, we’ve been rearchitecting our software with the goal of it being able to survive the outage of an entire datacenter (including it being partitioned from the network), but something not specifically built into our design was the ability to survive the failure of two datacenters at once. However unlikely, that is what happened on Saturday morning. Since we consider an AWS region as a datacenter, and having both of them fail at the same time, we weren’t able to remain available with only our last remaining datacenter.

We picked our three datacenters to have no dependency amongst them, and made sure that they are physically separated. However, we have since learned that two of the datacenters shared a common peering point. This peering point experienced an outage that resulted in both of our datacenters going offline.

The Outage

Note: All times referenced below are in Pacific Time.

  • At 7:57am, according to AWS, connectivity issue begins due to a peering point degrading in Northern California
  • At 8:11am, PagerDuty on-call engineer is paged about an issue with the one of the nodes in our notification dispatch system
  • At 8:13am, an attempt is made to bring back the failed node but with no success
  • At 8:18am, our monitoring system detects multiple-provider failure for notifications (caused by connectivity issue). At this time, most of the notifications are still going through, but with increased latencies and error rates
  • At 8:31, a Sev-2 was declared and more engineers were paged to help out
  • At 8:35am, PagerDuty completely loses its ability to dispatch notification, as it couldn’t establish quorum due to high network latency. Sev-1 is declared
  • At 8:53am, PagerDuty notification dispatch system was able to reach quorum and started to process all queued notifications
  • At 9:23am, according to AWS, connectivity issue at the Northern California peering point ends

During the post-mortem analysis, our engineers also determined that a misconfiguration on our coordinator service prevented us from recovering quickly.  In all, PagerDuty wasn’t able to dispatch notifications for 18 minutes between 8:35am and 8:53am; however, during this time, our events API was still able to accept events.

What we’re going to do

As always with major outages, we learn something new about deficiencies in our software.  These are some of our plans to rectify the discovered issues.

Short term

  1. During our analysis, we found that we didn’t have adequate logging to debug issues within some of our systems.  We have now added more logging and started to aggregate them into a single source for better searchability.
  2. During the outage, most of the failed coordinator processes were restarted manually.  We are going to add a process watcher to restart such processes automatically.
  3. We also found that we didn’t have good visibility into the inter-host connectivity. We’ll be building a dashboard that shows this.

Long term

  1. We also found that not all of our engineering staff are up to date with Cassandra and ZooKeeper.  We’ll be investing time to train our staff on both of these technologies.
  2. Investigate moving off one of the AWS regions.  We’ll need to do our homework when picking a new hosting provider and the datacenter to avoid single point of failure.
Share on FacebookTweet about this on TwitterGoogle+
This entry was posted in Reliability. Bookmark the permalink.

7 Responses to Outage Post-Mortem – April 13, 2013

  1. tcigna says:

    Can you clarify one of the items, are your two AWS instances in the same Availability Zone? Wouldn’t you consider moving one to another AZ to prevent this? Thanks.

    • Ryan Hoskin says:

      Not only are we in different availability zones, we’re in completely different regions that were both affected by a single peering point.  We’re in the process of evaluating changing one of the regions to avoid this type of vulnerability.

  2. Clear, consice and honest – many thanks

  3. Dennis Rowe says:

    How did you page out when PagerDuty was down?

    • Ryan Hoskin says:

      There was an 18 minute period where outgoing alerts were queued. In terms of receiving events our API endpoints may have returned 500 during this time. Incoming email alerts to us were queued and sent out once we recovered. During the outage we tweeted on the @pagerdutyops to notify customers that there was a service interruption. Let me know if this doesn’t address your question.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>