PagerDuty Blog

Outage Post-Mortem – Jan 24, 2013

On January 24, 25 and 26, 2013, PagerDuty suffered several outages.  The events API, used by our customers to submit monitoring events into PagerDuty from monitoring tools, was down during the outages.  Our web application, used to access and configure customer accounts, was also affected and may have been unavailable during the outages.

We’ve written this post-mortem to let you know what happened and to also let you know what we’re doing to ensure this never happens again.  Last but not least, we would like to apologize for this outage.  While we didn’t have any single prolonged outage during this period, we strongly believe in the mantra that even 2 minutes of downtime is unacceptable and we’d like to let you know we’re working hard on improving our availability, both in the short term and the long term.

Background

The PagerDuty infrastructure is hosted in multiple data centers (DCs).  The notification dispatch component of PagerDuty is fully redundant across 3 DCs and can survive a DC outage without any downtime.  We’ve designed the system to use a distributed data store which doesn’t require any sort of failover or flip when an entire DC goes offline.

However, the events API, which is backed by a queuing system, still relies on our old legacy database system, based on a traditional RDBMS.  This system has a primary database which is synchronously replicated to a secondary host.  The system also has a tertiary database which is asynchronously replicated (just in case both the primary and secondary have problems).  If the primary host goes down, our standard operating procedure is to do a flip to the secondary host.  The downside is that the flip process requires a few minutes of downtime.

Outage Details

Note: All times referenced below are in Pacific time.

On 1/24

  • At 8:25am, the events API and website both went down.
  • At 8:25am, the PagerDuty on-call engineers were alerted.
  • At 8:32am, we started a Severity-1 conference call.
  • At 8:36am, we started the flip process from the primary db to the secondary.
  • At 8:41am, the flip process was completed.
  • At 8:42am, the events API and website was brought back online.

Later on that day, we had several blips:

  • At 4:16pm: small blip – 1min outage
  • At 10:37pm: small blip – 1min outage
  • At 10:51pm: small blip – 1min outage

Throughout the day, we worked on investigating the issue and worked on the post-mortem.  As part of the investigation, we noticed a large number of invocations of a particular slow query on the database.  We modified the code to turn off the invocation of the offending query.  At this point, we thought the outages were caused by a single slow query, which we had fixed, so we thought the underlying problem was also fixed.

On 1/25

  • At 7:05am: small blip – 3min outage

We investigated the new outage and found another problematic slow query, which we fixed immediately.

On 1/26

  • At 2:28am, the events API and website went down.
  • At 2:28am, the on-call engineers were paged.
  • At 2:38am, both the events API and website had recovered.

At this point, we came to the conclusion that the best thing to do is upgrade the db machine to a larger host.  Engineers worked through the night to build all new db machines (primary, secondary and tertiary) on better hardware.

Around 6am, we believed the building of the new machines was complete.  From 6:15am to 914am, we attempted to flip the database to a new primary machine a couple of times, each time unsuccessful.  Each of these attempts caused a few minutes of downtime.

At this point, we gave up on flipping to the new machine.  The reason the flip did not work was because the data snapshot on the new machine was not uploaded correctly, due to the engineers being extremely tired and burned out after working through the night on the upgrades.

After getting rest for about 12 hours, the engineers started from scratch building new db machines.  The freshly rested engineers put a new primary database in place.  A few hours afterwards, they also put in an upgraded secondary database and an upgraded tertiary database.

What we’re going to do to prevent this from happening again

Short term

We will set up rigorous monitoring for slow queries on our data store [already done].  We will also automate the building of a new database server via chef.  The db server was one of the last components to be chef’ed in our infrastructure, and on 1/26 and 1/27, we re-built db machines by hand instead of using chef, which was a time consuming and error-prone process.

We will also instituted a more rigorous development process, whereby new features and changes to the code base must be vetted for database performance as part of the regular code review process [already done].

We will also set up better host metrics for the database server so we can detect early on if and when we are approaching capacity and upgrade the server in an orderly way.

Long term

We will remove the dependency of our events API from our main RDBMS database.  To give a bit more context, our events API is backed by a queue: incoming events are enqueued, and background workers process queued events.  The reason for this is so we can properly handle and process large volumes of event traffic.

Currently, this queue is reliant on our main SQL database.  As explained above, this DB is fully redundant with 2 backups across 2 data centers, but requires a failover when the main (primary) db goes down.

As a result of this post-mortem, we will fast-track a project to re-architect the events API queue and workers to use our newer distributed data store.  This data store is distributed across 5 nodes and 3 independent data centers, and it’s designed to survive the outage of an entire data center without requiring any failover process and without any downtime whatsoever.