Category Archives: Availability

Outage Post-Mortem – April 13, 2013

We spend enormous amount of our time on the reliability of PagerDuty and the infrastructure that hosts it.  Most of this work is invisible, hidden behind the API and the user interface our customers interact with. However, when they fail, … Continue reading

Posted in Availability | 7 Comments

Outage Post-Mortem – Jan 24, 2013

On January 24, 25 and 26, 2013, PagerDuty suffered several outages.  The events API, used by our customers to submit monitoring events into PagerDuty from monitoring tools, was down during the outages.  Our web application, used to access and configure … Continue reading

Posted in Availability | Leave a comment

A UTC Leap second vs Derecho

At PagerDuty, we usually get a front seat to anything that’s wrong with the internet. Last weekend, a derecho storm took out 7% of AWS and a leap second added to UTC caused servers to panic. As we mentioned in … Continue reading

Posted in Availability | Tagged , , , | Leave a comment

AWS Outage (June 29th) – Weathering the Storm

On the evening of Friday, June 29th, Amazon Web Services (AWS) experienced a major outage at its North Virginia location due to a loss of power. This outage, the second in June, affected numerous AWS customers who use PagerDuty. As … Continue reading

Posted in Availability | Tagged , | 1 Comment

Outage Post-Mortem – June 14

On Thursday, June 14, starting at 8:44pm Pacific time, PagerDuty suffered a serious outage. The application experienced 30 minutes of downtime, followed by a period of high load. We take the reliability of our systems very seriously; it’s our number … Continue reading

Posted in Announcements, Availability | 3 Comments

Pressure Release Valves

This is the fourth in a series of posts on increasing overall availability of your service or system. Have you ever gotten paged, and known right away that this problem isn’t like the last 15 operations issues you’ve dealt with … Continue reading

Posted in Availability | Tagged , , | Leave a comment

A Standard Operating Procedure for when s*IT hits the fan

This is the third in a series of posts on increasing overall availability of your service or system. In the first post of this series, we defined and introduced some concepts of system availability, including mean time between failure – MTBF … Continue reading

Posted in Availability | Tagged , , | Leave a comment

Availability lessons from shoe companies and ancient warlords

This is the second in a series of posts on increasing overall availability of your service or system. In the first post of this series, we defined and introduced some concepts of system availability, including mean time between failure – … Continue reading

Posted in Availability | Tagged , , | 1 Comment

The ups and downs of Availability

This post is meant as a quick introduction to some concepts of system availability, so that subsequent posts in this series make sense. I’ll go over concepts like availability, SLA, mean time between failure, mean time to recovery, etc. Continue reading

Posted in Availability | Tagged , , , , | 5 Comments