-
-
Follow
-
Recent Posts
Recent Comments
Categories
Archives
- May 2013
- April 2013
- March 2013
- February 2013
- January 2013
- November 2012
- October 2012
- September 2012
- August 2012
- July 2012
- June 2012
- April 2012
- March 2012
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- September 2011
- August 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- February 2011
- October 2010
- August 2010
- June 2010
- April 2010
- March 2010
- November 2009
- September 2009
- August 2009
Category Archives: Availability
Apr 24 13
Outage Post-Mortem – April 13, 2013
We spend enormous amount of our time on the reliability of PagerDuty and the infrastructure that hosts it. Most of this work is invisible, hidden behind the API and the user interface our customers interact with. However, when they fail, … Continue reading
Posted in Availability
7 Comments
Jan 29 13
Outage Post-Mortem – Jan 24, 2013
On January 24, 25 and 26, 2013, PagerDuty suffered several outages. The events API, used by our customers to submit monitoring events into PagerDuty from monitoring tools, was down during the outages. Our web application, used to access and configure … Continue reading
Posted in Availability
Leave a comment
Jul 13 12
A UTC Leap second vs Derecho
At PagerDuty, we usually get a front seat to anything that’s wrong with the internet. Last weekend, a derecho storm took out 7% of AWS and a leap second added to UTC caused servers to panic. As we mentioned in … Continue reading
On the evening of Friday, June 29th, Amazon Web Services (AWS) experienced a major outage at its North Virginia location due to a loss of power. This outage, the second in June, affected numerous AWS customers who use PagerDuty. As … Continue reading
Jun 18 12
Outage Post-Mortem – June 14
On Thursday, June 14, starting at 8:44pm Pacific time, PagerDuty suffered a serious outage. The application experienced 30 minutes of downtime, followed by a period of high load. We take the reliability of our systems very seriously; it’s our number … Continue reading
Posted in Announcements, Availability
3 Comments
Jan 27 12
Pressure Release Valves
This is the fourth in a series of posts on increasing overall availability of your service or system. Have you ever gotten paged, and known right away that this problem isn’t like the last 15 operations issues you’ve dealt with … Continue reading
This is the third in a series of posts on increasing overall availability of your service or system. In the first post of this series, we defined and introduced some concepts of system availability, including mean time between failure – MTBF … Continue reading
This is the second in a series of posts on increasing overall availability of your service or system. In the first post of this series, we defined and introduced some concepts of system availability, including mean time between failure – … Continue reading
Apr 18 11
The ups and downs of Availability
This post is meant as a quick introduction to some concepts of system availability, so that subsequent posts in this series make sense. I’ll go over concepts like availability, SLA, mean time between failure, mean time to recovery, etc. Continue reading

