Category Archives: Availability

Outage Post-Mortem – March 25, 2014

On March 25th, PagerDuty suffered intermittent service degradation over a three hour span, which affected our customers in a variety of ways. During the service degradation, PagerDuty was unable to accept 2.5% of attempts to send events to our integrations … Continue reading

FacebookTwitterGoogle+
Posted in Availability | Tagged , | Leave a comment

Outage Post Mortem – Jan 23, 2014

At PagerDuty, our customers rely on us to be highly-available and reliable when their infrastructure may not be. Unfortunately, sometimes bugs may surface in our software. When these incidents occur, we make sure that we offer transparency for our customers … Continue reading

FacebookTwitterGoogle+
Posted in Availability | Tagged , | Leave a comment

Outage Post-Mortem – Jan 16, 2014

At PagerDuty we offer transparency of any outage that negatively impacts PagerDuty customers. We are proud of PagerDuty’s superior reliability, but occasionally we may have a snafu. We recommend that you follow our dedicated Twitter account, @PagerDutyOps, to be notified … Continue reading

FacebookTwitterGoogle+
Posted in Availability | Tagged , | 1 Comment

How PagerDuty Offers Transparency of Outages for Our Customers

At PagerDuty we’ve invested in superior reliability of our service. We strive for 100% uptime to ensure that any events detected by your monitoring tools are routed to the correct person on your team via a PagerDuty alert. While we … Continue reading

FacebookTwitterGoogle+
Posted in Availability, Community, Customer | Tagged , , | Leave a comment

Outage Post-Mortem – Dec 11, 2013

On Dec 11th, PagerDuty suffered an outage which affected a subset of customers and blocked access to all pagerduty.com addresses. First off, we deeply apologize for this. Any outage, no matter how many customers were affected, is unacceptable. The root … Continue reading

FacebookTwitterGoogle+
Posted in Availability | Tagged , , | Leave a comment

Don’t Lose $460 Million Dollars From a Software Glitch

High-frequency trading accounts for 50% of US’ security trading. With thousands of securities totaling millions of dollars traded every millisecond, robust and reliable computer systems are needed to automate the trades. These automated microtransactions can produce huge rewards. But serious … Continue reading

FacebookTwitterGoogle+
Posted in Availability, Best Practices, On-Call | Tagged , , , , | Leave a comment

Outage Post-Mortem – May 30, 2013

As a member of PagerDuty’s realtime engineering team, a top concern is designing and implementing our systems with high availability and reliability.  On May 30, 2013 we had a brief outage that resulted in a degradation of our alerting reliability. … Continue reading

FacebookTwitterGoogle+
Posted in Availability | Tagged | Leave a comment

Outage Post-Mortem – April 13, 2013

We spend enormous amount of our time on the reliability of PagerDuty and the infrastructure that hosts it.  Most of this work is invisible, hidden behind the API and the user interface our customers interact with. However, when they fail, … Continue reading

FacebookTwitterGoogle+
Posted in Availability | 7 Comments

Outage Post-Mortem – Jan 24, 2013

On January 24, 25 and 26, 2013, PagerDuty suffered several outages.  The events API, used by our customers to submit monitoring events into PagerDuty from monitoring tools, was down during the outages.  Our web application, used to access and configure … Continue reading

FacebookTwitterGoogle+
Posted in Availability | Leave a comment

A UTC Leap second vs Derecho

At PagerDuty, we usually get a front seat to anything that’s wrong with the internet. Last weekend, a derecho storm took out 7% of AWS and a leap second added to UTC caused servers to panic. As we mentioned in … Continue reading

FacebookTwitterGoogle+
Posted in Availability | Tagged , , , | Leave a comment