Category Archives: Reliability

Outage Post Mortem – June 3rd & 4th, 2014

On June 3rd and 4th, PagerDuty’s Notification Pipeline suffered two large SEV-1 outages. On the 3rd, the outage resulted in a period of poor performance that led to some delayed notifications. On the 4th, the outage was more severe. In … Continue reading

Posted in Reliability | Tagged , | Leave a comment

Mobile Monitoring Metrics that Matter for Reliability

This is a guest blog post from Justin Liu of Crittercism, which provides mobile app performance management. Crittercism products monitor every aspect of mobile app performance, allowing Developers and IT Operations to deliver high performing, highly reliable, highly available mobile … Continue reading

Posted in Partnerships, Reliability | Tagged , | Leave a comment

Developers Need Monitoring Too

This is a guest blog post from Erik Näslund, Director of Disrapt. Erik is a back-end developer and operations guy. He created his first game at the age of six using AMOS Professional on the Amiga. There was a period … Continue reading

Posted in Partnerships, Reliability | Tagged , | 1 Comment

Lessons Learned from Creating a Reliable Mobile Build

PagerDuty engineers are obsessed with reliability. Letting down customers when they’ve been paged is the worst. With that in mind, we’re always designing and thinking of ways to maintain and build systems that maximize resiliency — including our mobile apps. … Continue reading

Posted in Reliability | Tagged , , , | Leave a comment

End-to-End SMS Provider Testing, It’s How We Ensure SMS Alerts are Delivered

Reliability is important to us. We even inject failure into our systems every Friday to prove it. But when it comes to sending alerts, reliability goes beyonds writing flawless code. We rely on several third-party carriers to deliver alerts to our … Continue reading

Posted in Reliability | Tagged , , , , , , , , | Leave a comment

Outage Post Mortem – April 14th, 2014

On April 14th, PagerDuty suffered an outage that affected customers on both the mobile and web applications. During the period of the outage, customers may have had issues managing their accounts, and some alerts had been delayed. When these incidents … Continue reading

Posted in Reliability | Tagged , | Leave a comment

Keep Your Website Available with the Right Monitoring Practices

In its simplest form, website monitoring is the process of testing and verifying that end-users can can actually use your service. There are several great SaaS applications that will ping your system to let you know if you are up … Continue reading

Posted in Reliability | Tagged , , , , , , , | Leave a comment

Increasing Quality and Reliability with Continuous Integration

Continuous integration (CI) is a software development practice where members frequently merge their work to decrease problems and conflicts. Each push is supported by an automated build (and test) to detect errors. By checking in with one another frequently, teams … Continue reading

Posted in Operations Performance, Reliability | Tagged , , | Leave a comment

Outage Post-Mortem – March 25, 2014

On March 25th, PagerDuty suffered intermittent service degradation over a three hour span, which affected our customers in a variety of ways. During the service degradation, PagerDuty was unable to accept 2.5% of attempts to send events to our integrations … Continue reading

Posted in Reliability | Tagged , | Leave a comment

Injecting Failure at Netflix, Staying Reliable for 40+ Million Customers

Corey Bertram, Site Reliability Engineer at Netflix recently spoke to a DevOps Meetup group at PagerDuty HQ about injecting failure at Netflix. For Corey, he wanted to show people what can go wrong, because anything can go wrong, will. Promoting … Continue reading

Posted in Reliability | Tagged , , , , , , , | Leave a comment