Category Archives: Reliability

How to Ditch Scheduled Maintenance

You like sleep and weekends. Customers hate losing access to your system due to maintenance. PagerDuty operations engineer Doug Barth has the solution: Ditch scheduled maintenance altogether. That sounds like a bold proposition. But as Doug explained at DevOps Days … Continue reading

Posted in Operations Performance, Reliability | Leave a comment

Who watches the watchmen?

How we drink our own champagne (and do monitoring at PagerDuty) We deliver over 4 Million alerts each month, and companies count on us to let them know when they have outages. So, who watches the watchmen? Arup Chakrabarti, PagerDuty’s … Continue reading

Posted in Events, Operations Performance, Reliability | 1 Comment

Blameless post mortems – strategies for success

When something goes wrong, getting to the ‘what’ without worrying about the ‘who’ is critical for understanding failures. Two engineering managers share their strategies for running blameless post mortems. Failure is inevitable in complex systems. While it’s tempting to find … Continue reading

Posted in Operations Performance, Reliability | 1 Comment

A Disunity of Data: The Case For Alerting on What You See

Guest blog post by Dave Josephsen, developer evangelist at Librato. Librato provides a complete solution for monitoring and understanding the metrics that impact your business at all levels of the stack. The assumption underlying all monitoring systems is the existence … Continue reading

Posted in Partnerships, Reliability | Tagged , , , | Comments Off

Outage Post Mortem – June 3rd & 4th, 2014

On June 3rd and 4th, PagerDuty’s Notification Pipeline suffered two large SEV-1 outages. On the 3rd, the outage resulted in a period of poor performance that led to some delayed notifications. On the 4th, the outage was more severe. In … Continue reading

Posted in Reliability | Tagged , | Comments Off

Mobile Monitoring Metrics that Matter for Reliability

This is a guest blog post from Justin Liu of Crittercism, which provides mobile app performance management. Crittercism products monitor every aspect of mobile app performance, allowing Developers and IT Operations to deliver high performing, highly reliable, highly available mobile … Continue reading

Posted in Partnerships, Reliability | Tagged , | Comments Off

Developers Need Monitoring Too

This is a guest blog post from Erik Näslund, Director of Disrapt. Erik is a back-end developer and operations guy. He created his first game at the age of six using AMOS Professional on the Amiga. There was a period … Continue reading

Posted in Partnerships, Reliability | Tagged , | 1 Comment

Lessons Learned from Creating a Reliable Mobile Build

PagerDuty engineers are obsessed with reliability. Letting down customers when they’ve been paged is the worst. With that in mind, we’re always designing and thinking of ways to maintain and build systems that maximize resiliency — including our mobile apps. … Continue reading

Posted in Reliability | Tagged , , , | Comments Off

End-to-End SMS Provider Testing, It’s How We Ensure SMS Alerts are Delivered

Reliability is important to us. We even inject failure into our systems every Friday to prove it. But when it comes to sending alerts, reliability goes beyonds writing flawless code. We rely on several third-party carriers to deliver alerts to our … Continue reading

Posted in Reliability | Tagged , , , , , , , , | Comments Off

Outage Post Mortem – April 14th, 2014

On April 14th, PagerDuty suffered an outage that affected customers on both the mobile and web applications. During the period of the outage, customers may have had issues managing their accounts, and some alerts had been delayed. When these incidents … Continue reading

Posted in Reliability | Tagged , | Comments Off