-
-
Follow
-
Recent Posts
Recent Comments
Categories
Archives
- May 2013
- April 2013
- March 2013
- February 2013
- January 2013
- November 2012
- October 2012
- September 2012
- August 2012
- July 2012
- June 2012
- April 2012
- March 2012
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- September 2011
- August 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- February 2011
- October 2010
- August 2010
- June 2010
- April 2010
- March 2010
- November 2009
- September 2009
- August 2009
Tag Archives: john
Hiring software engineers is hard. We all know this. If you get past the problem of sourcing and landing good candidates (which is hard in itself), the whole issue of “is this person I’m talking to ‘good enough’ to actually … Continue reading
Posted in Best Practices, Blog
Tagged hiring, Hiring Best Practices, Hiring Developers, Hiring Engineers, Jobs at PagerDuty, john, machine learning
Leave a comment
Having one person on-call isn’t enough. What happens if your on-call engineer sleeps through their alert? What happens if their phone’s battery dies without them knowing, or if they get an alert at a really inconvenient time, like when stuck … Continue reading
Jan 27 12
Pressure Release Valves
This is the fourth in a series of posts on increasing overall availability of your service or system. Have you ever gotten paged, and known right away that this problem isn’t like the last 15 operations issues you’ve dealt with … Continue reading
This is the third in a series of posts on increasing overall availability of your service or system. In the first post of this series, we defined and introduced some concepts of system availability, including mean time between failure – MTBF … Continue reading
Like pretty much everything else in Rails, optimistic locking is nice and easy to setup: you simply add a “lock_version” column to your ActiveRecord model and you’re all set. If a given Rails process is trying to update some record, … Continue reading
This is the second in a series of posts on increasing overall availability of your service or system. In the first post of this series, we defined and introduced some concepts of system availability, including mean time between failure – … Continue reading
Tired of getting a flood of PagerDuty incidents whenever a problem occurs with one of your systems? Do many of the incidents seem identical? Do you spend valuable time trying to fend off the seemingly never-ending PagerDuty phone calls and … Continue reading
Jun 20 11
New APIs Available Now
Have you ever said to yourself: “PagerDuty is great, but I wish I could better integrate it into the custom tools I already use.” Or maybe: “Why can’t I see more reports on the number of incidents each of my … Continue reading
Today, at around 1am Pacific Time, Amazon began having major problems with some of their cloud infrastructure: specifically with their EC2, EBS, and RDS offerings. We’d like to share some statistics on the alerts we sent out – via phone or SMS – during the outage. Continue reading
Apr 18 11
The ups and downs of Availability
This post is meant as a quick introduction to some concepts of system availability, so that subsequent posts in this series make sense. I’ll go over concepts like availability, SLA, mean time between failure, mean time to recovery, etc. Continue reading

