Skip to content
Jun 14 10

Assorted New Features

by Andrew Miklas

Browsing through our UserVoice feature requests is a pretty humbling experience for all of us working on PagerDuty.  It seems that as far as we’ve come with PagerDuty in our first year, we have at least another ten years of work ahead!

We’re just putting the finishing touches on the Nagios integration API now.  We’re going to have the first version of this out in the next two weeks or so.  But in the mean time, we thought we should show you all some of the other smaller features we’ve recently launched.

Looping on Escalation Chains

Up until now, incidents ran exactly once through their escalation policies.  Thus, unanswered incidents remained assigned to the last person on the escalation chain. Needless to say, this caused some problems if an alert made it through the escalation process without anyone taking action.

To ensure that open incidents are always dealt with, the final rule of an escalation policy can now direct PagerDuty to reassign the incident to the first person in the chain, and begin the escalation process anew.

We’re especially curious if anyone needs additional flexibility in the escalation policies.  Would the ability to loop back to a rule other than the first be useful to anyone?

PagerDuty Escalation Policies

Better Regex Support

We’ve made it possible to specify both “AND” and “OR” trigger message regex filters.  We’ve also added the option to filter incoming messages based on the “from” address.  If you’ve ever accidentally hit “reply-all” on a trigger message you’ve been CC’ed, you’ll know exactly why we’ve added the from filter option.

PagerDuty Service Email Filters

SSL & TLS

Our customers often find themselves needing to log into their PagerDuty accounts while on open WiFi points at airports, coffee shops, and the like.  Up until now, this was sort of a dicey proposition, since only the PagerDuty login and billing pages were SSL protected.  By popular request, we’ve added the option to enable SSL across your entire PagerDuty account.  To enable this option, get your account owner to visit the “Account Settings” page and flip on the SSL option.

PagerDuty Account Settings - SSL

By the way, we’ve also configured our mail servers to accept TLS protected SMTP sessions… perfect in case you suspect your network operator or upstream provider has some BOFH tendencies.  Simply configure your outbound mail servers to use TLS opportunistically, and you should be all set.  If you’d like to check to see if your mail is being received encrypted at our end, click “View Email” on an incident trigger and then use the “View Raw Message” link.  If the message is encrypted, the last hop listed in the receive headers will mention a TLS-enabled connection.

Apr 12 10

PagerDuty 2.0

by Alex Solomon

We’re happy to announce we’ve released the new version of PagerDuty, which has multi-incident support. To try it out, just log into your PagerDuty account.

This new feature corrects an over-simplification in PagerDuty’s design: up to now, PD required you to create a new alarm for each type of problem that your monitoring systems are capable of detecting. Unfortunately, this doesn’t work very well if you’re using a monitoring tool like Nagios, which can monitor thousands of hosts and services at once. The new release can now handle multiple open incidents from a single monitoring system; we call this “multi-incident support”.

Here’s a quick summary of the changes in the new release:

  • Alarms have been renamed to Services.
  • Alarm Groups have been renamed to Escalation Policies.
  • Services can now track multiple open incidents at once.
  • Incident “suppression” has been renamed to “acknowledgement”.
  • The amount of time an incident stays Acknowledged is now configurable on a service-by-service basis

The new version of PD is 100% backwards compatible with the previous version. Yes, we’ve renamed a bunch of stuff, but we’ve been very careful to retain the same behavior as the old version for your existing services. Read on for more details.

The big change: Multi-Incident Support

PagerDuty is now capable of tracking multiple open concurrent incidents.  Put another way, your monitoring system can tell PagerDuty about 100 simultaneous and independent problems without you needing to create 100 PagerDuty alarms (as was the case in the old version of PD).

PagerDuty now uses “incidents” rather than “alarms” as the main object.  Your support team will be acknowledging, escalating, and resolving incidents, instead of alarms.  Incidents in PagerDuty are similar to tickets in a bug tracking system: they are created when a problem is detected, and are resolved or closed when the problem is fixed.

Since PagerDuty can now handle hundreds of open incidents at once, we’ve tried to carefully design PagerDuty’s interface to make it easy to work with large collections of incidents.  The new Incidents and Dashboard tabs feature tables that let you see all of the open incidents assigned to you at a glance.  You can also easily triage your incidents straight from these pages using the controls located at the top of the table.

Incidents tab

Turning on multi-incident support for your PagerDuty services

By default, the PagerDuty services still work the same way they’ve always worked: they can only have one incident open at once. The reason for this is to maintain backwards compatibility.

You can enable multi-incident support for any existing service. Here’s how:

  1. Click on the “Services” tab, and click the “Edit” link (under Actions) for the service you wish to modify.
  2. Under the “Email integration settings” section, you’ll see 3 options:
    • Open a new incident for each trigger email
    • Open a new incident for each new trigger email subject
    • Open a new incident only if an open incident does not already exist

    Email integration settings
    The first option, if selected, will cause the service to open a new incident for each trigger email sent to the service’s email address.

    The second option, if selected, will cause the service to open a new incident based on the email subject: if an open incident with the same subject already exists, the email is appended to this incident; if not, a new incident is created.

    The third option, which should be selected by default for an existing service, allows a service to maintain the behavior of the old version of PagerDuty. It basically turns multi-incident support off: if selected, the service can only have one open incident at any one time. When the service receives a trigger eamil, it opens a new incident if the service doesn’t already have an open incident; otherwise, it appends the email to the open incident.

  3. To turn multi-incident support on, select either the first or second option.
  4. Click “Save changes” at the bottom of the page, and you’re done.

Alarms are now Services

We’ve renamed “alarms” to “services”.  Services are now used only to represent an integration point between PagerDuty and your monitoring services. Currently, the PagerDuty services integrate with your monitoring systems via email integration (just like in the old version of PD). In the coming weeks, we will also add support for an HTTP-based API for the PagerDuty services. This will allow your monitoring systems to trigger/acknowledge/resolve incidents in PagerDuty via a synchronous API call.

For similar reasons, we’ve renamed “alarm groups” to “escalation policies”.  We think the new name better captures the use of these objects.

Incident “suppression” is now incident “acknowledgement”

We’ve also renamed incident “suppression” to “acknowledge”.  As before, this feature is used to temporarily prevent an incident from generating alerts.  We thought the word “acknowledge” better captured the purpose of the feature: “stop bothering me about this problem for now… I’m working on it!”.

We’ve also made the acknowledgement timeout configurable on a service-by-service basis. This means that you can set the amount of time that an incident stays in the Acknowledged state, before it reverts to back to Triggered and alerts you again. The timeout is set to 30 minutes by default for each service, but you can change it or even turn it off easily:

  1. Click on the “Services” tab, and click the “Edit” link (under Actions) for the service you wish to modify.
  2. Under the “Incident settings” section, you’ll see an entry for the “Incident ack timeout”.

    Incident ack timeout

  3. By default, the timeout is set to “30 minutes”. To modify the timeout, click and change the value of this drop-down.You can also disable the timeout altogether, by unchecking the checkbox labeled “Enable a timeout for incidents left in the Acknowledged state for too long”. We recommend leaving the timeout enabled, to ensure you don’t forget incidents in the Acknowledged state.
  4. Click “Save changes” at the bottom of the page, and you’re done.

What’s next?

Next up is support for a PagerDuty API. This will make it easier to integrate PagerDuty with popular monitoring tools like Nagios, Zenoss, monit, Munin and many others. The API will allow your monitoring system to trigger, acknowledge and resolve incidents directly in PagerDuty, via a synchronous call to the API.

Mar 18 10

Preview release of the new “multi-incident” version of PagerDuty

by Alex Solomon

We’ve been carefully reviewing your feature requests to try to understand how best to improve PagerDuty.  One feature request came up far more often than the rest: make it easier to integrate PagerDuty with monitoring tools.  We’ve taken this request to heart and have begun reworking PagerDuty so that we will soon be able to support API integration with monitoring systems like Nagios.

Before we can release an API for PagerDuty, though, we need to correct some over-simplifications in PagerDuty’s design.  Up until now, PD required you to create a new alarm for each kind of problem that your monitoring systems are capable of detecting.  Unfortunately, this doesn’t work so well if you’re using a monitoring tool like Nagios that can track thousands of conditions at once.

So, for the past few weeks, we’ve been busy re-designing PD so that it can handle multiple open incidents from a single monitoring service.  We’re just about ready to roll out this new-and-improved version of PagerDuty, but before we do, we’d like to give you the chance to familiarize yourself with the system, and let us know if there’s any way we can make the new system even better prior to launch.

How do I try it out?

Glad you asked!  For at least the next week, we’re going to run a preview of the new PagerDuty system.  To log in, visit:

http://<your-subdomain>.pd-staging.com

and use your normal PagerDuty email and password.

All of your data has been migrated from your PagerDuty account, so you can see exactly how the system will look once we update the software on our production servers.  The preview release is fully functional, so please feel free to kick-the-tires and have it dispatch a few alerts for you.  Don’t worry — nothing you do in your preview account will have any impact to your production environment.  Of course, all SMS and phone calls made from the preview environment will be free of charge.

In order to maintain backward compatibility, we’ve configured all existing alarms to only support one active incident at once.  To remove this restriction, simply:

  1. Click the “Services” tab
  2. Select one of your existing alarms
  3. Click “Edit this service” on the right side of the screen
  4. Switch the incident creation mode to “Open a new incident for each trigger email”
  5. Click “Save Changes”

service_email_incident_creation2

The big change: Multi-Incident Support

PagerDuty is now capable of tracking multiple open concurrent incidents.  Put another way, your monitoring system can tell PagerDuty about 100 simultaneous and independent problems without you needing to create 100 PagerDuty alarms, as is the case now.

PagerDuty now uses “incidents” rather than “alarms” as the main object.  Your support team will be acknowledging, escalating, and resolving incidents, instead of the alarms that they work with now.  Incidents in PagerDuty are similar to tickets in a bug tracking system: they are created when a problem is detected, and are resolved or closed when the problem is fixed.

Since PagerDuty can now handle hundreds of open incidents at once, we’ve tried to carefully design PagerDuty’s interface to make it easy to work with large collections of incidents.  The new Incidents and Dashboard tabs feature tables that let you see all of the open incidents assigned to you at a glance.  You can also easily triage your incidents straight from these pages using the controls located at the top of the table.

incidents_tab2

One of the biggest advantages to PagerDuty’s existing single-incident design is that it can’t generate alert storms.  Even if Nagios sends hundreds of emails to PagerDuty at once, you’ll only receive one set of phone calls and SMS messages.  We’ve been careful to preserve this feature in the new version of the product.  PagerDuty will intelligently bundle multiple incidents into a single set of notifications so that you aren’t overwhelmed with alerts.

Other changes

We’ve made a few of other small changes to support the new multi-incident functionality.

First, we’ve renamed “alarms” to “services”.  Alarms/services are now used only to represent an integration point between PagerDuty and your monitoring services.  Currently, PagerDuty only has one type of service: the simple email-triggered mechanism you used in the previous version of PagerDuty.  In the coming weeks, we will be adding support for API-driven services so that we can offer even closer integration with products like Nagios.

For similar reasons, we’ve renamed “alarm groups” to “escalation policies”.  We think the new name better captures the use of these objects.

Finally, we’ve renamed incident “suppression” to “acknowledge”.  As before, this feature is used to temporarily prevent an incident from generating alerts.  We thought the word “acknowledge” better captured the purpose of the feature: “stop bothering me about this problem for now… I’m working on it!”.

What’s next

Next up is support for a PagerDuty API.  Once we’ve deployed PagerDuty multi-incident to production and ensured that everyone is comfortable with the new system, we’ll announce our plans for the API.  Stay tuned for more info!

Mar 7 10

New Feature: Alarm Auto-resolution

by Alex Solomon

We’d like to announce a new PagerDuty feature: auto-resolution of alarms. Auto-resolution is a setting on the PagerDuty alarms; if enabled, an alarm will automatically resolve itself after a specified amount of time.

Alarm auto-resolution is an important safety mechanism in case you forget an alarm in the Triggered state. This all makes perfect sense if you understand how the PagerDuty alarms work.

Alarms in PagerDuty are stateful. Each alarm starts out in the Idle state. Upon receiving a trigger email, the alarm transitions to the Triggered state and begins to alert your team based on the rules specified by the alarm’s alarm group. However, if an already Triggered alarm receives additional trigger emails, it logs them but *does not re-start the alerting process*. This can be dangerous, as I’ll explain below.

In the normal case, an alarm is triggered and notifies the person on-call. That person receives the phone/SMS/email alert, fixes the problem and resolves the alarm. In some cases, the person on-call does not receive the alert (this can happen if your cell runs out of batteries, or has no reception, or you forget your phone in another room and go to sleep). In these cases, the alarm is automatically escalated to a secondary person, who then picks up the alert and resolves the alarm. It’s also possible (and this has happened a few times to some of our customers) that an alarm triggers and contacts all of the people in the escalation chain, but nobody picks it up.

When an alarm runs out of people to notify, it stays in the Triggered state until someone resolves it. This is a dangerous state for an alarm to be in, because, as I mentioned above, any trigger emails to the alarm will not restart the alerting process. The alarm must be explicitly resolved to re-enable alerting.

autores_600
This is where auto-resolution comes in. We strongly recommend you turn it on for all of your alarms. Here’s how to enable auto-resolution for an alarm:

  1. Click on the Alarms tab, and click one of your alarms.
  2. Near the top of the page, you’ll see “Auto resolve”. Click “change”.
  3. Set the amount of time after which the alarm is auto-resolved. This should be set according to the amount of time an alarm would take to run out of people to notify (as specified by the rules set in your alarm group).
Nov 23 09

New Feature: Reports

by Andrew Miklas

For about the last month, we’ve been busy at work on our most requested feature: billing. Hmm… ok, perhaps billing isn’t quite the #1 requested feature request, but surprisingly, we did actually have a few customers who were asking for it.

Yesterday, we rolled out PagerDuty’s reporting component, which is probably the most user-visible component of the billing project. Reports will give you a “phone bill” style view of all the alerts PagerDuty sent in a month, along with who received the alert and which alarm triggered each alert. These reports should help you determine which of our pricing plans is best for your organization. For some of our customers, reports will also be useful when billing internal departments for out-of-hour service requests dispatched by PagerDuty.

Shows who PagerDuty contacted, which method we used, and which alarm triggered the alert.

Reports show who PagerDuty contacted, which method we used, when we sent out out the alert, and which alarm triggered the alert.

If you have other types of reports you’d like to be able to generate from your PagerDuty account data, please let us know. One feature we’re thinking of adding to the reporting module is the ability to download the reports in CSV format. We haven’t started work on this yet, but if it’s a heavily requested feature, we could look at sliding it up a bit in the work queue.

Sep 9 09

New PagerDuty Feature: Alarm Groups

by Andrew Miklas

We are proud to announce the release of a brand new feature to PagerDuty: alarm groups. Sounds simple, but it’s actually quite a sizable update to our system. To sum it up, alarm groups allow you to route problems differently depending on their source.

First of all, alarm groups allow you to organize your alarms into groups. For example, you might want to create a group called “DB Alarms” for your database alarms, and another group called “Website Alarms” for alarms related to your site.

Secondly, and more importantly, alarm groups allow you to specify what happens when an alarm in the group is triggered. Each alarm group has a set of rules called Alerting and Escalation rules. These rules specify who to contact when an alarm is triggered, and when to escalate if the person does not acknowledge the alert.

alarm_groups

As you can see in the example above (of an alarm group for database-specific alarms), the first rule says to contact the on-call person on the “DB Admins Primary” schedule. The escalation timeout for the first rule is 5 minutes; this means that if the Primary DB on-call doesn’t acknowledge the alert within 5 min, it will be escalated. If this happens, the second alerting & escalation rule is invoked, and so on.

You can set the alerting rules to contact a specific individual (like rule 4 above) or the person that is on-call on a specific on-call schedule (like rules 1 to 3 above).

With this new release, we have also lifted the restriction of only 3 on-call schedules (aka rotations). You now have the ability to create as many schedules as you need. You can browse all of your on-call schedules by clicking the On-call Schedules tab.

schedule_index

The new alarm groups feature is available to all existing accounts under the Alarm Groups tab. We’ve created a “Default” alarm group for you already, and have put all of your existing alarms in this group. Your alerting and escalation settings have now reside under this “Default” alarm group. To access them, click on the Alarm Groups tab. In the alarm groups table, click on Default (under Alarm group name).

Rest assured, all of your settings, on-call schedules, alerting and escalation rules, users, and user contact rules are unchanged. If you have any questions or feedback, please get in touch.

Aug 27 09

First Post!!1!

by Andrew Miklas

Six months out of the gate, and we finally got around to setting up our blog. Pretty bad, eh?  Up ’till now, we’ve been heads-down adding features to PagerDuty.

Anyway, we plan to use this blog to keep our customers (and curious onlookers) up to date regarding the development of PagerDuty, and our experiences running a startup out of Toronto.

Just in case you’ve randomly stumbled across our blog, PagerDuty is an alert management product we’ve been working on since February. Most web developers are pretty familiar with monitoring systems like Pingdom and Nagios. These tools are indispensable for any serious online business: they rapidly detect problems with your systems. What they don’t do as well, though, is ensure that someone knows about a detected problem.

With PagerDuty, we’re trying to bring top-notch alerting functionality to the monitoring tools IT pros have already come to trust. We help you establish an on-call rotation for your IT/ops team, and then dispatch alerts to the on-call engineer using your choice of phone calls, SMSes, or emails. PagerDuty’s email-based triggering makes integration with all of your existing monitoring software a snap.

If you haven’t already, take a look at PagerDuty. The service is completely free for the next little while (during our Beta period), so sign up for an account and let us know what you think.