Identify and Fix Problems Faster with Advanced Analytics

“True genius resides in the capacity for evaluation of uncertain, hazardous, and conflicting information” - Winston Churchill

IT Operations teams are awash in data – today’s technology has enabled us to capture millions of data points about what’s happening with our systems and people. Making sense of that data can be a challenge, but it’s necessary to maintain uptime and deliver great experiences to your customers.

We’re excited to announce our newest feature designed to help teams do just that. Advanced Analytics shows trends in key operational metrics to help teams fix problems faster and prevent them from happening in the first place. There’s a lot of valuable information stored in PagerDuty, and we wanted to expose that in an easy-to-use way to foster better decisions and richer, data-backed conversations.

Fix problems faster

As we spoke with several Operations teams to understand the metrics they’re tracking, we heard over and over again that response & resolution times are a key concern. As companies grow and begin to distribute their operations geographically, these metrics became even more important indicators of how well teams are handling incidents.

With PagerDuty’s Team Report, you can view trends in your Mean Time to Resolve (MTTR) and Mean Time to Acknowledge (MTTA), overlaid against incident count, recent incidents, and other contextual metrics. All together, companies and teams can see a holistic picture of how they are performing and where they need support.

team-report

The Advanced Analytics Team Report shows MTTR and MTTA trends along with incident count, summary metrics, and details for individual incidents

Identify and prevent alerting problems

Mature operations teams are always striving to reduce duplicate, low-severity, and noisy alerts, ensuring that when an incident is triggered, it represents a real problem that needs solving. PagerDuty’s System Report helps teams visualize incident load and other metrics across different escalation policies and services, highlighting trends and spikes to help prioritize engineering work.

The Advanced Analytics System Report shows incident count over time, and lets you filter by escalation policy or monitoring service.

The Advanced Analytics System Report shows incident count over time, and lets you filter by escalation policy or monitoring service

Drill-down and context

Visualizations help identify areas and trends to investigate, but measurement is not useful if it doesn’t drive action. A summary dashboard shows % changes over time, a single click on the chart drills down to a more specific time range, and tables list the underlying data points.

In preview now

Advanced Analytics is available now for all PagerDuty customers as part of our 30-day public preview. Log into your account to check out the new reports, and check out some of the resources we’ve put together to help you get the most out of analytics:

Advanced Analytics is a feature of our Enterprise plan. If you have any questions about your plan, please don’t hesitate to contact your Account Manager or our Support team.

Finally, we want to know what you think! To share feedback or questions about Advanced Analytics, please get in touch.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Announcements, Features, Operations Performance | Leave a comment

Best practices to make your metrics meaningful in PagerDuty

This post is the second in our series about how you can use data to improve your IT operations. Our first post was on alert fatigue.

A few weeks ago, we blogged on key performance metrics that top Operations teams track. As we’ve spoken with our beta testers for Advanced Reporting, we’ve learned quite a bit about how teams are measuring Time to Acknowledge (MTTA) & Time to Respond (MTTR). The way your team uses PagerDuty can have a significant impact on how these metrics look, so we wanted to share a few best practices to make the metrics meaningful.

1. Develop guidelines for acknowledging incidents

The time it takes to respond to an incident is a key performance metric. To understand your time to response in PagerDuty, we recommend that you acknowledge an incident when you begin working on it. Furthermore, if you’re on a multi-user escalation policy, this practice is even more important – we’ve just released an update so that once you acknowledge an incident, your teammates will be notified that they no longer need to worry about the alert.

Many high-performing Operations teams set targets for Ack time because it is one metric teams typically have a lot of control over. PagerDuty’s Team Report can show you trends in your TTA so you can see whether you are falling within your targets, and how the TTA varies with incident count.

2. Define when to resolve

We recommend resolving incidents when they are fully closed and the service has resumed fully operational status. If you’re using an API integration, PagerDuty will automatically resolve incidents when we receive an “everything is OK” message from the service. However, if you’re resolving incidents manually, make sure your team knows to resolve incidents in PagerDuty when the problem is fixed. To make incident resolution even easier, we’ll soon be releasing an update to our email-based integrations to auto-resolve incidents from email.

3. Use timeouts carefully

When you create the settings for a service, you can set two timeouts: the incident ack timeout and the auto resolve timeout. These timeouts can have an impact on your MTTA and MTTR metrics, so it’s important to understand how they are configured.

An incident ack timeout provides a safety net if an alert wakes you up in the middle of the night and you fall back asleep after acknowledging it. Once the timeout is reached, the incident will re-open and notify you again. If falling asleep after acking an incident is a big problem for your team, you should keep the incident ack timeout in effect – however, it can make your MTTA metrics more complex. The incident ack timeout can be configured independently for each service, and the default setting is 30 minutes.

If you’re not in the habit of resolving incidents when the work is done, auto resolve timeouts are in place to close incidents that have been forgotten. This timeout is also configurable in the Service settings, and the default is 4 hours. If you’re using this timeout, you’ll want to make sure it is longer than the time it takes to resolve most of your incidents (you can use our System or Team Reports to see your incident resolution time). To make sure you don’t forget about open incidents, PagerDuty will also send you an email every 24 hours if you have incidents that have been open for longer than a day.

4. Treat flapping alerts

A flapping alert is one that is triggered, then resolves quickly thereafter. Flapping is typically caused when the metric being monitored is hovering around a threshold. Flapping alerts can clutter your MTTR & MTTA metrics – on the Team Report, you may see a high number of alerts with a low resolution time, or a resolve time lower than ack time (auto-resolved incidents never get ack-ed). It’s a good idea to investigate flapping alerts since they can contribute to alert fatigue (not to mention causing annoyance) – many times they can be cured by adjusting the threshold. For more resources on flapping alerts, check out these New Relic and Nagios articles.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Operations Performance | Leave a comment

Let’s talk about Alert Fatigue

This is the first post in our series on how you can use data to improve your IT operations. The second post is on about best practices to make your metrics meaningful in PagerDuty.

Screen Shot 2014-08-28 at 4.44.31 PMAlert fatigue is a problem that’s not easy to solve, but there are things you can start doing today to make it better. Using data about your alerts, you can seriously invest in cleaning up your monitoring systems and preventing non-actionable alerts.

To help, we’ve compiled a 7-step process for combatting alert fatigue.

Reducing Alert Fatigue in 7 Steps

1. Commit to action

Cleaning up your monitoring systems is hard, and It’s easy to become desensitized to high alert levels. But the first step is to decide to do something about it. Take a quick look at your data. How many alerts are you getting off hours, and what’s the impact of those for the team?

Screen Shot 2014-08-28 at 4.49.01 PMThen, as a team, commit time to cleaning up your alerting workflows. Etsy designated a “hack week” to tackle their big monitoring hygiene problem, but setting aside a few hours a week or one day each month could also work.

 

2. Cut alerts that aren’t actionable & adjust thresholds

Start by reviewing your most common alerts (Hint: you can drill into incidents in PagerDuty’s new Advanced Reports). Gather the people who were on call recently, and for each alert, determine whether it was actionable.

Once you find non-actionable alerts, cut them.

It’s common to monitor and alert on CPU and memory usage because these are indicators that something is wrong. However, the metrics by themselves are NOT actionable because they don’t give specific information about what’s wrong. Etsy stopped monitoring these metrics, and focused instead on checks that gave more specific, actionable information.

You may also need to adjust the thresholds on your checks. Dan Slimmon from Exosite shared a great talk “Smoke Alarms and Car Alarms”, which details how two concepts from medical testing can help you alert only when there is a problem. The concepts are sensitivity and specificity, and together they give you a positive predictive value (PPV) – the likelihood that something is actually wrong when an alert goes off. The talk also shares strategies for improving your PPV using hysteresis (looking at historical values in addition to current values), as well as other techniques.

3. Save non-severe incidents for the morning

While all alerts are important, some may not be urgent. These non-urgent issues shouldn’t be waking you or your team up in the middle of the night. Consider creating separate workflows for non-severe incidents so these don’t interrupt your sleep or your workday. In PagerDuty, don’t forget to disable “Incident Ack Timeout” and “Incident Auto-Resolution” on low severity services.

4. Consolidate related alerts

When something goes wrong, you may get several alerts related to the same problem. Take advantage of monitoring dependencies if you can set them, and leverage our best practices for alert consolidation in PagerDuty:

  • Use an incident key to tell PagerDuty that certain events are related. For example, if you have multiple servers that go down, each individual one may generate a notification to PagerDuty. However, if those notifications all have the same incident key, we’ll consolidate the notifications to one alert that tells you 30 servers are down.
  • During an alert storm, PagerDuty will also bundle alerts that are triggered after the first incident. For example, if 10 incidents are triggered within the space of 1 minute, after your first alert, you’ll receive a single, aggregated alert.

5. Give alerts relevant names & descriptions

Nothing sucks more than getting an alert saying that something is broken without information to help you gauge the severity of the issue and what to do next.

  • Give your alerts descriptive names. If you’re giving a metric (i.e. disk space used), make sure there’s enough context around the number to let someone put it in perspective. Is disk space 80% full, or 99%?
  • Include relevant troubleshooting information in the alert description, like a link to existing documentation or runbooks that will help the team dig deeper. In PagerDuty, you can add a client_url to the incident, or include a runbook link in the service description.

6. Make sure the right people are getting alerts

When teams first start monitoring, we commonly see them sending all of their alerts to everyone. No one wants to receive alerts that aren’t meaningful, so if you have different teams responsible for certain parts of your infrastructure, use Escalation Policies in PagerDuty to direct alerts appropriately.

7. Keep it up to date with regular reviews

Don’t let your clean-up effort go to waste. Create a weekly process to review alerts. Etsy created a cool weekly review process they call “Opsweekly” (Github repo here), but we’ve heard of other companies that use a spreadsheet during weekly reviews.

To prevent alert fatigue from becoming the new norm, set quantifiable metrics for the on-call experience. If you hit these ceilings, it’s time to take action – whether that be monitoring clean-up or a little time off. At PagerDuty, we look at the number of alerts we get on a weekly basis, and if that number is more than 15 for an on-call team, we’ll do a de-brief to review the alerts.

Most importantly, take ownership of monitoring hygiene as a team – If you get an alert that isn’t actionable, even once, make it your responsibility to make sure no one ever gets woken up for that alert again.

Additional Resources:

Share on FacebookTweet about this on TwitterGoogle+
Posted in Operations Performance | Comments Off

BabyDuty: Being On-call is Similar to Being a Parent

What did the parent say to the on-call engineer at 3am? “At least yours doesn’t smell.” Believe it or not, lack of sleep isn’t the only similarity between managing critical infrastructure and the wonderful adventure of new parenthood. One of our own, Dave Cliffe, recently delivered a talk at Velocity Santa Clara on this very topic, after experiencing the many joys and stresses of both on-call duty and parenting. Here are a few expert tips that he wanted to pass on.

Redundancy Matters

No, having twins or triplets doesn’t mean “redundancy” (they’re not replaceable)! Instead, think in terms of resources. You and your partner will serve as crutches for each other – which means that maybe, when you’re lucky, one of you will be able to snatch a few hours of precious sleep. (And if you’re a single parent, just know that we are in absolute AWE of you!)

“Share the load!”

Redundancy is important in infrastructure design, too. At PagerDuty, not only do we have an active-active architecture for our application and infrastructure, but we also use multiple contact providers for redundancy to ensure you always receive the alert you need in the method of your choice. We’re relentless about reliability. We’re rigorous about redundancy. We’re cuckoo for Cocoa Puffs!  *Ahem* … sorry, too many kid’s commercials.

Which of yoursystems can you not afford to lose? Think in terms of highest priority and build redundancy that way.

Seek Out a Mentor

Dave named a couple of baby books that were recommended to him by his mentor that helped him navigate the first few months of parenthood (they’re Brain Rules for Baby and Raising Resilient Children in case you wanted to buy them). Mentors are also helpful in the engineering world.

Most developers who joined Sumo Logic were never on-call before. To help these new developers experience on-call life without feeling like they are thrown into the deep end, these junior engineers shared on-call responsibility with a veteran to help them get up to speed. This not only helps them to understand their systems more holistically but also acclimate to their on-call culture.

Practice shadowing to learn from senior colleagues how the critical systems in your organization work, and establish crisis-mode roles before things go belly-up. But don’t shadow other parents: that’s creepy.

Be Smart About Alerts

A screaming baby or an outage notification demands immediate attention; you know not to ignore those types of alerts.

But do you need to take action on each and every alert? If you receive an alert that your network is lagging, but you’re in charge of datastores, your services may not be required. Similarly, paying attention to how your baby cries will help you determine what’s going on – is s/he hungry, in need of a diaper change, teething, or do they just want attention?

We haven’t yet developed a BabyDuty platform for deciphering cries, but PagerDuty does let you customize the kind of alerts you receive, as well as determine how you like to be alerted. As children get older, they develop the ability to communicate what they want or need – yet with PagerDuty, this functionality is available right now. You can skip right over the Terrible Twos!

Dave concluded his talk with two suggestions: apply your on-call learning to parenting, and be sure to step back and take moments to admire what you’ve created.

“Enjoy your kids. They’re wonderful, and so is on-call duty.”

Share on FacebookTweet about this on TwitterGoogle+
Posted in Events | Tagged , , , , | Comments Off

DIY Arduino YUN Integration for PagerDuty Alerts

Using a little code inspiration from the Gmail Lamp, Daniel Gentleman of Thoughtfix (@Thoughtfix) built an awesome PagerDuty Arduino integration by combining an Arduino YUN, a 3D printed sphere and Adafruit’s Neopixel Ring to create visual status check of his PagerDuty dashboard. Check it out in the video of his device (and his cat) in action below:

Using PagerDuty’s API, the device checks in with PagerDuty every 15 seconds to update it’s status. The device is configured to see both both acknowledged and triggered alerts. When it discovers acknowledged or triggered alerts, the globe turns orange then to red, giving it a brief pulse of orange. Daniel notes that this is intentional as a way to see that there are alerts in both states, but the lamp will stay red for the duration of a 15 second cycle to show the more critical status. When all alerts are resolved the lamp glows green.

Keeping the Integration Reliable

Like many DIY projects the PagerDuty Status Lamp started out as a for-fun project. But Daniel saw this lamp filling a real need to have a quick visual cue for his open incidents, which in the future may possibly be used as a visual queue for others to leave him alone when his lamp is red. To make the integration reliable, Daniel added a simple function to the Arduino to check for WiFi to ensure that a connection is always established.

By creating a ping -c 1 shell script in the home directory and wrapping the PagerDuty API calls in a test for a ping. Daniel told us that in his final version, he has the device ping Bing.com every second (much more frequently than the calls to PagerDuty, which occur every 15 seconds). If the connection is lost the lamp will fade to a blue light.

WP_20140814_13_13_41_Pro

Future Possibilities

Because the device uses very simple shell scripts and one I/O pin on the Arduino Yun, Daniel explained that it’s possible to expand this to show who is on-call, having the lights change depending on which team has alerts and even physically move a robot based on PagerDuty API calls.

Ready to Build Your Own?

WP_20140814_13_14_07_Pro

Here’s what you will need:

Source code:

// This code is for a PagerDuty status lamp using an Arduino Yun
// with an Adafruit NeoPixel ring (12 LEDs) on pin 6.
// It requires three files in root’s home directory:
// ack.sh (curl request to PagerDuty’s API to check for acknowledged alerts)
// trig.sh (curl request to PagerDuty’s API to check for triggered alerts)
// cacert.pem (from http://curl.haxx.se/ca/cacert.pem for SSL curl requests)
//
// Example ack.sh (change status=acknowledged to status=triggered for trig.sh
// curl -H “Content-type: application/json” -H “Authorization: Token
token=[your token]” -X GET -G –data-urlencode “status=acknowledged”
https://[YourPagerDutyURL].pagerduty.com/api/v1/incidents/count
–cacert /root/cacert.pem
//

// Obtain a PagerDuty API token from
https://[YourPagerDutyURL].pagerduty.com/api_keys
#include <Adafruit_NeoPixel.h>
#define PIN 6
#include <Process.h>
Adafruit_NeoPixel strip = Adafruit_NeoPixel(12, PIN, NEO_GRB + NEO_KHZ800);
void setup() {
// Initialize Bridge
Bridge.begin();
// Initialize Serial
Serial.begin(9600);

// run various example processes
strip.begin();
strip.show(); // Initialize all pixels to ‘off’
colorWipe(strip.Color(0, 0, 255), 50); // Blue
delay(100);
}
void loop() {
int ackd = runAckd();
int trig = runTrigger();
if ((ackd == 0) && (trig == 0)){
colorWipe(strip.Color(0, 255, 0), 50); // Green
}
if (ackd >= 1){
colorWipe(strip.Color(255, 153, 0), 50); // Orange
}
if (trig >= 1){
colorWipe(strip.Color(255, 0, 0), 50); // Red
}
delay(15000);
}
int runTrigger() {
Process p;// Create a process and call it “p”
p.runShellCommand(“/root/trig.sh”);

// Check for Triggered events
// A process output can be read with the stream methods
while (p.running()); // do nothing until the process finishes, so
you get the whole output
int result = p.parseInt(); /* look for an integer */
Serial.print(“Triggered:”); /* Some serial debugging */
Serial.println(result);

// Ensure the last bit of data is sent.
Serial.flush();
return result;
}
int runAckd() {
Process p;// Create a process and call it “p”
p.runShellCommand(“/root/ack.sh”);

// Check for Triggered events
// A process output can be read with the stream methods
while (p.running()); // do nothing until the process finishes, so
you get the whole output
int result = p.parseInt(); /* look for an integer */
Serial.print(“Ackd:”); /* Some serial debugging */
Serial.println(result);

// Ensure the last bit of data is sent.
Serial.flush();
return result;
}
void colorWipe(uint32_t c, uint8_t wait) {
for(uint16_t i=0; i<strip.numPixels(); i++) {
strip.setPixelColor(i, c);
strip.show();
delay(wait);
}
}

Have a DIY PagerDuty project of your own? Let us know by emailing support@pagerduty.com.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Features, Partnerships | Tagged , , , , , , , | Comments Off

A Disunity of Data: The Case For Alerting on What You See

Guest blog post by Dave Josephsen, developer evangelist at Librato. Librato provides a complete solution for monitoring and understanding the metrics that impact your business at all levels of the stack.

The assumption underlying all monitoring systems is the existence of an entity that we cannot fully control. A thing we have created, like an airplane, or even a thing that simply exists by means of miraculous biology, like the human body. A thing that would be perfect, but for its interaction with the dirty, analog reality of meatspace; the playground of entropy, and chaos where the best engineering we can manage eventually goes sideways through age, human-error, and random happenstance.

Monitor that which you cannot control

Airplanes and bodies are systems. Not only do we expect them to operate in a particular way, but we have well-defined mental models that describe their proper operational characteristics — simple assumptions about how they should work, which map to metrics we can measure that enable us to describe these systems as ‘good’ or ‘bad’. Airplane tires should maintain 60 PSI of pressure, human hearts should beat between 40 and 100 times per minute. The JDBC client shouldn’t ever need more than a pool of 150 DB connections.

This is why we monitor: to obtain closed-loop feedback on the operational characteristics of systems we cannot fully control, to make sure they’re operating within bounds we expect. Fundamentally, monitoring is a signal processing problem, and therefore, all monitoring systems are signal processing systems.  Some monitoring systems sample and generate signals based on real-world measurements, and others collect and process signal to do things like visualization, aberrant-behavior detection, and alerting and notification.

Below is an admittedly oversimplified diagram that describes how monitoring systems for things like human hearts and aircraft oil-pressure work (allowing, of course, for the omission of the more esoteric inner-workings of the human heart, which cannot be monitored). In it, we see sensors generating a signal that feeds the various components that provide operational feedback about the system to human operators.

signal_monitoring_notification

All too often however, what we see in IT monitoring systems design looks like the figure below, where multiple sensors are employed to generate duplicate signals for each component that generates a different kind of feedback.

visualization_notification_monitoring

There are many reasons this anti-pattern can emerge, but most of these reduce to a cognitive dissonance between different IT groups, where each group believes they are monitoring for a different reason, and therefore require different analysis tools. The operations and development teams, for example might believe that monitoring OS metrics is fundamentally different from monitoring application metrics, and therefore each implements their own suite of monitoring tools to meet what they believe are exclusive requirements. For OS metrics, operations may think it requires minute-resolution state data on which they can alert, while development focuses on second-resolution performance metrics to visualize their application performance.

In reality, both teams share the same requirement: a telemetry signal that will provide closed-loop feedback from the systems they’re interested in, but because each implements a different toolchain to achieve this requirement, they wind up creating redundant signals, to feed different tools.

Disparate signals make for unpredictable results

When different data sources are used for alerting and graphing, one source or the other might generate false positives or negatives. Each might monitor subtly different metrics under the guise of the same name, or the same metric in subtly different ways. When an engineer is awakened in the middle of the night by an alert from such a system, and the visualization feedback doesn’t agree with the event-notification feedback, an already precarious, stressful and confusing situation is made worse, and precious time is wasted vetting one monitoring system or the other.

Ultimately, the truth of which source was correct is irrelevant. Even if someone successfully undertakes the substantial forensic effort necessary to puzzle it out, there’s unlikely to be a meaningful corrective action that can be taken to synchronize the behavior of the sources in the future.  The inevitable result is that engineers will begin ignoring both monitoring systems because neither can be trusted.

Fixing false-positives caused by disparate data sources isn’t a question of improving an unreliable monitoring system, but rather one of making two unreliable monitoring systems agree with each other in every case. If we were using two different EKG’s on the same patient — one for visualization and another for notification — and the result was unreliable, we would most likely redesign the system to operate on a common signal, and focus on making that signal as accurate as possible.  That is to say, the easiest way to solve this problem is to simply alert on what you see.

Synchronize your alerting and visualization on a common signal       

Alerting on what you see doesn’t require that everyone in the organization use the same monitoring tool to collect the metric data that’s interesting to them, it only requires that the processing and notification systems use a common input signal.
The specific means by which you achieve a commonality of input signal depends on the tools currently in use at your organization. If you’re a Nagios/Ganglia shop for example, you could modify Nagios to alert on data collected by Ganglia, instead of collecting one stream of metrics data from Ganglia for visualization, and a different signal from Nagios for alerting.

Librato and PagerDuty are an excellent choice for centrally processing the telemetry signals from all of the data collectors currently in use at your organization. With turn-key integration into AWS and Heroku, and support for nearly 100 open-source monitoring tools, log sinks, instrumentation libraries, and daemons, it’s a cinch to configure your current tools to emit metrics to Librato.

librato_pagerduty

By combining Librato and PagerDuty, all engineers from any team can easily process, visualize, and correlate events in your telemetry signals as well as send notifications and escalations that are guaranteed to reflect the data in those visualizations. Your engineers can use the tools they want, while ensuring that the signals emitted by those tools can be employed to provide effective, timely feedback to everyone in the organization. Sign up for a Librato free trial today and learn how to integrate PagerDuty with Librato .

Share on FacebookTweet about this on TwitterGoogle+
Posted in Partnerships, Reliability | Tagged , , , | Comments Off

The 4 Operational Metrics You Should Be Tracking

Living in a data-rich world is a blessing and a curse. Flexible monitoring systems, open APIs, and easy data visualization resources make it simple to graph anything you want, but too much data quickly becomes noisy and un-actionable.

We’ve blogged, spoken, and thought hard about what you should monitor and why from a systems perspective, but what about monitoring data on your operations performance? We worked with with a large number of PagerDuty customers as we built out our new Advanced Reporting feature, including some of the most sophisticated operations teams out there. We’d like to share some specific metrics and guidelines that help teams measure and improve their operational performance.

Top Metrics to Track

1. Raw Incident Count

A spike or continuous upward trend in the number of incidents a team receives tells you two things: either that team’s infrastructure has a serious problem, or their monitoring tools are misconfigured and need adjustment.

Incident counts may rise as an organization grows, but real incidents per responder should stay constant or move downward as the organization identifies and fixes low-quality alerts, builds runbooks, automates common fixes, and becomes more operationally mature.

“We were spending lots of time closing down redundant alerts.” – Kit Reynolds, IS Product Manager, thetrainline.com

When looking at incidents, it’s important to break them down by team or service, and then drill into the underlying incidents to understand what is causing problems. Was that spike on Wednesday due to a failed deploy that caused issues across multiple teams, or just a flapping monitoring system on a low-severity service? Comparing incident counts across services and teams also helps to put your numbers in context, so you understand whether a particular incident load is better or worse than the organization average.

2. Mean Time to Resolution (MTTR)

Time to resolution is the gold standard for operational readiness. When an incident occurs, how long does it take your team to fix it?

Downtime not only hurts your revenue but also customer loyalty, so it’s critical to make sure your team can react quickly to all incidents. For Major League Soccer, their fans expect their 20 web properties to be up during live matches. Justin Slattery, Director of Engineering, and his team are constantly working to improve their resolution times because “the cost of an outage during a middle of a game is incalculable.”

While resolution time is important to track, it’s often hard to norm, and companies will see variances in TTR based on the complexity of their environment, the way teams and infrastructure responsibility are organized, industry, and other factors. However, standardized runbooks, infrastructure automation, reliable alerting and escalation policies will all help drive this number down.

3. Time to Acknowledgement / Time to Response

This is the metric most teams forget about– the it time to takes a team to acknowledge and start work on an incident.

“Time to Respond is important because it will help you identify which teams and individuals are prepared for being on-call. Fast response time is a proxy for a culture of operational readiness, and teams with the attitude and tools to respond faster tend to have the attitude and tools to recover faster.”- Arup Chakrabarti, Operations Manager, PagerDuty

While an incident responder may not always have control over the root cause of a particular incident, one factor they are 100% responsible for is their time to acknowledgement and response. Operationally mature teams have high expectations for their team members’ time to respond, and hold themselves accountable with internal targets on response time.

If you’re using an incident management system like PagerDuty, an escalation timeout is a great way of enforcing a response time target. For example, if you decide that all incidents should be responded to within 5 minutes, then set your timeout to 5 minutes to make sure the next person in line is alerted. To gauge the team’s performance, and determine whether your target needs to be adjusted, you can track the number of incidents that are escalated.

4. Escalations

For most organizations using an incident management tool, an escalation is an exception – a sign that either a responder wasn’t able to get to an incident in time, or that he or she didn’t have the tools or skills to work on it. While escalation policies are a necessary and valuable part of incident management, teams should generally be trying to drive the number of escalations down over time.

There are some situations in which an escalation will be part of standard operating practice. For example, you might have a NOC, first-tier support team or even auto-remediation tool that triages or escalates incoming incidents based on their content. In this case, you’ll want to track what types of alerts should be escalated, and what normal numbers should look like for those alerts.

Track your Operations Performance with PagerDuty

“Before PagerDuty, it might take a day to respond to incidents. Now, it’s seconds.” – Aashay Desai, DevOps, Inkling.

PagerDuty has always supported extracting rich incident data through our full-coverage API, and we’ve also offered limited in-app reporting to all customers.

We’ll soon be releasing our new Advanced Reporting features for Enterprise customers, and are accepting signups for our private beta.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Operations Performance | Tagged , , , , , , | 2 Comments

Finding a Scalable Way to Stop Attackers

Evan Gilman, operations engineer at PagerDuty, recently spoke at a meetup at PagerDuty HQ.  The first thing Evan noted, “Security is hard.” Whether you’re a small shop with a constantly evolving codebase or a huge enterprise with many thousands of employees, attackers will keep coming so you need to find a scalable way to stop them.

Secure by default

Evan emphasized the importance of being “secure by default” in relation to file permissions and security privileges. If you put security checks in place, make it a pain in the ass for others to work around it. What good are rules if people know they are bendable? The checks weren’t added for an arbitrary reason – they are there to protect your customer and team. Also, secure everything. Your logs may not have passwords but they may contain sensitive data such as customer information so you need to secure anything.

Be paranoid

As a general rule, Evan noted, you should assume your network is hostile. That’s especially true in the cloud.

“You have no idea what else [is] running on the rack next to you.”

Encrypt all network traffic – both inter and intra-dc – and the success we’ve found is to encrypt at the transport layer. Also, remember to sanitize the data leaving your data infrastructure as well because you can’t trust the provider to watch after what you leave behind.

Automate and distribute enforcement as much as you can

For Evan and the PagerDuty team, automation is about distributed security policy enforcement. Create a centralized ruleset to manage policy and then push it out to individual nodes so they can enforce themselves.

For example, below is a snippet of code we have to distribute enforcement which reads: Cassandra storage on port 7000 should only be accessible by those with nodes with the Cassandra role.

pd_iptables

Take action when there’s something wrong

Ultimately, whatever security solutions you opt for will have to be user-responsive – whether your users are people within your organization, the general public or both.

You need to set up monitoring and alerts to let you know when things aren’t going right. Evan suggested monitoring the level of encryption in your data traffic. If you know 80% of your traffic should be encrypted but only 25% are over a given period, there’s something wrong and someone needs to get paged immediately.

Since PagerDuty is distributed across multiple 3rd party providers, we don’t have VPC available to use so we leverage host-based intrusion detection (HIDS) to let us know when there are problems.

The most important advice from Evan? Start today. You’re going to have to do it eventually and by starting now, you can reduce technical debt and help churn out the bad stuff you already have. Watch his talk below:

Want to see more talks from this meetup, check out:

Or learn more about Security at PagerDuty:

Share on FacebookTweet about this on TwitterGoogle+
Posted in Events, Security | Tagged , , , , | Comments Off

Don’t Build Your Own Operations Performance Platform

Building is second nature for many engineers. Naturally you could build a solution to solve the problems of:

  • Having multiple monitoring tools for your infrastructure
  • Sending multiple types of alerts and re-routing to another person until they’re answered
  • Handling on-call shift changes with one click when life happens
  • Having the dashboards to identify hotspots to make proactive fixes

If you did that, there are quite a few things you’d need to think about. And it’s basically the only thing we’ve been thinking about for over 5 years. Plus building your own tool could cost your business over $330,000 in one year!

Let PagerDuty do the work for you.

(Click to Enlarge)

PagerDuty Homegrown Infographic

Share on FacebookTweet about this on TwitterGoogle+
Posted in Operations Performance | Tagged , | 1 Comment

Chatting with PagerDuty’s API

Guest blog post from Simon Westlake is the Chief Technical Officer for Powercode, a complete CRM, OSS and billing system designed for ISPs. Powercode is in use by over two hundred ISPs around the globe.

Since Powercode is used worldwide by a variety of different types of ISPs, we have integrations with all kinds of different third party services. These integrations encompass email account management, invoice printing, credit card processing, vehicle tracking, equipment provisioning and much more. However, most of these integrations only end up benefiting a handful of our customers, which is why it was a very pleasant surprise to see the heavy utilization of PagerDuty amongst our customers when we integrated it.

Considerations when integrating with 3rd party apps

I first heard about PagerDuty when we held a Powercode user meetup event in our hometown of Random Lake, Wisconsin back in 2012. One of our very passionate users, Steve, was talking to me about how he used Powercode for his ISP. I mentioned that one of the things that we find very challenging is dealing with after hours emergencies, as many of the smaller ISPs using Powercode don’t have the revenue to justify running a 24×7 network operations center (NOC), and that it’s a real pain hoping your buzzing phone wakes you up when you get an email from your network monitoring system letting you know that half the network is down. He quickly pulled out his laptop and logged into his PagerDuty account to show me what we were missing.

After he walked me through the interface and the feature set, I decided on the spot that it was critical for us to integrate PagerDuty into Powercode. However, we have three hard requirements that we always adhere to when we integrate third party services into Powercode:

  1. Does the service have a API?
  2. Is the API well written and documented?
  3. Does the company provide a testing/integration environment for developers?

We’ve been down dark roads before where we’ve decided to skip one of these requirements and it always turns into a long term nightmare. Poorly written APIs (or no API at all) and little support from the third party means we end up having to patch back together a tenuous integration and, if customers come to really rely on the integration, it’s an ongoing headache trying to keep it running. Thankfully, PagerDuty delivered on all three counts. The API was solid and well documented and they readily provided a testing environment for us to integrate the service into Powercode. When we look at new providers, I always cross my fingers for a consistent, REST, JSON-based API and thankfully, that’s what we got!

One of the things I really like when building an integration with a third party system is to find that the API exposed to us is the same API used to build the core system by the original developers and that certainly appeared to be the case with the PagerDuty API. We were really easily able to tie in everything we needed and the integration was smooth and painless.

Analyzing PagerDuty’s Integration API

There were a couple of decisions we had to make when working with the API. Prior to integrating PagerDuty, the only option for alerting in Powercode was to trigger an email. The triggering mechanism had a variety of configuration options such as:

  1. How long should this device be in an alert state prior to an alert being generated?
  2. How many times should Powercode repeat the notification?
  3. What is the frequency and amount of repetitions?

We quickly realized that maintaining this configuration didn’t make sense with the ability to setup your alerting parameters in PagerDuty. We also wanted a way to be able to maintain a history of alerts for devices in PagerDuty. Finally, we had to make some decisions about whether or not to set up two way integration with PagerDuty – if an alert is opened or modified in PagerDuty, should it manipulate anything in Powercode?

After much deliberation, we decided to not integrate two way communication. We wanted to maintain Powercode being the ‘master’ as far as the status of incidents and to encourage people to utilize the Powercode interface to manage their equipment. This left us with a problem to solve – what happens if someone resolves an incident inside PagerDuty while it is still alerting in Powercode?

To deal with this, we decided to trigger an incident creation or update in PagerDuty on every cycle of our notification engine, which is once every minute. PagerDuty log updates to an open incident without triggering another incident, as well as, automatically bundles open incidents that occur around the same time into one alert to reduce alerting noise. While this can create a long list of incident updates in PagerDuty, this gave us some benefits:

  1. If the status of a device changes, that change is reflected in the incident description PagerDuty. For example, if a router is alerting because CPU usage is too high and it then begins alerting because the temperature is too high, re-triggering the incident allows us populate this information into the description.
  2. If a user resolves an incident in PagerDuty that is not really resolved (the device is still in an alerting state), it will be re-opened automatically.

One of the nice things about the PagerDuty API is that it allows you to submit an ‘incident key’ in order to track the incident in question. We decided to use the unique ID in our database associated with the piece of equipment that is alerting as the incident key – this simplified the deduplication process and allowed us to maintain a history within PagerDuty of incidents that had occurred with that piece of equipment. It also made it easy to resolve or acknowledge incidents due to changes within Powercode – we always knew how to reference the incident in question without having to store another identifier. This seemingly small feature in the PagerDuty API really expedited our ability to get it integrated quickly. See an example below for how simple this is for us to do in PHP:

pagerduty-incident-api

This gives us a nice list of descriptive incidents in PagerDuty:

Powercode_incident_description

Keep everyone in the know with PagerDuty

Our initial integration only used the ‘Integration API’ of PagerDuty. We reasoned that most of the other functionality would be controlled and accessed by users directly through the PagerDuty application, and it didn’t serve much purpose to recreate it all within Powercode. However, over time, we slowly found uses for the data in other sections. For example, we deployed a system within our NOC that uses the schedules and escalation policies section of the API to display to our local technicians who the current on-call person is and who to call in the event of an escalation. Our next plan is to implement the webhooks section of the API to be able to store logs inside Powercode that show who is currently working an incident – this allows us to give our customers the ability to get better real time data without needing to create accounts in PagerDuty for every member of their organization.

Final Thoughts

One thing I really like is finding that our customers respond positively and use the service. I believe that the PagerDuty integration in Powercode is the highest utilized third party service out of all the different services we have integrated. Once we started to show people how it worked, the response was universally positive. We even began using PagerDuty for our after hours support for Powercode itself – if you call into our emergency support line after hours and leave a voicemail, it opens an incident in PagerDuty to run through the escalation process!

We continue to recommend PagerDuty to our customers and I’m confident that their solid API and excellent support will mean ongoing integration in Powercode is a no brainer. Check out this integration guide to see how easy it is to integrate Powercode with PagerDuty.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Features, Partnerships | Tagged , , | Comments Off