Monitoring Best Practices Learned from IT Outages

Guest post by Alexis Lê-Quôc, co-founder and CTO of Datadog. Datadog is a monitoring service for IT, Operations and Development teams who want to turn the massive amounts of data produced by their apps, tools and services into actionable insight.

Datadog_LogoAt Datadog we eat our own… dogfood. We track hundreds of thousands of metrics internally. Learning what to alert on and what to monitor has taken us some time. Not all metrics are made equal, and we have come up with a simple way to manage them, which anyone can master. Here’s how we do it.

Monitoring goals

Why would you spend time getting better monitoring?

  1. To know about an issue before your customers or your boss
  2. To know how your systems & applications are performing
  3. To minimize your stress level

Classifying metrics

What kind of metrics does your monitoring tool track? Examples are: CPU utilization, memory utilization, database or web requests. That’s a lot of different types of metrics and they can be divided into two fundamental classifications of metrics – work and resource.

Work metrics
A work metric measures how much useful stuff your system or application is producing. For instance, we could look at the number of queries that a database is responding to or the number of pages that a web server is serving per second. The purpose of a database is to answer queries. The purpose of a web server is to serve pages. So these are appropriate work metrics.

Another work metric would be things like how much money is your application producing? That’s a very useful work metric to track availability and understand the effectiveness of your application and infrastructure.

Resource metrics
The other class is resource metrics. A resource is something that is used to produce something useful. You use a resource to produce some work. So a resource metric measures how much of something is consumed to produce work. When you ask the question, “how much CPU am I consuming in the database?” it doesn’t really say much about whether that’s useful or not. It just says, “Well, I have more CPU available” or “I’m maxed out and my CPU is completely maxed out.” Same for memory, disk, network and so on. In general, I’ve used resource metrics for capacity planning rather than for availability management.

Optimizing your monitoring

Now that we’ve defined work and resource metrics, we can move to best practices.Classify key metrics as work or resource

1. Classify key metrics as work or resource

Look at your key metrics, specifically the ones you really care about, and figure out whether they’re work metrics or resource metrics.

2. Only alert on work metrics

Once you’ve done this classification – and it’s really important to spend time doing this – you need to identify what you want to get alerted on. You only want to get alerted on work metrics.

In other words, you want to get alerted on things that measure how useful your system is.

I should mention that it’s useful to alert on some resource metrics if they’re a leading indicator of a failure. For instance, disk space is a resource metric. However, when you run out of disk space, the whole show stops so it’s also important to alert on these metrics. But in general, alerting on resource metrics should be rare.

3. Only alert on actionable work metrics

The tweak to the previous best practice is that you really only want to alert on actionable work metrics. In other words, you want to alert on work metrics that you can do something about.

For instance, an actionable work metric for a web server is how many webpages you serve without errors per second. That’s a work metrics because if you’re serving zero pages, your website is not running at all – it’s down.

A non-actionable work metric could be how many 404s I’m serving per second. This isn’t an actionable work metric because this will entirely depend on what people are doing on your site. If they are browsing to URLs that don’t exist, then you’re going to get a lot of 404s. This doesn’t mean it’s bad, but rather that they’re doing something that’s not expected. So you should not alert on non-actionable work metrics.

4. Review metrics and alerts periodically

The fourth, and maybe one of the hardest best practices, is to actually do a review and iterate on this process on a regular basis. Maybe it’s a weekly, bi-weekly or monthly thing, but you really want to carve out some time in your busy schedule and do a review with your team.

Back to goals

Now, let’s tie back back these best practices to the initial goals of monitoring that I mentioned. Classifying key metrics as work or resource is a prerequisite for everything.

a. To know about an issue before your customers or your boss

Only alert on work metrics so you know that you won’t be alerting on stuff that’s not useful and therefore have a much better result

b. To minimize your stress level

Only alert on actionable work metrics because you’re not going to get alerted on things over which you have no control

c. To know how your systems & applications are performing

Review metrics and alerts periodically so you have a good sense of how your systems are performing, trending and how you can change things.

Use these best practices to improve your monitoring strategy and when you’re ready to implement, try a 14-day free trial of Datadog to graph and alert on your actionable work metrics and any other metrics and events from over 80 common infrastructure tools.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Partnerships | Tagged , | Leave a comment

The Importance of Severity Levels to Reduce MTTR

Guest blog post by Elle Sidell, Lukas Burkoň, and Jan Prachař Testomato. Testomato offers easy automated testing to check websites pages and forms for problems that could potentially damage a visitor’s experience.

We all know how important monitoring is for making sure our websites and applications stay error free, but that’s only one part of the equation. What do you do once an error has been found? How do you decide what steps to take next?

Rating the severity of an issue and making sure the right person is notified plays a big role in how quickly and efficiently problems get resolved. We’ve pulled together a quick guide about the importance of error severity and how to set severity levels that fit your team’s escalation policy.

What Are Severities and Why Are They Important

In simple terms, the severity of an error indicates how serious an issue is depending on the negative impact it will have on users and the overall quality of an application or website.

For example, an error that causes a website to crash would be considered high severity, while a spelling error might (in some cases) be considered low.

Severity is important because it:

  • Helps you reduce and control the amount alerting noise.
  • Makes the process of handling errors smoother.
  • Improves how effectively and efficiently you resolve issues.

Having a severity alert process in place can help you prioritize the most crucial problems and avoid disturbing the wrong people with issues that are outside their normal area of responsibility.

On a larger scale, it makes decisions about what to fix easier for everyone.

How to Create Escalation Rules That Works for Your Team

Understanding the benefits of rating the severity of an incident is easy, but creating a severity process that works for your team can be tricky. There’s no silver bullet for this process. What works for you many not work for another team – even if it’s the same size or in the same industry.

How you choose to set up your severity levels can vary depending on your team, the project and its infrastructure, the organization of your team, and the tools you use. So where do you start?

In our experience, there are 3 main things you need to think about when creating an escalation process:

  1. Severity structure
  2. Team organization structure
  3. Thresholds and their corresponding notification channel

Errors with higher severity will naturally require a more reliable notification channel. For example, you might choose to send an SMS using PagerDuty for a high severity error, while one that is considered minor may not trigger an alert to help reduce noise. Instead, you could choose to leave it as a notification from Testomato, which can be viewed by someone at a later time.

1) Severity Structure

One of the easiest ways to set up a severity structure is to identify the most critical parts of your website or application based on their business value.

For example, the most critical parts of an e-shop would be its product catalogue and its checkout. These are the features that would cause an e-shop to severely affect business if they were to stop working. These issues need to be prioritized before all other issues.

Here’s one method we’ve found helpful for creating a severity structure:

  1. Create a list of the key features or content objects on your website or web application. (e.g. catalogue, checkout, homepage, signup, etc.). It’s a good idea to keep your list simple to make it easier to prioritize issues.
  2. Analyze your alert history and identify any common problems that may require a different severity level than you would normally assign (i.e. false timeouts may need to be marked as low severity, even though a timeout would be categorized somewhere higher on your scale).
  3. Decide on the levels you’d like to use for your scale (e.g. low, medium, high). You can add more levels depending on the size of your project and team.
  4. Once you have completed your list and analysis, estimate the severity level of each feature or content object, as well as any recurring errors that you found in your history.

There’s no right or wrong way to do this. The most important thing to know is how your team will classify specific incidents and make sure that everyone is on the same page.

2) Organization Structure

The next thing you’ll want to do is take a look at the structure of your team.

Having a clear understanding of how your team is structured and automating issue communication will help you define a more efficient flow of communication later on. For instance, team members responsible for your environment should be notified about issues immediately, while a project manager may only want to be kept in the loop for critical issues so they’re aware of possible problems.

Based on what we’ve seen with the project teams at Testomato, development teams are usually structured according to the following table:

Company/ Team Size Team Management Project Development Monitoring
freelancer client one person team none / manually
client
users
small team* CEO a few developers none
single person
developer / admin
users
large team CEO
CTO
VP Eng
Team Leads
a team of developers none
users
a team of testers
a team of admins

*A small team would generally be found in a web design agency or early stage startup.

For a more detailed structure, here’s a few more questions to keep in mind:

  • Who needs to be part of the alert process?
  • What are each person’s responsibilities when it comes to fixing an issue?
  • At what point does an alert require that this role be brought into the communication loop?

3) Communication Structure

One of the hardest parts of severities can be putting together a communication structure, especially if you don’t have a strong idea about how alerts should flow through your team structure.

Think of it this way:

  • Severity Structure: How serious is this problem?
  • Organization Structure: Whose responsibility is it?
  • Communication Structure: If X happens, how and when should team members be contacted?

The main goal of severity levels is to make sure the right people are aware of issues and help prioritize them. Setting a communication structure lets you connect different levels of your severity structure to roles from your organization and add more defined actions based on time sensitivity or error frequency. This way you can guarantee the right people are contacted using the proper channel that is required for the situation. If a responder is not available, there is an escalation path to ensure someone else on the team is notified.

Assigning notification channels and setting thresholds that correspond to your team organization means that problems are handled efficiently and only involve the people needed to solve them.

For example, if a critical incident occurs on your website, an admin receives a phone call immediately and an SMS is sent to the developer responsible for this feature at the same time. If the problem is not resolved after 10 minutes, the team manager will also receive a phone call.

On the other hand, a warning might only warrant an email for the team admin and any relevant developers.

Within PagerDuty, you can create 2 Testomato services – one general and another that is critical – and match these services to the escalation policy needed. If you have SLAs of 15 minute for critical incidents, that escalation path with be tighter than general incidents.

Here’s a basic overview of how we use severity levels at Testomato using both PagerDuty notifications and our own email alerts:

Team Members: manager, 2 admins (responsible for production), and 2-3 developers.

When errors occur on their project using the following process:

PagerDuty – SMS and Phone Call

  • All errors are sent to PagerDuty.
  • PagerDuty sent SMS immediately to both admins.
  • After 5 minutes, an admin is called according to the on-call schedule.
  • After 15 amount of time, a team manager is also called.
  • Developers are not contacted by PagerDuty.

Testomato – Email

  • Both errors and warnings are sent as Testomato email notifications to both admins and the developers.
  • Warnings are only sent as emails.
  • Developers are sent emails about both errors and warnings to stay informed about production problems.

We hope you’ve found this post helpful! What severity alert process works best for your team?

Share on FacebookTweet about this on TwitterGoogle+
Posted in Partnerships | Tagged , | Leave a comment

A deep dive into how we built Advanced Analytics

Advanced Analytics was a big project for us – not only was it a big step toward helping operations teams improve their performance with actionable data, but it also presented a complex design and engineering challenge.

Designing to Solve Problems

When we design new features, we always want to ensure that we solve real problems for our customers. As we dug into how our customers were using PagerDuty, and the goals they had for their Operations teams, one of the biggest pain points we found was a lack of visibility into what was happening with their operations. While this problem looks different at every company, we noticed many teams struggling with knowing what areas of their system are the most problematic and how their teams are performing.

Designing for Reliability and Scale

We process tens of millions of incidents a month. Since our customers count on us to improve their uptime, the reliability of our product is a core value here at PagerDuty. We needed to ensure that the infrastructure behind Advanced Analytics supported our reliability and performance needs now and in the future.

Reporting load looks different than load on a mobile app or dashboard; rather than needing a small set of incidents right now, you want larger numbers of incidents with calculations from a larger time range.

We needed to make sure reporting calls did not bog down our main incidents database and API, so we built a set of decoupled services that ensures we can serve data quickly to customers while avoiding any impact on our main alerting pipeline, web app, and mobile app.

A fleet of ETL workers takes data from a production slave database, iterating over incident log entries and flattening them into a raw facts table with a specific set of metrics. A second service serves up the raw denormalized incidents facts for display in drilldown tables, and the third consumes the raw data to quickly calculate aggregate metrics. When you access a report, these services serve the mix of summary and raw data that you ask for.

Reporting-tech-diagram

Advanced Analytics accesses a large volume of data, so we explored pre-computing some of the metrics. However, pre-computing presents tradeoffs between data staleness, the number of code paths to serve pre-rolled and on-demand reports, and UX design, so we wanted to make sure we did just the right amount. Based on testing, we discovered that by pre-computing per-incident metrics, we were able to strike the right balance of performance and flexibility.

We knew at the start that Advanced Analytics would serve as the foundation for exposing so much new data to our customers — more even than we’d be able to anticipate. That’s why we built our analytics framework to handle change. When we develop new and better ways to calculate valuable metrics, or have even richer usage data to expose, we can re-process data without any interruption to the feature. Customers will simply see new functionality available once all of the data is in place. In practice, this also allows us to deal with temporary fluctuations in data availability or integrity without downtime.

In practice, all this work is invisible to the user – they go to our reports page, select whatever they want to see, and quickly see their data rendered. But it’s important to us that we build our features to the same bar of scale and reliability we’re famous for.

Getting to the “So What?”

It would be easy to just take our existing reports and add filters, but we wanted to do more. We wanted to give users the context and flexibility to take away real, actionable insights from the reports.

We did this in three ways:

  1. Presenting individual metrics alongside aggregate summaries, so that customers can norm how a particular team or service is doing compared to the greater whole.
  2. Showing how metrics have changed since the last time period, so that customers understand at a high level whether they are improving their performance.
  3. Offering quick, simple drilldown to the underlying incidents, services, escalation policies, and users, so that customers can access the granular details of their operations activity.

Learning and Iterating

We collected customer feedback throughout our design and development process, and to make sure we were ready to ship, we ran a comprehensive beta test with select groups of customers. Throughout this process, we got great feedback that helped us iterate to deliver  the best possible solution.

Beta customers took instantly to the new reports, excited to have greater visibility into their systems and people, and eager to share how they wanted to use the feature to enact positive change in their teams. Some of our favorite use cases:

  • Identifying teams (escalation policies) with the lowest median times to resolve, so that other teams in the same company could learn from their operational practices and improve ops metrics companywide
  • Using the Team Report for weekly team meetings, reviewing how key metrics have changed from the previous week, and looking at escalated incidents to identify what went wrong
  • Using the incident drilldown to see where similar incidents occurred together, and finding duplicate or noisy alerts to remediate

Speaking with beta customers also provided us a great deal of UX feedback. Throughout our alpha and beta, we made UX and usability tweaks to ensure that our interactions were supporting the needs of our widely-varied customer base — from those with only one or two users and services up to those with hundreds.

While we’re thrilled to deliver this comprehensive solution to operations reporting, we see this as just the first step in PagerDuty’s analytics journey. We’re excited to continue helping our customers improve their uptime & reliability through analytics.

Tell us what you think!

Advanced Analytics is live as a 30-day preview to all customers, after which it will be a feature of our Enterprise plan. We’d love to hear what you think – email support@pagerduty.com with any feedback, and we promise we’ll read every single piece of it.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Features | Leave a comment

Datacenter and Natural Disasters: Responsiveness Matters

New Zealand is located on the southern tier of the Pacific “Ring of Fire”, which makes it no stranger to seismic activity. On average, there are about 10 earthquakes per day that are felt by people in New Zealand! Most of the earthquakes experienced by Kiwis are small – 4.0 or less on the Richter scale – yet some are substantial.

new_zealand_earthquake_response

The country’s most recent major quake, in early 2011, devastated the city of Christchurch and led to renewed awareness of the threat that tectonic shifts can pose. In the process, it brought to the fore the work of companies like GNS Science.

GNS Science provides geoscience information to the general public and other businesses through a system called GeoNet. It operates a network of seismometers that generate data around the clock, which means that when an earthquake does happen, GNS Science is one of the first organizations to know.

But for many years, earthquake notifications didn’t always reach the on-call earthquake and volcano response team who needed to know. This meant they sometimes received phone calls from the media asking them for insight on a quake when they weren’t aware something happened! Antiquated technology was largely to blame: GNS Science used spreadsheets for scheduling and physical pagers for alerting, neither of which proved entirely reliable.

“Managing spreadsheets and using pagers were ineffective. We needed to move from the stone age, ” says Kevin Fenaughty, datacenter manager

GNS Science first began using PagerDuty to centralize and send IT incident notifications. With their success in alerting the right engineer when issues occurred, when the response team wanted to move away from their unreliable alerting system, the datacenter team recommended PagerDuty. Today, GeoNet’s seismometer network will alert a designated on-call response team member with PagerDuty if any 4.5-or-larger quake is detected. Their method of only alerting for larger earthquake is similar to IT incident alerting – you only want to wake someone up for an actionable issue that is affecting people.

During an IT or natural disaster crisis it is important to not only keep everyone in the know but also have a flexible system to make changes on-the-fly if needed. If someone is on-call for hours during a disaster and are completely burnt out, with PagerDuty, they can easily change calendars without all the fuss. GNS Science has also found PagerDuty to be very reliable.

“After every big crisis, we have a debrief around what improvements we need to make, and PagerDuty is never mentioned – it just works.”

In an earthquake-prone country like New Zealand, having the right information at the right time is essential for public safety. GNS Science, with the help of PagerDuty, is making that possible.

“PagerDuty allows our guys to be more responsive. They are ready and prepared to give information on an earthquake when people call in.”

PagerDuty is headquartered in San Francisco which is located 35 miles away from Napa, the epicenter of a 6.0 earthquake that hit a couple of weeks ago so earthquakes are definitely top of mind for us. PagerDuty is designed for engineers but we’re excited to see the PagerDuty platform used to increase the responsiveness of incidents outside of IT.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Operations Performance | Tagged , , | Leave a comment

Identify and Fix Problems Faster with Advanced Analytics

“True genius resides in the capacity for evaluation of uncertain, hazardous, and conflicting information” - Winston Churchill

IT Operations teams are awash in data – today’s technology has enabled us to capture millions of data points about what’s happening with our systems and people. Making sense of that data can be a challenge, but it’s necessary to maintain uptime and deliver great experiences to your customers.

We’re excited to announce our newest feature designed to help teams do just that. Advanced Analytics shows trends in key operational metrics to help teams fix problems faster and prevent them from happening in the first place. There’s a lot of valuable information stored in PagerDuty, and we wanted to expose that in an easy-to-use way to foster better decisions and richer, data-backed conversations.

Fix problems faster

As we spoke with several Operations teams to understand the metrics they’re tracking, we heard over and over again that response & resolution times are a key concern. As companies grow and begin to distribute their operations geographically, these metrics became even more important indicators of how well teams are handling incidents.

With PagerDuty’s Team Report, you can view trends in your Mean Time to Resolve (MTTR) and Mean Time to Acknowledge (MTTA), overlaid against incident count, recent incidents, and other contextual metrics. All together, companies and teams can see a holistic picture of how they are performing and where they need support.

team-report

The Advanced Analytics Team Report shows MTTR and MTTA trends along with incident count, summary metrics, and details for individual incidents

Identify and prevent alerting problems

Mature operations teams are always striving to reduce duplicate, low-severity, and noisy alerts, ensuring that when an incident is triggered, it represents a real problem that needs solving. PagerDuty’s System Report helps teams visualize incident load and other metrics across different escalation policies and services, highlighting trends and spikes to help prioritize engineering work.

The Advanced Analytics System Report shows incident count over time, and lets you filter by escalation policy or monitoring service.

The Advanced Analytics System Report shows incident count over time, and lets you filter by escalation policy or monitoring service

Drill-down and context

Visualizations help identify areas and trends to investigate, but measurement is not useful if it doesn’t drive action. A summary dashboard shows % changes over time, a single click on the chart drills down to a more specific time range, and tables list the underlying data points.

In preview now

Advanced Analytics is available now for all PagerDuty customers as part of our 30-day public preview. Log into your account to check out the new reports, and check out some of the resources we’ve put together to help you get the most out of analytics:

Advanced Analytics is a feature of our Enterprise plan. If you have any questions about your plan, please don’t hesitate to contact your Account Manager or our Support team.

Finally, we want to know what you think! To share feedback or questions about Advanced Analytics, please get in touch.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Announcements, Features, Operations Performance | Leave a comment

Best practices to make your metrics meaningful in PagerDuty

This post is the second in our series about how you can use data to improve your IT operations. Our first post was on alert fatigue.

A few weeks ago, we blogged on key performance metrics that top Operations teams track. As we’ve spoken with our beta testers for Advanced Reporting, we’ve learned quite a bit about how teams are measuring Time to Acknowledge (MTTA) & Time to Respond (MTTR). The way your team uses PagerDuty can have a significant impact on how these metrics look, so we wanted to share a few best practices to make the metrics meaningful.

1. Develop guidelines for acknowledging incidents

The time it takes to respond to an incident is a key performance metric. To understand your time to response in PagerDuty, we recommend that you acknowledge an incident when you begin working on it. Furthermore, if you’re on a multi-user escalation policy, this practice is even more important – we’ve just released an update so that once you acknowledge an incident, your teammates will be notified that they no longer need to worry about the alert.

Many high-performing Operations teams set targets for Ack time because it is one metric teams typically have a lot of control over. PagerDuty’s Team Report can show you trends in your TTA so you can see whether you are falling within your targets, and how the TTA varies with incident count.

2. Define when to resolve

We recommend resolving incidents when they are fully closed and the service has resumed fully operational status. If you’re using an API integration, PagerDuty will automatically resolve incidents when we receive an “everything is OK” message from the service. However, if you’re resolving incidents manually, make sure your team knows to resolve incidents in PagerDuty when the problem is fixed. To make incident resolution even easier, we’ll soon be releasing an update to our email-based integrations to auto-resolve incidents from email.

3. Use timeouts carefully

When you create the settings for a service, you can set two timeouts: the incident ack timeout and the auto resolve timeout. These timeouts can have an impact on your MTTA and MTTR metrics, so it’s important to understand how they are configured.

An incident ack timeout provides a safety net if an alert wakes you up in the middle of the night and you fall back asleep after acknowledging it. Once the timeout is reached, the incident will re-open and notify you again. If falling asleep after acking an incident is a big problem for your team, you should keep the incident ack timeout in effect – however, it can make your MTTA metrics more complex. The incident ack timeout can be configured independently for each service, and the default setting is 30 minutes.

If you’re not in the habit of resolving incidents when the work is done, auto resolve timeouts are in place to close incidents that have been forgotten. This timeout is also configurable in the Service settings, and the default is 4 hours. If you’re using this timeout, you’ll want to make sure it is longer than the time it takes to resolve most of your incidents (you can use our System or Team Reports to see your incident resolution time). To make sure you don’t forget about open incidents, PagerDuty will also send you an email every 24 hours if you have incidents that have been open for longer than a day.

4. Treat flapping alerts

A flapping alert is one that is triggered, then resolves quickly thereafter. Flapping is typically caused when the metric being monitored is hovering around a threshold. Flapping alerts can clutter your MTTR & MTTA metrics – on the Team Report, you may see a high number of alerts with a low resolution time, or a resolve time lower than ack time (auto-resolved incidents never get ack-ed). It’s a good idea to investigate flapping alerts since they can contribute to alert fatigue (not to mention causing annoyance) – many times they can be cured by adjusting the threshold. For more resources on flapping alerts, check out these New Relic and Nagios articles.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Operations Performance | Leave a comment

Let’s talk about Alert Fatigue

This is the first post in our series on how you can use data to improve your IT operations. The second post is on about best practices to make your metrics meaningful in PagerDuty.

Screen Shot 2014-08-28 at 4.44.31 PMAlert fatigue is a problem that’s not easy to solve, but there are things you can start doing today to make it better. Using data about your alerts, you can seriously invest in cleaning up your monitoring systems and preventing non-actionable alerts.

To help, we’ve compiled a 7-step process for combatting alert fatigue.

Reducing Alert Fatigue in 7 Steps

1. Commit to action

Cleaning up your monitoring systems is hard, and It’s easy to become desensitized to high alert levels. But the first step is to decide to do something about it. Take a quick look at your data. How many alerts are you getting off hours, and what’s the impact of those for the team?

Screen Shot 2014-08-28 at 4.49.01 PMThen, as a team, commit time to cleaning up your alerting workflows. Etsy designated a “hack week” to tackle their big monitoring hygiene problem, but setting aside a few hours a week or one day each month could also work.

 

2. Cut alerts that aren’t actionable & adjust thresholds

Start by reviewing your most common alerts (Hint: you can drill into incidents in PagerDuty’s new Advanced Reports). Gather the people who were on call recently, and for each alert, determine whether it was actionable.

Once you find non-actionable alerts, cut them.

It’s common to monitor and alert on CPU and memory usage because these are indicators that something is wrong. However, the metrics by themselves are NOT actionable because they don’t give specific information about what’s wrong. Etsy stopped monitoring these metrics, and focused instead on checks that gave more specific, actionable information.

You may also need to adjust the thresholds on your checks. Dan Slimmon from Exosite shared a great talk “Smoke Alarms and Car Alarms”, which details how two concepts from medical testing can help you alert only when there is a problem. The concepts are sensitivity and specificity, and together they give you a positive predictive value (PPV) – the likelihood that something is actually wrong when an alert goes off. The talk also shares strategies for improving your PPV using hysteresis (looking at historical values in addition to current values), as well as other techniques.

3. Save non-severe incidents for the morning

While all alerts are important, some may not be urgent. These non-urgent issues shouldn’t be waking you or your team up in the middle of the night. Consider creating separate workflows for non-severe incidents so these don’t interrupt your sleep or your workday. In PagerDuty, don’t forget to disable “Incident Ack Timeout” and “Incident Auto-Resolution” on low severity services.

4. Consolidate related alerts

When something goes wrong, you may get several alerts related to the same problem. Take advantage of monitoring dependencies if you can set them, and leverage our best practices for alert consolidation in PagerDuty:

  • Use an incident key to tell PagerDuty that certain events are related. For example, if you have multiple servers that go down, each individual one may generate a notification to PagerDuty. However, if those notifications all have the same incident key, we’ll consolidate the notifications to one alert that tells you 30 servers are down.
  • During an alert storm, PagerDuty will also bundle alerts that are triggered after the first incident. For example, if 10 incidents are triggered within the space of 1 minute, after your first alert, you’ll receive a single, aggregated alert.

5. Give alerts relevant names & descriptions

Nothing sucks more than getting an alert saying that something is broken without information to help you gauge the severity of the issue and what to do next.

  • Give your alerts descriptive names. If you’re giving a metric (i.e. disk space used), make sure there’s enough context around the number to let someone put it in perspective. Is disk space 80% full, or 99%?
  • Include relevant troubleshooting information in the alert description, like a link to existing documentation or runbooks that will help the team dig deeper. In PagerDuty, you can add a client_url to the incident, or include a runbook link in the service description.

6. Make sure the right people are getting alerts

When teams first start monitoring, we commonly see them sending all of their alerts to everyone. No one wants to receive alerts that aren’t meaningful, so if you have different teams responsible for certain parts of your infrastructure, use Escalation Policies in PagerDuty to direct alerts appropriately.

7. Keep it up to date with regular reviews

Don’t let your clean-up effort go to waste. Create a weekly process to review alerts. Etsy created a cool weekly review process they call “Opsweekly” (Github repo here), but we’ve heard of other companies that use a spreadsheet during weekly reviews.

To prevent alert fatigue from becoming the new norm, set quantifiable metrics for the on-call experience. If you hit these ceilings, it’s time to take action – whether that be monitoring clean-up or a little time off. At PagerDuty, we look at the number of alerts we get on a weekly basis, and if that number is more than 15 for an on-call team, we’ll do a de-brief to review the alerts.

Most importantly, take ownership of monitoring hygiene as a team – If you get an alert that isn’t actionable, even once, make it your responsibility to make sure no one ever gets woken up for that alert again.

Additional Resources:

Share on FacebookTweet about this on TwitterGoogle+
Posted in Operations Performance | Comments Off

BabyDuty: Being On-call is Similar to Being a Parent

What did the parent say to the on-call engineer at 3am? “At least yours doesn’t smell.” Believe it or not, lack of sleep isn’t the only similarity between managing critical infrastructure and the wonderful adventure of new parenthood. One of our own, Dave Cliffe, recently delivered a talk at Velocity Santa Clara on this very topic, after experiencing the many joys and stresses of both on-call duty and parenting. Here are a few expert tips that he wanted to pass on.

Redundancy Matters

No, having twins or triplets doesn’t mean “redundancy” (they’re not replaceable)! Instead, think in terms of resources. You and your partner will serve as crutches for each other – which means that maybe, when you’re lucky, one of you will be able to snatch a few hours of precious sleep. (And if you’re a single parent, just know that we are in absolute AWE of you!)

“Share the load!”

Redundancy is important in infrastructure design, too. At PagerDuty, not only do we have an active-active architecture for our application and infrastructure, but we also use multiple contact providers for redundancy to ensure you always receive the alert you need in the method of your choice. We’re relentless about reliability. We’re rigorous about redundancy. We’re cuckoo for Cocoa Puffs!  *Ahem* … sorry, too many kid’s commercials.

Which of yoursystems can you not afford to lose? Think in terms of highest priority and build redundancy that way.

Seek Out a Mentor

Dave named a couple of baby books that were recommended to him by his mentor that helped him navigate the first few months of parenthood (they’re Brain Rules for Baby and Raising Resilient Children in case you wanted to buy them). Mentors are also helpful in the engineering world.

Most developers who joined Sumo Logic were never on-call before. To help these new developers experience on-call life without feeling like they are thrown into the deep end, these junior engineers shared on-call responsibility with a veteran to help them get up to speed. This not only helps them to understand their systems more holistically but also acclimate to their on-call culture.

Practice shadowing to learn from senior colleagues how the critical systems in your organization work, and establish crisis-mode roles before things go belly-up. But don’t shadow other parents: that’s creepy.

Be Smart About Alerts

A screaming baby or an outage notification demands immediate attention; you know not to ignore those types of alerts.

But do you need to take action on each and every alert? If you receive an alert that your network is lagging, but you’re in charge of datastores, your services may not be required. Similarly, paying attention to how your baby cries will help you determine what’s going on – is s/he hungry, in need of a diaper change, teething, or do they just want attention?

We haven’t yet developed a BabyDuty platform for deciphering cries, but PagerDuty does let you customize the kind of alerts you receive, as well as determine how you like to be alerted. As children get older, they develop the ability to communicate what they want or need – yet with PagerDuty, this functionality is available right now. You can skip right over the Terrible Twos!

Dave concluded his talk with two suggestions: apply your on-call learning to parenting, and be sure to step back and take moments to admire what you’ve created.

“Enjoy your kids. They’re wonderful, and so is on-call duty.”

Share on FacebookTweet about this on TwitterGoogle+
Posted in Events | Tagged , , , , | Comments Off

DIY Arduino YUN Integration for PagerDuty Alerts

Using a little code inspiration from the Gmail Lamp, Daniel Gentleman of Thoughtfix (@Thoughtfix) built an awesome PagerDuty Arduino integration by combining an Arduino YUN, a 3D printed sphere and Adafruit’s Neopixel Ring to create visual status check of his PagerDuty dashboard. Check it out in the video of his device (and his cat) in action below:

Using PagerDuty’s API, the device checks in with PagerDuty every 15 seconds to update it’s status. The device is configured to see both both acknowledged and triggered alerts. When it discovers acknowledged or triggered alerts, the globe turns orange then to red, giving it a brief pulse of orange. Daniel notes that this is intentional as a way to see that there are alerts in both states, but the lamp will stay red for the duration of a 15 second cycle to show the more critical status. When all alerts are resolved the lamp glows green.

Keeping the Integration Reliable

Like many DIY projects the PagerDuty Status Lamp started out as a for-fun project. But Daniel saw this lamp filling a real need to have a quick visual cue for his open incidents, which in the future may possibly be used as a visual queue for others to leave him alone when his lamp is red. To make the integration reliable, Daniel added a simple function to the Arduino to check for WiFi to ensure that a connection is always established.

By creating a ping -c 1 shell script in the home directory and wrapping the PagerDuty API calls in a test for a ping. Daniel told us that in his final version, he has the device ping Bing.com every second (much more frequently than the calls to PagerDuty, which occur every 15 seconds). If the connection is lost the lamp will fade to a blue light.

WP_20140814_13_13_41_Pro

Future Possibilities

Because the device uses very simple shell scripts and one I/O pin on the Arduino Yun, Daniel explained that it’s possible to expand this to show who is on-call, having the lights change depending on which team has alerts and even physically move a robot based on PagerDuty API calls.

Ready to Build Your Own?

WP_20140814_13_14_07_Pro

Here’s what you will need:

Source code:

// This code is for a PagerDuty status lamp using an Arduino Yun
// with an Adafruit NeoPixel ring (12 LEDs) on pin 6.
// It requires three files in root’s home directory:
// ack.sh (curl request to PagerDuty’s API to check for acknowledged alerts)
// trig.sh (curl request to PagerDuty’s API to check for triggered alerts)
// cacert.pem (from http://curl.haxx.se/ca/cacert.pem for SSL curl requests)
//
// Example ack.sh (change status=acknowledged to status=triggered for trig.sh
// curl -H “Content-type: application/json” -H “Authorization: Token
token=[your token]” -X GET -G –data-urlencode “status=acknowledged”
https://[YourPagerDutyURL].pagerduty.com/api/v1/incidents/count
–cacert /root/cacert.pem
//

// Obtain a PagerDuty API token from
https://[YourPagerDutyURL].pagerduty.com/api_keys
#include <Adafruit_NeoPixel.h>
#define PIN 6
#include <Process.h>
Adafruit_NeoPixel strip = Adafruit_NeoPixel(12, PIN, NEO_GRB + NEO_KHZ800);
void setup() {
// Initialize Bridge
Bridge.begin();
// Initialize Serial
Serial.begin(9600);

// run various example processes
strip.begin();
strip.show(); // Initialize all pixels to ‘off’
colorWipe(strip.Color(0, 0, 255), 50); // Blue
delay(100);
}
void loop() {
int ackd = runAckd();
int trig = runTrigger();
if ((ackd == 0) && (trig == 0)){
colorWipe(strip.Color(0, 255, 0), 50); // Green
}
if (ackd >= 1){
colorWipe(strip.Color(255, 153, 0), 50); // Orange
}
if (trig >= 1){
colorWipe(strip.Color(255, 0, 0), 50); // Red
}
delay(15000);
}
int runTrigger() {
Process p;// Create a process and call it “p”
p.runShellCommand(“/root/trig.sh”);

// Check for Triggered events
// A process output can be read with the stream methods
while (p.running()); // do nothing until the process finishes, so
you get the whole output
int result = p.parseInt(); /* look for an integer */
Serial.print(“Triggered:”); /* Some serial debugging */
Serial.println(result);

// Ensure the last bit of data is sent.
Serial.flush();
return result;
}
int runAckd() {
Process p;// Create a process and call it “p”
p.runShellCommand(“/root/ack.sh”);

// Check for Triggered events
// A process output can be read with the stream methods
while (p.running()); // do nothing until the process finishes, so
you get the whole output
int result = p.parseInt(); /* look for an integer */
Serial.print(“Ackd:”); /* Some serial debugging */
Serial.println(result);

// Ensure the last bit of data is sent.
Serial.flush();
return result;
}
void colorWipe(uint32_t c, uint8_t wait) {
for(uint16_t i=0; i<strip.numPixels(); i++) {
strip.setPixelColor(i, c);
strip.show();
delay(wait);
}
}

Have a DIY PagerDuty project of your own? Let us know by emailing support@pagerduty.com.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Features, Partnerships | Tagged , , , , , , , | Comments Off

A Disunity of Data: The Case For Alerting on What You See

Guest blog post by Dave Josephsen, developer evangelist at Librato. Librato provides a complete solution for monitoring and understanding the metrics that impact your business at all levels of the stack.

The assumption underlying all monitoring systems is the existence of an entity that we cannot fully control. A thing we have created, like an airplane, or even a thing that simply exists by means of miraculous biology, like the human body. A thing that would be perfect, but for its interaction with the dirty, analog reality of meatspace; the playground of entropy, and chaos where the best engineering we can manage eventually goes sideways through age, human-error, and random happenstance.

Monitor that which you cannot control

Airplanes and bodies are systems. Not only do we expect them to operate in a particular way, but we have well-defined mental models that describe their proper operational characteristics — simple assumptions about how they should work, which map to metrics we can measure that enable us to describe these systems as ‘good’ or ‘bad’. Airplane tires should maintain 60 PSI of pressure, human hearts should beat between 40 and 100 times per minute. The JDBC client shouldn’t ever need more than a pool of 150 DB connections.

This is why we monitor: to obtain closed-loop feedback on the operational characteristics of systems we cannot fully control, to make sure they’re operating within bounds we expect. Fundamentally, monitoring is a signal processing problem, and therefore, all monitoring systems are signal processing systems.  Some monitoring systems sample and generate signals based on real-world measurements, and others collect and process signal to do things like visualization, aberrant-behavior detection, and alerting and notification.

Below is an admittedly oversimplified diagram that describes how monitoring systems for things like human hearts and aircraft oil-pressure work (allowing, of course, for the omission of the more esoteric inner-workings of the human heart, which cannot be monitored). In it, we see sensors generating a signal that feeds the various components that provide operational feedback about the system to human operators.

signal_monitoring_notification

All too often however, what we see in IT monitoring systems design looks like the figure below, where multiple sensors are employed to generate duplicate signals for each component that generates a different kind of feedback.

visualization_notification_monitoring

There are many reasons this anti-pattern can emerge, but most of these reduce to a cognitive dissonance between different IT groups, where each group believes they are monitoring for a different reason, and therefore require different analysis tools. The operations and development teams, for example might believe that monitoring OS metrics is fundamentally different from monitoring application metrics, and therefore each implements their own suite of monitoring tools to meet what they believe are exclusive requirements. For OS metrics, operations may think it requires minute-resolution state data on which they can alert, while development focuses on second-resolution performance metrics to visualize their application performance.

In reality, both teams share the same requirement: a telemetry signal that will provide closed-loop feedback from the systems they’re interested in, but because each implements a different toolchain to achieve this requirement, they wind up creating redundant signals, to feed different tools.

Disparate signals make for unpredictable results

When different data sources are used for alerting and graphing, one source or the other might generate false positives or negatives. Each might monitor subtly different metrics under the guise of the same name, or the same metric in subtly different ways. When an engineer is awakened in the middle of the night by an alert from such a system, and the visualization feedback doesn’t agree with the event-notification feedback, an already precarious, stressful and confusing situation is made worse, and precious time is wasted vetting one monitoring system or the other.

Ultimately, the truth of which source was correct is irrelevant. Even if someone successfully undertakes the substantial forensic effort necessary to puzzle it out, there’s unlikely to be a meaningful corrective action that can be taken to synchronize the behavior of the sources in the future.  The inevitable result is that engineers will begin ignoring both monitoring systems because neither can be trusted.

Fixing false-positives caused by disparate data sources isn’t a question of improving an unreliable monitoring system, but rather one of making two unreliable monitoring systems agree with each other in every case. If we were using two different EKG’s on the same patient — one for visualization and another for notification — and the result was unreliable, we would most likely redesign the system to operate on a common signal, and focus on making that signal as accurate as possible.  That is to say, the easiest way to solve this problem is to simply alert on what you see.

Synchronize your alerting and visualization on a common signal       

Alerting on what you see doesn’t require that everyone in the organization use the same monitoring tool to collect the metric data that’s interesting to them, it only requires that the processing and notification systems use a common input signal.
The specific means by which you achieve a commonality of input signal depends on the tools currently in use at your organization. If you’re a Nagios/Ganglia shop for example, you could modify Nagios to alert on data collected by Ganglia, instead of collecting one stream of metrics data from Ganglia for visualization, and a different signal from Nagios for alerting.

Librato and PagerDuty are an excellent choice for centrally processing the telemetry signals from all of the data collectors currently in use at your organization. With turn-key integration into AWS and Heroku, and support for nearly 100 open-source monitoring tools, log sinks, instrumentation libraries, and daemons, it’s a cinch to configure your current tools to emit metrics to Librato.

librato_pagerduty

By combining Librato and PagerDuty, all engineers from any team can easily process, visualize, and correlate events in your telemetry signals as well as send notifications and escalations that are guaranteed to reflect the data in those visualizations. Your engineers can use the tools they want, while ensuring that the signals emitted by those tools can be employed to provide effective, timely feedback to everyone in the organization. Sign up for a Librato free trial today and learn how to integrate PagerDuty with Librato .

Share on FacebookTweet about this on TwitterGoogle+
Posted in Partnerships, Reliability | Tagged , , , | Comments Off