100 and Counting: Aruba Networks Now a PagerDuty Platform Partner

aruba_logo We’re excited to announce that Aruba Networks has joined PagerDuty’s partner ecosystem, officially marking our 100th platform integration. Big welcome to Aruba, and big thanks to PagerDuty’s community of builders and customers for helping us reach this milestone.

Identify security attacks and whales

Aruba Network’s ClearPass access management system gives companies visibility into activity in their network. By integrating with PagerDuty, ClearPass customers can know about network security issues immediately to reduce customer impact.

“Integrating our solution with PagerDuty’s operations performance platform allows Aruba Networks to provide end-to-end incident management on top of our ClearPass system to give our customers peace of mind knowing incidents will be escalated until someone responds.” – Cameron Esdaile, senior director of emerging technologies at Aruba Networks.

Companies can also find identify VIPs who enter their premises and to triage any log-in issues to deliver great customer experience. Casinos can make sure that their high rollers are happily taken care of from the get-go.

Build on PagerDuty’s Platform

Now with 100 ready-to-use integrations available in our partner ecosystem, the PagerDuty community has a easy, quick way to connect their accounts to other infrastructure, application, and business tools for more seamless operations management. And we’re not stopping here! PagerDuty will continue to actively forge additional integration partnerships. Interesting in building an integration? Let’s talk!

Share on FacebookTweet about this on TwitterGoogle+
Posted in Partnerships | Tagged , | Leave a comment

Blameless post mortems – strategies for success

When something goes wrong, getting to the ‘what’ without worrying about the ‘who’ is critical for understanding failures. Two engineering managers share their strategies for running blameless post mortems.

Failure is inevitable in complex systems. While it’s tempting to find a single person to blame, according to Sidney Dekker, these failures are usually the results of broader design issues in our systems. The good news is that we can design systems to reduce the risk of human errors, but in order to do that, we need to look at the many factors that contribute to failure – both systemic and human. Blameless post mortems, where the goal isn’t to figure out who made a mistake but how the mistake was made, are a tool that can help. While running one is not an easy task, the effort is well worth it. Here, two engineering managers describe some of the challenges and share how they make blameless postmortems successful.

Start with the right mindset

The attitude you take to the discussion is critical and sets the tone for the entire conversation. “You ignore the ‘this person did that’ part,” explains PagerDuty Engineering Manager Arup Chakrabarti. “What matters most is the customer impact, and that’s what you focus on.”

Mike Panchenko, CTO at Opsmatic, says that the approach is based on the assumption that no one wants to make a mistake. “Everyone has to assume that everyone else comes to work to do a good job,” he says. “If someone’s done something bad, it’s not about their character or commitment, it’s just that computers are hard and often you just break stuff.”

Don’t fear failure

Because it’s going to happen. “One thing I always tell my team is that if they’re not screwing up now and then, they’re probably not moving fast enough,” says Chakrabarti. “What’s important is, you learn from your mistakes as quickly as possible, fix it quickly, and keep moving forward.”

Nip blaming in the bud 

There are no shortcuts here. “You have to be very open about saying, ‘Hey, I will not tolerate person A blaming person B,” says Chakrabarti. “You have to call it out immediately, which is uncomfortable. But you have to do it, or else it gives whoever’s doing it a free pass.”

Panchenko agrees: “I’m a pretty direct guy, so when I see that going on, I immediately say ‘stop doing that.'”

That goes for inviting blame, too

“There’s a natural tendency of people to take blame,” says Panchenko. “But a lot of times, there’s the ‘last straw’ that breaks the system.” He describes a recent outage where a bunch of nodes were restarted due to a bug in an automation library. That bug was triggered by the re-appearance of a long-deprecated Chef recipe in the run list. The recipe, in turn, was added back to the runlist due to a misunderstanding about the purpose of a role file left around after a different migration/deprecation. The whole thing took over a month to develop. “Whoever was the next person to run that command was going to land on that mine,” he says, “and usually the person who makes the fatal keystroke expects to be blamed. Getting people to relax and accept the fact that the purpose of the post mortem isn’t to figure out who’s going to get fired for the outage is the biggest challenge for me.”

Handle ongoing performance issues later

It’s natural to be apprehensive about sharing things that didn’t go well when your job performance or credibility may be on the line. The trick is separating ongoing performance issues from “failures” that happen because of shortcomings in your processes or designs.

Panchenko pays attention to the kind of mistake that was made. “Once you see a failure of a certain kind, you should be adding monitoring or safeguards,” he says. “If you’re doing that, the main way someone’s going to be a bad apple is if they’re not following the process. So that’s what I look for: do we have a process in place to avoid the errors, and are the errors happening because the process is being circumvented, or does the process need to be improved?”

And sometimes, yes, you do need to fire people. “I have had scenarios where a single individual keeps making the same mistake, and you have to coach them and give them the opportunity to fix it,” says Chakrabarti. “But after enough time, you have to take that level of action.”

Get executive buy-in 

Both Arup and Mike agree that successful blameless postmortems won’t work without backing from upper-level management. “You have to get top-down support for it,” says Chakrabarti, “and the reason I say that is that blameless postmortems require more work. It’s very easy to walk into a room and say ‘Dave did it, let’s just fire him and we’ve fixed the problem.'” Instead, though, you’re telling the executives that not only did someone on your team cause an expensive outage, but they’re going to be involved in fixing it too. “Almost any executive is going to be very concerned about that,” he says.

“The one thing that’s definitely true is that the tone has to be set at the top,” says Panchenko. “And the tone has to be larger than just postmortems.”

Have you led or participated in blameless post mortems? We’d love to hear more about your experiences – leave us comments below!

Share on FacebookTweet about this on TwitterGoogle+
Posted in Operations Performance, Reliability | 2 Comments

rm –rf “breast cancer”

breast_cancer_awarenessAt PagerDuty, we pride ourselves in supporting the everyday hero, so naturally, we take it upon ourselves to give back to the community. Each year, we’ve actively participated in Movember so this year we decided to unite together to support other causes that we are equally passionate about. One of our beloved employees is a breast cancer survivor so we wanted to rally around this cause by creating more awareness and help support those who are fighting back.

Last Wednesday we celebrated Breast Cancer Awareness Month by putting a spicy spin on our weekly Whiskey Wednesday tradition with pink bubbly and delicious cupcakes. To raise money for the cause, PagerDutonians purchased raffle things for awesome t-shirts and a custom bag. Some even played poker with all proceeds being donated to Breast Cancer charities.

We’re starting to ramp-up our social responsibility and looking for new ways to lend a helping hand. In addition to raising money for breast cancer charities, last Friday a group of us volunteered at a Habitat for Humanity site. Next month, we will celebrate Movember with a mustache-growing competition, and all money raised will go to the Movember Foundation. For the holidays, we plan on hosting another food drive for the SF-Marin County Food Bank. If you have any suggestions for more ways to help us give-back, please feel free to reach out!

Share on FacebookTweet about this on TwitterGoogle+
Posted in Events | Tagged , , | Leave a comment

PagerDuty @ #FS14

Last week, we attended New Relic’s FutureStack14 conference and it was a great opportunity for us to connect with our friends over at New Relic and PagerDuty customers.

PagerDuty_FutureStack

PagerDuty’s FS14 on-call survival guide

Unlike most conferences held in hotels or conference centers, FutureStack14 was held at Fort Mason, a former army post, and offered attendees great views of the Golden Gate Bridge and Alcatraz. Lew Cirne delivered a keynote around their vision of making software more delightful for their customers. On average, we spend around 6 hours in front of software each day but we’ve all been frustrated with software one time or another. The culprits? Lag. Errors. Downtime. If 10% of those 6 hours are painful, that means we’re having 13,140 minutes bad experiences each year and life’s too short for bad software. New Relic is trying to eliminate the painful software moments and making sure it’s always available and high performing is part of PagerDuty’s calling as well.

Using Data to Take Action

New Relic announced new Insights capabilities which gives their business users an easy way to slice and dice their data to find actionable information around questions such as “Which marketing campaign is converting?” and “What customers are using our newest feature?” Data has the power to help inform business decisions and we have seen our customers use PagerDuty data to improve their operations. By tracking data about problems in your systems and how your team responds to them, PagerDuty customers have been able to is resolve issues faster to software impact on their customers.

Speaking of data, we’re excited to partner with New Relic on tomorrow’s Decreasing MTTR and Alerting Fatigue webinar. David Shackleford, product manager at PagerDuty will be speaking about what operational metrics to track and how to track them. Sign-up for the webinar today!

We’re looking forward to attending new year’s FutureStack. Confetti FTW!

OK_GO_futurestack

OK GO rockin’ out

Share on FacebookTweet about this on TwitterGoogle+
Posted in Events | Tagged , | 1 Comment

A Duty To Alert

Guest blog by Tim Yocum, Ops Director at Compose. Compose is a database service providing production-ready, auto-scaling MongoDB and Elasticsearch databases.

Compose users trust our team to take care of their data and their databases with our intelligent infrastructure and smart people ready to respond to alerts 24/7 to take care of any problems. That’s part of the Compose promise. But we think there’s a lesson to be learned by our users from how we handle those alerts.

compose-alerts

Growth creates complexity

A typical Compose customer has two major parts to their stack, their data storage and their application. While most data storage incidents and alerts are handled by us, we find there’s often a good chance that there’s no mechanism for generating and handling alerts in the application. It’s not by design; early on, unfortunately their own customers are the ones letting them know that their system isn’t working and they’ll have to organically scale  to handle those complaints, translate them into issues with your application, triage the problems and resolve the issue.

But as you scale up your systems and company, that organic scaling puts a load on your people and will affect response times and the quality of triage. Your application’s architecture will become more complex with more moving parts and more subsystems. It is at this point we often find we get calls about database problems which turn out to be further up the customer’s stack, up in the application.

Smoke alarms

There may already be instrumentation or monitoring in some of your system components so that you get a number of alerts from your systems. By adding more instrumentation you can cover the blind spots. Other systems can also provide performance metrics and regular checks. By adding all these to the equation you can provide as much alert visibility and sensitivity as possible.

But then you realize that what you’ve created is more alerts for your people handle. There’ll be a lot of noise because you’ll typically find that any single failure has a ripple effect in creating alerts in different systems. Some of the failures will be like smoke alarms, ringing loudly with no obvious cause, while others will only generate one alert yet have a massive impact. On top of all of that these alerts you may be receiving these alerts from multiple monitoring systems which are tied to different people. You may be tempted to build your own alert management system… but that’s another component to monitor and you’ll be spending more time engineering system monitoring than growing your business.

Answering alerts at Compose

At Compose, our business is delivering databases issues to our customers which is why we use PagerDuty. Our older Nagios and newer Sensu server monitoring systems both integrate with PagerDuty to report on the overall state of the servers. We then use our own Compose plugin to monitor our production systems for high lock and stepdown events turning them into alerts too. We have a premium support offering and we use PagerDuty to ensure rapid responses to it’s 911 contact points. Our 911 support email are picked up by PagerDuty’s email hooks while the 911 calls pass through Twilio into PagerDuty turning those calls into alerts too.

With the alerts collected, collated and deduplicated at PagerDuty, we then use its rotation management to handle two simultaneous overlapping rotations of support staff. The lead rotation is the primary contact and the second is a backup contact. The scheduling overlaps and where there’s jobs which are best done by two people, we can bring the primary and secondary in on the job.

We then add to that mix the ops team to act as an extra backup. Finally, ensure that scheduled maintenance doesn’t unnecessarily alert the people on call – no one likes being woken up or disturbed at dinner only to find out everything is fine. We have scripts we invoke through hubot that take hosts and services down and ensure that alerts from those systems are picked up in Nagios and Sensu and not forwarded on to PagerDuty.

PagerDuty has become an indispensible part of Compose operations. We used to rely on manually checking multiple systems and using a calendar to work out the on-call rotation. Now, we are more effective, not just at getting alerts in a timely fashion, but also internally it helps up spread the load of delivering high quality support. We wouldn’t be without it now. – Ben Wyrosdick, co-founder of Compose
Whatever your business is, you want to focus on that. If you use Compose then you’re already using us to take the task of database management off your to-do list. Using PagerDuty to monitor your applications will let you focus on your business even more. Learn how to integrate PagerDuty with Compose here to get alerts from your Compose hosted databases as part of your complete alert management system.
Share on FacebookTweet about this on TwitterGoogle+
Posted in Partnerships | Tagged , , | Leave a comment

Incident Status Change Notifications for Peace of Mind

sms-alertGetting paged for an incident and rushing to your computer only to find that the incident was auto-resolved or acknowledged by another team member is incredibly frustrating.

We’ve heard the feedback loud and clear, and we’ve recently released incident status update notifications – now you can customize how you’d like to be alerted when the status changes on one of your incidents.

Incident status change notifications can tell you when:

  • An incident is escalated (and therefore no longer assigned to you)
  • An incident is resolved (either automatically or by someone else)
  • An incident is acknowledged by someone else (this may be handy if you’re using a multi-user escalation policy).

You can customize when and how you’d like to be notified about incident status changes from your PagerDuty account. Simply log into PagerDuty and navigate to your profile. On the page, you’ll see the “Status Updates” section, and from there you can customize your status update notification preferences.

incident-status-updatesQuestions? Contact Support.

 

Share on FacebookTweet about this on TwitterGoogle+
Posted in Features | Leave a comment

How to Create a Data-driven Culture

This is the third post in our series on using data to improve your IT operations. The second post on making your metrics meaningful is posted here.

In tech, there’s no shortage of data. It can help you manage your systems and teams better, but getting the most out of the data available to you is about more than just getting the numbers. You need to have a culture that pushes to make and measure the success of decisions with data. At least in theory, relying on data lets managers not only make good decisions with lower risk but also have the confidence to make them quickly. It also provides a way to know whether a particular decision paid off.

Actually creating such a culture though, is more complicated than simply declaring that from now on your operations will be data driven. What data do you measure? How do you respond to it? And what steps do you take to get your team to buy into the whole idea in the first place?

Following are suggestions for how to implement a sustainable data-driven company culture–one that the team will buy into and is self-reinforcing–along with some pitfalls to watch out for:

1. Determine what to measure. The purpose here is to use data to make your business more nimble and confident. Managers need to understand the priorities of company executives and select the metrics that support those objectives. If you measure everything and treat it all as equally important, you’ll become bogged down in irrelevant minutiae. To get you started, here are four key operational metrics you should be tracking.

2. Relate the metrics to both the specific business goals and the team’s role in achieving them. Mean Time to Repair (MTTR) is a great high-level performance indicator, but it’s not always easily actionable by the team. Mean Time to Acknowledge an incident (MTTA) is a component of MTTR and is usually more actionable. Track both key performance indicators, as well as the metrics that lead or contribute so that you understand how the team’s work contributes to the overall goal.

3. Democratize the information. In a data-driven culture, everyone’s a data analyst. But to do that, the data has to be made more transparent than most companies are in the habit of, and the team needs the tools to access it. Make sure everyone has some kind of dashboard or other window into the data and that they understand (through training, if necessary) how to extract insights from it. Our newest feature, Advanced Analytics, is a great way to share data with everyone on your team.

4. Empower the team to speak up and take action. Everyone should feel free both to propose their own insights and suggest actions and to question the proposals of others, including upper management. “Do you have data to back that up?” should be a question that no one is afraid to ask (and everyone is prepared to answer).

5. Never stop testing. Before you start measuring, you don’t know everything you don’t know. New questions will arise that you may not already have a means to answer. Be prepared to test to get new data—which means being prepared to be surprised.

6. Act on the data. Nothing is more discouraging than to announce with great fanfare that “we are a data-driven organization,” only to have the team watch data-supported ideas languish or, worse, decisions still be made on the basis of what the boss thinks. A data-driven culture actually has to be driven by the data.

There are also several obstacles to becoming a true data-driven culture. Don’t make these mistakes:

1. Don’t get stuck in the past. Metrics, by their nature, reflect things that have already happened, and it’s easy to get sucked into spending a lot of time dissecting reports, discussing reasons, or assigning blame. What happened, happened—now what does it tell you about what to do next?

2. Don’t just focus on the numbers. It’s easier to manage to numbers than goals, and metrics will incentivize people to try to “work to the test.” Remember, the metrics are a means to an end—keep their relation to your business goals front and center.

3. Don’t get paralyzed. What you want out of metrics are insights, and simply measuring more things won’t necessarily help. Avoid the “analysis paralysis” that can come with having more information than you need.

Finally, a data-driven culture is a feedback loop. Top-performing Operations teams typically discuss what their data shows on a weekly basis. You have transparent access to data and a crew trained to interpret it. Close the loop by reporting the results of the actions you took, empowering everyone to start the process again.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Operations Performance | 1 Comment

Monitoring Best Practices Learned from IT Outages

Guest post by Alexis Lê-Quôc, co-founder and CTO of Datadog. Datadog is a monitoring service for IT, Operations and Development teams who want to turn the massive amounts of data produced by their apps, tools and services into actionable insight.

Datadog_LogoAt Datadog we eat our own… dogfood. We track hundreds of thousands of metrics internally. Learning what to alert on and what to monitor has taken us some time. Not all metrics are made equal, and we have come up with a simple way to manage them, which anyone can master. Here’s how we do it.

Monitoring goals

Why would you spend time getting better monitoring?

  1. To know about an issue before your customers or your boss
  2. To know how your systems & applications are performing
  3. To minimize your stress level

Classifying metrics

What kind of metrics does your monitoring tool track? Examples are: CPU utilization, memory utilization, database or web requests. That’s a lot of different types of metrics and they can be divided into two fundamental classifications of metrics – work and resource.

Work metrics
A work metric measures how much useful stuff your system or application is producing. For instance, we could look at the number of queries that a database is responding to or the number of pages that a web server is serving per second. The purpose of a database is to answer queries. The purpose of a web server is to serve pages. So these are appropriate work metrics.

Another work metric would be things like how much money is your application producing? That’s a very useful work metric to track availability and understand the effectiveness of your application and infrastructure.

Resource metrics
The other class is resource metrics. A resource is something that is used to produce something useful. You use a resource to produce some work. So a resource metric measures how much of something is consumed to produce work. When you ask the question, “how much CPU am I consuming in the database?” it doesn’t really say much about whether that’s useful or not. It just says, “Well, I have more CPU available” or “I’m maxed out and my CPU is completely maxed out.” Same for memory, disk, network and so on. In general, I’ve used resource metrics for capacity planning rather than for availability management.

Optimizing your monitoring

Now that we’ve defined work and resource metrics, we can move to best practices.Classify key metrics as work or resource

1. Classify key metrics as work or resource

Look at your key metrics, specifically the ones you really care about, and figure out whether they’re work metrics or resource metrics.

2. Only alert on work metrics

Once you’ve done this classification – and it’s really important to spend time doing this – you need to identify what you want to get alerted on. You only want to get alerted on work metrics.

In other words, you want to get alerted on things that measure how useful your system is.

I should mention that it’s useful to alert on some resource metrics if they’re a leading indicator of a failure. For instance, disk space is a resource metric. However, when you run out of disk space, the whole show stops so it’s also important to alert on these metrics. But in general, alerting on resource metrics should be rare.

3. Only alert on actionable work metrics

The tweak to the previous best practice is that you really only want to alert on actionable work metrics. In other words, you want to alert on work metrics that you can do something about.

For instance, an actionable work metric for a web server is how many webpages you serve without errors per second. That’s a work metrics because if you’re serving zero pages, your website is not running at all – it’s down.

A non-actionable work metric could be how many 404s I’m serving per second. This isn’t an actionable work metric because this will entirely depend on what people are doing on your site. If they are browsing to URLs that don’t exist, then you’re going to get a lot of 404s. This doesn’t mean it’s bad, but rather that they’re doing something that’s not expected. So you should not alert on non-actionable work metrics.

4. Review metrics and alerts periodically

The fourth, and maybe one of the hardest best practices, is to actually do a review and iterate on this process on a regular basis. Maybe it’s a weekly, bi-weekly or monthly thing, but you really want to carve out some time in your busy schedule and do a review with your team.

Back to goals

Now, let’s tie back back these best practices to the initial goals of monitoring that I mentioned. Classifying key metrics as work or resource is a prerequisite for everything.

a. To know about an issue before your customers or your boss

Only alert on work metrics so you know that you won’t be alerting on stuff that’s not useful and therefore have a much better result

b. To minimize your stress level

Only alert on actionable work metrics because you’re not going to get alerted on things over which you have no control

c. To know how your systems & applications are performing

Review metrics and alerts periodically so you have a good sense of how your systems are performing, trending and how you can change things.

Use these best practices to improve your monitoring strategy and when you’re ready to implement, try a 14-day free trial of Datadog to graph and alert on your actionable work metrics and any other metrics and events from over 80 common infrastructure tools.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Partnerships | Tagged , | Leave a comment

The Importance of Severity Levels to Reduce MTTR

Guest blog post by Elle Sidell, Lukas Burkoň, and Jan Prachař Testomato. Testomato offers easy automated testing to check websites pages and forms for problems that could potentially damage a visitor’s experience.

We all know how important monitoring is for making sure our websites and applications stay error free, but that’s only one part of the equation. What do you do once an error has been found? How do you decide what steps to take next?

Rating the severity of an issue and making sure the right person is notified plays a big role in how quickly and efficiently problems get resolved. We’ve pulled together a quick guide about the importance of error severity and how to set severity levels that fit your team’s escalation policy.

What Are Severities and Why Are They Important

In simple terms, the severity of an error indicates how serious an issue is depending on the negative impact it will have on users and the overall quality of an application or website.

For example, an error that causes a website to crash would be considered high severity, while a spelling error might (in some cases) be considered low.

Severity is important because it:

  • Helps you reduce and control the amount alerting noise.
  • Makes the process of handling errors smoother.
  • Improves how effectively and efficiently you resolve issues.

Having a severity alert process in place can help you prioritize the most crucial problems and avoid disturbing the wrong people with issues that are outside their normal area of responsibility.

On a larger scale, it makes decisions about what to fix easier for everyone.

How to Create Escalation Rules That Works for Your Team

Understanding the benefits of rating the severity of an incident is easy, but creating a severity process that works for your team can be tricky. There’s no silver bullet for this process. What works for you many not work for another team – even if it’s the same size or in the same industry.

How you choose to set up your severity levels can vary depending on your team, the project and its infrastructure, the organization of your team, and the tools you use. So where do you start?

In our experience, there are 3 main things you need to think about when creating an escalation process:

  1. Severity structure
  2. Team organization structure
  3. Thresholds and their corresponding notification channel

Errors with higher severity will naturally require a more reliable notification channel. For example, you might choose to send an SMS using PagerDuty for a high severity error, while one that is considered minor may not trigger an alert to help reduce noise. Instead, you could choose to leave it as a notification from Testomato, which can be viewed by someone at a later time.

1) Severity Structure

One of the easiest ways to set up a severity structure is to identify the most critical parts of your website or application based on their business value.

For example, the most critical parts of an e-shop would be its product catalogue and its checkout. These are the features that would cause an e-shop to severely affect business if they were to stop working. These issues need to be prioritized before all other issues.

Here’s one method we’ve found helpful for creating a severity structure:

  1. Create a list of the key features or content objects on your website or web application. (e.g. catalogue, checkout, homepage, signup, etc.). It’s a good idea to keep your list simple to make it easier to prioritize issues.
  2. Analyze your alert history and identify any common problems that may require a different severity level than you would normally assign (i.e. false timeouts may need to be marked as low severity, even though a timeout would be categorized somewhere higher on your scale).
  3. Decide on the levels you’d like to use for your scale (e.g. low, medium, high). You can add more levels depending on the size of your project and team.
  4. Once you have completed your list and analysis, estimate the severity level of each feature or content object, as well as any recurring errors that you found in your history.

There’s no right or wrong way to do this. The most important thing to know is how your team will classify specific incidents and make sure that everyone is on the same page.

2) Organization Structure

The next thing you’ll want to do is take a look at the structure of your team.

Having a clear understanding of how your team is structured and automating issue communication will help you define a more efficient flow of communication later on. For instance, team members responsible for your environment should be notified about issues immediately, while a project manager may only want to be kept in the loop for critical issues so they’re aware of possible problems.

Based on what we’ve seen with the project teams at Testomato, development teams are usually structured according to the following table:

Company/ Team Size Team Management Project Development Monitoring
freelancer client one person team none / manually
client
users
small team* CEO a few developers none
single person
developer / admin
users
large team CEO
CTO
VP Eng
Team Leads
a team of developers none
users
a team of testers
a team of admins

*A small team would generally be found in a web design agency or early stage startup.

For a more detailed structure, here’s a few more questions to keep in mind:

  • Who needs to be part of the alert process?
  • What are each person’s responsibilities when it comes to fixing an issue?
  • At what point does an alert require that this role be brought into the communication loop?

3) Communication Structure

One of the hardest parts of severities can be putting together a communication structure, especially if you don’t have a strong idea about how alerts should flow through your team structure.

Think of it this way:

  • Severity Structure: How serious is this problem?
  • Organization Structure: Whose responsibility is it?
  • Communication Structure: If X happens, how and when should team members be contacted?

The main goal of severity levels is to make sure the right people are aware of issues and help prioritize them. Setting a communication structure lets you connect different levels of your severity structure to roles from your organization and add more defined actions based on time sensitivity or error frequency. This way you can guarantee the right people are contacted using the proper channel that is required for the situation. If a responder is not available, there is an escalation path to ensure someone else on the team is notified.

Assigning notification channels and setting thresholds that correspond to your team organization means that problems are handled efficiently and only involve the people needed to solve them.

For example, if a critical incident occurs on your website, an admin receives a phone call immediately and an SMS is sent to the developer responsible for this feature at the same time. If the problem is not resolved after 10 minutes, the team manager will also receive a phone call.

On the other hand, a warning might only warrant an email for the team admin and any relevant developers.

Within PagerDuty, you can create 2 Testomato services – one general and another that is critical – and match these services to the escalation policy needed. If you have SLAs of 15 minute for critical incidents, that escalation path with be tighter than general incidents.

Here’s a basic overview of how we use severity levels at Testomato using both PagerDuty notifications and our own email alerts:

Team Members: manager, 2 admins (responsible for production), and 2-3 developers.

When errors occur on their project using the following process:

PagerDuty – SMS and Phone Call

  • All errors are sent to PagerDuty.
  • PagerDuty sent SMS immediately to both admins.
  • After 5 minutes, an admin is called according to the on-call schedule.
  • After 15 amount of time, a team manager is also called.
  • Developers are not contacted by PagerDuty.

Testomato – Email

  • Both errors and warnings are sent as Testomato email notifications to both admins and the developers.
  • Warnings are only sent as emails.
  • Developers are sent emails about both errors and warnings to stay informed about production problems.

We hope you’ve found this post helpful! What severity alert process works best for your team?

Share on FacebookTweet about this on TwitterGoogle+
Posted in Partnerships | Tagged , | Leave a comment

A deep dive into how we built Advanced Analytics

Advanced Analytics was a big project for us – not only was it a big step toward helping operations teams improve their performance with actionable data, but it also presented a complex design and engineering challenge.

Designing to Solve Problems

When we design new features, we always want to ensure that we solve real problems for our customers. As we dug into how our customers were using PagerDuty, and the goals they had for their Operations teams, one of the biggest pain points we found was a lack of visibility into what was happening with their operations. While this problem looks different at every company, we noticed many teams struggling with knowing what areas of their system are the most problematic and how their teams are performing.

Designing for Reliability and Scale

We process tens of millions of incidents a month. Since our customers count on us to improve their uptime, the reliability of our product is a core value here at PagerDuty. We needed to ensure that the infrastructure behind Advanced Analytics supported our reliability and performance needs now and in the future.

Reporting load looks different than load on a mobile app or dashboard; rather than needing a small set of incidents right now, you want larger numbers of incidents with calculations from a larger time range.

We needed to make sure reporting calls did not bog down our main incidents database and API, so we built a set of decoupled services that ensures we can serve data quickly to customers while avoiding any impact on our main alerting pipeline, web app, and mobile app.

A fleet of ETL workers takes data from a production slave database, iterating over incident log entries and flattening them into a raw facts table with a specific set of metrics. A second service serves up the raw denormalized incidents facts for display in drilldown tables, and the third consumes the raw data to quickly calculate aggregate metrics. When you access a report, these services serve the mix of summary and raw data that you ask for.

Reporting-tech-diagram

Advanced Analytics accesses a large volume of data, so we explored pre-computing some of the metrics. However, pre-computing presents tradeoffs between data staleness, the number of code paths to serve pre-rolled and on-demand reports, and UX design, so we wanted to make sure we did just the right amount. Based on testing, we discovered that by pre-computing per-incident metrics, we were able to strike the right balance of performance and flexibility.

We knew at the start that Advanced Analytics would serve as the foundation for exposing so much new data to our customers — more even than we’d be able to anticipate. That’s why we built our analytics framework to handle change. When we develop new and better ways to calculate valuable metrics, or have even richer usage data to expose, we can re-process data without any interruption to the feature. Customers will simply see new functionality available once all of the data is in place. In practice, this also allows us to deal with temporary fluctuations in data availability or integrity without downtime.

In practice, all this work is invisible to the user – they go to our reports page, select whatever they want to see, and quickly see their data rendered. But it’s important to us that we build our features to the same bar of scale and reliability we’re famous for.

Getting to the “So What?”

It would be easy to just take our existing reports and add filters, but we wanted to do more. We wanted to give users the context and flexibility to take away real, actionable insights from the reports.

We did this in three ways:

  1. Presenting individual metrics alongside aggregate summaries, so that customers can norm how a particular team or service is doing compared to the greater whole.
  2. Showing how metrics have changed since the last time period, so that customers understand at a high level whether they are improving their performance.
  3. Offering quick, simple drilldown to the underlying incidents, services, escalation policies, and users, so that customers can access the granular details of their operations activity.

Learning and Iterating

We collected customer feedback throughout our design and development process, and to make sure we were ready to ship, we ran a comprehensive beta test with select groups of customers. Throughout this process, we got great feedback that helped us iterate to deliver  the best possible solution.

Beta customers took instantly to the new reports, excited to have greater visibility into their systems and people, and eager to share how they wanted to use the feature to enact positive change in their teams. Some of our favorite use cases:

  • Identifying teams (escalation policies) with the lowest median times to resolve, so that other teams in the same company could learn from their operational practices and improve ops metrics companywide
  • Using the Team Report for weekly team meetings, reviewing how key metrics have changed from the previous week, and looking at escalated incidents to identify what went wrong
  • Using the incident drilldown to see where similar incidents occurred together, and finding duplicate or noisy alerts to remediate

Speaking with beta customers also provided us a great deal of UX feedback. Throughout our alpha and beta, we made UX and usability tweaks to ensure that our interactions were supporting the needs of our widely-varied customer base — from those with only one or two users and services up to those with hundreds.

While we’re thrilled to deliver this comprehensive solution to operations reporting, we see this as just the first step in PagerDuty’s analytics journey. We’re excited to continue helping our customers improve their uptime & reliability through analytics.

Tell us what you think!

Advanced Analytics is live as a 30-day preview to all customers, after which it will be a feature of our Enterprise plan. We’d love to hear what you think – email support@pagerduty.com with any feedback, and we promise we’ll read every single piece of it.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Features | Leave a comment