A Disunity of Data: The Case For Alerting on What You See

Guest blog post by Dave Josephsen, developer evangelist at Librato. Librato provides a complete solution for monitoring and understanding the metrics that impact your business at all levels of the stack.

The assumption underlying all monitoring systems is the existence of an entity that we cannot fully control. A thing we have created, like an airplane, or even a thing that simply exists by means of miraculous biology, like the human body. A thing that would be perfect, but for its interaction with the dirty, analog reality of meatspace; the playground of entropy, and chaos where the best engineering we can manage eventually goes sideways through age, human-error, and random happenstance.

Monitor that which you cannot control

Airplanes and bodies are systems. Not only do we expect them to operate in a particular way, but we have well-defined mental models that describe their proper operational characteristics — simple assumptions about how they should work, which map to metrics we can measure that enable us to describe these systems as ‘good’ or ‘bad’. Airplane tires should maintain 60 PSI of pressure, human hearts should beat between 40 and 100 times per minute. The JDBC client shouldn’t ever need more than a pool of 150 DB connections.

This is why we monitor: to obtain closed-loop feedback on the operational characteristics of systems we cannot fully control, to make sure they’re operating within bounds we expect. Fundamentally, monitoring is a signal processing problem, and therefore, all monitoring systems are signal processing systems.  Some monitoring systems sample and generate signals based on real-world measurements, and others collect and process signal to do things like visualization, aberrant-behavior detection, and alerting and notification.

Below is an admittedly oversimplified diagram that describes how monitoring systems for things like human hearts and aircraft oil-pressure work (allowing, of course, for the omission of the more esoteric inner-workings of the human heart, which cannot be monitored). In it, we see sensors generating a signal that feeds the various components that provide operational feedback about the system to human operators.

signal_monitoring_notification

All too often however, what we see in IT monitoring systems design looks like the figure below, where multiple sensors are employed to generate duplicate signals for each component that generates a different kind of feedback.

visualization_notification_monitoring

There are many reasons this anti-pattern can emerge, but most of these reduce to a cognitive dissonance between different IT groups, where each group believes they are monitoring for a different reason, and therefore require different analysis tools. The operations and development teams, for example might believe that monitoring OS metrics is fundamentally different from monitoring application metrics, and therefore each implements their own suite of monitoring tools to meet what they believe are exclusive requirements. For OS metrics, operations may think it requires minute-resolution state data on which they can alert, while development focuses on second-resolution performance metrics to visualize their application performance.

In reality, both teams share the same requirement: a telemetry signal that will provide closed-loop feedback from the systems they’re interested in, but because each implements a different toolchain to achieve this requirement, they wind up creating redundant signals, to feed different tools.

Disparate signals make for unpredictable results

When different data sources are used for alerting and graphing, one source or the other might generate false positives or negatives. Each might monitor subtly different metrics under the guise of the same name, or the same metric in subtly different ways. When an engineer is awakened in the middle of the night by an alert from such a system, and the visualization feedback doesn’t agree with the event-notification feedback, an already precarious, stressful and confusing situation is made worse, and precious time is wasted vetting one monitoring system or the other.

Ultimately, the truth of which source was correct is irrelevant. Even if someone successfully undertakes the substantial forensic effort necessary to puzzle it out, there’s unlikely to be a meaningful corrective action that can be taken to synchronize the behavior of the sources in the future.  The inevitable result is that engineers will begin ignoring both monitoring systems because neither can be trusted.

Fixing false-positives caused by disparate data sources isn’t a question of improving an unreliable monitoring system, but rather one of making two unreliable monitoring systems agree with each other in every case. If we were using two different EKG’s on the same patient — one for visualization and another for notification — and the result was unreliable, we would most likely redesign the system to operate on a common signal, and focus on making that signal as accurate as possible.  That is to say, the easiest way to solve this problem is to simply alert on what you see.

Synchronize your alerting and visualization on a common signal       

Alerting on what you see doesn’t require that everyone in the organization use the same monitoring tool to collect the metric data that’s interesting to them, it only requires that the processing and notification systems use a common input signal.
The specific means by which you achieve a commonality of input signal depends on the tools currently in use at your organization. If you’re a Nagios/Ganglia shop for example, you could modify Nagios to alert on data collected by Ganglia, instead of collecting one stream of metrics data from Ganglia for visualization, and a different signal from Nagios for alerting.

Librato and PagerDuty are an excellent choice for centrally processing the telemetry signals from all of the data collectors currently in use at your organization. With turn-key integration into AWS and Heroku, and support for nearly 100 open-source monitoring tools, log sinks, instrumentation libraries, and daemons, it’s a cinch to configure your current tools to emit metrics to Librato.

librato_pagerduty

By combining Librato and PagerDuty, all engineers from any team can easily process, visualize, and correlate events in your telemetry signals as well as send notifications and escalations that are guaranteed to reflect the data in those visualizations. Your engineers can use the tools they want, while ensuring that the signals emitted by those tools can be employed to provide effective, timely feedback to everyone in the organization. Sign up for a Librato free trial today and learn how to integrate PagerDuty with Librato .

Share on FacebookTweet about this on TwitterGoogle+
Posted in Partnerships, Reliability | Tagged , , , | Leave a comment

The 4 Operational Metrics You Should Be Tracking

Living in a data-rich world is a blessing and a curse. Flexible monitoring systems, open APIs, and easy data visualization resources make it simple to graph anything you want, but too much data quickly becomes noisy and un-actionable.

We’ve blogged, spoken, and thought hard about what you should monitor and why from a systems perspective, but what about monitoring data on your operations performance? We worked with with a large number of PagerDuty customers as we built out our new Advanced Reporting feature, including some of the most sophisticated operations teams out there. We’d like to share some specific metrics and guidelines that help teams measure and improve their operational performance.

Top Metrics to Track

1. Raw Incident Count

A spike or continuous upward trend in the number of incidents a team receives tells you two things: either that team’s infrastructure has a serious problem, or their monitoring tools are misconfigured and need adjustment.

Incident counts may rise as an organization grows, but real incidents per responder should stay constant or move downward as the organization identifies and fixes low-quality alerts, builds runbooks, automates common fixes, and becomes more operationally mature.

“We were spending lots of time closing down redundant alerts.” – Kit Reynolds, IS Product Manager, thetrainline.com

When looking at incidents, it’s important to break them down by team or service, and then drill into the underlying incidents to understand what is causing problems. Was that spike on Wednesday due to a failed deploy that caused issues across multiple teams, or just a flapping monitoring system on a low-severity service? Comparing incident counts across services and teams also helps to put your numbers in context, so you understand whether a particular incident load is better or worse than the organization average.

2. Mean Time to Resolution (MTTR)

Time to resolution is the gold standard for operational readiness. When an incident occurs, how long does it take your team to fix it?

Downtime not only hurts your revenue but also customer loyalty, so it’s critical to make sure your team can react quickly to all incidents. For Major League Soccer, their fans expect their 20 web properties to be up during live matches. Justin Slattery, Director of Engineering, and his team are constantly working to improve their resolution times because “the cost of an outage during a middle of a game is incalculable.”

While resolution time is important to track, it’s often hard to norm, and companies will see variances in TTR based on the complexity of their environment, the way teams and infrastructure responsibility are organized, industry, and other factors. However, standardized runbooks, infrastructure automation, reliable alerting and escalation policies will all help drive this number down.

3. Time to Acknowledgement / Time to Response

This is the metric most teams forget about– the it time to takes a team to acknowledge and start work on an incident.

“Time to Respond is important because it will help you identify which teams and individuals are prepared for being on-call. Fast response time is a proxy for a culture of operational readiness, and teams with the attitude and tools to respond faster tend to have the attitude and tools to recover faster.”- Arup Chakrabarti, Operations Manager, PagerDuty

While an incident responder may not always have control over the root cause of a particular incident, one factor they are 100% responsible for is their time to acknowledgement and response. Operationally mature teams have high expectations for their team members’ time to respond, and hold themselves accountable with internal targets on response time.

If you’re using an incident management system like PagerDuty, an escalation timeout is a great way of enforcing a response time target. For example, if you decide that all incidents should be responded to within 5 minutes, then set your timeout to 5 minutes to make sure the next person in line is alerted. To gauge the team’s performance, and determine whether your target needs to be adjusted, you can track the number of incidents that are escalated.

4. Escalations

For most organizations using an incident management tool, an escalation is an exception – a sign that either a responder wasn’t able to get to an incident in time, or that he or she didn’t have the tools or skills to work on it. While escalation policies are a necessary and valuable part of incident management, teams should generally be trying to drive the number of escalations down over time.

There are some situations in which an escalation will be part of standard operating practice. For example, you might have a NOC, first-tier support team or even auto-remediation tool that triages or escalates incoming incidents based on their content. In this case, you’ll want to track what types of alerts should be escalated, and what normal numbers should look like for those alerts.

Track your Operations Performance with PagerDuty

“Before PagerDuty, it might take a day to respond to incidents. Now, it’s seconds.” – Aashay Desai, DevOps, Inkling.

PagerDuty has always supported extracting rich incident data through our full-coverage API, and we’ve also offered limited in-app reporting to all customers.

We’ll soon be releasing our new Advanced Reporting features for Enterprise customers, and are accepting signups for our private beta.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Operations Performance | Tagged , , , , , , | Leave a comment

Finding a Scalable Way to Stop Attackers

Evan Gilman, operations engineer at PagerDuty, recently spoke at a meetup at PagerDuty HQ.  The first thing Evan noted, “Security is hard.” Whether you’re a small shop with a constantly evolving codebase or a huge enterprise with many thousands of employees, attackers will keep coming so you need to find a scalable way to stop them.

Secure by default

Evan emphasized the importance of being “secure by default” in relation to file permissions and security privileges. If you put security checks in place, make it a pain in the ass for others to work around it. What good are rules if people know they are bendable? The checks weren’t added for an arbitrary reason – they are there to protect your customer and team. Also, secure everything. Your logs may not have passwords but they may contain sensitive data such as customer information so you need to secure anything.

Be paranoid

As a general rule, Evan noted, you should assume your network is hostile. That’s especially true in the cloud.

“You have no idea what else [is] running on the rack next to you.”

Encrypt all network traffic – both inter and intra-dc – and the success we’ve found is to encrypt at the transport layer. Also, remember to sanitize the data leaving your data infrastructure as well because you can’t trust the provider to watch after what you leave behind.

Automate and distribute enforcement as much as you can

For Evan and the PagerDuty team, automation is about distributed security policy enforcement. Create a centralized ruleset to manage policy and then push it out to individual nodes so they can enforce themselves.

For example, below is a snippet of code we have to distribute enforcement which reads: Cassandra storage on port 7000 should only be accessible by those with nodes with the Cassandra role.

pd_iptables

Take action when there’s something wrong

Ultimately, whatever security solutions you opt for will have to be user-responsive – whether your users are people within your organization, the general public or both.

You need to set up monitoring and alerts to let you know when things aren’t going right. Evan suggested monitoring the level of encryption in your data traffic. If you know 80% of your traffic should be encrypted but only 25% are over a given period, there’s something wrong and someone needs to get paged immediately.

Since PagerDuty is distributed across multiple 3rd party providers, we don’t have VPC available to use so we leverage host-based intrusion detection (HIDS) to let us know when there are problems.

The most important advice from Evan? Start today. You’re going to have to do it eventually and by starting now, you can reduce technical debt and help churn out the bad stuff you already have. Watch his talk below:

Want to see more talks from this meetup, check out:

Or learn more about Security at PagerDuty:

Share on FacebookTweet about this on TwitterGoogle+
Posted in Events, Security | Tagged , , , , | Leave a comment

Don’t Build Your Own Operations Performance Platform

Building is second nature for many engineers. Naturally you could build a solution to solve the problems of:

  • Having multiple monitoring tools for your infrastructure
  • Sending multiple types of alerts and re-routing to another person until they’re answered
  • Handling on-call shift changes with one click when life happens
  • Having the dashboards to identify hotspots to make proactive fixes

If you did that, there are quite a few things you’d need to think about. And it’s basically the only thing we’ve been thinking about for over 5 years. Plus building your own tool could cost your business over $330,000 in one year!

Let PagerDuty do the work for you.

(Click to Enlarge)

PagerDuty Homegrown Infographic

Share on FacebookTweet about this on TwitterGoogle+
Posted in Operations Performance | Tagged , | 1 Comment

Chatting with PagerDuty’s API

Guest blog post from Simon Westlake is the Chief Technical Officer for Powercode, a complete CRM, OSS and billing system designed for ISPs. Powercode is in use by over two hundred ISPs around the globe.

Since Powercode is used worldwide by a variety of different types of ISPs, we have integrations with all kinds of different third party services. These integrations encompass email account management, invoice printing, credit card processing, vehicle tracking, equipment provisioning and much more. However, most of these integrations only end up benefiting a handful of our customers, which is why it was a very pleasant surprise to see the heavy utilization of PagerDuty amongst our customers when we integrated it.

Considerations when integrating with 3rd party apps

I first heard about PagerDuty when we held a Powercode user meetup event in our hometown of Random Lake, Wisconsin back in 2012. One of our very passionate users, Steve, was talking to me about how he used Powercode for his ISP. I mentioned that one of the things that we find very challenging is dealing with after hours emergencies, as many of the smaller ISPs using Powercode don’t have the revenue to justify running a 24×7 network operations center (NOC), and that it’s a real pain hoping your buzzing phone wakes you up when you get an email from your network monitoring system letting you know that half the network is down. He quickly pulled out his laptop and logged into his PagerDuty account to show me what we were missing.

After he walked me through the interface and the feature set, I decided on the spot that it was critical for us to integrate PagerDuty into Powercode. However, we have three hard requirements that we always adhere to when we integrate third party services into Powercode:

  1. Does the service have a API?
  2. Is the API well written and documented?
  3. Does the company provide a testing/integration environment for developers?

We’ve been down dark roads before where we’ve decided to skip one of these requirements and it always turns into a long term nightmare. Poorly written APIs (or no API at all) and little support from the third party means we end up having to patch back together a tenuous integration and, if customers come to really rely on the integration, it’s an ongoing headache trying to keep it running. Thankfully, PagerDuty delivered on all three counts. The API was solid and well documented and they readily provided a testing environment for us to integrate the service into Powercode. When we look at new providers, I always cross my fingers for a consistent, REST, JSON-based API and thankfully, that’s what we got!

One of the things I really like when building an integration with a third party system is to find that the API exposed to us is the same API used to build the core system by the original developers and that certainly appeared to be the case with the PagerDuty API. We were really easily able to tie in everything we needed and the integration was smooth and painless.

Analyzing PagerDuty’s Integration API

There were a couple of decisions we had to make when working with the API. Prior to integrating PagerDuty, the only option for alerting in Powercode was to trigger an email. The triggering mechanism had a variety of configuration options such as:

  1. How long should this device be in an alert state prior to an alert being generated?
  2. How many times should Powercode repeat the notification?
  3. What is the frequency and amount of repetitions?

We quickly realized that maintaining this configuration didn’t make sense with the ability to setup your alerting parameters in PagerDuty. We also wanted a way to be able to maintain a history of alerts for devices in PagerDuty. Finally, we had to make some decisions about whether or not to set up two way integration with PagerDuty – if an alert is opened or modified in PagerDuty, should it manipulate anything in Powercode?

After much deliberation, we decided to not integrate two way communication. We wanted to maintain Powercode being the ‘master’ as far as the status of incidents and to encourage people to utilize the Powercode interface to manage their equipment. This left us with a problem to solve – what happens if someone resolves an incident inside PagerDuty while it is still alerting in Powercode?

To deal with this, we decided to trigger an incident creation or update in PagerDuty on every cycle of our notification engine, which is once every minute. PagerDuty log updates to an open incident without triggering another incident, as well as, automatically bundles open incidents that occur around the same time into one alert to reduce alerting noise. While this can create a long list of incident updates in PagerDuty, this gave us some benefits:

  1. If the status of a device changes, that change is reflected in the incident description PagerDuty. For example, if a router is alerting because CPU usage is too high and it then begins alerting because the temperature is too high, re-triggering the incident allows us populate this information into the description.
  2. If a user resolves an incident in PagerDuty that is not really resolved (the device is still in an alerting state), it will be re-opened automatically.

One of the nice things about the PagerDuty API is that it allows you to submit an ‘incident key’ in order to track the incident in question. We decided to use the unique ID in our database associated with the piece of equipment that is alerting as the incident key – this simplified the deduplication process and allowed us to maintain a history within PagerDuty of incidents that had occurred with that piece of equipment. It also made it easy to resolve or acknowledge incidents due to changes within Powercode – we always knew how to reference the incident in question without having to store another identifier. This seemingly small feature in the PagerDuty API really expedited our ability to get it integrated quickly. See an example below for how simple this is for us to do in PHP:

pagerduty-incident-api

This gives us a nice list of descriptive incidents in PagerDuty:

Powercode_incident_description

Keep everyone in the know with PagerDuty

Our initial integration only used the ‘Integration API’ of PagerDuty. We reasoned that most of the other functionality would be controlled and accessed by users directly through the PagerDuty application, and it didn’t serve much purpose to recreate it all within Powercode. However, over time, we slowly found uses for the data in other sections. For example, we deployed a system within our NOC that uses the schedules and escalation policies section of the API to display to our local technicians who the current on-call person is and who to call in the event of an escalation. Our next plan is to implement the webhooks section of the API to be able to store logs inside Powercode that show who is currently working an incident – this allows us to give our customers the ability to get better real time data without needing to create accounts in PagerDuty for every member of their organization.

Final Thoughts

One thing I really like is finding that our customers respond positively and use the service. I believe that the PagerDuty integration in Powercode is the highest utilized third party service out of all the different services we have integrated. Once we started to show people how it worked, the response was universally positive. We even began using PagerDuty for our after hours support for Powercode itself – if you call into our emergency support line after hours and leave a voicemail, it opens an incident in PagerDuty to run through the escalation process!

We continue to recommend PagerDuty to our customers and I’m confident that their solid API and excellent support will mean ongoing integration in Powercode is a no brainer. Check out this integration guide to see how easy it is to integrate Powercode with PagerDuty.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Features, Partnerships | Tagged , , | Leave a comment

Identify Identity Lifecycles for Cloud App Security

Last month, we covered the tactics Twitter employs to keep users’ data safe. Stephen Lee, the director of platform solutions at Okta, also spoke at the recent PagerDuty security meetup about SaaS apps – the considerations companies need to make around their adoption and protecting the data inside them.

Provisioning the right applications

The great benefit of SaaS products is that they’re available to any user via the web. Yet cloud apps’ ease of use also presents the issue of access control: who gets access to which apps and at what level?

At Okta, automation helps solve one of the major issues of access control: provisioning. When a new employee comes on board – regardless of whether he works in operations, engineering, sales or some other department – he will need to be granted access to certain applications, and automation can greatly simplify this process.

Access control automation also lessens the risk that an employee will need to manually request access to an app down the line, helping reduce IT’s workload and enhancing productivity.

Plan around mobile device use

An important trend that SaaS is helping to power is the growing mobile-friendliness of workplaces. Mobility is great for employees, who are empowered to work where and how they want, in ways that would have been impossible just 10 years ago. Yet it presents real headaches for IT security teams.

Not only does company data live on an ever-greater number of mobile devices, which can easily be lost or stolen, but many of those devices are personal ones. What happens when an employee resigns or is fired – can her former employer be confident that company data won’t go with her?

These considerations demand a robust system for managing access control, one that makes it easy to grant or revoke access on a person-by-person basis.

Anticipate cloud for everything

SaaS isn’t just about enabling mobility. There are other benefits to adopting cloud technology in the enterprise, including cost savings, access to new features and user-friendliness. Yet the massive cloud shift – one of two major trends Stephen pointed to in the corporate tech marketplace, the other one being mobile device adoption – isn’t without security challenges.

For example, managing authentication and authorization when users are accessing apps from a number of locations on a number of different devices is quite hard. You’ve also got the interaction between the end-user and the actual applications – how do you ensure secure connections on networks you don’t control? Then there’s the matter of security audits, to which all public companies are subjected. If you get audited, you’ll have to prove you can generate data around “who has access to what”.

Stephen suggested thinking about security in the context of “identity lifecycles”. The first step in developing a comprehensive security plan is to map out these lifecycles for both internal and external users, thinking in terms of access control. The lifecycle approach makes particular sense when used in concert with a “secure by default” ethos, where security checks are baked in to the product development process.

Think about your users

Another benefit to identity lifecycles: they force companies to identify whose access, precisely, they are controlling. Whether it’s actual users, in Twitter’s case, or engineers and operations folks internally, “lifecycling” requires a holistic look at security.

At Okta, the question is one of accuracy: can people access what they need to access on a reliable basis? Steven presented a view of Okta’s end-user as the Okta security team’s “customer”.

“They need to be able to access what they need, but they shouldn’t be able to access what they don’t.” Stephen Lee, Okta

Thinking in terms of others’ needs is a rare thing in the business world, not least in IT, which spends most of its time immersed in device provisioning, bug fixes, system architecture and so on. Yet Stephen points the way to a better, more “customer”-friendly version of enterprise IT.

Watch Stephen’s full presentation here:

Share on FacebookTweet about this on TwitterGoogle+
Posted in Events, Partnerships | Tagged , , | Leave a comment

You’ve got data. Now what?

Guest blog post from Angel Fernández Camba, Developer at Logtrust. With Logtrust you will view all your business insights in dashboards and get alerts on any parameter you need, always in real-time.

Companies don’t have to search far and wide to find ways to increase their top line revenue. Many are sitting on a wealth data waiting to be analyzed. But how exactly do you turn data into money? The answer is simple, logs. Logs are events that happen inside a server, application, and firewall and they contain a lot of interesting information that every department within a company can use. Marketing can filter the data to discover new sources of revenue while IT teams can detect and track suspicious behavior in real-time to eliminate downtime and increase customer loyalty.

it_performance

Data is all around us

Mobile phones, tablets and laptops are connected to the company’s servers and whenever someone is surfing the web, sending emails or using applications, information is generated, This vast amount of “hidden” information holds actionable intelligence waiting to be revealed, but unfortunately most of it is unstructured. This makes it harder to discover useful information among all the data. Some companies turn to big data tools to crunch numbers, but resources are limited. Those who are familiar with log management solutions, create in-house scripts to get the job done. This situation translates into a lot of effort on limited resources so you need to find an easier way to summarize all of your information to find the valuable insights to make a positive  impact on your business.

big_data_sources

How to bring your data to life

Understanding your data is the most important thing. You can have tons of useful data but without the right tools it is worthless. Data representations can help you understand relations between variables when an error or event happens. This is very important because you may want to know how similar events have occurred or categorize them. When you are diagnosing a problem, it’s helpful to have a visual representation of the data to spot trends instead of scanning millions of rows on a table. Also, it’s important the way in which the data is represented. Depending on the nature of the data, there are many ways to present the data in a visual form.

Graphs can give you X-ray vision

Some of us are visual learners so graphs can be a good way for us to quickly highlight hotspot areas. Imagine we have a website and we want to know when the service is not working properly. First, we should look at how many requests we are dispatching, and how many of them have errors. Using a map can help discover if we have problems in a specific country or worldwide. Below, for example, is a request distribution map:

global_heat_map

Another way you can see the same information in another kind of visual representation are voronoi charts:

spain_log

Since Spain is the problem area, we can drill down one level deeper to see which parts of Spain has errors. A visualization of errors by city can give you further insight around where problem areas are. To contrast, how does this error distribution compare to the rest of the world world?

The more requests a country sends, the more likely it is a source of errors but let’s assume that our problem is not affecting just a country but all over the world. Graphs can contain the error distribution map of a given day, but in the graph below we can take a closer to at the last hour to see if the problem follows the same distribution.

Hourly_error_distribution

The error distribution map for the last hour is focused in Germany. These stats could lead us to detect system anomalies. If we see the server stats we can see where the errors are coming from.

CPU_overland

With the dashboard above, we can see our server had a CPU overload because of system maintenance routines. If this CPU overload starts impacting your customers, you’ll need to notify your on-call engineer to resolve these issues. Get more value from your data by integrating PagerDuty and Logtrust today.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Partnerships | Tagged , , , , | Leave a comment

Coming Soon – Advanced Reporting

PagerDuty can help you gain insight into your Operations, helping you better manage and prevent incidents. With PagerDuty, you can streamline your incident response, manage on-call scheduling more efficiently, and soon – analyze & prevent incidents.

Soon we’ll be launching Advanced Reporting to take your Operations to the next level. With our new reports, you’ll be able to see what’s going on across your infrastructure, analyze trends and turn insights into action.

Your infrastructure in a single view

PagerDuty brings all of your monitoring services into a single view so you can easily analyze incidents across your entire infrastructure. Our dashboards will help you see alerts over time, by service and by team so you can reduce alerts over time and eliminate non-actionable ones.

Analyze trends

Once you know what’s happening, dig deeper to understand why. Are your incidents going up or down over time? Which incidents are taking the longest to resolve? How quickly is the team acknowledging and resolving alerts? Our new reports will show you these trends.

Turn insights into action

With a better sense of the hotspots in your infrastructure and your team’s workflow, you can more effectively prioritize your work. Focus on what’s really going to drive greater reliability rather than the problems that happen to be noisiest this week. Drive reductions in your MTTR by finding and eliminating bottlenecks that are slowing down the team. And finally, keep everyone happy and healthy by monitoring the incident workload and proactively giving time off if you notice things have been crazy recently.

For example, let’s say you noticed that you always have more incidents on Mondays. Digging deeper, you see that you’re getting a lot of alerts about a slow response time for one of your API endpoints. It turns out there’s a particularly expensive query that’s running on your database at this time, generated by users who are running weekly reports in the app. You work with the application development team on a way to improve the query, and in the meantime, you increase the threshold for this alert on Mondays (since after all, it didn’t cause an outage) and make sure everyone on your team knows how they can quickly investigate this incident.

You’re invited to our public preview

Before we release Advanced Reporting, we’ll launch a public preview. All of our customers, regardless of their plan, will have the chance to test drive the reports during our preview. Stay tuned – we’ll be sure to let you know when the new reports are available in your account. Can’t wait? Request Beta access!

Share on FacebookTweet about this on TwitterGoogle+
Posted in Features, Operations Performance | Tagged , , , , | Leave a comment

Defending the Bird: Product Security Engineering at Twitter

Alex Smolen, Software Engineer at Twitter, recently spoke to our DevOps Meetup group at PagerDuty HQ about the philosophies and best practices his teams follow to maintain a high level of security for their 255+ million monthly active users.

Security in a Fast-Moving Environment: The Challenges

Twitter is one of the world’s most widely used social networks and they are continuing to add users at a steady clip.

While Twitter’s growth is exciting, it also poses challenges from a security standpoint. Because so many people count on Twitter to deliver real-time news and information, it’s a constant target for hackers. Two past incidents illustrate why Twitter security matters:

  • When The Associated Press’ account (@AP) was compromised and a tweet was sent about a nonexistent bombing, it drove down the stock market that day
  • When spam was sent from Barack Obama’s account, Twitter received an FTC consent decree related to information security

Twitter’s fast growth also demands lots of infrastructure investment, which forces the company’s security team to move quickly. The site was established as a Rails app, but it’s since been switched to a Scala-based architecture. That change demanded all-new tools and techniques regarding security.

Plus, Alex noted the security team is responsible for many other apps that they have acquired on top of Twitter itself.

The First Step in Securing Twitter: Automation

Automation is one of the strategies that both PagerDuty and Twitter use to optimize for security. The driving force behind automation at Twitter is a desire to employ creativity or judgment in everything the engineering team does.

“When we’re doing something and we think it’s tedious, we try to figure out a way to automate it.” – Alex Smolen, Software Engineer, Twitter

It was at one of Twitter’s Hack Weeks, which is like a big science fair, where the issue of automation first arose. From those initial efforts, the security team created a central place to manage information and run both static and dynamic analyses on security.

Automation helps Twitter’s engineers find security issues early on in the development process. When security problems do crop up, Twitter’s automation tools – in concert with PagerDuty’s operations performance platform – help assign incidents to the right people, so problems get solved more quickly.

One example is a program called Brakeman, which is run against Rails apps and shows all of the vulnerabilities in the apps’ code. If a vulnerability is discovered, the developer is alerted so they can get to the issue quickly. The goal is to close the loop as fast as possible, since the later something is discovered, the more complex and expensive resolve.

Other tools include Coffee Break for Java scripting and Phantom Gang, which dynamically scans Twitter’s live site. As with Brakeman, issues are assigned to the right on-call person for the job.

The Second Step: Robust Code Review Process

Security not just the security team’s responsibility but is owned by many engineers. There are also specific teams that deal with spam and abuse.

On the theme of shared accountability, Twitter’s developers are encouraged to work out security kinks early on in the code-development process. For sensitive code, as soon as code gets submitted the code also gets a security review. Devs can also use a self-serve form to request the security team’s input.

The security engineering team keeps itself accountable with the help of a homebuilt dashboard showing which reviews need to be done. Once upon a time, Twitter’s security engineers used roshambo to assign code reviews, but as their team scaled they now run a script to randomly assign code reviews.

“Roshambo is really hard to do over Skype.”

The Third Step: Designing Around Users

Twitter users, all 200-plus-million of them, have a vested interest in the site remaining secure. For that reason, some of Twitter’s security measures are customized for specific use cases.

One is two-factor authentication, which has been available on Twitter for some time. Initially, it was SMS-based; today, there is a natively built version that can generate a private key to sign login requests.

Another user-facing measure is an emphasis on SSL. Twitter was one of the first major services to require 100% SSL. Yet because many sites still allow the use of non-SSL connections, Alex’s team has built in HTTP Strict Transport Security (HSTS), which requires users to visit the SSL version of the site. Another strategy in use is certificate pinning. If someone tries accessing Twitter with a forged certificate, the native client won’t accept it.

Ultimately, Alex said, security is about enabling people – both users and Twitter’s own engineers. Given that Twitter’s security team represents about 1% of all the engineers in the company, keeping Twitter secure isn’t easy. But with the right processes and tools, those engineers can do their jobs effectively and keep Twitter humming.

Watch Alex’s full presentation here:

Stay tuned for blog posts around the other two security meetup presentations from Stephen Lee (Okta) and our very own Evan Gilman.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Events, Security | Tagged , , , | Leave a comment

Effective Start / End Practices for On-Call Scheduling

Since we launched on-call handoff notifications, lots of our customers have used them to be notified about their on-call responsibilities to make sure they never forget when they’re on-call. Over the years, we’ve seen a variety of on-call schedules and thought we’d share some of the more favored practices we’ve seen.

Exchange Shifts During Business Hours

Below is a distribution of all start and end times for on-call shifts scheduled within PagerDuty:

ochon_01

The most popular time to handoff your on-call shift is Midnight, followed by 8:00 AM and 9:00 AM, then 5:00 PM and 6:00 PM. Despite the popularity of the midnight swap, as PagerDuty, we recommend having your handoff occur during business hours, preferably when both parties are present in the office. Unless you both happen to be in the office at midnight, in which case, go home.

Switching your shift while on-site gives you the opportunity to talk to the next person going on-call about any issues that occurred during the previous shirt or to give a heads up on anything they may want to be on the look-out for.

At GREE, their team syncs up every Monday morning to review alerts from the previous week, go over the upcoming schedule and handoff the rotation to the next on-call team. This gives each team additional insight into the week ahead and makes sure that everyone knows who is the primary, secondary and manager responsible for keeping GREE reliable each week.

Don’t Switch Shifts Over the Weekend

Below is a graph of shift handoffs distributed by day of the week:

ochon_02

By exchanging on-call responsibility while on-site, we’d want to see less shifts occurring over the weekend. This distribution closely aligns to our scheduling philosophy at PagerDuty. By switching shifts on Monday, you are able to recount an entire week of data with minimal confusion.

Or you can schedule your shift exchanges during your weekly team meetings. This will still allow you to review information and give a heads up to any potential problems that may be faced.

Keep Regular Shift Lengths

Another hot topic in terms of on-call scheduling is shift length. Should you switch weekly? Daily? Hourly? While much of this may depend on the size of your team, you will also want to consider other factors. You may want review your historical alerting data to see if there are any hot times in your systems to make sure no single person is getting the short end of the stick, leading to burnout.

Below is a distribution of popular shift lengths from 1 hours up until 2 weeks:

ochon_03

The most popular shift lengths seem to be 8 hours, 12 hours, 1 day and 1 week. Keeping simple shift lengths means less confusion and forgetfulness about when you begin or end a shift.

So when should you use each of these shift lengths?

  • 8 Hours – Great for someone who is covering the business day. You may have another team covering off-business hours.
  • 12 Hours – This is popular for teams utilizing PagerDuty’s Follow-the-Sun schedule, which allows your international teams to be on-call during hours they would be awake.
  • 1 Day – Simple for medium size teams where everyone is going to be responsible for one day.
  • 1 Week – Great for small teams so they don’t have to toss responsibility back and forth to each other.

Find a Schedule that Works for Your Team

At PagerDuty, each internal team handles on-call scheduling differently.  Our Operations Team has a simple weekly rotation, while our Realtime Team has a weekday / weekend rotation where people are on call during the week, then on call during the weekend. This is because our Realtime Team is slightly larger than our Operations Team, so they wanted to have team members on call more often and for shorter shifts to prevent operational tasks from getting rusty.

Our actual Real-Time team schedule:

on-call-blur

It’s important to find a process that works for your team. Even within a single organization, different departments may find one approach works better than another. PagerDuty’s On-Call Schedules give you the ability to customize your team’s shift however works best for you. If you’re not quite sure where to start, just remember the basics: switch shifts when both parties are present in the office (if possible) and maintain a standard shift length (e.g. 12 hours, 1 days, 1 week) to help avoid confusion for when someone’s shift may be starting and ending.

To make this transition more convenient, we also began offering heads up time for when you start and end your shift. Simply log into your PagerDuty account and edit your On-Call Notification Rules from your profile page to get started.

 

Share on FacebookTweet about this on TwitterGoogle+
Posted in Features, Operations Performance | Tagged , , , | Leave a comment