Outage Post Mortem – June 3rd & 4th, 2014

On June 3rd and 4th, PagerDuty’s Notification Pipeline suffered two large SEV-1 outages. On the 3rd, the outage resulted in a period of poor performance that led to some delayed notifications. On the 4th, the outage was more severe. In order to recover from the outage, inflight data from the system was purged and resulted in failed notification delivery, failure to accept incoming events from our integration API, and a significant number of delayed notifications.

We would like to apologize to all of our customers who were affected by this outage. This was a very serious issue, and are taking steps to prevent an outage of this scale from happening again in the future.

What Happened?

Our notification pipeline relies on a NoSQL datastore called Cassandra. This open source, distributed, highly available and decentralized datastore powers some of the most popular sites/services on the internet. Cassandra has also proven to be a complicated system that is difficult to manage and tune correctly.

On June 3rd, the repair process on one of the nodes in our Cassandra cluster started normal operation. This Cassandra background repair process, used to keep stored data consistent across the cluster, puts substantial strain on Cassandra. This impacted how well our datastore performed. The repair process, in combination with additional high workload being applied at the time, put the Cassandra cluster into a heavily degraded state.

To remedy the situation, our team decreased the load on the cluster. As part of this, the repair process was stopped. While this temporarily resolved the incident, the cluster experienced six hours of oscillating between periods of stability and instability. We then eliminated communication between some of the nodes in an attempt to stabilize the cluster, and eventually normal operations resumed.

During this outage, PagerDuty’s Notification Pipeline was degraded to a point where approximately 3% of events sent to our integration API could not be received, and a small number of notifications (a fraction of 1%) experienced delayed delivery.

 On June 4th, our team manually restarted the repair process that had been postponed on the 3rd. Despite disabling a substantial amount of optional system load, the repair process eventually reintroduced the previous day’s outage to our system. Unfortunately, this subsequent outage was much more damaging: during the course of this outage we were unable to receive 14.9% of events sent to our integrations API, while 27.5% of notifications were not delivered at all, and 60.9% of notifications were delayed more than 5 minutes.

At first we attempted to reproduce our process from the previous day to get Cassandra stabilized, but these efforts did not have the same result. After several additional attempts to stabilize the notification pipeline performance, it was decided to take a drastic measure to regain control of the pipeline: a “factory reset”, deleting all inflight data in the notification pipeline. This allowed the team to gradually restore service, leading to stabilization of the pipeline and a return to regular operation.  Cassandra immediately recovered after the “reset”, although some of our downstream systems required manual intervention to get their data consistent with the new “blank slate”.

Though our systems are now fully operational, we are still in the process of conducting our root cause analysis, as we need to understand why our stabilization approaches didn’t work.  Fundamentally, however, we know that we were underscaled, and we know that we were sharing the cluster amongst too many different services with disparate load and usage patterns.

What are we doing?

Moving forward our top priority is to make sure an outage like this does not affect our customers again. We take reliability incredibly seriously at PagerDuty and will be prioritizing projects that will help make our system more stable in the future. Here are a few of the changes we will be undertaking to prevent this type of outage from occurring again:

  • Vertically scaling the existing Cassandra nodes (bigger, faster servers)

  • Setting up multiple Cassandra clusters and distributing our workloads across them

  • Establishing system load thresholds at which, in the future, we will proactively horizontally scale-up our existing Cassandra clusters

  • Upgrade the current & new clusters to a more recent version of Cassandra

  • Implement further load shedding techniques to help us control Cassandra at high loads

  • Bring additional Cassandra expertise in-house

One last thing that needs to be mentioned:  we had already decided to take some of the above actions as we had noticed similar issues recently.  We had made some of our planned improvements already, but unfortunately we had decided to do the rest of the improvements in an order that was based mostly on efficiency:  we had decided to do the Cassandra version upgrade before the vertical + horizontal scaling.  Unfortunately we ran out of time.  It’s now evident that the scaling had to happen first, and since the 4th, we have already completed the vertical scaling and are partway through the splitting of our Cassandra clusters based on workload and usage.

If you have any questions or concerns, shoot us an email at support@pagerduty.com.


Share on FacebookTweet about this on TwitterGoogle+
Posted in Reliability | Tagged , | Leave a comment

Cutting Response Time in Half while Increasing our Support Ticket Volume

Keeping our customers happy is a source of pride at PagerDuty. While having a reliable product that is loved by our customers makes our lives easier, people still run into the occasional issue. We’ve implemented some unique tools and processes to ensure our customers receive the support they deserve.

Keeping Response Time at a Minimum

For the past year, our goal was to initially respond to customers within two hours. While this is easy in principle to do, keeping an eye on such a metric in real time isn’t as easy as you would hope. Earlier this year, we developed a dashboard that shows our queue of open tickets with color-codes to visually indicate how long a ticket has been idle.


As each new ticket inches towards our two-hour window it transitions from green to red. This makes it easy to see at a glance which customers we’re behind in responding to.

Our dashboard also makes it easy to see if any of our agents have too many tickets on their plate, helping us balance our queue and spread the work evenly. We’ve already seen a drastic effect on our customer wait times:


This is all while our ticket volumes have increased steadily:

03_Created-Tickets-by-Submission-Method---Rolling-13-Months---Analysis---GoodData copy

Alerts for Missing our Internal SLA

Occasionally, a ticket will go untouched for more than two hours. If and when this happens, we trigger a PagerDuty incident that alerts the agent responsible for the ticket. If the alert is not acknowledged or resolved, it will get escalated to another agent on the team.


Our team has a designated room in HipChat to discuss support issues. Using PagerDuty’s HipChat integration the entire team can start chatting to tackle a latent customer issue.


Keeping Our Customers 100% Satisfied

The combination of all of these tools helps us hit our goal of 100% customer satisfaction. This is a goal is at the forefront of all of our agents’ minds and something that we narrowly miss each month. Our lowest monthly rating over the past 12 months was 96%, and we hit 100% in both October and December. We utilize Zendesk’s automated customer satisfaction emails to capture this information.


We track these metrics in GoodData, which are automatically synched with Zendesk, making it really easy to build dashboards with useful information.

We try to be as open as possible with the rest of the PagerDuty teams, so we display what our customers are saying about us (even bad things) on a television that displays recent customer satisfaction ratings:


This dashboard was made during a recent hack day. Source is available here: https://github.com/ryanhoskin/satisfaction

But this is just the tip of the iceberg; we use a variety of tools to help manage interactions with our customers.

  • Zendesk for managing tickets and our knowledge base
  • Olark for live chat, configured to create tickets within Zendesk
  • JIRA for communication between teams and filing bugs
  • HipChat for JIRA and PagerDuty updates
  • Skitch for screenshots.

Our team is here to help our customers. So please don’t hesitate reaching out with any questions or support you need to use PagerDuty. If needed, the best way to connect with us is via support@pagerduty.com and we’ll see your name pop right up on our dashboard.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Operations Performance | Tagged , , , , | Leave a comment

How We Ensure PagerDuty is Secure for our Customers

We at PagerDuty take security very seriously. To us, security means protecting customer data and making sure that customers can securely communicate with PagerDuty. Given our focus on high availability, we pay a lot of attention to how we design and monitor our security infrastructure.

In a dynamic environment like PagerDuty’s, providing robust host and network level security is not easy. Doing so in a way that is distributed, fault-tolerant, and hosting provider-agnostic introduces even more challenges. In this series of blog posts, we wanted to highlight some of the strategies and techniques that PagerDuty’s Operations Engineering team is using to secure the PagerDuty platform and to keep customer data safe.

Here are some of the best practices we follow around security:

Establish internal standards

All of our internal decision making for security goes against a list of philosophies and conventions that we maintain. This list is not written in stone, as we update it when we find problems, but it forces us to understand where we are making trade-offs and helps us with our decision making. It also makes it easy for new engineers to quickly understand why things are set up the way they are.

Secure by default

We follow a convention of securing everything by default, which means that disabling any security service has to be done via an override or exception rule. This serves to enforce consistency across our dev, test, and production environments.

As tempting as it is to poke a hole in the local firewall or to disable SSL when connecting to MySQL, we don’t want to be making these types of security changes in our production or test environments. Setting our tools to automatically “do the right thing” keeps all of our engineers honest. Also, by having this kind of consistency, we can debug security-related issues earlier in the development cycle.

Assume a hostile and flaky network

All of our infrastructure is deployed to cloud hosting providers, whose networks we cannot control. Additionally, we are deployed across multiple regions, so a good chunk of our data traffic goes over the WAN. This introduces the challenges of packet loss and high latency – as well as the possibility that intruders may try to eavesdrop on our traffic.

With this in mind, we encrypt all data in flight and always assume that our data is flowing through networks where we have little visibility.

Be provider-agnostic

Security Groups, VPC, Rescue consoles, etc. These are all examples of provider specific tools that we are unable to use because we are spread across multiple hosting providers and need to avoid vendor lock-in. All of our security tooling has to be based on commonly available Linux tools or installable packages, which eliminates our dependency on provider specific security tools, and leads to better stability. We leverage Chef to do most of this work for us and have built out nearly all of our tooling on top of it.

Centralize policy management and distribute policy enforcement

Most companies approach AAA (Authentication, Authorization, Access)  by having single sources of truth for access control and then use that source of truth as an authorization mechanism as well. Examples of this include: using an LDAP server, using a RADIUS server, or using a perimeter firewall to store network policies. Instead of relying on these single sources of truth for both policy management and enforcement, we split out and distribute the enforcement pieces to the individual nodes in the network. Our typical pattern is when a change is introduced into the network, the single source of truth updates the policy, and is then pushed out to all of the nodes.

Constantly validate

While all of the above serves to provide a robust security architecture, it’s important that we validate our security measures to ensure that they’re doing what we actually need them to do.

Some companies will do quarterly penetration testing, but with our dynamic environment, that is too slow. We actively scan, monitor, and alert on changes (with PagerDuty) when there is something that is not expected. We catch problems quickly if a mistake is made (e.g. engineer accidentally opens the wrong network port on a server) or if there is actual malicious behavior (e.g. someone trying to brute force an endpoint), we get alerted to the problem immediately.

This is the first post in a series about how we manage security at PagerDuty. To continue reading this series check out, Defining and Distributing Security Protocols for Fault Tolerance

Share on FacebookTweet about this on TwitterGoogle+
Posted in Security | Tagged , , , | Leave a comment

Custom Alert Sounds Now Available for iOS and Android

Our new version of the iOS and Android mobile apps makes push notifications even better. We’ve seen the use of push notifications increase since we released our mobile app, but we’ve also heard that the notifications may sometimes be overlooked because they sound the same as every other notification. We’re excited to announce that one of our most requested updates – the ability to set custom sounds for push notifications – is here.

Custom Alert Sounds

We know that it’s important to be able to distinguish between a PagerDuty alert and your mom messaging you on Facebook. To make sure you never miss an alert, our mobile apps now come with 10 sound options you can set as the sound for your PagerDuty push notifications. We’ve included some of our favorite sounds below.

PagerDuty Alert – let us become your favorite robot

Morse Code – it actually spells out PagerDuty

Sad Trombone – when your systems let you down

If you’re using an Android phone, you can also set the push notification to play a sound even if your phone is on silent.

Push notifications are a great way to quickly acknowledge an alert. Opening the mobile app directly from the notification lets you see all relevant info for the incident, and you can swipe to acknowledge, resolve or reassign it. However, we recommend setting a backup notification method (i.e. phone call or SMS message) since there are a few situations that could cause you to miss one.

Don’t already have the mobile app? You can download it for iOS or Android.

Are there any other features you would like to see in our mobile app? Shoot us an email at support@pagerduty.com or send us a tweet, @PagerDuty.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Announcements, Features | Tagged , , , | 4 Comments

Mobile Monitoring Metrics that Matter for Reliability

This is a guest blog post from Justin Liu of Crittercism, which provides mobile app performance management. Crittercism products monitor every aspect of mobile app performance, allowing Developers and IT Operations to deliver high performing, highly reliable, highly available mobile apps.

Mobile apps are now critical for all types of businesses. Whether your company builds an eCommerce app for consumers or a CRM app for enterprises, your apps need to work all the time. The first step in having a high performing mobile app is to ensure that any app you build and manage is stable and responsive.

The worlds of DevOps and mobile development are colliding. Like in any successful DevOps team, this means empowering your mobile developers with real-time visibility into mobile app performance so they can quickly identify and resolve problem areas. Here’s a short guide to two mobile metrics that matter to your users and your business.

Where Mobile Monitoring Complexity Comes From

Before diving into the metrics, it’s important to note that there is a range of considerations that you normally don’t need to think about in the web world.  Why does your app crash in China but not in the US?  How does performance differ across OS versions?  What is the effect of carrier latency on transaction conversion rates in your shopping app?  There are millions of questions you can only ask once your mobile app has been released into the wild.

Some factors that affect performance are outside your control, such as devices, operating systems, and networks.


The complexity increases as your app starts connecting to cloud services, such as social platforms, analytics tools, and ad platforms.  Even if you are building an internal B2E (business-to-employee) app, you may rely on proprietary ERP systems that you access through the cloud.


Two Metrics You Need to Track

Every month at Crittercism we monitor over 30 billion mobile app sessions and have found that the two most important mobile app performance metrics are:

1. Uptime. Our data shows that mobile app uptime, meaning the percent of app loads that don’t crash, must be greater than 99% for your app to be competitive in public app stores. Through stack traces, remote logging and other types of root cause analysis, Crittercism lets developers pin down the exact causes of problems in their apps.


2. Responsiveness. In addition, responsiveness of  your app’s service calls requires a latency of under one second to satisfy users.


Decrease Mobile Downtime

Crashes and cloud service errors are the primary culprits of poorly performing mobile apps, and Crittercism is the only product that combines error monitoring with service monitoring for mobile platforms.  Now you can trigger alerts for any metric that Crittercism tracks.  By integrating Crittercism with PagerDuty, you ensure that you’re using the same systems for monitoring mobile apps that you already use for the rest of your DevOps infrastructure.

Integrate Crittercism and PagerDuty today to start delivering high performing mobile apps.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Partnerships, Reliability | Tagged , | Leave a comment

Custom Alert Sounds and Increased Reliability for Push Notifications

push_notificationIn our quest to bring better and more functionality to our iOS and Android mobile apps, we will soon be releasing a new version that should help you get the most out of your push notifications.

Six months ago, we talked about how push notification growth began overtaking voice, SMS and email alerts. We have seen this trend continue to grow. Because of high adoption, bringing more utility to push notifications became a top priority.

In our upcoming iOS and Android apps, we will provide push notifications improvements in two ways: testing push notifications and adding custom alert sounds, the latter being a feature many of our customers have requested through our support team and via Twitter.

Testing Push Notifications

When you log into the PagerDuty mobile app, we capture the device’s unique ID in order to send push notifications. However, when your device ID changes, push notifications may not arrive on your device. This could happen when you upgrade your device’s OS or reset your device.

To mitigate and treat this kind issue, we are adding use cases into our code to ensure we are capturing and storing the correct device ID, even if that ID periodically changes. Customers will also be able to send test push notifications from the web application.

It’s worth noting that while push notifications can be quite helpful, they may not be as reliable as our other notification methods. Neither Google Cloud Messaging nor Apple Push Notification Service guarantee delivery times for push notifications and it’s not yet possible to add redundancy to increase reliability and exchange providers during outages. We recommend that you set up additional alert channels to ensure you never miss an alert.

Custom Alert Sounds

Since launching our mobile apps, we’ve received lots of requests for custom alert sounds for push notifications. When you receive a PagerDuty push notification on your mobile device, you need a way to distinguish that alert from all your other apps sending you push notifications, whether it’s your latest Facebook status getting a new like or your friend besting your top score in Candy Crush.

The next version of our app will be bundled with 10 sound options that you can set as the sound for your PagerDuty push notifications. We used the vast Creative Commons library to bring you a wide array of bleeps, bloops, alarms, and unique sounds that will ensure you won’t mistake a PagerDuty push notification with FourSquare asking you to check into your apartment when you walk in for the night.

Are there any other features you would like to see in our mobile app? Shoot us an email at support@pagerduty.com or send us a tweet, @PagerDuty.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Announcements, Features | Tagged , , , | 1 Comment

Eliminate Bottlenecks with Transparent Operations Communication

Guest blog post from Ville Saarinen, Development & Marketing at Flowdock. Ville is interested in the combination of sociology and technology, and is based in Helsinki, Finland along with the rest of the Flowdock team.

Software development teams have a wide selection of tools and services to choose from to help with specific tasks. There are services that help with team communication, version control, testing, deployment, monitoring and customer support. And since many of these tools are offered as an online service, getting set-up can be as simple as signing up.

Unfortunately, there’s also a downside to this wealth of choice. Staying up to date with your work requires constant jumping from one service to another, or enabling notifications that flood your inbox. And this applies to the team level as well: if someone wants to see what the team is working on, they need to either check each service’s activity stream separately, or start duplicating information (“I’ve finished working on XYZ”) into project management tools.

At Flowdock, we’ve tackled this problem with a team inbox. It integrates with the team’s tools to provide a real-time activity stream of everything that the team is working on. The team inbox lives right next to the team’s chat, where the team spends a large portion of their day, providing a window into what the team is actually doing. This sparks discussions based on activity, not reporting. And everyone is happier due to the decreased email load.

Find the Right People

If you’re on a team that’s responsible for keeping a service up and running 24×7, resolving issues quickly is critical for customer satisfaction and loyalty. Assigning a person to an issue helps build a sense of ownership, and will help items be resolved more quickly. However, if the owner of an issue is unable to solve the problem, what then? Mass blasting everyone isn’t difficult, but is an activity that doesn’t scale. Escalating the issue to specific team members works better – and with a team inbox, those you tapped to help out are quickly brought up to speed.

Another scenario is when a problem arises and the primary on-call engineer is unreachable. Automatic escalations help find an available person, but since the escalation notification is visible to the whole team anyone who is available can react. A single person is no longer the bottleneck.

Make Past Resolutions More Discoverable

Search_incident_historyWhenever a problem is solved, something new is learned. Good practices include having post-mortems to figure out why the problem came up, how to fix the root cause and to document this solution. The post-mortem should be held as soon as possible, while the details of the issue are still fresh in the team’s memory. We recommend putting the summary of these discussions into the team’s chat room and tagging it with an agreed-upon tag. That way, by searching for the right tag, people can find solutions to problems much more quickly when similar issues arise in the future. Even better, by tagging the conversations and actions that occurred while resolving an issue, you can follow the resolution of an issue blow-by-blow, after the fact. This type of passive collaboration helps a team learn from the past, and makes them much more effective as a team.

Another benefit of having these conversations (or the summaries of the conversations) out in the open is that it can help other teams. We recommend making our users’ chat rooms open to everyone in their organization. That way, when someone outside the team has a question, they can simply join the chat and ask the team. Or, if they know what tag to search for, they can find the answer directly from previous conversations or activity notifications without bothering anyone.

Information Transparency

Making information transparent to everyone can have huge benefits for teams and organizations. With the right information, better decisions are made, and the gatekeepers of information are no longer the bottleneck.

The challenge with making information available is that people inevitably suffer from information overload. How do you have time to work while you’re constantly being bombarded by new information? Where can you find the solution to your specific problem when you have such a vast amount of information available at your fingertips? The key is to use communication channels that allow users to define what is important to them, and what is not. Users should be able to decide what types of events call for their attention. The channel should also help users categorize the data (with e.g. tags) so that the right information can be found later on.


With a more open communication culture – open chat rooms, visible work activity – everyone can use operations information to quickly resolve incidents and create a culture of shared responsibility.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Operations Performance, Partnerships | Tagged , , | Leave a comment

Triage PagerDuty Alerts Using Loggly

Guest blog post by Jason Skowronki, product manager at Loggly. Loggly is the world’s most popular cloud-based log management service with over 3,500 active customers and developers and system administrators troubleshoot problems, monitor system status, and proactively address issues with alerts.

You’re out to dinner with friends and you receive an alert through PagerDuty. Your signup rate has dropped way below its usual level. This could indicate a serious problem with your site, but it could also just be an unusual traffic pattern. Should you leave the restaurant and rush home? Or would you just be sacrificing much-deserved downtime for something that could wait until tomorrow?

Alerting is critical to the 24×7, net-centric economy. It’s a way to minimize the impact of application problems on revenue and profits. At Loggly, we love PagerDuty because it has brought sanity to how we become aware of operational problems, assign the right resources to solve those problems, and follow them to completion. It answers the all-important “who” questions and is a perfect complement to the Loggly service, which gives DevOps teams a way to delve into the “why.”

Triage and Find the Root Cause of Problems Faster

Let’s go back to our interrupted meal. PagerDuty will tell you that an alert fired due to an unexpected decrease in signups. However, you need more information about what system they are coming from and who is responsible. You click on the alert and go straight to your Loggly dashboard, where you see that the alert fired at the exact same time that a deployment happened. So this is probably a real site problem. Time to get the check.


While you’re waiting, you search for your signup page logs. You see that clicks are being recorded but that calls aren’t consistently being sent to the back-end service. Later, some digging into the code shows you that the page isn’t rendering correctly in Internet Explorer browsers. You roll back the deployment, file a bug with the front-end team, and resolve the PagerDuty alert.

Loggly offers DevOps teams deeper visibility into their systems, both during initial assessment and triage and as they work to isolate and resolve their operational problems. Our powerful search and filtering, point-and-click charting, and dashboards help you make instant sense of tons of log data coming from applications, platforms, and systems. You can quickly see correlations between an alert state and other things happening on your systems, and you have access to all of the data you need to find root causes.

As a result, you can stop interrupting your day for small problems so you can focus on the big ones. And you can solve those big ones much faster.

Hang out with Loggly, New Relic and PagerDuty tonight at DataBeat’s reception and have some Data-tinis on us!

Share on FacebookTweet about this on TwitterGoogle+
Posted in Features, Partnerships | Tagged , , , | Leave a comment

How We Added Single Sign-On to PagerDuty

Since we’ve launched Single Sign-On we’ve received a lot of good feedback from our customers about having to remember one less password. While SSO doesn’t impact PagerDuty’s core operations performance management feature set, planning and implementing the new feature took a lot of careful consideration.

Coming Up with A Plan

After deciding to offer single sign-on support for PagerDuty, we knew we needed something that can integrate well with our Rails application. We researched possible solutions, but determined that OneLogin’s Ruby SAML Toolkit was the best solution for us. It is trusted by other companies in our space and had all the capabilities we envisioned needing.

To ensure this would work, we scrounged up a working prototype with a simple implementation. One endpoint to initiate authentication from our end and another to consume the response from identity providers.

The proof of concept worked, allowing us to start building the feature we wanted.

OneLogin’s Ruby SAML Toolkit

To make sure using OneLogin’s Ruby SAML toolkit didn’t introduce vulnerabilities to our systems we began with a mixture of manual and automated tests. One scenario we tested was when a SAML response from an identity provider is captured, and subsequently tampered with (e.g. changed email address) what would happen? The toolkit already has a few check-in points in place for this scenario and thus forbids authentication.

Even with the toolkit’s robust functionality, we did make a couple modifications to make the toolkit fit our needs and give it the PagerDuty stamp of approval. For example. we’ve modified SAML endpoints to always use SSL and made sure certificate validity periods are honored.

What about Mobile and SAML?

In order to support SSO for our mobile applications, we used SAML in combination with OAuth where our iOS and Android applications obtain an authentication token after successful SAML authentication.

Learning from Customer Previews

Before any major release, we give a few of our customers the chance to preview our upcoming feature with a test drive. This not only allows us to find bugs and dive deeper into our customers’ use cases, but (and arguably more importantly) allows us to learn to monitor our new feature effectively.

Refining Monitoring for Actionable Alerts

We monitor our Single Sign-On feature primarily using Sumo Logic. Setting up Sumo Logic’s monitoring for our SAML and OAuth endpoints during our customer preview set the stage for us to learn about what we should monitor.

In our initial set up, we learned that our alerts were not detailed enough (or actionable). We found that we often would receive an alert reading “Invalid SAML response,” which caused our on-calls to spend more time investigating what went wrong before being able to fix the issue. As a result, we started looking for events and set specific thresholds to create alerts with meaningful information (e.g. SAML response invalid, XML could not validate, assertion invalid due to date/time restrictions) without revealing any sensitive data about the authentication attempt.

For our team to work more effectively, we created dashboards and search queries for both the number of successful authentications and failures. Depending on the number of errors we can start to determine what’s going on and if action is required.

Our customer preview allowed us to learn a lot about authentication behavior to ensure we knew what each event meant to fine tune our alerting. If something unexpected occurs we can make sure we can act quickly for our customers and deliver the reliability our customers depend on.

Accounting for Clock Drift

Specifically during our preview we were confronted by Clock Drift. We learned that authentication between PagerDuty and an identity provider can fail due to clock skew between servers. An identity providers server may be ahead of our server’s time causing the authentication to fail because the server times don’t match.

Luckily, OneLogin’s Ruby SAML Toolkit has a built in feature that helps authenticate identity, accounting for clock drift. After this discovery we were able to set a certain value after some passes in our testing environment to account for the time differences between servers.

Setting Up SSO for your PagerDuty Account

We’ve partnered with Okta, OneLogin and Ping Identity to make it really easy to turn on SSO for your PagerDuty account. But you can also implement SSO for any SAML 2.0 capable identity provider, including Active Directory Domain Services (Via ADFS, which integrates with Active Directory Domain Services, using it as an identity provider). We’ve also put a guide together to help you set up SSO with Google Apps.

Has this feature benefited your team? Let us know in the comments.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Features, Security | Tagged , | Leave a comment

ChefConf 2014: How to Mock a Mockingbird

Chef is a powerful operations tool – its flexibility and automation capability make it extremely popular in organizations with extensive service deployments. Our own Ranjib Dey was invited to speak at Chef’s annual confab, ChefConf, which took place in April. Here are some of the highlights of Ranjib’s talk, “How To Mock a Mockingbird” (named after a book on mathematical philosophy).

Assume that things will break

“Design for failure” may be a more concise way of expressing this point. It’s a fundamental piece of the PagerDuty story – PagerDuty was created around the principle that crises are inevitable for all operations teams. Our role is to make handling those crises a little more pleasant.

Ranjib suggests that developers accept failure as a given. What DevOps teams can do to mitigate failure, Ranjib goes on to say, is strive to make failures as “inexpensive and isolated” as possible.

Design matters

If planning for failure is the “why” in Ranjib’s discussion (as in, “Why should I take Ranjib seriously?”), code design is the “how”. In an environment where things break, mitigation-by-design is the best defense against downtime.

What this means in practice, Ranjib says, is to minimize the code you deploy. Not only does greater code complexity make testing more difficult, but leaner code can be replaced more quickly should bugs present themselves (which, again, they inevitably will).

“Anything that is complex in theory will be difficult to test.”

Test intelligently

How you validate your codebase is linked closely to the way the code actually looks. That is, your testing processes will depend on the strength of the code’s design.

Your code should work well in happy-path scenario tests, for example. Happy-path scenarios, a.k.a. acceptance testing, measure code performance in ordinary use cases.

Code design is truly validated when your services are placed under stress – and the best way to do this is unit testing. Not only is unit testing very fast, Ranjib notes, it can capture 80 percent to 90 percent of errors (in an ideal scenario). That makes it a particularly smart validation strategy against acceptance testing, which, while comprehensive, is slow (and comes with a steep learning curve).

Further supporting the value of unit testing, Ranjib goes on to say, is the role it plays in enforcing codebase consistency. Rapid-fire unit tests help maintain conventionality in code design; for this reason, unit testing should be viewed as a foundational testing methodology.

“Unit testing is invaluable for long-term maintainability.”

Communicate for best results

We are evangelists for unit testing because it empowers devs to manage their code from end to end (with support from ops). In an environment where devs and ops work towards shared goals, collaboration is hugely important – and in order for collaboration to thrive, everyone has to communicate.

With communication a critical piece of DevOps culture, Ranjib says, breaking down knowledge silos should be made a priority. Good design is the principal element in healthy codebases (as stated in Ranjib’s first point), and design cannot flourish if teams aren’t sharing with each other what they know.

“Communication will influence design.”

Get comfortable with complexity

Illustrating the importance of communication were two of Ranjib’s slides – one depicting a typical reference code topology and the next showing a picture of a plate of spaghetti. Reference topologies, Ranjib argues, fail to capture all of the dependencies and ops services that deployments contain. Real topologies, he goes on to say, look a lot more like that mass of spaghetti.

Quite simply, modern codebases are going to be too complex to communicate simplistically. And the most fault-tolerant topologies will be scale-free – that is, contain a plurality of nodes and spokes such that no single element is “linked” dependently to another.

The idea of not knowing how everything in your codebase is connected (and being OK with it) was put forward by Netflix site reliability engineer Corey Bertram when he stopped by PagerDuty HQ in March. Corey’s talk illustrated the nature of deployments, suggesting that embracing a Zen attitude is the only logical response to their wild complexity.

“Nobody knows exactly what a fault-tolerant network architecture should look like.”

Share on FacebookTweet about this on TwitterGoogle+
Posted in Events | Tagged , , , | Leave a comment