How We Added Multi-User Alerting to PagerDuty

Since we launched our Multi-User Alerting feature last week we’ve received a lot of good feedback and have seen high adoption across the board. Multi-User Alerting was our most requested feature and we wanted to make sure we got it right while maintaining our reliability standards. We’ve made significant changes to our architecture and workflow to lay the foundation for future alerting use cases.

Coming Up With a Plan

multi_user_alerting_200pxThe biggest challenge we faced was the need to restructure our alerting data model to assign incidents to multiple users and allow multiple users to acknowledge incidents. We had do this restructuring while keeping everything running for our customers (i.e. we don’t have the concept of service downtime at PagerDuty). Our Product Team documented how they wanted PagerDuty to change and worked together with the our Realtime Team and our Web Team to create a rollout schedule. Although the changes that we were making were big, we came up with an release plan so that each piece of the rollout was incremental and backwards compatible.

Keeping Track of Multi-Acks and Ack Timeouts

With Multi-User Alerting, incidents are assigned to multiple users, and can be acknowledged by multiple users. Multi-acks help teams gather information around incident responsiveness – you can find out who rises to the occasion and who sleeps through their alerts. Since incidents now can be assigned to and acknowledged by more than one user we had to add new tables to our database to keep track of that information.

cross_team_respondersIncidents are associated with services and within each service the team can set what the incident acknowledgment timeout should be. Incident ack timeouts cause the incident to return to the Triggered state, and prevent incidents from being forgotten so services that are highly critical are configured with shorter timeouts.The timeout is relative to when the incident was first acknowledged, and with multiple acknowledgements, the timeout will be reset every time the incident is re-acked. For example if the incident ack timeout is set at 5 minutes and User A acks an incident at 2:00 AM and User B acks the same incident at 2:04 AM, the incident ack timeout would cause the incident to return to the Triggered state at 2:09 AM rather than 2:05 AM. This prevents Users A and B from receiving another alert just 1 minute after User B acknowledged it. We added this behavior to prevent people from being bogged down with alerts for acknowledged incidents that they are actively fixing.

Incident Snapshots for Minute Zero Alert Guarantee

multiple_responders_200pxThe PagerDuty alerting pipeline was built to send alerts that contained incident information that is as up-to-date as possible. If a user has only one notification rule with a 1 minute delay and an incident is resolved 30 seconds after it is triggered, we won’t send the user another alert because it has already been resolved. One consequence of Multi-user Alerting is that it introduced the possibility that one user on an escalation policy could acknowledge/resolve/re-assign an incident before another user was notified about it.

To deal with this wrinkle we modified our pipeline to implement special “minute zero” behavior. If a user has a notification rule set-up with a zero-minute delay, we infer that they want to be notified about an incident in all cases. Even if the incident gets modified in short period between when it was triggered and when the notification goes out, the notification will still go out. For example, if User A and User B are on the same escalation level with an immediate notification rule, and User A acknowledges an incident immediately after it was triggered, it will not prevent an alert from going out to User B.

Time Between Escalation Levels

One workaround that PagerDuty customers were using before Multi-User Alerting was to add escalation rules to an escalation policy that were separated by a very low escalation delay (e.g. 1 minute) – effectively alerting a bunch of users within a short time period. With Multi-User Alerting, this workaround is no longer required, and we encourage any customers who were using this workaround to reconfigure their escalation policies: you can now add up to 10 escalation targets (users or schedules) to an escalation level.

Multi-User Alerting has expanded alerting scenarios for our customers so they can reach the right people who need to know at the right time. Are there alerting scenarios you want to address but can’t? Let us know in the comments.


Posted in Features | Tagged | Leave a comment

Increasing Quality and Reliability with Continuous Integration

Continuous integration (CI) is a software development practice where members frequently merge their work to decrease problems and conflicts. Each push is supported by an automated build (and test) to detect errors. By checking in with one another frequently, teams can develop software more quickly and reliably. In essence, CI its about verifying the quality of code to ensure no bugs are introduced into the production environment. If there are bugs found in testing, the source is easily discovered and fixed. By testing the code frequently after every commit you can shrink the investigation around the source of the bug to the shorter time period. But manually testing the code is a pain and redundant. Many tests can be reused so we have created multiple automated tests to make it easier to test frequently. Additionally since these tests are iterative, once a bug is found we’ll create a test that looks for it in future code reviews so old bugs are never introduced again.

Before an Automated Build

DVCSAt PagerDuty after we decide on what we need to build a JIRA ticket is created to easily collaborate and to keep members updated on the status. Inside the ticket we include information around what this feature or fix will do and what the known impacts are. Then we create local branches from our Git Repo for the feature we want to develop or issue/bug we want to fix and give it the same name as the JIRA ticket. Git is a Distributed Version Control System (DVCS) so there is no a single source repository where we take the code from but there are multiple working copies. This prevents having a single point of failure in traditional single source repositories which relies one physical machine. We’re all about redundancy here at PagerDuty (multiple databases, multiple hosting providers, multiple contact providers for multiple contact methods, etc.) so having a DVCS makes it easier for us to develop locally even when there are issues. Bazaar and Mercury are few other DVCS you may want to check out.

Write Tests First

Although it would be nice to have automated tests for everything we build, it takes time to build them. Our tests are created before the code is written so we can use them to govern our designs and to avoid writing code that are hard to test. This test-driven development (TDD) improves software design and helps us maintain the code more easily. We prioritize test criteria in the order below since they have the greatest impact on reliability and resources.

1. Security – Critical bugs that block our workflows fall in this category. If the fix mutates any business critical code path we want to make sure we have everything tested.

2. Strategic – Large scale code rearrangement, adding new features. These tests tend to add corresponding specifications in our test suite. This addresses both happy path scenarios as well as any known regressions. For example, adding different types of services/micro services (a new persistent store) or a new tool (that automates a repetitive long running manual work).

3. Consistency – As a growing team, we need to make sure that the code built is easy for someone new to understand to build on top of. This exercise is an established best practice for code quality, error handling and identifying performance issues. Anyone who knows chef should be able to understand our code base. For example, isolation of our customizations and capturing them as separate patches/libraries then sending those patches to upstream projects. In these scenarios we write specs for the integration layer (i.e the glue part that ties out extensions with external libraries, like community cookbooks, gems, tools etc).

4. Shared Knowledge – Every functionality has a valid or certain domain assumptions. We use tests to establish what those domains are to know the boundaries of a feature. These are very specific to our own infrastructure, its dependencies and overall topology. An example is how we generate search driven dynamic configuration files for different service (like we always sort search results before before consuming it). We write tests to validate and enforce these assumptions, which is also exploited by downstream tool chains (like naming conventions across servers, environments etc.).

Our Testing Suite

The tests we write fall into 5 categories. All code built must pass tests in the order below, except for load tests, before it is deployed to maintain quality and reliability.

Semantic tests: We use Lint checks for overall semantics of the code and common best practices and Rubocop for ruby linting and Foodcritic for chef specific linting. These are code aware tools so depending on the language you write in these tools may or may not work for you. Lint tools are applied globally after every commit and we don’t have to write any additional codes for this.

There are several occasions when lint tests caught actual bugs apart from pointing at styling errors. For example foodcritic can detect chef resources that does not send notification when updated.

Unit tests: We write unit tests for almost every bit of code, if we are developing chef recipes, chefspec tests are written first. If we are writing raw ruby libraries, rspec tests are written first. Lint and unit tests don’t look for functionality. They are testing if the code has good or bad design.

Good design makes it easy for other members to pick up the code and understand it quickly. Additionally these tests show how easy it is to decouple the code. Technology is ever changing and the code has to be flexible. If ubuntu or nginx releases a patch for security reasons, how easy is it to accept that change

Functional tests: These test are intended to verify the feature of functionality as a whole, without any implementation knowledge, without mocking or stubbing any sub component. Also we strive to make the functional specs as human readable as possible, in simple english, without any programming language specific constructs.

These tests help out with:

  • new server provisioning

  • existing server teardown

  • an entire cluster provisioning

  • whether a sequence of operation works or not

We use Cucumber and aruba to write functional tests. These tests are not concerned about how the code is written but only if works. Cucumber is a BDD tools that allows specifications to be written in a readable fashion (using gherkin), while aruba is a cucumber extension that allows testing command line applications. Since a vast majority of our tools provide a command line interface (CLI) we find these testing tools very handy and easy to use.

Integration tests: These tests make sure everything is working when combined with all other services within a production like topology against a production like traffic pattern. This also helps us answer whether our system automation suite will work perfectly alongside different services, and against every changes made in them or other third party services that we consume.

Load tests: This will help us determine at what traffic scale we can handle. And quickly identify the main performance bottlenecks. We do a series of setup tasks to ensure we have production like data volume. Generally these tests are time consuming and resource intensive, hence they are performed periodically against a set of code changes (batching). Code changes where we feel performance is not a concern (config changes, UI tweaks), at times by pass these tests.

Automating Deploy & Sanity Checking

HipChat_continuous_integrationAfter the code passes all the tests we hand the code off to another team member to do a sanity check before releasing it. We do manual code overview to get a 2nd opinion to ensure that bugs aren’t introduced into production. The peer code review helps to ensure that no requirements were missed and that the code meets design standards.

We follow a semi-automatic deployment, where CI assist in testing and project specific tools (like capistrano and chef) and assist in deployment, but the actual deployment process is triggered manually.  The deployment tool itself will send a message in the PagerDuty HipChat room to let everyone know when something is being deployed. Then it sends both pre-deployment and post deployment notifications (as lock and unlock service messages). This helps us understand what is being deployed and also to avoid concurrent deployments.

With Continuous Integration we create a baseline quality of software that must be met and maintained which lowers the risk around our releases.

Posted in Best Practices | Tagged , , | Leave a comment

Outage Post-Mortem – March 25, 2014

On March 25th, PagerDuty suffered intermittent service degradation over a three hour span, which affected our customers in a variety of ways. During the service degradation, PagerDuty was unable to accept 2.5% of attempts to send events to our integrations endpoints and 11.0% of notifications experienced a delay in delivery – instead of arriving within five minutes of the triggering event, they were sent up to twenty-five minutes after the triggering event.

We take reliability seriously, an outage of this magnitude as well as the impact it causes to our customers is unacceptable. We apologize to all customers that were affected and are working to ensure the underlying causes never affect PagerDuty customers again.

What Happened?

Much of the PagerDuty notifications pipeline is built around Cassandra, a distributed NoSQL datastore. We use Cassandra for its excellent durability and availability characteristics, and it works extremely well for us. As we have moved more of the notifications pipeline over to use Cassandra the workload being applied to our Cassandra cluster has increased, including both steady-state load and a variety of bursty batch-style scheduled jobs.

On March 25, the Cassandra cluster was subjected to a higher than typical workload from several separate back-end services, but was still within capacity. However, some scheduled jobs then applied significant bursty workloads against the Cassandra cluster, which put multiple Cassandra cluster nodes into an overload state. The overloaded Cassandra nodes reacted by canceling various queued up requests, resulting in internal clients experiencing processing failures.

Request failures are not unexpected, many of our internal clients have retry-upon-fail logic to power through transient failures. However, these retries were counterproductive in the face of Cassandra overload, with many of the cancelled requests getting immediately retried – causing the overload period to extend longer than necessary as the retries subsided over time.

In summary, significant fluctuations in our Cassandra workload surpassed the cluster’s processing capacity, and failures occurred as a result. In addition, client retry logic resulted in the workload taking much longer to dissipate, extending the service interruption period.

What We Are Doing About This

Even with excellent monitoring and alerting in place, bursty workloads are dangerous: by the time their impact can be measured, the damage may already be done. Instead, an overall workload that has low variability should be the goal. With that in mind, we have re-balanced our scheduled jobs so that they are temporally distributed to minimize their overlap. In addition, we are flattening the intensity of our scheduled jobs so that each has a much more consistent and predictable load, albeit applied over a longer period of time.

Also, although our datasets are logically separated already, having a single shared Cassandra cluster for the entire notifications pipeline is still problematic. In addition to the combined workload from multiple systems being hard to model and accurately predict, it also means that when overload occurs it can impact multiple systems. To reduce this overload ripple effect, we will be isolating related systems to use separate Cassandra clusters, eliminating the ability for systems to interfere with each other via Cassandra.

Our failure detection and retry policies also need rebalancing, so that they better take into account overload scenarios and permit back-off and load dissipation.

Finally, we need to extend our failure testing regime to include overload scenarios, both within our Failure Friday sessions and beyond.

We take every customer-affecting outage seriously, and will be taking the above steps (and several more) to make PagerDuty even more reliable. If you have any questions or comments, please let us know.

Posted in Availability | Tagged , | Leave a comment

PagerDuty and | Conference Calls, Heartbeat Monitoring & More

PagerDuty sends out webhooks when different events happen on an incident. A webhook is a custom HTTP request that can contain more or less anything you want, posted to an address you specify. It’s a great way to get applications talking to each other without the complexity of a full-blown API. We currently use webhooks to support integrations with HipChat, Slack, Zapier and other tools, and we frequently see customers using this feature to build their own custom integrations.

Dave Hayes, product manager at PagerDuty, used webhooks in a previous hackday to create an animated map of PagerDuty incidents. He used Firebase to handle PagerDuty’s incoming webhooks, but I wanted something even simpler, and found lets you choose a URL that listens for incoming webhooks, then executes a script on whatever comes in. It’s awesome for translating between different types of input and output, kind of like a coder’s version of Zapier. As a bonus, it can also run scheduled scripts (cron jobs) at specified intervals.

For my project, I wanted to use webscripts to automate incident conference calling.

When a high-severity incident comes in, the first thing you want to do is get some eyes on the problem. You may want a backup or manager on the line if you’re a junior responder, or for severe problems, you may want the on-calls from several different teams.

In the heat of the moment, you don’t want to be fussing with chatting a conference call number or URL in a chat room– your team should instantly have the information they need to start collaborating. I wanted to figure out an automated way to add a conference call to a PagerDuty incident the instant someone acks it.

I was originally going to try using the Twilio API for this project, but I read about VoiceChatAPI (a Plivo project), which makes creating a new conference call ultra-simple. You don’t even need an API key!

Here’s how I set up my webscript.


And here’s how to do it:

1. Go to, create a new webscript and give it an address

2. Paste in the webscript from here (or this one if you don’t want/need the hipchat integration).

3. Create a new PagerDuty service, and add a webhook pointing at the webscript address.

Besides conference calls, webscripts can do a bunch of other useful things for PagerDuty customers:

You can find script examples and additional tips for using webscripts in the pagerduty-webscripts repo on Github.

Thanks to Steve Marx (@smarx) and DH (@dhfromkorea) for their help in putting together this project, as well as the teams at and Plivo.

Find these scripts useful? Have your own way to get PagerDuty and other apps talking to each other? Leave a comment, drop us a line on twitter or submit a pull request.


Posted in Community, Hack Day, Integrations | Tagged , , , , , | Leave a comment

Alert All the Right People with Multi-User Alerting

PagerDuty escalation policies just reached the next level (pun intended). You can now add up to 10 team members to each level of your policies to notify more people about the incidents they care about. Make sure your team’s issues are responded to quickly and resolved in a flash by getting all the right people notified and working together.

Alert Multiple Responders at Once

multiple_responders_250pxIn a high severity incident, you want to immediately get all-hands on deck. With PagerDuty’s Multi-User Alerting, you can alert 10 responders in each level of your escalation policy.

Critical incident? Alert the primary, secondary and tertiary at the same time to get someone on it faster.  Slower responders can join the resolution in progress.

Or in the rare case that your primary and secondary on-call miss their alert, you can use Multi-User Alerting to get the rest of the team responding to incidents to ensure nothing ever fall through the cracks.


shadowing_250pxWhen you hire new team members it’s imperative to get them up to speed quickly so they can start contributing to your team. With Multi-User Alerting you can have a trainee share on-call responsibility with a seasoned engineer to tag team incidents in order to get them into the swing of things, fast.

Keep Followers and Interested Parties Informed in Real-Time

muti_user_alerting_followingWith Multi-User Alerting it’s easy for managers or interested parties to stay in the loop about what’s going on.

Non-responders can add themselves to escalation policies to stay informed — or if they can handle the issue themselves, acknowledge and start working on it.

As a manager you can put yourself on any level of an escalation policy to be notified along with your team when they are receiving alerts, and immediately know when an incident gets escalated.

Notify Multiple Teams for Incidents

Critical incidents should often wake up people from multiple teams — is this a web or an ops issue?  You now have the ability to wake up someone from each time to tackle incidents together.

Try it now!

Multi-User Alerting is already available in your account, just edit any of your existing escalation policies or create a new one.  No more 1-minute escalations, no more workarounds needed. If you have any questions, shoot us an email at

Please remember to wake your team up responsibly. Unless it’s truly a sev-1, let the primary responder do their job — you hired good people for a reason. Not every incident needs to wake up 10 people.

Posted in Announcements, Features | Tagged , , , , , , , | 2 Comments

How Our Customers Scale On-Call with PagerDuty

“With PagerDuty we have be able to consolidate our alert stream”
- Chris Peters, Operations Lead at Expensify


“I can’t imagine life without PagerDuty. Having multiple alerting methods and escalations are no-brainers.”
- Brian Sensale, Senior Engineering Manager at Brightcove

Three years ago Brightcove took their first steps to transition to a DevOps organization model by getting each of their developers on-call to create a sense of ownership over the design, production and support phases of their code. At first, they used a rotational Blackberry synced to their monitoring tools, but this process was cumbersome, error-prone and limited participation. To fully transition towards a DevOps model, they needed the technology to accompany their cultural shift. Brightcove adopted PagerDuty to bring order to their on-call scheduling chaos.

Read Their Story


“PagerDuty is the go-to tool for anything that requires alerting.”
- Stefan Zier, Lead Architect at Sumo Logic

When a new developer joins Sumo Logic, it’s important to get them acclimated to their IT environment as quickly as possible. Not all developers hired at Sumo Logic have come from a DevOps environment, nor has every developer been on-call before. Sumo Logic uses PagerDuty to quickly on-board their new hires to get them up to speed and resolving incidents quickly.

Read Their Story

justin_slattery__dir_engineering_mls 2

“As we scaled, we staffed to have someone always on-call. PagerDuty made this much easier for us.”
- Justin Slattery, Director of Engineering at MLS Soccer

To avoid burnout, MLS Digital turned to PagerDuty to manage their on-call schedules. They have set up rotating on-call schedules so everyone doesn’t have to be on-call all the time. They can hang out with friends and family without bringing their laptop along and have a stronger team spirit knowing the on-call responsibility is fairly shared, which has been imperative as they scale their business.

Read Their Story

Beau_Christensen_Manager_of_Infrastructure_Operations_Ping_Identity 2

“PagerDuty has blossomed within Ping Identity. It has become a core piece of our infrastructure.”
- Beau Christensen, Manager of Infrastructure Operations at Ping Identity

PagerDuty was initially adopted within the infrastructure team and has expanded to support, developers and help desk teams. PagerDuty has seamlessly integrated into Ping Identity’s culture and has made it easy for cross-functional teams to collaborate on incidents. Easier collaboration combined with more effective alerting has decreased incident repair time by 100%.

Read Their Story


“PagerDuty has changed the way we run our business”
– Will Maier, Director of Operations at Simple

Simple requires that employees across the company, regardless of team or role, use the same tools. The Simple team is in constant communication over IRC. Non-engineers regularly use tools traditionally considered tech team staples. For example, their PR team uses PagerDuty alerts to keep an eye on outages or issues so that they can respond to media and customers as other teams are working on solutions.

Read Their Story

Posted in Community, Customer | Tagged , , , , , , | Leave a comment

Foundations of a Successful DevOps Team

This is the final post of our series about transitioning to a DevOps culture (for now). To start from the beginning check out, Why You Should Establish a DevOps Culture.

When we talk about DevOps we often defer to a conversation around collaboration and culture. One of the most important aspects of taking a DevOps approach in your organization are the tools and procedures that drive this movement. Even more importantly are your relationships with your tools and culture to create a foundation of self-service, prioritization and people.

Developer Self-Service with Infrastructure Automation

In a DevOps model, developers are empowered to do a lot on their own where previously they would have to rely on an operations team. To effectively transform into a DevOps culture, operations engineers help developers by creating tools that arm developers to solve problems on their own.

These self-service tools maintain consistency within your products’ performance, functional and physical attributes. Additionally, keeping your requirements, design and operational information in check through your product’s life cycle. Tools like Chef (which we use at PagerDuty), allow us to treat our infrastructure as code to enforce consistency across our test, development and product environments via version control, automated testing and peer reviews.

By providing developers with the tools they need to put out their own fires, more time can be spent improving your product, services and processes.

Establishing Prioritization for What You Monitor

Tracking what customers care about will help you prioritize your monitoring metrics. At PagerDuty, we prioritize our ability to accept events and send alerts to our customers above all other systems and processes. We take reliability seriously and know how much our customers depend on us to receive alerts. If they don’t get alerts they may not recognize an error in their systems, which may cause extended outages. Because we monitor and prioritize metrics customers care about, there is a sense of urgency whenever one of our monitoring tools alerts us of an incident.

Our engineers won’t delay fixing an alerting issue because they’re in the process of attending to a non-customer facing incident. By establishing priority we can quickly shift gears to make sure our service is available to you at all time.

We recommend that you figure out what part of your product and services matter most to your customers and focus your monitoring and alerting in these areas. You will find that your satisfaction rate will increase with less disruption of the services your customers care about most.

Connect People to Each Other and Your Systems

Being empowered and working together is necessary for the adoption of a DevOps culture. Tools are meant to aid and enforce our relationships with each other. For example. GitHub allows your team not only to store code, but to collaborate and share a central source of knowledge and provide version control in case something goes wrong and needs to be rolled back.

At PagerDuty, we strive to be the interconnecting layer between your tools and people so you can respond to incidents faster and reduce your mean time to repair. We hope that our service creates a sense of accountability for your team members and allow teams to work together for a unified cause (i.e. fixing a critical incident). You will not have a successful business or working product without people.

For people to effectively impact your business they need have the ability to work when they are needed. Getting people on-call to keep them aware of the systems they are accountable for is an effective way to start.

“Getting devs on-call means we are targeting the appropriate teams with actionable alerts.” – Eduardo Saito, Director of Engineering at GREE

Those who haven’t been on-call before may be resistant, but they will soon see the value of being on-call as everyone aligns their pains with the pains of their customers.

We’ve spoken quite a bit lately on how to transition to a DevOps model in your organization. If you are just starting out remember that a culture of collaboration is key. You cannot operate a DevOps model with siloed departments. This may be the biggest hurdle for your company. But if you focus your energy on empowering your team through self-service, using tools that connect people to their systems and identifying a prioritization among your common goals you will be well on your way.

Update 4/10/14 – Continuing reading the transitioning to DevOps series:

Posted in Best Practices, Community | Tagged , , , , , , | Leave a comment

PagerDuty Partners with San Francisco & New York to Deliver Parking Meter Alerts, Ticket Automation

parking_ticketFinding street parking in San Francisco and New York is a nightmare. Then when you find the holy grail of parking spaces you need to pay by the minute. But you never know how long you need so you either put in too much money or not enough. But what if there were a way to only pay what we needed without risking coming back to a ticket on our windshields?

At PagerDuty, we’ve learned that people use Pagerduty to get alerted from all sorts of things. Our friends at Sumo Logic have even used us to alert them when their office AC unit filled with water. So after a few team members were complaining about tickets outside our San Francisco office we proposed a mutually beneficial solution to the San Francisco Municipal Transportation Agency (SFMTA). Our program will reward proactive drivers while allowing the SFMTA to generate revenue from expired parking spaces.

Starting June 1, 2014, a small pilot program will launch in SOMA before expanding to the rest of the San Francisco and then New York in the fall. You will be able to sync your smartphone with your parking meter to receive alerts via voice, SMS, email or push notification, 10, 5 or 1 minutes before your time runs out.  Once you acknowledge, you will have the ability to add time to your meter directly from your phone. In addition, SFMTA agreed to provide 2 minutes of leniency for the first acknowledgement.

To generate revenue, SFTMA will be able to virtually ticket your vehicle by sending tickets directly to your smart phone once your time has expired without having to without having to visit your vehicle by integrating their ticketing system with our technology. Effectively automating their process.

We encourage everyone to download the new ParkingDuty app, which will be available for iOS and Android May 25, 2014 to prepare for the June 1, 2014 pilot launch. Anyone parking not using the app will not be able to add time via their mobile device. In these cases, SFTMA will receive alerts immediately when a parking meter’s time runs out using a small mounted camera that will snap a photo when an alert triggers to mail tickets via your vehicle’s registered location.

Next time you park on the street in San Francisco or New York, just look for the PagerDuty logo on your meter. By the way, Happy April Fools. Get pranking.


Posted in Announcements | Tagged , , , , , , | Leave a comment

Scaling & Eliminating Waste in Your DevOps Environment

This is the third post in a series to help your engineering team transition into a DevOps organizational model. Here we’ll discuss how to scale and eliminate waste from your engineering environment. Click here to  start at the beginnings of the series, Why You Need to Establish a DevOps Culture.

Eliminating waste from your processes highly relies on an understanding of what your operations people need and what your developers need. Having this granular understanding will help close the gaps between these two teams and make them one. As every team differs, we can’t tell you exactly what these needs are. But we can help you form the strategy to tackle these challenges within your culture.

Find Out What Everyone Really Wants

You will need to establish a shared vision between your developers and operations team that aligns with your company’s values and mission. To do this you will want to open up a dialogue between key stakeholders on each team. You will not want to make assumptions for what the other team needs or generalize their pain. Doing so, you’ll misinterpret each others’ organizational needs.

Once you get the conversation going you’ll be able to find some common ground. Instead of focusing on conflicting interest, identify your common goal and start to align teams for a singular purpose. Once you are setting up a plan to obtain a common goal you’ll most likely find that you don’t have many, if any, conflicting interests. After all, you all work for the same company with a shared mission and values.

Use the Scientific Method

We’ve all learned about the scientific method in grade school. But its core principles are easily forgotten when your deadlines are looming.

Plan. Transitioning to a DevOps model is a high level plan. In order to scale and eliminate waste during your transition you will need to focus your efforts on a few areas at a time. Common areas that may be causing problems in your organization are communication, tooling, expertise or team skepticism. You will want to identify which of these common problems are negatively impacting your team and start thinking about how to fix them. It would be beneficial to grab some initial measurements that you will be able to compare later. Even if you don’t have systems in place for measurements now, any ad hoc measurements you can take will be helpful to compare to later.

Do. Now it’s time to start doing stuff. Take what you have hypothesized during the planning phase and take a leap. For example, if you hypothesized that using continuous integrations in your development and product environments will decrease the time it takes to deploy new code, then it is time to implement your continuous integration system. Or if you learned that your team is lacking in specific skills in your initial assessment, you may need to hire (or contract) people with the expertise required.

Measure Results.  After you’ve implemented your new tool you will need to see if your changes made a difference. Is your development team still having issues while deploying new code? If so, has there been fewer issues, or is everything the same?

Respond. Using your metrics you can make a decision to act. If your plan didn’t work as you thought it work use the data you collected to adjust your hypothesis and try something else. Or amend what you have already implemented to address any new, unexpected challenges.

Continuously Improve. Finally, you will want to repeat this over and over for everything in your organization that needs to be addressed to create a DevOps culture. It won’t be quick or easy, you may not even see results right away, but it will be worth it.

Hire & Train with Your DevOps Culture in Mind

Finding the right people with the right mindset is crucial to scaling your DevOps culture. You will need to get your Human Resources team and your recruiters up to speed on the type of people you are looking for. You will want to hire people who thrive in or are excited to join a team with a DevOps philosophy. This means looking for people with both the specific skills you need for your team and the cultural values you are looking for.  Though this may be tough, they’re out there.

When your new hires join your team, it’s imperative that you train and mold them to be a valuable player on your team and to work towards the common goal of achieving a DevOps culture. At PagerDuty, we inject failure into our systems every Friday. Not only is this a great way to test the resilience of our systems, but Failure Friday gets our devs and ops team in a room together allowing them to cross train and have a complete understanding of our processes beyond their designated role.

And we’re not alone. We recently spoke with our friends at Sumo Logic about how they utilize training to ramp up their new developers and acclimate them with their culture as many of them have not come from a DevOps environment and not used to being on-call.

SumoLogic will pair their new developers with seasoned employees to share on-call responsibility to assist with blurring the lines between the traditional developer and operations engineering roles. They utilize an immersive shadowing training during business hours until the developer is up to speed on both their system and their way of life.

It’s imperative that you include your DevOps culture into your on-boarding process as you scale because people are resistant to change. Not everyone is cut out for this. It’s stressful and it takes people out of their comfort zones. But if you are proactive in your approach and mold people early on in your organize you will be able to evolve your engineering culture into something to brag about.

Assign a DevOps Champion

Making the decision to enter into a cultural transition is only the first step. Committing to it is an entirely different and aggressive beast. To keep everyone on track you may want to assign a DevOps Champion to answer any questions and help make the transition as easy as possible.

Your appointed champion should make themselves available so anyone can ask questions or address concerns. Your DevOps Champion needs to personify your company’s goals and be an aid to everyone involved. They need be approachable, a good listen and someone who wants to make everyone feel valued and included. Champions are good communicators that can explain the benefits for internal teams of all sizes and the company at large.

Your champion should encourage pride and sense of ownership for all employees, helping to empower them. They should also implement a “no asshole rule” to help alleviate the tension during a transition and reduce any negative that may be present from disgruntled employees. It also helps if your champion isn’t an asshole themselves.

You can’t half-ass a transition to a DevOps culture, you need to go all in order to make a difference. Once you get everyone on board, many of your challenges will just be getting started. Keep an open mind, breathe and remember to keep everyones needs at the forefront of your mind, while solving issues logically and for the future.

Lastly, keep in mind that even for highly performing teams this won’t happen overnight. For organizations with over a thousand employees it may take years, with one hundred or more it may take a year, and even for teams with as few as 10 employees it can take several months to implement. Just remember it will be worth it.

Update 4/10/14 – Continuing reading the transitioning to DevOps series:

Posted in Best Practices | Tagged , , , , , , | Leave a comment

Breaking Down Silos Doesn’t Happen Overnight

This is the second post in a series to help your engineering team transition into a DevOps organizational model. Here we’ll discuss how to start breaking down silos in your organization. Click here to start at the beginnings of the series, Why You Need to Establish a DevOps Culture.

Introducing a DevOps culture to your company and breaking down silos isn’t like how you may make other changes in your organization. You don’t submit a proposal and neatly weigh the pros and cons. If you do, you will probably meet a lot of resistance. Obtaining buy-in from business stakeholders can feel nearly impossible. But this doesn’t mean you shouldn’t advocate for change, you may just need to be stealthy with your approach.

You’ll need lots of patience if you commit yourself to changing your peers’ mindsets. To do so you will need to start with forming strong relationships and teaching concepts before introducing them to the tools needed for testing and configuration management.

Start with What You’ve Got

People are naturally resistant to change. To business leaders, change sounds costly. To others, change feels disruptive. Your best bet is to nurture your company’s pre-existing culture. You will need to feed your company’s own values back to them and determine what everyone really cares about.

Start by assessing your culture. Does your company focus on the customer or is it really all about the money? Are you laid back or do you have a regular ol’ corporate environment. Do your office parties have red solo cups or crystal champagne flutes? Are your offsites playing paintball or hitting a round at the country club? It’s important that you take a step back and figure out how to transition to a DevOps model with your current culture. Change doesn’t happen overnight and expecting your company to rapidly do a complete 180 will be futile.

After you have assessed your culture, start reinforcing the parts your teammates like and the parts that align with your ideal DevOps culture. Next, get everyone participating in your culture. Hold events for cross-functional teams to participate in. Go bowling together, grab a beer, mix up teams and head to a paintball course. Or simply invite them to a larger brainstorming session just to hear their ideas and build relationships. Working together is the first step to share accountability so you can both succeed, and sometimes fail, together.

“Beer is the most powerful tool in the DevOps environment. Take everyone down to the bar and just let them talk.” – Ben Rockwood, Joyent [Source: Keynote Address at the 25th Large Installation System Administration Conference (LISA ‘11)]

Inspire Others by Valuing Their Opinion and Including Them.

Speaking of getting cross-functional teams together to brainstorm, inspiring and including others will do wonders to break down silos in your organizations. In a silo’d culture, it’s easy to feel like you’re not included. This leads to thinking your ideas don’t matter; you can’t change anything and eventually not being included may lead to apathy.

Start by inviting both developers and operations engineers to conversations about what is required for a project. Don’t delegate tasks, but instead let people play an active role in figuring out the best way to accomplish a goal. Having input and playing a role in the decision making for a project will create a sense of ownership and pride in one’s work. This will give you the ability to recognize people for their contribution to further reinforce this sentiment.

Beginning with a single project will pave the way for your company’s entire engineering culture. This allows people to see projects from their initial conception all the way through to their completion. This will not only make people feel pride in their work, but start breaking down the “Us vs. Them” sentiment that often runs through development and operations teams. It’s about succeeding together, not pointing fingers when something unexpected happens.

It only takes asking a question or a meeting invite to spark a conversation to make people feel engaged and form a sense of responsibility for a project’s success. You will not only find solutions for your single project, but start to identity processes and tactics to work together in the future.

Show, Then Tell Concepts

You will want to pursue a show, then tell approach when teaching concepts. By showing you are leading by example and demonstrating proof of concept. It’s important to be genuinely interested in helping your colleagues improve their work by adding to their skill set.

For example, if you are on your company’s operations team and want to help a developer be more productive you can show, then tell him or her how to run a command. By learning from your example, not just explaining it to them, they will be more likely to replicate these actions in the future.

You will also want to avoid words that may have a negative, have a vague meaning, associated with a fad or may be considered an empty buzzword, such as “DevOps.” If you want to really have a DevOps culture you will need to live it to give it meaning, not just talk about it.

Focus on the benefits of collaboration and improving overall service delivery methods, such as shipping features to customers faster so they’ll get off your backs, when you need to use words to reinforce what you have shown them.

Find Your Advocates

By now you have gotten your team to work together, inspired them and even taught them to learn a few new concepts to add fluidity into their roles. Hopefully, you have learned a few new tricks along the way as well. This is a two-way street after all.

Unfortunately, your work is far from done. Chances are not everyone was receptive to your invitations or collaborating with you. Not everyone was open to learning. If it was that easy, wouldn’t everyone be using a DevOps organizational model?

To combat their apathy for the cause, you will need to find advocates on different teams to help teach and give your ideas more exposure. In order for any culture shift to take place more than one person will need to start advocating for it. Find those whom seemed most receptive to your invitations and teachings, then recruit them.

For the naysayers, help them understand (maybe over a few beers) how transitioning to a DevOps operational model will directly impact their lives. Discuss what each team wants and needs from each other and request that they advocate from their side bringing everyone together to start solving problems from the beginning together.

Advocates can get more of their teammates on board by reassuring them that the transition will be made with everyone’s input in mind. By having people engaged in honest, beneficial collaboration you will finally be on your way for creating a true DevOps culture.

Update 4/10/14 – Continuing reading the transitioning to DevOps series:

Posted in Best Practices | Tagged , , , , , , | Leave a comment