Chef at PagerDuty

ChefThis is the first post of a multi-part series on some of the operations challenges that the team at PagerDuty is solving.

At PagerDuty we strive for high availability at every layer of our stack. We attain this by writing resilient software that then runs on resilient infrastructure. We take this into account when we design our infrastructure automation. We assume that pieces will fail and that we need to either replace or rebuild pieces quickly.

For this first post about our Operations Engineering team, we will be covering how we automate our infrastructure using Chef, a highly extensible, ruby based, search driven configuration management tool, and what practices we have learned. We will cover what our typical workflow is and how we ensure that we can safely roll out new resilient and predictable infrastructure.

The Team

Before going diving into the technical details, first, some context about the team behind the magic. Our Operations Engineering team at PagerDuty is currently made up of 4 engineers. The team is responsible for a few areas: infrastructure automation, host-level security, persistence/data stores, and productivity tools. The team is made up of generalists with each team member having 1-2 areas of depth. While the Operations Engineering team has it’s own PagerDuty on-call rotation, each engineering team at PagerDuty also participates in on-call.

The Hardware

We currently own 150+ servers spanning multiple cloud providers. The servers are split into multiple environments (Staging, Load Test, and Production) and multiple services (app servers, persistence servers, load balancers, and mail servers). Each of our three environments have a dedicated chef server to prevent hosts from polluting other environments.

The Workflow

The chef code base is 3 year old and has around 3.5k commits.

Chef repository

Following is the skeleton of our chef repository:

  • git repo
    • cookbooks (stores community cookbooks that contain our customizations)
    • site-coobooks (stores our wrappers around community cookbooks, our custom cookbooks, lwrps etc)
    • data_bags (stores all data bags that are not encrypted)
    • lib (ruby libraries that are used across site-cookbook/* and knife plugins)
    • roles (stores all roles)

We use the standard feature branch workflow for our repo. A feature can be tactical work (spawning a new type of service), maintenance work (upgrading/patching), or strategic work (infrastructure improvements, large scale refactoring, etc). Feature branches are unit tested via Jenkins which is constantly watching Github for new changes. We then use the staging environment for integration testing. Feature branches that pass the tests are then deployed to the staging environment’s chef server. It depends on the feature, but most branches will go through a code review via a pull request. The code review is purposefully manual where we make sure that at least one other team member gives a +1 on the code. If there is a larger debate on the code, we block out time during our team meetings to discuss it. From there, the feature branch is merged and we invoke our restore script to delete all existing cookbooks from the chef server, upload all roles, environments, and cookbooks from master. Generally the restore process takes less than a minute. We do not follow any strict deployment schedules, we prefer to deploy whenever we can. Unless its a hot-fix, we prefer to do deployments during office hours when everyone is awake. We run chef-client throughout the week once a day via cron. If we need on demand chef execution, we use pssh or knife ssh with a controlled concurrency level.

Chef Testing

All PagerDuty custom cookbooks have a spec directory which contains ChefSpec based tests and we recently migrated to ChefSpec 3. We use Chefspec and Rspec stubbing capabilities extensively as the vast majority of our custom recipes uses search, encrypted data bags etc. Apart from cookbook specific unit tests that reside inside the spec sub directory of individual cookbooks, we have a top level spec directory, which has functional and unit tests. Unit tests are mostly ChefSpec-based role or environments assertions, while functional tests are all lxc and Rspec based assertions. The functional test suite uses chef zero to create an in-memory server, then uses restore script and chef restore knife plugin to emulate a staging or production server. Then we spawn individual lxc per role using the same bootstrap process as our production servers. Once we successfully converge a node, we assert based on the role. For example a zookeeper functional spec will telnet locally and run ‘stats’ to see if requests can be served. This covers most of our code base, except the integration with individual cloud providers.

Cookbook Management

We heavily use community cookbooks. We try not to create cookbooks if there is a well maintained open source alternative. We prefer to write wrapper cookbooks with a “pd” prefix which addresses our customization over the community cookbooks. An example would be pd-memcached cookbook which wraps the memcached community cookbooks, and provides iptables and other PagerDuty specific customization.

Both community cookbooks as well as our PagerDuty custom cookbooks are managed by Berkshelf. All custom cookbooks (pd-* ) stay inside the site-cookbooks directory in chef repo. We use use several custom knife plugins. Two of them, chef restore and chef backup, take care of fully backing up and restoring our chef server (nodes, clients, data bags). With this, we can easily move chef servers from host to host. Other knife plugins are used to spawn servers, perform tear downs and check status of third party services.

Gaining Confidence via Testing and Predictability

Currently, we are confident about our ability to spawn and safely teardown our infrastructure when we have the appropriate tests in place. When we initially took a TDD approach for our infrastructure, there was a steep learning curve for the team. We still run into issues when we are spinning nodes across multiple providers and network dependencies for external configurations (e.g. hosted monitoring services, log managements services), so we have introduced additional failure modes and security requirements. We have responded to these challenges by adopting aggressive memoization techniques, introducing security testing automation tools (e.g. gauntlt) in the operations toolkit (more on this in a later post).

A key challenge remains with cross component versioning issues, and upfront and proactive effort to update dependencies. Some code quality related issues from community cookbooks also hampered us. But we understand these are complex, time bound problems. We are part of the bigger community responsible for fixing them.

Share on FacebookTweet about this on TwitterGoogle+
This entry was posted in Reliability and tagged , , , , , . Bookmark the permalink.

19 Responses to Chef at PagerDuty

  1. Curtis says:

    Great article! Just curious, why was the decision made to tear down the chef repo, then restore with all changes?

    • Ranjib Dey says:

      we dont version freeze community cookbooks in the environments, but are locked via berkshelf. This frees us from version management of cookbooks on environment basis, but we take the risk of having multiple versions of cookbooks in chef server. The teardown approach takes care of that. Also this provides an uniform approach to deal with roles and other artifacts that are not version-ed in chef. Lastly, this ensures we can recreate our chef server only with our repo (and the node backups) , on a regular basis.,

  2. Ranjib Dey says:

    @disqus_VbjDxqUUQK:disqus chef repo is an git entity. while chef server is backed by postgres, bookshelf (kinda s3). Chef does not provide a way to sync them, i.e. delete stuff that dont exist in git repo. This is problematic, as you might delete a cookbook /role from git repo (aka chef repo) that you are not using any more, but it will still be present in the chef server. A node can still use it and thus you wont realize this until you migrate the chef server and restore it from git repo (the node will fail against the new chef server as it can not meet its cookbook dependency). The delete all approach helped us to get around this problem, if we have dangling references, it will be captured fast, right in the Test environment, where we do automated deployment against every merge. (the same logic applies to databags as well)

    Another reason for this is to get rid of multiple versions of same cookbook. We dont version freeze cookbooks per environment. Theres only one version of every cookbook (community cookbooks). Chef will always give the highest version of a cookbook unless a version is explicitly frozen in environments or runlist/roles. All the community cookbooks provide their own versions, while we use a single version across all our wrappers (the version comes from pd-base cookbook) , which follows the semver principles. So our entire chef repo has a single version and a berksfile together they capture the state of our chef-server entirely
    happy cooking
    ranjib

    • John says:

      Hi Ranjib,

      you said: “we use a single version across all our wrappers (the version comes from pd-base cookbook)”

      could you explain how you propagate pd-base version to all cookbooks?
      thanks

      • Ranjib Dey says:

        pd-base provides a few libraries, one of them contains a constant named PagerDuty::ChefRepository::VERSION. all our wrapper recipes require this in their metadata.rb and assigns `version PagerDuty::ChefRepository::VERSION`

        does that answer your question?

        • John says:

          I see.
          and it works because Berks or Knife actually evaluate the metadata.rb on upload. It makes sense.
          thanks a lot!

        • Curtis says:

          Are those pd-base libraries available just through a ‘depends’ statement?

          Or, do you pull in with Ruby ‘require’ statements at the top of your metadata.rb?

          I really like the concept, just curious how to make the base libraries available everywhere.

          • Ranjib Dey says:

            yes pd-base is a cookbook. you dont need the depends statement for using that library inside metadata, for the version bit we just use require_relative. metadat.rb is only evaluated during upload, afterwards chef uses json (which has the generated version). we declare pd-base as dependency only when we use the pd-base libraries/recipes/lwrps during chef run.

  3. serverascode says:

    What other security tools are you using? gauntit sounds really interesting, but it sounds like you are using others as well. Would love to hear about them. :)

    • Arup Chakrabarti says:

      @serverascode:disqus We are currently using gauntlt to continuously run port scan attacks against our infrastructure. We have also written some chef code that automatically updates our firewalls. Lastly, we are using tools like OSSEC to continuously scan the infrastructure for strange behavior. We have even more security testing automation work to do, and we will definitely be posting another blog post dedicated to it within the next few months.

      • Tehmasp Chaudhri says:

        Hey Arup,

        Have you done any integration w/ Chef and OSSEC in such a way that permitted changes to systems via Chef don’t necessarily trigger OSSEC alerts but changes done outside of a config mgmt system do? Thanks!

        • Arup Chakrabarti says:

          We are very heavy users of OSSEC and we have it integrated into Chef. The way that we cleared out all the noise was by setting it up in our environment, and then aggressively tuning the rules to filter out alerts from known changes by Chef. Our Chef runs are fairly predictable (when they run, how long they take, what they touch, etc) so this was feasible for us.

  4. alexism says:

    Hi Ranjib, thanks for sharing this information with us.

    I have a couple of questions around your site-cookbooks directory? I am currently using the same approach but I am considering create a repo for each cookbook.

    My motivations are mainly ease and scope of testing and cookbook version management.

    a. In your case, how do you manage cookbook versions?

    b. Do you do continuous-integration per cookbook or do your consider all your cookbooks as a single ‘deliverable’ and run all tests for all cookbooks on every commit?

    c. When do you merge your dev branch into the prod branch? when all the cookbooks are “green”? does such a gate impact the time it takes for a cookbook to be released in prod?

    d. On a different topic, how do you share libraries across cookbooks? I’m referring to this sentence:
    “lib (ruby libraries that are used across site-cookbook/* and knife plugins)”

    e. you don’t mention Chef environments in your post? don’t you use them? Do you consider the latest ‘stable’ commit in the chef repo as the one and only environment and all nodes must belong to that single env? If not how do partition nodes? with different Chef server or organization?

    thanks in advance for your answers,

    Alexis

    • Arup Chakrabarti says:

      Hi Alexis, answers to your questions are inline:

      a. In your case, how do you manage cookbook versions?

      We currently do not manage cookbook versions at the cookbook level. We use a git hash and timestamp to maintain a repo level version, but this has bit us at times when someone uploads a single incorrect cookbook. We are looking at ways to automatically modify the cookbook version number when we invoke our restore scripts to check which cookbooks have been modified and then update the version.

      b. Do you do continuous-integration per cookbook or do your consider all your cookbooks as a single ‘deliverable’ and run all tests for all cookbooks on every commit?

      We test each and every cookbook on every commit, even if a cookbook was not modified. We have caught problems where there is a change outside of our control (e.g. remote_url has changed for some package) with this model.

      c. When do you merge your dev branch into the prod branch? when all the cookbooks are “green”? does such a gate impact the time it takes for a cookbook to be released in prod?

      Yes, we only merge into our production (master) branch when all tests are passing. We find that for the initial cookbook creation, it will take a bit longer since you need to write up all the specs etc, but in the long run, we can make changes faster since we can validate those changes faster.

      d. On a different topic, how do you share libraries across cookbooks? I’m referring to this sentence:
      “lib (ruby libraries that are used across site-cookbook/* and knife plugins)”

      For our ruby libraries, we are using bundler. So for our knife commands, we would type in ‘bundle exec knife $PLUGIN $ACTION’

      e. you don’t mention Chef environments in your post? don’t you use them? Do you consider the latest ‘stable’ commit in the chef repo as the one and only environment and all nodes must belong to that single env? If not how do partition nodes? with different Chef server or organization?

      We use separate chef servers for our test/staging infrastructure and our production infrastructure. Within production, we try to only have one environment to avoid problems where you might get back unexpected results with chef search. Within our test/staging environments, we have a few chef environments to split up groups of servers. Environments are also part of the repo as we use them to control certain attributes.

      • Curtis says:

        We’re also testing every cookbook on every commit, but it takes upwards of 14-15 minutes to run all of our spec tests (for ~450 examples).

        Are you currently using anything to speed up your tests for quicker feedback?

  5. Seth Vargo says:

    Great article Ranjib!

  6. spustay says:

    Great article, thanks for sharing!

  7. Curtis says:

    Your article states that you don’t store encrypted data bags in your data_bags directory in the repo.

    What’s the current solution at PagerDuty for encrypted data bags?

    We’ve attempted to implement ChefVault, however, we’ve had issues getting around the chicken-and-egg problem that comes with it. It’s hard to automate a deployment when you’re required to update data bags every time new machines are created.