Behind the Scenes: PagerDuty’s April Hack Day

Everybody loves Fridays, but the second Friday of every month is special at PagerDuty—that’s Hack Day. On Hack Day, anyone at PagerDuty can work on any project they like for the entire day, even non-technical staff. Whether it’s a tech demo, some kind of code cleanup, or odds and ends for a public GitHub repo, everyone is encouraged to participate. The following week, the hackers present their projects to the company and trophies are awarded for Most Awesome, Most Useful, and “No Codez” (non-programming) projects.

trophies - soft glow - smallThere’s good-spirited competition for the Hack Day awards, and some of the hackers put a lot of effort into their presentations. Some are so inspired, they even spend evening or weekend time to make it better (which technically is cheating, but at least they’re passionate!).

Everyone at PagerDuty loves seeing the projects, and April’s Hack Day did not disappoint. Here’s a look at the winners in the technical categories, plus an honorable mention.

Most Useful: Doug’s Colorized Prompts

One constant danger of coding is accidentally running commands on a production box that were intended for a test box. Doug Barth tackled the issue and defined command prompts that are customized based on the environment a programmer is coding in. There are three styles: Development, Staging, and Production. The styles are color-coded for each environment (green = development, yellow = staging, red = production), and a unique prefix is applied for programmers who don’t have color support in their terminal. For the root user, the prompt is underlined as an extra warning since they have access to break even more stuff. As a bonus, the title bar also shows the prefix so the programmer can easily spot the production tab he or she left open.

production_style  titlebar

Having distinct prompts for each set of servers will help PagerDuty programmers keep their code straight and prevent a lot of headaches. Thanks, Doug, for a truly useful hack.

Most Awesome: Evan’s MiFi Battery Hack

At PagerDuty, we rely heavily on a fleet of MiFi portable Wifi hotspots to provide connectivity to our on-call engineers while they’re on the road. Unfortunately, the battery life on these devices is pretty terrible, which creates some rather untimely obstacles. Since most of the engineers sit right next to their MiFis while using them, Evan Gilman developed some code to reduce the signal length of the MiFi to increase the device’s battery life. The code modifies the transmit power of the wireless card inside the MiFi, reducing it by an arbitrary amount.

MiFiIn his testing, Evan was able to achieve nearly twice the battery life, ensuring that a fully-charged MiFi will live past the 8-hour mark under constant moderate use. With this, we have hope that the days of searching for an outlet in the midst of a crisis are behind us.

The code isn’t suitable for public release yet, but Evan plans to package it into a gem once it gets polished a bit.

Runner-Up for Most Awesome: Ian’s Voting App

Ian Enders developed a web app that lets everyone at PagerDuty submit their categories for Hack Day and then place their votes. We used to vote on a white board, but this didn’t work very well with remote teams (we have a new office in Toronto). Yeah, it’s pretty meta.

ian_voting_app_500

The code is public and you can check it out at https://github.com/ienders/leethaxors. Have ideas to improve it? Fork it and let us see what you can do.

And that’s a wrap. We hoped you enjoyed seeing these projects and learning what we’re working on when we take a break from our regular work days. Stay tuned for more updates from future hack days.

 

Posted in Community | Tagged , , , | Leave a comment

Thoughts from ChefConf 2013

#ChefConf is a three-day annual conference featuring demonstrations, workshops, and keynote presentations on the future of infrastructure automation. It’s designed for users of the Chef automation framework. 

PagerDutonians Ranjib Dey, Evan Gilman, and Doug Barth just got their brains filled to capacity at #ChefConf 2013 in San Francisco, where they learned about the latest chef trends, and played with ideas to make PagerDuty faster and more automated for our customers. This year’s speakers included a Who’s Who from Facebook, Disney, Message Bus, Adobe, Kickstarter, Riot Games, and more.

Workshops and Presentations

PagerDuty attended several outstanding workshops and presentations. A few notables:

Jamie Winsor of RiotGames spoke about the fabulous Berkshelf tool that not only eases cookbook reusability but also encourages clean and well-understood development practices. He also spoke about “Mother Brain,” a WIP project built using ZeroMQ that enables real time Chef orchestration on monitoring. Very cool stuff.

Miah Johnson from HotelTonight gave a thorough step-by-step presentation on how to refactor cookbooks. The quality of community cookbooks has been a major concern. With a growing volume of cookbooks, this talk would be very helpful for anyone who is involved with cookbook development.

Seth Chisamore at Opscode presented on omnibuses. Omnibuses are grand unified installers that bundle everything above glibc into a single monolithic platform-specific package. Chef, Vagrant, and Sensu are common examples. The omnibus project provides Ruby-based DSL and allows reusing existing Chef cookbooks. This provides some potential applications for PagerDuty that could drastically reduce the Chef run time.

Hackathon Happiness

As good as the keynotes were, some of the best takeaways came from the workshops and hackathons. During the pre-conference hackathon, we got to work on kitchen-LXC integration. LXCs are Linux containers (think of them as lightweight virtual machines). We got kitchen-LXC running with BTRFS support. This is significant for PagerDuty, because it can let us create the entire production-like environment (we spawned about 30 LXC containers on an MBP) and run integration testing locally, end to end.

During the post-conference hackathon we worked on Chef-Berkshelf integration. Currently Berkshelf is executed as a separate step to manage cookbooks (like uploading cookbooks on a Chef server)–you have to build a library that lets Chef use Berkshelf by itself. But with this integration, users won’t have to upload/update community cookbooks in their Chef server. Also, the cookbook storage can be distributed across multiple Chef servers, thus reducing the central Chef server load. At the end of the day we were able to get the library working and published a RubyGem.

Air Time

While we were there, Doug Barth, one of our operations engineers, got interviewed and talked about the ways PagerDuty uses Chef for automating our infrastructure. You can read the transcript here. One of our operations engineers, Ranjib Dey, got some air-time too by participating in a couple special #ChefConf podcasts by Food Fight: the Day 1 wrap-up and an episode on LWRPs. He also appeared in a Ship Show podcast.

IT Network Connections

ChefConf was an amazing experience, but one of the best highlights wasn’t programmed into the event–it was connecting with friends and partners in our network. We got to meet lot of folks with whom we’ve been interacting in IRC/GitHub for the past three years. We also spent some time with the Datadog folks and discussed our common pain points. Check out the photo of us and the Datadogs that they tweeted.

PagerDuty and Datadogs at #ChefConf

#ChefConf 2013 promised to be the premier event for IT infrastructure automation, and it didn’t disappoint. Our three ambassadors are armed with new ideas and possibilities, and we can’t wait to see how their ideas get implemented to make PagerDuty even better for our customers.

Posted in Community, Customer, Events | Tagged , , , , | Leave a comment

PagerDuty SMS Alerts Now Sent Via Short Code

PagerDuty’s number one priority is and will continue to be reliability. Unfortunately, reliability can be an issue with SMS messaging. However, our engineering teams have been working very hard to make sure that we provide the best possible deliverability on SMS messaging in order to continue to increase reliability.

photo

One way that we have improved the delivery rate of SMS PagerDuty alerts is that PagerDuty alerts are now being sent via short code. US cell carriers pre-approve SMS traffic from our short code so SMS alerts should now be more reliable in terms of deliverability across all carriers.

Any PagerDuty SMS alerts sent to a US phone number will be sent from PDUTY (738-89) and any failover will be sent via any of these long codes. This change will vastly minimize any issues our customers have experienced with receiving SMS alerts in the US.

For PagerDuty short code responses, standard text messaging rates should apply.

We are extremely excited about short codes and, as always, would love to hear your feedback! Please reach out to us at support at pagerduty.com.

Posted in Announcements, Features | Tagged , , , , | 7 Comments

Outage Post-Mortem – April 13, 2013

We spend enormous amount of our time on the reliability of PagerDuty and the infrastructure that hosts it.  Most of this work is invisible, hidden behind the API and the user interface our customers interact with. However, when they fail, they become very noticeable as delays in notifications and 500s on our API endpoints.  That’s what happened on Saturday, April 13, at around 8:00am Pacific Time. PagerDuty suffered an outage triggered by degradation in a peering point used by two AWS regions.

We are writing this post to let our customers know what had happened, what we have learned and what we’ll do to fix all the issues uncovered by this outage.

Background

PagerDuty’s infrastructure is hosted in three different datacenters (two in AWS and another in Linode).  For the past year, we’ve been rearchitecting our software with the goal of it being able to survive the outage of an entire datacenter (including it being partitioned from the network), but something not specifically built into our design was the ability to survive the failure of two datacenters at once. However unlikely, that is what happened on Saturday morning. Since we consider an AWS region as a datacenter, and having both of them fail at the same time, we weren’t able to remain available with only our last remaining datacenter.

We picked our three datacenters to have no dependency amongst them, and made sure that they are physically separated. However, we have since learned that two of the datacenters shared a common peering point. This peering point experienced an outage that resulted in both of our datacenters going offline.

The Outage

Note: All times referenced below are in Pacific Time.

  • At 7:57am, according to AWS, connectivity issue begins due to a peering point degrading in Northern California
  • At 8:11am, PagerDuty on-call engineer is paged about an issue with the one of the nodes in our notification dispatch system
  • At 8:13am, an attempt is made to bring back the failed node but with no success
  • At 8:18am, our monitoring system detects multiple-provider failure for notifications (caused by connectivity issue). At this time, most of the notifications are still going through, but with increased latencies and error rates
  • At 8:31, a Sev-2 was declared and more engineers were paged to help out
  • At 8:35am, PagerDuty completely loses its ability to dispatch notification, as it couldn’t establish quorum due to high network latency. Sev-1 is declared
  • At 8:53am, PagerDuty notification dispatch system was able to reach quorum and started to process all queued notifications
  • At 9:23am, according to AWS, connectivity issue at the Northern California peering point ends

During the post-mortem analysis, our engineers also determined that a misconfiguration on our coordinator service prevented us from recovering quickly.  In all, PagerDuty wasn’t able to dispatch notifications for 18 minutes between 8:35am and 8:53am; however, during this time, our events API was still able to accept events.

What we’re going to do

As always with major outages, we learn something new about deficiencies in our software.  These are some of our plans to rectify the discovered issues.

Short term

  1. During our analysis, we found that we didn’t have adequate logging to debug issues within some of our systems.  We have now added more logging and started to aggregate them into a single source for better searchability.
  2. During the outage, most of the failed coordinator processes were restarted manually.  We are going to add a process watcher to restart such processes automatically.
  3. We also found that we didn’t have good visibility into the inter-host connectivity. We’ll be building a dashboard that shows this.

Long term

  1. We also found that not all of our engineering staff are up to date with Cassandra and ZooKeeper.  We’ll be investing time to train our staff on both of these technologies.
  2. Investigate moving off one of the AWS regions.  We’ll need to do our homework when picking a new hosting provider and the datacenter to avoid single point of failure.

Posted in Availability | 7 Comments

Rackspace Cloud Monitoring Now Integrates with PagerDuty

Rackspace Cloud Monitoring Now Integrates with PagerDuty

Rackspace cloud monitoringPagerDuty is proud to announce an integration with Rackspace Cloud Monitoring, a system that allows you to monitor your critical systems even if they are hosted on a distributed environment.  Now that the powers of Rackspace Cloud Monitoring and PagerDuty have been combined, you now have the best of monitoring and alerting systems available, ensuring that your team is notified of critical events with your systems.

Getting Started

In order to utilize this new integration, you will need a PagerDuty and Rackspace Cloud Monitoring account.  Follow these links to sign up for accounts at PagerDuty and Rackspace Cloud Monitoring.  To configure your integration you can follow our easy Rackspace Cloud Monitoring Integration Guide.

Integration in Action

First you must configure a Monitoring Check for your system within Rackspace Cloud Monitoring. After an alert has been triggered, you’ll see the incident within PagerDuty.

Rackspace Cloud Monitoring Incident

To get additional details as to the contents of the alert, simply click on ‘View Message’ and all of the information from Rackspace Cloud Monitoring will be displayed for you.

Rackspace Cloud Monitoring Details

If you have any questions or issues, don’t hesitate to contact us at support [at] pagerduty.com.

Posted in Announcements, Integrations, Partnerships | Tagged , , | Leave a comment

AppDynamics Integration for PagerDuty

AppDynamics and PagerDuty logoIt’s a busy integration-filled day at PagerDuty. We’re excited to announce an integration with AppDynamics, a popular enterprise application performance monitoring and management solution. By combining AppDynamics’ granular visibility of applications with PagerDuty’s reliable alerting capabilities, customers can make sure the right people are proactively notified when business impact occurs, so IT teams can get their apps back up and running as quickly as possible.

AppDynamics and PagerDuty in action

You’ll need PagerDuty and AppDynamics accounts to get started – if you don’t already have both, you can sign up for free trials of PagerDuty and AppDynamics.  Once you complete this simple integration, you’ll start receiving incidents in PagerDuty created by AppDynamics out-of-the-box policies.

When the ‘Details’ link is clicked for a particular incident, you’ll see the details for this particular incident including the Incident Log:

PagerDuty and AppDynamics incident details

If you are interested in learning more about the event itself, simply click ‘View message’ and all of the AppDynamics event details are displayed showing which policy was breached, violation value, severity, etc. :

AppDynamics incident message

Give the integration a whirl

To get started, we have written an easy to follow AppDynamics integration guide for you. If you have any questions or issues, don’t hesitate to contact us at support [at] pagerduty.com.

Posted in Announcements, Integrations, Partnerships | Tagged , , , , | Leave a comment

PagerDuty integrates with Salesforce Desk.com

Salesforce DeskWe all know the great importance of customer service in this modern Internet era.  One of the things that contribute to providing great customer service is the ability for an organization to quickly respond to customer issues.  PagerDuty is all about providing tools to quickly resolve critical issues in an organization, including customer issues.  With this in mind, we are very excited to announce a new kind of integration, a first of many – integration with a leading customer support solution provider, Salesforce Desk.com.

IT alerting, coming to a help desk near you

Customer with a 3:00 am issueImagine a scenario where one of your high-value customer files an urgent ticket at 3:00 a.m., hoping to get to a quick resolution.  With a traditional setup, that ticket won’t be answered until the next business day.  PagerDuty integration with Desk.com changes all that – an action setup in Desk.com would trigger a PagerDuty incident that would wake up the appropriate support on-call to handle the situation.  You also get the other benefits that come with PagerDuty, including alerts via SMS, phone and push notification and the ability to escalate unhandled incidents.  If you’d like to get started, we have written an easy to follow integration guide for you.

PagerDuty Desk.com incident

We at PagerDuty, along with Salesforce are very excited to provide this integration so that you can wow your customers.  Please feel free to test-drive the integration by creating a trial PagerDuty account.

Photo credit: Bram Cymet / Creative Commons Flickr

Posted in Announcements, Integrations, Partnerships | Tagged , , , | Leave a comment

Getting to da Choppa! (aka: Customizing Your Android Notifications)

arnold-blog-hero

Being a fan of Ahhhnold’s meme-worthy performance in Predator, I acquired the Get to da Choppa soudbite as a ringtone on my Android. Then I thought, “would it not be cool if pagerduty notifed me with that?”

With the release of the new Android app, you can set a special ringtone for your PagerDuty push notification alerts. There will never be any confusion; you will know right away when Arnold is telling you to “Get to Da Choppa” that it is a PagerDuty alert!

Enabling Custom Notifications

Android Custom Notifications for PagerDuty
Here’s how you set up custom ringtone for PagerDuty Push Notifications from your Android app:

  1. Open the browser on your mobile device.
  2. Tap the Settings icon () on the bottom right.
  3. Tap Push Notification Sounds.
  4. And select a ringtone from your built-in ringtones or media ringtones by downloading the Ring Extended application on your phone.
  5. Tap OK and your custom ringtone is configured. Voila!

Hasta la vista, stock sounds

We are very excited about the addition of customized ringtones and hope that this will make it even easier for you to respond quickly to critical alerts. I mean, c’mon, no one would dare defy Arnold!

If you need assistance setting up custom ringtones on your phone or have any feedback, we would love to hear from you.

iOS user?

As you might imagine, adding custom sounds is a little different on the iOS app. We’re working on it and would love to hear what sounds you would like to have bundled with the PagerDuty iPhone app. Contact us on twitter with your suggestions.

Posted in Features | Tagged , , , , , | 2 Comments

New Pagerdutonians

Since January, we’ve been very busy expanding the PagerDuty team. We’re happy to announce the addition of Joe Lambe, Edgar Salazar, Arup Chakrabarti, Evan Gilman, Ryan Duffield, Ranjib Dey, and Clay Smith.

Joe Lambe headshot Joe, our new Director of Demand Generation, joins us from Atlassian, where he was responsible for the dark arts of demand generation. Joe is an expert in online advertising & SEO and has a PR background. Besides from being a marketing guru, Joe is an avid cyclist, enjoys yoga, and loves an amazing Italian meal.

Edgar has quickly established himself as a coding machine with rapid and well documented pull requests. Within a weeks time, Edgar won the nickname Machine Gun, though our office manager would like to call him the Cereal Bandit. No matter what time of day it is, you will find Edgar enjoying a bowl of cereal.

Arup claims that he was hired for his charming wit and dashing good looks, which isn’t that far from the truth. Arup brings years of experience growing teams and team members from Amazon and Netflix to expand and improve our Operations Team. Don’t let his title fool you, Arup is a well rounded software engineer that studied Biomedical Engineering at Boston University.

All the way from Florida, Evan joins the Operations Team as a Senior Engineer. Evan impressed us during his onsite interview when he passed all of his interviews in 30 minutes or less! When Evan isn’t in the SF office, you can find him with a camera in an exotic part of the world.

121d289If he’s not at a Toronto Blue Jay’s game or jamming away on the drums, you can find Mr. Duffield at the Toronto PagerDuty office. Ryan recently joined the Realtime team after working at his own startup for 2+ years. Though he works in the Tech industry, Ryan is an amateur astronomer.

useravatarThe title for Happiest Dutonian in the SF office goes to Ranjib. Regardless of the situation, Ranjib always has a smile on his face. He joins the Operations Team after working at Google and ThoughtWorks.

IMG_1545Clay is the newsiest dutonian to join the Product team, after working at Thomson Reuters Business. Clay brings a passion for building tools, automating, and generally making front-end Javascript code work better. We say we hired Clay for his coding skills, but we really hired him for his award winning Texas chili recipe that has 6 different types of meat!!

Posted in Announcements | Tagged , , , | Leave a comment

Announcing Google Reader Alerts

We understand how painful it can be when monitoring tools don’t work. Without reliable alerting and on-call scheduling, you’re bound to lose sleep, dollars or customers. Since launching PagerDuty, we’ve focused on building a product that provides a single dashboard for the tools you use to monitor your infrastructure. Said another way, when s*IT breaks, we wake you up.

Our vision to be the central command center for IT teams means we’re always on the hunt for new opportunities to increase the tools we integrate with and the ways customers can receive alerts.

Today, we alert you with emails, SMS and phone alerts to over 170 countries around the world, and recently iOS and Android push notifications. While these were a good start, our work wasn’t done yet.

 

Our best alert yet: Google Reader

google reader logo
Based on overwhelming customer interest, we’re happy to announce our latest notification type: Google Reader alerts.

We’re big fans of RSS readers as a way to make sense of the sprawling World Wide Web. Outside of our inboxes and GitHub accounts, Google Reader is our watering hole for the latest Hacker News posts, Android vs. Apple flamewar and xkcd comic. We like Twitter hashtags and Facebook likes as much as the next social media expert, but these services haven’t replaced the need for a simple, open solution for content digestion.

With Google Reader, we leverage Google’s track record of ensuring that the latest RSS post arrives safely, every time. Google Reader is the de facto RSS reader standard and we are confident we picked the right partner for the future.

 

Google Reader and PagerDuty in action

The PagerDuty Google Reader alert type can be enabled in a few easy steps. In minutes, you’ll be in RSS bliss.

Step 1: Create a PagerDuty free trial if you don’t have an existing account.

Step 2: Login to your PagerDuty account.

Step 3: Once logged in, click on My Profile settings under your email address.

Step 4: Under Contact Methods, click the new menu option of RSS, “Create Google Reader RSS Feed”.

Google Reader RSS feed for PagerDuty

Step 5: Copy your new RSS feed.

Step 6: Head over to Google Reader and add a new rss subscription, pasting your feed url like any other RSS-enabled site.

Voila. Now you can receive unlimited Google Reader alerts for PagerDuty, with the same customized notification rules you’ve come to expect.

Respond, acknowledge or resolve alerts without leaving Google Reader. Just email your your account address, for example tobiasfunkejorts@pagerduty.com, with ‘acknowledge’ or ‘resolve’ and we’ll update the alert status.

Google Reader Alerts for PagerDuty

Early Google Reader love

We’ve been dogfooding the new features at the PagerDuty offices and it’s given us richer social integration with our services and escalation policies. It’s hard to imagine life without Google Reader now.

But don’t just take our word for it. Feedback from our Google Reader private beta has been incredible.

“Our best partnership yet.” – Alex Solomon, PagerDuty Co-founder and CEO

“Reeder usage has jumped 10x since the beta launched. That’s a lot of alerts!” – Reeder, popular RSS reader app

“Thanks for making my midnight alerts even harder to sleep through, I reaaally appreciate it.” – Anonymous customer

“Cease and desist.” – Google

 

One step forward, two steps back

This is just the beginning. The following features are planned for later this year:

  • Alerts go social. Share your favorite PagerDuty alerts on Google Buzz or Wave.
  • Get extra weekend alerts. Subscribe to a curated list of PagerDuty alerts hand-picked by our staff to make you feel better about your next outage.
  • More notification services. Our team is testing alerts sent via custom GIFs, USPS mail and barbershop quartets.

We hope you like Google Reader alerts as much as we do. If you’ve got another notification type you’re dying to have added to PagerDuty, tweet us your ideas @pagerduty.
 

Posted in Features, Integrations, Partnerships | Tagged , , | 3 Comments