New! Automatically Resolve Email Incidents with Email Parsing

Our newest feature, Email Parsing, enables you to automatically resolve incidents from your email integrations. With this feature, you can now automatically resolve email incidents, and you can add custom text from the emails into the incident. This helps keep everybody in the loop when your systems recover, as well as improves the accuracy of resolution time reporting.

image01

How it Works

To automatically resolve email incidents, you’ll want to set up your monitoring service to send an email at the onset of an incident (Trigger Emails) as well as when the system returns to normal (Recovery Emails).  These two sets of emails are linked by an Incident Key, a unique identifier shared between these two emails– often the name of the host, server, or application experiencing the problem.

To parse emails, you need to add rules that tell PagerDuty whether an email is a trigger or a recovery email, as well as where to find the incident key to match them up. You can find an in-depth guide on doing this here.

image02

With email parsing, you can also add custom fields to an incident and update them throughout the lifecycle of the incident. These fields will appear in the incident details table, and are a great way to get useful information out of the email and displayed front-and-center for responders.

image00

You can find a guide on how to use these email parsers here, including specific guides for parsing emails from ZenDesk, Wormly, and JIRA. Have a guide you’d like us to write? Let us know!

 

Share on FacebookTweet about this on TwitterGoogle+
Posted in Announcements, Features | Tagged , , , , , | Leave a comment

So, What is ChatOps? And How do I Get Started?

Help your teams communicate and collaborate better

You’re probably hearing the word ChatOps more and more – at conferences, on Reddit and Hacker News, around the water cooler (or keg) – but what does it actually mean? And why and how would you implement it at your organization?

ChatOps, a term widely credited to GitHub, is all about conversation-driven development. By bringing your tools into your conversations and using a chat bot modified to work with key plugins and scripts, teams can automate tasks and collaborate, working better, cheaper and faster.

Here’s the 30,000-foot view: While in a chat room, team members type commands that the chat bot is configured to execute through custom scripts and plugins. These can range from code deployments to security event responses to team member notifications. The entire team collaborates in real-time as commands are executed.

Let’s break this down a little more. You probably use a chat client at work – HipChat, Slack, Flowdock, and Campfire are some commonly used tools, and if you already haveScreen Shot 2014-12-02 at 9.47.12 AM one in place, you’re on the right path. Then there are the chat bots – and they’re all open source.

  • Hubot: GitHub’s bot written in CoffeeScript and Node.js (docs)
  • Lita: Written in Ruby
  • Err: Written in Python

While chat bots do the commanding, sometimes you also have deployment servers listening for these commands, doing the heavy lifting of executing deployments tasks as background jobs. With Github’s Hubot, the deployment server is called Heaven. Check out how Flowdock recently implemented ChatOps with those tools in their workflow. Similar to how Hubot tells Heaven what to do, PagerDuty’s bot Officer URL tells Igor, our deployment server, what to do. I’ll share some more detailed information about ChatOps at PagerDuty in the next blog post.

The results?

Visibility Across The Board

Everyone has experienced the pain of figuring out if a particular command was run by a coworker. ChatOps helps bring that work into the foreground, by putting all of it in one place – everyone’s actions, notifications and diagnoses happen in full view. This encourages teams to be transparent.  Different plugins can help expose more information to everyone (replacing opaque IP addresses with DNS names and other metadata, for example).  Beyond operating more efficiently right from the get-go, it helps new hires jump right in and learn by doing. And it flattens the typical workflows teams use to deploy and diagnose. (Not to mention, it makes remote work a whole lot easier.)

It’s also how PagerDuty better onboards talent, improves our infrastructure through automation, and, as Jesse Newland at GitHub says, puts tools at the center of the conversation.

Screen Shot 2014-12-02 at 2.11.12 PM

Employing ChatOps even benefits non-technical teams. By having a central location for chat-based tools, teams like sales, marketing, and finance can understand what’s going on in your infrastructure – when you’re deploying code, who’s responsible for which systems and what they do – without having to walk over and interrupt. They can learn right from the bot itself.

Automate Manual Tasks

Tasks that used to be done manually, and often involved human error, are now automated through the chat bot. You can reduce tedious and error-prone hand-typed SQL statements, or put in place proper tests around often-repeated commands. Once a task is in chat, it’s a fast and easy way for other teams to make requests (no more ticket volleyball!) ChatOps can also improve your continuous delivery process. By easily understanding where a deployment started – and who started it – you’re able to cut out extra tasks and manual follow up, and deploy code continuously throughout the day.

Want to employ ChatOps at your company? Here are some tips on how to get started.

Pick Your Bot

The three chat bots we listed above – Lita, Hubot, and Err – provide teams with options to best suit their workflow. Different bots have different plugins and development languages ranging from Ruby to Node to Python, so pick the ecosystem that best fits your shop.

Plug It In

Hubot, Lita, and Err offer tons of scripts and plugins each – and your team can easily use any of them today. Check some examples below:

Start Small & Iterate

There are a lot of powerful ChatOps tools, plugins, and extras available, so it’s probably a good idea to start simple and get experience to find out what works best for your team. Try various bot integrations and scripts in your team chat room, and then stick with the ones you like best. There may be some trial and error – but that’s ok, it’s a part of the process.

The more you get used to coding, executing commands, etc. with your chat bot, the more efficient you’ll become. As your team reaps the benefits of employing ChatOps, other teams – like Front-end, Mobile, and more – will also catch on and implement it on their side. With the technical and non-technical folks participating, you’re developing not only efficient processes, but also a more development-focused culture in your company.

Additional Resources

Share on FacebookTweet about this on TwitterGoogle+
Posted in Operations Performance | 1 Comment

PagerDuty + Desk.com App Hub

Making it even easier to deliver great customer experiences

Customer communication is a key part of successful incident response. We’re excited to be a part of the new Desk.com App Hub to make it even easier for customer support teams to collaborate with IT and Engineering teams on critical incidents by having all their necessary tools in one place. With the Desk.com and PagerDuty integration, customer support teams can automatically be notified if critical, high-priority tickets are opened in their helpdesk.

Traditionally if a critical customer has an issue in the middle of the night, the ticket wouldn’t be answered until the next business day. When these incidents occur, fast response times are critical to maintain customer happiness in a world of immediate solutions. In order to achieve high customer service levels, key metrics your team can track are things like time to first response and resolution times. By including your support team in PagerDuty, you’re keeping the right people in the loop – allowing them to be notified, hop on conference calls, proactively manage customer communicators – and ultimately delivering better experiences when your business experiences those inevitable outages.

To get started with PagerDuty and Desk.com, check out our integration guide.

Become a PagerDuty Platform Partner

We’re continuing to build our ecosystem and now have more than 100 ready-to-use integrations available. Interested in becoming a partner or building an integration? Drop us a line!

Share on FacebookTweet about this on TwitterGoogle+
Posted in Announcements, Partnerships | 1 Comment

How to Ditch Scheduled Maintenance

You like sleep and weekends. Customers hate losing access to your system due to maintenance. PagerDuty operations engineer Doug Barth has the solution:

Ditch scheduled maintenance altogether.

That sounds like a bold proposition. But as Doug explained at DevOps Days Chicago, it actually makes a lot of sense.

Scheduled maintenance tends to take place late at night on weekends—a tough proposition for operations engineers and admins. Customers require access at all hours, not just daylight ones. And scheduled maintenance implies your system is less reliable than you think, because you’re afraid to change it during the workday.

The solution? Avoid it altogether, and replace it with fast, iterative maintenance strategies that don’t compromise your entire system.

That might sound a bit ‘out there.’ But shelving scheduled maintenance is easier than you think. In his talk, Doug offered four ways to do it.

Deploy in stages

First thing’s first: if you discard scheduled maintenance, your deployments need to be rock-solid. They should be scripted, fast and rolled back quickly, as well as tested periodically to ensure rollbacks don’t lag.

They also need to be forward and backward compatible by one version. It’s not an option to stop the presses when you push out a new version. Red-blue-green deployments are crucial here, as they ensure only a third of your infrastructure undergoes changes at any given time.

Lastly, stateless apps must be the norm. You should be able to reboot an app without any effect on the customer (like forced logouts or lost shopping carts).

Send canaries into the coal mine

Use canary deploys judiciously to test rollouts, judge their integrity and compare results. These test deployments affect only a small segment of your system, so bad code or an unexpected error doesn’t spell disaster for your entire service.

Doug suggested a few practical ways to accomplish this:

  • Gate features so you can put out code dark and slowly apply new features to a subset of customers.
  • Find ways to slowly bleed traffic over from one system to another, to reduce risk from misconfiguration or cold infrastructure.
  • Run critical path code on the side. Execute it and log errors, but don’t depend on it right away.

As Doug summed it up for the DevOps Days crowd: “Avoid knife-edge changes like the plague.”

Make retries your new best friend

Your system should be loaded with retries. Build them into all service layer hops, and use exponential backoffs to avoid overwhelming the downstream system. The requests between service layers must be idempotent, Doug emphasized. When they are, you’ll be able to reissue requests to new servers without double-applying changes.

Use queues where you don’t care about the response to decouple the client from the server. If you’re stuck with a request/response flow, use a circuit breaker approach, where your client library delivers back minimal results if a service is down—reducing front-end latency and errors.

Don’t put all of your eggs in one basket

Distribute your data to many servers, so that no one server is so important you can’t safely work on it.

At PagerDuty, the team uses multi-master clusters, which help with operations and vertical scaling. They also use multiple database servers like Cassandra: No one server is that special, which means operational work can happen during the day.

Put together, these strategies help admins and operational engineers sleep more, worry less and maintain better—all ahead of schedule.

 Questions? Share your thoughts in the comments below.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Operations Performance, Reliability | Leave a comment

October Hackday: iOS 8, Dev Docs ChatOps, and more

We love hackday here at PagerDuty – it’s a great opportunity for everyone at the company to work on projects they’re passionate about, get the creative juices flowing, and see how we can mix up the tools and technologies we know to help out our users and each other. Last month was one of PagerDuty’s most exciting hackdays, with a great mix of cool “What if?” projects, open source contributions, and little tools to help out around the office.

We’ve really been enjoying iOS 8 here, and love all the new APIs that came with it. Steve won our “Awesome” category with a proof of concept that uses the Yosemite/iOS8 “Handoff” feature to start incident response on an iPhone, then grab it where it left off on your Mac’s browser.

pagerduty_oct_hackday_web-2

Also within the iOS 8 theme, Alper and Clay showed off a “Today” widget that lets iPhone users know when they’re on-call next:

pagerduty_oct_hackday_web-4

We actually just got these two projects into our newest iOS build, so look out for an update on your phone and download the app if you don’t have it!

We also had some great dev-tool projects.

Grant, Amanda, and Greg have been giving our developer docs some much needed love. Greg gave them a spiffy new skin; Grant moved them behind Cloudflare (improving latency for our EU customers) and enabled HTTPS (security yay!); and Amanda started on moving them from a custom Rails app to a static Jekyll backend.

Screenshot 2014-11-17 10.55.59

Even the fancy scheduling power of Google Calendar can’t fight Parkinson’s Law, so David Y wrote a bot to help us figure out what meeting rooms are free via HipChat.

pagerduty_oct_hackday_web-1

Our new Sr. PM Chris, along with the UX team, wrote a flashcard deck for the spaced repetition system Anki to help new employees put faces to names.

pagerduty_oct_hackday_web-3

Tim wrote a plugin for Lita (our ChatOps bot) that looks up common security vulnerabilities from different databases, so that we can quickly access this info and show it in our team HipChat rooms.

Finally, Shack put together a guide on how to set up custom Chrome Omnibox searches for internal resources. We’re loading this into our default machine images to help new employees find things.

Screenshot 2014-11-17 11.09.59

 

And that’s the October 2014 Hackday! Want to participate in the next one? Come work with us :)

 

 

Share on FacebookTweet about this on TwitterGoogle+
Posted in Events, Features | Tagged , | Leave a comment

Fostering Diversity in Tech with an Online Community Policy

PagerDuty has a social media policy. Here’s how we developed it, and why.

Shortcut to our Community Policy: http://www.pagerduty.com/community-policy/

At PagerDuty, we’re committed to promoting diversity in technology and fostering innovation from all people, regardless of what might make them unique or different. Creating a culture that truly promotes diversity takes thoughtful work and requires that we reflect on how small things that we do or say can be perceived by others. Unconscious biases (video here) are always at play. Simple things like the snacks and beverages you offer at office events can have unintended perceptions and consequences.

There is a decent amount of awareness around how we can create the right culture in physical spaces like the office workplace and professional conferences. But what about social media? The amount of harassment on social media is surprising and discouraging. We view social media conversations the same way we view person-to-person conversations in the office: inappropriate or harassing comments have no place in our physical spaces, and we don’t want to see them in our online spaces either.

Unfortunately, we had a surprise recently when we posted Twitter ads with pictures showcasing PagerDuty t-shirts. We received a few comments that, we felt, didn’t support the kind of culture we’re trying to create.

When we noticed these comments – the reaction was universal – this was not OK, and we had to do something about it. Everyone here, from the leaders of the company down, wanted to take action. And so the idea of developing an explicit online community policy was born.

We researched what other companies are doing, and were surprised to find a lack of information about how others handle social media harassment. It’s common for organizers of tech conferences to lay out anti-harassment guidelines for attendees; and, of course, most workplaces have internal anti-harassment rules. But there weren’t many examples of companies doing the same for social media.

What we finally based ours on was a template provided by the Geek Feminism wiki. Our policy says that we don’t tolerate harassment, regardless of an individual’s identity or affiliations, and that if we notice such behavior or if a member of our community reports a problem, we can take any action to address it up to and including expulsion. The policy applies to any online space PagerDuty provides, from Twitter and Facebook to our Community Forums.

It may not seem like much. After all, it’s only 160 words on a webpage. But we believe that clearly stating our position on social media harassment and putting a process in place to respond to it will help us reduce and prevent it over time. For one, it’s a tangible way of promoting the culture we want to create. For another, we can now point to clear guidelines and a list of possible consequences if an incident should occur.

So far, we haven’t needed to do anything as drastic as block or expel someone. It only took a couple of days to develop the policy and post it. Once it was up, we approached the people who’d made the problematic Twitter comments — by email, so it didn’t become a public discussion — and asked them to take their comments down. Most of them were surprised, not realizing they’d said anything offensive (which is common — most instances of harassment aren’t intentional). A couple of them pushed back, but we simply thanked them for their feedback and reiterated the request to remove their comments. We’ve had to call on the policy a few times since then as well, but so far straightforward requests have done the trick.

Our hope is to find ways to make the policy more visible to our followers and community members. But in the meantime, it’s giving us the tool we needed to create the kind of community spaces we want.

Questions or comments? Contact us at communities@pagerduty.com.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Announcements | Leave a comment

Reducing your Incident Resolution Time

A little while back, we blogged on key performance metrics that top Operations teams track. Mean time to resolution (MTTR) was one of those metrics. It’s the time between failure & recovery from failure, and it’s directly linked to your uptime. MTTR is a great metric to track; however, it’s also important to avoid a myopic focus.

Putting MTTR into perspective

Your overall downtime is a function of the number of outages as well as the length of each. Dan Slimmon does a great job discussing these two factors and how you may want to think about prioritizing them. Depending on your situation, it may be more important to minimize noisy alerts that resolve quickly (meaning your MTTR may actually increase when you do this). But if you’ve identified MTTR as an area for improvement, here are some strategies that may help.

Working faster won’t solve the problem

It’d be nice if we could fix outages faster simply by working faster, but we all know this isn’t true. To make sustainable, measurable improvements to your MTTR, you need to do a deep investigation into what happens during an outage. True – there will always be variability in your resolution time due to the complexity of incidents. But taking a look at your processes is a good place to start – often the key to shaving minutes lies in how your people and systems work together.

Check out your RESPONSE time

The “MTTR” clock starts ticking as soon as an incident is triggered, and with adjustments to your notification processes, you may be able to achieve some quick wins.

Curious to know how your response time stacks up? We looked at a month of PagerDuty data to understand acknowledgement (response) and resolution times, and how they are are related. The median ack time was 2.82 minutes, and 56% of incidents were acknowledged within 4 minutes. The median resolution time was 28 minutes. For 40% of incidents, the acknowledgement time is between 0-20% of the resolution time.

Median Response Time: 2.82 minutes

Median Resolution Time: 28 minutes

Incident Response Time as % of Resolution Time

If your response time is on the longer side, you may want to look at how the team is getting alerted. Do alerts reliably reach the right person? If the first person notified does not respond, can the alerts automatically be escalated, and how much time do you really need to wait before moving on? Setting the right expectations and goals around response time can help ensure that all team members are responding to their alerts as quickly as possible. 

Establish a process for outages

An outage is a stressful time, and it’s not when you want to be figuring out how you respond to incidents. Establish a process (even if it’s not perfect at first) so everyone knows what to do. Make sure you have the following elements in place:

  1. Establish a communication protocol - If the incident is something more than one person needs to work on, make sure everyone understands where they need to be. Conference calls or Google Hangouts are a good idea, or a single room in Hipchat.
  2. Establish a leader - this is the person who’ll be directing the work of the team in resolving the outage. They’ll be taking notes and giving orders. If the rest of the team disagrees, the leader can be voted out, but another leader should be established immediately.
  3. Take great notes – about everything that’s happening during the outage. These notes will be a helpful reference when you look back during the post mortem. At PagerDuty, some of our call leaders like using a paper notebook beside their laptop as a visual reminder that they should be recording everything.
  4. Practice makes perfect - if you’re not having frequent outages practice your incident response plan monthly to make sure the team is well-versed. Also, don’t forget to train new-hires on the process.

To learn more, check out Blake Gentry’s talk about incident management at Heroku.

Find and fix the problem

Finding out what’s actually going wrong is often the lion’s share of your resolution time. It’s critical to have instrumentation and analytics for each of your services, and make sure that information helps you identify what’s going wrong. For problems that are somewhat common and well understood, you may be able to implement automated fixes. We’ll dive into each of these areas in later posts.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Operations Performance | Leave a comment

Super Charge Data Infrastructure Automation with SaltStack and PagerDuty

Welcome, SaltStack, to the PagerDuty platform! SaltStack is an open source configuration management and remote execution tool that allows you to manage tens of thousands of servers. With the latest PagerDuty integration, you can monitor failures, oversee changes to your infrastructure, keep tabs on system vitals, and manage code deployments. Already mighty on its own, SaltStack has super powers when integrated with PagerDuty.

Monitoring Failures

The most obvious use of the SaltStack and PagerDuty integration is triggering alerts when things break. If your deployment doesn’t go as planned, we’ll let the right person know.

Monitoring Changes

Salt states allow you to declare what state a server should be in, and if it’s not compliant, make the necessary changes to enforce the state. State changes will trigger incidents in PagerDuty which you can acknowledge and triage. With Salt’s “onchanges” requisite, you’re all set.

Monitoring System Vitals

Like Salt states, Salt monitoring states let you define what thresholds your systems should be running at. While monitoring states aren’t designed to make or enforce changes to your deployment, they’ll monitor your systems’ vitals and generate an alert when your system runs outside the bounds that have been configured.

Code Deployment

Salt can also help you manage code deployments across your infrastructure. For example, you may need to stop a web server, deploy your code, then restart the webserver. You’d then want to make sure the web application is still functional after the web server restarts and that the new deployment hasn’t placed stress on the web server. Using Salt and PagerDuty, you can automate this process and be alerted in real time if things don’t go as planned.

Learn More at AWS re:Invent

We’re rallying the troops and heading down to Vegas for AWS re:Invent this week. If you want more information on our newest integration – or for questions, swag or just a chat – visit booth 948. Our buds, SaltStack, will be at K19.

Ready to get started? Check out the guide for setting up SaltStack with PagerDuty.

 

Share on FacebookTweet about this on TwitterGoogle+
Posted in Announcements, Events, Partnerships | Leave a comment

Movember is Upon Us….

pagerduty_movemeber

That’s right…it’s one of our favorite times of the year here at PagerDuty: Movember. Not only do we get to grow awesome mustaches, but we are also supporting a great cause as a work community. There’s something special and exciting about watching an epic ‘stache evolve over the month while supporting and raising awareness for men’s health. For the third year in a row, we’ve joined forces with the Movember Foundation, the leading organization committed to – quite literally – changing the face of men’s health.

How does it work, you ask? PagerDuty “Mo Bros” shave their face on the first day of the month and after that, no mo’. Sure, you can trim and groom the ‘stache however you see fit – some opt for the handlebar, the villain, the sheriff – but the object is to grow it out as long as possible. Of course, we can’t forget about our “Mo Sistas” either! Ladies can show support for the cause by creating buzz around the office or donating to the Movember Foundation.

At the end of the month, we compare “results” and vote on the most epic mustache, and and winner is crowned Mr. Movember. The most enthusiastic gal throughout the month earns the title “Miss Movember.” We actually get quite competitive about it. It’s not uncommon to see a bidding war to shave mid-month between the most threatening, hefty Mo Bro ‘staches. We end the month with a Movember Party where the “Mo Bros” shave the mustaches and we snap after shots. Sounds like fun? ABSOLUTELY. Join our team and track our progress at our Mo Space page. Team name: PagerDuty.

Go Mo Bros, grow!

Share on FacebookTweet about this on TwitterGoogle+
Posted in Events | Tagged | Leave a comment

Who watches the watchmen?

How we drink our own champagne (and do monitoring at PagerDuty)

We deliver over 4 Million alerts each month, and companies count on us to let them know when they have outages. So, who watches the watchmen? Arup Chakrabarti, PagerDuty’s engineering manager, spoke about how we monitor our own systems at DevOps Days Chicago earlier this month. Here are some highlights from his talk about the monitoring tools and philosophies we use here at PagerDuty.

Use the right tool

When it comes to tools, New Relic is one of the tools we use, because it can provide lots of graphs and reports. Application performance management tools give you a lot of information, which is helpful when you don’t really know what your critical metrics are. But they can be hard to customize, and all that information can result in “analysis paralysis.”

PagerDuty also uses StatsD and DataDog monitor key metrics, because they’re easy to use and very customizable, though it can take a little time (we did a half-day session with our engineers) to get teams up to speed on the metrics. SumoLogic analyzes critical app logs, and PagerDuty engineers set up alerts on patterns in the logs. Wormly and Monitis provide external monitoring, though the team did have to build out a smarter health check page that alerts on unexpected values. And, finally, PagerDuty uses PagerDuty to consolidate alerts from all of these monitoring systems and notify us when things go wrong.

Avoid single-host monitoring for distributed systems

“Assume that anything that’s running on a single box is brittle and will break at the most inopportune time,” says Chakrabarti. Rather, PagerDuty sets up alerts on cluster-level metrics, such as the overall number of 500 errors, not the number in a single log file, and overall latency, not one box’s latency. For this reason, PagerDuty tries to funnel all of their systems through the monitoring system rather than feeding data directly from the servers or services into PagerDuty.

We funnel server and service alerts through a highly-available monitoring system so that we alert on the overall impact rather than individual box issues.

We funnel server and service alerts through a highly-available monitoring system so that we alert on the overall impact rather than individual box issues.

Chakrabarti also discusses dependency monitoring, or how to monitor the performance of SaaS systems that you don’t control. There’s no great answer for this problem yet. We do a combination of manual checks and automated pings. As an example, he tells the story of getting a call from a customer who wasn’t getting their SMSes. Upon investigation, it turned out that our SMS provider was sending the messages, but for some reason the wireless carrier was blocking them. As a result, we built out a testing framework, “a.k.a. how to abuse unlimited messaging plans.” Every minute, we send an SMS alert to every major carrier, and measure the response times.

We send SMS messages to the major mobile carriers every minute and measure the response times to make sure we know if the carriers are experiencing issues that may be affecting the deliverability of our SMS alerts

We send SMS messages to the major mobile carriers every minute and measure the response times to make sure we know if the carriers are experiencing issues that may be affecting the deliverability of our SMS alerts

Alert on what customers care about

A lot of people make the mistake of alerting on every single thing that’s wrong in the log, Chakrabarti says. “If the customer doesn’t notice it, maybe it doesn’t need to be alerted on.” But, he warns, the word “customer” can mean different things within the same organization. “If you’re working on end-user things, you’re going to want to monitor on latency. If you’re worried more about internal operations, you might care about the CPU in your Cassandra cluster because you know that’ll affect your other engineering teams.” We have a great blog post on what to alert on if you want to learn more.

Validate that the alerts work

Perhaps the best example of watching the watchers is the fact that “every now and then, you might have to go in manually and check that your alerts are still working,” says Chakrabarti. “We have something at PagerDuty we call Failure Friday, when basically we go in and attack our own services.” The team leaves all the alerts up and running, and proceeds to break processes, the network, and the data centers, with the intent of validating the alerts.

What has the team learned from Failure Friday? “Process monitoring was co-mingled with process running,” Chakrabarti explains. “If the service dies, the monitoring of the service also dies, and you never find out about it until it dies on every single box.” And that, in short, is the reason for external monitoring.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Events, Operations Performance, Reliability | 1 Comment