Breaking Down Silos Doesn’t Happen Overnight

This is the second post in a series to help your engineering team transition into a DevOps organizational model. Here we’ll discuss how to start breaking down silos in your organization. Click here to start at the beginnings of the series, Why You Need to Establish a DevOps Culture.

Introducing a DevOps culture to your company and breaking down silos isn’t like how you may make other changes in your organization. You don’t submit a proposal and neatly weigh the pros and cons. If you do, you will probably meet a lot of resistance. Obtaining buy-in from business stakeholders can feel nearly impossible. But this doesn’t mean you shouldn’t advocate for change, you may just need to be stealthy with your approach.

You’ll need lots of patience if you commit yourself to changing your peers’ mindsets. To do so you will need to start with forming strong relationships and teaching concepts before introducing them to the tools needed for testing and configuration management.

Start with What You’ve Got

People are naturally resistant to change. To business leaders, change sounds costly. To others, change feels disruptive. Your best bet is to nurture your company’s pre-existing culture. You will need to feed your company’s own values back to them and determine what everyone really cares about.

Start by assessing your culture. Does your company focus on the customer or is it really all about the money? Are you laid back or do you have a regular ol’ corporate environment. Do your office parties have red solo cups or crystal champagne flutes? Are your offsites playing paintball or hitting a round at the country club? It’s important that you take a step back and figure out how to transition to a DevOps model with your current culture. Change doesn’t happen overnight and expecting your company to rapidly do a complete 180 will be futile.

After you have assessed your culture, start reinforcing the parts your teammates like and the parts that align with your ideal DevOps culture. Next, get everyone participating in your culture. Hold events for cross-functional teams to participate in. Go bowling together, grab a beer, mix up teams and head to a paintball course. Or simply invite them to a larger brainstorming session just to hear their ideas and build relationships. Working together is the first step to share accountability so you can both succeed, and sometimes fail, together.

“Beer is the most powerful tool in the DevOps environment. Take everyone down to the bar and just let them talk.” – Ben Rockwood, Joyent [Source: Keynote Address at the 25th Large Installation System Administration Conference (LISA ‘11)]

Inspire Others by Valuing Their Opinion and Including Them.

Speaking of getting cross-functional teams together to brainstorm, inspiring and including others will do wonders to break down silos in your organizations. In a silo’d culture, it’s easy to feel like you’re not included. This leads to thinking your ideas don’t matter; you can’t change anything and eventually not being included may lead to apathy.

Start by inviting both developers and operations engineers to conversations about what is required for a project. Don’t delegate tasks, but instead let people play an active role in figuring out the best way to accomplish a goal. Having input and playing a role in the decision making for a project will create a sense of ownership and pride in one’s work. This will give you the ability to recognize people for their contribution to further reinforce this sentiment.

Beginning with a single project will pave the way for your company’s entire engineering culture. This allows people to see projects from their initial conception all the way through to their completion. This will not only make people feel pride in their work, but start breaking down the “Us vs. Them” sentiment that often runs through development and operations teams. It’s about succeeding together, not pointing fingers when something unexpected happens.

It only takes asking a question or a meeting invite to spark a conversation to make people feel engaged and form a sense of responsibility for a project’s success. You will not only find solutions for your single project, but start to identity processes and tactics to work together in the future.

Show, Then Tell Concepts

You will want to pursue a show, then tell approach when teaching concepts. By showing you are leading by example and demonstrating proof of concept. It’s important to be genuinely interested in helping your colleagues improve their work by adding to their skill set.

For example, if you are on your company’s operations team and want to help a developer be more productive you can show, then tell him or her how to run a command. By learning from your example, not just explaining it to them, they will be more likely to replicate these actions in the future.

You will also want to avoid words that may have a negative, have a vague meaning, associated with a fad or may be considered an empty buzzword, such as “DevOps.” If you want to really have a DevOps culture you will need to live it to give it meaning, not just talk about it.

Focus on the benefits of collaboration and improving overall service delivery methods, such as shipping features to customers faster so they’ll get off your backs, when you need to use words to reinforce what you have shown them.

Find Your Advocates

By now you have gotten your team to work together, inspired them and even taught them to learn a few new concepts to add fluidity into their roles. Hopefully, you have learned a few new tricks along the way as well. This is a two-way street after all.

Unfortunately, your work is far from done. Chances are not everyone was receptive to your invitations or collaborating with you. Not everyone was open to learning. If it was that easy, wouldn’t everyone be using a DevOps organizational model?

To combat their apathy for the cause, you will need to find advocates on different teams to help teach and give your ideas more exposure. In order for any culture shift to take place more than one person will need to start advocating for it. Find those whom seemed most receptive to your invitations and teachings, then recruit them.

For the naysayers, help them understand (maybe over a few beers) how transitioning to a DevOps operational model will directly impact their lives. Discuss what each team wants and needs from each other and request that they advocate from their side bringing everyone together to start solving problems from the beginning together.

Advocates can get more of their teammates on board by reassuring them that the transition will be made with everyone’s input in mind. By having people engaged in honest, beneficial collaboration you will finally be on your way for creating a true DevOps culture.

Update 4/10/14 – Continuing reading the transitioning to DevOps series:

FacebookTwitterGoogle+
Posted in Best Practices | Tagged , , , , , , | Leave a comment

Why You Need to Establish a DevOps Culture

This is the first post in a series to help your engineering team transition into a DevOps model. We’ll start with the whys and get to the hows in future posts. Stay tuned.

DevOps is a software development approach that focuses on the collaboration between developers and operations, where developers are empowered to own their code from cradle to grave and operations develops tools for automation to be used by their devs. Using a DevOps model you will share a common goal to quickly deliver quality products and services through more frequent deployment and collaboration.

Most of us have spent time working in an office with so much red tape it can take a month just to get a new stapler or to replace a broken keyboard. Nobody likes these unnecessary and seemingly countless levels of approval. Especially for small updates, such a homepage sign-up button or a quick bug fix to solve a customer’s frustration.

In response to these blockers, we go in to our offices, sit at our desks and only commit to a small piece of the puzzle. We just can’t get things done, so why bother? But we want to do more. We want to feel valuable and know that we can contribute more to make things better. For engineers, these roadblocks can impact workflows, stunt creativity and hurt their product.

Forming a DevOps culture in our companies will help us eliminate many of these roadblocks. The DevOps model is a software development ideology that encourages developers and operations engineers to work together in order to make products better and faster through more frequent deployment and automation.

Instead of drowning in your waterfall model, linearly moving from one step to the next, adopting a DevOps culture will help create a collaborative environment where work is distributed between ops and devs. This allows ops engineers to develop systems that will enable devs to deploy their own code and deploy their code frequently.

DevOps Teams are Holistic

In a properly constructed culture, both software developers and ops engineers have their roles intertwined. Instead of tossing problems over the wall with little regard for the people on the other side, they align their teams rather than work against each other.

While most of us have our core disciplines we specialize in, our professional environments should be conducive to learning and using other skills, even if they aren’t listed on our resume. With more fluidity in our roles we can get more done because we will feel more empowered to do so.

Eventually, this blending of roles will lead to organic collaboration and socializing among teams. This will keep your team flat and goals aligned, aiding in a culture that emphasizes line level communication between teams so code can be deployed more frequently and reliably.

Because everyone is taking ownership of their own work from cradle to grave they hold themselves accountable for any issues that may arise instead of pointing fingers. For example, a developer deploying a new feature will own the reliability of that feature and not simply toss responsibility to the operations team.

Did We Mention More Frequent Deploys?

Let’s drive down on this point a bit more. Faster, more frequent deploys are better for your business. In a DevOps organizational model you will be moving much smaller blocks with each deploy. This means there is less risk, the chances of something going wrong in these smaller blocks is exponentially less likely than moving multiple, larger blocks at once. Should something break, you are only rolling back a small piece instead of months worth of work to identify the problem.

Since you can deploy more frequently you can roll out code in smaller chunks so there is less risk with each deployment. If something does go awry you can roll back a small piece of code without having to go through and QA months worth of work.

Within a DevOps model each change to your environment is easier to monitor so you can measure key metrics then improve your system based on data. With proper automation, such as continuous integration, your development environments will finally keep up with your production environments allowing you to confidently test new code without your customers experiences hiccups in your service. This will make your code’s behavior significantly more predictable for your customers. As a result they will be able to seamlessly enjoy a new feature in your product.

Define Your Culture

In the end, your culture is unique to your team and business (and it should be). We’ll help you to identify the steps you need to take to start creating your own collaborative environment, but its outcome will be something none of us can predict.

Keep reading, over the next several weeks, we’ll dive deeper into how you can introduce a DevOps organizational model into your company’s culture.

Update 4/10/14 – Continuing reading the transitioning to DevOps series:

FacebookTwitterGoogle+
Posted in Best Practices, Culture | Tagged , , , , , , | Leave a comment

Avoid an Inbox Full of Stress, Get Everyone On-Call

Whenever we meet someone the first question we are asked is what we do for a living. We are always on the job, even though we try our hardest not to be.

While this can cause stress or worry, it also creates a sense of ownership over our responsibilities. None of us wants someone else to have to pick up the slack because we aren’t around.

In a sense, we’re all already on-call. From the support person receiving a customer email late into the evening, to the office manager who has to respond to an alarm going off in the office on a Saturday. So why not make it official?

It’s Crazy. Solve Problems, Only When There Are Problems

Explicitly assigning responsibility for incidents, you avoid everyone having to own every problem. These designated people will be able to respond to incidents faster, knowing they are accountable, instead of waiting for someone else to respond.

By resolving these incidents faster you are minimizing your customer impact. The office manager who is aware of a broken AC Unit can make arrangements so the team isn’t delayed come Monday morning. Support reps can make sure that customer situations are responded to alleviate chaos or avoid a public social media outcry for help.

Why Should I Care About Being On-Call?

Easy. Remember the transitive property from algebra class? If A = B and B = C, then A = C. In terms of your business, happy customers = happy business leaders and happier business leaders = happier you. Therefore (pretend we inserted that three dotted triangle) happy customers = happy you.

Instead of getting an inbox full of customer complaints in the morning, being on-call and addressing the issue the night before can make everyone on your team and your customers happy.

This accountability be may be a tad annoying when you are on-call. But when it’s not your shift, isn’t it great to know someone else is making sure you aren’t going to be overly stressed the next day?

Not only will you increase customer trust by being responsive and available, you can become a leader in your field. In the process you will be strengthening your team and putting in place predictable measures to to build relationships and cooperation among cross-functional teammates.

That’s Nice, But It Can’t Be Done.

Wrong! At PagerDuty we see it every day. Our customers are leaders in their fields and many have introduced an on-call mentality of accountability at several layers of their organization. They have accomplished this by creating cultures that support each other, instead of pushing responsibilities to each other.

At zeebox, both technical and non-technical teams are on-call. This helps their company strengthen their relationships with their partners, ensuring that their SpotSynch technology, which provides users with clickable TV commercials from their mobile device is working properly. For low-severity issues, it’s routine for a content producer or editor to receive a page.

Simple offers a modern banking solution and also a modern outlook to increase efficiency among their teammates. At Simple, everyone has the chance to be on-call from operations engineers, developers, risk team, support reps and even their PR team. With everyone on-call, their time in the office is focused on communicating with each other instead of putting out fires.

These are just two examples of amazing companies that are being proactive and taking accountability for their systems and business practices. Who goes on call in your organization?

FacebookTwitterGoogle+
Posted in Best Practices, On-Call | Tagged , , | Leave a comment

Hack Your On-Call Status with PagerDuty’s API

Knowing your on-call status is more important than knowing if it’s raining outside. Unlike dealing with the drizzle that’s passed over San Francisco recently, if I’m on-call you need leave the house with more than an umbrella.

Your on-call status is an integral part of PagerDuty’s iOS and Android apps. We’ve also created a few new endpoints on our REST API that make getting your current on-call status more straightforward. Full documentation of these new endpoints, as always, is on our developer documentation site.

To stayed informed of my current on-call status, I’ve hammered out a couple shell script one-liners and a JavaScript bookmarklet. You will need a PagerDuty API key to try the scripts out. These were written for a Mac OS X environment with node installed, but can be easily modified to run in other environments.

Send an Inspiration Email Reminder

Sometimes it’s useful to treat an escalation policy as a mailing list. For a particularly rough on-call rotation, this script will send an inspirational email to everyone on-call for a particular escalation policy. Animal videos are always appreciated.

A Pleasant Voice Reminder

amioncall_terminal

When you have over 50 tabs open in Chrome, more text on the screen is the last thing you need. I’ve created an alias of this script so when I type “amioncall” in the terminal Mac OS X will speak my on-call status in a pleasant voice.

Bookmarklet for your Toolbar

javascript_toolbar

I have on-call handoff notifications configured for all of my accounts, but sometimes it’s useful to know how long you have left until your shift ends. I wrote a tiny JavaScript bookmarklet that’s in my toolbar that pops up my on-call status on any page.

Hooking it All Up!

FacebookTwitterGoogle+
Posted in Code, On-Call | Tagged , , , | 1 Comment

I Married an On-Call Engineer

This is a guest blog post from Katie Newland. It’s a reaction to her spouse receiving PagerDuty notifications at inopportune times and how her spouse’s on-call responsibilities impact their relationship. Do you have a story about how your relationship has been impacted by your on-call responsibilities? Shoot us an email at support@pagerduty.com and we might feature you or your significant others perspective in the next edition of this series.

1098099969_241b2768d8_bI eye him across the room. Brow furrowed, face shrouded by the hood of his octocat hoodie, completely entranced by that big black screen with white text – what a stud. How did I get so lucky? He’s always been the smartest guy in the room. He’s logical, responsible and cool under pressure in a way that always makes me feel like I never have to worry about anything. One would think I hit the spousal jackpot. And yet, he’s not perfect – he has another lover. And I must constantly compete for his attention – making sure I wear the cutest outfit, have the wittiest comment, come up with the best ideas and still – this other lover – she gets more attention than me. He drops everything at her beck and call, even in the middle of the night and on holidays. She is the center of his whole world. The moment she calls, I cease to exist: we shift priorities, cancel our plans and all of her needs are tended to first. Who is this captivating consort? PagerDuty.

I was first introduced to this two-timing man in college. But in those days, I was his one and only true love. Our romance bloomed over competitive games of flip cup, late-night pizzas and PBRs – all without a second thought about uptime. After graduation, he got his first real job. That’s when things changed.

Upon first introduction, I viewed Pagerduty as my arch nemesis. At this job, Jesse was the only one on call, lugging his laptop everywhere he went: weddings, bars and even the grocery store.  Sleeping four consecutive hours became a luxury. After missing a page while scuba diving, he became even more reluctant to partake in activities that prevented him from being within arm’s reach of his computer.

But, over the years (and with a much improved on-call shift), I’ve accepted this other lover into our marriage and have even fostered their affair. It’s me who frantically shakes him awake during midnight pages. It’s me who presses 4 to acknowledge an alert while simultaneously tossing him a towel during mid-shower pages. And it’s me who apologizes profusely to our friends while he resolves outages during their party.

Though Pagerduty always seems to surface at the most inopportune times, I have to admit she’s not all bad. A loyal sidekick, PagerDuty allows him to be the first to know if problems arise and gives him a head start on getting things back up and running. PagerDuty is reliable. When an alert comes in, it’s legit. And other teammates can easily escalate to him in case everything is burning down.

Ops spouses have it rough. Over romantic candle-lit dinners, discussion turns to databases, dns servers, NTP reflection attacks, and SSDs. He can quickly sink into what I call the “code zone” in which whatever he’s hacking on is so intriguing he can’t see or hear me. I’ve even stooped so low as to text him “pagerduty alert” in hopes of snapping his attention. Sadly, my faux pages do not garner the same immediate response. Though I often play second fiddle to PagerDuty, I’m immensely proud of the work he does and what he’s accomplished. And you wouldn’t know it by his programmer-toned arms, but my Jesse is a badass – rescuing the world one DDoS at a time.

FacebookTwitterGoogle+
Posted in Community, Culture, Guest Blog | Tagged , , , | 1 Comment

Please Stop My Monitoring Alert Noise

We get it. You hate getting alerts. As Jason Floyd, Senior DevOps Manager at Real Networks put it, “I love you and I hate you. PagerDuty makes my job easier and wakes me up at 4 AM.” You are (kind of) ok with getting woken up at 2 AM if your server is really on fire. But if it’s for a minor issue that doesn’t affect your end user experience, you want your sleep. You want your alerts to be smarter and to only bother you when you really need it. Below are 4 monitoring alerting pains we commonly hear and how PagerDuty helps.

I am woken up for unimportant issues

Alert_noise_wolfYour monitoring tools’ alerts are the digital version of the boy who cried wolf. After getting bombarded with low severity alerts, when the big one hits, you treat it as another false notification. Noisy alerts make you feel sluggish and not your usual nimble self. With PagerDuty, you can filter alerts, similar to how your inbox filters, based upon subject, body and from address. Only get alerted when you need it.

I am getting multiple alerts for the same issue

Getting alerted at 2 AM is hard enough without getting alerted throughout the night for the same issue you already acknowledged. When multiple alerts for the same issue comes in around the same time, PagerDuty automatically de-duplicates the alert so only one incident is created. As long as the incident is open, any alerts for the same incident will be appended to the original incident and no new alerts will be fired off. You will have a log of all the alerts for reporting purposes, but you won’t have to suffer from alert fatigue.

I am getting alerted for issues I’m not responsible for

alert_routing_issuesGetting mail from tenants who used to live in your unit is similar to getting alerts for something you’re not responsible for. You just don’t want them. You feel guilty for ignoring them but you’re not really sure what to do with it, so you can either let it sit or try to send it back. If you’re the owner of the urgent mail, you want to get it immediately. To decrease confusion and delays, PagerDuty routes alerts directly to the person responsible for it and you can re-assign issues with a click. No more awkward exchanges.

I am getting many alerts at the same time and can’t keep track

During a large outage, many alerts are sent from multiple tools. Your phone is screaming at you and you’re ready throw it against the wall. Avoid breaking your phone with PagerDuty’s bundling of alerts. PagerDuty groups alerts occurring at the same time into one alert. But each unique alert will be logged individually so you can manage each incident at a time. Additionally, instead of logging into multiple tools to view incidents, get a complete view and track all your incidents in a single window with PagerDuty. Split screens causes split minds, and you need to be sharp when fighting your IT fires.

 

FacebookTwitterGoogle+
Posted in Best Practices | Tagged , , , , , | Leave a comment

Injecting Failure at Netflix, Staying Reliable for 40+ Million Customers

Corey Bertram, Site Reliability Engineer at Netflix recently spoke to a DevOps Meetup group at PagerDuty HQ about injecting failure at Netflix. For Corey, he wanted to show people what can go wrong, because anything can go wrong, will. Promoting chaos and injecting failure has been a great way to keep Netflix up and running for their 40+ million customers.

Tasked with Netflix’s uptime and reliability, Corey said,

“I spend a lot of time thinking about how to break Netflix”

According to Corey, by injecting failure into their production systems Netflix has been able to significantly improve how they reach to failure.

A rarity, especially for large companies, is Netflix’s culture of Freedom and Responsibility. Every developer at Netflix is free to do whatever they think is best for Netflix. Their estimated 1000 engineers are encouraged to be bold and solve problems, which allows everything at Netflix to happen organically. It’s for this reason that Netflix does not have an operations team, instead every engineer is responsible for their own services from conception through production.

Corey admits this creates a hostile environment for engineers where every incident is unique and no singular person knows how it all works. But, when their engineers are told to go wild, they do. They don’t shy away from challenges and find solutions to problems no other company has ever experienced.

Netflix has hundreds of databases and hundreds of services in the production path, which makes frequently injecting failure in their system necessary for their continued success and growth.

“No one knows how Netflix works. I say that in the most sincerest way possible. No one understands how end-to-end this thing works anymore. It’s massive.”

Taking a Different Approach to Failure and Reliability

Deploys are happening 24/7 at Netflix, so anything can happen across their tens of thousands of instances at any time. Because of this they have decided to focus on clusters rather than individual incidents. According to Corey, it’s easier to roll back a thousand services rather than just one so you can spot trends.

Corey admits that Netflix doesn’t test. It’s impossible to mimic what has been built in production in a testing environment. However, that doesn’t mean that no testing is done, but their testing environments are only a small fraction of what is occurring in production. When services are deployed they are tackling an entirely new environment in production caused by conditions the unique conditions that occur in their production environment.

“From a reliability standpoint, we are kind of just along for the ride.”

In the light of not having a test environment, Netflix has automated everything and created the Simian Army.

Inject Failure… But Don’t Break Netflix

Reliability is secured at Netflix by continuously automating the testing of production systems. By purposely poking at their systems they can see if it can really stand up in a fight. But instilling the need for reliability meant that the concepts needed to be sold internally. To do this, Netflix decided to brand, promote and incentivize the use of their process, the Simian Army.

Start Small.  Find your easy wins and keep it simple by going after your low hanging fruit. According to Corey, it’s these easy wins that will bite you if they are ignored.  Don’t get bogged down creating hundreds of test scenarios.

Log Everything. Some people say Netflix is a logging company that happens to stream videos because they take a log of every customer action to gain insight into what’s working and what’s not. You can’t be successful without insight, so log everything. Log all of your metrics, graphics, alerts, everything. You will want to invest heavily in your insight infrastructure in order to scale.

Scale to Zone Reliability Testing. A great way to see how you will handle a zone outage. Netflix builds everything in threes, so they should be able to withstand zone outages.  For Netflix, Chaos Gorilla automates the relocation of traffic, scales traffic, then nukes everything.

Tip: If you are on Amazon, Corey recommends using asymmetrical load balancing to avoid throwing a ton of traffic into one zone that is still standing after an outage.

Allow Opt-Outs, But Encourage Opt-ins. You may not want all of your services to experience failure as this may cause delays or loss of weeks worth of work. You want to build relationships with your developers not burn them by destroying their work.

Get a War Room (They’re Critical). When running failure automation its essential to have a rep from every team be present. You don’t know how your system may react to the failure. Having everyone together to monitor the service they are responsible for will make it easy to react and address what you have learned.

Repeat. Often. Currently, Netflix runs their failure automations quarterly. This is in the process of being adopted bi-weekly.  This isn’t simple or easy, but its necessary if you want to scale and stay reliable.

Corey sums up that if you are looking at increasing reliability, it is not a task you can take on yourself or you will fail. While you will always need to balance reliability with cost or innovation he reminds us that it’s even more essential is that you must keep it simple.

FacebookTwitterGoogle+
Posted in Announcements, Community, Events | Tagged , , , , , , , | Leave a comment

Build Out Your PagerDuty Reports with Zoho

Two of the most important metrics for any on-call team are Incident Volume and Mean Time to Repair (MTTR). Tracking how many incidents are coming into your system – and from which services – helps you identify both systemic infrastructure issues, and also misconfigured monitoring tools. Whether it’s a problem with the core system or just a monitoring threshold that needs adjusting, if you’re seeing dozens of incidents a day, there’s something there to fix!

Tracking MTTR helps show you how quickly your team is resolving issues. While incidents are always going to vary some in complexity, by looking at high-level trends in MTTR over time and across different escalation policies or services, you can start to identify opportunities to improve the way your team solves problems. Is one escalation policy consistently solving things faster than others? Maybe they’ve built more reliability into their system, or maybe they’re collaborating over HipChat or storing incident runbooks in the ‘notes’ field for each PagerDuty incident. Tracking MTTR outliers can help managers identify best practices as well as places to help struggling teams.

You can currently see some incident data in the PagerDuty “Reports” tab, but I wanted to build some additional reporting features to help us track down which services and escalation policies had unusually high (or low) incident volumes and MTTR.

I got a tip on a cool way to do this from our friends at Outbrain– they use Zoho Reports to query our API once an hour, then build dashboards from that data. After a few hours of wrangling with Zoho, here’s our report:

dash_report

 

 

You can filter by date, escalation policy, assignee and service. You can also click into any data point to see details about the related incidents!

To start building your own reports, we’ve put together a quick guide to help you through the process. First, set up an account in Zoho Reports. You can get basic reporting with a free account. Paid versions are available, which add additional data allocations as well as private, shareable dashboards.

Basic Setup

Create a new table, and choose “Import Excel, CSV, HTML, Google Drive….”

PagerDuty_reports_zoho_2

Choose “JSON” as the type, “Web” as the source, and enter the string “https://<your_subdomain>.pagerduty.com/api/v1/incidents” into the URL field. This calls the PagerDuty API for a list of incidents (more information).

PagerDuty_reports_zoho_3

Next, you’ll see a list of available columns. Feel free to remove some from the import, and if you’d like, double-click the headers to rename them into something more readable. Then click “Next.”

zoho_import

You should then see something like this– a table of all your incidents. If you’d like to clean it up a little (and didn’t pick and choose columns on import), click the button on the right to adjust which columns show, and in what order.

PagerDuty_reports_zoho_5

Populating the incidents table

Now let’s get this table populated. The PagerDuty API has a row limit of 100, which means we can only get 100 rows of data on each call.  However, we can add an “offset” parameter to our API call to control which row that 100 starts at. To start to fill up our table, first click “Import into this table.”

PagerDuty_reports_zoho_6

Select “incidents.id” as the column to match on, to make sure we don’t import duplicates. Then add “&offset=100” to the URL (you shouldn’t have to change anything else).

PagerDuty_reports_zoho_7

You’ll go through the same process of selecting columns to import, but shouldn’t have to change anything:

zoho_import2

Click “Create” and you should have 100 more rows in your table!

If you want to import a lot of historical data, you’ll need to go through this process once per hundred records, setting offset to 200 the next time, then 300, and so on. If you have a large amount of data to import, you may want to call the API directly from the command line and use a script to dump the info to CSV.

Finally, let’s set up Zoho to automatically grab new incidents as they come in. Click “Import,” then “Refetch/Schedule Import.”

PagerDuty_reports_zoho_9

You may see some information pre-filled here– if not, fill out the rest. For URL, make sure to use the URL without an offset parameter (ie “https://<subdomain>.pagerduty.com/api/v1/incidents”).

Set Zoho to pull data every hour:

PagerDuty_reports_zoho_10

Massaging our data

Now you have an incidents table in Zoho. If all you want to do is report on is number of incidents, you’re ready to go. However, if you’d also like to report on resolution time, we’ll need to do a little more work. Select “New,” then “New Query Table.”

 PagerDuty_reports_zoho_11

Query tables let you execute SQL commands against your base incidents table, which will help us get the data we want in a graphable way. If you know SQL, you’re probably good at this point. If not, here’s a sample query to get you started:

SELECT distinct “incidents.id” as incident_id, timestamp(“incidents.created_on”) as created_date, incidents.”incidents.html_url” as link, “incidents.escalation_policy.name” as escalation_policy, “incidents.service.name” as service, “incidents.resolved_by_user.name” as resolver, sec_to_time((unix_timestamp(“incidents.last_status_change_on”) – unix_timestamp(“incidents.created_on”))) as resolvetime, timestamp(“incidents.last_status_change_on”) as resolved_date, round((unix_timestamp(“incidents.last_status_change_on”) – unix_timestamp(“incidents.created_on”))/60) as resolvemins, concat_ws(‘:’,”incidents.trigger_summary_data.subject”,”incidents.trigger_summary_data.description”) as details

FROM incidents

WHERE “incidents.status” = ‘resolved’

Click “Execute” and you should see a new table appear below. This table is the one we’ll be doing our graphing from.

NOTE: You may be tempted to graph some metrics from your original table, and some from the query table. This will work, but doing everything from the query table enables some extra functionality I’ll explain later.

PagerDuty_reports_zoho_12

Graph Time

Now let’s make our first graph. From your query table, select “New,” then “New Chart View.”

PagerDuty_reports_zoho_13

Let’s make a graph that shows incident numbers by service over time. Whether it’s an infrastructure problem or just misconfigured monitoring thresholds, high incident volume is a sign that there’s something to go fix.

Drag ‘created_date’ into the x-axis, and set it to “Full Date.” Drag “incident_id” into the y-axis, and set it to ‘count.’ Then drag ‘service’ into the ‘color’ field – you can think of “color” as the variable that splits a column into multiple series. Click “Generate Graph,” and you’ll see something like this:

PagerDuty_reports_zoho_14

By clicking the different services on the right, you can control which services show. You can also use the “Filters” tab to include or exclude certain services (such as low-severity or test services).

Now let’s add an escalation policy filter, so that you can see services broken down by team. Click the “User Filters” tab, then drag in ‘escalation_policy.’ Then click ‘View Mode.’

PagerDuty_reports_zoho_15

Now you can filter the list of graphed services by escalation policy! This is a big help for team leads who only want to see the services they are responsible for.

PagerDuty_reports_zoho_16

Go ahead and save your report, and give it a descriptive name.

PagerDuty_reports_zoho_17

Now let’s graph time-to-resolve, to show us how quickly our team is fixing problems. Make a new chart and put created_date on the x-axis, resolve mins on the y-axis, and escalation policy as “color.” I also like to add incident_id(count) to the tooltip box, so that mousing over a certain day will show me how many incidents there were as well.

PagerDuty_reports_zoho_18

Over in “User Filters,” you can add whatever you’d like, but I suggest adding a filter for ‘resolvemins’.This will give us a ham-fisted way of excluding long-running incidents and getting a less noisy graph. Note to the math nerds out there: median or percentile would be a much better metric for resolution time. I’ll leave this exercise for you :-)

PagerDuty_reports_zoho_19

Click “generate graph,” play around with the various filters, and make sure you’re happy. Then save this report.

Putting it all together

Finally, let’s build a company dashboard. Select “New Dashboard”:

PagerDuty_reports_zoho_20

From the left sidebar, drag in your two reports. You’ll see that the user filters from both reports are automatically added to the top! You can also add any other user filters you would like, and they will affect both of the reports on the dashboard.

 PagerDuty_reports_zoho_21

When you’re satisfied with your dashboard, go ahead and save it. You can add additional graphical, tabular and summary views to your dashboard – my final one looks like this:

dash_report

Sharing your report

Finally, let’s show off our awesome reports to other people at the company. To share a report or dashboard, just click the “Publish” menu in the toolbar:

PagerDuty_reports_zoho_23

You’ll be given links or embed code that you can use to share your graph around your organization (we embed ours in our Confluence wiki). Note that on the free Zoho plan, you’ll need to be logged in to view the dashboard, but paid plans offer public embeds as well.

We hope you enjoy! Thanks again to Outbrain for the tip.

 

 

 

FacebookTwitterGoogle+
Posted in Hack Day, Integrations | Tagged , , , , | 2 Comments

10 Common Server Monitoring Mistakes from the Trenches

This is a guest blog post from Shawn Parrish of NodePing, one of our monitoring partners, about how to avoid some of the more common monitoring stumbling points. NodePing provides simple and affordable external server monitoring services. To learn more about NodePing visit their website (https://nodeping.com)

I have been responsible for servers and service monitoring for years and have probably made nearly all the mistakes. So listen to the war stories from a guy with scars and learn from my mistakes. Here’s 10 low bridges I’ve bumped my head on. Most of these are smack-your-forehead-duh common sense. Mind the gap.

Here are 10 common server monitoring mistakes I’ve made.

1. Not checking all my servers

Yeah it seems like a no-brainer, but when I have so many irons in the fire, it’s hard to remember to configure server monitoring for all of them. Some more commonly forgotten servers are:

  • Secondary DNS and MX servers.  This ‘B’ squad of servers usually gets in the game when the primary servers are offline for maintenance or have failed. If I don’t keep my eye on them too, they may not be working when I need them the most. Be sure to keep an eye on your failover boxes.

  • New servers.  Ah, the smell of fresh pizza boxes from Dell! After all the fun stuff (OS install, configuration, burn-in, hardening, testing, etc) the two most forgotten ‘must-haves’ on a new server are the corporate asset tag (anybody still use those?) and setting up server monitoring. Add it to your checklist.

  • Cloud servers. Those quick VPS and AWS instances are easy to set up, and easy to forget to monitor.

  • Temporary/Permanent servers.  You know the ones I’m talking about. The ‘proof of concept’ development box that was thrown together from retired hardware that has suddenly been dubbed as ‘production’. It needs monitoring too.

2. Not checking all services on a host

We know most failures take the whole box down, but if I don’t watch each service on a host, I could have a running website while FTP has flatlined. The most common one I forget is to check both HTTP and HTTPS. Sure, it’s the same ‘service’, but the apache configuration is separate, the firewall rules are likely separate. Also don’t forget the SSL checks, separate from the HTTPS checks, to ensure you have valid SSL certificates. I’ve gotten the embarrassing calls about the site being ‘down’ only to find out that the cert had expired. Oh, yeah… I was supposed to renew that, wasn’t I?

3. Not checking often enough

Users and bosses have very little tolerance for downtime. A lesson learned when trying to use a cheap monitoring service that only provided 10 minute check intervals. That’s up to 9.96 minutes of risk (pretty good math, huh?) that my server might be down before I’m alerted. Configure 1 minute check intervals on all services. Even if I don’t need to respond to them right away (a development box that goes down in the middle of the night), I’ll know ‘when’ it went down to within 60 seconds which could be helpful information when slogging through the logs for root cause analysis later.

4. Not checking HTTP content

Standard HTTP check is good, but the ‘default’, ‘under-construction’ Apache server page has given me that happy 200 response code and a green ‘PASS’ in my monitoring service just like my real site does. Choose something in the footer of the page that doesn’t change and do an HTTP content matching check on that. Don’t use the domain name though – that may show up in the ‘default’ page too and make that check less useful.

It’s also important to make sure certain content does NOT show up on a page. We’ve all visited a CMS site that displayed that nice ‘Unable to connect to database’ error. You want to know if that happens.

5. Not setting the correct timeout

Timeouts for a service are very subjective and should be configurable on your monitoring service. Web guys tell me our public website should load under 2 seconds or our visitors will go elsewhere. If my HTTP service check is taking 3.5 seconds, that should be considered a FAIL result and someone should be notified. Likewise, if I had a 4 second ‘helo’ delay configured in my sendmail, I’d want to move that timeout up over 5 seconds. Timeouts set to high let my performance issues go unnoticed; timeouts set too low just increase my notification noise. It takes time to tweak these on a per-service level.

6. Forgetting DNS goes both ways

Sure I’ve got a DNS checks to make sure my hostnames are resolving to my IPs but I all too often forget to check the reverse DNS (rDNS) entries as well. It’s especially important for SMTP services to have properly resolving PTR records or my email will be headed for the spam bucket. I always monitor SPF and DKIM records while I’m at it. Your monitoring service can do that, right?

Even when I’m using a reputable external DNS service I set up DNS checks to monitor each of the NS records on my domains. A misconfiguration on my part or theirs will cause all kinds of havoc.

7. Sensitivity too low/high

Some servers or services seem more prone to having little hiccups that don’t take the server down, but may intermittently cause checks to fail due to traffic or routing or maybe the phase of the moon. Nothing’s more annoying than a 3AM ‘down’ SMS for a host that really isn’t down. Some folks call this a false positive or flapping – I call it a nuisance. Of course I shouldn’t jump every time a single ping loses its way around the interwebs and every SMTP ‘helo’ goes unanswered then reality sets in and a more dangerous condition may occur. I may be tempted to start ignoring notifications because of all the noise of the alerts I really don’t care about.

A good monitoring service handles this nicely by allowing me to adjust the sensitivity of each check. Set this too low and my notifications for legitimate down events take too long to reach me, but set it too high and I’m swamped with useless false positive notifications. Again, this is something that should be configured per service and will take time to tweak.

8. Notifying the wrong person

Nothing ruins a vacation like a ‘host down’ notification. Sure, I’ve got backup sysadmins that should be covering for me, but I forget to change the PagerDuty schedules so notifications get delivered to them and not me.

9. Not choosing the correct notification type

Quick on the heels of #8 is knowing which type of notification to send. Yeah, I’ve made the mistake of configuring it to send email alerts when the email server is down. Critical server notifications should almost always send via SMS, voice, or persistent mobile push.

10. Not whitelisting the notification system’s email address

Quick on the heels of #9 (we’ve got lots of heels around here) is recognizing that if I don’t whitelist the monitoring service’s email address – it may end up in the spam bucket.

Bonus!

11. Paying too much

I’ve paid hundreds of dollars a month for a mediocre monitoring service for a couple dozen servers before. That’s just stupid. NodePing costs $15 a month for 200 servers/services at 1 minute intervals and it’s not the only cost effective monitoring service out there. Be sure to shop around to find one that fits your needs well. Pair it up with PagerDuty’s on-call/hand-off capabilities and you’re well on your way to avoiding the scars I’ve got without losing your shirt.

Nuff said, true believer.

 

FacebookTwitterGoogle+
Posted in Guest Blog, Server Monitoring | Tagged , , | 1 Comment

Tips for Tackling System Issues with PC Monitor and PagerDuty

This is a guest blog post from PC Monitor, one of our monitoring partners, about how to best use their system and PagerDuty together to tackle issues. You can learn more about PC monitor on their website (https://www.mobilepcmonitor.com/).

Has a server ever gone down while you’ve been away from a computer and you had to scramble to get online to resolve the issue? Unfortunately this happens way more than most people think and it can easily be avoided.

In todays day and age, we live a fully connected life through some pretty incredible technology. With startups like SmartThings making our homes fully connected allowing us to control aspects of our home, it’s only fitting the same happen in the office.

PC_Monitor_03_anywhere_anytime PC_Monitor_02_First_to_know PC_Monitor_01_extensible

According to eMarketer, smartphone users worldwide will total 1.75 billion in 2014, and more and more of those devices are not just for personal use anymore.

Busy IT system administrators are tasked with a very important job: making sure all systems are running smoothly. But when happens when something goes wrong?

Monitoring servers from your phone is just half the battle. The other half is taking action. And that’s exactly where Mobile PC Monitor comes in. Trusted by 250,000 consumers and professionals in over 100 countries around the world, PC Monitor is the easiest way to securely monitor and take action on any IT system remotely, from any smartphone or tablet.

PC_Monitor_Diagram

Here are three tips on what to do if a system has issues, and you have both PagerDuty and PC Monitor working together.

1. Have a plan of attack ready

Setup alerts using PagerDuty to notify the correct people in case of an issue. Those alerts can be delivered either via phone, SMS or email. We always recommend having more than one person receiving alerts, and with different amounts of retries.

2. Take action

Once you get that alert via PagerDuty and you open the PC Monitor app on iOS, Android or Windows Phone, you have lots of options on how to attack the problem. PC Monitor enables users to run Terminal commands directly from the app. You can restart or shutdown if necessary, too.

3. Always be prepared

Have a game plan ready in case this happens again. Know exactly with commands to run, or what sequence to do things in. And make sure others know what the game plan is … not just you.

Integrating PagerDuty with PC Monitor is easy and only requires a few steps. PagerDuty has put together a guide to help you get fully setup with PC Monitor.

FacebookTwitterGoogle+
Posted in Best Practices, Partnerships, Server Monitoring | Tagged , | 1 Comment