How Cascadeo Integrates PagerDuty Into Its NOC, Instant Messaging and Ops Support Platform

Over the past few years, PagerDuty has alerted thousands of users, letting them know when their systems are down. It’s what we do, and we’re proud to be seen as an integral part of their monitoring solution. Every once in a while we come across a customer that is using PagerDuty in such a way that it even makes us say COOL! One such customer is Cascadeo. Below is their story.

Cascadeo is an IT operations company that focuses on providing long term DevOps infrastructure and operations support for a wide variety of clients. With a staff of top talent systems/network engineers, project managers, and their own worldwide 24/7 NOC, Cascadeo customers benefit from a highly experienced DevOps team. Cascadeo customers can focus on the development of their applications, from development and through growth, while the team at Cascadeo supports the critical infrastructure on which the applications run.

At the core of Cascadeo’s product offering is a platform they developed called “The Cascadeo Operations Support System”, also known as COSS. COSS is a clustered application that runs across multiple regions on Amazon Web Services (AWS). COSS uses Amazon RDS for its database backend, and its function is to integrate all operations support systems into a cohesive ecosystem.  Cascadeo uses a wide variety of SAAS tools in their operations including: PagerDuty (escalations/guaranteed delivery messaging), Zendesk (workflow), Harvest (time tracking), and a number of other tools. A critical requirement for each tool chosen is that it have a rich set of REST APIs to be used in integration. COSS acts as the routing bus for all of these systems by either reaching out to their APIs, or by generating REST endpoints for various systems to use in accessing COSS (e.g. COSS Alerts API).

Cascadeo provides each of their customers with an instant messaging operations room. In that room resides the entire Cascadeo team dedicated to that customer  (NOC, PMs, Lead Engineers, Sys Admins, etc.) and everyone from the customer’s IT group (engineers,  managers, etc). All communications including maintenance windows, status updates, and requests for assistance happen in this room. Use of a virtual ops room allows for full transparency of all IT service requests and issue resolution as well as a log of all communications which is crucial for audit purposes.

Cascadeo uses PagerDuty in two ways: As an escalation platform to activate on-call teams and as a guaranteed-delivery messaging platform. Taking things to a new level, Cascadeo has integrated PagerDuty into their COSS platform through the use of PagerDuty’s API. Each Cascadeo team member has a service assigned to him or her within PagerDuty. Through a series of commands used within the COSS instant messaging room, any Cascadeo employee can issue notification commands at normal, urgent, and emergency priority. These requests are sent to PagerDuty and then proceed to alert the appropriate team member that can address a customer’s needs. Once notified of the service request, all responses (ACKs, resolutions or escalations) are then captured back within the instant messaging room.

Here is a sample transcript:

Cascadeo NOC D

Ludy :Device: x7.acme.com
IP Address: 10.25.0.13
Component: CheckDNSELB
Severity: Critical
Time: 2012/11/04 13:53:53.000
Message:
CRITICAL: 10.25.0.13 did return a CNAME record

== Escalation notes ==

=== Tier notes ===

  • This node is grouped under Infrastructure. It could be CRITICAL/HIGH/LOW Priority Infra. Read Device notes below.

=== Device notes ===

* Post in Cascadeo and Acme chatroom and indicate as an URGENT issue.

* If the event happens to be a non-urgent (as confirmed by the engineer), remind the engineer to put a transform for Event escalation notes.

* Create an URGENT ticket for Cascadeo SE and post ticket # in Acme chatroom. Note in the ticket that the Acme contact is: Fred Flintstone/Barney Rubble.

* How to access x7.acme.com — https://sites.google.com/a/acme/ops/runbook/incident-response-recipes#TOC-Responding-to-issue-with-x7.acme.com

* Post in Cascadeo and Acme chatroom and indicate as an URGENT issue.

* If the event happens to be a non-urgent (as confirmed by the engineer), remind the engineer to put a transform for Event escalation notes.

* Create an URGENT ticket for Cascadeo SE and post ticket # in Acme chatroom. Note in the ticket that the Acme contact is:Fred Flintstone / Barney Rubble.

* How to access x7.acme.com — https://sites.google.com/a/acme.com/ops/runbook/incident-response-recipes#TOC-Responding-to-issue-with-x7.acme.com

Cascadeo NOC D

.oncall [casc/int/systems, 32962] Pls check 32962 re Acme : Systems : fisheye.acme.com (172.30.0.14) CRITICAL: 10.25.0.13 did return a CNAME record/Time: 2012/11/04 13:53:53.000

6:00

COSS Bot

NOC, I’ve successfully sent event coss_102-32962 to COSS oncall macro – Systems via PagerDuty.

COSS Bot

NOC, PagerDuty is reporting that incident coss_102-32962 (for COSS oncall macro – Systems) was acknowledged by Romel Emperado on 2012-Nov-4 06:01AM PST.

 

In addition to uniquely creating services for their team members, Cascadeo creates on-call queues in PagerDuty that are associated with each customer. This allows for the NOC, the lead engineer, or the project manager associated with the client, to send alerts to the client’s escalation contacts. PagerDuty’s flexibility in defining escalation policies is very useful in complying with specific customer alerting requirements.

“We live and breathe 24×7 Operations, both in the cloud and in the data center,” says Ophir Ronen, a principal at Cascadeo. “For us, PagerDuty is a key tool in our handling of mission critical operations.”

In addition to leveraging PagerDuty, Cascadeo deploys Zenoss to all of their clients. As a part of the provisioning process, they spend a significant amount of time tuning Zenoss to increase the signal to noise ratio. When an NMS is first installed, it generates an enormous amount of noise (events which are not actionable).

Cascadeo conducts a series of triage sessions, typically twice per week, where they work with their clients to categorize the top 10 noisiest events as either actionable or noise. Actionable events require remediation and escalation data that is collated and embedded into the event itself. That way, when the issue recurs, Cascadeo’s NOC and on-call engineers will have the remediation/escalation information immediately at hand which dramatically reduces mean time to repair.

The COSS solution lives in the world of web services. Not only does it reach out to PagerDuty’s APIs, but it too has APIs. For example, by using the COSS Alerts API, Cascadeo can receive specific alerts from Zenoss, buffer and pass them through the COSS platform for tracking, and then pass on those alerts to PagerDuty to trigger an alert.

Cascadeo is a rapidly growing company of more than 80 people, distributed across 6 time zones. They are able to offer extremely high levels of service delivery thanks to their talented teams and the COSS platform. According to Ophir, “Our integration with PagerDuty via our COSS platform allows us to easily activate our resources, distributed around the world, to help our clients. Guaranteed delivery and multi-cloud redundancy is key for us which is why we selected PagerDuty as the tool to handle the critical alert and messaging functions of our OSS”.

Share on FacebookTweet about this on TwitterGoogle+
This entry was posted in Operations Performance and tagged , , , . Bookmark the permalink.
  • http://www.cloudstaff.com/monitoring Hoyt Velasquez

    Thanks for this helpful information. It is very useful. Cheers!