Meet the PagerDuty Android app

Android Logo

We’ve had mobile on our minds lately.

The recently revamped PagerDuty’s mobile site has allowed for less desktop fuss and more sleepless nights for our customers (maybe that’s not a good thing). Looking at all mobile devices, 97% of all mobile users were using either iOS or Android, so native apps were the next step.

A month ago, we launched the PagerDuty iPhone app. And today, after a few weeks of private beta testing, we’re launching the PagerDuty Android app, available for download in the Android Market.

The Droid you’ve been looking for

The Android app makes the mobile site functionality native, with a few extras.

PagerDuty Android App Incident Log Screenshot

The full feature list:

  • Receive unlimited push notification alerts
  • New custom sounds for push notifications (the most requested beta feature)
  • Easily access and respond to open incidents (acknowledge, resolve or reassign)
  • Quickly see when you’re on-call
  • Access a contact list of all users in your account (includes phone & SMS numbers and email addresses for each user)

A big kudos to everyone who expressed interest in the beta, logged bugs, and provided feedback on how to make the app better.

Ready for it? Download PagerDuty for Android.

Still waiting on your native app?

If you’re not an Android or iOS user, we’d love to hear what platform you’re using. Email us at support at pagerduty dot com with your feedback. In the meanwhile, don’t forget about the PagerDuty mobile optimized site that works on all platforms.

 

Posted in Announcements | Tagged , , | 3 Comments

PagerDuty attending the first Monitorama conference

Monitorama 2013 Conference

The first ever Monitorama conference is being held this weekend in Boston, MA. We’re excited to be sponsoring this new conference, full of great sessions and panels with some of the leading monitoring and operations experts on monitoring technologies.

Meet some PagerDutians IRL

James Litton, PagerDuty

James Litton

Alex, our CEO, and James, one of our Operations Engineers, will be in attendance to absorb the conference and maybe even chat with you. We are always interested in hearing how people are using PagerDuty in their companies so feel free to tell us your favorite and least favorite things.

Alex Solomon, PagerDuty

Alex Solomon

You can look for the bright green PagerDuty logo that James and Alex will be sporting and feel free to ask for your own t-shirt too! We might even have a few laptop stickers if you ask nicely.

 

 

Posted in Events | Tagged , , | Leave a comment

Improved Scout Integration with PagerDuty

ScoutAs PagerDuty fans know well, we rely on robust integrations with a host of other monitoring tools and tracking systems to allow PagerDuty to be the central command center for IT. Building on the integration we launched in June 2012 with server monitoring tool Scout, we have great news to report!

The Scout team has added support that allows you to integrate PagerDuty through multiple Services and Escalation Policies. This new feature will help if you need to route your Scout alerts to different teams based on thresholds, servers and/or applications.

PagerDuty Scout integration screenshot

Scout’s new Notification Groups feature allow you to define different notification groups and let you add multiple PagerDuty services, assign them to different notification groups, and specify which triggers go to which notification groups!

In addition, you can now trigger specific PagerDuty services at defined thresholds. For example, you could create threshold A that activates PagerDuty integration #1, and threshold B that activates PagerDuty Integration #2. These integrations can be associated to different escalation policies so that you would be notified differently, according to your configuration, based on type of alert.

Get your hands on it

We at PagerDuty are very excited about these improvements in Scout Integration so please feel free to test drive it by creating trial accounts in PagerDuty and Scout!

Posted in Announcements, Integrations, Partnerships | Tagged , , , | Leave a comment

Introducing the PagerDuty iPhone app

PagerDuty iOS App Push Notification

We’re excited to announce the first official PagerDuty app for iOS, optimized for iPhone and iPod Touch devices. It’s available today in the App Store.

Download PagerDuty for iPhone and let us know what you think.

The journey to iOS

Back in January, we made several improvements to the mobile version of PagerDuty.  We focused on improving accessibility to the features you need most on the go: we added a mobile “who’s on-call” display; improved the ability to respond to and triage incidents; and added a mobile user directory to quickly find teammates (and call them if necessary).

We’re proud of our mobile site, but the goal has always been a native app.  To that end, we’re excited to release the first version of our native iOS app.  Best of all, this means we’re bringing a fourth kind of alert to you — push notifications!

The nitty-gritty on features

PagerDuty iPhone App On-Call Screenshot

The new iOS app allows you to do the following:

  • Receive unlimited push notification alerts
  • Easily access and respond to open incidents (acknowledge, resolve or reassign)
  • Quickly see when you’re on-call
  • Access a contact list of all users in your account (includes phone & SMS numbers and email addresses for each user)

What are you waiting for? Download the PagerDuty iPhone app. Tweet us @pagerduty or email us at support@pagerduty.com with feedback requests or bugs.

Android users, you’re next

We haven’t forgotten about our Android users. We’re working out the final kinks of the Android app, and plan to launch in the next few weeks. If you want to be the first to know, follow us on Twitter @pagerduty or subscribe to the PagerDuty blog for a future post.

 

Posted in Announcements | Tagged , , , | 11 Comments

Expanding PagerDuty with $10.7M new funding

I’m very happy to announce we’ve just received $10.7M in funding, led by Andreessen Horowitz.  Also participating in the round were Jesse Robbins, founder of Opscode; WIN Funding; and existing investors Baseline, Harrison Metal and Ignition.

We will be using the funding to accelerate product and market development of our IT incident tracking and on-call management platform.  In other words, more money means we’ll hire more crazy-smart engineers to write more features, further raise the bar on our system’s reliability, and ultimately make even more customers happy.

I’ll be perfectly honest: Up to this point, we’ve been flying a bit under the radar.  We haven’t talked very much about our traction and success thus far.  We also haven’t done much in the way of publicity and self-promotion.  Instead, we’ve relied on our customers to spread PagerDuty via word of mouth.  This strategy has actually worked out quite well (as it turns out, when you build a good product that solves a real, hair-on-fire problem, people will pay for it).  Today, we have thousands of customers ranging from large enterprise companies (Microsoft, Adobe, Intuit, EA) to startups (Square, Github, Pinterest, Etsy) and everything in-between.  We’ve come a long way, but we still have a long way to go.

The main reason we’ve raised this new funding round is to accomplish our vision for the product much faster.  At this point, I’m sure you’re wondering “What is the PagerDuty vision?”.  Today, we are the “9-1-1 dispatch” system for IT.  It’s a bit like normal “9-1-1″, which is used to dispatch emergency services — police and ambulance.  Our system dispatches engineers to fix critical issues in your IT infrastructure.

The next major step in the vision is to expand beyond just the critical incidents.  We ultimately will become the central nervous system of IT: we’ll provide the interconnecting fabric between your systems and the people responsible for managing them.  We will continue to focus on solving the people part of IT incident management and leave monitoring for the monitoring guys.  Our big audacious goal is to reduce the noise.  If you think about it, devops teams use multiple monitoring tools, each of which produces a lot of alerts.  Most of these alerts are not critical or high priority.  We want to develop PagerDuty to slurp in all of these monitoring events and increase the signal-to-noise ratio for our users.  In other words, only the critical issues should wake you up at 4am, false alerts should be automatically filtered out, and low priority incidents should be surfaced in aggregate in summary reports.  Of course, there’s a lot more to this, but we can’t give everything away just yet :) .

What makes the vision really exciting is that we’re solving an important problem — incident response – a problem that’s never really been solved very well before.  We’re replacing cobbled-together solutions and manual processes, and ultimately helping DevOps engineers resolve issues faster and reduce downtime.  We want to help our users become heroes at their company (and in front of their boss).

From a product perspective, we are really focused on building a system that’s intuitive to use, that doesn’t have a steep learning curve (like many other enterprise software systems and IT tools), and that really meshes with the mantra of “make easy things easy and hard things possible”.

Finally, what really excites us from an engineering perspective is building an extremely reliable, fault-tolerant, distributed system at scale.  Indeed, we have an absolute obsession with reliability.  We believe that even two minutes of downtime is completely unacceptable, and that planned downtime and maintenance windows are no longer acceptable in today’s 24×7 world.  We’ve found that failover architectures aren’t suitable for extreme uptime applications like PagerDuty, and have therefore started converting our message dispatch pipe to use fully distributed data stores like Cassandra and Zookeeper.  Our ultimate goal is to be able to survive the total loss of a data center without any interruption or delay whatsoever to alert deliveries.

If this sounds exciting to you and you want to join us on this path, we are looking for smart reliability engineers, DevOps engineers, and front-end JavaScript experts to help us re-invent the devops tools space.

Posted in Announcements | 2 Comments

More Pager Dutonians

We’re happy to announce the addition of 4 new Pager dudes and dudettes; David Lanstein, Doug Barth, Ryan Hoskin, and Sam Noland.

Lanstien A few years back you would have found David coding away at Splunk, but not today! the Director of Major Accounts, David will be heading our Sales Department. David brings 2 years of technical sales experience and the remarkable skill of opening champagne.

 

dougDoug is an all around smart guy and a great addition to our ops team. He has the rare skill of honing in on small details while still keeping the big picture in mind. Along with being straight up brilliant, Doug brings in the best homemade baked goods. Hackday project: Parallelizing our test suite to run faster on multi-core machines.

 

photo (1)Ryan AKA the “Microsoft Support Guru” brings 8+ years of technical customer support experience to the PD support team. Since the majority of the Pager dudes and dudettes work on Mac’s, we’re ecstatic to finally have a Microsoft expert on board to help support our Window’s users. Though he won’t admit it, Ryan is quite the dancer. Hackday project: PowerShell script to import Windows Active Directory users into PagerDuty.

 

0e31e96 And then we have Sam, our newest addition to the Happiness Team. With her previous Vibe Managing experience, Sam brings a tool belt full of tricks, ideas, and skills that will help maintain our company culture in both our San Francisco and Toronto offices’  [Editor's note: Sam wrote most of this blog post but was entirely too modest about herself] Hackday project: Ducksboard status board in our lobby.

 

We’re always growing, and if you have an urge to make the world more reliable, give us a shout.

 

Posted in Announcements | Tagged , | Leave a comment

Outage Post-Mortem – Jan 24, 2013

On January 24, 25 and 26, 2013, PagerDuty suffered several outages.  The events API, used by our customers to submit monitoring events into PagerDuty from monitoring tools, was down during the outages.  Our web application, used to access and configure customer accounts, was also affected and may have been unavailable during the outages.

We’ve written this post-mortem to let you know what happened and to also let you know what we’re doing to ensure this never happens again.  Last but not least, we would like to apologize for this outage.  While we didn’t have any single prolonged outage during this period, we strongly believe in the mantra that even 2 minutes of downtime is unacceptable and we’d like to let you know we’re working hard on improving our availability, both in the short term and the long term.

Background

The PagerDuty infrastructure is hosted in multiple data centers (DCs).  The notification dispatch component of PagerDuty is fully redundant across 3 DCs and can survive a DC outage without any downtime.  We’ve designed the system to use a distributed data store which doesn’t require any sort of failover or flip when an entire DC goes offline.

However, the events API, which is backed by a queuing system, still relies on our old legacy database system, based on a traditional RDBMS.  This system has a primary database which is synchronously replicated to a secondary host.  The system also has a tertiary database which is asynchronously replicated (just in case both the primary and secondary have problems).  If the primary host goes down, our standard operating procedure is to do a flip to the secondary host.  The downside is that the flip process requires a few minutes of downtime.

Outage Details

Note: All times referenced below are in Pacific time.

On 1/24

  • At 8:25am, the events API and website both went down.
  • At 8:25am, the PagerDuty on-call engineers were alerted.
  • At 8:32am, we started a Severity-1 conference call.
  • At 8:36am, we started the flip process from the primary db to the secondary.
  • At 8:41am, the flip process was completed.
  • At 8:42am, the events API and website was brought back online.

Later on that day, we had several blips:

  • At 4:16pm: small blip – 1min outage
  • At 10:37pm: small blip – 1min outage
  • At 10:51pm: small blip – 1min outage

Throughout the day, we worked on investigating the issue and worked on the post-mortem.  As part of the investigation, we noticed a large number of invocations of a particular slow query on the database.  We modified the code to turn off the invocation of the offending query.  At this point, we thought the outages were caused by a single slow query, which we had fixed, so we thought the underlying problem was also fixed.

On 1/25

  • At 7:05am: small blip – 3min outage

We investigated the new outage and found another problematic slow query, which we fixed immediately.

On 1/26

  • At 2:28am, the events API and website went down.
  • At 2:28am, the on-call engineers were paged.
  • At 2:38am, both the events API and website had recovered.

At this point, we came to the conclusion that the best thing to do is upgrade the db machine to a larger host.  Engineers worked through the night to build all new db machines (primary, secondary and tertiary) on better hardware.

Around 6am, we believed the building of the new machines was complete.  From 6:15am to 914am, we attempted to flip the database to a new primary machine a couple of times, each time unsuccessful.  Each of these attempts caused a few minutes of downtime.

At this point, we gave up on flipping to the new machine.  The reason the flip did not work was because the data snapshot on the new machine was not uploaded correctly, due to the engineers being extremely tired and burned out after working through the night on the upgrades.

After getting rest for about 12 hours, the engineers started from scratch building new db machines.  The freshly rested engineers put a new primary database in place.  A few hours afterwards, they also put in an upgraded secondary database and an upgraded tertiary database.

What we’re going to do to prevent this from happening again

Short term

We will set up rigorous monitoring for slow queries on our data store [already done].  We will also automate the building of a new database server via chef.  The db server was one of the last components to be chef’ed in our infrastructure, and on 1/26 and 1/27, we re-built db machines by hand instead of using chef, which was a time consuming and error-prone process.

We will also instituted a more rigorous development process, whereby new features and changes to the code base must be vetted for database performance as part of the regular code review process [already done].

We will also set up better host metrics for the database server so we can detect early on if and when we are approaching capacity and upgrade the server in an orderly way.

Long term

We will remove the dependency of our events API from our main RDBMS database.  To give a bit more context, our events API is backed by a queue: incoming events are enqueued, and background workers process queued events.  The reason for this is so we can properly handle and process large volumes of event traffic.

Currently, this queue is reliant on our main SQL database.  As explained above, this DB is fully redundant with 2 backups across 2 data centers, but requires a failover when the main (primary) db goes down.

As a result of this post-mortem, we will fast-track a project to re-architect the events API queue and workers to use our newer distributed data store.  This data store is distributed across 5 nodes and 3 independent data centers, and it’s designed to survive the outage of an entire data center without requiring any failover process and without any downtime whatsoever.

Posted in Availability | Leave a comment

Mobile site improvements

We take our hackdays pretty seriously at PagerDuty, and we’re excited that new features and write-ups are starting to trickle out from our most recent one. The first feature to make it into the product comes from our head of product for our ever-improving mobile site experience.

Mobile On-Call Display, shows you whether you are on or off call. This feature also allows you to view which user on your team is on call within every configured Escalation Policy.

It’s one click away from the home page and means you no longer need to wonder who is on call and there is no need to rush to a computer to figure it out because this information is now quickly accessible from your smart phone!

Just select the phone icon on the bottom left and the screen displays “ON-CALL” or “OFF-CALL” and how many triggered and acknowledged incidents are currently assigned to you.

User is ON-CALL

You can also acknowledge, escalate, or resolve the incident assigned to you from the mobile site.

User can ack, resolve or escalate the incident

We have also recognized our customers need to quickly contact Users in their on-call rotation. In order to make this easier, we have added a Mobile User Directory feature to our mobile site. The Mobile User Directory is a contact button that appears at the bottom center of your mobile page and, when selected, lists every user in your account with all of their contact information. Please take a look at the example in the screen shot below!

Directory of Users

When a specific user is selected, you can view their phone, SMS, and email contact information as well as contact them directly.

User's Contact Info

If you haven’t tried our mobile site yet, it has never been a better time. If you have, our team is always interested in your feedback and suggestions so please feel free to drop us a line at support@pagerduty.com, or you could always just work here and make your own hackday projects.

Posted in Announcements, Features | Tagged , | 2 Comments

How Cascadeo Integrates PagerDuty Into Its NOC, Instant Messaging and Ops Support Platform

Over the past few years, PagerDuty has alerted thousands of users, letting them know when their systems are down. It’s what we do, and we’re proud to be seen as an integral part of their monitoring solution. Every once in a while we come across a customer that is using PagerDuty in such a way that it even makes us say COOL! One such customer is Cascadeo. Below is their story.

Cascadeo is an IT operations company that focuses on providing long term DevOps infrastructure and operations support for a wide variety of clients. With a staff of top talent systems/network engineers, project managers, and their own worldwide 24/7 NOC, Cascadeo customers benefit from a highly experienced DevOps team. Cascadeo customers can focus on the development of their applications, from development and through growth, while the team at Cascadeo supports the critical infrastructure on which the applications run.

At the core of Cascadeo’s product offering is a platform they developed called “The Cascadeo Operations Support System”, also known as COSS. COSS is a clustered application that runs across multiple regions on Amazon Web Services (AWS). COSS uses Amazon RDS for its database backend, and its function is to integrate all operations support systems into a cohesive ecosystem.  Cascadeo uses a wide variety of SAAS tools in their operations including: PagerDuty (escalations/guaranteed delivery messaging), Zendesk (workflow), Harvest (time tracking), and a number of other tools. A critical requirement for each tool chosen is that it have a rich set of REST APIs to be used in integration. COSS acts as the routing bus for all of these systems by either reaching out to their APIs, or by generating REST endpoints for various systems to use in accessing COSS (e.g. COSS Alerts API).

Cascadeo provides each of their customers with an instant messaging operations room. In that room resides the entire Cascadeo team dedicated to that customer  (NOC, PMs, Lead Engineers, Sys Admins, etc.) and everyone from the customer’s IT group (engineers,  managers, etc). All communications including maintenance windows, status updates, and requests for assistance happen in this room. Use of a virtual ops room allows for full transparency of all IT service requests and issue resolution as well as a log of all communications which is crucial for audit purposes.

Cascadeo uses PagerDuty in two ways: As an escalation platform to activate on-call teams and as a guaranteed-delivery messaging platform. Taking things to a new level, Cascadeo has integrated PagerDuty into their COSS platform through the use of PagerDuty’s API. Each Cascadeo team member has a service assigned to him or her within PagerDuty. Through a series of commands used within the COSS instant messaging room, any Cascadeo employee can issue notification commands at normal, urgent, and emergency priority. These requests are sent to PagerDuty and then proceed to alert the appropriate team member that can address a customer’s needs. Once notified of the service request, all responses (ACKs, resolutions or escalations) are then captured back within the instant messaging room.

Here is a sample transcript:

Cascadeo NOC D

Ludy :D evice: x7.acme.com
IP Address: 10.25.0.13
Component: CheckDNSELB
Severity: Critical
Time: 2012/11/04 13:53:53.000
Message:
CRITICAL: 10.25.0.13 did return a CNAME record

== Escalation notes ==

=== Tier notes ===

  • This node is grouped under Infrastructure. It could be CRITICAL/HIGH/LOW Priority Infra. Read Device notes below.

=== Device notes ===

* Post in Cascadeo and Acme chatroom and indicate as an URGENT issue.

* If the event happens to be a non-urgent (as confirmed by the engineer), remind the engineer to put a transform for Event escalation notes.

* Create an URGENT ticket for Cascadeo SE and post ticket # in Acme chatroom. Note in the ticket that the Acme contact is: Fred Flintstone/Barney Rubble.

* How to access x7.acme.com – https://sites.google.com/a/acme/ops/runbook/incident-response-recipes#TOC-Responding-to-issue-with-x7.acme.com

* Post in Cascadeo and Acme chatroom and indicate as an URGENT issue.

* If the event happens to be a non-urgent (as confirmed by the engineer), remind the engineer to put a transform for Event escalation notes.

* Create an URGENT ticket for Cascadeo SE and post ticket # in Acme chatroom. Note in the ticket that the Acme contact is:Fred Flintstone / Barney Rubble.

* How to access x7.acme.com – https://sites.google.com/a/acme.com/ops/runbook/incident-response-recipes#TOC-Responding-to-issue-with-x7.acme.com

Cascadeo NOC D

.oncall [casc/int/systems, 32962] Pls check 32962 re Acme : Systems : fisheye.acme.com (172.30.0.14) CRITICAL: 10.25.0.13 did return a CNAME record/Time: 2012/11/04 13:53:53.000

6:00

COSS Bot

NOC, I’ve successfully sent event coss_102-32962 to COSS oncall macro – Systems via PagerDuty.

COSS Bot

NOC, PagerDuty is reporting that incident coss_102-32962 (for COSS oncall macro – Systems) was acknowledged by Romel Emperado on 2012-Nov-4 06:01AM PST.

 

In addition to uniquely creating services for their team members, Cascadeo creates on-call queues in PagerDuty that are associated with each customer. This allows for the NOC, the lead engineer, or the project manager associated with the client, to send alerts to the client’s escalation contacts. PagerDuty’s flexibility in defining escalation policies is very useful in complying with specific customer alerting requirements.

“We live and breathe 24×7 Operations, both in the cloud and in the data center,” says Ophir Ronen, a principal at Cascadeo. “For us, PagerDuty is a key tool in our handling of mission critical operations.”

In addition to leveraging PagerDuty, Cascadeo deploys Zenoss to all of their clients. As a part of the provisioning process, they spend a significant amount of time tuning Zenoss to increase the signal to noise ratio. When an NMS is first installed, it generates an enormous amount of noise (events which are not actionable).

Cascadeo conducts a series of triage sessions, typically twice per week, where they work with their clients to categorize the top 10 noisiest events as either actionable or noise. Actionable events require remediation and escalation data that is collated and embedded into the event itself. That way, when the issue recurs, Cascadeo’s NOC and on-call engineers will have the remediation/escalation information immediately at hand which dramatically reduces mean time to repair.

The COSS solution lives in the world of web services. Not only does it reach out to PagerDuty’s APIs, but it too has APIs. For example, by using the COSS Alerts API, Cascadeo can receive specific alerts from Zenoss, buffer and pass them through the COSS platform for tracking, and then pass on those alerts to PagerDuty to trigger an alert.

Cascadeo is a rapidly growing company of more than 80 people, distributed across 6 time zones. They are able to offer extremely high levels of service delivery thanks to their talented teams and the COSS platform. According to Ophir, “Our integration with PagerDuty via our COSS platform allows us to easily activate our resources, distributed around the world, to help our clients. Guaranteed delivery and multi-cloud redundancy is key for us which is why we selected PagerDuty as the tool to handle the critical alert and messaging functions of our OSS”.

Posted in Best Practices, Blog, Customer | Tagged , , , | Leave a comment

Trading up Your Engine: How to Move Your IOPS-heavy MySQL/Rails Stack to Unicode Without Downtime

Out with the old, in with the New

You’re a techie working for one of the multitude of startups that rushed to market, where the founders hastily glued a Rails app together with candy-bar wrappers and tinfoil.  Once it became obvious that enthusiasm was no substitute for raw coding power, developers were hired paper over holes in the software architecture.  Finally, when those developers realized what manner of untamed beast the app was, they hired you to clean up the mess and make things pretty.

You know your stack.  You’ve got an old MySQL database; probably MySQL 5.0 or 5.1.  It was set up with default settings (read: we support English) from day one, and likely the only real change (“advancement”) anyone added since then is a read-slave and asynchronous replication.  After years of continued operation in this mode, your devs have come up with a thousand unmaintainable, awful fixes to allow some non-ASCII characters to be stored in BLOB fields.  Meanwhile, your support people complain that most of the planet gets errors using your app with their non-romanized names, and management is annoyed at the sheer number of subtly different transliteration functions in the code.

This was the situation at PagerDuty several months back, and this article discusses how we fixed it – how we transitioned from MySQL 5.1 storing latin1 (ISO-8859-1) characters to the shiny MySQL 5.5 with unicode (UTF-8) characters… and how you never noticed.

The Problem With MySQL in a Nutshell

The character sets used by MySQL when writing data to disk impose some limitations on your application.  A naïve user might claim that MySQL doesn’t need to know anything about character sets, which would make sense if you wanted poor performance when sorting your strings; your CHARs and VARCHARs.  Since you want to take advantage of database indexing to perform implicit server side sorts (in-process client-side sorting:  please die), MySQL has to understand the characters you’re using so that it has a context to sort in that isn’t just ordinal value.  Unfortunately, the default character set MySQL understands is latin1, which excludes symbols used in approximately 90% of the world.  A unicode character set like UTF-8 is much more appropriate when you intend to store multinational-symbol strings without resorting to BLOBs.

MySQL character sets are cooked into a column at column create time.  Of course MySQL has long allowed you to ALTER TABLE and modify this property, which makes it easier to move from one character set to another, but ALTER TABLE locks the whole table when doing its work, which is no good for live applications under heavy write, where your users expect continued responsiveness.  Something a little more involved is necessary.  This is the story of that something.

Before You Get Started, Read the Requirements

At PagerDuty, we saw this challenge as a surmountable technical obstacle that shouldn’t impact our business.  Namely:

  • For as long as we continue to use MySQL, we never again want to migrate datastores due to symbol-related storage/input problems (we want to accept a universal symbol set).
  • This switch needed to have at most negligible impact on the ongoing performance of the PagerDuty application (no new cloud infrastructure could be brought up for the purposes of event throughput).
  • Corollary:  no significant storage resources should be newly allocated to accommodate UTF-8-encoded MySQL characters (we allow for at most 2 times the old storage requirements; this is not unreasonable given that most of our users will simply use romanized characters, expecting anything else to fail).
  • The whole procedure that gets this out the door should incur negligible (< 1 minute) downtime for our users.

Sound ambitious? This is the minimum set of conditions we were given, and we’re pleased to say we succeeded in meeting all of them.

MySQL – Unwinding a Plethora of Insanity

MySQL makes converting to UTF-8 incredibly painful, in order to try to cover up the limitations of the InnoDB engine.  We begin by discussing problems with indices over CHAR/VARCHAR data, assuming you use InnoDB (which we used, because at least our server was not from the Stone Age).

Did you know that InnoDB has low limits on the size of single-column indices?  We didn’t either, but we found out to what extent those sneaky MySQL devs went to try to prevent this hurting you, the unwary user.  You see, MySQL 5.1′s “utf8″ encoding is not true UTF-8.  UTF-8 supports symbols between 1-4 bytes long.  MySQL’s utf8 supports symbols only between 1-3 bytes in size.  This breaks our first objective – to support all characters.  In order to solve that little oversight, we used the “utf8mb4″ encoding supplied in MySQL 5.5[1]… except we weren’t yet running 5.5.  Our solution to this problem required flipping database servers (I did say there;d be a bit of downtime!) – but we’ll get to that.

Initial testing of MySQL 5.5 was positive until we attempted to recreate our production tables via mysqldump[2] with UTF-8 encoding in place of of latin1:

mysqldump -d the_database | sed -e "s/\(.*DEFAULT CHARSET=\)latin1/\1utf8mb4/" | mysql the_database_utf8

Please don’t hit us.   We were startled by this strange error:

ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes

Oh woe is MySQL!  To have seen what it has seen, see what it sees!  Indeed, if you glance at the fine print, InnoDB only supports single-column indices at most 767 bytes in size.  Perhaps that’s why “utf8″ encoding only supports a maximum of 3 bytes per character:  it means conversions to utf8 from other charsets work when column indices are involved.  Consider; in order for the index comparator to be fast, all entries need to be the same size:  the maximum combined size of the column(s) they are striped over.  With a VARCHAR(255), a pretty standard cell type, and MySQL’s crippled utf8 encoding, max_length_of_string * max_size_of_char expands to 255 * 3 => 765.  With utf8mb4, 255 * 4 => 1020.  Oops.  What a pickle.

Thankfully, the link which describes this limitation also describes the workaround (allowing index size to grow to a max of 3072 bytes for a single column), which lead to some of the following lines in our /etc/my.cnf file:

[client]
default-character-set = utf8mb4

[mysqld]
default-storage-engine = INNODB
sql-mode="NO_ENGINE_SUBSTITUTION"

# file_per_table is required for large_prefix
innodb_file_per_table
# file_format = Barracuda is required for large_prefix
innodb_file_format = Barracuda
# large_prefix gives max single-column indices of 3072 bytes = win!
# we'll also to set ROW_FORMAT=DYNAMIC on each table, though.
innodb_large_prefix

character-set-client-handshake = FALSE
collation-server = utf8mb4_unicode_ci
init-connect='SET collation_connection = utf8mb4_unicode_ci'
init-connect='SET NAMES utf8mb4'
character-set-server = utf8mb4

[mysqldump]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

We’re convinced that there’s a more concise way to get what you want out of MySQL, but as the old adage tells us about this situation:  ”take off, nuke the site from orbit… it’s the only way to be sure.”  If you boot up a server with this my.cnf file and mysql_install_db, CREATE TABLE statements specifying ROW_FORMAT=DYNAMIC will do the right thing, and give you CHAR/VARCHAR strings that can be indexed, while also supporting all the symbols you could ever want.

There is a somewhat-related problem here, that multi-column indices are also bound to a maximum of 3072 bytes.  This one might be harder to solve.  We didn’t have a clever solution to the issue – only one composite index was affected by it, and that index happened to run over a table with few rows (which was thus ALTERable).  The index ran over column phone_number which was needlessly a VARCHAR(255), so a quick ALTER TABLE (well, its abstracted cousin, the Rails “migration”) took care of this for us and sized it down.

Collations:  Learn to Stop Worrying and Love Unicode

Our thinking was that suddenly enlarging indices would turn our snappy MySQL server into a lumbering behemoth.  This turned out not to be the case – the net result of our migration was very neutral, or teetering on a small speed gain!  If you’re running a semi-modern Rails application, chances are, this will be true for you too.  The reason?  Collations.

Collations tell MySQL how to sort strings so that they make sense; intuitively, the string “abc” comes before “bbc” in an ascending sort because the leading ‘a’ alphabetically precedes ‘b’.  However, complicated characters require more complicated rules.  For example, the Deutsches Institut für Normung (DIN) defines two possible latin1 collations; DIN-1 (German dictionary) ordering defines the ‘ß’ symbol as equivalent to ‘s’, and DIN-2 (German phone books) defines ß = “ss” (among other differences).  After the reduction is performed, a standard (English) lexical sort is used.

This matters when client connections want a query ordered by a string field, and thus you need some way to sort the strings.  MySQL collations, if used properly, give you that ordering for basically free (as long as you have an index over the string fields).  One often disregarded prerequisite to this benefit is that both client and server must be speaking the same character set – and the same collation.  It turns out that until our UTF-8 database migration, this was not true at PagerDuty.

Consider your Rails application.  Chances are, you’re using either the MySQL/Ruby or  the mysql2 gem to power ActiveRecord.  They’re reading from a database.yml file which specifies what as the encoding type?  Oh, utf8?  If you dig through the gem code for a while (which we ended up having to do), you’ll notice that that encoding gets passed into the MySQL connection settings; it becomes the character-set (and defines the collation) used to talk to MySQL.  The fact that you set this to utf8 while talking to a latin1-backed DB is the core of the speedup you’re about to make.

Fact: this entire time, you’ve been wasting CPU cycles on sorting.  MySQL has abstracted this away, and has given you strings sorted in an order that your client (Rails) app understood, all the while having to maintain a mapping between a UTF-8 collation (probably utf8_general_ci) and whatever collation your tables have been bound by.  Don’t believe me?  Watch what happens when you set the client and server to both run at utf8mb4 with the same collation (we chose utf8mb4_unicode_ci; see here for a discussion of Unicode collation differences in MySQL).  Enjoy the speedup.  Thank me later.

Keep Your Data Chugging:  Migrate + Replicate + Upgrade

It took us this much text, but at last we get to the tricky part:  how are you going to migrate your old, coal-powered datastore?  You’ve already seen a clever hack using mysqldump + sed to load data into a new server.  But your old database is still under write – so now what?  The solution here involves MySQL throwing you a bone with this insistence on knowing and separating client/server character sets.

We’d love to see the internals of how this works, but unfortunately I wasn’t able to take the time to read the MySQL source code or find credible information about this.  Using the above config file for our new server, simply setting up master/slave replication between our old DB and the new 5.5 UTF-8 one worked flawlessly.  We tested all manner of latin1 characters inserted into the old database, and they came back without issue in the replicated copy.  MySQL was doing all the correct translations, and we just had to sit back and watch.  Once the mesmerizing effect wore off, it was time to do some work – namely, all of our webservers needed to have their mysql-client packages updated.  For you see, mysql-client 5.1 doesn’t speak utf8mb4, and will have some issues talking to your 5.5 server also.

In order to do this, we used chef to quickly spin up new app-backend servers – clones of our existing servers – but with the new mysql-client version, and configured so that the background workers (all of PagerDuty’s queue processing and asynchronous tasks) were disabled.  The wonderful fruits of the cloud – occasionally useful!  These servers were pointed at the slave database, and were already configured (via chef environment settings, which make much more sense to those who use chef) to fully support utf8mb4 all the way through Rails via mysql2.  With these bad boys ready (and some testing to ensure that they worked the way our testing environments had), we were ready to flip our database.

Do it, Do it Now!  Come on, Flip Me!

The flip process is that incredibly risky moment when you’re simply not sure if everything will work or if you missed some crucial detail, and your customers are about to be very unhappy.  You nervously go through your checklist, making sure that you underscore the critical moment of no return.  In our flip, we had the following components:

  • shut down current background workers on the old app-backends
  • (at this point we’re no longer processing notification sending, but still queueing requests)
  • lock the master database
  • (at this point we are fully stopped – this is the downtime you were warned about!  New requests are frozen)
  • stop and reset the slave
  • run chef on our customer-facing load-balancers, bringing them into the new chef environment and changing them to use our newly spun-up machines as app-backends
  • (at this point we’re taking requests again, pre-flip requests will time-out)
  • spin up background workers on the new app-backends
  • (at this point we are fully functional)
  • terminate the old app-backends

As you can imagine since I’m talking to you about it now, all of these executed without error.  You’ve read in another post how our background processes run, and how easy it is to script monit, particularly in conjunction with chef, to shut down and spin up our background tasks.  Chef-client executions on our load-balancers typically complete within 20 seconds as a result of tireless work by our ops team, so we knew this would be the upper bound for our downtime.  The only SQL commands we had to run to jog things into action were:

(on the master)

BEGIN;
FLUSH TABLES WITH READ LOCK;

(on the slave, once you verify it’s caught up with the master)

STOP SLAVE;
RESET SLAVE;

That was it.  The stress was gone, and you, the customer, barely noticed we were temporarily ignoring your events.  All that was left was leisurely reconfiguration of our slaves and backups to target a new MySQL server.  Oh, and Rails needed some love; here’s what we added to our application’s config/initializers/activerecord_ext.rb:

module ActiveRecord
  module ConnectionAdapters
    module SchemaStatements
      def create_table_with_dynamic_row_format(table_name, options = {}, &block)
        new_options = options.dup
        new_options[:options] ||= ""
        new_options[:options] << " DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC"
        create_table_without_dynamic_row_format(table_name, new_options, &block)
      end

      alias_method_chain :create_table, :dynamic_row_format
    end
  end
end

Post-Mortem

If you made it this far, congratulations dear reader – you’re dedicated.  Hopefully, this article inspires you to do good and erase the stains of years of predominantly latin1 applications in your company.  Give that engine a good overhaul.  Be warned though… in the technologically-misimplemented world of Unicode, the excitement never ends.  There are always more components to bring into the 21st-century language mix.

[1] If you oppose Han Unification, then after all the effort here we unfortunately still don’t support your eclectic characters.

[2] Your mysqldump should be set to use the utf8 character set (a strict superset of latin1) to generate your text files, otherwise you may see a whackload of gibberish inserted into your new DB.

Posted in Best Practices, Blog, Code | Tagged , , | 2 Comments