PagerDuty Blog

Availability lessons from shoe companies and ancient warlords

This is the second in a series of posts on increasing overall availability of your service or system.

In the first post of this series, we defined and introduced some concepts of system availability, including mean time between failure – MTBF – and mean time to recovery – MTTR.  Both increasing MTBF and reducing MTTR are important, but reducing MTTR is arguably easier.  It doesn’t take months of engineering work and capital expenditure to see results, but can often be incrementally achieved with some additional tools, procedures, and processes.

In this post, we’ll talk about things you can do today to help reduce MTTR and effectively increase your availability.

Just do it!

During any outage scenario, minutes matter.  Depending on the business, every additional minute of downtime could result in lost revenue, lost customer trust, or worse.  To prevent wasting these precious minutes – and effectively directly increasing MTTR – a ‘bias for action’ attitude should be cultivated within your team.

What ‘bias for action’ means in an operational situation:  if at any point you have a hypothesis on what is causing your outage, and you or someone comes up with an idea for a solution that might help fix the problem, just do it.  Just give it a shot.

This attitude will help prevent indecision paralysis from gripping you when faced with a really bad operational problem and not enough research and data to make a completely informed and well-thought-out decision on what approach to try.  Making the perfect fix 2 hours into an outage is almost always less of a win than making an imperfect but helpful fix 15 minutes in.  And who knows:  one of the first things you try might actually fix the problem completely.  There’s only one way to know for sure:  just try it.  An outage is not the time for being risk-adverse.

An important factor in cultivating this operational attitude within your organization is to not penalize people for making mistakes or taking risks while trying to fix a large operation problem.  Bouncing the entire fleet of frontends made the problem worse for a little while?  Oh well; it was worth a try.  We won’t try that solution as quickly next time.

Of course, while you should take risks and try things, there’s no point being stupid about it.  Before truncating that database table, make sure you have a backup.  Before running that update statement across 12 thousand rows, have someone double-check the SQL for you, and break it up into multiple transactions if you can.  And weigh the magnitude of the problem against the fix that you’re attempting:  outages are one thing, but if your systems are instead only working at reduced capacity or functionality, you might want to hold off on those radical fixes where the potential downside for messing up is very large or catastrophic.

Know your enemies, and yourself

So it is said that if you know your enemies and know yourself, you can win a hundred battles without a single loss.   – Sun Tzu

Enemies

You and your team should be very familiar with the common problems – enemies – that your service or system faces on a day-to-day basis.  I don’t mean the most catastrophic and exotic potential problems that you might face, but the real and everyday failure modes that your system has encountered in the past, and almost certainly will encounter in the future.

You know what kind of problems I’m talking about:  that scaling issue that hasn’t yet been licked, that fussy legacy service that occasionally flakes out, that database that has the nasty habit of just seizing up, or those specific sets of circumstances that kick off a message storm.  Whatever the failure mode, if it’s something that crops up frequently in your system, everyone on your team should know (or be trained) on how to handle it.

Yes, you’re probably working on fixing the root cause of the problem (and if not, you should be), and you’re hoping it’ll soon be a distant unhappy memory.  But many of your most chronic problems can’t be easily and completely vanquished – or you’d have done it long ago – and some legacy systems are difficult to change.  So you should still have documented and detailed procedures for how to face these problems, and this documentation should be easily accessible to your team during future incidents.  An internal wiki is a great place for this Emergency Operations Guide.

Yourself

You and your team should also be very familiar with the tools you have at your disposal to be able to understand the state of your systems.

To begin, I’ll start with the most obvious: you need to know there is a problem in order to be able to fix it. So you need monitoring. Lots of it.

Monitor on the host-level: CPU, free memory, swap usage, disk space, disk IOPS, network I/O, or whatever host-level attributes are important for the system in question. There are tons  of  great  monitoring  solutions out there available for this.

Monitor on the application-level: Setup monitors to check and report on various system-level health metrics of your system, like request latency, throughput, processing delays, queue sizes, error rates, database performance, end-to-end  performance, etc. Setup logscans that continuously monitor your service logs looking for bad signs. If you have an externally-facing system, setup an external monitor to check whether or not your system/website is up, accepting requests, and healthy.

Know how to use your monitoring tools.  These tools are awesome resources during a failure situation to quickly and visually figure out what’s going wrong, but if your team doesn’t know how to use them or their UI (or even log in) then they’re not going to be of much use.  Add links to interesting/useful monitoring graphs and charts in your Emergency Operations Guide mentioned above, for quick access.

But these monitoring systems aren’t of much use if nobody is listening to them. You should – shameless plug! – use a system like PagerDuty to bridge the gap between the monitoring systems and your on-call staff (the people who will actually fix the problem) as well as organize on-call schedules and escalation policies.  Another – more expensive – option would be something like a NOC.  I won’t harp too much on this point, but there’s no reason for there to be any delay whatsoever between your monitoring systems realizing there’s a problem, and you realizing there’s a problem.

I’ll follow up soon with another post detailing some more of my favorite tips.