<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>PagerDuty Blog</title>
	<atom:link href="http://blog.pagerduty.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.pagerduty.com</link>
	<description></description>
	<lastBuildDate>Tue, 08 May 2012 01:16:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	
		<item>
		<title>PagerDuty Pricing Changes</title>
		<link>http://blog.pagerduty.com/2012/04/pagerduty-pricing-changes/</link>
		<comments>http://blog.pagerduty.com/2012/04/pagerduty-pricing-changes/#comments</comments>
		<pubDate>Tue, 17 Apr 2012 18:16:53 +0000</pubDate>
		<dc:creator>Alex Solomon</dc:creator>
				<category><![CDATA[Announcements]]></category>

		<guid isPermaLink="false">http://blog.pagerduty.com/?p=2083</guid>
		<description><![CDATA[UPDATE: The new pricing is now live! Please don&#8217;t hesitate to get in touch with us if you have any issues signing up under the new system or upgrading from an older plan. We&#8217;d like to announce we are making &#8230; <a href="http://blog.pagerduty.com/2012/04/pagerduty-pricing-changes/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><strong>UPDATE: The new pricing is now live! Please don&#8217;t hesitate to get in touch with us if you have any issues signing up under the new system or upgrading from an older plan.</strong></p>
<p>We&#8217;d like to announce we are making changes to our pricing for PagerDuty: we&#8217;re simplifying the pricing scheme, reducing overage alert charges, and increasing the price per user. Traditionally, pricing changes have been a very touchy subject for SaaS companies. We would like to do the right thing for all of our customers, so we are automatically grandfathering all existing accounts. This means if you&#8217;re a PagerDuty customer, you won&#8217;t be subject to any potential price increases.</p>
<h2>Why are you changing your prices?</h2>
<p>When we first launched PagerDuty, we based our pricing model on the 37signals classic <a href="http://highrisehq.com/signup" target="_blank">tiered model</a> where prices double as you go up the tiers but you also get more than double the stuff. This pricing model has served us well in the early days. More recently, in the last year, we&#8217;ve run into several limitations:</p>
<ul>
<li>Smaller IT ops teams are paying for a larger plan than what they need on a user basis, in order to get more alerts.</li>
<li>We have lots of companies with well over 25 users. Since our max plan, the X-Large, only supports 25 users, we find ourselves spending a lot of time doing custom billing for customers.</li>
<li>Our current pricing has expensive overage alerts on most of the plans ($0.60 each). If one of our customers had a bad month and used a lot of alerts, they&#8217;d end up with a large bill from us, which sucks.</li>
</ul>
<p>So, we&#8217;re making the switch to a per user per month pricing model. The main goal for the new pricing is simplicity: no more pricing tiers, user limits or alert limits; you can add as many users as you need. We&#8217;ve also eliminated all overage alert charges for US/Canada and reduced the price for overages in other countries.</p>
<h2>What is the new pricing?</h2>
<p>We&#8217;re scrapping the tiered plans and switching to per user per month simple pricing:</p>
<ul>
<li>$18 per user per month</li>
</ul>
<p>Every user gets:</p>
<ul>
<li>UNLIMITED phone &amp; SMS alerts to US and Canada</li>
<li>20 phone &amp; SMS alerts per user per month to all other supported countries</li>
<li>UNLIMITED email alerts</li>
</ul>
<p>(Additional international phone &amp; SMS alerts are $0.35 each)</p>
<p><strong>The good news</strong>: Unlimited alerting for US/Canada, which means no more overage charges ever. For international accounts, we&#8217;re including 20 phone &amp; SMS alerts per user per month. These alerts pool together and can be used by any user in the account. We&#8217;ve also reduced the price for international overages, should you need more.</p>
<p><strong>The not-so-good news</strong>: A higher per user per month price than the old pricing.</p>
<p><strong>Startups and small teams</strong>: We&#8217;ll also offer a Starter plan at $9 / user / month for up to 3 users. This plan comes with 50 US/Canada phone &amp; SMS alerts per user per month and 10 International phone &amp; SMS alerts per user per month.</p>
<h2>When will the changes happen?</h2>
<p>We&#8217;ll switch to the new pricing on May 7, 2012 (3 weeks from today).</p>
<h2>How does this impact me?</h2>
<p><strong>If you&#8217;re already a paying customer</strong>: Nothing will change. All paying customers are automatically grandfathered into their current pricing plan indefinitely. That being said, after the new pricing launch date, you&#8217;ll no longer be able to switch to another one of the old plans (so, if you&#8217;re on the Large, you wouldn&#8217;t be able to upgrade to the X-Large or downgrade to the Medium after May 7). Of course, after May 7th, you&#8217;ll have the option to upgrade to the new pricing or stay on your existing plan.</p>
<p><strong>If you&#8217;re currently on your 30-day trial</strong>: You should check and see if our existing plans are cheaper for your team than $18/user/month &#8212; if you use a lot of overage alerts each month, they might not be. If they are, sign up for a paid plan before May 7th and you&#8217;ll be able to keep that price forever. If you don&#8217;t, you&#8217;ll pay $18/user/month (or $9/user/month for the small team Starter plan).</p>
<p>If you have any questions or feedback, please <a href="http://www.pagerduty.com/contacts" target="_blank">contact us</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pagerduty.com/2012/04/pagerduty-pricing-changes/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>New Reporting Feature: Who Ever Said Numbers Were Boring?</title>
		<link>http://blog.pagerduty.com/2012/04/who-ever-said-numbers-were-boring/</link>
		<comments>http://blog.pagerduty.com/2012/04/who-ever-said-numbers-were-boring/#comments</comments>
		<pubDate>Wed, 04 Apr 2012 18:30:24 +0000</pubDate>
		<dc:creator>Ian Enders</dc:creator>
				<category><![CDATA[Announcements]]></category>
		<category><![CDATA[Features]]></category>
		<category><![CDATA[announcements]]></category>
		<category><![CDATA[functionality]]></category>
		<category><![CDATA[reporting]]></category>

		<guid isPermaLink="false">http://blog.pagerduty.com/?p=2010</guid>
		<description><![CDATA[When we aren&#8217;t dealing with event storms and cloud outages, we&#8217;re working hard on improving our product for you. One such effort has been to let you use more of the wealth of information we store for your account. We &#8230; <a href="http://blog.pagerduty.com/2012/04/who-ever-said-numbers-were-boring/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><img src="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/SampleReport-1.png" alt="" title="Sample PagerDuty Report" width="449" height="241" class="aligncenter size-full wp-image-2021" /><br />
When we aren&#8217;t dealing with event storms and cloud outages, we&#8217;re working hard on improving our product for you. One such effort has been to let you use more of the wealth of information we store for your account. We already keep track of every incident you&#8217;ve ever sent us, every phone call we&#8217;ve ever sent you, and every SMS message that ever woke you up. As it turns out, if we improve the navigability and visibility of that information it can tell some interesting stories.</p>
<p>In the past, our Reporting tab on your PagerDuty account read more like a phone bill than a report. You could see the total number of alerts sent for a given month, and get a specific list of phone calls made or emails sent for that month. That was about it.</p>
<p>Sure, you can still do that. But you can now do a lot more. </p>
<p><span id="more-2010"></span></p>
<p>You can now get a high level view of how your operations are trending over time. Are you getting better at it? Worse? What weeks last year were totally abysmal? Do your employees get a case of the Mondays after being hounded by operational issues? Are your weekends and holidays generally light on headaches and nice to your Ops Engineers? What was the all-time worst day for operations last year? You can now answer these questions quickly and relatively easily by visualizing the data in PagerDuty. Just tune your date range with some of our default date-ranges, or use your own. Change your data granularity by setting the data rollup and you&#8217;re off to the races.</p>
<p>Once you&#8217;ve found the data you care about, you can drilldown to get specific details about what alerts were firing or what incidents were causing them for a particular time range. Once you&#8217;ve drilled down, you can further refine your time range query to locate operational problems. Maybe you&#8217;ll be able to correlate events you never realized were related.</p>
<p>If the new reporting interface doesn&#8217;t provide you all of the data slicing you need, you can always export the data to CSV, open it up in Excel and perform further ninja operations on your numbers.</p>
<p>Regardless, we&#8217;re just getting started and we&#8217;ve barely scratched the surface on things we can do. We have lots more ideas on where we can go from here. What do <em>YOU</em> want to see?</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pagerduty.com/2012/04/who-ever-said-numbers-were-boring/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>We&#8217;re switching to Free + Ad-Supported (April Fools)</title>
		<link>http://blog.pagerduty.com/2012/04/were-switching-to-free-ad-supported/</link>
		<comments>http://blog.pagerduty.com/2012/04/were-switching-to-free-ad-supported/#comments</comments>
		<pubDate>Sun, 01 Apr 2012 14:00:01 +0000</pubDate>
		<dc:creator>Alex Solomon</dc:creator>
				<category><![CDATA[Announcements]]></category>
		<category><![CDATA[stories]]></category>
		<category><![CDATA[worst practices]]></category>

		<guid isPermaLink="false">http://blog.pagerduty.com/?p=2032</guid>
		<description><![CDATA[This was an April Fools post, we&#8217;re quite happy with our current business model. We enjoyed writing it though, so we&#8217;ll keep it up: We are very excited to announce a major change in the business model of PagerDuty: free &#8230; <a href="http://blog.pagerduty.com/2012/04/were-switching-to-free-ad-supported/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><strong>This was an April Fools post, we&#8217;re quite happy with our current business model.  We enjoyed writing it though, so we&#8217;ll keep it up:</strong></p>
<p>We are very excited to announce a major change in the business model of PagerDuty: free + ad-supported. As you all know, our old business model involved charging money for the PagerDuty product. In fact, we didn&#8217;t even have a free plan; only paid plans with a 30-day free trial.</p>
<p>This is a perfectly cromulent business model which works for many companies, but it just doesn&#8217;t scale as we set our sights on becoming the next Facebook or Google. Both Facebook and Google have free products and both companies monetize these products by showing lots of very relevant ads to their users. These companies are definitely web-scale and so is their very successful business model of &#8220;free + ad-supported&#8221;.</p>
<h2>Visual display ads (aka &#8220;ignorable&#8221; ads)</h2>
<p>After making the decision internally to switch to free + ad-supported, we were faced with an essential question: where and how to advertise? Everyone&#8217;s already doing visual display ads. In fact, display ads became popular back in the 90s during the first dot-com bubble.</p>
<p>The main issue we have with display ads is that they&#8217;re very easy to ignore: you can do so by just looking away. We&#8217;ve dubbed these type of ads &#8220;ignorable&#8221; and we&#8217;ve decided not to adopt them for our web application. If we sold these types of ads to advertisers, we&#8217;d feel like we&#8217;re ripping them off because 99.9% of people just ignore them (assuming they haven&#8217;t installed ad-block to do so automatically). As such, we are perfectly content to let Facebook and Google fight over the ignorable display ad scraps.</p>
<p><img class="aligncenter size-full wp-image-2049" title="Google ignorable ads" src="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/google_ignorable_ads1.png" alt="Google ignorable ads" width="400" height="228" /></p>
<h2>Ads you can&#8217;t ignore</h2>
<p>We knew we had to come up with a revolutionary new type of ad, one that you can&#8217;t just ignore. We tried really hard to do so and even had multiple company-wide brainstorming meetings in the process, yet we were completely stumped. Then, one day, one of our interns came up with an idea: advertise <strong>inside</strong> the PagerDuty alerts. Brilliant!</p>
<p>As you know, PagerDuty plugs into a variety of monitoring systems and manages alerting, escalation and on-call scheduling. We dispatch phone call alerts, SMS alerts and email alerts to our users. Instead of messing about with display ads, we thought let&#8217;s leverage the alerts we already send to our users in order to advertise to them. Let&#8217;s go through an example to see how it works:</p>
<ul>
<li>Let&#8217;s say you are on-call this week and your entire data center goes down at 2am.</li>
<li>Your monitoring system notices and tells PagerDuty.</li>
<li>We ring your cellphone and wake you up.</li>
<li>When you pick up, you hear &#8220;PagerDuty alert&#8221; followed by a quick audio advertisement from one of our many ad sponsors, followed by the details of the particular issue (in this case &#8220;Your data center is down.&#8221;)</li>
</ul>
<p><img class="size-full wp-image-2054 alignright" title="A Captive Audience" src="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/captive-audience-200-annotated.png" alt="A Captive Audience" width="200" height="150" /></p>
<p>The best part is that you can&#8217;t ignore these ads. In order to hear the critical details of your outage, you have to sit patiently and listen to a quick 30-second spot about delicious Pepsi Cola or durable Goodyear Tires. For those of you who are responsible for on-call duty, <strong>it&#8217;s your job</strong> to receive alerts from our service and listen to them, including the ad within. Brands can now, for the first time in history, advertise to an extremely captive audience, one whose job it is to listen to our alerts and the brands&#8217; ads.</p>
<h2>The new-and-revolutionary Alert-Ad™ ad network</h2>
<p>We knew we were onto something really big here. The world has already experienced ads on the web, TV, radio, billboards, magazines and newspapers. However, nobody has ever advertised inside IT alerts, until now.</p>
<p>PagerDuty is the first company in history to bring advertising to phone call, SMS and email alerts. This is absolutely groundbreaking and revolutionary; alerts are the next frontier of advertising. We are doing this via our new Alert-Ad™ advertising network.</p>
<p><img class="alignright size-full wp-image-2058" title="Sysadmin demographic" src="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/sysadmin-2-copy.png" alt="Sysadmin demographic" width="170" height="158" />Alert-Ad™ allows advertisers and brands to reach a very captive audience of sysadmins, devops and developers, a market segment that has been notoriously difficult to advertise to. This demographic has typically been anti-advertising: they tend to use ad blockers, prefer browsing the web using text-only browsers like Lynx and tend to avoid paying for cable service, instead preferring to watch Netflix. They also tend to be middle-class professionals with large disposable incomes.</p>
<p>Alert-Ad™ allows the best brands and advertisers in the world to target this fertile, lucrative demographic very precisely. Advertisers can target ads to specific times of the day, for specific types of errors, to various regions in the world. For example, Microsoft could advertise the uptime of Windows Server 2008 to sysadmins receiving alerts about their Ubuntu servers going down. Another example: 5-Hour Energy could advertise their energy drink to sysadmins receiving alerts between 1am and 5am. The possibilities are endless.</p>
<h2>Advertisers are already flocking to Alert-Ad™</h2>
<p>We&#8217;re also very excited to announce that we&#8217;ve already signed up some really great brands onto the Alert-Ad™ network. We&#8217;re already working with Match.com, the leaders in online dating, Pfizer, makers of Viagra, and 5-hour Energy, makers of the 5-Hour Energy energy drink.</p>
<p>Here&#8217;s an actual Alert-Ad™ that was recently sent to one of our users (who was asleep at the time):</p>
<p><object width="400" height="27" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="flashvars" value="audioUrl=http://blog.pagerduty.com/wp-content/uploads/April-fools-v21.mp3" /><param name="src" value="http://www.google.com/reader/ui/3523697345-audio-player.swf" /><param name="quality" value="best" /><embed width="400" height="27" type="application/x-shockwave-flash" src="http://www.google.com/reader/ui/3523697345-audio-player.swf" flashvars="audioUrl=http://blog.pagerduty.com/wp-content/uploads/April-fools-v21.mp3" quality="best" /></object></p>
<p>We can also include ads inside the PagerDuty SMS alerts. Here&#8217;s an SMS Alert-Ad™:</p>
<p><img class="aligncenter size-full wp-image-2060" title="SMS Alert-Ad" src="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/iphone_alertad_sms.jpg" alt="SMS Alert-Ad" width="300" height="445" /></p>
<p>In conclusion, we&#8217;re very excited about the possibilities of our new Alert-Ad™ system as well as the now-free PagerDuty product. We have lots of other new and revolutionary ideas in the pipeline for other places we can place ads that cannot be ignored. Stay tuned!</p>
<p>&nbsp;</p>
<p><strong>Update:</strong> We&#8217;ve already gotten early feedback from PagerDuty users who are receiving the first Alert-Ad™ ads: they&#8217;d like an option to skip the ad and get to their critical alerts faster.</p>
<p>We are an agile company, so we&#8217;ve responded. For the low fee of $0.15 (a nickel and a dime), you can press star (*) on your phone&#8217;s keypad during a phone-call alert to listen to the ad at twice the speed.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pagerduty.com/2012/04/were-switching-to-free-ad-supported/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
<enclosure url="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/April-fools-v21.mp3" length="848539" type="audio/mpeg" />
		</item>
		<item>
		<title>Outage Post Mortem &#8211; March 15</title>
		<link>http://blog.pagerduty.com/2012/03/outage-post-mortem-march-15/</link>
		<comments>http://blog.pagerduty.com/2012/03/outage-post-mortem-march-15/#comments</comments>
		<pubDate>Fri, 16 Mar 2012 03:40:50 +0000</pubDate>
		<dc:creator>John Laban</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.pagerduty.com/?p=1985</guid>
		<description><![CDATA[As some of you know, PagerDuty suffered an outage for a total of 15 minutes this morning. We take the reliability of our systems very seriously, and are writing this to give you full disclosure on what happened, what we &#8230; <a href="http://blog.pagerduty.com/2012/03/outage-post-mortem-march-15/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>As some of you know, PagerDuty suffered an outage for a total of 15 minutes this morning. We take the reliability of our systems <em>very</em> seriously, and are writing this to give you full disclosure on what happened, what we did wrong, what we did right, and what we’re doing to help prevent this in the future.</p>
<p>We also want to let you know that we are very sorry this outage happened. We have been working hard over the past 6 months on re-engineering our systems to be fully fault tolerant. We are tantalizingly close, but not quite there yet. Read on for the full details and steps we are taking to make sure this never happens again.</p>
<h1>Background</h1>
<p>PagerDuty&#8217;s main systems are hosted on Amazon Web Services&#8217; EC2.  AWS has the concept of &#8220;Availability Zones&#8221; (AZ&#8217;s), in which hosts are intended to fail independently of hosts in other availability zones within the same EC2 region.</p>
<p>PagerDuty takes advantage of these availability zones and makes sure to spread its hosts and datastores across multiple AZ&#8217;s.  In the event of a failure of a single AZ, PagerDuty can recover quickly by redirecting traffic to a surviving AZ very quickly.</p>
<p>However, it&#8217;s quite obvious that there are many situations in which all Availability Zones in a given EC2 region fail at once.  From experience, these situations happen roughly every 6 months.  One such region-wide failure occurred early this morning, in which AWS suffered internet connectivity issues across all of its US-East-1 region at once.</p>
<h1>The Outage</h1>
<p>PagerDuty became inaccessible at 2:27am this morning.</p>
<p>Knowing that fallbacks within other AZ&#8217;s aren&#8217;t enough, PagerDuty has another fully-functional replica of its entire stack running in another (completely separately owned and operated) datacenter.  We began the procedure to flip to this replica after we were notified of the problem with EC2 and when it became obvious that EC2 was having a region-wide outage.</p>
<p>At 2:42am (15 minutes after the start of the outage), EC2&#8242;s US-East-1 region re-appeared, and our systems started to quickly process the backlog of incoming API and email-based events, creating a large number of outgoing notifications to our customers.  At this point we aborted the flip to our fallback external notifications stack.</p>
<h1>What we did wrong</h1>
<p>Fifteen minutes seems like a long time between when our outage began and when we perform our flip.  And it is.</p>
<p>We use multiple external monitoring systems to monitor PagerDuty and alert all of us when there are issues (we can&#8217;t use PagerDuty ourselves, alas!).  After careful examination, the alerts from these systems were delayed by a few minutes.  As a result, we responded to the outage a few minutes late.</p>
<p>This is obviously an action item on us to remedy as soon as possible.  These minutes count.  We know they are very important to you.  We will look at switching or augmenting our monitoring systems as soon as possible.</p>
<p>Another miss on our part was not notifying all of you immediately of our outage via our emergency mass-broadcast system (see <a href="http://support.pagerduty.com/entries/21059657-what-if-pagerduty-goes-down">http://support.pagerduty.com/entries/21059657-what-if-pagerduty-goes-down</a>).  This was due to an internal miscommunication on when it is appropriate to use this system.  We will come out with another blog post shortly that details exactly how we use this system going forward, and a reminder on how you can register yourself for it.</p>
<h1>What we did right</h1>
<p>We&#8217;ve previously taken steps to be able to mitigate these large-scale EC2 events when they happen.</p>
<p>One such step is the very existence of our externally-hosted fallback PagerDuty environment.  This is an (expensive) solution to this rare problem.  We regularly run internal fire drills where we test and practice the procedure to flip to this environment.  We will continue these drills.</p>
<p>Another step that we’ve taken to mitigate these large-scale EC2 events is to make sure our systems can handle the very high amounts of traffic we see when a third of our customers (all the ones hosted on EC2) all go down at the same time. We&#8217;ve made many improvements to our systems over the past 6 months: our system now queues events quickly, intelligently sheds load under high-traffic scenarios in order to continue operating, and makes absolutely sure not to fail to page any of our customers.  These systems performed very well this morning, preventing further alerting delays.</p>
<h1>What we&#8217;re going to do</h1>
<p>A flip, no matter how quick, involves some downtime. This leaves a sour taste in our mouths. We are working (hard!) on our internal re-architecture to fully move to a notification processing system that involves NO temporary single points of failure, even when that SPOF is “all of EC2 east”.</p>
<p>Our new system will use a clustered multi-node datastore deployed on multiple hosts located in multiple independent data centers with different hosting providers. The new system will be able to survive a data center outage without any flips whatsoever. That&#8217;s right, we&#8217;re going flip-less (because the word &#8220;flip&#8221; is synonymous with &#8220;outage&#8221;). We are working full-steam on building this new system and deploying it as soon as possible, while making sure we stay stable during the changeover. This re-engineering effort is fairly substantial, so stand by for a few shorter term solutions.</p>
<p>During our internal post-mortem this morning, we have identified a few places where we can immediately improve the availability of our external event endpoints. These include building better redundancy into our email endpoint as well as our API endpoint. We are prioritizing these changes to the top of the heap.</p>
<p>We are also taking a closer look at moving our primary systems off of AWS US-East. In the short-term, we will continue to use US-East in some capacity (perhaps as a secondary provider). Longer term, we will switch all of our critical systems off of AWS altogether.</p>
<p>Finally, as mentioned above, we will improve our own monitoring systems. We&#8217;ve had alerts delivered too slowly by our own external web monitoring, and we will fix this asap. We will also improve our Twitter-based emergency broadcast procedure, which helps us announce to you when we are experiencing internal problems. Keep turned for another blog post about this in the next few days.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pagerduty.com/2012/03/outage-post-mortem-march-15/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>On-call best practices:  Page your manager</title>
		<link>http://blog.pagerduty.com/2012/03/on-call-best-practices-page-your-manager/</link>
		<comments>http://blog.pagerduty.com/2012/03/on-call-best-practices-page-your-manager/#comments</comments>
		<pubDate>Fri, 09 Mar 2012 02:51:33 +0000</pubDate>
		<dc:creator>John Laban</dc:creator>
				<category><![CDATA[Best Practices]]></category>

		<guid isPermaLink="false">http://blog.pagerduty.com/?p=1860</guid>
		<description><![CDATA[Having one person on-call isn&#8217;t enough. What happens if your on-call engineer sleeps through their alert? What happens if their phone&#8217;s battery dies without them knowing, or if they get an alert at a really inconvenient time, like when stuck &#8230; <a href="http://blog.pagerduty.com/2012/03/on-call-best-practices-page-your-manager/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://blog.pagerduty.com/2012/03/on-call-best-practices-page-your-manager/escalate/" rel="attachment wp-att-1883"><img src="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/escalate-300x291.png" alt="" title="escalate" width="300" height="291" class="alignright size-medium wp-image-1883" /></a>Having one person on-call isn&#8217;t enough. What happens if your on-call engineer sleeps through their alert? What happens if their phone&#8217;s battery dies without them knowing, or if they get an alert at a really inconvenient time, like when stuck on a bus or in traffic? It <em>will</em> happen.</p>
<p>You need a backup! One or more people, waiting in the wings, ready to spring into action if your primary on-call is <span style="text-decoration: line-through;">criminally negligent</span> unable to perform his or her duties to the best of their abilities at any given time.</p>
<p>These backups don&#8217;t need to be AS &#8220;on-call&#8221; as your primary engineer<a id="refFN1" href="#FN1"><sup>[1]</sup></a>, but what they lose in readiness they make up in numbers. It often makes sense to have multiple backups.</p>
<p>In PagerDuty we group the currently-on-call engineer and all of his or her backups into an &#8220;escalation policy&#8221;, which sets the order in which we alert (email/phone/SMS) people, and the delays in between alerts. There are a lot of ways to organize these escalation policies, but I&#8217;ll go over some patterns used by many of our customers (both big and small).</p>
<h2>Primary and Secondary</h2>
<p>You of course already have some sort of &#8220;primary&#8221; on-call engineer, and this glorious position is hopefully determined by a <a href="http://blog.pagerduty.com/2012/01/04/its-schedulin-time/" target="_blank">calendar that rotates between the people in your ops team</a> in a fair way<a href="#FN2" id="refFN2"><sup>[2]</sup></a>.</p>
<p>Many of our customers will supplement this rotation with an (unimaginatively-named) &#8220;secondary&#8221; rotation.  This secondary rotation is usually setup to shadow the primary rotation by being configured identically to the primary rotation, but offset by a certain amount of time so it&#8217;s impossible to be both primary on-call and secondary on-call at the same time.  For example, if your primary rotation contains &#8220;Alex, Bob, and Charlie&#8221;, then your secondary rotation can contain &#8220;Bob, Charlie, and Alex&#8221;, in that order.</p>
<p>Over 25% of the (many) on-call calendars we manage here at PagerDuty have either the word &#8220;primary&#8221; or &#8220;secondary&#8221; in them, so this is a very commonly-used pattern throughout our customer base. </p>
<h2>First-tier and Last-tier Support</h2>
<p>If your company is big enough or your operations tasks plentiful enough, you may also have a separate first-tier support team that handles all the basic operations tasks before even your &#8220;primary&#8221; on-call engineer.  This front-line team is usually trained in handling all of the repetitive and annoying small problems that crop up often and have clear resolution procedures.  (You know, that stuff that you should seriously just fix already, but hey, you&#8217;re busy.)  These first-tier support teams are often shared amongst multiple engineering teams, and are placed first in an escalation policy.</p>
<p>So who is placed last in the escalation policy?  Who is the <em>last</em>-tier support team?  Management, of course!</p>
<p>Your escalation policy shouldn&#8217;t just end with your primary or secondary ops teams, but should escalate up to their manager&#8217;s phone, and then maybe even their manager&#8217;s manager&#8217;s phone, assuming nobody acknowledges the alert in time.  </p>
<p>This has two purposes:  (1) management <i>is</i> ultimately responsible for these important systems and is logically who should be informed if any major problems fall through the cracks, and (2) your ops team will be less likely to &#8220;miss&#8221; an automated PagerDuty phone call if they know it just means that their boss will phoning them up personally in a few minutes and asking some very pointed questions.</p>
<h2>Example</h2>
<p>So, putting this all together, a (very complete) Escalation Policy example for a hypothetical &#8220;Database Ops&#8221; team might look something like this:</p>
<ol>
<li>Assign the incident to the user who is on-call in the <b>First-Tier Ops Team</b> schedule</li>
<li>Assign the incident to the user who is on-call in the <b>Primary DBA</b> schedule</li>
<li>Assign the incident to the user who is on-call in the <b>Secondary DBA</b> schedule</li>
<li>Assign the incident to <b>Dilbert Adams</b> (team lead)</li>
<li>Assign the incident to <b>Pointi Haredboss</b> (dev manager)</li>
</ol>
<p>With timeouts of maybe 10-30 minutes in between escalations, depending on your needs.   Note that any individuals (managers) put directly in the escalation policy are essentially on-call <em>all the time</em>, as last lines of defense, so should try to keep their cell phones handy and charged at all times.</p>
<p><a id="FN1" href="#refFN1">[1]</a> They don&#8217;t need to carry around a <a href="http://blog.pagerduty.com/2011/03/30/on-call-best-practices-part-1" target="_blank">mobile broadband device</a>, for instance, and can feel less guilty about doing things like <span style="text-decoration: line-through;">binge drinking while on duty</span> occasionally partaking in activities that might slightly decrease their on-call abilities.</p>
<p><a href="#refFN2" id="FN2">[2]</a> The primary engineer doesn&#8217;t have to be determined by a rotation, and you can instead have some poor soul be primary on-call every minute of every day for his whole tenure at your company, but this tenureship may be short due to burnout.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pagerduty.com/2012/03/on-call-best-practices-page-your-manager/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>New Relic Integration with PagerDuty</title>
		<link>http://blog.pagerduty.com/2012/02/new-relic-integration-with-pagerduty/</link>
		<comments>http://blog.pagerduty.com/2012/02/new-relic-integration-with-pagerduty/#comments</comments>
		<pubDate>Wed, 29 Feb 2012 17:13:12 +0000</pubDate>
		<dc:creator>Dave</dc:creator>
				<category><![CDATA[Announcements]]></category>
		<category><![CDATA[alerting]]></category>
		<category><![CDATA[announcements]]></category>
		<category><![CDATA[intergration]]></category>
		<category><![CDATA[new relic]]></category>

		<guid isPermaLink="false">http://blog.pagerduty.com/?p=1928</guid>
		<description><![CDATA[We are very excited to announce a new integration with New Relic. As with all of our integrations, once you hook up New Relic to PagerDuty, you&#8217;ll be able to set up phone, SMS and email alerts for critical issues &#8230; <a href="http://blog.pagerduty.com/2012/02/new-relic-integration-with-pagerduty/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://newrelic.com/"><img class="alignright size-full wp-image-1954" title="New Relic logo" src="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/NewRelic-logo.png" alt="New Relic logo" width="190" height="40" /></a>We are very excited to announce a new integration with <a href="http://newrelic.com/" target="_blank">New Relic</a>. As with all of our integrations, once you hook up New Relic to PagerDuty, you&#8217;ll be able to set up phone, SMS and email alerts for critical issues with your web application stack (as monitored by New Relic). However, we&#8217;ve worked hard to make the process of hooking up the two accounts very easy: in fact, you can do it in less than a minute. Our handy <a href="http://www.pagerduty.com/docs/guides/new-relic-integration-guide" target="_blank">step-by-step guide</a> will walk you through the process.</p>
<p>Some of you may have already been using PagerDuty for New Relic alerts by forwarding alert emails from NR into PD. Our new API-based integration is much better: New Relic can both create<strong> and resolve</strong> incidents in PagerDuty. That means once an outage is resolved, there&#8217;s no need to go into PagerDuty to manually resolve the associated incidents; they&#8217;re automatically resolved when New Relic detects the problem is fixed.</p>
<p>Just in case you haven&#8217;t heard of New Relic (really?), they&#8217;ve built an awesome web application performance tool that works with most of the popular web stacks, including Ruby on Rails, Java and .NET. In addition to webapp performance, New Relic also monitors your servers and end-user performance. We use them at PagerDuty to monitor the performance of our front-end web stack and we&#8217;re definitely big fans of the product.</p>
<h2>Integration Details</h2>
<p>The integration is super-simple to set up. In New Relic, click on &#8220;<strong>Account Settings</strong>&#8221; and then &#8220;<strong>API + web integrations</strong>&#8221; and you&#8217;ll notice a PagerDuty tab and a large green &#8220;Alert with PagerDuty&#8221; button. Just click the button and you&#8217;ll be guided through the integration process. You can also refer to our <a href="http://www.pagerduty.com/docs/guides/new-relic-integration-guide" target="_blank">integration guide</a> for the step-by-step rundown.</p>
<p><a href="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/ConfiguredNewRelic1.png"><img class="alignnone size-full wp-image-1947" title="ConfiguredNewRelic" src="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/ConfiguredNewRelic1.png" alt="" width="560" height="409" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pagerduty.com/2012/02/new-relic-integration-with-pagerduty/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Triggering an alert from a phone call (code sample)</title>
		<link>http://blog.pagerduty.com/2012/02/triggering-an-alert-from-a-phone-call-code-sample/</link>
		<comments>http://blog.pagerduty.com/2012/02/triggering-an-alert-from-a-phone-call-code-sample/#comments</comments>
		<pubDate>Mon, 27 Feb 2012 22:51:17 +0000</pubDate>
		<dc:creator>Dave</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[alerting]]></category>
		<category><![CDATA[code sample]]></category>
		<category><![CDATA[google app engine]]></category>
		<category><![CDATA[phone]]></category>
		<category><![CDATA[twilio]]></category>

		<guid isPermaLink="false">http://blog.pagerduty.com/?p=1909</guid>
		<description><![CDATA[I get a lot of requests to handle &#038; escalate phone calls as well as alerts from monitoring systems. Here&#8217;s a code sample that lets you hand out a phone number, let the caller record a message and have that &#8230; <a href="http://blog.pagerduty.com/2012/02/triggering-an-alert-from-a-phone-call-code-sample/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I get a lot of requests to handle &#038; escalate phone calls as well as alerts from monitoring systems.  Here&#8217;s a code sample that lets you hand out a phone number, let the caller record a message and have that message escalate just like a normal PagerDuty alert.  As an added bonus, most smartphones will let you hear the message and call the user back from the SMS.</p>
<p>We have regular hackdays at <a href="http://www.pagerduty.com">PagerDuty</a>, where we build things outside the core product without management (another reason you should <a href="http://blog.pagerduty.com//www.pagerduty.com/jobs">work here</a>).  A few weeks ago, I rolled out a proof of concept <a href="">Google App Engine</a> script to use Twilio to record a voicemail and then to pass it around like a regular alert.  Triggering alerts from phone calls hasn&#8217;t made it&#8217;s way on to the development roadmap, so I&#8217;m sharing this code sample as a work around for our more technically inclined users &#8212; so all the usual caveats and disclaimers apply, namely that our SLAs don&#8217;t apply.</p>
<p><a href="http://twilio.com">Twilio</a> will happily turn a phone call into an MP3 and give us a link to it (which means to get this to work you&#8217;re going to need to sign up for a Twilio account as well as a <a href="https://appengine.google.com">Google App Engine</a> account).  We then use Google&#8217;s URL shortener to shrink the URL into something that will fit in an SMS &#8212; all modern smart phones can figure out what to do with that.</p>
<p>End result, assuming you have SMS contact methods set up, they&#8217;ll receive an SMS like: </p>
<pre><code>ALRT #145 on Phone in: <a href="http://goo.gl/UMmDx">http://goo.gl/UMmDx</a> <a href="#">+14153490382</a> Reply 4:Ack, 6:Resolv.</code></pre>
<p>If you&#8217;re comfortable deploying code, it&#8217;s up on <a href="https://github.com/eurica/PagerDutyCallDesk">https://github.com/eurica/PagerDutyCallDesk</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pagerduty.com/2012/02/triggering-an-alert-from-a-phone-call-code-sample/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Not breaking your Google Analytics (like a pro)</title>
		<link>http://blog.pagerduty.com/2012/02/not-breaking-your-google-analytics-like-a-pro/</link>
		<comments>http://blog.pagerduty.com/2012/02/not-breaking-your-google-analytics-like-a-pro/#comments</comments>
		<pubDate>Thu, 16 Feb 2012 22:31:46 +0000</pubDate>
		<dc:creator>Dave</dc:creator>
				<category><![CDATA[Best Practices]]></category>
		<category><![CDATA[Availability]]></category>
		<category><![CDATA[marketing for mathies]]></category>
		<category><![CDATA[stories]]></category>

		<guid isPermaLink="false">http://blog.pagerduty.com/?p=1800</guid>
		<description><![CDATA[As a general rule, whatever percentage you think your test coverage is, it isn&#8217;t. Whatever amount of the known surface area you&#8217;re covering, there&#8217;s going to be an exciting swath of things you didn&#8217;t realize that you need to test. &#8230; <a href="http://blog.pagerduty.com/2012/02/not-breaking-your-google-analytics-like-a-pro/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>As a general rule, whatever percentage you think your test coverage is, it isn&#8217;t.  Whatever amount of the known surface area you&#8217;re covering, there&#8217;s going to be an exciting swath of things you didn&#8217;t realize that you need to test.  Analytics fell into that bucket for us.</p>
<p>We use Google Analytics in our webapp to get a feel for how users use the product, most recently to determine which functionality was prioritized for the <a href="http://blog.pagerduty.com/2012/02/06/we-have-a-mobile-site/" title="We have a mobile site">mobile site</a>.  So generally I look at our analytics every week or two to help developers out, and when Simon asked me to see how popular the mobile site was, I was pretty sure the answer was not &#8220;It decreased the use of our webapp by 98%.&#8221;:<br />
<a href="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/app-users-by-day.png"><img src="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/app-users-by-day.png" alt="" title="app users by day" width="458" height="334" class="aligncenter size-full wp-image-1804" /></a></p>
<p>I won&#8217;t name names, but the culprit rhymes with &#8220;itwasian&#8221;.</p>
<h3>What broke</h3>
<p>Our UI consists almost entirely of <a href="http://haml-lang.com/">HAML</a> powered by <a href="http://documentcloud.github.com/backbone/">backbone.js</a>, often <a href="https://github.com/ienders/jammit/tree/haml-js">at the same time</a>.  Which meant that we refactored the default Google Analytics code:</p>
<pre style="overflow-x:scroll; font-family: monospace">  :javascript
    var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
    document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
  :javascript // <-- This line was removed
    try {
      var pageTracker = _gat._getTracker("UA-8759953-1");
      pageTracker._setDomainName(".pagerduty.com");
      pageTracker._setAllowHash(false);
      pageTracker._trackPageview();
      #{yield :google_analytics}
    } catch(err) {}
</pre>
<p>You'll notice that this generates two JavaScript blocks, which we helpfully merged into one.</p>
<p>That's what broke everything - and by everything, I'm excluding some mobile and rare browsers that still executed the code as intended.  For 98% of our visitors, the fact that we merged those two script blocks means that the DOM does not get control after the document.write and the loading of ga.js doesn't happen before _gat is referenced.  _gat doesn't exist and that's the end of our analytics on this page.</p>
<p>The simple fix is, of course, to put the second script block back in.  But instead we moved to the newest asynchronous Google Analytics code, which doesn't need 2 blocks, since it only requires _gaq to be a JavaScript object, with the rest of the functionality coming later, whenever the browser gets around to it.</p>
<pre style="overflow-x:scroll; font-family: monospace">:javascript
  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-8759953-1']);
  _gaq.push(['_setDomainName', 'pagerduty.com']);
  _gaq.push(['_trackPageview']);

  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();
</pre>
<h3>Testing that Google Analytics is working</h3>
<p>To ensure I catch this sooner in the future, I've set up some intelligence events on our application and the website inside of Google Analytics to detect if we have abnormally low or high amounts of visitors.  </p>
<p>There's at least a day's delay before the email gets sent out, which is a shame (I received the last one while writing this) but it's another layer in our web of alerting.  Log into your Google Analytics account, and you'll see "Intelligence Events":</p>
<p><a href="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/intelligence-events.png"><img src="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/intelligence-events-300x139.png" alt="" title="intelligence events" width="300" height="139" class="aligncenter size-medium wp-image-1808" /></a></p>
<p>I intend to set up some more advanced heuristics later, but for now let's just test that analytics is working:<br />
<a href="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/analytics-alert.png"><img src="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/analytics-alert-300x133.png" alt="" title="analytics alert" width="300" height="133" class="aligncenter size-medium wp-image-1802" /></a></p>
<p>That day of lag had the advantage for testing that it sent me yesterday's alert, when our analytics were still broken (but I'm really stretching to call that an advantage).</p>
<h3>Tying it all in to PagerDuty</h3>
<p>This isn't the kind of alert I want to be woken up in the middle of the night for, but I still use PagerDuty as an incident management system for my analytics alerts, partially for the dogfooding, but also to track our media mentions, twitter mentions etc.  </p>
<p>For this I've set up a service that doesn't auto-resolve or expire acknowledgements, to track everything that emails "analyze-me@pdt-dave.pagerduty.com"</p>
<p><a href="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/Analyze-Me-1.png"><img src="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/Analyze-Me-1-300x196.png" alt="" title="Analyze Me" width="300" height="196" class="aligncenter size-medium wp-image-1811" /></a></p>
<h3>Test all the things</h3>
<p>So now I'm filling up our <a href="http://www.fogcreek.com/fogbugz/">Fogbugz</a> with new things to test: </p>
<ul>
<li>Our t-shirt mailings
<li>whether we respond to customer inquires quickly enough
<li>testing our load times across the website, app stack, blog and the support site (again with automated alerts from Google).  We already test this with <a href="http://newrelic.com/">New Relic</a>.
</ul>
<p>I don't have a good procedure for determining what we're forgetting to test, but I do have a couple of principles:</p>
<ul>
<li>If it fails once, it gets tested forever. I'm kind of expecting this to get lost in the shuffle, but apparently we never sent t-shirts to one or two people that I promised them to, so that's been automated and I now have an adhoc report of who has and hasn't received their shirts.  When push comes to crazy, I may integrate it with USPS tracking.
<li>It needs to be automatic, ideally sending you an alert when some measurement is out of band.  We track our time to resolution with Zendesk, so one of my projects is to automate our metrics with <a href="https://support.zendesk.com/entries/20012032-streamlining-workflow-with-time-based-events-and-automations">automations</a>
<li>Be a jerk. I'm testing other people's work and trying to get it to fail.  I don't care what the reliability team's metrics say, if my page load time spikes, I'm going to demand some answers.
</ul>
<p>We're still a young company, but we're fiercely dedicated to uptime and when you're dealing with bugs, there are known unknowns and unknown unknowns - and when I started here I never would've known how much I'd enjoy shooting Nerf guns across the office whenever the average page load time increase.</p>
<p>(With any luck, this will be the first post in a series on what happens when you make a mathematician do your marketing.)</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pagerduty.com/2012/02/not-breaking-your-google-analytics-like-a-pro/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>We have a mobile site</title>
		<link>http://blog.pagerduty.com/2012/02/we-have-a-mobile-site/</link>
		<comments>http://blog.pagerduty.com/2012/02/we-have-a-mobile-site/#comments</comments>
		<pubDate>Mon, 06 Feb 2012 23:45:03 +0000</pubDate>
		<dc:creator>Dave</dc:creator>
				<category><![CDATA[Announcements]]></category>
		<category><![CDATA[mobile]]></category>
		<category><![CDATA[ui]]></category>

		<guid isPermaLink="false">http://blog.pagerduty.com/?p=1783</guid>
		<description><![CDATA[We&#8217;re an engineering-heavy organization, but recently we&#8217;ve taken on a critical mass of design passion in the organization and hopefully it&#8217;s starting to show. Simon built out a mobile site to facilitate acknowledging and resolving alerts from your phone: Now &#8230; <a href="http://blog.pagerduty.com/2012/02/we-have-a-mobile-site/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>We&#8217;re an engineering-heavy organization, but recently we&#8217;ve taken on a critical mass of design passion in the organization and hopefully it&#8217;s starting to show.  Simon built out a mobile site to facilitate acknowledging and resolving alerts from your phone:<br />
<center><a href="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/newui-mobile.png"><img src="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/newui-mobile-173x300.png" alt="" title="newui screenshot" width="173" height="300" class="alignnone size-medium wp-image-1784" /></a><a href="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/newui-mobile2.png"><img src="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/newui-mobile2-168x300.png" alt="" title="newui screenshot 2" width="168" height="300" class="alignnone size-medium wp-image-1785" /></a></center></p>
<p>Now you&#8217;ll have all the major PagerDuty functionality in an easier form factor, and the full webpage is just a click away if you need it.</p>
<p>On the full web-app, Ian has dragged us (kicking and screaming) into the second decade of the 21<sup>st</sup> century with a more consistent modern feel.  There havn&#8217;t been any functionality changes in either of these updates, but hopefully we can make your experience a little less awful when we wake you up at 4am.  </p>
<p><a href="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/newui.png"><img src="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/newui-300x202.png" alt="" title="newui" width="300" height="202" class="aligncenter size-medium wp-image-1786" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pagerduty.com/2012/02/we-have-a-mobile-site/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Pressure Release Valves</title>
		<link>http://blog.pagerduty.com/2012/01/pressure-release-valves/</link>
		<comments>http://blog.pagerduty.com/2012/01/pressure-release-valves/#comments</comments>
		<pubDate>Sat, 28 Jan 2012 00:28:32 +0000</pubDate>
		<dc:creator>John Laban</dc:creator>
				<category><![CDATA[Availability]]></category>
		<category><![CDATA[MTTR]]></category>

		<guid isPermaLink="false">http://blog.pagerduty.com/?p=1387</guid>
		<description><![CDATA[This is the fourth in a series of posts on increasing overall availability of your service or system. Have you ever gotten paged, and known right away that this problem isn&#8217;t like the last 15 operations issues you&#8217;ve dealt with &#8230; <a href="http://blog.pagerduty.com/2012/01/pressure-release-valves/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><em>This is the fourth in <a href="http://blog.pagerduty.com/category/availability/" target="_blank">a series of posts</a> on increasing overall availability of your service or system.</em></p>
<p>Have you ever gotten <a href="http://www.pagerduty.com/" target="_blank">paged</a>, and known right away that this problem isn&#8217;t like the <a href="http://blog.pagerduty.com/2011/10/03/reducing-mttr-lessons-from-sun-tzu/" target="_blank">last 15 operations issues you&#8217;ve dealt with this week</a>?  That this problem is special, and is really, really bad?  You know, that kind of problem that you&#8217;ve been worrying about deep in your subconscious for weeks now, and that you&#8217;ve been hoping would never happen?</p>
<p>Well, what do you do when it happens?  Often in these high-pressure situations, you&#8217;ll have a very brief period of time (say, minutes) before a problem goes from &#8216;pretty-bad-but-our-customers-will-forgive-us-and-some-might-not-even-notice&#8217; to simply catastrophic.  If you&#8217;re a <a href="http://en.wikipedia.org/wiki/Scout_Motto" target="_blank">Boy or Girl Scout</a>, you&#8217;d just open up the Pressure Release Valve you&#8217;ve prepared beforehand and prevent the problem from escalating out of control.</p>
<h2>Build Pressure Release Valves</h2>
<p><a href="http://blog.pagerduty.com/2011/11/08/a-standard-operating-procedure-for-when-sit-hits-the-fan/ear-steam/" rel="attachment wp-att-1349"><img class="alignright size-medium wp-image-1349" title="ear-steam" src="http://pagerduty.zippykid.netdna-cdn.com/wp-content/uploads/ear-steam-300x200.jpg" alt="" width="300" height="200" /></a>When building or maintaining one of the systems or services that you own, have you ever said to yourself:  &#8221;You know, if situation X ever happened, as improbable as it is, we&#8217;d be really boned&#8221;?  Situation X could be any hypothetical catastrophic disaster scenario for your given system:  both master and slave datastores go down simultaneously;  all your customers or clients decide to flood you with their theoretical peak loads of traffic at once;  your cloud provider of choice suffers a multiple-availability-zone outage;  your multicast-based messaging system suffers from a feedback loop;  etc. </p>
<p>The problem is, if you work with a given system long enough, there&#8217;s a higher-than-you&#8217;d-like chance that &#8220;Situation X&#8221; will actually crop up.</p>
<p>So what can you do?  Yes, you could try to engineer a system to try to prevent these catastrophic failures altogether.  But building something like this could be time and cost prohibitive and can easily lead to over-engineered systems if you go too far.  Spending a lot of development time targeting failure scenarios that perhaps have a 5% chance of happening over the course of your lifetime isn&#8217;t the best use of your resources<a href="#FN1" id="refFN1"><sup>[1]</sup></a>.</p>
<p>Instead, create<strong> <em>pressure release valves</em></strong>.  You can think of these as a sort of lever or knob that you can adjust during failures in order to reduce the severity of your problem while it is being worked on.  They can often take the form of a <a href="http://zookeeper.apache.org/" target="_blank">configuration-based</a> boolean or constant that can be easily changed in case of an emergency, but can come in other forms too.</p>
<p>You can use these pressure release values to easily flip off (or on) some piece of critical functionality or to dial up or down some important value used in your application.  I&#8217;ll go into some examples below.</p>
<h2>Brainstorm</h2>
<p>To come up with these pressure release valves, get together with your team and brainstorm some (perhaps even semi-outlandish) ways in your system or service can fail catastrophically.  </p>
<p>For each of these failure modes, figure out a way in which the system could be temporarily patched, re-routed, short-circuited, or generally hacked to temporarily reduce the magnitude of the problem.  The goal would be to bring the system back to a functioning state:  you will probably be forced to sacrifice functionality in order to do so.  Usually, the 1 &#8211; 2 people who are most intimately familiar with a given system must design these hacks.  Since these people are not always available in an emergency, it&#8217;s good to explore these ideas ahead of time.</p>
<p>After you create a list of all the catastrophic failure modes and the corresponding hacks that would be needed to get the system back in a (semi) working state, you can also start figuring out common patterns in the hacks:  </p>
<ul>
<li>Would adding a throttle on the incoming requests help in a large number of these failure situations?</li>
<li>Would disabling the computationally-expensive widget X or Y on your website reduce load?</li>
<li>Would the ability to re-route all incoming requests from datacenter A to B turn a partial outage into just some latency issues?</li>
<li>Would relaxing your consistency requirements result in a bit of corrupt data but would <a href="http://en.wikipedia.org/wiki/CAP_theorem" target="_blank">make your system available</a> again?</li>
<li>What other functionality can you sacrifice on-demand from your datastore to get it partially functioning again?  Durability?  Historical data?  The ability to do writes (by using a read-only slave)?</li>
<li>Would flipping off some of your non-critical background workflows free up capacity for your more important ones?</li>
<li>Would the ritual sacrifice of an intern appease the operations gods?<a href="#FN2" id="refFN2"><sup>[2]</sup></a></li>
</ul>
<p>Limping along at only partial functionality is much better than a complete outage, and also takes pressure off the on-call staff while they <a href="http://blog.pagerduty.com/2011/11/08/a-standard-operating-procedure-for-when-sit-hits-the-fan/" target="_blank">get started on their methodical S.O.P</a> for fixing the root cause of the problem.</p>
<p>As I said earlier, you <em>could</em> try over-engineering a system to prevent these rare exotic catastrophes before they happen, but it often just isn&#8217;t worth it.  Plus, even then, there would probably <em>still</em> be other even-more-improbable-but-still-possible failure modes that could benefit from these brainstorming discussions.  So don&#8217;t necessarily waste large amounts of time engineering ways to prevent these obscure problems, <em>but don&#8217;t ignore their possibility either.</em> Talk about them!</p>
<p>If anyone has more examples of pressure release valves you keep in your own operations toolkit, I&#8217;d be very interested in hearing about them in the comments.</p>
<p><a href="#refFN1" id="FN1">[1]</a> Ignore this advice if you&#8217;re building something like a nuclear reactor.  Make that shit work.<br />
<a href="#refFN2" id="FN2">[2]</a> Just kidding.  Operations gods don&#8217;t get out of bed for anything less than a fulltime newhire college grad.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pagerduty.com/2012/01/pressure-release-valves/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

