<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>PagerDuty Blog</title>
	<atom:link href="http://blog.pagerduty.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.pagerduty.com</link>
	<description></description>
	<lastBuildDate>Mon, 30 Jan 2012 21:09:45 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Pressure Release Valves</title>
		<link>http://blog.pagerduty.com/2012/01/27/pressure-release-valves/</link>
		<comments>http://blog.pagerduty.com/2012/01/27/pressure-release-valves/#comments</comments>
		<pubDate>Sat, 28 Jan 2012 00:28:32 +0000</pubDate>
		<dc:creator>John Laban</dc:creator>
				<category><![CDATA[Availability]]></category>
		<category><![CDATA[MTTR]]></category>

		<guid isPermaLink="false">http://blog.pagerduty.com/?p=1387</guid>
		<description><![CDATA[This is the fourth in a series of posts on increasing overall availability of your service or system. Have you ever gotten paged, and known right away that this problem isn&#8217;t like the last 15 operations issues you&#8217;ve dealt with &#8230; <a href="http://blog.pagerduty.com/2012/01/27/pressure-release-valves/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><em>This is the fourth in <a href="http://blog.pagerduty.com/category/availability/" target="_blank">a series of posts</a> on increasing overall availability of your service or system.</em></p>
<p>Have you ever gotten <a href="http://www.pagerduty.com/" target="_blank">paged</a>, and known right away that this problem isn&#8217;t like the <a href="http://blog.pagerduty.com/2011/10/03/reducing-mttr-lessons-from-sun-tzu/" target="_blank">last 15 operations issues you&#8217;ve dealt with this week</a>?  That this problem is special, and is really, really bad?  You know, that kind of problem that you&#8217;ve been worrying about deep in your subconscious for weeks now, and that you&#8217;ve been hoping would never happen?</p>
<p>Well, what do you do when it happens?  Often in these high-pressure situations, you&#8217;ll have a very brief period of time (say, minutes) before a problem goes from &#8216;pretty-bad-but-our-customers-will-forgive-us-and-some-might-not-even-notice&#8217; to simply catastrophic.  If you&#8217;re a <a href="http://en.wikipedia.org/wiki/Scout_Motto" target="_blank">Boy or Girl Scout</a>, you&#8217;d just open up the Pressure Release Valve you&#8217;ve prepared beforehand and prevent the problem from escalating out of control.</p>
<h2>Build Pressure Release Valves</h2>
<p><a href="http://blog.pagerduty.com/2011/11/08/a-standard-operating-procedure-for-when-sit-hits-the-fan/ear-steam/" rel="attachment wp-att-1349"><img class="alignright size-medium wp-image-1349" title="ear-steam" src="http://pagerduty.zkimg.com/wp-content/uploads/ear-steam-300x200.jpg" alt="" width="300" height="200" /></a>When building or maintaining one of the systems or services that you own, have you ever said to yourself:  &#8221;You know, if situation X ever happened, as improbable as it is, we&#8217;d be really boned&#8221;?  Situation X could be any hypothetical catastrophic disaster scenario for your given system:  both master and slave datastores go down simultaneously;  all your customers or clients decide to flood you with their theoretical peak loads of traffic at once;  your cloud provider of choice suffers a multiple-availability-zone outage;  your multicast-based messaging system suffers from a feedback loop;  etc. </p>
<p>The problem is, if you work with a given system long enough, there&#8217;s a higher-than-you&#8217;d-like chance that &#8220;Situation X&#8221; will actually crop up.</p>
<p>So what can you do?  Yes, you could try to engineer a system to try to prevent these catastrophic failures altogether.  But building something like this could be time and cost prohibitive and can easily lead to over-engineered systems if you go too far.  Spending a lot of development time targeting failure scenarios that perhaps have a 5% chance of happening over the course of your lifetime isn&#8217;t the best use of your resources<a href="#FN1" id="refFN1"><sup>[1]</sup></a>.</p>
<p>Instead, create<strong> <em>pressure release valves</em></strong>.  You can think of these as a sort of lever or knob that you can adjust during failures in order to reduce the severity of your problem while it is being worked on.  They can often take the form of a <a href="http://zookeeper.apache.org/" target="_blank">configuration-based</a> boolean or constant that can be easily changed in case of an emergency, but can come in other forms too.</p>
<p>You can use these pressure release values to easily flip off (or on) some piece of critical functionality or to dial up or down some important value used in your application.  I&#8217;ll go into some examples below.</p>
<h2>Brainstorm</h2>
<p>To come up with these pressure release valves, get together with your team and brainstorm some (perhaps even semi-outlandish) ways in your system or service can fail catastrophically.  </p>
<p>For each of these failure modes, figure out a way in which the system could be temporarily patched, re-routed, short-circuited, or generally hacked to temporarily reduce the magnitude of the problem.  The goal would be to bring the system back to a functioning state:  you will probably be forced to sacrifice functionality in order to do so.  Usually, the 1 &#8211; 2 people who are most intimately familiar with a given system must design these hacks.  Since these people are not always available in an emergency, it&#8217;s good to explore these ideas ahead of time.</p>
<p>After you create a list of all the catastrophic failure modes and the corresponding hacks that would be needed to get the system back in a (semi) working state, you can also start figuring out common patterns in the hacks:  </p>
<ul>
<li>Would adding a throttle on the incoming requests help in a large number of these failure situations?</li>
<li>Would disabling the computationally-expensive widget X or Y on your website reduce load?</li>
<li>Would the ability to re-route all incoming requests from datacenter A to B turn a partial outage into just some latency issues?</li>
<li>Would relaxing your consistency requirements result in a bit of corrupt data but would <a href="http://en.wikipedia.org/wiki/CAP_theorem" target="_blank">make your system available</a> again?</li>
<li>What other functionality can you sacrifice on-demand from your datastore to get it partially functioning again?  Durability?  Historical data?  The ability to do writes (by using a read-only slave)?</li>
<li>Would flipping off some of your non-critical background workflows free up capacity for your more important ones?</li>
<li>Would the ritual sacrifice of an intern appease the operations gods?<a href="#FN2" id="refFN2"><sup>[2]</sup></a></li>
</ul>
<p>Limping along at only partial functionality is much better than a complete outage, and also takes pressure off the on-call staff while they <a href="http://blog.pagerduty.com/2011/11/08/a-standard-operating-procedure-for-when-sit-hits-the-fan/" target="_blank">get started on their methodical S.O.P</a> for fixing the root cause of the problem.</p>
<p>As I said earlier, you <em>could</em> try over-engineering a system to prevent these rare exotic catastrophes before they happen, but it often just isn&#8217;t worth it.  Plus, even then, there would probably <em>still</em> be other even-more-improbable-but-still-possible failure modes that could benefit from these brainstorming discussions.  So don&#8217;t necessarily waste large amounts of time engineering ways to prevent these obscure problems, <em>but don&#8217;t ignore their possibility either.</em> Talk about them!</p>
<p>If anyone has more examples of pressure release valves you keep in your own operations toolkit, I&#8217;d be very interested in hearing about them in the comments.</p>
<p><a href="#refFN1" id="FN1">[1]</a> Ignore this advice if you&#8217;re building something like a nuclear reactor.  Make that shit work.<br />
<a href="#refFN2" id="FN2">[2]</a> Just kidding.  Operations gods don&#8217;t get out of bed for anything less than a fulltime newhire college grad.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pagerduty.com/2012/01/27/pressure-release-valves/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Short emails for your pager</title>
		<link>http://blog.pagerduty.com/2012/01/25/short-emails-for-your-pager/</link>
		<comments>http://blog.pagerduty.com/2012/01/25/short-emails-for-your-pager/#comments</comments>
		<pubDate>Wed, 25 Jan 2012 22:49:14 +0000</pubDate>
		<dc:creator>Dave</dc:creator>
				<category><![CDATA[Features]]></category>
		<category><![CDATA[On-call]]></category>
		<category><![CDATA[pagers]]></category>

		<guid isPermaLink="false">http://blog.pagerduty.com/?p=1495</guid>
		<description><![CDATA[Our company literally grew out of our founders&#8217; frustration at being on pager duty, where the person on-call would physically carry a pager &#8212; so pagers have always been floating around the PagerDuty universe. For a while now, we&#8217;ve had &#8230; <a href="http://blog.pagerduty.com/2012/01/25/short-emails-for-your-pager/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Our company literally grew out of our founders&#8217; frustration at being on pager duty, where the person on-call would physically carry a pager &#8212; so pagers have always been floating around the PagerDuty universe.</p>
<p>For a while now, we&#8217;ve had a feature to send out alerts as short emails for our customers who send emails to pagers but we&#8217;ve selectively enabled it for some accounts.  Based on customer demand, we&#8217;re going to turn that on for all accounts now so if you&#8217;re sending emails to a pager (or an SMS gateway<a href="#sms_gateways_warning">*</a>) you can enable short emails and we&#8217;ll send the same text we&#8217;d send via an SMS:<br />
<img src="http://pagerduty.zkimg.com/wp-content/uploads/short-email-UI-cropped.png" alt="" title="short email UI cropped" width="544" height="150" class="aligncenter size-full wp-image-1658" /><br />
Sends:<br />
<img src="http://pagerduty.zkimg.com/wp-content/uploads/Pagerduty.com-Mail-PagerDuty-ALERT-dave@pagerduty.com_.png" alt="" title="Short email alert" width="370" height="62" class="aligncenter size-full wp-image-1499" /></p>
<p><a name="sms_gateways_warning">*</a> Just a gentle reminder: although we don&#8217;t stop you from sending emails to an SMS gateway, we can&#8217;t guarantee delivery.  SMSes sent this way are not guaranteed to be delivered quickly or at all.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pagerduty.com/2012/01/25/short-emails-for-your-pager/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SCALE 10X, Linux in Southern California</title>
		<link>http://blog.pagerduty.com/2012/01/24/scale-10x-linux-in-southern-california/</link>
		<comments>http://blog.pagerduty.com/2012/01/24/scale-10x-linux-in-southern-california/#comments</comments>
		<pubDate>Tue, 24 Jan 2012 23:55:29 +0000</pubDate>
		<dc:creator>Dave</dc:creator>
				<category><![CDATA[Community]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[linux]]></category>

		<guid isPermaLink="false">http://blog.pagerduty.com/?p=1660</guid>
		<description><![CDATA[One of our goals this year is to attend more conferences outside of San Francisco, and after the Southern California Linux Expo in Los Angeles, I think it&#8217;s the right call. We&#8217;re a nerdy bunch, and we definitely felt in &#8230; <a href="http://blog.pagerduty.com/2012/01/24/scale-10x-linux-in-southern-california/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>One of our goals this year is to attend more conferences outside of San Francisco, and after the <a href="http://www.socallinuxexpo.org/scale10x">Southern California Linux Expo</a> in Los Angeles, I think it&#8217;s the right call.  We&#8217;re a nerdy bunch, and we definitely felt in our element. After <a href="http://www.socallinuxexpo.org/scale10x/speakers/James/Litton">James</a> was on a <a href="http://www.socallinuxexpo.org/scale10x/presentations/panel-monitoring-sucks">panel</a> discussing why &#8220;Monitoring Sucks,&#8221; he was pulled into dozens of nitty-gritty conversations about integrating and deploying monitoring solutions and the cost/benefits of rolling your own.</p>
<p><center><img src="http://a.yfrog.com/img736/5156/662xr.jpg" title="Audience for Monitoring Sucks!" style="max-width:500px;"><br /><small>The Audience for the &#8220;Monitoring Sucks!&#8221; panel, from <a href="https://twitter.com/#!/zenoss/status/160496023377166336">Zenoss</a></small></center></p>
<p>It was a great crowd with a lot of great open source projects on the go, even if I feel I may have to hand in my nerd card for my lack of github followers.  In my defense, I managed to weigh our helium balloons to neutral buoyancy quicker than the <a href="http://blog.linuxchixla.org/">Linux Chix</a> (we were studying air currents in the conference hall, as one does [I loved their penguins]). There are already some <a href="https://www.linux.com/news/featured-blogs/196:zonker/539103:looking-back-on-scale-10x">good</a> <a href="http://cwebber.ucr.edu/2012/01/scale-10x-recap/">recaps</a> out there, but I missed the talks and can&#8217;t comment on those (though I only heard good things). Apart from the great customer contact, we <a href="http://www.socallinuxexpo.org/blog/looking-work">picked up</a> a few exciting resumes from applicants.  <a href="https://twitter.com/#!/irabinovitch">Ilan</a> and the all volunteer team did a great job, and I expect we&#8217;ll be at SCALE11x next year.</p>
<div id="attachment_1663" class="wp-caption aligncenter" style="width: 310px"><a href="http://pagerduty.zkimg.com/wp-content/uploads/photo-3.jpg"><img src="http://pagerduty.zkimg.com/wp-content/uploads/photo-3-300x224.jpg" alt="" title="Alex hands out an iPad" width="300" height="224" class="size-medium wp-image-1663" /></a><p class="wp-caption-text">We gave away an iPad as part of our &quot;Be a PagerDuty Hero contest&quot;</p></div>
<p>We gave away a lot of t-shirts (we&#8217;re down to a few smalls and XXXL shirts left) and all of our pens, so next year we&#8217;ll bring more stuff &#8212; I don&#8217;t think I really expected over 2000 attendees at the event, all of whom needed to see James from <a href="http://blog.crashspace.org/">Crash Space</a>&#8216;s Wiimote light-saber and <a href="http://www.scuzzstuff.org/oe_cake/">OE Cake</a> demo. </p>
<p>Great time, see you all in February 2012.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pagerduty.com/2012/01/24/scale-10x-linux-in-southern-california/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>It&#8217;s Schedulin&#8217; Time!</title>
		<link>http://blog.pagerduty.com/2012/01/04/its-schedulin-time/</link>
		<comments>http://blog.pagerduty.com/2012/01/04/its-schedulin-time/#comments</comments>
		<pubDate>Wed, 04 Jan 2012 22:21:22 +0000</pubDate>
		<dc:creator>Ian Enders</dc:creator>
				<category><![CDATA[Announcements]]></category>
		<category><![CDATA[Design]]></category>
		<category><![CDATA[Features]]></category>
		<category><![CDATA[announcements]]></category>
		<category><![CDATA[calendar]]></category>
		<category><![CDATA[design]]></category>

		<guid isPermaLink="false">http://blog.pagerduty.com/?p=1570</guid>
		<description><![CDATA[The new on-call scheduling tool is here. We migrated all accounts to it prior to the holiday break. Some of you have played with it. Some of you love it. Some of you don&#8217;t love it as much, but still &#8230; <a href="http://blog.pagerduty.com/2012/01/04/its-schedulin-time/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://blog.pagerduty.com/2012/01/04/its-schedulin-time/gramma-paged-2/" rel="attachment wp-att-1591"><img src="http://pagerduty.zkimg.com/wp-content/uploads/gramma-paged1.jpg" alt="&quot;Why are you talking to me scary robotic devil voice!?!&quot;" title="&quot;Why are you talking to me scary robotic devil voice!?!&quot;" width="575" height="289" class="aligncenter size-full wp-image-1591" /></a></p>
<p>The new on-call scheduling tool is here. We migrated all accounts to it prior to the holiday break. Some of you have played with it. Some of you love it. Some of you don&#8217;t love it as much, but still like it a lot. I&#8217;m happy with both of those result types.</p>
<p>At a high level, the new PagerDuty on-call scheduler is a toolkit that allows you to create <a href = "http://support.pagerduty.com/tags/example/entries">pretty much any</a> (recurring) on-call rotation you want. Say you&#8217;re the owner of a web-based startup and you want to be on-call all the time to get notified when your site goes down. You can do that. (You could do it before, but you know&#8230; I&#8217;m being explicit.) Say you want to create a DB Secondary Rotation where you have in-house DBAs on-call weekly from 8:00am to 8:00pm, and then a daily rotating night support staff in an entirely different continent on-call overnight. You can do that. (We call that a &#8220;<a href = "http://support.pagerduty.com/entries/20562768-follow-the-sun-schedule" title = "Follow the Sun">Follow the Sun</a>&#8221; schedule.) Say you want to create a Frontend Dev Primary Rotation that has three developers who rotate daily, but not on weekends. You can do that. (We don&#8217;t recommend having a dead-zone where nobody is on-call, but it is technically possible.) Say you want to put your Grandma on-call for 30 minutes once a week at 7:30am on Wednesday. You can do that. (Though I don&#8217;t know why you would.)</p>
<p><a href="http://blog.pagerduty.com/2012/01/04/its-schedulin-time/screen-shot-2012-01-04-at-2-13-50-pm/" rel="attachment wp-att-1638"><img src="http://pagerduty.zkimg.com/wp-content/uploads/Screen-Shot-2012-01-04-at-2.13.50-PM.png" alt="" title="Screen-Shot-2012-01-04-at-2.13.50-PM" width="575" height="196" class="aligncenter size-full wp-image-1638" /></a></p>
<p>How do you achieve all these wacky calendars designed to torture your loved ones at random intervals? Think Photoshop. Think <a href="http://en.wikipedia.org/wiki/Layers_%28digital_image_editing%29">digital image layers</a>. What you do is basically superimpose different rotations on top of each other, leaving gaps in the higher priority layers to allow lower priority ones to &#8220;show through.&#8221; The layer gaps are like the transparent parts of an image layer that allow you to compose something more interesting.</p>
<p>So that&#8217;s all well and good, and actually the important part. What is less important, but more exciting to me (because I&#8217;m a dork) is all the behind the scenes magic and the actual design and usability that plays out on the front stage. I&#8217;m not going to go into that so much though. I encourage you to play with the new tool and see how it feels. Do you like the feel of smooth responsive Javascript under your mouse clicks? How do you like the non-blocking page loads and the on-the-fly calendar previewing which responds to all of your actions? These are the sorts of things we are going to push for more and more as we continue to develop the PagerDuty product.</p>
<p>With this project (and another one which will be announced very soon), we are starting a rollout of a site restyling and modernization of our app stack. For those of you who are up to date on your design blogs, yes, Twitter&#8217;s Bootstrap Framework does lay some basic foundation for our new layout and basic page elements. We feel that it&#8217;s best to stick with well known visual paradigms so you can intuitively jump into our (somewhat complicated) application quickly, get productive, and get your systems monitored properly without breaking too much of a sweat.</p>
<p>I better cut this off now as it&#8217;s getting long-winded. But, rest assured, this is just the start of even more greatness. We have been getting some great feedback during our Beta period for this feature, and we listen to your suggestions very intently. There is plenty more where this came from!</p>
<p>Oh, and we&#8217;ve also added iCal and Webcal integration for the calendar; stay tuned for another blog post with the full details.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pagerduty.com/2012/01/04/its-schedulin-time/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What monitoring tools do you use?</title>
		<link>http://blog.pagerduty.com/2011/12/22/what-monitoring-tools-do-you-use/</link>
		<comments>http://blog.pagerduty.com/2011/12/22/what-monitoring-tools-do-you-use/#comments</comments>
		<pubDate>Fri, 23 Dec 2011 01:41:27 +0000</pubDate>
		<dc:creator>Dave</dc:creator>
				<category><![CDATA[Community]]></category>
		<category><![CDATA[Design]]></category>
		<category><![CDATA[data]]></category>

		<guid isPermaLink="false">http://blog.pagerduty.com/?p=1628</guid>
		<description><![CDATA[We support any monitoring tool that can send an email or make a JSON call, but we support tighter integration with some than others. We recently got our hands on a brilliant systems programmer and we&#8217;d like to have him &#8230; <a href="http://blog.pagerduty.com/2011/12/22/what-monitoring-tools-do-you-use/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>We support any monitoring tool that can send an email or make a JSON call, but we support tighter integration with some than others.  We recently got our hands on a brilliant systems programmer and we&#8217;d like to have him building the best possible product for the most useful monitoring systems for our current and potential customers.</p>
<p>To that end, I&#8217;m hoping you could spend 2 minutes filling out this survey: <a href="https://www.surveymonkey.com/s/PDMONB">https://www.surveymonkey.com/s/PDMONB</a></p>
<p>Thanks!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pagerduty.com/2011/12/22/what-monitoring-tools-do-you-use/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Team building with facial hair</title>
		<link>http://blog.pagerduty.com/2011/11/23/team-building-movember/</link>
		<comments>http://blog.pagerduty.com/2011/11/23/team-building-movember/#comments</comments>
		<pubDate>Thu, 24 Nov 2011 01:46:12 +0000</pubDate>
		<dc:creator>Dave</dc:creator>
				<category><![CDATA[Community]]></category>

		<guid isPermaLink="false">http://blog.pagerduty.com/?p=1520</guid>
		<description><![CDATA[One of the best things about growing our team at PagerDuty is that we have critical mass for things like Movember (it&#8217;s also an impressive comment on the state of remote working that Ian has managed to police our faces &#8230; <a href="http://blog.pagerduty.com/2011/11/23/team-building-movember/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>One of the best things about growing our team at PagerDuty is that we have critical mass for things like Movember (it&#8217;s also an impressive comment on the state of remote working that <a href="http://us.movember.com/mospace/2082150/">Ian</a> has managed to police our faces while working from Toronto for most of this month). <a href="http://us.movember.com/mospace/2294024/">Simon</a> took an early lead, but <a href="http://us.movember.com/mospace/2642094/">Lukasz</a> has blown us all out of the water (even though I think <a href="http://us.movember.com/mospace/2091994/">James</a> is growing the best mustache, even if <a href="http://us.movember.com/mospace/2293618/">Ali</a> and <a href="http://us.movember.com/mospace/2238360/">John</a> aren&#8217;t making that easy). <a href="http://us.movember.com/mospace/2096076/">Liz</a> and <a href="http://us.movember.com/mospace/2411992/">Ana</a> are in too, presumably to make sure <a href="http://us.movember.com/mospace/2672632/">I</a> don&#8217;t come in last in the facial hair department. <a href="http://us.movember.com/mospace/2240984/">Baskar</a>&#8216;s grown enough facial hair for the whole founding team (<strike>too much in fact, there seems to be some bearding going on</strike> Score one for peer pressure!).</p>
<p>You can donate to our team using any of them links above.</p>
<div style="line-height:0"><img src="http://pagerduty.zkimg.com/wp-content/uploads/IMG_3505-150x150.jpg" alt="" title="IMG_3505" width="150" height="150" class="alignnone size-thumbnail wp-image-1552" /><img src="http://pagerduty.zkimg.com/wp-content/uploads/IMG_3451-150x150.jpg" alt="" title="IMG_3451" width="150" height="150" class="alignnone size-thumbnail wp-image-1536" /><img src="http://pagerduty.zkimg.com/wp-content/uploads/IMG_3473-150x150.jpg" alt="" title="IMG_3473" width="150" height="150" class="alignnone size-thumbnail wp-image-1542" /><img src="http://pagerduty.zkimg.com/wp-content/uploads/IMG_3472-150x150.jpg" alt="" title="IMG_3472" width="150" height="150" class="alignnone size-thumbnail wp-image-1541" /><img src="http://pagerduty.zkimg.com/wp-content/uploads/IMG_3486-150x150.jpg" alt="" title="IMG_3486" width="150" height="150" class="alignnone size-thumbnail wp-image-1545" /><img src="http://pagerduty.zkimg.com/wp-content/uploads/IMG_3464-150x150.jpg" alt="" title="IMG_3464" width="150" height="150" class="alignnone size-thumbnail wp-image-1539" /><img src="http://pagerduty.zkimg.com/wp-content/uploads/IMG_3462-150x150.jpg" alt="" title="IMG_3462" width="150" height="150" class="alignnone size-thumbnail wp-image-1538" /><img src="http://pagerduty.zkimg.com/wp-content/uploads/IMG_3476-150x150.jpg" alt="" title="IMG_3476" width="150" height="150" class="alignnone size-thumbnail wp-image-1544" /><img src="http://pagerduty.zkimg.com/wp-content/uploads/IMG_3450-150x150.jpg" alt="" title="IMG_3450" width="150" height="150" class="alignnone size-thumbnail wp-image-1535" /></div>
]]></content:encoded>
			<wfw:commentRss>http://blog.pagerduty.com/2011/11/23/team-building-movember/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>With a little inline help from my friends</title>
		<link>http://blog.pagerduty.com/2011/11/21/with-a-little-inline-help-from-my-friends/</link>
		<comments>http://blog.pagerduty.com/2011/11/21/with-a-little-inline-help-from-my-friends/#comments</comments>
		<pubDate>Mon, 21 Nov 2011 17:54:22 +0000</pubDate>
		<dc:creator>Dave</dc:creator>
				<category><![CDATA[Announcements]]></category>
		<category><![CDATA[Features]]></category>
		<category><![CDATA[functionality]]></category>
		<category><![CDATA[help]]></category>
		<category><![CDATA[support]]></category>

		<guid isPermaLink="false">http://blog.pagerduty.com/?p=1465</guid>
		<description><![CDATA[We are absolutely spoiled in terms the technical competence of our users, so sometimes it&#8217;s difficult for us to imagine that anyone would need help inside our product. But now, thanks to the work our support team has been doing &#8230; <a href="http://blog.pagerduty.com/2011/11/21/with-a-little-inline-help-from-my-friends/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>We are absolutely spoiled in terms the technical competence of our users, so sometimes it&#8217;s difficult for us to imagine that anyone would need help inside our product.  But now, thanks to the work our support team has been doing on the <a href="http://support.pagerduty.com/forums">knowledge base</a>, we&#8217;re going to start exposing context sensitive help on every page.</p>
<p>This is very much a work in progress, and we&#8217;ll continue adding documentation &#8212; I&#8217;ll be watching which pages people ask for help with and prioritize them, and you can continue to email us at <a href="mailto:support@pagerduty.com">support@pagerduty.com</a></p>
<p><a href="http://blog.pagerduty.com/2011/11/21/with-a-little-inline-help-from-my-friends/google-chrome/" rel="attachment wp-att-1488"><img src="http://pagerduty.zkimg.com/wp-content/uploads/Google-Chrome-300x182.png" alt="" title="PagerDuty inline help" width="300" height="182" class="aligncenter size-medium wp-image-1488" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pagerduty.com/2011/11/21/with-a-little-inline-help-from-my-friends/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Standard Operating Procedure for when s*IT hits the fan</title>
		<link>http://blog.pagerduty.com/2011/11/08/a-standard-operating-procedure-for-when-sit-hits-the-fan/</link>
		<comments>http://blog.pagerduty.com/2011/11/08/a-standard-operating-procedure-for-when-sit-hits-the-fan/#comments</comments>
		<pubDate>Tue, 08 Nov 2011 19:36:14 +0000</pubDate>
		<dc:creator>John Laban</dc:creator>
				<category><![CDATA[Availability]]></category>
		<category><![CDATA[MTTR]]></category>

		<guid isPermaLink="false">http://blog.pagerduty.com/?p=1205</guid>
		<description><![CDATA[This is the third in a series of posts on increasing overall availability of your service or system. In the first post of this series, we defined and introduced some concepts of system availability, including mean time between failure &#8211; MTBF &#8230; <a href="http://blog.pagerduty.com/2011/11/08/a-standard-operating-procedure-for-when-sit-hits-the-fan/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><em>This is the third in <a href="http://blog.pagerduty.com/category/availability/" target="_self">a series of posts</a> on increasing overall availability of your service or system.</em></p>
<p>In the <a href="http://blog.pagerduty.com/2011/04/18/the-ups-and-downs-of-availability/" target="_self">first post</a> of this series, we defined and introduced some concepts of system availability, including mean time between failure &#8211; MTBF &#8211; and mean time to recovery &#8211; MTTR. In our <a href="http://blog.pagerduty.com/2011/10/03/reducing-mttr-lessons-from-sun-tzu/" target="_blank">second post</a> we went on to discuss easy ways in which you can effectively reduce MTTR starting <em>now</em>. This post continues on that theme with more tips on reducing MTTR and directly increasing your availability.</p>
<h2>Have an S.O.P</h2>
<p><a href="http://blog.pagerduty.com/2011/10/03/reducing-mttr-lessons-from-sun-tzu/sop_sign/" rel="attachment wp-att-1005"><img class="alignright size-medium wp-image-1005" title="Sop_Sign" src="http://pagerduty.zkimg.com/wp-content/uploads/Sop_Sign-221x300.jpg" alt="" width="221" height="300" /></a>For the <em>really</em> bad problems, have a Standard Operating Procedure that everyone knows how to follow.  The SOP should be a set of steps that can be taken to make it easier to work on the problem by increasing communication and organization. This is different than the documented failure modes in the &#8220;know your enemy&#8221; section in <a href="http://blog.pagerduty.com/2011/10/03/reducing-mttr-lessons-from-sun-tzu/" target="_blank">my last post</a>, in that an SOP is a generic procedure that can be used in pretty much <em>any</em> major failure scenario.</p>
<p>The SOP would be something your on-call primary engineer can bust out if some major catastrophe happens that he or she can&#8217;t immediately resolve within a designated timeframe &#8211; like say 10 minutes.</p>
<p>An example SOP might be for your on-call to:</p>
<ul>
<li>Start a conference call and invite others on the team.  This ensures that everyone working on the problem has a quick and easy communications link, and that you don&#8217;t <a href="http://en.wikipedia.org/wiki/Split-brain_(computing)" target="_blank">step on each others&#8217; toes</a>.  You could use a dedicated conference bridge phone line, or something as simple as Skype.  Just have this voice channel organized beforehand, so you don&#8217;t have to figure out everyone&#8217;s skypeID, etc, when really pressed for time.</li>
<li>Make sure there is a designated call leader or &#8220;<a href="http://en.wikipedia.org/wiki/Incident_commander" target="_blank">Incident Commander</a>&#8221; &#8211; a seasoned on-call veteran who doesn&#8217;t necessarily know the broken system in question but knows how to direct others in debugging and resolution tasks.  This call leader should keep everyone on track, make sure balls don&#8217;t get dropped, and resolve disputes if they come up.</li>
<li>Have a set of diagnostics that the on-call primary can start on ASAP while the call is being setup and people are joining.  These diagnostics can be things like monitoring data, relevant graphs, related problems in other systems, etc, and will be immediately useful when the conference call begins.</li>
<li>Have a designated chat system prepared for sharing data, links, code snippets, or whatever needs to be shared non-verbally.  Some ticketing systems are good for this as well.  Here at PagerDuty we use <a href="https://www.hipchat.com/" target="_blank">HipChat</a>.</li>
<li>If you&#8217;re part of a large organization, have a designated persion who communicates with business stakeholders, such as upper management.  These stakeholders will be understandably (very) interested in your fixing the problem as fast as possible, but will likely be disruptive if they join the conference call directly.  Your VP probably doesn&#8217;t know the bowels and intricacies of your messaging layer or caching system, but can definitely intimidate the already-stressed engineers in a conference call.  These people can, however, be very useful if major decisions need to be made that might disrupt other facets of the business (say by causing you to temporarily lose orders/requests/money/face/whatever) in order to help fix the problem at hand, so communication with them is important.</li>
</ul>
<p>Having an SOP for really high-severity issues reduces the variability of response and resolution times, and gets everyone up to speed quickly, including business decision makers.  It also reduces stress, uncertainty, and confusion when there is a pretty clear procedure to follow when starting to deal with large issues.</p>
<h2></h2>
]]></content:encoded>
			<wfw:commentRss>http://blog.pagerduty.com/2011/11/08/a-standard-operating-procedure-for-when-sit-hits-the-fan/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using Loggly and PagerDuty for threshold alerting</title>
		<link>http://blog.pagerduty.com/2011/10/31/using-loggly-and-pagerduty-to-for-threshold-alerting/</link>
		<comments>http://blog.pagerduty.com/2011/10/31/using-loggly-and-pagerduty-to-for-threshold-alerting/#comments</comments>
		<pubDate>Mon, 31 Oct 2011 17:16:00 +0000</pubDate>
		<dc:creator>Dave</dc:creator>
				<category><![CDATA[Community]]></category>
		<category><![CDATA[loggly]]></category>

		<guid isPermaLink="false">http://blog.pagerduty.com/?p=1460</guid>
		<description><![CDATA[Loggly has a great blog post up about using alert birds to trigger PagerDuty alerts in response to heuristics on your logs: &#8220;..in the example above, if my web servers are spewing 500 exceptions, I want my ops folks to &#8230; <a href="http://blog.pagerduty.com/2011/10/31/using-loggly-and-pagerduty-to-for-threshold-alerting/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Loggly has a great blog post up about using <a href="http://loggly.com/blog/2011/10/pagerduty-loggly-and-alert-birds/">alert birds to trigger PagerDuty</a> alerts in response to heuristics on your logs:</p>
<blockquote><p>&#8220;..in the example above, if my web servers are spewing 500 exceptions, I want my ops folks to get notified, provided there are more than 10 &#8211; I don&#8217;t want to wake anyone up over a little blip!&#8221;</p></blockquote>
<p>As someone who gets those 2am phone calls, anything to eliminate false positives is exciting.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pagerduty.com/2011/10/31/using-loggly-and-pagerduty-to-for-threshold-alerting/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>iCal Integration and Hand-off Notification</title>
		<link>http://blog.pagerduty.com/2011/10/25/ical-integration-and-hand-off-notification/</link>
		<comments>http://blog.pagerduty.com/2011/10/25/ical-integration-and-hand-off-notification/#comments</comments>
		<pubDate>Tue, 25 Oct 2011 23:39:32 +0000</pubDate>
		<dc:creator>Dave</dc:creator>
				<category><![CDATA[Features]]></category>
		<category><![CDATA[calendar]]></category>
		<category><![CDATA[functionality]]></category>

		<guid isPermaLink="false">http://blog.pagerduty.com/?p=1437</guid>
		<description><![CDATA[Our new calendar is getting its final touches and will add iCal integration to your profile page &#8212; if you&#8217;d like to get in on the beta, let us know at support@pagerduty.com (I forgot to mention it was still in &#8230; <a href="http://blog.pagerduty.com/2011/10/25/ical-integration-and-hand-off-notification/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Our new calendar is getting its final touches and will add iCal integration to your profile page &#8212; if you&#8217;d like to get in on the beta, let us know at <a href="mailto:support@pagerduty.com">support@pagerduty.com</a> (I forgot to mention it was still in beta in an earlier version of this post)</p>
<p>Now, you can either download your calendar for the next month, or use a secret URL that&#8217;s unique to your user profile to integrate your on-call schedule with whatever calendar app you use.  It&#8217;s read-only right now, but there are two things that I&#8217;m really excited about:</p>
<ul>
<li>If you share your secret URL with your team, they can add your schedule to their calendar, so managers can see the whole team at once in their calendar.</li>
<li>If you use a program that gives you reminders, you can <a href="http://support.pagerduty.com/entries/20585413-getting-hand-off-notifications">be notified before you go on call</a>, which is a popular ask on our <a href="http://feedback.pagerduty.com">feedback forum</a></li>
</ul>
<p><img class="alignnone size-full wp-image-1447" title="Find your secret calendar URL" src="http://pagerduty.zkimg.com/wp-content/uploads/Copy-link-address1.png" alt="" width="553" height="298" /></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pagerduty.com/2011/10/25/ical-integration-and-hand-off-notification/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Minified using apc
Page Caching using apc
Database Caching 1/14 queries in 0.013 seconds using memcached
Content Delivery Network via pagerduty.zkimg.com

Served from: blog.pagerduty.com @ 2012-02-05 09:40:10 -->
