PagerDuty Blog

10 Common Server Monitoring Mistakes from the Trenches

This is a guest blog post from Shawn Parrish of NodePing, one of our monitoring partners, about how to avoid some of the more common monitoring stumbling points. NodePing provides simple and affordable external server monitoring services. To learn more about NodePing visit their website (https://nodeping.com)

I have been responsible for servers and service monitoring for years and have probably made nearly all the mistakes. So listen to the war stories from a guy with scars and learn from my mistakes. Here’s 10 low bridges I’ve bumped my head on. Most of these are smack-your-forehead-duh common sense. Mind the gap.

Here are 10 common server monitoring mistakes I’ve made.

1. Not checking all my servers

Yeah it seems like a no-brainer, but when I have so many irons in the fire, it’s hard to remember to configure server monitoring for all of them. Some more commonly forgotten servers are:

  • Secondary DNS and MX servers.  This ‘B’ squad of servers usually gets in the game when the primary servers are offline for maintenance or have failed. If I don’t keep my eye on them too, they may not be working when I need them the most. Be sure to keep an eye on your failover boxes.

  • New servers.  Ah, the smell of fresh pizza boxes from Dell! After all the fun stuff (OS install, configuration, burn-in, hardening, testing, etc) the two most forgotten ‘must-haves’ on a new server are the corporate asset tag (anybody still use those?) and setting up server monitoring. Add it to your checklist.

  • Cloud servers. Those quick VPS and AWS instances are easy to set up, and easy to forget to monitor.

  • Temporary/Permanent servers.  You know the ones I’m talking about. The ‘proof of concept’ development box that was thrown together from retired hardware that has suddenly been dubbed as ‘production’. It needs monitoring too.

2. Not checking all services on a host

We know most failures take the whole box down, but if I don’t watch each service on a host, I could have a running website while FTP has flatlined. The most common one I forget is to check both HTTP and HTTPS. Sure, it’s the same ‘service’, but the apache configuration is separate, the firewall rules are likely separate. Also don’t forget the SSL checks, separate from the HTTPS checks, to ensure you have valid SSL certificates. I’ve gotten the embarrassing calls about the site being ‘down’ only to find out that the cert had expired. Oh, yeah… I was supposed to renew that, wasn’t I?

3. Not checking often enough

Users and bosses have very little tolerance for downtime. A lesson learned when trying to use a cheap monitoring service that only provided 10 minute check intervals. That’s up to 9.96 minutes of risk (pretty good math, huh?) that my server might be down before I’m alerted. Configure 1 minute check intervals on all services. Even if I don’t need to respond to them right away (a development box that goes down in the middle of the night), I’ll know ‘when’ it went down to within 60 seconds which could be helpful information when slogging through the logs for root cause analysis later.

4. Not checking HTTP content

Standard HTTP check is good, but the ‘default’, ‘under-construction’ Apache server page has given me that happy 200 response code and a green ‘PASS’ in my monitoring service just like my real site does. Choose something in the footer of the page that doesn’t change and do an HTTP content matching check on that. Don’t use the domain name though – that may show up in the ‘default’ page too and make that check less useful.

It’s also important to make sure certain content does NOT show up on a page. We’ve all visited a CMS site that displayed that nice ‘Unable to connect to database’ error. You want to know if that happens.

5. Not setting the correct timeout

Timeouts for a service are very subjective and should be configurable on your monitoring service. Web guys tell me our public website should load under 2 seconds or our visitors will go elsewhere. If my HTTP service check is taking 3.5 seconds, that should be considered a FAIL result and someone should be notified. Likewise, if I had a 4 second ‘helo’ delay configured in my sendmail, I’d want to move that timeout up over 5 seconds. Timeouts set to high let my performance issues go unnoticed; timeouts set too low just increase my notification noise. It takes time to tweak these on a per-service level.

6. Forgetting DNS goes both ways

Sure I’ve got a DNS checks to make sure my hostnames are resolving to my IPs but I all too often forget to check the reverse DNS (rDNS) entries as well. It’s especially important for SMTP services to have properly resolving PTR records or my email will be headed for the spam bucket. I always monitor SPF and DKIM records while I’m at it. Your monitoring service can do that, right?

Even when I’m using a reputable external DNS service I set up DNS checks to monitor each of the NS records on my domains. A misconfiguration on my part or theirs will cause all kinds of havoc.

7. Sensitivity too low/high

Some servers or services seem more prone to having little hiccups that don’t take the server down, but may intermittently cause checks to fail due to traffic or routing or maybe the phase of the moon. Nothing’s more annoying than a 3AM ‘down’ SMS for a host that really isn’t down. Some folks call this a false positive or flapping – I call it a nuisance. Of course I shouldn’t jump every time a single ping loses its way around the interwebs and every SMTP ‘helo’ goes unanswered then reality sets in and a more dangerous condition may occur. I may be tempted to start ignoring notifications because of all the noise of the alerts I really don’t care about.

A good monitoring service handles this nicely by allowing me to adjust the sensitivity of each check. Set this too low and my notifications for legitimate down events take too long to reach me, but set it too high and I’m swamped with useless false positive notifications. Again, this is something that should be configured per service and will take time to tweak.

8. Notifying the wrong person

Nothing ruins a vacation like a ‘host down’ notification. Sure, I’ve got backup sysadmins that should be covering for me, but I forget to change the PagerDuty schedules so notifications get delivered to them and not me.

9. Not choosing the correct notification type

Quick on the heels of #8 is knowing which type of notification to send. Yeah, I’ve made the mistake of configuring it to send email alerts when the email server is down. Critical server notifications should almost always send via SMS, voice, or persistent mobile push.

10. Not whitelisting the notification system’s email address

Quick on the heels of #9 (we’ve got lots of heels around here) is recognizing that if I don’t whitelist the monitoring service’s email address – it may end up in the spam bucket.

Bonus!

11. Paying too much

I’ve paid hundreds of dollars a month for a mediocre monitoring service for a couple dozen servers before. That’s just stupid. NodePing costs $15 a month for 200 servers/services at 1 minute intervals and it’s not the only cost effective monitoring service out there. Be sure to shop around to find one that fits your needs well. Pair it up with PagerDuty’s on-call/hand-off capabilities and you’re well on your way to avoiding the scars I’ve got without losing your shirt.

Nuff said, true believer.