Defending the Bird: Product Security Engineering at Twitter

Alex Smolen, Software Engineer at Twitter, recently spoke to our DevOps Meetup group at PagerDuty HQ about the philosophies and best practices his teams follow to maintain a high level of security for their 255+ million monthly active users.

Security in a Fast-Moving Environment: The Challenges

Twitter is one of the world’s most widely used social networks and they are continuing to add users at a steady clip.

While Twitter’s growth is exciting, it also poses challenges from a security standpoint. Because so many people count on Twitter to deliver real-time news and information, it’s a constant target for hackers. Two past incidents illustrate why Twitter security matters:

  • When The Associated Press’ account (@AP) was compromised and a tweet was sent about a nonexistent bombing, it drove down the stock market that day
  • When spam was sent from Barack Obama’s account, Twitter received an FTC consent decree related to information security

Twitter’s fast growth also demands lots of infrastructure investment, which forces the company’s security team to move quickly. The site was established as a Rails app, but it’s since been switched to a Scala-based architecture. That change demanded all-new tools and techniques regarding security.

Plus, Alex noted the security team is responsible for many other apps that they have acquired on top of Twitter itself.

The First Step in Securing Twitter: Automation

Automation is one of the strategies that both PagerDuty and Twitter use to optimize for security. The driving force behind automation at Twitter is a desire to employ creativity or judgment in everything the engineering team does.

“When we’re doing something and we think it’s tedious, we try to figure out a way to automate it.” – Alex Smolen, Software Engineer, Twitter

It was at one of Twitter’s Hack Weeks, which is like a big science fair, where the issue of automation first arose. From those initial efforts, the security team created a central place to manage information and run both static and dynamic analyses on security.

Automation helps Twitter’s engineers find security issues early on in the development process. When security problems do crop up, Twitter’s automation tools – in concert with PagerDuty’s operations performance platform – help assign incidents to the right people, so problems get solved more quickly.

One example is a program called Brakeman, which is run against Rails apps and shows all of the vulnerabilities in the apps’ code. If a vulnerability is discovered, the developer is alerted so they can get to the issue quickly. The goal is to close the loop as fast as possible, since the later something is discovered, the more complex and expensive resolve.

Other tools include Coffee Break for Java scripting and Phantom Gang, which dynamically scans Twitter’s live site. As with Brakeman, issues are assigned to the right on-call person for the job.

The Second Step: Robust Code Review Process

Security not just the security team’s responsibility but is owned by many engineers. There are also specific teams that deal with spam and abuse.

On the theme of shared accountability, Twitter’s developers are encouraged to work out security kinks early on in the code-development process. For sensitive code, as soon as code gets submitted the code also gets a security review. Devs can also use a self-serve form to request the security team’s input.

The security engineering team keeps itself accountable with the help of a homebuilt dashboard showing which reviews need to be done. Once upon a time, Twitter’s security engineers used roshambo to assign code reviews, but as their team scaled they now run a script to randomly assign code reviews.

“Roshambo is really hard to do over Skype.”

The Third Step: Designing Around Users

Twitter users, all 200-plus-million of them, have a vested interest in the site remaining secure. For that reason, some of Twitter’s security measures are customized for specific use cases.

One is two-factor authentication, which has been available on Twitter for some time. Initially, it was SMS-based; today, there is a natively built version that can generate a private key to sign login requests.

Another user-facing measure is an emphasis on SSL. Twitter was one of the first major services to require 100% SSL. Yet because many sites still allow the use of non-SSL connections, Alex’s team has built in HTTP Strict Transport Security (HSTS), which requires users to visit the SSL version of the site. Another strategy in use is certificate pinning. If someone tries accessing Twitter with a forged certificate, the native client won’t accept it.

Ultimately, Alex said, security is about enabling people – both users and Twitter’s own engineers. Given that Twitter’s security team represents about 1% of all the engineers in the company, keeping Twitter secure isn’t easy. But with the right processes and tools, those engineers can do their jobs effectively and keep Twitter humming.

Watch Alex’s full presentation here:

Stay tuned for blog posts around the other two security meetup presentations from Stephen Lee (Okta) and our very own Evan Gilman.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Events, Security | Tagged , , , | Leave a comment

Effective Start / End Practices for On-Call Scheduling

Since we launched on-call handoff notifications, lots of our customers have used them to be notified about their on-call responsibilities to make sure they never forget when they’re on-call. Over the years, we’ve seen a variety of on-call schedules and thought we’d share some of the more favored practices we’ve seen.

Exchange Shifts During Business Hours

Below is a distribution of all start and end times for on-call shifts scheduled within PagerDuty:

ochon_01

The most popular time to handoff your on-call shift is Midnight, followed by 8:00 AM and 9:00 AM, then 5:00 PM and 6:00 PM. Despite the popularity of the midnight swap, as PagerDuty, we recommend having your handoff occur during business hours, preferably when both parties are present in the office. Unless you both happen to be in the office at midnight, in which case, go home.

Switching your shift while on-site gives you the opportunity to talk to the next person going on-call about any issues that occurred during the previous shirt or to give a heads up on anything they may want to be on the look-out for.

At GREE, their team syncs up every Monday morning to review alerts from the previous week, go over the upcoming schedule and handoff the rotation to the next on-call team. This gives each team additional insight into the week ahead and makes sure that everyone knows who is the primary, secondary and manager responsible for keeping GREE reliable each week.

Don’t Switch Shifts Over the Weekend

Below is a graph of shift handoffs distributed by day of the week:

ochon_02

By exchanging on-call responsibility while on-site, we’d want to see less shifts occurring over the weekend. This distribution closely aligns to our scheduling philosophy at PagerDuty. By switching shifts on Monday, you are able to recount an entire week of data with minimal confusion.

Or you can schedule your shift exchanges during your weekly team meetings. This will still allow you to review information and give a heads up to any potential problems that may be faced.

Keep Regular Shift Lengths

Another hot topic in terms of on-call scheduling is shift length. Should you switch weekly? Daily? Hourly? While much of this may depend on the size of your team, you will also want to consider other factors. You may want review your historical alerting data to see if there are any hot times in your systems to make sure no single person is getting the short end of the stick, leading to burnout.

Below is a distribution of popular shift lengths from 1 hours up until 2 weeks:

ochon_03

The most popular shift lengths seem to be 8 hours, 12 hours, 1 day and 1 week. Keeping simple shift lengths means less confusion and forgetfulness about when you begin or end a shift.

So when should you use each of these shift lengths?

  • 8 Hours – Great for someone who is covering the business day. You may have another team covering off-business hours.
  • 12 Hours – This is popular for teams utilizing PagerDuty’s Follow-the-Sun schedule, which allows your international teams to be on-call during hours they would be awake.
  • 1 Day – Simple for medium size teams where everyone is going to be responsible for one day.
  • 1 Week – Great for small teams so they don’t have to toss responsibility back and forth to each other.

Find a Schedule that Works for Your Team

At PagerDuty, each internal team handles on-call scheduling differently.  Our Operations Team has a simple weekly rotation, while our Realtime Team has a weekday / weekend rotation where people are on call during the week, then on call during the weekend. This is because our Realtime Team is slightly larger than our Operations Team, so they wanted to have team members on call more often and for shorter shifts to prevent operational tasks from getting rusty.

Our actual Real-Time team schedule:

on-call-blur

It’s important to find a process that works for your team. Even within a single organization, different departments may find one approach works better than another. PagerDuty’s On-Call Schedules give you the ability to customize your team’s shift however works best for you. If you’re not quite sure where to start, just remember the basics: switch shifts when both parties are present in the office (if possible) and maintain a standard shift length (e.g. 12 hours, 1 days, 1 week) to help avoid confusion for when someone’s shift may be starting and ending.

To make this transition more convenient, we also began offering heads up time for when you start and end your shift. Simply log into your PagerDuty account and edit your On-Call Notification Rules from your profile page to get started.

 

Share on FacebookTweet about this on TwitterGoogle+
Posted in Features, Operations Performance | Tagged , , , | Leave a comment

Get Notified Before You Go On-Call

In February, we launched On-Call Handoff Notifications so you never forget that you are on-call, missing an alert!

By popular demand, we’ve tweaked this feature to let you decide when you want to be notified of your shift, up to 48 hours before your shift begins.

OCHON_Update

From your profile page in PagerDuty scroll down to On-Call Handoff Notification Rules. Simply select the hours before your shift before or ends to get notified and select your notification method.

Are you a bit of a snoozer? And need to be reminded multiple time. No problem, you can set up to 10 On-Call Handoff Notification Rules.

Pro Tip: When deciding when you get notified on your shift, take a glance at your schedule to see when you typically go on call. If your shift switches in the morning, you may not want to set your notification to needless wake you up in the middle of the night.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Announcements, Features | Tagged , , , | Leave a comment

Meetup: Keeping Customer Data Safe

pagerduty_security_meetup_71114

This Friday, July 11th at 12:00 PM we’re hosting our second Meetup at PagerDuty HQ. Swing by, grab a slice of pizza, a cup of beer and learn all about how you can keep your customers’ data safe from Twitter, PagerDuty and Okta.

RSVP on our meetup group to attend. Not in San Francisco? No sweat, we’ll also be streaming the event live, register here to save your spot.

alex_twitterAlex Smolen, Software Engineer at Twitter. Alex has a masters degree from the School of Information at UC Berkeley. Previously, he was a web security consultant at Foundstone, a division of McAfee.

With over a billion global registered users, Twitter’s security team is responsible for keeping every account safe and to protect Twitter a freedom of speech tool. From two-factor authentication to geo-signals, threat levels are different for each user and account-level security needs to be granular to identity and stop hackers for their diverse user base. Working closely with engineering teams across the company to design and implement secure systems, Alex Smolen and his security team use an automated approach to deploying a specific suite of tools to proactively find and fix vulnerabilities.

evan-pagerdutyEvan Gilman, Operations Engineer at PagerDuty. Evan is a Senior Engineer on our Operations Team and when Evan isn’t in the SF office, you can find him with a camera in an exotic part of the world.

Evan will discuss how we establish security standards at PagerDuty and constantly validate our security architecture. Evan will also dive into how PagerDuty creates fault tolerant protocols and how to set up monitoring to immediately tackle any security threat. For PagerDuty, protecting customer data not only builds customer trust, but helps ensure maximum uptime across their platform.

stephen-OktaStephen Lee, Director of Platform Solutions at Okta. Stephen is charge of product strategy and evangelism, focusing on solutions for ISV/SI partners and customers. Prior to joining Okta, Stephen spent 10+ years at Oracle serving multiple roles from engineering to product management in the area of Identity Management. With the ever-expanding number of devices, cloud applications, and people (employees, partners, customers and consumers), IT is facing a tough challenge to securely and efficiently manage access. When managing security for internal users, the problem often spans across IT, HR, and Operations. When managing security for external facing applications, the problem goes beyond IT and business owners – potentially involving partners and customers’ IT.

RSVP for Meetup | RSVP for Live Stream

*Doors open at 11:30 AM. Event will start promptly at 12:00 PM and doors will close at 12:05 PM.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Events, Security | Tagged , , | Leave a comment

Security Monitoring, Alerting and Automation

Constant validation is an essential piece of PagerDuty’s security methodology – and it takes place by way of continuous monitoring and alerting. A robust monitoring system helps us proactively detect issues and resolve them quickly.

Here are a handful of the monitoring and alerting tactics that we employ.

Port Availability Monitoring

Using our dynamic firewalls, we maintain a list of ports that should be open or closed to the world. Since this information is held in our Chef server, we are able to build out the checks for which ports should be open or closed on each server. We then run these checks continuously, and if one fails, we receive a PagerDuty alert for it. We use a framework called Gauntlt to do this, as it makes simple checks against infrastructure security very easy.

Centralized Logging and Analysis

We currently use Sumologic for our centralized logging. From a security standpoint, we do this because one of the first things an attacker can do is to shut down any logging to hide their tracks. By shipping these logs somewhere else, setting up pattern alerts on them, we can quickly react to problems that we find. In addition to this, we also use OSSEC to collect and analyze all syslog and application log data.

Active Response

Lastly, for well understood attacks, we have tools in place that can take action without any input from our team members. We are still very early in our active-response implementation, but as our infrastructure grows, we will need to build our more of these solutions so we are not constantly reacting to security incidents.

  • DenyHosts. We have deployed DenyHosts to every server in our infrastructure. If a non-existent user tries to login or if there is another brute force attack, we actively block the IP. While we have external SSH disabled on our infrastructure, we still leverage a set of gateway or ‘jump’ servers to access our servers. Since setting this up last July, we have blocked 1,085 unique IP addresses from accessing our infrastructure.

  • OSSEC. We use the open-source intrusion detection system OSSEC for detecting strange behavior on our servers. It continuously analyzes critical log files and directories for anomalous changes. OSSEC has different ‘levels’ of alerts; low- and medium-level ones will send out an email, while high-level alerts will create a PagerDuty incident so a member of our Operations team can immediately respond to the problem. We are not currently leveraging OSSEC’s built-in blocking abilities, but as we learn more about the common attack patterns on our infrastructure, we plan on enabling them.

Being proactive about monitoring is how we keep our services up and running. The active-response tools listed above hint at where we’d like to go with our security architecture.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Security | Tagged , , , , | Leave a comment

Alert on the Internet of Everything

Our customers are natural tinkers and builders, and we’re excited to launch our integration with Temboo.

Many PagerDuty customers have found great success in using our solution to alert the right person and teams when issues occur in their systems and software. Some PagerDuty customers have been applying our alerting and on-call capabilities to other unique use cases such as creating an on-call rotation for roommate chores and sending alerts when trees are illegally cut down.

Temboo offers a unique programming platform that normalizes access to 100+ APIs, databases, and code utilities to give developers the ability to connect to other applications without all the headache.

APIs are powerful, but require maintenance

Applications aren’t useful when they are siloed. APIs are a common set of requirements to help disparate applications to talk with one another. For developers, the need to maintain these integrations and keep up with API documentation is a hassle. Temboo sits on top of APIs to abstract the complexity from managing and integrating with other applications. With Temboo, you can generate just a few lines of code in the programming language of your choice from your browser, and use those few lines to easily incorporate the benefits of over 2000 API processes into your project.

Helping makers connect

Arduino is an open-source, lightweight computer designed to provide an easy way for makers to create devices that interact with their environment using sensors and actuators. The uses of Arduino are endless. With the ability to sense the environment, tinkers have created robots, thermostats, and motion detectors from scratch. Temboo partners with Arduino to make it easier for projects to interact with web applications. With Temboo, every Arduino can easily grab data and interact with web-based services like Fitbit, Facebook, Google, and now PagerDuty. Temboo’s integration with PagerDuty (aka PagerDuty Choreos) will make it easier for Arduino and other hardware to trigger PagerDuty alerts.

1-Arduino-Temboo-schema

For example, if you really want to buy a drone on eBay and want to get real-time alerts when it is listed, with Temboo’s eBay and PagerDuty Choreos, you can do just that. Or if you want to receive an alert whenever the humidity in your greenhouse dips below a certain level, you can use Temboo’s PagerDuty Choreos for that, too. Or even if you just want an alert every time the weather at the beach is warm enough to go swimming, Temboo and PagerDuty can take care of that as well–all this and more can be done with just a few short lines of code thanks to Temboo’s integration with PagerDuty.

Let your imagination take you far and away. Read this integration guide to start connecting PagerDuty with the Internet of Everything.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Partnerships | Tagged , , , , | Leave a comment

10 Common Ops Mistakes

Updated 7/24/2014: This blog post was updated to more accurately reflect Arup’s talk.

Arup Chakrabarti, PagerDuty’s operations engineer manager, stopped by Heavybit Industries’ HQ to discuss the biggest mistakes an operations team can make and how to head them off. To check out the full video, visit Heavybits Video Library.

1. Getting It Wrong in Infrastructure Setup

Creating Accounts

A lot of people use personal accounts when setting up enterprise infrastructure deployments. Instead, create new accounts using corporate addresses to enforce consistency.

Be wary of how you store passwords. Keeping them in your git repo could require you to wipe out your entire git history at a later date. It’s better to save passwords within configuration management so they can be plugged in as needed.

Selecting Tools

Another good move with new deployments: select your tools wisely. For example, leverage PaaS tools as long as possible – that way, you can focus on acquiring customers instead of building infrastructure. And don’t be afraid to employ “boring” products like Java. Well-established, tried-and-true tech can let you do some really cool stuff.

2. Poorly Designed Test Environments

Keep Test and Production Separate

You don’t want to risk your test and production environments mingling in any way. Be sure to set up test environments with different hosting and provider accounts from what you use in production.

Virtual Machines

Performing local development? There’s no way around it: applications will run differently on local machines and in production. To simulate a production environment as closely as possible, create VMs with a tool like Vagrant.

3. Incorrect Configuration Management

Both these Ansible and Salt are tools that are really easy to learn. Specifically, Ansible makes infrastructure-as-code deployment super-simple for ops teams.

What is infrastructure-as-code? Essentially, it’s the process of building infrastructure in such a way that it can be spun up or down quickly and consistently. Server configurations are going to get screwed up regardless of where your infrastructure is running, so you have to be prepared to restore your servers in as little time as possible.

Whatever tool you use, as a rule of thumb, it’s best to limit the number of automation software tools you’re using. Each one is a source of truth in your infrastructure which means it’s also a point of failure.

4. Deploying the Wrong Way

Consistency matters

Every piece of code must be deployed in as similar a fashion as possible. But getting all of your engineers to practice consistency can be a challenge.

Orchestrate your efforts

Powerful automation software can certainly help enforce consistency. But automation tools are only appropriate for big deployments – so when you’re getting started, Arup suggests running development using git and employing an orchestration tool. For example, Capistrano for Rails, Celery for Python or Ansible and Salt for both orchestration and configuration management.

5. Not Handling Incidents Correctly

Have a process in place

Creating and documenting an incident management process is absolutely necessary, even if the process isn’t perfect.

You should be prepared to review the incident-management document on an ongoing basis, too. If you’re experiencing lots of downtime, reviews won’t really be necessary.

Put everyone on-call

It’s becoming less and less common for companies to have dedicated on-call teams – instead, everyone who touches production code is expected to be reachable in the event of downtime.

This requires a platform (like PagerDuty) that can notify different people in different ways. What really matters is getting a hold of the right people at the right time.

6. Neglecting Monitoring and Alerting

Start anywhere

The specific tool you use for monitoring is less important than just putting something in place. PagerDuty uses StatsD in concert with Datadog; open-source tools like Nagios can be just as effective.

If you have the money, an application performance management tool like New Relic might be a good fit. But, what matters most is that you have a monitoring tool on deck.

“You have no excuse to not have any monitoring and alerting on your app, even when you first launch,” – Arup Chakrabarti, Engineering Manager, PagerDuty

7. Failing to Maintain Backups

Systematizing backups and restores

Just like monitoring and alerting, backing up your data is non-negotiable. Scheduling regular backups to S3 is a standard industry practice today.

You should try restoring your production dataset into a test environment to confirm that your backups are working as designed at least once a month.

8. Ignoring High Availability Principles

‘Multiple’ is the watchword

Having multiple servers at every layer, multiple stateless app servers and multiple load balancers is a no-brainer. Only with multiple failover options can you truly say you’ve optimized for HA.

Datastore design matters, too

Datastores (like Cassandra) are essential because with multimaster data clusters, individual nodes can be taken out with absolutely no customer-facing impact. Clustered datastores are ideal in fast-moving deployment environments for this reason.

9. Falling Into Common Security Traps

Relying solely on SSH

Use gateway boxes instead of SSH on your database servers and load balancers. You can run proxies through these gateways and lock traffic down if you suspect an incursion.

Not configuring individual user accounts

When an employee leaves your organization, it’s nice to be able to revoke his or her access expediently. But there are other reasons to set people up with user accounts to your various tools. Someone’s laptop may get lost. An individual might need his password reset. It’s a lot easier, Arup notes, to revoke or reset one user password than a master account password.

Failing to activate encryption in dev

Making encryption a part of the development cycle helps you catch security-related bugs early in development. Plus, forcing devs to think constantly about security is simply a good practice.

10. Ignoring Internal IT Needs

Not strictly an operations problem, but…

IT isn’t always ops’ concern. But on certain issues, both teams are stakeholders. For example:

  • Commonality in equipment: If an engineer loses her custom-built laptop, how long will it take to get her a replacement? Strive for consistency in hardware to streamline machine deployments.

  • Granting access to the right tools: On-boarding documents are a good way to share login information with new hires.

  • Imaging local machines: With disk images stored on USB, provisioning or reprovisioning equipment is a snap.

  • Turning on disk encryption: With encryption, no need to worry if a machine gets lost.

There are millions more mistakes that operations teams can make. But these 10 tend to be the most commonly seen, even at companies like Amazon, Netflix and PagerDuty.

Have your own Ops mistake you’d like to share. Let us know in the comments below.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Operations Performance | Tagged , , , | 8 Comments

Transparent, Real-Time Incident Customer Communication with StatusPage.io

statuspage.io-logo

Outages happen. Any amount of downtime is unacceptable to customers, but providing honest, up-to-date information decreases backlash and creates a sense of trust that you’re on it. We’ve partnered with StatusPage.io to help you streamline communication to your internal teams and customers.

As the hub of all your operations information, PagerDuty gives users visibility into issues across the entire stack to quickly diagnose root cause issues. While you’re working together to determine the root cause, you can now automatically tie PagerDuty incidents to your internal and external-facing status pages to keep everyone updated on system performance.

Eliminate swarms of emails and tickets

Your colleagues depend on your systems to be up and running to do their jobs. If the network is down, everyone needs to know. With the new integration, engineers can automatically send status notifications company-wide, allowing them to fix the problem at hand while avoiding swarms of internal status emails. For Douglas Jarquin, Technology Operations Manager at Zumba Fitness, after using StatusPage.io, “instead of the whole company running to (the IT manager’s) desk, he sends updates company wide with a few clicks.”

Without being bogged down by emails, engineers can focus on fixing the problem. Everyone at all levels of the company can stay up to date on the progress and get notified when issues are fixed so they can get back to it.

Build customer trust

As a SaaS company, uptime is extremely important. Your customers depend on your service to work whenever they need it. Using StatusPage.io, your customers can proactively opt in to status notifications, helping you to turn painful outages into memorable customer experiences.

“We struggled to provide appropriate incident communication to our customers with a homegrown status page for years. StatusPage.io provides us with the right tool for the job.” – Ernest Durbin, Operations Engineer, KISSmetrics

Deliver relevant information

Your systems are made up of many different functional pieces. With this integration, you can tie PagerDuty Services to a specific Component on your status page. For example, if an incident occurs that affects your APIs, the API Component of your status page will be updated to reflect that issue. With fine-grained settings, customers can subscribe to receive notifications via email, SMS and webhook for a specific piece of your system. You can also add performance metrics such as uptime and response time, helping customers to self-diagnose when there may be a problem with your system.

statuspage-pagerduty

With PagerDuty and StatusPage.io, engineering teams can talk directly with their customers, without ever having to manually send an email or text. Get started today and also subscribe to PagerDuty’s system status page to get updates on how we’re doing.

Share on FacebookTweet about this on TwitterGoogle+
Posted in Announcements, Partnerships | Tagged , | Leave a comment

Saving the Rainforest One Alert at a Time

Update 6/25/2014: Rainforest Connection recently launched a crowd funding campaign. Visit their Kickstarter page to learn more or donate to their cause. 

Every year ten of thousands of square miles of forests are being removed by illegal logging activities throughout the Earth’s rainforests. Rainforest Connection (RFCx) is on a mission to stop illegal logging and poaching by transforming recycled cell-phones into autonomous, solar-powered listening devices that can monitor and pinpoint chainsaw activity. RFCx is changing the game by creating the world’s first real-time logging detection system and making it possible for this all to work at scale using PagerDuty.

You might say that RFCx was accidentally founded by Topher White, a software developer, after volunteering in Indonesia and witnessing illegal logging firsthand.

During his time volunteering, he would frequently run into loggers illegally cutting down trees. Many times only a short 5 minute walk or few hundred yards from a camp full of guards whose primary responsibility was to stop illegal logging.

Today, even the best-equipped governments and NGOs that have invested in fighting illegal logging rely upon state-of-the-art drones or satellite surveillance to collect deforestation data. While these technologies have revolutionized environmental protection, and changed our understanding of the scope of the problem, they all have one important shortcoming: by the time environmental damage can be detected, it’s already days, weeks or even years too late. Topher and his team wanted to find a way to provide real-time data to make deforestation preventable on the ground.

According to White, deforestation is one of the biggest sources of the the carbon emissions that cause climate change. Every year deforestation is responsible for more than 17% of all human-caused CO2 emissions—more than all the world’s planes, trains, ships, cars and trucks combined. It’s also a primary reason for the highest rate of species extinction since the age of the dinosaurs. In order to combat this, a system was developed to alert nearby rangers that illegal logging was happening nearby so they could intervene.

“For many illegal loggers in Indonesia and around the world, there isn’t a real risk of being caught. They’re not hardened criminals, and they’re quick to back-off if interrupted. With real-time alerts, we can make it possible for rangers to arrive on the scene and ask them to stop before things get carried away. With no significant damage done, arrests and criminal charges could even be avoided.” – Topher White, Founder, RFCx

Scaling a Hacked System

Discarded Android phones are retrofitted with extra-sensitive microphones and connect to a cloud-based API used to identify specific sound signatures, such as chainsaws or even a monkey’s scream. Since chainsaws have a loud and unique sound signature that isn’t found in the natural soundscape of the forest it’s easy to detect the noise as an anomaly—even up to a kilometer in the distance.

Once a chainsaw is detected, rangers are alerted via phone calls, SMS, email and push notifications via PagerDuty with the location of the phone that was triggered. Rangers can then intervene and ask the loggers to leave. White noted that, based on their experiences so far, most loggers, when interrupted, are happy to oblige and leave the area.

In the initial testing of the system in Sumatra last year, within two days of implementing their system, they were already discovering illegal logging sites and immediately began saving trees in the area. At first, they were only using emails to alert rangers, but quickly discovered that email alerts alone were not sufficient. They needed to build a system for larger numbers of rangers to be alerted in the forest.

They had to keep in mind that most rangers have a smartphone and needed to be alerted on whatever device they were already using. That’s when RFCx found PagerDuty, which according to White, helped them skip months of development work in the field. Using PagerDuty they were able to go a step further and include a link in SMS alerts to show the location of the phone that heard the chainsaw. They can also listen to a recording on the sound to confirm what the sound is before trekking out into the forest.

“The most amazing part of PagerDuty, is that its not push notifications, or SMS, or a phone call, it’s any and all of those options at the same time. We don’t need to know what kind of phones our partners prefer to use, so this frees them up and eases our integration into their workflow. Also we can customize phone call alerts so before a ranger runs off to check a sound, we can send out an alert with a confirmation.” – Topher White, Founder, RFCx

Obtaining Energy Off-the-Grid

It goes without saying that the remote tropical rainforest is off-the-grid. There are no roads, running water or electricity. Despite this, the populations are very well-connected. Cell-phones use and coverage is prevalent—charged by generators once per day.

The RFCx team is able to use this cell-phone coverage to keep their system active 24/7, but the bigger problem to solve is power. To address this, RFCx had to design a special kind of solar panel that allows them to generate power even if the panels is partially obstructed from the sun due to vegetation in the area. Most solar powers become worthless when any part of the panel is obstructed and won’t convert solar to electrical energy. Because of their system, the solar panels are able to generate much more power than necessary to operate the phones so they should never need to be removed for charging.

What’s Next for Rainforest Connection?

RFCx is working to inspire people everywhere to take a stance on illegal logging and poaching by letting people listen live to the sounds of the rainforest, and when it occurs, deforestation.

“It’s no longer engaging for people around the world to read about huge areas being destroyed, but if you can listen in on the forest, and on the destruction as it occurs, people may become re-invested in pushing for change.” – Topher White, Founder, RFCx

White spoke about a future for RFCx, which expands beyond illegal logging, but includes the possibility of using this same technology to recognize patterns of animal distress based on their signatures.

Rainforest Connection was recently featured on BBC World’s Horizons episode on Extreme Recycling. To check it out click the screenshot below or visit: http://r-f.cx/1pfQlS0

Rainforest_Connection_BBC

Here at PagerDuty we are proud that our technology has been able to serve RFCx’s noble mission. To learn more, donate or get involved visit Rainforest Connection’s website:
https://rfcx.org

Share on FacebookTweet about this on TwitterGoogle+
Posted in Announcements, Operations Performance | Tagged , , , | 1 Comment

Defining and Distributing Security Protocols for Fault Tolerance

This is the second post in a series about how we manage security at PagerDuty. To start from the beginning check out, How We Ensure PagerDuty is Secure for our Customers.

High Availability and Reliability are extremely important here at PagerDuty. We have spent an enormous amount of time designing and architecting our service to withstand failures across servers, datacenters, regions, providers, external dependencies and many other factors. But given that failure is inevitable, how do you build fault tolerance into our security system?

Two things to keep in mind: First, dedicated network or security hardware introduces the problem of single points of failure, where a VPN appliance that goes down can take out an entire datacenter. Second, in a security model where only the edge of the network is protected, an attacker that is able to penetrate that single layer will gain access to everything.

To keep these issues from arising, our overall approach has been to centrally manage our security policies, and push out enforcement to all nodes. Each node is responsible for looking up the policy definitions as well as deriving and enforcing a tailored ruleset.

Dynamically Calculated Local Firewalls

The first strategy that we implemented in our centralized management/distributed enforcement model were our dynamic firewalls.

We have been migrating PagerDuty to a Service Oriented Architecture over the last two years, and with that, we have the opportunity to better isolate each service and contain lateral movements. When we define a new cluster of servers in Chef, we also setup the ruleset for the firewalls that define what group this server belongs to and what groups can talk to this group. From this, each server is able to create entire IPTables chains automatically, open service ports for the appropriates sources, and drop all other traffic. This means that each time a server is added/removed, we do not need to update any policies. Instead, the nodes will detect the change, and recalculate the ruleset.

We have seen a bunch of benefits from this approach:

  • We can easily create network partitions when needed. (This is how we make sure our dev, test, and production environments cannot talk to each other.)

  • We can isolate individual servers when we need to practice attacking them.

  • We can easily figure out which servers are talking to each other because all of the inbound rules have to be defined upfront.

  • We are using simple and straightforward Linux IPTables. If there is a firewall problem, every engineer understands how to manipulate the firewall and deploy a fix.

  • There is no single-point-of-failure network device. If a single server goes down or something more catastrophic happens, the rest of the system will continue to operate in a secure fashion.

Distributed Traffic Encryption

For encrypting network traffic, there are two dominant methods that most use: Virtual Private Networks (VPN) and per app/service encryption, but we found problems with both of them.

A typical VPN implementation with dedicated gateways at each of our AWS and Linode regions would have had a number of issues:

  • Almost single point of failure. Even if you deploy multiple gateways to each region, anytime a gateway server goes away, there is either a failover involved or a reduction in capacity. This will result in connectivity issues.

  • Cost and scalability. Because we are using standard virtual machines and not dedicated networking hardware, we would have to use very large instance sizes to encrypt and decrypt traffic for the servers behind them. We were concerned with conventional VPN gateways’ ability to scale with our traffic spikes.

  • Latency. Because we already have cross-region calls being made, we want as few hops as possible when connecting to non-local servers.

Per-app/service encryption methods – like forcing MySQL to only allow SSL connections or making sure that Cassandra uses internode encryption – do have a place in our security framework. But there are problems with only using this approach:

  • It’s easy to forget. While security is part of everyone’s job at a company, many times people will forget to enable the appropriate security settings.

  • Each app/service has a slightly different way of encrypting data. While most connection libraries support SSL, it can be implemented differently each time. Moreover, this means that anytime we add a new service, we have to rethink how to handle the encryption.

To solve the above issues, we implemented a point-to-point encryption model based on IPSec in transport mode. This enables all traffic between specified nodes on the network to be encrypted, regardless of where the node is located and what other nodes it is talking to. Again, we followed our centralized policy management convention by calculating the relationships on a Chef server and then pushing them out to each node.

There have been several benefits to using point-to-point encryption instead of the traditional VPN model:

  • Decentralized encryption. Instead of relying on critical VPN gateways, each node can handle its own encryption (removing single points of failure).

  • Scalability. Since relationships are only calculated for the nodes that a single node needs to talk to (as opposed to every node), the overhead of the encryption is quite low. In our initial benchmarks, we found that performance suffered when one node had to talk to thousands of nodes, but as long as our services remain isolated and contained, this model should scale for our needs.

  • Efficiency. We are able to take advantage of the dedicated AES hardware that ships with most modern chipsets. Additionally, since the traffic is encrypted as well as compressed, we have seen only a 2-3% impact on our network throughput.

  • Within-datacenter encryption. Sending traffic over dedicated links within or across datacenters is generally secure, but recent events have raised the specter of security breakdowns in these kinds of connections. Point-to-point encryption provides a better alternative.

  • One less dependency on NAT. As more networks support IPv6 and a global address space, the private address space provided by VPNs will have to be re-done. Our point-to-point model easily supports a global address space.

  • Full End to End encryption. Switches, routers, fiber taps, neighboring hosts, the hosting providers themselves. These are all examples of potential intrusion vectors. By encrypting traffic all the way through, even if an intruder were to succeed in capturing our traffic, they would be unable to read any of it.

Role-Based Access Control

PagerDuty follows a least-privilege permissions model. Basically, this means that engineers only have access to the servers they need to get their job done. We leverage Chef, in concert with straightforward Linux users/groups, to build out our access controls.

Whenever we add a new user to our infrastructure, we have to add in the groups to which this user belongs. Whenever we add a new group of servers, we have to specify which user groups have access to these servers. With this information, we are able to build out the passwd and group files on each host. Because this is all stored in JSON config files and is in version control, it is easy to wrap access requests/approvals around our code review process.

 

 

Share on FacebookTweet about this on TwitterGoogle+
Posted in Security | Tagged , , , | Leave a comment