PagerDuty Blog

Don’t Lose $460 Million Dollars From a Software Glitch

software bugHigh-frequency trading accounts for 50% of US’ security trading. With thousands of securities totaling millions of dollars traded every millisecond, robust and reliable computer systems are needed to automate the trades. These automated microtransactions can produce huge rewards. But serious repercussions can occur when there is an issue, and they are exacerbated when they go unfixed.

On August 1st, 2012, Knight Capital lost $460 million dollars after 45 minutes of trading. Knight received 97 emails from their systems to inform them there was an issue, but they responded too late. The SEC post-mortem document released on October 16, 2013 goes into more details around the events leading up to the software bug, but the point below summarizes the poor DevOps process around implementing a new code:

15. Beginning on July 27, 2012, Knight deployed the new RLP code in SMARS in stages by placing it on a limited number of servers in SMARS on successive days. During the deployment of the new code, however, one of Knight’s technicians did not copy the new code to one of the eight SMARS computer servers. Knight did not have a second technician review this deployment and no one at Knight realized that the Power Peg code had not been removed from the eighth server, nor the new RLP code added. Knight had no written procedures that required such a review. (page 6)

Software bugs can lead to huge losses in any industry. If you’re an e-commerce website or have ads on your page, you’ll lose money every second your page is down. Customer retention is also affected. Frustration causes customer to turn to rival companies. Slow response times have financial and reputation consequences. When problems occur, act quickly.

3 Steps for Quick IT Incident Management

Bugs can be created at any development stage and are the consequence of human error. It also takes human intervention to help resolve it. Based on Knight Capital’s 3 biggest mistakes, implement these best practices below to mitigate your IT incidents.

1. Set Thresholds.

B. Knight did not have controls reasonably designed to prevent it from entering orders for equity securities that exceeded pre-set capital thresholds for the firm, in the aggregate, as required under Rule 15c3-5(c)(1)(i). In particular, Knight failed to link accounts to firm-wide capital thresholds, and Knight relied on financial risk controls that were not capable of preventing the entry of orders (page 4)

Knight Capital made a huge error by not setting limits based upon their financial capacity. If they did set financial thresholds, such as a 75% of capital limit , they could have prevented some of their $450 million loss. Computer systems are getting more and more complex, and there are many components that can break when limits are crossed. Website incidents can occur because there are issues with DNS performance or CPU usage, so set thresholds for them. Instead of waiting for the CPU usage to reach 100%, set your threshold at 70% to avoid a disastrous outage. Thresholds act as a signal for huge incidents to come and can create a sense of urgency to resolve problems. Set proactive thresholds to prevent going over the cliff.

2. Create a People Process.

19. On August 1, Knight also received orders eligible for the RLP but that were designated for pre-market trading. SMARS processed these orders and, beginning at approximately 8:01 a.m. ET, an internal system at Knight generated automated e-mail messages (called “BNET rejects”) that referenced SMARS and identified an error described as “Power Peg disabled.” Knight’s system sent 97 of these e-mail messages to a group of Knight personnel before the 9:30 a.m. market open. Knight did not design these types of messages to be system alerts, and Knight personnel generally did not review them when they were received. However, these messages were sent in real time, were caused by the code deployment failure, and provided Knight with a potential opportunity to identify and fix the coding issue prior to the market open. These notifications were not acted upon before the market opened and were not used to diagnose the problem after the open. (page 6)

27. On August 1, Knight did not have supervisory procedures concerning incident response. More specifically, Knight did not have supervisory procedures to guide its relevant personnel when significant issues developed. On August 1, Knight relied primarily on its technology team to attempt to identify and address the SMARS problem in a live trading environment. Knight’s system continued to send millions of child orders while its personnel attempted to identify the source of the problem. In one of its attempts to address the problem, Knight uninstalled the new RLP code from the seven servers where it had been deployed correctly. This action worsened the problem, causing additional incoming parent orders to activate the Power Peg code that was present on those servers, similar to what had already occurred on the eighth server. (page 8)

Knight Capital’s systems sent 97 emails to warn them that there was an issue. These emails were not used to prevent nor diagnose the issue. What good is an alert if there’s not one around to hear it? When thresholds are crossed the right people need to be contacted immediately to resolve the incident. Since computer systems run 24/7/365, teams are created to respond to issues whenever they arise. Whether a network operations center (NOC) or an on-call rotation schedule is implemented, the right people need to be contacted wherever they may be. If you are on-call, you own all incidents that come in and you must act quickly to resolve high severity problems. By coupling people with an incident, this create a sense of responsibility and accountability.

3. Learn From Previous Incidents.

33. Several previous events presented an opportunity for Knight to review the adequacy of its controls in their entirety. For example, in October 2011, Knight used test data to perform a weekend disaster recovery test. After the test concluded, Knight’s LMM desk mistakenly continued to use the test data to generate automated quotes when trading began that Monday morning. Knight experienced a nearly $7.5 million loss as a result of this event. Knight responded to the event by limiting the operation of the system to market hours, changing the control so that this system would stop providing quotes after receiving an execution, and adding an item to a disaster recovery checklist that required a check of the test data. Knight did not broadly consider whether it had sufficient controls to prevent the entry of erroneous orders, regardless of the specific system that sent the orders or the particular reason for that system’s error. Knight also did not have a mechanism to test whether their systems were relying on stale data. (page 10)

Knight’s August 2012 incident was not the first one they experienced. After losing $7.5 million from an October 2011 incident, Knight did not put enough controls in place to prevent future erroneous order entry. They should have used the $7.5 million loss as a small learning lesson – compared to their $450 million loss the following year – and re-evaluated the adequacy of their controls. After any incident is resolved, you need to analyze what happened and iterate upon your incident management process. Consistent and accurate measures can help uncover trends. These trends will then feedback into setting up thresholds and refining the people process for incidents, and ultimately, make incident management more efficient.

Don’t let problems fester; fix them immediately. Incidents are costly – customers get angry, reputations are blemished, and money is lost. In Knight Capital’s case, hundreds of millions are lost. The 3 best practices above are a cyclical, never-ending process, but it’s worth it.