Lights Out Tonight

Trouble in the Hard Drive

Background: In a world where our lives are increasingly tied to the internet, cybersecurity is the unsung hero that keeps our data safe. Recently, the cybersecurity firm CrowdStrike experienced a significant outage, leaving many companies vulnerable and sparking concerns about our dependence on digital fortresses.

Initial Trigger: The outage was caused by a significant failure in CrowdStrike’s cloud infrastructure. Preliminary reports indicate that a software update inadvertently introduced a critical bug that disrupted communication between CrowdStrike’s threat detection modules and its central analysis servers.

This failure disrupted services for countless clients, including airlines, with thousands of flights delayed or cancelled.

If you look on the bright side, the outage was an inconvenience, but also a respite from the daily face to face many of us have with our screens. Maybe the blue screen of death gave you a chance to take a walk or talk to a colleague while you waited for it to be remedied.

It reminds us of another outage that didn’t really have a “bright” side…but you could still see the stars.

Deja Vu: On August 14, 2003, the Northeastern United States and parts of Canada experienced one of the largest power outages in history. The blackout left an estimated 50 million people without electricity for up to two days. The cause? A software bug in an alarm system that failed to alert control room operators to redistribute power after a transmission line in Ohio brushed against overgrown trees. This cascading failure emphasized how interconnected and vulnerable our infrastructure is. Much like CrowdStrike's recent ordeal, the 2003 blackout revealed the fragility of the systems we depend on and the ripple effects when they fail.

Numbers: $4.24 million

The average cost of a data breach in 2021, according to IBM's "Cost of a Data Breach Report 2021."

Quote: “To Err is Human; To Really Foul Things Up Requires a Computer

Bill Vaughan1

Fact: Like the Crowdstrike outtage, one little thing wrong in the computer led to a snowball of many big things going wrong. In this case, the software bug prevented operators from being able to redistribute load after overloaded transmission lines drooped into foliage

A glitch known as a "race condition" occurred in General Electric Energy's Unix-based XA/21 energy management system. This software bug was triggered and caused FirstEnergy's control room alarm system to stop working for over an hour without the operators knowing. As a result, the operators didn't receive any audio or visual alerts about important changes in the system.

Because of the alarm system failure, a backlog of unprocessed events built up, and the primary server crashed within 30 minutes. The system automatically switched to a backup server, but it also failed shortly after. These server failures drastically slowed down the screen refresh rate for the operators' computer consoles from the usual 1–3 seconds to 59 seconds per screen.

Without the alarm alerts, the operators ignored a critical call from American Electric Power about issues with a major power line in northeast Ohio.

Definition: Black Start

When a power plant can restart operations without relying on the external electric power grid. This can be done through an on-site standby generator or a tie-line to another plant or emergency generator.

List: Just a Small Glitch

  • Flash Crash, 2010: The U.S. stock market experienced a rapid and severe crash, with the Dow Jones Industrial Average plummeting about 1,000 points in a matter of minutes. The crash was triggered by a large automated trading algorithm that initiated a massive sell-off, which cascaded through the market. High-frequency trading systems amplified the effect, creating a feedback loop that caused widespread panic and significant temporary losses.

  • Galaxy IV, 1998: A software bug in the Galaxy IV satellite caused it to malfunction, disrupting services for nearly 90% of the world's pagers and affecting news wire transmissions, credit card transactions, and television and radio broadcasts.

  • O2 Network, 2018: O2, one of the UK's largest mobile network providers, experienced a day-long outage affecting millions of customers. The failure was traced to an expired software certificate in the network's systems provided by Ericsson, which led to a cascading failure affecting O2 and other operators globally.

And about those stars we mentioned…

The photo on the left is from a home in Goodwood, Ontario on a normal night, On the right, is the same spot during the 2003 power outage.2