CrowdStrike - the Lessons Learned
Almost as numerous as the sectors brought down by the CrowdStrike incident are the best practices ignored or circumvented by the company and its customers. We look at what happened and the measures needed to stop it happening again.
Who knew one software update could bring so many global systems to a screeching halt? On July 19, 2024, the largest IT outage in history disrupted multiple industries across the globe. Banking, healthcare, and the airline industry all suffered debilitating disruptions. The culprit was a normal software update automatically delivered to customers' installations by the cyber security firm CrowdStrike.
Such updates were a routine occurrence, taking place daily or multiple times a day to ensure CrowdStrike's customers' platforms were protected from the latest viruses and other cyber threats. This update, however, contained a fatal programming error. Inserted into the Windows kernel code at the heart of the client machines, it crippled the operating system. The result was the 'blue screen of death' and thousands of completely useless Windows PCs.
Understanding the CrowdStrike meltdown
To understand what went wrong, we must start at the beginning. Nearly 300 of the Fortune 500 companies reportedly use CrowdStrike — which has 29,000 global customers — but the software update to Falcon only impacted companies using Windows. That was due, in part, to a decision made back in 2009 by the European Union (EU).
The fatal EU ruling
In an effort to prevent anti-competitive behavior, the EU ruled that “Microsoft must ensure that third-party products can interoperate with Microsoft’s relevant software products using the same interoperability information on an equal footing as other Microsoft products,” as Computer Weekly puts it.
Since Microsoft security products worked by accessing the kernel code at the heart of the Windows operating system, the EU ruling meant that Microsoft had to allow competing products the same access - even though allowing third-party interventions to such critical OS code is considered poor practice.
This was the first step in the sequence of questionable moves that led to the disaster.
(Incidentally, this leaves pundits wondering if the mass outage will give Microsoft the ability to push back on the ruling.)
Not a cyberattack
When Microsoft Windows began displaying the dreaded “Blue Screen of Death” across industries and geographies, CrowdStrike told its clients its widely used Falcon Sensor software was the cause. It was not a cyberattack but a simple software update behind the problem.
ABC News Australia explains: “The company said the update, designed to target malicious system communication tools in cyber attacks, triggered a ‘logic error’ that resulted in an operating system crash on Windows systems…” Essentially, it was a simple mistake in coding.
CrowdStrike's responsibility
While the E.U. ruling may have given CrowdStrike the ability to trigger a massive outage, responsibility for the defective code and its global distribution lies with the company itself.
David Brumley, professor of electrical and computer engineering at Carnegie Mellon University, told TIME, “Their code is buggy, and it was sitting there as a ticking time bomb.” Brumley went on to detail the steps a company should take to avoid these kinds of widespread problems: “First, there should have been rigorous software testing to catch bugs; second, there should have been testing on different types of machines; and third, the roll out should have been slow with smaller sets of users to screen for negative ramifications.”
In short, the company by-passed simple best practices for critical software releases: careful pre-testing and phased rollout to small segments of the the user base.
The incident was a dramatic illustration of a dangerous tendency: today’s tech development environment prioritizes speed and efficiency in software distribution at the cost of resilience. This results in the classic engineer’s no-no: a single point of failure, which can cause the entire system to crash because of one small flaw.
The customer's responsibility
CrowdStrike’s failure to test the software pre-release and use a phased roll-out was a major contribution to the failure, but users also bear responsibility: acceptance of daily updates from an external source into critical live kernel code without robust backup or disaster-recovery protocols is a risky practice.
Customers whose machines were hit by the faulty software were forced to to manually boot computers in safe mode, find the offending CrowdStrike file, delete it, and then reboot with Windows. With millions of computers impacted, it is easy to imagine what a time-consuming process this was.
That’s why companies need to reconsidering their recovery strategies. “I’ve talked to several CISOs and CSOs who are considering triggering restore-from-backup protocols instead of manually booting each computer,” Eric O’Neill, a security expert, said in a press statement. “Companies that haven’t invested in rapid backup solutions are stuck in a Catch-22.”
Never again?
The CrowdStrike incident highlights the need for more robust practice on the part of both software producers and software users. In particular users need better procedures for detecting and responding to issues.
"The outage’s global scale highlights the need for advanced monitoring tools and robust incident response plans," according to a report on the outage on tech portal The New Stack. "Real-time monitoring and alerting systems should be in place to catch issues as they occur. IT teams should develop detailed incident response plans with clear protocols for quick identification, isolation and resolution of issues. These plans should include root-cause analysis and post-incident reviews to improve response strategies continuously."
Coping with a monoculture
The incident has sharply focused attention on the interconnectedness of today’s cloud-computing systems and the challenge this represents for organizations. The CrowdStrike outage is a wake-up call.
As The Guardian puts it, “Just as the pandemic forced us to confront the limitations of the global supply chains that had been created to improve efficiency rather than resilience, this CrowdStrike mistake should trigger a reappraisal of our networked world.”
Consolidation — and even simple platform dominance — in the tech industry has become a risk factor for companies worldwide. Not only is CrowdStrike one of the largest providers of cybersecurity services, but as The Guardian says, “Microsoft has a stranglehold on the business computing marketplace. Every large organization runs Windows, and most small businesses do, too.”
This type of monoculture makes an attractive target for attackers but is also problematic in relatively benign situations like the CrowdStrike Falcon outage. The onus is on software producers and users to ensure that that they have robust procedures in place to manage the resultant vulnerabilities.