Update 07AUG2024: CrowdStrike released a technical root cause analysis that confirms that an array out-of-bounds read, very similar to our example, caused the issue.
On 19 July 2024, an estimated 8.5 million Windows computers worldwide crashed and were unable to reboot, stuck in a blue screen of death. The outage impacted businesses and governments around the globe, affecting a vast majority of industries in transportation, financial services, healthcare, and more.
Not unexpectedly, this immediately raised fears of a large-scale cyber attack. Was this the long-feared global hacker attack aimed at disrupting our computer-based world and causing chaos worldwide? Thankfully, no. Within hours after the outage, CrowdStrike confirmed that a faulty update in their endpoint protection software, specifically its Falcon Sensor, caused the issue.
While the affected source code is not published, this blog post summarizes what CrowdStrike has publicly confirmed and examines code-level problems that could have led to this global outage. Our goal is to shed light on what type of bugs can lead to such serious software reliability issues in general and why catching code issues early in the development process is as important as catching security vulnerabilities.
What Happened: What we Know so Far (25JULY2024)
CrowdStrike Falcon Sensor is a lightweight agent that collects endpoint data and protects a computer from cyberattacks. To monitor system processes, detect malicious activity, and respond to threats in real time, it needs access to low-level system functions. This requires it to run a Windows kernel driver, which is usually written in C and C++. Since it should not be allowed to disable the protection easily, this driver is marked as a Boot-Start driver, which makes it mandatory for Windows startup.
This means that Falcon becomes a very important, sensitive component of the operating system once installed. To recap:
- The kernel driver is required for Windows to boot.
- The kernel driver has extensive capabilities to interact directly with hardware, manage system resources, and access protected memory.
- The kernel driver influences the operating system's core behavior.
Due to the immense responsibility and trust put in kernel drivers, they usually must surpass extensive testing via Microsoft’s Windows Update program. Driver packages that pass the tests of the Windows Hardware Lab Kit are digitally signed by Microsoft and marked as trustworthy. Although Falcon’s driver itself is also signed, complete testing via the Windows Hardware Lab Kit requires time. In order to quickly respond to novel techniques of cyber threat actors, Falcon needs to employ a more flexible approach to make changes to its kernel driver. For this purpose, CrowdStrike provides Rapid Response Content that is delivered in the form of a content configuration update. These updates contain Channel Files that the driver dynamically loads. These files influence the way how the kernel drivers work.
The update that caused the outage contained a faulty channel file, which resulted in the kernel driver reading memory out-of-bounds [source]. While a user-land application would simply crash by an issue like this, a kernel driver sitting at the heart of the operating system causes the whole system to crash – resulting in the infamous blue screen we have seen during the outage.
Exploring Potential Root Cause in the Code
The incident has intrigued experts around the world who were interested in determining the exact root cause of this memory out-of-bounds issue. Although some of these were already proven to be wrong, and CrowdStrike has not disclosed the faulty source code, let’s have a look at scenarios that may have caused an issue like this.
Null Pointer Dereference
A pointer in C and C++ is a variable that stores a memory address, allowing direct manipulation of data and efficient memory management. A pointer to null, also known as a null pointer, is created by initializing a pointer object to 0
, NULL
, or in the case of C++ nullptr
. A null pointer does neither point to an object nor to valid memory, and as a consequence dereferencing or accessing the memory pointed by such a pointer is undefined behavior, which usually results in a whole system crash for a kernel driver:
In addition to using the *
operator, accessing a member of a structure (using ->
) or an element of an array (using []
) also dereferences the pointer and very likely causes a crash if performed on a pointer to null:
You can find out more about Null Pointer Dereferences in our S2259 rule documentation. While the security community suspected a Null Pointer Dereference behind the outage at the beginning [source], this was later proven wrong [source]. Rather, it is suspected that an uninitialized variable could be the root cause.
Uninitialized Variables
Local variables in C and C++ must be declared to allocate memory and can optionally be initialized with a specific value upon declaration. A local variable of any built-in type (such as int
, float
, and pointers), declared without an initial value, is not initialized to any particular value as this process incurs a slight computational overhead. Consequently, if no value is assigned to such a variable first, the variable holds an arbitrary value left in its memory location by previous program operations, likely resulting in unintended behavior:
Similarly, structures that simply aggregate variables of built-in types, such as arrays or struct
/class
types without a constructor, will not initialize their members when declared without an initializer:
Finally, allocating objects of builtin or such aggregates types on the heap also does not initialize their values:
This also applies when new
is used in C++:
You can find out more about uninitialized variables in our S836 rule documentation.
The lack of variable initialization is one type of issue that can lead to out-of-bounds memory reads, which is mentioned in the preliminary post-incident review release by CrowdStrike [source]. But other issues can lead to out-of-bounds memory reads as well. Let’s have a look at these issues in general.
Out-of-bound Memory Access
Arrays and buffers are contiguous blocks of memory accessed using numerical indices to reference individual elements. Array overruns and buffer overflows occur when memory access accidentally exceeds the boundary of the allocated array or buffer. These overreaching accesses cause some of the most damaging and difficult-to-track defects. Not only do these faulty accesses constitute undefined behavior, but they frequently introduce security vulnerabilities, too.
This type of issue can, for example, occur when referencing elements of an array:
In a similar fashion, a pointer can access out-of-bound memory:
Furthermore, unsafe calls to functions like memcpy
may introduce out-of-bound memory access:
You can find out more about out-of-bound memory access in our S3519 rule documentation.
What We Can Takeaway from the CrowdStrike Outage
Bugs are an inevitable part of software development and regularly occur in code – all code is susceptible. Here, we illustrated how three different types of bugs can lead to an outage just like this one. While the affected source code is not published, it becomes evident that fixing all of these issues is essential.
This outage reminds us of the impact that even a small code issue can have – the financial damage alone could reach tens of billions of dollars [source].
A lot of attention is paid to software and code security, but reliability and maintainability issues are often neglected. You can talk to our team about finding and fixing these issues early in the development process here.