UPDATED: Last week, a “global IT outage” saw millions of Windows computers abruptly crash into the infamous “blue screen of death”, causing chaos for many organisations and their customers.
Yet contrary to hysterical reports in the media and particularly YouTube pundits, the outage did not take out "every" Windows PC. According to a Microsoft estimate, it was 8.5 million computers, primarily Windows 10 machines, that were affected.
Around 1.4 billion PCs run a Windows operating system, of which Windows 10 commands roughly 70% or just short of a billion. That means less than 1% of Windows 10 machines were affected. Yet, these were primarily enterprise machines, leading to crashes at airports, hotels, banks, healthcare facilities, government agencies, and many more – creating a highly visible and publicised outage.
The culprit was CrowdStrike, a popular cyber security vendor that runs end-point detection software. CrowdStrike released a patch that corrupted one of the drivers of its Falcon software. Drivers run in a computer's kernel area, a very sensitive and high-control part of a computer's software systems. Most software does not run in the kernel because a failure there will crash a machine to stop additional damage, such as data corruption. This is the infamous blue screen of death.
The event has sparked debate about whether Windows systems are safe and reliable, and whether the market relies too heavily on them. Let’s look at these two matters separately.
Is Windows safe?
The first question, whether Windows systems are safe and reliable, is moot. The crash was caused by CrowdStrike and had very little to do with Microsoft. While some have used the event as proof that Windows is an inferior operating system, media outlets like The Register report that a CrowdStrike patch released for some Linux distributions a month prior also caused kernel crashes (what Linux calls a “kernel panic”).*
Additionally, CrowdStrike nixed the Windows update once it became aware of the issue. Had it not, the same faulty update could have reached Linux and MacOS systems. CrowdStrike runs software at the Linux and MacOS kernel level. Windows is not an outlier – however, there is some truth that Windows could use more kernel protection.
Ultimately, this outage was CrowdStrike’s fault. It failed to test and vet the update properly. Normally, drivers undergo heavy vetting. However, that takes a lot of time, so infosec vendors can use non-kernel techniques to update systems and ensure protection for new cyber crime attacks. These techniques, though, can influence how a driver operates. CrowdStrike updated a content file that unintentionally stopped a driver from working, causing a system crash.
Why did CrowdStrike not properly test and vet the software? Why did it deploy the update to so many computers rather than ease it into circulation? These questions will undoubtedly be asked by many, including Microsoft’s executives.
Is Windows too popular?
The second issue is whether the market leans too heavily on Windows. But that's an irrelevant question, seeing that this crash could have affected other operating systems and could have been deployed more carefully. Updates often cause kernel-level crashes, but usually to a small sample of computers. The issue is not with Windows, but the scale of the event due to what appears to be a poorly managed rollout.
It's not Windows' fault that it's popular. The operating system is more user-friendly than the technical sandbagging common in Linux and more accessible than the elitist pricing of MacOS systems. Consumers give Windows an overwhelming market presence, making it most likely to become the visible front to such a mistake.
Are organisations exposed because they rely too much on Windows or Microsoft? It is a question that has undoubtedly been asked more often in these past few days than it was before Friday last week. Of course, the answer will depend on the organisation. However, as a general point, they are only more at risk because Windows is more popular.
Software supply chain problems can affect any operating system. For example, a few months ago, nation-state cybercriminals almost deployed a secret backdoor in major Linux distributions. Fortunately, they were stopped after, ironically, a Microsoft engineer uncovered the operation.
The outage's risks
The biggest risks from the outage are not the short-term business damage for clients or reliance on Windows. Attempts to clean this mess might cause more problems than the outage. Criminals started to exploit the outage almost immediately, publishing fake websites and newsletters offering dubious fixes that infect machines or compromise administrative accounts instead.
Fixing the issue involves booting a machine into safe mode and using administrative privileges to delete the offending patch. This requires physical access to the machine, and there are situations where technology staff must work remotely with non-tech staff located near those computers – including giving them access to admin accounts or permissions. If not managed carefully, fixing the outage could lead to weaker security.
The overall takeaway reveals an inconvenient truth about the digital era: systems have to be updated to address security flaws, and in many cases, security software needs some level of kernel access to do a proper job (especially to ensure that cybercriminals don't gain kernel-level access).
Such outages have happened before and will happen again. What's important is that the vendors deploying those patches do a proper job and that companies should have patch risk management strategies.
* UPDATE (28/11/2024):The Register has subsequently reported that CrowdStrike's software on Linux crashed due to a Linux kernel bug involving BPF, and that CrowdStrike's Falcon code, running at the kernel level, was not affected.
Share