SA CIOs minimally impacted by global CrowdStrike outage

Johannesburg, 22 Jul 2024
Flaws in a CrowdStrike update caused blue screens of death on Microsoft Windows machines.
Flaws in a CrowdStrike update caused blue screens of death on Microsoft Windows machines.

While the CrowdStrike outage on Friday caused ruckus across the globe, South African-based chief information officers (CIOs) were not hugely impacted.

ITWeb conducted a survey of local CIOs to gauge how the outage impacted their businesses. The “blue screen of death" issue impacted banks, airlines, TV broadcasters, supermarkets and many more businesses worldwide.

In South Africa, Capitec was one of the organisations that said on Friday morning it experienced significant disruptions across all its banking channels due to a global downtime incident involving CrowdStrike, a key technology service provider. The bank restored its services later in the day.

CrowdStrike is an American cyber security technology company based in Austin, Texas. It provides endpoint security, threat intelligence and cyber attack response services.

It released a software update to the vulnerability scanner Falcon Sensor; however, flaws in the update caused “blue screens of death” on Microsoft Windows machines, disrupting millions of Windows computers worldwide.

Over the weekend, Microsoft revealed that over 8.5 million Windows machines were impacted globally.

Affected machines were forced into a boot loop, making them unusable. This was caused by an update to a configuration file, Channel File 291, which CrowdStrike says triggered a logic error and caused the operating system to crash.

Although CrowdStrike fixed the update, computers stuck in a boot loop were still unable to connect to the internet to download the patch before Falcon loaded and crashed the device again.

CrowdStrike founder and CEO George Kurtz says: “I want to sincerely apologise directly to all of you for the outage. All of CrowdStrike understands the gravity and impact of the situation. We quickly identified the issue and deployed a fix, allowing us to focus diligently on restoring customer systems as our highest priority.

“The outage was caused by a defect found in a Falcon content update for Windows hosts. Mac and Linux hosts are not impacted. This was not a cyber attack. We are working closely with impacted customers and partners to ensure all systems are restored, so you can deliver the services your customers rely on.”

According to Kurtz, CrowdStrike is operating normally, and this issue does not affect Falcon platform systems.

“There is no impact to any protection if the Falcon sensor is installed. Falcon Complete and Falcon OverWatch services are not disrupted.”

Survey findings

Of the 44 respondents, the majority (57%) said their organisations weren’t affected by the CrowdStrike outage.
Of the 44 respondents, the majority (57%) said their organisations weren’t affected by the CrowdStrike outage.

ITWeb conducted a poll of South African CIOs this morning, between 8am and 12pm. Of the 44 respondents, the majority (57%) said their organisations weren’t affected by the CrowdStrike outage.

Of the 43% of CIO respondents who were affected, 52% noted it was their Microsoft systems that bore the brunt of the outage (34% said it was both PCs and systems, and 13% said it was only PCs).

Asked to estimate what percentage of their systems were impacted, 48% said it was less than 10%, with 6% saying between 91% and 100% of their systems were impacted.

Further, 96% said affected PCs were already back online, with the remaining 4% expecting any affected PCs to be back online in the next couple of days.

One of the survey respondents, Aadiel Ayob, executive for technology solutions at Sizwe IT Group, comments: "This update, intended to enhance protection against zero-day threats, inadvertently caused a DDoS [distributed denial-of-service] attack on systems that were available to check in at the time of the update.”

Ayob notes that given the widespread deployment of its Falcon sensor, CrowdStrike has a responsibility to rigorously follow its quality assurance processes every time, without exception.

“This incident is a crucial learning opportunity; drastic and appropriate actions are necessary to prevent such failures in the future. The Windows OS should be capable of determining in advance whether an update could be harmful to the normal functioning of a system and halting the update process if necessary – a self-defending system, which is essential to maintaining robust cyber security in an increasingly complex threat landscape.

“What this incident reminds us of is that technology can and did fail us, and it failed us globally. While we hope for the best, we must also be prepared for the worst that can happen.”

Phila Ndarana, chief technology officer at the Auditor-General of South Africa, says the impact was minimal.

“The latest global IT outage has brought to the fore, yet again, the ‘existential crisis’ that cyber security matters pose to and for all of us. The outage was widespread, with potential devastating impacts on clients and businesses. It is incumbent upon all of us as IT practitioners, players and professionals to remain vigilant and have a no-compromise attitude towards cyber security threats.”

Says Orapeleng Moeti, IT operations manager at Bridgestone: “Yes, we did have minimal impact - but indirectly via a third-party provider. Our biggest partner, which hosts our ERP system, was affected. On the Microsoft side we didn’t have any interruptions.

“We do not allow automatic updates on critical servers, so the impact would have been minimal if we did use CrowdStrike – which we do not.”

Much wider implications

According to market analyst firm Forrester, the outage brings severe economic consequences, as well as having a widespread impact on the health and well-being of those affected.

It notes that emergency response services in some cities were disrupted, and hospitals across the globe had to cancel scheduled surgeries.

Forrester urges IT leaders to empower authorised system administrators to fix the problems quickly and effectively. This includes backing up hard disk encryption keys (BitLocker or another third-party), as these may be critical for recovery in such instances, as well as using privileged identity management solutions for break-glass emergency situations.

“Crisis events require an ‘all-hands-on-deck’ response, but be sure to reserve a few analysts to continue monitoring other systems. Threat actors may use this time to attack while you’re distracted,” says the firm.

Jason Jordaan, principal forensic analyst and MD of DFIR Labs, says: "One of the lessons I hope will be taken away from this is to not simply blindly trust, and to actually relook at the contractual implications of the contracts we have in place with security vendors.

“In so many instances, we trust vendors to provide services and technology to help keep organisations safe, but at a contractual level, the contracts we enter into favour them and provide them liability escape clauses.

“This was a vendor failure, no different to when any other engineered system fails and we hold the engineering organisations involved liable for the failure, and there should be similar levels of accountability.”

Jordaan says another lesson is going to be one of whether, “we as an industry allow updates to be automatically pushed into our systems by vendors, without the ability to test these ourselves. We often in digital forensics talk about trust but verify ‘we don’t trust anything until we can objectively prove it’ and the same needs to be said about cyber security.”

Thankfully, he adds, incidents like this are rare, but the issue of “buggy” code is not, and organisations need to better prepare for this by not allowing anything into production that has not been properly tested.

Learning from the chaos

Arthur Goldstuck, MD of World Wide Worx, comments that the CrowdStrike outage is a case study in how to improve cyber security resilience and response strategy. In particular, he says, it highlights the need for a robust incident response plan, including communication protocols, roles and responsibilities, and recovery procedures.

“Underpinning the strategy must be a solid redundancy and backup system that includes backup servers, alternative communication channels and secondary data storage solutions to maintain operations during a primary system failure.

“Large organisations should also conduct regular testing and drills of incident response and disaster recovery plans, so that gaps or weaknesses in the plans are identified,” says Goldstuck.

“CrowdStrike – along with organisations that roll out software updates to large client bases – has even more to learn from the outage, including the need to test updates in various ‘sandbox’ environments before deployment. They should also implement staged or phased rollouts to gradually introduce updates across the user base, and monitor for any issues. It's astonishing that neither of these practices are followed, but we can expect things to change.

“And then CrowdStrike and its customers – and by extension all similar organisations – should have a reliable and quick rollback mechanism in place, so that they can revert to previous versions if updates become trainwrecks,” he concludes.