5 minute read 24 Jul 2024
How businesses can prepare for widespread IT outages

How businesses can insulate their IT systems against global outages

By Puneet Kukreja

EY UK & Ireland Cyber Security Leader

As the EY UK & Ireland Cyber Leader Puneet is passionate about building client centric, growth focused high-performing delivery organisations which are engineering led and powered by managed services.

5 minute read 24 Jul 2024

The major technology outage involving CrowdStrike highlights the world’s critical reliance on IT services and underscores the pressing need for comprehensive business continuity planning.

In brief
  • The world’s leading IT services providers are now as critically important to economies and society as energy, water, and other infrastructure providers.
  • Concentration of IT systems with a few providers poses significant risk with potential catastrophic consequences for safety and the economy, and emphasises the need for rigorous quality control measures.
  • Organisations that recovered fastest were those that had prepared in advance through IT resilience strategies and business continuity planning.

As organisations around the world continue to recover from what some have described as the biggest IT outage in history, the CrowdStrike software glitch serves as a wake-up call to keep businesses secure against unforeseen IT failures that could potentially bring services across the globe to a grinding halt.

It is estimated that 8.5 million Windows devices¹ across 674,620 direct customers in 1,200 unique industries were affected² due to a flaw in a routine update issued for a piece of cyber software.

It was not a cyberattack or breach. However, the outage has triggered warnings from cybersecurity experts about a surge in hacking attempts exploiting the IT disruption.

The disruption on 19 July 2024 pales in comparison to the WannaCry virus in 2017 that infected around 230,000 computers across 150 countries before a kill switch was identified.

The widespread impact of the global IT outage was quite alarming for those directly affected. People were not able to withdraw money from bank accounts, supermarkets were forced to close, airline fleets were grounded, and congestion built up at major ports across the world.

Global IT outage exposes critical fault lines

The outage brings organisations like major software vendors and IT infrastructure providers into the realm of critical infrastructure, underscoring their importance to our daily lives as well as their broad socio-economic importance. It also brings into focus the question of trust. Just as people turn on the tap in their homes to get clean water that they don’t need to test before consuming, they turn on their computers with the same level of trust not expecting to get a “blue screen of death” as a result of a routine update from a trusted provider.

There is a significant element of concentration risk at play. A vast majority of the world's IT systems run on a handful of providers. Should any one of them experience an outage, the results could be catastrophic, extending far beyond mere inconvenience. Such an event could compromise public health and safety, and even put lives at risk. In this light, the recent global IT outage might seem relatively minor.

The outage has brought the issue of quality control in software updates into the spotlight, drawing attention to the urgent need for more rigorous scrutiny during the testing phase before deployment. It raises the question of whether fundamental changes are necessary in the operations of essential technology service providers. For instance, the question that arises is should new quality assurance protocols be implemented to govern the rollout of updates and new software releases?

How can risks be minimised?

One way to reduce concentration risk is to diversify. But the interconnectedness of the technology provider ecosystem means that this may not be very practical.

The question of trust will arise for many of the organisations affected by the recent outage. At least some of them may be considering switching provider. This is not necessarily a wise course of action though. It would risk further disruption with no guarantee that the new solution will be as effective. The fact remains that the likely cause of the outage was human error, and this does happen from time to time, even in the very best organisations.

This puts the focus back on the affected organisations.

Every organisation must take responsibility for their ability to function and provide services to their customers, even in the most trying of circumstances.

It matters little to your customers if an IT outage was caused by a cyberattack or by a flawed software update, all they care about is that they are not disrupted.

One of the more heartening aspects of the global IT outage was the resilience displayed by many organisations. There were cases of airlines switching rapidly to manually check-in systems and reporting no flight delays despite the outage. Stories like that abounded but this experience was by no means uniform and there were also many instances of businesses and major facilities having to shut their doors.

This increases the importance of IT resilience and robust Business Continuity Plans (BCPs). IT resilience has now become a fundamental aspect of business operations, enabling organisations to quickly recover and maintain continuity in the face of unforeseen disruptions such as that caused by the global outage.

By embedding IT resilience into their core strategies, businesses can ensure that they remain operational and competitive, and continue to serve their customers even amidst growing complexities and vulnerabilities of the digital landscape.

‘Know’ how to build better resilience

The introduction of regulatory frameworks such as the NIS2 Directive and Digital Operational Resilience Act (DORA) makes IT resilience and BCPs even more important. Article 18 in the NIS2 Directive mandates that essential and important entities implement risk management measures, including advanced threat detection and continuous monitoring. Article 20 requires regular testing and updating of these measures to ensure effectiveness.

DORA, on the other hand, emphasises operational resilience in the financial sector, with Article 11 focusing on the need for thorough digital operational resilience testing, and Article 15 mandating comprehensive incident response and recovery plans. Organisations must foster a culture of resilience through regular employee training and the maintenance of redundancies across critical systems, ensuring quick recovery from disruptions.

By adhering to NIS2 and DORA, businesses can enhance their resilience, ensuring they remain operational and competitive amidst evolving digital threats and not just those related to cybersecurity.

In this respect, businesses should:

Armed with these five “knows” organisations will be able to recover quickly and continue to operate even during times of extreme disruption.

Summary

The widespread disruption caused by the global IT outage highlights the vulnerability of organisations around the world to a small number of major IT services and infrastructure providers. This may require new approaches to the quality assurance of new software releases and updates. It also highlighted the critical importance of IT resilience and business continuity planning for organisations to deal with unforeseen events and IT outages.

About this article

By Puneet Kukreja

EY UK & Ireland Cyber Security Leader

As the EY UK & Ireland Cyber Leader Puneet is passionate about building client centric, growth focused high-performing delivery organisations which are engineering led and powered by managed services.