CrowdStrike’s massive outage forces a rethink on IT automation

The CrowdStrike outage on July 19 is a strong reminder of the inherent risks in the current trajectory of IT automation. As businesses continue to push for greater efficiency through automation, the CrowdStrike incident reiterates the pressing need for experts to reevaluate the balance between speed and safety in software deployment.

How CrowdStrike’s glitch grounded airlines and cost billions

On July 19, a routine software update to CrowdStrike’s platform triggered a global IT crisis. Millions of Windows computers crashed simultaneously, causing widespread disruptions that rippled across multiple industries.

The aviation sector sunk into turmoil, with airlines forced to ground flights, leading to cascading delays and cancellations. Banking systems also faltered as critical apps crashed, leaving customers unable to access their accounts or perform transactions.

The financial impact of this outage was staggering. Fortune 500 companies alone are estimated to have incurred losses exceeding $5.4 billion, encompassing the immediate costs of downtime and the longer-term repercussions, including damage to brand reputation, customer trust, and operational continuity.

For many of these companies, the outage was more than a technical glitch; it was a costly lesson in the dangers of relying too heavily on automated systems without sufficient safeguards.

The dangerous dependence on vendor updates in a world obsessed with automation

The CrowdStrike incident reiterated the growing concern within the tech industry: the heavy reliance on vendor-driven, automated updates. As businesses increasingly adopt IT automation, the convenience of centralized updates provided by vendors like Microsoft has become both a boon and a potential liability.

Centralized updates streamline processes, reduce manual intervention, and make sure that systems remain current with the latest features and security patches.

This reliance, however, comes with a hidden cost. When an update goes wrong, as it did with CrowdStrike, the effects can be immediate and widespread, affecting not just a single organization but potentially millions of users across the globe.

The assumption that updates from trusted vendors are inherently safe can lead to complacency, leaving organizations vulnerable to catastrophic failures.

Why blind faith in automation could be your biggest IT risk

Phil Fersht, CEO and chief analyst at HFS Research, draws attention to the dangers of placing blind trust in automated updates. Fersht highlights how even minor code issues can spiral into massive disruptions when propagated through automated systems. This trust, often placed in large, well-established tech vendors like Microsoft, can create a false sense of security.

Organizations may believe that because they are dealing with reputable companies, the updates being pushed are infallible—a belief leading to a dangerous level of complacency, where critical quality assurance measures are bypassed or minimized, assuming that the vendor has already covered all bases.

The risks of automation in IT: What you need to know

The push towards automation in IT is not without its risks, as was brought to light by the CrowdStrike incident. While automation offers several benefits, such as efficiency, consistency, and scalability, it also introduces new challenges that must be carefully and expertly managed.

Automation’s benefits and pitfalls: Lessons from the CrowdStrike crisis

The journey towards IT automation began with the introduction of package-manager utilities in Unix and later in Linux, which made it easier to manage software updates across large numbers of systems. This approach gained traction as organizations recognized the efficiency it brought to their IT operations.

Microsoft’s transition to cloud-based solutions, particularly with Microsoft 365, further accelerated this trend, offering businesses the promise of seamless, automated updates delivered directly from the cloud.

This convenience, however, comes with a cliched double-edged sword. While automation reduces the need for manual intervention and makes sure that systems are always up-to-date, it also means that any flaws in an update can be rapidly deployed across vast numbers of systems.

The very feature that makes automation attractive—its ability to replicate changes quickly and consistently—can become its biggest drawback when something goes wrong. This is what makes the CrowdStrike incident so concerning; a single faulty update had the power to bring down systems on a global scale within a matter of minutes.

The consequences of rapid propagation

John Annand, research director at Info-Tech Research, emphasized that the speed at which automation can propagate changes is both a strength and a weakness. When updates are flawless, this speed makes sure that organizations remain secure and up-to-date without delay.

But, when an update contains errors, this same speed can amplify the problem, spreading it far and wide before anyone has a chance to react.

The financial implications of such rapid propagation are enormous. According to data from Splunk, IT downtime costs U.S. businesses over $400 billion annually—stemming from immediate operational disruptions and from the longer-term impacts on customer satisfaction, regulatory compliance, and competitive positioning.

In the case of the CrowdStrike outage, the speed at which the faulty update spread meant that businesses had little time to respond, exacerbating the financial and operational damage.

Protecting your business from future IT outages

The CrowdStrike outage has springboarded a widespread reassessment of how IT systems are managed and maintained. Businesses now recognize that while automation and centralized updates bring efficiency, they also introduce risks that require careful management.

The conversation has shifted towards protecting IT infrastructures against future disruptions, focusing on improving and optimizing internal processes, and adopting more nuanced strategies for software deployment.

The need for better quality assurance

Quality assurance (QA) has always been a core facet of IT management, but the CrowdStrike incident has reinforced its growing importance in an era of rapid automation.

Analysts stress that robust QA processes are no longer optional; they are essential to maintaining operational integrity. Traditional QA models, which typically involved basic testing and validation, are no longer sufficient. In today’s fast-paced IT environment, where updates are rolled out continuously and at scale, there must be more comprehensive checks and balances to catch potential issues before they escalate into crises.

Internal QA measures must evolve to include automated testing suites, rigorous regression testing, and continuous integration and continuous deployment (CI/CD) pipelines that can identify issues early in the development process.

The goal is to create a multi-layered safety net that catches errors at every stage, from development to deployment—minimizing the risk of faulty updates reaching production environments and affecting end users.

Organizations should consider adopting a more conservative approach to vendor updates, delaying implementation until thorough internal testing has been completed. Delays may seem counterintuitive in a world that values speed, but it’s a necessary precaution to prevent costly outages.

Canary deployment: A strategy to prevent IT disasters

Canary deployment has become a favored strategy for limiting the risks associated with software updates. This technique typically involves rolling out updates to a small, controlled group of users before releasing them to the broader user base.

Companies can monitor the update’s performance in a real-world environment, identify any issues, and make adjustments before the update reaches a larger audience.

The key advantage of canary deployment is that it lets businesses catch potential problems early, reducing the risk of widespread disruption. If an issue arises during the canary phase, it can be addressed without impacting the entire user base, limiting the scope of any potential damage.

Canary deployment also provides valuable data on how an update interacts with different system configurations and environments, offering insights that might not be apparent during initial testing. This ultimately helps organizations make more informed decisions about when and how to proceed with full-scale deployment.

CrowdStrike’s response post-outage

In the aftermath of the outage, CrowdStrike has taken steps to restore customer confidence and prevent similar incidents from occurring in the future. The company has announced a series of measures aimed at strengthening its QA processes and deployment strategies.

One of the key initiatives includes adding more rigorous validation testing to their update procedures, including stress-testing updates under a variety of conditions to make sure they perform as expected in real-world environments.

CrowdStrike is also implementing a staggered deployment strategy, which will phase updates gradually rather than pushing them to all users simultaneously—mirroring the canary deployment model and is designed to limit the impact of any unforeseen issues.

CrowdStrike’s response reinforces the importance of transparency and communication during a crisis. The company openly acknowledged the problem and detailed the steps being taken to address it, with aims to rebuild trust with its customers—a model for how other organizations can face similar challenges.

How the CrowdStrike outage is shaping new IT management practices

The fallout from the CrowdStrike outage has prompted a broader rethinking of IT management practices. Businesses are now more aware of the potential risks associated with rapid automation and are adjusting their strategies accordingly, reflecting a growing understanding that while automation brings efficiency, it must be balanced with caution and thorough oversight.

Why testing upgrades thoroughly is more crucial than ever

The CrowdStrike incident has led to a great increase in the stringency with which organizations test software upgrades before they are rolled out. In the past, companies might have relied heavily on vendor assurances or conducted only minimal testing before deploying updates.

Today, there is a greater emphasis on thorough, end-to-end testing that includes real-world scenarios and edge cases.

Organizations are adopting more sophisticated testing methodologies, including automated test environments that can simulate a wide range of operating conditions—helping make sure that updates are compatible with existing systems and do not introduce new vulnerabilities. This shift towards more rigorous testing is driven by the understanding that the costs of an outage far outweigh the time and resources required for thorough testing.

Businesses are also investing in advanced testing tools that integrate with their CI/CD pipelines, enabling continuous testing throughout the development cycle, making sure that potential issues are identified and resolved long before updates reach production.

Digital twins, synthetic data and the future of safe IT upgrades

In response to the increasing complexity of IT environments, companies are turning to digital twin models and synthetic data as part of their testing and risk management strategies.

A digital twin is a virtual replica of a physical system, allowing organizations to simulate and test updates in a controlled environment before applying them to live systems—providing a safe space to explore the potential impacts of updates without risking actual operations.

Synthetic data, on the other hand, is artificially generated data that mimics real-world data. It’s used to test systems under many different scenarios, including those that are difficult to replicate with actual data.

Using synthetic data lets companies stress-test their systems against a wide range of potential issues, including those that may only occur under rare or extreme conditions.

Phil Fersht highlights the importance of these tools in preventing future incidents like the CrowdStrike outage. Organizations can use digital twins and synthetic data to gain a deeper understanding of how updates will interact with their systems, reducing the likelihood of unforeseen problems. This also supports continuous improvement, as insights gained from testing can inform future development and deployment strategies.

Rethinking your approach to software updates after CrowdStrike’s failure

The CrowdStrike outage has also prompted executives to reconsider how quickly they adopt updates, particularly those pushed by third-party vendors. In the past, many companies may have rushed to implement updates as soon as they were released, driven by the desire to stay current and secure. Now, there is a growing recognition that immediate adoption may not always be the best course of action.

Executives are encouraged to take a more measured approach, weighing the benefits of new updates against the potential risks—possibly involving delaying updates until they have been thoroughly tested internally or opting for a phased rollout that mirrors the canary deployment model.

Companies must prioritize operational safety over speed if they are to reduce the risk of outages and maintain greater control over their IT environments.

This shift in strategy reflects a broader trend towards more deliberate and thoughtful IT management. As businesses continue to face the challenges of an increasingly automated world, the lessons learned from the CrowdStrike outage will be a strong reference point for shaping future practices.

Final thoughts

As you reflect on the lessons from the CrowdStrike outage, ask yourself: Is your brand’s pursuit of automation compromising your operational resilience? In a world where speed and efficiency are a priority, have you built enough safeguards to protect your business from the unseen risks?

Now is the time to reassess and fortify your strategies—because the next IT failure could be just one update away.

Tim Boesen

August 12, 2024

10 Min