CrowdStrike Postmortem – what organisations’ can learn and how to improve
On Friday July 19th, 2024, the world witnessed a widespread disruption that crippled over 8 million Windows computers running CrowdStrike’s Falcon platform. As the dust settles and we gain more insight into the incident, it’s crucial to dissect what happened and explore ways to strengthen our systems to avoid or mitigate the impact of similar incidents.
This retrospective analysis by Digital and Cloud Solutions Practice Manager, Mohit Dewan, looks at the cause of the outage and the steps taken to resolve it. It then provides insight into how organisations can achieve the critical balance between rapid updates and system resilience through testing, operational controls, and Observability. Additionally, we share a real-world example of our client’s response to the disruption.
The Issue
The CrowdStrike Falcon platform is utilised for safeguarding endpoints like workstations, laptops, servers, and virtual hosts from various cyber-attacks. It is normal for the platform to receive multiple updates daily to effectively combat threats in real-time. The problem occurred when CrowdStrike issued an update for its popular Falcon platform.
The issue, which only impacted Windows hosts, arose when the most recent update included a logic error in one of its configuration files, leading to incorrect memory allocation for a communications channel. This update was implemented at the kernel level in Windows. The result was the ‘Blue Screen of Death’ (BSOD) error as Windows failed to boot.
The Fix
The issue was addressed by instructing CrowdStrike customers to revert to the last confirmed stable configuration or initiate safe mode as an administrator, followed by executing the remediation procedures which included deleting the problematic configuration files from the latest Falcon Sensor update.
The challenge, however, was that the remediation necessitated manual intervention at each host that displayed the issue. This implied a significant expenditure of valuable man hours to rectify the problem, instead of being able to employ an automated method.
So, how does this type of issue occur, and how can we prevent it in the future?
The Resilience Balance – how could this be prevented?
A key factor is that the CrowdStrike Falcon Sensor can receive updates several times per day from the cloud platform.
”While the frequency of these updates is one of CrowdStrike’s strengths—allowing for near real-time identification and blocking of newly detected threat vectors—it also introduces risks and complacency, as demonstrated by this widespread incident.
Mohit DewanDigital Cloud and Solutions Practice Manager
Three resilience strategies
In the following section, we will explore various measures that can be implemented to prevent such problems from arising in the future. However, it’s important to keep in mind, that incorporating testing, governance, and controls can inevitably decelerate the deployment process. As a result, implementing these controls may reduce the risk of installing buggy software but increase the risk from a threat vector that your systems are not protected from. Therefore, the key here is to understand your systems and the risk profile they present.
- It is essential for a system accessible on the internet, with high customer engagement, to receive updates promptly to ensure full protection.
- For key operational infrastructure that is not internet facing, for example the application server that controls machinery on an assembly line, it is critical that the machinery always remains operational, so testing everything thoroughly is critical even if it doesn’t have the latest version.
In this balance of resilience, it is imperative to have a nuanced understanding of each of your systems and platforms and create appropriate policies that work for each, rather than implementing blanket policies across all systems. It is also a reminder not to become complacent in your testing.
1. Testing – don’t forget the basics
One of the most fundamental and important ways to prevent this kind of issue is to implement a testing framework. Ideally, the vendor (CrowdStrike) should take ownership of this process, as it’s unreasonable to expect customers to test multiple updates per day before they are rolled out. In any case, the CrowdStrike Falcon platform automatically pushes these updates to the endpoint, which underscores the need for thorough testing on the vendor’s part.
Bear in mind, other vendors update far less frequently, so setting and enforcing a policy where all updates are applied in non-production environments and functionally tested before deployment in production can prevent similar issues. This should be defined and enforced as part of an organisation’s change policy.
So how do you test multiple times per day?
To tackle the challenge of testing multiple updates per day, as CrowdStrike faced, an automated testing suite should be implemented as part of the CI/CD development pipeline. This suite can deploy each update to multiple platforms, perform a series of checks, and provide a report. All checks must pass before the update can be cleared for release.
In the case of the recent CrowdStrike incident, although some code-level checks were employed, they were too low-level and misconfigured, failing to detect the memory issue in the driver configuration. There was no further check at a higher level, meaning the update was never tested as compiled, installed code. This oversight highlights the need for more comprehensive and correctly configured testing processes.
2. Operational Controls – putting control back in your organisation’s hands
Further to testing, there are operational / governance controls that can be implemented that are specifically designed to slow down the deployment of updates to prevent buggy software from being deployed.
Typically, this takes the form of a policy such as n-1 where the version installed at a particular organisation is mandated through change policy as having to be one version behind the latest release. Depending on the organisation’s needs, this can be extended to two or even three versions behind, however, it’s important to check if the installed version is still supported by the vendor.
Another approach is d+10, where the new version is not installed on the organisation’s systems until 10 days after release. This ensures that someone else is the guinea pig to discover and iron out all the issues. Once again substitute whatever number of days is appropriate to the risk level of the system.
In both cases the customer needs to be in control of which versions are installed and must be able to opt to skip a version. This is something that can in most cases be negotiated with the vendor even when placed on an automatic update or release train platform.
3. Observability is essential – a proactive approach
Regardless of the policy setting and level of controls deployed to each of your systems and platforms, Observability is the essential ingredient to a resilient organisation.
Observability ensures smooth operations of your application and IT infrastructure, giving you complete visibility so you can detect problems before they escalate into full-blown incidents.
When done correctly, Observability collects appropriate health information from all your systems continuously giving you the ability to monitor specific components in real time or examine historical records from your systems. This proactive approach enables timely intervention and helps maintain system stability and performance.
Observability case study
Observability in action during the CrowdStrike disruption
In response to the CrowdStrike issue, Avocado worked with one of our clients where we had previously deployed an Observability solution to provide real time insights into the impact of the problem within this large enterprise. In less than one-hour Avocado consultants deployed new dashboards that interrogated health telemetry from over 10,000 Windows hosts. The dashboard showed in real time the number of hosts that had CrowdStrike Falcon installed along with their hostnames and locations. The dashboard also showed any Windows hosts that were online but had become unavailable in the last 20 minutes along with their locations.
As a result, this organisation was able to quickly ascertain the impact of the CrowdStrike issue on the business and formulate a response that included squads of engineers in each of their physical locations who took on the task of carrying out remediation on affected systems.
With other organisations expected to be out of action for up to two weeks, the impact on the bottom line is significant. Having the Observability platform already deployed greatly reduced investigation time as well as the Mean Time To Resolution (MTTR) saving the organisation millions of dollars.
Final thoughts
The recent CrowdStrike outage highlighted critical lessons in managing software updates and maintaining system resilience. The incident, caused by a problematic update, underscored the importance of robust testing frameworks and operational controls to prevent buggy software deployments. Observability emerges as a crucial component for proactive response, offering real-time insights and significantly reducing investigation and resolution times, saving both time and costs.
How much did the CrowdStrike incident impact your organisation’s revenue and ability to respond? Are you prepared for similar events in the future?
To strengthen your defences and ensure readiness for future challenges, consider integrating comprehensive testing, governance, and observability strategies. Reach out to Avocado by filling out an enquiry form to learn how we can assist in enhancing your organisation’s resilience and response capabilities.
In the News
This information was syndicated to these TechDay sites:
IT Brief Asia – Technology news for CIOs & IT decision-makers
https://itbrief.asia/story/crowdstrike-outage-fuels-rise-in-phishing-scams-experts-argue
SecurityBrief Asia – Technology news for CISOs & cybersecurity decision-makers
https://securitybrief.asia/story/crowdstrike-outage-fuels-rise-in-phishing-scams-experts-argue
IT Brief Australia – Technology news for CIOs & IT decision-makers
https://itbrief.com.au/story/crowdstrike-outage-fuels-rise-in-phishing-scams-experts-argue
SecurityBrief Australia – Technology news for CISOs & cybersecurity decision-makers
https://securitybrief.com.au/story/crowdstrike-outage-fuels-rise-in-phishing-scams-experts-argue
IT Brief New Zealand – Technology news for CIOs & IT decision-makers
https://itbrief.co.nz/story/crowdstrike-outage-fuels-rise-in-phishing-scams-experts-argue
SecurityBrief New Zealand – Technology news for CISOs & cybersecurity decision-makers
https://securitybrief.co.nz/story/crowdstrike-outage-fuels-rise-in-phishing-scams-experts-argue
IT Brief UK – Technology news for CIOs & IT decision-makers
https://itbrief.co.uk/story/crowdstrike-outage-fuels-rise-in-phishing-scams-experts-argue
SecurityBrief UK – Technology news for CISOs & cybersecurity decision-makers
https://securitybrief.co.uk/story/crowdstrike-outage-fuels-rise-in-phishing-scams-experts-argue
IT Brief India – Technology news for CIOs & IT decision-makers
https://itbrief.in/story/crowdstrike-outage-fuels-rise-in-phishing-scams-experts-argue
SecurityBrief India – Technology news for CISOs & cybersecurity decision-makers
https://securitybrief.in/story/crowdstrike-outage-fuels-rise-in-phishing-scams-experts-argue
Also appearing on:
The Ultimate Guide to Supply Chain Systems
https://techday.com.au/tag/supply-chain-logistics
