CrowdStrike Failure Analysis
With recommendations for a change in CS deployment process, Windows driver signing policy, and IT Department Contract Terms
There is a detailed CrowdStrike failure explainer video out that has a good explanation (but very detailed):
(h/t John Rushby).
Setting aside my broader concerns about software industry quality practices more generally, my reaction to this video is:
(1) Crowdstrike was taking matters into their own hands and fumbled the ball somehow. We still don't know if it was a testing failure or a production staging failure that pushed a wrong/corrupted file instead of a perfectly-well-tested production file. I suspect product staging failure. Given that there were two previous catastrophic failures on Linux that same year (See https://en.wikipedia.org/wiki/CrowdStrike) this suggests negligent quality practices in their staging/release process. One of the video comments suggests CS intentionally bypassed customer staging mechanisms (I have heard this from multiple sources at this point) which would make the CS fail even worse, but this is not yet confirmed. They could have been more resilient to such failures by making their code robust to corrupted data, but that is a failure of defense-in-depth, not the proximate cause of the outage IMHO.
(2) Microsoft has a non-resilient boot-start driver arrangement and is using driver signing/certification as a mitigation. Which they allow companies to bypass. Microsoft should not be allowing unsigned boot-start drivers to run. Period. If CS doesn't like this, they should set up some arrangement with MS to get the signing done on an expedited basis. Or convince MS that no change to their data file can cause the code to crash and gotten a special signing arrangement for immediate signing if only the data file were changed. And so on. CS is a big enough player to figure this out. Microsoft should completely own (be held responsible for) all boot-start driver failures. So partly this failure is in fact on MS having a non-resilient critical OS design feature.
(3) According a comment on this video I've seen pop up multiple times, CS is alleged to have bypassed the N-1 phased deployment capability used by IT departments. If true, IT departments in big companies are doing business with a company that bypassed phased deployments. This is an unforgivable (alleged) sin by CS. If true, it should not be forgiven. Regardless of how fact-finding goes, the lesson learned should be that bypass of phased deployment should be forbidden by contract by all those IT departments going forward. CS might send a hair-on-fire e-mail to IT departments urging expedited release, but should not preemptively bypass phased deployment.
In short, again from a narrow view and on available information, this is a failure of multiple layers in checks and balances that, barring new information from CS, it seems that CS willfully bypassed (or created mechanisms to bypass, depending on specifics they have not yet released).
**** Update ****
Crowdstrike has released a preliminary incident report here: https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/
The short version: software changes are tested and subject to phased release by IT groups. But data changes are pushed straight to production with NO TESTING at all -- just a check by a Content Validator. The Content Validator was defective, allowing bad content to get through.
Summarizing their findings:
- They do test and use phased deployment for code changes. But this was not a code change.
- They do test and use phased deployment for new calibration data templates. But this was a new instance of an existing template.
- They do not test, and do not use phased deployment for calibration data changes (additional instances of an existing template). This was a calibration data change (new instance of an existing template).
"Do not test" means they use some sort of checker tool (that missed the issue) and go straight to instant worldwide deployment at full scale. From the CS page: "Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data."
This does not really change my recommendations; it just gives them more texture and context.