a subsidiary of ProTitleUSA

Crowdstrike and Microsoft

So, what actually happened to the Crowdstrike and Microsoft outage?

Based on my interview of a folks from MSFT, there are a few things I learned :

1. The kernel update by crowdstrike referenced the memory location for booting, from a hardware location that did not exist, the reference to memory was a part of the PE code that Crowdstrike had access to.

2. Code which contained a location call that did not exist was marked as an "essential" code, faking the system that without it, it could not boot. After unsuccessful attempts, it triggered an update download check which stuck in the infinite loop as the update uploader would try to execute the code it did not have.

3. Update from Crowdcrike was pushed in seconds to 8 million users on the single deployment branch. Instead the roll out should be customer branch by branch in the staggered fashion. If one branch failed to update, stop updating from other branches.

4. 3 bugs at the same time which were pushed by Crowdstrike caused a worldwide outage. Deployment lacked thorough testing.

5. While both companies received the bad rap on this incident. MSFT came to the rescue, 10,000 engineers switched from other tech support functions, such as AZURE or Windows to help resolve the situation.

There are a number of lessons learned and biggest one is a phase deployment of update code when you have multiple users.

Our social networks:
Linkedin
Twitter
Youtube
Facebook
Protitleusa.com