Last week saw the first major service disruption to Office 365 in several years. A severe storm in Texas impacted the cooling system at the US South Central data centre, which resulted in protective systems in the data centre switching into containment mode and shutting down servers to prevent further damage. Many people in the immediate local area were affected, but more worryingly, so were users far outside the local area as cascading effects were felt with Azure AD across the world.
That was last week. Everything is back to normal. I decried the state of communication during the outage and asked for more human moments throughout. And after monitoring the Microsoft news sites for the past week and seeing nothing (nada, zilch, zero) about the outage and what went wrong, I’m left wondering why not.
Clearly something happened that should not have happened. Clearly something in how Azure AD (and other non-regional services like the Azure Resource Manager) is engineered / architectured is not where it should be yet. What I’m looking for is an explanation and elaboration of what happened, what Microsoft is going to do to resolve it properly this time, and perhaps even some insight into what happened in the data centre last week.
Customers purchasing cloud services from Microsoft rely on those abilities to do their work. And when everything is working fine, everyone is happy. But when there’s a problem, getting back to a normal state as quickly as possible is critical. But secondly – and perhaps even more importantly – is the deep analysis of what happened, what was learnt, and what will be done / is being done to prevent a recurrence. An outage we can accept, albeit grudgingly. A failure to learn from what happened we are much less willing to tolerate.
And the unwillingness to publicly disclose the learnings from a major outage makes a post like this one highly suspect, even though I’m sure the guidance is great:
We have heard from you, our customers, that you’d like us to provide more guidance and recommendations to help you successfully deploy Azure Active Directory (AD). So today, I’m excited to share a new set of step-by-step deployment plans based on the best practices we’ve learned from working with thousands of customers to successfully roll-out Azure AD.
Deployment plans guide you through the business value, planning considerations, implementation steps, and management of Azure AD solutions. They bring together everything you need to deploy Azure AD capabilities to get the maximum value. Deployment plans include Microsoft recommended best practices, user communications, planning guides, implementation steps, test cases, and more!
In the first instance, as a consequence of what happened last week, customers across the world would be more happy to know that Microsoft itself can “successfully deploy Azure Active Directory” in a way that local outages don’t cause global meltdowns.