After a pretty good run, Microsoft Office 365 had a major outage in San Antonio TX this week. A lightning storm during the night caused a power spike in the US South Central data centre, which negatively affected the cooling system in one part of the data centre, which triggered the automatic systems to start shutting down parts of the data centre to prevent further problems. Some Office 365 customers in the region were affected. Some Azure customers in the region were affected. And various Office 365 customers around the world were also affected – one segment due to a historical decision to host metadata for early Visual Studio Team Services (VSTS) customers out of the San Antonio data centre, regardless of where they were actually located, and another segment due to Azure Active Directory suffering from degraded availability and how the global architecture of multi-factor authentication has been configured.
During the outage, Microsoft used two main avenues for communicating status to the world: the Microsoft Azure status page, and the Microsoft Office 365 status page. The problem in US South Central was such, that, however, the Azure status page was only intermittently available; it kept going up and down. The Office 365 status page was more reliable, although much less informational with updates only delivered infrequently (every 3 hours). It would be good to see more information on the Office 365 status page, along with a more regular stream of updates (every 15 minutes even). When everything has gone south and your ability to work has been degraded, any new information about timeframes and resolutions and current actions are highly valued.
The other avenue was Twitter. At 1.13pm Texas time on September 4, the @Office365Status twitter account posted this update:
When I look through the responses to the tweet, I see the following: disagreement, uncertainty, and perplexity. Various people disagreed that services were restored, as they were still under outage conditions. Others were asking what the impact of the outage would be, such as on their legal holds. And still others were perplexed that Microsoft could let this happen; shouldn’t this have been designed out of the service by now?
While the above can be seen, there is something that can’t be seen: any attempt during the outage by @Office365Status to directly respond, engage, allay fears, spread hope, or give updates. Zilch. Nada. Zero. As a company with more than 100,000 employees, surely in such outage conditions, at least one person could be on hand to provide a human moment to the paying customers of Office 365 who have responded to the original post.
To ask for more insight. What are you seeing at your place? How many people is this affecting?
To offer an apology one individual at a time. Bruce, I’m sorry that Office 365 is down now. We’re working to put it back together as fast as possible.
To give updates on what the engineers in the data centre were doing. We have 25 engineers arriving from out-of-state, to help with the physical clean up. You should see the mess!
(I’m making up these answers).
To answer the hard questions. Legal holds won’t be compromised. They will stay in place. But clearly, nothing new will be added while the system is out. Or, Will your email send after we’re back online? Yes, if it is in your Outlook outbox. It depends in other situations. What is happening at your place?
To share any updates on where the team was at, and what was happening. Wow, what a day this has been for us. So unexpected. Our first responder team is about to go off-duty, and the second team is starting in 4 minutes.
A person. A human moment. Dear Microsoft, you can do this.
Satya has said that people shouldn’t join Microsoft to be cool, but to make others cool. But when the chilling winds of an outage blow across an already cool population, the warmth of a human being delivering a human moment is needed to keep the balance and prevent everyone from freezing.