Why No Lessons Learned Microsoft?

Last week saw the first major service disruption to Office 365 in several years. A severe storm in Texas impacted the cooling system at the US South Central data centre, which resulted in protective systems in the data centre switching into containment mode and shutting down servers to prevent further damage. Many people in the immediate local area were affected, but more worryingly, so were users far outside the local area as cascading effects were felt with Azure AD across the world.

That was last week. Everything is back to normal. I decried the state of communication during the outage and asked for more human moments throughout. And after monitoring the Microsoft news sites for the past week and seeing nothing (nada, zilch, zero) about the outage and what went wrong, I’m left wondering why not.

Clearly something happened that should not have happened. Clearly something in how Azure AD (and other non-regional services like the Azure Resource Manager) is engineered / architectured is not where it should be yet. What I’m looking for is an explanation and elaboration of what happened, what Microsoft is going to do to resolve it properly this time, and perhaps even some insight into what happened in the data centre last week.

Customers purchasing cloud services from Microsoft rely on those abilities to do their work. And when everything is working fine, everyone is happy. But when there’s a problem, getting back to a normal state as quickly as possible is critical. But secondly – and perhaps even more importantly – is the deep analysis of what happened, what was learnt, and what will be done / is being done to prevent a recurrence. An outage we can accept, albeit grudgingly. A failure to learn from what happened we are much less willing to tolerate.

And the unwillingness to publicly disclose the learnings from a major outage makes a post like this one highly suspect, even though I’m sure the guidance is great:

We have heard from you, our customers, that you’d like us to provide more guidance and recommendations to help you successfully deploy Azure Active Directory (AD). So today, I’m excited to share a new set of step-by-step deployment plans based on the best practices we’ve learned from working with thousands of customers to successfully roll-out Azure AD.


Deployment plans guide you through the business value, planning considerations, implementation steps, and management of Azure AD solutions. They bring together everything you need to deploy Azure AD capabilities to get the maximum value. Deployment plans include Microsoft recommended best practices, user communications, planning guides, implementation steps, test cases, and more!

In the first instance, as a consequence of what happened last week, customers across the world would be more happy to know that Microsoft itself can “successfully deploy Azure Active Directory” in a way that local outages don’t cause global meltdowns.

More Human Moments from Microsoft

After a pretty good run, Microsoft Office 365 had a major outage in San Antonio TX this week. A lightning storm during the night caused a power spike in the US South Central data centre, which negatively affected the cooling system in one part of the data centre, which triggered the automatic systems to start shutting down parts of the data centre to prevent further problems. Some Office 365 customers in the region were affected. Some Azure customers in the region were affected. And various Office 365 customers around the world were also affected – one segment due to a historical decision to host metadata for early Visual Studio Team Services (VSTS) customers out of the San Antonio data centre, regardless of where they were actually located, and another segment due to Azure Active Directory suffering from degraded availability and how the global architecture of multi-factor authentication has been configured.

During the outage, Microsoft used two main avenues for communicating status to the world: the Microsoft Azure status page, and the Microsoft Office 365 status page. The problem in US South Central was such, that, however, the Azure status page was only intermittently available; it kept going up and down. The Office 365 status page was more reliable, although much less informational with updates only delivered infrequently (every 3 hours). It would be good to see more information on the Office 365 status page, along with a more regular stream of updates (every 15 minutes even). When everything has gone south and your ability to work has been degraded, any new information about timeframes and resolutions and current actions are highly valued.

The other avenue was Twitter. At 1.13pm Texas time on September 4, the @Office365Status twitter account posted this update:

When I look through the responses to the tweet, I see the following: disagreement, uncertainty, and perplexity. Various people disagreed that services were restored, as they were still under outage conditions. Others were asking what the impact of the outage would be, such as on their legal holds. And still others were perplexed that Microsoft could let this happen; shouldn’t this have been designed out of the service by now?

While the above can be seen, there is something that can’t be seen: any attempt during the outage by @Office365Status to directly respond, engage, allay fears, spread hope, or give updates. Zilch. Nada. Zero. As a company with more than 100,000 employees, surely in such outage conditions, at least one person could be on hand to provide a human moment to the paying customers of Office 365 who have responded to the original post.

To ask for more insight. What are you seeing at your place? How many people is this affecting?

To offer an apology one individual at a time. Bruce, I’m sorry that Office 365 is down now. We’re working to put it back together as fast as possible.

To give updates on what the engineers in the data centre were doing. We have 25 engineers arriving from out-of-state, to help with the physical clean up. You should see the mess!

(I’m making up these answers).

To answer the hard questions. Legal holds won’t be compromised. They will stay in place. But clearly, nothing new will be added while the system is out. Or, Will your email send after we’re back online? Yes, if it is in your Outlook outbox. It depends in other situations. What is happening at your place?

To share any updates on where the team was at, and what was happening. Wow, what a day this has been for us. So unexpected. Our first responder team is about to go off-duty, and the second team is starting in 4 minutes.

A person. A human moment. Dear Microsoft, you can do this.

Satya has said that people shouldn’t join Microsoft to be cool, but to make others cool. But when the chilling winds of an outage blow across an already cool population, the warmth of a human being delivering a human moment is needed to keep the balance and prevent everyone from freezing.

Rotten from the Core?

Along with many others, my late 2017 Apple MacBook Pro suffers from a problematic keyboard. While I have had the laptop since before Christmas last year, I haven’t used the keyboard very often, since it is usually connected to a large screen and external keyboard in the office. The first time I used it directly I wondered what on earth was going wrong. The D key misbehaves constantly – misfires, gets stuck, provides a delayed response – and other keys have their share of problems too. And every time since the first direct usage it has misbehaved, got in the way, been annoying, stopping my writing flow, etc. Not great for a tool that is supposed to be invisible and let me get on with my work.
 
I went into the local authorised Apple repair shop yesterday to ask about getting it fixed under the keyboard replacement / repair program that Apple is running. I was met with a willingness to get it repaired, but a complete and utter indifference to flexibility in how it could be fixed. My guess is that hands-on time with the machine will be about one and a half hours maximum, and yet I have to hand it in for 3-5 days. Or for 100 minutes of repair time, I lose it for 100 hours.
 
“Can I book this in for a specific time to be looked at?”
“No.”
 
“Can you assess it, order the replacement pieces, and then I’ll bring it back in when the keyboard arrives from Apple?”
“No.”
 
“How long will it take to get fixed?”
“Don’t know. Anywhere from 3-5 days.”
 
“Could I bring it in at 9am on Monday to get the process started?”
“No. Our technicians don’t start at 9am.”
 
“I use this machine to do my work. How does Apple expect me to get anything done if it is tied up for 5 days?”
“Don’t know.”
 
I’m not used to interacting with people who evidence such absolute indifference in a professional or business setting. What’s worse that the direct reaction of the one employee I spoke to yesterday, is that he is situated within a wider company – a company that is apparently perfectly fine with indifference, poor service, and a rotten attitude.

Why does Apple tolerate this level of indifference, process inefficiency, and poor customer service in its ecosystem? Have they really baked an ecosystem of indifference? Is it rotten from the core?

Perhaps it is time to explore a new Surface of computing.

Protecting Mobile Devices

Mobile devices as endpoints to corporate information have taken the world by storm. The “mobile first” mantra refers to the preferential use of a mobile device before a desktop or laptop. Have phone, will work (or even run the company). The potential of the device to enable new ways of working has to be safeguarded from that which could undermine both current execution and the integrity of long-range plans.

In the Microsoft 365 world, this is the role of Intune Mobile Threat Defense. The service looks at what’s happening on devices, with applications, with the content of messages, with the types of network traffic going through the device … and makes a determination whether all is well or starting to go rotten (slowly or quickly). When a threat is detected – which can be in collaboration with another mobile threat analysis vendor – new protections are enforced to reduce risk, stop data loss, and contain the threat. These could be conditional access policies, such that the end user has to verify that they are the person requesting access to the information through a second factor or means of authentication. Or it could be more draconian, whereby data is locked and blocked from access by anyone or anything. If the device can be remediated – via a secondary user authentication action or a device update that contains the threat – everything goes back to how it is supposed to work.

The Microsoft Intune Team just announced a new integration with BETTER Mobile for leveraging signals from BETTER ActiveShield to trigger Intune policies around conditional access and other mitigation policies. Current Intune customers can get 50 free licenses for 18 months from BETTER Mobile, to try out the integration.

Spoof Intelligence in Office 365

Microsoft added Spoof Intelligence for email security earlier this year (January 2018 I think). This was included as a feature of the Office 365 Enterprise E5 plan, as well as a feature of the Advanced Threat Protection add-on for non-E5 customers. Spoof Intelligence provides visibility into who is spoofing your domain and/or domains that are sending email to you, and provides the capability to allow or deny any of these sending patterns. Spoofing means sending as a domain when you aren’t actually part of that domain, and the default behaviour in anti-spam engines is to treat spoofed email as junk or otherwise invalid. But that’s not always true.

In its documentation on Spoof Intelligence, Microsoft lists several situations when spoofing is valid:

When a sender spoofs an email address, they appear to be sending mail on behalf of one or more user accounts within one of your organization’s domains, or an external domain sending to your organization. Surprisingly, there are some legitimate business reasons for spoofing. For example, in these cases, you wouldn’t block the sender from spoofing your domain:
– You have third-party senders who use your domain to send bulk mail to your own employees for company polls.
– You have hired an external company to generate and send out advertising or product updates on your behalf.
– An assistant who regularly needs to send email for another person within your organization.
– An application that is configured to spoof its own organization in order to send internal notifications by email.


External domains frequently send spoofed email, and many of these reasons are legitimate. For example, here are some legitimate cases when external senders send spoofed email:
– The sender is on a discussion mailing list, and the mailing list is relaying the email from the original sender to all the participants on the mailing list.
– An external company is sending email on behalf of another company (for example, an automated report, or a software-as-a-service company).


You need a way to ensure that the mail sent by legitimate spoofers doesn’t get caught up in spam filters in Office 365 or external email systems.

There’s a plethora of technical standards and reputation dealings and authentication magic happening in the background to determine whether a message is spoofed or not, but the simple idea is that Spoof Intelligence provides a simple way of seeing who is spoofing you, and providing you with the ability to mark these spoofs as valid or invalid.

Email Security

The prevalence of email (addresses, services, checking behaviours) has made it a key vector for hackers, attackers, and others devoted to maleficence. There are many varieties of bad email:
– spam – unwanted email messages, normally carrying a commercial offer. Annoying and productivity draining at best; may carry other nefarious payloads at worse.
– phishing – an email message pretending to be something it is not, with the intent of capturing a user account for subsequent actions. Phishing is a key method of account compromise.
– whaling / CEO fraud / spear-phishing – a highly targeted email message sent to a specific individual (normally high ranking with special financial authorisations) requesting a specific action that sounds highly probable and likely given the (falsified) details in the email.
– privileged account compromise – targeted efforts to get the account credentials for a high-level IT administrator, because once you have the keys to the kingdom, you have the kingdom.
– attachments infected with viruses, ransomware, and malware
– … and many more.

Security vendors provide a range of protections against the many and varied types of attacks:
– anti-spam to filter out unwanted messages, based on certain attributes and qualities.
– reputation services that analyse message characteristics to discern the valid from the invalid.
– signature-based anti-malware services, that compare message and attachment characteristics with known malware signatures.
– denotation chambers to deal with unknown, new, and never-been-seen-before malware variants (zero-day threats). Attachments and other links are executed in a controlled environment and recursively analysed for fingerprints of badness. If nothing is identified, the message and attachment is deemed safe and passed through to the user.
– domain name checking to see whether the signals about message authenticity align with the domain name represented in the sender details.
– domain name lookalike and sound alike checks, to see if the sender is trying to fool you by using a valid domain name with valid reputation but that is pretending to be your domain name or the domain name of a trusted business partner. Such as michaelsampson.net versus michealsampson.net or michae1sampson.net or m1chaelsampson.net or michaelsampsn.net. If you don’t look close enough, you’ll miss the false pretence.
– wider industry standards around email reputation and authentication, to minimise the valid attack surface, and thus force the creation of false signals when everything doesn’t line up.

For every program manager, product manager and software engineer focused on making productivity-enhancing tools, there are at least as many focused on safeguarding that productivity through security tools.

Excel for Mac Plus Excel MVPs

Excel Table Talk Episode 6

Back in the early 1990s, the first client project that paid decent money (NZ$25 per hour, which my client described as “charging like a bull”) required the use of a massive spreadsheet to analyse cost flows in a small manufacturing firm. I spent hours and hours collecting the data, looking at how to manage it, and then how to use a spreadsheet with macros to automate the analysis. While PowerPoint got me my first visit to London to present at a conference in March 1994, it was Lotus 1-2-3 that enabled me to graduate debt free, and in retrospect, probably enabled me to get my first job after university. There was a lot of financial modelling required in that first role, although unlike at my manufacturing client, Microsoft Excel was the product of choice.

The above video presents recents updates in Excel for Mac, and at the end, has some short interviews with the Excel MVPs at the recent MVP Summit. I don’t do a lot with Excel anymore, and my days of charging $25 per hour are very much last millennium, but it’s fascinating to see the Excel team keeping on working with a product that directly and indirectly touches so many of the decision making processes in the world.

Data Residency in Office 365 – It’s Complicated

Microsoft announced data residency for data in Microsoft Teams for Canada, with Australia and Japan coming before the end of August as well. However, the details matter. The details such as – this only applies to two groups of customers in 2018:
– brand new customers who are creating a brand new tenant based in the Canadian geo; or
– current customers who have a tenant based in Canada but who have never opened nor touched Microsoft Teams.

In 2019 there will be a migration option for customers based in Canada (their tenant is homed there), and by extension, this will apply for Australia and Japan too. But for all practical purposes right now, this announcement is just a signaling device at this time that a change is coming. It may make a difference for Canadian organisations evaluating Office 365 right now, but everyone already using Office 365 is stuck with the current state for a while yet.

In the table above, I try to pick out the specifics on where your data is located, because … it’s complicated. It depends on how / why / who / what / where. In looking at the above:

Brand New Customer in Canada geo – yes, you could have all your Microsoft Teams data stored in Canada.
Existing Customer in Canada geo – your SharePoint and OneDrive storage and Exchange mailboxes will be stored in Canada, but your Teams data is stored in the United States currently. From 2019, you will have the option of migrating it to Canada.
Existing Customer with Multi-Geo – *if* the SharePoint site for the Team was created in the Canadian geo, and *if* the user sharing a file has their OneDrive also located in the Canadian geo, then your files will be stored in region, but until you migrate in 2019, your conversation and chat data in Teams will be stored out of geo (United States or other).

Data residency is a specialised area. If it’s in the “don’t care” bucket for you, then roll on. If it’s in the “it matters a lot to our organisation for several specific reasons,” then study the details.

Information Protection in Windows 10 and OneDrive

Paul Thurrott’s analysis of the soon-to-be-available Windows 10 update – Version 1809 (Redstone 5) – included this snippet that caught my eye:

Storage Sense now integrates with OneDrive and can automatically change any downloaded files to online-only if you haven’t used them in a configurable number of days (in Settings > System > Storage > Storage Sense).

Every vendor struggles with the balance between releasing tools that enable productivity through information availability and protecting information from too much disclosure / availability. What should this person have access to based on their job role and their tasks is a governance question for organisations, that’s enabled by technical capabilities offered by vendors. Data loss prevention stops people from flowing information to other people when it’s sensitive or confidential and the other party doesn’t have access rights. Access control lists on collaborative workspaces, shared folders, and systems of all kinds provide another form of information protection – it lets those who need the content in, and keeps those who don’t have the right to the content out. Role-based access control goes a step further and adds the nuance of who can and cannot take specific actions within a system.

Choosing to sync your OneDrive contents to a local machine is great for productivity – everything is immediately available whether you are connected to the network or not. But the risk is that unauthorised access to your machine – directly by a person or indirectly by a security threat executing and exfiltrating the data on your disk – will enable access to content by people who do not have authorisation. To information that is sensitive, confidential, or in need of special protections. The above forthcoming integration with Storage Sense in Windows 10 will mean that content from OneDrive that is not used often can be removed from local storage, reducing the potential information protection disclosure surface. If it’s not there directly, it can’t be accessed directly … and thus there’s another action required to gain access, which can be evaluated against up-to-the-second security policies.

Onboarding and Offboarding: The Hidden Processes

There’s a whole set of activities required for effectively onboarding and offboarding new employees. People to coordinate. Processes to develop and operate efficiently. Magic moments that should just happen – because first impressions count and create memories.

One of the behind-the-scenes or hidden processes involves setting up access for the new employee to the systems they require for doing their work. An email account. Access to the collaborative workspace tools being used. HR system access. And more. This can be done manually by an IT administrator with super-user privileges across systems, or driven based on policy using a directory service with provisioning (and de-provisioning) capabilities. The latter means an administrator creates a user account in one central system (the directory), adds the user to a group that has access rights to specific others systems, and the provisioning service notes the change and follows a pre-defined script for adding the new user to other connected systems.

For Office 365 and Microsoft 365 customers, the user provisioning service in Azure Active Directory enables automated, policy-based provisioning of non-Microsoft cloud apps, such as Salesforce, Slack, GoToMeeting, Dropbox, Box and more. This creates sanctioned accounts in these services, decreasing the footprint of unsanctioned apps and shadow IT services. Last week, Microsoft announced additional services can now be provisioned and deprovisioned using Azure AD – including Asana, BlueJeans, Bonusly, LucidChart, and Zendesk.

And when an employee leaves, removing them from the groups with access to other systems essentially runs the process in reverse: user accounts are revoked and thus access privileges are removed.

Being intentional / deliberate / automated in this area is another example of what information protection looks like in practice.