Americas

  • United States
Anirban Ghoshal
Senior Writer

Microsoft blames Aussie data center outage on staff strength, failed automation

News
Sep 04, 20233 mins
Data CenterMicrosoft Azure

The outage that occurred on August 30 led to downtime in Azure services pertaining to APIs, databases, and applications.

9 systems theory global crash outage
Credit: Getty Images

Microsoft has blamed staff strength and failed automation for a data center outage in Australia that took place on August 30, disabling users from accessing Azure, Microsoft 365, and Power Platform services for over 24 hours.

In a post-incident analysis report, Microsoft said the outage occurred due to a utility power sag in Australia’s East region, which in turn “tripped a subset of the cooling units offline in one data center, within one of the Availability Zones.”

As the cooling units were not working properly, the rise in temperature forced an automated shutdown of the data center in order to preserve data and infrastructure health, affecting compute, network, and storage services.

However, Microsoft said that the cooling units could have been restarted manually, which was not possible due to the unavailability of enough personnel at the data center.

“Due to the size of the data center campus, the staffing of the team at night was insufficient to restart the chillers in a timely manner. We have temporarily increased the team size from three to seven, until the underlying issues are better understood and appropriate mitigations can be put in place,” Microsoft wrote as part of the report.

In addition, the company said it is working on other major reforms, such as improving existing automation for the data center to improve restoration of services when an incident occurs.

“We are exploring ways to improve existing automation to be more resilient to various voltage sag event types,” Microsoft said, adding that an evaluation was underway to ensure that the highest-load servers and their corresponding chillers restarted first.

In the past few months, Microsoft has reported several outages, especially the unavailability of M365 services. In July, an outage took out its OneDrive for Business and SharePoint Online services.

In June, users faced issues with Outlook Web, Teams, OneDrive for Business, and SharePoint for over eight hours. 

In May, the company reported that UK users were facing issues accessing some service offerings under Microsoft 365. In April, Microsoft said it was investigating an issue where certain users were unable to use the search functionality in multiple Microsoft 365 services. Outlook on the Web, Exchange Online, SharePoint Online, Microsoft Teams, and Outlook desktop clients were among the affected services.

In another incident in April, users could not access Microsoft 365 web applications, and Teams.

Microsoft also suffered a global outage in February, and yet again, its users could not access emails and Teams. It suffered a similar outage in January.