A Microsoft Azure DevOps outage in the South Brazil Region, which lasted over 10 hours, was caused thanks to a typo in the code that saw 17 production databases deleted.
Having apologized to impacted customers for the outage, Microsoft has now issued a full post-mortem, sharing details about the investigation that took place from when the outage was first noticed at 12:10 UTC on May 24, until its remedy at 22:31 UTC on the same day.
Microsoft principal software engineering manager Eric Mattingly shared details of the code base upgrade which formed part of Sprint 222. Inside the pull request was a hidden typo bug in the snapshot deletion job, which ended up deleting the Azure SQL Server rather than the individual Azure SQL Database.
Mattingly explained: “when the job deleted the Azure SQL Server, it also deleted all seventeen production databases for the scale unit,” confirming that no data had been lost during the accidental process.
The outage was detected within 20 minutes, at which point the company’s on-call engineers got to work, however according to the event log the root cause was identified at 16:04, almost four hours after the outage had begun.
Microsoft blamed the over ten-hour fix time on the fact that customers themselves are unable to restore Azure SQL Servers, as well as backup redundancy complications and a “complex set of issues with [its] web servers.”
Having learned from its mistake, Microsoft has no promised to roll out Azure Resource Manager Locks to its key resources, in an effort to prevent future accidental deletion.
Despite a same-day fix, customers in the region were left without access to some services for several hours, emphasizing how easy it is for things to go wrong and the importance of having backup plans to reduce reliance on single service providers, including cloud storage and other off-prem infrastructure.