Machine learning: The AIOps system Azure uses to make the cloud reliable

To spot faults quickly even if they take a month to show up, Azure feeds signals into a machine learning system: in the future, you will be able to do that for your own cloud workloads.

Cloud services change all the time, whether it’s adding new features or fixing bugs and security vulnerabilities; that’s one of the big advantages over on-prem software. But every change is also an opportunity to introduce the bugs and regressions that are the main reasons for reliability issues and cloud downtime. In an attempt to avoid issues like that, Azure uses a safe deployment process that rolls out updates in phases, running them on progressively larger rings of infrastructure and using continuous, AI-powered monitoring to detect any issues that were missed during development and testing.

When Microsoft launched its Chaos Studio service for testing how workloads cope with unexpected faults last year, Azure CTO Mark Russinovich explained the safe deployment process. “We go through a canary cluster as part of our safe deployment, which is an internal Azure region where we’ve got synthetic tests and we’ve got internal workloads that actually test services before they go out. This is the first production environment that the code for new service update reaches so we want to make sure that we can validate it and get a good sense for the quality of it before we move it out and actually have it touch customers.”

SEE: Hiring Kit: Cloud Engineer (TechRepublic Premium)

After the canary region, the code is deployed into a pilot region, then a region with light usage, then a more heavily used region, then progressively to all Azure regions (which are grouped into pairs geographically, with updates going to one region in each pair first and then to the other). All the way through that deployment process, he explained, “We’ve got AIOps monitoring everything to look for regressions.”

AIOps–techniques using big data, machine learning and visualization to automate IT operations–can detect problems developers can’t find by debugging their code because they might be caused by dependencies or interactions that will only come into play when the code is live and being used in conjunction with other Azure services.

A bad rollout might crash VMs, slow them down, make them slower to provision or stop them communicating, or they might affect monitoring agents, storage, telemetry or operations on the control plane; but those problems might also be caused by a hardware failure, temporary network issues or timeouts in service APIs that rolling back the latest deployment wouldn’t fix. There are hundreds of deployments a day on Azure, and most deployments target hundreds or thousands of clusters, all of which have to be monitored, and a deployment can take a long time (anywhere from ten minutes to 18 hours). With thousands of components running in more than 200 data centres and over 60 regions, when problems like a memory leak might not show up for several days, or might show up as very subtle problems in many clusters that add up to a significant issue across a whole region, it’s hard for human operators to figure out exactly what changes cause a specific problem, especially if they’re caused by interactions with another component or service.

The AIOps system Microsoft uses, called Gandalf, “watches the deployment and health signals in the new build and long run and finds correlations, even if [they’re] not obvious,” Microsoft said. Gandalf looks at performance data (including CPU and memory usage), failure signals (like OS crashes, node faults and VM reboots as well as API call failures in the control plane) and takes information from other Azure services that track failures to spot problems and track them back to specific deployments.

It knows when a deployment is happening and looks at how many nodes, clusters and customers a failure would affect, to recommend whether new code is safe to roll out across Azure or if it should be blocked because it causes problems in the canary region that would translate to significant problems in production.

Gandalf captures fault information from one hour before and after each deployment as streaming data in Azure Data Explorer (also known as Kusto), which is designed for fast analytics: it usually takes about five minutes for Gandalf to make a decision about a deployment. It also tracks system behaviour for 30 days after deployment, to spot the longer-term issues (those decisions take about three hours).

SEE: iCloud vs. OneDrive: Which is best for Mac, iPad and iPhone users? (free PDF) (TechRepublic)

It’s not the only technique Microsoft uses to make Azure more resilient. “A memory leak caused by a new regressed payload would be stopped by Gandalf. Meanwhile, we have a resiliency mechanism to auto mitigate the already deployed nodes with leaking issues such as restarting the node if there are no customer workloads, or live-migrating running VMs if the node is not empty.”

“AIOps is good for detecting patterns that naturally occur and making correlations based on historical data as well as training,” Microsoft said. The problems it finds are with new deployment payload, but there are other issues like ay-zero bugs that Azure uses other techniques like chaos testing to find. “Day-zero bugs might be triggered by rarely occurring workloads, manifested in both previous versions and new versions and occur randomly, or have no strong correlation with new deployment. Chaos testing can capture such bugs by introducing failures randomly and testing that the system holds up as expected.”

Gandalf has been running for nearly four years, initially for some key Azure infrastructure components, stopping deployments that would otherwise have caused critical failures. It now covers more Azure components, including hot-patching Azure hosts; “We are creating holistic monitoring solutions for Azure host resource health and blocking rollouts causing host resource leaks like memory, diskspace,” Microsoft said.

“We are building intelligence for gating the quality of new builds of Azure infrastructure components using AIOps prior to rolling out a component to production. The key idea is to build a pre-production environment that can run A/B testing for representative customer workloads. This pre-production environment also has a good representation of settings in a production environment (hardware, VM SKU, etc.). This system gets feedback from Gandalf, so that similar issues captured in the production environment will be avoided when launched.”

Gandalf now looks at more signals. “We have started to explore the ideas of correlating signals across the whole Azure stack from datacentres ([like] temperature, humidity), hardware, host environment, to customer experience.” And it’s getting smarter about correlating failures. “We’re working to put higher weight to the failures that impact mission critical customers or high-cost services,” a spokesperson said.

It’s also being applied to changes to the settings in Azure as well as the components that make up the service. “In addition to payload rollout safety, we are building intelligence to make any changes (settings) in production safe.”

AIOps for enterprises

Gandalf is part of how Microsoft protects Azure and as with other internal tools like chaos engineering, the company is considering packaging some of these AIOps deployment techniques as a service to protect customer workloads.

Microsoft Defender for Cloud (previously known as Azure Security Center) and Sentinel cloud SIEM already use similar machine learning techniques for security, Russinovich noted. “AIOps is effectively operating there to look at the data and determine where there’s an incident. [The way] we’ve been using AIOps looking at the telemetry inside Azure to understand if there’s a regression or an incident with hardware or software someplace will show up in services supporting the monitoring data we’ve got, like Azure Monitor,” he suggested.

Microsoft already has Azure customers who are operating at the same scale as its own internal services, and large organizations are already using AIOps tools to manage their own infrastructure, so it makes sense to give them these tools to work reliably at cloud scale.

The technologies could enable immersive experiences, accelerated AI automation and optimized technologist delivery in the next two to 10 years, according to the firm.

This comprehensive guide covers the use of services from multiple cloud vendors, including the benefits businesses gain and the challenges IT teams face when using multicloud.

Machine learning: The AIOps system Azure uses to make the cloud reliable