Hexagon structure with gears and cloud on the gray background. Eps 10 vector file.
Image: Adobe Stock

Organizations are spending an average of $2.5 million per year on on-call operations, according to a report by Dimensional Research and automation provider Shoreline.io. They also suffer an average of 8.7 major incidents each year, 62% of which escalate to the C-suite, the Benchmarking Production Operations Report found.

The report highlights a number of challenges and opportunities for the cloud operations industry, maintaining that even though organizations are spending millions of dollars per year on on-call operations, they continue to suffer major outages that impact customer and employee productivity.

Cloud reliability challenges

Some 97% of organizational leaders said they prioritize cloud reliability. Yet despite this focus, companies highlight several major impediments to improving reliability. At the top of the list is the complexity of the environments they are managing.

“As a company’s product complexity increases, it becomes harder and harder to find SRE [site reliability engineering] and DevOps professionals that have the breadth of experience needed,’’ the report said.

SEE: Hiring Kit: Cloud Engineer (TechRepublic Premium)

The second biggest issue respondents cited is the lack of time to focus on preventing incidents or automating fixes. “This truly becomes a vicious cycle where the less time a team has, the less they can invest in improvements, while the product continues to grow and become more complex,’’ the report noted. “As the load on operations teams increases, people leave, causing the burden to be shared by fewer people.”

This report makes the case for organizations to start investing in incident prevention and repair automation right away, no matter where they are on their journey.

Among the other key findings:

  •  Service providers and human error are responsible for 72% of major incidents
  • Human error is 5x more likely to cause a major outage than automation error
  • The average time to resolve escalated incidents is 10.7 hours
  • Fifty-five percent of incidents are escalated to second-line responders or experts outside of the on-call team
  • Forty-eight percent of incidents are low value, repetitive, toil

As more organizations prioritize reducing the total number of incidents, decreasing costs, and shortening the time to recover, the survey indicated how significant reliability is:

  •  Ninety-eight percent of organizations face challenges in delivering highly reliable cloud applications
  • SRE teams grew 26% in the last 12 months
  • Cloud footprints grew 38% in the last 12 months
  • Modern technologies are making infrastructure management more difficult, with 73% reporting that multicloud makes their job harder and 52% reporting that Kubernetes and microservices make their job harder

“The growth of cloud footprints is outpacing the growth of on-call teams,” said Diane Hagglund, principal at Dimensional Research, in a statement. “Cloud environments are becoming increasingly complex while it is particularly challenging to find staff with the expertise to meet on-call needs, leaving incident response teams struggling to meet reliability demands.”

SEE: iCloud vs. OneDrive: Which is best for Mac, iPad and iPhone users? (free PDF) (TechRepublic)

How to improve on-call productivity

The report details several recommendations for improving on-call including:

Ensure incident management systems provide insight

Ninety-eight percent of organizations reported struggles with their incident management approach. Using ticketing data to gain insight into on-call operations is key to uncovering opportunities to improve productivity.

Attack escalations

The biggest opportunity to improve on-call productivity is by reducing incident escalations, which account for 78% of on-call time. Investing in self-service tools to empower support teams will not only reduce the total number of escalations but will provide more comprehensive diagnostic data.

Attack repetitive, low-value work or toil

Forty-eight percent of incidents are repetitive, presenting an opportunity to create self-healing incident remediation that frees teams of repetitive tasks so they can dedicate more time to improving resiliency, securing environments, and lowering costs to further improve productivity.

“The current approach to on-call is unsustainable, with the rapid growth of cloud infrastructure leaving SRE teams faced with thousands of hours of work per month,” said Anurag Gupta, founder and CEO at Shoreline.io, in a statement. “Utilizing automation to address escalations and eliminate low value, repetitive work will dramatically improve team productivity and overall customer experience.”

Dimensional Research said over 300 on-call practitioners, managers and executives were polled to learn about incident response in production cloud environments. Survey participants are responsible for running businesses that manage less than 20 to over 10,000 nodes, the firm said.