On-call cloud operations cost organizations an average of $2.5 million per year

Search Close

Account Information

Reset password

An email has been sent to you with instructions on how to reset your password.

Back to TechRepublic

Welcome to TechRepublic!

By registering, you agree to the Terms of Use and acknowledge the data practices outlined in the Privacy Policy.

You will also receive a complimentary subscription to TechRepublic's News and Special Offers newsletter and the Top Story of the Day newsletter. You may unsubscribe from these newsletters at any time.

All fields are required. Username must be unique. Password must be a minimum of 6 characters and have any 3 of the 4 items: a number (0 through 9), a special character (such as !, $, #, %), an uppercase character (A through Z) or a lowercase (a through z) character (no spaces).

Ticketing data is key to gaining insight into on-call operations and uncovering opportunities to improve productivity, according to a new report from Dimensional Research and Shoreline.io.

Hexagon structure with gears and cloud on the gray background. Eps 10 vector file. — Image: Adobe Stock

Organizations are spending an average of $2.5 million per year on on-call operations, according to a report by Dimensional Research and automation provider Shoreline.io. They also suffer an average of 8.7 major incidents each year, 62% of which escalate to the C-suite, the Benchmarking Production Operations Report found.

The report highlights a number of challenges and opportunities for the cloud operations industry, maintaining that even though organizations are spending millions of dollars per year on on-call operations, they continue to suffer major outages that impact customer and employee productivity.

Cloud reliability challenges

Some 97% of organizational leaders said they prioritize cloud reliability. Yet despite this focus, companies highlight several major impediments to improving reliability. At the top of the list is the complexity of the environments they are managing.

“As a company’s product complexity increases, it becomes harder and harder to find SRE [site reliability engineering] and DevOps professionals that have the breadth of experience needed,’’ the report said.

SEE: Hiring Kit: Cloud Engineer (TechRepublic Premium)

The second biggest issue respondents cited is the lack of time to focus on preventing incidents or automating fixes. “This truly becomes a vicious cycle where the less time a team has, the less they can invest in improvements, while the product continues to grow and become more complex,’’ the report noted. “As the load on operations teams increases, people leave, causing the burden to be shared by fewer people.”

This report makes the case for organizations to start investing in incident prevention and repair automation right away, no matter where they are on their journey.

Among the other key findings:

Service providers and human error are responsible for 72% of major incidents
Human error is 5x more likely to cause a major outage than automation error
The average time to resolve escalated incidents is 10.7 hours
Fifty-five percent of incidents are escalated to second-line responders or experts outside of the on-call team
Forty-eight percent of incidents are low value, repetitive, toil

As more organizations prioritize reducing the total number of incidents, decreasing costs, and shortening the time to recover, the survey indicated how significant reliability is:

Ninety-eight percent of organizations face challenges in delivering highly reliable cloud applications
SRE teams grew 26% in the last 12 months
Cloud footprints grew 38% in the last 12 months
Modern technologies are making infrastructure management more difficult, with 73% reporting that multicloud makes their job harder and 52% reporting that Kubernetes and microservices make their job harder

“The growth of cloud footprints is outpacing the growth of on-call teams,” said Diane Hagglund, principal at Dimensional Research, in a statement. “Cloud environments are becoming increasingly complex while it is particularly challenging to find staff with the expertise to meet on-call needs, leaving incident response teams struggling to meet reliability demands.”

SEE: iCloud vs. OneDrive: Which is best for Mac, iPad and iPhone users? (free PDF) (TechRepublic)

How to improve on-call productivity

The report details several recommendations for improving on-call including:

Ensure incident management systems provide insight

Ninety-eight percent of organizations reported struggles with their incident management approach. Using ticketing data to gain insight into on-call operations is key to uncovering opportunities to improve productivity.

Attack escalations

The biggest opportunity to improve on-call productivity is by reducing incident escalations, which account for 78% of on-call time. Investing in self-service tools to empower support teams will not only reduce the total number of escalations but will provide more comprehensive diagnostic data.

Attack repetitive, low-value work or toil

Forty-eight percent of incidents are repetitive, presenting an opportunity to create self-healing incident remediation that frees teams of repetitive tasks so they can dedicate more time to improving resiliency, securing environments, and lowering costs to further improve productivity.

“The current approach to on-call is unsustainable, with the rapid growth of cloud infrastructure leaving SRE teams faced with thousands of hours of work per month,” said Anurag Gupta, founder and CEO at Shoreline.io, in a statement. “Utilizing automation to address escalations and eliminate low value, repetitive work will dramatically improve team productivity and overall customer experience.”

Dimensional Research said over 300 on-call practitioners, managers and executives were polled to learn about incident response in production cloud environments. Survey participants are responsible for running businesses that manage less than 20 to over 10,000 nodes, the firm said.

TechRepublic Premium editorial calendar: IT policies, checklists, toolkits, and research for download

TechRepublic Premium content helps you solve your toughest IT issues and jump-start your career or next project.

Innovation

Gartner identifies 25 emerging technologies in its 2022 hype cycle

The technologies could enable immersive experiences, accelerated AI automation and optimized technologist delivery in the next two to 10 years, according to the firm.

ERP analytics — Image: BillionPhotos.com/Adobe Stock

CXO

Top 10 ERP vendors 2022

Are you an IT manager or executive trying to make the case for a new ERP vendor? Compare the top ERP software solutions with our list today.

Image: Apple. At WWDC 2022, Apple announced the planned release of the next version of its Mac operating system, macOS Ventura, for the fall of 2022.

Software

macOS 13 Ventura cheat sheet: Complete guide for 2022

Learn about the new features available with macOS 13 and how to download and install the latest version of Apple’s flagship operating system.

shopping cart full of electronics and tech in front of a phone with the text Online Sale Limited Time Offer to the right — Image: elenabsl/Adobe Stock

Software

Top TechRepublic Academy training courses and software offerings of 2022

Get great deals on developer and Linux training courses, Microsoft Office licenses and more through these TechRepublic Academy offerings.

Cloud

Multicloud explained: A cheat sheet

This comprehensive guide covers the use of services from multiple cloud vendors, including the benefits businesses gain and the challenges IT teams face when using multicloud.

Recruiting a Scrum Master with the right combination of technical expertise and experience will require a comprehensive screening process. This hiring kit provides a customizable framework your business can use to find, recruit and ultimately hire the right person for the job. This hiring kit from TechRepublic Premium includes a job description, sample interview questions ...

Knowing the terminology associated with Web 3.0 is going to be vital to every IT administrator, developer, network engineer, manager and decision maker in business. This quick glossary will introduce and explain concepts and terms vital to understanding Web 3.0 and the technology that drives and supports it.

On-call cloud operations cost organizations an average of $2.5 million per year