How Hundred-Year-Old Enterprises Improve IT Ops using Data and AIOps

Drive has 700+ articles for digital transformation leaders written by StarCIO Digital Trailblazer, Isaac Sacolick. Learn more.

What do three multibillion-dollar companies that have been around for over one hundred years have in common? There might be straightforward answers if they were in the same industry, but what if one is in media, another in financial services, and a third a food service distributor?

IT leaders from Wiley, OneMain Financial, and US Foods presented at the recent BigPanda’s Resolve ‘21 and Pandapalooza event about how they’re modernizing their IT operations with AIOps. I’ve already shared insights from this event, including 3 AIOps secrets that boost quick business impact and seven lessons from IT leaders on operating at digital speeds with AIOps. This post explores how companies that must continually reinvent themselves use data and machine learning to deliver great IT service management experiences.

Enterprises - IT Ops - Data - AIOps - Isaac Sacolick

Keep in mind that information technology wasn’t around when these three companies were founded, and they introduced many of the systems running their businesses over decades. But at the event, their leaders were presenting how they were leveraging machine learning and automations to improve the mean time to recovery (MTTR) from IT incidents and increase the reliability and performance of their systems.

I was most interested in seeing how these leaders used AIOps and leveraged data in IT Operations.

Use DevOps to Improve Data Quality

Didier Le Tien, VP of Application Development at US Foods, explained how having clean operational data was critical to support production applications. He states, “Changing your process through tools gives you an opportunity to collect the better quality data needed to prove or disprove you are on the right track. It’s one of the key elements to be more data-driven. This data has allowed us to think outside of the box when it comes to our operations, for example, having the visibility to identify production issues faster, use data to improve troubleshooting, and then address potential bugs. Because you have the data, concepts like AIOps became a reality for us.”

I love these comments because they illustrate

The importance of creating and cleansing data when instituting new processes and tools
How having cleansed operational data helps teams think outside of the box
Their targeted improvement metrics using AIOps and open-box machine learning capabilities

Reduce Alert Fatigue – Automation and Machine Learning

Sam Chatman, VP of IT Ops at OneMain Financial, explains the impact of levering AIOps is, “Being able to understand what is released, when it’s released, and the potential impacts of that release. We are overcoming alert fatigue, and BigPanda will be our Watson of the Enterprise Monitoring Center (EMC) by automating alerts, opening incident tickets, and identifying those actions to improve our mean time to recovery. This helps us keep our systems up when our users and customers need them to be.”

For other organizations, it might help to visualize what naturally happens to IT operations’ monitoring programs over time. Every time systems go down and IT gets thrown under the bus for a major incident, they add new monitoring systems and alerts to improve their response times. As new multicloud, database, and microservice technologies emerged, they add even more monitoring tools and increased observability capabilities.

Having more operational data and alerts is a good first step, but then alert fatigue kicks in when tier-one support teams respond and must make sense over dozens to thousands of alerts. OneMain has broken that cycle by establishing an EMC, investing in AIOps, focusing on customer experience, and addressing alert fatigue.

OneMain Financial’s EMC is relatively new, and they’ve already made significant business impacts. Sam shares one best practice – that overcoming alert fatigue not only requires better data, it also requires tools for automating aspects of the response. The automation improves communications and frees up time so that IT operations can focus on troubleshooting and restoring service. As Sam points out, the shift from tasks to problem-solving helps change everyone’s focus on improving customer and end-user experience.

Enable Actionable Insights – Improve Signal to Noise Ratios

If automation is part of how IT Operations improve recovery times, then reducing noisy alerts to a correlated and manageable number of incidents is another best practice. Kiran Venkatesan, Architect at Wiley, shares a core practice in improving the signal to noise ratio in the data used by IT Ops for incident management.

Kiran says, “If there is a lot of noise, then there is no benefit. We have started measuring compression rates in how much noise is generated by event monitoring tools. How many alerts are duplicated, can be aggregated, or are correlated? How much of an actionable incident is produced based on all of the enrichment that is going in within the context of the particular business service?”

So improving IT operations needs more than cleansed and correlated data, as it must lead to actionable, accurate, and at least partially automated responses. One important step is to map incidents to the impacted business services, define service level objectives, and improve communications.

Better Data Enables Automatic Incident Triage

The next step in the journey goes beyond reducing alert noise, correlating monitoring data, and enabling response automations. In the middle of the incident management process are bridge calls, war rooms, and other group efforts between subject matter experts. Their goal is to work collaboratively with all the available data and aim to troubleshoot issues, identify root causes, and prescribe courses of action.

Even as the operational data quality improves, the triage process can be the longest, most painful step in the incident pipeline.

BigPanda customers talked about ways their IT operations take advantage of automatic incident triage. Context is automatically added to each incident, including identifying the impacted business services, the teams who must stay informed, and the type of issues that need addressing. With this context added to the incident, first-level teams can then route the incident to the appropriate support teams. The approach should eliminate the “all hands on deck” concepts prevalent in IT Ops teams that haven’t invested in AIOps. Helping IT operations triage incidents is very promising for IT leaders looking beyond improving MTTR. Proactive leaders also aim to reduce the number of monthly incidents and enhance IT support personnel’s work-life balance.

When you see that hundred-year-old enterprises recognize the importance of high system reliability and enable IT operations with AIOps tools to improve service levels, you sense how important both customer and employee experiences are to these companies. When you listen to their leaders, then you get the sense that many IT organizations have much to gain by improving IT operational data and investing in AIOps.

This post is brought to you by BigPanda

The views and opinions expressed herein are those of the author and do not necessarily represent the views and opinions of BigPanda.