Drive has 700+ articles for digital transformation leaders written by StarCIO Digital Trailblazer, Isaac Sacolick. Learn more.

The tools and practices of IT Operations have to get better and easier.Every IT Ops engineer has the job of responding to alerts when a website is down or unresponsive. To restore service, the engineer follows a certain procedure to restart the web server and validate that the website is operational. Maybe it happens again a few days later and another engineer repeats the procedure to restore service. If it happens yet again, a proactive engineer hopefully takes the initiative to develop a simple script that automates this procedure.

Today, the complexity of monitoring and automating responses is both more important and more complex.

Businesses expect SaaS-competitive performance levels from their applications, especially ones that are directly used by customers. Customers not only expect applications to be available, but also want fast, secure, and consistent performance. When there is an issue, customers and leaders expect that IT can resolve the issue very quickly – in a matter of seconds and minutes – not hours.

In addition, many of the underlying application architectures are more complex with applications calling on more services, connecting to more databases, integrating with more data sources, and leveraging more third-party APIs and web components. Managing incident response on these applications is often slow and error prone because of the number of subsystems that need to be reviewed, the number of tools being used to capture operational data, and the complexity in the procedures needed to restore service.

In larger organization, managing multiple applications, web services, and databases in the cloud and on premises can get expensive because both the volume of incidents can be high (and growing every day) and the amount of manual work associated with managing incidents is constantly increasing. Making matters more complicated is the disparity of tools used in incident response by IT operations – and now progressively, digital operations – to monitor, respond, recover and for root cause analysis.

Managing the increased demand and complexity in incident response

Adding more people to the incident response team is becoming a less tenable option for many IT organizations that are being asked to implement and manage more applications as part of their digital transformation programs, but with only marginal increases in the IT operations budgets.
CIO need to be looking at new options to manage the growing complexity and to lead the transformation of their IT operation to a digital operation.
CIO can do more with less by looking for tools that enable digital operations by
  • Enabling the aggregation of data and analytics from multiple IT operational tools into a single management system.
  • Leveraging open box machine learning tools to process operational data and helping identify systems that are the root cause of an application failure.
  • Automating the response to an increasing number and variety of incidents, to improve customer experience.
  • Measuring the improvements of key performance indicators such as MTTD (mean time to discovery), MTTA (mean time to acknowledge) and MTTR (mean time to repair).

The complexity behind a single user journey

Let’s look at the diagramed example of one user journey that goes across three different node.js applications, leverages five different microservices deployed as lambda functions, and performs transactions with two RDS databases all deployed to AWS. These databases are also connected by three data pipeline services that are used to send updated data from enterprise systems hosted in a datacenter. The node.js applications also connect to two external APIs and embed two other JavaScript widgets.
All in, there are twenty different systems that make up this user’s journey, that need to be monitored for incidents. But that doesn’t tell the full story. As shown, there are thirty-five different connections being made and three that go across a VPN between the public cloud’s VPC and the data center. All of these services are being monitored by a myriad of different tools such as AWS Cloudwatch and DataDog on AWS, and SiteScope and Splunk in the data center. In addition, there are two different teams with operational responsibilities – the one for the data center uses ServiceNow while the cloud DevOps team is using Jira Ops.
When there is an incident, the service that sends out the alert is not always the one that’s the root cause of the problem. Let’s say Service 5 is running a slow query on the DB2 database that’s impacting the performance of a handful of queries running through Service 4. Each of these queries isn’t slow enough to trip off an alert, but the aggregate of their performance is slowing down App 3 significantly and it begins to send out an alert.
Without automation, the person in IT Ops responding to this alert needs to check several monitors across CloudWatch and DataDog to investigate the slowness in App 3. She may find the slow queries but will have a hard time pinpointing which service and query started it all.
What she can’t easily see is that an ETL from Data Pipeline 3 kicked off just before these queries began to slow down. She will totally miss this point because its outside her area of responsibility, and the data center team won’t notice the problem because from their perspective, the data pipeline is running normally.
Meanwhile, customers are suffering. How long will it be, until this mess is sorted out and performance restored?

Leveraging AI and automation in incident response

Now let’s look at this same scenario when there is some automation and open box machine learning in place through an autonomous digital operations platform like BigPanda.
With such a platform, alerts from CloudWatch, DataDog, SiteScope, and Splunk are aggregated and then correlated, in real-time, into discrete incidents. This means that for App 3, all the alerts from the underlying services, databases (including the ones in the data center) and the data pipelines are correlated into a single incident. When alerts from App 3 are triggered, the open box machine learning algorithm determines that Service 5’s query was the first performance issue and that it has a dependent data pipeline that is running. The automation also then opens up tickets in Service Now and Jira Ops with these details, to help the digital operations team coordinate, review, and resolve the issue.
Over time, you can expect the automation and AI to improve. For example, after open box machine learning correlates alerts into problematic incidents, the automation could trigger scripts to resolve the issue.
But as this example illustrates, improving MTTD and MTTR is not trivial when user experience is increasingly tied to many different microservices, databases, services and integrations. The digital operations teams needs to find IT Ops tools that make it easy to integrate with a diverse set of IT systems and monitoring tools, correlate data from all of these sources into actionable incidents, and automate various aspects of incident response. Such tools will maximize the uptime and performance of customer-facing applications and services at all times.
This post is brought to you by BigPanda.io
 
The views and opinions expressed herein are those of the author and do not necessarily represent the views and opinions of BigPanda.io.

Published on:

Leave a Reply


StarCIO

My company, StarCIO, provides leadership, learning, and advisory programs for companies looking to accelerate delivering business value from digital transformation. Contact me if you’d like to learn more about partnering opportunities.


Isaac Sacolick

Join us for a future session of Coffee with Digital Trailblazers, where we discuss topics for aspiring transformation leaders. If you enjoy my thought leadership, please sign up for the Driving Digital Newsletter and read all about my transformation stories in Digital Trailblazer.


Coffee with Digital Trailblazers hosted by Isaac Sacolick

Digital Trailblazers! Join us Fridays at 11am ET for a live audio discussion on digital transformation topics:  innovation, product management, agile, DevOps, data governance, and more!


Join the Community of StarCIO Digital Trailblazers

About Drive

Drive Agility, Innovation, Transformation

Drive is the blog for digital transformation leaders brought to you by StarCIO and Isaac Sacolick.

Agility, Innovation, and Transformation are the three primary digital transformation core competencies that every StarCIO Digital Trailblazer must champion in their organizations. Learn more About Drive.


About the StarCIO Digital Trailblazer Community

StarCIO Digital Trailblazer Community

Revolutionizing traditional learning, networking, and advising experiences.

Visit the community


About StarCIO

StarCIO

About Isaac Sacolick

Isaac Sacolick

Author, 1,000+ articles, keynote speaker, Chief StarCIO Digital Trailblazer. Full bio


Driving Digital Newsletter

Driving Digital Newsletter

StarCIO Guides

StarCIO Agile Planning Guides

Digital Trailblazer

Digital Trailblazer by Isaac Sacolick

Driving Digital

Driving Digital by Isaac Sacolick

Driving Digital Standup

Driving Digital Standup

Coffee with Digital Trailblazers

StarCIO Coffee With Digital Trailblazers

Recognition

InfoWorld 2025 Judge
InfoWorld Technology of the Year 2024 Judge
Thinkers360 Top 10 in IT Leadership
Thinkers360 Top Agile Thought Leader
Thinkers360 Top DevOps Leader
Thinkers360 Top in Digital Transfomation
Thinkers360 Top in Analytics
Thinkers360 Top in Product Management

Discover more from StarCIO Digital Trailblazer Community

Subscribe now to keep reading and get access to the full archive.

Continue reading