Drive has 700+ articles for digital transformation leaders written by StarCIO Digital Trailblazer, Isaac Sacolick. Learn more.

An application is waiting more than three seconds for an API’s response. The response time exceeds the performance requirements for this API, so a monitoring tool triggers an alert that automatically creates an incident ticket. By the time a service manager in IT Ops responds, the API shows acceptable response performance, and the ticket is closed without investigation.

How AIOps Help SREs Measure Error Budgets and Fulfill SLOs - Sacolick

What the service manager doesn’t see is that this is the fifth time in two weeks that this API tripped alerts, and two customer service complaint tickets are likely related to the problem. This IT group isn’t using AIOps to correlate alerts and automate integrations between tools, so recognizing this customer-impacting and recurring problem, triaging the root cause, and prioritizing its remediation is not on anyone’s radar. Instead, IT is investing time to close tickets while customers are complaining.

What are Service Level Objectives and Error Budgets

IT organizations must manage to higher expected service levels while supporting a mix of cloud-native applications, microservices, and legacy monolithic applications. But progressive IT organizations, including several leaders at hundred-year-old companies, are investing in AIOps, establishing SRE practices, changing how DevOps teams improve application reliability, resolving incidents faster, and reducing alert fatigue.

I spoke to Jason Walker, field CTO at BigPanda, about applying SRE methodologies, measuring Service Level Objectives (SLOs), and managing error budgets using AIOps capabilities.

Jason acknowledges that more people in IT Ops and their business stakeholders must understand SRE terminologies and methodologies. He explains, “Error budgets are a useful way to think about issues in the context of providing a reliable service. Maybe you’ve decided, “my SLO is 99.9 percent,” and the ratio of failures to attempts is going to be my service level indicator (SLI). You can only afford one failure for every 1,000 attempts.  That’s your error budget.”

So instead of just measuring failures and capturing service levels measured against time, as in how many alerts per week, service level objectives are calculated differently and capture error events as a percent of the total events.

How Using Error Budgets Reduces Alert Fatigue

Using SLOs can change the business and operational mindset on how to monitor, what to measure, when to alert, and how IT Ops responds to incidents.

SREs use burndown reports for monitoring error rates in the same way developers use this type of report to monitor sprint, release, and epic burndowns. Alerts are only generated when the burn down exceeds the error budget for a designated time period. Some groups use predictive algorithms to also consider whether errors are trending in that direction.

Walker goes on to explain how measuring errors and tracking error budgets with burndowns changes the approach. He says, “Sustained breaching over that ratio for a given period or spiking by exceeding the ratio by a significant amount should trigger an SLO alert so that you can take action.  You can scale it up to the business service level and measure it down to the microservice level.”

The approach helps reduce alert fatigue, a condition that plagues IT Ops when issues automatically trigger alerts and send off pagers whenever there’s an issue. Business leaders can collaborate with IT Ops to define error budgets with business context, so for example, they may identify higher SLOs and lower error budgets during peak hours or to support peak seasons.

Managing Incidents with Error Budgets and AIOps Event Correlation

So, to go back to my example, the first API errors issue probably does not trigger an alert or record an incident if the SLO for this service was being met and the error budget was not exceeded.  But by the fifth error in two weeks, chances are the error budget for this service is exceeded and requires action.

IT Ops teams using AIOps capabilities have an advantage when measuring error budgets. Let’s say the API alert triggers other alerts from the consuming microservice and several downstream applications. The AIOps open box machine learning algorithms can correlate these alerts and escalate them as one incident ticket to IT Ops. Tools then show the time-sequence of alerts which helps IT Ops triage the issue faster, and they can kick off automated responses that address known issues. The combination of these capabilities allows IT Ops to improve their mean time to resolution.

IT Ops also benefits by using the AIOps open integration hub that connects to ServiceNow, Jira, and Slack. Customer service is automatically notified of the issue and resolution via Slack, and when the root SREs determine that the root cause is a code issue, a Jira defect is created on the appropriate team’s backlog.

How SREs use Error Budgets to Prioritize App Improvements

Error budgets serve as a tool for IT Ops to recognize and prioritize which alerts require incident management. But SREs also use error budgets to prioritize which operational issues and technical debt that agile teams should invest development time to address.

These SREs use error budgets and their burndowns to have a dialog with agile product owners on prioritization. When business services, applications, dataops services, or microservices consistently exceed their error budgets, there should be a rationale to invest in the development effort to address root causes. On the other hand, if the product owner isn’t prioritizing remediations, then IT Ops may be justified in reducing the SLOs and managing to a larger error budget.

SREs using a topology mesh can show the dependencies and relationships between microservices, applications, databases, and business services to the product owner and application architects. So once there is agreement on upgrades and fixing defects, these maps help illustrate where development teams should focus on improvements.

Defining SLOs and error budgets is a key practice for IT organizations implementing digital transformations, hybrid working, cloud migrations, and other technology investments. Using AIOps in the implementation is a game-changer as it correlates alerts from multiple sources, streamlines incident reporting, supports faster issue triage, and enables workflow integrations.

This post is brought to you by BigPanda

The views and opinions expressed herein are those of the author and do not necessarily represent the views and opinions of BigPanda.

Published on:

Leave a Reply


StarCIO

My company, StarCIO, provides leadership, learning, and advisory programs for companies looking to accelerate delivering business value from digital transformation. Contact me if you’d like to learn more about partnering opportunities.


Isaac Sacolick

Join us for a future session of Coffee with Digital Trailblazers, where we discuss topics for aspiring transformation leaders. If you enjoy my thought leadership, please sign up for the Driving Digital Newsletter and read all about my transformation stories in Digital Trailblazer.


Coffee with Digital Trailblazers hosted by Isaac Sacolick

Digital Trailblazers! Join us Fridays at 11am ET for a live audio discussion on digital transformation topics:  innovation, product management, agile, DevOps, data governance, and more!


Join the Community of StarCIO Digital Trailblazers

About Drive

Drive Agility, Innovation, Transformation

Drive is the blog for digital transformation leaders brought to you by StarCIO and Isaac Sacolick.

Agility, Innovation, and Transformation are the three primary digital transformation core competencies that every StarCIO Digital Trailblazer must champion in their organizations. Learn more About Drive.


About the StarCIO Digital Trailblazer Community

StarCIO Digital Trailblazer Community

Revolutionizing traditional learning, networking, and advising experiences.

Visit the community


About StarCIO

StarCIO

About Isaac Sacolick

Isaac Sacolick

Author, 1,200+ articles, keynote speaker, Chief StarCIO Digital Trailblazer. Full bio


Driving Digital Newsletter

Driving Digital Newsletter

StarCIO Guides

StarCIO Agile Planning Guides

Digital Trailblazer

Digital Trailblazer by Isaac Sacolick

Driving Digital

Driving Digital by Isaac Sacolick

Driving Digital Standup

Driving Digital Standup

Coffee with Digital Trailblazers

StarCIO Coffee With Digital Trailblazers

Recognition

reworked imapct awards 2026 Judge
InfoWorld 2025 Judge
InfoWorld Technology of the Year 2024 Judge
Thinkers360 Top 10 in IT Leadership
Thinkers360 Top Agile Thought Leader
Thinkers360 Top DevOps Leader
Thinkers360 Top in Digital Transfomation
Thinkers360 Top in Analytics
Thinkers360 Top in Product Management

Discover more from StarCIO Digital Trailblazer Community

Subscribe now to keep reading and get access to the full archive.

Continue reading