Product

‍Harnessing ML for Cloud Resource Anomaly Detection: OpsNow’s Unique Model‍

OpsNow Team
December 11, 2023

In the ever-expanding universe of cloud computing, maintaining oversight on resource usage and costs is akin to navigating a labyrinth. Mixed signals and conflicting information quickly make well-intentioned efforts very taxing.  Anomalies in billing and cloud resources can signal critical issues ranging from budgetary blowouts to architectural inefficiencies and anomaly detection can help – if built properly. While AI/ML is a likely solution and buzzword, OpsNow has implemented a true ML model which we believe is the best and most trusted way to reign in cloud abnormalities in the background while you focus on your daily workload.

The Purpose of Anomaly Detection

Anomaly detection in cloud environments serves as an early warning system, designed to flag irregular patterns that deviate from the norm. These patterns could represent anything from a spike in data traffic, to rogue malware or unauthorized deployments, to an unexpected cost surge from misconfiguration. The goal is not just to detect these anomalies but to do so with such accuracy that false positives are minimized, and that only genuine threats trigger alerts.  

A Solid Model Built with Robust Design

At the core of our anomaly model is a duo of sophisticated forecasting algorithms: ARIMA and ETS. ARIMA (AutoRegressive Integrated Moving Average) models aim to describe the autocorrelations in the data. They do this by using a combination of past values and errors to forecast future points in a time series. The model consists of 3 parts - an AR term that accounts for regressing the variable on its own lagged values, an I (integrated) term that accounts for the degree of differencing, and an MA term that uses the dependency between an observation and a residual error.  When properly configured ARIMA models are powerful at uncovering a wide range of time series patterns. 

ETS (abbreviation for Error, Trend, Seasonality) models time series into its foundational components —  to highlight, among other things, errors, trends and seasonality. ETS provides the flexibility to model the errors as either additive or multiplicative, the trends as exponential, linear or damped, and the seasons as additive or multiplicative.The full ETS model includes 30+ model variations accounting for different real-world time series properties. ETS models simplify time series forecasting into an automated learning process by detecting patterns which best fit the datasets.

These algorithms are trained daily on fresh data – reflecting the constant variations of cloud patterns. This daily refresh cycle is what keeps the model in lockstep with reality, a critical feature that shields against the risks associated with stale data.

Why OpsNow ML-powered Anomaly Detection is a Game Changer.

1. **Daily Training Regimen**: Unlike models that grow obsolete with each passing day, OpsNow evolves. By training on new data daily, our engine maintains its edge, ensuring its alerts are based on current growth data and not outdated estimates.

2. **Principal Component Regression (PCR)**: Our use of PCR is OpsNow’s differentiator. PCR dives deep, using principal component analysis to sift through the noise and identify the root cause of anomalies. It's a method that doesn't just spot the issue but also understands it.

3. **Detailed Analysis**: The devil is in the details, and OpsNow thrives on them. By breaking down the data by service, region, and instance type, we avoid the one-size-fits-all trap, offering tailored insights that generic models can't match.

4. **Karhunen-Loeve Transformation**: After PCA does its job, our Karhunen-Loeve transformation steps in. This algorithm reconstructs the PCA data, revealing the actual source of the anomaly. It's the equivalent of having a map that leads straight to the issue, bypassing all the red herrings which more generic tools would have directed teams to.

Minimizing False Positives: A Balancing Act

One of the greatest challenges in anomaly detection is the elimination of false positives. By harnessing the collective power of ARIMA, ETS, and PCR, our model strikes a delicate balance. It's finely tuned to discern between a true anomaly and a small blip in the data, sparing teams from unnecessary fire drills. Having been in the CloudOps domain for years, we understand how important it is to have an anomaly detection system everyone can trust. 

Pinpointing Cost Savings

When an anomaly detection process is implemented as part of a cost management model, businesses can find honeypots of unexpected usage - and savings.  Active instances leftover after projects, misconfigured environments (have you ever left sharding at default values?) and even mis-keyed instance sizes all can have a significant, negative impact on your monthly bill. With an active dev environment all of these issues happen day in and out and by having the alerts and processes in place ultimately keep costs low and well-managed.

The Bottom Line

By monitoring cloud usage with fine-grained ML and alerting quickly to irregularities, OpsNow plays a crucial role in maintaining budget discipline. We built our tooling based on years of experience and launched in OpsNow to provide enterprises a proactive approach to anomaly management. The result is the purposeful use of ML technology, used to solve complicated problems and prevent cost overruns. OpsNow anomaly detection is not just a tool, it's a cloud watchdog ensuring your resources are utilized efficiently and economically so you and your team can keep things headed in the right direction.

Get your FinOps in order and try OpsNow.  Want a bit more help?  Schedule a no-commitment free 2hr consultation with OpsNow.  

Related Blog Posts