Thoughts on Data Science, IT Operations Analytics, programming and other random topics

Anomalies, alarms and actionability

09 Apr 2014

(first published on IBM Service Management 360 site)

In traditional IT performance management (PM), the use of thresholding is the most common way to ensure that the environments that are being monitored behave as desired. More recently, tools such as IBM SmartCloud Analytics – Predictive Insights are employing more sophisticated behavioral learning techniques to detect anomalies in IT environments in ways that are both different and complimentary.

Anomaly events

A common approach for predictive analytics systems is to first develop an understanding of “normal” (air quotes here) behavior and look for changes. In data science-speak, these would be considered anomalies, but sometimes I simply refer to the approach as “change-based” anomaly detection.

When such anomalies are detected, whether produced by traditional means or by more sophisticated ones, the PM system would typically produce an event and dispatch it to an event-management system. That’s where the fun starts. Depending on the severity of that alarm and the implied consequences, it’ll be greeted with a disinterested shrug or severe panic by the operators charged with keeping things running smoothly.

Given too many shruggable alarms, confidence in the PM system will be undermined. Perhaps most important, these alarms can be costly to deal with. Therefore PM systems must ensure that the alarms produced will be worthy of action.

Changes in system behavior

If you think about the essential mechanisms used to decide when there are anomalies, traditional thresholding applies explicit limits to behavior. Anything outside those limits is unacceptable. In the newer anomaly detection schemes, a change in system behavior is a key characteristic leading to the conclusion that an anomaly exists.

Some changes to the environments are intentional and desirable. For example, if we had a system where a server had been running in an overloaded condition for some time, then a positive change might be to reconfigure to reduce the load. This is a change, but it’s not typically a change that requires further attention. Any resulting anomaly would not be actionable (unless of course one was looking for confirmation of the change).

In other cases, where perhaps there are minor changes in behavior, there is no practical value in detecting and alarming on these, and no action should be carried out as a result. If such alarms were generated, an operator would look at that “anomalous situation” and understand that it’s acceptable behavior. A less skilled operator may waste effort dealing with this unimportant situation.

Actionable situations

The operators who receive these alarms do not generally care about the reasons why the PM system decided that an anomaly exists and why an alarm is generated. They care that the alarms are telling them something important and that the situation is actionable. Actionable here means that there is enough information for some action to be taken!

I’ve had customers tell me that delivering unimportant alarms, or, indeed, outright false positives, is one of the worst things that performance management systems do. Further, they’ve indicated that they’d rather have no alarms produced by the system than have ones that don’t meet the basic expectations of importance and actionability. Check out Ozgun Odabasi’s explanation of some good reasons to avoid false alarm events.

With traditional thresholding, the importance and actionability of the alarms is generally a given. It is implicit in the original reason for setting the thresholds. (This of course assumes they were set up properly initially, which is not always the case.) In any case, establishing and maintaining these thresholds is generally labor-intensive. In the change-based anomaly detection approaches, things get a little trickier. They can save dramatically on the labor aspect, assuming the PM system can discern important changes from unimportant ones, good changes from bad changes and so on.

We at IBM are actively working on developing schemes to help the PM system deal with these discernment problems. What do you think about these approaches? I would love to hear your thoughts on approaches that deal with distinctions between such anomalies and alarms as well as what your important considerations are in this area.

Please leave comments here or follow me on Twitter @rmckeown and we can discuss!

comments powered by Disqus