Thoughts on Data Science, IT Operations Analytics, programming and other random topics

Building confidence in your predictive anomaly detection system

08 Apr 2015

(first published on IBM Service Management 360 site)

So you are faced with trialing or promoting a new predictive analytics anomaly detection system. Despite being excited by the promise of this new technology, you need to convince not only yourself, but probably skeptical colleagues that it is indeed worth adopting this technology. How do you even approach this situation?

Even the seemingly obvious promise of early notification of emerging problems, as claimed by products such as IBM Operations Analytics – Predictive Insights can run into unanticipated roadblocks on the way to acceptance. In my last blog, Using Predictive Insights to Maximize Your Chances of Success I gave the anecdote about a customer being too overworked to even consider a new set of anomaly events. When it comes down to it, if the value isn’t immediately obvious and compelling in the evaluation phase, then the barrier to getting adoption can be almost insurmountable. So the key here is to be able to identify, quantify and explain the value to those interested.

You must make these assessments yourself and not just take the vendor’s general claims. In addition to notions of “trust but verify,” the practical reality is that the value of such tools is usually highly dependent on the environment, the selected data, configuration and the processes to deal with the output. A predictive analytics tool that produces great value in one context, might not produce much in your particular context, and vice versa.

Have conducted many such trials and early deployments in recent years, I find it useful to approach the confidence-building challenge in the following two stages:

  1. Determining the proportion of actionable anomaly events
  2. Determining financial value of anomaly events

In traditional threshold-based monitoring, the threshold levels are typically set explicitly, and you are effectively saying to the monitoring system “alert me when things go outside these bounds.” It is very simple and explicit from the outset. With the more advanced analytics, where, for example, the analytics system learns aspects about your system such as “normal” behavior, then there is room for interpretation on the part of the algorithms. Alerts emanating from such analytics may or may not meet your expectations with respect to being actionable. Some may involve trivially minor changes in your system, others may be a result of known normal activity – in these cases, and in others, you would reasonably conclude that you would not do anything with the anomaly, other than delete/cancel it. You must, though, do at least a basic survey to get a sense of how many are actionable and how many are not.

Given the typical volumes of events emerging and the effort associated with making the actionable determination, it will probably not be practical to go through all the anomaly events. We will often set up a simple experiment where we randomly select and present a number of anomaly events to the customer, e.g. 30, and ask them simple questions like, “Is this anomaly event ‘Actionable or Not to you?’” Recording the answers will give a sense of how things are going.

The determinations here are usually reached by a combination of expert insight by the users (“I see that, and it looks valid to me”), comparisons with output from legacy monitoring (“You produced an anomaly event, and the legacy monitoring did too”), and comparisons with other data in the environment (e.g. trouble tickets, configuration / provisioning changes ).

The basic objective here is to convince yourself that enough of what is coming out is actionable or otherwise meaningful and useful.

Determining the financial value of the anomaly events

It is not necessary to get a particularly high score in Stage 1, but it is a useful data point giving you a sense of proportion. What is ultimately more important is the financial implications. Outages avoided, operator jobs made easier, faster resolution of issues, etc., all have a cost.

One significant anomaly detected early with sufficient financial value, could easily completely justify the cost of deploying the analytics tool.

How much is it worth to avoid a high-profile outage on your network? How much is it worth to identify and resolve a fault with your data centre energy which might not get picked up till the next monthly maintenance activity? Similarly, many low-value anomalies may end up not being worth the effort to deal with. So you must put some thought into the financial value of those valid events.

For events deemed particularly interesting, you should attempt to put some monetary value on them. This can sometimes be tricky and obviously requires an ability to estimate costs in the environment. It clearly moves beyond the pure technical realm, and there may be a bit of art to it. You may be lucky and have some obvious anomalies occur during a trial phase for which you can readily estimate costs. However you approach it, putting the dollar value on it is critical.

Simply said, the approach is to first learn how much output is actionable and then estimate the cost of the associated (in)action – if you were alerted and didn’t act, what might the eventual problem situations cost you?

The approaches mentioned above apply to both early trial phases as well as to production, as you seek to optimize the analytics deployment. The promise of this technology is great, and it’s easy to let enthusiasm get the better of you. By adopting some basic methodical approaches, you will quite rapidly reach conclusions on the potential value in your environment. This will be both useful input to your purchase decisions, but also provide concrete support as you encourage adoption by your colleagues.

I’d love to hear about your experiences, good or bad, attempting to identify and prove value as you adopt these technologies. You can contact me here in the comment section below or on Twitter @rmckeown.

comments powered by Disqus