Thoughts on Data Science, IT Operations Analytics, programming and other random topics


Reduce the Burden and Cost of Threshold Management

15 Jun 2015

(first published on IBM Service Management 360 site)

Tools such as IBM Operations Analytics - Predictive Insights are proving extremely useful in providing early notification of emerging problems and giving operators the opportunity to resolve situations before they become service impacting. This is undoubtedly a game changing capability.

However, as useful as it is, it can imply some changes in how operators go about doing their business to truly realize the value. The is because a process change is often desirable to deal proactively treating potential problems rather than reactively treating actual problems, as is the dominant mode today. Some organizations will take time to adjust. However, there is another key benefit to be had from such technology which is more immediately and generally realizable. This is the cost savings which result from simplified threshold management.

This was made clear to us in the early days of Predictive Insights development. We were intently focused on the goal of building a system to detect emerging problems, and one client made the astute observation during a demo session

'If you guys produced exactly the same events as we are receiving from our current monitoring systems, but reduced the cost of managing those thresholds, you'd be on to a real winner'.

We were so consumed with our shiny bleeding edge technology, that we initially overlooked this potential. In their particular case, they were spending in excess of seven figures annually simply managing the threshold settings. In the short term, reducing those costs were key, and notions like early notification could wait!

So where does this potential come from? The detection of emerging problems is often based detecting changes in the environment. Change, in turn, is detected by applying machine learning techniques to develop a sense of 'normal' behavior in the environment and detecting deviations from that normal behavior. Performance data emerging from the systems is analyzed, compared with normal or expected behavior, and if found to be outside the expected ranges, anomalies are declared and alerts generated.

During the course of this, the anomaly detection system essentially determines threshold levels or bands, within which, behavior is acceptable, and if outside. is not. The key here is that these analytics systems do not need to be explicitly told what levels are acceptable. The implicit guidance given is that 'the system being monitoring is behaving normally' and therefore 'watch and learn'. Herein lie the opportunity for cost savings. If you do not have to explicitly set threshold levels, instead relying on the system to deduce those levels, you can save a tremendous amount of work and money

Now, particularly in situation where Service Level Agreements (SLAs) must be enforced, there will still be a need for explicit threshold settings. When someone says 'Web Response Time will be less than 2 seconds', and writes up a legal contract in those terms, it is pretty black and white. However, a large proportion of thresholds in a well-managed environment are really there to ensure that the various systems are doing their bit to support meeting those SLAs. Ensuring that any one of those meets it's particular threshold can be critical to satisfying the overal SLA.

This gives us the two tier perspective on the thresholds. We have

  • Explict thresholds directly measuring or enforcing SLAs
  • Implicit thresholds mined from normal operation of the key components delivering the services

It is in this second category of thresholds that we find candidates for elimination and potential cost savings. Eliminating these can be achieved without compromising operational reliability. As was succinctly stated in this blog post 'Why set thresholds manually when the software can do it for you? [This blog posting has since been taken down]. In the case of Consolidated Communications, they were able to realize a saving of $300K annually on similar reduction threshold maintenance efforts. In fact, switching to such a 'normal-based' thresholding scheme often provides better coverage than with traditional manual thresholds. There are simply too many places, and dark corners to adequately understand and place boundaries around. It is better to let the analytics systems figure out what is appropriate and do it across all the metrics.

There is an argument that says 'well, if system is operating normally, satisfying SLAs then the implicit thresholds mined out should also apply there'. In practice, we have both legal implications to consider when meeting SLAs, with the resulting need for crisply defined operational limits, and the fact that a system can change in a controlled fashion and still meet the SLAs. So the separation turns out to be essential to practical operations, and that's desirable. We end up with the best of both worlds, precise thresholds rooted firmly in the business agreements with outwards understandability, and implicit thresholds derived from the operation of the system and more inwards facing.

Looking forward, as you think about deploying such operational analytics, maybe drawn by the lure of early notification of emerging problems, don't forget that perhaps savings can be achieved more readily with reductions in threshold management efforts. Review your threshold portfolio and think carefully about which of these two categories they fall into, and then use that to estimate potential savings if you were to eliminate the explicit management of these.

I’d love to hear about thoughts on the potential for savings here, or indeed Operations Analytics in general.

You can contact me here in the comment section below or on Twitter @rmckeown

comments powered by Disqus