Thoughts on Data Science, IT Operations Analytics, programming and other random topics


Tell Me Something I Don't Know About My Environment

17 Jul 2015

(first published on IBM Service Management 360 site)

'Tell me something I don't know about my environment' is a challenge that we've received on a few occasions when working through the goals for a trial involving IBM Operations Analytics - Predictive Insights (PI). It usually comes from potential customers with existing rich monitoring and thresholding capabilities deployed in their environments. They already know a lot about what is going on there and part of what is behind the question is a desire to see what kind of additional value our product can bring to the table.

As a matter of best practice, we try to focus the trials on specific use-cases and areas of known problems. For this reason, such open-ended challenges are not something we would naturally attempt. The motivation here for a specific focus include a desire to scope the trial in terms of time and effort, as well as be in a position to understand up front where we are trying to provide value. The 'tell me something..' challenge is generally a much trickier proposition because it is wide-open and we don't know what they already know! We do welcome such challenges though, given our confidence in our analytics capabilities and on a number of occasions have been quite successful in showing new insights and highlighting the additional value that our analytics solutions can bring.

While so much of Predictive Analytics is targeted on, well, 'prediction' and forward looking aspects like getting ahead of emerging problems, the same analytical techniques can often be applied retrospectively to historical data sets, to shed light on what transpired in the environment. I like to think of this as a form of forensics, answering questions such as 'What happened?' and 'What caused the issue?' These new insights can provide additional guidance to refine monitoring and put in place processes to prevent recurrence of ongoing problems and so on.

In trial contexts, for such retrospectives, we'll typically take a dump of historical data, usually between two and eight weeks, though sometimes more and sometimes less – in fact, even just small numbers of weeks of data has yielded valuable insights. We will run that through our analytics and take a look at the output. For PI, the output can be a series of anomaly events as well as perspectives on metrics/kpis and their relationships – metrics that exhibit anomalies at the same time, metrics that tend to move together (in a correlation sense).

We've found of situations that were previously known to the client, e.g. an outage, detected by their monitoring. We also regularly find situations which prove to be previously unknown, but upon investigation by the client, prove to be interesting to them, and worthy of further attention. Some of the situations resulted in outages, which in hindsight, could have been prevented if had our tools been deployed – or at least, we would have alerted the client and given them sufficient time to address the issues. Others identified inadequate monitoring regimes which were allowing issues to slip through the cracks, and could have turned into service affecting situations. Of course, we can't be too critical in such cases as it can be tough (read 'expensive') to monitor all that we'd like to, especially in the face of a highly dynamic environment, but with PI, you can get the potential of wider coverage whilst saving money ( see Reduce Cost of Threshold Management with Operational Analytics). Sometimes, given the complexity of the environments, it's not even obvious what should be monitored which is another argument for automated systems to take care of this!

So whilst identifying and showing something previously unknown in a trial context may be impressive and help move things along from trial to a deployment, the real value is what those historical insights point to – situations that are detectable by these same predictive analytics. In the pressure cooker of resolving operational issues, the historical perspective may be of marginal interest. However, combining the ability to detect situations when it counts, in close-to real-time contexts with historical perspectives, which can help solve recurring issues and other subtle negative behaviors, is extremely valuable. In those latter cases, experts will be required to draw on a variety of available sources of data to support their investigations and conclusions, and historical data is key. It is incredibly useful to use the same technology, consistently across both current (where timeframes of interest may be measured in minutes) and historical (where timeframes of weeks and months may be relevant). Insights gleaned historically may be conveniently applied to the real-time situations and vice-versa.

So I encourage you to give these technologies a try. Chances are you will learn something valuable about your environment and processes during the course of your explorations. I’d love to hear about any discoveries you might make, or indeed your thoughts related to Operations Analytics in general. You can contact me here in the comment section below or on Twitter @rmckeown

comments powered by Disqus