Thoughts on Data Science, IT Operations Analytics, programming and other random topics


Shift to the Left with Predictive Insights and Log Analysis

11 Jun 2014

(first published on IBM Service Management 360 site)

Combining different capabilities from an IT Operations Analytics portfolio promises to provide better opportunities for avoiding outages and gaining insight into our monitored environments than has previously been possible. This statement applies to the basic single-use approach, where capabilities are used in isolation, but even more so to the approach where you put complimentary capabilities together. We’ve been combining such capabilities here at work to achieve a goal we call “shift to the left.” The phrase “shift to the left” is a handy working phrase we use to capture the notion that we are shifting our activities to earlier and earlier on the timeline. In other words, we are carrying out our operations activities earlier than has previously been possible. The objective is to get ahead of problems before they affect clients and users.

So how do we go about doing this? We do this by applying IT Operations Analytics (ITOA) to the volume and variety of data that is available in the typical IT environment. Individual tools and capabilities have value in isolation, but it doesn’t take long to realize that by combining these tools and looking across the data you can extract dramatically better value from your tool investments. This is really no different than the traditional IT approach, where we can often get significantly more value by combining and cross-analyzing our data. In the specific context of ITOA, we are really only at the beginning of understanding what might be possible; the future certainly looks exciting!

Focusing on the specific problem of IT outage avoidance, my immediate colleagues and I have developed tools and techniques to help hard-pressed IT operators and DevOps staff deal with emerging situations, which if left unchecked could affect their service. You can get a sense of how enthusiastic we are about this mission by watching our short YouTube video on the data scientist perspective.

For our “shift to the left” objective, two basic capabilities are used: predict and search. We use IBM SmartCloud Analytics – Predictive Insights and IBM SmartCloud Analytics – Log Analysis to provide each of these respective capabilities. Each of these products has advanced capabilities and value in its own right. Predictive Insights and similar tools help give earlier notification of emerging problems. They do this by looking more deeply at the nature of the data and by applying machine learning techniques to understand the normal behaviors in the environments and to detect noteworthy changes. Log Analysis provides the ability to ingest vast quantities of IT data, logs, events and more; slice and dice and index them in real time; and facilitate ad hoc searches and presentation of those search results in various rich forms.

The magic occurs when we combine these two capabilities. By using predict tools we can get earlier notification of problems and by using search tools we can rapidly search across the environment data to determine what has been going on. As an aside, these two are part of the framework—predict, search and optimize—through which we view ITOA at the moment.

When we receive a predict notification, we need that information to be actionable. In this case, “actionable” means that there are some obvious next best steps. Often, it will be obvious from that basic predict information what those next steps are. At other times, especially in the early stages of the problem where signals may be subtle or related to something the operator has not had previous experience with, it may not be clear at all what needs to be done. In these circumstances, we need to give the operator some help.

Here’s where we bring in search. We can help operators to seamlessly move from predict environments to search environments. If they can conveniently bring the relevant context with them to drive their search, then they can quickly gain additional insight into the issue. The initial search results are bounded in time (near the time of the anomaly) and in space (for the servers, devices and resource implicated in the anomaly) and provide more related context from the environment. This additional perspective is often all that is needed to determine the next best steps. The convenience of driving this scoped search and the speed at which these results can be obtained are both keys to keeping the timeline compressed—to staying to the “left” of the problem.

Finally—and perhaps specific to the tools we use—if the user is still puzzled about what to do, he or she can select the search results and with a simple right-click of the mouse, invoke an automated “expert advisor” to take a look at those results. This expert advisor will suggest next best steps and other related collateral (documents, best practices and more).

So predict and search, coupled sometimes with automated expert advice, can combine to reduce the amount of time it would usually take to detect, understand and resolve a situation. This is just one particular functional combination of these capabilities. As an industry, we are at the beginning of discovering what is possible here. As you consider applying IT analytics tools in your own environments, I encourage you to broaden your sense of what’s possible rather than limit your perspective to a single product capability—be that predict or search or whatever. Just think about all the different kinds of data you have available! How would you combine that data and apply analytic techniques to get an earlier understanding of what is going on in your environment?

As usual, I’d love to hear your thoughts and bright ideas on such combinations so connect with me on Twitter @rmckeown

comments powered by Disqus