I'd recently been thinking about aspects of 'the curse of big data' as they applied processing timeseries data from IT operations systems. These timeseries represent things such as cpu utilization, disk i/o rates and webserver response times. Given the volumes of data we deal with, this 'curse' often pops up, or at least the threat of it, in discussions. So we were having one of those back to basics discussions, related to simple correlations between such timeseries, and a number of questions around how many strong correlations were we likely to see, and when could we be confident that the correlations were valid (by some of our subjective definitions).

I'd read Vincent Granville's post The curse of big data some time ago, and later after the above mentioned timeseries discussion, I stumbled upon this wonderful short video Statistics Without the Agonizating Pain. It occured to me that I could have little bit of fun exploring the timeseries correlation question above by putting the two perspectives together. Vincent's post worked out the details using 'plain old-fashioned statistical modeling', and laid out some of concepts and terminology. Starting from there, I decided to then follow John's approach, and code my way around and into the problem, and in this case, my weapon of choice was R. R enabled a very rapid and succinct initial exploration.

My initial basic question was *'Given a set of many completely random timeseries, how many will be strongly correlated?'* Despite the hypothetical nature of this question, it's a good place to start to get your mind around the problem and how one might tackle it through coding.

What I'll do over the next few blog postings is take you through the steps I followed to explore the problem and answer the question.

Following Vincent's choice in variables, I'll use `k`

for the number of timeseries, each with `n`

observations. I define a function in R to generate a random timeseries with the `n`

observations. `i`

is an index I'll pass in later when calling this from one of the R looping functions (`sapply`

), but I don't use it in the function. Think of it as an iteration count.

timeseries <- function(i,n){ rnorm(n) }

The function simply uses `rnorm`

to generate `n`

values, with mean `0`

and standard deviation of `1`

. Set my variables

> k <- 1000 > n <- 20

Now I can generate a matrix of such timeseries

> allTimeseries <- sapply(1:k,timeseries,n)

`sapply`

will call the timeseries function with the values `1`

to `k`

as a first argume, which the function itself ignores, along with `n`

, our number of samples for each timeseries, which it does use. It will return a `matrix`

of `k`

timeseries, each of length `n`

. After we produce it, just look at the class type and the dimensions (`dim`

function) so ensure that you are getting something sensible back

> class(allTimeseries) [1] "matrix" > dim(allTimeseries) [1] 20 1000

Then, take a look at some of the actual values. Here is a snippet of my first row, which in this case would be the first observation for all `1000`

timeseries

> allTimeseries[1,] [1] 0.9812934126 0.0809221328 -0.5305226328 1.2741343709 -0.2454727926 [6] 1.8375363326 -0.1305255863 -0.8833065228 0.5038225212 -0.4978515373 .. [991] 0.4624402215 0.6315268798 1.2979804546 -0.5167739176 -0.8921856944 [996] -0.4604987605 -0.0972627883 -1.7788625651 0.3786650806 -0.0043285202

Of course, your actual values will differ. Next, use the `cor`

function to determine correlations between each of the timeseries

> r <- cor(allTimeseries)

`r`

is a correlation matrix, it is also symmetric, with the correlation between timeseries `i`

and timeseries `j`

being at `r[i,j]`

and `r[j,i]`

.

Our original question is basically 'how many of our timeseries (pairs) have strong correlation?'. Let's pick `0.8`

as our threshold. In terms of our correlation matrix `r`

, we need to extract all the correlation values and determine how many are greater than `0.8`

> correlations <- r[ which(lower.tri(r),TRUE) ]

Let's break this apart to see what's going on. First we extract the values from the lower triangle of `r`

. We could equally well have done the upper triangle of `r`

due to the symmetry. The function `lower.tri(r)`

returns a matrix, the same size as `r`

, with `TRUE`

in those positions which are in the lower triangle.

> lower.tri(r) [,1] [,2] [,3] [,4] [,5] ... [1,] FALSE FALSE FALSE FALSE FALSE ... [2,] TRUE FALSE FALSE FALSE FALSE ... [3,] TRUE TRUE FALSE FALSE FALSE ... ...

This is convenient as we can now apply the `which`

function to get the actual indices of those lower triangle elements. It works through our matrix with TRUEs and FALSEs and returns those indices of the elements which are TRUE

> which(lower.tri(r),TRUE) row col [1,] 2 1 [2,] 3 1 [3,] 4 1 ... [499500,] 1000 999

With the indices, we can get now the actual correlation values from r, as shown in the original line above, and repeated here

> correlations <- r[ which(lower.tri(r),TRUE) ]

The length of this correlation vector can be seen

> length(correlations) [1] 499500

This is exactly what we would have expected. `1000 KPIs`

give `1000*999/2`

correlations (when you think about how many pairs of timeseries we are compairing).

Take a look

> correlations [1] -0.03881186 -0.10470447 -0.22386602 -0.07264758 0.14677464 -0.23483565 [7] -0.10244466 -0.08548905 0.13903625 0.40835771 0.01554844 -0.24424305 ...

We only want those correlations which are greater than 0.8. Also remembering that we want to allow for strong postive *and* negative correlation, we need to use the `abs`

function so we pick up correlations greater than `0.8`

or less than `-0.8`

> correlations[abs(correlations)>0.8] [1] -0.8176548 -0.8008682 0.8040069 -0.8194776 0.8292056 -0.8094844 0.8166100 [8] 0.8017423 -0.8008229

My original question was about how many timeseries are strongly correlated, so we use the `length`

function. We can eyeball the output above to see how many in this case, but in the general case it's better to compute it

> length(correlations[abs(correlations)>0.8]) [1] 9

So for this single set of data, we see that there are `9`

pairs of timeseries that are strongly correlated. Of course, this was one randomly generated set and another set might have a completely different result. So we are not done yet. We have a few building blocks here and in the next post I'll show how to take these and get a more general sense of what is going on