Thoughts on Data Science, IT Operations Analytics, programming and other random topics


Exploring correlations in random timeseries with R cont.

22 Oct 2014

In my last post, Exploring correlations in random timeseries with R I started to show how we could use R to explore a simple question 'How many correlations would we expect between randomly produced timeseries?'. The emphasis is on applying code to the problem, rather than sweating the stats theory behind the problem.

Last time, through the easy application of a few functions in R, for 1000 timeseries, 20 observations each, and a correlation threshold of 0.8, we arrived at an answer of 9. This was the result for a single experiment and not useful in isolation. What we need to do now is carry out that experiment multiple times and organize those individual results and draw some conclusion. Simply, we'll be doing some Monte Carlo style experiments.

So let's get setup to run this experiment multiple time. We gather up the code snippets from last time, wrapping some of the items we executed individually into a new function singleExperimentResult

# k             number of timeseries
# n             number of observations per timeseries
# corThreshold  correlation threshold .. typically 0.8
   
timeseries <- function(i,n){
   rnorm(n)
}

singleExperimentResult <- function(i,k,n,corThreshold) {

  allTimeseries <- sapply(1:k,timeseries,n)
  r <- cor(allTimeseries)
  correlations <- r[ which(lower.tri(r),TRUE) ]
  length(correlations[abs(correlations)>corThreshold])
}

Now we can more conveniently execute the code from last time. Think of i is an id, or indeed, an experiment number we can assign when we invoke it from R iteration functions like sapply (see below)

> singleExperimentResult(1,1000,20,0.8)
[1] 12
> singleExperimentResult(1,1000,20,0.8)
[1] 15

Notice that we often get different results. This is obviously due to the random timeseries we are creating. Now, and here is the key leap, we can execute this experiment multiple times and get all the results back. We don't generally expect the same result back on each experiment, but by assembling and analyzing all the results we can draw some conclusions about what is going on. Let's start with say, 500 iterations. Using the sapply function is a convenient way to iterate and gather up all the results. The function call below took about 3min on my laptop - be patient.

> results <- sapply(1:500,singleExperimentResult,1000,20,0.8)
> length(results)
[1] 500
> results
  [1] 14 16  9 13  6 17 11 16  7 11 15 14  8 16  9 13 16 12 11 14  7 14 12 12 11
 [26] 11  9 10 14  8 17  7  8 12 15 13 16  8  9 16  8 10 18  8  9  7  9 14  9  6
 [51]  8 13  6 14  8 10 14 15 13 10  8 14 21 12 14 13  4  7 19 10 13 10  8 14 13
 [76] 12 14 11  7 10 10 13 12 11 10  9 13 16  7 16 16 12 15 10 14 15 14  9  7  7
[101]  5 14  9 13 19 11 10 13 14 16  9 14 10 17 12 14 13 13 11 10  4 15  9 16  7
[126] 13 14 13  8  6  9 11  7 11 16 18 12 13 12 12 10 12 10  7 14 14 15 11 10 10
[151] 10 12 15 13 12 13  8 15  9 11 10 13 15 14  8  8  7  8 12 11 11  8  6 21  4
[176]  9  7 11  9 13  8 11 13  8 16 12 10 20 16  7 12 10 12  7 11  8 15  9 16  6
[201]  7 12 10  8 12 13 12  8  8  7  7 19 12  8 11  5 17 12  9  5 12 12 17 14  9
[226]  7  5 12  9  7 11 12  9 10 13 12 10 11  7  6 13 16 10  7 12 10 15  8  8  8
[251] 10  9 12 13 11 14 17 11  6 13 11  9 14  5 15 16 14 12  7 11 16 15  8 12  9
[276] 19  9 11 13 11 15 14 18 13 12 15  6 10 16  8 13 16 15 10 12 12 14 15  9  9
[301] 15 10 11  8 16 16 10 11 11 11 10 11  9 13 10  9 13 13 11 12 13 12 11 11 11
[326] 11 15 16 11 16  3 14 13 11  9  7  9 15 10 18  7 15  9  8  9 11  9 10  7 22
[351] 12 12  6  3 14 15 18  9 10 15 14 12  8 11 13  9 10 13 11  8  7 12 16 10 10
[376] 18 14  8 14  6 15 14 12 11  7 14  6 14 10 14 16 10  8  8 13 16  9 12  6 14
[401]  9 12  6 11 17  7 12 11 11 11 16  8 12  9 18 15 10 13 15 16 10 12 10 18 11
[426] 14 13  7  8 11 14 10 14 14 11  9 16 10 16 13 10 16  5 16 11 10 13 15 14 15
[451] 10 11 10 12 12 15 12 20 15 17 16 10 13 12 16 13  8 13  9 10 13 13 11 11 15
[476] 11 10 17  8  8  7 13  9 16 12  7 13 12  8 13 10 12  8 14 13 11 14 16 14 11

To make sense of this, let's plot it as a histogram

> hist(results)

The rough answer to our original question then is that if we had such a set of random timeseries, we might generally expect somewhere between 3 and 21 correlations, but somewhere around 11 would be the most likely. So completely random data throws up correlations and our real-world IT data, intuitively at least, would likely produce more correlations which we might consider questionable. We can use our little program now to explore the effects of changing the different input parameters. We can see for example that if we increase the number of observations, n, in each timeseries, that the number of correlations drops off.

So by applying some basic programming skills and a bit of logical reasoning, were were able to side step any complicated stats (where 'complicated' is highly subjective anyway!), and still arive at a deeper understanding of our area of focus. This level of understanding is 'plenty good' for a lot of engineering and other real-world problems. In my particular problem though, this is only the start, there are other factors to consider. Even just scaling up this, e.g. 10000 timeseries will start to tax R a bit ( 10000x10000 gives us a 100,000,000 element matrix to play with - which my R works with, but 100,000 causes it to fail ). There are other techniques beyond this simple approach which we can apply. Those are topics for later posts.

comments powered by Disqus