I first discovered Kaggle about a year ago when listening to TWiT.TV's wonderful Triangulation. I was immediately struck by the simplicity of the idea. The idea of combining data science/analytics with a crowdsourcing competative element seemed like a winner to me and it seems to have taken off. There now is a vibrant community of data science practitioners developing around it. If you haven't seen it yet, I encourage you to take a look.
Today I was wondering if Kaggle had been applied to any of the problems in the IT Operations Analytics (ITOA) area. ITOA, with its huge volumes of data, often quite heterogenous in nature, would seem to be a great area to apply the collective brainpower of the Kaggle ecosystem, at least from a challenge perspective. So I took a look at the Kaggle site today and reviewed the set of competitions posted to date. There were 124 competitions listed, at time of writing, and I didn't find a single one that was in the area of ITOA. Oh well! Maybe it's early days yet in Kaggle-land.
I see at least three potential reasons for the lack of ITOA/Kaggle-engagement today, in priority order:
Organizations simply not being aware of Kaggle. By mentioning it here today, I'm doing my bit to help with the awareness aspect :-)
Reluctance of IT orgs to share data
I know from my own experiences that plenty of the customers I deal with are unwilling to let their data leave their site/domain. Security is the usual reason given. This is releasing data to us, as NDA-signed-up partners! The challenge of releasing data to an even more public forum would be steeper. However, much of the ITOA data in question, is not directly and obviously tied to end customers, and as you work through it, you find that the security concerns can be quite irrational. Not so in all cases though. However, anonymizing this data is usually quite trivial. Anonymization and closed competitions available in Kaggle combined, go a long way to alleviating the concerns.
Factors considered in results evaluation
Ok, this is one that is dear to my heart given my focus on actual product development. Developing predictive models which can make sufficiently accurate predictions about the behavour of aspects of the IT environment is but one of analytics challenges here. Equally important is to be able to make those predictions in real-time at 'IT scale' (See Lots of data in the data center ) and to evolve the prediction models continuously to reflect the ever reality of the ever changing IT environments). So a system of evaluation that allows for inclusion of additional considerations beyond basic accuracy is needed.
On a side note, their competition around 'Conway's Reverse Game of Life' piqued my interest. I have happy memories of programming the Game of Life in BASIC back in my early computer programming days on a Sharp MZ80K