Thoughts on Data Science, IT Operations Analytics, programming and other random topics

We adopt Logstash for ETL of metric data

25 Jul 2014

In the course of my daily analytics job, I regulary need to extract and prepare data from a variety of sources. Roughly speaking, I need to Extract,Transform and Load (ETL) data from source to analytics destinations, a major one at this time being Predictive Insights. Generally, this ETL activity is a continuous, near-realtime process, though there are plenty of occasions where it is a once-off activity.

Over the last few years we've used various ETL solutions in the product, from full commercial tools like Datastage to once-off homebrew tools (programs, scripts etc) crafted as needed, and more or less discarded soon after. While the commercial tools are all-powerful, that power often comes with large deployment footprints and steep learning curves. Some of these drawbacks were what was prompting our field folks to create their own once-off ETL tools, and this was leading to a proliferation of solutions, and limiting scope for re-use.

So we set about finding something lighter weight, that would maybe be 80% solution to our needs. What were those needs? Well, simply, extracting data from sources, doing basic format transformations, and making the data available, or pushing it, to the downstream analytics consumers. ETL! At the same time, our sister Log Analytics product had started to make extensive use of Logstash for its mediation needs when dealing with 'log' data. It didn't take a great leap to realize that, even though they were focused on 'log' data, we were doing analagous operations in our attempts to get 'performance' data. A vibrant and active external, Open Source community was a major consideration for me personally too.

Couple that with the fact that our performance and log analytics products are intended to work closely together. More often than not, the same field guys are deploying and integrating both products at a customer site. Having two different mediation solutions, with all the costs implied, would require some significant advantage of the second one to warrant its inclusion. So, in the end we concluded that, that for mediation of our performance data, we should adopt Logstash as a lightweight-flexible solution. Note too, that the product itself has native mediation capabilities, so part of what we were after here was something to bridge the gap between the huge variety of external sources and those internal interfaces.

Over the next few months I'll report back on some of our experiences in adopting Logstash here.

comments powered by Disqus