DATA CORRELATION

The initial hurdle in an organisation’s nascent big data stage is the overwhelming disparity in sources of data.

Big Data Analytics typically identifies correlation between data points as the start key to insights. The familiar arguments of Correlation v/s Causation are repeated time and again, and we find that in case after case, organisations are misguided with the sheer quantity of data there is to correlate.

The initial hurdle in an organisation’s nascent big data stage is the overwhelming disparity in sources of data. Webserver logs, application logs, business process and transaction logs, data from DBs and/or NoSQL data stores, OS event logs, system logs, data from network and security devices.. the list is endless. Correlation across these is, of course, a key aspect of big data analytics, but by no means the be all and end all of analytics.

Data_Sources_1

So what then, are we trying to correlate? Why? And where do we start?

There are questions (or lack thereof, in some cases) and there are metrics that answer or support answers to these questions. Instead of looking at correlation merely from the view of data points, let’s correlate the questions instead. Across disparate teams, disparate verticals, disparate business units. Let’s look at correlation from the point of view of the problem sets, and use this view to get to the relevant data. The answers may lie in multiple, disparate sources, or in a single source. As long as we identify these problem sets correctly, we will be directed down a very channelised path through the complex sea of data sets, and find the data points we really want to, and need to correlate.