• Conrad Chuang

Babies, analytics and official dates

In the "Data Driven Parent" (The Atlantic, May 2012) Mya Frazier covers baby-data apps, which are apps to help new parents capture and analyze data about their new baby.

At their most basic, these first-generation baby-data apps offer … a substitute for handwritten diaper-change and feeding logs. The apps' greater innovation, however, has been in charting and analyzing [and individual] children's data… Forthcoming versions … [will allow] parents to compare their child with other children in great detail.

Being able to collect, analyze and evaluate against a large (and growing) set of observations has always been one the more exciting aspects of big data. And by the power of transitive logic, futurists and IT pundits have assured us that we can apply the same techniques to big data in our organizations.

This all sounds sensible, but what about the master data, or the "small data," that defines the big data that's being collected? Differences in how attributes, identifiers and hierarchies evolve over time have an impact on the way one performs analysis.

With epidemiological data, attributes (a form of master data) such as weight in pounds" or "height in inches" mean the same thing across time period and for observers. This stability allows you to make period-on-period comparisons for single baby, create distributions from multiple babies and compare a baby to that distribution of big baby-data. This attribute stability is true for most natural, or scientific, data sets; without this stability research collaborations would be challenging.

Unlike epidemiological data, the master data that defines the information in your enterprises' transactional systems change in value and meaning across time periods and functions. For example, Ford attempted to re-invigorate its mid-sized sedan sales by changing the identifier of the Ford Taurus to the Ford 500 between 2005-2007. This means a ten year analysis of Ford Taurus sales would need link those two pools of data.

The "same thing, different name" problem is a relatively easy fix. But consider the problems caused by the "same name, different meaning" problem. One of our European customers decided to restructure their sales regions (or sales hierarchies) at the beginning of this fiscal year to take advantage of new market opportunities in Russia. Instead of coming up with brand new sales territory names, they just added Russia to their "Eastern European" sales territory. This means that any data about the "Eastern European" territory is now incompatible across time periods (you'd be comparing two different geographic regions). Performance analysis needs to correct for this or run the risk of mis-attributing outcomes and drawing the wrong conclusions. Another complicating fact, the changes to sales territories were reviewed, approved and accepted before they went into effect.

The complication is that we need to account for two kinds of time data. The first is obvious: when was the data was changed. The other is a bit less obvious: when was the data valid or official. (For our customer, the changes to sales territories were made at the beginning of Q4 FY2011 with the expectation that the sales team would not start working those territories until Q1 FY201). This is why we designed EBX5 to capture both kinds of time data.Our customers want to know: when (and who) changed what; and when the change was official, or valid.

So big data and babies aside it doesn't take much to remind us as that an understanding of the time dimension (when changed and when official) in the context of"same thing, different name" and "same name, different meaning" lies at the core of having confidence in the comparability of any data, big or small.

I'm amassing a collection of stories about how missing an understanding of "same thing, different name" and / or "same name, different meaning" has led to disastrous, humorous, or otherwise notably incorrect conclusions. Let me know if you have some examples to share.