• Conrad Chuang

Key Master Data Facets: Points of Entry

Over the past two months we've been covering aspects of the master data management problem that prospects tend to overlook. So far we've covered modeling (part 1, part 2), relationships, time, and context. Today we're focusing in on point of entry.

By point-of-entry we're referring to an approach to data quality. In point-of-entry approaches, you apply data quality rules where your data enters the platform, or at the point of entry. This is one of several approaches to data quality. I've heard the approaches to data quality described in this way:

  • Clean-the-ocean

  • Clean-the-river

  • Clean-the-rain

Clean-the-ocean projects are intended to be performed once. You have a pool of bad data and the goal is to match, merge and de-duplicate that large volume of information. The hope is that after the cleansing project is complete, you'll have a baseline that you build upon going forward; maintaining quality requires a point-of-entry program.

Clean-the-river and clean-the-rain approaches are both point-of-entry techniques. I've heard people differentiate rivers from rain in different ways. Some people posit the split is between governed vs. un-governed approaches; others assert that it's batch vs. record.

I think they're both right. In fact here's how I'd define the different categories.

  • Governed: Master data that requires multiple individuals who create and approve the content. Master data that is subject to stewardship, where individuals evaluate and approve new records, would fall into this category.

  • Ungoverned: Master data that can be automatically vetted through business rules. Master data that is subject to survivorship processes, where an automated system uses a set of heuristics to make decisions without human involvement.

  • Batch: Multiple records arrive for processing at the same time. The records could be conveyed in any format (XML, text files). Because batch methods bundle multiple creates, updates or deletes, it is usually not thought of as a real-time method.

  • Record: A single record is processed. Because the record is a single create, update or delete it is often used as part of a real-time approach.

2×2 Matrix of Point-of-Entry Data Quality options

If we create a simple 2×2 matrix with all those options, we can account for most of the point-of-entry quality use cases that I've seen in our customer base. Here's that matrix with a some examples.

While I am a big fan of ungoverned (automated) processes, I'm also well aware of its limits. Many business rules engines are finite state machines, meaning that you need to explicitly account for all of the possibilities ahead of time. While this might work for simple rules these systems struggle when you start to deal with multiple contingencies and deeply nested conditionals. Also, there will be situations in which the volume does not require an automated solution. If you're processing a handful of updates to the chart of accounts on a weekly basis — will the time and money you invest in building, testing (and updating) those business rules every achieve positive ROI? Finaly there's rate of change on the rules. If you're in an industry where the rules change often, encoding, updating and managing all those business rules over time is a considerable investment in time.

Questions to consider

1. Do I have a point-of-entry strategy?

We've seen customers focus intently on how they will clean up their current pool of bad data (clean-the-ocean). The intensity is often a response to pressure from the business to improve data quality. Sometimes this leads to a situation where not much thought has been given to how they will deal with the on-going quality process. Without an ongoing quality strategy, entropy takes over and the team is back to cleaning the bad data ocean in no time flat.

2. Do my systems support individual record (realtime) interaction patterns?

Whether or not you can do realtime will often depend on your upstreams systems-can they provide the information in realtime? There is a secondary question, which is does realtime matter for your use case. Some of our customers release monotonic versions of their master data (eg. once a week, month, quarter, etc). Realtime may not be necessary in these kinds of use case because the results do not need to be put into production immediately.

3. How much governance does my data need?

How many individuals are involved in the authoring, review and approval of your master data? If no individuals are required to review or approve additions and changes, you may not require governance. Even if the content does not require governance, you may chose to use governed workflows because of the complexity of the content. Processing hierarchies or categorization data often falls into this category because judgement calls are required.

Speaking of governance, stay tuned for our next post on governance.