Subscribe
About

Deduplicating information

Companies have a seemingly irresistible urge to create another copy of an existing set of data.

Dr Barry Devlin
By Dr Barry Devlin, Founder and principal, 9sight Consulting
Johannesburg, 09 Jul 2010

What is the biggest information problem? If there was one behaviour within business or IT that could be changed overnight, which one could bring the greatest data management benefits? How could companies immediately reduce ongoing information delivery and maintenance costs?

In my experience, IT often has a somewhat dysfunctional approach to information. More often than not, IT is not just supported in this behaviour by business budget owners; it is actually driven to it. Finance directors and marketing managers, who otherwise resist spending R100 on paperclips, are often among the first to indulge in this expensive and destructive behaviour. What am I talking about? Simply the seemingly irresistible urge to create another copy of an existing set of data!

At its most pervasive, this can be seen in the wonderful world of spreadsheets. Perfectly adequate information in the company's business intelligence (BI) system is copied into a spreadsheet, manipulated and mangled, pivoted and prodded until new insights emerge. Of course, this is valid and often valuable business behaviour. The problem is, what happens next? The spreadsheet data and calculations are saved for future use. The copy of the data has become hardened, in terms of structure and often content as well. Future changes in the BI system, especially in structure and meaning, can instantly invalidate this spreadsheet, downstream copies built upon it and the entire decision-making edifice constructed around them. And let's not even mention the effect of an invisible calculation error in a spreadsheet...

Let's move up a level. Marketing wants to do the latest gee-whiz analysis of every click-through pattern on the company's Web site since 2000. Vendor X has the solution - a new data warehouse appliance offering query speeds 150 times faster than the existing warehouse. It's a no-brainer. Marketing is happy with its innovative campaigns, and even finance signs off on the clear return on investment delivered by the new machine. Except, of course, that this bright, shiny server requires all of the existing clickstream data to be copied and maintained on an ongoing basis. Who's counting the cost of managing this additional, extensive effort?

The blame game

It's easy to blame business people who, driven by passion for business results and unaware of data management implications, simply want to have the information they need in the most useful form possible... now. IT, of course, would never be guilty of such short-sighted behaviour. Really?

The duplicated data can diverge from the truth of the properly managed and controlled centralised data warehouse.

Dr Barry Devlin is Founder and principal of 9sight Consulting.

The truth is that IT departments behave in exactly the same way. New applications are built with their own independent databases - to reduce inter-project dependencies, shorten delivery times, etc - irrespective of the existence of the information elsewhere in the IT environment. Even the widely accepted data warehouse architecture explicitly sanctions data duplication between the enterprise data warehouse (EDW) and dependent data marts; and implicitly assumes that copying (and transforming) data from the operational to the informational environment is the only way to provide decision support. And even independent data marts - fed directly from the operational environment rather than the EDW - are accepted, however reluctantly.

In most businesses and IT departments, it doesn't take much analysis to get a rough estimate of the costs of creating and maintaining these copies of data. Beyond the hardware and software costs - often relatively small these days - are the staff costs of finding and analysing data duplicates, tracking down inconsistencies and fire-fighting when a discrepancy becomes evident to business management. On the business side are similar ongoing costs of trying to manage copies of data, but by far the most telling are the costs of lost opportunities or mistaken decisions when the duplicated data has diverged from the truth of the properly managed and controlled centralised data warehouse.

Trimming down

So, assuming that companies would like to reduce some of these costs, here are four behavioural changes to implement to improve data management and reduce information duplication in an organisation:

1. Instigate a “lean data” policy across the organisation and educate both business users and IT personnel in its benefits. Although some data duplication is unavoidable, this policy ensures the starting point of every solution is the existing data resource.
2. Revisit existing data marts with a view to either combining marts with similar content or absorbing marts back into the EDW. Improvements in database performance since the marts were originally defined may enable the same solutions without duplicate data.
3. Define and implement a new policy regarding ongoing use or re-use of spreadsheets. When the same spreadsheet has been used in a management meeting three times in succession, for example, it should be evaluated by IT for possible incorporation of its function into the standard BI system.
4. Evaluate new database technologies to see if the additional power they offer could allow a significant decrease in the level of data duplication in the data warehouse environment.

Happy deduplicating!

Share