Why is it that most data warehouse developers shy away from documenting their work, which they put so much effort in to create?
Too many developers believe the programs they write are self explanatory and additional text that assists in understanding what they have programmed, can be found inside the code. That premise is correct when one looks at the argument from a de-bugging point of view; however, it does not hold any weight is assisting users to understand the complicity and dependencies of the vast amount of programs that make up a data mart solution within a business intelligence (BI) system.
Another favourite 'excuse' used to not produce documentation is that developers believe it is all available in the meta data. In a way this is true, but getting to it can be very hard and of course, there is the fact that one has to know where to look and might have to scratch through millions of lines of code to find it.
Therefore, perhaps we should take a re-look at data warehouse documentation needed, and start identifying those documents that add real value to a project and don't just get done as part of some BI project methodology or by some person who thinks it might be 'cool' to have. In this way, it will be more acceptable for developers to participate and produce real usable documentation that will make a huge difference in understanding and maintaining a system.
Essential
Everybody knows data warehouse documentation is a time-consuming process, but some documentation has to be available.
Ren'e Muiyser is principal BI consultant at PBT.
Let's start by identifying the technical documentation that is of utmost importance for a data mart to be delivered to Maintenance and Support. At some large companies, which operate via various departments that look at quality assurance, program management, project management, life cycles and correspondence, for example, it has been noted that there could be tens of different documents as part of a single data mart deliverable. And this only includes documentation that needs to be delivered as part of the solution, not the project management documents and other documents that will be needed during the analysis, development, testing and implementation phases.
The reality, however, is that only four technical documents are really needed. And if these are properly populated, any resource that understands the concept of data warehousing, the relational database and ETL software, will be able to effortlessly support and maintain the data warehouse solution delivered.
These documents include:
1. Architecture document
This document describes the proposed architecture for the entire solution. It describes the hardware infrastructure and the physical design of the architecture, which includes the RDBMS and ETL software. It also covers subject areas such as data base backup, security and performance. There should be a single document for the total business intelligence solution.
2. Logical design document
This document describes the high-level specification of the recommended solution and must address and satisfy all business requirements. This document should not contain too much technical detail, but rather reflect a logical solution that can be presented and understood by the user. It should incorporate the star schema solution and the data flow processes.
3. Source to target mapping spreadsheet
This document eliminates any confusion as to how data is transformed as the data items are moved from the source systems to the data mart. A source-to-target field mapping is created and it maps each source field in each source system to the appropriate target field in the data mart schema. It also clearly documents all business rules that govern how data values are integrated or split up.
4. ETL process document
This document describes the detail extraction, transform and load processes of the source data into the respective data mart. It also provides information of the order of ETL program execution, as well as the batch and error logging process.
All possibilities have been explored to minimise the amount of documentation and these four were found to be non-negotiable as deliverables of a data warehouse project. One might even find some sort of duplication in between these documents due to the different usages and audiences. From a technical perspective, the data could be at different levels of explanation, ie, table or columns level or even both.
Everybody knows data warehouse documentation is a time-consuming process, but some documentation has to be available. The technical documentation to be delivered as part of a BI project should be clearly indicated in the scope of work when negotiated with the client. These will not only make life easier for maintenance and support teams, but it's also easier to do handover and skills transfers to internal resources if the development work was outsourced to BI consultants.
* Ren'e Muiyser is principal BI consultant at PBT.
Share