By now, every data-driven business understands the value and role of a data warehouse as the central repository of an enterprise’s data, providing a trusted source of data that can be used for analysis, reporting and analytics across the organisation.
Some of the key characteristics as well as challenges of a data warehouse today include:
- Data integration: Integrating different types of data across multiple data formats, platforms and technologies.
- Historical data views: Providing reliable data lineage, to facilitate point in time reporting and analytical trends and predictions.
- Master and reference data: No data analysis or reporting can have value without a clear understanding of the underlying data, as well as a common vocabulary.
- Data quality: The importance of data quality for a data warehouse cannot be overstated. The quality of reports and analytics are only as good as the quality of the underlying data.
- Optimisation: Changing and enhancing the original structure of incoming data that was originally optimised for data transfer, bulk storage, or OLTP performance, to rather be optimised for OLAP.
- Competing methodologies: Choosing the most appropriate data warehouse design methodology. This was normally a choice between Inmon and Kimball; however, more recently includes choices such as Data Vault, as well as newly emerging options for cloud and big data implementations.
Time and consistent investment
As you would expect from reading these items, implementing a data warehouse is a complex process that needs to be thoroughly understood and mapped out. There are numerous factors and criteria that need to be considered, and many scenarios that can and will increase complexity of implementation.
As a result, the analysis, design, development, testing and implementation of a data warehouse is for most businesses a lengthy process. For a large corporate, three years is not an unreasonable length of time for the development of an enterprise data warehouse.
At the end of this time period, even with a fully developed data warehouse, the development does not end. As the business continues to change, transforming digitally, implementing new systems, upgrading old systems, forming new divisions, merging with other companies, restructuring existing companies; all of these aspects affect the data warehouse.
The importance of data quality for a data warehouse cannot be overstated.
This requires a continuous, significant investment in the data warehouse, to keep it current and in sync with the current state of the business.
Over and above this, the data warehouse also requires constant maintenance and monitoring, ensuring day to day support and operations keep the data in the data warehouse constantly available for use and to ensure data warehouse success.
Considering the problem of metadata
The data warehouse life cycle involves defining business requirements, information requirements, information models, source to target mapping specifications, physical model implementation, solution design and standards creation, ETL development, testing, logging, deployment between environments, and finally maintenance and support.
The key take-away from this is that there are numerous steps required along the software development life cycle (SDLC). Each of these steps gathers its own metadata. The further one goes down the list of activities, the greater the volume of metadata that is generated.
Once deployed, the solution generates still more metadata in the format of audit logs and metrics, that record the performance and health status of the solution.
The challenge this poses to the data warehouse is that all this meta data is disparate and disjointed. It can reside in spreadsheets, documents, collaboration portals, or even within various tools, such as ETL tools, testing tools, even metadata management tools.
In fact, often, all this metadata gathers like an avalanche, gaining volume and momentum, but with only rare exceptions, this metadata remains disjointed, and only descriptive in nature.
The metadata itself never serves any additional higher, functional purpose. It represents a sunk cost, with little real value being delivered to the business.
Change the paradigm
Data warehouse automation, however, can flip this scenario completely on its head. It challenges the status quo, and asks why expensive human resources must continuously, repetitively perform the same actions over and over.
It challenges the wastefulness of gathering and storing metadata that serves no higher purpose.
With data warehouse automation, optimisation, efficiency and reuse become the new status quo within the data warehouse.
The concept of “sweating the asset” becomes the new normal. Metadata that is generated is centralised and is used in conjunction with standards, templates, frameworks and utilities, to optimise and automate as much of the data warehouse life cycle as is possible.
In this new paradigm, metadata is no longer simply gathered and stored. The metadata used to define the requirements and physical properties of the solution are instrumental in the dynamic generation of the resulting solution.
This refers to a model-driven development approach. The design of the target solution, with its associated metadata, is used in conjunction with best practice standards and patterns, to dynamically generate the required solution.
In this paradigm, no metadata is wasted, and all metadata is integrated, shared and reused, enabling multiple use cases along the SDLC.
More and more, corporates are choosing to either buy, or build in-house, data warehouse automation capability. What is clear is that those corporates that commit to this paradigm will have far more efficient, stable and higher quality solutions delivered, at a fraction of the cost.
Watch out for my next article, where I will walk you through the details of how data warehouse automation works in the real world.
Share