More and more sharing of data is surfacing within the business intelligence (BI) community. As this is mostly not at an enterprise level, but rather on a business or functional level, it has become apparent that the process of data sharing needs to be guided, on a technical level, to form a sharable and synchronised extraction and loading process.
To enable data sharing, one must develop and abide by a common set of policies, procedures and standards governing data management and access for both the short and the long-term. Furthermore, it is also required to develop standard data models, data elements, and other metadata that defines this shared environment, as well as develop a repository system for storing related metadata to make it easily accessible.
The implication, however, of sharing data, is that there is an educational task to ensure that all users within the enterprise understand the relationship between the value of data, sharing of data, accessibility to data, and caution not to misinterpret information. Access to data does not mean anybody is allowed to modify or disclose the data. There must be an education process in place, as well as a change in the organisational culture regarding this process.
Open secret
Further to this, one of the main focus points regarding data sharing must be that of data privacy and security policies. Although these two policies go against the whole data sharing principle, it is possible to adequately provide access to open information while maintaining secure information, by identifying and developing security needs at the data level. This is done by implementing measures designed for the protection and privacy of personally identifiable information.
Open sharing of information and the release of information via relevant legislation must be balanced against the need to restrict the availability of classified, proprietary, and sensitive information.
Having said all of the above, sharing of data must start at the source systems and the extraction into the staging area of an enterprise data warehouse (EDW). Master data management (MDM) plays a major role here to identify the “single version of the truth”. Although duplicate data could be extracted from different source systems, the MDM will eventually reduce this to the absolute minimum.
Data privacy and security policies do not always relate to the source data loading routines, which load source system data in the staging area of the EDW, but more around the downstream data loads into enterprise data marts with sharable dimensions.
Guidelines
Access to data does not mean anybody is allowed to modify or disclose the data.
Rene Muiyser is principal consultant at PBT.
The following guidance is around the loading of these enterprise dimensions, where data needs to be shared between different business/functional units, within the EDW environment, and where the data could possibly already reside for a certain business or functional data mart in the EDW.
These guidelines should be applied during the solution designing of the ETL process in a scenario as described above:
* Use existing feed
* If all the required data (this must include all fields, transformation, grain, etc) reside in a data mart, split the relevant ETL process into two parts to perform a single extraction, but a multiple load process into the respective business/functional data mart when the data is needed in more than one dimension.
* If only certain fields are missing and the original source feed has no business need for the additional fields, add the fields to the extraction and split the load into two parts to load only the required fields into each respective business/functional data mart dimension.
* The actual splitting of the feed into two separate streams could be done before the original transformation is applied (separate translations) or after the transformation step, if the data needs to look exactly the same.
* Use new feed
* If the above solution is not achievable due to completely different requirements of the same data (different view, grain, etc) then the option would be to write a new extract from the staging area.
* The only solution requirement would be to switch the feed to the original extraction, once the original extraction process has changed in a manner that a single feed may provide data to both target data marts.
Issues such as development time, cost, etc, will always play a role in the design of any solution, but one needs to keep in mind the long-term benefit for the overall BI audience - whether it is the source or target system owners. This does not mean one should lose focus on the initial business requirements, or build a solution based on cost and development time only. There will always have to be some sort of trade-off between a business and technical solution in this regard.
Data sharing plays a crucial role in the overall process of BI. It is for this reason that the process of data sharing needs to be undertaken in the correct manner - ensuring a smooth process, and importantly, an effective outcome. Taking the above into consideration will assist a business in getting this process right.
Share