Subscribe
About

The ETL framework

An extraction, transformation and loading framework originates from difficulties experienced with BI projects.

Ren'e Muiyser
By Ren'e Muiyser, Principal BI consultant at PBT Group.
Johannesburg, 09 Jan 2012

Based on experiences over the past 15 years, an extraction, transformation and loading (ETL) framework normally emanates from difficulties that arise within business intelligence (BI) projects, due to possible hindrances or inefficiencies experienced from “off the shelf” products, and/or the need for standard processes within a specific environment in order to assist with the delivery of error-free end-to-end ETL processes in the data warehouse.

One of the main advantages that have been seen around the implementation of such a framework is the fact that it deconstructs the ETL process into a number of high-level processes, each with well-defined objectives.

A framework is intended to add rigour and structure to ETL designs.

Ren'e Muiyser is principal BI consultant at PBT Group.

The data is taken through a structured journey with major checkpoints along the way. Keep in mind that the processes should be built along with a methodology that enables data to be extracted from a source system, translated, validated, integrity checked, audited and easily loaded into a target system to meet the set objectives.

ETL tips

In line with this, a framework is intended to add rigour and structure to ETL designs. As such, below I have outlined the basic approaches businesses should aim to include when designing a framework:

* Add consistency to the design approach.
* Supplement development in the database environment.
* Implemented from a process point of view.
* Satisfy process requirements so they are not prone to failure.
* Create processes that can scale to larger data volumes and accommodate changes to the underlying system architecture.
* The framework should be able to change easily to support changes in business requirements.
* Supply a process status reporting functionality that should give a user an understanding of the basic health and condition of the environment.
* Optimise the transportation of data from source systems to exploitation areas, as well as improve and enhance all supporting processes.

Ideally, the framework should be implemented using a database language specifically for the back-end, and the front-end should preferably be part of the same technology stack and database calendar in order to schedule tasks accordingly.

As mentioned previously, the purpose of the framework is to implement standard methodologies by using templates to generate generic sections of code - leaving developers free to focus their attention on the custom aspects of the task at hand. The framework, with its built-in routines and templates, must manage best-practice aspects such as generating statistics and implementing auditing procedures, with no or very little impact on the ETL developer's time.

The template is king

In summary, a framework must be template-driven, allowing for standardisation, re-use and the ability to implement global changes quickly and easily. Developers should typically use predefined templates, within a wizard-driven interface, to assist with the creation of standardised tasks. The ETL code will then be generated from these task definitions.

If any standard related to a particular task type changes, the necessary changes are made to the template and the code is regenerated for all relevant tasks automatically. When the code is regenerated, the standard is applied to all the tasks created using that template. Therefore, developers no longer need to manually locate and update each instance of code that is affected by changes to a standard. The core of the framework should contain metadata-driven templates that are used to run ETL tasks in the data warehouse.

In order to optimise the re-usability of the framework, the templates should make use of environmental variables. These variables would then be set up for each environment on which the framework can run, and should include information such as database links, schema names, default values, fixed table names, template placeholders or file directory. This allows for the same ETL task to be deployed to multiple environments, with all the aspects unique to the environment defined in environmental variables.

Full-featured ETL tool prices are generally all over the map. Lately, what has been noticed and comes at no surprise is the fact that businesses aim purely to invest in something that is suited and beneficial to their particular environment. There is the odd occasion whereby companies have the solution, expertise and presence to successfully execute several BI and performance measurement plans. A framework can be specifically tailored to do the job at hand, and does not need weeks or months of customisation.

However, it needs to supplement the client's software investment with an affordable and workable ETL solution, reducing the time, cost and effort to successfully deliver BI projects.

Share