In my previous article, I discussed some of the potential alternatives to data warehouses. The key point that I stressed here was that these alternatives should still be considered a part of the greater data warehouse, and that they needed to complement and enable the holistic data warehouse, contributing to a single enterprise solution.
In this article, I continue with the discussion and focus on another data warehouse alternative: the data lakehouse.
The data lakehouse is a hybrid architecture that implements the best concepts of both a data warehouse and data lake methodology, while excluding their cons.
What are the cons of a data lake? Data lakes grow rapidly, resulting in a proliferation of data that can be difficult to manage and control the access and usage thereof. This can lead to governance and privacy issues.
The implementation of the data lake led to businesses using these data lakes as the new source of the data warehouse. Since these two methodologies were fundamentally based on different technologies, this resulted in massive volumes of data being transferred between the two environments, resulting in additional processing required to transfer the data and placing additional strain on potentially overburdened networks. This also led once again, to redundant data storage, as the data was stored in the data lake, and the landing zone of the data warehouse.
In a data lakehouse, data can be rapidly ingested into a data lake.
The data lakehouse architecture implements both the data warehouse and the data lake on the same platform. Big data platforms have evolved in recent years. Enhancements to data warehouse SQL query engines, such as Hive and Impala, have resulted in increased support for data management features that a data warehouse is reliant on.
As a result of this evolution of capability and maturity in the big data platforms, it became apparent that big data platforms could serve both data lake AND data warehouse requirements.
In a data lakehouse, data can be rapidly ingested into a data lake. This data can be stored in raw format, then structured, curated, audited, etc, as per all the normal requirements of a data lake. Data is then immediately made available to data scientists and data analysts.
This data then feeds directly into the data warehouse ETL load routines, to populate additional data structures in the big data platform for the associated dimensions and facts. These data structures benefit from the distribution and replication inherent in the big data platform. In addition, the big data file formats such as Parquet and ORC provide advanced column store capabilities, which further optimise the storage and usage of the data stored.
The data that is loaded into these big data files is then exposed via the data warehouse engines. These data warehouse engines have now matured and are able to provide the database management and performance features that we require for a data warehouse; for example caching, indexing, acid compliance, auditing, schema enforcement, zero-copy cloning and query optimisation.
Furthermore, many of these database abstraction technologies provide more support for semi-structured data (such as XML, JSON, etc), than conventional relational databases.
The data warehouse design methodology does not suffer from this implementation. In other words, the design of our data model can still be according to conventional data warehouse best practice, either Ralph Kimball or Bill Inmon aligned.
The benefit is that we get to consolidate our data in a single platform. We process data less; we move data less; we store data less (redundantly). The big data platform is inherently more affordable than enterprise data warehouse infrastructure platforms and associated software. We also benefit from the inherent scalability of the big data platform.
As storage and processing requirements grow, the big data platform’s ability to scale out storage and compute means the data warehouse is no longer held hostage by legacy infrastructure systems.
A final point to note on infrastructure is that the data lakehouse recommended architecture advises that storage and compute nodes are kept in separate clusters. This means that as with cloud-based big data solutions, the data lakehouse can scale storage and compute capabilities independently.
The data lakehouse is admittedly a new concept. As such, I will be the first to say there is a lot of room for it to mature. I confess though, that despite its newness and all the cons that a new concept inevitably implies, I do really like the idea, as it combines three concepts that I am passionate about.
I love working in data lakes. I love the open data formats, the open source languages, the flexibility, the scalability. Everything about the platform speaks to my inner techy. I love the data warehouse. It was what I grew up with, career wise. I love how the dimensional models bring sanity to the chaos of an organisation’s data. I love how the ETL consolidates and conforms the data into a unified solution, and I love how impressive our dashboards and reports look when it is based on well-implemented data models.
Finally, I love agile delivery and delivering value to my customers quickly. I love breaking down the conventional barriers of solution delivery. It is for this reason that I like this new concept.
I believe the data lakehouse enables the principles of agility and brings the data lake and the data warehouse closer together, in a way that leverages each of their strengths, and alleviates their weakness.
Share