In a world increasingly driven by data, it is crucial to take a deeper look at all aspects of data curiosity. While many in the industry refer to ‘data curiosity’ as indicating an interest in or curiosity about what the final numbers or visualisations mean, it could be argued that data curiosity should be a much broader categorisation than just the output itself.
A Forbes article published in 2019 states: “Data curiosity is the connective tissue that ties data literacy and data storytelling together.”
The world has seen significant changes in technology and data tools since then. Many of these new data tools operate as a “black box” environment where you supply data, and the tool generates the output required – be it a model or a report.
Understanding the data used is crucial in obtaining sensible outputs.
Artificial intelligence is increasingly being used in such tools to simplify and make data analysis and exploration more accessible to the greater world community, and not just those that specialise in data.
The result is an impact on what could be termed ‘data literacy’, where users no longer interrogate and understand the data being fed into these tools. Data curiosity as a concept therefore needs to incorporate the aspects which inform data literacy – the where, how and why of the data’s creation.
Understanding the data used is crucial in obtaining sensible outputs. The old adage “rubbish in – rubbish out” most definitely still applies. Focus has shifted to the flashy outputs that can be generated, with the result that the curiosity about where the data originated from and how it was created has fallen by the wayside.
Data literacy, or an expanded definition of data curiosity, is the most crucial step in any data work that is done – whether it is using the data to build new data artefacts or features, or pulling it into modelling or reporting.
So, which aspects of data lineage or origination are important to be curious about? In short – everything. The ‘where’ portion of understanding lineage assists in highlighting any restrictions or systemic issues which might creep into the data based on the system(s) that created it and the primary function of those systems.
Often data from a system is pulled into additional processes which have a focus or intent that differs from the primary function of the originating system. As such, the data could have gaps, inconsistencies or missing information, which could impact the additional processes and require work-arounds, additional sourcing or enhancement/creation of business rules for the data’s use.
Another aspect of lineage to be curious about is the ‘how’ of its creation. Is the data primarily raw data or are there aspects of it which have been defined or calculated? If so, what are the definitions and business rules for the creation of these variables? Do these align with expectations and what is needed from the process, build or report?
If they differ, what is the end impact on what is trying to be achieved? Are the differences material enough to justify being recalculated or redefined to work in the proposed environment?
Once the ‘where’ and the’ how’ have been explored, the company should have a good idea of the ‘why’. If the data was created to perform a function fundamentally different from what the organisation is wanting to use it for, then it is well-positioned to leverage what works, and create or define what is missing or misaligned. Without understanding the ‘why’ of data lineage, there is a risk of using it inappropriately and therefore impacting the final outputs.
It is therefore my opinion that ‘data curiosity’ as a concept needs to encompass the entire data lifecycle, not just the end results. Data workers need to understand where the data they are using comes from in order to appropriately consume, transform and enhance it for additional insights.
Often, being curious about and interrogating the where, how and why of data’s creation can lead to not only enhanced literacy, but also crucial insights which can significantly improve further use or enhancements through the remainder of the lifecycle.
Broad data curiosity is even more important in larger corporations or data centres, where the data consumed by analysts is often the product of multiple layers of feature creation and a combination of system outputs.
Being curious enough to stop and ask the where, how and why of the data consumed will guarantee better quality output as well as assist in minimising rework further down the line.
Share