The impact of AI on data engineering

By Julian Thomas, Principal consultant at PBT Group

Johannesburg, 25 Mar 2024

Julian Thomas, principal consultant at PBT Group.

As someone who is dedicated to the profession of data architecture and engineering, I am very interested in any new trend or development that can affect these two professions. And no trend before has ever had the potential to impact these professions as much as artificial intelligence (AI).

I remember some years back collaborating with colleagues, to brainstorm and predict the potential impact AI could have on our consulting business. We identified a few key points for consideration:

Trust: The providers of AI services were only as good as the underlying data they were modelled on. Many of these services are built off data mined from the internet. As such, there was no guarantee of the trust/accuracy element of this data. Of course, professional AI services are self-learning, and as such, their ability to choose more trustworthy results would improve over time.

Legality: The concern of legality was raised. What is the providence and ownership of the information provided? Whose information is being used to compile the answer, and is there potential copyright infringement incurred as a result?

Safety: For example, a developer using an AI chatbot to pose developer questions and accelerate development. The reason for this being that the developer would still have to fact-check and test the answer before implementation in production.

Greater concern was raised around automated AI use cases, where AI was used to automatically make decisions and implement resulting actions. How would this be monitored? What safeguards would be in place to ensure appropriate and safe behaviour?

Data engineers do, however, need to up their game to keep up with the changing industry.

Furthermore, for both scenarios, was there a potential risk of confidential intellectual property (IP) being submitted (or simply inferred), through the process of engaging with the AI service or solution? Who had ownership of any such type of IP that might be inadvertently imparted during this process? Was it safe, who else might have access to it?

This was of relevance to the data industry, where access to data represents such a strong competitive advantage, and where data is under strict governance and oversight.

Capability: Leveraging AI effectively requires a deep investment and maturity in a variety of skills and technologies. The new emerging data fabric methodology has a heavy reliance on AI-driven processing, driving automated ingestion and integration of data, intelligent automation of BI delivery to consumers, etc. Implementing this on-premises would be a monumental undertaking that most organisations simply could not afford to implement, nor have the appetite to do so.

Fast forward a couple of years and we can see that a lot of these concerns have materialised. I am sure any developer reading this will confirm the importance of verifying any code provided by AI platforms.

I use these platforms extensively to accelerate my research and development. I have found it to be an invaluable tool. However, I have encountered numerous scenarios where the generated code was either not relevant or did not work. So, TRUST…but VERIFY.

On the legal side, there is growing concern around copyright material. We have in the past year seen a slow but steady increase in copyright lawsuits being lodged against the providers of AI services. It becomes clear that while it is easy to have access to such services, one must be careful and aware of the fact that just because the information is available in an AI service, using this information, especially in activities that have a direct commercial benefit, can break copyright law.

Safety has also proven to be of concern − especially in the data industry. In our industry, our data provides a competitive advantage, and many companies are rushing to lock down AI services in this regard.

Developers and analysts publishing core data models and supporting data to help with research, development and problem-solving are potentially giving away core company IP, with unclear ownership attached, and with no knowledge of who might have access to this data.

Many organisations have therefore locked this down, or implemented ring-fenced instances of these AI services and models on-premises behind tightly-controlled networks and firewalls.

So, what does this all mean for the data engineer?

Good data engineering solutions must leverage advanced analytics in a variety of ways. We should be using these techniques for analysing data, anomaly and outlier detection, advanced matching for data quality analysis, natural language processing for extracting data from unstructured data and performing sentiment analysis, recommendation systems, fraud detection − the list is endless.

The reality is these capabilities have been around since before AI. They have always been important skills for data engineers and analysts.

Many people also speak about use cases such as automated ingestion, optimisation of ingestion processes in real-time, parallelise processing, optimise resource utilisation, etc. I personally don’t see data engineers getting too involved in this level of detail.

I believe that industry in general is moving to the cloud platform, and the cloud platforms and services will naturally provide this capability as a service “behind the scenes”.

What I do believe is that the role of the data engineer remains secure. Data is complex, working with this often requires domain knowledge that is unique to an organisation. Data engineers are required to provide this unique domain expertise, overcome data and algorithm bias, refining models to increase performance, validating results and making key judgement calls.

Data engineers do, however, need to up their game to keep up with the changing industry. The data engineers of the future will need a strong blend of mathematical, statistical, computer science and data skills and knowledge. This needs to be blended so that data engineers can leverage AI in building solutions that are highly-automated, fault-tolerant, with rich data governance and data quality associated features.

In short, I believe data engineers have challenging yet exciting times ahead of them.

The impact of AI on data engineering

Artificial intelligence has the potential to impact the data architecture and engineering professions like no tech trend ever before.

So, what does this all mean for the data engineer?