Everybody talks about data quality, how it is the foundation of AI and products are being offered. Something for measuring the data quality as part of a data governance solution, a Master Data Management solution, address cleansing features in the ETL tool, etc.
But customers need a complete solution, integrated into their business workflows. I will show how this can be built and that is actually fits very well into a modern data stack.
The first time data quality became a big topic was in the context of Data Warehousing. How can you draw the right conclusion, when the data is inconsistent?
Depending on the sentiments, the approach was
What about the following thought:
Above process gives the best result in terms of benefit but also costs and risk. Why? Let’s follow the business process, different personas and the priorities when working with the data.
A customer calls, ordering 1ton of a material. The sales agent enters the minimally required information because both are under time pressure. If the order entry takes too long, we are in danger of losing the customer. The strive for perfect data quality is the least importance. Ask if the customer has already ordered something in the past, pray the caller knows that, check material availability, agree on price and shipment date. Next caller please.
The sales order process has the fundamental checks implemented, but its focus is ease of use.
This data then flows to other systems and occasionally issues are uncovered. These must be fixed after the fact and cost money. The analytical platform is where all data ends up, where data is connected with all other data and where a consistency is required not only within one business object (e.g. one sales order with everything belonging to it, customer, address, materials,…) but in an aggregated manner also. Even if all sales orders are correct, the sum of revenue per customer region might be wrong due to missing region information, different spellings of regions etc.
The analytical platform is the one impacted the most and has the highest data quality requirements.
In an AI driven world, this is even worse. We no longer look at the data, AI agents make decisions based on the data and they have been trained with the data.
That’s like having a basketball player who occasionally misreads the accurate distance to the hoop. In those cases the likelihood for a 3 pointer will be lower – understandable – but during training, this single flawed record does impact all future throws.
It is easy to say, fix all problems in the source. It would make the most sense as it avoids e.g. return shipments due to wrong shipping addresses, mailing catalogs twice and all the other money wasted due to bad data. It is also unrealistic. Hence, report on problems, quantify them, try to fix them in the source but do not rely on that to happen.
The analytical and AI powered systems require a whole different level of data quality. A level the source systems cannot even achieve. Therefore something must be built outside the source systems anyhow.
If doubtful data has been identified, play that information back via a business workflow. But only for those cases where the business process would benefit from. As said, the business workflow needs consistent data within on business object, the analytical platform across all. These are two distinct requirements.
Modern data integration is about getting changed data once and then distribute it to all the consumers.
Usually that intermediary used for distributing data is Apache Kafka and with that we have the required low latency solution. All that is missing is a rules engine – the rtdi.io rules engine.
Now we can add multiple feedback loops to increase the data quality:
This approach has multiple advantages for the business:
From an architectural point of view:
rtdi.io GmbH
Tallach 150
9182 St. Jakob im Rosental
Austria
UID ATU74541169
Contact
[email protected]