While working for SAP I have seen many products around data governance and business rules validation. All did address a subset of the problem and were built on assumptions valid back then but no longer.
Apache Kafka did enable much more and this KStream service is the result, incorporating everything learned.
SAP Information Steward
The idea of SAP Information Steward is to look at the current data in the database, validate it against a set of rules and create data quality dashboards. This then shows that 5% of the data is invalid because it violates at least one rule and also lists the impact of the individual rules.
If all these tests are executed on a daily basis and the result stored in a database, data quality trends can be rendered.
Let’s imagine the data to validate is 1m rows, there are 100 rules and the data is retained for 5 years. This would mean 183bn rule results must be saved in the database along side the actual data of just 1m. Obviously this is way too expensive for a database and hence Information Steward stores only the aggregated results (“Rule #1 – check order value – was violated today 5 times”) and some sample data.
This limits the use dramatically. Was rule #1 violated for the same product? Always the same customer? Online orders or entered by the sales team? Nobody knows.
The other problem with Information Steward is that the quality rules are tested once a day. It would be better to test them right the second when they are loaded. But that cannot be done either as the database has no means to invoke an Information Steward run and even if, executing all rules which might take an hour every few seconds does not make sense anyhow.
With Kafka we have neither of the problems. A Kafka producer puts the data into Kafka the instant they are saved in the source database and this Business Rules Service consumes the changed data, applies all rules, adds the rule results to every single record and this validated record is saved in another topic. The only question remaining is how to visualize the data. It can be loaded into a Big Data database or together with the actual data into the relational target database. The volume is still large, 1m rows times 100 rules is 100m rows but at least the results do not need to be evaluated and saved for every single day.
Types of rules
Another thing learned is that are different types of rules. Some products can deal with some but we need all rules for a successful project.
Rules based on rules
Another often overlooked requirement is to use the output of one rule as input in the next rule. Rules are executed in stages. In stage 1 all data is standardized. Now we have the proper status codes or ‘?’ if it was an unknown code. We know that the data is valid. In the next stage the decision rules and more validation rules can be used based on the standardized values.
Without that feature all rules would need to contain the same standardization rules.
One thing most commercial business rules tool cannot work with is nested data. These are database tools and hence work with relational data only. But when the incoming data is nested the rules will use data from different levels of the nested schema and rules are executed for all elements within an array node.
By using Kafka the requirement to cleanse the data in realtime is a no-brainer. The millisecond data appears in a topic, the KStream consumes it and puts the resulting message – which includes the rule result – into a another topic.
Data Quality Dashboards
As said before, one important part is to get statistics about the data quality. Information Steward stores those as aggregated results in its own repository and provides predefined reports. This limits the use to only the provided reports. But by storing the the rule results per input record, the data quality is just another measure any BI tool can be used to report on it. “How many records are marked as invalid as of today?” This is a simple sum() over all records that have at least one rule violation.
While monitoring the data quality certainly is an important aspect, thanks to Kafka much more can be achieved. How about adding a consumer that sends an email whenever the topic with the cleansed data gets a record which has a rule violation. Thus alerting the user proactively that he should fix the data in the source system. Or trigger a workflow at certain conditions? One augmentation rule defines the condition and the trigger tests on the rule result.
Types of tests
Tests on fields with a single value, e.g. a number or a string (opposed to array) allow to set the condition, how to treat the rule if the test returns false (PASS/WARN/FAIL?) and optionally setting the field the test is bound with to a new value. In the example below the Quantity field has a test named “Test Quantity” and it checks if the value of the Quantity field is null. If that is the case, this returns FAIL but overwrites the Quantity with the value 0 also to help the reporting user to find all rows where the Quantity is null or zero.
With these options the different rule types are just combinations of rule results and value substitution. An augmenting rule is a rule that returns PASS always and sets the value to the derived value. A validation rule returns PASS/WARN/FAIL without modifying the value.
In real life examples the rules tend to get quite complex, making it hard to define them. For example a rule might categorize a value into “LowVolume” if the Qty between 0 and 10, “MidVolume” if Qty between 10 and 50 and everything above should be “HighVolume”. Where does a Qty = 10 end up in? Is there an overlap? Is a value range missing? What if 10 is changed to 15 – are both places modified?
This can get very complex but the framework allows to specify multiple tests on the same columns and to define what should happen.
This way complex decisions tables can be built easily where each individual rule is simple.
Each tested record gets an array field where the individual applied rules and their results are listed. Therefore the user knows what tests have been applied, which one were skipped and the test result of each.
But the overarching goal was to check if a record is valid or not. Therefore the record itself has a flag as well and it is the aggregation of all PASS/WARN/FAIL (considering the “Test unit one passes/fails”).
Reporting on this result provides immediate feedback which records are considered invalid and should be looked at. And by analyzing the failed records per rule and other attributes the root cause can be found quickly.