Key Challenges
• Improve data quality.
• Reduce number of support tickets.
• Automate data validation and email alerts.
Validating Data
When using large amounts of data from multiple sources, data
discrepancies and data loss can occur. Data discrepancies and data loss
decrease the effectiveness and accuracy of data systems. To combat data
discrepancies and data loss, users can submit support tickets for mismatched or
missing data. Resolving support tickets for mismatched and missing data
requires significant time.
Our client needed to reduce the number of support tickets regarding
mismatched or missing data for their partner sales attribution system. Issues
regarding mismatched or missing data accounted for over 20% of our client’s
support tickets. With their previous solution, our client validated data during
the last stage of their architecture. Our client validated the data using build
verification tests (BVTs) and manual data validation. Because data validation
only occurred during the last stage of the architecture, data problems from early stages compounded throughout the entire system. Our solution
needed to improve the data validation process.
Our Approach
We implemented data validation mechanisms at each stage of
the architecture. The architecture's stages included the staging, mart, cube
comparison, and so on. Validating the data at each stage ensured developers could
track anomalies. To track anomalies, we enabled earlier
corrective action. We anticipated that earlier corrective action would lead
to reduced support tickets.
To validate the data, we import data from multiple
repositories into our system. The system processes the data using three parameters:
BVTs, KPIs, and row counts.
Our data validation approach using BVTs includes four steps.
First, we gather a list of all test cases from the corresponding database. Second,
we run the testcase queries using multi-threading. Then, we compile the results
into a single table. Finally, we send the results to the developers in an alert
email.
To validate row counts, we use a different approach. We fix
a threshold value for the number of rows. The threshold value ensures the
number of rows does not change dramatically during data processing. The threshold
value is dynamic. The process generates different threshold values for
different tables depending on historical trends and the amount of data. For
example, tables with less than 1,000 rows cannot exceed a five percent
threshold. Tables with greater than 1,000,000 rows cannot exceed a one percent
threshold.
Validating data against project KPIs required another
distinct approach. At the beginning of the project—to measure the project’s
success—we defined KPIs with our client. We execute the KPIs at multiple
stages of the architecture. To ensure data remains aligned with the ultimate KPIs, we validate the data against previously defined
KPIs .
Results: Improved Reporting Time, Better Data Quality, and Reduced Support Tickets
The data validation project concluded with great results. Previously,
one of the streams required 24 hours to publish reports after receiving
upstream data. Now, the stream publishes reports in only six hours—four times
faster! Other streams saw similarly impressive results. In addition, data
quality has significantly improved. The total percentage of user tickets raised
for mismatched or missing data has declined from 20% to 15%.