Thursday, June 27, 2019

Case Study: Data Validation Framework Reduces Support Tickets

Key Challenges

   Improve data quality.
   Reduce number of support tickets.
   Automate data validation and email alerts.

Validating Data

When using large amounts of data from multiple sources, data discrepancies and data loss can occur. Data discrepancies and data loss decrease the effectiveness and accuracy of data systems. To combat data discrepancies and data loss, users can submit support tickets for mismatched or missing data. Resolving support tickets for mismatched and missing data requires significant time.

Our client needed to reduce the number of support tickets regarding mismatched or missing data for their partner sales attribution system. Issues regarding mismatched or missing data accounted for over 20% of our client’s support tickets. With their previous solution, our client validated data during the last stage of their architecture. Our client validated the data using build verification tests (BVTs) and manual data validation. Because data validation only occurred during the last stage of the architecture, data problems from early stages compounded throughout the entire system. Our solution needed to improve the data validation process.

Our Approach

We implemented data validation mechanisms at each stage of the architecture. The architecture's stages included the staging, mart, cube comparison, and so on. Validating the data at each stage ensured developers could track anomalies. To track anomalies, we enabled earlier corrective action. We anticipated that earlier corrective action would lead to reduced support tickets.
To validate the data, we import data from multiple repositories into our system. The system processes the data using three parameters: BVTs, KPIs, and row counts.

Our data validation approach using BVTs includes four steps. First, we gather a list of all test cases from the corresponding database. Second, we run the testcase queries using multi-threading. Then, we compile the results into a single table. Finally, we send the results to the developers in an alert email.

To validate row counts, we use a different approach. We fix a threshold value for the number of rows. The threshold value ensures the number of rows does not change dramatically during data processing. The threshold value is dynamic. The process generates different threshold values for different tables depending on historical trends and the amount of data. For example, tables with less than 1,000 rows cannot exceed a five percent threshold. Tables with greater than 1,000,000 rows cannot exceed a one percent threshold.

Validating data against project KPIs required another distinct approach. At the beginning of the project—to measure the project’s success—we defined KPIs with our client. We execute the KPIs at multiple stages of the architecture. To ensure data remains aligned with the ultimate KPIs, we validate the data against previously defined KPIs .

Results: Improved Reporting Time, Better Data Quality, and Reduced Support Tickets

The data validation project concluded with great results. Previously, one of the streams required 24 hours to publish reports after receiving upstream data. Now, the stream publishes reports in only six hours—four times faster! Other streams saw similarly impressive results. In addition, data quality has significantly improved. The total percentage of user tickets raised for mismatched or missing data has declined from 20% to 15%.