Thursday, June 13, 2019

DataOps Practices Create Incremental Business Value



Key Challenges

   Ensure the partner portal delivers accurate data.
   Improve report reliability.
   Respond to user support questions quickly.

Building a Team to Support 440,000 Partners

In 2017, our client revolutionized its partner services. 440,000 partners relied on our client’s analytics platform, but the platform inefficiently sourced data from silos. To improve performance, the client consolidated their business reporting into a single partner portal. Partner reporting, business insights, and performance improved significantly. At the center of the new portal, a Power BI dashboard provided insights to over 13,000 users. The dashboard integrated data from over thirty sources, handling millions of queries annually. At the time, no other business had a larger Power BI implementation. Over the next year, dashboard and partner portal use continued to increase. The portal’s growing popularity spurred the need for a support team.

To support the portal, we assembled a 40-member team. We divided the team into three domains: infrastructure, data refresh, and user support. The infrastructure team monitored data pipelines and ensured that reports loaded quickly. The data refresh team refreshed the data from upstream sources and transaction systems. The user support team answered questions about reports, managed data queries, and located reports. Together, the teams bridged the gap between development and operations and delivered continuous incremental value to our client. The principles the teams followed are codified as DataOps.

DataOps is the convergence of agile methodologies, DevOps practices, and statistical process control. This article examines the roles our DataOps teams fulfill for our client’s partner portal. This article offers a glimpse into the benefits of DataOps practices.

Reliability

A primary responsibility of DataOps teams is ensuring report uptime. Reports drive critical business decisions, so uptime is crucial. To improve uptime, we created a monitoring framework. The framework simplifies maintenance, monitoring, and report validation at all stages of development. The framework also tracks basic functions (such as page rendering) by capturing snapshots and delivering verification emails. Tracking the emails and examining the snapshots allows our DataOps teams to quickly detect failures. The framework even validates report data against backend data, pinpointing errors in the data pipeline.

The DataOps teams further increased reliability by implementing supplementary monitoring and control systems. The teams created an infrastructure health monitoring system that detects potential failure scenarios and corrects them. The teams automated alerts to monitor CPU, memory usage, and disk space and detect spikes outside of acceptable limits. With automated alerts, the portal requires less active monitoring. Less time spent actively monitoring freed team resources to address more complex challenges.

One early complex challenge was addressing the sheer number of data refresh jobs. Data refreshes must occur on schedule so users can base business decisions on the latest data points. Because of the partner portal’s large scale, the teams could not manually check each job. To improve the data refresh reliability, the DataOps teams built a data refresh tracking and monitoring system. The tracking system displayed the historical statuses of data refreshes at subsystem levels. The system triggered alerts if it detected delays in the operational data refresh pipeline. The tracking and monitoring system assured users that data points were never stale.

Cost Reduction

As the partner portal supported more users, the DataOps teams needed to monitor usage of cloud-based resources. In pay-as-you-go models, it is always important to optimize the usage of cloud-based resources.

Resource monitoring revealed discrepancies from the client’s estimated usage. The usage varied depending on:

   The usability of the applications.
   The user base’s knowledge of the existence and availability of the data and report assets.
   The user base’s training regarding using assets for day-to-day business activities.
   The users’ actual usage of the assets.

Identifying the sources of usage variations enabled the DataOps teams to tailor the resources to the client’s needs.

Resource usage also varied depending on our client’s business cycles. By predicting spikes and drop-offs in activity, the DataOps teams could optimally provision cloud-based resources. The ability to review usage patterns also proved useful when rationing provisioned assets. The teams could detect deviations from planned usage and scale assets depending on actual usage. The system alerted the teams about unexpected peaks in usage. The alerts allowed the teams to detect improperly functioning applications and correct the issues quickly. The teams further reduced costs by scheduling virtual machines to shut down during periods of low demand.

Data Quality

Erroneous data reduces users’ trust in information systems, resulting in reduced data asset usage. To ensure users receive high-quality data, the DataOps teams adopted statistical process control (SPC) techniques. SPC emphasizes early detection of nonconforming data and ensures rapid responses. To detect errors, the team developed a trend monitoring tool that calculates rolling averages and movement trends of key data points. If key data points deviate beyond an acceptable threshold, the system generates alerts. The alerts allow the DataOps teams to correct the data in a timely manner.

User Support

Responding to user requests is a large part of the DataOps teams’ work. Prompt support responses are essential for a good user experience. Originally, when a user required access to certain assets, a DataOps team member had to validate the request. Often, requests were pressing, and manual approval slowed the process. In response, the DataOps teams created a self-service tool for access requests. The tool, a simple request form, improved the user experience and significantly reduced the DataOps teams’ workload. A rules-based system accompanied the form, and the form automatically forwarded requests to authorized approvers. Upon arrival, authorized approvers could automatically grant access if the user had sufficient permissions. Alternatively, the approver could collate approved requests, allowing the operations team to provide access later.

DataOps Creates Operational Efficiencies

DataOps is essential for successfully planning, developing, and executing data solutions for all enterprises. DataOps tools and practices allowed our client to:

   Rapidly develop and enhance reports.
   Gather data optimally.
   Validate data.
   Ensure data availability for a large user base, aiding business decisions.

Our client’s partner portal allows 440,000 business partners to define goals according to uniform data points. Over the course of the last year, the DataOps team managed a 50% increase in reports without increasing the team size. During this time, the total number of users grew by a factor of ten. The DataOps teams sustained this growth without increasing team size by continually delivering incremental value. Our DataOps teams improved the availability, reliability, and quality of data delivered.