Wednesday, June 26, 2019

Improve Data Delivery Using Custom DataOps Monitoring Framework

Key Challenges

   Create tools to monitor data infrastructure.
   Optimize uptime, cost, and performance.
   Immediately alert the DataOps team if an error occurs.

Monitoring Large-Scale Infrastructure

DataOps teams continually improve data delivery, ensuring that data reaches end users consistently. DataOps teams manage three data system components: infrastructure, data refresh, and user support. The larger the data system, the larger the components. To deliver consistent data, teams must remove sources of error from the data pipeline. Automating monitoring processes prevents errors early and minimizes the need for human supervision. Automation is a necessity in large-scale data systems.

Recently, a software company tasked our DataOps team with managing a 17,000-user sales partner portal. Due to the scale of the portal, our team faced an enormous monitoring challenge. This article examines 10 automated monitoring tools our DataOps team created to improve uptime, cost, and performance.

Custom DataOps Monitoring Tools

Below are the 10 automated tools we created to monitor our client’s sales partner portal.

1.       Server Performance Monitor

With our server monitoring tool, administrators can monitor all servers in one place, in real time. The tool displays all 30 servers on one screen. The visual includes real-time monitoring for CPU, memory, and disk usage. If CPU, memory, or disk usage exceeds set thresholds, the offending server is highlighted red. When a server exceeds the threshold, the DataOps team is alerted. The alerts enable the team to maintain server balance.

2.       Deactivate Azure Resources During Nonbusiness Hours

We reduced server costs by 20% by turning off Azure resources during nonbusiness hours. Azure resources are billed on a pay-per-use basis. Users pay only for the resources they use. We improved efficiency by creating runbooks that turn off unneeded resources at 12:00 a.m. every day.

3.       Automatically Activate VMs

We saved five minutes of effort per server every day by starting servers on an as-needed basis. We implemented an automatic activation system using Microsoft Flow to trigger a runbook. The runbook starts the VM and sends an alert in Microsoft Teams. Our system allows the DataOps team to trigger any server as soon as they arrive at the office.

4.       Offline Server Alert

Our alert system triggers an immediate notification when a server goes offline. The notifications preempt crisis situations that could result in costly delays. To ensure consistent uptime, the system checks the server’s “heartbeat” every five minutes. If a server goes offline, log analytics and a runbook trigger an alert in Microsoft Teams.

5.       CPU / Memory / Disk Utilization Alert

Our monitoring system also triggers a Teams alert when CPU, memory, or disk utilization exceeds threshold values. The threshold alerts also use log analytics and a runbook to trigger notifications in Microsoft Teams. The threshold alerts ensure optimal report load times by avoiding excessive usage.

6.       Power BI Memory Optimizer

To optimize performance without downtime, we created a memory optimization tool. When memory usage is high, report performance slows. To prevent performance decline, the tool releases memory when the usage crosses a set threshold.

7.       Azure Data Factory (ADF) Failure Alert

To avoid refresh delays, we created a tool that alerts the DataOps team if an ADF pipeline fails. The tool monitors the ADF pipeline and sends a failure notification in Microsoft Teams.

8.       Production Report Scanner

To ensure optimal report performance, this tool triggers an alert when a report uses non-production resources. Using non-production (dev) resources impacts data recency, performance, and accessibility. To prevent negative performance impacts, this tool triggers an alert in Microsoft Teams if it detects a non-production resource.

9.       Gateway Offline Alert

The gateway offline alert tool prevents downtime. If an on-premises gateway goes offline, the gateway offline alert tool triggers an alert in Microsoft Teams. The DataOps team receives the alert and can fix the gateway immediately, ensuring uptime.

10.   Resource Optimization

Our resource optimization tool optimizes costs and resource usage. With the resource optimization tool, we can upgrade or downgrade resources based on usage. The tool issues upgrade and downgrade recommendations using Azure Resource Advisor. To determine stream and server mapping usage, the tool conducts an internal audit. The tool allows the DataOps team to decommission unneeded resources without obstructing development.

Results: Optimized Costs and Uptime

The monitoring tools were a great success. Due to our DataOps monitoring tools, our client enjoyed fewer costs and improved uptime. In the year after we implemented the tools, the number of VMs and reports went up by 50%, but costs only increased by 25%. In other words, our optimizations decreased the cost per stream. Our client’s end users experienced improved report availability and reliability. The client has since enlisted our DataOps team for subsequent projects, including the management of their biggest yearly conference.