
Key Challenges
• Create tools to monitor data infrastructure.
• Optimize uptime, cost, and performance.
• Immediately alert the DataOps team if an error occurs.
Monitoring Large-Scale Infrastructure
DataOps teams continually improve data delivery, ensuring that data reaches end users consistently. DataOps teams manage three data system components: infrastructure, data refresh, and user support. The larger the data system, the larger the components. To deliver consistent data, teams must remove sources of error from the data pipeline. Automating monitoring processes prevents errors early and minimizes the need for human supervision. Automation is a necessity in large-scale data systems.Recently, a software company tasked our DataOps team with managing a 17,000-user sales partner portal. Due to the scale of the portal, our team faced an enormous monitoring challenge. This article examines 10 automated monitoring tools our DataOps team created to improve uptime, cost, and performance.

Custom DataOps Monitoring Tools
Below are the 10 automated tools we created to monitor our
client’s sales partner portal.
1.
Server Performance Monitor
With
our server monitoring tool, administrators can monitor all servers in one place,
in real time. The tool displays all 30 servers on one screen. The visual
includes real-time monitoring for CPU, memory, and disk usage. If CPU, memory,
or disk usage exceeds set thresholds, the offending server is highlighted red.
When a server exceeds the threshold, the DataOps team is alerted. The alerts
enable the team to maintain server balance.
2.
Deactivate Azure Resources During
Nonbusiness Hours
We
reduced server costs by 20% by turning off Azure resources during nonbusiness
hours. Azure resources are billed on a pay-per-use basis. Users pay only for
the resources they use. We improved efficiency by creating runbooks that turn
off unneeded resources at 12:00 a.m. every day.
3.
Automatically Activate VMs
We
saved five minutes of effort per server every day by starting servers on an
as-needed basis. We implemented an automatic activation system using Microsoft
Flow to trigger a runbook. The runbook starts the VM and sends an alert in
Microsoft Teams. Our system allows the DataOps team to trigger any server as
soon as they arrive at the office.
4.
Offline Server Alert
Our
alert system triggers an immediate notification when a server goes offline. The
notifications preempt crisis situations that could result in costly delays. To
ensure consistent uptime, the system checks the server’s “heartbeat” every five
minutes. If a server goes offline, log analytics and a runbook trigger an alert
in Microsoft Teams.
5.
CPU / Memory / Disk Utilization Alert
Our
monitoring system also triggers a Teams alert when CPU, memory, or disk
utilization exceeds threshold values. The threshold alerts also use log
analytics and a runbook to trigger notifications in Microsoft Teams. The
threshold alerts ensure optimal report load times by avoiding excessive usage.
6.
Power BI Memory Optimizer
To
optimize performance without downtime, we created a memory optimization tool.
When memory usage is high, report performance slows. To prevent performance
decline, the tool releases memory when the usage crosses a set threshold.
7.
Azure Data Factory (ADF) Failure Alert
To
avoid refresh delays, we created a tool that alerts the DataOps team if an ADF
pipeline fails. The tool monitors the ADF pipeline and sends a failure
notification in Microsoft Teams.
8.
Production Report Scanner
To
ensure optimal report performance, this tool triggers an alert when a report
uses non-production resources. Using non-production (dev) resources impacts
data recency, performance, and accessibility. To prevent negative performance
impacts, this tool triggers an alert in Microsoft Teams if it detects a
non-production resource.
9.
Gateway Offline Alert
The
gateway offline alert tool prevents downtime. If an on-premises gateway goes
offline, the gateway offline alert tool triggers an alert in Microsoft Teams.
The DataOps team receives the alert and can fix the gateway immediately,
ensuring uptime.
10.
Resource Optimization
Our resource optimization tool optimizes costs and resource usage. With the resource optimization tool, we can upgrade or downgrade resources based on usage. The tool issues upgrade and downgrade recommendations using Azure Resource Advisor. To determine stream and server mapping usage, the tool conducts an internal audit. The tool allows the DataOps team to decommission unneeded resources without obstructing development.
