Friday, August 2, 2019

Delivering Data Faster: The Five Phases of Data Migration to the Cloud

Migrating Data Systems

Data systems are integral to business success. Data systems generate insights that drive sales and marketing. Migrating data systems to the cloud offers many benefits. Cloud architectures improve refresh speed, consolidate resources, and improve data uniformity. But data migration requires transferring terabytes or even petabytes of data. Data migration also requires two production environments to run in parallel. Because data migration involves carefully navigating multiple logistical challenges, it requires a deeply considered plan.

In this article, we will detail our data migration process. We will examine data migration in five phases:

1.   Discovery Phase
2.   Analysis and Design Phase.
3.   Planning and Approval Phase.
4.   Execution Phase
5.   Verification Phase.

Discovery Phase

During the discovery phase, we evaluate our client’s existing data system. Every client has distinct needs. We look at sources, the data itself, size, scale, refresh frequencies, and source relationships.

In our most recent data migration project, we partnered with a global technology company to address the following four challenges:

1.   High data refresh and process times
2.   Server dependencies among multiple streams
3.   Difficulty managing and maintaining infrastructure
4.   Difficulty patching infrastructure

On average, our client’s data required 24 hours to pull and process. Our client’s existing solution to slow processing times was simply to upgrade servers. Continuously upgrading servers resulted in a significant infrastructure cost.

By migrating their data, our client hoped to meet four main objectives:

1.   Enable real-time reporting
2.   Facilitate independent refreshes of multiple streams
3.   Easily manage and maintain infrastructure
4.   Decrease costs

Analysis and Design Phase

We divide our analysis and design phase into three steps. First, we research potential applicable technologies. Second, we read documentation to determine which technologies we want to use. Finally, we perform a series of proofs of concept to determine technology suitability and cost-benefit.

In our most recent data migration, we created the following proofs of concept (for reference, please refer to Figure 1):

•   Azure Analysis Services
◦   We conducted a proof of concept using Azure Analysis Services to optimize processing through Azure Data Warehouse, Azure SQL Database, and Azure Data Lake Services. Specifically, we examined how multi-partitioning and file splitting affected optimization.
•   Azure Data Warehouse
◦   We processed multiple models simultaneously by cloning Azure Data Warehouse to parallelize processing and improve execution speed.
•   Azure Data Warehouse versus Azure SQL Database
◦   We conducted a cost-benefit analysis for Azure Data Warehouse and Azure SQL Database. Of the two technologies, Azure Data Warehouse offered greater scalability. Because our client’s data system transfers 50 terabytes of data per day, we chose Azure Data Warehouse.

Planning and Approval Phase

During the planning and approval phase, we propose our solution and receive feedback from the client. As mentioned, with our recent client, we needed to meet four objectives. We proposed five steps to address our client’s objectives:

1.   To ease infrastructure maintenance, leverage the serverless, scalable, and distributed architecture of the cloud.
2.   Implement a real-time job and asset monitoring framework.
3.   Implement real-time data refresh pipelines for real-time reporting.
4.   To improve processing speed, leverage the distributed processing power of the cloud.
5.   Enable data publishes based on source availability using an intelligent job monitoring and validation framework.

Figure 1: A high level representation of a large data system.

Execution Phase

During the execution phase, we implement the approved architecture. For our recent client, we used Azure Data Lake Storage for the storage layer. We used Azure Databricks for the processing layer. For the publishing layer, we used tabular models and SQL. We used Power BI and Excel for the visualization layer. The project’s architecture consisted of a seven-step process:

1.   Data is staged into Azure Data Lake Storage from upstream using Azure Data Factory.
2.   Data is transferred to Azure Databricks from Azure Data Lake Storage.
3.   Data is processed using Azure Databricks.
4.   Processed data is moved back into Azure Data Lake Storage for downstream users and into Azure Data Warehouse to create reporting views.
5.   The intelligent job monitoring and validation framework provides independent processing and refreshes.
6.   The tabular model is processed.
7.   Reports are visualized using Power BI or Excel.

Verification Phase

During the verification phase, we conduct user acceptance testing (UAT) through numerous user acceptance sessions. UAT allows technical users, business users, and operations team members to become familiar with the new data system as the old system is gradually moved offline.

Benefits of Data Migration

Our client’s data migration reduced data latency and improved data availability, resulting in the latest data infrastructure. Data migration also resulted in real-time reporting benefits.

•   Latency Benefits
◦   2X more efficient refresh cycle
◦   Source availability-based refresh schedule
•   Data Availability
◦   Single source of truth for all reporting layers
◦   Unlimited scalability with accelerated data processing
◦   Improved disaster recovery measures with geo-replication
•   Data Infrastructure
◦   Automatic start, termination, and scaling of local storage
◦   Reduced support costs
•   Real-Time Reporting
◦   Real-time data refresh pipelines for critical reports

Our data migration process balances overcoming logistical challenges with achieving business objectives.