Lessons from a Petabyte-Scale Cloud Migration
Migrating petabytes of banking data between cloud providers is the kind of project that keeps you up at night. When Banregio — one of Mexico's largest banks — needed to move their analytical data warehouse from Azure Synapse Analytics to AWS Redshift, our team at Matilda Cloud took it on. Here's what I learned from a migration where failure wasn't an option.
The Scale of the Challenge
Banregio's Synapse environment held petabytes of structured financial data — transaction histories, customer analytics, regulatory reports, and real-time dashboards. This wasn't a lift-and-shift of a few VMs. We were migrating a live, production data warehouse that feeds critical banking operations, with strict requirements:
- ▹Zero data loss — Banking regulation mandates complete data integrity.
- ▹Minimal downtime — The analytical platform supports daily operations and regulatory reporting.
- ▹Security compliance — All data must be encrypted in transit and at rest, with audit trails.
- ▹Performance parity — Query performance on Redshift must match or exceed Synapse.
Architecture and Approach
We designed a phased migration strategy rather than a big-bang cutover: 1. Schema Migration — Translate Synapse table definitions, distributions, and indexes to Redshift equivalents. This isn't a 1:1 mapping; Synapse distribution styles (hash, round-robin, replicate) map differently to Redshift distribution keys and sort keys. 2. Historical Data Transfer — Move the bulk historical data using Azure Data Factory to export from Synapse to Azure Blob Storage, then transfer to S3 via a dedicated VPN tunnel. Redshift COPY commands load the data from S3. 3. Incremental Sync — While historical data transfers, keep the delta in sync using change data capture. This minimizes the cutover window. 4. Validation and Reconciliation — Automated row counts, checksums, and sample query comparisons between source and target. 5. Cutover — Final sync, application switchover, and monitoring.
The Networking Foundation
Before a single byte of data moved, we spent weeks on networking. The data transfer path needed to be both fast and secure:
- ▹VPN Tunnels — Site-to-site VPN between Azure and AWS with redundant tunnels for failover.
- ▹Dedicated Bandwidth — Provisioned sufficient network bandwidth to avoid competing with production traffic.
- ▹Network Segmentation — Isolated migration traffic on dedicated subnets with strict NSG/security group rules.
- ▹Load Balancing — Distributed the data transfer across multiple parallel streams to maximize throughput without overwhelming any single path.
Data Validation: The Hardest Part
Moving the data was actually the straightforward part. Proving that every row arrived correctly was where we spent the most effort. Our automated validation framework ran three levels of checks:
- ▹Row Count Validation — Table-by-table comparison of source and target row counts.
- ▹Checksum Validation — Hash-based checksums on key columns to verify data integrity.
- ▹Query Validation — A suite of 200+ analytical queries run against both Synapse and Redshift, comparing results within acceptable precision tolerances.
Disaster Recovery Planning
For a banking migration, you need a rollback plan for the rollback plan. We designed multiple fallback strategies:
- ▹Point-in-time snapshots of both Synapse and Redshift at every stage.
- ▹Parallel operation — Both systems ran simultaneously for two weeks post-cutover, with the ability to failback to Synapse within minutes.
- ▹Automated health checks that monitored Redshift query performance and data freshness, triggering alerts if degradation was detected.
Terraform and Infrastructure as Code
The entire AWS infrastructure — Redshift clusters, S3 buckets, VPN configurations, IAM roles, security groups — was provisioned and managed through Terraform. This gave us reproducible environments for testing the migration process before running it against production, and made the rollback plan concrete: we could tear down and rebuild the target environment from code.
Key Takeaways
- 1.Invest in validation infrastructure. The migration pipeline itself was maybe 30% of the effort. Validation, reconciliation, and monitoring was 70%.
- 2.Network first. Get the networking right before you start moving data. Every hour spent on network design saved days of troubleshooting later.
- 3.Parallel operation is non-negotiable for critical systems. Running both environments simultaneously during the transition period is expensive, but it's insurance you can't afford to skip.
- 4.Schema translation requires domain expertise. Automated schema conversion tools get you 80% there. The remaining 20% — distribution keys, sort keys, compression encodings — determine whether your queries are fast or painfully slow.
- 5.Automation is survival. At petabyte scale, you cannot manually verify data integrity. Every check must be automated, scheduled, and alerting.
The migration completed on schedule with zero data inconsistencies. Banregio's analytical queries now run on Redshift with comparable or better performance, and the bank has the flexibility of the AWS ecosystem for future data initiatives.