In today’s digital landscape, databases have become the cornerstone of application performance, particularly with the meteoric rise of viral applications. The rapid evolution in user engagement is exemplified by platforms like Facebook, Instagram, and ChatGPT, where the time required to achieve one million active users has drastically decreased over the years.
For instance, while Facebook took approximately ten months to reach one million users in 2004, Instagram achieved the same milestone in just two and a half months in 2010. More recently, ChatGPT saw similar explosive growth, reaching one million users in just five days post-launch.
This accelerated growth necessitates that database architectures, which are already complex, must evolve at an equally rapid pace. With the potential for user base expansions of 10x – 100x overnight, re-architecting databases without disrupting service has become a critical operational challenge. Sometimes, designs need to factor in 10x – 100x future growth right from the beginning.
Scaling approaches for Large Databases
Vertical Scaling
Vertical scaling involves upgrading a single database server with more powerful hardware—such as additional CPU, RAM, or storage—to handle increased load. This approach enhances performance and capacity quickly but is limited by the server’s maximum capabilities and can become costly and impractical as demands grow. Additionally, vertical scaling can introduce a single point of failure, which is undesirable for high-availability systems.
With the time to reach 1–100 million active users reducing exponentially, traditional vertical scaling approaches often reach their limits very quickly. Vertical scaling can lead to hardware constraints and diminishing returns as the complexity and cost of scaling a single node increase.
Horizontal Scaling aka Sharding
As a result, many organizations turn to horizontal scaling techniques, such as sharding, to address scalability challenges. Sharding involves dividing a large database into smaller, more manageable pieces called shards, each of which can be hosted on separate database servers [1]. Each customer’s data resides in a dedicated database shard, and using techniques such as consistent hashing, applications route a customer’s request to the appropriate shard [2]. This approach allows for greater flexibility and improved performance by distributing the load across multiple nodes. Sharding also enhances database availability, as the failure of a single shard is preferable to the entire database going down [3].
However, sharding is also not without its limitations. It introduces complexities in data management, as cross-shard queries require additional coordination, which can lead to performance bottlenecks. Additionally, maintaining consistency and balancing the load effectively across shards requires sophisticated algorithms and careful planning [4]. Despite these challenges, sharding remains a powerful technique for scaling out databases.
Migrating large databases with zero downtime
Database migration is a critical process for organizations that need to upgrade, scale, or transition their data infrastructure in response to viral growth. In high-availability scenarios, where continuous operation is essential, achieving zero downtime is particularly challenging. Below, we present a comprehensive approach to migrating large-scale databases with live traffic, emphasizing the importance of planning, validation, and execution to ensure a seamless transition.
Creating a Comprehensive Backup
The initial step in any migration process is to create a comprehensive backup of the existing database. This backup is crucial for several reasons:
- Data Safety: It acts as a fail-safe mechanism to restore the database to its original state in case of migration failures.
- Reference Point: It provides a reference for data consistency checks and validation during and after the migration.
Implementing point-in-time recovery (PITR) mechanisms can enhance data safety by allowing restoration to any specific moment before an incident [5].
Establishing a Multi-Cell Sharded Architecture
Once the backup is secured, the next phase involves creating a new database with a multi-cell sharded architecture. As explained above, Vertical scaling is capped by the biggest hardware you can provision on a server, which may not always be enough to scale for the future. On the other hand, Horizontal Scaling helps you scale out more without being capped by hardware size, as you can provision as many database partitions as you need, based on future projected growth. Horizontal Sharding involves partitioning the database into smaller, more manageable pieces, known as shards, which can be distributed across multiple servers or locations. The multi-cell architecture enhances scalability and availability [1]:
- Sharding Strategy: The database is partitioned based on certain criteria (e.g., customer ID, geographic location), ensuring even distribution of data and load.
- Cell Management: Each cell operates independently, facilitating parallel processing and minimizing the impact of failures. This approach aligns with microservices architecture principles, promoting isolation and scalability [6].
Asynchronous Propagation of Live Traffic
During the setup phase, live traffic is propagated to the new database asynchronously. This process involves:
- Dual Writes: Implementing a system where all write operations are sent to both the old and new databases, ensuring that the new database stays updated with the latest data [7].
- Data Replication: Using data replication techniques to synchronize data between the old and new databases in real-time or near-real-time.
Backfilling Historical Data
Once the new database is receiving live traffic, historical data from the existing database must be backfilled. This step is essential for:
- Completeness: Ensuring the new database contains the entire dataset, including historical records.
- Data Integrity: Maintaining data consistency between the old and new systems by synchronizing all historical data.
Backfilling can be performed using bulk data transfer methods optimized for large datasets and scheduled during low-traffic periods to minimize impact.
Offline Validation
After backfilling, offline validation is performed to compare the new database against the original backup. This phase includes:
- Data Verification: Checking that all data has been accurately transferred and that the new database reflects the state of the old database at the time of backup.
- Integrity Checks: Performing integrity checks to confirm that no data has been corrupted or lost during the migration process.
Live Validation
Following offline validation, live validation involves:
- Real-Time Comparison: Cross-referencing data read from both the new and old databases to ensure consistency in real-time.
- Consistency Monitoring: Continuously monitoring for discrepancies and ensuring that the new database accurately reflects live traffic data.
Cutover Process
Once confidence in the new database’s accuracy is established, the cutover process begins:
- Traffic Redirection: Gradually directing all read and write traffic to the new system, effectively transitioning operations from the old database.
- Backup Strategy: Retaining the old database as a backup and taking regular backups of the new database to protect against potential issues.
- Monitoring: Continuously monitoring the new system for stability and performance issues.
Decommissioning the Old Database
After verifying the stability and operational status of the new database, the old database can be decommissioned. This step includes:
- Final Backup: Taking a final backup of the old database before it is decommissioned.
- Resource Reallocation: Reallocating or releasing resources previously dedicated to the old database.
Best Practices
- Comprehensive Testing: Simulate the migration process in a staging environment before rolling it out in production.
- Monitoring and Observability: Implement robust monitoring – metrics, canaries, alarms, etc. to detect issues early.
- Rollback Plan: Always have a rollback strategy in case of unforeseen problems.
- Stakeholder Communication: Proactively communicate all stages of your database migration, and keep all relevant teams informed throughout the process.
Conclusion
Migrating very large-scale databases with live traffic and achieving zero downtime requires meticulous planning, advanced techniques such as sharding, and a structured approach to ensure a seamless transition. By following the outlined steps—backup creation, architecture setup, live traffic propagation, data validation, and careful cutover—organizations can effectively manage database migrations while maintaining uninterrupted service. Embracing these strategies is essential in an era where application availability and scalability are critical to success.
About the Author
Vibhas Zanpure is a senior software engineer at Amazon, with more than a decade of experience in building hyper scale, cellular, cloud native applications. He has a strong passion for databases, solving complex technical problems and building hyper scale distributed systems.
References
- Baker, J., Bond, C., Corbett, J. C., et al. (2011). Megastore: Providing Scalable, Highly Available Storage for Interactive Services. In Proceedings of the Conference on Innovative Data Systems Research (CIDR ’11), 223–234.
- Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., & Lewin, D. (1997). Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. In Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing (STOC ’97), 654–663.
- https://aws.amazon.com/what-is/database-sharding/
- Abadi, D. J. (2012). Consistency Tradeoffs in Modern Distributed Database System Design: CAP is Only Part of the Story. IEEE Computer, 45(2), 37–42.
- The PostgreSQL Global Development Group. (2023). PostgreSQL Documentation: Continuous Archiving and Point-in-Time Recovery (PITR).
- Newman, S. (2015). Building Microservices: Designing Fine-Grained Systems. O’Reilly Media.
- https://aws.amazon.com/blogs/database/rolling-back-from-a-migration-with-aws-dms/