Intermediate
Introduction: The Staging Database Problem Every Team Faces
If you’ve ever needed a production-identical staging database, you know the pain. Traditional approaches — snapshots, logical dumps, replicas — are slow, expensive, or both. A 2TB Aurora database snapshot restore can take 30+ minutes and immediately doubles your storage costs. Logical dumps with mysqldump or pg_dump can take hours and hammer your production instance.
Amazon Aurora Cloning solves this with a copy-on-write protocol at the storage layer. A clone shares the same underlying storage pages as the source until either side modifies data. The result: a full, independent copy of your production database created in minutes, with near-zero additional storage cost at creation time.
This article is for intermediate-level DevOps engineers and database administrators who already run Aurora in production and want to implement a reliable, automated production-to-staging cloning workflow. You should be comfortable with the AWS CLI, IAM policies, and basic Aurora administration.
Prerequisites
- An existing Amazon Aurora MySQL-Compatible or Aurora PostgreSQL-Compatible cluster (version 1.x+ for MySQL, any for PostgreSQL)
- AWS CLI v2 installed and configured with appropriate credentials
- IAM permissions for
rds:RestoreDBClusterToPointInTime,rds:CreateDBInstance,rds:DescribeDBClusters, and related actions - Python 3.8+ with
boto3installed (for the automation scripts) - Understanding of Aurora cluster architecture (cluster endpoint, reader endpoint, instances)
How Aurora Cloning Works Under the Hood
Aurora’s storage layer is fundamentally different from traditional RDS. Data is stored in a shared, distributed storage volume that spans three Availability Zones, organized into 10GB segments called protection groups. Each protection group is replicated six ways across three AZs.
When you create a clone, Aurora uses a copy-on-write protocol at the storage page level. Here’s what actually happens:
- At clone creation: Aurora creates new metadata pointers that reference the same underlying storage pages as the source. No data is physically copied. This is why cloning takes minutes regardless of database size.
- On read (from either source or clone): Both clusters read from the same shared pages. No additional storage is consumed.
- On write (from either source or clone): Aurora allocates a new storage page, copies the original page contents to it, then applies the modification. Only the modified pages consume additional storage.
This means a 4TB production database clone initially costs $0 in additional storage. You only pay for the pages that diverge over time, plus the compute cost of the clone’s DB instances.
Key Limitations to Know Upfront
| Limitation | Detail |
| Maximum clones per source | 15 clones directly from a source cluster. You can create clones of clones (up to 15 from each), but cross-cloning chains have a maximum depth of 15. |
| Cross-region cloning | Not supported. Clones must be in the same AWS Region as the source. |
| Cross-account cloning | Supported via AWS RAM (Resource Access Manager). Requires sharing the source cluster. |
| Parallel DDL during clone | Avoid running heavy DDL (ALTER TABLE) on the source during clone creation for best performance. |
| Engine version mismatch | Clone must use the same engine version as the source at creation time. You can upgrade afterward. |
| Aurora Serverless v1 | Cloning is supported for Aurora Serverless v2 and provisioned. Aurora Serverless v1 supports cloning as well, but only to provisioned clusters. |
Creating an Aurora Clone: Step by Step with the AWS CLI
The Aurora clone operation uses the restore-db-cluster-to-point-in-time API with --restore-type copy-on-write. This is the key parameter that distinguishes cloning from a point-in-time restore.
Step 1: Identify Your Source Cluster
# List your Aurora clusters
aws rds describe-db-clusters \
--query "DBClusters[*].[DBClusterIdentifier,Engine,EngineVersion,Status]" \
--output table
# Get details of your production cluster
aws rds describe-db-clusters \
--db-cluster-identifier prod-aurora-mysql \
--query "DBClusters[0].{ClusterID:DBClusterIdentifier,Engine:Engine,Version:EngineVersion,Status:Status,StorageEncrypted:StorageEncrypted,VpcSecurityGroups:VpcSecurityGroups[*].VpcSecurityGroupId,Subnets:DBSubnetGroup}" \
--output json
Step 2: Create the Clone Cluster
# Create the Aurora clone
aws rds restore-db-cluster-to-point-in-time \
--source-db-cluster-identifier prod-aurora-mysql \
--db-cluster-identifier staging-aurora-mysql \
--restore-type copy-on-write \
--use-latest-restorable-time \
--db-subnet-group-name staging-db-subnet-group \
--vpc-security-group-ids sg-0a1b2c3d4e5f67890 \
--tags Key=Environment,Value=staging Key=ManagedBy,Value=devops
# Wait for the clone cluster to become available
aws rds wait db-cluster-available \
--db-cluster-identifier staging-aurora-mysql
echo "Clone cluster is available."
Important: The restore-db-cluster-to-point-in-time command creates only the cluster (storage layer). You still need to add at least one DB instance to it.
Step 3: Add a DB Instance to the Clone
# Create a writer instance in the cloned cluster
aws rds create-db-instance \
--db-instance-identifier staging-aurora-mysql-instance-1 \
--db-cluster-identifier staging-aurora-mysql \
--db-instance-class db.r6g.large \
--engine aurora-mysql
# Wait for the instance to become available
aws rds wait db-instance-available \
--db-instance-identifier staging-aurora-mysql-instance-1
echo "Clone instance is ready."
Notice that I used db.r6g.large for staging instead of whatever your production might use (e.g., db.r6g.2xlarge). This is a common and recommended cost optimization — your staging workload likely doesn’t need the same compute power as production.
Step 4: Retrieve the Clone Endpoint
# Get the cluster endpoint for your staging clone
aws rds describe-db-clusters \
--db-cluster-identifier staging-aurora-mysql \
--query "DBClusters[0].Endpoint" \
--output text
# Example output:
# staging-aurora-mysql.cluster-abc123xyz.us-east-1.rds.amazonaws.com
Automating the Full Workflow with Python and Boto3
In practice, you’ll want to automate the entire clone lifecycle: create, use for testing, then destroy. Here’s a production-ready Python script that handles the complete workflow including cleanup of old clones.
#!/usr/bin/env python3
"""
Aurora Clone Manager — Creates staging clones from production
and cleans up old clones automatically.
"""
import boto3
import time
import sys
from datetime import datetime, timezone
rds = boto3.client("rds", region_name="us-east-1")
SOURCE_CLUSTER = "prod-aurora-mysql"
CLONE_PREFIX = "staging-aurora-mysql"
CLONE_INSTANCE_CLASS = "db.r6g.large"
SUBNET_GROUP = "staging-db-subnet-group"
SECURITY_GROUPS = ["sg-0a1b2c3d4e5f67890"]
ENGINE = "aurora-mysql"
def get_clone_identifier():
"""Generate a unique clone identifier with timestamp."""
timestamp = datetime.now(timezone.utc).strftime("%Y%m%d-%H%M")
return f"{CLONE_PREFIX}-{timestamp}"
def delete_old_clones(keep_latest=1):
"""Delete old staging clones, keeping the N most recent."""
response = rds.describe_db_clusters()
clones = [
c for c in response["DBClusters"]
if c["DBClusterIdentifier"].startswith(CLONE_PREFIX)
]
# Sort by creation time, newest first
clones.sort(key=lambda c: c["ClusterCreateTime"], reverse=True)
for clone in clones[keep_latest:]:
cluster_id = clone["DBClusterIdentifier"]
print(f"Deleting old clone: {cluster_id}")
# First, delete all instances in the cluster
for member in clone["DBClusterMembers"]:
instance_id = member["DBInstanceIdentifier"]
print(f" Deleting instance: {instance_id}")
rds.delete_db_instance(
DBInstanceIdentifier=instance_id,
SkipFinalSnapshot=True,
)
waiter = rds.get_waiter("db_instance_deleted")
waiter.wait(
DBInstanceIdentifier=instance_id,
WaiterConfig={"Delay": 30, "MaxAttempts": 60},
)
# Then delete the cluster
rds.delete_db_cluster(
DBClusterIdentifier=cluster_id,
SkipFinalSnapshot=True,
)
print(f" Cluster {cluster_id} deletion initiated.")
def create_clone():
"""Create a new Aurora clone from the source cluster."""
clone_id = get_clone_identifier()
instance_id = f"{clone_id}-instance-1"
print(f"Creating clone cluster: {clone_id}")
rds.restore_db_cluster_to_point_in_time(
SourceDBClusterIdentifier=SOURCE_CLUSTER,
DBClusterIdentifier=clone_id,
RestoreType="copy-on-write",
UseLatestRestorableTime=True,
DBSubnetGroupName=SUBNET_GROUP,
VpcSecurityGroupIds=SECURITY_GROUPS,
Tags=[
{"Key": "Environment", "Value": "staging"},
{"Key": "ManagedBy", "Value": "clone-manager"},
{"Key": "SourceCluster", "Value": SOURCE_CLUSTER},
],
)
# Wait for cluster to be available
print("Waiting for clone cluster to become available...")
waiter = rds.get_waiter("db_cluster_available")
waiter.wait(
DBClusterIdentifier=clone_id,
WaiterConfig={"Delay": 30, "MaxAttempts": 60},
)
print("Clone cluster is available.")
# Create the writer instance
print(f"Creating instance: {instance_id}")
rds.create_db_instance(
DBInstanceIdentifier=instance_id,
DBClusterIdentifier=clone_id,
DBInstanceClass=CLONE_INSTANCE_CLASS,
Engine=ENGINE,
)
waiter = rds.get_waiter("db_instance_available")
waiter.wait(
DBInstanceIdentifier=instance_id,
WaiterConfig={"Delay": 30, "MaxAttempts": 60},
)
# Get the endpoint
response = rds.describe_db_clusters(DBClusterIdentifier=clone_id)
endpoint = response["DBClusters"][0]["Endpoint"]
print(f"\nClone ready!")
print(f" Cluster: {clone_id}")
print(f" Endpoint: {endpoint}")
print(f" Instance: {instance_id} ({CLONE_INSTANCE_CLASS})")
return clone_id, endpoint
if __name__ == "__main__":
print("=== Aurora Clone Manager ===\n")
# Clean up old clones first (keep only the latest 1)
print("--- Cleaning up old clones ---")
delete_old_clones(keep_latest=1)
# Create new clone
print("\n--- Creating new clone ---")
clone_id, endpoint = create_clone()
print(f"\nDone. Update your staging config to use: {endpoint}")
You can trigger this script from a CI/CD pipeline, a nightly cron job, or an AWS Lambda function (though be aware Lambda’s 15-minute timeout may not be sufficient — consider AWS Step Functions for orchestration).
Data Sanitization: Protecting Production Data in Staging
Cloning production to staging means your staging environment has real customer data. This is a compliance and security concern you must address. Run a sanitization script immediately after the clone is available.
#!/bin/bash
# sanitize-staging.sh — Run after clone creation
STAGING_ENDPOINT="staging-aurora-mysql-20240115-0800.cluster-abc123.us-east-1.rds.amazonaws.com"
STAGING_USER="admin"
STAGING_DB="myapp"
# Mask email addresses
mysql -h "${STAGING_ENDPOINT}" -u "${STAGING_USER}" -p "${STAGING_DB}" <<'EOF'
-- Mask PII in the users table
UPDATE users SET
email = CONCAT('user_', id, '@example.com'),
phone = '555-000-0000',
first_name = CONCAT('FirstName_', id),
last_name = CONCAT('LastName_', id)
WHERE id > 0;
-- Truncate sensitive audit logs
TRUNCATE TABLE payment_audit_log;
TRUNCATE TABLE session_tokens;
-- Reset all passwords to a known staging hash
UPDATE users SET password_hash = '$2b$12$LJ3m4ys8Kqx.staged.hash.for.testing.only'
WHERE id > 0;
-- Verify row counts to confirm sanitization ran
SELECT 'users' AS tbl, COUNT(*) AS cnt FROM users
UNION ALL
SELECT 'payment_audit_log', COUNT(*) FROM payment_audit_log;
EOF
echo "Sanitization complete."
Critical point on copy-on-write cost: The sanitization UPDATE statements will cause page divergence. If you update every row in a 500GB table, those pages will be allocated as new storage for the clone. Plan accordingly — sanitizing a large table means that table’s storage is effectively duplicated.
Cost and Performance Considerations
Storage Cost Model
Aurora storage is billed at $0.10 per GB-month (standard pricing in us-east-1, as of early 2025