Architecting Fault-Tolerant Multi-Region Systems
Architecting Fault-Tolerant Multi-Region Systems
This guide documents a real-world journey of designing, building, and operating a multi-region
AWS platform capable of sustaining a 99.99% Service Level Agreement (SLA). Whether you're a
junior cloud engineer trying to wrap your head around high-availability concepts or a senior
architect validating your own design decisions, this book was written with you in mind. I've tried to
avoid the trap of making things sound easy when they aren't — because the truth is, multi-region
systems are genuinely hard, and I want you to walk away with practical knowledge, not just
theoretical fluff.
Contents
Twenty-three minutes in the payments world isn't just embarrassing. It's regulatory exposure, it's
lost transaction revenue, and it's customer trust eroding in real time. The CTO received a call from
their largest enterprise client threatening to terminate the contract if they couldn't demonstrate a
credible path to 99.99% availability within the next 180 days.
Let me stop here, because this number gets thrown around a lot without people really
Going from 99.9% to 99.99% isn't just adding a "9" — it's a 10x reduction in your downtime
budget. That 23-minute outage I described? It had already consumed roughly 44% of the entire
annual downtime budget for a 99.99% SLA target. In one incident.
This is why 99.99% SLA almost always implies multi-region architecture. A single AWS region,
even with all its redundancy features, can experience regional-level events (as the October 2025
AWS outage demonstrated) that no amount of Multi-AZ configuration can protect you from. The
moment you accept that regional outages are a real threat — not a theoretical one — the
architecture decisions start to fall into place naturally.
The existing environment had several constraints that we couldn't simply ignore:
• Database schema complexity: The MySQL database had 340+ tables with complex
foreign key relationships. This wasn't something we could casually migrate to a different
engine without significant risk.
• Application coupling: The monolith maintained long-lived database connections with
stored procedures and MySQL-specific features. Any database change required close
coordination with the development team.
• Compliance requirements: As a financial services company, they fell under RBI (Reserve
Bank of India) guidelines, which imposed specific data residency requirements — certain
customer PII had to remain within India's geographic boundaries.
• Operational maturity: The team was genuinely skilled at running the existing stack, but
had limited experience with multi-region architecture, global traffic management, or
chaos engineering.
The CFO was supportive but cautious. The approved budget for this transformation was $80,000
USD for the initial 120-day engagement (my professional fees plus implementation costs), with
an expected ongoing infrastructure cost increase of no more than 40% above their current AWS
bill.
Their current AWS spend was approximately $12,000/month. That meant our target architecture
needed to land somewhere under $17,000/month. This budget constraint was, frankly, one of the
most useful forcing functions of the entire project. It stopped me from reaching for every shiny
AWS service and forced genuine prioritization.
Timeline: 120 days to production readiness, with a hard deadline driven by that enterprise client's
contract review.
Chapter 2: Initial Assessment — What I Found Under the
Hood
My Evaluation Process
The first two weeks of the engagement were purely investigative. I've seen architects rush past
this phase and it almost always costs them later. Before I drew a single architecture diagram, I
wanted to understand the system as it actually existed, not as it was documented.
I started with a set of structured discovery sessions with different stakeholder groups:
With the engineering team (3 sessions, 6 hours total): The engineers were candid in ways that
surprised me. They knew the system had problems. The most common phrase I heard was "we
have a ticket for that." The backlog was littered with "improve connection handling," "add retry
logic," "implement circuit breaker" tickets — all marked medium priority, all perpetually
deprioritized in favor of feature work.
With the operations team (2 sessions, 4 hours total): The ops team managed everything
through the AWS console. There was almost no Infrastructure as Code. Resources had been
created manually, often with inconsistent naming, missing tags, and no documentation of why
certain configurations existed. I asked one engineer why the RDS parameter group
had max_connections set to 500 instead of the default, and nobody could remember.
With the business leadership (1 session, 2 hours): Leadership understood the financial stakes
clearly. They also surfaced a requirement that hadn't been in the brief: the platform needed to
support a Southeast Asian expansion within 18 months, which meant the architecture we built
now needed to accommodate a second region in Singapore (ap-southeast-1) as a future growth
I ran a two-week observation period, collecting metrics from CloudWatch, enabling VPC Flow
Logs (which had not been turned on), and analyzing RDS Performance Insights data. Here's a
summary of what I found:
Application Performance:
• Average API response time: 340ms at baseline, spiking to 2.8 seconds during business
hours
• P99 response time: 4.2 seconds (unacceptable for a payments platform)
• Database query response time: The top 5 slowest queries accounted for 73% of all
database load
• Connection pool wait time: 180ms on average during peak hours, with occasional
timeouts
Infrastructure Reliability:
• The Auto Scaling Group was configured with a minimum of 2 instances and a maximum
of 4 — far too constrained for production traffic
• Health checks on the Application Load Balancer were using the root path /, which
returned a 200 even when the database connection was broken (classic trap)
• No scheduled maintenance windows had been defined for RDS, meaning AWS could
apply minor version updates at any time
• Backup retention was set to 7 days with no cross-region backup copy configured
Security Posture:
• Root account had no MFA enabled (this was the most alarming finding)
• IAM users had direct AdministratorAccess policies with no MFA enforcement
• Security groups allowed inbound port 22 (SSH) from [Link]/0 on bastion hosts
• No CloudTrail logging to an immutable S3 bucket
Cost Efficiency:
• 12 EC2 instances running in us-east-1 from a previous project that had been
"temporarily" migrated and never cleaned up — approximately $1,200/month of pure
waste
After the assessment, I put together a formal risk register. The top risks were:
1. Single region dependency — Any regional event would cause complete service loss
2. No automated failover for any tier — Every failure required manual human
intervention
3. Database as a single point of failure — Even with Multi-AZ, the failover process
exposed application fragility
4. Zero runbook
documentation —
Recovery procedures existed
only in people's heads
5. Insufficient
observability — Without
distributed tracing,
diagnosing cross-service
7. Compliance gap — No
evidence of data
classification, no encryption
at rest for several RDS
tables, no audit logging of
data access
Chapter 3: Solution Design — The Architecture Decisions
That Mattered
This was the first major architectural decision, and I want to be transparent about the deliberation
process, because it's not as straightforward as most guides make it seem.
Active-Active means both regions handle live traffic simultaneously. This gives you true zero-
downtime failover — if one region goes down, the other region is already serving traffic and
simply absorbs the load. The challenge is that it requires your data layer to be genuinely multi-
master, meaning writes can happen in both regions simultaneously. For a payment processing
system, this introduces the risk of split-brain scenarios and conflict resolution complexity.
Active-Passive (Warm Standby) means one region handles all production traffic while a second
region maintains a running but scaled-down replica of the entire infrastructure. During a regional
failure, traffic is redirected to the standby region, and the standby infrastructure scales up to
handle full production load. This approach balances cost against recovery time — the secondary
region costs roughly 30-40% of the primary, but your failover involves a few minutes of DNS
propagation and database promotion.
For FinServ Co., I recommended a Warm Standby Active-Passive architecture as the initial
implementation, with a clear migration path to Active-Active for non-payment workloads. My
reasoning:
• The compliance constraints around financial transactions made true active-active
database writes across regions legally complex
• The budget constraint made full active-active infrastructure cost-prohibitive (it essentially
doubles your infrastructure costs)
• 99.99% SLA with warm standby and sub-5-minute RTO is achievable with modern AWS
services like Aurora Global Database and Route 53 ARC
Why This Matters: Many architects jump straight to active-active because it sounds better. But
active-active with a poorly designed data layer can actually be less reliable than well-executed
active-passive, because data consistency issues can corrupt application state across both regions
AWS Services Selected (and Why Each One Made the Cut)
I went through a deliberate service selection process. For every major component, I evaluated the
AWS-native option, the open-source self-managed option, and any relevant managed
Read Replicas, which can have replication lag of 30 seconds to several minutes depending on
write volume. For a payments platform where a 5-minute RPO is required, Aurora Global
Database was the clear choice.
The secondary benefit was the failover mechanics. Aurora Global Database supports a "managed
planned failover" for scheduled maintenance that promotes the secondary region to primary with
zero data loss. For unplanned failures, the promotion typically completes in under 1 minute.
Amazon ECS on Fargate (Primary Application Tier) The team had been running EC2-based
Auto Scaling Groups. I evaluated three options: EC2 with ASG, ECS on Fargate, and EKS. EKS was
ruled out based on operational complexity and cost — managing a Kubernetes control plane
adds significant overhead, and the application didn't have requirements that demanded
Kubernetes-specific features (custom resource definitions, complex scheduling, etc.). EC2 with
ASG was familiar to the team but added unnecessary OS-level management burden. Fargate
struck the right balance: container-based workloads with no node management, native
integration with Service Discovery and ALB, and straightforward auto-scaling with built-in multi-
AZ support.
Route 53 Application Recovery Controller (ARC) This deserves a dedicated discussion because
it's one of the most underutilized services in the AWS portfolio. Route 53 ARC, particularly after
the introduction of ARC Region Switch in August 2025, provides a fully managed, centralized
orchestration layer for multi-region failover. Before ARC, executing a regional failover required
running complex scripts across multiple services — updating Route 53 records, promoting Aurora
secondaries, scaling up ECS services in the standby region, updating application configuration —
all in the right sequence and within a tight time window. ARC Region Switch turns this into a
guided, pre-validated workflow. As one disaster recovery architect noted in the community when
ARC Region Switch launched: "This will make x-region DR runbooks so much easier".
AWS Global Accelerator Route 53 health checks have a propagation delay — changes to DNS
records can take 60-120 seconds to propagate globally, depending on client-side TTL caching.
Global Accelerator solves this by routing traffic through AWS's private backbone network and
performing health checks at the edge, with failover measured in seconds rather than minutes. For
Amazon VPC with Transit Gateway The multi-region architecture required secure, low-latency
connectivity between regions for replication traffic and management plane operations. Rather
than setting up individual VPC peering connections (which doesn't scale), I implemented Transit
Gateway in each region with inter-region peering. This also laid the groundwork for the future
Singapore expansion.
AWS Secrets Manager with Cross-Region Replication Given the security findings around
secrets in environment variables, AWS Secrets Manager was non-negotiable. The cross-region
replication feature ensures that the standby region has access to all application secrets without
any manual synchronization.
AWS KMS with Multi-Region Keys Encryption at rest is straightforward in a single region. In a
multi-region context, you need to decide how to handle encryption key distribution. AWS KMS
Multi-Region Keys allow the same logical key material to be available in multiple regions, which
means encrypted data (Aurora snapshots, S3 objects, EBS volumes) can be decrypted in the
secondary region without key migration procedures. I'll cover the key policy design in depth in a
later chapter.
AWS Systems Manager Session Manager I made a point of eliminating all SSH-based access
from the architecture. Session Manager provides browser-based and CLI-based shell access to
EC2 instances (and ECS containers via ECS Exec) using IAM authentication, with full audit logging
to CloudTrail and CloudWatch. No bastion hosts, no port 22, no SSH keys to rotate. In a financial
services environment, this is not just a convenience — it's a significant reduction in attack surface.
Let me walk through the architecture from edge to core, because understanding the traffic flow is
essential for understanding why each component exists.
• DNS resolution via Route 53, using latency-based routing as the primary routing policy
with health checks at the region level
• AWS Global Accelerator providing anycast IP addresses and edge-based health checking
with sub-30-second failover capability
• CloudFront CDN in front of static assets and read-heavy API endpoints, with S3 origin in
the primary region and cross-region replication to secondary
• AWS WAF attached to the ALB with managed rule groups for OWASP Top 10
• Same ECS service definitions using the same container images from ECR with cross-
region replication
• Route 53 ARC routing controls keep this region's traffic at zero until failover is triggered
Data Layer:
• Aurora Global Database with primary cluster in ap-south-1 (3 AZs, 2 read replicas)
• Aurora secondary cluster in ap-southeast-1 with ~1 second replication lag
• ElastiCache Redis with Global Datastore for session state replication
• S3 with Cross-Region Replication for all object storage
Management and Security Layer:
One area where I pushed back on the default recommendation was NAT Gateway costs. Routing
all private subnet traffic through NAT Gateways for accessing AWS services is expensive at scale. I
implemented VPC Endpoints for all services that support them (S3, DynamoDB, SSM, Secrets
Manager, ECR, CloudWatch Logs, STS) — this eliminated roughly $600/month in NAT Gateway
data processing charges while simultaneously improving security (traffic stays on the AWS
backbone).
Chapter 4: Solution Design Deep Dive — The Areas That
Bite You
FinOps isn't an afterthought — it's a design principle. Multi-region architectures have a well-
deserved reputation for runaway costs, particularly around data transfer, and I wanted to bake
cost visibility into the platform from day one.
Tagging Strategy
Before deploying a single resource, I established a mandatory tagging schema enforced through
AWS Config rules and Service Control Policies:
Any resource missing these tags triggers a Config rule violation and a notification to the team
lead. This isn't just about cost allocation — it's about accountability. When a developer sees their
name on a $400/month RDS instance they created for a test, they tend to clean it up.
With a 40% On-Demand cost reduction target, I implemented a tiered commitment strategy
aligned with the AWS FinOps Framework:
• 3-Year Compute Savings Plans for the baseline ECS Fargate workload that runs 24/7
(approximately 60% of average Fargate usage) — this yielded around 52% savings vs. On-
Demand
• 1-Year Convertible Reserved Instances for the Aurora database instances —
Convertible RIs allow instance family exchanges, which gives flexibility as the database
scales
• Regional RIs rather than Zonal RIs for EC2 workloads, to allow the RI discount to apply
One thing I got wrong on the first pass: I purchased Standard RIs for the production EC2
instances in the secondary region, which ran at near-zero utilization most of the time (remember,
it's warm standby). When I modeled the actual utilization, Convertible RIs were a better fit for
secondary-region resources because they allowed exchanges as the architecture evolved.
Pro Tip: Use the AWS Cost Explorer Reserved Instance Utilization and Coverage reports
weekly, not monthly. By the time the monthly report arrives, you've already missed the window to
adjust. Set up CloudWatch alarms on RI utilization dropping below 80% — that's your signal that
something in the architecture has changed.
I set up Cost Anomaly Detection with monitors for each service and each account, with
thresholds set to alert at 15% above the 30-day rolling average. This caught an early issue where
VPC Flow Logs to CloudWatch were generating far more data than expected (more on this in the
challenges section).
This is where multi-region architectures bleed money if you're not careful. Every byte that crosses
a region boundary costs money. I implemented the following to minimize cross-region data
transfer:
AWS launched Database Savings Plans at re:Invent 2025, extending commitment-based discounts
to RDS and Aurora workloads. If you're architecting a new multi-region platform today, factor in
Database Savings Plans alongside Compute Savings Plans — the combined coverage can reduce
your database costs by 30-40%.
I use the AWS Well-Architected Framework not as a compliance checkbox but as a genuine
design review tool. Let me walk through how each pillar influenced specific decisions in this
architecture.
Operational Excellence The key operational excellence principle for multi-region systems
is: everything must be automated, documented, and regularly tested. The worst time to figure out
• AWS Systems Manager Documents (SSM Documents) for all operational runbooks,
executed via Automation, not shell scripts
• AWS Config Conformance Packs to continuously validate that the infrastructure
Reliability (The Core Pillar for This Project) The reliability pillar is most directly relevant to our
99.99% SLA goal. The foundational principle is fault isolation: design your system so that failures
are contained and cannot cascade across boundaries. AWS Regions are the largest fault isolation
boundary, which is precisely why multi-region is necessary for 99.99%.
Key reliability design decisions:
• No cross-region dependencies in the request path for either region — each region must
Security The security pillar is woven throughout every layer of this architecture. The single most
impactful change from a security posture standpoint was implementing the principle of least
privilege at every layer, enforced through IAM Permission Boundaries (explained in detail in a
later section).
Performance Efficiency For a 340ms average API response time, there was meaningful
optimization opportunity. The architecture addressed performance at multiple layers:
• CloudFront caching for static assets and read-heavy API responses reduced origin load by
approximately 60%
• ElastiCache Redis for session and computed result caching reduced database read load
significantly
• Aurora Read Replicas in the primary region for read-heavy reporting queries, offloading
the writer instance
Sustainability AWS added sustainability as the sixth pillar in 2021. For this architecture,
sustainability considerations included:
• Graviton3-based EC2 instances for any EC2 workloads (Graviton offers comparable or
better performance at significantly lower power consumption)
• ECS Fargate spot tasks for non-production environments
• S3 Intelligent-Tiering for document storage to automatically move infrequently accessed
objects to cheaper storage classes
AWS Security Reference Architecture (SRA)
The AWS Security Reference Architecture provides a blueprint for deploying security services
across a multi-account AWS Organization. Before I explain how I applied it, let me explain why
Running all your workloads in a single AWS account is like running your entire company from a
single server — it works until it doesn't, and when something goes wrong, the blast radius is your
entire business. The SRA recommends a purpose-built multi-account structure where security
boundaries between accounts limit the damage any single compromised credential can cause.
This account boundary approach means that even if an attacker compromises credentials in the
Production Primary Account, they cannot access CloudTrail logs (which reside in the Log Archive
Account), cannot modify GuardDuty findings (Security Tooling Account), and cannot affect the
Shared Services infrastructure.
AWS Control Tower manages the guardrails across these accounts. Control Tower provides:
• Pre-built Service Control Policies (SCPs) that prevent disabling GuardDuty, CloudTrail, or
Config
• Automated account vending with a Landing Zone template
• Centralized compliance dashboard
GuardDuty is enabled organization-wide with the Security Tooling account as the administrator.
I specifically enabled:
AWS Security Hub is configured as an aggregated findings dashboard in the Security Tooling
account. It pulls findings from GuardDuty, Inspector, Macie, Config, and IAM Access Analyzer —
giving a single pane of glass for the security posture across all accounts.
SCPs in the Management Account define the outer boundary of what any account in the
organization can do, regardless of what IAM policies say. Think of SCPs as a veto power — the
maximum permissions any principal can ever have.
This section gets more attention than it deserves in most guides, so I want to explain the why as
access to instances without any open inbound ports. Access is controlled entirely through IAM
policies, which means the same identity federation and MFA enforcement that protects your AWS
console also protects your instance access. Every session is logged to CloudWatch Logs and
CloudTrail — you have a complete, tamper-evident audit trail of every command executed.
First, ensure the SSM Agent is installed on your instances. For Amazon Linux 2023 and Amazon
Linux 2, it's pre-installed. For other distributions, you'll need to install it manually.
The instance needs an IAM instance profile with the AmazonSSMManagedInstanceCore managed
policy. This allows the SSM agent to register with Systems Manager and maintain the session.
Critically, the agent communicates outbound to the SSM endpoints — no inbound connectivity
required.
For instances in private subnets (as all production instances should be), you need VPC Endpoints
for SSM:
What these commands do: Each create-vpc-endpoint call creates an Interface VPC Endpoint
— essentially an Elastic Network Interface (ENI) in your subnet that routes traffic destined for the
specified AWS service through AWS's private network rather than through the internet or NAT
Gateway. The --private-dns-enabled flag means that the standard SSM endpoint URLs resolve
to private IP addresses when queried from within your VPC. The --security-group-
ids parameter controls which instances can communicate with the endpoint.
What this configuration does: The s3BucketName and s3KeyPrefix settings ensure that every
session is recorded and stored in S3 for long-term audit
purposes. cloudWatchStreamingEnabled: true means you get real-time log streaming as the
session is active, which is useful for security monitoring. The kmsKeyId ensures that session logs
are encrypted using your CMK (Customer Managed Key). The shellProfile at the bottom
sets HISTFILE=/dev/null to prevent commands from being cached in the shell history file (since
we're capturing everything in CloudWatch anyway, the shell history file would be redundant and
potentially accessible to other processes).
Connecting to an Instance:
# Start a session from your local workstation (requires SSM plugin for AWS
CLI)
aws ssm start-session \
--target i-0abc123def456789 \
--region ap-south-1
# For ECS containers using ECS Exec
aws ecs execute-command \
--cluster production-primary \
--task arn:aws:ecs:ap-south-1:123456789012:task/abc123 \
--container payment-processor \
--command "/bin/bash" \
--interactive \
--region ap-south-1
Gotcha: ECS Exec requires --enable-execute-command to be set when creating the ECS service,
AND the task role
needs ssmmessages:CreateControlChannel, ssmmessages:CreateDataChannel, ssmmessages:O
penControlChannel, and ssmmessages:OpenDataChannel permissions. I've seen this trip up a lot
of teams who set up ECS Exec but never test it before they actually need it.
VPC Flow Logs capture metadata about IP traffic flowing through your network interfaces. Note
the word "metadata" — Flow Logs don't capture the actual packet content, just the who, what,
when, and whether. This is an important distinction because it means they're not a substitute for
a full packet capture solution, but they're invaluable for troubleshooting, security analysis, and
compliance.
Why most teams aren't getting value from Flow Logs: The common failure mode I see is
teams enabling Flow Logs, collecting them somewhere, and never looking at them. Logs without
monitoring are expensive storage. The value of Flow Logs comes from building detection and
alerting logic on top of them.
What the custom log format achieves: The default VPC Flow Log format captures only a subset
of available fields. The custom format I've defined above adds fields like flow-
direction (whether traffic is ingress or egress relative to the ENI), vpc-id and subnet-
port scans and SYN floods), and pkt-srcaddr/pkt-dstaddr (the actual source and destination
addresses, which differ from srcaddr/dstaddr when NAT is involved).
# Create a metric filter that detects rejected traffic (potential port scans
or unauthorized access attempts)
aws logs put-metric-filter \
--log-group-name "/aws/vpc/flow-logs/primary" \
--filter-name "RejectedTraffic" \
--filter-pattern "[version, account, interface, srcaddr, dstaddr, srcport,
dstport, protocol, packets, bytes, start, end, action=REJECT, ...]" \
--metric-transformations \
metricName=RejectedConnections,metricNamespace=VPCFlowLogs,metricValue=1,unit=
Count \
--region ap-south-1
# Create an alarm if rejected traffic spikes above normal baseline
aws cloudwatch put-metric-alarm \
--alarm-name "HighRejectedTraffic" \
--alarm-description "Spike in rejected VPC traffic - possible port scan or
unauthorized access" \
--metric-name RejectedConnections \
--namespace VPCFlowLogs \
--statistic Sum \
--period 300 \
--threshold 1000 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:ap-south-1:123456789012:security-alerts \
--region ap-south-1
What this does: The put-metric-filter command tells CloudWatch to scan incoming Flow Log
entries and, whenever it finds a log entry where the action field equals REJECT, increment a
counter metric named RejectedConnections. The alarm then fires if this counter exceeds 1,000
rejected connections within a 5-minute window across two consecutive evaluation periods (i.e.,
10 minutes sustained). This kind of alert would catch a port scan in real time.
Cost Gotcha: VPC Flow Logs to CloudWatch can get very expensive at scale. A busy VPC with
many ENIs can generate hundreds of gigabytes of logs per day. My recommendation: send Flow
Logs to both CloudWatch (with a 30-day retention and metric filters for alerting) AND S3 with
Parquet format (for longer-term analysis). This gives you real-time detection from CloudWatch
while keeping long-term query costs manageable via Athena on S3.
Encryption is easy to enable. Encryption that actually improves your security posture, rather than
A KMS key policy is unlike any other AWS resource policy. The critical difference: the key policy
is the primary access control mechanism for KMS keys. Even an IAM user
with AdministratorAccess cannot use a KMS key unless the key policy explicitly allows it. This is
actually the intended behavior — it means that key access cannot be accidentally granted
through overly permissive IAM policies.
Every KMS key must have at least one statement that grants the AWS account root access, or the
key becomes permanently inaccessible (a situation you cannot recover from). This is the first
thing I check in any KMS key audit.
For a multi-region architecture, you have two options for KMS keys:
1. Separate regional keys — Create independent keys in each region. Data encrypted in
the primary region cannot be decrypted in the secondary region without re-encryption.
2. Multi-Region keys — A single logical key that can be replicated to multiple regions. The
same key material (same key ID prefix mrk-) is available in both regions.
For Aurora Global Database, S3 Cross-Region Replication, and any data that moves between
regions, Multi-Region keys are the practical choice. Using separate regional keys with cross-
region data would require re-encryption at the application layer, which adds complexity and
latency.
Understanding the key policy structure: The four statements above implement a clear
separation of concerns:
1. Root account access is the safety net — it ensures that if all other principals are
accidentally removed, the account root user can still access the key. Never remove this
statement.
2. Aurora service access uses kms:ViaService to scope the permission to actions taken
through the RDS service, not arbitrary Decrypt calls directly via the KMS API. This means
an attacker who obtains an Aurora role's credentials cannot use them to decrypt arbitrary
data outside of the RDS context.
3. Key administration is restricted to specific named roles, not the
entire Administrator group. Key lifecycle management (creation, deletion, rotation
changes) should have an even more restricted audience than IAM administration.
4. Application key usage grants only Decrypt and GenerateDataKey* — the minimum
permissions needed by applications to read and write encrypted data. No Encrypt (they
use GenerateDataKey* for envelope encryption), no administrative actions.
Permission Boundaries are one of the most misunderstood concepts in AWS IAM. Let me explain
them clearly.
Think of it this way: Your IAM role has a permission boundary that allows s3:*. Even if you then
attach an identity-based policy that grants s3:* plus ec2:*, the EC2 permissions are silently
blocked by the boundary. The effective permissions are the intersection of what the identity-
based policy allows AND what the permission boundary allows.
The architecture uses a developer role vending machine pattern — developers can create IAM
roles for their services, but they cannot create roles more powerful than their own permission
boundary. This prevents privilege escalation attacks where a developer creates a role
with AdministratorAccess and uses it to escape their authorized scope.
# Permission boundary definition - caps what any role created in this account
can do
resource "aws_iam_policy" "developer_permission_boundary" {
name = "DeveloperPermissionBoundary"
description = "Maximum permissions allowed for developer-created roles"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "AllowApplicationServices"
Effect = "Allow"
Action = [
# Only allow the specific services the application legitimately uses
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket",
"secretsmanager:GetSecretValue",
"secretsmanager:DescribeSecret",
"kms:Decrypt",
"kms:GenerateDataKey",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"xray:PutTraceSegments",
"xray:PutTelemetryRecords",
"ssmmessages:*",
"ec2messages:*"
]
Resource = "*"
},
{
# Explicit deny: no IAM actions at all through developer-created roles
# This prevents the "create a role and grant yourself admin" escape
hatch
Sid = "DenyIAMExceptPassRole"
Effect = "Deny"
Action = [
"iam:AttachRolePolicy",
"iam:CreateRole",
"iam:DeleteRole",
"iam:DetachRolePolicy",
"iam:PutRolePolicy"
]
Resource = "*"
},
{
# Deny access to organization-level actions and billing
Sid = "DenyOrganizationActions"
Effect = "Deny"
Action = [
"organizations:*",
"account:*",
"billing:*"
]
Resource = "*"
}
]
})
}
# When creating an ECS task role, always attach the permission boundary
resource "aws_iam_role" "ecs_task_role" {
name = "ecs-payment-processor-task-role"
# The boundary caps this role's maximum permissions
permissions_boundary = aws_iam_policy.developer_permission_boundary.arn
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "[Link]"
}
}
]
})
}
# Even if this role later gets an over-permissive policy attached,
# the permission boundary ensures the maximum effective permissions
# never exceed what we defined in DeveloperPermissionBoundary
resource "aws_iam_role_policy_attachment" "ecs_task_policy" {
role = aws_iam_role.ecs_task_role.name
policy_arn = aws_iam_policy.ecs_application_policy.arn
}
Permission boundaries don't grant permissions on their own. They only limit them. This means
you need BOTH a permission boundary AND an identity-based policy for the principal to actually
do anything. I've seen teams attach a permission boundary to a role, then wonder why the role
suddenly can't do anything — it's because the boundary alone grants nothing; it just limits what
The first phase was all about building the foundational infrastructure that everything else would
sit on top of. I always tell junior engineers: time spent on the foundation is never wasted. A shaky
foundation means every layer on top is unreliable.
The network architecture is more nuanced than most tutorials suggest. The key principle is: treat
network segmentation as a security control, not just an organizational tool.
The secondary VPC uses the [Link]/16 range — an important convention. Because we're
peering these VPCs via Transit Gateway, their CIDR ranges must not overlap. Using [Link]/16 for
primary and [Link]/16 for secondary is a pattern that also accommodates future regional
expansion ([Link]/16 for a third region, etc.).
The security baseline was established before any workload resources were created. The three
most critical security foundations were:
# Enable CloudTrail organization trail - captures ALL API calls across all
accounts
aws cloudtrail create-trail \
--name finserv-org-trail \
--s3-bucket-name finserv-cloudtrail-archive-${LOG_ARCHIVE_ACCOUNT_ID} \
--is-multi-region-trail \
--enable-log-file-validation \
--is-organization-trail \
--kms-key-id arn:aws:kms:ap-south-
1:${SECURITY_ACCOUNT_ID}:key/${CLOUDTRAIL_KEY_ID} \
--region ap-south-1
# What each flag does:
# --is-multi-region-trail: Captures API events from ALL regions, not just ap-
south-1
# --enable-log-file-validation: Creates SHA-256 hash digests so you can verify
logs weren't tampered with
# --is-organization-trail: Applies to all accounts in the AWS Organization
# --kms-key-id: Encrypts log files using our CMK (the default is SSE-S3, this
is stronger)
# Enable CloudTrail logging
aws cloudtrail start-logging \
--name finserv-org-trail \
--region ap-south-1
# Enable MFA enforcement through IAM password policy
aws iam update-account-password-policy \
--minimum-password-length 14 \
--require-symbols \
--require-numbers \
--require-uppercase-characters \
--require-lowercase-characters \
--allow-users-to-change-password \
--max-password-age 90 \
--password-reuse-prevention 12
The Replication Lag Monitoring Alert — one thing that gets overlooked:
This is where the multi-region magic becomes operational. ARC provides a control plane for
managing which region is "active" from a traffic routing perspective, with validation checks to
ensure the target region is actually ready to receive traffic before flipping the switch.
Pro Tip: The ARC cluster endpoint used in the CLI commands above is a Regional endpoint that's
part of ARC's own high-availability design. ARC endpoints are deployed across multiple
Availability Zones in multiple regions specifically so that failover operations remain available even
during regional events. Always store the ARC cluster endpoint in your runbook — don't assume
you can look it up during an actual outage.
Performance Tuning
The initial post-launch metrics were promising but not where we needed them. Average API
response time had dropped from 340ms to 185ms, but the P99 was still at 1.8 seconds — higher
than the 500ms target.
Root cause analysis using X-Ray distributed tracing identified three culprits:
1. Cold start latency in Fargate — New tasks took 45-60 seconds to initialize before
accepting traffic, meaning that scale-out events created temporary capacity gaps
2. Aurora connection pooling — The application was creating a new database connection
per API request rather than using a connection pool
3. Synchronous downstream API calls — Three external payment gateway API calls were
being made sequentially rather than in parallel
For the Fargate cold start issue, I implemented RDS Proxy as a connection pooler between the
application and Aurora, and added a warm buffer to the ECS service by keeping a small fleet of
pre-initialized tasks running.
Before implementing DR, you need agreement on what the targets actually are. I facilitated a
workshop with the FinServ Co. leadership team to define these precisely:
Regional failure (unplanned) < 5 minutes < 1 second (Aurora replication lag) P0
RTO (Recovery Time Objective) is how quickly you must restore service. RPO (Recovery Point
Objective) is how much data loss is acceptable — specifically, how far back you can tolerate
rolling back to.
For the 99.99% SLA math: with a 5-minute RTO and assuming one regional event per year, a
single incident consumes roughly 10% of the annual downtime budget. That leaves margin for
smaller incidents throughout the year.
AWS documents four DR strategies, each with different cost and recovery time implications:
• RTO: 30-60 minutes (need to provision and scale up infrastructure from minimal state)
3. Warm Standby
4. Active-Active (Multi-Site)
For FinServ Co., Warm Standby was the right balance. The secondary region runs at
approximately 40% of primary capacity cost — enough to serve traffic immediately upon failover
but without the expense of full active-active.
This runbook was encoded into an SSM Automation document so it can be executed with a
single command. Here's the human-readable version with explanations:
1. Confirm that the primary region is genuinely degraded (not a false alarm)
- Check Global Accelerator health check status
- Check Route 53 ARC cluster routing control states
- Verify with at least two independent monitoring sources
2. Check Aurora Global Database replication lag
- If lag > 30 seconds, assess whether data loss is acceptable
- If lag is within SLA (< 1 second), proceed
"[{\"RoutingControlArn\":\"PRIMARY_RC_ARN\",\"RoutingControlState\":\"Off\"},
{\"RoutingControlArn\":\"SECONDARY_RC_ARN\",\"RoutingControlState\":\"On\"}]"
\
--endpoint-url [Link]
# Step 3: Scale up ECS services in secondary region
aws ecs update-service \
--cluster finserv-production-secondary \
--service payment-processor \
--desired-count 18 \
--region ap-southeast-1
# Step 4: Promote Aurora secondary to writer
aws rds failover-global-cluster \
--global-cluster-identifier finserv-global \
--target-db-cluster-identifier arn:aws:rds:ap-southeast-
1:123456789012:cluster:finserv-aurora-secondary \
--region ap-south-1
# Step 5: Update application configuration to point to secondary Aurora
endpoint
# (This is handled automatically if using RDS Proxy + Aurora Global Database
# since the application connects to the proxy, which reconnects to new
writer)
What happened: Three weeks after enabling VPC Flow Logs to CloudWatch, the monthly AWS
bill came in $2,400 higher than projected. The culprit: CloudWatch Logs ingestion charges for a
high-traffic VPC were enormous — we were generating approximately 50GB of flow log data per
1. Send Flow Logs primarily to S3 with Parquet format (reduces storage costs by ~75% vs
standard text format)
2. Keep CloudWatch Logs only for specific subnets (database subnets, management
subnets) where real-time alerting matters most
3. Implement a CloudWatch Logs subscription filter to extract only REJECT records to a
separate smaller log group for metric filters
This reduced the CloudWatch Logs ingestion cost from $750/day to approximately $45/day.
Lesson learned: Always model your VPC Flow Logs volume before enabling them. A rough
estimate: take your peak network throughput in GB/day, assume Flow Log metadata is
approximately 5-10% of actual traffic volume, and calculate accordingly. For high-throughput
environments, S3-first is almost always the right approach.
What happened: During our first DR drill, we successfully promoted the Aurora secondary
cluster to writer in under 90 seconds. However, in-flight database transactions during the
promotion window were not gracefully handled — they received connection errors, and
approximately 0.3% of payment requests during the failover window resulted in error responses
to users.
Root cause: The application was connecting directly to the Aurora cluster writer endpoint. During
promotion, the writer endpoint becomes unavailable for 30-60 seconds while the new writer
initializes. The application's database connection pool didn't have adequate retry logic.
Solution:
1. Implement RDS Proxy between the application and Aurora. RDS Proxy handles the
writer endpoint change transparently and queues requests during the brief promotion
window, dramatically reducing connection errors.
2. Add retry logic to the application with exponential backoff for database operations
(maximum 3 retries with 100ms, 200ms, 400ms backoff).
3. Implement the circuit breaker pattern using the Resilience4j library — if database
errors exceed 50% of requests in a 10-second window, open the circuit and return a
graceful degraded response rather than cascading failures.
# Create RDS Proxy target group pointing to Aurora Global Database
aws rds register-db-proxy-targets \
--db-proxy-name finserv-aurora-proxy \
--db-cluster-identifiers finserv-aurora-primary \
--region ap-south-1
# The proxy target group automatically detects the Aurora writer endpoint
# After failover, the proxy reconnects to the new writer endpoint
# Application connection strings remain pointing at the proxy endpoint
throughout
Lesson learned: RDS Proxy isn't optional for production Aurora deployments doing frequent
failovers — it's a necessity. The 8-minute timeout default for connection pinning in RDS Proxy
also needs tuning for transaction-heavy workloads; we reduced it to 120 seconds.
What happened: During the first failover test, the secondary region ECS tasks failed to start
because Secrets Manager replica secrets weren't immediately available. Specifically, a newly
rotated secret in the primary region had not yet replicated to the secondary before we triggered
the failover.
Root cause: Secrets Manager cross-region replication is eventually consistent. The default
rotation and replication cycle meant there could be a 5-10 second window where a newly rotated
secret existed only in the primary region.
Solution: Implement a validation Lambda that runs hourly and verifies that all critical secrets
exist and are accessible in the secondary region, alerting before any gap becomes a problem
import boto3
import json
def lambda_handler(event, context):
"""
Validates that all critical secrets are replicated and accessible
in the secondary region. Publishes a CloudWatch metric for alerting.
"""
primary_sm = [Link]('secretsmanager', region_name='ap-south-1')
secondary_sm = [Link]('secretsmanager', region_name='ap-southeast-
1')
cloudwatch = [Link]('cloudwatch', region_name='ap-south-1')
replication_failures = 0
# Also verify the secret value is non-empty (not just that the ARN
exists)
if not [Link]('SecretString') and not
[Link]('SecretBinary'):
print(f"WARNING: Secret {secret_name} exists but has no value
in secondary region")
replication_failures += 1
except secondary_sm.[Link]:
print(f"CRITICAL: Secret {secret_name} NOT FOUND in secondary
region")
replication_failures += 1
except Exception as e:
print(f"ERROR checking secret {secret_name}: {str(e)}")
replication_failures += 1
return {
'statusCode': 200,
'checked': len(critical_secrets),
'failures': replication_failures
}
Lesson learned: DR validation should be a continuous automated process, not something you
only check during a planned drill. This Lambda runs every hour and would alert the on-call team
within 60 minutes of any replication failure.
What happened: When we scaled the secondary region's ECS tasks from 6 (warm standby) to 18
(production capacity) during a failover drill, we observed a 25-second spike in container startup
latency. The culprit: all 18 tasks were simultaneously pulling their container images from ECR
through the NAT Gateway, saturating its bandwidth.
Solution:
1. ECR VPC Endpoints — Eliminate NAT Gateway for ECR image pulls by routing through a
VPC Endpoint
2. ECR image pre-warming — Run a scheduled task in the secondary region every 6 hours
that pulls the latest container image, ensuring the image is cached locally in the container
Lesson learned: VPC Endpoints for ECR should be standard in any production Fargate
deployment. They improve security (traffic stays on AWS backbone), reduce NAT Gateway costs,
and eliminate NAT Gateway bandwidth as a constraint.
Root cause investigation: The canary was running HTTP/1.1 requests against the ALB endpoint.
During certain periods, the ALB was initiating keep-alive connection recycling, which caused the
canary's first request in a new connection to experience a slightly elevated response time that
crossed the canary's timeout threshold.
Solution:
1. Adjust canary timeout threshold from 2 seconds to 5 seconds (matching our P99
performance baseline)
2. Configure the canary to retry once before marking a failure (single-point failures
After 60 days of production operation on the new architecture, here are the measured outcomes:
• Achieved SLA: 99.991% over the first 60 days (2.6 minutes total downtime across two
minor incidents)
• Incident count: 2 minor incidents vs. 7 in the comparable prior period
• Mean Time to Detect (MTTD): 2.3 minutes (down from 11 minutes with previous
monitoring)
• Mean Time to Resolve (MTTR): 8.7 minutes (down from 47 minutes)
• Successful DR drills: 3/3 (RTO consistently under 4 minutes, well within 5-minute target)
Performance
Cost
• Monthly infrastructure cost: $16,200 (from $12,000 — 35% increase, within the 40%
budget constraint)
• Waste eliminated: $1,200/month in orphaned resources + $600/month in NAT Gateway
savings via VPC Endpoints
• Reserved Instance coverage: 78% of eligible resources committed (target was 70%)
• Projected annual savings from RI/SP commitments: $28,400/year vs. pure On-
Demand
Security
• Critical findings in Security Hub: 0 (down from 47 at initial assessment)
• High findings: 3 (all acknowledged with remediation timelines)
• GuardDuty findings triggering alerts: 2 (both investigated, confirmed benign —
automated responses handled them)
• Days since last IAM credential exposure finding: 60 (with Secrets Manager, this
category of finding effectively disappeared)
Team Productivity
rollback)
• Mean time to provision new environments: From 3 days to 45 minutes (Terraform
modules)
• On-call incident pages per month: From 23 to 7
• Time spent on infrastructure troubleshooting: Down approximately 60% per engineer
per week
Chapter 9: Key Takeaways
• Aurora Global Database with RDS Proxy was the single highest-impact technical
choice. The combination of sub-second RPO from Aurora Global Database and
transparent failover handling from RDS Proxy resolved our biggest reliability weakness at
a reasonable cost.
• Route 53 ARC with Safety Rules prevented what could have been a catastrophic
operator error during one of our DR drills when a team member accidentally tried to turn
off both region routing controls simultaneously — ARC's safety rule blocked the
operation and fired an alert.
• VPC Endpoints for all applicable services reduced both cost and attack surface
simultaneously. It's one of those rare architectural decisions where security and cost
• Start with the health check endpoint design. The deep health check (/health/deep)
that validates database connectivity should have been the first thing we built. We wasted
two weeks troubleshooting load balancer behavior because we were using the wrong
• Consider AWS Backup with cross-region copy for operational backups. I relied on
Aurora's native backup capabilities, which are excellent for RPO but awkward for point-in-
time restore workflows. AWS Backup provides a centralized backup management
experience that I'd use from the start in future engagements.
• Implement AWS Resilience Hub earlier. AWS Resilience Hub is a service that analyzes
your workload and generates a resiliency score against your defined RTO and RPO
targets. I discovered it mid-project, and it would have been a useful validation tool during
• Never treat 99.99% SLA as a single architectural decision. It's the cumulative effect of
dozens of smaller decisions, each of which eliminates a class of failure.
• DR runbooks that haven't been tested in the last 30 days are documentation, not
runbooks. Run drills relentlessly.
• Always use update-routing-control-states (plural, atomic) rather than
individual update-routing-control-state calls when changing ARC. The atomic version
ensures safety rules are evaluated against the final desired state, not intermediate states.
• KMS Multi-Region Keys are not optional for multi-region architectures with encrypted
data. Dealing with cross-region re-encryption at the application layer is a significant
• Set up Cost Anomaly Detection in the first week. Cost surprises are much easier to
address when caught in week 2 rather than month 2.
• Don't skip the stakeholder interview phase. The requirement that the architecture
accommodate Southeast Asian expansion (from the business interview) significantly
influenced several technical decisions. That requirement never appeared in any written
documentation.
• Automate everything, document the automation. The goal isn't to write a runbook
that a human reads during an incident. The goal is an SSM Automation document that a
human approves during an incident while the system executes it.
Chapter 10: Tech Stack Summary
DNS & Global Route 53, Global Accelerator, Route 53 ARC provides safety rules and
Traffic ARC atomic routing control
changes
edge
Access Systems Manager Session Manager Zero SSH, full audit trail
Governance AWS Organizations, Control Tower, Config, CIS benchmarks enforced via
SCPs Config Conformance Packs
Monitoring CloudWatch, X-Ray, CloudWatch Centralized dashboard,
Synthetics, Container Insights distributed tracing
Logging CloudTrail (org-wide), VPC Flow Logs (S3 + Log Archive account with
CW), ALB Access Logs immutable S3 bucket
By the end of the 120-day engagement, every tier of the architecture had been designed around
the assumption of failure. Not "if something fails," but "when something fails." The infrastructure
expects faults and handles them automatically. The operations team gets paged when something
interesting is happening, not when something is already broken.
The measure of a well-designed system isn't that it never fails. It's that when it fails, the failure is
contained, detected quickly, and resolved automatically — ideally before any user notices
anything at all.
This ebook reflects real-world architectural patterns and configurations used in production AWS
environments. Service pricing and specific API parameters are subject to change — always verify
against current AWS documentation before implementing. All account IDs, resource ARNs, and
company names used in examples are illustrative.
Appendix A: Complete Terraform Module Structure
One of the things that keeps multi-region Terraform projects manageable is a clean, consistent
module structure. Here's the full directory layout I used for this project, along with explanations
of what lives where and why.
finserv-infrastructure/
├── environments/
│ ├── production/
│ │ ├── [Link] # Root module — calls all child modules
│ │ ├── [Link] # Input variables for production environment
│ │ ├── [Link] # Exported values (ALB ARNs, Aurora
endpoints, etc.)
│ │ ├── [Link] # Actual values (committed to repo, no
secrets)
│ │ ├── [Link] # S3 remote state + DynamoDB locking
configuration
│ │ └── [Link] # Multi-region provider configuration
│ ├── staging/
│ │ └── ... # Same structure, different tfvars values
│ └── development/
│ └── ...
│
├── modules/
│ ├── vpc/
│ │ ├── [Link] # VPC, subnets, IGW, NAT Gateways, route
tables
│ │ ├── [Link] # cidr_block, azs, environment, etc.
│ │ ├── [Link] # vpc_id, subnet_ids, nat_gateway_ids
│ │ └── [Link]
│ ├── transit-gateway/
│ │ ├── [Link] # TGW, attachments, inter-region peering
│ │ ├── [Link]
│ │ └── [Link]
│ ├── security-groups/
│ │ ├── [Link] # All security group definitions
│ │ ├── [Link]
│ │ └── [Link]
│ ├── vpc-endpoints/
│ │ ├── [Link] # Interface and Gateway endpoints
│ │ ├── [Link]
│ │ └── [Link]
│ ├── kms/
│ │ ├── [Link] # Multi-region keys, replica keys, key
aliases
│ │ ├── [Link]
│ │ └── [Link]
│ ├── iam/
│ │ ├── [Link] # Roles, policies, permission boundaries
│ │ ├── [Link]
│ │ └── [Link]
│ ├── aurora-global/
│ │ ├── [Link] # Global cluster, primary cluster, secondary
cluster
│ │ ├── [Link] # Cluster and instance parameter groups
│ │ ├── [Link]
│ │ └── [Link]
│ ├── rds-proxy/
│ │ ├── [Link]
│ │ ├── [Link]
│ │ └── [Link]
│ ├── elasticache-global/
│ │ ├── [Link] # Redis Global Datastore
│ │ ├── [Link]
│ │ └── [Link]
│ ├── ecr/
│ │ ├── [Link] # Repositories with cross-region replication
│ │ ├── [Link]
│ │ ├── [Link]
│ │ └── [Link]
│ ├── ecs/
│ │ ├── [Link] # ECS cluster with Container Insights
│ │ ├── [Link] # Task definition template
│ │ ├── [Link] # ECS service with auto scaling
│ │ ├── [Link] # Application Load Balancer + Target Groups
│ │ ├── [Link]
│ │ └── [Link]
│ ├── global-accelerator/
│ │ ├── [Link]
│ │ ├── [Link]
│ │ └── [Link]
│ ├── route53-arc/
│ │ ├── [Link] # ARC cluster, control panel, routing
controls, safety rules
│ │ ├── [Link] # Route 53 health checks backed by ARC
│ │ ├── [Link] # Failover DNS records
│ │ ├── [Link]
│ │ └── [Link]
│ ├── secrets-manager/
│ │ ├── [Link] # Secrets with cross-region replication
│ │ ├── [Link] # Rotation Lambda configurations
│ │ ├── [Link]
│ │ └── [Link]
│ ├── ssm-session-manager/
│ │ ├── [Link] # Session preferences document, IAM role, VPC
endpoints
│ │ ├── [Link]
│ │ └── [Link]
│ ├── monitoring/
│ │ ├── [Link] # CloudWatch Dashboards
│ │ ├── [Link] # All CloudWatch Alarms
│ │ ├── [Link] # Log groups with retention and encryption
│ │ ├── [Link] # VPC Flow Log metric filters
│ │ ├── [Link] # CloudWatch Synthetics canaries
│ │ ├── [Link]
│ │ └── [Link]
│ └── waf/
│ ├── [Link] # WAF WebACL with managed rule groups
│ ├── [Link]
│ └── [Link]
│
├── scripts/
│ ├── failover/
│ │ ├── [Link] # Orchestrates full regional failover
│ │ ├── [Link] # Pre-flight checks before failover
│ │ └── [Link] # Post-failover verification
│ ├── cost/
│ │ └── [Link] # Reports RI/SP coverage percentage
│ └── security/
│ └── [Link] # Manual secret rotation trigger
│
└── docs/
├── architecture-decisions/ # ADR (Architecture Decision Records)
├── runbooks/ # Operational runbooks
└── diagrams/ # Architecture diagrams ([Link] source
files)
# [Link]
terraform {
required_version = ">= 1.6.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.30"
}
}
}
# Primary region provider
provider "aws" {
alias = "primary"
region = var.primary_region # "ap-south-1"
default_tags {
tags = {
Environment = [Link]
ManagedBy = "terraform"
Project = "finserv"
Owner = var.owner_team
}
}
}
# Secondary region provider
provider "aws" {
alias = "secondary"
region = var.secondary_region # "ap-southeast-1"
default_tags {
tags = {
Environment = [Link]
ManagedBy = "terraform"
Project = "finserv"
Owner = var.owner_team
Region = "secondary"
}
}
}
# Global services provider (us-east-1 — required for Route 53, IAM, Global
Accelerator)
provider "aws" {
alias = "global"
region = "us-east-1"
default_tags {
tags = {
Environment = [Link]
ManagedBy = "terraform"
Project = "finserv"
}
}
}
Gotcha: default_tags in the provider block applies tags to every resource created by that
provider alias. This is a powerful way to ensure baseline tagging compliance without repeating
tags in every resource block. However, be aware that some AWS resources don't support tags at
all — Terraform will throw an error if default_tags tries to apply to a non-taggable resource.
The workaround is using ignore_tags in the provider configuration for those specific tag keys.
Appendix B: Security Group Reference
Security groups are the last line of defense in your network architecture. Here is the complete
security group matrix for this architecture — what each group allows, what it blocks, and why.
# security-groups/[Link]
# ─────────────────────────────────────────────────────────────────
# ALB Security Group — faces the internet
# ─────────────────────────────────────────────────────────────────
resource "aws_security_group" "alb" {
provider = [Link]
name = "finserv-alb-sg"
description = "Security group for Application Load Balancer"
vpc_id = var.vpc_id
# HTTPS only from internet — HTTP is redirected to HTTPS at the listener
level
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["[Link]/0"]
description = "HTTPS from internet"
}
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["[Link]/0"]
description = "HTTP from internet — redirected to HTTPS by ALB listener
rule"
}
# ALB can only talk to ECS tasks (enforced by using security group
reference, not CIDR)
egress {
from_port = 8080
to_port = 8080
protocol = "tcp"
security_groups = [aws_security_group.ecs_tasks.id]
description = "Forward to ECS tasks on port 8080"
}
tags = { Name = "finserv-alb-sg" }
}
# ─────────────────────────────────────────────────────────────────
# ECS Tasks Security Group — application tier
# ─────────────────────────────────────────────────────────────────
resource "aws_security_group" "ecs_tasks" {
provider = [Link]
name = "finserv-ecs-tasks-sg"
description = "Security group for ECS Fargate tasks"
vpc_id = var.vpc_id
# Only accept traffic from the ALB
ingress {
from_port = 8080
to_port = 8080
protocol = "tcp"
security_groups = [aws_security_group.[Link]]
description = "Traffic from ALB only"
}
# Allow all outbound — tasks need to reach Aurora, ElastiCache, Secrets
Manager,
# ECR, CloudWatch, SSM, and potentially external payment gateways
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["[Link]/0"]
description = "All outbound traffic"
}
tags = { Name = "finserv-ecs-tasks-sg" }
}
# ─────────────────────────────────────────────────────────────────
# RDS Proxy Security Group
# ─────────────────────────────────────────────────────────────────
resource "aws_security_group" "rds_proxy" {
provider = [Link]
name = "finserv-rds-proxy-sg"
description = "Security group for RDS Proxy"
vpc_id = var.vpc_id
ingress {
from_port = 3306
to_port = 3306
protocol = "tcp"
security_groups = [aws_security_group.ecs_tasks.id]
description = "MySQL connections from ECS tasks"
}
egress {
from_port = 3306
to_port = 3306
protocol = "tcp"
security_groups = [aws_security_group.[Link]]
description = "MySQL connections to Aurora cluster"
}
tags = { Name = "finserv-rds-proxy-sg" }
}
# ─────────────────────────────────────────────────────────────────
# Aurora Security Group — most restrictive
# ─────────────────────────────────────────────────────────────────
resource "aws_security_group" "aurora" {
provider = [Link]
name = "finserv-aurora-sg"
description = "Security group for Aurora Global Database cluster"
vpc_id = var.vpc_id
# Only accept connections from the RDS Proxy — NOT directly from ECS tasks
# This forces all database connections to go through the proxy
ingress {
from_port = 3306
to_port = 3306
protocol = "tcp"
security_groups = [aws_security_group.rds_proxy.id]
description = "MySQL from RDS Proxy only — direct application access
blocked"
}
# No outbound rules needed — Aurora doesn't initiate outbound connections
# (Aurora-to-Aurora replication uses internal AWS networking, not these
security groups)
tags = { Name = "finserv-aurora-sg" }
}
# ─────────────────────────────────────────────────────────────────
# ElastiCache Security Group
# ─────────────────────────────────────────────────────────────────
resource "aws_security_group" "elasticache" {
provider = [Link]
name = "finserv-elasticache-sg"
description = "Security group for ElastiCache Redis"
vpc_id = var.vpc_id
ingress {
from_port = 6379
to_port = 6379
protocol = "tcp"
security_groups = [aws_security_group.ecs_tasks.id]
description = "Redis from ECS tasks"
}
tags = { Name = "finserv-elasticache-sg" }
}
# ─────────────────────────────────────────────────────────────────
# VPC Endpoints Security Group
# ─────────────────────────────────────────────────────────────────
resource "aws_security_group" "vpc_endpoints" {
provider = [Link]
name = "finserv-vpc-endpoints-sg"
description = "Security group for VPC Interface Endpoints"
vpc_id = var.vpc_id
# Accept HTTPS from the entire VPC CIDR — all resources in the VPC
# should be able to use the endpoints
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = [var.vpc_cidr]
description = "HTTPS from VPC CIDR for AWS service access"
}
tags = { Name = "finserv-vpc-endpoints-sg" }
}
Internet
│ HTTPS (443)
▼
[ALB SG]
│ Port 8080
▼
[ECS Tasks SG]
│ Port 3306 │ Port 6379 │ Port 443
▼ ▼ ▼
[RDS Proxy SG] [ElastiCache SG] [VPC Endpoints SG]
│ Port 3306 │ │
▼ │ AWS Services
[Aurora SG] │ (SSM, ECR, etc.)
│
Redis Cluster
The key design principle here: no layer can be reached except by the layer immediately
above it. ECS tasks cannot talk directly to Aurora — all connections must pass through RDS
Proxy. This layered security model means that even if an attacker compromises an ECS task, they
face an additional barrier before reaching the database.
Appendix C: AWS Config Rules Reference
AWS Config continuously evaluates your resource configurations against rules you define. Here
are the Config rules I implemented, with explanations of what each one checks and why it
matters.
# config-rules/[Link]
# ─────────────────────────────────────────────────────────────────
# Managed Rules (AWS-maintained rule logic)
# ─────────────────────────────────────────────────────────────────
# Ensure MFA is enabled for the root account
resource "aws_config_config_rule" "root_mfa_enabled" {
name = "root-account-mfa-enabled"
description = "Checks whether the root user of your AWS account requires
MFA"
source {
owner = "AWS"
source_identifier = "ROOT_ACCOUNT_MFA_ENABLED"
# This is a periodic rule — evaluated every 24 hours, not triggered by
events
}
}
# Ensure CloudTrail is enabled in all regions
resource "aws_config_config_rule" "cloudtrail_enabled" {
name = "cloudtrail-enabled"
description = "Checks whether CloudTrail is enabled and logging API calls"
source {
owner = "AWS"
source_identifier = "CLOUD_TRAIL_ENABLED"
}
input_parameters = jsonencode({
s3BucketName = "finserv-cloudtrail-archive"
# Ensures CloudTrail is not just enabled, but logging to the correct
bucket
})
}
# Ensure no security groups allow unrestricted SSH inbound
resource "aws_config_config_rule" "no_unrestricted_ssh" {
name = "restricted-ssh"
description = "Checks that no security groups allow unrestricted inbound SSH
(port 22)"
source {
owner = "AWS"
source_identifier = "INCOMING_SSH_DISABLED"
}
# This rule triggers whenever a security group is created or modified
}
# Ensure EBS volumes are encrypted
resource "aws_config_config_rule" "ebs_encryption" {
name = "encrypted-volumes"
description = "Checks that attached EBS volumes are encrypted"
source {
owner = "AWS"
source_identifier = "ENCRYPTED_VOLUMES"
}
input_parameters = jsonencode({
kmsId = aws_kms_key.[Link]
# Optionally enforce use of a specific CMK, not just any encryption
})
}
# Ensure RDS instances are encrypted
resource "aws_config_config_rule" "rds_encrypted" {
name = "rds-storage-encrypted"
source {
owner = "AWS"
source_identifier = "RDS_STORAGE_ENCRYPTED"
}
}
# Ensure RDS instances have deletion protection
resource "aws_config_config_rule" "rds_deletion_protection" {
name = "rds-instance-deletion-protection-enabled"
source {
owner = "AWS"
source_identifier = "RDS_INSTANCE_DELETION_PROTECTION_ENABLED"
}
}
# Ensure Multi-AZ is enabled for RDS
resource "aws_config_config_rule" "rds_multi_az" {
name = "rds-multi-az-support"
source {
owner = "AWS"
source_identifier = "RDS_MULTI_AZ_SUPPORT"
}
}
# Ensure S3 buckets have server-side encryption enabled
resource "aws_config_config_rule" "s3_bucket_encryption" {
name = "s3-bucket-server-side-encryption-enabled"
source {
owner = "AWS"
source_identifier = "S3_BUCKET_SERVER_SIDE_ENCRYPTION_ENABLED"
}
}
# Ensure S3 buckets block public access
resource "aws_config_config_rule" "s3_public_access_block" {
name = "s3-account-level-public-access-blocks-periodic"
source {
owner = "AWS"
source_identifier = "S3_ACCOUNT_LEVEL_PUBLIC_ACCESS_BLOCKS_PERIODIC"
}
}
# Ensure VPC Flow Logs are enabled
resource "aws_config_config_rule" "vpc_flow_logs" {
name = "vpc-flow-logs-enabled"
description = "Checks whether VPC Flow Logs are enabled for each VPC"
source {
owner = "AWS"
source_identifier = "VPC_FLOW_LOGS_ENABLED"
}
input_parameters = jsonencode({
trafficType = "ALL"
# Ensure ALL traffic is captured, not just REJECT or ACCEPT
})
}
# Ensure GuardDuty is enabled
resource "aws_config_config_rule" "guardduty_enabled" {
name = "guardduty-enabled-centralized"
source {
owner = "AWS"
source_identifier = "GUARDDUTY_ENABLED_CENTRALIZED"
}
}
# ─────────────────────────────────────────────────────────────────
# Custom Rules (Lambda-backed — enforce custom business logic)
# ─────────────────────────────────────────────────────────────────
# Custom rule: Enforce mandatory tags on all taggable resources
resource "aws_config_config_rule" "mandatory_tags" {
name = "mandatory-tags-enforcement"
description = "Ensures all resources have required tags: Environment, Owner,
ManagedBy"
source {
owner = "CUSTOM_LAMBDA"
source_identifier = aws_lambda_function.config_mandatory_tags.arn
source_detail {
event_source = "[Link]"
message_type = "ConfigurationItemChangeNotification"
# Triggers on every resource configuration change — catches untagged
resources
# within seconds of creation
}
}
scope {
compliance_resource_types = [
"AWS::EC2::Instance",
"AWS::RDS::DBInstance",
"AWS::RDS::DBCluster",
"AWS::ECS::Service",
"AWS::S3::Bucket",
"AWS::ElastiCache::ReplicationGroup"
]
}
depends_on = [aws_config_configuration_recorder.primary]
}
For certain rules, rather than just alerting on violations, I configured automatic remediation using
SSM Automation documents:
This appendix lists every CloudWatch alarm configured in the architecture, organized by
category. For each alarm, the threshold, period, and action are specified along with a plain-
English explanation of what it detects.
# monitoring/[Link]
locals {
# SNS topic ARNs for alert routing
critical_topic = "arn:aws:sns:ap-south-1:123456789012:finserv-critical-
alerts"
warning_topic = "arn:aws:sns:ap-south-1:123456789012:finserv-warning-
alerts"
info_topic = "arn:aws:sns:ap-south-1:123456789012:finserv-info-alerts"
}
# ─────────────────────────────────────────────────────────────────
# APPLICATION TIER ALARMS
# ─────────────────────────────────────────────────────────────────
resource "aws_cloudwatch_metric_alarm" "alb_5xx_rate" {
alarm_name = "ALB-5XX-ErrorRate-High"
alarm_description = "ALB 5XX error rate exceeds 1% — indicates application
or infrastructure errors"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
threshold = 1
treat_missing_data = "notBreaching"
metric_query {
id = "error_rate"
expression = "errors / requests * 100"
label = "5XX Error Rate %"
return_data = true
}
metric_query {
id = "errors"
metric {
namespace = "AWS/ApplicationELB"
metric_name = "HTTPCode_Target_5XX_Count"
dimensions = { LoadBalancer = aws_lb.primary.arn_suffix }
period = 60
stat = "Sum"
}
}
metric_query {
id = "requests"
metric {
namespace = "AWS/ApplicationELB"
metric_name = "RequestCount"
dimensions = { LoadBalancer = aws_lb.primary.arn_suffix }
period = 60
stat = "Sum"
}
}
alarm_actions = [local.critical_topic]
ok_actions = [local.info_topic]
}
resource "aws_cloudwatch_metric_alarm" "alb_target_response_time_p99" {
alarm_name = "ALB-P99-ResponseTime-High"
alarm_description = "P99 target response time exceeds 2 seconds — degraded
user experience"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "TargetResponseTime"
namespace = "AWS/ApplicationELB"
period = 60
statistic = "p99"
# Note: Extended statistics like p99 require "extended_statistic" not
"statistic"
extended_statistic = "p99"
threshold = 2
treat_missing_data = "notBreaching"
dimensions = { LoadBalancer = aws_lb.primary.arn_suffix }
alarm_actions = [local.warning_topic]
}
resource "aws_cloudwatch_metric_alarm" "alb_unhealthy_hosts" {
alarm_name = "ALB-UnhealthyHosts-NonZero"
alarm_description = "One or more ECS tasks are failing ALB health checks —
potential task crash or startup failure"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "UnHealthyHostCount"
namespace = "AWS/ApplicationELB"
period = 60
statistic = "Maximum"
threshold = 0
treat_missing_data = "notBreaching"
dimensions = {
LoadBalancer = aws_lb.primary.arn_suffix
TargetGroup = aws_lb_target_group.payment_processor.arn_suffix
}
alarm_actions = [local.critical_topic]
}
# ─────────────────────────────────────────────────────────────────
# ECS FARGATE ALARMS
# ─────────────────────────────────────────────────────────────────
resource "aws_cloudwatch_metric_alarm" "ecs_cpu_high" {
alarm_name = "ECS-PaymentProcessor-CPU-High"
alarm_description = "ECS service CPU utilization exceeds 80% — auto
scaling should be triggered but verify"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "CPUUtilization"
namespace = "AWS/ECS"
period = 60
statistic = "Average"
threshold = 80
dimensions = {
ClusterName = aws_ecs_cluster.[Link]
ServiceName = aws_ecs_service.payment_processor.name
}
alarm_actions = [local.warning_topic]
}
resource "aws_cloudwatch_metric_alarm" "ecs_memory_high" {
alarm_name = "ECS-PaymentProcessor-Memory-High"
alarm_description = "ECS service memory utilization exceeds 85% — risk of
OOM task failures"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "MemoryUtilization"
namespace = "AWS/ECS"
period = 60
statistic = "Average"
threshold = 85
dimensions = {
ClusterName = aws_ecs_cluster.[Link]
ServiceName = aws_ecs_service.payment_processor.name
}
alarm_actions = [local.warning_topic]
}
resource "aws_cloudwatch_metric_alarm" "ecs_running_task_count_low" {
alarm_name = "ECS-PaymentProcessor-RunningTasks-BelowMinimum"
alarm_description = "Running task count dropped below minimum desired
count — possible deployment failure or task crash loop"
comparison_operator = "LessThanThreshold"
evaluation_periods = 2
metric_name = "RunningTaskCount"
namespace = "ECS/ContainerInsights"
period = 60
statistic = "Minimum"
threshold = 6 # Our minimum desired count
treat_missing_data = "breaching"
# treat_missing_data=breaching: If the metric stops reporting (e.g.,
Container Insights issue),
# assume the worst and fire the alarm
dimensions = {
ClusterName = aws_ecs_cluster.[Link]
ServiceName = aws_ecs_service.payment_processor.name
}
alarm_actions = [local.critical_topic]
}
# ─────────────────────────────────────────────────────────────────
# DATABASE ALARMS
# ─────────────────────────────────────────────────────────────────
resource "aws_cloudwatch_metric_alarm" "aurora_cpu_high" {
alarm_name = "Aurora-Primary-CPU-High"
alarm_description = "Aurora writer instance CPU exceeds 70% — investigate
slow queries"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "CPUUtilization"
namespace = "AWS/RDS"
period = 60
statistic = "Average"
threshold = 70
dimensions = { DBClusterIdentifier =
aws_rds_cluster.primary.cluster_identifier }
alarm_actions = [local.warning_topic]
}
resource "aws_cloudwatch_metric_alarm" "aurora_freeable_memory" {
alarm_name = "Aurora-Primary-FreeableMemory-Low"
alarm_description = "Aurora freeable memory below 1GB — buffer pool
pressure, possible swapping"
comparison_operator = "LessThanThreshold"
evaluation_periods = 3
metric_name = "FreeableMemory"
namespace = "AWS/RDS"
period = 60
statistic = "Average"
threshold = 1073741824 # 1 GB in bytes
dimensions = { DBClusterIdentifier =
aws_rds_cluster.primary.cluster_identifier }
alarm_actions = [local.warning_topic]
}
resource "aws_cloudwatch_metric_alarm" "aurora_replication_lag" {
alarm_name = "Aurora-GlobalDB-ReplicationLag-High"
alarm_description = "Aurora Global Database replication lag exceeds 5
seconds — RPO at risk"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "AuroraGlobalDBReplicationLag"
namespace = "AWS/RDS"
period = 60
statistic = "Maximum"
threshold = 5000 # 5 seconds in milliseconds
treat_missing_data = "breaching"
dimensions = { DBClusterIdentifier =
aws_rds_cluster.secondary.cluster_identifier }
alarm_actions = [local.critical_topic]
}
resource "aws_cloudwatch_metric_alarm" "rds_proxy_client_connections" {
alarm_name = "RDSProxy-ClientConnections-High"
alarm_description = "RDS Proxy client connection count approaching limit —
connection pool saturation risk"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "ClientConnections"
namespace = "AWS/RDS"
period = 60
statistic = "Maximum"
threshold = 800 # Alert at 80% of the 1000 max_connections limit
dimensions = { ProxyName = aws_db_proxy.[Link] }
alarm_actions = [local.warning_topic]
}
# ─────────────────────────────────────────────────────────────────
# DISASTER RECOVERY ALARMS
# ─────────────────────────────────────────────────────────────────
resource "aws_cloudwatch_metric_alarm" "secrets_replication_failure" {
alarm_name = "DR-SecretsManager-ReplicationFailure"
alarm_description = "Secrets Manager secrets not fully replicated to
secondary region — DR readiness at risk"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "SecretsReplicationFailures"
namespace = "FinServ/DR"
period = 3600 # Checked hourly by validation Lambda
statistic = "Maximum"
threshold = 0
treat_missing_data = "breaching"
alarm_actions = [local.critical_topic]
}
resource "aws_cloudwatch_metric_alarm" "secondary_region_task_count" {
provider = [Link]
alarm_name = "DR-Secondary-ECS-Tasks-Down"
alarm_description = "Secondary region ECS tasks not running — warm standby
compromised"
comparison_operator = "LessThanThreshold"
evaluation_periods = 2
metric_name = "RunningTaskCount"
namespace = "ECS/ContainerInsights"
period = 300
statistic = "Minimum"
threshold = 2 # Minimum warm standby tasks expected
treat_missing_data = "breaching"
dimensions = {
ClusterName = aws_ecs_cluster.[Link]
ServiceName = "payment-processor"
}
alarm_actions = [local.critical_topic]
}
Appendix E: Useful AWS CLI One-Liners for Operations
This appendix is a quick-reference collection of CLI commands you'll reach for repeatedly in day-
to-day operations of a multi-region platform. I keep this list pinned in the team's internal wiki.
# ─────────────────────────────────────────────────────────────────
# REGION STATUS AND FAILOVER READINESS
# ─────────────────────────────────────────────────────────────────
# Check Aurora Global Database replication lag across all secondary clusters
aws rds describe-global-clusters \
--global-cluster-identifier finserv-global \
--query
'GlobalClusters[0].GlobalClusterMembers[?IsWriter==`false`].{Cluster:DBCluster
Arn,Lag:GlobalWriteForwardingStatus}' \
--region ap-south-1
# Get current Aurora Global Database status and member details
aws rds describe-global-clusters \
--global-cluster-identifier finserv-global \
--region ap-south-1 \
--output table
# Check Route 53 ARC routing control states (which regions are active)
aws route53-recovery-control-config list-routing-controls \
--control-panel-arn arn:aws:route53-recovery-
control::123456789012:controlpanel/PANEL_ID \
--endpoint-url [Link]
# ─────────────────────────────────────────────────────────────────
# ECS OPERATIONS
# ─────────────────────────────────────────────────────────────────
# List all running tasks in a service with their AZ and IP
aws ecs list-tasks \
--cluster finserv-production-primary \
--service-name payment-processor \
--region ap-south-1 \
--query 'taskArns' \
--output text | xargs aws ecs describe-tasks \
--cluster finserv-production-primary \
--region ap-south-1 \
--query
'tasks[*].{TaskID:taskArn,AZ:availabilityZone,Status:lastStatus,IP:attachments
[0].details[?name==`privateIPv4Address`].value|[0]}' \
--output table
# Force a new deployment (rolling update with zero downtime)
aws ecs update-service \
--cluster finserv-production-primary \
--service payment-processor \
--force-new-deployment \
--region ap-south-1
# Watch ECS service events in real time (useful during deployments)
watch -n 5 'aws ecs describe-services \
--cluster finserv-production-primary \
--services payment-processor \
--region ap-south-1 \
--query "services[0].events[:5]" \
--output table'
# Scale ECS service manually (useful during failover to secondary)
aws ecs update-service \
--cluster finserv-production-secondary \
--service payment-processor \
--desired-count 18 \
--region ap-southeast-1
# ─────────────────────────────────────────────────────────────────
# AURORA OPERATIONS
# ─────────────────────────────────────────────────────────────────
# Check Aurora cluster endpoints (writer and reader)
aws rds describe-db-clusters \
--db-cluster-identifier finserv-aurora-primary \
--query
'DBClusters[0].{Writer:Endpoint,Reader:ReaderEndpoint,Status:Status,Members:DB
ClusterMembers[*].{ID:DBInstanceIdentifier,Writer:IsClusterWriter,Status:DBIns
tanceStatus}}' \
--region ap-south-1 \
--output table
# Check RDS Proxy target health
aws rds describe-db-proxy-targets \
--db-proxy-name finserv-aurora-proxy \
--region ap-south-1 \
--query
'Targets[*].{Endpoint:Endpoint,Port:Port,State:[Link],Description:
[Link]}' \
--output table
# Perform a planned Aurora Global Database failover (zero data loss)
aws rds failover-global-cluster \
--global-cluster-identifier finserv-global \
--target-db-cluster-identifier arn:aws:rds:ap-southeast-
1:123456789012:cluster:finserv-aurora-secondary \
--region ap-south-1
# Monitor the failover progress
aws rds describe-global-clusters \
--global-cluster-identifier finserv-global \
--query
'GlobalClusters[0].{Status:Status,Members:GlobalClusterMembers[*].{Cluster:DBC
lusterArn,Writer:IsWriter,Status:Readers}}' \
--region ap-south-1
# ─────────────────────────────────────────────────────────────────
# SECRETS MANAGER OPERATIONS
# ─────────────────────────────────────────────────────────────────
# List all secrets and their replication status
aws secretsmanager list-secrets \
--region ap-south-1 \
--query
'SecretList[*].{Name:Name,ReplicationStatus:ReplicationStatus[*].{Region:Regio
n,Status:Status}}' \
--output table
# Manually rotate a secret immediately (triggers the rotation Lambda)
aws secretsmanager rotate-secret \
--secret-id finserv/aurora/password \
--rotate-immediately \
--region ap-south-1
# Verify a secret value can be retrieved in the secondary region
aws secretsmanager get-secret-value \
--secret-id finserv/aurora/password \
--region ap-southeast-1 \
--query '{Name:Name,VersionId:VersionId,LastRotated:CreatedDate}' \
--output table
# ─────────────────────────────────────────────────────────────────
# COST AND RESOURCE HYGIENE
# ─────────────────────────────────────────────────────────────────
# Find all untagged EC2 instances (missing the mandatory "Owner" tag)
aws ec2 describe-instances \
--filters "Name=instance-state-name,Values=running" \
--query
'Reservations[*].Instances[?!not_null(Tags[?Key==`Owner`].Value|[0])].[Instanc
eId,LaunchTime,InstanceType]' \
--region ap-south-1 \
--output table
# Find EC2 instances not accessed via SSM in the last 30 days
# (potential zombie instances worth investigating)
aws ssm describe-instance-information \
--query
'InstanceInformationList[*].{ID:InstanceId,PingStatus:PingStatus,LastPing:Last
PingDateTime,PlatformType:PlatformType}' \
--region ap-south-1 \
--output table
# List all EBS snapshots older than 90 days (candidates for cleanup)
aws ec2 describe-snapshots \
--owner-ids self \
--query "Snapshots[?StartTime<='$(date -d '90 days ago' --iso-
8601)'][*].{SnapshotId:SnapshotId,Size:VolumeSize,StartTime:StartTime,Descript
ion:Description}" \
--region ap-south-1 \
--output table
# Check Reserved Instance utilization percentage
aws ce get-reservation-utilization \
--time-period Start=$(date -d '30 days ago' +%Y-%m-%d),End=$(date +%Y-%m-
%d) \
--granularity MONTHLY \
--query
'UtilizationsByTime[*].Total.{UtilizationPercentage:UtilizationPercentage,Purc
hasedHours:PurchasedHours,UsedHours:UsedHours}' \
--region us-east-1
# ─────────────────────────────────────────────────────────────────
# SECURITY OPERATIONS
# ─────────────────────────────────────────────────────────────────
# List all IAM roles WITHOUT permission boundaries (security risk)
aws iam list-roles \
--query
'Roles[?!PermissionsBoundary].{RoleName:RoleName,CreatedDate:CreateDate,Path:P
ath}' \
--output table
# Check GuardDuty findings in the last 24 hours
aws guardduty list-findings \
--detector-id $(aws guardduty list-detectors --query 'DetectorIds[0]' --
output text --region ap-south-1) \
--finding-criteria '{"Criterion":{"updatedAt":{"Gte":'$(date -d '24 hours
ago' +%s000)'},"severity":{"Gte":4}}}' \
--region ap-south-1
# Get details of the most recent GuardDuty findings
DETECTOR_ID=$(aws guardduty list-detectors --query 'DetectorIds[0]' --output
text --region ap-south-1)
FINDING_IDS=$(aws guardduty list-findings --detector-id $DETECTOR_ID --region
ap-south-1 --query 'FindingIds[:5]' --output json)
aws guardduty get-findings \
--detector-id $DETECTOR_ID \
--finding-ids $FINDING_IDS \
--query
'Findings[*].{Type:Type,Severity:Severity,Title:Title,Region:Region,UpdatedAt:
UpdatedAt}' \
--region ap-south-1 \
--output table
# ─────────────────────────────────────────────────────────────────
# SESSION MANAGER OPERATIONS
# ─────────────────────────────────────────────────────────────────
# List all managed instances available for Session Manager access
aws ssm describe-instance-information \
--filters "Key=PingStatus,Values=Online" \
--query
'InstanceInformationList[*].{ID:InstanceId,Name:ComputerName,OS:PlatformType,V
ersion:PlatformVersion,AgentVersion:AgentVersion}' \
--region ap-south-1 \
--output table
# Start a Session Manager session (interactive shell)
aws ssm start-session \
--target i-0abc123def456789 \
--region ap-south-1
# Run a non-interactive command on a remote instance via SSM Run Command
aws ssm send-command \
--instance-ids i-0abc123def456789 \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["df -h","free -m","uptime"]' \
--query '[Link]' \
--output text \
--region ap-south-1 \
| xargs -I {} aws ssm get-command-invocation \
--command-id {} \
--instance-id i-0abc123def456789 \
--query 'StandardOutputContent' \
--output text \
--region ap-south-1
Appendix F: Disaster Recovery Testing Checklist
Regular DR testing is what separates theoretical availability from actual availability. Below is the
checklist I use for quarterly DR drills. The goal is to execute this checklist under simulated
pressure conditions — timers running, communication channels active — not as a leisurely
walkthrough.
Post-Drill Activities
Scenario: Aurora replication lag exceeds 30 seconds at drill start Action: Postpone drill.
Investigate replication health. Common causes are high write volume on primary, network
congestion on the TGW inter-region peering link, or an Aurora maintenance event in progress.
Do NOT proceed with failover when lag is high unless it's a genuine emergency — you'll accept
data loss that you don't need to.
Scenario: Secondary ECS tasks fail to scale up within 3 minutes Action: Check ECR image pull
success (VPC Endpoints healthy?), check Fargate task launch failures in ECS Events, check Secrets
Manager access from secondary region. Common cause: Fargate capacity constraints in the
secondary region during a real regional event when everyone is failing over simultaneously.
Consider pre-subscribing to additional Fargate capacity via Fargate Capacity Providers.
Scenario: ARC routing control change returns an error Action: Verify you're using the correct
ARC cluster endpoint. ARC endpoints are regional — if your primary region is unavailable, switch
to the secondary region's ARC endpoint. The endpoints are distributed specifically so that at least
one remains available even during a regional event.
Scenario: Synthetic canary reports success but real users report errors Action: Check that the
synthetic canary is running from outside the AWS network (CloudWatch Synthetics canaries run
from within AWS infrastructure). If the issue affects specific geographic regions (e.g., India-based
users only), it may be a Global Accelerator routing issue rather than an application issue. Check
Global Accelerator flow logs for client-specific routing decisions.
Appendix G: Glossary of Key Terms
This glossary is intentionally written for readers who may encounter these terms for the first time.
Skip to what you need.
Active-Active Architecture: A design where all regions simultaneously handle live production
traffic. Writes and reads can happen in any region. Requires careful data consistency
management but delivers the lowest possible RTO.
Active-Passive Architecture: A design where one region handles all production traffic (active)
while another region maintains a standby environment (passive). Traffic only shifts to the passive
Anycast IP Address: A single IP address that routes to the "nearest" server based on network
topology. AWS Global Accelerator provides anycast IP addresses, meaning users in Mumbai and
Singapore both resolve the same IP address but reach the nearest AWS edge location.
Aurora Global Database: An Amazon Aurora feature that replicates data from a primary region
cluster to one or more secondary region clusters with sub-second replication lag. The secondary
clusters are read-only until promoted during a failover.
Canary Deployment: A deployment strategy where a small percentage of traffic is routed to the
new version before rolling it out to all users. Named after the "canary in a coal mine" concept —
if the canary (new version) fails, you know before the full deployment.
Circuit Breaker Pattern: A software design pattern that detects repeated failures and "opens" a
circuit to stop additional calls to the failing component, allowing it time to recover. Prevents
Conformance Pack: A collection of AWS Config rules deployed together as a package. AWS
provides pre-built conformance packs for standards like CIS Benchmarks, PCI-DSS, and HIPAA.
Customer Managed Key (CMK): An AWS KMS key that you create and control, as opposed to
AWS-managed keys or AWS-owned keys. CMKs give you control over key policies, rotation
schedules, and the ability to audit key usage in CloudTrail.
Elastic Network Interface (ENI): A virtual network interface that can be attached to an EC2
instance or ECS Fargate task. In awsvpc networking mode, each ECS task gets its own ENI and its
own IP address.
Envelope Encryption: A technique where data is encrypted with a data key (DEK), and the data
key itself is encrypted with a master key (CMK). AWS services use envelope encryption — you
never directly encrypt data with the CMK itself.
Fargate: An AWS compute engine for containers that eliminates the need to manage EC2
instances (nodes). You specify CPU and memory requirements for your container tasks, and AWS
handles the underlying compute.
Global Accelerator: An AWS networking service that routes user traffic through AWS's private
global network backbone rather than the public internet, reducing latency and providing sub-30-
second automatic failover between endpoints.
GuardDuty: AWS's managed threat detection service. It continuously analyzes CloudTrail logs,
VPC Flow Logs, and DNS logs using machine learning to identify suspicious activity.
IAM Permission Boundary: A maximum permissions policy attached to an IAM role or user that
limits what other policies can grant. Even if an identity-based policy grants an action, the
Multi-AZ: A configuration where a resource (like an RDS database or ALB) is deployed across
multiple Availability Zones within a single AWS Region. Protects against AZ-level failures but not
regional failures.
Multi-Region: A configuration where infrastructure is deployed across multiple AWS Regions.
Protects against regional failures. More complex and expensive than Multi-AZ.
NAT Gateway: A managed AWS service that allows resources in private subnets to initiate
outbound connections to the internet while blocking unsolicited inbound connections. Data
processing charges ($0.045/GB in ap-south-1) make it important to minimize unnecessary traffic
through NAT.
RDS Proxy: A fully managed database proxy for RDS and Aurora that pools and shares database
connections, improving application scalability and resilience during failovers.
RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in
time. An RPO of 1 second means you can tolerate losing at most 1 second of data. Lower RPO =
more expensive replication.
RTO (Recovery Time Objective): The maximum acceptable time to restore service after a failure.
An RTO of 5 minutes means you must restore service within 5 minutes of a failure being
detected. Lower RTO = more expensive standby infrastructure.
Route 53 ARC (Application Recovery Controller): An AWS service that provides centralized
control and safety checks for multi-region failover operations, including routing controls that act
as on/off switches for regional traffic.
Savings Plan: An AWS pricing model where you commit to a consistent amount of compute
usage (in $/hour) for 1 or 3 years in exchange for discounts of up to 66% vs. On-Demand pricing.
Service Control Policy (SCP): An AWS Organizations policy that sets permission guardrails for
accounts within an organizational unit or the entire organization. SCPs act as a veto — even an
Transit Gateway (TGW): An AWS networking service that acts as a central hub for connecting
multiple VPCs and on-premises networks. Supports inter-region peering for connecting VPCs
across AWS regions.
VPC Endpoint: A networking component that allows traffic from within your VPC to reach AWS
services (like S3, SSM, or ECR) over AWS's private network rather than through the public internet
or NAT Gateway.
VPC Flow Logs: A feature that captures metadata about IP traffic flowing through network
interfaces in your VPC. Captured data includes source/destination IPs, ports, protocols, and
WAF (Web Application Firewall): AWS WAF examines HTTP/HTTPS requests to your ALB or
CloudFront distribution and can block requests matching patterns associated with common
attacks (SQL injection, XSS, bad bots, etc.).
Warm Standby: A DR strategy where a scaled-down but functional version of the production
environment runs in the standby region. Upon failover, the standby scales up to handle full
production load. Balances cost against recovery time.
Appendix H: Architecture Decision Records (ADRs)
Context: The existing application used MySQL with stored procedures and MySQL-specific
functions. Migrating to Aurora PostgreSQL would require significant application changes.
Decision: Use Aurora MySQL 8.0 (Aurora 3.x) with Global Database.
Consequences:
Consequences:
Review Trigger: Reconsider Active-Active if ap-southeast-1 traffic exceeds 40% of total traffic
volume.
Consequences:
Review Trigger: Reconsider if the engineering team size grows beyond 20 and multiple distinct
services require independent deployment pipelines.
Context: The existing setup used bastion hosts with SSH key-based access.
Decision: Replace all SSH and bastion host access with AWS Systems Manager Session Manager.
Consequences:
• Positive: Eliminates port 22 from all security groups (reduced attack surface)
connect to private resources (e.g., database GUIs connecting to Aurora through an SSM
tunnel)
Appendix I: Recommended AWS Documentation and
Further Reading
The following official AWS resources are the most valuable references for the topics covered in
this book. I refer to these regularly and recommend bookmarking all of them.
Security:
Networking:
Operations:
These appendices serve as a living reference companion to the main ebook. As AWS releases new
services and features, specific CLI parameters and service configurations will evolve — always
validate against the current AWS documentation before implementation. The architectural patterns
and design principles, however, tend to remain stable even as the tooling changes.
Appendix J: SLA Validation Summary
Platform: FinServ Co. Multi-Region Payment Processing Platform Primary Region: ap-south-1
Maximum data loss per incident 1 second (RPO ✅ 0.8s average replication
target) lag
DR Drill RPO Data loss during failover < 1 second 0.8 seconds ✅
Annualised budget consumption rate: ~9.9% per 60 days → projected 29.7% annual consumption.
Well within budget.
Unplanned
Week 0s (Multi- ✅ Sub-minute AZ
simulation (AZ 0m 41s
10 AZ) PASS recovery confirmed
kill)
Next scheduled drill: Week 24 (simulated unplanned regional failure with database write load
active)
Based on measured production metrics, validated DR drill results, and continuous monitoring data,
the FinServ Co. multi-region platform demonstrates the architectural controls and operational
readiness required to sustain a 99.99% SLA. The platform has maintained 99.997% availability
over the first 60 days of production operation, consuming less than 10% of its annual downtime
budget. All three disaster recovery drills have met or exceeded the defined RTO (< 5 minutes) and
RPO (< 1 second) targets.
Signed off: Manish Kumar, AWS Solutions Architect Review cadence: Monthly scorecard update
| Quarterly full re-assessment
This summary should be reviewed at each monthly operations meeting and updated after every DR
drill. Any single incident consuming more than 20% of the annual downtime budget (10 minutes 31
seconds) should trigger an immediate architecture review.