0% found this document useful (0 votes)
94 views139 pages

Architecting Fault-Tolerant Multi-Region Systems

The document is a comprehensive guide on designing and implementing a fault-tolerant multi-region AWS platform aimed at achieving a 99.99% Service Level Agreement (SLA). It covers the challenges faced by a fintech company, the architectural decisions made, and the practical steps taken to enhance system reliability and performance. The guide is structured to assist both junior and senior engineers in understanding high-availability concepts and the complexities of multi-region systems.

Uploaded by

deniz bayraktar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views139 pages

Architecting Fault-Tolerant Multi-Region Systems

The document is a comprehensive guide on designing and implementing a fault-tolerant multi-region AWS platform aimed at achieving a 99.99% Service Level Agreement (SLA). It covers the challenges faced by a fintech company, the architectural decisions made, and the practical steps taken to enhance system reliability and performance. The guide is structured to assist both junior and senior engineers in understanding high-availability concepts and the complexities of multi-region systems.

Uploaded by

deniz bayraktar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

The 99.

99% Blueprint: Architecting Fault-


Tolerant Multi-Region Systems on AWS

A Field-Tested Implementation Guide

About This Book

This guide documents a real-world journey of designing, building, and operating a multi-region
AWS platform capable of sustaining a 99.99% Service Level Agreement (SLA). Whether you're a
junior cloud engineer trying to wrap your head around high-availability concepts or a senior
architect validating your own design decisions, this book was written with you in mind. I've tried to
avoid the trap of making things sound easy when they aren't — because the truth is, multi-region

systems are genuinely hard, and I want you to walk away with practical knowledge, not just
theoretical fluff.

Contents

Chapter 1: The Challenge — Why We Were Falling Apart ...................................................... 5


Business Context and the Pain Points Nobody Talked About .......................................... 5
Understanding What 99.99% Actually Means ................................................................. 6
Technical Constraints We Were Starting With ................................................................. 6
Budget and Timeline Pressures ....................................................................................... 7
Chapter 2: Initial Assessment — What I Found Under the Hood ............................................ 9
My Evaluation Process .................................................................................................... 9
Key Metrics and Bottlenecks I Identified ....................................................................... 10
Risk Factors Identified .................................................................................................. 12
Chapter 3: Solution Design — The Architecture Decisions That Mattered ............................ 13
The Core Question: Active-Active or Active-Passive? .................................................... 13
AWS Services Selected (and Why Each One Made the Cut) ........................................... 14
Architecture Diagram Description ................................................................................. 18
Cost vs. Performance Trade-offs ................................................................................... 20
Chapter 4: Solution Design Deep Dive — The Areas That Bite You....................................... 22
FinOps — Building Cost Discipline Into the Architecture............................................... 22
AWS Well-Architected Framework — Six Pillars as a Design Checklist .......................... 24
AWS Security Reference Architecture (SRA) .................................................................. 26
AWS Systems Manager Session Manager — Replacing SSH Completely....................... 27
VPC Flow Logs — Building Network Intelligence .......................................................... 31
KMS Key Policies and Permission Boundary Mechanics ................................................ 35
Chapter 5: Implementation Journey ..................................................................................... 44
Phase 1: Foundation — Getting the Ground Ready ....................................................... 44
Phase 2: Core Services — Aurora Global Database and ECS .......................................... 51
Phase 3: Advanced Features — Automated Failover with Route 53 ARC ....................... 63
Phase 3 (Continued): Monitoring, Alerting, and Observability ....................................... 68
Phase 4: Optimization — Fine-Tuning After Go-Live ..................................................... 72
Chapter 6: Disaster Recovery — RTO, RPO, and the Mechanics of Failover .......................... 74
Defining the Recovery Objectives ................................................................................. 74
The Four DR Strategies and Why We Chose Warm Standby ......................................... 74
Failover Runbook (Step-by-Step) .................................................................................. 75
Chapter 7: Challenges and How I Solved Them .................................................................... 78
Challenge 1: VPC Flow Logs Cost Spiral ........................................................................ 78
Challenge 2: Aurora Failover Disconnecting Existing Sessions ....................................... 79
Challenge 3: Secrets Manager Secret Replication Lag ................................................... 80
Challenge 4: NAT Gateway Bottleneck During Failover Scale-Up .................................. 82
Challenge 5: False Positive Health Check Failures from AWS Synthetic Canaries ........... 83
Chapter 8: Results and Metrics — The Numbers .................................................................. 86
Availability and Reliability ............................................................................................. 86
Performance ................................................................................................................. 86
Cost ............................................................................................................................. 86
Security ........................................................................................................................ 86
Team Productivity ........................................................................................................ 87
Chapter 9: Key Takeaways ................................................................................................... 88
What Worked Exceptionally Well .................................................................................. 88
What I'd Do Differently Next Time ................................................................................ 88
Best Practices Discovered ............................................................................................. 89
Recommendations for Similar Projects ......................................................................... 89
Chapter 10: Tech Stack Summary ........................................................................................ 91
Appendix A: Complete Terraform Module Structure ............................................................ 94
Appendix B: Security Group Reference .............................................................................. 100
Appendix C: AWS Config Rules Reference ......................................................................... 105
Appendix D: CloudWatch Alarms Complete Reference ...................................................... 111
Appendix E: Useful AWS CLI One-Liners for Operations ..................................................... 118
Appendix F: Disaster Recovery Testing Checklist ................................................................ 124
Appendix G: Glossary of Key Terms ................................................................................... 127
Appendix H: Architecture Decision Records (ADRs)............................................................ 131
Appendix I: Recommended AWS Documentation and Further Reading ............................. 134
Appendix J: SLA Validation Summary ................................................................................. 136
Chapter 1: The Challenge —
Why We Were Falling Apart

Business Context and the Pain Points


Nobody Talked About

Before I get into architecture diagrams and


Terraform snippets, let me paint a picture of where
things stood before we redesigned everything. The
client was a mid-sized fintech company — I'll call
them FinServ Co. — offering real-time payment
processing and loan origination services across India
and Southeast Asia. Their platform was processing

somewhere around 200,000 transactions per day,


with peak volumes around month-end and during
promotional campaigns.

On paper, they had a "cloud-native" setup. In reality,


they had a single AWS region (ap-south-1, Mumbai)

with a modest Auto Scaling Group in front of a


monolithic Java application, a MySQL RDS instance
with Multi-AZ enabled, and an S3 bucket for
documents. Their on-call team had set up some
CloudWatch alarms that fired emails to a distribution
list. It worked — mostly — until it didn't.

The defining moment that triggered this


engagement was an RDS failover event that lasted
47 seconds. That's well within the bounds of what
Multi-AZ is designed to handle. The problem was
that the application wasn't designed to handle

transient database connection failures gracefully.


The connection pool exhausted, the health checks on the load balancer started failing, and before
anyone had finished their morning chai, the entire platform had gone down. Total outage: 23
minutes.

Twenty-three minutes in the payments world isn't just embarrassing. It's regulatory exposure, it's
lost transaction revenue, and it's customer trust eroding in real time. The CTO received a call from
their largest enterprise client threatening to terminate the contract if they couldn't demonstrate a
credible path to 99.99% availability within the next 180 days.

That's when they called me.

Understanding What 99.99% Actually Means

Let me stop here, because this number gets thrown around a lot without people really

internalizing what it implies.

• 99.9% SLA = 8 hours, 46 minutes of allowable downtime per year


• 99.99% SLA = 52 minutes and 36 seconds of allowable downtime per year
• 99.999% SLA = 5 minutes and 15 seconds per year

Going from 99.9% to 99.99% isn't just adding a "9" — it's a 10x reduction in your downtime
budget. That 23-minute outage I described? It had already consumed roughly 44% of the entire
annual downtime budget for a 99.99% SLA target. In one incident.

This is why 99.99% SLA almost always implies multi-region architecture. A single AWS region,
even with all its redundancy features, can experience regional-level events (as the October 2025
AWS outage demonstrated) that no amount of Multi-AZ configuration can protect you from. The
moment you accept that regional outages are a real threat — not a theoretical one — the
architecture decisions start to fall into place naturally.

Technical Constraints We Were Starting With

The existing environment had several constraints that we couldn't simply ignore:
• Database schema complexity: The MySQL database had 340+ tables with complex
foreign key relationships. This wasn't something we could casually migrate to a different
engine without significant risk.
• Application coupling: The monolith maintained long-lived database connections with

stored procedures and MySQL-specific features. Any database change required close
coordination with the development team.
• Compliance requirements: As a financial services company, they fell under RBI (Reserve
Bank of India) guidelines, which imposed specific data residency requirements — certain
customer PII had to remain within India's geographic boundaries.
• Operational maturity: The team was genuinely skilled at running the existing stack, but
had limited experience with multi-region architecture, global traffic management, or

chaos engineering.

Budget and Timeline Pressures

The CFO was supportive but cautious. The approved budget for this transformation was $80,000
USD for the initial 120-day engagement (my professional fees plus implementation costs), with

an expected ongoing infrastructure cost increase of no more than 40% above their current AWS
bill.

Their current AWS spend was approximately $12,000/month. That meant our target architecture
needed to land somewhere under $17,000/month. This budget constraint was, frankly, one of the

most useful forcing functions of the entire project. It stopped me from reaching for every shiny
AWS service and forced genuine prioritization.

Timeline: 120 days to production readiness, with a hard deadline driven by that enterprise client's
contract review.
Chapter 2: Initial Assessment — What I Found Under the
Hood

My Evaluation Process

The first two weeks of the engagement were purely investigative. I've seen architects rush past
this phase and it almost always costs them later. Before I drew a single architecture diagram, I
wanted to understand the system as it actually existed, not as it was documented.

I started with a set of structured discovery sessions with different stakeholder groups:

With the engineering team (3 sessions, 6 hours total): The engineers were candid in ways that
surprised me. They knew the system had problems. The most common phrase I heard was "we
have a ticket for that." The backlog was littered with "improve connection handling," "add retry
logic," "implement circuit breaker" tickets — all marked medium priority, all perpetually
deprioritized in favor of feature work.

With the operations team (2 sessions, 4 hours total): The ops team managed everything
through the AWS console. There was almost no Infrastructure as Code. Resources had been
created manually, often with inconsistent naming, missing tags, and no documentation of why
certain configurations existed. I asked one engineer why the RDS parameter group

had max_connections set to 500 instead of the default, and nobody could remember.

With the business leadership (1 session, 2 hours): Leadership understood the financial stakes
clearly. They also surfaced a requirement that hadn't been in the brief: the platform needed to
support a Southeast Asian expansion within 18 months, which meant the architecture we built
now needed to accommodate a second region in Singapore (ap-southeast-1) as a future growth

path, not just a DR site.


Key Metrics and Bottlenecks I Identified

I ran a two-week observation period, collecting metrics from CloudWatch, enabling VPC Flow
Logs (which had not been turned on), and analyzing RDS Performance Insights data. Here's a
summary of what I found:

Application Performance:

• Average API response time: 340ms at baseline, spiking to 2.8 seconds during business
hours
• P99 response time: 4.2 seconds (unacceptable for a payments platform)

• Database query response time: The top 5 slowest queries accounted for 73% of all
database load
• Connection pool wait time: 180ms on average during peak hours, with occasional
timeouts

Infrastructure Reliability:

• The Auto Scaling Group was configured with a minimum of 2 instances and a maximum
of 4 — far too constrained for production traffic
• Health checks on the Application Load Balancer were using the root path /, which
returned a 200 even when the database connection was broken (classic trap)

• No scheduled maintenance windows had been defined for RDS, meaning AWS could
apply minor version updates at any time
• Backup retention was set to 7 days with no cross-region backup copy configured

Security Posture:
• Root account had no MFA enabled (this was the most alarming finding)
• IAM users had direct AdministratorAccess policies with no MFA enforcement
• Security groups allowed inbound port 22 (SSH) from [Link]/0 on bastion hosts
• No CloudTrail logging to an immutable S3 bucket

• Secrets were stored in environment variables, not in AWS Secrets Manager

Cost Efficiency:

• 12 EC2 instances running in us-east-1 from a previous project that had been
"temporarily" migrated and never cleaned up — approximately $1,200/month of pure
waste

• No Reserved Instances or Savings Plans active — 100% On-Demand pricing


• EBS snapshots accumulating without a lifecycle policy — 3TB of snapshots, many over 6
months old
Risk Factors Identified

After the assessment, I put together a formal risk register. The top risks were:

1. Single region dependency — Any regional event would cause complete service loss
2. No automated failover for any tier — Every failure required manual human
intervention
3. Database as a single point of failure — Even with Multi-AZ, the failover process
exposed application fragility
4. Zero runbook

documentation —
Recovery procedures existed
only in people's heads
5. Insufficient
observability — Without
distributed tracing,
diagnosing cross-service

failures would take far too


long
6. Security credential
exposure — The secrets-in-
environment-variables
pattern was a breach waiting
to happen

7. Compliance gap — No
evidence of data
classification, no encryption
at rest for several RDS
tables, no audit logging of
data access
Chapter 3: Solution Design — The Architecture Decisions
That Mattered

The Core Question: Active-Active or Active-Passive?

This was the first major architectural decision, and I want to be transparent about the deliberation
process, because it's not as straightforward as most guides make it seem.

Active-Active means both regions handle live traffic simultaneously. This gives you true zero-
downtime failover — if one region goes down, the other region is already serving traffic and

simply absorbs the load. The challenge is that it requires your data layer to be genuinely multi-
master, meaning writes can happen in both regions simultaneously. For a payment processing
system, this introduces the risk of split-brain scenarios and conflict resolution complexity.

Active-Passive (Warm Standby) means one region handles all production traffic while a second
region maintains a running but scaled-down replica of the entire infrastructure. During a regional

failure, traffic is redirected to the standby region, and the standby infrastructure scales up to
handle full production load. This approach balances cost against recovery time — the secondary
region costs roughly 30-40% of the primary, but your failover involves a few minutes of DNS
propagation and database promotion.

For FinServ Co., I recommended a Warm Standby Active-Passive architecture as the initial
implementation, with a clear migration path to Active-Active for non-payment workloads. My
reasoning:
• The compliance constraints around financial transactions made true active-active
database writes across regions legally complex
• The budget constraint made full active-active infrastructure cost-prohibitive (it essentially
doubles your infrastructure costs)

• 99.99% SLA with warm standby and sub-5-minute RTO is achievable with modern AWS
services like Aurora Global Database and Route 53 ARC

Why This Matters: Many architects jump straight to active-active because it sounds better. But
active-active with a poorly designed data layer can actually be less reliable than well-executed
active-passive, because data consistency issues can corrupt application state across both regions

simultaneously. Choose complexity only when you've earned it.

AWS Services Selected (and Why Each One Made the Cut)

I went through a deliberate service selection process. For every major component, I evaluated the
AWS-native option, the open-source self-managed option, and any relevant managed

alternatives. Here's the reasoning for the final stack:


Amazon Aurora Global Database (MySQL-compatible) The database choice was the most
consequential decision of the entire architecture. Aurora Global Database provides sub-second
cross-region replication — typically around 1 second of replication lag — which translates
directly to an RPO of approximately 1 second. Contrast this with standard RDS Cross-Region

Read Replicas, which can have replication lag of 30 seconds to several minutes depending on
write volume. For a payments platform where a 5-minute RPO is required, Aurora Global
Database was the clear choice.
The secondary benefit was the failover mechanics. Aurora Global Database supports a "managed
planned failover" for scheduled maintenance that promotes the secondary region to primary with
zero data loss. For unplanned failures, the promotion typically completes in under 1 minute.

Amazon ECS on Fargate (Primary Application Tier) The team had been running EC2-based
Auto Scaling Groups. I evaluated three options: EC2 with ASG, ECS on Fargate, and EKS. EKS was
ruled out based on operational complexity and cost — managing a Kubernetes control plane
adds significant overhead, and the application didn't have requirements that demanded
Kubernetes-specific features (custom resource definitions, complex scheduling, etc.). EC2 with
ASG was familiar to the team but added unnecessary OS-level management burden. Fargate

struck the right balance: container-based workloads with no node management, native
integration with Service Discovery and ALB, and straightforward auto-scaling with built-in multi-
AZ support.

Route 53 Application Recovery Controller (ARC) This deserves a dedicated discussion because

it's one of the most underutilized services in the AWS portfolio. Route 53 ARC, particularly after
the introduction of ARC Region Switch in August 2025, provides a fully managed, centralized
orchestration layer for multi-region failover. Before ARC, executing a regional failover required
running complex scripts across multiple services — updating Route 53 records, promoting Aurora
secondaries, scaling up ECS services in the standby region, updating application configuration —
all in the right sequence and within a tight time window. ARC Region Switch turns this into a
guided, pre-validated workflow. As one disaster recovery architect noted in the community when
ARC Region Switch launched: "This will make x-region DR runbooks so much easier".

AWS Global Accelerator Route 53 health checks have a propagation delay — changes to DNS
records can take 60-120 seconds to propagate globally, depending on client-side TTL caching.
Global Accelerator solves this by routing traffic through AWS's private backbone network and
performing health checks at the edge, with failover measured in seconds rather than minutes. For

a 99.99% SLA, that difference matters.

Amazon VPC with Transit Gateway The multi-region architecture required secure, low-latency
connectivity between regions for replication traffic and management plane operations. Rather
than setting up individual VPC peering connections (which doesn't scale), I implemented Transit
Gateway in each region with inter-region peering. This also laid the groundwork for the future
Singapore expansion.

AWS Secrets Manager with Cross-Region Replication Given the security findings around
secrets in environment variables, AWS Secrets Manager was non-negotiable. The cross-region
replication feature ensures that the standby region has access to all application secrets without
any manual synchronization.

AWS KMS with Multi-Region Keys Encryption at rest is straightforward in a single region. In a
multi-region context, you need to decide how to handle encryption key distribution. AWS KMS
Multi-Region Keys allow the same logical key material to be available in multiple regions, which
means encrypted data (Aurora snapshots, S3 objects, EBS volumes) can be decrypted in the
secondary region without key migration procedures. I'll cover the key policy design in depth in a
later chapter.

AWS Systems Manager Session Manager I made a point of eliminating all SSH-based access
from the architecture. Session Manager provides browser-based and CLI-based shell access to
EC2 instances (and ECS containers via ECS Exec) using IAM authentication, with full audit logging
to CloudTrail and CloudWatch. No bastion hosts, no port 22, no SSH keys to rotate. In a financial
services environment, this is not just a convenience — it's a significant reduction in attack surface.

Architecture Diagram Description

Let me walk through the architecture from edge to core, because understanding the traffic flow is
essential for understanding why each component exists.

Traffic Entry Layer:

• DNS resolution via Route 53, using latency-based routing as the primary routing policy
with health checks at the region level
• AWS Global Accelerator providing anycast IP addresses and edge-based health checking
with sub-30-second failover capability
• CloudFront CDN in front of static assets and read-heavy API endpoints, with S3 origin in
the primary region and cross-region replication to secondary

Application Layer (Primary Region - ap-south-1):

• Application Load Balancer spanning 3 Availability Zones (ap-south-1a, 1b, 1c)


• ECS Fargate cluster with task definitions pinned to specific CPU/memory configurations
• Service Auto Scaling based on ALB request count and CPU utilization metrics

• AWS WAF attached to the ALB with managed rule groups for OWASP Top 10

Application Layer (Secondary Region - ap-southeast-1, Warm Standby):

• Identical ALB configuration, scaled down to minimum tasks (1 per AZ)

• Same ECS service definitions using the same container images from ECR with cross-
region replication
• Route 53 ARC routing controls keep this region's traffic at zero until failover is triggered

Data Layer:

• Aurora Global Database with primary cluster in ap-south-1 (3 AZs, 2 read replicas)
• Aurora secondary cluster in ap-southeast-1 with ~1 second replication lag
• ElastiCache Redis with Global Datastore for session state replication
• S3 with Cross-Region Replication for all object storage
Management and Security Layer:

• AWS Organizations with

separate accounts for:


Management, Security
Tooling, Log Archive, Shared
Services, Production Primary,
Production Secondary
• AWS Control Tower for
account governance

• AWS Security Hub


aggregated to the Security
Tooling account
• GuardDuty enabled in all
regions and accounts
• CloudTrail with organization-

level trail writing to the Log


Archive account

Cost vs. Performance Trade-


offs

The final architecture came in at


approximately $16,200/month — just
under the $17,000 ceiling. The
biggest cost drivers were:

• Aurora Global Database:


~$4,800/month (primary +
secondary clusters, storage,
I/O)
• ECS Fargate (both regions): ~$3,200/month
• Global Accelerator: ~$800/month (+ data transfer)
• NAT Gateways (4 total across 2 regions): ~$1,400/month (NAT Gateway data charges
are often underestimated)

• ElastiCache Redis: ~$1,200/month


• Other services (WAF, CloudFront, Secrets Manager, KMS, etc.): ~$2,000/month
• Data transfer costs: ~$2,800/month

One area where I pushed back on the default recommendation was NAT Gateway costs. Routing
all private subnet traffic through NAT Gateways for accessing AWS services is expensive at scale. I

implemented VPC Endpoints for all services that support them (S3, DynamoDB, SSM, Secrets
Manager, ECR, CloudWatch Logs, STS) — this eliminated roughly $600/month in NAT Gateway
data processing charges while simultaneously improving security (traffic stays on the AWS
backbone).
Chapter 4: Solution Design Deep Dive — The Areas That
Bite You

FinOps — Building Cost Discipline Into the Architecture

FinOps isn't an afterthought — it's a design principle. Multi-region architectures have a well-
deserved reputation for runaway costs, particularly around data transfer, and I wanted to bake
cost visibility into the platform from day one.

Tagging Strategy

Before deploying a single resource, I established a mandatory tagging schema enforced through
AWS Config rules and Service Control Policies:

Environment: production | staging | development


Region: primary | secondary
Application: payment-processor | loan-origination | shared-services
CostCenter: engineering | operations | compliance
Owner: team-name@[Link]
ManagedBy: terraform

Any resource missing these tags triggers a Config rule violation and a notification to the team
lead. This isn't just about cost allocation — it's about accountability. When a developer sees their
name on a $400/month RDS instance they created for a test, they tend to clean it up.

Commitment Coverage Strategy

With a 40% On-Demand cost reduction target, I implemented a tiered commitment strategy
aligned with the AWS FinOps Framework:

• 3-Year Compute Savings Plans for the baseline ECS Fargate workload that runs 24/7
(approximately 60% of average Fargate usage) — this yielded around 52% savings vs. On-
Demand
• 1-Year Convertible Reserved Instances for the Aurora database instances —
Convertible RIs allow instance family exchanges, which gives flexibility as the database
scales
• Regional RIs rather than Zonal RIs for EC2 workloads, to allow the RI discount to apply

across AZs automatically


• Spot Instances for non-production workloads (dev, staging) with a fallback to On-
Demand

One thing I got wrong on the first pass: I purchased Standard RIs for the production EC2
instances in the secondary region, which ran at near-zero utilization most of the time (remember,

it's warm standby). When I modeled the actual utilization, Convertible RIs were a better fit for
secondary-region resources because they allowed exchanges as the architecture evolved.

Pro Tip: Use the AWS Cost Explorer Reserved Instance Utilization and Coverage reports
weekly, not monthly. By the time the monthly report arrives, you've already missed the window to

adjust. Set up CloudWatch alarms on RI utilization dropping below 80% — that's your signal that
something in the architecture has changed.

AWS Cost Anomaly Detection

I set up Cost Anomaly Detection with monitors for each service and each account, with
thresholds set to alert at 15% above the 30-day rolling average. This caught an early issue where
VPC Flow Logs to CloudWatch were generating far more data than expected (more on this in the
challenges section).

Data Transfer Cost Optimization

This is where multi-region architectures bleed money if you're not careful. Every byte that crosses
a region boundary costs money. I implemented the following to minimize cross-region data
transfer:

• S3 Transfer Acceleration disabled (unnecessary since we're using CloudFront at the


edge)
• Aurora Global Database replication traffic is charged separately from standard data
transfer and is generally more cost-efficient than replicating at the application layer
• ElastiCache Global Datastore replication traffic optimized by only caching session data
and computed results, not raw database query results
• ECR Image Pull configured to pull from the local region's replicated ECR repository, not
cross-region

New in 2025: Database Savings Plans

AWS launched Database Savings Plans at re:Invent 2025, extending commitment-based discounts

to RDS and Aurora workloads. If you're architecting a new multi-region platform today, factor in
Database Savings Plans alongside Compute Savings Plans — the combined coverage can reduce
your database costs by 30-40%.

AWS Well-Architected Framework — Six Pillars as a Design Checklist

I use the AWS Well-Architected Framework not as a compliance checkbox but as a genuine
design review tool. Let me walk through how each pillar influenced specific decisions in this
architecture.

Operational Excellence The key operational excellence principle for multi-region systems
is: everything must be automated, documented, and regularly tested. The worst time to figure out

your failover procedure is during an actual regional outage at 3 AM. We implemented:

• AWS Systems Manager Documents (SSM Documents) for all operational runbooks,
executed via Automation, not shell scripts
• AWS Config Conformance Packs to continuously validate that the infrastructure

matches our desired state


• Operational dashboards in CloudWatch with pre-built runbook links embedded in alarm
descriptions
• GameDay exercises scheduled quarterly, simulating regional outages using AWS Fault
Injection Simulator (FIS)

Reliability (The Core Pillar for This Project) The reliability pillar is most directly relevant to our
99.99% SLA goal. The foundational principle is fault isolation: design your system so that failures
are contained and cannot cascade across boundaries. AWS Regions are the largest fault isolation
boundary, which is precisely why multi-region is necessary for 99.99%.
Key reliability design decisions:

• No cross-region dependencies in the request path for either region — each region must

be capable of operating completely independently


• Health checks that reflect actual application health, not just "is the server running" (I
replaced the / health check with a /health/deep endpoint that validates database
connectivity and downstream service availability)
• Automatic scaling with pre-warming logic (scaling policies that anticipate load increases,
not just react to them)

Security The security pillar is woven throughout every layer of this architecture. The single most
impactful change from a security posture standpoint was implementing the principle of least
privilege at every layer, enforced through IAM Permission Boundaries (explained in detail in a
later section).

Performance Efficiency For a 340ms average API response time, there was meaningful
optimization opportunity. The architecture addressed performance at multiple layers:

• CloudFront caching for static assets and read-heavy API responses reduced origin load by

approximately 60%
• ElastiCache Redis for session and computed result caching reduced database read load
significantly
• Aurora Read Replicas in the primary region for read-heavy reporting queries, offloading
the writer instance

Cost Optimization Covered above in the FinOps section.

Sustainability AWS added sustainability as the sixth pillar in 2021. For this architecture,
sustainability considerations included:

• Graviton3-based EC2 instances for any EC2 workloads (Graviton offers comparable or
better performance at significantly lower power consumption)
• ECS Fargate spot tasks for non-production environments
• S3 Intelligent-Tiering for document storage to automatically move infrequently accessed
objects to cheaper storage classes
AWS Security Reference Architecture (SRA)

The AWS Security Reference Architecture provides a blueprint for deploying security services
across a multi-account AWS Organization. Before I explain how I applied it, let me explain why

the SRA matters.

Running all your workloads in a single AWS account is like running your entire company from a
single server — it works until it doesn't, and when something goes wrong, the blast radius is your
entire business. The SRA recommends a purpose-built multi-account structure where security
boundaries between accounts limit the damage any single compromised credential can cause.

Account Structure I Implemented:

Management Account (Root of Organization)


├── Security OU
│ ├── Security Tooling Account (GuardDuty Master, Security Hub, IAM Access
Analyzer)
│ └── Log Archive Account (CloudTrail, VPC Flow Logs, Config History)
├── Infrastructure OU
│ └── Shared Services Account (Transit Gateway, Route 53 Resolver, SSO)
└── Workloads OU
├── Production Primary Account (ap-south-1)
└── Production Secondary Account (ap-southeast-1)

This account boundary approach means that even if an attacker compromises credentials in the
Production Primary Account, they cannot access CloudTrail logs (which reside in the Log Archive

Account), cannot modify GuardDuty findings (Security Tooling Account), and cannot affect the
Shared Services infrastructure.

AWS Control Tower manages the guardrails across these accounts. Control Tower provides:

• Pre-built Service Control Policies (SCPs) that prevent disabling GuardDuty, CloudTrail, or
Config
• Automated account vending with a Landing Zone template
• Centralized compliance dashboard
GuardDuty is enabled organization-wide with the Security Tooling account as the administrator.
I specifically enabled:

• S3 Protection (detects unusual data access patterns)


• RDS Protection (detects unusual database authentication events)
• EKS/ECS Runtime Monitoring (if and when container workloads are added)
• Lambda Protection

AWS Security Hub is configured as an aggregated findings dashboard in the Security Tooling
account. It pulls findings from GuardDuty, Inspector, Macie, Config, and IAM Access Analyzer —
giving a single pane of glass for the security posture across all accounts.

Key Governance Decisions (Service Control Policies):

SCPs in the Management Account define the outer boundary of what any account in the
organization can do, regardless of what IAM policies say. Think of SCPs as a veto power — the
maximum permissions any principal can ever have.

I implemented SCPs that:

• Prevented disabling CloudTrail in any account


• Required MFA for any IAM action from the root user
• Restricted EC2 instance types to specific families (no GPU instances in non-ML accounts
to prevent cryptomining)

• Prevented the creation of Internet Gateways in the security accounts


• Limited S3 bucket creation to specific regions (compliance requirement)

AWS Systems Manager Session Manager — Replacing SSH Completely

This section gets more attention than it deserves in most guides, so I want to explain the why as

clearly as the how.

Traditional SSH access to EC2 instances requires:

1. An open inbound security group rule on port 22


2. SSH key pair management (generation, distribution, rotation, revocation)
3. A bastion host (or "jump server") accessible from the internet
4. Ongoing maintenance of that bastion host

Every one of those requirements is a potential vulnerability. Port 22 open to [Link]/0 is an


invitation for brute-force attacks. SSH key management at enterprise scale is a genuine
operational burden. Bastion hosts are frequently overlooked in patching cycles.

Session Manager eliminates all of these by providing IAM-authenticated, TLS-encrypted shell

access to instances without any open inbound ports. Access is controlled entirely through IAM
policies, which means the same identity federation and MFA enforcement that protects your AWS
console also protects your instance access. Every session is logged to CloudWatch Logs and
CloudTrail — you have a complete, tamper-evident audit trail of every command executed.

Setting Up Session Manager — Step by Step

First, ensure the SSM Agent is installed on your instances. For Amazon Linux 2023 and Amazon
Linux 2, it's pre-installed. For other distributions, you'll need to install it manually.

The instance needs an IAM instance profile with the AmazonSSMManagedInstanceCore managed
policy. This allows the SSM agent to register with Systems Manager and maintain the session.
Critically, the agent communicates outbound to the SSM endpoints — no inbound connectivity
required.

For instances in private subnets (as all production instances should be), you need VPC Endpoints

for SSM:

# Create VPC Endpoints for Session Manager in the primary region


# These endpoints allow SSM traffic without traversing NAT Gateway
aws ec2 create-vpc-endpoint \
--vpc-id vpc-0abc123def456789 \
--vpc-endpoint-type Interface \
--service-name [Link] \
--subnet-ids subnet-0a1b2c3d subnet-0e4f5a6b \
--security-group-ids sg-0security789 \
--private-dns-enabled \
--region ap-south-1 \
--tag-specifications 'ResourceType=vpc-endpoint,Tags=[{Key=Name,Value=ssm-
endpoint-primary},{Key=Environment,Value=production}]'
# The above creates the core SSM endpoint. You also need two more endpoints:
# - [Link] (for Session Manager specifically)
# - [Link].ec2messages (for EC2 instance agent communication)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-0abc123def456789 \
--vpc-endpoint-type Interface \
--service-name [Link] \
--subnet-ids subnet-0a1b2c3d subnet-0e4f5a6b \
--security-group-ids sg-0security789 \
--private-dns-enabled \
--region ap-south-1
aws ec2 create-vpc-endpoint \
--vpc-id vpc-0abc123def456789 \
--vpc-endpoint-type Interface \
--service-name [Link]-south-1.ec2messages \
--subnet-ids subnet-0a1b2c3d subnet-0e4f5a6b \
--security-group-ids sg-0security789 \
--private-dns-enabled \
--region ap-south-1

What these commands do: Each create-vpc-endpoint call creates an Interface VPC Endpoint
— essentially an Elastic Network Interface (ENI) in your subnet that routes traffic destined for the
specified AWS service through AWS's private network rather than through the internet or NAT
Gateway. The --private-dns-enabled flag means that the standard SSM endpoint URLs resolve
to private IP addresses when queried from within your VPC. The --security-group-
ids parameter controls which instances can communicate with the endpoint.

Configuring Session Manager Preferences (via SSM):

# Configure Session Manager to log sessions to CloudWatch and S3


# This is done through the SSM Document AWS-SessionManagerRunShell
aws ssm put-document \
--name "SSM-SessionManagerRunShell" \
--content '{
"schemaVersion": "1.0",
"description": "Session Manager Preferences Document",
"sessionType": "Standard_Stream",
"inputs": {
"s3BucketName": "finserv-session-logs-primary",
"s3KeyPrefix": "session-manager/",
"s3EncryptionEnabled": true,
"cloudWatchLogGroupName": "/aws/ssm/session-manager",
"cloudWatchEncryptionEnabled": true,
"cloudWatchStreamingEnabled": true,
"kmsKeyId": "arn:aws:kms:ap-south-1:123456789012:key/mrk-
abc123def456",
"runAsEnabled": false,
"shellProfile": {
"linux": "export HISTFILE=/dev/null; echo \"Session started: $(date)
by $(whoami) on $(hostname)\" | tee -a /var/log/[Link]"
}
}
}' \
--document-type "Session" \
--region ap-south-1

What this configuration does: The s3BucketName and s3KeyPrefix settings ensure that every
session is recorded and stored in S3 for long-term audit
purposes. cloudWatchStreamingEnabled: true means you get real-time log streaming as the
session is active, which is useful for security monitoring. The kmsKeyId ensures that session logs

are encrypted using your CMK (Customer Managed Key). The shellProfile at the bottom
sets HISTFILE=/dev/null to prevent commands from being cached in the shell history file (since
we're capturing everything in CloudWatch anyway, the shell history file would be redundant and
potentially accessible to other processes).

Connecting to an Instance:

# Start a session from your local workstation (requires SSM plugin for AWS
CLI)
aws ssm start-session \
--target i-0abc123def456789 \
--region ap-south-1
# For ECS containers using ECS Exec
aws ecs execute-command \
--cluster production-primary \
--task arn:aws:ecs:ap-south-1:123456789012:task/abc123 \
--container payment-processor \
--command "/bin/bash" \
--interactive \
--region ap-south-1

Gotcha: ECS Exec requires --enable-execute-command to be set when creating the ECS service,
AND the task role
needs ssmmessages:CreateControlChannel, ssmmessages:CreateDataChannel, ssmmessages:O
penControlChannel, and ssmmessages:OpenDataChannel permissions. I've seen this trip up a lot

of teams who set up ECS Exec but never test it before they actually need it.

VPC Flow Logs — Building Network Intelligence

VPC Flow Logs capture metadata about IP traffic flowing through your network interfaces. Note

the word "metadata" — Flow Logs don't capture the actual packet content, just the who, what,
when, and whether. This is an important distinction because it means they're not a substitute for
a full packet capture solution, but they're invaluable for troubleshooting, security analysis, and
compliance.

Why most teams aren't getting value from Flow Logs: The common failure mode I see is

teams enabling Flow Logs, collecting them somewhere, and never looking at them. Logs without
monitoring are expensive storage. The value of Flow Logs comes from building detection and
alerting logic on top of them.

Enabling Flow Logs to CloudWatch and S3 (Terraform):

# VPC Flow Log resource - enables at the VPC level


# Sending to both CloudWatch (for real-time alerting) and S3 (for long-term
analysis)
resource "aws_flow_log" "primary_vpc_cw" {
# IAM role that allows VPC Flow Logs service to write to CloudWatch
iam_role_arn = aws_iam_role.flow_log_role.arn
log_destination = aws_cloudwatch_log_group.vpc_flow_logs.arn
traffic_type = "ALL" # Capture ACCEPT, REJECT, and ALL traffic
vpc_id = aws_vpc.[Link]
# Custom log format - the default format misses some useful fields
log_format = "${version} ${account-id} ${interface-id} ${srcaddr} ${dstaddr}
${srcport} ${dstport} ${protocol} ${packets} ${bytes} ${windowstart}
${windowend} ${action} ${flow-direction} ${log-status} ${vpc-id} ${subnet-id}
${instance-id} ${tcp-flags} ${type} ${pkt-srcaddr} ${pkt-dstaddr}"
tags = {
Name = "flow-logs-primary-vpc"
Environment = "production"
ManagedBy = "terraform"
}
}
resource "aws_flow_log" "primary_vpc_s3" {
log_destination = "arn:aws:s[Link]inserv-vpc-flow-logs-archive/primary-
region/"
log_destination_type = "s3"
traffic_type = "ALL"
vpc_id = aws_vpc.[Link]
# Parquet format enables much faster and cheaper Athena queries
destination_options {
file_format = "parquet"
hive_compatible_partitions = true # Automatically partitions by date for
cost-efficient querying
per_hour_partitions = true
}
}
resource "aws_cloudwatch_log_group" "vpc_flow_logs" {
name = "/aws/vpc/flow-logs/primary"
retention_in_days = 30 # Keep 30 days in CloudWatch for active analysis
# Encrypt the log group with KMS
kms_key_id = aws_kms_key.[Link]
}
# IAM role allowing VPC Flow Logs to write to CloudWatch
resource "aws_iam_role" "flow_log_role" {
name = "vpc-flow-log-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "[Link]"
}
Condition = {
# Prevents the confused deputy problem - limits which flow logs can
assume this role
StringEquals = {
"aws:SourceAccount" = var.account_id
}
}
}
]
})
}
resource "aws_iam_role_policy" "flow_log_policy" {
name = "vpc-flow-log-policy"
role = aws_iam_role.flow_log_role.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogGroups",
"logs:DescribeLogStreams"
]
Effect = "Allow"
Resource = "*"
}
]
})
}

What the custom log format achieves: The default VPC Flow Log format captures only a subset

of available fields. The custom format I've defined above adds fields like flow-
direction (whether traffic is ingress or egress relative to the ENI), vpc-id and subnet-

id (essential for troubleshooting in multi-VPC environments), tcp-flags (useful for detecting

port scans and SYN floods), and pkt-srcaddr/pkt-dstaddr (the actual source and destination
addresses, which differ from srcaddr/dstaddr when NAT is involved).

The Hive-compatible partitions option is a game-changer for cost management. It


automatically organizes Flow Log files in S3 using a directory structure that Athena understands
as partitions (e.g., /year=2025/month=10/day=15/). This means when you run an Athena query
for yesterday's rejected traffic, you only scan the files for that day — not your entire Flow Log
archive. Without partitioning, a simple query can scan hundreds of gigabytes and cost tens of
dollars. With partitioning, the same query scans megabytes and costs fractions of a cent.

Detection Patterns with CloudWatch Metric Filters:

# Create a metric filter that detects rejected traffic (potential port scans
or unauthorized access attempts)
aws logs put-metric-filter \
--log-group-name "/aws/vpc/flow-logs/primary" \
--filter-name "RejectedTraffic" \
--filter-pattern "[version, account, interface, srcaddr, dstaddr, srcport,
dstport, protocol, packets, bytes, start, end, action=REJECT, ...]" \
--metric-transformations \

metricName=RejectedConnections,metricNamespace=VPCFlowLogs,metricValue=1,unit=
Count \
--region ap-south-1
# Create an alarm if rejected traffic spikes above normal baseline
aws cloudwatch put-metric-alarm \
--alarm-name "HighRejectedTraffic" \
--alarm-description "Spike in rejected VPC traffic - possible port scan or
unauthorized access" \
--metric-name RejectedConnections \
--namespace VPCFlowLogs \
--statistic Sum \
--period 300 \
--threshold 1000 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:ap-south-1:123456789012:security-alerts \
--region ap-south-1

What this does: The put-metric-filter command tells CloudWatch to scan incoming Flow Log
entries and, whenever it finds a log entry where the action field equals REJECT, increment a
counter metric named RejectedConnections. The alarm then fires if this counter exceeds 1,000
rejected connections within a 5-minute window across two consecutive evaluation periods (i.e.,
10 minutes sustained). This kind of alert would catch a port scan in real time.

Cost Gotcha: VPC Flow Logs to CloudWatch can get very expensive at scale. A busy VPC with
many ENIs can generate hundreds of gigabytes of logs per day. My recommendation: send Flow
Logs to both CloudWatch (with a 30-day retention and metric filters for alerting) AND S3 with

Parquet format (for longer-term analysis). This gives you real-time detection from CloudWatch
while keeping long-term query costs manageable via Athena on S3.

KMS Key Policies and Permission Boundary Mechanics

Encryption is easy to enable. Encryption that actually improves your security posture, rather than

just checking a compliance box, requires careful key policy design.

Understanding KMS Key Policies

A KMS key policy is unlike any other AWS resource policy. The critical difference: the key policy
is the primary access control mechanism for KMS keys. Even an IAM user
with AdministratorAccess cannot use a KMS key unless the key policy explicitly allows it. This is
actually the intended behavior — it means that key access cannot be accidentally granted
through overly permissive IAM policies.
Every KMS key must have at least one statement that grants the AWS account root access, or the
key becomes permanently inaccessible (a situation you cannot recover from). This is the first
thing I check in any KMS key audit.

Multi-Region Key Design

For a multi-region architecture, you have two options for KMS keys:

1. Separate regional keys — Create independent keys in each region. Data encrypted in
the primary region cannot be decrypted in the secondary region without re-encryption.
2. Multi-Region keys — A single logical key that can be replicated to multiple regions. The
same key material (same key ID prefix mrk-) is available in both regions.

For Aurora Global Database, S3 Cross-Region Replication, and any data that moves between
regions, Multi-Region keys are the practical choice. Using separate regional keys with cross-
region data would require re-encryption at the application layer, which adds complexity and
latency.

KMS Key Policy for Aurora Global Database (Terraform):

resource "aws_kms_key" "aurora_primary" {


provider = [Link] # [Link] = ap-south-1
description = "Aurora Global Database encryption key - Primary
Region"
key_usage = "ENCRYPT_DECRYPT"
customer_master_key_spec = "SYMMETRIC_DEFAULT"
multi_region = true # Makes this a multi-region key
# Automatic key rotation - enables new key material annually while
preserving
# the ability to decrypt data encrypted with older key versions
enable_key_rotation = true
policy = jsonencode({
Version = "2012-10-17"
Id = "aurora-key-policy"
Statement = [
{
# Statement 1: Root account access (required - never remove this)
Sid = "Enable IAM User Permissions"
Effect = "Allow"
Principal = {
AWS = "arn:aws:iam::${var.account_id}:root"
}
Action = "kms:*"
Resource = "*"
},
{
# Statement 2: Allow the Aurora service to use this key
Sid = "Allow Aurora Service"
Effect = "Allow"
Principal = {
Service = "[Link]"
}
Action = [
"kms:Encrypt",
"kms:Decrypt",
"kms:ReEncrypt*",
"kms:GenerateDataKey*",
"kms:CreateGrant",
"kms:ListGrants",
"kms:DescribeKey"
]
Resource = "*"
Condition = {
# Scope the RDS service access to only resources in this account
StringEquals = {
"kms:CallerAccount" = var.account_id
"kms:ViaService" = "[Link]"
}
}
},
{
# Statement 3: Allow key administrators (senior engineers only)
Sid = "Allow Key Administration"
Effect = "Allow"
Principal = {
AWS = [
"arn:aws:iam::${var.account_id}:role/KeyAdministrator",
"arn:aws:iam::${var.account_id}:role/TerraformExecutionRole"
]
}
Action = [
"kms:Create*",
"kms:Describe*",
"kms:Enable*",
"kms:List*",
"kms:Put*",
"kms:Update*",
"kms:Revoke*",
"kms:Disable*",
"kms:Get*",
"kms:Delete*",
"kms:TagResource",
"kms:UntagResource",
"kms:ScheduleKeyDeletion",
"kms:CancelKeyDeletion"
]
Resource = "*"
},
{
# Statement 4: Allow key users (application roles)
# Note: No kms:Delete*, kms:Disable*, or kms:Schedule* here
Sid = "Allow Key Usage by Applications"
Effect = "Allow"
Principal = {
AWS = [
"arn:aws:iam::${var.account_id}:role/ECSTaskExecutionRole",
"arn:aws:iam::${var.account_id}:role/ECSApplicationRole"
]
}
Action = [
"kms:Decrypt",
"kms:GenerateDataKey*",
"kms:DescribeKey"
]
Resource = "*"
}
]
})
tags = {
Name = "aurora-global-key-primary"
Environment = "production"
DataClass = "confidential"
}
}
# Replicate the primary key to the secondary region
resource "aws_kms_replica_key" "aurora_secondary" {
provider = [Link] # [Link] = ap-southeast-1
description = "Aurora Global Database encryption key - Secondary Region
(replica)"
primary_key_arn = aws_kms_key.aurora_primary.arn
# The replica key has its own key policy - it starts as a copy of the
primary
# but can be modified independently. Keep it aligned with the primary.
policy = aws_kms_key.aurora_primary.policy
tags = {
Name = "aurora-global-key-secondary"
Environment = "production"
Region = "secondary"
}
}

Understanding the key policy structure: The four statements above implement a clear

separation of concerns:

1. Root account access is the safety net — it ensures that if all other principals are
accidentally removed, the account root user can still access the key. Never remove this
statement.

2. Aurora service access uses kms:ViaService to scope the permission to actions taken
through the RDS service, not arbitrary Decrypt calls directly via the KMS API. This means
an attacker who obtains an Aurora role's credentials cannot use them to decrypt arbitrary
data outside of the RDS context.
3. Key administration is restricted to specific named roles, not the
entire Administrator group. Key lifecycle management (creation, deletion, rotation

changes) should have an even more restricted audience than IAM administration.
4. Application key usage grants only Decrypt and GenerateDataKey* — the minimum
permissions needed by applications to read and write encrypted data. No Encrypt (they
use GenerateDataKey* for envelope encryption), no administrative actions.

Permission Boundary Mechanics — Explained Clearly

Permission Boundaries are one of the most misunderstood concepts in AWS IAM. Let me explain
them clearly.

IAM has two types of policies:

• Identity-based policies: Attached to users, groups, or roles. They grant permissions.


• Resource-based policies: Attached to resources (S3 buckets, KMS keys, etc.). They also
grant permissions.

A Permission Boundary is a special type of identity-based policy that sets


the maximum permissions a role or user can ever have — regardless of what other policies say.

Think of it this way: Your IAM role has a permission boundary that allows s3:*. Even if you then

attach an identity-based policy that grants s3:* plus ec2:*, the EC2 permissions are silently
blocked by the boundary. The effective permissions are the intersection of what the identity-
based policy allows AND what the permission boundary allows.

Why Permission Boundaries Matter in This Architecture:

The architecture uses a developer role vending machine pattern — developers can create IAM
roles for their services, but they cannot create roles more powerful than their own permission
boundary. This prevents privilege escalation attacks where a developer creates a role
with AdministratorAccess and uses it to escape their authorized scope.
# Permission boundary definition - caps what any role created in this account
can do
resource "aws_iam_policy" "developer_permission_boundary" {
name = "DeveloperPermissionBoundary"
description = "Maximum permissions allowed for developer-created roles"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "AllowApplicationServices"
Effect = "Allow"
Action = [
# Only allow the specific services the application legitimately uses
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket",
"secretsmanager:GetSecretValue",
"secretsmanager:DescribeSecret",
"kms:Decrypt",
"kms:GenerateDataKey",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"xray:PutTraceSegments",
"xray:PutTelemetryRecords",
"ssmmessages:*",
"ec2messages:*"
]
Resource = "*"
},
{
# Explicit deny: no IAM actions at all through developer-created roles
# This prevents the "create a role and grant yourself admin" escape
hatch
Sid = "DenyIAMExceptPassRole"
Effect = "Deny"
Action = [
"iam:AttachRolePolicy",
"iam:CreateRole",
"iam:DeleteRole",
"iam:DetachRolePolicy",
"iam:PutRolePolicy"
]
Resource = "*"
},
{
# Deny access to organization-level actions and billing
Sid = "DenyOrganizationActions"
Effect = "Deny"
Action = [
"organizations:*",
"account:*",
"billing:*"
]
Resource = "*"
}
]
})
}
# When creating an ECS task role, always attach the permission boundary
resource "aws_iam_role" "ecs_task_role" {
name = "ecs-payment-processor-task-role"
# The boundary caps this role's maximum permissions
permissions_boundary = aws_iam_policy.developer_permission_boundary.arn
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "[Link]"
}
}
]
})
}
# Even if this role later gets an over-permissive policy attached,
# the permission boundary ensures the maximum effective permissions
# never exceed what we defined in DeveloperPermissionBoundary
resource "aws_iam_role_policy_attachment" "ecs_task_policy" {
role = aws_iam_role.ecs_task_role.name
policy_arn = aws_iam_policy.ecs_application_policy.arn
}

The Permission Boundary Gotcha Everyone Hits:

Permission boundaries don't grant permissions on their own. They only limit them. This means
you need BOTH a permission boundary AND an identity-based policy for the principal to actually
do anything. I've seen teams attach a permission boundary to a role, then wonder why the role
suddenly can't do anything — it's because the boundary alone grants nothing; it just limits what

the identity policy can grant.


Chapter 5: Implementation Journey

Phase 1: Foundation — Getting the Ground Ready

The first phase was all about building the foundational infrastructure that everything else would
sit on top of. I always tell junior engineers: time spent on the foundation is never wasted. A shaky
foundation means every layer on top is unreliable.

VPC Design and Network Architecture

The network architecture is more nuanced than most tutorials suggest. The key principle is: treat
network segmentation as a security control, not just an organizational tool.

Primary Region VPC (ap-south-1) — CIDR: [Link]/16

├── Public Subnets (Internet-facing load balancers, NAT Gateways only)


│ ├── [Link]/24 (ap-south-1a)
│ ├── [Link]/24 (ap-south-1b)
│ └── [Link]/24 (ap-south-1c)

├── Application Subnets (ECS tasks, EC2 instances - no direct internet access)
│ ├── [Link]/23 (ap-south-1a) — /23 gives 512 addresses
│ ├── [Link]/23 (ap-south-1b)
│ └── [Link]/23 (ap-south-1c)

├── Database Subnets (Aurora, ElastiCache - most restrictive)
│ ├── [Link]/24 (ap-south-1a)
│ ├── [Link]/24 (ap-south-1b)
│ └── [Link]/24 (ap-south-1c)

└── Management Subnets (SSM endpoints, monitoring agents)
├── [Link]/24 (ap-south-1a)
└── [Link]/24 (ap-south-1b)

Secondary Region VPC (ap-southeast-1) — CIDR: [Link]/16

The secondary VPC uses the [Link]/16 range — an important convention. Because we're
peering these VPCs via Transit Gateway, their CIDR ranges must not overlap. Using [Link]/16 for
primary and [Link]/16 for secondary is a pattern that also accommodates future regional
expansion ([Link]/16 for a third region, etc.).

Terraform VPC Module (Primary Region):

# [Link] - Primary VPC Configuration


terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
# Provider configuration for primary region
provider "aws" {
alias = "primary"
region = "ap-south-1"
}
# Provider configuration for secondary region
provider "aws" {
alias = "secondary"
region = "ap-southeast-1"
}
# Primary VPC
resource "aws_vpc" "primary" {
provider = [Link]
cidr_block = "[Link]/16"
# Enable DNS hostnames - required for VPC Endpoints to work with private DNS
enable_dns_hostnames = true
# Enable DNS resolution - required for DNS to function within the VPC
enable_dns_support = true
tags = {
Name = "finserv-primary-vpc"
Environment = "production"
Region = "primary"
ManagedBy = "terraform"
}
}
# Internet Gateway - allows traffic from public subnets to reach the internet
# Public subnets use this for outbound traffic and to receive inbound traffic
resource "aws_internet_gateway" "primary" {
provider = [Link]
vpc_id = aws_vpc.[Link]
tags = {
Name = "finserv-primary-igw"
}
}
# Public Subnets - one per AZ
# These subnets host the Application Load Balancer and NAT Gateways
# IMPORTANT: DO NOT place application servers in public subnets
resource "aws_subnet" "public" {
provider = [Link]
count = 3
vpc_id = aws_vpc.[Link]
cidr_block = cidrsubnet("[Link]/16", 8, [Link])
# cidrsubnet(base_cidr, newbits, netnum) calculates subnet CIDRs:
# count=0: [Link]/24, count=1: [Link]/24, count=2: [Link]/24
availability_zone = data.aws_availability_zones.[Link][[Link]]
# map_public_ip_on_launch = false - we do NOT auto-assign public IPs
# The ALB will get Elastic IPs through its own configuration
map_public_ip_on_launch = false
tags = {
Name = "finserv-public-
${data.aws_availability_zones.[Link][[Link]]}"
Tier = "public"
}
}
# Application Subnets - /23 blocks for larger address space
resource "aws_subnet" "application" {
provider = [Link]
count = 3
vpc_id = aws_vpc.[Link]
# Starting at offset 10 in the second octet: [Link]/23, [Link]/23,
[Link]/23
cidr_block = cidrsubnet("[Link]/16", 7, [Link] + 5)
availability_zone = data.aws_availability_zones.[Link][[Link]]
tags = {
Name = "finserv-app-
${data.aws_availability_zones.[Link][[Link]]}"
Tier = "application"
}
}
# Database Subnets - smallest required size (/24)
resource "aws_subnet" "database" {
provider = [Link]
count = 3
vpc_id = aws_vpc.[Link]
cidr_block = cidrsubnet("[Link]/16", 8, [Link] + 20)
# count=0: [Link]/24, count=1: [Link]/24, count=2: [Link]/24
availability_zone = data.aws_availability_zones.[Link][[Link]]
tags = {
Name = "finserv-db-
${data.aws_availability_zones.[Link][[Link]]}"
Tier = "database"
}
}
# NAT Gateways - one per AZ for high availability
# This is more expensive than a single NAT Gateway but eliminates cross-AZ NAT
Gateway traffic
# and removes the NAT Gateway as a single point of failure for private subnets
resource "aws_eip" "nat" {
provider = [Link]
count = 3
domain = "vpc"
# Elastic IPs are the static public IP addresses attached to NAT Gateways
# We need one per NAT Gateway
tags = {
Name = "finserv-nat-eip-${[Link]}"
}
}
resource "aws_nat_gateway" "primary" {
provider = [Link]
count = 3
allocation_id = aws_eip.nat[[Link]].id
subnet_id = aws_subnet.public[[Link]].id
# NAT Gateways MUST be in public subnets - they need internet connectivity
tags = {
Name = "finserv-nat-
${data.aws_availability_zones.[Link][[Link]]}"
}
depends_on = [aws_internet_gateway.primary]
# The Internet Gateway must exist before creating the NAT Gateway
}
# Route Tables for private subnets
# Each private subnet gets its own route table pointing to the AZ-local NAT
Gateway
# This ensures that if one NAT Gateway fails, only that AZ is affected
resource "aws_route_table" "private" {
provider = [Link]
count = 3
vpc_id = aws_vpc.[Link]
route {
cidr_block = "[Link]/0"
nat_gateway_id = aws_nat_gateway.primary[[Link]].id
# Default route: all traffic goes through the AZ-local NAT Gateway
}
tags = {
Name = "finserv-private-rt-${[Link]}"
}
}
resource "aws_route_table_association" "application" {
provider = [Link]
count = 3
subnet_id = aws_subnet.application[[Link]].id
route_table_id = aws_route_table.private[[Link]].id
}
resource "aws_route_table_association" "database" {
provider = [Link]
count = 3
subnet_id = aws_subnet.database[[Link]].id
route_table_id = aws_route_table.private[[Link]].id
}

Transit Gateway for Inter-Region Connectivity:

# Transit Gateway in Primary Region


resource "aws_ec2_transit_gateway" "primary" {
provider = [Link]
description = "Transit Gateway for primary region - FinServ Co."
# Amazon Side ASN for BGP routing - must be unique per TGW
amazon_side_asn = 64512
# Auto-accept shared attachments from the same account
auto_accept_shared_attachments = "enable"
# Default route table - we'll use custom route tables for more control
default_route_table_association = "disable"
default_route_table_propagation = "disable"
# Enable multicast if needed for future use cases
multicast_support = "disable"
tags = {
Name = "finserv-tgw-primary"
}
}
# Transit Gateway in Secondary Region
resource "aws_ec2_transit_gateway" "secondary" {
provider = [Link]
description = "Transit Gateway for secondary region - FinServ Co."
amazon_side_asn = 64513 # Different ASN than primary
default_route_table_association = "disable"
default_route_table_propagation = "disable"
tags = {
Name = "finserv-tgw-secondary"
}
}
# Peering between the two Transit Gateways
# This is what allows traffic to flow between the two regions
resource "aws_ec2_transit_gateway_peering_attachment" "primary_secondary" {
provider = [Link]
peer_account_id = data.aws_caller_identity.secondary.account_id
peer_region = "ap-southeast-1"
peer_transit_gateway_id = aws_ec2_transit_gateway.[Link]
transit_gateway_id = aws_ec2_transit_gateway.[Link]
tags = {
Name = "tgw-peering-primary-to-secondary"
}
}
# Accept the peering attachment from the secondary region
resource "aws_ec2_transit_gateway_peering_attachment_accepter" "accept" {
provider = [Link]
transit_gateway_attachment_id =
aws_ec2_transit_gateway_peering_attachment.primary_secondary.id
tags = {
Name = "tgw-peering-accept-secondary"
}
}

IAM Roles and Security Baseline:

The security baseline was established before any workload resources were created. The three
most critical security foundations were:

1. Enabling CloudTrail organization-wide with S3 log delivery to the Log Archive


account
2. Enforcing MFA through IAM password policy and SCPs
3. Creating the IAM role hierarchy with permission boundaries

# Enable CloudTrail organization trail - captures ALL API calls across all
accounts
aws cloudtrail create-trail \
--name finserv-org-trail \
--s3-bucket-name finserv-cloudtrail-archive-${LOG_ARCHIVE_ACCOUNT_ID} \
--is-multi-region-trail \
--enable-log-file-validation \
--is-organization-trail \
--kms-key-id arn:aws:kms:ap-south-
1:${SECURITY_ACCOUNT_ID}:key/${CLOUDTRAIL_KEY_ID} \
--region ap-south-1
# What each flag does:
# --is-multi-region-trail: Captures API events from ALL regions, not just ap-
south-1
# --enable-log-file-validation: Creates SHA-256 hash digests so you can verify
logs weren't tampered with
# --is-organization-trail: Applies to all accounts in the AWS Organization
# --kms-key-id: Encrypts log files using our CMK (the default is SSE-S3, this
is stronger)
# Enable CloudTrail logging
aws cloudtrail start-logging \
--name finserv-org-trail \
--region ap-south-1
# Enable MFA enforcement through IAM password policy
aws iam update-account-password-policy \
--minimum-password-length 14 \
--require-symbols \
--require-numbers \
--require-uppercase-characters \
--require-lowercase-characters \
--allow-users-to-change-password \
--max-password-age 90 \
--password-reuse-prevention 12

Phase 2: Core Services — Aurora Global Database and ECS

Aurora Global Database Setup


This is the heart of the multi-region architecture. Aurora Global Database is the reason we can
achieve sub-second RPO.

# Aurora Global Database Terraform Configuration


# First, create the global cluster object (a logical grouping)
resource "aws_rds_global_cluster" "finserv" {
global_cluster_identifier = "finserv-global"
engine = "aurora-mysql"
engine_version = "8.0.mysql_aurora.3.04.1"
database_name = "finservdb"
storage_encrypted = true
# This is a global cluster object - it doesn't contain any instances itself
# The actual writer instances are in the aws_rds_cluster resources below
}
# Primary Aurora Cluster (ap-south-1) - this is the writer cluster
resource "aws_rds_cluster" "primary" {
provider = [Link]
cluster_identifier = "finserv-aurora-primary"
engine = "aurora-mysql"
engine_version = "8.0.mysql_aurora.3.04.1"
global_cluster_identifier = aws_rds_global_cluster.[Link]
db_subnet_group_name = aws_db_subnet_group.[Link]
vpc_security_group_ids = [aws_security_group.aurora_primary.id]
database_name = "finservdb"
master_username = "admin"
# Master password comes from Secrets Manager, not hardcoded
manage_master_user_password = true
master_user_secret_kms_key_id = aws_kms_key.aurora_primary.arn
# Backup configuration
backup_retention_period = 7 # 7 days of automated backups
preferred_backup_window = "02:00-03:00" # 2-3 AM IST (off-peak)
preferred_maintenance_window = "sun:03:00-sun:04:00"
# Enhanced monitoring
db_cluster_parameter_group_name =
aws_rds_cluster_parameter_group.aurora_mysql8.name
# Enable deletion protection - prevents accidental cluster deletion
deletion_protection = true
# Enable Performance Insights
performance_insights_enabled = true
performance_insights_kms_key_id = aws_kms_key.aurora_primary.arn
performance_insights_retention_period = 7 # Days
storage_encrypted = true
kms_key_id = aws_kms_key.aurora_primary.arn
# Enable automatic minor version upgrades during maintenance window
auto_minor_version_upgrade = true
# CloudWatch log exports - export slow query and error logs to CloudWatch
enabled_cloudwatch_logs_exports = ["audit", "error", "general", "slowquery"]
tags = {
Name = "finserv-aurora-primary"
Environment = "production"
Region = "primary"
}
}
# Primary Aurora Instances (writer + 2 readers for read scaling)
resource "aws_rds_cluster_instance" "primary" {
provider = [Link]
count = 3 # 1 writer + 2 readers
identifier = "finserv-aurora-primary-${[Link]}"
cluster_identifier = aws_rds_cluster.[Link]
instance_class = "db.r6g.2xlarge" # Graviton-based, cost-efficient
engine = aws_rds_cluster.[Link]
engine_version = aws_rds_cluster.primary.engine_version
# Promotion tier: 0 = highest priority for failover within the cluster
# Instance 0 will be preferred as the new writer if the current writer fails
promotion_tier = [Link]
performance_insights_enabled = true
monitoring_interval = 60 # Enhanced monitoring every 60 seconds
monitoring_role_arn = aws_iam_role.rds_monitoring.arn
# Automatic minor version upgrade during maintenance window
auto_minor_version_upgrade = true
tags = {
Name = "finserv-aurora-primary-instance-${[Link]}"
}
}
# Aurora Cluster Parameter Group - database-level tuning
resource "aws_rds_cluster_parameter_group" "aurora_mysql8" {
provider = [Link]
name = "finserv-aurora-mysql8"
family = "aurora-mysql8.0"
description = "FinServ Aurora MySQL 8.0 parameter group"
# Binlog configuration for cross-region replication
parameter {
name = "binlog_format"
value = "ROW"
# ROW-based binlog captures the actual row changes, not just the SQL
statement
# This is required for Aurora Global Database replication and is more
reliable
# than STATEMENT format for replication
}
parameter {
name = "log_bin_trust_function_creators"
value = "1"
# Required when you have stored functions that modify data
# Without this, Aurora rejects function creation with DETERMINISTIC flag
issues
apply_method = "immediate"
}
parameter {
name = "max_connections"
value = "1000"
# Increased from default (which varies by instance class)
# For db.r6g.2xlarge, the formula is:
LEAST({DBInstanceClassMemory/12582880}, 5000)
# Setting explicitly gives predictable behavior
apply_method = "pending-reboot"
}
parameter {
name = "innodb_buffer_pool_size"
value = "{DBInstanceClassMemory*3/4}"
# Allocate 75% of instance memory to the InnoDB buffer pool
# This is the most impactful single tuning parameter for Aurora MySQL
# The {DBInstanceClassMemory} placeholder is Aurora-specific and resolves
at runtime
apply_method = "pending-reboot"
}
parameter {
name = "slow_query_log"
value = "1"
# Enable slow query logging - crucial for identifying performance
bottlenecks
apply_method = "immediate"
}
parameter {
name = "long_query_time"
value = "1"
# Log any query taking more than 1 second - tune this down to 0.5 once
baseline established
apply_method = "immediate"
}
parameter {
name = "performance_schema"
value = "1"
# Enable Performance Schema - provides runtime statistics for Performance
Insights
apply_method = "pending-reboot"
}
}
# Secondary Aurora Cluster (ap-southeast-1) - read-only replica cluster
resource "aws_rds_cluster" "secondary" {
provider = [Link]
cluster_identifier = "finserv-aurora-secondary"
engine = "aurora-mysql"
engine_version = "8.0.mysql_aurora.3.04.1"
global_cluster_identifier = aws_rds_global_cluster.[Link]
# Linking to the same global_cluster_identifier makes this a secondary
cluster
db_subnet_group_name = aws_db_subnet_group.[Link]
vpc_security_group_ids = [aws_security_group.aurora_secondary.id]
# Secondary clusters do NOT specify master_username or master_password
# These are inherited from the global cluster's primary
storage_encrypted = true
kms_key_id = aws_kms_replica_key.aurora_secondary.arn
# Note: Using the REPLICA key here, not the primary key
# The replica key shares key material with primary but lives in ap-
southeast-1
backup_retention_period = 7
deletion_protection = true
# Skip final snapshot is false - always take a final snapshot before
destroying
skip_final_snapshot = false
final_snapshot_identifier = "finserv-aurora-secondary-final-snapshot"
depends_on = [aws_rds_cluster_instance.primary]
# Secondary cluster must be created AFTER primary instances are healthy
tags = {
Name = "finserv-aurora-secondary"
Environment = "production"
Region = "secondary"
}
}
resource "aws_rds_cluster_instance" "secondary" {
provider = [Link]
count = 2 # Warm standby: 2 readers in secondary (scaled down vs
primary's 3)
identifier = "finserv-aurora-secondary-${[Link]}"
cluster_identifier = aws_rds_cluster.[Link]
instance_class = "[Link]"
# Smaller instance class in secondary - will be scaled up during failover
# Using [Link] (~$0.48/hr) vs primary's db.r6g.2xlarge (~$0.96/hr)
# This is a deliberate warm standby cost optimization
engine = aws_rds_cluster.[Link]
engine_version = aws_rds_cluster.secondary.engine_version
monitoring_interval = 60
monitoring_role_arn = aws_iam_role.rds_monitoring_secondary.arn
tags = {
Name = "finserv-aurora-secondary-instance-${[Link]}"
}
}

The Replication Lag Monitoring Alert — one thing that gets overlooked:

# Monitor Aurora Global Database replication lag


# If lag exceeds 5 seconds, something is wrong with replication health
aws cloudwatch put-metric-alarm \
--alarm-name "AuroraGlobalReplicationLag" \
--alarm-description "Aurora Global DB replication lag exceeds acceptable
threshold" \
--metric-name "AuroraGlobalDBReplicationLag" \
--namespace "AWS/RDS" \
--statistic Maximum \
--period 60 \
--threshold 5000 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 3 \
--dimensions Name=DBClusterIdentifier,Value=finserv-aurora-secondary \
--alarm-actions arn:aws:sns:ap-southeast-1:123456789012:database-alerts \
--treat-missing-data breaching \
--region ap-southeast-1
# The metric is in milliseconds - 5000 = 5 seconds
# treat-missing-data=breaching means that if the metric stops reporting
# (which can happen if the secondary cluster loses contact with primary),
# the alarm immediately enters ALARM state rather than silently ignoring the
gap

ECS Fargate Application Deployment

# ECS Cluster with Container Insights enabled


resource "aws_ecs_cluster" "primary" {
provider = [Link]
name = "finserv-production-primary"
setting {
name = "containerInsights"
value = "enabled"
# Container Insights provides CPU, memory, network, and storage metrics
# per task and per service - essential for troubleshooting and autoscaling
# Note: This adds approximately $0.35 per 10M metrics to your CloudWatch
bill
}
configuration {
execute_command_configuration {
kms_key_id = aws_kms_key.ecs_exec.arn
logging = "OVERRIDE"
log_configuration {
cloud_watch_encryption_enabled = true
cloud_watch_log_group_name = "/aws/ecs/execute-command"
# This is where ECS Exec session commands are logged
# Required for compliance audit trail
}
}
}
tags = {
Name = "finserv-ecs-primary"
Environment = "production"
}
}
# ECS Task Definition for Payment Processor
resource "aws_ecs_task_definition" "payment_processor" {
provider = [Link]
family = "payment-processor"
requires_compatibilities = ["FARGATE"]
network_mode = "awsvpc"
# awsvpc mode gives each task its own ENI and security group - required for
Fargate
# and essential for fine-grained security group controls at the task level
cpu = "2048" # 2 vCPU
memory = "4096" # 4 GB RAM
execution_role_arn = aws_iam_role.ecs_execution_role.arn
# execution_role: Used by the ECS agent to pull images, write logs, fetch
secrets
# This is NOT the role your application code uses - that's task_role_arn
below
task_role_arn = aws_iam_role.ecs_task_role.arn
# task_role: Used by your application code at runtime
# This role has the permission boundary we defined earlier
container_definitions = jsonencode([
{
name = "payment-processor"
image = "${aws_ecr_repository.payment_processor.repository_url}:latest"
essential = true
# essential=true means if this container stops, the entire task stops
# For single-container task definitions, this is always true
portMappings = [
{
containerPort = 8080
protocol = "tcp"
# Only specify containerPort for Fargate - hostPort is automatically
# set to the same value in awsvpc mode
}
]
# Environment variables from Secrets Manager (NOT hardcoded values)
secrets = [
{
name = "DB_PASSWORD"
valueFrom = "arn:aws:secretsmanager:ap-south-
1:123456789012:secret:finserv/aurora/password-AbCdEf"
# Fargate fetches this secret at task startup and injects it as an
env var
# The task execution role needs secretsmanager:GetSecretValue
permission
},
{
name = "REDIS_AUTH_TOKEN"
valueFrom = "arn:aws:secretsmanager:ap-south-
1:123456789012:secret:finserv/redis/auth-token-GhIjKl"
}
]
environment = [
# Non-sensitive configuration as plain environment variables
{
name = "DB_HOST"
value = aws_rds_cluster.[Link]
},
{
name = "DB_READ_HOST"
value = aws_rds_cluster.primary.reader_endpoint
# Aurora provides a separate reader endpoint that load-balances
# across all read replicas - use this for read queries
},
{
name = "REGION"
value = "ap-south-1"
},
{
name = "ENV"
value = "production"
}
]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = "/ecs/payment-processor"
"awslogs-region" = "ap-south-1"
"awslogs-stream-prefix" = "ecs"
"awslogs-create-group" = "true"
# awslogs-create-group: Automatically creates the log group if it
doesn't exist
# Useful during initial deployment but consider pre-creating groups
with
# specific retention and encryption settings
}
}
# Health check - validates the application is actually healthy, not just
running
healthCheck = {
command = ["CMD-SHELL", "curl -f [Link]
|| exit 1"]
interval = 30 # Check every 30 seconds
timeout = 10 # Consider unhealthy if no response in 10 seconds
retries = 3 # Mark as unhealthy after 3 consecutive failures
startPeriod = 60 # Grace period during container startup (before
health checks count)
# startPeriod is critical for applications with slow startup times
# Without it, ECS will kill your container before it's had a chance to
initialize
}
# Resource limits
ulimits = [
{
name = "nofile"
softLimit = 65536
hardLimit = 65536
# Increase open file descriptor limit
# Payment processors can have many concurrent connections - default
1024 is too low
}
]
}
])
tags = {
Name = "payment-processor-task-def"
}
}
# ECS Service - manages how many task copies run and how they're deployed
resource "aws_ecs_service" "payment_processor" {
provider = [Link]
name = "payment-processor"
cluster = aws_ecs_cluster.[Link]
task_definition = aws_ecs_task_definition.payment_processor.arn
desired_count = 6 # 2 tasks per AZ across 3 AZs
launch_type = "FARGATE"
# Enable ECS Exec for debugging (with audit logging configured above)
enable_execute_command = true
# Deployment configuration
deployment_maximum_percent = 200
deployment_minimum_healthy_percent = 100
# These settings mean: during a deployment, ECS can run up to 2x the
desired_count
# tasks, but must keep at least desired_count healthy tasks running at all
times
# This ensures zero-downtime deployments
deployment_circuit_breaker {
enable = true
rollback = true
# If a new deployment fails health checks, automatically roll back to the
previous version
# This is a critical safety net for production deployments
}
network_configuration {
subnets = aws_subnet.application[*].id
security_groups = [aws_security_group.ecs_tasks.id]
assign_public_ip = false # Tasks in private subnets, no public IPs
}
load_balancer {
target_group_arn = aws_lb_target_group.payment_processor.arn
container_name = "payment-processor"
container_port = 8080
}
# Service discovery for internal service-to-service communication
service_registries {
registry_arn = aws_service_discovery_service.payment_processor.arn
}
# Spread tasks across AZs for resilience
placement_strategy {
type = "spread"
field = "attribute:[Link]-zone"
# Ensures tasks are distributed across AZs
# Without this, ECS might place all tasks in a single AZ
}
depends_on = [
aws_lb_listener.https,
aws_iam_role_policy_attachment.ecs_task_execution
]
tags = {
Name = "payment-processor-service"
}
}

Phase 3: Advanced Features — Automated Failover with Route 53 ARC

Route 53 Application Recovery Controller

This is where the multi-region magic becomes operational. ARC provides a control plane for
managing which region is "active" from a traffic routing perspective, with validation checks to
ensure the target region is actually ready to receive traffic before flipping the switch.

# Route 53 ARC - Cluster (the top-level ARC resource)


resource "aws_route53recoverycontrolconfig_cluster" "finserv" {
name = "finserv-recovery-cluster"
# A cluster is the logical grouping of your multi-region recovery
configuration
# ARC clusters are global resources, not regional
}
# Control Panel - a grouping of routing controls
resource "aws_route53recoverycontrolconfig_control_panel" "production" {
name = "production-control-panel"
cluster_arn = aws_route53recoverycontrolconfig_cluster.finserv.cluster_arn
}
# Routing Controls - boolean switches that control traffic flow to each region
resource "aws_route53recoverycontrolconfig_routing_control" "primary" {
name = "primary-region-routing-control"
cluster_arn =
aws_route53recoverycontrolconfig_cluster.finserv.cluster_arn
control_panel_arn =
aws_route53recoverycontrolconfig_control_panel.[Link]
}
resource "aws_route53recoverycontrolconfig_routing_control" "secondary" {
name = "secondary-region-routing-control"
cluster_arn =
aws_route53recoverycontrolconfig_cluster.finserv.cluster_arn
control_panel_arn =
aws_route53recoverycontrolconfig_control_panel.[Link]
}
# Safety Rules - prevent both regions from being off simultaneously
resource "aws_route53recoverycontrolconfig_safety_rule" "min_one_active" {
asserted_controls = [
aws_route53recoverycontrolconfig_routing_control.[Link],
aws_route53recoverycontrolconfig_routing_control.[Link]
]
control_panel_arn =
aws_route53recoverycontrolconfig_control_panel.[Link]
name = "minimum-one-region-active"
rule_config {
inverted = false
threshold = 1
type = "ATLEAST"
# ATLEAST 1 of the routing controls must be ON at all times
# ARC will REFUSE any operation that would violate this rule
# This prevents the catastrophic scenario of accidentally disabling BOTH
regions
}
wait_period_ms = 5000
# Wait 5 seconds before applying the rule after a routing control change
# Gives time for DNS propagation to begin before new traffic arrives
}
# Route 53 Health Check backed by ARC Routing Control
resource "aws_route53_health_check" "primary_arc" {
type = "RECOVERY_CONTROL"
routing_control_arn =
aws_route53recoverycontrolconfig_routing_control.[Link]
tags = {
Name = "primary-region-arc-health-check"
}
}
resource "aws_route53_health_check" "secondary_arc" {
type = "RECOVERY_CONTROL"
routing_control_arn =
aws_route53recoverycontrolconfig_routing_control.[Link]
tags = {
Name = "secondary-region-arc-health-check"
}
}
# Route 53 DNS Records with failover routing
resource "aws_route53_record" "api_primary" {
zone_id = var.route53_zone_id
name = "[Link]"
type = "A"
set_identifier = "primary"
failover_routing_policy {
type = "PRIMARY"
}
alias {
name = aws_lb.primary.dns_name
zone_id = aws_lb.primary.zone_id
evaluate_target_health = true
}
health_check_id = aws_route53_health_check.primary_arc.id
# This record is only active when the ARC routing control for primary is ON
}
resource "aws_route53_record" "api_secondary" {
provider = [Link]
zone_id = var.route53_zone_id
name = "[Link]"
type = "A"
set_identifier = "secondary"
failover_routing_policy {
type = "SECONDARY"
}
alias {
name = aws_lb.secondary.dns_name
zone_id = aws_lb.secondary.zone_id
evaluate_target_health = true
}
health_check_id = aws_route53_health_check.secondary_arc.id
}
Triggering Failover — The ARC Command:

# MANUAL FAILOVER: Toggle routing controls to shift traffic from primary to


secondary
# Step 1: Turn OFF primary routing control
aws route53-recovery-control-config update-routing-control-state \
--routing-control-arn arn:aws:route53-recovery-
control::123456789012:controlpanel/abc/routingcontrol/primary \
--routing-control-state Off \
--endpoint-url [Link]
[Link]
# Step 2: Turn ON secondary routing control (do this in the SAME API call if
possible)
aws route53-recovery-control-config update-routing-control-states \
--update-routing-control-state-entries \
"[{\"RoutingControlArn\":\"arn:aws:route53-recovery-
control::123456789012:controlpanel/abc/routingcontrol/primary\",\"RoutingContr
olState\":\"Off\"},{\"RoutingControlArn\":\"arn:aws:route53-recovery-
control::123456789012:controlpanel/abc/routingcontrol/secondary\",\"RoutingCon
trolState\":\"On\"}]" \
--endpoint-url [Link]
[Link]
# Using update-routing-control-states (plural) allows atomic changes across
multiple controls
# This is important - it ensures the Safety Rule can validate the entire
change at once
# and won't leave you in a state where both regions are OFF
# Step 3: Promote Aurora secondary to primary
aws rds failover-global-cluster \
--global-cluster-identifier finserv-global \
--target-db-cluster-identifier arn:aws:rds:ap-southeast-
1:123456789012:cluster:finserv-aurora-secondary \
--region ap-south-1
# This command initiates the Aurora Global Database failover
# --target-db-cluster-identifier: the secondary cluster you want to promote to
writer
# The operation typically completes in under 1 minute for planned failovers
# For unplanned failovers (primary region completely unavailable), use --
allow-data-loss flag

Pro Tip: The ARC cluster endpoint used in the CLI commands above is a Regional endpoint that's

part of ARC's own high-availability design. ARC endpoints are deployed across multiple
Availability Zones in multiple regions specifically so that failover operations remain available even
during regional events. Always store the ARC cluster endpoint in your runbook — don't assume
you can look it up during an actual outage.

AWS Global Accelerator Configuration

resource "aws_globalaccelerator_accelerator" "finserv" {


name = "finserv-global-accelerator"
ip_address_type = "IPV4"
enabled = true
attributes {
flow_logs_enabled = true
flow_logs_s3_bucket = "finserv-globalaccelerator-logs"
flow_logs_s3_prefix = "flow-logs/"
# Global Accelerator has its own flow logs separate from VPC Flow Logs
# These capture traffic at the Global Accelerator edge, before it enters
the AWS backbone
}
}
resource "aws_globalaccelerator_listener" "https" {
accelerator_arn = aws_globalaccelerator_accelerator.[Link]
client_affinity = "SOURCE_IP"
# SOURCE_IP affinity ensures that requests from the same client IP
# are routed to the same endpoint group (region)
# Important for stateful connections and session consistency
protocol = "TCP"
port_range {
from_port = 443
to_port = 443
}
}
# Endpoint group for primary region - higher traffic dial setting
resource "aws_globalaccelerator_endpoint_group" "primary" {
listener_arn = aws_globalaccelerator_listener.[Link]
endpoint_group_region = "ap-south-1"
traffic_dial_percentage = 100
# 100% of traffic goes to primary under normal conditions
# During failover, this will be set to 0 and secondary to 100
health_check_path = "/health"
health_check_protocol = "HTTP"
health_check_interval_seconds = 10
threshold_count = 3
endpoint_configuration {
endpoint_id = aws_lb.[Link]
weight = 100
client_ip_preservation_enabled = true
# Preserve the original client IP address as traffic passes through Global
Accelerator
# Required for IP-based rate limiting and fraud detection at the
application layer
}
}
resource "aws_globalaccelerator_endpoint_group" "secondary" {
listener_arn = aws_globalaccelerator_listener.[Link]
endpoint_group_region = "ap-southeast-1"
traffic_dial_percentage = 0
# Starts at 0% - no traffic routed to secondary under normal operations
health_check_path = "/health"
health_check_protocol = "HTTP"
health_check_interval_seconds = 10
threshold_count = 3
endpoint_configuration {
endpoint_id = aws_lb.[Link]
weight = 100
client_ip_preservation_enabled = true
}
}

Phase 3 (Continued): Monitoring, Alerting, and Observability


A 99.99% SLA is only meaningful if you can detect degradation before it becomes an outage. The
monitoring architecture I designed follows three principles: measure what matters, alert on
symptoms not causes, and automate everything you can.

CloudWatch Dashboard (via Terraform):

resource "aws_cloudwatch_dashboard" "finserv_operations" {


dashboard_name = "FinServ-Operations"
dashboard_body = jsonencode({
widgets = [
{
type = "metric"
properties = {
title = "API Response Time (P99)"
period = 60
metrics = [
["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer",
aws_lb.primary.arn_suffix, { stat = "p99", label = "Primary
Region P99" }],
["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer",
aws_lb.secondary.arn_suffix, { stat = "p99", label = "Secondary
Region P99" }]
]
view = "timeSeries"
yAxis = { left = { label = "Seconds", min = 0 } }
}
},
{
type = "metric"
properties = {
title = "Aurora Replication Lag"
period = 60
metrics = [
["AWS/RDS", "AuroraGlobalDBReplicationLag",
"DBClusterIdentifier", "finserv-aurora-secondary",
{ stat = "Maximum", label = "Replication Lag (ms)" }]
]
annotations = {
horizontal = [{ value = 5000, label = "Alert Threshold (5s)",
color = "#ff0000" }]
}
}
},
{
type = "metric"
properties = {
title = "ECS Task Count"
period = 60
metrics = [
["ECS/ContainerInsights", "RunningTaskCount",
"ClusterName", "finserv-production-primary",
"ServiceName", "payment-processor",
{ stat = "Average", label = "Primary Running Tasks" }],
["ECS/ContainerInsights", "RunningTaskCount",
"ClusterName", "finserv-production-secondary",
"ServiceName", "payment-processor",
{ stat = "Average", label = "Secondary Running Tasks" }]
]
}
}
]
})
}

Auto Scaling Configuration for ECS:

# Application Auto Scaling target


resource "aws_appautoscaling_target" "ecs_target" {
provider = [Link]
max_capacity = 50 # Maximum 50 tasks across all AZs
min_capacity = 6 # Minimum 6 tasks (2 per AZ)
resource_id = "service/finserv-production-primary/payment-processor"
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
}
# Scale OUT policy: Add tasks when ALB request count per target exceeds
threshold
resource "aws_appautoscaling_policy" "scale_out_requests" {
provider = [Link]
name = "scale-out-on-request-count"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.ecs_target.resource_id
scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
service_namespace = aws_appautoscaling_target.ecs_target.service_namespace
target_tracking_scaling_policy_configuration {
target_value = 1000.0
# Target: 1000 requests per minute per ECS task
# If average exceeds 1000, scale out. If below, scale in.
predefined_metric_specification {
predefined_metric_type = "ALBRequestCountPerTarget"
resource_label =
"${aws_lb.primary.arn_suffix}/${aws_lb_target_group.payment_processor.arn_suff
ix}"
}
scale_in_cooldown = 300 # Wait 5 minutes before scaling IN (prevents
flapping)
scale_out_cooldown = 60 # Only wait 1 minute before scaling OUT
(respond quickly to load)
# Asymmetric cooldowns are deliberate: scale out fast, scale in
conservatively
}
}
# Scheduled scaling: Pre-warm for known high-traffic periods
resource "aws_appautoscaling_scheduled_action" "prewarm_morning" {
provider = [Link]
name = "prewarm-morning-peak"
service_namespace = aws_appautoscaling_target.ecs_target.service_namespace
resource_id = aws_appautoscaling_target.ecs_target.resource_id
scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
schedule = "cron(30 2 * * ? *)"
# 2:30 AM UTC = 8:00 AM IST - Pre-warm 30 minutes before business hours
scalable_target_action {
min_capacity = 18 # Raise minimum to 18 tasks (6 per AZ) before peak
max_capacity = 50
}
}
resource "aws_appautoscaling_scheduled_action" "scale_down_evening" {
provider = [Link]
name = "scale-down-after-business-hours"
service_namespace = aws_appautoscaling_target.ecs_target.service_namespace
resource_id = aws_appautoscaling_target.ecs_target.resource_id
scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
schedule = "cron(0 15 * * ? *)"
# 3:00 PM UTC = 8:30 PM IST - Reduce minimum after business hours
scalable_target_action {
min_capacity = 6
max_capacity = 50
}
}

Phase 4: Optimization — Fine-Tuning After Go-Live

Performance Tuning

The initial post-launch metrics were promising but not where we needed them. Average API
response time had dropped from 340ms to 185ms, but the P99 was still at 1.8 seconds — higher
than the 500ms target.

Root cause analysis using X-Ray distributed tracing identified three culprits:

1. Cold start latency in Fargate — New tasks took 45-60 seconds to initialize before
accepting traffic, meaning that scale-out events created temporary capacity gaps
2. Aurora connection pooling — The application was creating a new database connection
per API request rather than using a connection pool

3. Synchronous downstream API calls — Three external payment gateway API calls were
being made sequentially rather than in parallel
For the Fargate cold start issue, I implemented RDS Proxy as a connection pooler between the
application and Aurora, and added a warm buffer to the ECS service by keeping a small fleet of
pre-initialized tasks running.

# Create RDS Proxy to handle connection pooling


aws rds create-db-proxy \
--db-proxy-name finserv-aurora-proxy \
--engine-family MYSQL \
--auth '[{"AuthScheme":"SECRETS","SecretArn":"arn:aws:secretsmanager:ap-
south-1:123456789012:secret:finserv/aurora/password-
AbCdEf","IAMAuth":"REQUIRED"}]' \
--role-arn arn:aws:iam::123456789012:role/RDSProxyRole \
--vpc-subnet-ids subnet-0a1b2c3d subnet-0e4f5a6b subnet-0c7d8e9f \
--vpc-security-group-ids sg-0proxy123456 \
--require-tls \
--region ap-south-1
# What RDS Proxy does:
# - Maintains a pool of database connections (avoiding the overhead of
connection establishment per request)
# - Multiplexes application connections onto a smaller number of database
connections
# - Handles Aurora failover transparently - the application connects to the
proxy endpoint,
# and the proxy reconnects to the new writer without the application seeing
the disruption
# --require-tls: Enforces TLS for all connections to the proxy (compliance
requirement)
# IAMAuth=REQUIRED: Applications must use IAM authentication tokens, not
static passwords
Chapter 6: Disaster Recovery — RTO, RPO, and the
Mechanics of Failover

Defining the Recovery Objectives

Before implementing DR, you need agreement on what the targets actually are. I facilitated a
workshop with the FinServ Co. leadership team to define these precisely:

Scenario RTO Target RPO Target Priority

AZ-level failure < 60 seconds 0 (Multi-AZ) P0

Regional failure (unplanned) < 5 minutes < 1 second (Aurora replication lag) P0

Regional failure (planned) < 2 minutes 0 (managed planned failover) P1

Data corruption/logical error < 30 minutes < 24 hours (point-in-time restore) P1

Full account compromise < 4 hours < 1 hour P2

RTO (Recovery Time Objective) is how quickly you must restore service. RPO (Recovery Point

Objective) is how much data loss is acceptable — specifically, how far back you can tolerate
rolling back to.

For the 99.99% SLA math: with a 5-minute RTO and assuming one regional event per year, a
single incident consumes roughly 10% of the annual downtime budget. That leaves margin for
smaller incidents throughout the year.

The Four DR Strategies and Why We Chose Warm Standby

AWS documents four DR strategies, each with different cost and recovery time implications:

1. Backup and Restore

• RTO: Hours to days


• RPO: Hours (dependent on backup frequency)
• Cost: Very low (pay only for backup storage)
• Use case: Non-critical workloads, test/dev environments
2. Pilot Light

• RTO: 30-60 minutes (need to provision and scale up infrastructure from minimal state)

• RPO: Minutes to hours


• Cost: Low (only core data replication services running)
• Use case: Important but not time-critical systems

3. Warm Standby

• RTO: Minutes (5-15 minutes typically)


• RPO: Seconds (with Aurora Global Database)
• Cost: Moderate (secondary infrastructure running at reduced capacity)
• Use case: Business-critical systems requiring 99.99% — our choice

4. Active-Active (Multi-Site)

• RTO: Near-zero (traffic already flowing to both regions)


• RPO: Near-zero

• Cost: High (full production capacity in both regions simultaneously)


• Use case: Mission-critical systems where even seconds of downtime are unacceptable

For FinServ Co., Warm Standby was the right balance. The secondary region runs at
approximately 40% of primary capacity cost — enough to serve traffic immediately upon failover
but without the expense of full active-active.

Failover Runbook (Step-by-Step)

This runbook was encoded into an SSM Automation document so it can be executed with a
single command. Here's the human-readable version with explanations:

Pre-failover validation (T-0):

1. Confirm that the primary region is genuinely degraded (not a false alarm)
- Check Global Accelerator health check status
- Check Route 53 ARC cluster routing control states
- Verify with at least two independent monitoring sources
2. Check Aurora Global Database replication lag
- If lag > 30 seconds, assess whether data loss is acceptable
- If lag is within SLA (< 1 second), proceed

3. Confirm secondary region health


- Verify ECS service is running (warm standby tasks healthy)
- Verify Aurora secondary cluster is reachable
- Verify Secrets Manager replicas are accessible

Failover execution (T+0 to T+5 minutes):

# Step 1: Update Global Accelerator to stop sending new traffic to primary


aws globalaccelerator update-endpoint-group \
--endpoint-group-arn
arn:aws:globalaccelerator::123456789012:accelerator/abc/listener/def/endpoint-
group/primary \
--traffic-dial-percentage 0 \
--region us-west-2 # Global Accelerator is always managed from us-west-2

aws globalaccelerator update-endpoint-group \


--endpoint-group-arn
arn:aws:globalaccelerator::123456789012:accelerator/abc/listener/def/endpoint-
group/secondary \
--traffic-dial-percentage 100 \
--region us-west-2
# Step 2: Update ARC routing controls
aws route53-recovery-control-config update-routing-control-states \
--update-routing-control-state-entries \

"[{\"RoutingControlArn\":\"PRIMARY_RC_ARN\",\"RoutingControlState\":\"Off\"},

{\"RoutingControlArn\":\"SECONDARY_RC_ARN\",\"RoutingControlState\":\"On\"}]"
\
--endpoint-url [Link]
# Step 3: Scale up ECS services in secondary region
aws ecs update-service \
--cluster finserv-production-secondary \
--service payment-processor \
--desired-count 18 \
--region ap-southeast-1
# Step 4: Promote Aurora secondary to writer
aws rds failover-global-cluster \
--global-cluster-identifier finserv-global \
--target-db-cluster-identifier arn:aws:rds:ap-southeast-
1:123456789012:cluster:finserv-aurora-secondary \
--region ap-south-1
# Step 5: Update application configuration to point to secondary Aurora
endpoint
# (This is handled automatically if using RDS Proxy + Aurora Global Database
# since the application connects to the proxy, which reconnects to new
writer)

Post-failover validation (T+5 to T+10 minutes):

# Verify synthetic transaction succeeds against secondary region


curl -X POST [Link] \
-H "Authorization: Bearer $TEST_TOKEN" \
-d '{"amount": 1, "currency": "INR", "type": "validation_test"}' \
-w "\nHTTP Status: %{http_code}\nTotal Time: %{time_total}s\n"
# Check that ECS tasks are healthy in secondary
aws ecs describe-services \
--cluster finserv-production-secondary \
--services payment-processor \
--query
'services.{Running:runningCount,Pending:pendingCount,Desired:desiredCount}' \
--region ap-southeast-1
Chapter 7: Challenges and How I Solved Them

Challenge 1: VPC Flow Logs Cost Spiral

What happened: Three weeks after enabling VPC Flow Logs to CloudWatch, the monthly AWS
bill came in $2,400 higher than projected. The culprit: CloudWatch Logs ingestion charges for a
high-traffic VPC were enormous — we were generating approximately 50GB of flow log data per

day, at $0.50/GB ingestion = $750/day.

How I troubleshot it:

# Check CloudWatch Logs ingestion metrics to identify the source


aws cloudwatch get-metric-statistics \
--namespace AWS/Logs \
--metric-name IncomingBytes \
--dimensions Name=LogGroupName,Value=/aws/vpc/flow-logs/primary \
--start-time 2025-10-01T[Link]Z \
--end-time 2025-10-08T[Link]Z \
--period 86400 \
--statistics Sum \
--region ap-south-1
# Identify which ENIs were generating the most log volume
# Query using CloudWatch Logs Insights
aws logs start-query \
--log-group-name /aws/vpc/flow-logs/primary \
--start-time $(date -d '1 day ago' +%s) \
--end-time $(date +%s) \
--query-string 'stats sum(bytes) as TotalBytes by interface-id | sort
TotalBytes desc | limit 20' \
--region ap-south-1

Solution: I restructured the Flow Logs setup to:

1. Send Flow Logs primarily to S3 with Parquet format (reduces storage costs by ~75% vs
standard text format)
2. Keep CloudWatch Logs only for specific subnets (database subnets, management
subnets) where real-time alerting matters most
3. Implement a CloudWatch Logs subscription filter to extract only REJECT records to a
separate smaller log group for metric filters

This reduced the CloudWatch Logs ingestion cost from $750/day to approximately $45/day.

Lesson learned: Always model your VPC Flow Logs volume before enabling them. A rough

estimate: take your peak network throughput in GB/day, assume Flow Log metadata is
approximately 5-10% of actual traffic volume, and calculate accordingly. For high-throughput
environments, S3-first is almost always the right approach.

Challenge 2: Aurora Failover Disconnecting Existing Sessions

What happened: During our first DR drill, we successfully promoted the Aurora secondary
cluster to writer in under 90 seconds. However, in-flight database transactions during the
promotion window were not gracefully handled — they received connection errors, and
approximately 0.3% of payment requests during the failover window resulted in error responses
to users.

Root cause: The application was connecting directly to the Aurora cluster writer endpoint. During
promotion, the writer endpoint becomes unavailable for 30-60 seconds while the new writer
initializes. The application's database connection pool didn't have adequate retry logic.

Solution:

1. Implement RDS Proxy between the application and Aurora. RDS Proxy handles the
writer endpoint change transparently and queues requests during the brief promotion
window, dramatically reducing connection errors.
2. Add retry logic to the application with exponential backoff for database operations
(maximum 3 retries with 100ms, 200ms, 400ms backoff).

3. Implement the circuit breaker pattern using the Resilience4j library — if database
errors exceed 50% of requests in a 10-second window, open the circuit and return a
graceful degraded response rather than cascading failures.
# Create RDS Proxy target group pointing to Aurora Global Database
aws rds register-db-proxy-targets \
--db-proxy-name finserv-aurora-proxy \
--db-cluster-identifiers finserv-aurora-primary \
--region ap-south-1
# The proxy target group automatically detects the Aurora writer endpoint
# After failover, the proxy reconnects to the new writer endpoint
# Application connection strings remain pointing at the proxy endpoint
throughout

Lesson learned: RDS Proxy isn't optional for production Aurora deployments doing frequent

failovers — it's a necessity. The 8-minute timeout default for connection pinning in RDS Proxy
also needs tuning for transaction-heavy workloads; we reduced it to 120 seconds.

Challenge 3: Secrets Manager Secret Replication Lag

What happened: During the first failover test, the secondary region ECS tasks failed to start
because Secrets Manager replica secrets weren't immediately available. Specifically, a newly
rotated secret in the primary region had not yet replicated to the secondary before we triggered
the failover.

Root cause: Secrets Manager cross-region replication is eventually consistent. The default
rotation and replication cycle meant there could be a 5-10 second window where a newly rotated
secret existed only in the primary region.

Solution: Implement a validation Lambda that runs hourly and verifies that all critical secrets
exist and are accessible in the secondary region, alerting before any gap becomes a problem

during actual failover.

import boto3
import json
def lambda_handler(event, context):
"""
Validates that all critical secrets are replicated and accessible
in the secondary region. Publishes a CloudWatch metric for alerting.
"""
primary_sm = [Link]('secretsmanager', region_name='ap-south-1')
secondary_sm = [Link]('secretsmanager', region_name='ap-southeast-
1')
cloudwatch = [Link]('cloudwatch', region_name='ap-south-1')

# List of critical secrets that MUST be available in secondary


critical_secrets = [
'finserv/aurora/password',
'finserv/redis/auth-token',
'finserv/payment-gateway/api-key',
'finserv/jwt/signing-key'
]

replication_failures = 0

for secret_name in critical_secrets:


try:
# Try to get the secret in the secondary region
response = secondary_sm.get_secret_value(SecretId=secret_name)

# Also verify the secret value is non-empty (not just that the ARN
exists)
if not [Link]('SecretString') and not
[Link]('SecretBinary'):
print(f"WARNING: Secret {secret_name} exists but has no value
in secondary region")
replication_failures += 1

except secondary_sm.[Link]:
print(f"CRITICAL: Secret {secret_name} NOT FOUND in secondary
region")
replication_failures += 1
except Exception as e:
print(f"ERROR checking secret {secret_name}: {str(e)}")
replication_failures += 1

# Publish metric to CloudWatch


cloudwatch.put_metric_data(
Namespace='FinServ/DR',
MetricData=[
{
'MetricName': 'SecretsReplicationFailures',
'Value': replication_failures,
'Unit': 'Count'
}
]
)

return {
'statusCode': 200,
'checked': len(critical_secrets),
'failures': replication_failures
}

Lesson learned: DR validation should be a continuous automated process, not something you

only check during a planned drill. This Lambda runs every hour and would alert the on-call team
within 60 minutes of any replication failure.

Challenge 4: NAT Gateway Bottleneck During Failover Scale-Up

What happened: When we scaled the secondary region's ECS tasks from 6 (warm standby) to 18
(production capacity) during a failover drill, we observed a 25-second spike in container startup
latency. The culprit: all 18 tasks were simultaneously pulling their container images from ECR
through the NAT Gateway, saturating its bandwidth.

Solution:

1. ECR VPC Endpoints — Eliminate NAT Gateway for ECR image pulls by routing through a
VPC Endpoint
2. ECR image pre-warming — Run a scheduled task in the secondary region every 6 hours
that pulls the latest container image, ensuring the image is cached locally in the container

host's image cache


# Create ECR VPC Endpoints in secondary region
# Endpoint for ECR API (image manifest retrieval)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-0secondary789 \
--vpc-endpoint-type Interface \
--service-name [Link] \
--subnet-ids subnet-sec-1a subnet-sec-1b \
--security-group-ids sg-secondary-endpoints \
--private-dns-enabled \
--region ap-southeast-1
# Endpoint for ECR Docker registry (actual image layer downloads)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-0secondary789 \
--vpc-endpoint-type Interface \
--service-name [Link] \
--subnet-ids subnet-sec-1a subnet-sec-1b \
--security-group-ids sg-secondary-endpoints \
--private-dns-enabled \
--region ap-southeast-1
# S3 Gateway Endpoint (ECR uses S3 to store image layers)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-0secondary789 \
--vpc-endpoint-type Gateway \
--service-name [Link]-southeast-1.s3 \
--route-table-ids rtb-sec-private-1 rtb-sec-private-2 \
--region ap-southeast-1

Lesson learned: VPC Endpoints for ECR should be standard in any production Fargate

deployment. They improve security (traffic stays on AWS backbone), reduce NAT Gateway costs,
and eliminate NAT Gateway bandwidth as a constraint.

Challenge 5: False Positive Health Check Failures from AWS Synthetic


Canaries
What happened: We configured CloudWatch Synthetics canaries to test the API endpoints every
minute. During the initial weeks, we received several 2 AM alerts that turned out to be false
positives — the canary reported failures that users weren't actually experiencing.

Root cause investigation: The canary was running HTTP/1.1 requests against the ALB endpoint.
During certain periods, the ALB was initiating keep-alive connection recycling, which caused the
canary's first request in a new connection to experience a slightly elevated response time that
crossed the canary's timeout threshold.

Solution:

1. Adjust canary timeout threshold from 2 seconds to 5 seconds (matching our P99
performance baseline)
2. Configure the canary to retry once before marking a failure (single-point failures

shouldn't trigger alerts)


3. Implement multi-step canary scripts that simulate actual user flows rather than single
endpoint checks

// CloudWatch Synthetics canary script - simulates a payment validation flow


const synthetics = require('Synthetics');
const log = require('SyntheticsLogger');
const validatePaymentFlow = async () => {
// Step 1: Health check
const healthResponse = await [Link](
'Check API Health',
{
hostname: '[Link]',
method: 'GET',
path: '/health',
port: 443,
protocol: 'https:',
},
(res) => {
if ([Link] !== 200) {
throw new Error(`Health check failed: ${[Link]}`);
}
}
);

// Step 2: Authenticate (tests IAM/JWT integration)


// Step 3: Submit test payment validation
// Step 4: Verify response structure and latency
// This multi-step approach catches integration failures, not just
connectivity
};
[Link] = async () => {
return await validatePaymentFlow();
};
Chapter 8: Results and Metrics — The Numbers

After 60 days of production operation on the new architecture, here are the measured outcomes:

Availability and Reliability

• Achieved SLA: 99.991% over the first 60 days (2.6 minutes total downtime across two
minor incidents)
• Incident count: 2 minor incidents vs. 7 in the comparable prior period
• Mean Time to Detect (MTTD): 2.3 minutes (down from 11 minutes with previous
monitoring)
• Mean Time to Resolve (MTTR): 8.7 minutes (down from 47 minutes)
• Successful DR drills: 3/3 (RTO consistently under 4 minutes, well within 5-minute target)

Performance

• Average API response time: 185ms (from 340ms — 46% improvement)


• P99 API response time: 480ms (from 4,200ms — 89% improvement)
• Database query time (top 5 slow queries): Reduced by 73% through query

optimization and connection pooling


• Transaction throughput: Peak capacity increased from 12,000 to 85,000 transactions per
hour

Cost

• Monthly infrastructure cost: $16,200 (from $12,000 — 35% increase, within the 40%
budget constraint)
• Waste eliminated: $1,200/month in orphaned resources + $600/month in NAT Gateway
savings via VPC Endpoints
• Reserved Instance coverage: 78% of eligible resources committed (target was 70%)
• Projected annual savings from RI/SP commitments: $28,400/year vs. pure On-
Demand

Security
• Critical findings in Security Hub: 0 (down from 47 at initial assessment)
• High findings: 3 (all acknowledged with remediation timelines)
• GuardDuty findings triggering alerts: 2 (both investigated, confirmed benign —
automated responses handled them)

• Days since last IAM credential exposure finding: 60 (with Secrets Manager, this
category of finding effectively disappeared)

Team Productivity

• Deployment frequency: From bi-weekly to daily (CI/CD pipeline with automated

rollback)
• Mean time to provision new environments: From 3 days to 45 minutes (Terraform
modules)
• On-call incident pages per month: From 23 to 7
• Time spent on infrastructure troubleshooting: Down approximately 60% per engineer
per week
Chapter 9: Key Takeaways

What Worked Exceptionally Well

• Aurora Global Database with RDS Proxy was the single highest-impact technical
choice. The combination of sub-second RPO from Aurora Global Database and
transparent failover handling from RDS Proxy resolved our biggest reliability weakness at

a reasonable cost.
• Route 53 ARC with Safety Rules prevented what could have been a catastrophic
operator error during one of our DR drills when a team member accidentally tried to turn
off both region routing controls simultaneously — ARC's safety rule blocked the
operation and fired an alert.
• VPC Endpoints for all applicable services reduced both cost and attack surface
simultaneously. It's one of those rare architectural decisions where security and cost

optimization point in the same direction.


• Permission Boundaries as a developer guardrail allowed us to give the engineering
team meaningful autonomy (they can create IAM roles for their services) while
maintaining the security property that no developer-created role can escalate beyond the
defined boundary.
• The tagging strategy enforced by AWS Config turned cost attribution from a monthly
blame game into a real-time, actionable feedback loop for developers.

What I'd Do Differently Next Time

• Start with the health check endpoint design. The deep health check (/health/deep)
that validates database connectivity should have been the first thing we built. We wasted
two weeks troubleshooting load balancer behavior because we were using the wrong

health check from the start.


• Model VPC Flow Logs volume before enabling. The cost surprise in month one was
entirely avoidable. A back-of-envelope calculation (network throughput × estimated log
overhead × CloudWatch ingestion price) would have pointed us to S3-first architecture
immediately.
• Involve the application team in DR planning from day one. The Aurora failover
connection error challenge (Challenge 2) would have been caught earlier if application
developers had been in the room when we designed the database failover mechanics.
Infrastructure and application teams need to co-design DR, not hand it off to each other.

• Consider AWS Backup with cross-region copy for operational backups. I relied on
Aurora's native backup capabilities, which are excellent for RPO but awkward for point-in-
time restore workflows. AWS Backup provides a centralized backup management
experience that I'd use from the start in future engagements.
• Implement AWS Resilience Hub earlier. AWS Resilience Hub is a service that analyzes
your workload and generates a resiliency score against your defined RTO and RPO
targets. I discovered it mid-project, and it would have been a useful validation tool during

the design phase to identify gaps before deployment.

Best Practices Discovered

• Never treat 99.99% SLA as a single architectural decision. It's the cumulative effect of
dozens of smaller decisions, each of which eliminates a class of failure.

• DR runbooks that haven't been tested in the last 30 days are documentation, not
runbooks. Run drills relentlessly.
• Always use update-routing-control-states (plural, atomic) rather than
individual update-routing-control-state calls when changing ARC. The atomic version
ensures safety rules are evaluated against the final desired state, not intermediate states.
• KMS Multi-Region Keys are not optional for multi-region architectures with encrypted
data. Dealing with cross-region re-encryption at the application layer is a significant

ongoing operational burden.


• Graviton-based instances (r6g, m7g, c7g) should be the default for any new workload.
The price-performance advantage over x86 equivalents is consistently 20-40%, and the
compatibility story is now excellent for mainstream workloads.
• Session Manager with full CloudWatch Logs integration is not just a security win — it's an
operational win. The ability to search session history to understand what commands were
run before an incident is genuinely valuable.

Recommendations for Similar Projects


• Run the AWS Well-Architected Tool at the start of every engagement, not just at the
end. The questions it asks will surface architectural blind spots you didn't know you had.
• Get your tagging strategy locked in before any resources are created. Retrofitting
tags onto hundreds of existing resources is painful and error-prone.

• Set up Cost Anomaly Detection in the first week. Cost surprises are much easier to
address when caught in week 2 rather than month 2.
• Don't skip the stakeholder interview phase. The requirement that the architecture
accommodate Southeast Asian expansion (from the business interview) significantly
influenced several technical decisions. That requirement never appeared in any written
documentation.
• Automate everything, document the automation. The goal isn't to write a runbook

that a human reads during an incident. The goal is an SSM Automation document that a
human approves during an incident while the system executes it.
Chapter 10: Tech Stack Summary

Layer Service Notes

DNS & Global Route 53, Global Accelerator, Route 53 ARC provides safety rules and
Traffic ARC atomic routing control
changes

CDN CloudFront Static assets + API caching at

edge

Compute ECS Fargate No node management, native


multi-AZ, Fargate Spot for
non-prod

Container ECR with cross-region replication Private, encrypted, image


Registry scanning enabled

Database Aurora MySQL Global Database Sub-second RPO, ~1 min RTO

(Primary) for managed failover

Database RDS Proxy Connection pooling,


(Proxy) transparent failover

Caching ElastiCache Redis Global Datastore Session state, computed


results

Object Storage S3 with CRR Cross-Region Replication to


secondary, Intelligent-Tiering

Networking VPC, Transit Gateway (inter-region TGW peering for


peering), VPC Endpoints management traffic,
endpoints for cost/security

Load Balancing Application Load Balancer ALB in both regions, WAF


attached

Security IAM with Permission Boundaries, KMS SRA multi-account structure

Multi-Region Keys, Secrets Manager,


GuardDuty, Security Hub, Macie

Access Systems Manager Session Manager Zero SSH, full audit trail

Governance AWS Organizations, Control Tower, Config, CIS benchmarks enforced via
SCPs Config Conformance Packs
Monitoring CloudWatch, X-Ray, CloudWatch Centralized dashboard,
Synthetics, Container Insights distributed tracing

Logging CloudTrail (org-wide), VPC Flow Logs (S3 + Log Archive account with
CW), ALB Access Logs immutable S3 bucket

CI/CD CodePipeline, CodeBuild, CodeDeploy Blue/green deployments to


ECS

IaC Terraform with remote state in S3 + Modular structure,


DynamoDB locking workspace-based
environment management

Cost Cost Anomaly Detection, Compute Savings Tagging strategy enforced by


Management Plans, Aurora RI, AWS Cost Explorer Config
Closing Thoughts

Building a 99.99% SLA platform is fundamentally an exercise in eliminating single points of


failure, one layer at a time. When I started this engagement, FinServ Co. had exactly one layer of
redundancy — Multi-AZ for their database — and it wasn't even wired up correctly at the
application layer.

By the end of the 120-day engagement, every tier of the architecture had been designed around
the assumption of failure. Not "if something fails," but "when something fails." The infrastructure
expects faults and handles them automatically. The operations team gets paged when something
interesting is happening, not when something is already broken.

The measure of a well-designed system isn't that it never fails. It's that when it fails, the failure is
contained, detected quickly, and resolved automatically — ideally before any user notices
anything at all.

That's what 99.99% SLA really means in practice.

This ebook reflects real-world architectural patterns and configurations used in production AWS
environments. Service pricing and specific API parameters are subject to change — always verify
against current AWS documentation before implementing. All account IDs, resource ARNs, and
company names used in examples are illustrative.
Appendix A: Complete Terraform Module Structure

One of the things that keeps multi-region Terraform projects manageable is a clean, consistent
module structure. Here's the full directory layout I used for this project, along with explanations
of what lives where and why.

finserv-infrastructure/
├── environments/
│ ├── production/
│ │ ├── [Link] # Root module — calls all child modules
│ │ ├── [Link] # Input variables for production environment
│ │ ├── [Link] # Exported values (ALB ARNs, Aurora
endpoints, etc.)
│ │ ├── [Link] # Actual values (committed to repo, no
secrets)
│ │ ├── [Link] # S3 remote state + DynamoDB locking
configuration
│ │ └── [Link] # Multi-region provider configuration
│ ├── staging/
│ │ └── ... # Same structure, different tfvars values
│ └── development/
│ └── ...

├── modules/
│ ├── vpc/
│ │ ├── [Link] # VPC, subnets, IGW, NAT Gateways, route
tables
│ │ ├── [Link] # cidr_block, azs, environment, etc.
│ │ ├── [Link] # vpc_id, subnet_ids, nat_gateway_ids
│ │ └── [Link]
│ ├── transit-gateway/
│ │ ├── [Link] # TGW, attachments, inter-region peering
│ │ ├── [Link]
│ │ └── [Link]
│ ├── security-groups/
│ │ ├── [Link] # All security group definitions
│ │ ├── [Link]
│ │ └── [Link]
│ ├── vpc-endpoints/
│ │ ├── [Link] # Interface and Gateway endpoints
│ │ ├── [Link]
│ │ └── [Link]
│ ├── kms/
│ │ ├── [Link] # Multi-region keys, replica keys, key
aliases
│ │ ├── [Link]
│ │ └── [Link]
│ ├── iam/
│ │ ├── [Link] # Roles, policies, permission boundaries
│ │ ├── [Link]
│ │ └── [Link]
│ ├── aurora-global/
│ │ ├── [Link] # Global cluster, primary cluster, secondary
cluster
│ │ ├── [Link] # Cluster and instance parameter groups
│ │ ├── [Link]
│ │ └── [Link]
│ ├── rds-proxy/
│ │ ├── [Link]
│ │ ├── [Link]
│ │ └── [Link]
│ ├── elasticache-global/
│ │ ├── [Link] # Redis Global Datastore
│ │ ├── [Link]
│ │ └── [Link]
│ ├── ecr/
│ │ ├── [Link] # Repositories with cross-region replication
│ │ ├── [Link]
│ │ ├── [Link]
│ │ └── [Link]
│ ├── ecs/
│ │ ├── [Link] # ECS cluster with Container Insights
│ │ ├── [Link] # Task definition template
│ │ ├── [Link] # ECS service with auto scaling
│ │ ├── [Link] # Application Load Balancer + Target Groups
│ │ ├── [Link]
│ │ └── [Link]
│ ├── global-accelerator/
│ │ ├── [Link]
│ │ ├── [Link]
│ │ └── [Link]
│ ├── route53-arc/
│ │ ├── [Link] # ARC cluster, control panel, routing
controls, safety rules
│ │ ├── [Link] # Route 53 health checks backed by ARC
│ │ ├── [Link] # Failover DNS records
│ │ ├── [Link]
│ │ └── [Link]
│ ├── secrets-manager/
│ │ ├── [Link] # Secrets with cross-region replication
│ │ ├── [Link] # Rotation Lambda configurations
│ │ ├── [Link]
│ │ └── [Link]
│ ├── ssm-session-manager/
│ │ ├── [Link] # Session preferences document, IAM role, VPC
endpoints
│ │ ├── [Link]
│ │ └── [Link]
│ ├── monitoring/
│ │ ├── [Link] # CloudWatch Dashboards
│ │ ├── [Link] # All CloudWatch Alarms
│ │ ├── [Link] # Log groups with retention and encryption
│ │ ├── [Link] # VPC Flow Log metric filters
│ │ ├── [Link] # CloudWatch Synthetics canaries
│ │ ├── [Link]
│ │ └── [Link]
│ └── waf/
│ ├── [Link] # WAF WebACL with managed rule groups
│ ├── [Link]
│ └── [Link]

├── scripts/
│ ├── failover/
│ │ ├── [Link] # Orchestrates full regional failover
│ │ ├── [Link] # Pre-flight checks before failover
│ │ └── [Link] # Post-failover verification
│ ├── cost/
│ │ └── [Link] # Reports RI/SP coverage percentage
│ └── security/
│ └── [Link] # Manual secret rotation trigger

└── docs/
├── architecture-decisions/ # ADR (Architecture Decision Records)
├── runbooks/ # Operational runbooks
└── diagrams/ # Architecture diagrams ([Link] source
files)

The [Link] Configuration:

# [Link] — Always use remote state for team environments


terraform {
backend "s3" {
bucket = "finserv-terraform-state-123456789012"
key = "production/[Link]"
region = "ap-south-1"
encrypt = true
kms_key_id = "arn:aws:kms:ap-south-1:123456789012:key/terraform-state-
key"
# DynamoDB table for state locking — prevents concurrent applies
# from corrupting state. The table needs a partition key named "LockID" of
type String.
dynamodb_table = "finserv-terraform-locks"
}
}

The [Link] with Multi-Region Configuration:

# [Link]
terraform {
required_version = ">= 1.6.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.30"
}
}
}
# Primary region provider
provider "aws" {
alias = "primary"
region = var.primary_region # "ap-south-1"
default_tags {
tags = {
Environment = [Link]
ManagedBy = "terraform"
Project = "finserv"
Owner = var.owner_team
}
}
}
# Secondary region provider
provider "aws" {
alias = "secondary"
region = var.secondary_region # "ap-southeast-1"
default_tags {
tags = {
Environment = [Link]
ManagedBy = "terraform"
Project = "finserv"
Owner = var.owner_team
Region = "secondary"
}
}
}
# Global services provider (us-east-1 — required for Route 53, IAM, Global
Accelerator)
provider "aws" {
alias = "global"
region = "us-east-1"
default_tags {
tags = {
Environment = [Link]
ManagedBy = "terraform"
Project = "finserv"
}
}
}

Gotcha: default_tags in the provider block applies tags to every resource created by that

provider alias. This is a powerful way to ensure baseline tagging compliance without repeating
tags in every resource block. However, be aware that some AWS resources don't support tags at
all — Terraform will throw an error if default_tags tries to apply to a non-taggable resource.
The workaround is using ignore_tags in the provider configuration for those specific tag keys.
Appendix B: Security Group Reference

Security groups are the last line of defense in your network architecture. Here is the complete
security group matrix for this architecture — what each group allows, what it blocks, and why.

# security-groups/[Link]
# ─────────────────────────────────────────────────────────────────
# ALB Security Group — faces the internet
# ─────────────────────────────────────────────────────────────────
resource "aws_security_group" "alb" {
provider = [Link]
name = "finserv-alb-sg"
description = "Security group for Application Load Balancer"
vpc_id = var.vpc_id
# HTTPS only from internet — HTTP is redirected to HTTPS at the listener
level
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["[Link]/0"]
description = "HTTPS from internet"
}
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["[Link]/0"]
description = "HTTP from internet — redirected to HTTPS by ALB listener
rule"
}
# ALB can only talk to ECS tasks (enforced by using security group
reference, not CIDR)
egress {
from_port = 8080
to_port = 8080
protocol = "tcp"
security_groups = [aws_security_group.ecs_tasks.id]
description = "Forward to ECS tasks on port 8080"
}
tags = { Name = "finserv-alb-sg" }
}
# ─────────────────────────────────────────────────────────────────
# ECS Tasks Security Group — application tier
# ─────────────────────────────────────────────────────────────────
resource "aws_security_group" "ecs_tasks" {
provider = [Link]
name = "finserv-ecs-tasks-sg"
description = "Security group for ECS Fargate tasks"
vpc_id = var.vpc_id
# Only accept traffic from the ALB
ingress {
from_port = 8080
to_port = 8080
protocol = "tcp"
security_groups = [aws_security_group.[Link]]
description = "Traffic from ALB only"
}
# Allow all outbound — tasks need to reach Aurora, ElastiCache, Secrets
Manager,
# ECR, CloudWatch, SSM, and potentially external payment gateways
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["[Link]/0"]
description = "All outbound traffic"
}
tags = { Name = "finserv-ecs-tasks-sg" }
}
# ─────────────────────────────────────────────────────────────────
# RDS Proxy Security Group
# ─────────────────────────────────────────────────────────────────
resource "aws_security_group" "rds_proxy" {
provider = [Link]
name = "finserv-rds-proxy-sg"
description = "Security group for RDS Proxy"
vpc_id = var.vpc_id
ingress {
from_port = 3306
to_port = 3306
protocol = "tcp"
security_groups = [aws_security_group.ecs_tasks.id]
description = "MySQL connections from ECS tasks"
}
egress {
from_port = 3306
to_port = 3306
protocol = "tcp"
security_groups = [aws_security_group.[Link]]
description = "MySQL connections to Aurora cluster"
}
tags = { Name = "finserv-rds-proxy-sg" }
}
# ─────────────────────────────────────────────────────────────────
# Aurora Security Group — most restrictive
# ─────────────────────────────────────────────────────────────────
resource "aws_security_group" "aurora" {
provider = [Link]
name = "finserv-aurora-sg"
description = "Security group for Aurora Global Database cluster"
vpc_id = var.vpc_id
# Only accept connections from the RDS Proxy — NOT directly from ECS tasks
# This forces all database connections to go through the proxy
ingress {
from_port = 3306
to_port = 3306
protocol = "tcp"
security_groups = [aws_security_group.rds_proxy.id]
description = "MySQL from RDS Proxy only — direct application access
blocked"
}
# No outbound rules needed — Aurora doesn't initiate outbound connections
# (Aurora-to-Aurora replication uses internal AWS networking, not these
security groups)
tags = { Name = "finserv-aurora-sg" }
}
# ─────────────────────────────────────────────────────────────────
# ElastiCache Security Group
# ─────────────────────────────────────────────────────────────────
resource "aws_security_group" "elasticache" {
provider = [Link]
name = "finserv-elasticache-sg"
description = "Security group for ElastiCache Redis"
vpc_id = var.vpc_id
ingress {
from_port = 6379
to_port = 6379
protocol = "tcp"
security_groups = [aws_security_group.ecs_tasks.id]
description = "Redis from ECS tasks"
}
tags = { Name = "finserv-elasticache-sg" }
}
# ─────────────────────────────────────────────────────────────────
# VPC Endpoints Security Group
# ─────────────────────────────────────────────────────────────────
resource "aws_security_group" "vpc_endpoints" {
provider = [Link]
name = "finserv-vpc-endpoints-sg"
description = "Security group for VPC Interface Endpoints"
vpc_id = var.vpc_id
# Accept HTTPS from the entire VPC CIDR — all resources in the VPC
# should be able to use the endpoints
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = [var.vpc_cidr]
description = "HTTPS from VPC CIDR for AWS service access"
}
tags = { Name = "finserv-vpc-endpoints-sg" }
}

Security Group Traffic Flow Summary:

Internet
│ HTTPS (443)

[ALB SG]
│ Port 8080

[ECS Tasks SG]
│ Port 3306 │ Port 6379 │ Port 443
▼ ▼ ▼
[RDS Proxy SG] [ElastiCache SG] [VPC Endpoints SG]
│ Port 3306 │ │
▼ │ AWS Services
[Aurora SG] │ (SSM, ECR, etc.)

Redis Cluster

The key design principle here: no layer can be reached except by the layer immediately

above it. ECS tasks cannot talk directly to Aurora — all connections must pass through RDS
Proxy. This layered security model means that even if an attacker compromises an ECS task, they
face an additional barrier before reaching the database.
Appendix C: AWS Config Rules Reference

AWS Config continuously evaluates your resource configurations against rules you define. Here
are the Config rules I implemented, with explanations of what each one checks and why it
matters.

# config-rules/[Link]
# ─────────────────────────────────────────────────────────────────
# Managed Rules (AWS-maintained rule logic)
# ─────────────────────────────────────────────────────────────────
# Ensure MFA is enabled for the root account
resource "aws_config_config_rule" "root_mfa_enabled" {
name = "root-account-mfa-enabled"
description = "Checks whether the root user of your AWS account requires
MFA"
source {
owner = "AWS"
source_identifier = "ROOT_ACCOUNT_MFA_ENABLED"
# This is a periodic rule — evaluated every 24 hours, not triggered by
events
}
}
# Ensure CloudTrail is enabled in all regions
resource "aws_config_config_rule" "cloudtrail_enabled" {
name = "cloudtrail-enabled"
description = "Checks whether CloudTrail is enabled and logging API calls"
source {
owner = "AWS"
source_identifier = "CLOUD_TRAIL_ENABLED"
}
input_parameters = jsonencode({
s3BucketName = "finserv-cloudtrail-archive"
# Ensures CloudTrail is not just enabled, but logging to the correct
bucket
})
}
# Ensure no security groups allow unrestricted SSH inbound
resource "aws_config_config_rule" "no_unrestricted_ssh" {
name = "restricted-ssh"
description = "Checks that no security groups allow unrestricted inbound SSH
(port 22)"
source {
owner = "AWS"
source_identifier = "INCOMING_SSH_DISABLED"
}
# This rule triggers whenever a security group is created or modified
}
# Ensure EBS volumes are encrypted
resource "aws_config_config_rule" "ebs_encryption" {
name = "encrypted-volumes"
description = "Checks that attached EBS volumes are encrypted"
source {
owner = "AWS"
source_identifier = "ENCRYPTED_VOLUMES"
}
input_parameters = jsonencode({
kmsId = aws_kms_key.[Link]
# Optionally enforce use of a specific CMK, not just any encryption
})
}
# Ensure RDS instances are encrypted
resource "aws_config_config_rule" "rds_encrypted" {
name = "rds-storage-encrypted"
source {
owner = "AWS"
source_identifier = "RDS_STORAGE_ENCRYPTED"
}
}
# Ensure RDS instances have deletion protection
resource "aws_config_config_rule" "rds_deletion_protection" {
name = "rds-instance-deletion-protection-enabled"
source {
owner = "AWS"
source_identifier = "RDS_INSTANCE_DELETION_PROTECTION_ENABLED"
}
}
# Ensure Multi-AZ is enabled for RDS
resource "aws_config_config_rule" "rds_multi_az" {
name = "rds-multi-az-support"
source {
owner = "AWS"
source_identifier = "RDS_MULTI_AZ_SUPPORT"
}
}
# Ensure S3 buckets have server-side encryption enabled
resource "aws_config_config_rule" "s3_bucket_encryption" {
name = "s3-bucket-server-side-encryption-enabled"
source {
owner = "AWS"
source_identifier = "S3_BUCKET_SERVER_SIDE_ENCRYPTION_ENABLED"
}
}
# Ensure S3 buckets block public access
resource "aws_config_config_rule" "s3_public_access_block" {
name = "s3-account-level-public-access-blocks-periodic"
source {
owner = "AWS"
source_identifier = "S3_ACCOUNT_LEVEL_PUBLIC_ACCESS_BLOCKS_PERIODIC"
}
}
# Ensure VPC Flow Logs are enabled
resource "aws_config_config_rule" "vpc_flow_logs" {
name = "vpc-flow-logs-enabled"
description = "Checks whether VPC Flow Logs are enabled for each VPC"
source {
owner = "AWS"
source_identifier = "VPC_FLOW_LOGS_ENABLED"
}
input_parameters = jsonencode({
trafficType = "ALL"
# Ensure ALL traffic is captured, not just REJECT or ACCEPT
})
}
# Ensure GuardDuty is enabled
resource "aws_config_config_rule" "guardduty_enabled" {
name = "guardduty-enabled-centralized"
source {
owner = "AWS"
source_identifier = "GUARDDUTY_ENABLED_CENTRALIZED"
}
}
# ─────────────────────────────────────────────────────────────────
# Custom Rules (Lambda-backed — enforce custom business logic)
# ─────────────────────────────────────────────────────────────────
# Custom rule: Enforce mandatory tags on all taggable resources
resource "aws_config_config_rule" "mandatory_tags" {
name = "mandatory-tags-enforcement"
description = "Ensures all resources have required tags: Environment, Owner,
ManagedBy"
source {
owner = "CUSTOM_LAMBDA"
source_identifier = aws_lambda_function.config_mandatory_tags.arn
source_detail {
event_source = "[Link]"
message_type = "ConfigurationItemChangeNotification"
# Triggers on every resource configuration change — catches untagged
resources
# within seconds of creation
}
}
scope {
compliance_resource_types = [
"AWS::EC2::Instance",
"AWS::RDS::DBInstance",
"AWS::RDS::DBCluster",
"AWS::ECS::Service",
"AWS::S3::Bucket",
"AWS::ElastiCache::ReplicationGroup"
]
}
depends_on = [aws_config_configuration_recorder.primary]
}

Config Remediation — Auto-Fix Non-Compliant Resources:

For certain rules, rather than just alerting on violations, I configured automatic remediation using
SSM Automation documents:

# Automatically enable S3 bucket encryption for non-compliant buckets


resource "aws_config_remediation_configuration" "s3_encryption_remediation" {
config_rule_name = aws_config_config_rule.s3_bucket_encryption.name
resource_type = "AWS::S3::Bucket"
target_type = "SSM_DOCUMENT"
target_id = "AWSConfigRemediation-EnableS3BucketEncryption"
# This is an AWS-provided SSM document that enables default encryption on S3
buckets
automatic = true
# Automatically apply the remediation without human approval
# Only use automatic=true for low-risk remediations
maximum_automatic_attempts = 3
retry_attempt_seconds = 60
parameter {
name = "BucketName"
resource_value = "RESOURCE_ID"
# RESOURCE_ID is a placeholder that Config replaces with the actual bucket
name
}
parameter {
name = "SSEAlgorithm"
static_value = "aws:kms"
}
parameter {
name = "KMSMasterKeyID"
static_value = aws_kms_key.[Link]
}
}
Appendix D: CloudWatch Alarms Complete Reference

This appendix lists every CloudWatch alarm configured in the architecture, organized by
category. For each alarm, the threshold, period, and action are specified along with a plain-
English explanation of what it detects.

# monitoring/[Link]
locals {
# SNS topic ARNs for alert routing
critical_topic = "arn:aws:sns:ap-south-1:123456789012:finserv-critical-
alerts"
warning_topic = "arn:aws:sns:ap-south-1:123456789012:finserv-warning-
alerts"
info_topic = "arn:aws:sns:ap-south-1:123456789012:finserv-info-alerts"
}
# ─────────────────────────────────────────────────────────────────
# APPLICATION TIER ALARMS
# ─────────────────────────────────────────────────────────────────
resource "aws_cloudwatch_metric_alarm" "alb_5xx_rate" {
alarm_name = "ALB-5XX-ErrorRate-High"
alarm_description = "ALB 5XX error rate exceeds 1% — indicates application
or infrastructure errors"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
threshold = 1
treat_missing_data = "notBreaching"
metric_query {
id = "error_rate"
expression = "errors / requests * 100"
label = "5XX Error Rate %"
return_data = true
}
metric_query {
id = "errors"
metric {
namespace = "AWS/ApplicationELB"
metric_name = "HTTPCode_Target_5XX_Count"
dimensions = { LoadBalancer = aws_lb.primary.arn_suffix }
period = 60
stat = "Sum"
}
}
metric_query {
id = "requests"
metric {
namespace = "AWS/ApplicationELB"
metric_name = "RequestCount"
dimensions = { LoadBalancer = aws_lb.primary.arn_suffix }
period = 60
stat = "Sum"
}
}
alarm_actions = [local.critical_topic]
ok_actions = [local.info_topic]
}
resource "aws_cloudwatch_metric_alarm" "alb_target_response_time_p99" {
alarm_name = "ALB-P99-ResponseTime-High"
alarm_description = "P99 target response time exceeds 2 seconds — degraded
user experience"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "TargetResponseTime"
namespace = "AWS/ApplicationELB"
period = 60
statistic = "p99"
# Note: Extended statistics like p99 require "extended_statistic" not
"statistic"
extended_statistic = "p99"
threshold = 2
treat_missing_data = "notBreaching"
dimensions = { LoadBalancer = aws_lb.primary.arn_suffix }
alarm_actions = [local.warning_topic]
}
resource "aws_cloudwatch_metric_alarm" "alb_unhealthy_hosts" {
alarm_name = "ALB-UnhealthyHosts-NonZero"
alarm_description = "One or more ECS tasks are failing ALB health checks —
potential task crash or startup failure"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "UnHealthyHostCount"
namespace = "AWS/ApplicationELB"
period = 60
statistic = "Maximum"
threshold = 0
treat_missing_data = "notBreaching"
dimensions = {
LoadBalancer = aws_lb.primary.arn_suffix
TargetGroup = aws_lb_target_group.payment_processor.arn_suffix
}
alarm_actions = [local.critical_topic]
}
# ─────────────────────────────────────────────────────────────────
# ECS FARGATE ALARMS
# ─────────────────────────────────────────────────────────────────
resource "aws_cloudwatch_metric_alarm" "ecs_cpu_high" {
alarm_name = "ECS-PaymentProcessor-CPU-High"
alarm_description = "ECS service CPU utilization exceeds 80% — auto
scaling should be triggered but verify"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "CPUUtilization"
namespace = "AWS/ECS"
period = 60
statistic = "Average"
threshold = 80
dimensions = {
ClusterName = aws_ecs_cluster.[Link]
ServiceName = aws_ecs_service.payment_processor.name
}
alarm_actions = [local.warning_topic]
}
resource "aws_cloudwatch_metric_alarm" "ecs_memory_high" {
alarm_name = "ECS-PaymentProcessor-Memory-High"
alarm_description = "ECS service memory utilization exceeds 85% — risk of
OOM task failures"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "MemoryUtilization"
namespace = "AWS/ECS"
period = 60
statistic = "Average"
threshold = 85
dimensions = {
ClusterName = aws_ecs_cluster.[Link]
ServiceName = aws_ecs_service.payment_processor.name
}
alarm_actions = [local.warning_topic]
}
resource "aws_cloudwatch_metric_alarm" "ecs_running_task_count_low" {
alarm_name = "ECS-PaymentProcessor-RunningTasks-BelowMinimum"
alarm_description = "Running task count dropped below minimum desired
count — possible deployment failure or task crash loop"
comparison_operator = "LessThanThreshold"
evaluation_periods = 2
metric_name = "RunningTaskCount"
namespace = "ECS/ContainerInsights"
period = 60
statistic = "Minimum"
threshold = 6 # Our minimum desired count
treat_missing_data = "breaching"
# treat_missing_data=breaching: If the metric stops reporting (e.g.,
Container Insights issue),
# assume the worst and fire the alarm
dimensions = {
ClusterName = aws_ecs_cluster.[Link]
ServiceName = aws_ecs_service.payment_processor.name
}
alarm_actions = [local.critical_topic]
}
# ─────────────────────────────────────────────────────────────────
# DATABASE ALARMS
# ─────────────────────────────────────────────────────────────────
resource "aws_cloudwatch_metric_alarm" "aurora_cpu_high" {
alarm_name = "Aurora-Primary-CPU-High"
alarm_description = "Aurora writer instance CPU exceeds 70% — investigate
slow queries"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "CPUUtilization"
namespace = "AWS/RDS"
period = 60
statistic = "Average"
threshold = 70
dimensions = { DBClusterIdentifier =
aws_rds_cluster.primary.cluster_identifier }
alarm_actions = [local.warning_topic]
}
resource "aws_cloudwatch_metric_alarm" "aurora_freeable_memory" {
alarm_name = "Aurora-Primary-FreeableMemory-Low"
alarm_description = "Aurora freeable memory below 1GB — buffer pool
pressure, possible swapping"
comparison_operator = "LessThanThreshold"
evaluation_periods = 3
metric_name = "FreeableMemory"
namespace = "AWS/RDS"
period = 60
statistic = "Average"
threshold = 1073741824 # 1 GB in bytes
dimensions = { DBClusterIdentifier =
aws_rds_cluster.primary.cluster_identifier }
alarm_actions = [local.warning_topic]
}
resource "aws_cloudwatch_metric_alarm" "aurora_replication_lag" {
alarm_name = "Aurora-GlobalDB-ReplicationLag-High"
alarm_description = "Aurora Global Database replication lag exceeds 5
seconds — RPO at risk"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "AuroraGlobalDBReplicationLag"
namespace = "AWS/RDS"
period = 60
statistic = "Maximum"
threshold = 5000 # 5 seconds in milliseconds
treat_missing_data = "breaching"
dimensions = { DBClusterIdentifier =
aws_rds_cluster.secondary.cluster_identifier }
alarm_actions = [local.critical_topic]
}
resource "aws_cloudwatch_metric_alarm" "rds_proxy_client_connections" {
alarm_name = "RDSProxy-ClientConnections-High"
alarm_description = "RDS Proxy client connection count approaching limit —
connection pool saturation risk"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "ClientConnections"
namespace = "AWS/RDS"
period = 60
statistic = "Maximum"
threshold = 800 # Alert at 80% of the 1000 max_connections limit
dimensions = { ProxyName = aws_db_proxy.[Link] }
alarm_actions = [local.warning_topic]
}
# ─────────────────────────────────────────────────────────────────
# DISASTER RECOVERY ALARMS
# ─────────────────────────────────────────────────────────────────
resource "aws_cloudwatch_metric_alarm" "secrets_replication_failure" {
alarm_name = "DR-SecretsManager-ReplicationFailure"
alarm_description = "Secrets Manager secrets not fully replicated to
secondary region — DR readiness at risk"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "SecretsReplicationFailures"
namespace = "FinServ/DR"
period = 3600 # Checked hourly by validation Lambda
statistic = "Maximum"
threshold = 0
treat_missing_data = "breaching"
alarm_actions = [local.critical_topic]
}
resource "aws_cloudwatch_metric_alarm" "secondary_region_task_count" {
provider = [Link]
alarm_name = "DR-Secondary-ECS-Tasks-Down"
alarm_description = "Secondary region ECS tasks not running — warm standby
compromised"
comparison_operator = "LessThanThreshold"
evaluation_periods = 2
metric_name = "RunningTaskCount"
namespace = "ECS/ContainerInsights"
period = 300
statistic = "Minimum"
threshold = 2 # Minimum warm standby tasks expected
treat_missing_data = "breaching"
dimensions = {
ClusterName = aws_ecs_cluster.[Link]
ServiceName = "payment-processor"
}
alarm_actions = [local.critical_topic]
}
Appendix E: Useful AWS CLI One-Liners for Operations

This appendix is a quick-reference collection of CLI commands you'll reach for repeatedly in day-
to-day operations of a multi-region platform. I keep this list pinned in the team's internal wiki.

# ─────────────────────────────────────────────────────────────────
# REGION STATUS AND FAILOVER READINESS
# ─────────────────────────────────────────────────────────────────
# Check Aurora Global Database replication lag across all secondary clusters
aws rds describe-global-clusters \
--global-cluster-identifier finserv-global \
--query
'GlobalClusters[0].GlobalClusterMembers[?IsWriter==`false`].{Cluster:DBCluster
Arn,Lag:GlobalWriteForwardingStatus}' \
--region ap-south-1
# Get current Aurora Global Database status and member details
aws rds describe-global-clusters \
--global-cluster-identifier finserv-global \
--region ap-south-1 \
--output table
# Check Route 53 ARC routing control states (which regions are active)
aws route53-recovery-control-config list-routing-controls \
--control-panel-arn arn:aws:route53-recovery-
control::123456789012:controlpanel/PANEL_ID \
--endpoint-url [Link]
# ─────────────────────────────────────────────────────────────────
# ECS OPERATIONS
# ─────────────────────────────────────────────────────────────────
# List all running tasks in a service with their AZ and IP
aws ecs list-tasks \
--cluster finserv-production-primary \
--service-name payment-processor \
--region ap-south-1 \
--query 'taskArns' \
--output text | xargs aws ecs describe-tasks \
--cluster finserv-production-primary \
--region ap-south-1 \
--query
'tasks[*].{TaskID:taskArn,AZ:availabilityZone,Status:lastStatus,IP:attachments
[0].details[?name==`privateIPv4Address`].value|[0]}' \
--output table
# Force a new deployment (rolling update with zero downtime)
aws ecs update-service \
--cluster finserv-production-primary \
--service payment-processor \
--force-new-deployment \
--region ap-south-1
# Watch ECS service events in real time (useful during deployments)
watch -n 5 'aws ecs describe-services \
--cluster finserv-production-primary \
--services payment-processor \
--region ap-south-1 \
--query "services[0].events[:5]" \
--output table'
# Scale ECS service manually (useful during failover to secondary)
aws ecs update-service \
--cluster finserv-production-secondary \
--service payment-processor \
--desired-count 18 \
--region ap-southeast-1
# ─────────────────────────────────────────────────────────────────
# AURORA OPERATIONS
# ─────────────────────────────────────────────────────────────────
# Check Aurora cluster endpoints (writer and reader)
aws rds describe-db-clusters \
--db-cluster-identifier finserv-aurora-primary \
--query
'DBClusters[0].{Writer:Endpoint,Reader:ReaderEndpoint,Status:Status,Members:DB
ClusterMembers[*].{ID:DBInstanceIdentifier,Writer:IsClusterWriter,Status:DBIns
tanceStatus}}' \
--region ap-south-1 \
--output table
# Check RDS Proxy target health
aws rds describe-db-proxy-targets \
--db-proxy-name finserv-aurora-proxy \
--region ap-south-1 \
--query
'Targets[*].{Endpoint:Endpoint,Port:Port,State:[Link],Description:
[Link]}' \
--output table
# Perform a planned Aurora Global Database failover (zero data loss)
aws rds failover-global-cluster \
--global-cluster-identifier finserv-global \
--target-db-cluster-identifier arn:aws:rds:ap-southeast-
1:123456789012:cluster:finserv-aurora-secondary \
--region ap-south-1
# Monitor the failover progress
aws rds describe-global-clusters \
--global-cluster-identifier finserv-global \
--query
'GlobalClusters[0].{Status:Status,Members:GlobalClusterMembers[*].{Cluster:DBC
lusterArn,Writer:IsWriter,Status:Readers}}' \
--region ap-south-1
# ─────────────────────────────────────────────────────────────────
# SECRETS MANAGER OPERATIONS
# ─────────────────────────────────────────────────────────────────
# List all secrets and their replication status
aws secretsmanager list-secrets \
--region ap-south-1 \
--query
'SecretList[*].{Name:Name,ReplicationStatus:ReplicationStatus[*].{Region:Regio
n,Status:Status}}' \
--output table
# Manually rotate a secret immediately (triggers the rotation Lambda)
aws secretsmanager rotate-secret \
--secret-id finserv/aurora/password \
--rotate-immediately \
--region ap-south-1
# Verify a secret value can be retrieved in the secondary region
aws secretsmanager get-secret-value \
--secret-id finserv/aurora/password \
--region ap-southeast-1 \
--query '{Name:Name,VersionId:VersionId,LastRotated:CreatedDate}' \
--output table
# ─────────────────────────────────────────────────────────────────
# COST AND RESOURCE HYGIENE
# ─────────────────────────────────────────────────────────────────
# Find all untagged EC2 instances (missing the mandatory "Owner" tag)
aws ec2 describe-instances \
--filters "Name=instance-state-name,Values=running" \
--query
'Reservations[*].Instances[?!not_null(Tags[?Key==`Owner`].Value|[0])].[Instanc
eId,LaunchTime,InstanceType]' \
--region ap-south-1 \
--output table
# Find EC2 instances not accessed via SSM in the last 30 days
# (potential zombie instances worth investigating)
aws ssm describe-instance-information \
--query
'InstanceInformationList[*].{ID:InstanceId,PingStatus:PingStatus,LastPing:Last
PingDateTime,PlatformType:PlatformType}' \
--region ap-south-1 \
--output table
# List all EBS snapshots older than 90 days (candidates for cleanup)
aws ec2 describe-snapshots \
--owner-ids self \
--query "Snapshots[?StartTime<='$(date -d '90 days ago' --iso-
8601)'][*].{SnapshotId:SnapshotId,Size:VolumeSize,StartTime:StartTime,Descript
ion:Description}" \
--region ap-south-1 \
--output table
# Check Reserved Instance utilization percentage
aws ce get-reservation-utilization \
--time-period Start=$(date -d '30 days ago' +%Y-%m-%d),End=$(date +%Y-%m-
%d) \
--granularity MONTHLY \
--query
'UtilizationsByTime[*].Total.{UtilizationPercentage:UtilizationPercentage,Purc
hasedHours:PurchasedHours,UsedHours:UsedHours}' \
--region us-east-1
# ─────────────────────────────────────────────────────────────────
# SECURITY OPERATIONS
# ─────────────────────────────────────────────────────────────────
# List all IAM roles WITHOUT permission boundaries (security risk)
aws iam list-roles \
--query
'Roles[?!PermissionsBoundary].{RoleName:RoleName,CreatedDate:CreateDate,Path:P
ath}' \
--output table
# Check GuardDuty findings in the last 24 hours
aws guardduty list-findings \
--detector-id $(aws guardduty list-detectors --query 'DetectorIds[0]' --
output text --region ap-south-1) \
--finding-criteria '{"Criterion":{"updatedAt":{"Gte":'$(date -d '24 hours
ago' +%s000)'},"severity":{"Gte":4}}}' \
--region ap-south-1
# Get details of the most recent GuardDuty findings
DETECTOR_ID=$(aws guardduty list-detectors --query 'DetectorIds[0]' --output
text --region ap-south-1)
FINDING_IDS=$(aws guardduty list-findings --detector-id $DETECTOR_ID --region
ap-south-1 --query 'FindingIds[:5]' --output json)
aws guardduty get-findings \
--detector-id $DETECTOR_ID \
--finding-ids $FINDING_IDS \
--query
'Findings[*].{Type:Type,Severity:Severity,Title:Title,Region:Region,UpdatedAt:
UpdatedAt}' \
--region ap-south-1 \
--output table
# ─────────────────────────────────────────────────────────────────
# SESSION MANAGER OPERATIONS
# ─────────────────────────────────────────────────────────────────
# List all managed instances available for Session Manager access
aws ssm describe-instance-information \
--filters "Key=PingStatus,Values=Online" \
--query
'InstanceInformationList[*].{ID:InstanceId,Name:ComputerName,OS:PlatformType,V
ersion:PlatformVersion,AgentVersion:AgentVersion}' \
--region ap-south-1 \
--output table
# Start a Session Manager session (interactive shell)
aws ssm start-session \
--target i-0abc123def456789 \
--region ap-south-1
# Run a non-interactive command on a remote instance via SSM Run Command
aws ssm send-command \
--instance-ids i-0abc123def456789 \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["df -h","free -m","uptime"]' \
--query '[Link]' \
--output text \
--region ap-south-1 \
| xargs -I {} aws ssm get-command-invocation \
--command-id {} \
--instance-id i-0abc123def456789 \
--query 'StandardOutputContent' \
--output text \
--region ap-south-1
Appendix F: Disaster Recovery Testing Checklist

Regular DR testing is what separates theoretical availability from actual availability. Below is the
checklist I use for quarterly DR drills. The goal is to execute this checklist under simulated
pressure conditions — timers running, communication channels active — not as a leisurely
walkthrough.

Pre-Drill Preparation (T-48 hours)

□ Notify all stakeholders (team leads, CTO, enterprise clients as appropriate)


□ Review and update the runbook since last drill — any architecture changes?
□ Verify current Aurora replication lag baseline (should be < 1 second)
□ Confirm secondary region ECS service health (warm standby tasks running)
□ Test ARC endpoint connectivity from operations workstations
□ Confirm all team members have CLI access configured for both regions
□ Set up a war room communication channel (dedicated Slack channel or Teams
call)
□ Confirm monitoring access in the secondary region is operational
□ Brief on-call escalation path — who makes the "go/no-go" decision?

Drill Execution (T-0 to T+15 minutes)

T+0:00 — Announce drill start in war room channel


T+0:01 — Record baseline metrics (ALB request count, error rate, response
time)
T+0:02 — Execute pre-failover validation script
□ Aurora replication lag: ___ ms (should be < 1000ms to proceed)
□ Secondary ECS tasks running: ___ (should be >= 2)
□ All secrets replicated: □ YES □ NO (halt if NO)
□ Secondary ALB health check passing: □ YES □ NO
T+0:05 — Begin failover execution
□ Update Global Accelerator traffic dial (primary: 0%, secondary:
100%)
□ Update ARC routing controls (primary: OFF, secondary: ON)
□ Initiate Aurora Global Database managed failover
□ Scale up secondary ECS service to production task count (18 tasks)
T+0:08 — First validation check
□ Global Accelerator health check showing secondary as healthy
□ Synthetic canary reporting success against [Link]
□ Aurora failover status: PROMOTING or COMPLETE
T+0:10 — Monitor Aurora promotion completion
□ Aurora secondary cluster status: writer=TRUE confirmed
□ RDS Proxy reconnection to new writer confirmed
□ Application error rate returning to baseline
T+0:12 — Full validation
□ P99 response time within SLA (<500ms)
□ ECS desired=18, running=18, pending=0
□ Zero 5XX errors for 60-second window
□ Payment validation synthetic transaction: SUCCESS
T+0:15 — Record RTO: ___ minutes ___ seconds
□ Target RTO: 5 minutes □ PASSED □ FAILED
TOTAL OUTAGE WINDOW: From T+0:05 (traffic removed from primary)
to T+0:12 (full validation): ___ minutes

Post-Drill Activities

□ Execute failback to original primary region (same procedure in reverse)


□ Confirm Aurora failback promotion completes successfully
□ Verify all monitoring is operational back in primary region
□ Document any deviations from expected behavior
□ Calculate actual RTO and RPO achieved vs. targets
□ Hold 30-minute retrospective (What went well? What needs improvement?)
□ Update runbook with any lessons learned
□ File a record of the drill result in the compliance documentation
□ Schedule next drill (no more than 90 days from today)

Common Drill Failure Modes and How to Handle Them

Scenario: Aurora replication lag exceeds 30 seconds at drill start Action: Postpone drill.

Investigate replication health. Common causes are high write volume on primary, network
congestion on the TGW inter-region peering link, or an Aurora maintenance event in progress.
Do NOT proceed with failover when lag is high unless it's a genuine emergency — you'll accept
data loss that you don't need to.

Scenario: Secondary ECS tasks fail to scale up within 3 minutes Action: Check ECR image pull
success (VPC Endpoints healthy?), check Fargate task launch failures in ECS Events, check Secrets
Manager access from secondary region. Common cause: Fargate capacity constraints in the
secondary region during a real regional event when everyone is failing over simultaneously.
Consider pre-subscribing to additional Fargate capacity via Fargate Capacity Providers.

Scenario: ARC routing control change returns an error Action: Verify you're using the correct
ARC cluster endpoint. ARC endpoints are regional — if your primary region is unavailable, switch
to the secondary region's ARC endpoint. The endpoints are distributed specifically so that at least
one remains available even during a regional event.

Scenario: Synthetic canary reports success but real users report errors Action: Check that the
synthetic canary is running from outside the AWS network (CloudWatch Synthetics canaries run
from within AWS infrastructure). If the issue affects specific geographic regions (e.g., India-based
users only), it may be a Global Accelerator routing issue rather than an application issue. Check
Global Accelerator flow logs for client-specific routing decisions.
Appendix G: Glossary of Key Terms

This glossary is intentionally written for readers who may encounter these terms for the first time.
Skip to what you need.

Active-Active Architecture: A design where all regions simultaneously handle live production

traffic. Writes and reads can happen in any region. Requires careful data consistency
management but delivers the lowest possible RTO.

Active-Passive Architecture: A design where one region handles all production traffic (active)
while another region maintains a standby environment (passive). Traffic only shifts to the passive

region during a failover event.

Anycast IP Address: A single IP address that routes to the "nearest" server based on network
topology. AWS Global Accelerator provides anycast IP addresses, meaning users in Mumbai and
Singapore both resolve the same IP address but reach the nearest AWS edge location.

Aurora Global Database: An Amazon Aurora feature that replicates data from a primary region
cluster to one or more secondary region clusters with sub-second replication lag. The secondary
clusters are read-only until promoted during a failover.

Canary Deployment: A deployment strategy where a small percentage of traffic is routed to the
new version before rolling it out to all users. Named after the "canary in a coal mine" concept —
if the canary (new version) fails, you know before the full deployment.

Circuit Breaker Pattern: A software design pattern that detects repeated failures and "opens" a
circuit to stop additional calls to the failing component, allowing it time to recover. Prevents

cascading failures from spreading through a distributed system.

CIDR Block: Classless Inter-Domain Routing — a method of representing IP address


ranges. [Link]/16 means "all IP addresses from [Link] to [Link]" — the /16 tells you
16 bits are fixed, leaving 16 bits (65,536 addresses) variable.

Conformance Pack: A collection of AWS Config rules deployed together as a package. AWS
provides pre-built conformance packs for standards like CIS Benchmarks, PCI-DSS, and HIPAA.
Customer Managed Key (CMK): An AWS KMS key that you create and control, as opposed to
AWS-managed keys or AWS-owned keys. CMKs give you control over key policies, rotation
schedules, and the ability to audit key usage in CloudTrail.

Elastic Network Interface (ENI): A virtual network interface that can be attached to an EC2
instance or ECS Fargate task. In awsvpc networking mode, each ECS task gets its own ENI and its
own IP address.

Envelope Encryption: A technique where data is encrypted with a data key (DEK), and the data
key itself is encrypted with a master key (CMK). AWS services use envelope encryption — you
never directly encrypt data with the CMK itself.

Failover: The process of switching from a failed or degraded component to a backup


component. In multi-region context, this means redirecting traffic from a failing region to a

healthy standby region.

Fargate: An AWS compute engine for containers that eliminates the need to manage EC2
instances (nodes). You specify CPU and memory requirements for your container tasks, and AWS
handles the underlying compute.

Global Accelerator: An AWS networking service that routes user traffic through AWS's private
global network backbone rather than the public internet, reducing latency and providing sub-30-
second automatic failover between endpoints.

GuardDuty: AWS's managed threat detection service. It continuously analyzes CloudTrail logs,
VPC Flow Logs, and DNS logs using machine learning to identify suspicious activity.

IAM Permission Boundary: A maximum permissions policy attached to an IAM role or user that
limits what other policies can grant. Even if an identity-based policy grants an action, the

permission boundary can block it.

Multi-AZ: A configuration where a resource (like an RDS database or ALB) is deployed across
multiple Availability Zones within a single AWS Region. Protects against AZ-level failures but not
regional failures.
Multi-Region: A configuration where infrastructure is deployed across multiple AWS Regions.
Protects against regional failures. More complex and expensive than Multi-AZ.

NAT Gateway: A managed AWS service that allows resources in private subnets to initiate
outbound connections to the internet while blocking unsolicited inbound connections. Data
processing charges ($0.045/GB in ap-south-1) make it important to minimize unnecessary traffic
through NAT.

Parameter Group: A collection of database engine parameters (configuration settings) that


control the behavior of an Aurora or RDS instance. Parameter groups allow you to tune
performance, logging, and replication behavior.

Permission Boundary: See "IAM Permission Boundary" above.

RDS Proxy: A fully managed database proxy for RDS and Aurora that pools and shares database
connections, improving application scalability and resilience during failovers.

RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in

time. An RPO of 1 second means you can tolerate losing at most 1 second of data. Lower RPO =
more expensive replication.

RTO (Recovery Time Objective): The maximum acceptable time to restore service after a failure.
An RTO of 5 minutes means you must restore service within 5 minutes of a failure being
detected. Lower RTO = more expensive standby infrastructure.

Route 53 ARC (Application Recovery Controller): An AWS service that provides centralized
control and safety checks for multi-region failover operations, including routing controls that act
as on/off switches for regional traffic.

Savings Plan: An AWS pricing model where you commit to a consistent amount of compute
usage (in $/hour) for 1 or 3 years in exchange for discounts of up to 66% vs. On-Demand pricing.

Service Control Policy (SCP): An AWS Organizations policy that sets permission guardrails for
accounts within an organizational unit or the entire organization. SCPs act as a veto — even an

account's Administrator cannot do something an SCP prohibits.


Session Manager: The component of AWS Systems Manager that provides browser-based and
CLI-based shell access to EC2 instances and ECS containers using IAM authentication, without
requiring SSH keys or open inbound ports.

Transit Gateway (TGW): An AWS networking service that acts as a central hub for connecting
multiple VPCs and on-premises networks. Supports inter-region peering for connecting VPCs
across AWS regions.

VPC Endpoint: A networking component that allows traffic from within your VPC to reach AWS
services (like S3, SSM, or ECR) over AWS's private network rather than through the public internet
or NAT Gateway.

VPC Flow Logs: A feature that captures metadata about IP traffic flowing through network
interfaces in your VPC. Captured data includes source/destination IPs, ports, protocols, and

whether traffic was accepted or rejected.

WAF (Web Application Firewall): AWS WAF examines HTTP/HTTPS requests to your ALB or
CloudFront distribution and can block requests matching patterns associated with common
attacks (SQL injection, XSS, bad bots, etc.).

Warm Standby: A DR strategy where a scaled-down but functional version of the production
environment runs in the standby region. Upon failover, the standby scales up to handle full
production load. Balances cost against recovery time.
Appendix H: Architecture Decision Records (ADRs)

Architecture Decision Records document significant technical decisions in a lightweight,


structured format. I maintain these for every major architectural choice so that future team
members understand not just what was decided but why.

ADR-001: Aurora MySQL Global Database over PostgreSQL

Date: Project week 2 Status: Accepted

Context: The existing application used MySQL with stored procedures and MySQL-specific
functions. Migrating to Aurora PostgreSQL would require significant application changes.

Decision: Use Aurora MySQL 8.0 (Aurora 3.x) with Global Database.

Consequences:

• Positive: Zero application changes required for database engine compatibility


• Positive: Aurora MySQL 3.x (MySQL 8.0 compatible) provides significant performance
improvements over the client's existing MySQL 5.7
• Negative: PostgreSQL generally has stronger logical replication features that could be

useful in future active-active scenarios


• Neutral: Both MySQL and PostgreSQL variants of Aurora support Global Database with
equivalent replication lag characteristics

ADR-002: Warm Standby over Active-Active

Date: Project week 2 Status: Accepted

Context: Active-Active was considered for its zero-RTO benefits.

Decision: Implement Warm Standby Active-Passive architecture.

Consequences:

• Positive: Meets 99.99% SLA with sub-5-minute RTO (validated in drilling)


• Positive: Infrastructure cost 35-40% lower than full Active-Active
• Positive: No distributed transaction complexity or conflict resolution requirements
• Negative: 2-5 minute failover window vs. near-zero for Active-Active
• Negative: Architecture has a clear primary/secondary hierarchy that may need rethinking

if Southeast Asian traffic volumes grow to parity with Indian traffic

Review Trigger: Reconsider Active-Active if ap-southeast-1 traffic exceeds 40% of total traffic
volume.

ADR-003: ECS Fargate over EKS

Date: Project week 3 Status: Accepted

Context: Kubernetes (EKS) was evaluated as the container orchestration platform.

Decision: Use ECS Fargate.

Consequences:

• Positive: No Kubernetes control plane management overhead (~$0.10/hour per cluster ×


2 regions = ~$150/month saved)
• Positive: Simpler operational model for a team without prior Kubernetes experience
• Positive: Native integration with ALB, Service Discovery, and IAM without additional
tooling

• Negative: Less flexible scheduling and pod-level control than Kubernetes


• Negative: Migration to EKS in the future (if needed) will require significant work

Review Trigger: Reconsider if the engineering team size grows beyond 20 and multiple distinct
services require independent deployment pipelines.

ADR-004: Session Manager over Bastion Hosts

Date: Project week 1 Status: Accepted

Context: The existing setup used bastion hosts with SSH key-based access.

Decision: Replace all SSH and bastion host access with AWS Systems Manager Session Manager.
Consequences:

• Positive: Eliminates port 22 from all security groups (reduced attack surface)

• Positive: No SSH key management (no keys to lose, rotate, or revoke)


• Positive: Complete audit trail of all session activity in CloudTrail and CloudWatch Logs
• Positive: Access controlled by IAM policies — same MFA requirements as console access
• Negative: Requires SSM Agent on all instances (included by default on Amazon Linux)
• Negative: Requires VPC Endpoints in private subnets (additional cost: ~$50/month per
region)
• Neutral: Session Manager supports port forwarding, allowing local development tools to

connect to private resources (e.g., database GUIs connecting to Aurora through an SSM
tunnel)
Appendix I: Recommended AWS Documentation and
Further Reading

The following official AWS resources are the most valuable references for the topics covered in

this book. I refer to these regularly and recommend bookmarking all of them.

Architecture and Reliability:

• AWS Well-Architected Framework — Reliability


Pillar: [Link]/wellarchitected/latest/reliability-pillar/
• AWS Multi-Region Fundamentals (Prescriptive
Guidance): [Link]/prescriptive-guidance/latest/aws-multi-region-
fundamentals/
• Disaster Recovery of Workloads on AWS
(Whitepaper): [Link]/whitepapers/latest/disaster-recovery-
workloads-on-aws/
• Aurora Global Database DR and
Failover: [Link]/AmazonRDS/latest/AuroraUserGuide/aurora-global-
[Link]

Security:

• AWS Security Reference Architecture: [Link]/prescriptive-


guidance/latest/security-reference-architecture/
• IAM Best Practices: [Link]/IAM/latest/UserGuide/best-
[Link]
• KMS Multi-Region Keys: [Link]/kms/latest/developerguide/multi-
[Link]

Networking:

• VPC Endpoints documentation: [Link]/vpc/latest/privatelink/


• Transit Gateway inter-region peering: [Link]/vpc/latest/tgw/tgw-
[Link]
• VPC Flow Logs: [Link]/vpc/latest/userguide/[Link]
Cost Optimization:

• AWS FinOps Framework: [Link]/aws-cost-management/

• Savings Plans Overview: [Link]/savingsplans/latest/userguide/


• Cost Anomaly Detection: [Link]/cost-
management/latest/userguide/[Link]

Operations:

• Route 53 ARC documentation: [Link]/r53recovery/latest/dg/

• Systems Manager Session Manager: [Link]/systems-


manager/latest/userguide/[Link]
• AWS Resilience Hub: [Link]/resilience-hub/latest/userguide/

These appendices serve as a living reference companion to the main ebook. As AWS releases new
services and features, specific CLI parameters and service configurations will evolve — always
validate against the current AWS documentation before implementation. The architectural patterns
and design principles, however, tend to remain stable even as the tooling changes.
Appendix J: SLA Validation Summary

99.99% SLA Readiness Scorecard

Platform: FinServ Co. Multi-Region Payment Processing Platform Primary Region: ap-south-1

(Mumbai) | Secondary Region: ap-southeast-1 (Singapore) Assessment Date: Q1 2026


| Prepared by: Manish Kumar, AWS Solutions Architect

SLA Mathematics — What 99.99% Requires

Metric Allowable Budget Status

Annual downtime budget 52 minutes 36 ✅ Within budget


seconds

Maximum single-incident duration 5 minutes (RTO ✅ 3m 47s achieved in last


target) drill

Maximum data loss per incident 1 second (RPO ✅ 0.8s average replication
target) lag

Incidents consuming >10% annual 0 ✅ Zero to date


budget

Availability Pillar Scores

Pillar Control Target Measured Status

Compute ECS Fargate multi-AZ 3 AZs active 3 AZs ✅


spread confirmed

Database Aurora Global DB < 1,000 ms 820 ms ✅

replication lag average

Database Aurora Multi-AZ within 3 instances 3 healthy ✅


cluster
DNS/Traffic Route 53 ARC routing Both regions Verified ✅
controls configured

Edge Global Accelerator failover < 30 seconds 18 seconds ✅


time

Secrets Cross-region secret 100% replicated 100% ✅


replication

DR Drill RTO Regional failover < 5 minutes 3m 47s ✅


completion

DR Drill RPO Data loss during failover < 1 second 0.8 seconds ✅

Synthetic Availability from external 99.99% 99.993% ✅


Canary probe

60-Day Production Availability Record

Month Total Minutes Downtime Minutes Availability Budget Used

Month 1 43,200 1.6 min 99.996% 12.2%

Month 2 43,200 1.0 min 99.998% 7.6%

Combined 86,400 2.6 min 99.997% 9.9%

Annualised budget consumption rate: ~9.9% per 60 days → projected 29.7% annual consumption.
Well within budget.

Failover Drill History

Drill RTO RPO


Type Result Notes
Date Achieved Achieved

Week Planned regional ✅ First drill — minor


4m 12s 0.6s
6 failover PASS ECS scaling delay
Drill RTO RPO
Type Result Notes
Date Achieved Achieved

Unplanned
Week 0s (Multi- ✅ Sub-minute AZ
simulation (AZ 0m 41s
10 AZ) PASS recovery confirmed
kill)

Week Planned regional ✅ Fastest drill to date


3m 47s 0.8s
14 failover PASS — team improving

Next scheduled drill: Week 24 (simulated unplanned regional failure with database write load
active)

Risk Register — Open Items

Risk Severity Mitigation Owner Due

Fargate capacity constraints Medium Subscribe to Fargate Infra 30 days


during real regional event Capacity Reservation in Team
secondary region

Aurora instance class Low Scale secondary Infra Automated


mismatch (secondary uses instances during failover Team

[Link] vs primary via automation


r6g.2xlarge)

Quarterly drill cadence Low Schedule next 3 drills in Ops 7 days


requires calendar commitment team calendar now Lead

SLA Certification Statement

Based on measured production metrics, validated DR drill results, and continuous monitoring data,
the FinServ Co. multi-region platform demonstrates the architectural controls and operational
readiness required to sustain a 99.99% SLA. The platform has maintained 99.997% availability
over the first 60 days of production operation, consuming less than 10% of its annual downtime
budget. All three disaster recovery drills have met or exceeded the defined RTO (< 5 minutes) and
RPO (< 1 second) targets.

Signed off: Manish Kumar, AWS Solutions Architect Review cadence: Monthly scorecard update
| Quarterly full re-assessment

This summary should be reviewed at each monthly operations meeting and updated after every DR

drill. Any single incident consuming more than 20% of the annual downtime budget (10 minutes 31
seconds) should trigger an immediate architecture review.

You might also like