Aws Infrastructure Event Readiness
Aws Infrastructure Event Readiness
Readiness
AWS Guidelines and Best Practices
December 2018
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Notices
This document is provided for informational purposes only. It represents
AWS’s current product offerings and practices as of the date of issue of this
document, which are subject to change without notice. Customers are
responsible for making their own independent assessment of the information
in this document and any use of AWS’s products or services, each of which is
provided “as is” without warranty of any kind, whether express or implied.
This document does not create any warranties, representations, contractual
commitments, conditions or assurances from AWS, its affiliates, suppliers or
licensors. The responsibilities and liabilities of AWS to its customers are
controlled by AWS agreements, and this document is not part of, nor does it
modify, any agreement between AWS and its customers.
Contents
Introduction 1
Infrastructure Event Readiness Planning 2
What is a Planned Infrastructure Event? 2
What Happens During a Planned Infrastructure Event? 2
Design Principles 4
Discrete Workloads 4
Automation 8
Diversity/Resiliency 10
Mitigating Against External Attacks 13
Cost Optimization 16
Event Management Process 17
Infrastructure Event Schedule 17
Planning and Preparation 18
Operational Readiness (Day of Event) 27
Post-Event Activities 29
Conclusion 31
Contributors 32
Further Reading 32
Appendix 33
Detailed Architecture Review Checklist 33
Abstract
This whitepaper describes guidelines and best practices for customers with
production workloads deployed on Amazon Web Services (AWS) who want to
design and provision their cloud-based applications to handle planned scaling
events, such as product launches or seasonal traffic spikes, gracefully and
dynamically. We address general design principles as well as provide specific
best practices and guidance across multiple conceptual areas of infrastructure
event planning. We then describe operational readiness considerations and
practices, and post-event activities.
Amazon Web Services – Infrastructure Event Readiness
Introduction
Infrastructure event readiness is about designing and preparing for
anticipated and significant events that have an impact on your business.
During such events, it is critical that the company web service is reliable,
responsive, and highly fault tolerant; under all conditions and changes in
traffic patterns. Examples of such events are expansion into new territories,
new product or feature launches, seasonal events, or significant business
announcements or marketing events.
With Amazon Web Services (AWS), your company can scale up its
infrastructure in preparation for a planned scaling event in a dynamic,
adaptable, pay-as-you-go basis. Amazon’s rich array of elastic and
programmable products and services gives your company access to the same
highly secure, reliable, and fast infrastructure that Amazon uses to run its own
global network and enables your company to nimbly adapt in response to its
own rapidly changing business requirements.
This whitepaper outlines best practices and design principles to guide your
infrastructure event planning and execution and how you can use AWS
Page 1
Amazon Web Services – Infrastructure Event Readiness
services to ensure that your applications are ready to scale up and scale out as
your business needs dictate.
Page 2
Amazon Web Services – Infrastructure Event Readiness
During the planned event, AWS customers can open support cases with AWS
for troubleshooting or real-time support (such as a server going down).
Customers who subscribe to the AWS Enterprise Support plan have the
additional flexibility to talk with support engineers immediately and to raise
critical severity cases if rapid response is required.
After the event, AWS resources are designed to automatically scale down to
appropriate levels to match traffic levels, or continue to scale up, as events
dictate.
Page 3
Amazon Web Services – Infrastructure Event Readiness
Design Principles
Preparation for planned events starts with a design at the beginning of any
implementation of a cloud-based application stack or workload that follows
best practices.
Discrete Workloads
A design based on best practices is essential to the effective management of
planned-event workloads at both normal and elevated traffic levels. From the
start, design discrete and independent functional groupings of resources
centered on a specific business application or product. This section describes
the multiple dimensions to this design goal.
Tagging
Tags are used to label and organize resources. They are an essential
component of managing infrastructure resources during a planned
infrastructure event. On AWS, tags are customer-managed, key-value labels
applied to an individual managed resource, such as a load balancer or an
Amazon Elastic Compute Cloud (Amazon EC2) instance. By referencing well-
defined tags that have been attached to AWS resources, you can easily
identify which resources within your overall infrastructure comprise your
planned event workload. Then, using this information, you can analyze it for
preparedness. Tags can also be used for cost-allocation purposes.
Tags can be used to organize, for example, Amazon EC2 instances, Amazon
Machine Image (AMI) images, load balancers, security groups, Amazon
Relational Database Service (Amazon RDS) resources, Amazon Virtual Private
Cloud (Amazon VPC) resources, Amazon Route 53 health checks, and Amazon
Simple Storage Service (Amazon S3) buckets.
For examples of how to create and manage tags, and put them in Resource
Groups, see Resource Groups and Tagging for AWS.2
Page 4
Amazon Web Services – Infrastructure Event Readiness
Loose Coupling
When architecting for the cloud, design every component of your application
stack to operate as independently as possible from each other. This gives
cloud-based workloads the advantage of resiliency and scalability.
Page 5
Amazon Web Services – Infrastructure Event Readiness
Page 6
Amazon Web Services – Infrastructure Event Readiness
By using managed services and their service endpoints, you can leverage the
power of production-ready resources as part of your design solution for
handling increased volume, reach, and transaction rates during a planned
infrastructure event. You don’t need to provision and administer your own
servers that perform the same functions as managed services.
For more information on AWS service endpoints, see AWS Regions and
Endpoints.4 See also Amazon EMR,5 Amazon RDS,6 and Amazon ECS7 for
examples of managed services that have endpoints.
Serverless Architectures
Leverage AWS Lambda as a strategy to effectively respond to dynamically
changing processing loads during a planned infrastructure event. Lambda is
an event-driven, serverless computing platform. It’s a dynamically invoked
Page 7
Amazon Web Services – Infrastructure Event Readiness
service that runs Python, Node.js, or Java code in response to events (via
notifications) and automatically manages the compute resources specified by
that code. Lambda doesn’t require provisioning, prior to the event, of Amazon
Elastic Compute Cloud (EC2) resources. The Amazon Simple Notification
Service (Amazon SNS) can be configured to trigger Lambda functions. See
Amazon Simple Notification Service8 for details.
Lambda serverless functions can execute code that access or invoke other
AWS services such as database operations, data transformations, object or file
retrieval, or even scaling operations in response to external events or internal
system load metrics. AWS Lambda can also generate new notifications or
events of its own, and even launch other Lambda functions.
AWS Lambda provides the ability to exercise fine control over scaling
operations during a planned infrastructure event. For example, Lambda can
be used to extend the functionality of Auto Scaling operations to perform
actions such as notifying third-party systems that they also need to scale, or
for adding additional network interfaces to new instances as they are
provisioned. See Using AWS Lambda with Auto Scaling Lifecycle Hooks9 for
examples of how to use Lambda to customize scaling operations.
Automation
Auto Scaling
A critical component of infrastructure event planning is Auto Scaling. Being
able to automatically scale an application’s capacity up or down according to
pre-defined conditions helps to maintain application availability during
fluctuations in traffic patterns and volume that occur in a planned
infrastructure event.
AWS provides Auto Scaling capability across many of its resources, including
EC2 instances, database capacity, containers, etc.
Page 8
Amazon Web Services – Infrastructure Event Readiness
Configuration Management/Orchestration
Integral to a robust, reliable, and responsive planned infrastructure event
strategy is the incorporation of configuration management and orchestration
tools for individual resource state management and application stack
deployment.
Page 9
Amazon Web Services – Infrastructure Event Readiness
Diversity/Resiliency
Remove Single Points of Failure and Bottlenecks
When planning for an infrastructure event, analyze your application stacks for
any single points of failure (SPOF) or performance bottlenecks. For example,
is there any single instance of a server, data volume, database, NAT gateway,
or load balancer that would cause the entire application, or significant
portions of it, to stop working if it were to fail?
Page 10
Amazon Web Services – Infrastructure Event Readiness
If you have a complex multistep workflow where there is a need to track the
current state of each step in the workflow, Amazon Simple Workflow Service
(SWF) can be used to centrally store execution history and make these
workloads stateless.
Page 11
Amazon Web Services – Infrastructure Event Readiness
single compute resource can’t meet the need, you can design your workloads
so that tasks and data are partitioned into smaller fragments and executed in
parallel across a cluster of compute resources. Distributed processing is
stateless, since the independent nodes on which the partitioned data and
tasks are being processed may fail. In this case, auto-restart of failed tasks on
another node of the distributed processing cluster is automatically handled
by the distributed processing scheduling engine.
For more information on these types of workloads, see Big Data Analytics
Options on AWS.13
Load Balancing
With the Elastic Load Balancing service (ELB), a fleet of application servers can
be attached to a load balancer and yet be distributed across multiple
Availability Zones. When the EC2 instances in a particular Availability Zone
Page 12
Amazon Web Services – Infrastructure Event Readiness
sitting behind a load balancer fail their health checks, the load balancer stops
sending traffic to those nodes. When combined with Auto Scaling, the
number of healthy nodes is automatically rebalanced with the other
Availability Zones and no manual intervention is required.
It’s also possible to have load balancing across Regions by using Amazon
Route 53 and latency-based DNS routing algorithms. See Latency Based
Routing for more information.14
There are numerous techniques that can be used for load shedding such as
caching or latency-based DNS routing. With latency-based DNS routing, the
IP addresses of those application servers that are responding with the least
latency are returned by the DNS servers in response to name resolution
requests. Caching can take place close to the application, using an in-memory
caching layer such as Amazon ElastiCache. You can also deploy a caching
layer that is closer to the user’s edge location, using a global content
distribution network such as Amazon CloudFront.
For more information about ElastiCache and CloudFront, see Getting Started
with ElastiCache15 and Amazon CloudFront CDN.16
Page 13
Amazon Web Services – Infrastructure Event Readiness
There are numerous actions you can take at each of these layers to mitigate
against such an attack. For example, you can protect against saturation
events by overprovisioning network and server capacity or implementing
auto-scaling technologies that are configured to react to attack patterns. You
can also make use of purpose-built DDoS mitigation systems such as
application firewalls, dynamic load shedding at the edge using Content
Distribution Networks (CDNs), network layer threat pattern recognition and
filtering, or routing your traffic or requests through a DDoS mitigation
provider.
AWS provides automatic DDoS protection as part of the AWS Shield Standard
which is included in all AWS services in every AWS Region, at no additional
cost. When a network or transport-layer attack is detected it is automatically
mitigated at the AWS border, before the traffic is routed to an AWS Region.
To make use of this capability it is important to architect your application for
DDoS-resiliency.
Page 14
Amazon Web Services – Infrastructure Event Readiness
This reference architecture includes several AWS services that can help you
improve your web application’s resiliency against DDoS attacks.
Page 15
Amazon Web Services – Infrastructure Event Readiness
On AWS, you can implement a WAF from the AWS Marketplace or use AWS
WAF which allows you to build your own rules or subscribe to rules managed
by Marketplace vendors. With AWS WAF you can use regular rules to block
known bad patterns or rate-based rules to temporarily block requests from
sources that match conditions you define and exceed a given rate. Deploy
these rules using an AWS CloudFormation template. If you have applications
distributed across many AWS accounts, deploy and manage AWS WAF rules
for your entire organization by using AWS Firewall Manager.
To learn more about deploying preconfigured protections with AWS WAF, see
AWS WAF Security Automations18. To learn more about rules available from
Marketplace vendors, see Managed Rules for AWS WAF19. To learn more
about managing rules with AWS Firewall Manager, see Getting Started with
AWS Firewall Manager20.
Cost Optimization
Reserved vs Spot vs On-Demand
Controlling the costs of provisioned resources in the cloud is closely tied to
the ability to dynamically provision these resources based on systems metrics
and other performance and health check criteria. With Auto Scaling, resource
utilization can be closely matched to actual processing and storage needs,
minimizing wasteful expense and underutilized resources.
Another dimension of cost control in the cloud is being able to choose from
the following: On-Demand instances, Reserved Instances (RIs), or Spot
Instances. In addition, DynamoDB offers a reservation capacity capability.
With On-Demand instances you pay for only the Amazon EC2 instances you
use. On-Demand instances let you pay for compute capacity by the hour with
no long-term commitments.
Page 16
Amazon Web Services – Infrastructure Event Readiness
Spot Instances allow you to bid on spare Amazon EC2 computing capacity.
Spot Instances are often available at a discount compared to On-Demand
pricing, which significantly reduces the cost of running your cloud-based
applications.
When designing for the cloud, some use cases are better suited for the use of
Spot Instances than others. For example, since Spot Instances can be retired
at any time once the bid price goes above your bid, you should consider
running Spot Instances only for relatively stateless and horizontally scaled
application stacks. For stateful applications or expensive processing loads,
Reserved Instances or On-Demand instances may be more appropriate. For
mission-critical applications where capacity limitations are out of the
question, Reserved Instances are the optimal choice.
Page 17
Amazon Web Services – Infrastructure Event Readiness
Week 1:
• Nominate a team to drive planning and engineering for the
infrastructure event.
• Conduct meetings between stakeholders to understand the
parameters of the event (scale, duration, time, geographic reach,
affected workloads) and the success criteria.
• Engage any downstream or upstream partners and vendors.
Weeks 2-3:
• Review architecture and adjust as needed.
• Conduct operational review; adjust as needed.
• Follow best practices described in this paper and in footnoted
references.
• Identify risks and develop mitigation plans.
Page 18
Amazon Web Services – Infrastructure Event Readiness
Architecture Review
An essential part of your preparation for an infrastructure event is an
architectural review of the application stack that will experience the upsurge
in traffic. The purpose of the review is to verify and identify potential areas of
risk to either the scalability or reliability of the application and to identify
opportunities for optimization in advance of the event.
Reliability The ability of a system to recover from Service Limits, Multiple Availability Zones
infrastructure or service failures, dynamically and Regions, Scalability, Health
acquire computing resources to meet Check/Monitoring, Backup/Disaster
demand, and mitigate disruptions such as Recovery (DR), Networking, Self-Healing
misconfigurations or transient network Automation
issues.
Performance The ability to use computing resources Right AWS Services, Resource Utilization,
Efficiency efficiently to meet system requirements, and Storage Architecture, Caching, Latency
to maintain that efficiency as demand Requirements
changes and technologies evolve.
Page 19
Amazon Web Services – Infrastructure Event Readiness
Operational The ability to run and monitor systems to Runbooks, Playbooks, Continuous
Excellence deliver business value and to continually Integration/Continuous Deployment
improve supporting processes and (CI/CD), Game Days, Infrastructure as
procedures. Code, Root Cause Analysis (RCA)s
Operational Review
In addition to an architectural review, which is more focused on the design
components of an application, review your cloud operations and
management practices to evaluate how well you are addressing the
management of your cloud workloads. The goal of the review is to identify
operational gaps and issues and take actions in advance of the event to
minimize them.
Page 20
Amazon Web Services – Infrastructure Event Readiness
Cloud services providers typically have limits on the different resources that
you can use. Limits are usually imposed on a per-account and per-region
basis. The resources affected include instances, volumes, streams, serverless
invocations, snapshots, number of VPCs, security rules, and so on. Limits are a
safety measure against runaway code or rogue actors attempting to abuse
resources and as a control to help minimize billing risk.
Some service limits are raised automatically over time as you expand your
footprint in the cloud, though most of these services require that you request
limit increases by opening a support case. While some service limits can be
increased via support cases, other services have limits that can’t be changed.
For more information on limits for various AWS services and how to check
them, see AWS Service Limits23 and Trusted Advisor.24
Pattern Recognition
Baselines
You should document “back to healthy” values for key metrics prior to the
commencement of an infrastructure event. This helps you to determine when
an application/service is safely returned to normal levels following the
completion/end of the event. For example, identifying that the normal
transaction rate through a load balancer is 2,500 requests per second will
help determine when it is safe to begin wind down procedures after the
event.
Page 21
Amazon Web Services – Infrastructure Event Readiness
Proportionality
Review the proportionality of scaling required by the various components of
an application stack when preparing for an infrastructure event. This
proportionality is not always one-to-one. For example, a ten-fold increase in
transactions per second across a load balancer might require a twenty-fold
increase in storage capacity, number of streaming shards, or number of
database read and write operations; due to processing that might be taking
place in the front-facing application.
Communications Plan
Prior to the event, develop a communications plan. Gather a list of internal
stakeholders and support groups and identify who should be contacted at
various stages of the event in various scenarios, such as beginning of the
event, during the event, end of the event, post-event analysis, emergency
contacts, contacts during troubleshooting situations, etc.
• Stakeholders
• Operations managers
• Developers
• Support teams
• Cloud service provider teams
• Network operations center (NOC) team
As you gather a list of internal contacts you should also develop a contact list
of external stakeholders involved with the continuous live delivery of the
application. These stakeholders include partners and vendors supporting key
components of the stack, downstream and upstream vendors providing
external services, data feeds, authentication services, and so on.
Page 22
Amazon Web Services – Infrastructure Event Readiness
• Telecommunications vendors
• Live data streaming partners
• PR marketing contacts
• Advertising partners
• Technical consultants involved with service engineering
Ask for the following information from each provider:
NOC Preparation
Prior to the event, instruct your operations and/or developer team to create a
live metrics dashboard that monitors each critical component of the web
service in production as the event occurs. Ideally, the dashboard should
automatically present updated metrics every minute or at an interval that is
suitable and effective during the event.
Page 23
Amazon Web Services – Infrastructure Event Readiness
Runbook Preparation
You should develop a runbook in preparation for the infrastructure event. A
runbook is an operational manual containing a compilation of procedures and
operations that your operators will carry out during the event. Event
runbooks can be outgrowths of existing runbooks used for routine operations
and exception handling. Typically, a runbook contains procedures to begin,
stop, supervise, and debug a system. It should also describe procedures for
handling unexpected events and contingencies.
Page 24
Amazon Web Services – Infrastructure Event Readiness
Monitor
Monitoring Plan
Database, application, and operating system monitoring is crucial to ensure a
successful event. Set up comprehensive monitoring systems to effectively
detect and respond immediately to serious incidents during the infrastructure
event. Incorporate both AWS and customer monitoring data. Ensure that
monitoring tools are instrumented at the appropriate level for an application
based on its business criticality. Implementing a monitoring plan that
collectively gathers monitoring data from all of your AWS solution segments
will help in debugging a complex failure if it occurs.
• What monitoring tools and dashboards must be set up for the event?
• What are the monitoring objectives and the allowed thresholds? What
events will trigger actions?
• What resources and metrics from these resources will be monitored
and how often must they be polled?
• Who will perform the monitoring tasks? What monitoring alerts are in
place? Who will be alerted?
• What remediation plans have been set up for common and expected
failures? What about unexpected events?
• What is the escalation process in the case of operational failure of any
critical systems components?
The following AWS monitoring tools can be used as part of your plan:
Page 25
Amazon Web Services – Infrastructure Event Readiness
• Amazon EC2 instance health: Used for viewing status checks and for
scheduling events for your instances based on their status, such as
auto-rebooting or restarting an instance.
• Amazon SNS: Used for setting up, operating, and sending event-
driven notifications.
• AWS X-Ray: Used to debug and analyze distributed applications and
microservices architecture by analyzing data flows across system
components.
• Amazon Elasticsearch Service: Used for centralized log collection and
real-time log analysis. For rapid, heuristic detection of problems.
• Third-party tools: Used for a real-time analytics and full stack
monitoring and visibility.
• Standard operating system monitoring tools: Used for OS-level
monitoring.
For more details about AWS monitoring tools, see Automated and Manual
Monitoring.25 See also Using Amazon CloudWatch Dashboards26 and Publish
Custom Metrics.27
Notifications
A crucial operational element in your design for infrastructure events is the
configuration of alarms and notifications to integrate with your monitoring
solutions. These alarms and notifications can be used with services such as
AWS Lambda to trigger actions based on the alert. Automating responses to
operational events is a key element to enabling mitigation, rollback, and
recovery with maximum responsiveness.
Page 26
Amazon Web Services – Infrastructure Event Readiness
War Room
During the event, have an open conference bridge with the following
participants:
• Business stakeholders
Throughout most of the event the conversation of this conference bridge
should be minimal. If an adverse operational event arises, the key people who
can respond to the event will already be on this bridge ready to act and
consult.
Leadership Reporting
During the event, send an email hourly to key leadership stakeholders. This
update should include the following:
Page 27
Amazon Web Services – Infrastructure Event Readiness
At the conclusion of the event, a summary email should be sent that follows
the following format:
• Overall event summary with synopsis of issues encountered
• Final metrics
• Updated remedy plan that details the issues and resolutions
• Key points-of-contact for any follow-ups that stakeholders may have.
Contingency Plan
Each step in the event’s preparation process should have a corresponding
contingency action that has been verified in a test environment.
• What are the worst-case scenarios that can occur during the event?
• What types of events would cause a negative public relations impact?
• Which third-party components and services might fail during the
event?
• Which metrics should be monitored that would indicate that a worst-
case scenario is occurring?
• What is the rollback plan for each identified worst-case scenario?
• How long will each rollback process take? What is the acceptable
Recovery Point Objective (RPO) and Recovery Time Objective (RTO)?
(See Using AWS for Disaster Recovery28 for additional information on
these concepts.)
Page 28
Amazon Web Services – Infrastructure Event Readiness
Post-Event Activities
Post-Mortem Analysis
We recommend a post-mortem analysis as part of an infrastructure event
management lifecycle. Post mortems allow you to collaborate with each team
involved and identify areas that might need further optimization, such as
operational procedures, implementation details, failover and recovery
procedures, etc. This is especially relevant if an application stack encountered
disruptions during the event and a root cause analysis (RCA) is needed. A
post-mortem analysis helps provide data points and other essential
information needed in an RCA document.
Wind-Down Process
Immediately following the conclusion of the infrastructure event, the wind-
down process should begin. During this period, monitor relevant applications
and services to ensure traffic has reverted back to normal production levels.
Use the health dashboards created during the event’s preparation phase to
verify the normalization of traffic and transaction rates. Wind-down periods
for some events may be linear and straightforward, while others may
experience uneven or more gradual reductions in volume. Some traffic
patterns from the event may persist. For example, recovering from a surge in
traffic generally requires straightforward wind-down procedures, whereas an
application deployment or expansion into a new geographical Region may
have long-lasting effects requiring you to carefully monitor new traffic
patterns as part of the permanent application stack.
Page 29
Amazon Web Services – Infrastructure Event Readiness
At some point following the completion of the event, you must determine
when it is safe to end event management operations. Refer to the previously
documented “normal” values for key metrics to help determine when to
declare that an event is completed or ended. We recommend splitting wind-
down activities into two branches, which could have different timelines. Focus
the first branch on operational management of the event, such as sending
communications to internal and external stakeholders and partners, and the
resetting of service limits. Focus the second branch on technical aspects of
the wind-down such as scale-down procedures, validation of the health of the
environment, and criteria for determining whether architectural changes
should be reverted or committed.
The timeline associated with each of those branches can vary depending on
the nature of the event, key metrics, and customer comfort. We’ve outlined
some common tasks associated with each branch in Tables 2 and 3 to help
you determine the appropriate time-to-end management for an event.
Task Description
Communications Notification to internal and external stakeholders that the event has ended. The
time-to-end communication should be aligned with the definition of the
completion of the event. Use “back to healthy” metrics to determine when it is
appropriate to end communication. Alternatively, you can end communication
in tiers. For example, you could end the war room bridge but leave the event
escalation procedures intact in case of post-event failures.
Service Although it may be tempting to retain an elevated service limit after an event,
Limits/Cost keep in mind that service limits are also used as a safety net. Service limits
Containment protect you and your costs by preventing excess service usage, be that a
compromised account or misconfigured automation.
Reporting and Data collection and collation of event metrics, accompanied by analytical
Analysis narratives showing patterns, trends, problem areas, successful procedures, ad-
hoc procedures, timeline of event, and whether or not success criteria were met
should be developed and distributed to all internal parties identified in the
communications plan. A detailed cost analysis should also be developed, to
show the operational expense of supporting the event.
Optimization Enterprise organizations evolve over time as they continue to improve their
Tasks operations. Operational optimization requires the constant collection of metrics,
operational trends, and lessons learned from events to uncover opportunities
for improvement. Optimization ties back with preparation to form a feedback
loop to address operational issues and prevent them from reoccurring.
Page 30
Amazon Web Services – Infrastructure Event Readiness
Task Description
Service Although it may be tempting to retain elevated service limits after an event, keep in
Limits/Cost mind that service limits also serve the purpose of being a safety net. Service limits
Containment protect your operations and operating costs by preventing excess service usage,
either through malicious activity stemming from a compromised account or
through misconfigured automation.
Scale Down Revert resources that were scaled up during the preparation phase. These items are
Procedures unique to your architecture but the following examples are common:
• EC2/RDS instance size
• Reserved capacity
Validation of Compare to baseline metrics and review production health to verify that after the
Health of event and after scale-down procedures have been completed, the systems affected
Environment are reporting normal behavior.
Disposition of Some changes made in preparation for the event may be worth keeping, depending
Architectural on the nature of the event and observation of operational metrics. For example,
Changes expansion into a new geographical Region might require a permanent increase of
resources in that Region, or raising certain service limits or configuration
parameters, such as number of partitions in a DB or shards in a stream of PIOPS in a
volume, might be a performance tuning measure that should be persisted.
Optimize
Perhaps the most important component of infrastructure event management
is the post-event analysis and the identification of operational and
architectural challenges observed and opportunities for improvement.
Infrastructure events are rarely one-time events. They might be seasonal or
coincide with new releases of an application, or they might be part of the
growth of the company as it expands into new markets and territories. Thus,
every infrastructure event is an opportunity to observe, improve, and prepare
more effectively for the next one.
Conclusion
AWS provides building blocks in the form of elastic and programmable
products and services that your company can assemble to support virtually
Page 31
Amazon Web Services – Infrastructure Event Readiness
any scale of workload. With AWS infrastructure event guidelines and best
practices, coupled with our complete set of highly available services, your
company can design and prepare for major business events and ensure that
scaling demands can be met smoothly and dynamically, ensuring fast
response and global reach.
Contributors
The following individuals and organizations contributed to this document:
Further Reading
For additional reading on operational and architectural best practices, see
Operational Checklists for AWS.29 We recommend that readers review AWS
Well Architected Framework30 for a structured approach to evaluating their
cloud based application delivery stacks. AWS offers Infrastructure Event
Management (IEM) as a premium support offering for customers desiring
more direct involvement of AWS Technical Account Manager and Support
Engineers in their design, planning and day of event operations. For more
details about the AWS IEM premium support offering, please see
Infrastructure Event Management.31
Page 32
Amazon Web Services – Infrastructure Event Readiness
Appendix
Detailed Architecture Review Checklist
Yes-No- Security
N/A
We rotate our AWS Identity and Access Management (IAM) access keys and user
password and the credentials for the resources involved in our application at most every
Y—N—N/A
3 months as per AWS security best practices. We apply password policy in every
account, and we use hardware or virtual multifactor authentication (MFA) devices.
We have internal security processes and controls for controlling unique, role-based,
Y—N—N/A
least privilege access to AWS APIs leveraging IAM.
We use IAM roles for EC2 instances as convenient instead of embedding any
Y—N—N/A
credentials inside AMIs.
We apply the latest security patches on our EC2 instances for either Windows or Linux
instances. We use operating system access controls including Amazon EC2 Security
Y—N—N/A
Group rules, VPC network access control lists, OS hardening, host-based firewall,
intrusion detection/prevention, monitoring software configuration and host inventory.
We ensure that the network connectivity to and from the organization’s AWS and
Y—N—N/A
corporate environments uses a transport of encryption protocols.
We apply a centralized log and audit management solution to identify and analyze any
Y—N—N/A
unusual access patterns or any malicious attacks on the environment.
We have Security event and incident management, correlation, and reporting processes
Y—N—N/A
in place.
We make sure that there isn’t unrestricted access to AWS resources in any of our
Y—N—N/A
security groups.
We use a secure protocol (HTTPS or SSL), up-to-date security policies, and cipher
Y—N—N/A protocols for a front-end connection (client to load balancer). The requests are
encrypted between the clients and the load balancer, which is more secure.
We configure our Amazon Route 53 MX resource record set to have a TXT resource
Y—N—N/A record set that contains a corresponding Sender Policy Framework (SPF) value to
specify the servers that are authorized to send email for our domain.
We architect our application for DDoS resiliency by using services that operate from the
AWS Global Edge Network, like Amazon CloudFront and Amazon Route 53, as well as
Y—N—N/A
additional AWS services that mitigate against Layer 3 through 6 attacks (see Summary
of DDoS Mitigation Best Practices in the Appendix).
Page 33
Amazon Web Services – Infrastructure Event Readiness
Yes-No- Reliability
N/A
We deploy our application on a fleet of EC2 instances that are deployed into an Auto
Y—N—N/A Scaling group to ensure automatic horizontal scaling based on a pre-defined scaling
plans. Learn more.
We use an Elastic Load Balancing health check in our Auto Scaling group configuration to
Y—N—N/A ensure that the Auto Scaling group acts on the health of the underlying EC2 instances.
(Applicable only if you use load balancers in Auto Scaling groups.)
We deploy critical components of our applications across multiple Availability Zones, are
appropriately replicating data between zones. We test how failure within these
Y—N—N/A
components affects application availability using Elastic Load Balancing, Amazon Route
53, or any appropriate third-party tool.
In the database layer we deploy our Amazon RDS instances in multiple Availability Zones
Y—N—N/A to enhance database availability by synchronously replicating to a standby instance in a
different Availability Zone.
We define processes for either automatic or manual failover in case of any outage or
Y—N—N/A
performance degradation.
Y—N—N/A We use CNAME records to map our DNS name to our services. We DON’T use A records.
We configure a lower time-to-live (TTL) value for our Amazon Route 53 record set. This
avoids delays when DNS resolvers request updated DNS records when rerouting traffic.
Y—N—N/A
(For example, this can occur when DNS failover detects and responds to a failure of one
of your endpoints.)
We have at least two VPN tunnels configured to provide redundancy in case of outage or
Y—N—N/A
planned maintenance of the devices at the AWS endpoint.
We use AWS Direct Connect and have two Direct Connect connections configured at all
times to provide redundancy in case a device is unavailable. The connections are
provisioned at different Direct Connect locations to provide redundancy in case a location
Y—N—N/A
is unavailable.
We configure the connectivity to our virtual private gateway to have multiple virtual
interfaces configured across multiple Direct Connect connections and locations.
We use Windows instances and ensure that we are using the latest paravirtual (PV)
drivers. PV driver helps optimize driver performance and minimize runtime issues and
Y—N—N/A
security risks. We ensure that EC2Config agent is running the latest version on our
Windows instance.
We take snapshots of our Amazon Elastic Block Store (EBS) volumes to ensure a point-
Y—N—N/A
in-time recovery in case of failure.
We use separate Amazon EBS volumes for the operating system and application/database
Y—N—N/A
data where appropriate.
Y—N—N/A We apply the latest kernel, software and drivers patches on any Linux instances.
Page 34
Amazon Web Services – Infrastructure Event Readiness
We run a usage check report against our services limits and make sure that the current
Y—N—N/A
usage across AWS services is at or less than 80% of the service limits. Learn more
We understand that some dynamic HTTP request headers that Amazon CloudFront
Y—N—N/A receives (User-Agent, Date, etc.) can impact the performance by reducing the cache hit
ratio and increasing the load on the origin. Learn more
We ensure that the maximum throughput of an EC2 instance is greater than the
aggregate maximum throughput of the attached EBS volumes. We also use EBS-
Y—N—N/A
optimized instances with PIOPS EBS volumes to get the expected performance out of the
volumes.
We ensure that the solution design doesn’t have a bottleneck in the infrastructure or a
Y—N—N/A
stress point in the database or the application design.
In our designs, we avoid using a large number of rules in security group(s) attached to
Y—N—N/A our application instances. A large number of rules in a security group may degrade
performance.
We note whether the infrastructure event may involve over-provisioned capacity that
Y—N—N/A
needs to be cleaned up after the event to avoid unnecessary cost.
We use right sizing for all of our infrastructure components including EC2 instance size,
Y—N—N/A RDS DB instance size, caching cluster nodes size and numbers, Redshift Cluster nodes
size and numbers, and EBS volume size.
We use Spot Instances when it’s convenient. Spot Instances are ideal for workloads that
Y—N—N/A have flexible start and end times. Typical use cases for Spot instances are: batch
processing, report generation, and high-performance computing workloads.
Page 35
Amazon Web Services – Infrastructure Event Readiness
Notes
1
https://aws.amazon.com/answers/account-management/aws-tagging-
strategies/
2
https://aws.amazon.com/blogs/aws/resource-groups-and-tagging/
3
https://aws.amazon.com/sqs/
4
http://docs.aws.amazon.com/general/latest/gr/rande.html
5
https://aws.amazon.com/emr/
6
https://aws.amazon.com/rds/
7
https://aws.amazon.com/ecs/
8
https://aws.amazon.com/sns/
9
https://aws.amazon.com/blogs/compute/using-aws-lambda-with-auto-
scaling-lifecycle-hooks/
10
http://docs.aws.amazon.com/lambda/latest/dg/welcome.html
11
https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/
12
https://aws.amazon.com/answers/configuration-management/aws-
infrastructure-configuration-management/
13
https://d0.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AW
S%20.pdf
14
http://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-
policy.html#routing-policy-latency
15
https://aws.amazon.com/elasticache/
16
https://aws.amazon.com/cloudfront/
17
https://docs.aws.amazon.com/waf/latest/developerguide/getting-started-
ddos.html
18
https://aws.amazon.com/answers/security/aws-waf-security-automations/
19
https://aws.amazon.com/mp/security/WAFManagedRules/
Page 36
Amazon Web Services – Infrastructure Event Readiness
20
https://docs.aws.amazon.com/waf/latest/developerguide/getting-started-
fms.html
21
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts-on-
demand-reserved-instances.html
22
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-
instances.html
23
https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html
24
https://aws.amazon.com/about-aws/whats-new/2014/07/31/aws-
trusted-advisor-security-and-service-limits-checks-now-free/
25
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring_auto
mated_manual.html
26
http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Cloud
Watch_Dashboards.html
27
http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publis
hingMetrics.html
28
https://aws.amazon.com/blogs/aws/new-whitepaper-use-aws-for-
disaster-recovery/
29
http://media.amazonwebservices.com/AWS_Operational_Checklists.pdf
30
http://d0.awsstatic.com/whitepapers/architecture/AWS_Well-
Architected_Framework.pdf
31
https://aws.amazon.com/premiumsupport/iem/
Page 37