Section 1:
Introduction
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
INTRODUCTION
Section 1:
About the exam & course setup
About the
Data Engineer
ASSOCIATE
Certification
✓ Impactful way to advance career
Why getting certified? ✓ Positioning as an expert
✓ Future proof + great job opportunities.
✓ AWS Certified Data Engineer Associate
What is covered? ✓ https://aws.amazon.com/certification/certified-data-engineer-associate
✓ Not needed for the exam.
Demos ✓ Help with memorizing.
✓ Give you practical foundation.
✓ Clear exam with ease.
Goal
✓ Knowledge for working with AWS
✓ 720 / 1000
Passing Score ✓ Goal: Achieve a score of 850+
Master the Exam
Not needed for the exam.
Free Trial Account Help with memorizing
Give you a practical knowledge.
Exam Overview https://aws.amazon.com/certification/certified-data-engineer-associate
Exam Duration ❑ Time: 130min
Exam Questions ❑ 65 questions ❑ Multiple Select, Multiple Choice
Scenario-based questions – find the best solution
A data engineer needs to create an ETL process that automatically extracts data from an Amazon 53 bucket,
transforms it, and loads it into Amazon Redshift. Which
AWS service is the EASIEST to achieve this?
❑ AWS Lambda
✓
❑ AWS Glue
❑ Amazon Step Functions
❑ Amazon EMR
Recipe to clear the exam
Step-by-step incl. Demos
Lectures ~ 30-60 min / day
Quizzes Practice and test your knowledge
Slides Repeat and go through important points
Evaluate knowledge & weaknesses
Practice Test Eliminate weaknesses
Book Exam
Confident & Prepared
Final Tips
Resources
Q&A Section
Reviews
Connect & Congratulate
Section 2:
Data Ingestion
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS S3 – Storage
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS S3 - Storage
• One of the most important building blocks in AWS
Main Storage Solution
• S3 = "Simple Storage Service"
• Cost-effective and simple object storage
• Buckets (containers for storage) and objects (files)
Data Management
simple web services interface
AWS S3 - Storage
• Buckets (containers for storage) and objects (files)
Data Management
• Each bucket is created in a specific region
AWS S3 - Storage
• Each bucket is created in a specific region
Rules • Buckets must have a globally unique name
(across all regions, across all accounts)
• Between 3 (min) and 63 (max) characters
• Only lowercase letters, numbers, dots (.), and hyphens (-)
• Must begin and end with a letter or number.
• Not formatted as an IP address (for example, 192.168.5.4)
AWS S3 - Storage
• Each bucket is created in a specific region
Key • Each object is identified by a unique, user-assigned key
• Upload: example.txt documents/example.txt
AWS S3 - Storage
• Buckets (containers for storage) and objects (files)
Data Management
simple web services interface
Use Cases • Backup & recovery
• Websites, Applications
• Data archiving
• Data lakes
• … etc.
AWS S3 – Storage Classes
Storage Class Use Case Durability Availability Durability
99.999999999% Likelihood of loosing in object
S3 Standard Frequently accessed data 99.99%
"11 nines" in a year (1 in 100 billion)
Data with unknown or changing access
S3 Intelligent-Tiering 99.999999999% 99.90%
patterns
Availability
Less frequently accessed data, but Percentage of time that the
S3 Standard-IA 99.999999999% 99.90%
requires rapid access when needed service is operational
S3 One Zone-IA
Same as Standard-IA, but stored in a
99.999999999% 99.50% Lifecylcle Rules
single AZ for cost savings
Define change of storage classes
over time
Long-term archiving with retrieval 99.99%
S3 Glacier 99.999999999%
times ranging from minutes to hours (after retrieval)
Versioning
Longest-term archiving where retrieval 99.99% Allows you to retrieve
S3 Glacier Deep Archive 99.999999999%
time of 12 hours is acceptable (after retrieval) previous versions of an object
Data Ingestion Methods
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Streaming Ingestion Batch Ingestion
● Enables real-time ingestion ● Ingests data periodically in
batches.
● Ideal for time-sensitive data VS ● Typically large volumes
● More expensive and intricate ● Cost-effective and efficient
● Implemented using services ● Tools like AWS Glue commonly
like Amazon Kinesis for used
streaming data.
AWS Glue
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Glue
▪ Fully-managed ETL service
▪ Designed to make it easy to
load and transform data
▪ Visual interface:
Easily create ETL jobs without code
▪ Various integrations:
Amazon S3, Amazon Redshift, and Amazon RDS
AWS Glue
▪ Fully-managed ETL service
▪ Designed to make it easy to
load and transform data
▪ Visual interface:
Easily create ETL jobs without code
▪ Various integrations:
Amazon S3, Amazon Redshift, Amazon RDS,
Kinesis Data Streams, DocumentDB etc.
AWS Glue
▪ Script auto-generated behind the scenes
▪ Uses Spark behind the scenes
(without need to manage anything)
▪ Serverless:
AWS Glue takes care of the underlying
infrastructure
▪ Pay-as-you-go:
You only pay for what you use
AWS Glue Data Catalog
▪ Centralized Data Catalog:
Stores table schemas and metadata
⇒ allows querying by:
AWS Athena, Redshift, Quicksight, EMR
▪ Glue Crawlers:
scan data sources,
infer schema,
stored in the AWS Glue Data Catalog
⇒ Can automatically classify data
▪ Scheduling:
Run on a schedule or based on triggers
+ incremental loads/crawling (ETL jobs & Crawlers)
Section 3:
Querying with Athena
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Athena
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Athena
▪ Interactive query service:
Query files in S3 using SQL
▪ Serverless:
No infrastructure to manage
Pay-as-you-go
S3 Bucket Crawler Data Catalog Athena Quicksight
AWS Athena
▪ Log Analysis:
Analyzing log files stored in Amazon S3
▪ Ad-Hoc Analysis:
Ad-hoc queries on data lakes stored in S3
▪ Data Lake Analytics: Building a data lake on
Amazon S3 and using Athena to query data
▪ Real-Time Analytics:
Integrating Athena with streaming data sources
such as Amazon Kinesis
Athena
Federated Queries
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Athena
Federated Query:
▪ Query data sources other than S3 buckets using a data
connector.
• Relational and non-relational data sources
• Object data sources
• Custom data sources
▪ Federated data sources: (built-in examples)
• Amazon CloudWatch Logs,
• Amazon DynamoDB,
• Amazon DocumentDB,
• Amazon RDS,…
▪ Work with multiple sources
• Amazon RDS – products table
• Amazon DocumentDB - detailed customer profile data
• Amazon DynamoDB - user interactions
Athena
Workgroups
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Athena
Workgroups:
Isolate queries from other queries in the same account
▪ Isolate queries for different…
• Teams
• Use cases
• Applications
⇒ Different settings or to track and control cost
▪ Control…
• Query execution settings
• Access
• Cost
• Type of engine (Athena SQL vs. Apache Spark)
▪ Up to 1000 workgroups per region
▪ Each account has primary workgroup
Athena - Cost
Cost:
▪ Pay for queries run – amount of data scanned
▪ Cost saving possible with reserved capacity
Athena - Performance Optimization
▪ Use partitions
• Eliminate data partions that need to be scanned (pruning)
▪ Use partition projection
• Automate partion management
• Speed up queries for tables with large partitions
▪ AWS Glue Partition Indexes:
• Athena retrieves only relevant partions instead of loading all
• Optimize query planning and reduce query runtime
Athena –Query Result Reuse
▪ What it does?
Reuses previous results that match your query and max. age
▪ Benefits?
Improve query performance & cost
Athena –Query Result Reuse
▪ When to use?
o Query where the source data doesn’t change frequently
o Repeated queries
o Large datasets with complex queries
Athena - Performance Optimization
▪ Data Compression:
Reduce file size to speed up queries
▪ Format Conversion:
transform data into an optimized structure such as Apache
Parquet or Apache ORC columnar formats
Section 4:
AWS Glue Deep Dive
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Glue Costs
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Glue Costs
▪ Crawlers:
▪ Hourly rate based on the number of DPUs used
▪ Billed by seconds with with a 10-minute minumum
▪ What are DPUs?
▪ DPUs = Data Processing Units
▪ A single DPU provides 4 vCPU and 16 GB of memory
▪ Data Catalog:
o Up to a million objects for free
o $1.00 per 100,000 objects over a million, per month
Glue Costs
▪ ETL jobs:
▪ Hourly rate based on the number of DPUs used
▪ Billed by seconds with with a 10-minute minumum
▪ AWS Glue versions 2.0 and later have a 1-minute minimum
▪ How many DPUs are used?
▪ Apache Spark: Minimum of 2 DPUs – Default: 10 DPUs
▪ Spark Streaming: Minimum of 2 DPUs – Default: 2 DPUs
▪ Ray job (ML/AI): Minumum of 2 M-DPUs (high memory). Default:6 M-DPUs
▪ Python Shell job (flexible & simple): 1 DPU or 0.0625 DPU. Default 0.0625 DPU
▪ Cost of DPUs
o $0.44 per DPU-Hour (may differ and depend on region)
Glue Costs
▪ Glue Job Notebooks / Interactive Sessions:
▪ Used to interactively develop ETL code in notebooks
▪ Based on time session is active and number of DPUs
▪ Configurable idle timeouts
▪ 1-minute minimum billing
▪ Minimum of 2 DPUs – Default: 5 DPUs
Glue Costs
▪ ETL jobs cost example:
▪ Apache Spark job
▪ Runs for 15 minutes
▪ Uses 6 DPU
▪ 1 DPU-Hour is $0.44
⇒ Job ran for 1/4th of an hour and used 6 DPUs
⇒ 6 DPU * 1/4 hour * $0.44 = $0.66.
▪ Interactive Session cost example:
▪ Use a notebook in Glue Studio to interactively develop your ETL code.
▪ 5 DPU (default)
▪ Keep the session running for 24 minutes (2/5th of an hour)
⇒ Billed for 5 DPUs * 2/5 hour at $0.44 per DPU-Hour = $0.88.
AWS Budgets
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Budgets
▪ Alarms:
Set budgets & receive alarms when exceeded
▪ Actual & Forecasted
Help to manage cost
▪ Budget Types:
o Cost budget
o Usage budget
o Saving plans budget
o Reservation plans budget
▪ Budgets are free
Two action-enable budgets are free then it is $0.10/day
Stateful vs. Stateless
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Stateful vs. Stateless
▪ Stateful:
Systems remember past interactions for influencing future ones.
▪ Stateless:
Systems process each request independently without relying on
past interactions.
Stateful vs. Stateless
Data Ingestion Context:
▪ Stateful:
Maintain context for each data ingestion event.
▪ Stateless:
Process each data ingestion event independently.
Stateful vs. Stateless
Data Ingestion in AWS:
● Amazon Kinesis:
Supports both stateful (Data Streams) and stateless (Data
Firehose) data processing.
● AWS Data Pipeline:
Orchestrates workflows for both stateful and stateless data
ingestion.
● AWS Glue:
Offers stateful or stateless ETL jobs with features like job
bookmarks for tracking progress.
Glue Transformations
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Glue – Extract Transform Load
Transform
• Filtering: Remove unnecessary
data
Extract Load
• Joining: Combine data
• Amazon RDS, Aurora, • Amazon RDS, Aurora,
DynamoDB
• Aggregation: Summarize data.
DynamoDB
• Amazon Redshift • Amazon Redshift
• Amazon S3, Kinesis • FindMatches ML: Identify records
• Amazon S3, Kinesis
that refers same entity.
• Detect PII:Identify and manage
sensitive information.
CSV <-> Parquet <-> JSON <-> XML
AWS Glue – Extract Transform Load
Transform
• Filtering: Remove unnecessary
data
Extract Load
• Joining: Combine data
• Amazon RDS, Aurora, • Amazon RDS, Aurora,
DynamoDB
• Aggregation: Summarize data.
DynamoDB
• Amazon Redshift • Amazon Redshift
• Amazon S3, Kinesis • FindMatches ML: Identify records
• Amazon S3, Kinesis
that refers same entity.
• Detect PII:Identify and manage
sensitive information.
CSV <-> Parquet <-> JSON <-> XML
Glue Worfklows
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Glue – Glue Workflows
● Orchestrate multi-step data processing jobs, manage executions
and monitoring of jobs/crawlers.
● Ideally used for managing AWS Glue operations but also can be
leveraged other services.
● Provides visual interface.
● You can create workflows manually or with AWS Glue Blueprints.
● Triggers initiate jobs and crawlers.
○ Schedule Triggers: Starts the workflow at regular intervals.
○ On-Demand Triggers: Manually start the workflow from AWS Console.
○ EventBridge Event: Launches the workflow based on specific events captured by
Amazon EventBridge.
○ On-Demand & EventBridge: Combination of On-Demand and EventBridge rules.
○ Lambda Function: With a trigger that invokes Workflow.
Glue Job Types
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Glue
AWS Glue
Job Types
Spark ETL Jobs: ⇒ Large-scale data processing.
⇒ 2DPU to 100DPU
Spark Streaming ETL Jobs: ⇒ Analyze data in real-time.
⇒ 2DPU to 100DPU
Python Shell Jobs: ⇒ Suitable for light-weight tasks.
⇒ 0.0625DPU to 1DPU.
Ray Jobs: ⇒ It is suitable for parallel processing tasks.
AWS Glue
Execution Types
⇒ Designed for predictable ETL jobs.
Standard Execution: ⇒ Jobs start running immediately.
⇒ Guarantees consistent job execution times.
Flex Execution: ⇒ Cost-effective option for less time-sensitive ETL jobs.
⇒ Jobs may start with some delay.
Partitioning
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Glue – Partitioning device-data-bucket
● Enhances the performance of AWS Glue
○ Provides better query performance. /data
○ Reduces I/O operations.
● AWS Glue can skip over large segments /year=2023 /year=2024
within partitioned data. /month=01 /month=02 /month=01 /month=02
/day=01 /day=01 /day=01 /day=01
● AWS Glue can process each partition
independently.
● Provides cost efficiency by reducing query
efforts.
● In AWS Glue, define partitioning as part of
ETL job scripts. Also possible within Glue
Data Catalog.
AWS Glue DataBrew
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Glue DataBrew
● Data preparation tool with visual interface.
● Cleaning and data format processes.
● Pre-built transformations.
● No coding required.
Amazon S3 AWS Glue Amazon
(Data Lake) Data Brew Redshift
● Automate data preparations.
AWS Glue DataBrew
AWS Glue DataBrew - Transformations
Project Where you configure transformation tasks
Step Applied transformation to your dataset
Recipe Set of transformation steps; can be saved and reused
Job Execution of a recipe on a dataset; output to locations such as S3
Schedule Schedule jobs to automate transformation
Data Profiling Understand quality and characteristics of your data
AWS Glue DataBrew - Transformations
● NEST_TO_MAP:
○ convert columns into a map.
Name Age City NEST_TO_MAP
Alice 30 New York {'Name': 'Alice', 'Age': 30, 'City': 'New York'}
● NEST_TO_ARRAY:
○ convert columns into an array
Name Age City NEST_TO_ARRAY
Alice 30 New York ['Alice', 30, 'New York']
● NEST_TO_STRUCT
o Like NEST_TO_MAP but retains exact data type and order
AWS Glue DataBrew - Transformations
● UNNEST_ARRAY:
○ Expands array to multiple columns
NEST_TO_ARRAY Name Age City
['Alice', 30, 'New York'] Alice 30 New York
AWS Glue DataBrew - Transformations
● PIVOT
o Pivot column and pivot values to rotate data from rows into columns
Product Quarter Sales Product Q1 Q2
A Q1 150 A 150 200
A Q2 200 B 180 210
B Q1 180
B Q2 210
● UNPIVOT
o Pivot column into rows (attribute + value)
Name Age City Attribute Value
Frank 40 Miami Name Frank
Age 30
City Miami
AWS Glue DataBrew - Transformations
● TRANSPOSE
○ Switch columns and rows
Name Age City Attribute Alice Frank
Alice 30 New York Age 30 32
Frank 32 Miami City New York Miami
AWS Glue DataBrew - Transformations
Join ⇒ Combine two datasets.
Split ⇒ Split a column into multiple columns based on a delimiter.
Filter ⇒ Apply conditions to keep only specific rows in your dataset.
Sort ⇒ Arrange the rows in your dataset in ascending or descending order.
Date/Time ⇒ Convert strings to date/time formats or change between different
Conversions date/time formats.
Caunt Distinct ⇒ Calculates the number of unique entries in that column.
Section 5:
Serverless Compute with
Lambda
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Lambda
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Lambda
● What is AWS Lambda?
○ Lets you run code without managing servers
⇒ automatically scaling based on demand
○ Serverless compute service
⇒ No need to provision or manage servers
▪ Various programming languages
o Python
o Java
o Node.js
o Go
o …
AWS Lambda
Use cases
○ Data processing tasks on data in Amazon S3 or DynamoDB
○ Event-driven ingestion:
■ S3
■ DynamoDB
■ Kinesis
⇒ Process real-time events, such as file upload
○ Automation:
Automate tasks and workflows by triggering Lambda functions in
response to events
AWS Lambda
Advantages
○ Scalable: Scales automatically depending on workload
○ Cost efficient: Pay only for what you use
○ Simplicity: No need to manage infrastructure
Typically stateless: Each invocation is independent and doesn't maintain state
AWS Lambda – S3 Notification
S3 Notification Executes Code
Trigger
File transfer
File upload
Data processing
S3 Bucket
AWS Lambda – Kinesis Data Stream
Continuously
Event Source executes
Trigger
File transfer
Kinesis
Data processing
Data Streams
⇒ Executes in batches
⇒ Automatically scales
Lambda Layers
• Contains supplementary code
• library dependencies,
• custom runtime or
• configuration file
Functio Functio
n1 n2 Lambda function
Fuction code Fuction code components:
Without layers
Code dependencies, Code dependencies,
custom runtimes, custom runtimes,
configuration files, etc. configuration files, etc.
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Lambda Layers
1. Package Layer content (.zip)
2. Create the Layer in Lambda
3. Add the Layer to Your Functions
4. The function can access the contents of the layer during runtime
Functio Functio
n1 n2
Fuction code Fuction code
Lambda function
Lambda Layer 1 Lambda Layer 1 components:
With layers
Code dependencies, custom
runtimes, configuration files, etc. Lambda Layer 1
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Benefits
1 Share Dependencies Across Multiple Functions
2 Separate Core Logic from Dependencies
3 Reduce Deployment Package Size
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Replayability
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Replayability
Definition of Replayability:
○ Ability to reprocess or re-ingest data that has already been handled.
Why is it important?
○ Error Handling: Corrects processing mistakes and recovers lost data.
○ Data Consistency: Ensures uniform data across distributed systems.
○ Adapting to Changes: Adjusts to schema or source data changes.
○ Testing and Development: Facilitates feature testing and debugging
without risking live data integrity.
Replayability
Strategies for Implementing Replayability
○ Idempotent Operations:
Ensure repeated data processing yields consistent results.
○ Logging and Auditing:
Keep detailed records for tracking and diagnosing issues.
○ Checkpointing:
Use markers for efficient data process resumption.
○ Backfilling Mechanisms:
Update historical data with new information as needed.
Replayability
Replayability is an important safety net for data
processing.
It ensures systems are resilient, accurate, and
adaptable to changes.
Section 6:
Data Streaming
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon Kinesis
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon Kinesis
Collection of services:
Handling different aspects of data streaming
▪ Kinesis Data Streams
Ingest and process large volumes of streaming data
▪ Kinesis Firehose
Fully managed service to deliver streaming data to destinations more easily
´⇒ Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk.
▪ Managed Apache Flink
(formerly Kinesis Data Analytics)
Analyze streaming data in real time using standard SQL queries
Amazon Kinesis
Use cases
Variety of real-time data processing and analytics use cases
1) Real-time analytics
Analyzing streaming data to gain insights
2) IoT Processing
Ingesting and processing data from IoT devices or sensors
3) Security and fraud detection:
Detecting anomalies and responding to security in real time
Amazon
Kinesis Data Streams
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon Kinesis Data Streams
Units (up to 1 MB) of data
(e.g JSON objects, log entries)
Shard
Formatted as Data Record
AWS SDK 1
Shard
Kinesis Producer Library
(KPL) 2
Includes a Partition Key Shard
Amazon Kinesis Agent 3
determines the shard
to which the record will be assigned …
Producer Data Stream
Device or application that
generates and writes data to
Kinesis data stream
(data source)
Amazon Kinesis Data Streams
Shards
Basic units of capacity
AWS SDKs and APIs
Shard
Data Record 1 shard: 1 MB/s in (in-throughput)
Kinesis Producer Library 1 1000 records/s
(KPL)
Shard 2 MB/s out (out-throughput)
2 ⇒ Determines its overall capacity
for ingesting and processing data.
Custom Applications
Partition Key Shard
⇒ Efficient parallel processing
IoT devices determines the shard 3
…
Durability
Producer Data Stream Configurable retention period
(default 24 hours up to 365 days)
Writes data to Kinesis data stream Replicated across multiple
Availability Zones
⇒ Resilient to failure
Data is immutable
Amazon Kinesis Data Streams
Scalability
Add and remove shards dynamically
AWS SDKs and APIs
Shard
Data Record 1 Auto Scaling to automatically scale
Kinesis Producer Library
(KPL)
Shard
⇒ Elastically scale up and down
2
Custom Applications
Partition Key Shard
IoT devices determines the shard 3 Capacity mode
…
Provisioned mode
Must specify the number of shards
Producer Data Stream Can increase and decrease the number
Pay hourly rate
Writes data to Kinesis data stream On-demand mode
Automatically scales shards based on
throughput peaks over last 30 days
Default: 4 MB/s or 4,000 records/s
Amazon Kinesis Data Streams
AWS SDK
AWS SDKs and APIs
Shard Kinesis Client Library
Kinesis Producer Library
1 (KCL)
(KPL) Shard
Kinesis Data Firehose
2
Custom Applications
Shard Managed Apache Flink
(formerly Kinesis Data Analytics)
IoT devices
3 Lambda
… Other Services
Producer Data Stream Consumer
Writes data to Kinesis data stream Gets records from Amazon Kinesis Data Streams
and processes them
Kinesis Data Streams application
Throughput and Latency
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Throughput
Throughput in AWS Kinesis Data Streams
● Definition:
Volume of data (in MB or records per second) ingested into or retrieved from a
Kinesis Data Stream.
● Measurement Units:
Units per second (e.g., Mbps, records per second)
● Real-World:
Actual rate of data processing, accounting for all factors.
● Shard-Based Scaling:
Scalable through number of shards; each shard adds fixed capacity to stream.
● Proportional Relationship:
Total stream throughput directly relates to shard count.
● Optimization Goal:
Improving capacity to process more data within timeframe for high-volume data
Bandwidth vs. Throughput
Bandwith in AWS Kinesis Data Streams
● Definition:
The maximum data transfer rate
● Theoretical Upper Limit:
Potential maximum for throughput
Latency
Latency and Propagation Delay
● Definition Latency:
The time from initiating a process to the availability of the result.
● Propagation Delay:
Specific latency from when a record is written to when it's read by a consumer.
● Influencing Factor:
Significantly affected by the polling interval, or how often consumer
applications check for new data.
Latency
Latency and Propagation Delay
● Recommendation:
Poll each shard once per second per consumer application to balance efficiency
and avoid exceeding API limits.
● Kinesis Client Library (KCL) Defaults:
Configured to poll every second, keeping average delays below one second.
● Reducing Delays for Immediate Data Needs:
Increase the KCL polling frequency for quicker data access, with careful
management to avoid API rate limit issues.
Enhanced
Fan-Out
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Fan-Out Fan-In
● A single stream distributes
VS ● Multiple sources converge towards
data to multiple consumers a single destination.
● Distribute data to different ● Destination can be another stream
applications or another storage system.
● Combine data from different
sensors into a single stream
Enhanced Fan-Out
● Traditionally: Shared Through-put (standard consumer)
⇒ Potential bottleneck
● Solution: Enhanced fan-out
⇒ Push data through HTTP/2
⇒ Each registered consumer its
own dedicated read throughput
(up to 20 consumers)
⇒ Each consumer: Up to 2 MB/second per shard
⇒ Add more consumers without creating bottleneck
⇒ E.g. 10 consumers ⇒ 20 MB/s instead of 2 MB/s
Enhanced Fan-Out - Benefits
● Increased Throughput:
Each consumer gets its own dedicated bandwidth,
allowing for faster data processing.
● Improved Scalability:
The system can handle more concurrent consumers
without performance degradation.
● Reduced Latency:
Improved latency (~70ms)
● Simplified Application Development:
You don't need to implement complex logic to
manage shared throughput between consumers.
Enhanced Fan-Out – When to Use?
● Higher Cost
● High number of consumers
● Require reduced latency
Troubleshooting &
Performance in Kinesis
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Kinesis Data Stream
Performance Challenges for Producers:
Slow Write Rates
1) Problem: Service Limits & Throttling
Solution: Monitor Throughput Exceptions
2) Problem: Uneven data distribution to shards – "Hot Shards"
Solution: Effective partition key strategy
3) Problem: High throughput and small batches can be inefficient
Solution: Batch records to aggregate multiple records
Kinesis Data Stream
Performance Challenges for Consumers:
Slow Read Rates
1) Problem: Hitting shard read limits (per-shard limit)
Solution: Increasing shard count
2) Problem: Low value for maximum number of GetRecords per call
Solution: System-defaults are generally recommended.
3) Problem: Logic inside ProcessRecords takes longer than expected.
Solution: Change logic, test with empty records
Kinesis Data Stream
GetRecords returns empty records array
Every call to GetRecords returns a ShardIterator value (must be used in the next iteration).
ShardIterator NULL ⇒ Shard has been closed.
Empty records reasons:
1) No more data in the shard.
2) No data pointed to by the ShardIterator.
⇒ It is not an issue and usually automatically handled (Kinesis Client Library).
Kinesis Data Stream
Additional Issues
1) Problem: Skipped records
Solution: Might be due to unhandled exceptions.
Handle all exceptions in processRecords
2) Problem: Expired Shard Iterators.
DynamoDB table used by Kinesis does not have enough capacity to
store the data. Large number of shards.
Solution: Not called GetRecords for more than 5min.
Increase write capacity to shard to table.
Kinesis Data Stream
Additional Issues
3) Problem: Consumers falling behind
Solution: Increase retention
Monitor GetRecords.IteratorAgeMilliseconds or MillisBehindLatest metrics
- Spikes ⇒ API failures (transient)
- Steady increases ⇒ Limited resources or processing logic
Kinesis Data Stream
Additional Issues
4) Problem: Unauthorized KMS Master Key Permission Error
Solution: Writing to an encrypted stream without the necessary permissions on
the KMS master key. Ensure you have the correct permissions via AWS
KMS and IAM policies
Amazon Kinesis
Data Firehose
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Overview
o Fully Managed Service:
Automates data streaming and loading, reducing the need
for manual setup and administration.
o Effortless Data Handling:
Captures, transforms, and loads data streams into AWS
data stores and analytic tools with minimal effort.
o Automatic Scaling:
Dynamically adjusts to the volume of incoming data,
providing seamless scalability.
Amazon Kinesis Data Firehose
Fallback S3 bucket
Reliability
Archive original records
Processing issues
Kinesis Data Streams
Retry mechanism Amazon
AWS CloudWatch Logs S3
Buffering
Redshift (indirect
AWS IoT
via S3)
Amazon CloudWatch Consumed in Batches OpenSearch
Events
Near-real-time
Custom vs.
Applications Efficient / Perfomance
Splunk
Transformations
Producer Consumers
Buffering
AWS Lambda integrated
generate data streams
File Size
On-the-fly / integrated
up to 128MB
Time interval
⇒ Elastically scale
up to 900 seconds
up and down
Key Features
o Near Real-Time Processing:
Buffers based on time and size, balancing prompt data handling
with the efficiency of batch deliveries.
o Broad Data Format Support:
Handles multiple data formats and conversions (e.g., JSON to
Parquet/ORC).
o Data Security and Transfer Efficiency:
Compresses and encrypts data to enhance security during transfer.
o Real-Time Analytics and Metrics:
Ideal for scenarios requiring quick data analysis, like log or
event data capture.
Pricing
Consumption-Based Pricing:
Costs are tied to the volume of data processed, suitable for various
operational scales.
Kinesis Data Streams Kinesis Firehose
● Manual shard setup
VS ● Fully managed experience
● Coding consumers/producers ● Focus on delivery (efficient)
● Real-time (200ms/70ms latency) ● Near-real-time
● Data Storage up to 365 days ● Ease of use
Managed Service for
Apache Flink
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
What is Amazon Managed Service for Apache Flink?
o Fully Managed Service:
For querying and processing of data streams.
o Real-time streaming processing and analytics:
Scalable real-time analytics using Flink
SQL
Real-time monitoringReal-time website trafficStreaming ETLPython
Scala
Java
o Apache Flink under the hood:
Open-source stream processing framework managed by AWS
o Serverless & Scalable
AMS Flink
Event-driven actions
Anomaly detection
Define real-time actions
Algorithms in real-time
Kinesis Data Streams
Kinesis Data Streams
Amazon MSK
Amazon MSK
Amazon S3
Amazon S3
Custom
Custom Data Sources
Data Sources
On-the-fly processing
Flink's streaming engine
Flink Sources Flink Sinks
Some data streams Low Latency Stateful Computations
Minimal Code Near real-time Maintain state based on Checkpoints & Snapshots
incoming data
Ensuring fault tolerance
Pricing
o Pay-as-You-Go
Consumption based
o Kinesis Processing Units (KPUs)
Charges are based on the number of KPUs
(1 KPU has 1 vCPU and 4 GB of memory)
o Application Orchestration
Each application requires one additional KPU for orchestration.
o Storage and Backups
per GB per month
o Automatic Scaling
Number of KPUs automatically scaled based on needs.
Manually provisioning possible.
o AMS Flink Studio (Interactive Mode)
Charged for two additional KPUs
Amazon Managed
Streaming
for Apache Kafka (MSK)
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Overview
o Managed Kafka Service
Streamlines setup and management of Apache Kafka.
o Purpose
Enables processing of streaming data in real-time.
Architecture
o Kafka Brokers:
Servers that manage the publishing and subscribing of records
o ZooKeeper Nodes:
Coordinates Kafka brokers and manages cluster metadata.
High Availability & Storage
o Multi-AZ Deployment
Clusters distributed across Availability Zones for fault tolerance.
o EBS Storage
Utilizes Elastic Block Store for durable message storage, supporting
seamless data recovery.
Custom Configuration & Scalability
o Message Size Flexibility
Configurable message sizes (up to 10 MB), more flexibility than Kinesis (1
MB limit)
o Scalable Clusters
Easily adjustable broker counts and storage sizes to handle varying
workloads.
Producers & Consumers
o Producers
Applications that send data to MSK clusters.
o Consumers
Applications or services retrieving and processing data from MSK.
Comparison with Amazon Kinesis
o Customization vs. Convenience
MSK offers in-depth configurability; Kinesis offers easier setup with fixed
limits.
o Message Handling
MSK supports larger message sizes, critical for specific use cases.
MSK Kinesis
● More granular control, ● More Managed Experience
more management
● Straight-forward setup
● More complex pipelines
VS ● Message size:
● Message size: 1 MB Limit
Up to 10MB (Default 1 BM)
● Streams & Shards
● Topics & Partitions
MSK Kinesis
● Encryption ● Encryption
In-flight TLS encryption TLS encryption in-flight by
OR plain text possible default
At Rest: At Rest:
Supports KMS encryption
VS Supports KMS encryption
Straight-forward setup
● Access Control:
○ Mutual TLS ● Access Control:
○ SASL/SCRAM Username/password ○ Uses IAM policies for both
authentication mechanism, also authentication and authorization
relying on Kafka ACLs.
○ IAM Access Control
authentication and authorization
using IAM
Section 7:
Storage with S3
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Partitioning
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Overview
• Partitioning data through a combination of bucket and
folder structure.
Importance of Folder Structure
Data Management Query Performance
▪ Simplifying tasks ▪ Improving query
such as data performance by
retention and allowing queries to
archival. process only relevant
data subsets.
▪ Facilitating easy
archiving or ▪ Reducing the amount of
deletion of older data scanned in
data partitions queries, leading to
cost savings.
Implementation of Partitioning
Organizing Data Partitioning Examples
▪ Organizing data into folders ▪ Time-based partitioning
and subfolders, including the example with S3 keys.
use of buckets.
2021 /month=11/day=05/filename.txt
▪ Enabling the use of Glue
crawlers to automatically ▪ Introduction of metadata
create partition keys. tagging for custom metadata
attachment.
Importance of Maintaining The Glue Catalog for
Managing Metadata and Defining Partitions
Amazon Athena
Query using Athena
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon Athena
Query using Athena
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Storage Classes and Lifecycle
configuration
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Storage Classes
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
S3 Storage Classes
S3 Standard S3 Intelligent-Tiering
Amazon S3 Express One Zone S3 Standard-Infrequent Access
S3 One Zone-Infrequent Access S3 Glacier Instant Retrieval
S3 Glacier Flexible Retrieval S3 Glacier Deep Archive
S3 Storage Classes
Amazon S3 Standard (S3 Standard)
• It is best for storing frequently accessed data.
• It delivers low latency and high throughput.
• Is appropriate for
Cloud applications Dynamic websites Content distribution
Big data analytics Mobile and gaming
applications
S3 Storage Classes
S3 Intelligent-Tiering
• Best for unknown or changing access patterns.
• Automatically moves data to the most cost-effective access tier based on
access frequency
Frequent Infrequent Archive
access tier access tier Access tier
S3 Storage Classes
Amazon S3 Express One Zone
• Is a high-performance, single-Availability Zone storage class.
• It can improve data access speeds by 10x and reduce request costs by 50%
compared to S3 Standard.
• Data is stored in a different bucket type—an Amazon S3 directory bucket
S3 Storage Classes
S3 Standard-Infrequent Access (S3 Standard-IA)
• Is best for data that is accessed less frequently, but requires rapid
access when needed.
• ideal for long-term storage, backups, and as a data store for disaster
recovery files.
S3 One Zone-Infrequent Access (S3 One Zone-IA)
• Is best for data that is accessed less frequently, but requires rapid
access when needed.
• Stores data in a single AZ.
• Ideal for storing secondary backup copies of on premises data
S3 Storage Classes
S3 Glacier Storage Classes
• Are purpose-built for data archiving.
S3 Glacier Instant S3 Glacier S3 Glacier Deep
Retrieval storage Flexible Retrieval Archive
S3 Storage Classes
• Is best for long-lived data that is rarely
accessed and requires retrieval in
milliseconds.
• Delivers the fastest access to archive storage
• Ideal for archive data that needs immediate
S3 Glacier Instant access.
Retrieval storage
S3 Storage Classes
• Is best for archive data that is accessed 1—2
times per year and is retrieved asynchronously.
• It is an ideal solution for backup, disaster
recovery, offsite data storage needs
• Configurable retrieval times, from minutes to
S3 Glacier hours, with free bulk retrievals.
Flexible Retrieval
S3 Storage Classes
• Is the cheapest archival option
• Supports long-term retention and digital
preservation for data that may be accessed once
or twice in a year.
• Objects can be restored within 12 hours.
S3 Glacier Deep
Archive
Lifecycle configuration
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Lifecycle configuration
• Lifecycle configuration is a set of rules that define actions that Amazon
S3 applies to a group of objects.
Transition actions Expiration actions
Lifecycle configuration
Supported lifecycle transitions
S3 Standard
S3 Standard-IA
S3 Intelligent-Tiering
S3 One Zone-IA
S3 Glacier Instant Retrieval
S3 Glacier Flexible Retrieval
S3 Glacier Deep Archive
Versioning
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Versioning
• Helps you manage and preserve
the versions of objects in
your bucket.
• Useful for
Update Safety S3 Object
S3 Object
Versions
Delete Protection
Versioning
• Is disabled by default
• You can enable s3 bucket versioning using
S3 Console AWS CLI AWS SDK
Versioning
Versioning Enabled
S3 Object
S3 Object
S3 Object S3 Object
Version ID: Null
Version ID: SwS…
Version ID: NXD…
Versioning
Encryption and Bucket Policy
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Encryption
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Encryption
Encryption in transit
Encrypted
Secure Socket
Unencrypted Layer/Transport Layer Unencrypted Bucket
Client
Security (SSL/TLS)
Encryption
Encryption at rest
• Data is encrypted and then stored as it is.
• Can be done at Server Side or Client Side.
• All buckets have encryption at rest configured by default.
Encryption
Server Side Encryption methods
• Server-side encryption with Amazon S3 managed keys (SSE-S3)
• Server-side encryption with AWS KMS (SSE-KMS)
• Dual-layer server-side encryption with AWS KMS keys (DSSE-KMS)
• Server-side encryption with customer-provided keys (SSE-C)
Encryption
Server-side encryption with Amazon S3 managed keys (SSE-S3)
Bucket
S3 Managed Key
User
Unencrypted Encrypted
Encryption
Server-side encryption with AWS KMS (SSE-KMS)
Bucket
KMS Managed Key
User
Unencrypted Encrypted
Encryption
Dual-layer server-side encryption with AWS KMS keys (DSSE-KMS)
Bucket
KMS Managed Key KMS Managed Key
User
Unencrypted Encrypted Encrypted
Encryption
Server-side encryption with customer-provided keys (SSE-C)
Key
Bucket
Client Key
HTTPS
User
Unencrypted Encrypted
Encryption
Client-side Encryption
User Bucket
Encrypted
Client Key
Encrypted
Unencrypted Encrypted
Bucket Policy
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Bucket Policy
• Defines permissions and access
control for objects within an
Amazon S3 bucket.
• Is written in JSON format.
Specifies
• Who (which users or services) can
access the bucket and
• What actions they can perform on
the bucket and its objects.
Access Points and
Object Lambda
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
S3 Access Points
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Access Points
Access
Point A
User
Bucket
Access
Point A
User
Access
Point A
User
• Lets you create customized entry points to a bucket, each with its own
unique policies for access control.
Access Points
Each Access Point has
• DNS name (Internet Origin or VPC Origin)
• Access point policy (similar to bucket policy)
Features
Customized Improved Scalability Enhanced
Permissions and Organization Security
Object Lambda
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Object Lambda
With Object Lambda you can
• Transform data as it is retrieved
• No need to create a separate version of the data
Use Cases
Filtering Sensitive Converting Augmenting
Information Data Formats Data
Object Lambda
• Add your own code to process data
• No need to duplicate data
Data request
Object Lambda Returns data
Access Point
User
Bucket Invoke Lambda
Lambda Function Contains transform logic
Access Point
Object Lambda
Use Cases
• Reduct PII (Personally identifiable information) for analytics
• Convert data formats such as XML to JSON
• Augmenting Data with information from other services
S3 Event Notification
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
S3 Event Notification
S3 Event Notification
The Event can be
• New object created events
• Object removal events
• Restore object events
• Replication events
S3 Event Notification
• S3 Lifecycle expiration events
• S3 Lifecycle transition events
• S3 Intelligent-Tiering automatic archival events
• Object tagging events
• Object ACL PUT events
S3 Event Notification
S3 can send event notification messages to
Simple Notification Service topics
Simple Queue Service queues
Lambda function
EventBridge
S3 Event Notification
The Event Examples
• s3:ObjectCreated:Put:
Object is uploaded to the bucket using the PUT method.
• s3:ObjectCreated:Post:
Object is uploaded to the bucket using the POST method.
• s3:ObjectRemoved:Delete:
Object is deleted from the bucket.
Wildcards
• s3:ObjectCreated:*
Captures all object creation events
Prefix and Suffix Filters
Filter to objects in a directory to trigger the event (e.g. uploads/images/)
Filter to file types (e.g. '.jpg') to trigger the event
S3 Event Notification
S3:ObjectCreated:Put
Lambda Function
Bucket
Insert INTO …
File Upload
RDS Database
User
S3 Event Notification
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
S3 Event Notification
S3 Event Notification
The Event can be
• New object created events
• Object removal events
• Restore object events
• Replication events
S3 Event Notification
• S3 Lifecycle expiration events
• S3 Lifecycle transition events
• S3 Intelligent-Tiering automatic archival events
• Object tagging events
• Object ACL PUT events
S3 Event Notification
S3 can send event notification messages to
Simple Notification Service topics
Simple Queue Service queues
Lambda function
EventBridge
S3 Event Notification
The Event Examples
• s3:ObjectCreated:Put:
Object is uploaded to the bucket using the PUT method.
• s3:ObjectCreated:Post:
Object is uploaded to the bucket using the POST method.
• s3:ObjectRemoved:Delete:
Object is deleted from the bucket.
Wildcards
• s3:ObjectCreated:*
Captures all object creation events
Prefix and Suffix Filters
Filter to objects in a directory to trigger the event (e.g. uploads/images/)
Filter to file types (e.g. '.jpg') to trigger the event
S3 Event Notification
S3:ObjectCreated:Put
Lambda Function
Bucket
Insert INTO …
File Upload
RDS Database
User
Section 8:
Other Storage Services
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
EBS
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS EBS
• Scalable block solution in AWS
Single availability zone
• EBS = Elastic Block Store
• Persistent storage for a variety of data types
EC2 instances
Highly scalable
AWS EBS - Storage
• Scalability
• Durability
Key features
• Block-level storage
• Persistent storage
• High performance
• Cost effective
EBS functionality
SSD Based
Volume Type
Select
Amazon Elastic EC2 Application
HDD Based Instance
Block Store
⇒ Bound to a specific AZ
⇒ Use snapshots for different AZ or regions
EBS Snapshots
▪ Definition:
Instantaneous copies of AWS resources.
▪ Incremental Backups: Improves efficiency.
▪ Block-Level Backup: Detailed data capturing.
▪ Data Consistency: Before creating a snapshot, AWS ensures data consistency
by temporarily pausing I/O operations.
▪ S3 Compatible
▪ Lifecycle Management: Allows you to define lifecycle policies for EBS
snapshots, enabling automation of snapshot management tasks.
▪ Data Recovery: Dynamic use of snapshots.
▪ Cost Management: Reasonable cost.
EBS Capacity provisioning
Increased Capacity
Modify volume size
Monitoring and
Optimization
Volume types
GP2/GP3
IO1/OI2
ST1/SC1
Magnetic
EBS Volume in AWS
Provisioned IOPS
Volume Size
Option to increase
provision for IOPS type
SSD (IO1)
Delete on Termination
Determines whether the volume should be automatically deleted when its
associated EC2 instance is terminated.
● Enabled:
The volume will be automatically deleted by AWS when the associated EC2 instance is
terminated.
● Disabled:
The EBS volume will persist even after the associated EC2 instance is terminated.
⇒ Managing the deletion attribute
Amazon EFS
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Elastic File System
Amazon Elastic File System (Amazon EFS) provides serverless, fully elastic
file storage so that you can share files without provisioning or managing
storage capacity and performance.
▪ Multi-AZ Availability
▪ Scalability
▪ Shared File System
▪ Elasticity
▪ NFSv4.1 protocol
▪ Performance Mode:
▪ General Purpose (broad range)
▪ Max I/O (high troughput / IOPS)
▪ Pay as You Go Pricing
Posix System Standard API
The POSIX (Portable Operating System Interface)
standard defines a set of APIs for
compatibility between various UNIX-based
operating systems.
▪ Posix APIs Wide range of functionality.
▪ POSIX simplifies the porting of applications to Linux and
fosters interoperability between different platforms.
Performance & Storage classes on EFS
• Performace Scaling
• Performance and Throughput Modes
• Supports 1000 concurrent NFS • Storage Classes IA / One Zone-IA
• Automatic throughput scaling • Supports 1000 concurrent NFS
Performance Mode Performance Mode (Max IO)
(General purpose)
○ Designed for a wide range of ○ Optimized for applications that
workloads, including latency- require the highest levels of
sensitive applications and aggregate throughput and IOPS.
those with mixed read/write
operations. ○ Provides higher IOPS and
throughput compared to the
○ Offers low latency and good
throughput for most use cases. VS General Purpose mode, making it
suitable for latency-sensitive
and I/O-intensive workloads.
○ Suitable for applications such
as web serving, content ○ Ideal for applications such as
management, and development big data analytics, media
environments. processing, and database
workloads.
○ Automatically scales
performance based on the ○ Performance does not scale
amount of data stored in the automatically based on data
file system. size; users need to manually
adjust provisioned throughput
capacity.
Throughput Mode Throughput Mode
(Bursting throughput) (Provisioned throughput)
○ Designed for workloads with
unpredictable or spikey access ○ Designed for applications with
patterns. predictable or sustained
throughput requirements.
○ Provides burst credits that
allow the file system to ○ Users can provision a specific
achieve throughput levels
higher than its baseline for
short periods, enabling burst
VS amount of throughput (in MiB/s)
for the file system, ensuring
consistent performance
workloads to achieve high regardless of workload spikes or
performance without burst credits.
provisioning throughput
capacity. ○ Suitable for applications with
continuous data processing,
○ Suitable for applications with high-volume data transfers, or
intermittent usage patterns, large-scale analytics workloads
such as development and test where predictable performance is
environments, or applications critical.
with periodic data processing
tasks.
AWS Backup
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Backup
● Centralized backup for many AWS Services.
○ EC2, RDS, EFS, S3, DocumentDB…
● Provides automated backup scheduling.
● Cross-region & Cross-account backup capabilities.
● You can restore individual files or entire system.
● Compliance and Security:
○ Set rules for retentions and lifecycle of backups
○ Access controlled within IAM
● Process can be monitored.
AWS Backup
Define backup
plan with its
AWS Resources
frequency and
retentions.
AWS Backup Vault Lock
Backup Vault ⇒ a container where backups are stored securely.
● Enhances security and compliance with immutable safeguards
AWS Backup
AWS Backup Vault Lock
● Immutability:
○ Policies you set become immutable.
○ Uses WORM (Write Once Read Many) method.
○ No one, even root user can change or delete the recovery points.
● Provides Regulatory Compliance
● Vault Lock Modes:
• Compliance Mode • Governance Mode
⇒ Policy cannot be changed or deleted ⇒ Specified IAM users can update
for the lock period. policies but can not delete recovery
points
Section 9:
DynamoDB
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
DynamoDB
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS DynamoDB
o Serverless NoSQL database
o NoSQL ⇒ "Not Only SQL" or "non-relational" database
Relational DB NoSQL
● Structured data in tables ● Structure, semi-structured
and unstructured
● Require predefined schema ● No need for fixed schema
(tables + rows) VS (schemaless)
● Scalability: Vertically ● Scalability: Horizontally
(scale up) (scale out)
● SQL Queries ● Varying ways of querying
(complex SQL queries possible) (no joins, no aggregations)
● Optimized for complex queries ● Optimized for performance,
large volume & flexibility
Relational DB NoSQL
● Example use case: ● Example use case:
Reporting Big data, real-time analytics
VS
AWS DynamoDB
o Data Model: Supports both key-value and document data model
o Fully-managed distributed database
o Performs very well under high traffic and high workloads
o High-availability & durability through distribution & AZ-replication
o Supports encryption at rest
o Millisecond latency
AWS DynamoDB
DynamoDB Consists of:
- Tables: A collection of data
- Items: Individual records in a table(400 KB max)
- Attributes: Properties of a table
All Tables in DynamoDB have a Primary Key
Books
{
"BookID": "B101",
"Title": "The Great Adventure",
"Author": "Ella Fitzgerald",
"Price": 9.99, item
"Genres": ["Adventure", "Fantasy"],
"PublishYear": 2010
}
attribute o Each item has a unique identifier:
{ Primary key
"BookID": "B102",
"Title": "Cooking Simplified",
"Author": "James Oliver",
o Table is schemaless
"Price": 20.00,
"Genres": ["Cooking"], o Nested attributes are possible
"PublishYear": 2015,
"Publisher": {
"Name": "Yummy Books",
"Location": "New York"
}
}
{
"BookID": "B103",
"Title": "Modern Web Development",
"Author": "Lisa Ray",
"Price": 31.50,
"Genres": ["Technology", "Educational"],
"PublishYear": 2018,
"RelatedCourses": ["Web Development 101", "Advanced JavaScript"]
}
AWS DynamoDB
• Primary Keys
o Uniquely identifies items in a table.
o Items in a table cannot have the same “key”
o They must be scalar (string, number or binary)
o Specified at the creation time of the table
o Required for data organization and retrieval
• Types of Primary Keys
Partition Key (hash attribute):
A single “key” is used to distinguish items in a table
Composite Key:
Two “keys” are used to distinguish items in a table:
A partition and a sort key (range attribute)
Partition Key Composite Key
▪ Also known as Hash Attribute ▪ Also known as Hash and Range
attribute.
▪ Contains only Partition Keys
VS ▪ Contains Partition and Sort Key
▪ Items cannot have the same ▪ Items can have the same Partition key
partition key provided the sort key is different.
AWS DynamoDB
Simple Partition Key Composite Key
Primary Key Attributes Primary Key Attributes
Student No Age Degree Student No Graduation Year Age Degree
21829332 19 Bcom 21829332 2020 19 Bcom
21789906 17 Science 21829332 2018 17 Science
21689541 26 Maths 21689541 2016 26 Maths
22587413 2014 30 Tourism
22587413 30 Tourism
Sort Key
Partition Key
Partition Key
AWS DynamoDB
Primary keys
o Most efficient way to retrieve data
o Determines the physical partion where data is stored
What if you want query data using an attribute?
AWS DynamoDB
Primary keys
o Most efficient way to retrieve data
o Determines the physical partion where data is stored
What if you want query data using an attribute?
AWS DynamoDB
Secondary Indexes
o They allow for data in tables to be queried using an alternative
“key” than the Primary Key
o Allows for a range of query patterns
Secondary Index Types
o Local Secondary Index (LSI)
o Global Secondary Index (GSI)
AWS DynamoDB
Local Secondary Index ▪ Index has the same partition key as the
base table, but a different sort key.
▪ Created at the same time as the table and
cannot be modified after table creation.
▪ Maximum of 5 Local Secondary Indexes
per table
Global Secondary Index ▪ An index with a partition key and sort key
that can be different from the base table.
▪ Can be modified after table has been created
▪ Maximum of 20 GSI per table.
▪ If writes throttle, base table writes
will throttle too
AWS DynamoDB
ProductID (PK) ProductName Category Price Manufacturer StockQuantity
The Great
B101 Books 9.99 YummyBooks 100
Adventure
Ultra HD TV
E102 Electronics 1200.00 ViewSonic 30
55''
Cooking
B103 Books 20.00 TastyPress 50
Simplified
Gamer Pro
E103 Electronics 129.99 KeyCraft 75
Keyboard
Partition Key CategoryIndex (GSI) ManufacturerIndex (GSI)
Sort Key Partition Key
AWS DynamoDB
How to create a Secondary Key?
o We use "projected attributes"
o Attributes that are copied from a table to an Index
o Attributes to copy can be specified
o Maximum 20 projected attributes per index
Three options for projected attributes:
o All: All of the attributes from the base table are projected into the index.
o Keys Only: Only the key attributes from the base table are projected into the
index.
o Include: Only specific attributes from the base table are projected into the
index.
AWS DynamoDB
DynamoDB Streams
o Captures changes done on data in DynamoDB tables
[Writes/Deletes/Updates]
o Changes are sent to a stream record
o Changes are chronologically ordered
Changes
One record
Single write operation
DELETE Real-time access
UPDATE
Automatic shard management
Table INSERT Stream
AWS DynamoDB
DynamoDB Streams
o Optional feature – Disabled by default
o Records are all changes made to an item during one operation
o You can decide what should be recorded to the stream
Note:
o Changes made before activation of Stream are not recorded
o Data in the stream is retained for 24 hours
AWS DynamoDB
• Stream Record Options
o Keys_only: view only the key attributes of the item modified
o New_image: view the item after changes were made
o Old_image: view the item before changes were made
o New_and_old_image: view the item before and after it was made
AWS DynamoDB
Processing Streams
1. AWS Lambda:
2. Amazon Kinesis Data Streams ⇒ Data Firehose
3. Amazon Elasticsearch Service
4. Custom Applications
5. AWS Glue
6. Cross-Region Replication
AWS DynamoDB
Lambda
1. Enable stream on table
2. Create a Lambda Function
3. Configure Trigger: Event Source Mapping ⇒ trigger for new records
AWS DynamoDB
● DynamoDB APIs
o APIs are used to interact with DynamoDB programmatically
o They allow for the management of data in DynamoDB tables
o Applications are to use simple API Operations to interact with DynamoDB
High Level Operations
o Control Plane
o Data Plane
o DynamoDB Streams APIs
o Transaction
High Level Operations
AWS DynamoDB o
o
Control Plane
Data Plane
o DynamoDB Streams APIs
Control Plane o Transaction
These operations allow you to create and manage DynamoDB tables
o CreateTable – Creates a new table.
o DescribeTable – Returns information about a table.
o ListTables – Returns the names of all of your tables in a list.
o UpdateTable – Modifies the settings of a table or its indexes.
o DeleteTable – Removes a table and all of its dependent objects from DynamoDB.
High Level Operations
AWS DynamoDB
o Control Plane
o Data Plane
o DynamoDB Streams APIs
o Transaction
Data Plane
These operations allow you to perform Create, Read, Update, and Delete on
data in tables. (CRUD operations)
o I. PartiQL APIs
o II: DynamoDB’s classic CRUD APIs
High Level Operations
o Control Plane
AWS DynamoDB o
o
Data Plane
DynamoDB Streams APIs
● Data Plane o Transaction
PartiQL is a SQL-compatible query language for DynamoDB that can be used
to perform CRUD operations
I. PartiQL APIs :
o ExecuteStatement – Reads multiple items from a table. You can also
write or update a single item from a table.
o BatchExecuteStatement – Writes, updates, or reads multiple items from
a table.
High Level Operations
AWS DynamoDB o
o
Control Plane
Data Plane
o DynamoDB Streams APIs
Data Plane o Transaction
II. DynamoDB’s classic CRUD APIs:
The Primary key must be specified
1. Creating data
o PutItem – Writes a single item to a table.
o BatchWriteItem – Writes up to 25 items to a table.
High Level Operations
AWS DynamoDB o
o
Control Plane
Data Plane
o DynamoDB Streams APIs
Data Plane o Transaction
II. DynamoDB’s classic CRUD APIs:
The Primary key must be specified
2. Reading Data
o GetItem - Retrieves a single item from a table.
o BatchGetItem – Retrieves up to 100 items from one or more tables.
o Query – Retrieves all items that have a specific partition key.
o Scan – Retrieves all items in the specified table or index.
High Level Operations
AWS DynamoDB o
o
Control Plane
Data Plane
o DynamoDB Streams APIs
Data Plane o Transaction
II. DynamoDB’s classic CRUD APIs:
The Primary key must be specified
3. Updating data
o UpdateItem – Modifies one or more attributes in an item.
4. Deleting data
o DeleteItem – Deletes a single item from a table.
o BatchWriteItem – Deletes up to 25 items from one or more
tables.
High Level Operations
AWS DynamoDB o
o
Control Plane
Data Plane
o DynamoDB Streams APIs
DynamoDB Streams o Transaction
Operations for enabling/disabling a stream on a table and allowing access to the data
modification records in a stream.
o ListStreams – Returns a list of all your streams, or a stream for a
specific table.
o DescribeStream – Returns information about a stream
o GetShardIterator – Returns a shard iterator
o GetRecords – Retrieves one or more stream records, using a given shard
iterator.
High Level Operations
AWS DynamoDB o
o
Control Plane
Data Plane
o DynamoDB Streams APIs
Transactions o Transaction
Transactions provide Atomicity, Consistency, Isolation, and Durability
(ACID), Can use PartiQL or Classic CRUD APIs
PartiQL APIs:
o ExecuteTransaction – A batch operation that allows CRUD
operations on multiple items both within and across tables.
High Level Operations
AWS DynamoDB o
o
Control Plane
Data Plane
Transactions o DynamoDB Streams APIs
o Transaction
DynamoDB’s classic CRUD APIs:
o TransactWriteItems – A batch operation that allows Put, Update,
and Delete operations on multiple items within and across tables.
o TransactGetItems – A batch operation that allows Get operations to
retrieve multiple items from one or more tables.
AWS DynamoDB
Data Types
Data Type Definition Type
Can represent exactly one Number,
Scalar Type value Boolean, String
Can represent a complex
Document Type structure with nested List, Map
attributes
String set,
Can represent multiple
Set Type scalar values.
Number set, and
Binary set.
Cost & Configuration
Read/Write Capacity Modes
On-Demand
o Billing: Pay per read/write operation without pre-specifying throughput.
o Flexibility: Ideal for unpredictable or spiky workloads.
o Performance: Maintains consistent, low-latency performance at any scale.
⇒ Requests never throttle
o Management: AWS handles scaling, reducing operational overhead.
o Cost Implications: More expensive for predictable workloads (premium for scalability).
Cost & Configuration
Read/Write Capacity Modes
Provisioned Mode
o Throughput: Specify expected reads/writes per second in RCUs and WCUs.
o Cost Efficiency: More economical for predictable workloads.
o Auto Scaling: Automatically adjusts throughput based on traffic, optimizing costs.
o Throttling Risk: Exceeding provisioned throughput can lead to throttling.
o Management Overhead: Requires monitoring and occasional manual adjustments.
Cost & Configuration
Read/Write Capacity Modes
Provisioned Mode
Reserved Capacity
o Long-Term Commitment:
Commit to specific RCUs and WCUs for 1 or 3 years.
o Discounted Pricing:
Reduced rates
o Use Case:
Suited for stable, predictable workloads over long periods.
Cost & Configuration
Read/Write Capacity Modes
Provisioned Mode
Auto Scaling
o Dynamic Scaling: Adjusts throughput based on utilization.
o Cost Optimization: Lowers costs by matching capacity to demand.
o Setup Complexity: Requires setting minimum, maximum, and target utilization levels.
o Response Time: Minor delays in scaling might occur, but it's designed to be responsive.
AWS DynamoDB
Write Capacity Units (WCUs) & Read Capacity Units (RCUs)
Write Capacity Unit
• Definition:
Measures write throughput; 1 WCU equals 1 write per second for items up to 1 KB.
• Consumption:
Items over 1 KB require more WCUs, e.g., a 3 KB item needs 3 WCUs per write.
• Provisioning:
Allocate WCUs based on expected writes.
Adjust with over-provisioning for spikes or use Auto Scaling for dynamism.
• Cost:
Pay for provisioned WCUs, used or not. Planning is important to manage expenses.
• Throttling:
Exceeding WCUs leads to throttled writes, potentially causing
ProvisionedThroughputExceededExceptions.
Opt for on-demand capacity for automatic scaling.
AWS DynamoDB
Write Capacity Units (WCUs) & Read Capacity Units (RCUs)
Read Capacity Unit
• RCU Definition: One strongly consistent read per second (or
two eventually consistent reads) for items up to 4 KB.
• Throughput Calculation: Larger items consume more WCUs/RCUs.
AWS DynamoDB
Write Capacity Units (WCUs) & Read Capacity Units (RCUs)
Read Consistency
• Eventually Consistent Reads:
Fast, high throughput, but data might be slightly outdated.
• Strongly Consistent Reads:
Ensures the latest data view at a higher cost.
• Application Needs:
Choose based on the criticality of data freshness vs. throughput
requirements.
AWS DynamoDB
Write Capacity Units (WCUs) & Read Capacity Units (RCUs)
Read Capacity Unit
• RCU Definition: One strongly consistent read per second (or
two eventually consistent reads) for items up to 4 KB.
• Throughput Calculation: Larger items consume more WCUs/RCUs.
Example Example
Total Data: 1 item, 8 KB Total Data: 10 items, each 3 KB
Read Type: Strongly Consistent Read Type: Eventually Consistent
Conclusion: 2 RCUs to read an 8 KB item Conclusion: 5 RCUs to perform eventually
strongly consistently once per second. consistent reads of 10 items with 3 KB
items per second
AWS DynamoDB
Performance Optimization
Hot Partitions
● Overloaded partitions due to uneven data distribution.
● Caused by poorly chosen partition keys.
Throttling
● Happens when requests exceed table/index throughput limits.
● Leads to higher latencies or request failures.
AWS DynamoDB
Performance Optimization
Best Practices to Avoid Throttling
● Even Data Distribution: Use well-designed partition keys.
● Exponential Backoff: Incorporate into retry logic.
● Monitor Throughput: Use CloudWatch, adjust settings as needed.
● Use DAX: Cache reads to reduce table load.
AWS DynamoDB
Performance Optimization
Burst Capacity
● Allows handling short traffic spikes.
● Uses saved unused capacity for up to 5 minutes.
Adaptive Capacity
● Automatically balances throughput across partitions.
● Maintains performance, but key design still matters.
AWS DynamoDB
Read Write Capacity Modes
Reserved Capacity
o Purchase Reserved Capacity in advance for tables in standard table class.
o Once-off upfront fee with a specified minimum provisioned usage level
over (x) amount of time
o Billed hourly at the rate of reserved capacity
o Unused reserved capacity is applied to accounts in the same AWS organization - can be
turned off
o Regional – not available for standard IA table class/On-demand mode
AWS DynamoDB
Read/Write Capacity Modes
DynamoDB Auto Scaling
o Define maximum and minimum read/write capacity units as well as target
utilization % within that range
o DynamoDB maintains target utilizations and auto-scales to meet
increases/decreases in traffic
o Requests never throttle
AWS DynamoDB
DynamoDB Accelerator (DAX)
DynamoDB’s in-memory cache that improves performance by 10x with microseconds latency.
To enable it: Create a DAX Cluster
o DAX Cluster: A DAX cluster has one or more nodes running on individual
instances with one node as the primary node.
o Accsssing DAX: Applications can access DAX through endpoints of the DAX cluster
o “ThrottlingException”: Returned if requests exceed the capacity of a node
⇒ DAX limits the rate at which it accepts additional requests by returning a
ThrottlingException.
AWS DynamoDB
DynamoDB Accelerator (DAX)
Read Operations
If an Item requested is in the cache, DAX returns it to the application without
accessing DynamoDB(cache hit)
If item is not in the cache (cache miss), DAX forwards the request to DynamoDB.
API calls:
o Batch
o GetItem
o Query
o Scan
AWS DynamoDB
DynamoDB Accelerator (DAX)
Write Operations
Data is written to the DynamoDB table, and then to the DAX cluster.
API calls:
o BatchWriteItem
o UpdateItem
o DeleteItem
o PutItem
AWS DynamoDB
DynamoDB Accelerator (DAX)
Write Operations
Data is written to the DynamoDB table, and then to the DAX cluster.
API calls:
o BatchWriteItem
o UpdateItem
o DeleteItem
o PutItem
AWS DynamoDB
DynamoDB TTL
DynamoDB Time To Live feature allows for the automatic deletion of items in a table
by setting an expiration date.
o Expired items are deleted 48 hours after the expiration date.
o TTL does not consume write throughput
o Expired Items pending deletion can still be updated.
o Expiration timestamp should be in Unix Epoch format
o Deleted items appear in DynamoDB Streams as Service Deletion
Section 10:
Redshift Datawarehouse
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon Redshift
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon Redshift
• Fully managed, highly available, cost-effective petabyte-scale
data warehouse service in the cloud.
• ANSI SQL compatible relational database.
• Can be accessed using Amazon Redshift query editor v2 or any other
Business Intelligence (BI) tool.
Amazon Redshift
• Handles large amounts of data.
• Supports multiple data sources, including
CSV JSON Apache Parquet
Amazon Redshift
Features
Scalable Columnar Storage
Massively Parallel Integration with Other AWS
Processing (MPP) Services
Advanced Compression Security Features
Amazon Redshift
Use cases include
Data Warehousing Business Intelligence
Analytics Log Analysis
IoT Data Processing Real-Time Dashboards
Amazon Redshift Cluster
Part One
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon Redshift Cluster
• Is the core infrastructure component of Redshift.
• Runs an Amazon Redshift engine and contains one or more databases.
• Executes workloads coming from external data apps.
• Uses replication and continuous backups.
• Automatically recovers from failures.
Amazon Redshift Cluster
Leader Node
Compute Nodes
Amazon Redshift Cluster
Amazon Redshift Cluster
Leader Node
• Coordinates two or more compute nodes.
• Aggregates results from compute nodes.
Leader Node • Develops execution plan.
• Assigns a portion of the data to each
compute node.
Compute Nodes
Amazon Redshift Cluster
Compute Nodes
• Has CPU, memory, and disk storage.
• Run the query execution plans.
• Transmit data among themselves.
Leader Node
• Capacity can be increased by increasing
the number of nodes, upgrading the node
type, or both.
Compute Nodes • Use provisioned cluster or Redshift
Serverless
Redshift Managed Storage
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Redshift Managed Storage (RMS)
• Uses large, high-performance solid-state drives (SSDs) for fast local
storage.
• Uses S3 for longer-term durable storage.
• Pricing is the same for RMS regardless of whether the data resides in
high-performance SSDs or in S3.
RA3 and DC2 Node types
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
RA3 and DC2 Node types
RA3 nodes
• Uses Redshift Managed Storage (RMS).
• Separates compute and storage.
• Scale and pay for compute and managed storage independently.
• Supports multi Availability Zone (AZ)
RA3 and DC2 Node types
DC2 nodes
• Local SSD storage included.
• DC2 nodes store your data locally for high performance.
• Available on single Availability Zone (AZ) only.
• Recommend by AWS For datasets under 1TB (compressed).
• You can add more compute nodes to increase the storage capacity of the
cluster.
Amazon Redshift Cluster
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Resizing clusters
Used to change the node type or the number of nodes in the cluster
Node Slices:
• Compute nodes are split into slices
• Each handling a part of the workload.
• Leader node distributes data & tasks to the slices for parallel processing
Elastic resize
• Adjust number of nodes without disruption or restart
(by redistributing data slices across nodes)
• Completes quickly.
Can be used to …
• Add or remove nodes from your cluster.
• Change Node Type: From DS2 to RA3
Amazon Redshift Cluster
Classic resize
• Similar to elastic resize.
• Takes more time to complete
• Useful in cases were elastic resize is not applicable
Amazon Redshift
snapshots
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon Redshift snapshots
• Snapshots are point-in-time backups of a cluster.
• Snapshots are stored internally on S3.
Snapshot
• Snapshot can be taken manually or automatically.
Amazon Redshift snapshot
Automated Snapshot
• Incremental snapshots are taken automatically.
• Is enabled by default.
• By Default Snapshot is taken
Every Eight Hours Every 5 GB per node of data changes
• Default retention period is one day.
Amazon Redshift snapshots
Manual Snapshots
• Can be taken any time.
• By default, manual snapshots are retained indefinitely.
• You can specify the retention period when you create a manual snapshot,
or you can change the retention period by modifying the snapshot.
Sharing data across AWS
Regions
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Sharing data across AWS Regions
• You can share data across AWS Regions without the need to copy data
manually.
• Sharing can be done without Amazon S3 as a medium.
• With cross-Region data sharing, you can share data across clusters in the
same AWS account, or in different AWS accounts even when the clusters are
in different Regions.
Distribution Styles
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Distribution Styles
• Clusters store data across compute nodes.
• Use Distribution Keys for that distribution
• Distribution Styles determine where data is stored (compute node)
• Distribute the workload uniformly among the nodes in the cluster
• Minimize data movement during query execution.
Distribution Styles
KEY distribution
• Rows are distributed according to the values in one column.
• The leader node places matching values on the same node slice.
• Useful for tables that participate frequently in joins
Distribution Styles
ALL distribution
• A copy of the entire table is distributed to every node.
• Multiplies the storage required by the number of nodes in the cluster.
• Is appropriate only for relatively small and slowly changing tables.
• Faster query operations
Distribution Styles
EVEN distribution
• Leader node distributes the rows across the slices.
(regardless of the values in any particular column)
• Appropriate when a table doesn't participate in joins.
• Appropriate when there isn't a clear choice between KEY distribution and
ALL distribution.
Distribution Styles
AUTO distribution
• Redshift assigns an optimal distribution style based on the size of the
table data.
• Is the default distribution Style
• How does it work
ALL KEY EVEN
• The change in distribution style occurs in the background with minimal
impact to user queries.
Vacuum and Workload
Management
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Vacuum
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Vacuum
• It Re-sorts rows and reclaims space
in either a specified table or all tables in the current database.
• Redshift automatically sorts data and runs VACUUM DELETE in the
background.
• By default, VACUUM skips sorting for tables where > 95% already sorted.
• Users can access tables while they are being vacuumed.
• Redshift automatically performs a DELETE ONLY vacuum in the background.
Vacuum
Command Syntax
VACUUM [ FULL | SORT ONLY | DELETE ONLY | REINDEX | RECLUSTER ]
[ [ table_name ] [ TO threshold PERCENT ] [ BOOST ] ]
• FULL : Sorts the specified table and reclaims disk space.
• SORT ONLY : Sorts the specified table (or all tables) without
reclaiming space
• DELETE ONLY : Reclaims disk space only.
• REINDEX : Rebuilds the indexes on the tables.
Useful for tables using Interleaved Sort Keys.
• RECLUSTER : Sorts the portions of the table that are unsorted.
Vacuum
• TO threshold PERCENT :
Specifies a threshold above which VACUUM skips sort phase &
reclaiming space in delete phase.
• BOOST :
Runs the VACUUM command with additional resources as they are available.
Redshift Integration
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Redshift Integration
S3 EMR Lambda DMS EC2
Data MSK KDS DynamoDB
pipeline
Copy command
copy favoritemovies
from 'dynamodb://Movies'
iam_role'arn:aws:iam::0123456789012:role/MyRedshiftRole;
• Is used to load large amount of data from outside of redshift.
• Uses the Amazon Redshift the Massively Parallel Processing (MPP).
• Uses optimal compression scheme.
• Supported compression include
Gzip IZop bzip
Moving data between S3
and Redshift
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Loading Data from S3 to Redshift
• You can add data to your Amazon Redshift tables
either by using an INSERT or COPY command.
• loading data with COPY
○ Is faster.
○ Decrypts data as it is loaded from S3.
○ Requires a manifest file and IAM role.
Unloading Data to S3 from Redshift
• The UNLOAD command allows you to export the results
of a query to Amazon S3.
• It supports exporting data in various formats,
including
CSV Apache Parquet ORC
Moving data between S3 and Redshift
Redshift Auto Copy from Enhanced VPC routing
Amazon S3
• Automatically copies data • Enables you to route
from an Amazon S3 bucket the network traffic
into an Amazon Redshift. through a VPC instead
of the internet.
Amazon Aurora zero-ETL integration
• Allows changes made to the Amazon Aurora RDBMS
database to be replicated in the Amazon Redshift
database seconds after the Aurora updates.
• Eliminates the need for custom data pipelines
• Enables customers to analyze petabytes of
transactional data in near real time.
Data Transformation using
Amazon Redshift
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Data Transformation using Amazon Redshift
Design patterns when moving data from source systems to a data warehouse
• Extract Transform Load (ETL)
• Extract Load Transform (ELT)
Data Transformation using Amazon Redshift
Extract Transform Load (ETL)
Source 1
Target
Source 1
Source 3
Extract Transform Load
Data Transformation using Amazon Redshift
Extract Load Transform (ELT)
Source 1
Target MPP Database
Source 1
Source 3
Extract and Load Transform
Data Transformation using Amazon Redshift
• Amazon Redshift provides a functionality to process all your data in one
place with its in-database transformation (ELT) capabilities.
• SQL Transformations
• Stored Procedures
• User-Defined Functions (UDFs)
Data Transformation using Amazon Redshift
• Amazon Redshift can connect to an ETL platform of using JDBC and ODBC.
• Popular ETL platforms that integrate with Amazon Redshift include third-
party tools like
• Informatica
• Matillion
• dbt
• AWS-native tools like “AWS Glue”.
Amazon Redshift Federated
Queries
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon Redshift Federated Queries
• Used to combine and analyze data across different data sources.
• Reduces data transmission while optimizing efficiency.
• Eliminates the necessity of ETL pipelines.
Amazon Redshift Federated Queries
Amazon RDS
Amazon Aurora
Other data source
accessible via
JDBC
Amazon Redshift Federated Queries
• Uses external schema definitions.
• Can be used to incorporate live data.
• Distributes part of the computation for federated queries directly into
the remote operational databases.
• Uses parallel processing capacity.
Amazon Redshift Federated
Queries
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon Redshift Federated Queries
• Used to combine and analyze data across different data sources.
• Reduces data transmission while optimizing efficiency.
• Eliminates the necessity of ETL pipelines.
Amazon Redshift Federated Queries
Amazon RDS
Amazon Aurora
Other data source
accessible via
JDBC
Amazon Redshift Federated Queries
• Uses external schema definitions.
• Can be used to incorporate live data.
• Distributes part of the computation for federated queries directly into
the remote operational databases.
• Uses parallel processing capacity.
Amazon Redshift Spectrum
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon Redshift Spectrum
• Enables you to run complex SQL queries directly against data stored in
Amazon S3.
• You define external tables in Redshift cluster that reference the data
files stored in S3.
Amazon Redshift Spectrum
• Handles the execution of queries by retrieving only the necessary data
from S3.
• Supports different data formats, Gzip and Snappy compression.
• Resides on dedicated Amazon Redshift servers.
• Pushes many compute-intensive tasks down to the Redshift Spectrum layer.
Amazon Redshift Spectrum
• Scales intelligently.
• Redshift Spectrum tables can be created by defining the structure for
your files and registering them as tables in an external data catalog.
• The external data catalog can be
AWS Glue Amazon Apache Hive metastore
Athena
Amazon Redshift Spectrum
• Changes to the external data catalog are immediately available.
• You can partition the external tables on one or more columns.
• Redshift Spectrum tables can be queried and joined just as any other
Amazon Redshift table.
Amazon Redshift Spectrum
Amazon Redshift Spectrum considerations
• Redshift cluster and the S3 bucket must be in the same AWS Region.
• Redshift Spectrum doesn't support enhanced VPC routing with provisioned
clusters.
• Supports Amazon S3 access point aliases.
• Redshift Spectrum doesn't support VPC with Amazon S3 access point aliases
Amazon Redshift Spectrum
Amazon Redshift Spectrum considerations (continued)
• You can't perform update or delete operations on external tables.
• To create a new external table in the specified schema, you can use
CREATE EXTERNAL TABLE.
• To insert the results of a SELECT query into existing external tables on
external catalogs, you can use INSERT (external table).
• Unless you are using an AWS Glue Data Catalog that is enabled for AWS
Lake Formation, you can't control user permissions on an external table.
Amazon Redshift Spectrum
Amazon Redshift Spectrum considerations (continued)
• To run Redshift Spectrum queries, the database user must have permission
to create temporary tables in the database.
• Redshift Spectrum doesn't support Amazon EMR with Kerberos.
System Tables and Views
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Redshift System Tables and Views
System Table System View
• Contain information • Views to organize
about how the system is information from the
functioning. system tables.
Redshift System Tables and Views
• Queried the same way like any other database tables.
• Some system tables can only be used by AWS staff for diagnostic purposes.
Redshift System Tables and Views
Types of system tables and views
Details on database objects;
• SVV views
SVV_ALL_TABLES see all tables
SVV_ALL_COLUMNS see a union of columns
• SYS views Monitor query and workload performance in clusters;
SYS_QUERY_HISTORY see details of user queries
Generated from system logs for historical records;
• STL views STL_ALERT_EVENT_LOG identify opportunities to improve query performance
STL_VACUUM statistics for tables that have been vacuumed
Snapshots of the current system data;
• STV tables
STV_EXEC_STATE information about queries & query steps actively running
Redshift System Tables and Views
Types of system tables and views
Details about queries on both the main and concurrency scaling clusters;
• SVCS views SVCS_QUERY_SUMMARY general information about the execution of a query
Contain references to STL tables & logs for more detailed information;
• SVL views SVL_USER_INFO data about Amazon Redshift database users
Redshift Data API
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Redshift Data API
• It is a lightweight, HTTPS-based API that is used for running queries
against Amazon Redshift.
❑ Lambda
❑ SageMaker notebooks
❑ Other web-based applications
Serverless, event-
Web applications Third-party services
driven architecture
• Can be used to run SQL queries asynchronously.
• Can be used as an alternative to using JDBC or ODBC drivers.
• No need to manage and maintain WebSocket or JDBC connections.
• Can be setup with very little operational overhead (very easy setup!)
Redshift Data API
Serverless Execution
Asynchronous Processing
Direct Integration with AWS Services (e.g. Lambda / SageMaker)
Simplified Management
Redshift Data API
• Maximum duration of a query is 24 hours.
• Maximum number of active queries per Amazon Redshift cluster is 200.
• Maximum query result size (after compression) is 100 MB.
• Maximum retention time for query results is 24 hours.
Redshift Data API
• Maximum query statement size is 100 KB.
• Is available to query single-node and multiple-node clusters of the
following node types:
dc2.large dc2.8xlarge ra3.xlplus ra3.4xlarge
ra3.16xlarge
Acess Control
• Authorize user/service by adding managed policy
⇒ AmazonRedshiftDataFullAccess
Monitoring Data API
• Monitoring of Data API events in EventBridge
• EventBridge routs data to targets such as AWS Lambda or SNS.
• Option to schedule Data API operations within EventBridge
Data Sharing
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Data Sharing
• Used to securely share access to live data across
Clusters Workgroups AWS accounts AWS Regions
Availability Zones
• Data is live
Data Sharing
Datashare
Producer Cluster Consumer Cluster
- Tables
- Views
- User defined
functions
Data Sharing
Standard Datashares AWS Data Exchange AWS Lake Formation-
datashares managed datashares
Data Sharing
• The consumer is charged for all compute and cross-region data
transfer fees
• The producer is charged for the underlying storage of data
• The performance of the queries on shared data depends on the compute
capacity of the consumer clusters.
• Data sharing continues to work when clusters are resized or when the
producer cluster is paused.
Data Sharing
• Data sharing is supported for all provisioned ra3 cluster types and
Amazon Redshift Serverless.
• For cross-account and cross-Region data sharing, both the producer
and consumer clusters and serverless namespaces must be encrypted.
• You can only share SQL UDFs through datashares. Python and Lambda
UDFs aren't supported.
Data Sharing
• Adding external schemas, tables, or late-binding views on external
tables to datashares is not supported.
• Consumers can't add datashare objects to another datashare.
Workload Management
(WLM)
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Workload Management (WLM)
• Helps you manage query performance for competing queries
• Define how resources are allocated to different queries.
• Uses query queues
⇒ Route queries to the appropriate queues
• Enables users to flexibly manage priorities within workloads.
• You can create up to 8 queues.
Workload Management (WLM)
Automatic Work Load Management
• Redshift manages:
▪ How many queries run concurrently
▪ How much memory is allocated to each dispatched query.
• When complex queries ⇒ lower concurrency
• When lighter queries ⇒ higher concurrency
Workload Management (WLM)
Automatic Work Load Management
• You can configure the following for each query queue:
Queue Configurable with different priorities and rules
User groups assign a set of user groups to a queue
Query groups assign a set of query groups to a queue
Priority relative importance of queries
Concurrency scaling mode automatically adds additional cluster capacity
Workload Management (WLM)
Manual Work Load Management
• Manage system performance by modifying WLM configuration.
• You can configure the amount of memory allocated to each queue.
⇒ Automatic usually higher throughput
⇒ Manual offers more control
Workload Management (WLM)
Short query acceleration (SQA)
• Prioritizes selected short-running queries over longer-running queries.
• Uses machine learning to predict the execution time of a query.
• Runs short-running queries in a dedicated space.
Workload Management (WLM)
Short query acceleration (SQA)
• CREATE TABLE AS (CTAS) statements and read-only queries, such as SELECT
statements, are eligible for SQA.
Redshift Serverless
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Redshift Serverless
• No Cluster Management.
• On demand scaling.
• Pay as you go.
Redshift Serverless Redshift provisioned
Node Node
● Automatically provisions and ● You build a cluster with node
manages capacity for you. types that meet your cost and
VS performance specifications
Workload management and concurrency Workload management and concurrency
scaling scaling
● Automatically manages resources ● You enable concurrency scaling
efficiently and scales, based on on your cluster to handle
workloads, within the thresholds periods of heavy load.
of cost controls.
Redshift Serverless Redshift provisioned
Port Port
● You can choose port from the ● You can choose any port to
port range of 5431–5455 or 8191– connect
8215.
VS
Resizing Resizing
● Not Applicable ● Cluster resize can be done to
add nodes or remove nodes
Encryption Encryption
● Always encrypted with AWS KMS, ● Can be encrypted with AWS KMS
with AWS managed or customer (with AWS managed or customer
managed keys. managed keys), or unencrypted.
Redshift ML
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Redshift ML
• Allows users to create, train, and apply machine learning models using
just SQL.
• Users can utilize SQL commands to create and manage machine learning
models, which are then trained using data stored in Redshift.
• Amazon Redshift ML enables you to train models with one single SQL CREATE
MODEL command.
Redshift ML
• Lets users create predictions without the need to move data out of Amazon
Redshift.
• Users can create, train, and deploy machine learning models directly
within the Redshift environment.
• Useful for users that doesn't have expertise in machine learning, tools,
languages, algorithms, and APIs.
Redshift ML
• Redshift ML supports common machine learning algorithms and tasks, such
as
Binary classification Multiclass classification Regression
• It Automatically finds the best model using Amazon SageMaker Autopilot.
• Can use the massively parallel processing capabilities of Amazon
Redshift.
Security in Amazon Redshift
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon Redshift Spectrum
Amazon Redshift provides these features to manage security
• Sign-in credentials
• Access management using IAM
• Virtual Private Cloud (VPC)
• Cluster encryption
Redshift Security
• Cluster security groups
○ Default Lockdown:
When you provision an Amazon Redshift cluster, it is locked down by
default so nobody has access to it.
○ Inbound Access Control:
To grant other users inbound access to an Amazon Redshift cluster,
you associate the cluster with a security group.
• SSL connections to secure data in transit
• Load data encryption to encrypt the data during the load process
Amazon Redshift Spectrum
• Data in transit
• Column-level access control
• Row-level security control
Access Control In Redshift
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Access Control In Redshift
• You can either manage your users and groups within
○ Redshift
○ AWS IAM users
• The privileges to access specific objects are tightly coupled with the DB
engine itself.
Access Control In Redshift
Manage your users and groups within Redshift
User 1 User 2 User 3 User 4 User 5
Group A Group B
Access Control In Redshift
CREATE USER human_resource_1 PASSWORD 'xyzzy-1-XYZZY';
ALTER GROUP human_resource ADD USER human_resource_1;
CREATE USER operation_1 PASSWORD 'xyzzy-1-XYZZY';
ALTER GROUP operation ADD USER operation_1;
Access Control In Redshift
/* Creating the 2 groups */
CREATE GROUP humanresource;
CREATE GROUP operation;
/* Creating the 2 schemas with the data */
CREATE SCHEMA humanresource;
CREATE SCHEMA operation;
/* Give sales USAGE rights in schema, and read-only (SELECT) access to the
tables within the schema */
GRANT USAGE on SCHEMA humanresource TO GROUP humanresource;
GRANT SELECT ON ALL TABLES IN SCHEMA operation TO GROUP operation;
Access Control In Redshift
A group CANNOT contain another group
User 1 User 2
Group A
Access Control In Redshift
AWS introduced RBAC (Role-based access control)
Role 1 GRANT PERMISSIONS
GRANT ROLES
User 1 User 2
Role 2
Access Control In Redshift
/* Creating role*/
CREATE ROLE humanresource;
/* Grant role to user*/
GRANT ROLE humanresource TO user_1;
/* Give sales USAGE rights in schema, and read-only (SELECT) access to the
tables within the schema */
GRANT USAGE on SCHEMA humanresource TO ROLE humanresource;
GRANT SELECT ON ALL TABLES IN SCHEMA operation TO ROLE finance;
GRANT ROLE finance TO ROLE humanresource;
Redshift Fine-grained access control
Fine-grained Redshift access control
• Used to control and manage access permissions for various users or roles.
• Contains detailed access policies.
Fine-grained access control
/* Column-level access control*/
GRANT SELECT (column1, column2) ON tablename TO username;
Fine-grained access control
/* Row-level access control */
item_id item_name quantity warehouse_id
CREATE RLS POLICY view_own_warehouse_inventory LED Light
WITH (warehouse_id INTEGER) 101
Bulb
120 1
USING ( 102
Electric
85 2
Drill
warehouse_id IN (
103 Hammer 75 1
SELECT managed_warehouse_id
Nails
FROM warehouse_managers 104 (Pack of 150 3
WHERE manager_username = current_user 100)
)
);
manager_id manager_username managed_warehouse_id
1 jsmith 1
2 mbrown 2
3 lwilson 3
Fine-grained access control
/* Masking Policy */
CREATE MASKING POLICY mask_email
WITH (email VARCHAR(256))
USING ('***'::TEXT);
manager_id manager_username email
1 jsmith ***
2 mbrown ***
3 lwilson ***
Fine-grained access control
/* Create Masking Policy*/ /* Attach Masking Policy*/
CREATE MASKING POLICY mask_email ATTACH MASKING POLICY mask_email
WITH (email VARCHAR(256)) ON employee_data(email)
USING ('***'::TEXT); TO ROLE role_hr
PRIORITY 20;
manager_id manager_username email /* Detach Masking Policy*/
1 jsmith ***
2 mbrown *** DETACH MASKING POLICY mask_email
3 lwilson *** ON employee_data(email)
FROM ROLE role_hr;
Redshift Fine-grained access control
Row level security in Redshift
• Users can only access specific rows.
• Rows have criteria that defines which role can access the specific item
(row).
Access logging & monitoring in Redshift
• Failed and successful access attempts to Redshift data warehouses can be
logged using the system table STL_CONNECTION_LOG.
• Audit logs are not enabled by default.
Section 11:
Other Database Services
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon RDS
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS RDS
What is RDS?
▪ Fully managed relational database service.
Main characteristics:
▪ Scalable, reliable, and cost-effective.
▪ Fully encrypted at rest and in transit.
▪ Support for Multiple Database Engines:
o MySQL
o PostgreSQL
o MariaDB
o Oracle Database
o SQL Server
o Amazon Aurora
Security in RDS
1) AWS Key Management Service (AWS KMS):
Automatic integration for key management and envelope encryption.
2) Backup and Disaster Recovery:
Automated and manual backups, restoring from a snapshot.
3) Patch Management:
Automatic minor version upgrades for minor updates and patches
4) AWS backup:
Centralization and automation of data backup.
ACID compliance in RDS
Important set of properties for data reliability.
Definition:
○ Atomicity
○ Consistency
Data integrity and consistency
○ Isolation
○ Durability
ACID compliance in RDS
73273
Locking mechanisms for concurrent data access
and modification in multiple transactions.
1760 0009-14563.7
Prevent other transactions
1 Exclusive locks from reading or writing to
the same data.
Syntax: FOR SHARE.
1250 003-77156.8
Allow multiple transactions
2 Shared locks to read data at the same
time without blocking.
Syntax: FOR UPDATE.
003-1040559
These can be locked to
3 Tables & Rows keep data integrity and
control.
Syntax and Deadlocks
▪ Postgre SQL command to lock a table:
LOCK TABLE table_name IN ACCESS EXCLUSIVE MODE;
▪ PostgreSQL command to acquire a shared lock:
SELECT * FROM table_name FOR SHARE;
▪ PostgreSQL command to acquire an exclusive lock:
SELECT * FROM table_name FOR UPDATE;
▪ Understanding Deadlocks
What Happens:
During concurrent transactions; lock resources and wait on each other.
Result:
No transaction can proceed, halting progress.
AWS RDS Basic Operational Guidelines.
o Monitor metrics for memory, CPU, replica lag, and storage via CloudWatch.
o Scale database instances to manage storage capacity efficiently.
o Enable automatic backups during periods of low write IOPS.
o Provision sufficient I/O capacity for your database workload.
o Set a Time-To-Live (TTL) value under 30 seconds.
o Conduct failover testing regularly.
Best Practices for DB engines in RDS
1) Allocate enough RAM,so that your working set resides almost
completely in memory
2) Check ReadOPS metric constantly, the value should be small and
stable.
3) Use Enhanced Monitoring to obtain real time metrics for the OS in
your DB instances
4) Use Performance Insights for RDS and Aurora
○ Simplifies database performance monitoring and tuning
○ Easy-to use dashboard
Example use cases:
➢ Detect performance issues
➢ Evaluate impact of SQL queries and optimize them (Dev & Test)
➢ Assess and tune performance during transitioning to the cloude
Best practices for different engines hosted in RDS
o MySQL: Ensure tables don't exceed the 16TiB size limit by partitioning
large tables.
o Oracle FlashGrid Cluster: Utilize virtual appliances to run self-managed
RAC and RAC extended clusters across AZs on EC2.
o RDS for PostgreSQL: Improve performance by optimizing data loading and
utilizing the autovacuum feature effectively.
o SQL Server Failover: Allocate sufficient provisioned IOPS to handle your
workload during failover.
Amazon Aurora
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon Aurora
• Relational database service
Purpose
• Fully compatible with full MySQL and PostgreSQL
• Easy transition from MySQL and PostgreSQL
• Performance
Features
Offers up to 5x the performance of MySQL, 3x that of PostgreSQL.
• Scalability
Scales from 10GB to 128TB as needed.
• Read replicas:
Up to 15 replicas to extend read capacity.
• Security:
IAM for authentication, supports encryption at rest and in transit.
Amazon Aurora
Aurora Serverless • Automatic Scaling:
Auto-adjusts to the application's needs, no manual scaling required.
• On-Demand Usage:
Pay-for-what-you-use is ideal for sporadic or unpredictable workloads.
• Simple Setup:
No management of database instances; automatic capacity handling.
• Cost-Effective:
Bills based on Aurora Capacity Units (ACUs), suitable for fluctuating
workloads.
Amazon Aurora
Use Cases
• Caching:
Ideal for high-performance caching
Reducing load and improving response times
• Leaderboards and Counting:
In gaming and social networks
• Session Store:
Session information for web applications, ensuring fast
retrieval and persistence.
Amazon Aurora
Pricing
• Node Pricing: Charges based on type and number of nodes, varying by CPU,
memory, and network performance.
• Data Transfer Pricing: Costs for data transferred "in" and "out" of the service;
intra-region transfers typically not charged.
• Backup Pricing: Charges for backup storage beyond the free tier, priced per GB
per month.
• Reserved Instances: Available for long-term use at a significant discount, with
one or three-year commitments.
Amazon DocumentDB
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon DocumentDB
• Fully managed NoSQL database service.
Purpose
• Document-oriented
• Fully compatible with MongoDB.
• Serverless:
Scales automatically with needs; no infrastructure management.
Features
• Managed Service:
AWS handles provisioning, setup etc.
• High Availability & Durability:
Built on AWS infrastructure with multi-AZ replication.
• Security:
IAM for authentication, supports encryption at rest and in transit.
Amazon DocumentDB
Pricing
• Instance Hours: Charged based on instance run time.
• Storage & I/O: Fees for S3 storage and I/O operations.
• Backup Storage: Additional costs for backups beyond free retention limits.
Amazon DocumentDB
Pricing
• Instance Hours: Charged based on instance run time.
• Storage & I/O: Fees for S3 storage and I/O operations.
• Backup Storage: Additional costs for backups beyond free retention limits.
Amazon Neptune
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Neptune - Overview
• Fully managed graph database service
Purpose
• Optimized for highly connected datasets
• Knowlegde graphs,
Use Cases • Fraud detection,
• Recommendation engines
• Property Graph: Vertices and edges model
Data Model
• RDF: Triple-based model
Neptune - Features
• Fully Managed: Provisioning, patching, backup by AWS.
• High Availability: Across 3 AZs, supports automatic failover.
• Secure: Uses AWS KMS for encryption, provides VPC isolation.
• Scalable: Auto-scales to 64 TB, up to 15 read replicas.
• Fast and Reliable: Handles billions of relations with millisecond latency.
Neptune – Integration & Performance
• AWS Ecosystem: Integrates with Lambda, S3, SageMaker.
Integration
• Open Standards: Complies with Gremlin and SPARQL.
• Query Optimization: Advanced techniques for graph traversal.
Performance
• Concurrency and Throughput: Optimized for high traffic.
Pricing • Model: Based on instance size and usage hours.
• Complexity: Demands graph database knowledge.
Summary
• Specialized: Focused on graph-specific applications.
Amazon Keyspaces
(for Apache Cassandra)
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon Keyspaces
• Fully managed NoSQL database service.
Purpose
• Fully compatible with Apache Cassandra.
• Serverless:
Scales automatically with needs; no infrastructure management.
Features
• Managed Service:
AWS handles provisioning, setup etc.
• High Availability & Durability:
Built on AWS infrastructure with multi-AZ replication.
• Security:
IAM for authentication, supports encryption at rest and in transit.
Amazon Keyspaces
Pricing
• On-demand capacity: Pay for throughput and storage.
• Provisioned capacity: For predictable workloads.
Amazon MemoryDB
for Redis
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon MemoryDB for Redis
• Fully managed in-memory database service
Purpose
• Fully compatible with Redis, supporting Redis APIs
and data structures.
• In-memory storage
for low-latency and high-throughput access
Features
• Scalability
Automatically scales to adapt to workload changes.
• Durability:
Ensures data persistence with snapshotting and replication across
multiple Availability Zones.
• Security:
IAM for authentication, supports encryption at rest and in transit.
Amazon MemoryDB for Redis
Use Cases
• Caching:
Ideal for high-performance caching
Reducing load and improving response times
• Leaderboards and Counting:
In gaming and social networks
• Session Store:
Session information for web applications, ensuring fast
retrieval and persistence.
Amazon MemoryDB for Redis
Pricing
• Node Pricing: Charges based on type and number of nodes, varying by CPU,
memory, and network performance.
• Data Transfer Pricing: Costs for data transferred "in" and "out" of the service;
intra-region transfers typically not charged.
• Backup Pricing: Charges for backup storage beyond the free tier, priced per GB
per month.
• Reserved Instances: Available for long-term use at a significant discount, with
one or three-year commitments.
Amazon Timestream
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon Timestream
o Fully managed, serverless time-series database
o Designed for high-performance real-time analytics
Use cases Features
- IoT applications - Serverless
- Application logs - Optimized for time-series data
- DevOps monitoring - Time series specific functions
- Financial market data - High performancece for
time series data
Amazon Timestream
Ingestion Output
Lambda
Managed Grafana
Amazon MSK
Managed Service
for Apache Flink QuickSight
Kinesis
Data Streams SageMaker
IoT Core
Real-time processing with IoT decives
Use case
IoT devices Kinesis Data Stream Managed Apache Flink Timestream Managed Grafana
Transport Visualization
Generate data Transform & insert Time-series database
streaming events of metrics & logs
Real-time processing with IoT decives
Use case
IoT devices Kinesis Data Stream Managed Apache Flink Timestream Managed Grafana
Kinesis Data Firehose
Section 12:
Compute Services
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Elastic Compute Cloud (EC2)
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Elastic Compute Cloud (EC2)
• Is a web service that provides secure, resizable
compute capacity in the cloud
• Allows users to easily configure, launch, and manage
virtual servers, known as instances.
• It provides on-demand, scalable computing capacity.
• Offers variety of instances with a different
combinations of
CPU Memory Storage Networking
Elastic Compute Cloud (EC2)
• Users have full control over the configuration of
their EC2 instances.
• High Availability and Reliability.
• Auto Scaling.
• It seamlessly integrates with other AWS services.
EC2 Instance Types
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
EC2 Instance Types
General Purpose Compute Optimized
• Provide a balance of • Ideal for compute bound
compute, memory and applications that
networking resources. benefit from high
performance processors.
EC2 Instance Types
Accelerated Computing
Memory Optimized
• Use hardware
• Designed for workloads
accelerators, or co-
that process large data
processors, to perform
sets in memory.
functions.
EC2 Instance Types
Storage Optimized
• designed for workloads that require high, sequential read
and write access to very large data sets on local storage.
EC2 Instance Types
HPC Optimized
• Purpose built to offer the best price performance for
running HPC workloads at scale on AWS.
• Ideal for applications that benefit from high-performance
processors.
AWS Batch
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Batch When to use it?
● Batch jobs based on docker images
Lambda
❑ Lightweight, event-driven tasks
❑ Run code in response to events (ideal for real-time)
Glue
❑ Specialized ETL / Data Integration service
Batch
❑ General purpose, versatile (compute-intensive) batch
jobs
AWS Batch
● Batch jobs based on docker images
● Automated Scaling
● Job Scheduling
● Can be integrated with AWS Step Functions.
● Serverless:
○ Uses EC2 instances and Spot instances.
○ Can be used together with Fargate
○ No need to manage infrastructure.
● Pricing:
○ EC2 Instance / Fargate / Spot Instance costs.
AWS Batch – How It Works
Submit Jobs Job
Define Jobs Running
to Job Queues Scheduling
AWS SAM
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS SAM
● Framework for development processes.
● Simplified serverless application development.
○ AWS Lambda
○ Amazon API Gateway
○ Amazon DynamoDB
● It has YAML based template.
● Provides deployment ease.
● Provides local testing capabilities.
AWS SAM – CLI
● Provides an environment that mimics AWS.
● Test and debug your serverless applications.
● It has IDE extensions.
● sam build, sam package, sam deploy.
AWS SAM – Template & Deployment
App
1) Write Your Template
2) Build Application: sam build
sam build
3) Package Your Application: sam package
4) Deploy Your Application: sam deploy
App
sam package
sam deploy
Section 13:
Analytics
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Lake Formation
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Lake Formation – How It Works
Data Preparation
Data Collection Data Cataloging
Analytics
Security & Access Data Sharing
Integration
AWS Lake Formation
● Collects data from different sources.
● Organizes and prepares data.
● Integrates with other AWS services.
○ Redshift, Athena, Glue, S3
● Automates with blueprints.
AWS Lake Formation
● Secures the data. On-
Premises
AWS
Storage Crawler Data Data
Services Catalog Lake
AWS Lake Formation – How It Works
3) Catalog and
1) Define Data
2) Data Ingestion Organize
Sources
5) Use and
4) Clean and 5) Set up Analyze
Transform Security
AWS Lake Formation – Security
● Centralized security management.
● Control access at database, table or column level.
Data Filtering - Security
● Data Filtering for fine-grained access control to the data
Row-level Security Column-level Security Cell-level Security
restrict access to certain rows
restrict access to certain columns
combines column & row restriction
LF-tags – LF-TBAC
● LF-tags can be used to define permissions based on attributes
Attach LF tag to resource Can be attached to Data Catalogue resoures
Assign Permissions Assign permission based on resources using LF-tags
AWS Lake Formation – Security
● Centralized security management.
● Control access at database, table or column level.
● Role based access control.
● Cross-account access.
AWS Lake Formation – Cross-Account
• Share Setup:
Use named resources or LF-Tags for easy database and table sharing across AWS
accounts.
• Granular Access:
Use data filters to control access at the row and cell levels in shared tables.
• Permissions Management:
Leverage AWS RAM to handle permissions and enable cross-account sharing.
• Resource Acceptance:
Once shared resources are accepted via AWS RAM, the recipient's data lake
administrator can assign further permissions.
• Resource Link:
Establish a resource link for querying shared resources via Athena and Redshift
Spectrum in the recipient account.
AWS Lake Formation – Cross-Account
Troubleshooting Points:
• Issues with Permissions (RAM issue)
• IAM role misconfiguration
Amazon EMR
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon EMR
● Fast, distributed data processes.
● Uses big data frameworks:
○ Apache Hadoop, Apache Spark
○ For petabyte scale processing
○ Glue is easy to use but for
heavy workloads (petabytes)
not that suitable
○ Migration from existing
on-premise resources
Amazon EMR
● Fast, distributed data processes.
● Uses big data frameworks:
○ Apache Hadoop, Apache Spark
● Cluster based architecture.
● Pricing is based on EC2 usage and hourly service rates.
● Security:
○ IAM, VPC, KMS.
Amazon EMR
What Is Hadoop
● Distributed storage and processing framework.
○ 1) Hadoop Distributed File System (HDFS)
○ 2) MapReduce
- x86-based instance:
Versatile & traditional choice
EMR Cluster Structure - Graviton-based instance:
Balance of compute & memory
20% cost savings
Cluster Cluster Nodes
Master Node: Manages the cluster
Master Core
Node Node
Core Nodes: responsible for
running tasks and storing data
Task Core
Node Node
Task Nodes: responsible for
running tasks and storing data
Apache Hive
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Apache Hive
● Distributed data warehousing tool for querying and managing
datasets.
● Hive stores data in tables.
○ Text files, Parquet, ORC…
● Provides SQL-like queries also called Hive QL
● Compatible with the Hadoop Distributed File System (HDFS) and
Amazon S3.
Hive Metastore
Hive Query
Apache Hive
Hive Metastore
● Amazon EMR is also capable working with Hive.
○ Amazon EMR: Simplifies running big data frameworks.
● Within EMR, it is easy to migrate your Hive system.
● Migration from Hive to AWS Glue Data Catalog is also
possible.
AWS Glue ETL
● AWS Glue Data Catalog has built-in compatibility with Job
Hive. AWS Glue Data
Catalog
● Sharing metadata between Hive and AWS Glue is possible.
● You can run Hive queries on cataloged data in AWS Glue.
Amazon Managed Grafana
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon Managed Grafana
Fully managed Grafana service
• Open-source visualization tool
• Create dashboards for metrics, logs and traces
Use cases:
- Monitoring systems and application
- Dasbhoards for log or sensor data
- Real-time monitoring for IT infrastructure
- Use alerts
Managed Grafana interface
Amazon Managed Grafana Features
Workspace
A logically isolated Grafana server
Data sources
Integrates with AWS data sources that collect operational data
User Authentication
Integrates with identity providers that support SAML 2.0 and
AWS IAM Identity Center
Amazon Managed Grafana - How It Works
CloudWatch Very common
Timestream IoT devices & time series
ElasticSearch Data stored in Elastic Search
Create Grafana Connect to Multiple Setup
Manage User
Workspaces Data Sources Dashboards
Access
AWS OpenSearch
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS OpenSearch
● What is AWS OpenSearch?
○ Fully managed Search Engine
○ Built in Dashboards for real-time data analysis
▪ Use Cases
o Real-time Application Monitoring
o Business Intelligence (BI) and Data Analytics
o …
AWS OpenSearch
▪ Pricing
Pay as you go.
▪ Infrastructure:
Fully managed.
▪ Scalability:
Ability to scale up or down manually.
▪ Integration:
Integrates with other AWS services
▪ Availability:
Multi-AZ deployments, automated snapshots.
AWS OpenSearch Key Components
Documents Types Indices
{ { {
… ... ...
{ ... ...
… {
aa aa } }
aa aa
... } } { {
... ... ...
... ...
} } }
AWS OpenSearch
Node, Cluster and Shard
▪ What is a Node:
Single running instance of OpenSearch.
▪ Data Nodes
▪ Master Nodes
▪ Client Nodes
▪ What is a Cluster ( ⇒ Domain):
Collection of one or more nodes.
▪ What is a Shard:
Partitions of an index’s data.
▪ Primary Shards
▪ Replica Shards
AWS OpenSearch – Infrastructure
Cluster
Node 1 Node 2 Node 3
Shard Shard Shard Shard Shard Shard
P0 R2 R0 P1 P2 R1
AWS OpenSearch
Managing and Accessing Data
PUT /my-index-000001
{
"settings": {
▪ Creating an Index: "index": {
Define its settings and mappings. "number_of_shards": 3,
"number_of_replicas": 2
}
},
"mappings": {
"properties": {
"name": {
"type": "text"
},
"age": {
"type": "integer"
}
}
}
}
AWS OpenSearch
Managing and Accessing Data
▪ Adding Documents: POST /my-index-000001/_doc/1
Define its settings and mappings. {
"name": "John",
"age": 30
}
AWS OpenSearch
Managing and Accessing Data GET /my-index-000001/_search
{
"query": {
▪ Searching: "match": {
Search through indices. "name": "John"
}
}
}
AWS OpenSearch - Configurations
▪ Domain:
Management of OpenSearch clusters.
▪ Master Node:
Responsible for cluster-wide operations.
▪ S3 Snapshot:
S3 snapshots can be used to restore your cluster.
AWS OpenSearch – What to Avoid
▪ OLTP and Ad-Hock Queries:
Can lead to suboptimal performance.
▪ Over-Sharding:
Can lead to increased overhead and reduced
efficiency.
▪ OpenSearch as Primary Data Store:
Not designed for transactional data integrity.
AWS OpenSearch – Performance
Memory Pressure Management:
▪ Watch for JVM memory pressure errors;
▪ Balance shard allocations to avoid overloading
Shard Optimization:
▪ Reduce shard count to mitigate memory issues
Data Offloading:
▪ Delete older or unused indices;
▪ Consider archiving data to Amazon Glacier to improve
performance.
AWS OpenSearch – Security
▪ Authentication:
▪ Native Authentication: Users and Roles
▪ External Authentication: Active Directory, Kerberos,
SAML, and OpenID
▪ Authorization:
▪ Role Baes Access Control (RBAC): Through Users and Roles
▪ Attribute-based access control (ABAC): Based on user
attributes
AWS OpenSearch – Security
▪ Encryption:
▪ In-transit encryption: TLS encryption.
▪ At-rest encryption: Third party applications.
▪ Audit Logging:
▪ Identifying potential security issues.
AWS OpenSearch – Dashboards
▪ Data Visualization:
▪ Line charts
▪ Bar graphs
▪ Pie charts
▪ Dashboard Creation:
Each dashboard is customizable; users are able to
create dashboards as per their needs.
AWS OpenSearch – Storage Types
Hot Storage Warm Storage Ultra-Warm Storage Cold Storage
▪ Less frequently ▪ Rarely used
▪ Frequently accessed but still ▪ Cost-effective (e.g. archival
accessed & instant available. for read-only data)
retrieval.
▪ Requires ▪ S3 + caching ▪ Lowest cost
▪ Fastest dedicated master
performance nodes ▪ Lower cost ▪ Requires
UltraWarm;
▪ Real-time ▪ Not compatible ▪ Not frequently
analytics and with T2 or T3 data written or ▪ Uses Amazon S3,
recent log data node instance queried hence no compute
types. overhead
AWS OpenSearch – Reliability and Efficiency
Cross-Cluster Replication:
▪ What It Is:
▪ Allows you to copy and synchronize data.
▪ Used for increasing data availability.
▪ Why It Is Important:
▪ Provides availability.
▪ Prevents hardware failure and network issues related
blocks.
AWS OpenSearch – Reliability and Efficiency
Index Management:
▪ What It Is:
▪ Automates the index managing process.
▪ Defines indices lifecycles.
▪ Why It Is Important:
▪ Provides cost efficiency.
▪ Provides performance improvements.
AWS OpenSearch – Reliability and Efficiency
Infrastructure Management:
▪ What It Is:
▪ Deciding disk scale.
▪ Deciding master node quantity.
▪ Why It Is Important:
▪ Determines your system resilliance and stability.
AWS OpenSearch – Serverless
● Serverless Flexibility:
Auto-scales based on demand, reducing management
overhead.
● Cost Efficiency:
Pay only for what you use, ideal for variable
workloads.
● Seamless AWS Integration:
Enhances capabilities with AWS services like
Lambda, S3, and Kinesis.
Amazon QuickSight
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon QuickSight
● What is QuickSight?
○ AWS’s visualization Service.
○ ML-Powered dashboards.
▪ Use Cases
o Data Exploration
o Anomaly Detection
o Forecasting
o …
Amazon QuickSight
▪ Scalability
Automatically scales up and down.
▪ Serverless:
Fully managed.
▪ Machine Learning Powered:
Ability to scale up or down manually.
Amazon QuickSight
▪ Data Analysis
Workspace for creating visualizations
▪ Data Visualization:
Visuals (Charts)
▪ Dashboards:
Published version of an analysis
Amazon QuickSight
SPICE Super-fast, Parallel, In-memory Calculation Engine
▪ In memory engine
▪ 10GB SPICE per user
▪ Benefits:
▪ Speed and Performance
▪ Automatic Data Refreshes
▪ Synchronization
QuickSight – Dashboards
Features
▪ Automatic Refreshs:
Automatically refreshes dashboards.
▪ Collaboration and Sharing:
Can be shared with team members.
▪ Mobile Accessibility:
Mobile responsive dashboards.
QuickSight – Data Sources
▪ AWS Services:
o S3
o RDS
o Aurora
o Redshift
o Athena
▪ Data Pipeline Scenario:
S3 Glue Athena QuickSight
QuickSight – Data Sources
▪ CSV, Excel, TXT, S3
▪ OpenSearch
▪ Aurora/RDS/Redshift
▪ Third-Party Databases Services:
o PostgreSQL
o MySQL
o SQL Server
o Oracle
▪ ODBC/JDBC data sources
QuickSight – What to Avoid
▪ Overloading Dashboards
▪ Poor Data Security
▪ Using ETL
Amazon QuickSight
Licensing
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
QuickSight – Licensing
Standard Edition
Included SPICE
▪ Small group of users Type Price
Capacity
Annual plan $9/user/month
▪ No advanced features need 10 GB/user
0.25$/GB for
Monthly plan $12/user/month additional capacity
QuickSight – Enterprise
▪ Advanced Features:
▪ RLS
▪ Hourly refreshs
▪ ML Insights
▪ Administrative features:
▪ User management
▪ Encryption
▪ Pricing
▪ Pay-per-session for readers
QuickSight – Enterprise
With annual
Month-to-month
commitment
▪ Author License:
▪ Connect to Data Author $24/month $18/month
▪ Create Dashboards Author with
▪ Share content QuickSight Q $34/month $28/month
▪ Reader License:
▪ Explore Dashboards
Month-to-month
▪ Get Reports
▪ Download Data Reader $0.30/session
up to $5 max/month
Reader with QuickSight Q $0.30/session
▪ SPICE up to $10 max/month
▪ $0.38 per GB/month
QuickSight – Enterprise
▪ Additional Features
▪ QuickSight Embedded
▪ Paginated Reports
▪ Alerts and Anomaly Detection
○ Author Pro & Reader Pro
Additional Generative BI capabilities
Amazon QuickSight
Security
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
QuickSight – Security
Access Control
▪ Resource Access Control
▪ IAM Integration
▪ Active Directory Connector
▪ MFA supported
▪ IAM Configuration
You are responsible to assign IAM role.
Necessary to access the data sources.
IAM permission necessary for encrypted data sources.
QuickSight – Security
Data in VPC
▪ Private VPC Access Through ENI
Data On-Premise
▪ Use AWS Direct Connect
Run QuickSight in VPC
▪ Full network isolation
▪ IP Restriction
QuickSight – Cross-Region/Account
Standard Edition
▪ Accessing QuickSight from Private Subnet not possible
▪ Security group attached to the data source (e.g., an RDS
instance in another account) allows inbound connections of QS
QuickSight – Cross-Region/Account
VPC – Account A VPC – Account B
Enterprise Edition Private Subnet Private Subnet
VPC Peering
QuickSight Data Source
▪ Use an Elastic Network Interface (ENI) within a VPC
Data sources like Amazon RDS/Redshift in private subnets
▪ VPC Peering
Connect them using VPC peering.
QuickSight – Cross-Region/Account
VPC – Account A VPC – Account B
Enterprise Edition Private Subnet Private Subnet
Transit
AWS
VPC
VPC PrivateLink
Sharing
Gateway
Peering
QuickSight Data Source
● AWS Transit Gateway
For managing connections at scale within the same region
● AWS PrivateLink:
Securely exposes services across accounts without exposing data
to the public internet.
● VPC Sharing:
Allows multiple AWS accounts to share a single VPC, facilitating
access to shared resources like QuickSight and databases.
Row-Level Security (RLS) Enterprise
● Customizable Access
Which data rows can be seen
● Dataset Filters
Applied using dataset filters
● Data Security
Users only see data relevant to their role
● Column-Level Security: Enterprise
Manage access to specific columns
Amazon Q in QuickSight
● Natural language queries for interacting with data.
● Ask data-related questions in plain English.
● Delivers answers as visualizations, summaries, or raw data.
● Enhances data analytics accessibility, no technical expertise
needed
Section 14:
Machine Learning
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon SageMaker
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon SageMaker
● Simplifies machine learning aspects.
• Supports various compute environment and frameworks
Build ⇒ Jupyter notebooks, TensorFlow & Pytorch
Train • Automatically manages infrastructure
Deploy • Provides deployment ease
Optimization • Offers automatic model tuning
Amazon SageMaker
Amazon SageMaker Studio
● It is an IDE with unified interface.
● Provides collaboration and version control.
● Includes various tools and components:
○ Notebooks: Pre-configured Jupyter notebooks.
○ Experiments: Organize, track, and compare.
○ Debugger: Analyze and debug.
○ Autopilot: Automated model creation.
Amazon SageMaker
Access Management
● SageMaker-Specific Policies:
○ AmazonSageMakerFullAccess: Provides full access to SageMaker resources.
○ AmazonSageMakerReadOnly: Provides read-only access.
● Resource-Based Policies:
○ You can specify accesses based on SageMaker resources.
● Fine-Grained Access Control with Tags:
○ You can control access based on specific conditions.
● Integration with Other AWS Services:
○ Amazon VPC: Provide network-level access control.
○ AWS KMS: Manage encryption keys for used or produced data by SageMaker.
Amazon SageMaker – Feature Store
Feature Store
⇒ Store and manage data features in one place.
⇒ Supports Online and Offline usage.
• Online store • Offline store
⇒ Real-time apps, low latency ⇒ Historical data analysis.
⇒ You can group your features
Id Time Feature1 Feature2 Id Time Feature1 Feature2
Amazon SageMaker – Feature Store
● Data can be ingested through from many services
○ EMR, Glue, Kinesis, Kafka, Lambda, Athena…
Pipeline • Offline store
Configure Feature
Ingest Data
Store
• Online store
ML Model
Benefits
● Efficiency: Reduces the effort and complexity.
● Flexibility: Supports both real-time and batch data processes.
Amazon SageMaker – Lineage Tracking
● Manage and track lifecycle of your ML models.
● Provides better insights on your ML workflow.
● Useful for understanding model dependencies, auditing, and
reproducing results.
Lineage Tracking Entities
● Representation of every component in ML workflow.
● SageMaker automatically creates those entities.
Amazon SageMaker – Lineage Tracking
Lineage Tracking Entities
Trial Components ⇒ represent individual stages or steps
Trials ⇒ Trials help in evaluating different approaches or iterations
Experiments ⇒ Containers for organizing the trials, focus on specific problem
Contexts ⇒ Provide a logical grouping for other entities
Actions ⇒ Represent operations or activities involving artifacts
⇒ Data objects generated throughout the ML lifecycle: Datasets,
Artifacts model parameters…
Associations ⇒ Defines the relationships between entities
Amazon SageMaker – Lineage Tracking
Benefits of SageMaker ML Lineage Tracking:
● Reproducibility:
○ Enables to reproduce and trace results.
● Auditability:
○ Provides a detailed history of.
● Governance:
○ Enhances the governance of ML projects.
● Collaboration:
○ Makes it easier for teams to collaborate.
Amazon SageMaker – Data Wrangler
● It simplifies data preparation process.
Data Feature
Data Import Preparation Visualization Engineering Export Data
• Provides
• Get
• S3, an • To
insights
Redshift, interface • Create and SageMaker
EMR, modify or other
• See data
Feature • Normalize features AWS
distributi
Store data, Services
ons
clean data
Amazon SageMaker – Data Wrangler
Quick Model
● You can quickly test your data.
● Automatically trains/tests the data and provides you insights:
○ Model summary
○ Feature summary
○ Confusion matrix
Section 15:
Application Integration
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Step Functions
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Step Functions
● Create multi-step workflows
⇒ Visual workflows in Visual Editor
● Use Cases
• Application Orchestration: Automates tasks across applications.
• Data Processing: Manages complex, conditional data workflows.
• Microservices Orchestration: Coordinates microservices with error management.
• Machine Learning Workflows: Streamlines the entire machine learning cycle.
⇒ Whenever we want to orchestrate & integrate multiple services
AWS Step Functions
● What is State Machine?
State machine = Workflow
State = Step
Task = Unit of work
AWS Step Functions
● Different types of states ( = steps)
• Task State: Executes specific work like a Lambda function or API call.
• Choice State: Adds branching logic, directing execution based on conditions.
• Wait State: Pauses execution for a set duration or a specific time.
• Succeed State: Ends the execution successfully with the provided output.
• Fail State: Stops execution and marks it as failed.
• Parallel State: Multiple branches simultaneously, aggregating results.
• Map State: Iterates over a list, processing each item with a sub-workflow.
• Pass State: Passes input directly to output, with optional data transformation.
AWS Step Functions
● State Machine defined in ASL
ASL = Amazon States Language
AWS Step Functions
● Built-in controls
⇒ To examine the state of each step in the workflow use
• Retry
• Catch
• Timeouts
• Parallel
• Choice
• Wait
• Pass
AWS Step functions
AWS SDK Integrations
⇒ Allow Step Functions to execute API actions across more than 200 AWS
services directly within a workflow.
Optimized Integrations
⇒ Optimized integrations are designed specifically for seamless
integration with popular AWS services, such as Lambda, S3, DynamoDB,
ECS, SNS, and SQS.
IAM permissions
○ State machine needs to have appropriate IAM permissions
■ E.g. execute a Lambda function
⇒ Lambda function might need permission to other services as well
Standard Workflows Express Workflows
● Ideal for high-volume, short-
● Used for long-running, durable, duration, event-processing
and auditable workflows. workloads.
● Reliable continuous operation.
VS ● High throughput.
● Provides detailed logging of ● 100,000 per second execution
all state transitions. rate / nearly unlimited
transaction rate.
● Full integration / expensive.
● Limited integration / cost-
effective.
● 2,000 per second execution rate
/ 4,000 per second state
transition rate. ● Fast microservice
orchestration
Amazon EventBridge
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon EventBridge
EventBridge is a serverless service that helps you link different parts of your application using
events, making it easier to build apps that can scale and adapt quickly.
Event = JSON object (change in environment / state )
Components of EventBridge:
▪ Event Producers: Sources that generate events.
▪ Event Bus: This is the central hub where events are sent.
▪ Rules: These define the match criteria.
▪ Targets: Resources that recieve the events
Amazon EventBridge
▪ Enables decoupled, event-driven architecture
⇒ services can interact efficiently without direct integration
Rules
Lambda
S3 Bucket Event Bus
SNS notification
Step Functions
Amazon EventBridge Rules
Two types of rules
Matching on event On a schedule
Event pattern
• Triggers based on specific pattern • Sends events to targets at intervals
• Define event pattern to define events • E.g. periodically run a Lambda function
• When matched ⇒ Send event to target
• E.g. Lambda executes as response to new data in DynamoDB table
Amazon EventBridge – Event buses
Event Bus Type Purpose
Available by default;
Default Event Bus
Automatically receives events fom AWS service
Created by user;
Custom Event Buses
Specify which events should be received
Created by user;
Partner Event Buses
Receives eents from integrated SaaS partners
EventBridge
Schema Registry
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon EventBridge – Schema Registry
• Discover, manage, and evolve event schemas
Version Management
• Includes details about the type of
data each event can contain in JSON
format
• Features : Schema Discovery & Code generation
Schema Sharing
Amazon EventBridge Resource Based Policy
Used to specify direct permissions on event buses.
Key components: Benefits:
• Descentralized Management
• Cross-Account Access
• Granular Access Control
Scenario Overview
• Understand the scenario
• Solve different tasks
• Make use of the big picture
Scenario
Automating Retailer Data Lake
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Practical Scenario Overview
• Demonstrate the concept
• Acknowledge the advanced nature of the concept
• Improve understanding by simplifying the explanation
Objective
Automate the data processing pipeline
Focus on daily retail sales
STEP 05: ETL Job Execution
Workflow Overview ▪
▪
Successful validation triggers the
execution
ETL job performs further data
transformation
STEP 03: State Machine
and Lambda Function
▪ Utilize AWS Step
Functions state
machine
▪ Include a Lambda
function
STEP 01: Data Ingestion
▪ CSV files uploaded to
an S3 bucket STEP 04: Choice State for
Validation
▪ Implement a choice state
▪ If validation passes, proceed to
ETL job
STEP 02: Data Validation ▪ If validation fails, end the
with EventBridge workflow
▪ Create an EventBridge rule
▪ Rule triggers AWS Step Functions
workflow
Outcome
Automation of the workflow
Reduction of manual overhead
Quick generation of daily reports.
Ensured data quality
Implementation Steps
Step 4 AWS Step
Functions State Machine
Step 3 EventBridge
and Rule Creation ▪ Create a state machine
▪ Define the logic and
▪ Create an EventBridge flow
Step 2 Setting up rule
Data Ingestion: ▪ Set up the rule to
trigger the AWS Step
▪ Configure S3 bucket Functions workflow
Step 1 Designing the for sales data
uploads.
Workflow:
▪ Define the structure
▪ Outline the sequence for CSV files.
▪ Visualize the workflow
Implementation Steps
Step 8 Monitoring and
Optimization
Step 7 ETL Job
Configuration ▪ Utilize CloudWatch for
monitoring the
workflow.
▪ Set up the ETL job for
Step 6 Choice State further data ▪ Optimize the workflow
transformation. for efficiency and
for Validation
performance.
▪ Ensure it is triggered
▪ Implement a choice upon successful
state in the state validation.
Step 5 Lambda Function
machine.
for Data Validation
▪ Define conditions for
▪ Develop a Lambda function proceeding to ETL or
ending the workflow.
▪ Check for expected header
names and handle
success/failure conditions
AppFlow
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AppFlow
• Secure Data Movement: Enables safe data transfer
Purpose between SaaS apps and AWS services.
• SaaS Integration:
Simplifies connections with popular SaaS applications
Salesforce, Snowflake, Slack etc..
SaaS Integration • No-Code Interface:
Allows for easy setup without coding, thanks to a user-friendly UI.
• Pre-built Connectors: Offers connectors for quick integrations with
common SaaS platforms.
AppFlow
Source Destination
AppFlow
AppFlow
• Data Transformation:
Features
Provides mapping and transformations for compatible data formats.
• Bi-Directional Sync:
Supports two-way data synchronization.
• Event-Driven Flows:
Can initiate transfers based on SaaS application events.
• Encryption & Security:
Ensures data is encrypted and managed with AWS IAM policies.
AppFlow
• Analytics & Insights:
Use Cases
Facilitates data aggregation for in-depth analytics.
• Customer Data Sync:
Provides a unified customer data view across systems.
• Workflow Automation:
Enables automated interactions between SaaS and AWS services.
Amazon SNS
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon SNS
▪ Fully-managed pub/sub messaging Service.
▪ Topics for high throughput
⇒ push-based, many-to-many messaging.
▪ Large number of subscriber endpoints
SQS queues, AWS Lambda functions, HTTP/s, Emails, etc.
Key features of amazon SNS
▪ Topics:
Publish/subscribe mechanism that allows messages to be pushed to
subscribers. There are two types:
⇒ Standard topics:
o High throughput
o At least once delivery
o Unordered delivery
o Multiple subscribers
⇒ FIFO (first-in, first-out)
o Ordered delivery
o Exactly one processing
o Message Grouping
o Limited throughput
o Deduplication
o Use with SQS FIFO Queues
Amazon SNS: Aplication to Aplication (A2A)
▪ A2A enhances flexibility and integration possibilities
▪ Fanout pattern
Publishers
Amazon SNS
Service Providers(Mongo DB,
Amazon Firehose Datadog)
▪ Ideal for serverless and microservices
Amazon SNS: Aplication to person (A2P)
▪ Directly delivered to people
Mobile push
Filter policies
Mobile text(SMS)
Publishers Email
Amazon SNS
Amazon SNS How to Publish
Topic Publish
• Send a message to an SNS topic
• Topic acts as a channel
• Automatically gets delivered
Direct Publish • Targets a specific endpoint
• Push notifications
• Intended for a specific subscriber or device
Amazon SQS
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon SQS
▪ Fully-managed message queuing service.
▪ Provides a reliable, highly scalable hosted queue.
▪ Allows to decouple and scale microservices.
▪ Enables asynchronous processing.
SQS Message Lifecycle
▪ Consumers
Message A Message E
Producers
Message B Message D
Message A
Message C
Queue distributed on SQS Servers
SQS Message Lifecycle
Customer must delete message
Message E
Producers
Message B Message D
Message A
Message C
Standard Queues FIFO Queues
● High Throughput. ● Ordering Guarantee.
● At-Least-Once Delivery. ● No Duplicates.
VS
● Best-Effort to order. ● Standard Throughput.
● Use Cases: Speed and high ● Use Cases: Order of operation
performance. is critical.
Amazon SQS Amazon Kinesis Data Streams
▪ Decouple application components. ▪ Real-time data streaming.
▪ Data retention for up to 14 ▪ Extendable retention for up
days. to 365 days (24 hours by
default).
▪ Scales automatically with the
number of messages.
VS ▪ Provides ordering of records
within a shard (partition).
Amazon API Gateway
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon API Gateway
▪ Fully managed service to create, publish, maintain,
monitor and secure APIs at scale.
REST APIs
▪ Standard REST architecture
▪ Integration with various endpoints
▪ Allows CRUD operations
WebSocket APIs
▪ Real time communication between client and server
▪ Ideal for chat apps, live updates and notifications
Amazon API Gateway
HTTP APIs
▪ Create RESTful Apis wih lower latency and costs
▪ Support OpenID Connect and Oauth 2.0
▪ Built-in support for CORS
▪ Automatic deployments
CORS: web browsers security feature to control cross-
origin requests.
Amazon API Gateway integration
▪ Direct integration with other AWS services like Lambda
DynamoDB, S3, SNS, etc.
▪ HTTP integration
Forward requests to endpoints
Suitable for microservices
▪ Mock integration for testing
▪ AWS Proxy integration
Passes requests in a JSON format
Throttling, Quotas, and Rate Limiting
Mechanisms used in API Gateway to control and manage traffic
effectively.
Throttling:
▪ Prevents API saturation due to traffic spikes
▪ Set limits on number of requests per second
Quotas:
▪ Limits requests for a specific period of time
▪ Must use plans and API keys for implementation
Rate Limiting:
▪ Combines throttling and quotas to control requests
acceptance
API Gateway Endpoint Types
✓ Reduce latency for clients globally
Edge Optimized ✓ Ideal for worldwide reach
✓ URL: https://abc123.execute-api.amazonaws.com/prod
✓ Does not use CloudFront caching.
Regional Endpoint ✓ Ideal for a specific region.
✓ URL: https://abc123.execute-api.us-west-2.amazonaws.com/prod
✓ Accessible only within VPC
Private Endpoint ✓ Prevents exposure to public internet
✓ Public https://vpce-12345-abc.execute-api.us-east-1.vpce.amazonaws.com/prod
✓ In-region endpoint: https://vpce-12345-abc.vpce.amazonaws.com/prod
Section 16:
Management, Monitoring,
Governance
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS CloudFormation
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS CloudFormation
Infrastructure Management • Allows you to define and provision your AWS infrastructure as code
• Use cases:
-Replicate infrastructure across regions
-Control and track changes to your infrastructure
-Simplify infrastructure management
CloudFormation interface
AWS CloudFormation Templates
Text files that describe the desired state of your AWS infrastructure.
YAML Template to create the “FirstS3Bucket” S3 Bucket
AWS CloudFormation Stacks
A collection of AWS resources created and managed as a single unit
HOW IT WORKS
Template S3 Bucket CloudFormation Stack
AWS CloudWatch
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS CloudWatch
• AWS‘ proprietary service for monitoring Applications
Main Monitoring Service and Resources in real-time
• Metrics are displayed on a Dashboard
CloudWatch interface
AWS CloudWatch Metrics
A metric is a numbers to monitor
• Features:
o Name Spaces: Serves as a container for CloudWatch metrics.
o Time Stamps: Every metric should be linked to a timestamp.
o Dimensions: Key/value pair belonging to a metric.
o Statistics: Aggregated data metrics over defined time intervals.
o Period: The length of time associated with a specific statistic.
o Resolution: Level of detail you can see in your data
-Standard Resolution: data has a one-minute granularity
-High Resolution: data has a granularity of one second
AWS CloudWatch Metric Streams
o Continuously stream metrics to both AWS & 3rd-party destinations near-real time.
o Kinesis Firehose is used to stream data to AWS destinations.
Custom setup with Firehose Quick S3 setup Quick AWS partner setup
Ingests Sends metrics OpenSearch
metrics to destinations
Redshift
CloudWatch
Metric Streams S3
Kinesis Data
Firehose
AWS CloudWatch Alarms
Monitor metrics and trigger actions when defined thresholds are breached.
• Types of Alarms:
o Metric Alarm: Monitors a single metric
o Composite Alarm: Monitors the state of other alarms
AWS CloudWatch Alarms
● Alarm States:
OK - Metric is within the defined threshold
ALARM - Metric is over the defined threshold
INSUFFICIENT_DATA - Not enough data available to determine the alarm state
● Alarm Actions: Actions an alarm can take when it changes state
❑ Amazon SNS: Trigger email, SMS, or push alerts.
❑ EC2 Actions: Stop, terminate or reboot EC2 instance.
❑ Auto Scaling: Adjust instance count based on load.
❑ Lambda: Execute functions for automation.
❑ Incident Management: Create urgent incident tickets.
AWS CloudWatch Logs
Collects and consolidates logs from various sources
• Centralized Logging: From different services in one location
• Real-Time Monitoring: Real-time monitoring of log data
AWS CloudWatch Logs
Collects and consolidates logs from various sources
• Log streams: Sequences of log events from a single source
• Log groups: Logical containers for log streams
• Log events: Records of an activity logged by an application
• Retention Policy: Allows you to allocate the period a log is
retained
• Log Insights: Interactive log analysis tool for querying and
visualizing log data stored in CloudWatch Logs
• S3: Logs can be sent to S3, Kinesis Data
streams/Firehose, and Lambda.
AWS CloudWatch Log Filtering Subscription
Filter log data using a Metric or Subscription filter before
sending it to a destination.
• Metric filter: Extract data from log events to create custom metrics
• Subscription filter: Filter log data being sent to other AWS services
Kinesis Data
Streams
Kinesis Data
Firehose
CloudWatch Logs
Subscription Filter
Cross Accounts Access
1) Setup Data Stream in Destination Account
2) Create IAM role + Trust Policy in Destination Account to write to Stream
3) Setup Subscription Filter in Source Account
WRITE
Resource Policy
Send Logs
Filters Kinesis
CloudWatch Subscription IAM Role Trust Policy
Logs Filter Data Streams
ASSUME role
Source Destination
Account Account
AWS CloudWatch Logs Agent
o EC2 does not send any data to CloudWatch, to send its logs to
CloudWatch – a logs agent is needed.
o A CloudWatch Logs Agent is a lightweight, standalone agent that can
be installed on EC2 instances/On-prem servers
o It collects and streams log data from EC2 instances/On-prem servers
to CloudWatch Logs in near real-time.
• Types of log agents:
o CloudWatch logs agent
o CloudWatch Unified logs agent
AWS CloudWatch Logs Agent
Logs Agent
o Older version with limited
capabilities
o Collects logs only
Unified CloudWatch agent
o Enhanced logs agent Note: AWS mostly recommends
Unified CloudWatch agent
o Collects logs as well as system-
level metrics
o Collects RAM, CPU Utilization,
Memory Usage, Disk Space, Network
Traffic, and Swap Space metrics
AWS CloudTrail
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS CloudTrail
Audit and Governance • Records all activities that take place within your AWS
Tool account
• Activities are recorded as events
• Enabled by default
CloudTrail interface
AWS CloudTrail Events
Record(s) of activities
• Types of Events
o Management Events: Captures high-level operations
o Data Events: Captures data-level operations
o Insight Events: Captures unusual activity
Events History
view management events of the past
90 days
AWS CloudTrail Trails
Captures records of AWS activities and stores in S3
● Trail Types:
Multi-Region - Trail applies to all regions
Single Region - Trail applies to one region
Organizational - Logs events for all accounts in an organization
• Features:
o Multiple Trails Per region creates multiple trails within a single AWS region.
AWS CloudTrail Lake
A Managed data lake for AWS user and API activity
● Lake Channels:
Integration with outside sources
Used to ingest events from external
sources
Service Linked
AWS services create channels to receive
CloudTrail events
AWS CloudTrail Extras
o CloudTrail allows for deep analysis of event
o Create rules with EventBridge if needed
AWS Config
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Config
Centralized Configuration • Assess, audit, and evaluate the configurations of your
Management AWS resources
• Disabled by default
• Generates a configuration item for each resource.
Config interface
AWS Config Concepts
o Configuration Item is the current state of individual AWS resources
o Configuration Recorder stores configuration items for resources in your account.
o Configuration History is a historical record of configuration changes
o Configuration Snapshot is a collection of configuration items
o Configuration Stream is an automatically updated list of configuration items
for resources recorded by AWS Config.
o Conformance packs bundles Config rules, remediation actions, and required AWS
resource configurations into a single, reusable package.
o Discovery discovers resources in your AWS environment
o Advanced queries analyzes real-time and historical resource configurations
o Resource Relationship creates a map of relationships between AWS resources.
AWS Config Rules
Evaluates the compliance of your AWS resources against desired configuration
Evaluation results for a Config rule:
• Types of Rules
o AWS Config Managed Rule
o AWS Config Custom Rule
AWS Config Managed Rules AWS Config Custom Rules
● Pre-defined and customizable ● Rules you create from scratch.
rules created by AWS Config.
VS ● Created with lambda functions
or Guard
Config evaluates History changes
Resource recorded in S3
EC2
S3 Bucket
Configuration
Change
Config checks with Rules
AWS Config Trigger Types
Determines when AWS Config evaluates the rules against your resources.
• Trigger Types
o Configuration changes: A configuration change is detected
o Periodic: Evaluates at specified intervals
o Hybrid: Evaluates resource configuration change and chosen frequency
AWS Config Evaluation Modes
● Define when and how resources are evaluated during the resource
provisioning process.
● Evaluation Modes:
Proactive
Assesses resource configurations before
they are deployed
Detective
Assesses resource configurations after
they have been deployed
AWS Config Multi-Account Multi-Region Data Aggregation
Aggregate and centrally manage AWS Config data across multiple AWS
accounts and regions
• Concepts
o Aggregator: Collect Config configuration and compliance data from
multiple source accounts and regions.
o Source Account: AWS accounts where AWS Config records configuration
changes and compliance data for resources
o Aggregator Account: Central hub for aggregating configuration and
compliance data from multiple source accounts
o Authorization: Permission granted to an aggregator Account to collect data.
AWS Well-Architected
Framework
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Well-Architected Framework
o Operational Excellence:
Manage operations to deliver business value and continuously improve processes.
o Security:
Protect data and systems; manage access, and respond to security events.
o Reliability:
Ensure systems perform as expected, handle changes in demand, and recover from disruptions.
o Performance Efficiency:
Use resources efficiently, adapt to changing needs, and leverage new technologies.
o Cost Optimization:
Reduce and control costs without sacrificing performance or capacity.
AWS Well-Architected Framework
o Sustainability
Minimize environmental impact by efficiently using resources and reducing carbon emissions.
AWS Well-Architected Tool
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Well-Architected Tool
Architecture Optimization Tool • Review workloads against the Well-Architected Framework.
• Questionnaire-Based Assessment
Well-Architected Tool interface
AWS Well-Architected Tool Features
o Workload = Collection of components that add to business value.
o Milestones = crucial stages in your architecture’s evolution throughout its lifecycle
o Lenses = Evaluate your architectures against best practices and identify areas of
improvement.
▪ Lens Catalog (created & maintained by AWS)
▪ Custom lenses (user-defined lenses)
o High-Risk Issues(HRIs) = Architectural and operational choices that may negatively
impact a business.
o Medium risk issues (MRIs) = Architectural and operational choices that may negatively
impact a business but not to the same degree as HRIs.
AWS Well-Architected Tool Extras
Use Cases
• Continuously improve architectures
• Get architectural guidance
• Enable consistent governance
How it Works
• Define the workload
• Review the workload
• Tool returns feedback
AWS
Identity and Access Management
(IAM)
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Identity Access Management (IAM)
Centrally manage access & permissions
Users Identities that we can attach permissions to
Groups Collections of users
Roles Collection of permissions that can be assumed by identities
Policies Definition of permissions
AWS Identity Access Management (IAM)
Centrally manage access & permissions
Users Identities that we can attach permissions to
▪ Principle of Least Privilege:
No access per default
Only grant specific access to what is needed
IAM - Users
Types of users:
▪ Root User:
Initial user with full access and services
Intended for account set up and emergency
▪ Standard IAM users:
Unique set of credentials
Direct access to AWS resources
▪ Federated users:
Authenticated through external identity providers:
IAM - Groups
Groups:
▪ A collection of users managed as a single entity
▪ Assign policies to group => all users inherit permissions
▪ A user can belong to multiple groups
▪ No credentials associated
IAM Groups
Dev Group Test Group
User A User C
User B User D
IAM - Roles
Roles:
▪ A combination of permissions that can be assumed
▪ Attach policies to role
▪ Role can then be assumed by identities
▪ Services need to assume roles to perform actions
assume
attach
Role
Policies
IAM - Policies
Policies are documents that define permissions for IAM entities
Managed policies:
▪ Centrally managed standalone policies
AWS Managed policies:
Created and managed by AWS
Customer Managed policies:
Created and managed by users
IAM - Policies
Inline policies:
▪ Attached only to a single IAM user
▪ Non-reusable
IAM - Policies
Identity-based policies:
▪ Associated with IAM identities
▪ Determine what actions can be performed
▪ Effective to grant identity permissions
across different services and resources
Resource-based policies:
▪ Attached to a resource instead of IAM identity
▪ Grant or deny permissions on the resource
▪ Inline policy only
IAM – Trust Policy
Define which entities (accounts, users, or services) are allowed to assume a role.
▪ Type of resource-based policy for IAM roles
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::111122223333:root"
},
"Action": "sts:AssumeRole"
}
]
}
▪ Used for example for cross-account access
assume access
Trust policy Role EC2
Section 17:
Containers
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Introduction to Docker
Containers
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
What is Docker?
• Is a platform for developing, shipping, and running
applications in containers
• Packages applications and their dependencies into
standardized units called containers.
• Is used to quickly deploy and scale applications.
Why Use Docker Containers?
• Consistency : Applications run consistently across
different environments.
• Isolation : Containers are isolated from each other
• Portability : Containers can be easily moved
between different systems and environments.
• Scalability : Enables you to easily scale up or
down the number of containers as needed.
Docker Use Cases
• Microservices Architecture
• Continuous Integration and Continuous
Deployment (CI/CD)
• Hybrid Cloud Environments
• Big Data and Analytics
Docker Components
• Docker Engine
• Docker Image
• Docker Containers
• Docker Registry
Docker Registries
Docker Hub Amazon ECR
● Is the most popular container ● Amazon ECR is a fully managed
registry Docker container registry
provided by AWS.
● It hosts millions of pre-built
images for various software ● Integrates seamlessly with other
applications, libraries, and AWS services
frameworks.
● Amazon ECR supports both public
● Docker Hub offers both public and private repositories.
and private repositories.
Docker Processes
Build Run
Docker File Docker Container
Docker Image
Push Pull
Docker Registries
Elastic Container Service
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Elastic Container Service
• Is a fully managed container orchestration service.
• Simplifies the process of container
Deployment Management Scaling
• It provides high availability and scalability
• It offers built-in security features
• It's integrated with both AWS and third-party tools.
Elastic Container Service Terms
• Task Definition
• Is a blueprint for your application
• It encapsulates all the necessary configuration
parameters.
• Cluster
• Is a logical grouping of container instances
• Provides a centralized management point
Elastic Container Service Terms
• Task
• is ideal for short-running jobs
• It's an instantiation of a Task Definition.
• Tasks can be scheduled and terminated dynamically
based on workload demands.
Elastic Container Service Terms
• Service
• Is ideal for long-running applications.
• ECS automatically replaces failed tasks.
• It's an instantiation of a Task Definition.
• Container Agents
• Run on each EC2 instance within an ECS cluster.
• Serve as the communication bridge between the ECS
and the container instances.
Amazon ECS launch types
• EC2 launch type
• You must provision & maintain the infrastructure
• Suitable for large workloads that must be price
optimized.
• Enables you to use EC2 instances like spot instances
and custom instance types.
• Scaling does not come out of the box.
Amazon ECS launch types
• Fargate launch type
• You don't need to manage an EC2 infrastructure
• Requires less effort to set up
• AWS just runs ECS Tasks for you based on the CPU /
RAM you need
• handles scaling out your capacity
• External launch type
• Is used to run your containerized applications on
your on-premise server or virtual machine (VM)
Task placement strategies and constraints
• Applicable for EC2 launch type launch mode only
Task placement strategies
• is an algorithm for selecting instances for task
placement or tasks for termination.
• Available strategies are
Binpack Spread Random
Task placement strategies and constraints
• Binpack
• Tasks are placed on instances that have the least
available CPU or memory capacity
• Helps minimize wasted resources.
• Beneficial for cost optimization and maximizing the
usage of resources.
Task placement strategies and constraints
• Spread
• Spreads tasks evenly across container instances
within the cluster
• Ensures that no single instance becomes overloaded.
• Suitable for ensuring even distribution of tasks
across instances.
Task placement strategies and constraints
• Random
• Randomly places tasks onto container instances within
the cluster
• Not suitable for applications with specific
performance or availability requirements.
• It's primarily used when you don't have specific
constraints or considerations for task placement.
• It is possible to create a task placement strategy that
uses multiple strategies.
Task placement strategies and constraints
• Task placement constraint
• These are rules that must be met in order to place a
task on a container instance.
• There are two types on constraints types
Distinct Instance Member of
Task placement strategies and constraints
• Distinct Instance
• Place each task on a different container instance.
• Member of
• Place tasks on container instances that satisfy an
expression.
Task placement strategies and constraints
• When Amazon ECS places a task, it uses the following
process to select the appropriate EC2 Container
instance:
1. CPU, memory, and port requirements
2. Task Placement Constraints
3. Task Placement Strategies
IAM Roles for ECS
• Task Execution Role
• Is an IAM role that ECS uses to manage tasks on your
behalf.
• Task Role
• Enables containers within the task to access AWS
resources securely.
Elastic Container Registry
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon Elastic Container Registry
• Amazon ECR is a fully managed Docker container
registry provided by AWS.
• It enables you to store, manage, and deploy Docker
images securely.
• It's integrated with other AWS services.
• Is Secure
Amazon Elastic Container Registry
ECR ECS Cluster
ECS task:
Pulls Specify the container images
directly from ECR.
Deploy them as part of
containerized applications
Docker Images EC2 Instance on the ECS cluster.
Amazon Elastic Container Registry
• Features
• Lifecycle policies
• Image scanning
• Cross-Region and cross-account replication
• Versioning
• Tagging
Amazon Elastic Container Registry
Public Repository
• Are accessible to anyone on the internet.
• Special permission or credential is not required.
Private Repository
• Are only to authorized users.
• Access to private repositories can be controlled
using AWS IAM (Identity and Access Management)
Elastic Kubernetes Service
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Kubernetes
• Is is an open-source platform designed to automate the
Deployment Management Scaling
of containerized applications.
• Google open-sourced the Kubernetes project in 2014
• Suitabile for running and managing workloads of all
sizes and styles
• It is non cloud native, available on most cloud
providers
Amazon Elastic Kubernetes Service
Managed Kubernetes service to run Kubernetes
in the AWS cloud and on-premises data centers.
• It's integrated with other AWS services.
• Provides High Availability
• Scalability
• Security
• Monitoring and Logging
Amazon EKS architecture
Control plane
• Consists of nodes that run the Kubernetes software.
• Manages and orchestrates various components of the
Kubernetes cluster.
• Is fully managed by AWS.
Amazon EKS architecture
Compute
• Contains worker machines called nodes.
• Amazon EKS offers the following primary node types.
AWS Fargate Karpenter Managed node groups
Self-managed nodes
Amazon EKS architecture
AWS Fargate
• Is a serverless compute engine.
• AWS manages the underlying infrastructure.
• You specify your application's resource needs, and
AWS handles the rest.
Amazon EKS architecture
Karpenter
• Best for running containers with a high availability
requirement.
• It launches right-sized compute resources in
response to changing application load.
Managed node groups
• Create and manage Amazon EC2 instances for you.
Amazon EKS architecture
Self-managed nodes
• Offer full control over your Amazon EC2 instances
within an Amazon EKS cluster.
• You are in charge of managing, scaling, and
maintaining the nodes.
• Suitable for users who need more control over their
nodes.
Section 18:
Migrations
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Snow Family
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Snow Family
● Data management and processing.
● Used for bypassing bandwidth limitations.
● Local computing capabilities:
○ Snowball Edge and Snowcone.
● Security:
○ Secure at rest and in transit.
○ Uses AWS KMS.
AWS Snow Family - Snowcone
● Small and lightweight.
● 8TB of storage.
● Built in DataSync agent.
● Data Processes:
○ AWS IoT Greengrass, AWS Lambda.
AWS Snow Family – Snowball Edge
● 1)Storage Optimized:
○ Storage focused.
○ 80 TB.
● 2)Compute Optimized:
○ Run applications, process data.
○ 42 TB.
● Move data into AWS.
● Data is encrypted.
● Used for remote areas and low bandwidth edge locations.
AWS Snow Family – Snowmobile
● Up to 100PB of data carried by a truck.
● Enterprise level large scale datasets.
● Consider over 10PB.
AWS Snow Family
Snow Family Process
Transfer Send Back AWS
Order Data to Uploads
Locally AWS the Data
Snowcone vs Snowball vs Snowmobile
Snowball Edge Storage
Snowcone Snowmobile
Optimized
8 TB HDD(14TB SSD
Storage Capacity 80 TB Up to 100PB
optional)
Up to 24 TB, online Up to petabytes, Up to exabytes,
Migration Size
and offline offline offline
40 vCPU(104vCPU for
Usable vCPUs 2 vCPU -
Compute Optimized)
AWS Transfer Family
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Transfer Family
● Secure file transfer with SFTP, FTPS and FTP.
● Frequently and securely exchanged data.
● Fully managed.
● Integration with AWS Services:
○ Amazon S3
○ Amazon EFS
● Customization and Control:
○ Setting up DNS
○ IAM based authentication
AWS Transfer Family
● Simplified Migration:
○ No need to modify applications.
● Pricing:
○ Pay as you go.
Use Cases
● Secure data distribution
● Data backup and archiving
AWS DataSync
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS DataSync
● Simplifies moving data.
● Data transfer on-premises to AWS storage services.
● Data transfer between different AWS storage services.
● File permissions and metadata preserved.
● AWS Snowcone includes AWS DataSync in it.
AWS DataSync
Key Features
▪ Speed:
▪ Up to 10 times faster.
▪ Schedule and Automation:
▪ Allows you to schedule transfers.
▪ It does not have continues sync option.
▪ Integration:
▪ Amazon S3, Amazon EFS, Amazon FSx for Windows
On-Premises
AWS DataSync – How It Works
Setting Up AWS DataSync NFS
or
▪ Agent Installation: SMB
▪ Install agent in a server that has access to NFS
DataSync
or SMB file systems. Agent
On-Premises
▪ Task Configuration:
▪ Define sources(NFS,SMB) and targets(S3,EFS,FSx). TLS
▪ Data Transfer Options:
▪ Scheduling, bandwidth throttling
AWS
AWS Database Migration
Service (DMS)
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS DMS
● Helps you on migrating databases.
● Supports various type of database engines.
● Continuous replication.
● Uses EC2 instances.
● Security:
○ Data encrypted during transit.
● Pricing:
○ Pay as you go
AWS DMS – Source & Target
SOURCES TARGETS
- Amazon Aurora
- Amazon Aurora
- RDS
- Oracle
- Redshift
- Microsoft SQL
Server - DynamoDB
- MySQL - DocumentDB
- PostgreSQL - S3
- MongoDB - Kinesis Data
Streams
- SAP ASE
- Apache Kafka
AWS DMS – How It Works
Homogeneous Migration
Replication
Task
Source Target
Source DB Endpoint Endpoint Target DB
(Full Data
Load, CDC)
Replication Instance
AWS DMS – How It Works
Heterogeneous Migration
Schema
Conversion
Source DB Target DB
SCT Server
Replication
Task
Source Target
Source DB Endpoint Endpoint Target DB
(Full Data
Load, CDC)
Replication Instance
AWS Application Discovery
Service
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Application Discovery Service
● Get insights of your on-premise servers and databases
● Useful for migrations to understand own resources
● Collects information about your applications.
○ Server specifications, dependencies, usage data
Agent-based data Agentless data
collection collection
Install software agent For VMs running in Vmware vCenter environment
● Analyze collected data.
● Plan your migration process with collected information.
○ Can integrate with AWS Migration Hub and AWS Application Migration Service
AWS Application Migration
Service
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Application Migration Service
● Simplifies migration process with automations.
Lift-and-Shift ⇒ Can replicate entire servers without significant downtime
Automates
⇒ Migration of applications, databases, servers to AWS
Migration
Test Before
⇒ Supports creating test environments
Switching
⇒ Supports most common environments including Windows and
Compatibility Linux
AWS Application Migration Service
● Workflow for migration
Install Install the AWS Replication Agent on the source server
Configure Configure the launch settings for each server
Launch Test the migration of your source servers to AWS
Test Instance
Cutover Cutover will migrate your source servers
Section 19:
VPCs
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Networking
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Virtual Private Cloud
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon VPC
Private, secure, isolated network within the AWS cloud to launch your resources.
⇒ Can be linked to on-premise infrastructure
⇒ Regional Service
Amazon VPC Subnets
A range of IP addresses in your VPC
⇒ Zonal Service
• Types of Subnets
o Public Subnets have access to the internet
o Private Subnets do not have direct access to the internet
o VPN-only Subnets are accessed via a VPN connection
o Isolated Subnets are only accessed by other resources in the same VPC
Subnet Routing
Route Tables are sets of rules that dictate how traffic is
routed in your VPC.
Networking
Components
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon VPC Networking Components
Internet Egress-Only NAT
Gateway Internet Gateway Gateway/Instance
Allows outbound IPv6 Allows resources in
Allows communication
communication from private subnets to
between your VPC and
VPC instances to the connect to external
the internet.
Internet, while destinations but
blocking inbound IPv6 prevents connection
connections. requests
VPNs & VPC Peering
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon VPC and Corporate Network
You can connect your VPC to your own corporate data center
Virtual Private
Network
• Secure connection between a
VPC and an on-premises
network over the internet.
Direct Connect
• Private connectivity between
corporate networks and AWS.
Amazon VPC Peering
• Enables direct communication between two VPCS
• Intra or Inter Region
Amazon VPC Transit Gateway
• Central hub interconnecting VPCs and on-
premises networks.
Security Groups &
NACLs
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon VPC Security
• Security Groups control inbound or outbound • Network Access Control List (NACL)
traffic at resource level Control inbound or outbound traffic at the
subnet level
VPC
Additional Features
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
Amazon VPC Extra Features
VPC Flow Logs: Capture information about IP traffic going to and from network
interfaces.
Reachability Analyzer: Analyze network reachability between resources within your VPC
and external endpoints
Ephemeral Ports: Temporary ports for outbound communication
VPC Sharing: Share your VPC resources with other AWS accounts in the same AWS
Organization,
Section 20:
Security
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS KMS
(Key Management Service)
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
KMS
• Manages Encryption Keys:
Overview To encrypt data in other AWS services
⇒ Used to encrypt & decrypt data
• Integration:
Integrates to other services (S3, databases, EBS volumes etc.)
• API calls:
Don't store secrets in code
• Cloud Trail integration:
Log use of your keys for auditing
• Encrypt data stored in S3 buckets
Use Cases
• Database credentials:
Encrypt credentials instead of storing them in plain text
Types of Keys
• One key for both encryption and decryption
Symmetric Keys
• Suitable for high-volume data.
Default • Example: AES with 256-bit keys
• At rest & in-transit
• Uses key pair
Asymmetric Keys • Public Key: Encrypt data (can be downloaded)
• Private Key: Decrypts data
• Encrypted data must be shared safely
• Sign/Verify operations
• Example: RSA & ECC
AWS-Managed vs. Customer-Managed
• Controlled and managed by AWS
AWS owned keys
• No direct access and lifecycle control
• Owned by service
• Good choice unless you you need to audit and manage key
• Created & managed by AWS KMS – but customer-specific
AWS managed keys
• Cannot control over its usage and policies, rotation etc.
• Good choice unless you need to control the encrypt key
• Audit using CloudTrail
AWS-Managed vs. Customer-Managed
• You create, own, and manage
Customer managed keys
• Full control over these KMS keys
• Policies, rotation, encryption
KMS Key Management
Creating Keys Rotating Keys Managing Keys
▪ AWS Management
▪ Replaces old
Console, AWS CLI,
keys with new ones ▪ Configuring key
or SDK
policies
▪ Selecting the
key type ▪ KMS handles
complexities
▪ Key policies
Key Rotation
• For keys that AWS manages
Automatic Rotation
• Automatically rotates the keys every year
Customer Managed • Users responsibility
Policies
• Full access to the key to the root user
Default Policies
• Allows usage of IAM policies
• More complex requirements
Custom Policies • More Granular Control
• Regulated industries
AWS KMS
Pricing
• Pricing:
▪ $1.00 per customer-managed key per month;
▪ $0.03 per 10,000 API requests.
• Key Rotation:
• Automatic: Free for AWS-managed keys.
• Manual: No extra charge for customer-managed keys; requires setup.
• Cross-Region Requests: $0.01 per 10,000 requests for using a KMS key in a
different region.
Cross-region
Keys are bound to the region in which they are created
Region A Region B
Volume Volume
Encrypted
Snapshot Snapshot
copy
Default Key Customer-managed Key
Multi-Region keys in AWS KMS
• Use keys in different AWS Regions you had the same key
• Each set of related multi-Region keys has the same key material and key ID
• Manage each multi-Region key independently
• Create a multi-Region primary key ⇒ replicate it into Regions that you select
Use Cases
• Disaster recover in multi-region setup
• Data distributed in multiple regions
• Distributed signing applications
Cross-account
Keys can be shared across accounts
Configurable using policies
Account A Account B
Customer-
managed key
Set policies
Volume Volume
Snapshot
permissions
Snapshot Snapshot
access
copy
Customer Key
AWS Macie
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Macie
• Automatically scans and classifies sensitive data in
Purpose Amazon S3
• Machine Learning: Detects sensitive data
▪ Personally identifiable information (PII)
▪ Financial data
▪ Health information
▪ Anomalous access patterns
Features
• Automated alerts:
Detailed alerts when sensitive data or unusual access patterns are
detected; Integrates with CloudWatch and other services
• Comprehensive Dashboard: Overview of S3 environment
▪ Regulatory Compliance
Use Cases ▪ Security Monitoring
▪ Risk Assessment
AWS Secrets
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Secrets
• Manage and retrieve secrets
Purpose
▪ Database credentials (RDS & Redshift)
▪ API Keys
▪ Access Tokens
• Secrets Management & Storage:
Features
Encrypts and stores secrets with pre-built integration for other services
• Automatic Rotation: No need to change code in applications
• Retrieval: Secure retrieval via API calls
• Auditing: CloudTrail Integration
Cross-region replication
Replicate secrets across multiple AWS Regions
Region A Region B
Region C
Secret
Secret
Primary
Secret Replica Promote to
Enable Secret standalone
replication Replica
Add region Secret
Secret Secret
Add region
Secret
Cross-region replication
• ARN Consistency:
ARN remains the same
Primary: arn:aws:secretsmanager:Region1:123456789012:secret:MySecret-a1b2c3
Replica: arn:aws:secretsmanager:Region2:123456789012:secret:MySecret-a1b2c3
• Metadata and Data Replication:
Encrypted secret data, tags, and resource policies are replicated across
specified regions.
• Automatic Rotation:
If rotation is enabled on the primary secret, the updated secret values
automatically propagate to all replicas.
Cross-account
Secrets can be shared across accounts
Configurable using policies
Account A Account B
Attach
resource
Secret policy
access
Modify
access
policy
AWS Shield
003-1040559 1250 003-77156.8 1760 0009-14563.7 73273
AWS Shield
Protect against Distributed Denial of Service (DDoS) attacks
• Automatically enabled & Free
AWS Shield Standard
• Protection against most common network and transport layer DDoS
attacks (96%)
• Attacks agains network and transport layers (layer 3 and 4) and the
application layer (layer 7).
• E.g. Slow reads or volumetric attacks
• AWS Services Covered: Amazon CloudFront, Elastic Load Balancing (ELB),
Amazon Route 53, and more.
• Visibility and Reporting: Provides AWS CloudWatch metrics and AWS
Health Dashboard notifications during larger attacks.
AWS Shield
Protect against Distributed Denial of Service (DDoS)
attacks
AWS Shield Advanced • Enhanced DDoS Protection: Guards against complex DDoS attacks.
• Financial Shield: Protects from attack-related cost spikes.
• 24/7 Expertise: Access to AWS DDoS Response Team.
• Attack Insights: Immediate and detailed attack analysis.
• Custom Rules: Personalized protection with AWS WAF.
• Targeted Defense: Specific protection for key resources.
• Premium Service: Subscription model with advanced features.