0% found this document useful (0 votes)
38 views74 pages

DE AWS Test (1) T

The document is an internship report on AWS Data Engineering submitted by four students as part of their Bachelor of Technology degree in Computer Science and Engineering-Data Science. It includes acknowledgments, an abstract on AWS services, and a detailed index of topics covered in the internship, including cloud foundations and data engineering principles. The report highlights the importance of AWS in providing cloud computing resources and the skills acquired during the internship related to data management and analytics.

Uploaded by

SaMPaTH CM 19&[
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views74 pages

DE AWS Test (1) T

The document is an internship report on AWS Data Engineering submitted by four students as part of their Bachelor of Technology degree in Computer Science and Engineering-Data Science. It includes acknowledgments, an abstract on AWS services, and a detailed index of topics covered in the internship, including cloud foundations and data engineering principles. The report highlights the importance of AWS in providing cloud computing resources and the skills acquired during the internship related to data management and analytics.

Uploaded by

SaMPaTH CM 19&[
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

AWS DATA ENGINEERING INTERNSHIP

Internship-II report submitted in partial fulfilment


of requirements for the award of degree of

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING-DATA SCIENCE

Submitted By

A. Jahnavi Dhana Sri 322103383005


U. Sri Sushma 322103383063
K. Sampath Vishal 323103383L02
Sk. Bashira 323103383L06

Under the Esteemed Guidance of

Name of Course Mentor Name of Course Coordinator


Mr. D. Arun Kumar Dr. Ch. Sita Kumari
(Assistant Professor) (Associate Professor)

Department of Computer Science and Engineering-Data Science


Gayatri Vidya Parishad College of Engineering (Autonomous),
Visakhapatnam.
(2022-2026)

Page | 1
CERTIFICATE

This is to certify that this report on “AWS Data Engineering” is a Bonafide record
of Internship-II work submitted by—

A. Jahnavi Dhana Sri 322103383005


U. Sri Sushma 322103383063
K. Sampath Vishal 323103383L02
Sk. Bashira 323103383L06

In their VI semester in partial fulfilment for the award of degree of


Bachelor in Computer Science and Engineering-Data Science
During the academic year 2024-2025

Name of Course Coordinator Head of the Department


Dr. Ch. Sita Kumari Dr. Y. Anuradha
(Associate Professor) ( Professor)

Name of Course Mentor


Mr. D. Arun Kumar
(Assistant Professor)

Page | 2
ACKNOWLEDGEMENTS

We would like to express our deep sense of gratitude to our esteemed institute Gayatri Vidya
Parishad College of Engineering (Autonomous), which has provided us an opportunity to fulfil
our cherished desire.
We express our sincere thanks to our Principal Dr. A.B. KOTESWARA RAO, Gayatri Vidya
Parishad College of Engineering (Autonomous) for his encouragement to us during this project.
We thank our Course Mentor , Mr. D. Arun Kumar, Assistant Professor, Department of
Computer Science and Engineering, for the kind suggestions and guidance for the successful
completion of our internship.
We thank our Course Coordinator, Dr. CH. SITA KUMARI, Associate Professor, Department
of Computer Science and Engineering, for her valuable guidance and insightful suggestions that
greatly contributed to the successful completion of our internship.
We are highly indebted to Dr. Y. ANURADHA, Professor and Head of the Department of
Computer Science and Engineering-Data Science, Gayatri Vidya Parishad College of
Engineering (Autonomous), for giving us an opportunity to do the internship in college.

We thank our Internship Coordinators, Dr. CH. SITA KUMARI, Associate Professor,
Department of Computer Science and Engineering and Ms. T. TEJESWARI, Assistant
Professor, Department of Computer Science and Engineering for providing us with this
internship opportunity.

We are very thankful to AICTE and Edu-skills for giving us a comprehensive platform that
helped us to solve every issue regarding the internship.

Finally, we are indebted to the teaching and non-teaching staff of the Computer Science and
Engineering Department for all their support in completion of our project.

A. Jahnavi Dhana Sri 322103383005


U. Sri Sushma 322103383063
K. Sampath Vishal 323103383L02
Sk. Bashira 323103383L06

Page | 3
322103383005:

Page | 4
322103383063:

Page | 5
323103383L02:

Page | 6
323103383L06

Page | 7
ABSTRACT

Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon that provides on-demand cloud
computing platforms and APIs to individuals, companies, and governments, on a metered
pay-as-you-go basis. These cloud computing web services provide distributed computing
processing capacity and software tools via AWS server farms. One of these services is
Amazon Elastic Compute Cloud (EC2), which allows users to have at their disposal a virtual
cluster of computers, available all the time, through the Internet. AWS's virtual computers
emulate most of the attributes of a real computer, including hardware central processing units
(CPUs) and graphics processing units (GPUs) for processing; local/RAM memory; hard-
disk/SSD storage; a choice of operating systems; networking; and pre-loaded application
software such as web servers, databases, and customer relationship management (CRM).

AWS Academy Cloud Foundations is intended for those who seek an overall understanding
of cloud computing concepts, independent of specific technical roles. It provides a detailed
overview of cloud concepts, AWS core services, security, architecture, pricing, and support.
Cloud computing is the on-demand delivery of IT resources over the Internet with payas-you-
go pricing. Instead of buying, owning, and maintaining physical data centres and servers, you
can access technology services, such as computing power, storage, and databases, on an as-
needed basis from a cloud provider like Amazon Web Services (AWS).

AWS Academy Data Engineering is designed to help students learn about and get hands-on
practice with the tasks, tools, and strategies that are used to collect, store, prepare, analyze,
and visualize data for use in analytics and machine learning (ML) applications. Throughout
the course, students will explore use cases from real-world applications, which will enable
them to make informed decisions while building data pipelines for their particular
applications.

Page | 8
INDEX

About AWS 15

Course-1: Cloud Foundations:

1. Cloud Concepts Overview 15


1.1 Introducing Cloud Architecting
1.2 Advantages of Cloud Computing
1.3 Introduction to Amazon Web Services
1.4 Moving to the AWS Cloud – The AWS Cloud Adoption Framework

2. Cloud Economics and Billing 17


2.1 Fundamentals of Pricing
2.2 Total Cost of Ownership
2.3 AWS Organizations
2.4 AWS Billing and Management
2.5 Technical Support

3. AWS Global Infrastructure Overview 20


3.1 AWS Global Infrastructure
3.2 AWS Services and Categories Overview
3.3 Summary

4. AWS Cloud Security 21


4.1 AWS Shared Responsibility Model
4.2 AWS Identity and Access Management
4.3 Securing a new AWS Account
4.4 Securing Accounts
4.5 Securing Data on AWS
4.6 Working to ensure compliance

5. Networking And Content Delivery 24


5.1 Networking Basics
5.2 Amazon VPC
5.3 VPC Networking
5.4 VPC Security
5.5 Amazon Route S3
5.6 Amazon Cloud Front

6. Compute 28
6.1 Compute Services Overview
Page | 9
6.2 Amazon EC2
6.3 Amazon EC2 Cost Optimization
6.4 Container Services
6.5 Introduction to AWS Lambda
6.6 Introduction to AWS Beanstalk

7. Storage 31
7.1 Amazon Elastic Block Store
7.2 Amazon Simple Storage Service
7.3 Amazon Elastic File System
7.4 Amazon S3 Glacier

8. Databases 33
8.1 Amazon Relational Database Service
8.2 Amazon DynamoDB
8.3 Amazon Redshift
8.4 Amazon Aurora

9. Cloud Architecture 36
9.1 AWS Well-Architected Framework
9.2 Reliability and Availability
9.3 AWS Trusted Advisor

10. Auto Load Balancing 37


10.1 Elastic Load Balancing
10.2 Amazon Cloud Watch
10.3 Amazon EC2 Auto Scaling

Page | 10
Course-2: AWS Data Engineering:

1. Welcome to AWS Data Engineering 39


1.1 Course Objectives

2. Data-Driven Organizations 40
2.1 Data-Driven Decisions
2.2 The Data Pipeline Infrastructure
2.3 Module Summary

3. The Elements of Data 41


3.1 Volume and Velocity
3.2 Variety - Data Types
3.3 Variety - Data Sources
3.4 Veracity and Value
3.5 Activities to Improve Veracity
3.6 Module Summary

4. Design Principles & Patterns for Data Pipelines 43


4.1 AWS Well-Architected Framework and Lenses
4.2 The Evolution of Data Architectures
4.3 Modern Data Architecture on AWS
4.4 Modern Data Architecture Pipeline – Ingestion and Storage
4.5 Modern Data Architecture Pipeline – Processing and Consumption
4.6 Streaming Analytics Pipeline
4.7 Module Summary

5. Securing and Scaling the Data Pipeline 45


5.1 Cloud Security Review
5.2 Security of Analytics Workloads
5.3 ML Security
5.4 Scaling – An Overview
5.5 Creating a Scalable Infrastructure
5.6 Creating Scalable Components
5.7 Module Summary

6. Ingesting and Preparing Data 47


6.1 ETL and ELT Comparison
6.2 Data Wrangling Introduction
6.3 Data Discovery
6.4 Data Structuring
6.5 Data Cleaning

Page | 11
6.6 Data Enriching
6.7 Data Validating
6.8 Data Publishing
6.9 Module Summary

7. Ingesting by Batch or by Stream 49


7.1 Comparing Batch and Stream Ingestion
7.2 Batch Ingestion Processing
7.3 Purpose-Built Ingestion Tools
7.4 AWS Glue for Batch Ingestion Processing
7.5 Scaling Considerations for Batch Processing
7.6 Kinesis for Stream Processing
7.7 Scaling Considerations for Stream Processing
7.8 Ingesting IoT Data by Stream
7.9 Module Summary

8. Storing and Organizing Data 52


8.1 Data Lake Storage
8.2 Data Warehouse Storage
8.3 Purpose-Built Databases
8.4 Storage in Support of the Pipeline
8.5 Securing Storage
8.6 Module Summary

9. Processing Big Data 54


9.1 Big Data Processing Concepts
9.2 Apache Hadoop
9.3 Apache Spark
9.4 Amazon EMR
9.5 Managing Your Amazon EMR Clusters
9.6 Apache Hudi
9.7 Module Summary

10. Processing Data for ML 56


10.1 ML Concepts
10.2 The ML Lifecycle
10.3 Framing the ML Problem to Meet the Business Goal
10.4 Collecting Data
10.5 Applying Labels to Training Data with Known Targets
10.6 Preprocessing Data
10.7 Feature Engineering
10.8 Developing a Model
10.9 Deploying a Model

Page | 12
10.10 ML Infrastructure on AWS
10.11 SageMaker
10.12 Introduction to Amazon CodeWhisperer
10.13 AI/ML Services on AWS
10.14 Module Summary

11. Analysing and Visualizing Data 60


11.1 Considering Factors that Influence Tool Selection
11.2 Comparing AWS Tools and Services
11.3 Selecting Tools for a Gaming Analytics Use Case
11.4 Module Summary

12. Automating the Pipeline 61


12.1 Automating Infrastructure Deployment
12.2 CI/CD
12.3 Automating with Step Functions
12.4 Module Summary

Case Study: 65
1. Aim
2. Description
3. AWS services used
4. Benefits of AWS services
5. Conclusion

Conclusion 69

References 70

Page | 13
AWS

Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon that provides on- demand
cloud computing platforms and APIs to individuals, companies, and governments, on a
metered, pay-as-you-go basis. Clients will often use this in combination with
autoscaling (a process that allows a client to use more computing in times of high application
usage, and then scale down to reduce costs when there is less traffic). These cloud computing
web services provide various services related to networking, compute, storage, middleware,
IoT and other processing capacity, as well as software tools via AWS server farms. This
frees clients from managing, scaling, and patching hardware and operating systems. One of
the foundational services is Amazon Elastic Compute Cloud (EC2), which allows users
to have at their disposal a virtual cluster of computers, with extremely high availability, which
can be interacted with over the internet via REST APIs, a CLI or the AWS console. AWS's
virtual computers emulate most of the attributes of a real computer, including hardware
central processing units (CPUs) and graphics processing units (GPUs) for processing;
local/RAM memory; Hard-disk (HDD)/SSD storage; a choice of operating systems;
networking; and pre-loaded application software such as web servers, databases, and
customer relationship management (CRM).

AWS services are delivered to customers via a network of AWS server farms located
throughout the world. Fees are based on a combination of usage (known as a "Pay-as-you-
go" model), hardware, operating system, software, and networking features chosen by the
subscriber requiring various degrees of availability, redundancy, security, and service options.
Subscribers can pay for a single virtual AWS computer, a dedicated physical computer, or
clusters of either. Amazon provides select portions of security for subscribers (e.g. physical
security of the data centres) while other aspects of security are the responsibility of the
subscriber (e.g. account management, vulnerability scanning, patching). AWS operates from
many global geographical regions including seven in North America.

Amazon markets AWS to subscribers as a way of obtaining large-scale computing capacity


more quickly and cheaply than building an actual physical server farm. All services are billed
based on usage, but each service measures usage in varying ways. As of 2023 Q1, AWS has
31% market share for cloud infrastructure while the next two competitors Microsoft Azure
and Google Cloud have 25%, and 11% respectively, according to Synergy Research Group.

Page | 14
CLOUD FOUNDATIONS
MODULE 1
CLOUD CONCEPTS OVERVIEW

Section 1: Introduction to cloud computing:

Some key takeaways from this section of the module are:


• Cloud computing is the on-demand delivery of IT resources via the internet with pay-as
you-go pricing.
• Cloud computing enables you to think of (and use) your infrastructure as software.
• There are three cloud service models: IaaS, PaaS, and SaaS.
• There are three cloud deployment models: cloud, hybrid, and on-premises or private cloud.
• There are many AWS service analogues for the traditional, on-premises IT space.

Section 2: Advantages of cloud computing:

The key takeaways from this section of the module include the six advantages of cloud
computing:
• Trade capital expense for variable expense
• Massive economies of scale
• Stop guessing capacity
• Increase speed and agility
• Stop spending money on running and maintaining data centres
• Go global in minutes

Fig 1.1: Applications of Cloud Computing

Page | 15
Section 3: Introduction to Amazon Web Services (AWS):

The key takeaways from this section of the module include:


• AWS is a secure cloud platform that offers a broad set of global cloud-based products
called services that are designed to work together.
• There are many categories of AWS services, and each category has many services to
choose from.
• Choose a service based on your business goals and technology requirements.
• There are three ways to interact with AWS services.

Section 4: Moving to the AWS Cloud -The AWS Cloud Adoption Framework (AWS
CAF):

The key takeaways from this section of the module include:


• Cloud adoption is not instantaneous for most organizations and requires a thoughtful,
deliberate strategy and alignment across the whole organization.
• The AWS CAF was created to help organizations develop efficient and effective plans for
their cloud adoption journey.
• The AWS CAF organizes guidance into six areas of focus, called perspectives.
• Perspectives consist of sets of business or technology capabilities that are the
responsibility of key stakeholders.

Module-1 Summary:

In summary, in this module we learned how to:


• Define different types of cloud computing
• Describe six advantages of cloud computing
• Recognize the main AWS service categories and core services
• Reviewed the AWS Cloud Adoption Framework

Page | 16
MODULE 2
CLOUD ECONOMICS AND BILLING

Section 1: Fundamentals of pricing:

In summary, while the number and types of services offered by AWS have increased
dramatically, our philosophy on pricing has not changed. At the end of each month, you pay
only for what you use, and you can start or stop using a product at any time. No long-term
contracts are required.
The best way to estimate costs is to examine the fundamental characteristics for each AWS
service, estimate your usage for each characteristic, and then map that usage to the prices
that are posted on the AWS website. The service pricing strategy gives you the flexibility to
choose the services that you need for each project and to pay only for what you use.
There are several free AWS services, including:
• Amazon VPC
• Elastic Beanstalk
• AWS CloudFormation
• IAM
• Automatic scaling services
• AWS OpsWorks
• Consolidated Billing
While the services themselves are free, the resources that they provision might not be free.
In most cases, there is no charge for inbound data transfer or for data transfer between other
AWS services within the same AWS Region. There are some exceptions, so be sure to
verify data transfer rates before you begin to use the AWS service. Outbound data transfer
costs are tiered.

Section 2: Total cost of ownership:

It is difficult to compare an on-premises IT delivery model with the AWS Cloud. The two
are different because they use different concepts and terms. Using on-premises IT involves
a discussion that is based on capital expenditure, long planning cycles, and multiple
components to buy, build, manage, and refresh resources over time. Using the AWS Cloud
involves a discussion about flexibility, agility, and consumption-based costs.

Some of the costs that are associated with data centre management include:
• Server costs for both hardware and software, and facilities costs to house the equipment.
• Storage costs for the hardware, administration, and facilities.
• Network costs for hardware, administration, and facilities.
• And IT labour costs that are required to administer the entire solution.
Soft benefits include:

Page | 17
• Reusing service and applications that enable you to define (and redefine solutions) by
using the same cloud service
• Increased developer productivity

Fig 2.1: Cost of On-premises versus Cloud

Section 3: AWS Organisations:

AWS Organizations enables you to:


• Create service control policies (SCPs) that centrally control AWS services across multiple
AWS accounts.
• Create groups of accounts and then attach policies to a group to ensure that the correct
policies are applied across the accounts.
• Simplify account management by using application programming interfaces (APIs) to
automate the creation and management of new AWS accounts.
• Simplify the billing process by setting up a single payment method for all the AWS
accounts in your organization. With consolidated billing, you can see a combined view of
charges that are incurred by all your accounts, and you can take advantage of pricing
benefits from aggregated usage. Consolidated billing provides a central location to manage
billing across all of your AWS accounts, and the ability to benefit from volume discounts.

Section 4: AWS Billing and Cost Management:

AWS Billing and Cost Management is the service that you use to pay your AWS bill,
monitor your usage, and budget your costs. Billing and Cost Management enables you to
forecast and obtain a better idea of what your costs and usage might be in the future so that
you can plan ahead.
You can set a custom time period and determine whether you would like to view your data
at a monthly or daily level of granularity.
Page | 18
With the filtering and grouping functionality, you can further analyse your data using a variety
of available dimensions. The AWS Cost and Usage Report Tool enables you to identify
opportunities for optimization by understanding your cost and usage data trends and how you
are using your AWS implementation.

Section 5: Technical Support:

Provide unique combination of tools and expertise:


• AWS Support
• AWS Support Plans
Support is provided for:
• Experimenting with AWS
• Production use of AWS
• Business-critical use of AWS
Proactive guidance :
• Technical Account Manager (TAM)
Best practices :
• AWS Trusted Advisor
Account assistance :
• AWS Support Concierge

Module-2 Summary :

In summary, in this module, we:


• Explored the fundamentals of AWS pricing
• Reviewed Total Cost of Ownership concepts
• Reviewed an AWS Pricing Calculator estimate.
Total Cost of Ownership is a concept to help you understand and compare the costs that are
associated with different deployments. AWS provides the AWS Pricing Calculator to assist
you with the calculations that are needed to estimate cost savings.
Use the AWS Pricing Calculator to:
• Estimate monthly costs
• Identify opportunities to reduce monthly costs
• Model your solutions before building them
AWS Billing and Cost Management provides you with tools to help you access, understand,
allocate, control, and optimize your AWS costs and usage. These tools include AWS Bills,
AWS Cost Explorer, AWS Budgets, and AWS Cost and Usage Reports.
These tools give you access to the most comprehensive information about your AWS costs
and usage including which AWS services are the main cost drivers. Knowing and
understanding your usage and costs will enable you to plan ahead and improve your AWS
implementation.

Page | 19
MODULE 3
AWS GLOBAL INFRASTRUCTURE OVERVIEW

Section 1: AWS Global Infrastructure:

Some key takeaways from this section of the module include:


• The AWS Global Infrastructure consists of Regions and Availability Zones.
• Your choice of a Region is typically based on compliance requirements or to reduce
latency.
• Each Availability Zone is physically separate from other Availability Zones and has
redundant power, networking, and connectivity.
• Edge locations, and Regional edge caches improve performance by caching content closer
to users.

Fig 3.1: Components of Global Infrastructure

Section 2: AWS services and service category overview:

AWS categories of services:


Analytics, Application Integration, AR and VR, Blockchain, Business Applications,
Compute, Cost Management, Customer Engagement, Database, Developer Tools, End User
Computing, Game Tech, Internet of Things, Machine Learning, Management and
Governance, Media Services, Migration and Transfer, Mobile, Networking and Content
Delivery, Robotics, Satellite, Security Identity and Compliance, Storage etc

Module-3: Summary:

In summary, in this module we learned how to:


• Identify the difference between AWS Regions, Availability Zones, and edge locations
• Identify AWS service and service categories

Page | 20
MODULE – 4
AWS CLOUD SECURITY

Section 1: AWS shared responsibility model:

Some key takeaways from this section of the module include:


AWS and the customer share security responsibilities:
• AWS is responsible for security of the cloud
• Customer is responsible for security in the cloud
AWS is responsible for protecting the infrastructure—including hardware, software,
networking, and facilities—that run AWS Cloud services
→ For services that are categorized as infrastructure as a service (IaaS), the customer is
responsible for performing necessary security configuration and management tasks
• For example, guest OS updates and security patches, firewall, security group
configurations

Section 2: AWS Identity and Access Management (IAM):

Some key takeaways from this section of the module include:


IAM policies are constructed with JavaScript Object Notation (JSON) and define
permissions.
• IAM policies can be attached to any IAM entity.
• Entities are IAM users, IAM groups, and IAM roles.
An IAM user provides a way for a person, application, or service to authenticate to AWS.
An IAM group is a simple way to attach the same policies to multiple users.
An IAM role can have permissions policies attached to it, and can be used to delegate
temporary access to users or applications.

Fig 4.1: AWS identity and access management

Page | 21
Section 3: Securing a new AWS account:

The key takeaways from this section of the module are all related to best practices for
securing an AWS account. Those best practice recommendations include:
• Secure logins with multi-factor authentication (MFA).
• Delete account root user access keys.
• Create individual IAM users and grant permissions according to the principle of least
privilege.
• Use groups to assign permissions to IAM users.
• Configure a strong password policy.
• Delegate using roles instead of sharing credentials.
• Monitor account activity using AWS CloudTrail.

Section 4: Securing accounts:

• AWS Organizations enables you to consolidate multiple AWS accounts so that you
centrally manage them.
• AWS Key Management Service (AWS KMS) is a service that enables you to create and
manage encryption keys, and to control the use of encryption across a wide range of AWS
services and your applications.
• Amazon Cognito provides solutions to control access to AWS resources from your
application. You can define roles and map users to different roles so your application can
access only the resources that are authorized for each user.
• AWS Shield is a managed distributed denial of service (DDoS) protection service that
safeguards applications that run on AWS. It provides always-on detection and automatic
inline mitigations that minimize application downtime and latency, so there is no need to
engage AWS Support to benefit from DdoS protection.

Section 5: Securing data on AWS:

AWS supports encryption of data at rest


• Data at rest = Data stored physically (on disk or on tape)
• You can encrypt data stored in any service that is supported by AWS KMS, including:
• Amazon S3
• Amazon EBS
• Amazon Elastic File System (Amazon EFS)
• Amazon RDS managed databases
• Tools and options for controlling access to S3 data include –
• Amazon S3 Block Public Access feature: Simple to use.
• IAM policies & Bucket policies: A good option when the user can authenticate using IAM
• Access control lists (ACLs): A legacy access control mechanism.
• AWS Trusted Advisor bucket permission check: A free feature.
Page | 22
Fig 4.2: AWS Security Infrastructure

Section 6: Working to ensure compliance:

Some key takeaways from this section of the module include:


• AWS security compliance programs provide information about the policies, processes, and
controls that are established and operated by AWS.
• AWS Config is used to assess, audit, and evaluate the configurations of AWS resources.
• AWS Artifact provides access to security and compliance reports.

Module-4: Summary:

In summary, in this module we learned how to:


• Recognize the shared responsibility model
• Identify the responsibility of the customer and AWS
• Recognize IAM users, groups, and roles
• Describe different types of security credentials in IAM
• Identify the steps to securing a new AWS account
• Explore IAM users and groups
• Recognize how to secure AWS data
• Recognize AWS compliance programs

Module-4: List of Labs:

Lab1 – Introduction to AWS IAM: This is a guided lab in which the instructions are there to
follow through which we get introduced to the AWS IAM

Page | 23
MODULE – 5
NETWORKING AND CONTENT DELIVERY

Section 1: Networking basics :


A computer network is two or more client machines that are connected together to share
resources
Each client machine in a network has a unique Internet Protocol (IP) address that identifies
it. An IP address is a numerical label in decimal format. Machines convert that decimal
number to a binary format.
A 32-bit IP address is called an Ipv4 address. Ipv6 addresses, which are 128 bits, are also
available. Ipv6 addresses can accommodate more user devices.
A common method to describe networks is Classless Inter-Domain Routing (CIDR). The
CIDR address is expressed as follows:
• An IP address (which is the first address of the network)
• Next, a slash character (/)
• Finally, a number that tells you how many bits of the routing prefix must be fixed or
allocated for the network identifier
The Open Systems Interconnection (OSI) model is a conceptual model that is used to
explain how data travels over a network.

Fig 5.1: OSI Layer

Section 2: Amazon VPC:

Some key takeaways from this section of the module include:


• A VPC is a logically isolated section of the AWS Cloud.
• A VPC belongs to one Region and requires a CIDR block.
• A VPC is subdivided into subnets.
Page | 24
• A subnet belongs to one Availability Zone and requires a CIDR block.
• Route tables control traffic for a subnet.
• Route tables have a built-in local route.
•You add additional routes to the table.
•The local route cannot be deleted.

Section 3: VPC networking:

Some key takeaways from this section of the module include:


There are several VPC networking options, which include:
• Internet gateway: Connects your VPC to the internet
• NAT gateway: Enables instances in a private subnet to connect to the internet
• VPC endpoint: Connects your VPC to supported AWS services
• VPC peering: Connects your VPC to other VPCs
• VPC sharing: Allows multiple AWS accounts to create their application resources into
shared, centrally-managed Amazon VPCs
• AWS Site-to-Site VPN: Connects your VPC to remote networks
• AWS Direct Connect: Connects your VPC to a remote network by using a dedicated
network connection
• AWS Transit Gateway: A hub-and-spoke connection alternative to VPC peering
You can use the VPC Wizard to implement your design.

Fig 5.2: Virtual Private Cloud

Section 4: VPC security:

The key takeaways from this section of the module are:


• Build security into your VPC architecture.
• Security groups and network ACLs are firewall options that you can use to secure your

Page | 25
VPC.

Section 5: Amazon Route 53:

Some key takeaways from this section of the module include:


• Amazon Route 53 is a highly available and scalable cloud DNS web service that translates
domain names into numeric IP addresses.
• Amazon Route 53 supports several types of routing policies.
• Multi-Region deployment improves your application’s performance for a global audience.
• You can use Amazon Route 53 failover to improve the availability of your applications.

Section 6: Amazon CloudFront:

Some key takeaways from this section of the module include:


A CDN is a globally distributed system of caching servers that accelerates delivery of
content.
Amazon CloudFront is a fast CDN service that securely delivers data, videos, applications,
and APIs over a global infrastructure with low latency and high transfer speeds.
Amazon CloudFront offers many benefits, including:
• Fast and global
• Security at the edge
• Highly programmable
• Deeply integrated with AWS
• Cost-effective

Module-5: Summary:

In summary, in this module we learned how to:


• Recognize the basics of networking
• Describe virtual networking in the cloud with Amazon VPC
• Label a network diagram
• Design a basic VPC architecture
• Indicate the steps to build a VPC
• Identify security groups
• Create your own VPC and added additional components to it to produce a customized
network
• Identify the fundamentals of Amazon Route 53
• Recognize the benefits of Amazon CloudFront

Page | 26
Module-5: List of Labs :

Lab2 – Build your VPC and Launch a Web Server: In this lab, we have:
• Created an Amazon VPC.
• Created additional subnets.
• Created an Amazon VPC security group.
• Launched a web server instance on Amazon EC2.

Page | 27
MODULE – 6
COMPUTE

Section 1: Compute services overview:

Amazon Web Services (AWS) offers many compute services like Amazon EC2, Amazon
Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service (Amazon
ECS), AWS Elastic Beanstalk, AWS Lamba, Amazon Elastic Kubernetes Services (Amazon
EKS), Amazon Fargate.
Selecting the wrong compute solution for an architecture can lead to lower performance
efficiency
• A good starting place—Understand the available compute options

Section 2: Amazon EC2:

Some key takeaways from this section of the module include:


• Amazon EC2 enables you to run Windows and Linux virtual machines in the cloud.
• You launch EC2 instances from an AMI template into a VPC in your account.
• You can choose from many instance types. Each instance type offers different
combinations of CPU, RAM, storage, and networking capabilities.
• You can configure security groups to control access to instances (specify allowed ports
and source).
• User data enables you to specify a script to run the first time that an instance launches.
• Only instances that are backed by Amazon EBS can be stopped.
• You can use Amazon CloudWatch to capture and review metrics on EC2 instances.

Section 3: Amazon EC2 cost optimization:

Some key takeaways from this section of the module are:


Amazon EC2 pricing models include On-Demand Instances, Reserved Instances, Spot
Instances, Dedicated Instances, and Dedicated Hosts. Per second billing is available for On-
Demand Instances, Reserved Instances, and Spot Instances that use only Amazon Linux and
Ubuntu.
Spot Instances can be interrupted with a 2-minute notification. However, they can offer
significant cost savings over On-Demand Instances.
The four pillars of cost optimization are–
• Right size
• Increase elasticity
• Optimal pricing model
• Optimize storage choices

Page | 28
Section 4: Container services:

Some key takeaways from this section include:


• Containers can hold everything that an application needs to run.
• Docker is a software platform that packages software into containers.
• A single application can span multiple containers.
• Amazon Elastic Container Service (Amazon ECS) orchestrates the running of Docker
containers.
• Kubernetes is open source software for container orchestration.
• Amazon Elastic Kubernetes Service (Amazon EKS) enables you to run Kubernetes on
AWS
• Amazon Elastic Container Registry (Amazon ECR) enables you to store, manage, and
deploy your Docker containers.

Section 5: Introduction to AWS Lambda:

Some key takeaways from this section of the module include:


• Serverless computing enables you to build and run applications and services without
provisioning or managing servers.
• AWS Lambda is a serverless compute service that provides built-in fault tolerance and
automatic scaling.
• An event source is an AWS service or developer-created application that triggers a
Lambda function to run.
• The maximum memory allocation for a single Lambda function is 10,240 MB.
• The maximum run time for a Lambda function is 15 minutes.

Fig 6.1: AWS Lambda

Page | 29
Section 6: Introduction to AWS Elastic Beanstalk:

Some key takeaways from this section of the module include:


• AWS Elastic Beanstalk enhances developer productivity.
• Simplifies the process of deploying your application.
• Reduces management complexity.
• Elastic Beanstalk supports Java, .NET, PHP, Node.js, Python, Ruby, Go, and Docker.
• There is no charge for Elastic Beanstalk. Pay only for the AWS resources you use.

Module-6: Summary:

In summary, in this module, we learned how to:


• Provide an overview of different AWS compute services in the cloud
• Demonstrate why to use Amazon Elastic Compute Cloud (Amazon EC2)
• Identify the functionality in the Amazon EC2 console
• Perform basic functions in Amazon EC2 to build a virtual computing environment
• Identify Amazon EC2 cost optimization elements
• Demonstrate when to use AWS Elastic Beanstalk
• Demonstrate when to use AWS Lambda
• Identify how to run containerized applications in a cluster of managed servers

Module-6: List of Labs:

• Lab3 – Introduction to EC2: In this lab, we have :


1. Launched an instance that is configured as a web server
2. Viewed the instance system log
3. Reconfigured a security group
4. Modified the instance type and root volume size

Page | 30
MODULE – 7
STORAGE

Section 1: Amazon Elastic Block Store (Amazon EBS):

• Amazon EBS provides block-level storage volumes for use with Amazon EC2 instances.
Amazon EBS volumes are off-instance storage that persists independently from the life of
an instance. They are analogous to virtual disks in the cloud. Amazon EBS provides three
volume types: General Purpose SSD, Provisioned IOPS SSD, and magnetic.
• The three volume types differ in performance characteristics and cost, so you can choose
the right storage performance and price for the needs of your applications.
• Additional benefits include replication in the same Availability Zone, easy and transparent
encryption, elastic volumes, and backup by using snapshots.

Section 2: Amazon Simple Storage Service (Amazon S3):

• Amazon S3 is a fully managed cloud storage service.


• You can store a virtually unlimited number of objects.
• You pay for only what you use.
• You can access Amazon S3 at anytime from anywhere through a URL.
• Amazon S3 offers rich security controls.

Section 3: Amazon Elastic File System (Amazon EFS):

• Amazon EFS provides file storage over a network.


• Perfect for big data and analytics, media processing workflows, content management, web
serving, and home directories.
• Fully managed service that eliminates storage administration tasks.
• Accessible from the console, an API, or the CLI.
• Scales up or down as files are added or removed and you pay for what you use.

Section 4: Amazon S3 Glacier:

• Amazon S3 Glacier is a data archiving service that is designed for security, durability, and
an extremely low cost.
• Amazon S3 Glacier pricing is based on Region.
• Its extremely low-cost design works well for long-term archiving.
• The service is designed to provide 11 9s of durability for objects.

Page | 31
Module7: Summary:

In summary, in this module, we learned how to:


• Identify the different types of storage
• Explain Amazon S3
• Identify the functionality in Amazon S3
• Explain Amazon EBS
• Identify the functionality in Amazon EBS
• Perform functions in Amazon EBS to build an Amazon EC2 storage solution
• Explain Amazon EFS
• Identify the functionality in Amazon EFS
• Explain Amazon S3 Glacier
• Identify the functionality in Amazon S3 Glacier
• Differentiate between Amazon EBS, Amazon S3, Amazon EFS, and Amazon S3 Glacier

Module7: List of Labs:

• Lab4 – Working with EBS: In this lab, we have:


• Created an Amazon EBS volume
• Attached the volume to an instance
• Configured the instance to use the virtual disk
• Created an Amazon EBS snapshot
• Restored the snapshot

Page | 32
MODULE – 8
DATABASES

Section 1: Amazon Relational Database Service:

• With Amazon RDS, you can set up, operate, and scale relational databases in the cloud.
• Features –
• Managed service
• Accessible via the console, AWS Command Line Interface (AWS CLI), or application
programming interface (API) calls
• Scalable (compute and storage)
• Automated redundancy and backup are available
• Supported database engines:
• Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle, Microsoft SQL Server

Section 2: Amazon DynamoDB:

Amazon DynamoDB:
• Runs exclusively on SSDs.
• Supports document and key-value store models.
• Replicates your tables automatically across your choice of AWS Regions.
• Works well for mobile, web, gaming, ad tech, and Internet of Things (IoT) applications.
• Is accessible via the console, the AWS CLI, and API calls.
• Provides consistent, single-digit millisecond latency at any scale.
• Has no limits on table size or throughput.

Section 3: Amazon Redshift:

• In summary, Amazon Redshift is a fast, fully managed data warehouse service. As a


business grows, you can easily scale with no downtime by adding more nodes. Amazon
Redshift automatically adds the nodes to your cluster and redistributes the data for
maximum performance.
• Amazon Redshift is designed to consistently deliver high performance. Amazon Redshift
uses columnar storage and a massively parallel processing architecture. These features
parallelize and distribute data and queries across multiple nodes. Amazon Redshift also
automatically monitors your cluster and backs up your data so that you can easily restore if
needed. Encryption is built in—you only need to enable it.

Page | 33
Section 4: Amazon Aurora:

• In summary, Amazon Aurora is a highly available, performant, and cost-effective


managed
relational database.
• Aurora offers a distributed, high-performance storage subsystem. Using Amazon Aurora
can reduce your database costs while improving the reliability of the database.
• Aurora is also designed to be highly available. It has fault-tolerant and self-healing storage
built for the cloud. Aurora replicates multiple copies of your data across multiple Availability
Zones, and it continuously backs up your data to Amazon S3.
• Multiple levels of security are available, including network isolation by using Amazon
VPC;
encryption at rest by using keys that you create and control through AWS Key Management
Service (AWS KMS); and encryption of data in transit by using Secure Sockets Layer (SSL).
• The Amazon Aurora database engine is compatible with existing MySQL and PostgreSQL
open source databases, and adds compatibility for new releases regularly.
• Finally, Amazon Aurora is fully managed by Amazon RDS. Aurora automates database
management tasks, such as hardware provisioning, software patching, setup, configuration,
or backups.

Fig 8.1: Amazon Aurora features

Module-8 Summary:

In summary, in this module, we learn how to:


• Explain Amazon Relational Database Service (Amazon RDS)
• Identify the functionality in Amazon RDS
• Explain Amazon DynamoDB
• Identify the functionality in Amazon DynamoDB
• Explain Amazon Redshift

Page | 34
• Explain Amazon Aurora
• Perform tasks in an RDS database, such as launching, configuring, and interacting

Module-8: List of Labs:

• Lab5 – Build a Database Server: In this lab, we have:


• Launched an Amazon RDS DB instance with high availability.
• Configured the DB instance to permit connections from your web server.
• Opened a web application and interacted with your database

Page | 35
MODULE – 9
CLOUD ARCHITECTURE

Section 1: AWS Well-Architected Framework:

Some key takeaways from this section of the module include:


• The AWS Well-Architected Framework provides a consistent approach to evaluate cloud
architectures and guidance to help implement designs.
• The AWS Well-Architected Framework documents a set of design principles and best
practices that enable you to understand if a specific architecture aligns well with cloud best
practices.
• The AWS Well-Architected Framework is organized into six pillars.

Section 2: Reliability and availability:

Some key takeaways from this section of the module include:


• Reliability is a measure of your system’s ability to provide functionality when desired by
the user, and it can be measured in terms of MTBF.
• Availability is the percentage of time that a system is operating normally or correctly
performing the operations expected of it (or normal operation time over total time).
• Three factors that influence the availability of your applications are fault tolerance,
scalability, and recoverability.
• You can design your workloads and applications to be highly available, but there is a cost
trade off to consider.

Section 3: AWS Trusted Advisor:

Some key takeaways from this section of the module include:


• AWS Trusted Advisor is an online tool that provides real-time guidance to help you
provision your resources by following AWS best practices.
• AWS Trusted Advisor looks at your entire AWS environment and gives you real-time
recommendations in five categories.

Module-9: Summary:

In summary, in this module we learned how to:


• Describe the AWS Well-Architected Framework, including the six pillars
• Identify the design principles of the AWS Well-Architected Framework
• Explain the importance of reliability and high availability
• Identify how AWS Trusted Advisor helps customers
• Interpret AWS Trusted Advisor recommendations
Page | 36
MODULE – 10
AUTO SCALING AND MONITORING

Section 1: Elastic Load Balancing:

Some key takeaways from this section of the module include:


• Elastic Load Balancing distributes incoming application or network traffic across multiple
targets (such as Amazon EC2 instances, containers, IP addresses, and Lambda functions) in
one or more Availability Zones.
• Elastic Load Balancing supports three types of load balancers:
• Application Load Balancer
• Network Load Balancer
• Classic Load Balancer
• Elastic Load Balancing offers several monitoring tools for continuous monitoring and
logging for auditing and analytics.

Section 2: Amazon CloudWatch:

Some key takeaways from this section of the module include:


• Amazon CloudWatch helps you monitor your AWS resources—and the applications that
you run on AWS—in real time.
• CloudWatch enables you to –
• Collect and track standard and custom metrics.
• Set alarms to automatically send notifications to SNS topics or perform Amazon EC2
Auto Scaling or Amazon EC2 actions based on the value of the metric or expression relative
to a threshold over a number of time periods.
• Define rules that match changes in your AWS environment and route these events to
targets for processing.

Section 3: Amazon EC2 Auto Scaling:


Some key takeaways from this section of the module include:
• Scaling enables you to respond quickly to changes in resource needs.
• Amazon EC2 Auto Scaling helps you maintain application availability, and enables you to
automatically add or remove EC2 instances according to your workloads.
• An Auto Scaling group is a collection of EC2 instances.
• A launch configuration is an instance configuration template.
• You can implement dynamic scaling with Amazon EC2 Auto Scaling, Amazon
CloudWatch, and Elastic Load Balancing.
AWS Auto Scaling is a separate service that monitors your applications, and it
automatically adjusts capacity for the following resources:
• Amazon EC2 instances and Spot Fleets
Page | 37
• Amazon ECS tasks
• Amazon DynamoDB tables and indexes
• Amazon Aurora Replicas

Module-10: Summary:

In summary, in this module we learned how to:


• Indicate how to distribute traffic across Amazon Elastic Compute Cloud (Amazon EC2)
instances using Elastic Load Balancing.
• Identify how Amazon CloudWatch enables you to monitor AWS resources and
applications in real time
• Explain how Amazon EC2 Auto Scaling launches and releases servers in response to
workload changes.
• Perform scaling and load balancing tasks to improve an architecture.

Module10: List of Labs:

• Lab6 – Scale and Load Balance your Architecture: In this lab, we have:
• Created an Amazon Machine Image (AMI) from a running instance.
• Created a load balancer.
• Created a launch configuration and an Auto Scaling group.
• Automatically scaled new instances within a private subnet.
• Created Amazon CloudWatch alarms and monitored performance of your infrastructure.

Page | 38
AWS DATA ENGINEERING
MODULE-1
WELCOME TO AWS DATA ENGINEERING

Section-1: Course Objectives:

• Summarize the role and value of data science in a data-driven organization.


• Recognize how the elements of data influence decisions about the
infrastructure of a data pipeline.
• Illustrate a data pipeline by using AWS services to meet a generalized use case.
• Identify the risks and approaches to secure and govern data at each step and
each transition of the data pipeline.
• Identify scaling considerations and best practices for building pipelines that
handle large-scale datasets.
• Design and build a data collection process while considering constraints such
as scalability, cost, fault tolerance, and latency.
• Select a data storage option that matches the requirements and constraints of
a given data analytics use case.
• Implement the steps to process both structured, semi-structured, and
unstructured data formats in a data pipeline that is built with AWS.
• Explain the concept of MapReduce and how Amazon EMR is used in big data
pipelines.
• Differentiate the characteristics of an ML pipeline and its specific processing
steps.
• Analyse data by using AWS tools that are appropriate to a given use case.
• Implement a data visualization solution that is aligned to an audience and data
type.

Page | 39
MODULE 2
DATA-DRIVEN ORGANISATIONS

Section-1: Data-Driven Decisions:

• Data - driven organizations use data science to inform decisions. This includes data
analytics and AI, along with its subfield of ML.
•The main distinction between data analytics and AI/ML is the way that the analysis is
performed. Data analytics relies on programming logic while AI/ML applications can learn
from examples in data to make predictions. This makes AI/ML good for unstructured data
where the variables are complex.
• The rapidly increasing availability of data, coupled with decreasing costs of supporting
technology, increases the opportunity for organizations to make data - driven decisions.

Section 2: The Data Pipeline Infrastructure:

•The data pipeline provides the infrastructure for data-driven decisions and includes layers
to ingest, store, process, and analyse and visualize data.
•Data wrangling refers to the work to transform data to prepare it for analysis. Processing is
typically iterative. The business problem should drive the pipeline design.

Module-2 Summary:

This module introduced data - driven decision - making and how data analytics and AI/ML
can support decision - making. We learned that both data analytics and AI/ML can help with
predictions, but AI/ML does it by learning about the data rather than by using programmed
rules. We learned about the four main layers in a data pipeline: ingestion, storage,
processing, and analysis and visualization. We were also introduced to the type of actions
that are taken on data as it passes through the pipeline; for example, cleaning and
transformation. We came across the types of concerns that a data engineer or data scientist
might have when building out and using a data pipeline. And finally, we learned about the
modernize, unify, and innovative approach for building modern data infrastructures.

Module-2 Labs:

1. Accessing and Analysing Data by Using Amazon S3


In this lab, we have learnt how to:
• Create an AWS CloudFormation template and deploy it as a stack to create an S3
bucket.
• Load data into an S3 bucket.
• Use S3 Select to query a .csv file that is stored in an S3 bucket.

Page | 40
• Modify the encryption properties and storage type for an S3 object.
• Upload a compressed version of a dataset to Amazon S3.
• Review an AWS Identity and Access Management (IAM) policy that is used by team
members to access the data.
• Test a team member's restricted access to the data by using the AWS Command Line
Interface (AWS CLI).

Page | 41
MODULE 3
THE ELEMENTS OF DATA

Section 1: Volume and Velocity:

•Volume is about how much data, and velocity is about the pace of data. Together,
They drive the scaling requirements for your pipeline.
•When we build a pipeline, consider volume and velocity at each layer. For
example, some data that arrives at high velocity doesn't need to be processed
immediately and can be stored and processed at a slower pace.
• Our choices should balance costs for throughput and storage against the required
time to answer and accuracy of the answer.

Section 2: Variety-data types:

• The three general types of data are structured, semi-structured and unstructured.
•Structured data, such as a relational database, is easy to query and process but not
very flexible.
•Unstructured data on the other hand is very flexible but more difficult to work
with. Most of the data growth in recent years is of the unstructured type.

Section 3: Variety-data sources:

•Common data sources include an organization's own databases; public datasets;


and time-series data, such as events, IoT data, and sensor data.
•Combining these datasets in your pipeline can enrich your analysis and the decisions that
we can support. But, complexity also increases because you must
manage the differences in structure and content between each data source.

Section-4: Veracity and Value:

•Veracity is about trusting your data, and without veracity you can't expect to get good
value from your data.
•For each data source, we need to discover its veracity, clean and transform what we can to
improve it, and then prevent unwanted changes after we have cleaned it.

Section-5: Activities to Improve Veracity:

•A data engineer or data scientist needs to question the data source before it ever enters the
pipeline.
•After we determine to use a data source, we need to determine how to clean it and have a
Page | 42
common definition of what clean means for a data source. Ensure that any transformation
does what we have intended.
•Save raw data in an immutable form so that we have details instead of only aggregated
values. This supports future insights, and makes it easier to find errors.
•To protect cleaned data, the organization should implement processes and governance
strategies to manage the data in the systems.

Module-3 Summary:

The module introduced us to a basic vocabulary to think about the data sources that will feed
your pipeline. We learned about volume, velocity, variety, veracity, and value and how each
of them impacts your pipeline design. We also learned the importance of asking questions
about data sources before they enter your pipeline and protecting them after they are part of
your system.

Page | 43
MODULE 4
DESIGN PRINCIPLES & PATTERNS FOR DATA
PIPELINES

Section-1: AWS Well-Architected Framework and Lenses:

•The Well-Architected Framework provides best practices and design guidance, and its
lenses extend guidance to focus on specific domains.
•We can use the Data Analytics Lens to guide your design of data pipelines to suit the
characteristics of the data that you need to process.

Section-2: The Evolution of Data Architectures:

•Data stores and architectures evolved to keep up with continually increasing volume,
variety, and velocity of the data being generated.
•Modern data architectures will continue to use different types of data stores to provide the
best fit for each use case, but they need to find a way to unify disparate sources to maintain
a single source of truth.

Section-3: Modern Data Architecture on AWS:

•A data lake provides centralized storage and is integrated with purpose - built data stores
and processing tools.
•Data moves into the data lake (outside in), from the data lake to other stores (inside
out), and might also move directly between purpose-built stores (around the perimeter).
•Amazon S3 provides the data lake, and Lake Formation and AWS Glue help to provide
seamless access.

Section-4: Modern Data Architecture Pipeline – Ingestion and Storage:

•The AWS modern data architecture uses purpose-built tools to ingest data based on
characteristics of the data.
•The storage layer includes two layers. The first is a storage layer that uses Amazon
Redshift as its data warehouse and Amazon S3 for its data lake. The second is a catalog
layer that uses AWS Glue and Lake Formation.
•The catalog maintains metadata and also provides schemas-on-read, which Redshift
Spectrum uses to read data directly from Amazon S3.
•Normally, data within the lake is segmented into zones to represent different states of
processing. Data arrives in a landing zone and might move up to curated as it’s processed
for different types of consumption. We create zones by using prefixes or bucket names for
each zone in Amazon S3.
Page | 44
Section-5: Modern Data Architecture Pipeline – Processing and Consumption:

•The processing layer is where data is transformed for some type of consumption. The
modern data architecture supports three general types of processing: SQL-based ELT, big
data processing, and near real-time ETL.
• The consumption layer includes components that access the data and metadata in the
storage layer (including the data that is transformed by the processing layer). The
consumption layer supports three analysis methods: interactive SQL queries, BI dashboards,
and ML.

Section-6: Streaming Analytics Pipeline:

•Streaming analytics includes producers who put things on the stream and consumers who
get things off the stream.
• A stream provides temporary storage to process incoming data in real time for delivery to
real-time applications, such as real-time dashboards.
•The results of streaming analytics might also be saved to more durable storage for
additional processing downstream.

Module-4 Summary:
The design principles and patterns that we will use to build data pipelines. We learned about
how the evolution of data stores and data architectures informed design principles of modern
data architecture. We used the Well-Architected Framework to find design principles and
recommendations that are related to building analytics pipelines. In addition to the modern
data architecture, you were introduced to key characteristics of streaming data pipelines.

Module-4 Labs:

1. Querying Data by Using Athena


In this lab, we have learnt how to:
• Use the Athena query editor to create an AWS Glue database and table.
• Define the schema for the AWS Glue database and associated tables by using the
Athena bulk add columns feature.
• Configure Athena to use a dataset that is located in Amazon S3.
• Optimize Athena queries against a sample dataset.
• Create views in Athena to simplify data analysis for other users.
• Create Athena named queries by using AWS CloudFormation.
• Review an AWS Identity and Access Management (IAM) policy that can be assigned to
users who intend to use Athena named queries.
• Confirm that a user can access an Athena named query by using the AWS Command
Line Interface (AWS CLI) in the AWS Cloud9 terminal.

Page | 45
MODULE 5
SECURING AND SCALING THE DATA PIPELINE

Section-1: Cloud security review:

•Access management consists of authentication and authorization; adhere to the principle of


least privilege with both.
•IAM integrates with most AWS services and helps us to securely share and control
individual and group access to our AWS resources.
•A key aspect of a data security plan is to secure data at rest and data in transit.
•Logging and monitoring can assist your organization to maintain compliance with local
laws and regulations.

Section-2: Security of analytics workloads:

•Honor the data classifications and protection policies that the owners of the source data
assigned.
•Secure access to the data in the analytics workload.
•Share data downstream in compliance with the source system's classification policies.
•Ensure the environment is accessible with the least permissions necessary; automate
auditing of environment changes, and alert in case of abnormal environment access.

Section-3: ML security:

•Apply the principle of least privilege throughout the ML lifecycle.


•Encrypt data in transit to and at rest in the compute and storage infrastructure.
•To reduce data exposure risks, only store the data that has business needs.
•To detect malicious inputs that might result in incorrect predictions, add protection inside
and outside of the deployed code.
•Ensure that data access logging is enabled, and audit for anomalous data access events.

Section-4: Scaling-An overview:

•Horizontal scaling adds additional instances; vertical scaling adds additional resources to
an instance.
•AWS Auto Scaling automatically scales Amazon EC2 capacity horizontally or vertically
based on scaling plans and predictive scaling.
•Application Auto Scaling automatically scales resources for individual AWS services
beyond Amazon EC2.

Page | 46
Section-5: Creating a scalable infrastructure:

•Traditional infrastructure deployments use manual processes that are time


-consuming and prone to error.
•IaC automates the process to create, update, and delete AWS infrastructure.
•IaC can use a declarative or imperative programming approach.
•You can use CloudFormation to automate AWS resource and pipeline provisioning.

Section-6: Creating scalable components:

•Configure scalable components in your pipeline to ensure that it is optimized.


•Use CloudWatch and Lambda to provide automatic scaling for Kinesis data streams.
•Kinesis Data Streams on demand mode is the recommended method for automatic scaling.

Module-5 Summary:
This module has prepared us to do the following:
•Highlight how cloud security best practices apply to analytics and ML data pipelines.
•List AWS services that play key roles in securing a data pipeline.
•Cite factors that drive performance and scaling decisions across each layer of a data
pipeline.
•Describe how IaC supports the security and scalability of a data pipeline infrastructure.
• Identify the function of common CloudFormation template sections.

Page | 47
MODULE 6
INGESTING AND PREPARING DATA

Section-1: ETL and ELT comparison:

•Ingestion involves pulling data into the pipeline and applying transformations on the data.
•Ingestion by using an ETL flow works well with structured data that is destined for a data
warehouse or other structured store.
•ETL and ELT processing within a pipeline should evolve to optimize its value.

Section-2: Data wrangling introduction:

•Data wrangling describes the multi-step process of transforming large amounts of


unstructured data or sets of structured data from multiple sources for an analytics use case.
•Data wrangling is especially important for data scientists when building ML models.
•Data wrangling steps might overlap, be iterative, or not occur at all in some ingestion
processes.
•Data wrangling steps include discovery, structuring, cleaning, enriching, validating, and
publishing.

Section-3: Data discovery:

• Data discovery is an iterative process that each role who is involved in the business need
• should perform.
• During data discovery, the data engineer should determine the relationships between
• sources, how to filter the sources, and how to organize the data in the target storage.
• The discovery phase helps you to identify the tools and resources that you will need to
• move forward.

Section-5: Data structuring:

•Data structuring is about mapping raw data from the source file or files into a format that
supports combining it and storing it with other data.
•Data structuring includes organizing storage and access, parsing the source file,
and mapping the source file to the target.
•Data structuring also includes strategies and methods to optimize file size, such as splitting
or compressing files.

Page | 48
Section-6: Data cleaning:

•Data cleaning is about preparing the source data for use in a pipeline.
•Cleaning is usually done for each source based on the characteristics of that source.
•Cleaning includes tasks such as removing unneeded, duplicate, or invalid data, and fixing
missing values and data types.
•How the cleaning tasks resolve issues depends on the role of the person who cleans the
data.

Section-7: Data enriching:

•Data enriching is about combining the data sources together and adding value to the data.
•During enriching, we might merge sources, add additional fields, or calculate new values.

Section-8: Data validating:

•Data validating is about ensuring the integrity of the dataset that you have
created.
•Validating tasks might overlap with cleaning tasks.
•This step might be iterative as you check the results of your work, find issues, and address
them.

Section-9: Data publishing:

•Data publishing is about making the dataset available for use.


•In this step, you move the data to the storage layer of the pipeline. The decisions that you
make during the structuring step about organization and file management are used to load
the data into storage.
•At this step, you also need to provide methods to update the dataset and monitor ongoing
ingestion.

Module-6 Summary:
This module prepared us to do the following:
•Distinguishing between the processes of ETL and ELT.
•Defining data wrangling in the context of ingesting data into a pipeline.
•Describing key tasks in each of these data wrangling steps:
•Discovery
•Structuring
•Cleaning
•Enriching
•Validating and Publishing

Page | 49
MODULE-7
INGESTING BY BATCH OR BY STREAM

Section-1: Comparing batch and stream ingestion:

•Batch jobs query the source, transform the data, and load it into the pipeline.
•Traditional ETL uses batch processing.
•With stream processing, producers put records on a stream where consumers get and
process them.
•Streams are designed to handle high-velocity data and real-time processing.

Section-2: Batch ingestion processing:

•Batch ingestion involves writing scripts and jobs to perform the ETL or ELT
process.
•Workflow orchestration helps you to handle interdependencies between jobs and manage
failures within a set of jobs.
•Key characteristics for pipeline design include ease of use, data volume and
variety, orchestration and monitoring, and scaling and cost management.

Section-3: Purpose-built ingestion tools:

•Choose purpose-built tools that match the type of data to be ingested and simplify the tasks
that are involved in ingestion.
•Amazon AppFlow, AWS DMS, and DataSync each simplify the ingestion of
specific data types.
•AWS Data Exchange provides a simplified way to find and subscribe to third-party
datasets.

Section-4: AWS Glue for batch ingestion processing:

•AWS Glue is a fully managed data integration service that simplifies ETL tasks.
•AWS Glue crawlers derive schemas from data stores and provide them to the centralized
AWS Glue Data Catalog.
•AWS Glue Studio provides visual authoring and job management tools.
• The AWS Glue Spark runtime engine processes jobs in a serverless environment.
• AWS Glue workflows provide ETL orchestration.
• CloudWatch provides integrated monitoring and logging for AWS Glue, including job run
insights.

Page | 50
Section-5: Scaling considerations for batch processing:

•Performance goals should focus on what factors are most important for your batch
processing.
• Scale AWS Glue jobs horizontally by adding more workers.
• Scale AWS Glue jobs vertically by choosing a larger type of worker in the job
configuration.
• Large, splitable files let the AWS Glue Spark runtime engine run many jobs in
parallel with less overhead than processing many smaller files

Section-6: Kinesis for stream processing:

•The stream is a buffer between the producers and the consumers of the stream.
• The KPL simplifies the work of writing producers for Kinesis Data Streams.
• Data is written to shards on the stream as a
sequence of data records.
• Data records include a sequence number, partition key, and data blob.
• Kinesis Data Firehose can deliver streaming data directly to storage, including Amazon S3
and Amazon Redshift.
• Kinesis Data Analytics is purpose built to perform real- time analytics on data as it passes
through the stream.

Section-7: Scaling considerations for stream processing:

•Kinesis Data Streams provides scaling options to manage the throughput of data
on the stream.
• We can scale how much data can be written to the stream, how long the data is
stored on the stream, and how much throughput each consumer gets.
• CloudWatch provides metrics that help you monitor how your stream handles the data that
is being written to and read from it.

Section-8: Ingesting IoT data by stream:

•With AWS IoT services, we can use MQTT and a pub/sub model to communicate with IoT
devices.
• We can use AWS IoT Core to securely connect, process, and act upon device data.
• The AWS IoT Core rules engine transforms and routes incoming messages to AWS
services.
• AWS IoT Analytics provides a complete pipeline to ingest and process data and
then make it available for analytics.

Page | 51
Module-7 Summary:

This module prepared us to do the following:

•List key tasks that the data engineer needs to perform when building an ingestion layer.
•Describe how purpose-built AWS services support ingestion tasks.
•Illustrate how the features of AWS Glue work together to support automating batch
ingestion.
•Describe Kinesis streaming services and features that simplify streaming ingestion.
•Identify configuration options in AWS Glue and Kinesis Data Streams that help you scale
your ingestion processing to meet your needs.
•Describe distinct characteristics of ingesting IoT data by using AWS IoT Core and AWS
IoT Analytics services.

Module-7 Labs:

1. Performing ETL on a Dataset by Using AWS Glue


In this lab, we have learnt how to:
• Access AWS Glue in the AWS Management Console and create a crawler.
• Create an AWS Glue database with tables and a schema by using a crawler.
• Query data in the AWS Glue database by using Athena.
• Create and deploy an AWS Glue crawler by using an AWS CloudFormation template.
• Review an AWS Identity and Access Management (IAM) policy for users to run an
AWS Glue crawler and query an AWS Glue database in Athena.
• Confirm that a user with the IAM policy can use the AWS Command Line Interface
(AWS CLI) to access the AWS Glue database that the crawler created.
• Confirm that a user can run the AWS Glue crawler when source data changes.

Page | 52
MODULE-8
STORING AND ORGANIZING DATA

Section-1: Data lake storage:

•Data lakes store data as-is. We don’t need to structure the data before we begin to run
analytics.
•Amazon S3 promotes data integrity through strong data consistency and multipart uploads.
•With Lake Formation, we can use governed tables to enable concurrent data inserts and
edits across tables.

Section-2:Data warehouse storage:

•Data warehouses consists of three tiers, and can store structured, curated, or transformed
data.
•Amazon Redshift is a fully -managed data warehouse services that uses computing
resources called nodes.
• We can use Redshift Spectrum to write SQL queries that combine data from both our data
lake and our data warehouse.

Section-3:Purpose-built databases:

• Our choice of database will affect what your application can handle, how it will perform,
and the operations that you are responsible for.
•When choosing your database, consider several factors:
•Application workload
•Data shape
•Performance
•Operations burden

Section-4:Storage in support of the pipeline:

•Storage plays an integral part in ELT and ETL pipelines.


•ETL pipelines transform data in buffered memory prior to loading data into a data lake or
data warehouse for storage.
•ELT pipelines extract and load data into a data lake or data warehouse for storage without
transformation

Page | 53
Section-5: Securing storage:

•Security for data lake storage is built upon the intrinsic security features of Amazon S3.
•Access policies provide a highly customizable way to provide access to resources in the
data lake.
•Data lakes that are built on AWS rely on server-side and client-side encryption.
•Amazon Redshift handles service security and database security as two distinct functions.

Module-8 summary:

This module prepared us to do the following:


•Define storage types that are found in a modern data architecture.
•Distinguish between data storage types.
•Select data storage options that match your storage needs.
•Implement secure storage practices for cloud-based data.

Module-8 Labs:
1. Storing and Analysing Data by Using Amazon Redshift
Int his lab, we have learnt how to:
• Review an AWS Identity and Access Management (IAM) role's permissions to access
and configure Amazon Redshift.
• Create and configure a Redshift cluster.
• Create a security group for a Redshift cluster.
• Create a database, schemas, and tables with the Redshift cluster.
• Load data into tables in a Redshift cluster.
• Query data within a Redshift cluster by using the Amazon Redshift console.
• Query data within a Redshift cluster by using the API and AWS Command Line
Interface (AWS CLI) in AWS Cloud9.
• Review an IAM policy with permissions to run queries on a Redshift cluster.
• Confirm that a data science team member can run queries on a Redshift cluster.

Page | 54
MODULE-9
PROCESSING BIG DATA

Section-1: Big data processing concepts:

•Big data processing is generally divided into two categories, batch and streaming.
•Batch data typically involves cold data, with analytics workloads that involve longer
processing times.
•Streaming data involves many data sources providing data that must be processed
sequentially and incrementally.
•Batch processing and stream processing each benefit from specialized big data processing
frameworks.

Section-2: Apache Hadoop:

•Hadoop includes a distributed storage system, HDFS.


•When we store data in HDFS, Hadoop splits the data into smaller data blocks.
•MapReduce processes large datasets with a parallel, distributed algorithm on a cluster

Section-3: Apache Spark:

•Apache Spark performs processing in-memory, reduces the number of steps in a job, and
reuses data across multiple parallel operations.
•Spark reuses data by using an in-memory cache to speed up ML algorithms.

Section-4: Amazon EMR:

• We can use Amazon EMR to perform automated installations of common big data projects.
•The Amazon EMR service architecture consists of four distinct layers:
•Storage layer
•Cluster resource management layer
•Data processing frameworks layer
•Applications and programs layer

Section-5: Managing your Amazon EMR clusters:

•Three methods are available to launch EMR clusters: interactive, command line, and API.
•EMR clusters are characterized as either long running or transient, based on their usage.
•External connections to EMR clusters can only be made through the main node.

Page | 55
Section-6: Apache Hudi:

•Apache Hudi provides the ability to ingest and update data in near real time.
•Hudi maintains metadata of the actions that are performed to ensure those actions are both
atomic and consistent.

Module-9 summary:

This module prepared us to do the following:


•Compare and select the big data processing framework that best supports your workloads.
•Explain the principles of Apache Hadoop and Amazon EMR, and how they support data
processing in AWS.
•Launch, configure, and manage an Amazon EMR cluster to support big data processing.

Module-9 Labs:
1. Processing Logs by Using Amazon EMR
In this lab, we have learnt how to:
• Launch an EMR cluster through the AWS Management Console.
• Run Hive interactively.
• Use Hive commands to create tables from log data that is stored in Amazon S3.
• Use Hive commands to join tables and store the joined table in Amazon S3.
• Use Hive to query tables that are stored in Amazon S3.

2. Updating Dynamic Data in Place


In this lab, we have learnt how to:
• Create an AWS Glue job to run custom extract, transform, and load (ETL) scripts.
• Use Athena to run queries.
• Use the Apache Hudi Connector to perform in-place updates.

Page | 56
MODULE 10
PROCESSING DATA FOR ML

Section-1: ML concepts:

•An ML model is a function that combines a trained dataset and an algorithm to predict
outcomes.
•The three general types of machine learning include supervised, unsupervised and
reinforcement.
•Deep learning is a subcategory of machine learning that uses neural networks to develop
models.
•Generative AI is a subcategory of deep learning that can generate content and is trained on
large amounts of data.
•In machine learning, the label or target is what we are trying to predict, and the features are
attributes that can be used to predict the target.

Section-2: The ML lifecycle:

•The ML lifecycle starts with defining a business problem and framing it as an ML


problem.
•The next step is to collect data and prepare it to use in an ML model.
•Model development takes the prepared data and then hands it off to be deployed to
production when the model is ready.
•The final phase in the lifecycle is to monitor the model in production.

Section-3: Framing the ML problem to meet the business goal:

•State the business goal in the form of a problem statement.


•Data scientists work with domain experts to determine an appropriate ML approach.
•The business problem drives whether ML is needed or if a simpler solution can meet the
need

Section-4: Collecting data:

•Collecting data in the ML lifecycle is similar to an extract, load, and transform (ELT)
ingestion process.
•The data engineer and the data scientist ensure that they have enough of the correct data to
support training and testing the model.
•During the data collection phase, we might need to add labels to the training data.

Page | 57
Section-5: :Applying labels to training data with known targets:

•Labelling is the process of assigning useful labels to targets in the training data.
•Common types of labelling include computer vision, natural language processing, audio
processing, and tabular data.
•By using tools such as SageMaker Ground Truth, we can share labelling jobs with an
expert workforce.

Section-6: Preprocessing data:

•Data preprocessing puts data into the correct shape and quality for training.
•The data scientist performs preprocessing by using a combination of techniques and
expertise.
•Exploring and visualizing the data helps the data scientist get a feel for the data.
•Examples of preprocessing strategies include partitioning, balancing, and data formatting.

Section-7: Feature engineering:

•Feature engineering is about improving the existing features to improve their usefulness in
predicting outcomes.
•Feature creation and transformation focus on adding new information.
•Feature extraction and selection are about reducing dimensionality.

Section-8: Developing a model:

•Model development consists of model building, training, tuning, and evaluation.


•Training and tuning are iterative and continue until the accuracy rate is in line with the
business goal.
•Use unlabelled test data to validate the results from the training dataset.
•Training a model is resource intensive and might require the use of big data frameworks or
HPC systems.

Section-9: Deploying a model:

•The deployment infrastructure is typically quite different from the training infrastructure.
•Inference is the process of making predictions on the production data.
•Automating the ML lifecycle is an important step in operationalizing and scaling the ML
solution.
•ML Ops is a deployment approach that relies on a streamlined development lifecycle to
optimize resources.

Page | 58
Section-10: ML infrastructure on AWS:

•AWS ML infrastructure consists of a compute, network, and storage layer; a framework


layer; and a workflow services layer.
•Services in the compute, network, and storage layer are designed to handle the
compute power and speed that are required to train models and make predictions in real
time.
• The workflow services layer provides tools to simplify creating and managing ML
environments.

Section-11: SageMaker:

•SageMaker is a managed service that provides an integrated workbench for the ML


lifecycle.
•Key components to prepare data and build models include SageMaker Studio, Data
Wrangler, Studio notebooks, and Processing.
•SageMaker provides deployment options for real-time or batch inference.
•SageMaker Canvas is a no-code option that business analysts can use to build models and
make predictions.

Section-12: Introduction to Amazon CodeWhisperer:

•Amazon CodeWhisperer is a generative AI tool that can help us develop our application
faster using a LLM.
•Prompt engineering is the practice of writing inputs or instructions to interact with a
generative AI model (LLM) that can generate an expected output.
•AWS has several generative AI offerings: Amazon CodeWhisperer, Amazon Bedrock,
AWS Inferentia, and Amazon SageMaker JumpStart.

Section-13: AI/ML services on AWS:

•AWS offers a growing number of purpose-built AI/ML services to handle common use
cases.
•Services include natural language processing, image and video recognition, and time-series
data predictions.
• We can incorporate these services into our data pipelines without the burden of building
custom ML solutions.

Page | 59
Module-10 Summary:

This module prepared us to do the following:


▪ Distinguish between labels, features, and samples in the context of ML.
Describe each phase of the ML lifecycle and the roles involved in each.
▪ Cite questions to ask when framing an ML problem.
▪ Discuss key considerations and tasks during the data collection phase of the ML
lifecycle.
▪ Give examples of preprocessing and feature engineering data transformations.
▪ Compare the activities and resource considerations for model development and
model deployment.
▪ Categorize AWS ML infrastructure services based on how they might be used in the
ML lifecycle.
▪ Describe features of Amazon SageMaker that assist with each phase of the ML
lifecycle.
▪ Cite two AWS services that use ML for common use cases.

Page | 60
MODULE-11
ANALYZING AND VISUALIZING DATA

Section-1: Considering factors that influence tool selection:

•When we select analysis and visualization tools, we should consider the business needs,
data characteristics, and access to data.
• Consider the granularity and format of the insights based on business needs.
• Consider the volume, velocity, variety, veracity, and value of the data.
• Consider the functions of individuals who will access, analyse, and visualize the data.

Section-2: Comparing AWS tools and services:

•AWS tools and services that are commonly used to query and visualize data include
Athena, QuickSight, and OpenSearch
Service.
• Athena is used for interactive analysis with SQL.
•Decision-makers can use QuickSight to interact with data visually and get insight
quickly.
• OpenSearch Service is used for operational analytics to visualize data in near real time.

Section-3: Selecting tools for a gaming analytics use case:

•This use case showcased the granularity of visualized insights:


• Daily batch aggregates of client usage patterns (Athena)
• Consolidated aggregate KPIs for leadership (QuickSight)
• Continuous health and performance monitoring (OpenSearch Service)
• Keep the influencing factors in mind when we select AWS tools and services. Multiple
solutions exist to meet the business needs of data analysis and visualization.

Module-11 Summary:

This module prepared us to do the following:


•List factors to consider when selecting analysis and visualization tools.
•Compare available AWS tools and services for data analysis and visualization.
•Determine the appropriate AWS tools and services to analyse and visualize data based on
influencing factors (business needs, data characteristics, and access to data)

Page | 61
Module-11 Labs:

1. Analysing and Visualizing Streaming Data with Kinesis Data Firehose,


OpenSearch Service, and OpenSearch Dashboards
In this lab, we have learnt how to:

• Describe the lab infrastructure in the AWS Management Console.


• Ingest web server logs into Amazon Kinesis Data Firehose and Amazon
OpenSearch Service.
• Observe the data ingestion and transformation process by using Amazon
CloudWatch log events.
• Build an index for these logs in OpenSearch Service.
• Visualize data by using OpenSearch Dashboards, including:
o Create a pie chart that illustrates the operating systems and
browsers that visitors use to consume the website.
o Create a heat map that illustrates how users are referred to product
pages (either by the search page or the recommendations page).

Page | 62
MODULE 12
AUTOMATING THE PIPELINE

Section-1: Automating infrastructure deployment:

• Automating environment can help us ensure that system is stable, consistent, and efficient.
• Repeatability and reusability are two key benefits of infrastructure as code

Section-2: CI/CD:

•CI/CD spans the develop and deploy stages of the software development lifecycle.
•Continuous delivery improves on continuous integration by helping teams gain a greater
level of certainty that their software will work in production.

Section-3: Automating with Step Functions:

•With Step Functions, we can use visual workflows to coordinate the components of
distributed applications and microservices.
• We define a workflow, which is also referred to as a state machine, as a series of steps and
transitions between each step.
• Step Functions is integrated with Athena to facilitate building workflows that include
Athena queries and data processing operations.

Module-12 Summary:

This module prepared us to do the following:


•Identify the benefits of automating your pipeline.
•Understand the role of CI/CD and how it applies to your pipeline.
•Examine the states of Step Functions.
•Use Step Functions to build and automate a data pipeline.

Module-12 Labs:

1. Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
In this lab, we have learnt:
• Create and test a Step Functions workflow by using Step Functions Studio.
• Create an AWS Glue database and tables.
• Store data on Amazon S3 in Parquet format to use less storage space and to promote
faster data reads.
Page | 63
• Partition data that is stored on Amazon S3 and use Snappy compression to optimize
performance.
• Create an Athena view.
• Add an Athena view to a Step Functions workflow.
• Construct an ETL pipeline by using Step Functions, Amazon S3, Athena, and AWS
Glue.

Page | 64
CASE STUDY
BUILDING AN AIRLINE DATA PIPELINE – AWS SERVICES USED

1. AIM:
To develop an analytics visualization dashboard using AWS QuickSight for monitoring
real-time or scheduled airline flight data. The dashboard provides insights into airline
operations, flight volumes, and airport traffic, enabling data-driven decisions for business
and operations teams.
This project creates an end-to-end pipeline that collects flight data from the Aviation Stack
API, processes and stores it using AWS Lambda and S3, transforms and catalogs it using
AWS Glue, queries it using Athena, and visualizes it via QuickSight.

2. DESCRIPTION:
Dashboard Objectives:
▪ Data Visualization: Interactive visual charts, KPIs, and tables to view trends and stats
▪ Timely Insights: Fetch live flight information with scheduled Lambda invocations
▪ Scalability: Supports massive flight datasets using S3 and serverless Glue/Athena
▪ User-Friendly: Quick Sight’s no-code UI makes it accessible for all business users

Page | 65
Task 1: Ingest Flight Data Using Lambda

1. Create S3 Bucket:

We created an Amazon S3 bucket named airlines-raw-data to serve as the storage location


for all ingested flight data as Fig 1.1. The Lambda function writes each file in a structured,
date-based folder format (e.g., year=2025/month=07/day=12/). This partitioning strategy
not only keeps the data organized but also improves performance and reduces costs when
querying through Athena.

Fig 1.1 Setting up the S3 bucket for raw data

2. Set up Lambda Function:

We developed a Python script within an AWS Lambda function Fig 1.2 to automate the
ingestion of flight data. The script connects to the AviationStack API, retrieves real-time
or scheduled flight records, and parses the JSON response.

Once processed, the data is stored in an Amazon S3 bucket using a structured folder
format organized by date (year/month/day). This ensures the data is efficiently partitioned
for querying and downstream processing.

Fig 1.2 Deploying the Python script for Data ingestion from the Data Source
Page | 66
3. Schedule with Event Bridge (If needed):
▪ Rule: rate (30 minutes) to fetch flight data every half hour

Task 2: Prepare and Catalog Data Using AWS Glue

1. Glue Crawler (Raw Data): We configured a Glue crawler to scan the raw flight data
stored in the S3 bucket as shown in fig 2.1. The crawler automatically detects the structure
of the JSON files and infers their schema. Once complete, it creates a metadata table
named flights_raw inside the flights_db database, making the raw flight data ready for
querying through Athena.

Fig 2.1 Setting up the AWS Glue's Crawler

2. Glue ETL Job (Optional - Flattened Data):

We created an AWS Glue ETL job to process and transform the raw flight data. This job
extracts nested fields—such as departure and arrival details—and flattens them into a
structured format.

3. Glue Crawler (Cleaned Data):

After the ETL job outputs the flattened flight data to S3, we set up a second Glue crawler
to scan that cleaned data path. The crawler analyzes the structure of the new files and
creates a metadata table named flights_cleaned in the flights_cleaned_db database Fig
2.2. This makes the cleaned data ready for SQL queries in Athena and visualization in
QuickSight.

Page | 67
Fig 2.2 locating metadata table named flights_cleaned in DataCatalog

Task 3: Querying Data in Amazon Athena

1. Validate the Data:

Using Amazon Athena, we executed a simple SQL query to preview the cleaned flight
data and confirm that it was correctly processed and stored Fig 3.1. The query retrieves
the first 10 rows from the flights_cleaned table:

➢ SELECT * FROM flights_cleaned_db.flights_cleaned LIMIT 10;

This step helps validate the schema, data types, and overall data quality before building
visualizations.

Fig 3.1 Validating the Data in Amazon Athena through SQL Queries

Page | 68
Task 4: Build Dashboard in Amazon QuickSight

1. Enable Access:
• S3, Glue, Athena permissions granted Fig 4.1 to QuickSight service role

Fig 4.1 Enabling Access for Data resources

2. Create Dataset:
• Open Amazon QuickSight and go to "Manage data."
• Choose "New dataset" and select Athena as the data source as Fig 4.2.
• From the list of available tables, select the flights_cleaned table, which
was created using the Glue crawler.

Fig 4.2 Establish a connection to the Athena database

Page | 69
3. Designing Dashboard:

Once the dataset was available in Quick Sight, we began building the dashboard with a
mix of visual components to uncover key patterns in the flight data as shown in the
following figures 4.3-4.5:

• KPI Cards to display high-level metrics such as total number of flights, distinct
airlines, and airports involved.
• Line Chart showing flight volume over time, helping track trends and peak periods.
• Bar Chart breaking down the number of flights operated by each airline.
• Table listing detailed flight routes, mapping each departure airport to its
corresponding arrival airport.
• Filters for interactive exploration based on flight_status, flight_date, airline_name,
and dep_airport.

Fig 4.4 Flight Status Distribution and Airport Frequency

Fig 4.5 Airline Companies and Their Arrival Destinations [Sankey diagram]

Page | 70
Fig 4.6 Airline Companies by Size Tree map] and Departure Activity [Stacked Bar Chart]

3. AWS Services Used

▪ Amazon Lambda: Used to fetch flight data from the Aviation Stack API and store it in
S3 automatically. It runs serverless and can be scheduled using Amazon Event Bridge for
regular ingestion (e.g., every 30 minutes).
▪ Amazon S3: Acts as the central data lake for storing both raw and cleaned flight data in
JSON format. Data is partitioned by date to support efficient querying and integration
with other AWS services.
▪ AWS Glue: Crawlers automatically detect schema from S3 and create queryable tables
in the Glue Data Catalog. Optional ETL jobs flatten and transform nested data for easier
analytics.
▪ Amazon Athena: Allows you to run SQL queries directly on flight data stored in S3 using
the Glue catalog. It is serverless and ideal for exploring, filtering, and validating data with
no setup.
▪ Amazon Quick Sight: Used to create interactive dashboards and charts based on flight
data queried through Athena. It supports real-time insights and can use SPICE for fast in-
memory visualizations.

4. Benefits of AWS Services over Traditional Analyses:

▪ Amazon S3:
• Scalability and Durability: Handles the dataset's storage needs effortlessly, with
high availability and durability, ensuring data integrity.
• Cost Efficiency: Provides a cost-effective solution for storing and managing data,
optimizing operational expenses.
• Integration Capabilities: Seamlessly integrates with other AWS services,
enhancing overall data management efficiency.
▪ Amazon QuickSight:
• Intuitive Interface: Offers a user-friendly interface that simplifies the creation of
complex visualizations and dashboards, promoting user adoption and engagement.
• Real-Time Insights: Enables instant access to real-time insights from the Amazon
Page | 71
sales data, supporting agile decision-making processes.
• Scalability and Flexibility: Scales effortlessly with growing data volumes and
user demands, ensuring consistent performance.

5. Conclusion:

This project showcases a real-time airline analytics platform using AWS. By leveraging Amazon
Lambda, S3, Glue, Athena, and Quick Sight, we built a fully automated, scalable, and cost-
effective data pipeline. The final dashboard delivers powerful insights into flight patterns, airline
operations, and airport traffic — empowering aviation stakeholders to make data-driven decisions
efficiently.
This case study demonstrates the power of modern AWS data tools to transform external APIs
into insightful visual dashboards with minimal infrastructure management.

Page | 72
CONCLUSION

In today's rapidly evolving technological landscape, cloud computing has emerged as a


fundamental pillar of modern business operations. Among the plethora of cloud service providers,
AWS stands out for its comprehensive suite of services, robust security features, and unparalleled
scalability. AWS offers a competitive edge with its vast global infrastructure, which enables
businesses to deploy applications and services with high availability and low latency. The rise of
cloud computing has transformed how organizations manage and analyse their data, leading to
more informed decision-making and operational efficiency.

AWS provides a rich array of data engineering services designed to handle various aspects of data
processing, storage, and analysis. Amazon QuickSight is an intuitive business intelligence service
that allows users to create and share interactive dashboards. Amazon Athena offers a serverless
query service, enabling users to analyse data directly from Amazon S3 using standard SQL. AWS
Glue is a fully managed ETL service that simplifies the process of preparing and loading data for
analytics. Additionally, services such as Amazon Redshift for data warehousing, Amazon EMR
for big data processing, and AWS Data Pipeline for data workflow automation play crucial roles
in building a comprehensive data engineering ecosystem on AWS.

In conclusion, we have explored the fundamental AWS services essential for data engineering
and cloud computing, equipping us with the tools to generate valuable analyses and insights from
diverse data sources. By using AWS's powerful capabilities, we can make informed and
calculated decisions in this data-driven industry.

Page | 73
REFERENCES

AWS Cloud Foundations Course Link:

https://awsacademy.instructure.com/courses/81207

AWS Data Engineering Course Link:

https://awsacademy.instructure.com/courses/81208

Data Source:
https://aviationstack.com/

Amazon Lambda:
https://ap-south-1.console.aws.amazon.com/lambda/home?region=ap-south-1

Amazon Athena:
https://aws.amazon.com/athena/?whats-new-
cards.sortby=item.additionalFields.postDateTime&whats-newcards.sort-order=desc

Amazon Glue:
https://aws.amazon.com/glue/

Amazon-S3:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/We lcome.html

Amazon-QuickSight:
https://ap-south-1.quicksight.aws.amazon.com/sn/start/analyses

Page | 74

You might also like