DE AWS Test (1) T
DE AWS Test (1) T
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING-DATA SCIENCE
Submitted By
Page | 1
CERTIFICATE
This is to certify that this report on “AWS Data Engineering” is a Bonafide record
of Internship-II work submitted by—
Page | 2
ACKNOWLEDGEMENTS
We would like to express our deep sense of gratitude to our esteemed institute Gayatri Vidya
Parishad College of Engineering (Autonomous), which has provided us an opportunity to fulfil
our cherished desire.
We express our sincere thanks to our Principal Dr. A.B. KOTESWARA RAO, Gayatri Vidya
Parishad College of Engineering (Autonomous) for his encouragement to us during this project.
We thank our Course Mentor , Mr. D. Arun Kumar, Assistant Professor, Department of
Computer Science and Engineering, for the kind suggestions and guidance for the successful
completion of our internship.
We thank our Course Coordinator, Dr. CH. SITA KUMARI, Associate Professor, Department
of Computer Science and Engineering, for her valuable guidance and insightful suggestions that
greatly contributed to the successful completion of our internship.
We are highly indebted to Dr. Y. ANURADHA, Professor and Head of the Department of
Computer Science and Engineering-Data Science, Gayatri Vidya Parishad College of
Engineering (Autonomous), for giving us an opportunity to do the internship in college.
We thank our Internship Coordinators, Dr. CH. SITA KUMARI, Associate Professor,
Department of Computer Science and Engineering and Ms. T. TEJESWARI, Assistant
Professor, Department of Computer Science and Engineering for providing us with this
internship opportunity.
We are very thankful to AICTE and Edu-skills for giving us a comprehensive platform that
helped us to solve every issue regarding the internship.
Finally, we are indebted to the teaching and non-teaching staff of the Computer Science and
Engineering Department for all their support in completion of our project.
Page | 3
322103383005:
Page | 4
322103383063:
Page | 5
323103383L02:
Page | 6
323103383L06
Page | 7
ABSTRACT
Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon that provides on-demand cloud
computing platforms and APIs to individuals, companies, and governments, on a metered
pay-as-you-go basis. These cloud computing web services provide distributed computing
processing capacity and software tools via AWS server farms. One of these services is
Amazon Elastic Compute Cloud (EC2), which allows users to have at their disposal a virtual
cluster of computers, available all the time, through the Internet. AWS's virtual computers
emulate most of the attributes of a real computer, including hardware central processing units
(CPUs) and graphics processing units (GPUs) for processing; local/RAM memory; hard-
disk/SSD storage; a choice of operating systems; networking; and pre-loaded application
software such as web servers, databases, and customer relationship management (CRM).
AWS Academy Cloud Foundations is intended for those who seek an overall understanding
of cloud computing concepts, independent of specific technical roles. It provides a detailed
overview of cloud concepts, AWS core services, security, architecture, pricing, and support.
Cloud computing is the on-demand delivery of IT resources over the Internet with payas-you-
go pricing. Instead of buying, owning, and maintaining physical data centres and servers, you
can access technology services, such as computing power, storage, and databases, on an as-
needed basis from a cloud provider like Amazon Web Services (AWS).
AWS Academy Data Engineering is designed to help students learn about and get hands-on
practice with the tasks, tools, and strategies that are used to collect, store, prepare, analyze,
and visualize data for use in analytics and machine learning (ML) applications. Throughout
the course, students will explore use cases from real-world applications, which will enable
them to make informed decisions while building data pipelines for their particular
applications.
Page | 8
INDEX
About AWS 15
6. Compute 28
6.1 Compute Services Overview
Page | 9
6.2 Amazon EC2
6.3 Amazon EC2 Cost Optimization
6.4 Container Services
6.5 Introduction to AWS Lambda
6.6 Introduction to AWS Beanstalk
7. Storage 31
7.1 Amazon Elastic Block Store
7.2 Amazon Simple Storage Service
7.3 Amazon Elastic File System
7.4 Amazon S3 Glacier
8. Databases 33
8.1 Amazon Relational Database Service
8.2 Amazon DynamoDB
8.3 Amazon Redshift
8.4 Amazon Aurora
9. Cloud Architecture 36
9.1 AWS Well-Architected Framework
9.2 Reliability and Availability
9.3 AWS Trusted Advisor
Page | 10
Course-2: AWS Data Engineering:
2. Data-Driven Organizations 40
2.1 Data-Driven Decisions
2.2 The Data Pipeline Infrastructure
2.3 Module Summary
Page | 11
6.6 Data Enriching
6.7 Data Validating
6.8 Data Publishing
6.9 Module Summary
Page | 12
10.10 ML Infrastructure on AWS
10.11 SageMaker
10.12 Introduction to Amazon CodeWhisperer
10.13 AI/ML Services on AWS
10.14 Module Summary
Case Study: 65
1. Aim
2. Description
3. AWS services used
4. Benefits of AWS services
5. Conclusion
Conclusion 69
References 70
Page | 13
AWS
Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon that provides on- demand
cloud computing platforms and APIs to individuals, companies, and governments, on a
metered, pay-as-you-go basis. Clients will often use this in combination with
autoscaling (a process that allows a client to use more computing in times of high application
usage, and then scale down to reduce costs when there is less traffic). These cloud computing
web services provide various services related to networking, compute, storage, middleware,
IoT and other processing capacity, as well as software tools via AWS server farms. This
frees clients from managing, scaling, and patching hardware and operating systems. One of
the foundational services is Amazon Elastic Compute Cloud (EC2), which allows users
to have at their disposal a virtual cluster of computers, with extremely high availability, which
can be interacted with over the internet via REST APIs, a CLI or the AWS console. AWS's
virtual computers emulate most of the attributes of a real computer, including hardware
central processing units (CPUs) and graphics processing units (GPUs) for processing;
local/RAM memory; Hard-disk (HDD)/SSD storage; a choice of operating systems;
networking; and pre-loaded application software such as web servers, databases, and
customer relationship management (CRM).
AWS services are delivered to customers via a network of AWS server farms located
throughout the world. Fees are based on a combination of usage (known as a "Pay-as-you-
go" model), hardware, operating system, software, and networking features chosen by the
subscriber requiring various degrees of availability, redundancy, security, and service options.
Subscribers can pay for a single virtual AWS computer, a dedicated physical computer, or
clusters of either. Amazon provides select portions of security for subscribers (e.g. physical
security of the data centres) while other aspects of security are the responsibility of the
subscriber (e.g. account management, vulnerability scanning, patching). AWS operates from
many global geographical regions including seven in North America.
Page | 14
CLOUD FOUNDATIONS
MODULE 1
CLOUD CONCEPTS OVERVIEW
The key takeaways from this section of the module include the six advantages of cloud
computing:
• Trade capital expense for variable expense
• Massive economies of scale
• Stop guessing capacity
• Increase speed and agility
• Stop spending money on running and maintaining data centres
• Go global in minutes
Page | 15
Section 3: Introduction to Amazon Web Services (AWS):
Section 4: Moving to the AWS Cloud -The AWS Cloud Adoption Framework (AWS
CAF):
Module-1 Summary:
Page | 16
MODULE 2
CLOUD ECONOMICS AND BILLING
In summary, while the number and types of services offered by AWS have increased
dramatically, our philosophy on pricing has not changed. At the end of each month, you pay
only for what you use, and you can start or stop using a product at any time. No long-term
contracts are required.
The best way to estimate costs is to examine the fundamental characteristics for each AWS
service, estimate your usage for each characteristic, and then map that usage to the prices
that are posted on the AWS website. The service pricing strategy gives you the flexibility to
choose the services that you need for each project and to pay only for what you use.
There are several free AWS services, including:
• Amazon VPC
• Elastic Beanstalk
• AWS CloudFormation
• IAM
• Automatic scaling services
• AWS OpsWorks
• Consolidated Billing
While the services themselves are free, the resources that they provision might not be free.
In most cases, there is no charge for inbound data transfer or for data transfer between other
AWS services within the same AWS Region. There are some exceptions, so be sure to
verify data transfer rates before you begin to use the AWS service. Outbound data transfer
costs are tiered.
It is difficult to compare an on-premises IT delivery model with the AWS Cloud. The two
are different because they use different concepts and terms. Using on-premises IT involves
a discussion that is based on capital expenditure, long planning cycles, and multiple
components to buy, build, manage, and refresh resources over time. Using the AWS Cloud
involves a discussion about flexibility, agility, and consumption-based costs.
Some of the costs that are associated with data centre management include:
• Server costs for both hardware and software, and facilities costs to house the equipment.
• Storage costs for the hardware, administration, and facilities.
• Network costs for hardware, administration, and facilities.
• And IT labour costs that are required to administer the entire solution.
Soft benefits include:
Page | 17
• Reusing service and applications that enable you to define (and redefine solutions) by
using the same cloud service
• Increased developer productivity
AWS Billing and Cost Management is the service that you use to pay your AWS bill,
monitor your usage, and budget your costs. Billing and Cost Management enables you to
forecast and obtain a better idea of what your costs and usage might be in the future so that
you can plan ahead.
You can set a custom time period and determine whether you would like to view your data
at a monthly or daily level of granularity.
Page | 18
With the filtering and grouping functionality, you can further analyse your data using a variety
of available dimensions. The AWS Cost and Usage Report Tool enables you to identify
opportunities for optimization by understanding your cost and usage data trends and how you
are using your AWS implementation.
Module-2 Summary :
Page | 19
MODULE 3
AWS GLOBAL INFRASTRUCTURE OVERVIEW
Module-3: Summary:
Page | 20
MODULE – 4
AWS CLOUD SECURITY
Page | 21
Section 3: Securing a new AWS account:
The key takeaways from this section of the module are all related to best practices for
securing an AWS account. Those best practice recommendations include:
• Secure logins with multi-factor authentication (MFA).
• Delete account root user access keys.
• Create individual IAM users and grant permissions according to the principle of least
privilege.
• Use groups to assign permissions to IAM users.
• Configure a strong password policy.
• Delegate using roles instead of sharing credentials.
• Monitor account activity using AWS CloudTrail.
• AWS Organizations enables you to consolidate multiple AWS accounts so that you
centrally manage them.
• AWS Key Management Service (AWS KMS) is a service that enables you to create and
manage encryption keys, and to control the use of encryption across a wide range of AWS
services and your applications.
• Amazon Cognito provides solutions to control access to AWS resources from your
application. You can define roles and map users to different roles so your application can
access only the resources that are authorized for each user.
• AWS Shield is a managed distributed denial of service (DDoS) protection service that
safeguards applications that run on AWS. It provides always-on detection and automatic
inline mitigations that minimize application downtime and latency, so there is no need to
engage AWS Support to benefit from DdoS protection.
Module-4: Summary:
Lab1 – Introduction to AWS IAM: This is a guided lab in which the instructions are there to
follow through which we get introduced to the AWS IAM
Page | 23
MODULE – 5
NETWORKING AND CONTENT DELIVERY
Page | 25
VPC.
Module-5: Summary:
Page | 26
Module-5: List of Labs :
Lab2 – Build your VPC and Launch a Web Server: In this lab, we have:
• Created an Amazon VPC.
• Created additional subnets.
• Created an Amazon VPC security group.
• Launched a web server instance on Amazon EC2.
Page | 27
MODULE – 6
COMPUTE
Amazon Web Services (AWS) offers many compute services like Amazon EC2, Amazon
Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service (Amazon
ECS), AWS Elastic Beanstalk, AWS Lamba, Amazon Elastic Kubernetes Services (Amazon
EKS), Amazon Fargate.
Selecting the wrong compute solution for an architecture can lead to lower performance
efficiency
• A good starting place—Understand the available compute options
Page | 28
Section 4: Container services:
Page | 29
Section 6: Introduction to AWS Elastic Beanstalk:
Module-6: Summary:
Page | 30
MODULE – 7
STORAGE
• Amazon EBS provides block-level storage volumes for use with Amazon EC2 instances.
Amazon EBS volumes are off-instance storage that persists independently from the life of
an instance. They are analogous to virtual disks in the cloud. Amazon EBS provides three
volume types: General Purpose SSD, Provisioned IOPS SSD, and magnetic.
• The three volume types differ in performance characteristics and cost, so you can choose
the right storage performance and price for the needs of your applications.
• Additional benefits include replication in the same Availability Zone, easy and transparent
encryption, elastic volumes, and backup by using snapshots.
• Amazon S3 Glacier is a data archiving service that is designed for security, durability, and
an extremely low cost.
• Amazon S3 Glacier pricing is based on Region.
• Its extremely low-cost design works well for long-term archiving.
• The service is designed to provide 11 9s of durability for objects.
Page | 31
Module7: Summary:
Page | 32
MODULE – 8
DATABASES
• With Amazon RDS, you can set up, operate, and scale relational databases in the cloud.
• Features –
• Managed service
• Accessible via the console, AWS Command Line Interface (AWS CLI), or application
programming interface (API) calls
• Scalable (compute and storage)
• Automated redundancy and backup are available
• Supported database engines:
• Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle, Microsoft SQL Server
Amazon DynamoDB:
• Runs exclusively on SSDs.
• Supports document and key-value store models.
• Replicates your tables automatically across your choice of AWS Regions.
• Works well for mobile, web, gaming, ad tech, and Internet of Things (IoT) applications.
• Is accessible via the console, the AWS CLI, and API calls.
• Provides consistent, single-digit millisecond latency at any scale.
• Has no limits on table size or throughput.
Page | 33
Section 4: Amazon Aurora:
Module-8 Summary:
Page | 34
• Explain Amazon Aurora
• Perform tasks in an RDS database, such as launching, configuring, and interacting
Page | 35
MODULE – 9
CLOUD ARCHITECTURE
Module-9: Summary:
Module-10: Summary:
• Lab6 – Scale and Load Balance your Architecture: In this lab, we have:
• Created an Amazon Machine Image (AMI) from a running instance.
• Created a load balancer.
• Created a launch configuration and an Auto Scaling group.
• Automatically scaled new instances within a private subnet.
• Created Amazon CloudWatch alarms and monitored performance of your infrastructure.
Page | 38
AWS DATA ENGINEERING
MODULE-1
WELCOME TO AWS DATA ENGINEERING
Page | 39
MODULE 2
DATA-DRIVEN ORGANISATIONS
• Data - driven organizations use data science to inform decisions. This includes data
analytics and AI, along with its subfield of ML.
•The main distinction between data analytics and AI/ML is the way that the analysis is
performed. Data analytics relies on programming logic while AI/ML applications can learn
from examples in data to make predictions. This makes AI/ML good for unstructured data
where the variables are complex.
• The rapidly increasing availability of data, coupled with decreasing costs of supporting
technology, increases the opportunity for organizations to make data - driven decisions.
•The data pipeline provides the infrastructure for data-driven decisions and includes layers
to ingest, store, process, and analyse and visualize data.
•Data wrangling refers to the work to transform data to prepare it for analysis. Processing is
typically iterative. The business problem should drive the pipeline design.
Module-2 Summary:
This module introduced data - driven decision - making and how data analytics and AI/ML
can support decision - making. We learned that both data analytics and AI/ML can help with
predictions, but AI/ML does it by learning about the data rather than by using programmed
rules. We learned about the four main layers in a data pipeline: ingestion, storage,
processing, and analysis and visualization. We were also introduced to the type of actions
that are taken on data as it passes through the pipeline; for example, cleaning and
transformation. We came across the types of concerns that a data engineer or data scientist
might have when building out and using a data pipeline. And finally, we learned about the
modernize, unify, and innovative approach for building modern data infrastructures.
Module-2 Labs:
Page | 40
• Modify the encryption properties and storage type for an S3 object.
• Upload a compressed version of a dataset to Amazon S3.
• Review an AWS Identity and Access Management (IAM) policy that is used by team
members to access the data.
• Test a team member's restricted access to the data by using the AWS Command Line
Interface (AWS CLI).
Page | 41
MODULE 3
THE ELEMENTS OF DATA
•Volume is about how much data, and velocity is about the pace of data. Together,
They drive the scaling requirements for your pipeline.
•When we build a pipeline, consider volume and velocity at each layer. For
example, some data that arrives at high velocity doesn't need to be processed
immediately and can be stored and processed at a slower pace.
• Our choices should balance costs for throughput and storage against the required
time to answer and accuracy of the answer.
• The three general types of data are structured, semi-structured and unstructured.
•Structured data, such as a relational database, is easy to query and process but not
very flexible.
•Unstructured data on the other hand is very flexible but more difficult to work
with. Most of the data growth in recent years is of the unstructured type.
•Veracity is about trusting your data, and without veracity you can't expect to get good
value from your data.
•For each data source, we need to discover its veracity, clean and transform what we can to
improve it, and then prevent unwanted changes after we have cleaned it.
•A data engineer or data scientist needs to question the data source before it ever enters the
pipeline.
•After we determine to use a data source, we need to determine how to clean it and have a
Page | 42
common definition of what clean means for a data source. Ensure that any transformation
does what we have intended.
•Save raw data in an immutable form so that we have details instead of only aggregated
values. This supports future insights, and makes it easier to find errors.
•To protect cleaned data, the organization should implement processes and governance
strategies to manage the data in the systems.
Module-3 Summary:
The module introduced us to a basic vocabulary to think about the data sources that will feed
your pipeline. We learned about volume, velocity, variety, veracity, and value and how each
of them impacts your pipeline design. We also learned the importance of asking questions
about data sources before they enter your pipeline and protecting them after they are part of
your system.
Page | 43
MODULE 4
DESIGN PRINCIPLES & PATTERNS FOR DATA
PIPELINES
•The Well-Architected Framework provides best practices and design guidance, and its
lenses extend guidance to focus on specific domains.
•We can use the Data Analytics Lens to guide your design of data pipelines to suit the
characteristics of the data that you need to process.
•Data stores and architectures evolved to keep up with continually increasing volume,
variety, and velocity of the data being generated.
•Modern data architectures will continue to use different types of data stores to provide the
best fit for each use case, but they need to find a way to unify disparate sources to maintain
a single source of truth.
•A data lake provides centralized storage and is integrated with purpose - built data stores
and processing tools.
•Data moves into the data lake (outside in), from the data lake to other stores (inside
out), and might also move directly between purpose-built stores (around the perimeter).
•Amazon S3 provides the data lake, and Lake Formation and AWS Glue help to provide
seamless access.
•The AWS modern data architecture uses purpose-built tools to ingest data based on
characteristics of the data.
•The storage layer includes two layers. The first is a storage layer that uses Amazon
Redshift as its data warehouse and Amazon S3 for its data lake. The second is a catalog
layer that uses AWS Glue and Lake Formation.
•The catalog maintains metadata and also provides schemas-on-read, which Redshift
Spectrum uses to read data directly from Amazon S3.
•Normally, data within the lake is segmented into zones to represent different states of
processing. Data arrives in a landing zone and might move up to curated as it’s processed
for different types of consumption. We create zones by using prefixes or bucket names for
each zone in Amazon S3.
Page | 44
Section-5: Modern Data Architecture Pipeline – Processing and Consumption:
•The processing layer is where data is transformed for some type of consumption. The
modern data architecture supports three general types of processing: SQL-based ELT, big
data processing, and near real-time ETL.
• The consumption layer includes components that access the data and metadata in the
storage layer (including the data that is transformed by the processing layer). The
consumption layer supports three analysis methods: interactive SQL queries, BI dashboards,
and ML.
•Streaming analytics includes producers who put things on the stream and consumers who
get things off the stream.
• A stream provides temporary storage to process incoming data in real time for delivery to
real-time applications, such as real-time dashboards.
•The results of streaming analytics might also be saved to more durable storage for
additional processing downstream.
Module-4 Summary:
The design principles and patterns that we will use to build data pipelines. We learned about
how the evolution of data stores and data architectures informed design principles of modern
data architecture. We used the Well-Architected Framework to find design principles and
recommendations that are related to building analytics pipelines. In addition to the modern
data architecture, you were introduced to key characteristics of streaming data pipelines.
Module-4 Labs:
Page | 45
MODULE 5
SECURING AND SCALING THE DATA PIPELINE
•Honor the data classifications and protection policies that the owners of the source data
assigned.
•Secure access to the data in the analytics workload.
•Share data downstream in compliance with the source system's classification policies.
•Ensure the environment is accessible with the least permissions necessary; automate
auditing of environment changes, and alert in case of abnormal environment access.
Section-3: ML security:
•Horizontal scaling adds additional instances; vertical scaling adds additional resources to
an instance.
•AWS Auto Scaling automatically scales Amazon EC2 capacity horizontally or vertically
based on scaling plans and predictive scaling.
•Application Auto Scaling automatically scales resources for individual AWS services
beyond Amazon EC2.
Page | 46
Section-5: Creating a scalable infrastructure:
Module-5 Summary:
This module has prepared us to do the following:
•Highlight how cloud security best practices apply to analytics and ML data pipelines.
•List AWS services that play key roles in securing a data pipeline.
•Cite factors that drive performance and scaling decisions across each layer of a data
pipeline.
•Describe how IaC supports the security and scalability of a data pipeline infrastructure.
• Identify the function of common CloudFormation template sections.
Page | 47
MODULE 6
INGESTING AND PREPARING DATA
•Ingestion involves pulling data into the pipeline and applying transformations on the data.
•Ingestion by using an ETL flow works well with structured data that is destined for a data
warehouse or other structured store.
•ETL and ELT processing within a pipeline should evolve to optimize its value.
• Data discovery is an iterative process that each role who is involved in the business need
• should perform.
• During data discovery, the data engineer should determine the relationships between
• sources, how to filter the sources, and how to organize the data in the target storage.
• The discovery phase helps you to identify the tools and resources that you will need to
• move forward.
•Data structuring is about mapping raw data from the source file or files into a format that
supports combining it and storing it with other data.
•Data structuring includes organizing storage and access, parsing the source file,
and mapping the source file to the target.
•Data structuring also includes strategies and methods to optimize file size, such as splitting
or compressing files.
Page | 48
Section-6: Data cleaning:
•Data cleaning is about preparing the source data for use in a pipeline.
•Cleaning is usually done for each source based on the characteristics of that source.
•Cleaning includes tasks such as removing unneeded, duplicate, or invalid data, and fixing
missing values and data types.
•How the cleaning tasks resolve issues depends on the role of the person who cleans the
data.
•Data enriching is about combining the data sources together and adding value to the data.
•During enriching, we might merge sources, add additional fields, or calculate new values.
•Data validating is about ensuring the integrity of the dataset that you have
created.
•Validating tasks might overlap with cleaning tasks.
•This step might be iterative as you check the results of your work, find issues, and address
them.
Module-6 Summary:
This module prepared us to do the following:
•Distinguishing between the processes of ETL and ELT.
•Defining data wrangling in the context of ingesting data into a pipeline.
•Describing key tasks in each of these data wrangling steps:
•Discovery
•Structuring
•Cleaning
•Enriching
•Validating and Publishing
Page | 49
MODULE-7
INGESTING BY BATCH OR BY STREAM
•Batch jobs query the source, transform the data, and load it into the pipeline.
•Traditional ETL uses batch processing.
•With stream processing, producers put records on a stream where consumers get and
process them.
•Streams are designed to handle high-velocity data and real-time processing.
•Batch ingestion involves writing scripts and jobs to perform the ETL or ELT
process.
•Workflow orchestration helps you to handle interdependencies between jobs and manage
failures within a set of jobs.
•Key characteristics for pipeline design include ease of use, data volume and
variety, orchestration and monitoring, and scaling and cost management.
•Choose purpose-built tools that match the type of data to be ingested and simplify the tasks
that are involved in ingestion.
•Amazon AppFlow, AWS DMS, and DataSync each simplify the ingestion of
specific data types.
•AWS Data Exchange provides a simplified way to find and subscribe to third-party
datasets.
•AWS Glue is a fully managed data integration service that simplifies ETL tasks.
•AWS Glue crawlers derive schemas from data stores and provide them to the centralized
AWS Glue Data Catalog.
•AWS Glue Studio provides visual authoring and job management tools.
• The AWS Glue Spark runtime engine processes jobs in a serverless environment.
• AWS Glue workflows provide ETL orchestration.
• CloudWatch provides integrated monitoring and logging for AWS Glue, including job run
insights.
Page | 50
Section-5: Scaling considerations for batch processing:
•Performance goals should focus on what factors are most important for your batch
processing.
• Scale AWS Glue jobs horizontally by adding more workers.
• Scale AWS Glue jobs vertically by choosing a larger type of worker in the job
configuration.
• Large, splitable files let the AWS Glue Spark runtime engine run many jobs in
parallel with less overhead than processing many smaller files
•The stream is a buffer between the producers and the consumers of the stream.
• The KPL simplifies the work of writing producers for Kinesis Data Streams.
• Data is written to shards on the stream as a
sequence of data records.
• Data records include a sequence number, partition key, and data blob.
• Kinesis Data Firehose can deliver streaming data directly to storage, including Amazon S3
and Amazon Redshift.
• Kinesis Data Analytics is purpose built to perform real- time analytics on data as it passes
through the stream.
•Kinesis Data Streams provides scaling options to manage the throughput of data
on the stream.
• We can scale how much data can be written to the stream, how long the data is
stored on the stream, and how much throughput each consumer gets.
• CloudWatch provides metrics that help you monitor how your stream handles the data that
is being written to and read from it.
•With AWS IoT services, we can use MQTT and a pub/sub model to communicate with IoT
devices.
• We can use AWS IoT Core to securely connect, process, and act upon device data.
• The AWS IoT Core rules engine transforms and routes incoming messages to AWS
services.
• AWS IoT Analytics provides a complete pipeline to ingest and process data and
then make it available for analytics.
Page | 51
Module-7 Summary:
•List key tasks that the data engineer needs to perform when building an ingestion layer.
•Describe how purpose-built AWS services support ingestion tasks.
•Illustrate how the features of AWS Glue work together to support automating batch
ingestion.
•Describe Kinesis streaming services and features that simplify streaming ingestion.
•Identify configuration options in AWS Glue and Kinesis Data Streams that help you scale
your ingestion processing to meet your needs.
•Describe distinct characteristics of ingesting IoT data by using AWS IoT Core and AWS
IoT Analytics services.
Module-7 Labs:
Page | 52
MODULE-8
STORING AND ORGANIZING DATA
•Data lakes store data as-is. We don’t need to structure the data before we begin to run
analytics.
•Amazon S3 promotes data integrity through strong data consistency and multipart uploads.
•With Lake Formation, we can use governed tables to enable concurrent data inserts and
edits across tables.
•Data warehouses consists of three tiers, and can store structured, curated, or transformed
data.
•Amazon Redshift is a fully -managed data warehouse services that uses computing
resources called nodes.
• We can use Redshift Spectrum to write SQL queries that combine data from both our data
lake and our data warehouse.
Section-3:Purpose-built databases:
• Our choice of database will affect what your application can handle, how it will perform,
and the operations that you are responsible for.
•When choosing your database, consider several factors:
•Application workload
•Data shape
•Performance
•Operations burden
Page | 53
Section-5: Securing storage:
•Security for data lake storage is built upon the intrinsic security features of Amazon S3.
•Access policies provide a highly customizable way to provide access to resources in the
data lake.
•Data lakes that are built on AWS rely on server-side and client-side encryption.
•Amazon Redshift handles service security and database security as two distinct functions.
Module-8 summary:
Module-8 Labs:
1. Storing and Analysing Data by Using Amazon Redshift
Int his lab, we have learnt how to:
• Review an AWS Identity and Access Management (IAM) role's permissions to access
and configure Amazon Redshift.
• Create and configure a Redshift cluster.
• Create a security group for a Redshift cluster.
• Create a database, schemas, and tables with the Redshift cluster.
• Load data into tables in a Redshift cluster.
• Query data within a Redshift cluster by using the Amazon Redshift console.
• Query data within a Redshift cluster by using the API and AWS Command Line
Interface (AWS CLI) in AWS Cloud9.
• Review an IAM policy with permissions to run queries on a Redshift cluster.
• Confirm that a data science team member can run queries on a Redshift cluster.
Page | 54
MODULE-9
PROCESSING BIG DATA
•Big data processing is generally divided into two categories, batch and streaming.
•Batch data typically involves cold data, with analytics workloads that involve longer
processing times.
•Streaming data involves many data sources providing data that must be processed
sequentially and incrementally.
•Batch processing and stream processing each benefit from specialized big data processing
frameworks.
•Apache Spark performs processing in-memory, reduces the number of steps in a job, and
reuses data across multiple parallel operations.
•Spark reuses data by using an in-memory cache to speed up ML algorithms.
• We can use Amazon EMR to perform automated installations of common big data projects.
•The Amazon EMR service architecture consists of four distinct layers:
•Storage layer
•Cluster resource management layer
•Data processing frameworks layer
•Applications and programs layer
•Three methods are available to launch EMR clusters: interactive, command line, and API.
•EMR clusters are characterized as either long running or transient, based on their usage.
•External connections to EMR clusters can only be made through the main node.
Page | 55
Section-6: Apache Hudi:
•Apache Hudi provides the ability to ingest and update data in near real time.
•Hudi maintains metadata of the actions that are performed to ensure those actions are both
atomic and consistent.
Module-9 summary:
Module-9 Labs:
1. Processing Logs by Using Amazon EMR
In this lab, we have learnt how to:
• Launch an EMR cluster through the AWS Management Console.
• Run Hive interactively.
• Use Hive commands to create tables from log data that is stored in Amazon S3.
• Use Hive commands to join tables and store the joined table in Amazon S3.
• Use Hive to query tables that are stored in Amazon S3.
Page | 56
MODULE 10
PROCESSING DATA FOR ML
Section-1: ML concepts:
•An ML model is a function that combines a trained dataset and an algorithm to predict
outcomes.
•The three general types of machine learning include supervised, unsupervised and
reinforcement.
•Deep learning is a subcategory of machine learning that uses neural networks to develop
models.
•Generative AI is a subcategory of deep learning that can generate content and is trained on
large amounts of data.
•In machine learning, the label or target is what we are trying to predict, and the features are
attributes that can be used to predict the target.
•Collecting data in the ML lifecycle is similar to an extract, load, and transform (ELT)
ingestion process.
•The data engineer and the data scientist ensure that they have enough of the correct data to
support training and testing the model.
•During the data collection phase, we might need to add labels to the training data.
Page | 57
Section-5: :Applying labels to training data with known targets:
•Labelling is the process of assigning useful labels to targets in the training data.
•Common types of labelling include computer vision, natural language processing, audio
processing, and tabular data.
•By using tools such as SageMaker Ground Truth, we can share labelling jobs with an
expert workforce.
•Data preprocessing puts data into the correct shape and quality for training.
•The data scientist performs preprocessing by using a combination of techniques and
expertise.
•Exploring and visualizing the data helps the data scientist get a feel for the data.
•Examples of preprocessing strategies include partitioning, balancing, and data formatting.
•Feature engineering is about improving the existing features to improve their usefulness in
predicting outcomes.
•Feature creation and transformation focus on adding new information.
•Feature extraction and selection are about reducing dimensionality.
•The deployment infrastructure is typically quite different from the training infrastructure.
•Inference is the process of making predictions on the production data.
•Automating the ML lifecycle is an important step in operationalizing and scaling the ML
solution.
•ML Ops is a deployment approach that relies on a streamlined development lifecycle to
optimize resources.
Page | 58
Section-10: ML infrastructure on AWS:
Section-11: SageMaker:
•Amazon CodeWhisperer is a generative AI tool that can help us develop our application
faster using a LLM.
•Prompt engineering is the practice of writing inputs or instructions to interact with a
generative AI model (LLM) that can generate an expected output.
•AWS has several generative AI offerings: Amazon CodeWhisperer, Amazon Bedrock,
AWS Inferentia, and Amazon SageMaker JumpStart.
•AWS offers a growing number of purpose-built AI/ML services to handle common use
cases.
•Services include natural language processing, image and video recognition, and time-series
data predictions.
• We can incorporate these services into our data pipelines without the burden of building
custom ML solutions.
Page | 59
Module-10 Summary:
Page | 60
MODULE-11
ANALYZING AND VISUALIZING DATA
•When we select analysis and visualization tools, we should consider the business needs,
data characteristics, and access to data.
• Consider the granularity and format of the insights based on business needs.
• Consider the volume, velocity, variety, veracity, and value of the data.
• Consider the functions of individuals who will access, analyse, and visualize the data.
•AWS tools and services that are commonly used to query and visualize data include
Athena, QuickSight, and OpenSearch
Service.
• Athena is used for interactive analysis with SQL.
•Decision-makers can use QuickSight to interact with data visually and get insight
quickly.
• OpenSearch Service is used for operational analytics to visualize data in near real time.
Module-11 Summary:
Page | 61
Module-11 Labs:
Page | 62
MODULE 12
AUTOMATING THE PIPELINE
• Automating environment can help us ensure that system is stable, consistent, and efficient.
• Repeatability and reusability are two key benefits of infrastructure as code
Section-2: CI/CD:
•CI/CD spans the develop and deploy stages of the software development lifecycle.
•Continuous delivery improves on continuous integration by helping teams gain a greater
level of certainty that their software will work in production.
•With Step Functions, we can use visual workflows to coordinate the components of
distributed applications and microservices.
• We define a workflow, which is also referred to as a state machine, as a series of steps and
transitions between each step.
• Step Functions is integrated with Athena to facilitate building workflows that include
Athena queries and data processing operations.
Module-12 Summary:
Module-12 Labs:
1. Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
In this lab, we have learnt:
• Create and test a Step Functions workflow by using Step Functions Studio.
• Create an AWS Glue database and tables.
• Store data on Amazon S3 in Parquet format to use less storage space and to promote
faster data reads.
Page | 63
• Partition data that is stored on Amazon S3 and use Snappy compression to optimize
performance.
• Create an Athena view.
• Add an Athena view to a Step Functions workflow.
• Construct an ETL pipeline by using Step Functions, Amazon S3, Athena, and AWS
Glue.
Page | 64
CASE STUDY
BUILDING AN AIRLINE DATA PIPELINE – AWS SERVICES USED
1. AIM:
To develop an analytics visualization dashboard using AWS QuickSight for monitoring
real-time or scheduled airline flight data. The dashboard provides insights into airline
operations, flight volumes, and airport traffic, enabling data-driven decisions for business
and operations teams.
This project creates an end-to-end pipeline that collects flight data from the Aviation Stack
API, processes and stores it using AWS Lambda and S3, transforms and catalogs it using
AWS Glue, queries it using Athena, and visualizes it via QuickSight.
2. DESCRIPTION:
Dashboard Objectives:
▪ Data Visualization: Interactive visual charts, KPIs, and tables to view trends and stats
▪ Timely Insights: Fetch live flight information with scheduled Lambda invocations
▪ Scalability: Supports massive flight datasets using S3 and serverless Glue/Athena
▪ User-Friendly: Quick Sight’s no-code UI makes it accessible for all business users
Page | 65
Task 1: Ingest Flight Data Using Lambda
1. Create S3 Bucket:
We developed a Python script within an AWS Lambda function Fig 1.2 to automate the
ingestion of flight data. The script connects to the AviationStack API, retrieves real-time
or scheduled flight records, and parses the JSON response.
Once processed, the data is stored in an Amazon S3 bucket using a structured folder
format organized by date (year/month/day). This ensures the data is efficiently partitioned
for querying and downstream processing.
Fig 1.2 Deploying the Python script for Data ingestion from the Data Source
Page | 66
3. Schedule with Event Bridge (If needed):
▪ Rule: rate (30 minutes) to fetch flight data every half hour
1. Glue Crawler (Raw Data): We configured a Glue crawler to scan the raw flight data
stored in the S3 bucket as shown in fig 2.1. The crawler automatically detects the structure
of the JSON files and infers their schema. Once complete, it creates a metadata table
named flights_raw inside the flights_db database, making the raw flight data ready for
querying through Athena.
We created an AWS Glue ETL job to process and transform the raw flight data. This job
extracts nested fields—such as departure and arrival details—and flattens them into a
structured format.
After the ETL job outputs the flattened flight data to S3, we set up a second Glue crawler
to scan that cleaned data path. The crawler analyzes the structure of the new files and
creates a metadata table named flights_cleaned in the flights_cleaned_db database Fig
2.2. This makes the cleaned data ready for SQL queries in Athena and visualization in
QuickSight.
Page | 67
Fig 2.2 locating metadata table named flights_cleaned in DataCatalog
Using Amazon Athena, we executed a simple SQL query to preview the cleaned flight
data and confirm that it was correctly processed and stored Fig 3.1. The query retrieves
the first 10 rows from the flights_cleaned table:
This step helps validate the schema, data types, and overall data quality before building
visualizations.
Fig 3.1 Validating the Data in Amazon Athena through SQL Queries
Page | 68
Task 4: Build Dashboard in Amazon QuickSight
1. Enable Access:
• S3, Glue, Athena permissions granted Fig 4.1 to QuickSight service role
2. Create Dataset:
• Open Amazon QuickSight and go to "Manage data."
• Choose "New dataset" and select Athena as the data source as Fig 4.2.
• From the list of available tables, select the flights_cleaned table, which
was created using the Glue crawler.
Page | 69
3. Designing Dashboard:
Once the dataset was available in Quick Sight, we began building the dashboard with a
mix of visual components to uncover key patterns in the flight data as shown in the
following figures 4.3-4.5:
• KPI Cards to display high-level metrics such as total number of flights, distinct
airlines, and airports involved.
• Line Chart showing flight volume over time, helping track trends and peak periods.
• Bar Chart breaking down the number of flights operated by each airline.
• Table listing detailed flight routes, mapping each departure airport to its
corresponding arrival airport.
• Filters for interactive exploration based on flight_status, flight_date, airline_name,
and dep_airport.
Fig 4.5 Airline Companies and Their Arrival Destinations [Sankey diagram]
Page | 70
Fig 4.6 Airline Companies by Size Tree map] and Departure Activity [Stacked Bar Chart]
▪ Amazon Lambda: Used to fetch flight data from the Aviation Stack API and store it in
S3 automatically. It runs serverless and can be scheduled using Amazon Event Bridge for
regular ingestion (e.g., every 30 minutes).
▪ Amazon S3: Acts as the central data lake for storing both raw and cleaned flight data in
JSON format. Data is partitioned by date to support efficient querying and integration
with other AWS services.
▪ AWS Glue: Crawlers automatically detect schema from S3 and create queryable tables
in the Glue Data Catalog. Optional ETL jobs flatten and transform nested data for easier
analytics.
▪ Amazon Athena: Allows you to run SQL queries directly on flight data stored in S3 using
the Glue catalog. It is serverless and ideal for exploring, filtering, and validating data with
no setup.
▪ Amazon Quick Sight: Used to create interactive dashboards and charts based on flight
data queried through Athena. It supports real-time insights and can use SPICE for fast in-
memory visualizations.
▪ Amazon S3:
• Scalability and Durability: Handles the dataset's storage needs effortlessly, with
high availability and durability, ensuring data integrity.
• Cost Efficiency: Provides a cost-effective solution for storing and managing data,
optimizing operational expenses.
• Integration Capabilities: Seamlessly integrates with other AWS services,
enhancing overall data management efficiency.
▪ Amazon QuickSight:
• Intuitive Interface: Offers a user-friendly interface that simplifies the creation of
complex visualizations and dashboards, promoting user adoption and engagement.
• Real-Time Insights: Enables instant access to real-time insights from the Amazon
Page | 71
sales data, supporting agile decision-making processes.
• Scalability and Flexibility: Scales effortlessly with growing data volumes and
user demands, ensuring consistent performance.
5. Conclusion:
This project showcases a real-time airline analytics platform using AWS. By leveraging Amazon
Lambda, S3, Glue, Athena, and Quick Sight, we built a fully automated, scalable, and cost-
effective data pipeline. The final dashboard delivers powerful insights into flight patterns, airline
operations, and airport traffic — empowering aviation stakeholders to make data-driven decisions
efficiently.
This case study demonstrates the power of modern AWS data tools to transform external APIs
into insightful visual dashboards with minimal infrastructure management.
Page | 72
CONCLUSION
AWS provides a rich array of data engineering services designed to handle various aspects of data
processing, storage, and analysis. Amazon QuickSight is an intuitive business intelligence service
that allows users to create and share interactive dashboards. Amazon Athena offers a serverless
query service, enabling users to analyse data directly from Amazon S3 using standard SQL. AWS
Glue is a fully managed ETL service that simplifies the process of preparing and loading data for
analytics. Additionally, services such as Amazon Redshift for data warehousing, Amazon EMR
for big data processing, and AWS Data Pipeline for data workflow automation play crucial roles
in building a comprehensive data engineering ecosystem on AWS.
In conclusion, we have explored the fundamental AWS services essential for data engineering
and cloud computing, equipping us with the tools to generate valuable analyses and insights from
diverse data sources. By using AWS's powerful capabilities, we can make informed and
calculated decisions in this data-driven industry.
Page | 73
REFERENCES
https://awsacademy.instructure.com/courses/81207
https://awsacademy.instructure.com/courses/81208
Data Source:
https://aviationstack.com/
Amazon Lambda:
https://ap-south-1.console.aws.amazon.com/lambda/home?region=ap-south-1
Amazon Athena:
https://aws.amazon.com/athena/?whats-new-
cards.sortby=item.additionalFields.postDateTime&whats-newcards.sort-order=desc
Amazon Glue:
https://aws.amazon.com/glue/
Amazon-S3:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/We lcome.html
Amazon-QuickSight:
https://ap-south-1.quicksight.aws.amazon.com/sn/start/analyses
Page | 74