0% found this document useful (0 votes)
13 views

Data Engineering

Internship report
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Data Engineering

Internship report
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

lOMoARcPSD|50603328

Internship Report

Electronics & Communication ENgineering (Jawaharlal Nehru Technological University,


Anantapur)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Poojitha Tirumaladasu ([email protected])
lOMoARcPSD|50603328

Data Engineering Virtual Internship


An internship report submitted to
JAWAHARLAL NEHRU TECHONOLOGICAL UNIVERSITY ANANTAPUR
ANANTHAPURAMU

in partial fulfilment of the requirements for


the award of the degree of

BACHELOR OF TECHNOLOGY
in
ELECTRONICS AND COMMUNICATION ENGINEERING

Submitted by

Chowdam Anila (Roll no: 20121A0450)

Department of Electronics and Communication Engineering

SREE VIDYANIKETHAN ENGINEERING COLLEGE


(AUTONOMOUS)
Sree Sainath Nagar, A. Rangampet, Tirupathi - 517102.

(2020-2024)

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

ii

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

Internship Details

Title of Internship: Data Engineering Virtual Internship

Name of the student: Chowdam Anila

Year and Semester: IV - II

Name of organization from where AICTE- AWS Academy Virtual Internship


internship undergone:

Duration of Internship: 10 Weeks

From date and to date: January – March 2024

iii

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

Contents
Week Topics Covered
Week 1 Overview of AWS Academy Data Engineering
Week 2 Data Driven Organizations
Week 3 The Elements of Data& Design principles and patterns for Data Pipelines

Week 4 Securing and Scaling the Data Pipeline


Week 5 Ingesting and Preparing Data
Week 6 Ingesting by Batch or by Stream
Week 7 Storing and Organizing Data
Week 8 Processing Big Data & Data for ML
Week 9 Analyzing and Visualizing Data
Week 10 Automating the Pipeline

iv

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

ABSTRACT
This whitepaper helps architects, data scientists, and developers understand the big data
analytics options available in the AWS cloud by providing an overview of services, with the
following information:
• Ideal usage patterns
• Cost model
• Performance
• Durability and availability
• Scalability and elasticity
• Interfaces
• Anti-patterns

This paper concludes with scenarios that showcase the analytics options in use, as well as
additional resources for getting started with big data analytics on AWS
Objectives:
1.Data Integration and Centralization:
•Unify diverse datasets from educational institutions, encompassing student records
faculty information, academic performance, and administrative data.

2.Real-time Data Processing:


•Enable near-real-time processing and analysis of data, facilitating quick decision-
making for educational administrators and policymakers.

3.Security and Compliance:


•Implement stringent security measures to safeguard sensitive student and faculty
information, adhering to regulatory standards set forth by AICTE.

4.Scalable Infrastructure:
•Design and deploy a scalable data infrastructure on AWS, ensuring it can adapt to the
growing data volumes from an expanding network of technical institutions.

5.Analytics and Reporting:


•Establish a comprehensive analytics framework for generating actionable insights,
supporting informed decision-making at both institutional and regulatory levels.

6.Collaborative Data Ecosystem:


•Promote collaboration between educational institutions by facilitating secure data
exchange and interoperability within the AWS environment.

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

TABLE OF CONTENTS
Page
no.
Abstract
v
List of Figures
vii
Week 1 Overview of AWS academy Data Engineering 1-2

Week 2 Data Driven Organizations 3


2.1 Data Driven Decisions 3
2.2 The data pipeline- infrastructure 3
Week 3 Elements of data, design principles& patterns for data pipelines 4-5
3.1 The five Vs of data volume, velocity, variety& value 4
3.2 Variety Data Types, Modern Data Architecture pipeline 4

Week 4 Securing & Scaling data pipeline 6


4.1 Scaling: An overview 6
4.2 Creating a Scalable Infrastructure 6

Week 5 Ingesting & Preparing data 7


5.1 ETL & ELT Comparison 7

Week 6 Ingesting by batch or by stream 8


6.1 Comparing Batch & Stream Ingestion 8

Week 7 Storing & organizing data 9


7.1 Storage in the modern data architecture 9

Week 8 Processing big data & data for ML 10- 12


8.1 Big Data processing Concepts 10

Week 9 Analyzing and visualising data 13

Week 10 Automating the pipeline 14

Conclusion 15

References 16

vi

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

LIST OF FIGURES
Figure Description Page no.

Fig- 1.1 Code Generation 1

Fig- 1.2 Benefits of Amazon Code Whisperer 2

Fig- 2.1 Data Pipeline 3

Fig- 3.1 Data Characteristics 4

Fig- 4.1 Types of Scaling 6

Fig- 4.2 Template Structure 6

Fig- 4.3 AWS Cloud Formation 6

Fig- 5.1 ETL & ELT Comparison 7

Fig- 6.1 Batch & Streaming Ingestion 8

Fig- 6.2 Built Ingestion Tools 8

Fig-7.1 Storage in Modern Architecture 9

Fig- 8.1 Data Processing 10

Fig-8.2 Apache Hadoop 10

Fig- 8.3 ML models 11

Fig- 8.4 ML life style 11

Fig- 8.5 ML framing 12

Fig- 8.6 Collecting data 12

Fig- 9.1 Factors & needs 13

Fig- 9.2 Quick Sight Example 13

Fig- 10.1 Automating Infrastructure 14

Fig- 10.2 Step function 14

vii

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

COURSE MODULES
WEEK- 1:
OVERVIEW OF AWS ACADEMY DATA ENGINEERING
Course objectives:
This course prepares you to do the following:
• Summarize the role and value of data science in a data-driven organization.
• Recognize how the elements of data influence decisions about the
infrastructure of a data pipeline.
• Illustrate a data pipeline by using AWS services to meet a generalized use case.
• Identify the risks and approaches to secure and govern data at each step and
each transition of the data pipeline
• Identify scaling considerations and best practices for building pipelines that
• handle large-scale datasets.
• Design and build a data collection process while considering constraints such
as scalability, cost, fault tolerance, and latency.

Fig- 1.1: Code Generation

Open Code Reference Log


Code Whisperer learns from open-source projects and the code it suggests might occasionally
resemble code samples from the training data. With the reference log, you can view references
to code suggestions that are similar to the training data. When such occurrences happen, Code
Whisperer notifies you and provides repository and licensing information. Use this
information to make decisions about whether to use the code in your project and properly
attribute the source code as desired.

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

Fig- 1.2: Benefits of Amazon CodeWhisperer

Code Whisperer code generation offers many benefits for software development
organizations. It accelerates application development for faster delivery of software solutions.
By automating repetitive tasks, it optimizes the use of developer time, so developers can focus
on more critical aspects of the project. Additionally, code generation helps mitigate security
vulnerabilities, safeguarding the integrity of the codebase. Code Whisperer also protects open
source intellectual property by providing the open source reference tracker. Code Whisperer
enhances code quality and reliability, leading to robust and efficient applications. And it
supports an efficient response to evolving software threats, keeping the codebase up to date
with the latest security practices. Code Whisperer has the potential to increase development
speed, security, and the quality of software.

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

WEEK- 2:
DATA DRIVEN ORGANIZATIONS
2.1 Data Driven Decisions:

How do organizations decide...


• Which of these customer transactions should be flagged as fraud?
• Which webpage design leads to the most completed sales?
• Which patients are most likely to have a relapse?
• Which type of online activity represents a security issue?
• When is the optimum time to harvest this year's crop?
2.2 The data pipeline –infrastructure for data-driven decisions:

Fig- 2.1: Data Pipeline

Another key characteristic of deriving insights by using your data pipeline is that the process
will almost always be iterative. You have a hypothesis about what you expect to find in the
data, and you need to experiment and see where it takes you. You might develop your
hypothesis by using BI tools to do initial discovery and analysis of data that has already been
collected. You might iterate within a pipeline segment, or you might iterate across the entire
pipeline. For example, in this illustration, the initial iteration (number 1) yielded a result that
wasn't as defined as was desired. Therefore, the data scientist refined the model and
reprocessed the data to get a better result (number 2). After reviewing those results, they
determined that additional data could improve the detail available in their result, so an
additional data source was tapped and ingested through the pipeline to produce the desired
result (number 3). A pipeline often has iterations of storage and processing. For example,
after the external data is ingested into pipeline storage, iterative processing transforms the
data into different levels of refinement for different needs.

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

WEEK- 3:

THE ELEMENTS OF DATA, DESIGN PRINCIPLES &


PATTERNS FOR DATA PIPELINES

3.1 The five Vs of data- volume, velocity, variety, veracity& value:

Fig- 3.1: Data Characteristics


The evolution of data architectures:
So, which of these data stores or data architectures is the best one for your data pipeline?
The reality is that a modern architecture might include all of these elements. The key to a
modern data architecture is to apply the three-pronged strategy that you learned about earlier.
Modernize the technology that you are using. Unify your data sources to create a single source
of truth that can be accessed and used across the organization. And innovate to get higher
value analysis from the data that you have.
Variety data types, Modern data architecture on AWS:

The architecture illustrates the following other AWS purpose-built services that integrate
with Amazon S3 and map to each component that was described on the previous slide:
Amazon Redshift is a fully managed data warehouse service.
•Amazon OpenSearch Service is a purpose-built data store and search engine that is
optimized for real-time analytics, including log analytics.
•Amazon EMR provides big data processing and simplifies some of the most complex
elements of setting up big data processing.
•Amazon Aurora provides a relational database engine that was built for the cloud.
•Amazon DynamoDB is a fully managed nonrelational database that is designed to run
high-performance applications.
•Amazon Sage Maker is an AI/ML service that democratizes access to ML process

3.2 Modern data architecture pipeline: Ingestion and storage:


Data being ingested into the Amazon S3 data lake arrives at the landing zone, where it is first
cleaned and stored into the raw zone for permanent storage. Because data that is destined for
the data warehouse needs to be highly trusted and conformed to a schema, the data needs to
be processed further additional transformations would include applying the schema and
partitioning (structuring) as well as other transformations that are required to make the data

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

conform to requirements that are established for the trusted zone. Finally, the processing layer
prepares the data for the curated zone by modeling and augmenting it to be joined with other
datasets (enrichment) and then stores the transformed, validated data in the curated layer.
Datasets from the curated layer are ready to be ingested into the data warehouse to make them
available for low-latency access or complex SQL querying.
Streaming analytics pipeline:

Producers ingest records onto the stream. Producers are integrations that collect data from
a source and load it onto the stream. Consumers process records. Consumers read data from
the stream and perform their own processing on it. The stream itself provides a temporary but
durable storage layer for the streaming solution. In the pipeline that is depicted in this slide,
Amazon CloudWatch Events is the producer that puts CloudWatch Events event data onto
the stream. Kinesis Data Streams provides the storage. The data is then available to multiple
consumers.

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

WEEK- 4:
SECURING & SCALING DATA PIPELINE
4.1 Scaling: An overview:

Fig- 4.1: Types of Scaling


4.2 Creating a scalable infrastructure:

Fig- 4.2: Template Structure

Fig- 4.3: AWS Cloud Formation


AWS CloudFormation is a fully managed service that provides a common language for you
to describe and provision all of the infrastructure resources in your cloud environment. Cloud
Formation creates, updates, and deletes the resources for your applications in environments
called stacks. A stack is a collection of AWS resources that are managed as a single unit.
CloudFormation is all about automated resource provisioning—
it simplifies the task of repeatedly and predictably creating groups of related resources that
power your applications. Resources are written in text files by using JSON or YAML format.

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

WEEK- 5:
INGESTING & PREPARING DATA
5.1 ETL and ELT comparison:

Fig- 5.1: ETL& ELT comparison

Data wrangling:

Transforming large amounts of unstructured or structured raw data from multiple sources with
different schemas into a meaningful set of data that has value for downstream processes or
users.

Data Structuring:

For the scenario that was described previously, the structuring step includes exporting a .json
file from the customer support ticket system, loading the .json file into Excel, and letting
Excel parse the file. For the mapping step for the supp2 data, the data engineer would modify
the cust num field to match the customer id field in the data warehouse.

For this example, you would perform additional data wrangling steps before compressing
the file for upload to the S3 bucket.
Data Cleaning:

It includes;

•Remove unwanted data.


•Fill in missing data values.
•Validate or modify data types.
•Fix outliers

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

WEEK- 6:
INGESTING BY BATCH OR BY STREAM
6.1 Comparing batch and stream ingestion:

Fig- 6.1: Batch& Streaming Ingestion


To generalize the characteristics of batch processing, batch ingestion involves running batch
jobs that query a source, move the resulting dataset or datasets to durable storage in the
pipeline, and then perform whatever transformations are required for the use case. As noted
in the Ingesting and Preparing Data module, this could be just cleaning and minimally
formatting data to put it into the lake. Or, it could be more complex enrichment, augmentation,
and processing to support complex querying or big data and machine learning (ML)
applications. Batch processing might be started on demand, run on a schedule, or initiated by
an event. Traditional extract, transform, and load (ETL) uses batch processing, but extract,
load, and transform (LT) processing might also be done by batch.

Batch Ingestion Processing:


The process of transporting data from one or more sources to a target site for further
processing and analysis. This data can originate from a range of sources, including data lakes,
IoT devices, on-premises databases, and SaaS apps, and end up in different target
environments, such as cloud data warehouses or data marts.
Purpose Built Ingestion Tools:

Fig- 6.2: Built Ingestion Tools

Use Amazon App Flow to ingest data from a software as a service (SaaS) application. You
can do the following with Amazon App Flow:
•Create a connector that reads from a SaaS source and includes filters.
•Map fields in each source object to fields in the destination and perform
transformations.
•Perform validation on records to be transferred.
•Securely transfer to Amazon S3 or Amazon Redshift. You can trigger an ingestion on
demand, on event, or on a schedule.
An example use case for Amazon App Flow is to ingest customer support ticket data from
the Zendesk SaaS product.

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

WEEK- 7:
STORING AND ORGANIZING DATA:
7.1 Storage in the modern data architecture:

Fig- 7.1: Storage in modern Architecture

Data in cloud object storage is handled as objects. Each object is assigned a key, which is a
unique identifier. When the key is paired with metadata that is attached to the objects, other
AWS services can use the information to unlock a multitude of capabilities. Thanks to
economies of scale, cloud object storage comes at a lower cost than traditional storage.

Data Warehouse Storage:

•Provide a centralized repository


•Store structured and semi-structured data
•Store data in one of two ways:
•Frequently accessed data in fast storage
•Infrequently accessed data in cheap storage
•Might contain multiple databases that are organized into tables and columns
•Separate analytics processing from transactional databases•
Example: Amazon Redshift

Purpose-Built Data Bases:


•ETL pipelines transform data in buffered memory prior to loading data into a data lake or
data warehouse for storage.
•ELT pipelines extract and load data into a data lake or data warehouse for storage without
transformation. Here are a few key points to summarize this section. Storage plays an integral
part in ELT and ETL pipelines. Data often moves in and out of storage numerous times, based
on pipeline type and workload type.
ETL pipelines transform data in buffered memory prior to loading data into a data lake or
data warehouse for storage. Levels of buffered memory vary by service.
ELT pipelines extract and load data into data lake or data warehouse storage without
transformation. The transformation of the data is part of the target system’s workload.
Securing Storage:

Security for a data warehouse in Amazon Redshift


•Amazon Redshift database security is distinct from the security of the service itself.
•Amazon Redshift provides additional features to manage database security.
•Due to third-party auditing, Amazon Redshift can help to support applications that are
required to meet international compliance standards.

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

WEEK- 8:
PROCESSING BIG DATA & DATA FOR ML
8.1 Big Data processing Concepts:

Fig- 8.1: Data Processing

Apache Hadoop:

Fig- 8.2: Apache Hadoop

Apache Spark:

Apache Spark characteristics,

•Is an open-source, distributed processing framework


•Uses in-memory caching and optimized query processing
•Supports code reuse across multiple workloads
•Clusters consist of leader and worker nodes

Amazon EMR Characteristics:

•Managed cluster platform


•Big data solution for petabyte-scale data processing, interactive analytics, and machine
learning

10

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

•Processes data for analytics and BI workloads using big data frameworks
•Transform and move large amounts of data into and out of AWS data stores
ML Concepts:

Fig- 8.3: ML models

ML Life Cycle:

Fig- 8.4: ML life cycle

Framing the ML problem to meet Business Goals:

Working backwards from the business problem to be solved

11

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

•What is the business problem?


•What pain is it causing?
•Why does this problem need to be resolved?
•What will happen if you don't solve this problem?
•How will you measure success?

Fig- 8.5: ML Framing

Collecting Data:

Fig-8.6: Collecting Data

12

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

WEEK- 9:
ANALYZING & VISUALIZING DATA
Consideration factors that influence tool selection:

Fig- 9.1: Factors & needs

Data characteristics:
•How much data is there?
•At what speed and volume does it arrive?
•How frequently is it updated?
•How quickly is it processed?
•What type of data is it?

9.2 Comparing AWS tools and Services:


For accessibility: Data from multiple sources is put in Amazon S3, where Athena can be used
for one-time queries. Amazon EMR aggregates the data and stores the aggregates in S3.
Athena can be used to query the aggregated datasets. From S3, the data can be used in Amazon
Redshift, where Quick Sight can access the data to create visualizations. End of accessibility
description.

Fig- 9.2: QuickSight Example

13

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

WEEK- 10:
AUTOMATING THE PIPELINE
Automating Infrastructure deployment:

Fig- 10.1: Automating Infrastructure

If you build infrastructure with code, you gain the benefits of repeatability and reusability
while you build your environments. In the example shown, a single template is used to deploy
Network Load Balancers and Auto Scaling groups that contain Amazon Elastic Compute
Cloud (Amazon EC2) instances. Network Load Balancers distribute traffic evenly across
targets.

CI/CD:
CI/CD can be pictured as a pipeline, where new code is submitted on one end, tested over a
series of stages (source, build, test, staging, and production), and then published as
production-ready code.

Automating with Step Function:

Fig- 10.2: Step Function

• With Step Functions, you can use visual workflows to coordinate the components of
distributed applications and microservices.
• You define a workflow, which is also referred to as a state machine, as a series of
steps and transitions between each step.
• Step Functions is integrated with Athena to facilitate building workflows that include
Athena queries and data processing operations.

14

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

CONCLUSION
Data engineering is a critical component in the modern data landscape, playing a crucial role
in the success of data-driven decision-making and analytics. As we draw conclusions about
data engineering, several key points come to the forefront:

Foundation for Data-Driven Insights:


Data engineering serves as the foundation for extracting, transforming, and loading (ETL) data
from diverse sources into a format suitable for analysis. This process is essential for generating
meaningful insights and facilitating informed decision-making within organizations.

Data Quality and Integrity:


Maintaining data quality and integrity is paramount in data engineering. Data engineers are
responsible for cleaning, validating, and ensuring the accuracy of data, contributing to the
reliability of downstream analyses and business processes.

Scalability and Performance:


With the increasing volume, velocity, and variety of data, data engineering solutions must be
scalable and performant. Scalability ensures that systems can handle growing amounts of data,
while performance optimization ensures timely processing and availability of data for analytics.

Integration of Diverse Data Sources:


Data engineering enables the integration of data from various sources, whether structured or
unstructured, providing a unified view of information. This integration is crucial for a
comprehensive understanding of business operations and customer behavior.

15

Downloaded by Poojitha Tirumaladasu ([email protected])


lOMoARcPSD|50603328

REFERENCES
[1] https://docs.aws.amazon.com/prescriptive-guidance/latest/modern-data-centric-use-
cases/data-engineering-principles.html
[2] https://medium.com/@polystat.io/data-engineering-complete-reference-guide-from-a-z-
2019-852c308b15ed
[3] https://aws.amazon.com/big-data/datalakes-and-analytics/
[4] https://www.preplaced.in/blog/an-actionable-guide-to-aws-glue-for-data-engineers
[5] https://awsacademy.instructure.com/courses/68699/modules/items/6115658
[6] https://aws.amazon.com/blogs/big-data/aws-serverless-data-analytics-pipeline-reference-
architecture/
[7] https://dev.to/tkeyo/data-engineering-pipeline-with-aws-step-functions-codebuild-and-
dagster-5290

16

Downloaded by Poojitha Tirumaladasu ([email protected])

You might also like