0% found this document useful (0 votes)

28 views39 pages

Overview of Data Engineering - Updated

The document provides an introduction to data engineering, covering key concepts such as the definition of data, its importance, and the data engineering lifecycle. It outlines the roles of data engineers, the ETL process, and different types of data storage solutions. Additionally, it discusses structured, unstructured, and semi-structured data, along with various technologies and tools used in data engineering.

Uploaded by

Jana Mohamed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views39 pages

Overview of Data Engineering - Updated

Uploaded by

Jana Mohamed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Introduction to Dat

Engineering
Agenda
1. What is Data?

2. Why Data is Important?

3. What is Data Engineering?

4. Data Engineering Lifecycle

5. Different Roles & Titles in Data Engineering

6. ETL (Extract, Transform, Load) Process

7. Types of Data

8. Storage in Data Engineering

9. OLTP vs. OLAP

10. ETL vs. ELT

What is Data?

Data is a collection of facts, information, and statistics and this can be in various forms
sound, images, or any other format.
What could be the data these companies have?

Facebook

Netflix

Spotify

Noon

Banks

Vodafone
Company Types of Data Collected

Facebook (Meta) User profiles, posts, likes, friends, ads,

time spent.
Netflix Watch history, preferences, device
used, search history, ratings.
Spotify Songs played, playlists, skips, search
history, ads, subscriptions.
Noon (E-commerce) Browsing & purchase history, cart
activity, payment, reviews.
Banks Transactions, loans, card usage, fraud
detection, customer support.
Seller Companies Sales, customer details, payment
trends, inventory.
Data is the new oil
Why Data is Important?
1) decision-making

2) problem solving

3) understanding

4) improving processes

5) understanding customers
What is Data Engineering?

Data engineering is the practice of designing and building systems for the
aggregation, storage and analysis of data at scale. Data engineers empower
organizations to get insights in real time from large datasets.
What Does a Data Engineer Do?

➢ Design & Build Data Pipelines

○ Collect, transform, and move data efficiently.
➢ Develop & Manage Databases
○ Store structured & unstructured data for easy access.
➢ Ensure Data Quality & Integrity
○ Validate data accuracy, consistency, and reliability.
➢ Collaborate with Teams
○ Work with data scientists, and analysts.
➢ Optimize Data Workflows
○ Automate processes and improve efficiency.
Data Engineering Lifecycle
Different Roles
Different Titles
ETL

ETL stands for extract, transform, and load

and is a traditionally accepted way for
organizations to combine data from multiple
systems into a single database, data store,
data warehouse, or data lake.

ETL is an important way to bring all relevant

data together in one place to make it
actionable—to analyze it and enable
executives, managers, and other
stakeholders to make informed business
decisions based on it.
Extraction
Extraction is the process of retrieving
data from one or more sources—online,
on-premises, legacy, SaaS, or others.
After the retrieval, or extraction, is
complete, the data is loaded into a
staging area.
Data Sources have Several Forms
What is structured data?

Structured data is organized in a clear, predefined format. The standardized nature of

structured data makes it easily decipherable by data analytics tools, machine learning
algorithms and human users.

Structured data can include both quantitative data (such as prices or revenue figures)
and qualitative data (such as dates, names, addresses and credit card numbers).

For example, a financial report with company names, expense values and reporting
Example
What is unstructured data?

Unstructured data does not have a predefined format.

Unstructured data can contain both textual and non-textual data and both qualitative
(social media comments) and quantitative (figures embedded in text) data.

Examples of unstructured data from textual data sources include, Emails, Text
documents, Social media posts, Call transcripts, Message text files, such as those from
Microsoft Teams or Slack

Examples of non-textual unstructured data include, Image files (JPEG, GIF and PNG),
Example
Semi-structured
ETL: Transformation

Transformation involves taking that data, cleaning it, and putting it into a common
format, so it can be stored in a targeted database, data store, data warehouse, or
data lake. Cleaning typically involves taking out duplicate, incomplete, or obviously
erroneous records.
Example: Raw Data (Before Transformation)
Cleaned Data (After Transformation)
Transformation Steps

● Convert date formats to a standard format (YYYY-MM-DD)

● Remove duplicates and NULL values
● Convert negative amounts to absolute values (if applicable)
● Standardize country names
ETL: Load

Loading is the process of inserting that formatted data into the target database,
data store, data warehouse, or data lake.

Scenario

A company collects daily sales data from multiple branches. After extracting the
data and transforming it (cleaning, formatting), it needs to be loaded into a
centralized “database” for reporting.
Back to the lifecycle: Storage
Storage
Storage is the cornerstone of the data engineering lifecycle and underlies its major
stages—ingestion, transformation, and serving. Data gets stored many times as it
moves through the lifecycle. To paraphrase an old saying, it’s storage all the way
down.
Storage Components
Different Types of Storage
Data Engineering Storage Abstractions

● Data warehouses
● Data lakes
● Data lakehouses
Comparison

Feature Data Warehouse Data Lake Data Lakehouse

Data Types Structured data only Structured, semi- Structured, semi-

structured, structured,
unstructured unstructured
Storage Cost High (optimized for Low (scalable for Moderate (balances
structured data) large volumes) cost & performance)
Performance Fast for structured Slower without Optimized for both
queries optimization structured &
unstructured data
OLTP Vs. OLAP
ETL Vs. ELT
Uber use-case
Technology & Tools
Programming: Python

Cloud: Microsoft Azure

Storage Tools: MySQL/PostgreSQL

Querying: SQL

Processing: Apache Spark, Hadoop.

Orchestration: Apache Airflow

Course Modules
Thank you

What Is ETL
No ratings yet
What Is ETL
13 pages
ch5 MDX Summary
No ratings yet
ch5 MDX Summary
8 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
163 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
28 pages
ETL Basics in Data Warehousing
No ratings yet
ETL Basics in Data Warehousing
63 pages
Data Engineering Life Cycle
No ratings yet
Data Engineering Life Cycle
5 pages
ETL
No ratings yet
ETL
2 pages
Introduction To Data Engineering
100% (1)
Introduction To Data Engineering
6 pages
Crime Prevention and Control css402 - 1716304451
No ratings yet
Crime Prevention and Control css402 - 1716304451
42 pages
Imran Introduction To DWH-5
No ratings yet
Imran Introduction To DWH-5
26 pages
Data Engineering Overview
No ratings yet
Data Engineering Overview
2 pages
Intro To Data Engineering!
No ratings yet
Intro To Data Engineering!
34 pages
2OEeUEnBTY CompleteGuideToBecomeModernDataEngineer
No ratings yet
2OEeUEnBTY CompleteGuideToBecomeModernDataEngineer
43 pages
ETL (Extract, Transform and Load)
No ratings yet
ETL (Extract, Transform and Load)
9 pages
ETL Process and Data Warehouse Types
No ratings yet
ETL Process and Data Warehouse Types
75 pages
DWH Session1
No ratings yet
DWH Session1
36 pages
ETL Essentials for Businesses
No ratings yet
ETL Essentials for Businesses
5 pages
Lecture 1.1 - Introduction To DE
No ratings yet
Lecture 1.1 - Introduction To DE
27 pages
DE Unit I
No ratings yet
DE Unit I
12 pages
Understanding Etl Er1
No ratings yet
Understanding Etl Er1
34 pages
ETL Overview: What It Is and Why It Matters
No ratings yet
ETL Overview: What It Is and Why It Matters
5 pages
ETL Review
No ratings yet
ETL Review
30 pages
Break Down Data Silos With ETL and Unlock Trapped Data With ETL
No ratings yet
Break Down Data Silos With ETL and Unlock Trapped Data With ETL
25 pages
Data Extraction Part1
No ratings yet
Data Extraction Part1
15 pages
Etl Tools Comparison
No ratings yet
Etl Tools Comparison
21 pages
Data Engineering Unit - 2
No ratings yet
Data Engineering Unit - 2
7 pages
06-Data-Integration Quality Profiling
No ratings yet
06-Data-Integration Quality Profiling
39 pages
Lec 13-ETL
No ratings yet
Lec 13-ETL
18 pages
Clover ETL - 1
No ratings yet
Clover ETL - 1
29 pages
Data Engineering - Beginner's Guide
100% (1)
Data Engineering - Beginner's Guide
9 pages
D56261GC10 - 1001 - US - Oracle Database 11g Data Warehousing Fundamentals
No ratings yet
D56261GC10 - 1001 - US - Oracle Database 11g Data Warehousing Fundamentals
4 pages
Data Terms 1714351092
No ratings yet
Data Terms 1714351092
12 pages
The Background and Skill of Data Engineer
No ratings yet
The Background and Skill of Data Engineer
9 pages
Data Engineering
No ratings yet
Data Engineering
6 pages
Data Engineering & ETL Essentials
No ratings yet
Data Engineering & ETL Essentials
20 pages
Data Engineering Lifecycle
No ratings yet
Data Engineering Lifecycle
13 pages
The Essential Guide To Data Engineering
No ratings yet
The Essential Guide To Data Engineering
12 pages
Understanding the ETL Process
No ratings yet
Understanding the ETL Process
3 pages
ETL - Extract, Transform and Load: What Is A Data Warehouse?
No ratings yet
ETL - Extract, Transform and Load: What Is A Data Warehouse?
30 pages
De Notes
No ratings yet
De Notes
3 pages
ETL Process in Data Warehousing
No ratings yet
ETL Process in Data Warehousing
37 pages
Data Engineering UNIT-1
100% (1)
Data Engineering UNIT-1
14 pages
Lab Manual
No ratings yet
Lab Manual
32 pages
ETL Concepts and Tools Overview
No ratings yet
ETL Concepts and Tools Overview
10 pages
M2.1 Introduction To Building Batch Data Pipelines
No ratings yet
M2.1 Introduction To Building Batch Data Pipelines
31 pages
Univr Ba2425 - l9 - Data Integration p1
No ratings yet
Univr Ba2425 - l9 - Data Integration p1
31 pages
Intro To ETL
No ratings yet
Intro To ETL
43 pages
Building The DW - ETL
100% (1)
Building The DW - ETL
19 pages
Data Warehousing Extract, Transform and Load (ETL)
No ratings yet
Data Warehousing Extract, Transform and Load (ETL)
32 pages
De Unit - I
No ratings yet
De Unit - I
43 pages
ADTHEORY4
No ratings yet
ADTHEORY4
13 pages
Etl Faq
No ratings yet
Etl Faq
20 pages
ETL Process: Challenges and Steps
No ratings yet
ETL Process: Challenges and Steps
32 pages
3 ETL Versus ELT - Coursera
No ratings yet
3 ETL Versus ELT - Coursera
1 page
Types and Sets of Data in Data Mining
No ratings yet
Types and Sets of Data in Data Mining
6 pages
(Ebook) Struggling To Surrender: Some Impressions of An American Convert To Islam by Jeffrey Lang ISBN 9780915957262, 0915957264 Download
100% (1)
(Ebook) Struggling To Surrender: Some Impressions of An American Convert To Islam by Jeffrey Lang ISBN 9780915957262, 0915957264 Download
60 pages
Intercultural Competence Guide
No ratings yet
Intercultural Competence Guide
11 pages
Christian Financial Empowerment
100% (1)
Christian Financial Empowerment
21 pages
Words-In-Action: Speaking and Writing Extra Practice
No ratings yet
Words-In-Action: Speaking and Writing Extra Practice
2 pages
EAPP
No ratings yet
EAPP
2 pages
Olympiad Inequalities Overview
No ratings yet
Olympiad Inequalities Overview
7 pages
Tartuffe or The Hypocrite
No ratings yet
Tartuffe or The Hypocrite
61 pages
Learning Styles in Education
No ratings yet
Learning Styles in Education
18 pages
Voicemodels
No ratings yet
Voicemodels
2 pages
Module 3
No ratings yet
Module 3
12 pages
Topic01 SQLDataDefinition
No ratings yet
Topic01 SQLDataDefinition
6 pages
500 Dorks para CC - Carding
No ratings yet
500 Dorks para CC - Carding
9 pages
English-Vietnamese Translation Guide
No ratings yet
English-Vietnamese Translation Guide
40 pages
Kisi2 B.inggris Bab 2 Kls 5B
No ratings yet
Kisi2 B.inggris Bab 2 Kls 5B
2 pages
Final Words From The Cross 1st Edition Adam Hamilton Download
100% (10)
Final Words From The Cross 1st Edition Adam Hamilton Download
48 pages
MAT 1 20230930202040627regular
No ratings yet
MAT 1 20230930202040627regular
10 pages
Office 2007 Group Policy and OCTSettings
No ratings yet
Office 2007 Group Policy and OCTSettings
1,912 pages
Lecture 1
No ratings yet
Lecture 1
11 pages
LP 6 1-Grading Mathematics Vi
No ratings yet
LP 6 1-Grading Mathematics Vi
54 pages
Atlascopco PF6000 Manual
100% (1)
Atlascopco PF6000 Manual
264 pages
Bi Literal Cypher of Francis Bacon
100% (2)
Bi Literal Cypher of Francis Bacon
507 pages
VMware Installation & Virtualization Guide
No ratings yet
VMware Installation & Virtualization Guide
16 pages
[Journal of South Asian Literature 1982-Win-spr Vol. 17 Iss. 1] - A MARATHI SAMPLER_ Varied Voices in Contemporary Marathi Short Stories and Poetry __ Front Matter (1982) [10.2307_40873993] - Libgen.li
No ratings yet
[Journal of South Asian Literature 1982-Win-spr Vol. 17 Iss. 1] - A MARATHI SAMPLER_ Varied Voices in Contemporary Marathi Short Stories and Poetry __ Front Matter (1982) [10.2307_40873993] - Libgen.li
7 pages
Lamentations of Jeremiah Analysis
No ratings yet
Lamentations of Jeremiah Analysis
97 pages
(Cultural Memory in The Present) Jonathan Culler - The Literary in Theory-Stanford University Press (2006)
No ratings yet
(Cultural Memory in The Present) Jonathan Culler - The Literary in Theory-Stanford University Press (2006)
145 pages
Nuevo Documento 20 - 03 - 2025 15 - 37
No ratings yet
Nuevo Documento 20 - 03 - 2025 15 - 37
4 pages
Diebold PRL 1991
No ratings yet
Diebold PRL 1991
6 pages
SyCyS Project
No ratings yet
SyCyS Project
11 pages
Worlds of WebDM - Weird Wastelands (Psion Promo)
No ratings yet
Worlds of WebDM - Weird Wastelands (Psion Promo)
8 pages
Maths 8 Youth Education 145
No ratings yet
Maths 8 Youth Education 145
6 pages

Overview of Data Engineering - Updated

Uploaded by

Overview of Data Engineering - Updated

Uploaded by

Introduction to Dat

2. Why Data is Important?

3. What is Data Engineering?

4. Data Engineering Lifecycle

5. Different Roles & Titles in Data Engineering

6. ETL (Extract, Transform, Load) Process

8. Storage in Data Engineering

9. OLTP vs. OLAP

10. ETL vs. ELT

Facebook (Meta) User profiles, posts, likes, friends, ads,

➢ Design & Build Data Pipelines

ETL stands for extract, transform, and load

ETL is an important way to bring all relevant

Structured data is organized in a clear, predefined format. The standardized nature of

Unstructured data does not have a predefined format.

● Convert date formats to a standard format (YYYY-MM-DD)

Feature Data Warehouse Data Lake Data Lakehouse

Data Types Structured data only Structured, semi- Structured, semi-

Cloud: Microsoft Azure

Storage Tools: MySQL/PostgreSQL

Processing: Apache Spark, Hadoop.

Orchestration: Apache Airflow

You might also like