Introduction to Dat
Engineering
Agenda
1. What is Data?
2. Why Data is Important?
3. What is Data Engineering?
4. Data Engineering Lifecycle
5. Different Roles & Titles in Data Engineering
6. ETL (Extract, Transform, Load) Process
7. Types of Data
8. Storage in Data Engineering
9. OLTP vs. OLAP
10. ETL vs. ELT
What is Data?
Data is a collection of facts, information, and statistics and this can be in various forms
sound, images, or any other format.
What could be the data these companies have?
Facebook
Netflix
Spotify
Noon
Banks
Vodafone
Company Types of Data Collected
Facebook (Meta) User profiles, posts, likes, friends, ads,
time spent.
Netflix Watch history, preferences, device
used, search history, ratings.
Spotify Songs played, playlists, skips, search
history, ads, subscriptions.
Noon (E-commerce) Browsing & purchase history, cart
activity, payment, reviews.
Banks Transactions, loans, card usage, fraud
detection, customer support.
Seller Companies Sales, customer details, payment
trends, inventory.
Data is the new oil
Why Data is Important?
1) decision-making
2) problem solving
3) understanding
4) improving processes
5) understanding customers
What is Data Engineering?
Data engineering is the practice of designing and building systems for the
aggregation, storage and analysis of data at scale. Data engineers empower
organizations to get insights in real time from large datasets.
What Does a Data Engineer Do?
➢ Design & Build Data Pipelines
○ Collect, transform, and move data efficiently.
➢ Develop & Manage Databases
○ Store structured & unstructured data for easy access.
➢ Ensure Data Quality & Integrity
○ Validate data accuracy, consistency, and reliability.
➢ Collaborate with Teams
○ Work with data scientists, and analysts.
➢ Optimize Data Workflows
○ Automate processes and improve efficiency.
Data Engineering Lifecycle
Different Roles
Different Titles
ETL
ETL stands for extract, transform, and load
and is a traditionally accepted way for
organizations to combine data from multiple
systems into a single database, data store,
data warehouse, or data lake.
ETL is an important way to bring all relevant
data together in one place to make it
actionable—to analyze it and enable
executives, managers, and other
stakeholders to make informed business
decisions based on it.
Extraction
Extraction is the process of retrieving
data from one or more sources—online,
on-premises, legacy, SaaS, or others.
After the retrieval, or extraction, is
complete, the data is loaded into a
staging area.
Data Sources have Several Forms
What is structured data?
Structured data is organized in a clear, predefined format. The standardized nature of
structured data makes it easily decipherable by data analytics tools, machine learning
algorithms and human users.
Structured data can include both quantitative data (such as prices or revenue figures)
and qualitative data (such as dates, names, addresses and credit card numbers).
For example, a financial report with company names, expense values and reporting
Example
What is unstructured data?
Unstructured data does not have a predefined format.
Unstructured data can contain both textual and non-textual data and both qualitative
(social media comments) and quantitative (figures embedded in text) data.
Examples of unstructured data from textual data sources include, Emails, Text
documents, Social media posts, Call transcripts, Message text files, such as those from
Microsoft Teams or Slack
Examples of non-textual unstructured data include, Image files (JPEG, GIF and PNG),
Example
Semi-structured
ETL: Transformation
Transformation involves taking that data, cleaning it, and putting it into a common
format, so it can be stored in a targeted database, data store, data warehouse, or
data lake. Cleaning typically involves taking out duplicate, incomplete, or obviously
erroneous records.
Example: Raw Data (Before Transformation)
Cleaned Data (After Transformation)
Transformation Steps
● Convert date formats to a standard format (YYYY-MM-DD)
● Remove duplicates and NULL values
● Convert negative amounts to absolute values (if applicable)
● Standardize country names
ETL: Load
Loading is the process of inserting that formatted data into the target database,
data store, data warehouse, or data lake.
Scenario
A company collects daily sales data from multiple branches. After extracting the
data and transforming it (cleaning, formatting), it needs to be loaded into a
centralized “database” for reporting.
Back to the lifecycle: Storage
Storage
Storage is the cornerstone of the data engineering lifecycle and underlies its major
stages—ingestion, transformation, and serving. Data gets stored many times as it
moves through the lifecycle. To paraphrase an old saying, it’s storage all the way
down.
Storage Components
Different Types of Storage
Data Engineering Storage Abstractions
● Data warehouses
● Data lakes
● Data lakehouses
Comparison
Feature Data Warehouse Data Lake Data Lakehouse
Data Types Structured data only Structured, semi- Structured, semi-
structured, structured,
unstructured unstructured
Storage Cost High (optimized for Low (scalable for Moderate (balances
structured data) large volumes) cost & performance)
Performance Fast for structured Slower without Optimized for both
queries optimization structured &
unstructured data
OLTP Vs. OLAP
ETL Vs. ELT
Uber use-case
Technology & Tools
Programming: Python
Cloud: Microsoft Azure
Storage Tools: MySQL/PostgreSQL
Querying: SQL
Processing: Apache Spark, Hadoop.
Orchestration: Apache Airflow
Course Modules
Thank you