0% found this document useful (0 votes)
28 views39 pages

Overview of Data Engineering - Updated

The document provides an introduction to data engineering, covering key concepts such as the definition of data, its importance, and the data engineering lifecycle. It outlines the roles of data engineers, the ETL process, and different types of data storage solutions. Additionally, it discusses structured, unstructured, and semi-structured data, along with various technologies and tools used in data engineering.

Uploaded by

Jana Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views39 pages

Overview of Data Engineering - Updated

The document provides an introduction to data engineering, covering key concepts such as the definition of data, its importance, and the data engineering lifecycle. It outlines the roles of data engineers, the ETL process, and different types of data storage solutions. Additionally, it discusses structured, unstructured, and semi-structured data, along with various technologies and tools used in data engineering.

Uploaded by

Jana Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Introduction to Dat

Engineering
Agenda
1. What is Data?

2. Why Data is Important?

3. What is Data Engineering?

4. Data Engineering Lifecycle

5. Different Roles & Titles in Data Engineering

6. ETL (Extract, Transform, Load) Process

7. Types of Data

8. Storage in Data Engineering

9. OLTP vs. OLAP

10. ETL vs. ELT


What is Data?

Data is a collection of facts, information, and statistics and this can be in various forms
sound, images, or any other format.
What could be the data these companies have?

Facebook

Netflix

Spotify

Noon

Banks

Vodafone
Company Types of Data Collected

Facebook (Meta) User profiles, posts, likes, friends, ads,


time spent.
Netflix Watch history, preferences, device
used, search history, ratings.
Spotify Songs played, playlists, skips, search
history, ads, subscriptions.
Noon (E-commerce) Browsing & purchase history, cart
activity, payment, reviews.
Banks Transactions, loans, card usage, fraud
detection, customer support.
Seller Companies Sales, customer details, payment
trends, inventory.
Data is the new oil
Why Data is Important?
1) decision-making

2) problem solving

3) understanding

4) improving processes

5) understanding customers
What is Data Engineering?

Data engineering is the practice of designing and building systems for the
aggregation, storage and analysis of data at scale. Data engineers empower
organizations to get insights in real time from large datasets.
What Does a Data Engineer Do?

➢ Design & Build Data Pipelines


○ Collect, transform, and move data efficiently.
➢ Develop & Manage Databases
○ Store structured & unstructured data for easy access.
➢ Ensure Data Quality & Integrity
○ Validate data accuracy, consistency, and reliability.
➢ Collaborate with Teams
○ Work with data scientists, and analysts.
➢ Optimize Data Workflows
○ Automate processes and improve efficiency.
Data Engineering Lifecycle
Different Roles
Different Titles
ETL

ETL stands for extract, transform, and load


and is a traditionally accepted way for
organizations to combine data from multiple
systems into a single database, data store,
data warehouse, or data lake.

ETL is an important way to bring all relevant


data together in one place to make it
actionable—to analyze it and enable
executives, managers, and other
stakeholders to make informed business
decisions based on it.
Extraction
Extraction is the process of retrieving
data from one or more sources—online,
on-premises, legacy, SaaS, or others.
After the retrieval, or extraction, is
complete, the data is loaded into a
staging area.
Data Sources have Several Forms
What is structured data?

Structured data is organized in a clear, predefined format. The standardized nature of


structured data makes it easily decipherable by data analytics tools, machine learning
algorithms and human users.

Structured data can include both quantitative data (such as prices or revenue figures)
and qualitative data (such as dates, names, addresses and credit card numbers).

For example, a financial report with company names, expense values and reporting
Example
What is unstructured data?

Unstructured data does not have a predefined format.

Unstructured data can contain both textual and non-textual data and both qualitative
(social media comments) and quantitative (figures embedded in text) data.

Examples of unstructured data from textual data sources include, Emails, Text
documents, Social media posts, Call transcripts, Message text files, such as those from
Microsoft Teams or Slack

Examples of non-textual unstructured data include, Image files (JPEG, GIF and PNG),
Example
Semi-structured
ETL: Transformation

Transformation involves taking that data, cleaning it, and putting it into a common
format, so it can be stored in a targeted database, data store, data warehouse, or
data lake. Cleaning typically involves taking out duplicate, incomplete, or obviously
erroneous records.
Example: Raw Data (Before Transformation)
Cleaned Data (After Transformation)
Transformation Steps

● Convert date formats to a standard format (YYYY-MM-DD)


● Remove duplicates and NULL values
● Convert negative amounts to absolute values (if applicable)
● Standardize country names
ETL: Load

Loading is the process of inserting that formatted data into the target database,
data store, data warehouse, or data lake.

Scenario

A company collects daily sales data from multiple branches. After extracting the
data and transforming it (cleaning, formatting), it needs to be loaded into a
centralized “database” for reporting.
Back to the lifecycle: Storage
Storage
Storage is the cornerstone of the data engineering lifecycle and underlies its major
stages—ingestion, transformation, and serving. Data gets stored many times as it
moves through the lifecycle. To paraphrase an old saying, it’s storage all the way
down.
Storage Components
Different Types of Storage
Data Engineering Storage Abstractions

● Data warehouses
● Data lakes
● Data lakehouses
Comparison

Feature Data Warehouse Data Lake Data Lakehouse

Data Types Structured data only Structured, semi- Structured, semi-


structured, structured,
unstructured unstructured
Storage Cost High (optimized for Low (scalable for Moderate (balances
structured data) large volumes) cost & performance)
Performance Fast for structured Slower without Optimized for both
queries optimization structured &
unstructured data
OLTP Vs. OLAP
ETL Vs. ELT
Uber use-case
Technology & Tools
Programming: Python

Cloud: Microsoft Azure

Storage Tools: MySQL/PostgreSQL

Querying: SQL

Processing: Apache Spark, Hadoop.

Orchestration: Apache Airflow


Course Modules
Thank you

You might also like