Course+Slides+ +Data+Warehouse+ +the+Ultimate+Guide
Course+Slides+ +Data+Warehouse+ +the+Ultimate+Guide
o Receive orders
o React to complaints
o Fill up stock
o Receive orders
o React to complaints
o Fill up stock
Analytical Operational
decision making data keeping
Two requirements
✓ User friendly
Sales data
Data warehouse
CRM system
Understanding a data warehouse
ETL
Sales data
Data warehouse
CRM system
Understanding a data warehouse
Extract, Transform, Load
Other data sources
Centralized
ETL location for data
Sales data
Data warehouse
CRM system
Goals of a data warehouse
Sales data
Meaningful insights
Better decisions
Raw data Transform
Data lake & data warehouse are
BOTH used as
centralized data storage
Data Lake Data Warehouse
Data Raw
Data Lake Processed
Data Warehouse
Technologies Big
Data data
Lake Database
Data Warehouse
Structure Unstructured
Data Lake Structured
Data Warehouse
When to use
Data
Technologies
Raw
Data
Big
Lake
Data data
Lake
Processed
Data Warehouse
Database
Data Warehouse
Structure Unstructured
Data Lake Structured
Data Warehouse
what?
Usage Not defined yet
Data Lake
Specific & ready to
beWarehouse
Data used
Users Data Scientists Business users & IT
Both!
Demos & Hands-on
Centralized
Sales data
ETL location for data
Data warehouse
CRM system
Data Warehouse Layers
Data warehouse
Data Warehouse Layers
Department 1
Department 2
data sources
Staging
Data Warehouse Layers
Departments
data sources
Staging
Data Warehouse Layers
Department 1
Department 2
data sources
Staging
Data Warehouse Layers
Predictive
data sources Analytics
Staging Core /
Data Warehouse
Data Warehouse Layers
Data Mart 1
Data Mart 1
Cleansing
data sources Predictive
Staging Core / Analytics
Data Warehouse
Data Mart 2
Data Warehouse Layers
Data Mart 1
Data Mart 1
ETL E TL
Predictive
data sources Analytics
Staging Core /
Access Layer /
Data Warehouse
Data Warehouse Layers
ETL
o "Short time on the source systems"
o "Quickly extract"
o Move the data into relational database
o Start transformations from there
data sources
Staging
Data Warehouse Layers
data sources
Staging
id date product id date product
1 1/2/2022 Fulltoss Tangy Tomato 1 1/2/2022 Fulltoss Tangy Tomato
Chilli - Green, Organically Chilli - Green, Organically
2 1/2/2022 2 1/2/2022
Grown Grown
3 1/2/2022 Masala Powder 3 1/2/2022 Masala Powder
Cheese Cracker Cheese Cracker
4 1/2/2022 4 1/2/2022
(Mcvities) (Mcvities)
Centre Filled Chocolate Centre Filled Chocolate
5 1/2/2022 5 1/2/2022
Cake Cake
Data Warehouse Layers
4 1/2/2022 1 2
(Mcvities)
Centre Filled
5 1/2/2022 5 2
Chocolate Cake
id date product
Chocos - Magic
6 1/3/2022 3 2
1 1/2/2022 Fulltoss Tangy Tomato Hearts
Chilli - Green, Organically Chocos Webs -
2 1/2/2022 7 1/3/2022 4 3
data sources
Grown
8 1/3/2022
Spiderman
Breakfast Cereal -
5 2
3 1/2/2022 Masala Powder Chocolate
4 1/2/2022
Cheese Cracker
(Mcvities)
Staging 9 1/3/2022
Fiber Rich
Chocolate
6 1
Centre Filled Chocolate Centre Filled
id
5 date
1/2/2022 product
Cake id date product 10 1/3/2022
Chocolate Cake
3 2
4 1/2/2022 1 2
(Mcvities)
Centre Filled
5 1/2/2022 5 2
Chocolate Cake
id date product Chocos - Magic
6 1/3/2022 3 2
1 1/2/2022 Fulltoss Tangy Tomato Hearts
Chilli - Green, Organically Chocos Webs -
2 1/2/2022 7 1/3/2022 4 3
data sources
Grown
8 1/3/2022
Spiderman
Breakfast Cereal -
5 2
3 1/2/2022 Masala Powder Chocolate
4 1/2/2022
Cheese Cracker
(Mcvities)
Staging 9 1/3/2022
Fiber Rich
Chocolate
6 1
4 1/2/2022 1 2
(Mcvities)
Centre Filled
5 1/2/2022 5 2
Chocolate Cake
id date product id date product
Chocos - Magic
6 1/3/2022 3 2
1 1/2/2022 Fulltoss Tangy Tomato 1 1/2/2022 Fulltoss Tangy Tomato Hearts
Chilli - Green, Organically Chilli - Green, Organically Chocos Webs -
2 1/2/2022 2 1/2/2022 7 1/3/2022 4 3
data sources
Grown
3 1/2/2022
Grown
Masala Powder 8 1/3/2022
Spiderman
Breakfast Cereal -
5 2
3 1/2/2022 Masala Powder Chocolate
4 1/2/2022
Cheese Cracker
(Mcvities)
4 Staging
1/2/2022
Cheese Cracker
(Mcvities)
9 1/3/2022
Fiber Rich
Chocolate
6 1
Predictive
Analytics
Core /
Access Layer /
Data Warehouse
Data Warehouse Layers
Predictive
Analytics
Core /
Access Layer /
Data Warehouse
Data Warehouse Layers
o Subset of a DWH
o Dimensional Model
o Can be further aggregated
Data Mart 1
Predictive
Core / Analytics
Access Layer /
Data Warehouse Data Mart 2
Data Warehouse Layers
o Subset of a DWH
o Dimensional Model
o Can be further aggregated
o Usability + Acceptance
Data Mart 1 o Performance
o Tools
Predictive
o Departments
Core / Analytics
Access Layer / o Regions
Data Warehouse Data Mart 2 o Use-cases
Data Marts
SELECT <column1>,
<column2>, ...
FROM <table_name>
Primary key Foreign key
id date product customer_id
Fulltoss Tangy
1 1/2/2022 2
Tomato
Chilli - Green,
2 1/2/2022 2
Organically Grown
3 1/2/2022 Masala Powder 5
Cheese Cracker
4 1/2/2022 1
(Mcvities)
Relational database Tables (relations) 5 1/2/2022
Centre Filled
Chocolate Cake
5
id name city
1 Frank New York
2 Sarah Chicago
5 Marc Delas
Primary key Foreign key
id date product customer_id name
Fulltoss Tangy
1 1/2/2022 2 Sarah
Tomato
customer_id, 4 1/2/2022
Cheese Cracker
(Mcvities)
1 Frank
Relational database
name Tables (relations) 5 1/2/2022
Centre Filled
Chocolate Cake
5 Marc
FROM sales
5 Marc Delas
Primary key Foreign key
id date product customer_id name
o 70s to 90s building logic & improving performance 1 1/2/2022
Fulltoss Tangy
2 Sarah
Tomato
2 Sarah Chicago
5 Marc Delas
Primary key Foreign key
MySQL
Amazon Relational Database Service (RDS) id name city
1 Frank New York
Azure SQL databases 2 Sarah Chicago
5 Marc Delas
✓ Highly optimized for query performance
Response time
Disc In-memory
o columnar storage,
o parallel query plans,
o and other techniques
Response time
Disc In-memory Disc In-memory
o columnar storage,
o parallel query plans,
o Durability: Lose all information when device loses
o and other techniques
power or is reset
o Durability addedTraditional
through snapshots / images
database In-memory database
o Cost-factor
o Traditional DBs also trying reduce usage of disc
Response time
Disc In-memory Disc In-memory
Products in RDMS context
✓ SAP HANA
✓ Oracle In-Memory
✓ Amazon MemoryDB
OLAP Cubes
✓ Traditional DWH based on relational DBMS (ROLAP)
Customers
drill / slice & dice
Benefits
Recommendation
✓ Alternatives:
- Tabular models (SSAS)
- ROLAP
- columnar storage
ODS (Operational Data Storage)
ODS
CRM system
ODS
✓ (Near) real-time
Other data sources id customer total
1 Sarah $334
2 Frank $4234
3 Thomas $544
5
Angela
Kate
$4332
$460
ODS
Analytical decisions
DWH
(Near) real-time
Operational decisions
ODS
ODS - Sequential
Analytical decisions
DWH
(Near) real-time
ODS
Operational decisions
ODS
Data Mart 1
Predictive
data sources Analytics
Staging Core /
Data Warehouse
Data Mart 2
The different layers
✓ Landing zone
Staging ✓ Minimal transformation
✓ "Stage" the data in tables
✓ Always there
Core ✓ Business Logic & Single Point of Truth
✓ Can be sometimes the access layer
✓ Access Layer
Mart ✓ Specific to one use-case
✓ Optimized for performance
o columnar storage,
o parallel query plans,
o and other techniques
Response time
Disc In-memory Disc In-memory
What is dimensional modeling?
Dimensional modeling
✓ Facts ✓ Dimensions
Profit by year
Profit by category
Dimensional modeling
✓ Dimensions
✓ Facts
✓ Dimensions ✓ Dimensions
star schema
Dimensional modeling
2 20220102 5 2 $12
3 20220102 6 5 $93
4 20220102 23 1 $23
5 20220102 16 5 $21
Dimensional modeling
Performance Usability
✓ Dimensions
✓ Facts
✓ Dimensions ✓ Dimensions
star schema
Facts
Dim_Customer Usually…
customer_id
first name o Aggregatable (numerical values)
o Foundation of DWH last name
sex o Measureable vs. descriptive
o Key measurements ✓ Dimensions
city
o Event- or transactional data
o Aggregated and analyzed Sales
o Date/time in a fact table
sales_id
✓ Facts
product_id
Dim_Date
Dim_Product customer_id
date_id
product_id units
year
name price
✓ Dimensions
category
quarter
✓ Dimensions
month
subcategory
week
dimensions
day
weekday
holiday_flag
Facts
2 20220102 2 $12
3 20220102 2 $93
4 20220102 3 $23
5 20220102 16 $21
✓ Dimensions
✓ Facts
✓ Dimensions ✓ Dimensions
star schema
Dimensions
Dim_Customer Usually…
customer_id
first name o Non-Aggregatable
o Categorizes facts last name
sex o Measureable vs. descriptive
o Supportive & descriptive ✓ Dimensions
city
o (More) static
o Filtering, Grouping & Labeling Sales
sales_id
✓ Facts
product_id
Dim_Date
Dim_Product customer_id
date_id
product_id units
year
name price
✓ Dimensions
category
quarter
✓ Dimensions
month
subcategory
week
dimensions
day
weekday
holiday_flag
Dimensions
✓ Dimensions
Sales
sales_id
✓ Facts
product_id
Dim_Product customer_id
product_id units
name price
✓ Dimensions
category ✓ Dimensions
subcategory
dimensions
Normalized Star schema
o Technique to avoid redundancy
FK
o Minimizes storage
sales_id product_id customer_id units price
o Performance (write / update) 1 3 23 1 2.99
✓ Dimensions
2 5 13 1 1.99
o Many tables 3 2 7 2 3.49
4 3 16 1 2.29
o Many joins necessary 5 3 13 5 1.49
✓ Facts
PK
✓ Dimensions ✓ Dimensions
Denormalized
product_id name category sub_category
1 Chili Herbs Spices o There is data redundancy!
2 Garlic Fruits & Vegetables Vegetable
3 Banana Fruits & Vegetables Fruits o Optimized to get data out
4 Chocolate Sweets & Snacks Sweets
5 Chips Sweets & Snacks Snacks o Query performance (read)
o User experience
Star schema
✓ Dimensions
product_id name category sub_category
1 Chili Herbs Spices
2 Garlic Fruits & Vegetables Vegetable
3 Banana Fruits & Vegetables Fruits
4 Chocolate Sweets & Snacks Sweets
5 Chips Sweets & Snacks Snacks
Snowflake schema
✓ Facts
sales_id product_id customer_id units price
1 3 23 1 2.99
✓ Dimensions
2 5 13 1 1.99
3 2 7 2 3.49
4 3 16 1 2.29
5 3 13 5 1.49
category units
Herbs 0
Fruits & Vegetables 9
Sweets & Snacks 1
product_id name category sub_category
1 Chili Herbs Spices name Units
2 Garlic Fruits & Vegetables Vegetable Chili 0
3 Banana Fruits & Vegetables Fruits Garlic 2
4 Chocolate Sweets & Snacks Sweets Banana 7
5 Chips Sweets & Snacks Snacks Chocolate 0
Chips 1
Additive facts
date_id Date Day Month
20220101 01/01/2022 1 1
20220102 02/01/2022 2 1 sales_id product_id date_id units price
20220103 03/01/2022 3 1 1 3 20220101 1 2.99
20220104 04/01/2022 4 1 2 5 20220102 1 1.99
20220105 05/01/2022 5 1 3 2 20220102 2 3.49
4 3 20220103 1 2.29
Date units
5 3 20220104 5 1.49
01/01/2022 2
02/01/2022 3
03/01/2022 1
04/01/2022 5 category units
Herbs 0
Fruits & Vegetables 9
Sweets & Snacks 1
product_id name category sub_category
1 Chili Herbs Spices name Units
2 Garlic Fruits & Vegetables Vegetable Chili 0
3 Banana Fruits & Vegetables Fruits Garlic 2
4 Chocolate Sweets & Snacks Sweets Banana 7
5 Chips Sweets & Snacks Snacks Chocolate 0
Chips 1
Additivity
✓ Example: Balance
Non-additive facts
sales_id product_id date_id units price
o Price 1 3 20220101 1 2.99
o Percentages 2 5 20220102 1 1.99
3 2 20220102 2 3.49
o Ratios 4 3 20220103 1 2.29
5 3 20220104 5 1.49
category price
Herbs $0
product_id name category sub_category
Fruits & Vegetables $10.26
1 Chili Herbs Spices
Sweets & Snacks $1.99
2 Garlic Fruits & Vegetables Vegetable
3 Banana Fruits & Vegetables Fruits
4 Chocolate Sweets & Snacks Sweets
5 Chips Sweets & Snacks Snacks
Additivity
✓ Typically additive
✓ Typically additive
✓ No events = null or 0
Measure Measure Measure Measure Measure Measure
week_id revenue sales cost day_id no. calls missed calles duration
1 323 123 12 1 31 3 432
2 541 322 31 2 25 4 142
3 242 108 12 3 52 2 134
4 352 212 51 4 23 6 562
5 312 198 25 5 53 4 122
Order production
Characteristics
✓ Least common
Order production
Types of fact tables
Types of fact tables
Type Transactional Periodic Snapshot Accumulating Snapshot
Grain
Date Dimensions
No. Of dimensions
Facts
Size
Performance
Types of fact tables
Type Transactional Periodic Snapshot Accumulating Snapshot
Date Dimensions
No. Of dimensions
Facts
Size
Performance
Types of fact tables
Type Transactional Periodic Snapshot Accumulating Snapshot
Date Dimensions 1 Transaction date Snapshot date (end of period) Multiple snapshot dates
No. Of dimensions
Facts
Size
Performance
Types of fact tables
Type Transactional Periodic Snapshot Accumulating Snapshot
Date Dimensions 1 Transaction date Snapshot date (end of period) Multiple snapshot dates
Facts
Size
Performance
Types of fact tables
Type Transactional Periodic Snapshot Accumulating Snapshot
Date Dimensions 1 Transaction date Snapshot date (end of period) Multiple snapshot dates
Size
Performance
Types of fact tables
Type Transactional Periodic Snapshot Accumulating Snapshot
Date Dimensions 1 Transaction date Snapshot date (end of period) Multiple snapshot dates
Size Largest (most detailed grain) Middle (less detailed grain) Lowest (highest aggregation)
Performance
Types of fact tables
Type Transactional Periodic Snapshot Accumulating Snapshot
Date Dimensions 1 Transaction date Snapshot date (end of period) Multiple snapshot dates
Size Largest (most detailed grain) Middle (less detailed grain) Lowest (highest aggregation)
4 key decisions
Considdering the
business needs Tables & columns
Steps to create a fact table
1) Identify business process for analysis
Example: sales_id date Sales amount
Sales, 1 2022-01-01 $41
2 2022-01-02 $15
Order processing 3 2022-01-02 $24
4 2022-01-03 $13
5 2022-01-04 $52
2) Declare the grain
Example: Transaction, Order, Order lines, Daily, Daily + location
Fact
Fact Table
Fact
Factless fact table
Employee registration
Factless fact table
Employee registration
Factless fact table
Promo
promo_id date_id prod_id channel_id campaign_id
1 20220102 5 2 3
Events
2 20220103 3 3 4
3 20220103 4 6 3 No metrics
4 20220104 4 8 6
5 20220104 3 4 8
Occurence of events
Employee registration
Natural vs. Surrogate key
Natural vs. Surrogate key
Natural keys
Products Sales
Natural vs. Surrogate key
Natural keys
✓ Integer number
Surrogate key
✓ _PK or _FK suffix
Artificial keys
✓ Created by the database / ETL tool
Benefits
Surrogate key
Corporate IT
E-Commerce company
✓ 3 websites
Data collection
✓ Warehouse data
Case study: E-Commerce
Goals
✓ Logistics in warehouse
✓ Maximizing profits
❖ Profit margine, sales volume, product cost, promotions,
discounts
Case study: E-Commerce
Step 1 Identify Business process
❖ Customer
❖ Products
❖ Promotions
❖ Time/date
❖ Website
Case study: E-Commerce
Step 3 Identify dimensions
❖ Customer
❖ Products
❖ Promotions
❖ Time/date
❖ Website
Case study: E-Commerce
Step 4 Identify facts for measurement
✓ Additive
❖ Discount absolut (yes?)
❖ Discount percentage (no?)
❖ Profit
Case study: E-Commerce
Result
Fact Fact Fact Fact Fact Fact Fact
Product_PK Product_ID
1 P001 ✓ Lookup table
2 P002
3 P003
Dimensions tables
Product_PK Product_ID
1 P001
2 P002
3 P003
Dimensions tables
✓
Flattened dimension
Hierarchies in dimensions
✓ Dimension
Sales Fact
✓ Dimension ✓ Dimension
Conformed dimension
✓ Dimension
✓ Dimension ✓ Dimension
Conformed dimension
✓ Dimension
Conformed dimension
✓ Dimension ✓ Dimension (shared attributes)
Conformed dimension
✓ Dimension
Conformed dimension
✓ Dimension ✓ Time/Date (shared attributes)
Conformed dimension
✓ Dimension
Drill across
Sales Fact Cost Fact
Conformed dimension
✓ Dimension ✓ Time/Date (shared attributes)
Conformed dimension
✓ Region
✓ Dimension
Drill across
Sales Fact Cost Fact
Conformed dimension
✓ Dimension ✓ Time/Date (shared attributes)
Conformed dimension
✓ Sales fact
Sales_PK Sales Date_FK
1 $9,400 20220101
2 $7,300 20220101
3 $5,100 20220102
✓ Cost fact
Cost_PK Cost Date_FK
1 $7,200 20220101
2 $1,900 20220101
3 $2,800 20220101
Conformed dimension
✓ Sales fact
Sales_PK Sales Date_FK
1 $9,400 20220101
2 $7,300 20220101
3 $5,100 20220102
✓ Cost fact
Cost_PK Cost Date_FK
1 $7,200 20220101
2 $1,900 20220102
3 $2,800 20220103
✓ Sales fact
Sales_PK Sales Date_FK
1 $9,400 20220101
2 $7,300 20220101
3 $5,100 20220102
✓ Cost fact
Cost_PK Cost DateMonth_FK
1 $7,200 20220101
2 $1,900 20220201
3 $2,800 20220301
Different FK possible!
Conformed dimension
✓ Sales fact
Sales_PK Sales Date_FK
1 $9,400 20220101
2 $7,300 20220101
3 $5,100 20220102
✓ Cost fact
Cost_PK Cost DateMonth_FK
1 $7,200 2022-01
2 $1,900 2022-02
3 $2,800 2022-03
Different FK possible!
Conformed dimension
1. Eliminate them if they are not relevant What if they are relevant?
2. Leave them as they are in the fact Long text values? Table size?
3. One Flag => One dimension Very wide fact table?
Note:
We call it "junk dimension" usually only internally.
Talking to business users we can refer to as
"transactional indicator dimension".
Junk dimensions
Incoming /
Flag_PK Payment_Type Outbound Is_Bonus
1 Wired Incoming Yes
2 Wired Incoming No
3 Wired Outbound Yes
4 Wired Outbound No
Junk dimensions
Payment_Type Amount
Is_Bonus Amount
Wired $5350
Yes $9350
Credit Card $6553
No $11857
Cash $6754
Junk dimensions
Number of combinations
3 x 2 x 2 = 12
Transaction_PK Amount Payment_Type Incoming / Outbound Is_Bonus
1 $530 Wired Incoming Yes
2 $553 Credit Card Outbound No
3 $654 Cash Incoming No
Products Products
Month (Orders received) Month (Production started)
January 2500 January 2650
February 2700 February 2450
… … … …
2. Business users + IT
✓ "Original"
UPDATE
✓ Very simple
UPDATE
UPDATE
More significant
Product_Key Name Category
1 Sunglases TR-7 Assecoirs
2 Chocolate bar 70% cacao Sweets
3 Delicious Oat meal biscuits Buscuits
Not so significant
Type 2: New row
Type 2: New row
UPDATE
Add Row
Product_Key Name Category
1 Sunglases TR-7 Assecoirs
2 Chocolate bar 70% cacao Sweets
3 Oat meal biscuits Sweets
4 Delicious Oat meal biscuits Buscuits
Type 2: New row
Category Amount
Assecoirs $25 ❑ No updates in fact
Respecting history Sweets $3
Buscuits $4
❑ From that moment new FK
Type 2: New row
Product_PK Product_ID Name Category
1 SG-TR7 Sunglases TR-7 Assecoirs
No . of products? 2 CH-B70 Chocolate bar 70% cacao Sweets
3 OT-BSC Oat meal biscuits Sweets
4 OT-BSC Delicious Oat meal biscuits Buscuits
Period in which
values are valid Instead of null better
date far in the future
✓ Necessary also in ETL to use correct FK
✓ Type 1 – Static
Category Amount
Prev_Category Amount
Assecoirs $25
Assecoirs $25
Sweets $6
Sweets $14
Buscuits $8
Type 3: Additonal Attributes
Product_PK Product_ID Name Category Prev_Category
1 SG-TR7 Sunglases TR-7 Assecoirs Assecoirs
2 CH-B70 Chocolate bar 70% cacao Sweets Sweets
3 OT-BSC Oat meal biscuits Biscuit Sweets
= ETL process
Data Warehouse Layers
Centralized
Sales data
ETL location for data
Data warehouse
CRM system
Data Warehouse Layers
Predictive
data sources Analytics
Staging Core /
Data Warehouse
Data Warehouse Layers
Data Mart 1
ETL-Tool
✓ Load data
Everything we need to build our DWH!
Extract, Transform, Load ETL
ETL-Setup
Building workflows…
✓ Staging workflow
ETL-Setup
Jobs …
✓ Run the workflows
✓ Understanding data
✓ Remember MAX(Sales_Key)
data sources
Core /
Staging Data Warehouse
Data Warehouse Layers
o UPDATE
Core / Product_PK Name
1 Sunglases TR-7
Data Warehouse
2 Chocolate bar 70% cacao
3 Oat meal biscuits
4 Chocolate bar 70% cacao
5 Oat meal biscuits
Data Warehouse Layers
Insert/ o DELETE
Update Product_PK
1
Name
Sunglases TR-7
Deleted
No
2 Chocolate bar 70% cacao Yes
3 Oat meal biscuits No
4 Chocolate bar 70% cacao No
5 Oat meal biscuits No
Transform
Insert /
Update
data sources
Core /
Staging Data Warehouse
Main goals
Create a consolidated view of all data for
analysis purposes
Month Amount
Month Amount
Januar-2022 $5030
Januar-2022 $5030
February-2022 $6053
February-2022 $6053
March-2022 $2455
March-2022 $2455
Total $13548
▪ Deduplication
▪ Filtering (rows & columns)
▪ Cleaning & Mapping (Integration)
▪ Value Standardization (Integration)
▪ Key Generation
Kinds of transformations
Basic Advanced
▪ Deduplication ▪ Joining
▪ Filtering (rows & columns) ▪ Splitting
▪ Cleaning & Mapping (Integration) ▪ Aggregating
▪ Value Standardization (Integration) ▪ Deriving new values
▪ Key Generation
Kinds of transformations
Basic ▪ Deduplication
Store 1 Store 2
product_id name category product_id name category
P521 Almonds 150g Nuts P521 Almonds 150g Nuts
P252 Garlic Fruits & Vegetables P672 Orange Juice Drinks
P533 Banana Fruits & Vegetables P423 Green Apples Fruits & Vegetables
P684 Chocolate Sweets & Snacks P564 Chocolate Cookies Sweets & Snacks
P755 Spicy Chips Sweets & Snacks P755 Spicy Chips Sweets & Snacks
Kinds of transformations
Basic ▪ Deduplication
Product Dimension
product_id name category
P521 Almonds 150g Nuts
P252 Garlic Fruits & Vegetables
P533 Banana Fruits & Vegetables
P684 Chocolate Sweets & Snacks
product_id
P755 name
Spicy Chips category
Sweets & Snacks
P521 Almonds 150g Nuts
P672 Orange Juice Drinks
P423 Green Apples Fruits & Vegetables
P564 Chocolate Cookies Sweets & Snacks
P755 Spicy Chips Sweets & Snacks
Kinds of transformations
Basic ▪ Deduplication
Product Dimension
product_id name category
P521 Almonds 150g Nuts
P252 Garlic Fruits & Vegetables
P533 Banana Fruits & Vegetables
P684 Chocolate Sweets & Snacks
product_id
P755 name
Spicy Chips category
Sweets & Snacks
P521 Almonds 150g Nuts
P672 Orange Juice Drinks
P423 Green Apples Fruits & Vegetables
P564 Chocolate Cookies Sweets & Snacks
P755 Spicy Chips Sweets & Snacks
Kinds of transformations
Basic ▪ Deduplication
Product Dimension
product_id name category
P521 Almonds 150g Nuts
P252 Garlic Fruits & Vegetables
P533 Banana Fruits & Vegetables
P684 Chocolate Sweets & Snacks
product_id
P755 name
Spicy Chips category
Sweets & Snacks
P672 Orange Juice Drinks
P423 Green Apples Fruits & Vegetables
P564 Chocolate Cookies Sweets & Snacks
Kinds of transformations
Basic ▪ Filtering rows
Filter out irrelevant rows
Sales_Date Name Amount Type
2022-06-06 Sunglases TR-7 $25 Sale
2022-06-06 Chocolate bar 70% cacao $3 Sale
2022-06-06 Sunglases TR-7 $-25 Refund
2022-06-07 Oat meal biscuits $4 Sale
2022-06-07 Chocolate bar 70% cacao $3 Sale
2022-06-08 Oat meal biscuits $4 Sale
Kinds of transformations
Basic ▪ Filtering rows
Filter out irrelevant rows
Sales_Date Name Amount Type
2022-06-06 Sunglases TR-7 $25 Sale
2022-06-06 Chocolate bar 70% cacao $3 Sale
2022-06-06 Sunglases TR-7 $-25 Refund
2022-06-07 Oat meal biscuits $4 Sale
2022-06-07 Chocolate bar 70% cacao $3 Sale
2022-06-08 Oat meal biscuits $4 Sale
Kinds of transformations
Basic ▪ Filtering rows
Filter out irrelevant rows
Sales_Date Name Amount Type
2022-06-06 Sunglases TR-7 $25 Sale
2022-06-06 Chocolate bar 70% cacao $3 Sale
2022-06-07 Oat meal biscuits $4 Sale
2022-06-07 Chocolate bar 70% cacao $3 Sale
2022-06-08 Oat meal biscuits $4 Sale
Kinds of transformations
Basic ▪ Filtering columns
Filter out irrelevant columns
Sales_Date Name Amount Type
2022-06-06 Sunglases TR-7 $25 Sale
2022-06-06 Chocolate bar 70% cacao $3 Sale
2022-06-07 Oat meal biscuits $4 Sale
2022-06-07 Chocolate bar 70% cacao $3 Sale
2022-06-08 Oat meal biscuits $4 Sale
Kinds of transformations
Basic ▪ Filtering columns
Filter out irrelevant columns
Name Gender
Taylor M
Isabella F
M => Male Sofia F
Name Gender
Taylor M
Isabella Fe
M => Male Sofia F
Name Gender
Taylor Male
Isabella Female
M => Male Sofia Female
Day Sales
Monday $500
Tuesday $760
Wednesday null
null => 0
Day Sales
Monday $500
Tuesday $760
Wednesday $0
Kinds of transformations
Basic ▪ Cleaning & Mapping (Integration)
Mapping different values
Month Sales
January 2022 $1500
February 2022 $4550
March 2022 $3321
Kinds of transformations
Basic ▪ Key Generation
Product Dimension
Product_PK product_id name category
1 P521 Almonds 150g Nuts
2 P252 Garlic Fruits & Vegetables
3 P533 Banana Fruits & Vegetables
4 P684 Chocolate Sweets & Snacks
5 product_id
P755 name
Spicy Chips category
Sweets & Snacks
6 P521 Almonds 150g Nuts
7 P672 Orange Juice Drinks
8 P423 Green Apples Fruits & Vegetables
9 P564 Chocolate Cookies Sweets & Snacks
10 P755 Spicy Chips Sweets & Snacks
Kinds of transformations
Advanced ▪ Joining
Product Dimension
Product_PK product_id name category
1 P521 Almonds 150g Nuts
2 P252 Garlic Fruits & Vegetables
3 P533 Banana Fruits & Vegetables
4 P684 Chocolate Sweets & Snacks
5 P755 Spicy Chips Sweets & Snacks
Sales Fact
Sales_PK product_id Date
3 P533 2022-01-01
4 P252 2022-01-01
5 P755 2022-01-02
6 P684 2022-01-02
7 P755 2022-01-02
Kinds of transformations
Advanced ▪ Joining
Product Dimension
Product_PK product_id name category
1 P521 Almonds 150g Nuts
2 P252 Garlic Fruits & Vegetables
3 P533 Banana Fruits & Vegetables
4 P684 Chocolate Sweets & Snacks
5 P755 Spicy Chips Sweets & Snacks
Sales Fact
Sales_PK product_id Product_FK Date
3 P533 3 2022-01-01
4 P252 2 2022-01-01
5 P755 5 2022-01-02
6 P684 4 2022-01-02
7 P755 5 2022-01-02
Kinds of transformations
Advanced ▪ Joining
Product Dimension
Product_PK product_id name category Eff_Date Exp_Date
1 P521 Almonds 150g Nuts 2021-01-01 2121-01-01
2 P252 Garlic Fruits & Vegetables 2021-01-01 2121-01-01
3 P533 Banana Fruits & Vegetables 2021-01-01 2121-01-01
4 P684 Chocolate Sweets & Snacks 2021-01-01 2121-01-01
5 P755 Spicy Chips Sweets & Snacks 2021-01-01 2121-01-01
Sales Fact
Sales_PK product_id Product_FK Date
3 P533 3 2022-01-01
4 P252 2 2022-01-01
5 P755 5 2022-01-02
6 P684 4 2022-01-02
7 P755 5 2022-01-02
Kinds of transformations
Advanced ▪ Joining
Product Table
Product_PK product_id name Category_id
1 P521 Almonds 150g 1
2 P252 Garlic 2
3 P533 Banana 2
4 P684 Chocolate 3
5 P755 Spicy Chips 3
Category_id Category
1 Nuts
Category table 2 Fruits & Vegetables
3 Sweets
Kinds of transformations
Advanced ▪ Joining
Product Dimension
Product_PK product_id name category
1 P521 Almonds 150g Nuts
2 P252 Garlic Fruits & Vegetables
3 P533 Banana Fruits & Vegetables
4 P684 Chocolate Sweets & Snacks
5 P755 Spicy Chips Sweets & Snacks
Kinds of transformations
• By length / position
Advanced ▪ Splitting
• By Delimiter
Store Dimension
Store_id Location
1 New York, NY 10011
2 Orland Park, IL 60462
3 Houston, TX 77002
4. Transform + Load:
• Read from staging
• Transformation (Clean + Extract)
• Update/Insert
Processing Order
1. 2.
Extract Transform + Load
Transform + Load
Dimensions Facts
Product Dimension Sales Fact
Product_PK product_id name category Sales_PK product_id Product_FK Date
1 P521 Almonds 150g Nuts 3 P533 3 2022-01-01
2 P252 Garlic Fruits & Vegetables 4 P252 2 2022-01-01
3 P533 Banana Fruits & Vegetables 5 P755 5 2022-01-02
4 P684 Chocolate Sweets & Snacks 6 P684 4 2022-01-02
5 P755 Spicy Chips Sweets & Snacks 7 P755 5 2022-01-02
Processing Order for dimensions
Transform + Load
UPDATE
UPDATE
✓ Type 1 – Static
Case study:
Set up a complete ETL workflow
Plan of attack
3. Staging
1. Create DateKey
2. Include product_FK
3. Payment dimension
4. Additional columns:
• total_cost
• Add total_price
• Add profit
Plan of attack
1. Set up tables
Jobs or packages
External tool
In the ETL tool
(e.g. Windows Task scheduler or on server)
Guidelines
What are the requirements? How long does it take? What is a good time?
Night? Morning?
ETL tools
Enterprise Open-source Cloud-native Custom
Define responsibles
Connectors
Capabilities
Ease of use/work
Reviews
Support/Extras
1-5
Total weighted score:
Choosing ETL tool
3. Test / Demo / Trial
Make a decision!
Choosing ETL tool
3. Test / Demo / Trial
Make a decision!
Choosing ETL tool
Enterprise Open-source Cloud-based
ETL
Data warehouse
Data Warehouse Layers
Transformations
applied when data
is moved
Predictive
data sources Analytics
Staging Core /
Data Warehouse
Data Warehouse Layers
E TL
Data Warehouse Layers
E LT
What is ELT?
Transform
Predictive
data sources Analytics
Staging Core /
Data Warehouse
Data Warehouse Layers
Empty
Delta Insert/
Update
data sources
Core /
Staging Data Warehouse
Processing order
1. 2.
Extract Transform + Load
1. 2.
Extract Load Transform
Easy to use Data Quality Accessible Enables business users to analyze data
SELECT
product_id
FROM sales
WHERE customer_id = 5
3, P0625, 5, visa
Table scan
4, P0432, 8, mastercard 6, P0058, 5, mastercard
SELECT
product_id
FROM sales
WHERE customer_id = 5
Location Value
1, P0494, 4, visa 2, P0221, 5, visa
1 4
6, P0058, 5, mastercard 4, P0432, 8, mastercard
2 5
3, P0625, 5, visa
5 8
✓ Indexes help to make data reads faster!
❖ Additional storage
SELECT
product_id
FROM sales
WHERE customer_id = 5
Location Value
1, P0494, 4, visa 2, P0221, 5, visa
1 4
6, P0058, 5, mastercard 4, P0432, 8, mastercard
2 5
3, P0625, 5, visa
5 8
Using index
Location Value ✓ Different types of indexes
1 4 for different situations
2 5
1
20 ✓ Breaks data down into pages or blocks
AB AD
1 AE
AC ✓ Should be used for high-cardinality
(unique) columns
1 ABA… 15
ABB… ✓ Not entire table (costy in terms of storage)
…
ACA…
ACB…
❖ Bitmap index
1 visa 11100
Good for many repeating values
4 mastercard 00011
(dimensionality)
❖ Bitmap index
✓ Particularily good for dataware houses
Value 1 2 3 4 5 6 7 8
Good for many repeating values
mastercard x x
(dimensionality)
visa x x x
Guidelines
✓ Storage layer What is the right ✓ Pay for what you use
✓ Compute layer
choice today?
✓ Managed service
✓ Software layer
✓ Optmized for scalable
Physical data center
analytics
Cloud vs. On-premises
On-premises Cloud
✓ Snowflake
✓ Amazon Redshift
✓ Azure Synapse
✓ Snowflake
✓ Amazon Redshift
✓ Azure Synapse
Traditional
Massive parallel processing (MPP)
Example
SELECT
*
FROM sales
WHERE customer_id = 5
Massive parallel processing (MPP)
Traditional
Traditional
Traditional
Node
"Shared nothing" Task 1
Work load is split up
architicture & processed individually
Task 2
Independent MPP
resources
Task 3
Massive parallel processing (MPP)
3, P0625, 5, visa
SELECT
product_id
1, 2, 3, 4, 5 FROM sales
P0494, P0221, P0625, P0431, P0058
4, 5, 5, 8, 5
Less data needs to be processed!
visa, visa, visa, mastercard, mastercard
Better compression, less storage
18.29, 1.49, 5.89, 11.59, 12.39
Columnar databases