0% found this document useful (0 votes)
292 views

Course+Slides+ +Data+Warehouse+ +the+Ultimate+Guide

This document discusses the purposes and functions of operational and analytical data systems. It notes that operational systems are designed for inputting and accessing single records at a time for transaction processing, while analytical systems are intended for querying large volumes of historical data and making fact-based decisions. The document emphasizes that a data warehouse centralizes data from various sources and structures it for optimized querying and analysis to support business intelligence needs.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
292 views

Course+Slides+ +Data+Warehouse+ +the+Ultimate+Guide

This document discusses the purposes and functions of operational and analytical data systems. It notes that operational systems are designed for inputting and accessing single records at a time for transaction processing, while analytical systems are intended for querying large volumes of historical data and making fact-based decisions. The document emphasizes that a data warehouse centralizes data from various sources and structures it for optimized querying and analysis to support business intelligence needs.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 393

Two purposes

o Receive orders
o React to complaints
o Fill up stock

 Turn the weel


Analytical Operational
decision making data keeping
Two purposes

o Receive orders
o React to complaints
o Fill up stock

 Turn the weel


Analytical Operational
decision making data keeping
Two purposes

o What's the best category?


o How many sales compared
o Receive orders
to last month? o React to complaints
o What can be improved? o Fill up stock

 Evaluate performance  Turn the weel


 Decision-making Analytical Operational
decision making data keeping

OLAP = Online Analytical Processing OLTP = Online Transactional Processing


"Yes, we have a lot of data but we don't use it"
"Our data is very complicated and difficult to analyze"
"It's spread all over the different systems and difficult to access"

"I just want to see what is relevant!"

"We need to access data quick and easily"

"We want to make fact-based decisions!"


Two requirements

o Thousands of records at o One record at a time


a time o Data input
o Fast query performance o No long history
o Historical context

Analytical Operational
decision making data keeping
Two requirements

o Thousands of records at o One record at a time


a time o Data input
o Fast query performance o No long history
o Historical context
o Usablity
Analytical Operational
decision making data keeping

DWH is there to address those analytical data needs!


Two requirements

o Thousands of records at o One record at a time


a time o Data input
o Fast query performance o No long history
o
o Used for reporting and data analysis
Historical context
Usablity
Analytical Operational
decision making data keeping

DWH is there to address those analytical data needs!


Data warehouse:
A database used and optimized for analytical
purposes.

✓ User friendly

✓ Fast query performance

✓ Enabling data analysis


Understanding a data warehouse

Other data sources

Sales data

Data warehouse

CRM system
Understanding a data warehouse

Other data sources

ETL
Sales data

Data warehouse

CRM system
Understanding a data warehouse
Extract, Transform, Load
Other data sources

Centralized
ETL location for data
Sales data

Data warehouse

CRM system
Goals of a data warehouse

✓ Centralized and consistent location for data

✓ Data must be accessible fast (query performance)

✓ User-friendly (easy to understand)

✓ Must load data consistently and repeatedly (ETL)

✓ Reporting and data visualization built on top


Understanding a data warehouse

Other data sources

Sales data

Staging Data Production


CRM system
area Transformation
We create a data warehouse for
Business Intelligence…
Data analysis Data
warehouse
Strategies
o Data gathering
o Data storing
Technologies o Reporting
o Data visualization
o Data mining
Infrastructures o Predictive analytics

Meaningful insights
Better decisions
Raw data Transform
Data lake & data warehouse are
BOTH used as
centralized data storage
Data Lake Data Warehouse

Data Raw
Data Lake Processed
Data Warehouse

Technologies Big
Data data
Lake Database
Data Warehouse

Structure Unstructured
Data Lake Structured
Data Warehouse

Specific & ready to


Usage Not defined yet
Data Lake beWarehouse
Data used
Users Data Scientists Business users & IT
Data Lake Data Warehouse

When to use
Data
Technologies
Raw
Data

Big
Lake

Data data
Lake
Processed
Data Warehouse

Database
Data Warehouse

Structure Unstructured
Data Lake Structured
Data Warehouse

what?
Usage Not defined yet
Data Lake
Specific & ready to
beWarehouse
Data used
Users Data Scientists Business users & IT

Both!
Demos & Hands-on

✓ Demonstrations & Hands-on assignments

✓ Assignment & Installations are optionally only

✓ Install ETL-Tool (Pentaho)


✓ Database Management System (PostgreSQL)
The layers of a Data Warehouse
Data Warehouse Layers

Extract, Transform, Load


Other data sources

Centralized
Sales data
ETL location for data

Data warehouse

CRM system
Data Warehouse Layers

Data warehouse
Data Warehouse Layers
Department 1

Department 2

data sources
Staging
Data Warehouse Layers
Departments

data sources
Staging
Data Warehouse Layers
Department 1

Department 2

data sources
Staging
Data Warehouse Layers

Predictive
data sources Analytics
Staging Core /
Data Warehouse
Data Warehouse Layers

Data Mart 1

data sources Predictive


Staging Core / Analytics
Data Warehouse
Data Mart 2
Data Warehouse Layers

Data Mart 1

Cleansing
data sources Predictive
Staging Core / Analytics
Data Warehouse
Data Mart 2
Data Warehouse Layers

Data Mart 1

data sources Predictive


Staging Core / Analytics
Data Warehouse
Data Mart 2
Data Warehouse Layers
Data Warehouse

Data Mart 1

data sources Predictive


Staging Core / Analytics
Data Warehouse
Data Mart 2
The Staging Area
Data Warehouse Layers

ETL E TL

Predictive
data sources Analytics
Staging Core /
Access Layer /
Data Warehouse
Data Warehouse Layers

ETL
o "Short time on the source systems"
o "Quickly extract"
o Move the data into relational database
o Start transformations from there

data sources
Staging
Data Warehouse Layers

ETL id date product customer_id store_id


Fulltoss Tangy
1 1/2/2022 2 2
Tomato
Chilli - Green,
2 1/2/2022 4 4
Organically Grown
3 1/2/2022 Masala Powder 7 5
Cheese Cracker
4 1/2/2022 1 2
(Mcvities)
Centre Filled
5 1/2/2022 5 2
Chocolate Cake

data sources
Staging
id date product id date product
1 1/2/2022 Fulltoss Tangy Tomato 1 1/2/2022 Fulltoss Tangy Tomato
Chilli - Green, Organically Chilli - Green, Organically
2 1/2/2022 2 1/2/2022
Grown Grown
3 1/2/2022 Masala Powder 3 1/2/2022 Masala Powder
Cheese Cracker Cheese Cracker
4 1/2/2022 4 1/2/2022
(Mcvities) (Mcvities)
Centre Filled Chocolate Centre Filled Chocolate
5 1/2/2022 5 1/2/2022
Cake Cake
Data Warehouse Layers

ETL id date product customer_id store_id


Fulltoss Tangy
1 1/2/2022 2 2
Tomato
Chilli - Green,
2 1/2/2022 4 4
Organically Grown

Temporary 3 1/2/2022 Masala Powder


Cheese Cracker
7 5

4 1/2/2022 1 2
(Mcvities)
Centre Filled
5 1/2/2022 5 2
Chocolate Cake
id date product
Chocos - Magic
6 1/3/2022 3 2
1 1/2/2022 Fulltoss Tangy Tomato Hearts
Chilli - Green, Organically Chocos Webs -
2 1/2/2022 7 1/3/2022 4 3
data sources
Grown
8 1/3/2022
Spiderman
Breakfast Cereal -
5 2
3 1/2/2022 Masala Powder Chocolate
4 1/2/2022
Cheese Cracker
(Mcvities)
Staging 9 1/3/2022
Fiber Rich
Chocolate
6 1
Centre Filled Chocolate Centre Filled
id
5 date
1/2/2022 product
Cake id date product 10 1/3/2022
Chocolate Cake
3 2

6 1/3/2022 Chocos - Magic Hearts 6 1/3/2022 Chocos - Magic Hearts


Chocos Webs - Chocos Webs -
7 1/3/2022 7 1/3/2022
Spiderman Spiderman
Breakfast Cereal - Breakfast Cereal -
8 1/3/2022 8 1/3/2022
Chocolate Chocolate
9 1/3/2022 Fiber Rich Chocolate 9 1/3/2022 Fiber Rich Chocolate
Centre Filled Chocolate Centre Filled Chocolate
10 1/3/2022 10 1/3/2022
Cake Cake
Data Warehouse Layers

ETL id date product customer_id store_id


Fulltoss Tangy
1 1/2/2022 2 2
Tomato
Chilli - Green,
2 1/2/2022 4 4
Organically Grown

Temporary 3 1/2/2022 Masala Powder


Cheese Cracker
7 5

4 1/2/2022 1 2
(Mcvities)
Centre Filled
5 1/2/2022 5 2
Chocolate Cake
id date product Chocos - Magic
6 1/3/2022 3 2
1 1/2/2022 Fulltoss Tangy Tomato Hearts
Chilli - Green, Organically Chocos Webs -
2 1/2/2022 7 1/3/2022 4 3
data sources
Grown
8 1/3/2022
Spiderman
Breakfast Cereal -
5 2
3 1/2/2022 Masala Powder Chocolate
4 1/2/2022
Cheese Cracker
(Mcvities)
Staging 9 1/3/2022
Fiber Rich
Chocolate
6 1

Centre Filled Chocolate Centre Filled


id
5 date
1/2/2022 product 10 1/3/2022 3 2
Cake Chocolate Cake
6 1/3/2022 Chocos - Magic Hearts
Chocos Webs -
7 1/3/2022
Spiderman
Breakfast Cereal -
8 1/3/2022
Chocolate
9 1/3/2022 Fiber Rich Chocolate
Centre Filled Chocolate
10 1/3/2022
Cake
Data Warehouse Layers

ETL id date product customer_id store_id


Fulltoss Tangy
1 1/2/2022 2 2
Tomato
Chilli - Green,
2 1/2/2022 4 4
Organically Grown

Persistant 3 1/2/2022 Masala Powder


Cheese Cracker
7 5

4 1/2/2022 1 2
(Mcvities)
Centre Filled
5 1/2/2022 5 2
Chocolate Cake
id date product id date product
Chocos - Magic
6 1/3/2022 3 2
1 1/2/2022 Fulltoss Tangy Tomato 1 1/2/2022 Fulltoss Tangy Tomato Hearts
Chilli - Green, Organically Chilli - Green, Organically Chocos Webs -
2 1/2/2022 2 1/2/2022 7 1/3/2022 4 3
data sources
Grown
3 1/2/2022
Grown
Masala Powder 8 1/3/2022
Spiderman
Breakfast Cereal -
5 2
3 1/2/2022 Masala Powder Chocolate
4 1/2/2022
Cheese Cracker
(Mcvities)
4 Staging
1/2/2022
Cheese Cracker
(Mcvities)
9 1/3/2022
Fiber Rich
Chocolate
6 1

Centre Filled Chocolate Centre Filled Chocolate Centre Filled


id
5 date
1/2/2022 product
Cake id
5 date
1/2/2022 product
Cake
10 1/3/2022
Chocolate Cake
3 2

6 1/3/2022 Chocos - Magic Hearts 6 1/3/2022 Chocos - Magic Hearts


Chocos Webs - Chocos Webs -
7 1/3/2022 7 1/3/2022
Spiderman Spiderman
Breakfast Cereal - Breakfast Cereal -
8 1/3/2022 8 1/3/2022
Chocolate Chocolate
9 1/3/2022 Fiber Rich Chocolate 9 1/3/2022 Fiber Rich Chocolate
Centre Filled Chocolate Centre Filled Chocolate
10 1/3/2022 10 1/3/2022
Cake Cake
The Staging Layer

✓ Staging Layer is the landing zone extracted data

✓ Data in tables and on a separate database

✓ As little "touching" as possible

✓ We don't charge the source systems

✓ Temporary or Persistant Staging Layers


Data Marts
Data Warehouse Layers

Predictive
Analytics
Core /
Access Layer /
Data Warehouse
Data Warehouse Layers

Predictive
Analytics
Core /
Access Layer /
Data Warehouse
Data Warehouse Layers
o Subset of a DWH
o Dimensional Model
o Can be further aggregated

Data Mart 1

Predictive
Core / Analytics
Access Layer /
Data Warehouse Data Mart 2
Data Warehouse Layers
o Subset of a DWH
o Dimensional Model
o Can be further aggregated

o Usability + Acceptance
Data Mart 1 o Performance

o Tools
Predictive
o Departments
Core / Analytics
Access Layer / o Regions
Data Warehouse Data Mart 2 o Use-cases
Data Marts

✓ Data Mart = Small scale DWH?

 Focus on the business problem

✓ Should you use a Data Mart or not?

 Focus on the business problem


Relational Database
id date product customer_id
Fulltoss Tangy
1 1/2/2022 2
Tomato
Chilli - Green,
2 1/2/2022 2
Organically Grown
3 1/2/2022 Masala Powder 5
Cheese Cracker
4 1/2/2022 1
(Mcvities)
Relational database Tables (relations) 5 1/2/2022
Centre Filled
Chocolate Cake
5

SELECT <column1>,
<column2>, ...
FROM <table_name>
Primary key Foreign key
id date product customer_id
Fulltoss Tangy
1 1/2/2022 2
Tomato
Chilli - Green,
2 1/2/2022 2
Organically Grown
3 1/2/2022 Masala Powder 5
Cheese Cracker
4 1/2/2022 1
(Mcvities)
Relational database Tables (relations) 5 1/2/2022
Centre Filled
Chocolate Cake
5

id name city
1 Frank New York

2 Sarah Chicago

3 Sabrina New Orleans

4 Maya Los Angelas

5 Marc Delas
Primary key Foreign key
id date product customer_id name
Fulltoss Tangy
1 1/2/2022 2 Sarah
Tomato

SELECT sales.id, 2 1/2/2022


Chilli - Green,
Organically Grown
2 Sarah

product, 3 1/2/2022 Masala Powder 5 Marc

customer_id, 4 1/2/2022
Cheese Cracker
(Mcvities)
1 Frank

Relational database
name Tables (relations) 5 1/2/2022
Centre Filled
Chocolate Cake
5 Marc

FROM sales

LEFT JOIN customer id name city


1 Frank New York
ON customer_id = customer.id 2 Sarah Chicago

3 Sabrina New Orleans

4 Maya Los Angelas

5 Marc Delas
Primary key Foreign key
id date product customer_id name
o 70s to 90s building logic & improving performance 1 1/2/2022
Fulltoss Tangy
2 Sarah
Tomato

o Operational systems: 1 table 2 1/2/2022


Chilli - Green,
Organically Grown
2 Sarah

3 1/2/2022 Masala Powder 5 Marc


OLAP / Analysis: multiple tables (context) 4 1/2/2022
Cheese Cracker
1 Frank
(Mcvities)
Relational database Tables (relations) 5 1/2/2022
Centre Filled
Chocolate Cake
5 Marc

Rise of RD => rise of OLAP / DWH


id name city
1 Frank New York

2 Sarah Chicago

3 Sabrina New Orleans

4 Maya Los Angelas

5 Marc Delas
Primary key Foreign key

o Relational database management system (RDMS) id date product customer_id name


Fulltoss Tangy
1 1/2/2022 2 Sarah
Tomato
Oracle 2 1/2/2022
Chilli - Green,
2 Sarah
Organically Grown

Microsoft SQL Server 3 1/2/2022 Masala Powder 5 Marc


Cheese Cracker
4 1/2/2022 1 Frank
(Mcvities)
PostgreSQL
Relational database Tables (relations) 5 1/2/2022
Centre Filled
5 Marc
Chocolate Cake

MySQL
Amazon Relational Database Service (RDS) id name city
1 Frank New York
Azure SQL databases 2 Sarah Chicago

(Snowflake) 3 Sabrina New Orleans

4 Maya Los Angelas

5 Marc Delas
✓ Highly optimized for query performance

✓ Good for Analytics / High query volume

✓ Usually used for data marts

✓ Relational and non-relational


Traditional database

Response time
Disc In-memory
o columnar storage,
o parallel query plans,
o and other techniques

Traditional database In-memory database

Response time
Disc In-memory Disc In-memory
o columnar storage,
o parallel query plans,
o Durability: Lose all information when device loses
o and other techniques
power or is reset
o Durability addedTraditional
through snapshots / images
database In-memory database
o Cost-factor
o Traditional DBs also trying reduce usage of disc
Response time
Disc In-memory Disc In-memory
Products in RDMS context

✓ SAP HANA

✓ MS SQL Server In-Memory Tables

✓ Oracle In-Memory

✓ Amazon MemoryDB
OLAP Cubes
✓ Traditional DWH based on relational DBMS (ROLAP)

✓ Data is organized non-relational in Cube (MOLAP)


Cube = Multidimensial dataset

✓ Arrays instead of tables

✓ Main reason to use: Fast query performance

✓ Works well with many BI solutions


Customers
✓ Precalculated (aggregated values)

✓ High performance ✓ MDX

✓ Interactive tools to ✓ Multidimensional DBs

Customers
drill / slice & dice
Benefits
Recommendation

✓ Built for a specific use-case (as data marts in general)

✓ More efficient & less complex with separate data marts

✓ Good for interactive queries with hierarchies

✓ Optional after star schema is built in relational DB


Alternatives

✓ Less important today with advancement of hardware

✓ Alternatives:
- Tabular models (SSAS)
- ROLAP
- columnar storage
ODS (Operational Data Storage)
ODS

✓Sometimes a little bit confusing

✓Different understandings / definitions


ODS

✓ Operational decision making


Other data sources

Sales data ETL


ODS

CRM system
ODS

✓ No need for long history

✓ Needs to be very current or real-time


ODS

✓ (Near) real-time
Other data sources id customer total
1 Sarah $334

2 Frank $4234

3 Thomas $544

Sales data ETL 4

5
Angela

Kate
$4332

$460

ODS

CRM system ✓ Current state (Update logic)


ODS - Paralell

Analytical decisions

DWH

(Near) real-time

Operational decisions
ODS
ODS - Sequential
Analytical decisions

DWH
(Near) real-time

ODS
Operational decisions
ODS

✓ Getting less relevant


✓ Better performance (Faster ETL / DBs)
✓ Big data technologies (very fast / real-time)

✓ Don't get hung up with terminology!


Summary
Data Warehouse Layers
Access layer Business
Landing zone Single point of truth applications

Data Mart 1

Predictive
data sources Analytics
Staging Core /
Data Warehouse
Data Mart 2
The different layers
✓ Landing zone
Staging ✓ Minimal transformation
✓ "Stage" the data in tables

✓ Always there
Core ✓ Business Logic & Single Point of Truth
✓ Can be sometimes the access layer

✓ Access Layer
Mart ✓ Specific to one use-case
✓ Optimized for performance
o columnar storage,
o parallel query plans,
o and other techniques

Traditional database In-memory database

Response time
Disc In-memory Disc In-memory
What is dimensional modeling?
Dimensional modeling

✓ Method of organizing data (in a data warehouse)

✓ Facts ✓ Dimensions

o Measurement like profit o Context like category or period

Profit by year

Profit by category
Dimensional modeling

✓ Dimensions

✓ Facts

✓ Dimensions ✓ Dimensions

star schema
Dimensional modeling

✓ Unique technique of structering data

✓ Commonly used in DWH

✓ Optimized for faster data retrieval

✓ Oriented around performance & usability

✓ Designed Reporting / OLAP


Why dimensional modeling?
Dimensional modeling

✓ Goal: Fast data retrieval

✓ Oriented around performance & usability

id date product category customer_id name profit


Fulltoss Tangy
1 1/2/2022 Vegetables 2 Sarah $23
Tomato
Chilli - Green,
2 1/2/2022 Snacks 2 Sarah $12
Organically Grown
3 1/2/2022 Masala Powder Herbs 5 Marc $93
Cheese Cracker
4 1/2/2022 Snacks 1 Frank $23
(Mcvities)
Centre Filled
5 1/2/2022 Snacks 5 Marc $21
Chocolate Cake
Dimensional modeling

✓ Goal: Fast data retrieval

✓ Oriented around performance & usability

id date product category customer_id name profit


Fulltoss Tangy
1 1/2/2022 Vegetables 2 Sarah $23
Tomato
Chilli - Green,
2 1/2/2022 Snacks 2 Sarah $12
Organically Grown
3 1/2/2022 Masala Powder Herbs 5 Marc $93
Cheese Cracker
4 1/2/2022 Snacks 1 Frank $23
(Mcvities)
Centre Filled
5 1/2/2022 Snacks 5 Marc $21
Chocolate Cake
Dimensional modeling

✓ Goal: Fast data retrieval

✓ Oriented around performance & usability

id date product category customer_id profit


Fulltoss Tangy
1 1/2/2022 Vegetables 2 $23
Tomato
Chilli - Green,
2 1/2/2022 Snacks 2 $12
Organically Grown
3 1/2/2022 Masala Powder Herbs 5 $93
Cheese Cracker
4 1/2/2022 Snacks 1 $23
(Mcvities)
Centre Filled
5 1/2/2022 Snacks 5 $21
Chocolate Cake
Dimensional modeling

✓ Goal: Fast data retrieval

✓ Oriented around performance & usability

id date product category customer_id profit


Fulltoss Tangy
1 1/2/2022 Vegetables 2 $23
Tomato
Chilli - Green,
2 1/2/2022 Snacks 2 $12
Organically Grown
3 1/2/2022 Masala Powder Herbs 5 $93
Cheese Cracker
4 1/2/2022 Snacks 1 $23
(Mcvities)
Centre Filled
5 1/2/2022 Snacks 5 $21
Chocolate Cake
Dimensional modeling

✓ Goal: Fast data retrieval

✓ Oriented around performance & usability


FK PK
id date product_id customer_id profit product_id product category
1 1/2/2022 2 2 $23 1 product 1 Vegetables

2 1/2/2022 5 2 $12 2 product 2 Snacks

3 1/2/2022 6 5 $93 3 product 3 Herbs

4 1/2/2022 23 1 $23 4 product 4 Snacks

5 1/2/2022 16 5 $21 5 product 5 Snacks

Profit Fact Table Product Dim


Dimensional modeling

✓ Goal: Fast data retrieval

✓ Oriented around performance & usability

id date_id product_id customer_id profit


1 20220102 2 2 $23

2 20220102 5 2 $12

3 20220102 6 5 $93

4 20220102 23 1 $23

5 20220102 16 5 $21
Dimensional modeling

✓ Goal: Fast data retrieval

✓ Oriented around performance & usability

id date_id product_id customer_id profit date_id weekday month


1 20220102 2 2 $23 20220102 Monday January

2 20220102 5 2 $12 20220103 Tuesday January

3 20220102 6 5 $93 20220104 Wednesday January

4 20220102 23 1 $23 20220105 Friday January

5 20220102 16 5 $21 20220106 Saturday January

Profit Fact Table Date Dim


Dimensional modeling

✓ Goal: Fast data retrieval

✓ Oriented around performance & usability

Performance Usability

Prefered technique for data warehouse!


Facts
Facts

✓ Dimensions

✓ Facts

✓ Dimensions ✓ Dimensions

star schema
Facts
Dim_Customer Usually…
customer_id
first name o Aggregatable (numerical values)
o Foundation of DWH last name
sex o Measureable vs. descriptive
o Key measurements ✓ Dimensions
city
o Event- or transactional data
o Aggregated and analyzed Sales
o Date/time in a fact table
sales_id
✓ Facts
product_id
Dim_Date
Dim_Product customer_id
date_id
product_id units
year
name price
✓ Dimensions
category
quarter
✓ Dimensions
month
subcategory
week
dimensions
day
weekday
holiday_flag
Facts

✓ Fact table: PK, FK & Facts

✓ Grain: Most atomic level facts are defined

id date_id region_id profit


1 20220102 1 $23

2 20220102 2 $12

3 20220102 2 $93

4 20220102 3 $23

5 20220102 16 $21

✓ Different types of facts


Dimensions
Dimensions

✓ Dimensions

✓ Facts

✓ Dimensions ✓ Dimensions

star schema
Dimensions
Dim_Customer Usually…
customer_id
first name o Non-Aggregatable
o Categorizes facts last name
sex o Measureable vs. descriptive
o Supportive & descriptive ✓ Dimensions
city
o (More) static
o Filtering, Grouping & Labeling Sales
sales_id
✓ Facts
product_id
Dim_Date
Dim_Product customer_id
date_id
product_id units
year
name price
✓ Dimensions
category
quarter
✓ Dimensions
month
subcategory
week
dimensions
day
weekday
holiday_flag
Dimensions

✓ Dimension table: PK, Dimension, (FK)

✓ People, products, places, time

customer_id first_name first_name email


1 Mike Miller [email protected]

2 Sofia Snider [email protected]

2 Marco Steadman [email protected]

3 Sarah Griffith [email protected]

4 Jennifer Lovell [email protected]

✓ Different types of dimension


Star schema
Star schema

✓ Dimensions
Sales
sales_id
✓ Facts
product_id
Dim_Product customer_id
product_id units
name price
✓ Dimensions
category ✓ Dimensions
subcategory
dimensions
Normalized Star schema
o Technique to avoid redundancy
FK
o Minimizes storage
sales_id product_id customer_id units price
o Performance (write / update) 1 3 23 1 2.99
✓ Dimensions
2 5 13 1 1.99
o Many tables 3 2 7 2 3.49
4 3 16 1 2.29
o Many joins necessary 5 3 13 5 1.49
✓ Facts

PK
✓ Dimensions ✓ Dimensions
Denormalized
product_id name category sub_category
1 Chili Herbs Spices o There is data redundancy!
2 Garlic Fruits & Vegetables Vegetable
3 Banana Fruits & Vegetables Fruits o Optimized to get data out
4 Chocolate Sweets & Snacks Sweets
5 Chips Sweets & Snacks Snacks o Query performance (read)
o User experience
Star schema

✓ Most common schema in Data Mart

✓ Simplest form (vs. snowflake schema)

✓ Work best for specific needs


(simple set of queries vs complex queries)

✓ Usablity + Performance for specific (read) use-case


Snowflake schema
Star schema
sales_id product_id customer_id units price
1 3 23 1 2.99
✓ Dimensions
2 5 13 1 1.99
3 2 7 2 3.49
4 3 16 1 2.29
5 3 13 5 1.49
✓ Facts

✓ Dimensions
product_id name category sub_category
1 Chili Herbs Spices
2 Garlic Fruits & Vegetables Vegetable
3 Banana Fruits & Vegetables Fruits
4 Chocolate Sweets & Snacks Sweets
5 Chips Sweets & Snacks Snacks
Snowflake schema
✓ Facts
sales_id product_id customer_id units price
1 3 23 1 2.99
✓ Dimensions
2 5 13 1 1.99
3 2 7 2 3.49
4 3 16 1 2.29
5 3 13 5 1.49

product_id name category_id sub_category


1 Chili 1 Spices
2
3
✓ Dimensions Garlic
Banana
2
2
Vegetable
Fruits
4 Chocolate 3 Sweets
Snowflake schema
5 Chips 3 Snacks
(More) normalized
category_id category
1 Herbs
2 Fruits & Vegetables
3 Sweets & Snacks
Snowflake schema
Advantage Disadvantage

✓ Less space (storage cost) ✓ More complex

✓ No (less) redundant data ✓ More joins


(easier to maintain/update, (more complex SQL queries)
less risk of corrupted data)
✓ Less performance Data Marts
✓ Solves write slow downs / Cubes
Snowflake schema

Data Mart Core

✓ Star schema ✓ Star schema

✓ Maybe snowflake schema


Additivity in facts
Additivity

Additive Semi-additive Non-additive

✓ Can be added across ✓ Can be added across ✓ Cannot be added


all dimensions a few dimensions across any dimension

✓ Most flexible & useful


Additive facts
sales_id product_id date_id units amount
1 3 20220101 1 2.99
2 5 20220102 1 1.99
3 2 20220102 2 3.49
4 3 20220103 1 2.29
5 3 20220104 5 1.49

category units
Herbs 0
Fruits & Vegetables 9
Sweets & Snacks 1
product_id name category sub_category
1 Chili Herbs Spices name Units
2 Garlic Fruits & Vegetables Vegetable Chili 0
3 Banana Fruits & Vegetables Fruits Garlic 2
4 Chocolate Sweets & Snacks Sweets Banana 7
5 Chips Sweets & Snacks Snacks Chocolate 0
Chips 1
Additive facts
date_id Date Day Month
20220101 01/01/2022 1 1
20220102 02/01/2022 2 1 sales_id product_id date_id units price
20220103 03/01/2022 3 1 1 3 20220101 1 2.99
20220104 04/01/2022 4 1 2 5 20220102 1 1.99
20220105 05/01/2022 5 1 3 2 20220102 2 3.49
4 3 20220103 1 2.29
Date units
5 3 20220104 5 1.49
01/01/2022 2
02/01/2022 3
03/01/2022 1
04/01/2022 5 category units
Herbs 0
Fruits & Vegetables 9
Sweets & Snacks 1
product_id name category sub_category
1 Chili Herbs Spices name Units
2 Garlic Fruits & Vegetables Vegetable Chili 0
3 Banana Fruits & Vegetables Fruits Garlic 2
4 Chocolate Sweets & Snacks Sweets Banana 7
5 Chips Sweets & Snacks Snacks Chocolate 0
Chips 1
Additivity

Additive Semi-additive Non-additive

✓ Can be added across ✓ Can be added across ✓ Cannot be added


all dimensions a few dimensions across any dimension

✓ Most flexible & useful

✓ Most facts are


fully additive
Semi-additive facts
balance_id portfolio_id date_id balance
1 1 20220101 $50
2 1 20220102 $100
3 1 20220103 $100
4 2 20220101 $120
5 2 20220102 $170
6 2 20220103 $60

Added across Types Added across Date


Portfolio_id Type Date_id balance Type balance
1 USD Cash 20220101 $170 USD Cash $250
2 Stocks 20220102 $270 Stocks $350
20220103 $160
Semi-additive facts
balance_id portfolio_id date_id balance
1 1 20220101 $50
2 1 20220102 $100
3 1 20220103 $100
4 2 20220101 $120
5 2 20220102 $170
6 2 20220103 $60

Added across Types Average across Date


Portfolio_id Type Date_id balance Type balance
1 USD Cash 20220101 $170 USD Cash $83.33
2 Stocks 20220102 $270 Stocks $116.67
20220103 $160
Additivity

Additive Semi-additive Non-additive

✓ Can be added across ✓ Can be added across ✓ Cannot be added


all dimensions a few dimensions across any dimension

✓ Most flexible & useful ✓ Used carefully &


less flexible
✓ Most facts are
fully additive ✓ Averaging might be
an alternative

✓ Example: Balance
Non-additive facts
sales_id product_id date_id units price
o Price 1 3 20220101 1 2.99
o Percentages 2 5 20220102 1 1.99
3 2 20220102 2 3.49
o Ratios 4 3 20220103 1 2.29
5 3 20220104 5 1.49

category price
Herbs $0
product_id name category sub_category
Fruits & Vegetables $10.26
1 Chili Herbs Spices
Sweets & Snacks $1.99
2 Garlic Fruits & Vegetables Vegetable
3 Banana Fruits & Vegetables Fruits
4 Chocolate Sweets & Snacks Sweets
5 Chips Sweets & Snacks Snacks
Additivity

Additive Semi-additive Non-additive

✓ Can be added across ✓ Can be added across ✓ Cannot be added


all dimensions a few dimensions across any dimension

✓ Most flexible & useful ✓ Used carefully & ✓ Limited analytical


less flexible value
✓ Most facts are
✓ Averaging might be ✓ Store underlying value
fully additive
an alternative ✓ Ratio, price etc.
✓ Example: Balance
Nulls in facts
Nulls in facts

balance_id portfolio_id balance Incoming Outgoing


1 1 $50 null null
2 1 $100 $50 null
3 1 $100 null null
4 2 $120 null null
5 2 $170 $50 null
6 2 $60 null $110

AVG MIN SUM


SELECT $50 $50 $100
AVG(Incoming),
MIN(Incoming),
SUM(Incoming)
FROM balance_table
Nulls in facts

balance_id portfolio_id balance Incoming Outgoing


1 1 $50 $0 $0
2 1 $100 $50 $0
3 1 $100 $0 $0
4 2 $120 $0 $0
5 2 $170 $50 $0
6 2 $60 $0 $110

AVG MIN SUM


SELECT $16.67 $0 $100
AVG(Incoming),
MIN(Incoming),
SUM(Incoming)
FROM balance_table
Nulls in facts

balance_id portfolio_id balance Incoming Outgoing


1 1 $50 $0 $0
2 null $100 $50 $0
3 1 $100 $0 $0
4 2 $120 $0 $0
5 null $170 $50 $0
6 2 $60 $0 $110

Portfolio_id Type AVG MIN SUM


1 USD Cash $16.67 $0 $100
2 Stocks
Nulls in facts

balance_id portfolio_id balance Incoming Outgoing


1 1 $50 $0 $0
2 999 $100 $50 $0
3 1 $100 $0 $0
4 2 $120 $0 $0
5 999 $170 $50 $0
6 2 $60 $0 $110

Portfolio_id Type AVG MIN SUM


1 USD Cash $16.67 $0 $100
2 Stocks
999 Old types
Year-to-Date facts
Year-to-Date facts

✓ Often requested by business users

✓ Tempted to store them in columns

✓ Month-to-Date, Quarter-to-Date, Fiscal-Year-to-Date etc.

✓ Better store the underlying values in defined grain (!)

✓ Instead calculate all the to-Date variations in BI tool


Transactional fact table
Transactional fact table

✓ 1 row = measurement of 1 event / transaction

✓ Taken place at a specific time

✓ One transaction defines the lowest grain


FK FK Measure FK FK FK Measure
sales_id product_id date_id units call_id emp_id date_id customer_id duration
1 3 20220101 1 1 3 20220101 1 43
2 5 20220102 1 2 5 20220102 1 12
3 2 20220102 2 3 2 20220102 2 134
4 3 20220103 1 4 3 20220103 1 62
5 3 20220104 5 5 3 20220104 5 22

Sales transactions Calls


Characteristics
✓ Most common and very flexible

✓ Typically additive

✓ Tend to have a lot of dimensions associated

✓ Can be enormous in size


FK FK Measure FK FK FK Measure
sales_id product_id date_id units call_id emp_id date_id customer_id duration
1 3 20220101 1 1 3 20220101 1 43
2 5 20220102 1 2 5 20220102 1 12
3 2 20220102 2 3 2 20220102 2 134
4 3 20220103 1 4 3 20220103 1 62
5 3 20220104 5 5 3 20220104 5 22

Sales transactions Calls


Periodic snapshot fact table
Periodic snapshot fact table

✓ 1 row = summarizes measure of many events / transactions

✓ Summarized of standard period (e.g. 1 day, 1 week etc.)

✓ Lowest period defines the grain


Measure Measure Measure Measure Measure Measure
week_id revenue sales cost day_id no. calls missed calles duration
1 323 123 12 1 31 3 432
2 541 322 31 2 25 4 142
3 242 108 12 3 52 2 134
4 352 212 51 4 23 6 562
5 312 198 25 5 53 4 122

Sales transactions Calls


Characteristics
✓ Tend to be not as enormous in size

✓ Typically additive

✓ Tend to have a lot of facts and fewer dimensions associated

✓ No events = null or 0
Measure Measure Measure Measure Measure Measure
week_id revenue sales cost day_id no. calls missed calles duration
1 323 123 12 1 31 3 432
2 541 322 31 2 25 4 142
3 242 108 12 3 52 2 134
4 352 212 51 4 23 6 562
5 312 198 25 5 53 4 122

Sales transactions Calls


Accumulation snapshot fact table
Accumulation snapshot fact table

✓ 1 row = summarizes measure of many events / transactions

✓ Summarized of lifespan of 1 process (e.g. order fulfillment)

✓ Definite beginning & definite ending (& steps in between)


Date FK Measure Date FK Date FK Date FK Date FK Date FK Measure
Order Date No. Production Production Inspection Shipping Damaged
order_id FK Products Product_FK Start FK End FK Date FK Date FK products
1 20220102 100 32 20220103 20220110 20220112 20220113 3
2 20220103 100 32 20220104 20220112 20220113 20220113 4
3 20220103 100 32 20220103 20220112 20220113 20220114 1
4 20220104 100 32 20220106 20220110 20220112 20220113 0
5 20220104 100 32 20220108 20220117 20220119 20220120 6

Order production
Characteristics
✓ Least common

✓ Workflow or process analysis

✓ Multiple Date/Time foreign keys (for each process step)

✓ Date/Time keys associated with role-playing dimension


Date FK Measure Date FK Date FK Date FK Date FK Date FK Measure
Order Date No. Production Production Inspection Shipping Damaged
order_id FK Products Product_FK Start FK End FK Date FK Date FK products
1 20220102 100 32 20220103 20220110 20220112 20220113 3
2 20220103 100 32 20220104 20220112 20220113 20220113 4
3 20220103 100 32 20220103 20220112 20220113 20220114 1
4 20220104 100 32 20220106 20220110 20220112 20220113 0
5 20220104 100 32 20220108 20220117 20220119 20220120 6

Order production
Types of fact tables
Types of fact tables
Type Transactional Periodic Snapshot Accumulating Snapshot

Grain

Date Dimensions

No. Of dimensions

Facts

Size

Performance
Types of fact tables
Type Transactional Periodic Snapshot Accumulating Snapshot

1 row = 1 defined period (plus 1 row = lifetime of process


Grain 1 row = 1 transaction
other dimensions) /event

Date Dimensions

No. Of dimensions

Facts

Size

Performance
Types of fact tables
Type Transactional Periodic Snapshot Accumulating Snapshot

1 row = 1 defined period (plus 1 row = lifetime of process


Grain 1 row = 1 transaction
other dimensions) /event

Date Dimensions 1 Transaction date Snapshot date (end of period) Multiple snapshot dates

No. Of dimensions

Facts

Size

Performance
Types of fact tables
Type Transactional Periodic Snapshot Accumulating Snapshot

1 row = 1 defined period (plus 1 row = lifetime of process


Grain 1 row = 1 transaction
other dimensions) /event

Date Dimensions 1 Transaction date Snapshot date (end of period) Multiple snapshot dates

No. Of dimensions High Lower Very high

Facts

Size

Performance
Types of fact tables
Type Transactional Periodic Snapshot Accumulating Snapshot

1 row = 1 defined period (plus 1 row = lifetime of process


Grain 1 row = 1 transaction
other dimensions) /event

Date Dimensions 1 Transaction date Snapshot date (end of period) Multiple snapshot dates

No. Of dimensions High Lower Very high

Cumulative measures of Measures of process in


Facts Measures of transactions
transactions in period lifespan

Size

Performance
Types of fact tables
Type Transactional Periodic Snapshot Accumulating Snapshot

1 row = 1 defined period (plus 1 row = lifetime of process


Grain 1 row = 1 transaction
other dimensions) /event

Date Dimensions 1 Transaction date Snapshot date (end of period) Multiple snapshot dates

No. Of dimensions High Lower Very high

Cumulative measures of Measures of process in


Facts Measures of transactions
transactions in period lifespan

Size Largest (most detailed grain) Middle (less detailed grain) Lowest (highest aggregation)

Performance
Types of fact tables
Type Transactional Periodic Snapshot Accumulating Snapshot

1 row = 1 defined period (plus 1 row = lifetime of process


Grain 1 row = 1 transaction
other dimensions) /event

Date Dimensions 1 Transaction date Snapshot date (end of period) Multiple snapshot dates

No. Of dimensions High Lower Very high

Cumulative measures of Measures of process in


Facts Measures of transactions
transactions in period lifespan

Size Largest (most detailed grain) Middle (less detailed grain) Lowest (highest aggregation)

Can be improved with


Performance Better (less detailed) Good performance
aggregation
Steps to create a fact table
Steps to create a fact table

What are the key decisions we need to take during


the desing?

4 key decisions

Considdering the
business needs Tables & columns
Steps to create a fact table
1) Identify business process for analysis
Example: sales_id date Sales amount
Sales, 1 2022-01-01 $41
2 2022-01-02 $15
Order processing 3 2022-01-02 $24
4 2022-01-03 $13
5 2022-01-04 $52
2) Declare the grain
Example: Transaction, Order, Order lines, Daily, Daily + location

3) Identify dimensions that are relevant


What, when, where, how and why Filtering & grouping
Example: Time, locations, products, customers,… "Soul" for analysis
4) Identify facts for measurement
Defined by the grain & not by specific use-case
Steps to create a fact table
1) Identify business process for analysis
sales_id date Sales amount
1 2022-01-01 $41
2 2022-01-02 $15
2) Declare the grain 3 2022-01-02 $24
4 2022-01-03 $13
Example: Transaction 5 2022-01-04 $52

3) Identify dimensions that are relevant


Sales
Example: Time, locations, products sales_id date amount prod_id loc_id
1 20220101 $41 3 1
2 20220102 $15 4 5
3 20220102 $24 6 4
4 20220103 $13 1 3
5 20220104 $52 23 4
4) Identify facts for measurement
Example: Sales amount & order quantity
Factless fact table

Fact
Fact Table
Fact
Factless fact table

✓ Facts are usually numeric

✓ Sometimes only dimensionals aspects of an event are recorded

✓ Example new employee is registered


Entry Date
reg_id FK dep_id region_id manager_id Pos_id
1 20220102 1 2 3 10
Events
2 20220103 3 3 4 112
3 20220103 4 6 3 202 No metrics
4 20220104 4 8 6 110
5 20220104 3 4 8 17

Employee registration
Factless fact table

✓ How many employees have been registered last month?

✓ How many employees have been registered in a certain region?

✓ Example new employee is registered


Entry Date
reg_id FK dep_id region_id manager_id Pos_id
1 20220102 1 2 3 10
Events
2 20220103 3 3 4 112
3 20220103 4 6 3 202 No metrics
4 20220104 4 8 6 110
5 20220104 3 4 8 17

Employee registration
Factless fact table

Promo
promo_id date_id prod_id channel_id campaign_id
1 20220102 5 2 3
Events
2 20220103 3 3 4
3 20220103 4 6 3 No metrics
4 20220104 4 8 6
5 20220104 3 4 8
Occurence of events
Employee registration
Natural vs. Surrogate key
Natural vs. Surrogate key

Natural keys

product_id name category sales_id date Sales amount


PX30 Chili Herbs GXF-EFS 2022-01-01 $41
PT32 Garlic Fruits & Vegetables DOS-FWA 2022-01-02 $15
AX42 Banana Fruits & Vegetables DSF-GWS 2022-01-02 $24
DA24 Chocolate Sweets & Snacks PTG-DWD 2022-01-03 $13
PO20 Chips Sweets & Snacks ERW-DWD 2022-01-04 $52

Products Sales
Natural vs. Surrogate key

Natural keys

✓ Come out of the source system


Product_PK product_id name category
1 PX30 Chili Herbs
2 PT32 Garlic Fruits & Vegetables
3 AX42 Banana Fruits & Vegetables
4 DA24 Chocolate Sweets & Snacks

✓ Integer number
Surrogate key
✓ _PK or _FK suffix
Artificial keys
✓ Created by the database / ETL tool
Benefits
Surrogate key

✓ Improve performance (less storage/better joins)

✓ Handle dummy values (nulls / missing values) e.g. 999 or -1

✓ Integrate multiple source systems

✓ Easier administrate / update

✓ Sometimes there are even no natural keys available


Practical guidlines
Surrogate key

✓ Always use surrogate keys in tables as main PK and FK

✓ Both for Facts & Dimensions (except date dimension)

✓ Optionally keep the natural keys


Case study: E-Commerce
Case study: E-Commerce

Corporate IT

E-Commerce company

✓ 3 websites

✓ Each website operated independently by multiple departments


✓ ~ 1000 individual products

✓ Groceries, kitchen products, household products etc.


Case study: E-Commerce

Data collection

✓ Shopping cart check out

✓ Warehouse data
Case study: E-Commerce

Goals

✓ Logistics in warehouse

✓ Maximizing profits
❖ Profit margine, sales volume, product cost, promotions,
discounts
Case study: E-Commerce
Step 1 Identify Business process

✓ Business process for first DWH?


Sales transactions
❖ Most critical for business
❖ Data availabilty, data quality
❖ Which products sold
❖ What is sales profit
❖ Sales of each website
❖ Performance on different days
❖ Sales over time
Case study: E-Commerce
Step 2 Declare the grain

✓ What level of detail?


❖ Most analytical value with atomic grain
Order + Order line
❖ Highest dimensionality
Case study: E-Commerce
Step 3 Identify dimensions

✓ Descriptive aspects of measures


Dimensions
❖ Naturally derived after grain defined

❖ Customer
❖ Products
❖ Promotions
❖ Time/date
❖ Website
Case study: E-Commerce
Step 3 Identify dimensions

✓ Descriptive aspects of measures


Dimensions
❖ Naturally derived after grain defined

❖ Customer
❖ Products
❖ Promotions
❖ Time/date
❖ Website
Case study: E-Commerce
Step 4 Identify facts for measurement

✓ What facts are in the fact table?


Facts
❖ Must comply with the grain

✓ Additive
❖ Discount absolut (yes?)
❖ Discount percentage (no?)
❖ Profit
Case study: E-Commerce
Result
Fact Fact Fact Fact Fact Fact Fact

Website Customer Products Date/time


Dimension tables
Dimensions tables

✓ Always has a Primary Key (PK)


Product_ID Name Category
P001 Sunglases TR-7 Assecoirs
P002 Chocolate bar 70% cacao Sweets
P003 Oat meal biscuits Sweets

✓ Use surrogate key


Product_PK Name Category
1 Sunglases TR-7 Assecoirs
2 Chocolate bar 70% cacao Sweets
3 Oat meal biscuits Sweets

Product_PK Product_ID
1 P001 ✓ Lookup table
2 P002
3 P003
Dimensions tables

Product_ID Name Category


P001 Sunglases TR-7 Assecoirs
P002 Chocolate bar 70% cacao Sweets SELECT
P003 Oat meal biscuits Sweets S.*,
P.Product_PK
FROM Sales_Fact S
Product_PK Name Category
LEFT JOIN Product_Dim as P
ON P.Product_ID = S.Order_line_ID
1 Sunglases TR-7 Assecoirs
2 Chocolate bar 70% cacao Sweets
3 Oat meal biscuits Sweets

Product_PK Product_ID
1 P001
2 P002
3 P003
Dimensions tables

✓ Always has a Primary Key (PK)


Product_PK Name Category
1 Sunglases TR-7 Assecoirs
2 Chocolate bar 70% cacao Sweets
3 Oat meal biscuits Sweets

✓ Relatively few rows / many columns with descriptive attributes


Dimensions tables

✓ Group & Filter ("slice & dice")


Date Dimension
Date Dimension
✓ One of the most common & most important dimensions

✓ Contains date related features


❖ Year, Month (name & number), Day, Quarter, Week,
Weekday (name & number), …

✓ Meaningful surrogate key YYYYMMDD


For example 2022-04-02  20220402

✓ Extra row for no date/null (source)  1900-01-01 (dim)


Date Dimension
✓ Time is usually a separate dimension

✓ Can be populated in advance (e.g. for next 5 or 10 years)

Date features ▪ Numbers & Text (e.g. January, 1)


▪ Long & Abreviated (Jan, January – Mon, Monday)
▪ Combinations of attributes (Q1, 2022-Q1)
▪ Fiscal dates (Fiscal Year etc.)
▪ Flags (Weekend, company holidays etc.)
Date_PK Date Month Short Month Year-Quarter Year Weekday Is_Weekend
20220101 2022-01-01 January Jan 2022-Q1 2022 Saturday 1
20220102 2022-01-02 January Jan 2022-Q1 2022 Sunday 1
20220103 2022-01-03 January Jan 2022-Q1 2022 Monday 0
Date Dimension
✓ Time is usually a separate dimension

✓ Can be populated in advance (e.g. for next 5 or 10 years)

Date features ▪ Numbers & Text (e.g. January, 1)


▪ Long & Abreviated (Jan, January – Mon, Monday)
▪ Combinations of attributes (Q1, 2022-Q1)
▪ Fiscal dates (Fiscal Year etc.)
▪ Flags (Weekend, company holidays etc.)
Nulls in dimensions
Nulls in dimensions
What we've learnt ✓ Nulls must be avoided in FKs

❖ Nulls in FKs break referential integrity!


❖ They don't appear in Joins
Nulls in dimensions
What we've learnt ✓ Nulls must be avoided in FKs

❖ Nulls in FKs break referential integrity!


❖ They don't appear in Joins
Nulls in dimensions
What we've learnt ✓ Nulls must be avoided in FKs

❖ Nulls in FKs break referential integrity!


❖ They don't appear in Joins
Nulls in dimensions
What we've learnt ✓ Nulls must be avoided in FKs

❖ Nulls in FKs break referential integrity!


❖ They don't appear in Joins
Nulls in dimensions
What we've learnt ✓ Nulls must be avoided in FKs

✓ Nulls can be present in Facts


Nulls in dimensions
Dimensions

✓ Replace nulls with descriptive values

✓ More understable for business users

✓ Values appear in aggregations in BI tools


Hierarchies in dimensions
Hierarchies in dimensions

Source data ✓ Often normalized


Hierarchies in dimensions

Source data ✓ Often normalized


Hierarchies in dimensions

Source data ✓ Often normalized

✓ Snowflaked schema (should be avoided)


Hierarchies in dimensions

Source data ✓ Often normalized

Some professionals have the habit to normalize data

⇒ Bad for usability & performance!

⇒ We should not do that!


Hierarchies in dimensions

What we should do ✓ Denormalize / flattened


Flattened dimension
Hierarchies in dimensions

What we should do ✓ Considder combinations if helpful


Year-Month Year-Month Year-Quarter
01-01-2022 Jan-2022 2022-Q1
02-01-2022 Jan-2022 2022-Q1
03-01-2022 Jan-2022 2022-Q1

Location_PK City State City-State


1 Nashville Tennessee Nashville, Tennessee
2 Nashville Indiana Nashville, Indiana
3 Kansas City Kansas Kansas City, Kansas
Hierarchies in dimensions

Source data ✓ Often normalized

✓ Snowflaked schema (should be avoided)


Conformed dimensions
Conformed dimensions

Conformed dimension is a dimension that is shared by


multiple fact tables / stars.

Used to compare facts across different fact tables.


Conformed dimension

✓ Dimension

Sales Fact

✓ Dimension ✓ Dimension
Conformed dimension

✓ Dimension

Sales Fact Cost Fact

✓ Dimension ✓ Dimension
Conformed dimension

✓ Dimension

Sales Fact Cost Fact

Conformed dimension
✓ Dimension ✓ Dimension (shared attributes)
Conformed dimension

✓ Dimension

Sales Fact Cost Fact

Conformed dimension
✓ Dimension ✓ Time/Date (shared attributes)
Conformed dimension

✓ Dimension

Drill across
Sales Fact Cost Fact

Conformed dimension
✓ Dimension ✓ Time/Date (shared attributes)
Conformed dimension

✓ Region
✓ Dimension

Drill across
Sales Fact Cost Fact

Conformed dimension
✓ Dimension ✓ Time/Date (shared attributes)
Conformed dimension

✓ Conformed Date dim


Month Cost Sales
January $50,300 $67,300
February $55,300 $71,400
March $65,100 $79,400

✓ Conformed Region dim


Country Cost Sales
Spain $57,200 $69,800
Belgium $15,300 $21,900
wd $35,100 $29,400
Conformed dimension

✓ Sales fact
Sales_PK Sales Date_FK
1 $9,400 20220101
2 $7,300 20220101
3 $5,100 20220102

✓ Cost fact
Cost_PK Cost Date_FK
1 $7,200 20220101
2 $1,900 20220101
3 $2,800 20220101
Conformed dimension

✓ Sales fact
Sales_PK Sales Date_FK
1 $9,400 20220101
2 $7,300 20220101
3 $5,100 20220102

✓ Cost fact
Cost_PK Cost Date_FK
1 $7,200 20220101
2 $1,900 20220102
3 $2,800 20220103

Same granularity not necessary!


Conformed dimension

✓ Sales fact
Sales_PK Sales Date_FK
1 $9,400 20220101
2 $7,300 20220101
3 $5,100 20220102

✓ Cost fact
Cost_PK Cost DateMonth_FK
1 $7,200 20220101
2 $1,900 20220201
3 $2,800 20220301

Different FK possible!
Conformed dimension

✓ Sales fact
Sales_PK Sales Date_FK
1 $9,400 20220101
2 $7,300 20220101
3 $5,100 20220102

✓ Cost fact
Cost_PK Cost DateMonth_FK
1 $7,200 2022-01
2 $1,900 2022-02
3 $2,800 2022-03

Different FK possible!
Conformed dimension

✓ Conformed Date dim


Month Cost Sales
January $50,300 $67,300
February $55,300 $71,400
March $65,100 $79,400
Degenerate dimension
Degenerate dimension

✓ Transactional Sales fact


Transaction_PK Amount Payment_FK Payment_PK Header
1 $530 234-032 234-032 Type A
2 $553 234-032 234-033 Type A
3 $654 234-033 234-034 Type B
Degenerate dimension

✓ Transactional Sales fact


Transaction_PK Amount Payment_FK Payment_PK Header
1 $530 234-032 234-032 Type A
2 $553 234-032 234-033 Type A
3 $654 234-033 234-034 Type B

✓ All relevant information have already


been extracted (to other dimensions)
Degenerate dimension

✓ Transactional Sales fact


Transaction_PK Amount Payment_FK Payment_PK
1 $530 234-032 234-032
2 $553 234-032 234-033
3 $654 234-033 234-034

✓ All relevant information have already


been extracted (to other dimensions)

✓ Attribute can be still useful


Degenerate dimension

✓ Transactional Sales fact


Transaction_PK Amount Payment_DD
1 $530 234-032
2 $553 234-032
3 $654 234-033

✓ All relevant information have already


been extracted (to other dimensions)

✓ Attribute can be still useful

✓ Indicate that it is a deg. dim. (e.g. _DD)


Degenerate dimension

✓ Degenarate dimension the dimension key


without an associated dimension
Transaction_PK Amount Payment_DD
1 $530 234-032
2 $553 234-032
3 $654 234-033

Occuring mostly in Transactional facts

Invoice no., billing no. or order_id


typically are degenerate dimensions
Junk dimensions
Junk dimensions

Transaction_PK Amount Payment_Type Incoming / Outbound Is_Bonus


1 $530 Wired Incoming Yes
2 $553 Credit Card Outbound No
3 $654 Cash Incoming No

1. Eliminate them if they are not relevant What if they are relevant?
2. Leave them as they are in the fact Long text values? Table size?
3. One Flag => One dimension Very wide fact table?

Alternative: Junk dimension


Junk dimensions

What is a junk dimension?


Dimension with various flags / indicators
with low cardinality
Junk dimensions

What is a junk dimension?


Like a box were we store items we need but
have no separate storing location.
Junk dimensions

Note:
We call it "junk dimension" usually only internally.
Talking to business users we can refer to as
"transactional indicator dimension".
Junk dimensions

Transaction_PK Amount Payment_Type Incoming / Outbound Is_Bonus


1 $530 Wired Incoming Yes
2 $553 Credit Card Outbound No
3 $654 Cash Incoming No

Transaction_PK Amount Transactional_Flag_FK


1 $530 1
2 $553 7
3 $654 12

Incoming /
Flag_PK Payment_Type Outbound Is_Bonus
1 Wired Incoming Yes
2 Wired Incoming No
3 Wired Outbound Yes
4 Wired Outbound No
Junk dimensions

Payment_Type Amount
Is_Bonus Amount
Wired $5350
Yes $9350
Credit Card $6553
No $11857
Cash $6754
Junk dimensions

Number of combinations
3 x 2 x 2 = 12
Transaction_PK Amount Payment_Type Incoming / Outbound Is_Bonus
1 $530 Wired Incoming Yes
2 $553 Credit Card Outbound No
3 $654 Cash Incoming No

Flag_PK Payment_Type Incoming / Outbound Is_Bonus


1 Wired Incoming Yes
2 Credit Card Outbound No
3 Cash Incoming No

12 Cash Outbound No
Junk dimensions
Many dimensions?

9 indicators with 4 combinations


Many combinations!
4^9 = 262144

1. Extract only available combinations of fact table

2. Two or more junk dimensions 4^5 = 1024


Role-playing dimension
Role-playing dimension

What is a role-playing dimension?


Dimension that is referenced multiple times
by a fact
Role-playing dimension
Date FK Measure Date FK
Order Date No. Production
order_id FK Products Product_FK Start FK
1 20220102 100 32 20220103
2 20220103 100 32 20220104
3 20220103 100 32 20220103
4 20220104 100 32 20220106
5 20220104 100 32 20220108
Role 1
Role 2
Date_PK Date Month Short Month Year-Quarter Year Weekday Is_Weekend
20220101 2022-01-01 January Jan 2022-Q1 2022 Saturday 1
20220102 2022-01-02 January Jan 2022-Q1 2022 Sunday 1
20220103 2022-01-03 January Jan 2022-Q1 2022 Monday 0
Role-playing dimension

Products Products
Month (Orders received) Month (Production started)
January 2500 January 2650
February 2700 February 2450
… … … …

Order Date No. Production


order_id FK Products Product_FK Start FK
1 20220102 100 32 20220103
2 20220103 100 32 20220104
3 20220103 100 32 20220103
4 20220104 100 32 20220106
5 20220104 100 32 20220108
Role-playing dimension

✓ BI tools for example via active & inactive relationships


Role-playing dimension

✓ For analysis in SQL you can create additional view for


each role

✓ No duplicated data but still we it appears like a


separate dimension
Slowly changing dimensions
Slowly changing dimensions

Till now we have pretended dimensions never change…

… indeed they are rather static usually …

… but surprise… they do change in the real world…

Develop a strategy to handle changes in dimensions...


Slowly changing dimensions

1. Be proactive: Ask about potential changes

2. Business users + IT

3. Strategy for each changing attribute

Kimball introduced SCD in 1995 and distinguished


between different types (1, 2, 3, …).
Type 0: Retain Original
Type 0: Retain Original

✓ There won't be any changes

✓ Date Table (expect for holidays etc.)

✓ "Original"

✓ Very simple and easy to maintain


Type 1: Overwrite
Type 1: Overwrite

✓ Old attributes are just overwritten

✓ Only current state is reflected


Product_Key Name Category
1 Sunglases TR-7 Assecoirs
2 Chocolate bar 70% cacao Sweets
3 Oat meal biscuits Sweets

UPDATE

Product_Key Name Category


1 Sunglases TR-7 Assecoirs
2 Chocolate bar 70% cacao Sweets
3 Delicious Oat meal biscuits Buscuits
Type 1: Overwrite

✓ Very simple

✓ No Fact table needs to be modified

Product_Key Name Category


1 Sunglases TR-7 Assecoirs
2 Chocolate bar 70% cacao Sweets
3 Oat meal biscuits Sweets

UPDATE

Product_Key Name Category


1 Sunglases TR-7 Assecoirs
2 Chocolate bar 70% cacao Sweets
3 Delicious Oat meal biscuits Buscuits
Type 1: Overwrite
Problem

✓ Very simple ❖ History is lost!


❖ Insignificant changes
✓ No Fact table needs to be modified
Product_Key Name Category
❖ Might affect / break
1 Sunglases TR-7 Assecoirs
2 Chocolate bar 70% cacao Sweets existing queries
3 Oat meal biscuits Sweets

UPDATE
More significant
Product_Key Name Category
1 Sunglases TR-7 Assecoirs
2 Chocolate bar 70% cacao Sweets
3 Delicious Oat meal biscuits Buscuits

Not so significant
Type 2: New row
Type 2: New row

✓ Problem with Type 1: No history of dimensions!

✓ Only current state is reflected

Product_Key Name Category


1 Sunglases TR-7 Assecoirs
2 Chocolate bar 70% cacao Sweets
3 Oat meal biscuits Sweets

UPDATE

Product_Key Name Category


1 Sunglases TR-7 Assecoirs
2 Chocolate bar 70% cacao Sweets
3 Delicious Oat meal biscuits Buscuits
Type 2: New row

✓ Problem with Type 1: No history of dimensions!

✓ Only current state is reflected


Category Amount
Sales_Key Name Amount
Before
Assecoirs $25
1 Sunglases TR-7 $25 Sweets $14
2 Chocolate bar 70% cacao $3
3 Oat meal biscuits $4
4 Chocolate bar 70% cacao $3
5 Oat meal biscuits $4 Category Amount
Assecoirs $25
After Sweets $6
Buscuits $8
Type 2: New row

✓ Problem with Type 1: No history of dimensions!

✓ Only current state is reflected


Category Amount
Sales_Key Name Amount
Before
Assecoirs $25
1 Sunglases TR-7 $25 Sweets $14
2 Chocolate bar 70% cacao $3
3 Oat meal biscuits $4
4 Chocolate bar 70% cacao $3
5 Oat meal biscuits $4 Category Amount
Assecoirs $25
After Sweets $6
Buscuits $8
Type 2: New row

✓ Problem with Type 1: No history of dimensions!

✓ Only current state is reflected


Category Amount
Sales_Key Name Amount
Before
Assecoirs $25
1 Sunglases TR-7 $25 Sweets $14
2 Chocolate bar 70% cacao $3
3 Oat meal biscuits $4
4 Chocolate bar 70% cacao $3
5 Oat meal biscuits $4 Category Amount
Assecoirs $25
After Sweets $6
Buscuits $8
Type 2: New row

✓ Problem with Type 1: No history of dimensions!

✓ Only current state is reflected


Category Amount
Sales_Key Name Amount
Before
Assecoirs $25
1 Sunglases TR-7 $25 Sweets $14 Category Amount
2 Chocolate bar 70% cacao $3 Correctly representing history Assecoirs $25
3 Oat meal biscuits $4 Sweets $7
4 Chocolate bar 70% cacao $3 Buscuits $4
5 Oat meal biscuits $4 Category Amount
Assecoirs $25
After Sweets $6
Buscuits $8
Type 2: New row

✓ Type 2: Perfectly partitions history


Default strategy
✓ Changes are reflected with history
Category Amount
Sales_Key Name Amount
Before
Assecoirs $25
1 Sunglases TR-7 $25 Sweets $14 Category Amount
2 Chocolate bar 70% cacao $3 Correctly representing history Assecoirs $25
3 Oat meal biscuits $4 Sweets $7
4 Chocolate bar 70% cacao $3 Buscuits $4
5 Oat meal biscuits $4 Category Amount
Assecoirs $25
After Sweets $6
Buscuits $8
Type 2: New row
Type 2: New row

Product_Key Name Category


1 Sunglases TR-7 Assecoirs
2 Chocolate bar 70% cacao Sweets
3 Oat meal biscuits Sweets

Add Row
Product_Key Name Category
1 Sunglases TR-7 Assecoirs
2 Chocolate bar 70% cacao Sweets
3 Oat meal biscuits Sweets
4 Delicious Oat meal biscuits Buscuits
Type 2: New row

Sales_Key Name Product_FK Amount


Product_Key Name Category
1 Sunglases TR-7 1 $25
1 Sunglases TR-7 Assecoirs
2 Chocolate bar 70% cacao 2 $3
2 Chocolate bar 70% cacao Sweets
3 Oat meal biscuits 3 $4
3 Oat meal biscuits Sweets
4 Chocolate bar 70% cacao 2 $3
4 Delicious Oat meal biscuits Buscuits
5 Oat meal biscuits 4 $4

Category Amount
Assecoirs $25 ❑ No updates in fact
Respecting history Sweets $3
Buscuits $4
❑ From that moment new FK
Type 2: New row
Product_PK Product_ID Name Category
1 SG-TR7 Sunglases TR-7 Assecoirs
No . of products? 2 CH-B70 Chocolate bar 70% cacao Sweets
3 OT-BSC Oat meal biscuits Sweets
4 OT-BSC Delicious Oat meal biscuits Buscuits

Count distinct Product_ID

All current products?


Administrate Type 2 SCD
Administrate Type 2 SCD

Product_PK Product_ID Name Category Ef_Date Ex_Date


1 SG-TR7 Sunglases TR-7 Assecoirs 2022-01-01 2100-01-01
2 CH-B70 Chocolate bar 70% cacao Sweets 2022-01-01 2100-01-01
3 OT-BSC Oat meal biscuits Sweets 2022-01-01 2022-05-31
4 OT-BSC Delicious Oat meal biscuits Buscuits 2022-06-01 2100-01-01

Period in which
values are valid Instead of null better
date far in the future
✓ Necessary also in ETL to use correct FK

✓ Requires Surrogate key instead of Natural key


Administrate Type 2 SCD
Correct FK?
Product_PK Product_ID Name Category Ef_Date Ex_Date
1 SG-TR7 Sunglases TR-7 Assecoirs 2022-01-01 2100-01-01
2 CH-B70 Chocolate bar 70% cacao Sweets 2022-01-01 2100-01-01
3 OT-BSC Oat meal biscuits Sweets 2022-01-01 2022-05-31
4 OT-BSC Delicious Oat meal biscuits Buscuits 2022-06-01 2100-01-01

✓ Add row in the Dimension first

✓ Lookup in the Dimension with Natural key + Ef_/Ex_Date


Administrate Type 2 SCD
Correct FK?

Product_PK Product_ID Name Category Ef_Date Ex_Date Is_Current


1 SG-TR7 Sunglases TR-7 Assecoirs 2022-01-01 2100-01-01 Yes
2 CH-B70 Chocolate bar 70% cacao Sweets 2022-01-01 2100-01-01 Yes
3 OT-BSC Oat meal biscuits Sweets 2022-01-01 2022-05-31 No
4 OT-BSC Delicious Oat meal biscuits Buscuits 2022-06-01 2100-01-01 Yes

✓ Add row in the Dimension first

✓ Lookup in the Dimension with Natural key + Ef_/Ex_Date

✓ Additional for Is_Current


Mixing Type 1 + 2
Mixing Type 1 + 2

Product_PK Product_ID Name Category Ef_Date Ex_Date


1 SG-TR7 Sunglases TR-7 Assecoirs 2022-01-01 2100-01-01
2 CH-B70 Chocolate bar 70% cacao Sweets 2022-01-01 2100-01-01
3 OT-BSC Oat meal biscuits Sweets 2022-01-01 2100-01-01
4 OT-BSC Delicious Oat meal biscuits Buscuits 2022-06-01 2100-01-01

✓ Some attributes can be Type 1 and some Type 2


Mixing Type 1 + 2

Product_PK Product_ID Name Category Ef_Date Ex_Date


1 SG-TR7 Sunglases TR-7 Assecoirs 2022-01-01 2100-01-01
2 CH-B70 Chocolate bar 70% cacao Sweets 2022-01-01 2100-01-01
3 OT-BSC Oat meal biscuits Sweets 2022-01-01 2100-01-01
4 OT-BSC Delicious Oat meal biscuits Buscuits 2022-06-01 2100-01-01

✓ No set in stone rules but needs to be defined with


business users

✓ Not a technical decision


Type 3: Additonal Attributes
Type 3: Additonal Attributes
Product_PK Product_ID Name Category Ef_Date Ex_Date
1 SG-TR7 Sunglases TR-7 Assecoirs 2022-01-01 2100-01-01
2 CH-B70 Chocolate bar 70% cacao Sweets 2022-01-01 2100-01-01
3 OT-BSC Oat meal biscuits Sweets 2022-01-01 2100-01-01
4 OT-BSC Delicious Oat meal biscuits Buscuits 2022-06-01 2100-01-01

✓ Type 2 – Default strategy to maintain reflect history

✓ Type 1 – Static

✓ Type 3 – In-between: Switching back & forth between versions


Type 3: Additonal Attributes
Product_PK Product_ID Name Category Prev_Category
1 SG-TR7 Sunglases TR-7 Assecoirs Assecoirs
2 CH-B70 Chocolate bar 70% cacao Sweets Sweets
3 OT-BSC Oat meal biscuits Biscuit Sweets

✓ Instead of adding a row – we add a column


Type 3: Additonal Attributes
Product_PK Product_ID Name Category Prev_Category
1 SG-TR7 Sunglases TR-7 Assecoirs Assecoirs
2 CH-B70 Chocolate bar 70% cacao Sweets Sweets
3 OT-BSC Oat meal biscuits Biscuit Sweets

✓ Instead of adding a row – we add a column

Category Amount
Prev_Category Amount
Assecoirs $25
Assecoirs $25
Sweets $6
Sweets $14
Buscuits $8
Type 3: Additonal Attributes
Product_PK Product_ID Name Category Prev_Category
1 SG-TR7 Sunglases TR-7 Assecoirs Assecoirs
2 CH-B70 Chocolate bar 70% cacao Sweets Sweets
3 OT-BSC Oat meal biscuits Biscuit Sweets

✓ Instead of adding a row – we add a column

✓ Typically used for significant changes at a time


(e.g. restructurings in organizations)
Type 3: Additonal Attributes
Sales_Key Name Region_FK Amount
1 Sunglases TR-7 1 $25 Reg_PK Region Prev_Region
2 Chocolate bar 70% cacao 2 $3 1 North North
3 Oat meal biscuits 3 $4 2 West West
4 Chocolate bar 70% cacao 2 $3 3 South West
5 Oat meal biscuits 3 $4

✓ Instead of adding a row – we add a column

✓ Typically used for significant changes at a time


(e.g. restructurings in organizations)

✓ Enables switching betweeen historic / current view


Type 3: Additonal Attributes
Sales_Key Name Region_FK Amount
1 Sunglases TR-7 1 $25 Reg_PK Region Prev_Region
2 Chocolate bar 70% cacao 2 $3 1 North North
3 Oat meal biscuits 3 $4 2 West West
4 Chocolate bar 70% cacao 2 $3 3 South West
5 Oat meal biscuits 3 $4 4 East Not applicable

✓ New attributes => New rows


Limitations ✓ It is possible to add multiple historic columns

❖ Not suitable for frequent or unpredictable changes => better Type 2

❖ Minor changes => better Type 1

Least frequent type


Type 3: Additonal Attributes
Sales_Key Name Region_FK Amount
1 Sunglases TR-7 1 $25 Reg_PK Region Prev_Region
2 Chocolate bar 70% cacao 2 $3 1 North North
3 Oat meal biscuits 3 $4 2 West West
4 Chocolate bar 70% cacao 2 $3 3 South West

Limitations 5 Oat meal biscuits 3 $4

✓ Not suitable for frequent or unpredictable changes


=> better Type 2
✓ It is possible to add multiple historic columns
Least frequent type
✓ Minor changes => better Type 1
(two versions needed)
What is an ETL?
What is an ETL?

✓ How to design dimensional model

✓ How to bring data from source to DWH

= ETL process
Data Warehouse Layers

Extract, Transform, Load


Other data sources

Centralized
Sales data
ETL location for data

Data warehouse

CRM system
Data Warehouse Layers

Predictive
data sources Analytics
Staging Core /
Data Warehouse
Data Warehouse Layers

Data Mart 1

data sources Predictive


Staging Core / Analytics
Data Warehouse
Data Mart 2
Extract, Transform, Load ETL

ETL-Tool

Set of (built-in) tools to…


✓ Connect to different data sources

✓ Transform / Clean data

✓ Load data
Everything we need to build our DWH!
Extract, Transform, Load ETL

ETL-Setup

Building workflows…
✓ Staging workflow

✓ Core / Transformation workflow

✓ Data Mart workflow


Extract, Transform, Load ETL

ETL-Setup

Jobs …
✓ Run the workflows

✓ Are scheduled based on defined rules


Extract
Extracting

✓ Data is part of DWH

✓ Understanding data

✓ From here data is transformed

✓ Transient (most commonly)


data sources Staging
✓ All data copied and then deleted
Extracting types

Initial Load Delta Load

✓ First (real) run ✓ Subsequent runs


✓ All data ✓ Only additional data
Initial Load
Initial Load
✓ First initial extraction from source data

✓ After discussion with the business users + IT

o What data is needed


o When is a good time to load the data
(Night? Weekends?)

o Smaller extractions to test


Initial Load

✓ Initial Load to Core with Transformations

✓ After all the transformation steps have been designed

✓ Just done for all data from Staging (no filtering)


Delta Load
Same structure Delta Load
✓ Incremental periodic Extraction / Load

✓ Delta column for every table


Sales_Date Name Amount
2022-06-06 Sunglases TR-7 $25
2022-06-06 Chocolate bar 70% cacao $3
2022-06-07 Oat meal biscuits $4
2022-06-07 Chocolate bar 70% cacao $3
2022-06-08 Oat meal biscuits $4

✓ Transaction date, create_date, etc.


Delta Load
✓ Incremental periodic Extraction / Load

✓ Delta column for every table


Sales_Key Name Amount
1 Sunglases TR-7 $25
2 Chocolate bar 70% cacao $3
3 Oat meal biscuits $4
4 Chocolate bar 70% cacao $3
5 Oat meal biscuits $4

✓ Incrementing number (Suitable primary key)


Delta Load
✓ Incremental periodic Extraction / Load
Sales_Key Name Amount
1 Sunglases TR-7 $25
2 Chocolate bar 70% cacao $3
3 Oat meal biscuits $4
4 Chocolate bar 70% cacao $3
5 Oat meal biscuits $4

✓ Remember MAX(Sales_Key)

✓ MAX(Sales_Key) -> Variable X

✓ Next run: Sales_Key > X


What if there is no delta column?

✓ Some tools can capture automatically which data


has been already loaded

✓ Just full load everytime and compare the data with


data that is already loaded

✓ Depending on the data volumes -> performance


Load (Insert/Update)
Data Warehouse Layers
Empty
Delta Insert/
Update

data sources
Core /
Staging Data Warehouse
Data Warehouse Layers

Insert/ o INSERT / APPEND


Update

o UPDATE
Core / Product_PK Name
1 Sunglases TR-7
Data Warehouse
2 Chocolate bar 70% cacao
3 Oat meal biscuits
4 Chocolate bar 70% cacao
5 Oat meal biscuits
Data Warehouse Layers

Insert/ o DELETE
Update Product_PK
1
Name
Sunglases TR-7
Deleted
No
2 Chocolate bar 70% cacao Yes
3 Oat meal biscuits No
4 Chocolate bar 70% cacao No
5 Oat meal biscuits No

o Typically we don't delete data


Core /
Data Warehouse
Transform
Data Warehouse Layers

Transform

Insert /
Update
data sources
Core /
Staging Data Warehouse
Main goals
Create a consolidated view of all data for
analysis purposes

1. Consolidate (from multiple systems)

2. Reshape (for analysis purposes)


Main goals
1. Consolidate (from multiple systems)
Transaction_ID Amount Date
T1 $5030 10/1/2022
T2 $5053 11/1/2022
T3 $654 12/1/2022

Transaction_ID Amount (in thousands) Transaction_Date


T14 $5.345 10-1-2022
T15 $7.953 11-1-2022
T16 $9.654 12-1-2022

Making the data compatible & consistent!


Main goals
2. Reshape according to business requirements
Transaction_ID Amount Date
T1 $5030 10/1/2022
T2 $5053 11/1/2022
T3 $654 12/1/2022

Transaction_PK Amount Date_FK


1 $5030 20220110
2 $5053 20220111
3 $654 20220112
Main goals
2. Reshape according to business requirements
Month Januar-2022 February-2022 March-2022 Total
Amount $5030 $6053 $2455 $13548

Month Amount
Month Amount
Januar-2022 $5030
Januar-2022 $5030
February-2022 $6053
February-2022 $6053
March-2022 $2455
March-2022 $2455
Total $13548

Clean & reshape data


Kinds of transformations

▪ Deduplication
▪ Filtering (rows & columns)
▪ Cleaning & Mapping (Integration)
▪ Value Standardization (Integration)
▪ Key Generation
Kinds of transformations

Basic Advanced

▪ Deduplication ▪ Joining
▪ Filtering (rows & columns) ▪ Splitting
▪ Cleaning & Mapping (Integration) ▪ Aggregating
▪ Value Standardization (Integration) ▪ Deriving new values
▪ Key Generation
Kinds of transformations
Basic ▪ Deduplication
Store 1 Store 2
product_id name category product_id name category
P521 Almonds 150g Nuts P521 Almonds 150g Nuts
P252 Garlic Fruits & Vegetables P672 Orange Juice Drinks
P533 Banana Fruits & Vegetables P423 Green Apples Fruits & Vegetables
P684 Chocolate Sweets & Snacks P564 Chocolate Cookies Sweets & Snacks
P755 Spicy Chips Sweets & Snacks P755 Spicy Chips Sweets & Snacks
Kinds of transformations
Basic ▪ Deduplication
Product Dimension
product_id name category
P521 Almonds 150g Nuts
P252 Garlic Fruits & Vegetables
P533 Banana Fruits & Vegetables
P684 Chocolate Sweets & Snacks
product_id
P755 name
Spicy Chips category
Sweets & Snacks
P521 Almonds 150g Nuts
P672 Orange Juice Drinks
P423 Green Apples Fruits & Vegetables
P564 Chocolate Cookies Sweets & Snacks
P755 Spicy Chips Sweets & Snacks
Kinds of transformations
Basic ▪ Deduplication
Product Dimension
product_id name category
P521 Almonds 150g Nuts
P252 Garlic Fruits & Vegetables
P533 Banana Fruits & Vegetables
P684 Chocolate Sweets & Snacks
product_id
P755 name
Spicy Chips category
Sweets & Snacks
P521 Almonds 150g Nuts
P672 Orange Juice Drinks
P423 Green Apples Fruits & Vegetables
P564 Chocolate Cookies Sweets & Snacks
P755 Spicy Chips Sweets & Snacks
Kinds of transformations
Basic ▪ Deduplication
Product Dimension
product_id name category
P521 Almonds 150g Nuts
P252 Garlic Fruits & Vegetables
P533 Banana Fruits & Vegetables
P684 Chocolate Sweets & Snacks
product_id
P755 name
Spicy Chips category
Sweets & Snacks
P672 Orange Juice Drinks
P423 Green Apples Fruits & Vegetables
P564 Chocolate Cookies Sweets & Snacks
Kinds of transformations
Basic ▪ Filtering rows
Filter out irrelevant rows
Sales_Date Name Amount Type
2022-06-06 Sunglases TR-7 $25 Sale
2022-06-06 Chocolate bar 70% cacao $3 Sale
2022-06-06 Sunglases TR-7 $-25 Refund
2022-06-07 Oat meal biscuits $4 Sale
2022-06-07 Chocolate bar 70% cacao $3 Sale
2022-06-08 Oat meal biscuits $4 Sale
Kinds of transformations
Basic ▪ Filtering rows
Filter out irrelevant rows
Sales_Date Name Amount Type
2022-06-06 Sunglases TR-7 $25 Sale
2022-06-06 Chocolate bar 70% cacao $3 Sale
2022-06-06 Sunglases TR-7 $-25 Refund
2022-06-07 Oat meal biscuits $4 Sale
2022-06-07 Chocolate bar 70% cacao $3 Sale
2022-06-08 Oat meal biscuits $4 Sale
Kinds of transformations
Basic ▪ Filtering rows
Filter out irrelevant rows
Sales_Date Name Amount Type
2022-06-06 Sunglases TR-7 $25 Sale
2022-06-06 Chocolate bar 70% cacao $3 Sale
2022-06-07 Oat meal biscuits $4 Sale
2022-06-07 Chocolate bar 70% cacao $3 Sale
2022-06-08 Oat meal biscuits $4 Sale
Kinds of transformations
Basic ▪ Filtering columns
Filter out irrelevant columns
Sales_Date Name Amount Type
2022-06-06 Sunglases TR-7 $25 Sale
2022-06-06 Chocolate bar 70% cacao $3 Sale
2022-06-07 Oat meal biscuits $4 Sale
2022-06-07 Chocolate bar 70% cacao $3 Sale
2022-06-08 Oat meal biscuits $4 Sale
Kinds of transformations
Basic ▪ Filtering columns
Filter out irrelevant columns

Sales_Date Name Amount


2022-06-06 Sunglases TR-7 $25
2022-06-06 Chocolate bar 70% cacao $3
2022-06-07 Oat meal biscuits $4
2022-06-07 Chocolate bar 70% cacao $3
2022-06-08 Oat meal biscuits $4
Kinds of transformations
Basic ▪ Cleaning & Mapping (Integration)
Mapping different values

Name Gender
Taylor M
Isabella F
M => Male Sofia F

F => Female Name Gender


Lydia Female
Naomi Female
Leon Male
Kinds of transformations
Basic ▪ Cleaning & Mapping (Integration)
Mapping different values

Name Gender
Taylor M
Isabella Fe
M => Male Sofia F

F => Female Name Gender


Lydia Female
Naomi Female
Leon Male
Kinds of transformations
Basic ▪ Cleaning & Mapping (Integration)
Mapping different values

Name Gender
Taylor Male
Isabella Female
M => Male Sofia Female

F => Female Name Gender


Lydia Female
Naomi Female
Leon Male
Kinds of transformations
Basic ▪ Cleaning & Mapping (Integration)
Mapping different values

Day Sales
Monday $500
Tuesday $760
Wednesday null
null => 0
Day Sales
Monday $500
Tuesday $760
Wednesday $0
Kinds of transformations
Basic ▪ Cleaning & Mapping (Integration)
Mapping different values

Month Sales Month Sales


January '22 $500 January 2022 $1500
February '22 $760 February 2022 $450
March '22 $245 March 2022 $321
Kinds of transformations
Basic ▪ Value Standardization (Integration)
Mapping different values

Month Sales Month Sales in thsd


January 2022 $500 January 2022 $1.5
February 2022 $760 February 2022 $4.550
March 2022 $245 March 2022 $3.321

Month Sales
January 2022 $1500
February 2022 $4550
March 2022 $3321
Kinds of transformations
Basic ▪ Key Generation
Product Dimension
Product_PK product_id name category
1 P521 Almonds 150g Nuts
2 P252 Garlic Fruits & Vegetables
3 P533 Banana Fruits & Vegetables
4 P684 Chocolate Sweets & Snacks
5 product_id
P755 name
Spicy Chips category
Sweets & Snacks
6 P521 Almonds 150g Nuts
7 P672 Orange Juice Drinks
8 P423 Green Apples Fruits & Vegetables
9 P564 Chocolate Cookies Sweets & Snacks
10 P755 Spicy Chips Sweets & Snacks
Kinds of transformations
Advanced ▪ Joining
Product Dimension
Product_PK product_id name category
1 P521 Almonds 150g Nuts
2 P252 Garlic Fruits & Vegetables
3 P533 Banana Fruits & Vegetables
4 P684 Chocolate Sweets & Snacks
5 P755 Spicy Chips Sweets & Snacks
Sales Fact
Sales_PK product_id Date
3 P533 2022-01-01
4 P252 2022-01-01
5 P755 2022-01-02
6 P684 2022-01-02
7 P755 2022-01-02
Kinds of transformations
Advanced ▪ Joining
Product Dimension
Product_PK product_id name category
1 P521 Almonds 150g Nuts
2 P252 Garlic Fruits & Vegetables
3 P533 Banana Fruits & Vegetables
4 P684 Chocolate Sweets & Snacks
5 P755 Spicy Chips Sweets & Snacks
Sales Fact
Sales_PK product_id Product_FK Date
3 P533 3 2022-01-01
4 P252 2 2022-01-01
5 P755 5 2022-01-02
6 P684 4 2022-01-02
7 P755 5 2022-01-02
Kinds of transformations
Advanced ▪ Joining
Product Dimension
Product_PK product_id name category Eff_Date Exp_Date
1 P521 Almonds 150g Nuts 2021-01-01 2121-01-01
2 P252 Garlic Fruits & Vegetables 2021-01-01 2121-01-01
3 P533 Banana Fruits & Vegetables 2021-01-01 2121-01-01
4 P684 Chocolate Sweets & Snacks 2021-01-01 2121-01-01
5 P755 Spicy Chips Sweets & Snacks 2021-01-01 2121-01-01
Sales Fact
Sales_PK product_id Product_FK Date
3 P533 3 2022-01-01
4 P252 2 2022-01-01
5 P755 5 2022-01-02
6 P684 4 2022-01-02
7 P755 5 2022-01-02
Kinds of transformations
Advanced ▪ Joining
Product Table
Product_PK product_id name Category_id
1 P521 Almonds 150g 1
2 P252 Garlic 2
3 P533 Banana 2
4 P684 Chocolate 3
5 P755 Spicy Chips 3

Category_id Category
1 Nuts
Category table 2 Fruits & Vegetables
3 Sweets
Kinds of transformations
Advanced ▪ Joining
Product Dimension
Product_PK product_id name category
1 P521 Almonds 150g Nuts
2 P252 Garlic Fruits & Vegetables
3 P533 Banana Fruits & Vegetables
4 P684 Chocolate Sweets & Snacks
5 P755 Spicy Chips Sweets & Snacks
Kinds of transformations
• By length / position
Advanced ▪ Splitting
• By Delimiter

Store Dimension
Store_id Location
1 New York, NY 10011
2 Orland Park, IL 60462
3 Houston, TX 77002

Store_id City Location Store_id City State ZIP


1 New York NY 10011 1 New York NY 10011
2 Orland Park IL 60462 2 Orland Park IL 60462
3 Houston TX 77002 3 Houston TX 77002
Kinds of transformations
Advanced ▪ Aggregations

Sales_Date Name Amount • SUM


2022-06-06 Sunglases TR-7 $25
2022-06-06 Chocolate bar 70% cacao $3
2022-06-07 Oat meal biscuits $4 • COUNT
2022-06-07 Chocolate bar 70% cacao $3
2022-06-08 Oat meal biscuits $4 • DISTINCT COUNT

Sales_Date No. of sales Amount • AVERAGE


2022-06-06 2 $28
2022-06-07 2 $7
2022-06-08 1 $4
Kinds of transformations
Advanced ▪ Deriving Values
Sales_Date Name Amount Tax
2022-06-06 Sunglases TR-7 $25 17%
2022-06-06 Chocolate bar 70% cacao $3 6%
2022-06-07 Oat meal biscuits $4 6%
2022-06-07 Chocolate bar 70% cacao $3 6%
2022-06-08 Oat meal biscuits $4 6%

Sales_Date Name Amount Tax Tax amount


2022-06-06 Sunglases TR-7 $25 17% $4.25
2022-06-06 Chocolate bar 70% cacao $3 6% $0.18
2022-06-07 Oat meal biscuits $4 6% $0.24
2022-06-07 Chocolate bar 70% cacao $3 6% $0.18
2022-06-08 Oat meal biscuits $4 6% $0.24
Demo: Plan of attack

Extract Transform + Load

Source Staging Core

Add Surrogate key Clean data

Delta logic Add additional column


Demo: Plan of attack

1. Look at the problem and plan

2. Set up tables & schema

3. Output staging table (+ truncate)

4. Transform + Load:
• Read from staging
• Transformation (Clean + Extract)
• Update/Insert
Processing Order

In which order should the steps be processed?

Facts -> Dimensions? Dimensions -> Facts?

Consider dependencies and plan carefully!


Processing order

1. 2.
Extract Transform + Load

Source Staging Core

✓ Every table at a time ✓ Every table at a time


Processing order
Extract
Order doesn't really matter

Transform + Load

Dimensions Facts
Product Dimension Sales Fact
Product_PK product_id name category Sales_PK product_id Product_FK Date
1 P521 Almonds 150g Nuts 3 P533 3 2022-01-01
2 P252 Garlic Fruits & Vegetables 4 P252 2 2022-01-01
3 P533 Banana Fruits & Vegetables 5 P755 5 2022-01-02
4 P684 Chocolate Sweets & Snacks 6 P684 4 2022-01-02
5 P755 Spicy Chips Sweets & Snacks 7 P755 5 2022-01-02
Processing Order for dimensions
Transform + Load

Start Dimension 1 Dimension 2


✓ New values ✓ New values
✓ Transformations ✓ Transformations
✓ SCD Updates / Load ✓ SCD Updates / Load
Processing Order for facts
Transform + Load

Start Dimension 1 Dimension 2

Finish Fact 2 Fact 1


Processing Order for facts
Extract

Start Dimension 1 Dimension 2

Finish Fact 2 Fact 1


Type 1: Overwrite

✓ Old attributes are just overwritten

✓ Only current state is reflected


Product_Key Name Category
1 Sunglases TR-7 Assecoirs
2 Chocolate bar 70% cacao Sweets
3 Oat meal biscuits Sweets

UPDATE

Product_Key Name Category


1 Sunglases TR-7 Assecoirs
2 Chocolate bar 70% cacao Sweets
3 Delicious Oat meal biscuits Buscuits
Type 2: New row

✓ Problem with Type 1: No history of dimensions!

✓ Only current state is reflected

Product_Key Name Category


1 Sunglases TR-7 Assecoirs
2 Chocolate bar 70% cacao Sweets
3 Oat meal biscuits Sweets

UPDATE

Product_Key Name Category


1 Sunglases TR-7 Assecoirs
2 Chocolate bar 70% cacao Sweets
3 Delicious Oat meal biscuits Buscuits
Type 3: Additonal Attributes
Product_PK Product_ID Name Category Ef_Date Ex_Date
1 SG-TR7 Sunglases TR-7 Assecoirs 2022-01-01 2100-01-01
2 CH-B70 Chocolate bar 70% cacao Sweets 2022-01-01 2100-01-01
3 OT-BSC Oat meal biscuits Sweets 2022-01-01 2100-01-01
4 OT-BSC Delicious Oat meal biscuits Buscuits 2022-06-01 2100-01-01

✓ Type 2 – Default strategy to maintain reflect history

✓ Type 1 – Static

✓ Type 3 – In-between: Switching back & forth between versions


Plan of attack

Case study:
Set up a complete ETL workflow
Plan of attack

1. Look at the problem and plan

2. Set up tables & schema

3. Staging

5. Core (dimension table)

6. Core (fact table)

7. Set up job & testing


Fact table design

1. Create DateKey

2. Include product_FK

3. Payment dimension

4. Additional columns:
• total_cost
• Add total_price
• Add profit
Plan of attack
1. Set up tables

2. Staging for sales fact

3. Create staging job

4. Core for dim_payment

5. Core for sales fact

6. Create core job


Processing order

Extract Transform + Load

Source Staging Core

Add Surrogate key Clean data

Delta logic Add additional column


Scheduling

Extract Transform + Load

Source Staging Core

Jobs or packages

Scheduling at specific times / frequencies


Scheduling

Can be done either…

External tool
In the ETL tool
(e.g. Windows Task scheduler or on server)
Guidelines

What are the requirements? How long does it take? What is a good time?

3 x / day? 5min? Initial Load vs. Delta Load

1 x / day? 1h? Effect on productive system

Every 30 min? Short read access

Night? Morning?
ETL tools
Enterprise Open-source Cloud-native Custom

Commercial Source code Cloud technology Own development

✓ Most mature ✓ Often free Data already in cloud? Customized

✓ Graphical interface ✓ Graphical interface ✓ Efficiency Internal resources

✓ Architecural needs Support? Flexibility? Maintainance?

✓ Support Ease of use? Training?


ETL tools
Enterprise Open-source Cloud-native

Alteryx Talend Open Studio Azure Data Factory

Informatica Pentaho Data Integration AWS Glue

Oracle Data Integrator Hadoop Google Cloud Data Flow

Microsoft SSIS Stitch


Choosing ETL tool
1. Evaluate current situation/needs

What do you want to improve?

Data sources & other tools?

Define your requirements!

Define responsibles

Who are the users?


Choosing ETL tool
2. Evaluate tools

Must have? Weight/


Text Rating
K.O.? Importance
Cost 1-5 1-5

Connectors

Capabilities

Ease of use/work

Reviews

Support/Extras
1-5
Total weighted score:
Choosing ETL tool
3. Test / Demo / Trial

Make a decision!
Choosing ETL tool
3. Test / Demo / Trial

Make a decision!
Choosing ETL tool
Enterprise Open-source Cloud-based

Informatica Talend Open Studio Azure Data Factory

Oracle Data Integrator Pentaho Data Integration AWS Glue

Microsoft SSIS Hadoop Google Cloud Data Flow

IBM DataStage Stitch


What is ELT?
What is ELT?
Extract, Transform, Load

ETL
Data warehouse
Data Warehouse Layers

Transformations
applied when data
is moved
Predictive
data sources Analytics
Staging Core /
Data Warehouse
Data Warehouse Layers

E TL
Data Warehouse Layers

E LT
What is ELT?

Extract + Load Leverage Database

Transform

Real-time More flexible

Data warehouse SELECT


SUM(sales_amount),
SELECT
category
product_id
FROM sales
FROM sales
Tools are the same! GROUP BY category
WHERE customer_id = 5
Is ETL obsolete?
NO!
But there are
different use cases!
ETL vs. ELT
ETL ELT
✓ More stable with defined ✓ Requires high performance DB
transformations
✓ More flexible
✓ More generic use-cases
✓ Transformations can be
✓ Security changed quickly
✓ Real-time
ETL vs. ELT
ETL ELT

✓ Reporting ✓ Data Science, ML

✓ Generic use cases ✓ Real-time requirements

✓ Easy to use ✓ Big data


Data Warehouse Layers

Predictive
data sources Analytics
Staging Core /
Data Warehouse
Data Warehouse Layers
Empty
Delta Insert/
Update

data sources
Core /
Staging Data Warehouse
Processing order

1. 2.
Extract Transform + Load

Source Staging Core


✓ Truncate after each run
Processing order

1. 2.
Extract Load Transform

Source Staging Core

✓ Truncate after each run


Data Warehouse Use Cases
What's next?

What are the use cases?

Performance Integrated Strategic decisions Basis for reporting

Easy to use Data Quality Accessible Enables business users to analyze data

Continuous Training of Machine Learning Models Predictive Analytics

Aggregate & Filter Use Big Data


Using index

SELECT
product_id
FROM sales
WHERE customer_id = 5

3, P0625, 5, visa
Table scan
4, P0432, 8, mastercard 6, P0058, 5, mastercard

1, P0494, 4, visa 2, P0221, 5, visa Read-inefficient


Using index

SELECT
product_id
FROM sales
WHERE customer_id = 5

Location Value
1, P0494, 4, visa 2, P0221, 5, visa
1 4
6, P0058, 5, mastercard 4, P0432, 8, mastercard
2 5
3, P0625, 5, visa
5 8
✓ Indexes help to make data reads faster!

❖ Slower data writes

❖ Additional storage

❖ B-tree Indexes ❖ Bitmap Indexes


Using index

SELECT
product_id
FROM sales
WHERE customer_id = 5

Location Value
1, P0494, 4, visa 2, P0221, 5, visa
1 4
6, P0058, 5, mastercard 4, P0432, 8, mastercard
2 5
3, P0625, 5, visa
5 8
Using index
Location Value ✓ Different types of indexes
1 4 for different situations
2 5

5 8 ❖ B-tree Indexes ❖ Bitmap Indexes

1, P0494, 4, visa 2, P0221, 5, visa

6, P0058, 5, mastercard 4, P0432, 8, mastercard


3, P0625, 5, visa
❖ B-tree Indexes

A ✓ Multi-level tree structure

1
20 ✓ Breaks data down into pages or blocks
AB AD
1 AE
AC ✓ Should be used for high-cardinality
(unique) columns
1 ABA… 15
ABB… ✓ Not entire table (costy in terms of storage)

ACA…
ACB…
❖ Bitmap index

✓ Particularily good for dataware houses

✓ Large amounts of data + low-cardinality

✓ Very storage efficient

✓ More optimized for read &


few DML-operations
❖ Bitmap index
✓ Particularily good for dataware houses

✓ Large amounts of data + low-cardinality

✓ Very storage efficient


Row_id Value Bit

1 visa 11100
Good for many repeating values

4 mastercard 00011
(dimensionality)
❖ Bitmap index
✓ Particularily good for dataware houses

✓ Large amounts of data + low-cardinality

✓ Very storage efficient

Value 1 2 3 4 5 6 7 8
Good for many repeating values
mastercard x x
(dimensionality)
visa x x x
Guidelines

B-tree Index Bitmap Index

Default index Slow to update

Unique columns Storage efficient


(surrogate key, names)
Great read performance
Guidelines

Should we put index on every column?

No! They come with a cost! Storage + Create/Update time

Only when necessary!

Avoid full table reads

Small tables do not require indexes


Guidelines
On which columns? 1. Large tables

2. Columns that are used as filters


Guidelines

Fact tables B-tree on surrogate key Bitmap key on foreign keys

Are they used in searchs a lot?


Dimension table Size of table

Choose based on cardinality


Guidelines

CREATE INDEX index_name ON table_name [USING method]


(
column_name [ASC | DESC],
...
);
Cloud vs. On-premises
On-premises Cloud

What? Own local hardware What? Software-as-a-service

✓ Storage layer What is the right ✓ Pay for what you use
✓ Compute layer
choice today?
✓ Managed service
✓ Software layer
✓ Optmized for scalable
Physical data center
analytics
Cloud vs. On-premises
On-premises Cloud

Benefits ✓ Full control Benefits ✓ Fully managed


✓ Data governance & compliance ✓ Scalable
✓ Cost-efficient
✓ Managed security
Problems ❖ Full responsibility Problems
✓ Availability
❖ High costs ✓ Time to market
❖ More internal resources ❖ Regulations
❖ Less flexible ❖ Different providers?
Conclusion?
Which one to
choose?

✓ Cloud data warehouses are on the rise

✓ Most companies opt for cloud data warehouse

In most cases cloud data warehouse is the better choice nowadays!


Conclusion?
What are the
options?

✓ Snowflake

✓ Amazon Redshift

✓ Azure Synapse

✓ Google Big Query


Conclusion?
What are the
options?

✓ Snowflake

✓ Amazon Redshift

✓ Azure Synapse

✓ Google Big Query


Massive parallel processing (MPP)

Traditional
Massive parallel processing (MPP)
Example

SELECT
*
FROM sales
WHERE customer_id = 5
Massive parallel processing (MPP)

Traditional

Task 1 Task 2 Task 3


500ms 500ms 500ms
Massive parallel processing (MPP)

Traditional

"Shared disk" Task 1


architicture
Task 2
MPP
Task 3
Massive parallel processing (MPP)

Traditional

Node
"Shared nothing" Task 1
Work load is split up
architicture & processed individually
Task 2
Independent MPP
resources
Task 3
Massive parallel processing (MPP)

✓ Modern way of solving performance issues

✓ Millions of rows can be processed faster

✓ Many people can run queries at the same time


with good performance

✓ Helpful with centralizing massive amounts of data


Columnar databases
SELECT
product_id
FROM sales

3, P0625, 5, visa

4, P0432, 8, mastercard 6, P0058, 5, mastercard


Traditional
1, P0494, 4, visa 2, P0221, 5, visa
Relational DB

All rows have to be scanned


Bad for fast data retrieval!
Good for transactional DB
Columnar databases

SELECT
product_id
1, 2, 3, 4, 5 FROM sales
P0494, P0221, P0625, P0431, P0058

4, 5, 5, 8, 5
Less data needs to be processed!
visa, visa, visa, mastercard, mastercard
Better compression, less storage
18.29, 1.49, 5.89, 11.59, 12.39
Columnar databases

100 columns 5% of data needs to be

but only 5 columns are needed processed

✓ Important factor in improving analytical query


performance
Guidelines

index if you frequently want to retrieve less than about 15%


of the rows in a large table

Index columns used for joins to improve join performance

Small tables do not require indexes

PK automatically per default has an index


•There is a wide range of values (good for regular indexes).
•There is a small range of values (good for bitmap indexes).

You might also like