Star and snowflake
schema
D ATA B A S E D E S I G N
Lis Sulmont
Curriculum Manager
Star schema
Dimensional modeling: star schema Example:
Fact tables
Supply books to stores in USA and Canada
Holds records of a metric Keep track of book sales
Changes regularly
Connects to dimensions via foreign keys
Dimension tables
Holds descriptions of attributes
Does not change as often
DATABASE DESIGN
Star schema example
DATABASE DESIGN
Snowflake schema (an extension)
DATABASE DESIGN
Same fact table, different
dimensions
Star schemas: one dimension Snowflake schemas: more than one
dimension
Because dimension tables are normalized
DATABASE DESIGN
What is normalization?
Database design technique
Divides tables into smaller tables and connects them via relationships
Goal: reduce redundancy and increase data integrity
DATABASE DESIGN
What is normalization?
Database design technique
Divides tables into smaller tables and connects them via relationships
Goal: reduce redundancy and increase data integrity
Identify repeating groups of data and create new tables for them
DATABASE DESIGN
Book dimension of the star schema
Most likely to have repeating values:
Author
Publisher
Genre
DATABASE DESIGN
Book dimension of the snowflake schema
DATABASE DESIGN
Store dimension of the star schema
City
State
Country
DATABASE DESIGN
Store dimension of the snowflake schema
DATABASE DESIGN
DATABASE DESIGN
DATABASE DESIGN
Let's practice!
D ATA B A S E D E S I G N
Normalized and
denormalized
databases
D ATA B A S E D E S I G N
Lis Sulmont
Curriculum Manager
Back to our book store example
Denormalized: star schema Normalized: snowflake schema
DATABASE DESIGN
Denormalized Query
Goal: get quantity of all Octavia E. Butler books sold in Vancouver in Q4 of 2018
SELECT SUM(quantity) FROM fact_booksales
-- Join to get city
INNER JOIN dim_store_star on fact_booksales.store_id = dim_store_star.store_id
-- Join to get author
INNER JOIN dim_book_star on fact_booksales.book_id = dim_book_star.book_id
-- Join to get year and quarter
INNER JOIN dim_time_star on fact_booksales.time_id = dim_time_star.time_id
WHERE
dim_store_star.city = 'Vancouver' AND dim_book_star.author = 'Octavia E. Butler' AND
dim_time_star.year = 2018 AND dim_time_star.quarter = 4;
7600
Total of 3 joins
DATABASE DESIGN
Normalized query
SELECT
SUM(fact_booksales.quantity)
FROM
fact_booksales
-- Join to get city
INNER JOIN dim_store_sf ON fact_booksales.store_id = dim_store_sf.store_id
INNER JOIN dim_city ON dim_store_sf.city_id = dim_city_sf.city_id
-- Join to get author
INNER JOIN dim_book_sf ON fact_booksales.book_id = dim_book_sf.book_id
INNER JOIN dim_author_sf ON dim_book_sf.author_id = dim_author_sf.author_id
-- Join to get year and quarter
INNER JOIN dim_time_sf ON fact_booksales.time_id = dim_time_sf.time_id
INNER JOIN dim_month_sf ON dim_time_sf.month_id = dim_month_sf.month_id
INNER JOIN dim_quarter_sf ON dim_month_sf.quarter_id = dim_quarter_sf.quarter_id
INNER JOIN dim_year_sf ON dim_quarter_sf.year_id = dim_year_sf.year_id
DATABASE DESIGN
Normalized query (continued)
WHERE
dim_city_sf.city = `Vancouver`
AND
dim_author_sf.author = `Octavia E. Butler`
AND
dim_year_sf.year = 2018 AND dim_quarter_sf.quarter = 4;
sum
7600
Total of 8 joins
So, why would we want to normalize a databases?
DATABASE DESIGN
Normalization saves space
Denormalized databases enable data redundancy
DATABASE DESIGN
Normalization saves space
Normalization eliminates data redundancy
DATABASE DESIGN
Normalization ensures better data integrity
1. Enforces data consistency
Must respect naming conventions because of referential integrity, e.g., 'California', not 'CA' or
'california'
2. Safer updating, removing, and inserting
Less data redundancy = less records to alter
3. Easier to redesign by extending
Smaller tables are easier to extend than larger tables
DATABASE DESIGN
Database normalization
Advantages
Normalization eliminates data redundancy: save on storage
Better data integrity: accurate and consistent data
Disadvantages
Complex queries require more CPU
DATABASE DESIGN
Remember OLTP and OLAP?
OLTP OLAP
e.g., Operational databases e.g., Data warehouses
Typically highly normalized Typically less normalized
Write-intensive Read-intensive
Prioritize quicker and safer insertion of data Prioritize quicker queries for analytics
DATABASE DESIGN
Let's practice!
D ATA B A S E D E S I G N
Normal forms
D ATA B A S E D E S I G N
Lis Sulmont
Curriculum Manager
Normalization
Identify repeating groups of data and create new tables for them
A more formal definition:
The goals of normalization are to:
Be able to characterize the level of redundancy in a relational schema
Provide mechanisms for transforming schemas in order to remove redundancy
1 Database Design, 2nd Edition by Adrienne Watt
DATABASE DESIGN
Normal forms (NF)
Ordered from least to most normalized:
First normal form (1NF) Fourth normal form (4NF)
Second normal form (2NF) Essential tuple normal form (ETNF)
Third normal form (3NF) Fifth normal form (5NF)
Elementary key normal form (EKNF) Domain-key Normal Form (DKNF)
Boyce-Codd normal form (BCNF) Sixth normal form (6NF)
1 https://en.wikipedia.org/wiki/Database_normalization
DATABASE DESIGN
1NF rules
Each record must be unique - no duplicate rows
Each cell must hold one value
Initial data
| Student_id | Student_Email | Courses_Completed |
|------------|-----------------|----------------------------------------------------------|
| 235 |
[email protected] | Introduction to Python, Intermediate Python |
| 455 |
[email protected] | Cleaning Data in R |
| 767 |
[email protected] | Machine Learning Toolbox, Deep Learning in Python |
DATABASE DESIGN
In 1NF form
| Student_id | Student_Email |
|------------|-----------------|
| 235 | [email protected] |
| 455 | [email protected] |
| 767 | [email protected] |
| Student_id | Completed |
|------------|--------------------------|
| 235 | Introduction to Python |
| 235 | Intermediate Python |
| 455 | Cleaning Data in R |
| 767 | Machine Learning Toolbox |
| 767 | Deep Learning in Python |
DATABASE DESIGN
2NF
Must satisfy 1NF AND
If primary key is one column
then automatically satisfies 2NF
If there is a composite primary key
then each non-key column must be dependent on all the keys
Initial data
| Student_id (PK) | Course_id (PK) | Instructor_id | Instructor | Progress |
|-----------------|----------------|---------------|---------------|----------|
| 235 | 2001 | 560 | Nick Carchedi | .55 |
| 455 | 2345 | 658 | Ginger Grant | .10 |
| 767 | 6584 | 999 | Chester Ismay | 1.00 |
DATABASE DESIGN
In 2NF form
| Student_id (PK) | Course_id (PK) | Percent_Completed |
|-----------------|----------------|-------------------|
| 235 | 2001 | .55 |
| 455 | 2345 | .10 |
| 767 | 6584 | 1.00 |
| Course_id (PK) | Instructor_id | Instructor |
|----------------|---------------|---------------|
| 2001 | 560 | Nick Carchedi |
| 2345 | 658 | Ginger Grant |
| 6584 | 999 | Chester Ismay |
DATABASE DESIGN
3NF
Satisfies 2NF
No transitive dependencies: non-key columns can't depend on other non-key columns
Initial Data
| Course_id (PK) | Instructor_id | Instructor | Tech |
|----------------|---------------|---------------|--------|
| 2001 | 560 | Nick Carchedi | Python |
| 2345 | 658 | Ginger Grant | SQL |
| 6584 | 999 | Chester Ismay | R |
DATABASE DESIGN
In 3NF
| Course_id (PK) | Instructor | Tech |
|----------------|---------------|--------|
| 2001 | Nick Carchedi | Python |
| 2345 | Ginger Grant | SQL |
| 6584 | Chester Ismay | R |
| Instructor_id | Instructor |
|---------------|---------------|
| 560 | Nick Carchedi |
| 658 | Ginger Grant |
| 999 | Chester Ismay |
DATABASE DESIGN
Data anomalies
What is risked if we don't normalize enough?
1. Update anomaly
2. Insertion anomaly
3. Deletion anomaly
DATABASE DESIGN
Update anomaly
Data inconsistency caused by data redundancy when updating
| Student_ID | Student_Email | Enrolled_in | Taught_by |
|------------|-----------------|-------------------------|---------------------|
| 230 |
[email protected] | Cleaning Data in R | Maggie Matsui |
| 367 |
[email protected] | Data Visualization in R | Ronald Pearson |
| 520 |
[email protected] | Introduction to Python | Hugo Bowne-Anderson |
| 520 |
[email protected] | Arima Models in R | David Stoffer |
To update student 520 's email:
Need to update more than one record, otherwise, there will be inconsistency
User updating needs to know about redundancy
DATABASE DESIGN
Insertion anomaly
Unable to add a record due to missing attributes
| Student_ID | Student_Email | Enrolled_in | Taught_by |
|------------|-----------------|-------------------------|---------------------|
| 230 |
[email protected] | Cleaning Data in R | Maggie Matsui |
| 367 |
[email protected] | Data Visualization in R | Ronald Pearson |
| 520 |
[email protected] | Introduction to Python | Hugo Bowne-Anderson |
| 520 |
[email protected] | Arima Models in R | David Stoffer |
Unable to insert a student who has signed up but not enrolled in any courses
DATABASE DESIGN
Deletion anomaly
Deletion of record(s) causes unintentional loss of data
| Student_ID | Student_Email | Enrolled_in | Taught_by |
|------------|-----------------|-------------------------|---------------------|
| 230 |
[email protected] | Cleaning Data in R | Maggie Matsui |
| 367 |
[email protected] | Data Visualization in R | Ronald Pearson |
| 520 |
[email protected] | Introduction to Python | Hugo Bowne-Anderson |
| 520 |
[email protected] | Arima Models in R | David Stoffer |
If we delete Student 230 , what happens to the data on Cleaning Data in R ?
DATABASE DESIGN
Data anomalies
What is risked if we don't normalize enough?
1. Update anomaly
2. Insertion anomaly
3. Deletion anomaly
The more normalized the database, the less prone it will be to data anomalies
Don't forget the downsides of normalization from the last video
DATABASE DESIGN
Let's practice!
D ATA B A S E D E S I G N