0% found this document useful (0 votes)
128 views

CS408: Data Warehousing: Welcome To Course

The document provides an introduction to the course CS408: Data Warehousing. It includes information about the course content, schedule, assessments, and recommended literature. The course covers topics such as the data warehousing lifecycle, storage architecture, modeling, indexes, optimization, OLAP queries, and building a data warehouse. It will involve both lectures and hands-on labs to design and implement a sample data warehouse using tools like SQL Server and SSAS.

Uploaded by

hassan499
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views

CS408: Data Warehousing: Welcome To Course

The document provides an introduction to the course CS408: Data Warehousing. It includes information about the course content, schedule, assessments, and recommended literature. The course covers topics such as the data warehousing lifecycle, storage architecture, modeling, indexes, optimization, OLAP queries, and building a data warehouse. It will involve both lectures and hands-on labs to design and implement a sample data warehouse using tools like SQL Server and SSAS.

Uploaded by

hassan499
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Welcome to course

CS408: Data Warehousing


Outlines
• About Me
• Course information
• Teaching team
• Learning outcomes
• Course contents and schedule
• Assessments
• Resources
• Lecture 01

2
Course Information

Course Title: Data Warehousing

Course Code: CS408

Database Systems, Data


Prerequisites:
Structures, Operating Systems

Level: 4

8
Why should you be here?
• Bad decisions can lead to disaster
– Data Warehousing is at
the base of decision
support systems

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 10
Why should you be here?
• Data Warehousing & OLAP are important
• They help to
– Understand the information hidden
within an organization’s data
• See data from different angles:
product,client,time,geographical area
• Get adequate statistics to get
your point of argumentation
across
• Get a glimpse of the future…

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 11
Why should you be here?

12
Course Contents and Schedule
Week LECTURE LAB
Introduction to course, Introduction to Data
1 • Introduction with lab course, Introduction with Tool
Warehouses
• Database and SQL Revision
2 DW Life Cycle and Basic Architecture
• Lab task based on SQL queries
• PSQL introduction and lab tasks based on PSQL
3 Storage Architecture
• Lab task based on PSQL
• Will follow a Data set in in our labs
4 DW Modeling – Conceptual Model • In this step we will design data warehouse architecture
• Create staging Database in SQL Server 2012.
5 Logical Model • Create Production Data warehouse in SQL Server 2012.
• Data will be extracted from source data and load in staging tables
6 Physical Model
using SSIS
• Different transformations will be applied on the data and data will
7 Project Brief stored in production tables using SSIS and implementation of slowly
changing dimension.
Indexes • OLAP queries writing
8
• Query based tasks will be assigned to perform with in the lab.
Indexes (continued) • OLAP Query Optimizing
9
• Query based tasks will be assigned to perform with in the lab.
10 Optimisation • Multidimensional cube will be design using SSAS
11 Optimisation (continued) • Formulas, aggregation applied on the built cube
12 OLAP Queries • Data visualization and reporting based on Datawarehouse

13 OLAP Queries (continued) • Data visualization and reporting based on Cube

14 Building the DW • Reports and Dashboard management on Powerbi.com


Real-time Data Warehousing • Introduction to Hadoop and other Big Data tools and techniques
15
Project submission • DWH lab revision
16 Big Data – Challenges and Opportunities • Semester Project discussion 13
Assessments

Assessment type Marks weight

Lab Tasks 20%

Mid 15%

Final
35%

Project 30%

14
Recommended Literature

• Building the Data Warehouse


– William H. Inmon
– Wiley, ISBN 978-0-7645-9944-6
• The Data Warehouse Toolkit
– Ralph Kimball & Margy Ross
– Wiley, ISBN 978-1-118-53080-1

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 15
Lecture 01

Introduction to Data Warehouse

16
This week

1.What is a data warehouse?


2. Applications and users

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 17
What is a data warehouse?
• Basically a very large da ta ba se…
– Not all very large databases are data warehouses,
but all data warehouses are pretty large databases
– Nowadays a warehouse is considered
to start at around 800 GB and goes
up to severalTB
– It spans over several servers and
needs an impressive amount of
computing power

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 18
What is a data warehouse?
• More specific, a collective data repository
– Containing snapshots of the operational data
(history)
– Obtained through data cleansing (Extract-
Transform- Load)
– Useful for analytics

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 19
What is a data warehouse?
• Compared to other solutions it…
– Is suitable for tactical/strateg ic focus
– Implies a sm all num ber of transactions
– Implies large transactions spanning over a
long period of time

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 20
Some Definitions
• Experts say…
– Ralph Kimball:“a copy of transaction
data specifically structured for query and
analysis”
– Bill Inmon: “A data warehouse is a:
– Subject oriented
– Integrated
– Non-volatile
– Time variant
collection of data in support of management’s decisions.”

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 21
Inmon Definition (cont’d.)
• S ubject oriented
– The data in the data warehouse is organized so that
all the data elements relating to the same real-world
event or object are linked together
• Typical subject areas in DWs are
Customer,Product,Order,Claim,
Account,…

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 22
Inmon Definition (cont’d.)

• S ubject oriented
– Example:customer as subject in a DW
• DW is organized in this case by the customer
• It may consist of 10,100 or more physical tables, all
related
CUSTOMER

Base customer
Data 2003 - 2005
Customer activity
Base customer
2001 - 2004
Data 2000 - 2002

Customer activity Customer activity


Detail 2002 - 2004 Detail 2005 - 2006

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 23
Inmon Definition (cont’d.)
• Integ ra ted
– The data warehouse contains data from most or all
of an organization's operational systems and this data is
made consistent
– E.g. gender,measurement,conflicting keys,
consistency,…
operational encoding DW
appl A – M,F M,F
appl B – 1,0
appl C – male,female

appl A – cm cm
appl B – inches

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 24
Inmon Definition (cont’d.)
• Non-vola tile
– Data in the data warehouse is never over-written
or deleted - once committed, the data is static, read-
only, and retained for future reporting
– Data is loaded,but not updated
– When subsequent changes occur,a new snapshot record is
written

Operational DW

insert change

delete access
access load
Record-by-record Mass load/
manipulation access of data access

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 25
Inmon Definition (cont’d.)
• T im e-va rying
– The changes to the data in the data warehouse are
tracked and recorded so that reports can be produced
showing changes over tim e
– Different environments have different tim e horizons
• associated
• While for operational systems a 60-
to-90 day time horizon is normal,
data warehouse has a
• 5-to-10 year horizon

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 26
General Definition
• More general,a DW is a
– Repository of an
org anization’s
electronically stored
data
– Desig ned to
facilitate
reporting and
analysis

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 27
Typical Features
• DW typically…
– Reside on computers dedicated to this function
– Run on DB MS such as Oracle, IBM DB2, Teradata or
Microsoft SQL Server
– Retain data for long periods of tim e
– Consolidate data obtained from a variety of
sources
– Are built around their own carefully desig ned data
m odel

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 28
Use case
• DW stands for big data volume,so lets take an
example of 2 big com pa nies,a retailer,say
Walmart and a RDBMS vendor,Sybase:
– Walmart CEO:I want to keep track of sales in
all my stores simultaneously
– Sybase consultant:You need our wonderful RDBMS software.
You can stuff data in as sales are rung up at cash registers and
simultaneously query data right in your office
– SoWalmart buys a $1 milion Sun E10000 multi-CPU
server,a $500 000 Sybase license,a book
“Database Design for Smarties”,and build
themselves a normalized SQL data model

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 29
Use case (cont’d.)
• After a few months of stuffing data into the tables…
aWalmart executive asks…
– I have noticed that there was a Colgate promotion recently,
directed to people who live in small towns.How much
toothpaste did we sell in those towns yesterday?
– Translation to a query:
select sum(sales.quantity_sold) from sales,products,product_categories, manufacturers,
stores,cities where manufacturer_name =‘Colgate’
and product_category_name =‘toothpaste’ and cities.population < 40 000
and trunc(sales.date_time_of_sale) = trunc(sysdate-1) and sales.product_id =
products.product_id
and sales.store_id = stores.store_id
and products.product_category_id = product_categories.product_category_id and
products.manufacturer_id = manufacturers.manufacturer_id
and stores.city_id = cities.city_id

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 30
Use case (cont’d.)

– The tables contain large volumes of data and the


query implies a 6 way join so it will take some
time to execute
– The tables are at the sam e tim e also updated
by new sales
– Soon after executive start their quest for marketing
information store employees notice that there are times
during the day when it is impossible to process a sale
Any attempt to update the database results in freezing the
computer up for 20 minutes

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 31
Use case (cont’d.)

• In minutes…theWalmart CEO calls Sybase


tech support
• Walmart CEO: WE TYPE IN THE TOOTHPASTE
QUERY AND OUR SYSTEM HANGS!!!
• Sybase support: Of course it does! You built an on-
line transaction processing (OLTP) system. You
can’t feed it a decision support system (DSS)
query and expect things to work!
– Walmart CEO:!@%$#.I thought this was the whole point of SQL
and your RDBMS…to query and insert simultaneously!!
– Sybase support:Uh,not exactly.If you’re reading from the database,
nobody can write to the database.If you’re writing to the database,
nobody can read from the database.So if you’ve got a query that takes
20 minutes to run and don’t specify special locking instructions,
nobody can update those tables for 20 minutes.

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 32
Use case (cont’d.)
– Walmart CEO: It sounds like a bug.
– Sybase support: Actually it is a feature. We call it
pessimistic locking.
– Walmart CEO: Can you fix your system so that it doesn’t
lock up???
– Sybase support: No. But we made this great loader tool
so that you can copy everything from your OLTP system
into a separate Data Warehouse system at 100 GB/hour
• After a while…

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 33
OLTP vs. DW

• OLTP (OnL ine T ransaction P rocessing):


– Also known under the name of operational data,
it represents day-to-day operational business activities:
• Purchasing,sales,production distribution, …
– Typically for data entr y and retrieval
transaction processing
– Reflects only the current state
of the data

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 34
OLTP vs. DW (cont’d.)

• OLAP (OnL ine A nalytical P rocessing):


– Represents front-end analytics based on a
DW repository
– It provides information for activities like
• Resource planning,capital budgeting,marketing initiatives,…
– It is decision oriented

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 35
OLTP vs. DW (cont’d.)
• Properties
Operational DB DW
Mostly updates Mostly reads

Many small transactions Queries long, complex

MB-TB of data GB-PB of data

Raw data Summarized data

Clerical users Decision makers

Up-to-date data May be slightly outdated

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 36
OLTP vs. DW (cont’d.)

• Consider a norm a lized da ta ba se for a store


– The tables would look like this…
Shipment
Invoice Invoice
Invoice_line_item
Customer

Customer_ID (PK) Invoice_number (PK) Invoice_number (FK)


Name Date Item_seq_number
Address Customer_ID (FK) Product_ID (FK)
City Status_code (FK) Units
Postal_code Total Unit_cost
Sales_tax
Shipping_charge
Sales_unit

Sales_unit_D
(PK) Name
Product Address
Status City
Product_ID (PK) Postal_cod
Status_code (PK) Name e
Status Description Telephone_number
Cost Email
Sales_unit_ID (FK)

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 37
OLTP vs. DW (cont’d.)
• If we were to set up a DW for that store,we
would start by building the following star schema
Customer Product
Customer_ID (PK) Product_ID (PK)
Name Name
Address Description
City Cost
Postal_code Sales_facts

Product_ID (FK)
Customer_ID (FK)
Time_key (FK)
Sales_unit_ID (FK)
Sales_unit Total Time_period
Sales_tax
Sales_unit_ID (PK) Shipping_charge Time_key
Name (PK)
Address City Description
Postal_code Day
Telephone Fiscal_week
Fiscal_period
Fiscal_year

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 38
OLTP vs. DW (cont’d.)

• B a sic insig hts from comparing OLTP and DWs


– A DW is a separate (RDB MS ) installation that
contains copies of data from on-line systems
• Physically separate hardware may not be absolutely
necessary if you have lots of extra com puting power,
but it is recommended
– With an optim istic locking DBMS you might even
be able to get away for a while with keeping just one
copy of your data

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 39
OLTP vs. DW (cont’d.)

• There is an essentially different pattern of


ha rdwa re utiliz a tion between on-line
and analytical processing

Operational Data warehouse

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 40
Applications of DW
• Typica l questions which can be answered
with DW & OLAP
– How much did sales unitA earn in January?
– How much did sales unit B earn in February?
What was their combined sales amount for the first
quarter?
• Answering these questions with S QL -queries is
difficult
– Complex query formulation necessary
– Process is likely to be slow due to complex joins and
multiple scans

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 41
Applications of DW (cont’d.)
• Why such questions can be answered better with
a DW?
– Because in a DW tables are rearranged and
pre- ag g reg ated (known as computing cubes)
e.g. Years, weeks, days etc.
• The tables arrangement is subject oriented,usually some
sta r schem a

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 42
Applications of DW (cont’d.)
• A DW is the base repository for front-end
a na lytics
– OLAP
– KDD (Knowledge Discovery in Databases) a
data mining process
– Data visualization
– Reporting

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 43
Applications of DW (cont’d.)

• OLAP is a form of information processing and


thus needs to provide timely,accurate and
understandable information
– timely is however a relative term:
• In OLTP we expect an update to go
through in a matter of seconds
• In OLAP the time to answer a query
can take m inutes, hours or even longer
• There are many flavors of OLAP
– ROLAP, DOLAP, MOLAP, WOLAP, HOLAP,…

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 44
Applications of DW (cont’d.)
• KDD (Data Mining)
– Constructs m odels of the data in question
• Models can be viewed as high level summaries of the
underlying data

ID Name ZIP Sex Age Income Children Car Spent

12 Peter 38106 M 35 € 55,000 2 Mini Van € 210.00


15 Gabriel 38100 M 32 € 56,000 0 SUV € 30.00
… … … … … … … … …
122 Claire 38106 F 21 € 42,000 0 Coupe € 50.00

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 45
Applications of DW (cont’d.)
– Based on this example a query returns the data that
fulfills the constraints
• SELECT * FROM CUSTOMER_TABLE WHERE
TOTAL_SPENT > €100;
– Data mining might return the following set of rules
for customers spending more than €100:
• IF AGE > 35 AND CAR =‘MINIVAN’ THEN TOTAL SPENT
> €100
• IF SEX =‘M’ AND ZIP = 38106 THEN TOTAL SPENT >
€100

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 46
Applications of DW (cont’d.)
– It answers questions like
• Which products or customers are more profitable
• Which outlets have sold the least this year
– In consequence it motivates decisions like
• Which products should have their production increased
• Which customers should be targeted for special
promotions
• Which outlets should be closed

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 47
Who is the user?
• Users of DW are called DS S a na lysts
and usually are business persons
– Their primary job is to define and discover
information used in corporate decision-m aking
– The way they think
• “Give me what I say I want,and then I can tell you what I
really want”
• They work in explorative manner

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 48
Who is the user? (cont’d.)
– Typical explorative line of work
• “Ah! Now that I see what the possibilities are,I can tell
what I really want to see.But until I know what the
possibilities are,I cannot describe exactly what I want…”

– This usage has fundamental effect on the way a DW


is developed
• The classical system developm ent life cycle
assumes that the requirements are known at the start of
design
• The DSS analyst starts with existing requirements,but
fa ctoring in new requirem ents with time

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 49
Summary
• Data Warehouse - Introduction
– DW Definitions
– OLTP vs. DW
– Applications of DW

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 50
Next lecture

• Data Warehouse Life Cycle

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 51

You might also like