Chapter 11:
Data Warehousing
Modern Database Management
Jeffrey A. Hoffer, Mary B. Prescott, Fred
R. McFadden
1
Objectives
• Definition of terms
• Reasons for information gap between
information needs and availability
• Reasons for need of data warehousing
• Describe three levels of data warehouse
architectures
• List four steps of data reconciliation
• Describe two components of star schema
• Estimate fact table size
• Design a data mart
2
Definition
• Data Warehouse:
Warehouse
– A subject-oriented, integrated, time-variant, non-updatable
collection of data used in support of management decision-
making processes
– Subject-oriented: e.g. customers, patients, students,
products
– Integrated: Consistent naming conventions, formats,
encoding structures; from multiple data sources
– Time-variant: Can study trends and changes
– Nonupdatable: Read-only, periodically refreshed
• Data Mart:
Mart
– A data warehouse that is limited in scope
3
Need for Data Warehousing
• Integrated, company-wide view of high-quality
information (from disparate databases)
• Separation of operational and informational systems
and data (for improved performance)
4
5
Data Warehouse Architectures
• Generic Two-Level Architecture
• Independent Data Mart
• Dependent Data Mart and Operational
Data Store
• Logical Data Mart and @ctive Warehouse
• Three-Layer architecture
All involve some form of extraction, transformation and loading (ETL)
ETL
6
Figure 11-2: Generic two-level data warehousing architecture
L
One,
company-
wide
T warehouse
Periodic extraction data is not completely current in warehouse
7
Figure 11-3 Independent data mart Data marts:
data warehousing architecture Mini-warehouses, limited in scope
T
E
Separate ETL for each Data access complexity
independent data mart due to multiple data marts
8
Figure 11-4 Dependent data mart with ODS provides option for
operational data store: a three-level architecture obtaining current data
T
E Simpler data access
Single ETL for
enterprise data warehouse Dependent data marts
(EDW) loaded from EDW
9
Figure 11-5 Logical data mart and real ODS and data warehouse
are one and the same
time warehouse architecture
T
E
Near real-time ETL for Data marts are NOT separate databases,
Data Warehouse but logical views of the data warehouse
Easier to create new data marts 10
Figure 11-7 Data Characteristics
Example of DBMS
log entry Status vs. Event Data
Status
Event = a database action
(create/update/delete) that
results from a transaction
Status
12
Figure 11-8
Transient
Data Characteristics
operational data Transient vs. Periodic Data
With
transient
data,
changes to
existing
records are
written over
previous
records, thus
destroying
the previous
data content
13
Figure 11-9:
Periodic
Data Characteristics
warehouse data Transient vs. Periodic Data
Periodic
data are
never
physically
altered or
deleted
once they
have
been
added to
the store
14
Other Data Warehouse
Changes
• New descriptive attributes
• New business activity attributes
• New classes of descriptive attributes
• Descriptive attributes become more
refined
• Descriptive data are related to one another
• New source of data
15
The Reconciled Data Layer
• Typical operational data is:
– Transient–not historical
– Not normalized (perhaps due to denormalization for performance)
– Restricted in scope–not comprehensive
– Sometimes poor quality–inconsistencies and errors
• After ETL, data should be:
– Detailed–not summarized yet
– Historical–periodic
– Normalized–3rd normal form or higher
– Comprehensive–enterprise-wide perspective
– Timely–data should be current enough to assist decision-making
– Quality controlled–accurate with full integrity
16
The ETL Process
• Capture/Extract
• Scrub or data cleansing
• Transform
• Load and Index
ETL = Extract, transform, and load
17
Capture/Extract…obtaining a snapshot of a chosen subset
of the source data for loading into the data warehouse
Figure 11-10:
Steps in data
reconciliation
Static extract = capturing Incremental extract =
a snapshot of the source capturing changes that
data at a point in time have occurred since the last
static extract 18
Scrub/Cleanse…uses pattern recognition and AI
techniques to upgrade data quality
Figure 11-10:
Steps in data
reconciliation
(cont.)
Fixing errors: misspellings, Also: decoding, reformatting,
erroneous dates, incorrect field time stamping, conversion, key
usage, mismatched addresses, generation, merging, error
missing data, duplicate data, detection/logging, locating
inconsistencies missing data
19
Transform = convert data from format of operational
system to format of data warehouse
Figure 11-10:
Steps in data
reconciliation
(cont.)
Record-level: Field-level:
Selection–data partitioning single-field–from one field to one field
Joining–data combining multi-field–from many fields to one, or
Aggregation–data summarization one field to many
20
Load/Index= place transformed data
into the warehouse and create indexes
Figure 11-10:
Steps in data
reconciliation
(cont.)
Refresh mode: bulk rewriting Update mode: only changes
of target data at periodic intervals in source data are written to data
warehouse
21
Figure 11-11: Single-field transformation
In general–some transformation
function translates data from old
form to new form
Algorithmic transformation uses
a formula or logical expression
Table lookup–another
approach, uses a separate
table keyed by source
record code
22
Figure 11-12: Multifield transformation
M:1–from many source
fields to one target field
1:M–from one
source field to
many target fields
23
Derived Data
• Objectives
– Ease of use for decision support applications
– Fast response to predefined user queries
– Customized data for particular target audiences
– Ad-hoc query support
– Data mining capabilities
Characteristics
– Detailed (mostly periodic) data
– Aggregate (for summary)
– Distributed (to departmental servers)
Most common data model = star schema
(also called “dimensional model”)
24
Figure 11-13 Components of a star schema
Fact tables contain factual
or quantitative data
1:N relationship between Dimension tables are denormalized to
dimension tables and fact tables maximize performance
Dimension tables contain descriptions
about the subjects of the business
Excellent for ad-hoc queries, but bad for online transaction processing
25
Figure 11-14: Star schema example
Fact table provides statistics for sales
broken down by product, period and store
dimensions
26
Figure 11-15 Star schema with sample data
27
Issues Regarding Star Schema
• Dimension table keys must be surrogate (non-intelligent and
non-business related), because:
– Keys may change over time
– Length/format consistency
• Granularity of Fact Table–what level of detail do you want?
– Transactional grain–finest level
– Aggregated grain–more summarized
– Finer grains better market basket analysis capability
– Finer grain more dimension tables, more rows in fact table
• Duration of the database–how much history should be kept?
– Natural duration–13 months or 5 quarters
– Financial institutions may need longer duration
– Older data is more difficult to source and cleanse
28
Figure 11-16: Modeling dates
Fact tables contain time-period data
Date dimensions are important
29
On-Line Analytical Processing (OLAP)
• The use of a set of graphical tools that provides users
with multidimensional views of their data and allows
them to analyze the data using simple windowing
techniques
• Relational OLAP (ROLAP)
– Traditional relational representation
• Multidimensional OLAP (MOLAP)
– Cube structure
• OLAP Operations
– Cube slicing – come up with 2-D view of data
– Drill-down – going from summary to more detailed views
31
Figure 11-22: Slicing a data cube
32
Summary report
Figure 11-24
Example of drill-down
Starting with summary
data, users can obtain Drill-down with
details for particular color added
cells
33
Data Mining and Visualization
• Knowledge discovery using a blend of statistical, AI, and
computer graphics techniques
• Goals:
– Explain observed events or conditions
– Confirm hypotheses
– Explore data for new or unexpected relationships
• Techniques
– Case-based reasoning
– Rule discovery
– Signal processing
– Neural nets
– Fractals
• Data visualization – representing data in graphical/multimedia
formats for analysis
34