Data Mining:
Concepts and
Techniques
— Chapter 3 —
Jiawei Han and Micheline Kamber
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber. All rights reserved.
August 10, 2009 Data Mining: Concepts and Techniques 1
August 10, 2009 Data Mining: Concepts and Techniques 2
Chapter 3: Data Warehousing, Data
Generalization, and On-line Analytical
Processing
Data warehouse: Basic concept
Data warehouse modeling: Data cube and
OLAP
Data warehouse architecture
Data warehouse implementation
Data generalization and concept
description
August 10, 2009 Data Mining: Concepts and Techniques 3
What is Data Warehouse?
Defined in many different ways, but not rigorously.
A decision support database that is maintained
separately from the organization’s operational database
Support information processing by providing a solid
platform of consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
Data warehousing:
The process of constructing and using data warehouses
August 10, 2009 Data Mining: Concepts and Techniques 4
Data Warehouse—Subject-Oriented
Organized around major subjects, such as
customer, product, sales
Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing
Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process
August 10, 2009 Data Mining: Concepts and Techniques 5
Data Warehouse—Integrated
Constructed by integrating multiple,
heterogeneous data sources
relational databases, flat files, on-line
transaction records
Data cleaning and data integration techniques
are applied.
Ensure consistency in naming conventions,
encoding structures, attribute measures, etc.
among different data sources
E.g., Hotel price: currency, tax, breakfast covered,
etc.
When data is moved to the warehouse, it is
converted.
August 10, 2009 Data Mining: Concepts and Techniques 6
Data Warehouse—Time Variant
The time horizon for the data warehouse is
significantly longer than that of operational
systems
Operational database: current value data
Data warehouse data: provide information from
a historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
Contains an element of time, explicitly or
implicitly
But the key of operational data may or may not
contain “timeData
August 10, 2009
element”
Mining: Concepts and Techniques 7
Data Warehouse—Nonvolatile
A physically separate store of data transformed
from the operational environment
Operational update of data does not occur in the
data warehouse environment
Does not require transaction processing,
recovery, and concurrency control mechanisms
Requires only two operations in data
accessing:
initial loading of data and access of data
August 10, 2009 Data Mining: Concepts and Techniques 8
Data Warehouse vs. Heterogeneous
DBMS
Traditional heterogeneous DB integration: A query driven
approach
Build wrappers/mediators on top of heterogeneous
databases
When a query is posed to a client site, a meta-dictionary
is used to translate the query into queries appropriate for
individual heterogeneous sites involved, and the results
are integrated into a global answer set
Complex information filtering, compete for resources
Data warehouse: update-driven, high performance
Information fromData
August 10, 2009 heterogeneous sources is integrated in
Mining: Concepts and Techniques 9
Data Warehouse vs. Operational
DBMS
OLTP (on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
Major task of data warehouse system
Data analysis and decision making
Distinct features (OLTP vs. OLAP):
User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries
August 10, 2009 Data Mining: Concepts and Techniques 10
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
August 10, 2009 Data Mining: Concepts and Techniques 11
Why Separate Data Warehouse?
High performance for both systems
DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
Different functions and different data:
missing data: Decision support requires historical data
which operational DBs do not typically maintain
data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
data quality: different sources typically use inconsistent
data representations, codes and formats which have to
be reconciled
Note: There are more
August 10, 2009 Dataand
Mining:more
Conceptssystems which perform
and Techniques 12
From Tables and Spreadsheets to Data
Cubes
A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled and
viewed in multiple dimensions
Dimension tables, such as item (item_name, brand,
type), or time(day, week, month, quarter, year)
Fact table contains measures (such as dollars_sold) and
keys to each of the related dimension tables
In data warehousing literature, an n-D base cube is called a
base cuboid. The top most 0-D cuboid, which holds the
highest-level of summarization, is called the apex cuboid.
The lattice of cuboids forms a data cube.
August 10, 2009 Data Mining: Concepts and Techniques 13
Chapter 3: Data Warehousing, Data
Generalization, and On-line Analytical
Processing
Data warehouse: Basic concept
Data warehouse modeling: Data cube and
OLAP
Data warehouse architecture
Data warehouse implementation
Data generalization and concept
description
August 10, 2009 Data Mining: Concepts and Techniques 14
Cube: A Lattice of Cuboids
all
0-D(apex) cuboid
time item location supplier
1-D cuboids
time,location item,location location,supplier
time,item 2-D cuboids
time,supplier item,supplier
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
August 10, 2009 Data Mining: Concepts and Techniques 15
Conceptual Modeling of Data
Warehouses
Modeling data warehouses: dimensions &
measures
Star schema: A fact table in the middle
connected to a set of dimension tables
Snowflake schema: A refinement of star
schema where some dimensional hierarchy is
normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of
stars, therefore called galaxy schema or fact
August 10, 2009 Data Mining: Concepts and Techniques 16
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
August 10, 2009 Data Mining: Concepts and Techniques 17
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
August 10, 2009 Data Mining: Concepts and Techniques 18
Example of Fact
Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
from_location
branch_key
branch location_key location to_location
branch_key location_key dollars_cost
branch_name units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
August 10, 2009 Data Mining: Concepts and Techniques shipper_type 19
Cube Definition Syntax (BNF) in
DMQL
Cube Definition (Fact Table)
define cube <cube_name> [<dimension_list>]:
<measure_list>
Dimension Definition (Dimension Table)
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
Special Case (Shared Dimension Tables)
First time as “cube definition”
define dimension <dimension_name> as
<dimension_name_first_time> in cube
<cube_name_first_time>
August 10, 2009 Data Mining: Concepts and Techniques 20
Defining Star Schema in DMQL
define cube sales_star [time, item, branch,
location]:
dollars_sold = sum(sales_in_dollars),
avg_sales = avg(sales_in_dollars), units_sold
= count(*)
define dimension time as (time_key, day,
day_of_week, month, quarter, year)
define dimension item as (item_key, item_name,
brand, type, supplier_type)
define dimension branch as (branch_key,
branch_name, branch_type)
define dimension location as (location_key, street,
city,
August 10, 2009 province_or_state, country)
Data Mining: Concepts and Techniques 21
Defining Snowflake Schema in
DMQL
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month,
quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street,
city(city_key, province_or_state, country))
August 10, 2009 Data Mining: Concepts and Techniques 22
Defining Fact Constellation in
DMQL
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month,
quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as
location in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
August 10, 2009 Data Mining: Concepts and Techniques 23
Measures of Data Cube: Three
Categories
Distributive: if the result derived by applying the
function to n aggregate values is the same as that
derived by applying the function on all the data
without partitioning
E.g., count(), sum(), min(), max()
Algebraic: if it can be computed by an algebraic
function with M arguments (where M is a bounded
integer), each of which is obtained by applying a
distributive aggregate function
E.g., avg(), min_N(), standard_deviation()
Holistic: if there is no constant bound on the
storage size needed to describe a subaggregate.
August 10, 2009
E.g., median(),Data
mode(), rank()
Mining: Concepts and Techniques 24
A Concept Hierarchy: Dimension
(location)
all all
region Europe ... North_America
country Germany ... Spain Canada ... Mexico
city Frankfurt ... Vancouver ... Toronto
office L. Chan ... M. Wind
August 10, 2009 Data Mining: Concepts and Techniques 25
View of Warehouses and
Hierarchies
Specification of
hierarchies
Schema hierarchy
day < {month <
quarter; week} <
year
Set_grouping
hierarchy
August 10, 2009 {1..10} <
Data Mining: Concepts and Techniques 26
Multidimensional Data
Sales volume as a function of product,
month, and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
o n
gi
Industry Region Year
Re
Category Country Quarter
Product
Product City Month Week
Office Day
Month
August 10, 2009 Data Mining: Concepts and Techniques 27
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
ct
TV
du
PC U.S.A
o
Pr
VCR
Country
sum
Canada
Mexico
sum
August 10, 2009 Data Mining: Concepts and Techniques 28
Cuboids Corresponding to the Cube
all
0-D(apex) cuboid
product date country
1-D cuboids
product,date product,country date, country
2-D cuboids
3-D(base) cuboid
product, date, country
August 10, 2009 Data Mining: Concepts and Techniques 29
Browsing a Data Cube
Visualization
OLAP capabilities
Interactive
August 10, 2009 Data Mining: Concepts andmanipulation
Techniques 30
Typical OLAP Operations
Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension
reduction
Drill down (roll down): reverse of roll-up
from higher level summary to lower level
summary or detailed data, or introducing new
dimensions
Slice and dice: project and select
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D
planes
Other operations
drill across: involving (across) more than one fact
August 10,table
2009 Data Mining: Concepts and Techniques 31
Fig. 3.10 Typical
OLAP Operations
August 10, 2009 Data Mining: Concepts and Techniques 32
A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location Each circle is
called a Promotion Organization
August 10, 2009 footprint Data Mining: Concepts and Techniques 33
Chapter 3: Data Warehousing, Data
Generalization, and On-line Analytical
Processing
Data warehouse: Basic concept
Data warehouse modeling: Data cube and
OLAP
Data warehouse architecture
Data warehouse implementation
Data generalization and concept
description
August 10, 2009 Data Mining: Concepts and Techniques 34
Design of Data Warehouse: A
Business Analysis Framework
Four views regarding the design of a data
warehouse
Top-down view
allows selection of the relevant information necessary
for the data warehouse
Data source view
exposes the information being captured, stored, and
managed by operational systems
Data warehouse view
consists of fact tables and dimension tables
Business query view
sees the perspectives of data in the warehouse from
August 10, 2009 the view of end-user
Data Mining: Concepts and Techniques 35
Data Warehouse Design
Process
Top-down, bottom-up approaches or a combination of both
Top-down: Starts with overall design and planning
(mature)
Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view
Waterfall: structured and systematic analysis at each step
before proceeding to the next
Spiral: rapid generation of increasingly functional
systems, short turn around time, quick turn around
Typical data warehouse design process
Choose a business process to model, e.g., orders,
invoices, etc.
Choose the grain (atomic level of data) of the business
process
August 10, 2009 Data Mining: Concepts and Techniques 36
Data Warehouse: A Multi-Tiered Architecture
Monitor
& OLAP Server
Other Metadata
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
Data Sources Data Storage OLAP Engine Front-End Tools
August 10, 2009 Data Mining: Concepts and Techniques 37
Three Data Warehouse Models
Enterprise warehouse
collects all of the information about subjects
spanning the entire organization
Data Mart
a subset of corporate-wide data that is of value to
a specific groups of users. Its scope is confined
to specific, selected groups, such as marketing
data mart
Independent vs. dependent (directly from warehouse)
data mart
Virtual warehouse
A set of views over operational databases
Only some of the possible summary views may
August 10, 2009 Data Mining: Concepts and Techniques 38
Data Warehouse
Development: A
Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Enterprise
Data Data
Data
Mart Mart
Warehouse
Model refinement Model refinement
Define a high-level corporate data model
August 10, 2009 Data Mining: Concepts and Techniques 39
Data Warehouse Back-End Tools and
Utilities
Data extraction
get data from multiple, heterogeneous, and
external sources
Data cleaning
detect errors in the data and rectify them when
possible
Data transformation
convert data from legacy or host format to
warehouse format
Load
sort, summarize, consolidate, compute views,
check integrity, and build indicies and partitions
Refresh
propagate the updates from the data sources to
August 10, 2009 Data Mining: Concepts and Techniques 40
Metadata Repository
Meta data is the data defining warehouse objects. It stores:
Description of the structure of the data warehouse
schema, view, dimensions, hierarchies, derived data defn,
data mart locations and contents
Operational meta-data
data lineage (history of migrated data and transformation
path), currency of data (active, archived, or purged),
monitoring information (warehouse usage statistics, error
reports, audit trails)
The algorithms used for summarization
The mapping from operational environment to the data
warehouse
Data related to system performance
warehouse schema, view and derived data definitions
Business data
August 10, 2009 Data Mining: Concepts and Techniques 41
OLAP Server Architectures
Relational OLAP (ROLAP)
Use relational or extended-relational DBMS to store and
manage warehouse data and OLAP middle ware
Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and
services
Greater scalability
Multidimensional OLAP (MOLAP)
Sparse array-based multidimensional storage engine
Fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
Flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers (e.g., Redbricks)
Specialized support
August 10, 2009
for SQL queries over star/snowflake
Data Mining: Concepts and Techniques 42
Data Warehouse Usage
Three kinds of data warehouse applications
Information processing
supports querying, basic statistical analysis, and
reporting using crosstabs, tables, charts and graphs
Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling,
pivoting
Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and
presenting the mining results using visualization tools
August 10, 2009 Data Mining: Concepts and Techniques 43
Chapter 3: Data Warehousing, Data
Generalization, and On-line Analytical
Processing
Data warehouse: Basic concept
Data warehouse modeling: Data cube and
OLAP
Data warehouse architecture
Data warehouse implementation
Data generalization and concept
description
August 10, 2009 Data Mining: Concepts and Techniques 44
Efficient Data Cube
Computation
Data cube can be viewed as a lattice of cuboids
The bottom-most cuboid is the base cuboid
The top-most cuboid (apex) contains only one
cell
How many T n in an n-dimensional cube
cuboids
= ∏ ( Li +1)
with L levels? i =1
Materialization of data cube
Materialize every (cuboid) (full materialization),
none (no materialization), or some (partial
materialization)
Selection of which
August 10,2009 cuboids
Data Mining: to materialize
Concepts and Techniques 45
Cube Operation
Cube definition and computation in DMQL
define cube sales[item, city, year]:
sum(sales_in_dollars)
compute cube sales
Transform it into a SQL-like language (with a new
()
operator cube by, introduced by Gray et al.’96)
SELECT item, city, year, SUM (amount)
(city) (item) (year)
FROM SALES
CUBE BY item, city, year
Need compute the following Group-Bys
(city, item) (city, year) (item, year)
(date, product, customer),
(date,product),(date, customer), (product,
(city, item, year)
customer),
(date), (product),
August 10, 2009 Data(customer)
Mining: Concepts and Techniques 46
Iceberg Cube
Computing only the cuboid cells
whose count or other aggregates
satisfying the condition like
HAVING COUNT(*) >= minsup
Motivation
Only a small portion of cube cells may be “above
the water’’ in a sparse cube
Only calculate “interesting” cells—data above
certain threshold
Avoid explosive growth of the cube
Suppose 100 dimensions, only 1 base cell. How many
aggregate cells if count >= 1? What about count >=
2?
August 10, 2009 Data Mining: Concepts and Techniques 47
Indexing OLAP Data: Bitmap Index
Index on a particular column
Each value in the column has a bit vector: bit-op is fast
The length of the bit vector: # of records in the base table
The i-th bit is set if the i-th row of the base table has the
value for the indexed column
not suitable for high cardinality domains
Base table Index on Region Index on Type
Cust Region Type RecIDAsia Europe America RecID Retail Dealer
C1 Asia Retail 1 1 0 0 1 1 0
C2 Europe Dealer 2 0 1 0 2 0 1
C3 Asia Dealer 3 1 0 0 3 0 1
C4 America Retail 4 0 0 1 4 1 0
C5 Europe Dealer 5 0 1 0 5 0 1
August 10, 2009 Data Mining: Concepts and Techniques 48
Indexing OLAP Data: Join Indices
Join index: JI(R-id, S-id) where R (R-id, …)
S (S-id, …)
Traditional indices map the values to a list
of record ids
It materializes relational join in JI file
and speeds up relational join
In data warehouses, join index relates the
values of the dimensions of a start schema
to rows in the fact table.
E.g. fact table: Sales and two
dimensions city and product
A join index on city maintains for
each distinct city a list of R-IDs of the
tuples recording the Sales in the city
Join indices can span multiple
dimensions
August 10, 2009 Data Mining: Concepts and Techniques 49
Efficient Processing OLAP Queries
Determine which operations should be performed on the available
cuboids
Transform drill, roll, etc. into corresponding SQL and/or OLAP
operations, e.g., dice = selection + projection
Determine which materialized cuboid(s) should be selected for OLAP
op.
Let the query to be processed be on {brand, province_or_state}
with the condition “year = 2004”, and there are 4 materialized
cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state}
August 10, 2009
where year = 2004
Data Mining: Concepts and Techniques 50
Chapter 3: Data Warehousing, Data
Generalization, and On-line Analytical
Processing
Data warehouse: Basic concept
Data warehouse modeling: Data cube and
OLAP
Data warehouse architecture
Data warehouse implementation
Data generalization and concept
description
August 10, 2009 Data Mining: Concepts and Techniques 51
What is Concept Description?
Descriptive vs. predictive data mining
Descriptive mining: describes concepts or task-
relevant data sets in concise, summarative,
informative, discriminative forms
Predictive mining: Based on data and analysis,
constructs models for the database, and predicts
the trend and properties of unknown data
Concept description:
Characterization: provides a concise and
succinct summarization of the given collection of
data
Comparison: provides descriptions comparing
two or more collections
August 10, 2009
of data
Data Mining: Concepts and Techniques 52
Data Generalization and Summarization-
based Characterization
Data generalization
A process which abstracts a large set of task-
relevant data in a database from a low
conceptual levels to higher ones.
1
2
3
4
Conceptual levels
5
Approaches:
Data cube approach(OLAP approach)
Attribute-oriented induction approach
August 10, 2009 Data Mining: Concepts and Techniques 53
Attribute-Oriented Induction
Proposed in 1989 (KDD ‘89 workshop)
Not confined to categorical data nor particular
measures
How it is done?
Collect the task-relevant data (initial relation)
using a relational database query
Perform generalization by attribute removal or
attribute generalization
Apply aggregation by merging identical,
generalized tuples and accumulating their
respective counts
Interactive presentation
August 10, 2009 Data Mining: Conceptswith users
and Techniques 54
Basic Principles of Attribute-Oriented
Induction
Data focusing: task-relevant data, including
dimensions, and the result is the initial relation
Attribute-removal: remove attribute A if there is a
large set of distinct values for A but (1) there is no
generalization operator on A, or (2) A’s higher level
concepts are expressed in terms of other attributes
Attribute-generalization: If there is a large set of
distinct values for A, and there exists a set of
generalization operators on A, then select an
operator and generalize A
Attribute-threshold control: typical 2-8,
specified/default
August 10, 2009 Data Mining: Concepts and Techniques 55
Attribute-Oriented Induction: Basic
Algorithm
InitialRel: Query processing of task-relevant data,
deriving the initial relation.
PreGen: Based on the analysis of the number of
distinct values in each attribute, determine
generalization plan for each attribute: removal? or
how high to generalize?
PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
Presentation: User interaction: (1) adjust levels by
drilling, (2) pivoting, (3) mapping into rules, cross
tabs, visualization presentations.
August 10, 2009 Data Mining: Concepts and Techniques 56
Example
DMQL: Describe general characteristics of
graduate students in the Big-University database
use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major,
birth_place, birth_date, residence, phone#,
gpa
from student
where status in “graduate”
Corresponding SQL statement:
Select name, gender, major, birth_place,
birth_date, residence, phone#, gpa
from student Data Mining: Concepts and Techniques
August 10, 2009 57
Class Characterization: An Example
Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Initial Jim M CS Vancouver,BC, 8-12-76 3511 Main St., 687-4598 3.67
Woodman Canada Richmond
Relation Scott M CS Montreal, Que, 28-7-75 345 1st Ave., 253-9106 3.70
Lachance Canada Richmond
Laura Lee F Physics Seattle, WA, USA 25-8-70 125 Austin Ave., 420-5232 3.83
… … … … … Burnaby … …
…
Removed Retained Sci,Eng, Country Age range City Removed Excl,
Bus VG,..
Gender Major Birth_region Age_range Residence GPA Count
Prime M Science Canada 20-25 Richmond Very-good 16
Generalized F Science Foreign 25-30 Burnaby Excellent 22
Relation … … … … … … …
Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62
August 10, 2009 Data Mining: Concepts and Techniques 58
Presentation of Generalized
Results
Generalized relation:
Relations where some or all attributes are generalized, with
counts or other aggregation values accumulated.
Cross tabulation:
Mapping results into cross tabulation form (similar to
contingency tables).
Visualization techniques:
Pie charts, bar charts, curves, cubes, and other visual
forms.
Quantitative characteristic rules:
Mapping generalized result into characteristic rules with
grad ( x) ∧ male( x) ⇒
quantitative information associated with it, e.g.,
birth _ region( x) ="Canada"[t :53%]∨ birth _ region( x) =" foreign"[t : 47%].
August 10, 2009 Data Mining: Concepts and Techniques 59
Mining Class Comparisons
Comparison: Comparing two or more classes
Method:
Partition the set of relevant data into the target class and
the contrasting class(es)
Generalize both classes to the same high level concepts
Compare tuples with the same high level descriptions
Present for every tuple its description and two measures
support - distribution within single class
comparison - distribution between classes
Highlight the tuples with strong discriminant features
Relevance Analysis:
Find attributes (features) which best distinguish different
classes
August 10, 2009 Data Mining: Concepts and Techniques 60
Quantitative Discriminant Rules
Cj = target class
qa = a generalized tuple covers some tuples of
class
but can also cover some tuples of contrasting
class count(qa ∈Cj )
d − weight = m
d-weight
range: [0, 1]
∑i =1
count(qa ∈Ci )
quantitative discriminant rule form
∀ X, target_class(X) ⇐ condition(X) [d : d_weight]
August 10, 2009 Data Mining: Concepts and Techniques 61
Example: Quantitative Discriminant
Rule
Status Birth_country Age_range Gpa Count
Graduate Canada 25-30 Good 90
Undergraduate Canada 25-30 Good 210
Count distribution between graduate and undergraduate students for a generalized tuple
Quantitative discriminant rule
∀ X , graduate _ student ( X ) ⇐
birth _ country ( X ) =" Canada"∧ age _ range( X ) ="25 − 30"∧ gpa ( X ) =" good " [d : 30%]
where 90/(90 + 210) = 30%
August 10, 2009 Data Mining: Concepts and Techniques 62
Class Description
Quantitative characteristic rule
∀ X, target_class(X) ⇒ condition(X) [t : t_weight]
necessary
Quantitative discriminant rule
∀ X, target_class(X) ⇐ condition(X) [d : d_weight]
sufficient
Quantitative description rule
∀ X, target_class(X) ⇔
condition 1(X) [t : w1, d : w ′1] ∨ ... ∨ conditionn(X) [t : wn, d : w ′n]
necessary and sufficient
August 10, 2009 Data Mining: Concepts and Techniques 63
Example: Quantitative Description
Rule
Location/item TV Computer Both_items
Count t-wt d-wt Count t-wt d-wt Count t-wt d-wt
Europe 80 25% 40% 240 75% 30% 320 100% 32%
N_Am 120 17.65% 60% 560 82.35% 70% 680 100% 68%
Both_ 200 20% 100% 800 80% 100% 1000 100% 100%
regions
Crosstab showing associated t-weight, d-weight values and total number
(in thousands) of TVs and computers sold at AllElectronics in 1998
Quantitative description rule for target class
Europe
∀ X, Europe(X) ⇔
(item(X) =" TV" ) [t : 25%, d : 40%] ∨ (item(X) =" computer" ) [t : 75%, d : 30%]
August 10, 2009 Data Mining: Concepts and Techniques 64
Concept Description vs. Cube-Based
OLAP
Similarity:
Data generalization
Presentation of data summarization at multiple
levels of abstraction
Interactive drilling, pivoting, slicing and dicing
Differences:
OLAP has systematic preprocessing, query
independent, and can drill down to rather low
level
AOI has automated desired level allocation, and
may perform dimension relevance
analysis/ranking when there are many relevant
August 10, 2009 Data Mining: Concepts and Techniques 65
Chapter 3: Data Warehousing, Data
Generalization, and On-line Analytical
Processing
Data warehouse: Basic concept
Data warehouse modeling: Data cube and
OLAP
Data warehouse architecture
Data warehouse implementation
Data generalization and concept
description
August 10, 2009 Data Mining: Concepts and Techniques 66
From On-Line Analytical Processing
(OLAP)
to On Line Analytical Mining (OLAM)
Why online analytical mining?
High quality of data in data warehouses
DW contains integrated, consistent, cleaned
data
Available information processing structure
surrounding data warehouses
ODBC, OLEDB, Web accessing, service
facilities, reporting and OLAP tools
OLAP-based exploratory data analysis
Mining with drilling, dicing, pivoting, etc.
On-line selection of data mining functions
Integration and swapping of multiple mining
August 10, 2009 functions, algorithms,
Data Mining: Concepts and tasks
and Techniques 67
An OLAM System Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM
Data Cube API
Layer2
MDDB
MDDB
Meta
Data
Filtering&Integration Database API Filtering
Layer1
Data cleaning Data
Databases Data
Data integration Warehouse Repository
August 10, 2009 Data Mining: Concepts and Techniques 68
Chapter 3: Data Warehousing, Data
Generalization, and On-line Analytical
Processing
Data warehouse: Basic concept
Data warehouse modeling: Data cube and OLAP
Data warehouse architecture
Data warehouse implementation
Data generalization and concept description
From data warehousing to data mining
Summary
August 10, 2009 Data Mining: Concepts and Techniques 69
Warehousing, and On-line Analytical
Processing
Data generalization: Attribute-oriented induction
Data warehousing: A multi-dimensional model of a data
warehouse
Star schema, snowflake schema, fact constellations
A data cube consists of dimensions & measures
OLAP operations: drilling, rolling, slicing, dicing and pivoting
Data warehouse architecture
OLAP servers: ROLAP, MOLAP, HOLAP
Efficient computation of data cubes
Partial vs. full vs. no materialization
Indexing OALP data: Bitmap index and join index
OLAP query processing
From OLAP to OLAM (on-line analytical mining)
August 10, 2009 Data Mining: Concepts and Techniques 70
References (I)
S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R.
Ramakrishnan, and S. Sarawagi. On the computation of multidimensional
aggregates. VLDB’96
D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance
in data warehouses. SIGMOD’97
R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional
databases. ICDE’97
S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP
technology. ACM SIGMOD Record, 26:65-74, 1997
E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer
World, 27, July 1993.
J. Gray, et al. Data cube: A relational aggregation operator generalizing
group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery,
1:29-54, 1997.
A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations,
and Applications. MIT Press, 1999.
J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD
Record, 27:97-107, 1998.
August
10, 2009 Data Mining: Concepts and Techniques 71
References (II)
C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design:
Relational and Dimensional Techniques. John Wiley, 2003
W. H. Inmon. Building the Data Warehouse. John Wiley, 1996
R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to
Dimensional Modeling. 2ed. John Wiley, 2002
P. O'Neil and D. Quass. Improved query performance with variant indexes.
SIGMOD'97
Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998
A. Shoshani. OLAP and statistical databases: Similarities and differences.
PODS’00.
S. Sarawagi and M. Stonebraker. Efficient organization of large
multidimensional arrays. ICDE'94
OLAP council. MDAPI specification version 2.0. In
http://www.olapcouncil.org/research/apily.htm, 1998
E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems.
John Wiley, 1997
P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987.
August 10, 2009 Data Mining: Concepts and Techniques 72
August 10, 2009 Data Mining: Concepts and Techniques 73