0% found this document useful (0 votes)

647 views73 pages

Concepts and Techniques: - Chapter 3

A data warehouse is a subject-oriented, integrated, timevariant collection of data. The time horizon for the data warehouse is significantly longer than that of operational systems. Data warehouses provide information from a historical perspective (e.g., past 5-10 years)

Uploaded by

mahendirana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

647 views73 pages

Concepts and Techniques: - Chapter 3

Uploaded by

mahendirana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Data Mining:

Concepts and
Techniques

— Chapter 3 —

Jiawei Han and Micheline Kamber

Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber. All rights reserved.
August 10, 2009 Data Mining: Concepts and Techniques 1
August 10, 2009 Data Mining: Concepts and Techniques 2
Chapter 3: Data Warehousing, Data
Generalization, and On-line Analytical
Processing
 Data warehouse: Basic concept

 Data warehouse modeling: Data cube and

OLAP

 Data warehouse architecture

 Data warehouse implementation

 Data generalization and concept

description
August 10, 2009 Data Mining: Concepts and Techniques 3
What is Data Warehouse?
 Defined in many different ways, but not rigorously.
 A decision support database that is maintained
separately from the organization’s operational database
 Support information processing by providing a solid
platform of consolidated, historical data for analysis.
 “A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
 Data warehousing:
 The process of constructing and using data warehouses

August 10, 2009 Data Mining: Concepts and Techniques 4

Data Warehouse—Subject-Oriented

 Organized around major subjects, such as

customer, product, sales
 Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing
 Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process

August 10, 2009 Data Mining: Concepts and Techniques 5

Data Warehouse—Integrated
 Constructed by integrating multiple,
heterogeneous data sources
 relational databases, flat files, on-line

transaction records
 Data cleaning and data integration techniques
are applied.
 Ensure consistency in naming conventions,

encoding structures, attribute measures, etc.

among different data sources

E.g., Hotel price: currency, tax, breakfast covered,
etc.
 When data is moved to the warehouse, it is
converted.
August 10, 2009 Data Mining: Concepts and Techniques 6
Data Warehouse—Time Variant

 The time horizon for the data warehouse is

significantly longer than that of operational
systems
 Operational database: current value data
 Data warehouse data: provide information from
a historical perspective (e.g., past 5-10 years)
 Every key structure in the data warehouse
 Contains an element of time, explicitly or
implicitly
 But the key of operational data may or may not
contain “timeData
August 10, 2009
element”
Mining: Concepts and Techniques 7
Data Warehouse—Nonvolatile
 A physically separate store of data transformed
from the operational environment
 Operational update of data does not occur in the
data warehouse environment
 Does not require transaction processing,
recovery, and concurrency control mechanisms
 Requires only two operations in data
accessing:
 initial loading of data and access of data

August 10, 2009 Data Mining: Concepts and Techniques 8

Data Warehouse vs. Heterogeneous
DBMS

 Traditional heterogeneous DB integration: A query driven

approach
 Build wrappers/mediators on top of heterogeneous
databases
 When a query is posed to a client site, a meta-dictionary
is used to translate the query into queries appropriate for
individual heterogeneous sites involved, and the results
are integrated into a global answer set
 Complex information filtering, compete for resources
 Data warehouse: update-driven, high performance
 Information fromData
August 10, 2009 heterogeneous sources is integrated in
Mining: Concepts and Techniques 9
Data Warehouse vs. Operational
DBMS
 OLTP (on-line transaction processing)
 Major task of traditional relational DBMS
 Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
 OLAP (on-line analytical processing)
 Major task of data warehouse system
 Data analysis and decision making
 Distinct features (OLTP vs. OLAP):
 User and system orientation: customer vs. market
 Data contents: current, detailed vs. historical, consolidated
 Database design: ER + application vs. star + subject
 View: current, local vs. evolutionary, integrated
 Access patterns: update vs. read-only but complex queries
August 10, 2009 Data Mining: Concepts and Techniques 10
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

August 10, 2009 Data Mining: Concepts and Techniques 11

Why Separate Data Warehouse?
 High performance for both systems
 DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
 Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
 Different functions and different data:
 missing data: Decision support requires historical data
which operational DBs do not typically maintain
 data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
 data quality: different sources typically use inconsistent
data representations, codes and formats which have to
be reconciled
 Note: There are more
August 10, 2009 Dataand
Mining:more
Conceptssystems which perform
and Techniques 12
From Tables and Spreadsheets to Data
Cubes
 A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
 A data cube, such as sales, allows data to be modeled and
viewed in multiple dimensions
 Dimension tables, such as item (item_name, brand,
type), or time(day, week, month, quarter, year)
 Fact table contains measures (such as dollars_sold) and
keys to each of the related dimension tables
 In data warehousing literature, an n-D base cube is called a
base cuboid. The top most 0-D cuboid, which holds the
highest-level of summarization, is called the apex cuboid.
The lattice of cuboids forms a data cube.
August 10, 2009 Data Mining: Concepts and Techniques 13
Chapter 3: Data Warehousing, Data
Generalization, and On-line Analytical
Processing
 Data warehouse: Basic concept

 Data warehouse modeling: Data cube and

OLAP

 Data warehouse architecture

 Data warehouse implementation

 Data generalization and concept

description
August 10, 2009 Data Mining: Concepts and Techniques 14
Cube: A Lattice of Cuboids

all
0-D(apex) cuboid

time item location supplier

1-D cuboids

time,location item,location location,supplier

time,item 2-D cuboids
time,supplier item,supplier

time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier

4-D(base) cuboid
time, item, location, supplier

August 10, 2009 Data Mining: Concepts and Techniques 15

Conceptual Modeling of Data
Warehouses
 Modeling data warehouses: dimensions &
measures
 Star schema: A fact table in the middle
connected to a set of dimension tables
 Snowflake schema: A refinement of star
schema where some dimensional hierarchy is
normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
 Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of
stars, therefore called galaxy schema or fact
August 10, 2009 Data Mining: Concepts and Techniques 16
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures

August 10, 2009 Data Mining: Concepts and Techniques 17

Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country

August 10, 2009 Data Mining: Concepts and Techniques 18

Example of Fact
Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
from_location
branch_key
branch location_key location to_location
branch_key location_key dollars_cost
branch_name units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
August 10, 2009 Data Mining: Concepts and Techniques shipper_type 19
Cube Definition Syntax (BNF) in
DMQL
 Cube Definition (Fact Table)
define cube <cube_name> [<dimension_list>]:
<measure_list>
 Dimension Definition (Dimension Table)
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
 Special Case (Shared Dimension Tables)
 First time as “cube definition”

 define dimension <dimension_name> as

<dimension_name_first_time> in cube
<cube_name_first_time>

August 10, 2009 Data Mining: Concepts and Techniques 20

Defining Star Schema in DMQL

define cube sales_star [time, item, branch,

location]:
dollars_sold = sum(sales_in_dollars),
avg_sales = avg(sales_in_dollars), units_sold
= count(*)
define dimension time as (time_key, day,
day_of_week, month, quarter, year)
define dimension item as (item_key, item_name,
brand, type, supplier_type)
define dimension branch as (branch_key,
branch_name, branch_type)
define dimension location as (location_key, street,
city,
August 10, 2009 province_or_state, country)
Data Mining: Concepts and Techniques 21
Defining Snowflake Schema in
DMQL

define cube sales_snowflake [time, item, branch, location]:

August 10, 2009 Data Mining: Concepts and Techniques 22

Defining Fact Constellation in
DMQL

define cube sales [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month,
quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as
location in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
August 10, 2009 Data Mining: Concepts and Techniques 23
Measures of Data Cube: Three
Categories

 Distributive: if the result derived by applying the

function to n aggregate values is the same as that
derived by applying the function on all the data
without partitioning
 E.g., count(), sum(), min(), max()
 Algebraic: if it can be computed by an algebraic
function with M arguments (where M is a bounded
integer), each of which is obtained by applying a
distributive aggregate function
 E.g., avg(), min_N(), standard_deviation()
 Holistic: if there is no constant bound on the
storage size needed to describe a subaggregate.
August 10, 2009
 E.g., median(),Data
mode(), rank()
Mining: Concepts and Techniques 24
A Concept Hierarchy: Dimension
(location)

all all

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind

August 10, 2009 Data Mining: Concepts and Techniques 25

View of Warehouses and
Hierarchies

Specification of
hierarchies
 Schema hierarchy
day < {month <
quarter; week} <
year
 Set_grouping
hierarchy
August 10, 2009 {1..10} <
Data Mining: Concepts and Techniques 26
Multidimensional Data

 Sales volume as a function of product,

month, and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
o n
gi

Industry Region Year

Category Country Quarter

Product

Product City Month Week

Office Day

Month
August 10, 2009 Data Mining: Concepts and Techniques 27
A Sample Data Cube

Total annual sales

Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
ct

TV
du

PC U.S.A
o
Pr

VCR

Country
sum
Canada

Mexico

sum

August 10, 2009 Data Mining: Concepts and Techniques 28

Cuboids Corresponding to the Cube

all
0-D(apex) cuboid
product date country
1-D cuboids

product,date product,country date, country

2-D cuboids

3-D(base) cuboid
product, date, country

August 10, 2009 Data Mining: Concepts and Techniques 29

Browsing a Data Cube

 Visualization
 OLAP capabilities

 Interactive

August 10, 2009 Data Mining: Concepts andmanipulation

Techniques 30
Typical OLAP Operations
 Roll up (drill-up): summarize data
 by climbing up hierarchy or by dimension
reduction
 Drill down (roll down): reverse of roll-up
 from higher level summary to lower level
summary or detailed data, or introducing new
dimensions

Slice and dice: project and select
 Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D
planes
 Other operations
 drill across: involving (across) more than one fact
August 10,table
2009 Data Mining: Concepts and Techniques 31
Fig. 3.10 Typical
OLAP Operations

August 10, 2009 Data Mining: Concepts and Techniques 32

A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS

ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT

REGION
DIVISION
Location Each circle is
called a Promotion Organization
August 10, 2009 footprint Data Mining: Concepts and Techniques 33
Chapter 3: Data Warehousing, Data
Generalization, and On-line Analytical
Processing
 Data warehouse: Basic concept

 Data warehouse modeling: Data cube and

OLAP

 Data warehouse architecture

 Data warehouse implementation

 Data generalization and concept

description
August 10, 2009 Data Mining: Concepts and Techniques 34
Design of Data Warehouse: A
Business Analysis Framework
 Four views regarding the design of a data
warehouse
 Top-down view

allows selection of the relevant information necessary
for the data warehouse
 Data source view

exposes the information being captured, stored, and
managed by operational systems
 Data warehouse view

consists of fact tables and dimension tables
 Business query view
 sees the perspectives of data in the warehouse from
August 10, 2009 the view of end-user
Data Mining: Concepts and Techniques 35
Data Warehouse Design
Process
 Top-down, bottom-up approaches or a combination of both
 Top-down: Starts with overall design and planning
(mature)
 Bottom-up: Starts with experiments and prototypes (rapid)
 From software engineering point of view
 Waterfall: structured and systematic analysis at each step
before proceeding to the next
 Spiral: rapid generation of increasingly functional
systems, short turn around time, quick turn around
 Typical data warehouse design process
 Choose a business process to model, e.g., orders,
invoices, etc.
 Choose the grain (atomic level of data) of the business
process
August 10, 2009 Data Mining: Concepts and Techniques 36
Data Warehouse: A Multi-Tiered Architecture

Monitor
& OLAP Server
Other Metadata
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools

August 10, 2009 Data Mining: Concepts and Techniques 37
Three Data Warehouse Models
 Enterprise warehouse
 collects all of the information about subjects

spanning the entire organization

 Data Mart
 a subset of corporate-wide data that is of value to

a specific groups of users. Its scope is confined

to specific, selected groups, such as marketing
data mart
 Independent vs. dependent (directly from warehouse)
data mart
 Virtual warehouse
 A set of views over operational databases

 Only some of the possible summary views may

August 10, 2009 Data Mining: Concepts and Techniques 38
Data Warehouse
Development: A
Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts

Enterprise
Data Data
Data
Mart Mart
Warehouse

Model refinement Model refinement

Define a high-level corporate data model

August 10, 2009 Data Mining: Concepts and Techniques 39
Data Warehouse Back-End Tools and
Utilities
 Data extraction
 get data from multiple, heterogeneous, and

external sources
 Data cleaning
 detect errors in the data and rectify them when

possible
 Data transformation
 convert data from legacy or host format to

warehouse format
 Load
 sort, summarize, consolidate, compute views,

check integrity, and build indicies and partitions

 Refresh
 propagate the updates from the data sources to
August 10, 2009 Data Mining: Concepts and Techniques 40
Metadata Repository
 Meta data is the data defining warehouse objects. It stores:
 Description of the structure of the data warehouse
 schema, view, dimensions, hierarchies, derived data defn,
data mart locations and contents
 Operational meta-data
 data lineage (history of migrated data and transformation
path), currency of data (active, archived, or purged),
monitoring information (warehouse usage statistics, error
reports, audit trails)
 The algorithms used for summarization
 The mapping from operational environment to the data
warehouse
 Data related to system performance
 warehouse schema, view and derived data definitions

 Business data
August 10, 2009 Data Mining: Concepts and Techniques 41
OLAP Server Architectures

 Relational OLAP (ROLAP)

 Use relational or extended-relational DBMS to store and
manage warehouse data and OLAP middle ware
 Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and
services
 Greater scalability
 Multidimensional OLAP (MOLAP)
 Sparse array-based multidimensional storage engine
 Fast indexing to pre-computed summarized data
 Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
 Flexibility, e.g., low level: relational, high-level: array
 Specialized SQL servers (e.g., Redbricks)
 Specialized support
August 10, 2009
for SQL queries over star/snowflake
Data Mining: Concepts and Techniques 42
Data Warehouse Usage
 Three kinds of data warehouse applications
 Information processing

supports querying, basic statistical analysis, and
reporting using crosstabs, tables, charts and graphs
 Analytical processing

multidimensional analysis of data warehouse data
 supports basic OLAP operations, slice-dice, drilling,
pivoting
 Data mining
 knowledge discovery from hidden patterns
 supports associations, constructing analytical models,
performing classification and prediction, and
presenting the mining results using visualization tools
August 10, 2009 Data Mining: Concepts and Techniques 43
Chapter 3: Data Warehousing, Data
Generalization, and On-line Analytical
Processing
 Data warehouse: Basic concept

 Data warehouse modeling: Data cube and

OLAP

 Data warehouse architecture

 Data warehouse implementation

 Data generalization and concept

description
August 10, 2009 Data Mining: Concepts and Techniques 44
Efficient Data Cube
Computation
 Data cube can be viewed as a lattice of cuboids
 The bottom-most cuboid is the base cuboid
 The top-most cuboid (apex) contains only one
cell
 How many T n in an n-dimensional cube
cuboids
= ∏ ( Li +1)
with L levels? i =1

 Materialization of data cube

 Materialize every (cuboid) (full materialization),
none (no materialization), or some (partial
materialization)
Selection of which
August 10,2009 cuboids
Data Mining: to materialize
Concepts and Techniques 45
Cube Operation

 Cube definition and computation in DMQL

define cube sales[item, city, year]:
sum(sales_in_dollars)
compute cube sales
 Transform it into a SQL-like language (with a new
()
operator cube by, introduced by Gray et al.’96)
SELECT item, city, year, SUM (amount)
(city) (item) (year)
FROM SALES
CUBE BY item, city, year
 Need compute the following Group-Bys
(city, item) (city, year) (item, year)

(date, product, customer),

(date,product),(date, customer), (product,
(city, item, year)
customer),
(date), (product),
August 10, 2009 Data(customer)
Mining: Concepts and Techniques 46
Iceberg Cube
 Computing only the cuboid cells
whose count or other aggregates
satisfying the condition like
HAVING COUNT(*) >= minsup
 Motivation
 Only a small portion of cube cells may be “above

the water’’ in a sparse cube

 Only calculate “interesting” cells—data above

certain threshold
 Avoid explosive growth of the cube


Suppose 100 dimensions, only 1 base cell. How many
aggregate cells if count >= 1? What about count >=
2?
August 10, 2009 Data Mining: Concepts and Techniques 47
Indexing OLAP Data: Bitmap Index
 Index on a particular column
 Each value in the column has a bit vector: bit-op is fast
 The length of the bit vector: # of records in the base table
 The i-th bit is set if the i-th row of the base table has the
value for the indexed column
 not suitable for high cardinality domains

Base table Index on Region Index on Type

Cust Region Type RecIDAsia Europe America RecID Retail Dealer
C1 Asia Retail 1 1 0 0 1 1 0
C2 Europe Dealer 2 0 1 0 2 0 1
C3 Asia Dealer 3 1 0 0 3 0 1
C4 America Retail 4 0 0 1 4 1 0
C5 Europe Dealer 5 0 1 0 5 0 1
August 10, 2009 Data Mining: Concepts and Techniques 48
Indexing OLAP Data: Join Indices
 Join index: JI(R-id, S-id) where R (R-id, …) 
 S (S-id, …)
 Traditional indices map the values to a list
of record ids
 It materializes relational join in JI file

and speeds up relational join

 In data warehouses, join index relates the
values of the dimensions of a start schema
to rows in the fact table.
 E.g. fact table: Sales and two

dimensions city and product


A join index on city maintains for
each distinct city a list of R-IDs of the
tuples recording the Sales in the city
 Join indices can span multiple

dimensions
August 10, 2009 Data Mining: Concepts and Techniques 49
Efficient Processing OLAP Queries
 Determine which operations should be performed on the available
cuboids
 Transform drill, roll, etc. into corresponding SQL and/or OLAP
operations, e.g., dice = selection + projection
 Determine which materialized cuboid(s) should be selected for OLAP
op.
 Let the query to be processed be on {brand, province_or_state}
with the condition “year = 2004”, and there are 4 materialized
cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state}
August 10, 2009
where year = 2004
Data Mining: Concepts and Techniques 50
Chapter 3: Data Warehousing, Data
Generalization, and On-line Analytical
Processing
 Data warehouse: Basic concept

 Data warehouse modeling: Data cube and

OLAP

 Data warehouse architecture

 Data warehouse implementation

 Data generalization and concept

description
August 10, 2009 Data Mining: Concepts and Techniques 51
What is Concept Description?
 Descriptive vs. predictive data mining
 Descriptive mining: describes concepts or task-

relevant data sets in concise, summarative,

informative, discriminative forms
 Predictive mining: Based on data and analysis,

constructs models for the database, and predicts

the trend and properties of unknown data
 Concept description:

 Characterization: provides a concise and

succinct summarization of the given collection of

data
 Comparison: provides descriptions comparing

two or more collections

August 10, 2009
of data
Data Mining: Concepts and Techniques 52
Data Generalization and Summarization-
based Characterization
 Data generalization
 A process which abstracts a large set of task-
relevant data in a database from a low
conceptual levels to higher ones.
1
2
3
4
Conceptual levels
5
 Approaches:

Data cube approach(OLAP approach)
 Attribute-oriented induction approach

August 10, 2009 Data Mining: Concepts and Techniques 53

Attribute-Oriented Induction

 Proposed in 1989 (KDD ‘89 workshop)

 Not confined to categorical data nor particular
measures
 How it is done?
 Collect the task-relevant data (initial relation)
using a relational database query
 Perform generalization by attribute removal or
attribute generalization
 Apply aggregation by merging identical,
generalized tuples and accumulating their
respective counts
 Interactive presentation
August 10, 2009 Data Mining: Conceptswith users
and Techniques 54
Basic Principles of Attribute-Oriented
Induction
 Data focusing: task-relevant data, including
dimensions, and the result is the initial relation
 Attribute-removal: remove attribute A if there is a
large set of distinct values for A but (1) there is no
generalization operator on A, or (2) A’s higher level
concepts are expressed in terms of other attributes
 Attribute-generalization: If there is a large set of
distinct values for A, and there exists a set of
generalization operators on A, then select an
operator and generalize A
 Attribute-threshold control: typical 2-8,
specified/default
August 10, 2009 Data Mining: Concepts and Techniques 55
Attribute-Oriented Induction: Basic
Algorithm
 InitialRel: Query processing of task-relevant data,
deriving the initial relation.
 PreGen: Based on the analysis of the number of
distinct values in each attribute, determine
generalization plan for each attribute: removal? or
how high to generalize?
 PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
 Presentation: User interaction: (1) adjust levels by
drilling, (2) pivoting, (3) mapping into rules, cross
tabs, visualization presentations.
August 10, 2009 Data Mining: Concepts and Techniques 56
Example

 DMQL: Describe general characteristics of

graduate students in the Big-University database

use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major,
birth_place, birth_date, residence, phone#,
gpa
from student
where status in “graduate”
 Corresponding SQL statement:

Select name, gender, major, birth_place,

birth_date, residence, phone#, gpa
from student Data Mining: Concepts and Techniques
August 10, 2009 57
Class Characterization: An Example

Name Gender Major Birth-Place Birth_date Residence Phone # GPA

Initial Jim M CS Vancouver,BC, 8-12-76 3511 Main St., 687-4598 3.67

Woodman Canada Richmond
Relation Scott M CS Montreal, Que, 28-7-75 345 1st Ave., 253-9106 3.70
Lachance Canada Richmond
Laura Lee F Physics Seattle, WA, USA 25-8-70 125 Austin Ave., 420-5232 3.83
… … … … … Burnaby … …
…
Removed Retained Sci,Eng, Country Age range City Removed Excl,
Bus VG,..
Gender Major Birth_region Age_range Residence GPA Count
Prime M Science Canada 20-25 Richmond Very-good 16
Generalized F Science Foreign 25-30 Burnaby Excellent 22
Relation … … … … … … …

Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62

August 10, 2009 Data Mining: Concepts and Techniques 58

Presentation of Generalized
Results
 Generalized relation:
 Relations where some or all attributes are generalized, with
counts or other aggregation values accumulated.
 Cross tabulation:
 Mapping results into cross tabulation form (similar to
contingency tables).
 Visualization techniques:
 Pie charts, bar charts, curves, cubes, and other visual
forms.
 Quantitative characteristic rules:
 Mapping generalized result into characteristic rules with
grad ( x) ∧ male( x) ⇒
quantitative information associated with it, e.g.,
birth _ region( x) ="Canada"[t :53%]∨ birth _ region( x) =" foreign"[t : 47%].
August 10, 2009 Data Mining: Concepts and Techniques 59
Mining Class Comparisons

 Comparison: Comparing two or more classes

 Method:
 Partition the set of relevant data into the target class and
the contrasting class(es)
 Generalize both classes to the same high level concepts
 Compare tuples with the same high level descriptions
 Present for every tuple its description and two measures
 support - distribution within single class
 comparison - distribution between classes
 Highlight the tuples with strong discriminant features
 Relevance Analysis:
 Find attributes (features) which best distinguish different
classes
August 10, 2009 Data Mining: Concepts and Techniques 60
Quantitative Discriminant Rules

 Cj = target class
 qa = a generalized tuple covers some tuples of
class
 but can also cover some tuples of contrasting

class count(qa ∈Cj )

d − weight = m
 d-weight
 range: [0, 1]
∑i =1
count(qa ∈Ci )

 quantitative discriminant rule form

∀ X, target_class(X) ⇐ condition(X) [d : d_weight]

August 10, 2009 Data Mining: Concepts and Techniques 61

Example: Quantitative Discriminant
Rule

Status Birth_country Age_range Gpa Count

Graduate Canada 25-30 Good 90
Undergraduate Canada 25-30 Good 210

Count distribution between graduate and undergraduate students for a generalized tuple
 Quantitative discriminant rule
∀ X , graduate _ student ( X ) ⇐
birth _ country ( X ) =" Canada"∧ age _ range( X ) ="25 − 30"∧ gpa ( X ) =" good " [d : 30%]

 where 90/(90 + 210) = 30%

August 10, 2009 Data Mining: Concepts and Techniques 62

Class Description

 Quantitative characteristic rule

∀ X, target_class(X) ⇒ condition(X) [t : t_weight]
necessary


 Quantitative discriminant rule

∀ X, target_class(X) ⇐ condition(X) [d : d_weight]
sufficient


 Quantitative description rule

∀ X, target_class(X) ⇔
condition 1(X) [t : w1, d : w ′1] ∨ ... ∨ conditionn(X) [t : wn, d : w ′n]
 necessary and sufficient

August 10, 2009 Data Mining: Concepts and Techniques 63

Example: Quantitative Description
Rule
Location/item TV Computer Both_items

Count t-wt d-wt Count t-wt d-wt Count t-wt d-wt

Europe 80 25% 40% 240 75% 30% 320 100% 32%
N_Am 120 17.65% 60% 560 82.35% 70% 680 100% 68%
Both_ 200 20% 100% 800 80% 100% 1000 100% 100%
regions

Crosstab showing associated t-weight, d-weight values and total number

(in thousands) of TVs and computers sold at AllElectronics in 1998

 Quantitative description rule for target class

Europe
∀ X, Europe(X) ⇔
(item(X) =" TV" ) [t : 25%, d : 40%] ∨ (item(X) =" computer" ) [t : 75%, d : 30%]

August 10, 2009 Data Mining: Concepts and Techniques 64

Concept Description vs. Cube-Based
OLAP
 Similarity:
 Data generalization
 Presentation of data summarization at multiple
levels of abstraction
 Interactive drilling, pivoting, slicing and dicing
 Differences:
 OLAP has systematic preprocessing, query
independent, and can drill down to rather low
level
 AOI has automated desired level allocation, and
may perform dimension relevance
analysis/ranking when there are many relevant
August 10, 2009 Data Mining: Concepts and Techniques 65
Chapter 3: Data Warehousing, Data
Generalization, and On-line Analytical
Processing
 Data warehouse: Basic concept

 Data warehouse modeling: Data cube and

OLAP

 Data warehouse architecture

 Data warehouse implementation

 Data generalization and concept

description
August 10, 2009 Data Mining: Concepts and Techniques 66
From On-Line Analytical Processing
(OLAP)
to On Line Analytical Mining (OLAM)
 Why online analytical mining?
 High quality of data in data warehouses


DW contains integrated, consistent, cleaned
data
 Available information processing structure

surrounding data warehouses

 ODBC, OLEDB, Web accessing, service

facilities, reporting and OLAP tools

 OLAP-based exploratory data analysis

 Mining with drilling, dicing, pivoting, etc.

 On-line selection of data mining functions


Integration and swapping of multiple mining
August 10, 2009 functions, algorithms,
Data Mining: Concepts and tasks
and Techniques 67
An OLAM System Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM

Data Cube API

Layer2
MDDB
MDDB
Meta
Data
Filtering&Integration Database API Filtering
Layer1
Data cleaning Data
Databases Data
Data integration Warehouse Repository
August 10, 2009 Data Mining: Concepts and Techniques 68
Chapter 3: Data Warehousing, Data
Generalization, and On-line Analytical
Processing
 Data warehouse: Basic concept

 Data warehouse modeling: Data cube and OLAP

 Data warehouse architecture

 Data warehouse implementation

 Data generalization and concept description

 From data warehousing to data mining

 Summary
August 10, 2009 Data Mining: Concepts and Techniques 69
Warehousing, and On-line Analytical
Processing
 Data generalization: Attribute-oriented induction
 Data warehousing: A multi-dimensional model of a data
warehouse
 Star schema, snowflake schema, fact constellations
 A data cube consists of dimensions & measures
 OLAP operations: drilling, rolling, slicing, dicing and pivoting
 Data warehouse architecture
 OLAP servers: ROLAP, MOLAP, HOLAP
 Efficient computation of data cubes
 Partial vs. full vs. no materialization
 Indexing OALP data: Bitmap index and join index
 OLAP query processing
 From OLAP to OLAM (on-line analytical mining)
August 10, 2009 Data Mining: Concepts and Techniques 70
References (I)
 S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R.
Ramakrishnan, and S. Sarawagi. On the computation of multidimensional
aggregates. VLDB’96
 D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance
in data warehouses. SIGMOD’97
 R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional
databases. ICDE’97
 S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP
technology. ACM SIGMOD Record, 26:65-74, 1997
 E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer
World, 27, July 1993.
 J. Gray, et al. Data cube: A relational aggregation operator generalizing
group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery,
1:29-54, 1997.
 A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations,
and Applications. MIT Press, 1999.
 J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD
Record, 27:97-107, 1998.
August

10, 2009 Data Mining: Concepts and Techniques 71
References (II)
 C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design:
Relational and Dimensional Techniques. John Wiley, 2003
 W. H. Inmon. Building the Data Warehouse. John Wiley, 1996
 R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to
Dimensional Modeling. 2ed. John Wiley, 2002
 P. O'Neil and D. Quass. Improved query performance with variant indexes.
SIGMOD'97
 Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998
 A. Shoshani. OLAP and statistical databases: Similarities and differences.
PODS’00.
 S. Sarawagi and M. Stonebraker. Efficient organization of large
multidimensional arrays. ICDE'94
 OLAP council. MDAPI specification version 2.0. In
http://www.olapcouncil.org/research/apily.htm, 1998
 E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems.
John Wiley, 1997
 P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987.
August 10, 2009 Data Mining: Concepts and Techniques 72
August 10, 2009 Data Mining: Concepts and Techniques 73

ADC Question Paper ????
No ratings yet
ADC Question Paper ????
39 pages
Sathi A Das 2003
No ratings yet
Sathi A Das 2003
10 pages
DB2 Web Query For I The Nuts and Bolts
No ratings yet
DB2 Web Query For I The Nuts and Bolts
224 pages
2016 - An Overview of Microgrid Protection Methods and The Factors Involved
No ratings yet
2016 - An Overview of Microgrid Protection Methods and The Factors Involved
13 pages
The BMW 5 Series.: July 2019
No ratings yet
The BMW 5 Series.: July 2019
17 pages
Open Source Backup Solutions Guide
No ratings yet
Open Source Backup Solutions Guide
28 pages
Digital Transformation in Marketing
No ratings yet
Digital Transformation in Marketing
10 pages
Analysis of Sony-Ericsson Joint Venture
100% (1)
Analysis of Sony-Ericsson Joint Venture
4 pages
Introduction To Marketing Final Document Submission (Daniel Morais)
100% (1)
Introduction To Marketing Final Document Submission (Daniel Morais)
15 pages
Resumes Um Anc
No ratings yet
Resumes Um Anc
3 pages
Energy-Efficient IoT-Based Light Control System in Smart Indoor Agriculture
No ratings yet
Energy-Efficient IoT-Based Light Control System in Smart Indoor Agriculture
20 pages
Flutter Live Streaming App Report
No ratings yet
Flutter Live Streaming App Report
102 pages
CUMI Report
No ratings yet
CUMI Report
23 pages
Catalogue Poolspa2020
No ratings yet
Catalogue Poolspa2020
404 pages
Minnor Project 0000
No ratings yet
Minnor Project 0000
50 pages
MITR ATM Sharing Network Overview
No ratings yet
MITR ATM Sharing Network Overview
8 pages
Aircel: Telecom Growth in India
No ratings yet
Aircel: Telecom Growth in India
55 pages
Reporton Gas Leakage Detection Sensor
No ratings yet
Reporton Gas Leakage Detection Sensor
13 pages
Design Thinking in Project Management For Innovation
No ratings yet
Design Thinking in Project Management For Innovation
4 pages
Customer Purchase Decision DTH Study
No ratings yet
Customer Purchase Decision DTH Study
6 pages
Icici Project For BBM Customer Attitude Towards Financial Services Provided by Icici Bank
50% (2)
Icici Project For BBM Customer Attitude Towards Financial Services Provided by Icici Bank
119 pages
Brussels App Economy and Mobile Services Analysis
No ratings yet
Brussels App Economy and Mobile Services Analysis
16 pages
Electric Water Heaters: Features & Specs
No ratings yet
Electric Water Heaters: Features & Specs
12 pages
M&A Announcements Digest
0% (1)
M&A Announcements Digest
20 pages
DMSS Bida
No ratings yet
DMSS Bida
68 pages
Caribbean Utilities Company, Ltd. Caribbean Utilities Company, LTD
No ratings yet
Caribbean Utilities Company, Ltd. Caribbean Utilities Company, LTD
48 pages
BCA 1st Year Syllabus Overview
No ratings yet
BCA 1st Year Syllabus Overview
30 pages
Past Hurricanes Teach County How To Prepare: Inside This Issue
No ratings yet
Past Hurricanes Teach County How To Prepare: Inside This Issue
24 pages
Abhi Savings Account Features & Charges
No ratings yet
Abhi Savings Account Features & Charges
2 pages
Organizational Communication Types
No ratings yet
Organizational Communication Types
20 pages
SES2002: Environmental Change Course
No ratings yet
SES2002: Environmental Change Course
8 pages
Developing Skills For Employability at The Secondary Level
No ratings yet
Developing Skills For Employability at The Secondary Level
14 pages
Essay
No ratings yet
Essay
41 pages
HADR by Indian Navy
No ratings yet
HADR by Indian Navy
3 pages
VB Summit Schedule 3 DAYS
No ratings yet
VB Summit Schedule 3 DAYS
8 pages
Bangla NLP Tasks and Transformer Models
No ratings yet
Bangla NLP Tasks and Transformer Models
48 pages
Organizational Study of Bharti Airtel
100% (1)
Organizational Study of Bharti Airtel
52 pages
JayDeep S CV PDF
No ratings yet
JayDeep S CV PDF
1 page
Peluang Kerja di Oseanografi
No ratings yet
Peluang Kerja di Oseanografi
109 pages
Concept of Public Sector Undertakings
No ratings yet
Concept of Public Sector Undertakings
8 pages
Several Data Analysis and Processing of Electronic Nose Data Preprocessing Subsystem
No ratings yet
Several Data Analysis and Processing of Electronic Nose Data Preprocessing Subsystem
4 pages
GATE ME 1991 Question Paper
No ratings yet
GATE ME 1991 Question Paper
13 pages
Revel Course Catalog 2023 - Revised 2
No ratings yet
Revel Course Catalog 2023 - Revised 2
29 pages
MDM 1010 ServicesIntegrationFramework (SIF) Guide en
No ratings yet
MDM 1010 ServicesIntegrationFramework (SIF) Guide en
161 pages
Green Cloud Computing Term Paper
No ratings yet
Green Cloud Computing Term Paper
5 pages
Digital Marketing Overview and SWOT Analysis
No ratings yet
Digital Marketing Overview and SWOT Analysis
77 pages
Consumer Behavior Study Notes
No ratings yet
Consumer Behavior Study Notes
38 pages
Handout For Students - Chapter 5 - Compatibility Mode
No ratings yet
Handout For Students - Chapter 5 - Compatibility Mode
42 pages
Icici Bank Rubyx Credit Cards Membership Guide
No ratings yet
Icici Bank Rubyx Credit Cards Membership Guide
16 pages
MBA in Business Analytics Program Overview
No ratings yet
MBA in Business Analytics Program Overview
12 pages
Arc Welding & Cutting Catalogue: Contacts
No ratings yet
Arc Welding & Cutting Catalogue: Contacts
122 pages
Impact of Innovation in FMCG Products On
No ratings yet
Impact of Innovation in FMCG Products On
9 pages
800 Gigacenter Ordering Guide: April 2017
No ratings yet
800 Gigacenter Ordering Guide: April 2017
28 pages
Ensemble Models For Intrusion Detection System Classification
No ratings yet
Ensemble Models For Intrusion Detection System Classification
15 pages
Viacom18's Media Expansion Plan
No ratings yet
Viacom18's Media Expansion Plan
23 pages
The Roles of Cloud-Based Systems On The Cancer-Related Studies A Systematic Literature Review
No ratings yet
The Roles of Cloud-Based Systems On The Cancer-Related Studies A Systematic Literature Review
20 pages
Dashiell Hammett
No ratings yet
Dashiell Hammett
28 pages
TRAI Recommendations on FM Radio Broadcasting
No ratings yet
TRAI Recommendations on FM Radio Broadcasting
18 pages
Slides For Textbook - Chapter 2
No ratings yet
Slides For Textbook - Chapter 2
64 pages
Data Mning by Jaiwei Han Chapter 2
No ratings yet
Data Mning by Jaiwei Han Chapter 2
90 pages
Opec Analysis: What States Are Members?
No ratings yet
Opec Analysis: What States Are Members?
3 pages
International Trade Law
100% (2)
International Trade Law
16 pages
Loom LMS Overview
No ratings yet
Loom LMS Overview
8 pages
Shawn R Emmons New Milford, CT 860354 7799 Public Record
No ratings yet
Shawn R Emmons New Milford, CT 860354 7799 Public Record
3 pages
Electrical Switches Industry Study
No ratings yet
Electrical Switches Industry Study
62 pages
Report On Patanjali Ayurveda
No ratings yet
Report On Patanjali Ayurveda
9 pages
Market Planning
0% (1)
Market Planning
109 pages
FIS Treasury Partner Program Product Sheet
No ratings yet
FIS Treasury Partner Program Product Sheet
2 pages
About Shwapno
No ratings yet
About Shwapno
2 pages
Gillette Indonesia
100% (4)
Gillette Indonesia
20 pages
Business Loans and Market Strategy
No ratings yet
Business Loans and Market Strategy
2 pages
Knjiga ZP
100% (1)
Knjiga ZP
102 pages
Brochure
No ratings yet
Brochure
2 pages
Strategy Crafting & Execution Guide
No ratings yet
Strategy Crafting & Execution Guide
5 pages
Proxmox VE-Subscription-Agreement V3.0 PDF
No ratings yet
Proxmox VE-Subscription-Agreement V3.0 PDF
4 pages
Cabeau v. Allnet - Complaint
No ratings yet
Cabeau v. Allnet - Complaint
17 pages
TA Crash Course
No ratings yet
TA Crash Course
38 pages
2024-07-10 Statement - USB Checking 8219
No ratings yet
2024-07-10 Statement - USB Checking 8219
6 pages
PESTLE Analysis Infographics Guide
No ratings yet
PESTLE Analysis Infographics Guide
20 pages
ICICI Bank Regular Savings Account Details
No ratings yet
ICICI Bank Regular Savings Account Details
3 pages
Business Plan - Executive Summary - Template
No ratings yet
Business Plan - Executive Summary - Template
2 pages
Industrial Sickness of India
67% (6)
Industrial Sickness of India
24 pages
New Public Management
No ratings yet
New Public Management
7 pages
Homeritz IPO Prospectus 2010
No ratings yet
Homeritz IPO Prospectus 2010
44 pages
Case Study - Tata Steel Delayering
100% (1)
Case Study - Tata Steel Delayering
12 pages
Beml
50% (2)
Beml
34 pages
Bank of America
No ratings yet
Bank of America
11 pages
1999 Irr Bot Law
100% (1)
1999 Irr Bot Law
53 pages
Director Remuneration - Not A Related Party Transaction - Taxguru - in
No ratings yet
Director Remuneration - Not A Related Party Transaction - Taxguru - in
2 pages
Sign Board
No ratings yet
Sign Board
1 page