Datamining Lecture 2
Datamining Lecture 2
LECTURE 2
Data warehouse
…
• Subject Oriented:
• Data warehouse is subject oriented because it provides us the
information around a subject rather than the organization's ongoing
operations.
• These subjects can be product, customers, suppliers, sales, revenue,
etc.
• The data warehouse does not focus on the ongoing operations,
rather it focuses on modelling and analysis of data for decision-
making. Customer
Customer Data
(1988 - 1990)
Customer activity
Customer Data (1986- 1989)
(1985 - 1987)
…
• Integrated:
• Data warehouse is constructed by integration of data from
heterogeneous sources such as relational databases, flat
files etc.
• This integration enhances the effective analysis of data.
• Data Preprocessing are applied to ensure consistency in
naming conventions, encoding structures, attribute
measures, and so on.
DM BY Basha K, 2022 Data Mining 6
…
• Time Variant:
• The data collected in a data warehouse is identified with
a particular time period.
• The data in a data warehouse provides information from
a historical point of view. e.g. past 5-10 years
• Data warehouse stores historical data.
DM BY Basha K, 2022 Data Mining 7
…
• Non- volatile:
• Nonvolatile means the previous data is not removed when new data is
added to it.
• The data warehouse is kept separate from the operational database
therefore frequent changes in operational database is not reflected in the
data warehouse.
• Data once recorded cannot be updated.
• Data warehouse requires two operations:
• Initial loading of data
• Access of data
DM BY Basha K, 2022 Data Mining 8
Steps of KDD:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be
combined)1
3. Data selection (where data relevant to the analysis task are
retrieved from the database)
4. Data transformation (where data are transformed or
consolidated into forms appropriate for mining by performing
summary or aggregation operations, for instance)
5. Data mining (an essential process where intelligent methods
are applied in order to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on some interestingness
measures;
DM BY Basha K, 2022 Data Mining 11
…
When compared star with snowflake model,
Star model is the best one, but the snowflake is the normalized
form to reduce redundancies.
-easy to maintain.
-save storage space.
-reduce the effectiveness of browsing.
-More joins will be needed to execute the query.
DM BY Basha K, 2022 Data Mining 16
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
DM BY Basha K, 2022 Data Mining 18
all all
Cont..
• Day<month<quarter<year
• Aggregates from the level month to day.
• By descending order.
• Additional Dimension-adding new dimension to the
given cube.
DM BY Basha K, 2022 Data Mining 21
Cont..
• Slice and dice:
• Slice operation performs selection on one dimension of a
given cube.
• Example:time=q1.
• Dice operation performs selection on two or more operations.
• Example:location=q1 or q2, time= t1 or t2.
• Pivot (rotate):
• Visualization operation that rotates the data axes in view in
order to provide an alternative presentation of data.
DM BY Basha K, 2022 Data Mining 22
…
-information may be documented at various
levels of detail and accuracy.
3.Data warehouse view
-fact table and dimension table.
-represents the information that is stored
inside the data warehouse.
4.Business query view
-perspective of data in warehouse from view
point of end user.
DM BY Basha K, 2022 Data Mining 24
Three-Tiered Architecture
Monitor
Metadata & OLAP Server
other
source Integrator
s Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
Recommended approach
• Enterprise warehouse
• collects all of the information about subjects spanning the entire
organization
• Data Mart
• a subset of corporate-wide data that is of value to a specific
groups of users. Its scope is confined to specific, selected
groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse) data mart
• Virtual warehouse
• A set of views over operational databases
• Only some of the possible summary views may be materialized
DM BY Basha K, 2022 Data Mining 29
Multi-Tier Data
Warehouse
Distributed
Data Marts
Types of OLAP
Metadata Repository
• Meta data is the data defining warehouse objects. It has the following
kinds
• Description of the structure of the warehouse
• schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents
• Operational meta-data
• data lineage (history of migrated data and transformation path), currency of
data (active, archived, or purged), monitoring information (warehouse
usage statistics, error reports, audit trails)
• The algorithms used for summarization(aggregation, reports..).
• Business metadata(policy)
DM BY Basha K, 2022 Data Mining 32
OLAM ARCHITECTURE
• OLAM and OLAP servers both accept on-line queries.
• Via graphical user interface and work with data cube via cube
API.
• Metadata data –access of the data cube .
• Data cube –constructed by accessing and integrating multiple
database via MDDB.
• Filtering a datawarehouse via database API.
DM BY Basha K, 2022 Data Mining 35
Cont..
• OLAM –data mining tasks like classification,
prediction, clustering, concept description..
• Sophisticated than an OLAP server.
• High quality of data.
DM BY Basha K, 2022 Data Mining 36
An OLAM Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM
Layer2
MDDB
MDDB
Meta
Data
Filtering&Integration Database API Filtering
Layer1
Data cleaning Data
Databases Data
Data integration Warehouse Repository