Concepts and Techniques: - Chapter 1
Concepts and Techniques: - Chapter 1
Concepts and
Techniques
(3rd ed.)
Chapter 1
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 1. Introduction
Summary
2
Evolution of Sciences
Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
The Internet and computing Grid that makes all these archives universally
accessible
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online
Science, Comm. ACM, 45(11): 50-54, Nov. 2002
4
Evolution of Database
Technology
1960s:
1970s:
1980s:
1990s:
2000s
Chapter 1. Introduction
Summary
6
Alternative names
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
8
Data cleaning
Data mining
Decisio
n
Making
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
DBA
10
11
Input Data
Data PreProcessing
Data integration
Normalization
Feature selection
Dimension reduction
Data
Mining
Pattern discovery
Association &
correlation
Classification
Clustering
Outlier analysis
PostProcessin
g
Pattern evaluation
Pattern selection
Pattern
interpretation
Pattern visualization
12
13
Chapter 1. Introduction
Summary
14
Data to be mined
Database data (extended-relational, object-oriented,
heterogeneous, legacy), data warehouse, transactional data,
stream, spatiotemporal, time-series, sequence, text and web,
multi-media, graphs & social and information networks
Knowledge to be mined (or: Data mining functions)
Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
Descriptive vs. predictive data mining
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Data-intensive, data warehouse (OLAP), machine learning,
statistics, pattern recognition, visualization, high-performance,
etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, text mining, Web mining, etc.
15
Chapter 1. Introduction
Summary
16
Object-relational databases
Multimedia database
Text databases
Chapter 1. Introduction
Summary
18
Typical methods
Typical applications:
22
Outlier analysis
23
Graph mining
Finding frequent subgraphs (e.g., chemical compounds), trees
(XML), substructures (web fragments)
Information network analysis
Social networks: actors (objects, nodes) and relationships
(edges)
e.g., author networks in CS, terrorist networks
Multiple heterogeneous networks
A person could be multiple information networks: friends,
family, classmates,
Links carry a lot of semantic information: Link mining
Web mining
Web is a big information network: from PageRank to Google
Analysis of Web information networks
Web community discovery, opinion mining, usage mining,
25
Evaluation of Knowledge
Coverage
Accuracy
Timeliness
26
Chapter 1. Introduction
Summary
27
Applications
Algorithm
Pattern
Recognition
Data Mining
Database
Technology
Statistics
Visualization
High-Performance
Computing
28
High-dimensionality of data
Chapter 1. Introduction
Summary
30
Chapter 1. Introduction
Summary
32
Mining Methodology
User Interaction
Interactive mining
33
Chapter 1. Introduction
Summary
35
KDD Conferences
Pacific-Asia Conf. on
Knowledge Discovery and Data
Mining (PAKDD)
DB conferences: ACM
SIGMOD, VLDB, ICDE, EDBT,
ICDT,
PR conferences: CVPR,
Journals
KDD Explorations
Statistics
Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS,
etc.
Journals: Machine Learning, Artificial Intelligence, Knowledge and Information
Systems, IEEE-PAMI, etc.
Web and IR
Visualization
38
Chapter 1. Introduction
Summary
39
Summary
Recommended Reference
Books
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data.
Morgan Kaufmann, 2002
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge
Discovery, Morgan Kaufmann, 2001
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3 rd ed., 2011
D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, 2nd ed., Springer-Verlag, 2009
P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations, Morgan Kaufmann, 2 nd ed. 2005
41