100% found this document useful (1 vote)
158 views11 pages

Cd-Rom Included: Business User Action

weka tool
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
158 views11 pages

Cd-Rom Included: Business User Action

weka tool
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

CD-ROM INCLUDED

Data

Business User
Mining
Action

Data Data
Warehouse Mart

Customer
Response
Insight into Data Mining
Theory and Practice

K.P. Soman
Head and Professor
Centre for Excellence in Computational Engineering and Networking
Amrita Vishwa Vidyapeetham, Coimbatore

Shyam Diwakar
Assistant Professor
School of Biotechnology, Amritapuri Campus
Amrita Vishwa Vidyapeetham

V. Ajay
Research Scholar
Purdue University
USA

Delhi-110092
2009
INSIGHT INTO DATA MINING: THEORY AND PRACTICE (with CD-ROM)
K.P. Soman, Shyam Diwakar, and V. Ajay

© 2006 by PHI Learning Private Limited, Delhi. All rights reserved. No part of this book may be
reproduced in any form, by mimeograph or any other means, without permission in writing from
the publisher.

The authors and publisher make no warranty of any kind, expressed or implied, with regard to the usage of
softwares as exhibited in the CD-ROM. The authors and publisher shall not be liable in any event for incidental or
consequential damages in connection with, or arising out of, the furnishing, performance, or use of the softwares.

ISBN-978-81-203-2897-6

The export rights of this book are vested solely with the publisher.

Seventh Printing … … July, 2014

Published by Asoke K. Ghosh, PHI Learning Private Limited, Rimjhim House, 111, Patparganj
Industrial Estate, Delhi-110092 and Printed by Rajkamal Electric Press, Plot No. 2, Phase IV, HSIDC,
Kundli-131028, Sonepat, Haryana.

Copy R  309
To
Our Beloved Amma
Mata Amritanandamayi Devi
123456789012345678901234567890121234567
123456789012345678901234567890121234567 123456789012345678901234567890121234567
123456789012345678901234567890121234567
Chapter 1
123456789012345678901234567890121234567
123456789012345678901234567890121234567 123456789012345678901234567890121234567
123456789012345678901234567890121234567
123456789012345678901234567890121234567 123456789012345678901234567890121234567

Contents

Preface xi
Acknowledgements xv

1. DATA MINING 1–19


1.1 Introduction 1
1.1.1 Data Mining and Knowledge Discovery 1
1.1.2 Data Mining Vs. Data Analysis 2
1.1.3 Data Mining and Statistics 3
1.1.4 Data Mining and Machine Learning 3
1.2 Data Mining—Success Stories 4
1.3 Main Reason for Growth of Data Mining Research 12
1.4 Recent Research Achievements 12
1.4.1 Graphical Models and Hierarchical Probabilistic Representations 14
1.5 New Applications 15
1.6 Trends that Effect Data Mining 16
1.7 Research Challenges 17
1.8 Testbeds and Infrastructure 18
References 18

2. DATA MINING FROM A BUSINESS PERSPECTIVE 20–34


2.1 Introduction 20
2.2 From Data mining Tools to Solutions 22
2.3 Evolution of Data Mining Systems 23
2.4 Knowledge Discovery Process 25
2.5 Data Mining Supporting Technologies Overview 25
2.5.1 Data Mining: Verification Vs. Discovery 26
2.5.2 Decision Support Systems 26
2.5.3 OLAP 27
2.5.4 Desktop DSS 27
2.5.5 Data Warehouse 28
2.5.6 Data Mining Process 30
2.6 Data Mining Techniques 32
References 34

v
vi Contents

3. DATA TYPES, INPUT AND OUTPUT OF DATA MINING ALGORITHMS 35–52


3.1 Introduction 35
3.2 Instances and Features 36
3.3 Different Types of Features (Data) 37
3.4 Concept Learning and Concept Description 38
3.5 Output of Data Mining—Knowledge Representation 41
3.5.1 Knowledge Output from Classification Learners 41
3.5.2 Output of Cluster Learning Algorithms 45
3.5.3 Output of Association Rules 48
3.5.4 Output of Trees for Numeric Prediction 48
3.5.5 Instance-based Learning and Knowledge Representation 50
References 52

4. DECISION TREES—CLASSIFICATION AND REGRESSION TREES 53–112


4.1 Introduction 53
4.2 Constructing Classification Trees 55
4.2.1 ID3 Algorithm for Nominal Atttributes 55
4.2.2 Information Theory and Information Entropy 56
4.2.3 Building the Tree 60
4.2.4 Highly-branching Attributes 65
4.2.5 ID3 to C4.5 67
4.2.6 Understanding ID3 and C4.5 Algorithm Pictorially 67
4.3 CHAID (Chi-square Automatic Interaction Detection) 68
4.3.1 Mathematical Tools of CHAID 71
4.3.2 Types of CHAID Variables 71
4.3.3 CHAID Algorithm 71
4.3.4 Description of CHAID Algorithm 72
4.3.5 CHAID Applied on Weather Data 73
4.3.6 Merging of Predictor Levels When the Variable is Monotonic 76
4.4 CART (Classification and Regression Trees) 77
4.4.1 Impurity Measures Used in CART 77
4.4.2 Gini Index 77
4.4.3 Using Gini Index—An Example 78
4.4.4 Twoing Index 80
4.4.5 Ordered Twoing 81
4.4.6 Steps in the CART Analysis 81
4.5 Regression Trees 81
4.5.1 An Example of Regression Tree 81
4.5.2 Tree-based Regression 83
4.5.3 Least Squares Regression Trees 85
4.5.4 Efficient Growth of LS Regression Trees 86
4.5.5 Splits on Continuous Variables 88
4.5.6 Splits on Discrete Variables 89
4.5.7 Model Trees 91
4.6 General Problems in Prediction of Classes for Data with Unknown Class Value 93
Contents vii

4.7 Pruning — Introduction 95


4.7.1 Pruning Algorithms 98
4.8 Model Estimation 103
4.8.1 Cross-validation: Hold Out Methods 104
4.8.2 Model Comparison 106
4.8.3 Cost-sensitive Learning 107
Exercises 107
References 111

5. PREPROCESSING AND POSTPROCESSING IN DATA MINING 113–136


5.1 Introduction 113
5.2 Steps in Preprocessing 113
5.3 Discretization 115
5.3.1 Manual Approach 116
5.3.2 Binning 117
5.3.3 Entropy-based Discretization 117
5.3.4 Other Simple Methods of Finding Split Points 119
5.4 Feature Extraction, Selection and Construction 122
5.4.1 Feature Extraction 123
5.4.2 Feature Selection 125
5.4.3 Feature Construction 126
5.5 Missing Data and Methodological Techniques for Dealing It 126
5.5.1 What are Missing Data? 127
5.5.2 What are the Major Reasons for Missing Data? 127
5.5.3 What are the Missing Data Mechanisms? 127
5.5.4 Missing Data Mechanisms—An Artificial Example 128
5.6 Example of Dealing Missing Data in Decision Tree Induction 129
5.7 Postprocessing 133
References 134

6. DATASETS 137–159
6.1 Introduction 137
6.2 Contact Lenses 137
6.3 Iris Plants Database 140
6.4 Breast Cancer Database 143
6.5 Wage Data 146
6.6 Credit Database 148
6.7 Housing Database 149
6.8 1985 Auto Imports Database 152
6.9 Badge Problem 156
6.9.1 Presentation of the Problem 156
6.9.2 Partial List of DataSet 157

7. ASSOCIATION RULE MINING 160–173


7.1 Introduction 160
7.2 Automatic Discovery of Association Rules in Transaction Databases 160
7.2.1 Support and Confidence 161
viii Contents

7.3 The Apriori Algorithm 164


7.4 Shortcomings 171
Exercises 171
References 173

8. MACHINE LEARNING WITH OPEN SOURCE AND COMMERCIAL


SOFTWARE 174–199
8.1 Machine Learning with Weka 174
8.1.1 Getting Started 175
8.1.2 Loading the Data 177
8.1.3 Selecting or Filtering Attributes 179
8.1.4 Discretization 180
8.1.5 Association Rule Mining 186
8.1.6 Classification 188
8.1.7 Clustering 194
8.2 XLMiner‘ 198
8.2.1 Sample DataSets with XLMiner‘ 198
References 199

9. ALGORITHMS FOR CLASSIFICATION AND REGRESSION 200–237


9.1 Introduction 200
9.2 Naïve Bayes 200
9.2.1 Problem of Zero Frequency in Naïve Bayes 202
9.2.2 Missing Values and Numeric Attributes 203
9.3 Multiple Regression Analysis 205
9.3.1 What is Regression Analysis? 205
9.3.2 Simple and Multiple Regression Analysis 205
9.3.3 Applications in Marketing 206
9.3.4 Methodology 206
9.3.5 Multiple Regression Analysis Using Excel 206
9.3.6 Input Data 207
9.3.7 Regression Output 209
9.4 Logistic Regression 210
9.5 k-Nearest Neighbour Classification 214
9.5.1 k-Nearest Neighbour Prediction 217
9.5.2 Shortcomings of k-NN Algorithms 217
9.6 GMDH (Group Method of Data Handling) 218
9.6.1 Introduction 218
9.6.2 The Background of Group Method of Data Handling 219
9.6.3 Construction of Decision Rules 221
9.6.4 Results of Experiments 225
9.6.5 Discussion and Conclusion 225
9.7 Evolutionary Computing and Genetic Algorithms 226
9.7.1 Evolution Theory 226
9.7.2 Genetic Algorihm 231
9.7.3 Machine Learning Using Genetic Algorithms 233
Exercises 234
References 237
Contents ix

10. SUPPORT VECTOR MACHINES 238–270


10.1 Introduction 238
10.2 Basic Idea Behind Linear Support Vector Machines 242
10.3 SVM with Soft Margin: Linear Kernel 244
10.3.1 Linear Programming Formulation of Linear SVM 247
10.3.2 SVM with Training Error: Nonlinear Kernel 248
10.4 Proximal Support Vector Machines 248
10.4.1 Dealing with Nonlinear Kernels 257
10.5 Generating Datasets 262
10.5.1 Spiral Dataset Generator 262
10.5.2 Checker Board Dataset 264
10.5.3 Normally Distributed Multivariate Dataset Generator 264
10.6 Problems and Solutions 268
Exercises 269
References 269

11. CLUSTER ANALYSIS 271–378


11.1 Introduction 271
11.1.1 Similarity and Its Measurement 273
11.1.2 Basic Types of Clustering 283
11.2 Partitional Clusterings 299
11.3 k-medoids 303
11.4 Modern Clustering Methods 304
11.5 Birch 306
11.6 DBSCAN 310
11.6.1 Concepts Required for DBSCAN Algorithm 311
11.6.2 Basic Concepts and Algorithm of DBSCAN 312
11.6.3 Algorithm 312
11.6.4 Advantages of DBSCAN Algorithm 314
11.7 Optics (Ordering Points To Identify Clustering Structure) 315
11.7.1 Introduction 315
11.7.2 Motivation for Optics 315
11.7.3 Concepts Used in Optics 315
11.7.4 Optics Algorithm 316
11.7.5 Reachability Plots 325
11.7.6 Advantages 326
11.7.7 Disadvantages 326
11.8 Clustering Based on Graph Partitioning 326
11.8.1 Weighted Graph Partitioning (GP) 326
11.8.2 Balanced Graph Partitioning—Basic Principle 328
11.8.3 k-way Partitioning 332
11.9 CHAMELEON: A Two-phase Clustering Algorithm 332
11.9.1 Modelling the Data 333
11.9.2 Modelling the Cluster Similarity 333
11.9.3 Two Phases of CHAMELEON 335
11.9.4 Illustration of CHAMELEON with an Example 337
x Contents

11.10 The COBWEB Conceptual Clustering Algorithm 340


11.10.1 COBWEB Algorithm 340
11.10.2 COBWEB: A Simple Illustration 343
11.11 GCLUTO—Graphical Clustering Toolkit 351
11.11.1 Overview 352
11.11.2 Options Available in GCLUTO 361
11.11.3 Text Mining Using GCLUTO 368
Exercises 371
References 378

12. VISUALIZATION OF MULTIDIMENSIONAL DATA 379–388


12.1 Introduction 379
12.2 Diagrams for Multidimensional Visualization 381
12.2.1 Kiviat Diagrams 381
12.2.2 Parallel Coordinates 383
12.2.3 3D Scattergram 384
12.2.4 3D Line Graph 384
12.2.5 Volume Rendering 385
12.2.6 Floors and Walls 385
12.2.7 Chernoff Faces 387
12.3 Visual Data Mining 387
12.3.1 Animation 388
References 388

Appendix A SVM Formulation: Linear Classifier Assuring Complete Separability 389


Appendix B Matrix Formulation of Graph Partitioning 395

Index 399–403
Insight Into Data Mining: Theory And
Practice

25%
OFF

Author : SOMAN, K.
Publisher : PHI Learning ISBN : 9788120328976 P., DIWAKAR,
SHYAM, AJAY, V.

Type the URL : http://www.kopykitab.com/product/74 14

Get this eBook

You might also like