0% found this document useful (0 votes)

117 views37 pages

Data Mining For Business Intelligence: Shmueli, Patel & Bruce

Chap2 Overview (1)

Uploaded by

Sam8544

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

117 views37 pages

Data Mining For Business Intelligence: Shmueli, Patel & Bruce

Chap2 Overview (1)

Uploaded by

Sam8544

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 37

Overview

Data Mining for Business Intelligence

Shmueli, Patel & Bruce
Galit Shmueli and Peter Bruce 2010

Core Ideas in Data Mining

Classification
Prediction Association Rules

Data Reduction
Data Exploration Visualization

Supervised Learning
Goal: Predict a single target or outcome

variable
Training data, where target value is known Score to data where value is not known

Methods: Classification and Prediction

Unsupervised Learning
Goal: Segment data into meaningful segments;

detect patterns
There is no target (outcome) variable to predict or

classify
Methods: Association rules, data reduction &

exploration, visualization

Supervised: Classification
Goal: Predict categorical target (outcome)

variable Examples: Purchase/no purchase, fraud/no fraud, creditworthy/not creditworthy Each row is a case (customer, tax return, applicant) Each column is a variable Target variable is often binary (yes/no)

Supervised: Prediction
Goal: Predict numerical target (outcome) variable

Examples: sales, revenue, performance

As in classification: Each row is a case (customer, tax return,

applicant) Each column is a variable Taken together, classification and prediction constitute predictive analytics

Unsupervised: Association Rules

Goal: Produce rules that define what goes with

what Example: If X was purchased, Y was also purchased Rows are transactions Used in recommender systems Our records show you bought X, you may also like Y Also called affinity analysis

Unsupervised: Data Reduction

Distillation of complex/large data into

simpler/smaller data Reducing the number of variables/columns (e.g., principal components) Reducing the number of records/rows (e.g., clustering)

Unsupervised: Data Visualization

Graphs and plots of data
Histograms, boxplots, bar charts, scatterplots Especially useful to examine relationships

between pairs of variables

Data Exploration
Data sets are typically large, complex & messy
Need to review the data to help refine the task Use techniques of Reduction and Visualization

The Process of Data Mining

Steps in Data Mining

2.
3. 4.

5.
6. 7.

8.
9.

Define/understand purpose Obtain data (may involve random sampling) Explore, clean, pre-process data Reduce the data; if supervised DM, partition it Specify task (classification, clustering, etc.) Choose the techniques (regression, CART, neural networks, etc.) Iterative implementation and tuning Assess results compare models Deploy best model

Obtaining Data: Sampling

Data mining typically deals with huge databases
Algorithms and models are typically applied to a

sample from a database, to produce statisticallyvalid results XLMiner, e.g., limits the training partition to 10,000 records Once you develop and select a final model, you use it to score the observations in the larger database

Rare event oversampling

Often the event of interest is rare
Examples: response to mailing, fraud in taxes, Sampling may yield too few interesting cases to

effectively train a model A popular solution: oversample the rare cases to obtain a more balanced training set Later, need to adjust results for the oversampling

Pre-processing Data

Types of Variables
Determine the types of pre-processing

needed, and algorithms used Main distinction: Categorical vs. numeric

Numeric
Continuous Integer

Categorical Ordered (low, medium, high) Unordered (male, female)

Variable handling
Numeric
Most algorithms in XLMiner can handle numeric

data May occasionally need to bin into categories

Categorical
Nave Bayes can use as-is In most other algorithms, must create binary

dummies (number of dummies = number of categories 1)

Detecting Outliers
An outlier is an observation that is extreme,

being distant from the rest of the data (definition of distant is deliberately vague) Outliers can have disproportionate influence on models (a problem if it is spurious) An important step in data pre-processing is detecting outliers Once detected, domain knowledge is required to determine if it is an error, or truly extreme.

Detecting Outliers
In some contexts, finding outliers is the purpose

of the DM exercise (airport security screening). This is called anomaly detection.

Handling Missing Data

Most algorithms will not process records with

missing values. Default is to drop those records. Solution 1: Omission

If a small number of records have missing values, can

omit them If many records are missing values on a small set of variables, can drop those variables (or use proxies) If many records have missing values, omission is not practical

Solution 2: Imputation Replace missing values with reasonable substitutes Lets you keep the record and use the rest of its (nonmissing) information

Normalizing (Standardizing) Data

Used in some techniques when variables with the

largest scales would dominate and skew results Puts all variables on same scale Normalizing function: Subtract mean and divide by standard deviation (used in XLMiner) Alternative function: scale to 0-1 by subtracting minimum and dividing by the range
Useful when the data contain dummies and numeric

The Problem of Overfitting

Statistical models can produce highly complex

explanations of relationships between variables The fit may be excellent When used with new data, models of great complexity do not do so well.

100% fit not useful for new data

1600 1400 1200

1000 Revenue

800

600

400

200

0 0 100 200 300 400 500 600 700 800 900 1000

Expenditure

Overfitting (cont.)
Causes:
Too many predictors A model with too many parameters Trying many different models

Consequence: Deployed model will not work as well as expected with completely new data.

Partitioning the Data

Problem: How well will our model perform with new data? Solution: Separate data into two parts
Training partition to develop the

model Validation partition to implement the model and evaluate its performance on new data

Addresses the issue of overfitting

Test Partition
When a model is developed on

training data, it can overfit the training data (hence need to assess on validation) Assessing multiple models on same validation data can overfit validation data Some methods use the validation data to choose a parameter. This too can lead to overfitting the validation data Solution: final selected model is applied to a test partition to give unbiased estimate of its performance on new

Example Linear Regression Boston Housing Data

A CRIM 0.006 0.027 0.027 0.032 0.069 0.030 0.088 0.145 0.211 0.170 B C D E NOX 0.54 0.47 0.47 0.46 0.46 0.46 0.52 0.52 0.52 0.52 F RM 6.58 6.42 7.19 7.00 7.15 6.43 6.01 6.17 5.63 6.00 G AGE 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 H I J K L M N O ZN INDUS CHAS 18 0 0 0 0 0 12.5 12.5 12.5 12.5 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 0 0 0 0 0 0 0 0 0 0 DIS RAD 4.09 4.97 4.97 6.06 6.06 6.06 5.56 5.95 6.08 6.59 1 2 2 3 3 3 5 5 5 5 TAX PTRATIO 296 242 242 222 222 222 311 311 311 311 CAT. B LSTAT MEDV MEDV 5 9 4 3 5 5 12 19 30 17 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 0 0 1 1 1 0 0 0 0 0 15.3 397 17.8 397 17.8 393 18.7 395 18.7 397 18.7 394 15.2 396 15.2 397 15.2 387 15.2 387

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX B MEDV

per capita crime rate by town proportion of residential land zoned for lots over 25,000 sq.ft. proportion of non-retail business acres per town. Charles River dummy variable (1 if tract bounds river; 0 otherwise) nitric oxides concentration (parts per 10 million) average number of rooms per dwelling proportion of owner-occupied units built prior to 1940 weighted distances to five Boston employment centres index of accessibility to radial highways full-value property-tax rate per $10,000 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town Median value of owner-occupied homes in $1000

PTRATIO pupil-teacher ratio by town LSTAT % lower status of the population

Partitioning the data

Using XLMiner for Multiple Linear Regression

Specifying Output

Prediction of Training Data

Row Id. 1 4 5 6 9 10 12 17 18 Predicted Value 30.24690555 28.61652272 27.76434086 25.6204032 11.54583087 19.13566187 21.95655773 20.80054199 16.94685562 Actual Value Residual

24 -6.246905549 33.4 4.783477282 36.2 8.435659135 28.7 3.079596801 16.5 4.954169128 18.9 -0.235661871 18.9 -3.05655773 23.1 2.299458015 17.5 0.553144385

Prediction of Validation Data

Row Id. 2 3 7 8 11 13 14 15 16 Predicted Value 25.03555247 30.1845219 23.39322259 19.58824389 18.83048747 21.20113865 19.81376359 19.42217211 19.63108414 Actual Value 21.6 34.7 22.9 27.1 15 21.7 20.4 18.2 19.9 Residual -3.435552468 4.515478101 -0.493222593 7.511756109 -3.830487466 0.498861352 0.586236414 -1.222172107 0.268915856

Summary of errors
Training Data scoring - Summary Report
Total sum of squared errors 6977.106

RMS Error Average Error 4.790720883 3.11245E-07

Validation Data scoring - Summary Report

Total sum of squared errors 4251.582211

RMS Error Average Error 4.587748542 -0.011138034

RMS error
Error = actual - predicted RMS = Root-mean-squared error = Square root of average squared error

In previous example, sizes of training and validation sets differ, so only RMS Error and Average Error are comparable

Using Excel and XLMiner for Data Mining

Excel is limited in data capacity However, the training and validation of DM

models can be handled within the modest limits of Excel and XLMiner Models can then be used to score larger databases XLMiner has functions for interacting with various databases (taking samples from a database, and scoring a database from a developed model)

Summary
Data Mining consists of supervised methods

(Classification & Prediction) and unsupervised methods (Association Rules, Data Reduction, Data Exploration & Visualization) Before algorithms can be applied, data must be characterized and pre-processed To evaluate performance and to avoid overfitting, data partitioning is used Data mining methods are usually applied to a sample from a large database, and then the best model is used to score the entire database

Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Data Mining Notes
No ratings yet
Data Mining Notes
43 pages
Chap2 Overview
No ratings yet
Chap2 Overview
17 pages
Data Mining for Analysts
No ratings yet
Data Mining for Analysts
38 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
Data Pre-Processing Techniques
No ratings yet
Data Pre-Processing Techniques
12 pages
Chapter 02 Overview
No ratings yet
Chapter 02 Overview
43 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Data Mining Process Overview
No ratings yet
Data Mining Process Overview
18 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
18 pages
Unit 1 (DS)
No ratings yet
Unit 1 (DS)
15 pages
MS5107 Boston Housing, Corolla NUIG
No ratings yet
MS5107 Boston Housing, Corolla NUIG
6 pages
Data Mining with XLMiner
No ratings yet
Data Mining with XLMiner
69 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Unit2 Notes
No ratings yet
Unit2 Notes
8 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Sampling and Variable Management
No ratings yet
Data Sampling and Variable Management
5 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
XLMiner Data Analytics Guide
No ratings yet
XLMiner Data Analytics Guide
25 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Predictive Maintenance
No ratings yet
Predictive Maintenance
66 pages
Statistics For Data Science
100% (3)
Statistics For Data Science
39 pages
IE500 DM 05 Preprocessing
No ratings yet
IE500 DM 05 Preprocessing
65 pages
Data Mining Techniques and Models
No ratings yet
Data Mining Techniques and Models
43 pages
Module 2-b Prediction Methods and Models-Data Preperation
No ratings yet
Module 2-b Prediction Methods and Models-Data Preperation
26 pages
Recitation 1
No ratings yet
Recitation 1
4 pages
Data Analytics Course (IIFT MBA) Full Course Summary - 27072023
No ratings yet
Data Analytics Course (IIFT MBA) Full Course Summary - 27072023
253 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Data Mining - An Overview
No ratings yet
Data Mining - An Overview
40 pages
Data Mining
No ratings yet
Data Mining
49 pages
Assignment 02
No ratings yet
Assignment 02
9 pages
ChatGPT - Shared Content
No ratings yet
ChatGPT - Shared Content
26 pages
Unit 2
No ratings yet
Unit 2
37 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Lec 2
No ratings yet
Lec 2
19 pages
Datamining Unit4
No ratings yet
Datamining Unit4
21 pages
Data Analysis Process My Notes
No ratings yet
Data Analysis Process My Notes
7 pages
Data Mining
No ratings yet
Data Mining
25 pages
BUSINESS INTELLIGENCE NOTES Unit 4
No ratings yet
BUSINESS INTELLIGENCE NOTES Unit 4
10 pages
Data Mining and KDD Process Explained
No ratings yet
Data Mining and KDD Process Explained
28 pages
Data Mining University Answer
No ratings yet
Data Mining University Answer
10 pages
ML Combined
No ratings yet
ML Combined
254 pages
Data Mining & Agent Selection Guide
No ratings yet
Data Mining & Agent Selection Guide
8 pages
Final Report For Sales Dataset Project
No ratings yet
Final Report For Sales Dataset Project
25 pages
Data Mining: Management Science
No ratings yet
Data Mining: Management Science
57 pages
Supervised Learning 1 PDF
100% (1)
Supervised Learning 1 PDF
162 pages
UNIT - II - Data Mining Essentials
No ratings yet
UNIT - II - Data Mining Essentials
20 pages
Classification Analysis
No ratings yet
Classification Analysis
4 pages
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
No ratings yet
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
5 pages
Insy662 - f23 - Week 1
No ratings yet
Insy662 - f23 - Week 1
21 pages
Complete Data Science Questions
No ratings yet
Complete Data Science Questions
5 pages
Anomaly Detection Techniques Explained
No ratings yet
Anomaly Detection Techniques Explained
68 pages
Business Analytics
No ratings yet
Business Analytics
14 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
14 pages
Learning Progress Review Week 10
No ratings yet
Learning Progress Review Week 10
35 pages
Unit I Preprocessing
No ratings yet
Unit I Preprocessing
22 pages
How To Motivate Employees: Sullivan University MGT545X-A1-07-Leadership and Team Development Fall 2013
No ratings yet
How To Motivate Employees: Sullivan University MGT545X-A1-07-Leadership and Team Development Fall 2013
10 pages
Implementing Self-Directed Teams at RL Wolfe
11% (9)
Implementing Self-Directed Teams at RL Wolfe
2 pages
Revitalizing Roaring Dragon Hotel
25% (8)
Revitalizing Roaring Dragon Hotel
2 pages
BOM Process Challenges in SAP Systems
No ratings yet
BOM Process Challenges in SAP Systems
5 pages
Applied Maths Subjective-2016
No ratings yet
Applied Maths Subjective-2016
2 pages
Theologian of Resistance Christiane Tietz Online Reading
No ratings yet
Theologian of Resistance Christiane Tietz Online Reading
111 pages
Transient Couette Flow Analysis
No ratings yet
Transient Couette Flow Analysis
7 pages
Gender Swaying
No ratings yet
Gender Swaying
63 pages
10 Netiquette Rule1
No ratings yet
10 Netiquette Rule1
2 pages
Desing of Welded - Lincoln - PDF
No ratings yet
Desing of Welded - Lincoln - PDF
836 pages
Material Management:role, Objectives, Adv
83% (6)
Material Management:role, Objectives, Adv
9 pages
E-Advertisement: Presented By-Swapnil Panpatil A-1632
No ratings yet
E-Advertisement: Presented By-Swapnil Panpatil A-1632
19 pages
Level WX Childrens Story
No ratings yet
Level WX Childrens Story
12 pages
The Hindu E-Paper Document
No ratings yet
The Hindu E-Paper Document
2 pages
MBA Business Communication Exam
No ratings yet
MBA Business Communication Exam
2 pages
Planners Lab vs Excel: Modeling Comparison
No ratings yet
Planners Lab vs Excel: Modeling Comparison
24 pages
CSS English Literature MCQs 2000
0% (1)
CSS English Literature MCQs 2000
4 pages
Aqueous Extract of Hybanthus Enneaspermus Exhibited Aphrodisiac Potentials in Fluoxetine-Induced Sexually-Impaired Female Rats
No ratings yet
Aqueous Extract of Hybanthus Enneaspermus Exhibited Aphrodisiac Potentials in Fluoxetine-Induced Sexually-Impaired Female Rats
12 pages
Demonstration Lesson Plan
No ratings yet
Demonstration Lesson Plan
6 pages
FOLK HEAVENS - Indigenous Philippines Concept On Afterlife
No ratings yet
FOLK HEAVENS - Indigenous Philippines Concept On Afterlife
3 pages
R V Cunningham
No ratings yet
R V Cunningham
5 pages
ELT Teacher Roles & Strategies
No ratings yet
ELT Teacher Roles & Strategies
344 pages
Qatar Tonsillitis Management Guidelines
No ratings yet
Qatar Tonsillitis Management Guidelines
23 pages
Analytics Rubrics (Macroskills)
No ratings yet
Analytics Rubrics (Macroskills)
7 pages
Islam and World Peace Explanations of A Sufi by M. R. Bawa Muhaiyaddeen
No ratings yet
Islam and World Peace Explanations of A Sufi by M. R. Bawa Muhaiyaddeen
76 pages
Andhra Pradesh Writ Petition on Land Dispute
No ratings yet
Andhra Pradesh Writ Petition on Land Dispute
19 pages
GTG 1b PDF
No ratings yet
GTG 1b PDF
13 pages
Science in Vedas
88% (8)
Science in Vedas
25 pages
SW#4
No ratings yet
SW#4
2 pages
Law Assignment for Built Env. Students
No ratings yet
Law Assignment for Built Env. Students
16 pages
Superdari
No ratings yet
Superdari
3 pages
The Necklace Edexcel Textbook Edited
No ratings yet
The Necklace Edexcel Textbook Edited
5 pages
Matsya Deva
No ratings yet
Matsya Deva
3 pages
Project Report On Coca Cola in Kanpur
No ratings yet
Project Report On Coca Cola in Kanpur
97 pages

Data Mining For Business Intelligence: Shmueli, Patel & Bruce

Uploaded by

Data Mining For Business Intelligence: Shmueli, Patel & Bruce

Uploaded by

Overview

Data Mining for Business Intelligence

Core Ideas in Data Mining

Methods: Classification and Prediction

Examples: sales, revenue, performance

Unsupervised: Association Rules

Unsupervised: Data Reduction

Unsupervised: Data Visualization

between pairs of variables

The Process of Data Mining

Steps in Data Mining

Obtaining Data: Sampling

Rare event oversampling

needed, and algorithms used Main distinction: Categorical vs. numeric

Categorical Ordered (low, medium, high) Unordered (male, female)

data May occasionally need to bin into categories

dummies (number of dummies = number of categories 1)

of the DM exercise (airport security screening). This is called anomaly detection.

Handling Missing Data

missing values. Default is to drop those records. Solution 1: Omission

Normalizing (Standardizing) Data

The Problem of Overfitting

100% fit not useful for new data

Partitioning the Data

Addresses the issue of overfitting

Example Linear Regression Boston Housing Data

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX B MEDV

PTRATIO pupil-teacher ratio by town LSTAT % lower status of the population

Partitioning the data

Using XLMiner for Multiple Linear Regression

Prediction of Training Data

Prediction of Validation Data

RMS Error Average Error 4.790720883 3.11245E-07

Validation Data scoring - Summary Report

RMS Error Average Error 4.587748542 -0.011138034

Using Excel and XLMiner for Data Mining

You might also like