Overview
Data Mining for Business Intelligence
Shmueli, Patel & Bruce
Galit Shmueli and Peter Bruce 2010
Core Ideas in Data Mining
Classification
Prediction Association Rules
Data Reduction
Data Exploration Visualization
Supervised Learning
Goal: Predict a single target or outcome
variable
Training data, where target value is known Score to data where value is not known
Methods: Classification and Prediction
Unsupervised Learning
Goal: Segment data into meaningful segments;
detect patterns
There is no target (outcome) variable to predict or
classify
Methods: Association rules, data reduction &
exploration, visualization
Supervised: Classification
Goal: Predict categorical target (outcome)
variable Examples: Purchase/no purchase, fraud/no fraud, creditworthy/not creditworthy Each row is a case (customer, tax return, applicant) Each column is a variable Target variable is often binary (yes/no)
Supervised: Prediction
Goal: Predict numerical target (outcome) variable
Examples: sales, revenue, performance
As in classification: Each row is a case (customer, tax return,
applicant) Each column is a variable Taken together, classification and prediction constitute predictive analytics
Unsupervised: Association Rules
Goal: Produce rules that define what goes with
what Example: If X was purchased, Y was also purchased Rows are transactions Used in recommender systems Our records show you bought X, you may also like Y Also called affinity analysis
Unsupervised: Data Reduction
Distillation of complex/large data into
simpler/smaller data Reducing the number of variables/columns (e.g., principal components) Reducing the number of records/rows (e.g., clustering)
Unsupervised: Data Visualization
Graphs and plots of data
Histograms, boxplots, bar charts, scatterplots Especially useful to examine relationships
between pairs of variables
Data Exploration
Data sets are typically large, complex & messy
Need to review the data to help refine the task Use techniques of Reduction and Visualization
The Process of Data Mining
Steps in Data Mining
1.
2.
3. 4.
5.
6. 7.
8.
9.
Define/understand purpose Obtain data (may involve random sampling) Explore, clean, pre-process data Reduce the data; if supervised DM, partition it Specify task (classification, clustering, etc.) Choose the techniques (regression, CART, neural networks, etc.) Iterative implementation and tuning Assess results compare models Deploy best model
Obtaining Data: Sampling
Data mining typically deals with huge databases
Algorithms and models are typically applied to a
sample from a database, to produce statisticallyvalid results XLMiner, e.g., limits the training partition to 10,000 records Once you develop and select a final model, you use it to score the observations in the larger database
Rare event oversampling
Often the event of interest is rare
Examples: response to mailing, fraud in taxes, Sampling may yield too few interesting cases to
effectively train a model A popular solution: oversample the rare cases to obtain a more balanced training set Later, need to adjust results for the oversampling
Pre-processing Data
Types of Variables
Determine the types of pre-processing
needed, and algorithms used Main distinction: Categorical vs. numeric
Numeric
Continuous Integer
Categorical Ordered (low, medium, high) Unordered (male, female)
Variable handling
Numeric
Most algorithms in XLMiner can handle numeric
data May occasionally need to bin into categories
Categorical
Nave Bayes can use as-is In most other algorithms, must create binary
dummies (number of dummies = number of categories 1)
Detecting Outliers
An outlier is an observation that is extreme,
being distant from the rest of the data (definition of distant is deliberately vague) Outliers can have disproportionate influence on models (a problem if it is spurious) An important step in data pre-processing is detecting outliers Once detected, domain knowledge is required to determine if it is an error, or truly extreme.
Detecting Outliers
In some contexts, finding outliers is the purpose
of the DM exercise (airport security screening). This is called anomaly detection.
Handling Missing Data
Most algorithms will not process records with
missing values. Default is to drop those records. Solution 1: Omission
If a small number of records have missing values, can
omit them If many records are missing values on a small set of variables, can drop those variables (or use proxies) If many records have missing values, omission is not practical
Solution 2: Imputation Replace missing values with reasonable substitutes Lets you keep the record and use the rest of its (nonmissing) information
Normalizing (Standardizing) Data
Used in some techniques when variables with the
largest scales would dominate and skew results Puts all variables on same scale Normalizing function: Subtract mean and divide by standard deviation (used in XLMiner) Alternative function: scale to 0-1 by subtracting minimum and dividing by the range
Useful when the data contain dummies and numeric
The Problem of Overfitting
Statistical models can produce highly complex
explanations of relationships between variables The fit may be excellent When used with new data, models of great complexity do not do so well.
100% fit not useful for new data
1600 1400 1200
1000 Revenue
800
600
400
200
0 0 100 200 300 400 500 600 700 800 900 1000
Expenditure
Overfitting (cont.)
Causes:
Too many predictors A model with too many parameters Trying many different models
Consequence: Deployed model will not work as well as expected with completely new data.
Partitioning the Data
Problem: How well will our model perform with new data? Solution: Separate data into two parts
Training partition to develop the
model Validation partition to implement the model and evaluate its performance on new data
Addresses the issue of overfitting
Test Partition
When a model is developed on
training data, it can overfit the training data (hence need to assess on validation) Assessing multiple models on same validation data can overfit validation data Some methods use the validation data to choose a parameter. This too can lead to overfitting the validation data Solution: final selected model is applied to a test partition to give unbiased estimate of its performance on new
Example Linear Regression Boston Housing Data
A CRIM 0.006 0.027 0.027 0.032 0.069 0.030 0.088 0.145 0.211 0.170 B C D E NOX 0.54 0.47 0.47 0.46 0.46 0.46 0.52 0.52 0.52 0.52 F RM 6.58 6.42 7.19 7.00 7.15 6.43 6.01 6.17 5.63 6.00 G AGE 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 H I J K L M N O ZN INDUS CHAS 18 0 0 0 0 0 12.5 12.5 12.5 12.5 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 0 0 0 0 0 0 0 0 0 0 DIS RAD 4.09 4.97 4.97 6.06 6.06 6.06 5.56 5.95 6.08 6.59 1 2 2 3 3 3 5 5 5 5 TAX PTRATIO 296 242 242 222 222 222 311 311 311 311 CAT. B LSTAT MEDV MEDV 5 9 4 3 5 5 12 19 30 17 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 0 0 1 1 1 0 0 0 0 0 15.3 397 17.8 397 17.8 393 18.7 395 18.7 397 18.7 394 15.2 396 15.2 397 15.2 387 15.2 387
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX B MEDV
per capita crime rate by town proportion of residential land zoned for lots over 25,000 sq.ft. proportion of non-retail business acres per town. Charles River dummy variable (1 if tract bounds river; 0 otherwise) nitric oxides concentration (parts per 10 million) average number of rooms per dwelling proportion of owner-occupied units built prior to 1940 weighted distances to five Boston employment centres index of accessibility to radial highways full-value property-tax rate per $10,000 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town Median value of owner-occupied homes in $1000
PTRATIO pupil-teacher ratio by town LSTAT % lower status of the population
Partitioning the data
Using XLMiner for Multiple Linear Regression
Specifying Output
Prediction of Training Data
Row Id. 1 4 5 6 9 10 12 17 18 Predicted Value 30.24690555 28.61652272 27.76434086 25.6204032 11.54583087 19.13566187 21.95655773 20.80054199 16.94685562 Actual Value Residual
24 -6.246905549 33.4 4.783477282 36.2 8.435659135 28.7 3.079596801 16.5 4.954169128 18.9 -0.235661871 18.9 -3.05655773 23.1 2.299458015 17.5 0.553144385
Prediction of Validation Data
Row Id. 2 3 7 8 11 13 14 15 16 Predicted Value 25.03555247 30.1845219 23.39322259 19.58824389 18.83048747 21.20113865 19.81376359 19.42217211 19.63108414 Actual Value 21.6 34.7 22.9 27.1 15 21.7 20.4 18.2 19.9 Residual -3.435552468 4.515478101 -0.493222593 7.511756109 -3.830487466 0.498861352 0.586236414 -1.222172107 0.268915856
Summary of errors
Training Data scoring - Summary Report
Total sum of squared errors 6977.106
RMS Error Average Error 4.790720883 3.11245E-07
Validation Data scoring - Summary Report
Total sum of squared errors 4251.582211
RMS Error Average Error 4.587748542 -0.011138034
RMS error
Error = actual - predicted RMS = Root-mean-squared error = Square root of average squared error
In previous example, sizes of training and validation sets differ, so only RMS Error and Average Error are comparable
Using Excel and XLMiner for Data Mining
Excel is limited in data capacity However, the training and validation of DM
models can be handled within the modest limits of Excel and XLMiner Models can then be used to score larger databases XLMiner has functions for interacting with various databases (taking samples from a database, and scoring a database from a developed model)
Summary
Data Mining consists of supervised methods
(Classification & Prediction) and unsupervised methods (Association Rules, Data Reduction, Data Exploration & Visualization) Before algorithms can be applied, data must be characterized and pre-processed To evaluate performance and to avoid overfitting, data partitioning is used Data mining methods are usually applied to a sample from a large database, and then the best model is used to score the entire database