Concepts - Model Evaluation (Data Mining Fundamentals)
Concepts - Model Evaluation (Data Mining Fundamentals)
1
Data Mining & Methodology
2
Data Preprocessing
3
Precision, bias and accuracy
• Interval / numerical
• Nominal / categorical
• Type of prediction
5
Model Quality Measurement
• Estimates
• Variance/Errors (examples: ASE/MSE, RASE)
• Fit (examples: Coefficient of Determination, R2 )
• Methods: Statistics
• Classification (Decisions)
• Error rates (examples: Misclassification)
• Accuracy (examples: Accuracy, Precision, Specificity)
• Methods: Confusion Matrix, Gain, Lift, ROC
6
Errors for Estimation
• Indicates how close a regression line is to a set of points (actual values). It does this by
taking the distances (i.e. errors) from the points to the regression line and squaring
them.
• The difference (i.e. variance/errors) between the estimated values and the actual value.
• Mean squared error (MSE) = SSE/DFE
• Average squared error (ASE) = SSE/N
Where SSE = Sum of Squared Error
N = total sample size
DFE = Degree of Freedom
Model Evaluation for Regressions
The regression model has a adjusted R square value of 0.1525, meaning the input
variables in the model can explain 15.25% of the characteristics of the target variable,
i.e. the Result.
Model presentation:
result = 18.7833 + 0.7717(Medu=0) – 0.9251(Medu=1) – 0.7789(Medu=2) +
0.2555(Medu=3) – 0.4689(age) – 0.3951(goout) + 0.8672(studytime)
Decision Tree Evaluation: Performance and Complexity
Misclassification/error in both training
and validation keep decreasing when
more model training is performed
Optimal tree:
Compromise with 5 leaves
Observing Decision Tree Rules Node id: 6 (best rule/highest % to identify B=1)
if Replacement: Gift Count 36 Months >= 2.5 or MISSING
AND Replacement: Gift Amount Last < 7.5
then there is 64% chance B is 1 (i.e. 36% B is 0).
16
Validation dataset (Positive target response is Churn = “Y”)
What are the chances that John will be able to recall all terrorists precisely?
Say John narrated 15 terrorists to finally spell out the 10 correct terrorists.
22
Recall vs Precision
Then, John’s recall (ability to find all relevant (positive) targets) will be 100%,
but his precision (ability to correctly predict ‘true’ target from all prediction)
will only be (10/15=67%)
Calculate the rates..
Accuracy:
Misclassification Rate:
Precision positive (True Positive Rate):
Recall positive (Sensitivity):
Recall negative (Specificity):
24
Got it?
100+50
• Accuracy: = = 0.9091
165
10+5
• Misclassification Rate = = 0.0909
165
100
• Precision = = 0.9090
10+100
100
• Sensitivity = = 0.9524
100+5
50
• Specificity = = 0.8333
50+10
25
ROC Chart
• A receiver operating characteristics, or ROC, curve
provides an assessment of one or more binary
classification models.
• Usually a diagonal line is plotted as a baseline, that
is, where a random prediction would lie.
• For classification models that generate a single
value, a single point can be plotted on the chart. A
point above the diagonal line indicates a degree of
accuracy that is better than a random prediction.
• Conversely, a point below the line indicates that
the prediction is worse than a random prediction.
The closer the point is to the upper top left point in
the chart, the better the prediction.
26
ROC Chart and Index
1.0
0.0
0.0 1.0
weak model strong model
weak model
ROC Index < 0.6 strong model
ROC Index > 0.7
• The closer the curve follows the left-hand border and then the top border of
the ROC space, the more accurate the test.
• The closer the curve comes to the 45-degree diagonal of the ROC space, the
less accurate the test.
SAS Enterprise Miner – Model Comparison
30
SAS Enterprise Miner – Model Comparison
31
SAS Enterprise Miner – Model Comparison
32
Contingency Table-based Measures
Actual
Orange (c1) Apple (c2) Plum (c3)
Orange (c1) 10 0 0 p1 = 10
Apple (c2) 0 7 5 p2 = 12
Predicted Plum (c3) 0 3 5 p3 = 8
a1 = 10 a2 = 10 a3 = 10
Accuracy = All Correct Prediction/Total = 22/30 = 0.733
Precision (c1): CP1/(CP1+p1) = 10/10 = 1
Precision (c2): CP2/(CP2+p2) = 7/12 = 0.583
Precision (c3): CP3/(CP3+p3) = 5/8 = 0.625
Recall (c1): CP1/a1 = 10/10 = 1
Recall (c2): CP2/a2 = 7/10 = 0.7
Recall (c3): CP3/a3 = 5/10 = 0.5
F(c1): (2*CP1)/(a1+p1) = (2*10)/(10+10) = 20/20 = 1
F(c2): (2*CP2)/(a2+p2) = (2*7)/(10+12) = 14/22 = 0.636
F(c3): (2*CP3)/(a3+p3) = (2*5)/(10+8) = 10/18 = 0.556
33
Overall F-measure: (1/number of classes)(total of all classes F) = (1/3)(1+0.636+0.556) = 0.731
Example: Binary Classification Models Comparison
Analysis & Interpretation:
• The overall accuracy and
error rate of the models
are summarized in the
accuracy and error
metric. In general,
model C is most accurate,
followed by model B, and then model A.
• The metrics also assess how well the models
specifically predict positives, with model B
performing the best based on the sensitivity
score.
• Model C has the highest specificity score,
indicating that this model is the best of the
three at predicting negatives.
Note: These different metrics are used in different
situations, depending on the goal of the specific
project.
34
Generalization and Overfitting
• Generalization is the property of a model, whereby the model applies to data that were
not used to build the model.
• Applying models not just to the exact training set but to the general population
(including those beyond the training data.) from which the training data came
• Overfitting is the tendency of data mining procedures to tailor models to the training
data, at the expense of generalization to previously unseen data points.
• There is no single choice/procedure that will eliminate overfitting. The best strategy is to
recognize overfitting and manage complexity in a principled way.
• The accuracy of a model depends on how complex we allow it to be. A model can be
complex in different ways
• Generally, there will be more overfitting as one allows the model to be more complex.
Underfitting and Overfitting
Error
Validation Data
Training Data
Number of Nodes
Underfitting Overfitting
(Model Complexity)
Good Compromise
Underfitting: when model is too simple, both training and test errors are large
Overfitting: test error rate begins to increase as training error rate continues to decrease
Over-fitting and Under-fitting
• Under-fitting
• When model is too simple, both training and validation
errors/misclassification are large
• Over-fitting
• Occurs when learned function (i.e. training) performs well on
data used during the training and poorly with new data (i.e.
validation)
• Validation error rate begins to increase as training error rate
continues to decrease
• An issue for all modeling algorithms
37
More Training ≠ Performance
38
Complexity ≠ Performance
Model A Model B
ASE = 11.311 (Validation) ASE = 5.702 (Validation)
39
Complexity and Performance Evaluation
• Choose the right metrics
• Different modelling techniques may have different performance metrics
• R-Square (e.g. linear Regressions)
• ASE for estimation (interval target)
• Confusion matrix, ROC, misclassification/accuracy for classification
(nominal target)
• Different modelling techniques may pose different characteristics in
complexity
• Decision trees: depth/branch, leaves size
• Regressions: number of attributes
41
Adjustment for improving Model Performance /Complexity
(and reduce Bias)
Data preparation
• Detect highly correlated attributes, i.e. collinearity
• Data transformation on variables with high data sparsity / outliers
• Imputation – replacing missing values
• Data deletion – missing values/outliers/duplication
• Variable importance / selection
• Replacing misclassified values
NOTE: Do not surprise if you do not get better performance after performing
certain data preparation techniques!
Modeling
• Different modelling techniques have different test settings
• Variable selection methods (backward/forward/stepwise)
• Tree pruning (for decision tree)
42
Model Presentation and Interpretation
Model presentation and explanation
• Decision Tree modelling technique:
o Present the complete rules or tree of Decision tree
o Explain the best rule for a positive and negative target
▪ The rule description
▪ positive and negative target purity ratio explanation
• Regression modelling technique:
o Present the formula of regression
o Interpretation of the regression formula
Model complexity
• Presented the number of input attributes as a predictor in the selected model
• Presented the names of the input attributes included in the selected model
Model performance
• indication of accuracy or estimation prediction
• present the value of measurement for model evaluation
• indication of “validation” partition is used as model selection criteria
43