0% found this document useful (0 votes)
29 views

Concepts - Model Evaluation (Data Mining Fundamentals)

The document discusses model evaluation techniques for assessing how well a model fits data and generalizes to new data. Common metrics include errors, accuracy, precision and recall for classification models as well as R-squared and F-tests for regression models. Validation methods like confusion matrices are also covered.

Uploaded by

mtemp7489
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Concepts - Model Evaluation (Data Mining Fundamentals)

The document discusses model evaluation techniques for assessing how well a model fits data and generalizes to new data. Common metrics include errors, accuracy, precision and recall for classification models as well as R-squared and F-tests for regression models. Validation methods like confusion matrices are also covered.

Uploaded by

mtemp7489
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Model Evaluation

1
Data Mining & Methodology

• Data mining is a process that uses a variety of data analysis tools to


discover patterns and relationships in data that may be used to
make valid predictions.
• A generic data mining process methodology:

2
Data Preprocessing

• The model evaluation phase of a data


mining process is assessing how well a
model fits the data and generalizes to
new data.
• This phase is an essential step in the data
mining process, as it helps to ensure that
the model is accurate and reliable.

3
Precision, bias and accuracy

Quality of measurement process & resulting data are


measured by:
• Precision – the closeness of repeated measurements
• Bias – a systematic variation of measurements from
the quantity being measured
• Accuracy – the correctness of measurements to the
true value of the quantity being measured
Choosing Measurement Criteria

• Data type of target variable (prediction)

• Interval / numerical

• Nominal / categorical

• Type of prediction

• Estimate - Interval / numerical

• Decision / Classification – Nominal / categorical

5
Model Quality Measurement
• Estimates
• Variance/Errors (examples: ASE/MSE, RASE)
• Fit (examples: Coefficient of Determination, R2 )
• Methods: Statistics

• Classification (Decisions)
• Error rates (examples: Misclassification)
• Accuracy (examples: Accuracy, Precision, Specificity)
• Methods: Confusion Matrix, Gain, Lift, ROC

6
Errors for Estimation

• Indicates how close a regression line is to a set of points (actual values). It does this by
taking the distances (i.e. errors) from the points to the regression line and squaring
them.
• The difference (i.e. variance/errors) between the estimated values and the actual value.
• Mean squared error (MSE) = SSE/DFE
• Average squared error (ASE) = SSE/N
Where SSE = Sum of Squared Error
N = total sample size
DFE = Degree of Freedom
Model Evaluation for Regressions

Regression models can be evaluated and analysed based on several


measurements:
• Errors (e.g. ASE or other errors)
• F-test
• R Squared
• Misclassification
• ROC
R2 (R-Squared) for Linear Regression
• R2 is a statistical measure of how close the data are to the fitted regression line - the goodness of fit of the
regression.
• R2 is also known as the coefficient of determination.
• R2 indicates how much better the function predicts the dependent variable than just using the mean value
of the dependent variable.
• The adjusted R2 is a statistic adjusted for the number of parameters in the equation and the number of data
observations. It is a more conservative estimate of the percent of variance explained
• R2 measures the strength of the relationship between a model and the input attributes.
• R2 is not a formal significant test for the relationship. The F-test of overall significance is the hypothesis
test for this relationship. If the overall F-test is significant, it can be concluded that the correlation between
the model and input attributes is statistically significant.
• R2 is always between 0 and 100% (or between 0 and 1):
• 0% indicates that the model explains none of the variability of the target attribute data.
• 100% indicates that the model explains all the variability of the target attribute data.
Chi-Square for Logistic Regressions
• The statistics test the overall significance of the regression model.
• The overall significance indicates whether a regression model provides a better
fit to the data than a model that contains no input attributes or using
mean/mode.
• If the p-value is less than the significance level (usually 0.01 or 0.05), the
sample data provide sufficient evidence to conclude that statistically the
regression model fits the data better than the model with no input attributes.
F-value and P-value
• The F-value is a ratio of the variances of the predicted values and the residuals, while the p-value is a probability
value that indicates the likelihood of obtaining the observed results by chance.
• F-value: The F-value is a measure of the overall significance of the model. A large F-value indicates that the model
is able to explain a significant amount of the variation in the dependent variable. An F-value that is greater than the
critical value at a given significance level (typically 0.05 or 0.01) indicates that the model is statistically significant.
• P-value: The p-value is a measure of the probability of obtaining the observed results by chance. A small p-value
indicates that the results are unlikely to have occurred by chance. A p-value that is less than the significance level
(typically 0.05 or 0.01) indicates that the results are statistically significant.
• In general, a linear regression model with a large F-value and a small p-value is considered to be a good fit for the
data. However, it is important to note that the F-value and p-value are only two measures of the model's fit. Other
factors, such as the R-squared value, should also be considered when evaluating the model.
• The F-value and p-value do not provide any information about the significance of individual variables in the model.
To assess the significance of individual variables, you should look at the t-values and p-values for those variables.
• Overall, the F-value and p-value are two useful tools for evaluating the significance of a linear regression model.
However, it is important to keep in mind that they are only two measures of the model's fit and should not be used
in isolation.
Example Outputs of a Linear Regression Model
Example Outputs of a Logistic Regression Model
Example of Linear Regression Model Interpretation

The regression model has a adjusted R square value of 0.1525, meaning the input
variables in the model can explain 15.25% of the characteristics of the target variable,
i.e. the Result.

Provided other input variables are constant:

Model presentation:
result = 18.7833 + 0.7717(Medu=0) – 0.9251(Medu=1) – 0.7789(Medu=2) +
0.2555(Medu=3) – 0.4689(age) – 0.3951(goout) + 0.8672(studytime)
Decision Tree Evaluation: Performance and Complexity
Misclassification/error in both training
and validation keep decreasing when
more model training is performed

As more training does not improve


performance (i.e. misclassification in
validation does not decrease obviously
anymore), further training will lead to
overfitting (i.e. when we observe the
training misclassification/error keeps
decreasing while validation increasing.

Optimal tree:
Compromise with 5 leaves
Observing Decision Tree Rules Node id: 6 (best rule/highest % to identify B=1)
if Replacement: Gift Count 36 Months >= 2.5 or MISSING
AND Replacement: Gift Amount Last < 7.5
then there is 64% chance B is 1 (i.e. 36% B is 0).

Node id: 23 (best rule/highest % to identify B=0)


if Replacement: Time Since Last Gift >= 17.5 or MISSING
AND Replacement: Gift Count 36 Months >= 2.5 or MISSING
AND Replacement: Gift Amount Last >= 7.5 or MISSING
AND Replacement: Gift Amount Average Card 36 Months >= 14.415
or MISSING
then there is 59% chance B is 0 (reversely 41% B is 1).

16
Validation dataset (Positive target response is Churn = “Y”)

ID Age Gender Churn Churn Churn


(Actual value) (Prediction)
1 18 M Y Y
2 21 M N Y
3 30 F N N
4 25 M Y Y
5 50 F N N
6 28 F Y Y
7 22 F Y N
8 40 M N Y
9 32 F N N
10 60 M N N
17
Validation dataset (Positive target response is Churn = “Y”)

ID Age Gender Churn Churn Churn


(Actual value) (Prediction)
1 18 M Y Y Positive (TRUE) = True Positive
2 21 M N Y Positive (FALSE) = False Positive
3 30 F N N Negative (TRUE) = True Negative
4 25 M Y Y Positive (TRUE) = True Positive
5 50 F N N Negative (TRUE) = True Negative
6 28 F Y Y Positive (TRUE) = True Positive
7 22 F Y N Negative (FALSE) = False Negative
8 40 M N Y Positive (FALSE) =False Positive
9 32 F N N Negative (TRUE) = True Negative
10 60 M N N Negative (TRUE) = True Negative
18
19
Evaluating Binary Models’ Predictive Accuracy
Prediction models with a binary response are assessed based on four properties
are initially required:
• True positive (TP): The number of observations predicted to be true (1) that are
in fact true (1).
• True negative (TN): The number of observations predicted to be false (0) that are
in fact false (0).
• False positive (FP): The number of observations that are incorrectly predicted to
be positive (1), but which are in fact negative (0).
• False negative (FN): The number of observations that are incorrectly predicted to
be negative (0), but which are in fact positive (1).

These four alternatives are illustrated in the Confusion Matrix 20


Confusion Matrix and Measures

Misclassification Rate: (FP+FN)/Total Overall, how often is the model wrong?


Accuracy: (TP+TN)/Total Overall, how often is the model correct?
True Positive Rate(Precision Positive): When it predicts yes, how often is it correct?
TP/(TP+FP) How often the predicted ‘true’ target is correct?
True Negative Rate(Precision Negative): When it predicts no, how often is it correct?
TN/(TN+FN) How often the predicted ‘false’ target is correct?
Sensitivity (Recall Positive): When it's actually yes, how often does it predict yes?
TP/(TP+FN) The ability to find all relevant (positive) targets in a dataset
Specificity (Recall Negative): When it's actually no, how often does it predict no?
TN/(TN+FP) The ability to find all relevant (negative) targets in a dataset
21
Recall vs Precision
John was the witness of an incident where there were 10 terrorists attacking
customers in a café.

What are the chances that John will be able to recall all terrorists precisely?

Say John narrated 15 terrorists to finally spell out the 10 correct terrorists.

22
Recall vs Precision
Then, John’s recall (ability to find all relevant (positive) targets) will be 100%,
but his precision (ability to correctly predict ‘true’ target from all prediction)
will only be (10/15=67%)
Calculate the rates..

 Accuracy:
 Misclassification Rate:
 Precision positive (True Positive Rate):
 Recall positive (Sensitivity):
 Recall negative (Specificity):

24
Got it?
100+50
• Accuracy: = = 0.9091
165
10+5
• Misclassification Rate = = 0.0909
165
100
• Precision = = 0.9090
10+100
100
• Sensitivity = = 0.9524
100+5
50
• Specificity = = 0.8333
50+10

25
ROC Chart
• A receiver operating characteristics, or ROC, curve
provides an assessment of one or more binary
classification models.
• Usually a diagonal line is plotted as a baseline, that
is, where a random prediction would lie.
• For classification models that generate a single
value, a single point can be plotted on the chart. A
point above the diagonal line indicates a degree of
accuracy that is better than a random prediction.
• Conversely, a point below the line indicates that
the prediction is worse than a random prediction.
The closer the point is to the upper top left point in
the chart, the better the prediction.

26
ROC Chart and Index
1.0

0.0
0.0 1.0
weak model strong model
weak model
ROC Index < 0.6 strong model
ROC Index > 0.7

• The closer the curve follows the left-hand border and then the top border of
the ROC space, the more accurate the test.
• The closer the curve comes to the 45-degree diagonal of the ROC space, the
less accurate the test.
SAS Enterprise Miner – Model Comparison

30
SAS Enterprise Miner – Model Comparison

31
SAS Enterprise Miner – Model Comparison

32
Contingency Table-based Measures
Actual
Orange (c1) Apple (c2) Plum (c3)
Orange (c1) 10 0 0 p1 = 10
Apple (c2) 0 7 5 p2 = 12
Predicted Plum (c3) 0 3 5 p3 = 8
a1 = 10 a2 = 10 a3 = 10
Accuracy = All Correct Prediction/Total = 22/30 = 0.733
Precision (c1): CP1/(CP1+p1) = 10/10 = 1
Precision (c2): CP2/(CP2+p2) = 7/12 = 0.583
Precision (c3): CP3/(CP3+p3) = 5/8 = 0.625
Recall (c1): CP1/a1 = 10/10 = 1
Recall (c2): CP2/a2 = 7/10 = 0.7
Recall (c3): CP3/a3 = 5/10 = 0.5
F(c1): (2*CP1)/(a1+p1) = (2*10)/(10+10) = 20/20 = 1
F(c2): (2*CP2)/(a2+p2) = (2*7)/(10+12) = 14/22 = 0.636
F(c3): (2*CP3)/(a3+p3) = (2*5)/(10+8) = 10/18 = 0.556
33
Overall F-measure: (1/number of classes)(total of all classes F) = (1/3)(1+0.636+0.556) = 0.731
Example: Binary Classification Models Comparison
Analysis & Interpretation:
• The overall accuracy and
error rate of the models
are summarized in the
accuracy and error
metric. In general,
model C is most accurate,
followed by model B, and then model A.
• The metrics also assess how well the models
specifically predict positives, with model B
performing the best based on the sensitivity
score.
• Model C has the highest specificity score,
indicating that this model is the best of the
three at predicting negatives.
Note: These different metrics are used in different
situations, depending on the goal of the specific
project.
34
Generalization and Overfitting
• Generalization is the property of a model, whereby the model applies to data that were
not used to build the model.
• Applying models not just to the exact training set but to the general population
(including those beyond the training data.) from which the training data came
• Overfitting is the tendency of data mining procedures to tailor models to the training
data, at the expense of generalization to previously unseen data points.
• There is no single choice/procedure that will eliminate overfitting. The best strategy is to
recognize overfitting and manage complexity in a principled way.
• The accuracy of a model depends on how complex we allow it to be. A model can be
complex in different ways
• Generally, there will be more overfitting as one allows the model to be more complex.
Underfitting and Overfitting
Error

Validation Data

Training Data

Number of Nodes
Underfitting Overfitting
(Model Complexity)
Good Compromise
Underfitting: when model is too simple, both training and test errors are large
Overfitting: test error rate begins to increase as training error rate continues to decrease
Over-fitting and Under-fitting
• Under-fitting
• When model is too simple, both training and validation
errors/misclassification are large
• Over-fitting
• Occurs when learned function (i.e. training) performs well on
data used during the training and poorly with new data (i.e.
validation)
• Validation error rate begins to increase as training error rate
continues to decrease
• An issue for all modeling algorithms

37
More Training ≠ Performance

Decision Tree Model 1 Decision Tree Model 2 Decision Model 3


Partition = 50:50, Partition = 70:30, Partition = 90:10,
Misclassification = 0.428 Misclassification = 0.417 Misclassification = 0.432

38
Complexity ≠ Performance

Model A Model B
ASE = 11.311 (Validation) ASE = 5.702 (Validation)

39
Complexity and Performance Evaluation
• Choose the right metrics
• Different modelling techniques may have different performance metrics
• R-Square (e.g. linear Regressions)
• ASE for estimation (interval target)
• Confusion matrix, ROC, misclassification/accuracy for classification
(nominal target)
• Different modelling techniques may pose different characteristics in
complexity
• Decision trees: depth/branch, leaves size
• Regressions: number of attributes

41
Adjustment for improving Model Performance /Complexity
(and reduce Bias)
Data preparation
• Detect highly correlated attributes, i.e. collinearity
• Data transformation on variables with high data sparsity / outliers
• Imputation – replacing missing values
• Data deletion – missing values/outliers/duplication
• Variable importance / selection
• Replacing misclassified values
NOTE: Do not surprise if you do not get better performance after performing
certain data preparation techniques!
Modeling
• Different modelling techniques have different test settings
• Variable selection methods (backward/forward/stepwise)
• Tree pruning (for decision tree)
42
Model Presentation and Interpretation
Model presentation and explanation
• Decision Tree modelling technique:
o Present the complete rules or tree of Decision tree
o Explain the best rule for a positive and negative target
▪ The rule description
▪ positive and negative target purity ratio explanation
• Regression modelling technique:
o Present the formula of regression
o Interpretation of the regression formula
Model complexity
• Presented the number of input attributes as a predictor in the selected model
• Presented the names of the input attributes included in the selected model
Model performance
• indication of accuracy or estimation prediction
• present the value of measurement for model evaluation
• indication of “validation” partition is used as model selection criteria
43

You might also like