Computing Classification Evaluation Metrics in R

Last Updated : 25 Apr, 2025

Classification evaluation metrics help us understand how well a model performs in assigning instances to predefined categories. These metrics provide both general and class-specific insights, guiding us in tuning models and interpreting their effectiveness.

Confusion Matrix

The confusion matrix summarizes a model's prediction results in a tabular format. Rows represent actual classes and columns show predicted classes. Each cell contains the count of predictions for a particular actual/predicted combination, helping us identify errors like false positives and false negatives. From the matrix, we derive core metrics:

True Positives (TP): Correctly predicted positives
True Negatives (TN): Correctly predicted negatives
False Positives (FP): Incorrectly predicted positives
False Negatives (FN): Incorrectly predicted negatives

1. Accuracy

Formula: (TP + TN) / (TP + TN + FP + FN)

Accuracy measures the overall correctness of a model's predictions. It’s easy to interpret but can be misleading on imbalanced datasets. A model predicting only the majority class may still show high accuracy, even if it fails to capture minority classes. Accuracy is more reliable when class distribution is balanced and all types of errors are equally important.

2. Precision

Formula: TP / (TP + FP)

Precision focuses on the quality of positive predictions. It tells us how many of the predicted positives are actually correct. This is especially useful when false positives are costly—like in fraud detection or medical diagnoses, where incorrect alerts can cause significant disruptions or unnecessary treatments.

3. Recall (Sensitivity)

Formula: TP / (TP + FN)

Recall measures the model’s ability to detect all actual positives. It's critical in cases where missing a positive case (false negative) is riskier than a false alarm—like detecting diseases or rare faults in systems. High recall means fewer important cases are missed.

4. F1 Score

Formula: 2 × (Precision × Recall) / (Precision + Recall)

The F1 score balances precision and recall into a single metric, which is especially helpful when dealing with class imbalance. It's useful when both false positives and false negatives are important and neither precision nor recall alone gives the full picture.

5. Specificity

Formula: TN / (TN + FP)

Specificity or true negative rate, tells us how well a model identifies actual negatives. It’s key in scenarios where false positives are costly—such as medical screenings, where we want to minimize mislabeling healthy individuals as sick.

6. Kappa Score

Formula: (Observed Accuracy - Expected Accuracy) / (1 - Expected Accuracy)

Kappa measures agreement between predicted and actual labels, adjusted for chance. It's particularly useful for imbalanced datasets, giving a more reliable view of model performance than accuracy alone. A score of 1 indicates perfect agreement, 0 means chance-level performance and negative values suggest worse-than-random predictions.Evaluation metrics in R

Implementation of Evaluation Metrics in R

We will use the "iris" dataset for the programming example. It contains three categories of 50 instances each. We will use the random forest to classify the data into categories and then print the above-mentioned evaluation metrics in R.

Step 1: Loading the necessary package

We will nstall the caret package, which stands for Classification And Rrgression Training. The "caret" package is a comprehensive framework in R for performing machine learning tasks, including data preprocessing, model training and evaluation. Load the "caret" package into the R environment so that its functions and capabilities can be used.

install.packages("caret")
library(caret)

Step 2: Loading Dataset

For this example, we use the "iris" dataset. Load the "iris" dataset, which is a famous dataset included in R. It contains measurements of different features of iris flowers (sepal length, sepal width, petal length, petal width) along with their corresponding species (setosa, versicolor, virginica).

Print a summary of the "iris" dataset, providing descriptive statistics such as mean, median, minimum, maximum and quartiles for each feature.

data(iris)
summary(iris)

Output:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

This code block creates a scatter plot matrix of the "iris" dataset. It shows pairwise scatter plots of each feature against other features, allowing us to visualize the relationships and distributions between variables.

plot(iris)

Output:

Step 3: Splitting the Dataset

We will split the dataset into training and testing sets using createDataPartition from the caret package. We reserve 80% of the data for training. We are creating two subsets: trainData to build the model and testData to evaluate it.

trainIndex <- createDataPartition(iris$Species, p = 0.8, list = FALSE)

trainData <- iris[trainIndex, ]
testData <- iris[-trainIndex, ]

Step 4: Model training

We train a classification model using the Random Forest algorithm with train() from caret. The formula Species ~ . tells R to use all other columns to predict the Species.

install.packages("randomForest")
library(randomForest)

model <- train(Species ~ ., data = trainData, method = "rf")

Step 5: Evaluating Metrics

We make predictions on the test data and use confusionMatrix() to compute accuracy, precision, recall, F1 score and more.We are comparing the predicted class labels with the actual values in the test set. The result is stored in cm, which contains detailed performance metrics.

predictions <- predict(model, newdata = testData)

cm<-confusionMatrix(predictions, testData$Species)
cm

Output:

Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 10 0 0
versicolor 0 10 1
virginica 0 0 9
Overall Statistics

Accuracy : 0.9667
95% CI : (0.8278, 0.9992)
No Information Rate : 0.3333
P-Value [Acc > NIR] : 2.963e-13

Kappa : 0.95

Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 1.0000 0.9000
Specificity 1.0000 0.9500 1.0000
Pos Pred Value 1.0000 0.9091 1.0000
Neg Pred Value 1.0000 1.0000 0.9524
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.3333 0.3000
Detection Prevalence 0.3333 0.3667 0.3000
Balanced Accuracy 1.0000 0.9750 0.9500

Now the confusion matrix function results are stored in variable cm. We can access the accuracy, kappa, etc scores by using the argument overall to the variable cm. The overall metrics the measure the correctness or any other evaluation by using whole data and not separated by class i.e. it considers the whole result and not restricted by categories or not influenced by classes are shown in it.

To see the overall evaluation metrics use $overall and to see the classwise evaluation metrics use $byClass

cm$byClass

Output:

Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
Class: setosa 1.0 1.00 1.0000000 1.000000 1.0000000
Class: versicolor 1.0 0.95 0.9090909 1.000000 0.9090909
Class: virginica 0.9 1.00 1.0000000 0.952381 1.0000000
Recall F1 Prevalence Detection Rate Detection Prevalence
Class: setosa 1.0 1.0000000 0.3333333 0.3333333 0.3333333
Class: versicolor 1.0 0.9523810 0.3333333 0.3333333 0.3666667
Class: virginica 0.9 0.9473684 0.3333333 0.3000000 0.3000000
Balanced Accuracy
Class: setosa 1.000
Class: versicolor 0.975
Class: virginica 0.950

We get all the classwise evaluation scores. From this, we can observe the sensitivity, Specificity, Precision, Recall and F1 Score.

cm$overall

Output:

Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
9.666667e-01 9.500000e-01 8.278305e-01 9.991564e-01 3.333333e-01
AccuracyPValue McnemarPValue
2.962731e-13 NaN

The cm$byClass line retrieves the performance metrics (such as accuracy, precision, recall and F1 score) for each class separately.

The cm$overall the line provides overall performance metrics across all classes.

In this article we use the caret package, to load data, train a model, make predictions and calculate a full set of classification evaluation metrics.

Computing Classification Evaluation Metrics in R

sahilgangurde08

Improve

Article Tags :

Computing Classification Evaluation Metrics in R

Confusion Matrix

1. Accuracy

2. Precision

3. Recall (Sensitivity)

4. F1 Score

5. Specificity

6. Kappa Score

Implementation of Evaluation Metrics in R

Step 1: Loading the necessary package

Step 2: Loading Dataset

Step 3: Splitting the Dataset

Step 4: Model training

Step 5: Evaluating Metrics

Similar Reads

Thank You!

What kind of Experience do you want to share?