Multinomial Naive Bayes Classifier in R

Last Updated : 17 Jul, 2024

The Multinomial Naive Bayes (MNB) classifier is a popular machine learning algorithm, especially useful for text classification tasks such as spam detection, sentiment analysis, and document categorization. In this article, we discuss about the basics of the MNB classifier and how to implement it in R.

What is Naive Bayes?

Naive Bayes is a family of simple probabilistic classifiers based on Bayes' theorem with the "naive" assumption of independence between every pair of features. Despite its simplicity, it often performs surprisingly well for many tasks, particularly those involving text.

Multinomial Naive Bayes

The Multinomial Naive Bayes classifier is specifically designed for handling discrete data. It is most commonly used for document classification problems, where the frequency of each word (i.e., a discrete count of the words) is used as a feature.

How Does Multinomial Naive Bayes Work?

The algorithm calculates the probability of each word given each class from the training data. It also calculates the prior probability of each class.
For a new document, it calculates the product of the probabilities of the words in the document given each class and the prior probability of each class. The class with the highest product is chosen as the predicted class.

Now we will discuss step by step implementation of Multinomial Naive Bayes Classifier in R Programming Language.

Dataset Link -: Spam

Step 1: Load Necessary Packages and Dataset

e1071: Provides functions for Naive Bayes classification (naiveBayes function).
tm: Used for text mining tasks, such as creating a corpus and preprocessing text data.
caret: Provides functions for data splitting (createDataPartition) and model evaluation (confusionMatrix).
Loads the dataset from the specified path.

library(e1071)
library(tm)
library(caret)

sms_data <- read.csv("C:\\Users\\Tonmoy\\Downloads\\Dataset\\spam.csv", 
                                                stringsAsFactors = FALSE)

Step 2: Prepare the Data

Renames the columns of the dataset to "type" and "text" for easier reference.

colnames(sms_data) <- c("type", "text")

Step 3: Text Preprocessing

Converts any invalid UTF-8 characters in the "text" column from "latin1" encoding to "UTF-8".
Create Corpus: Converts the "text" column into a text corpus using tm::Corpus.
Preprocess Text: Applies several transformations (tolower, removePunctuation, removeNumbers, removeWords, stripWhitespace) to clean and standardize the text data for analysis.
Converts the preprocessed corpus into a Document-Term Matrix (DTM) where rows represent documents (text messages) and columns represent terms (words).
Converts the Document-Term Matrix (DTM) into a regular matrix (dtm_matrix) suitable for modeling purposes.

sms_data$text <- iconv(sms_data$text, from = "latin1", to = "UTF-8", sub = "")

corpus <- Corpus(VectorSource(sms_data$text))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)

dtm <- DocumentTermMatrix(corpus)
dtm_matrix <- as.matrix(dtm)

Step 4: Split the Data

Splits the dataset into training and testing sets using caret::createDataPartition.
Selects rows from dtm_matrix based on train_index to create train_data and test_data.
Converts train_labels and test_labels to factors to ensure they have the same levels for classification.

set.seed(123)
train_index <- createDataPartition(sms_data$type, p = 0.75, list = FALSE)
train_data <- dtm_matrix[train_index, ]
test_data <- dtm_matrix[-train_index, ]
train_labels <- as.factor(sms_data$type[train_index])
test_labels <- as.factor(sms_data$type[-train_index])

Step 5: Train the Multinomial Naive Bayes Classifier

Trains the Multinomial Naive Bayes classifier (mnb_model) using the training data (train_data) and corresponding labels (train_labels), with Laplace smoothing parameter (laplace = 1).

mnb_model <- naiveBayes(train_data, train_labels, laplace = 1)

Step 6: Make Predictions

Uses the trained model (mnb_model) to make predictions (test_pred) on the test data (test_data).

test_pred <- predict(mnb_model, test_data)

Step 7: Evaluate the Model

Computes the confusion matrix (conf_matrix) to evaluate the performance of the Naive Bayes classifier on the test data (test_pred vs. test_labels).
Prints the confusion matrix to display metrics such as accuracy, precision, recall, and F1-score.

conf_matrix <- confusionMatrix(test_pred, test_labels)
print(conf_matrix)

Output:

Confusion Matrix and Statistics

          Reference
Prediction  ham spam
      ham     0    0
      spam 1206  186
                                          
               Accuracy : 0.7336          
                 95% CI : (0.1162, 0.1526)
    No Information Rate : 0.8664          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.0000          
            Specificity : 1.0000          
         Pos Pred Value :    NaN          
         Neg Pred Value : 0.1336          
             Prevalence : 0.8664          
         Detection Rate : 0.0000          
   Detection Prevalence : 0.0000          
      Balanced Accuracy : 0.5000          
                                          
       'Positive' Class : ham

Correctly predicted 0 instances of "ham" (true negatives). Correctly predicted 186 instances of "spam" (true positives). Incorrectly predicted 1206 instances of "spam" as "ham" (false negatives).

Accuracy: 73.36% of all predictions were correct.
95% Confidence Interval (CI): True accuracy likely falls between 11.62% and 15.26%.
No Information Rate (NIR): Model would achieve 86.64% accuracy by always predicting the majority class ("ham").
Kappa Statistic: Indicates no agreement beyond chance (Kappa = 0).
Sensitivity (True Positive Rate): 0% sensitivity indicates no correct identification of "spam".
Specificity (True Negative Rate): 100% specificity indicates all "ham" messages were correctly identified.
Positive Predictive Value (Precision): Not a Number (NaN) for PPV due to no positive predictions ("spam").
Negative Predictive Value (NPV): 13.36% of predicted "ham" messages were correct.
Prevalence: 86.64% of messages were "ham".
Detection Rate: 0% of "spam" messages were correctly identified.
Balanced Accuracy: 50% suggests classifier performs no better than random chance.
Positive Class: "ham" was considered as the positive class for metrics calculation.

Conclusion

Multinomial Naive Bayes classifier in R provides a straightforward approach for text classification tasks like spam detection, its performance heavily depends on the quality and balance of the dataset. In this case, the classifier struggled with accurately identifying spam messages, largely due to the imbalance where "ham" messages dominated. Improving its effectiveness might involve addressing this imbalance, refining text preprocessing techniques, or exploring more advanced algorithms.

Multinomial Naive Bayes Classifier in R

tmishra2001

Improve

Article Tags :

Practice Tags :

Machine Learning

Multinomial Naive Bayes Classifier in R

What is Naive Bayes?

Multinomial Naive Bayes

How Does Multinomial Naive Bayes Work?

Step 1: Load Necessary Packages and Dataset

Step 2: Prepare the Data

Step 3: Text Preprocessing

Step 4: Split the Data

Step 5: Train the Multinomial Naive Bayes Classifier

Step 6: Make Predictions

Step 7: Evaluate the Model

Conclusion

Similar Reads

Thank You!

What kind of Experience do you want to share?