Open In App

Multinomial Naive Bayes Classifier in R

Last Updated : 17 Jul, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

The Multinomial Naive Bayes (MNB) classifier is a popular machine learning algorithm, especially useful for text classification tasks such as spam detection, sentiment analysis, and document categorization. In this article, we discuss about the basics of the MNB classifier and how to implement it in R.

What is Naive Bayes?

Naive Bayes is a family of simple probabilistic classifiers based on Bayes' theorem with the "naive" assumption of independence between every pair of features. Despite its simplicity, it often performs surprisingly well for many tasks, particularly those involving text.

Multinomial Naive Bayes

The Multinomial Naive Bayes classifier is specifically designed for handling discrete data. It is most commonly used for document classification problems, where the frequency of each word (i.e., a discrete count of the words) is used as a feature.

How Does Multinomial Naive Bayes Work?

  • The algorithm calculates the probability of each word given each class from the training data. It also calculates the prior probability of each class.
  • For a new document, it calculates the product of the probabilities of the words in the document given each class and the prior probability of each class. The class with the highest product is chosen as the predicted class.

Now we will discuss step by step implementation of Multinomial Naive Bayes Classifier in R Programming Language.

Dataset Link -: Spam

Step 1: Load Necessary Packages and Dataset

  • e1071: Provides functions for Naive Bayes classification (naiveBayes function).
  • tm: Used for text mining tasks, such as creating a corpus and preprocessing text data.
  • caret: Provides functions for data splitting (createDataPartition) and model evaluation (confusionMatrix).
  • Loads the dataset from the specified path.
R
library(e1071)
library(tm)
library(caret)

sms_data <- read.csv("C:\\Users\\Tonmoy\\Downloads\\Dataset\\spam.csv", 
                                                stringsAsFactors = FALSE)

Step 2: Prepare the Data

Renames the columns of the dataset to "type" and "text" for easier reference.

R
colnames(sms_data) <- c("type", "text")

Step 3: Text Preprocessing

  • Converts any invalid UTF-8 characters in the "text" column from "latin1" encoding to "UTF-8".
  • Create Corpus: Converts the "text" column into a text corpus using tm::Corpus.
  • Preprocess Text: Applies several transformations (tolower, removePunctuation, removeNumbers, removeWords, stripWhitespace) to clean and standardize the text data for analysis.
  • Converts the preprocessed corpus into a Document-Term Matrix (DTM) where rows represent documents (text messages) and columns represent terms (words).
  • Converts the Document-Term Matrix (DTM) into a regular matrix (dtm_matrix) suitable for modeling purposes.
R
sms_data$text <- iconv(sms_data$text, from = "latin1", to = "UTF-8", sub = "")

corpus <- Corpus(VectorSource(sms_data$text))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)

dtm <- DocumentTermMatrix(corpus)
dtm_matrix <- as.matrix(dtm)

Step 4: Split the Data

  • Splits the dataset into training and testing sets using caret::createDataPartition.
  • Selects rows from dtm_matrix based on train_index to create train_data and test_data.
  • Converts train_labels and test_labels to factors to ensure they have the same levels for classification.
R
set.seed(123)
train_index <- createDataPartition(sms_data$type, p = 0.75, list = FALSE)
train_data <- dtm_matrix[train_index, ]
test_data <- dtm_matrix[-train_index, ]
train_labels <- as.factor(sms_data$type[train_index])
test_labels <- as.factor(sms_data$type[-train_index])

Step 5: Train the Multinomial Naive Bayes Classifier

Trains the Multinomial Naive Bayes classifier (mnb_model) using the training data (train_data) and corresponding labels (train_labels), with Laplace smoothing parameter (laplace = 1).

R
mnb_model <- naiveBayes(train_data, train_labels, laplace = 1)

Step 6: Make Predictions

Uses the trained model (mnb_model) to make predictions (test_pred) on the test data (test_data).

R
test_pred <- predict(mnb_model, test_data)

Step 7: Evaluate the Model

  • Computes the confusion matrix (conf_matrix) to evaluate the performance of the Naive Bayes classifier on the test data (test_pred vs. test_labels).
  • Prints the confusion matrix to display metrics such as accuracy, precision, recall, and F1-score.
R
conf_matrix <- confusionMatrix(test_pred, test_labels)
print(conf_matrix)

Output:

Confusion Matrix and Statistics

Reference
Prediction ham spam
ham 0 0
spam 1206 186

Accuracy : 0.7336
95% CI : (0.1162, 0.1526)
No Information Rate : 0.8664
P-Value [Acc > NIR] : 1

Kappa : 0

Mcnemar's Test P-Value : <2e-16

Sensitivity : 0.0000
Specificity : 1.0000
Pos Pred Value : NaN
Neg Pred Value : 0.1336
Prevalence : 0.8664
Detection Rate : 0.0000
Detection Prevalence : 0.0000
Balanced Accuracy : 0.5000

'Positive' Class : ham

Correctly predicted 0 instances of "ham" (true negatives). Correctly predicted 186 instances of "spam" (true positives). Incorrectly predicted 1206 instances of "spam" as "ham" (false negatives).

  • Accuracy: 73.36% of all predictions were correct.
  • 95% Confidence Interval (CI): True accuracy likely falls between 11.62% and 15.26%.
  • No Information Rate (NIR): Model would achieve 86.64% accuracy by always predicting the majority class ("ham").
  • Kappa Statistic: Indicates no agreement beyond chance (Kappa = 0).
  • Sensitivity (True Positive Rate): 0% sensitivity indicates no correct identification of "spam".
  • Specificity (True Negative Rate): 100% specificity indicates all "ham" messages were correctly identified.
  • Positive Predictive Value (Precision): Not a Number (NaN) for PPV due to no positive predictions ("spam").
  • Negative Predictive Value (NPV): 13.36% of predicted "ham" messages were correct.
  • Prevalence: 86.64% of messages were "ham".
  • Detection Rate: 0% of "spam" messages were correctly identified.
  • Balanced Accuracy: 50% suggests classifier performs no better than random chance.
  • Positive Class: "ham" was considered as the positive class for metrics calculation.

Conclusion

Multinomial Naive Bayes classifier in R provides a straightforward approach for text classification tasks like spam detection, its performance heavily depends on the quality and balance of the dataset. In this case, the classifier struggled with accurately identifying spam messages, largely due to the imbalance where "ham" messages dominated. Improving its effectiveness might involve addressing this imbalance, refining text preprocessing techniques, or exploring more advanced algorithms.


Next Article

Similar Reads