Multinomial Naive Bayes Classifier in R
Last Updated :
17 Jul, 2024
The Multinomial Naive Bayes (MNB) classifier is a popular machine learning algorithm, especially useful for text classification tasks such as spam detection, sentiment analysis, and document categorization. In this article, we discuss about the basics of the MNB classifier and how to implement it in R.
What is Naive Bayes?
Naive Bayes is a family of simple probabilistic classifiers based on Bayes' theorem with the "naive" assumption of independence between every pair of features. Despite its simplicity, it often performs surprisingly well for many tasks, particularly those involving text.
Multinomial Naive Bayes
The Multinomial Naive Bayes classifier is specifically designed for handling discrete data. It is most commonly used for document classification problems, where the frequency of each word (i.e., a discrete count of the words) is used as a feature.
How Does Multinomial Naive Bayes Work?
- The algorithm calculates the probability of each word given each class from the training data. It also calculates the prior probability of each class.
- For a new document, it calculates the product of the probabilities of the words in the document given each class and the prior probability of each class. The class with the highest product is chosen as the predicted class.
Now we will discuss step by step implementation of Multinomial Naive Bayes Classifier in R Programming Language.
Dataset Link -: Spam
Step 1: Load Necessary Packages and Dataset
- e1071: Provides functions for Naive Bayes classification (naiveBayes function).
- tm: Used for text mining tasks, such as creating a corpus and preprocessing text data.
- caret: Provides functions for data splitting (createDataPartition) and model evaluation (confusionMatrix).
- Loads the dataset from the specified path.
R
library(e1071)
library(tm)
library(caret)
sms_data <- read.csv("C:\\Users\\Tonmoy\\Downloads\\Dataset\\spam.csv",
stringsAsFactors = FALSE)
Step 2: Prepare the Data
Renames the columns of the dataset to "type" and "text" for easier reference.
R
colnames(sms_data) <- c("type", "text")
Step 3: Text Preprocessing
- Converts any invalid UTF-8 characters in the "text" column from "latin1" encoding to "UTF-8".
- Create Corpus: Converts the "text" column into a text corpus using tm::Corpus.
- Preprocess Text: Applies several transformations (tolower, removePunctuation, removeNumbers, removeWords, stripWhitespace) to clean and standardize the text data for analysis.
- Converts the preprocessed corpus into a Document-Term Matrix (DTM) where rows represent documents (text messages) and columns represent terms (words).
- Converts the Document-Term Matrix (DTM) into a regular matrix (dtm_matrix) suitable for modeling purposes.
R
sms_data$text <- iconv(sms_data$text, from = "latin1", to = "UTF-8", sub = "")
corpus <- Corpus(VectorSource(sms_data$text))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
dtm <- DocumentTermMatrix(corpus)
dtm_matrix <- as.matrix(dtm)
Step 4: Split the Data
- Splits the dataset into training and testing sets using caret::createDataPartition.
- Selects rows from dtm_matrix based on train_index to create train_data and test_data.
- Converts train_labels and test_labels to factors to ensure they have the same levels for classification.
R
set.seed(123)
train_index <- createDataPartition(sms_data$type, p = 0.75, list = FALSE)
train_data <- dtm_matrix[train_index, ]
test_data <- dtm_matrix[-train_index, ]
train_labels <- as.factor(sms_data$type[train_index])
test_labels <- as.factor(sms_data$type[-train_index])
Step 5: Train the Multinomial Naive Bayes Classifier
Trains the Multinomial Naive Bayes classifier (mnb_model) using the training data (train_data) and corresponding labels (train_labels), with Laplace smoothing parameter (laplace = 1).
R
mnb_model <- naiveBayes(train_data, train_labels, laplace = 1)
Step 6: Make Predictions
Uses the trained model (mnb_model) to make predictions (test_pred) on the test data (test_data).
R
test_pred <- predict(mnb_model, test_data)
Step 7: Evaluate the Model
- Computes the confusion matrix (conf_matrix) to evaluate the performance of the Naive Bayes classifier on the test data (test_pred vs. test_labels).
- Prints the confusion matrix to display metrics such as accuracy, precision, recall, and F1-score.
R
conf_matrix <- confusionMatrix(test_pred, test_labels)
print(conf_matrix)
Output:
Confusion Matrix and Statistics
Reference
Prediction ham spam
ham 0 0
spam 1206 186
Accuracy : 0.7336
95% CI : (0.1162, 0.1526)
No Information Rate : 0.8664
P-Value [Acc > NIR] : 1
Kappa : 0
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.0000
Specificity : 1.0000
Pos Pred Value : NaN
Neg Pred Value : 0.1336
Prevalence : 0.8664
Detection Rate : 0.0000
Detection Prevalence : 0.0000
Balanced Accuracy : 0.5000
'Positive' Class : ham
Correctly predicted 0 instances of "ham" (true negatives). Correctly predicted 186 instances of "spam" (true positives). Incorrectly predicted 1206 instances of "spam" as "ham" (false negatives).
- Accuracy: 73.36% of all predictions were correct.
- 95% Confidence Interval (CI): True accuracy likely falls between 11.62% and 15.26%.
- No Information Rate (NIR): Model would achieve 86.64% accuracy by always predicting the majority class ("ham").
- Kappa Statistic: Indicates no agreement beyond chance (Kappa = 0).
- Sensitivity (True Positive Rate): 0% sensitivity indicates no correct identification of "spam".
- Specificity (True Negative Rate): 100% specificity indicates all "ham" messages were correctly identified.
- Positive Predictive Value (Precision): Not a Number (NaN) for PPV due to no positive predictions ("spam").
- Negative Predictive Value (NPV): 13.36% of predicted "ham" messages were correct.
- Prevalence: 86.64% of messages were "ham".
- Detection Rate: 0% of "spam" messages were correctly identified.
- Balanced Accuracy: 50% suggests classifier performs no better than random chance.
- Positive Class: "ham" was considered as the positive class for metrics calculation.
Conclusion
Multinomial Naive Bayes classifier in R provides a straightforward approach for text classification tasks like spam detection, its performance heavily depends on the quality and balance of the dataset. In this case, the classifier struggled with accurately identifying spam messages, largely due to the imbalance where "ham" messages dominated. Improving its effectiveness might involve addressing this imbalance, refining text preprocessing techniques, or exploring more advanced algorithms.
Similar Reads
Multinomial Naive Bayes
Multinomial Naive Bayes is one of the variation of Naive Bayes algorithm. A classification algorithm based on Bayes' Theorem ideal for discrete data and is typically used in text classification problems. It models the frequency of words as counts and assumes each feature or word is multinomially dis
7 min read
Naive Bayes Classifiers
Naive Bayes is a classification algorithm that uses probability to predict which category a data point belongs to, assuming that all features are unrelated. This article will give you an overview as well as more advanced use and implementation of Naive Bayes in machine learning. Illustration behind
7 min read
Naive Bayes Classifier in R Programming
Naive Bayes is a Supervised Non-linear classification algorithm in R Programming. Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Baye's theorem with strong(Naive) independence assumptions between the features or variables. The Naive Bayes algorithm is call
3 min read
Naive Bayes vs. SVM for Text Classification
Text classification is a fundamental task in natural language processing (NLP), with applications ranging from spam detection to sentiment analysis and document categorization. Two popular machine learning algorithms for text classification are Naive Bayes classifier (NB) and Support Vector Machines
9 min read
Rule-Based Classifier - Machine Learning
Rule-based classifiers are just another type of classifier which makes the class decision depending by using various "if..else" rules. These rules are easily interpretable and thus these classifiers are generally used to generate descriptive models. The condition used with "if" is called the anteced
4 min read
Bernoulli Naive Bayes
Bernoulli Naive Bayes is a subcategory of the Naive Bayes Algorithm. It is typically used when the data is binary and it models the occurrence of features using Bernoulli distribution. It is used for the classification of binary features such as 'Yes' or 'No', '1' or '0', 'True' or 'False' etc. Here
6 min read
KNN Classifier in R Programming
K-Nearest Neighbor or KNN is a supervised non-linear classification algorithm. It is also Non-parametric in nature meaning , it doesn't make any assumption about underlying data or its distribution. Algorithm Structure In KNN algorithm, K specifies the number of neighbors and its algorithm is as fol
4 min read
Build a Neural Network Classifier in R
Creating a neural network classifier in R can be done using the popular deep learning framework called Keras, which provides a high-level interface to build and train neural networks. Here's a step-by-step guide on how to build a simple neural network classifier using Keras in R Programming Language
9 min read
Multinomial Distribution in R
This article will guide you through the use of multinomial distribution in R, including its theory, parameters, and practical applications using built-in R functions.Multinomial DistributionThe multinomial distribution in R describes the probability of obtaining a specific combination of outcomes wh
4 min read
Classification of Text Documents using Naive Bayes
In natural language processing and machine learning Naive Bayes is a popular method for classifying text documents. It can be used to classifies documents into pre defined types based on likelihood of a word occurring by using Bayes theorem. In this article we will implement Text Classification usin
4 min read