0% found this document useful (0 votes)
154 views3 pages

Naive Bayes Text Classification Guide

This document summarizes the process of building a Naive Bayes classifier to perform text classification on the 20 Newsgroups dataset. It involves preprocessing the text by converting to lowercase, removing punctuation, numbers and stopwords. Then it creates document-term matrices for training and test data, builds a Naive Bayes model on the training data, makes predictions on the test data and evaluates the model using a confusion matrix.

Uploaded by

brahmesh_sm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
154 views3 pages

Naive Bayes Text Classification Guide

This document summarizes the process of building a Naive Bayes classifier to perform text classification on the 20 Newsgroups dataset. It involves preprocessing the text by converting to lowercase, removing punctuation, numbers and stopwords. Then it creates document-term matrices for training and test data, builds a Naive Bayes model on the training data, makes predictions on the test data and evaluates the model using a confusion matrix.

Uploaded by

brahmesh_sm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

# Text

classification
using a Naive
Bayes scheme
# Data : 20 Newsgroups
# Download link : [Link]

# Load all the required libraries. Note : Packages need to be installed first.
library(dplyr)
library(caret)
library(tm)
library(RTextTools)
library(doMC)
library(e1071)
registerDoMC(cores=detectCores())
# Load data.
# We will use the 'train-all-terms' file which contains over 11300 messages.
# Read file as a dataframe
[Link] <- [Link]("[Link]", header=FALSE, sep="\t", quote="",
stringsAsFactors=FALSE, [Link] = c("topic", "text"))

# Preview the dataframe


# head([Link]) # or use View([Link])
# How many messages do each of the 20 categories contain?
table([Link]$topic)
# Read topic variable as a factor variable
[Link]$topic <- [Link]([Link]$topic)

# Randomize : Shuffle rows randomly.


[Link](2016)
[Link] <- [Link][sample(nrow([Link])), ]
[Link] <- [Link][sample(nrow([Link])), ]
# Create corpus of the entire text
corpus <- Corpus(VectorSource([Link]$text))

# Total size of the corpus


length(corpus)

# Inspect the corpus


inspect(corpus[1:5])
# Tidy up the corpus using 'tm_map' function. Make the following transformations on
the corpus : change to lower case, removing numbers,
# punctuation and white space. We also eliminate common english stop words like
"his", "our", "hadn't", couldn't", etc using the
# stopwords() function.
# Use 'dplyr' package's excellent pipe utility to do this neatly
[Link] <- corpus %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(removeWords, stopwords(kind="en")) %>%
tm_map(stripWhitespace)
# Create document term matrix
dtm <- DocumentTermMatrix([Link])
dim(dtm)
# Create a 75:25 data partition. Note : 5000 (~50% of the entire set) messages were
used for this analysis.

[Link] <- [Link][1:8470,]


[Link] <- [Link][8471:11293,]

[Link] <- dtm[1:8470,]


[Link] <- dtm[8471:11293,]
dim([Link])
[Link] <- [Link][1:8470]
[Link] <- [Link][8471:11293]
# Find frequent words which appear five times or more

fivefreq <- findFreqTerms([Link], 5)


length(fivefreq)
dim([Link])
# Build dtm using fivefreq words only. Reduce number of features to
length(fivefreq)
[Link]( [Link] <- DocumentTermMatrix([Link], control =
list(dictionary=fivefreq)) )
[Link]( [Link] <- DocumentTermMatrix([Link], control =
list(dictionary=fivefreq)) )
# converting word counts (0 or more) to presence or absense (yes or no) for each
word
convert_count <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
y
}
# Apply yes/no function to get final training and testing dtms
[Link]( [Link] <- apply([Link], 2, convert_count) )
[Link] ( [Link] <- apply([Link], 2, convert_count) )
# Build the NB classifier
[Link] (ngclassifier <- naiveBayes([Link], [Link]$topic))

# Make predictions on the test set


[Link]( predictions <- predict(ngclassifier, newdata=[Link]) )
predictions
cm <- confusionMatrix(predictions, [Link]$topic )
cm

You might also like