Naive Bayes Text Classification Guide

This document summarizes the process of building a Naive Bayes classifier to perform text classification on the 20 Newsgroups dataset. It involves preprocessing the text by converting to lowercase, removing punctuation, numbers and stopwords. Then it creates document-term matrices for training and test data, builds a Naive Bayes model on the training data, makes predictions on the test data and evaluates the model using a confusion matrix.

Uploaded by

brahmesh_sm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

154 views3 pages

Naive Bayes Text Classification Guide

Uploaded by

brahmesh_sm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

# Text

classification
using a Naive
Bayes scheme
# Data : 20 Newsgroups
# Download link : [Link]

# Load all the required libraries. Note : Packages need to be installed first.
library(dplyr)
library(caret)
library(tm)
library(RTextTools)
library(doMC)
library(e1071)
registerDoMC(cores=detectCores())
# Load data.
# We will use the 'train-all-terms' file which contains over 11300 messages.
# Read file as a dataframe
[Link] <- [Link]("[Link]", header=FALSE, sep="\t", quote="",
stringsAsFactors=FALSE, [Link] = c("topic", "text"))

# Preview the dataframe

# head([Link]) # or use View([Link])
# How many messages do each of the 20 categories contain?
table([Link]$topic)
# Read topic variable as a factor variable
[Link]$topic <- [Link]([Link]$topic)

# Randomize : Shuffle rows randomly.

[Link](2016)
[Link] <- [Link][sample(nrow([Link])), ]
[Link] <- [Link][sample(nrow([Link])), ]
# Create corpus of the entire text
corpus <- Corpus(VectorSource([Link]$text))

# Total size of the corpus

length(corpus)

# Inspect the corpus

inspect(corpus[1:5])
# Tidy up the corpus using 'tm_map' function. Make the following transformations on
the corpus : change to lower case, removing numbers,
# punctuation and white space. We also eliminate common english stop words like
"his", "our", "hadn't", couldn't", etc using the
# stopwords() function.
# Use 'dplyr' package's excellent pipe utility to do this neatly
[Link] <- corpus %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(removeWords, stopwords(kind="en")) %>%
tm_map(stripWhitespace)
# Create document term matrix
dtm <- DocumentTermMatrix([Link])
dim(dtm)
# Create a 75:25 data partition. Note : 5000 (~50% of the entire set) messages were
used for this analysis.

[Link] <- [Link][1:8470,]

[Link] <- [Link][8471:11293,]

[Link] <- dtm[1:8470,]

[Link] <- dtm[8471:11293,]
dim([Link])
[Link] <- [Link][1:8470]
[Link] <- [Link][8471:11293]
# Find frequent words which appear five times or more

fivefreq <- findFreqTerms([Link], 5)

length(fivefreq)
dim([Link])
# Build dtm using fivefreq words only. Reduce number of features to
length(fivefreq)
[Link]( [Link] <- DocumentTermMatrix([Link], control =
list(dictionary=fivefreq)) )
[Link]( [Link] <- DocumentTermMatrix([Link], control =
list(dictionary=fivefreq)) )
# converting word counts (0 or more) to presence or absense (yes or no) for each
word
convert_count <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
y
}
# Apply yes/no function to get final training and testing dtms
[Link]( [Link] <- apply([Link], 2, convert_count) )
[Link] ( [Link] <- apply([Link], 2, convert_count) )
# Build the NB classifier
[Link] (ngclassifier <- naiveBayes([Link], [Link]$topic))

# Make predictions on the test set

[Link]( predictions <- predict(ngclassifier, newdata=[Link]) )
predictions
cm <- confusionMatrix(predictions, [Link]$topic )
cm

Big Data
No ratings yet
Big Data
5 pages
Supervised Learningclassification Part3
No ratings yet
Supervised Learningclassification Part3
42 pages
Text Mining Code
No ratings yet
Text Mining Code
2 pages
Text Analysis
No ratings yet
Text Analysis
15 pages
NB
No ratings yet
NB
2 pages
Text Analysis
No ratings yet
Text Analysis
15 pages
Spam Classification Using OCR and R
No ratings yet
Spam Classification Using OCR and R
21 pages
Text Mining & Analysis Guide
No ratings yet
Text Mining & Analysis Guide
6 pages
Naive Bayes
No ratings yet
Naive Bayes
11 pages
Document Classification with tm Package
No ratings yet
Document Classification with tm Package
16 pages
Stewart LabHandout
No ratings yet
Stewart LabHandout
11 pages
Text Mining Twitter Data with R
No ratings yet
Text Mining Twitter Data with R
35 pages
Data Science With R Text Mining by Graham Williams
No ratings yet
Data Science With R Text Mining by Graham Williams
21 pages
Semantic Analysis Theory1
No ratings yet
Semantic Analysis Theory1
16 pages
Blue Doodle Project Presentation
No ratings yet
Blue Doodle Project Presentation
15 pages
Text Mining and Preprocessing Guide
No ratings yet
Text Mining and Preprocessing Guide
2 pages
Text and Sentiment Analysis
No ratings yet
Text and Sentiment Analysis
41 pages
Peer Graded Assignment: Task Milestones
No ratings yet
Peer Graded Assignment: Task Milestones
6 pages
Word Cloud
No ratings yet
Word Cloud
3 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Text Mining in R with TM Package
No ratings yet
Text Mining in R with TM Package
6 pages
Data Science
No ratings yet
Data Science
25 pages
AI Phash3
No ratings yet
AI Phash3
11 pages
Lecture 8
No ratings yet
Lecture 8
45 pages
Daima Jieshi
No ratings yet
Daima Jieshi
5 pages
Text Mining Notes
No ratings yet
Text Mining Notes
28 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
Text Representation: Lecture # 6
No ratings yet
Text Representation: Lecture # 6
21 pages
R Text Mining & Sentiment Guide
No ratings yet
R Text Mining & Sentiment Guide
9 pages
Business Analytics CA3
No ratings yet
Business Analytics CA3
11 pages
Module 8 - Text - Update
No ratings yet
Module 8 - Text - Update
42 pages
Live Classroom 3
No ratings yet
Live Classroom 3
36 pages
Basic Textual Analysis in R
No ratings yet
Basic Textual Analysis in R
2 pages
Unstructured Data Classification
100% (2)
Unstructured Data Classification
83 pages
Text Analysis
No ratings yet
Text Analysis
13 pages
Unit 5
No ratings yet
Unit 5
8 pages
Unstructured Text Classification Guide
No ratings yet
Unstructured Text Classification Guide
37 pages
Twitter Data Mining with R Techniques
No ratings yet
Twitter Data Mining with R Techniques
34 pages
Text Classification
No ratings yet
Text Classification
7 pages
Hands-On Data Science With R Text Mining
No ratings yet
Hands-On Data Science With R Text Mining
41 pages
Statistical Learning and Text Classification With NLTK and Scikit-Learn
No ratings yet
Statistical Learning and Text Classification With NLTK and Scikit-Learn
24 pages
Chapter Veera 6
No ratings yet
Chapter Veera 6
4 pages
Pipeline
No ratings yet
Pipeline
9 pages
18-NLP-DTM Tokenization Corpus BoW Cloud
No ratings yet
18-NLP-DTM Tokenization Corpus BoW Cloud
14 pages
NLP For ML - Spam Classifier
No ratings yet
NLP For ML - Spam Classifier
14 pages
Text Analytics with TF-IDF in Python
No ratings yet
Text Analytics with TF-IDF in Python
14 pages
Text Mining and Word Cloud in R
No ratings yet
Text Mining and Word Cloud in R
3 pages
Samaksh Gupta Programming Ass. IR
No ratings yet
Samaksh Gupta Programming Ass. IR
13 pages
5 Paso S Text Mining
No ratings yet
5 Paso S Text Mining
4 pages
Slides - Text Mining
No ratings yet
Slides - Text Mining
12 pages
EBUS622 - Week 5 - Lecture - Text Preparation
No ratings yet
EBUS622 - Week 5 - Lecture - Text Preparation
40 pages
PPPT
No ratings yet
PPPT
20 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
Thesis Final - Pham Dung - Quang Anh - Ver2
No ratings yet
Thesis Final - Pham Dung - Quang Anh - Ver2
30 pages
Building A Powered Ai and Spam Caller
No ratings yet
Building A Powered Ai and Spam Caller
7 pages
Text Analysis: Why Do We Need Text Analytics
No ratings yet
Text Analysis: Why Do We Need Text Analytics
2 pages
Lesson 6 Recap
No ratings yet
Lesson 6 Recap
6 pages
LP - SCM 2019 7 Sem
No ratings yet
LP - SCM 2019 7 Sem
2 pages
Code Optimization
0% (1)
Code Optimization
90 pages
Assignment 8
No ratings yet
Assignment 8
10 pages
Beginner's Guide to Android SDK
No ratings yet
Beginner's Guide to Android SDK
8 pages
C# Language
0% (1)
C# Language
81 pages
Java Development Tools Overview
No ratings yet
Java Development Tools Overview
1 page
Java Loop Control
No ratings yet
Java Loop Control
5 pages
Understanding Virtual Companies: A Guide
No ratings yet
Understanding Virtual Companies: A Guide
1 page
SCJP Java Certification Exam Guide
No ratings yet
SCJP Java Certification Exam Guide
6 pages
Emulator PDF
No ratings yet
Emulator PDF
20 pages
Attendance Certificate for Exam Valuation
No ratings yet
Attendance Certificate for Exam Valuation
2 pages
Google - Case Study Practice NIC - MNGT 1001
No ratings yet
Google - Case Study Practice NIC - MNGT 1001
14 pages
Schedule 7A Appendix 3 Lots 1 To 3 P23 KPI Excel Workbook
No ratings yet
Schedule 7A Appendix 3 Lots 1 To 3 P23 KPI Excel Workbook
28 pages
Fingerprint Science Fundamentals
No ratings yet
Fingerprint Science Fundamentals
12 pages
PDF - PHP File4762 Soundcore-Vr-P10 - Compressed
No ratings yet
PDF - PHP File4762 Soundcore-Vr-P10 - Compressed
4 pages
Restaurant QR Ordering Solution
No ratings yet
Restaurant QR Ordering Solution
19 pages
FSI Calculations
No ratings yet
FSI Calculations
4 pages
How To Extend Material Master Views by Using MM50 - SAP Blogs
No ratings yet
How To Extend Material Master Views by Using MM50 - SAP Blogs
17 pages
Advanced C Lab Record
No ratings yet
Advanced C Lab Record
134 pages
Godunov Type Schemes Overview
No ratings yet
Godunov Type Schemes Overview
30 pages
Differential Algebra and Related Topics: Li Guo Phyllis J. Cassidy William F. Keigher William Y. Sit
100% (2)
Differential Algebra and Related Topics: Li Guo Phyllis J. Cassidy William F. Keigher William Y. Sit
320 pages
Introduction to MARIE Assembly Language
No ratings yet
Introduction to MARIE Assembly Language
87 pages
A Case Study of Emmanuel Medical Centre: Patients Records Management System
No ratings yet
A Case Study of Emmanuel Medical Centre: Patients Records Management System
86 pages
Resume Manoj Sahoo
No ratings yet
Resume Manoj Sahoo
3 pages
Travis Marchok's Data Science Resume
No ratings yet
Travis Marchok's Data Science Resume
1 page
EAGLE Introduction Tutorial
No ratings yet
EAGLE Introduction Tutorial
9 pages
Designing Effective Web Surveys 1st Edition Mick P. Couper PHD Digital Version 2025
No ratings yet
Designing Effective Web Surveys 1st Edition Mick P. Couper PHD Digital Version 2025
62 pages
E Business Assignment
No ratings yet
E Business Assignment
5 pages
Linear Algebra: Lecture Notes
No ratings yet
Linear Algebra: Lecture Notes
47 pages
Algebra Reviewer
No ratings yet
Algebra Reviewer
83 pages
Rulebook For Elixir 2024 UI UX
No ratings yet
Rulebook For Elixir 2024 UI UX
4 pages
NVIDIA System Information 11-27-2016 16-30-1688
No ratings yet
NVIDIA System Information 11-27-2016 16-30-1688
10 pages
Display Screen Equipment Risk Assessment: User/Workstation Questionnaire
No ratings yet
Display Screen Equipment Risk Assessment: User/Workstation Questionnaire
3 pages
41 160624225310
No ratings yet
41 160624225310
9 pages
LSMW Guide: Update Customer Records
No ratings yet
LSMW Guide: Update Customer Records
18 pages
CCN Lab 01 - Packet Tracer - New OBE
No ratings yet
CCN Lab 01 - Packet Tracer - New OBE
12 pages
Student - Rmpssu.ac - in Student StudentResult Data dcAxubkkQSdCAnifMXq7yVVKD5m+Kd37IPNs+28QLgU
No ratings yet
Student - Rmpssu.ac - in Student StudentResult Data dcAxubkkQSdCAnifMXq7yVVKD5m+Kd37IPNs+28QLgU
1 page
Microcontroller Overview and Applications
No ratings yet
Microcontroller Overview and Applications
7 pages
Assignment 7 Olap
No ratings yet
Assignment 7 Olap
11 pages
61a0264109a87c7fd5dfd0f09b14ddbb (1).ppt
No ratings yet
61a0264109a87c7fd5dfd0f09b14ddbb (1).ppt
183 pages
SOLAAR 11.05 Software Manual Addendum (BRE0004560 - Rev - A)
No ratings yet
SOLAAR 11.05 Software Manual Addendum (BRE0004560 - Rev - A)
12 pages

Naive Bayes Text Classification Guide

Uploaded by

Naive Bayes Text Classification Guide

Uploaded by

# Text

# Preview the dataframe

# Randomize : Shuffle rows randomly.

# Total size of the corpus

# Inspect the corpus

[Link] <- [Link][1:8470,]

[Link] <- dtm[1:8470,]

fivefreq <- findFreqTerms([Link], 5)

# Make predictions on the test set

You might also like