Amrit Science Campus: Submitted by
Amrit Science Campus: Submitted by
A Project Report
On
Submitted By:
Aayush Paudel (10074/073)
Bishal Chapagain (10086/073)
Giriraj Khanal (10091/073)
Satish Kandel (10122/073)
Submitted To:
Department of Computer Science and Information Technology
Amrit Science Campus
Lainchaur, Kathmandu, Nepal
April 2021
EMAIL SPAM CLASSIFICATION
Submitted By:
Aayush Paudel (10074/073)
Bishal Chapagain (10086/073)
Giriraj Khanal (10091/073)
Satish Kandel (10122/073)
April 2021
ACKNOWLEDGEMENTS
Firstly, we would like to express our sincere gratitude to our supervisor, Mr.
Balkrishna Subedi Sir for the continuous support throughout this project. His
guidance and motivation helped us throughout the study and research process.
A very special thanks goes to our Head of Department Mr. Hikmat Rokaya Sir for
giving us this opportunity to undertake this project and believing in us.
Last but not the least, we would like to express our sincere thanks to all our friends
who helped us either directly or indirectly during this project. The quality time spent
with our teacher and friends will remain forever in our heart.
Group Members:
Aayush Paudel (10074/073)
Bishal Chapagain (10086/073)
Giriraj Khanal (10091/073)
Satish Kandel (10122/073)
i
ABSTRACT
Spam email is one of the biggest problems in today’s world of the Internet. Spam
emails not only affect the organizations financially but is also a major problem to
individual email users. The email labeled as spam is nothing but advertisement of any
company/product or any kind of virus being received by the email client mailbox
without prior expectation of the user. Another aspect of the problem is that due to the
large frequency of incoming emails, it is very difficult to separate important emails
from spams. To support ease of access, emails are needed to be categorized based on
the type of information they contain which will let the person know the content before
even opening the email. To solve this problem different spam email filtering
techniques and algorithms are used to protect our mailbox from spam emails.
In this project, we are using Naive Bayes Classifier for spam email classification
enhanced with Laplace Smoothing for numerical stability. The Naive Bayesian
Classifier is a very simple and efficient method for spam classification. Here we are
using a dataset from Apache Spam Assain’s site for classification of spam and non-
spam email. The frequency of words arranged in a full-matrix table is used as the
feature for the model. The result is to increase the accuracy of the system.
ii
TABLE OF CONTENTS
Acknowledgement..........................................................................................................i
ABSTRACT..................................................................................................................ii
List of Figures................................................................................................................v
List of Tables................................................................................................................vi
List of Abbreviations...................................................................................................vii
Chapter 1: Introduction..................................................................................................1
1.1 Background..............................................................................................................1
1.3 Objective..................................................................................................................2
1.5 Applications.............................................................................................................2
2.1.2 Tokenization..................................................................................................3
iii
3.3.1 Technical Feasibility.....................................................................................9
4.1 Flowchart...............................................................................................................11
4.3 Pre-processing........................................................................................................12
Chapter 5: Implementation..........................................................................................15
5.2 Testing...................................................................................................................16
5.3 Dataset...................................................................................................................17
6. Experimental Result.................................................................................................20
References....................................................................................................................24
Appendices..................................................................................................................25
iv
List of Figures
Figure Title
Page
List of Tables
Table Title
Page
v
vi
LIST OF ABBREVIATIONS
vii
CHAPTER 1
INTRODUCTION
1.1 Background
Due to wide use of the internet for communication, it became a popular medium for
advertising and marketing. Although sending emails through the network is quick and
cost effective, it gave rise to another problem in today’s internet world i.e., sending
bulk or unsolicited emails to numerous users [1]. Email spam comes under electronic
spam which sends bulk of unnecessary or junk mail of duplicate emails to recipients
without any request by hiding the identity of the sender.
Email spam follows three properties i.e., Anonymity, mass mailing and unsolicited
emails.
Anonymity is the property of hiding the uniqueness and whereabouts of the email
sender. Mass mailing is defined as the sending of bulk identical emails to a large
number of groups and unsolicited emails are the emails transferring to the recipients
who do not request [2].
It is estimated that 70 percent of all emails sent globally is spam, and the volume of
spam continues to grow because spam remains a lucrative business[2]. It is important
to stop as much spam as we can, to protect the network from many possible risks:
viruses, phishing attacks, compromised web links and other malicious content. Spam
filters also protect our server from being overloaded with non-essential emails, and
the worse problem of being infected with spam software that may turn them into
spam servers themselves. We have chosen this project of making a spam classifier
which will act as the key first line of defense.
1
1.3 Objective
The goal of this project is to construct an email spam filter using machine learning
technique - Naive Bayesian Classifier.
This project considers the email body and uses a bag of words approach to classify
the email. It looks at each word in isolation and keeps track of frequency of each
word, which is used as the feature for the Naive Bayes algorithm during the training.
Although the classifier does a good job classifying the email, the context is lost while
constructing the features. Also, since the header of the email is stripped out, it does
not consider other possibilities for the email to be spam like: routing information,
including IP address.
1.5 Applications
2
CHAPTER 2
LITERATURE REVIEW
There has been much research and discussions on the topic of email spam
classification. Many research papers have been published and numerous algorithms
have been developed to resolve the issue of spam email adulterating the inbox of
users. Algorithms like K-means clustering, Logistic Regression, Decision Trees etc.
have been used to classify the received emails.
In the paper[3], authors have highlighted several features contained in the email
header which will be used to identify and classify spam messages efficiently. Those
features are selected based on their performance in detecting spam messages.
Although header information provides a lot of useful insights in detecting the spam
emails, they do not take into consideration the other part which is the email body.
Spam filtering when implemented in the recipient mailbox can significantly improve
the user performance in terms of time and effort as it eliminates the need to manually
mark the email as being spam. Using a simple probabilistic approach such as Naive
Bayes rule, we can automatically infer the class of email as being spam and ham with
high accuracy. Our project is based on this very approach and aims to classify spam
emails from ham emails.
Since the email body contains a lot of HTML tags which does not provide additional
information on the context of the email, we use the BeautifulSoup package in python
to remove all the HTML tags and get a clean email body without any HTML tags.
3
2.1.2 Tokenization
Tokenization is a way of separating a piece of text into smaller units called tokens.
Here, tokens can be either words, characters, or sub-words. Hence, tokenization can
be broadly classified into 3 types: word, character, and subword (n-gram characters)
tokenization.
The most common way of forming tokens is based on space. Assuming space as a
delimiter, the tokenization of the sentence results in 3 tokens: Computer-Science-
Domain. As each token is a word, it becomes an example of Word tokenization.
As tokens are the building blocks of Natural Language, the most common way of
processing the raw text happens at the token level. Tokenization is the foremost step
while modeling text data. Tokenization is performed on the corpus to obtain tokens.
The following tokens are then used to prepare a vocabulary. Vocabulary refers to the
set of unique tokens in the corpus. Vocabulary can be constructed by considering
each unique token in the corpus or by considering the top K Frequently Occurring
Words.
The major drawback of using word tokenization is when dealing with Out Of
Vocabulary (OOV) words.
A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search
engine has been programmed to ignore, both when indexing entries for searching and
when retrieving them as the result of a search query [4].
4
We would not want these words to take up space in our database, or taking up
valuable processing time. For this, we can remove them easily, by storing a list of
words that you consider to stop words. NLTK(Natural Language Toolkit) in python
has a list of stopwords stored in 16 different languages.
To determine if the email is spam or not, stop word does not contribute much. Thus, it
is desirable to remove them from the email body while preprocessing.
We have used the NLTK package to remove stop words present in the English
Language.
Furthermore, we have also removed punctuations from the email body, as it is of least
importance to us.
Stemming is the process of reducing inflection in words to their root forms such as
mapping a group of words to the same stem even if the stem itself is not a valid word
in the Language [5].
Stem (root) is the part of the word to which you add inflectional (changing/deriving)
affixes such as (-ed, -ize, -s, -de,mis). So, stemming a word or sentence may result in
words that are not actual words. Stems are created by removing the suffixes or
prefixes used with a word.
Here we have used the Python’s nltk package to stem the word. Specifically, we have
used PorterStemmer class which uses suffix stripping to produce stems.
PorterStemmer is known for its simplicity and speed.
Example:
5
Connections, Connecting, Connection, Connected are all mapped to the stem word
Connect.
Application of Stemming:
Initially for collection of stemmed words in each document we count the occurrences
of each unique word and create a sparse matrix which is a Pandas Dataframe having
the column Document_ID, word_ID, Label and Occurrences.
After this, we create a Full Matrix, which is also a Pandas Dataframe consisting of
Document_ID, Labels, and most frequent 2500 word_id’s as the columns. Each cell
value below the word_id consists of the occurrences for the corresponding words in
the given Document_Id.
6
We do this for both Training (70% of the total emails) and Test Data (30% of the
remaining emails)
CHAPTER 3
REQUIREMENT ANALYSIS AND FEASIBILITY
STUDY
The corpus obtained from this site is used for both training and testing phase of model
formation.
DFD is one of the popular system analysis and design tools. It is a graphical
representation of flow of data in a system. Process, Data Flow, Data Store and
External entity are major components used in DFD to represent a system.
7
Figure 3.2.1: Level-0 DFD
8
Figure 3.2.3: Level-2 DFD
3.3 Feasibility Study
The system we are trying to develop is technologically feasible. We are using Python
as the programming language. We will be using Pycharm CE as the IDE, mainly for
testing and deployment. For the majority of the task, we will be using Jupyter
Notebook, until a final machine learning model is developed.
The user-friendly interface provided by WTforms, HTML and CSS, make it easy
even for the non-technical user to test the email using any web browser. Also, the
response is quick.
9
3.3.3 Economical Feasibility
The system we are trying to build is also feasible economically. All the tools required
and data collected are available publicly and are open-sourced. Since this is a small
project, we will be deploying our model using the Flask package locally in our
machine. Flask is a lightweight web application microframework
We had followed the following schedule during our project planning and
implementation:
Figure 3.4.1: Gantt chart showing activities and time duration of project
10
CHAPTER 4
SYSTEM DESIGN
4.1 Flowchart
11
5. Put all the pieces together to build a standalone Classifier and test it on a fresh
email data.
4.3 Pre-processing
Data in real-world is never clean, complete or in the desirable format. Thus, the first
step is to preprocess this data which involves extracting the email body from the
entire email, removing blank emails and emails containing bad data.
Then for each clean email body, we convert it to lowercase, tokenize the sentence to
obtain words, remove the stop words, stem the words and remove the punctuations to
get the data in the desirable format for further analysis.
A Naive Bayes classifier is a probabilistic machine learning model that is used for
classification tasks. The gist of the classifier is based on the Bayes theorem.
Naive Bayes classifier assumes that the presence of a particular feature in a class is
unrelated to the presence of any other feature. Naive Bayes uses the Bag of Word
approach. That is, every word in the email is treated independently (each word is
looked at in isolation.) This method does not give importance to the sequence; hence
the context is lost. Thus, the name - Naive.
12
Naive Bayes model is easy to build and particularly useful for very large data sets.
Along with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.
Using this algorithm and assuming that the words are independent, we calculate the
joint probability of the words in an email or document.
Here is the Formula to calculate conditional probability for each word in the email to
determine if it is more likely to be Spam or Ham.
We then make the decision based on the conditional probability and joint probability.
If the condition:
For example:
13
After Preprocessing:
[ Hello, Strange]
We then make the feature matrix out of these words by counting the occurrences of
these words in spam and ham email context.
Finally, we calculate the joint probability of the email being spam as:
P(Spam|Hello) *P(Spam|Strange)
And joint probability of the email being ham as:
P(Ham|Hello) *P(Ham|Strange)
If the former is greater, this email is likely to be spam rather than ham.
14
CHAPTER 5
IMPLEMENTATION
5.1 Feature Extraction
For each email we want to classify as either spam or ham, it has to go through a series
of modules after which we get the features that we can feed into the algorithm to
make the prediction.
Full matrix is the final representation for the feature for our data. Below is the python
code which takes into sparse matrix as the parameter along with the number of tokens
in the given email body and index for document id, word id and frequency.
15
word_id = sparse_matrix[i][word_idx]
occurence = sparse_matrix[i][freq_idx]
full_matrix.at[0, 'DOC_ID'] = 0
# full_matrix.at[0, 'CATEGORY'] = label
full_matrix.at[0, word_id] = occurrence
full_matrix.set_index('DOC_ID', inplace=True)
return full_matrix
Using this feature matrix, we can easily calculate the conditional probability for each
token being spam and ham, which will form the basis for joint probability calculation.
5.2 Testing
Supervised Machine Learning algorithms work on three major phases. The first is
called the Training phase where we teach the algorithm about the data using the
features and the labels present in the training dataset. Once the algorithm learns
about the data from the training phase, we use the rest of the data to evaluate the
model performance. This is called Validation phase and the data being used is called
the Validation dataset. The Validation dataset is different from the training dataset
which is something the model has not seen during the training phase. This helps to
prevent the problem of our model being overfit to the data so that we can make a
sense of how our model performs on the previously unseen data. The labels present in
the validation dataset will be used to calculate evaluation metrics of the model.
The final phase is called testing where we give this trained model the real-world data
and let the model evaluate on its own. We train the model so as to ensure it gives
16
sensible results during the testing phase and helps to achieve the objective of actually
classifying spam email from the ham email.
5.3 Dataset
A data set is simply a collection of data. In other words, a data set corresponds to the
contents of a single database table, or a single statistical data matrix, where every
column of the table represents a particular variable, and each row corresponds to a
given member of the data set in question.
In the context of spam classification, we call these dataset corpus. Corpus are large
and structured sets of text. To put it in simple words, they are a set of all documents.
There are two categories of dataset being used in this project. They are spam data and
non-spam(ham) data.
Spam Dataset is a collection of all the spam emails which are labeled as 1 to indicate
they are spam. They are undesirable to the recipient. Thus, we want to separate them
from the legit email.
17
Retail Value!\n\n\n\nYOURS for Only $29.99! <Includes FREE Shipping!
>\n\n\n\nDon\'t fall prey to destructive viruses or hackers!\n\nProtect your computer
and your valuable information!\n\n\n\n\n\nSo don\'t delay...get your copy
TODAY!\n\n\n\n\n\nhttp://euro.specialdiscounts4u.com/\n\n++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+\n\nThis email has been screened and filtered by our in house ""OPT-OUT"" system
in \n\ncompliance with state laws. If you wish to "OPT-OUT" from this mailing as
well \n\nas the lists of thousands of other email providers please visit
\n\n\n\nhttp://dvd.specialdiscounts4u.com/optoutd.html\n\n+++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+\n\n\n\n\n'
Ham Dataset is a collection of all the ham emails which are labeled as 0 to indicate
they are non-spam. They are important to the recipient. Thus, we want to separate
them from the spam emails and put it in the users inbox.
'Once upon a time, Manfred wrote :\n\n\n\n> I would like to install RPM itself. I have
tried to get the information\n\n> by visiting www.rpm.org <http://www.rpm.org>
and the related links they\n\n> give but they all seems to assume that RPM already is
installed.\n\n> I have a firewall based on linux-2.2.20 (Smoothwall) for private
use.\n\n> I would like to install the RPM package/program but there is no\n\n>
information how to do this from scratch.\n\n> Found this site and hopefully some
have the knowledge.\n\n> Best regards Manfred Grobosch\n\n\n\nWell, you can
simply use an rpm tarball (or extract one from a source rpm\n\non a machine that has
rpm scripts install "rpm2cpio <file.src.rpm> | cpio\n\n-dimv" and "./configure &&
make && make install" as usual. You need db3 or\n\ndb4 development files at least,
and once everything installed you\'ll need\n\nto initialize your rpm database.\n\n\n\nIf
you need more help, I suggest you join the [email protected] by\n\nsubscribing at
18
https://listman.redhat.com/\n\n\n\nMatthias\n\n\n\n-- \n\nMatthias Saou World Trade
Center\n\n------------- Edificio Norte 4 Planta\n\nSystem and Network Engineer
08039 Barcelona, Spain\n\nElectronic Group Interactive Phone : +34 936
00 23 23\n\n\n\n_______________________________________________\n\nRPM-
List mailing list <RPM-
[email protected]>\n\nhttp://lists.freshrpms.net/mailman/listinfo/rpm-list\n\n\n\n\n'
19
CHAPTER 6
EXPERIMENTAL RESULT
Naive Bayes Email Spam Classification is a Binary Classification problem. The result
for the experiment is in the form of 1 and 0. 1 indicates that the given email is Spam
and 0 indicates it is Ham
20
Figure 6.2: Prediction Page Displaying the Result that the email is not spam
The following table summarizes the evaluation metrics used for the Naive Bayes
algorithm in this project.
Actual vs Predicted T F
T TP FN
F FP TN
21
Table 6.1: Confusion Matrix
Accuracy is the measure of actual true and false values captured by the model.
Accuracy= (TP+TN)/(TP+TN+FP+FN)
Recall is the measure of actual true values captured by the model.
Recall= TP/(TP+FN)
Precision is the measure of relevant true values predicted by the model.
Precision= TP/(TP+FP)
F1-Score is the harmonic mean of recall and precision.
F1-score = 2*Recall*Precision/(Recall+Precision)
Evaluation on Validation set:
True Positive: 547
True Negative: 1126
False Positive: 9
False Negative: 41
Accuracy: 97.10%
Precision Score: 98.38%
Recall Score: 93.03%
F1-Score: 95.63%
22
CHAPTER 7
CONCLUSION AND FUTURE WORK
Taking into account the importance of electronic messages, in this paper we have
proposed an approach for email classification which takes into account the content
present in the email body as tokens and uses Naive Bayes to classify it as either being
spam or ham. The main advantage of this approach is that it is fast, robust and gives
pretty good classification accuracy. The algorithm was implemented using the Python
programming language in PyCharm IDE. The implementation results were studied
and the system was tested in the new instances of email. The results show that the
approach used is reasonable and takes away the manual load of users to classify spam
email from legit ones.
Some of the future research work can be done on, (1) utilize Email headers
information to further strengthen the system. (2) Making the algorithm to self-learn
based on misclassified data (3) Making the algorithm to classify email written in
different languages.
23
REFERENCES
[1] Cheruvu, A. (2012) Email Spam Detector: A Tool to Monitor and Detect Spam
Attack. [online] Available at: http://sci.tamucc.edu/~cams/projects/410.pdf
[Accessed Sept. 2020]
[2] Tope, M. (2019) Email Spam Detection using Naive Bayes Classifier. [online]
Available at: https://www.ijsdr.org/papers/IJSDR1906001.pdf [Accessed Sept.
2020]
[3] What is Tokenization in NLP?. [Blog] Available at:
https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/
[Accessed Sept. 2020]
[4] Removing stop words with NLTK in python. [Blog] Available at:
https://www.geeksforgeeks.org/removing-stop-words-nltk-python/ [Accessed
Sept. 2020]
[5] Stemming with python nltk package. [Blog] Available at:
https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
[Accessed Sept. 2020]
[6] Naive Bayes Classifier. [Blog] Available at:
https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c [Accessed
Oct. 2020]
24
APPENDICES
This is the python file that does the task of data preprocessing and part of feature
extraction
import pandas as pd
import numpy as np
import nltk
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
words = word_tokenize(msg_no_html.lower())
return filtered_words
def make_dataframe(word_list):
word_columns_df = pd.DataFrame.from_records(word_list)
return word_columns_df
25
nr_rows = df.shape[0]
nr_cols = df.shape[1]
word_set = set(indexed_words)
dict_list = []
for i in range(nr_rows):
for j in range(nr_cols):
word = df.iat[i, j]
if word in word_set:
doc_id = df.index[i]
word_id = indexed_words.get_loc(word)
category = labels
sparse_df = pd.DataFrame(dict_list)
return sparse_df.groupby(['DOC_ID', 'WORD_ID', 'LABEL']).sum().reset_index()
26