NLB Lab Manuel 2
NLB Lab Manuel 2
Lab Manual
Natural Language Processing with
Python
Dr. G N V G Sirisha
Smt. A. L. Lavanya
This lab manual is intended to aid second-year undergraduate Artificial Intelligence and
Machine Learning students in their course Natural Language Processing with Python Lab
[B20AM2206].
G N V G Sirisha: She got her PhD, M.Tech. and B.Tech. degrees from Andhra University.
Presently she is working as an associate professor in the Department of Computer Science and
Engineering, SRKR Engineering College, Bhimavaram, India. Her research interests include
Data Science, Information Retrieval and Machine Learning.
A L Lavanya: She is currently pursuing (Ph.D) in GIET University, M.Tech. and MCA degrees
from Andhra University. Presently she is working as an assistant professor in the Department
of Computer Science and Engineering, SRKR Engineering College, Bhimavaram, India.
Preface
Natural Language Processing is a dynamic and rapidly evolving field that sits at the
intersection of computer science, linguistics, and artificial intelligence. With the increasing
importance of language-based technologies in our daily lives, ranging from chatbots and
virtual assistants to language translation and sentiment analysis, NLP has become an essential
skill for anyone interested in the world of data science and machine learning.
In this lab, we will leverage the expressive and versatile programming language, Python,
along with popular libraries and tools such as NLTK (Natural Language Toolkit), SpaCy, and
scikit-learn.
Dr. G N V G Sirisha
Evaluation Scheme
Examination Marks
Exercise Programs 5
Record 5
Internal Exam 5
External Exam 35
Course Objectives
1. The main objective of the course is to understand the various concepts of natural
language processing along with their implementation using Python
Course Outcomes
CO2. Learn various techniques for implementing NLP including parsing & text
processing
List of Experiments
S. No. Experiment
Demonstrate Noise Removal for any textual data and remove regular
1. expression pattern such as hashtag from textual data
2. Perform lemmatization and stemming using python library nltk
3. Demonstrate object standardization such as replacing social media slangs from
a text
4. Perform the part of speech tagging on any textual data.
5.
Implement topic modeling using Latent Dirichlet Allocation (LDA ) in python.
Session # 1
Learning Objective
To demonstrate noise Removal for any textual data and remove regular expression
pattern such as hashtag from textual data.
Learning Outcomes
After the completion of this experiment, students will be able to
• Understand different approaches to noise removal
• Develop programs for noise removal using dictionaries and regural expressions
Learning Context
Any piece of text which is not relevant to the context of the data and the end-output
can be specified as the noise.
For example – language stopwords (commonly used words of a language – is, am, the,
of, in etc), URLs or links, social media entities (mentions, hashtags), punctuations and
industry specific words. This step deals with removal of all types of noisy entities
present in the text.
A general approach for noise removal is to prepare a dictionary of noisy entities, and
iterate the text object by tokens (or by words), eliminating those tokens which are
present in the noise dictionary.
Another approach is to use regular expressions while dealing with special patterns of
noise. The following python code removes a regex pattern from the input text:
Exercise
Demonstrate Noise Removal for any textual data and remove regular expression
pattern such as hash tag from textual data
Additional Exercise
a. Create a function to remove all URLs (web links) from a given text. Test the
b. Identify and extract emoticons from the text using regular expressions.
Solutions
Session # 2
Learning Objective
Learning Outcomes
After the completion of this experiment, students will be able to
• Understand the concepts of lemmatization and stemming
• Apply stemming and lemmatization using Python library nltk
Learning Context
For example – “play”, “player”, “played”, “plays” and “playing” are the different
variations of the word – “play”, Though they mean different but contextually all are
similar. The step converts all the disparities of a word into their normalized form (also
known as lemma). Normalization is a pivotal step for feature engineering with text as
it converts the high dimensional features (N different features) to the low dimensional
space (1 feature), which is an ideal ask for any ML model.
10
Instructions
In the gif to the right, you can see an example of using noise removal, tokenization,
and lemmatization to change the string “Who was partying?” into a list with the words
“who”, “be”, and “party”.
I. Tokenization
print(tokenized)
# [“This”, “is”, “a”, “text”, “to”, “tokenize”]
11
If the text is not in tokens, then we need to convert it into tokens. After we have
converted strings of text into tokens, we can convert the word tokens into their
root form. There are mainly three algorithms for stemming. These are the Porter
Stemmer, the Snowball Stemmer and the Lancaster Stemmer. Porter Stemmer is
the most common among them.
b. Lemmatization
In natural language processing, lemmatization is the text preprocessing
normalization task concerned with bringing words down to their root forms. The
word “Lemmatization” is itself derived from the base word “Lemma”. In
Linguistics (a field of study on which NLP is based) a lemma is a meaningful base
word or a root word that forms the basis for other words. For example, the lemma
of the words “playing” and “played” is play.
c. POS Tagging
POS Tagging (Parts of Speech Tagging) is a process to mark up the words in text
format for a particular part of a speech based on its definition and context. Each
specific token is assigned a Part of Speech. It is also called grammatical tagging.
Let’s learn with a NLTK part of speech example:
Input: Everyting to permit us.
Output: [(‘Everything’, NN), (‘to’, TO),(‘permit’, VB), (‘us’, PRP)]
12
Exercise
1. Perform lemmatization and stemming using python library nltk
Additional Exercise
13
b. Tailor the lemmatization process for a specific domain, such as medical or legal
text. Identify domain-specific terms and ensure accurate lemmatization for these
terms.
Solutions
text = ‘data science uses scientific methods algorithms and many types of processes’
stem_words(text)
Output:
[‘data’, ‘scienc’, ‘use’, ‘scientif’, ‘method’, ‘algorithm’, ‘and’, ‘mani’, ‘type’, ‘of’, ‘process’]
14
def get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
# 1. Init Lemmatizer
lemmatizer = WordNetLemmatizer()
Foot
['n', 'v', 'n', 'v', 'v', 'n', 'n', 'n', 'n', 'a']
['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best']
15
Session # 3
Learning Objective
To demonstrate object standardization such as replacing social media slangs from a
text.
Learning Outcomes
After the completion of this experiment, students will be able to
• Understand the different social media slang
• Develop programs for removing social media slangs from given text
Learning Context
Text data often contains words or phrases which are not present in any standard
lexical dictionaries. These pieces are not recognized by search engines and models.
Some of the examples are – acronyms, hashtags with attached words, and colloquial
slangs. With the help of regular expressions and manually prepared data dictionaries,
this type of noise can be fixed, the code below uses a dictionary lookup method to
replace social media slangs from a text.
16
Exercise
Demonstrate object standardization such as replacing social media slangs from a text
Additional Exercise
a. Develop a function that analyzes a given text and provides statistics on the
number of social media slangs replaced. Include information such as the most
common slangs and their frequencies.
17
Solutions
18
Session # 4
Learning Objective
To perform the part of speech tagging on any textual data.
Learning Outcomes
After the completion of this experiment, students will be able to
• Define and explain the concept of Part-of-Speech (POS) and its importance in
natural language processing.
• Recognize and differentiate between common POS tags such as nouns, verbs,
adjectives, adverbs, pronouns, and prepositions.
• Utilize the NLTK (Natural Language Toolkit) library in Python to perform Part-
of-Speech tagging on textual data.
• Integrate POS tagging with other natural language processing techniques, such
as tokenization, lemmatization, and named entity recognition, to build more
advanced language processing pipelines
Learning Context
To analyze preprocessed data, it needs to be converted into features. Depending upon
the usage, text features can be constructed using various techniques – Syntactical
Parsing, Entities / N-grams / word-based features, Statistical features, and word
embeddings. Part-of-speech (POS) tagging is the process of assigning a word to its
grammatical category, in order to understand its role within the sentence. Traditional
parts of speech are nouns, verbs, adverbs, conjunctions, etc.
19
Part of speech tagging – Apart from the grammar relations, every word in a sentence
is also associated with a part of speech (pos) tag (nouns, verbs, adjectives, adverbs etc).
The pos tags defines the usage and function of a word in the sentence. Here is a list of
all possible pos-tags defined by Pennsylvania university.
Number Tag Description
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PRP Personal pronoun
19. PRP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
20
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
30. VBN Verb, past participle
31. VBP Verb, non-3rd person singular present
32. VBZ Verb, 3rd person singular present
33. WDT Wh-determiner
34. WP Wh-pronoun
35. WP$ Possessive wh-pronoun
36. WRB Wh-adverb
21
Tokens with POS – (“book_VB”, 1), (“my_PRP$”, 1), (“flight_NN”, 1), (“I_PRP”, 1),
(“will_MD”, 1), (“read_VB”, 1), (“this_DT”, 1), (“book_NN”, 1)
C. Normalization and Lemmatization: POS tags are the basis of lemmatization
process for converting a word to its base form (lemma).
D. Efficient stopword removal: POS tags are also useful in efficient removal of
stopwords.
For example, there are some tags that always define the low frequency / less important
words of a language. For example: (IN – “within”, “upon”, “except”), (CD –
“one”,”two”, “hundred”), (MD – “may”, “must” etc)
22
Exercise
Perform the part of speech tagging on any textual data.
Additional Exercise
b. Develop programs for parts of speech tagging using other libraries like Spacy,
TextBloB, StanfordNLP, Fla, Polyglot and Transformers (HuggingFace)
Solutions
The following programs containing parts of speech tagging using nltk pos_tag and Custom
POS Tagging with Penn Treebank Tags
Output:
23
def get_wordnet_pos(word):
"""Map POS tag to first character lemmatize() accepts"""
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
['n', 'v', 'n', 'v', 'v', 'n', 'n', 'n', 'n', 'a']
24
Session # 5
Learning Objective
Learning Outcomes
After the completion of this experiment, students will be able to
• Understand the concept of topic modeling and its significance in uncovering
hidden thematic structures within a collection of documents.
• Use Python libraries such as Gensim or Scikit-Learn to implement the LDA
algorithm on a corpus of documents.
• Apply topic modeling techniques to real-world scenarios, such as analyzing
customer reviews, news articles, or social media content, to extract meaningful
insights.
Learning Context
LDA stands for Latent Dirichlet Allocation, which is a probabilistic topic modelling
technique used in natural language processing (NLP) and machine learning. It is a
generative statistical model that allows for the discovery of underlying topics in a
collection of documents. The basic idea behind LDA is that documents are assumed
to be composed of a mixture of different topics, and each topic is represented by a
distribution of words.
25
LDA assigns a specific topic to each word in the document. The goal of LDA is to learn
the latent (hidden) topic structure from the observed word occurrences in the
documents. It does this by estimating the parameters of the topic and word
distributions that best explain the observed data. This process involves iteratively
updating the topic and word assignments until a convergence criterion is met.
Steps:
1. Import Libraries
2. Text Pre Processing
• Converting Text to Lowercase
• Split Text into Words
• Remove The Stop Loss Words
• Removing Punctuation & Special Characters
• Normalize The Words
3. Converting Text to Numerical Representation
4. Implementation Of LDA
5. Retrieve The Topics
6. Assigning the topic to the documents
26
Exercise
To implement topic modeling using Latent Dirichlet Allocation (LDA ) in Python.
Additional Exercises
Solutions
27
1. Import Libraries
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
28
2. Text Preprocessing
Steps to preprocess text data:
29
30
Outputs
Topic 1 ['weekend' 'want' 'watch' 'movie' 'good']
Topic 2 ['watch' 'amazon' 'cricket' 'don’t' 'netflix']
Topic 3 ['book' 'zealand' 'test' 'beating' 'world']
Topic 4 ['chill' 'nice' 'would' 'however' 'like']
Topic 5 ['good' 'movie' 'book' 'watch' 'test']
Topic 6 ['good' 'movie' 'book' 'watch' 'test']
doc_topic = lda_model.transform(tf_idf_arr)
print(doc_topic)
# iterating over every value till the end value
for n in range(doc_topic.shape[0]):
# document is n+1
print ("Document", n+1, " -- Topic:" ,topic_doc)
32
Session # 6
Learning Objective
a) To demonstrate Term Frequency – Inverse Document Frequency (TF – IDF)
using python
b) To demonstrate word embeddings using word2vec
Learning Outcomes
After the completion of this experiment, students will be able to
33
Learning Context
The term frequency-inverse document frequency (tf-idf) is a widely used method for
weighting terms in information retrieval and text mining. The tf-idf score measures
the importance of a term in a document relative to its importance in the entire corpus.
First we split each document in the corpus into a list of words and place them in a set
named words_set using union method. The union() method returns a set that contains
all unique items from the specified set(s).
Term Frequency: The term frequency is the number of times a term appears in a
document
tf(t,d) = count of t in d / number of words in d
Inverse Document Frequency: Inverse document frequency (idf) is a measure of
the rarity of a term in a corpus of documents
idf(t) = log(N / df(t))
where:
• t is a term
• df(t) is the number of documents in the corpus that contain the term t
• N is the total number of documents in the corpus
• log is the natural logarithm
TF-IDF score: The tf-idf score is calculated by multiplying two factors: the term
frequency (tf) and the inverse document frequency (idf).
tf-idf(t, d) = tf(t, d) * idf(t)
Word Embedding is a word representation type that allows machine learning
algorithms to understand words with similar meanings. It is a language modeling and
feature learning technique to map words into vectors of real numbers using neural
networks, probabilistic models, or dimension reduction on the word co-occurrence
matrix.
Applications of Word Embeddings:
Compute similar words: Word embedding is used to suggest similar words to the
word being subjected to the prediction model.
34
Create a group of related words: It is used for semantic grouping which will group
things of similar characteristic together and dissimilar far away
Feature for text classification: Text is mapped into arrays of vectors which is fed to
the model for training as well as predict
Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TFIDF)
represent each word in a document with a single value, typically a count or a
weight, respectively. Word2Vec, on the other hand, represents each word as a
vector of values, typically with hundreds of dimensions.
Both Bag of Words and TFiDF ignore the order of words and context of words Word
embeddings are developed to overcome this limitation
There are two architectures used by Word2vec:
In CBOW, the current word is predicted using the window of surrounding context
windows. For example, if wi-1,wi-2,wi+1,wi+2are given words or context, this
model will provide wi
Skip-Gram performs opposite of CBOW which implies that it predicts the given
sequence or context from the word. You can reverse the example to understand it.
If wi is given, this will predict the context or wi-1,wi-2,wi+1,wi+2
we perform dimensionality reduction using PCA and we can plot the points , we
can observe that similar words are plotted closely. we found the similarity index
between various words in the corpus using ‘wv.similarity()’ function
35
Exercise
Demonstrate object standardization such as replacing social media slangs from a text
Additional Exercise
Solutions
6. a.
36
OUTPUT:
37
6. b.
OUTPUT:
38
39
OUTPUT:
40
Session # 7
Learning Objective
To implement text classification using Naïve Bayes Classifier and Text Blob library
Learning Outcomes
After the completion of this experiment, students will be able to
• Understand the concept of the Naïve Bayes Classifier and its suitability for text
classification
• Apply the Naïve Bayes Classifier to real-world scenarios, such as sentiment
analysis, spam detection, or topic categorization, showcasing its versatility in
different text classification tasks.
• Understand the strengths and limitations of TextBlob for natural language
processing.
Learning Context
Text classification is the process of assigning predefined categories or labels to a given
text document.
Naive Bayes Classifier is a probabilistic algorithm based on Bayes' theorem that can
be used for text classification. It assumes that the presence of a particular feature in a
class is independent of the presence of other features in the same class. Naive Bayes
Classifier is fast and efficient and works well with high-dimensional data such as text
data. Bayes’ Theorem finds the probability of an event occurring given the probability
of another event that has already occurred. Bayes’ theorem is stated mathematically
as follows:
41
Given a data matrix X and a target vector y, we state our problem as:
where, y is class variable and X is a dependent feature vector with dimension d i.e. X
= (x1,x2,x2, ……. xd), where d is the number of variables/features of the sample.
P(y|X) is the probability of observing the class y given the sample X with X = (x1,x2,x2,
…. xd), where d is the number of variables/features of the sample.
The denominator remains constant for the given input so we can remove that term.
Finally, to find the probability of a given sample for all possible values of the class
variable y, we just need to find the output with maximum probability:
TextBlob is a Python library built on NLTK for natural language processing (NLP). It
simplifies NLP tasks with a user-friendly interface. Key features include:
• Text processing, tagging, extraction, sentiment analysis, and translation.
• Sentiment analysis determines positive, negative, or neutral sentiment
in text.
42
Exercise
1. Implement Text classification using naïve bayes classifier and text blob library
Additional Exercise
43
b. Evaluate the performance of the Naïve Bayes classifier using metrics like
accuracy, precision, recall, and F1 score.
Solutions
import nltk
nltk.download('punkt')
from textblob.classifiers import NaiveBayesClassifier as NBC
from textblob import TextBlob
training_corpus = [
('I am exhausted of this work.', 'Class_B'),
("I can't cooperate with this", 'Class_B'),
('He is my badest enemy!', 'Class_B'),
('My management is poor.', 'Class_B'),
('I love this burger.', 'Class_A'),
('This is an brilliant place!', 'Class_A'),
('I feel very good about these dates.', 'Class_A'),
('This is my best work.', 'Class_A'),
("What an awesome view", 'Class_A'),
('I do not like this dish', 'Class_B')]
test_corpus = [
("I am not feeling well today.", 'Class_B'),
("I feel brilliant!", 'Class_A'),
('Gary is a friend of mine.', 'Class_A'),
("I can't believe I'm doing this.", 'Class_B'),
('The date was good.', 'Class_A'), ('I do not enjoy my job', 'Class_B')]
model = NBC(training_corpus)
print(model.classify("Their codes are amazing."))
print(model.classify("I don't like their computer."))
print(model.accuracy(test_corpus))
Output: Class_A
Class_B
0.8333333333333334
44
Session # 8
Learning Objective
Learning Outcomes
After the completion of this experiment, students will be able to
• Explain the basic principles of Support Vector Machines and how they are
applied to text classification tasks.
• Utilize Python libraries, such as scikit-learn, to implement a Support Vector
Machine model for text classification. Understand the different kernel functions
and parameters available in scikit-learn's SVM implementation.
Learning Context
Support vector machines (SVMs) are supervised machine learning algorithms which
are used both for classification and regression. But generally, they are used in
classification problems such as text classification. Text Classification is the process of
labeling text data.
SVM finds a hyperplane that creates a boundary between two classes of data
to classify them. These algorithms works best on smaller and complex datasets.
45
To implement SVM for text classification, you can use libraries such as scikit-learn in
Python. The steps involved include data preprocessing such as tokenization and word
stemming/lemmatization , converting the text into feature vectors using techniques
like TF-IDF, splitting the data into training and test sets, and training the SVM
classifier on the training data. Finally, you can evaluate the performance of the
classifier on the test data.
The classification report consists of the following:
True positive measures the extent to which the model correctly predicts the positive
class. False positives occur when the model predicts that an instance belongs to a class
that it actually does not. True negatives are the outcomes that the model correctly
predicts as negative.
Precision is the ratio between the True Positives and all the Positive. Mathematically:
Support values in output of SVM (Support Vector Machine) refer to the degree of
confidence with which a given data point is classified into a particular class by the
SVM model.
Accuracy is the ratio of the total number of correct predictions and the total number
of predictions. Mathematically:
46
The macro-average is computed using the arithmetic mean (aka unweighted mean) of
all the per-class values.
The weighted average is the sum of each singular value multiplied by its
corresponding weight.
Exercise
Additional Exercise
47
b. Implement SVM using sklearn, apply different kernel functions and verify
which kernel function performs better on IMDB dataset
48
Solutions
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn import svm
training_corpus = [
('I am exhausted of this work.', 'Class_B'),
("I can't cooperate with this", 'Class_B'),
('He is my badest enemy!', 'Class_B'),
('My management is poor.', 'Class_B'),
('I love this burger.', 'Class_A'),
('This is an brilliant place!', 'Class_A'),
('I feel very good about these dates.', 'Class_A'),
('This is my best work.', 'Class_A'),
("What an awesome view", 'Class_A'),
('I do not like this dish', 'Class_B')]
49
OUTPUT:
['Class_A' 'Class_A' 'Class_B' 'Class_B' 'Class_A' 'Class_A']
accuracy 0.50 6
50
Session # 9
Learning Objective
To convert text to vectors (using term frequency) and apply cosine similarity to provide
closeness among two texts.
Learning Outcomes
After the completion of this experiment, students will be able to
• Explain the importance of converting textual data into numerical vectors for
various natural language processing (NLP) tasks.
• Implement cosine similarity in Python to measure the similarity between two
text vectors.
• Interpret cosine similarity scores and understand how values close to 1 indicate
high similarity, while values close to 0 suggest dissimilarity.
• Recognize real-world applications of text vectorization and cosine similarity,
such as document similarity, plagiarism detection, and information retrieval.
Learning Context
The process of converting text into vector is called vectorization. Vectorization can be
done using many methods. One of the method is “Term Frequency”.
51
Formula:
Cosine Similarity
The Cosine similarity of two documents will range from 0 to 1. If the Cosine similarity
score is 1, it means two vectors have the same orientation. The value closer to 0
indicates that the two documents have less similarity.
Formula:
The mathematical equation of Cosine similarity between two non-zero vectors is:
52
Exercise
To convert text to vectors (using term frequency) and apply cosine similarity to
provide closeness among two texts.
Additional Exercise
53
Solutions
import math
from collections import Counter
def get_cosine(vec1, vec2):
print(vec1.keys())
print(vec2.keys())
common = set(vec1.keys()) & set(vec2.keys())
print(common)
numerator = sum([vec1[x] * vec2[x] for x in common])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
words = text.split()
return Counter(words)
text1 = 'This is an article on analytics vidhya'
text2 = 'article on analytics vidhya is about natural language processing'
vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)
print(vector1)
print(vector2)
cosine = get_cosine(vector1, vector2)
print(cosine)
Output:
Counter({'This': 1, 'is': 1, 'an': 1, 'article': 1, 'on': 1, 'analytics': 1, 'vidhya': 1})
Counter({'article': 1, 'on': 1, 'analytics': 1, 'vidhya': 1, 'is': 1, 'about': 1, 'natural': 1, 'language': 1,
'processing': 1})
dict_keys(['This', 'is', 'an', 'article', 'on', 'analytics', 'vidhya'])
dict_keys(['article', 'on', 'analytics', 'vidhya', 'is', 'about', 'natural', 'language', 'processing'])
{'article', 'analytics', 'on', 'is', 'vidhya'}
0.629940788348712
54
Session # 10
Learning Objective
Case study 1: Identify the sentiment of tweets
In this problem, you are provided with tweet data to predict sentiment on electronic
products of netizens
Learning Outcomes
After the completion of this experiment, students will be able to
• Recognize the significance of sentiment analysis in understanding public
opinions about electronic products.
• Implement text preprocessing techniques to clean and prepare tweet text for
sentiment analysis.
• Handle challenges such as removing special characters, handling emojis, and
addressing variations in language.
Learning Context
Twitter sentiment analysis identifies negative, and positive emotions within the text
of a tweet. It is a text analysis using Natural Language Processing (NLP) and machine
learning. It identifies and extracts subjective information from original data,
providing a company with a better understanding of the social sentiment of its brand,
and product.
55
A Twitter sentiment analysis is the process of determining the emotional tone behind
a series of words, specifically on Twitter. A sentiment analysis tool is an automated
technique that extracts meaningful information related to their attitudes, emotions,
and opinions.
56
7. Interpretation: Once the model is trained and evaluated, you can interpret the
results to gain insights into the sentiment of the tweets.
57
Exercise
Solutions
58
59
60
STOP_WORDS = ['a', 'about', 'above', 'after', 'again', 'against', 'all','also', 'am', 'an', 'and','any', 'are',
"aren't", 'as', 'at', 'be', 'because', 'been',↪'before', 'being', 'below','between', 'both', 'but', 'by', 'can',
"can't", 'cannot', 'com',↪'could', "couldn't", 'did',"didn't", 'do', 'does', "doesn't", 'doing', "don't",
'down', 'during', 'each', 'else', 'ever','few', 'for', 'from', 'further', 'get', 'had', "hadn't", 'has',"hasn't",
'have', "haven't", 'having','he', "he'd", "he'll", "he's", 'her', 'here', "here's", 'hers',↪'herself', 'him',
'himself', 'his', 'how',"how's", 'however', 'http', 'i', "i'd", "i'll", "i'm", "i've",'if', 'in', 'into', 'is', "isn't",
'it',"it's", 'its', 'itself', 'just', 'k', "let's", 'like', 'me','more', 'most', "mustn't", 'my', 'myself','no', 'nor',
'not', 'of', 'off', 'on', 'once', 'only', 'or',↪'other', 'otherwise', 'ought', 'our', 'ours',ourselves', 'out', 'over',
'own', 'r', 'same', 'shall',"shan't", 'she', "she'd", "she'll", "she's",'should', "shouldn't", 'since', 'so',
'some', 'such', 'than',↪'that', "that's", 'the', 'their', 'theirs','them', 'themselves', 'then', 'there', "there's",
'these','they', "they'd", "they'll", "they're","they've", 'this', 'those', 'through', 'to', 'too', 'under','until',
'up', 'very', 'was', "wasn't",'we', "we'd", "we'll", "we're", "we've", 'were', "weren't",’what', "what's",
'when', "when's", 'where',"where's", 'which', 'while', 'who', "who's", 'whom', 'why',"why's", 'with',
"won't", 'would', "wouldn't",'www', 'you', "you'd", "you'll", "you're", "you've", 'your','yours',
'yourself', 'yourselves']
61
Accuracy: 86.36363636363636 %
62
63
Session # 11
Learning Objective
Case study 2: Detect hate speech in tweets.
The objective of this task is to detect hate speech in tweets. For the sake of simplicity,
we say a tweet contains hate speech if it has a racist or sexist sentiment associated
with it. So, the task is to classify racist or sexist tweets from other tweets
Learning Outcomes
After the completion of this experiment, students will be able to
• Define hate speech detection within the context of social media and recognize its
significance in curbing offensive and harmful content.
• Utilize machine learning algorithms (e.g., Logistic Regression, Support Vector
Machines) for the classification of tweets into categories of racist or sexist hate
speech and non-hate speech.
Learning Context
Detecting hate speech in tweets is a challenging task, as it requires understanding the
nuances of language and the context in which the words are used. Here are some steps
that can help you in detecting hate speech in tweets:
1. Collect a dataset: The first step is to collect a dataset of tweets that contain racist
or sexist sentiments. There are several publicly available datasets that you can
use, such as the Hate Speech and Offensive Language Identification Dataset
(https://github.com/t-davidson/hate-speech-and-offensive-language), which
contains tweets labeled as hate speech or not hate speech.
2. Preprocess the data: Preprocess the data by removing noise such as URLs,
mentions, and special characters. You can also normalize the data by
64
converting all the text to lowercase and removing stop words and by applying
stemming.
3. Feature extraction: Extract features from the preprocessed data. Some popular
feature extraction techniques for text classification include Bag of Words, TF-
IDF, and word embeddings.
4. Train a classifier: Use the extracted features to train a classifier. You can use
machine learning algorithms such as Naive Bayes, Support Vector Machines,
or Logistic Regression.
5. Evaluate the classifier: Evaluate the performance of the classifier using metrics
such as accuracy, precision, recall, and F1-score. You can also use techniques
such as
6. Cross-validation and grid search to fine-tune the hyper parameters of the
classifier.
7. Classify new tweets: Once the classifier is trained, you can use it to classify new
tweets as either containing hate speech or not.
65
Exercise
Develop a model to detect hate speech in tweets
Solution
66
vectorizer = CountVectorizer()
vectorizer.fit(train_x)
x_train_count = vectorizer.transform(train_x)
x_test_count = vectorizer.transform(test_x)
x_train_count.toarray()
67
tf_idf_word_vectorizer = TfidfVectorizer()
tf_idf_word_vectorizer.fit(train_x)
x_train_tf_idf_word = tf_idf_word_vectorizer.transform(train_x)
x_test_tf_idf_word = tf_idf_word_vectorizer.transform(test_x)
x_train_tf_idf_word.toarray()
log = linear_model.LogisticRegression()
log_model = log.fit(x_train_count, train_y)
accuracy = model_selection.cross_val_score(log_model, x_test_count, test_y, cv = 20)
print(accuracy)
mean = np.mean(accuracy)
print("\nLogistic regressionmodelwith 'count-vectors' method")
print("Accuracy ratio: ", mean)
log = linear_model.LogisticRegression()
log_model = log.fit(x_train_tf_idf_word, train_y)
accuracy = model_selection.cross_val_score(log_model,
x_test_tf_idf_word,
test_y,
cv = 20).mean()
print(accuracy)
mean = np.mean(accuracy)
print("\nLogistic regressionmodelwith 'count-vectors' method")
print("Accuracy ratio: ", mean)
y = train_y
X = x_train_count.astype("float64")
logit_roc_auc = roc_auc_score(y, log_model.predict(X))
fpr, tpr,thresholds = roc_curve(y, log_model.predict_proba(X)[:,1])
plt.figure()
plt.plot(fpr, tpr,label='AUC (area= %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.legend()
plt.xlabel('False PositiveRate')
plt.ylabel('True PositiveRate')
plt.title('ROC')
plt.show()
print(train_x[0])
new_tweet = train_x[0]
new_tweet_features = vectorizer.transform([new_tweet])
log_model.predict(new_tweet_features)
68
69
70