NLP Lab Manual - 1
NLP Lab Manual - 1
(AUTONOMOUS)
Affiliated to JNTUH, approved by AICTE, Accredited by NAAC with A++ Grade
ISO 9001:2015 Certified
Kacharam, Shamshabad, Hyderabad – 501218, Telangana, India
Laboratory Manual
Natural Language Processing
(III B. Tech- I SEMESTER)
(VCE-R22)
Course Code-A8708
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING (AI & ML)
PO11
PO12
PSO1
PSO2
CO#/
PO1
PO2
PO3
PO4
PO5
PO6
PO7
PO8
PO9
POs
A8708.1 2 2 2 2
A8708.2 3 2 2 2 2
A8708.3 3 2 2 2 2
A8708.4 3 2 2 2 2
A8708.5 3 2 3 2 2
Note: 1-Low, 2-Medium, 3-High
LIST OF PROGRAMS FOR PRACTICE:
No Title of the Experiment Tools and Techniques Expected Skills/Ability
Morphological
Write a program to Tokenize and tag the given Analysis in
3 sentence using Morphological Analysis in NLP. NLP.
1. A Computer
a) Write a program to get Synonyms from System Synonyms from
WordNet. with WordNet.
4 b) Write a program to get Antonyms from Ubuntu Antonyms from
WordNet. Operating WordNet.
System.
2. Python 3.x
a) Write a program to show the difference in the or above
results of Stemming and Lemmatization. version Stemming and
5 3. Jupyter Lemmatization using
a) Write a program to Lemmatizing Words Using NLTK and WordNet.
Notebook
WordNet. or Pycharm
a) Write a program to print all stop words in IDE stop words from a
NLP. given text using
6
b) Write a program to remove all stop words NLTK.
from a given text.
Application of NLP
11 Implement a case study of NLP application.
Course end project
Max. Marks
S.NO# EVALUATION METHOD ASSESSMENT TOOL
Marks Total
Internal practical examination-I 10
1
Continuous Internal Evaluation (CIE) Day to day evaluation 10
40
Viva-Voce 10
Course End Project 10
Write-up 20
Experiment/program 10
Evaluation of results 10
2 Semester End Examination (SEE) 60
Project Presentation on another
10
experiment/program
Viva-Voce 10
CO BLOOM’s
No Title of the Experiment
LEVEL
a) Write a program to Tokenize Text to word using NLTK. CO-1 L-3
1
b) Write a program to Tokenize Text to Sentence using
NLTK.
a) Write a program to remove numbers, punctuations, and CO-1 L-3
2 whitespaces in a file.
b) Write a program to Count Word Frequency in a file.
Write a program to Tokenize and tag the given sentence using CO-1 L-3
3 Morphological Analysis in NLP.
6
c) Write a program to print all stop words in NLP. CO-1 L-3
d) Write a program to remove all stop words from a given text.
To run the below python program, (NLTK) natural language toolkit has to be installed in
your system.
The NLTK module is a massive tool kit, aimed at helping you with the entire Natural
Language Processing (NLP) methodology.
In order to install NLTK run the following commands in your terminal.
sudo pip install nltk
Then, enter the python shell in your terminal by simply typing python
Type import nltk
nltk.download(‘all’)
The above installation will take quite some time due to the massive amount of tokenizers,
chunkers, other algorithms, and all of the corpora to be downloaded.
Some terms that will be frequently used are :
# punctuation marks
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~ '''
# Driver program
string = "Welcome???@@##$ to#$% NLP%$^$%^&LAB"
Punctuation(string)
# Python3 code to remove whitespace
def remove(string):
return string.replace(" ", "")
# Driver Program
string = ' N L P '
print(remove(string))
# initialising string
ini_string = "AI123for127NLP"
# printing result
print("final string : ", res)
b)
First, we create a text file in which we want to count the words in Python. Let this file
be sample.txt with the following contents
Mango banana apple pear
Banana grapes strawberry
Apple pear mango banana
Kiwi apple mango strawberry
NLP
Tagging is typically the second step in the NLP pipeline, following tokenisation.
The Universal tagset shown below is a simplified POS tagset; other NLTK tagsets include wsj and brown.
NOUN (nouns)
VERB (verbs)
ADJ (adjectives)
ADV (adverbs)
PRON (pronouns)
DET (determiners and articles)
ADP (prepositions and postpositions)
NUM (numerals)
CONJ (conjunctions)
PRT (particles)
. (punctuation marks)
X (a catch-all for other categories such as abbreviations or foreign words)
What PRON
is AUX
the DET
weather NOUN
like ADP
today NOUN
? PUNCT
The DET
weather NOUN
is AUX
sunny ADJ
. PUNCT
I PRON
went VERB
to ADP
the DET
store NOUN
, PUNCT
but CCONJ
they PRON
were VERB
closed VERB
, PUNCT
so CCONJ
I PRON
had VERB
to PART
go VERB
to ADP
another DET
store NOUN
. PUNCT
a. Write a program to get Synonyms from WordNet.
Week-4 b. Write a program to get Antonyms from WordNet.
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet
synonyms = []
for syn in wordnet.synsets("good"):
for l in syn.lemmas():
synonyms.append(l.name())
print(set(synonyms))
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet
antonyms = []
print(set(antonyms))
play
play
play
play
cri
a. Write a program to print all stop words in NLP.
Week-6 b. Write a program to remove all stop words from a given text.
import nltk
from nltk.corpus import stopwords
stopwords=stopwords.words('english')
print(stopwords)
print(len(stopwords))
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
# converts the words in word_tokens to lower case and then checks whether
#they are present in stop_words or not
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
#with no lower case conversion
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)
Program Output/ Expected Output
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",
"you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself',
'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them',
'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll",
'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has',
'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or',
'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against',
'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from',
'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once',
'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than',
'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now',
'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn',
"didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn',
"isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan',
"shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't",
'wouldn', "wouldn't"]
179
['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop',
'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration',
'.']
Write a Python program to apply Collocation extraction word
combinations in the text. Collocation examples are “break the rules,”
Week-7 “free time,” “draw a conclusion,” “keeps in mind,” “get ready,” and so
on.
Collocations are two or more words that tend to appear frequently together, for example
– United States. There are many other words that can come after United, such as the
United Kingdom and United Airlines. As with many aspects of natural language
processing, context is very important. And for collocations, context is everything. In the
case of collocations, the context will be a document in the form of a list of words.
Discovering collocations in this list of words means to find common phrases that occur
frequently throughout the text.
biagram_collocation = BigramCollocationFinder.from_words(words)
biagram_collocation.nbest(BigramAssocMeasures.likelihood_ratio, 15)
stopset = set(stopwords.words('english'))
filter_stops = lambda w: len(w) < 3 or w in stopset
biagram_collocation.apply_word_filter(filter_stops)
biagram_collocation.nbest(BigramAssocMeasures.likelihood_ratio, 15)
# Loading Libraries
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import TrigramAssocMeasures
trigram_collocation = TrigramCollocationFinder.from_words(words)
trigram_collocation.apply_word_filter(filter_stops)
trigram_collocation.apply_freq_filter(3)
trigram_collocation.nbest(TrigramAssocMeasures.likelihood_ratio, 15)
[('black', 'knight'),
('clop', 'clop'),
('head', 'knight'),
('mumble', 'mumble'),
('squeak', 'squeak'),
('saw', 'saw'),
('holy', 'grail'),
('run', 'away'),
('french', 'guard'),
('cartoon', 'character'),
('iesu', 'domine'),
('pie', 'iesu'),
('round', 'table'),
('sir', 'robin'),
('clap', 'clap')]
[('clop', 'clop', 'clop'),
('mumble', 'mumble', 'mumble'),
('squeak', 'squeak', 'squeak'),
('saw', 'saw', 'saw'),
('pie', 'iesu', 'domine'),
('clap', 'clap', 'clap'),
('dona', 'eis', 'requiem'),
('brave', 'sir', 'robin'),
('heh', 'heh', 'heh'),
('king', 'arthur', 'music'),
('hee', 'hee', 'hee'),
('holy', 'hand', 'grenade'),
('boom', 'boom', 'boom'),
('...', 'dona', 'eis'),
('already', 'got', 'one')]
Write a Python program to extract Relationship that allows obtaining structured in-
formation from unstructured sources such as raw text. Strictly stated, it is identifying
Week-8 relations (e.g., acquisition, spouse, employment) among named entities (e.g., people,
organizations, locations). For example,from the sentence “Mark and Emily married
yesterday,” we can extract the information that Mark is Emily’s husband.
The overwhelming amount of unstructured text data available today from traditional media sources as well as
newer ones, like social media, provides a rich source of information if the data can be structured. Named
Entity Extraction forms a core subtask to build knowledge from semi-structured and unstructured text sources.
Some of the first researchers working to extract information from unstructured texts recognized the
importance of “units of information” like names (such as person, organization, and location names) and
numeric expressions (such as time, date, money, and percent expressions). They coined the term “Named
Entity” in 1996 to represent these.
Considering recent increases in computing power and decreases in the costs of data storage, data scientists and
developers can build large knowledge bases that contain millions of entities and hundreds of millions of facts
about them. These knowledge bases are key contributors to intelligent computer behavior. Not surprisingly,
Named Entity Extraction operates at the core of several popular technologies such as smart assistants
(Siri, Google Now), machine reading, and deep interpretation of natural language.
This post explores how to perform Named Entity Extraction, formally known as “Named Entity Recognition
and Classification (NERC). In addition, the article surveys open-source NERC tools that work with Python
and compares the results obtained using them against hand-labeled data.
Preparing semi-structured natural language data for ingestion using regular expressions; creating a custom
corpus in the Natural Language Toolkit
Using a suite of open source NERC tools to extract entities and store them in JSON format
Comparing the performance of the NERC tools
Implementing a simplistic ensemble classifier
The information extraction concepts and tools in this article constitute a first step in the overall process of
structuring unstructured data. They can be used to perform more complex natural language processing to
derive unique insights from large collections of unstructured data.
(S
(PERSON Bill/NNP)
works/VBZ
for/IN
Apple/NNP
so/IN
he/PRP
went/VBD
to/TO
(GPE Boston/NNP)
for/IN
a/DT
conference/NN
./.)
Write a program to print POS and parse tree of a given Text.
Week-9
import nltk
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"),
("cat", "NN")]
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag, word_tokenize, RegexpParser
# Example text
sample_text = "The quick brown fox jumps over the lazy dog"
After Extracting
(S
(NP The/DT quick/JJ brown/NN)
(NP fox/NN)
(VP (V jumps/VBZ))
(P over/IN)
(NP the/DT lazy/JJ dog/NN))
Week-10 Write a program to print bigram and Trigram of a given Text.
#Write a program to print Unigram, bigram, Trigram list from a given document.
sentence = 'Hello everyone. Welcome to class. You are studying language modelling article'
n=1
x = ngrams(sentence.split(), n)
for grams in x:
print (grams)
sentence = 'Hello everyone. Welcome to class. You are studying language modelling article'
n=2
x = ngrams(sentence.split(), n)
for grams in x:
print (grams)
sentence = 'Hello everyone. Welcome to class. You are studying language modelling article'
n=3
x = ngrams(sentence.split(), n)
for grams in x:
print (grams)
Program Output/ Expected Output
('Hello',)
('everyone.',)
('Welcome',)
('to',)
('class.',)
('You',)
('are',)
('studying',)
('language',)
('modelling',)
('article',)
('Hello', 'everyone.')
('everyone.', 'Welcome')
('Welcome', 'to')
('to', 'class.')
('class.', 'You')
('You', 'are')
('are', 'studying')
('studying', 'language')
('language', 'modelling')
('modelling', 'article')
In today’s society, practically everyone has a mobile phone, and they all get communications
(SMS/ email) on their phone regularly. But the essential point is that majority of the messages
received will be spam, with only a few being ham or necessary communications. Scammers create
fraudulent text messages to deceive you into giving them your personal information, such as your
password, account number, or Social Security number. If they have such information, they may be
able to gain access to your email, bank, or other accounts.
In this article, we are going to develop various deep learning models using Tensorflow for SMS
spam detection and also analyze the performance metrics of different models.
We will be using SMS Spam Detection Dataset, which contains SMS text and corresponding label
(Ham or spam)
Dataset can be downloaded from here https://www.kaggle.com/datasets/uciml/sms-spam-collection-
dataset
Implementation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Reading the data
df = pd.read_csv("/content/spam.csv",encoding='latin-1')
df.head()
df = df.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1)
df = df.rename(columns={'v1':'label','v2':'Text'})
df['label_enc'] = df['label'].map({'ham':0,'spam':1})
df.head()
sns.countplot(x=df['label'])
plt.show()
# Find average number of tokens in all sentences
avg_words_len=round(sum([len(i.split()) for i in df['Text']])/len(df['Text']))
print(avg_words_len)
# Finding Total no of unique words in corpus
s = set()
for sent in df['Text']:
for word in sent.split():
s.add(word)
total_words_length=len(s)
print(total_words_length)
# Splitting data for Training and testing
from sklearn.model_selection import train_test_split
X, y = np.asanyarray(df['Text']), np.asanyarray(df['label_enc'])
new_df = pd.DataFrame({'Text': X, 'label': y})
X_train, X_test, y_train, y_test = train_test_split(
new_df['Text'], new_df['label'], test_size=0.2, random_state=42)
X_train.shape, y_train.shape, X_test.shape, y_test.shape
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report,accuracy_score
tfidf_vec = TfidfVectorizer().fit(X_train)
X_train_vec,X_test_vec = tfidf_vec.transform(X_train),tfidf_vec.transform(X_test)
baseline_model = MultinomialNB()
baseline_model.fit(X_train_vec,y_train)
from tensorflow.keras.layers import TextVectorization
MAXTOKENS=total_words_length
OUTPUTLEN=avg_words_len
text_vec = TextVectorization(
max_tokens=MAXTOKENS,
standardize='lower_and_strip_punctuation',
output_mode='int',
output_sequence_length=OUTPUTLEN
)
text_vec.adapt(X_train)
embedding_layer = layers.Embedding(
input_dim=MAXTOKENS,
output_dim=128,
embeddings_initializer='uniform',
input_length=OUTPUTLEN
)
input_layer = layers.Input(shape=(1,), dtype=tf.string)
vec_layer = text_vec(input_layer)
embedding_layer_model = embedding_layer(vec_layer)
x = layers.GlobalAveragePooling1D()(embedding_layer_model)
x = layers.Flatten()(x)
x = layers.Dense(32, activation='relu')(x)
output_layer = layers.Dense(1, activation='sigmoid')(x)
model_1 = keras.Model(input_layer, output_layer)
model_1.compile(optimizer='adam', loss=keras.losses.BinaryCrossentropy(
label_smoothing=0.5), metrics=['accuracy'])
from sklearn.metrics import precision_score, recall_score, f1_score
def compile_model(model):
'''
simply compile the model with adam optimzer
'''
model.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(),
metrics=['accuracy'])
return model_results_dict
Description
Possible viva questions include answers are provided. The student can have a practice of them.
1. What is full form of NLP?
Explanation: The input and output of an NLP system can be : Speech Written Text
A. 2
B. 3
C. 4
D. 5
View Answer
Ans : A
Explanation: There are enormous ambiguity exists when processing natural language.
A. Discourse Analysis
B. Automatic Summarization
C. Machine Translation
D. All of the above
View Answer
Ans : D
9. Which of the following is used to mapping sentence plan into sentence structure?
A. Text planning
B. Sentence planning
C. Text Realization
D. None of the Above
View Answer
Ans : C
10. Which of the following is used study of construction of words from primitive meaningful units?
A. Phonology
B. Morphology
C. Morpheme
D. Shonology
View Answer
Ans : B
A. 3
B. 4
C. 5
D. 6
View Answer
Ans : C
Explanation: There are general five steps :Lexical Analysis ,Syntactic Analysis , Semantic Analysis,
Discourse Integration, Pragmatic Analysis.
12. Parts-of-Speech tagging determines ___________
13. In linguistic morphology _____________ is the process for reducing inflected words to their root
form.
A. Rooting
B. Stemming
C. Text-Proofing
D. Both Rooting & Stemming
View Answer
Ans : B
Explanation: In linguistic morphology Stemming is the process for reducing inflected words to their
root form.
14. Many words have more than one meaning; we have to select the meaning which makes the
most sense in context. This can be resolved by ____________
A. Fuzzy Logic
B. Shallow Semantic Analysis
C. Word Sense Disambiguation
D. All of the above
View Answer
Ans : C
A. It is hard to implement.
B. Slow speed
C. inefficient
D. Both B and C
View Answer
Ans : D
Explanation: It is inefficient, as the search process has to be repeated if an error occurs and Slow
speed of working are Demerits of Top-Down Parser.
Explanation: The simplest style of grammar, therefore widely used one are merits of Context-Free
Grammar.
17. "He lifted the beetle with red cap." contain which type of ambiguity ?
A. Lexical ambiguity
B. Syntax Level ambiguity
C. Referential ambiguity
D. None of the Above
View Answer
Ans : B
A. Lexical ambiguity
B. Syntax Level ambiguity
C. Sementic ambiguity
D. None of the Above
View Answer
Ans : D
19. Given a sound clip of a person or people speaking, determine the textual representation of the
speech.
A. Text-to-speech
B. Speech-to-text
C. Both A and B
D. None of the Above
View Answer
Ans : B