0% found this document useful (0 votes)
21 views40 pages

NLP Lab Manual - 1

The document is a laboratory manual for a Natural Language Processing course at Vardhaman College of Engineering, detailing course objectives, outcomes, and program outcomes. It outlines the skills students will acquire, the assessment scheme, and provides a list of practical experiments to enhance learning in NLP. The course aims to teach students how to extract and analyze information from unstructured text using machine learning techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views40 pages

NLP Lab Manual - 1

The document is a laboratory manual for a Natural Language Processing course at Vardhaman College of Engineering, detailing course objectives, outcomes, and program outcomes. It outlines the skills students will acquire, the assessment scheme, and provides a list of practical experiments to enhance learning in NLP. The course aims to teach students how to extract and analyze information from unstructured text using machine learning techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

VARDHAMAN COLLEGE OF ENGINEERING

(AUTONOMOUS)
Affiliated to JNTUH, approved by AICTE, Accredited by NAAC with A++ Grade
ISO 9001:2015 Certified
Kacharam, Shamshabad, Hyderabad – 501218, Telangana, India

Laboratory Manual
Natural Language Processing
(III B. Tech- I SEMESTER)
(VCE-R22)
Course Code-A8708
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING (AI & ML)

VARDHAMAN COLLEGE OF ENGINEERING


(AUTONOMOUS)
Kacharam, Shamshabad, Hyderabad – 501218, Telangana, India
PROGRAM OUTCOMES (POS)
PO1: Engineering Knowledge: Apply knowledge of mathematics, science, engineering fundamentals
and an engineering specialization to the solution complex engineering problems.
PO2: Problem Analysis: Identify, formulate, research literature and analyze complex engineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences
and engineering sciences.
PO3: Design/ Development of Solutions: Design solutions for complex engineering problems and
design system components or processes that meet specified needs with appropriate consideration
for public health and safety, cultural, societal and environmental considerations.
PO4: Conduct investigations of complex problems: Use research-based knowledge and research
methods including design of experiments, analysis and interpretation of data and synthesis of
information to provide valid conclusions.
PO5: Modern Tool Usage: Create, select and apply appropriate techniques, resources and modern
engineering and IT tools including prediction and modeling to complex engineering activities with
an understanding of the limitations.
PO6: The Engineer and Society: Apply reasoning informed by contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to
professional engineering practice.
PO7: Environment and Sustainability: Understand the impact of professional engineering solutions
in societal and environmental contexts and demonstrate knowledge of and need for sustainable
development.
PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of engineering practice.
PO9: Individual and Team Work: Function effectively as an individual, and as a member or leader in
diverse teams and in multi-disciplinary settings.
PO10: Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as being able to comprehend and write
effective reports and design documentation, make effective presentations and give and receive
clear instructions.
PO11: Project Management and Finance: Demonstrate knowledge and understanding of
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.
PO12: Life-long Learning: Recognize the need for and have the preparation and ability to Engage in
independent and life- long learning in the broadest context of technological Change.

PROGRAM SPECIFIC OUTCOMES (PSOs)


PSO1: To collect requirements, analyze, design, implement and test software Systems.
PSO2: To analyze the errors and debug them within minimal time.
COURSE OVERVIEW:
Natural Language Processing is the art of extracting information from unstructured text. Learn basics of
Natural Language Processing, Regular Expressions & text sentiment analysis using machine learning in
this course. Natural Language Processing (NLP) is basically how we can teach machines to understand
human languages and extract meaning from text. The course covers the phases of NLP processing and
uses libraries providedby NLP to analyse the given text document.
COURSE OBJECTIVE
The ultimate aim of NLP is to read, understand, and decode human words in a valuable manner.
Most of the NLP techniques depend on machine learning to obtain meaning from human languages.

COURSE OUTCOMES (COs)


After the completion of the course, the student will be able to:
CO# Course Outcomes
A8708.1 Identify the structure of words and documents for text preprocessing.
A8708.2 Choose an approach to parse the given text document.
A8708.3 Make use of semantic parsing to capture real meaning of text.
A8708.4 Select a language model to predict the probability of a sequence of words.
A8708.5 Examine the various applications of NLP.
BLOOM’S LEVEL OF THE COURSE OUTCOMES
Bloom’s Level
CO# Remember Understand Apply Analyze Evaluate Create
(L1) (L2) (L3) (L4) (L5) (L6)
A8708.1 ✔
A8708.2 ✔
A8708.3 ✔
A8708.4 ✔
A8708.5 ✔

COURSE ARTICULATION MATRIX


PO10

PO11

PO12

PSO1

PSO2
CO#/
PO1

PO2

PO3

PO4

PO5

PO6

PO7

PO8

PO9

POs

A8708.1 2 2 2 2

A8708.2 3 2 2 2 2

A8708.3 3 2 2 2 2

A8708.4 3 2 2 2 2

A8708.5 3 2 3 2 2
Note: 1-Low, 2-Medium, 3-High
LIST OF PROGRAMS FOR PRACTICE:
No Title of the Experiment Tools and Techniques Expected Skills/Ability

a) Write a program to Tokenize Text to word Sentence and word


using NLTK.
1 Tokenization using
b) Write a program to Tokenize Text NLTK
to Sentence using NLTK.
remove numbers,
punctuations, and
a) Write a program to remove numbers, whitespaces from
2 punctuations, and whitespaces in a file. text
b) Write a program to Count Word Frequency
in a file. Word count using
NLTK

Morphological
Write a program to Tokenize and tag the given Analysis in
3 sentence using Morphological Analysis in NLP. NLP.

1. A Computer
a) Write a program to get Synonyms from System Synonyms from
WordNet. with WordNet.
4 b) Write a program to get Antonyms from Ubuntu Antonyms from
WordNet. Operating WordNet.
System.
2. Python 3.x
a) Write a program to show the difference in the or above
results of Stemming and Lemmatization. version Stemming and
5 3. Jupyter Lemmatization using
a) Write a program to Lemmatizing Words Using NLTK and WordNet.
Notebook
WordNet. or Pycharm
a) Write a program to print all stop words in IDE stop words from a
NLP. given text using
6
b) Write a program to remove all stop words NLTK.
from a given text.

Write a Python program to apply Collocation


Collocation
extraction word combinations in the text.
7 Collocation examples are “break the rules,” “free extraction using
time,” “draw a conclusion,” “keeps in mind,” “get NLTK
ready,” and so on.
Write a Python program to extract Relationship
that allows obtaining structured in- formation
from unstructured sources such as raw text. Entity Relationship
8 Strictly stated, it is identifying relations (e.g., Extraction using
acquisition, spouse, employment) among named NLTK
entities (e.g., people, organizations, locations).
For example,from the sentence “Mark and Emily
married yesterday,” we can extract the
No Title of the Experiment Tools and Techniques Expected Skills/Ability

information that Mark is Emily’s husband.

Draw the parse tree


9 Write a program to print POS and parse tree and extract POS
of a given Text. using NLTK

Write a program to print bigram and Trigram


10 of a given Text. n-gram using NLTK

Application of NLP
11 Implement a case study of NLP application.
Course end project

ASSESSMENT SCHEME R22

Max. Marks
S.NO# EVALUATION METHOD ASSESSMENT TOOL
Marks Total
Internal practical examination-I 10
1
Continuous Internal Evaluation (CIE) Day to day evaluation 10
40
Viva-Voce 10
Course End Project 10
Write-up 20
Experiment/program 10
Evaluation of results 10
2 Semester End Examination (SEE) 60
Project Presentation on another
10
experiment/program
Viva-Voce 10
CO BLOOM’s
No Title of the Experiment
LEVEL
a) Write a program to Tokenize Text to word using NLTK. CO-1 L-3
1
b) Write a program to Tokenize Text to Sentence using
NLTK.
a) Write a program to remove numbers, punctuations, and CO-1 L-3
2 whitespaces in a file.
b) Write a program to Count Word Frequency in a file.
Write a program to Tokenize and tag the given sentence using CO-1 L-3
3 Morphological Analysis in NLP.

a) Write a program to get Synonyms from WordNet. CO-1 L-3


4 b) Write a program to get Antonyms from WordNet.

b) Write a program to show the difference in the results of CO-1 L-3


5 Stemming and Lemmatization.
b) Write a program to Lemmatizing Words Using WordNet.

6
c) Write a program to print all stop words in NLP. CO-1 L-3
d) Write a program to remove all stop words from a given text.

Write a Python program to apply Collocation extraction word CO-2 L-3


7 combinations in the text. Collocation examples are “break the rules,”
“free time,” “draw a conclusion,” “keeps in mind,” “get ready,” and so
on.
Write a Python program to extract Relationship that allows CO-2 L-3
obtaining structured in- formation from unstructured sources such
as raw text. Strictly stated, it is identifying relations (e.g.,
8 acquisition, spouse, employment) among named entities (e.g.,
people, organizations, locations). For example,from the sentence
“Mark and Emily married yesterday,” we can extract the
information that Mark is Emily’s husband.
9 Write a program to print POS and parse tree of a given Text. CO-4 L-3
10 Write a program to print bigram and Trigram of a given Text. CO-4 L-3

11 Implement a case study of NLP application. CO-5 L-4


LAB SESSION PLAN

a) Write a program to Tokenize Text to word using NLTK.


Week-1
b) Write a program to Tokenize Text to Sentence using NLTK.

Program Code / Snippet / Algorithm / Description

To run the below python program, (NLTK) natural language toolkit has to be installed in
your system.
The NLTK module is a massive tool kit, aimed at helping you with the entire Natural
Language Processing (NLP) methodology.
In order to install NLTK run the following commands in your terminal.
 sudo pip install nltk
 Then, enter the python shell in your terminal by simply typing python
 Type import nltk
 nltk.download(‘all’)
The above installation will take quite some time due to the massive amount of tokenizers,
chunkers, other algorithms, and all of the corpora to be downloaded.
Some terms that will be frequently used are :

 Corpus – Body of text, singular. Corpora is the plural of this.


 Lexicon – Words and their meanings.
 Token – Each “entity” that is a part of whatever was split up based on rules. For
examples, each word is a token when a sentence is “tokenized” into words. Each
sentence can also be a token, if you tokenized the sentences out of a paragraph.
So basically tokenizing involves splitting sentences and words from the body of
the text.

!pip install nltk


import nltk
nltk.download('punkt')
from nltk.tokenize import WordPunctTokenizer
tokenizer1 = WordPunctTokenizer()
tokenizer1.tokenize(text)

from nltk.tokenize import sent_tokenize


text = "Hello everyone. Welcome to class. You are studying NLP article"
sent_tokenize(text)
Program Output/ Expected Output
Requirement already satisfied: nltk in c:\users\lipu\anaconda3\lib\site-packages (3.5)
Requirement already satisfied: joblib in c:\users\lipu\anaconda3\lib\site-packages (from
nltk) (0.17.0)
Requirement already satisfied: regex in c:\users\lipu\anaconda3\lib\site-packages (from nltk)
(2020.10.15)
Requirement already satisfied: tqdm in c:\users\lipu\anaconda3\lib\site-packages (from nltk)
(4.50.2)
Requirement already satisfied: click in c:\users\lipu\anaconda3\lib\site-packages (from nltk)
(7.1.2)

[nltk_data] Downloading package punkt to


[nltk_data] C:\Users\Lipu\AppData\Roaming\nltk_data...
[nltk_data] Package punkt is already up-to-date!
['Hello',
'everyone',
'.',
'Welcome',
'to',
'class',
'.',
'You',
'are',
'studying',
'NLP',
'article']

['Hello everyone.', 'Welcome to class.', 'You are studying NLP article']


a) Write a program to remove numbers, punctuations, and whitespaces in a file.
Week-2 b) Write a program to Count Word Frequency in a file.

Program Code / Snippet / Algorithm / Description


1. Initialize the input string
2. Check if the character present in the string is punctuation or not.
3. If a character is a punctuation, then erase that character and decrement the index.
4. Print the output string, which will be free of any punctuation.

# Python program to remove punctuation from a given string


# Function to remove punctuation
def Punctuation(string):

# punctuation marks
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~ '''

# traverse the given string and if any punctuation


# marks occur replace it with null
for x in string.lower():
if x in punctuations:
string = string.replace(x, "")

# Print string without punctuation


print(string)

# Driver program
string = "Welcome???@@##$ to#$% NLP%$^$%^&LAB"
Punctuation(string)
# Python3 code to remove whitespace
def remove(string):
return string.replace(" ", "")

# Driver Program
string = ' N L P '
print(remove(string))

# Python code to demonstrate


# how to remove numeric digits from string
# using join and isdigit

# initialising string
ini_string = "AI123for127NLP"

# printing initial ini_string


print("initial string : ", ini_string)

# using join and isdigit


# to remove numeric digits from string
res = ''.join([i for i in ini_string if not i.isdigit()])

# printing result
print("final string : ", res)

b)

First, we create a text file in which we want to count the words in Python. Let this file
be sample.txt with the following contents
Mango banana apple pear
Banana grapes strawberry
Apple pear mango banana
Kiwi apple mango strawberry

text = open("test.txt", "r")


# Create an empty dictionary
d = dict()
# Loop through each line of the file
for line in text:
# Remove the leading spaces and newline character
line = line.strip()

# Convert the characters in line to


# lowercase to avoid case mismatch
line = line.lower()

# Split the line into words


words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1

# Print the contents of dictionary


for key in list(d.keys()):
print(key, ":", d[key])

Program Output/ Expected Output

Welcome to NLP LAB

NLP

initial string : AI123for127NLP


final string : AIforNLP
mango : 3
banana : 3
apple : 3
pear : 2
grapes : 1
strawberry : 2
kiwi : 1
Write a program to Tokenize and tag the given sentence using Morphological
Week-3 Analysis in NLP.

Program Code / Snippet / Algorithm / Description


import spacy
#The spaCy library is a popular library for natural language processing (NLP) in Python. It
provides a wide range of capabilities for text processing, including tokenization, POS
tagging, named entity recognition, and more. In this program, we are using it for
morphological analysis which is the study of word structure and forms.
# Load the spaCy model
nlp = spacy.load("en_core_web_sm")
# get input from user for various sentences
interrogative_sentence = "What is the weather like today?" # or interrogative_sentence =
input("Enter an interrogative Sentence.")
declarative_sentence = "The weather is sunny." # or declarative_sentence = input("Enter an
declarative Sentence.")
complex_sentence = "I went to the store, but they were closed, so I had to go to another
store." # or complex_sentence = input("Enter an complex sentence using conjunction.")
# Process the sentences with spaCy
interrogative_doc = nlp(interrogative_sentence)
declarative_doc = nlp(declarative_sentence)
complex_doc = nlp(complex_sentence)
# Print the morphological analysis for interrogative sentence
for token in interrogative_doc:
print(token.text, token.pos_)
print("\n")
# Print the morphological analysis for declarative sentence
for token in declarative_doc:
print(token.text, token.pos_)
print("\n")
# Print the morphological analysis for complex sentence
for token in complex_doc:
print(token.text, token.pos_)

Part-of-speech (POS) tagging


POS tagging is the process of classifying and labelling words in a text into their parts of speech - noun,
adjective, determinant, etc..

Tagging is typically the second step in the NLP pipeline, following tokenisation.

The Universal tagset shown below is a simplified POS tagset; other NLTK tagsets include wsj and brown.

 NOUN (nouns)
 VERB (verbs)
 ADJ (adjectives)
 ADV (adverbs)
 PRON (pronouns)
 DET (determiners and articles)
 ADP (prepositions and postpositions)
 NUM (numerals)
 CONJ (conjunctions)
 PRT (particles)
 . (punctuation marks)
 X (a catch-all for other categories such as abbreviations or foreign words)

Program Output/ Expected Output

What PRON
is AUX
the DET
weather NOUN
like ADP
today NOUN
? PUNCT

The DET
weather NOUN
is AUX
sunny ADJ
. PUNCT

I PRON
went VERB
to ADP
the DET
store NOUN
, PUNCT
but CCONJ
they PRON
were VERB
closed VERB
, PUNCT
so CCONJ
I PRON
had VERB
to PART
go VERB
to ADP
another DET
store NOUN
. PUNCT
a. Write a program to get Synonyms from WordNet.
Week-4 b. Write a program to get Antonyms from WordNet.

Program Code / Snippet / Algorithm / Description

import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet
synonyms = []
for syn in wordnet.synsets("good"):
for l in syn.lemmas():
synonyms.append(l.name())
print(set(synonyms))

import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet

antonyms = []

for syn in wordnet.synsets("good"):


for l in syn.lemmas():
synonyms.append(l.name())
if l.antonyms():
antonyms.append(l.antonyms()[0].name())

print(set(antonyms))

Program Output/ Expected Output


{'well', 'undecomposed', 'right', 'goodness', 'dependable', 'respectable',
'adept', 'serious', 'expert', 'effective', 'thoroughly', 'beneficial',
'trade_good', 'just', 'soundly', 'in_force', 'upright', 'good', 'proficient',
'unspoiled', 'in_effect', 'safe', 'skillful', 'commodity', 'salutary', 'sound',
'honest', 'dear', 'near', 'full', 'estimable', 'ripe', 'unspoilt', 'skilful',
'practiced', 'secure', 'honorable'}
{'badness', 'ill', 'evil', 'evilness', 'bad'}
a. Write a program to show the difference in the results of Stemming and
Lemmatization.
Week-5 b. Write a program to Lemmatizing Words Using WordNet.

Program Code / Snippet / Algorithm / Description

from nltk.stem import WordNetLemmatizer


# create an object of class WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("plays", 'v'))
print(lemmatizer.lemmatize("played", 'v'))
print(lemmatizer.lemmatize("play", 'v'))
print(lemmatizer.lemmatize("playing", 'v'))
print(lemmatizer.lemmatize("crying", 'v'))

from nltk.stem import PorterStemmer


# create an object of class PorterStemmer
porter = PorterStemmer()
print(porter.stem("play"))
print(porter.stem("playing"))
print(porter.stem("plays"))
print(porter.stem("played"))
print(porter.stem("crying"))

Program Output/ Expected Output


play
play
play
play
cry

play
play
play
play
cri
a. Write a program to print all stop words in NLP.
Week-6 b. Write a program to remove all stop words from a given text.

Program Code / Snippet / Algorithm / Description

import nltk
from nltk.corpus import stopwords
stopwords=stopwords.words('english')
print(stopwords)
print(len(stopwords))

from nltk.corpus import stopwords


from nltk.tokenize import word_tokenize

example_sent = """This is a sample sentence,


showing off the stop words filtration."""

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)
# converts the words in word_tokens to lower case and then checks whether
#they are present in stop_words or not
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
#with no lower case conversion
filtered_sentence = []

for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)
Program Output/ Expected Output
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",
"you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself',
'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them',
'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll",
'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has',
'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or',
'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against',
'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from',
'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once',
'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than',
'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now',
'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn',
"didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn',
"isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan',
"shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't",
'wouldn', "wouldn't"]
179

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop',
'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration',
'.']
Write a Python program to apply Collocation extraction word
combinations in the text. Collocation examples are “break the rules,”
Week-7 “free time,” “draw a conclusion,” “keeps in mind,” “get ready,” and so
on.

Program Code / Snippet / Algorithm / Description

Collocations are two or more words that tend to appear frequently together, for example
– United States. There are many other words that can come after United, such as the
United Kingdom and United Airlines. As with many aspects of natural language
processing, context is very important. And for collocations, context is everything. In the
case of collocations, the context will be a document in the form of a list of words.
Discovering collocations in this list of words means to find common phrases that occur
frequently throughout the text.

from nltk.corpus import webtext

# use to find bigrams, which are pairs of words


from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
# Loading the data
words = [w.lower() for w in webtext.words(
'C:\\Geeksforgeeks\\python_and_grail.txt')]

biagram_collocation = BigramCollocationFinder.from_words(words)
biagram_collocation.nbest(BigramAssocMeasures.likelihood_ratio, 15)

from nltk.corpus import stopwords

stopset = set(stopwords.words('english'))
filter_stops = lambda w: len(w) < 3 or w in stopset

biagram_collocation.apply_word_filter(filter_stops)
biagram_collocation.nbest(BigramAssocMeasures.likelihood_ratio, 15)
# Loading Libraries
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import TrigramAssocMeasures

# Loading data - text file


words = [w.lower() for w in webtext.words(
'C:\Geeksforgeeks\\python_and_grail.txt')]

trigram_collocation = TrigramCollocationFinder.from_words(words)
trigram_collocation.apply_word_filter(filter_stops)
trigram_collocation.apply_freq_filter(3)

trigram_collocation.nbest(TrigramAssocMeasures.likelihood_ratio, 15)

Program Output/ Expected Output


[("'", 's'),
('arthur', ':'),
('#', '1'),
("'", 't'),
('villager', '#'),
('#', '2'),
(']', '['),
('1', ':'),
('oh', ', '),
('black', 'knight'),
('ha', 'ha'),
(':', 'oh'),
("'", 're'),
('galahad', ':'),
('well', ', ')]

[('black', 'knight'),
('clop', 'clop'),
('head', 'knight'),
('mumble', 'mumble'),
('squeak', 'squeak'),
('saw', 'saw'),
('holy', 'grail'),
('run', 'away'),
('french', 'guard'),
('cartoon', 'character'),
('iesu', 'domine'),
('pie', 'iesu'),
('round', 'table'),
('sir', 'robin'),
('clap', 'clap')]
[('clop', 'clop', 'clop'),
('mumble', 'mumble', 'mumble'),
('squeak', 'squeak', 'squeak'),
('saw', 'saw', 'saw'),
('pie', 'iesu', 'domine'),
('clap', 'clap', 'clap'),
('dona', 'eis', 'requiem'),
('brave', 'sir', 'robin'),
('heh', 'heh', 'heh'),
('king', 'arthur', 'music'),
('hee', 'hee', 'hee'),
('holy', 'hand', 'grenade'),
('boom', 'boom', 'boom'),
('...', 'dona', 'eis'),
('already', 'got', 'one')]
Write a Python program to extract Relationship that allows obtaining structured in-
formation from unstructured sources such as raw text. Strictly stated, it is identifying
Week-8 relations (e.g., acquisition, spouse, employment) among named entities (e.g., people,
organizations, locations). For example,from the sentence “Mark and Emily married
yesterday,” we can extract the information that Mark is Emily’s husband.

Program Code / Snippet / Algorithm / Description

The overwhelming amount of unstructured text data available today from traditional media sources as well as
newer ones, like social media, provides a rich source of information if the data can be structured. Named
Entity Extraction forms a core subtask to build knowledge from semi-structured and unstructured text sources.
Some of the first researchers working to extract information from unstructured texts recognized the
importance of “units of information” like names (such as person, organization, and location names) and
numeric expressions (such as time, date, money, and percent expressions). They coined the term “Named
Entity” in 1996 to represent these.

Considering recent increases in computing power and decreases in the costs of data storage, data scientists and
developers can build large knowledge bases that contain millions of entities and hundreds of millions of facts
about them. These knowledge bases are key contributors to intelligent computer behavior. Not surprisingly,
Named Entity Extraction operates at the core of several popular technologies such as smart assistants
(Siri, Google Now), machine reading, and deep interpretation of natural language.

This post explores how to perform Named Entity Extraction, formally known as “Named Entity Recognition
and Classification (NERC). In addition, the article surveys open-source NERC tools that work with Python
and compares the results obtained using them against hand-labeled data.

The specific steps include:

 Preparing semi-structured natural language data for ingestion using regular expressions; creating a custom
corpus in the Natural Language Toolkit
 Using a suite of open source NERC tools to extract entities and store them in JSON format
 Comparing the performance of the NERC tools
 Implementing a simplistic ensemble classifier

The information extraction concepts and tools in this article constitute a first step in the overall process of
structuring unstructured data. They can be used to perform more complex natural language processing to
derive unique insights from large collections of unstructured data.

from nltk import word_tokenize, pos_tag, ne_chunk


input_str = "Bill works for Apple so he went to Boston for a conference."
print (ne_chunk(pos_tag(word_tokenize(input_str))))
Program Output/ Expected Output

(S
(PERSON Bill/NNP)
works/VBZ
for/IN
Apple/NNP
so/IN
he/PRP
went/VBD
to/TO
(GPE Boston/NNP)
for/IN
a/DT
conference/NN
./.)
Write a program to print POS and parse tree of a given Text.
Week-9

Program Code / Snippet / Algorithm / Description

import nltk
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"),
("cat", "NN")]

pattern = "NP: {<DT>?<JJ>*<NN>}"


NPChunker = nltk.RegexpParser(pattern)
result = NPChunker.parse(sentence)
result.draw()

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag, word_tokenize, RegexpParser

# Example text
sample_text = "The quick brown fox jumps over the lazy dog"

# Find all parts of speech in above sentence


tagged = pos_tag(word_tokenize(sample_text))

#Extract all parts of speech from any text


chunker = RegexpParser("""
NP: {<DT>?<JJ>*<NN>} #To extract Noun Phrases
P: {<IN>} #To extract Prepositions
V: {<V.*>} #To extract Verbs
PP: {<p> <NP>} #To extract Prepositional Phrases
VP: {<V> <NP|PP>*} #To extract Verb Phrases
""")

# Print all parts of speech in above sentence


output = chunker.parse(tagged)
print("After Extracting\n", output)
Program Output/ Expected Output

After Extracting
(S
(NP The/DT quick/JJ brown/NN)
(NP fox/NN)
(VP (V jumps/VBZ))
(P over/IN)
(NP the/DT lazy/JJ dog/NN))
Week-10 Write a program to print bigram and Trigram of a given Text.

Program Code / Snippet / Algorithm / Description

#Write a program to print Unigram, bigram, Trigram list from a given document.

from nltk import ngrams

sentence = 'Hello everyone. Welcome to class. You are studying language modelling article'

n=1
x = ngrams(sentence.split(), n)

for grams in x:
print (grams)

from nltk import ngrams

sentence = 'Hello everyone. Welcome to class. You are studying language modelling article'

n=2
x = ngrams(sentence.split(), n)

for grams in x:
print (grams)

from nltk import ngrams

sentence = 'Hello everyone. Welcome to class. You are studying language modelling article'

n=3
x = ngrams(sentence.split(), n)

for grams in x:
print (grams)
Program Output/ Expected Output

('Hello',)

('everyone.',)

('Welcome',)

('to',)

('class.',)

('You',)

('are',)

('studying',)

('language',)

('modelling',)

('article',)

('Hello', 'everyone.')
('everyone.', 'Welcome')
('Welcome', 'to')
('to', 'class.')
('class.', 'You')
('You', 'are')
('are', 'studying')
('studying', 'language')
('language', 'modelling')
('modelling', 'article')

('Hello', 'everyone.', 'Welcome')


('everyone.', 'Welcome', 'to')
('Welcome', 'to', 'class.')
('to', 'class.', 'You')
('class.', 'You', 'are')
('You', 'are', 'studying')
('are', 'studying', 'language')
('studying', 'language', 'modelling')
('language', 'modelling', 'article')
Implement a case study of NLP application.
Week-11

Program Code / Snippet / Algorithm / Description


Spam SMS Classification
The Spam SMS Classification undertaking is a critical improvement within the concern of Natural
Language Processing (NLP). This mission’s goal is to create a system this is capable of as it must
be classifying SMS messages as unsolicited mail or ham. With the growing variety of unsolicited
messages that people receive on their telephones, this mission targets to decorate SMS
communication by filtering out undesirable and potentially dangerous messages. By the usage of
NLP techniques, the venture looks to increase a version that can correctly differentiate between
junk mail and legitimate messages. The significance of this mission lies in its contribution to
improving individual security and privacy. Here is a basic idea for SMS Spam Detection in
Python.

In today’s society, practically everyone has a mobile phone, and they all get communications
(SMS/ email) on their phone regularly. But the essential point is that majority of the messages
received will be spam, with only a few being ham or necessary communications. Scammers create
fraudulent text messages to deceive you into giving them your personal information, such as your
password, account number, or Social Security number. If they have such information, they may be
able to gain access to your email, bank, or other accounts.
In this article, we are going to develop various deep learning models using Tensorflow for SMS
spam detection and also analyze the performance metrics of different models.
We will be using SMS Spam Detection Dataset, which contains SMS text and corresponding label
(Ham or spam)
Dataset can be downloaded from here https://www.kaggle.com/datasets/uciml/sms-spam-collection-
dataset

Implementation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Reading the data
df = pd.read_csv("/content/spam.csv",encoding='latin-1')
df.head()
df = df.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1)
df = df.rename(columns={'v1':'label','v2':'Text'})
df['label_enc'] = df['label'].map({'ham':0,'spam':1})
df.head()
sns.countplot(x=df['label'])
plt.show()
# Find average number of tokens in all sentences
avg_words_len=round(sum([len(i.split()) for i in df['Text']])/len(df['Text']))
print(avg_words_len)
# Finding Total no of unique words in corpus
s = set()
for sent in df['Text']:
for word in sent.split():
s.add(word)
total_words_length=len(s)
print(total_words_length)
# Splitting data for Training and testing
from sklearn.model_selection import train_test_split

X, y = np.asanyarray(df['Text']), np.asanyarray(df['label_enc'])
new_df = pd.DataFrame({'Text': X, 'label': y})
X_train, X_test, y_train, y_test = train_test_split(
new_df['Text'], new_df['label'], test_size=0.2, random_state=42)
X_train.shape, y_train.shape, X_test.shape, y_test.shape
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report,accuracy_score

tfidf_vec = TfidfVectorizer().fit(X_train)
X_train_vec,X_test_vec = tfidf_vec.transform(X_train),tfidf_vec.transform(X_test)

baseline_model = MultinomialNB()
baseline_model.fit(X_train_vec,y_train)
from tensorflow.keras.layers import TextVectorization

MAXTOKENS=total_words_length
OUTPUTLEN=avg_words_len

text_vec = TextVectorization(
max_tokens=MAXTOKENS,
standardize='lower_and_strip_punctuation',
output_mode='int',
output_sequence_length=OUTPUTLEN
)
text_vec.adapt(X_train)
embedding_layer = layers.Embedding(
input_dim=MAXTOKENS,
output_dim=128,
embeddings_initializer='uniform',
input_length=OUTPUTLEN
)
input_layer = layers.Input(shape=(1,), dtype=tf.string)
vec_layer = text_vec(input_layer)
embedding_layer_model = embedding_layer(vec_layer)
x = layers.GlobalAveragePooling1D()(embedding_layer_model)
x = layers.Flatten()(x)
x = layers.Dense(32, activation='relu')(x)
output_layer = layers.Dense(1, activation='sigmoid')(x)
model_1 = keras.Model(input_layer, output_layer)

model_1.compile(optimizer='adam', loss=keras.losses.BinaryCrossentropy(
label_smoothing=0.5), metrics=['accuracy'])
from sklearn.metrics import precision_score, recall_score, f1_score

def compile_model(model):
'''
simply compile the model with adam optimzer
'''
model.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(),
metrics=['accuracy'])

def fit_model(model, epochs, X_train=X_train, y_train=y_train,


X_test=X_test, y_test=y_test):
'''
fit the model with given epochs, train
and test data
'''
history = model.fit(X_train,
y_train,
epochs=epochs,
validation_data=(X_test, y_test),
validation_steps=int(0.2*len(X_test)))
return history

def evaluate_model(model, X, y):


'''
evaluate the model and returns accuracy,
precision, recall and f1-score
'''
y_preds = np.round(model.predict(X))
accuracy = accuracy_score(y, y_preds)
precision = precision_score(y, y_preds)
recall = recall_score(y, y_preds)
f1 = f1_score(y, y_preds)

model_results_dict = {'accuracy': accuracy,


'precision': precision,
'recall': recall,
'f1-score': f1}

return model_results_dict

input_layer = layers.Input(shape=(1,), dtype=tf.string)


vec_layer = text_vec(input_layer)
embedding_layer_model = embedding_layer(vec_layer)
bi_lstm = layers.Bidirectional(layers.LSTM(
64, activation='tanh', return_sequences=True))(embedding_layer_model)
lstm = layers.Bidirectional(layers.LSTM(64))(bi_lstm)
flatten = layers.Flatten()(lstm)
dropout = layers.Dropout(.1)(flatten)
x = layers.Dense(32, activation='relu')(dropout)
output_layer = layers.Dense(1, activation='sigmoid')(x)
model_2 = keras.Model(input_layer, output_layer)

compile_model(model_2) # compile the model


history_2 = fit_model(model_2, epochs=5) # fit the model

Program Output/ Expected Output


POSSIBLE VIVA QUESTIONS

Description

Possible viva questions include answers are provided. The student can have a practice of them.
1. What is full form of NLP?

A. Natural Language Processing


B. Nature Language Processing
C. Natural Language Process
D. Natural Language pages
View Answer
Ans : A

Explanation: Natural Language Processing (NLP) refers to AI method of communicating with an


intelligent systems using a natural language such as English.

2. What are the input and output of an NLP system?

A. Speech and noise


B. Speech and Written Text
C. Noise and Written Text
D. Noise and value
View Answer
Ans : B

Explanation: The input and output of an NLP system can be : Speech Written Text

3. How many Components of NLP are there?

A. 2
B. 3
C. 4
D. 5
View Answer
Ans : A

Explanation: There are 2 Components of NLP : NLU & NLG

4. What is full form of NLU?

A. Nature Language Understanding


B. Natural Long Understanding
C. Natural Language Understanding
D. None of the Above
View Answer
Ans : C

Explanation: Natural Language Understanding is full form of NLU.

5. What is full form of NLG?

A. Natural Language Generation


B. Natural Language Genes
C. Natural Language Growth
D. Natural Language Generator
View Answer
Ans : A

Explanation: Natural Language Generation is full form of NLG

6. What is the main challenge/s of NLP?

A. Handling Ambiguity of Sentences


B. Handling Tokenization
C. Handling POS-Tagging
D. All of the above
View Answer
Ans : A

Explanation: There are enormous ambiguity exists when processing natural language.

7. Which of the following includes major tasks of NLP?

A. Discourse Analysis
B. Automatic Summarization
C. Machine Translation
D. All of the above
View Answer
Ans : D

Explanation: There is even bigger list of tasks of NLP.

8. What is Morphological Segmentation?


A. Does Discourse Analysis
B. is an extension of propositional logic
C. Separate words into individual morphemes and identify the class of the morphemes
D. None of the Above
View Answer
Ans : C

Explanation: Morphological Segmentation is Separate words into individual morphemes and


identify the class of the morphemes

9. Which of the following is used to mapping sentence plan into sentence structure?

A. Text planning
B. Sentence planning
C. Text Realization
D. None of the Above
View Answer
Ans : C

Explanation: Text Realization : It is mapping sentence plan into sentence structure.

10. Which of the following is used study of construction of words from primitive meaningful units?

A. Phonology
B. Morphology
C. Morpheme
D. Shonology
View Answer
Ans : B

Explanation: Morphology : It is a study of construction of words from primitive meaningful units.

11. How many steps of NLP is there?

A. 3
B. 4
C. 5
D. 6
View Answer
Ans : C

Explanation: There are general five steps :Lexical Analysis ,Syntactic Analysis , Semantic Analysis,
Discourse Integration, Pragmatic Analysis.
12. Parts-of-Speech tagging determines ___________

A. part-of-speech for each word dynamically as per meaning of the sentence


B. part-of-speech for each word dynamically as per sentence structure
C. all part-of-speech for a specific word given as input
D. All of the above
View Answer
Ans : D

Explanation: A Bayesian network provides a complete description of the domain.

13. In linguistic morphology _____________ is the process for reducing inflected words to their root
form.

A. Rooting
B. Stemming
C. Text-Proofing
D. Both Rooting & Stemming
View Answer
Ans : B

Explanation: In linguistic morphology Stemming is the process for reducing inflected words to their
root form.

14. Many words have more than one meaning; we have to select the meaning which makes the
most sense in context. This can be resolved by ____________

A. Fuzzy Logic
B. Shallow Semantic Analysis
C. Word Sense Disambiguation
D. All of the above
View Answer
Ans : C

Explanation: Shallow Semantic Analysis doesn't cover word sense disambiguation.

15. Which of the following is demerits of Top-Down Parser?

A. It is hard to implement.
B. Slow speed
C. inefficient
D. Both B and C
View Answer
Ans : D

Explanation: It is inefficient, as the search process has to be repeated if an error occurs and Slow
speed of working are Demerits of Top-Down Parser.

16. Which of the following is merits of Context-Free Grammar?

A. simplest style of grammar


B. They are highly precise.
C. High speed
D. All of the above
View Answer
Ans : A

Explanation: The simplest style of grammar, therefore widely used one are merits of Context-Free
Grammar.

17. "He lifted the beetle with red cap." contain which type of ambiguity ?

A. Lexical ambiguity
B. Syntax Level ambiguity
C. Referential ambiguity
D. None of the Above
View Answer
Ans : B

Explanation: Syntax Level ambiguity is correct option.

18. "I am tired." Contain which type of ambiguity ?

A. Lexical ambiguity
B. Syntax Level ambiguity
C. Sementic ambiguity
D. None of the Above
View Answer
Ans : D

Explanation: It contain Referential ambiguity.

19. Given a sound clip of a person or people speaking, determine the textual representation of the
speech.
A. Text-to-speech
B. Speech-to-text
C. Both A and B
D. None of the Above
View Answer
Ans : B

Explanation: NLP is required to linguistic analysis.

20. What is Machine Translation?

A. Converts one human language to another


B. Converts human language to machine language
C. Converts any human language to English
D. Converts Machine language to human language
View Answer
Ans : A

Explanation: The best known example of machine translation is google translator.

You might also like