Jal Patel NLP
Jal Patel NLP
Name:Jal Patel
Branch: CSE(AI-ML)
Batch: A4
Semester: 7th
1
3174205 211310142028
LAB MANUAL
Sr. Practical Name Date of Faculty
No. Submission Signature
1 Write a program to implement word Tokenizer, Sentence and
Paragraph Tokenizers
2 Write a python program to eliminate stopwords using NLTK.
3 Write a python program to perform Parts of Speech tagging
using NLTK.
4 Write a python program to perform lemmatization using
NLTK.
5 Write a python program for chunking using NLTK.
6 Write a python program to perform stemming using NLTK.
7 Write a python program to perform Named Entity Recognition
using NLTK.
8 Write a program for different feature extraction techniques
used in NLP.
9 Write a program to implement all the NLP Pre-Processing
Techniques required to perform further NLP tasks.
10 Write a program to implement both user-defined and pre-
defined functions to generate: Unigrams, Bigrams, Trigrams, N-
grams.
11 Write a program to identify all antonyms and synonyms of a
word.
12 Write a program to find hyponymy, homonymy, polysemy for a
given word.
13 Write a program to calculate the score and polarity of text data using
vader analyzer and TextBlob.
2
3174205 211310142028
Practical 1
Write a program to implement word Tokenizer, Sentence and Paragraph
Tokenizers
• Tokenization:
A. Sentence Tokenization
Sentence tokenization refers to splitting a given text into individual sentences. This is useful
for analysing sentence structures and is a typical first step in NLP applications like sentiment
analysis, text summarization, and translation. Python's nltk library provides a simple function
sent_tokenize() for this task.
Code:
# Step 1: Import the necessary libraries
import nltk
nltk.download('punkt') # Download necessary tokenizer models
from nltk.tokenize import sent_tokenize
3
3174205 211310142028
Output:
B. Word Tokenization
Word tokenization splits a text into individual words, which is essential for tasks like frequency
analysis, text classification, and language modeling. nltk provides the word_tokenize()
function for this.
Code:
# Step 1: Import necessary libraries
import nltk
nltk.download('punkt') # Download tokenizer models
from nltk.tokenize import word_tokenize
Output:
C. Paragraph Tokenization
Paragraph tokenization involves dividing large bodies of text into separate paragraphs. This
method is useful in organizing large documents for structured analysis. Though there is no
direct function in nltk for paragraph tokenization, we can achieve this by splitting the text using
newlines (\n).
4
3174205 211310142028
Code:
# Step 1: Define the paragraph tokenization function
def tokenize_paragraphs(text):
paragraphs = text.split('\n\n') # Split text by newlines
return paragraphs
Organizations today have large volumes of voice and text data from various communication channels
like emails, text messages, social media newsfeeds, video, audio, and more.
They use NLP software to automatically process this data, analyze the intent or sentiment in the
message, and respond in real time to human communication."""
Output:
5
3174205 211310142028
Practical 2
Write a python program to eliminate stopwords using NLTK
• Introduction
Stopwords are common words in a language (like "the", "is", "in", etc.) that often do not add
significant meaning to a sentence and can be ignored during text analysis. In Natural Language
Processing (NLP), removing stopwords is an important preprocessing step to focus on
meaningful words that contribute to the context of the text.
Python's Natural Language Toolkit (NLTK) provides a ready-to-use list of stopwords for
different languages. By using the stopwords module, we can easily filter out these words from
the input text.
Code:
# Step 1: Import necessary libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
return filtered_text
6
3174205 211310142028
Output:
7
3174205 211310142028
Practical 3
Write a python program to perform Parts of Speech tagging using NLTK.
• Introduction
Parts of Speech (POS) tagging is a technique in Natural Language Processing (NLP) that
assigns a specific part of speech (noun, verb, adjective, etc.) to each word in a sentence. This
helps in understanding the syntactic structure of a sentence. The Natural Language Toolkit
(NLTK) in Python provides a convenient method for performing POS tagging using the
pos_tag() function.
Code:
# Step 1: Import necessary libraries
import nltk
from nltk.tokenize import word_tokenize
return tagged_words
8
3174205 211310142028
Output:
9
3174205 211310142028
Practical 4
Write a python program to perform lemmatization using NLTK
• Introduction
Lemmatization is a process in Natural Language Processing (NLP) where words are reduced
to their base or dictionary form, known as the lemma. Unlike stemming, lemmatization takes
into account the morphological analysis of words, ensuring that the base form is a valid word
in the language. For example, "running" becomes "run", and "better" becomes "good". This
helps in standardizing words to improve text analysis and retrieval.
NLTK provides the WordNet Lemmatizer, which is used for lemmatization based on WordNet,
a lexical database of the English language.
Code:
# Step 1: Import necessary libraries
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
import os
# Step 3: Find and print the path of the downloaded wordnet resource
try:
wordnet_path = nltk.data.find('corpora/wordnet.zip')
print(f"WordNet is located at: {wordnet_path}")
except LookupError:
print("WordNet not found, please make sure it's downloaded correctly.")
return lemmatized_text
10
3174205 211310142028
lemmatized_text = lemmatize_text(text)
Output:
11
3174205 211310142028
Practical 5
Write a python program for chunking using NLTK
• Introduction
Chunking, also known as shallow parsing, is a process in Natural Language Processing (NLP)
that divides a text into syntactically correlated parts like noun phrases (NP), verb phrases (VP),
and more. The primary goal of chunking is to label segments of the sentence and group them
together based on their syntactical roles. It can be considered a higher-level task than
tokenization or part-of-speech (POS) tagging. Chunking provides us with more meaningful
groups of words that help in understanding the structure of a sentence.
Chunking is useful in various NLP tasks, such as:
• Steps in Chunking:
1. Tokenization: Splitting the sentence into individual words.
2. POS Tagging: Assigning part-of-speech tags (like noun, verb, adjective) to each token.
3. Chunking: Grouping words into meaningful phrases such as noun phrases (NP), verb
phrases (VP), etc.
Code:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser
12
3174205 211310142028
return chunked_sentence
# Example sentence
sentence = "The quick brown fox jumps over the lazy dog"
Output:
13
3174205 211310142028
Practical 6
Write a python program to perform stemming using NLTK.
• Introduction
The Porter Stemmer, developed by Martin Porter, is one of the most widely used stemming
algorithms in NLP tasks. It applies a set of rules to remove common morphological and
inflexional endings from words in English.
Code:
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
# Download NLTK tokenizer
nltk.download('punkt')
def stem_text(text):
# Initialize the Porter Stemmer
porter_stemmer = PorterStemmer()
# Tokenize the text into words
words = word_tokenize(text)
# Apply stemming to each word
stemmed_words = [porter_stemmer.stem(word) for word in words]
# Join the stemmed words back into a single string
stemmed_text = ' '.join(stemmed_words)
return stemmed_text
# Example text
text = "NLTK is a leading platform for building Python programs to work with human language data."
# Perform stemming
stemmed_text = stem_text(text)
# Print the stemmed text
print(stemmed_text)
Output:
14
3174205 211310142028
Practical 7
Write a python program to perform Named Entity Recognition using NLTK
• Introduction
Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP),
where the goal is to locate and classify named entities in text into predefined categories such
as person names, organizations, locations, dates, and more.
The NLTK library in Python provides tools to perform NER using its built-in ne_chunk()
method, which uses part-of-speech tagged words to identify named entities.
Code:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk
return named_entities
# Example text
text = "Apple is a company based in California, United States. Steve Jobs was one of its founders."
15
3174205 211310142028
Output:
16
3174205 211310142028
Practical 8
Write a program for different feature extraction techniques used in NLP.
• Introduction
Feature extraction is a critical step in Natural Language Processing (NLP) that transforms raw
text into a structured format that can be easily analyzed by machine learning algorithms.
Several techniques can be used to extract features from text data, including Bag of Words,
Count Vectorization, TF-IDF, and Word2Vec.
A. Bag of Words is one of the simplest and most common techniques used in NLP for feature
extraction. It represents text data as a collection of words without considering the order of
words. Each unique word in the corpus is assigned a token, and the presence of a word in a
document is typically represented as a binary indicator (1 for present, 0 for absent) or as a count
of occurrences.
B. Count Vectorizer
Count Vectorizer is a specific implementation of the Bag of Words model that converts a
collection of text documents to a matrix of token counts. It counts the number of occurrences
of each word in the documents, resulting in a document-term matrix.
D. Word2Vec
Word2Vec is a more advanced technique that represents words as dense vectors in a continuous
vector space. Unlike the previous methods, Word2Vec captures semantic meaning and
relationships between words. It uses neural networks to learn word representations, allowing
words with similar meanings to be closer in the vector space.
17
3174205 211310142028
Output:
Output:
18
3174205 211310142028
Output:
19
3174205 211310142028
Practical 9
Write a program to implement all the NLP Pre-Processing Techniques
required to perform further NLP tasks.
• Introduction
NLP pre-processing is an essential step in preparing text data for further analysis or modeling.
Properly pre-processed data can lead to better model performance, as it helps reduce noise and
irrelevant information in the text. Different tasks in NLP may require specific pre-processing
techniques depending on the nature of the data and the analysis objectives.
Code:
# Import necessary libraries
import nltk
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag, ne_chunk
# Tokenize text
tokens = word_tokenize(text)
# Convert to lowercase
tokens = [word.lower() for word in tokens]
# Remove punctuation
20
3174205 211310142028
# Apply stemming
stemmed_tokens = [stemmer.stem(word) for word in tokens]
# Apply lemmatization
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in stemmed_tokens]
Output:
21
3174205 211310142028
22
3174205 211310142028
Practical 10
Write a program to implement both user-defined and pre-defined functions
to generate: Unigrams, Bigrams, Trigrams, N-grams.
• Introduction:
In Natural Language Processing (NLP), N-grams are continuous sequences of 'n' items from a
given text or speech. These items can be words, letters, or syllables depending on the context
of application. N-grams play an important role in text preprocessing for tasks like text
generation, machine learning, sentiment analysis, and more.
• Theory:
Code:
import nltk
from nltk.util import ngrams
Parameters:
text: str - The input text to generate n-grams from
n: int - The number of n-grams (1 for uni-grams, 2 for bi-grams, etc.)
Returns:
list - List of n-grams
23
3174205 211310142028
"""
tokens = nltk.word_tokenize(text)
return list(ngrams(tokens, n))
Parameters:
text: str - The input text to generate n-grams from
n: int - The number of n-grams (1 for uni-grams, 2 for bi-grams, etc.)
Returns:
list - List of n-grams
"""
tokens = nltk.word_tokenize(text)
ngrams_list = []
return ngrams_list
# Demonstration for Uni-grams, Bi-grams, and Tri-grams using Pre-defined and User-defined functions
for i in range(1, 4): # Loop for 1-gram, 2-grams, and 3-grams
print(f"\n{'='*20} {i}-grams (Pre-defined) {'='*20}")
ngrams_predefined = generate_ngrams_predefined(Text, i)
for grams in ngrams_predefined:
print(grams)
24
3174205 211310142028
Output:
25
3174205 211310142028
Practical 11
Write a program to identify all antonyms and synonyms of a word.
• Introduction
In Natural Language Processing (NLP), understanding synonyms (words with similar
meanings) and antonyms (words with opposite meanings) is essential for various applications,
including sentiment analysis, information retrieval, and text summarization. Synonyms can
help broaden the understanding of a text, while antonyms provide a contrast that can clarify
meanings and enhance descriptions.
• Theory
• Synonyms: Words that have similar meanings. For example, "happy" and "joyful" are
synonyms.
• Antonyms: Words that have opposite meanings. For example, "happy" and "sad" are
antonyms.
To find synonyms and antonyms programmatically, we can use the WordNet lexical database,
which is part of the Natural Language Toolkit (NLTK) in Python. WordNet provides a rich
vocabulary database and allows us to explore relationships between words.
Code:
import nltk
from nltk.corpus import wordnet
def find_synonyms_antonyms(word):
"""
This function finds synonyms and antonyms of a given word using WordNet.
Parameters:
word: str - The input word to find synonyms and antonyms for
Returns:
dict - A dictionary containing synonyms and antonyms
"""
synonyms = set()
antonyms = set()
26
3174205 211310142028
return {
'synonyms': list(synonyms),
'antonyms': list(antonyms)
}
# Example usage
if __name__ == "__main__":
word_to_lookup = input("Enter a word to find its synonyms and antonyms: ")
result = find_synonyms_antonyms(word_to_lookup)
Output:
27
3174205 211310142028
Practical 12
Write a program to find hyponymy, homonymy, polysemy for a given word.
• Introduction
In Natural Language Processing (NLP), understanding the relationships between words is
crucial for tasks such as semantic analysis and language generation. Three important concepts
in this context are:
1. Hyponymy: This is a relationship between words where one word (the hyponym) is a
more specific term than another (the hypernym). For example, "sparrow" is a hyponym
of "bird."
2. Homonymy: This refers to two or more words that sound the same (homophones) or
are spelled the same (homographs) but have different meanings. For example, "bat" can
refer to a flying mammal or a piece of sports equipment.
3. Polysemy: This describes a single word that has multiple meanings. For instance, the
word "bank" can mean a financial institution or the side of a river.
The Natural Language Toolkit (NLTK) library in Python provides access to WordNet, a lexical
database that can be used to explore these relationships. This program demonstrates how to
find hyponyms, hypernyms, and calculate polysemy for given words.
Code:
import nltk
from nltk.corpus import wordnet
28
3174205 211310142028
# Finding hypernyms
hypernyms = find_hypernyms(word)
print(f"Hypernyms of '{word}': {[hypernym.name() for hypernym in hypernyms]}")
# Finding hyponyms
hyponyms = find_hyponyms('fruit')
print(f"Hyponyms of 'fruit': {[hyponym.name() for hyponym in hyponyms]}")
Output:
29
3174205 211310142028
Practical 13
Write a program to calculate the score and polarity of text data using vader
analyzer and TextBlob.
• Introduction
Sentiment analysis is a crucial task in Natural Language Processing (NLP) that involves
determining the emotional tone behind a body of text. This can be particularly useful in various
applications, such as social media monitoring, customer feedback analysis, and market
research.
Two popular libraries for sentiment analysis in Python are VADER (Valence Aware Dictionary
and sEntiment Reasoner) and TextBlob.
• VADER is specifically designed for sentiment analysis in social media texts, such as
tweets. It uses a lexicon of words that are assigned sentiment scores, which makes it
highly effective for short and informal text.
• TextBlob is a more general-purpose NLP library that provides simple API access to
common natural language processing tasks, including sentiment analysis. It uses a
predefined lexicon and the concept of "polarity" and "subjectivity" to gauge sentiment.
Code:
# Install the required libraries if you haven't already
# !pip install vaderSentiment textblob
30
3174205 211310142028
return {
'polarity': blob.sentiment.polarity,
'subjectivity': blob.sentiment.subjectivity
}
Output:
31
3174205 211310142028
32