PPT for Assignment-10 (Machine Learning With Python_NLP-2)

The document provides an overview of various Python libraries and techniques for machine learning (ML) and natural language processing (NLP), including data manipulation, visualization, and model training using libraries like NumPy, Pandas, Scikit-Learn, and TensorFlow. It covers essential tasks such as data preprocessing, feature extraction, and text tokenization, along with examples of using regular expressions for text manipulation. Additionally, it discusses advanced topics like sentiment analysis, text generation with Keras, and model building using LSTM.

Uploaded by

skaushal1be23

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

PPT for Assignment-10 (Machine Learning With Python_NLP-2)

Uploaded by

skaushal1be23

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

ML and NLP

with Python
Python Libraries
• NumPy Numerical computing, arrays
• Pandas Data manipulation
• Matplotlib Data visualization
• Seaborn Statistical data visualization
• Scikit-Learn Machine learning algorithms
• TensorFlow Deep learning, neural networks
• Keras High-level API for deep learning
• PyTorch Deep learning (research-focused)
• XGBoost Gradient boosting for structured data
• LightGBM Fast boosting algorithm
• OpenCV Computer vision and image processing
• NLTK Natural language processing
scikit
• scikit-learn (sklearn) is a powerful machine learning library in Python
that provides tools for:
Data Preprocessing (handling missing data, scaling, encoding)
Feature Extraction (Bag of Words, TF-IDF, PCA)
Supervised Learning (Regression & Classification models)
Unsupervised Learning (Clustering, Anomaly Detection)
Model Selection & Evaluation (Cross-validation, Hyperparameter
tuning)
• Task 1: Load & Explore a Dataset
import pandas as pd
df = pd.read_csv('data.csv') # Load dataset
print(df.head()) # Show first 5 rows
print(df.info()) # Dataset summary
print(df.describe()) # Statistical summary
• Task 2: Train-Test Split
from sklearn.model_selection import train_test_split
X = df.drop('Target', axis=1) # Features
y = df['Target'] # Labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
The random_state parameter ensures
that the data split is reproducible. It
• Task 3: Linear Regression controls the randomness of the train-
from sklearn.linear_model import LinearRegression test split, meaning:
model = LinearRegression() Same random_state → Same Split
Every Time
model.fit(X_train, y_train) # Train model Different random_state → Different
y_pred = model.predict(X_test) # Make predictions Split Every Time
• Task 4: Logistic Regression

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, classification_report
clf = LogisticRegression()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
NLP Libraries in Python
• Python has number of libraries for NLP to perform
tokenization, sentiment analysis, machine
translation, text summarization, and more.
• NLTK (Natural Language Toolkit)

• spaCy

• TextBlob

• Transformers (by Hugging Face)

• Gensim

• Tesseract OCR (for Text Extraction from Images)

• Polyglot
Text Preprocessing in Python
• Text Cleaning/Tokenization using Python RegEx Module
• Regular Expressions - Sequence of characters that defines a search
pattern. It is commonly used for:
• Finding specific patterns in text (e.g., emails, dates, phone numbers).
• Replacing or cleaning text (e.g., removing special characters).
• Splitting text into meaningful components.
• Python has a built-in module named “re” that is used for regular
expressions in Python
RegEx - Example
import re
s = “CognitiveComputing: A computer science subject for geeks”
match = re.search(‘subject', s)
print('Start Index:', match.start())
print('End Index:', match.end())

Output:
Start Index: 39
End Index: 46
re.findall() - finds and returns all
matching occurrences in a list
import re
string = """Hello my Number is 987654321 and
my friend's number is 123456789"""
regex = r'\d+'
match = re.findall(regex, string)
print(match)
Output: Here r character (r’portal’) stands for raw, not regex. The raw
['987654321', '123456789'] string is slightly different from a regular string, it won’t interpret
the \ character as an escape character. This is because the
regular expression engine uses \ character for its own escaping
purpose.
Other Regex Functions
re.compile() Regular expressions are compiled into pattern objects

re.split() Split string by the occurrences of a character or a pattern.

re.sub() Replaces all occurrences of a character or patter with a

replacement string.

re.escape() Escapes special character

re.search() Searches for first occurrence of character or pattern

Split() for word tokenization
text = "There are multiple ways we can perform tokenization on given
text data. We can choose any method based on langauge, library and
purpose of modeling."
# Split text by whitespace
tokens = text.split()
print(tokens)
['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given',
'text', 'data.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge,', 'library',
'and', 'purpose', 'of', 'modeling.']
Tokenization using NLTK Tokenizer
• NLTK provides several built-in tokenizers for different NLP tasks.
1. Sentence Tokenization (sent_tokenize)
2. Word Tokenization (word_tokenize)
3. Regular Expression Tokenizer (RegexpTokenizer) - Custom regex-based
tokenization
4. White Space Tokenizer (WhitespaceTokenizer)
5. WordPunct Tokenizer (WordPunctTokenizer)
6. Tweet Tokenizer (TweetTokenizer)
7. SyllableTokenizer
Steps before using NLTK in Jupyter
Notebook
%pip import nltk
install nltk
print(nltk.__version__)
import os
print(os.getcwd())
import os
nltk_path = os.path.expanduser("drive/nltk_data/tokenizers")
os.makedirs(nltk_path, exist_ok=True)
print(f"Created directory: {nltk_path}")
import nltk
nltk.data.path.append(os.path.expanduser("drive/nltk_data"))
print("NLTK path updated!")
import zipfile
nltk_path = os.path.expanduser("drive/nltk_data/tokenizers")
with zipfile.ZipFile("punkt.zip", "r") as zip_ref: // NLTK requires the punkt tokenizer model
zip_ref.extractall(nltk_path)
Word Tokenization
• from nltk.tokenize import word_tokenize
• text = """There are multiple ways we can perform tokenization on
given text data. We can choose any method based on langauge,
library and purpose of modeling."""
• tokens = word_tokenize(text)
• print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text',
'data', '.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', ',', 'library',
'and', 'purpose', 'of', 'modeling', '.']
Sentence Tokenization
• from nltk.tokenize import sent_tokenize
• text = """Characters like periods, exclamation point and newline char
are used to separate the sentences. But one drawback with split()
method, that we can only use one separator at a time! So sentence
tonenization wont be foolproof with split() method."""
• sent_tokenize(text)

['Characters like periods, exclamation point and newline char are used to separate the sentences.’,
'But one drawback with split() method, that we can only use one separator at a time!’,
'So sentence tonenization wont be foolproof with split() method.']
Split() for sentence tokenization
text = """Characters like periods, exclamation point and newline char
are used to separate the sentences. But one drawback with split()
method, that we can only use one separator at a time! So sentence
tonenization wont be foolproof with split() method."""
text.split(". ") # Note the space after the full stop makes sure that we
dont get empty element at the end of list.

['Characters like periods, exclamation point and newline char are used to separate the
sentences', 'But one drawback with split() method, that we can only use one separator
at a time! So sentence tonenization wont be foolproof with split() method.']
Stemming
• RegexpStemmer - custom stemming rules using regular expressions
(regex).
• PorterStemmer
• LancasterStemmer
• SnowballStemmer – Supports multiple languages
PorterStemmer

from nltk.stem import PorterStemmer

from nltk.tokenize import word_tokenize
ps = PorterStemmer()
sentence = "Programmers program with programming languages"
words = word_tokenize(sentence)
for w in words:
print(w, " : ", ps.stem(w))
Programmers :
programm program :
program
with : with
programming :
program languages :
RegexpStemmer
from nltk.stem import RegexpStemmer
# Define a regex pattern to remove common suffixes like "ing", "ed", "es"
regexp_stemmer = RegexpStemmer(r"ing$|ed$|es$")
words = ["running", "flies", "studies", "happiness", "played", "jumps"]
stemmed_words = [regexp_stemmer.stem(word) for word in words]
print(stemmed_words)

Output:
['runn', 'fli', 'studi', 'happiness', 'play', 'jumps']
Lemmatization
• The WordNetLemmatizer in NLTK uses the WordNet lexical database
to find the base form of words.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["running", "flies", "studies", "better", "happily", "geese"]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)

Output: ['running', 'fly', 'study', 'better', 'happily', 'goose']

TRY YOURSELF: Lemmatization with POS (Part of Speech) Tags

Lemmatization
# import these modules
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))
# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos="a"))
rocks : rock
corpora : corpus
better : good
StopWord Removal
• from nltk.corpus import stopwords
• # Get English stopwords
• stop_words = set(stopwords.words("english"))
• print(stop_words) # Display some stopwords

• stop_words.add("example") # Adding "example" to stopwords list

• stop_words.remove("not") # Removing "not" (if negation is important)
Removing StopWord from Sentence
from nltk.tokenize import word_tokenize
text = "This is a simple example to demonstrate the removal of
stopwords in NLP."
# Tokenizing the text
tokens = word_tokenize(text)
# Remove stopwords
filtered_tokens = [word for word in tokens if word.lower() not in
stop_words]
print(filtered_tokens)
Do it Yourself!
• Singularize and Pluralize text using TextBlob
• TextBlob: Translate a sentence from Spanish to English
NLP-II
BoW in Python
from sklearn.feature_extraction.text import CountVectorizer
texts = ["I love machine learning", "Machine learning is amazing", "I love coding"]
vectorizer = CountVectorizer()
bow = vectorizer.fit_transform(texts) //Learn the vocabulary dictionary and return document-term matrix.
print(vectorizer.get_feature_names_out())
print(bow.toarray())
from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [ ... 'This is the first document.', ... 'This document is the second document.', ...
'And this is the third one.', ... 'Is this the first document?', ... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> vectorizer.get_feature_names_out()
array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'], ...)
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
>>> X2 = vectorizer2.fit_transform(corpus)
>>> vectorizer2.get_feature_names_out()
array(['and this', 'document is', 'first document', 'is the', 'is this', 'second document',
'the first', 'the second', 'the third', 'third one', 'this document', 'this is', 'this the'], ...)
>>> print(X2.toarray())
[[0 0 1 1 0 0 1 0 0 0 0 1 0]
[0 1 0 1 0 1 0 1 0 0 1 0 0]
[1 0 0 1 0 0 0 0 1 1 0 1 0]
[0 0 1 0 1 0 1 0 0 0 0 0 1]]
TF-IDF in Python
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(texts)
print(tfidf.get_feature_names_out())
print(X.toarray())

['amazing' 'coding' 'is' 'learning' 'love' 'machine’]

[[0. 0. 0. 0.57735027 0.57735027 0.57735027]
[0.5628291 0. 0.5628291 0.42804604 0. 0.42804604]
[0. 0.79596054 0. 0. 0.60534851 0. ]]
Similarity in Texts
text1 = set("machine learning is fun".split())
text2 = set("learning about machine intelligence".split())
jaccard = len(text1 & text2) / len(text1 | text2)
print("Jaccard Similarity:", jaccard)

Jaccard Similarity: 0.3333333333333333

Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
tfidf_vec = TfidfVectorizer()
vecs = tfidf_vec.fit_transform(["machine learning is fun", "learning about machine
intelligence"])
cos_sim = cosine_similarity(vecs[0:1], vecs[1:2])
print("Cosine Similarity:", cos_sim[0][0])

Cosine Similarity: 0.3360969272762575

Jaccard compares token sets;

Cosine compares vector angles (good for longer texts).
Sentiment Analysis
from textblob import TextBlob
review = "The service was excellent and the staff was friendly."
blob = TextBlob(review)
print("Polarity:", blob.sentiment.polarity)
print("Subjectivity:", blob.sentiment.subjectivity)
Word Cloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt
text = "Python is simple and powerful. I love Python programming!"
wordcloud = WordCloud().generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Text Generation using Keras
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
text = "Machine learning is fun and exciting to learn"
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
sequences = []
words = text.split()
for i in range(1, len(words)):
seq = words[:i+1]
tokenized_seq = tokenizer.texts_to_sequences([' '.join(seq)])[0]
sequences.append(tokenized_seq)
# Pad the sequences
padded = pad_sequences(sequences)
print(padded)
Build a Model (LSTM Example)
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
model = Sequential()
model.add(Embedding(input_dim=50, output_dim=10,
input_length=padded.shape[1]))
model.add(LSTM(50))
model.add(Dense(50, activation='relu'))
model.add(Dense(len(tokenizer.word_index) + 1, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
# Normally you'd train the model with model.fit(), then use it to predict.
• Long Short-Term Memory.
It is a type of Recurrent Neural Network (RNN) that is specially
designed to remember long sequences and patterns in data —
especially useful in Natural Language Processing (NLP), time series,
and speech.

Solutions To Ventilate Learning Spaces A Review of Current CO2 Sensors For IoT Systems
No ratings yet
Solutions To Ventilate Learning Spaces A Review of Current CO2 Sensors For IoT Systems
8 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Tokenizer
No ratings yet
Tokenizer
4 pages
Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
Lab Prgms Weel1-Output
No ratings yet
Lab Prgms Weel1-Output
4 pages
65 SC Tae1 A3
No ratings yet
65 SC Tae1 A3
3 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
No ratings yet
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
11 pages
Natural Language Processing With Python's NLTK Package – Real Python
No ratings yet
Natural Language Processing With Python's NLTK Package – Real Python
27 pages
Unit 5
No ratings yet
Unit 5
4 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP LAB_MANUAL (1)
No ratings yet
NLP LAB_MANUAL (1)
33 pages
NLP - 1_250119_222702 (1)
No ratings yet
NLP - 1_250119_222702 (1)
71 pages
NLP Programs
No ratings yet
NLP Programs
5 pages
4.TWITTER EXTRACTION AND ANALYTICS
No ratings yet
4.TWITTER EXTRACTION AND ANALYTICS
45 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
NLP_Preprocessing_Steps__1740444240
No ratings yet
NLP_Preprocessing_Steps__1740444240
20 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
9 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
Text Preprocessing Stages
No ratings yet
Text Preprocessing Stages
8 pages
UNIT-V-NLP Using NLTK
No ratings yet
UNIT-V-NLP Using NLTK
19 pages
a7 dsbda sana
No ratings yet
a7 dsbda sana
15 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
NLP Practicals
No ratings yet
NLP Practicals
6 pages
NLTK Tutorial
No ratings yet
NLTK Tutorial
33 pages
AP for NLP-LO1
No ratings yet
AP for NLP-LO1
61 pages
NLP 02
No ratings yet
NLP 02
6 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
NLP_course-EDC-1-29
No ratings yet
NLP_course-EDC-1-29
29 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
Jal Patel NLP
No ratings yet
Jal Patel NLP
32 pages
LP Vi Manual
No ratings yet
LP Vi Manual
77 pages
NLP Programming
No ratings yet
NLP Programming
39 pages
AIML_P4
No ratings yet
AIML_P4
12 pages
Python and NLP Notes
No ratings yet
Python and NLP Notes
32 pages
p4
No ratings yet
p4
10 pages
NLP Experiment 2
No ratings yet
NLP Experiment 2
5 pages
01 NLP - Merged Vinay
No ratings yet
01 NLP - Merged Vinay
27 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
Reg. No.: 39110009 Colab Notebook Link: Name: Abivirshan Suresh
No ratings yet
Reg. No.: 39110009 Colab Notebook Link: Name: Abivirshan Suresh
27 pages
NLP Record
No ratings yet
NLP Record
15 pages
NLP Practicals All
No ratings yet
NLP Practicals All
57 pages
AI Zone: Log in Sign Up
No ratings yet
AI Zone: Log in Sign Up
24 pages
Natural Language Processing manual
No ratings yet
Natural Language Processing manual
39 pages
Understanding Language Model
No ratings yet
Understanding Language Model
5 pages
SK NLP Practical (FS)
No ratings yet
SK NLP Practical (FS)
22 pages
NLTK
No ratings yet
NLTK
4 pages
AP for NLP-Word 2 Vec
No ratings yet
AP for NLP-Word 2 Vec
33 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
Module2.4 Text Processing
No ratings yet
Module2.4 Text Processing
17 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
Unit2 Full
No ratings yet
Unit2 Full
28 pages
Python For Beginners
From Everand
Python For Beginners
Célio Azevedo
No ratings yet
LTIMINDTREE Campus Drive details for B.TECH, M.TECH 2025 batch
No ratings yet
LTIMINDTREE Campus Drive details for B.TECH, M.TECH 2025 batch
2 pages
Variables and Data Types - C
No ratings yet
Variables and Data Types - C
6 pages
Logic 5.2
No ratings yet
Logic 5.2
16 pages
Focp
No ratings yet
Focp
19 pages
Unit-V Storage Management
No ratings yet
Unit-V Storage Management
98 pages
Lecture 22 Energy-Based Models - Hopfield Network
No ratings yet
Lecture 22 Energy-Based Models - Hopfield Network
57 pages
Fds - Syllabus-2 Engineering Sppu
No ratings yet
Fds - Syllabus-2 Engineering Sppu
8 pages
Latest Java Interview Question 1
No ratings yet
Latest Java Interview Question 1
35 pages
Final Term Computer (2210) SR-II (RED)
No ratings yet
Final Term Computer (2210) SR-II (RED)
34 pages
Kuiz Math Form 1
No ratings yet
Kuiz Math Form 1
2 pages
Sentiment Range MA (ChartPrime)
No ratings yet
Sentiment Range MA (ChartPrime)
2 pages
User Guide
No ratings yet
User Guide
30 pages
Expt 2
No ratings yet
Expt 2
3 pages
Fast, Multicore-Scalable, Low-Fragmentation Memory Allocation Through Large Virtual Memory and Global Data Structures
No ratings yet
Fast, Multicore-Scalable, Low-Fragmentation Memory Allocation Through Large Virtual Memory and Global Data Structures
16 pages
Plot The DC Characteristics / 5 Regions of A CMOS Inverter
No ratings yet
Plot The DC Characteristics / 5 Regions of A CMOS Inverter
7 pages
Archiver Error
No ratings yet
Archiver Error
11 pages
FSMs
No ratings yet
FSMs
15 pages
Learning Journal Unit 6
No ratings yet
Learning Journal Unit 6
5 pages
Dcap211 13
No ratings yet
Dcap211 13
1 page
Aptitude Topics
No ratings yet
Aptitude Topics
12 pages
Assignment 2: 1 Mixed - Cipher - Py (60 Points)
No ratings yet
Assignment 2: 1 Mixed - Cipher - Py (60 Points)
4 pages
An Electric Power Distribution Company Charges Its Domestic Consumers As Follows
No ratings yet
An Electric Power Distribution Company Charges Its Domestic Consumers As Follows
36 pages
How Rare Are Fancy Serial Numbers?: by Dave Undis
100% (1)
How Rare Are Fancy Serial Numbers?: by Dave Undis
5 pages
Java Unit 1
No ratings yet
Java Unit 1
17 pages
Chapter 1,2,3 Test Class 9
No ratings yet
Chapter 1,2,3 Test Class 9
2 pages
Sequential Logic
No ratings yet
Sequential Logic
39 pages
CS101
No ratings yet
CS101
13 pages
cs markscheme
No ratings yet
cs markscheme
8 pages
Assignment Number Theory Question
No ratings yet
Assignment Number Theory Question
3 pages