0% found this document useful (0 votes)
3 views

PPT for Assignment-10 (Machine Learning With Python_NLP-2)

The document provides an overview of various Python libraries and techniques for machine learning (ML) and natural language processing (NLP), including data manipulation, visualization, and model training using libraries like NumPy, Pandas, Scikit-Learn, and TensorFlow. It covers essential tasks such as data preprocessing, feature extraction, and text tokenization, along with examples of using regular expressions for text manipulation. Additionally, it discusses advanced topics like sentiment analysis, text generation with Keras, and model building using LSTM.

Uploaded by

skaushal1be23
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

PPT for Assignment-10 (Machine Learning With Python_NLP-2)

The document provides an overview of various Python libraries and techniques for machine learning (ML) and natural language processing (NLP), including data manipulation, visualization, and model training using libraries like NumPy, Pandas, Scikit-Learn, and TensorFlow. It covers essential tasks such as data preprocessing, feature extraction, and text tokenization, along with examples of using regular expressions for text manipulation. Additionally, it discusses advanced topics like sentiment analysis, text generation with Keras, and model building using LSTM.

Uploaded by

skaushal1be23
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

ML and NLP

with Python
Python Libraries
• NumPy Numerical computing, arrays
• Pandas Data manipulation
• Matplotlib Data visualization
• Seaborn Statistical data visualization
• Scikit-Learn Machine learning algorithms
• TensorFlow Deep learning, neural networks
• Keras High-level API for deep learning
• PyTorch Deep learning (research-focused)
• XGBoost Gradient boosting for structured data
• LightGBM Fast boosting algorithm
• OpenCV Computer vision and image processing
• NLTK Natural language processing
scikit
• scikit-learn (sklearn) is a powerful machine learning library in Python
that provides tools for:
Data Preprocessing (handling missing data, scaling, encoding)
Feature Extraction (Bag of Words, TF-IDF, PCA)
Supervised Learning (Regression & Classification models)
Unsupervised Learning (Clustering, Anomaly Detection)
Model Selection & Evaluation (Cross-validation, Hyperparameter
tuning)
• Task 1: Load & Explore a Dataset
import pandas as pd
df = pd.read_csv('data.csv') # Load dataset
print(df.head()) # Show first 5 rows
print(df.info()) # Dataset summary
print(df.describe()) # Statistical summary
• Task 2: Train-Test Split
from sklearn.model_selection import train_test_split
X = df.drop('Target', axis=1) # Features
y = df['Target'] # Labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
The random_state parameter ensures
that the data split is reproducible. It
• Task 3: Linear Regression controls the randomness of the train-
from sklearn.linear_model import LinearRegression test split, meaning:
model = LinearRegression() Same random_state → Same Split
Every Time
model.fit(X_train, y_train) # Train model Different random_state → Different
y_pred = model.predict(X_test) # Make predictions Split Every Time
• Task 4: Logistic Regression

from sklearn.linear_model import LogisticRegression


from sklearn.metrics import accuracy_score, classification_report
clf = LogisticRegression()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
NLP Libraries in Python
• Python has number of libraries for NLP to perform
tokenization, sentiment analysis, machine
translation, text summarization, and more.
• NLTK (Natural Language Toolkit)

• spaCy

• TextBlob

• Transformers (by Hugging Face)

• Gensim

• Tesseract OCR (for Text Extraction from Images)

• Polyglot
Text Preprocessing in Python
• Text Cleaning/Tokenization using Python RegEx Module
• Regular Expressions - Sequence of characters that defines a search
pattern. It is commonly used for:
• Finding specific patterns in text (e.g., emails, dates, phone numbers).
• Replacing or cleaning text (e.g., removing special characters).
• Splitting text into meaningful components.
• Python has a built-in module named “re” that is used for regular
expressions in Python
RegEx - Example
import re
s = “CognitiveComputing: A computer science subject for geeks”
match = re.search(‘subject', s)
print('Start Index:', match.start())
print('End Index:', match.end())

Output:
Start Index: 39
End Index: 46
re.findall() - finds and returns all
matching occurrences in a list
import re
string = """Hello my Number is 987654321 and
my friend's number is 123456789"""
regex = r'\d+'
match = re.findall(regex, string)
print(match)
Output: Here r character (r’portal’) stands for raw, not regex. The raw
['987654321', '123456789'] string is slightly different from a regular string, it won’t interpret
the \ character as an escape character. This is because the
regular expression engine uses \ character for its own escaping
purpose.
Other Regex Functions
re.compile() Regular expressions are compiled into pattern objects

re.split() Split string by the occurrences of a character or a pattern.

re.sub() Replaces all occurrences of a character or patter with a


replacement string.

re.escape() Escapes special character

re.search() Searches for first occurrence of character or pattern


Split() for word tokenization
text = "There are multiple ways we can perform tokenization on given
text data. We can choose any method based on langauge, library and
purpose of modeling."
# Split text by whitespace
tokens = text.split()
print(tokens)
['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given',
'text', 'data.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge,', 'library',
'and', 'purpose', 'of', 'modeling.']
Tokenization using NLTK Tokenizer
• NLTK provides several built-in tokenizers for different NLP tasks.
1. Sentence Tokenization (sent_tokenize)
2. Word Tokenization (word_tokenize)
3. Regular Expression Tokenizer (RegexpTokenizer) - Custom regex-based
tokenization
4. White Space Tokenizer (WhitespaceTokenizer)
5. WordPunct Tokenizer (WordPunctTokenizer)
6. Tweet Tokenizer (TweetTokenizer)
7. SyllableTokenizer
Steps before using NLTK in Jupyter
Notebook
%pip import nltk
install nltk
print(nltk.__version__)
import os
print(os.getcwd())
import os
nltk_path = os.path.expanduser("drive/nltk_data/tokenizers")
os.makedirs(nltk_path, exist_ok=True)
print(f"Created directory: {nltk_path}")
import nltk
nltk.data.path.append(os.path.expanduser("drive/nltk_data"))
print("NLTK path updated!")
import zipfile
nltk_path = os.path.expanduser("drive/nltk_data/tokenizers")
with zipfile.ZipFile("punkt.zip", "r") as zip_ref: // NLTK requires the punkt tokenizer model
zip_ref.extractall(nltk_path)
Word Tokenization
• from nltk.tokenize import word_tokenize
• text = """There are multiple ways we can perform tokenization on
given text data. We can choose any method based on langauge,
library and purpose of modeling."""
• tokens = word_tokenize(text)
• print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text',
'data', '.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', ',', 'library',
'and', 'purpose', 'of', 'modeling', '.']
Sentence Tokenization
• from nltk.tokenize import sent_tokenize
• text = """Characters like periods, exclamation point and newline char
are used to separate the sentences. But one drawback with split()
method, that we can only use one separator at a time! So sentence
tonenization wont be foolproof with split() method."""
• sent_tokenize(text)

['Characters like periods, exclamation point and newline char are used to separate the sentences.’,
'But one drawback with split() method, that we can only use one separator at a time!’,
'So sentence tonenization wont be foolproof with split() method.']
Split() for sentence tokenization
text = """Characters like periods, exclamation point and newline char
are used to separate the sentences. But one drawback with split()
method, that we can only use one separator at a time! So sentence
tonenization wont be foolproof with split() method."""
text.split(". ") # Note the space after the full stop makes sure that we
dont get empty element at the end of list.

['Characters like periods, exclamation point and newline char are used to separate the
sentences', 'But one drawback with split() method, that we can only use one separator
at a time! So sentence tonenization wont be foolproof with split() method.']
Stemming
• RegexpStemmer - custom stemming rules using regular expressions
(regex).
• PorterStemmer
• LancasterStemmer
• SnowballStemmer – Supports multiple languages
PorterStemmer

from nltk.stem import PorterStemmer


from nltk.tokenize import word_tokenize
ps = PorterStemmer()
sentence = "Programmers program with programming languages"
words = word_tokenize(sentence)
for w in words:
print(w, " : ", ps.stem(w))
Programmers :
programm program :
program
with : with
programming :
program languages :
RegexpStemmer
from nltk.stem import RegexpStemmer
# Define a regex pattern to remove common suffixes like "ing", "ed", "es"
regexp_stemmer = RegexpStemmer(r"ing$|ed$|es$")
words = ["running", "flies", "studies", "happiness", "played", "jumps"]
stemmed_words = [regexp_stemmer.stem(word) for word in words]
print(stemmed_words)

Output:
['runn', 'fli', 'studi', 'happiness', 'play', 'jumps']
Lemmatization
• The WordNetLemmatizer in NLTK uses the WordNet lexical database
to find the base form of words.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["running", "flies", "studies", "better", "happily", "geese"]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)

Output: ['running', 'fly', 'study', 'better', 'happily', 'goose']

TRY YOURSELF: Lemmatization with POS (Part of Speech) Tags


Lemmatization
# import these modules
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))
# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos="a"))
rocks : rock
corpora : corpus
better : good
StopWord Removal
• from nltk.corpus import stopwords
• # Get English stopwords
• stop_words = set(stopwords.words("english"))
• print(stop_words) # Display some stopwords

• stop_words.add("example") # Adding "example" to stopwords list


• stop_words.remove("not") # Removing "not" (if negation is important)
Removing StopWord from Sentence
from nltk.tokenize import word_tokenize
text = "This is a simple example to demonstrate the removal of
stopwords in NLP."
# Tokenizing the text
tokens = word_tokenize(text)
# Remove stopwords
filtered_tokens = [word for word in tokens if word.lower() not in
stop_words]
print(filtered_tokens)
Do it Yourself!
• Singularize and Pluralize text using TextBlob
• TextBlob: Translate a sentence from Spanish to English
NLP-II
BoW in Python
from sklearn.feature_extraction.text import CountVectorizer
texts = ["I love machine learning", "Machine learning is amazing", "I love coding"]
vectorizer = CountVectorizer()
bow = vectorizer.fit_transform(texts) //Learn the vocabulary dictionary and return document-term matrix.
print(vectorizer.get_feature_names_out())
print(bow.toarray())
from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [ ... 'This is the first document.', ... 'This document is the second document.', ...
'And this is the third one.', ... 'Is this the first document?', ... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> vectorizer.get_feature_names_out()
array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'], ...)
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
>>> X2 = vectorizer2.fit_transform(corpus)
>>> vectorizer2.get_feature_names_out()
array(['and this', 'document is', 'first document', 'is the', 'is this', 'second document',
'the first', 'the second', 'the third', 'third one', 'this document', 'this is', 'this the'], ...)
>>> print(X2.toarray())
[[0 0 1 1 0 0 1 0 0 0 0 1 0]
[0 1 0 1 0 1 0 1 0 0 1 0 0]
[1 0 0 1 0 0 0 0 1 1 0 1 0]
[0 0 1 0 1 0 1 0 0 0 0 0 1]]
TF-IDF in Python
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(texts)
print(tfidf.get_feature_names_out())
print(X.toarray())

['amazing' 'coding' 'is' 'learning' 'love' 'machine’]


[[0. 0. 0. 0.57735027 0.57735027 0.57735027]
[0.5628291 0. 0.5628291 0.42804604 0. 0.42804604]
[0. 0.79596054 0. 0. 0.60534851 0. ]]
Similarity in Texts
text1 = set("machine learning is fun".split())
text2 = set("learning about machine intelligence".split())
jaccard = len(text1 & text2) / len(text1 | text2)
print("Jaccard Similarity:", jaccard)

Jaccard Similarity: 0.3333333333333333


Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
tfidf_vec = TfidfVectorizer()
vecs = tfidf_vec.fit_transform(["machine learning is fun", "learning about machine
intelligence"])
cos_sim = cosine_similarity(vecs[0:1], vecs[1:2])
print("Cosine Similarity:", cos_sim[0][0])

Cosine Similarity: 0.3360969272762575

Jaccard compares token sets;


Cosine compares vector angles (good for longer texts).
Sentiment Analysis
from textblob import TextBlob
review = "The service was excellent and the staff was friendly."
blob = TextBlob(review)
print("Polarity:", blob.sentiment.polarity)
print("Subjectivity:", blob.sentiment.subjectivity)
Word Cloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt
text = "Python is simple and powerful. I love Python programming!"
wordcloud = WordCloud().generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Text Generation using Keras
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
text = "Machine learning is fun and exciting to learn"
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
sequences = []
words = text.split()
for i in range(1, len(words)):
seq = words[:i+1]
tokenized_seq = tokenizer.texts_to_sequences([' '.join(seq)])[0]
sequences.append(tokenized_seq)
# Pad the sequences
padded = pad_sequences(sequences)
print(padded)
Build a Model (LSTM Example)
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
model = Sequential()
model.add(Embedding(input_dim=50, output_dim=10,
input_length=padded.shape[1]))
model.add(LSTM(50))
model.add(Dense(50, activation='relu'))
model.add(Dense(len(tokenizer.word_index) + 1, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
# Normally you'd train the model with model.fit(), then use it to predict.
• Long Short-Term Memory.
It is a type of Recurrent Neural Network (RNN) that is specially
designed to remember long sequences and patterns in data —
especially useful in Natural Language Processing (NLP), time series,
and speech.

You might also like