PPT for Assignment-10 (Machine Learning With Python_NLP-2)
PPT for Assignment-10 (Machine Learning With Python_NLP-2)
with Python
Python Libraries
• NumPy Numerical computing, arrays
• Pandas Data manipulation
• Matplotlib Data visualization
• Seaborn Statistical data visualization
• Scikit-Learn Machine learning algorithms
• TensorFlow Deep learning, neural networks
• Keras High-level API for deep learning
• PyTorch Deep learning (research-focused)
• XGBoost Gradient boosting for structured data
• LightGBM Fast boosting algorithm
• OpenCV Computer vision and image processing
• NLTK Natural language processing
scikit
• scikit-learn (sklearn) is a powerful machine learning library in Python
that provides tools for:
Data Preprocessing (handling missing data, scaling, encoding)
Feature Extraction (Bag of Words, TF-IDF, PCA)
Supervised Learning (Regression & Classification models)
Unsupervised Learning (Clustering, Anomaly Detection)
Model Selection & Evaluation (Cross-validation, Hyperparameter
tuning)
• Task 1: Load & Explore a Dataset
import pandas as pd
df = pd.read_csv('data.csv') # Load dataset
print(df.head()) # Show first 5 rows
print(df.info()) # Dataset summary
print(df.describe()) # Statistical summary
• Task 2: Train-Test Split
from sklearn.model_selection import train_test_split
X = df.drop('Target', axis=1) # Features
y = df['Target'] # Labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
The random_state parameter ensures
that the data split is reproducible. It
• Task 3: Linear Regression controls the randomness of the train-
from sklearn.linear_model import LinearRegression test split, meaning:
model = LinearRegression() Same random_state → Same Split
Every Time
model.fit(X_train, y_train) # Train model Different random_state → Different
y_pred = model.predict(X_test) # Make predictions Split Every Time
• Task 4: Logistic Regression
• spaCy
• TextBlob
• Gensim
• Polyglot
Text Preprocessing in Python
• Text Cleaning/Tokenization using Python RegEx Module
• Regular Expressions - Sequence of characters that defines a search
pattern. It is commonly used for:
• Finding specific patterns in text (e.g., emails, dates, phone numbers).
• Replacing or cleaning text (e.g., removing special characters).
• Splitting text into meaningful components.
• Python has a built-in module named “re” that is used for regular
expressions in Python
RegEx - Example
import re
s = “CognitiveComputing: A computer science subject for geeks”
match = re.search(‘subject', s)
print('Start Index:', match.start())
print('End Index:', match.end())
Output:
Start Index: 39
End Index: 46
re.findall() - finds and returns all
matching occurrences in a list
import re
string = """Hello my Number is 987654321 and
my friend's number is 123456789"""
regex = r'\d+'
match = re.findall(regex, string)
print(match)
Output: Here r character (r’portal’) stands for raw, not regex. The raw
['987654321', '123456789'] string is slightly different from a regular string, it won’t interpret
the \ character as an escape character. This is because the
regular expression engine uses \ character for its own escaping
purpose.
Other Regex Functions
re.compile() Regular expressions are compiled into pattern objects
['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text',
'data', '.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', ',', 'library',
'and', 'purpose', 'of', 'modeling', '.']
Sentence Tokenization
• from nltk.tokenize import sent_tokenize
• text = """Characters like periods, exclamation point and newline char
are used to separate the sentences. But one drawback with split()
method, that we can only use one separator at a time! So sentence
tonenization wont be foolproof with split() method."""
• sent_tokenize(text)
['Characters like periods, exclamation point and newline char are used to separate the sentences.’,
'But one drawback with split() method, that we can only use one separator at a time!’,
'So sentence tonenization wont be foolproof with split() method.']
Split() for sentence tokenization
text = """Characters like periods, exclamation point and newline char
are used to separate the sentences. But one drawback with split()
method, that we can only use one separator at a time! So sentence
tonenization wont be foolproof with split() method."""
text.split(". ") # Note the space after the full stop makes sure that we
dont get empty element at the end of list.
['Characters like periods, exclamation point and newline char are used to separate the
sentences', 'But one drawback with split() method, that we can only use one separator
at a time! So sentence tonenization wont be foolproof with split() method.']
Stemming
• RegexpStemmer - custom stemming rules using regular expressions
(regex).
• PorterStemmer
• LancasterStemmer
• SnowballStemmer – Supports multiple languages
PorterStemmer
Output:
['runn', 'fli', 'studi', 'happiness', 'play', 'jumps']
Lemmatization
• The WordNetLemmatizer in NLTK uses the WordNet lexical database
to find the base form of words.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["running", "flies", "studies", "better", "happily", "geese"]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)