0% found this document useful (0 votes)

2 views21 pages

AI LAB FINAL

The document outlines various Python libraries essential for data processing, visualization, and machine learning, including numpy, pandas, matplotlib, and sklearn. It discusses specific functionalities of libraries like NLTK for natural language processing, including stopwords and WordNet, as well as various classifiers and techniques from sklearn for model evaluation and text vectorization. Additionally, it details steps for data preprocessing, such as handling missing values, removing duplicates, and encoding categorical labels.

Uploaded by

aqua.insight01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views21 pages

AI LAB FINAL

Uploaded by

aqua.insight01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Python Libraries

Python libraries commonly used for data processing, visualization, and machine learning tasks.
Here are the details:

1. Libraries Imported:

o numpy (np): Used for linear algebra operations.

o pandas (pd): Essential for data processing and reading CSV files.

o matplotlib.pyplot (plt): Typically used for plotting and visualization.

o sklearn: A machine learning library.

o string: For string manipulation.

o nltk: The Natural Language Toolkit for working with human language data.

2. Purpose:

o These libraries provide essential tools for various tasks, such as data analysis,
machine learning model building, and text processing.

The NLTK (Natural Language Toolkit) and explore its stopwords and WordNet corpora:

1. Stopwords:

o Definition: Stopwords are common words (such as “the,” “and,” “is,” etc.) that
appear frequently in a language but typically do not carry significant meaning on
their own.

o Purpose: They are often removed from text during natural language processing
(NLP) tasks to improve efficiency and focus on more meaningful content.

o Example: In English, stopwords include words like “a,” “an,” “the,” “in,” “of,” and
“and.”
o Import: from nltk.corpus import stopwords

2. WordNet:

o Definition: WordNet is a lexical database that organizes words into synsets (sets of
synonyms) and provides semantic relationships between words.

o Purpose: It is widely used for natural language understanding, word sense

disambiguation, and semantic similarity calculations.

o Import: from nltk.corpus import wordnet

1. from nltk.corpus import stopwords:

o This line imports the stopwords dataset from the Natural Language Toolkit (NLTK)
library.
o nltk: Refers to the NLTK library, which is a powerful Python package for natural
language processing (NLP) tasks.
o corpus: Within NLTK, corpus is a module that provides access to various text corpora
and lexical resources.
o stopwords: Specifically, this refers to a collection of common words in a language
(like "the", "is", "in") that are often removed from text data because they typically do
not contribute much to the meaning of a sentence.

2. from nltk.stem.porter import PorterStemmer:

o This line imports the PorterStemmer class from the porter submodule of the stem
module within NLTK.
o stem: Within NLTK, stem is a module that contains implementations of different
stemming algorithms.
o porter: Specifically, porter is a submodule within the stem module that contains the
implementation of the Porter stemming algorithm.
o PorterStemmer: The PorterStemmer class is used to apply the Porter stemming
algorithm, which reduces words to their root or base form. For example, "running"
would be stemmed to "run".
3. WordNetLemmatizer:
o Purpose: The WordNetLemmatizer is a class in NLTK used for lemmatization.
o Usage: Lemmatization is the process of reducing words to their base or root form
(called a lemma). Unlike stemming, which simply chops off prefixes or suffixes to
obtain the root word, lemmatization applies linguistic rules to find the lemma of a
word.
o Example: The lemmatized form of "running" would be "run".

4. wordnet:
• Purpose: This is a corpus reader for the WordNet lexical database.
• Usage: WordNet is a large lexical database of English. It groups words into sets of
synonyms called synsets and provides short definitions, usage examples, and
information on semantic relationships between words.
• Example: You can use WordNet to find synonyms, antonyms, hypernyms, hyponyms,
and more for a given word.

From Sklearn library:

These imports are from the feature_extraction.text module of the scikit-learn library (sklearn)

1. CountVectorizer:
o Purpose: CountVectorizer is used to convert a collection of text documents into a
matrix of token counts. It essentially tokenizes text, builds a vocabulary of known
words, and generates a document-term matrix where each row represents a
document and each column represents the count of a word in that document.
o Usage: CountVectorizer is commonly used in natural language processing (NLP)
tasks such as text classification, clustering, and information retrieval.

2. TfidfVectorizer:
o Purpose: TfidfVectorizer is similar to CountVectorizer but it converts a collection of
raw documents to a matrix of TF-IDF features. TF-IDF stands for Term Frequency-
Inverse Document Frequency, which measures the importance of a word in a
document relative to a collection of documents. It helps in highlighting words that are
unique to a document while downweighting common words across documents.
o Usage: TfidfVectorizer is commonly used in text mining and information retrieval
tasks to represent text data in a numerical format suitable for machine learning
algorithms.

3. KFold:
o Purpose: Used for cross-validation, splitting the dataset into k consecutive folds.
o Usage: Typically used in evaluating model performance and tuning hyperparameters.
o Import: from sklearn.model_selection import KFold

4. MultinomialNB:
o Purpose: A naive Bayes classifier suitable for classification with discrete features.
o Usage: Commonly used for text classification tasks.
o Import: from sklearn.naive_bayes import MultinomialNB

5. LogisticRegression:
o Purpose: A linear model for binary classification.
o Usage: Commonly used as a baseline model for binary classification problems.
o Import: from sklearn.linear_model import LogisticRegression

6. DecisionTreeClassifier:
o Purpose: A non-parametric supervised learning method used for classification and
regression tasks.
o Usage: Known for its simplicity, interpretability, and ability to handle both numerical
and categorical data.
o Import: from sklearn.tree import DecisionTreeClassifier

7. LinearSVC:
o Purpose: A linear classification model similar to SVC but implemented with linear
kernels.
o Usage: Suitable for large-scale classification problems, especially with high-
dimensional data.
o Import: from sklearn.svm import LinearSVC
8. BaggingClassifier:
o Purpose: An ensemble method that improves stability and accuracy by combining
predictions from multiple base models trained on different subsets of the training
data.
o Usage: Applies bagging to any base classifier.
o Import: from sklearn.ensemble import BaggingClassifier

9. RandomForestClassifier:
o Purpose: An ensemble method that constructs multiple decision trees during
training and outputs the class that is the mode of the classes (classification) or the
mean prediction (regression) of the individual trees.
o Usage: Robust to overfitting and widely used for classification tasks.
o Import: from sklearn.ensemble import RandomForestClassifier

10. ExtraTreesClassifier:
o Purpose: An ensemble learning method similar to RandomForestClassifier but with
a few differences in the way the trees are built.
o Usage: Efficient and often provides better generalization performance compared to
RandomForestClassifier.
o Import: from sklearn.ensemble import ExtraTreesClassifier

11. MLPClassifier:
o Purpose: A feedforward artificial neural network model for classification tasks.
o Usage: Learns non-linear relationships between features and target labels.
o Import: from sklearn.neural_network import MLPClassifier

12. KNeighborsClassifier:
o Purpose: A simple and effective classification algorithm based on nearest neighbors
in the feature space.
o Usage: Especially useful when the decision boundary is not well-defined or when the
data is noisy.
o Import: from sklearn.neighbors import KNeighborsClassifier

13. SGDClassifier:
o Purpose: A linear classifier that uses stochastic gradient descent to optimize the loss
function.
o Usage: Efficient and works well with large-scale datasets and sparse feature
representation
o Import: from sklearn.linear_model import SGDClassifier
14. train_test_split:
o Purpose: Splits the dataset into training and testing subsets.
o Usage: Commonly used for evaluating the performance of machine learning models
by training on a portion of the data and testing on the rest.
Import: from sklearn.model_selection import train_test_split

15. metrics:
o Purpose: Contains functions for evaluating the performance of machine learning
models.
o Usage: Provides various metrics such as accuracy, precision, recall, F1 score, etc.
Import: from sklearn import metrics

16. metrics.confusion_matrix:
o Purpose: Computes the confusion matrix to evaluate the accuracy of a
classification.
o Usage: Helps in understanding the performance of the classification algorithm by
comparing actual and predicted values.
Import: from sklearn.metrics import confusion_matrix

17. accuracy_score, recall_score, precision_score, f1_score:

o Purpose: Computes different evaluation metrics for classification tasks.
o Usage: Useful for assessing the performance of classifiers based on various aspects
such as accuracy, precision, recall, and F1 score.
Import: from sklearn.metrics import accuracy_score, recall_score, precision_score,
f1_score

18. classification_report:
o Purpose: Generates a detailed classification report that includes precision, recall,
F1-score, and support for each class.
o Usage: Useful for gaining insight into the performance of a classification model
across different classes.
Import: from sklearn.metrics import classification_report
Python commands:
Step 1: Loading and Previewing Dataset

• This line reads the CSV file 'dataset (1).csv' into a pandas DataFrame, using the 'ISO-8859-1'
encoding to handle special characters.
• head() will display the first five lines of dataset.

Output:

Step 2: Checking for Null values

• data.isnull(): This method returns a DataFrame of the same shape as data, but with boolean
values indicating whether each value is NaN (missing).
• .sum(): When applied to the DataFrame, this sums up the True values (which are treated as
1) for each column, resulting in a Series that shows the total count of missing values for each
column.
• print(...): Outputs the Series to the console, displaying the number of missing values per
column.ata
• data.info(): This method prints a concise summary of the DataFrame, which includes:
✓ The total number of entries (rows).
✓ The number of non-null entries in each column.
✓ The data type of each column (e.g., integer, float, object).
✓ Memory usage of the DataFrame.
Output:

Step 3: Filling Null values

This line fills any missing values (NaN) in the 'Event Class' column with the mode (most frequently
occurring value) of that column.

• data['Event Class'].mode()[0]: Computes the mode of the 'Event Class' column and
takes the first mode value if there are multiple.
• .fillna(...): Fills all NaN values in the 'Event Class' column with the mode value.

print(data.isnull().sum())
This line calculates and prints the number of missing values in each column of the DataFrame.

• data.isnull(): Creates a DataFrame of the same shape as data with boolean values
indicating whether each entry is NaN.
• .sum(): Sums the boolean values for each column, resulting in a count of missing values
per column.

Output:
Step 4: Checking and counting the duplicate rows:

Print(data.duplicated()) will checks for duplicate rows in the DataFrame.

• data.duplicated(): Returns a Series of boolean values indicating whether each row is a

duplicate of a previous row in the DataFrame.
Result: Prints the Series to the console, where True indicates a duplicate row, and False
indicates a unique row.
data[data.duplicated()].shape[0] This line counts the number of duplicate rows in the dataFrame.
• data[data.duplicated()]: Filters the DataFrame to include only the rows that are
duplicates.
• .shape[0]: Returns the number of rows in the filtered DataFrame, which is the count of
duplicate rows.

Output:

Step 5: Dropping Duplicates and reseting the index

data.drop_duplicated(inplace=True,keep=’first’) will removes duplicate rows from the DataFrame.

• inplace=True: Modifies the original DataFrame without creating a new copy.

• keep='first': Keeps the first occurrence of each duplicate row and removes the subsequent
duplicates.
Result: The DataFrame data will no longer contain any duplicate rows.
data.reset_index(inplace=True,drop=True) will resets the index of the DataFrame, creating a new
index from 0 to n-1.

• inplace=True: Modifies the original DataFrame without creating a new copy.

• drop=True: Drops the old index instead of adding it as a new column.
Result: The DataFrame data will have a new integer index starting from 0, making it easier to work
with and ensuring sequential indexing after dropping duplicates.

Step 6: Adding a Lowercase Column to DataFrame

• data['lower']: This part creates a new column named 'lower' in the DataFrame data.
• =: The assignment operator assigns values to the new column.
• data['Event Description'].str.lower(): This is the expression that provides the values for the
new column. It contains the lowercase versions of the values from the 'Event Description'
column.

Step 7: Removing Punctuation from Text Data

• string.punctuation: This is a string constant provided by Python's string module, containing

all punctuation characters. It includes characters like '.', ',', '!', etc.
• def remove_punctuation(text): This defines a custom function named remove_punctuation
which takes a text input and removes punctuation characters from it.
• text.translate(str.maketrans('', '', PUNCT_TO_REMOVE)): This line utilizes the str.translate()
method to remove punctuation characters from the input text. The str.maketrans() method
creates a translation table where punctuation characters are mapped to None, effectively
removing them from the text.
• data["lower"]: This accesses the column 'lower' in the DataFrame data, which presumably
contains the text data with lowercase characters.
• .apply(lambda text: remove_punctuation(text)): This applies the remove_punctuation
function to each element (text) in the 'lower' column using the .apply() method. It effectively
removes punctuation from each text entry.
• data["punc_removed"]: This creates a new column named 'punc_removed' in the
DataFrame data and assigns the text data with punctuation removed to it.

Step 8: Removing Stopwords from Text Data

• stopwords.words('english'): This function from the NLTK library retrieves a list of English
stopwords.
• def remove_stopwords(text): This defines a custom function named remove_stopwords
which takes a text input and removes stopwords from it.
• [word for word in str(text).split() if word not in STOPWORDS]: This list comprehension
iterates over each word in the text, checking if it's not in the set of stopwords. If it's not a
stopword, it includes the word in the list.
• " ".join(...): This joins the list of words back into a single string, separated by spaces.
• data["punc_removed"]: This accesses the column 'punc_removed' in the DataFrame data,
which presumably contains the text data with punctuation removed.
• .apply(lambda text: remove_stopwords(text)): This applies the remove_stopwords
function to each element (text) in the 'punc_removed' column using the .apply() method. It
effectively removes stopwords from each text entry.
• data["stopwords_removed"]: This creates a new column named 'stopwords_removed' in
the DataFrame data and assigns the text data with stopwords removed to it.

Step 9: Encoding Categorical Labels with LabelEncoder

• from sklearn.preprocessing import LabelEncoder will imports the LabelEncoder class
from the sklearn.preprocessing module. The LabelEncoder is used to convert categorical
labels into numerical form.
• le = LabelEncoder() : This line initializes a LabelEncoder object named le.
• data['Target']: This accesses the column 'Target' in the DataFrame data, which presumably
contains categorical labels.
• le.fit_transform(data['Target']): This applies the fit_transform() method of the LabelEncoder
object to the 'Target' column. It fits the label encoder to the unique values in the 'Target'
column and transforms them into numerical labels.

Step 10: Vectorizing Text Data with CountVectorizer

• from sklearn.feature_extraction.text import CountVectorizer: This line imports the

CountVectorizer class, which is used to convert a collection of text documents into a matrix
of token counts.
• vectorizer = CountVectorizer(): This line initializes a CountVectorizer object named
vectorizer.
• vectorizer.fit_transform(data['stopwords_removed']): This line applies the fit_transform()
method of the CountVectorizer object to the 'stopwords_removed' column of the DataFrame
data. It fits the vectorizer to the text data and transforms it into a matrix of token counts.
• y = data['Target']: This line extracts the target variable, typically denoted as 'y', from the
DataFrame data.
• print(x):This line prints the transformed matrix x, which represents the token counts of the
text data.

Step 11: Splitting Data into Training and Test Sets

• from sklearn.model_selection import train_test_split: This line imports the
train_test_split function, which is used to split datasets into random train and test subsets.
• train_test_split(x, y, test_size=0.3, random_state=42): This line splits the input features x
and the target variable y into training and test sets.
• test_size=0.3: This parameter specifies the proportion of the dataset to include in the test
split. Here, it's set to 30%.
• random_state=42: This parameter sets the random seed for reproducibility. It ensures that
the split is deterministic and reproducible across multiple runs.
• print('Training Data ', x_train.shape, y_train.shape) print('Test Data ', x_test.shape,
y_test.shape): These lines print the shapes of the training and test data arrays to verify the
split.
• x_train.shape and y_train.shape represent the dimensions of the training data (input
features and target variable).
• x_test.shape and y_test.shape represent the dimensions of the test data (input features
and target variable).
Output:
Step 12: TF-IDF Vectorization and Data Splitting

• from sklearn.feature_extraction.text import TfidfVectorizer: This line imports the

TfidfVectorizer class, which is used to convert a collection of raw documents to a matrix of
TF-IDF features.
• tfidf = TfidfVectorizer()
• x = tfidf.fit_transform(data['stopwords_removed'].values): This line applies the
fit_transform() method of the TfidfVectorizer object to the 'stopwords_removed' column of
the DataFrame data. It fits the vectorizer to the text data and transforms it into a matrix of
TF-IDF features.
• y = data['Target']: This line extracts the target variable, typically denoted as 'y', from the
DataFrame data.
• train_test_split(x, y, test_size=0.3, random_state=42): This line splits the TF-IDF features
(x) and the target variable (y) into training and test sets.
• test_size=0.3: This parameter specifies the proportion of the dataset to include in the test
split. Here, it's set to 30%.
• random_state=42: This parameter sets the random seed for reproducibility. It ensures that
the split is deterministic and reproducible across multiple runs.
• print('Training Data ', x_train.shape, y_train.shape) print('Test Data ', x_test.shape,
y_test.shape): These lines print the shapes of the training and test data arrays to verify the
split.
• x_train.shape and y_train.shape represent the dimensions of the training data (TF-IDF
features and target variable).
• x_test.shape and y_test.shape represent the dimensions of the test data (TF-IDF features
and target variable).

Step 13: Logistic Regression Model Evaluation

• model = LogisticRegression(): This line initializes a Logistic Regression model.
• clf = model.fit(x_train, y_train):This line trains the Logistic Regression model using the
training data (x_train and y_train).
• predictions = model.predict(x_test): This line uses the trained model to make predictions
on the test data (x_test).
• metrics.classification_report(y_test, predictions): This line prints a classification report,
which includes precision, recall, F1-score, and support for each class, as well as the average
metrics.
• metrics.confusion_matrix(y_test, predictions): This line prints the confusion matrix,
which is a table showing the counts of true positive, true negative, false positive, and false
negative predictions.

Output:
Sentiment Analysis with NLTK and VADER

Sentiment analysis, a subset of natural language processing (NLP), aims to determine the
emotional tone behind a body of text. It is widely used in fields like marketing, customer service,
and social media monitoring to gauge public sentiment. One effective tool for sentiment
analysis is VADER (Valence Aware Dictionary and sEntiment Reasoner), which is particularly
adept at handling social media text.

• Downloading the Lexicon:

nltk.download('vader_lexicon'): Downloads the VADER lexicon needed for sentiment
analysis.
• Importing the Analyzer:
from nltk.sentiment.vader import SentimentIntensityAnalyzer: Imports the
SentimentIntensityAnalyzer class from NLTK.
• Creating an Analyzer Instance:
sia = SentimentIntensityAnalyzer(): Initializes the sentiment analyzer and assigns it to the
variable sia.

Step 1 :Sentiment Analysis with TextBlob

• import textblob: imports the TextBlob library, which provides a simple API for diving into
common natural language processing (NLP) tasks.
• from textblob import TextBlob: This specifically imports the TextBlob class from the
textblob module. This class is used to create a TextBlob object, which provides a range of NLP
functionalities, including sentiment analysis.
• def analyze_polarity(stopwords_removed): defines a function named analyze_polarity that
takes one argument, stopwords_removed. This argument is expected to be a string from
which stopwords (common words that don't carry significant meaning, like "and", "the", "is")
have been removed.
• result = TextBlob(stopwords_removed) creates a TextBlob object from the input string. This
object, result, allows us to access various NLP features, including sentiment analysis.
• if result.sentiment.polarity > 0: checks the polarity score of the sentiment analysis. Polarity
is a float within the range [-1.0, 1.0], where negative values indicate negative sentiment,
positive values indicate positive sentiment, and values around zero indicate neutral
sentiment.
• If the polarity is greater than 0, the function returns 1, indicating positive sentiment.
• elif result.sentiment.polarity == 0: checks if the polarity is exactly zero. If so, the function
returns 0, indicating neutral sentiment.
else: covers the case where polarity is less than 0. The function returns -1, indicating negative
sentiment.

Testing the analyze_polarity Function:

s = "I'm Happy" assigns the string "I'm Happy" to the variable s. This string is a positive
sentiment statement.
print(analyze_polarity(s)) calls the analyze_polarity function with s as the argument and
prints the returned value.
The function analyzes the polarity of "I'm Happy", which has a positive sentiment. Therefore,
the expected output is 1.

Step 2:

1. Creating a New Column in the DataFrame:

data['Result'] is the syntax used to create a new column called Result in the existing
DataFrame data.

2. List Comprehension to Apply the Function:

[analyze_polarity(stopwords_removed)for stopwords_removed in
data['stopwords_removed']] is a list comprehension that iterates over each element in the
stopwords_removed column of the DataFrame data.
for stopwords_removed in data['stopwords_removed'] iterates over each value (each string
of text with stopwords removed) in the stopwords_removed column.
analyze_polarity(stopwords_removed) calls the analyze_polarity function for each value in
the stopwords_removed column. This function analyzes the sentiment of the text and returns
1 for positive sentiment, 0 for neutral sentiment, and -1 for negative sentiment.

3. Converting the List to a NumPy Array:

np.array([...]) converts the list of sentiment scores generated by the list comprehension into
a NumPy array. This ensures that the resulting series can be directly assigned to the
DataFrame column data['Result'].

4. Assigning the NumPy Array to the New Column:

The NumPy array of sentiment scores is assigned to the new column Result in the DataFrame
data.

Step 3: Classifying Text Based on Sentiment Analysis Results

1. Positive Sentiment Classification:

o List Comprehension:
[head for index, head in enumerate(data['stopwords_removed']) if data['Result'][index] > 0.5]
creates a list of text entries from the stopwords_removed column that have a positive
sentiment score.
o Enumerate:
enumerate(data['stopwords_removed']) provides both the index and the value (head) of each
entry in the stopwords_removed column.
o Condition:
if data['Result'][index] > 0.5 checks if the sentiment score in the Result column at the current
index is greater than 0.5, indicating positive sentiment.
2. Unbiased Sentiment Classification:
o List Comprehension:
[head for index, head in enumerate(data['stopwords_removed']) if data['Result'][index] ==
0.5] creates a list of text entries from the stopwords_removed column that have a neutral
sentiment score.
o Condition:
if data['Result'][index] == 0.5 checks if the sentiment score in the Result column at the current
index is equal to 0.5, indicating neutral sentiment.
3. Negative Sentiment Classification:
o List Comprehension:
[head for index, head in enumerate(data['stopwords_removed']) if data['Result'][index] < 0.5]
creates a list of text entries from the stopwords_removed column that have a negative
sentiment score.
o Condition:
if data['Result'][index] < 0.5 checks if the sentiment score in the Result column at the current
index is less than 0.5, indicating negative sentiment.

Step 4 :

1. Calculating Percentage of Positive Data:

len(positive) calculates the number of entries classified as positive sentiment.
len(data['stopwords_removed']) calculates the total number of entries in the
stopwords_removed column of the DataFrame data.
The expression len(positive) * 100 / len(data['stopwords_removed']) computes the
percentage of positive data by dividing the count of positive entries by the total count
of entries and then multiplying by 100 to obtain the percentage.
2. Calculating Percentage of Negative Data:
len(negative) calculates the number of entries classified as negative sentiment.
len(data['stopwords_removed']) calculates the total number of entries in the
stopwords_removed column of the DataFrame data.
The expression len(negative) * 100 / len(data['stopwords_removed']) computes the
percentage of negative data by dividing the count of negative entries by the total count
of entries and then multiplying by 100 to obtain the percentage.
3. Calculating Percentage of Unbiased Data:
len(unbiased) calculates the number of entries classified as unbiased sentiment.
len(data['stopwords_removed']) calculates the total number of entries in the
stopwords_removed column of the DataFrame data.
The expression len(unbiased) * 100 / len(data['stopwords_removed']) computes the
percentage of unbiased data by dividing the count of unbiased entries by the total
count of entries and then multiplying by 100 to obtain the percentage.
Printing Results:
This line prints the percentages of positive, negative, and unbiased data.
str(o_pos), str(o_neg), and str(o_un) convert the calculated percentages to strings so that
they can be concatenated with the descriptive text in the print statement.
\n is used for newline characters to format the output nicely.
Output:

Step 5:

1. Importing NLTK and VADER:

import nltk imports the NLTK library, which provides various tools and datasets for natural
language processing tasks.
from nltk.sentiment.vader import SentimentIntensityAnalyzer imports the
SentimentIntensityAnalyzer class from the VADER module of NLTK. VADER is specifically
designed for sentiment analysis and provides a pre-trained model for analyzing sentiment in
text.
2. Defining the sentiment_analysis Function:
def sentiment_analysis(text): defines a function named sentiment_analysis that takes a
single argument text, which represents the text to be analyzed.
sia = SentimentIntensityAnalyzer() creates an instance of the SentimentIntensityAnalyzer
class, which will be used to perform sentiment analysis.
sentiment_scores = sia.polarity_scores(text) calls the polarity_scores method of the
SentimentIntensityAnalyzer instance sia to analyze the sentiment of the input text. This
method returns a dictionary containing the sentiment scores (positive, negative, neutral, and
compound).
return sentiment_scores returns the sentiment scores dictionary.
3. Applying Sentiment Analysis to DataFrame:
data['Result'] = np.array([sentiment_analysis(stopwords_removed) for stopwords_removed
in data['stopwords_removed']]) applies the sentiment_analysis function to each entry in the
stopwords_removed column of the DataFrame data. The sentiment analysis results are
stored in a new column named Result.
data.head() displays the first few rows of the DataFrame data, including the newly added
Result column.
print(data[‘Result’])

Output:

Step 6: Selecting Columns for Analysis

1. Selecting Columns:
data[["stopwords_removed", "Result"]] selects two columns from the DataFrame data:
"stopwords_removed" and "Result".
The double square brackets [[...]] are used to specify a list of column names to be selected.
The selected columns are "stopwords_removed" which presumably contains preprocessed
text data and "Result" which contains the sentiment analysis results.
2. Displaying the First Few Rows:
cls.head() displays the first few rows of the DataFrame cls, which contains only the selected
columns.
This allows you to inspect the selected columns and their corresponding sentiment analysis
results.

British_Airways_Forage_Report
No ratings yet
British_Airways_Forage_Report
12 pages
The God Anubis-Iconography and Epithets-partI PDF
100% (4)
The God Anubis-Iconography and Epithets-partI PDF
65 pages
LAB 6
No ratings yet
LAB 6
47 pages
Machine Learning Lecture - 4 and Lecture - 5
No ratings yet
Machine Learning Lecture - 4 and Lecture - 5
73 pages
Personalized Cancer Diagnosis
No ratings yet
Personalized Cancer Diagnosis
100 pages
NLP_crecord_mid2
No ratings yet
NLP_crecord_mid2
36 pages
Practical Assignment ML
No ratings yet
Practical Assignment ML
50 pages
Tutorial 3 - 206009L
No ratings yet
Tutorial 3 - 206009L
34 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
31 pages
NLP Tushar
No ratings yet
NLP Tushar
21 pages
Glove
100% (1)
Glove
10 pages
AI LAB-EXP-13 To LAST
No ratings yet
AI LAB-EXP-13 To LAST
16 pages
Lab5 Example Fall 23
No ratings yet
Lab5 Example Fall 23
4 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
DM Practical File
No ratings yet
DM Practical File
21 pages
TopicClassifierbyDavidCaleb
No ratings yet
TopicClassifierbyDavidCaleb
7 pages
21 01 23
No ratings yet
21 01 23
8 pages
Machine Learning, NLP_ Text Classification Using Scikit-learn, Python and NLTK
No ratings yet
Machine Learning, NLP_ Text Classification Using Scikit-learn, Python and NLTK
9 pages
NLP 1 Week Tutorial NLTK
No ratings yet
NLP 1 Week Tutorial NLTK
15 pages
DUPLICATION QUESTION DETECTION USING MACHINE LEARNING
No ratings yet
DUPLICATION QUESTION DETECTION USING MACHINE LEARNING
8 pages
1739168641630
No ratings yet
1739168641630
30 pages
2021 SmartCOMM Technical Overview
No ratings yet
2021 SmartCOMM Technical Overview
25 pages
Aped For Fake News
No ratings yet
Aped For Fake News
6 pages
NLP Manual
No ratings yet
NLP Manual
21 pages
FDS Lab
No ratings yet
FDS Lab
11 pages
PPT for Assignment-10 (Machine Learning With Python_NLP-2)
No ratings yet
PPT for Assignment-10 (Machine Learning With Python_NLP-2)
37 pages
Data Mining Numericals
No ratings yet
Data Mining Numericals
38 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
Module III
No ratings yet
Module III
42 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
PR ZXV
No ratings yet
PR ZXV
8 pages
Sumati
No ratings yet
Sumati
10 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
A Guide To Text Classification (NLP)
No ratings yet
A Guide To Text Classification (NLP)
17 pages
Methodology (Autosaved)
No ratings yet
Methodology (Autosaved)
9 pages
DMlab2021
No ratings yet
DMlab2021
4 pages
Assignment
No ratings yet
Assignment
6 pages
Scikit Learn
No ratings yet
Scikit Learn
10 pages
Library
No ratings yet
Library
23 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Comprehensive Overview of Common ML Techniques
No ratings yet
Comprehensive Overview of Common ML Techniques
7 pages
NLP Assignment(917722H031)
No ratings yet
NLP Assignment(917722H031)
18 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
Practicle 7-notes
No ratings yet
Practicle 7-notes
2 pages
Dr Basit Assignments
No ratings yet
Dr Basit Assignments
13 pages
A145286344 23681 24 2018 Tensorflow
No ratings yet
A145286344 23681 24 2018 Tensorflow
15 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Vtu ML
No ratings yet
Vtu ML
13 pages
Sentiment Analysis On Tweets
No ratings yet
Sentiment Analysis On Tweets
2 pages
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
No ratings yet
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
20 pages
NLP_EXP2 (1)
No ratings yet
NLP_EXP2 (1)
6 pages
8
No ratings yet
8
9 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Maths 2 Book Linear Algebra (All The Chapters)
No ratings yet
Maths 2 Book Linear Algebra (All The Chapters)
240 pages
Unit 5 Material
No ratings yet
Unit 5 Material
18 pages
NLP PDF
No ratings yet
NLP PDF
3 pages
ML_Exp
No ratings yet
ML_Exp
9 pages
WEEK 2 G7 - MATATAG - LessonExemplar - MATHEMATICS 7
No ratings yet
WEEK 2 G7 - MATATAG - LessonExemplar - MATHEMATICS 7
11 pages
adobesign_configuration
No ratings yet
adobesign_configuration
11 pages
Nykaa Com
No ratings yet
Nykaa Com
10 pages
JYSS 2020 Computing P2
No ratings yet
JYSS 2020 Computing P2
8 pages
The Madwoman in The Attic Chapter One
100% (1)
The Madwoman in The Attic Chapter One
3 pages
Voyager Legend Ug en GB
No ratings yet
Voyager Legend Ug en GB
13 pages
Teste de Inglês 10º Ano
100% (2)
Teste de Inglês 10º Ano
3 pages
04 - ASQL - Quiz1 - SQL Advance 1: Tests & Quizzes
No ratings yet
04 - ASQL - Quiz1 - SQL Advance 1: Tests & Quizzes
9 pages
Teaching and learning grammar
No ratings yet
Teaching and learning grammar
12 pages
AI Detector - Trusted AI Checker for ChatGPT, GPT4 & Gemini
No ratings yet
AI Detector - Trusted AI Checker for ChatGPT, GPT4 & Gemini
1 page
unit-1 Introduction to Web
No ratings yet
unit-1 Introduction to Web
14 pages
English Communication Notes
No ratings yet
English Communication Notes
30 pages
Recovering Catalyst Fixed Configuration Switches From A Corrupted or Missing Image
No ratings yet
Recovering Catalyst Fixed Configuration Switches From A Corrupted or Missing Image
13 pages
19497-Article Text-37985-1-10-20171129
No ratings yet
19497-Article Text-37985-1-10-20171129
5 pages
Lecture 3-Chapter - 1 - Digital - Systems - and - Binary - Numbers
No ratings yet
Lecture 3-Chapter - 1 - Digital - Systems - and - Binary - Numbers
7 pages
Blueprint Modernize The Employee User Experience Cloud With SAP Fiori Cloud
No ratings yet
Blueprint Modernize The Employee User Experience Cloud With SAP Fiori Cloud
26 pages
Paul Discussion 3
No ratings yet
Paul Discussion 3
3 pages
Bible Study Guide New
No ratings yet
Bible Study Guide New
15 pages
Recommended For Magnifying Own Attractor Field: (Press CTRL and Click With Left Mouse On Blue Items To Download It)
No ratings yet
Recommended For Magnifying Own Attractor Field: (Press CTRL and Click With Left Mouse On Blue Items To Download It)
5 pages
Barney's Pajama Party - Barney Wiki - Fandom
No ratings yet
Barney's Pajama Party - Barney Wiki - Fandom
8 pages
Binary Keyboard Mouse Notes
No ratings yet
Binary Keyboard Mouse Notes
4 pages
MATH 65 - CN Part 1 Review On Changing Fraction To Decimal To Percent
No ratings yet
MATH 65 - CN Part 1 Review On Changing Fraction To Decimal To Percent
4 pages
The Life of ST Agustine of Hippo
No ratings yet
The Life of ST Agustine of Hippo
2 pages
Textual and Reader-Response Stylistics
No ratings yet
Textual and Reader-Response Stylistics
10 pages
Revised Eccd Checklist: ST N D R D ST N D R D ST N D R D
No ratings yet
Revised Eccd Checklist: ST N D R D ST N D R D ST N D R D
2 pages
The Parents of The Prophet Were Muslim: Mufti Muhammad Khan Qadri
No ratings yet
The Parents of The Prophet Were Muslim: Mufti Muhammad Khan Qadri
16 pages
PROPOSAL y Conectores para Subir Nota
No ratings yet
PROPOSAL y Conectores para Subir Nota
3 pages
Mastering Python: A Comprehensive Guide for Beginners and Experts
From Everand
Mastering Python: A Comprehensive Guide for Beginners and Experts
Rick Spair
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
From Everand
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Mustafa Al-Dori
5/5 (1)
Mastering Python Programming: A Comprehensive Guide: The IT Collection
From Everand
Mastering Python Programming: A Comprehensive Guide: The IT Collection
Christopher Ford
5/5 (1)
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
C# Interview Questions You'll Most Likely Be Asked
From Everand
C# Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

AI LAB FINAL

Uploaded by

AI LAB FINAL

Uploaded by

Python Libraries

o numpy (np): Used for linear algebra operations.

o matplotlib.pyplot (plt): Typically used for plotting and visualization.

o sklearn: A machine learning library.

o string: For string manipulation.

o Purpose: It is widely used for natural language understanding, word sense

o Import: from nltk.corpus import wordnet

1. from nltk.corpus import stopwords:

2. from nltk.stem.porter import PorterStemmer:

From Sklearn library:

17. accuracy_score, recall_score, precision_score, f1_score:

Step 2: Checking for Null values

Step 3: Filling Null values

Print(data.duplicated()) will checks for duplicate rows in the DataFrame.

• data.duplicated(): Returns a Series of boolean values indicating whether each row is a

Step 5: Dropping Duplicates and reseting the index

• inplace=True: Modifies the original DataFrame without creating a new copy.

• inplace=True: Modifies the original DataFrame without creating a new copy.

Step 6: Adding a Lowercase Column to DataFrame

Step 7: Removing Punctuation from Text Data

• string.punctuation: This is a string constant provided by Python's string module, containing

Step 8: Removing Stopwords from Text Data

Step 9: Encoding Categorical Labels with LabelEncoder

Step 10: Vectorizing Text Data with CountVectorizer

• from sklearn.feature_extraction.text import CountVectorizer: This line imports the

Step 11: Splitting Data into Training and Test Sets

• from sklearn.feature_extraction.text import TfidfVectorizer: This line imports the

Step 13: Logistic Regression Model Evaluation

• Downloading the Lexicon:

Step 1 :Sentiment Analysis with TextBlob

Testing the analyze_polarity Function:

1. Creating a New Column in the DataFrame:

2. List Comprehension to Apply the Function:

3. Converting the List to a NumPy Array:

4. Assigning the NumPy Array to the New Column:

Step 3: Classifying Text Based on Sentiment Analysis Results

1. Positive Sentiment Classification:

1. Calculating Percentage of Positive Data:

1. Importing NLTK and VADER:

Step 6: Selecting Columns for Analysis

You might also like