0% found this document useful (0 votes)
2 views21 pages

AI LAB FINAL

The document outlines various Python libraries essential for data processing, visualization, and machine learning, including numpy, pandas, matplotlib, and sklearn. It discusses specific functionalities of libraries like NLTK for natural language processing, including stopwords and WordNet, as well as various classifiers and techniques from sklearn for model evaluation and text vectorization. Additionally, it details steps for data preprocessing, such as handling missing values, removing duplicates, and encoding categorical labels.

Uploaded by

aqua.insight01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views21 pages

AI LAB FINAL

The document outlines various Python libraries essential for data processing, visualization, and machine learning, including numpy, pandas, matplotlib, and sklearn. It discusses specific functionalities of libraries like NLTK for natural language processing, including stopwords and WordNet, as well as various classifiers and techniques from sklearn for model evaluation and text vectorization. Additionally, it details steps for data preprocessing, such as handling missing values, removing duplicates, and encoding categorical labels.

Uploaded by

aqua.insight01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Python Libraries

Python libraries commonly used for data processing, visualization, and machine learning tasks.
Here are the details:

1. Libraries Imported:

o numpy (np): Used for linear algebra operations.

o pandas (pd): Essential for data processing and reading CSV files.

o matplotlib.pyplot (plt): Typically used for plotting and visualization.

o sklearn: A machine learning library.

o string: For string manipulation.

o nltk: The Natural Language Toolkit for working with human language data.

2. Purpose:

o These libraries provide essential tools for various tasks, such as data analysis,
machine learning model building, and text processing.

The NLTK (Natural Language Toolkit) and explore its stopwords and WordNet corpora:

1. Stopwords:

o Definition: Stopwords are common words (such as “the,” “and,” “is,” etc.) that
appear frequently in a language but typically do not carry significant meaning on
their own.

o Purpose: They are often removed from text during natural language processing
(NLP) tasks to improve efficiency and focus on more meaningful content.

o Example: In English, stopwords include words like “a,” “an,” “the,” “in,” “of,” and
“and.”
o Import: from nltk.corpus import stopwords

2. WordNet:

o Definition: WordNet is a lexical database that organizes words into synsets (sets of
synonyms) and provides semantic relationships between words.

o Purpose: It is widely used for natural language understanding, word sense


disambiguation, and semantic similarity calculations.

o Import: from nltk.corpus import wordnet

1. from nltk.corpus import stopwords:

o This line imports the stopwords dataset from the Natural Language Toolkit (NLTK)
library.
o nltk: Refers to the NLTK library, which is a powerful Python package for natural
language processing (NLP) tasks.
o corpus: Within NLTK, corpus is a module that provides access to various text corpora
and lexical resources.
o stopwords: Specifically, this refers to a collection of common words in a language
(like "the", "is", "in") that are often removed from text data because they typically do
not contribute much to the meaning of a sentence.

2. from nltk.stem.porter import PorterStemmer:

o This line imports the PorterStemmer class from the porter submodule of the stem
module within NLTK.
o stem: Within NLTK, stem is a module that contains implementations of different
stemming algorithms.
o porter: Specifically, porter is a submodule within the stem module that contains the
implementation of the Porter stemming algorithm.
o PorterStemmer: The PorterStemmer class is used to apply the Porter stemming
algorithm, which reduces words to their root or base form. For example, "running"
would be stemmed to "run".
3. WordNetLemmatizer:
o Purpose: The WordNetLemmatizer is a class in NLTK used for lemmatization.
o Usage: Lemmatization is the process of reducing words to their base or root form
(called a lemma). Unlike stemming, which simply chops off prefixes or suffixes to
obtain the root word, lemmatization applies linguistic rules to find the lemma of a
word.
o Example: The lemmatized form of "running" would be "run".

4. wordnet:
• Purpose: This is a corpus reader for the WordNet lexical database.
• Usage: WordNet is a large lexical database of English. It groups words into sets of
synonyms called synsets and provides short definitions, usage examples, and
information on semantic relationships between words.
• Example: You can use WordNet to find synonyms, antonyms, hypernyms, hyponyms,
and more for a given word.

From Sklearn library:

These imports are from the feature_extraction.text module of the scikit-learn library (sklearn)

1. CountVectorizer:
o Purpose: CountVectorizer is used to convert a collection of text documents into a
matrix of token counts. It essentially tokenizes text, builds a vocabulary of known
words, and generates a document-term matrix where each row represents a
document and each column represents the count of a word in that document.
o Usage: CountVectorizer is commonly used in natural language processing (NLP)
tasks such as text classification, clustering, and information retrieval.

2. TfidfVectorizer:
o Purpose: TfidfVectorizer is similar to CountVectorizer but it converts a collection of
raw documents to a matrix of TF-IDF features. TF-IDF stands for Term Frequency-
Inverse Document Frequency, which measures the importance of a word in a
document relative to a collection of documents. It helps in highlighting words that are
unique to a document while downweighting common words across documents.
o Usage: TfidfVectorizer is commonly used in text mining and information retrieval
tasks to represent text data in a numerical format suitable for machine learning
algorithms.

3. KFold:
o Purpose: Used for cross-validation, splitting the dataset into k consecutive folds.
o Usage: Typically used in evaluating model performance and tuning hyperparameters.
o Import: from sklearn.model_selection import KFold

4. MultinomialNB:
o Purpose: A naive Bayes classifier suitable for classification with discrete features.
o Usage: Commonly used for text classification tasks.
o Import: from sklearn.naive_bayes import MultinomialNB

5. LogisticRegression:
o Purpose: A linear model for binary classification.
o Usage: Commonly used as a baseline model for binary classification problems.
o Import: from sklearn.linear_model import LogisticRegression

6. DecisionTreeClassifier:
o Purpose: A non-parametric supervised learning method used for classification and
regression tasks.
o Usage: Known for its simplicity, interpretability, and ability to handle both numerical
and categorical data.
o Import: from sklearn.tree import DecisionTreeClassifier

7. LinearSVC:
o Purpose: A linear classification model similar to SVC but implemented with linear
kernels.
o Usage: Suitable for large-scale classification problems, especially with high-
dimensional data.
o Import: from sklearn.svm import LinearSVC
8. BaggingClassifier:
o Purpose: An ensemble method that improves stability and accuracy by combining
predictions from multiple base models trained on different subsets of the training
data.
o Usage: Applies bagging to any base classifier.
o Import: from sklearn.ensemble import BaggingClassifier

9. RandomForestClassifier:
o Purpose: An ensemble method that constructs multiple decision trees during
training and outputs the class that is the mode of the classes (classification) or the
mean prediction (regression) of the individual trees.
o Usage: Robust to overfitting and widely used for classification tasks.
o Import: from sklearn.ensemble import RandomForestClassifier

10. ExtraTreesClassifier:
o Purpose: An ensemble learning method similar to RandomForestClassifier but with
a few differences in the way the trees are built.
o Usage: Efficient and often provides better generalization performance compared to
RandomForestClassifier.
o Import: from sklearn.ensemble import ExtraTreesClassifier

11. MLPClassifier:
o Purpose: A feedforward artificial neural network model for classification tasks.
o Usage: Learns non-linear relationships between features and target labels.
o Import: from sklearn.neural_network import MLPClassifier

12. KNeighborsClassifier:
o Purpose: A simple and effective classification algorithm based on nearest neighbors
in the feature space.
o Usage: Especially useful when the decision boundary is not well-defined or when the
data is noisy.
o Import: from sklearn.neighbors import KNeighborsClassifier

13. SGDClassifier:
o Purpose: A linear classifier that uses stochastic gradient descent to optimize the loss
function.
o Usage: Efficient and works well with large-scale datasets and sparse feature
representation
o Import: from sklearn.linear_model import SGDClassifier
14. train_test_split:
o Purpose: Splits the dataset into training and testing subsets.
o Usage: Commonly used for evaluating the performance of machine learning models
by training on a portion of the data and testing on the rest.
Import: from sklearn.model_selection import train_test_split

15. metrics:
o Purpose: Contains functions for evaluating the performance of machine learning
models.
o Usage: Provides various metrics such as accuracy, precision, recall, F1 score, etc.
Import: from sklearn import metrics

16. metrics.confusion_matrix:
o Purpose: Computes the confusion matrix to evaluate the accuracy of a
classification.
o Usage: Helps in understanding the performance of the classification algorithm by
comparing actual and predicted values.
Import: from sklearn.metrics import confusion_matrix

17. accuracy_score, recall_score, precision_score, f1_score:


o Purpose: Computes different evaluation metrics for classification tasks.
o Usage: Useful for assessing the performance of classifiers based on various aspects
such as accuracy, precision, recall, and F1 score.
Import: from sklearn.metrics import accuracy_score, recall_score, precision_score,
f1_score

18. classification_report:
o Purpose: Generates a detailed classification report that includes precision, recall,
F1-score, and support for each class.
o Usage: Useful for gaining insight into the performance of a classification model
across different classes.
Import: from sklearn.metrics import classification_report
Python commands:
Step 1: Loading and Previewing Dataset

• This line reads the CSV file 'dataset (1).csv' into a pandas DataFrame, using the 'ISO-8859-1'
encoding to handle special characters.
• head() will display the first five lines of dataset.

Output:

Step 2: Checking for Null values

• data.isnull(): This method returns a DataFrame of the same shape as data, but with boolean
values indicating whether each value is NaN (missing).
• .sum(): When applied to the DataFrame, this sums up the True values (which are treated as
1) for each column, resulting in a Series that shows the total count of missing values for each
column.
• print(...): Outputs the Series to the console, displaying the number of missing values per
column.ata
• data.info(): This method prints a concise summary of the DataFrame, which includes:
✓ The total number of entries (rows).
✓ The number of non-null entries in each column.
✓ The data type of each column (e.g., integer, float, object).
✓ Memory usage of the DataFrame.
Output:

Step 3: Filling Null values

This line fills any missing values (NaN) in the 'Event Class' column with the mode (most frequently
occurring value) of that column.

• data['Event Class'].mode()[0]: Computes the mode of the 'Event Class' column and
takes the first mode value if there are multiple.
• .fillna(...): Fills all NaN values in the 'Event Class' column with the mode value.

print(data.isnull().sum())
This line calculates and prints the number of missing values in each column of the DataFrame.

• data.isnull(): Creates a DataFrame of the same shape as data with boolean values
indicating whether each entry is NaN.
• .sum(): Sums the boolean values for each column, resulting in a count of missing values
per column.

Output:
Step 4: Checking and counting the duplicate rows:

Print(data.duplicated()) will checks for duplicate rows in the DataFrame.

• data.duplicated(): Returns a Series of boolean values indicating whether each row is a


duplicate of a previous row in the DataFrame.
Result: Prints the Series to the console, where True indicates a duplicate row, and False
indicates a unique row.
data[data.duplicated()].shape[0] This line counts the number of duplicate rows in the dataFrame.
• data[data.duplicated()]: Filters the DataFrame to include only the rows that are
duplicates.
• .shape[0]: Returns the number of rows in the filtered DataFrame, which is the count of
duplicate rows.

Output:

Step 5: Dropping Duplicates and reseting the index


data.drop_duplicated(inplace=True,keep=’first’) will removes duplicate rows from the DataFrame.

• inplace=True: Modifies the original DataFrame without creating a new copy.


• keep='first': Keeps the first occurrence of each duplicate row and removes the subsequent
duplicates.
Result: The DataFrame data will no longer contain any duplicate rows.
data.reset_index(inplace=True,drop=True) will resets the index of the DataFrame, creating a new
index from 0 to n-1.

• inplace=True: Modifies the original DataFrame without creating a new copy.


• drop=True: Drops the old index instead of adding it as a new column.
Result: The DataFrame data will have a new integer index starting from 0, making it easier to work
with and ensuring sequential indexing after dropping duplicates.

Step 6: Adding a Lowercase Column to DataFrame

• data['lower']: This part creates a new column named 'lower' in the DataFrame data.
• =: The assignment operator assigns values to the new column.
• data['Event Description'].str.lower(): This is the expression that provides the values for the
new column. It contains the lowercase versions of the values from the 'Event Description'
column.

Step 7: Removing Punctuation from Text Data

• string.punctuation: This is a string constant provided by Python's string module, containing


all punctuation characters. It includes characters like '.', ',', '!', etc.
• def remove_punctuation(text): This defines a custom function named remove_punctuation
which takes a text input and removes punctuation characters from it.
• text.translate(str.maketrans('', '', PUNCT_TO_REMOVE)): This line utilizes the str.translate()
method to remove punctuation characters from the input text. The str.maketrans() method
creates a translation table where punctuation characters are mapped to None, effectively
removing them from the text.
• data["lower"]: This accesses the column 'lower' in the DataFrame data, which presumably
contains the text data with lowercase characters.
• .apply(lambda text: remove_punctuation(text)): This applies the remove_punctuation
function to each element (text) in the 'lower' column using the .apply() method. It effectively
removes punctuation from each text entry.
• data["punc_removed"]: This creates a new column named 'punc_removed' in the
DataFrame data and assigns the text data with punctuation removed to it.

Step 8: Removing Stopwords from Text Data

• stopwords.words('english'): This function from the NLTK library retrieves a list of English
stopwords.
• def remove_stopwords(text): This defines a custom function named remove_stopwords
which takes a text input and removes stopwords from it.
• [word for word in str(text).split() if word not in STOPWORDS]: This list comprehension
iterates over each word in the text, checking if it's not in the set of stopwords. If it's not a
stopword, it includes the word in the list.
• " ".join(...): This joins the list of words back into a single string, separated by spaces.
• data["punc_removed"]: This accesses the column 'punc_removed' in the DataFrame data,
which presumably contains the text data with punctuation removed.
• .apply(lambda text: remove_stopwords(text)): This applies the remove_stopwords
function to each element (text) in the 'punc_removed' column using the .apply() method. It
effectively removes stopwords from each text entry.
• data["stopwords_removed"]: This creates a new column named 'stopwords_removed' in
the DataFrame data and assigns the text data with stopwords removed to it.

Step 9: Encoding Categorical Labels with LabelEncoder


• from sklearn.preprocessing import LabelEncoder will imports the LabelEncoder class
from the sklearn.preprocessing module. The LabelEncoder is used to convert categorical
labels into numerical form.
• le = LabelEncoder() : This line initializes a LabelEncoder object named le.
• data['Target']: This accesses the column 'Target' in the DataFrame data, which presumably
contains categorical labels.
• le.fit_transform(data['Target']): This applies the fit_transform() method of the LabelEncoder
object to the 'Target' column. It fits the label encoder to the unique values in the 'Target'
column and transforms them into numerical labels.

Step 10: Vectorizing Text Data with CountVectorizer

• from sklearn.feature_extraction.text import CountVectorizer: This line imports the


CountVectorizer class, which is used to convert a collection of text documents into a matrix
of token counts.
• vectorizer = CountVectorizer(): This line initializes a CountVectorizer object named
vectorizer.
• vectorizer.fit_transform(data['stopwords_removed']): This line applies the fit_transform()
method of the CountVectorizer object to the 'stopwords_removed' column of the DataFrame
data. It fits the vectorizer to the text data and transforms it into a matrix of token counts.
• y = data['Target']: This line extracts the target variable, typically denoted as 'y', from the
DataFrame data.
• print(x):This line prints the transformed matrix x, which represents the token counts of the
text data.

Step 11: Splitting Data into Training and Test Sets


• from sklearn.model_selection import train_test_split: This line imports the
train_test_split function, which is used to split datasets into random train and test subsets.
• train_test_split(x, y, test_size=0.3, random_state=42): This line splits the input features x
and the target variable y into training and test sets.
• test_size=0.3: This parameter specifies the proportion of the dataset to include in the test
split. Here, it's set to 30%.
• random_state=42: This parameter sets the random seed for reproducibility. It ensures that
the split is deterministic and reproducible across multiple runs.
• print('Training Data ', x_train.shape, y_train.shape) print('Test Data ', x_test.shape,
y_test.shape): These lines print the shapes of the training and test data arrays to verify the
split.
• x_train.shape and y_train.shape represent the dimensions of the training data (input
features and target variable).
• x_test.shape and y_test.shape represent the dimensions of the test data (input features
and target variable).
Output:
Step 12: TF-IDF Vectorization and Data Splitting

• from sklearn.feature_extraction.text import TfidfVectorizer: This line imports the


TfidfVectorizer class, which is used to convert a collection of raw documents to a matrix of
TF-IDF features.
• tfidf = TfidfVectorizer()
• x = tfidf.fit_transform(data['stopwords_removed'].values): This line applies the
fit_transform() method of the TfidfVectorizer object to the 'stopwords_removed' column of
the DataFrame data. It fits the vectorizer to the text data and transforms it into a matrix of
TF-IDF features.
• y = data['Target']: This line extracts the target variable, typically denoted as 'y', from the
DataFrame data.
• train_test_split(x, y, test_size=0.3, random_state=42): This line splits the TF-IDF features
(x) and the target variable (y) into training and test sets.
• test_size=0.3: This parameter specifies the proportion of the dataset to include in the test
split. Here, it's set to 30%.
• random_state=42: This parameter sets the random seed for reproducibility. It ensures that
the split is deterministic and reproducible across multiple runs.
• print('Training Data ', x_train.shape, y_train.shape) print('Test Data ', x_test.shape,
y_test.shape): These lines print the shapes of the training and test data arrays to verify the
split.
• x_train.shape and y_train.shape represent the dimensions of the training data (TF-IDF
features and target variable).
• x_test.shape and y_test.shape represent the dimensions of the test data (TF-IDF features
and target variable).

Step 13: Logistic Regression Model Evaluation


• model = LogisticRegression(): This line initializes a Logistic Regression model.
• clf = model.fit(x_train, y_train):This line trains the Logistic Regression model using the
training data (x_train and y_train).
• predictions = model.predict(x_test): This line uses the trained model to make predictions
on the test data (x_test).
• metrics.classification_report(y_test, predictions): This line prints a classification report,
which includes precision, recall, F1-score, and support for each class, as well as the average
metrics.
• metrics.confusion_matrix(y_test, predictions): This line prints the confusion matrix,
which is a table showing the counts of true positive, true negative, false positive, and false
negative predictions.

Output:
Sentiment Analysis with NLTK and VADER

Sentiment analysis, a subset of natural language processing (NLP), aims to determine the
emotional tone behind a body of text. It is widely used in fields like marketing, customer service,
and social media monitoring to gauge public sentiment. One effective tool for sentiment
analysis is VADER (Valence Aware Dictionary and sEntiment Reasoner), which is particularly
adept at handling social media text.

• Downloading the Lexicon:


nltk.download('vader_lexicon'): Downloads the VADER lexicon needed for sentiment
analysis.
• Importing the Analyzer:
from nltk.sentiment.vader import SentimentIntensityAnalyzer: Imports the
SentimentIntensityAnalyzer class from NLTK.
• Creating an Analyzer Instance:
sia = SentimentIntensityAnalyzer(): Initializes the sentiment analyzer and assigns it to the
variable sia.

Step 1 :Sentiment Analysis with TextBlob


• import textblob: imports the TextBlob library, which provides a simple API for diving into
common natural language processing (NLP) tasks.
• from textblob import TextBlob: This specifically imports the TextBlob class from the
textblob module. This class is used to create a TextBlob object, which provides a range of NLP
functionalities, including sentiment analysis.
• def analyze_polarity(stopwords_removed): defines a function named analyze_polarity that
takes one argument, stopwords_removed. This argument is expected to be a string from
which stopwords (common words that don't carry significant meaning, like "and", "the", "is")
have been removed.
• result = TextBlob(stopwords_removed) creates a TextBlob object from the input string. This
object, result, allows us to access various NLP features, including sentiment analysis.
• if result.sentiment.polarity > 0: checks the polarity score of the sentiment analysis. Polarity
is a float within the range [-1.0, 1.0], where negative values indicate negative sentiment,
positive values indicate positive sentiment, and values around zero indicate neutral
sentiment.
• If the polarity is greater than 0, the function returns 1, indicating positive sentiment.
• elif result.sentiment.polarity == 0: checks if the polarity is exactly zero. If so, the function
returns 0, indicating neutral sentiment.
else: covers the case where polarity is less than 0. The function returns -1, indicating negative
sentiment.

Testing the analyze_polarity Function:


s = "I'm Happy" assigns the string "I'm Happy" to the variable s. This string is a positive
sentiment statement.
print(analyze_polarity(s)) calls the analyze_polarity function with s as the argument and
prints the returned value.
The function analyzes the polarity of "I'm Happy", which has a positive sentiment. Therefore,
the expected output is 1.

Step 2:

1. Creating a New Column in the DataFrame:

data['Result'] is the syntax used to create a new column called Result in the existing
DataFrame data.

2. List Comprehension to Apply the Function:


[analyze_polarity(stopwords_removed)for stopwords_removed in
data['stopwords_removed']] is a list comprehension that iterates over each element in the
stopwords_removed column of the DataFrame data.
for stopwords_removed in data['stopwords_removed'] iterates over each value (each string
of text with stopwords removed) in the stopwords_removed column.
analyze_polarity(stopwords_removed) calls the analyze_polarity function for each value in
the stopwords_removed column. This function analyzes the sentiment of the text and returns
1 for positive sentiment, 0 for neutral sentiment, and -1 for negative sentiment.

3. Converting the List to a NumPy Array:

np.array([...]) converts the list of sentiment scores generated by the list comprehension into
a NumPy array. This ensures that the resulting series can be directly assigned to the
DataFrame column data['Result'].

4. Assigning the NumPy Array to the New Column:

The NumPy array of sentiment scores is assigned to the new column Result in the DataFrame
data.

Step 3: Classifying Text Based on Sentiment Analysis Results

1. Positive Sentiment Classification:


o List Comprehension:
[head for index, head in enumerate(data['stopwords_removed']) if data['Result'][index] > 0.5]
creates a list of text entries from the stopwords_removed column that have a positive
sentiment score.
o Enumerate:
enumerate(data['stopwords_removed']) provides both the index and the value (head) of each
entry in the stopwords_removed column.
o Condition:
if data['Result'][index] > 0.5 checks if the sentiment score in the Result column at the current
index is greater than 0.5, indicating positive sentiment.
2. Unbiased Sentiment Classification:
o List Comprehension:
[head for index, head in enumerate(data['stopwords_removed']) if data['Result'][index] ==
0.5] creates a list of text entries from the stopwords_removed column that have a neutral
sentiment score.
o Condition:
if data['Result'][index] == 0.5 checks if the sentiment score in the Result column at the current
index is equal to 0.5, indicating neutral sentiment.
3. Negative Sentiment Classification:
o List Comprehension:
[head for index, head in enumerate(data['stopwords_removed']) if data['Result'][index] < 0.5]
creates a list of text entries from the stopwords_removed column that have a negative
sentiment score.
o Condition:
if data['Result'][index] < 0.5 checks if the sentiment score in the Result column at the current
index is less than 0.5, indicating negative sentiment.

Step 4 :

1. Calculating Percentage of Positive Data:


len(positive) calculates the number of entries classified as positive sentiment.
len(data['stopwords_removed']) calculates the total number of entries in the
stopwords_removed column of the DataFrame data.
The expression len(positive) * 100 / len(data['stopwords_removed']) computes the
percentage of positive data by dividing the count of positive entries by the total count
of entries and then multiplying by 100 to obtain the percentage.
2. Calculating Percentage of Negative Data:
len(negative) calculates the number of entries classified as negative sentiment.
len(data['stopwords_removed']) calculates the total number of entries in the
stopwords_removed column of the DataFrame data.
The expression len(negative) * 100 / len(data['stopwords_removed']) computes the
percentage of negative data by dividing the count of negative entries by the total count
of entries and then multiplying by 100 to obtain the percentage.
3. Calculating Percentage of Unbiased Data:
len(unbiased) calculates the number of entries classified as unbiased sentiment.
len(data['stopwords_removed']) calculates the total number of entries in the
stopwords_removed column of the DataFrame data.
The expression len(unbiased) * 100 / len(data['stopwords_removed']) computes the
percentage of unbiased data by dividing the count of unbiased entries by the total
count of entries and then multiplying by 100 to obtain the percentage.
Printing Results:
This line prints the percentages of positive, negative, and unbiased data.
str(o_pos), str(o_neg), and str(o_un) convert the calculated percentages to strings so that
they can be concatenated with the descriptive text in the print statement.
\n is used for newline characters to format the output nicely.
Output:

Step 5:

1. Importing NLTK and VADER:


import nltk imports the NLTK library, which provides various tools and datasets for natural
language processing tasks.
from nltk.sentiment.vader import SentimentIntensityAnalyzer imports the
SentimentIntensityAnalyzer class from the VADER module of NLTK. VADER is specifically
designed for sentiment analysis and provides a pre-trained model for analyzing sentiment in
text.
2. Defining the sentiment_analysis Function:
def sentiment_analysis(text): defines a function named sentiment_analysis that takes a
single argument text, which represents the text to be analyzed.
sia = SentimentIntensityAnalyzer() creates an instance of the SentimentIntensityAnalyzer
class, which will be used to perform sentiment analysis.
sentiment_scores = sia.polarity_scores(text) calls the polarity_scores method of the
SentimentIntensityAnalyzer instance sia to analyze the sentiment of the input text. This
method returns a dictionary containing the sentiment scores (positive, negative, neutral, and
compound).
return sentiment_scores returns the sentiment scores dictionary.
3. Applying Sentiment Analysis to DataFrame:
data['Result'] = np.array([sentiment_analysis(stopwords_removed) for stopwords_removed
in data['stopwords_removed']]) applies the sentiment_analysis function to each entry in the
stopwords_removed column of the DataFrame data. The sentiment analysis results are
stored in a new column named Result.
data.head() displays the first few rows of the DataFrame data, including the newly added
Result column.
print(data[‘Result’])

Output:

Step 6: Selecting Columns for Analysis

1. Selecting Columns:
data[["stopwords_removed", "Result"]] selects two columns from the DataFrame data:
"stopwords_removed" and "Result".
The double square brackets [[...]] are used to specify a list of column names to be selected.
The selected columns are "stopwords_removed" which presumably contains preprocessed
text data and "Result" which contains the sentiment analysis results.
2. Displaying the First Few Rows:
cls.head() displays the first few rows of the DataFrame cls, which contains only the selected
columns.
This allows you to inspect the selected columns and their corresponding sentiment analysis
results.

You might also like