AI LAB FINAL
AI LAB FINAL
Python libraries commonly used for data processing, visualization, and machine learning tasks.
Here are the details:
1. Libraries Imported:
o pandas (pd): Essential for data processing and reading CSV files.
o nltk: The Natural Language Toolkit for working with human language data.
2. Purpose:
o These libraries provide essential tools for various tasks, such as data analysis,
machine learning model building, and text processing.
The NLTK (Natural Language Toolkit) and explore its stopwords and WordNet corpora:
1. Stopwords:
o Definition: Stopwords are common words (such as “the,” “and,” “is,” etc.) that
appear frequently in a language but typically do not carry significant meaning on
their own.
o Purpose: They are often removed from text during natural language processing
(NLP) tasks to improve efficiency and focus on more meaningful content.
o Example: In English, stopwords include words like “a,” “an,” “the,” “in,” “of,” and
“and.”
o Import: from nltk.corpus import stopwords
2. WordNet:
o Definition: WordNet is a lexical database that organizes words into synsets (sets of
synonyms) and provides semantic relationships between words.
o This line imports the stopwords dataset from the Natural Language Toolkit (NLTK)
library.
o nltk: Refers to the NLTK library, which is a powerful Python package for natural
language processing (NLP) tasks.
o corpus: Within NLTK, corpus is a module that provides access to various text corpora
and lexical resources.
o stopwords: Specifically, this refers to a collection of common words in a language
(like "the", "is", "in") that are often removed from text data because they typically do
not contribute much to the meaning of a sentence.
o This line imports the PorterStemmer class from the porter submodule of the stem
module within NLTK.
o stem: Within NLTK, stem is a module that contains implementations of different
stemming algorithms.
o porter: Specifically, porter is a submodule within the stem module that contains the
implementation of the Porter stemming algorithm.
o PorterStemmer: The PorterStemmer class is used to apply the Porter stemming
algorithm, which reduces words to their root or base form. For example, "running"
would be stemmed to "run".
3. WordNetLemmatizer:
o Purpose: The WordNetLemmatizer is a class in NLTK used for lemmatization.
o Usage: Lemmatization is the process of reducing words to their base or root form
(called a lemma). Unlike stemming, which simply chops off prefixes or suffixes to
obtain the root word, lemmatization applies linguistic rules to find the lemma of a
word.
o Example: The lemmatized form of "running" would be "run".
4. wordnet:
• Purpose: This is a corpus reader for the WordNet lexical database.
• Usage: WordNet is a large lexical database of English. It groups words into sets of
synonyms called synsets and provides short definitions, usage examples, and
information on semantic relationships between words.
• Example: You can use WordNet to find synonyms, antonyms, hypernyms, hyponyms,
and more for a given word.
These imports are from the feature_extraction.text module of the scikit-learn library (sklearn)
1. CountVectorizer:
o Purpose: CountVectorizer is used to convert a collection of text documents into a
matrix of token counts. It essentially tokenizes text, builds a vocabulary of known
words, and generates a document-term matrix where each row represents a
document and each column represents the count of a word in that document.
o Usage: CountVectorizer is commonly used in natural language processing (NLP)
tasks such as text classification, clustering, and information retrieval.
2. TfidfVectorizer:
o Purpose: TfidfVectorizer is similar to CountVectorizer but it converts a collection of
raw documents to a matrix of TF-IDF features. TF-IDF stands for Term Frequency-
Inverse Document Frequency, which measures the importance of a word in a
document relative to a collection of documents. It helps in highlighting words that are
unique to a document while downweighting common words across documents.
o Usage: TfidfVectorizer is commonly used in text mining and information retrieval
tasks to represent text data in a numerical format suitable for machine learning
algorithms.
3. KFold:
o Purpose: Used for cross-validation, splitting the dataset into k consecutive folds.
o Usage: Typically used in evaluating model performance and tuning hyperparameters.
o Import: from sklearn.model_selection import KFold
4. MultinomialNB:
o Purpose: A naive Bayes classifier suitable for classification with discrete features.
o Usage: Commonly used for text classification tasks.
o Import: from sklearn.naive_bayes import MultinomialNB
5. LogisticRegression:
o Purpose: A linear model for binary classification.
o Usage: Commonly used as a baseline model for binary classification problems.
o Import: from sklearn.linear_model import LogisticRegression
6. DecisionTreeClassifier:
o Purpose: A non-parametric supervised learning method used for classification and
regression tasks.
o Usage: Known for its simplicity, interpretability, and ability to handle both numerical
and categorical data.
o Import: from sklearn.tree import DecisionTreeClassifier
7. LinearSVC:
o Purpose: A linear classification model similar to SVC but implemented with linear
kernels.
o Usage: Suitable for large-scale classification problems, especially with high-
dimensional data.
o Import: from sklearn.svm import LinearSVC
8. BaggingClassifier:
o Purpose: An ensemble method that improves stability and accuracy by combining
predictions from multiple base models trained on different subsets of the training
data.
o Usage: Applies bagging to any base classifier.
o Import: from sklearn.ensemble import BaggingClassifier
9. RandomForestClassifier:
o Purpose: An ensemble method that constructs multiple decision trees during
training and outputs the class that is the mode of the classes (classification) or the
mean prediction (regression) of the individual trees.
o Usage: Robust to overfitting and widely used for classification tasks.
o Import: from sklearn.ensemble import RandomForestClassifier
10. ExtraTreesClassifier:
o Purpose: An ensemble learning method similar to RandomForestClassifier but with
a few differences in the way the trees are built.
o Usage: Efficient and often provides better generalization performance compared to
RandomForestClassifier.
o Import: from sklearn.ensemble import ExtraTreesClassifier
11. MLPClassifier:
o Purpose: A feedforward artificial neural network model for classification tasks.
o Usage: Learns non-linear relationships between features and target labels.
o Import: from sklearn.neural_network import MLPClassifier
12. KNeighborsClassifier:
o Purpose: A simple and effective classification algorithm based on nearest neighbors
in the feature space.
o Usage: Especially useful when the decision boundary is not well-defined or when the
data is noisy.
o Import: from sklearn.neighbors import KNeighborsClassifier
13. SGDClassifier:
o Purpose: A linear classifier that uses stochastic gradient descent to optimize the loss
function.
o Usage: Efficient and works well with large-scale datasets and sparse feature
representation
o Import: from sklearn.linear_model import SGDClassifier
14. train_test_split:
o Purpose: Splits the dataset into training and testing subsets.
o Usage: Commonly used for evaluating the performance of machine learning models
by training on a portion of the data and testing on the rest.
Import: from sklearn.model_selection import train_test_split
15. metrics:
o Purpose: Contains functions for evaluating the performance of machine learning
models.
o Usage: Provides various metrics such as accuracy, precision, recall, F1 score, etc.
Import: from sklearn import metrics
16. metrics.confusion_matrix:
o Purpose: Computes the confusion matrix to evaluate the accuracy of a
classification.
o Usage: Helps in understanding the performance of the classification algorithm by
comparing actual and predicted values.
Import: from sklearn.metrics import confusion_matrix
18. classification_report:
o Purpose: Generates a detailed classification report that includes precision, recall,
F1-score, and support for each class.
o Usage: Useful for gaining insight into the performance of a classification model
across different classes.
Import: from sklearn.metrics import classification_report
Python commands:
Step 1: Loading and Previewing Dataset
• This line reads the CSV file 'dataset (1).csv' into a pandas DataFrame, using the 'ISO-8859-1'
encoding to handle special characters.
• head() will display the first five lines of dataset.
Output:
• data.isnull(): This method returns a DataFrame of the same shape as data, but with boolean
values indicating whether each value is NaN (missing).
• .sum(): When applied to the DataFrame, this sums up the True values (which are treated as
1) for each column, resulting in a Series that shows the total count of missing values for each
column.
• print(...): Outputs the Series to the console, displaying the number of missing values per
column.ata
• data.info(): This method prints a concise summary of the DataFrame, which includes:
✓ The total number of entries (rows).
✓ The number of non-null entries in each column.
✓ The data type of each column (e.g., integer, float, object).
✓ Memory usage of the DataFrame.
Output:
This line fills any missing values (NaN) in the 'Event Class' column with the mode (most frequently
occurring value) of that column.
• data['Event Class'].mode()[0]: Computes the mode of the 'Event Class' column and
takes the first mode value if there are multiple.
• .fillna(...): Fills all NaN values in the 'Event Class' column with the mode value.
print(data.isnull().sum())
This line calculates and prints the number of missing values in each column of the DataFrame.
• data.isnull(): Creates a DataFrame of the same shape as data with boolean values
indicating whether each entry is NaN.
• .sum(): Sums the boolean values for each column, resulting in a count of missing values
per column.
Output:
Step 4: Checking and counting the duplicate rows:
Output:
• data['lower']: This part creates a new column named 'lower' in the DataFrame data.
• =: The assignment operator assigns values to the new column.
• data['Event Description'].str.lower(): This is the expression that provides the values for the
new column. It contains the lowercase versions of the values from the 'Event Description'
column.
• stopwords.words('english'): This function from the NLTK library retrieves a list of English
stopwords.
• def remove_stopwords(text): This defines a custom function named remove_stopwords
which takes a text input and removes stopwords from it.
• [word for word in str(text).split() if word not in STOPWORDS]: This list comprehension
iterates over each word in the text, checking if it's not in the set of stopwords. If it's not a
stopword, it includes the word in the list.
• " ".join(...): This joins the list of words back into a single string, separated by spaces.
• data["punc_removed"]: This accesses the column 'punc_removed' in the DataFrame data,
which presumably contains the text data with punctuation removed.
• .apply(lambda text: remove_stopwords(text)): This applies the remove_stopwords
function to each element (text) in the 'punc_removed' column using the .apply() method. It
effectively removes stopwords from each text entry.
• data["stopwords_removed"]: This creates a new column named 'stopwords_removed' in
the DataFrame data and assigns the text data with stopwords removed to it.
Output:
Sentiment Analysis with NLTK and VADER
Sentiment analysis, a subset of natural language processing (NLP), aims to determine the
emotional tone behind a body of text. It is widely used in fields like marketing, customer service,
and social media monitoring to gauge public sentiment. One effective tool for sentiment
analysis is VADER (Valence Aware Dictionary and sEntiment Reasoner), which is particularly
adept at handling social media text.
Step 2:
data['Result'] is the syntax used to create a new column called Result in the existing
DataFrame data.
np.array([...]) converts the list of sentiment scores generated by the list comprehension into
a NumPy array. This ensures that the resulting series can be directly assigned to the
DataFrame column data['Result'].
The NumPy array of sentiment scores is assigned to the new column Result in the DataFrame
data.
Step 4 :
Step 5:
Output:
1. Selecting Columns:
data[["stopwords_removed", "Result"]] selects two columns from the DataFrame data:
"stopwords_removed" and "Result".
The double square brackets [[...]] are used to specify a list of column names to be selected.
The selected columns are "stopwords_removed" which presumably contains preprocessed
text data and "Result" which contains the sentiment analysis results.
2. Displaying the First Few Rows:
cls.head() displays the first few rows of the DataFrame cls, which contains only the selected
columns.
This allows you to inspect the selected columns and their corresponding sentiment analysis
results.