0% found this document useful (0 votes)

2 views6 pages

Sentiment Analysis

The document outlines a sentiment analysis project using a dataset of Kindle reviews, detailing the preprocessing steps including tokenization, stopword removal, stemming, and lemmatization. It describes the implementation of various machine learning models such as Naive Bayes and a CNN model for sentiment classification, along with their respective accuracies. The final results show that the CNN model achieved an accuracy of 76.50% after training.

Uploaded by

itzshivam1999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views6 pages

Sentiment Analysis

Uploaded by

itzshivam1999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Sentiment Analysis

March 13, 2025

[1]: import pandas as pd

import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk

# Load dataset
data = pd.read_csv('kindle_review.csv')
df = data[['reviewText', 'rating']].copy()

[2]: df

[2]: reviewText rating

0 Jace Rankin may be short, but he's nothing to … 3
1 Great short read. I didn't want to put it dow… 5
2 I'll start by saying this is the first of four… 3
3 Aggie is Angela Lansbury who carries pocketboo… 3
4 I did not expect this type of book to be in li… 4
… … …
11995 Valentine cupid is a vampire- Jena and Ian ano… 4
11996 I have read all seven books in this series. Ap… 5
11997 This book really just wasn't my cuppa. The si… 3
11998 tried to use it to charge my kindle, it didn't… 1
11999 Taking Instruction is a look into the often hi… 3

[12000 rows x 2 columns]

[3]: # Convert rating to binary (0: negative, 1: positive)

df['rating'] = df['rating'].apply(lambda x: 0 if x < 3 else 1)

# Lowercase
df['reviewText'] = df['reviewText'].str.lower()

# Remove special characters, URLs, and HTML tags

1
df['reviewText'] = df['reviewText'].apply(lambda x: re.sub(r'[^a-zA-Z0-9\s]',␣
↪'', str(x)))

df['reviewText'] = df['reviewText'].apply(lambda x: re.sub(r'http\S+', '', x))

# Download NLTK resources

nltk.download(['punkt', 'stopwords', 'wordnet', 'averaged_perceptron_tagger',␣
↪'maxent_ne_chunker', 'words'])

# Tokenization
df['tokens'] = df['reviewText'].apply(word_tokenize)

# Remove stopwords
stop_words = set(stopwords.words('english'))
df['tokens'] = df['tokens'].apply(lambda tokens: [word for word in tokens if␣
↪word not in stop_words])

# Stemming
stemmer = PorterStemmer()
df['stemmed'] = df['tokens'].apply(lambda tokens: [stemmer.stem(word) for word␣
↪in tokens])

# POS Tagging
df['pos_tags'] = df['tokens'].apply(pos_tag)

# Lemmatization with POS

lemmatizer = WordNetLemmatizer()
def lemmatize_with_pos(tagged_tokens):
lemmatized = []
for word, tag in tagged_tokens:
pos = tag[0].lower()
pos = pos if pos in ['a', 'r', 'n', 'v'] else 'n' # Default to noun
lemmatized.append(lemmatizer.lemmatize(word, pos))
return lemmatized

df['lemmatized'] = df['pos_tags'].apply(lemmatize_with_pos)

[nltk_data] Downloading package punkt to

[nltk_data] C:\Users\itzsh\AppData\Roaming\nltk_data…
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\itzsh\AppData\Roaming\nltk_data…
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data] C:\Users\itzsh\AppData\Roaming\nltk_data…
[nltk_data] Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] C:\Users\itzsh\AppData\Roaming\nltk_data…

2
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data] C:\Users\itzsh\AppData\Roaming\nltk_data…
[nltk_data] Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data] C:\Users\itzsh\AppData\Roaming\nltk_data…
[nltk_data] Package words is already up-to-date!

[4]: from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from gensim.models import Word2Vec
import numpy as np

# Bag of Words (BOW)

bow_vectorizer = CountVectorizer()
X_bow = bow_vectorizer.fit_transform(df['lemmatized'].apply(' '.join))

# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(df['lemmatized'].apply(' '.join))

# Word2Vec
# Train Word2Vec model
sentences = df['lemmatized'].tolist()
w2v_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1,␣
↪workers=4)

# Convert sentences to vectors by averaging word vectors

def sentence_vector(sentence):
return np.mean([w2v_model.wv[word] for word in sentence if word in␣
↪w2v_model.wv], axis=0)

X_w2v = np.array([sentence_vector(sentence) for sentence in sentences])

[5]: from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Split data
X_train_bow, X_test_bow, y_train, y_test = train_test_split(X_bow,␣
↪df['rating'], test_size=0.2)

X_train_tfidf, X_test_tfidf, _, _ = train_test_split(X_tfidf, df['rating'],␣

↪test_size=0.2)

X_train_w2v, X_test_w2v, _, _ = train_test_split(X_w2v, df['rating'],␣

↪test_size=0.2)

# Train models

3
nb_bow = GaussianNB().fit(X_train_bow.toarray(), y_train)
nb_tfidf = GaussianNB().fit(X_train_tfidf.toarray(), y_train)
nb_w2v = GaussianNB().fit(X_train_w2v, y_train)

# Evaluate
y_pred_bow = nb_bow.predict(X_test_bow.toarray())
y_pred_tfidf = nb_tfidf.predict(X_test_tfidf.toarray())
y_pred_w2v = nb_w2v.predict(X_test_w2v)

print("BOW Accuracy:", accuracy_score(y_test, y_pred_bow))

print("TF-IDF Accuracy:", accuracy_score(y_test, y_pred_tfidf))
print("Word2Vec Accuracy:", accuracy_score(y_test, y_pred_w2v))

BOW Accuracy: 0.57125

TF-IDF Accuracy: 0.4525
Word2Vec Accuracy: 0.5091666666666667

[6]: # CNN Model using existing Word2Vec embeddings

import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense
from sklearn.model_selection import train_test_split

# Convert lemmatized tokens to sequences

tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['lemmatized'].apply(lambda x: ' '.join(x)))
sequences = tokenizer.texts_to_sequences(df['lemmatized'].apply(lambda x: ' '.
↪join(x)))

# Pad sequences
max_len = max(len(s) for s in sequences)
X_cnn = pad_sequences(sequences, maxlen=max_len, padding='post')
y = df['rating']

# Split data with fixed random_state

X_train_cnn, X_test_cnn, y_train_cnn, y_test_cnn = train_test_split(X_cnn, y,␣
↪test_size=0.2, random_state=42)

# Load existing Word2Vec embeddings

embedding_dim = 100
vocab_size = len(tokenizer.word_index) + 1
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
if word in w2v_model.wv:
embedding_matrix[i] = w2v_model.wv[word]

4
# Build CNN model
model_cnn = Sequential()
model_cnn.add(Embedding(vocab_size, embedding_dim, weights=[embedding_matrix],␣
↪input_length=max_len, trainable=False))

model_cnn.add(Conv1D(128, 5, activation='relu'))
model_cnn.add(GlobalMaxPooling1D())
model_cnn.add(Dense(1, activation='sigmoid'))
model_cnn.compile(optimizer='adam', loss='binary_crossentropy',␣
↪metrics=['accuracy'])

# Train and evaluate

model_cnn.fit(X_train_cnn, y_train_cnn, epochs=20, validation_data=(X_test_cnn,␣
↪y_test_cnn))

loss, accuracy = model_cnn.evaluate(X_test_cnn, y_test_cnn)

print(f"CNN Accuracy: {accuracy:.4f}")

C:\Users\itzsh\AppData\Local\Programs\Python\Python312\Lib\site-
packages\keras\src\layers\core\embedding.py:90: UserWarning: Argument
`input_length` is deprecated. Just remove it.
warnings.warn(
Epoch 1/20
300/300 �� 20s 63ms/step -
accuracy: 0.7099 - loss: 0.5623 - val_accuracy: 0.7725 - val_loss: 0.4640
Epoch 2/20
300/300 �� 19s 62ms/step -
accuracy: 0.7932 - loss: 0.4413 - val_accuracy: 0.7887 - val_loss: 0.4366
Epoch 3/20
300/300 �� 20s 66ms/step -
accuracy: 0.8290 - loss: 0.3919 - val_accuracy: 0.7171 - val_loss: 0.6283
Epoch 4/20
300/300 �� 22s 73ms/step -
accuracy: 0.8139 - loss: 0.3938 - val_accuracy: 0.7871 - val_loss: 0.4415
Epoch 5/20
300/300 �� 20s 67ms/step -
accuracy: 0.8338 - loss: 0.3696 - val_accuracy: 0.7742 - val_loss: 0.4720
Epoch 6/20
300/300 �� 21s 71ms/step -
accuracy: 0.8592 - loss: 0.3223 - val_accuracy: 0.7942 - val_loss: 0.4341
Epoch 7/20
300/300 �� 20s 67ms/step -
accuracy: 0.8816 - loss: 0.2871 - val_accuracy: 0.7967 - val_loss: 0.4406
Epoch 8/20
300/300 �� 20s 67ms/step -
accuracy: 0.8980 - loss: 0.2596 - val_accuracy: 0.7942 - val_loss: 0.4483
Epoch 9/20
300/300 �� 20s 67ms/step -

5
accuracy: 0.9035 - loss: 0.2508 - val_accuracy: 0.7887 - val_loss: 0.4552
Epoch 10/20
300/300 �� 21s 68ms/step -
accuracy: 0.9062 - loss: 0.2372 - val_accuracy: 0.7896 - val_loss: 0.4663
Epoch 11/20
300/300 �� 20s 68ms/step -
accuracy: 0.9428 - loss: 0.1871 - val_accuracy: 0.7833 - val_loss: 0.5050
Epoch 12/20
300/300 �� 20s 67ms/step -
accuracy: 0.9312 - loss: 0.1930 - val_accuracy: 0.7804 - val_loss: 0.4908
Epoch 13/20
300/300 �� 20s 67ms/step -
accuracy: 0.9480 - loss: 0.1580 - val_accuracy: 0.7862 - val_loss: 0.5018
Epoch 14/20
300/300 �� 20s 68ms/step -
accuracy: 0.9576 - loss: 0.1441 - val_accuracy: 0.7725 - val_loss: 0.5382
Epoch 15/20
300/300 �� 21s 69ms/step -
accuracy: 0.9662 - loss: 0.1268 - val_accuracy: 0.7754 - val_loss: 0.5315
Epoch 16/20
300/300 �� 20s 67ms/step -
accuracy: 0.9718 - loss: 0.1106 - val_accuracy: 0.7808 - val_loss: 0.5851
Epoch 17/20
300/300 �� 20s 67ms/step -
accuracy: 0.9747 - loss: 0.1027 - val_accuracy: 0.7754 - val_loss: 0.5782
Epoch 18/20
300/300 �� 20s 67ms/step -
accuracy: 0.9827 - loss: 0.0863 - val_accuracy: 0.7717 - val_loss: 0.6004
Epoch 19/20
300/300 �� 20s 68ms/step -
accuracy: 0.9826 - loss: 0.0792 - val_accuracy: 0.7783 - val_loss: 0.6086
Epoch 20/20
300/300 �� 20s 68ms/step -
accuracy: 0.9850 - loss: 0.0752 - val_accuracy: 0.7650 - val_loss: 0.6401
75/75 �� 1s 18ms/step -
accuracy: 0.7521 - loss: 0.6422
CNN Accuracy: 0.7650

[ ]:

CNN for Sentiment Analysis Implementation
No ratings yet
CNN for Sentiment Analysis Implementation
7 pages
Sample
No ratings yet
Sample
6 pages
Sentiment Analysis Using LSTM
No ratings yet
Sentiment Analysis Using LSTM
5 pages
Sentiment Analysis Using LSTM
No ratings yet
Sentiment Analysis Using LSTM
5 pages
Deep Learning Lab
No ratings yet
Deep Learning Lab
7 pages
NER Brahui NLP Project
No ratings yet
NER Brahui NLP Project
12 pages
IMDB Sentiment Analysis with Keras
No ratings yet
IMDB Sentiment Analysis with Keras
10 pages
Sumati
No ratings yet
Sumati
10 pages
NLP Lab Manual for B.E. Students
No ratings yet
NLP Lab Manual for B.E. Students
21 pages
Sentiment Analysis with CNN Model
No ratings yet
Sentiment Analysis with CNN Model
2 pages
Unit 4
No ratings yet
Unit 4
23 pages
Super Visionado VSRegras
No ratings yet
Super Visionado VSRegras
6 pages
DL 3
No ratings yet
DL 3
6 pages
Python NLP Techniques Guide
No ratings yet
Python NLP Techniques Guide
18 pages
Neural Networks
No ratings yet
Neural Networks
8 pages
DL Lab Answers Batch 2
No ratings yet
DL Lab Answers Batch 2
27 pages
ML Week10.1
No ratings yet
ML Week10.1
5 pages
Basenlp
No ratings yet
Basenlp
5 pages
Assignment
No ratings yet
Assignment
6 pages
IMDB Sentiment Analysis with LSTM
No ratings yet
IMDB Sentiment Analysis with LSTM
5 pages
NLP Assignment 4 (22bce9560)
No ratings yet
NLP Assignment 4 (22bce9560)
12 pages
Keras NLP Encoding and Sentiment Analysis
No ratings yet
Keras NLP Encoding and Sentiment Analysis
8 pages
Assingment-3 NLP
No ratings yet
Assingment-3 NLP
5 pages
HateSpeech - Ipynb - Colab
No ratings yet
HateSpeech - Ipynb - Colab
8 pages
CCS355
No ratings yet
CCS355
29 pages
Sentence Embedding Code
No ratings yet
Sentence Embedding Code
9 pages
Practical No11
No ratings yet
Practical No11
6 pages
566f0619-9145-4b8f-b12b-cb8a5b0cd30d
No ratings yet
566f0619-9145-4b8f-b12b-cb8a5b0cd30d
17 pages
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
No ratings yet
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
2 pages
Assignment 7
No ratings yet
Assignment 7
10 pages
Glove
100% (1)
Glove
10 pages
Integer-Encoding-Simplernn - Ipynb - Colaboratory
No ratings yet
Integer-Encoding-Simplernn - Ipynb - Colaboratory
4 pages
NLP Lab Assignment - 05
No ratings yet
NLP Lab Assignment - 05
6 pages
Genaii
No ratings yet
Genaii
5 pages
DL Lab Manual
No ratings yet
DL Lab Manual
18 pages
FALLSEM2024-25 BCSE332P LO VL2024250102168 2024-10-07 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE332P LO VL2024250102168 2024-10-07 Reference-Material-I
18 pages
AI Lab6
No ratings yet
AI Lab6
22 pages
Implementing Random Zoom in Keras
No ratings yet
Implementing Random Zoom in Keras
17 pages
NLP Tushar
No ratings yet
NLP Tushar
21 pages
NLP Using RNN
No ratings yet
NLP Using RNN
15 pages
CNN Model
No ratings yet
CNN Model
7 pages
NLP Assignment (917722H031)
No ratings yet
NLP Assignment (917722H031)
18 pages
DL Exp-10,11,12
No ratings yet
DL Exp-10,11,12
6 pages
Tugas NLP - 1152000052 1
No ratings yet
Tugas NLP - 1152000052 1
14 pages
Deep DL Manual Nainish
No ratings yet
Deep DL Manual Nainish
8 pages
Computer Vision Lab Guide
No ratings yet
Computer Vision Lab Guide
120 pages
Exercise 8
No ratings yet
Exercise 8
6 pages
A Quick Recap: Artificial Intelligence LAB
No ratings yet
A Quick Recap: Artificial Intelligence LAB
29 pages
2 Old Paper Solution DL
No ratings yet
2 Old Paper Solution DL
4 pages
Research on Text Topic Modeling
No ratings yet
Research on Text Topic Modeling
26 pages
Course 3 - Week 2 - Exercise - Answer - Ipynb - Colaboratory
No ratings yet
Course 3 - Week 2 - Exercise - Answer - Ipynb - Colaboratory
8 pages
Self Evaluation Exercises
No ratings yet
Self Evaluation Exercises
12 pages
Neural Language Models & Classifiers Guide
No ratings yet
Neural Language Models & Classifiers Guide
7 pages
NLP A2
No ratings yet
NLP A2
7 pages
Case Study - Sentiment Analysis With RNNs
No ratings yet
Case Study - Sentiment Analysis With RNNs
8 pages
Plotting Loss History in TensorFlow
No ratings yet
Plotting Loss History in TensorFlow
65 pages
Next Word Prediction With NLP and Deep Learning
No ratings yet
Next Word Prediction With NLP and Deep Learning
13 pages
Lab 5
No ratings yet
Lab 5
7 pages
Expt 5 Expt 6
No ratings yet
Expt 5 Expt 6
10 pages
Knowledge Acquisition and Validation
No ratings yet
Knowledge Acquisition and Validation
79 pages
CLG835-II Engine & Filter Parts List
100% (1)
CLG835-II Engine & Filter Parts List
1 page
Summary of When Do People Begin Cleaned
No ratings yet
Summary of When Do People Begin Cleaned
3 pages
JM Methanol Brochure
100% (2)
JM Methanol Brochure
24 pages
Alu 80 C Technical Datasheet Installation Advice: 7 Years
No ratings yet
Alu 80 C Technical Datasheet Installation Advice: 7 Years
1 page
Learning To See Creatively PDF
100% (1)
Learning To See Creatively PDF
32 pages
Engineering Problem Solving Project List of Ideas
100% (1)
Engineering Problem Solving Project List of Ideas
15 pages
Teaching Practical Online
No ratings yet
Teaching Practical Online
30 pages
STEM Tutoring Business Plan
No ratings yet
STEM Tutoring Business Plan
5 pages
Karnataka PDF
No ratings yet
Karnataka PDF
1,848 pages
Quarter 1 - Module 3: Responsible Use of Media and Information
No ratings yet
Quarter 1 - Module 3: Responsible Use of Media and Information
21 pages
Seismic Interpretation
80% (5)
Seismic Interpretation
213 pages
Kvs User Guideeaqs
No ratings yet
Kvs User Guideeaqs
7 pages
McGraw-Hill My Math Parent Guide
No ratings yet
McGraw-Hill My Math Parent Guide
2 pages
2018 Hiyoshi Internship for Students
No ratings yet
2018 Hiyoshi Internship for Students
5 pages
Environmental Health Data Analysis
No ratings yet
Environmental Health Data Analysis
18 pages
GP English HL Grade 12 June 2023 P1 and Memo
No ratings yet
GP English HL Grade 12 June 2023 P1 and Memo
20 pages
User Test Results for Urine Collection Instructions
100% (1)
User Test Results for Urine Collection Instructions
6 pages
Summer Holiday Homework Class PREP
No ratings yet
Summer Holiday Homework Class PREP
18 pages
Reading List 2020 0
No ratings yet
Reading List 2020 0
9 pages
BS208
No ratings yet
BS208
21 pages
SVP 70-71 OpMan 6
100% (1)
SVP 70-71 OpMan 6
61 pages
Drone Solutions for Pakistani Farmers
No ratings yet
Drone Solutions for Pakistani Farmers
22 pages
Emerson Fb2200 Flow Computer Instruction Manual en 586724
No ratings yet
Emerson Fb2200 Flow Computer Instruction Manual en 586724
124 pages
Dragon Tooth
No ratings yet
Dragon Tooth
15 pages
MP BOCW Act and Rules 2002
No ratings yet
MP BOCW Act and Rules 2002
167 pages
Suggested SBA Projects
No ratings yet
Suggested SBA Projects
2 pages
1 Toefl Score Report PDF
No ratings yet
1 Toefl Score Report PDF
2 pages
Aalco Metals LTD Aluminium Alloy 6082 T6 Extrusions 338
No ratings yet
Aalco Metals LTD Aluminium Alloy 6082 T6 Extrusions 338
3 pages
Vidyapravesh Assessment Booklet Nanded 23-24
No ratings yet
Vidyapravesh Assessment Booklet Nanded 23-24
21 pages

Sentiment Analysis

Uploaded by

Sentiment Analysis

Uploaded by

Sentiment Analysis

March 13, 2025

[1]: import pandas as pd

[2]: reviewText rating

[12000 rows x 2 columns]

[3]: # Convert rating to binary (0: negative, 1: positive)

# Remove special characters, URLs, and HTML tags

df['reviewText'] = df['reviewText'].apply(lambda x: re.sub(r'http\S+', '', x))

# Download NLTK resources

# Lemmatization with POS

[nltk_data] Downloading package punkt to

[4]: from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Bag of Words (BOW)

# Convert sentences to vectors by averaging word vectors

X_w2v = np.array([sentence_vector(sentence) for sentence in sentences])

[5]: from sklearn.model_selection import train_test_split

X_train_tfidf, X_test_tfidf, _, _ = train_test_split(X_tfidf, df['rating'],␣

X_train_w2v, X_test_w2v, _, _ = train_test_split(X_w2v, df['rating'],␣

print("BOW Accuracy:", accuracy_score(y_test, y_pred_bow))

BOW Accuracy: 0.57125

[6]: # CNN Model using existing Word2Vec embeddings

# Convert lemmatized tokens to sequences

# Split data with fixed random_state

# Load existing Word2Vec embeddings

# Train and evaluate

loss, accuracy = model_cnn.evaluate(X_test_cnn, y_test_cnn)

You might also like