0% found this document useful (0 votes)
3 views50 pages

E Commerce Review Analysis

Uploaded by

arisondecor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views50 pages

E Commerce Review Analysis

Uploaded by

arisondecor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

A MAJOR PROJECT REPORT

ON

E-COMMERCE REVIEW ANALYSIS


SUBMITTED IN PARTIAL FULFILLMENT FOR THE AWARD OF DEGREE OF

BACHELOR OF TECHNOLOGY
IN
ELECTRONICS AND COMMUNICATION
ENGINEERING

Submitted by: Under the Guidance of:


ANSHU PRIYA (9916102068) DR. YOGESH KUMAR
ABHISHEK KUMAR (9916102193)

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

JAYPEE INSTITUTE OF INFORMATION TECHNOLOGY, NOIDA (U.P.)

May, 2020
CERTIFICATE

This is to certify that the minor project report entitled, “E-COMMERCE REVIEW
ANALYSIS” submitted by ABHISHEK KUMAR and ANSHU PRIYA in partial
fulfillment of the requirements for the award of Bachelor of Technology Degree in
Electronics and Communication Engineering of the Jaypee Institute of Information
Technology, Noida is an authentic work carried out by them under my supervision and
guidance. The matter embodied in this report is original and has not been submitted for
the award of any other degree.

Dr. Yogesh Kumar


ECE Department,
JIIT, Sec-128,
Noida-201304

Dated:

ii
DECLARATION
We hereby declare that this written submission represents our own ideas in our own
words and where other’s ideas or words have been included, have adequately cited and
reference the original sources. We also declare that we have adhered to all principles of
academic honesty and integrity and have not misrepresented or fabricated or falsified any
idea/data/fact/source in our submission.

Place: NOIDA Name: ABHISHEK KUMAR

Date: Enrollment: 9916102193

Name: ANSHU PRIYA

Enrollment: 9916102068

iii
ABSTRACT
In the recent few years as we can see that the electronic or can say, online places of
shopping or trading are growing at a fast rate. And for them to grow unexpectedly, the
businessmen or the sellers are asking their buyers or loyal customers about what they
think about the product they are purchasing. And because of this, ratings and reviews of
huge amounts are published almost every day because of which the future customers are
in dilemma because they are not able to get to a decision if they should spend their money
on commodities or not. When it comes to the goods and product managers, it is difficult
for them as well because of the large numbers of comments and reviews being generated
on a daily basis. It makes it difficult for them to analyze all of this. The project we are
building here is the solution. Here we are working on categorizing the comments or
reviews based on the category they lie in. For example, good is a positive word and bad is
a negative word. This will generate a polarity. All of this is explained in detail here. Here,
a study of Amazon reviews is done in order to understand how this project works.

iv
ACKNOWLEDGEMENT
It is our privilege to express our sincerest regards to our major project supervisor, Dr.
Yogesh Kumar for his valuable inputs, guidance, encouragement, whole-hearted

cooperation and constructive criticism throughout the duration of our project.

We deeply express our sincere thanks to our Head of Department for encouraging and
allowing us to present the project on the topic “E-COMMERCE REVIEW
ANALYSIS” at our department premises for the partial fulfillment of the requirements

leading to the award of B-Tech degree. We take this opportunity to thank all our

professors who have directly or indirectly helped our project.

We pay our respect and love to our parents and all other family members and friends for

their love and encouragement throughout our career. Last but not the least we express our

thanks to our friends for their cooperation and support.

Signature:

Names:
ABHISHEK KUMAR (9916102193)
ANSHU PRIYA (9916102068)

v
TABLE OF CONTENTS

TOPIC PAGE NO.

Certificate ii
Declaration iii
Abstract iv
Acknowledgement v
Table of Contents vi-vii
List of Figures viii
List of Tables ix

CHAPTER 1- INTRODUCTION 1

1.1 Sentiment Analysis 1


1.2 E-Commerce 2
1.3 Machine Learning in E-Commerce 3
1.4 Machine Learning in Business 6

CHAPTER 2- LITERATURE SURVEY 9

2.1 Works Carried Earlier 9


2.1.1 Artificial Intelligence Based Bank Cheque 9
Signature Verification System
2.1.2 NLP using artificial intelligence 9
2.1.3 Matrix Factorization Method 10
2.1.4 Shallow Window-Based Method 10
2.2 Applications of Machine Learning 11

CHAPTER 3- INFORMATION ABOUT THE DATASET 13


3.1 Data Preprocessing 13
3.2 Data Resampling 13
3.3 Features 13

CHAPTER 4- METHODS 16
4.1 Naive Bayes 16
4.1.1 Multinomial Naive Bayes 16

vi
4.2 K-nearest Neighbor 17
4.3 Linear Support Vector Machine 18
4.4 Long Short-Term Memory 19
4.4.1 Core Idea Behind LSTM 20
4.5 GloVe (Global Vector for Word Representation) 21

CHAPTER 5- RESULTS AND DISCUSSION 23

5.1 Word Cloud 24


5.1.1 Numerous Areas of Word Cloud Usage 25

CHAPTER 6- CONCLUSION AND FUTURE SCOPE 27

6.1 Conclusion 27
6.2 Future Scope of Sentiment Analysis 27

REFERENCES 29

APPENDIX 31

vii
LIST OF FIGURES
FIGURE PAGE NO.

Fig. 1.1: E-Commerce illustration 2

Fig. 1.2: AI/ML Investment Areas 6

Fig. 3.1: Few of the entries of the dataset of Consumer Reviews 16


of Amazon Products

Fig. 4.1: A simple representation of how multinomial naïve bayes works 18

Fig. 4.2: Before and after K-NN 19

Fig. 4.3: Support Vectors 20

Fig. 4.4: Representation of LSTM 21

Fig. 4.5: A part of working of LSTM 21

Fig. 4.6: The sigmoid layer 22

Fig. 5.1: Comparison graph of various model based on accuracy 25

Fig. 5.2: Comparison of one product based on positive and negative review 25

Fig. 5.3: Word cloud of positive words 26

Fig. 5.4: Result word cloud of negative words 26

viii
LIST OF TABLES
TABLE PAGE NO.

1. NameS of ML chatbox 4
2. Artificial Intelligence Assistant 7
3. Performance of different models 24

ix
CHAPTER 1: INTRODUCTION

1.1 Sentiment Analysis

In the last few years we have seen a huge amount of efforts are being given by
researchers and students on understanding the emotions, feelings and opinions in
theoretical data and resources. From the data and knowledge all over the internet, we can
see that there has been a surge in this field in the last few years. This research area in
really large, and a part of this area of study is known as sentiment analysis or opinion
mining. What this means is that when a bunch or cluster of words are given as a dataset,
we can theoretically study customers opinions and emotions, how they think about a
certain product and the product’s features, if the product makes them happy or not and if
they are going to buy them in the future or not. This technique has a diverse use.

Let’s say, we take an example, most of the time businesses want to know about opinions
and reviews of their customers about the product and services they are provided with.
They want to know if their services are making their customers happy or not in order to
make the necessary changes. On the other hand, the potential buyers also want to know
about the thoughts of the buyers who have already bought product they are interested in.
Hence, this helps both the community on a great extent. However, the point to be noted is,
the researchers use this data or information to do a deep analysis of the trends followed in
E-commerce, which could ultimately lead to an easier life for all of us. [2]

However, it’s not an easy task, as there are millions of websites over the internet having a
lot of products they need to sell. This means even large amount of reviews and opinions.
Every such site contains a large number of opinionated texts which is not easy to decode.
Maintaining a dataset of so many texts is even more difficult. An average human will
have a lot of issues in identifying such sites and extracting all the information and reviews
that those site contains. Apart from that, asking a PC to know what a sarcasm is, is a
really difficult and challenging task, no matter how smart AI has become, computers still
cannot think like a living being. [5]

So in this project, the challenge here is how to differentiate that a comment made on a
certain good or commodity comes under the helpful or positive bucket or it comes under
the denial or negative bucket. After this differentiation is done, we need to build a
supervised learning algorithm or model in order to polarise the large number of
comments. Along with traditional algorithms such as naïve bayes, K-nearest neighbor and
1
linear supporting vector machines these models also incorporate deep learning tactics
which are known to be the convolutional neural networks. Another type of deep learning
tactic is the Recurrent Neural Network. These models or say, algorithms are then
analyzed and their accuracy is compared to one another which then help us to understand
better the polarized texts and how they describe a product. [3]

1.2 E-Commerce

The sale and purchase of a commodity or product over the web or say, internet is termed
as the electronic business or web based business. Other expression used for this web
based business is the E-business. Instances of internet business destinations are flipkart,
eBay, amazon and so on. Web based business gives one of a kind highlights of non-
money installment, 24x7 Service accessibility and improved deals. As per the Kleiner
Perkins Caufield Byers report, which is a funding firm, eventually, Amazon india is going
to amaze the whole country in the field of online or electronic market business. [1]

Given below are a few internet business types:


 Business to business model
 Government to citizen model
 Consumer to business model
 Consumer to consumer model
 Business to consumer model
 Government to business model

Fig. 1.1: E-Commerce illustration

2
1.3 Machine Learning in E-Commerce

Artificial intelligence helps e-commerce businesses to understand their buyers better.


With the facilities of AI and Machine learning, businesses these days are able to use large
datasets to know about customer opinions and in what pattern the buy a product. Machine
learning has the ability of self learning. These algorithms are useful in making
customizable goods, commodities or gifts for many customers on their site. Here, some of
the important points on web based businesses are given:

 Customers are able to choose products in real-time

These days, businesses based on the web are trying to give the buyers or customers the
same feeling as if they are shopping in a supermarket. Even though they are buying things
online, they do not have to go through any inconvenience and they are easily able to look
for the commodities or products in no time. The buyers or costumers are provided with
customizable things like a luxury item or gift for their loved ones with the help of ML.

 Searching with help of pictures

There are many software’s which consists of techniques that helps in recognizing a
picture. With techniques like these, people who surf online websites are able to search
with help of the pictures they already have. Now the searchings which were earlier done
with the help of words, are now being done with help of images. This technique match the
provided photo with that on the internet and show it to the viewers with very less efforts.
Pinterest is a website as well as an application for phone which uses similar method. Here
people have the option to upload an image, the website matches the image with the
content it has on it’s site and display it to the viewers. The viewers then have the freedom
choose whatever they like.

 Recruitment with the help of AI

The Artificial Intelligence revolutions are used by the HR divisions in many ways.
Conference meetings, online connections, for example LinkedIn, screening activities,
discovering interns and students looking for jobs are few of the examples. Hence, the
work load of the HR is reduces to a great extent with the help of AI techniques and
algorithms as the most qualified candidate gets the job with very less efforts.

3
 Voice Powered Search

In online shopping, text-based search is being replaced by voice search feature. Voice
recognition accuracy has been much better as compared to earlier. Practically, about 69%
of solicitations have a characteristic of being conversational or with the help of Google
associate, turned into conversational speeches. Few of the savvy gadgets have features
controlling activities with the help of voice. One such example is the Apple’s Home Pod
which is under the control of Siri. Echo, which is a product of Amazon, fueled by Alexa
is another such model. Searches done by alexa with the help of voices are utilized so that
a request can be made for a product to be transported from Amazon. As per concentrate
by ComScore, half of the pursuits will be founded on voice look by 2020.

1. Conversational commerce

Platforms that support chat feature can help the buyers purchase a product with help of
texting. This is natural language processing. Electronic or online exchanges are being
encouraged with the help of Chatbots for the large companies. It is done with the help of
TacoBot , which uses Slackas its messenger, and the other one is Kik messenger used by
H&M. Popular companies, for example, Tommy Hilfiger started a Fashion Chatbot with
the help of Facebook Messenger when there was a Fashion Week in New York in 2016.
This was the very first company to use an application like messenger for selling a mixture
of goods. Instances of a few applications constructed with the help of chatbots based are
mentioned here:

Table 1.1: NameS of ML chatbox

Field Name of the ChatBot Platform

Web based business Operator iPhone Operating System

Money matters Chip iPhone Operating System as


well as Android

In the field of medicine Babylon Health iPhone Operating System as


well as Android

Foodings and outings Reserve iPhone Operating System as


well as Android

4
Schooling Duolingo iPhone Operating System as
well as Android

 Services provided to the customer

Artificial intelligence impacts consumer or customer assistance using chatbots. Chatbots


are PC software produced for trading with the help of conversations. Conversational
platforms like Chatbots communicate with the help of common speeches to give the
client, an individual, or say, fulfilled client care. They enable advertisers to communicate
with the client progressively and find out what client really wants so that they have a
direction to look into. Eva is one such example here in out country, India. Another such
chatbot is HDFC in our country. It is an banking chatbot constructed with AI. Client
questions can be answered in a very very small time frame on large number of mediums.
Yatra is one such website which uses similar method for electronic travelling across or
outside India. These Chatbots are smart enough to let purchasers so that they can know
about the plane details and accordingly get their plane tickets straightforwardly through
their messenger, chatBot.

 Near-enough private assistants

This feature has helped a lot of individuals to learn how to make intelligent choices when
it comes to spending money on what they shop. Let’s take an example, Ping was a
chatting facility developed by Flipkart. This facility has really worked up to the mark in
providing aide to shoppers, but had to be closed recently. Ping was fuelled with the aide
of AI so that the customers are provided with items they desire in no time. Alexa, which
is developed by Amazon and is it’s home partner, is also a development of AI and is
another such assistant which provides customers with virtual shopping aide. The
customers receive trending fashions and experiences and all of it is done with the help of
confirmed voice of an individual. Mona is another such collaborator.

 Virtual Assistant

E-commerce virtual assistant are programs that act smartly when it comes to keeping up
the business supplies and other works or services that might be techie. A certain favours
can be done for personal use as well with its help. "ChatBot" is also a kind of virtual
assistant. In recent days Lenovo has also bought to the customers its virtual assistant

5
which is expected compete with Google and Cortana now. Another artificial intelligence-
powered deep learning assistant is CAVA. It has many features including the face and
voice-recognition technique that helps in management of data and other information.

2. Detection of unreal comments

Shopper comments and ratings these days are very essential, so that the businesses
believe in the customers when it comes to buy things electronically. The recent studies
states that, 90 percent of shoppers who respond said that positive reviews on a certain
product influenced them to buy the product they are interested in. But there are high
chances that there can be fake reviews which can affect the buying decisions of a
potential customer. Even this issue can be managed with the help of AI. Amazon is one of
the large organizations that uses artificial intelligence to fight against fake reviews on the
products. Amazon’s AI system makes sure that the comments of the shoppers which are
real (based on their verification), are given importance. This makes sure that comments
which different shoppers or buyers tick as genuine are given importance.

 Sale procedures constructed by AI

AI combined with the client connection system is a successful answer for oversee deals.
This AI empowered method permits a CRM framework to answer shoppers’ questions,
tackle their problems and issues and even create and open new doors for the business
group. The clients will never again see improper options in the future while purchasing on
the web.

 Commercials centralizing the customers

Artificial intelligence methods can be made to deliver customer-centric merchendise.

Many different fields are there for AI to be used in businesses, some of them are stated
here:

o Formation of goods and categorizing it.


o Factorizing customers for better understanding
o Sentiment Understanding
o Anticipative merchandising. [4]

The figure 1.2 on the next page shows the AI/ML Investment Areas.

6
Fig. 1.2: AI/ML Investment Areas

1.4 Machine Learning in Business

Avanade proposed a study, a survey of almost 500 business and IT heads from all around
the world exposed that they are expecting of about 33% increases in earnings as with the
help of intelligent technologies. Avanade is a combined set out between Accenture and
Microsoft that holds and support the windows’ Cortana along with different frameworks
that provide anticipating investigations along with perceptions based on data. Another
instance: The Apptus eSale, it is build in order to computerize the customers’ wants and
needs prior hand. Big data and machine learning is combined in this software to find oyt
which products might look interesting to a potential buyer as he/she searches online or get
adviced for a product. ML can predict and spontaneously display commodities by looking
at the phrases that have been searched when a buyer visits electronic shops permitted by
Apptus eSales. Even a big company as such as Google, displayed some heed to machine
learning when it bought DeepMind, AI Company. Below are some of the main features of
artificial intelligence that can be utilized by web based businesses:

o Data Mining
o NLP(Natural Language Processing)
7
o Machine Learning

The table 1.2 here, shows the different business areas and the AI tools in that area.

Table 1.2: Artificial Intelligence Assistant

Business Area AI Tool

Customer Relationship jet lore, Taktt, Kaisisto, Data Fox

Retail Crystals, Datorama, Albert, Air PR,

Sales 6sense, Clari, Spins, Aviso, Tethr

Market Research Quid, Matte rmark, Tracxn, Enigma, Bottle


nose

Support for Consumers Clara Bridge, Brain, Aron, Digital Genuis

Business Intelligence & Analytics Ayasadi, Data Robot, Sun down, Arimo

8
CHAPTER 2: LITREATURE SURVEY

2.1 Works Carried Earlier


Up until now, there are a ton of research papers identified with item audits, supposition
investigation or conclusion mining. Few of them have been explained in the upcoming
subsections.

2.1.1 Artificial Intelligence Based Bank Cheque Signature Verification


System[4]
In this paper, new copyright infringement procedure was applied subjected to K-Nearest
Neighbors procedure The strategy here is used to bunch the group of texts which is then
matched with the neighboring texts or words. Also, a counter is required for counting how
many strings are in coordination while looking at the records. After that the archive is
differentiated and the present course of action of records. After coordinating the
collection of words we get the copied texts and term them outputs. The technique here,
determines how many times the copied word appeared in a record. It additionally
computes the level of coordinated replicated words.

2.1.2 Natural language processing using artificial intelligence[7]


In this research paper, the creators proposed utilizing recursive neural systems to
accomplish a superior understanding compositionality in errands, for example, estimation
recognition. Here the idea of regular language handling has been examined. Normal
language handling is one of utilization of man-made brainpower. Normal language
handling is never really, comprehend the human language by PCs. The means engaged
with NLP are morphological examination, syntactic investigation, semantic investigation,
talk mix and pragmatics investigation

Some of the other papers are also explained in short here. In one of the papers we studied,
the researchers did some surveys from which opinions were removed. After that the
outcome is broken down so as a plan of action is developed. Truly huge precision was
guaranteed by the researchers. Multinomial Naïve Bayesian (MNB) was essentially
utilized and bolster the fundamental classifiers with the use of vector machine.

9
In one another method, the proposal of expansion of present work was made in the area of
normal speech handling. Some important classifiers and methods were utilized so as to
see if a studied survey is good or bad.

2.1.3 Matrix Factorization Method


Matrix factorization techniques have roots that extend back in the days of LSA for
producing low-dimensional text or word collections. Strategies as such use low level
approximations so that huge matrices can be disintegrated in order to catch analytical data
for a collection. Some particular kind of information got by these networks varies from
one application to another. The type of matrices in LSA are named as term record, which
means, the texts are in correspondence to the lines, and various archives are in
correspondence to sections in a collection. Interestingly, networks of the type of “term-
term”, is used by the HAL (in the year of 1996), which means, texts are related to the
lines and columns. While, the passages are related to occasions when a said text happens
with regard to some different said text. The fundamental issue that occurred to
Hyperspace Analogue to Language along with some similar strategies is the highest
occuring of words contributes a lopsided add up to the closeness amount: the measure
when the two words occur at the same time, for instance, affects largely their likeness in
spite conveying moderately minimal about their semantic relatedness.

2.1.4 Shallow Window-Based Method


One more technique that involves learning of word depictions which manages in making
estimates inside local condition windows. This can be understood from an instance in
which a model was presented Bengio et al. (2003), that learns word vector representations
as a component of a basic neural system engineering for language displaying. The word
vector was decoupled by Collobert and Weston (2008), getting ready from recent
planning objectives, which prepared Collobert et al. (2011) a path to use the set of a word
so the term depictions could be scholarly, instead of just the initial context just like the
case with language models. Through appraisal on a task that is related to the word, these
types of models and strategies displayed the capacity of grasping etymological models as
straight associations between the vectors that contain words. These are different from
factorization systems that occur in the matrix, the shallow window-based procedures sick
impacts that implies they don't work legitimately on the co-event bits of knowledge of the
collection. Rather, these models inspect windows setting over the entire collection, which
neglects to exploit the enormous measure of repetition in the information.
10
In this undertaking, we used both standard computations algorithms including K-closest
neighbor, SVM and profound learning stunts and Naive Bayesian. Looking at how
precise the models are, we might show signs of improvement seeing how these
calculations work in assignments, for example, conclusion investigation.

2.2 Applications of Machine Learning

Simulated intelligence appropriation has been seen at numerous zones. A few models are
following:

 Gaming: Machines would now be able to contend with people in games with
machine learning.ML execution can be seen in numerous vital games, for
example, poker, chess, spasm tac-toe, and so on. Machines are engaged with
capacity to consider numerous positions dependent on heuristic information. Dark
Blue, the very initial PC that played chess. Creator was the IBM. Another model
is AlphaGo by Google .

 Banking: Machine learning application likewise lies in Anti-illegal tax avoidance


(AML). Tax criminals conceal their activities to expand their illicit cash. This
unlawful is recorded so well so as to give the dream of honestly earned money.
One of the leading industries in the world that is the banking industry is also
moving from AML which holds the function of customary recognition to the
framework which is based on computer reasoning.

 Expert Systems: Master frameworks are created so as to take care of not so easy
issues in some sort of specific space, along with AI. Reason for master
frameworks is to prompt, foresee results, propose elective arrangement and help
human in dynamic. The three pillars of ES are based on information, induction
engine, and user interface.
Master framework’s instances are CALEX, GERMWATCHER, LEY and
AGREX and so forth.

 Medicine and Healthcare Services: The Machine learning approach has shown its
application in many sectors likewise one of the sectors is healthcare. This

11
approach is applied in Medical Diagnosis, Prediction of Risk, and Drug Recovery.
For instance, In the treatment of skin cancer, Sebastian Thurn’s laboratory does
calculation with the help of A.I to recognize the high precision value of skin
cancer.

 Robots with smart brains: Sensors are installed in robots, for example, knock,
pressure, warmth, light and temperature can identify the physical information and
play out the directions by a human. They have proficient processors and colossal
memory to settle on shrewd choices and display knowledge. Smart Robots are
additionally proficient to gain from mistakes.[10]

12
CHAPTER 3: INFORMATION ABOUT THE DATASET

3.1 Data Preprocessing


The dataset which we have selected in our project consists of reviews and rating of
Products sold by Amazon by customer. This dataset has 34660 information units in all.
Every model incorporates the sort, name of the item just as the content audit and the
rating of the item. Now when we want to use the information, first we separate the rating
and audit section since these two are the basic piece of this undertaking. At that point, we
found out that there are a few information units or bits which has no appraisals when we
obtained the review or information on the review. Subsequent to taking out those models,
we have 34627 information units in all. That’s a huge dataset.

Also, to have a concise diagram of the dataset, we have plot the dissemination of the
evaluations. It can be seen from the dataset that we have partitioned the dataset into 5
sections on the basis of scale of 1 to 5 just making appropriation between these.
Likewise, the classes here are not uniformly arranged as we can see that 1st class and 2nd
class has incorporated fewer values of data on the other hand 5th class has incorporated
even larger than 20000 values of data. [7] This can be shown from one example of the
dataset: Survey message: 'The item here has not frustrated yet. The kids love utilizing it.
In fact, even I admire the capacity to screen control the content my kids are exposed to,
effortlessly.'

3.2 Data Resampling


Since our dataset is uneven, we have attempted information resampling in a portion of our
analyses. Information resampling is a mainstream method of managing imbalanced
information. In this task, we attempted to oversample the information of class 1,2 and 3
by more than once examining those surveys because there are fewer values of data in
these classes. In our preparation set, we have rehashed many times the audits that are first
marked as 1,2,3, this is the case where overfitting occurs.[8]

3.3 Features
The 2 sorts of highlights are used in this project of which the primary sort falls into the
customary technique. Essentially, the manufacture of a word takes place in this step
which refers to a regular word which is then followed by the listing out of every word.
13
Then there is the setting of the edge of the text or word referring to be 6 event and there is
wound up the accumulation of words from the dataset, which is 4,223. At that point,
every audit is transformed into the vector, and then every word symbolizes the frequency
of the words that have been spoken. To achieve this, we have to change the edge and
referred the word’s length. On doing this the result is founded which depicts that the
extension of the length of a referred word doesn’t have an impact on the precision.

Another sort of highlight we utilized is the 50-d glove2 word reference which was
pretrained on Wikipedia. For this part, we essentially need to exploit the implications of
each word. For this situation, we speak to each survey with the help of glove vectors’
mean vector that is, 50-d of every single text or word that makes the comment. [9]

The figure 3.1 on the next page is the snapshot of the entries of the dataset containing the
reviews. Not all the entries can be shown here because of the huge size of the file. It
consists of the user name, brand, rating, purchase and many more information.
Some of the columns of the dataset is explained below:
reviews.username column consists of the unique usernames used by the buyers on
amazon.
reviews.title column consists of the title of the comment made by them on a specific
product.
reviews.text column contains the text where the actual comment is made on the products
they buy.
reviews.sourceURLs consists of the URLs of the specific product on which the comments
are done.
Reviews.rating consists of the ratings given on the product bought by the customers.
Mostly ratings are accompanied with reviews.
Manufacturer column consists of the manufacturer of the product as the name suggests. In
this case it will be Amazon.
Categories contains the category in which the product on which the review and rating is
done lies.

14
Fig. 3.1: Few of the entries of the dataset of Consumer Reviews of Amazon Products

15
CHAPTER 4: METHODS

4.1 NB(Naïve Bayes)


This method is extensively recognized for generating calculations that are to be learned in
order to group issues here in NB. This calculation accepts that x0 is restrictively self-
ruling variable y & this is the presumption for NB.

(4.1)

For our model to work nicely, we additionally fused Laplace Smoothing in it. The
expectation of the model is given by the following equation:
(4.2)
With the main method of speaking to audit writings, it takes a variety of non-negative
whole numbers, and models p(xi/y) with multinomial dispersion. With the second method
of speaking to audit writings utilizing glove word reference, the data sources are positive
digits, hence p(xi/y) with the help of Gaussian distribution was decided to be shown. [4].

4.1.1 Multinomial Naive Bayes


The naive Bayes technique or strategy is applied with the aid of Multinomial NB for data
transmitted multinomially, it is most impressive naïve bayes technique for classification
of text out of the two most widely used techniques (the term word vector count is used to
represent the information, despite the fact that word vector are moreover known to work
well in practice).with the help of vectors vectors θa = (θa1 ,…, θam) for each class ‘a’ the
dispersion is parameterised,where m is number of traits (proportion of the vocabulary in
case of word classification) and θai is probability P(xi ∣ a) of trait i showing up in an
example refering.to class a.

The parameters θa is evaluated.by simplified inspiration of extreme probability, which


means, relative recurrence counting:

(4.3)

Where , gives the amount of times i appears in the training set T in a set.of
class a.

And, , gives the cumulative sum of all traits for class a.


16
α≥0 is the smoothing. condition that represents the.traits that are not there in the training
phase and for further calculations zero probabilities are forestalled. α<1 is known as
Lidstone smoothing and α=1 is called Laplace smoothing, while. The figure 4.1
underneath shows how.a multnomial naïve bayes function is works in simple manner.

Fig. 4.1: A simple representation of how multinomial naïve bayes works.

4.2 Linear Support Vector Machine


Linear SVM is a method for creating a classifier (a vector) distinguishing between the
labeled datasets. Geometrically given circles, two types of points, and x’s, in a space,
maximization of minimum separation from one datapoints to the other is done. To put it
another way, it tries to maximize the margin. The SVM tries to solve the optimization
ambiguity given below:

(4.4)

It tries to find the w to satisfy the separability constraint and also satisfies the maximum
margin issue. The following figure shows the representation of the support vectors.
It uses a technique knows as the kernel trick to change your data and adjust the data and
then find an optimal boundary between outputs that is feasible based on such
17
transformations or changes and between the possible outputs it finds an optimal
boundary.[3]

The figure 4.2 that is shown below, represents the Support Vectors which is the basic co-
ordinates of individual observation. Support Vector Machine is technique which best
isolates the two separations i.e. (hyper-plane/ line).

Fig 4.2: Support vectors

4.3 K-nearest Neighbor


One of the most widely used anonparametric characterization technique is K-closest
Neighbor (KNN). It has been broadly utilized as of late. When making a forecast, this
technique first searches for the K= n closest neighbors of the information. At that point, it
will dole out most of that n neighbors' datapoints. Euclidean separation is the separation
between each neighbor, which can quantify the similitude between every data point.[4]
(4.4)
The equation above shows the scientific portrayal of Euclidean separation utilized in
KNN calculation. The general thought of KNN is that in the event that the sources of info
are like one another, at that point the yield would be the equivalent. The figure 4.1 on the
upcoming page gives the arrangement of information when applying K-NN.

18
Fig 4.3: Before and after applying K-NN method

When K-NN is applied, assume there are two classifications, i.e., Category X and
Category Y, and we have another information point x1, so this information point will lie
in which of these classes. Hence, we need a K-NN calculation, to take care of this sort of
issue or problems. With the assistance of K-NN, without much of a stretch we can
differentiate the classification or class of a specific dataset.

4.4 Long Short-Term Memory


LSTMs also known as Long Short Term Memory Systems, are some methods that are
uncommon sort of RNN. It fit for learning long haul dependencies. All of these methods and
proposals were presented by Hochreiter & Schmidhuber back in the year 1997. They were
refined and were very popular in their work. They function very efficiently on a vast array of
issues, and are being used widely nowadays.
LSTMs have made sure that they keep away the long-term dependency problem as it is
expected to. Remembering the data for a substantial period of time is their default attribute
for all intents and purposes, i.e., that’s exactly what they are meant to do and not something
that is difficult for them to learn.

All repetitive neural systems have a sort of connectivity, also known as a chain of rehashing
modules of neural system. RNNs rehashing modules will have an exceptionally
straightforward structure, one solitary layer of tanh, for example.

Alternatively, LSTMs have this chain in form of a connection like structure, however the
repeating module has an different structure. there are four neural network layer, instead of
making a single layer of neural network, interacting in an exceedingly unusual method.

19
Fig. 4.4: Representation of LSTM

In the above figure 4.4, we can see that, from one hub’s yield to the contributions of others,
whole vector is conveyed by each line. The pointwise tasks are represented by pink circles,
just how the vector addition looks. And the yellow shapes are layers o learned neural system
. Merging lines represent concatenation. A line which is forking signifies its data being
copied and the replicas going to another locations.

4.4.1 Core Idea Behind LSTM


The cell state is the main key or way to LSTMs. It is represented by the horizontal line that
is shown in the figure 4.5 below.

The cell state can be said is bit like a conveyor belt. There are very few linear interactions.
Otherwise it mostly goes straight down the entire system. It's exceptionally simple for data
to simply stream along it unaltered.

Fig. 4.5: A part of working of LSTM

The above figure shows that after being carefully regulated and balanced by structures called
gates, the LSTM has the ability to evacuate or add data to the cell state.

20
Gates are an alternate method or way of letting data through. What they are made of is a
sigmoid neural net layer and a pointwise multiplication or duplication operation.

Fig. 4.6: The sigmoid layer

The above figure 4.6 is The sigmoid layer gives numbers in the range of null and one,
depicting how much each component should be let through. A zero signifies “do not let
anything to enter,” while a one signifies “let everything enter!”

Cell state gates are required to secure and control. An LSTM requires three gates.

4.5 GloVe (Global Vector for Word Representation)


Word or text insights occurring in a corpus are the essential source of data this is made
available for all unsupervised Machine Learning methods in order to learn word
representations. Even though many such tactics or methods now are there in existence, the
doubts and issues still remains because we still suspect how meaning is formed from these
methods or tactics or statistics, and how the word vector that is generated will present that
meaning to us. Here this question is being answered or atleast trying to be answered. Our
thoughts, brains and knowledge are used to Develop new word representation model and
methodology which is known as GloVe, (which means Global Vectors). It is given this
name because this model directly captures the global corpus statistics.

An easy instance is considered here, considering which, we can see how some features and
characteristics of meaning can be drawn-out straightforwardly from co-happening
probabilities. Let us take two letters i and j that displays a particular attribute or feature
which we are interested in; for instance, let’s say, the concept of thermodynamic phase is
what we are interested in. So let’s just consider a = steam and b= ice. Now, how these words
are related can be determined by Examining the ratio of their probability of co-occurrence
with detailed term investigation, k. Words may be ice related but not steam, let us assume c
= solid, then the ratio of Pac/Pbc can be expected to be very large. Again, Words may be
steam related but not ice related, say c = gas, in this case the ratio is expected to be quite

21
small. There might be words c, say water or heat, that has the possibility to be related to both
ice and steam, or maybe to none of them, in such cases the ratio should be almost close to
one. Compared with the initial or raw probabilities, this ratio can distinguish relevant words
(solid and liquid, for example) from irrelevant words (water and heat) to a much greater
extent and will also learn to differentiate between the two words that are relatable. Hence,
from the conditions above, it can be deduced that the actual starting point for learning of
word vectors should be with ratios of co-occurrence likelihood and not jumping to the
probabilities themselves. It is notable that the ratio Pac/Pbc must depend on three words here
(in this case, letters), i.e., a, b, and c, this gives the most general form of equation,

(4.6)

In the above generalized formula w Rd are known as word vectors,


And w Rd are called as separate context word vectors.
GloVe has the training goal of memorizing a learning word vector so that its dot product
is equal to the logarithm of the probability of co-occurrence of words. Yet because the
logarithm of a ratio is equal to the difference in logarithmic functions, this objective
associates probability of co-occurrence that has vector differences in text or word vector
space. These ratios will imply some kind of meaning, as a result, this data also gets
encoded as vector differences. So now we know the reason why the word vectors give a
very good results on word analogy duties or tasks, for instance, word2vec package.

22
CHAPTER 5: RESULTS AND DISCUSSION
So the large complete set of data that we have which consists 34627 comments and texts
has been parted in three parts. The size of the training data is 21,000, which contributes to
60% of the dataset. The test set was of the size 6813, which is equal to 20% of the dataset.
The validation set of size 6814, which contributes to 20% of the dataset.
Here we underwent implementation of many methods, i.e., Multinomial NB method,
Support vector machine with RBF Kernel. Other methods are also there: KNN-4, 5, 6,
Long Short Term Memory, SVM with Linear Kernel. All of this with the help of 4223-d
input features representing review text.
KNN-5 is much more better than the other 2 KNN models. In a similar fashion SVM with
Linear Kernel is a little bit better than the RBF Kernel based Support Vector Machine.
Linear Support Vector Machine has overfitting problem which can be seen because of the
remarkable gap that sits between the training set accuracy and test set accuracy. When it
comes to test set accuracy, the LSTM gives the best performance.
From glove dictionary, we took help of 50-d input features so that Gaussian NB method
can be run, along with Linear Kernel Support Vector Machine algorithm. Long short term
memory and KNN- 4, 5, 6 also use this feature. Yet here again, K-NN-5 gives better
performance than the two other K-Nearest Neighbor methods. Even the idea of
resampling of information was tried on Long Short Term Memory model but the test
accuracy result unfortunately showed no improvisation because of the overfitting
problem. And yet again we can see that out of all the models, LSTM gives the best
results.
The following table 1, gives in detail the results of the training and test accuracy for all
the models.

Table 5.1: Performance of different models

Algorithms Accuracy of the training Accuracy of the test set


set

Long Short term memory 85.6% 65.6%


with Glove

Linear support vector 84.1% 69.1%

23
machine

KNN 61.5% 61.7%

Multinomial Naïve Bayes 76.2% 71.5%

Linear SVM w/ Glove 68.7% 68.6%

Gaussian naïve bayes with 52.2% 52.4%


Glove

LSTM 72.9% 72.1%

So we got to know that generally, the models which consists of conventional input
attributes give much better performance when compared to those which has input
attributes considering glove. Hence, we realized that Long Short Term Memory has
provided us with the most precise analysis as compared with the different models we used
here.
Hence we come to a conclusion that the classification model needs to be chosen with
great care and precision for sentimental analysis systems because this decision has a huge
impact on the accuracy of your work and your final result. With the help of the overall
sentiment and count based metrics, we can get the feedback of products, organization
from customers. These days companies have been holding on to the power of data, but if
you want to have the most important data, one needs to hold on to the weightage provided
by AI as well as deep learning.

Fig. 5.1: Comparison graph of various model based on accuracy


24
:
Fig. 5.2: Comparison of one product based on positive and negative revie

5.1 Word Cloud


Graphical depiction or portrayal of the rate of occurrence of a word is given greater
importance to words that appear more frequently in a source text which are termed as
Word clouds or tag clouds. We can also say in other words, that word clouds, as the name
suggests depict a set of texts or words in the form of a cloud. As the number of times the
appearance of a word increases, the size will get bigger and bigger. Hence, just by giving
a glance at the size of the cloud, you can identify the big words and therefore the trendy
headlines.

5.1.1 Different Areas where Word Cloud can be used

Word clouds have a diverse usage in a lot of fields. Some of them are:

1. Trendy headlines on the Social Media: To classify and organise tweets under
sections which are in demand, we can extract the top texts, letters or words out of
them and then use them in the trending section, after we read and get text of reviews
or tweets that users are sending out.

2. Hyped topics in the news: After we analyse the words, headings or texts of a variety
of articles and reports in the news, we can work to find out the trending and top
words out of them and will get to know what are the most hyped news topics that are
going around a city, country or can say, the entire world.
25
3. Websites can use it for navigation of users: There are many website over the
internet that is driven by keywords or hashtags. We can create a word cloud which
will then help the users to directly jump to any subject of interest either it is shopping
or gaming. It will be relevant for the users.

The two figures given below are the word cloud (positive word cloud as well as negative
word cloud) generated from the dataset that we used in this project.

Fig. 5.3: Word cloud of positive words

Fig. 5.4: word cloud of negative words

26
CHAPTER 6: CONCLUSION AND FUTURE SCOPE

6.1 Conclusion
From all the work and result obtained in this project, we came to a conclusion that when it
comes to complexity, KNN needed much higher calculation complexity when compared
to Naive Bayes algo and SVM during the time of training. We can see that in KNN
algorithm, the need is to compute the separation of all the training data points and all the
evaluation data points, which requires a lot of time.
We also noticed, that the accuracy wasn’t much affected even when the length od the
dictionary was increased. One of the reason might be that when the threshold of the
dictionary was decrease, there was an increase in the length of dictionary. Here comes the
problem, i.e., the number or amount of reviews we have is less than 40,000. When this is
given some thought, we realize, that the dimension of feature space is significantly larger
than the amount or number of data points. Hence, we come to a conclusion that the curse
of dimensionality here, might be an issue.
We also realized that, the method of normally counting words would have given much
better results when compared to using glove mean method. Why did this happen? The
reason might be that the individual word feature might be weakened if we use the
average, and then the separation between different words or reviews will not be accurate.
We realized that using LSTM gives much better results than other machine approaches or
methods we used in this project. This can be because of the large number of the
parameters that was in the dataset. From table 5.1 we can see that, the training accuracy
of LSTM with Glove gave an estimate of 85.6 % after resampling. But the accuracy of the
test data is only 65.6 %, from this result we can see that this model has overfitted on the
resampled data, because there are many instances that have been repeated.

6.2 Future Scope of Sentimental Analysis


Sentiment analysis has a bright future as it is going really fast and aiming huge. It has
already passed the demand of the number of likes, comments and shares, and it has aimed
to outstrech, and completely understand, why the connections on social media is
important and what the customers need and want. With this analysis there are broader
aspects where sentiment analysis might be required, even top brands will try their best to
hold on to this tool, and so will the customers and consumers because it is favourable to
all individuals. Either it is profit or non profit organizations. The most important feature is

27
that it understands the feelings, emotions and how a person thinks about a specific brand
or organization. Hence, the audiences will experience much better response from their
favorite brand or organization and hence, can get their needs personalized. Organizations
can further categorize based on how their audience and consumers actually feel about the
commodities they are provided with or what they browse on social media instead of the
categories based on age of customer, his/her gender, income or other surface factors,.
Hence, ultimately sentiment analysis is going to play a huge role in contributing towards
better understanding of providers and consumers. This relation will be strengthened.

28
REFRENCES
[1] Avaneet Pannu, “Artificial Intelligence and its Application in Different Areas”,
International Journal of Engineering and Innovative Technology (IJEIT), Issue 10, Vol 4,
April 2015

[2] Dhiraj Kapoor and R. K. Gupta, “Software Cost Estimation using Artificial
Intelligence Technique” International Journal of Research and Development in Applied
Science and Engineerin, Issue 1, Vol 9, February 2016

[3] Mausaami Sahu, “Plagiarism Detection Using Artificial Intelligence”, International


Journal of Scientific & Technology Research(IJSTR), Vol 5, Issue 04, April 2017

[4] Ashish, A.Dongare, Prof.R.D. Ghongade, “Artificial Intelligence Based Bank Cheque
Signature Verification System” International Research Journal of Engineering and
Technology (IRJET) Volume 03, Issue 01, Jan-2016

[5] Siddharth Gupta, Deep Borkar, Chevelyn De Mello, Saurabh Patil, “An E-Commerce
Website based Chabot” International Journal of Computer Science and Information
Technologies (IJCSIT), Vol. 6 (2), 2015

[6] Saloni Shukla, Joel Rebello, “Threat of automation: Robotics and artificial
intelligence to reduce job opportunities”, http://economictimes.indiatimes.com

[7] Unnati Dhavare, Prof. Umesh Kulkarni,“Natural language processing using artificial
intelligence” International Journal of Emerging Trends & Technology in Computer
Science (IJETTCS) ,Volume 4, Issue 2, March April 2015

[8]https://www.avanade.com/en/about/avanade/partnerships/accenture-avanade-
microsoft-alliance

[9] Dirican, Cüneyt. "The impacts of robotics, artificial intelligence on business and
economics." ProcediaSocial and Behavioral Sciences 195 (2015): 564-573.

[10] Min-Yuan Cheng, Denny Kusoemo, Richard Antoni Gosno. "Text mining-based
construction site accident classification using hybrid supervised machine learning",
Automation in Construction, 2020

[11] http://www.businessinsider.in

[12] http://www.ethesis.nitrkl.ac.in

29
[13] R. Subash, R. Jebakumar, Yash Kamdar, Nishit Bhatt. "Automatic Image Captioning
Using Convolution Neural Networks and LSTM" , Journal of Physics: Conference Series,
2019

[14] http://www.towardsdatascience.com

[15] http://dssresearchjournal.com

30
APPENDIX
# This Python 3 environment comes with many helpful analytics libraries installed

# For example, here's several helpful packages to load in

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.

import pandas as pd

import matplotlib.pyplot as plt

import matplotlib as mpl

import nltk.classify.util

from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix

from sklearn import metrics

from sklearn.metrics import roc_curve, auc

from nltk.classify import NaiveBayesClassifier

import numpy as np

import re

import string

import nltk

%matplotlib inline

temp = pd.read_csv(r"1429_1.csv")

temp.head()

permanent = temp[['reviews.rating' , 'reviews.text' , 'reviews.title' , 'reviews.username']]

31
print(permanent.isnull().sum()) #Checking for null values

permanent.head()

check = permanent[permanent["reviews.rating"].isnull()]

check.head()

senti= permanent[permanent["reviews.rating"].notnull()]

permanent.head()

senti["senti"] = senti["reviews.rating"]>=4

senti["senti"] = senti["senti"].replace([True , False] , ["pos" , "neg"])

senti["senti"].value_counts().plot.bar()

import nltk.classify.util

from nltk.classify import NaiveBayesClassifier

import numpy as np

import re

import string

import nltk

cleanup_re = re.compile('[^a-z]+')

def cleanup(sentence):

sentence = str(sentence)

sentence = sentence.lower()

sentence = cleanup_re.sub(' ', sentence).strip()

#sentence = " ".join(nltk.word_tokenize(sentence))

return sentence

32
senti["Summary_Clean"] = senti["reviews.text"].apply(cleanup)

check["Summary_Clean"] = check["reviews.text"].apply(cleanup)

split = senti[["Summary_Clean" , "senti"]]

train=split.sample(frac=0.8,random_state=200)

test=split.drop(train.index)

def word_feats(words):

features = {}

for word in words:

features [word] = True

return features

train["words"] = train["Summary_Clean"].str.lower().str.split()

test["words"] = test["Summary_Clean"].str.lower().str.split()

check["words"] = check["Summary_Clean"].str.lower().str.split()

train.index = range(train.shape[0])

test.index = range(test.shape[0])

check.index = range(check.shape[0])

prediction = {} ## For storing results of different classifiers

train_naive = []

test_naive = []

check_naive = []

for i in range(train.shape[0]):
33
train_naive = train_naive +[[word_feats(train["words"][i]) , train["senti"][i]]]

for i in range(test.shape[0]):

test_naive = test_naive +[[word_feats(test["words"][i]) , test["senti"][i]]]

for i in range(check.shape[0]):

check_naive = check_naive +[word_feats(check["words"][i])]

classifier = NaiveBayesClassifier.train(train_naive)

print("NLTK Naive bayes Accuracy : {}".format(nltk.classify.util.accuracy(classifier ,


test_naive)))

classifier.show_most_informative_features(5)

y =[]

only_words= [test_naive[i][0] for i in range(test.shape[0])]

for i in range(test.shape[0]):

y = y + [classifier.classify(only_words[i] )]

prediction["Naive"]= np.asarray(y)

y1 = []

for i in range(check.shape[0]):

y1 = y1 + [classifier.classify(check_naive[i] )]

check["Naive"] = y1

from wordcloud import STOPWORDS

34
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.feature_extraction.text import CountVectorizer

stopwords = set(STOPWORDS)

stopwords.remove("not")

count_vect = CountVectorizer(min_df=2 ,stop_words=stopwords , ngram_range=(1,2))

tfidf_transformer = TfidfTransformer()

X_train_counts = count_vect.fit_transform(train["Summary_Clean"])

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

X_new_counts = count_vect.transform(test["Summary_Clean"])

X_test_tfidf = tfidf_transformer.transform(X_new_counts)

checkcounts = count_vect.transform(check["Summary_Clean"])

checktfidf = tfidf_transformer.transform(checkcounts)

from sklearn.naive_bayes import MultinomialNB

model1 = MultinomialNB().fit(X_train_tfidf , train["senti"])

prediction['Multinomial'] = model1.predict_proba(X_test_tfidf)[:,1]

print("Multinomial Accuracy : {}".format(model1.score(X_test_tfidf , test["senti"])))

check["multi"] = model1.predict(checktfidf)## Predicting Sentiment for Check which


was Null values for rating

35
from sklearn.naive_bayes import BernoulliNB

model2 = BernoulliNB().fit(X_train_tfidf,train["senti"])

prediction['Bernoulli'] = model2.predict_proba(X_test_tfidf)[:,1]

print("Bernoulli Accuracy : {}".format(model2.score(X_test_tfidf , test["senti"])))

check["Bill"] = model2.predict(checktfidf)## Predicting Sentiment for Check which was


Null values for rating

from sklearn import linear_model

logreg = linear_model.LogisticRegression(solver='lbfgs' , C=1000)

logistic = logreg.fit(X_train_tfidf, train["senti"])

prediction['LogisticRegression'] = logreg.predict_proba(X_test_tfidf)[:,1]

print("Logistic Regression Accuracy : {}".format(logreg.score(X_test_tfidf ,


test["senti"])))

check["log"] = logreg.predict(checktfidf)## Predicting Sentiment for Check which was


Null values for rating

words = count_vect.get_feature_names()

feature_coefs = pd.DataFrame(

data = list(zip(words, logistic.coef_[0])),

columns = ['feature', 'coef'])

feature_coefs.sort_values(by="coef")

def formatt(x):

if x == 'neg':

return 0

if x == 0:
36
return 0

return 1

vfunc = np.vectorize(formatt)

cmp = 0

colors = ['b', 'g', 'y', 'm', 'k']

for model, predicted in prediction.items():

if model not in 'Naive':

false_positive_rate, true_positive_rate, thresholds =


roc_curve(test["senti"].map(vfunc), predicted)

roc_auc = auc(false_positive_rate, true_positive_rate)

plt.plot(false_positive_rate, true_positive_rate, colors[cmp], label='%s: AUC


%0.2f'% (model,roc_auc))

cmp += 1

plt.title('Classifiers comparaison with ROC')

plt.legend(loc='lower right')

plt.plot([0,1],[0,1],'r--')

plt.xlim([-0.1,1.2])

plt.ylim([-0.1,1.2])

plt.ylabel('True Positive Rate')

plt.xlabel('False Positive Rate')

plt.show()

test.senti = test.senti.replace(["pos" , "neg"] , [True , False] )

37
keys = prediction.keys()

for key in ['Multinomial', 'Bernoulli', 'LogisticRegression']:

print(" {}:".format(key))

print(metrics.classification_report(test["senti"], prediction.get(key)>.5, target_names =


["positive", "negative"]))

print("\n")

def test_sample(model, sample):

sample_counts = count_vect.transform([sample])

sample_tfidf = tfidf_transformer.transform(sample_counts)

result = model.predict(sample_tfidf)[0]

prob = model.predict_proba(sample_tfidf)[0]

print("Sample estimated as %s: negative prob %f, positive prob %f" % (result.upper(),
prob[0], prob[1]))

test_sample(logreg, "The product was good and easy to use")

test_sample(logreg, "the whole experience was horrible and product is worst")

test_sample(logreg, "product is not good")

check.head(10)

from wordcloud import WordCloud, STOPWORDS

stopwords = set(STOPWORDS)

mpl.rcParams['font.size']=12 #10

mpl.rcParams['savefig.dpi']=100 #72

38
mpl.rcParams['figure.subplot.bottom']=.1

def show_wordcloud(data, title = None):

wordcloud = WordCloud(

background_color='white',

stopwords=stopwords,

max_words=300,

max_font_size=40,

scale=3,

random_state=1 # chosen at random by flipping a coin; it was heads

).generate(str(data))

fig = plt.figure(1, figsize=(15, 15))

plt.axis('off')

if title:

fig.suptitle(title, fontsize=20)

fig.subplots_adjust(top=2.3)

plt.imshow(wordcloud)

plt.show()

show_wordcloud(senti["Summary_Clean"])
39
show_wordcloud(senti["Summary_Clean"][senti.senti == "pos"] , title="Postive Words")

show_wordcloud(senti["Summary_Clean"][senti.senti == "neg"] , title="Negitive words")

40
41

You might also like