E Commerce Review Analysis
E Commerce Review Analysis
ON
BACHELOR OF TECHNOLOGY
IN
ELECTRONICS AND COMMUNICATION
ENGINEERING
May, 2020
CERTIFICATE
This is to certify that the minor project report entitled, “E-COMMERCE REVIEW
ANALYSIS” submitted by ABHISHEK KUMAR and ANSHU PRIYA in partial
fulfillment of the requirements for the award of Bachelor of Technology Degree in
Electronics and Communication Engineering of the Jaypee Institute of Information
Technology, Noida is an authentic work carried out by them under my supervision and
guidance. The matter embodied in this report is original and has not been submitted for
the award of any other degree.
Dated:
ii
DECLARATION
We hereby declare that this written submission represents our own ideas in our own
words and where other’s ideas or words have been included, have adequately cited and
reference the original sources. We also declare that we have adhered to all principles of
academic honesty and integrity and have not misrepresented or fabricated or falsified any
idea/data/fact/source in our submission.
Enrollment: 9916102068
iii
ABSTRACT
In the recent few years as we can see that the electronic or can say, online places of
shopping or trading are growing at a fast rate. And for them to grow unexpectedly, the
businessmen or the sellers are asking their buyers or loyal customers about what they
think about the product they are purchasing. And because of this, ratings and reviews of
huge amounts are published almost every day because of which the future customers are
in dilemma because they are not able to get to a decision if they should spend their money
on commodities or not. When it comes to the goods and product managers, it is difficult
for them as well because of the large numbers of comments and reviews being generated
on a daily basis. It makes it difficult for them to analyze all of this. The project we are
building here is the solution. Here we are working on categorizing the comments or
reviews based on the category they lie in. For example, good is a positive word and bad is
a negative word. This will generate a polarity. All of this is explained in detail here. Here,
a study of Amazon reviews is done in order to understand how this project works.
iv
ACKNOWLEDGEMENT
It is our privilege to express our sincerest regards to our major project supervisor, Dr.
Yogesh Kumar for his valuable inputs, guidance, encouragement, whole-hearted
We deeply express our sincere thanks to our Head of Department for encouraging and
allowing us to present the project on the topic “E-COMMERCE REVIEW
ANALYSIS” at our department premises for the partial fulfillment of the requirements
leading to the award of B-Tech degree. We take this opportunity to thank all our
We pay our respect and love to our parents and all other family members and friends for
their love and encouragement throughout our career. Last but not the least we express our
Signature:
Names:
ABHISHEK KUMAR (9916102193)
ANSHU PRIYA (9916102068)
v
TABLE OF CONTENTS
Certificate ii
Declaration iii
Abstract iv
Acknowledgement v
Table of Contents vi-vii
List of Figures viii
List of Tables ix
CHAPTER 1- INTRODUCTION 1
CHAPTER 4- METHODS 16
4.1 Naive Bayes 16
4.1.1 Multinomial Naive Bayes 16
vi
4.2 K-nearest Neighbor 17
4.3 Linear Support Vector Machine 18
4.4 Long Short-Term Memory 19
4.4.1 Core Idea Behind LSTM 20
4.5 GloVe (Global Vector for Word Representation) 21
6.1 Conclusion 27
6.2 Future Scope of Sentiment Analysis 27
REFERENCES 29
APPENDIX 31
vii
LIST OF FIGURES
FIGURE PAGE NO.
Fig. 5.2: Comparison of one product based on positive and negative review 25
viii
LIST OF TABLES
TABLE PAGE NO.
1. NameS of ML chatbox 4
2. Artificial Intelligence Assistant 7
3. Performance of different models 24
ix
CHAPTER 1: INTRODUCTION
In the last few years we have seen a huge amount of efforts are being given by
researchers and students on understanding the emotions, feelings and opinions in
theoretical data and resources. From the data and knowledge all over the internet, we can
see that there has been a surge in this field in the last few years. This research area in
really large, and a part of this area of study is known as sentiment analysis or opinion
mining. What this means is that when a bunch or cluster of words are given as a dataset,
we can theoretically study customers opinions and emotions, how they think about a
certain product and the product’s features, if the product makes them happy or not and if
they are going to buy them in the future or not. This technique has a diverse use.
Let’s say, we take an example, most of the time businesses want to know about opinions
and reviews of their customers about the product and services they are provided with.
They want to know if their services are making their customers happy or not in order to
make the necessary changes. On the other hand, the potential buyers also want to know
about the thoughts of the buyers who have already bought product they are interested in.
Hence, this helps both the community on a great extent. However, the point to be noted is,
the researchers use this data or information to do a deep analysis of the trends followed in
E-commerce, which could ultimately lead to an easier life for all of us. [2]
However, it’s not an easy task, as there are millions of websites over the internet having a
lot of products they need to sell. This means even large amount of reviews and opinions.
Every such site contains a large number of opinionated texts which is not easy to decode.
Maintaining a dataset of so many texts is even more difficult. An average human will
have a lot of issues in identifying such sites and extracting all the information and reviews
that those site contains. Apart from that, asking a PC to know what a sarcasm is, is a
really difficult and challenging task, no matter how smart AI has become, computers still
cannot think like a living being. [5]
So in this project, the challenge here is how to differentiate that a comment made on a
certain good or commodity comes under the helpful or positive bucket or it comes under
the denial or negative bucket. After this differentiation is done, we need to build a
supervised learning algorithm or model in order to polarise the large number of
comments. Along with traditional algorithms such as naïve bayes, K-nearest neighbor and
1
linear supporting vector machines these models also incorporate deep learning tactics
which are known to be the convolutional neural networks. Another type of deep learning
tactic is the Recurrent Neural Network. These models or say, algorithms are then
analyzed and their accuracy is compared to one another which then help us to understand
better the polarized texts and how they describe a product. [3]
1.2 E-Commerce
The sale and purchase of a commodity or product over the web or say, internet is termed
as the electronic business or web based business. Other expression used for this web
based business is the E-business. Instances of internet business destinations are flipkart,
eBay, amazon and so on. Web based business gives one of a kind highlights of non-
money installment, 24x7 Service accessibility and improved deals. As per the Kleiner
Perkins Caufield Byers report, which is a funding firm, eventually, Amazon india is going
to amaze the whole country in the field of online or electronic market business. [1]
2
1.3 Machine Learning in E-Commerce
These days, businesses based on the web are trying to give the buyers or customers the
same feeling as if they are shopping in a supermarket. Even though they are buying things
online, they do not have to go through any inconvenience and they are easily able to look
for the commodities or products in no time. The buyers or costumers are provided with
customizable things like a luxury item or gift for their loved ones with the help of ML.
There are many software’s which consists of techniques that helps in recognizing a
picture. With techniques like these, people who surf online websites are able to search
with help of the pictures they already have. Now the searchings which were earlier done
with the help of words, are now being done with help of images. This technique match the
provided photo with that on the internet and show it to the viewers with very less efforts.
Pinterest is a website as well as an application for phone which uses similar method. Here
people have the option to upload an image, the website matches the image with the
content it has on it’s site and display it to the viewers. The viewers then have the freedom
choose whatever they like.
The Artificial Intelligence revolutions are used by the HR divisions in many ways.
Conference meetings, online connections, for example LinkedIn, screening activities,
discovering interns and students looking for jobs are few of the examples. Hence, the
work load of the HR is reduces to a great extent with the help of AI techniques and
algorithms as the most qualified candidate gets the job with very less efforts.
3
Voice Powered Search
In online shopping, text-based search is being replaced by voice search feature. Voice
recognition accuracy has been much better as compared to earlier. Practically, about 69%
of solicitations have a characteristic of being conversational or with the help of Google
associate, turned into conversational speeches. Few of the savvy gadgets have features
controlling activities with the help of voice. One such example is the Apple’s Home Pod
which is under the control of Siri. Echo, which is a product of Amazon, fueled by Alexa
is another such model. Searches done by alexa with the help of voices are utilized so that
a request can be made for a product to be transported from Amazon. As per concentrate
by ComScore, half of the pursuits will be founded on voice look by 2020.
1. Conversational commerce
Platforms that support chat feature can help the buyers purchase a product with help of
texting. This is natural language processing. Electronic or online exchanges are being
encouraged with the help of Chatbots for the large companies. It is done with the help of
TacoBot , which uses Slackas its messenger, and the other one is Kik messenger used by
H&M. Popular companies, for example, Tommy Hilfiger started a Fashion Chatbot with
the help of Facebook Messenger when there was a Fashion Week in New York in 2016.
This was the very first company to use an application like messenger for selling a mixture
of goods. Instances of a few applications constructed with the help of chatbots based are
mentioned here:
4
Schooling Duolingo iPhone Operating System as
well as Android
This feature has helped a lot of individuals to learn how to make intelligent choices when
it comes to spending money on what they shop. Let’s take an example, Ping was a
chatting facility developed by Flipkart. This facility has really worked up to the mark in
providing aide to shoppers, but had to be closed recently. Ping was fuelled with the aide
of AI so that the customers are provided with items they desire in no time. Alexa, which
is developed by Amazon and is it’s home partner, is also a development of AI and is
another such assistant which provides customers with virtual shopping aide. The
customers receive trending fashions and experiences and all of it is done with the help of
confirmed voice of an individual. Mona is another such collaborator.
Virtual Assistant
E-commerce virtual assistant are programs that act smartly when it comes to keeping up
the business supplies and other works or services that might be techie. A certain favours
can be done for personal use as well with its help. "ChatBot" is also a kind of virtual
assistant. In recent days Lenovo has also bought to the customers its virtual assistant
5
which is expected compete with Google and Cortana now. Another artificial intelligence-
powered deep learning assistant is CAVA. It has many features including the face and
voice-recognition technique that helps in management of data and other information.
Shopper comments and ratings these days are very essential, so that the businesses
believe in the customers when it comes to buy things electronically. The recent studies
states that, 90 percent of shoppers who respond said that positive reviews on a certain
product influenced them to buy the product they are interested in. But there are high
chances that there can be fake reviews which can affect the buying decisions of a
potential customer. Even this issue can be managed with the help of AI. Amazon is one of
the large organizations that uses artificial intelligence to fight against fake reviews on the
products. Amazon’s AI system makes sure that the comments of the shoppers which are
real (based on their verification), are given importance. This makes sure that comments
which different shoppers or buyers tick as genuine are given importance.
AI combined with the client connection system is a successful answer for oversee deals.
This AI empowered method permits a CRM framework to answer shoppers’ questions,
tackle their problems and issues and even create and open new doors for the business
group. The clients will never again see improper options in the future while purchasing on
the web.
Many different fields are there for AI to be used in businesses, some of them are stated
here:
The figure 1.2 on the next page shows the AI/ML Investment Areas.
6
Fig. 1.2: AI/ML Investment Areas
Avanade proposed a study, a survey of almost 500 business and IT heads from all around
the world exposed that they are expecting of about 33% increases in earnings as with the
help of intelligent technologies. Avanade is a combined set out between Accenture and
Microsoft that holds and support the windows’ Cortana along with different frameworks
that provide anticipating investigations along with perceptions based on data. Another
instance: The Apptus eSale, it is build in order to computerize the customers’ wants and
needs prior hand. Big data and machine learning is combined in this software to find oyt
which products might look interesting to a potential buyer as he/she searches online or get
adviced for a product. ML can predict and spontaneously display commodities by looking
at the phrases that have been searched when a buyer visits electronic shops permitted by
Apptus eSales. Even a big company as such as Google, displayed some heed to machine
learning when it bought DeepMind, AI Company. Below are some of the main features of
artificial intelligence that can be utilized by web based businesses:
o Data Mining
o NLP(Natural Language Processing)
7
o Machine Learning
The table 1.2 here, shows the different business areas and the AI tools in that area.
Business Intelligence & Analytics Ayasadi, Data Robot, Sun down, Arimo
8
CHAPTER 2: LITREATURE SURVEY
Some of the other papers are also explained in short here. In one of the papers we studied,
the researchers did some surveys from which opinions were removed. After that the
outcome is broken down so as a plan of action is developed. Truly huge precision was
guaranteed by the researchers. Multinomial Naïve Bayesian (MNB) was essentially
utilized and bolster the fundamental classifiers with the use of vector machine.
9
In one another method, the proposal of expansion of present work was made in the area of
normal speech handling. Some important classifiers and methods were utilized so as to
see if a studied survey is good or bad.
Simulated intelligence appropriation has been seen at numerous zones. A few models are
following:
Gaming: Machines would now be able to contend with people in games with
machine learning.ML execution can be seen in numerous vital games, for
example, poker, chess, spasm tac-toe, and so on. Machines are engaged with
capacity to consider numerous positions dependent on heuristic information. Dark
Blue, the very initial PC that played chess. Creator was the IBM. Another model
is AlphaGo by Google .
Expert Systems: Master frameworks are created so as to take care of not so easy
issues in some sort of specific space, along with AI. Reason for master
frameworks is to prompt, foresee results, propose elective arrangement and help
human in dynamic. The three pillars of ES are based on information, induction
engine, and user interface.
Master framework’s instances are CALEX, GERMWATCHER, LEY and
AGREX and so forth.
Medicine and Healthcare Services: The Machine learning approach has shown its
application in many sectors likewise one of the sectors is healthcare. This
11
approach is applied in Medical Diagnosis, Prediction of Risk, and Drug Recovery.
For instance, In the treatment of skin cancer, Sebastian Thurn’s laboratory does
calculation with the help of A.I to recognize the high precision value of skin
cancer.
Robots with smart brains: Sensors are installed in robots, for example, knock,
pressure, warmth, light and temperature can identify the physical information and
play out the directions by a human. They have proficient processors and colossal
memory to settle on shrewd choices and display knowledge. Smart Robots are
additionally proficient to gain from mistakes.[10]
12
CHAPTER 3: INFORMATION ABOUT THE DATASET
Also, to have a concise diagram of the dataset, we have plot the dissemination of the
evaluations. It can be seen from the dataset that we have partitioned the dataset into 5
sections on the basis of scale of 1 to 5 just making appropriation between these.
Likewise, the classes here are not uniformly arranged as we can see that 1st class and 2nd
class has incorporated fewer values of data on the other hand 5th class has incorporated
even larger than 20000 values of data. [7] This can be shown from one example of the
dataset: Survey message: 'The item here has not frustrated yet. The kids love utilizing it.
In fact, even I admire the capacity to screen control the content my kids are exposed to,
effortlessly.'
3.3 Features
The 2 sorts of highlights are used in this project of which the primary sort falls into the
customary technique. Essentially, the manufacture of a word takes place in this step
which refers to a regular word which is then followed by the listing out of every word.
13
Then there is the setting of the edge of the text or word referring to be 6 event and there is
wound up the accumulation of words from the dataset, which is 4,223. At that point,
every audit is transformed into the vector, and then every word symbolizes the frequency
of the words that have been spoken. To achieve this, we have to change the edge and
referred the word’s length. On doing this the result is founded which depicts that the
extension of the length of a referred word doesn’t have an impact on the precision.
Another sort of highlight we utilized is the 50-d glove2 word reference which was
pretrained on Wikipedia. For this part, we essentially need to exploit the implications of
each word. For this situation, we speak to each survey with the help of glove vectors’
mean vector that is, 50-d of every single text or word that makes the comment. [9]
The figure 3.1 on the next page is the snapshot of the entries of the dataset containing the
reviews. Not all the entries can be shown here because of the huge size of the file. It
consists of the user name, brand, rating, purchase and many more information.
Some of the columns of the dataset is explained below:
reviews.username column consists of the unique usernames used by the buyers on
amazon.
reviews.title column consists of the title of the comment made by them on a specific
product.
reviews.text column contains the text where the actual comment is made on the products
they buy.
reviews.sourceURLs consists of the URLs of the specific product on which the comments
are done.
Reviews.rating consists of the ratings given on the product bought by the customers.
Mostly ratings are accompanied with reviews.
Manufacturer column consists of the manufacturer of the product as the name suggests. In
this case it will be Amazon.
Categories contains the category in which the product on which the review and rating is
done lies.
14
Fig. 3.1: Few of the entries of the dataset of Consumer Reviews of Amazon Products
15
CHAPTER 4: METHODS
(4.1)
For our model to work nicely, we additionally fused Laplace Smoothing in it. The
expectation of the model is given by the following equation:
(4.2)
With the main method of speaking to audit writings, it takes a variety of non-negative
whole numbers, and models p(xi/y) with multinomial dispersion. With the second method
of speaking to audit writings utilizing glove word reference, the data sources are positive
digits, hence p(xi/y) with the help of Gaussian distribution was decided to be shown. [4].
(4.3)
Where , gives the amount of times i appears in the training set T in a set.of
class a.
(4.4)
It tries to find the w to satisfy the separability constraint and also satisfies the maximum
margin issue. The following figure shows the representation of the support vectors.
It uses a technique knows as the kernel trick to change your data and adjust the data and
then find an optimal boundary between outputs that is feasible based on such
17
transformations or changes and between the possible outputs it finds an optimal
boundary.[3]
The figure 4.2 that is shown below, represents the Support Vectors which is the basic co-
ordinates of individual observation. Support Vector Machine is technique which best
isolates the two separations i.e. (hyper-plane/ line).
18
Fig 4.3: Before and after applying K-NN method
When K-NN is applied, assume there are two classifications, i.e., Category X and
Category Y, and we have another information point x1, so this information point will lie
in which of these classes. Hence, we need a K-NN calculation, to take care of this sort of
issue or problems. With the assistance of K-NN, without much of a stretch we can
differentiate the classification or class of a specific dataset.
All repetitive neural systems have a sort of connectivity, also known as a chain of rehashing
modules of neural system. RNNs rehashing modules will have an exceptionally
straightforward structure, one solitary layer of tanh, for example.
Alternatively, LSTMs have this chain in form of a connection like structure, however the
repeating module has an different structure. there are four neural network layer, instead of
making a single layer of neural network, interacting in an exceedingly unusual method.
19
Fig. 4.4: Representation of LSTM
In the above figure 4.4, we can see that, from one hub’s yield to the contributions of others,
whole vector is conveyed by each line. The pointwise tasks are represented by pink circles,
just how the vector addition looks. And the yellow shapes are layers o learned neural system
. Merging lines represent concatenation. A line which is forking signifies its data being
copied and the replicas going to another locations.
The cell state can be said is bit like a conveyor belt. There are very few linear interactions.
Otherwise it mostly goes straight down the entire system. It's exceptionally simple for data
to simply stream along it unaltered.
The above figure shows that after being carefully regulated and balanced by structures called
gates, the LSTM has the ability to evacuate or add data to the cell state.
20
Gates are an alternate method or way of letting data through. What they are made of is a
sigmoid neural net layer and a pointwise multiplication or duplication operation.
The above figure 4.6 is The sigmoid layer gives numbers in the range of null and one,
depicting how much each component should be let through. A zero signifies “do not let
anything to enter,” while a one signifies “let everything enter!”
Cell state gates are required to secure and control. An LSTM requires three gates.
An easy instance is considered here, considering which, we can see how some features and
characteristics of meaning can be drawn-out straightforwardly from co-happening
probabilities. Let us take two letters i and j that displays a particular attribute or feature
which we are interested in; for instance, let’s say, the concept of thermodynamic phase is
what we are interested in. So let’s just consider a = steam and b= ice. Now, how these words
are related can be determined by Examining the ratio of their probability of co-occurrence
with detailed term investigation, k. Words may be ice related but not steam, let us assume c
= solid, then the ratio of Pac/Pbc can be expected to be very large. Again, Words may be
steam related but not ice related, say c = gas, in this case the ratio is expected to be quite
21
small. There might be words c, say water or heat, that has the possibility to be related to both
ice and steam, or maybe to none of them, in such cases the ratio should be almost close to
one. Compared with the initial or raw probabilities, this ratio can distinguish relevant words
(solid and liquid, for example) from irrelevant words (water and heat) to a much greater
extent and will also learn to differentiate between the two words that are relatable. Hence,
from the conditions above, it can be deduced that the actual starting point for learning of
word vectors should be with ratios of co-occurrence likelihood and not jumping to the
probabilities themselves. It is notable that the ratio Pac/Pbc must depend on three words here
(in this case, letters), i.e., a, b, and c, this gives the most general form of equation,
(4.6)
22
CHAPTER 5: RESULTS AND DISCUSSION
So the large complete set of data that we have which consists 34627 comments and texts
has been parted in three parts. The size of the training data is 21,000, which contributes to
60% of the dataset. The test set was of the size 6813, which is equal to 20% of the dataset.
The validation set of size 6814, which contributes to 20% of the dataset.
Here we underwent implementation of many methods, i.e., Multinomial NB method,
Support vector machine with RBF Kernel. Other methods are also there: KNN-4, 5, 6,
Long Short Term Memory, SVM with Linear Kernel. All of this with the help of 4223-d
input features representing review text.
KNN-5 is much more better than the other 2 KNN models. In a similar fashion SVM with
Linear Kernel is a little bit better than the RBF Kernel based Support Vector Machine.
Linear Support Vector Machine has overfitting problem which can be seen because of the
remarkable gap that sits between the training set accuracy and test set accuracy. When it
comes to test set accuracy, the LSTM gives the best performance.
From glove dictionary, we took help of 50-d input features so that Gaussian NB method
can be run, along with Linear Kernel Support Vector Machine algorithm. Long short term
memory and KNN- 4, 5, 6 also use this feature. Yet here again, K-NN-5 gives better
performance than the two other K-Nearest Neighbor methods. Even the idea of
resampling of information was tried on Long Short Term Memory model but the test
accuracy result unfortunately showed no improvisation because of the overfitting
problem. And yet again we can see that out of all the models, LSTM gives the best
results.
The following table 1, gives in detail the results of the training and test accuracy for all
the models.
23
machine
So we got to know that generally, the models which consists of conventional input
attributes give much better performance when compared to those which has input
attributes considering glove. Hence, we realized that Long Short Term Memory has
provided us with the most precise analysis as compared with the different models we used
here.
Hence we come to a conclusion that the classification model needs to be chosen with
great care and precision for sentimental analysis systems because this decision has a huge
impact on the accuracy of your work and your final result. With the help of the overall
sentiment and count based metrics, we can get the feedback of products, organization
from customers. These days companies have been holding on to the power of data, but if
you want to have the most important data, one needs to hold on to the weightage provided
by AI as well as deep learning.
Word clouds have a diverse usage in a lot of fields. Some of them are:
1. Trendy headlines on the Social Media: To classify and organise tweets under
sections which are in demand, we can extract the top texts, letters or words out of
them and then use them in the trending section, after we read and get text of reviews
or tweets that users are sending out.
2. Hyped topics in the news: After we analyse the words, headings or texts of a variety
of articles and reports in the news, we can work to find out the trending and top
words out of them and will get to know what are the most hyped news topics that are
going around a city, country or can say, the entire world.
25
3. Websites can use it for navigation of users: There are many website over the
internet that is driven by keywords or hashtags. We can create a word cloud which
will then help the users to directly jump to any subject of interest either it is shopping
or gaming. It will be relevant for the users.
The two figures given below are the word cloud (positive word cloud as well as negative
word cloud) generated from the dataset that we used in this project.
26
CHAPTER 6: CONCLUSION AND FUTURE SCOPE
6.1 Conclusion
From all the work and result obtained in this project, we came to a conclusion that when it
comes to complexity, KNN needed much higher calculation complexity when compared
to Naive Bayes algo and SVM during the time of training. We can see that in KNN
algorithm, the need is to compute the separation of all the training data points and all the
evaluation data points, which requires a lot of time.
We also noticed, that the accuracy wasn’t much affected even when the length od the
dictionary was increased. One of the reason might be that when the threshold of the
dictionary was decrease, there was an increase in the length of dictionary. Here comes the
problem, i.e., the number or amount of reviews we have is less than 40,000. When this is
given some thought, we realize, that the dimension of feature space is significantly larger
than the amount or number of data points. Hence, we come to a conclusion that the curse
of dimensionality here, might be an issue.
We also realized that, the method of normally counting words would have given much
better results when compared to using glove mean method. Why did this happen? The
reason might be that the individual word feature might be weakened if we use the
average, and then the separation between different words or reviews will not be accurate.
We realized that using LSTM gives much better results than other machine approaches or
methods we used in this project. This can be because of the large number of the
parameters that was in the dataset. From table 5.1 we can see that, the training accuracy
of LSTM with Glove gave an estimate of 85.6 % after resampling. But the accuracy of the
test data is only 65.6 %, from this result we can see that this model has overfitted on the
resampled data, because there are many instances that have been repeated.
27
that it understands the feelings, emotions and how a person thinks about a specific brand
or organization. Hence, the audiences will experience much better response from their
favorite brand or organization and hence, can get their needs personalized. Organizations
can further categorize based on how their audience and consumers actually feel about the
commodities they are provided with or what they browse on social media instead of the
categories based on age of customer, his/her gender, income or other surface factors,.
Hence, ultimately sentiment analysis is going to play a huge role in contributing towards
better understanding of providers and consumers. This relation will be strengthened.
28
REFRENCES
[1] Avaneet Pannu, “Artificial Intelligence and its Application in Different Areas”,
International Journal of Engineering and Innovative Technology (IJEIT), Issue 10, Vol 4,
April 2015
[2] Dhiraj Kapoor and R. K. Gupta, “Software Cost Estimation using Artificial
Intelligence Technique” International Journal of Research and Development in Applied
Science and Engineerin, Issue 1, Vol 9, February 2016
[4] Ashish, A.Dongare, Prof.R.D. Ghongade, “Artificial Intelligence Based Bank Cheque
Signature Verification System” International Research Journal of Engineering and
Technology (IRJET) Volume 03, Issue 01, Jan-2016
[5] Siddharth Gupta, Deep Borkar, Chevelyn De Mello, Saurabh Patil, “An E-Commerce
Website based Chabot” International Journal of Computer Science and Information
Technologies (IJCSIT), Vol. 6 (2), 2015
[6] Saloni Shukla, Joel Rebello, “Threat of automation: Robotics and artificial
intelligence to reduce job opportunities”, http://economictimes.indiatimes.com
[7] Unnati Dhavare, Prof. Umesh Kulkarni,“Natural language processing using artificial
intelligence” International Journal of Emerging Trends & Technology in Computer
Science (IJETTCS) ,Volume 4, Issue 2, March April 2015
[8]https://www.avanade.com/en/about/avanade/partnerships/accenture-avanade-
microsoft-alliance
[9] Dirican, Cüneyt. "The impacts of robotics, artificial intelligence on business and
economics." ProcediaSocial and Behavioral Sciences 195 (2015): 564-573.
[10] Min-Yuan Cheng, Denny Kusoemo, Richard Antoni Gosno. "Text mining-based
construction site accident classification using hybrid supervised machine learning",
Automation in Construction, 2020
[11] http://www.businessinsider.in
[12] http://www.ethesis.nitrkl.ac.in
29
[13] R. Subash, R. Jebakumar, Yash Kamdar, Nishit Bhatt. "Automatic Image Captioning
Using Convolution Neural Networks and LSTM" , Journal of Physics: Conference Series,
2019
[14] http://www.towardsdatascience.com
[15] http://dssresearchjournal.com
30
APPENDIX
# This Python 3 environment comes with many helpful analytics libraries installed
import pandas as pd
import nltk.classify.util
import numpy as np
import re
import string
import nltk
%matplotlib inline
temp = pd.read_csv(r"1429_1.csv")
temp.head()
31
print(permanent.isnull().sum()) #Checking for null values
permanent.head()
check = permanent[permanent["reviews.rating"].isnull()]
check.head()
senti= permanent[permanent["reviews.rating"].notnull()]
permanent.head()
senti["senti"] = senti["reviews.rating"]>=4
senti["senti"].value_counts().plot.bar()
import nltk.classify.util
import numpy as np
import re
import string
import nltk
cleanup_re = re.compile('[^a-z]+')
def cleanup(sentence):
sentence = str(sentence)
sentence = sentence.lower()
return sentence
32
senti["Summary_Clean"] = senti["reviews.text"].apply(cleanup)
check["Summary_Clean"] = check["reviews.text"].apply(cleanup)
train=split.sample(frac=0.8,random_state=200)
test=split.drop(train.index)
def word_feats(words):
features = {}
return features
train["words"] = train["Summary_Clean"].str.lower().str.split()
test["words"] = test["Summary_Clean"].str.lower().str.split()
check["words"] = check["Summary_Clean"].str.lower().str.split()
train.index = range(train.shape[0])
test.index = range(test.shape[0])
check.index = range(check.shape[0])
train_naive = []
test_naive = []
check_naive = []
for i in range(train.shape[0]):
33
train_naive = train_naive +[[word_feats(train["words"][i]) , train["senti"][i]]]
for i in range(test.shape[0]):
for i in range(check.shape[0]):
classifier = NaiveBayesClassifier.train(train_naive)
classifier.show_most_informative_features(5)
y =[]
for i in range(test.shape[0]):
y = y + [classifier.classify(only_words[i] )]
prediction["Naive"]= np.asarray(y)
y1 = []
for i in range(check.shape[0]):
y1 = y1 + [classifier.classify(check_naive[i] )]
check["Naive"] = y1
34
from sklearn.feature_extraction.text import TfidfTransformer
stopwords = set(STOPWORDS)
stopwords.remove("not")
tfidf_transformer = TfidfTransformer()
X_train_counts = count_vect.fit_transform(train["Summary_Clean"])
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_new_counts = count_vect.transform(test["Summary_Clean"])
X_test_tfidf = tfidf_transformer.transform(X_new_counts)
checkcounts = count_vect.transform(check["Summary_Clean"])
checktfidf = tfidf_transformer.transform(checkcounts)
prediction['Multinomial'] = model1.predict_proba(X_test_tfidf)[:,1]
35
from sklearn.naive_bayes import BernoulliNB
model2 = BernoulliNB().fit(X_train_tfidf,train["senti"])
prediction['Bernoulli'] = model2.predict_proba(X_test_tfidf)[:,1]
prediction['LogisticRegression'] = logreg.predict_proba(X_test_tfidf)[:,1]
words = count_vect.get_feature_names()
feature_coefs = pd.DataFrame(
feature_coefs.sort_values(by="coef")
def formatt(x):
if x == 'neg':
return 0
if x == 0:
36
return 0
return 1
vfunc = np.vectorize(formatt)
cmp = 0
cmp += 1
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.show()
37
keys = prediction.keys()
print(" {}:".format(key))
print("\n")
sample_counts = count_vect.transform([sample])
sample_tfidf = tfidf_transformer.transform(sample_counts)
result = model.predict(sample_tfidf)[0]
prob = model.predict_proba(sample_tfidf)[0]
print("Sample estimated as %s: negative prob %f, positive prob %f" % (result.upper(),
prob[0], prob[1]))
check.head(10)
stopwords = set(STOPWORDS)
mpl.rcParams['font.size']=12 #10
mpl.rcParams['savefig.dpi']=100 #72
38
mpl.rcParams['figure.subplot.bottom']=.1
wordcloud = WordCloud(
background_color='white',
stopwords=stopwords,
max_words=300,
max_font_size=40,
scale=3,
).generate(str(data))
plt.axis('off')
if title:
fig.suptitle(title, fontsize=20)
fig.subplots_adjust(top=2.3)
plt.imshow(wordcloud)
plt.show()
show_wordcloud(senti["Summary_Clean"])
39
show_wordcloud(senti["Summary_Clean"][senti.senti == "pos"] , title="Postive Words")
40
41