VEHICULAR SPEED RESTRAINTS IN SPECIAL
ZONES
PROJECT REPORT
submitted by
HRISHIC ULLAS (20618035)
ANANDHURAJ (20618015)
AMAL MALIK KA (20618011)
MUHAMMED JABIR KN (20618041)
SUMINLAL CM (20618061)
ASLAM PA (20618021)
to
Cochin University of Science and Technology
in partial fulfillment of the requirements for the award of B.Tech Degree in
Safety and Fire Engineering
Division Of Safety And Fire Engineering
School Of Engineering
Cochin University Of Science And Technology
MAY 2022
DIVISION OF SAFETY AND FIRE ENGINEERING
SCHOOL OF ENGINEERING
COCHIN UNIVERSITY OF SCIENCE AND
TECHNOLOGY
CERTIFICATE
Certified that this report entitled ”VEHICULAR SPEED
RESTRAINTS IN SPECIAL ZONES ” is the report of project
presented by
Hrishic Ullas (20618035)
AnandhuRaj (20618015)
Amal Malik KA (20618011)
Muhammed Jabir KN (20618041)
Suminlal CM (20618061)
Aslam PA (20618021)
during 2018-2022 in partial fulfillment of the requirements for the
award of the Degree of Bachelor of Technology in Safety and Fire
Engineering of Cochin University of Science and Technology.
PROJECT GUIDE
Nithya Gopinath
Assisstant Professor
Division of Safety and Fire
Engineering
School of Engineering
i
Acknowledgement
We take this opportunity to thank the supreme being, the source
of all knowledge whose blessings are our guiding light in any venture we
take up. We are in short of words to express our gratitude to Ms. Jithi
P V, our project guide who guided us and helped us constantly with her
inputs and suggestions, without which we could not have implemented
this project the way it works today. We are highly indebted to Ms
Faheena and Mr. Damodharan V, our project coordinators for their
constant supervision and support in completing this project. We also
express our heartfelt gratitude to Mr.Pramod Pavithran for all kind of
encouragement extended to us. We also thank Dr David Peter, Head of
Division, Computer Science and Engineering. Our sincere appreciation
to all the staff members who have helped us during this course of work.
Sneha Jayasankar (20218092)
Muhammed Aslam (20218057)
Safvan M P (20218080)
Sagar Krishna (20218081)
i
Declaration
We, Sneha Jayasankar , Muhammed Aslam , Sagar Krishna and Safvan
M P hereby declare that this major project is the record of authentic work
carried out by us during the academic year 2021 - 2022 and has not been
submitted to any other University or Institute towards the award of any
degree.
ii
Abstract
Personality and character have major effects on certain behavioural
outcomes. As advancements in technology occur, more people these days
are using social media such as Facebook, Twitter and Instagram. Due
to the increase in social media’s popularity, types of behaviours are now
easier to group and study as this is important to know the behaviour
of users via social networking in order to analyse similarities of certain
behaviour types and this can be used to predict what they post as well as
what they comment, share and like on social networking sites. Although
some researchers collect demographic information on users’ gender on
Facebook, others on Twitter do not. This lack of demographic data,
which is typically available in more traditional sources such as surveys,
has created a new focus on developing methods to work out these traits
as a means of expanding Big Data research.Therefore, the purpose of
this project is to collect data from previous researches and to analyse the
methods they have used. We aim to
• Use demographic, psychographic, attitudinal, and behavioural data
to refine consumer targets and inform engagement strategies
• Brings consumer targets to life with vivid and complete profiles, in-
cluding lifestyles,
• Generate geographic nuances of consumers in all of nation’s media
markets, including purchase behaviour, attitudes, lifestyles and much
more.
• Develop a unique, media-neutral machine learning metric for detect-
ing consumer trends regarding planning, buying and selling.
For the purpose of demonstrating this project, our Main Focus is to
compare Millennials and Gen Z and other generations’ as well as gender
based consumer behaviours. The project will also demonstrate the trend
of each individual category. The results of the analysis would provide
businesses information on the social media users’ purchasing behaviour,
their sentiment thus allowing them to take more appropriate strategies
to enhance their Competitiveness.
iii
Contents
1 Introduction 2
2 Literature Review 5
2.1 Newsvendor model with strategic consumers . . . . . . . 5
2.2 Multi-stage sales model . . . . . . . . . . . . . . . . . . . 6
2.3 Consumer’s psychological satisfaction . . . . . . . . . . . 7
3 Proposed System 10
3.0.1 Problem Definition . . . . . . . . . . . . . . . . . 10
3.0.2 Existing Systems . . . . . . . . . . . . . . . . . . 10
3.0.3 System Proposed . . . . . . . . . . . . . . . . . . 15
4 System Study 16
4.1 Software Requirements Specification . . . . . . . . . . . . 16
4.1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.2 Product Scope . . . . . . . . . . . . . . . . . . . . 17
4.1.3 Product Perspective . . . . . . . . . . . . . . . . 17
4.1.4 Product Features . . . . . . . . . . . . . . . . . . 17
4.1.5 Project Overview . . . . . . . . . . . . . . . . . . 18
4.1.6 Functional Requirements . . . . . . . . . . . . . . 18
4.1.7 Non Functional Requirements . . . . . . . . . . . 18
4.2 Hardware and Software Requirements . . . . . . . . . . . 19
4.2.1 Hardware Requirements . . . . . . . . . . . . . . 19
4.3 Software Requirements . . . . . . . . . . . . . . . . . . . 19
5 System Design 21
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . 21
5.2.1 Apriori Algorithm Data Preprocessing . . . . . . 21
iv
5.2.2 Extraction . . . . . . . . . . . . . . . . . . . . . . 21
5.2.3 Transform . . . . . . . . . . . . . . . . . . . . . . 21
5.2.4 Load . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3.1 Kmeans Algorithm . . . . . . . . . . . . . . . . . 23
5.3.2 Apriori Algorithm . . . . . . . . . . . . . . . . . . 23
5.3.3 Latent Dirichlet Allocation . . . . . . . . . . . . . 24
6 System Implementation 26
6.1 Platform . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Apriori Analysis . . . . . . . . . . . . . . . . . . . . . . . 27
6.3 Segmentation Analysis . . . . . . . . . . . . . . . . . . . 28
6.4 Topic Analysis . . . . . . . . . . . . . . . . . . . . . . . . 29
7 Result 31
8 Conclusion 32
Citations 33
v
List of Figures
5.1 Cleaned table data . . . . . . . . . . . . . . . . . . . . . 22
5.2 Work Flow Chart . . . . . . . . . . . . . . . . . . . . . . 25
6.1 Product count per itemset distribution . . . . . . . . . . 27
6.2 Cluster interpretation . . . . . . . . . . . . . . . . . . . . 28
6.3 Customer Segmentation based on product category . . . 29
6.4 LDA analysis . . . . . . . . . . . . . . . . . . . . . . . . 29
6.5 Topic Interpretation based 1 star and 5 star reviews . . . 30
1
Chapter 1
Introduction
With the development of the economy, competition in product sales
is becoming more and more intense. In order to occupy more market
share, retailers adopt discounts and other promotional measures to at-
tract consumers to buy. This phenomenon is widespread in physical
stores. E-commerce platforms such as Amazon and Flipkart have also
increased their trading volume by various price-reduction promotions on
festivals. Retailers’ promotions make consumers more “smart”. Con-
sumers will search the price information of products in different periods,
and predict the price trend to determine their purchase time. In order
to maximise their own utility, such strategic consumers usually choose
to wait until mark-down promotions. Concerning research on revenue
management, consumer’s strategic behaviour has become a common eco-
nomic phenomenon, which has attracted extensive attention in indus-
try and academia. Understanding consumer behaviour is essential for a
company to find success for its current products as well as new product
launches. Every consumer has a different thought process and attitude
towards buying a particular product. If a company fails to understand
the reaction of a consumer towards a product, there are high chances of
product failure.
Due to the changing fashion, technology, trends, living style, dispos-
able income, and similar other factors, consumer behaviour also changes.
A marketer has to understand the factors that are changing so that the
marketing efforts can be aligned accordingly. “Know Your Customer”
is currently the success mantra for the eCommerce sector as they adopt
innovative ways of tracking the customer’s online journey and how to
serve them more efficiently.
2
Introduction
With the rapid development of information and network technologies,
online transactions are gradually replacing traditional face-to-face trans-
actions and have become the most common trading method for con-
sumers. It is necessary for e-commerce platforms to provide a channel
for consumers to express their opinions—which are usually called prod-
uct reviews—after completing transactions. When consumers decide the
product they wish to purchase and the site from which they want to
purchase, they tend to retrieve information about alternative products
to help make decisions.
Generally, this information comes from two main sources. The first
source is from sellers who demonstrate their products with descriptions
on platforms. The second source is from the consumers who have bought
similar products, which is known as Word-Of-Mouth (WOM). However,
driven by economic interests, sellers always try to conceal unfavorable in-
formation or even exaggerate and fabricate favorable information about
products. Consequently, the information asymmetry between sellers and
consumers is enhanced until consumers receive real products. Consumers
are often disappointed because the product they receive is different from
their expectations. Fortunately, after receipt confirmation, consumers
have an opportunity to review their transactions from several aspects,
such as product quality or promptness of service. Reviews on most plat-
forms comprise numerical ratings and textual comments. Numerical rat-
ings usually range from 1 to 5 stars. Some platforms even provide more
than one numerical rating interface for different aspects. For instance, at
Tmall, which is the largest B2C ecommerce platform in China, numerical
ratings exist for description matching, service attitude, delivery speed,
and logistics speed. With textual comments, consumers can remark any
aspect of purchased products and any detail of their trading experience.
Customer reviews are public—the comments posted by customers can be
viewed by anyone on the platform including sellers and other customers
Besides a numerical rating level, consumers do not have a unified form
they can use for review. Reviews could be recommendations with ex-
tremely positive attitude or a detailed description of product usage to re-
mind others of its weaknesses. More specifically, owing to different char-
acteristics of consumers—such as character, attitude, experience, expert
knowledge, responsibility, economic profits, social needs, and individual
3
Introduction
worth—different consumers can have different preferences on products.
Even on similar products, they can hold divergent opinions. Similarly,
when reading reviews, different consumers get varying levels of insight.
As a result, analytical technologies like Cloud data analytics for eCom-
merce are being deployed to get a deeper understanding of customer be-
haviour. In recent years, consumer behaviour analysis in the eCommerce
industry has emerged as an effective analytical tool for knowing how any
online shopper interacts with the eCommerce website. eCommerce retail-
ers can derive multiple benefits from the insights taken from consumer
behaviour analysis tools. These valuable insights can lead to a more
personalised approach to customer needs that can increase their lifetime
value to the business. Besides that, behavioural analysis in eCommerce
can reduce customer acquisition costs, improve brand recommendations,
and improve the lead generation process.
4
Chapter 2
Literature Review
This section discusses pre existing systems in this field.
2.1 Newsvendor model with strategic consumers
Whether strategic consumers decide to buy usually depends on their
Whether strategic consumers decide to buy usually depends on their ex-
pectations of the future price of the products. For instance, it was found
that consumers’ expectations of the future price of a brand plays a cru-
cial role in the decision to buy now or later. This forward-looking be-
haviour of consumers has been widely concerned by relevant literature
in consumer behaviour. In the field of economics, Dr.Coase first studied
strategic consumer behaviour in the durable goods market. The research
showed that if the price of the product is higher than the marginal pro-
duction cost, strategic consumers will expect the trend of commodity
price and wait for the product price to be reduced, which will eventually
lead durable goods monopolists to set the price as the marginal produc-
tion cost and obtain zero profit. In the operation management literature
concerning strategic consumers, the studies focus on the impact of con-
sumers’ strategic buying behaviour on retailer’s pricing and inventory
decisions. In market transactions, the demand for products is random,
and retailers need to make optimal decisions under the random needs of
consumers.
The newsvendor model is a basic optimal ordering model under
stochastic demand. Based on the classical newsvendor model, strategic
consumer behaviour was introduced and the rational expectation hypoth-
esis was utilised to study the strategic equilibrium between the retailer
5
Literature review
and consumers. Pricing strategies of a seller were studied with budget
constraints facing two types of strategic consumers with different search
costs and proposed three pricing strategies to motivate all consumers to
visit his shop. They found that the selection of the optimal strategy is
independent of the composition of consumers but is dependent on the
seller’s budget level and the difference between the two search costs. In
the model, consumers have different valuations of commodities and take
strategic actions to measure the expected payoffs of immediate purchase
and delayed purchase by analysing the retailer’s optimal inventory deci-
sions. The results show that limited inventory can alleviate the loss of
profits caused by strategic consumers, but it cannot completely eliminate
this negative impact.
2.2 Multi-stage sales model
In order to alleviate the loss of profits caused by strategic con-
sumer behaviour, retailers need to adjust the price of products with
time and make dynamic pricing of goods. In the literature of opera-
tions management, there have been many studies on dynamic pricing.
Based on the maximisation of consumers’ intertemporal utility, strategic
consumer behaviour in dynamic pricing research was first included and
analysed the intertemporal pricing of new products sold in monopoly
markets.Contrary to intuition, it was found that strategic waiting by
customers may sometimes benefit the seller because when low-value cus-
tomers wait, they compete for availability with high-value customers and
thus increase their willingness to pay. The classical dynamic pricing prob-
lem of a single kind of product was studied and the demand intensity was
described under the scale of the external market with a stochastic pro-
cess. The results showed that simple dynamic pricing rules can perform
well for the widely existing Gaussian distribution. An inventory system
for perishable items with limited replenishment capacity was introduced
and inventory-level-dependent demand. With the goal of profit maximi-
sation, the optimal joint dynamic pricing and replenishment policy is
obtained by solving an optimization problem with Pontryagin’s maxi-
mum principle.
6
Literature review
2.3 Consumer’s psychological satisfaction
Consumers can predict results when making decisions, but because
of uncertainties, actual results often differ from consumers’ expectations.
Comparing the actual results with the expected, the gap between the re-
sults leads to different psychological satisfaction of consumers. Literature
on psychology and behavioural decision making has conducted relevant
research on the psychological perception of individuals. Based on these
theories, when the actual results exceed their expectations, consumers
may be elation, and their psychological satisfaction is positive. When the
actual results are worse than their expectations, consumers may be dis-
appointed, and psychological satisfaction is negative. Retailer’s pricing
and inventory decisions were studied and with disappointment aversion
consumers when selling perishable products. The results showed that
retailers’ decision-making and consumers regarding inventory and pric-
ing during repeated sales cycles. The interaction of purchase behaviours
forms a reference distribution of information that leads to optimal deci-
sions in equilibrium.
From the above models it was identified that the primary factors that
affect consumer behaviour are
• Psychological
– This is considered to be the most important factor that affects
consumer behaviour. Traits like perception, motivation, person-
ality, beliefs and attitude are important to decide why a consumer
would buy a product.
• Personal
– These are characteristics that are applicable to individuals and
may not relate to other people in a group. These factors can
include age, occupation, financial situation and lifestyle.
• Social
– Social characteristics play an important role in consumer be-
haviour, and it can include family, communities and social in-
7
Literature review
teraction. These factors are difficult to assess while preparing
marketing plans.
• Geographical
– The location of consumers also play a role in how they purchase
products. For example, a person living in warmer weather would
be less likely to purchase winter clothing compared to someone
living in temperate climates.
Also, the 4 Key Metrics for Measuring Consumer Behaviour were found
to be
• Average Session Time
The average session time is a good indicator of how long consumers
spend on your website. Longer session times indicate a higher like-
lihood that the session will end with an online purchase. On the
positive note, online shoppers interested in your store tend to spend
more time browsing through products, reading product reviews, and
interacting with your customer support executives.
• Pages per visit
This is an effective metric for analyzing customer behavior for mea-
suring the number of pages (or content) being viewed by shoppers
in every visit. Based on this metric, you can identify the most (or
least) viewed website pages and work on their respective strengths
and weaknesses. For the least viewed pages, a page-per-visit measure
of less than 2 would be insufficient for executing a conversion. Sim-
ilarly, a high page-per-visit measure works great for boosting CTR
and overall conversions.
• Traffic flow
The online traffic flow is an efficient metric in monitoring how con-
sumers move (or navigate) through your store pages. Traffic flow
can indicate the online store pages that are most attractive to shop-
pers. Through this data, you can design the best navigation path for
8
Literature review
shoppers to reach your most popular products or product categories.
Similarly, traffic-related metrics can help simplify the checkout pro-
cess and deliver the right marketing message to the target audience.
• Customer Loyalty
Whether it is through free shipping or freebies, customer loyalty is an
excellent yardstick for observing customer behavior. Customer loy-
alty metrics can help you track the buying habits of each shopper and
understand the merchandise preferred by each demographic group.
Apart from tailoring the shopping experience of loyal customers, cus-
tomer loyalty programs can be used to monitor the purchase behavior
after the sale is made. For example, if the customer has made an
additional purchase (complementing the previous purchase) or has
made a product return.
9
Chapter 3
Proposed System
Problem Definition
The analytic survey gains insight into how big data can offer ecom-
merce companies greater opportunity to drive sales and revenue, we
leveraged machine learning algorithms to analyse Amazon customer
data. The main aim of this project is to analyse outcomes of the 3
questions below:
– Can we identify customer segments based on the purchased prod-
uct categories to better target marketing campaigns?
– Can we identify which products a customer will most likely pur-
chase together?
– Can we extract key topics within product reviews to help com-
panies analyse and interpret customer feedback?
The objective of our project is to investigate eCommerce consumer
behaviour.
– The purpose behind our research topics is that we believe data
analysis is key for strategic and well informed decision making
– We believe it helps target customer segments to upsell products,
increase conversion rates, grow sales, and better target marketing
campaigns.
Existing Systems
As one of the most traditional and influential way of communica-
tion, WOM(word of mouth) allows consumers to exchange opinions
10
Proposed System
and information about products, brands, and services. Researchers
have demonstrated that WOM not only influences consumers to make
choices and decisions , but also has an effect on their expectation
and perceptions on products. Existing literature has also shown
that there is a tight relationship between WOM and sales. Good
product triggers positive WOM and positive WOM promotes sales
in turn. Especially along with development of internet, electronic
WOM which mainly known as online reviews can be produced more
easily by consumers and spread much faster and wider than ever
before, and consequently reviews have more powerful influence on
consumers and markets. In order to find out the influence mecha-
nism of reviews, research efforts are mainly devoted in three streams.
The first stream is to explore motives of consumers to be en-
gaged in online reviews articulation. Existing research on traditional
WOM could provide valuable insights because a new online form may
not change its function to be a potential driver of consumer actions.
Dichter identified four dimensions of WOM involvement—product,
self, others, and message. Compared with Dichter, Hennig-Thurau
et al. suggested consumers’ desire for social interaction, desire for
economic incentives, concern for other consumers, and self-worth en-
hancement. Sundaram et al. found that expressing positive feelings
can be triggered by positive experience. Such feelings can also be
classified into the aspect of self in Dichters work. Anderson de-
veloped a utility-based model of WOM to predict whether WOM
activity should increase as either satisfaction and/or dissatisfaction
increases. Their findings support the proposed asymmetric U-shape
for the relationship between consumer satisfaction and WOM activ-
ity. Specifically, extremely dissatisfied consumers engage in WOM
activity greater than satisfied ones.
Chung and Darke suggested consumers are more likely to engage
in reviewing products which are relevant to the self-concept rather
than utilitarian. There is a bias for consumers to exaggerate benefits
of self-relevant products. Tong et al. modeled a set of motivating and
11
Literature review
inhibiting factors that could influence consumers’ intention to con-
tribute product reviews. Their experiments show that perceived sat-
isfaction is associated with helping others and influencing merchants,
probability of enhancing self- image, and perceived executional costs.
In addition, the presence of an economic rewarding mechanism can
promote contribution of reviews in certain conditions. The second
stream is to study the factors that influence perceived helpfulness of
reviews to consumers. In related studies, reviewer identity, review
valance, product type, and characteristics of review text — including
depth, subjective, readability, and spelling errors — are commonly
examined. Review valance generally includes positive, negative, and
neutral experiences. In numerical rating of typical five stars, one star
indicates an extremely negative view, five stars indicate an extremely
positive view, and three stars indicate a moderate view.
In general, products are divided into two basic types—search
product and experience product—based on whether consumers can
easily obtain accurate measurable objective attributes and informa-
tion about products prior to purchase. Search products like cam-
era, cellphone, and computers can be known by obtaining detailed
parameters from public introduction before using them personally.
Experience product like books, music, and food can only be truly
understood by real experience.
Review depth, which is usually represented by the number of words
in comment text section, can increase information diagnostics to help
consumers obtain information without additional search cost. Longer
reviews are believed to contain more information than short reviews.
Incremental information can promote confidence of the decision makers
and is regarded as more convincing than others. In addition, a review
with longer length implies that there is greater involvement of reviewers
and greater likelihood of presenting a detailed description of how and
where the product was used in a specific context. Review depth has dif-
ferent effect on the helpfulness of review in different types of products.
Specifically, review depth has greater different influence on review help-
fulness of search goods than experience goods. Prior researches consider
12
Literature review
more than one factor simultaneously. Study conducted by Mudambi and
Schuff bdicated review extremity, review depth, and product type effect
on the helpfulness of review from Amazon. Product type can moderate
the effect of review extremity. Review depth has a different effect on the
helpfulness of a review in different types of products.
Pan and Zhang revealed that the review valance and length of
the review have positive effects on helpfulness of the review; however,
the product type moderated these effects. In addition, they established
a curvilinear relationship between reviewer innovativeness and helpful-
ness. Zhang et al. discovered that promotion consumption goals made
consumers perceive positive reviews more persuasively than the negative
ones. On the contrary, consumers with prevention consumption goals
perceive negative reviews more persuasively. Liu et al. developed models
and algorithms to predict the helpfulness of review using three important
factors— reviewers expertise, the writing style of review, and the timeli-
ness of review. Some works find that extreme ratings are more influential
than moderate ones.
Ghose and Ipeirotis explored several aspects of review text and
reviewers, including text-level features such as subjectivity levels, read-
ability, and spelling errors and review-level features such as average use-
fulness of past reviews and self-disclosed identity measures of reviews,
and performs an econometric analysis to reveal relations between these
aspects and helpfulness. Korfiatis et al. investigated the interplay be-
tween review helpfulness, review score, and review text, which are quan-
tized by conformity, understandability, and expressiveness. They also
found that review readability influences more on helpfulness than the
review length and extremely helpful reviews received higher score than
others that were deemed as less helpful. By comparing review data from
four national Amazon sites (USA, UK, Germany, and Japan), Danescu-
Niculescu-Mizil et al. noted the national differences between reviews
collected from different Amazon sites in terms of review variance and
review helpfulness.
The third stream is to explore a review’s influence on providers
13
Literature review
with respect to marketing activities, and Tsinghua Science and Technol-
ogy, June 2015, 20(3): 293-305 to consumers around purchase decision-
making. The WOM mechanism lets consumers share opinions and ex-
periences on products, companies, and services. Specifically, e-WOM is
a lower-cost and more effective channel to enable consumers to express
their opinions and be heard. Opinion towards products can influence
decision-making of consumers and subsequently influence the product
sale. In order to persuade consumers to buy their own products, compa-
nies have to pay attention to WOM to understand consumers reactions
to their products, such as attitude to certain attributes of products, or
different market demand situations in different regions.
According to Simon’s classic work, a decision process contains
three distinct phases — intelligence, design, and choice phase. The first
phase is to recognize problem and gather information about problem.
The second phase is to structure the problem, develop criteria, and iden-
tify alternative solutions. The last phase is to make a final decision to
choose the best alternative solution, which meets the criteria. Subse-
quently, a decision-maker evaluates how well the process was executed
using the feedback obtained from the results, which can help stage of
posterior intelligence in the future. Based on Simon’s decision-making
model, Kohli et al. explained that the factors such as consumers cost and
time savings lead to consumer satisfaction with online channel, where
the wealth of reviews emerges. His research gives instructions to attract
buyers and retain them by providing capabilities or tools such as compar-
ing features and price, recommending items to support buyer decision-
making process. Kotler and Keller divided purchase decision process into
six stages, which include need recognition, information searching, alter-
natives evaluation, purchase decision, purchase act, and post-purchase
evaluation. According to this division, posting comments or reviews on
the website is at the last phase of purchase. However, the results ex-
tend far beyond this stage. Reviews read by other consumers influence
the next purchase decision process. Therefore, websites should assist
consumers to explore valuable information more easily so that they can
make better purchase decisions.
14
Literature review
Liu compared the dynamic patterns of WOM during movie pre-
release and opening week with WOM data from Yahoo Movies Web Site
and finds that the volume of movie WOM explains box office sales with a
significant level both in aggregate and early weeks. His finding highlights
the necessity to observe and respond to WOM communications actively,
especially during early weeks after the release when most of the revenue
is produced.
In addition, reviews could also reveal a product’s advantages and
disadvantages compared with other products of competitors to improve
product quality in time. In addition, purchase intention can also be
extracted from comprehensive review valance to predict sale amount in
future. According to prediction, manufacturers can arrange production
plan and supply chain management flexibly to satisfy consumers demand,
lower the costs, and maximize profitability [33] . Existing research de-
scribed above provides valuable insights on review analysis; however,
most of them investigate reviews at a high level. Consumers’ individ-
ual preference or characteristics on different product categories are sel-
dom discussed and examined. Therefore, this system focuses on reviews
on different categories generated by different reviewers to reveal some
meaningful implications, which offer instructions to sellers, producers,
and consumers themselves to improve their online activities.
System Proposed
We intend to develop a study that helps gain insight into how big data
can offer ecommerce companies greater opportunity to drive sales and
revenue. We shall use machine learning algorithms to analyse Amazon
customer data. For the initial implementation, reviews from a number
of product categories were selected from the list of publicly available
datasets. The findings to be made are items frequently bought together,
key topics within product reviews, and customer segmentation based
on product categories. A variety of insights are obtained from different
analysis such as K-means cluster analysis, eclat analysis etc and these
are used to obtain valuable conclusions from the data.
15
Chapter 4
System Study
This section present the SRS, system objectives and software tool
requirements.
4.1 Software Requirements Specification
A software requirements specification(SRS) is a document that cap-
tures complete description about how the system is expected to perform.
Purpose
In this document the software requirements are presented. The
purpose of this document is to lay out the functional and non-functional
requirements for the application. It also provides a detailed overview of
our product. The document clearly describes each one of the product’s
parameters and goals. It also briefly outlines the project’s target audi-
ence, its user interface, and software requirements. With this document,
a clear idea can be obtained on every feature and function of the appli-
cation and the software requirements necessary to create the same.
16
System Study
Product Scope
This project was developed In order to gain insight into how big
data can offer ecommerce companies greater opportunity to drive sales
and revenue, we leveraged machine learning algorithms to analyze Ama-
zon customer data. However, this project is also exploratory in the sense
that methods beyond what is used in the test system will be researched
and evaluated. The product does basic functionalities as any consumer
behaviour analysis systems, with the added accessibility and control over
the process flow available and also more accuracy. A website is also cre-
ated so that the users can access this from any device, irrespective of the
operating system used and understand the analysis their company data
is subjected to and the insights gained.
Product Perspective
The goal for developing the system is to provide a platform to anal-
yse how big data can offer ecommerce companies greater opportunity to
drive sales and revenue. The project aims to equip its users with a safe
and secure platform which would avail companies the analyse their user
data including content such as user reviews and items bought by the
consumers and items frequently bought together etc.
Product Features
Our research had 3 goals:
• Develop a list of items frequently bought together
• Create customer segments based on product categories purchased
• Build a model to identify main topics included in the customer re-
views of a product
17
System Study
Project Overview
The project has following functions:
Functional Requirements
• Items frequently bought together
– Purpose: For mining frequent item sets and relevant association
rules from relational databases
– Input: Preprocessed datasets of customer reviews as a proxy to
customer purchase
– Output: Frequency of Apriori Recommendations By Number of
Products
• Customer segmentation based on product categories
– Purpose: To identify customer segments based on the purchased
product categories to better target marketing campaigns
– Input: Preprocessed datasets of customer reviews as a proxy to
customer purchase
– Output: customer segmentation based on product categories
• Map main topics from customer reviews
– Purpose: To extract key topics within product reviews to help
companies analyze customer feedback
– Input: Preprocessed datasets of customer reviews as a proxy to
customer purchase
– Output: customer reviews topic analysis
Non Functional Requirements
• Performance.
• Accuracy: The analysis and results generated should be accurate.
• Efficieny The software should be able to analyse real time data and
give accurate results efficiently.
18
System Study
4.2 Hardware and Software Requirements
Hardware Requirements
Does not require any particular hardware. The web-interface works
on any mobile device or computer, provided it has an active internet
connection.
4.3 Software Requirements
• Front End
– HTML
– CSS
– Javascript
∗ HTML / CSS / JavaScript–HTML provides the basic structure
of sites, which is enhanced and modified by other technologies
like CSS and JavaScript. CSS is used to control presentation,
formatting, and layout. JavaScript is used to control the be-
haviour of different elements.
• Backend
– Python 3.7
∗ Python is an interpreted high-level programming language for
general-purpose programming.
– Postgres
∗ PostgreSQL is a powerful, open source object-relational database
system that uses and extends the SQL language combined with
19
many features that safely store and scale the most complicated
data workloads.
– Pyspark
∗ PySpark is the Python API for Apache Spark, an open source,
distributed computing framework and set of libraries for real-
time, large-scale data processing.
– Scikit-learn
∗ Scikit-learn is a free software machine learning library for the
Python programming language. It features various classifica-
tion, regression and clustering algorithms including support
vector machines, random forests, gradient boosting, k-means
and DBSCAN, and is designed to interoperate with the Python
numerical and scientific libraries NumPy and SciPy.
20
Chapter 5
System Design
5.1 Introduction
Designing requires a careful planning and thinking on the part of the system designer. Designing
a system means to plan how the various parts of it are going to achieve the desired goal. After the
software requirements have been analysed and specified, design is the first of the three technical activ-
ities.The System Design describes how the functional and non-functional requirements are recorded.
It describes design goals and considerations, provides a high-level overview of the system architecture,
and describes the data design associated with the system, as well as the human-machine interface and
operational scenarios
5.2 Data Preprocessing
There are three steps in general needed for preprocessing the data in this system. They are as follows:
• Extract
• Transform
• Load
Apriori Algorithm Data Preprocessing
Extraction
For the process of development 8 different product segments from Amazon data are selected.
Each segment has has the same data schema as example below:
Transform
• Load Amazon product segment into PySpark DataFrame
• Perform preliminary cleaning
– Drop unnecessary columns
– Filter data to present only verified purchases
21
System Design
– Drop the verified purchased column after filtering
• Create Apriori Analysis dataframe
– Drop additional unnecessary columns in preparation for Apriori Analysis
• Repeat this process with various product segments.
Load
• Download Postgres driver that will allow Spark to interact with PostgresSQL
• Configure settings for PostgresSQL
• Write the cleaned table into PostgresSQL.
Figure 5.1: Cleaned table data
Algorithm
22
System Design
5.3 Algorithms
Kmeans Algorithm
Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined
distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries
to make the intra-cluster data points as similar as possible while also keeping the clusters as different
(far) as possible. It assigns data points to a cluster such that the sum of the squared distance between
the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that
cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar)
the data points are within the same cluster.
The way kmeans algorithm works is as follows:
1. Specify number of clusters K.
2. Initialize centroids by first shuffling the dataset and then randomly selecting K data points for
the centroids without replacement.
3. Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters
isn’t changing.
4. Compute the sum of the squared distance between data points and all centroids.
5. Assign each data point to the closest cluster (centroid).
6. Compute the centroids for the clusters by taking the average of the all data points that belong
to each cluster.
Apriori Algorithm
Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset in the
given database. This data mining technique follows the join and the prune steps iteratively until the
most frequent itemset is achieved. A minimum support threshold is given in the problem or it is
assumed by the user.
• In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The algorithm
will count the occurrences of each item.
• Let there be some minimum support, minsup ( eg 2). The set of 1 – itemsets whose occurrence
is satisfying the min sup are determined. Only those candidates which count more than or equal
to minsup, are taken ahead for the next iteration and the others are pruned.
• Next, 2-itemset frequent items with minsup are discovered. For this in the join step, the 2-itemset
is generated by forming a group of 2 by combining items with itself.
• The 2-itemset candidates are pruned using min-sup threshold value. Now the table will have 2
–itemsets with min-sup only.
• The next iteration will form 3 –itemsets using join and prune step. This iteration will follow
antimonotone property where the subsets of 3-itemsets, that is the 2 –itemset subsets of each
group fall in minsup. If all 2-itemset subsets are frequent then the superset will be frequent
otherwise it is pruned.
23
System Design
• Next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its subset
does not meet the minsup criteria. The algorithm is stopped when the most frequent itemset is
achieved.
Feature Selection:
• Understand items brought by the same customer can increase conversion rates in ecomm and
drive revenues growth (cross sell)
• priori algorithm is popular for this type of analysis
• Apriori gives confidence level of recommendation that helps data analysts decide the right thresh-
old for website recommended products
Benefits:
• Most simple algorithm among association rule learning
• Broadly adopted for basket analysis
• Easy to understand and interpret
• Exhaustive: finds all rules with confidence levels
Latent Dirichlet Allocation
LDA assumes that documents are composed of words that help determine the topics and maps
documents to a list of topics by assigning each word in the document to different topics. The assignment
is in terms of conditional probability estimates.The value in each cell indicates the probability of a
word wj belonging to topic tk. ‘j’ and ‘k’ are the word and topic indices respectively. It is important
to note that LDA ignores the order of occurrence of words and the syntactic information. It treats
documents just as a collection of words or a bag of words. Once the probabilities are estimated (we
will get to how these are estimated shortly), finding the collection of words that represent a given topic
can be done either by picking top ‘r’ probabilities of words or by setting a threshold for probability
and picking only the words whose probabilities are greater than or equal to the threshold value. For
instance, if we focus on topic-1 and pick top 4 probabilities assuming that the probabilities of the
words not shown in the table are less than 0.012, then topic-1 can be represented as shown below using
the ‘r’ top probabilities words approach. In the above example, if word-k, word1, word3 and word2
are respectively trees, mountains, rivers and streams then topic-1 could correspond to ‘nature’.
Feature Selection
• Topic discovery for customer reviews to gather feedback and identify themes: product qualities
and what has to be improved
• Find ‘relevant’ topics and identify trends
• Topic Modelling is an unsupervised approach used for finding and observing the bunch of words
(called “topics”) in large clusters of texts
Benefits
• Largely used for topic discovery
• Simple to implement
• Runs relatively quickly
• Probabilistic model
24
System Design
Figure 5.2: Work Flow Chart
25
Chapter 6
System Implementation
6.1 Platform
The entire code was implmented on google colab . The UI was created using Html,CSS and
Javacript. The data preprocessing was done using postgresSQL. The final code is built on Windows
based systems as it provides more familiarity and a better range of browsers to test.
26
System Implementation
6.2 Apriori Analysis
Figure 6.1: Product count per itemset distribution
• The histogram depicts the frequency of Apriori associations by itemsets
• Highest number of instances are for 3 product itemsets with about 5,909 associations
• Lowest number of associations are 8 product itemsets with only 9 associations
• Total number of recommendations the Apriori analysis gathered was 23,590
The Apriori algorithm is used for mining frequent item sets and relevant association rules from
relational databases. The parameters “support” and “confidence” are utilized, support are the items’
frequency of occurrence and confidence is a conditional probability. The goal of the analysis is to
identify items bought together and show them in the ecommerce website to increase cross sell and
sales.
27
System Implementation
6.3 Segmentation Analysis
Based on data of eight different product categories; apparel, furniture, music, watches, personal
care, office products, video and video games; the data was consolidated based on the product quantities
bought from each segment by customer. The Kmeans algorithm was the machine learning used, since
it is an unsupervised model that groups data into clusters, or in this case, customer segments.
Figure 6.2: Cluster interpretation
Based on the interpretations:
• 44% of customers buying products from multi-categories, largest segment
• Cluster 2 (furniture): priority for marketing campaign since there are current customers from
Cluster 4 that buys furniture and other products
• Create additional campaigns to cluster 0, 1 and 3 . . . give discount in other product categories
to incentivise product mix and sales
28
System Implementation
Figure 6.3: Customer Segmentation based on product category
6.4 Topic Analysis
For this analysis one specific product was selected. The Latent Dirichlet Allocation (LDA) machine
learning model was used to identify topics with the customer reviews. To better interpret the data,
the analysis was split into bad (1-star) and good (5-stars) reviews.
Figure 6.4: LDA analysis
The bubble charts below represent the output of the analysis, each bubble represents a differ-
ent topic, the larger the bubble, the higher percentage of the number of reviews in the corpus of the
topic. The blue bars show the overall frequency of each word in the corpus, if no topic is selected, the
blue bars display the most frequently used words. The red bars give the estimated number of times a
given term was generated by a given topic. The further the bubbles are away for each other, the more
29
System Implementation
different they are.
Similar words between topics for good and bad reviews with different connotation Analysis
can be biased by person interpreting the outputs, hard to extract meaning of topics Hard to identify
different topics, similar words and feedback, recommended only for a superficial analysis Need to
improve corpus to combine words for more accurate analysis
Figure 6.5: Topic Interpretation based 1 star and 5 star reviews
30
Chapter 7
Result
The project succeeded in building a reliable and accessible consumer behaviour analysis software.
This application can Develop a list of items frequently bought together, create customer segments
based on product categories purchased and build a model to identify main topics included in the cus-
tomer reviews of a product . The web application can be accessed from any device and no credentials or
subscriptions are needed to be provided, hence ensuring ease of access to all the users and companies.
Companies can now analyse their user data with better accuracy and design marketing campaigns as
required from the insights gained
31
Chapter 8
Conclusion
Whether a company is large or small, keeping track of the decision making behavior and buying
habits of its customers will ensure that the company stays on track, in terms of generating experience
which will boost its customer base and keep them in the game on a long term basis.
As we steer towards a data driven era, platforms will find it difficult to sustain if they fail
to keep track of the preferences of their customers. In the absence of behavioral analytics, teams are
stuck using insufficiently detailed demographic data and so-called vanity metrics. This analysis will
allow companies to stay informed and ensure that their customers encounter a satisfying experience.
32
References
[1]R. Jacobson and C. Obermiller, “The formation of expected future price: A reference price for
forward-looking consumers,” J. Consum. Res., vol. 16, no. 4, pp. 420–432, 1990.
[2]T. Ye and H. Sun, “Price-setting newsvendor with strategic consumers,” Omega, vol. 63, pp.
103–110, Sep. 2016.
[3]Y. Song and X. Zhao, “A newsvendor problem with boundedly rational strategic customers,” Int.
J. Prod. Res., vol. 55, no. 1, pp. 228–243, Jan. 2018.
[4]D. Besanko and W. L. Winston, “Optimal price skimming by a monopolist facing rational
consumers,” Manage. Sci., vol. 36, no. 5, pp. 555–567, May 2011.
[5]Q. Liu and S. Shum, “Pricing and capacity rationing with customer disap- pointment aversion,”
Prod. Oper. Manage., vol. 22, no. 5, pp. 1269–1286, Sep. 2013.
[6] O. Baron, M. Hu, S. Najafi-Asadolahi, and Q. Qian, “Newsvendor selling to loss-averse consumers
with stochastic reference points,” Manuf. Service Oper. Manage., vol. 17, no. 4, pp. 456–469, 2015.
[7] J. Quan, X. Wang, C. Li, and D. Xia, “Quantity commitment strategy and effectiveness analysis
with disappointment aversion strategic consumers,” IEEE Access, vol. 7, pp. 67094–67106, 2019.
33