0% found this document useful (0 votes)
40 views21 pages

Seminar Report Maddu Ravindra 19103335 - Ravindra Babu

The document is a report on an internship as a Data Science Intern at Personifwy, detailing the objectives, methodologies, and findings related to customer segmentation using hierarchical clustering. It emphasizes the importance of data science in business, particularly in enhancing customer engagement and revenue generation through targeted marketing strategies. The project analyzes customer data to identify segments based on spending scores and annual income, aiming to improve business outcomes.

Uploaded by

Pritam kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views21 pages

Seminar Report Maddu Ravindra 19103335 - Ravindra Babu

The document is a report on an internship as a Data Science Intern at Personifwy, detailing the objectives, methodologies, and findings related to customer segmentation using hierarchical clustering. It emphasizes the importance of data science in business, particularly in enhancing customer engagement and revenue generation through targeted marketing strategies. The project analyzes customer data to identify segments based on spending scores and annual income, aiming to improve business outcomes.

Uploaded by

Pritam kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

REPORT ON INTERNSHIP

DATA SCIENCE INTERN


UNDER THE COMPANY:

“ PERSONIFWY ”
Surakshaa Fairview Apartments, Belathur,
Bengaluru, Karnataka 560067

SUBMITTED BY:

M Ravindra
19103335

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

SCHOOL OF STUDIES OF ENGINEERING & TECHNOLOGY


GURU GHASIDAS VISHWAVIDYALAYA
(CENTRAL UNIVERSITY)
BILASPUR, CHHATTISGARH, INDIA

OCTOBER,2022
DECLARATION

I hereby declare that all the work presented in this report in the partial
fulfilment of the requirement for the award of the degree of Bachelor of
Technology in Computer Science & Engineering, Institute of Technology,
Guru Ghasidas Vishwavidyalaya, Central University, Bilaspur,
Chhattisgarh, is an authentic record of the work done during the
vocational training under Internshala.

STUDENT NAMES:

M Ravindra (19103335)

1
ACKNOWLEDGEMENT

The satisfaction that accompanies the successful completion of any task


would be incomplete without the mention of people whose ceaseless
cooperation made it possible, whose constant guidance and
encouragement crown all efforts with success. I would like to express our
gratitude and appreciation to Liveasy for being our guide and all those
who gave us the opportunity to complete this training. I am also deeply
thankful to all sources from where we have cited information. We don’t
know all of the names of people behind them, but I want to acknowledge
their help during our training.

Date:.......-11-2022

STUDENT NAMES:

M Ravindra (19103335)

2
ABSTRACT
A lot of customers buy products from the mall and to generate more
revenue for the mall, the authorities need to attract these customers and
for this large amount of capital is required. After the advertisement, the
output is only around 30-40%. Hence customer segmentation comes
into the picture.

Customer Segmentation is a popular application of unsupervised


learning and by using this technique we'll only focus on the potential
customers (customers whose probability of buying the product is very
high). With this technique, the output will drastically increase to
90-95%.

Our project aims to build clusters of customers based on their Spending


Score and Annual Income. The algorithm used in this project is
Hierarchical clustering.

3
CERTIFICATE

4
TABLE OF CONTENTS

TOPICS PAGE NO

1. Introduction..………………………………………...……………………….06
1.1. About the company…………………………………………………………………..06
1.2. Vision……………………………………………………………………………………..06
1.3. Mission…………………………………………………………………………………..06
1.4. Internship Objectives……………………………………………………………….06

2.Data Science………………………………………………………………………….07
2.1 Definition of Data Science………………………………………………………….07
2.2 importance of Data Science……………………………………………………….07
2.3 Prerequisites for Data Science……………………………………………………07
2.4 Benefits of Data Science in Business…………………………………………..08
2.5 Data Science used for ?.......................................................................08
2.6 Applications of Data Science…………………………………………………..…09
2.7 Data Science process……………………………………………………………...….11

3. Mall Customer Segmentation……………………………………..…..12


3.1 General…………………………………………………………………………………….12
3.2 Customer Segmentation…………………………………………………....13
3.3 Dataset....………………………………………………………………..……13

4. Proposed Method and Architecture..…..………………………….14


4.1 Data Science Project Architecture……………………………………………….14

5.Methodology..…………………………………..……………………………...15
5.1 Clustering………………………………………………………………………………..15
5.2 Hierarchical Clustering Segmentation………………………………………..15
5.3 Divisive Clustering…………………………………………………………………...17

6. Implementation and Analysis.………………………………………..18


6.1 Dendrogram…..…………………...…………………………………………………...18

7.Conclusion...…………………………………………………………………….19
7.1 Annual Income Vs Spending Score Analysis………………………………...19

8.References………………………………………………………………....20

5
1. Introduction

1.1 About the Company

Personifwy is an advanced analytics-based Enterprise SAAS platform, that helps


organizations to drive their employee engagement for business success. Personifwy
makes it easy to use workforce experience as a competitive advantage to ensure
business success. The 3 main use cases offered are 1. Pre-boarding - Offers a
dedicated digital engagement assistant to virtually preboard talent, predict no shows
and gather insights that can improve offer to joining ratios. 2. Onboarding –
Understands employee pulse once onboarded and enhances their journey into the
organization through personalised learning and recommendation in the first 180
days. Offers a dedicated assessment to comprehend and reduce infant mortality 3.
Employee Engagement- Helps (a) understand your workforce engagement in real
time, enable them through personalized learning content (b) help track and measure
their success by enabling OKRs (Objectives & Key Results). Thus, promoting action
on insights to reduce attrition for the organization.

1.2 Vision

Our Vision lies to bring in a technology-oriented career-driven Industrial Experience


into the aspirant’s career with Great Value.

1.3 Mission

Our core values not only guide our behaviour, but these are core to our thinking and
our culture. These are exemplified by everyone at our organization
● Customer Focus: Our customers are at the centre of everything we do.
● Employee Centricity: We champion employee experience to ensure we
build a workplace for future
● Shared Ambition: We learn, we grow, and we win together. We are
collectively accountable and empowered
● Agility: We embrace innovation, we believe in transformation, we don’t fear
change and we value adaptability.

1.4 Internship Objectives

1. Identify goals and objectives of the company.


2. To learn and apply theoretical knowledge practically in the workplace.
3. To develop interpersonal, managerial, and communication skills.
4. To come up with the possible strategies to gain a competitive advantage.
5. To learn about the professional ethics and working of corporate world

6
2.Data Science
2.1 Definition of Data Science
Data science is the process of using tools and techniques to draw actionable
information out of huge volumes of noisy data. Data science is used for everything
from business decision making to sports analytics to insurance risk assessment

2.2 importance of Data Science


Data science plays an important role in virtually all aspects of business operations
and strategies.

For example, it provides information about customers that helps companies create
stronger marketing campaigns and targeted advertising to increase product sales.

2.3 Prerequisites for Data Science


2.3.1.Statistics

Data science relies on statistics to capture and transform data patterns into usable
evidence through the use of complex machine learning techniques.

2.3.2.Programming

Python, R, and SQL are the most common programming languages. To successfully
execute a data science project, it is important to instill some level of programming
knowledge.

2.3.3.Machine Learning

Making accurate forecasts and estimates is made possible by Machine Learning,


which is a crucial component of data science. You must have a firm understanding of
machine learning if you want to succeed in the field of data science.

2.3.4.Databases

A clear understanding of the functioning of Databases, and skills to manage and


extract data is a must in this domain.

2.3.5.Modeling

You may quickly calculate and predict using mathematical models based on the data
you already know. Modeling helps in determining which algorithm is best suited to
handle a certain issue and how to train these models.

7
2.4 Benefits of Data Science in Business
● Improves business predictions
● Interpretation of complex data
● Better decision making
● Product innovation
● Improves data security
● Development of user-centric products

2.5 Data Science used for ?

2.5.1.Descriptive Analysis

It helps in accurately displaying data points for patterns that may appear that satisfy
all of the data’s requirements. In other words, it involves organizing, ordering, and
manipulating data to produce information that is insightful about the supplied data.
It also involves converting raw data into a form that will make it simple to grasp and
interpret.

2.5.2.Predictive Analysis

It is the process of using historical data along with various techniques like data
mining, statistical modeling, and machine learning to forecast future results.
Utilizing trends in this data, businesses use predictive analytics to spot dangers and
opportunities.

2.5.3.Diagnostic Analysis

It is an in-depth examination to understand why something happened. Techniques


like drill-down, data discovery, data mining, and correlations are used to describe it.
Multiple data operations and transformations may be performed on a given data set
to discover unique patterns in each of these techniques.

2.5.4.Prescriptive Analysis

Prescriptive analysis advances the use of predictive data. It not only foresees what is
most likely to occur but also offers the best course of action for dealing with that
result. It can assess the probable effects of various decisions and suggest the optimal

8
course of action. It makes use of machine learning recommendation engines,
complicated event processing, neural networks, simulation, graph analysis, and
simulation.

2.6 Applications of Data Science

2.6.1.Product Recommendation

The product recommendation technique can influence customers to buy similar


products. For example, a salesperson of Big Bazaar is trying to increase the store’s
sales by bundling the products together and giving discounts. So he bundled
shampoo and conditioner together and gave a discount on them. Furthermore,
customers will buy them together for a discounted price.

2.6.2.Future Forecasting

It is one of the widely applied techniques in Data Science. On the basis of various
types of data that are collected from various sources weather forecasting and future
forecasting are done.

2.6.3.Fraud and Risk Detection

It is one of the most logical applications of Data Science. Since online transactions
are booming, losing your data is possible. For example, Credit card fraud detection
depends on the amount, merchant, location, time, and other variables. If any of them
looks unnatural, the transaction will be automatically canceled, and it will block your
card for 24 hours or more.

2.6.4.Self-Driving Car

The self-driving car is one of the most successful inventions in today’s world. We
train our car to make decisions independently based on the previous data. In this
process, we can penalize our model if it does not perform well. The car becomes more
intelligent with time when it starts learning through all the real-time experiences.

2.6.5.Image Recognition

When you want to recognize some images, data science can detect the object and
classify it. The most famous example of image recognition is face recognition – If you
tell your smartphone to unblock it, it will scan your face. So first, the system will
detect the face, then classify your face as a human face, and after that, it will decide if
the phone belongs to the actual owner or not.

9
2.6.6.Speech to text Convert

Speech recognition is a process of understanding natural language by the computer.


We are quite familiar with virtual assistants like Siri, Alexa, and Google Assistant.

2.6.7.Healthcare

Data Science helps in various branches of healthcare such as Medical Image Analysis,
Development of new drugs, Genetics and Genomics, and providing virtual assistance
to patients.

2.6.8.Search Engines

Google, Yahoo, Bing, Ask, etc. provides us with a lot of results within a fraction of a
second. It is made possible using various data science algorithms.

Fig.1 Applications of Data Science

10
2.7 Data Science process

2.7.1.Obtaining the data

The first step is to identify what type of data needs to be analyzed, and this data
needs to be exported to an excel or a CSV file.

2.7.2.Scrubbing the data

It is essential because before you can read the data, you must ensure it is in a
perfectly readable state, without any mistakes, with no missing or wrong values.

2.7.3.Exploratory Analysis

Analyzing the data is done by visualizing the data in various ways and identifying
patterns to spot anything out of the ordinary. To analyze the data, you must have
excellent attention to detail to identify if anything is out of place.

2.7.4.Modeling or Machine Learning

A data engineer or scientist writes down instructions for the Machine Learning
algorithm to follow based on the Data that has to be analyzed. The algorithm
iteratively uses these instructions to come up with the correct output.

2.7.5.Interpreting the data

In this step, you uncover your findings and present them to the organization. The
most critical skill in this would be your ability to explain your results.

11
3. Mall Customer Segmentation

3.1 General

Over the years, the increasing competition between businesses and the availability of
large-scale historical data has resulted in the extensive use of data mining techniques
to discover important and strategic information that is hidden in the information of
organizations. Data mining is the process of extracting logical information from a
dataset and presenting it in a human-accessible way for decision support. Data
mining techniques distinguish areas such as statistics, artificial intelligence, machine
learning and data systems. Data mining applications include but are not limited to
bioinformatics, weather forecasting, fraud detection, financial analysis and customer
segmentation.

The key to this paper is to identify customer segments in the commercial business
using a data mining method. Customer division is the division of the customer base
of the business into groups called customer segments such that each customer
segment consists of customers who share similar market characteristics.

These distinctions are based on factors that can directly or indirectly influence the
market or business such as product preferences or expectations, locations, behavior
and so on. The importance of customer segmentation includes, inter alia, the ability
of a business to customize market plans that will be appropriate for each segment of
its customers; support for business decisions based on a risky environment such as
debt relations with their customers; Identification of products related to individual
components and how to manage demand and supply power; reveals the
interdependence and interaction between consumers, between products, or between
customers and products that the business may not be aware of; the ability to predict
customer decline, and which customers are most likely to have problems and raise
other market research questions and provide clues to finding solutions.

Two factors are considered in the combination of the number of goods purchased by
the customer per month and the average number of customer visits per month. From
the dataset, four customers or categories are grouped and labeled as follows:
cluster_1, cluster_2, cluster_3, cluster_4,cluster_5.

12
3.2 Customer Segmentation

To make predictions and find the clusters of potential customers of the mall and thus
find appropriate measures to increase the revenue of the mall is one of the prevailing
applications of unsupervised learning.

For example, a group of customers have high income but their spending score
(amount spent in the mall) is low so from the analysis we can convert such type of
customers into potential customers (whose spending score is high) by using
strategies like better advertising, accepting feedback and improving the quality of
products.

To identify such customers, this project analyses and forms clusters based on
different criteria which are discussed in the further sections.

3.3 Dataset

The dataset name is ‘Mall_Customers.csv’ consists of 5 columns which are Annual


Income (k$), Spending Score (1-100) where all features are numeric.

Fig.2 Snapshot of Dataset


The size of the dataset is (200, 2) which is 200 rows and 2 columns

13
4. Proposed Method and Architecture

Fig.3 Data Science Project Architecture

4.1 Data Science Project Architecture

4.1.1 Problem Statement


Customer Segmentation is a popular application of unsupervised learning. Using
clustering, identify segments of customers to target the potential user base. They
divide customers into groups according to common characteristics like interests, and
spending habits so they can market to each group effectively. Use Hierarchical
clustering and also visualize the Dendrogram. Then analyze their annual incomes
and spending scores.

4.1.2 Data
The size of the dataset is (200, 2) which is 200 rows and 2 columns. Also the dataset
does not contain any NULL or NaN values.

4.1.3 Algorithms
Unsupervised Learning algorithm is used in this project to analyze and form clusters
of customers based on their income and spending score features.

4.1.4 Model
Hierarchical clustering model is used and is hyper tuned parameters like
n_clusters=5 using cluster module of scikit learn library.

4.1.5 Programming and Environment


Programming Language: Python 3.6

14
Environment (Libraries and Technologies): Numpy, Pandas, Matplotlib,
Seaborn, Jupyter Notebook, Google Colab.

5.Methodology

The Data Science Methodology aims to answer basic questions in a prescribed


sequence, that cover the five main aspects of data science projects. These aspects are:
● From Problem to Approach
● From Requirements to Collection
● From Understanding to Preparation
● From Modelling to Evaluation
● From Deployment to Feedback

In this project, the prescribed sequence is:


● Creating an approach to solve the given problem statement
● Exploring the dataset and obtaining useful insight from the same
● Cleaning the dataset by handling nan values, remove duplicate records, etc.
● Data Visualization used to obtain important information from the data
● Data Preprocessing is performed to make the data ready to fit the model this
includes feature scaling, splitting the dataset into features and labels, etc.
● Model Building

5.1 Clustering

Clustering is one of the most common methods used in exploring data to obtain a
clear understanding of the data structure. It can be characterized as the task of
finding the subtitles and subgroups in the complete dataset. Similar data is clustered
in many subgroups. A cluster refers to a collection of aggregated data points due to
some similarities. Clustering is used in Market Basket analysis used to segment the
customers based on their behaviours and transactions

5.2 Hierarchical Clustering Segmentation

➢ Hierarchical Clustering is also called hierarchical cluster or Hierarchical


Clustering Analysis (HCA )
➢ Hierarchical Clustering is an unsupervised Learning Algorithm, and this is
one of the most popular clustering technique in Machine Learning
➢ There are 2 types of Hierarchical Clustering
1.Agglomerative Clustering.
2.Divisive Clustering

15
5.2.1 Agglomerative Clustering
➢ This is a ‘bottom-up’ approach each observation stars in its own cluster, and pairs
of clusters and merged as one moves up the hierarchy

5.2.2 Step involved in Agglomerative Clustering

1. Make each data point a single point cluster, i.e, that forms n clusters.
2. Take the 2 closest data points and make them one cluster i.e, n-1 clusters.
3. Take 2 closest clusters and make them 1 cluster i.e, n-2 clusters.
4. Repeat this step , till it becomes one cluster.
➢ Euclidean distance on y-axis.
➢ Data points is on x-axis.
➢ Dendrograms is Known as the memory or the graphical representation of the
Hierarchical Clustering.
➢ The distance between the clusters can be find out using the following ways:
1.Closest point
2.Farest point
3.Average distance
4.Distance between centroids
➢ The default distance measured is known as Euclidean distance .

16
5.3 Divisive Clustering

➢ This is a “top-down” approach all observations start in one cluster, and splits are
performed recursively as one moves down the hierarchy.

5.3.1 Steps of Divisive Clustering

1. Initially, all points in the dataset belong to one single cluster.


2. Partition the cluster into two least similar cluster
3. Proceed recursively to form new clusters until the desired number of clusters
is obtained.

17
6. Implementation and Analysis

we will find the optimal number of clusters using the Dendrogram for our model. For
this, we are going to use scipy library as it provides a function that will directly return
the dendrogram for our code.

Fi.4 Dendrogram

Using this Dendrogram, we will now determine the optimal number of clusters for
our model. For this, we will find the maximum vertical distance that does not cut any
horizontal bar. Consider the below diagram:

Fig.5 Dendrogram to find clusters

18
In the above diagram, we have shown the vertical distances that are not cutting their
horizontal bars. As we can visualize, the 4th distance is looking the maximum, so
according to this, the number of clusters will be 5(the vertical lines in this range). We
can also take the 2nd number as it approximately equals the 4th distance, but we will
consider the 5 clusters because the same we calculated in the K-means algorithm.

So, the optimal number of clusters will be 5, and we will train the model in the next
step, using the same.

7. Conclusion
For this project, the Hierarchical clustering algorithm is used and performs the best
(with n_clusters = 5 and affinity='euclidean', linkage='ward'). After the clustering
algorithm is applied to the dataset, this is the output.

Fig.6 Annual Income Vs Spending Score after Clustering Clustering

7.1 Analysis

a. High Income, High Spending Score (Cluster 3) - Target these customers by


sending new product alerts which would lead to an increase in the revenue collected
by the mall as they are loyal customers.

b. High Income, Low Spending Score (Cluster 1) - Target these customers by asking
the feedback and advertising the product in a better way to convert them into Cluster
5 customers.

c. Average Income, Average Spending Score (Cluster 2) - May or may not target these
groups of customers based on the policy of the mall.

19
d. Low Income, High Spending Score (Cluster 4) - Can target these set of customers
by providing them with Low-cost EMI's, etc.

e. Low Income, Low Spending Score (Cluster 5) - Don't target these customers since
they have less income and need to save money.

8.References

[1] Github-https://github.com/kouluribabu12/Custemer_Segmentation
[2] https://www.javatpoint.com/hierarchical-clustering-in-machine-learning
[3]https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial
-in-python
[4]https://www.researchgate.net/publication/349714847_MALL_CUSTOMER_SE
GMENTATION_USING_CLUSTERING_ALGORITHM

20

You might also like