REPORT ON INTERNSHIP
DATA SCIENCE INTERN
UNDER THE COMPANY:
“ PERSONIFWY ”
Surakshaa Fairview Apartments, Belathur,
Bengaluru, Karnataka 560067
SUBMITTED BY:
M Ravindra
19103335
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCHOOL OF STUDIES OF ENGINEERING & TECHNOLOGY
GURU GHASIDAS VISHWAVIDYALAYA
(CENTRAL UNIVERSITY)
BILASPUR, CHHATTISGARH, INDIA
OCTOBER,2022
DECLARATION
I hereby declare that all the work presented in this report in the partial
fulfilment of the requirement for the award of the degree of Bachelor of
Technology in Computer Science & Engineering, Institute of Technology,
Guru Ghasidas Vishwavidyalaya, Central University, Bilaspur,
Chhattisgarh, is an authentic record of the work done during the
vocational training under Internshala.
STUDENT NAMES:
M Ravindra (19103335)
1
ACKNOWLEDGEMENT
The satisfaction that accompanies the successful completion of any task
would be incomplete without the mention of people whose ceaseless
cooperation made it possible, whose constant guidance and
encouragement crown all efforts with success. I would like to express our
gratitude and appreciation to Liveasy for being our guide and all those
who gave us the opportunity to complete this training. I am also deeply
thankful to all sources from where we have cited information. We don’t
know all of the names of people behind them, but I want to acknowledge
their help during our training.
Date:.......-11-2022
STUDENT NAMES:
M Ravindra (19103335)
2
ABSTRACT
A lot of customers buy products from the mall and to generate more
revenue for the mall, the authorities need to attract these customers and
for this large amount of capital is required. After the advertisement, the
output is only around 30-40%. Hence customer segmentation comes
into the picture.
Customer Segmentation is a popular application of unsupervised
learning and by using this technique we'll only focus on the potential
customers (customers whose probability of buying the product is very
high). With this technique, the output will drastically increase to
90-95%.
Our project aims to build clusters of customers based on their Spending
Score and Annual Income. The algorithm used in this project is
Hierarchical clustering.
3
CERTIFICATE
4
TABLE OF CONTENTS
TOPICS PAGE NO
1. Introduction..………………………………………...……………………….06
1.1. About the company…………………………………………………………………..06
1.2. Vision……………………………………………………………………………………..06
1.3. Mission…………………………………………………………………………………..06
1.4. Internship Objectives……………………………………………………………….06
2.Data Science………………………………………………………………………….07
2.1 Definition of Data Science………………………………………………………….07
2.2 importance of Data Science……………………………………………………….07
2.3 Prerequisites for Data Science……………………………………………………07
2.4 Benefits of Data Science in Business…………………………………………..08
2.5 Data Science used for ?.......................................................................08
2.6 Applications of Data Science…………………………………………………..…09
2.7 Data Science process……………………………………………………………...….11
3. Mall Customer Segmentation……………………………………..…..12
3.1 General…………………………………………………………………………………….12
3.2 Customer Segmentation…………………………………………………....13
3.3 Dataset....………………………………………………………………..……13
4. Proposed Method and Architecture..…..………………………….14
4.1 Data Science Project Architecture……………………………………………….14
5.Methodology..…………………………………..……………………………...15
5.1 Clustering………………………………………………………………………………..15
5.2 Hierarchical Clustering Segmentation………………………………………..15
5.3 Divisive Clustering…………………………………………………………………...17
6. Implementation and Analysis.………………………………………..18
6.1 Dendrogram…..…………………...…………………………………………………...18
7.Conclusion...…………………………………………………………………….19
7.1 Annual Income Vs Spending Score Analysis………………………………...19
8.References………………………………………………………………....20
5
1. Introduction
1.1 About the Company
Personifwy is an advanced analytics-based Enterprise SAAS platform, that helps
organizations to drive their employee engagement for business success. Personifwy
makes it easy to use workforce experience as a competitive advantage to ensure
business success. The 3 main use cases offered are 1. Pre-boarding - Offers a
dedicated digital engagement assistant to virtually preboard talent, predict no shows
and gather insights that can improve offer to joining ratios. 2. Onboarding –
Understands employee pulse once onboarded and enhances their journey into the
organization through personalised learning and recommendation in the first 180
days. Offers a dedicated assessment to comprehend and reduce infant mortality 3.
Employee Engagement- Helps (a) understand your workforce engagement in real
time, enable them through personalized learning content (b) help track and measure
their success by enabling OKRs (Objectives & Key Results). Thus, promoting action
on insights to reduce attrition for the organization.
1.2 Vision
Our Vision lies to bring in a technology-oriented career-driven Industrial Experience
into the aspirant’s career with Great Value.
1.3 Mission
Our core values not only guide our behaviour, but these are core to our thinking and
our culture. These are exemplified by everyone at our organization
● Customer Focus: Our customers are at the centre of everything we do.
● Employee Centricity: We champion employee experience to ensure we
build a workplace for future
● Shared Ambition: We learn, we grow, and we win together. We are
collectively accountable and empowered
● Agility: We embrace innovation, we believe in transformation, we don’t fear
change and we value adaptability.
1.4 Internship Objectives
1. Identify goals and objectives of the company.
2. To learn and apply theoretical knowledge practically in the workplace.
3. To develop interpersonal, managerial, and communication skills.
4. To come up with the possible strategies to gain a competitive advantage.
5. To learn about the professional ethics and working of corporate world
6
2.Data Science
2.1 Definition of Data Science
Data science is the process of using tools and techniques to draw actionable
information out of huge volumes of noisy data. Data science is used for everything
from business decision making to sports analytics to insurance risk assessment
2.2 importance of Data Science
Data science plays an important role in virtually all aspects of business operations
and strategies.
For example, it provides information about customers that helps companies create
stronger marketing campaigns and targeted advertising to increase product sales.
2.3 Prerequisites for Data Science
2.3.1.Statistics
Data science relies on statistics to capture and transform data patterns into usable
evidence through the use of complex machine learning techniques.
2.3.2.Programming
Python, R, and SQL are the most common programming languages. To successfully
execute a data science project, it is important to instill some level of programming
knowledge.
2.3.3.Machine Learning
Making accurate forecasts and estimates is made possible by Machine Learning,
which is a crucial component of data science. You must have a firm understanding of
machine learning if you want to succeed in the field of data science.
2.3.4.Databases
A clear understanding of the functioning of Databases, and skills to manage and
extract data is a must in this domain.
2.3.5.Modeling
You may quickly calculate and predict using mathematical models based on the data
you already know. Modeling helps in determining which algorithm is best suited to
handle a certain issue and how to train these models.
7
2.4 Benefits of Data Science in Business
● Improves business predictions
● Interpretation of complex data
● Better decision making
● Product innovation
● Improves data security
● Development of user-centric products
2.5 Data Science used for ?
2.5.1.Descriptive Analysis
It helps in accurately displaying data points for patterns that may appear that satisfy
all of the data’s requirements. In other words, it involves organizing, ordering, and
manipulating data to produce information that is insightful about the supplied data.
It also involves converting raw data into a form that will make it simple to grasp and
interpret.
2.5.2.Predictive Analysis
It is the process of using historical data along with various techniques like data
mining, statistical modeling, and machine learning to forecast future results.
Utilizing trends in this data, businesses use predictive analytics to spot dangers and
opportunities.
2.5.3.Diagnostic Analysis
It is an in-depth examination to understand why something happened. Techniques
like drill-down, data discovery, data mining, and correlations are used to describe it.
Multiple data operations and transformations may be performed on a given data set
to discover unique patterns in each of these techniques.
2.5.4.Prescriptive Analysis
Prescriptive analysis advances the use of predictive data. It not only foresees what is
most likely to occur but also offers the best course of action for dealing with that
result. It can assess the probable effects of various decisions and suggest the optimal
8
course of action. It makes use of machine learning recommendation engines,
complicated event processing, neural networks, simulation, graph analysis, and
simulation.
2.6 Applications of Data Science
2.6.1.Product Recommendation
The product recommendation technique can influence customers to buy similar
products. For example, a salesperson of Big Bazaar is trying to increase the store’s
sales by bundling the products together and giving discounts. So he bundled
shampoo and conditioner together and gave a discount on them. Furthermore,
customers will buy them together for a discounted price.
2.6.2.Future Forecasting
It is one of the widely applied techniques in Data Science. On the basis of various
types of data that are collected from various sources weather forecasting and future
forecasting are done.
2.6.3.Fraud and Risk Detection
It is one of the most logical applications of Data Science. Since online transactions
are booming, losing your data is possible. For example, Credit card fraud detection
depends on the amount, merchant, location, time, and other variables. If any of them
looks unnatural, the transaction will be automatically canceled, and it will block your
card for 24 hours or more.
2.6.4.Self-Driving Car
The self-driving car is one of the most successful inventions in today’s world. We
train our car to make decisions independently based on the previous data. In this
process, we can penalize our model if it does not perform well. The car becomes more
intelligent with time when it starts learning through all the real-time experiences.
2.6.5.Image Recognition
When you want to recognize some images, data science can detect the object and
classify it. The most famous example of image recognition is face recognition – If you
tell your smartphone to unblock it, it will scan your face. So first, the system will
detect the face, then classify your face as a human face, and after that, it will decide if
the phone belongs to the actual owner or not.
9
2.6.6.Speech to text Convert
Speech recognition is a process of understanding natural language by the computer.
We are quite familiar with virtual assistants like Siri, Alexa, and Google Assistant.
2.6.7.Healthcare
Data Science helps in various branches of healthcare such as Medical Image Analysis,
Development of new drugs, Genetics and Genomics, and providing virtual assistance
to patients.
2.6.8.Search Engines
Google, Yahoo, Bing, Ask, etc. provides us with a lot of results within a fraction of a
second. It is made possible using various data science algorithms.
Fig.1 Applications of Data Science
10
2.7 Data Science process
2.7.1.Obtaining the data
The first step is to identify what type of data needs to be analyzed, and this data
needs to be exported to an excel or a CSV file.
2.7.2.Scrubbing the data
It is essential because before you can read the data, you must ensure it is in a
perfectly readable state, without any mistakes, with no missing or wrong values.
2.7.3.Exploratory Analysis
Analyzing the data is done by visualizing the data in various ways and identifying
patterns to spot anything out of the ordinary. To analyze the data, you must have
excellent attention to detail to identify if anything is out of place.
2.7.4.Modeling or Machine Learning
A data engineer or scientist writes down instructions for the Machine Learning
algorithm to follow based on the Data that has to be analyzed. The algorithm
iteratively uses these instructions to come up with the correct output.
2.7.5.Interpreting the data
In this step, you uncover your findings and present them to the organization. The
most critical skill in this would be your ability to explain your results.
11
3. Mall Customer Segmentation
3.1 General
Over the years, the increasing competition between businesses and the availability of
large-scale historical data has resulted in the extensive use of data mining techniques
to discover important and strategic information that is hidden in the information of
organizations. Data mining is the process of extracting logical information from a
dataset and presenting it in a human-accessible way for decision support. Data
mining techniques distinguish areas such as statistics, artificial intelligence, machine
learning and data systems. Data mining applications include but are not limited to
bioinformatics, weather forecasting, fraud detection, financial analysis and customer
segmentation.
The key to this paper is to identify customer segments in the commercial business
using a data mining method. Customer division is the division of the customer base
of the business into groups called customer segments such that each customer
segment consists of customers who share similar market characteristics.
These distinctions are based on factors that can directly or indirectly influence the
market or business such as product preferences or expectations, locations, behavior
and so on. The importance of customer segmentation includes, inter alia, the ability
of a business to customize market plans that will be appropriate for each segment of
its customers; support for business decisions based on a risky environment such as
debt relations with their customers; Identification of products related to individual
components and how to manage demand and supply power; reveals the
interdependence and interaction between consumers, between products, or between
customers and products that the business may not be aware of; the ability to predict
customer decline, and which customers are most likely to have problems and raise
other market research questions and provide clues to finding solutions.
Two factors are considered in the combination of the number of goods purchased by
the customer per month and the average number of customer visits per month. From
the dataset, four customers or categories are grouped and labeled as follows:
cluster_1, cluster_2, cluster_3, cluster_4,cluster_5.
12
3.2 Customer Segmentation
To make predictions and find the clusters of potential customers of the mall and thus
find appropriate measures to increase the revenue of the mall is one of the prevailing
applications of unsupervised learning.
For example, a group of customers have high income but their spending score
(amount spent in the mall) is low so from the analysis we can convert such type of
customers into potential customers (whose spending score is high) by using
strategies like better advertising, accepting feedback and improving the quality of
products.
To identify such customers, this project analyses and forms clusters based on
different criteria which are discussed in the further sections.
3.3 Dataset
The dataset name is ‘Mall_Customers.csv’ consists of 5 columns which are Annual
Income (k$), Spending Score (1-100) where all features are numeric.
Fig.2 Snapshot of Dataset
The size of the dataset is (200, 2) which is 200 rows and 2 columns
13
4. Proposed Method and Architecture
Fig.3 Data Science Project Architecture
4.1 Data Science Project Architecture
4.1.1 Problem Statement
Customer Segmentation is a popular application of unsupervised learning. Using
clustering, identify segments of customers to target the potential user base. They
divide customers into groups according to common characteristics like interests, and
spending habits so they can market to each group effectively. Use Hierarchical
clustering and also visualize the Dendrogram. Then analyze their annual incomes
and spending scores.
4.1.2 Data
The size of the dataset is (200, 2) which is 200 rows and 2 columns. Also the dataset
does not contain any NULL or NaN values.
4.1.3 Algorithms
Unsupervised Learning algorithm is used in this project to analyze and form clusters
of customers based on their income and spending score features.
4.1.4 Model
Hierarchical clustering model is used and is hyper tuned parameters like
n_clusters=5 using cluster module of scikit learn library.
4.1.5 Programming and Environment
Programming Language: Python 3.6
14
Environment (Libraries and Technologies): Numpy, Pandas, Matplotlib,
Seaborn, Jupyter Notebook, Google Colab.
5.Methodology
The Data Science Methodology aims to answer basic questions in a prescribed
sequence, that cover the five main aspects of data science projects. These aspects are:
● From Problem to Approach
● From Requirements to Collection
● From Understanding to Preparation
● From Modelling to Evaluation
● From Deployment to Feedback
In this project, the prescribed sequence is:
● Creating an approach to solve the given problem statement
● Exploring the dataset and obtaining useful insight from the same
● Cleaning the dataset by handling nan values, remove duplicate records, etc.
● Data Visualization used to obtain important information from the data
● Data Preprocessing is performed to make the data ready to fit the model this
includes feature scaling, splitting the dataset into features and labels, etc.
● Model Building
5.1 Clustering
Clustering is one of the most common methods used in exploring data to obtain a
clear understanding of the data structure. It can be characterized as the task of
finding the subtitles and subgroups in the complete dataset. Similar data is clustered
in many subgroups. A cluster refers to a collection of aggregated data points due to
some similarities. Clustering is used in Market Basket analysis used to segment the
customers based on their behaviours and transactions
5.2 Hierarchical Clustering Segmentation
➢ Hierarchical Clustering is also called hierarchical cluster or Hierarchical
Clustering Analysis (HCA )
➢ Hierarchical Clustering is an unsupervised Learning Algorithm, and this is
one of the most popular clustering technique in Machine Learning
➢ There are 2 types of Hierarchical Clustering
1.Agglomerative Clustering.
2.Divisive Clustering
15
5.2.1 Agglomerative Clustering
➢ This is a ‘bottom-up’ approach each observation stars in its own cluster, and pairs
of clusters and merged as one moves up the hierarchy
5.2.2 Step involved in Agglomerative Clustering
1. Make each data point a single point cluster, i.e, that forms n clusters.
2. Take the 2 closest data points and make them one cluster i.e, n-1 clusters.
3. Take 2 closest clusters and make them 1 cluster i.e, n-2 clusters.
4. Repeat this step , till it becomes one cluster.
➢ Euclidean distance on y-axis.
➢ Data points is on x-axis.
➢ Dendrograms is Known as the memory or the graphical representation of the
Hierarchical Clustering.
➢ The distance between the clusters can be find out using the following ways:
1.Closest point
2.Farest point
3.Average distance
4.Distance between centroids
➢ The default distance measured is known as Euclidean distance .
16
5.3 Divisive Clustering
➢ This is a “top-down” approach all observations start in one cluster, and splits are
performed recursively as one moves down the hierarchy.
5.3.1 Steps of Divisive Clustering
1. Initially, all points in the dataset belong to one single cluster.
2. Partition the cluster into two least similar cluster
3. Proceed recursively to form new clusters until the desired number of clusters
is obtained.
17
6. Implementation and Analysis
we will find the optimal number of clusters using the Dendrogram for our model. For
this, we are going to use scipy library as it provides a function that will directly return
the dendrogram for our code.
Fi.4 Dendrogram
Using this Dendrogram, we will now determine the optimal number of clusters for
our model. For this, we will find the maximum vertical distance that does not cut any
horizontal bar. Consider the below diagram:
Fig.5 Dendrogram to find clusters
18
In the above diagram, we have shown the vertical distances that are not cutting their
horizontal bars. As we can visualize, the 4th distance is looking the maximum, so
according to this, the number of clusters will be 5(the vertical lines in this range). We
can also take the 2nd number as it approximately equals the 4th distance, but we will
consider the 5 clusters because the same we calculated in the K-means algorithm.
So, the optimal number of clusters will be 5, and we will train the model in the next
step, using the same.
7. Conclusion
For this project, the Hierarchical clustering algorithm is used and performs the best
(with n_clusters = 5 and affinity='euclidean', linkage='ward'). After the clustering
algorithm is applied to the dataset, this is the output.
Fig.6 Annual Income Vs Spending Score after Clustering Clustering
7.1 Analysis
a. High Income, High Spending Score (Cluster 3) - Target these customers by
sending new product alerts which would lead to an increase in the revenue collected
by the mall as they are loyal customers.
b. High Income, Low Spending Score (Cluster 1) - Target these customers by asking
the feedback and advertising the product in a better way to convert them into Cluster
5 customers.
c. Average Income, Average Spending Score (Cluster 2) - May or may not target these
groups of customers based on the policy of the mall.
19
d. Low Income, High Spending Score (Cluster 4) - Can target these set of customers
by providing them with Low-cost EMI's, etc.
e. Low Income, Low Spending Score (Cluster 5) - Don't target these customers since
they have less income and need to save money.
8.References
[1] Github-https://github.com/kouluribabu12/Custemer_Segmentation
[2] https://www.javatpoint.com/hierarchical-clustering-in-machine-learning
[3]https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial
-in-python
[4]https://www.researchgate.net/publication/349714847_MALL_CUSTOMER_SE
GMENTATION_USING_CLUSTERING_ALGORITHM
20