0% found this document useful (0 votes)
28 views

ADS Phase4

This document discusses using data science techniques for customer segmentation. It outlines collecting customer data from various sources, preprocessing the data by handling missing values, encoding categories, scaling numbers, and splitting the data. Several clustering algorithms are described that could be used for segmentation, including K-means, hierarchical, DBSCAN, and Gaussian mixture models. The document proposes training models on most of the data, evaluating models with metrics, updating the segmentation model in real-time with new customer data, implementing a personalization engine, and establishing a feedback loop to adapt models over time based on marketing responses.

Uploaded by

21cse13
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

ADS Phase4

This document discusses using data science techniques for customer segmentation. It outlines collecting customer data from various sources, preprocessing the data by handling missing values, encoding categories, scaling numbers, and splitting the data. Several clustering algorithms are described that could be used for segmentation, including K-means, hierarchical, DBSCAN, and Gaussian mixture models. The document proposes training models on most of the data, evaluating models with metrics, updating the segmentation model in real-time with new customer data, implementing a personalization engine, and establishing a feedback loop to adapt models over time based on marketing responses.

Uploaded by

21cse13
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

CUSTOMER SEGMENTATION USING DATA SCIENCE

Team member
312821104018: Balamanikandan.M
Phase 4 Submission Document

INTRODUCTION:
Customer segmentation is a vital strategy for businesses to
understand their diverse customer base and tailor marketing efforts
effectively. Traditionally, it involves grouping customers based on
historical data. However, in today's dynamic market, this approach may
fall short in capturing evolving customer behaviours and preferences.
Content:
1. Data Collection and Integration
Data Sources: We will collect data from a variety of sources, including
customer surveys, purchase history databases, website interaction logs, and
social media engagement metrics.
Real-time Integration: Data integration will be an ongoing, real-time process to
ensure the most up-to-date customer insights.

2. Data Preprocessing
Step 1: Handling Missing Values:

First, identify any missing values in the dataset. In your provided data, there
don't seem to be any missing values, which is great.

Step 2: Encoding Categorical Data:

You have a categorical variable, 'Gender.' To make it usable for machine


learning algorithms, apply one-hot encoding. This involves creating binary
columns for each category, 'Male' and 'Female,' where '1' represents the
presence of that category, and '0' represents the absence.

Step 3: Data Scaling:

Standardize numerical features. In this case, 'Age,' 'Annual Income (k$),' and
'Spending Score (1-100)' are the numerical attributes. Standardization ensures
that all these features are on the same scale, preventing one feature from
dominating the clustering process.

Step 4: Data Splitting (for Later Phases):

Although data splitting is typically associated with machine learning model


development, you can prepare for it in advance. Determine the allocation ratio
for your data. For instance, you may decide to use 70% of the data for training
and 30% for testing in later phases.
Step 5: Data Saving (for Later Phases):

Save the pre-processed data. This ensures that you have a clean and ready
dataset for the upcoming phases of your project. By saving it, you prevent the
need to reprocess the data each time you work on different aspects of project.

3. Clustering Algorithms
1.K-MEANS CLUSTERING:
Description: K-Means is a centroid-based algorithm that partitions data into K
clusters. It minimizes the sum of squared distances from data points to their
assigned cluster centroids.
Advantages: Simple to implement, computationally efficient, works well with
spherical clusters.
Considerations: Requires specifying the number of clusters (K) in advance,
sensitive to initial cluster centres.

2.HIERARCHICAL CLUSTERING:
Description: Hierarchical clustering creates a tree-like structure of clusters by
successively merging or splitting clusters based on similarity.
Advantages: Reveals hierarchical relationships, does not require specifying K in
advance.
Considerations: Computationally intensive for large datasets.
3.DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
Description: DBSCAN identifies clusters as regions of high data point density,
separating noisy points as outliers.
Advantages: Can discover clusters of varying shapes, robust to noise, does not
require specifying K.
Considerations: Sensitive to hyperparameter settings.
4.AGGLOMERATIVE CLUSTERING:
Description: Agglomerative clustering is a bottom-up approach that starts with
individual data points as clusters and merges them hierarchically.
Advantages: Reveals hierarchical structure, handles different cluster shapes.
Considerations: Can be computationally expensive.
5.MEAN-SHIFT CLUSTERING:
Description: Mean-shift identifies cluster centres by moving towards the mode
of data point density.
Advantages: Can identify clusters of varying shapes and sizes, adaptive
bandwidth.
Considerations: Computationally intensive, sensitive to bandwidth.
6.GAUSSIAN MIXTURE MODELS (GMM):
Description: GMM assumes that data points are generated from a mixture of
Gaussian distributions.
Advantages: Can model overlapping clusters, provides probabilistic cluster
assignments.
Considerations: Sensitive to initialization, can converge to local optima.
7.SPECTRAL CLUSTERING:
Description: Spectral clustering leverages eigenvectors of a similarity matrix to
partition data into clusters.
Advantages: Effective for capturing complex data structures, handles non-
convex clusters.
Considerations: Requires similarity matrix computation, sensitive to kernel
choice.
8.AFFINITY PROPAGATION:
Description: Affinity propagation identifies exemplar data points and assigns
other points to the nearest exemplar.
Advantages: Identifies cluster examples automatically, works well with various
cluster shapes.
Considerations: Complexity can be high, may create many clusters.
9.OPTICS (Ordering Points To Identify the Clustering Structure):
Description: OPTICS generates a hierarchical clustering structure by ordering
data points based on reachability distance.
Advantages: Handles varying data density, reveals hierarchical relationships.
Considerations: Complex hyperparameters, sensitivity to minPts.
10.SELF-ORGANIZING MAPS (SOM):
Description: SOM is a neural network-based clustering technique that creates
a grid of nodes and assigns data points to the closest nodes.
Advantages: Reveals underlying data structure, useful for dimensionality
reduction.
Considerations: Requires tuning of network parameters, computationally
intensive.

APPLYING CLUSTERING ALGORITHM:


1.K-Means Clustering:
K-Means is a versatile algorithm for customer segmentation. It can
be used to identify distinct customer groups based on age, annual income, and
spending score. For instance, you can employ K-Means to create clusters of
young, high-income customers who exhibit high spending tendencies and
separate them from older, budget-conscious customers who spend less.
2.Hierarchical Clustering:
Hierarchical clustering is ideal for understanding the hierarchical
structure within your customer data. This approach unveils relationships
between clusters, which can be particularly valuable for identifying subclusters
within larger segments. It's useful when you want to uncover intricate
hierarchies in customer behavior.
3.DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
DBSCAN is an effective choice for identifying clusters of varying
shapes and densities within your customer data. It excels at noise reduction,
ensuring that anomalies do not significantly affect your segmentation results.
4.Agglomerative Clustering:
Agglomerative clustering is invaluable when you seek to unveil the
hierarchical structure of customer segments. This method starts with individual
customers as clusters and progressively merges them into higher-level
segments, providing insights into how different clusters relate to one another.
5.Mean-Shift Clustering:
Mean-Shift is a powerful algorithm for identifying clusters of varying
shapes and sizes. It adapts automatically to the data, making it suitable for
revealing complex customer segments within your dataset.
6.Gaussian Mixture Models (GMM):
GMM is particularly useful when customer segments overlap. It can
model clusters with shared characteristics, offering probabilistic assignments
to segments. This is beneficial when customers exhibit mixed behaviours.
7.Spectral Clustering:
Spectral clustering is an advanced technique for capturing intricate
data structures. It's particularly effective when dealing with non-convex
customer clusters that don't adhere to traditional shapes.
8.Affinity Propagation:
Affinity propagation automatically identifies exemplar customers
within your data. This can be instrumental in understanding the representative
characteristics of each segment, making it an attractive choice for diverse
customer behaviours.
9.OPTICS (Ordering Points To Identify the Clustering Structure):
OPTICS is advantageous when handling customer data with
varying density. It organizes data points based on reachability distance,
revealing the hierarchical relationships between clusters and subclusters.
10.Self-Organizing Maps (SOM):
SOM, a neural network-based approach, is well-suited for
uncovering the underlying structure of customer data. It can also assist in
dimensionality reduction, simplifying the visualization and interpretation of
complex customer segments.
4. Model Training
A substantial portion of our dataset will be used to train the clustering models.
These models will learn to group similar customers together based on various
attributes.
5. Model Evaluation
To assess the quality of our clustering models, we will employ metrics such as
the Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Index.
6. Real-time Data Updates
We will implement mechanisms to update the segmentation model in real-
time as new customer data becomes available. This ensures the relevancy and
accuracy of our customer segments.
7. Personalization Engine
A personalization engine will be implemented using machine learning
techniques like collaborative filtering and recommendation algorithms. This
engine will enable real-time content and offer personalization for each
customer segment.
8. Feedback Loop and Adaptation
We will establish a feedback loop that continually monitors the effectiveness of
marketing campaigns and customer responses. This feedback will be used to
adapt and optimize our segmentation and personalization models over time.

DATA SOURCE:
IN-HOUSE TRANSACTION DATA: Your own business records, such as e-
commerce platforms, point-of-sale systems, or CRM databases, are excellent
sources of customer data. These systems capture details of customer
purchases, interactions, and transaction history. You can access this data
directly from your business's internal records.

SURVEYS AND FEEDBACK: To gather valuable insights directly from your


customers, consider conducting surveys and feedback collection. You can
design and distribute surveys through your website, email, or even in-store.
These surveys can be tailored to collect information relevant to your
segmentation goals, such as customer preferences and satisfaction levels.
SOCIAL MEDIA DATA: Social media platforms like Twitter, Facebook, and
Instagram offer Application Programming Interfaces (APIs) that allow you to
collect public posts, comments, and interactions related to your brand or
industry. These social media data sources can provide real-time customer
sentiment and discussions.
WEBSITE INTERACTION DATA: Utilize website analytics tools like Google
Analytics or custom tracking systems to capture data on user behavior. This
data includes page views, click-through rates, time spent on pages, and other
user interactions on your website. It offers insights into how customers engage
with your online content.
GEOSPATIAL DATA: For understanding the physical movements and
behaviours of your customers, you can consider collecting geospatial data. This
can come from GPS data, location-based services, and mobile app usage. It
reveals valuable information about customer movements and location-based
interactions.
DEMOGRAPHIC DATA: Collect demographic information about your
customers, such as age, gender, income, occupation, and location. This data
can be obtained through customer registration forms, surveys, or integrated
from external demographic databases.
THIRD-PARTY DATA PROVIDERS: There are data providers like Experian,
Nielsen, and Acxiom that offer comprehensive datasets with demographic,
behavioural, and geographic data. You can purchase or license data from these
providers to enrich your customer segmentation analysis.
CUSTOMER SUPPORT INTERACTIONS: Customer support systems and chat
logs store interactions between your customers and your support team.
Analysing this data can provide insights into customer pain points, frequently
asked questions, and issues that need addressing.
PURCHASE HISTORY DATA: Beyond transaction data, maintaining a detailed
purchase history for each customer is crucial. This information helps
understand product preferences, purchase frequency, and customer loyalty.

APIS AND WEB SCRAPING: In some cases, you may need to access external
data sources to gain context for customer segmentation. You can use APIs to
extract data from sources like news websites, weather services, or other
relevant platforms. Web scraping techniques can be employed to collect data
from websites that do not offer APIs.

DATA SOURCE AND LOADING:


Data Source Description:
For our customer segmentation project, we obtained the dataset from multiple
sources, including in-house transaction records, customer surveys, and social
media interactions. This diverse range of data sources provides a
comprehensive view of our customers' behaviour and preferences.
Data Format and Loading Process:
The dataset is stored in a CSV format, making it accessible for data analysis. We
loaded the data into our Python environment using the Pandas library. This
step involved importing the dataset, which is named "customer_data.csv," and
ensuring it is in a tabular format for further analysis.
Data Overview:
Dataset Structure:
The dataset comprises 500 records (rows) and 5 columns (attributes). This data
scale allows for meaningful insights into our customer base.
Data Quality Issues:
During the data loading process, we identified a few data quality concerns.
Specifically, there were some missing values in the 'Annual Income (k$)'
column. We plan to address these missing values in the next phase by
employing data imputation techniques.
Initial Data Exploration:
Basic Statistics:
Initial statistics revealed that the mean age of our customers is approximately
38 years, with an annual income mean of $61,000. The spending score has a
mean value of 50.4. These statistics offer an initial understanding of the central
tendencies within our dataset.
Missing Values:
We found that 7 records in the 'Annual Income (k$)' column had missing
values. We plan to use the mean value of this column to fill these missing
entries.
Categorical Features:
The 'Gender' column is categorical and has two unique values: 'Male' and
'Female.'
Unique Values:
For the 'Gender' attribute, we have 'Male' and 'Female' as unique values.
Outliers:
While there are no significant outliers in the 'Age' column, we did notice some
high-income outliers in the 'Annual Income (k$)' column, which may have
implications for clustering.
FEATURE ENGINEERING :
Feature engineering is a crucial step in the data preprocessing phase
of your customer segmentation project. It involves creating new features or
transforming existing ones to enhance the quality of your dataset and improve
the performance of your clustering algorithm.
Age Binning:
Instead of using the exact age of customers, you can create age bins or
categories, such as 'Young,' 'Middle-aged,' and 'Senior.' This can simplify the
analysis and reveal age-related clusters.
Income-to-Age Ratio:
Calculate the income-to-age ratio for each customer. This feature can help
identify customers who are financially stable relative to their age.
Spending Score Binning:
Similar to age, you can create bins for the spending score, categorizing
customers as 'Low,' 'Medium,' and 'High' spenders.
Gender Encoding:
While you've already applied one-hot encoding for 'Gender,' you can explore
other encoding methods, like label encoding, to represent gender numerically.
Combining Features:
Experiment with combining two or more features. For instance, you can create
a 'Spending Efficiency' feature by dividing 'Spending Score' by 'Annual Income.'
This can identify customers who spend efficiently or extravagantly relative to
their income.
Polynomial Features:
Consider adding polynomial features to capture non-linear relationships
between attributes. For example, you can include the square of 'Age' or
'Annual Income' as new features.
Interaction Features:
Create interaction features to capture the interaction between two attributes.
For instance, you can multiply 'Age' by 'Annual Income' to understand how age
and income interact to influence spending.
Custom Metrics:
Develop custom metrics or scores that reflect specific business objectives. For
example, you can create a 'Customer Loyalty Score' based on the customer's
age, spending score, and number of visits.
Time-Based Features:
If you have access to temporal data, consider adding time-related features. For
example, you can include the last visit date as an indicator of customer
engagement.
Principal Component Analysis (PCA):
If your dataset has high dimensionality, PCA can reduce it by transforming the
features into a set of linearly uncorrelated variables. This can help the
clustering algorithm perform more efficiently.

DATA SAMPLING:
We decided to work with the entire dataset as it's manageable for initial
exploration. However, we may consider random sampling if the dataset were
significantly larger.
Preliminary Insights:
Initial observations suggest that our customer base is relatively evenly
distributed in terms of gender. We also need to be cautious about outliers in
the 'Annual Income' column, which could affect the accuracy of our
segmentation model. These initial insights will guide our next steps in the
project.
By applying these explanations to your project, you will have a comprehensive
Phase 3 section, demonstrating your approach to data loading and initial
exploration, including addressing data quality issues and sharing preliminary
insights.
Datasetlink:(https://www.kaggle.com/datasets/akram24/mall-customers)
Example data:

CustomerID Gender Age Annual Income (k$) Spending Score (1-100)

1 Male 19 15 39

2 Male 21 15 81

3 Female 20 16 6

4 Female 23 16 77

5 Female 31 17 40

6 Female 22 17 76

7 Female 35 18 6

8 Female 23 18 94

9 Male 64 19 3

10 Female 30 19 72

11 Male 67 19 14

12 Female 35 19 99

13 Female 58 20 15

14 Female 24 20 77

15 Male 37 20 13

16 Male 22 20 79
17 Female 35 21 35

18 Male 20 21 66

19 Male 52 23 29

20 Female 35 23 98

21 Male 35 24 35

22 Male 25 24 73

23 Female 46 25 5

24 Male 31 25 73

25 Female 54 28 14

LIBRARIED USED:
1. Pandas

● Description: Pandas is an essential data manipulation library in Python,


offering data structures like Data Frames and Series for efficient data
handling.
● Purpose: Pandas will be instrumental in preprocessing and organizing
customer data, allowing us to clean, filter, and transform the dataset as
needed for segmentation.
● Installation: To install Pandas, execute the following command:

pip install pandas

2. Scikit-Learn

● Description: Scikit-Learn is a comprehensive machine learning library in


Python that provides tools for clustering and predictive modeling.
● Purpose: We will utilize Scikit-Learn's clustering algorithms, such as K-
Means, DBSCAN, and hierarchical clustering, to segment customers
based on their behavior and attributes.
● Installation: To install Scikit-Learn, use this command:

pip install scikit-learn

3. Matplotlib

● Description: Matplotlib is a versatile data visualization library that


enables the creation of various charts and plots.
● Purpose: Matplotlib will be employed to visualize the results of
customer segmentation, helping us understand the distribution of
customer clusters and their characteristics.
● Installation: To install Matplotlib, use the following command:
pip install matplotlib

4. Seaborn

● Description: Seaborn is built on top of Matplotlib and is specialized in


creating statistical data visualizations.
● Purpose: Seaborn will enhance our visualizations by providing
aesthetically pleasing and informative representations of customer
segments.
● Installation: Install Seaborn with the following command:

pip install seaborn

PROGRAM:
pct=round(a/sum(a)*100)
lbs=paste(c("Female”, “Male")," ",pct,"%",sep=" ")
library(plotrix)
pie3D(a,labels=lbs,
main="Pie Chart Depicting Ratio of Female and Male")
a=table(customer_data$Gender)
barplot(a,main="Using BarPlot to display Gender Comparision",
ylab="Count",
xlab="Gender",
col=rainbow(2),
legend=rownames(a))

DATA VISUALIZATION:
 Data Exploration: Start by conducting exploratory data analysis (EDA)
to get a better understanding of your dataset. You can use basic
visualizations like histograms, bar charts, and scatter plots to examine
the distribution of numerical attributes, the frequency of categorical
variables, and relationships between different features.
 Correlation Analysis: Create correlation matrices or heatmaps to
visualize the relationships between numerical attributes. This can help
you identify any significant correlations between variables, which might
be useful in the segmentation process.

 Cluster Visualization: After applying a clustering algorithm to your


data, you can visualize the clusters using techniques such as scatter plots
or 2D/3D projections. These visualizations can help you understand how
customers are grouped and their relative positions in the feature space.

 T-SNE (T-Distributed Stochastic Neighbour Embedding): T-SNE is a


powerful technique for visualizing high-dimensional data in a lower-
dimensional space. It can help you see the distribution of data points in a
way that makes clusters more apparent.

 Silhouette Plots: Silhouette analysis can be used to evaluate the quality


of your clusters. Silhouette plots allow you to visualize how similar each
data point is to its own cluster compared to other clusters.

 Dendrogram: If you're using hierarchical clustering, you can create


dendrograms to illustrate the hierarchy of clusters. Dendrograms help in
understanding the structure of the data and can be useful for decision-
making in the segmentation process.

CODING:
library(purrr)
set.seed(123)
# function to calculate total intra-cluster sum of square
iss <- function(k) {
kmeans(customer_data[,3:5],k,iter.max=100,nstart=100,algorithm="Lloyd" )
$tot.withinss
}
k.values <- 1:10
iss_values <- map_dbl(k.values, iss)
plot(k.values, iss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total intra-clusters sum of squares")

 Box Plots: Box plots can provide insights into the distribution of
numerical attributes within each cluster. They show the median,
quartiles, and potential outliers in the data, helping you identify
characteristics of each cluster.

plot(density(customer_data$Annual.Income..k..),
col="yellow",
main="Density Plot for Annual Income",
xlab="Annual Income Class",
ylab="Density")
polygon(density(customer_data$Annual.Income..k..),
col="#ccff66")

 Radial Plot: Radial plots are used to display multivariate data in a


circular format. They can be helpful for visualizing how different
variables relate to each other within clusters.

 Parallel Coordinates: Parallel coordinate plots allow you to visualize


data with multiple numerical attributes. Each axis represents a different
variable, and lines connecting points show how data points relate to
each other.
 Custom Visualizations: Depending on your specific data and objectives,
you may need to create custom visualizations. Tools like Matplotlib and
Seaborn in Python can help you generate custom charts and plots to
represent your findings effectively.

 Interactive Dashboards: Consider creating interactive dashboards


using tools like Tableau, Power BI, or Plot. These dashboards enable
stakeholders to explore data and insights in a dynamic and user-friendly
manner.

INTERPRETATION:
LEVERAGING CLUSTERS FOR STRATEGIC INSIGHTS
In this phase, the goal is to make sense of the clusters
identified by your segmentation model and translate them into actionable
insights. The clusters represent different groups of customers, each with its
own unique traits and preferences.
Here's how you can interpret these clusters and use the insights to drive
business strategies:
Understanding Customer Personas:
Analyse each cluster's characteristics thoroughly. For example, you
might find that Cluster 1 is composed of young, high-income customers with a
preference for luxury products, while Cluster 2 consists of older, budget-
conscious shoppers. Understanding these personas is crucial as it allows you to
create a clear picture of who your customers are.
Tailored Marketing Strategies:
With well-defined personas, you can now create tailored marketing
strategies. For Cluster 1, your marketing campaigns could highlight exclusivity
and luxury, targeting them with high-end product offerings. In contrast, for
Cluster 2, the focus might be on affordability and deals. These tailored
approaches increase the relevance of your marketing efforts.
Product Development and Inventory Management:
Insights from customer personas can guide product development. If
Cluster 1 craves innovation, you can prioritize R&D for cutting-edge
products. Meanwhile, Cluster 2 may appreciate timeless, classic designs.
These insights can optimize your product lineup.
Customer Engagement and Retention:
Understanding customer personas helps improve engagement and
retention strategies. Cluster 3 might benefit from loyalty programs, while
Cluster 4 could respond well to personalized recommendations. By speaking
directly to their needs, you enhance their loyalty.

Performance Evaluation:
Continuously assess the performance of these strategies. Track
conversion rates, customer satisfaction, and retention. Adjust your
strategies as needed to ensure they align with the personas' evolving
preferences.
Data-Driven Decision Making:
Emphasize the importance of data-driven decision-making within your
organization. By demonstrating the impact of segmentation on your marketing
strategies and bottom line, you foster a culture of informed choices.
In summary, the interpretation phase is about converting data into strategic
wisdom. It provides the groundwork for customer-centric strategies that
enhance marketing, product development, and customer satisfaction, all while
driving business growth.

CONCLUSION:
In conclusion, our customer segmentation project is poised to provide
invaluable insights into our customer base. We've meticulously collected and
processed data from various sources, and our choice of the K-Means clustering
algorithm ensures efficient and accurate segmentation. With a well-defined
model training and testing approach and comprehensive visualization
techniques, we're on track to shape more targeted marketing campaigns and
product strategies.

What sets our project apart is its innovative integration of diverse data sources
and cutting-edge clustering algorithms, promising data-driven decision-making
that aligns with our commitment to understanding and serving our customers
better. This project document presents a clear roadmap and methodology,
with the potential to transform our business's future success.

You might also like