ADS Phase4
ADS Phase4
Team member
312821104018: Balamanikandan.M
Phase 4 Submission Document
INTRODUCTION:
Customer segmentation is a vital strategy for businesses to
understand their diverse customer base and tailor marketing efforts
effectively. Traditionally, it involves grouping customers based on
historical data. However, in today's dynamic market, this approach may
fall short in capturing evolving customer behaviours and preferences.
Content:
1. Data Collection and Integration
Data Sources: We will collect data from a variety of sources, including
customer surveys, purchase history databases, website interaction logs, and
social media engagement metrics.
Real-time Integration: Data integration will be an ongoing, real-time process to
ensure the most up-to-date customer insights.
2. Data Preprocessing
Step 1: Handling Missing Values:
First, identify any missing values in the dataset. In your provided data, there
don't seem to be any missing values, which is great.
Standardize numerical features. In this case, 'Age,' 'Annual Income (k$),' and
'Spending Score (1-100)' are the numerical attributes. Standardization ensures
that all these features are on the same scale, preventing one feature from
dominating the clustering process.
Save the pre-processed data. This ensures that you have a clean and ready
dataset for the upcoming phases of your project. By saving it, you prevent the
need to reprocess the data each time you work on different aspects of project.
3. Clustering Algorithms
1.K-MEANS CLUSTERING:
Description: K-Means is a centroid-based algorithm that partitions data into K
clusters. It minimizes the sum of squared distances from data points to their
assigned cluster centroids.
Advantages: Simple to implement, computationally efficient, works well with
spherical clusters.
Considerations: Requires specifying the number of clusters (K) in advance,
sensitive to initial cluster centres.
2.HIERARCHICAL CLUSTERING:
Description: Hierarchical clustering creates a tree-like structure of clusters by
successively merging or splitting clusters based on similarity.
Advantages: Reveals hierarchical relationships, does not require specifying K in
advance.
Considerations: Computationally intensive for large datasets.
3.DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
Description: DBSCAN identifies clusters as regions of high data point density,
separating noisy points as outliers.
Advantages: Can discover clusters of varying shapes, robust to noise, does not
require specifying K.
Considerations: Sensitive to hyperparameter settings.
4.AGGLOMERATIVE CLUSTERING:
Description: Agglomerative clustering is a bottom-up approach that starts with
individual data points as clusters and merges them hierarchically.
Advantages: Reveals hierarchical structure, handles different cluster shapes.
Considerations: Can be computationally expensive.
5.MEAN-SHIFT CLUSTERING:
Description: Mean-shift identifies cluster centres by moving towards the mode
of data point density.
Advantages: Can identify clusters of varying shapes and sizes, adaptive
bandwidth.
Considerations: Computationally intensive, sensitive to bandwidth.
6.GAUSSIAN MIXTURE MODELS (GMM):
Description: GMM assumes that data points are generated from a mixture of
Gaussian distributions.
Advantages: Can model overlapping clusters, provides probabilistic cluster
assignments.
Considerations: Sensitive to initialization, can converge to local optima.
7.SPECTRAL CLUSTERING:
Description: Spectral clustering leverages eigenvectors of a similarity matrix to
partition data into clusters.
Advantages: Effective for capturing complex data structures, handles non-
convex clusters.
Considerations: Requires similarity matrix computation, sensitive to kernel
choice.
8.AFFINITY PROPAGATION:
Description: Affinity propagation identifies exemplar data points and assigns
other points to the nearest exemplar.
Advantages: Identifies cluster examples automatically, works well with various
cluster shapes.
Considerations: Complexity can be high, may create many clusters.
9.OPTICS (Ordering Points To Identify the Clustering Structure):
Description: OPTICS generates a hierarchical clustering structure by ordering
data points based on reachability distance.
Advantages: Handles varying data density, reveals hierarchical relationships.
Considerations: Complex hyperparameters, sensitivity to minPts.
10.SELF-ORGANIZING MAPS (SOM):
Description: SOM is a neural network-based clustering technique that creates
a grid of nodes and assigns data points to the closest nodes.
Advantages: Reveals underlying data structure, useful for dimensionality
reduction.
Considerations: Requires tuning of network parameters, computationally
intensive.
DATA SOURCE:
IN-HOUSE TRANSACTION DATA: Your own business records, such as e-
commerce platforms, point-of-sale systems, or CRM databases, are excellent
sources of customer data. These systems capture details of customer
purchases, interactions, and transaction history. You can access this data
directly from your business's internal records.
APIS AND WEB SCRAPING: In some cases, you may need to access external
data sources to gain context for customer segmentation. You can use APIs to
extract data from sources like news websites, weather services, or other
relevant platforms. Web scraping techniques can be employed to collect data
from websites that do not offer APIs.
DATA SAMPLING:
We decided to work with the entire dataset as it's manageable for initial
exploration. However, we may consider random sampling if the dataset were
significantly larger.
Preliminary Insights:
Initial observations suggest that our customer base is relatively evenly
distributed in terms of gender. We also need to be cautious about outliers in
the 'Annual Income' column, which could affect the accuracy of our
segmentation model. These initial insights will guide our next steps in the
project.
By applying these explanations to your project, you will have a comprehensive
Phase 3 section, demonstrating your approach to data loading and initial
exploration, including addressing data quality issues and sharing preliminary
insights.
Datasetlink:(https://www.kaggle.com/datasets/akram24/mall-customers)
Example data:
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
5 Female 31 17 40
6 Female 22 17 76
7 Female 35 18 6
8 Female 23 18 94
9 Male 64 19 3
10 Female 30 19 72
11 Male 67 19 14
12 Female 35 19 99
13 Female 58 20 15
14 Female 24 20 77
15 Male 37 20 13
16 Male 22 20 79
17 Female 35 21 35
18 Male 20 21 66
19 Male 52 23 29
20 Female 35 23 98
21 Male 35 24 35
22 Male 25 24 73
23 Female 46 25 5
24 Male 31 25 73
25 Female 54 28 14
LIBRARIED USED:
1. Pandas
2. Scikit-Learn
3. Matplotlib
4. Seaborn
PROGRAM:
pct=round(a/sum(a)*100)
lbs=paste(c("Female”, “Male")," ",pct,"%",sep=" ")
library(plotrix)
pie3D(a,labels=lbs,
main="Pie Chart Depicting Ratio of Female and Male")
a=table(customer_data$Gender)
barplot(a,main="Using BarPlot to display Gender Comparision",
ylab="Count",
xlab="Gender",
col=rainbow(2),
legend=rownames(a))
DATA VISUALIZATION:
Data Exploration: Start by conducting exploratory data analysis (EDA)
to get a better understanding of your dataset. You can use basic
visualizations like histograms, bar charts, and scatter plots to examine
the distribution of numerical attributes, the frequency of categorical
variables, and relationships between different features.
Correlation Analysis: Create correlation matrices or heatmaps to
visualize the relationships between numerical attributes. This can help
you identify any significant correlations between variables, which might
be useful in the segmentation process.
CODING:
library(purrr)
set.seed(123)
# function to calculate total intra-cluster sum of square
iss <- function(k) {
kmeans(customer_data[,3:5],k,iter.max=100,nstart=100,algorithm="Lloyd" )
$tot.withinss
}
k.values <- 1:10
iss_values <- map_dbl(k.values, iss)
plot(k.values, iss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total intra-clusters sum of squares")
Box Plots: Box plots can provide insights into the distribution of
numerical attributes within each cluster. They show the median,
quartiles, and potential outliers in the data, helping you identify
characteristics of each cluster.
plot(density(customer_data$Annual.Income..k..),
col="yellow",
main="Density Plot for Annual Income",
xlab="Annual Income Class",
ylab="Density")
polygon(density(customer_data$Annual.Income..k..),
col="#ccff66")
INTERPRETATION:
LEVERAGING CLUSTERS FOR STRATEGIC INSIGHTS
In this phase, the goal is to make sense of the clusters
identified by your segmentation model and translate them into actionable
insights. The clusters represent different groups of customers, each with its
own unique traits and preferences.
Here's how you can interpret these clusters and use the insights to drive
business strategies:
Understanding Customer Personas:
Analyse each cluster's characteristics thoroughly. For example, you
might find that Cluster 1 is composed of young, high-income customers with a
preference for luxury products, while Cluster 2 consists of older, budget-
conscious shoppers. Understanding these personas is crucial as it allows you to
create a clear picture of who your customers are.
Tailored Marketing Strategies:
With well-defined personas, you can now create tailored marketing
strategies. For Cluster 1, your marketing campaigns could highlight exclusivity
and luxury, targeting them with high-end product offerings. In contrast, for
Cluster 2, the focus might be on affordability and deals. These tailored
approaches increase the relevance of your marketing efforts.
Product Development and Inventory Management:
Insights from customer personas can guide product development. If
Cluster 1 craves innovation, you can prioritize R&D for cutting-edge
products. Meanwhile, Cluster 2 may appreciate timeless, classic designs.
These insights can optimize your product lineup.
Customer Engagement and Retention:
Understanding customer personas helps improve engagement and
retention strategies. Cluster 3 might benefit from loyalty programs, while
Cluster 4 could respond well to personalized recommendations. By speaking
directly to their needs, you enhance their loyalty.
Performance Evaluation:
Continuously assess the performance of these strategies. Track
conversion rates, customer satisfaction, and retention. Adjust your
strategies as needed to ensure they align with the personas' evolving
preferences.
Data-Driven Decision Making:
Emphasize the importance of data-driven decision-making within your
organization. By demonstrating the impact of segmentation on your marketing
strategies and bottom line, you foster a culture of informed choices.
In summary, the interpretation phase is about converting data into strategic
wisdom. It provides the groundwork for customer-centric strategies that
enhance marketing, product development, and customer satisfaction, all while
driving business growth.
CONCLUSION:
In conclusion, our customer segmentation project is poised to provide
invaluable insights into our customer base. We've meticulously collected and
processed data from various sources, and our choice of the K-Means clustering
algorithm ensures efficient and accurate segmentation. With a well-defined
model training and testing approach and comprehensive visualization
techniques, we're on track to shape more targeted marketing campaigns and
product strategies.
What sets our project apart is its innovative integration of diverse data sources
and cutting-edge clustering algorithms, promising data-driven decision-making
that aligns with our commitment to understanding and serving our customers
better. This project document presents a clear roadmap and methodology,
with the potential to transform our business's future success.