Segmentation Analytics
Segmentation is dividing a market into well-defined slices or
groups of customers who share similar set of needs and wants.
Definition assumes natural Or natural segments do not
segmentation, i.e., distinct exist, but there is a structure
market segments exist, and to the market units, which is
the marketer only needs to reproducible.
find them.
In the real world, there are no
natural segments, nor is the
unit level data structured in
any manner
In such cases, segments are
constructed by analyzing
consumer data.
This is called constructive
segmentation.
• Constructive segmentation, in simple words, is grouping objects (consumers
in our case) into classes / categories using certain variables
• Grouping objects using one or two variables in easy
• However, the problem gets complicated when there are multiple variables
used to group objects
• In practical marketing scenarios, segmentation is a multi variate problem
Segmentation Analytics could be based on Secondary or Primary Data
Secondary Data: Primary Data:
Internally collected data Collected through Surveys
from CRM or MIS
External sources such as
Census data, Retail panel
data, Monthly Per Capita
Expenditure, etc.
Steps in Primary Data-based Segmentation
Segmentation of Basis Variables K-Means
the Market (Essential characteristics) Clustering
Decide the Target Target Variables ANOVA to validate
Segment(s) (Attractiveness criteria) segments
Descriptor Variables ANOVA,
Develop Specific
(Detailed segment profiling Discriminant
Solutions for appropriate solutions) Analysis
Some Important Considerations for Segmentation Analysis
• If there are n objects, the entire set of n could be grouped into one cluster
(mass targeting), or each individual object could be treated as a cluster
(customization)
• Neither solution is managerially or statistically feasible
• The ideal solution lies between 1 and n, say K clusters
• Each of the K clusters should show homogeneity within the cluster and
heterogeneity between the clusters
• In other words, a diverse set of customers should be clustered in such a way
that customers within one group are more similar to others in the group, than
with those outside the group
Understanding K-Means Clustering
Data of 11 respondents
who have provided their
assessment on:
• Quality of product
• Purchase intention of the
product
Scatter plot of the data
What defines how
different respondents
are grouped?
The Euclidean or Squared
Euclidean distance!
Proximity Matrix
Squared Euclidean Distance
Case 1:R1 2:R2 3:R3 4:R4 5:R5 6:R6 7:R7 8:R8 9:R9 10:R10 11:R11
1:R1 0.000 1.000 4.000 64.000 50.000 37.000 53.000 37.000 85.000 85.000 89.000
2:R2 1.000 0.000 1.000 49.000 37.000 26.000 50.000 36.000 74.000 72.000 74.000
3:R3 4.000 1.000 0.000 36.000 26.000 17.000 49.000 37.000 65.000 61.000 61.000
4:R4 64.000 49.000 36.000 0.000 2.000 5.000 85.000 85.000 53.000 37.000 25.000
5:R5 50.000 37.000 26.000 2.000 0.000 1.000 61.000 61.000 37.000 25.000 17.000
6:R6 37.000 26.000 17.000 5.000 1.000 0.000 52.000 50.000 36.000 26.000 20.000
7:R7 53.000 50.000 49.000 85.000 61.000 52.000 0.000 2.000 16.000 26.000 40.000
8:R8 37.000 36.000 37.000 85.000 61.000 50.000 2.000 0.000 26.000 36.000 50.000
9:R9 85.000 74.000 65.000 53.000 37.000 36.000 16.000 26.000 0.000 2.000 8.000
10:R10 85.000 72.000 61.000 37.000 25.000 26.000 26.000 36.000 2.000 0.000 2.000
11:R11 89.000 74.000 61.000 25.000 17.000 20.000 40.000 50.000 8.000 2.000 0.000
This is a dissimilarity matrix
The Proximity Matrix indicates that there are four “groups” or “clusters” of
respondents.
• K Means algorithm picks up
randomly 4 Initial Cluster Centers or 1:R1 6:R6 8:R8 11:R11
1:R1 0 37 37 89
Initial Seed Points. 2:R2 1 26 36 74
• Assume it picks up seeds – R1,R6, 3:R3
4:R4
4
64
17
5
37
85
61
25
R8 and R11. Algorithm will consider 5:R5 50 1 61 17
only these four columns in the 6:R6
7:R7
37
53
0
52
50
2
20
40
proximity matrix. 8:R8 37 50 0 50
• Based on distance, observations 9:R9
10:R10
85
85
36
26
26
36
8
2
will be assigned to the Initial Cluster 11:R11 89 20 50 0
Centers
QLTY PURINT
R1 9 9
R2 9 8 • Since the Initial Cluster Centers are
R3 9 7 no more one single observation, but a
C1 9 8 group of some observations, they can
R4 9 1
not be represented any more by their
R5 8 2
Initial Cluster Center coordinates.
R6 8 3
C2 8.33333 2
R7 2 7
• The algorithm will represent the
R8 3 8
groups by CLUSTER CENTROIDS –
C3 2.5 7.5
C1, C2, C3 &C4
R9 2 3
R10 3 2
R11 4 1
C4 3 2
Proximity Matrix Algorithm recomputes the distances,
Squared Euclidean Distance
12:C1 13:C2 14:C3 15:C4
with C1, C2, C3 and C4 as the new
1:R1 1.000 49.444 44.500 85.000 reference points.
2:R2 0.000 36.444 42.500 72.000
3:R3 1.000 25.444 42.500 61.000
4:R4 49.000 1.444 84.500 37.000
5:R5 37.000 0.111 60.500 25.000 Are there any changes in cluster
NO
6:R6 26.000 1.111 50.500 26.000 membership?
7:R7 50.000 65.111 0.500 26.000
8:R8 36.000 64.444 0.500 36.000
Will the cluster centroid change? NO
9:R9 74.000 41.111 20.500 2.000
10:R10 72.000 28.444 30.500 0.000
11:R11 74.000 19.778 44.500 2.000
This is a dissimilarity matrix
• K- Means Clustering Algorithm is an Iterative Process.
• Analysts need to provide number of clusters that they wish to form.
• The algorithm picks up K- seeds randomly. (In our illustration case, it was 4
seeds)
• Every run of the algorithm has three steps
• Computing distance between the seeds and individual observations
• Assigning individual observation to the closest seed
• Computing the cluster centroids
• The algorithm iterates the process till it reaches CONVERGENCE – no
change in cluster centers.
Adopting a Data-centric Approach for
Segmentation
Segmentation Analytics using SPSS
Basis Variables Variables that help to segment the
K-Means
market based on consumer needs,
(Essential characteristics) Clustering
concerns, wants, preferences
Target Variables ANOVA to validate
Variables to validate the segments
(Attractiveness criteria) segments
Descriptor Variables ANOVA,
Variables that help to describe the
(Detailed segment profiling Discriminant
segments
for appropriate solutions) Analysis
Shopping Mall Data
Variable Name Variable Label Type of Variable
V1 Buying from these big stores intimidates me
V2 Buying from these stores intimidates me
V3 I combine buying with other jobs
V4 I always buy the best combination of price and quality
V5 I don’t care where I am buying from
V6 I do compare prices to save money
Income Income in Rs.
Store_visit Frequency of visiting big stores
Deciding the number of Segments – Variance Ratio Criterion
• Data-based determination of appropriate number of segments is considered
more rigorous than heuristics
• Clusters are supposed to have greater between-cluster variability than within-
cluster
• Therefore, a clustering solution that maximizes between-cluster variability
compared to within-cluster variability is considered to be the best.
Deciding the number of Segments – Variance Ratio Criterion
• Variance Ratio Criterion (VRC) used to determine correct number of clusters in cluster analysis
• ANOVA provides the F-value, which is the ratio of mean sum of squares between the cluster and mean sum
of squares within the cluster, which is nothing but VRC
• Therefore,
• F-values, thus obtained for each cluster are summed to get the Pooled Variance Ratio for a cluster solution
• Calculate Pooled Variance Ratio for each cluster solution
• Calculate the difference in Pooled Variance Ratio as you move from one cluster solution to the other (ω
statistic)
• The cluster where the omega (ω) statistic is the lowest is the appropriate number of clusters
Deciding the number of Segments – Variance Ratio Criterion
• Variance Ratio Criterion (VRC) used to determine correct number of clusters
in cluster analysis
• ANOVA provides the F-value, which is the ratio of mean sum of squares
between the cluster and mean sum of squares within the cluster, which is
nothing but VRC
• Therefore,
• F-values, thus obtained for each cluster are summed to get the Pooled
Variance Ratio for a cluster solution
• Pooled Variance Ratio varies for each cluster solution