Cluster Analysis: Classification Analysis, or Numerical Taxonomy
Cluster Analysis: Classification Analysis, or Numerical Taxonomy
ing Procedure Deciding on the Number of Clusters Interpreting and Profiling the Clusters Assessing Reliability and Validity Cluster Analysis Cluster analysis is a class of techni ues used to classify ob!ects or cases into relati"ely homogeneous groups called clusters# $b!ects in each cluster tend to be similar to each other and dissimilar to ob!ects in the other clusters# Cluster analysis is also called classification analysis% or numerical taxonomy# &oth cluster analysis and discriminant analysis are concerned 'ith classification# (o'e"er% discriminant analysis re uires prior )no'ledge of the cluster or group membership for each ob!ect or case included% to de"elop the classification rule# In contrast% in cluster analysis there is no a priori information about the group or cluster membership for any of the ob!ects# *roups or clusters are suggested by the data% not defined a priori#
Advantages of cluster analysis: Cluster analysis is used in mar)eting for different purposes+ Segmenting the mar)et+ Consumers may be clustered or grouped on the basis of benefits deri"ed from the purchase of a product# ,ach cluster 'ould consist of consumers 'ho are relati"ely homogeneous in terms of the benefits they see)# -nderstanding the buyer beha"iors+ .he main purpose of cluster analysis is to identify the homogeneous groups of buyers# From these groups% buying beha"ior can be e/tracted to de"elop a suitable mar)eting strategy Identifying ne' product opportunities+ In the competiti"e en"ironment% clustering of brands and products 'ould lead to identify potential ne' product opportunities# Selecting test mar)ets+ Cities or regions can be grouped into homogeneous clusters to arri"e at different mar)eting strategies Reducing Data+ $ther multi"ariate techni ue such as multiple discriminant analysis can be applied further to cluster analysis to describe differences in consumer0s product usage beha"ior# It enables to reduce the data that are more manageable than indi"idual obser"ations#
Statistics Associated with Cluster Analysis Agglomeration schedule# An agglomeration schedule gi"es information on the ob!ects or cases being combined at each stage of a hierarchical clustering process# Cluster centroid# .he cluster centroid is the mean "alues of the "ariables for all the cases or ob!ects in a particular cluster# Cluster centers# .he cluster centers are the initial starting points in nonhierarchical clustering# Clusters are built around these centers% or seeds# Cluster membership# Cluster membership indicates the cluster to 'hich each ob!ect or case belongs#
Dendrogram# A dendrogram% or tree graph% is a graphical de"ice for displaying clustering results# Vertical lines represent clusters that are !oined together# .he position of the line on the scale indicates the distances at 'hich clusters 'ere !oined# .he dendrogram is read from left to right#
V ar ia bl e 1
V ar ia bl e 1 Variable 2 Variable 2
Distances between cluster centers# .hese distances indicate ho' separated the indi"idual pairs of clusters are# Clusters that are 'idely separated are distinct% and therefore desirable# Similarity distance coefficient matri!# A similarity1distance coefficient matri/ is a lo'er2triangle matri/ containing pair 'ise distances bet'een ob!ects or cases#
Select a Clustering Procedure Decide on the Number of Clusters Interpret and Profile Clusters Assess the Validity of Clustering
Illustration: An internet cafe company 'ants to )no' attitudes to'ards internet surfing# 3ith the help of mar)eting research team% the company identified si/ attitude "ariables# .'enty respondents 'ere as)ed to e/press their degree of 'ith the follo'ing statements on a se"en point scale 452disagree% 62agree7# Conduct cluster analysis method using SPSS to identify homogenous customer groups based on 'hich a suitable mar)eting strategy can be adopted by the company# V5 2 Internet surfing is fun V8 2 Surfing is bad for your budget V9 2 I combine surfing 'ith music and games V: 2 I try to get best information I 'anted 'hile surfing V; 2 I don0t 'aste time in surfing V< 2 =ou can get lot of information from "arious sources .he obtained data is as sho'n belo' and the same is gi"en as input to SPSS soft'are and selecting the option as cluster Analysis# Attitudinal Data for Clustering
V; 8 ; 5 9 < 9 9 5 < : ; 8 : : 5 : 8 : 8 6
Perhaps the most important part of formulating the clustering problem is selecting the "ariables on 'hich the clustering is based# Inclusion of e"en one or t'o irrele"ant "ariables may distort an other'ise useful clustering solution# &asically% the set of "ariables selected should describe the similarity bet'een ob!ects in terms that are rele"ant to the mar)eting research problem# .he "ariables should be selected based on past research% theory% or a consideration of the hypotheses being tested# In e/ploratory research% the researcher should e/ercise !udgment and intuition#
Select a distance or similarity measure .he most commonly used measure of similarity is the ,uclidean distance or its s uare# .he $uclidean distance is the s uare root of the sum of the s uared differences in "alues for each "ariable# $ther distance measures are also a"ailable# .he city-block or Manhattan distance bet'een t'o ob!ects is the sum of the absolute differences in "alues for each "ariable# .he Chebychev distance bet'een t'o ob!ects is the ma/imum absolute difference in "alues for any "ariable#
If the "ariables are measured in "astly different units% the clustering solution 'ill be influenced by the units of measurement# In these cases% before clustering respondents% 'e must standardiAe the data by rescaling each "ariable to ha"e a mean of Aero and a standard de"iation of unity# It is also desirable to eliminate outliers 4cases 'ith atypical "alues7# -se of different distance measures may lead to different clustering results# (ence% it is ad"isable to use different measures and compare the results#
$ptimiAing Partitioning
Bin)age Methods
Single lin)age
Complete lin)age
A"erage lin)age
Select a Clustering #rocedure % &ierarchical &ierarchical clustering is characteriAed by the de"elopment of a hierarchy or tree2li)e structure# (ierarchical methods can be agglomerati"e or di"isi"e# Agglomerative clustering starts 'ith each ob!ect in a separate cluster# Clusters are formed by grouping ob!ects into bigger and bigger clusters# .his process is continued until all ob!ects are members of a single cluster# Divisive clustering starts 'ith all the ob!ects grouped in a single cluster# Clusters are di"ided or split until each ob!ect is in a separate cluster# Agglomerati"e methods are commonly used in mar)eting research# .hey consist of lin)age methods% error sums of s uares or "ariance methods% and centroid methods#
.he single lin(age method is based on minimum distance% or the nearest neighbor rule# At e"ery stage% the distance bet'een t'o clusters is the distance bet'een their t'o closest points .he complete lin(age method is similar to single lin)age% e/cept that it is based on the ma/imum distance or the furthest neighbor approach# In complete lin)age% the distance bet'een t'o clusters is calculated as the distance bet'een their t'o furthest points# .he average lin(age method 'or)s similarly# (o'e"er% in this method% the distance bet'een t'o clusters is defined as the a"erage of the distances bet'een all pairs of ob!ects% 'here one member of the pair is from each of the clusters#
Cluster 5
Cluster 8
Average 'in(age
A"erage Distance
Cluster 5 Cluster 8
Select a Clustering #rocedure % Variance )ethod .he variance methods attempt to generate clusters to minimiAe the 'ithin2cluster "ariance# A commonly used "ariance method is the *ard+s procedure# For each cluster% the means for all the "ariables are computed# .hen% for each ob!ect% the s uared ,uclidean distance to the cluster means is calculated # .hese distances are summed for all the ob!ects# At each stage% the t'o clusters 'ith the smallest increase in the o"erall sum of s uares 'ithin cluster distances are combined# In the centroid methods% the distance bet'een t'o clusters is the distance bet'een their centroids 4means for all the "ariables7% ,"ery time ob!ects are grouped% a ne' centroid is computed# $f the hierarchical methods% a"erage lin)age and 3ardCs methods ha"e been sho'n to perform better than the other procedures#
Centroid )ethod
Select a Clustering #rocedure % -onhierarchical .he nonhierarchical clustering methods are fre uently referred to as k2means clustering# .hese methods include se uential threshold% parallel threshold% and optimiAing partitioning# In the se.uential threshold method% a cluster center is selected and all ob!ects 'ithin a prespecified threshold "alue from the center are grouped together# .hen a ne' cluster center or seed is selected% and the process is repeated for the unclustered points# $nce an ob!ect is clustered 'ith a seed% it is no longer considered for clustering 'ith subse uent seeds# .he parallel threshold method operates similarly% e/cept that se"eral cluster centers are selected simultaneously and ob!ects 'ithin the threshold le"el are grouped 'ith the nearest center# .he optimi/ing partitioning method differs from the t'o threshold procedures in that ob!ects can later be reassigned to clusters to optimiAe an o"erall criterion% such as a"erage 'ithin cluster distance for a gi"en number of clusters#
It has been suggested that the hierarchical and nonhierarchical methods be used in tandem# First% an initial clustering solution is obtained using a hierarchical procedure% such as a"erage lin)age or 3ardCs# .he number of clusters and cluster centroids so obtained are used as inputs to the optimiAing partitioning method# Choice of a clustering method and choice of a distance measure are interrelated# For e/ample% s uared ,uclidean distances should be used 'ith the 3ardCs and centroid methods# Se"eral nonhierarchical procedures also use s uared ,uclidean distances#
Clusters combined
Stage 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Cluster 1 14 16 6 7 2 13 5 11 3 8 10 14 6 12 9 20 4 10 1 6 5 9 4 19 1 17 1 15 2 5 1 3 4 18 2 4 1 2 Cluster 2 1.000000 2.000000 3.500000 5.000000 6.500000 8.160000 10.166667 13.000000 15.583000 18.500000 23.000000 27.750000 33.100000 41.333000 51.833000 64.500000 79.667000 172.662000 328.600000
Coefficient 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 6 6 7 4 8 9 0 10 0 13 0 3 11 14 5 12 0 15 17 16 18
Decide on the -umber of Clusters .heoretical% conceptual% or practical considerations may suggest a certain number of clusters# In hierarchical clustering% the distances at 'hich clusters are combined can be used as criteria# .his information can be obtained from the agglomeration schedule or from the dendrogram# In nonhierarchical clustering% the ratio of total 'ithin2group "ariance to bet'een2 group "ariance can be plotted against the number of clusters# .he point at 'hich an elbo' or a sharp bend occurs indicates an appropriate number of clusters# .he relati"e siAes of the clusters should be meaningful#
Interpreting and #rofiling the Clusters Interpreting and profiling clusters in"ol"es e/amining the cluster centroids# .he centroids enable us to describe each cluster by assigning it a name or label# It is often helpful to profile the clusters in terms of "ariables that 'ere not used for clustering# .hese may include demographic% psychographic% product usage% media usage% or other "ariables#
Cluster Centroids
V1
5.750
V2
3.625
V3
6.000
V4
3.125
V5
1.750
V6
3.875
1.667
3.000
1.833
3.500
5.500
3.333
3.500
5.833
3.333
6.000
3.500
6.000
.he abo"e table gi"es the centroid or means "alues for each cluster# Cluster 5 has relati"ely high "alues on the "ariables V54Internet surfing is fun7 and V94I combine surfing 'ith music and games7#it has lo' "alue on V ;4I don0t 'aste time in surfing7# (ence cluster 1 can be labeled as surf loving and concentrated. this cluster consists of cases+5%9%<%6%%>%58%5;%and 56# Cluster 2 is !ust opposite% 'ith lo' "alues on V5 and V9 and a high "alue on V; and this can be labeled as apathetic surfers. .his consists of cases 8%;%?%55%59 and 8@# Cluster 3 has high "alues on V84Surfing is bad for your budget 7% V:4I try to get best information I 'anted 'hile surfing7% and V < 4=ou can get lot of information from "arious sources7# .hus this cluster can be labeled as economical surfers# .his consists of cases :% 5@%5:%5<%5> and 5?#
Assess 1eliability and Validity Perform cluster analysis on the same data using different distance measures# Compare the results across measures to determine the stability of the solutions# -se different methods of clustering and compare the results# Split the data randomly into hal"es# Perform clustering separately on each half# Compare cluster centroids across the t'o sub samples# Delete "ariables randomly# Perform clustering based on the reduced set of "ariables# Compare the results 'ith those obtained by clustering based on the entire set of "ariables# In nonhierarchical clustering% the solution may depend on the order of cases in the data set# Ma)e multiple runs using different order of cases until the solution stabiliAes#
1esults of -onhierarchical Clustering Initial Cluster Centers Cluster 1 V5 V8 V9 V: V; V< : < 9 6 8 2 8 9 8 : 6 3 6 8 < : 5 9
6 8 Iteration (istory
Iteration 5 8
9 8#;;@ @#@@@
Con"ergence achie"ed due to no or small distance change# .he ma/imum distance by 'hich any center has changed is @#@@@# .he current iteration is 8# .he minimum distance bet'een initial centers is 6#6:<#
Cluster )embership Case Number 5 8 9 : ; < 6 > ? 5@ 55 58 59 5: 5; 5< 56 5> 5? 8@ Cluster 9 8 9 5 8 9 9 9 8 5 8 9 8 5 9 5 9 5 Distance 5#:5: 5#989 8#;;@ 5#:@: 5#>:> 5#88; 5#;@@ 8#585 5#6;< 5#5:9 5#@:5 5#;>5 8#;?> 5#:@: 8#>8> 5#<8: 8#;?> 9#;;;
Distances between "inal Cluster Centers Cluster 5 8 9 ;#;<> ;#<?> <#?8> 5 8 ;#;<> 9 ;#<?> <#?8>
It is interesting note that clusters identified from hierarchical method are same in non2 hierarchical method e/cept change in the order# .he distances bet'een the final cluster centers indicate that the pairs of clusters are 'ell separated# (ierarchical clustering includes methods such as single lin)age% complete lin)age and a"erage lin)age# 3e need not to specify in ad"ance ho' many clusters are to be e/tracted# A range of solution pro"ided by the soft'are from 52 cluster solution to n2 cluster solution# 3here as in Non2hierarchical clustering% you ha"e to specify in ad"ance ho' many clusters are re uired from the data# .he specified number of nodes and points closest to them are used to form initial clusters and through an iterati"e rearrangements and the final )2clusters are determined by the pac)age#