0% found this document useful (0 votes)
16 views14 pages

Hierarchical Cluster Analysis: Comparison of Three Linkage Measures and Application To Psychological Data

The document discusses hierarchical agglomerative cluster analysis, a statistical technique for grouping similar cases into homogeneous clusters based on chosen distance and linkage measures. It compares three common linkage measures (single, complete, and average linkage) and emphasizes the importance of these choices in influencing clustering outcomes. Additionally, it provides a tutorial for researchers on performing hierarchical cluster analysis using SPSS, illustrated with an example involving bilingual language variables.

Uploaded by

projas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views14 pages

Hierarchical Cluster Analysis: Comparison of Three Linkage Measures and Application To Psychological Data

The document discusses hierarchical agglomerative cluster analysis, a statistical technique for grouping similar cases into homogeneous clusters based on chosen distance and linkage measures. It compares three common linkage measures (single, complete, and average linkage) and emphasizes the importance of these choices in influencing clustering outcomes. Additionally, it provides a tutorial for researchers on performing hierarchical cluster analysis using SPSS, illustrated with an example involving bilingual language variables.

Uploaded by

projas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

T

Q ¦ 2015  vol. 11  no. 1


M
P

Hierarchical Cluster Analysis:


Comparison of Three Linkage Measures and
Application to Psychological Data
Odilia Yim , a, Kylee T. Ramdeen a, b, c
a School of Psychology, University of Ottawa
b Laboratoire de Psychologie et Neurocognition, Université de Savoie
c Centre National de la Recherche Scientifique, Laboratoire de Psychologie et Neurocognition, Unité Mixte de Recherche 5105,

Grenoble, France

Abstract  Cluster analysis refers to a class of data reduction methods used for sorting cases, observations, or variables of a given
dataset into homogeneous groups that differ from each other. The present paper focuses on hierarchical agglomerative cluster
analysis, a statistical technique where groups are sequentially created by systematically merging similar clusters together, as
dictated by the distance and linkage measures chosen by the researcher. Specific distance and linkage measures are reviewed,
including a discussion of how these choices can influence the clustering process by comparing three common linkage measures
(single linkage, complete linkage, average linkage). The tutorial guides researchers in performing a hierarchical cluster analysis
using the SPSS statistical software. Through an example, we demonstrate how cluster analysis can be used to detect meaningful
subgroups in a sample of bilinguals by examining various language variables.
Keywords  Cluster analysis; Hierarchical cluster analysis; Agglomerative, linkage; SPSS

 [email protected]

Introduction

In everyday life, we try to sort similar items together provide researchers a background to hierarchical
and classify them into different groups, a natural and cluster analysis and a tutorial in SPSS using an example
fundamental way of creating order among chaos. from psychology.
Among many scientific disciplines, it is also essential to Cluster analysis is a type of data reduction
uncover similarities within data to construct technique. Data reduction analyses, which also include
meaningful groups. The purpose of cluster analysis is factor analysis and discriminant analysis, essentially
to discover a system of organizing observations where reduce data. They do not analyze group differences
members of the group share specific properties in based on independent and dependent variables. For
common. Cluster analysis is a class of techniques that example, factor analysis reduces the number of factors
classifies cases into groups that are relatively or variables within a model and discriminant analysis
homogeneous within themselves and relatively classifies new cases into groups that have been
heterogeneous between each other (Landau & Chis previously identified based on specific criteria. Cluster
Ster, 2010; Norusis, 2010). Cluster analysis has a simple analysis is unique among these techniques because its
goal of grouping cases into homogeneous clusters, yet goal is to reduce the number of cases or observations1
the choice in algorithms and measures that dictates the by classifying them into homogeneous clusters,
successive merging of similar cases into different identifying groups without previously knowing group
clusters makes it a complex process. Although an membership or the number of possible groups. Cluster
appealing technique, cluster solutions can be easily analysis also allows for many options regarding the
misinterpreted if the researcher does not fully algorithm for combining groups, with each choice
understand the procedures of cluster analysis. Most
importantly, one must keep in mind that cases will 1 The present paper focuses only on the grouping of cases or
always be grouped into clusters regardless of the true observations, but cluster analysis can also be used to reduce
nature of the data. Therefore, the present paper aims to the number of variables in a dataset.

The Quantitative Methods for Psychology 8


T
Q ¦ 2015  vol. 11  no. 1
M
P

resulting in a different grouping structure. Therefore, sequential steps (Blei & Lafferty, 2009). Non-
cluster analysis can be a convenient statistical tool for hierarchical techniques (e.g., k-means clustering) first
exploring underlying structures in various kinds of establish an initial set of cluster means and then assign
datasets. each case to the closest cluster mean (Morissette &
Cluster analysis was initially used within the Chartier, 2013). The present paper focuses on
disciplines of biology and ecology (Sokal & Sneath, hierarchical clustering, though both clustering methods
1963). Although this technique has been employed in have the same goal of increasing within-group
the social sciences, it has not gained the same homogeneity and between-groups heterogeneity. At
widespread popularity as in the natural sciences. A each step in the hierarchical procedure, either a new
general interest in cluster analysis increased in the cluster is formed or one case joins a previously grouped
1960s, resulting in the development of several new cluster. Each step is irreversible meaning that cases
algorithms that expanded possibilities of analysis. It cannot be subsequently reassigned to a different
was during this period that researchers began utilizing cluster. This makes the initial clustering steps highly
various innovative tools in their statistical analyses to influential because the first clusters generated will be
uncover underlying structures in datasets. Within a compared to all of the remaining cases. The alternate
decade, the growth of cluster analysis and its method of non-hierarchical clustering requires the
algorithms reached a high point. By the 1970s, the researcher to establish a priori the number of clusters
focus shifted to integrating multiple algorithms to form in the final solution. If there is uncertainty about the
a cohesive clustering protocol (Wilmink & total number of clusters in the dataset, the analysis
Uytterschaut, 1984). In recent decades, there has been must be re-run for each possible solution. In this
a gradual incorporation of cluster analysis into other situation, hierarchical clustering is preferred as it
areas, such as the health and social sciences. However, inherently allows one to compare the clustering result
the use of cluster analysis within the field of psychology with an increasing number of clusters; no decision
continues to be infrequent (Borgen & Barnett, 1987). about the final number of clusters needs to be made a
The general technique of cluster analysis will first be priori.
described to provide a framework for understanding Hierarchical cluster analysis can be conceptualized
hierarchical cluster analysis, a specific type of as being agglomerative or divisive. Agglomerative
clustering. The multiple parameters that must be hierarchical clustering separates each case into its own
specified prior to performing hierarchical clustering individual cluster in the first step so that the initial
will be examined in detail. A particular focus will be number of clusters equals the total number of cases
placed on the relative impact of three common linkage (Norusis, 2010). At successive steps, similar cases–or
measures. The second part of this paper will illustrate clusters–are merged together (as described above)
how to perform a hierarchical cluster analysis in SPSS until every case is grouped into one single cluster.
by applying the technique to differentiate subgroups Divisive hierarchical clustering works in the reverse
within a group of bilinguals. This paper will discuss the manner with every case starting in one large cluster
statistical implications of hierarchical clustering and and gradually being separated into groups of clusters
how to select the appropriate parameters in SPSS to until each case is in an individual cluster. This latter
allow researchers to uncover the grouping structure technique, divisive clustering, is rarely utilized because
that most accurately describes their multivariate of its heavy computational load (for a discussion on
dataset. divisive methods, see Wilmink & Uytterschaut, 1984).
The focus of the present paper is on the method of
Hierarchical Cluster Analysis
hierarchical agglomerative cluster analysis and this
Due to the scarcity of psychological research employing method is defined by two choices: the measurement of
the general technique of cluster analysis, researchers distance between cases and the type of linkage between
may not fully understand the utility of cluster analysis clusters (Bratchell, 1989).
and the application of the clustering technique to their
Distance Measure
data. There are two main methods: hierarchical and
non-hierarchical cluster analysis. Hierarchical The definition of cluster analysis states it is a technique
clustering combines cases into homogeneous clusters used for the identification of homogeneous subgroups.
by merging them together one at a time in a series of Therefore, cluster analysis is inherently linked to the

9
The Quantitative Methods for Psychology
T
Q ¦ 2015  vol. 11  no. 1
M
P

concept of similarity. The first step a researcher must (Blei & Lafferty, 2009). This algorithm allows for the
take is to determine the statistic that will be used to distance between two cases to be calculated across all
calculate the distance or similarity between cases. Both variables and reflected in a single distance value. At
measures may be thought to mirror one another; as the each step in the procedure, the squared Euclidean
distance between two cases decreases, their similarity distance between all pairs of cases and clusters is
should respectively increase. However, an important calculated and shown in a proximity matrix (discussed
distinction must be made: whereas both measures below). At each step, the pair of cases or clusters with
reflect the pattern of scores of the chosen variables, the smallest squared Euclidean distance will be joined
only the distance measure takes into account the with one another. This makes hierarchical clustering a
elevation of those scores (Clatworthy, Buick, Hankins, lengthy process because after each step, the full
Weinman, & Horne, 2005). For example, if we wish to proximity matrix must once again be recalculated to
separate bilinguals who switch between their two take into account the recently joined cluster. The
languages frequently from those who do not switch squared Euclidean distance calculation is straight-
languages often, the difference in the actual scores on forward when there is only one case per cluster.
multiple language measures must be taken into However, an additional decision must be made as to
account. In this case, a distance measure must be how best to calculate the squared Euclidean distance
selected. However, if we wish to assess the efficacy of a when there is more than one case per cluster. This is
language intervention program, then the actual referred to as the linkage measure and the researcher
language scores may not be of importance. In this case, must determine how to best calculate the link between
we would be assessing the pattern of language scores two clusters.
over time (i.e. from before to after intervention) to
Linkage Measure
identify the clusters of people that improved, worsened,
or did not change their language skills after The problem that arises when a cluster contains more
intervention. In this situation, a similarity measure such than one case is that the squared Euclidean distance
as the Pearson correlation, would be sufficient to assess can only be calculated between a pair of scores at a
the pattern of scores before and after intervention time and cannot take into account three or more scores
while ignoring the raw language scores. An added simultaneously. In line with the proximity matrix, the
difficulty of using a correlation coefficient is that it is goal is still to calculate the difference in scores between
easy to interpret when there are only one or two pairs of clusters, however in this case the clusters do
variables, but as the number of variables increases the not contain one single value per variable. This suggests
interpretation becomes unclear. It is for these reasons that one must find the best way to calculate an accurate
that distance measures are more commonly used in distance measure between pairs of clusters for each
cluster analysis because they allow for an assessment of variable when one or both of the clusters contains more
both the pattern and elevation of the scores in question. than one case. Once again, the goal is to find the two
Of course, there is not only one statistic that can be clusters that are nearest to each other in order to merge
used as a distance measure in cluster analysis. The them together. There exist many different linkage
choice of the distance measure will depend primarily measures that define the distance between pairs of
on whether the variables are continuous or clusters in their own way. Some measures define the
dichotomous in nature. Many chapters on cluster distance between two clusters based on the smallest or
analysis simply overlook this question and discuss largest distance that can be found between pairs of
measures applicable to continuous variables only. cases (single and complete linkage, respectively) in
Although this paper will focus on applying cluster which each case is from a different cluster (Mazzocchi,
analysis to continuous data, it is important to note that 2008). Average linkage averages all distance values
at least four measures exist for calculating distance between pairs of cases from different clusters. Single
with dichotomous data (see Finch, 2005). linkage, complete linkage, and average linkage will each
The most commonly used distance measure for be fully detailed in turn.
continuous variables is the squared Euclidean distance, Single linkage. Also referred to as nearest neighbour or
. In the equation, a and b refer to the minimum method. This measure defines the distance
two cases being compared on the j variable, where k is between two clusters as the minimum distance found
the total number of variables included in the analysis between one case from the first cluster and one case

10
The Quantitative Methods for Psychology
T
Q ¦ 2015  vol. 11  no. 1
M
P

from the second cluster (Florek, Lukaszewiez, Perkal, This means that in the previous example, the distance
Steinhaus, & Zubrzchi, 1951; Sneath, 1957). For between cluster 1 and cluster 2 would be the average of
example, if cluster 1 contains cases a and b, and cluster all distances between the pairs of cases listed above: (a,
2 contains cases c, d, and e, then the distance between c), (a, d), (a, e), (b, c), (b, d), and (b, e). Incorporating
cluster 1 and cluster 2 would be the smallest distance information about the variance of the distances renders
found between the following pairs of cases: (a, c), (a, d), the average distance value a more accurate reflection of
(a, e), (b, c), (b, d), and (b, e). A concern of using single the distance between two clusters of cases.
linkage is that it can sometimes produce chaining Each linkage measure defines the distance between
amongst the clusters. This means that several clusters two clusters in a unique way. The selected linkage
may be joined together simply because one of their measure will have a direct impact on the clustering
cases is within close proximity of case from a separate procedure and the way in which clusters are merged
cluster. This problem is specific to single linkage due to together (Mazzocchi, 2008). This will subsequently
the fact that the smallest distance between pairs is the impact the final cluster solution. In the next section, a
only value taken into consideration. Because the steps hierarchical cluster analysis will be performed on a
in agglomerative hierarchical clustering are previously published dataset using SPSS.
irreversible, this chaining effect can have disastrous
SPSS Tutorial on Hierarchical Cluster Analysis
effects on the cluster solution.
Complete linkage. Also referred to as furthest neighbour The following tutorial will outline a step-by-step
or maximum method. This measure is similar to the process to perform a hierarchical cluster analysis using
single linkage measure described above, but instead of SPSS statistical software (version 21.0) and how to
searching for the minimum distance between pairs of interpret the subsequent analysis results. The research
cases, it considers the furthest distance between pairs data in the following example was part of a larger
of cases (Sokal & Michener, 1958). Although this solves research dataset from Yim and Bialystok (2012) which
the problem of chaining, it creates another problem. examined bilinguals and their language use. The
Imagine that in the above example cases a, b, c, and d present example includes 67 Cantonese-English
are within close proximity to one another based upon bilingual young adults. The participants completed
the pre-established set of variables; however, if case e language proficiency tests in both languages and
differs considerably from the rest, then cluster 1 and questionnaires regarding their daily language use.
cluster 2 may no longer be joined together because of Participants had to indicate how often they use both
the difference in scores between (a, e) and (b, e). In English and Cantonese daily (“I use English and
complete linkage, outlying cases prevent close clusters Cantonese daily”) on a scale from 0 (none of the time)
to merge together because the measure of the furthest to 100 (all of the time). Language proficiency was
neighbour exacerbates the effects of outlying data. assessed using the Peabody Picture Vocabulary Test-III
Average linkage. Also referred to as the Unweighted (PPVT-III; Dunn & Dunn, 1997) in both Cantonese and
Pair-Group Method using Arithmetic averages English, measuring receptive vocabulary. This sample
(UPGMA)2. To overcome the limitations of single and was chosen because it is an apt example to demonstrate
complete linkage, Sokal and Michener (1958) proposed the applicability of cluster analysis on psychological
taking an average of the distance values between pairs data. Bilinguals are loosely defined as individuals who
of cases. This method is supposed to represent a regularly use two (or more) languages, yet many issues
natural compromise between the linkage measures to remain in the research field; for example, there is still
provide a more accurate evaluation of the distance no consensus as to what criteria determine that
between clusters. For average linkage, the distances someone is bilingual (Grosjean, 1998). High proficiency
between each case in the first cluster and every case in bilinguals are often viewed as a homogenous
the second cluster are calculated and then averaged. population; however, there can be within-group
differences in language usage and language proficiency.
2 The average linkage presented here is referred to as average The goal of a hierarchical cluster analysis on this data is
linkage between groups in SPSS and other resources. It should to examine possible subgroups in a sample of highly
not be confused with an alternate method, average linkage proficient bilinguals.3
within groups, which takes into account the variability found
within each cluster. For a contrast between linkage measures,
see Everitt, Landau, Leese, and Stahl (2011). 3 To practice with a dataset, please contact the corresponding

11
The Quantitative Methods for Psychology
T
Q ¦ 2015  vol. 11  no. 1
M
P

Figure 1  Running a hierarchical cluster analysis.


been moved into the Variables box. There is also an
Step 1: Choosing Cluster Variables
option to label cases. If a researcher has a variable
The researcher first has to identify the variables that which can be used to identify the individual cases, the
will be included for analysis. Any number of variables variable can be brought over to the box named Label
can be included, but it is best to include variables that Cases By. This can be helpful in reading the output as it
are meaningful to the research question. In this will allow for each case to be easily referenced. In our
example, we use three variables for the cluster analysis: example, we do not assign a variable to label cases
bilinguals’ proficiency scores in both languages and because the participant ID numbers correspond with
their self-report of their daily use of both languages. the row numbers in SPSS. If no variable is chosen to
These three variables target proficiency and daily use, label the cases, the output will use the SPSS row
two dimensions commonly used to assess bilingualism. numbers to identify the cases.
The variables included in this example are all
continuous variables.
Step 2: Selecting Cluster Method

To run a hierarchical cluster analysis in SPSS, click on


Analyze, then Classify, and then Hierarchical Cluster
(Figure 1). A new dialog box labelled Hierarchical
Cluster Analysis will then appear. Among the list of
variables presented in the left panel, select the
variables that will be included in the analysis and move
them to the Variables box on the right. As shown in
Figure 1, the three selected language variables have

author.
Figure 2  Choosing statistics.
12
The Quantitative Methods for Psychology
T
Q ¦ 2015  vol. 11  no. 1
M
P

Figure 3  Selecting plot options. Figure 4  Specifying cluster measures.

equivalent to average linkage between groups. In the


Step 3: Specifying Parameters
drop-down menu, single linkage and complete linkage
After selecting the variables to include in the analysis, it are also available along with four other measures.
is important to request certain items to be included in Under the Measure options, Interval specifies the
the data output. On the right side of the window, the distance measure for the cluster analysis. The default
Statistics button will allow for the researcher to request option is the squared Euclidean distance as it is the
a proximity matrix to be produced (Figure 2). most common and some linkage measures specifically
Otherwise, the SPSS output will only include the require this distance measure. SPSS provides eight
agglomeration schedule, a table that details each step of distance options, including Euclidean distance and
the clustering procedure. Pearson correlation.
Under Statistics, the Plots button allows the Under Transform Values, there are options to
researcher to select the visual outputs that will standardize the variables selected for clustering. No
illustrate the cluster solution. The researcher can standardization is specified by default, but the two
choose to produce a dendrogram, a visual tree graph most common transformation options are z-scores or
that displays the clustering procedure. Dendrograms using a range of -1 to 1. In our example, the values will
are very helpful in determining where the hierarchical need to be transformed as the three variables were not
clustering procedure should be stopped because the measured on the same scale. Now that all the
ultimate goal is not to continue the clustering until each parameters have been set and the output options have
case is combined into one large cluster. In Plots, the box been chosen, the analysis is ready to be run in SPSS. The
marked Dendrograms must be selected otherwise SPSS SPSS syntax for the tutorial can also be found in the
will not generate it automatically (Figure 3). Also, the Appendix.
researcher can modify the presentation of the icicle plot
Step 4: Interpreting the Output
by changing its orientation.
Next, it is important to set the specific parameters Similar to other analyses, SPSS will first produce a Case
for the cluster analysis, namely choosing the distance Processing Summary which lists the number of valid
and linkage measures that will be used. By clicking on cases, the number of missing cases, and also the
the Method button, a new dialog box will open where distance measure that was chosen (i.e., the squared
these options will be listed (Figure 4). The Cluster Euclidean distance). The Proximity Matrix is the second
Method refers to the linkage measure. In SPSS, the table in the output, if requested. The matrix lists the
default is the between-groups linkage which is squared Euclidean distance that was calculated

13
The Quantitative Methods for Psychology
T
Q ¦ 2015  vol. 11  no. 1
M
P

Table 1  Proximity Matrix

Squared Euclidean Distance


Case Case 50 Case 51 Case 52 Case 53 Case 54 Case 55 Case 56 Case 57 Case 58 Case 59 Case 60
Case 50 .000 .499 .145 .504 1.404 .162 .162 .114 .222 .933 .132
Case 51 .499 .000 .362 .390 .360 .278 .278 .256 .252 .116 .270
Case 52 .145 .362 .000 .740 1.360 .028 .028 .282 .115 .867 .058
Case 53 .504 .390 .740 .000 .842 .753 .753 .329 .803 .379 .703
Case 54 1.404 .360 1.360 .842 .000 1.104 1.104 .760 .900 .140 1.022
Case 55 .162 .278 .028 .753 1.104 .000 .000 .208 .029 .737 .009
Case 56 .162 .278 .028 .753 1.104 .000 .000 .208 .029 .737 .009
Case 57 .114 .256 .282 .329 .760 .208 .208 .000 .176 .492 .144
Case 58 .222 .252 .115 .803 .900 .029 .029 .176 .000 .660 .015
Case 59 .933 .116 .867 .379 .140 .737 .737 .492 .660 .000 .700
Case 60 .132 .270 .058 .703 1.022 .009 .009 .144 .015 .700 .000

between all pairs of cases in the first step of the cluster the clusters being combined at a given stage are more
procedure. Table 1 is a truncated version of the matrix heterogeneous than previous combinations. (The
that shows the distances between cases 50-60; in the agglomeration schedule shown in Table 2 has been
example, cases 55 and 56 had the smallest squared cropped. Only the top and the bottom of the schedule
Euclidean distance (approximately .000) and were are shown as it becomes quite long with a large number
therefore the first two cases to be joined together. The of cases.)
full proximity matrix is recalculated after each step but The purpose of the agglomeration schedule is to
is not shown in the output to save space. Nonetheless, assist the researcher in identifying at what point two
the repeated calculation of the proximity matrix is used clusters being combined are considered too different to
to determine the successive merging of cases illustrated form a homogeneous group, as evidenced by the first
in the remaining outputs. large increase in coefficient values. When there is a
The Agglomeration Schedule (Table 2) follows the large difference between the coefficients of two
proximity matrix in the output. The agglomeration consecutive stages, this suggests that the clusters being
schedule displays how the hierarchical cluster analysis merged are increasing in heterogeneity and that it
progressively clusters the cases or observations. Each would be ideal to stop the clustering process before the
row in the schedule shows a stage at which two cases clusters become too dissimilar. In Table 2, there is a
are combined to form a cluster, using an algorithm jump in the coefficient values between stages 63 and
dictated by the distance and linkage selections. The 64. With a difference of approximately .201, this is the
agglomeration schedule lists all of the stages in which first noticeable increase that we encounter as we move
the clusters are combined until there is only one cluster down the list of coefficients in the agglomeration
remaining after the last stage. The number of stages in schedule. Therefore, we can choose to stop the
the agglomeration schedule is one less than the number clustering after stage 63.
of cases in the data being clustered. In this example, It can be difficult to calculate the differences of the
there are 66 stages because the sample consists of 67 coefficients. An easy solution is to plot the coefficient
bilinguals. The coefficients at each stage represent the values by stage in a scree plot. A scree plot is simply a
distance of the two clusters being combined. As shown line graph, a visual representation of the agglomeration
in Table 2, cases 55 and 56 are combined at the first schedule. Although SPSS does not produce the scree
stage because the squared Euclidean distance between plot in its output, it can be made in Microsoft Excel by
them is the smallest out of all the pairs. In fact, the copying the values in the stage and coefficients
coefficients are very small (approximately .000) for the columns. In Figure 5, the scree plot shows a large
first several stages and slowly increase as the schedule increase in the coefficients after stage 63.
progresses. The increase in coefficients indicates that

14
The Quantitative Methods for Psychology
T
Q ¦ 2015  vol. 11  no. 1
M
P

Table 2  Agglomeration Schedule. The first figure included in the SPSS output is the Icicle
Plot (Figure 6). Like the agglomeration schedule, this
Cluster Combined Stage Cluster First Appears

Stage Cluster 1 Cluster 2 Coefficients Cluster 1 Cluster 2 Next Stage


plot displays the similarity between two cases. The
1 55 56 .000 0 0 25 icicle plot is easier to interpret when examining it from
2 9 32 .000 0 0 11 the bottom to the top. Each of the dark grey bars in the
3 12 43 .000 0 0 6
4 3 46 .000 0 0 57
plot represents one case. However, it is important to
5 25 31 .000 0 0 8 note the areas between cases and when they become
6 12 19 .000 3 0 11
shaded. The point at which the space between two
7 20 50 .001 0 0 17
8 25 28 .001 5 0 31
cases becomes shaded represents when the cases were
9 10 62 .001 0 0 18 joined together. For example in Figure 6, near the
10 21 24 .001 0 0 13
midpoint of the plot, the section between two dark bars
. . . . . . .
. . . . . . . is shaded immediately, suggesting that those two cases
. . . . . . . were clustered together at the onset of the clustering
55 6 40 .082 41 45 59
56 4 13 .089 49 35 61
procedure. Inspecting the plot closely, we discover that
57 3 8 .092 4 52 63 those two cases correspond to case 55 and case 56,
58 1 2 .109 51 27 65 which were combined at the first stage of the
59 5 6 .116 34 55 61
60 15 51 .133 54 0 62
agglomeration schedule. (In the SPSS output, the bars
61 4 5 .167 56 59 64 on the icicle plot are all shaded in the same colour. We
62 15 18 .262 60 53 65
have changed the bars representing the cases into a
63 3 7 .303 57 50 64
64 3 4 .504 63 61 66 darker colour to differentiate them more easily.)
65 1 15 .603 58 62 66 As mentioned previously, a hierarchical cluster
66 1 3 .887 65 64 0
analysis is best illustrated using a dendrogram, a visual
display of the clustering process (Figure 7). It appears
1
at the very end of the SPSS output. Examining the
0.8 dendrogram from left to right, clusters that are more
0.6 similar to each other are grouped together earlier. The
Coefficient

vertical lines in the dendrogram represent the grouping


0.4
of clusters or the stages of the agglomeration schedule.
0.2 They also indicate the distance between two joining
0 clusters (as represented by the x-axis, located above the
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 plot). As the clusters being merged become more
Stage heterogeneous, the vertical lines will be located farther
to the right side of the plot, as they represent larger
Figure 5  Scree plot of coefficients by stage. distance values. While the vertical lines are indicative of
the distance between clusters, the horizontal lines

Figure 6  Icicle plot.


15
The Quantitative Methods for Psychology
T
Q ¦ 2015  vol. 11  no. 1
M
P

Figure 8  Creating a cluster filter variable.

represent the differences of these distances. The


horizontal lines also connect all cases that are a part of
one cluster which is important when determining the
final number of clusters after the stopping decision is
made. Upon visually inspecting the dendrogram, the
longest horizontal lines represent the largest
differences. Therefore, a long horizontal line indicates
that two clusters (which are dissimilar to each other)
are being combined and identifies where it is optimal to
stop the clustering procedure. Similar to the
agglomeration schedule, if the vertical and horizontal
lines are close to one another, then this would suggest
that the level of homogeneity of the clusters merged at
those stages is relatively stable. The cut-off should thus
be placed where there are no closely plotted lines while
eliminating the vertical lines with large values.
As there is no formal stopping rule for hierarchical
cluster analysis, a cut-off needs to be determined from
the dendrogram to signify when the clustering process
should be stopped (Bratchell, 1989). The best approach
to determine the number of clusters in the data is to
incorporate information from both the agglomeration
schedule and the dendrogram. Figure 7 illustrates the
dendrogram generated by SPSS with an added line
indicating the optimal stopping point of the clustering
procedure. From the agglomeration schedule, we had
concluded that it would be best to stop the cluster
analysis after the 63rd stage, eliminating the last three
stages (stages 64, 65, and 66). This decision is reflected
in the dendrogram where the last three vertical lines
(representing the last three stages in the agglomeration
schedule) were cut from the cluster solution. By
Figure 7  Dendrogram with added line indicating stopping the clustering at this point, four clusters are
suggested stopping location.
16
The Quantitative Methods for Psychology
T
Q ¦ 2015  vol. 11  no. 1
M
P

revealed within the dataset as the cut-off line crosses output figures which can sometimes be subjective and
four horizontal lines. The interpretation of these ambiguous. The underlying structure of the cluster
clusters will be discussed following the tutorial. solution can change greatly by simply modifying one of
the chosen measures, such as linkage. The following
Step 5: Organizing Data into Subgroups
section will review how employing three different
Once the number of clusters has been decided, the data linkage measures (single linkage, complete linkage, and
can be organized into the subgroups specified by the average linkage) can result in three vastly different
analysis. This can be easily accomplished by re-running analyses and clustering results, as evidenced in visual
the hierarchical cluster analysis, but with one plots such as dendrograms. Additionally, upon choosing
additional step. In the Hierarchical Cluster Analysis a linkage measure, we interpret the results of the
window, click on the button on the right called Save cluster solution and the meaning of the subgroups.
(under Method where we chose the cluster measures). As mentioned previously, the linkage measure
As seen in Figure 8, the researcher can dictate the determines how to calculate the distance between pairs
cluster membership to have a single solution (fixed of clusters with two or more cases. Figure 9 displays
number of clusters) or a range of solutions (a range of three dendrograms from three analyses, each using a
clusters). By default, SPSS does not specify cluster different linkage measure. Although all three analyses
membership because it contradicts the objective of were run on the same data (from the SPSS tutorial), the
hierarchical clustering (i.e. not requiring a known differences between the dendrograms are easily
number of clusters beforehand). Since we have observable upon visual inspection. First, the analysis
determined the number of clusters in the data, we are using the single linkage measure is shown on the left.
able to request a specific number of clusters. In our Using the process outlined in the tutorial, three clusters
example, four clusters were identified. By specifying the can be identified in the data. However, the dendrogram
number of clusters in this Save window, SPSS will clearly shows how single linkage can produce chaining
generate a new variable in the Data View window because the majority of cases were grouped together
which assigns each case into one of the four clusters. into a large cluster, with minimal distance between
This can also be accomplished by inserting a Save clusters. As the smallest distance between pairs is the
Cluster instruction in the SPSS syntax (see Appendix for only value taken into consideration, cases that are close
syntax). Once the analysis is complete, the researcher is in distance but from different clusters may drive their
able to use the cluster variable to analyze the different respective groups to merge despite the proximity of the
clusters, for example, examining descriptive statistics rest of the cases. The dendrogram in the center shows
and how the clusters may differ according to the the analysis using complete linkage where the opposite
variables used in the analysis. problem can be observed. Five clusters were derived
Both windows allow the researcher to specify from this analysis. Complete linkage does not
cluster membership, but it is only in the Save option necessarily merge groups that are close together due to
where a cluster filter variable will be generated. Also, if outlying cases that may be far apart.
the researcher is uncertain about the number of Average linkage represents a natural compromise
clusters in the data and wishes to look at two or more between single linkage and complete linkage, as it is
options, inputting a range of solutions can be used to sensitive to the shape and size of clusters. Single
generate a new variable for each of the cluster linkage is sensitive to outliers, but it is impervious to
membership options. differences in the density of the clusters; in contrast,
complete linkage can break down large clusters though
Discussion
it is highly influenced by outliers (Almeida, Barbosa,
Cluster analysis allows the researcher to make many Pais, & Formosinho, 2007). As seen by the visual
decisions about the measures used in the analysis. comparison, the average linkage method was a
However, this can be a problem as it places greater compromise between the single and complete methods
weight on the researcher being knowledgeable to select as well. (The dendrogram on the right in Figure 9 is the
the appropriate measures. The tutorial demonstrated same as Figure 7.) However, the number of clusters
that it is often difficult to determine the exact number obtained using average linkage is not always the
of clusters in a dataset and that this decision is average between the single and complete linkage
dependent on a numerical and visual inspection of the solutions, as was the case in this example. Average

17
The Quantitative Methods for Psychology
T
Q ¦ 2015  vol. 11  no. 1
M
P

Figure 9  Three dendrograms from a hierarchical cluster analysis with single linkage (left), complete linkage
(center), and average linkage (right).

linkage was the most appropriate option for the data subgroups. As seen in Table 3, the analysis resulted in
used in this example; however, the procedures and four distinct clusters that vary according to two
solution of a cluster analysis will be unique to each dimensions, the daily use of both languages and
dataset. Bratchell (1989) suggests that there is no best Cantonese proficiency, as measured by the PPVT-III.
choice and researchers may need to employ different Clusters A and B represent bilinguals who use
techniques and compare their results. Cantonese and English every day, while Clusters C and
It was demonstrated above that average linkage was D are those who use both languages in a moderate
the best linkage measure for the bilingual data in the degree only. When examining Cantonese proficiency, it
present example. At this point, it is important to take a is noteworthy that despite all bilinguals being
closer look at the cluster groups generated by the communicatively competent in Cantonese, there is a
hierarchical cluster analysis and how these groups may split among them on this measure. The bilinguals in
be meaningful. The analysis resulted in creating four Clusters A and D obtained higher scores compared to

18
The Quantitative Methods for Psychology
T
Q ¦ 2015  vol. 11  no. 1
M
P

Table 3  Means and standard deviations of daily language use and proficiency scores in Cantonese and English by
cluster group.
Daily Use of
Cluster n Cantonese Proficiency English Proficiency
Both Languages
A 24 99.2 (2.5) 194.2 (4.3) 160.0 (19.2)

B 27 97.1 (5.0) 165.0 (9.0) 174.7 (10.2)

C 6 65.3 (5.5) 168.8 (7.5) 182.5 (4.2)

D 10 59.6 (7.5) 195.8 (4.5) 148.1 (17.5)

those in Clusters B and C. Therefore, four meaningful technique complements our visual inspection of the
subgroups were detected: (i) Cluster A – frequent cluster analysis and allows us to reliably conclude the
language users with high Cantonese proficiency; (ii) existence of four meaningful subgroups within the
Cluster B – frequent language users with intermediate dataset.
Cantonese proficiency, (iii) Cluster C – moderate Deciding upon the most accurate cluster solution
language users with intermediate Cantonese and its interpretation may in itself pose a limitation
proficiency, and (iv) Cluster D – moderate language because of the freedom that is given to the researcher.
users with high Cantonese proficiency. The results of Like with any other statistical analysis, there are
the cluster analysis confirmed that there are situations in which hierarchical cluster analysis does
meaningful subgroups within this group of high not perform optimally. As explained above, the full
proficiency bilinguals. Although bilinguals are generally proximity matrix must be computed at each step in the
considered to be a homogeneous group, there exist fine clustering procedure. If the sample is very large, more
differences among them and distinguishing these time will be needed to produce the proximity matrix at
within-group differences can be significant for bilingual each step. Moreover, each step is irreversible and cases
research. cannot be reassigned to a different cluster later on in
After identifying a set of meaningful subgroups, the process. The sequential and inflexible nature of
there is still a final step that one can take to further hierarchical clustering makes the initial partitions more
validate the cluster solution by performing a split- influential than those at a later point. However, these
sample validation.4 The sample was randomly split to potential limitations inherent to the nature of
create two sub-samples, which were then used for hierarchical cluster analysis are minimal and the
comparison regarding the number of clusters and each benefits of this otherwise flexible method are broad
of the cluster profiles (Everitt et al., 2011). The first and encouraging for its use as a statistical tool in the
sub-sample (n = 34) and the second sub-sample (n = field of psychology.
33) both generated the same four cluster groups as the Cluster analysis is not a data mining technique used
original full-sample solution. The clustering pattern for creating a structure within a dataset that is not
was maintained across the four subgroups: bilinguals meaningful. Hierarchical clustering will always provide
who used Cantonese and English everyday (Clusters A a series of cluster solutions from one possible cluster to
and B) represented a larger proportion of both sub- n possible clusters. The present paper does not
samples than bilinguals who used their languages consider comprehensively all the parameters
moderately (Clusters C and D). Importantly, the cluster associated with hierarchical cluster analysis; there are
solution was replicated within each of the four many specific techniques and models that have not
subgroups; that is, cases which were merged together been addressed. (We recommend the fifth edition of
in a cluster were also combined together in both of the Cluster Analysis by Everitt et al., 2011, as further
sub-samples. Therefore, the split-sample validation reading. It is a comprehensive and essential resource
for researchers who are interested in this statistical
4 There are no cluster validation methods in SPSS; technique.) It is the responsibility of the researcher to
however, other validation techniques are available in ensure that the distance and linkage measures have
different statistical software packages (SAS Institute, been appropriately selected and that the clustering
1983; Wilkinson, Engelman, Corter, & Coward, 2000). process is stopped at the most logical point. As

19
The Quantitative Methods for Psychology
T
Q ¦ 2015  vol. 11  no. 1
M
P

specified in the SPSS tutorial, the investigator must of Counseling Psychology, 34(4), 456-468.
examine various outputs to determine the most Bratchell, N. (1989). Cluster analysis. Chemometrics
appropriate number of clusters. There is no correct or and Intelligent Laboratory Systems, 6, 105–125.
incorrect solution to cluster analysis; it is up to the Clatworthy, J., Buick, D., Hankins, M., Weinman, J., &
researcher to select the appropriate parameters to Horne, R. (2005). The use and reporting of cluster
reveal the most accurate underlying structure of the analysis in health psychology: A review. British
data. Journal of Health Psychology, 10(3), 329-358.
Dunn, L. M. & Dunn, L. M. (1997). Peabody Picture
Conclusion
onclusion
Vocabulary Test – III. Circle Pines, MN: American
Cluster analysis is a statistical tool that offers a wide Guidance Service.
range of options for the researcher, allowing for the Everitt, B. S., Landau, S., Leese, M., & Stahl, D. (2011).
analysis to be uniquely tailored to the data and the Cluster Analysis (5th edition). Chichester, UK: John
objectives of the study. Although the practice of using Wiley & Sons, Ltd.
this class of techniques is not yet common in the field of Finch, H. (2005). Comparison of distance measures in
psychology, there are clear advantages to offering cluster analysis with dichotomous data. Journal of
various options in setting the parameters of the Data Science, 3(1), 85-100.
analysis. Hierarchical cluster analysis is suggested as a Florek, K., Lukaszewiez, J., Perkal, J., Steinhaus, H., &
practical method in identifying meaningful clusters Zubrzchi, S. (1951). Sur la liason: Division des points
within samples that may superficially appear d’un ensemble fini. Colloquium Mathematicum, 2,
homogeneous. The present paper presented a 282–285.
theoretical background to hierarchical clustering, Grosjean, F. (1998). Studying bilinguals:
specifically outlining the three common linkage Methodological and conceptual issues. Bilingualism:
measures used and a tutorial outlining the steps in the Language and Cognition, 1(2), 131-149.
analysis, guiding researchers to discover underlying Landau, S. & Chis Ster, I. (2010). Cluster Analysis:
structures and subgroups on their own. With increased Overview. In P. Peterson, E. Baker, & B. McGaw
practice and when utilized appropriately, cluster (Eds.), International Encyclopedia of Education, 3rd
analysis is a powerful tool that can be implemented on edition (pp. 72-83). Oxford, UK: Elsevier Ltd.
diverse sets of psychological data. Mazzocchi, M. (2008). Statistics for Marketing and
Consumer Research. London, UK: Sage Publications
Authors’ notes and acknowledgments
Ltd.
We would like to thank Sylvain Chartier for his Morissette, L., & Chartier, S. (2013). The k-means
suggestions on an earlier version of this paper and clustering technique: General considerations and
Ellen Bialystok for her guidance during the collection of implementation in Mathematica. Tutorials in
the data used in the tutorial. Quantitative Methods for Psychology, 9(1), 15-24.
Address for Correspondence: Odilia Yim, University of Norusis, M. J. (2010). Chapter 16: Cluster analysis.
Ottawa, 136 Jean Jacques Lussier, Vanier 5045, Ottawa, PASW Statistics 18 Statistical Procedures
ON Canada K1N 6N5. Companion (pp. 361-391). Upper Saddle River, NJ:
Prentice Hall.
References
SAS Institute (1983). SAS technical report A-108 Cubic
Almeida, J. A. S., Barbosa, L. M. S., Pais, A. A. C. C., & Clustering Criterion. Cary, NC: SAS Institute Inc.
Formosinho, S. J. (2007). Improving hierarchical Retrieved from https://support.sas.com/document
cluster analysis: A new method with outlier tation/onlinedoc/v82/techreport_a108.pdf
detection and automatic clustering. Chemometrics Sneath, P. H. A. (1957). The application of computers to
and Intelligent Laboratory Systems, 87, 208-217. taxonomy. Journal of General Microbiology, 17, 201–
Blei, D. & Lafferty, J. (2009). Topic models. In A. 226.
Srivastava and M. Sahami (Eds.), Text Mining: Sokal R. R. & Michener C. D. (1958). A statistical method
Classification, Clustering, and Applications (pp. 71- for evaluating systematic relationships. The
94). Boca Raton, FL: Taylor & Francis Group. University of Kansas Scientific Bulletin, 38, 1409-
Borgen, F. H. & Barnett, D. C. (1987). Applying cluster 1438.
analysis in counselling psychology research. Journal Sokal, R. R., & Sneath, P. H. A. (1963). Principles of

20
The Quantitative Methods for Psychology
T
Q ¦ 2015  vol. 11  no. 1
M
P

numerical taxonomy. San Francisco: W. H. Freeman. Statistical Methods in Physical Anthropology (pp.
Wilkinson, L., Engelman, L., Corter, J., & Coward, M. 135-175). Dordrecht, The Netherlands: D. Reidel
(2000). Cluster analysis. In L. Wilkinson (Ed.), Systat Publishing Company.
10 – Statistics I (pp. 65-124). Chicago, IL: SPSS Inc. Yim, O. & Bialystok, E. (2012). Degree of conversational
Wilmink, F. W. & Uytterschaut, H. T. (1984). Cluster code-switching enhances verbal task switching in
analysis, history, theory and applications. In G. N. Cantonese-English bilinguals. Bilingualism:
van Vark & W.W. Howells (Eds.), Multivariate Language and Cognition, 15(4), 873-883.

Appendix:
Appendix: SPSS Syntax for Hierarchical Cluster Analysis
The steps outlined in the tutorial can be performed using the following syntax in SPSS. The entry
'C:\Users\cluster.tmp' is the location of a temporary file for the analysis (suitable for both PC and Mac operating
systems). The user can substitute this line with a specific location on their computer if they wish and it can be
deleted after the analysis is complete. The comment lines beginning with an asterisk can be copied into the SPSS
syntax window for reference.

*PROXIMITIES: Substitute with your own variable names in your datafile (as many as desired).
*MEASURE: Distance measure (e.g., Squared Euclidean).
*STANDARDIZE: Transformation applied to the variables (e.g., Range -1 to 1).
*METHOD: Linkage measure (e.g., between-groups average; others are SINGLE or COMPLETE).
*To generate a cluster variable in the Data View window, add the following line under
* the CLUSTER command: /SAVE CLUSTER (number of clusters desired).

PROXIMITIES Variable1 Variable2 Variable3


/MATRIX OUT ('C:\Users\cluster.tmp')
/VIEW=CASE
/MEASURE=SEUCLID
/PRINT NONE
/STANDARDIZE=VARIABLE RANGE.

CLUSTER
/MATRIX IN ('C:\Users\cluster.tmp')
/METHOD BAVERAGE
/PRINT SCHEDULE
/PRINT DISTANCE
/PLOT DENDROGRAM VICICLE.

Citation

Yim, O., & Ramdeen, K. T. (2015). Hierarchical Cluster Analysis: Comparison of Three Linkage Measures and
Application to Psychological Data. The Quantitative Methods for Psychology, 11 (1), 8-21.

Copyright © 2015 Yim and Ramdeen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use,
distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is
cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Received: 16/08/14 ~ Accepted: 22/10/14

21
The Quantitative Methods for Psychology

You might also like