Lecture 6
Cluster Analysis
Clustering, Distance Methods, and Ordination
  ➢ Grouping, or clustering, is distinct from the classification methods.
  ➢ Classification pertains to a known number of groups, and the operational
    objective is to assign new observations to one of these groups.
  ➢ Cluster analysis is a more primitive technique in that no assumptions are
    made concerning the number of groups or the group structure.
  ➢ Grouping is done on the basis of similarities or distances (dissimilarities).
  ➢ The inputs required are similarity measures or data from which
    similarities can be computed.
  ➢ To illustrate the nature of the difficulty in defining a natural grouping,
    consider sorting the 16 face cards in an ordinary deck of playing cards into
    clusters of similar objects.
  ➢ Some groupings are illustrated in Figure 12.1. It is immediately clear that
    meaningful partitions depend on the definition of similar.
➢ In most practical applications of cluster analysis, the investigator knows
  enough about the problem to distinguish "good" groupings from "bad"
  groupings.
➢ Why not enumerate all possible groupings and select the "best" ones for
  further study?
➢ For the playing-card example, there is one way to form a single group of 16
  face cards, there are 32,767 ways to partition the face cards into two groups
    (of varying sizes), there are 7,141,686 ways to sort the face cards into three
    groups (of varying sizes), and so on.!
  ➢ Obviously, time constraints make it impossible to determine the best
    groupings of similar objects from a list of all possible structures.
  ➢ Even fast computers are easily overwhelmed by the typically large number
    of cases, so one must settle for algorithms that search for good, but not
    necessarily the best, groupings.
Similarity Measures
  ➢ Most efforts to produce a rather simple group structure from a complex
    data set require a measure of "closeness," or "similarity."
  ➢ There is often a great deal of subjectivity involved in the choice of a
    similarity measure.
  ➢ Important considerations include the nature of the variables (discrete,
    continuous, binary), scales of measurement (nominal, ordinal, interval,
    ratio), and subject matter knowledge.
  ➢ When items (units or cases) are clustered, proximity is usually indicated by
    some sort of. distance
  ➢ By contrast, variables are usually grouped on the basis of correlation
    coefficients or like measures of association.
Distances and Similarity Coefficients for Pairs of Items
   ➢ The Euclidean (straight-line) distance between two p-dimensional
      observations (items)
           𝑥 ′ = [𝑥1 , 𝑥2 , … … … , 𝑥𝑝 ] 𝑎𝑛𝑑 𝑦 ′ = [𝑦1 , 𝑦2 , … … … , 𝑦𝑝 ] 𝑖𝑠,
                                                                                 2
         𝑑(𝑥, 𝑦) = √(𝑥1 − 𝑦1 )2 + (𝑥2 − 𝑦2 )2 + ⋯ … … . +(𝑥𝑝 − 𝑦𝑝 )
                                = √(𝑥 − 𝑦)′ (𝑥 − 𝑦)
  ➢ The statistical distance between the same two observations is of the form
                          𝑑(𝑥, 𝑦) = √(𝑥 − 𝑦)′ 𝐴(𝑥 − 𝑦)
       ✓ Ordinarily, 𝐴 = 𝑆 −1 , where S contains the sample variances and
         covariances.
       ✓ However, without prior knowledge of the distinct groups, these
         sample quantities cannot be computed.
       ✓ For this reason, Euclidean distance is often preferred for clustering.
➢ Another distance measure is the Minkowski metric
                                                    1
                                    𝑝               𝑚
                        𝑑(𝑥, 𝑦) = [∑|𝑥𝑖 − 𝑦𝑖 |𝑚 ]
                                   𝑖=1
➢ For m = 1, d(x,y) measures the "city-block" distance between two points in p
  dimensions.
➢ For m = 2, d(x, y) becomes the Euclidean distance.
➢ In general, varying m changes the weight given to larger and smaller
  differences.
➢ Two additional popular measures of "distance" or dissimilarity are given
  by the Canberra metric and the Czekanowski coefficient.
➢ Both of these measures are defined for nonnegative variables only. We have
                                                𝑝       |𝑥 −𝑦𝑖 |
     Canberra metric:               d(x, y) = ∑𝑖=1 (𝑥𝑖
                                                          𝑖 +𝑦𝑖 )
                                                          𝑝
                                                    2 ∑𝑖=1 min (𝑥𝑖 ,𝑦𝑖 )           |𝑥𝑖 −𝑦𝑖 |
     Czekanowski coefficient:       d(x, y) = 1 −         𝑝                ∑𝑝𝑖=1
                                                         ∑𝑖=1 (𝑥𝑖 +𝑦𝑖 )            (𝑥𝑖 +𝑦𝑖 )
➢ When items cannot be represented by meaningful p-dimensional
  measurements, pairs of items are often compared on the basis of the presence
  or absence of certain characteristics.
➢ Similar items have more characteristics in common than do dissimilar
  items.
➢ The presence or absence of a characteristic can be described mathematically
  by introducing a binary variable, which assumes the value 1 if the
  characteristic is present and the value 0 if the characteristic is absent.
➢ For p = 5 binary variables, for instance, the "scores" for two items i and k
  might be arranged as follows:
                                           Variables
                                   1     2    3     4        5
                        Item i     1     0    0     1        1
                        Item k     1     1    0     1        0
➢ In this case, there are two 1-1 matches, one 0-0 match, and two mismatches.
➢ Let 𝑥𝑖𝑗 be the score (1 or 0) of the jth binary variable on the ith item and 𝑥𝑘𝑗
  be the score (again, 1 or 0) of the jth variable on the kth item, 𝑗 =
  1, 2, … … . , 𝑝.
➢ Consequently,
                    2   0        𝑖𝑓 𝑥𝑖𝑗 = 𝑥𝑘𝑗 = 1 𝑜𝑟 𝑥𝑖𝑗 = 𝑥𝑘𝑗 = 0
        (𝑥𝑖𝑗 − 𝑥𝑘𝑗 ) = {                                           … … … . (1)
                         1                            𝑖𝑓 𝑥𝑖𝑗 ≠ 𝑥𝑘𝑗
                                         𝑝              2
 and the squared Euc1idean distance, ∑𝑖=1(𝑥𝑖𝑗 − 𝑥𝑘𝑗 ) , provides a count of the
 number of mismatches.
➢ A large distance corresponds to many mismatches-that is, dissimilar items.
➢ From the preceding display, the square of the distance between items i and k
  would be
 5
               2
∑(𝑥𝑖𝑗 − 𝑥𝑘𝑗 ) = (1 − 1)2 + (0 − 1)2 + (0 − 0)2 + (1 − 1)2 + (1 − 0)2
𝑖=1
                   =2
➢ Although a distance based on (1) might be used to measure similarity, it
  suffers from weighting the 1-1 and 0-0 matches equally.
➢ In some cases, a 1-1 match is a stronger indication of similarity than a 0-0
  match.
  ➢ For instance, in grouping people, the evidence that two persons both read
    ancient Greek is stronger evidence of similarity than the absence of this
    ability.
  ➢ Thus, it might be reasonable to discount the 0-0 matches or even disregard
    them completely.
  ➢ To allow for differential treatment of the1-1 matches and the 0-0 matches,
    several schemes for defining similarity coefficients have been suggested.
  ➢ To introduce these schemes, let us arrange the frequencies of matches and
    mismatches for items i and k in the form of a contingency table:
                                  Item k              Totals
                                  1        0
            Item i        1       a        b          a+b
                          0       c        d          c+d
            Total               a+c b+d p=a+b+c+d
  ➢    In this table, a represents the frequency of 1-1 matches, b is the frequency of
       1-0 matches, and so forth.
  ➢    Given the foregoing five pairs of binary outcomes, a = 2 and b = c = d = 1.
  ➢    Table below lists common similarity coefficients defined in terms of the
       frequencies in the above Table.
  ➢    A short rationale follows each definition.
Table: Similarity Coefficients for Clustering Items
Coefficient                            Rationale
      𝑎+𝑑                              Equal weights for 1-1 matches and 0-0
  1.
       𝑝
                                       matches
         2(𝑎+𝑑)                        Double weight for 1-1 matches and 0-0
  2.
      2(𝑎+𝑑)+𝑏+𝑐                       matches
          𝑎+𝑑                          Double weight for unmatched pairs.
  3.
        𝑎+𝑑+2(𝑏+𝑐)
        𝑎
  4.                                       No 0-0 matches in numerator.
        𝑝
            𝑎
  5.                                       No 0-0 matches in numerator or
        𝑎+𝑏+𝑐
                                           denominator.
                                           (The 0-0 matches are treated as irrelevant.)
            2𝑎                             No 0-0 matches in numerator or
  6.
        2𝑎+𝑏+𝑐
                                           denominator.
                                           Double weight for 1-1 matches.
           𝑎
7.              No 0-0 matches in numerator or
     𝑎+2(𝑏+𝑐)
                denominator
                Double weight for unmatched pairs.
      𝑎
8.              Ratio of matches to mismatches with 0-0
     𝑏+𝑐
                matches excluded.