Multivariate Analysis (Minitab)
Multivariate Analysis (Minitab)
Table Of Contents
Multivariate Analysis .............................................................................................................................................................. 1 Overview........................................................................................................................................................................... 1 Principal Components ...................................................................................................................................................... 2 Factor Analysis ................................................................................................................................................................. 5 Cluster Observations ...................................................................................................................................................... 12 Cluster Variables ............................................................................................................................................................ 17 Cluster K-Means............................................................................................................................................................. 20 Discriminant Analysis ..................................................................................................................................................... 23 Simple Correspondence Analysis .................................................................................................................................. 27 Multiple Correspondence Analysis ................................................................................................................................. 34 Index .................................................................................................................................................................................... 39
Multivariate Analysis
Overview
Multivariate Analysis Overview
Use Minitab's multivariate analysis procedures to analyze your data when you have made multiple measurements on items or subjects. You can choose to: Analyze the data covariance structure to understand it or to reduce the data dimension Assign observations to groups Explore relationships among categorical variables
Because Minitab does not compare tests of significance for multivariate procedures, interpreting the results is somewhat subjective. However, you can make informed conclusions if you are familiar with your data.
Grouping observations
Minitab offers three cluster analysis methods and discriminant analysis for grouping observations: Cluster Observations groups or clusters observations that are "close" to each other when the groups are initially unknown. This method is a good choice when no outside information about grouping exists. The choice of final grouping is usually made according to what makes sense for your data after viewing clustering statistics. Cluster Variables groups or clusters variables that are "close" to each other when the groups are initially unknown. The procedure is similar to clustering of observations. You may want to cluster variables to reduce their number. Cluster K-Means, like clustering of observations, groups observations that are "close" to each other. K-means clustering works best when sufficient information is available to make good starting cluster designations. Discriminant Analysis classifies observations into two or more groups if you have a sample with known groups. You can use discriminant analysis to investigate how the predictors contribute to the groupings.
Correspondence Analysis
Minitab offers two methods of correspondence analysis to explore the relationships among categorical variables: Simple Correspondence Analysis explores relationships in a 2-way classification. You can use this procedure with 3way and 4-way tables because Minitab can collapse them into 2-way tables. Simple correspondence analysis decomposes a contingency table similar to how principal components analysis decomposes multivariate continuous data. Simple correspondence analysis performs an eigen analysis of data, breaks down variability into underlying dimensions, and associates variability with rows and/or columns. Multiple Correspondence Analysis extends simple correspondence analysis to the case of 3 or more categorical variables. Multiple correspondence analysis performs a simple correspondence analysis on an indicator variables matrix in which each column corresponds to a level of a categorical variable. Rather than a 2-way table, the multi-way table is collapsed into 1 dimension.
Multivariate
Stat > Multivariate Allows you to perform a principal components analysis, factor analysis, cluster analysis, discriminant analysis, and correspondence analysis. Select one of the following options: Principal Components performs principal components analysis Factor Analysis performs factor analysis Cluster Observations performs agglomerative hierarchical clustering of observations Cluster Variables performs agglomerative hierarchical clustering of variables
Multivariate Analysis
Cluster K-Means performs K-means non-hierarchical clustering of observations Discriminant Analysis performs linear and quadratic discriminant analysis Simple Correspondence Analysis performs simple correspondence analysis on a two-way contingency table Multiple Correspondence Analysis performs multiple correspondence analysis on three or more categorical variables Minitab offers the following additional multivariate analysis options: Balanced MANOVA General MANOVA Multivariate control charts
[10] G. W. Milligan (1980). "An Examination of the Effect of Six Types of Error Pertubation on Fifteen Clustering Algorithms," Psychometrika, 45, 325-342. [11] S.J. Press and S. Wilson (1978). "Choosing Between Logistic Regression and Discriminant Analysis," Journal of the American Statistical Association, 73, 699-705. [12] A. C. Rencher (1995). Methods of Multivariate Analysis, John Wiley & Sons.
Principal Components
Principal Components Analysis
Stat > Multivariate > Principal Components Use principal component analysis to help you to understand the underlying data structure and/or form a smaller number of uncorrelated variables (for example, to avoid multicollinearity in regression). An overview of principal component analysis can be found in most books on multivariate analysis, such as [5].
Nonuniqueness of Coefficients
The coefficients are unique (except for a change in sign) if the eigenvalues are distinct and not zero. If an eigenvalue is repeated, then the "space spanned" by all the principal component vectors corresponding to the same eigenvalue is unique, but the individual vectors are not. Therefore, the coefficients that Minitab prints and those in a book or another program may not agree, though the eigenvalues (variances) will always be the same. If the covariance matrix has rank r < p, where p is the number of variables, then there will be p - r eigenvalues equal to zero. Eigenvectors corresponding to these eigenvalues may not be unique. This can happen if the number of observations is less than p or if there is multicollinearity.
Multivariate Analysis
Scores: Enter the storage columns for the principal components scores. Scores are linear combinations of your data using the coefficients. The number of columns specified must be less than or equal to the number of principal components calculated.
Session window output Principal Component Analysis: Pop, School, Employ, Health, Home
Eigenanalysis of the Correlation Matrix Eigenvalue Proportion Cumulative Variable Pop School Employ Health Home 3.0289 0.606 0.606 PC1 -0.558 -0.313 -0.568 -0.487 0.174 1.2911 0.258 0.864 PC2 -0.131 -0.629 -0.004 0.310 -0.701 0.5725 0.114 0.978 PC3 0.008 -0.549 0.117 0.455 0.691 0.0954 0.019 0.998 PC4 0.551 -0.453 0.268 -0.648 0.015 0.0121 0.002 1.000 PC5 -0.606 0.007 0.769 -0.201 0.014
Factor Analysis
Factor Analysis
Stat > Multivariate > Factor Analysis Use factor analysis, like principal components analysis, to summarize the data covariance structure in a few dimensions of the data. However, the emphasis in factor analysis is the identification of underlying "factors" that might explain the dimensions associated with large data variability.
Multivariate Analysis
None: Choose not to rotate the initial solution. Equimax: Choose to perform an equimax rotation of the initial solution (gamma = number of factors / 2). Varimax: Choose to perform a varimax rotation of the initial solution (gamma = 1). Quartimax: Choose to perform a quartimax rotation of the initial solution (gamma = 0). Orthomax with gamma: Choose to perform an orthomax rotation of the initial solution, then enter value for gamma between 0 and 1. <Options> <Graphs> <Storage> <Results>
The typical case is to use raw data. Set up your worksheet so that a row contains measurements on a single item or subject. You must have two or more numeric columns, with each column representing a different measurement (response). Minitab automatically omits rows with missing data from the analysis. Usually the factor analysis procedure calculates the correlation or covariance matrix from which the loadings are calculated. However, you can enter a matrix as input data. You can also enter both raw data and a matrix of correlations or covariances. If you do, Minitab uses the matrix to calculate the loadings. Minitab then uses these loadings and the raw data to calculate storage values and generate graphs. See To perform factor analysis with a correlation or covariance matrix. If you store initial factor loadings, you can later input these initial loadings to examine the effect of different rotations. You can also use stored loadings to predict factor scores of new data. See To perform factor analysis with stored loadings.
Multivariate Analysis
3 Do one of the following, and then click OK: To examine the effect of a different rotation method, choose an option under Type of Rotation. See Rotating the factor loadings for a discussion of the various rotations>Main. To predict factor scores with new data, in Variables, enter the columns containing the new data.
Number of factors
The choice of the number of factors is often based upon the proportion of variance explained by the factors, subject matter knowledge, and reasonableness of the solution [6]. Initially, try using the principal components extraction method without specifying the number of components. Examine the proportion of variability explained by different factors and narrow down your choice of how many factors to use. A Scree plot may be useful here in visually assessing the importance of factors. Once you have narrowed this choice, examine the fits of the different factor analyses. Communality values, the proportion of variability of each variable explained by the factors, may be especially useful in comparing fits. You may decide to add a factor if it contributes to the fit of certain variables. Try the maximum likelihood method of extraction as well.
Rotation
Once you have selected the number of factors, you will probably want to try different rotations. Johnson and Wichern [6] suggest the varimax rotation. A similar result from different methods can lend credence to the solution you have selected. At this point you may wish to interpret the factors using your knowledge of the data. For more information see Rotating the factor loadings.
Multivariate Analysis
Loadings for Initial Solution Compute from variables: Choose to compute loadings from the raw data. Use loadings: Choose to use loadings which were previously calculated, then specify the columns containing the loadings. You must specify one column for each factor calculated. See To perform factor analysis with stored loadings. Maximum Likelihood Extraction Use initial communality estimates in: Choose the column containing data to be used as the initial values for the communalities. The column should contain one value for each variable. Max iterations: Enter the maximum number of iterations allowed for a solution (default is 25). Convergence: Enter the criterion for convergence (occurs when the uniqueness values do not change very much). This number is the size of the smallest change (default is 0.005).
Multivariate Analysis
Session window output Factor Analysis: Pop, School, Employ, Health, Home
Maximum Likelihood Factor Analysis of the Correlation Matrix * NOTE * Heywood case Unrotated Factor Loadings and Communalities Variable Pop School Employ Health Home Factor1 0.971 0.494 1.000 0.848 -0.249 Factor2 0.160 0.833 0.000 -0.395 0.375 Communality 0.968 0.938 1.000 0.875 0.202
Multivariate Analysis
Variance % Var
2.9678 0.594
1.0159 0.203
3.9837 0.797
Rotated Factor Loadings and Communalities Varimax Rotation Variable Pop School Employ Health Home Variance % Var Factor1 0.718 -0.052 0.831 0.924 -0.415 2.2354 0.447 Factor2 0.673 0.967 0.556 0.143 0.173 1.7483 0.350 Communality 0.968 0.938 1.000 0.875 0.202 3.9837 0.797
Sorted Rotated Factor Loadings and Communalities Variable Health Employ Pop Home School Variance % Var Factor1 0.924 0.831 0.718 -0.415 -0.052 2.2354 0.447 Factor2 0.143 0.556 0.673 0.173 0.967 1.7483 0.350 Communality 0.875 1.000 0.968 0.202 0.938 3.9837 0.797
Factor Score Coefficients Variable Pop School Employ Health Home Factor1 -0.165 -0.528 1.150 0.116 -0.018 Factor2 0.246 0.789 0.080 -0.173 0.027
10
Session window output Factor Analysis: Pop, School, Employ, Health, Home
Principal Component Factor Analysis of the Correlation Matrix Unrotated Factor Loadings and Communalities Variable Pop School Employ Health Home Variance % Var Factor1 -0.972 -0.545 -0.989 -0.847 0.303 3.0289 0.606 Factor2 -0.149 -0.715 -0.005 0.352 -0.797 1.2911 0.258 Factor3 0.006 -0.415 0.089 0.344 0.523 0.5725 0.114 Factor4 0.170 -0.140 0.083 -0.200 0.005 0.0954 0.019 Factor5 -0.067 0.001 0.085 -0.022 0.002 0.0121 0.002 Communality 1.000 1.000 1.000 1.000 1.000 5.0000 1.000
Sorted Unrotated Factor Loadings and Communalities Variable Employ Pop Health Home Factor1 -0.989 -0.972 -0.847 0.303 Factor2 -0.005 -0.149 0.352 -0.797 Factor3 0.089 0.006 0.344 0.523 Factor4 0.083 0.170 -0.200 0.005 Factor5 0.085 -0.067 -0.022 0.002 Communality 1.000 1.000 1.000 1.000
11
Multivariate Analysis
School Variance % Var -0.545 3.0289 0.606 -0.715 1.2911 0.258 -0.415 0.5725 0.114 -0.140 0.0954 0.019 0.001 0.0121 0.002 1.000 5.0000 1.000
Factor Score Coefficients Variable Pop School Employ Health Home Factor1 -0.321 -0.180 -0.327 -0.280 0.100 Factor2 -0.116 -0.553 -0.004 0.272 -0.617 Factor3 0.011 -0.726 0.155 0.601 0.914 Factor4 1.782 -1.466 0.868 -2.098 0.049 Factor5 -5.511 0.060 6.988 -1.829 0.129
Cluster Observations
Cluster Observations
Stat > Multivariate > Cluster Observations Use clustering of observations to classify observations into groups when the groups are initially not known. This procedure uses an agglomerative hierarchical method that begins with all observations being separate, each forming its own cluster. In the first step, the two observations closest together are joined. In the next step, either a third observation joins the first two, or two other observations join together into a different cluster. This process will continue until all clusters are joined into one, however this single cluster is not useful for classification purposes. Therefore you must decide how many groups are logical for your data and classify accordingly. See Determining the final cluster grouping for more information.
12
13
Multivariate Analysis
Deciding Which Distance Measures and Linkage Methods to Use Cluster Observations
Distance Measures
If you do not supply a distance matrix, Minitab's first step is to calculate an n x n distance matrix, D, where n is the number of observations. The matrix entries, d(i, j), in row i and column j, is the distance between observations i and j. Minitab provides five different methods to measure distance. You might choose the distance measure according to properties of your data. The Euclidean method is a standard mathematical measure of distance (square root of the sum of squared differences). The Pearson method is a square root of the sum of square distances divided by variances. This method is for standardizing. Manhattan distance is the sum of absolute distances, so that outliers receive less weight than they would if the Euclidean method were used. The squared Euclidean and squared Pearson methods use the square of the Euclidean and Pearson methods, respectfully. Therefore, the distances that are large under the Euclidean and Pearson methods will be even larger under the squared Euclidean and squared Pearson methods. If you choose Average, Centroid, Median, or Ward as the linkage method, it is generally recommended [9] that you use one of the squared distance measures.
Tip
Linkage methods
The linkage method that you choose determines how the distance between two clusters is defined. At each amalgamation stage, the two closest clusters are joined. At the beginning, when each observation constitutes a cluster, the distance between clusters is simply the inter-observation distance. Subsequently, after observations are joined together, a linkage rule is necessary for calculating inter-cluster distances when there are multiple observations in a cluster. You may wish to try several linkage methods and compare results. Depending on the characteristics of your data, some methods may provide "better" results than others. With single linkage, or "nearest neighbor," the distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster. Single linkage is a good choice when clusters are clearly separated. When observations lie close together, single linkage tends to identify long chain-like clusters that can have a relatively large distance separating observations at either end of the chain [6]. With average linkage, the distance between two clusters is the mean distance between an observation in one cluster and an observation in the other cluster. Whereas the single or complete linkage methods group clusters based upon single pair distances, average linkage uses a more central measure of location. With centroid linkage, the distance between two clusters is the distance between the cluster centroids or means. Like average linkage, this method is another averaging technique. With complete linkage, or "furthest neighbor," the distance between two clusters is the maximum distance between an observation in one cluster and an observation in the other cluster. This method ensures that all observations in a cluster are within a maximum distance and tends to produce clusters with similar diameters. The results can be sensitive to outliers [10]. With median linkage, the distance between two clusters is the median distance between an observation in one cluster and an observation in the other cluster. This is another averaging technique, but uses the median rather than the mean, thus downweighting the influence of outliers. With McQuitty's linkage, when two clusters are be joined, the distance of the new cluster to any other cluster is calculated as the average of the distances of the soon to be joined clusters to that other cluster. For example, if clusters 1 and 3 are to be joined into a new cluster, say 1*, then the distance from 1* to cluster 4 is the average of the distances from 1 to 4 and 3 to 4. Here, distance depends on a combination of clusters rather than individual observations in the clusters. With Ward's linkage, the distance between two clusters is the sum of squared deviations from points to centroids. The objective of Ward's linkage is to minimize the within-cluster sum of squares. It tends to produce clusters with similar numbers of observations, but it is sensitive to outliers [10]. In Ward's linkage, it is possible for the distance between two clusters to be larger than dmax, the maximum value in the original distance matrix. If this happens, the similarity will be negative.
14
Multivariate Analysis
How do you know where to cut the dendrogram? You might first execute cluster analysis without specifying a final partition. Examine the similarity and distance levels in the Session window results and in the dendrogram. You can view the similarity levels by placing your mouse pointer over a horizontal line in the dendrogram. The similarity level at any step is the percent of the minimum distance at that step relative to the maximum inter-observation distance in the data. The pattern of how similarity or distance values change from step to step can help you to choose the final grouping. The step where the values change abruptly may identify a good point for cutting the dendrogram, if this makes sense for your data. After choosing where you wish to make your partition, rerun the clustering procedure, using either Number of clusters or Similarity level to give you either a set number of groups or a similarity level for cutting the dendrogram. Examine the resulting clusters in the final partition to see if the grouping seems logical. Looking at dendrograms for different final groupings can also help you to decide which one makes the most sense for your data. Note For some data sets, average, centroid, median and Ward's methods may not produce a hierarchical dendrogram. That is, the amalgamation distances do not always increase with each step. In the dendrogram, such a step will produce a join that goes downward rather than upward.
Session window output Cluster Analysis of Observations: Protein, Carbo, Fat, Calories, VitaminA
Standardized Variables, Squared Euclidean Distance, Complete Linkage Amalgamation Steps Number of obs. in new cluster 2 3
Step 1 2
Number of clusters 11 10
Clusters joined 5 12 3 5
New cluster 5 3
15
Multivariate Analysis
3 4 5 6 7 8 9 10 11 9 8 7 6 5 4 3 2 1 98.792 94.684 93.406 87.329 86.189 80.601 68.079 41.409 0.000 0.4347 1.9131 2.3730 4.5597 4.9701 6.9810 11.4873 21.0850 35.9870 3 6 2 7 1 2 2 1 1 11 8 3 9 4 6 7 2 10 3 6 2 7 1 2 2 1 1 4 2 5 2 2 7 9 11 12
Final Partition Number of clusters: 4 Within cluster sum of squares 2.48505 8.99868 2.27987 0.00000 Average distance from centroid 1.11469 1.04259 1.06768 0.00000 Maximum distance from centroid 1.11469 1.76922 1.06768 0.00000
Number of observations 2 7 2 1
Cluster Centroids Variable Protein Carbo Fat Calories VitaminA Cluster1 1.92825 -0.75867 0.33850 0.28031 -0.63971 Cluster2 -0.333458 0.541908 -0.096715 0.280306 -0.255883 Cluster3 -0.20297 0.12645 0.33850 0.28031 2.04707 Cluster4 -1.11636 -2.52890 -0.67700 -3.08337 -1.02353 Grand centroid 0.0000000 0.0000000 0.0000000 -0.0000000 -0.0000000
Distances Between Cluster Centroids Cluster1 Cluster2 Cluster3 Cluster4 Cluster1 0.00000 2.67275 3.54180 4.98961 Cluster2 2.67275 0.00000 2.38382 4.72050 Cluster3 3.54180 2.38382 0.00000 5.44603 Cluster4 4.98961 4.72050 5.44603 0.00000
16
Multivariate Analysis
Cluster Variables
Cluster Variables
Stat > Multivariate > Cluster Variables Use Clustering of Variables to classify variables into groups when the groups are initially not known. One reason to cluster variables may be to reduce their number. This technique may give new variables that are more intuitively understood than those found using principal components. This procedure is an agglomerative hierarchical method that begins with all variables separate, each forming its own cluster. In the first step, the two variables closest together are joined. In the next step, either a third variable joins the first two, or two other variables join together into a different cluster. This process will continue until all clusters are joined into one, but you must decide how many groups are logical for your data. See Determining the final grouping.
17
Multivariate Analysis
Deciding Which Distance Measures and Linkage Methods to Use Cluster Variables
Distance Measures
You can use correlations or absolute correlations for distance measures. With the correlation method, the (i,j) entry of the distance matrix is dij = 1 - ij and for the absolute correlation method, dij = 1 - |ij|, where ij is the (Pearson product moment) correlation between variables i and j. Thus, the correlation method will give distances between 0 and 1 for positive correlations, and between 1 and 2 for negative correlations. The absolute correlation method will always give distances between 0 and 1. If it makes sense to consider negatively correlated data to be farther apart than positively correlated data, then use the correlation method. If you think that the strength of the relationship is important in considering distance and not the sign, then use the absolute correlation method.
Linkage methods
The linkage method that you choose determines how the distance between two clusters is defined. At each amalgamation stage, the two closest clusters are joined. At the beginning, when each variables constitutes a cluster, the distance between clusters is simply the inter-variables distance. Subsequently, after observations are joined together, a linkage rule is necessary for calculating inter-cluster distances when there are multiple variables in a cluster. You may wish to try several linkage methods and compare results. Depending on the characteristics of your data, some methods may provide "better" results than others. With single linkage, or "nearest neighbor," the distance between two clusters is the minimum distance between a variable in one cluster and a variable in the other cluster. Single linkage is a good choice when clusters are clearly separated. When variables lie close together, single linkage tends to identify long chain-like clusters that can have a relatively large distance separating variables at either end of the chain [6]. With average linkage, the distance between two clusters is the mean distance between a variable in one cluster and a variable in the other cluster. Whereas the single or complete linkage methods group clusters based upon single pair distances, average linkage uses a more central measure of location. With centroid linkage, the distance between two clusters is the distance between the cluster centroids or means. Like average linkage, this method is another averaging technique. With complete linkage, or "furthest neighbor," the distance between two clusters is the maximum distance between a variable in one cluster and a variable in the other cluster. This method ensures that all variables in a cluster are within a maximum distance and tends to produce clusters with similar diameters. The results can be sensitive to outliers [10]. With median linkage, the distance between two clusters is the median distance between a variable in one cluster and a variable in the other cluster. This is another averaging technique, but uses the median rather than the mean, thus downweighting the influence of outliers. With McQuitty's linkage, when two clusters are be joined, the distance of the new cluster to any other cluster is calculated as the average of the distances of the soon to be joined clusters to that other cluster. For example, if clusters 1 and 3 are to be joined into a new cluster, say 1*, then the distance from 1* to cluster 4 is the average of the distances from 1 to 4 and 3 to 4. Here, distance depends on a combination of clusters rather than individual variables in the clusters. With Ward's linkage, the distance between two clusters is the sum of squared deviations from points to centroids. The objective of Ward's linkage is to minimize the within-cluster sum of squares. It tends to produce clusters with similar numbers of variables, but it is sensitive to outliers [10]. In Ward's linkage, it is possible for the distance between two clusters to be larger than dmax, the maximum value in the original distance matrix. If this happens, the similarity will be negative.
18
Multivariate Analysis
Session window output Cluster Analysis of Variables: Age, Years, Weight, Height, Chin, Forearm, ...
Correlation Coefficient Distance, Average Linkage Amalgamation Steps Number of obs. in new cluster 2 2 3 2 3 6
Step 1 2 3 4 5 6
Number of clusters 9 8 7 6 5 4
Clusters joined 6 7 1 2 5 6 3 9 3 10 3 5
New cluster 6 1 5 3 3 3
19
Multivariate Analysis
7 8 9 3 2 1 61.3391 56.5958 55.4390 0.773218 0.868085 0.891221 3 1 1 8 3 4 3 1 1 7 9 10
Cluster K-Means
Cluster K-Means
Stat > Multivariate > Cluster K-Means Use K-means clustering of observations, like clustering of observations, to classify observations into groups when the groups are initially unknown. This procedure uses non-hierarchical clustering of observations according to MacQueen's algorithm [6]. K-means clustering works best when sufficient information is available to make good starting cluster designations.
20
Multivariate Analysis
Standardize variables: Check to standardize all variables by subtracting the means and dividing by the standard deviation before the distance matrix is calculated. This is a good idea if the variables are in different units and you wish to minimize the effect of scale differences. If you standardize, cluster centroids and distance measures are in standardized variable space before the distance matrix is calculated. <Storage>
Unlike hierarchical clustering of observations, it is possible for two observations to be split into separate clusters after they are joined together. K-means procedures work best when you provide good starting points for clusters [10]. There are two ways to initialize the clustering process: specifying a number of clusters or supplying an initial partition column that contains group codes. You may be able to initialize the process when you do not have complete information to initially partition the data. Suppose you know that the final partition should consist of three groups, and that observations 2, 5, and 9 belong in each of those groups, respectively. Proceeding from here depends upon whether you specify the number of clusters or supply an initial partition column. If you specify the number of clusters, you must rearrange your data in the Data window to move observations 2, 5 and 9 to the top of the worksheet, and then specify 3 for Number of clusters. If you enter an initial partition column, you do not need to rearrange your data in the Data window. In the initial partition worksheet column, enter group numbers 1, 2, and 3, for observations 2, 5, and 9, respectively, and enter 0 for the other observations.
The final partition will depend to some extent on the initial partition that Minitab uses. You might try different initial partitions. According to Milligan [10], K-means procedures may not perform as well when the initializations are done arbitrarily. However, if you provide good starting points, K-means clustering may be quite robust.
21
Multivariate Analysis
10 Check Standardize variables. 11 Click Storage. In Cluster membership column, enter BearSize. 12 Click OK in each dialog box.
Session window output K-means Cluster Analysis: Head.L, Head.W, Neck.G, Length, Chest.G, Weight
Standardized Variables Final Partition Number of clusters: 3 Within cluster sum of squares Average distance from centroid Maximum distance from centroid
Number of observations
22
Multivariate Analysis
Cluster1 Cluster2 Cluster3 Cluster Centroids Variable Head.L Head.W Neck.G Length Chest.G Weight Cluster1 -1.0673 -0.9943 -1.0244 -1.1399 -1.0570 -0.9460 Cluster2 0.0126 -0.0155 -0.1293 0.0614 -0.0810 -0.2033 Cluster3 1.2261 1.1943 1.4476 1.2177 1.3932 1.4974 Grand centroid -0.0000 0.0000 -0.0000 0.0000 -0.0000 -0.0000 41 67 35 63.075 78.947 65.149 1.125 0.997 1.311 2.488 2.048 2.449
Distances Between Cluster Centroids Cluster1 Cluster2 Cluster3 Cluster1 0.0000 2.4233 5.8045 Cluster2 2.4233 0.0000 3.4388 Cluster3 5.8045 3.4388 0.0000
Discriminant Analysis
Discriminant Analysis
Stat > Multivariate > Discriminant Analysis Use discriminant analysis to classify observations into two or more groups if you have a sample with known groups. Discriminant analysis can also used to investigate how variables contribute to group separation. Minitab offers both linear and quadratic discriminant analysis. With linear discriminant analysis, all groups are assumed to have the same covariance matrix. Quadratic discrimination does not make this assumption but its properties are not as well understood. In the case of classifying new observations into one of two categories, logistic regression may be superior to discriminant analysis [3], [11].
23
Multivariate Analysis
Cross-Validation
Cross-validation is one technique that is used to compensate for an optimistic apparent error rate. The apparent error rate is the percent of misclassified observations. This number tends to be optimistic because the data being classified are the same data used to build the classification function. The cross-validation routine works by omitting each observation one at a time, recalculating the classification function using the remaining data, and then classifying the omitted observation. The computation time takes approximately four times longer with this procedure. When cross-validation is performed, Minitab displays an additional summary table. Another technique that you can use to calculate a more realistic error rate is to split your data into two parts. Use one part to create the discriminant function, and the other part as a validation set. Predict group membership for the validation set and calculate the error rate as the percent of these data that are misclassified.
Prior Probabilities
Sometimes items or subjects from different groups are encountered according to different probabilities. If you know or can estimate these probabilities a priori, discriminant analysis can use these so-called prior probabilities in calculating the posterior probabilities, or probabilities of assigning observations to groups given the data. With the assumption that the
24
Multivariate Analysis
data have a normal distribution, the linear discriminant function is increased by ln(pi), where pi is the prior probability of group i. Because observations are assigned to groups according to the smallest generalized distance, or equivalently the largest linear discriminant function, the effect is to increase the posterior probabilities for a group with a high prior probability. Now suppose we have priors and suppose fi(x) is the joint density for the data in group i (with the population parameters replaced by the sample estimates). The posterior probability is the probability of group i given the data and is calculated by
The largest posterior probability is equivalent to the largest value of If fi(x) is the normal distribution, then - (a constant)
The term in square brackets is called the generalized squared distance of x to group i and is denoted by
. Notice,
The term in square brackets is the linear discriminant function. The only difference from the non-prior case is a change in the constant term. Notice, the largest posterior is equivalent to the smallest generalized distance, which is equivalent to the largest linear discriminant function.
25
Multivariate Analysis
Above plus mean, std. dev., and covariance summary: Choose to display the classification matrix, the squared distance between group centers, linear discriminant function, a summary of misclassified observations, means, standard deviations, and covariance matrices, for each group and pooled. Above plus complete classification summary: Choose to display the classification matrix, the squared distance between group centers, linear discriminant function, a summary of misclassified observations, means, standard deviations, covariance matrices, for each group and pooled, and a summary of how all observations were classified. Minitab notes misclassified observations with two asterisks beside the observation number.
Summary of classification Put into Group Alaska Canada Total N N correct Proportion N = 100 True Group Alaska Canada 44 1 6 49 50 50 44 49 0.880 0.980 N Correct = 93 Proportion Correct = 0.930
Squared Distance Between Groups Alaska Canada Alaska 0.00000 8.29187 Canada 8.29187 0.00000
Linear Discriminant Function for Groups Constant Freshwater Marine Alaska -100.68 0.37 0.38 Canada -95.14 0.50 0.33
Summary of Misclassified Observations Observation 1** True Group Alaska Pred Group Canada Group Alaska Squared Distance 3.544 Probability 0.428
26
Multivariate Analysis
Canada Alaska Canada Alaska Canada Alaska Canada Alaska Canada Alaska Canada Alaska Canada 2.960 8.1131 0.2729 4.7470 0.7270 4.7470 0.7270 3.230 1.429 2.271 1.985 2.045 7.849 0.572 0.019 0.981 0.118 0.882 0.118 0.882 0.289 0.711 0.464 0.536 0.948 0.052
27
Multivariate Analysis
<Storage>
Supplementary data
When performing a simple correspondence analysis, you have a main classification set of data on which you perform your analysis. However, you may also have additional or supplementary data in the same form as the main set, because you can see how these supplementary data are "scored" using the results from the main set. These supplementary data may be further information from the same study, information from other studies, or target profiles [4]. Minitab does not include these data when calculating the components, but you can obtain a profile and display supplementary data in graphs. You can have row supplementary data or column supplementary data. Row supplementary data constitutes an additional row(s) of the contingency table, while column supplementary data constitutes an additional column(s) of the contingency table. Supplementary data must be entered in contingency table form. Therefore, each worksheet column of these data must contain c entries (where c is the number of contingency table columns) or r entries (where r is the number of contingency table rows).
If you have three or four categorical variables, you must cross some variables before entering data as shown above. See Crossing variables to create a two-way table.
If you like, use any dialog box options, then click OK.
28
Multivariate Analysis
Crossing variables allows you to use simple correspondence analysis to analyze three-way and four-way contingency tables. You can cross the first two variables to form rows and/or the last two variables to form columns. You must enter three categorical variables to perform one cross, and four categorical variables to perform two crosses. In order to cross columns, you must choose Categorical variables for Input Data rather than Columns of a contingency table in the main dialog box. If you want to cross for either just the rows or for just the columns of the contingency table, you must enter three worksheet columns in the Categorical variables text box. If you want to cross both the rows and the columns of the table, you must specify four worksheet columns in this text box.
29
Multivariate Analysis
Columns profiles: Check to display a table of column profiles and column masses. Expected frequencies: Check to display a table of the expected frequency in each cell of the contingency table. Observed - expected frequencies: Check to display a table of the observed minus the expected frequency in each cell of the contingency table. Chi-square values: Check to display a table of the value in each cell of the contingency table. Inertias: Check to display the table of the relative inertia in each cell of the contingency table.
In all plots, row points are plotted with red circlessolid circles for regular points, and open circles for supplementary points. Column points are plotted with blue squaresblue squares for regular points, and open squares for supplementary points.
A row plot is a plot of row principal coordinates. A column plot is a plot of column principal coordinates. A symmetric plot is a plot of row and column principal coordinates in a joint display. An advantage of this plot is that the profiles are spread out for better viewing of distances between them. The row-to-row and column-to-column distances are approximate distances between the respective profiles. However, this same interpretation cannot be made for row-tocolumn distances. Because these distances are two different mappings, you must interpret these plots carefully [4].
30
Multivariate Analysis
An asymmetric row plot is a plot of row principal coordinates and of column standardized coordinates in the same plot. Distances between row points are approximate distances between the row profiles. Choose the asymmetric row plot over the asymmetric column plot if rows are of primary interest. An asymmetric column plot is a plot of column principal coordinates and row standardized coordinates. Distances between column points are approximate distances between the column profiles. Choose an asymmetric column plot over an asymmetric row plot if columns are of primary interest. An advantage of asymmetric plots is that there can be an intuitive interpretation of the distances between row points and column points, especially if the two displayed components represent a large proportion of the total inertia [4]. Suppose you have an asymmetric row plot, as shown in Example of simple correspondence analysis. This graph plots both the row profiles and the column vertices for components 1 and 2. The closer a row profile is to a column vertex, the higher the row profile is with respect to the column category. In this example, of the row points, Biochemistry is closest to column category E, implying that biochemistry as a discipline has the highest percentage of unfunded researchers in this study. A disadvantage of asymmetric plots is that the profiles of interest are often bunched in the middle of the graph [4], as happens with the asymmetric plot of this example.
31
Multivariate Analysis Session window output Simple Correspondence Analysis: CT1, CT2, CT3, CT4, CT5
Row Profiles Geology Biochemistry Chemistry Zoology Physics Engineering Microbiology Botany Statistics Mathematics Mass A 0.035 0.034 0.046 0.025 0.088 0.034 0.027 0.000 0.069 0.026 0.039 B 0.224 0.069 0.192 0.125 0.193 0.125 0.162 0.140 0.172 0.141 0.161 C 0.459 0.448 0.377 0.342 0.412 0.284 0.378 0.395 0.379 0.474 0.389 D 0.165 0.034 0.162 0.292 0.079 0.170 0.135 0.198 0.138 0.103 0.162 E 0.118 0.414 0.223 0.217 0.228 0.386 0.297 0.267 0.241 0.256 0.249 Mass 0.107 0.036 0.163 0.151 0.143 0.111 0.046 0.108 0.036 0.098
Analysis of Contingency Table Axis 1 2 3 4 Total Inertia 0.0391 0.0304 0.0109 0.0025 0.0829 Proportion 0.4720 0.3666 0.1311 0.0303 Cumulative 0.4720 0.8385 0.9697 1.0000 Histogram ****************************** *********************** ******** *
Row Contributions ID 1 2 3 4 5 6 7 8 9 10 ID 1 2 3 4 5 6 7 8 9 10 Name Geology Biochemistry Chemistry Zoology Physics Engineering Microbiology Botany Statistics Mathematics Name Geology Biochemistry Chemistry Zoology Physics Engineering Microbiology Botany Statistics Mathematics Qual 0.916 0.881 0.644 0.929 0.886 0.870 0.680 0.654 0.561 0.319 Mass 0.107 0.036 0.163 0.151 0.143 0.111 0.046 0.108 0.036 0.098 Inert 0.137 0.119 0.021 0.230 0.196 0.152 0.010 0.067 0.012 0.056 2 Contr 0.322 0.248 0.029 0.052 0.003 0.310 0.018 0.005 0.000 0.012 Component Coord Corr -0.076 0.055 -0.180 0.119 -0.038 0.134 0.327 0.846 -0.316 0.880 0.117 0.121 -0.013 0.009 0.179 0.625 -0.125 0.554 -0.107 0.240 1 Contr 0.016 0.030 0.006 0.413 0.365 0.039 0.000 0.088 0.014 0.029
Component Coord Corr -0.303 0.861 0.455 0.762 -0.073 0.510 -0.102 -0.027 0.292 0.110 0.039 -0.014 0.061 0.083 0.006 0.749 0.671 0.029 0.007 0.079
Supplementary Rows ID 1 2 Name Museums MathSci Qual 0.556 0.559 Mass 0.067 0.134 Inert 0.353 0.041 Component Coord Corr 0.314 0.225 -0.112 0.493 1 Contr 0.168 0.043 Component Coord Corr -0.381 0.331 0.041 0.066 2 Contr 0.318 0.007
Column Contributions
32
Multivariate Analysis
Component Coord Corr -0.478 0.574 -0.127 0.286 -0.083 0.341 0.390 0.859 0.032 0.012 1 Contr 0.228 0.067 0.068 0.632 0.006 Component Coord Corr -0.072 0.013 -0.173 0.531 -0.050 0.124 -0.139 0.109 0.292 0.978 2 Contr 0.007 0.159 0.032 0.103 0.699
ID 1 2 3 4 5
Name A B C D E
33
Multivariate Analysis
47.2% is accounted for by the first component, 36.66% by the second component, and so on. Here, 65.972 is the squared statistic you would obtain if you performed a squared test of association with this contingency table. Row Contributions. You can use the third table to interpret the different components. Since the number of components was not specified, Minitab calculates 2 components. The column labeled Qual, or quality, is the proportion of the row inertia represented by the two components. The rows Zoology and Geology, with quality = 0.928 and 0.916, respectively, are best represented among the rows by the two component breakdown, while Math has the poorest representation, with a quality value of 0.319. The column labeled Mass has the same meaning as in the Row Profiles table the proportion of the class in the whole data set. The column labeled Inert is the proportion of the total inertia contributed by each row. Thus, Geology contributes 13.7% to the total squared statistic. The column labeled Coord gives the principal coordinates of the rows. The column labeled Corr represents the contribution of the component to the inertia of the row. Thus, Component 1 accounts for most of the inertia of Zoology and Physics (Coor = 0.846 and 0.880, respectively), but explains little of the inertia of Microbiology (Coor = 0.009). Contr, the contribution of each row to the axis inertia, shows that Zoology and Physics contribute the most, with Botany contributing to a smaller degree, to Component 1. Geology, Biochemistry, and Engineering contribute the most to Component 2.
Next, Minitab displays information for each of the two components (axes).
Supplementary rows. You can interpret this table in a similar fashion as the row contributions table. Column Contributions. The fifth table shows that two components explain most of the variability in funding categories B, D, and E. The funded categories A, B, C, and D contribute most to component 1, while the unfunded category, E, contributes most to component 2. Row Plot. This plot displays the row principal coordinates. Component 1, which best explains Zoology and Physics, shows these two classes well removed from the origin, but with opposite sign. Component 1 might be thought of as contrasting the biological sciences Zoology and Botany with Physics. Component 2 might be thought of as contrasting Biochemistry and Engineering with Geology. Asymmetric Row Plot. Here, the rows are scaled in principal coordinates and the columns are scaled in standard coordinates. Among funding classes, Component 1 contrasts levels of funding, while Component 2 contrasts being funded (A to D) with not being funded (E). Among the disciplines, Physics tends to have the highest funding level and Zoology has the lowest. Biochemistry tends to be in the middle of the funding level, but highest among unfunded researchers. Museums tend to be funded, but at a lower level than academic researchers
34
Multivariate Analysis
<Results> <Graphs> <Storage>
Supplementary data When performing a multiple correspondence analysis, you have a main classification set of data on which you perform your analysis. However, you may also have additional or supplementary data in the same form as the main set, and you might want to see how this supplementary data are "scored" using the results from the main set. These supplementary data are typically a classification of your variables that can help you to interpret the results. Minitab does not include these data when calculating the components, but you can obtain a profile and display supplementary data in graphs. Set up your supplementary data in your worksheet using the same form, either raw data or indicator variables, as you did for the input data. Because your supplementary data will provide additional information about your observations, your supplementary data column(s) must be the same length as your input data.
35
Multivariate Analysis
Session window output Multiple Correspondence Analysis: CarWt, DrEject, AccType, AccSever
Analysis of Indicator Matrix Axis 1 2 3 4 Total Inertia 0.4032 0.2520 0.1899 0.1549 1.0000 Proportion 0.4032 0.2520 0.1899 0.1549 Cumulative 0.4032 0.6552 0.8451 1.0000 Histogram ****************************** ****************** ************** ***********
36
Multivariate Analysis
Column Contributions ID 1 2 3 4 5 6 7 8 Name Small Standard NoEject Eject Collis Rollover NoSevere Severe Qual 0.965 0.965 0.474 0.474 0.613 0.613 0.568 0.568 Mass 0.042 0.208 0.213 0.037 0.193 0.057 0.135 0.115 Inert 0.208 0.042 0.037 0.213 0.057 0.193 0.115 0.135 Component Coord Corr 0.381 0.030 -0.078 0.030 -0.284 0.472 1.659 0.472 -0.426 0.610 1.429 0.610 -0.652 0.502 0.769 0.502 1 Contr 0.015 0.003 0.043 0.250 0.087 0.291 0.143 0.168 Component Coord Corr -2.139 0.936 0.437 0.936 -0.020 0.002 0.115 0.002 0.034 0.004 -0.113 0.004 -0.237 0.066 0.280 0.066 2 Contr 0.771 0.158 0.000 0.002 0.001 0.003 0.030 0.036
Next, Minitab displays information for each of the two components (axes).
37
Multivariate Analysis
Contr, the contribution of the row to the axis inertia, shows Eject and Rollover contributing the most to Component 1 (Contr = 0.250 and 0.291, respectively). Component 2, on the other hand accounts for 93.6% of the inertia of the car size categories, with Small contributing 77.1% of the axis inertia.
Column Plot. As the contribution values for Component 1 indicate, Eject and Rollover are most distant from the origin. This component contrasts Eject and Rollover and to some extent Severe with NoSevere. Component 2 separates Small with the other categories. Two components may not adequately explain the variability of these data, however.
38
Index
C Cluster analysis........................................................... 12 Cluster observations................................................ 12 Cluster variables...................................................... 17 K-means .................................................................. 21 Cluster K-Means ......................................................... 21 Cluster K-Means (Stat menu).................................. 21 Cluster Observations .................................................. 12 Cluster Observations (Stat menu) ........................... 12 Cluster Variables......................................................... 17 Cluster Variables (Stat menu) ................................. 17 Correspondence analysis ..................................... 28, 35 Cross-validation .......................................................... 24 Discriminant Analysis .............................................. 25 D Discriminant Analysis.................................................. 24 S Simple Correspondence Analysis .............................. 28 Simple Correspondence Analysis (Stat menu) ....... 28 P Principal Components .................................................. 2 Principal Components (Stat menu) ........................... 2 Prior probabilities........................................................ 25 M Multiple Correspondence Analysis ............................. 35 Multiple Correspondence Analysis (Stat menu)...... 35 Multivariate (Stat menu) ............................................... 1 F Factor Analysis ............................................................. 5 Factor Analysis (Stat menu)...................................... 5 Discriminant Analysis (Stat menu) .......................... 24
39