1 s2.0 S259026012200011X Main
1 s2.0 S259026012200011X Main
Methods in Psychology
journal homepage: www.sciencedirect.com/journal/methods-in-psychology
A R T I C L E I N F O A B S T R A C T
Keywords: Psychological research often involves complex datasets that cannot easily be analyzed using traditional statistical
Discriminant correspondence analysis methods. Multiblock Discriminant Correspondence Analysis (multiblock DICA, also called MUDICA) examines group
DICA differences in large, structured categorical datasets and identifies blocks of variables that contribute to these
MUDICA
differences. Data for this illustration were obtained from a study on mental health literacy (N = 648) that
Mental health literacy
included 33 questions that were arranged into four blocks: etiology, symptoms, treatment, and general knowl
edge of psychological disorders. With non-parametric inference tests and results displayed as intuitive maps,
MUDICA revealed differences in performance across groups not readily detectable using standard methods.
Psychological research often involves the simultaneous examination datasets), ANOVA or regression are often performed on each dependent
of a large number of behavioral, physiological, and demographic vari variable separately and are followed by corrections for multiple
ables. These variables might be quantitative or qualitative and could comparisons—a procedure that in turn can result in low statistical
individually or collectively be associated with differences among pop power. Alternatively, multivariate datasets can be analyzed with
ulations of interest. Often, such variables are analyzed with specific methods such as multivariate ANOVA (MANOVA) or linear discriminant
statistical methods whose main goals are to: (1) determine if there are analysis (LDA). However, such methods can only handle datasets with
reliable group differences (e.g., clinical versus control groups); (2) many more observations than variables and the variables themselves
predict information for new individuals (e.g., group assignment); and/or cannot be multicollinear (i.e., the variables cannot be linearly related).
(3) examine relationships between different variables (e.g., indepen Qualitative data include variables that describe observations (e.g.,
dence of attributes). demographic variables, survey responses) and may be categorical (also
called nominal) or ordinal. Such data are analyzed with methods that
1. Analyzing quantitative and qualitative data examine the association between variables (e.g., χ 2 test of indepen
dence) or predict group assignment (e.g., binomial or multinomial lo
Quantitative data include variables that represent amounts and may gistic regression). However, the χ 2 test of independence can only
be discrete or continuous. Such data are analyzed (usually one variable examine the association between two categorical variables, while lo
at a time) via two seemingly different yet statistically equivalent gistic regression can only be applied to datasets with many more ob
methods: Analysis of Variance (ANOVA) and Regression. However, both servations than (non-colinear) variables. In addition, most methods that
ANOVA and regression can only handle datasets with one quantitative analyze qualitative data cannot examine fine grained relationships
dependent variable at a time, irrespective of the number of quantitative among observations and variables (e.g., main effects and interactions).
or qualitative independent variables or factors. When datasets contain Almost all the traditional methods mentioned above (that analyze
more than one quantitative dependent variable (i.e., multivariate either quantitative or qualitative data) predominantly use parametric
Abbreviations: ANOVA, Analysis of Variance; BADA, Barycentric Discriminant Analysis; CA, Correspondence Analysis; DICA, Discriminant Correspondence
Analysis; MANOVA, Multivariate Analysis of Variance; MCA, Multiple Correspondence Analysis; MUDICA, Multiblock Discriminant Correspondence Analysis; PCA,
Principal Component Analysis.
* Corresponding author. 2900 Bedford Avenue, Brooklyn, NY, 11210, USA.
E-mail address: [email protected] (A. Krishnan).
https://doi.org/10.1016/j.metip.2022.100100
Received 3 April 2022; Received in revised form 21 September 2022; Accepted 21 September 2022
Available online 27 September 2022
2590-2601/© 2022 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
A. Krishnan et al. Methods in Psychology 7 (2022) 100100
inferential tests, which depend on specific assumptions about the data how to interpret the numerous maps that are generated by the analysis.
(e.g., normality). Furthermore, results from most traditional methods The underlying technique for MUDICA—used in this article—is MCA (Lebart
are often presented in a counter-intuitive manner and can be difficult to et al., 1984; also see Guttman, 1941, for an earlier version), a method
interpret (Wasserstein et al., 2019), particularly for very large datasets —derived from CA (Cordier, 1965)—that extends principal component
that require multiple levels of analysis (e.g., hierarchical linear analysis [PCA; Hotelling (1933)] to analyze categorical data. MCA can
regression). handle more than two categorical variables at a time and shares the
Though there exist alternative methods that can address the afore geometric properties that are characteristic of CA (Husson et al., 2017),
mentioned limitations of traditional methods and handle large, quali such as analyzing observation profiles rather than absolute frequencies
tative (i.e., categorical) datasets, these alternative methods are not across variables. However, increasing the number of variables in MCA
widely applied in psychological research. One such method is corre makes the geometric representation of profiles for MCA less intuitive than
spondence analysis [CA; Cordier (1965)], which is a multivariate method CA (for more details on CA geometry see Abdi and Williams, 2022; Husson
specifically developed to handle categorical data. In its simplest form, CA et al., 2017; Phillips and Phillips, 2009). In fact, the application of MCA
(1) analyzes a contingency table that cross tabulates the frequency of for categorical data is similar to how PCA is applied for quantitative data
observations according to two categorical variables, and (2) represents (see Abdi and Williams, 2010 for a detailed illustration of PCA). Specif
the relationships of these variables by maps where the proximity be ically, as with PCA, MCA condenses information from a large dataset by
tween the levels of the variables expresses their association (Hair et al., combining the originally correlated categorical variables into new, un
2009). Since its introduction in the 1960’s, the CA family has grown to correlated quantitative variables called factors, components, or di
include other variants such as multiple correspondence analysis [MCA; mensions. These dimensions reveal how observations differ from each
Lebart et al. (1984); see Guttman (1941) for an earlier version], which other and which variables contribute to the differences. In addition, the
simultaneously examines more than two categorical variables at a time, relationship between individual observations and variables can be dis
and discriminant correspondence analysis [DICA; Abdi (2007); Saporta played on a single map, a unique feature of the CA family of methods
and Keita (2006)], which evaluates group differences. (Lebart and Saporta, 2014).
DICA—a particular variant of CA and MCA—analyzes differences be
1.1. Current work tween categories of observations that are evaluated on multiple variables
(Abdi, 2007; Saporta and Keita, 2006), while MUDICA (Williams et al.,
In this work, we present advances in the application of a particular 2010)—a particular variant of DICA—analyzes differences between cat
variant of discriminant correspondence analysis—multiblock discrimi egories of observations that are evaluated on multiple blocks of variables.
nant correspondence analysis (also called MUDICA)—a method that can MUDICA offers numerous advantages to analyze large, structured cate
handle large, structured categorical datasets with multicollinear vari gorical datasets from different perspectives. First, fine-grained analyses
ables that are arranged as blocks (e.g., academic variables, demographic that expose complex relationships between observations, variables,
variables). We expand upon the original presentation of multiblock DICA, groups, and blocks can be simultaneously performed. Second, relation
where the method was first introduced to analyze the relationship be ships between the observations, variables, groups, and blocks are pre
tween types of social communication patterns among individuals with sented in the form of intuitive maps that reveal underlying patterns in
Dementia of the Alzheimer’s Type (DAT) and their spouses. Specifically, the data. Third, while early variants of CA were exploratory and required
the authors evaluated group differences between patients with varying experienced visual interpretation of the maps, now, with superior
severity of DAT and examined the contribution of variables (i.e., the computing power, relevant non-parametric inferential testing proced
number of occurrences of a given communication pattern) that were ures generate objective or confirmatory results that can also be dis
arranged into two specific blocks (i.e., patient-initiated and spouse- played on the same maps. Fourth, because these non-parametric testing
initiated communication patterns). Williams et al. (2010) showed how procedures do not rely on parametric assumptions of standard statistical
MUDICA can include hypothesis testing to address clinical research ques models (e.g., normality), traditional hypothesis testing is still possible
tions involving categorical variables even when the data comprise many with MUDICA. Fifth, MUDICA is not affected by multicollinearity because it
more variables than observations (a pattern often called the “N ≪ P” does not rely on matrix inversion, which is the basis of traditional
problem). methods such as linear or logistic regression. In summary, MUDICA is an
While MUDICA was first used more than a decade ago (albeit in the field ideal tool to analyze complex datasets where traditional methods cannot
of communication disorders), the method has largely remained in the ordinarily be employed.
sidelines and has not been fully applied in psychological research. With We used MUDICA for our data to: (1) examine the variability in mental
the availability of better data visualization tools and customizable sta health literacy among participants based on age and gender; (2) identify
tistical software, we present an up-to-date account of MUDICA including variability among participant groups due to differences in the types of
how to normalize (i.e., scale) variable blocks, partial out a confounding mental health literacy questions; and (3) display descriptive and infer
variable, examine main effects and interactions, and evaluate a posteriori ential results in the form of intuitive maps that are easily interpreted. In
group differences with non-parametric inferential tests. addition, we used a new conditioned version of MUDICA to partial out the
We use MUDICA to analyze data obtained from a recent study (Miles main effect of gender and separately control for the effect of clinical
et al., 2020), to examine age and gender effects on mental health liter coursework on mental health literacy.
acy—a concept that refers to knowledge and beliefs that facilitate the
identification and management of psychological disorders (Jorm et al., 3. Methods
1997). With MUDICA, we simultaneously assess how different participant
groups respond to questions on various topics in mental health literacy Data for this illustration were obtained from a study involving 663
such as etiology, symptoms, treatment, and general knowledge of dis undergraduate students who answered 33 multiple choice questions on
orders. In doing so, we identify problematic topics for specific groups, various topics related to mental health literacy (Miles et al., 2020; Rabin
which will provide a direction for mental health literacy education. et al., 2021). Based on previous literature on age differences in mental
health literacy (Farrer et al., 2008), participants in this study were
2. Background divided into two groups: (1) traditional college students (≤ 24 years)
and (2) non-traditional college students or adult learners (≥ 25 years).
There are various ways in which MUDICA can be used to extract in Furthermore, taking into account known gender differences in mental
formation from a large dataset. In this paper, we provide an illustrative health literacy (Wong, 2016), the two age groups were stratified by male
example that incorporates all the steps of MUDICA with an emphasis on and female gender, a process resulting in a total of four participant
2
A. Krishnan et al. Methods in Psychology 7 (2022) 100100
groups. We excluded from the analysis a total of 15 participants with individual variables. Ideally, to examine group differences based on
missing or inadequate data (seven who did not report their age, three blocks of variables, as in our example, the analysis should give the same
who did not report their gender, and five who had answered fewer than importance to a block of four variables (e.g., the etiology block) as to a
four questions correctly). The final sample of 648 participants included block of sixteen variables (e.g., the symptoms block). If the size of the
303 females and 208 males in the ≤ 24 age group and 94 females and 43 block is ignored, then the block with sixteen variables is more likely to
males in the ≥ 25 age group, respectively. influence the results as compared to the block with four variables. So, to
The 33 multiple choice questions were further classified into four prevent the analysis from being driven by the number of variables
blocks: etiology (4 questions), treatment (5 questions), general knowl within a block, we incorporated block normalization with MUDICA such
edge (8 questions), symptoms (16 questions). Based on content consid that all blocks were given an equivalent importance (see Data analysis
erations, questions were further identified by specific domains (e.g., section below for details).
suicide, childhood disorders, medication). Each multiple choice ques
tion had five answer choices and only one possible correct answer (see 3.2. Data analysis
Rabin et al., 2021, for sample questions). The main variables of interest
for this paper were responses to each question (correct or incorrect), age Below we describe the different steps of MUDICA (see Fig. 1 for a
(≤ 24 years or ≥ 25 years) and gender (male or female), while clinical schematic diagram and the Appendix for mathematical details). In
coursework (yes or no) was a supplementary variable of interest. addition, the Results section highlights how these different steps were
implemented for our example dataset with an emphasis on how to
3.1. Data recoding interpret the different maps generated by MUDICA. All statistical analyses
were performed in the R programming language (R Core Team, 2020)
Often, datasets represent quantities (e.g., score on a test, number of using the TExPosition (Beaton et al., 2014a, 2014b) and the ggplot2
correct questions) that are, in fact, qualitative or an aggregate of qual (Wickham, 2016) packages (for additional R code for CA and MCA, see
itative variables. In our example, a participant could obtain a score Husson et al., 2017).
between 0 and 33 depending on how many questions were answered
correctly. The total score (e.g., 18 out of 33) is a quantitative variable, 3.2.1. Step 1: Data organization
but the response to each question (i.e., correct versus incorrect) is a The original categorical data are appropriately recoded for analysis.
qualitative variable. There are two approaches to analyze such datasets. The recoded data are arranged with observations (identified by their
One approach is to examine the differences in participant groups by groups) on the rows and variables (normalized within blocks) on the
analyzing the absolute quantity (e.g., an independent sample t-test be columns (see Fig. 2 for specific details on data recoding and
tween males and females with total score as a single quantitative normalization).
dependent variable). Another approach is to examine the differences In our example, a question was either answered correctly (i.e., right,
between participant groups by analyzing the pattern of responses (e.g., R) or incorrectly (i.e., wrong, W), so each question was described as {R,
with responses to each question as multiple qualitative dependent vari W}, where ‘R’ and ‘W’ indicated a particular response for each question.
ables). In our example, we examined such patterns with MUDICA by rep Numerically, each question was coded with 1s and 0s, where 1 indicated
resenting participant responses with complete disjunctive coding the presence of a particular response and 0 indicated the absence of a
[Nakache (1973); see Data analysis section below for details], where particular response. Specifically, if a participant answered a question
each possible categorical response level (i.e., correct or incorrect) of the correctly, then the question was coded as {1, 0}, and if the participant
qualitative variable was uniquely expressed in the analysis. answered a question incorrectly, then the question was coded as {0, 1}.
In addition, large datasets often have variables that are organized The final dataset contained 648 participants whose responses to 33
into blocks, where the blocks collectively offer more information than questions could either be correct or incorrect, thus creating a table with
Fig. 1. Schematic diagram of the steps for MUDICA [adapted from Williams et al. (2010)]: (1) The original data are recoded for anlaysis; (2) DICA is performed on a
group × variable contingency table and the resulting dimensions are displayed as maps; (3) Contributions of blocks of variables are quantified in the dimensional
space; (4) Inference tests are conducted to examine group differences, determine reliability of dimensions, and predict group assignment.
3
A. Krishnan et al. Methods in Psychology 7 (2022) 100100
Fig. 2. Data organization for MUDICA: (1) All variables are represented by their levels of possible responses; (2) Original data are re-coded disjunctively; (3) Variables
are normalized within blocks such that for each row, the responses for all questions in a particular block sum to 1.
648 rows and 66 columns (i.e., number of questions × levels of 3.2.2. Step 2: Discriminant correspondence analysis
response). While this type of complete disjunctive coding ensures that all Observations are collapsed into group barycenters (i.e., summed
levels of a variable are represented in the analysis, multicollinearity, by within groups), so there are as many rows as there are groups. The data
definition, is automatically introduced into the dataset. This is because table itself becomes a contingency table that contains the frequency of
one level of each variable can be derived directly from the other level of occurrence of every level of each variable for each group of observa
this variable. For example, if a participant answers a question correctly, tions. This contingency table is the input for correspondence analysis,
it automatically implied that the question was not answered incorrectly. which in turn transforms the table into two sets of factors (also called
However, with complete disjunctive coding both responses (i.e., 1 = dimensions): one set for the groups (and observations) and one set for
presence of correct response and 0 = absence of incorrect response) are the variables. The first dimension explains the largest possible variance
represented in the analysis, making each variable (with its levels) a in the data. The second dimension, which is uncorrelated (i.e., orthog
multicollinear set. Fortunately, MUDICA is not affected by multi onal) to the first dimension, explains the second largest possible vari
collinearity because the analysis does not involve a matrix inversion step ance in the data. All subsequent dimensions are computed as such, each
that is necessary for other methods such as linear or logistic regression with a decreasing amount of variance explained. Pairs of dimensions are
and linear discriminant analysis (see Härdle & Simar, 2019). geometrically represented on a map with each dimension as an axis and
group barycenters (and observations) and variables as points on these
3.2.1.1. Block normalization. There are different types of block maps, where points close to each other are similar and points far away
normalization procedures (see Abdi et al., 2012b, for examples) whose from each other are dissimilar [Abdi and Williams (2022); see the Ap
goal is to ensure that each variable within a block is given equal pendix for mathematical details].
importance and that all blocks in the analysis are also given an appro
priate importance. For example, consider a dataset with seven variables, 3.2.2.1. Conditioned analyses. Conditioned analyses (not displayed in
each with two possible responses (i.e., whether the question was Fig. 1) are used to partial out the effect of a single categorical variable
answered correctly or incorrectly). These seven variables are organized that might contribute to the variability in the data but might not be
into three blocks, where the first block (B1) contains one variable, the directly relevant to the overall analysis. For conditioned MCA (Escofier,
second block (B2) contains two variables, and the third block (B3) con 1988), such an effect is algebraically removed from the dataset prior to
tains four variables (see Fig. 2, Step 1). When these variables are performing MCA, and the resulting dimensions are interpreted in the same
disjunctively coded, each variable is represented as {R, W} and way as in a plain MCA (see the Appendix for mathematical details).
numerically coded as {1, 0} for a right answer and {0, 1} for a wrong Conditioned MUDICA extends conditioned MCA where the effect of a single
answer (see Fig. 2, Step 2). With this coding schema, across each row, categorical variable is removed before performing the MUDICA. Condi
the sum of values within a block indicates the number of variables in tioned analyses can be used for various purposes such as to examine
that block. So, for B1, the sum across each row is 1 (i.e., one variable in experimental effects in the absence of an interfering or confounding
the block), for B2 the sum across each row is 2 (i.e., two variables in the factor or to examine interaction effects in the absence of an over
block), and for B3 the sum across each row is 4 (i.e., four variables in the shadowing main effect.
block). Ideally, despite the differences in the number of variables in each In our example, participants with previous coursework in clinical
block, B1, B2, and B3 should contribute equally to the analysis. There psychology have been previously shown to have an advantage in a
fore, to ensure this equal contribution, we normalize (i.e., scale) the mental health literacy assessment because of their exposure to such
blocks by dividing the disjunctively coded variables within a block by topics in their curriculum (Miles et al., 2020). With conditioned MUDICA,
the total number of variables in that block. With this approach, the sum the effect of age and gender on performance in the mental health literacy
across each row for B1, B2, and B3 is 1 (see Fig. 2, Step 3), a configuration assessment can be examined after partialling out the effect of clinical
indicating that each block as a whole will contribute equally to the coursework, which, by itself, is not one of the primary variables of
analysis irrespective of how many variables are present in that block. interest.
In our example, there were four blocks with a different number of
questions per block: etiology (4 questions), treatment (5 questions), 3.2.2.2. Supplementary data. Supplementary data can be any data,
general knowledge (8 questions), symptoms (16 questions). After the which were not included in the original analysis. Supplementary ob
responses were disjunctively coded, each block was normalized by the servations are observations described by the same variables as the
number of questions in this block (i.e., the etiology block by 4, the original dataset (e.g., new participants who take the same mental health
treatment block by 5, the general knowledge block by 8, and the literacy assessment) and supplementary variables are variables that are
symptoms block by 16). In this way, we ensured that no particular block measured on the same observations in the original dataset (e.g., clinical
preferentially influenced the analysis just by its number of questions. coursework, college major). These supplementary data are simply pro
jected onto the dimensions generated by MUDICA or conditioned MUDICA,
4
A. Krishnan et al. Methods in Psychology 7 (2022) 100100
and therefore, these supplementary data do not influence the analysis In order to better understand the total variance explained in our
but can be useful to better interpret the dimensions and perform infer example, we first performed an MCA (see the Appendix for mathematical
ential analyses (described in more detail below). details) on the overall dataset (i.e., 648 rows by 66 columns). The MCA
generated nine dimensions of which the first dimension revealed that
3.2.3. Step 3: Block contributions differences in total scores explained most of the variance (i.e., τ1 = 99%,
The relationship between the normalized blocks of variables and λ1 = 0.022). To examine the variance in responses on Dimension 1
groups of observations are displayed in the dimension space (see Results (Fig. 3, read from left to right on the horizontal axis), we identified
section and the Appendix for more details). The effect of each normalized participants based on their total scores (min = 4 and max = 33) and split
block is separately quantified to reveal how these blocks contribute to the distribution into three groups and identified them by colors [e.g.,
group differences on each dimension (Williams et al., 2010). scores between 0 and 11 (red), 12–22 (orange), and 23–33 (green)].
Dimension 1 can be interpreted as differences in total scores, with par
3.2.4. Step 4: Inferential tests ticipants who were likely to answer most questions correctly represented
There are different inferential analysis steps for MUDICA. The first step on the left side of Dimension 1 and participants who were likely to
is to evaluate whether the variance explained in the sample reflects the answer most questions incorrectly represented on the right side of
real variance explained in the population (akin to a null hypothesis test). Dimension 1. It should be noted that the color-coded arrow depicted in
For this, MUDICA uses permutation tests (Berry et al., 2011) to (1) evaluate Fig. 3 will be used for subsequent figures to indicate the direction of
whether there is an overall difference between groups, and, if such a maximum variance from high scores (in green) on the left to low scores
difference exists, (2) identify the dimensions responsible for these group (in red) on the right.
differences. The second step is to examine the stability of group differ
ences and identify variables that reliably contribute to these differences, 4.2. Multiblock discriminant correspondence analysis
for which MUDICA uses bootstrap tests (Efron and Tibshirani, 1993; Hes
terberg, 2011). Specific implementation of these tests for our example For MUDICA, the 648 observations were categorized into four groups
data are further elaborated in the Results section. stratified by two genders (males and females) and two age-groups (≤ 24
In addition, MUDICA uses cross-validation analyses such as the leave- and ≥ 25 years), and the 66 variables were organized into four blocks.
one-out (LOO) procedure to evaluate the quality of group assignment. To avoid any particular block from dominating the analysis, each block
In the LOO procedure, each observation is excluded from the dataset one was normalized by the number of questions within the block: etiology (4
at a time and the left out observation is then projected as a supple questions), treatment (5 questions), general knowledge (8 questions),
mentary observation onto the dimensions generated by the MUDICA model symptoms (16 questions).
(which was created with the other observations). Then, the distance of When observations are categorized in a priori groups, MUDICA uses this
the projected observation from each of the group barycenters is group information to extract dimensions that maximize the variance
computed and the observation is assigned to the closest group. between groups (and so optimizes group assignment). These dimensions
are represented as a map with the group barycenters (or means) as
4. Results points on this map. The similarity between two groups is interpreted
based on the proximity of points on the maps—the closer the points, the
In this paper, we illustrate how MUDICA can be used to examine group more similar the groups and the farther the points, the more dissimilar
differences in mental health literacy based on age and gender, and we the groups. Individual observations are also represented as points on
identify blocks of variables that drive these group differences. Below, we these maps, and, to predict group assignment, a boundary is drawn
present each analysis in detail along with its methodological relevance, around all the participants from a particular group anchored by the
with an emphasis on how to interpret the numerous maps generated by respective group barycenter. The boundary, called a convex hull, con
MUDICA [for additional interpretation on specific analyses, see Williams nects the outermost participants for this group and is sensitive to outliers
et al. (2010); for mathematical details see the Appendix]. (Greenacre and Blasius, 2006). Often, the convex hulls are peeled to only
contain a given proportion (e.g., 95%) of the participants within the
4.1. What do dimension-based methods give us? group and are known as peeled convex hulls. When peeled convex hulls
are drawn around participants included in the original analysis (i.e., a
Dimension-based methods extract—from datasets—new, uncorre fixed effect model), these hulls are called tolerance intervals.
lated variables, also called dimensions. Each dimension explains a spe In our example, MUDICA generated a total of three dimensions that,
cific amount of variance (called eigenvalue and denoted by λ), and the together, depicted the differences in performance of the four participant
sum of all the eigenvalues gives the overall variance of the dataset. The groups within the dimensional space (Fig. 4a). The most discriminant
proportion of variance explained (denoted by τ) by each dimension is dimension, Dimension 1, had λ1 = 0.013 and explained τ1 = 81% of the
the ratio of the variance explained by this dimension to the total vari total variance, the second-most discriminant dimension, Dimension 2,
ance. The dimensions are viewed as maps, where the observations, had λ2 = 0.002 and explained τ2 = 12% of the total variance, and the
groups, variables, and blocks are plotted as points on this map so that the least discriminant dimension, Dimension 3, had λ3 = 0.001 and
variability in the dataset can be visually inspected and interpreted based explained τ1 = 7% of the total variance (the total variance is given by λ1
on proximity of the points on the map. + λ2 + λ3 = 0.016). As the first two dimensions together accounted for
5
A. Krishnan et al. Methods in Psychology 7 (2022) 100100
6
A. Krishnan et al. Methods in Psychology 7 (2022) 100100
Fig. 5. Block normalization: Top panel shows the respective contribution of each block with (left) and without (right) block normalization. Bottom panel shows the
partial effect of the four blocks (with block normalization) for each participant group, where the direction of the lines (i.e., towards the left or right) indicates the
performance on questions within a particular block.
whereas in Fig. 7d, a symptom question on dementia was more likely to variance reflects a true difference in the population (akin to a null hy
be answered correctly by ≥ 25 year old females than by any of the other pothesis test), MUDICA uses a permutation procedure, where the original
groups. dataset is reordered so that the inherent relationship between observa
tions and variables is broken. For this reordered (or permuted) dataset, a
new R2 statistic is computed. This procedure is repeated a large number
4.3. Inference procedures
(e.g., 1000) of times and a distribution of the R2 statistic is generated and
used to determine the probability of obtaining the original R2 under the
The overall explained variance from a MUDICA is used to compute an
assumptions of the null hypothesis (separate tests are conducted for the
R2 statistic, which is the ratio of the between-group variance to the total
whole dimensional space and for each dimension). If this probability is
variance (Beaton et al., 2014a). To determine whether the explained
7
A. Krishnan et al. Methods in Psychology 7 (2022) 100100
Fig. 6. Variable contributions: Top panel shows variable contributions for Dimension 1 and bottom panel shows variable contributions for Dimension 2 (threshold is
set at ~1.5%).
smaller than .05 or 5%, then the variance explained is considered to Further, to illustrate the quality of group assignment, MUDICA gener
reflect a true value different from 0. In our example, the overall R2 was ates prediction intervals, which are the random effect version of the
.07 (p < .001), a probability value small enough to indicate a statistically tolerance intervals mentioned earlier. Prediction intervals are computed
significant, albeit small, effect. using the LOO procedure (see Methods section), where each observation is
8
A. Krishnan et al. Methods in Psychology 7 (2022) 100100
Fig. 7. Variables by block: Variables are represented by factors scores in the multivariate space and can either be displayed all at once or separately by blocks. Panels
identify variables within, respectively, blocks of (a) etiology, (b) treatment, (c) symptoms and general knowledge, and (d) questions that contributed to the dif
ferences in scores between participant groups.
excluded from the dataset and is projected onto its LOO subspace (i.e., the population. For the bootstrap procedure, observations from the original
multivariate space created by a MUDICA using the other observations). dataset are resampled with replacement to generate a new sample called
Next, this observation is reconstituted from its LOO subspace projections. as a bootstrap sample, and group means are computed for this bootstrap
Then, the reconstituted observation is projected as a supplementary sample. This process is repeated a large number (e.g., 1000) of times,
observation onto the multivariate space of the original MUDICA (for spe and a set of group means is generated for each of these bootstrap sam
cifics, see Equations 19 to 24, page 1391ff in Williams et al., 2010). ples, which are each projected as supplementary observations onto the
Finally, a peeled convex hull is drawn around a given proportion (e.g., multivariate space of the original MUDICA. An ellipsoid is then drawn
95%) of the reconstituted observations (i.e., a random effect model) for a around a given proportion (e.g., 95%) of the bootstrapped group means
particular group, and this hull is called a prediction interval. In our and represents the confidence interval for each group mean. If the el
example, Fig. 8a shows prediction intervals for each group, which lipsoids around the group means do not overlap, this indicates that the
almost completely overlap with each other, indicating a low accuracy in groups (i.e., the group means) reliably differ in the dimensional space. If
group assignment (i.e., we cannot accurately determine age or gender of the ellipsoids around two group means do overlap, this indicates that the
the participants based on their performance on the assessment). groups do not reliably differ for the given dimensions. In our example
MUDICA uses a bootstrap procedure to determine the stability of group (Fig. 8b, results provided in Table 1), the confidence intervals for the ≤
differences. In the bootstrap procedure, the original dataset is assumed 24 year old males and females do not overlap—a configuration that
to represent the entire population of interest and is therefore used to indicates that the pattern of responses from these two groups reliably
recreate samples that are similar to samples drawn from the original differed. However, the confidence intervals for the ≥ 25 year old males
9
A. Krishnan et al. Methods in Psychology 7 (2022) 100100
Table 1
Normalized multiblock dica category-level statistics.
Category N Dimension 1 Dimension 2
Factor Scores Contributions (%) Bootstrap Ratios Factor Scores Contributions (%) Bootstrap Ratios
Note: Bootstrap ratios above/below ± 1.96 are considered reliable at p < .05 and are shown in bold face.
10
A. Krishnan et al. Methods in Psychology 7 (2022) 100100
Fig. 9. Bootstrap ratio tests: Top panel shows the bootstrap ratio values for Dimension 1 and the bottom panel shows the bootstrap ratio values for Dimension 2
(threshold at p < .05, which corresponds to a bootstrap ratio value of ~± 2, akin to a t-test).
main effect of gender was 0.003, a process resulting in a 81.25% were no longer reliable—–an effect implying that performance was less
reduction in the total variance of the dataset, a reduction which in likely to be driven by gender and more likely to be driven by age. In
dicates a very large effect of gender on performance. When the main addition, as mentioned earlier, when the main effect of gender was
effect of gender was removed, the effect of the interaction between age present, females were more likely than males to answer etiology ques
and gender (previously overshadowed by the main effect of gender) was tions correctly (see Fig. 5b), but once the gender effect was removed,
clearly revealed. The interaction effect showed that differences in per knowledge of etiology was no longer important for identifying differ
formance between males and females in the ≤ 24 year old age group ences in performance between the groups (Fig. 11b).
11
A. Krishnan et al. Methods in Psychology 7 (2022) 100100
Table 2
Normalized multiblock DICA variable-level statistics.
Block Content Domain Response N Dimension 1 Dimension 2
Treatment Anxiety Disorders R 336 − 0.12 2.80 ¡3.05 − 0.04 2.61 2.80
Anxiety Disorders W 312 0.13 3.01 3.03 0.05 2.81 3.01
Dementia R 537 − 0.06 1.06 ¡3.16 − 0.01 0.08 1.06
Dementia W 111 0.28 5.15 3.22 0.03 0.39 5.15
General R 406 − 0.03 0.19 − 0.94 0.03 1.71 0.19
General W 242 0.05 0.32 0.94 − 0.05 2.86 0.32
Schizophrenia W 205 0.29 9.64 4.78 0.04 1.44 9.64
Schizophrenia R 443 − 0.13 4.46 ¡4.64 − 0.02 0.67 4.46
Suicide W 453 0.05 0.55 1.84 0.02 0.81 0.55
Suicide R 195 − 0.11 1.28 − 1.85 − 0.05 1.89 1.28
General Anxiety Disorders W 305 0.10 1.14 2.53 0.01 0.10 1.14
Knowledge Anxiety Disorders R 343 − 0.09 1.02 ¡2.55 − 0.01 0.09 1.02
Anxiety Medication (1) W 360 0.09 0.95 2.32 − 0.01 0.08 0.95
Anxiety Medication (1) R 288 − 0.11 1.19 ¡2.36 0.01 0.10 1.19
Anxiety Medication (2) R 436 − 0.01 0.02 − 0.41 0.03 1.06 0.02
Anxiety Medication (2) W 212 0.02 0.03 0.41 − 0.06 2.17 0.03
Eating Disorders R 145 − 0.18 1.69 ¡2.76 0.25 23.23 1.69
Eating Disorders W 503 0.05 0.49 2.71 − 0.07 6.70 0.49
General (1) W 371 0.08 0.80 2.34 0.01 0.03 0.80
General (1) R 277 − 0.10 1.07 ¡2.33 − 0.01 0.04 1.07
General (2) R 419 − 0.07 0.64 ¡2.18 − 0.04 1.57 0.64
General (2) W 229 0.12 1.17 2.18 0.07 2.87 1.17
Substance Use Disorder R 233 − 0.06 0.33 − 1.14 − 0.10 5.22 0.33
Substance Use Disorder W 415 0.04 0.19 1.14 0.05 2.93 0.19
Suicide R 553 − 0.01 0.03 − 0.75 − 0.01 0.13 0.03
Suicide W 95 0.07 0.17 0.74 0.06 0.76 0.17
Symptoms Anxiety Disorders W 157 0.11 0.34 1.54 − 0.02 0.08 0.34
Anxiety Disorders R 491 − 0.03 0.11 − 1.53 0.01 0.03 0.11
Anxiety Disorders R 485 − 0.06 0.36 ¡2.75 0.04 0.94 0.36
Anxiety Disorders W 163 0.19 1.07 2.78 − 0.12 2.80 1.07
Bipolar R 238 − 0.14 0.81 ¡2.92 0.00 0.00 0.81
Bipolar W 410 0.08 0.47 2.89 0.00 0.00 0.47
Bipolar W 435 0.06 0.30 2.25 0.00 0.00 0.30
Bipolar R 213 − 0.13 0.62 ¡2.27 − 0.01 0.01 0.62
Childhood Disorders R 343 − 0.03 0.06 − 0.84 0.04 0.54 0.06
Childhood Disorders W 305 0.04 0.07 0.84 − 0.04 0.61 0.07
Dementia R 168 − 0.19 1.09 ¡3.10 − 0.21 9.43 1.09
Dementia W 480 0.07 0.38 3.05 0.07 3.30 0.38
Depression R 365 − 0.16 1.67 ¡4.57 0.04 0.66 1.67
Depression W 283 0.21 2.16 4.61 − 0.05 0.85 2.16
Eating Disorders W 347 0.15 1.46 4.16 0.02 0.11 1.46
Eating Disorders R 301 − 0.18 1.68 ¡4.15 − 0.02 0.13 1.68
Gender W 178 0.13 0.52 2.10 − 0.07 0.96 0.52
Gender R 470 − 0.05 0.20 ¡2.09 0.02 0.36 0.20
OCD W 463 0.10 0.80 3.91 0.00 0.00 0.80
OCD R 185 − 0.24 1.99 ¡4.07 0.00 0.01 1.99
Personality Disorder (1) R 297 − 0.08 0.32 − 1.82 0.05 0.90 0.32
Personality Disorder (1) W 351 0.07 0.27 1.82 − 0.04 0.76 0.27
Personality Disorder (2) R 382 − 0.09 0.57 ¡2.81 − 0.04 0.88 0.57
Personality Disorder (2) W 266 0.13 0.82 2.83 0.06 1.26 0.82
PTSD W 391 0.09 0.56 2.95 0.00 0.00 0.56
PTSD R 257 − 0.14 0.85 ¡2.99 0.00 0.00 0.85
Schizophrenia R 174 − 0.08 0.18 − 1.20 − 0.10 1.98 0.18
Schizophrenia W 474 0.03 0.07 1.20 0.04 0.73 0.07
Sexual Disorder R 438 − 0.08 0.45 ¡2.70 − 0.04 0.92 0.45
Sexual Disorder W 210 0.16 0.94 2.74 0.09 1.91 0.94
Somatic Symptom R 431 − 0.06 0.27 ¡2.08 0.01 0.08 0.27
Disorder
Somatic Symptom W 217 0.12 0.54 2.08 − 0.02 0.16 0.54
Disorder
Note: Bootstrap ratios above/below ± 1 are considered reliable at p < .05 and are shown in bold face. R = correct (i.e., right) response; W = incorrect (i.e., wrong)
response. Questions within each block of variables are identified by content domain.
12
A. Krishnan et al. Methods in Psychology 7 (2022) 100100
Fig. 11. Partialling out the main effect of gender: Dimension 1 represents age
differences in performance and Dimension 2 represents the specific difference
Fig. 10. Supplementary data: Top panel shows the effect of having taken a in performance between ≥ 25 males and females, which was overshadowed by
clinical course, middle panel shows finer grained age effects with categories not the strong main effect of gender (top panel). Bottom panel shows the contri
included in the analysis, and bottom panel shows finer grained age and gender bution of each block, which reveals that in the absence of a gender effect,
categories to illustrate the interaction effect. questions on etiology no longer drive differences between participant scores
along Dimension 1.
4.5.2. Partialling out the effect of a particular confound
As noted above, taking clinical coursework could be examined as a males, implying that, in the absence of clinical coursework, males in
supplementary variable, where results showed that there were reliable general were more likely to have lower levels of mental health literacy
group differences in performance based on whether or not participants than females, as indicated by the strong main effect of gender described
had clinical coursework. Therefore, in order to examine age and gender earlier. Fig. 12b shows the important variable contributions for Di
differences on mental health literacy in the absence of clinical course mensions 1 and 2, where (compared to Fig. 6a and b) the absence of a
work, we used conditioned MUDICA to remove the effect of clinical cour clinical psychology course resulted in lower contributions of questions
sework from the dataset and then examined the effect of age and gender on specific disorders and higher contributions of general questions on
on performance. After accounting for clinical coursework, Dimension 1 disorders.
had λ1 = 0.007, and τ1 = 69%, Dimension 2 had λ2 = 0.002, and τ2 =
20%, and Dimension 3 had λ3 = 0.001, and τ3 = 11%; together these 5. Discussion
three dimensions accounted for 100% of the total variance (for this
analysis the total variance was λ1 + λ2 + λ3 = 0.010). Psychological research involves datasets with multiple data types (e.
The total variance before removing the effect of clinical coursework g., survey measures, physiological measures, behavioral measures).
was 0.016 (as mentioned earlier), and the total variance after removing However, most traditional statistical methods such as ANOVA or regres
the effect of clinical coursework was 0.010, resulting in a 37.5% sion limit the scope of the analysis to one or few questions about the data
reduction of the total variance in the dataset. When the effect of taking a and, so, often ignore the richness of the datasets. MUDICA has previously
clinical course was removed (Fig. 12a; supplementary age and gender been shown to better handle categorical datasets and extract from the
categories displayed), the response patterns for ≥ 25 year old males data findings that were not readily detectable using traditional statistical
were no longer similar to the response patterns for the ≥ 25 year old methods (Williams et al., 2010). Here, we present new advances in
females. Instead, the response patterns for ≥ 25 year old males now MUDICA including: (1) examining main effects and interactions and pre
appeared more similar to the response patterns for the ≤ 24 year old dicting group assignment; (2) quantifying the contributions of blocks of
13
A. Krishnan et al. Methods in Psychology 7 (2022) 100100
Fig. 12. Partialling out the confounding effect of clinical coursework: Partialling out clinical coursework has a specific effect on a sub-group of older male par
ticipants (i.e., 30 + years), where the effects are better seen with supplementary group means (top panel). Bottom panel shows the important variables that contribute
to differences in scores between participant groups along Dimensions 1 and 2.
variables and categories of observations; and (3) partialling out the ef included additional approaches to examine specific interaction effects,
fect of a single confounding variable to reveal other weaker, yet supplementary data, and contributions of blocks of variables.
important, underlying effects. In our example, MUDICA revealed finer details about age and gender
For this paper, we used data from a mental health literacy research effects on mental health literacy, and identified the individual variables
study to illustrate the application of MUDICA to examine 33 varia and blocks of variables that contributed to the differences between
bles—arranged into four blocks—that were collected on 648 partici groups. Specifically, while there was a strong gender effect—with fe
pants who, in turn, were classified into four groups. The goal of the male participants performing better than male participants over
analysis was to examine group differences based on age and gender in all—there was also a clear interaction effect where the differences
mental health literacy and identify variables that contributed to these between males and females were driven by the differences in the ≥ 25
differences. However, the dataset contained categorical data that were year age-group. This difference between males and females in the ≥ 25
multicollinear—a configuration precluding the use of traditional year age-group was amplified when controlling for variance due to
methods of analyses. MUDICA offered a middle ground, where the clinical coursework—an effect implying that the gender difference in
elegance of traditional methods such as ANOVA or regression was pre mental health literacy is affected by clinical coursework offered at the
served (i.e., to examine main effects and interaction or predict group college level.
assignment), but in a single model and without the need for corrections
for multiple comparisons. MUDICA also went a step further by employing 6. Limitations
non-parametric inference testing methods that do not rely on the as
sumptions required for ANOVA or regression (e.g., normality), and While MUDICA has many advantages, it also has limitations to be
14
A. Krishnan et al. Methods in Psychology 7 (2022) 100100
considered. A first limitation is that, for datasets with a large number of (2017) ], SPSS [module Categories; Meulman and Heiser (2011)], and
variables, MUDICA could identify dimensions on which almost all variables Matlab (from author HA’s homepage at: https://personal.utdallas.
contribute, making such dimensions difficult to interpret. However, edu/herve/). In addition, CA and MCA are also available in various
when data are structured into blocks, examining block contributions packages from open-source software languages such as python (MCA
could facilitate a more effective interpretation of a dimension. There is 1.0.3, n. d.) and R (e.g., FactoMineR, ade4, ExPosition). In fact, R also
also ongoing research on how to effectively simplify interpretation of incorporates aspects of DICA in some of its packages. The version of CA
dimensions by forcing variables to either have very high or very low used in this article was implemented in R using the TExPosition package
contributions on particular dimensions (i.e., sparse solutions), where (Beaton et al., 2014b) and the maps were created with the ggplot2
such dimensions can be optimized to explain as much variance as package (Wickham, 2016). However, none of the above software pro
possible while retaining a simple dimensional structure (Guillemot et al., grams have specific code to perform MUDICA or any of its various steps (e.
2019; Yu, 2021; Yu et al., in press 2022). g., block projections, conditioned analysis), which have to be specially
A second, often reported, limitation of methods such as MCA and coded (see Williams et al., 2010 for Matlab code for MUDICA). The
MUDICA is that they only analyze categorical variables and therefore, step-by-step R code (with the example dataset used in this paper) is
quantitative variables have to be appropriately transformed (e.g., bin available from author AK’s GitHub webpage at: https://github.
ned or categorized) if they are to be examined along with other cate com/anjkrishnan/multiblockDICA, and the R code for the specific ana
gorical variables. While methods such as Barycentric Discriminant lyses and creation of the maps is available from author HA’s GitHub
Analysis (Abdi et al., 2017) and Multiblock Barycentric Discriminant webpage at: https://github.com/HerveAbdi/PTCA4CATA and
Analysis (Abdi et al., 2012a) exist for quantitative variables, and deliver https://github.com/HerveAbdi/data4PCCAR.
the same advantages as DICA and MUDICA deliver for categorical variables,
research is currently underway to better integrate information from both 8. Conclusion
quantitative and categorical variables within the same dimension-based
model (Beaton et al., 2019a, 2019b). However, if the goal of the analysis We illustrated the application of MUDICA in a mental health literacy
is to generalize a pattern of responses as opposed to studying individual study where the goal was to identify differences between groups of par
differences, and the loss of statistical power is minimal, the conversion ticipants (i.e., ≤ 24 males, ≤ 24 females, ≥ 25 males, ≥ 25 females)
of some types of quantitative variables into categorical variables (e.g., based on 33 questions from a mental health literacy questionnaire that
via binning or using domain-specific cut-offs) is acceptable (Benzécri, were arranged in blocks (i.e., etiology, symptoms, treatment, general
1973). knowledge). Results from MUDICA were displayed as maps representing
A third limitation of dimension-based methods relates to missing different aspects of the data: group differences, group assignment, indi
data. In general, if there are only a few missing responses for any vari vidual variable contributions, contributions of blocks, and underlying
able, then one approach is to replace the missing values by the profile of interaction effects in the absence of a main effect or a confound, along
this variable (i.e., the average probability of responses after excluding with relevant inferential testing procedures. For our example, MUDICA
the missing data). Another approach is to predict plausible values for the revealed that while a strong gender effect exists in mental health liter
missing data while taking into account similarities between the obser acy—where, overall, females have higher mental health literacy than
vations and the relationship between variables (Josse and Husson, males—this gender effect masks underlying interactions between age and
2016). A third approach is to impute missing responses based on the gender. Specifically, both males and females in the ≤ 24 age group are
original disjunctively coded dataset and iteratively reconstruct the data more likely to have low mental health literacy, and that males in the ≥ 25
until convergence (see Husson et al., 2017, for more details). When there age-group are more likely to have lower mental health literacy in the
is a large number of missing responses, a common practice is to include absence of any college coursework in clinical psychology. This difference
an additional level for any variable where multiple observations have in coursework is less likely to affect females across all age-groups.
missing values, and this level is included in the analysis (Husson et al., In conclusion, multiblock discriminant correspondence analysis
2017). An examination of such data will reveal whether the responses (MUDICA), is a versatile dimension-based method that is well suited to
were missing randomly or systematically. Often, participants with analyze large, structured categorical datasets. MUDICA generates easy-to-
randomly missing responses are removed from the analysis. Other ap interpret maps that represent the relationship between groups of ob
proaches to impute data such as regularized iterative imputations that servations and blocks of variables. The reliability of the maps and sta
have been developed for MCA (Josse et al., 2012) can also be applied to bility of the variables are tested through non-parametric inferential
MUDICA. procedures such as permutation and bootstrap procedures. In this paper,
A fourth limitation is that, with conditioned MUDICA, only one effect (e. we introduced conditioned MUDICA where one specific effect from the
g., a main effect, confounding effect) can be removed at a time, partic dataset can be removed so that weaker underlying effects are clearly
ularly if the effects are not orthogonal (i.e., uncorrelated) with the other revealed. Thus, much like how a sketch artist creates a composite picture
effects. This limitation is being addressed in recent work by combining that represents the likely image of a person based on information from a
the underlying method for MUDICA with other techniques such as partial witness, so too, MUDICA creates a composite picture of the relationship
least squares regression (Beaton et al., 2019a, 2019b), where layers of between observations and variables based on information from large
the dataset can be systematically removed in order to study other and complex datasets.
smaller, yet important, effects within the data (Escofier and Pagès,
2016). Credit author statement
Finally, methods such as MUDICA generate numerous maps that are
designed to intuitively reveal patterns of information from large data The authors made the following contributions. Anjali Krishnan:
sets. However, to be able to accurately interpret these maps requires Conceptualization - Equations and code, Data curation - Analysis, fig
substantial practice in reading such maps along with an adequate un ures, and tables, Writing – original draft preparation, review and edit
derstanding of the dataset and relevant domain expertise (e.g., mental ing; Ju-Chi Yu: Conceptualization - Equations and code, Writing -
health literacy for this paper). Drafting mathematical appendices, review and editing; Rona Miles:
Conceptualization - Example study design, Writing - Drafting interpre
7. Software tation of clinical results, review and editing; Derek Beaton: Conceptu
alization - Equations and code, Writing – review and editing; Laura A.
MUDICA is based on correspondence analysis, which is available in Rabin: Conceptualization - Example study design, Writing – review and
most proprietary software programs including SAS [PROC CORRESP; Inc editing; Hervé Abdi: Conceptualization - Equations and code, Writing –
15
A. Krishnan et al. Methods in Psychology 7 (2022) 100100
Ethics approval and consent to participate The authors declare that they have no known competing financial
interests or personal relationships that could have appeared to influence
The example dataset used in this paper was obtained from a study the work reported in this paper.
that was approved by the Institutional Review Board (IRB) of Brooklyn
College of the City University of New York (IRB reference number: Data availability
2016–1018), and the consent procedure was also approved by the IRB.
No personal identification data were collected. The data and code are publicly available (see Software section for
details).
Funding
Acknowledgements
This work was supported by a grant awarded to authors RM and LAR
from the John Cleaver Kelly (JCK) Foundation [2016–2021] and the The authors would like to acknowledge Dr. Amy Boggan, Dr. Soudeh
Professional Staff Congress and The City University of New York Grant Khoubrouy, and Mr. Brendon Mizener for helpful comments on previous
#63184–00 51 awarded to author RM. versions of the manuscript.
Appendix
This Appendix describes the main steps for MUDICA including the new features of block normalization and conditioned analyses. For a more formal
presentation of MUDICA, see Appendix C, page 1390ff in Williams et al. (2010).
Notations
Matrices are shown in bold face upper case letters (e.g., X), vectors are shown in bold face lower case letters (e.g., x), and numbers are shown in
italic upper case letters (e.g., I). The diag{X} operator transforms the elements on the diagonal of matrix X into a vector, while the diag{x} operator
transforms the vector x into a diagonal matrix. The transpose of a matrix (e.g., X) is represented with a superscript T as XT .
For simple correspondence analysis, X is a contingency table with counts for levels of one categorical variable on the rows and another categorical
variable on the columns. This contingency table is then analyzed with a generalized singular value decomposition (GSVD). Multiple correspondence
analysis (MCA) generalizes correspondence analysis to analyze multiple categorical variables that are disjunctively coded (i.e., scores coded as 0s and
1s).
For MCA, matrix X has I observations and JK levels for each of the K variables (the total number of all levels for all variables is J). Matrix X is coded
with 0s and 1s and the sum of all the 1s is N. The first step in MCA is to compute the probability matrix:
Z = N− 1X (1)
The next step is to compute two vectors that contain the row totals (r) and the column totals (c) for X. These row and column total vectors are
diagonalized (Dr = diag{r} and Dc = diag{c}). Then, the χ 2 (chi-square) distance from the probability matrix Z is computed as:
R = Z − rcT (2)
1 1
with the constraints, PDr 2 PT = QDc 2 QT = I. Matrix Δ is an L × L diagonal matrix with L singular values as the diagonal elements; P is the matrix of
− −
left singular vectors, and Q is the matrix of right singular vectors. The rows and columns of R are multiplied by P and Q, respectively, to generate
matrices of factor scores:
(4)
1
F = D−r 2 PΔ
and
(5)
1
G = D−c 2 QΔ
Each column of F and G represent the dimensions and reveal how the observations and variables differ from each other.
DICA is an extension of CA and MCA that examines group differences by maximizing between-group variance. In DICA, the I rows of X are categorized
into groups, which are represented in an I × O design matrix Y, where O is the number of groups. A contingency table is then computed as: R = YT X,
and a GSVD is then performed on this contingency table (following the same steps above). The factor scores are generated in the same way as in CA and
16
A. Krishnan et al. Methods in Psychology 7 (2022) 100100
MCA,and represent the differences between categories (as opposed to individual observations).
Conditioned Multiple Correspondence Analysis
A conditioned MCA (Escofier, 1988) is used to control for a specific effect described by a single categorical variable that was not included in the
T
original analysis. If e is the additional variable and Ye is the design matrix for that variable, then C = X Ye represents how e contributes to the variables
of X. This matrix C is transformed into probabilities as:
̂ = C(1Ye )− 1 .
C (6)
In order to get the contribution of each observation to the levels of e, the design matrix Ye is normalized by the total number of observations (N) as:
N− 1Ye. This conditioned contribution indicates the proportion of the data predicted by e and is computed as:
( )
̂ = Ye C
Oe = N N − 1 Ye C ̂ (7)
− 1( ) −1
To perform a conditioned MCA, a GSVD is conducted on Dr 2 R − Oe + rcT Dc 2 , where R is the χ 2 distance from the probability matrix Z (see MCA section
above).
To perform a conditioned DICA, R is now the group matrix (i.e., YT X), and this matrix R is used to compute Oe. Factor scores of conditioned MCA and
conditioned DICA are computed and interpreted in the same way as ordinary MCA and DICA.
Supplementary Data
Observations or variables that are not included in the original analysis can be examined as supplementary data (i.e., they are not used to generate
the multivariate space). For a supplementary row (iTsup ), the factor scores (gsup ) are obtained as:
(8)
− 1 1
gsup = (iTsup 1) iTsup GΔ−
For a supplementary column (jsup ), the factor scores (f sup ) are obtained as:
(9)
− 1 1
f sup = (jTsup 1) jTsup FΔ−
− 1 − 1
where (iTsup 1) and (jTsup 1) are first used to scale (iTsup ) and (jsup ) so that the sum of the elements of isup and jsup are equal to 1.
MUDICA (Williams et al., 2010) is used to examine the contributions of blocks of variables to the overall variance. Here, the data table has J columns
as before, but these J columns are now arranged into H a priori blocks (i.e., X = [X1, X2, …, Xh]). These blocks are normalized such that each block
contributes equally to the analysis (see below for block normalization).
Each of the H blocks can be projected into the DICA multivariate space. First, the GSVD of the group matrix R is rewritten as:
where Qh is the hth block of X. Then, the factor scores for the hth block are computed as (with Wh being a diagonal weight matrix for the hth block):
Fh = HXh Wh Qh (11)
Block Normalization
When variables are arranged into blocks, the variables within a particular block can be normalized such that, for each observation, the sum of
responses to variables within this block is equal to 1. Specifically, if the variables in the hth block (i.e., Xh) are disjunctively coded (i.e., with 1s and 0s)
and rh represents the vector of row totals for block Xh, block Xh is normalized as:
̃ h = Xh diag{rh }−
X 1
(12)
17
A. Krishnan et al. Methods in Psychology 7 (2022) 100100
References Hotelling, H., 1933. Analysis of a complex of statistical variables into principal
components. J. Educ. Psychol. 24 (6), 417. https://doi.org/10.1037/h0071325.
Husson, F., Lê, S., Pagès, J., 2017. Exploratory Multivariate Analysis by Example Using R,
Abdi, H., 2007. Discriminant correspondence analysis. In: Salkind, N. (Ed.), Encyclopedia
second ed. CRC Press, Boca Raton.
of Measurement and Statistics. Sage Publications. https://doi.org/10.4135/
Inc, S., 2017. SAS/STAT 14.3 User’s Guide: the Corresp Procedure. Cary: SAS Institute
9781412952644.n140.
Inc.
Abdi, H., Williams, L., 2022. Correspondence analysis. In: Frey, B. (Ed.), The SAGE
Jorm, A.F., Korten, A.E., Jacomb, P.A., Christensen, H., Rodgers, B., Pollitt, P., 1997.
Encyclopedia of Research Design. Sage Publications, pp. 327–339. https://doi.org/
Mental health literacy”: a survey of the public’s ability to recognise mental disorders
10.4135/9781071812082.n124.
and their beliefs about the effectiveness of treatment. Med. J. Aust. 166 (4),
Abdi, H., Williams, L.J., 2010. Principal component analysis. Wiley Interdiscipl. Rev.:
182–186. https://doi.org/10.5694/j.1326-5377.1997.tb140071.x.
Comput. Stat. 2 (4), 433–459. https://doi.org/10.1002/wics.101.
Josse, J., Chavent, M., Liquet, B., Husson, F., 2012. Handling missing values with
Abdi, H., Williams, L.J., Beaton, D., Posamentier, M.T., Harris, T.S., Krishnan, A., Devous
regularized iterative multiple correspondence analysis. J. Classif. 29 (1), 91–116.
Sr, M.D., 2012a. Analysis of regional cerebral blood flow data to discriminate among
https://doi.org/10.1007/s00357-012-9097-0.
alzheimer’s disease, frontotemporal dementia, and elderly controls: a multi-block
Josse, J., Husson, F., 2016. missMDA: a package for handling missing values in
barycentric discriminant analysis (MUBADA) methodology. J. Alzheim. Dis. 31 (s3),
multivariate data analysis. J. Stat. Software 70, 1–31. https://doi.org/10.18637/jss.
S189–S201. https://doi.org/10.3233/JAD-2012-112111.
v070.i01.
Abdi, H., Williams, L.J., Béra, M., 2017. Barycentric discriminant analysis. In: Alhajj, R.,
Lebart, L., Morineau, A., Warwick, K.M., 1984. Multivariate Descriptive Statistical
Rokne, J. (Eds.), Encyclopedia of Social Network Analysis and Mining. Springer,
Analysis; Correspondence Analysis and Related Techniques for Large Matrices. New
New York, pp. 1–20. https://doi.org/10.1007/978-1-4939-7131-2_110192.
York (USA) Wiley.
Abdi, H., Williams, L.J., Valentin, D., Bennani-Dosse, M., 2012b. STATIS and DISTATIS:
Lebart, L., Saporta, G., 2014. Historical elements of correspondence analysis and multiple
optimum multitable principal component analysis and three way metric
correspondence analysis. In: Greenacre, M., Blasius, J. (Eds.), Visualization and
multidimensional scaling. Wiley Interdiscipl. Rev.: Comput. Stat. 4 (2), 124–167.
Verbalization of Data. CRC Press, Chapman & Hall, pp. 31–44.
https://doi.org/10.1002/wics.198.
MCA 1.0.3. Python software foundation (n.d.) Retrieved September 3, 2022, from.
Beaton, D., Abdi, H., Filbey, F.M., 2014a. Unique aspects of impulsive traits in substance
https://pypi.org/project/mca/.
use and overeating: specific contributions of common assessments of impulsivity.
McIntosh, A.R., Lobaugh, N.J., 2004. Partial least squares analysis of neuroimaging data:
Am. J. Drug Alcohol Abuse 40 (6), 463–475. https://doi.org/10.3109/
applications and advances. Neuroimage 23, S250–S263. https://doi.org/10.1016/j.
00952990.2014.937490.
neuroimage.2004.07.020.
Beaton, D., Fatt, C.R.C., Abdi, H., 2014b. An ExPosition of multivariate analysis with the
Meulman, J.J., Heiser, W.J., 2011. IBM SPSS Categories 20. SPSS Inc., USA, p. 313.
singular value decomposition in R. Comput. Stat. Data Anal. 72, 176–189. https://
Miles, R., Rabin, L., Krishnan, A., Grandoit, E., Kloskowski, K., 2020. Mental health
doi.org/10.1016/j.csda.2013.11.006.
literacy in a diverse sample of undergraduate students: demographic, psychological,
Beaton, D., Saporta, G., Abdi, H., 2019a. A Generalization of Partial Least Squares
and academic correlates. BMC Publ. Health 20 (1), 1–13. https://doi.org/10.1186/
Regression and Correspondence Analysis for Categorical and Mixed Data: an
s12889-020-09696-0.
Application with the ADNI Data. others. bioRxiv, 598888. https://doi.org/10.1101/
Nakache, J.-P., 1973. Influence du codage des données en analyse factorielle des
598888.
correspondances étude d’un exemple pratique médical. Rev. Stat. Appl. 21 (2),
Beaton, D., Sunderland, K.M., Levine, B., Mandzia, J., Masellis, M., Swartz, R.H.,
57–70.
Troyer, A.K., Binns, M.A., Abdi, H., Strother, S.C., 2019b. Generalization of the
Phillips, D., Phillips, J., 2009. Visualising types: the potential of correspondence analysis.
minimum covariance determinant algorithm for categorical and mixed data types.
In: Byrne, D., Ragin, C.C. (Eds.), Sage Handbook of Case-Based Methods. Sage
others bioRxiv, 333005. https://doi.org/10.1101/333005.
Publications, pp. 148–168.
Benzécri, J.-P., 1973. L’Analyse des Données 1–2. Dunod.
R Core Team, 2020. R: A Language and Environment for Statistical Computing. https://
Berry, K.J., Johnston, J.E., Mielke Jr., P.W., 2011. Permutation methods. Wiley
www.R-project.org/.
Interdiscipl. Rev.: Comput. Stat. 3 (6), 527–542. https://doi.org/10.1002/wics.177.
Rabin, L.A., Miles, R.T., Kamata, A., Krishnan, A., Elbulok-Charcape, M., Stewart, G.,
Cordier, B., 1965. L’analyse des correspondances [PhD thesis]. University of Rennes.
Compton, M.T., 2021. Development, item analysis, and initial reliability and validity
Efron, B., Tibshirani, R.J., 1993. An introduction to the bootstrap. Monogr. Stat. Appl.
of three forms of a multiple-choice mental health literacy assessment for college
Probab. 57, 1–436. https://doi.org/10.1007/978-1-4899-4541-9.
students (MHLA-c). Psychiatr. Res. 300, 113897 https://doi.org/10.1016/j.
Escofier, B., 1988. Analyse des correspondances multiples conditionnelle. In: Diday, en
psychres.2021.113897.
(Ed.), Data Analysis and Informatics: International Symposium Proceedings: 5th.
Saporta, G., Keita, N.N., 2006. Correspondence analysis and classification. In:
Amsterdam: North Holland.
Greenacre, M.J., Blasius, J. (Eds.), Multiple Correspondence Analysis and Related
Escofier, B., Pagès, J., 2016. Analyses factorielles simples et multiples. Dunod.
Methods, pp. 371–392. https://doi.org/10.1201/9781420011319-19. Chapman and
Farrer, L., Leach, L., Griffiths, K.M., Christensen, H., Jorm, A.F., 2008. Age differences in
Hall/CRC.
mental health literacy. BMC Publ. Health 8 (1), 1–8. https://doi.org/10.1186/1471-
Statistical inference in the 21st century: a world beyond p < 0.05 [special issue]. In:
2458-8-125.
Wasserstein, R.L., Schirm, A.L., Lazar, N.A. (Eds.), Am. Statistician 73. https://doi.
Greenacre, M., Blasius, J., 2006. Multiple Correspondence Analysis and Related Methods.
org/10.1080/00031305.2019.1583913.
Chapman and Hall/CRC.
Wickham, H., 2016. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag.
Guillemot, V., Beaton, D., Gloaguen, A., Löfstedt, T., Levine, B., Raymond, N.,
Williams, L.J., Abdi, H., French, R., Orange, J.B., 2010. A tutorial on multiblock
Tenenhaus, A., Abdi, H., 2019. A constrained singular value decomposition method
discriminant correspondence analysis (MUDICA): a new method for analyzing
that integrates sparsity and orthogonality. PLoS One 14 (3), e0211463. https://doi.
discourse data from clinical populations. J. Speech Lang. Hear. Res. 53 (5),
org/10.1371/journal.pone.0211463.
1372–1393. https://doi.org/10.1044/1092-4388(2010/08-0141.
Guttman, L., 1941. The quantification of a class of attributes: a theory and method of
Wong, K., 2016. Gender differences in mental health literacy of university students.
scale construction. In: Horst, P. (Ed.), The Prediction of Personal Adjustment. Social
West. Undergrad. Psychol. J. 4 (1).
Science Council, pp. 318–348.
Yu, J.-C., 2021. Sparse Partial Least Square Correspondence Analysis (SPLS-CA):
Hair, J., Black, W., Babin, J., Anderson, R., Tatham, R., 2009. Analyzing nominal data
Applications To Genetics and Behavioral Studies [PhD Thesis].
with correspondence analysis. In: Hair, J., Black, W., Babin, J., Anderson, R.,
Yu, J.-C., Gómez–Corona, C., Abdi, H., Guillemot, V., 2022. Sparse MFA, sparse STATIS,
Tatham, R. (Eds.), Multivariate Data Analysis. Prentice-Hall, pp. 595–603.
and sparse DiSTATIS with an application to sensory evaluation. J. Chemometr.
Härdle, W.K., Simar, L., 2019. Applied Multivariate Statistical Analysis. Springer Nature.
https://doi.org/10.1002/cem.3443 (in press).
Hesterberg, T., 2011. Bootstrap. Wiley Interdiscipl. Rev.: Comput. Stat. 3 (6), 497–526.
https://doi.org/10.1002/wics.182.
18