0% found this document useful (0 votes)
106 views9 pages

Samuel, Bucher & Suzuki - CAP 5, The Cambridge Handbook of Research Methods in Clinical Psychology

Surveys and interviews are commonly used methods to measure psychological constructs in clinical research. They are useful when the construct being measured is something the participant has conscious awareness of and can verbally report on, such as symptoms, experiences, thoughts, and behaviors. Some key advantages of surveys and interviews are that they allow researchers to efficiently collect data from large geographic areas and do not require physical proximity between the researcher and participant. While surveys standardize the questions asked, interviews allow for some flexibility in how questions are posed and addressed. Researchers must consider the specific construct being measured and choose whether a survey or interview is most appropriate.

Uploaded by

Carla Abarca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views9 pages

Samuel, Bucher & Suzuki - CAP 5, The Cambridge Handbook of Research Methods in Clinical Psychology

Surveys and interviews are commonly used methods to measure psychological constructs in clinical research. They are useful when the construct being measured is something the participant has conscious awareness of and can verbally report on, such as symptoms, experiences, thoughts, and behaviors. Some key advantages of surveys and interviews are that they allow researchers to efficiently collect data from large geographic areas and do not require physical proximity between the researcher and participant. While surveys standardize the questions asked, interviews allow for some flexibility in how questions are posed and addressed. Researchers must consider the specific construct being measured and choose whether a survey or interview is most appropriate.

Uploaded by

Carla Abarca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

5 Survey and Interview Methods

DOUGLAS B. SAMUEL, MEREDITH A. BUCHER, AND TAKAKUNI SUZUKI

As discussed in the preceding chapter, the measurement of the methods, we will briefly turn to the question of to
a psychological construct is a sine qua non of clinical whom the instruments are administered.
psychological science. Quite literally, without an adequate
measurement of diagnostic, outcome, or functional con-
structs there is no way to conduct reliable and valid
research. This volume highlights and expands on a WHEN ARE SURVEYS AND INTERVIEWS USEFUL?
number of methods and approaches to measuring those The central hypothesis underlying the use of any method
constructs, but none have as lengthy a history as simply that asks a participant to provide a linguistic answer to a
asking an individual to report on their experiences, question (whether a survey or interview), is that the infor-
thoughts, feelings, and behaviors. The ubiquitous use of mation being sought is available to conscious awareness.
survey or interview methods has, at times, garnered criti- Thus, a central point of consideration in using a survey or
cism (Haeffel & Howard, 2010; Nisbett & Wilson, 1977). interview is whether the construct is one for which the
Many a reviewer or author has lamented the reliance on participant is reasonably able to have knowledge and
questionnaire methods. Indeed, these methods – like any articulate it verbally. For example, a research question
other – do have limitations. Nonetheless, there are aimed at understanding neural connectivity or any other
reasons, beyond simple convenience, that they have explicitly internal biological process is not particularly
proven so popular. Survey and interview methods also amenable to survey or interview methods. Similarly, a
represent the measurement approach that has, by far, the psychological construct that is defined as being outside
largest empirical background and typically offers well- of conscious awareness, such as a defense mechanism,
articulated psychometric properties and available evi- would be less suitable for a survey (e.g., the source of the
dence for construct validity. ratings was knowledgeable and well positioned to have
In this chapter, we articulate the choices that go into access to the relevant information). Nonetheless, there
choosing a method and highlight those factors that make remains a multitude of psychological constructs relevant
surveys and/or interviews well suited for a given question. to clinical research that fit comfortably within these
We also detail those research questions or constructs for desired boundaries, making them amenable to assessment
which these methods are not recommended. Next, we via survey or interview. Particularly in clinical diagnosis,
highlight the similarities and differences between the the conscious, subjective experience, such as distress, is
survey and interview methods and make suggestions often the central construct of interest (Diagnostic and Stat-
about how a researcher might choose between the two. istical Manual of Mental Disorders, Fifth Edition ‒ DSM-5).
Finally, we recommend best practices for scale develop- Such constructs, by definition, are assessed readily via
ment and a series of decision points in the use of these surveys or interviews.
methods. Another central consideration in choosing a method is
Throughout this chapter it is important to note that we the physical distance between the researcher and the par-
consider the merit of survey and interview methods separ- ticipant. In this regard, surveys and interviews are particu-
ately from the source from whom the information is larly well suited to efficiently and effectively collect a
gathered. Although in many cases surveys (and interviews) sample from across a broad geographic area, regardless
are completed by the participant on their own experiences of the location of the assessor. Although rapid advances in
or symptoms, there is also a rich history of using these technology have closed the gap (e.g., wearable technology;
methods to assess the opinions of an informant (i.e., a Trull & Ebner-Priemer, 2013), surveys and interviews have
spouse, co-worker, parent, or even a clinician) about the the distinct advantage of being able to be administered to
target. Once we have covered the general pros and cons of the participant from anywhere on the globe due to the

45

Downloaded from https://www.cambridge.org/core. University of New England, on 06 May 2020 at 01:11:54, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms.
https://doi.org/10.1017/9781316995808.007
46 SAMUEL, BUCHER, & SUZUKI

availability to complete surveys via mail or online and result is an equivalent stimulus for all participants given
conduct interviews by phone or video call. In contrast, the required level of reading comprehension. The hall-
many other methods (e.g., neurobiological assessments mark of the interview method is that a member of the
or experimental paradigms) require a physical visit to a research team administers the items to the respondent.
laboratory. This produces a significant burden on the par- There is considerably more variability in the sequencing
ticipant (and likely limits the geographic range from and standardization of the items on interviews than on
which a study can recruit participants) as well as the surveys, depending primarily on the nature of the inter-
experimenter. Furthermore, the intensive nature of other view. Interviews can range from completely structured,
methods often limits participation to only a single individ- where the interviewer reads items verbatim and asks spe-
ual at a given time, making it inefficient to collect large cific follow-up questions depending on the answer in a
samples rapidly. way that resembles the survey format (e.g., the Wechsler
intelligence scales), to completely unstructured that
allows the interviewer to ask whatever questions they see
fit to ascertain the scoring or diagnosis desired. The latter
Differences between Survey and Interview Methods
are standard in clinical practice (Perry, 1992) but only
Having outlined the benefits shared by surveys and inter- rarely employed in research settings, which may compli-
views, choosing among those two becomes the next piv- cate the translation of research into practice (Samuel,
otal question. The clear similarity between surveys and Suzuki, & Griffin, 2016). The most typical interview
interviews is that they both ask direct questions of the methods are “semistructured” in that they provide a set
subject, who then provides an answer. In this way, both of standardized prompts that are asked of all participants,
methods are predominantly scored on the basis of sub- but then also permit the interviewer to follow up on the
jects’ responses. Surveys often (but not necessarily) responses or ask more questions as they see fit (First et al.,
instruct a participant to select from among a list of poten- 2014).
tial response options whereas interviews will often allow
for a more elaborated or open-ended response. Nonethe-
less, there is no a priori reason why either should be
SELECTING AN INTERVIEW OR SURVEY METHOD
restricted in this way. A survey could be crafted so that
the subject provides an open-ended written response (e.g., In light of the considerable differences between interview
Sentence Completion Test; Loevinger, 1979) just as easily and survey methods, it is useful to examine the costs and
as an interview item could ultimately be answered as true benefits of each method so that researchers can determine
versus false, although this is not typical. which is the best fit for a given study.
The key distinctions between these methods are with
regard to how the test stimuli are presented as well as
Pros of Surveys
whose opinion determines the score. In the case of a self-
report questionnaire survey, the individual is free to A major benefit of surveys – and indeed likely a major
answer each item as they desire and the researcher has reason they are commonly used in psychological
limited ability to alter the scoring. This makes a survey a research – is their incredible efficiency. Surveys allow
direct and unfiltered report from the target. In contrast, large samples to be collected with little researcher effort.
items from an interview are scored by the researcher/ Historically this meant the use of “mass testing” situations
interviewer based on their opinion of the individual’s where numbers of participants would take a paper-and-
response as well as additional information. For example, pencil survey at the same time in a given room or mail out
consider an item asking about a person’s humility. In an copies of the survey for participants to complete on their
interview the interviewee could reply to the item by own time, greatly reducing the ratio of experimenter effort
saying: “Oh yes, I am extremely humble. Probably the per data point. However, advances in the implementation
most humble person you will ever meet. I’ve had some of surveys via the web (e.g., Qualtrics) have exponentially
amazing successes in my life, but I’ve never let it go to decreased the effort even further, such that it is now pos-
my head. People always comment on how remarkably sible to collect samples that are limited in size only by the
humble I am given my impressive accomplishments.” In researcher’s budget or the population of interest. Addition-
such a case the interviewer would be free to score the item ally, advances in overall computer technology allow for
as low on humility despite the stated answer, whereas the the use of computerized adaptive testing in which the test
interviewee would likely endorse a survey item as indicat- adapts items administered based on participants’
ing high humility. Therefore, a careful consideration of responses (Zetin & Glenn, 1999). This can further reduce
the constructs and the method of administration is crucial administration time by removing unnecessary, or extra,
in deciding between these two methods. questions without sacrificing test validity.
The other primary difference is in the presentation of The standardized format and direct ratings by the sub-
stimuli. Survey questions are presented to the respondent ject also makes surveys very easy to use. The use of
in written format, typically in a standardized order. The existing surveys requires only that researchers administer

Downloaded from https://www.cambridge.org/core. University of New England, on 06 May 2020 at 01:11:54, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms.
https://doi.org/10.1017/9781316995808.007
SURVEY AND INTERVIEW METHODS 47

them in accordance with the instructions, leaving very information to make such a rating (e.g., Vazire, 2010).
little chance for administration errors or variation across Even in situations where the target has the knowledge
participants. This feature also makes the assessment pro- and insight to make valid ratings on a survey, there may
cess easy for the participant as they are entirely self-paced still be a host of factors that threaten the validity of their
and thus the survey need take only as long as the individ- scores.
ual completing it requires. Taken together, this level of The attention span of the respondent is one such factor
standardization provides some assurance that the items that can limit validity in situations where surveys are
completed (and the responses to those items) are compar- lengthy. There is also a broad literature on how response
able across participants. styles can impact the validity of surveys (Van Vaerenbergh
Surveys also have the potential benefit of anonymity for & Thomas, 2013). The specifics of the response styles are
the respondent that may maximize honesty for illegal or beyond the scope of this chapter, but are thought of as
sensitive topics (e.g., substance use, sexual behaviors). For systematic tendencies toward answering items in a given
example, Ong and Weiss (2000) found that the rate of way. These styles, or biases, can include nay-saying (deny-
admitted cheating among undergraduate students was ing attributes), acquiescence (answering affirmatively),
74 percent when participants were guaranteed anonymity, extreme responding, midpoint responding, or social desir-
but only 25 percent when the survey was confidential. ability and are considered mostly nondeliberate. In con-
Even if not anonymous, many people will find it much trast, an additional issue that can be particularly relevant
easier to report personal matters, undesirable characteris- in clinical research is deliberate faking whereby partici-
tics, or embarrassing behaviors when it involves clicking a pants deliberately distort their responses to make the
button on a keyboard rather than detailing it to another target seem better than they are (i.e., “faking good”) or to
person in a face-to-face setting (Newman et al., 2002) – appear more impaired than is accurate (i.e., “faking bad”).
although some exceptions have been noted (e.g., Poulin, Although there are ways to minimize such biases or detect
2010). such attempts, which we will detail later, researchers and
A final benefit of surveys, as commonly used, is the clinicians should continue to be aware of the possibility of
elimination of scoring errors on the part of the researcher. response biases.
Because most surveys ‒ although not all (e.g., patient In addition to the validity threat, there are also a few
satisfaction surveys) ‒ rely on closed-ended questions with issues that are tradeoffs to the way in which surveys are
limited response options and a specified scoring algo- typically used. A downside to the standardized stimuli that
rithm, the scoring process can be automated to the point characterize surveys is that there is no flexibility in
that scoring errors can only stem from incorrectly following up on responses to ask in another way or clarify
entering the items or scoring metrics. This eliminates the the participant’s response. While such a strategy is pos-
need for calculation of interrater reliability. sible by using survey branching or contingent item sets
within computerized surveys, there are still limits to the
range of follow-up questions that can be done via survey.
A final potential downside of surveys is again a tradeoff for
Cons of Surveys
their efficiency in that the researcher cedes control of the
One frequently articulated concern of survey research is administration setting. This is not an inherent limitation
that the answering process is a black box, in that research- of surveys, as it is possible for them to be completed in the
ers have no way of knowing why a respondent chose a laboratory. However, practically this is rarely done unless
particular answer. This yields additional concerns about a a visit is required for the completion of other methods.
myriad of threats to the validity of the observed score on The result is that, practically, survey respondents can com-
a given item or scale. These threats include lack of plete items in any number of contexts (home, work,
reading ability or comprehension, lack of insight/know- school, movie theater, etc.) or states (happy, angry, dis-
ledge, attentional limitations, response styles, demo- tracted, intoxicated, etc.) that may confound score
graphic biases, or even deliberate faking on the part of validity.
the respondent. It goes without saying that a participant
who lacks the ability to read and comprehend the items on
a survey cannot provide an answer that accurately reflects
Pros of Interviews
the true level of the construct that the items represent.
Further, even if they are able to comprehend the items, It should be clarified again that there is considerable vari-
there is also the possibility that the individual lacks the ability in the amount of structure built into interviews. At
insight or knowledge to accurately report on a given item. one extreme, a fully structured interview (one that allows
For example, on a self-report questionnaire, an individual no probes) is, in reality, simply a survey that is adminis-
asked about relationship conflicts may be prone to under- tered verbally. Such a case is likely quite infrequent as
estimate their role in these conflicts due to lack of insight. even structured interviews typically allow for follow-up
Further, a spouse or other informant who is asked to infer questions – or probes – to better understand the nature
the mental state of a target may simply lack the requisite of the response and flesh out details. Thus, a real strength

Downloaded from https://www.cambridge.org/core. University of New England, on 06 May 2020 at 01:11:54, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms.
https://doi.org/10.1017/9781316995808.007
48 SAMUEL, BUCHER, & SUZUKI

of interviews is the ability of both the interviewer and of a research team in order to accept all interested subjects
interviewee to seek clarification. Participants in an inter- who might have nontraditional work schedules or caretak-
view are able to address issues of comprehension by ing responsibilities. Regardless of the level of flexibility on
asking for items to be restated or reworded. Further, they the part of the assessor, it still remains probable that the
are able to elaborate on their responses in order to com- nature of some pathologies may actually be related to
municate nuances and/or indicate the rationale behind greater difficulty in scheduling and completing an inter-
their answers. view, making missing data points more problematic.
The ability of the interviewer to probe responses and For any interview to be valid all interviewers must be
seek elaboration on answers is the primary advantage of interchangeable with one another. To achieve (or approxi-
the interview method. In doing so, the interviewer has the mate) this goal, extensive training is often necessary
ability to build from existing items (in the case of a semi- before a research staff member is ready to conduct an
structured interview) or ask whatever questions he or she interview. This training can be quite time-consuming and
sees fit (in the case of an unstructured interview) to garner expensive. For example, the training listed on the website
the information needed to arrive at a score for each item. for the Structured Clinical Interview for DSM-IV Axis
In this way, the score on an interview represents the inte- I (First et al., 1996), suggested each interviewer first
gration of multiple points of information (e.g., nonverbal watch an 11-hour DVD training program before
behavior, pace of speech, affective intensity, etc.) that go embarking on any onsite training. Even once initial
well beyond the interviewee’s stated response. training has been successful (i.e., each interviewer has
In addition to the opportunity to seek additional infor- been trained to some preestablished criterion), ongoing
mation, the scores on an interview are usually provided by training is necessary to guard against rater drift (Rogers,
a researcher who has been well trained in the scoring 2001). This often requires regularly scheduled meetings in
protocol and is a (presumably) neutral third party. This which all interviewers discuss difficult coding decisions
makes interview scores potentially less vulnerable to the and receive ongoing fidelity assessments (Widiger &
response biases, lack of insight, or other validity threats Samuel, 2005).
that can hamper surveys. Finally, the administration set- Even when interview training is thorough, a routine best
ting for an interview can be standardized such that practice for the valid administration of interviews is esti-
each interviewee completes the interview in a laboratory mating the reliability across raters. This is because, as
setting that is removed from everyday life in a way that can noted above, the interviewer introduces a new source of
free them from distractions and potentially affective error in that two different interviewers may yield different
interference. scores for the same interviewee. This can occur for a
variety of reasons but can basically be distilled into three
major categories: idiosyncratic administration, inter-
viewer stimulus effects, or unreliable scoring (Rogers,
Cons of Interviews
2001). Aside from fully structured interviews, a feature of
We hope it has been clear to this point that a major feature this method is the ability for the interviewer to use follow-
of interviews is the intensive oversight of the information up probes or ask additional questions to clarify an answer
being gathered on the part of the trained interviewer. Of for scoring purposes. Although this is designed to increase
course, this also represents its greater downside as such a validity, it can introduce unreliability. For example, if two
procedure is incredibly costly in terms of researcher time interviewers differ in terms of the frequency, style, or type
since it requires an interviewer to be present one-on-one of question asked then this idiosyncratic administration
with each research participant. Not only does this greatly can elicit different information from the interviewee. It
impede the efficiency of data collection, but it also results may also be the case that idiosyncratic features of the
in a much greater per-participant financial cost. As such, interviewer (or the similarity with the interviewee) may
the advantages of interviews need to be considered care- differentially elicit information from the participant.
fully in light of the significant cost ‒ perhaps even to the These can take the form of demographic or physical fea-
point that some demonstrable benefit must recommend tures of the interviewer (e.g., a female participant may be
interviews, rather than treating them as the default gold more willing to disclose details of their sexual history to
standard. another female than to a male) as well as more state-based
Another factor to consider is that the one-on-one nature features (e.g., an interviewer who is particularly chipper
of interviews also makes it all the more challenging to find on a given day versus tired on another may elicit different
a suitable time for an individual to participate. Scheduling responses from the participant). Finally, even if interview-
a participant necessitates the researcher and participant ers elicit the same responses or information from partici-
agreeing on a mutually workable time and then coordin- pants they still might assign different scores. In such a
ating any logistical details that might go along with it (e.g., case, this idiosyncratic scoring might be due to biases on
a physical space for an in-person interview or the calling the part of the interviewer (e.g., gender biases) or inaccur-
details for a phone interview). To manage this success- ate encoding of all information to formulate a final score
fully, a great degree of flexibility is required on the part (e.g., Morey & Benson, 2016). There may also be halo

Downloaded from https://www.cambridge.org/core. University of New England, on 06 May 2020 at 01:11:54, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms.
https://doi.org/10.1017/9781316995808.007
SURVEY AND INTERVIEW METHODS 49

effects in which scores on specific items bleed over and construct is and how it fits into the larger context (Loevin-
color the interviewer-assigned ratings more broadly. ger, 1957), see Chapter 2, by Zachar and colleagues, for
In sum, a central concern about interviews – beyond more details.
their intensive nature – is the possibility of unreliability Following the literature review and construct operatio-
across raters (Samuel, 2015). Researchers often take a nalization the scholar next creates a large list of items that
number of steps to minimize unreliability, including strategically samples all the content within the defined
training, as well as analyses that examine interrater reli- construct (including subconstructs), as well as content
ability. The typical form of interrater reliability utilized in that is at the edges of, and even past, the boundaries. The
most diagnostic interview settings is to have another inter- idea behind this process is that the irrelevant items will be
viewer listen to an audio or video recording of the session identified in subsequent validation steps and discarded,
and make ratings, which can be compared to those of the but relevant content cannot be added back in at a later
original interviewer. Importantly, this method can only point. Thus, the goal of this item development process is
examine the reliability of scoring assignments. As Chmie- one of overinclusiveness (Clark & Watson, 1995). In add-
lewski and colleagues (2015) have noted, this is a less than ition, items should be worded as simply as possible to
perfect approach as it does not take into account potential capture the intended meaning so as to improve readability
differences related to the administration of the interview and avoid idiosyncratic interpretations. Items are also best
or stimulus value of the interviewer. To do this, the inter- suited to detect a construct when they reflect only that
view should be independently administered by two separ- construct. That is, it is advisable to avoid items that repre-
ate researchers to see how well scores match. sent a blend of two different constructs.
In the next stage, this large pool of items is administered
to a sample of the population the scale is intended to
CHOOSING AN EXISTING MEASURE OR DEVELOPING measure. This point is both obvious and crucial, as it
A NEW ONE would be of little value to examine item performance in a
sample irrelevant to the population of interest. It is worth
Once the researcher has determined that either the survey
mentioning here that this step can look quite different in
or interview methods will likely suit their needs, the next
questionnaires versus interviews. Indeed, developing a
step is to determine if there is an existing measure that will
large (500+) item pool and administering it to several
assess the construct of interest or if a new one must be
hundred participants is readily achievable for question-
developed. In either case, a thorough review of the litera-
naire development (e.g., Simms et al., 2013), but it would
ture is an imperative first step. This will reveal measures of
be quite onerous to do this in the development of a struc-
similar/same constructs and the researcher can determine
tured interview. This provides some inherent psychomet-
if any meet the needs of the particular study. Ideally, any
ric advantages for questionnaires over surveys as they are
existing measure chosen would have been vetted so that its
typically developed.
psychometric properties and validity are known, as
The properties of items in this development pool can be
opposed to using a measure that is developed and used
evaluated to eliminate items that show uniform endorse-
ad hoc in a given study. If the researcher determines that
ment (all participants answer the question the same way),
no existing measure provides the type of assessment they
redundancy (typically items correlating above 0.5), or that
seek, then a new measure can be created.
show poor internal consistency (i.e., those that do not
correlate with the others). Most of these procedures stem
from classical test theory (CTT), which emphasizes psy-
Principles of Scale Construction
chometric indicators of observed items. A related alterna-
A number of wonderful resources exist on the topic of tive is item response theory (IRT; Embretson & Reise,
scale construction (e.g., Clark & Watson, 1995) and it is 2000), which examines latent item properties such as the
well beyond the scope of this chapter to reproduce those in discrimination parameter (i.e., how well the item differen-
their entirety. Nonetheless, it is worth summarizing some tiates individuals at various levels of the latent trait) and
of the basics here. Most notable is that this process is the difficulty parameter (i.e., the point along the latent
iterative and self-correcting, in that results from one trait continuum where the item best discriminates those
step provide feedback on prior and subsequent steps. individuals). Although sometimes pitched as competitors,
Following a thorough literature review to determine that these approaches are similar and are perhaps best thought
no suitable assessment exists, the researcher should create of as complementary and overlapping. One area where
an operational definition of the construct of interest. This they may well differ, though, is in their treatment of infre-
definition should specify both the breadth of the construct quently endorsed items. As noted above, a typical
(i.e., what it does include) as well as its boundaries approach within CTT is to remove infrequent items as they
(i.e., what it does not include). An operational definition have less information value. However, within IRT, the test
should also specify any lower-order components of that developer may actually prefer such items as they will pro-
construct. Moreover, this operational definition should vide information at different levels of the latent trait and,
offer a conceptual and theoretical account of what this as such, will maximize the coverage of that trait by the

Downloaded from https://www.cambridge.org/core. University of New England, on 06 May 2020 at 01:11:54, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms.
https://doi.org/10.1017/9781316995808.007
50 SAMUEL, BUCHER, & SUZUKI

scale. This property has proven quite useful on high-stakes correlate with conceptually unrelated constructs (i.e., dis-
tests predicting academic achievement. criminant validity).
Regardless of the method of deriving the item pool, During the construct validation, the test developer is
there is general consensus that the resulting instrument also encouraged to attend to the issue of shared method
should have homogenous scales (Smith, McCarthy, & variance. It is well known that measures from the same
Zapolski, 2009). Although an instrument can have mul- method (e.g., self-report questionnaire) will often correl-
tiple scales, each individual one should measure a unitary ate more highly than measures of a different method
construct, rather than some amalgam. This again empha- (Campbell & Fiske, 1959). As such, it is advised to select
sizes the need to have items that are themselves univocal. not only an array of related and unrelated constructs, but
A central step in any scale’s construction is the examin- also to vary the methods in order to create the multitrait,
ation of construct validity. Construct validation is again a multimethod matrix.
very broad concept and has been the source of many Finally, the newly developed measure should be admin-
seminal works (Cronbach & Meehl, 1955), prior reviews istered to a new sample to examine the psychometric
(Strauss & Smith, 2009), and is discussed by Furr in Chap- properties and construct validity. Often the newly con-
ter 6. For all those reasons, we will not repeat this in depth structed measure is optimized for the initial sample. How-
here, but rather summarize briefly the key issues of con- ever, due to sampling error, the properties of the measure
struct validation. A baseline requirement for construct should be examined in an independent sample to ensure
validation is reliability. This term can connote somewhat the findings are replicable and generalizable.
different meanings, but typically represents the degree to The goal of many assessment measures is often to make
which the scale is internally consistent. This should not predictions about future states or behaviors (e.g., risk of
rely exclusively on Cronbach’s alpha, as it varies consider- violence, job or academic performance, illness trajector-
ably based on the number of items (Lance, Butts, & ies) and so another important aspect is predictive validity.
Michels, 2006). Instead, one should prefer metrics like Their longitudinal nature makes these tests less common
average interitem correlation or McDonald’s omega in the psychometric literature, but all the more informa-
(McDonald, 1999), which detects the degree to which vari- tive when available. No test can be said to be validated, or
ance in the items are attributable to a single factor. In valid in an absolute sense; rather a test has validity support
cases where the construct is conceptualized as traitlike, for a given purpose.
such as personality, the degree of stability over brief inter-
vals (i.e., dependability; Watson, 2004) can also be an
indicator of transient error, or unreliability. The idea
OTHER CONSIDERATIONS FOR CONSTRUCTING AND
behind each of these metrics is that the reliability of a
EVALUATING SURVEYS OR INTERVIEWS
measure places a cap on its validity.
Validity is itself a multifaceted idea, with a variety of
Construct Specification
specific types of validity subsumed under the umbrella of
construct validity, which indicates the degree to which a Within the overarching theme of construct specification,
given scale actually measures the latent construct it much attention is rightly drawn to the issue of operation-
intends to operationalize. This can include content valid- ally defining the construct and specifying how it should
ity, which indicates the degree to which the item/scale is relate to other measures. However, there are other fea-
deemed to represent the content of the operationalized tures of construct specification that are less commonly
construct definition and is typically assessed by consulting considered. A necessary step in defining a construct is to
expert raters, but may also be evidenced by simple face hypothesize about the temporal features of the construct
validity (i.e., the item appears to measure the intended explicitly; that is, should the target be considered a state
construct). Most typically, this includes an evaluation of that will shift over time, or be more traitlike and durable
criterion-related validity, which can either be concurrent across time and situation. This has major implications for
or predictive. In a typical construct validation study, the the proposed nomological network as well as for the type
researcher will choose a set of constructs assessed by of items that are written and the timeframe specified by
existing measures that range in their degree of conceptual the question. It would be ideal for the construct specifica-
linkage with the target construct. For example, a novel tion phase to hypothesize the expected test-retest stability
measure of intelligence might reference existing measures over various intervals (e.g., a measure of the emotion of
of intelligence as well as proximal constructs, such as surprise would be expected to have very low stability even
academic achievement, and more distally related (or even over short intervals).
potentially unrelated) concepts like openness to experi- As noted earlier, it is preferable for a given scale to be
ence. The goal is to utilize these measures to situate the homogenous and assess a single latent construct. That
newly created measure in the “nomological network” said, it is quite routine for an instrument to include mul-
(Cronbach & Meehl, 1955). Ideally the new measure tiple scales, including some that may be conceptually orth-
should correlate most highly with the most conceptually ogonal and others that are highly intercorrelated, but
related constructs (i.e., convergent validity) and fail to separable (e.g., facets within personality models). For this

Downloaded from https://www.cambridge.org/core. University of New England, on 06 May 2020 at 01:11:54, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms.
https://doi.org/10.1017/9781316995808.007
SURVEY AND INTERVIEW METHODS 51

reason, test developers should specify a priori how they (Tackett et al., 2013). These informants might be those
believe the scale and overarching instrument will be struc- with any variety of relation to the target, including
tured. There are a variety of possibilities (unidimensional, parents, peers, spouses, teachers, supervisors, observers,
correlated higher-order factors, bifactor, etc.), which or healthcare providers. Test developers should carefully
might all be reasonable, but it is incumbent upon the consider the source, or sources, that might provide the
developer to offer a falsifiable hypothesis about how it will most useful information about the target construct and
be structured. Furthermore, the iterative process of scale create the instrument accordingly (Alexander et al.,
construction and validation will often feed back on itself 2017). For example, a self-report scale assessing advanced
such that subsequent examinations of structure may well dementia would seem less than ideal in isolation, as would
result in modifications to the theory. For example, two an informant report of a target’s internal mental states.
scales that were thought to be distinct may be combined Another practical consideration is the tendency for
based on tests of latent structure, or a scale thought to load people to answer items (whether survey or interviews) in
on one higher-order domain may shift to another. In other ways that may reflect an idiosyncratic response style,
cases, however, the departure of latent structure analyses response bias, or even intentional distortion that might
from a priori expectation may suggest the need to revise subvert the validity of their scores. There are a number
the scale’s content (Clark & Watson, 1995). of methods that have been implemented to deal with these
A final point, which in our estimation is often neglected, possibilities. At the most extreme level, many instru-
is the issue of continuum specification. Continuum speci- ments – particularly for psychopathology – contain a set
fication refers to the hypothesized nature of a construct’s of validity scales to assess for these response styles (e.g.,
distribution as well as the number and nature of its end the Minnesota Multiphasic Personality Inventory [MMPI];
poles. For example, there has been a great deal of work Ben-Porath, 2012). These typically include extremely rare
concerning whether the emotions of sadness and happi- or impossible items (e.g., “I was born on the moon”) to
ness represent opposite poles of the same continuum (i.e., detect lapses in attention, random responding, excessive
bipolar) versus two distinct (i.e., unipolar) concepts (Tay endorsement, defensiveness, denial, faking, or variable
& Kuykendall, 2017). When defining and operationalizing responding across the instrument. In each case, these
a construct the developer should specify the ends of the scales are solely used to screen out invalid responses, as
continuum and note whether a low standing reflects attempts to utilize validity metrics to “correct” scores on
simply the absence of the construct or an elevated level substantive scales has been shown to reduce, rather than
of an opposite construct. This property of constructs has improve, score validity (e.g., the “K-scale correction” on
major implications for a test developer as it provides the the MMPI).
breadth of item content that should be included. To be It is notable here that response bias concerns and valid-
clear, not every measure should aim to assess all possible ity scales have been exclusive to self-report sources, yet
levels of the continuum. For example, a scale intended to there is no compelling evidence that whatever problem
measure depression will reasonably focus its assessment exists is restricted to self-report. Indeed, most interviews
on the extreme levels of sadness, but it would also be or informant survey measures that are completed by
helpful to specify whether low scores reflect a lack of spouses, parents, peers, or clinicians do not have validity
depression (e.g., neutral mood), happiness, or even risk scales, yet we are unaware of any evidence to suggest these
for mania. sources have any fewer issues in this regard. Obviously,
test developers should consider carefully how they might
handle response biases and future research should more
clearly articulate the nature and magnitude of concerns
Practical Considerations in Scale Development
from multiple sources.
The preceding sections have focused primarily on the con- There are also more subtle techniques designed to
ceptual issues in developing survey and interview instru- account for some of these response styles. For example,
ments. This final section concerns primarily practical utilizing reverse-keyed items can be used to account for
considerations in developing an instrument. A broad issue individuals who simply endorse all items affirmatively.
that we return to at this point is the source of the infor- That said, it has not always been clear that reverse-keyed
mation, or the person who will actually be completing the items actually solve this problem in practice. It is not
measure. In many cases, particularly with surveys, there is atypical for reversed items to hang together and perhaps
an expectation that the questionnaire will be completed by even form their own (sub)factor and generally have infer-
the individual who is being assessed (i.e., the target). ior psychometric properties to positively worded items
Indeed, in many cases the target is the best source of (Rodebaugh, Woods, & Heimberg, 2007; van Sonderen,
information about their own thoughts, feelings, behaviors, Sanderman, & Coyne, 2013). This appears to be the case
and abilities. Yet there is also a broad literature demon- due to the increased comprehension demands for the test-
strating the value of considering informant sources for taker, potentially amplifying problems with groups that
assessing psychopathology (Rescorla et al., 2016), job per- have lower reading levels, cognitive decline, or for trans-
formance (Barrick, Mount, & Judge, 2001), and minors lated measures. Thus, although their usage remains fairly

Downloaded from https://www.cambridge.org/core. University of New England, on 06 May 2020 at 01:11:54, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms.
https://doi.org/10.1017/9781316995808.007
52 SAMUEL, BUCHER, & SUZUKI

common ‒ we have used them ourselves (Samuel et al., also allowed these methods to develop a massive literature
2012) – test developers should carefully consider their pros on their psychometric properties. One can only hope that
and cons when developing new surveys or interviews. alternative methods of assessment are ultimately held to
Another central consideration that often gets less atten- the same psychometric standards as these methods.
tion from test developers is carefully choosing the
response options/anchors to items as well as item probes.
This entails the instructions to examinees about what they
are rating. Most typically, mental health questionnaires REFERENCES
utilize an agreement metric, but there are also a number Alexander, L. A., McKnight, P. E., Disabato, D. J., & Kashdan, T.
of examples of tests where the answer is based on the B. (2017). When and How to Use Multiple Informants to
frequency with which a certain symptom is exhibited. This Improve Clinical Assessments. Journal of Psychopathology and
can create confusion, particularly when writing items Behavioral Assessment, 39(4), 669–679.
about relatively infrequent, but intense, affective states. Barrick, M. R., Mount, M. K., & Judge, T. A. (2001). Personality
An item such as “Please rate the extent to which you have and Performance at the Beginning of the New Millennium:
experienced an urge to self-harm in the past week” can be What Do We Know and Where Do We Go Next? International
quite confusing for respondents. Considering a person Journal of Selection and Assessment, 9(1‒2), 9–30.
who has had a single episode of intense desire to self- Ben-Porath, Y. S. (2012). Interpreting the MMPI-2-RF. Minneap-
olis, MN: University of Minnesota Press.
harm, should they respond by saying “a little,” based on
Campbell, D. T., & Fiske, D. W. (1959). Convergent and Discrimin-
the fact that it only happened once in the past week? Or
ant Validation by the Multitrait-Multimethod Matrix. Psycho-
should they respond “quite a bit,” based on how intensely logical Bulletin, 56(2), 81–105.
they felt in that one instance? Clear instructions on how to Chmielewski, M., Clark, L. A., Bagby, R. M., & Watson, D. (2015).
rate the frequency or intensity (or perhaps even having Method Matters: Understanding Diagnostic Reliability in DSM-
them rate both, if that information is valuable) is key. IV and DSM-5. Journal of Abnormal Psychology, 124(3),
The number and type of response options are also areas 764–769.
that rely primarily on anecdote and tradition rather than Clark, L. A., & Watson, D. (1995). Constructing Validity: Basic
empirical findings. Typically, most scales utilize either a Issues in Objective Scale Development. Psychological Assess-
true-false format or a Likert-type scale. Although Likert- ment, 7(3), 309–319.
type scales typically use between four and five response Cronbach, L. J., & Meehl, P. E. (1955). Construct Validity in
Psychological Tests. Psychological Bulletin, 52(4), 281–302.
options (with intense debate about the utility and meaning
Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for
of a “middle” option), there is a wide variety used, includ-
Psychologists. Mahwah, NJ: Lawrence Erlbaum.
ing visual analog scales that have a nearly infinite number. First, M. B., Bhat, V., Adler, D., Dixon, L., Goldman, B., Koh,
Unfortunately, the amount of research on these consider- S., . . . Siris, S. (2014). How Do Clinicians Actually Use the
ations is quite slim (e.g., Froman, 2014; Preston & Col- Diagnostic and Statistical Manual of Mental Disorders in Clin-
man, 2000). Ideally, a test developer would both consider ical Practice and Why We Need to Know More. Journal of
this issue carefully and pilot test a variety of candidates Nervous and Mental Disease, 202(12), 841–844.
before settling on the set of response options that maxi- First, M. B., Spitzer, R. L., Gibbon, M., & Williams, J. B. W.
mizes the reliability, validity, and utility of the scale. (1996). Structured Clinical Interview for DSM-IV Axis
A final consideration is the length of a scale. These run I Disorders. Washington, DC: American Psychiatric Press.
the gamut from several hundred items to even some Froman, R. D. (2014). The Ins and Outs of Self-Report Response
Options and Scales. Research in Nursing & Health, 37(6),
“instruments” that claim to assess complex constructs
447–451.
with only a single item. In practice, the advice we provide
Haeffel, G. J., & Howard, G. S. (2010). Self-Report: Psychology’s
is to start with the absolute minimum of three items per Four-Letter Word. American Journal of Psychology, 123(2),
construct and move up from there based on the cost- 181–188.
benefit analysis of informational value versus time cost. Lance, C. E., Butts, M. M., & Michels, L. C. (2006). The Sources
The minimum of three items provides the ability to call it a of Four Commonly Reported Cutoff Criteria: What Did
“scale” and makes it amenable to latent variable modeling. They Really Say? Organizational Research Methods, 9(2),
Ideally, any number above that point will be based on a 202–220.
number of factors including construct breadth and depth, Loevinger, J. (1957). Objective Tests as Instruments of Psycho-
assessment time, and precision. logical Theory. Psychological Reports, 3(4), 635–694.
In closing, we hope that this chapter has provided a Loevinger, J. (1979). Construct Validity of the Sentence Comple-
tion Test of Ego Development. Applied Psychological Measure-
summary of the pros and cons of the survey and interview
ment, 3, 281–311.
methods of clinical research for the reader to make an
McDonald, R. P. (1999). Test Theory: A Unified Treatment. Mah-
informed decision regarding the measurement they use. wah, NJ: Lawrence Earlbaum.
Despite criticisms, these remain – by far – the most Morey, L. C., & Benson, K. T. (2016). An Investigation of Adher-
common methods of assessing clinical constructs in the ence to Diagnostic Criteria, Revisited: Clinical Diagnosis of the
research literature. This is driven in no small part by their DSM-IV/DSM-5 SECTION II Personality Disorders. Journal of
efficiency and ease of use. Nonetheless, this ease of use has Personality Disorders, 30(1), 130–144.

Downloaded from https://www.cambridge.org/core. University of New England, on 06 May 2020 at 01:11:54, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms.
https://doi.org/10.1017/9781316995808.007
SURVEY AND INTERVIEW METHODS 53

Newman, J. C., Jarlais, D., Turner, C. F., Gribble, J., Cooley, P., & Simms, L. J., Goldberg, L. R., Watson, D., Roberts, J., & Welte, J.
Paone, D. (2002). The Differential Effects of Face-to-Face and (2013). The CAT-PD Project: Introducing an Integrative Model
Computer Interview Modes. American Journal of Public Health, & Efficient Measure of Personality Disorder Traits. Paper pre-
92(2), 294–297. sented at the Society for Research in Psychopathology, Oak-
Nisbett, R. E., & Wilson, T. D. (1977). Telling More Than We Can land, CA.
Know ‒ Verbal Reports on Mental Processes. Psychological Smith, G. T., McCarthy, D. M., & Zapolski, T. C. B. (2009). On the
Review, 84(3), 231–259. Value of Homogeneous Constructs for Construct Validation,
Ong, A. D., & Weiss, D. J. (2000). The Impact of Anonymity on Theory Testing, and the Description of Psychopathology. Psy-
Responses to Sensitive Questions. Journal of Applied Social chological Assessment, 21(3), 272–284.
Psychology, 30(8), 1691–1708. Strauss, M. E., & Smith, G. T. (2009). Construct Validity:
Perry, J. C. (1992). Problems and Considerations in the Valid Advances in Theory and Methodology. Annual Review of Clin-
Assessment of Personality Disorders. American Journal of ical Psychology, 5, 1–25.
Psychiatry, 149(12), 1645–1653. Tackett, J. L., Herzhoff, K., Reardon, K. W., Smack, A. J., &
Poulin, M. (2010). Reporting on First Sexual Experience: The Kushner, S. C. (2013). The Relevance of Informant Discrepan-
Importance of Interviewer-Respondent Interaction. Demo- cies for the Assessment of Adolescent Personality Pathology.
graphic Research, 22, 237–287. Clinical Psychology-Science and Practice, 20(4), 378–392.
Preston, C. C., & Colman, A. M. (2000). Optimal Number of Tay, L., & Kuykendall, L. (2017). Why Self-Reports of Happiness
Response Categories in Rating Scales: Reliability, Validity, Dis- and Sadness May Not Necessarily Contradict Bipolarity:
criminating Power, and Respondent Preferences. Acta Psycho- A Psychometric Review and Proposal. Emotion Review, 9,
logica, 104(1), 1–15. 146–154.
Rescorla, L. A., Achenbach, T. M., Ivanova, M. Y., Turner, L. V., Trull, T. J., & Ebner-Priemer, U. (2013). Ambulatory Assessment.
Arnadottir, H., Au, A., . . . Zasepa, E. (2016). Collateral Reports Annual Review of Clinical Psychology 9(9), 151–176.
and Cross-Informant Agreement about Adult Psychopathology Van Sonderen, E., Sanderman, R., & Coyne, J. C. (2013). Ineffect-
in 14 Societies. Journal of Psychopathology and Behavioral iveness of Reverse Wording of Questionnaire Items: Let’s Learn
Assessment, 38(3), 381–397. from Cows in the Rain. Plos One, 8(7), e68967.
Rodebaugh, T. L., Woods, C. M., & Heimberg, R. G. (2007). The Van Vaerenbergh, Y., & Thomas, T. D. (2013). Response Styles in
Reverse of Social Anxiety Is Not Always the Opposite: The Survey Research: A Literature Review of Antecedents, Conse-
Reverse-Scored Items of the Social Interaction Anxiety Scale quences, and Remedies. International Journal of Public Opinion
Do Not Belong. Behavior Therapy, 38(2), 192–206. Research, 25(2), 195–217.
Rogers, R. (2001). Handbook of Diagnostic and Structured Inter- Vazire, S. (2010). Who Knows What About a Person? The Self-
viewing. New York: Guilford Press. Other Knowledge Asymmetry (SOKA) Model. Journal of Person-
Samuel, D. B. (2015). A Review of the Agreement Between Clin- ality and Social Psychology, 98(2), 281–300.
icians’ Personality Disorder Diagnoses and Those From Other Watson, D. (2004). Stability versus Change, Dependability versus
Methods and Sources. Clinical Psychology-Science and Practice, Error: Issues in the Assessment of Personality over Time. Jour-
22(1), 1–19. nal of Research in Personality, 38(4), 319–350.
Samuel, D. B., Riddell, A. D. B., Lynam, D. R., Miller, J. D., & Widiger, T. A., & Samuel, D. B. (2005). Evidence-Based Assess-
Widiger, T. A. (2012). A Five-Factor Measure of Obsessive- ment of Personality Disorders. Psychological Assessment, 17(3),
Compulsive Personality Traits. Journal of Personality Assess- 278–287.
ment, 94(5), 456–465. Zetin, M., & Glenn, T. (1999). Development of a Computerized
Samuel, D. B., Suzuki, T., & Griffin, S. A. (2016). Clinicians and Psychiatric Diagnostic Interview for Use by Mental Health and
Clients Disagree: Five Implications for Clinical Science. Journal Primary Care Clinicians. CyberPsychology & Behavior, 2(3),
of Abnormal Psychology, 125(7), 1001–1010. 223–229.

Downloaded from https://www.cambridge.org/core. University of New England, on 06 May 2020 at 01:11:54, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms.
https://doi.org/10.1017/9781316995808.007

You might also like