IRR Flight School Instructors
IRR Flight School Instructors
A FOUNDATIONAL STUDY
By
December 2007
© 2007 Matthew Vail Smith
All Rights Reserved
INTER-RATER RELIABILITY OF FLIGHT SCHOOL INSTRUCTORS:
A FOUNDATIONAL STUDY
by
December 2007
Dr Joel Hutchinson, who helped me to overcome the mental blocks I struggled with.
Lisa Cahill, ASU Polytechnic Writing Center, for her constructive criticism and helpful
suggestions.
Professors Merrill Karp and Jim Anderson for introducing me to the PCATD and
Greg and David, the Lab Assistants who taught me how to use the PCATD.
The four flight instructors who took time out of their busy schedules to watch three hours
of footage.
Committee member Dr. William McCurry for his guidance and suggestions.
And very special thanks to my committee chair, Dr. Mary Niemczyk, without whose
unwavering faith and support, I could never have accomplished this project and
graduated.
iii
TABLE OF CONTENTS
Page
CHAPTER
1 INTRODUCTION .......................................................................................1
Scope................................................................................................3
Assumptions.....................................................................................4
Limitations .......................................................................................4
Equipment Used...............................................................................5
Background ......................................................................................7
Cohen’s Kappa.................................................................................9
3 METHOD ..................................................................................................20
Flight Pattern..................................................................................20
iv
CHAPTER Page
Experiment Execution....................................................................24
4 RESULTS ..................................................................................................27
Raw Scores.....................................................................................27
Summary of Results.......................................................................32
Technical Improvements................................................................36
Summary ........................................................................................40
REFERENCES ..................................................................................................................41
APPENDIX
D Score Sheet.................................................................................................52
v
LIST OF TABLES
Table Page
9. Summary of Results.................................................................................................32
vi
LIST OF FIGURES
Figure Page
3. Pattern D ..................................................................................................................22
vii
CHAPTER 1
INTRODUCTION
pilots. As part of the regular curriculum, students must attend ground school and engage
in the required number of flight training hours. Ground school and written exams issued
by the Federal Aviation Administration (FAA) are standardized as well as the required
flight syllabi. However, training from school to school is not identical, even though fully
compliant with FAA regulations. Even in a flight school that has very exacting
standards, training may be different under different instructors for any number of reasons
such as the instructors’ abilities and interests. Some pilots dislike instructing and only do
it to build hours and to put experience on a resume. Others do it because they enjoy
sharing their love of flying with others. All instructors regardless of their personal
characteristics must do one thing: evaluate student performance. And yet, because of
one another. The reasons for differences in instructor perception of student performance
simply cannot catalog another’s motives, but one can see the result of the instructors’
perceptions: difference.
When scoring a student pilot, there is the student pilot’s performance, which is
objective, and the instructor pilot’s perception of that performance, which is subjective.
In the best of circumstances, the performance and the recorded perception of that
performance share a high degree of similarity. That is, the instructor ought always to
record a score that accurately and precisely reflects the student’s performance. However,
2
this is not always the case. Some perceptions of performance are too forgiving, while
others are overly critical. In other words, the same student pilot can receive a passing
score from an overly forgiving instructor and a failing score from an overly critical
frustrated. There is a problem ensuring that all student pilots receive standardized scores
that reflect the student pilot’s performance with a high degree of reliability.
Students, as well as the other stake-holders of flight schools, must be sure that the
scoring system is such that the scores are a meaningful indicator of the student’s
Furthermore, the scores should be consistent from one instructor to another. This
“used to assess the degree to which different raters/observers give consistent estimates of
the same phenomenon” (Trochim, 2001, p.96). This investigation, then, seeks to offer
any flight school a method to determine the inter-rater reliability of its instructor pilots.
Chapter 1 introduces the problem and sets the parameters of this investigation.
Chapter 2, the literature review, examines pertinent inter-rater reliability literature dealing
with both the statistical theory and application of inter-rater reliability. The literature
reviewed in this study does not come from aviation sources because, after an exhaustive
search of reputable science journal databases, the researcher could not find aviation inter-
rater reliability studies. Instead, the literature reviewed comes from other fields such as
sports, psychology, health care and education, where inter-rater reliability studies are
used extensively. Many lessons learned from these fields may be applied to aviation,
especially in the sub-fields of aviation human factors and flight training/pilot education.
3
Chapter 3 discusses the methodology used to plan, design and execute the project and to
analyze the data. Chapter 4 examines the results. Chapter 5 discusses two possible ways
while executing the project, offers a commercial application, suggestions for further
Statement of Purpose
student pilot performance between instructor pilots. In order to accomplish this task, this
investigation:
reliability;
Scope
reliability. Four instructor pilots were asked to watch the flight performances of ten
students flying the same instrument flight pattern as recorded on a DVD. The testing of
the raters took place throughout the course of a single afternoon in a controlled
Assumptions
This investigation assumes that there may be a difference between the raters in
terms of their evaluation of student performance that is worth examining and that the
traditional methods for determining inter-rater reliability, such as the kappa coefficient,
are sound. Furthermore, it assumes that the principles of inter-rater reliability are
Limitations
This investigation has a few limitations. First, this study does not—indeed,
cannot—presume to act as a predictive model. It measures what exists now, but cannot
definitively state that raters will evaluate in this way or that. This study does not consider
questions of gender, racial or other forms of favoritism or bias because bias is an error
that causes a rater to be unreliable. This study does not seek to answer why the raters are
reliable or not, but only to establish a repeatable method for determining inter-rater
study that seeks only to show that inter-rater reliability studies can be adapted from other
fields and made useful for aviation research, and it uses the instructors of the flight school
as test subjects.
It cannot be over-emphasized that this study investigates neither the student pilots
nor their performance. The student pilots and their performance are only means to the
5
end of examining inter-rater reliability. Whether a student pilot is a good pilot or a poor
pilot is entirely moot. This study investigates how reliably the raters rate the flight
performances, not the flight performances or the students who flew them.
Finally, there were budgetary limitations. This study was funded entirely by the
researcher. Much of the equipment used, as listed below, belonged to the flight school.
However, the researcher paid for the video camera, accessories and the computer used to
Equipment Used
“fire wire” output in order to transfer the recorded footage to the hard
drive of a computer;
the recorded footage and create DVDs for the raters (instructor pilots) to
view; and
• a PC, projector and movie screen for showing the DVDs to the raters.
Chapter Summary
In order to ensure that students are scored fairly and consistently, flight schools
must consider the inter-rater reliability of their instructor pilots. This study describes the
6
method for testing inter-rater reliability of flight school instructors that the researcher
CHAPTER 2
LITERATURE REVIEW
coefficient discussed, kappa, is the one used to analyze the data in this study. The rest of
the chapter focuses on how, in the absence of inter-rater reliability studies in aviation,
inter-rater reliability studies have been used in other fields, such as sports, psychology,
Background
rating system, and those who use it (DeVellis, 2005; Trochim, 2001). Since this study
seeks to establish the inter-rater reliability of instructor pilots, it is helpful to have some
DeVellis managed to pack extensive information into a few short pages. DeVellis reports
that there are two influences at work in the process of measuring scores: “(1) the true
score of the object, person, event, or other phenomenon being measured, and (2) error
(i.e. everything other than the true score of the phenomenon of interest)” (p. 315). In
Chapter One, Introduction, true score was referred to as objective performance. Error can
susceptible to error, thus the disconnect between the true score (objective performance)
8
dealt with through statistical processes and analysis. This investigation seeks to measure
rater error. It does not study what errors are, why errors exist, or the moral implications
of error.
The purpose of the kappa statistic is to account for and eliminate agreement by
chance—chance being a type of error—so that the researcher can get a clearer idea of
how much agreement there really is between raters. The coefficient, then, distinguishes
quantified possible error becomes the denominator, while the quantified true score is the
ascribable to the true score relative to the total variability of the obtained score”
(DeVellis, 2005). Or, in the terms chosen for this investigation, it is the ratio of the
performance. In this study, it is assumed that any disconnect in the relationship between
the pilot’s performance (true score) and the instructors’ recorded perception (obtained
The way to find this coefficient, then, is to measure rater against rater rather than
pilot against rater. Each rater observed the exact same flight performances. Therefore,
the raters ought to record identical scores. In practice they may or may not. This is why
one performs an inter-rater reliability study, to discover these discrepancies between true
Cohen’s Kappa
In the late 1950’s and throughout the 1960’s, Jacob Cohen conducted seminal
the Greek letter kappa (κ), as the standard coefficient for inter-rater reliability, with κ≥.70
being considered reliable. This is not merely a 70% agreement, because agreement can
happen by chance. Instead, kappa accommodates the expected frequency of ratings; thus
kappa coefficient and raises three points that are foundational to inter-rater reliability:
Dr. Kalim Gwet’s paper explaining Cohen’s kappa gave additional information
not presented in Cohen’s article, such as explaining how to use Cohen’s kappa step-by-
step. Gwet’s work gave much inspiration to this investigation and the methodology he
describes has been adapted for use in this project. What follows is a brief paraphrasing of
Two raters observe three species of turtles. They are told to identify the species to
which each turtle belongs (y, r or c). Thirty-six turtles are observed and the raters tally
If Rater 1 claims “Y” and Rater 2 claims “R,” then the tally goes in the box that
corresponds with Y/R: first column, second row. If both raters claim “R,” then the tally
10
goes into the R/R box in the middle of the table: second column, second row. And so on.
The row and column tallies were the totaled in order to ensure that the correct number of
observations, 36, was recorded. The total number of agreements is calculated, “by
summing the values of the diagonal cells of the table Σa= 9 + 8 + 6 = 23.” (Gwet, 2002b)
Figure 1 shows Gwet’s contingency table. The cells showing agreement (Y/Y, R/R and
Rater 1
Row totals:
Y R C
Y 9 3 1 13
Rater 2
R 4 8 2 14
C 2 1 6 9
Column totals: 15 12 9 23
Out of the thirty-six turtles observed, the raters agreed on 23 decisions, thus
making the agreement level 64%. That is not good enough because some of the
In order to account for chance agreement, one must compute the expected
frequency (ef) by dividing the product of the row and column totals by the number of
samples (N). Figure 2 shows that by dividing the products of the row and column totals
frequency by chance.
11
Rater 1
Row totals:
Y R C
Y 9 (5.42) 3 1 13
Rater 2
R 4 8 (4.67) 2 14
C 2 1 6 (2.25) 9
Column totals: 15 12 9 23 (12.34)
To find kappa, then, one divides the difference of Σa minus Σef by the difference
considered satisfactory; less than .70 is not. This example has a kappa of .45, denoting
In this case, Gwet’s recommendation was to retrain the raters to recognize the
species better. Specifically, the raters had trouble with two species in particular, thus
Gwet recommended raters to “focus on correctly discriminating between these two types
Gwet’s explanation of Cohen’s kappa showed two raters with thirty-six samples
of three species. The current inter-rater reliability study has four raters judging ten flight
later followed by a second article on why kappa is insufficient (Gwet, 2002a). However
coefficient’s merits, the researcher did not find Gwet’s alternative method in literature
other than his own, whereas the researcher found Cohen’s kappa coefficient used
extensively. Therefore, Gwet’s criticism of kappa is mentioned here only to make the
reader aware that there are other means (other coefficients) of determining inter-rater
reliability. This study uses Cohen’s kappa, since it is widely accepted, while Gwet’s new
coefficient is not.
Flying and sports are related activities in that they are both simultaneously
physical and mental, or psychomotor, to denote the inseparability between the physical
and mental aspects. Being physical acts, they can be measured. And being measurable,
Basketball (Lindeman, Libkuman, King, & Kruse, 2000), examined the physical form
and movements of a jump shot. Basketball coaches have written books that discuss what
proper shooting form is, and the study used that information to create an instrument for
assessing jump-shots. Four raters then viewed video tapes of 32 shooters and rated the
shooters’ form and movement according to the instrument developed. The conclusion
was that the instrument may help discern a correlation between the shooter’s form and the
The jump shot study shows the validity of an inter-rater reliability study when
valid when observing flight performances, because it, too, observes psychomotor activity.
13
and other methods of measuring patient behavior are reliable means of assessment. These
studies have been used to assess rating scales and assessment methods related to sleep
disorders (Ferri, Bruni, Miano, Smerieri, Spruyt & Terzano, 2005), mental capacity
(Schmidt, Salas, Bernert & Schatschneider, 2005), delusions (Bell, Halligan & Ellis,
2006 and Meyers, English, Gabriele, Peasley-Milkus, Heo, Flint, et al., 2006), social
Nuttall, 1996), and other means of rating psychological disorders (Drake, Haddock,
to the United States. It is used in China (Leung & Tsang, 2006), Korea (Joo, Joo, Hong,
Hwang, Maeng, Han, et al., 2004), Japan (Kaneda, Ohmoria & Fujii, 2001), in the Arabic
language (Kadri, Agoub, El Gnaoui, Mchichi Alami, Hergueta & Moussaoui, 2005),
Turkey (Tural, Fidaner, Alkin & Bandelow, 2002), Greece (Papavasiliou, Rapidi, Rizou,
Petrapoulou & Tzavara, 2007 and Kolaitas, Korpa, Kolvin & Tsiantis, 2003), and France
(Thuile, Even, Friedman & Guelfi, 2005). In all of these articles, scales or other methods
and methods of assessment were tested and validated using inter-rater reliability studies.
It seems, then, that inter-rater reliability studies serve a very useful purpose in
determining the validity of scoring or rating rubrics. Thus, one can surmise that an inter-
14
rater reliability study may be very useful to a flight school that needs to measure the
Training health care practitioners also has parallels to training pilots. Both health
care practice and the practice of flying require both mental aptitude and the physical
skills to carry out their mentally-driven tasks. This fact is true for the entire gamut of
health care practitioners from nurses to surgeons and the gamut of pilots from the simple
sport (ultra-light) pilot to captains of 747s. All of the individuals in these vast and
diverse groups require a level of mental and physical harmony that demands high-level
Research regarding nursing in triage units verified that “live” experiments may be
comparison of live versus paper case scenarios (Worster, Sardo, Fernandes, Eva, &
Upadhy, 2007) shows that the kappa was acceptable in both live and paper cases,
however, the correlation in live cases was much higher (.90 live, versus .76 on paper).
Therefore it seems that it is better to test inter-rater reliability of instructor pilots with a
Paper-based scenarios would have been easy enough to create for the
instructor/raters being investigated, but as this triage nursing study makes clear, live is
more desirable because it is more reliable. The researcher did not conduct this present
Instead, the performances that the raters observed were captured on video for viewing at
another place and time, which is consistent with other studies reviewed in this chapter.
15
Bann, Davis, Moorthy, Munz, Hernandez, Khan, Datta, and Darzi (2005) studied
11 surgical trainees and put them through a 15 minute, six-station rotation of basic
surgical tasks. Each trainee performed the six-station rotation on five separate occasions
for a total of 90 minutes of observation. All of the trainees’ performances were video
recorded for later review. The six tasks each had criteria determining what makes a
trainee competent or not at that task. For example, in the suturing task, trainees were
rated on the “time taken and total number of movements” used to complete the task
(Bann, et al., 2005). The trainees were further rated on the quality of the suture, based on
the squareness and orientation of the knots. The authors emphasized that their measuring
The researchers used the Spearman correlation coefficient (rho) in their statistical
Moorthy, Munz, Hernandez, Khan, Datta, & Darzi, 2005). (Since neither the pilots nor
the raters sit for their part of the study more than once, there will not be any improvement
to measure. Therefore, rho is not necessary to this study.) On the other hand, the
researchers used Cronbach’s alpha coefficient “to test a number of internal consistency;
these included the inter-rater reliability of video assessment and intra-task reliability.”
(Bann, et al., 2005). The result of this experiment was that video assessment is indeed a
reliable means of assessing performance. Yet another study concluded that inter-rater
reliability of video taped cases was excellent, having a coefficient of .93. (Hulsman,
medicine and aviation. Moreover, Michelson specifically cites the usefulness and
16
ubiquity of simulator training in aviation, and suggests that more and better simulators be
developed in the training of orthopedic surgeons. Michelson cites other studies that
suggest “good, but not perfect, correlation” (Michelson, 2006) and later suggests that
simulator-based competency standards be developed and will likely come built-in to the
simulators is that they are asynchronous. That is, a resident doctor need not have a
supervisor present during training if using a simulator. Furthermore, the data collected
during the simulation can be reviewed by more than one supervisor or rater
independently, meeting Cohen’s third requirement that raters perform their duties
Inter-rater reliability studies are not used solely in the training of health care
professionals, but also to verify the rubrics for various cases such as rating the
Callaway, 2006) and for rating the severity of rosacea (Bamford, Gessert, & Renier,
2004). The authors of the rosacea article admitted that when the scale ranged from 1 to
10, the inter-rater reliability coefficient indicated unreliable rating. But when the scale
was reduced to a range from 1 to 5, the inter-rater reliability coefficient was much
Watson (1995) provided a stark example of the necessity for kappa rather than using
cases the percentage of agreement was 100%, while the kappa coefficient, which
accounts for chance agreement, was 0.01, the absolute lowest number possible.
17
Kappa has also been found useful in determining inter-rater reliability in other
studies. A study conducted by Kolt, Brewer, Pizzari, Schoo, & Garrett (2007) combined
two inter-rater reliability studies, one in which six physiotherapists and physiotherapy
students examined videotaped cases, the other compared two live clinical sessions. The
results were that the inter-rater reliability of the first study was very high (κ= .87 to .93)
and the second study reliability varied from very good to good (κ = .76 to .89 and .63 to
.76). Dionne, Bybee, & Tomaka (2006) used kappa to establish moderate reliability (κ =
.55) in a study using 20 patients and 54 trained clinicians. Fifty-four raters is the greatest
used in the Journal of Educational Psychology, 1979-1983 (1985) in order to discern the
is understood that the statistical methods used by its contributors are useful and
reliability studies. Inter-rater reliability studies comprised nearly half of the studies—by
far the greatest percentage. Considering how commonly researchers use inter-rater
article indicates that performing an inter-rater reliability study at flight schools, which are
question of what constitutes good or bad writing cannot be answered with an inter-rater
reliability study. Instead, much like the rubrics used to rate medical observations or the
jump-shot as discussed previously, the rubrics for scoring essays must be created first by
an expert or group of collaborating experts who know what good writing is. Qualitative
characteristics must be sorted and presented in such a way that raters can quantify their
observations and opinions of the writing samples. Lee (2004) noticed that, given a
holistic scoring rubric, raters scored computer-based writing samples provided by English
as Second Language (ESL) students far more reliably than when using paper-based—that
is to say, handwritten—writing samples. The holistic rubric included several criteria that
accounted not only for the quality of content, but also quality of expression, as
determined by the writing experts. Lee suggests that the raters may need to learn how not
to discriminate against messy handwriting, and that correcting that bias may help to make
Penny, Johnson and Gordon (2000) introduced the idea of augmenting a holistic
rubric with benchmark writing samples. Writing, like many other human activities, is
performed on a continuum. That is, one cannot easily discern discreet moments, but
rather observe ability over the passage of time. Assigning an integer to rate a
requires a ‘snapshot’, or a discreet variable. In many cases, this means assigning a rating
from 1 to 5. Inter-rater reliability studies show whether the quality of writing (or what
ever act is being rated) is being accurately translated into a quantity, which can then be
measured. Introducing benchmark papers helped those charged with assessing writing
19
samples to more accurately rate the quality of writing because each integer had an
exemplar to which the raters could refer. Thus, the inter-rater reliability was increased,
Chapter Summary
and many examples of how to design and execute inter-rater reliability studies. The
articles featured in this study were chosen because the fields of study all involved training
and featured psycho-motor skills that are analogous with and transferable to evaluating
pilot training.
20
CHAPTER 3
METHOD
pilots when observing flights performed by student pilots. This study included
flight rules (IFR) pattern. The researcher transferred the footage to a DVD. Four
instructor pilots reviewed DVDs of the flight performance footage and scored the student
pilots’ performances on a scale of 1 to 5. The researcher then analyzed the scores using
Cohen’s kappa coefficient. The resulting coefficients are discussed in Chapter Four,
Results.
Flight Pattern
In The Pilot’s Manual: Instrument Flying (Kirshner, 1990) there are several flight
patterns to choose from. The pattern used for this investigation is referred to as Pattern
D. It was chosen because it is long enough to give the raters something substantial to
appears in Figure 3.
Pilot Participants
participated by flying the aforementioned flight pattern using a PCATD. The researcher
explained to the students that they were being videotaped for the purpose of investigating
inter-rater reliability. They were assured that these scores, good or bad, would not figure
into their course average. Their identities were protected by preventing any
distinguishing features from being recorded on video. Also, the order in which the flight
21
performances were viewed was different from the order they were recorded. Thus, the
student who flew the first flight on the day of recording might have actually have been
the last flight viewed by the raters. The researcher did not collect or record any
demographic data about the student pilot participants in order to abide by the limitations
Rater Participants
The rater-participants were selected from the pool of instructor pilots at the flight
school. All instructor pilots were offered a chance to participate and the researcher
enlisted the help of four volunteers. These instructor pilots watched and scored the
flights that are contained on the DVDs. They are the raters, whose reliability this study
investigates. Just as the student pilots who flew the pattern were assured that their
participation would not affect their scores in school, the raters were assured of their
anonymity and that their performance in this study would not impact their employment at
the flight school. Also just as with the student pilot participants, the researcher did not
collect or record any demographic data about the rater participants in order to abide by
Scoring Rubric
rubric. The flight school at which this study was performed already has a scoring
investigation. The reader is referred to the scoring rubric in Appendix A, which explains
and quantity, yet in studies such as this and those in the social sciences, medical science,
and education, researchers must change qualitative performance into quantitative data in
order to perform statistical analysis. One cannot average words or put words into a
formula. Thus, words (qualities) must be transformed into numbers (quantities). There is
no analyzing poor, good, or great, but one can analyze scores of 1, 3, or 5. This is
precisely the reason for this inter-rater reliability study: to determine if the student pilot
Prior to sitting at the PCATD, the researcher briefed the student pilots. The
pattern is rather complex, and depending on the skill of the student pilot, the researcher
gave oral instructions, if necessary. As stated in Chapter One, Introduction, this study is
not investigating the student pilots. Therefore, the student pilots’ ability to perform the
flight pattern well or poorly is immaterial. What this study investigates is whether the
raters agree about the student pilots’ performance. Therefore helping a lesser skilled
student pilot complete the pattern does not affect the inter-rater reliability. The raters
24
were entirely unaware of which student referred to the pattern and which students
performed the pattern from memory. After the flight was finished, there was a
Experiment Execution
After the flight patterns were recorded, it was time to test the raters. The raters
other raters, just as specified by Cohen (1960). Then raters were asked to score the
student pilots’ performances according to the scoring rubric. After the raters scored the
Cohen’s coefficient kappa is derived using only two raters. Several studies cited
in Chapter Two, Literature Review, used only two raters, some four, some more. After a
very thorough search, the researcher could not find any research that suggests an optimal
number of raters for inter-rater reliability studies. In this study, there are four raters
because the researcher looked for an even number of raters, as most of the other studies
had, and four instructor pilots made themselves available for testing purposes.
The researcher used six contingency tables similar to the tables described in
Chapter Two, Literature Review, but adapted the table to provide the resultant
information in conformity to the APA style manual. (Gwet’s contingency tables do not
conform to the APA manual.) The example table (Table 1) shows hypothetical Rater X
versus hypothetical Rater Y. The numbers 1 through 5 indicate the scores which raters
of scores from the ten flights (A through J) were tallied in the table according to the rules
25
as described by Gwet (2002b) in Chapter Two, Literature Review. That is, if rater 1
gives a score of “3” and rater 2 gives a score of “3” then one point will be tallied in the
cell (3, 3). (In Table 1 below, the numbers “0” denote nothing, as this is only an
Table 1
Rater X
Row
Score 1 2 3 4 5 a ef
Totals:
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
Rater Y 3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0
Column Totals: 0 0 0 0 0 N Σa Σef
10 0 0
The tables will account for each possible permutation without replicating pairs.
After the result of each table is tallied according to Cohen’s kappa method, the resultant
coefficients will then be analyzed to determine the inter-rater reliability of the instructor
pilots in comparison with each other. Each column and row should add up to 10, which
is the N, the only constant in the equation. Column a is the number of agreements. This
number is simply the cells showing agreement (e.g. 1, 1; 2, 2, etc.) brought over to a
single column. Column ef is the expected frequency. (The method to derive the ef was
discussed earlier.) At the bottom of column a and column ef is the sum of a (Σa) and the
26
sum of ef (Σef). In the next chapter, these tables will have beneath them the kappa
Chapter Summary
as volunteers to fly Instrument Pattern D using the Elite PCATD. A video camera
recorded the image of the simulated instrument panel on the movie screen during the
flights. After recording the student pilots’ flights, the researcher transferred the footage
onto DVDs for easier viewing. Each flight was assigned a letter, A through J. The
researcher then enlisted the help of four instructor pilots to be the rater participants. The
instructor pilots watched and scored the flights in a controlled environment. Upon
finishing their task, the researcher collected their score sheets and placed the scores into
the contingency tables. The researcher then took the pertinent numbers from the table
(those that indicate agreement) and put them into the kappa formula.
If the coefficient, kappa, is .70 or greater, the rater pairs can be said to exhibit
greater reliability; if less than .70, then the rater pairs may be said to exhibit lesser
reliability. The next chapter will discuss the results of this experiment.
27
CHAPTER 4
RESULTS
The experiment was conducted in a classroom equipped with a PC, projector and
movie screen. The four raters sat in the same room, but were seated far apart to prevent
communication between raters. They were given instructions and a score sheet
(Appendix C and D, respectively) and were briefed by the researcher about how to
behave during the test (i.e. no talking, gesturing, or using other means of communicating
during flights, no talking about the flights during break times, etc.). It took three hours to
watch all of the flights, including two short restroom breaks and one longer break time
during which the researcher switched from the first to the second DVD.
Raw Scores
The raters watched the flights and marked the scores on the score sheet that was
provided. The researcher collected the score sheets and the raw scores are in Table 2
below.
Table 2
Sample Flight
Rater A B C D E F G H I J
1 4 5 2 1 4 3 2 3 1 5
2 4 5 1 1 4 4 2 4 1 3
3 3 3 1 1 3 4 1 4 1 2
4 3 5 1 1 3 3 2 4 1 4
28
At first glance, these scores appear to show good agreement, especially in sample
flights C, D, G, H and I. A brief examination of the raw scores also reveals that Rater 1
evenly distributed the scores; the only rater to do so. Raters 2 and 4 had very similar
results, with only disagreement being between a score of 3 and 4. Rater 3 gave the most
scores of 1, and gave no scores of 5. However, to properly analyze the data for inter-rater
Contingency Tables
illustrated on page 25. Tables 3 through 8 below are the contingency tables that were
used to sort and analyze the data. These tables were adapted from Gwet (2002b) in order
to conform to APA standards and to show data without the redundancy of tables as in
Gwet (2002b). Beneath each contingency table is the mathematical work used to derive
Table 3
Rater 1
Row
Score 1 2 3 4 5 a ef
Totals:
1 2 1 0 0 0 3 2 .6
2 0 1 0 0 0 1 1 .2
Rater 2 3 0 0 0 0 1 1 0 .2
4 0 0 2 2 0 4 2 .8
5 0 0 0 0 1 1 1 .2
Column Totals: 2 2 2 2 2 N Σa Σef
10 6 2
Table 4
Rater 1
Row
Score 1 2 3 4 5 a ef
Totals:
1 2 2 0 0 0 4 2 .8
2 0 0 0 0 1 1 0 .2
Rater 3 3 0 0 0 2 1 3 0 .6
4 0 0 2 0 0 2 0 .4
5 0 0 0 0 0 0 0 0
Column Totals: 2 2 2 2 2 N Σa Σef
10 2 2
Table 5
Rater 1
Row
Score 1 2 3 4 5 a ef
Totals:
1 2 1 0 0 0 3 2 .6
2 0 1 0 0 0 1 1 .2
Rater 4 3 0 0 0 0 1 1 0 .2
4 0 0 2 2 0 4 2 .8
5 0 0 0 0 1 1 1 .2
Column Totals: 2 2 2 2 2 N Σa Σef
10 6 2
Table 6
Rater 2
Row
Score 1 2 3 4 5 a ef
Totals:
1 3 1 0 0 0 4 3 1.2
2 0 0 1 0 0 1 0 .1
Rater 3 3 0 0 0 2 1 3 0 .1
4 0 0 0 2 0 2 2 .8
5 0 0 0 0 0 0 0 0
Column Totals: 3 1 1 4 1 N Σa Σef
10 5 2.2
Table 7
Rater 2
Row
Score 1 2 3 4 5 a ef
Totals:
1 3 0 0 0 0 3 3 .9
2 0 1 0 0 0 1 1 .2
Rater 4 3 0 0 0 3 0 3 0 .3
4 0 0 1 1 0 2 1 .8
5 0 0 0 0 1 1 1 .2
Column Totals: 3 1 1 4 1 N Σa Σef
10 6 2.4
Table 8
Rater 3
Row
Score 1 2 3 4 5 a ef
Totals:
1 3 0 0 0 0 3 3 1.2
2 1 0 0 0 0 1 0 .1
Rater 4 3 0 0 2 1 0 3 2 .9
4 0 1 0 1 0 2 1 .6
5 0 0 1 0 0 1 0 .1
Column Totals: 4 1 3 3 1 N Σa Σef
10 6 2.9
Summary of Results
The scores have been tallied and the kappa for each rater pair calculated. As
stated previously throughout this study, the minimum desirable kappa coefficient is .70.
Table 9
Summary of Results
The best kappa was .50, and the worst, 0. The average kappa coefficient was .38—just
Although all of the rater pairings in this study fell far below .70, one rater, Rater
3, seemed the least reliable of the four. The three pairings in which Rater 3 was involved
were the least reliable, one of which had a kappa of 0, entirely unreliable. Rater 1, with
whom Rater 3 shared the kappa of 0, enjoyed the two highest reliability scores, .50, with
Raters 2 and 4.
Each rater was paired three times. When each rater’s three pairings were
averaged, Rater 1 scored a .33, Rater 2, .45, Rater, 3 .27, and Rater 4, .37. However,
33
removing Rater 3 from the averages, so that each rater was only paired twice, Rater 1’s
average rose to .50, Rater 2 to .48 and Rater 4 to .48. Among Raters 1, 2 and 4, the
scores are extremely similar (pair 1 & 2 .50, pair 1 & 4 .50 and pair 2 & 4 .47). Thus it
seems that removing Rater 3 improved the inter-rater reliability in this study. Without
Rater 3 the overall average reliability increased from .38 to .49. This is still well below
The next chapter will discuss two methods to improve inter-rater reliability at the
flight school and recommendations for improving the execution of the study and further
research. The next chapter also includes a commercial application of this study.
34
CHAPTER 5
DISCUSSION
The resultant coefficients are such that the study did not yield good inter-rater
reliability. There must be some way to improve inter-rater reliability at the flight school.
Two suggestions are to engage in extensive recurrent training and to improve the scoring
rubric. There are also some ways to improve the technical aspects of the study and to do
further research. Finally, the researcher proposes a commercial application for this inter-
Recurrent Training
The previous chapter described the raw scores and the resultant kappa coefficients
for the four raters. These scores show low inter-rater reliability which may indicate the
need for recurrent training, which may help the flight school reinforce the scoring
criteria. In the case of Rater 3, more training would be required than for Raters 1, 2 and
4. In sample C, while Raters 1, 2 and 4 agreed upon a score of 5, Rater 3 awarded a score
of 3. In sample G where all others gave a score of 2, Rater 3 gave a 1. And in Sample J,
where there was no agreement among any raters, Rater 3 gave the low score of 2. After
examining the raw scores, it is evident that the most common disagreement was between
the scores 3 and 4. It may be that Raters 1, 2 and 4 need to review the standards to help
them differentiate between performances that rate a 3 rather than a 4, while Rater 3 needs
a greater amount of training to align that rater’s expectations of student performance with
It may also be helpful to start training instructor pilots how to interpret the
standards used to score student pilot performance first using simple maneuvers and
35
working their way up to complex patterns, just as the students themselves must work
their way up from simple maneuvers to complex patterns. This recurrent training may be
of little use unless the standards are better defined through an improved scoring rubric.
It could be that the scoring rubric needs improving. Referring again to Appendix
quantifiable data. For example, “An ‘Excellent’ (5) grade will be issued when a student’s
performance far exceeds and is well above the completion standards.” Unfortunately,
there is little to define exactly what makes a performance far exceed or well above the
completion standards. The same can be said for scores 4, 3, 2, and 1. There definitions
The scoring sheet (Appendix D) offered the rater the completion standards from
the lesson in which Pattern D is taught. The altitude standard asks only that a student
pilot remain within plus or minus 200 feet of the starting altitude. This standard is very
broadly defined and leaves too much open to interpretation by individual instructor pilots
and hence affects inter-rater reliability. An example of how to fine tune the altitude
• a score of 5 should require the student remain within plus or minus 50 feet;
• a 1 indicates that the student violated the 200 foot limit in either direction, and
therefore is unsatisfactory.
The other standards, heading, bank angle and airspeed, could also be redefined to
more precisely indicate how skilled the student is, rather than leaving a broad range that
the instructor pilots to be retrained in these newer, more precisely defined, standards
would help to improve inter-rater reliability. Fine-tuning these standards may require
further research.
Technical Improvements
improvements made to how the experiment is executed on a technical level. This project
was the researcher’s first attempt to record video footage from a PCATD and then
transfer that footage to DVD. While the footage was usable, the quality could be
improved by recording the footage directly from the PCATD rather than through another
media. The footage had to travel through a few steps of media: from the PCATD to the
projector, to the screen, to the video camera, to the iMac, to the iMovie HD application,
to the iDVD application, to actual DVDs. The transfer from camera to the digital movie
applications iMovie HD and iDVD are not problematic because there is no noticeable
degradation of footage from one digital source to another. Thus, removing the projector,
movie screen, and video camera from the middle, would likely produce higher quality
images, making the footage easier to watch clearly. Since the raters all watched the same
footage, the footage quality does not affect the inter-rater reliability. It would only affect
37
inter-rater reliability if some raters watched one set of footage, and other raters watched
learning how to use all of the features of the iMovie HD and iDVD applications to their
fullest extent. There are other high-end software applications for video editing such as
Final Cut that should also be considered provided the future researcher has the budget for
opportunity to discuss the future for which this project is the foundation. As stated in
Chapter One, Introduction, this project was a foundational study, meant to lay the
groundwork and establish a method to study inter-rater reliability at flight schools that
can be used at any flight school that has the resources to carry out the experiment.
raters, or both. This researcher would also encourage a future researcher to test other
means of measuring inter-rater reliability. Chapter Two, Literature Review, cited studies
which used alpha and rho. In the interest of finding the best analytical method, alpha,
rho, and other coefficients should be tested along with the increase in samples and raters
begin testing particular maneuvers such as shallow, medium and steep turns, ascending
and descending turns, or constant airspeed climbs. These are just examples, and a future
researcher could experiment with particular maneuvers rather than entire patterns. At the
38
same time, one could also consider choosing from a catalog of other instrument patterns,
Recommendations one and two do not cast doubt on the methodology of this
study. Adding more raters might lead to more agreement, but it might also lead to more
disagreement. Likewise, adding more sample flights may or may not cause lesser or
greater reliability. What must be avoided at all costs is designing a study that is
structured to create agreement. Testing particular maneuvers rather than patterns is not
necessarily better because doing maneuvers is just one part of flight training and the goal
of flight training is not to make a pilot proficient at doing maneuvers, but to make a pilot
have such a depth of understanding and technical ability that he or she can take the
maneuvers learned through the years of training and spontaneously serialize or combine
discreet maneuvers into an organic flight that has unity from take off to landing. So
testing only maneuvers versus testing patterns or testing spontaneous flights is not
necessarily better. However, more samples, raters, other patterns, and other statistical
methods all deserve to be tested for the sake of expanding our body of knowledge and for
perfecting a method that one day could become “tried and true.” In short, researchers
must trust in the scientific method to continually develop better means of testing and
Upon doing further research, fine-tuning the standards and processing instructors
through updated training, one may find that the method can be adapted for commercial
use.
39
Upon testing and re-testing this experiment such that the results can be replicated
and are consistent, and the method deemed valid by a panel of experts in related fields,
this study can be developed into an instructor training program that may be created for
program could begin by evaluating maneuvers and testing reliability. Upon reaching a
kappa of .70 or greater, the instructor trainee can move on to the next phase learning how
to evaluate simple patterns, and then moving onto learning how to evaluate complex
patterns, and finally how to reliably rate IFR check rides. The training need not happen
only using a PCATD. The method and training system must be such that as the training
progresses, the footage from the PCATD is replaced by footage from a full simulator, and
the full simulator eventually replaced by footage from an actual aircraft, because the
instructors and their students will experience training in all three media.
must establish baseline flights, just as Penny, et al. (2000) establish benchmark essays for
scoring writing samples. For example, a future researcher may find that a particular
flight has been viewed by raters and they have consensus that the flight is a 3. A
researcher for a commercial developer or flight school must build up a catalog of baseline
flights that have all been tested and create a test in which the established baseline scores
are entered into the contingency table as Rater 1, while the rater currently being tested
becomes Rater 2. Thus, the future researcher or tester will place the New Rater versus
the baseline scores. A kappa of .70 or greater shows that the New Rater can score flights
40
reliably, while a kappa less than .70 will indicate that the New Rater needs further
instruction before being allowed to rate actual flights. The result may be that flight
schools can effective and economically screen potential flight instructors or maintain
Summary
“The search for valid, reliable, feasible, and fair assessments of cognitive and
human performance is, in many ways, at the very heart of educational measurement”
(Penny, et al, 2000). In a very real way, instructor pilots are educators, and their
to find research in scientific and educational journals that would help to lay the
foundation of inter-rater reliability studies in flight training. To that end, four flight
school instructors (raters) were tested according to the methodology inspired by the
literature reviewed and statistical analysis based upon Cohen’s Kappa coefficient. This
education, social science, psychology, medicine and even sports. It is used quite often in
training situations. In this study, kappa was applied to flight training, specifically testing
instructor pilots for inter-rater reliability. Ultimately, the study indicated that the inter-
rater reliability was low; having an average kappa of .38, well below the desired .70.
Nevertheless, this study was successful in that it showed a usable method for testing
inter-rater reliability in flight training and provides the basis for further research and
commercial development.
41
REFERENCES
Bamford, J.T.M., Gessert, C.E., & Renier, C.M. (2004) Measurement of the severity of
rosacea. [Electronic Version]. Journal of the American Academy Dermatology,
51(5), 697-703.
Bann, S., Davis, I.M., Moorthy, K., Munz, Y., Hernandez, J., Khan, M., Datta, V., &
Darzi, A. (2005). The Reliability of multiple objective measures of surgery and
the role of human performance. [Electronic version]. The American Journal of
Surgery, 189, 747-752.
Bell, V., Halligan P.W., & Ellis, H.D. (2006). Diagnosing Delusions: A review of inter-
rater reliability. [Electronic version]. Schizophrenia Research, 86, 76-79.
Dionne, C.P., Bybee, R.F., & Tomaka, J. (2006). Inter-rater reliability of McKenzie
assessment in patients with neck pain. [Electronic version]. Physiotherapy, 92,
75-82.
Drake, R., Haddock, G., Terrier, N., Bentall, R., & Lewis, S. (2007). The Psychotic
Symptom Rating Scales (PSYRATS): Their usefulness and properties in first
episode psychosis. [Electronic version]. Schizophrenia Research, 89, 119-122.
Ferri, R., Bruni, O., Miano, S., Smerieri, A., Spruyt, K., & Terzano, M. (2005). Inter-
rater reliability of sleep cyclic alternating pattern (CAP) scoring and validation of
a new computer-assisted CAP scoring method. [Electronic version]. Clinical
Neurophysiology, 116, 696-707.
Goodwin, L.D. & Goodwin, W.L. (1985). An Analysis of Statistical Techniques Used in
the Journal of Educational Psychology, 1979-1983. [Electronic version].
Educational Psychologist, 20(1), 13-21.
Gwet, K. (2002a) Kappa statistic is not satisfactory for assessing the extent of agreement
between raters. Retrieved December 15, 2006, from
http://www.stataxis.com/files/articles/kappa_statistic_is_not_satisfactory.pdf.
Gwet, K. (2002b) Cohen’s Kappa. Retrieved December 15, 2006, from http://www-
class.unl.edu/psycrs/handcomp/hckappa.pdf.
42
Holey, L.A., & Watson, M.J. (1995) Inter-rater reliability of connective tissue zones
recognition. [Electronic version]. Physiotherapy, 61(7), 369-372.
Hulsman, R.L., Mollema, E.D., Oort, F.J., Hoos, A.M., & de Haes, J.C.J.M. (2006) Using
standardized video cases for assessment of medical communication skills:
Reliability of an objective structured video examination by computer. [Electronic
version]. Patient Education and Counseling, 60, 24-31.
Joo, E.-J., Joo, Y.-H., Hong, J.-P., Hwang, S., Maeng, S.-J., Han J.-H., Yang, B.-H., Lee,
Y.-S., & Kim, Y.-S. (2004). Korean Version of the Diagnostic Interview for
Genetic Studies: Validity and Reliability. [Electronic version]. Comprehensive
Psychiatry, 45(3), 225-229.
Kadri, N., Agoub, M., El Gnaoui, S., Mchichi Alami, Kh., Hergueta, T., & Moussaoui, D.
(2005). Moroccan colloquial Arabic version of the Mini International
Neuropsychiatric Intervire (MINI): qualitative and quantitative validation.
[Electronic Version]. European Psychiatry, 20, 193-195.
Kaneda, Y., Ohmoria, T., & Fujii, A. (2001). The serotonin syndrome: investigation
using the Japanese version of the Serotonin Syndrome Scale. [Electronic version].
Psychiatry Research, 105, 135-142.
Kirshner, W.K. (1990) The Pilot’s Manual: Instrument Flying (4th ed.). Ames, IA: Iowa
State Press
Kolaitas, J., Korpa, T., Kolvin, I., & Tsiantis, J. (2003). Letter to the Editor. [Electronic
version]. European Psychiatry, 18, 374-375.
Kolt, G.S., Brewer, B.W., Pizzari, T., Schoo, A.M.M., & Garrett, N. (2006). The Sport
Injury Rehabilitation Adherence Scale: a reliable scale for use in clinical
physiotherapy. [Electronic version]. Physiotherapy 93(1), 17-22.
Lee, H.K. (2004). A comparative study of ESL writers’ performance in a paper-based and
a computer-delivered writing test. [Electronic version]. Assessing Writing, 9, 4-
26.
Leung, T.K.S. & Tsang H.W.H. (2006). Chinese version of the Assessment of
Interpersonal Problem Solving Skills. [Electronic version]. Psychiatry Research
143, 189-197.
Lindeman, B., Libkuman, T., King, D., & Kruse B. (2000). Development of an
Instrument to Assess Jump-Shooting Form in Basketball. [Electronic version].
Journal of Sports Behavior. 23(4), 335-348.
43
Meyers, B.S., English, J., Gabriele, M., Peasley-Miklus, C., Heo, M., Flint, A.J., Mulsant,
B.H., & Rothschild, A.J. (2006). A Delusion Assessment Scale for Psychotic
major Depression: Reliability, Validity, and Utility. Biological Psychiatry, 60,
136-1342.
Monroe-Blum, H., Collins, E., McCleary, L., & Nuttall, S. (1996). The social dysfunction
index (SDI) for patients with schizophrenia and related disorders. [Electronic
version]. Schizophrenia Research. 20, 211-219.
Papavasilou, A.S., Rapidi, C.A., Rizou, C., Petrapoulou, K., & Tzavara, Ch. (2006).
Reliability of Greek version Gross Motor Function Classification System.
[Electronic version]. Brain & Development, 29 79-82
Penny, J., Johnson, R.L., & Gordon, B. (2000) The effect of rating augmentation on inter-
rater reliability: and empirical study of a holistic rubric. [Electronic version].
Assessing Writing, 7,143-164.
Raymont, V., Buchanan, A., David, A.S., Hayward, P., Wessley, S., & Hotopf, M.
(2006). The inter-rater reliability of mental capacity assessments. [Electronic
version]. Law and Psychiatry, 30, 112-117
Rittenberger, J.C., Martin, J.R., Kelly, L.J., Roth, R.N., Hostler, D., & Callaway, C.W.
(2006). Inter-rater reliability for witnessed collapse and presence of bystander
CPR. [Electronic version]. Resuscitation, 70, 410-415.
Schmidt, N.B., Salas, D., Bernert, R., & Schatschneider, C. (2005). Diagnosing
agoraphobia in the context of panic disorder: examining the effect of the DSM-IV
criteria on diagnostic decision-making. [Electronic version]. Behavior Research
and Therapy, 43, 1219-1229.
Thuile, J., Even, C., Friedman, S., & Guelfi, J.-D. (2005). Inter-rater reliability of the
French version of the core index for melancholia. [Electronic version]. Journal
of Effective Disorders, 88, 193-208.
Trochim, W.M.K. (2001). The Research Methods Knowledge Base (2nd ed.). Mason, OH:
Thomson
Tural, U., Fidaner, H., Alkin, T. & Bandelow, B. (2002). Assessing the severity of panic
disorder and agoraphobia: Validity, reliability and objectivity of the Turkish
translation of the Panic and Agoraphobia Scale (P & A). [Electronic version].
Journal of Anxiety Disorders, 16, 331-340.
44
Worster, A., Sardo, A.A., Fernandes C.M.B., Eva, K., & Upadhy, S. (2007). Triage tool
inter-rater reliability: a comparison of live versus paper case scenarios.
[Electronic version]. Journal of Emergency Nursing, 33(4), 319-323.
APPENDIX A
SCORING RUBRIC
46
APPENDIX B
Brief
Thank you for participating in this inter-rater reliability study. You are not being
tested. Your upcoming flight will be scored by instructors for research purposes only.
Your performance here today will not have any effect on your scores in school. Your
name is not being recorded. Even I, the researcher, am not keeping a record of your
During this flight you will be asked to fly Pattern D from The Pilot’s Manual:
Instrument Flying. Whether you have a passing or thorough knowledge of this flight
pattern is not important. I will talk you through the flight, if necessary. I will not keep
track of the time for you. I will, however, give you ample time before the next maneuver.
Remember, you are not the one being tested. This flight is being used to test your
instructors. Even though your performance is not being tested, I ask that you still try
your best just as you would in a real plane with a real instructor pilot.
Instructions
This flight will begin with you already airborne. You are flying at 6000 feet,
straight and level, heading 360, at 130 knots cruising speed. The flight will end with you
1. Begin…now. Keep the aircraft straight and level for one minute.
3. When you come to heading 315, fly straight and level for one minute.
5. When you reach heading 135, fly straight and level for 30 seconds.
7. When you reach heading 180, fly straight and level for 2 minutes.
11. When you reach heading 360, fly straight and level for 2 minutes.
14. Turn right 180 degrees to heading 360. Fly straight and level for 2 minutes.
Debrief
Thank you for flying this pattern. Your flight is one of many that will be used to
help us test the reliability of the instructor pilots. Although a recording of your flight has
been made, no information about you has been kept, and thus no information about you
INSTRUCTIONS TO RATERS
51
Brief
Thank you for being kind enough to participate in this study. We are soon going
to watch DVDs containing 10 sample flights. Before we watch these flights, I must lay
2. Score each flight at the end of the flight. Do not wait until all flights are over to
3. You have been given a copy of Pattern D and the scoring rubric, which you may
refer to throughout this process. On the score sheet, there is also a brief summary
4. You must not communicate with each other while watching the flights. This
5. We will take short breaks after every two videos, and a long break at the end of
6. You may talk during the break times, but you must refrain from talking about the
7. At the end of the viewing, after I have collected your score sheets, we may then
discuss any flights. You will not have the ability to change your scores.
SCORING SHEET
53
Note: Standards are taken directly from a lesson pertinent to Pattern D. The grading scale
A B C D E F G H I J