0% found this document useful (0 votes)
209 views61 pages

IRR Flight School Instructors

This foundational study by Matthew Vail Smith investigates the inter-rater reliability of flight school instructors in evaluating student pilot performance. The research aims to establish a method for assessing consistency in scoring among instructors, utilizing Cohen's kappa coefficient for analysis. The study highlights the importance of reliable scoring systems in flight training to ensure fair and standardized evaluations for student pilots.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
209 views61 pages

IRR Flight School Instructors

This foundational study by Matthew Vail Smith investigates the inter-rater reliability of flight school instructors in evaluating student pilot performance. The research aims to establish a method for assessing consistency in scoring among instructors, utilizing Cohen's kappa coefficient for analysis. The study highlights the importance of reliable scoring systems in flight training to ensure fair and standardized evaluations for student pilots.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

INTER-RATER RELIABILITY OF FLIGHT SCHOOL INSTRUCTORS:

A FOUNDATIONAL STUDY

By

Matthew Vail Smith

An Applied Project Presented in Partial Fulfillment


of the Requirements for the Degree
Master of Science in Technology

ARIZONA STATE UNIVERSITY

December 2007
© 2007 Matthew Vail Smith
All Rights Reserved
INTER-RATER RELIABILITY OF FLIGHT SCHOOL INSTRUCTORS:

A FOUNDATIONAL STUDY

by

Matthew Vail Smith

has been approved

December 2007

Graduate Supervisory Committee:

Mary Niemczyk, Chair


William McCurry

ACCEPTED BY THE GRADUATE COLLEGE


ACKNOWLEDGMENTS

I would like to acknowledge the help of several people:

Dr Joel Hutchinson, who helped me to overcome the mental blocks I struggled with.

Lisa Cahill, ASU Polytechnic Writing Center, for her constructive criticism and helpful

suggestions.

Professors Merrill Karp and Jim Anderson for introducing me to the PCATD and

explaining the possibilities it offered.

Greg and David, the Lab Assistants who taught me how to use the PCATD.

The student volunteers who flew the sample flights.

The four flight instructors who took time out of their busy schedules to watch three hours

of footage.

Committee member Dr. William McCurry for his guidance and suggestions.

And very special thanks to my committee chair, Dr. Mary Niemczyk, without whose

unwavering faith and support, I could never have accomplished this project and

graduated.

iii
TABLE OF CONTENTS

Page

LIST OF TABLES............................................................................................................ vii

LIST OF FIGURES ......................................................................................................... viii

CHAPTER

1 INTRODUCTION .......................................................................................1

Statement of Purpose .......................................................................3

Scope................................................................................................3

Assumptions.....................................................................................4

Limitations .......................................................................................4

Equipment Used...............................................................................5

Chapter Summary ............................................................................5

2 LITERATURE REVIEW ............................................................................7

Background ......................................................................................7

Cohen’s Kappa.................................................................................9

Inter-rater reliability in Sports .......................................................12

Inter-rater reliability in Psychology ...............................................13

Inter-rater reliability in Health Care ..............................................14

Inter-rater reliability in Education .................................................17

Chapter Summary ..........................................................................19

3 METHOD ..................................................................................................20

Flight Pattern..................................................................................20

Pilot Participants ............................................................................20

iv
CHAPTER Page

Rater Participants ...........................................................................21

Scoring Rubric ...............................................................................23

Flying the Pattern...........................................................................23

Experiment Execution....................................................................24

Chapter Summary ..........................................................................26

4 RESULTS ..................................................................................................27

Raw Scores.....................................................................................27

Contingency Tables .......................................................................28

Summary of Results.......................................................................32

5 CONCLUSIONS AND RECOMMENDATIONS ....................................34

Recurrent Training .........................................................................34

Scoring Rubric Improvements .......................................................35

Technical Improvements................................................................36

Recommendations for Further Research........................................37

Commercial Application of this Study ..........................................39

Summary ........................................................................................40

REFERENCES ..................................................................................................................41

APPENDIX

A Scoring Rubric ...........................................................................................45

B Briefing and Script.....................................................................................47

C Instructor Instructions ................................................................................50

D Score Sheet.................................................................................................52

v
LIST OF TABLES

Table Page

1. Example of the inter-rater reliability Contingency Table

used in this Experiment........................................................................25

2. Raters Raw Scores ...................................................................................................27

3. Rater 1 versus Rater 2 ..............................................................................................29

4. Rater 1 versus Rater 3 ..............................................................................................29

5. Rater 1 versus Rater 4 ..............................................................................................30

6. Rater 2 versus Rater 3 ..............................................................................................30

7. Rater 2 versus Rater 4 ..............................................................................................31

8. Rater 3 versus Rater 4 ..............................................................................................31

9. Summary of Results.................................................................................................32

vi
LIST OF FIGURES

Figure Page

1. Contingency Table Highlighting Agreement Cells..................................................10

2. Contingency Table with Chance-Corrected Agreement ..........................................11

3. Pattern D ..................................................................................................................22

vii
CHAPTER 1

INTRODUCTION

Several educational institutions exist to train students to become professional

pilots. As part of the regular curriculum, students must attend ground school and engage

in the required number of flight training hours. Ground school and written exams issued

by the Federal Aviation Administration (FAA) are standardized as well as the required

flight syllabi. However, training from school to school is not identical, even though fully

compliant with FAA regulations. Even in a flight school that has very exacting

standards, training may be different under different instructors for any number of reasons

such as the instructors’ abilities and interests. Some pilots dislike instructing and only do

it to build hours and to put experience on a resume. Others do it because they enjoy

sharing their love of flying with others. All instructors regardless of their personal

characteristics must do one thing: evaluate student performance. And yet, because of

their personal characteristics, instructors perceive student performances differently from

one another. The reasons for differences in instructor perception of student performance

can be systematic or arbitrary, conscious or subconscious, innocuous or malicious; one

simply cannot catalog another’s motives, but one can see the result of the instructors’

perceptions: difference.

When scoring a student pilot, there is the student pilot’s performance, which is

objective, and the instructor pilot’s perception of that performance, which is subjective.

In the best of circumstances, the performance and the recorded perception of that

performance share a high degree of similarity. That is, the instructor ought always to

record a score that accurately and precisely reflects the student’s performance. However,
2

this is not always the case. Some perceptions of performance are too forgiving, while

others are overly critical. In other words, the same student pilot can receive a passing

score from an overly forgiving instructor and a failing score from an overly critical

instructor for an identical or near-identical performance, leaving the student confused or

frustrated. There is a problem ensuring that all student pilots receive standardized scores

that reflect the student pilot’s performance with a high degree of reliability.

Students, as well as the other stake-holders of flight schools, must be sure that the

scoring system is such that the scores are a meaningful indicator of the student’s

performance rather than an arbitrary indicator of the instructor’s perception.

Furthermore, the scores should be consistent from one instructor to another. This

problem can be examined with an inter-rater reliability study. Inter-rater reliability is

“used to assess the degree to which different raters/observers give consistent estimates of

the same phenomenon” (Trochim, 2001, p.96). This investigation, then, seeks to offer

any flight school a method to determine the inter-rater reliability of its instructor pilots.

Chapter 1 introduces the problem and sets the parameters of this investigation.

Chapter 2, the literature review, examines pertinent inter-rater reliability literature dealing

with both the statistical theory and application of inter-rater reliability. The literature

reviewed in this study does not come from aviation sources because, after an exhaustive

search of reputable science journal databases, the researcher could not find aviation inter-

rater reliability studies. Instead, the literature reviewed comes from other fields such as

sports, psychology, health care and education, where inter-rater reliability studies are

used extensively. Many lessons learned from these fields may be applied to aviation,

especially in the sub-fields of aviation human factors and flight training/pilot education.
3

Chapter 3 discusses the methodology used to plan, design and execute the project and to

analyze the data. Chapter 4 examines the results. Chapter 5 discusses two possible ways

to improve inter-rater reliability at the flight school, suggests technical improvements

while executing the project, offers a commercial application, suggestions for further

research and summarizes the project.

Statement of Purpose

The purpose of this investigation, then, is to determine the reliability of rating

student pilot performance between instructor pilots. In order to accomplish this task, this

investigation:

• defines inter-rater reliability and discusses its application to pilot training;

• reviews literature regarding inter-rater reliability;

• describes the method (experiment) that was used to assess inter-rater

reliability;

• analyzes the data as collected from the performed experiment using

Cohen’s kappa coefficient;

• discusses the results;

• makes recommendations for corrective action;

• suggests a commercial application for this research; and

• suggests areas for further research.

Scope

The scope of this investigation is a foundational study in which the rating

performances of a cross-section of instructors are analyzed to determine inter-rater


4

reliability. Four instructor pilots were asked to watch the flight performances of ten

students flying the same instrument flight pattern as recorded on a DVD. The testing of

the raters took place throughout the course of a single afternoon in a controlled

environment, under the supervision of the researcher.

Assumptions

This investigation assumes that there may be a difference between the raters in

terms of their evaluation of student performance that is worth examining and that the

traditional methods for determining inter-rater reliability, such as the kappa coefficient,

are sound. Furthermore, it assumes that the principles of inter-rater reliability are

transferable from one field to another.

Limitations

This investigation has a few limitations. First, this study does not—indeed,

cannot—presume to act as a predictive model. It measures what exists now, but cannot

definitively state that raters will evaluate in this way or that. This study does not consider

questions of gender, racial or other forms of favoritism or bias because bias is an error

that causes a rater to be unreliable. This study does not seek to answer why the raters are

reliable or not, but only to establish a repeatable method for determining inter-rater

reliability. Therefore, this study does not claim to be exhaustive. It is a foundational

study that seeks only to show that inter-rater reliability studies can be adapted from other

fields and made useful for aviation research, and it uses the instructors of the flight school

as test subjects.

It cannot be over-emphasized that this study investigates neither the student pilots

nor their performance. The student pilots and their performance are only means to the
5

end of examining inter-rater reliability. Whether a student pilot is a good pilot or a poor

pilot is entirely moot. This study investigates how reliably the raters rate the flight

performances, not the flight performances or the students who flew them.

Finally, there were budgetary limitations. This study was funded entirely by the

researcher. Much of the equipment used, as listed below, belonged to the flight school.

However, the researcher paid for the video camera, accessories and the computer used to

transfer the footage to DVD from personal funds.

Equipment Used

The following equipment was used to complete this project:

• Elite brand Personal Computer Aircraft Training Device (PCATD);

• a computer projector and a movie screen;

• a Sony DCR-HC28 video camera, used to record the flight instruments

(computer simulated instrument panel); the camera was equipped with a

“fire wire” output in order to transfer the recorded footage to the hard

drive of a computer;

• an iMac personal computer with iMovie HD and iDVD, used to organize

the recorded footage and create DVDs for the raters (instructor pilots) to

view; and

• a PC, projector and movie screen for showing the DVDs to the raters.

Chapter Summary

In order to ensure that students are scored fairly and consistently, flight schools

must consider the inter-rater reliability of their instructor pilots. This study describes the
6

method for testing inter-rater reliability of flight school instructors that the researcher

developed and discusses the research on which this method is based.


7

CHAPTER 2

LITERATURE REVIEW

This chapter is a review of literature related to inter-rater reliability. The chapter

begins by establishing the background of inter-rater reliability: explaining what inter-rater

reliability is and discussing a coefficient used to measure inter-rater reliability. The

coefficient discussed, kappa, is the one used to analyze the data in this study. The rest of

the chapter focuses on how, in the absence of inter-rater reliability studies in aviation,

inter-rater reliability studies have been used in other fields, such as sports, psychology,

health care and education.

Background

Inter-rater reliability measures the extent of agreement between two or more

individual raters. Inter-rater reliability is used to measure the consistency of a scoring or

rating system, and those who use it (DeVellis, 2005; Trochim, 2001). Since this study

seeks to establish the inter-rater reliability of instructor pilots, it is helpful to have some

background on inter-rater reliability and how it has been used.

In his 2005 entry into the Encyclopedia of Social Measurement, Robert F.

DeVellis managed to pack extensive information into a few short pages. DeVellis reports

that there are two influences at work in the process of measuring scores: “(1) the true

score of the object, person, event, or other phenomenon being measured, and (2) error

(i.e. everything other than the true score of the phenomenon of interest)” (p. 315). In

Chapter One, Introduction, true score was referred to as objective performance. Error can

be influenced by the instructor’s perception. Or, rather, the instructor’s perception is

susceptible to error, thus the disconnect between the true score (objective performance)
8

and the recorded score (instructor’s perception). Error is simply a phenomenon to be

dealt with through statistical processes and analysis. This investigation seeks to measure

rater error. It does not study what errors are, why errors exist, or the moral implications

of error.

The purpose of the kappa statistic is to account for and eliminate agreement by

chance—chance being a type of error—so that the researcher can get a clearer idea of

how much agreement there really is between raters. The coefficient, then, distinguishes

between purposeful agreement and accidental agreement. In a reliability formula, the

quantified possible error becomes the denominator, while the quantified true score is the

numerator. Thus, whatever reliability coefficient is used it is the “ratio of variability

ascribable to the true score relative to the total variability of the obtained score”

(DeVellis, 2005). Or, in the terms chosen for this investigation, it is the ratio of the

pilot’s objective performance and the instructors’ recorded perception of that

performance. In this study, it is assumed that any disconnect in the relationship between

the pilot’s performance (true score) and the instructors’ recorded perception (obtained

score) is due to the raters, not the pilot.

The way to find this coefficient, then, is to measure rater against rater rather than

pilot against rater. Each rater observed the exact same flight performances. Therefore,

the raters ought to record identical scores. In practice they may or may not. This is why

one performs an inter-rater reliability study, to discover these discrepancies between true

score and obtained score, should discrepancy (error) exist.


9

Cohen’s Kappa

In the late 1950’s and throughout the 1960’s, Jacob Cohen conducted seminal

research focusing on inter-rater reliability. Cohen proposed a coefficient represented by

the Greek letter kappa (κ), as the standard coefficient for inter-rater reliability, with κ≥.70

being considered reliable. This is not merely a 70% agreement, because agreement can

happen by chance. Instead, kappa accommodates the expected frequency of ratings; thus

eliminating mere chance agreement (Cohen, 1960; Gwet, 2002b).

Cohen’s original article, A coefficient of agreement for nominal scales, which

appeared in the Journal of Educational & Psychological Measurement, explains the

kappa coefficient and raises three points that are foundational to inter-rater reliability:

1. The units are independent.


2. The categories of the nominal scale are independent, mutually exclusive, and
exhaustive.
3. The judges operate independently. (Cohen, 37)

Dr. Kalim Gwet’s paper explaining Cohen’s kappa gave additional information

not presented in Cohen’s article, such as explaining how to use Cohen’s kappa step-by-

step. Gwet’s work gave much inspiration to this investigation and the methodology he

describes has been adapted for use in this project. What follows is a brief paraphrasing of

the methodology provided by Cohen, as explained by Gwet (2002b).

Two raters observe three species of turtles. They are told to identify the species to

which each turtle belongs (y, r or c). Thirty-six turtles are observed and the raters tally

their judgments in a three-by-three table. (Three, because y, r and c.)

If Rater 1 claims “Y” and Rater 2 claims “R,” then the tally goes in the box that

corresponds with Y/R: first column, second row. If both raters claim “R,” then the tally
10

goes into the R/R box in the middle of the table: second column, second row. And so on.

The row and column tallies were the totaled in order to ensure that the correct number of

observations, 36, was recorded. The total number of agreements is calculated, “by

summing the values of the diagonal cells of the table Σa= 9 + 8 + 6 = 23.” (Gwet, 2002b)

Figure 1 shows Gwet’s contingency table. The cells showing agreement (Y/Y, R/R and

C/C) are shaded.

Figure 1 Contingency Table Highlighting Agreement Cells (Gwet, 2002b)

Rater 1
Row totals:
Y R C
Y 9 3 1 13
Rater 2
R 4 8 2 14
C 2 1 6 9
Column totals: 15 12 9 23

Out of the thirty-six turtles observed, the raters agreed on 23 decisions, thus

making the agreement level 64%. That is not good enough because some of the

agreements may have been mere chance agreements.

In order to account for chance agreement, one must compute the expected

frequency (ef) by dividing the product of the row and column totals by the number of

samples (N). Figure 2 shows that by dividing the products of the row and column totals

Σa = 9 + 8 + 6 = 23 becomes Σef = 5.42 + 4.67 + 2.25 = 12.34. This is the expected

frequency by chance.
11

Figure 2 Contingency Table with Chance-Corrected Agreement (Gwet, 2002b)

Rater 1
Row totals:
Y R C
Y 9 (5.42) 3 1 13
Rater 2
R 4 8 (4.67) 2 14
C 2 1 6 (2.25) 9
Column totals: 15 12 9 23 (12.34)

To find kappa, then, one divides the difference of Σa minus Σef by the difference

of N (number of samples) minus Σef (sum of expected frequency). That is:

κ = (Σa - Σef) / (N - Σef) = (23 - 12.34) / (36 - 12.34) = .45

Kappa is evaluated next. As was stated above, a kappa of .70 or greater is

considered satisfactory; less than .70 is not. This example has a kappa of .45, denoting

rather weak inter-rater reliability.

In this case, Gwet’s recommendation was to retrain the raters to recognize the

species better. Specifically, the raters had trouble with two species in particular, thus

Gwet recommended raters to “focus on correctly discriminating between these two types

of turtles, in order to improve inter-rater reliability” (Gwet, 2002b).

Gwet’s explanation of Cohen’s kappa showed two raters with thirty-six samples

of three species. The current inter-rater reliability study has four raters judging ten flight

performance samples on a scale of 1 to 5. Chapter 3, Method, will discuss the application

of Cohen’s kappa to this project.

Remarkably enough, Gwet’s article explaining Cohen’s kappa coefficient was

later followed by a second article on why kappa is insufficient (Gwet, 2002a). However

interesting Gwet’s argument is regarding kappa’s insufficiency and his alternative


12

coefficient’s merits, the researcher did not find Gwet’s alternative method in literature

other than his own, whereas the researcher found Cohen’s kappa coefficient used

extensively. Therefore, Gwet’s criticism of kappa is mentioned here only to make the

reader aware that there are other means (other coefficients) of determining inter-rater

reliability. This study uses Cohen’s kappa, since it is widely accepted, while Gwet’s new

coefficient is not.

Inter-rater reliability in Sports

Flying and sports are related activities in that they are both simultaneously

physical and mental, or psychomotor, to denote the inseparability between the physical

and mental aspects. Being physical acts, they can be measured. And being measurable,

they can be used in an inter-rater reliability study.

One such study, Development of an Instrument to Assess Jump-Shooting Form in

Basketball (Lindeman, Libkuman, King, & Kruse, 2000), examined the physical form

and movements of a jump shot. Basketball coaches have written books that discuss what

proper shooting form is, and the study used that information to create an instrument for

assessing jump-shots. Four raters then viewed video tapes of 32 shooters and rated the

shooters’ form and movement according to the instrument developed. The conclusion

was that the instrument may help discern a correlation between the shooter’s form and the

shooter’s success rate.

The jump shot study shows the validity of an inter-rater reliability study when

observing psychomotor activity. By analogy, then, an inter-rater reliability study is likely

valid when observing flight performances, because it, too, observes psychomotor activity.
13

Inter-rater reliability in Psychology

Inter-rater reliability studies are often used in psychology to determine if scales

and other methods of measuring patient behavior are reliable means of assessment. These

studies have been used to assess rating scales and assessment methods related to sleep

disorders (Ferri, Bruni, Miano, Smerieri, Spruyt & Terzano, 2005), mental capacity

(Raymont, Buchanan, David, Hayward, Wessley & Hotopf, 2006), agoraphobia

(Schmidt, Salas, Bernert & Schatschneider, 2005), delusions (Bell, Halligan & Ellis,

2006 and Meyers, English, Gabriele, Peasley-Milkus, Heo, Flint, et al., 2006), social

dysfunction in schizophrenia and related illnesses (Monroe-Blum, Collins, McCleary, &

Nuttall, 1996), and other means of rating psychological disorders (Drake, Haddock,

Terrier, Bentall & Lewis, 2007).

Using inter-rater reliability studies to validate psychological testing is not limited

to the United States. It is used in China (Leung & Tsang, 2006), Korea (Joo, Joo, Hong,

Hwang, Maeng, Han, et al., 2004), Japan (Kaneda, Ohmoria & Fujii, 2001), in the Arabic

language (Kadri, Agoub, El Gnaoui, Mchichi Alami, Hergueta & Moussaoui, 2005),

Turkey (Tural, Fidaner, Alkin & Bandelow, 2002), Greece (Papavasiliou, Rapidi, Rizou,

Petrapoulou & Tzavara, 2007 and Kolaitas, Korpa, Kolvin & Tsiantis, 2003), and France

(Thuile, Even, Friedman & Guelfi, 2005). In all of these articles, scales or other methods

of assessment were tested, or foreign language translations of English language scales

and methods of assessment were tested and validated using inter-rater reliability studies.

It seems, then, that inter-rater reliability studies serve a very useful purpose in

determining the validity of scoring or rating rubrics. Thus, one can surmise that an inter-
14

rater reliability study may be very useful to a flight school that needs to measure the

reliability of its raters and scoring rubrics.

Inter-rater reliability in Health Care

Training health care practitioners also has parallels to training pilots. Both health

care practice and the practice of flying require both mental aptitude and the physical

skills to carry out their mentally-driven tasks. This fact is true for the entire gamut of

health care practitioners from nurses to surgeons and the gamut of pilots from the simple

sport (ultra-light) pilot to captains of 747s. All of the individuals in these vast and

diverse groups require a level of mental and physical harmony that demands high-level

training. This training regimen is ready-made for inter-rater reliability studies.

Research regarding nursing in triage units verified that “live” experiments may be

more reliable than paper-based exercises. Triage tool inter-rater reliability: a

comparison of live versus paper case scenarios (Worster, Sardo, Fernandes, Eva, &

Upadhy, 2007) shows that the kappa was acceptable in both live and paper cases,

however, the correlation in live cases was much higher (.90 live, versus .76 on paper).

Therefore it seems that it is better to test inter-rater reliability of instructor pilots with a

live flight scenario rather than a paper-based scenario.

Paper-based scenarios would have been easy enough to create for the

instructor/raters being investigated, but as this triage nursing study makes clear, live is

more desirable because it is more reliable. The researcher did not conduct this present

investigation “live” due to physical constraints of aircraft and budgetary constraints.

Instead, the performances that the raters observed were captured on video for viewing at

another place and time, which is consistent with other studies reviewed in this chapter.
15

Bann, Davis, Moorthy, Munz, Hernandez, Khan, Datta, and Darzi (2005) studied

11 surgical trainees and put them through a 15 minute, six-station rotation of basic

surgical tasks. Each trainee performed the six-station rotation on five separate occasions

for a total of 90 minutes of observation. All of the trainees’ performances were video

recorded for later review. The six tasks each had criteria determining what makes a

trainee competent or not at that task. For example, in the suturing task, trainees were

rated on the “time taken and total number of movements” used to complete the task

(Bann, et al., 2005). The trainees were further rated on the quality of the suture, based on

the squareness and orientation of the knots. The authors emphasized that their measuring

instrument was able to discern both quantity and quality of work.

The researchers used the Spearman correlation coefficient (rho) in their statistical

analysis, which is “used to examine correlations between sittings.” (Bann, Davis,

Moorthy, Munz, Hernandez, Khan, Datta, & Darzi, 2005). (Since neither the pilots nor

the raters sit for their part of the study more than once, there will not be any improvement

to measure. Therefore, rho is not necessary to this study.) On the other hand, the

researchers used Cronbach’s alpha coefficient “to test a number of internal consistency;

these included the inter-rater reliability of video assessment and intra-task reliability.”

(Bann, et al., 2005). The result of this experiment was that video assessment is indeed a

reliable means of assessing performance. Yet another study concluded that inter-rater

reliability of video taped cases was excellent, having a coefficient of .93. (Hulsman,

Mollema, Oort, Hoos & de Haes., 2006)

In a rare example, James D. Michelson, MD drew a direct parallel between

medicine and aviation. Moreover, Michelson specifically cites the usefulness and
16

ubiquity of simulator training in aviation, and suggests that more and better simulators be

developed in the training of orthopedic surgeons. Michelson cites other studies that

suggest “good, but not perfect, correlation” (Michelson, 2006) and later suggests that

simulator-based competency standards be developed and will likely come built-in to the

software packages of off-the-shelf simulators in the future. One benefit of using

simulators is that they are asynchronous. That is, a resident doctor need not have a

supervisor present during training if using a simulator. Furthermore, the data collected

during the simulation can be reviewed by more than one supervisor or rater

independently, meeting Cohen’s third requirement that raters perform their duties

independently (Cohen, 1960).

Inter-rater reliability studies are not used solely in the training of health care

professionals, but also to verify the rubrics for various cases such as rating the

effectiveness of out-of-hospital CPR (Rittenberger, Martin, Kelly, Roth, Hostler, &

Callaway, 2006) and for rating the severity of rosacea (Bamford, Gessert, & Renier,

2004). The authors of the rosacea article admitted that when the scale ranged from 1 to

10, the inter-rater reliability coefficient indicated unreliable rating. But when the scale

was reduced to a range from 1 to 5, the inter-rater reliability coefficient was much

greater, indicating reliability.

Inter-rater reliability is also used a great deal in physiotherapy. Holey and

Watson (1995) provided a stark example of the necessity for kappa rather than using

mere percentage of agreement when performing an inter-rater reliability study. In some

cases the percentage of agreement was 100%, while the kappa coefficient, which

accounts for chance agreement, was 0.01, the absolute lowest number possible.
17

Kappa has also been found useful in determining inter-rater reliability in other

studies. A study conducted by Kolt, Brewer, Pizzari, Schoo, & Garrett (2007) combined

two inter-rater reliability studies, one in which six physiotherapists and physiotherapy

students examined videotaped cases, the other compared two live clinical sessions. The

results were that the inter-rater reliability of the first study was very high (κ= .87 to .93)

and the second study reliability varied from very good to good (κ = .76 to .89 and .63 to

.76). Dionne, Bybee, & Tomaka (2006) used kappa to establish moderate reliability (κ =

.55) in a study using 20 patients and 54 trained clinicians. Fifty-four raters is the greatest

number of raters seen in the entire literature review.

Inter-rater reliability in Education

Laura D. and William L. Goodwin wrote An analysis of statistical techniques

used in the Journal of Educational Psychology, 1979-1983 (1985) in order to discern the

most popularly used statistical methods in educational psychology. The Journal of

Educational Psychology (JEP) is a long-established, peer-reviewed journal. Therefore, it

is understood that the statistical methods used by its contributors are useful and

appropriate for anyone doing research in a field related to educational psychology,

including this investigation.

From 1979-1983, 40 out of 92 reliability studies in the JEP were inter-rater

reliability studies. Inter-rater reliability studies comprised nearly half of the studies—by

far the greatest percentage. Considering how commonly researchers use inter-rater

reliability studies to establish or verify reliability in an educational setting, the Goodwin’s

article indicates that performing an inter-rater reliability study at flight schools, which are

rightly considered educational institutions, is a legitimate pursuit.


18

A common use of inter-rater reliability studies in education assesses writing. The

question of what constitutes good or bad writing cannot be answered with an inter-rater

reliability study. Instead, much like the rubrics used to rate medical observations or the

jump-shot as discussed previously, the rubrics for scoring essays must be created first by

an expert or group of collaborating experts who know what good writing is. Qualitative

characteristics must be sorted and presented in such a way that raters can quantify their

observations and opinions of the writing samples. Lee (2004) noticed that, given a

holistic scoring rubric, raters scored computer-based writing samples provided by English

as Second Language (ESL) students far more reliably than when using paper-based—that

is to say, handwritten—writing samples. The holistic rubric included several criteria that

accounted not only for the quality of content, but also quality of expression, as

determined by the writing experts. Lee suggests that the raters may need to learn how not

to discriminate against messy handwriting, and that correcting that bias may help to make

the scores that the raters awarded more reliable.

Penny, Johnson and Gordon (2000) introduced the idea of augmenting a holistic

rubric with benchmark writing samples. Writing, like many other human activities, is

performed on a continuum. That is, one cannot easily discern discreet moments, but

rather observe ability over the passage of time. Assigning an integer to rate a

performance—that is, shifting from a qualitative to quantitative measuring system—

requires a ‘snapshot’, or a discreet variable. In many cases, this means assigning a rating

from 1 to 5. Inter-rater reliability studies show whether the quality of writing (or what

ever act is being rated) is being accurately translated into a quantity, which can then be

measured. Introducing benchmark papers helped those charged with assessing writing
19

samples to more accurately rate the quality of writing because each integer had an

exemplar to which the raters could refer. Thus, the inter-rater reliability was increased,

and may also have led to a greater external validity.

Chapter Summary

Inter-rater reliability literature is plentiful and offers researchers several methods

and many examples of how to design and execute inter-rater reliability studies. The

articles featured in this study were chosen because the fields of study all involved training

and featured psycho-motor skills that are analogous with and transferable to evaluating

pilot training.
20

CHAPTER 3

METHOD

This investigation was designed to assess inter-rater reliability between instructor

pilots when observing flights performed by student pilots. This study included

videotaping the performance of student pilots flying an industry standard instrument

flight rules (IFR) pattern. The researcher transferred the footage to a DVD. Four

instructor pilots reviewed DVDs of the flight performance footage and scored the student

pilots’ performances on a scale of 1 to 5. The researcher then analyzed the scores using

Cohen’s kappa coefficient. The resulting coefficients are discussed in Chapter Four,

Results.

Flight Pattern

In The Pilot’s Manual: Instrument Flying (Kirshner, 1990) there are several flight

patterns to choose from. The pattern used for this investigation is referred to as Pattern

D. It was chosen because it is long enough to give the raters something substantial to

score, yet not so time-consuming as to prove burdensome. An illustration of the pattern

appears in Figure 3.

Pilot Participants

Student pilots enrolled in a flight program at a four-year research university

participated by flying the aforementioned flight pattern using a PCATD. The researcher

explained to the students that they were being videotaped for the purpose of investigating

inter-rater reliability. They were assured that these scores, good or bad, would not figure

into their course average. Their identities were protected by preventing any

distinguishing features from being recorded on video. Also, the order in which the flight
21

performances were viewed was different from the order they were recorded. Thus, the

student who flew the first flight on the day of recording might have actually have been

the last flight viewed by the raters. The researcher did not collect or record any

demographic data about the student pilot participants in order to abide by the limitations

as discussed in Chapter One, Introduction.

Rater Participants

The rater-participants were selected from the pool of instructor pilots at the flight

school. All instructor pilots were offered a chance to participate and the researcher

enlisted the help of four volunteers. These instructor pilots watched and scored the

flights that are contained on the DVDs. They are the raters, whose reliability this study

investigates. Just as the student pilots who flew the pattern were assured that their

participation would not affect their scores in school, the raters were assured of their

anonymity and that their performance in this study would not impact their employment at

the flight school. Also just as with the student pilot participants, the researcher did not

collect or record any demographic data about the rater participants in order to abide by

the limitations as discussed in Chapter One, Introduction.


22

Figure 3 Pattern D (Kirshner, 1990)


23

Scoring Rubric

In order to measure inter-rater reliability, there must be an established scoring

rubric. The flight school at which this study was performed already has a scoring

rubric—an explanation of how scores are determined—which was used in this

investigation. The reader is referred to the scoring rubric in Appendix A, which explains

what the scores represent.

As stated in Chapter Two, Literature Review, there is a difference between quality

and quantity, yet in studies such as this and those in the social sciences, medical science,

and education, researchers must change qualitative performance into quantitative data in

order to perform statistical analysis. One cannot average words or put words into a

formula. Thus, words (qualities) must be transformed into numbers (quantities). There is

no analyzing poor, good, or great, but one can analyze scores of 1, 3, or 5. This is

precisely the reason for this inter-rater reliability study: to determine if the student pilot

flight performance is being accurately transformed into a quantitative score according to

the scoring rubric.

Flying the Pattern

Prior to sitting at the PCATD, the researcher briefed the student pilots. The

pattern is rather complex, and depending on the skill of the student pilot, the researcher

gave oral instructions, if necessary. As stated in Chapter One, Introduction, this study is

not investigating the student pilots. Therefore, the student pilots’ ability to perform the

flight pattern well or poorly is immaterial. What this study investigates is whether the

raters agree about the student pilots’ performance. Therefore helping a lesser skilled

student pilot complete the pattern does not affect the inter-rater reliability. The raters
24

were entirely unaware of which student referred to the pattern and which students

performed the pattern from memory. After the flight was finished, there was a

debriefing. See the script in Appendix B.

Experiment Execution

After the flight patterns were recorded, it was time to test the raters. The raters

viewed the DVDs in a controlled environment so as not to influence or be influenced by

other raters, just as specified by Cohen (1960). Then raters were asked to score the

student pilots’ performances according to the scoring rubric. After the raters scored the

student pilots’ performances, the researcher analyzed the data.

Cohen’s coefficient kappa is derived using only two raters. Several studies cited

in Chapter Two, Literature Review, used only two raters, some four, some more. After a

very thorough search, the researcher could not find any research that suggests an optimal

number of raters for inter-rater reliability studies. In this study, there are four raters

because the researcher looked for an even number of raters, as most of the other studies

had, and four instructor pilots made themselves available for testing purposes.

The researcher used six contingency tables similar to the tables described in

Chapter Two, Literature Review, but adapted the table to provide the resultant

information in conformity to the APA style manual. (Gwet’s contingency tables do not

conform to the APA manual.) The example table (Table 1) shows hypothetical Rater X

versus hypothetical Rater Y. The numbers 1 through 5 indicate the scores which raters

can give to student pilots. A score of 1 represents an unsatisfactory performance; a 2,

marginal; a 3, good; a 4, very good; and a 5, excellent, as described in Appendix A. Pairs

of scores from the ten flights (A through J) were tallied in the table according to the rules
25

as described by Gwet (2002b) in Chapter Two, Literature Review. That is, if rater 1

gives a score of “3” and rater 2 gives a score of “3” then one point will be tallied in the

cell (3, 3). (In Table 1 below, the numbers “0” denote nothing, as this is only an

example. N is 10 because the number of samples is already known.)

Table 1

Example of the inter-rater reliability Contingency Table used in this Experiment

Rater X
Row
Score 1 2 3 4 5 a ef
Totals:
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
Rater Y 3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0
Column Totals: 0 0 0 0 0 N Σa Σef
10 0 0

The tables will account for each possible permutation without replicating pairs.

After the result of each table is tallied according to Cohen’s kappa method, the resultant

coefficients will then be analyzed to determine the inter-rater reliability of the instructor

pilots in comparison with each other. Each column and row should add up to 10, which

is the N, the only constant in the equation. Column a is the number of agreements. This

number is simply the cells showing agreement (e.g. 1, 1; 2, 2, etc.) brought over to a

single column. Column ef is the expected frequency. (The method to derive the ef was

discussed earlier.) At the bottom of column a and column ef is the sum of a (Σa) and the
26

sum of ef (Σef). In the next chapter, these tables will have beneath them the kappa

equation worked out, resulting in the kappa coefficient.

Chapter Summary

In summary, the methodology is as follows. The researcher enlisted student pilots

as volunteers to fly Instrument Pattern D using the Elite PCATD. A video camera

recorded the image of the simulated instrument panel on the movie screen during the

flights. After recording the student pilots’ flights, the researcher transferred the footage

onto DVDs for easier viewing. Each flight was assigned a letter, A through J. The

researcher then enlisted the help of four instructor pilots to be the rater participants. The

instructor pilots watched and scored the flights in a controlled environment. Upon

finishing their task, the researcher collected their score sheets and placed the scores into

the contingency tables. The researcher then took the pertinent numbers from the table

(those that indicate agreement) and put them into the kappa formula.

If the coefficient, kappa, is .70 or greater, the rater pairs can be said to exhibit

greater reliability; if less than .70, then the rater pairs may be said to exhibit lesser

reliability. The next chapter will discuss the results of this experiment.
27

CHAPTER 4

RESULTS

The experiment was conducted in a classroom equipped with a PC, projector and

movie screen. The four raters sat in the same room, but were seated far apart to prevent

communication between raters. They were given instructions and a score sheet

(Appendix C and D, respectively) and were briefed by the researcher about how to

behave during the test (i.e. no talking, gesturing, or using other means of communicating

during flights, no talking about the flights during break times, etc.). It took three hours to

watch all of the flights, including two short restroom breaks and one longer break time

during which the researcher switched from the first to the second DVD.

Raw Scores

The raters watched the flights and marked the scores on the score sheet that was

provided. The researcher collected the score sheets and the raw scores are in Table 2

below.

Table 2

Raters Raw Scores

Sample Flight
Rater A B C D E F G H I J
1 4 5 2 1 4 3 2 3 1 5
2 4 5 1 1 4 4 2 4 1 3
3 3 3 1 1 3 4 1 4 1 2
4 3 5 1 1 3 3 2 4 1 4
28

At first glance, these scores appear to show good agreement, especially in sample

flights C, D, G, H and I. A brief examination of the raw scores also reveals that Rater 1

evenly distributed the scores; the only rater to do so. Raters 2 and 4 had very similar

results, with only disagreement being between a score of 3 and 4. Rater 3 gave the most

scores of 1, and gave no scores of 5. However, to properly analyze the data for inter-rater

reliability, these raw scores must be tallied in the contingency tables.

Contingency Tables

To analyze the data, the researcher created a series of contingency tables as

illustrated on page 25. Tables 3 through 8 below are the contingency tables that were

used to sort and analyze the data. These tables were adapted from Gwet (2002b) in order

to conform to APA standards and to show data without the redundancy of tables as in

Gwet (2002b). Beneath each contingency table is the mathematical work used to derive

the kappa coefficient.


29

Table 3

Rater 1 versus Rater 2

Rater 1
Row
Score 1 2 3 4 5 a ef
Totals:
1 2 1 0 0 0 3 2 .6
2 0 1 0 0 0 1 1 .2
Rater 2 3 0 0 0 0 1 1 0 .2
4 0 0 2 2 0 4 2 .8
5 0 0 0 0 1 1 1 .2
Column Totals: 2 2 2 2 2 N Σa Σef
10 6 2

Given: N = 10, Σa = 6, Σef = 2

κ = (Σa - Σef) ÷ (N - Σef) = (6 – 2) ÷ (10 – 2) = 4 ÷ 8 = .50

Table 4

Rater 1 versus Rater 3

Rater 1
Row
Score 1 2 3 4 5 a ef
Totals:
1 2 2 0 0 0 4 2 .8
2 0 0 0 0 1 1 0 .2
Rater 3 3 0 0 0 2 1 3 0 .6
4 0 0 2 0 0 2 0 .4
5 0 0 0 0 0 0 0 0
Column Totals: 2 2 2 2 2 N Σa Σef
10 2 2

Given: N = 10, Σa = 2, Σef = 2

κ = (Σa - Σef) ÷ (N - Σef) = (2 – 2) ÷ (10 – 2) = 0 ÷ 8 = 0


30

Table 5

Rater 1 versus Rater 4

Rater 1
Row
Score 1 2 3 4 5 a ef
Totals:
1 2 1 0 0 0 3 2 .6
2 0 1 0 0 0 1 1 .2
Rater 4 3 0 0 0 0 1 1 0 .2
4 0 0 2 2 0 4 2 .8
5 0 0 0 0 1 1 1 .2
Column Totals: 2 2 2 2 2 N Σa Σef
10 6 2

Given: N = 10, Σa = 6, Σef = 2

κ = (Σa - Σef) ÷ (N - Σef) = (6 – 2) ÷ (10 – 2) = 4 ÷ 8 = .50

Table 6

Rater 2 versus Rater 3

Rater 2
Row
Score 1 2 3 4 5 a ef
Totals:
1 3 1 0 0 0 4 3 1.2
2 0 0 1 0 0 1 0 .1
Rater 3 3 0 0 0 2 1 3 0 .1
4 0 0 0 2 0 2 2 .8
5 0 0 0 0 0 0 0 0
Column Totals: 3 1 1 4 1 N Σa Σef
10 5 2.2

Given: N = 10, Σa = 5, Σef = 2.2

κ = (Σa - Σef) ÷ (N - Σef) = (5 – 2.2) ÷ (10 – 2.2) = 2.8 ÷ 7.8 = .38


31

Table 7

Rater 2 versus Rater 4

Rater 2
Row
Score 1 2 3 4 5 a ef
Totals:
1 3 0 0 0 0 3 3 .9
2 0 1 0 0 0 1 1 .2
Rater 4 3 0 0 0 3 0 3 0 .3
4 0 0 1 1 0 2 1 .8
5 0 0 0 0 1 1 1 .2
Column Totals: 3 1 1 4 1 N Σa Σef
10 6 2.4

Given: N = 10, Σa = 6, Σef = 2.4

κ = (Σa - Σef) ÷ (N - Σef) = (6 – 2.4) ÷ (10 – 2.4) = 3.6 ÷ 7.6 = .47

Table 8

Rater 3 versus Rater 4

Rater 3
Row
Score 1 2 3 4 5 a ef
Totals:
1 3 0 0 0 0 3 3 1.2
2 1 0 0 0 0 1 0 .1
Rater 4 3 0 0 2 1 0 3 2 .9
4 0 1 0 1 0 2 1 .6
5 0 0 1 0 0 1 0 .1
Column Totals: 4 1 3 3 1 N Σa Σef
10 6 2.9

Given: N = 10, Σa = 6, Σef = 2.9

κ = (Σa - Σef) ÷ (N - Σef) = (6 – 2.9) ÷ (10 – 2.9) = 3.1 ÷ 7.1 = .44


32

Summary of Results

The scores have been tallied and the kappa for each rater pair calculated. As

stated previously throughout this study, the minimum desirable kappa coefficient is .70.

The results in this study were markedly lower.

Table 9

Summary of Results

Rater Pair Kappa


Rater 1 vs. Rater 2 .50
Rater 1 vs. Rater 3 .00
Rater 1 vs. Rater 4 .50
Rater 2 vs. Rater 3 .38
Rater 2 vs. Rater 4 .47
Rater 3 vs. Rater 4 .44
Average .38

The best kappa was .50, and the worst, 0. The average kappa coefficient was .38—just

over half of the desired .70.

Although all of the rater pairings in this study fell far below .70, one rater, Rater

3, seemed the least reliable of the four. The three pairings in which Rater 3 was involved

were the least reliable, one of which had a kappa of 0, entirely unreliable. Rater 1, with

whom Rater 3 shared the kappa of 0, enjoyed the two highest reliability scores, .50, with

Raters 2 and 4.

Each rater was paired three times. When each rater’s three pairings were

averaged, Rater 1 scored a .33, Rater 2, .45, Rater, 3 .27, and Rater 4, .37. However,
33

removing Rater 3 from the averages, so that each rater was only paired twice, Rater 1’s

average rose to .50, Rater 2 to .48 and Rater 4 to .48. Among Raters 1, 2 and 4, the

scores are extremely similar (pair 1 & 2 .50, pair 1 & 4 .50 and pair 2 & 4 .47). Thus it

seems that removing Rater 3 improved the inter-rater reliability in this study. Without

Rater 3 the overall average reliability increased from .38 to .49. This is still well below

.70, but much better.

The next chapter will discuss two methods to improve inter-rater reliability at the

flight school and recommendations for improving the execution of the study and further

research. The next chapter also includes a commercial application of this study.
34

CHAPTER 5

DISCUSSION

The resultant coefficients are such that the study did not yield good inter-rater

reliability. There must be some way to improve inter-rater reliability at the flight school.

Two suggestions are to engage in extensive recurrent training and to improve the scoring

rubric. There are also some ways to improve the technical aspects of the study and to do

further research. Finally, the researcher proposes a commercial application for this inter-

rater reliability study.

Recurrent Training

The previous chapter described the raw scores and the resultant kappa coefficients

for the four raters. These scores show low inter-rater reliability which may indicate the

need for recurrent training, which may help the flight school reinforce the scoring

criteria. In the case of Rater 3, more training would be required than for Raters 1, 2 and

4. In sample C, while Raters 1, 2 and 4 agreed upon a score of 5, Rater 3 awarded a score

of 3. In sample G where all others gave a score of 2, Rater 3 gave a 1. And in Sample J,

where there was no agreement among any raters, Rater 3 gave the low score of 2. After

examining the raw scores, it is evident that the most common disagreement was between

the scores 3 and 4. It may be that Raters 1, 2 and 4 need to review the standards to help

them differentiate between performances that rate a 3 rather than a 4, while Rater 3 needs

a greater amount of training to align that rater’s expectations of student performance with

flight school standards.

It may also be helpful to start training instructor pilots how to interpret the

standards used to score student pilot performance first using simple maneuvers and
35

working their way up to complex patterns, just as the students themselves must work

their way up from simple maneuvers to complex patterns. This recurrent training may be

of little use unless the standards are better defined through an improved scoring rubric.

Scoring Rubric Improvements

It could be that the scoring rubric needs improving. Referring again to Appendix

A, there is a disconnect between the description of the quality of performance and

quantifiable data. For example, “An ‘Excellent’ (5) grade will be issued when a student’s

performance far exceeds and is well above the completion standards.” Unfortunately,

there is little to define exactly what makes a performance far exceed or well above the

completion standards. The same can be said for scores 4, 3, 2, and 1. There definitions

of the scores are too broad.

The scoring sheet (Appendix D) offered the rater the completion standards from

the lesson in which Pattern D is taught. The altitude standard asks only that a student

pilot remain within plus or minus 200 feet of the starting altitude. This standard is very

broadly defined and leaves too much open to interpretation by individual instructor pilots

and hence affects inter-rater reliability. An example of how to fine tune the altitude

standards could include the following scores:

• a score of 5 should require the student remain within plus or minus 50 feet;

• a 4, plus or minus 100 feet;

• a 3 plus or minus 150 feet;

• a 2, plus or minus 200; and


36

• a 1 indicates that the student violated the 200 foot limit in either direction, and

therefore is unsatisfactory.

The other standards, heading, bank angle and airspeed, could also be redefined to

more precisely indicate how skilled the student is, rather than leaving a broad range that

is susceptible to loose interpretation. Perhaps by fine-tuning the standards and requiring

the instructor pilots to be retrained in these newer, more precisely defined, standards

would help to improve inter-rater reliability. Fine-tuning these standards may require

further research.

Technical Improvements

Although the researcher is confident in the methodology, there can be

improvements made to how the experiment is executed on a technical level. This project

was the researcher’s first attempt to record video footage from a PCATD and then

transfer that footage to DVD. While the footage was usable, the quality could be

improved by recording the footage directly from the PCATD rather than through another

media. The footage had to travel through a few steps of media: from the PCATD to the

projector, to the screen, to the video camera, to the iMac, to the iMovie HD application,

to the iDVD application, to actual DVDs. The transfer from camera to the digital movie

applications iMovie HD and iDVD are not problematic because there is no noticeable

degradation of footage from one digital source to another. Thus, removing the projector,

movie screen, and video camera from the middle, would likely produce higher quality

images, making the footage easier to watch clearly. Since the raters all watched the same

footage, the footage quality does not affect the inter-rater reliability. It would only affect
37

inter-rater reliability if some raters watched one set of footage, and other raters watched

an improved version of the footage.

In summary, the technical execution of the project could be improved simply by

learning how to use all of the features of the iMovie HD and iDVD applications to their

fullest extent. There are other high-end software applications for video editing such as

Final Cut that should also be considered provided the future researcher has the budget for

to make these technological upgrades.

Recommendations for Further Research

With the technical improvement recommendations out of the way, this is an

opportunity to discuss the future for which this project is the foundation. As stated in

Chapter One, Introduction, this project was a foundational study, meant to lay the

groundwork and establish a method to study inter-rater reliability at flight schools that

can be used at any flight school that has the resources to carry out the experiment.

The first recommendation is to expand the number of samples, the number of

raters, or both. This researcher would also encourage a future researcher to test other

means of measuring inter-rater reliability. Chapter Two, Literature Review, cited studies

which used alpha and rho. In the interest of finding the best analytical method, alpha,

rho, and other coefficients should be tested along with the increase in samples and raters

until an agreed upon method is derived.

The second recommendation is to choose different patterns. One suggestion is to

begin testing particular maneuvers such as shallow, medium and steep turns, ascending

and descending turns, or constant airspeed climbs. These are just examples, and a future

researcher could experiment with particular maneuvers rather than entire patterns. At the
38

same time, one could also consider choosing from a catalog of other instrument patterns,

more or less challenging than Pattern D.

Recommendations one and two do not cast doubt on the methodology of this

study. Adding more raters might lead to more agreement, but it might also lead to more

disagreement. Likewise, adding more sample flights may or may not cause lesser or

greater reliability. What must be avoided at all costs is designing a study that is

structured to create agreement. Testing particular maneuvers rather than patterns is not

necessarily better because doing maneuvers is just one part of flight training and the goal

of flight training is not to make a pilot proficient at doing maneuvers, but to make a pilot

have such a depth of understanding and technical ability that he or she can take the

maneuvers learned through the years of training and spontaneously serialize or combine

discreet maneuvers into an organic flight that has unity from take off to landing. So

testing only maneuvers versus testing patterns or testing spontaneous flights is not

necessarily better. However, more samples, raters, other patterns, and other statistical

methods all deserve to be tested for the sake of expanding our body of knowledge and for

perfecting a method that one day could become “tried and true.” In short, researchers

must trust in the scientific method to continually develop better means of testing and

never rest contented with existing research.

Upon doing further research, fine-tuning the standards and processing instructors

through updated training, one may find that the method can be adapted for commercial

use.
39

Commercial Application of this Study

Upon testing and re-testing this experiment such that the results can be replicated

and are consistent, and the method deemed valid by a panel of experts in related fields,

this study can be developed into an instructor training program that may be created for

the commercial market and sold to flight schools.

Following some of the recommendations above, perhaps the instructor training

program could begin by evaluating maneuvers and testing reliability. Upon reaching a

kappa of .70 or greater, the instructor trainee can move on to the next phase learning how

to evaluate simple patterns, and then moving onto learning how to evaluate complex

patterns, and finally how to reliably rate IFR check rides. The training need not happen

only using a PCATD. The method and training system must be such that as the training

progresses, the footage from the PCATD is replaced by footage from a full simulator, and

the full simulator eventually replaced by footage from an actual aircraft, because the

instructors and their students will experience training in all three media.

Future researchers who wish to apply this project to a commercial application

must establish baseline flights, just as Penny, et al. (2000) establish benchmark essays for

scoring writing samples. For example, a future researcher may find that a particular

flight has been viewed by raters and they have consensus that the flight is a 3. A

researcher for a commercial developer or flight school must build up a catalog of baseline

flights that have all been tested and create a test in which the established baseline scores

are entered into the contingency table as Rater 1, while the rater currently being tested

becomes Rater 2. Thus, the future researcher or tester will place the New Rater versus

the baseline scores. A kappa of .70 or greater shows that the New Rater can score flights
40

reliably, while a kappa less than .70 will indicate that the New Rater needs further

instruction before being allowed to rate actual flights. The result may be that flight

schools can effective and economically screen potential flight instructors or maintain

standards with current instructors.

Summary

“The search for valid, reliable, feasible, and fair assessments of cognitive and

human performance is, in many ways, at the very heart of educational measurement”

(Penny, et al, 2000). In a very real way, instructor pilots are educators, and their

evaluations of student performance are educational measurements. The researcher sought

to find research in scientific and educational journals that would help to lay the

foundation of inter-rater reliability studies in flight training. To that end, four flight

school instructors (raters) were tested according to the methodology inspired by the

literature reviewed and statistical analysis based upon Cohen’s Kappa coefficient. This

coefficient is commonly used in inter-rater reliability studies in several fields from

education, social science, psychology, medicine and even sports. It is used quite often in

training situations. In this study, kappa was applied to flight training, specifically testing

instructor pilots for inter-rater reliability. Ultimately, the study indicated that the inter-

rater reliability was low; having an average kappa of .38, well below the desired .70.

Nevertheless, this study was successful in that it showed a usable method for testing

inter-rater reliability in flight training and provides the basis for further research and

commercial development.
41

REFERENCES

Bamford, J.T.M., Gessert, C.E., & Renier, C.M. (2004) Measurement of the severity of
rosacea. [Electronic Version]. Journal of the American Academy Dermatology,
51(5), 697-703.

Bann, S., Davis, I.M., Moorthy, K., Munz, Y., Hernandez, J., Khan, M., Datta, V., &
Darzi, A. (2005). The Reliability of multiple objective measures of surgery and
the role of human performance. [Electronic version]. The American Journal of
Surgery, 189, 747-752.

Bell, V., Halligan P.W., & Ellis, H.D. (2006). Diagnosing Delusions: A review of inter-
rater reliability. [Electronic version]. Schizophrenia Research, 86, 76-79.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational


Psychological Measurement, 20(1), 37-46.

DeVellis, R.F. (2005). Inter-Rater Reliability. [Electronic version]. In Encyclopedia of


Social Measurement (Vol. 2, pp.317-322). New York: Elsevier Inc.,

Dionne, C.P., Bybee, R.F., & Tomaka, J. (2006). Inter-rater reliability of McKenzie
assessment in patients with neck pain. [Electronic version]. Physiotherapy, 92,
75-82.

Drake, R., Haddock, G., Terrier, N., Bentall, R., & Lewis, S. (2007). The Psychotic
Symptom Rating Scales (PSYRATS): Their usefulness and properties in first
episode psychosis. [Electronic version]. Schizophrenia Research, 89, 119-122.

Ferri, R., Bruni, O., Miano, S., Smerieri, A., Spruyt, K., & Terzano, M. (2005). Inter-
rater reliability of sleep cyclic alternating pattern (CAP) scoring and validation of
a new computer-assisted CAP scoring method. [Electronic version]. Clinical
Neurophysiology, 116, 696-707.

Goodwin, L.D. & Goodwin, W.L. (1985). An Analysis of Statistical Techniques Used in
the Journal of Educational Psychology, 1979-1983. [Electronic version].
Educational Psychologist, 20(1), 13-21.

Gwet, K. (2002a) Kappa statistic is not satisfactory for assessing the extent of agreement
between raters. Retrieved December 15, 2006, from
http://www.stataxis.com/files/articles/kappa_statistic_is_not_satisfactory.pdf.

Gwet, K. (2002b) Cohen’s Kappa. Retrieved December 15, 2006, from http://www-
class.unl.edu/psycrs/handcomp/hckappa.pdf.
42

Holey, L.A., & Watson, M.J. (1995) Inter-rater reliability of connective tissue zones
recognition. [Electronic version]. Physiotherapy, 61(7), 369-372.

Hulsman, R.L., Mollema, E.D., Oort, F.J., Hoos, A.M., & de Haes, J.C.J.M. (2006) Using
standardized video cases for assessment of medical communication skills:
Reliability of an objective structured video examination by computer. [Electronic
version]. Patient Education and Counseling, 60, 24-31.

Joo, E.-J., Joo, Y.-H., Hong, J.-P., Hwang, S., Maeng, S.-J., Han J.-H., Yang, B.-H., Lee,
Y.-S., & Kim, Y.-S. (2004). Korean Version of the Diagnostic Interview for
Genetic Studies: Validity and Reliability. [Electronic version]. Comprehensive
Psychiatry, 45(3), 225-229.

Kadri, N., Agoub, M., El Gnaoui, S., Mchichi Alami, Kh., Hergueta, T., & Moussaoui, D.
(2005). Moroccan colloquial Arabic version of the Mini International
Neuropsychiatric Intervire (MINI): qualitative and quantitative validation.
[Electronic Version]. European Psychiatry, 20, 193-195.

Kaneda, Y., Ohmoria, T., & Fujii, A. (2001). The serotonin syndrome: investigation
using the Japanese version of the Serotonin Syndrome Scale. [Electronic version].
Psychiatry Research, 105, 135-142.

Kirshner, W.K. (1990) The Pilot’s Manual: Instrument Flying (4th ed.). Ames, IA: Iowa
State Press

Kolaitas, J., Korpa, T., Kolvin, I., & Tsiantis, J. (2003). Letter to the Editor. [Electronic
version]. European Psychiatry, 18, 374-375.

Kolt, G.S., Brewer, B.W., Pizzari, T., Schoo, A.M.M., & Garrett, N. (2006). The Sport
Injury Rehabilitation Adherence Scale: a reliable scale for use in clinical
physiotherapy. [Electronic version]. Physiotherapy 93(1), 17-22.

Lee, H.K. (2004). A comparative study of ESL writers’ performance in a paper-based and
a computer-delivered writing test. [Electronic version]. Assessing Writing, 9, 4-
26.

Leung, T.K.S. & Tsang H.W.H. (2006). Chinese version of the Assessment of
Interpersonal Problem Solving Skills. [Electronic version]. Psychiatry Research
143, 189-197.

Lindeman, B., Libkuman, T., King, D., & Kruse B. (2000). Development of an
Instrument to Assess Jump-Shooting Form in Basketball. [Electronic version].
Journal of Sports Behavior. 23(4), 335-348.
43

Meyers, B.S., English, J., Gabriele, M., Peasley-Miklus, C., Heo, M., Flint, A.J., Mulsant,
B.H., & Rothschild, A.J. (2006). A Delusion Assessment Scale for Psychotic
major Depression: Reliability, Validity, and Utility. Biological Psychiatry, 60,
136-1342.

Michelson, J.D. (2006). Simulation in Orthopaedic Education: An Overview of Theory


and Practice. [Electronic version]. The Journal of Bone & Joint Surgery. 88-
A(6), 1405-1411.

Monroe-Blum, H., Collins, E., McCleary, L., & Nuttall, S. (1996). The social dysfunction
index (SDI) for patients with schizophrenia and related disorders. [Electronic
version]. Schizophrenia Research. 20, 211-219.

Papavasilou, A.S., Rapidi, C.A., Rizou, C., Petrapoulou, K., & Tzavara, Ch. (2006).
Reliability of Greek version Gross Motor Function Classification System.
[Electronic version]. Brain & Development, 29 79-82

Penny, J., Johnson, R.L., & Gordon, B. (2000) The effect of rating augmentation on inter-
rater reliability: and empirical study of a holistic rubric. [Electronic version].
Assessing Writing, 7,143-164.

Raymont, V., Buchanan, A., David, A.S., Hayward, P., Wessley, S., & Hotopf, M.
(2006). The inter-rater reliability of mental capacity assessments. [Electronic
version]. Law and Psychiatry, 30, 112-117

Rittenberger, J.C., Martin, J.R., Kelly, L.J., Roth, R.N., Hostler, D., & Callaway, C.W.
(2006). Inter-rater reliability for witnessed collapse and presence of bystander
CPR. [Electronic version]. Resuscitation, 70, 410-415.

Schmidt, N.B., Salas, D., Bernert, R., & Schatschneider, C. (2005). Diagnosing
agoraphobia in the context of panic disorder: examining the effect of the DSM-IV
criteria on diagnostic decision-making. [Electronic version]. Behavior Research
and Therapy, 43, 1219-1229.

Thuile, J., Even, C., Friedman, S., & Guelfi, J.-D. (2005). Inter-rater reliability of the
French version of the core index for melancholia. [Electronic version]. Journal
of Effective Disorders, 88, 193-208.

Trochim, W.M.K. (2001). The Research Methods Knowledge Base (2nd ed.). Mason, OH:
Thomson

Tural, U., Fidaner, H., Alkin, T. & Bandelow, B. (2002). Assessing the severity of panic
disorder and agoraphobia: Validity, reliability and objectivity of the Turkish
translation of the Panic and Agoraphobia Scale (P & A). [Electronic version].
Journal of Anxiety Disorders, 16, 331-340.
44

Worster, A., Sardo, A.A., Fernandes C.M.B., Eva, K., & Upadhy, S. (2007). Triage tool
inter-rater reliability: a comparison of live versus paper case scenarios.
[Electronic version]. Journal of Emergency Nursing, 33(4), 319-323.
APPENDIX A

SCORING RUBRIC
46
APPENDIX B

BRIEFING AND SCRIPT


48

Brief

Thank you for participating in this inter-rater reliability study. You are not being

tested. Your upcoming flight will be scored by instructors for research purposes only.

Your performance here today will not have any effect on your scores in school. Your

name is not being recorded. Even I, the researcher, am not keeping a record of your

name or any information about you.

During this flight you will be asked to fly Pattern D from The Pilot’s Manual:

Instrument Flying. Whether you have a passing or thorough knowledge of this flight

pattern is not important. I will talk you through the flight, if necessary. I will not keep

track of the time for you. I will, however, give you ample time before the next maneuver.

Remember, you are not the one being tested. This flight is being used to test your

instructors. Even though your performance is not being tested, I ask that you still try

your best just as you would in a real plane with a real instructor pilot.

Do you have any questions?

Instructions

This flight will begin with you already airborne. You are flying at 6000 feet,

straight and level, heading 360, at 130 knots cruising speed. The flight will end with you

airborne as well. Do you have any questions before we begin?

1. Begin…now. Keep the aircraft straight and level for one minute.

2. At the one minute mark, turn left to heading 315.

3. When you come to heading 315, fly straight and level for one minute.

4. Turn right, 180 degrees to heading 135.


49

5. When you reach heading 135, fly straight and level for 30 seconds.

6. Turn right 45 degrees to heading 180.

7. When you reach heading 180, fly straight and level for 2 minutes.

8. At the two-minute mark, turn right thirty degrees to heading 210.

9. Fly straight and level for 45 seconds.

10. Turn left 210 degrees to heading 360.

11. When you reach heading 360, fly straight and level for 2 minutes.

12. Turn right 180 degrees to heading 180.

13. Fly straight and level for 2 minutes.

14. Turn right 180 degrees to heading 360. Fly straight and level for 2 minutes.

15. You have finished the flight. Please stop.

Debrief

Thank you for flying this pattern. Your flight is one of many that will be used to

help us test the reliability of the instructor pilots. Although a recording of your flight has

been made, no information about you has been kept, and thus no information about you

can or will be shared.

Do you have any questions before you go?


APPENDIX C

INSTRUCTIONS TO RATERS
51

Brief

Thank you for being kind enough to participate in this study. We are soon going

to watch DVDs containing 10 sample flights. Before we watch these flights, I must lay

out some ground rules:

1. We will watch each flight only once.

2. Score each flight at the end of the flight. Do not wait until all flights are over to

score them all. Take each flight as it is.

3. You have been given a copy of Pattern D and the scoring rubric, which you may

refer to throughout this process. On the score sheet, there is also a brief summary

of the standards and of the scoring rubric.

4. You must not communicate with each other while watching the flights. This

includes talking, nodding, winking, gesturing, making faces, etc.

5. We will take short breaks after every two videos, and a long break at the end of

the first DVD.

6. You may talk during the break times, but you must refrain from talking about the

flights. Please keep the conversation to unrelated topics.

7. At the end of the viewing, after I have collected your score sheets, we may then

discuss any flights. You will not have the ability to change your scores.

Do you have any questions before we begin?


APPENDIX D

SCORING SHEET
53

Standards: Grading Scale:

Altitude: +/- 200 feet 5 Excellent


4 Very Good
Heading +/- 15°
3 Good
Bank angle +/- 10°
2 Marginal
Airspeed +/- 15 KIAS 1 Unsatisfactory

Note: Standards are taken directly from a lesson pertinent to Pattern D. The grading scale

is the same as described in the scoring rubric.

Give a score of 1 – 5 for each sample.

A B C D E F G H I J

You might also like