0% found this document useful (0 votes)
141 views10 pages

NASA Safety Culture Analysis

This document describes a new approach to modeling organizational safety culture and decision-making using system dynamics. The authors applied this approach to NASA's manned space program to better understand factors in the Columbia accident and analyze the new Independent Technical Authority structure. The approach involves creating executable models of the organizational safety control structure and decision-making processes to identify risks, evaluate changes, and perform causal analysis of accidents.

Uploaded by

ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
141 views10 pages

NASA Safety Culture Analysis

This document describes a new approach to modeling organizational safety culture and decision-making using system dynamics. The authors applied this approach to NASA's manned space program to better understand factors in the Columbia accident and analyze the new Independent Technical Authority structure. The approach involves creating executable models of the organizational safety control structure and decision-making processes to identify risks, evaluate changes, and perform causal analysis of accidents.

Uploaded by

ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Proceedings of the 2005 Winter Simulation Conference

M. E. Kuhl, N. M. Steiger, F. B. Armstrong, and J. A. Joines, eds.

USING SYSTEM DYNAMICS FOR SAFETY AND RISK MANAGEMENT


IN COMPLEX ENGINEERING SYSTEMS

Nicolas Dulac
Nancy Leveson
David Zipkin
Stephen Friedenthal
Joel Cutcher-Gershenfeld
John Carroll
Betty Barrett

Massachusetts Institute of Technology


Cambridge, MA 02139, U.S.A.

ABSTRACT Our approach rests on the hypothesis that safety cul-


ture can be modeled, formally analyzed, and engineered.
This paper presents a new approach to modeling and ana- Models of the organizational safety control structure and
lyzing organizational culture, particularly safety culture. dynamic decision-making and review processes can poten-
We have been experimentally applying it to the NASA tially be used for: (1) designing and validating improve-
manned space program as part of our goal to create a pow- ments to the risk management and safety culture; (2)
erful new approach to risk management in complex sys- evaluating and analyzing risk; (3) detecting when risk is
tems. We describe the approach and give sample results of increasing to unacceptable levels (a virtual “canary in the
its applications to understand the factors involved in the coal mine”); (4) evaluating the potential impact of changes
Columbia accident and to perform a risk analysis of the and policy decisions on risk; (5) performing “root cause”
new Independent Technical Authority (ITA) structure for (perhaps better labeled as systemic factors or causal dy-
NASA, which was introduced to improve safety-related namics) analysis; and (6) determining the information each
decision-making. decision-maker needs to manage risk effectively and the
communication requirements for coordinated decision-
1 THE PROBLEM making across large projects.

Traditionally accidents are treated as resulting from an ini- 2 INTRODUCTION TO STAMP


tiating (root cause) event in a chain of directly related fail-
ure events. This traditional approach, however, has limited STAMP (System Theoretic Accident Model and Processes)
applicability to complex systems, where interactions views accidents as the result of flawed processes involving
among components, none of which may have failed, often interactions among people, societal and organizational
lead to accidents. The chain-of-events model also does not structures, engineering activities, and physical system
include the systemic factors in accidents such as safety cul- components (Leveson 2004). Safety is treated as a control
ture and flawed decision-making. Technical risk manage- problem: accidents occur when component failures, exter-
ment requires more than simply looking at the technical nal disturbances, and/or dysfunctional interactions among
parts of systems. A new, more inclusive approach is system components are not adequately handled. In the
needed that encompasses the technical aspects, as well as Space Shuttle Challenger loss, for example, the O-rings did
the managerial, organizational, social, and political aspects not adequately control propellant gas release by sealing a
of the system and its environment. tiny gap in the field joint. In the Mars Polar Lander loss,
To accomplish this goal, we use a new foundational the software did not adequately control the descent speed
model of accident causation plus formal modeling and of the spacecraft—it misinterpreted noise from a Hall ef-
analysis of both the physical and organizational aspects of fect sensor as an indication the spacecraft had reached the
systems. The modeling involves various types of executa- surface of the planet.
ble and analyzable models, including system dynamics Accidents such as these, involving engineering design
models. errors, may in turn stem from inadequate control over the

1311
Dulac, Leveson, Zipkin, Friedenthal, Cutcher-Gershenfeld, Carroll, and Barrett

development process, i.e., risk is not adequately managed taining; controls and feedback channels, process models
in the design, implementation, and manufacturing proc- representing the view of the controlled process by those
esses. Control is also imposed by the management func- controlling it, and a model of the dynamics and pressures
tions in an organization—the Challenger accident involved that can lead to degradation of this structure over time.
inadequate controls in the launch-decision process, for ex- These models and the analysis procedures defined for them
ample—and by the social and political system within can be used (1) to investigate accidents and incidents to de-
which the organization exists. The role of all of these fac- termine the role played by the different components of the
tors must be considered in hazard and risk analysis. safety control structure and learn how to prevent related
Note that the use of the term “control” does not imply accidents in the future, (2) to proactively perform hazard
a strict military-style command and control structure. Be- analysis and design to reduce risk throughout the life of the
havior is controlled or influenced not only by direct man- system, and (3) to support a continuous risk management
agement intervention, but also indirectly by policies, pro- program where risk is monitored and controlled.
cedures, shared values, and other aspects of the STAMP uses two types of models: static control struc-
organizational culture. All behavior is influenced and at ture models and dynamic behavior (system dynamics)
least partially “controlled” by the social and organizational models. A new aspect with respect to system dynamics is
context in which the behavior occurs. The connotation of the typing together of system dynamics and static structure
control systems in engineering centers on the concept of modeling. We believe this will lead to easier construction
feedback and adjustment (such as in a thermostat), which is of system dynamics models and to more complete or ex-
an important part of the way we use the term here. Engi- panded modeling and system understanding. The follow-
neering this context can be an effective way of creating and ing section provides a short description of our first at-
changing a safety culture. tempts at using system dynamics to model safety culture
Systems are viewed in STAMP as interrelated compo- and decision-making in the NASA space shuttle program.
nents that are kept in a state of dynamic equilibrium by
feedback loops of information and control. A system is not 3 MODELING SAFETY CULTURE AND
treated as a static design, but as a dynamic process that is DECISION-MAKING AT NASA
continually adapting to achieve its ends and to react to
changes in itself and its environment. The original design
must not only enforce appropriate constraints on behavior 3.1 Initial High-Level Diagram
to ensure safe operation, but must continue to operate
safely as changes and adaptations occur over time. Acci- Our interest in using system dynamics as an integral part of
dents, then, are considered to result from dysfunctional in- a STAMP Risk Analysis process started with a very simple
teractions among the system components (including both causal loop diagram drawn in the aftermath of the Colum-
the physical system components and the organizational and bia accident. The objective was to describe the fundamen-
human components) that violate the system safety con- tal system adaptation modes responsible for the erosion of
straints. The process leading up to an accident can be de- safety at NASA (See Figure 1).
scribed in terms of an adaptive feedback function that fails
to maintain safety as performance changes over time to
meet a complex set of goals and values. The accident or
loss itself results not simply from component failure
(which is treated as a symptom of the problems) but from
inadequate control of safety-related constraints on the de-
velopment, design, construction, and operation of the
socio-technical system.
While events reflect the effects of dysfunctional interac-
tions and inadequate enforcement of safety constraints, the
inadequate control itself is only indirectly reflected by the
events—the events are the result of the inadequate control.
The system control structure, therefore, must be examined
to determine how unsafe events might occur and if the con-
trols are adequate to maintain the required constraints on
safe behavior. Figure 1: Simplified Model of the Dynamics Behind the
A STAMP modeling and analysis effort involves cre- Shuttle Columbia Loss
ating a model of the system safety control structure: the
safety requirements and constraints that each component The reinforcing feedback loop labeled R1 or Pushing
(both technical and organizational) is responsible for main- the Limit, shows how as external pressures increased, per-

1312
Dulac, Leveson, Zipkin, Friedenthal, Cutcher-Gershenfeld, Carroll, and Barrett

formance pressure increased which led to increased launch model was that safety should be examined carefully at the
rates and thus success in meeting the launch rate expecta- point when the program seems to be highly successful.
tions which in turn led to increased expectations and in- While the model helps to understand the accident, its sim-
creasing performance pressures. This, of course, is an un- plicity limits the quality of insights that can be extracted
stable system and cannot be maintained indefinitely—note from its analysis. Therefore, we decided to create a more
the larger balancing loop, B1, in which this loop is embed- complete system dynamics model of the NASA Space
ded, labeled Limits to Success. The upper left loop repre- Shuttle safety decision-making.
sents part of the safety program. The external influences of
budget cuts and increasing performance pressures that re- 3.2 Detailed Model of Safety Decision-Making in the
duced the priority of safety procedures led to a decrease in NASA Manned Space Program
system safety efforts. The combination of this decrease
along with loop B2, in which fixing problems increased The larger model was created to understand the factors in
complacency, which also contributed to reduction of sys- the Shuttle safety culture and decision-making that con-
tem safety efforts, eventually led to a situation of (unrec- tributed to the Columbia loss. The original model was
ognized) high risk. While reduction in safety efforts and constructed using both Leveson’s personal long-term asso-
lower prioritization of safety concerns may lead to acci- ciation with NASA as well as interviews with current and
dents, accidents usually do not occur for a considerable former employees, books on NASA's safety culture, such
time period (years) so false confidence is created that the as Inside NASA (McCurdy 1994), books on the Challenger
reductions are having no impact on safety and therefore and Columbia accidents, NASA mishap reports including:
pressures increase to reduce safety efforts and priority even CAIB (Gehman 2003), Mars Polar Lander (Young 2000),
further as the external performance pressures mount. Mars Climate Orbiter (Stephenson 1999), WIRE
A simple system dynamics model was created out of (Branscome 1999), SOHO (NASA/ESA 1998), Huygens
the causal loop diagram. Model analysis indicated an in- (Link 2000), other NASA reports on the manned space
herently oscillating behavior where risk is allowed to creep program such as SIAT (McDonald 2000) and others, as
up undetected (see Figure 2) as safety efforts diminish un- well as many of the better researched magazine and news-
der safety budget cuts and increasing complacency associ- paper articles. A detailed documentation of the original
ated with a program perceived to be safe and operational. model cannot be provided in this paper, but is available
The major counter-intuitive finding associated with the ini- upon request from the author. The initial results from our
tial model was that safety should be examined carefully at modeling efforts provided some interesting insights and
the when the program seems to be highly successful. The reinforced our belief that system dynamics modeling
model generated a lot of enthusiasm at NASA colloquiums should be an integral part of a STAMP analysis. Among
[more on this], but its simplicity limited the quality of in- the scenarios investigated, a contractor analysis was per-
sights that could be extracted from its analysis. Conse- formed to understand the effect of different levels of con-
quently, we decided to create a more complete system dy- tracting on system risk. We found that increased contract-
namics model of the NASA Space Shuttle safety decision- ing did not significantly change the level of risk until a
making. “tipping point” is reached where NASA was not able to
perform the integration and safety oversight that is their
responsibility. After that point, risk escalates substantially
(see Figure 3).

Figure 2: Oscillating Dynamics from Simplified Model

A simple system dynamics model was created out of the


causal loop diagram. Model analysis indicated an inher-
ently oscillating behavior where risk is allowed to creep up
undetected (see Figure 2) as safety efforts diminish under Figure 3: Contractor Scenario Analysis Results
safety budget cuts and increasing complacency associated
with a program perceived to be safe and operational. The Another scenario investigated the impact on the model be-
major counter-intuitive finding associated with the initial havior of increasing the independence of safety decision

1313
Dulac, Leveson, Zipkin, Friedenthal, Cutcher-Gershenfeld, Carroll, and Barrett

makers through an organizational change like the Inde- System Safety Requirements and Constraints:
pendent Technical Authority (ITA). This analysis ap-
proximated the effect of the ITA by modifying parameters 1. Safety considerations must be first and foremost
in the system such as: better reporting, better safety re- in technical decision-making.
views, and more power and authority to safety decision- 2. Safety-related technical decision-making must be
makers. The results show that significantly higher risk done by eminently qualified experts, with broad
mitigation potential could be achieved by a successful im- participation of the full workforce.
plementation of the ITA program (Figure 4). 3. Technical decision-making must be credible (exe-
cuted using credible personnel, with safety analy-
ses available and used throughout the system life-
cycle).
4. The Agency must provide avenues for the full ex-
pression of technical conscience (for safety-
related technical concerns) and provide a process
for full and adequate resolution of technical con-
flicts as well as conflicts between programmatic
and technical concerns.

Each of these high-level requirements was then refined


into more detailed requirements.

Figure 4: Simplified ITA Scenario Analysis Results

Based on this first attempt at performing detailed modeling


of safety culture and decision-making in the manned space
program, we were asked by NASA to assist in a planned
assessment of the new ITA by using our modeling ap-
proach to identify metrics and measures of effectiveness
for the assessment. To accomplish this goal, we modified
the original model to include a structure that better cap-
tures the effects of the ITA program. The objective was to
perform a structured analysis of the risks associated with
the implementation of the new NASA ITA program. The
model was created based on information we obtained from
the ITA Implementation Plan and our personal experiences
at NASA. The remaining of this paper discusses the entire
risk analysis process, as well as the system dynamics
model, analysis and results using the ITA program imple-
mentation risk analysis as an example.

4 THE STAMP-BASED RISK ANALYSIS


PROCESS

We followed a traditional system engineering and system


safety engineering approach (see Figure 5), but adapted to
the task at hand (organizational risk analysis).
The first step in a STAMP-based risk analysis is to
identify the high-level hazard(s) independent technical au-
thority was designed to control and then the general re-
quirements and constraints necessary to eliminate that haz-
ard(s). For ITA: Figure 5: STAMP-Based Risk Analysis Process
System Hazard: Poor engineering and management deci- The next step was to create a structural model of the
sion-making leading to an accident (loss) safety control structure in the NASA manned space pro-
gram, augmented with the independent technical authority

1314
Dulac, Leveson, Zipkin, Friedenthal, Cutcher-Gershenfeld, Carroll, and Barrett

as designed. This model includes the roles and responsi- 5 DYNAMIC RISK ANALYSIS OF THE
bilities of each organizational component with respect to INDEPENDENT TECHNICAL AUTHORITY
safety. We then traced each of the above system safety re-
quirements and constraints to those components responsi- 5.1 Model Description
ble for their implementation and enforcement. In this proc-
ess, we identified some omissions in the organizational One of the significant challenges associated with modeling
design and places where overlapping control responsibili- a socio-technical system as complex as the Shuttle program
ties could lead to conflicts or require careful coordination is creating a model that captures the critical intricacies of
and communication. the real-life system, but is not so complex that it cannot be
We next performed a hazard analysis on the safety readily understood. To be accepted and therefore useful to
control structure, using a new hazard analysis technique decision makers, a model must have the confidence of the
based on STAMP. A STAMP hazard analysis works on users and that confidence will be limited if the users cannot
both the technical (physical) and the organizational (social) understand what has been modeled. We addressed this
aspects of systems. There are four general types of risks in problem by breaking the overall system dynamics model
the ITA concept: into nine logical subsystem models, each of an intellectu-
ally manageable size and complexity (see Figure 6).
1. Unsafe decisions are made or approved by the
ITA.
2. Safe decisions are disallowed (i.e., overly conser- System Safety
vative decision-making that undermines the goals Launch Rate Resource
Allocation
of NASA and long-term support for the ITA);
3. Decision-making takes too long, minimizing im-
pact and also reducing support for the ITA. Perceived
Success by ITA
4. Good decisions are made by the ITA, but they do Administration
not have adequate impact on system design, con-
struction, and operation. Shuttle Aging
and System Safety
Maintenance Efforts &
Incident Learning Efficacy
The hazard analysis applied each of these types of & Corrective
risks to the NASA organizational components and func- Action
Risk
tions involved in safety-related decision-making and iden-
System Safety
tified the risks (inadequate control) associated with each. Knowledge,
Skills & Staffing
The resulting list of risks was quite long (250), but most
appeared to be important and not easily dismissed. To re-
duce the list to one that could be feasibly assessed, we
categorized each risk as either an immediate and substan- Figure 6: The Nine Subsystem Models and their Interac-
tial concern, a longer-term concern, or capable of being tions
handled through standard processes and not needing a spe-
cial assessment. The subsystem models were built and tested independ-
We then used our system dynamics models to identify ently. Extensive partial model testing was used in order to
which risks were the most important to measure and assess, increase our confidence that the model behavior would be
i.e., which provide the best measure of the current level of accurate. It was also verified that the behavior of each
system risk and are the most likely to detect increasing risk subsystem module passed the intent rationality test (More-
early enough to prevent significant losses. This analysis led croft 1985, Sterman 2000). The behavior of each subsys-
to a list of the best leading indicators of increasing and un- tem model was shown to be in accordance with the open-
acceptable risk. loop behavior rationally expected when critical feedback
The analysis also pointed to structural changes and loops are removed. For example, in the absence of exter-
planned evolution of the safety-related decision-making nal pressures to modify the resources allocated to safety
structure over time that could strengthen the efforts to efforts (e.g., schedule and budget pressures), the System
avoid migration to unacceptable levels of organizational Safety Resource Allocation model should output a constant
risk and avoid flawed management and engineering deci- level of safety resources. Once validation and confidence
sion-making leading to an accident. The following section in the behavior of each subsystem model was established,
provides a description of the contribution of system dy- subsystem models were connected to one another so that
namics modeling to the entire ITA risk analysis. important information could flow between them and emer-
gent properties arising from their interactions could be in-
cluded in the analysis. The model was built in a modular

1315
Dulac, Leveson, Zipkin, Friedenthal, Cutcher-Gershenfeld, Carroll, and Barrett

fashion, which made it easy to test and modify individual tion of problems resulting in no action, fraction of correc-
subsystem models independently, and then re-integrate tive actions that only address the symptoms of the problem,
them. fraction of corrective actions that address the systemic fac-
The following description provides a high-level listing tors that led to the problem, waiver issuance rate, fraction
of some key variables and concepts contained in each sub- of corrective actions rejected at safety review, quality of
system model. A detailed description of the content of lessons learned.
each model is impossible given paper length constraints,
however, interested readers are invited to request it from Perceived Success by Management: Accumulation of
the author. successful launches, NASA recent safety history, occur-
rence of serious events and accidents.
Risk: Incident and accident occurrence, effective vehicle
age, quantity and quality of inspections, proactive hazard Independent Technical Authority: Effectiveness and
analysis and mitigation efforts, response of the program to Credibility of ITA, quality and thoroughness of safety
anomalies (symptom fix vs. systemic factor fix response). analyses, workload of ITA designees, attractiveness of be-
ing a Technical Warrant Holder (TWH), TWH resources
System Safety Resource Allocation: Level of resources and training, ability to attract knowledgeable trusted
allocated to system safety, priority of safety program, pri- agents, trusted agent training adequacy, ITA influence and
ority of launch performance, NASA safety history, per- prestige, ability to attract highly skilled and well-respected
formance expectations, schedule pressure, budget pressure. technical leaders, ITA power and authority.

System Safety Knowledge, Skills, and Staffing: NASA 5.2 Model Analysis
and contractors’ system safety knowledge and skills, abil-
ity to oversee contractor safety activities, number of NASA Once the models were thoroughly tested, three types of
system safety employees, number of contractor system analyses were performed: (1) sensitivity analyses to inves-
safety employees, aggregate experience of NASA employ- tigate the impact of various ITA program parameters on
ees, aggregate experience of contractor employees, age of the system dynamics and on risk, (2) system behavior
NASA employees, portion of work contracted out, stability mode analyses, and (3) metrics identification and evalua-
of funding, hiring rate, attrition rate, experience at hire, tion.
learning rate.
5.2.1 ITA Model Sensitivity Analysis
Shuttle Aging and Maintenance: Age of the shuttle vehi-
cles (in launches), amount of maintenance, refurbishments, In order to investigate the effect of ITA parameters on the
and safety upgrades, resources available for maintenance, system-level dynamics, a 200-run Monte-Carlo sensitivity
maintenance requirements, original design lifetime, uncer- analysis was performed. Random variations representing
tainty in remaining system life. +/- 30% of the baseline ITA exogenous parameter values
were used in the analysis. Figure 7 and 8 show the results
Launch Rate: Perception of success by management, per- of the 200 individual traces, for the variables ITA Effec-
formance expectations from management, schedule pres- tiveness and Credibility and System Technical Risk.
sure, launch commitment, launch backlog, launch delays. The initial sensitivity analysis shows that at least two
qualitatively different system behavior modes can occur.
System Safety Efforts and Efficacy: Availability and The first behavior mode (behavior mode #1 in Figure 7) is
adequacy of system safety resources, availability and effec- representative of a successful ITA program implementation
tiveness of safety processes and standards, system safety where risk is adequately mitigated for a relatively long pe-
staff characteristics (number, knowledge, experience, riod of time (behavior mode #1 in Figure 8). More than
skills, motivation, and commitment), ability of NASA to 75% of the runs fall in that category. The second behavior
oversee and integrate contractor safety efforts, quantity and mode (behavior mode #2 in Figure 7) is representative of a
quality of lessons learned. rapid rise and then collapse in ITA effectiveness associated
with an unsuccessful ITA program implementation. In this
Incident Learning and Corrective Action: Number of mode, risk increases rapidly, resulting in frequent hazard-
safety-related incidents, fraction of safety problems re- ous events (serious incidents) and accidents (behavior
ported depending on the effectiveness of the reporting mode #2 in Figure 8).
process, employee sensitization to safety problems, fear of
reporting problems and concerns, risk perceived by engi-
neers and technical workers, fraction of safety problems
investigated, thoroughness of investigation process, frac-

1316
Dulac, Leveson, Zipkin, Friedenthal, Cutcher-Gershenfeld, Carroll, and Barrett

increasingly seen as a solved problem. When this decline


occurs, resources are reallocated to more urgent perform-
ance-related matters and safety efforts start to suffer.
In this behavior mode, the Effectiveness and Credibil-
ity of ITA declines, then stabilizes and follows the Quality
of Safety Analyses coming from the System Safety Efforts
and Efficacy model. A discontinuity occurs around month
850 (denoted by the arrow on the x-axis of Figure 9), when
a serious incident or accident shocks the system despite
sustained efforts by the TA and TWHs (at this point of the
system lifecycle, time-related parameters such as vehicle
and infrastructure aging and deterioration create problems
that are difficult to eliminate).
Figure 9 shows normalized key variables of a sample
simulation representative of behavior mode #1, where the
Figure 7: Sensitivity Results for Effectiveness and Credi- ITA program implementation is successful in providing ef-
bility of ITA fective risk management throughout the system lifecycle.
This behavior mode is characterized by an extended period
of nearly steady-state equilibrium where risk remains at
very low levels.

Figure 8: Sensitivity Results for System Technical Risk

5.2.2 System Behavior Mode Analysis


Figure 9: Key Variables for Behavior Mode #1
Because the results of the initial ITA sensitivity analysis
showed two qualitatively different behavior modes, we per- Behavior Mode #2: Unsuccessful ITA Implementation:
formed detailed analysis of each to better understand the pa- In the second behavior mode (behavior mode #2 in Figure
rameters involved. Using this information, we were able to 7), Effectiveness and Credibility of ITA increases in the
identify some potential metrics and indicators of increasing initial transient, then quickly starts to decline and eventu-
risk as well as potential risk mitigation strategies. ally reaches bottom. This behavior mode represents cases
where a combination of parameters (insufficient resources,
Behavior Mode #1: Successful ITA Implementation: Be- support, staff…) creates conditions where the ITA struc-
havior mode 1, successful ITA program implementation, ture is unable to have a sustained effect on the system. As
includes a short-term initial transient where all runs ITA decline reaches a tipping point, the reinforcing dy-
quickly reach the maximum Effectiveness and Credibility namics act in the negative direction and the system mi-
of ITA. This behavior is representative of the initial ex- grates toward a high-risk state where accidents and serious
citement phase, where the ITA is implemented and shows incidents occur frequently (at the arrows on the x-axis in
great promise to reduce the level of risk. After a period of Figure 10).
very high success, the Effectiveness and Credibility of ITA The key normalized variables for a sample simulation
slowly starts to decline. This decline is mainly due to the run representative of the second behavior mode are shown
effects of complacency: the quality of safety analyses starts in Figure 10. This behavior mode represents an unsuccess-
to erode as the program is highly successful and safety is ful implementation of the ITA program. As risk increases,

1317
Dulac, Leveson, Zipkin, Friedenthal, Cutcher-Gershenfeld, Carroll, and Barrett

accidents start to occur and create shock changes in the


system. Safety is increasingly perceived as an urgent prob-
lem and more resources are allocated for safety analyses,
which increases System Safety Efforts and Efficacy, but by
this point the TA and TWHs have lost so much credibility
that they are not able to significantly contribute to risk
mitigation anymore. As a result, risk increases dramati-
cally, the ITA personnel and safety staff become over-
whelmed with safety problems and start to issue a large
number of waivers in order to continue flying. This behav-
ior mode includes many discontinuities created by the fre-
quent hazardous events and accidents and provides much
useful information for selection of metrics to measure the
effectiveness of ITA and to provide early indication of the Figure 11: Risk and Requirement Waivers Accumulation
system migrating toward a state of increased risk.

Figure 12: Risk and Incidents/Problems under Investiga-


tion

Figure 10: Key Variables for Behavior Mode #2 tions. If investigation requirements continue to increase,
the TWHs and trusted agents become saturated and simply
5.2.3 Metrics Identification and Evaluation cannot attend to each investigation in a timely manner.
A bottleneck effect is created that makes things worse
Our models indicate that many good indicators of increas- through a fast acting, negative-polarity reinforcing loop
ing risk are available. However, many of these indicators (see Figure 13). This potential bottleneck points to the util-
become useful only after a significant risk increase has oc- ity of more distributed technical decision-making.
curred, i.e., they are lagging rather than leading indicators. Using the number of problems being worked is not
The requirements waiver accumulation pattern, for exam- without its own limitations. For a variety of reasons, the
ple, is a good indicator, but only becomes significant when technical warrant holders may simply not be getting infor-
risk starts to rapidly increase (Figure 11), thus casting mation about existing problems. Independent metrics (e.g.,
doubt on its usefulness as an effective early warning. using the PRACA database) may have increased accuracy
Alternatively, the number of incidents/problems under here. It is unlikely that a single metric will provide the in-
ITA investigation appears to be a more responsive measure formation required—a combination of complementary
of the system heading toward a state of higher risk (see metrics are almost surely going to be required.
Figure 12). A large number of incidents under investiga- Because of its deep structural impact on the system,
tion results in a high workload for trusted agents, who are the health of ITA may be the most effective early indicator
already busy working on project-related tasks. Initially, of increasing risk. There is a high correlation between the
the dynamics are balancing, as ITA personnel are able to Effectiveness and Credibility of ITA and the location of
increase their incident investigation rate to accommodate the tipping point at which risk starts to rapidly increase.
the increased investigation requirements. However, because the Effectiveness and Credibility of ITA
As the investigation requirements become higher, cor- cannot be measured directly, we must seek proxy measures
ners may be cut to compensate, resulting in lower quality of ITA health. One of the most promising leading indica-
investigation resolutions and less effective corrective ac- tors of ITA health is the ability to continually recruit the

1318
Dulac, Leveson, Zipkin, Friedenthal, Cutcher-Gershenfeld, Carroll, and Barrett

Project Work throughout the system lifecycle, resulting in higher launch


Requirement rates, especially in the second half of the system life.
+
Relative Effective Shuttle Age
ITA Workload 4
+
ITA Investigation
Capacity 3

B, then R
2
Incidents under ITA + -
Investigation Investigation Incident
- Bottleneck Resolution Rate
+ 1

0
0 250 500 750 1000
Incoming Incidents and
Time (Month)
Problems requiring
Investigation
Figure 14: Variable Exhibiting Low Sensitivity to Model
Parameters
Figure 13: Balancing Loop Becomes Reinforcing as ITA
Workload Keeps Increasing
6 CONCLUSION: USING SYSTEM DYNAMICS
IN SAFETY AND RISK ANALYSIS
“best of the best”. Employees in the organization have an
acute perception of which assignments are regarded as
prestigious and important. As long as ITA is able to retain The objective of the system dynamics part of a STAMP-
its influence, prestige, power and credibility, it should be Based Risk Analysis to provide insight into the dynamic
able to attract the best, highly experienced technical per- reasons for the adaptation of the safety control structure, or
sonnel with leadership qualities. By monitoring the quality the drift toward an unsafe system state. Once these adapta-
of ITA personnel (Technical Warrant Holders and Trusted tion mechanisms are better understood, the system dynam-
Agents) over time along with turnover and job application ics models can be used to help design monitoring systems
data, it should be possible to have a good indication of ITA that will act as a virtual “canary-in-the-mine”, alerting de-
health and to correct the situation before risk starts to in- cision-makers that the system has reached, or is heading
crease rapidly. toward an unsafe state.
In addition to using system dynamics as a tool to iden- During the ITA risk analysis process, the system dy-
tify and evaluate metrics and leading indicators of safety namics models contributed many insights that may not
drift, we used the models created in order to perform a first have been identified using the static safety control structure
order assessment of the risks identified in the previous alone. For example, the requirements waiver accumulation
steps of the STAMP Based Risk Analysis Process. The list is often considered a sign of risk increase, but the dynamic
of risks identified included 250 items, approximately 75% modeling provided hints that it may be a lagging indicator
of which were related to variables in the system dynamics with limited effectiveness for early warning. The analysis
models. This correlation between risks and model vari- provided many other candidate indicators (incidents under
ables allowed us to prioritize risks according to their sensi- investigation, quality of ITA designees, …) that could be
tivity to other model parameters. In order to determine the used as leading indicators of safety drift. More work will
sensitivity of specific variables, a sensitivity analysis simu- be required to evaluate the effectiveness of these indicators
lation was performed that covered a range of cases includ- on the real system, but the model analysis pointed to can-
ing cases where the ITA is highly successful and self- didate indicators that may not have been identified other-
sustained, and cases where the ITA quickly loses its effec- wise.
tiveness. A Low, Medium or High sensitivity rating was Another interesting insight from the system dynamics
assigned depending on the normalized variation percentage analysis is that the performance of ITA is highly depend-
of specific model variables during the sensitivity analysis. ent on the quality of safety analyses produced by system
Figure 7 provides an example of a variable with a high safety employees at NASA and contractor offices.
variation to model parameters due to the reinforcing ITA In addition, while Technical Warrant Holders (TWHs)
dynamics described above. Figure 14 provides an example are shielded in the design of the ITA from programmatic
of a variable with lower sensitivity to model parameters. budget and schedule pressures through independent man-
This variable provides a measure of the shuttle age relative agement chains and budgets, Trusted Agents are not. They
to its design lifetime. The effective shuttle age is higher at have dual responsibility for working both on the project
the end of the system lifecycle if the ITA program is suc- and on TWH assignments, which can lead to obvious con-
cessful because risk has been effectively mitigated flicts. Good information is key to good decision-making.
Having that information produced by employees not under

1319
Dulac, Leveson, Zipkin, Friedenthal, Cutcher-Gershenfeld, Carroll, and Barrett

the ITA umbrella reduces the effective independence of Sterman, John. 2000. Business Dynamics: Systems thinking
ITA. In addition to conflicts of interest, increases in and modeling for a complex world, McGraw-Hill.
Trusted Agent workload due either to project and/or TWH Young, Tom (Chair). 2000. Mars program independent in-
assignments or other programmatic pressures can reduce vestigation board report, NASA.
their sensitivity to safety problems. A long list of similar
insights was generated from the system dynamics part of AUTHOR BIOGRAPHIES
the STAMP risk analysis.
The results of our analysis are very encouraging and NICOLAS DULAC is a doctoral candidate in the depart-
illustrate the potential for a STAMP-based risk analysis ment of Aeronautics and Astronautics at MIT. His current
process augmented with system dynamics modeling to sig- research interests include system safety, system engineer-
nificantly improve risk management in complex socio- ing, visualization of complex systems, hazard analysis in
technical systems. socio-technical systems, organizational safety culture, and
dynamic risk analysis.
7 FUTURE WORK
NANCY LEVESON is Professor of Aeronautics and As-
While the process for generating the safety control struc- tronautics and also Professor of Engineering Systems at
ture part of the STAMP risk analysis is mature and well MIT. She is a member of the National Academy of Engi-
documented, the process for creating and using system dy- neering. Her research interests include system engineering,
namics models based on it currently requires much effort system safety, human-computer interaction and software
and domain expertise. Future work will address the crea- engineering.
tion of system dynamics models based on the STAMP
safety control structure and existing system safety arche- DAVID ZIPKIN is a graduate from MIT’s Technology
types. The model validation (or confidence increase) and and Policy (TPP) Program.
insight generation processes will also be addressed in
greater detail. STEPHEN FRIEDENTHAL is a student in MIT’s Sys-
tem Design and Management (SDM) Program.
ACKNOWLEDGMENTS
JOEL CUTCHER-GERSHENFELD is a Senior Re-
This research was partially supported by a grant from the search Scientist in MIT's Sloan School of Management and
USRA Center for Program/Project Management Research Executive Director of MIT's Engineering Systems Learn-
(CPMR), which is funded by NASA APPL and by the ITA ing Center.
Program within the NASA Chief Engineer’s Office.
JOHN S. CARROLL is Professor of Behavioral and Pol-
REFERENCES icy Sciences at the MIT Sloan School of Management and
the MIT Engineering Systems Division. His research fo-
Branscome, D.R. (Chair). 1999. WIRE Mishap investiga- cuses on the relationships among individual and group de-
tion board report, NASA. cision making and communication, and organizational
Gehman, Harold (Chair). 2003. Columbia accident investi- learning, change, and culture.
gation report.
Leveson, Nancy. 2004. A new accident model for engi- BETTY BARRETT is a research scientist and associate
neering safer systems. Safety Science 42 (4): 237–270. director of the Engineering Systems Learning Center in the
Link, D.C.R. 2000. Report of the Huygens communica- Engineering Systems Division at MIT.
tions system inquiry board, NASA.
McCurdy, Howard. 1994. Inside NASA: High technology
and organizational change in the U.S. space program.
Johns Hopkins University Press.
McDonald, Harry. 2000. Shuttle independent assessment
team (SIAT) report.
Morecroft, J.D.W. 1985. Rationality in the analysis of be-
havioral simulation models. Management Science 31
(7): 900-916.
NASA/ESA investigation board. 1998. SOHO Mission In-
terruption, NASA.
Stephenson, A. (Chair). 1999. Mars Climate Orbiter mis-
hap investigation board report, NASA.

1320

You might also like