Integrating Testing With Reliability
Integrating Testing With Reliability
SUMMARY
The activities of software testing and reliability are integrated for the purpose of demonstrating how
the two activities interact in achieving testing efficiency and the reliability resulting from these tests.
Integrating means modeling the execution of a variety of tests on a directed graph representation of an
example program. A complexity metric is used to construct the nodes, edges, and paths of the example
program. Models are developed to represent the efficiency and achieved reliability of black box and white
box tests. Evaluations are made of path, independent path, node, program construct, and random tests
to ascertain which, if any, is superior with respect to efficiency and reliability. Overall, path testing has
the edge in test efficiency. The results depend on the nature of the directed graph in relation to the type
of test. Although there is no dominant method, in most cases the tests that provide detailed coverage are
better. For example, path testing discovers more faults than independent path testing. Predictions are
made of the reliability and fault correction that results from implementing various test strategies. It is
believed that these methods can be used by researchers and practitioners to evaluate the efficiency and
reliability of other programs. Copyright © 2008 John Wiley & Sons, Ltd.
KEY WORDS: test efficiency; software reliability; modeling efficiency and reliability
1. INTRODUCTION
Software is a complex intellectual product. Inevitably, some errors are made during requirements
formulation as well as during designing, coding, and testing the product. State-of-the-practice soft-
ware development processes to achieve high-quality software includes measures that are intended
to discover and correct faults resulting from these errors, including reviews, audits, screening by
language-dependent tools and several levels of tests. Managing these errors involves describing,
∗ Correspondence to: Norman Schneidewind, Professor Emeritus of Information Sciences, Naval Postgraduate School,
U.S. Senate, U.S.A.
† E-mail: [email protected]
‡ Fellow of the IEEE.
§ IEEE Congressional Fellow, 2005.
classifying, and modeling the effects of the remaining faults in the delivered product and thereby
helping to reduce their number and criticality [1].
One approach to achieving high-quality software is to investigate the relationship between testing
and reliability. Thus, the problem that this research addresses is the comprehensive integration of
testing and reliability methodologies. Although other researchers have addressed bits and pieces of
the relationship between testing and reliability, it is believed that this is the first research to integrate
testing efficiency, the reliability resulting from tests, modeling the execution of tests with directed
graphs, using complexity metrics to represent the graphs, and evaluations of path, independent path,
node, random node, white box, and black box tests.
One of the reasons for advocating the integration of testing with reliability is that, as recommended
by Hamlet [2], the risk of using software can be assessed based on reliability information. He states
that the primary goal of testing should be to measure the reliability of tested software. Therefore,
it is undesirable to consider testing and reliability prediction as disjoint activities.
When integrating testing and reliability, it is important to know when there has been enough
testing to achieve reliability goals. Thus, determining when to stop a test is an important management
decision. Several stopping criteria have been proposed, including the probability that the software
has a desired reliability and the expected cost of remaining faults [3]. Use the probabilities associated
with path and node testing in a directed graph to estimate the closeness to the desired reliability of
1.0 that can be achieved. To address the cost issue, explicitly estimate the cost of remaining faults
in monetary units and estimate it implicitly by the number of remaining faults compared with the
total number of faults in the directed graph of a program.
Given that it cannot be shown that there are no more errors in the program, use heuristic arguments
based on thoroughness and sophistication of testing effort and trends in the resulting discovery
of faults to argue the plausibility of the lower risk of remaining faults [4]. The progress in fault
discovery and removal is used as a heuristic metric when testing is ‘complete’. At each stage of
testing, reliability is estimated to note the efficiency of various testing methods: path, independent
path, random path, node, and random node.
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
INTEGRATING TESTING WITH RELIABILITY 177
A further complication involves the dynamic nature of programs. If a failure occurs during
preliminary testing and the code is changed, the software may now work for a test case that did
not work previously. But the code’s behavior on preliminary testing can no longer be guaranteed.
To account for this possibility, testing should be restarted. The expense of doing this is often
prohibitive [6]. It would be possible to model this effect but at the cost of unmanageable model
complexity engendered by restarting the testing. It appears that this effect would have been modeled
by simulation.
The analysis starts with the notations that are used in the integrated testing and reliability
approach to achieving high-quality software. Refer to these notations when reading the equations
and analyses.
1.2.2.2. Probabilities.
p( j): probability of traversing path j;
p(n): probability of traversing node n.
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
178 N. SCHNEIDEWIND
1.2.2.4. Reliabilities.
Rn : empirical reliability at node n prior to fault removal;
R p : empirical reliability of program prior to fault removal;
Un : empirical unreliability at node n prior to fault removal;
U ( j): empirical unreliability on path j prior to fault removal;
R( j): empirical reliability of path j prior to fault removal;
R(c, k): empirical reliability achieved by testing construct c during test k after fault removal;
re (n): empirical number of remaining faults at node n prior to fault removal.
2. TEST STRATEGIES
There are two major types of tests, each with its own type of test case—white box and black box
testing [7]:
White box testing is based on the knowledge about the internal structure of the software under test
(e.g. knowledge of the structure of decision statements). The adequacy of test cases is assessed
in terms of the level of coverage of the structure they reach (e.g. comprehensiveness of covering
nodes, edges, and paths in a directed graph) [8].
White box test case: A set of test inputs, execution conditions, and expected results developed for
a particular objective such as exercising a particular program path or to verify compliance with a
specific requirement. For example, exercise particular program paths with the objective of achieving
high reliability by discovering multiple faults on these paths.
In black box testing, it may be easier to derive tests at higher levels of abstraction. More information
about the final implementation is introduced in stages so that additional tests due to increased
knowledge of the structure are required in small manageable amounts, which greatly simplifies
structural, or white box, testing. However, it is not clear whether black box testing (e.g. testing
If Then Else statements) preceding or following white box testing (e.g. identifying If Then Else
paths) would affect test effectiveness.
Black box test case: Specifications of inputs, predicted results, and a set of execution conditions
for a test item. In addition, because only the functionality of the software is of concern in black box
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
INTEGRATING TESTING WITH RELIABILITY 179
testing, this testing method emphasizes executing functions and examining their input and output
data [9]. For example, specify functions to force the execution of program constructs (e.g. While
Do) that are expected to result in an entire set of faults to be encountered and removed. There are
no inputs and outputs specified because the C++ example program executes paths independent of
inputs, but is dependent on function probabilities. The inputs specify parameters and variables used
in program computations, and not path execution probabilities. The program produces a standard
output dependent only on function probabilities.
A variant of black box testing captures the values of and changes in variables during a test. In
regression testing, for example, this approach can be used to determine whether a program, modified
by removing faults, behaves correctly [10]. However, in this research, rather than observing variable
changes, black box testing is conducted by observing the results of executing decision statements
and determining whether the correct decision is made.
3. TESTING PROCESS
Four types of testing are used, as described below. Path testing involves visiting both nodes and
paths in testing a program, whereas node testing involves only testing nodes. For example, in
testing an If Then Else construct, the If Then and Else components are visited in path testing,
whereas in node testing, only the If component is visited. Recognize the limitations of using
a directed graph for the purpose of achieving complete test coverage. For example, although it
is possible to represent initial conditions and boundary conditions [11] in a directed graph, the
amount of detail involved could make its use unwieldy. Therefore, it is better to represent only
the decision and sequence constructs in the graph. However, this is not a significant limitation
because the decision constructs account for the majority of complexity in most programs and high
complexity leads to low reliability. For illustrative purposes, a short program is used in Figure 1. This
program may appear to be simple. In fact, it is complex because of the iterative decision constructs.
Of course, only short programs are amenable to manual analysis. However, McCabe and asso-
ciates developed tools for converting program language representations to directed graphs for large
programs [12].
The following is an outline of the characteristics of the various testing schemes that were consid-
ered. First, identify program constructs: If Then, If Then Else, While Do, and Sequence in the
program to be tested. Then perform the following white box tests.
In path testing, it is desired to distinguish the independent paths from the non-independent paths.
Therefore, as the McCabe complexity metric [13] represents the number of independent paths to
test, faults are randomly planted at nodes of a directed graph that is constructed with edges and
nodes based on this metric. This process provides a random number of faults that are encountered
as each path is traversed. Note that in path testing the selection of paths is pre-determined.
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
180 N. SCHNEIDEWIND
start node
n=1 n=3
n=2
Else If Then Else Then
i=1 i=2
i=3 i=4
While Do
i=5 n=4
i=6
i=8 n=6
If Then Then
i=7 n=5
i=9 Do
If Then sequence
n=7
i = 10 n =8
As opposed to path testing, which uses pre-determined paths, random path testing produces random
selection of paths. Thus, using the directed graph based on the McCabe metric, a random selection
of path execution sequences and the same random distribution of faults at nodes as in path testing,
a different sequence of fault encounters at the nodes will occur, compared with path testing.
Using the directed graph based on the McCabe metric and the same distribution of faults as before,
node testing randomly encounters faults as only the nodes are visited.
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
INTEGRATING TESTING WITH RELIABILITY 181
Using the directed graph based on the McCabe metric and a different random distribution of faults
than is used in the other tests, random node testing encounters a different set of faults, compared
with node testing. A different random distribution of faults is used because otherwise the same
result would be achieved as in node testing.
After the four types of white box tests have been conducted, perform the following steps:
Conduct black box testing: force function execution and observe resulting fault encounters and
removals [9].
Conduct white box testing: observe the response of a program to path and node testing [9].
Make reliability predictions using Schneidewind single parameter model (SSPM) [14], with
randomly generated fault data. Faults are generated randomly so that there will be no bias in the
fault distribution. Therefore, the fault distribution is not intended to be representative of a particular
environment. Rather, it is designed to be generic.
Predict the number of remaining faults and reliability with SSPM and compare with the empirical
test values.
Compare reliability predictions with results of black box and white box testing.
4. ASSUMPTIONS
Recognize that the following assumptions impose limitations on the integrated testing and relia-
bility approach. However, all models are abstractions of reality. Therefore, the assumptions do not
significantly detract from addressing the research questions below.
When faults are discovered and removed, no new faults are introduced. This assumption overstates
the reliability resulting from the various testing methods, but its effect will be experienced by all
the testing methods.
The probability of traversing a node is independent of the probability of traversing other nodes.
This is the case in the directed graph, which is used in the example program. It would not be the
case in all programs.
No faults are removed until a given test is complete. Therefore, as path testing visits some of
the same nodes on different tests, the expected number of faults encountered can exceed the actual
number of faults.
5. RESEARCH QUESTIONS
The following questions seem to be important in evaluating the efficiency of test methods and their
integration with reliability:
1. Does an independent path testing strategy lead to higher efficiency and reliability than path
and random path testing?
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
182 N. SCHNEIDEWIND
2. Does a node testing strategy lead to higher reliability and efficiency than random node testing?
3. Does the McCabe complexity metric [13] assist in organizing software tests?
4. Which of the testing strategies yields the highest reliability prior to or after fault removal?
5. Do reliability metrics, using SSPM, produce more accurate reliability assessments than node
and random node testing?
6. Which testing method, white box or black box, provides more efficient testing and higher
reliability?
The following equations are used to implement the testing strategies and reliability predictions:
where p(n) is determined by the branch probabilities in Figure 1 and f (n) is determined by
randomly generating the number of faults at each node.
The probability of traversing path j is given by the following equation:
n
nj
p( j) = p(n) (2)
n=1
Then, using Equations (1) and (2) yields the following equation for the expected number of faults
encountered on path j:
nn j
E( j) = p(n) f (n) p( j) (3)
n=1
Furthermore, summing Equation (3) over the number of paths in a program yields the expected
number of faults in a program, based on path testing, in the following equation:
nj
Ep = E( j) (4)
j=1
According to Equation (1), the empirical reliability at node n prior to fault removal is given in the
following equation:
p(n) f (n)
Rn = 1− n n (5)
n=1 p(n) f (n)
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
INTEGRATING TESTING WITH RELIABILITY 183
Now, the empirical unreliability at node n, according to Equation (5), is given by the following
equation:
Un = 1− Rn (6)
Then, using Equations (5) and (6) the unreliability on path j prior to fault removal is given by
the following equation:
nn nn p(n) f (n)
U ( j) = [ p( j)][Un ] = [ p( j)] n n (7)
n=1 n=1 n=1 p(n) f (n)
Then, according to Equation (6) the reliability of path j prior to fault removal is given by the
following equation:
R( j) = 1−U ( j) (8)
Finally, the reliability of the program R p is limited by the minimum of the path reliabilities computed
in Equation (8). Thus, the following equation is obtained:
R p = min R( j) (9)
Continuing the analysis, find the empirical number of remaining faults at node n, prior to fault
removal, according to the following equation:
nn
re (n) = n f − p(n) f (n) (10)
n=1
To obtain an operational profile of how program space was used, Horgan and colleagues [15]
identified the possible functions of their program and generated a graph capturing the connectivity
of these functions. Each node in the graph represented a function. Two nodes, A and B, were
connected if control could flow from function A to function B. There was a unique start and end
node representing functions at which execution began and was terminated, respectively. A path
through the graph from the start node to the end node represents one possible program execution.
In 1976, Thomas McCabe proposed a complexity metric based on the idea of the directed graph
as a representation of the complexity of a program. The directed graph can be based on functions,
as in the case of Horgan’s approach, or program statements that are used in this paper. McCabe
proposed that his metric be a basis for developing a testing strategy [13]. The McCabe complexity
metric is used as the basis for constructing the example directed graph that is used to illustrate the
integration of testing and reliability [16]. There are various definitions of this metric. The one that
is used here is given in the following equation [16]:
Mc = (n e −n n )+1 (11)
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
184 N. SCHNEIDEWIND
start node
1
n=1 4 n=3
n=2
2 If Then Else Then 1
Else i=1 i=2 3
4 i=4
i=3
4 4
While Do
i=5 n=4 2
i=6 4
2 4 i=8 Then
If Then n=6
i=7 n=5
i=9 Do
If Then sequence 5 2
n=7
2 7 i = 10 n =8
Bold = Number of Faults planted for path testing (24 faults total).
Italics = Number of Faults planted for random node testing (27 faults total).
Here, n n = −Mc +(n e +1) for n n < n e and n n > −Mc +(n e +1), where n n is the number of nodes
representing program statements (e.g. If Then Else) and conditions (e.g. Then, Else), and n e is the
number of edges representing program control flow transition, as depicted in Figure 1.
This definition is convenient to use for the testing strategy because it corresponds to the number
of independent paths and the number of independent circuits (‘window panes’) in a directed graph.
See Figures 1 and 2. Strategy means that paths are emphasized in the test plan.
The approach is used for specifying Mc and n e and computing n e from Equation (11). Then
knowing the number edges and nodes in a directed graph, for a given complexity, the information
is in hand to represent a program. In the case of the While Do construct, only count one iteration
in computing Mc .
The directed graph of the program shown in Figure 1 is based on a C++ program that was
written for a software reliability model [17]. The program has 420 C++ statements. The program
computes cumulative number of failures, actual reliability, predicted reliability, rate of change of
predicted reliability, mean number of failures, fault correction rate, mean fault correction rate, fault
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
INTEGRATING TESTING WITH RELIABILITY 185
correction delay, prediction errors, and maximum likelihood parameter estimation. The directed
graph represents the decision logic of the model.
8. TESTING STRATEGIES
As pointed out by Voas and McGraw, some people erroneously believe that testing involves only
software. In fact, according to them, testing also involves specifications [18]. This research supports
their view; in fact, the While Do loop in Figure 1 represents the condition for being able to make
reliability predictions if the specified boundary condition on the equations is satisfied in the C++
program. Therefore, in all of the following testing strategies that are implemented in the example
program, it is implicit that testing encompasses both specifications and software.
In black box testing, the tester is unconcerned with the internals of the program being tested.
Rather, the tester is interested in how the program behaves according to its specifications. Test data
are derived from its specifications [9]. In contrast, in white box testing the tester is interested in how
the program will behave according to its internal logic. Test data are derived based on the internal
workings of the program [9]. Some authors propose that black and white box testing be integrated
in order to improve test efficiency and reduce test cost [19]. This would not be wise because each
strategy has objectives that are fundamentally different.
In white box testing, based on the test data, a program can be forced to follow certain paths.
Research indicates that applying one or more white box testing methods in conjunction with func-
tional testing can increase program reliability when the following two-step procedure is used:
1. Evaluate the adequacy of test data constructed using functional testing.
2. Enhance these test data to satisfy one or more criteria provided by white box testing
methods [20].
In this research, the approach is to adapt #2 to generate test data to satisfy path coverage criteria
to find additional faults.
In black box testing the program is forced to execute constructs (e.g. While Do) that are associated
with the functional specifications. For example, continue to compute functions while there is input
data. In white box testing, test nodes and paths are associated with the detailed logic and functions of
the program. For example, a path would involve computing the error in predicting reliability metrics.
Related to black box testing is the concept of the operational profile wherein the functions of
the program, the occurrence rates of the functions, and the occurrence probabilities are listed [21].
In the case where the functions are the program constructs (e.g. If Then Else), as shown in Figure
1. In the example program, occurrence rates of all constructs are 100%. Thus, rather than using
occurrence rates, the importance of the constructs is more relevant (e.g. While Do more important
than If Then).
In their study [7], regarding system functionality, they began with the assumption that coding
errors tend to be regional. Analysis of the results of the testing of the 53 system tasks within the
six functional categories supported this assumption. The data indicate that tasks and categories
with high execution quantities had more field deficiencies. These tasks and categories were more
complex, containing a broader range of functions made possible through additional lines of code.
Owing to this complexity, these areas were more susceptible to errors.
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
186 N. SCHNEIDEWIND
These results suggest that there should be a focus in the testing effort on complex high payoff
areas of a program such as the While Do construct and associated constructs (see Figure 2), where
there is a concentration of faults. These constructs experience high probabilities of path execution
(i.e. ‘execution quantities’).
According to AT&T [22], the earlier a problem is discovered, the easier and less expensive it is to
fix, making software development more cost-effective. AT&T uses a ‘break it early’ strategy. The
use of independent path testing attempts to implement this approach because it is comparatively
easy and quick to expose the faults of these tests, with the expectation of revealing a large number
of faults early in the testing process.
As stated in [23], one of the principles of testing is the following: define test completion criteria.
The test effort has specific, quantifiable goals. Testing is completed only when the goals have been
reached (e.g. testing is complete when the tests that address 100% functional coverage of the system
have been all executed successfully). Although this is a noble goal and is achieved in the small
program, it is infeasible to achieve 100% coverage in a large program with multiple iterations. Such
an approach would be unwise due to the high cost of achieving fault removal at the margin.
Another principle stated in [23] is to verify test coverage: track the amount of functional coverage
achieved by the successful execution of each test. Implement this principle as part of the black
box testing approach, where the discovery and removal of faults as each construct is tracked
(e.g. If Then Else) is executed (see Table II).
One metric of test effectiveness is the ratio of the number of paths traversed to the total number
of paths in the program [24]. This is a good beginning, but it is only one characteristic of an
effective metric. In addition, it is important to consider the presence of faults on the paths. This is
the approach described below.
In order to evaluate the effectiveness of testing strategies, compute the fault coverage by two
means: path coverage and edge coverage. Recall that the number of faults encountered during path
testing can exceed the actual number of faults. Therefore, path testing must take this factor into
account. Path testing efficiency is implemented by using the following equation that imposes the
constraint that the sum of faults found on paths must not exceed the number of faults in the program:
nj
n j n
nj
e( j) = E( j) nf = p(n) f (n) p( j) nf
j=1 j=1 n=1
(12)
nj
for E( j) ≤ n f
j=1
As long as the constraint is satisfied, path testing is efficient because no more testing is done
n j
than is necessary to find all of the faults. However, for ( j=1 E( j))>n f , path testing is inefficient
because more testing is done than is necessary to find all of the faults.
For independent path testing, use Equation (12) just for the independent paths and compare the
result with that obtained using all paths.
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
INTEGRATING TESTING WITH RELIABILITY 187
n j
Another metric of efficiency is ( j=1 E( j)), compared with n f . This metric is computed using
only independent paths. Then the computations are compared to see which testing method produces
the greater fault coverage in relation to the number faults in the program.
The final metric of testing efficiency is node testing efficiency given in the following equation:
nn
e(n) = p(n) f (n)/n f (13)
n=1
An important point about test strategy evaluation is the effects on testing efficiency of the order of
path testing because the number of faults encountered could vary with sequence. Take the approach
in Figures 1 and 2 that path sequence is top down, testing the constructs such as If Then Else in
order.
First, note that these are limited experiments in terms of the number of examples it is feasible to use.
Therefore, there is no claim that these results can be extrapolated to the universe of the integration
of testing with reliability strategies. However, it is suggested that researchers and practitioners can
use these methods as a template for this type of research. In addition, the directed graph in the
program example is small. However, this is not a limitation of the approach because large programs
are modular (or should be). Thus, a large program can be represented by a set of directed graphs
for modules and the methods could be applied to each one.
Table I shows the results from developing the path and node connection matrix corresponding to
the directed graph in Figure 1, which shows the independent circuits and lists the paths: independent
and non-independent. In this table a ‘1’ indicates connectivity and a ‘0’ indicates no connectivity.
‘Path number’ identifies paths that are used in the plots below. A path is defined as the sequence
of nodes that are connected as indicated by a ‘1’ in the table. This table allows us to identify the
independent paths that provide a key feature of the white box testing strategy. These paths are
italicized in Table I: paths 1, 3, and 4. The definition of an independent path is that it cannot be
formed by a combination of other paths in the directed graph [13].
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
188 N. SCHNEIDEWIND
1.6000
1.4000
0.4000
0.2000
0.0000
1 2 3 4 5 6 7 8 9 10 11 12 13
j
Figure 2 shows how faults are randomly seeded in the nodes of the directed graph in order to
evaluate the various tests. It also shows the minimum reliability path, which is the reliability of the
program because the reliability of the program cannot be greater than the reliability of the weakest
path. The minimum reliability path is also noted in Figure 7, where R( j) = 0.7292.
Noting that in Figure 3 paths are equivalent to the number of tests, this figure indicates that
path testing is efficient only for the first seven paths; after that there is more testing than necessary
to achieve efficiency because tests 1, . . . , 7 have removed all the faults. All independent paths are
efficient based on the fact that these paths were identified in Table I. However, Figure 4 tells another
story: Here the expected number of faults that is found in both path testing and independent path
testing is compared with the number of actual faults. Although independent path testing is efficient,
it accounts for only 35.42% of the faults in the program. This result dramatically shows that it is
unwise to rely on independent path testing alone to achieve high reliability.
In Figure 5, recognizing that the number of nodes is equivalent to the number of tests, it is seen
that, with node testing, the tests do not cover all the faults in the program (i.e. efficiency = 0.7917).
Of the three testing strategies, path testing provides the best coverage. It finds all of the faults but
at the highest cost. The best method depends on the application, with path testing advisable for
mission critical applications, and independent path and node testing appropriate for commercial
applications because of their lower cost.
Up to this point, the testing strategies have been static. That is, path testing, independent path testing,
and node testing have been conducted, considering the number of tests, but without considering
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
INTEGRATING TESTING WITH RELIABILITY 189
40.0000
independent paths j = 1, 3, and 5 tests find 8.500 faults = 35.42 % of faults in program
35.0000
30.0000
nf = 24 faults in program
25.0000
sum E (j)
20.0000
15.0000
10.0000
test number = path j
5.0000
efficient: sum E (j) <= nf
0.0000
1 2 3 4 5 6 7 8 9 10 11 12 13
j
Figure 4. The expected number of cumulative faults encountered (and removed) sum E( j) vs test number j.
30.00
10.00
5.00
0.00
1 2 3 4 5 6 7 8 9
n
Figure 5. Node testing cumulative expected number of faults found sum [ p(n)∗ f (n)] vs number of nodes n.
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
190 N. SCHNEIDEWIND
40
35
30
Series 1: ra (T)
Series 2: Schneidewind Single Parameter Model R (T): MRE = .4510
25 Series 3: Yamada S Shaped Model R (T): MRE = .8753
R (T), ra (T)
Series1
20 Series2
Series3
15
10
0
1 2 3 4 5 6 7 8
T
Figure 6. Predicted remaining failures to occur at test time T and actual remaining
failures ra (T ) vs test time T .
test time. Of course, time does not stand still in testing. With each node and edge traversal, there is
an elapsed time. Now bring time into the analysis so that a software reliability model can be used
to predict the reliability of the program in the example directed graph.
There are a number of predictions that can be made of reliability to answer the question ‘when
to stop testing’? Among these are remaining failures and reliability [20] that are predicted below,
using the SSPM [11]. In order to consider more than one model for the analysis, the remaining
failures were predicted using the Yamada S-shaped model [25] and its mean relative prediction
error was compared with SSPM. The result was that SSPM has lower error mean relative error
(MRE) = 0.4510 vs MRE = 0.8753 for Yamada, as shown in Figure 6, that compares the actual
remaining failures with the predicted values for SSPM and Yamada.
In addition, the mean number of failures in the test intervals was predicted for both models
and their MREs were compared. For SSPM the value was 0.3865 and for Yamada the value was
0.4572. Thus, because of better prediction accuracy, SSPM predictions are compared with the results
obtained with node and random node testing.
The first step in applying SSPM is to estimate the single parameter from the randomly generated
faults present at the directed graph nodes. (Parameter is defined as the rate of change of failure
rate and t is the program test time.) Then faults are randomly seeded in the directed graph using
the Excel random number generator.
Now, in preparing to develop the equation for predicting the remaining failures, the cumulative
number of failures predicted to occur at test interval T is computed as follows [11]:
T
F(T ) = e−t dt = (1/)[1−e−T ] (14)
0
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
INTEGRATING TESTING WITH RELIABILITY 191
For the purpose of the testing model, consider black box testing to be composed of successive tests,
each one exercising a program construct, encountering faults in the construct, and removing them.
Thus, formulate the reliability based on test k of construct c as follows:
n(c, k)
R(c, k) = k (18)
nf
where n(c, k) is the number of faults removed on test k and n f is the number of faults in the
program in Equation (17). Thus, fault removals are accumulated with each test, until as many faults
as possible have been removed. The number of faults removed is limited by the number of faults
associated with the constructs in the program.
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
192 N. SCHNEIDEWIND
In addition to reliability, the efficiency of black box testing is evaluated in Equation (18) as
n(c, k)
e(c, k) = k (19)
k
The meaning of Equation (18) is that e(c, k) computes the cumulative faults removed divided by
test number k, which is equal to the number of tests.
1. Does an independent path testing strategy lead to higher efficiency and reliability than path
testing and random path testing? Independent path testing alone will not uncover nearly the
number of faults in a program. In the experiments, the results were even worse when paths
were selected randomly because it is possible that random paths could be duplicated, rendering
random path testing inefficient. This fact made it difficult to compare random testing efficiency
with path testing because not every path was tested with random testing. Instead of comparing
individual path efficiencies, the coefficients of variation for random path and path testing
efficiency were computed to gain a sense of the variation in this metric. The values are 0.5544
and 0.5680 for random path testing and path testing, respectively. Thus, there is little to choose
in terms of variability of efficiency.
Note that because path testing traverses all nodes and edges, theoretically, path testing
would yield a reliability of 1.0 after fault removal; this high reliability cannot be obtained
with independent path testing and random path testing.
2. Does a node testing strategy lead to higher reliability and efficiency than random node testing?
As shown in Figure 9, node testing provides higher prediction accuracy.
3. Does the McCabe complexity metric [13] assist in organizing software tests? Yes, even though,
as has been shown, independent path testing lacks in complete fault coverage. Nevertheless,
the metric is useful for identifying major components of a program to test.
4. Which testing strategy yields the highest reliability prior to fault removal? This question is
addressed in Figure 7, which shows the superiority of path testing in early tests, with node
testing and random node testing catching up in later tests. Thus, overall, path testing is superior.
This is to be expected because path testing exercises both nodes and edges.
5. Do reliability metrics, using SSPM, produce more accurate reliability assessments than node
and random node testing? The answer for the remaining failure predictions is ‘no’, as Figure 8
demonstrates. The answer for reliability predictions is also ‘no’ as shown in Figure 9 where
node testing produces minimum error. These results reinforce the idea that testing can produce
reliability assessment accuracy that a reliability model may not be able to achieve.
6. Which testing method, white box or black box, provides more efficient testing and higher
reliability? This question is addressed in Table II, which shows the results of the black box
testing strategy. See Figure 2 to understand the fault removal process by noting how many
faults are planted at each construct. Because black box (Equation (19)) and white box testing
(Equation (13)) efficiency are computed differently, it is necessary to compare them on the
basis of cumulative faults removed, as a function of test number. When black box in Table II
is compared with path testing (i.e. white box testing) in Figure 4, it is seen that for the same
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
INTEGRATING TESTING WITH RELIABILITY 193
1.0000
R (j)
0.9000
0.8000
0.7000 rn
Rn
0.6000 Rn catches up to R (j) at n= 6
rn catches up to R (j) at n= 8
Reliability
path tests: j = 1, 12
0.4000 node tests: n = 1, 8
0.3000
0.2000
0.1000
0.0000
1 2 3 4 5 6 7 8 9 10 11 12
(j, n)
Figure 7. Reliability obtained prior to fault removal by path testing R( j), node testing Rn , and random node
testing rn vs test number ( j, n).
number of tests, black box is superior (removes more faults). The reason is that this particular
type of black box testing exercises complete program constructs, finding and removing a large
number of faults during each test.
Now, comparing the black box testing of Table II with the white box testing of Figure 7, it is
seen that white box yields the higher reliability. This is to be expected because white box testing
provides a more detailed coverage of a program’s faults.
Thus far, there has been the assumption that faults encountered in traversing a directed graph
representation of a program have been removed (i.e. corrected). In reality, this may not be the
case unless fault correction is explicitly considered. There are several software reliability models
that include fault correction in addition to reliability prediction. These models are advantageous
because the results of tests, based on fault correction, are used in reliability prediction to improve
the accuracy of prediction. One such model [26,27] is used to make predictions based on fault
correction. It would not make sense to compare test efficiency of the fault correction model with, for
example, that of the path testing model because, as explained, the former includes fault correction
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
194 N. SCHNEIDEWIND
25.00
Series1
Series2
Remaining Failures
10.00
0.00
1 2 3 4 5 6 7 8
t
but the latter does not. However, insight into the effectiveness of fault correction can be obtained
by evaluating, for example, fault correction delay time over a series of test time intervals.
It was shown in [26,27], using Shuttle flight software failure data, that the cumulative number of
faults corrected by test time T, C(T ), is related to the cumulative number of failures F(T ) detected
by time T . In addition, in the case of the Shuttle data, the number of faults is equal to the number
of failures. This is assumed to be the case in the hypothetical fault data of Figure 2, which is
used in the predictions that follow. C(T ) and F(T ) are related by the delay time dT —the time
between fault detection and completion of fault correction. Recalling that for SSPM, F(T ) is given
in Equation (20), then C(T ) can be expressed in Equation (21):
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
INTEGRATING TESTING WITH RELIABILITY 195
1.0000
most accurate
0.9000
0.8000
0.7000
0.6000
Series1
Reliability
Series3
0.5000 Series4
Series5
0.4000
0.1000
0.0000
1 2 3 4 5 6 7 8
t
Important aspects of fault correction and testing that are not covered by models, such as the above,
are the fault correction efficiency in the various phases of software development that must be
provided by empirical evidence. In a Hewlett–Packard division application, 31% of the requirements
faults were eliminated in the requirements phase, 30% of requirements faults were eliminated in
preliminary design, 15% during detailed design, and 24% removed during testing. Additionally,
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
196 N. SCHNEIDEWIND
14.0000
CR (T): fault correction rate
12.0000
2.0000
0.0000
1 2 3 4 5 6 7 8
T
Figure 10. SSPM predicted cumulative failures F(T ), correction delay dT , cumulative faults corrected C(T ),
and remaining faults R(T ) vs test time interval T .
51% of the detailed design faults slipped into the testing phase. The other important aspect of
efficiency is the effort required to remove the faults. This investigation confirms that it is costly to
wait. The total effort expended to remove 236 intra-phase faults was 250.5 h while it took 1964.8 h
to remove the 248 faults that were corrected in later phases. Faults undetected within the originating
phase took approximately eight times more effort to correct. In fact, the problem does not get better
as time passes. Faults found in the field are at least in an order of magnitude more expensive to
fix than those found while testing. Faults that propagate to later phases of development produce a
nearly exponential increase in the effort, and thus in the cost, of fixing those faults [29].
A confirming example is provided by the Praxis critical systems development of the Certification
Authority for the Multos smart card scheme on behalf of Mondex International. The authors claim
that correctness by construction is possible and practical. It demands a development process that
builds correctness into every step. It demands rigorous requirements definition, precise system-
behavior specification, solid and verifiable design, and code whose behavior is precisely understood.
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
INTEGRATING TESTING WITH RELIABILITY 197
It demands defect removal and prevention at every step. The number of system faults is low
compared with systems developed using less formal approaches. The distribution of effort clearly
shows that fault fixing constituted a relatively small part of the effort (6%) [30]; this contrasts
with many critical projects where fixing of late-discovered faults takes a large proportion of project
resources, as in the Hewlett–Packard example.
Experiences like these should lead the software engineering community to adopt (1) phase-
dependent predictions in reliability and testing models and (2) defect removal and fault correction
in all phases of the development process.
14. CONCLUSIONS
For white box testing, path testing was the most efficient overall. This is not surprising because
path testing exercises all components of a program—statements and transitions among statements.
Although not surprising, it was comforting to find that the law of diminishing returns has not been
overturned by the black box testing result in Table II, where both testing efficiency and reliability
increase at a decreasing rate. Results such as these can be used as a stopping rule to prevent an
organization’s testing budget from being exceeded.
An interesting result, based on Table II and Figure 3, is the superiority of black box testing
over white box testing in finding and removing faults, due to its coverage of complete program
constructs. On the other hand, the application of white box testing yields higher reliability than
black box testing because the former, using path testing, for example, mirrors a program’s state
transitions that are related to complexity, and complexity is highly related to reliability [31].
It is not clear whether these results would hold up if different programs and directed graphs
were used. A fertile area for future research would be to experiment with the test strategies on
different programs. Because it is clear that models are insufficient for capturing pertinent details of
the reliability and testing process, it is important to include empirical evidence in evaluating testing
strategies. Therefore, a promising area for future research would be to incorporate empirical data,
such as the data in the previous section, in integrated and reliability models to see whether testing
efficiency is improved.
REFERENCES
1. IEEE/AIAA P1633TM /Draft 13. Draft Standard for Software Reliability Prediction, November 2007.
2. Hamlet D. Foundations of software testing: Dependability theory. Proceedings of the Second ACM SIGSOFT Symposium.
Foundations of Software Engineering, 1994; 128–139.
3. Prowell SJ. A cost-benefit stopping criterion for statistical testing. Proceedings of the 37th Annual Hawaii International
Conference on System Sciences (HICSS’04)—Track 9, 2004; 90304b.
4. Hailpern B, Santhanam P. Software debugging, testing, and verification. IBM Systems Journal 2002; 41(1).
5. Beizer B. Software Testing Techniques (2nd edn). Van Nostrand Reingold: New York, 1990.
6. Reliable Software Technologies Corporation. http://www.cigital.com/.
7. Chen TY, Yu YT. On the expected number of failures detected by subdomain testing and random testing. IEEE
Transactions on Software Engineering 1996; 22(2):109–119.
8. Tonella P, Ricca F. A 2-layer model for the white-box testing of Web applications. Sixth IEEE International Workshop
on Web Site Evolution (WSE’04), 2004; 11–19.
9. Howden WE. Functional Program Testing and Analysis. McGraw-Hill: New York, 1987.
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr
198 N. SCHNEIDEWIND
10. Xie T, Notkin D. Checking inside the black box: Regression testing by comparing value spectra. IEEE Transactions on
Software Engineering 2005; 31(10):869–883.
11. Myers G. The Art of Software Testing. Wiley: New York, 1979.
12. http://www.mccabe.com/.
13. McCabe TJ. IEEE Transactions on Software Engineering 1976; Se-2(4):308–320.
14. Schneidewind NF. A new software reliability model. The R&M Engineering Journal 2006; 26(3):6–22.
15. Wong WE, Horgan JR, Mathur AP, Pasquini A. Test set size minimization and fault detection effectiveness: A case study
in a space application. COMPSAC’97—21st International Computer Software and Applications Conference, 1997; 522.
16. Fenton NF, Pfleeger SL. Software Metrics: A Rigorous & Practical Approach (2nd edn). PWS Publishing Company:
Boston, 1997.
17. Schneidewind NF. Reliability modeling for safety critical software. IEEE Transactions on Reliability 1997; 46(1):88–98.
18. Voas JM, McGraw G. Software Fault Injection: Inoculating Programs Against Errors. Wiley: New York, 1998.
19. Beydeda S, Gruhn V, Stachorski M. A graphical class representation for integrated black- and white-box testing.
Seventeenth IEEE International Conference on Software Maintenance (ICSM’01), 2001; 706.
20. Horgan JR, Mathur AP. Assessing testing tools in research and education. IEEE Software 1992; 9(3):61–69.
21. Musa JD. Software Reliability Engineering: More Reliable Software, Faster and Cheaper (2nd edn). Authorhouse: 2004.
22. General Accounting Office (GAO). Best Practices: A More Constructive Test Approach is Key to Better Weapon System
Outcomes. GAO: Washington, 2000.
23. Mogyorodi GE. Bloodworth Integrated Technology, Inc. ‘What is Requirements-based Testing? Cross Talk, March 2003.
24. Schick GJ, Wolverton RW. A History of Software Reliability Modeling, University of Southern California and Thompson
Ramo Woodridge Corporation (undated).
25. Xie M. Software Reliability Modelling. World Scientific: Singapore, 1991.
26. Schneidewind NF. Modeling the fault correction processes, part 2. The R&M Engineering Journal 2004; 24(Part 2(1)):
6–14; ISSN 0277-9633.
27. Schneidewind NF. Modeling the fault correction processes, part 1. The R&M Engineering Journal 2003; 23(Part 1(4)):
6–15; ISSN 0277-9633.
28. Zage D, Zage W. An analysis of the fault correction process in a large-scale SDL production model. Twenty-fifth
International Conference on Software Engineering (ICSE’03), 2003; 570.
29. Runeson P, Holmstedt Jönsson M, Scheja F. Are found defects an indicator of software correctness? An investigation
in a controlled case study. Fifteenth International Symposium on Software Reliability Engineering (ISSRE’04), 2004;
91–100.
30. Hall A, Chapman R. Correctness by construction: Developing a commercial secure system. IEEE Software 2002;
19(1):18–25.
31. Khoshgoftaar TM, Munson JC. Predicting software development errors using software complexity metrics. IEEE Journal
on Selected Areas in Communications 1990; 8(2):253–261.
32. Keller T, Schneidewind NF, Thornton PA. Predictions for increasing confidence in the reliability of the space shuttle
flight software. Proceedings of the AIAA Computing in Aerospace 10, San Antonio, TX, 28 March 1995; 1–8.
Copyright q 2008 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2009; 19:175–198
DOI: 10.1002/stvr