BERRIET-SOLLIEC 2014 Goals of Evaluation PDF
BERRIET-SOLLIEC 2014 Goals of Evaluation PDF
research-article2014
EVI0010.1177/1356389014529836EvaluationBerriet-Solliec et al.
Article
Marielle Berriet-Solliec
National Institute of Agronomy, Food and Environment (Agrosup Dijon), France
Pierre Labarthe
French National Institute for Agricultural Research (INRA), France
Catherine Laurent
French National Institute for Agricultural Research (INRA), France
Abstract
All stakeholders are urged to pay more attention to the quality of evidence used and produced
during the evaluation process in order to select appropriate evaluation methods. A ‘theory
of evidence for evaluation’ is needed to better address this issue. This article discusses the
relationships between the three main goals of evaluation (to learn, measure and understand) and
the various types of evidence (evidence of presence, of difference-making, of mechanism) which
are produced and/or used in the evaluation process. It argues for the need to clearly distinguish
between this approach and that of levels of evidence, which is linked to data collection and
processing methods (e.g. single case observations, difference methods, randomized controlled
trials…). The analysis is illustrated by examples in the field of agro-environmental policymaking
and farm advisory services.
Keywords
agricultural extension, agri-environment, agricultural policies, evaluation, evidence, evidence-
based decision, knowledge
Recently there has been a resurgence of research into the effects of knowledge characteristics
on the dynamics of collective decision making in public or private organizations. For years,
Corresponding author:
Catherine Laurent, INRA, Sciences for Action and Development, Umr 1048, 16 rue Claude Bernard, 75231 Paris Cedex
5, France.
Email: [email protected]
studies have stressed and modelled the diversity of the sources and types of knowledge used
in decision making (e.g. expertise, theories on causal relations, traditional knowledge, etc.).
More recently, some theoretical developments, such as research around ‘evidence-based
decisions’, have merged learning from various disciplinary standpoints (e.g. philosophy of
science, medical studies, economics, ecology) and opened new debates on ‘empirical evi-
dence for use’ (Cartwright, 2011). Regarding evaluation, a proposal to rank evaluations
according to a hierarchy of evidence (e.g. Lipsey, 2007) has caused heated discussions about
what counts as evidence (Donaldson, 2008). These debates are calling on decision makers to
pay more attention to the quality of evidence when selecting appropriate methods of evalua-
tion and assessing their conclusions. There are also calls for a ‘theory of evidence for evalu-
ation’ (Schwandt, 2008).
This article aims to contribute to the building of such a theory. We analyse the relationships
between goals of evaluation and types of evidence (i.e. what is the object of evidence in dif-
ferent types of evidence). We demonstrate how the resulting theoretical advances help to bet-
ter analyse the trade-offs involved in the use of alternative types of evidence. To illustrate this,
we focus on the ex-post evaluation of public programs in agriculture: specifically, for advisory
services and agri-environmental policies.
Goal 1:To measure – the evaluation is designed to assess the effects of a program. A first group of stud-
ies focuses on the quantification of program impacts, often using micro-economic techniques
(Rossi et al., 2004), in line with the work of Heckmann. An emblematic principle of this type
of research is the identification of an experimental or quasi-experimental situation in which
systematic reference to a counterfactual can be used to identify outcomes which are specific
to the program under evaluation (Banerjee and Duflo, 2009; Shadish et al., 2002). This first
group of studies seeks to assess if a public intervention works (a measure of effect usually
referred to as ‘impact assessment’). A second group of studies aims at measuring efficiency.
This involves measuring the value of goods or services produced through public programs
against the cost of their production. The goal is then to determine whether an organization or
initiative has produced as many benefits as possible given the resources it has at its disposal;
this approach takes into account a combination of factors such as costs, quality, use of
resources, appropriateness and whether deadlines were met.
Goal 2: To understand – the evaluation identifies and analyses the mechanisms by which the program
under evaluation can produce the expected outcomes or may create adverse effects. This second goal
is the basis of studies of the theories underlying public programs and analysis of the specific
mechanisms by which these programs have made an impact. Chen (1990), Chen and Rossi
(1983) and Shadish et al. (1991) introduced the debate in the 1980s and 90s, and several theo-
retical works were recently published on these issues (Donaldson, 2007; Donaldson et al.,
2008; Jordan et al., 2008; Shadish et al., 2002; Stame, 2004). In practice, this raises the ques-
tion of what knowledge can be used to provide a reliable empirical basis to implement these
approaches (Pawson, 2002, 2006; Pawson and Tilley, 1997; Schwandt, 2003) but also of what
credible claim of the contribution of an intervention to a change can be made in the absence of
experimental approaches (Mayne, 2012).
Goal 3: To learn – the evaluation is designed as a collective learning process. Many studies emphasize
the importance of elements that support the use of evaluation; these are intended to facilitate
the implementation of adequate methods and the appropriation of evaluation findings by dif-
ferent types of users (Patton, 2008). Evaluation is considered an operational approach intended
to improve public action and decisions. Emphasis is placed on its instrumental dimension (as
a response to an institutional demand) and on the role played by evaluation approaches as an
organizational learning process. This goal can lead to the idea of a ‘learning society’ (Schwandt,
2003) and to a new conception of evaluation as a form of inquiry involving pedagogical
engagement with real practice. Using diverse participatory methods (e.g. stakeholder-based,
democratic, collaborative, pluralist, responsive) (Cousins and Whitmore, 1998; Mertens,
1999), this ‘learning’ objective can be paired with the goal of empowerment (Fetterman, 1996;
Fetterman and Wandersman, 2005).
This plurality of goals generates an initial question: should we consider the quality of evi-
dence in the same way for all these cases?
a) socially relevant to those concerned and consider negative as well as positive effects;
b) based on adequate types of evidence (in line with what the evaluation entails); and
c) reliable (produced using rigorous methods, to ensure the highest degree of probative
force).
Types of evidence
Broadly speaking, the following three types of empirical evidence are necessary to evaluate
public policies:
1) Evidence of presence. This type of evidence aims at the description and verification of a thing
which exists on the ground (e.g. species observed while building a botanical inventory to
describe biodiversity). It is used to build an agreement among different stakeholders on the
state of the world (before and after the program). This can be approached through a proxy (e.g.
the number of footprints of individuals belonging to certain species).
3) Evidence of a mechanism for a phenomenon. This is produced when there is evidence that the
entities or the activities that make up a mechanism, and the organization of these entities and
activities by which they produced the phenomenon, are known (e.g. the bio-chemical reac-
tions needed for an increase in fertilizer [C = cause] to increase crop yield [O] in a controlled
environment).
This type of evidence may confirm a relationship of cause and effect, all other things being
equal. It provides information on the causal pathway to intervene upon for the goals of a pub-
lic program to be achieved. However, in real life conditions, evaluators are always confronted
with complex causal structures in which various mechanisms interfere. In that respect, follow-
ing Cartwright (2011), a probabilistic theory of causality can be adopted.
For each effect-type at a time t, Ot, and for each time t’ before t, there is a set of factors {C1t’,…,Cnt’}
– the causes at t’ of O at t – whose values in combination fix the objective chance at t’ that O takes
value o for any o in its allowed range. A causal structure, CSt’(Ot), for Ot is such a set along with the
related objective chances for all values of Ot for all combinations of allowed values, Ljt’, of the causes
in the set: Prob (Ot = o/Ljt’). For simplicity I will usually supress time and other indices and also
restrict attention to two valued variables. So a causal structure looks like this:
In practice, full knowledge of the causal structure involved in a public program is generally
unreachable. It is therefore useful to develop hypotheses on the mechanisms that will play an
important role, in order to design a program and have an effect on ‘manipulable’ factors
(Shadish et al., 2002) or to analyse whether an intervention is a contributory cause to a change
(Mayne, 2012). Here, evaluation usually involves the production of both evidence of mecha-
nism and evidence of difference-making, a combination which provides information about
causal pathways. In certain cases, however, an evaluation is based exclusively on evidence of
difference-making and therefore says little or nothing about underlying causality if the causal
structure is complex.
•• the production of the expected mechanism, by observing changes which occur at each
stage (e.g. whether a financial incentive has led to a shift in practices which has in turn led
to the use of a fertilizer that has an impact on crops). In this case, evidence of mechanism
will be combined with evidence of difference-making to help clarify causal relationships;
•• measurement only of the produced effects (e.g. does income support increased produc-
tion levels?), without hypothesizing about the causal chain involved (purchasing of
consulting services, purchasing of inputs, reduction of risk aversion, etc.). Here, evi-
dence of difference making provides little information on the causal relationships which
need to be studied in order to judge how generic are the results obtained.
Disentangling various types of evidence highlights the ambiguous relationship between evi-
dence of difference making and causality: in certain cases, these types of evidence reveal noth-
ing about causal pathways. This remains true even when such evidence is produced using
methods (such as randomized controlled trials) which may confer a high level of proof. Types
of evidence and level of evidence are two independent dimensions of the quality of evidence.
Levels of evidence
The assessment of levels of empirical evidence is usually considered a major issue. Whatever
the type of evidence, not all findings have the same probative force: they cannot be ranked at
the same ‘level of evidence’. In the field of agriculture, for example, levels of evidence of
effectiveness are often classified in the following order, from the lowest to the highest quality,
according to the methodology of data collection:
But there is not ‘one’ methodology (as is commonly argued for RCTs) that could be consid-
ered as the gold standard for all situations. Other types of ranking are possible. For instance,
if a research aims at understanding a mechanism (e.g. the reasons why children/parents will
accept a treatment, depending on individual behaviours), then in depth qualitative studies
including single case observations provide a higher level of evidence than results of cohort
studies based on probabilistic models (Petticrew and Roberts, 2003).
In addition the apparent simplicity of the former classification should not conceal that the
assessment of the quality of evidence produced at each level could be based on different crite-
ria (e.g. study design, quality of study conduct, consistency of results) (Liberati et al., 2001).
It should not conceal either the numerous questions that arise when several types of evidence
are involved and need to be combined and/or are in competition (Laurent and Trouvé, 2011).
In other words, the criteria for assessing the level of evidence must be chosen according to the
objectives of this assessment.
Invoking the argument that there is no universal rule by which to rank the level of evidence,
some authors reject this very principle and argue in favour of a symmetry of knowledge, put-
ting on the same level opinions from various stakeholders, traditional knowledge gained from
experience, empirical evidence resulting from systematic investigation, etc. Such a renuncia-
tion may generate significant adverse effects when it comes to action. In a large number of real
evaluation settings, stakeholders want information that is as robust as possible to help them
comply with their objectives. This is the case in many areas of public intervention such as
agriculture, which involve both private and public organizations and which involve actors
who consider that it makes sense to look for the best possible level of evidence to inform their
decision (Labarthe and Laurent, 2013).
Therefore, both empirical observations and progress in the theory of evidence invite the
abandonment of two equally unproductive claims: those pretending that there is a unique
methodology for ranking the level of evidence; and those rejecting the very principle of assess-
ing the probative force of evidence. Instead, they emphasize the need to define clear principles
that will enable various stakeholders to assess the level of available evidence, utilizing the
criteria that are relevant for their particular objectives.
find common analytical frameworks through which to assess the relevance of alternative eval-
uation methods.
This tradition thrives all the more because intervention in agriculture (e.g. financial public
support, regulatory measures, technical support) is subject to decisions taken jointly at the
international level, whether it involves policy frameworks (e.g. the Common Agricultural
Policy), health and environmental standards or economic support for production and advisory
services. In addition, over the last two decades, evaluation is no longer confined to the assess-
ment of the productive performance of farm activity. New stakeholders have joined the discus-
sion with concerns related to the environmental performances of agriculture and to its
contributions to rural development and social cohesion.
In the case of farm advisory services, for example, a global forum has been created (the
Global Forum for Rural Advisory Services, or G-FRAS) to facilitate collective discussion,
working groups, reports and evaluation initiatives. In Europe, the European Commission has
commissioned an evaluation of the implementation of advisory services in different member
countries. These initiatives highlight sensitive issues about the use of evidence according to
the goal of the evaluation: i) when measuring the effects of alternative advisory interventions
(e.g. debates about the probative force of alternative methods for impact assessment); ii) when
assessing the robustness of the causal scheme of these interventions (e.g. does the idea of
knowledge diffusion, upon which many of these interventions are based, hold up in the field?);
and iii) even when promoting learning through evaluation.
Ideally, an evaluation procedure should be aimed at producing results based on evidence of
the best possible quality. However, such a view remains highly theoretical and blind spots
subsist in how such evidence is actually produced. As demonstrated below in three kinds of
ex-post evaluation, the adequacy of a type of evidence varies depending on the goal of the
evaluation.
In other words, the evaluation process does not examine in detail the mechanisms by which
an action is effective; public programs mobilize a large number of factors and it is often
impossible to observe every form of interaction between them. In most cases, evidence of
effectiveness is sought in order to prove that the program made a difference, not to describe
the mechanisms that made the measure effective, nor to control whether the effects confirm an
underlying theory of action. Therefore, the evaluator does not open the ‘black box’ of the
evaluated program. For instance, evidence that an agri-environmental scheme has been effec-
tive in maintaining biodiversity can be sought, without analysing the specific ecological, eco-
nomic and social mechanisms that contributed to that outcome.
I = E (O | T = 1) - E (O | T = 0) (1)
the population ϕ divides into two groups that are identical with respect to all other features casually
relevant to the targeted outcomes, O, except for the policy treatment T, and its downstream
consequences. (Cartwright, 2011: 18)
The main pitfall in this situation is a selection bias where differences exist between the
‘treated’ group and the control group (stemming from observable or unobservable factors)
which could explain variations in levels of O independently of the effects of the program T.
In light of this, evidence-based decision studies in the medical field rank the methods
used in terms of their ability to reduce this bias: the smaller the bias, the higher the level of
evidence. Traditionally, randomized controlled trials (RCT) are viewed as the ‘gold stand-
ard’ for measuring the outcomes of a specific program. Selection bias is eliminated by ran-
domly distributing individuals in the treated group and the control group. For this reason,
new, experimental evaluation methods (Duflo and Krémer, 2005) are emerging in various
sectors (e.g. justice, education, the social sciences as well as the environment and agricul-
ture). However, while such methods are widespread in health-related fields, they are less
used for other public programs, where the randomization of beneficiaries of a public pro-
gram can pose technical and ethical problems. In cases where an RCT cannot be undertaken,
‘quasi-experimental’ methods such as matching or double differencing are considered the
most reliable alternatives (Bro et al., 2004). Matching involves pairing individuals who
benefited from the program with individuals who did not and comparing the levels of indi-
cator variables. The goal is to pair individuals based on their most significant similarity,
particularly in terms of how likely they are to benefit from the program. The double differ-
ence method is a combination of a comparison before and after the implementation of a
public program and a comparison with and without the program. Differences in O are meas-
ured with proxy variables in both the beneficiary group and the control group. Nevertheless,
both matching and double differencing have limitations. Matching makes it possible to pair
individuals using only observable variables, with the risk that unobservable ones (skills,
attitude, social capital) induce a selection bias. Double differencing relies on the hypothesis
that such variables have a constant effect over time.
Such methods have already been used to evaluate farm advisory service policies (Davis
et al., 2012; Godtland et al., 2004; Van den Berg and Jiggins, 2007). But to ensure the
empirical reliability of this kind of work, methodological precautions must be taken which
may limit the scope of findings. Below are four examples related to farm advisory service
programs:
1) The first problem bears on the requirement for a random distribution of farmers who
benefited from these advisory services programs and those who did not (in the case
of RCTs). Aside from the ethical issues raised, this requirement is also contrary to
the diagrams of causality of certain programs, such as participative and bottom-up
interventions (e.g. farmer field schools): the effectiveness of such programs theo-
retically depends on the self-motivated participation of farmers in a collective
project.
2) The second problem bears on an essential hypothesis of the methodologies of impact
evaluation based on RCTs or semi-experimental evaluation: beneficiaries must not
be influenced by the fact that non-beneficiaries do not benefit from the program,
and vice versa (Stable Unit Treatment Value Assumption – SUTVA). This hypoth-
esis may also be contrary to the diagrams of causality underlying certain advisory
service programs, particularly those built on so-called diffusionist models (e.g. the
World Bank’s Train & Visit program): in theory, their effectiveness resides in the
fact that farmers who directly receive advice will share acquired knowledge with
those who have not.
3) The third problem is the choice of indicators. Evaluating the impact of farm advisory
services supposes the ability to identify a proxy of the expected results. At which
level should this result be selected (Van den Berg and Jiggins, 2007)? The level of
farm performance (yield, income, etc.); the level of the adoption of innovations; or
the level most directly affected by farm advisory services: farmers’ knowledge and
skills? The question then becomes how to express this knowledge and these skills in
quantitative variables. In that respect, Godtland et al. (2004) have stressed the diffi-
culties and limitations of their attempt to express farmers’ knowledge through knowl-
edge tests. Likewise, the effects of this proxy will have to be observable over
relatively short durations (due to costs, RCTs are often used in one- to two-year
population tests). However, in the case of farm advisory services, one can wonder
whether this short-term measure makes any sense due to certain mid- or long-term
dimensions of learning processes.
4) The last aspect is related to the distributive effects of the evaluated policy. In most
impact studies, the effect is calculated by looking at the difference between the average
obtained by the group of individuals benefiting from the measure in a sample and that
of the individuals who do not benefit. However, an average improvement for the target
population can hide great inequalities or even aggravate these inequalities. Abadie
et al. (2002) have shown for instance that a training program for poor populations
could result in an increase in the average income of the target populations, but have no
effect on the poorest fraction of this population.
This example of the evaluation of farm advisory services shows that the measurement of
the impact of public programs is only rigorous if the methods used are consistent with specific
hypotheses associated with the method of data collection (e.g. randomization, a lack of diffu-
sion-related effects).
In other words, the experimental settings of the production of evidence of effectiveness are
such that they cause many problems of generalization and external validity. This knowledge is
only valid for a specific population ϕ in a particular environment characterized by a specific
causal structure CSt’(Ot). And it can only be extended to populations θ that share the same
causal structure CSt’(Ot). Some authors propose to solve this ‘environmental dependence’
issue by replicating measures of effectiveness (with an RCT) in various contexts, but ‘worry
that there is little incentive in the system to carry out replication studies (because journals may
not be as willing to publish the fifth experiment on a given topic as the first one), and funding
agencies may not be willing to fund them either’ (Banerjee and Duflo, 2009: 161). But the
problem is not a financial one. In any case, replication alone cannot be a solution; a theory
about causal structures is necessary to identify the scale and boundaries of different θ popula-
tions that may share the same causal structure. It is necessary to rely on theories to identify
mechanisms that characterize the causal structure of the target populations of the policies.
produced, regularities or recurring facts are identified so as to determine the various causes
{C1t’,…,Cnt’} and the set of causal relations {Prob(Ot/L1t’),…,Prob(Ot/Lmt’)} by which the
implementation of a program has expected or unexpected effects. These effects can directly
relate to the goal of the program or to its broader context. The evaluation will thus depend on
the nature of the problem in question: at stake are the specificities of this problem in a particu-
lar context and the assessment of the degree of generality of the proposed solutions for further
action.
In certain cases, to improve the quality of the measurement of impacts, the evaluation is
constructed using a preliminary analysis of the theory underlying the program (program the-
ory). A first step is understanding (before the measurement) the causal mechanisms that guided
the design of the program. The role of the evaluator consists, more precisely, in putting forth
hypotheses on the main features of the causal structure linking a program and its potential
subsequent effects. The aim is to build a diagram that traces these patterns of causality and
constitutes the theory of the program and is a simplified representation of the comprehensive
causal structure. When it is established, such a diagram becomes a reference framework and
the basis of the evaluation approach for the evaluator, who then proposes indicators that will
be useful for measuring impacts.
The analysis of the causal structure of the program allows a better understanding of the
distributive effects of a program within the target population and across populations. However,
the diagram that is built is only a simplified representation of the proposed causal structure.
Therefore some of the ways in which evidence on mechanisms is used in the evaluation pro-
cess raises questions, as is illustrated by the following example.
collected (about crop rotation, plant pest management, etc.). They are linked to agri-ecolog-
ical indicators to calculate the potential risks and effects of these changes (e.g. the use of
less chemical inputs is associated with a positive impact on biodiversity) (Mitchell et al.,
1995; Van de Werf and Petit, 2002).
However, it is impossible to identify and take into account the many existing mechanisms
that interact in various contexts. Thus, the causal diagram that underlies these actions is only
an approximation of a comprehensive causal structure that ideally could allow their effect to
be fully predicted. The research which examine these types of methods all point out that these
measures identify ‘potential effects’ but fail to measure actual impacts. Nevertheless, these
qualifications are often absent in the executive summaries of reports that present evaluation
results. Variations in the value of an indicator can thus be presented as evidence of an improve-
ment of environmental performances. This is not only improper from a formal point of view;
the few experimental tests carried out on this issue also disprove that it is an acceptable esti-
mate. For instance Kleinj and Sutherland (2003) and Kleinj et al. (2006) show that certain
measures which were successful in terms of ‘policy performance’ did not have the expected
environmental impact.
Such doubts about the effectiveness of certain agri-environmental schemes can be linked to
the weakness of the theoretical models upon which they are based, as well as to a lack of
empirical data with which to identify what works and what does not (MacNeely et al., 2005).
The work done on the eco-millennium assessment demonstrated the importance of these
knowledge gaps (Carpenter et al., 2006). This concerns both evidence of difference-making
and evidence of mechanism.
1) Identifying the mechanisms by which the actions were effective (or not) is essential to
producing generic knowledge that can be used to develop new programs (e.g. a causal
relation which can be exploited in various contexts). It can also help assess the generic
nature of the knowledge used in the program (e.g. to what extent the causal structure of
two different populations can be considered similar) and to raise new issues for the
evaluators and stakeholders involved in the evaluation.
2) In certain situations, it makes sense to rank results based on the opinions of respected
authorities, single case studies, observations on wider samples of situations, etc. in
order to assess the robustness of available evidence. However, the use of theoretical
models to infer the effective impact of a program, as sophisticated as they may be, is
often limited. The causality diagrams formalized in these theoretical models are only
ever partial representations of complex causal structures. Their predictive capacities
vary according to the object under evaluation and the context; therefore one cannot
replace the observation of the real effects (and the production of evidence of effective-
ness) with that of expected effects (estimated using an analysis of the means imple-
mented in the program).
System Methodology (SSM) to design and evaluate technical advisory programs (Rochs and
Navarro, 2008). SSM is designed to help a ‘human activity system’ (HAS) make the most
effective decisions in uncertain and complex contexts (Checkland, 1981) where learning is the
priority. Checkland and Scholes (1990) point out that SSM as a model is not intended to estab-
lish versions of reality. Instead, it aims to facilitate debate so that collective decisions and
action can be taken in problem situations. The seven stages of SSM are (Checkland, 1981):
i. inquiring into the situation (identifying the problem using different communication
techniques: brainstorming, interviews, participant observation, focus groups, etc.);
ii. describing the situation (describing the context using a wide variety of sources);
iii. defining HAS (identifying program stakeholders, and interviewing them on the trans-
formations they are expecting);
iv. building conceptual models of the HAS (representing the relationships between stake-
holders in the program being designed or evaluated);
v. comparing the conceptual models with the real world (preparation of a presentation of
the model for a debate with stakeholders);
vi. defining desirable and feasible changes;
vii. implementation (Rochs and Navarro, 2008).
Corroboration with facts and producing the best possible evidence do not appear to be at
the heart of this conception/evaluation approach, which instead aims at promoting and struc-
turing debate between program stakeholders to arrive at a consensual solution. In practice,
however, significant problems arise (Salner, 2000). In workshops, for example, evidence is
provided by different stakeholders verbally, and must be verified. Salner (2000) likens this
method to journalism, in that it involves the verification of the opinions of different stakehold-
ers so that ‘analysis makes it possible to mount an argument for change which was not simply
an intuitive reaction to a conversation held; it was an argument which could be explicitly
retraced at any time with links to supporting evidence’ (Checkland and Scholes, 1990: 198–9).
Verification is thought to be guaranteed by the open, public and collective nature of the debate.
Comparison with ‘fact checking’ in journalism, however, only holds true if the evidence pre-
sented is evidence of presence describing facts known through stakeholder practices. Instead,
arguments often go deeper and target the expected or measured impact of programs and even
the causality diagram upon which they are based. These evaluation methods rely not only on
evidence of presence but also on evidence of effectiveness and mechanisms but do not formal-
ize this integration. This lack of formalization manifests itself on two levels: (i) in the use of
scientific knowledge to formulate hypotheses on the modalities of how public programs func-
tion, (ii) in the verification of the level of evidence obtained.
Ultimately, these formalization tasks are implicitly transferred to workshop leaders (often
researchers). This situation poses a number of problems as it is assumed that these leaders
have extensive skills and means at their disposal (to produce state-of-the-art reports of avail-
able scientific literature, statistical analyses and various types of verifications). For this rea-
son, several authors have pointed out that SSM can be exploited to reinforce a balance of
power given the asymmetries of information between stakeholders:
the kind of open, participative debate that is essential for the success of the soft system approach, and
is the only justification for the result obtained, is impossible to obtain in problem situations where
there is a fundamental conflict between interest groups that have access to unequal power resources.
Soft system thinking either has to walk away from these problem situations, or it has to fly in the face
of its own philosophical principles and acquiesce in proposed changes emerging from limited debates
characterized by distorted communication. (Jackson, 1991: 198)
•• The issue of level of evidence is often neglected and seen as secondary to collective
learning objectives. All contributions are accepted equally and the reliability of evi-
dence is not subject to systematic testing procedures.
•• Very quickly, evidence presented by participants with different interests can be in com-
petition and arbitration is often based on non-transparent criteria.
•• Without a systematic, clear verification procedure for evidence, learning may focus
more on the ability to reach consensual positions than on the ability to use the best tools
for achieving a given objective and on evaluating outcomes in a rigorous manner.
Conclusion
This article is not intended as a standard-setting tool. Our goal is to contribute to building a
theory of evidence for evaluation that allows different stakeholders to better judge the quality
of evidence they seek depending on their project.
We have illustrated that while evaluation may have very different objectives (e.g. under-
standing the mechanisms of public programs, measuring their specific impacts, or supporting
collective learning to favour the emergence of an agreement between stakeholders in the pro-
grams), each objective leads to a different examination of the question of types of evidence,
i.e. what is object of evidence (presence, making a difference, mechanism). This concern must
be clearly distinguished from the study of levels of evidence, which deals with data collection
and interpretation (e.g. single case observations, difference methods, RCTs); of each of these
methods can be used for producing each type of evidence.
With this in mind, the issue of RCTs must be re-examined, along with the types of evidence
for which these methods are used. Experimental economics can be used as a tool to test some
hypotheses on mechanisms rather than only be used to assess the impact of a policy in a given
environment. Nevertheless, whether RCTs are a relevant tool in that respect is a matter of
ongoing discussion both in medical sciences and in economics (Deaton, 2009). A key question
in this debate is the importance of heterogeneity and distributive effects across populations,
which are not acknowledged by RCTs, but which can be essential for formulating theories in
various scientific areas (economics, management science, but also bio-medical sciences and
ecology among others).
For each situation, the quality of evidence can be assessed according to three dimensions.
Ideally, as mentioned above, one would like to base a decision on evidence that is both
socially relevant (addresses phenomena considered by each stakeholder to be important), of
a high level (with probative force) and which corresponds to the adequate type for the goals
of the evaluation. This ideal is usually inaccessible, for various reasons including cost, meth-
odological constraints, and the need to select precise objectives from a large number of pos-
sible points of view.
Evaluators are permanently confronted with trade-offs. The three examples above show
that a better understanding of quality of evidence can help better assess the limits inherent in
the conclusions of every evaluation depending on the quality of evidence on which they are
based. In the real world, every evaluation process has its own limits and can only produce reli-
able results for a particular field of interest. Choices should thus be made that will involve
institutional issues and possible conflicts of interest. As is the case with any policy instrument,
the final decision depends on a multiplicity of factors which cannot be reduced to evidence
issues alone. However, a clear specification of the limits of validity of findings is a prerequi-
site to avoid misinterpretations. A better shared knowledge of the type and the level of evi-
dence that is used to evaluate the result of interventions can help clarify for various stakeholders
what is at stake in making alternative choices.
Acknowledgements
The authors would like to thank the anonymous referees and editors who provided useful and inspiring
comments on an earlier version of this article.
Funding
This research was conducted in an interdisciplinary research program funded by the French National
Agency for Research (program EBP-Biosoc/ADD). It is based on the combination of former research
experience on evaluation theories (M. Berriet-Solliec), on international debates on evaluation of farm-
advisory services (P. Labarthe) and on quality of evidence (C. Laurent).
References
Abadie A, Angrist J and Imbens G (2002) Instrumental variables estimates of the effect of subsidized
training on the quantiles of trainee earnings. Econometrica 70: 91–117.
Adams WM, Avelling R, Brockington D, Dickson B, Elliot J, Hutton J et al. (2004) Biodiversity con-
servation and the eradication of poverty. Science 306: 1147–9.
Banerjee AV and Duflo E (2009) The experimental approach to development economics. The Annual
Review of Economics 1: 151–78.
Bro E, Mayot P, Corda E and Reitz F (2004) Impact of habitat management on grey partridge popula-
tions: assessing wildlife cover using a multisite BACI experiment. Journal of Applied Ecology 41:
846–57.
Carpenter S, DeFries R, Dietz T, Mooney H, Polasky S, Reids W and Scholes R (2006) Millenium
Ecosystem assessment: research needs. Science 314: 257–8.
Cartwright N (2011) Evidence, external validity and explanatory relevance. In: Morgan GJ (ed.),
Philosophy of Science Matters: The Philosophy of Peter Achinstein. New York: Oxford University
Press, 15–28.
Cartwright N and Hardie J (2012) Evidence-Based Policy: A Practical Guide to Doing It Better. Oxford:
Oxford University Press.
Checkland PB (1981) System Thinking, System Practice. New York: John Wiley.
Checkland PB and Scholes J (1990) Soft Systems Methodology in Action. Chichester: John Wiley &
Sons.
Chen HT (1990) Theory-Driven Evaluation. Newbury, CA: SAGE.
Chen HT and Rossi PH (1983) Evaluating with sense. The theory-driven approach. Evaluating Review
7(3): 283–302.
Cousins JB and Whitmore E (1998) Understanding and participatory evaluation. New Directions for
Evaluation 80: 69–80.
Davis KE (2008) Extension in Sub-Saharan Africa: overview and assessment of past and current mod-
els, and future prospects. Journal of International Agricultural and Extension Education 15(3):
15–28.
Davis KE, Nkonya E, Kato E, Mekonnen DA, Odendo M, Miiro R and Nkuba J (2012) Impact of farmer
field schools on agricultural productivity and poverty in East Africa. World Development 40(2):
402–13.
Deaton AS (2009) Randomization in the tropics and the search for the elusive keys to economic devel-
opment. National Bureau of Economic Search Working Paper: 14690. Cambridge, MA.
Donaldson SI (2007) Program Theory-driven Evaluation Science: Strategies and Applications. New
York: Routledge.
Donaldson SI (2008) In search of the blueprint for evidence-based global society. In: Donaldson SI,
Christie CA and Mark HH (2008) What Counts as Credible Evidence in Evaluation and Evidence-
based Practice? Thousand Oaks, CA: SAGE, 2–18.
Donaldson SI, Christie CA and Mark HH (2008) What Counts as Credible Evidence in Evaluation and
Evidence-based Practice? Thousand Oaks, CA: SAGE.
Duflo E and Krémer M (2005) Use of randomization in the evaluation of development effectiveness. In:
Pitman G, Feinstein O and Ingram G (eds), Evaluating Development Effectiveness. New Brunswick,
NJ: Transaction Publishers, 205–32.
Fetterman DM (1996) Empowerment evaluation. An introduction to theory and practice. In: Fetterman,
Kaftarian and Wandersman (Eds), Empowerment evaluation: Knowledge and Tools for Self-
Assessment & Accountability. Thousand Oaks, CA: SAGE, 3–46.
Fetterman DM and Wandersman A (2005) Empowerment Evaluation. Principles and Practice. New
York: The Guilford Press.
Fitzpatrick JL, Sanders JR and Worthen BR (2011) Program Evaluation: Alternative Approaches and
Practical Guidelines. Upper Saddle River, NJ: Pearson Education.
Godtland EM, Sadoulet E, de Janvry A, Murgai R and Ortiz O (2004) The impact of farmer field
schools on knowledge and productivity: a study of potato farmers in the Peruvian Andes. Economic
Development and Cultural Change 53(1): 63–92.
Hansen HF and Rieper O (2009) The evidence movement: the development and consequences of meth-
odologies in review practices. Evaluation 15: 141–63.
Illari PM (2011) Mechanistic evidence: disambiguating the Russo-Williamson thesis. International
Studies in the Philosophy of Science 25(2): 139–57.
Jackson M (1991) Systems Methodology for the Management Sciences. New York and London: Plenum
Press.
Jordan GB, Hage J and Mote J (2008) A theories-based systemic framework for evaluating diverse
portfolios of scientific work, part 1: micro and meso indicators. New Directions for Evaluation
118: 7–24.
Kleinj D and Sutherland W (2003) How effective are European agri-environment schemes in conserv-
ing and promoting biodiversity? Journal of Applied Ecology 40: 947–69.
Kleinj D, Baquero RA, Clough Y, Díaz M, Esteban J, Fernández F et al. (2006) Mixed biodiversity
benefits of agri-environment schemes in five European countries. EcolLett 9(3): 243–54.
Labarthe P and Laurent C (2013) Privatization of agricultural extension services in the EU: towards a
lack of adequate knowledge for small-scale farms? Food Policy 38: 240–52.
Laurent C and Trouvé A (2011) Competition of evidences and the emergence of the ‘evidence-based’
or « evidence-aware » policies in agriculture. 122nd EAAE Seminar ‘evidence-based agricultural
and rural policy making: methodological and empirical challenges of policy evaluation’. Ancona,
Italy, 17–18 February 2011.
Laurent C, Berriet-Solliec M, Kirsch M, Perraud D, Tinel B, Trouvé A et al. (2009) Pourquoi s’intéresser
à la notion d’Evidence-basedpolicy? Revue Tiers-monde 200: 853–73.
Liberati A, Buzzetti R, Grilli R, Magrini N and Monozzi S (2001) Evidence-based case review. Which
guidelines can we trust? Assessing strength of evidence behind recommendations for clinical prac-
tice. Western Journal of Medicine 174: 262–5.
Lipsey MW (2007) Method choice for government evaluation: the beam in our own eye. In: Julnes G
and Rog DJ (eds), Informing Federal Policies on Evaluation Methodology: Building the Evidence
Base for Method Choice in Government Sponsored Evaluation. New Directions for Evaluation,
vol. 113. San Francisco, CA: Jossey-Bass, 113–15.
MacNeely JA, Faith DP, Albers HJ et al. (2005) Biodiversity. In: Chopra K, Leemans R, Kumar P and
Simons H (eds), Ecosystems and Human Well-Being: Volume 3. Policy Responses. Washington,
DC: Island Press, 119–72.
Mayne J (2012) Contribution analysis: coming of age? Evaluation 18: 270–80.
Mertens D (1999) Inclusive evaluation: implications of transformative theory of evaluation. American
Journal of Evaluation 20(1): 1–14.
Mitchell G, May A and McDonald A (1995) PICABUE: a methodological framework for the develop-
ment of indicators of sustainable development. International Journal of Sustainable Development
& World Ecology 2: 104–23.
Oliver S, Harden A, Rees R, Shepherd J, Brunton J, Garcia J and Oakley A (2005) An emerging frame-
work for including different types of evidence in systematic reviews for public policy. Evaluation
11: 428–46.
Patton MQ (2008) Utilization Focused Evaluation, 4th edn. Thousand Oaks, CA: SAGE.
Pawson R (2002) Evidence-based policy: in search of a method. Evaluation 8: 157–81.
Pawson R (2006) Evidence-based Policy: A Realistic Perspective. London: SAGE.
Pawson R and Tilley N (1997) Realistic Evaluation. London: SAGE.
Petticrew M and Roberts H (2003) Evidence, hierarchies and typologies: horses for courses. Journal of
Epidemiology and Community Health 57: 527–9.
Primdahl J, Peco B, Schramek J, Anderse E and Onate JJ (2003) Environmental effects of agri-environ-
mental schemes in Western Europe. Journal of Environmental Management 67: 129–138.
Rochs F and Navarro M (2008) Soft System Methodology: an intervention strategy. Journal of
International Agricultural and Extension Education 15(3): 95–9.
Rogers P (2008) Using programme theory to evaluate complicated and complex aspects of interven-
tions. Evaluation 14(1): 29–48.
Rossi PH, Lipsey MW and Freeman HE (2004) Evaluation: A Systematic Approach, 7th edn. Newbury
Park, CA: SAGE.
Salner M (2000) Beyond Checkland & Scholes: improving SSM. Occasional Papers on Systemic
Development 11: 23–44.
Schwandt T (2003) ‘Back to the rough ground!’ Beyond theory to practice in evaluation. Evaluation
9(3): 353–64.
Schwandt T (2008) Toward a practical theory of evidence for evaluation. In: Donaldson SI, Christie
CA and Mark HH (eds), What Counts as Credible Evidence in Evaluation and Evidence-based
Practice? Thousand Oaks, CA: SAGE, 197–212.
Shadish WR, Cook TD and Campbell DT (2002) Experimental and Quasi Experimental Designs for
Generalized Causal Inference. Boston, New York: Houghton Mifflin Company.
Shadish WR, Cook TD and Leviton LC (1991) Foundations of Program Evaluation Theories of
Practice. Newbury Park, CA: SAGE.
Stame N (2004) Theory-based evaluation and varieties of complexity. Evaluation 10(1): 58–76.
Stern E (2004) Philosophies and types of evaluation research. In: The Foundations of Evaluation and
Impact Research. Third report on vocational training research in Europe: background report.
Luxembourg: Office for Official Publications of the European Communities (Cedefop Reference
Serie, 58), 12–42.
Stern E, Stame N, Mayne J, Forss K, Davies R and Befani B (2012) Broadening the range of designs and
methods for impact evaluations. DFID Working Paper 38, London.
Stufflebeam DL (2001) ‘Evaluation Models’: New Directions for Evaluation, 89. San Francisco, CA:
Jossey-Bass.
Van den Berg H and Jiggins J (2007) Investing in farmers: The impact of farmer field schools in relation
to Integrated Pest Management. World Development 35(4): 663–87.
Van der Sluijs J, Douguet J-M, O’Connor M, GuimaraesPeriera A, Quintana SC, Maxim L and Ravetz
J (2008) Qualité de la connaissance dans un processus délibératif. Nature, Science, Société 16:
265–73.
Van der Werf H and Petit J (2002) Evaluation of the environmental impact of agriculture at the farm
level: a comparison and analysis of 12 indicators-based method. Agriculture, Ecosystems and
Environment 93: 131–45.