0% found this document useful (0 votes)

18 views19 pages

BERRIET-SOLLIEC 2014 Goals of Evaluation PDF

The article emphasizes the importance of quality evidence in evaluation processes, proposing a 'theory of evidence for evaluation' to better align evaluation goals with types of evidence. It categorizes evaluation goals into measuring, understanding, and learning, and distinguishes between different types of evidence such as evidence of presence, difference-making, and mechanisms. The authors argue that the assessment of evidence quality should consider both the type of evidence and the methodology used, as no single method can be deemed universally superior for all evaluation contexts.

Uploaded by

Ghislain Blais

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views19 pages

BERRIET-SOLLIEC 2014 Goals of Evaluation PDF

Uploaded by

Ghislain Blais

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

529836

research-article2014
EVI0010.1177/1356389014529836EvaluationBerriet-Solliec et al.

Article

Goals of evaluation and

types of evidence

Marielle Berriet-Solliec
National Institute of Agronomy, Food and Environment (Agrosup Dijon), France

Pierre Labarthe
French National Institute for Agricultural Research (INRA), France

Catherine Laurent
French National Institute for Agricultural Research (INRA), France

Abstract
All stakeholders are urged to pay more attention to the quality of evidence used and produced
during the evaluation process in order to select appropriate evaluation methods. A ‘theory
of evidence for evaluation’ is needed to better address this issue. This article discusses the
relationships between the three main goals of evaluation (to learn, measure and understand) and
the various types of evidence (evidence of presence, of difference-making, of mechanism) which
are produced and/or used in the evaluation process. It argues for the need to clearly distinguish
between this approach and that of levels of evidence, which is linked to data collection and
processing methods (e.g. single case observations, difference methods, randomized controlled
trials…). The analysis is illustrated by examples in the field of agro-environmental policymaking
and farm advisory services.

Keywords
agricultural extension, agri-environment, agricultural policies, evaluation, evidence, evidence-
based decision, knowledge

Recently there has been a resurgence of research into the effects of knowledge characteristics
on the dynamics of collective decision making in public or private organizations. For years,

Corresponding author:
Catherine Laurent, INRA, Sciences for Action and Development, Umr 1048, 16 rue Claude Bernard, 75231 Paris Cedex
5, France.
Email: [email protected]

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

196 Evaluation 20(2)

studies have stressed and modelled the diversity of the sources and types of knowledge used
in decision making (e.g. expertise, theories on causal relations, traditional knowledge, etc.).
More recently, some theoretical developments, such as research around ‘evidence-based
decisions’, have merged learning from various disciplinary standpoints (e.g. philosophy of
science, medical studies, economics, ecology) and opened new debates on ‘empirical evi-
dence for use’ (Cartwright, 2011). Regarding evaluation, a proposal to rank evaluations
according to a hierarchy of evidence (e.g. Lipsey, 2007) has caused heated discussions about
what counts as evidence (Donaldson, 2008). These debates are calling on decision makers to
pay more attention to the quality of evidence when selecting appropriate methods of evalua-
tion and assessing their conclusions. There are also calls for a ‘theory of evidence for evalu-
ation’ (Schwandt, 2008).
This article aims to contribute to the building of such a theory. We analyse the relationships
between goals of evaluation and types of evidence (i.e. what is the object of evidence in dif-
ferent types of evidence). We demonstrate how the resulting theoretical advances help to bet-
ter analyse the trade-offs involved in the use of alternative types of evidence. To illustrate this,
we focus on the ex-post evaluation of public programs in agriculture: specifically, for advisory
services and agri-environmental policies.

The diverse goals of evaluations

A general objective of the evaluation process is to organize and analyse the information
gathered about the program concerned. Many methods exist to frame this objective (e.g.
intervention logic, theory of action, theory of program, theory-driven evaluation, contribu-
tion analysis), and this diversity may be confusing to users. This difficulty is exacerbated
by the similarly wide range of theoretical models with which the programs under evalua-
tion are developed and implemented: Patton (2008) identifies more than 100 kinds of eval-
uation and refers to a state of ‘Babel confusion’. Several classifications of these methods
were proposed recently (Fitzpatrick et al., 2011; Hansen and Rieper, 2009; Oliver et al.,
2005, Rogers, 2008; Stern et al., 2012; Stufflebeam, 2001). In addition, Stern (2004)
showed the heuristic value of an analysis that links evaluation methods with the purpose of
the evaluation.
Indeed, regarding evaluation goals, evaluation studies can be classified, in a very stylized
way, into three broad groups, keeping in mind that one evaluation procedure may combine
several simultaneous or successive goals.

Goal 1:To measure – the evaluation is designed to assess the effects of a program. A first group of stud-
ies focuses on the quantification of program impacts, often using micro-economic techniques
(Rossi et al., 2004), in line with the work of Heckmann. An emblematic principle of this type
of research is the identification of an experimental or quasi-experimental situation in which
systematic reference to a counterfactual can be used to identify outcomes which are specific
to the program under evaluation (Banerjee and Duflo, 2009; Shadish et al., 2002). This first
group of studies seeks to assess if a public intervention works (a measure of effect usually
referred to as ‘impact assessment’). A second group of studies aims at measuring efficiency.
This involves measuring the value of goods or services produced through public programs
against the cost of their production. The goal is then to determine whether an organization or
initiative has produced as many benefits as possible given the resources it has at its disposal;

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

Berriet-Solliec et al.: Goals of evaluation and types of evidence 197

this approach takes into account a combination of factors such as costs, quality, use of
resources, appropriateness and whether deadlines were met.

Goal 2: To understand – the evaluation identifies and analyses the mechanisms by which the program
under evaluation can produce the expected outcomes or may create adverse effects. This second goal
is the basis of studies of the theories underlying public programs and analysis of the specific
mechanisms by which these programs have made an impact. Chen (1990), Chen and Rossi
(1983) and Shadish et al. (1991) introduced the debate in the 1980s and 90s, and several theo-
retical works were recently published on these issues (Donaldson, 2007; Donaldson et al.,
2008; Jordan et al., 2008; Shadish et al., 2002; Stame, 2004). In practice, this raises the ques-
tion of what knowledge can be used to provide a reliable empirical basis to implement these
approaches (Pawson, 2002, 2006; Pawson and Tilley, 1997; Schwandt, 2003) but also of what
credible claim of the contribution of an intervention to a change can be made in the absence of
experimental approaches (Mayne, 2012).

Goal 3: To learn – the evaluation is designed as a collective learning process. Many studies emphasize
the importance of elements that support the use of evaluation; these are intended to facilitate
the implementation of adequate methods and the appropriation of evaluation findings by dif-
ferent types of users (Patton, 2008). Evaluation is considered an operational approach intended
to improve public action and decisions. Emphasis is placed on its instrumental dimension (as
a response to an institutional demand) and on the role played by evaluation approaches as an
organizational learning process. This goal can lead to the idea of a ‘learning society’ (Schwandt,
2003) and to a new conception of evaluation as a form of inquiry involving pedagogical
engagement with real practice. Using diverse participatory methods (e.g. stakeholder-based,
democratic, collaborative, pluralist, responsive) (Cousins and Whitmore, 1998; Mertens,
1999), this ‘learning’ objective can be paired with the goal of empowerment (Fetterman, 1996;
Fetterman and Wandersman, 2005).
This plurality of goals generates an initial question: should we consider the quality of evi-
dence in the same way for all these cases?

Types and levels of evidence

Recent progress in the study of the quality of evidence (e.g. Cartwright and Hardie, 2012;
Illari, 2011; Laurent et al., 2009; Shadish et al., 2002) can help clarify the on-going contro-
versy over what counts as good quality evidence for an evaluation. In particular, an explicit
distinction needs to be made between types of evidence (depending on the object of evidence),
and data collection methods, which determine the probative force of the produced evidence.
When an ex-post procedure is used to evaluate a public program, generally the goal is to
produce the best possible knowledge to assess the actual outcome of the program and, to
the furthest extent possible, base later action on these outcomes. The ‘best’ knowledge
should be:

a) socially relevant to those concerned and consider negative as well as positive effects;
b) based on adequate types of evidence (in line with what the evaluation entails); and
c) reliable (produced using rigorous methods, to ensure the highest degree of probative
force).

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

198 Evaluation 20(2)

Types of evidence
Broadly speaking, the following three types of empirical evidence are necessary to evaluate
public policies:

1) Evidence of presence. This type of evidence aims at the description and verification of a thing
which exists on the ground (e.g. species observed while building a botanical inventory to
describe biodiversity). It is used to build an agreement among different stakeholders on the
state of the world (before and after the program). This can be approached through a proxy (e.g.
the number of footprints of individuals belonging to certain species).

2) Evidence of difference-making. This can be evidence of effectiveness: evidence that a given

action yields the desired result (e.g. improved biodiversity following the implementation of
agro-ecological regulations aimed at biodiversity conservation). It can be also evidence of
harm: obtained when adverse effects of an intervention have been found (e.g. adverse effects
of agro-ecological regulation on the sustainability of small-scale farms [Adams et al., 2004]).
This type of evidence requires the identification of an outcome (O) which can be observed
(e.g. the return of a species no longer observed in the area) and whose expected value (E) is
specific to a certain intervention (T) (here a public program). Thus, the impact (I) of the pro-
gram for a population can be expressed as:

I = E (O | T=1) - E (O | T=0) (1)

Where I = impact; E = expected value; T = treatment; O = outcome

3) Evidence of a mechanism for a phenomenon. This is produced when there is evidence that the
entities or the activities that make up a mechanism, and the organization of these entities and
activities by which they produced the phenomenon, are known (e.g. the bio-chemical reac-
tions needed for an increase in fertilizer [C = cause] to increase crop yield [O] in a controlled
environment).
This type of evidence may confirm a relationship of cause and effect, all other things being
equal. It provides information on the causal pathway to intervene upon for the goals of a pub-
lic program to be achieved. However, in real life conditions, evaluators are always confronted
with complex causal structures in which various mechanisms interfere. In that respect, follow-
ing Cartwright (2011), a probabilistic theory of causality can be adopted.

For each effect-type at a time t, Ot, and for each time t’ before t, there is a set of factors {C1t’,…,Cnt’}
– the causes at t’ of O at t – whose values in combination fix the objective chance at t’ that O takes
value o for any o in its allowed range. A causal structure, CSt’(Ot), for Ot is such a set along with the
related objective chances for all values of Ot for all combinations of allowed values, Ljt’, of the causes
in the set: Prob (Ot = o/Ljt’). For simplicity I will usually supress time and other indices and also
restrict attention to two valued variables. So a causal structure looks like this:

CSt’(Ot) = <{C1t’,…,Cnt’}, {Prob(Ot/L1t’),…,Prob(Ot/Lmt’) }> ‘(2)’ (Cartwright, 2011: 16)

In practice, full knowledge of the causal structure involved in a public program is generally
unreachable. It is therefore useful to develop hypotheses on the mechanisms that will play an

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

Berriet-Solliec et al.: Goals of evaluation and types of evidence 199

important role, in order to design a program and have an effect on ‘manipulable’ factors
(Shadish et al., 2002) or to analyse whether an intervention is a contributory cause to a change
(Mayne, 2012). Here, evaluation usually involves the production of both evidence of mecha-
nism and evidence of difference-making, a combination which provides information about
causal pathways. In certain cases, however, an evaluation is based exclusively on evidence of
difference-making and therefore says little or nothing about underlying causality if the causal
structure is complex.

Observation may, indeed, be focused on one of two elements:

•• the production of the expected mechanism, by observing changes which occur at each
stage (e.g. whether a financial incentive has led to a shift in practices which has in turn led
to the use of a fertilizer that has an impact on crops). In this case, evidence of mechanism
will be combined with evidence of difference-making to help clarify causal relationships;
•• measurement only of the produced effects (e.g. does income support increased produc-
tion levels?), without hypothesizing about the causal chain involved (purchasing of
consulting services, purchasing of inputs, reduction of risk aversion, etc.). Here, evi-
dence of difference making provides little information on the causal relationships which
need to be studied in order to judge how generic are the results obtained.

Disentangling various types of evidence highlights the ambiguous relationship between evi-
dence of difference making and causality: in certain cases, these types of evidence reveal noth-
ing about causal pathways. This remains true even when such evidence is produced using
methods (such as randomized controlled trials) which may confer a high level of proof. Types
of evidence and level of evidence are two independent dimensions of the quality of evidence.

Levels of evidence
The assessment of levels of empirical evidence is usually considered a major issue. Whatever
the type of evidence, not all findings have the same probative force: they cannot be ranked at
the same ‘level of evidence’. In the field of agriculture, for example, levels of evidence of
effectiveness are often classified in the following order, from the lowest to the highest quality,
according to the methodology of data collection:

1. the opinions of respected authorities;

2. evidence obtained by single case observations;
3. evidence obtained from historical or geographical comparisons;
4. evidence obtained from cohort studies or controlled case studies;
5. evidence obtained through randomized controlled trials (RCT).

But there is not ‘one’ methodology (as is commonly argued for RCTs) that could be consid-
ered as the gold standard for all situations. Other types of ranking are possible. For instance,
if a research aims at understanding a mechanism (e.g. the reasons why children/parents will
accept a treatment, depending on individual behaviours), then in depth qualitative studies
including single case observations provide a higher level of evidence than results of cohort
studies based on probabilistic models (Petticrew and Roberts, 2003).

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

200 Evaluation 20(2)

In addition the apparent simplicity of the former classification should not conceal that the
assessment of the quality of evidence produced at each level could be based on different crite-
ria (e.g. study design, quality of study conduct, consistency of results) (Liberati et al., 2001).
It should not conceal either the numerous questions that arise when several types of evidence
are involved and need to be combined and/or are in competition (Laurent and Trouvé, 2011).
In other words, the criteria for assessing the level of evidence must be chosen according to the
objectives of this assessment.
Invoking the argument that there is no universal rule by which to rank the level of evidence,
some authors reject this very principle and argue in favour of a symmetry of knowledge, put-
ting on the same level opinions from various stakeholders, traditional knowledge gained from
experience, empirical evidence resulting from systematic investigation, etc. Such a renuncia-
tion may generate significant adverse effects when it comes to action. In a large number of real
evaluation settings, stakeholders want information that is as robust as possible to help them
comply with their objectives. This is the case in many areas of public intervention such as
agriculture, which involve both private and public organizations and which involve actors
who consider that it makes sense to look for the best possible level of evidence to inform their
decision (Labarthe and Laurent, 2013).
Therefore, both empirical observations and progress in the theory of evidence invite the
abandonment of two equally unproductive claims: those pretending that there is a unique
methodology for ranking the level of evidence; and those rejecting the very principle of assess-
ing the probative force of evidence. Instead, they emphasize the need to define clear principles
that will enable various stakeholders to assess the level of available evidence, utilizing the
criteria that are relevant for their particular objectives.

The case of agriculture

To illustrate the links between goals of evaluation and types of evidence, we employ examples
from agriculture-related policies because in agriculture, like in medicine, evaluating what
works and what does not has for long been a source of enquiry, observational tools and analy-
sis. Basing a decision on erroneous conclusions in agriculture or medicine can have serious,
irreversible and immediately visible consequences (e.g. a person’s death, ruined crops and
famine, death of a herd). These decisions are taken at different levels (e.g. farm, sector, national
policies) and generate a wide variety of evaluation configurations for the corresponding types
of programs. They concern a wide range of interventional situations, from the ‘simplest’ of
problems (e.g. measuring the impact of a single extra input on yield) to the most complex (e.g.
measuring the multivariate impact of agri-environmental policies on communities and ecosys-
tems, discussed in the Millenium Ecosystem Assessment [Carpenter et al., 2006; MacNeely
et al., 2005]).
In agriculture, like in medicine, the context of the decision is very heterogeneous. The phe-
nomena under evaluation involve factors that are physical (e.g. effects of climate, soil differ-
entiations), biological (e.g. genetic variability generating differences in yields, resistance to
diseases) or of the realm of the social sciences (e.g. economic policy, farm advisory services).
Similarly, a single subject of study can be approached using a wide range of complementary
or competing interventions based on different theories of action. Two examples are farm advi-
sory services (Davis, 2008) and agro-environmental measures (Kleinj and Sutherland, 2003).
Given the eclectic range of subjects and interventions possible, there is a strong incentive to

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

Berriet-Solliec et al.: Goals of evaluation and types of evidence 201

find common analytical frameworks through which to assess the relevance of alternative eval-
uation methods.
This tradition thrives all the more because intervention in agriculture (e.g. financial public
support, regulatory measures, technical support) is subject to decisions taken jointly at the
international level, whether it involves policy frameworks (e.g. the Common Agricultural
Policy), health and environmental standards or economic support for production and advisory
services. In addition, over the last two decades, evaluation is no longer confined to the assess-
ment of the productive performance of farm activity. New stakeholders have joined the discus-
sion with concerns related to the environmental performances of agriculture and to its
contributions to rural development and social cohesion.
In the case of farm advisory services, for example, a global forum has been created (the
Global Forum for Rural Advisory Services, or G-FRAS) to facilitate collective discussion,
working groups, reports and evaluation initiatives. In Europe, the European Commission has
commissioned an evaluation of the implementation of advisory services in different member
countries. These initiatives highlight sensitive issues about the use of evidence according to
the goal of the evaluation: i) when measuring the effects of alternative advisory interventions
(e.g. debates about the probative force of alternative methods for impact assessment); ii) when
assessing the robustness of the causal scheme of these interventions (e.g. does the idea of
knowledge diffusion, upon which many of these interventions are based, hold up in the field?);
and iii) even when promoting learning through evaluation.
Ideally, an evaluation procedure should be aimed at producing results based on evidence of
the best possible quality. However, such a view remains highly theoretical and blind spots
subsist in how such evidence is actually produced. As demonstrated below in three kinds of
ex-post evaluation, the adequacy of a type of evidence varies depending on the goal of the
evaluation.

To measure: Type of evidence for evaluations designed to

measure the effects of a public intervention
In this section we consider evaluations aimed at making an impact assessment, to provide
empirical evidence of the difference(s) made by a public program, in order to measure, as far
as possible, the actual impact of this program – and only that. This impact is defined as the
difference between:

•• the actual situation with the program; and

•• the situation that would have occurred without it.

In other words, the evaluation process does not examine in detail the mechanisms by which
an action is effective; public programs mobilize a large number of factors and it is often
impossible to observe every form of interaction between them. In most cases, evidence of
effectiveness is sought in order to prove that the program made a difference, not to describe
the mechanisms that made the measure effective, nor to control whether the effects confirm an
underlying theory of action. Therefore, the evaluator does not open the ‘black box’ of the
evaluated program. For instance, evidence that an agri-environmental scheme has been effec-
tive in maintaining biodiversity can be sought, without analysing the specific ecological, eco-
nomic and social mechanisms that contributed to that outcome.

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

202 Evaluation 20(2)

Implementation: Measuring effectiveness independently of insight into mechanisms

Producing evidence of the effectiveness of a public program requires the identification of an
‘all else being equal’ relationship between two variables: a proxy of the treatment or public
program T under evaluation, and a proxy of the desired outcomes (O) of that program T, on a
population ϕ. Here, the main objective is to measure the difference between an observable
situation (the level of O, for the population that benefits from the program, T = 1) and a coun-
terfactual unobservable one (the level of variable O which would still occur in this same popu-
lation without the treatment, T = 0). In practice, this is done by comparing, through proxies,
the levels O in a population which received the treatment and in a control population which
did not.

I = E (O | T = 1) - E (O | T = 0) (1)

In other words, ideally for impact assessment:

the population ϕ divides into two groups that are identical with respect to all other features casually
relevant to the targeted outcomes, O, except for the policy treatment T, and its downstream
consequences. (Cartwright, 2011: 18)

The main pitfall in this situation is a selection bias where differences exist between the
‘treated’ group and the control group (stemming from observable or unobservable factors)
which could explain variations in levels of O independently of the effects of the program T.
In light of this, evidence-based decision studies in the medical field rank the methods
used in terms of their ability to reduce this bias: the smaller the bias, the higher the level of
evidence. Traditionally, randomized controlled trials (RCT) are viewed as the ‘gold stand-
ard’ for measuring the outcomes of a specific program. Selection bias is eliminated by ran-
domly distributing individuals in the treated group and the control group. For this reason,
new, experimental evaluation methods (Duflo and Krémer, 2005) are emerging in various
sectors (e.g. justice, education, the social sciences as well as the environment and agricul-
ture). However, while such methods are widespread in health-related fields, they are less
used for other public programs, where the randomization of beneficiaries of a public pro-
gram can pose technical and ethical problems. In cases where an RCT cannot be undertaken,
‘quasi-experimental’ methods such as matching or double differencing are considered the
most reliable alternatives (Bro et al., 2004). Matching involves pairing individuals who
benefited from the program with individuals who did not and comparing the levels of indi-
cator variables. The goal is to pair individuals based on their most significant similarity,
particularly in terms of how likely they are to benefit from the program. The double differ-
ence method is a combination of a comparison before and after the implementation of a
public program and a comparison with and without the program. Differences in O are meas-
ured with proxy variables in both the beneficiary group and the control group. Nevertheless,
both matching and double differencing have limitations. Matching makes it possible to pair
individuals using only observable variables, with the risk that unobservable ones (skills,
attitude, social capital) induce a selection bias. Double differencing relies on the hypothesis
that such variables have a constant effect over time.
Such methods have already been used to evaluate farm advisory service policies (Davis
et al., 2012; Godtland et al., 2004; Van den Berg and Jiggins, 2007). But to ensure the

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

Berriet-Solliec et al.: Goals of evaluation and types of evidence 203

empirical reliability of this kind of work, methodological precautions must be taken which
may limit the scope of findings. Below are four examples related to farm advisory service
programs:

1) The first problem bears on the requirement for a random distribution of farmers who
benefited from these advisory services programs and those who did not (in the case
of RCTs). Aside from the ethical issues raised, this requirement is also contrary to
the diagrams of causality of certain programs, such as participative and bottom-up
interventions (e.g. farmer field schools): the effectiveness of such programs theo-
retically depends on the self-motivated participation of farmers in a collective
project.
2) The second problem bears on an essential hypothesis of the methodologies of impact
evaluation based on RCTs or semi-experimental evaluation: beneficiaries must not
be influenced by the fact that non-beneficiaries do not benefit from the program,
and vice versa (Stable Unit Treatment Value Assumption – SUTVA). This hypoth-
esis may also be contrary to the diagrams of causality underlying certain advisory
service programs, particularly those built on so-called diffusionist models (e.g. the
World Bank’s Train & Visit program): in theory, their effectiveness resides in the
fact that farmers who directly receive advice will share acquired knowledge with
those who have not.
3) The third problem is the choice of indicators. Evaluating the impact of farm advisory
services supposes the ability to identify a proxy of the expected results. At which
level should this result be selected (Van den Berg and Jiggins, 2007)? The level of
farm performance (yield, income, etc.); the level of the adoption of innovations; or
the level most directly affected by farm advisory services: farmers’ knowledge and
skills? The question then becomes how to express this knowledge and these skills in
quantitative variables. In that respect, Godtland et al. (2004) have stressed the diffi-
culties and limitations of their attempt to express farmers’ knowledge through knowl-
edge tests. Likewise, the effects of this proxy will have to be observable over
relatively short durations (due to costs, RCTs are often used in one- to two-year
population tests). However, in the case of farm advisory services, one can wonder
whether this short-term measure makes any sense due to certain mid- or long-term
dimensions of learning processes.
4) The last aspect is related to the distributive effects of the evaluated policy. In most
impact studies, the effect is calculated by looking at the difference between the average
obtained by the group of individuals benefiting from the measure in a sample and that
of the individuals who do not benefit. However, an average improvement for the target
population can hide great inequalities or even aggravate these inequalities. Abadie
et al. (2002) have shown for instance that a training program for poor populations
could result in an increase in the average income of the target populations, but have no
effect on the poorest fraction of this population.

This example of the evaluation of farm advisory services shows that the measurement of
the impact of public programs is only rigorous if the methods used are consistent with specific
hypotheses associated with the method of data collection (e.g. randomization, a lack of diffu-
sion-related effects).

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

204 Evaluation 20(2)

The evidence issue when assessing effectiveness

It is only possible to fully understanding the significance and limitations of these approaches
if we accept that they are designed to obtain the highest possible level of evidence of differ-
ence-making (effectiveness or harmlessness) – and only this.

1) Obtaining high-level evidence of difference-making may seem simple or even simplis-

tic. It is in fact quite challenging and involves costly practices which pose significant
methodological and ethical problems. Nevertheless, it is the only way to obtain rigor-
ous evidence of the actual impact of a public program.
2) The significance of evidence of difference-making should not be overestimated; it does
not indicate which mechanisms rendered the program action effective; often, several
competing explanations emerge concerning the effectiveness of a program.
3) Such results are therefore of limited interest when deciding to extend a public program
to other contexts or periods. This should be done using methods that provide the most
reliable hypotheses possible regarding the mechanisms that make action effective.

In other words, the experimental settings of the production of evidence of effectiveness are
such that they cause many problems of generalization and external validity. This knowledge is
only valid for a specific population ϕ in a particular environment characterized by a specific
causal structure CSt’(Ot). And it can only be extended to populations θ that share the same
causal structure CSt’(Ot). Some authors propose to solve this ‘environmental dependence’
issue by replicating measures of effectiveness (with an RCT) in various contexts, but ‘worry
that there is little incentive in the system to carry out replication studies (because journals may
not be as willing to publish the fifth experiment on a given topic as the first one), and funding
agencies may not be willing to fund them either’ (Banerjee and Duflo, 2009: 161). But the
problem is not a financial one. In any case, replication alone cannot be a solution; a theory
about causal structures is necessary to identify the scale and boundaries of different θ popula-
tions that may share the same causal structure. It is necessary to rely on theories to identify
mechanisms that characterize the causal structure of the target populations of the policies.

To understand: Type of evidence for evaluations aimed at

identifying and analyzing the mechanisms by which a program can
produce expected results or adverse effects
Many authors have pointed out the importance of basing an evaluation on a theory, on a pre-
cise understanding of the mechanisms operating in the programs being studied (Chen, 1990;
Chen and Rossi, 1983; Jordan et al., 2008; Pawson, 2002; Pawson and Tilley, 1997; Shadish
et al., 1991). Their initial acknowledgement underlines the limitations of the two other types
of evaluation described in this article. They insist on the fact that these evaluations, which rely
on evidence of effectiveness or on gathering opinions, cannot reveal in a reliable way the
causal structure – CSt’(Ot) – that explains why a program works or not in a given context, and
why it may have different impacts on the various elements of the target population.
Such evaluations focus on understanding (i) the object which is evaluated (ii) the mecha-
nisms of action to be ‘revealed’ through the analysis and (iii) the context in which the program
is implemented. By analysing, in different contexts, the way in which the impacts are

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

Berriet-Solliec et al.: Goals of evaluation and types of evidence 205

produced, regularities or recurring facts are identified so as to determine the various causes
{C1t’,…,Cnt’} and the set of causal relations {Prob(Ot/L1t’),…,Prob(Ot/Lmt’)} by which the
implementation of a program has expected or unexpected effects. These effects can directly
relate to the goal of the program or to its broader context. The evaluation will thus depend on
the nature of the problem in question: at stake are the specificities of this problem in a particu-
lar context and the assessment of the degree of generality of the proposed solutions for further
action.
In certain cases, to improve the quality of the measurement of impacts, the evaluation is
constructed using a preliminary analysis of the theory underlying the program (program the-
ory). A first step is understanding (before the measurement) the causal mechanisms that guided
the design of the program. The role of the evaluator consists, more precisely, in putting forth
hypotheses on the main features of the causal structure linking a program and its potential
subsequent effects. The aim is to build a diagram that traces these patterns of causality and
constitutes the theory of the program and is a simplified representation of the comprehensive
causal structure. When it is established, such a diagram becomes a reference framework and
the basis of the evaluation approach for the evaluator, who then proposes indicators that will
be useful for measuring impacts.
The analysis of the causal structure of the program allows a better understanding of the
distributive effects of a program within the target population and across populations. However,
the diagram that is built is only a simplified representation of the proposed causal structure.
Therefore some of the ways in which evidence on mechanisms is used in the evaluation pro-
cess raises questions, as is illustrated by the following example.

Example: Environmental evaluation coupling economic models and sustainability

indicators
Many public programs aim at encouraging farmers to adopt practices that guarantee better
environmental performances (e.g. biodiversity conservation, water quality, etc.). This is done
by delivering specific financial support or by making changes in farm practices a prerequisite
to receiving existing forms of aid (e.g. agri-environmental schemes, cross-compliance for the
Common Agricultural Policy in the European Union).
The procedures used to evaluate the environmental impact of these programs almost
never rely on the production of evidence of effectiveness, as seen in Kleinj and Sutherland’s
review on biodiversity (Kleinj and Sutherland, 2003). Measuring the effectiveness of a pro-
gram for biodiversity conservation would indeed require collecting ecological data accord-
ing to a specific and elaborate methodological framework (with the possibility of building
counterfactuals so as to measure impacts specifically linked to the program). Such methodo-
logical frameworks are costly and often regarded as inaccessible. For this reason, many
evaluations rely on the diagram of causality at the origin of the public program (an eco-
nomic incentive A, must cause a change of agricultural practice B, which has an ecological
impact C), and make the assumption that if the means were in fact implemented, then the
program was effective. Evaluations then focus on measuring B, i.e. the number of farmers
who have actually changed their practices. Such approaches have been referred to as the
measurement of ‘policy performances’ (Primdahl et al., 2003). In certain cases this informa-
tion is considered sufficient to draw conclusions on the environmental impact of the pro-
gram. In other cases, the ‘black box’ of these changes is opened and additional data are

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

206 Evaluation 20(2)

collected (about crop rotation, plant pest management, etc.). They are linked to agri-ecolog-
ical indicators to calculate the potential risks and effects of these changes (e.g. the use of
less chemical inputs is associated with a positive impact on biodiversity) (Mitchell et al.,
1995; Van de Werf and Petit, 2002).
However, it is impossible to identify and take into account the many existing mechanisms
that interact in various contexts. Thus, the causal diagram that underlies these actions is only
an approximation of a comprehensive causal structure that ideally could allow their effect to
be fully predicted. The research which examine these types of methods all point out that these
measures identify ‘potential effects’ but fail to measure actual impacts. Nevertheless, these
qualifications are often absent in the executive summaries of reports that present evaluation
results. Variations in the value of an indicator can thus be presented as evidence of an improve-
ment of environmental performances. This is not only improper from a formal point of view;
the few experimental tests carried out on this issue also disprove that it is an acceptable esti-
mate. For instance Kleinj and Sutherland (2003) and Kleinj et al. (2006) show that certain
measures which were successful in terms of ‘policy performance’ did not have the expected
environmental impact.
Such doubts about the effectiveness of certain agri-environmental schemes can be linked to
the weakness of the theoretical models upon which they are based, as well as to a lack of
empirical data with which to identify what works and what does not (MacNeely et al., 2005).
The work done on the eco-millennium assessment demonstrated the importance of these
knowledge gaps (Carpenter et al., 2006). This concerns both evidence of difference-making
and evidence of mechanism.

The evidence issue when analysing mechanisms

Recourse to evidence of mechanism in evaluation procedures thus takes two principal forms:
to produce such evidence to reveal in detail the mechanisms behind the phenomena observed;
and to analyse the way in which it interacts in the theory of the program, in order to structure
the evaluation.

1) Identifying the mechanisms by which the actions were effective (or not) is essential to
producing generic knowledge that can be used to develop new programs (e.g. a causal
relation which can be exploited in various contexts). It can also help assess the generic
nature of the knowledge used in the program (e.g. to what extent the causal structure of
two different populations can be considered similar) and to raise new issues for the
evaluators and stakeholders involved in the evaluation.
2) In certain situations, it makes sense to rank results based on the opinions of respected
authorities, single case studies, observations on wider samples of situations, etc. in
order to assess the robustness of available evidence. However, the use of theoretical
models to infer the effective impact of a program, as sophisticated as they may be, is
often limited. The causality diagrams formalized in these theoretical models are only
ever partial representations of complex causal structures. Their predictive capacities
vary according to the object under evaluation and the context; therefore one cannot
replace the observation of the real effects (and the production of evidence of effective-
ness) with that of expected effects (estimated using an analysis of the means imple-
mented in the program).

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

Berriet-Solliec et al.: Goals of evaluation and types of evidence 207

3) As mentioned before, under certain conditions, evidence of mechanism can be com-

bined with evidence of difference making in order to highlight a causal pathway. It
would be misleading to associate causality with only one type of evidence.

To learn: Evidence for evaluations primarily designed as a

collective learning process
As discussed briefly in section 1, extensive literature based on different theoretical points of
view has shown the importance of involving stakeholders in the evaluation, in order to improve
the theories guiding the evaluator’s work, improve the quality of the evaluation and enable a
more sensible use of results by different stakeholders (Cousin and Whitmore, 1998; Fetterman
and Wandersman, 2005; Mertens, 1999). Ultimately, these approaches do not call into ques-
tion the need to review types of evidence (mechanisms, effectiveness, adverse effects) even if
they do give rise to new debates, notably concerning the questions for which evidence should
be produced and the ways in which data must be collected and interpreted.
However, there are also certain evaluation procedures primarily aimed at promoting con-
sensus through close collaboration between different stakeholders, right from the evaluation
design stage, in order to build awareness and encourage new practices, the latter taking prec-
edence over the measurement of a program’s outcomes.
Evaluation methods that highlight the educational dimension of evaluation procedures are
one such example. These methods ‘bring to the table’ all stakeholders who have a vested inter-
est in the improvement of the program under evaluation. The person in charge of the evalua-
tion begins by drafting as accurately as possible a sociogram of the network of stakeholders
which includes information about the nature and intensity of the ties between these actors. The
evaluator uses this representation of actor networks to conduct in-depth interviews with stake-
holders to gather each person’s point of view and suggest ways of improving the program. At
each stage of the evaluation, partial conclusions are discussed and analysed in working groups.
In certain cases – service-related programs, for example – evaluators constitute a representa-
tive sample of service users. It is the users themselves who then assess the value of the pro-
gram (after a specific training session).
Here, the heart of the evaluation method is the contributions of program stakeholders to a
social construction of representations of an observed reality. While the process may simply
mobilize opinions, it also calls upon scientific knowledge (in the field of natural sciences,
primarily), often through the tools proposed by researchers (e.g. simulation models). The reli-
ability of evidence used for collective learning is not frequently addressed although it some-
times generates debates (e.g. Van der Sluijs et al., 2008).
In this type of evaluation procedure the evaluator’s role is to organize debates, ultimately
to obtain the most consensual possible results which can then be used by the largest number
of people. These approaches have become highly popular in recent years and take on different
forms. They are used for various issues involving collective action (e.g. water management,
land-use planning) and rely on different methods to promote interaction among actors during
the evaluation phase (e.g. role playing, agent-based simulation, etc.).

Implementation: A Soft System Methodology (SSM) example

An emblematic example of this type of method can be found in evaluations of public programs
offering farm advisory services. A notable example is the relatively widespread use of Soft
Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014
208 Evaluation 20(2)

System Methodology (SSM) to design and evaluate technical advisory programs (Rochs and
Navarro, 2008). SSM is designed to help a ‘human activity system’ (HAS) make the most
effective decisions in uncertain and complex contexts (Checkland, 1981) where learning is the
priority. Checkland and Scholes (1990) point out that SSM as a model is not intended to estab-
lish versions of reality. Instead, it aims to facilitate debate so that collective decisions and
action can be taken in problem situations. The seven stages of SSM are (Checkland, 1981):

i. inquiring into the situation (identifying the problem using different communication
techniques: brainstorming, interviews, participant observation, focus groups, etc.);
ii. describing the situation (describing the context using a wide variety of sources);
iii. defining HAS (identifying program stakeholders, and interviewing them on the trans-
formations they are expecting);
iv. building conceptual models of the HAS (representing the relationships between stake-
holders in the program being designed or evaluated);
v. comparing the conceptual models with the real world (preparation of a presentation of
the model for a debate with stakeholders);
vi. defining desirable and feasible changes;
vii. implementation (Rochs and Navarro, 2008).

Corroboration with facts and producing the best possible evidence do not appear to be at
the heart of this conception/evaluation approach, which instead aims at promoting and struc-
turing debate between program stakeholders to arrive at a consensual solution. In practice,
however, significant problems arise (Salner, 2000). In workshops, for example, evidence is
provided by different stakeholders verbally, and must be verified. Salner (2000) likens this
method to journalism, in that it involves the verification of the opinions of different stakehold-
ers so that ‘analysis makes it possible to mount an argument for change which was not simply
an intuitive reaction to a conversation held; it was an argument which could be explicitly
retraced at any time with links to supporting evidence’ (Checkland and Scholes, 1990: 198–9).
Verification is thought to be guaranteed by the open, public and collective nature of the debate.
Comparison with ‘fact checking’ in journalism, however, only holds true if the evidence pre-
sented is evidence of presence describing facts known through stakeholder practices. Instead,
arguments often go deeper and target the expected or measured impact of programs and even
the causality diagram upon which they are based. These evaluation methods rely not only on
evidence of presence but also on evidence of effectiveness and mechanisms but do not formal-
ize this integration. This lack of formalization manifests itself on two levels: (i) in the use of
scientific knowledge to formulate hypotheses on the modalities of how public programs func-
tion, (ii) in the verification of the level of evidence obtained.
Ultimately, these formalization tasks are implicitly transferred to workshop leaders (often
researchers). This situation poses a number of problems as it is assumed that these leaders
have extensive skills and means at their disposal (to produce state-of-the-art reports of avail-
able scientific literature, statistical analyses and various types of verifications). For this rea-
son, several authors have pointed out that SSM can be exploited to reinforce a balance of
power given the asymmetries of information between stakeholders:

the kind of open, participative debate that is essential for the success of the soft system approach, and
is the only justification for the result obtained, is impossible to obtain in problem situations where

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

Berriet-Solliec et al.: Goals of evaluation and types of evidence 209

there is a fundamental conflict between interest groups that have access to unequal power resources.
Soft system thinking either has to walk away from these problem situations, or it has to fly in the face
of its own philosophical principles and acquiesce in proposed changes emerging from limited debates
characterized by distorted communication. (Jackson, 1991: 198)

The evidence issue in collective learning

These approaches raise several questions where the issue of evidence is concerned.

•• The issue of level of evidence is often neglected and seen as secondary to collective
learning objectives. All contributions are accepted equally and the reliability of evi-
dence is not subject to systematic testing procedures.
•• Very quickly, evidence presented by participants with different interests can be in com-
petition and arbitration is often based on non-transparent criteria.
•• Without a systematic, clear verification procedure for evidence, learning may focus
more on the ability to reach consensual positions than on the ability to use the best tools
for achieving a given objective and on evaluating outcomes in a rigorous manner.

Conclusion
This article is not intended as a standard-setting tool. Our goal is to contribute to building a
theory of evidence for evaluation that allows different stakeholders to better judge the quality
of evidence they seek depending on their project.
We have illustrated that while evaluation may have very different objectives (e.g. under-
standing the mechanisms of public programs, measuring their specific impacts, or supporting
collective learning to favour the emergence of an agreement between stakeholders in the pro-
grams), each objective leads to a different examination of the question of types of evidence,
i.e. what is object of evidence (presence, making a difference, mechanism). This concern must
be clearly distinguished from the study of levels of evidence, which deals with data collection
and interpretation (e.g. single case observations, difference methods, RCTs); of each of these
methods can be used for producing each type of evidence.
With this in mind, the issue of RCTs must be re-examined, along with the types of evidence
for which these methods are used. Experimental economics can be used as a tool to test some
hypotheses on mechanisms rather than only be used to assess the impact of a policy in a given
environment. Nevertheless, whether RCTs are a relevant tool in that respect is a matter of
ongoing discussion both in medical sciences and in economics (Deaton, 2009). A key question
in this debate is the importance of heterogeneity and distributive effects across populations,
which are not acknowledged by RCTs, but which can be essential for formulating theories in
various scientific areas (economics, management science, but also bio-medical sciences and
ecology among others).
For each situation, the quality of evidence can be assessed according to three dimensions.
Ideally, as mentioned above, one would like to base a decision on evidence that is both
socially relevant (addresses phenomena considered by each stakeholder to be important), of
a high level (with probative force) and which corresponds to the adequate type for the goals
of the evaluation. This ideal is usually inaccessible, for various reasons including cost, meth-
odological constraints, and the need to select precise objectives from a large number of pos-
sible points of view.

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

210 Evaluation 20(2)

Evaluators are permanently confronted with trade-offs. The three examples above show
that a better understanding of quality of evidence can help better assess the limits inherent in
the conclusions of every evaluation depending on the quality of evidence on which they are
based. In the real world, every evaluation process has its own limits and can only produce reli-
able results for a particular field of interest. Choices should thus be made that will involve
institutional issues and possible conflicts of interest. As is the case with any policy instrument,
the final decision depends on a multiplicity of factors which cannot be reduced to evidence
issues alone. However, a clear specification of the limits of validity of findings is a prerequi-
site to avoid misinterpretations. A better shared knowledge of the type and the level of evi-
dence that is used to evaluate the result of interventions can help clarify for various stakeholders
what is at stake in making alternative choices.

Acknowledgements
The authors would like to thank the anonymous referees and editors who provided useful and inspiring
comments on an earlier version of this article.

Funding
This research was conducted in an interdisciplinary research program funded by the French National
Agency for Research (program EBP-Biosoc/ADD). It is based on the combination of former research
experience on evaluation theories (M. Berriet-Solliec), on international debates on evaluation of farm-
advisory services (P. Labarthe) and on quality of evidence (C. Laurent).

References
Abadie A, Angrist J and Imbens G (2002) Instrumental variables estimates of the effect of subsidized
training on the quantiles of trainee earnings. Econometrica 70: 91–117.
Adams WM, Avelling R, Brockington D, Dickson B, Elliot J, Hutton J et al. (2004) Biodiversity con-
servation and the eradication of poverty. Science 306: 1147–9.
Banerjee AV and Duflo E (2009) The experimental approach to development economics. The Annual
Review of Economics 1: 151–78.
Bro E, Mayot P, Corda E and Reitz F (2004) Impact of habitat management on grey partridge popula-
tions: assessing wildlife cover using a multisite BACI experiment. Journal of Applied Ecology 41:
846–57.
Carpenter S, DeFries R, Dietz T, Mooney H, Polasky S, Reids W and Scholes R (2006) Millenium
Ecosystem assessment: research needs. Science 314: 257–8.
Cartwright N (2011) Evidence, external validity and explanatory relevance. In: Morgan GJ (ed.),
Philosophy of Science Matters: The Philosophy of Peter Achinstein. New York: Oxford University
Press, 15–28.
Cartwright N and Hardie J (2012) Evidence-Based Policy: A Practical Guide to Doing It Better. Oxford:
Oxford University Press.
Checkland PB (1981) System Thinking, System Practice. New York: John Wiley.
Checkland PB and Scholes J (1990) Soft Systems Methodology in Action. Chichester: John Wiley &
Sons.
Chen HT (1990) Theory-Driven Evaluation. Newbury, CA: SAGE.
Chen HT and Rossi PH (1983) Evaluating with sense. The theory-driven approach. Evaluating Review
7(3): 283–302.
Cousins JB and Whitmore E (1998) Understanding and participatory evaluation. New Directions for
Evaluation 80: 69–80.

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

Berriet-Solliec et al.: Goals of evaluation and types of evidence 211

Davis KE (2008) Extension in Sub-Saharan Africa: overview and assessment of past and current mod-
els, and future prospects. Journal of International Agricultural and Extension Education 15(3):
15–28.
Davis KE, Nkonya E, Kato E, Mekonnen DA, Odendo M, Miiro R and Nkuba J (2012) Impact of farmer
field schools on agricultural productivity and poverty in East Africa. World Development 40(2):
402–13.
Deaton AS (2009) Randomization in the tropics and the search for the elusive keys to economic devel-
opment. National Bureau of Economic Search Working Paper: 14690. Cambridge, MA.
Donaldson SI (2007) Program Theory-driven Evaluation Science: Strategies and Applications. New
York: Routledge.
Donaldson SI (2008) In search of the blueprint for evidence-based global society. In: Donaldson SI,
Christie CA and Mark HH (2008) What Counts as Credible Evidence in Evaluation and Evidence-
based Practice? Thousand Oaks, CA: SAGE, 2–18.
Donaldson SI, Christie CA and Mark HH (2008) What Counts as Credible Evidence in Evaluation and
Evidence-based Practice? Thousand Oaks, CA: SAGE.
Duflo E and Krémer M (2005) Use of randomization in the evaluation of development effectiveness. In:
Pitman G, Feinstein O and Ingram G (eds), Evaluating Development Effectiveness. New Brunswick,
NJ: Transaction Publishers, 205–32.
Fetterman DM (1996) Empowerment evaluation. An introduction to theory and practice. In: Fetterman,
Kaftarian and Wandersman (Eds), Empowerment evaluation: Knowledge and Tools for Self-
Assessment & Accountability. Thousand Oaks, CA: SAGE, 3–46.
Fetterman DM and Wandersman A (2005) Empowerment Evaluation. Principles and Practice. New
York: The Guilford Press.
Fitzpatrick JL, Sanders JR and Worthen BR (2011) Program Evaluation: Alternative Approaches and
Practical Guidelines. Upper Saddle River, NJ: Pearson Education.
Godtland EM, Sadoulet E, de Janvry A, Murgai R and Ortiz O (2004) The impact of farmer field
schools on knowledge and productivity: a study of potato farmers in the Peruvian Andes. Economic
Development and Cultural Change 53(1): 63–92.
Hansen HF and Rieper O (2009) The evidence movement: the development and consequences of meth-
odologies in review practices. Evaluation 15: 141–63.
Illari PM (2011) Mechanistic evidence: disambiguating the Russo-Williamson thesis. International
Studies in the Philosophy of Science 25(2): 139–57.
Jackson M (1991) Systems Methodology for the Management Sciences. New York and London: Plenum
Press.
Jordan GB, Hage J and Mote J (2008) A theories-based systemic framework for evaluating diverse
portfolios of scientific work, part 1: micro and meso indicators. New Directions for Evaluation
118: 7–24.
Kleinj D and Sutherland W (2003) How effective are European agri-environment schemes in conserv-
ing and promoting biodiversity? Journal of Applied Ecology 40: 947–69.
Kleinj D, Baquero RA, Clough Y, Díaz M, Esteban J, Fernández F et al. (2006) Mixed biodiversity
benefits of agri-environment schemes in five European countries. EcolLett 9(3): 243–54.
Labarthe P and Laurent C (2013) Privatization of agricultural extension services in the EU: towards a
lack of adequate knowledge for small-scale farms? Food Policy 38: 240–52.
Laurent C and Trouvé A (2011) Competition of evidences and the emergence of the ‘evidence-based’
or « evidence-aware » policies in agriculture. 122nd EAAE Seminar ‘evidence-based agricultural
and rural policy making: methodological and empirical challenges of policy evaluation’. Ancona,
Italy, 17–18 February 2011.
Laurent C, Berriet-Solliec M, Kirsch M, Perraud D, Tinel B, Trouvé A et al. (2009) Pourquoi s’intéresser
à la notion d’Evidence-basedpolicy? Revue Tiers-monde 200: 853–73.

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

212 Evaluation 20(2)

Liberati A, Buzzetti R, Grilli R, Magrini N and Monozzi S (2001) Evidence-based case review. Which
guidelines can we trust? Assessing strength of evidence behind recommendations for clinical prac-
tice. Western Journal of Medicine 174: 262–5.
Lipsey MW (2007) Method choice for government evaluation: the beam in our own eye. In: Julnes G
and Rog DJ (eds), Informing Federal Policies on Evaluation Methodology: Building the Evidence
Base for Method Choice in Government Sponsored Evaluation. New Directions for Evaluation,
vol. 113. San Francisco, CA: Jossey-Bass, 113–15.
MacNeely JA, Faith DP, Albers HJ et al. (2005) Biodiversity. In: Chopra K, Leemans R, Kumar P and
Simons H (eds), Ecosystems and Human Well-Being: Volume 3. Policy Responses. Washington,
DC: Island Press, 119–72.
Mayne J (2012) Contribution analysis: coming of age? Evaluation 18: 270–80.
Mertens D (1999) Inclusive evaluation: implications of transformative theory of evaluation. American
Journal of Evaluation 20(1): 1–14.
Mitchell G, May A and McDonald A (1995) PICABUE: a methodological framework for the develop-
ment of indicators of sustainable development. International Journal of Sustainable Development
& World Ecology 2: 104–23.
Oliver S, Harden A, Rees R, Shepherd J, Brunton J, Garcia J and Oakley A (2005) An emerging frame-
work for including different types of evidence in systematic reviews for public policy. Evaluation
11: 428–46.
Patton MQ (2008) Utilization Focused Evaluation, 4th edn. Thousand Oaks, CA: SAGE.
Pawson R (2002) Evidence-based policy: in search of a method. Evaluation 8: 157–81.
Pawson R (2006) Evidence-based Policy: A Realistic Perspective. London: SAGE.
Pawson R and Tilley N (1997) Realistic Evaluation. London: SAGE.
Petticrew M and Roberts H (2003) Evidence, hierarchies and typologies: horses for courses. Journal of
Epidemiology and Community Health 57: 527–9.
Primdahl J, Peco B, Schramek J, Anderse E and Onate JJ (2003) Environmental effects of agri-environ-
mental schemes in Western Europe. Journal of Environmental Management 67: 129–138.
Rochs F and Navarro M (2008) Soft System Methodology: an intervention strategy. Journal of
International Agricultural and Extension Education 15(3): 95–9.
Rogers P (2008) Using programme theory to evaluate complicated and complex aspects of interven-
tions. Evaluation 14(1): 29–48.
Rossi PH, Lipsey MW and Freeman HE (2004) Evaluation: A Systematic Approach, 7th edn. Newbury
Park, CA: SAGE.
Salner M (2000) Beyond Checkland & Scholes: improving SSM. Occasional Papers on Systemic
Development 11: 23–44.
Schwandt T (2003) ‘Back to the rough ground!’ Beyond theory to practice in evaluation. Evaluation
9(3): 353–64.
Schwandt T (2008) Toward a practical theory of evidence for evaluation. In: Donaldson SI, Christie
CA and Mark HH (eds), What Counts as Credible Evidence in Evaluation and Evidence-based
Practice? Thousand Oaks, CA: SAGE, 197–212.
Shadish WR, Cook TD and Campbell DT (2002) Experimental and Quasi Experimental Designs for
Generalized Causal Inference. Boston, New York: Houghton Mifflin Company.
Shadish WR, Cook TD and Leviton LC (1991) Foundations of Program Evaluation Theories of
Practice. Newbury Park, CA: SAGE.
Stame N (2004) Theory-based evaluation and varieties of complexity. Evaluation 10(1): 58–76.
Stern E (2004) Philosophies and types of evaluation research. In: The Foundations of Evaluation and
Impact Research. Third report on vocational training research in Europe: background report.
Luxembourg: Office for Official Publications of the European Communities (Cedefop Reference
Serie, 58), 12–42.

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

Berriet-Solliec et al.: Goals of evaluation and types of evidence 213

Stern E, Stame N, Mayne J, Forss K, Davies R and Befani B (2012) Broadening the range of designs and
methods for impact evaluations. DFID Working Paper 38, London.
Stufflebeam DL (2001) ‘Evaluation Models’: New Directions for Evaluation, 89. San Francisco, CA:
Jossey-Bass.
Van den Berg H and Jiggins J (2007) Investing in farmers: The impact of farmer field schools in relation
to Integrated Pest Management. World Development 35(4): 663–87.
Van der Sluijs J, Douguet J-M, O’Connor M, GuimaraesPeriera A, Quintana SC, Maxim L and Ravetz
J (2008) Qualité de la connaissance dans un processus délibératif. Nature, Science, Société 16:
265–73.
Van der Werf H and Petit J (2002) Evaluation of the environmental impact of agriculture at the farm
level: a comparison and analysis of 12 indicators-based method. Agriculture, Ecosystems and
Environment 93: 131–45.

Marielle Berriet-Solliec is Professor in AgroSupDijon-CESAER. She is a member of the French

Evaluation Society. Her research about evaluation is applied to the rural development policies.
Pierre Labarthe is Researcher in Economics at the French National Institute for Agricultural Research
(INRA), and works on the transformation of farm advisory services in Europe.
Catherine Laurent, Senior Scientist in INRA, has recently coordinated an interdisciplinary critical
assessment of the capacity of evidence-based approaches to inform policies related to agriculture.

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

Dart-Six Normative Approaches
100% (1)
Dart-Six Normative Approaches
13 pages
00 Perspectives-of-Evaluation 2019 Manuscript
No ratings yet
00 Perspectives-of-Evaluation 2019 Manuscript
34 pages
Evaluation Research in Education
No ratings yet
Evaluation Research in Education
12 pages
Leeuw Donaldson 2015 Theory in Evaluation Reducing Confusion and Encouraging Debate
No ratings yet
Leeuw Donaldson 2015 Theory in Evaluation Reducing Confusion and Encouraging Debate
14 pages
Evaluation Methodologies for Policy Impact
No ratings yet
Evaluation Methodologies for Policy Impact
6 pages
AMMEpaper
No ratings yet
AMMEpaper
59 pages
Chapter 6
No ratings yet
Chapter 6
9 pages
Evaluative Reasoning
No ratings yet
Evaluative Reasoning
18 pages
Role of Impact Evaluation Analysis in Social Welfare Programmes
No ratings yet
Role of Impact Evaluation Analysis in Social Welfare Programmes
5 pages
Policy 2
No ratings yet
Policy 2
5 pages
Ainun Dan Arvan Makalah
No ratings yet
Ainun Dan Arvan Makalah
25 pages
Revisi Bab 1 Fix
No ratings yet
Revisi Bab 1 Fix
27 pages
Mine
No ratings yet
Mine
3 pages
Working Paper: Federica Calidoni-Lundberg
No ratings yet
Working Paper: Federica Calidoni-Lundberg
39 pages
Assessing The Impacts of Public Participation: Concepts, Evidence and Policy Implications
No ratings yet
Assessing The Impacts of Public Participation: Concepts, Evidence and Policy Implications
52 pages
Monitoring and Evaluation Group 8 26-11.edited
No ratings yet
Monitoring and Evaluation Group 8 26-11.edited
22 pages
Rangkuman Bacaan Dari DeepSeek
No ratings yet
Rangkuman Bacaan Dari DeepSeek
7 pages
Impact Evaluation Methods Explained
No ratings yet
Impact Evaluation Methods Explained
16 pages
Introduction to Evaluation Methods
No ratings yet
Introduction to Evaluation Methods
8 pages
Educational Evaluation Which Has Developed Standards For Educational
No ratings yet
Educational Evaluation Which Has Developed Standards For Educational
15 pages
Evaluation Research Methods Guide
No ratings yet
Evaluation Research Methods Guide
22 pages
Impactevaluation Anoverview 170201102249
No ratings yet
Impactevaluation Anoverview 170201102249
27 pages
Evaluation Resourse
No ratings yet
Evaluation Resourse
6 pages
Key Principles of Impact Evaluation
No ratings yet
Key Principles of Impact Evaluation
9 pages
Evaluation: Purpose and Standards
No ratings yet
Evaluation: Purpose and Standards
7 pages
Planning Cha 5
No ratings yet
Planning Cha 5
3 pages
Ch5 Eval-Rev
No ratings yet
Ch5 Eval-Rev
42 pages
Ew201605 01 000729
No ratings yet
Ew201605 01 000729
11 pages
Principles of Impact Evaluation
No ratings yet
Principles of Impact Evaluation
30 pages
KnowHow Policy Evaluation
No ratings yet
KnowHow Policy Evaluation
4 pages
Chen 1983
No ratings yet
Chen 1983
21 pages
Presentation Chapter 14
No ratings yet
Presentation Chapter 14
12 pages
Evaluating The Evaluation
100% (1)
Evaluating The Evaluation
21 pages
Unit 9 PDF
No ratings yet
Unit 9 PDF
29 pages
Borras Hc3b8jlund Evaluation Learning Preprint
No ratings yet
Borras Hc3b8jlund Evaluation Learning Preprint
20 pages
EXT 507 (Evaluation)
No ratings yet
EXT 507 (Evaluation)
19 pages
Patton - A World Larger Than Formative Summative
No ratings yet
Patton - A World Larger Than Formative Summative
14 pages
Elelia Wife Evaluation
No ratings yet
Elelia Wife Evaluation
20 pages
IE 20 Dec 2021 A
No ratings yet
IE 20 Dec 2021 A
112 pages
Stame TBEcomplexity
No ratings yet
Stame TBEcomplexity
20 pages
Notes - Lecture 12, Slide 3
No ratings yet
Notes - Lecture 12, Slide 3
11 pages
Introduction To Impact Evaluation (Rose)
No ratings yet
Introduction To Impact Evaluation (Rose)
30 pages
Policy Instruments Evaluation Insights
No ratings yet
Policy Instruments Evaluation Insights
38 pages
LIBRO BUENAZA Report Impact Evaluation With Small Cohorts Methodology Guidance Secured ESCRIBIR
No ratings yet
LIBRO BUENAZA Report Impact Evaluation With Small Cohorts Methodology Guidance Secured ESCRIBIR
88 pages
PP CH 7 - 2024
No ratings yet
PP CH 7 - 2024
9 pages
Types of Evaluations
No ratings yet
Types of Evaluations
4 pages
Ie PPT-1
No ratings yet
Ie PPT-1
47 pages
Benjamin2021Evaluation AAM
No ratings yet
Benjamin2021Evaluation AAM
9 pages
Glossary Monitoring & Evaluation
No ratings yet
Glossary Monitoring & Evaluation
14 pages
Evaluating A Program
No ratings yet
Evaluating A Program
3 pages
Program Theory Evaluation
No ratings yet
Program Theory Evaluation
9 pages
Of 2023 Integrative Evaluation
No ratings yet
Of 2023 Integrative Evaluation
10 pages
Planning, Monitoring, and Evaluation
No ratings yet
Planning, Monitoring, and Evaluation
57 pages
Unit 4
No ratings yet
Unit 4
16 pages
Pomeranz - Impact Evaluation Methods Final
No ratings yet
Pomeranz - Impact Evaluation Methods Final
27 pages
Management and Leadership: Andrew P. Douglas
No ratings yet
Management and Leadership: Andrew P. Douglas
7 pages
School Year Calendar 2023-2024
No ratings yet
School Year Calendar 2023-2024
1 page
Sunset To Sunset: God's Sabbath Rest
100% (2)
Sunset To Sunset: God's Sabbath Rest
37 pages
Books To Purchase
No ratings yet
Books To Purchase
2 pages
Mother of Sorrows Richard Mccann Instant Download
No ratings yet
Mother of Sorrows Richard Mccann Instant Download
30 pages
Ingles Vs Estrada
0% (1)
Ingles Vs Estrada
34 pages
Lecture Planner - (Only PDF) - NSEP 11th 2025
No ratings yet
Lecture Planner - (Only PDF) - NSEP 11th 2025
4 pages
Creativity Techniques Overview
No ratings yet
Creativity Techniques Overview
52 pages
Examen Febrero 15 1
No ratings yet
Examen Febrero 15 1
7 pages
Read-Reason-Write Textbook
100% (1)
Read-Reason-Write Textbook
264 pages
Raptorq-Technical-Overview 4
No ratings yet
Raptorq-Technical-Overview 4
12 pages
Gardens
No ratings yet
Gardens
3 pages
Art of War
No ratings yet
Art of War
68 pages
Oregon Adoption Policy Lawsuit
No ratings yet
Oregon Adoption Policy Lawsuit
41 pages
Basic Reliability Formulation Handbook2
100% (2)
Basic Reliability Formulation Handbook2
254 pages
Political Events & Stock Returns
No ratings yet
Political Events & Stock Returns
15 pages
Ghastly Ghosts!-Black Cat Publishing (2001)
No ratings yet
Ghastly Ghosts!-Black Cat Publishing (2001)
82 pages
Harshini Maddi - 3511210464
No ratings yet
Harshini Maddi - 3511210464
3 pages
SRS Assignment
No ratings yet
SRS Assignment
10 pages
Astrology Signs and Their Traits
100% (1)
Astrology Signs and Their Traits
3 pages
Jobs at Hogwarts
No ratings yet
Jobs at Hogwarts
2 pages
Apple’s Global Workforce Strategy Analysis
No ratings yet
Apple’s Global Workforce Strategy Analysis
7 pages
Against The Stream - The Thai Female Buddhist Saint Mae Chi Kaew Sianglam
No ratings yet
Against The Stream - The Thai Female Buddhist Saint Mae Chi Kaew Sianglam
42 pages
Fine Art
No ratings yet
Fine Art
2 pages
Cnpilot Enterprise AP User Guide - PMP 2677 - 001v001 PDF
No ratings yet
Cnpilot Enterprise AP User Guide - PMP 2677 - 001v001 PDF
176 pages
Bioethics ppt2
No ratings yet
Bioethics ppt2
43 pages
Fiber Optic Numerical Aperture Measurement
No ratings yet
Fiber Optic Numerical Aperture Measurement
6 pages
Sandiganbayan Compromise Dispute
No ratings yet
Sandiganbayan Compromise Dispute
12 pages
Contract Performance Essentials
100% (3)
Contract Performance Essentials
7 pages
PCI PIN Security Requirements
No ratings yet
PCI PIN Security Requirements
86 pages

BERRIET-SOLLIEC 2014 Goals of Evaluation PDF

Uploaded by

BERRIET-SOLLIEC 2014 Goals of Evaluation PDF

Uploaded by

529836

Goals of evaluation and

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

The diverse goals of evaluations

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

Types and levels of evidence

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

2) Evidence of difference-making. This can be evidence of effectiveness: evidence that a given

I = E (O | T=1) - E (O | T=0) (1)

Where I = impact; E = expected value; T = treatment; O = outcome

CSt’(Ot) = <{C1t’,…,Cnt’}, {Prob(Ot/L1t’),…,Prob(Ot/Lmt’) }> ‘(2)’ (Cartwright, 2011: 16)

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

Observation may, indeed, be focused on one of two elements:

1. the opinions of respected authorities;

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

The case of agriculture

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

To measure: Type of evidence for evaluations designed to

•• the actual situation with the program; and

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

Implementation: Measuring effectiveness independently of insight into mechanisms

In other words, ideally for impact assessment:

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

The evidence issue when assessing effectiveness

1) Obtaining high-level evidence of difference-making may seem simple or even simplis-

To understand: Type of evidence for evaluations aimed at

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

Example: Environmental evaluation coupling economic models and sustainability

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

The evidence issue when analysing mechanisms

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

3) As mentioned before, under certain conditions, evidence of mechanism can be com-

To learn: Evidence for evaluations primarily designed as a

Implementation: A Soft System Methodology (SSM) example

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

The evidence issue in collective learning

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

Marielle Berriet-Solliec is Professor in AgroSupDijon-CESAER. She is a member of the French

Downloaded from evi.sagepub.com at MSSS/ SERVICE DOCUMENTATION on August 8, 2014

You might also like