Cópia de 2007 AM in Plants - Oraguzie Et Al
Cópia de 2007 AM in Plants - Oraguzie Et Al
in Plants
Association Mapping
in Plants
Edited by
Nnadozie C. Oraguzie
The Horticulture and Food Research Institute of New Zealand Ltd
(HortResearch)
Havelock North, New Zealand
Susan E. Gardiner
The Horticulture and Food Research Institute of New Zealand Ltd
(HortResearch)
Palmerston North, New Zealand
H. Nihal De Silva
The Horticulture and Food Research Institute of New Zealand Ltd
(HortResearch)
Auckland, New Zealand
Dr Nnadozie C. Oraguzie Dr Erik H.A. Rikkerink
HortResearch HortResearch
Hawkes Bay Research Centre Mt. Albert Research Centre
Private Bag 1401 Private Bag 92169
Havelock North Auckland
New Zealand New Zealand
987654321
springer.com
Preface
The approach taken for locating the genes that underlie human diseases has shifted from
pedigree-based linkage studies to population-based association studies. In both cases the
proximity of a genetic marker to a susceptibility locus is inferred from statistical
measures that reflect the number of recombination events between them: in a disease
pedigree there are no more than a few hundred opportunities for recombination so that
recombination rates less than about one percent cannot be estimated and genes can be
located only coarsely on a genetic map with that approach. The linkage disequilibrium
detected in an association study, however, reflects the actions of many thousands of
recombination events since the initial disease mutation and the expectation is that
susceptibility genes can then be mapped more accurately.
The editors of this volume have recognized the need for parallel activity in plant species.
For the past 20 years, the genes that affect plant economic traits have usually been
mapped with data collected from “pedigrees” of populations formed by crossing inbred
lines. These Quantitative Trait Loci have been mapped on a coarse scale, and a QTL is
likely to refer to several genes in a region. The move to population-based association
studies was therefore as necessary in plants as it was in humans, and readers will find this
book to be a useful review of the marker technology, statistical methodology, and
progress to date. Although one of the authors fears that “plant genetics can be considered
as less advanced than human genetics” the chapters suggest that if that is the case it will
not be so for long.
The recent increased activity in association mapping in humans has rested on the
development of efficient and affordable methods for discovering and employing Single
Nucleotide Polymorphism markers. Plant geneticists cannot command the resources
available to their human geneticist colleagues, but they can anticipate benefiting from the
success of the International HapMap Project. The improvement in marker technology
from such large projects will inevitably be imported to plant studies. The editors have
provided helpful guides to the use of SNPs in association studies.
v
vi PREFACE
Along with the substantial increase in the volume of data when large numbers of
individuals are typed at millions of SNPs there are substantial challenges in the statistical
interpretation of the data. This book contains a valuable account of the issues of multiple
testing and an accessible account of False Discovery Rates. The more basic concepts of
linkage disequilibrium and case-control versus family-based association tests are also
discussed. It is often the case that geneticists do not receive extensive statistical training
and the coverage of the theory of estimation and testing is therefore welcome. Readers
will notice a greater use of Bayesian methods than is usually found in statistical genetics
books. Such methods are appearing more frequently in scientific papers.
I congratulate the editors and all the authors on this timely and comprehensive treatment
of association mapping in plants. The importance of food and fiber for human welfare
cannot be overstated, and progress in plant improvement will rest in no small part on the
work described in these pages. On a personal level, I am delighted by the leadership
shown by my fellow antipodeans.
B.S. Weir
Professor and Chair
Department of Biostatistics
University of Washington
Acknowledgements
The motivation to write this book came from numerous correspondences and finally a
meeting with Keri Witman, former senior editor in plant sciences at Springer
Science+Business Media, New York, USA (who continually stressed the need for a book
in this field of research), in Tulln Austria in 2004 at an EUCARPIA Quantitative genetics
conference. The recommendation and suggestions by anonymous referees consulted by
Springer to review the original proposal kept up the initial enthusiasm. This book would
not have been possible without HortResearch’s support. In particular, we would like to
thank the following people; Drs Vincent Bus (Science Leader, Pipfruit and Summerfruit
breeding), Andrew Granger (Future Fruit Group Leader) and Bruce Campbell (General
manager science operations) for their encouragement and moral support, Stuart Ritchie
for assistance with contract negotiation, Sharlene Cookson for assistance with book cover
design, Dr David Chagn é for his immense contribution right from the proposal stage of
the book till it went into press, and the SPU team especially, Dr Anne Gunson and
Christine Lamont, for editorial and technical assistance. We are grateful to the following
for permission to reproduce copyright material: Trustees of the Royal Botanic Gardens,
Kew; Elsevier (Figure 1 in Trends in Genetics 11: 83-90(2002) and Tables 1 & 2 in
Trends in Genetics 20(2): 105(2004)); Swedish National Biobanking program,
Wallenberg Consortium North (http://www.meb.ki.se/genestat.htm/-pairwise D` for 45
SNPs within a linked region); The University of Chicago press (Table 3, American
Journal of Human Genetics 60(3): 676-690, 1997); Annual Reviews (Table 1 in Annual
Review of Plant Biology 54, 2003); Nature Publishing group (Table 3 in Nature Genetics
28:286-289); and MacMillan publishing Inc. (Tables 2, 3, 4, 5 & 6 in Genetics 170:859-
873, 2005). Finally we thank the following colleagues who reviewed the manuscripts and
made suggestions that significantly improved the quality of the chapters: Drs David B.
Neale (Professor of Forest Genetics, Dept of Plant Sciences, UCLA), Mark E. Sorrells
(Professor, Dept of Plant Breeding and Genetics, Cornell University, Ithaca, NY), Trudy
F.C. Mackay (WNR Distinguished Professor of Genetics, North Carolina State
vii
viii ACKNOWLEDGEMENTS
Contributors .....................................................................................................................xi
Introduction.....................................................................................................................xiii
2. Linkage Disequilibrium............................................................................................. 11
Nnadozie C. Oraguzie, Phillip L. Wilcox, Erik H.A. Rikkerink,
and H. Nihal de Silva
ix
x CONTENTS
Index................................................................................................................................ 271
Contributors
Roderick D. Ball
Ensis Wood Quality, Scion (New Zealand Forest Research Institute Limited), 49 Sala Street,
Private Bag 3020, Rotorua, New Zealand
Jacqueline Batley
Primary Industries Research Victoria, Victorian AgriBiosciences Centre,
La Trobe R&D Park, Bundoora, Victoria 3083, Australia
Rowland D. Burdon
Ensis Genetics, Scion (New Zealand Forest Research Institute Limited), 49 Sala Street,
Private Bag 3020, Rotorua, New Zealand
David Chagné
HortResearch, Plant Gene Mapping group, Private Bag 11030, Palmerston North,
New Zealand
Craig E. Echt
USDA Forest Service, Southern Institute of Forest Genetics 23332 MS Highway 67,
Saucier, MS 39574, USA
xi
xii CONTRIBUTORS
David Edwards
Primary Industries Research Victoria, Victorian AgriBiosciences Centre,
La Trobe R&D Park, Bundoora, Victoria 3083, Australia
John W. Forster
Primary Industries Research Victoria, Victorian AgriBiosciences Centre,
La Trobe R&D Park, Bundoora, Victoria 3083, Australia
Mark P. Dobrowolski
Primary Industries Research Victoria, Plant Genetics and Genomics Platform, Hamilton
Centre, Mt Napier Road Hamilton, Victoria 3300, Australia
Susan E. Gardiner
HortResearch, Palmerston North Research Centre, Private Bag 11030, Palmerston North,
New Zealand
Nnadozie C. Oraguzie
HortResearch, Hawkes Bay Research Centre, Cnr. Crosses and St. George’s Roads,
Private Bag 1401, Havelock North, New Zealand
H. Nihal De Silva
HortResearch, Mt Albert Research Centre, 120 Mt Albert Road, Private Bag 92169,
Auckland, New Zealand
Phillip L. Wilcox
Cell wall Biotechnology Centre, Scion (New Zealand Forest Research Institute Limited),
49 Sala Street, Private Bag 3020, Rotorua, New Zealand
Introduction
Most traits we deal with on a daily basis have complex inheritance patterns that
complicate the ability of existing mapping technologies to detect the underlying genetic
factors. In the last decade or so, we have seen the successful use of conventional map-
based strategies in identification and cloning of quantitative trait loci (QTLs) in model
plant species including tomato and Arabidopsis. However, efficient gene discovery with
this method will probably continue to be largely limited to those loci that have large
effects on quantitative trait variation. Techniques are also needed to more rapidly
identify genes that play a modest role in regulating quantitative trait variation.
Association mapping via linkage disequilibrium or LD (non-random association of alleles
at different loci) offers promise in this area. The traditional approach of linkage/QTL
mapping reliant on developing large mapping populations continues to suffer from lack
of mapping resolution inherent in samples with limited meiotic cross-over events. These
problems are exacerbated in tree crops, where very large populations are impractical from
a plant management point of view. In association mapping, there may not be any need to
make crosses initially to generate segregating populations. The natural variation that
exists in the available germplasm can be utilized for mapping straightaway.
Association genetics via LD mapping is an emerging field of genetic mapping that has
the potential for resolution to the level of individual genes (alleles) underlying
quantitative traits. LD mapping is a technology that can take full advantage of the
phenomenal leaps and bounds in technology development in the area of molecular
biology and marry it with our increasing understanding of the molecular basis of
inheritance and molecular tools recently developed in terms of molecular markers and
genetic maps in a way that could have a significant practical impact on breeding. The
convergence of improved statistical methods, availability of growing plant genomics
databases and improvements in the affordability and potential scale of sequencing and
xiii
xiv INTRODUCTION
genotyping, suggests that this technology will probably be more widely adopted for
mapping and gene discovery in plants in the near future.
The book brings together all the information on association genetics and linkage
disequilibrium published in different journals in one volume and will be of interest to
advanced breeding/genetics students, researchers, professional plant breeders and
university lecturers. Breeders will find it particularly useful as a guide for making
decisions on breeding strategies that will facilitate identification of ‘superior’ parents for
development of new improved varieties. Difficult statistical concepts and tools are
presented with detailed illustrations in a way readers can comprehend hence; the book
could also serve as a teaching aid for postgraduate students. A very comprehensive
comparison of statistical approaches/methodologies and guidelines on optimal study
design, as well as the comparison of the relative benefits of association mapping and
conventional QTL mapping will be particularly useful to geneticists wanting to set up
studies on gene/genome mapping.
1 The Horticulture and Food Research Institute of New Zealand Limited (HortResearch), Cnr Crosses and
St George’s Roads, P.B. 1401, Havelock North, New Zealand
2 Scion (New Zealand Forest Research Institute Limited), 49 Sala Street, P.B. 3020, Rotorua, New Zealand
1
2 N.C. ORAGUZIE ET AL.
Resolution of causative Low – moderate density linkage maps High – disequilibrium within small
trait polymorphism only required physical regions requiring many
markers
Experimental populations Defined pedigrees, e.g., backcross, Linkage disequilibrium experiments:
for detection F2, RI, three and two generation unrelated individuals
pedigrees/families, half-sib families, (“unstructured” populations), large
etc. numbers of small unrelated families
(e.g., transmission disequilibrium
tests, TDT)
Marker discovery costs Moderate Moderate for few traits, high for
many traits
Extent of inference Pedigree specific, except where Species or subspecies wide
species has high extant LD
Number of markers 102–low 103 105 for small genomes –~109 for
required for genome large genomes
coverage
Secondly, the opportunity to use molecular markers to enhance rates of genetic gain,
including the utilization of specific genes from non-elite germplasm in a more directed
and efficient manner than was hitherto possible. A further application is the generation of
fundamental knowledge around the genetic architecture of extant variation in
populations, and the opportunity to determine evolutionary phenomena that have led to
existing population structures.
So why is association genetics now becoming more widely used? A number of key
factors have contributed to the recent interest in association genetics, including methods
for high throughput gene discovery, polymorphism detection, and genotyping (see
Chapters 3–5 for more discussion). The prevalence of many complex human diseases
such as asthma, cardiovascular disease, bipolar disorder, and diabetes, has increased over
the past two decades in developed countries (reviewed in Risch 2000). During the same
period, the genetic causes of such disorders have been increasingly emphasized as a
means to better understand their parthenogenesis, with the ultimate goal of improvement
of preventative strategies, diagnostic tools, and treatment. Geneticists wanting to identify
the genetic causes of these disorders through conventional map-based strategies including
linkage analysis, QTL mapping, and positional cloning have constantly been met with
only limited success. However, these map-based approaches have been instrumental in
the identification and cloning of genes responsible for less common and simply inherited
human disorders, as well as traits controlled by major genes in plants. Examples of such
traits and the genes responsible for them in humans are breast cancer (BRCA-1 and -2),
Alzheimer’s disease (β-amyloid precursor protein (APP) and presenilin-1 and -2),
diabetes (maturity-onset diabetes of youth (MODY)-1, -2), colon cancer (familial
adenomatous polyposis (FAP)) and hereditary nonpolyposis colorectal cancer (HNPCC)
(FPC), heart disease (LDL receptor genes). The best examples of simply inherited traits
in plants controlled mostly by a single locus apart from Mendel’s well-known examples
in peas include resistance to certain pests and diseases, flower and fruit color, plant
growth habit, reproductive mechanisms (such as self incompatibility), and aspects of
genetic load. The map-based strategies have also been utilized for positional cloning of
genes that underlie QTL in plants (reviewed in Yano 2001). For example, the
morphological differences between maize and its wild relative teosinte have been studied
through the analysis of QTL. As a result of such studies, one of the major QTL involved
in maize domestication (teosinte branch 1) has been cloned. Other examples of cloned
genes underlying QTL in other smaller plants are: in tomato, Brix9-2-2 which encodes
Lycopersicon apoplastic invertase (Lin5), responsible for soluble acid content, and fw2.2,
responsible for fruit size; in rice, Heading date 1 which encodes a protein with high
similarity to that encoded by the Arabidopsis gene Constans, responsible for photoperiod
sensitivity; in Arabidopsis, a QTL at the Frigida locus, responsible for vernalization
response to flowering (Johanson et al. 2000), and an allele of Cryptochrome 2 (Cry2),
responsible for variation in flowering time (El-Assal et al. 2001).
Despite the successes of conventional QTL mapping strategies, efficient gene
discovery with these methods will probably continue to be largely limited to those loci
that have large effects on quantitative trait variation. These loci have large effects
compared with the environmental effect. Furthermore, individuals in segregating
populations can usually be assigned to discrete groups corresponding directly to their
genotypes. Unlike these Mendelian traits for which (usually) alleles at single loci
4 N.C. ORAGUZIE ET AL.
studies based on LD may allow the identification of the actual genes represented by these
QTLs. Only polymorphisms with extremely tight linkage to a locus with phenotypic
effects are likely to be significantly associated with a trait in populations typically used
for association mapping, thus providing much finer resolution than QTL mapping based
on pedigreed populations (Remington et al. 2001).
The key advantages of association tests include their speed, because mapping
populations may not be necessary, particularly in crops that are limited to no more than
one generation per year. Controlled breeding is lacking in humans, as are large numbers
of progeny per family. These among other reasons may well be why association genetics
approaches have been exploited better and to a higher degree in human genetics studies.
The other advantage is high resolution as already mentioned. The resolution and cost of
association approaches depend largely on the nature and extent of LD, i.e., the
nonrandom association of alleles in test populations. LD can result from population
structure, selection, drift, or physical linkage. The physical extent of LD around a gene
determines the effectiveness of association mapping, and this could result from many
factors, including rate of out-crossing, the degree of artificial or natural selection on the
region or regions of the genome, recombination rate, chromosomal location, population
size and structure, and the age of the allele under study. In cultivated species, the extent
of LD will also be shaped by human selection and the bottlenecks associated with crop
dispersal beyond the center of origin (see Rafalski and Morgante 2004). The concept and
factors that influence LD are discussed in detail in Chapters 2 and 7. Estimates of LD are
important as an indicator of how useful LD-based association genetics approaches can be
when compared with other available mapping methods, on the basis of the trade-off
between population size and informativeness (Rafalski and Morgante 2004). In a
situation where LD is large, genome wide scans may be possible albeit with poor
resolution. Conversely, if there is a rapid decline of LD, examination of candidate genes
may be a more viable option for association studies, as genome wide scans will require
excessively large numbers of markers – the cost of which will be too prohibitive for
many applied breeding programs. For example, the rapid decay of LD at 100 kb around
the Xa5 locus in rice (Oryza sativa L.), would require an average of one marker per
centiMorgan (where 1 cM = 200–300 kb) (Garris et al. 2003), suggesting that candidate
gene-based LD mapping could provide greater resolution than conventional QTL
mapping (as a result of more recombination events). This is even more apparent in plants
such as conifers and onions which have many megabase pairs per centiMorgan, but
relatively rapid decay of disequilibrium over short distances (Chapter 10). It is important
therefore to gain an understanding of the patterns of LD in different regions of the
genome and in different populations in one organism, to make an informed choice of a
methodology for association genetics studies. According to Rafalski and Morgante
(2004), one of the first tasks to undertake will be to identify populations with different
amounts of LD, including high LD populations for high resolution mapping. In most
cases, existing germplasm collections can be exploited for this purpose. The plant
research community can take advantage of being able to create these by crossing
populations with the required amount of LD and diversity (Rafalski and Morgante 2004).
Because of limited genomic resources in most crops, the candidate gene approach may
continue to be used widely in plant association genetics studies irrespective of the extent
of LD (Buckler and Thornsberry 2002). The number of markers needed to scan the whole
genome remains to be determined, but it will differ between populations as well as
between species. Suggestions on the possible number of markers and sample size for
association studies are provided in Chapter 8.
Although, LD-based association genetics methods hold promise for speeding up the
fine mapping and identification of genes responsible for variation in agronomic traits, the
traditional linkage mapping methods will continue to be useful, particularly when trying
to “mendelise” QTLs and assessing the effects of a QTL in isolation (Paran and Zamir
AN OVERVIEW OF ASSOCIATION MAPPING 7
2003). According to Rafalski and Morgante (2004), plant biologists could potentially
create experimental populations of unlimited size for the purpose of high resolution
genetic mapping. Also, association mapping methods could be adapted to utilize pre-
existing populations. Such populations should offer improved mapping resolution
because of the many opportunities for recombination that will have been realized over
many generations. In addition, the many years of testing that have been carried out on
some breeding lines could result in a more accurate assessment of phenotypic traits that
are difficult to score (Rafalski and Morgante 2004). In general, association genetics
approaches may be more suited to organisms with little or no pedigree information; large
effective population sizes resulting in less differentiation in trait values and little or no
structure in the population; populations with rich allelic diversity, moderate to high
nucleotide diversity; and traits with little or no selection history and controlled by many
loci with small effects, and low frequency older alleles. On the other hand, linkage based
fine mapping methods may be more efficient for marker assisted breeding in inbred crops
than in some out-breeding perennial species. Also, a functional understanding of QTLs
will require positional cloning and complementation tests and this will be more feasible
in organisms with small genomes, mutants with well-defined effects, efficient trans-
formation systems and near complete genomic sequence. See Neale and Savolainen (2004)
for a detailed discussion on these factors and how they will impinge on the choice of a
mapping strategy. A comparison of the relative power and cost of association genetics
and conventional QTL mapping approaches is presented in Chapter 8.
A number of factors have been identified that can potentially limit the application of
association mapping via the LD approach. These are population structure, i.e., the
presence of subgroups with an unequal distribution of alleles within a population,
population stratification resulting from the complex breeding history of many
agronomically important crops and the limited gene flow in most wild plants (Sharbel
et al. 2000), pleiotropic and epistatic interactions, genotype × environment interactions,
poor experimental design, weak statistical tests/inferences, small sample size, complexity
of the trait under study, as well as the quality of the phenotypic data. Many of these
factors individually and in combination have been suggested to lead to spurious
associations in association genetics studies. Methods to control these factors are
discussed in Chapter 8.
1.4 CONCLUSION
Over the past decades, we have seen the successful use of map-based strategies
including linkage analysis, QTL mapping, and positional cloning for the dissection of the
mechanism of trait inheritance. These approaches have facilitated the identification of
major genes and QTL in human, plant, and animal species particularly in model
organisms. However, efficient gene discovery with these approaches will probably
continue to be largely limited to loci that have a large effect on quantitative variation.
Techniques are also needed to more rapidly identify genes that play a modest role in
regulating quantitative trait variation. Current procedures are time consuming and it can
take several years to develop populations for fine scale mapping. Apart from inherently
poor resolution resulting from limited meiotic crossover events in pedigreed populations,
developing large full sib families for each major gene may be impractical from a plant
management view point, particularly in tree crops (Chapters 10 and 11). A more efficient
8 N.C. ORAGUZIE ET AL.
approach that may not need the generation of large pedigreed mapping populations with
higher resolution is therefore needed to complement conventional QTL mapping
strategies. Currently, a population genomics tool termed association mapping seems to be
the method of choice and has been used more extensively in human genetics studies than
in any other species.
To design appropriate association genetics studies we need to understand the
structure of LD within and among populations as well as in different regions of the
genome in an organism. Depending on the extent of LD, a candidate gene approach or
genome-wide association study may be carried out. For most plant species, at least in the
near future, pre-existing mapping populations and germplasm collections may be used as
a starting point because of limited genomic resources and increased precision in
phenotypic assessments resulting from repeated measurements in such populations. Also,
there are situations (depending on species, populations, domestication/selection history,
etc.) under which conventional QTL mapping methods may work better than association
genetics methods, and vice versa. Note also that association studies could result in
spurious associations if factors such as population structure, experimental design,
statistical tests, and inferences are not adequately addressed.
1.5 REFERENCES
Alpert, K.B., Tanksley, S.D., 1996, High-resolution mapping and isolation of a yeast artificial chromosome
contig containing fw2.2: a major fruit weight quantitative trait locus in tomato. Proc Natl Acad Sci USA
93: 15503–15507.
Botstein, D., Risch, N., 2003, Discovering genotypes underlying human phenotypes: past successes for
mendelian disease, future approaches for complex disease. Nature Genetics 33: 228–237.
Buckler, E.S., Thornsberry, J.M., 2002, Plant molecular diversity and applications to genomics. Curr Opin Plant
Biol 5: 107–111.
Chakraborty, R., Weiss, K.M., 1988, Admixture as a tool for finding genes and detecting that difference from
allelic association between loci. Proc Natl Acad Sci 85: 9119–9123.
Darvasi, A., Soller, M., 1995, Advanced intercross lines, an experimental population for fine genetic mapping.
Genetics 141: 1199–1207.
El-Assal, S., Alonso-Blanco, C., Peeters, A., Raz, V., Koornneef, M., 2001, A QTL for flowering time in
Arabidopsis reveals a novel allele of CRY2. Nat Genet. 29: 435–440.
Garris, A.J., McCouch, S.R., Kresovich, S., 2003, Population structure and its effect on haplotype diversity and
linkage disequilibrium surrounding the xa5 locus of rice (Oryza sativa L.). Genetics 165: 759–769.
Jannink, J.-L., Walsh, B., 2002, Association mapping in plant populations. In: Kang M.S. (ed.). Quantitative
Genetics, Genomics and Plant Breeding, CABI, Wallingford, UK.
Johanson, U., West, J., Lister, C., Michaels, S., Amasino, R., Dean, C., 2000, Molecular analysis of FRIGIDA,
a major determinant of natural variation in Arabidopsis flowering time. Science 290: 344–347.
Kruglyak, L., 1999, Prospects for whole-genome linkage disequilibrium mapping of common disease genes.
Nat Genet 22: 139–144.
Neale, D.B., Savolainen, O., 2004, Association genetics of complex traits in conifers. Trends Plant Sci 9(7):
325–330.
Paran, I., Zamir, D., 2003, Quantitative traits in plants: beyond the QTL. Trends Genet 19: 303–306.
Rafalski, A., Morgante, M., 2004, Corn and humans: recombination and linkage disequilibrium in two genomes
of similar size. Trends Genet 20(2): 103–111.
Remington, D.L., Thornsberry, J.M., Matsuoka, Y., Wilson, L.M., Whitt, S.R., Doebley, J., Kresovich, S.,
Goodman, M.M., Buckler, E.S., 2001, Structure of linkage disequilibrium and phenotypic associations in
maize genome. PNAS 98(20): 11479–11484.
AN OVERVIEW OF ASSOCIATION MAPPING 9
Risch, N.J., 2000, Searching for genetic determinants in the new millennium. Nature 405: 847–856.
Sharbel, T.F., Haubold, B., Mitchell-Olds, T., 2000, Genetic isolation by distance in Arabidopsis thaliana: Bio-
geography and postglacial colonization of Europe. Molecular Ecology 9:2109-2118.
Stuber, C.W., Polacco, M., Senior, M.L., 1999, Synergy of empirical breeding, marker-assisted selection, and
genomics to increase crop yield potential. Crop Sci 39: 1571–1583.
The International HapMap Consortium, 2005, A haplotype map of the human genome. Nature 437: 1299–1320.
Yano, M., 2001, Genetic and molecular dissection of naturally occurring variation. Curr Opin Plant Biol 4:
130–135.
Chapter 2
LINKAGE DISEQUILIBRIUM
1 2 3
Nnadozie C. Oraguzie , Phillip L. Wilcox , Erik H.A. Rikkerink , and H. Nihal de
3
Silva
2.1 INTRODUCTION
The recent surge of interest in linkage disequilibrium (LD) mapping stems in part
from pioneering work in humans in which LD testing is a convenient means of examining
genetic polymorphisms in different genetic backgrounds, taking advantage of generations
of recombination present in such samples. LD mapping is appealing due to the potential
to identify a large number of haplotypes at many genetic loci across a large collection of
phenotypically well-characterized germplasm, either by DNA sequencing or by high-
throughput single nucleotide polymorphism (SNP) analysis. LD mapping exploits the
phenotypic and genetic variation present across a natural population and draws inferences
on the basis of past recombination events that have shaped the haplotype structure of that
species (Nordborg and Tavare 2002; Borevitz and Nordborg 2003). On the other hand,
conventional quantitative trait locus (QTL) mapping or linkage analysis usually considers
only variation among offspring of relatively few genotypes (most often between two
crossed individuals) and relies solely on recombination events observed in their progeny
(note also that only allelic variation present in the parental genotypes can be evaluated,
limited to up to four alleles in outbred families and two in inbred families). The
resolution of mapping using crosses or pedigrees depends on the amount of
recombination which is determined mostly by the number of meiotic crossover events.
Typical rates of recombination as estimated in humans are in the order of 10–8 per base
pair per meiosis (Hagenblad and Nordborg 2002). This is in the same order of magnitude
in Drosophila melanogaster (10–6 to 10–7 per map distance per meiosis) as shown by
intragenic recombination at the rosy locus on chromosome 3 (Chovnick et al. 1964) –
indicating that the best resolution achievable in a single generation is always going to be
low. Since conventional mapping studies cannot be easily performed with very large
1
The Horticulture and Food Research Institute of New Zealand Limited (HortResearch), Cnr Crosses and St
George’s Roads, P.B. 1401, Havelock North, New Zealand.
2
Scion (New Zealand Forest Research Institute Limited), 49 Sala Street, P.B. 3020, Rotorua, New Zealand.
3
The Horticulture and Food Research Institute of New Zealand Limited (HortResearch), Mt Albert Research
Centre, 120 Mt Albert Road, P.B. 92169, Auckland, New Zealand.
11
12 N.C. ORAGUZIE ET AL.
numbers of individuals or very many generations, their resolution is generally poor. They
provide a good way of localizing genes to individual chromosomes, or if sample size is
adequate, specific genomic regions, but typically do not provide sufficient resolution to
locate the gene or functional polymorphism. They are also inefficient at finding alleles at
low frequencies in the population. In contrast, LD mapping takes advantage of historical
recombination in the ancestry of a lineage and may be more efficient for detecting
contributions of rare alleles, and for localizing the genes of interest.
In order to use association genetics most effectively, we need to understand the
structure of LD in a genome. In the presence of significant LD, it may be possible to
identify genetic regions that are associated with a trait of interest by a systematic scan of
individuals from an existing population using polymorphisms from either well chosen
genomic regions, or full genome scans where affordable. LD mapping plays a
fundamental role in human gene mapping and has been used extensively to dissect
complex diseases including Alzheimer (Corder et al. 1994) and cystic fibrosis (Kerem et al.
1989). However, many of the initial associations detected have not been consistently
replicated and may well have been spurious, particularly because the tests could not take
sufficient account of the effect of population structural problems such as admixture (see
below). Nonrepeatable results could also be due to inadequate experimental design
(Altshuler et al. 2000; Ball 2005). LD mapping is of further interest as it may shed light
on the origins and evolutionary history of an organism since the distribution of LD is
determined, in part, by population history (Tishkoff et al. 1996). Moreover, knowledge of
the level of disequilibrium in a population may enable us to learn more about the biology
of recombination in that species (Pritchard and Przeworski 2001). Potentially, it could
also provide information on intraspecific lineages carrying genetic factors (for example,
insertions or inversions that generate large scale differences between chromosomes and
presumably reduce crossovers) capable of modulating rates of recombination, allowing
subsequent characterization.
There is still a lot to learn about genomic patterns of LD in plants. In addition,
knowledge of LD at the chromosomal level is relatively small. LD mapping in plants will
be useful to identify allelic variants that potentially relate to a trait(s) of interest to
complement QTL mapping and for general application of molecular markers to
germplasm characterization. In this review, we will provide some background
information on the theory of LD, measures of LD in a population and factors that
influence LD. We will conclude the discussion with some empirical examples of LD
testing in model organisms including humans, Drosophila melanogaster and plants,
particularly the two most advanced model plant systems with respect to LD studies,
maize and Arabidopsis thaliana.
Linkage equilibrium (LE) and LD are population genetics terms used to describe the
likelihood of co-occurrence of alleles at different loci in a population. Generally, linkage
refers to the correlated inheritance of loci through physical connection on a chromosome.
While LE refers to random association of alleles at different loci (that is, the chance of
finding one allele at one locus that is independent of an allele at another locus), LD refers
to nonrandom association of alleles at different loci. That is, when a particular allele at
LINKAGE DISEQUILIBRIUM 13
one locus is found together with a specific allele at a second locus more often than
expected if alleles at the loci were combining independently in a population, the loci are
said to be in LD (see Figure 2.1). LD does not automatically imply linkage. Tight linkage
may result in high levels of LD and it is in this sense that LD is also a powerful mapping
tool. This occurs when the correlation of allelic states of loci in different parts of the
genome is caused by the physical proximity of the loci. LD may also be influenced by
other factors which will be discussed in the later sections of this chapter.
Although the concept of LD dates to the early part of the 20th century (Jennings
1917), the first commonly used LD measure, D, was developed about four decades ago
(Lewontin 1964). The digenic D (in common with most other measures of LD),
quantifies disequilibrium, as the difference between the observed frequency of co-
occurrence of an allele of locus A with an allele of another locus B, and the expected
frequency of co-occurrence under LE (i.e., if the two alleles were combining at random).
For two loci A/a and B/b, let the frequency of the observed haplotype with alleles A and B
be PAB. Assuming independence, the expected haplotype frequency is the product of the
corresponding two allele frequencies, i.e., p A × p B . Therefore, D = PAB − p A pB . If D
differs significantly from zero, LD is said to exist. We will provide a full discussion of
different measures of LD in Section 2.3. If loci, A and B are both biallelic, four different
haplotypes are possible. Under LD, some of these two-locus haplotype frequencies will
be over-represented and others under-represented. Figure 2.1 illustrates two scenarios
where DNA sequences of haplotypes are in complete LD or LE (i.e., no LD).
A)
1 2
AAGCTGTCACTG…/intervening DNA sequence/…TCATCGTACTCA
AGGCTGTCACTG…/intervening DNA sequence/…TCATCGTACTCA
Site 1
A G
Site 2 C 6 0
T 0 6
Complete LD
14 N.C. ORAGUZIE ET AL.
B)
1 2
AAGCTGTCACTG…/intervening DNA sequence/…TCATCGTACTCA
AGGCTGTCACTG…/intervening DNA sequence/…TCATCGTACTCA
Site 1
A G
Site 2 C 3 3
T 3 3
No LD
Figure 2.1. Hypothetical scenarios of LD between linked polymorphisms caused by different mutational and
recombinational histories. The starting population has only two haplotypes; AG at locus 1 and TT at locus 2.
Mutation later occurs at locus 2 with “T” being replaced by “C” in some cases. (A) shows maintenance of LD
due to lack of recombination between loci 1 and 2 in generations following mutation, and (B) is a situation
where LE is attained due to recombination breaking down the initial disequilibrium. The corresponding
contingency table shows the haplotype counts. Absolute LD exists when two loci share a similar mutational
history with no recombination, LE is attained when there is recombination between loci regardless of
mutational history. The influence of recombination and mutational history on LD will be discussed in
subsequent paragraphs.
Linkage Disequilibrium
Generation
1
Linkage Disequilibrium
One Many
Generation Generations
A A
B B
C C
Figure 2.2. Hypothetical diagrams showing decay of LD after many generations following recombination.
(A) shows complete LD in generation one to almost complete dissipation of LD in generation t, (B) After a few
generations, alleles of moderately distant genes still cosegregate (e.g., A & C), but after many generations only
alleles of very close genes cosegregate (e.g., A & B).
Figure 2.3 shows an example of how D is reduced with time (in this case after 100
generations) under different degrees of recombination fraction, θ . For example, if θ = 0.10
(10% recombination), it will take 6.5 generations for D to be cut in half and 28.4 gener-
ations for D to drop by 95%, according to the equation obtained by rearranging (2.1):
In ⎛⎜ t ⎞⎟
D
t = ⎝ D0 ⎠ .
In (1 − θ )
16 N.C. ORAGUZIE ET AL.
1.0
D t = D 0 (1 − θ ) t
θ = 0.001
0.8
Linkage Disequilibrium
θ = 0.01
0.6
θ = 0.1
0.4
θ = 0.5
0.2
0.0
1 10 100
Generations
2.3 MEASURES OF LD
A variety of measures of LD have been developed, and there are some good reviews
that compare these different measures (Hendrick 1987; Devlin and Risch 1995; Jorde
1995). A good general account of LD is given by Weir (1996). The basis for many LD
measures is the deviation of observed haplotype frequency from their expectation
assuming independence. Consider two biallelic loci A and B. Let p1 and p 2 ( = 1 − p1 )
be the frequencies of A1 and A2 alleles at locus A . Likewise, q1 and q2 ( = 1 − q1 ) are
the frequencies of B1 and B2 alleles at the second locus. These two bi-allelic loci will
then produce four different haplotypes, A1B1 , A1B2 , A2 B1 and A2 B2 . These haplotypes can
be represented in a 2 × 2 contingency table as in Table 2.1. We use the notation
P (uppercase), with a subscript indicating the two alleles it carries, to represent the
haplotype frequency. Lower case letters are used to represent allele frequencies. Note that
haplotype frequencies as given in Table 2.1 are the unconditional probabilities.
Sometimes we are interested in conditional probabilities, e.g., the probability of a
haplotype having A1 allele given that allele B1 is present. This can be calculated using the
haplotype and marginal allele frequencies as P11 / q1 . Similarly, the probability of
having A2 allele in the presence of allele B1 is P21 / q1 .
LINKAGE DISEQUILIBRIUM 17
Allele
Allele B1 B2 Total
A1 P11 P12 p1
A2 P21 P22 p2
Total q1 q2 1
In the first three lines, the first term of each expression is the observed haplotype
frequency and the second term the expected frequency under independence. Note that all
expressions in equation (2.2) follow from the first because of inequalities
in Table 2.1; i.e., p1 = P11 + P12 ; q1 = P11 + P21 etc. For instance,
D = P11 − p1q1
= P11 − (1 − p2 )(1 − q2 )
= P11 − (1 − p2 − q2 + p2 q2 )
= P11 − ( P11 + P12 − P12 − P22 + p2 q2 )
= P22 − p2 q2
Likewise,
D = P11 − p1q1
= P11 − ( P11 + P12 )( P11 + P21 )
= P11 − ( P11P11 + P11P21 + P11P12 + P12 P21 )
= P11 (1 − P11 − P21 − P12 ) − P12 P21
= P11 P22 − P12 P21
The ordering of alleles into rows and columns of Table 2.1 is arbitrary and often D
is reported without any sign as | D | . The measure D is dependent on allele frequencies,
18 N.C. ORAGUZIE ET AL.
so some standardized measure would be useful for comparisons across loci with different
frequencies. Lewontin (1964) defined a standardized measure of D , called D′ as
follows:
⎧ D
⎪ min( p q , p q ) D>0
⎪
D′ = ⎨ 1 2 2 1
(2.3)
⎪ D
D<0.
⎪⎩ min( p1q1 , p2 q2 )
The denominator of the expression is the maximum absolute value of D that could be
achieved for given marginal totals, which of course are the allele frequencies at the two
loci (Table 2.1). The absolute value of D is scaled for the observed allele frequencies;
hence the resulting value is bounded between 0 and 1. The case of | D′ | =1 is known as
complete LD. This occurs when the two loci are in complete LD or if there are less than
four of the possible haplotypes as described above. Values of | D′ | <1 indicate that the
complete ancestral LD has been disrupted presumably due to recombination, resulting in
all four possible haplotypes being observed.
According to Hill and Weir (1994), D is in fact more frequently used in the
standardized form as:
D
r= (2.4)
( p1 p2 q1q2 )1 / 2
D
δ= .
q1 P22 (2.5)
This is the same measure that Terwilliger (1995) referred to as λ . Yet another
epidemiological measure recommended for LD by Kaplan and Weir (1992) is based on
the difference in conditional frequencies:
P11 P12 D
d = − = .
q1 q2 q1q2 (2.6)
The final LD measure that Devlin and Risch (1995) included in their list of commonly
used ones also originated from epidemiology. It is based on the odds ratio (Equation
(2.7)). The odds-ratio for an event is the probability of that event happening divided by
the probability of it not happening, i.e., if P is the probability of event, then the odds-
ratio is P /(1 − P ) . The first line of Equation (2.7) is then read as odds for B1 allele to be
present in the haplotype given that A1 is present, divided by odds for B1 allele to be
present in the haplotype in the presence of A2 .
20 N.C. ORAGUZIE ET AL.
P11 P22
Odds Ratio (OR) =
P12 P21
OR − 1 P11 P22 − P12 P21 D
Q= = = .
OR + 1 P11 P22 + P12 P21 P11 P22 + P12 P21 (2.7)
Unlike the odds ratio, which ranges from zero to infinity, Q is bounded between
( −1, +1 ).
As Devlin and Risch (1995) pointed out these five measures of LD differ only in
their denominator, which only serves to standardize D . One might expect then that all
these measures provide the same information for simple disequilibrium mapping. Devlin
and Risch (1995) illustrated with simple examples that this wasn’t the case. We advise
readers to refer to their paper for a full account of the comparison of the different LD
measures.
To measure LD experimentally, we take a random sample from the population
of interest. LD calculated on the sample is an estimate of the corresponding
parameter. Assuming that all haplotypes are observed, we make counts of the different
haplotypes. Let the sample contain n total number of diploid individuals. The
composition of different haplotypes for two biallelic loci can be represented in a 2× 2
contingency table as follows:
Allele B1 B2 Total
A1 a b a+ b
A2 c d c+ d
Total a+ c b+ d 2n = a + b + c + d
The marginal row and column totals are allelic counts at A and B loci,
respectively. Cell counts can vary independently with the only constraint being they add
up to the total count 2 n . Dividing these cell and marginal counts by the total gives
estimates of corresponding haplotypes and allele frequencies as shown in Table 2.1. For
data such as in Table 2.2, the hypothesis of association between alleles at the two loci
(i.e., LD) can be tested by either Chi-square or Fisher’s exact test. These tests are discussed
in detail in Chapter 7 (Section 7.6.1). In concluding this section, we will show how different
measures of LD are calculated using some hypothetical data as in Example 2.1.
(a) (b)
50% —T——A— 80% —T——A—
50% —C——G— 20% —C——G—
(c) (d)
25% —T——A— 10% —T——A—
25% —T——G— 20% —T——G—
25% —C——G— 70% —C——G—
25% —C——A—
The haplotype frequencies can be tabulated in the following table for population (d).
Allele
Allele A G Total
T 0.1 0.2 0.3
C 0 0.7 0.7
Total 0.1 0.9
Using the equations described in the text the different measures of LD are calculated
for the four populations, and these values are tabulated below:
LD measure
Population |D| | D′ | r δ d Q
a 0.25 1 1 1 1 1
b 0.16 1 1 1 1 1
c 0 0 0 0 0 0
d 0.07 1 0.51 1 0.78 1
It is noted that markedly different values are obtained with the different measures,
keeping in mind the range for each measure. | D | =1 if only two or three haplotypes are
present. A close look at populations (a) and (b) shows that | D | is dependent on the
range of allele frequencies.
When haplotypes are known, it is easy to estimate LD from sample data (Example
2.1). In practice, however, when unrelated individuals are sampled it is not possible to
determine the phase of the double heterozygote, A1 A2 B1B2 of two marker loci. The
double heterozygote can produce two different pairs of haplotypes depending on the
phase configuration; i.e., A1 B1 / A2 B2 or A1 B2 / A2 B1 . The genotypes with unknown
phase need to be determined to infer on the haplotypes. In the case of two biallelic
markers, maximum likelihood estimates (MLE) of haplotype frequencies can be obtained
analytically by solving a cubic equation (SAS 2004). For multiple loci or markers with
more than two alleles an iterative process could be used.
22 N.C. ORAGUZIE ET AL.
One of the earliest methods used for haplotyping was Clark’s algorithm (Clark
1990). Today more efficient alternatives are available. One such alternative is the
Expectation Maximization (EM) algorithm of Excoffier and Slatkin (1995). It is a
combination of two algorithms: an EM statistical algorithm for handling missing data,
and a counting algorithm for frequencies. As the process is iterative, it starts by guessing
haplotype frequencies. It then uses the current allele estimates to replace the ambiguous
phased genotypes. Given the phase configurations of unphased genotypes it then goes to
estimate the frequency of each haplotype by counting. The process is repeated until the
frequencies converge. As the number of markers increases, the process can be
computationally very demanding. For a set of m unphased biallelic marker loci there
will be 2m possible haplotypes. Typically, the algorithm is appropriate for < 25 SNPs.
This maximum likelihood method makes the assumption that the population is in Hardy–
Weinberg equilibrium. Haplotype inference from genotype data is becoming more
important in association studies. Intuitively, one would expect an analysis based on
haplotypes to be more powerful because of simultaneous use of multiple marker infor–
mation. But as discussed here haplotypes often need to be estimated based on
assumptions made on the populations. This process leads to loss of some information.
The extent of LD can be highly variable across genomes in many of the species
studied to date. Within a given region, LD will decrease with the distance between marker
sites (Figure 2.4). Genome-wide patterns of pairwise LD values can often show regions
of high LD separated by regions of low LD. This scenario is often referred to as
“haplotype block.” Regions that are high LD and low in recombination are also referred
to as “LD hot spots” in the literature. A “hot spot” for LD also implies a “cold spot” for
recombination. There are two common approaches to haplotype blocking. One method
defines a block whenever LD is greater than some threshold value. The second method
defines a block when a smaller number of haplotypes make up a high proportion of
observed haplotypes. There is an ongoing debate regarding whether haplotype blocks
truly exist as our understanding of genomic patterns of recombination and disequilibrium
is still limited (Cardon and Abecasis 2003).
When a large number of markers are considered it is useful to graphically display
estimated LD values. It is common to visualize LD patterns in the form of color-coded
matrices (Pettersson et al. 2004). This way we can identify blocks within a genome area
in which there is a strong LD. Graphical Overview of Linkage Disequilibrium (GOLD) is a
computer program, which can graphically display LD structures (Abecasis and Cookson
2000). The software can be downloaded from http://www.sph.umich.edu/csg/abecasis/
GOLD/. A sample output from GOLD is shown in Figure 2.5 (figure from GENESTAT,
http://www.meb.ki.se/genestat/). Pettersson et al. (2004) have developed GOLDsurfer,
which is an extension of the 2D view in GOLD to a 3D package. D ′ values are
represented by colors, with hotter colors representing high D ′ . Note the several large
red blocks on the diagonal, indicating haplotype blocks of maximal disequilibrium where
there has been no recombination since the LD was formed.
seem to have the most evident impact on LD. Mutation provides the raw material for
producing polymorphisms that will be in LD (Flint-Garcia et al. 2003). LD is created
when a new mutation occurs on a chromosome that carries a particular allele at a nearby
locus. Recombination is the main mechanism that breaks down LD. Meiotic crossing-
over weakens intrachromosomal LD while independent assortment is particularly
responsible for breaking down interchromosomal LD. Recombination rates are known to
vary by more than an order of magnitude across the genome. Because breakdown of LD
is primarily driven by recombination, the extent of LD is expected to vary in inverse
relation to the local recombination rate (Nachman 2002). Recurrent mutations can also
lessen the association between alleles at adjacent loci. Some SNPs, such as those at CpG
dinucleotides, might have high mutation rates (due to decay of methylated cytosine –
5MeC – by deamination to thymidine over evolutionary time, leading to CpG
suppression) and therefore show little or no LD with nearby markers, even in the absence
of historical recombination. On the other hand in places where the levels of DNA
methylation are generally lower (e.g., within genes), this effect would tend to be
mitigated.
1.0
0.8
0.6
2
r
0.4
0.2
0.0
0 500 1000 1500 2000 2500 3000 3500
Weighted distance (bp)
2
Figure 2.4. Plot of LD (in r ) against weighted distance between polymorphic sites in the candidate gene
“d3” in maize. (Data from Remington et al. 2001.)
24 N.C. ORAGUZIE ET AL.
Figure 2.5. Pairwise | D ′ | for 45 SNPs within a linked region (figure from GENESTAT, http://www.
meb.ki.se/genestat/, courtesy of the Swedish National Biobanking program, Wallenberg consortium north). (see
color plate)
Figure 2.6. A simplistic diagram showing the major difference between gene conversion and crossover. (A)
Two DNA molecules. (B) Gene conversion after mismatch correction – the red DNA donates part of its genetic
information (e–e' region) to the blue DNA. (C) DNA crossover – the two DNAs exchange part of their genetic
information (f–f ' and F–F'). (see color plate)
structured population where substructuring has recently ceased. Admixture results in the
introduction of chromosomes of different ancestry and allele frequencies. Often the
resulting LD extends to unlinked sites, on the same and different chromosomes but
breaks down rapidly with random mating (Pritchard and Przeworski 2001). With regard
to LD detection studies, population admixture is one of the major factors that causes
spurious associations between marker alleles and the phenotype. Although our interest in
LD is because it is likely to be caused by tightly linked loci, spurious associations due to
population admixture can often lead to incorrect conclusions. Population admixture can
generate LD even though the individual populations forming the mixture do not show any
such disequilibrium. We will illustrate this point using a hypothetical example. Let us
consider a locus with a disease allele, D and a second unlinked marker locus with allele,
M . Take two populations (I and II) of equal size, but with different frequencies of alleles
D and M , as shown in Table 2.4. Since we assume the loci to be independent the
expected frequency of individuals carrying both D and M alleles in the population is
the product of their individual frequencies (Table 2.4). For our hypothetical example we
take these to be the observed frequencies. Consider now the admixture of these two
populations in equal proportions. The new observed allele frequencies of the mixed
population would simply be the average of the two as shown in the table. The observed
frequency of individuals with D and M alleles in the mixed population is greater
than what would be expected for independent loci, i.e., the mixing has resulted in a
spurious association between D and M alleles. The use of a population, which is
likely to have resulted from admixture, is therefore not recommended for association
studies without proper genomic control (Table 2.3).
Table 2.3. A hypothetical scenario showing how population admixture can lead to spurious associations
Frequency
Population I II Admixture
M -Disease allele 0.7 0.1 0.4
M -allele 0.8 0.2 0.5
D&M 0.56 0.02 0.29
An example based on empirical data of how admixture and selection can influence
LD resulting in significant association between genotype and phenotype is the oat study
carried out by Beer et al. (1997). In this study, the authors used 64 North American oat
varieties and landraces that have been phenotyped for 13 quantitative characters and
grouped based on their restriction fragment length polymorphism (RFLP) genotype at 48
loci. They found significant associations between RFLP fragments and the group means
for 11.2% of the fragments at 1% significance level. The authors, however, did not take
into consideration in their data analysis the fact that both spring and winter varieties were
represented in the germplasm pool they used for the study (Souza and Sorrells 1991).
These groups differed in both phenotype means and marker frequencies. Also, the
germplasm had undergone four decades of selection and improvement with some
genotypes older than others. Hence, the germplasm may be considered as an admixture of
old and modern subpopulations, with one having undergone less selection than the other.
In which case, one would expect to find fewer associations between the marker alleles
within each subpopulation than in the combined pool. This indeed was the case when the
data was re-analyzed, with only 6% and 4.9% of allele-trait associations significant in the
28 N.C. ORAGUZIE ET AL.
Factor Effect
Recombination rate Higher recombination lowers LD
Mating systems: selfing species High LD
Mating systems: out-crossing species Low LD
Genetic isolation between lineages Increases LD
Population subdivision Increases LD
Population admixture Increases LD
Natural and artificial selection Locally increases LD
Population size Small populations have more LD
Balancing selection Increases LD
Mutation rate High mutation rate decreases overall LD but LD around
newly created mutated allele remains high until
dissipated by recombination
Genomic rearrangements Rearrangements suppress local recombination
Stochastic effects (chance) Increase or decrease LD
Epistatic interactions with significant Increase LD
phenotypic effects
Modified from Rafalski and Morgante (2004).
Due to the different genetic and environmental factors that affect LD, the extent and
pattern of LD are expected to vary within and between species and even between
different regions of the genome of the same species. Below, we discuss the species from
which we currently derive most of our information about LD.
2.5.1 LD in Humans
hot spots of recombination. The authors examined haplotype patterns across 51 autosomal
regions (spanning 13 Mb of the human genome) in samples from Africa, Europe and Asia
and reported that the human genome can be parsed objectively into haplotype blocks;
including sizable regions over which there is little evidence of historical recombination,
and within which only a few common haplotypes are observed. The boundaries of blocks
and specific haplotypes they contained were highly correlated across populations. The
study demonstrated that such haplotype frameworks provide substantial statistical power
in association studies of common genetic variation across each region and they highlight
the need to develop this detail of data for other species such as plant species.
It has also been documented that regions of high LD, in general, correspond to
regions of low recombination. For example, McVean et al. (2004) developed and
validated a method for estimating recombination rates from patterns of genetic variation.
From extensive SNP surveys in European and African populations, the authors found
evidence for extreme local rate variation spanning four orders in magnitude, in which
50% of all recombination events take place in less than 10% of the sequence. They
demonstrated that recombination hot spots are a ubiquitous feature of the human genome,
occurring on average every 200 kb or less, but recombination occurs preferentially
outside genes.
Recent human patterns of LD have also highlighted the importance of a second
feature of recombination: homologous gene conversion (Frisse et al. 2001; Przeworski
and Wall 2001). Ptak et al. (2004) estimated local recombination rates indirectly from
patterns of LD in 84 genomic regions in a sample of individuals of European origin and
of African–American descent. They found that LD based estimates are significantly
positively correlated with map-based estimates. Also, using LD based estimators, the
authors found evidence for homologous gene conversion in patterns of polymorphism.
Frisse et al. (2001) also identified significant differences between the African and non-
African populations which will impact on the design of future association studies in these
populations. In general, there is less LD in African populations than in non-African
populations. The half-length of D in the Utah population in USA is about 60 kb,
whereas the half-length is considerably less than 5 kb for the Yoruba tribe from the
southwestern part of Nigeria. Although it is generally believed that these results could be
attributed to major human historical events particularly population bottlenecks associated
with geographical expansion and population isolation, it would be worthwhile estimating
inbreeding coefficients in these populations to see what role (if at all) it might have
played in shaping the LD. Nevertheless, these studies highlight the importance of
developing an understanding of the distribution of LD in any particular population as a
prerequisite for subsequent experimental design. In a high LD population, genome-wide
scans could be conducted to minimize the number of markers needed, and this could be
followed by high resolution mapping in a low LD population (Reich et al. 2001).
Much attention is now focused on the identification of susceptibility genes
underlying complex diseases, such as diabetes, schizophrenia and hypertension.
Parametric linkage analysis narrowed the diastrophic dysplasia (DTD) gene to a ~2 Mb
interval, but an LD study in Finnish patients pinpointed the gene to a ~40 kb interval and
made its positional cloning possible (Häsbacka et al. 1992). The study also showed that
the DTD gene lies within 0.06 cM (about 60 kb) of the colony stimulating factor 1
receptor (CSF1R) gene. Positional cloning of both the Huntington disease (HD), cystic
fibrosis (CF) genes (Kerem et al. 1989) and one of the major Alzheimer factors (Corder
et al. 1994) was helped greatly by LD mapping followed by association analysis. Several
LINKAGE DISEQUILIBRIUM 31
2.5.2 LD in Drosophila
Much of our understanding of how LD is shaped in natural populations initially
came from research on Drosophila species. Drosophila population history is still not well
understood. Drosophila melanogaster and Drosophila simulans are human commensals;
as with humans, they are thought to have originated in Africa, and only recently spread to
other continents. The levels of diversity seem to be higher in African populations than
non-African ones for D. melanogaster (David and Capy 1988).
The most detailed analysis of LD has been made in D. melanogaster, in which
allelic combinations can readily be determined for individual chromosomes that have
been extracted from wild populations through inbred lines. Most studies have focused on
in-depth comparisons of single gene loci and/or single populations, and the principal
finding is one of regional variations in LD among loci. A total of 3,143 pairwise
comparisons involving 206 polymorphic restriction variants or eight gene regions of
Drosophila melanogaster were included in one analysis (Zapata and Alvarez 1983). It
was found that heterogeneity is mostly explained by large differences in the intensity of
sample disequilibrium among regions. Langley et al. (2000) found evidence for a
32 N.C. ORAGUZIE ET AL.
surprisingly large amount of recombination at the tip of the X chromosome even though
crossing over rates are known to be extremely low. They inferred that this must be the
result of a high rate of gene conversion. Schaeffer et al. (2001) presented an analysis of
protein variation at the ADH and ADH-related (ADHR) loci in the alcohol
dehydrogenase (Adh) region in 139 strains of Drosophila pseudoobscura. Several
conclusions can be drawn from the LD analysis of SNPs and ADHR haplotypes. First,
recombination reduces the fraction of polymorphic loci that show associations with a
disease-causing gene, but significant LD can be observed as a result of mutation and
random genetic drift. Second, LD studies will be most effective in detecting allele-
phenotype associations when the alleles are at moderate frequencies and the authors
suggest that their model system conclusions may be applicable to other organisms. In-
depth studies of how several forces (for example, mutation, recombination and selection)
act to increase or decrease LD in a given region indicate that the balance of these forces
should result in strongest disequilibrium around alleles at frequencies of ~10%. However,
even adjacent regions can experience quite different evolutionary histories. A recent
chromosome-wide study of the fourth chromosome (Wang et al. 2002), previously
believed to be nonrecombining and invariable, found polymorphic regions interspersed
with regions of little to no variation. Therefore, recombination was shown to occur on the
chromosome, and although at very low rate consistent with previous findings, this has
been sufficient to affect the structure of genetic variation on the chromosome, allowing
different regions to have different evolutionary histories.
Recombination rates per physical length are well known to show marked regional
variation, and much research on LD in Drosophila has used this factor to focus on
understanding the effects of selection and other forces on the degree of LD. Over the past
decade, numerous surveys of DNA sequence variation in natural populations of several
Drosophila species have established that polymorphism levels are positively correlated
with the regional rate of crossing over, and are not generally explained by variation in
mutation rates (Wang et al. 2002; Begun and Aquadro 1992; 1994). This correlation has
been proposed to result from the hitchhiking that is associated with fixation of
advantageous mutants: in a region of low recombination, if directional selection drives an
advantageous mutation through a population to fixation, much of the variations at linked
sites will be eliminated during the process (Parsch et al. 2001). Selection on a region will
therefore also increase the strength of LD observed: that is, significant allelic associations
over large genetic distances might result from the action of natural selection. For
example, strong geographical clinal variation in many enzyme loci around the
phosphogluconate mutase (Pgm) locus is likely to be explained by clinal selection at Pgm
and pervasive low levels of recombination in the region, so that the other loci are forced
to hitchhike along with it (Verrelli and Eanes 2001). Selection against deleterious
mutations can also reduce variation at linked sites. A recent analysis of multiple loci in D.
melanogaster and D. simulans showed that both species have greater within locus LD
than expected theoretically. This could be due to a departure from the demographic
assumption of a panmitic equilibrium in Drosophila and/or the action of natural selection
on many loci.
2.5.3 LD in Plants
Genetic diversity at the sequence level has been studied in only a handful of plant
taxa, with maize and Arabidopsis thaliana, the most commonly studied species. These
LINKAGE DISEQUILIBRIUM 33
two species have evolved different mating systems (out-crossing and selfing, respectively
– although maize can be readily inbred and many cultivated varieties have been derived
by this process, so maize might be more accurately described as a facultative inbreeder,
in contrast to species like perennial ryegrass, which can accurately be described as an
obligate out-breeder) and provide contrasting views of LD in plant genomes.
2.5.3.1. Maize
Maize is a good candidate for DNA sequence polymorphism survey because of its
long history as a model genetic system and because of its agricultural importance. Maize
was domesticated in Mexico about 7,500 years ago and dispersed throughout the
Americas shortly thereafter. As a result of dispersal, there are now hundreds of maize
landraces representing worldwide geographic locales. However, most of these have
contributed little to modern maize breeding programs, and virtually all elite US inbred
germplasm is derived from only a few landraces. The first published LD study on maize
was based on a survey of 21 loci distributed along chromosome 1 of maize (Tenaillon
et al. 2001). Each locus was sampled in 25 individuals representing a “species-wide”
sample of maize that included US and exotic landraces. Although the length of these
genes was short (1.5 kb), the rate of decay in LD was surprisingly rapid. On average, LD
2
declined below nominal levels, which we arbitrarily define here as r = 0.20, within
400 bp. By contrast, a subsample that included only US inbred lines demonstrated a
lower rate of decay over distance, reaching nominal levels in ~1 kb. Higher LD in the US
germplasm is consistent with the recent formation of these inbred lines and their
relatively narrow genetic base.
A second study surveyed six genes of longer length (1.2–10 kb) in 102 inbred lines
(Remington et al. 2001). These lines included tropical and semitropical lines and thus are
more genetically diverse than samples of US inbred lines alone but probably less diverse
than a species-wide sample. In this study, LD again declined rapidly; for five of six
genes, LD was below the nominal level in 200 to 1,500 bp. In four of six genes sampled,
2
predicted r values declined to less than 0.1 within 2,000 bp, much less than the 50 kb
observed for the same degree of LD decay in Northern European human population.
However, LD did not decay to nominal levels in 10 kb for one gene, shrunken (sh1).
Selection can also maintain elevated LD in localized regions. A subsequent study showed
that sh1, an enzyme in the starch biosynthesis pathway, was under directional selection
during either domestication or breeding (Whitt et al. 2002). This may provide an
explanation for the persistence of LD at sh1. Although LD decays rapidly in a gene after
selection for a particular allele (Przeworski 2002), maize is believed to have arisen from a
single domestication event in southern Mexico about 9,000 years ago (Matsuoka et al.
2002). Based on this supposition, an appreciable selective effect on LD may still remain.
Another surprising aspect of this study was that a genome-wide sample of 47 simple
sequence repeats (SSRs) demonstrated higher levels of LD than SNPs in candidate genes.
The reason for the apparent difference between SNPs and SSRs is unclear at present, but
it may reflect differences in the type of historical information captured by markers with
different mutation rates (Remington et al. 2001). Thornsberry et al. (2001) measured
disequilibrium in and around the Dwarf8 locus in maize, and found examples of
disequilibrium spanning in excess of 3 kb in this region. An interesting feature was that
within this region were regions in equilibrium, indicating a nonlinear decay of
34 N.C. ORAGUZIE ET AL.
disequilibrium, and the potential for more complex patterns of LD than simply being
restricted to small regions of localized disequilibrium.
Longer stretches of LD have also been observed in maize: Jung et al. (2004) found
stretches of disequilibrium of up to 500 kb in the vicinity of the adh1 locus. Longer range
LD has been reported in other plants, for example by Yin et al. (2004) in Populus
trichocarpa, another out-crossing angiosperm, where LD was observed in the vicinity of
a resistance gene at a distance of 34 and 16 kb, respectively. These results also indicated
region-specific LD differences. Although these observations could be due to phenomena
such as selective sweeps, they also raise the intriguing possibility, hypothesized by
Rafalski and Morgante (2004), of nonuniform recombination between genic and
nongenic regions, where less crossover occurs between the lower homology intergenic
DNA (or alternatively, preferential pairing in regions of high sequence homology – such
as expressed genes). Such a phenomenon may be restricted to species where there is less
homology in intergenic sequences, such as in regions of the maize genome. Or it could
also operate in regions where there are clusters of genes under more extreme diversifying
selection such as resistance genes where the birth and death process can sometimes
eliminate pairing gene partners and can result in a rather abrupt localized end to the
region of homology along otherwise homologous chromosomes. Further re-sequencing of
large stretches of gDNA such as BAC libraries will be informative in revealing the extent
and frequency of longer range LD in out-crossing species.
2.5.3.2. Arabidopsis
time. When alleles are retained within populations for substantially longer periods than
expected for neutral genes, there is time to accumulate relatively high levels of diversity
and ample opportunity for recombination among alleles. Because FRI, rps5, and
CLAVATA2 may be atypical, additional studies of the genomic patterns of Arabidopsis
LD are merited. Nonetheless, one can draw two conclusions. The first is that LD in
species-wide samples decays far more slowly over physical distance in this selfing
species relative to out-crossing maize; this difference is consistent with the low effective
recombination rate and the demographic consequences of selfing. The second is that
selection on a particular gene, such as rps5, affects the distribution of genetic diversity in
neighboring genes through genetic associations. The extent of these effects likely is
stronger in selfing than in out-crossing taxa.
Species LD Criterion
Human 60 kb D′ half-length, North Europeans
Human 5 kb D′ , half-length, Yoruba-Nigerians
Cattle >10 cM D′ half-length
A. thaliana 50–100 kb r 2 , half-length
Soybean >50 kb Little LD decay found
Norway Spruce ~100 bp r 2 , half-length
Norway Spruce ~200 bp r 2 = 0.2
Grape >500 bp r 2 , half-length
Maize ~400 bp r 2 = 0.2
Maize (inbreds from USA) ~1 kb r 2 = 0.2
Maize 200–1,500 bp r 2 = 0.2
Reprinted from Trends in Genetics 20(2), Rafalski and Morgante; Corn and humans: recombination and LD in
two genomes of similar size, pp. 103–111, 2004, with permission from Elsevier.
2.6 CONCLUSION
meioses (i.e., all those that have occurred in the history of the sample) than traditional
mapping population.
The extent of LD can vary across genomes and between species. Also, there is some
indication of variation in genome-wide patterns of LD within a species, with some high
LD regions interspersed with regions of low LD. Studies in model organisms such as
Drosophila, maize and Arabidopsis, as well as humans, have informed plant geneticists
of the potential complexity of phenomena that give rise to observed patterns of
disequilibrium, including the heterogeneity of LD that can occur within species.
Implications for association genetics are that global, or genome-wide averages may not
adequately reflect patterns in specific regions, therefore patterns within regions of interest
will need adequate elucidation for successful application of association genetics
approaches. Population structure and biological behavior in particular, have pronounced
effects on patterns of LD. Such differences have been observed in model plant species
such as Arabidopsis and maize, where very different patterns have been observed.
However, the current state of knowledge is such that even in model species, there is much
to learn about the nature of LD and its underlying causes.
2.7 REFERENCES
Abecasis, G.R., Cookson, W.C., 2000, GOLD – graphical overview of linkage disequilibrium. Bioinformatics
16: 182–183.
Abott, R.J., Gomes, M.F., 1989, Population genetic structure and outcrossing rate of Arabidopsis thaliana (L.)
Heynh. Heredity 62: 411–418.
Aguadé, M., 2001, Nucleotide sequence variation at two genes of the phenylpropanoid pathway, the FAH1 and
F3H genes in Arabidopsis thaliana. Mol Biol Evol 18(1): 1–9.
Altshuler, D., Hirschhorn, J.N., Klannemark, M., Lingren, C.M., Vohl, M.-C., Nemesh, J., Lne, C.R., Schaffner,
S.F., Bolk, S., Brewer, C., Tuomi, T., Gaudet, D., Hudson, T.J., Daly, M., Groop, L., Lander, E.S., 2000,
The common PPARγ Pro12Ala polymorphisms is associated with decreased risk of type 2 diabetes. Nat
Genet 26: 76–80.
Ardlie, K., Liu-Cordero, S.N., Eberle, M.A., Daly, M., Barret, J., Winchester, E., Lander, E.S., Kruglyak, L.,
2001, Lower-than-expected linkage disequilibrium between tightly linked markers in humans suggests a
role for gene conversion. Am J Hum Genet 69(3): 582–589.
Ball, R.D., 2005, Experimental designs for reliable detection of linkage disequilibrium in unstructured random
population association studies. Genetics 170: 859–873.
Beer, S.C., Siripoonwiwat, W., O’Donoghue, L.S., Souza, E., Mathews, D., Sorrells, M.E., 1997, Associations
between molecular markers and quantitative traits in an oat germplasm pool: can we infer linkages?
J Agric Genom 3. [online] URL: http://www.ncgr.org/research/jag/papers97/paper197/indexp197.html.
Begun, D.J., Aquadro, C.F., 1992, Levels of naturally occurring DNA polymorphisms correlate with
recombination rates in D. Melanogaster. Nature 356(6369): 519–520.
Begun, D.J., Aquadro, C.F., 1994, Evolutionary inferences from DNA variation at the 6-phosphogluconate
dehydrogenase locus in natural populations of Drosophila: selection and geographic differentiation.
Genetics 136(1): 155–171.
Bergelson, J., Stahl, E., Dudek, S., Kreitman, M., 1998, Genetic variation within and among populations of
Arabidopsis thaliana. Genetics 148: 1311–1323.
Borevitz, J.O., Nordborg, M., 2003, The impact of genomics on the study of natural variation in Arabidopsis.
Plant Physiol 132: 718–725.
Brown, G.R., Gill, G.P., Kaunz, R.K., Langley, C.H., Neale, D.B., 2004, Nucleotide diversity and linkage
disequilibrium in loblolly pine. Proc Natl Acad Sci 42: 15255–15260.
Cannon, G.B., 1963, The effects of natural selection on linkage disequilibrium and relative fitness in
experimental population of Drosophila melanogaster. Genetics 48: 1201–1216.
Cardon, L.R., Abecasis, G.R., 2003, Using haplotype blocks to map human complex trait loci. Trends Genet.
19: 135–140.
Chovnick, A., Schalet, A., Kernaghan, R.P., Kraus, M., 1964, The rosy cistron in Drosophila melanogaster:
genetic fine structure analysis. Genetics 50: 1245–1259.
LINKAGE DISEQUILIBRIUM 37
Clark, A.G., 1990, Inference of haplotypes from PCR-amplified samples of diploid populations. Mol Biol Evol
7: 111–122.
Corder, E.H., Saunders, A.M., Risch, N.J., Strittmatter, W.J., Schmechel, D.E., Gaskell Jr., P.C., Rimmler, J.B.,
Locke, P.A., Conneally, P.M., Schmader, K.E., Small, G.W., Roses, A.D., Haines, J.L., Pericak-Vance,
M.A., 1994, Protective effect of apoliprotein E-type 2 allele for late on-set alzheimer-disease. Nat Genet
7: 180–184.
Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J., Lander, E.S., 2001, High resolution haplotype structure
in human genome. Nat Genet 29: 229–232.
David, J.R., Capy, P., 1988, Genetic variation of Drosophila melanogaster natural populations. Trends Genet 4:
106–111.
Devlin, B., Risch, N., 1995, A comparison of linkage disequilibrium measures for fine scale mapping.
Genomics 29: 311–322.
Dvornyk, V., Sirviö, A., Mikkonen, M., Savolainen, O., 2002, Low nucleotide diversity at the pal1 locus in the
widely distributed Pinus sylvestris. Mol Biol Evol 19: 179–188.
Excoffier, L., Slatkin, M., 1995, Maximum-likelihood estimation of molecular haplotype frequencies in a
diploid population. Mol Biol Evol 12: 921–927.
Falconer, D.S., Mackay, T.F.C., 1996, Introduction to quantitative genetics (4th edition). Longman Group
Limited, London.
Flint-Garcia, S., Thornsberry, J.M., Buckler IV, E.S., 2003, Structure of linkage disequilibrium in plants. Annu
Rev Plant Biol 54: 357–374.
Frisse, L.R., Hudson, R.R., Bartoszewics, A., Wall, J.D., Donfack, J., Rienzo, D., 2001, Gene conversion and
different population histories may explain the contrast between polymorphism and linkage disequilibrium
levels. Am J Hum Genet 69: 831–843.
Gabriel, S.B., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., DeFliece, M.,
Lochner, A., Faggart, M., Liu-Cordero, S.N., Rotimi, C., Adebayo, A., Cooper, R., Ward, R., Lander,
E.S., Daly, M.J., Altshuler, D., 2002, The structure of haplotype blocks in the human genome. Science
296: 2225–2229.
Gaut, B.S., Long, A.D., 2003 The lowdown on linkage disequilibrium. Plant Cell 15(7): 1502–1506.
Hagenblad, J., Nordborg, M., 2002, Sequence variation and haplotye structure surrounding the flowering time
locus FRI in Arabidopsis thaliana. Genetics 161: 289–298.
Hansfstingl, U., Berry, A., Kellog, E.A., Costa III, J.T., Rüdiger, W., Ausubel, F.M., 1994, Haplotypic
divergence coupled with lack of diversity at the Arabidopsis thaliana alcohol dehydrogenase locus: roles
for both balancing and directional selection? Genetics 138: 811–828.
Häsbacka, J., de la Chapelle, A., Kaitila, I., Sistonen, P., Weaver, A., Lander, E., 1992, Linkage disequilibrium
mapping in isolated founder populations: diastrophic dysplasia in Finland. Nat Genet 2: 204–211.
Hendricks, P., 1987, Gametic disequilibrium measures: proceed with caution. Genetics 117: 331–341.
Hill, W.G., Weir, B.S., 1994, Maximum-likelihood estimation of gene location by linkage disequilibrium. Am J
Hum Genet 54: 705–714.
Innan, H., Tajima, F., Terauchi, R., Miyashita, N., 1996, Intragenic recombination in the Adh locus of the wild
plant Arabidopsis thaliana. Genetics 143: 1761–1770.
Inoue, I., Nakajima, T., Williams, C.S., Quackenbush, J., Puryear, R., Powers, M., Cheng, T., Ludwig, E.H.,
Sharma, A.M., Hata, A., Jeunemaitre, X., Lalouel, J.M., 1997, A nucleotide substitution in the promoter
of human angiotensinogen is associated with essential hypertension and affects basal transcription in
vitro. J Clin Invest 99: 1786–1797.
Iso, H., Harada, S., Shinamoto, T., Sato, S., Kitamura, A., Sankai, T., Tanigawa, T., Iida, M., Komachi, Y.,
2000, Angiotensinogen T174M and M235T variants, sodium intake and hypertension in non-drinking,
lean Japanese men and women. J Hypertens 18: 1197–1206.
Jeffreys, A.J., Kauppi, L., Neumann, R., 2001, Intensely punctuate meiotic recombination in the class II region
of the major histocompatibility complex. Nat Genet 29: 217–222.
Jennings, H.S., 1917, The numerical results of diverse systems of breeding, with respect to two pairs of
characters, linked or independent, with special relations to the effect of linkage. Genetics 2: 97–154.
Jeunemaitre, X., Inoue, I., Williams, C., Charru, A., Tichet, J., Powers, M., Sharma, A.M., Gimenez-Roqueplo,
A.P., Hata, A.S., Corvol, P., Lalouel, J.M., 1997, Haplotypes of angiotensinogen in essential
hypertension. Am J Hum Genet 60: 1448–1460.
Jorde, L.B., 1995, Linkage disequilibrium as a gene mapping tool. Am J Hum Genet 56: 11–14.
Jung, M., Ching, A., Bhattramakki, D., Dolan, M., Tingey, S., Morgante, M., Rafalski, A., 2004, Linkage
disequilibrium and sequence diversity in a 500-kbp region around the adh1 locus in elite maize
germplasm. Theor Appl Genet 109(4): 681–689.
Kaplan, N., Weir, B.S., 1992, Expected behavior of conditional linkage disequilibrium. American Journal of
Genetics 51:333-343.
38 N.C. ORAGUZIE ET AL.
Kato, N., Sugiyama, T., Morita, H., Kurihara, H., Yamori, Y., Yazaki, Y., 1999, Angiotensinogen gene and
essential hypertension in the Japanese: extensive association study and meta-analysis on six reported
studies. J. Hypertens 17: 757–763.
Kerem, B., Rommens, J.M., Buchanan, J.A., Markiewicz, D., Cox, T.K., Chakravarti, A., Buchwald, M., Tsui,
L.C., 1989, Identification of the cystic fibrosis gene: genetic analysis. Science 245: 1073–1080.
Knowler, W.C., Williams, R.C., Pettitt, D.J., Steinberg, A.G., 1988, Gm3–5,13,14 and type 2 diabetes mellitus; an
association in American Indians with genetic admixture. Am J Hum Genet 43: 520–526.
Kruglyak, L., 1999, Prospects for whole-genome linkage disequilibrium mapping of common disease genes.
Nat Genet 22: 139–144.
Krutovsky, K.V., Neale, D.B., 2005, Nucleotide diversity and linkage disequilibrium in cold-hardiness- and
wood quality-related candidate genes in Douglas-fir. Genetics 171: 2029-2041.
Kunz, R., Kreutz, R., Beige, J., Distler, A., Sharma, A.M., 1997, Association between the angiotensinogen
235T-variant and essential hypertension in whites: a systematic review and methodological appraisal.
Hypertension 30: 1331–1337.
Langley, C.H., Lazzaro, B.P., Philips, W., Heikkinen, E., Braverman, J.M., 2000, Linkage disequilibria and the
site frequency spectra in su(s) and su(3(a)) regions of the Drosophila melanogaster X chromosome.
Genetics 156(4): 1837–1852.
Levin, M.L., Bertell, R., 1978, Re: “Simple estimation of population attributable risk from case-control
studies.” Am J Epidemiol 108: 78–79.
Lewontin, R.C., 1964, The interaction of selection and linkage. I. General considerations; heterotic models.
Genetics 49: 49–67.
Matsuoka, Y., Vigouroux, I., Goodman, M.M., Sanchez, G.J., Buckler, E., Doebley, J., 2002, Proc Natl Acad
Sci USA 99: 6080-6084.
McVean, G.A., Myers, S.R., Hunt, S., Deloukas, P., Bentley, D.R., Donnelly, P., 2004, The fine scale structure
of recombination rate variation in the human genome. Science 304: 581–584.
Nachman, M.W., 2002, Variation in recombination rates across the genome: evidence and implications. Curr
Opin Genet Dev 12(6): 657–663.
Nordborg, M., 2000, Linkage disequilibrium, gene trees, and selfing: an ancestral recombination graph with
partial self-fertilization. Genetics 154: 923–929.
Nordborg, M., Donnelly, P., 1997, The coalescent process with selfing. Genetics 146: 1185–1195.
Nordborg, M., Tavare, S., 2002, Linkage disequilibrium: what history has to tell us. Trends Genet 18: 83–90.
Nordborg, M., Borevitz, J.O., Bergelson, J., Berry, C.C., Chory, J., Hagenblad, J., Kreitman, M., Maloof, J.N.,
Noyes, T., Oefner, P.J., Stahl, E.A., Weigel, D., 2002, The extent of linkage disequilibrium in
Arabidopsis thaliana. Nat Genet 30: 190–193.
Palaisa, K., Morgante, M., Williams, M., Rafalski, A., 2003, Contrasting effects of selection on sequence
diversity and linkage disequilibrium at two phytoene synthase loci. Plant Cell 15: 1795–1806.
Pan, W.H., Chen, J.W., Fann, C., Jou, Y.S., Wu, S.Y., 2000, Linkage analysis with candidate genes: the Taiwan
young-onset hypertension genetic study. Hum Genet 107: 210–215.
Parsch, J., Meiklejohn, C.D., Hartl, D.L., 2001, Patterns of DNA sequence variation suggest the recent action of
positive selection in the janus-ocnus region of Drosophila simulans. Genetics 159(2): 647–657.
Pettersson, F., Oskar, J., Cardon, L.R., 2004, GoldSurfer: three dimensional display of linkage disequilibrium.
Phillips, M.S., Lawrence, R., Sachidanandam, R., Morris, A.P., Balding, J.D., Donaldson, M.A., Studebaker,
J.F., Ankener, W.M., Alfisi, S.V., Kuo, F.-S., Camisa, A.L., Pazorov, V., Scott, K.E., Carey, B.J., Faith,
J., Katari, G., Bhatti, H.A., Cyr, J.M., Derohannessian, V., Elosua, C., Forman, A.M., Grecco, N.M.,
Hock, C.R., Kuebler, J.M., Lathrop, J.A., Mockler, M.A., Natchtman, E.P., Restine, S.L., Varde, S.A.,
Hozza, M.J., Gelfand, C.A., Broxholme, J., Abecasis, G.R., Boyce-Jacino, M.T., Cardon, L.R., 2003,
Chromosome-wide distribution of haplotype blocks and the role of recombination hotspots. Nat Genet
33: 382–387.
Pritchard, J.K., Przeworksi, M., 2001, Linkage disequilibrium in humans: models and data. Am J Hum Genet
69: 1–14.
Przeworski, M., 2002, The signature of positive selection at randomly chosen loci. Genetics 160: 1179–1189.
Przeworski, M., Wall, J.D., 2001, Why is there so little intragenic LD in humans? Genet Res 77: 143–151.
Ptak, S.E., Voelpel, S., Przeworski, M., 2004, Insights into recombination from patterns of linkage
disequilibrium in humans. Genetics 167: 387–397.
Rafalski, A., Morgante, M., 2004, Corn and Humans: recombination and linkage disequilibrium in two genomes
of similar size. Trends Genet 20(2): 103–111.
Rankinen, T., Gagnon, J., Perusse, L., Chagnon, Y.C., Rice, T., Leon, A.S., Skinner, J.S., Wilmore, J.H., Rao
DC, Bouchard, C., 2000, AGT M235T and ACE ID polymorphisms and exercise blood pressure in the
HERITAGE family study. Am J Physiol Heart Circ Physiol 279: H368–H374.
Reich, D.E., Cargill, M., Block, S., Ireland, J., Sabieti, P.C., Ritcher, D.J., Lavery, T., Kouyoumjian, R.,
Farhadian, S.F., Ward, R., Lander, E.S., 2001, Linkage disequilibrium in the human genome. Nature 41:
199–204.
LINKAGE DISEQUILIBRIUM 39
Remington, D.L., Thornsberry, J.M., Matsuoka, Y., Wilson, L.M., Whitt, S.R., Doebley, J., Kresovich, S.,
Goodman, M.M., Buckler, E.S., 2001, Structure of linkage disequilibrium and phenotypic associations in
the maize genome. Proc Natl Acad Sci USA 98: 11479–11484.
Rice, T., Rankinen, T., Province, M.A., Chagnon, Y.C., Perusse, L., Borecki, I.B., Bouchard, C., Rao, D.C.,
2000, Genome-wide linkage analysis of systolic and diastolic blood pressure: the Quebec family study.
Circulation 102: 1956–1963.
SAS Institute, Inc., 2004, SAS/Genetics 9.1 Users’s Guide, Cary, NC: SAS Institute, Inc.
Sato, N., Katsuya, T., Nakagawa, T., Ishikawa, K., Fu, Y., Asai, T., Fukuda, M., Suzuki, F., Nakamura, Y.,
Higaki, J., Ogihara, T., 2000. Nine polymorphisms of angiotensinogen gene in the susceptibility to
essential hypertension. Life Sci 68: 259–272.
Schaeffer, S.W., Walthour, C.S., Toleno, D.M., Olek, A.T., Miller, E.L., 2001, Protein variation in Adh and
Adh-related in Drosophila pseudoobscura. Linkage disequilibrium between single nucleotide
polymorphisms and protein alleles. Genetics 159(2): 673–687.
Shepard, K.A., Purugganan, M.D., 2003, Molecular population genetics of the Arabidopsis CLAVATA2 region:
The genomic scale of variation and selection in selfing species. Genetics 263: 1083–1095.
Souza, E., Sorrells, M.E., 1991, Relationships among 70 North American oat germplasms: I. Cluster analysis
using quantitative characters. Crop Sci 31: 599–605.
Staessen, J.A., Kuznetsova, T., Wang, J.G., Emelianov, D., Vlietinck, R., Fagard, R., 1999, M235T
angiotensinogen gene polymorphism and cardiovascular renal risk. J Hypertens 17: 9–17.
Tajima, F., 1989, Statistical method for testing the neutral mutation hypothesis by DNS polymorphism.
Genetics 123: 585–595.
Tenaillon, M., Sawkins, M.C., Long, A.D., Gaut, R.L., Doebley, J.F., Gaut, B.S., 2001, Patterns of DNA
sequence polymorphisms along chromosome 1 of maize (Zea mays ssp. Mays L.). Proc Natl Acad Sci
USA 98: 9161–9166.
Terwilliger, J.D., 1995, A powerful likelihood method for the analysis of linkage disequilibrium between trait
loci and one or more polymorphic loci. Am J Hum Genet 56: 777–787.
The International HapMap consortium, 2005, A haplotype map of the human genome. Nature 437: 1299-1320.
Thornsberry, J.M., Goodman, M.M., Doebley, J., Kresovich, S., Nielsen, D., Buckler IV, E.S., 2001, Dwarf8
polymorphisms associate with variation in flowering time. Nat Genet 28: 286–289.
Thumma, B.R., Nolan, M.F., Evans, R., Morgan, G.F., 2005, Polymorphisms in Cinnamoyl CoA reductase
(CCR) are associated with variation in microfibril angle in Eucalyptus spp. Genetics 171:1257-1265.
Tian, D., Araki, H., Stahl, E.A., Bergelson, J., Kreitman, M., 2002, Signature of balancing selection in
Arabidopsis. Proc Natl Acad Sci USA 99: 11525–11530.
Tishkoff, S.A., Dietzsch, E., Speed, W., Pakstis, A.J., Kidd, J.R. Cheung, K., Bonné-Tamir, B., Santachiara-
Benerecetti, A.S., Moral, P., Krings, M., Pääbo, S., Watson, E., Risch, N., Jenkins, T., Kidd, K.K., 1996,
Global patterns of linkage disequilibrium at the CD4 locus and modern human origins. Science 271:
1380–1387.
Todokoro, S., Terauchi, R., Kawano, S., 1995, Microsatellite polymorphisms in natural populations of
Arabidopsis thaliana in Japan. Jpn J Genet 70: 543–554.
Verrelli, B.C., Eanes, W.F., 2001, Clinal variation for amino acid polymorphisms at the Pgm locus in
Drosophila melanogaster. Genetics 157(4): 1649–1663.
Wang, W., Thornton, K., Berry, A., Long, M., 2002, Nucleotide variation along the Drosophila melanogaster
fourth chromosome. Science 295(5552): 134–137.
Weir, B.S., 1996, Genetic Data Analysis II, Sinaeur Sunderland, MA, USA.
Whitt, S.R., Wilson, L.M., Tenaillon, M.I,, Gaut, B.S., Buckler, E.S., 2002, Genetic diversity and selection in
the maize starch pathway. Proc Natl Acad Sci USA 99: 12959–12962.
Yin, T.-M., DiFazio, S.P., Gunter, L.E., Jawdy, S.S., Boerjan, W., Tuskan, G.A., 2004, Genetic and physical
mapping of Melampsora rust resistance genes in Populus and characterization of linkage disequilibrium
and flanking genomic sequence. New Phytol 164: 95–105.
Zapata, C., Alvarez, C., 1983, On the detection of non-random association in natural populations of Drosophila.
Mol Biol Evol 10: 823–841.
Chapter 3
WHAT ARE SNPs?
1 Primary Industries Research Victoria, Victorian AgriBiosciences Centre, La Trobe R&D Park, Bundoora,
Victoria 3083, Australia
2 HortResearch, Plant Gene Mapping group, Private Bag 11030, Palmerston North, New Zealand
41
42 DAVID EDWARDS ET AL.
construction of ultra-high density genetic maps and association with genetic disorders (in
humans and livestock) and agronomic traits (in livestock and crop plants).
SNPs provide an abundant source of DNA polymorphism in a number of eukaryote
species. Information on the frequency, nature and distribution of SNPs in the majority of
plant genomes is limited. However, the level of development and application of SNPs in
higher plants, including some crop and tree species, is increasing, and consequently they
provide an attractive marker system to plant breeders and geneticists. With the increasing
availability of public sequence data and the rapid discovery of SNPs in plants, the
development and application of SNP markers will continue to accelerate.
SNPs can differentiate between related sequences both within an individual and
between individuals in a population. In diploid species, in which an individual is
heterozygous at a genetic locus, there are two homologous gene copies that may be
differentiated by SNPs. The inheritance of each variant may be directly measured in
the progeny. Detection of SNPs in individuals becomes complicated in the presence of
gene or genome duplication. In these instances, it is often difficult to differentiate
between homoeologous (between genome) and paralogous (within genome) duplication
of genetic loci without detailed genetic inheritance studies. Because the majority of DNA
in individuals within a related population is the same, genetic differences between
individuals can be defined by SNPs. The frequency of SNPs (nucleotide diversity) and
the haplotypic diversity (heterozygosity) between two individuals or within a population
are direct measures of genetic diversity. Under conditions of forced inbreeding, such as
recurrent backcrossing to parental individuals, sib-mating or mating between individuals
with lower-degree relatedness, reduced genetic diversity and SNP frequency is observed.
Such conditions may have arisen due to population reduction or isolation in natural
populations (the so-called ‘founder effect’). For domesticated crop plants, narrow genetic
bases have contributed to corresponding reduced genetic diversity at the nucleotide level.
The frequency and nature of SNPs in plants is beginning to receive considerable
attention. A number of reports in Arabidopsis thaliana, rice and maize have provided
estimates of sequence diversity in these species. In many species, the analysis of DNA
sequence variation has been confined to single genes or DNA fragments with the goal of
defining gene structure, function or evolutionary relationships. It is known that SNPs are
widely distributed throughout genomes, although various studies show that the
occurrence and distribution of SNPs differs between species, in particular between
inbreeding and outbreeding species, or in those species with a narrow genetic base. It is
generally well accepted that some species, for example maize, are highly polymorphic,
whilst others, such as soybean and melon, are less polymorphic. Detailed studies of
sequence diversity have now been performed at selected loci for a range of plant species
and in plants, the typical frequencies are in the range of 1 SNP every 100–300 bp.
The most advanced SNP studies in plants have been performed on model species
where a large quantity of genomic or EST sequence is available. SNPs have been
detected using high-throughput analysis in A. thaliana (Cho et al. 1999). ESTs are a good
resource for SNP discovery and they have been used for SNP discovery in sugarbeet
(Schneider et al. 2001), maize (Ching et al. 2002; Batley et al. 2003), rice (Nasu et al.
2002), soybean (Zhu et al. 2003) and sugarcane (Grivet et al. 2003). In soybean 280
WHAT ARE SNPs? 43
SNPs from 143 amplicons (76.3 kb) have been identified (Zhu et al. 2003). In maize, one
non-coding SNP/31 bp and one coding SNP/124 bp has been reported for 18 maize genes
in 36 inbred lines.
A genome-wide polymorphism database of rice has been constructed defining
polymorphisms between the cultivars Nipponbare (from sub-species japonica) and 93-11
(from sub-species indica) (Shen et al. 2004). The database contains 1,703,176 SNPs and
479,406 insertions/deletions (indels) (see Section 3.6 for further discussion on indels).
This equates to approximately 1 SNP/268 bp in the rice genome. A similar study was also
performed by Feltus et al. (2004). After aligning drafts of rice indica and japonica
sequence and filtering to remove multiple copy and low-quality sequence, 384,341
candidate interspecific SNPs were identified, at a frequency of approximately 1.7
SNPs/kb. Due to the stringent filtering process, this is probably an underestimate of the
real SNP frequency in rice. This work was performed again in 2005 (Yu et al. 2005) using
alignments of the improved whole-genome shotgun sequences for japonica and indica rice.
SNP frequencies varied from 3 SNPs/kb in coding sequence to 27.6 SNPs/kb in the
transposable elements, with a genome wide measure of 15.13 SNPs/kb, or 1 SNP per 66 bp.
Further studies in rice have involved SNP discovery and characterisation in the Piz
and Piz-t regions (Hayashi et al. 2004). The frequency was found to be similar to the
previous studies, with an SNP found every 248 bp (Hayashi et al. 2004). The SNP
frequency varied slightly depending on the cultivars being assessed. On average, 1 SNP
was detected every 390 bp between cultivars Nipponbare and Zenith and 1 SNP per 173
bp between cultivars Nipponbare and Toride 1. The SNP frequency was higher between
Zenith and Toride 1, with an SNP on average every 140 bp. In earlier studies, Yu et al.
(2002) compared sequences from japonica and indica cultivars and found an average of
1 SNP every 170 bp, while Nasu (2002) reported a similar frequency for rice SNPs.
Extensive research has been performed on SNP frequency in barley. Russell et al.
(2004) examined the frequency and distribution of SNPs within 23 genes associated with
grain germination in barley in a range of accessions including European cultivars,
landraces and wild barley. The frequency of SNPs was found to be 1 SNP every 78 bp. In
a further study, the Isa (inhibitor of α-amylase) gene was sequenced in 16 barley
genotypes to detect sequence polymorphisms (Bundock and Henry 2004). A total of 80
SNPs were identified in the 2,164 bp sequence, containing the Isa promoter, transcript
and 3′-untranslated region (UTR), giving a high frequency of 1 SNP/27 bp. Kota et al.
(2001) identified 72 polymorphic SNPs in seven genotypes of barley. The frequency of
SNPs was estimated to be 1 every 240 bp. This was calculated from 52,140 bp of
sequence from each genotype analysed. Similar studies have been performed in other
crop species such as Beta vulgaris and Zea mays, for which the relevant frequencies were
1 every 60–130 bp and 104 bp, respectively (Schneider et al. 2001; Ching et al. 2002;
Tenaillon et al. 2001). As expected, the frequency of SNPs in inbreeding species such as
barley is lower than that observed in outbreeding species. This is further demonstrated in
poplar, an out-breeding tree species, which exhibits a high level of variation. Cronk
(2005) determined the presence of an SNP every 100 bp in poplar, increasing to 1 every 50
bp when geographically diverse species were included in the study.
In a study of 25 diverse genotypes of soybean (Zhu et al. 2003), a total of 280 SNPs
were identified in 143 amplicons, totalling 76.3 kb sequence, providing 1 SNP per
273 bp. It was found that nucleotide diversity was lower in soybean than maize or
A. thaliana, and this may be due to inbreeding. However, as A. thaliana is also
self-pollinating, this does not explain all the findings. These results may also be due to
44 DAVID EDWARDS ET AL .
the narrow genetic base of soybean. SNP discovery has also been performed in lesser-
known crops. Based on EST sequence information, fragments of 34 genes were amplified
from five diverse quinoa (Chenopodium quinoa Willd.) accessions and the related weed
species C. berlandieri and sequenced (Coles et al. 2005). Analysis of the quinoa EST
sequences revealed a total of 51 polymorphisms in 20 EST sequences, including 38 SNPs
and 13 indels. This was an average of 1 SNP every 462 bp, which increased to 1 SNP
every 179 bp when C. berlandieri was included in the analysis. This SNP frequency is
lower than that observed in barley (1/189 bp), maize (1/104 bp) and sugarbeet (1/130 bp),
but similar to levels observed in soybean (1/503 bp) and A. thaliana (1/336 bp). Although
the sample size was small, the SNP frequency reflects the narrow genetic base for
cultivated quinoa.
Lopez et al. (2005) exploited a recently developed EST collection to identify SNPs
in five cultivars of cassava (Manihot esculenta Crantz). One SNP per 905 bp was
detected in intra-cultivar comparisons and 1 SNP per 1,032 bp was detected in inter-
cultivar comparisons, based on data from 111 contigs, with an overall value of 1 SNP
every 509 bp. This study also obtained further information on SNP frequency in six
cultivars from 33 amplicons from 3′-EST and BAC end sequences. A total of 11 kb of
sequence was obtained for each cultivar, with 186 SNPs being identified. Of these, 146
were observed within cultivars and 80 were observed between cultivars. The total
frequency of SNPs was found to be one per 62 bp, a value similar to that observed for
other crops. The intra-cultivar variation may be due to the presence of background
heterozygosity and inbreeding depression within the lines. Cassava is also an ancient
polyploid and predicted SNPs may be due to the presence of paralogous comparisons
between members of multi-gene families.
In potato, 277 SNPs were identified between two alleles of the urease gene, with an
average of 2.5 SNPs per 100 bp (Wittle et al. 2005). This average frequency of 1 SNP per
40 bp is relatively high for comparison between two alleles of a single copy gene. This is
also reflected by studies of SNP variation in resistance gene analogues (RGAs) of
cultivated potato, as described in Chapter 4.
different frequencies in different genomic regions. This uneven distribution may be due
to differences in recombination rate, gene density, transmission pattern, selection strength
and compositional pressure. Genomic regions with low recombination rates generally
have reduced levels of polymorphisms (Rafalski and Morgante 2004). Regions subject to
strong balancing selection (i.e. two or more alleles or haplotypes are maintained), such as
those containing disease resistance genes, show the greatest diversity (Kuang et al. 2004).
The local abundance of SNPs within the genome varies due to a combination of the
mutation rate that generates new polymorphisms and any positive or negative selection
for regions linked to these mutations. SNP generation de novo may be more frequent
outside of transcribed genic regions as these regions tend to exhibit greater levels of
5-methylcytosine (5meC) abundance, an important factor in the generation of the most
abundant C to T mutation due to deamination of 5meC (which is aminothymidine) to T
over evolutionary time. The majority of SNPs would be expected to be evolutionary
neutral, that is, they would be neither selected for nor against, and their abundance in a
population would vary due to random genetic drift. Rare deleterious mutations are
counter-selected at a rate characteristic of the specific fitness penalty. For example, SNPs
or Indels in transcribed sequences that lead to the production of altered proteins are
relatively infrequent in populations when compared to similar polymorphisms within
intron or untranscribed sequence. Selection, either natural or through breeding would lead
to the removal of deleterious sequences from the population and increase the abundance
of beneficial sequences. Selective pressure would apply to sequences in proximity to the
selected sequence (the so-called ‘hitch-hiking’ phenomenon) unless they are separated
by recombination during meiosis. Thus, strong selective pressure is likely to lead to
genomic regions with reduced genetic diversity and fewer SNPs. This hypothesis is
supported by the observation that in most organisms studied to date, SNPs are more
prevalent in the non-coding regions of the genome. These mutations should theoretically
only affect the phenotype if they cause a change in the regulation of gene expression,
changing the expression pattern of surrounding transcribed regions. Within the coding
regions, an SNP is either non-synonymous and results in an amino acid change, or is
synonymous and does not alter the amino acid sequence and therefore is neutral. Non-
synonymous SNPs may also be radical or conservative in nature, depending on
transitions between positively charged, uncharged and negatively charged amino acid
side-groups. Synonymous change may, however, potentially modify an RNA splice
processing site resulting in phenotypic changes. SNPs have become popular tools for
identifying genetic loci that contribute to phenotypic variation based on LD (see Chapters
2 and 7 for further discussion on the principles of LD).
The distribution of SNPs across the genome has been studied in a variety of plant
species. Perhaps the most comprehensive study is in A. thaliana, where over 37,000
SNPs were identified by comparing partial genome sequence from the Ler accession
with the near complete sequence of Col-0 (Schmid et al. 2003). The distribution of
SNPs was found to be even across the five chromosomes, with the exception of
centromeric regions, which contain few transcribed genes. In the ESTs studied, a total
of 4,327 SNPs were identified. Analysis of amplicons derived from sequence tagged
sites (STSs), corresponding to 4,955 consensus sequences revealed 3,773 SNPs. Of
these, 2,922 (77%) were in non-coding regions of the genome. In the EST-derived
SNPs, there was an average of 1 SNP per 336 bp. There was a higher ratio of
synonymous to non-synonymous polymorphisms in EST compared to STS data,
supporting the concept that expressed genes are more constrained by sequence evolution
than randomly selected genomic loci.
46 DAVID EDWARDS ET AL .
SNPs are produced by mutations. The mutation frequency between any two
nucleotides is not random but is dependent on the nucleotide base, the base sequence in
its immediate proximity and the methylation status of the DNA. A major mechanism of
spontaneous mutation is due to errors in DNA replication. Nucleotide bases in DNA can
exist in two different structural forms (tautomers) called KETO and ENOL forms, but are
predominantly found in the KETO form. Shifts to the ENOL form (tautomerisation) can
alter pairing preferences, such that A may pair with C rather than T. Reversion of the
tautomeric shift following DNA replication leads to fixation of a base mutation. The
predicted average frequency of such processes is c. 1 per 10 4 bp copied, but the influence
of fidelity maintenance systems such as polymerase proof-reading and post-replication
mismatch repair results in observed frequencies of c. 1 in 1010 bp copied, corresponding
to c. 1 in 106 per gene across a broad range of organisms.
Transitions are the most common form of SNP (Garg et al. 1999; Picoult-Newberg
et al. 1999; Deutsch et al. 2001; Batley et al. 2003) reflecting the high frequency of the C
to T mutation following deamination of methylated cytosine residues (Coulondre et al.
1978). C/T transitions constitute 67% of the SNPs observed in humans. Other variations
in base substitution abundance are observed, but the underlying mechanisms for these
differences remain to be explained (Batley et al. 2003).
Lopez et al. (2005) observed a significantly higher number of transitions than
transversions in intra-cultivar (64% transitions) and inter-cultivar (65% transitions)
comparisons in cassava. However, Coles et al. (2005) found an approximate 1:1
transition:transversion ratio in quinoa. A total of 20 transitions and 18 transversions were
identified, increasing to 61 and 45, respectively, if the closely related weed species, C.
berlandieri, was included in the analysis. This ratio was similar to those observed in
maize, soybean (Zhu et al. 2003) and A. thaliana, but lower than the 2:1 ratios observed
in sugarbeet, melon (Morales et al. 2004) and barley (Soleimani et al. 2003). The higher-
than-expected C/T transition rate is likely to be due to the methylation effects described
previously. Hayashi et al. (2004) found that 72–75% of SNPs between indica and
japonica rice cultivars were transitions. This finding was supported by Feltus et al.
(2004) who aligned drafts of the rice subspecies japonica and indica sequence and found
that 65.8% SNPs were transitions and 34.2% were transversions. The high frequency of
48 DAVID EDWARDS ET AL.
3.6 INDELS
Small insertion or deletion events (indel for insertion/deletion) are another common
form of genetic mutation. These mutations may be detected as SNPs as the insertion or
deletion of nucleotides changes the sequence read. Indels may be produced by errors in
DNA synthesis, repair or recombination, or may be due to the insertion and excision of
transposable elements that often leave a characteristic DNA footprint of several
nucleotide bases. For example, the relative abundance of eight base indels observed in
maize by Bhattramakki et al. (2002) may be due to sequence duplication during insertion
and excision of Ac/Ds transposable elements (Sutton et al. 1984).
Tenaillon et al. (2002) studied SNPs and indels located in previously published
sequences from 21 loci on maize chromosome 1. Small indels (1–5 bp) were frequent,
56% of the indels being 1–2 bp in length and 92% were less than 20 bp in length.
Furthermore, 5 of the 21 indels longer than 20 bp were found to be previously
characterised Miniature Inverted-repeat Transposable Elements (MITEs). A total of 263
indels were observed in 17/21 loci. Indel size ranged from 1 to 640 bp, and the number
per locus ranged from 2 to 59. This frequency of small indels was also observed in the
Piz and Piz-t regions of rice. Of the 52 indels identified, 42 (81%) were 1–5 bp in length
and only 4 were longer than 40 bp (Hayashi et al. 2004).
WHAT ARE SNPs? 49
In a study of the urease gene in potato, 40 indels were observed within non-coding
regions, of which 70% were 1–4 bp in length, 20% 5–10 bp and 10% (4 indels) were
greater than 10 bp. The instances of these long indels may be explained by the relevant
sequence features. One insertion was found to be due to a retrotransposon. A 30 bp indel
is found in an array of 30 bp repeats within an intron and may have been caused by
unequal cross-over, while a 34 bp indel is present in an SSR-containing region, which are
known to undergo expansion and contraction (Wittle et al. 2005).
Morales et al. (2004) searched for indels in 34 ESTs between two distantly related
melon genotypes. On average 1 indel was found per 1,666 bp. No indel was found
inside the coding region. The indel length ranged from 1 to 13 bp, with single bp indels
being the most frequent. This indel frequency was higher than in the total A. thaliana
genome, in which one indel per 6.6 kb was observed (Jander et al. 2002). However, these
data are not directly comparable to the melon study as both coding and non-coding
regions were used in the A. thaliana study. Ching et al. (2002) examined the frequency
and distributions of polymorphisms at 18 maize genes in 36 maize inbreds. Indels were
found to be frequent in non-coding regions (1/85 bp) but rare in coding sequences.
In the genome wide polymorphism database of rice, using cultivars Nipponbare
(japonica) and 93-11 (indica) (Shen et al. 2004), 479,406 indels were detected. This
corresponds to approximately 1 indel per 953 bp in the rice genome. This indel frequency
is higher than that observed in a similar study of the rice subspecies indica and japonica
sequence by Feltus et al. (2004), who found approx. 0.11 indels/kb. However, due to the
stringent sequence filtering performed in this later study, the result probably
underestimates indel frequency in rice.
A total of 23 indels were identified between 16 barley genotypes in the 2,164 bp of
Isa gene sequence (Bundock and Henry 2004), a measure of 1 indel per 94 bp. Four of
these indels were within a microsatellite region and were excluded. Of the remaining 19
indels, 9 were 1 bp in length and the others ranged from 4 to 306 bp, giving an average
frequency of 1 indel per 114 bp.
SNPs are individual nucleotide base differences between DNA sequences and can
represent differences between individuals or within populations. The specific base
difference is determined by the cause of mutation and is non-random, with C to T
transitions being the most frequent form. Insertion/deletion events (indels) are a special
form of SNP caused by the addition or removal of DNA sequence, resulting in both
length and sequence polymorphisms. The frequency of SNPs is dependent on both their
generation and selection in populations. SNPs are generally evolutionally neutral, with
frequencies varying due to random genetic drift. Some SNPs, particularly those
associated with expressed genes, may be under positive or negative evolutionary
selection pressure and will be maintained or rapidly removed from populations
(Przeworski 2002; Bamshad and Wooding 2003). SNPs not separated by recombination
at meiosis and thus in LD with other SNPs will be inherited as a linkage block and thus
maintained at a frequency determined by the cumulative selection pressure of the
haplotypic group. SNPs and indels are valuable molecular genetic markers due to both
their abundance and relative stability in the genome, and can be applied as perfect
molecular markers when identified within genes underlying observed traits.
50 DAVID EDWARDS ET AL.
3.8 REFERENCES
Bamshad, M., Wooding, S.P., 2003, Signatures of natural selection in the human genome. Nat. Rev. Genet. 4:
99–111.
Batley, J., Barker, G., O’Sullivan, H., Edwards, K.J., Edwards, D., 2003, Mining for single nucleotide
polymorphisms and insertions/deletions in maize expressed sequence tag data. Plant Physiol. 132: 84–91.
Bertin, I., Zhu, J.H., Gale, M.D., 2005, SSCP-SNP in pearl millet – a new marker system for comparative
genetics. Theor. Appl. Genet. 110: 1467–1472.
Bhattramakki, D., Dolan, M., Hanafey, M., Wineland, R., Vaske, D., Register, J.C. III, Tingey, S.V., Rafalski,
A., 2002, Insertion–deletion polymorphisms in 3' regions of maize genes occur frequently and can be used
as highly informative genetic markers. Plant Mol. Biol. 48: 539–547.
Bundock, P.C., Henry, R.J., 2004, Single nucleotide polymorphism, haplotype diversity and recombination in
the Isa gene of barley. Theor. Appl. Genet. 109: 543–551.
Ching, A., Caldwell, K.S., Jung, M., Dolan, M., Smith, O.S., Tingey, S., Morgante, M., Rafalski, A.J., 2002,
SNP frequency, haplotype structure and linkage disequilibrium in elite maize inbred lines. BMC Genet.
3: 1–14.
Cho, R.J., Mindrinos, M., Richards, D.R., Sapolsky, R.J., Anderson, M., Drenkard, E., Dewdney, J., Reuber,
T.L., Stammers, M., Federspiel, N., Theologis, A., Yang, W.H., Hubbell, E., Au, M., Chung, E.Y.,
Lashkari, D., Lemieux, B., Dean, C., Lipshutz, R.J., Ausubel, F.M., Davis, R.W., Oefner, P.J., 1999,
Genome wide mapping with biallelic markers in Arabidopsis thaliana. Nat. Genet. 23: 203–207.
Coles, N.D., Coleman, C.E., Christensen, S.A., Jellen, E.N., Stevens, M.R., Bonifacio, A., Rojas-Beltran, J.A.,
Fairbanks, D.J., Maughan, P.J., 2005, Development and use of an expressed sequenced tag library in
quinoa (Chenopodium quinoa Willd.) for the discovery of single nucleotide polymorphisms. Plant Sci.
168: 439–447.
Coulondre, C., Miller, J.H., Farabaugh, P.J., Gilbert, W., 1978, Molecular basis of base substitution hot spots in
Escherichia coli. Nature 274: 775–780.
Cronk, Q.C.B., 2005, Plant eco-devo: the potential of poplar as a model organism. New Phytol. 166: 39–48.
Deutsch, S., Iseli, C., Bucher, P., Antonarakis, S.E., Scott, H.S., 2001, A cSNP map and database for human
chromosome 21. Genome Res. 11: 300–307.
Dvornyk, V., Sirviö, A., Mikkonen, M., Savolainen, O., 2002, Low nucleotide diversity at the pal1 locus in the
widely distributed Pinus sylvestris. Mol. Biol. Evol. 19: 179–188.
Feltus, F.A., Wan, J., Schulze, S.R., Estill, J.C., Jiang, N., Paterson, A.H., 2004, An SNP resource for rice
genetics and breeding based on subspecies Indica and Japonica genome alignments. Genome Res. 14:
1812–1819.
Garg, K., Green, P., Nickerson, D.A., 1999, Identification of candidate coding region single nucleotide
polymorphisms in 165 human genes using assembled expressed sequence tags. Genome Res. 9: 1087–
1092.
Grivet, L., Glaszmann, J.-C., Vincentz, M., da Silva, F., Arruda, P., 2003, ESTs as a source for sequence
polymorphism discovery in sugarcane: example of Adh genes. Theor. Appl. Genet. 106: 190–197.
Hayashi, K., Hashimoto, N., Daigen, M., Ashikawa, I., 2004, Development of PCR-based SNP markers for rice
blast resistance genes at the Piz locus. Theor. Appl. Genet. 108: 1212–1220.
Jander, G., Norris, S.R., Rounsley, S.D., Bush, D.F., Levin, I.M., Last, R.L., 2002, Arabidopsis map based
cloning in the post genome era. Plant Physiol. 129: 440–450.
Kim, M.Y., Van, K., Lestari, P., Moon, J.-K., Lee, S.-H., 2005, SNP identification and SNAP marker
development for a GmNARK gene controlling supernodulation in soybean. Theor. Appl. Genet. 110:
1003–1010.
Kota, R., Varshney, R.K., Thiel, T., Dehmer, K.J., Graner, A., 2001, Generation and compairson of EST
derived SSRs and SNPs in barley (Hordeum vulgare L.). Hereditas 135: 145–151.
Kuang, H., Woo, S.-S., Meyers, B.C., Nevo, E., Michelmore, R.W., 2004, Multiple genetic processes result in
heterogeneous rates of evolution within the major cluster disease resistance genes in lettuce. Plant Cell 16:
2870–2894.
Lopez, C., Piegu, B., Cooke, R., Delseny, M., Tohme, J., Verdier, V., 2005, Using cDNA and genomic
sequences as tools to develop SNP strategies in cassava (Manihot esculenta Crantz). Theor. Appl. Genet.
110: 425–431.
WHAT ARE SNPs? 51
Mogg, R., Batley, J., Hanley, S., Edwards, D., O’Sullivan, H., Edwards, K.J., 2002, Characterisation of the
flanking regions of Zea Mays microsatellites reveals a large number of useful sequence polymorphisms.
Theor. Appl. Genet. 105: 532–543.
Morales, M., Roig, E., Monforte, A.J., Arús, P., Garcia-Mas, J., 2004, Single-nucleotide polymorphisms
detected in expressed sequence tags of melon (Cucumis melo L.). Genome 47: 352–360.
Nasu, S., Suzuki, J., Ohta, R., Hasegawa, K., Yui, R., Kitazawa, N., Monna, L., Minobe, Y., 2002, Search for
and analysis of single nucleotide polymorphisms (SNPs) in rice (Oryza sativa, Oryza rufipogon) and
establishment of SNP markers. DNA Research 9: 163–171.
Neale, D.B., Savolainen, O., 2004, Association genetics of complex traits in conifers. Trends Plant Sci. 9: 325–
330.
Picoult-Newberg, L., Ideker, T.E., Pohl, M.G., Taylor, S.L., Donaldson, M.A., Nickerson, D.A., Boyce-Jacino,
M., 1999, Mining SNPs from EST databases. Genome Res. 9: 167–174.
Przeworski, M., 2002, The signature of positive selection at randomly chosen loci. Genetics 160: 1179–1189.
Rafalski, J.A., 2002a, Novel genetic mapping tools in plants: SNPs and LD-based approaches. Plant Sci. 162:
329–333.
Rafalski, J.A., 2002b, Applications of single nucleotide polymorphisms in crop genetics. Current Opin. Plant
Biol. 5: 94–100.
Rafalski, A., Morgante, M., 2004, Corn and humans: recombination and linkage disequilibrium in two genomes
of similar size. Trends Genet. 20: 103–111.
Russell, J., Booth, A., Fuller, J., Harrower, B., Hedley, P., Machray, G., Powell, W., 2004, A comparison of
sequence-based polymorphism and haplotype content in transcribed and anonymous regions of the barley
genome. Genome 47: 389–398.
Schmid, K.J., Rosleff Sörensen, T., Stracke, R., Törjék, O., Altmann, T., Mitchell-Olds, T., Weisshaar, B.,
2003, Large-scale identification and analysis of genome wide single nucleotide polymorphisms for
mapping in Arabidopsis thaliana. Genome Res. 13: 1250–1257.
Schneider, K., Weisshaar, B., Borchardt, D.C., Salamini, F., 2001, SNP frequency and allelic haplotype
structure of Beta vulgaris expressed genes. Mol. Breed. 8: 63–74.
Shen, Y.-J., Jiang, H., Jin, J.-P., Zhang, Z.-B., Xi, B., He, Y.-Y., Wang, G., Wang, C., Qian, L., Li, X., Yu, Q.-
B., Liu, H.-J., Chen, D.-H., Gao, J.-H., Huang, H., Shi, T.-L., Yang, Z.-N., 2004, Development of genome-
wide DNA polymorphism database for map-based cloning of rice genes. Plant Physiol. 135: 1198–1205.
Soleimani, V.D., Baum, B.R., Johnson, D.A., 2003, Efficient validation of single nucleotide polymorphisms in
plants by allele specific PCR, with an example from barley. Plant Mol. Biol. Rep. 21: 281–288.
Sutton, W.D., Gerlach, W.L., Schwartz, D., Peacock, W.J., 1984, Molecular analysis of Ds controlling element
mutations at the Adh1 locus of maize. Science 223: 1265–1268.
Syvanen, A.C., 2001, Genotyping single nucleotide polymorphisms. Nat. Rev. Genet. 2: 930–942.
Tenaillon, M.I., Sawkins, M.C., Anderson, L.K., Stack, S.M., Doebley, J., Gaut, B.S., 2002, Patterns of
diversity and recombination along Chromosome 1 of maize (Zea mays ssp. mays L.). Genetics 162: 1401–
1413.
Wittle, C.-P., Tiller, S., Isidore, E., Davies, H.V., Taylor, M.A., 2005, Analysis of two alleles of the urease gene
from potato: polymorphisms, expression and extensive alternative splicing of the corresponding mRNA. J.
Exp. Botany 56: 91–99.
Yu, J., Hu, S., Wang, J., Wong, G.K., Songgang Li, S., Liu, B., Deng, Y., Dai, L., Zhou,Y., Zhang, X., Cao, M.,
Liu, J., Sun, J., Tang, J., Chen, Y., Huang, X., Lin, W., Ye, C., Tong, W., Cong, L., Geng, J., Han, Y., Li,
L., Li, W., Hu, G., Xiangang Huang, X., Li, W., Li, J., Liu, Z., Li, L., Liu, J., Qi, Q., Liu, J., Li, L., Li, T.,
Wang, X., Lu, H., Wu, T., Zhu, M., Ni, P., Han, H., Dong, W., Ren, X., Feng, X., Cui, P., Li, X., Wang,
H., Xu, X., Zhai, W., Xu, Z., Zhang, J., He, S., Zhang, J., Xu, J., Zhang, K., Zheng, X., Dong, J., Zeng,
W., Tao, L. Ye, J., Tan, J., Ren, X., Chen, X., He, J., Liu, D., Wei Tian, W., Tian, C., Xia, H., Bao, Q., Li,
G., Gao, H., Cao, T., Wang, J., Zhao, W., Li, P., Chen, W., Wang, X., Zhang, Y., Hu, J., Wang, J., Liu, S.,
Yang, J., Zhang, G., Xiong, Y., Li, Z., Mao, L., Zhou, C., Zhu, Z., Chen, R., Hao, B., Zheng, W., Chen, S.,
Guo, W., Li, G., Liu, S., Tao, M., Wang, J., Zhu, L., Yuan, L., Yang, H., 2002, A draft sequence of the
rice genome (Oryza sativa L. ssp. indica). Science 296: 79–92.
52 EDWARDS ET AL.
Yu, J., Wang, J., Lin, W., Li, S., Li, H., Zhou, J., Ni, P., Dong, W., Hu, S., Zeng, C., Zhang, J., Zhang, Y., Li,
R., Xu, Z., Li, S., Li, X., Zheng, H., Cong, L., Lin, L., Yin, J., Geng, J., Li, G., Shi, J., Liu, J., Lv, H., Li,
J., Wang, J., Deng, Y., Ran, L., Shi, X., Wang, X., Wu, Q., Li, C., Ren, X., Wang, J., Wang, X., Li, D.,
Liu, D., Zhang, X., Ji, Z., Zhao, W., Sun, Y., Zhang, Z., Bao, J., Han, Y., Dong, L., Ji, J., Chen, P., Wu, S.,
Liu, J., Xiao, Y., Bu, D., Tan, J., Yang, L., Ye, C., Zhang, J., Xu, J., Zhou, Y., Yu, Y., Zhang, B., Zhuang,
S., Wei, H., Liu, B., Lei, M., Yu, H., Li, Y., Xu, H., Wei, S., He, X., Fang, L., Zhang, Z., Zhang, Y.,
Huang, X., Su, Z., Tong, W., Li, J., Tong, Z., Li, S., Ye, J., Wang, L., Fang, L., Lei, T., Chen, C., Chen,
H., Xu, Z., Li, H., Huang, H., Zhang, F., Xu, H., Li, N., Zhao, C., Li, S., Dong, L., Huang, Y., Li, L., Xi,
Y., Qi, Q., Li, W., Zhang, B., Hu, W., Zhang, Y., Tian, X., Jiao, Y., Liang, X., Jin, J., Gao, L., Zheng, W.,
Hao, B., Liu, S., Wang, W., Yuan, L., Cao, M., McDermott, J., Samudrala, R., Wang, J., Wong, G.K.-S.,
Yang, H., 2005, The genomes of Oryza sativa: A history of duplications. PLoS Biology 3: 0266-0281.
Zhu, Y.L., Song, Q.J., Hyten, D.L., van Tassell, C.P., Matukumalli, L.K., Grimm, D.R., Hyatt, S.M., Fickus,
E.W., Young, N.D., Cregan, P.B., 2003, Single nucleotide polymorphisms in soybean. Genetics 163:
1123–1134.
Chapter 4
SINGLE NUCLEOTIDE POLYMORPHISM
DISCOVERY
David Edwards1, John W. Forster1, Noel O.I. Cogan1, Jacqueline Batley1, and
David Chagné2
4.1 INTRODUCTION
1 Primary Industries Research Victoria, Victorian AgriBiosciences Centre, La Trobe R&D Park, Bundoora,
VIC 3083, Australia
2 HortResearch, Plant Gene Mapping group, Private Bag 11030, Palmerston North, New Zealand
53
54 D. EDWARDS ET AL.
The first techniques that will be presented are techniques that do not require the
production of a large number of sequences to discover novel SNPs. These techniques are
very popular because they are inexpensive and present the advantage of being applicable
by any molecular biology laboratory since they use common reagents and equipments.
Chronologically, the first method to be used for DNA polymorphism detection was
restriction fragment length polymorphism (RFLP) (Botstein et al. 1980). This method
was used successfully to detect point mutations occurring at restriction sites and was
employed for mapping in a number of plant species (Keim et al. 1990). Nevertheless, this
method is now rarely applied due to its technical limitations, i.e., labor-intensive and
requiring large quantities of DNA. The next generation of molecular markers was based
on the use of the polymerase chain reaction (PCR) technique. The first PCR-based
marker, cleaved amplified polymorphic sequence (CAPS) (Konieczny and Ausubel 1993),
is comparable to RFLP since it is based on the PCR-amplified fragments digestion using
restriction endonuclease. The main drawback of CAPS is that, as with RFLPs, the SNP
must occur within a restriction site. This restricts its use to a small minority of
polymorphisms. To circumvent this problem, Neff et al. (1998) developed the dCAPS
method (“derived” CAPS) where a restriction site can be created through the addition of a
mismatch in a PCR primer located close to the SNP. In addition, the authors created a
simple software system called dCAPS Finder (http://helix.wustl.edu/dcaps/dcaps.html),
which facilitates the design of dCAPS markers. Although CAPS and dCAPS methods can
be applied for genotyping SNPs in a relatively inexpensive way, these methods remain of
low efficiency for SNP discovery. Indeed, a large number of restriction enzymes must be
tested to find polymorphisms.
gel electrophoresis. This method has been employed predominantly in human genetics,
though SSCP has also been applied to detect SNPs in several plants like cereals (Martins-
Lopes et al. 2001; Sato and Nishio 2003), forest trees (Plomion et al. 1999), horticultural
trees (Etienne et al. 2002) and other crops (Hongtrakul et al. 1998; McCallum et al.
2001).
SNP detection can be based on resolving heteroduplex (i.e., mismatched hybridi-
zation between complementary DNA strands) from homoduplex (i.e., perfect
hybridization) DNA fragments. Heteroduplexes can be formed during a heating/slow
cooling procedure (Figure 4.1) with their subsequent differentiation from homoduplex
sequences separated by polyacrylamide gel electrophoresis (Hauser et al. 1998).
Heteroduplexes usually migrate slower than homoduplexes during electrophoresis due to
the presence of mismatched base pairing. No sequence knowledge is needed prior to
using this technique, which makes it suitable for SNP discovery in heterozygous
individuals or pooled DNA.
Figure 4.1. Nonsequencing SNP discovery methods: heteroduplex analysis, TILLING, DGGE, and SSCP.
(see color plate)
D/TGGE, SSCP, and heteroduplex analysis are readily applicable in any molecular
biology laboratory as they do not require sophisticated equipment to be used. The major
drawbacks of these techniques, despite their relatively high efficiency in detecting SNPs,
is their low-medium throughput, due to the use of polyacrylamide slab gels which require
long migration times, and the use of ethidium bromide or silver staining methods. Indeed,
these staining methods do not permit multiplexing and require the use of potentially
hazardous chemicals. The speed and efficiency of SSCP and TGGE can be increased by
the use of capillary electrophoresis systems (i.e., automated sequencers) and can also be
56 D. EDWARDS ET AL.
multiplexed by the application of fluorescent dye labels (Hebenbrock et al. 1995; Inazuka
et al. 1997). As an example, Hsia et al. (2005) reported the use of temperature gradient
capillary electrophoresis (TGCE), the capillary electrophoresis equivalent of TGGE, to
detect SNPs in maize without prior knowledge of the polymorphic sequences. In addition,
Kuhn et al. (2005) demonstrated the application of capillary electrophoresis-based SSCP
in cocoa. However, the fact that the position and type of polymorphism are unknown
when using D/TGGE, SSCP, or their capillary electrophoresis equivalents makes them
less attractive to researchers who want to survey the position and nature of the
polymorphism that may be associated with a trait variation.
As it has been reported in humans (Giordano et al. 1999), denaturing high-
performance liquid chromatography (dHPLC) can also be used for detecting SNPs by
heteroduplex analysis. dHPLC does not require gel-based genotyping procedures and is
considered more accurate than polyacrylamide gel-based methods. DNA fragments are
amplified by PCR, denaturated by heating, slowly cooled, and run through
chromatographic columns using different temperatures. Because dHPLC assays can be
performed in a relatively short time (5–15 min), are compatible with automation and do
not require DNA resequencing, this method can provide an efficient means for relatively
high-throughput SNP discovery and genotyping in plants (Kota et al. 2001).
4.2.4 TILLING
One of the major drawbacks for SNP discovery (and scoring) is the requirement for
the PCR technique to reduce genome complexity. This is particularly problematic
knowing that plants often have complex genomes. An ideal method for SNP discovery
would be to scan the complete genome in a single reaction. However, only a few methods
SINGLE NUCLEOTIDE POLYMORPHISM DISCOVERY 57
that do not rely on PCR have been described to date. One particular method applies DNA
chip technology to identify sequence polymorphisms. Borevitz et al. (2003) used a
microarray, originally developed for gene expression studies, to identify new
polymorphisms between Arabidopsis thaliana accessions (Col and Ler). The difference
in intensity between hybridization experiments was compared using similar statistical
analysis as used for expression data. The authors demonstrated that the method could
efficiently detect known polymorphisms and that detection is more efficient where the
variation is close to the oligonucleotide features central base. Array-based discovery
methods may represent the future for SNP discovery in particular cases where sequence
information is great enough for gene array development but where there is not enough
sequence information from different individuals to predict polymorphisms. This is
particularly the case for species in which large EST databases have been generated from a
small number of genotypes.
4.3.2 MassArray
MassArray technology (Lau et al. 2000; Rodi et al. 2002) is based on the utilization
of matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-
TOF MS). MALDI-TOF MS can detect differences between DNA fragments based on
their molecular weight. As the molecular weight of the four nucleotides that make up
DNA is different, this system is able to detect a single base variation in a PCR-amplified
DNA fragment. The homogenous MassCleave (hMC) assay (Mattocks et al. 2004) is part
of the MassArray platform developed by Sequenom (Sequenom, San Diego, CA, USA)
and is suitable for SNP discovery. The principle of hMC is the following (Figure 4.3):
PCR fragments between 300 and 700 bp in length are cleaved using an enzyme cutting at
specific bases. Products are then run on a MALDI-TOF MS. The mass spectra obtained
for the four cleavage reactions are compared to the theoretical spectra that were inferred
SINGLE NUCLEOTIDE POLYMORPHISM DISCOVERY 59
using reference sequences (i.e., the ones used for PCR design), or compared between
different DNA pools or individuals. Differences between spectra can be due to sequence
variations, the introduction or removal of a cleavage site, or mass shift due to the
presence of an INDEL. The hMC method can be automated and multiplexed, which
makes it a suitable method for high-throughput SNP discovery. However, to date there
are no current examples of the use of the hMC technique for SNP discovery in plants,
though the application has been applied successfully for studies in human genetics (Lau
et al. 2000; O’Donnell et al. 1997).
Figure 4.3. Homogenous MassCleave: principle. A PCR product is amplified, treated with SAP, and then
in vitro transcribed. The transcription of the PCR product in RNA permits the base-specific cleavage using
RNAse A. The resulting cleavage products are run on a MALDI-TOF MS, which generates a signal based on
the fragment masses.
particularly efficient method for gene-length SNP discovery. Access to full genomic
sequences, such as those derived from large insert DNA libraries, permits direct primer
design to upstream and downstream gene control elements and intragenic introns. Intron
sequences may also be accessed by primer design to flanking exonic regions in cDNA
sequences. However, without prior knowledge of intron–exon structure, recovery of in-
tron sequences may be incomplete, due either to inadvertent primer design across intron–
exon splice junctions, or to inefficient amplification of large introns. Certain genic re-
gions are anticipated to show higher levels of SNP variation and may be preferentially
targeted. These include the UTRs and introns, compared to exonic regions. As 3′-UTRs
are frequently more extensive than their 5′-located counterparts, a number of studies have
targeted these regions specifically.
Generation of PCR amplicons may be followed either by direct sequencing using one
of the LAPs, or by cloning into a plasmid vector followed by clone-specific sequencing
using a universal primer. The choice between direct sequencing and cloned amplicon se-
quencing is governed by a number of technical, statistical, and logistical considerations,
and is highly influenced by the breeding system of the organism in question, as well as
(in specialized cases such as conifer megagametophytes) by the ploidy level of the tissue
used for SNP discovery.
Technical considerations include the efficiency and accuracy with which heterozygous
SNPs may be identified by direct sequencing of an amplicon mixture derived from two
(for diploid outbred genotypes) or more (for autopolyploid outbred genotypes) distinct
haplotypes; the confounding effects of heterozygous indels, which produce overlapping
phase shifts under conditions of direct sequencing; and similar effects arising from inad-
vertent amplification from multiple paralogous sequences.
Statistical considerations apply to the optimum number of cloned sequences selected
for sequence analysis prior to alignment, based on expectations of allelic proportions, as
well as the potential biasing effects of allele-specific PCR competition and paralogous
sequence structure. The potential error rate associated with in vitro base substitution by
thermostable polymerases must also be considered, as potential spurious SNPs may be
generated in individual cloned sequences. This rate has been estimated at c. 1 in 103
bases replicated (Palumbi and Baker 1994), sufficiently high to require multiple clone se-
quencing for each allelic variant. In addition, cloned amplicon sequencing is prohibitive
for large numbers of distinct genotypes, requiring appropriate experimental design in
order to provide data of value across the broader germplasm pool of the target species.
These issues are related to the logistical considerations, as amplicon cloning and sequenc-
ing is costly, laborious and time-consuming, especially during the process of manual se-
quence alignment.
and Henry 2004; Russell et al. 2004), pearl millet (Pennisetum glaucum L.) (Gaut and
Clegg 1993), and rice (Oryza sativa L.) (Bradbury et al. 2005; Hayashi et al. 2004; Jin
et al. 2003).
For maize, the issue of paralogous sequences was minimized by predominant (but not
exclusive) targeting of the 3′-ends of ESTs (Bhattramakki et al. 2002; Bhattramakki and
Rafalski 2001; Ching et al. 2002). A similar approach was taken to discriminate between
members of the cytochrome P450 gene family (Bundock et al. 2003), and for genes asso-
ciated with grain germination in barley (Russell et al. 2004). For wheat, genome-specific
primers were designed using pre-existing information on substitutions and indels in genes
encoding ADP-glucose pyrophosphorylase and granule-bound starch synthase, and the
specificity of amplification was determined through testing on nullisomic–tetrasomic
(NT) substitution lines which permit discrimination between homoeologous gene se-
quences (Caldwell et al. 2004). In soybean, PCR products derived from a single standard
genotype were pre-screened by gel electrophoresis in order to identify those primer sets
that appeared to produce a single product, while those producing no or weak amplifica-
tion, or multiple products, were discarded. In addition, sequencing from both ends with
each amplification primer was used as necessary for additional quality control. Nonethe-
less, c. 20% of amplicons sequenced produced data attributable to heterogeneous tem-
plate, demonstrating the importance of the paralogy problem (Zhu et al. 2003).
SNP variation between sequences from different homozygous genotypes was assessed
visually (Ching et al. 2002), using the Phred/Phrap suite (Bhattramakki and Rafalski
2001; Ewing and Green 1998; Ewing et al. 1998), using Sequencher™ (Gene Codes, Ann
Arbor, MI, USA; Bundock and Henry 2004; Caldwell et al. 2004; Mogg
et al. 2002) or the PolyBayes SNP detection software (Zhu et al. 2003).
These and other studies have permitted estimates of SNP incidence over different
germplasm samples. A comparison of genome sequences between two different acces-
sions of A. thaliana predicted an SNP frequency of 1 per 6.6 kb (Jander et al. 2002). For
the A. thaliana CRY2 gene, comparison of a 3.2-kb region containing the entire transcrip-
tional unit as well as over 1 kb of upstream and downstream sequences across 32 eco-
types revealed 90 SNPs and 12 indels, corresponding to frequencies of 1 per 36 bp and 1
per 267 bp, respectively. In soybean, comparison was based on 25 genotypes, of which
14 were estimated to have contributed 80.5% of allelic diversity present in North Ameri-
can varietal material. Resequencing was performed for 143 amplicons including coding
and noncoding genic sequences selected from a total of 90 full-length genes and 88
cDNAs, as well as intergenic genomic sequences. A total of 280 SNPs were identified
over 76.3 kb of genomic sequence, at a frequency of 1 per 272.5 bp (Zhu et al. 2003). In
barley, a total of 2.7 kb from 23 grain germination-associated genes was resequenced
across a panel of 24 cultivated barley accessions, eight landraces, and eight lines of the
progenitor species H. spontaneum, identifying 1 SNP per 78 bp and 1 indel per 680 bp
(Russell et al. 2004). Although the selection and range of germplasm clearly influences
such estimates, the obligate inbreeding habit and narrow genetic bases typical of such
species generally contributes to low SNP frequency. By contrast, higher values have been
reported for facultative allogamous species such as maize. The study of 3′-UTR targeted
amplicons in 22 amplicons from 18 genes was performed using 36 diverse maize geno-
types, representing the major US-derived heterotic germplasm groups (Ching et al. 2002).
Across a total of 6.9 kb of genomic sequence, the SNP frequency was 1 per 61 bp and the
indel frequency was 1 per 126 bp. SNP frequency in coding sequence was 1 per 130.5 bp
and in noncoding sequence was 47.7 bp, while the distribution of indels showed a similar
62 D. EDWARDS ET AL.
pattern. Further studies based on analysis of several hundred loci across eight inbred
maize lines (Bhattramakki et al. 2002; Bhattramakki and Rafalski 2001) revealed compa-
rable frequencies (1 SNP per 83 bp, 1 indel per 250 bp).
Direct sequencing has also been applied to obligate outbreeding (allogamous) species
such as potato (Solanum tuberosum L.) (Rickert et al. 2003). BAC library clones contain-
ing sequences similar to nucleotide-binding site and leucine-rich repeat (NBS–LRR) type
pathogen resistance genes were selected for analysis, and PCR amplicons were designed
in candidate genomic regions. Comparative sequence analysis was performed using a
panel of 17 autotetraploid and 11 diploid potato genotypes. A total of 78 amplicons with
a total sequence length of 31 kb were reanalyzed across the germplasm panel. Predicted
heterozygous indels were confirmed by sequencing from the opposite end of the ampli-
con with the second amplification primer, and SNP dosage in heterozygous autotetraploid
combinations (i.e., ABBB, AABB, or AAAB) was estimated from overlapping sequence
peak heights. A total of 1,498 SNPs and 127 indels were identified visually, correspond-
ing to frequencies of 1 per 21 bp and 1 per 243 bp, respectively.
In conifers, SNP discovery by resequencing PCR products can be facilitated by the use
of megagametophyte. Megagametophyte is a haploid endosperm developing from the
maternal gamete, with nutritive functions for the surrounded zygote. The advantage of us-
ing megagametophyte for SNP discovery in conifers is that time-consuming and costly
cloning procedures become unnecessary, given that sequencing reactions can be per-
formed directly from PCR-amplified fragments to have access to haplotype sequences.
The sequencing of several PCR products using the same endosperm gives the haplotype
structure of the mother plant. This information can further be compared with sequenced
PCR products from the diploid embryo to infer the paternal haplotype. The use of mega-
gametophyte was very popular in the 1990s for genetic mapping in gymnosperms. This
approach was recently employed for linkage disequilibrium studies using SNPs, in Japa-
nese sugi (Kado et al. 2003), loblolly pine (Brown et al. 2004; Gill et al. 2003), and mari-
time and Monterrey pine (Pot et al. 2005).
Despite the labor-intensive nature of amplicon cloning and sequencing (Zhang and
Hewitt 2003), and the possibility of artifactual results due to in vitro recombination of
cloned heteroduplexes (Tang and Unnasch 1995), the method provides a number of sig-
nificant advantages. Linkage phase between contiguous heterozygous SNPs may be un-
ambiguously determined in primary analysis, allowing SNP haplotype structure in the
target region to be determined. In addition, as noted previously, heterozygous indels of
variable length and paralogous sequences may be unambiguously identified.
In animal systems, amplicon cloning and sequencing has been used for species such as
humpback whales (Palumbi and Baker 1994), black tiger prawn (Duda and Palumbi
1999), and turnip moth (LaForest et al. 1999). The results of several analyses in plant
taxa have been published, and numerous studies are currently being performed in for-
estry, horticultural and forage species. In maize, amplicons corresponding to a c. 600 bp
region of the b anthocyanin biosynthesis-regulatory gene were obtained from 18 different
genotypes, including 18 inbred lines and 7 ancestral lines. Cloned amplicons were se-
quenced and aligned using CLUSTAL W (Thompson et al. 1994) to identify SNPs and
indels (Selinger and Chandler 1999). The teosinte-branched1 (tb1) domestication locus
was targeted in cultivated maize and two species of the ancestral grass teosinte (Z. mays
SINGLE NUCLEOTIDE POLYMORPHISM DISCOVERY 63
ssp. parviglumis and Z. mays spp. mexicana) by amplicon cloning in the TOPO TA-
cloning system (Invitrogen, Carlsbad, CA, USA) revealing limited coding sequence
variation, but substantial promoter region divergence (Wang et al. 1999). Interspecific
comparisons have also been made for the alcohol dehydrogenase (Adh) locus of A. lyrata,
which is allogamous, and A. thaliana, which is autogamous. A combination of amplicon
cloning and sequencing and direct sequencing strategies were used for A. lyrata, followed
by alignment using CLUSTAL W (Savolainen et al. 2000).
The genomic complexity of hexaploid bread wheat has been addressed by sequence
analysis of cloned amplicons derived from RFLP probes previously used for genetic map
construction (Bryan et al. 1999). Low levels of SNP were detected between homologous
sequences, at c. 1 per 1,000 bp. The majority of amplicons designed against template
wheat sequences amplified at least two distinct products, reflecting potential homoeolo-
cus and paralocus detection. Differences in the length of PCR products obtained with
specific primers could be exploited to design genome-specific amplicons. Amplicon clon-
ing has also been used to detect allelic variation in high molecular weight glutenin sub-
units from the wheat D-genome progenitor species Aegilops tauschii (Lu et al. 2005).
Amplicons from the nuclear ribosomal DNA internal transcribed spacer (ITS) region
were obtained from individuals of different species of spruce (genus Picea) and cloned in
pGEM vectors in order to identify SNPs capable of distinguishing black spruce (Picea
mariana) and red spruce (Picea rubens) (Germano and Klein 1999). Nucleotide diversity
has also been studied in the European aspen (Populus tremula L.) through analysis of 24
different trees from four different geographical sites (Ingvarsson 2005). Five gene loci
(Adh1, CI-1, GA20ox1, TI-3, and Gapdh) were used for primer design to generate ampli-
cons from each genotype that were directly cloned into the TA-cloning vector pCR2.1
and subsequently individually sequenced. Sequence alignments were performed using
Sequencher, revealing an SNP frequency (across 6.2 kb) of 1 per 60 bp. Similar studies
have been performed for other long-lived woody perennial species such as the silver
birch, Betula pendula (Järvinen et al. 2003). Amplicons from the PISTILLATA (PI)
homologue BpMADS2 gene, spanning c. 2.4 kb, were derived from ten individuals from
each of two Finnish populations. The amplicons were cloned into the pUC18 vector, se-
quenced and aligned to reveal limited haplotype diversity, with two common types in
each population.
Amplicon cloning and sequencing has also been used for SNP discovery in potato, as a
complement to the direct sequencing activities described above. A study of the LRR-
encoding StVe1 resistance gene in potato (Simko et al. 2004) was performed using a
sample set of 30 North American tetraploid cultivars. PCR products (c. 839 bp in length)
were directly cloned using the TOPO TA-cloning system, and a total of 600 cloned frag-
ments (20 per cultivar) were sequenced. The average SNP incidence was 1 per 15 bp, but
the nucleotide diversity was organized into a number of highly distinct haplotypes, of
which three were detected in 97% of the analyzed cultivars. A paralogous sequence of
851 bp in length was also cloned and discriminated from the StVe1 amplicon by Poly-
Bayes analysis, through the presence of 2–6 bp indels.
The Poaceae species perennial ryegrass (Lolium perenne L.) and the Fabaceae species
white clover (Trifolium repens L.) are the most important components of temperate
64 D. EDWARDS ET AL.
pastoral agriculture systems, supporting grazing industries for dairy, meat, and wool pro-
duction. The majority of research to date on molecular marker development and validation in
out-crossing pasture species has been based on anonymous genetic markers, such as ge-
nomic DNA-derived simple sequence repeats (SSRs) and amplified fragment length
polymorphisms (AFLPs) (Jones et al. 2002a, b, 2003). The paradigm for marker-assisted
selection (MAS) that was established in autogamous plant species such as tomato, rice,
and wheat involves the use of such markers to construct linkage maps, genetic trait dis-
section through QTL analysis, and selection of linked markers in selection schemes such
as donor–recipient recurrent selection. The obligate outbreeding nature of pasture
grasses and legumes clearly presents major limitations to the ready implementation of
the inbreeding paradigm.
The most obvious solution to such problems is to develop candidate gene-based mark-
ers that show a functional association with the target trait region (Andersen and Lüb-
berstedt 2003). Based on the population biology of perennial ryegrass and white clover
(outbreeding with relatively large effective population sizes, at least for ecotypic popula-
tions), linkage disequilibrium (LD) is expected to extend over relatively short molecular
distances. In this instance, it should be possible to identify diagnostic variants for the se-
lection of individual parental genotypes on the basis of superior allele content. This will
allow more efficient use of germplasm collections for parental selection. In addition, such
“perfect” markers will allow highly effective progeny selection (Forster et al.
2004).
Large-scale gene sequence collections which have been generated by both incremental
and EST discovery in perennial ryegrass and white clover provide the resource for func-
tionally associated marker development, with c. 15,000 unigenes currently defined for
each species (Sawbridge et al. 2003a, b). Selected genes have already been mapped as
gene-associated RFLP and SSR loci (Barrett et al. 2004; Faville et al. 2005). RFLP
markers are not readily implemented in molecular breeding, and SSRs are only present in
a subset (generally less than 10%) of target genes. However, genic SNP markers can in
principle be developed for any gene, and show the benefits of locus-specificity, high data
fidelity, and high-throughput analysis. The experimental method for SNP discovery is
based on cloning and sequencing of gene-specific amplicons from the heterozygous par-
ents of two-way pseudo-testcross mapping families. The putative SNPs are then validated
in the progeny set, and cross-validated in other sibships and diverse germplasm.
In perennial ryegrass, which is a diploid species (2n = 2x = 14), in vitro gene-
associated SNP discovery process has been based on a three-part strategy. The “fast-
track” component involves short ESTs, providing single SNP loci for structured map en-
hancement; “medium-track” involves full-length cDNAs, providing several SNP loci and
partial haplotypic data; and “slow-track” is based on full-length genes with intron–exon
structure, providing multiple SNP loci and determination of complete haplotype struc-
tures. Such data may be used to determine the extent of linkage disequilibrium and stabil-
ity of gene-length SNP haplotypes, and to test for causal correlation between genotypic
diversity and corresponding variation for related agronomic traits (Cogan et al. 2006).
“Proof-of-concept” for the in vitro discovery process was obtained with the perennial
ryegrass LpASRa2 gene. The Asr gene family encodes a group of proteins that are tran-
scriptionally induced by ABA treatment and water stress, and during fruit ripening. Os-
motic and saline stress leads to up-regulation of the rice gene (Vaidyanathan et al. 1999),
and the maize Zm-Asr1 gene co-locates with QTLs for traits responsive to mild water
stress (Jeanneau et al. 2002). LpASRa2 consequently provides an excellent candidate for
SINGLE NUCLEOTIDE POLYMORPHISM DISCOVERY 65
NA6 LG4
0.0 xlpcell
3.7 xpps0006x
8.2 xpps0150a
11.3 xlpssrh03a08.2
24.2 xlpssrk15f05.2
26.8 xpps0423a
29.2 xpps0146b
33.2 xpps0205y
35.6 xlpssrk05a11.1
38.4 xlpssrk08b11.1
xlpasra2.132
xlpasra2.303.10
xlpasra2.474
39.7 xlpasra2.555
xlpasra2.136
LpASRa2
xlpasra2.183 E xon In tron E xon 3’UTR
xlpasra2.108 Coordinate 108 132 136 183 213 303.10 474 555 652 Haplotype
42.8 xlpasra2
47.4 xpps0439d NA6 1 T A G G G G T T A 1
48.0 xlpchie 2 C G C A G C C G A 2
xpps0433x
48.6 xpps0202a AU6 1 C G C G C C C G G 3
49.9 xpps0018a 2 C G C G C C C G A 4
xlpa22c
50.5 xlphaka
51.8 xlpssrk03c05
52.5 xlpssrk01g06 P he Gl u Gl u Glu Gl u Hi s Pro
55.5 xlpzba
P he Gl u Gl n Glu Asp Hi s Pro
56.3 xlpssc
60.6 xlp4clja
69.1 xpps0345d
71.9 xpps0040y
73.3 xpps0181a
79.8 xlpomt3.1
88.9 xlpffta.1
91.1 xlpssrk07C11
95.2 xpps0453y
95.9 xpps0335b
A B
Figure 4.4. (A) Genetic linkage map of LG4 of the NA6 parental map, showing the SNP loci (indicated as
xlpasra2.coordinate number) in close linkage with the corresponding RFLP locus (xlpasra2). (B) LpASRa2
haplotype structures within and between the NA6 and AU6 parental genotypes. Arrows show putative mutational
changes between members of the second haplogroup (haplotypes 2–4), and predicted translation products of
exon-located SNP loci are indicated.
progenitors (Badr et al. 2002; Chen and Gibson 1970a, b, 1971; Ellison et al. 2006).
The structure of the EST–SSR based genetic map (Barrett et al. 2004) reveals homoeologous
relationships between eight pairs of linkage groups. In silico alignment of SSR-containing
EST sequences with whole genome sequence from model legume species has permitted
comparisons between the subgenome structures of white clover and chromosome structure
in barrel medic (Medicago truncatula Gaertn.).
Close to 50 white clover cDNAs have been selected from public databases, including
the cyanogenesis-associated linimarase gene, TrLIN, and from an unigene resource
(Sawbridge et al. 2003b), including genes for flavonoid biosynthesis (relevant to bloat
safety) and organic acid biosynthesis (relevant to aluminium tolerance and phosphorus
acquisition). Two second generation two-way pseudo-testcross reference mapping fami-
lies, designated F1(Haifa2 × LCL2) and F1(S1846 × LCL6), have provided parental DNA
templates for in vitro SNP discovery. A large proportion of the analyzed genes produced
multiple amplicons which could be assembled into separate contigs by application of high
stringency alignment criteria in Sequencher™. A smaller proportion (< 25%) produced
non overlapping contigs under moderately stringent conditions. Although intragenomic
paralogy could account for this effect, in many instances homoeolocus amplification is
probably responsible. For example, two distinct haplogroups obtained using the antho-
cyanidin reductase (banyuls) cDNA (TrBANa) are differentiated by a large indel within
an intron, in addition to multiple coding sequence differences (Figure 4.5). The assign-
ment of allelic SNPs in each of putative homoeologues to the second-generation refer-
ence maps will permit clarification of these relationships.
4.5.1 In silico discovery of single nucleotide polymorphisms
Of the methods applied for the discovery of SNPs, the mining of sequence datasets
should provide the cheapest source of abundant SNPs (Buetow et al. 1999; Gu et al.
1998; Picoult-Newberg et al. 1999; Taillon-Miller et al. 1998). Gene discovery and
genome sequencing projects are increasingly considering SNP discovery in the selection
of the starting material for nucleic acid extraction (Jander et al. 2002). With the
development of high-throughput sequencing technology, large amounts of data have been
submitted to the various DNA databases that may be suitable for data mining and SNP
discovery. In particular, EST sequencing programs have provided a wealth of
information, identifying novel genes from a broad range of organisms and providing an
indication of gene expression level in particular tissues (Adams et al. 1995). EST
sequence data may provide the richest source of biologically useful SNPs due to the
relatively high redundancy of gene sequence, the diversity of genotypes represented
within databases, and the fact that each SNP would be associated with an expressed gene
(Picoult-Newberg et al. 1999). Candidate SNPs have been identified and validated from
EST collections from a number of plant species including Arabidopsis (Schmid et al.
2003), barley (Kota et al. 2003), cassava (Lopez et al. 2005), melon (Morales et al. 2004),
pine (Le Dantec et al. 2004), quinoa (Coles et al. 2005), tomato (Yang et al. 2004), and
wheat (Somers et al. 2003). The continuing decrease in the cost of DNA sequencing is
leading to a growing number of whole genome sequencing projects. This data
increasingly enables the identification of SNPs in overlapping genomic sequence and
also through comparison of genomic sequences with EST sequence data (Dawson et al.
2001; Jander et al. 2002; Taillon-Miller et al. 1998). Sequencing technologies continue
68 D. EDWARDS ET AL.
TrBANa
Indel
(89 bp)
A
Figure 4.5. Schematic representation of putative homoeolocus structure in the white clover TrBANa gene,
including predicted allelic variation detected by in vitro SNP discovery.
viewed and marked for inspection using Consed (Gordon et al. 1998). More recently, this
approach has been extended to include a Bayesian statistical method. PolyBayes (Marth
et al. 1999) is a fully probabilistic SNP detection algorithm that calculates the probability
that discrepancies at a given location of a multiple alignment represent true sequence
variations as opposed to sequencing errors. The calculation takes into account the
alignment depth, the base calls in each of the sequences, the associated base quality values
(such as generated by the Phred trace analysis program or the Phrap fragment assembler),
the base composition in the region, and the expected a priori polymorphism rate.
Where sequence trace files are available for the comparison of sequence trace files
to filter out polymorphisms in traces of dubious quality, software such as PolyBayes and
PolyPhred are the most efficient means to differentiate between true SNPs and sequence
error. Unfortunately, complete sequence trace file archives are rarely available for large
sequence datasets collated from a variety of sources. Furthermore, sequence quality-
based SNP discovery does not identify errors in sequences which were incorporated prior
to the base calling process. The principal cause of these prior errors is the inherently high
error rate of the reverse transcription process required for the generation of cDNA
libraries for EST sequencing. Similar errors are also inherent, though to a lesser extent, in
any PCR amplification process that may be part of a sequencing protocol. In cases where
trace files are unavailable, the identification of sequence errors can be based on two
further methods to determine SNP confidence; redundancy of the polymorphism in an
alignment, and co-segregation of SNPs with haplotype.
EST sequence datasets are most suited to redundancy-based SNP discovery. The
highly redundant nature of EST datasets permits the selection of polymorphisms that
occur multiple times within a set of aligned sequences. The frequency of occurrence of a
polymorphism at a particular locus provides a measure of confidence in the SNP
representing a true polymorphism and is referred to as the SNP redundancy score. By
examining SNPs that have a redundancy score of two or greater, i.e., two or more of the
aligned sequences represent the polymorphism, the vast majority of sequencing errors are
removed. Although some true genetic variation is also ignored due to its presence only
once within an alignment, the high degree of redundancy within the data permits the
rapid identification of large numbers of SNPs without the requirement for sequence trace
files.
While redundancy-based methods for SNP discovery are highly efficient, the
nonrandom nature of sequence error may lead to certain sequence errors being repeated
between runs due to conserved, complex DNA structures. Therefore, errors at these loci
would have a relatively high SNP redundancy score and appear as confident SNPs. This
source of error requires an additional method to differentiate them from true
polymorphisms. A further measure of SNP confidence is based on haplotype co-
segregation. While sequencing errors may occur at nonrandom positions within a
sequencing read due to conserved sequence complexity, the probability of these errors
being repeated between sequence reads remains random. True SNPs that represent
divergence between homologous genes co-segregate to define a conserved haplotype,
whereas nonrandom sequence errors do not co-segregate with haplotype. A co-
segregation score based on the frequency of an SNP pattern occurring at multiple loci in an
alignment allows ready identification of SNPs that do not co-segregate to define a
haplotype. The SNP score and co-segregation score together provide a valuable means
for estimating confidence in the validity of SNPs within aligned sequences independent
of sequence trace files or the source of the sequence error.
70 D. EDWARDS ET AL.
4.6 CONCLUSION
There are several approaches that may be undertaken for the discovery of SNPs in
plant species. The method applied would be dependent on several factors, including the
expected application of the discovered SNPs, the availability of gene or genome sequence
and the availability of computational tools or laboratory facilities. Where only limited
DNA sequence is available or large numbers of validated SNPs are required within a
limited number of specific genes, an in vitro approach would be favored. Where large
numbers of SNPs are required across a genome and a significant quantity of sequence
data was available, an in silico approach may be more appropriate. Two factors are likely
to influence SNP discovery in the future. These are the increasing ability to produce gene
and genome sequence data at an ever reducing cost and the development of massive
throughput genotyping systems for the assessment of tens of thousands of SNPs across
thousands of genotypes. These applications can be applied for SNP discovery and
validation within specific genes as well as defining global SNP frequencies across
populations.
4.7 REFERENCES
Adams, M.D., Kerlavage, A.R., Fleischmann, R.D., Fuldner, R.A., Bult, C.J., Lee, N.H., Kirkness, E.F.,
Weinstock, K.G., Gocayne, J.D., White, O., Sutton, G., Blake, J.A., Brandon, R.C., Chiu, M.-W., Clayton,
R.A., Cline, R.T., Cotton, M.D., Earle-Hughes, J., Fine, L.D., FitzGerald, L.M., FitzHugh, W.M.,
Fritchman, J.L., Geoghagen, N.S.M., Glodek, A., Gnehm, C.L., Hanna, M.C., Hedblom, E., Hinkle Jr.,
P.S., Kelley, J.M., Klimek, K.M., Kelley, J.C., Liu, L.-I., Marmaros, S.M., Merrick, J.M., Moreno-
Palanques, R.F., McDonald, L.A., Nguyen, D.T., Pellegrino, S.M., Phillips, C.A., Ryder, S.E., Scott, J.L.,
SINGLE NUCLEOTIDE POLYMORPHISM DISCOVERY 71
Saudek, D.M., Shirley, R., Small, K.V., Spriggs, T.A., Utterback, T.R., Weidman, J.F., Li, Y., Barthlow,
R., Bednarik, D.P., Cao, L., Cepeda, M.A., Coleman, T.A., Collins, E.-J., Dimke, D., Feng, P., Ferrie, A.,
Fischer, C., Hastings, G.A., He, W.-W., Hu, J.-S., Huddleston, K.A., Greene, J.M., Gruber, J., Hudson, P.,
Kim, A., Kozak, D.L., Kunsch, C., Ji, H., Li, H., Meissner, P.S., Olsen, H., Raymond, L., Wei, Y.-F.,
Wing, J., Xu, C., Yu, G.-L., Ruben, S.M., Dillon, P.J., Fannon, M.R., Rosen, C.A., Haseltine, W.A.,
Fields, C., Fraser, C.M., Venter, J.C., 1995, Initial assessment of human gene diversity and expression
patterns based upon 83 million nucleotides of cDNA sequence. Nature 377:3–17.
Ahmadian, A., Gharizadeh, B., Gustafsson, A.C., Sterky, F., Nyren, P., Uhlen, M., Lundeberg, J., 2000, Single-
nucleotide polymorphism analysis by pyrosequencing. Analytical Biochemistry 280:103–110.
Andersen, J.R., Lübberstedt, T., 2003, Functional markers in plants. Trends in Plant Science 8:554–560.
Badr, A., Sayed-Ahmed, H., El-Shanshouri, A., Watson, L.E., 2002, Ancestors of white clover (Trifolium
repens L.), as revealed by isozyme polymorphisms. Theoretical and Applied Genetics 106:143–148.
Barker, G., Batley, J., O’Sullivan, H., Edwards, K.J., Edwards, D., 2003, Redundancy based detection of
sequence polymorphisms in expressed sequence tag data using autoSNP. Bioinformatics 19:421–422.
Barrett, B., Griffiths, A., Schreiber, M., Ellison, N., Mercer, C., Bouton, J., Ong, B., Forster, J., Sawbridge, T.,
Spangenberg, G., Bryan, G., Woodfield, D., 2004, A microsatellite map of white clover. Theoretical and
Applied Genetics 109:596–608.
Batley, J., Barker, G., O’Sullivan, H., Edwards, K.J., Edwards, D., 2003, Mining for single nucleotide
polymorphisms and insertions/deletions in maize expressed sequence tag data. Plant Physiology 132:84–91.
Bhattramakki, D., Rafalski, A., 2001, Discovery and application of single nucleotide polymorphism markers in
plants. In: R.J. Henry (ed) Plant Genotyping: The DNA Fingerprinting of Plants. CABI Publishing,
Wallingford, Oxon, UK, pp 179–193.
Bhattramakki, D., Dolan, M., Hanafey, M., Wineland, R., Vaske, D., Register III, J.C., Tingey, S.V., Rafalski,
A., 2002, Insertion–deletion polymorphisms in 3′ regions of maize genes occur frequently and can be used
as highly informative genetic markers. Plant Molecular Biology 48:539–547.
Borevitz, J.O., Liang, D., Plouffe, D., Chang, H.S., Zhu, T., Weigel, D., Berry, C.C., Winzeler, E., Chory, J.,
2003, Large-scale identification of single-feature polymorphisms in complex genomes. Genome Research
13:513–523.
Botstein, D., White, R.L., Skolnick, M., Davis, R.W., 1980, Construction of a genetic linkage map in man using
restriction fragment length polymorphisms. American Journal of Human Genetics 32:314–331.
Bradbury, L.M.T., Fitzgerald, T.L., Henry, R.J., Jin, Q., Walters, D.E., 2005, The gene for fragrance in rice.
Plant Biotechnology Journal 3:363–370.
Brown, G.R., Kadel, E.E., Bassoni, D.L., Kiehne, K.L., Temesgen, B., van Buijtenen, J.P., Sewell, M.M.,
Marshall, K.A., Neale, D.B., 2001, Anchored reference loci in loblolly pine (Pinus taeda L.) for integrating
pine genomics. Genetics 159:799–809.
Brown, G.R., Gill, G.P., Kuntz, R.J., Langley, C.H., Neale, D.B., 2004, Nucleotide diversity and linkage
disequilibrium in loblolly pine. Proceedings of the National Academy of Sciences of the United States of
America 101:15255–15260.
Bryan, G.J., Stephenson, P., Collins, A., Kirby, J., Smith, J.B., Gale, M.D., 1999, Low levels of DNA sequence
variation among adapted genotypes of hexaploid wheat. Theoretical and Applied Genetics 99:192–198.
Buetow, K.H., Edmonson, M.N., Cassidy, A.B., 1999, Reliable identification of large numbers of candidate
SNPs from public EST data. Nature Genetics 21:323–325.
Bundock, P.C., Henry, R.J., 2004, Single nucleotide polymorphism, haplotype diversity and recombination in
the Isa gene of barley. Theoretical and Applied Genetics 109:543–551.
Bundock, P.C., Christopher, J.T., Eggler, P., Ablett, G., Henry, R.J., Holton, T.A., 2003, Single nucleotide
polymorphisms in cytochrome P450 genes from barley. Theoretical and Applied Genetics 106:676–682.
Burke, J., Davison, D., Hide, W., 1999, d2-cluster: a validated method for clustering EST and full-length cDNA
sequences. Genome Research 9:1135–1142.
Caldwell, K.S., Dvorak, J., Lagudah, E.S., Akhunov, E., Luo, M.-C., Wolters, P., Powell, W., 2004, Sequence
polymorphism in polyploid wheat and their D-genome diploid ancestor. Genetics 167:941–947.
Chen, C., Gibson, P.B., 1970a, Chromosome pairing in two interspecific hybrids of Trifolium. Canadian Journal
of Genetics and Cytology 12:790–794.
Chen, C., Gibson, P.B., 1970b, Meiosis in two species of Trifolium and their hybrids. Crop Science 10:188–189.
Chen, C., Gibson, P.B., 1971, Karyotypes of fifteen Trifolium species in section Amoria. Crop Science 11:441–
445.
Ching, A., Caldwell, K.S., Jung, M., Dolan, M., Smith, O.S., Tingey, S., Morgante, M., Rafalski, A.J., 2002,
SNP frequency, haplotype structure and linkage disequilibrium in elite maize inbred lines. BMC Genetics
[electronic resource] 3:19.
72 D. EDWARDS ET AL.
Cogan, N.O.I., Ponting, R.C., Vecchies, A.C., Drayton, M.C., George, J., Dobrowolski, M.P., Sawbridge, T.I.,
Spangenberg, G.C., Smith, K.F., Forster, J.W., 2006, Gene-associated single nucleotide polymorphism
(SNP) discovery in perennial ryegrass (Lolium perenne L.) Mol Genet Genomics 276:101-12.
Coles, N.D., Coleman, C.E., Christensen, S.A., Jellen, E.N., Stevens, M.R., Bonifacio, A., Rojas-Beltran, J.A.,
Fairbanks, D.J., Maughan, P.J., 2005, Development and use of an expressed sequenced tag library in quinoa
(Chenopodium quinoa Willd.) for the discovery of single nucleotide polymorphisms. Plant Science
168:439–447.
Comai, L., Young, K., Till, B.J., Reynolds, S.H., Greene, E.A., Codomo, C.A., Enns, L.C., Johnson, J.E.,
Burtner, C., Odden, A.R., Henikoff, S., 2004, Efficient discovery of DNA polymorphisms in natural
populations by Ecotilling. Plant Journal 37:778–786.
Coryell, V.H., Jessen, H., Schupp, J.M., Webb, D., Keim, P., 1999, Allele-specific hybridization markers for
soybean. Theoretical and Applied Genetics 98:690–696.
Dawson, E., Chen, Y., Hunt, S., Smink, L.J., Hunt, A., Rice, K., Livingston, S., Bumpstead, S., Bruskiewich,
R., Sham, P., Ganske, R., Adams, M., Kawasaki, K., Shimizu, N., Minoshima, S., Roe, B., Bentley, D.,
Dunham, I., 2001, A SNP resource for human chromosome 22: extracting dense clusters of SNPs from the
genomic sequence. Genome Research 11:170–178.
Duda Jr., T.F., Palumbi, S.R., 1999, Population structure of the black tiger prawn, Penaeus monodon, among
western Indian Ocean and western Pacific populations. Marine Biology 134:705–710.
Ellison, N.W., Liston, A., Szeimer, J.J., Williams, W.M., Taylor, W.L., 2006, Molecular phylogenetics of the
clover genus (Trifolium-Leguminosae). Molecular Phylogenetics and Evolution 39: 688-705.
Etienne, C., Rotham, C., Moing, A., Plomion, C., Bodenes, C., Svanella-Dumas, L., Cosson, P., Pronier, V.,
Monet, R., Dirlewanger, E., 2002, Candidate genes and QTLs for sugar and organic acid content in peach
(Prunus persica L. Batsch). Theoretical and Applied Genetics 105(1):145–159.
Ewing, B., Green, P., 1998, Base-calling of automated sequencer traces using phred. II. Error probabilities.
Genome Research 8:186–194.
Ewing, B., Hillier, L., Wendl, M.C., Green, P., 1998, Base-calling of automated sequencer traces using phred. I.
Accuracy assessment. Genome Research 8:175–185.
Faville, M.J., Vecchies, A.C., Schreiber, M., Drayton, M.C., Hughes, L.J., Jones, E.S., Guthridge, K.M., Smith,
K.F., Sawbridge, T., Spangenberg, G.C., Bryan, G.T., Forster, J.W., 2004, Functionally associated
molecular genetic marker map construction in perennial ryegrass (Lolium perenne L.). Theoretical and
Applied Genetics 110:12–32.
Faville, M., Vecchies, A.C., Schreiber, M., Drayton, M.C., Hughes, L.J., Jones, E.S., Guthridge, K.M., Smith,
K.F., Sawbridge, T., Spangenberg, G.C., Bryan, G.T., Forster, J.W., 2005, Candidate gene-based molecular
marker map construction in perennial ryegrass (Lolium perenne L.). Theoretical and Applied Genetics
110:12–32.
Forster, J.W., Jones, E.S., Batley, J., Smith, K.F., 2004, Molecular marker-based genetic analysis of pasture and
turf grasses. In: A. Hopkins, Z.-Y. Wang, M. Sledge, R.E. Barker (eds) Molecular Breeding of Forage and
Turf. Kluwer, Dordecht, pp 197–239.
Gaut, B.S., Clegg, M.T., 1993, Nucleotide polymorphism in the Adh1 locus of pearl millet (Pennisetum
glaucum) (Poaceae). Genetics 135:1091–1097.
Germano, J., Klein, A.S., 1999, Species-specific nuclear and chloroplast single nucleotide polymorphisms to
distinguish Picea glauca, P. mariana and P. rubens. Theoretical and Applied Genetics 99:37–49.
Gilchrist, E.J., Haughn, G.W., 2005, TILLING without a plough: a new method with applications for reverse
genetics. Current Opinion in Plant Biology 8:1–5.
Gill, G.P., Brown, G.R., Neale, D.B., 2003, A sequence mutation in the cinnamyl alcohol dehydrogenase gene
associated with altered lignification in loblolly pine. Plant Biotechnology Journal 1:253–258.
Giordano, M., Oefner, P.J., Underhill, P.A.., Sforza, L.L.C., Tosi, R., Richiardi, P.M., 1999, Identification by
denaturing high-performance liquid chromatography of numerous polymorphisms in a candidate region for
multiple sclerosis susceptibility. Genomics 56:247–253.
Gordon, D., Abajian, C., Green, P., 1998, Consed: a graphical tool for sequence finishing. Genome Research
8:195–202.
Green, P., 1994, Phrap. Unpublished data (www.phrap.org).
Greene, E.A., Codomo, C.A., Taylor, N.E., Henikoff, J.G., Till, B.J., Reynolds, S.H., Enns, L.C., Burtner, C.,
Johnson, J.E., Odden, A.R., Comai, L., Henikoff, S., 2003, Spectrum of chemically induced mutations from
a large-scale reverse-genetic screen in Arabidopsis. Genetics 164:731–740.
Gu, Z., Hillier, L., Kwok, P.-Y., 1998, Single nucleotide polymorphism hunting in cyberspace. Human
Mutation 12:221–225.
Hauser, M.T., Adhami, F., Dorner, M., Fuchs, E., Glossl, J., 1998, Generation of co-dominant PCR-based
markers by duplex analysis on high resolution gels. Plant Journal 16:117–125.
SINGLE NUCLEOTIDE POLYMORPHISM DISCOVERY 73
Hayashi, K., Hashimoto, N., Daigen, M., Ashikawa, I., 2004, Development of PCR-based SNP markers for rice
blast resistance genes at the Piz locus. Theoretical and Applied Genetics 108:1212–1220.
Hebenbrock, K., Williams, P.M., Karger, B.L., 1995, Single strand conformational polymorphism using
capillary electrophoresis with two-dye laser-induced fluorescence detection. Electrophoresis 16:1429–1436.
Hongtrakul, V., Slabaugh, M.B., Knapp, S.J., 1998, DFLP, SSCP, and SSR markers for △9-stearoyl-acyl
carrier protein desaturases strongly expressed in developing seeds of sunflower: intron lengths are
polymorphic among elite inbred lines. Molecular Breeding 4:195–203.
Hsia, A.-P., Wen, T.-J., Chen, H.D., Liu, Z., Yandeau-Nelson, M.D., Wei, Y., Guo, L., Schnable, P.S., 2005,
Temperature gradient capillary electrophoresis (TGCE) – a tool for the high-throughput discovery and
mapping of SNPs and IDPs. Theoretical and Applied Genetics 111:218–225.
Huang, X., Madan, A., 1999, CAP3: a DNA sequence assembly program. Genome Research 9:868–877.
Inazuka, M., Wenz, H.M., Sakabe, M., Tahira, T., Hayashi, K., 1997, A streamlined mutation detection system:
multicolor post-PCR fluorescence labeling and single-strand conformational polymorphism analysis by
capillary electrophoresis. Genome Research 7:1094–1103.
Ingvarsson, P.K., 2005, Nucleotide polymorphism and linkage disequilibrium within and among natural
populations of European aspen (Populus tremula L., salicaceae). Genetics 169:945–953.
Jander, G., Norris, S.R., Rounsley, S.D., Bush, D.F., Levin, I.M., Last, R.L., 2002, Arabidopsis map-based
cloning in the post-genome era. Plant Physiology 129:440–450.
Järvinen, P., Lemmetyinen, J., Savolainen, O., Sopanen, T., 2003, DNA sequence variation in BpMADS2 gene
in two populations of Betula pendula. Molecular Ecology 12:369–384.
Jeanneau, M., Gerentes, D., Foueillassar, X., Zivy, M., Vidal, J., Toppan, A., Perez, P., 2002, Improvement of
drought tolerance in maize: towards the functional validation of the Zm-Asr1 gene and increase of water use
efficiency by over-expressing C4-PEPC. Biochimie 84:1127–1135.
Jin, Q., Waters, D., Cordeiro, G.M., Henry, R.J., Reinke, R.F., 2003, A single nucleotide polymorphism (SNP)
marker linked to the fragrance gene in rice (Oryza sativa L.). Plant Science 165:359–364.
Jones, E.S., Dupal, M.P., Dumsday, J.L., Hughes, L.J., Forster, J.W., 2002a, An SSR-based genetic linkage map
for perennial ryegrass (Lolium perenne L.). Theoretical and Applied Genetics 105:577–584.
Jones, E.S., Mahoney, N.L., Hayward, M.D., Armstead, I.P., Jones, J.G., Humphreys, M.O., King, I.P., Kishida,
T., Yamada, T., Balfourier, F., Charmet, G., Forster, J.W., 2002b, An enhanced molecular marker based
genetic map of perennial ryegrass (Lolium perenne) reveals comparative relationships with other Poaceae
genomes. Genome 45:282–295.
Jones, E.S., Hughes, L.J., Drayton, M.C., Abberton, M.T., Michaelson-Yeates, T.P.T., Bowen, C., Forster, J.W.,
2003, An SSR and AFLP molecular marker-based genetic map of white clover (Trifolium repens L.). Plant
Science 165:531–539.
Kado, T., Yoshimaru, H., Tsumura, Y., Tachida, H., 2003, DNA variation in a conifer, Cryptomeria japonica
(Cupressaceae sensu lato). Genetics 164:1547–1559.
Keim, P., Diers, B.W., Olson, T.C., Shoemaker, R.C., 1990, RFLP mapping in soybean: association between
marker loci and variation in quantitative traits. Genetics 126:735–742.
Konieczny, A., Ausubel, F.M., 1993, A procedure for mapping Arabidopsis mutations using co-dominant
ecotype-specific PCR-based markers. Plant Journal 4:403–410.
Kota, R., Wolf, M., Michalek, W., Graner, A., 2001, Application of denaturing high-performance liquid
chromatography for mapping of single nucleotide polymorphisms in barley (Hordeum vulgare L.). Genome
44:523–528.
Kota, R., Rudd, S., Facius, A., Kolesov, G., Thiel, T., Zhang, H., Stein, N., Mayer, K., Graner, A., 2003,
Snipping polymorphisms from large EST collections in barley (Hordeum vulgare L.). Molecular Genetics
and Genomics 270:24–33.
Kuhn, D.N., Borrone, J., Meerow, A.W., Motamayor, J.C., Brown, J.S., Schnell, R.J., 2005, Single-strand
conformation polymorphism analysis of candidate genes for reliable identification of alleles by capillary
array electrophoresis. Electrophoresis 26:112–125.
LaForest, S.M., Prestwich, G.D., Löfstedt, C., 1999, Intraspecific nucleotide variation at the pheromone binding
protein locus in the turnip moth, Agrotis segetum. Insect Molecular Biology 8:481–490.
Lau, E., Leushner, J., Patnaik, M., 2000, Automated detection of the factor V Leiden mutation using MALDI-
TOF mass spectrometry on the MassARRAY system. Clinical Chemistry 46:1880.
Le Dantec, L.L., Chagné, D., Pot, D., Cantin, O., Garnier-Géré, P., Bedon, F., Frigerio, J.-M., Chaumeil, P.,
Léger, P., Garcia, V., Laigret, F., De Daruvar, A., Plomion, C., 2004, Automated SNP detection in
expressed sequence tags: statistical considerations and application to maritime pine sequences. Plant
Molecular Biology 54:461–470.
Lopez, C., Piégu, B., Cooke, R., Delseny, M., Tohme, J., Verdier, V., 2005, Using cDNA and genomic
sequences as tools to develop SNP strategies in cassava (Manihot esculenta Crantz). Theoretical and
Applied Genetics 110:425–431.
74 D. EDWARDS ET AL.
Lu, C.M., Yang, W.Y., Zhang, W.J., Lu, B.-R., 2005, Identification of SNPs and development of allelic specific
PCR markers for high molecular weight glutenin subunit Dt×1.5 from Aegilops tauschii through sequence
characterization. Journal of Cereal Science 41:13–18.
Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., Berka, J., Braverman, M.S.,
Chen, Y.-J., Chen, Z., Dewell, S.B., Du, L., Fierro, J.M., Gomes, X.V., Godwin, B.C., He, W., Helgesen,
S., Ho, C.H., Irzyk, G.P., Jando, S.C., Alenquer, M.L.I., Jarvie, T.P., Jirage, K.B., Kim, J.-B., Knight, J.R.,
Lanza, J.R., Leamon, J.H., Lefkowitz, S.M., Lei, M., Li, J., Lohman, K.L., Lu, H., Makhijani, V.B.,
McDade, K.E., McKenna, M.P., Myers, E.W., Nickerson, E., Nobile, J.R., Plant, R., Puc, B.P., Ronan,
M.T., Roth, G.T., Sarkis, G.J., Simons, J.F., Simpson, J.W., Srinivasan, M., Tartaro, K.R., Tomasz, A.,
Vogt, K.A., Volkmer, G.A., Wang, S.H., Wang, Y., Weiner, M.P., Yu, P., Begley, R.F., Rothberg, J.M.,
2005, Genome sequencing in microfabricated high-density picolitre reactors. Nature 437:376–380.
Marth, G.T., Korf, I., Yandell, M.D., Yeh, R.T., Gu, Z., Zakeri, H., Stitziel, N.O., Hillier, L., Kwok, P.-Y.,
Gish, W.R., 1999, A general approach to single-nucleotide polymorphism discovery. Nature Genetics
23:452–456.
Martins-Lopes, P., Zhang, H., Koebner, R., 2001, Detection of single nucleotide mutations in wheat using
single strand conformation polymorphism gels. Plant Molecular Biology Reporter 19:159–162.
Mattocks, C., White, H.E., Owen, N., Durston, V.J., Harvey, J.F., Cross, N.C.P., 2004, An evaluation of the
MassCLEAVE(tm) biochemistry for diagnostic screening. Journal of Medical Genetics 41:S75.
McCallum, J., Leite, D., Pither-Joyce, M., Havey, M.J., 2001, Expressed sequence markers for genetic analysis
of bulb onion (Allium cepa L.). Theoretical and Applied Genetics 103:979–991.
Mogg, R., Batley, J., Hanley, S., Edwards, D., O’Sullivan, H., Edwards, K.J., 2002, Characterization of the
flanking regions of Zea mays microsatellites reveals a large number of useful sequence polymorphisms.
Theoretical and Applied Genetics 105:532–543.
Morales, M., Roig, E., Monforte, A.J., Arús, P., Garcia-Mas, J., 2004, Single-nucleotide polymorphisms
detected in expressed sequence tags of melon (Cucumis melo L.). Genome 47:352–360.
Myers, R.M., Sheffield, V.C., Cox, D.R., 1988, Detection of single base changes in DNA: ribonuclease
cleavage and denaturing gradient gel electrophoresis. In: K.E. Davies (ed) Genome Analysis: A Practical
Approach. IRL, Oxford, pp 95–139.
Neff, M.M., Neff, J.D., Chory, J., Pepper, A.E., 1998, dCAPS, a simple technique for the genetic analysis of
single nucleotide polymorphisms: experimental applications in Arabidopsis thaliana genetics. Plant Journal
14:387–392.
O’Donnell, M.J., Little, D.P., Braun, A., 1997, MassArray as an enabling technology for the industrial-scale
analysis of DNA. Genetic Engineering News 17:39.
Oleykowski, C.A., Mullins, C.R.B., Godwin, A.K., Yeung, A.T., 1998, Mutation detection using a novel plant
endonuclease. Nucleic Acids Research 26:4597–4602.
Olsen, K.M., Halldorsdottir, S.S., Stinchcombe, J.R., Weinig, C., Schmittt, J., Purugganan, M.D., 2004, Linkage
disequilibrium mapping of Arabidopsis CRY2 flowering time alleles. Genetics 167:1361–1369.
Orita, M., Suzuki, Y., Sekiya, T., Hayashi, K., 1989, Rapid and sensitive detection of point mutations and SNA
polymorphisms using the polymerase chain reaction. Genomics 5:874–879.
Palumbi, S.R., Baker, C.S., 1994, Contrasting population structure from nuclear intron sequences and mtDNA
of humpback whales. Molecular Biology and Evolution 11:426–435.
Paran, I., Zamir, D., 2003, Quantitative traits in plants: beyond the QTL. Trends in Genetics 19:303–306.
Perry, J.A., Wang, T.L., Welham, T.J., Gardner, S., Pike, J.M., Yoshida, S., Parniske, M., 2003, A TILLING
reverse genetics tool and a web-accessible collection of mutants of the legume Lotus japonicus. Plant
Physiology 131:866–871.
Pertea, G., Huang, X., Liang, F., Antonescu, V., Sultana, R., Karamycheva, S., Lee, Y., White, J., Cheung, F.,
Parvizi, B., Tsai, J., Quackenbush, J., 2003, TIGR gene indices clustering tools (TGICL): a software system
for fast clustering of large EST datasets. Bioinformatics 19:651–652.
Picoult-Newberg, L., Ideker, T.E., Pohl, M.G., Taylor, S.L., Donaldson, M.A., Nickerson, D.A., Boyce-Jacino,
M., 1999, Mining SNPs from EST databases. Genome Research 9:167–174.
Plomion, C., Frigerio, J.-M., Ridolfi, M., Pot, D., Pionneau, C., Bodénes, C., Kremer, A., Hurme, P.,
Savolainen, O., Avila, C., Gallardo, F., Canovas, F.M., David, H., Neutelings, G., Campbell, M., 1999,
Developing SSCP markers in two Pinus species. Molecular Breeding 5:21–31.
Pot, D., McMillan, L., Echt, C., Le Provost, G., Garnier-Géré, P., Cato, S., Plomion, C., 2005, Nucleotide
variation in genes involved in wood formation in two pine species. New Phytologist 167:101–112.
Rickert, A.M., Premstaller, A., Gebhardt, C., Oefner, P.J., 2002, Genotyping of SNPs in a polyploid genome by
pyrosequencing (TM). Biotechniques 32(3):592–593.
Rickert, A.M., Jeong, J.H., Meyer, S., Nagel, A., Ballvora, A., Oefner, P.J., Gebhardt, C., 2003, First generation
SNP/InDel markers tagging loci for pathogen resistance in the potato genome. Plant Biotechnology Journal
1:399–410.
SINGLE NUCLEOTIDE POLYMORPHISM DISCOVERY 75
Rodi, C.P., Storm, N., Darnhofer-Patel, B., Hartmer, R., Leppin, L., Bocker, S., Denissenko, M., van den Boom,
D., 2002, MassARRAY (TM) analysis of fragmented nucleic acids: applications in typing, sequence
validation, and targeted SNP discovery. European Journal of Human Genetics 10:299.
Russell, J., Booth, A., Fuller, J., Harrower, B., Hedley, P., Machray, G., Powell, W., 2004, A comparison of
sequence-based polymorphism and haplotype content in transcribed and anonymous regions of the barley
genome. Genome 47:389–398.
Sato, Y., Nishio, T., 2003, Mutation detection in rice waxy mutants by PCR–RF–SSCP. Theoretical and
Applied Genetics 107:560–567.
Savage, D., Batley, J., Erwin, T., Logan, E., Love, C.G., Lim, G.A., Mongin, E., Barker, G., Spangenberg, G.C.,
Edwards, D., 2005, SNPServer: a real-time SNP discovery tool. Nucleic Acids Research 33:W493–W495.
Savolainen, O., Langley, C.H., Lazzaro, B.P., Fréville, H., 2000, Contrasting patterns of nucleotide
polymorphism at the alcohol dehydrogenase locus in the outcrossing Arabidopsis lyrata and the selfing
Arabidopsis thaliana. Molecular Biology and Evolution 17:645–655.
Sawbridge, T., Ong, E.-K., Binnion, C., Emmerling, M., McInnes, R., Meath, K., Nguyen, N., Nunan, K.,
O’Neill, M., O’Toole, F., Rhodes, C., Simmonds, J., Tian, P., Wearne, K., Webster, T., Winkworth, A.,
Spangenberg, G., 2003a, Generation and analysis of expressed sequence tags in perennial ryegrass (Lolium
perenne L.). Plant Science 165:1089–1100.
Sawbridge, T., Ong, E.-K., Binnion, C., Emmerling, M., Meath, K., Nunan, K., O’Neill, M., O’Toole, F.,
Simmonds, J., Wearne, K., Winkworth, A., Spangenberg, G., 2003b, Generation and analysis of expressed
sequence tags in white clover (Trifolium repens L.). Plant Science 165:1077–1087.
Schmid, K.J., Sörensen, T.R., Stracke, R., Törjek, O., Altmann, T., Mitchell-Olds, T., Weisshaar, B., 2003,
Large-scale identification and analysis of genome-wide single-nucleotide polymorphisms for mapping in
Arabidopsis thaliana. Genome Research 13:1250–1257.
Selinger, D.A., Chandler, V.L., 1999, Major recent and independent changes in levels and patterns of
expression have occurred at the b gene, a regulatory locus in maize. Proceedings of the National Academy
of Sciences of the United States of America 96:15007–15012.
Shattuck-Eidens, D.M., Bell, R.N., Neuhausen, S.L., Helentjaris, T., 1990, DNA sequence variation within
maize and melon: observations from polymerase chain reaction amplification and direct sequencing.
Genetics 126:207–217.
Simko, I., Haynes, K.G., Ewing, E.E., Costanzo, S., Christ, B.J., Jones, R.W., 2004, Mapping genes for
resistance to Verticillium albo-atrum in tetraploid and diploid potato populations using haplotype
association tests and genetic linkage analysis. Molecular Genetics and Genomics 271:522–531.
Slade, A.J., Fuerstenberg, S.I., Loeffler, D., Steine, M.N., Facciotti, D., 2005, A reverse genetic, nontransgenic
approach to wheat crop improvement by TILLING. Nature Biotechnology 23:75–81.
Somers, D.J., Kirkpatrick, R., Moniwa, M., Walsh, A., 2003, Mining single-nucleotide polymorphisms from
hexaploid wheat ESTs. Genome 46:431–437.
Taillon-Miller, P., Gu, Z., Li, Q., Hillier, L., Kwok, P.-Y., 1998, Overlapping genomic sequences: a treasure
trove of single-nucleotide polymorphisms. Genome Research 8:748–754.
Tang, J., Unnasch, T.R., 1995, Discriminating PCR artifacts using directed heteroduplex analysis (DHDA).
Biotechniques 19:902–905.
Thompson, J.D., Higgins, D.G., Gibson, T.J., 1994, CLUSTAL W: improving the sensitivity of progressive
multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix
choice. Nucleic Acids Research 22:4673–4680.
Till, B.J., Reynolds, S.H., Greene, E.A., Codomo, C.A., Enns, L.C., Johnson, J.E., Burtner, C., Odden, A.R.,
Young, K., Taylor, N.E., Henikoff, J.G., Comai, L., Henikoff, S., 2003, Large-scale discovery of induced
point mutations with high-throughput TILLING. Genome Research 13:524–530.
Till, B.J., Burtner, C., Comai, L., Henikoff, S., 2004, Mismatch cleavage by single-strand specific nucleases.
Nucleic Acids Research 32:2632–2641.
Vaidyanathan, R., Kuruvilla, S., Thomas, G., 1999, Characterisation and expression pattern of an absissic acid
and osmotic stress responsive gene from rice. Plant Science 140:25–36.
Wang, D.G., Fan, J.-B., Siao, C.-J., Berno, A., Young, P., Sapolsky, R., Ghandour, G., Perkins, N., Winchester,
E., Spencer, J., Kruglyak, L., Stein, L., Hsie, L., Topaloglou, T., Hubbell, E., Robinson, E., Mittmann, M.,
Morris, M.S., Shen, N., Kilburn, D., Rioux, J., Nusbaum, C,, Rozen, S., Hudson, T.J., Lipshutz, R., Chee,
M., Lander, E.S., 1998, Large-scale identification, mapping, and genotyping of single-nucleotide
polymorphisms in the human genome. Science 280:1077–1082.
Wang, R.-L., Stec, A., Hey, J., Lukens, L., Doebley, J., 1999, The limits of selection during maize
domestication. Nature 398:236–239.
Yang, W., Bai, X., Kabelka, E., Eaton, C., Kamoun, S., Van Der Knaap, E., Francis, D., 2004, Discovery of
single nucleotide polymorphisms in Lycopersicon esculentum by computer aided analysis of expressed
sequence tags. Molecular Breeding 14:21–34.
Zhang, D.X., Hewitt, G.M., 2003, Erratum: nuclear DNA analyses in genetic studies of populations: practice,
problems and prospects (Molecular Ecology (2003) 12:563–584). Molecular Ecology 12:1687.
76 D. EDWARDS ET AL.
Zhang, W., Gianibelli, M.C., Ma, W., Rampling, L., Gale, K.R., 2003, Identification of SNPs and development
of allele-specific PCR markers for gliadin alleles in Triticum aestivum. Theoretical and Applied Genetics
107:130–138.
Zhu, Y.L., Song, Q.J., Hyten, D.L., Van Tassell, C.P., Matukumalli, L.K., Grimm, D.R., Hyatt, S.M., Fickus,
E.W., Young, N.D., Cregan, P.B., 2003, Single-nucleotide polymorphisms in soybean. Genetics 163:1123–
1134.
Chapter 5
SINGLE NUCLEOTIDE POLYMORPHISMS
GENOTYPING IN PLANTS
David Chagné1, Jacqueline Batley2, David Edwards2, and John W. Forster2
5.1 INTRODUCTION
Single nucleotide polymorphism (SNP) markers are highly abundant in the genomes
of the majority of organisms, including plants. They provide valuable markers for the
study of agronomic or adaptive traits in plant species, using strategies based on genetic
mapping or association genetics studies. The development of SNP markers usually follow
a three-part progression consisting, chronologically, of SNP discovery based on analysis
of a small set of individuals, validation in a larger set (i.e., to remove false positives due
to sequencing errors or due to the presence of homeologous/paralogous sequences) and
then genotyping in a large population. The present chapter will focus on methods that are
applicable to large-scale SNP genotyping studies.
Syvänen (2001) and Kwok (2001) are among the most recent authors to publish
complete reviews on SNP-genotyping techniques. In addition, a number of methods that
use advanced technology (Invader assay, Pyrosequencing, Illumina fiber-optic array
technology linked to bead-immobilized GoldenGate™ PCR technology, and Sequenom
MALDI-TOF MS MassExtend™ technology) are described in detail in a recent book
chapter (Kahl et al. 2005). Those reviews cover the different methods that can be
employed for scoring SNPs, such as allele-specific oligonucleotide (ASO) hybridization,
oligonucleotide ligation, single nucleotide primer extension, and enzymatic cleavage.
Those methods are commonly used in combination with SNP detection technology
platforms such as gel electrophoresis systems, fluorescent plate readers, flow cytometry,
mass spectrometry, or oligonucleotide-based microarrays. Even with the latest technical
advances that have occurred since these reviews were published, these methods still
provide the core methodologies for SNP genotyping, in particular for plant association
studies. The following chapter will not attempt to provide another fully comprehensive
review, but will aim to describe the key features of the major technologies and attempt to
analyze the requirements of the SNP scoring methods that can be used in plants. A range
1 HortResearch, Plant Gene Mapping group, Private Bag 11030, Palmerston North, New Zealand
2 Primary Industries Research Victoria, Victorian AgriBiosciences Centre, La Trobe R&D Park, Bundoora, VIC
3083, Australia
77
78 DAVID CHAGNÉ ET AL.
The first strategy consists in scanning the whole-genome with a very large number
of genetic loci (in the region of 10,000–100,000 or higher). This objective is difficult to
achieve as it requires an extremely detailed knowledge of the genome under
consideration, the availability of a large number of independent SNP markers, and a high-
throughput detection method that can ideally be multiplexed on a very large-scale. For
plant species, in which genomes can be relatively complex, for which linkage
disequilibrium may only extend over short molecular distances because of the influence
of reproductive systems, and for which SNP frequencies may be low (Rafalski and
Morgante 2004), this approach can be difficult to apply. For instance, for well-
characterized crop species such as maize or wheat, despite the availability of large EST
data sets suitable for in silico SNP discovery and partial or complete physical maps,
implementation of whole-genome scan-based association genetics methods would be a
SINGLE NUCLEOTIDE POLYMORPHISMS 79
major undertaking. The number of SNPs required for such analysis would substantially
exceed any current technical capacity for genotyping. The same problem arises for other
model systems such as rice (the model genome for grasses and cereals), tomato (the
model species for Solanaceous plants), or poplar (the model species for trees), and is even
more acute for the broad range of little-studied, “genomic-orphan” plant species, none of
which possess sufficient SNP resources to consider a whole-genome scan with a
sufficiently high marker density. An attempt to perform a whole-genome scan in
Arabidopsis was recently reported (Törjek et al. 2003), but this study was based on a very
low number of markers (i.e., 100 SNPs) compared with the larger number of markers
ideally required. Although SNP markers provide the most effective current marker
system for association genetics analysis, those inbreeding plant species that are
descended from narrow domestication bottlenecks may show LD extending over map
distances measured in centimorgans rather than physical distances in the range from Kb
to Mb. In this case, other marker systems may be amenable to implementation for whole-
genome scans, including restriction fragment length polymorphisms (RFLPs), amplified
fragment length polymorphisms (AFLPs), and diversity array technology (DArT)
(Jaccoud et al. 2001; Wenzl et al. 2004), because of the highly multiplex nature of the
relevant assays and simple sequence repeats (SSRs), because of the highly polymorphic
and multiallelic nature of the physical locus. In addition, SSR-based genetic maps have
been developed for a number of the most important crop species and may consequently
be directly used for this purpose. Studies of this nature have been performed in rice with
SSRs (Semon et al. 2005), sugarcane (Jannoo et al. 1999), and sorghum (Deu et al. 2005)
using RFLPs, as well as durum wheat, soybean, and other species. Although such studies
are likely to be augmented and eventually supplanted by SNP-based surveys, the current
data have been highly valuable for assessment of genome-wide patterns of LD.
For large nuclear plant genomes, the feasibility of whole-genome scan-based LD
analysis is highly enhanced by methods for reduction of genome complexity. For the last
20 years, the global trend to reduce genome complexity in experimental DNA samples
has been to use the polymerase chain reaction (PCR) technique. However, the scoring of
millions of SNP loci spanning the entire genome over large numbers of test individuals
cannot be realistically achieved by using PCR amplification, even with a high level of
multiplexing, which is in any case often difficult to achieve. For that reason, more
contemporary methods use whole-genome amplification (WGA) techniques (Telenius
et al. 1992), which consist of amplifying total genomic DNA without a requirement for
locus-specific oligonucleotide primers. It should be noted that although WGA is most
commonly performed using PCR in combination with random oligomers (usually
between 6 and 10 nucleotides in length), the recently developed multiple displacement
amplification (MDA) technique (Dean et al. 2002) which is commercialized by GE
HealthCare (GE Healthcare, Little Chalfont, UK) as GenomiPhi™, relies instead on
isothermal rolling-circle replication catalyzed by bacteriophage 29 polymerase, and has
been reported to provide superior genome coverage to methods such as degenerate
oligonucleotide primed-polymerase chain reaction (DOP-PCR). The idea behind the use
of WGA is to reduce the number of reactions to be performed for one individual to one
single tube reaction. An example of genome complexity reduction using WGA was
reported by Jordan et al. (2002) in which DOP-PCR was used to amplify Arabidopsis
genomic DNA from different ecotypes. The authors suggested that the DOP-PCR
amplified DNA could be used for SNP genotyping by direct sequencing or by ASO
hybridization-based methods. Subsequently, several groups attempted to combine WGA
80 DAVID CHAGNÉ ET AL.
methods with microarray technologies, based on the potential for microarray to obtain
high-density analysis. Following this idea, Matsuzaki et al. (2004) used a complexity-
reduction assay to genotype more than 10,000 human SNPs based on oligonucleotide
array-mediated detection. The complete genotyping assay, from the DNA template to the
genotypic score, consists of the following steps: restriction enzyme digestion and
universal adaptor ligation, amplification using sequences complementary to the adaptors
as PCR primers, fragmentation and labeling, hybridization to the microarray, image
scanning and data acquisition. This study presented one of the first examples of relatively
high-resolution genotyping of a complex genome (i.e., average density of one SNP every
0.1 cM in the human genome), and represents one of the most promising techniques for
high-throughput and accurate SNP genotyping. Similar high-throughput methods were
recently developed such as the fiber-optic array-linked GoldenGate® assay (Illumina,
Inc., San Diego, USA), the molecular inversion probe assay (Hardenbol et al. 2005),
which utilizes an oligonucleotide ligation method (OLA; Iannone et al. 2000) and the
Infinium™ WGG method (Gunderson et al. 2005), which combines WGA, allele-specific
primer extension (ASPE; Taylor et al. 2001) and oligonucleotide array hybridization
using the BeadArray™ technology (Illumina, Inc., San Diego, USA; Shen et al. 2005).
Those latest assays make it possible to consider the completion of ambitious initiatives in
plant species equivalent to the HapMap project, which requires high-resolution SNP
haplotype definition across the genomes of members of multiple human populations
(HapMap 2003). Even if such techniques are, in theory, transferable for application in
any target organism, the application of such techniques for a plant species has not yet
been demonstrated. The sole existing example of the use of microarray systems for SNP
detection was performed through detection of single feature polymorphisms (SFPs) on
the Arabidopsis Affymetrix GeneChip® (Borevitz et al. 2003). Interestingly, the use of a
gene-expression-orientated array ensured that the analysis did not require any prior
specific SNP development, but rather inferred SNP structure retrospectively through
comparison of differential features. In this case, the characterization of such a number of
chip-based SNP loci could be sufficient to permit complete coverage of the Arabidopsis
genome, but this number would still not be sufficient for performing a whole-genome
scan in species characterized by a less extensive LD, such as forest trees or out-breeding
forage species (see Chapters 9 and 10). Moreover, and potentially of even higher
significance, the assayed SNPs are only located in transcribed sequences. Since the
Affymetrix chip-based experimental system is based on transcribed regions only it is
unable (at present) to detect DNA variations arising only in noncoding regulatory
regions, which have been proposed to account for the majority of quantitative trait
variation in animal systems such as Drosophila melanogaster (Robin et al. 2002), and
have been directly implicated in a high proportion of such variation in plant species
(Paran and Zamir 2003). Nevertheless, availability of genome sequence for the two
commonly used Arabidopsis ecotypes (i.e., Columbia [Col] and Landsberg Erecta [Ler])
has provided access to a large quantity of ready-characterized SNPs located in both
coding and noncoding regions (Jander et al. 2002). Furthermore, as plant genomes can
now be sequenced in a relatively short time, as demonstrated through the completion of
the genome sequence of poplar in less than two years (Brunner et al. 2004), it seems likely
that the methods developed for SNP detection in model systems will soon be available for
use by other crop biologists.
SINGLE NUCLEOTIDE POLYMORPHISMS 81
The second strategy which can be used for association studies in plants to reduce the
complexity of the genomic regions that are targeted is the candidate gene approach. This
approach consists of the characterization of SNPs present in a subset of specific genes
identified using various strategies such as bioinformatics-based data mining, QTL
analysis and linkage mapping, expression studies, transgenic modification by antisense
RNA expression or RNA interference (RNAi), or positional cloning and physical
mapping. The idea is to find the single base polymorphism that is directly causal of
functional variation in the trait of interest (which is often termed the qualitative or
quantitative trait nucleotide, QTN), or at least to find a SNP located within the functional
gene or at a small physical distance from the gene. This strategy provides a good solution
to the problems raised by the rapid decline of linkage disequilibrium observed in plant
genomes (Rafalski and Morgante 2004), as the chances that linkage disequilibrium may
be dissipated by a recombination event are extremely low in generational time (c. 10 −6
per meiosis) when assaying a SNP located in a candidate gene, compared with much
higher probabilities when using a more distant marker in a low-resolution genome scan.
This numerically discrete strategy may consequently be applied to a large number of
individuals (such as those present within germplasm collections).
A technique that may be widely used for SNP genotyping in candidate genes is the
direct sequencing of PCR products. As described earlier in Chapter 4, sequencing is
accurate and may also be used for SNP discovery and validation. Indeed, sequencing-
derived data are often taken as a benchmark standard in studies for the evaluation of
novel SNP-genotyping methods. Given that a large number of laboratories now possess
automated sequencers or have access to facilities offering low-cost sequencing services,
this method may also be highly effective in terms of cost and throughput. Current
capillary electrophoresis technology permits sequencing of fragments of up to 1 Kb in
length, which makes it possible to genotype several SNPs within the same sequence, and
determine the haplotype structure within the sequenced fragment. Several examples of
the application of SNP detection using sequencing in plants have been published. As an
example, in forest trees, several authors (Brown et al. 2004; Gill et al. 2003; Pot et al.
2005) studied nucleotide diversity within candidate genes associated with wood
formation and adaptive traits in three pine species. Sets of loci were sequenced across
a range of natural populations, revealing heterogeneous patterns of diversity in the
evaluated genes. The loci subjected to genotyping were chosen because they
corresponded to genes of known function in wood formation, as well as co-locating with
QTLs for wood quality that were previously identified by genetic mapping (Brown et al.
2003; Chagné et al. 2003).
Pyrosequencing (Ahmadian et al. 2000) may also be used for SNP genotyping
through generation of short-read sequences, although the technique slightly differs from
that used in standard Sanger–Coulsen sequencing chemistry (see Chapter 4). An example
of the use of pyrosequencing was reported for genotyping of SNPs associated with grain
82 DAVID CHAGNÉ ET AL.
quality in barley (Polakova et al. 2003). Pyrosequencing is a very rapid and accurate
method, and the cost per sample is relatively amenable to high-throughput analysis (i.e.,
5,000 samples a day), even if the price of the requisite equipment platform (PSQ 96™,
Pyrosequencing AB, Uppsala, Sweden) and associated analysis software is relatively
high. The new large-scale sequencing 454 technology (Margulies et al. 2005), which
exploits a pyrosequencing system in combination with solid-phase reaction support and
picoliter-scale reaction volumes, offers a potential mechanism for dramatic increase in
the scale of such sequencing efforts, and has already proven effective for resequencing of
specific small-scale genomic regions.
F2 population and the association between the SNP markers and the contrasting
nodulation production trait was confirmed in different genetic backgrounds.
Figure 5.1. Allele-specific PCR amplification. In the example presented a G/T SNP is targeted by two PCR
primers: one allele-specific primer is designed to anneal in its 3′ end to one of the SNP variants, but not to the
other. In addition to the SNP site mismatch, a second mismatch is included on the third or fourth nucleotide
situated upstream of the SNP site, in order that the PCR fails for the nonspecific allele, because of the low
complementarity of the primer in the 3′ end. No fluorescent labeling is required and regular PCR conditions can
be carried out.
The derived cleaved amplified polymorphic sequence (dCAPS) method (Neff et al.
1998) represents a cost-effective system to convert SNPs into dominant PCR markers
through incorporation of a mismatch into one of the PCR amplification primers in order
to create a target restriction endonuclease site. Unlike the standard CAPS method, which
is dependent on the presence of an SNP within a restriction site, dCAPS can be developed
relatively easily for any SNP locus. The dCAPS method was successfully used to map
SNPs linked to vernalization requirement in wheat (Iwaki et al. 2002) and to develop
molecular markers linked to self-compatibility in sweet cherry (Ikeda et al. 2004).
The TILLING method (McCallum et al. 2000) differs slightly from CAPS or
dCAPS in that most commonly used versions use a nuclease, such as CelI, which cleaves
mismatch-containing heteroduplex DNA. Although TILLING was originally developed
for detection of mutations induced by chemical agents such as ethylmethanesulphonate
(EMS), it may also be used in “ecoTILLING” applications to genotype SNPs in natural
populations, and is hence suitable for association studies in plants (Comai et al. 2004;
Gilchrist and Haughn 2005). One example is the demonstration of high levels of diversity
in poplar, a long-lived woody perennial species which is outcrossing in nature and shows
a wide natural range of distribution (Cronk 2005). However, the routine use of
ecoTILLING in outbreeding plant species is likely to be highly exacting technically.
The Invader™ assay (Figure 5.2) is a relatively new technique designed specifically
for genotyping SNPs (Mein et al. 2000; Olivier 2005). The technique uses two target
84 DAVID CHAGNÉ ET AL.
specific oligonucleotide probes (invader probe and SNP specific probe prolonged by a
flap sequence) that anneal to the SNP and form a three-dimensional complex. The flap
sequences are not complementary to the SNP site, but in the presence of the complex, an
endonuclease (FEN) cleaves the flap, which is released and induces a fluorescent
emission. The first generation of the Invader™ assay, although being a highly accurate
method, does require the PCR amplification of the target DNA and the design of a
specific secondary probe for each of the SNP alleles. This increases the cost of the
method, which makes it unsuitable for high-throughput genotyping. The second
generation of Invader™ assay, namely the Biplex Invader™ assay (Olivier et al. 2002),
uses a serial invasive reaction, where two unlabeled allele-specific probes are designed.
Each of them, if they anneal to the target SNP allele, releases a flap sequence which is
complementary to a fluorescence resonance energy transfer (FRET) molecule, which
fuels another cleavage reaction and then emits fluorescence. This method is characterized
by a very high accuracy and a low failure rate, which makes it very attractive for plant
biologists who want to genotype a small number of SNPs over large populations, as is
characteristic of the candidate gene approach.
probe
AT
AO
CLEAVAGE
F Fluorescence
emission
TGCATGCATGCATGCAA
AC
Flap
CG
Quencher
GT
AT
NO CLEAVAGE
AC
Allele-specific
CG
probe
A
TA
CLEAVAGE
C
Quencher
FRET probe
CLEAVAGE
Target DNA (allele G)
...NNNNNACGTACGTACGTACGTG CGTACGTACGTNNNNN...
NGCATGCATGCA
Fluorescence
TGCATGCATGCATGCAC emission
Allele-specific
GT
CT
(G) secondary
Flap 2
TA
probe
GC
T
CLEAVAGE Quencher
FRET probe
CLEAVAGE
genotyped over a range of potato clones spanning different genotypes of the R locus.
Interestingly, the Taqman™ assay could distinguish between different allele dosages (as
potato has an autopolyploid genetic constitution) which is an attractive and often critical
feature for analysis of those plant species that show complex higher ploidy levels (such as
wheat or kiwifruit) or are derived paleopolyploids (such as apple and maize).
Microarray plate
-NNNNNACGTACGTACGTACGTGCGTACGTACGTNNNNN
NNNNNTGCATGCATGCATGCAAGCATGCATGCANNNNN
No hybridization
Microarray plate
-NNNNNACGTACGTACGTACGTGCGTACGTACGTNNNNN
NNNNNTGCATGCATGCATGCACGCATGCATGCANNNNN-
Hybridization
Figure 5.3. Allele-specific oligonucleotide hybridization. A oligonucleotides feature with the SNP site in its
central position is bound to a microarray glass plate. Under stringent hybridization conditions, the
complementary allele will anneal to the fixed oligonucleotide and a fluorescent signal attached to the probe will
be detected. (see color plate)
LIGATION
NO LIGATION
Figure 5.4. Oligonucleotide ligation assay (OLA). The OLA is based on the ligation of two probes hybridizing
next to the SNP site. The joining of the two probes using a DNA ligase depends on the probes hybridization,
their juxtaposition on the target sequence and the perfect complementarity at the joining site. If the allele-
specific probe is not specific to the SNP variant, the ligation does not occur.
A popular method which was designed specifically for genotyping SNPs is the
minisequencing technique (Syvänen 1999; Syvänen et al. 1990), also called the primer
extension technique. The principle of this method is as follows: a detection primer is
designed to target a sequence immediately upstream of the SNP. Then, the 3′-terminus of
the oligonucleotide is extended by a DNA polymerase using labeled ddNTPs
(Figure 5.5). Therefore, one terminating fluorescent dye corresponds to each individual
base, which makes it possible to detect up to four allelic variants for a variable site and
discriminate heterozygous from homozygous genotypes. Different detection platforms
such as microarrays (Pastinen et al. 1997), capillary electrophoresis systems (Pastinen
et al. 1996), pyrosequencing (Ekstroem et al. 2000), flow cytometry (Chen et al. 2000),
mass spectrometry (Buetow et al. 2001; Haff and Smirnov 1997; Li et al. 1999; Tang
et al. 1999) or fluorescence plate readers (Chen et al. 1999; Hsu et al. 2001; Lopez-
Crapez et al. 2005) can be employed with the minisequencing method, demonstrating its
flexibility of adaptation to different analytical technologies.
As an example in plants, Törjek et al. (2003) used minisequencing to develop a set
of 112 SNP markers in A. thaliana using the SNaPshot™ assay combined with the use of
an ABI 3700 automated sequencer (Applied Biosystems, Foster City, CA, USA) and
a matrix-assisted laser desorption/ionization time of flight (MALDI-TOF) mass
spectrometer. Both platforms allowed the set of markers to be multiplexed (such that
5,376 data points were collected in this study), which suggested that the method can be
used as a medium- to high-throughput genotyping system. In crop plants, the primer
extension technique was employed for studying the association between variations in the
β-amylase gene and the fermentation properties of barley (Paris et al. 2002). The authors
88 DAVID CHAGNÉ ET AL.
used the SNuPe technique (GE Healthcare, Little Chalfont, UK) to genotype their SNPs
over a range of barley breeding lines. Similarly, the SNuPe method was employed to
genotype SNPs linked to microsatellite loci in maize inbred lines (Batley et al. 2003),
using a Megabace capillary sequencer (GE Healthcare, Little Chalfont, UK). A set of
SNPs linked to a leaf rust resistance gene in wheat (Tyrka et al. 2004) was also
genotyped by the SNuPe technique. Interestingly, Lee et al. (2004) compared the primer
extension technique with three other methods (i.e., ASPE, OLA, and direct hybridi-
zation), using a flow cytometry instrument as a detection system (Luminex, Austin, TX,
USA). Results of the four methods were compared with SNP genotype scores obtained
with the SNaPshot kit as a positive control. Overall, minisequencing and ASPE using
flow cytometric detection methods were shown to be effective methods for the provision
of codominant markers, as both SNP alleles can be discriminated and are represented in
the same reaction tube. However, minisequencing methods do show some demerits in
terms of cost and time, as the reactions need to be treated before SNP detection using
exonuclease I and shrimp alkaline phosphatase (SAP) to degrade excess PCR primers and
dNTPs prior to DNA polymerization.
Figure 5.5. Minisequencing or primer extension. An oligonucleotide primer immediately flanking the SNP is
extended using a DNA polymerase. Fluorescently labeled terminating nucleotides are incorporated, with a
different dye color for every nucleotide. The oligonucleotide can be attached to a solid-phase array, separated in
a capillary electrophoresis system, by a flow cytometry instrument, by mass spectrometry, or revealed by a
fluorescent plate reader. (see color plate)
Table 5.1. Relative costs of different SNP-genotyping methods. Note that we have used
relative rankings in terms of cost as exact values will change with time
Technique Scale Equipment and platform cost Reagents cost
Sequencing
Resequencing (Sanger) Low- to medium-throughput High High
Pyrosequencing Medium-throughput High Medium
454 sequencing High-throughput High Low
DNA conformation
SSCP Low- to medium-throughput Low Low
DGGE Low- to medium-throughput Low Low
dHPLC Low- to medium-throughput Medium Low
Allele-specific oligonucleotides
Microarray-based High-throughput High High
Taqman Medium-throughput Medium High
Minisequencing
Allele-specific primer extension Medium- to high-throughput High Medium
specific to the project and the crop considered. The HapMap project (HapMap 2003)
provides a striking example of an ambitious international consortium dedicated to
genotyping a very large number of SNPs over a range of individuals derived from
multiple human groups. Plant geneticists may well regard such a project with envy, not
least because of the high level of financial investment required, and speculate on the
feasibility of performing such studies in plant genomes. Plant genetics laboratories,
whether located in academia, the public service sector or the private sector, have typically
been geared to low- to medium-throughput genetic analysis and are often multidiscip-
linary in nature, with expertise ranging from classical quantitative and population
genetics to recent molecular biology disciplines, studying gene function and expression
or genome structure and organization. For this reason, the method chosen for SNP
genotyping would have to fit with other technical requirements. In the particular case of
genetic analysis, the method chosen would need to be amenable to association studies, as
well as linkage mapping or marker-assisted breeding. All these applications require
different numbers of loci to be considered and different scales of plant samples to be
characterized. Genetic trait dissection, generally based at present on linkage mapping and
QTL analysis, is characterized by relatively small numbers of closely related genotypes
(150–300) and large numbers of genetic markers (200–400). By contrast, implementation
of validated genetic marker-trait gene associations in molecular plant breeding is
characterized by relatively large numbers of individuals (typically 1,000–10,000) and
small numbers of markers (5–25). Association genetic analysis, as an aspect of DNA
profiling, shows heterogeneous scale requirements from one to thousands of markers, and
from tens to thousands of individuals. The flexible scales of genotyping analysis imply a
necessity for equally flexible genotyping platforms, ideally modular in nature, to service
the different requirements.
90 DAVID CHAGNÉ ET AL.
The model plant species A. thaliana may provide a good model for association
studies through a whole-genome scan strategy. However, unlike the majority of crop
plant species, A. thaliana does not possess a particular complex genome. Furthermore, its
reproductive system is not shared by the majority of plant species and A. thaliana has not
experienced a strong domestication bottleneck as has occurred for many major crop
species, implying some differences in the structure and distribution of LD compared to
that seen in other plants. On the other hand, the self-pollinating (autogamous) breeding
system of A. thaliana is shared by some crop species such as wheat, barley, rice, tomato,
sorghum, pearl millet, and others, and on this basis, LD information from the model
species may prove useful for other species. It is clear, however, that intensive association
genetics studies must be performed in each target species or species group (as described
in Chapters 9–11), and the candidate gene-based approach seems to offer the most
feasible current option for such analysis. As previously described, plant genomics tends
to follow trends established in its human counterpart, such as complete sequencing of
several plant genomes and the establishment of large EST data sets. If a highly
multiplexed, high-throughput, accurate, low-cost technique is developed for human
genetics, such a system will be rapidly assimilated into plant genetics. The GoldenGate™
and Infinium™ assays commercialized by Illumina (Illumina, San Diego, CA, USA)
present a number of highly attractive features, especially the capacity to process large
numbers of SNP loci over multiple DNA samples using a microtiter plate format, and the
capability to produce modular rearrangements of array elements to address different
scales of analysis, as described earlier. The identification of large numbers of validated
SNP loci is an issue in the generation of such systems, but the methodologies described in
Chapter 4 will provide suitable sets for all major crop species in the near future. A
prototype barley SNP-based Illumina system for assay of 1,536 gene-associated SNPs has
recently been developed in collaboration with the Scottish Crop Research Institute
(SCRI), Dundee, UK (R. Waugh, personal communication). The performance of this
prototype will provide important information on general applicability to other crop
species. If the 454 DNA sequencing technology (Margulies et al. 2005) can be
successfully adapted to large-scale genotyping applications, this may offer another
attractive route for plant geneticists who want to identify SNPs and apply LD mapping.
5.5 REFERENCES
Ahmadian, A., Gharizadeh, B., Gustafsson, A.C., Sterky, F., Nyren, P., Uhlen, M., Lundeberg, J., 2000, Single-
nucleotide polymorphism analysis by pyrosequencing. Analytical Biochemistry 280:103–110.
Alsmadi, O.A., Bornarth, C.J., Song, W., Wisniewski, M., Du, J., Brockman, J.P., Faruqi, A.F., Sun, Z., Du, Y.,
Wu, X., Egholm, M., Abarzúa, P., Lasken, R.S., Driscoll, M.D., 2003, High accuracy genotyping directly
from genomic DNA using a rolling circle amplification based assay. BMC Genomics 4:21.
Batley, J., Mogg, R., Edwards, D., O'Sullivan, H., Edwards, K.J., 2003, A high-throughput SNuPE assay for
genotyping SNPs in the flanking regions of Zea mays sequence tagged simple sequence repeats.
Molecular Breeding 11:111–120.
Baumler, S., Felsenstein, F.G., Schwarz, G., 2003, CAPS and DHPLC analysis of a single nucleotide
polymorphism in the cytochrome b gene conferring resistance to strobilurins in field isolates of Blumeria
graminis f. sp hordei. Journal of Phytopathology-Phytopathologische Zeitschrift 151:149–152.
Bertin, I., Zhu, J.H., Gale, M.D., 2005, SSCP-SNP in pearl millet – A new marker system for comparative
genetics. Theoretical and Applied Genetics 110:1467–1472.
Borevitz, J.O., Liang, D., Plouffe, D., Chang, H.S., Zhu, T., Weigel, D., Berry, C.C., Winzeler, E., Chory, J.,
2003, Large-scale identification of single-feature polymorphisms in complex genomes. Genome
Research 13:513–523.
SINGLE NUCLEOTIDE POLYMORPHISMS 91
Brown, G.R., Bassoni, D.L., Gill, G.P., Fontana, J.R., Wheeler, N.C., Megraw, R.A., Davis, M.F., Sewell,
M.M., Tuskan, G.A., Neale, D.B., 2003, Identification of quantitative trait loci influencing wood
property traits in Loblolly pine (Pinus taeda L.). III. QTL verification and candidate gene mapping.
Genetics 164:1537–1546.
Brown, G.R., Gill, G.P., Kuntz, R.J., Langley, C.H., Neale, D.B., 2004, Nucleotide diversity and linkage
disequilibrium in loblolly pine. Proceedings of the National Academy of Sciences of the United States of
America 101(42):15255–15260.
Brunner, A.M., Busov, V.B., Strauss, S.H., 2004, Poplar genome sequence: functional genomics in an
ecologically dominant plant species. Trends in Plant Science 9:49–56.
Buetow, K.H., Edmonson, M., MacDonald, R., Clifford, R., Yip, P., Kelley, J., Little, D.P., Strausberg, R.,
Koester, H., Cantor, C.R., Braun, A., 2001, High-throughput development and characterization of a
genomewide collection of gene-based single nucleotide polymorphism markers by chip-based matrix-
assisted laser desorption/ionization time-of-flight mass spectrometry. Proceedings of the National
Academy of Sciences of the United States of America 98:581–584.
Chagné, D., Brown, G., Lalanne, C., Madur, D., Pot, D., Neale, D., Plomion, C., 2003, Comparative genome
and QTL mapping between maritime and loblolly pines. Molecular Breeding 12:185–195.
Chen, X., Kwok, P.-Y., Levine, L., 1999, Fluorescence polarization in homogeneous nucleic acid analysis.
Genome Research 9:492–498.
Chen, J., Iannone, M.A., Li, M.-S., Taylor, J.D., Rivers, P., Nelsen, A.J., Slentz-Kesler, K.A., Roses, A.,
Weiner, M.P., 2000, A microsphere-based assay for multiplexed single nucleotide polymorphism
analysis using single base chain extension. Genome Research 10:549–557.
Comai, L., Young, K., Till, B.J., Reynolds, S.H., Greene, E.A., Codomo, C.A., Enns, L.C., Johnson, J.E.,
Burtner, C., Odden, A.R., Henikoff, S., 2004, Efficient discovery of DNA polymorphisms in natural
populations by Ecotilling. Plant Journal 37:778–786.
Cronk, Q.C.B., 2005, Plant eco-devo: the potential of poplar as a model organism. New Phytologist 166:39–48.
Dean, F.B., Hosono, S., Fang, L.X.W., Fawad Faruqi, A., Bray-Ward, P., Sun, Z., Zong, Q., Du, Y., Du, J.,
Driscoll, M., Song, W., Kingsmore, S.F., Egholm, M., Lasken, R.S., 2002, Comprehensive human
genome amplification using multiple displacement amplification. Proceedings of the National Academy
of Sciences of the United States of America 99:5261–5266.
De Jong, W.S., De Jong, D.M., Bodis, M., 2003, A fluorogenic 5′ nuclease (TaqMan) assay to assess dosage of
a marker tightly linked to red skin color in autotetraploid potato. Theoretical and Applied Genetics
107:1384–1390.
Délye, C., Matéjicek, A., Gasquez, J., 2002, PCR-based detection of resistance to acetyl–CoA carboxylase-
inhibiting herbicides in black-grass (Alopecurus myosuroides huds) and ryegrass (Lolium rigidum gaud).
Pest Management Science 58:474–478.
Deu, M., Ratmadass, A., Hamada, M.A., Noyer, J.L., Diabate, M., Chantereau, J., 2005, Quantitative tratit loci
for head-bug resistance in sorghum. African Journal of Biotechnology 4:247–250.
Dunner, S., Miranda, M.E., Amigues, Y., Cañon, J., Georges, M., Hanset, R., Williams, J., Ménissier, F., 2003,
Haplotype diversity of the myostatin gene among beef cattle breeds. Genetics Selection Evolution
35:103–118.
Ekstroem, B., Alderborn, A., Hammerling, U., 2000, Pyrosequencing for SNPs. Proceedings of SPIE – The
International Society for Optical Engineering 3926:134–139.
Faruqi, F.A., Hosono, S., Driscoll, M.D., Dean, F.B., Alsmadi, O., Bandaru, R., Kumar, G., Grimwade, B.,
Zong, Q., Sun, Z., Du, Y., Kingsmore, S., Knott, T., Lasken, R.S., 2001, High-throughput genotyping of
single nucleotide polymorphisms with rolling circle amplification. BMC Genomics 2:4.
Gilchrist, E.J., Haughn, G.W., 2005, TILLING without a plough: a new method with applications for reverse
genetics. Current Opinion in Plant Biology 8:15.
Gill, G.P., Brown, G.R., Neale, D.B., 2003, A sequence mutation in the cinnamyl alcohol dehydrogenase gene
associated with altered lignification in loblolly pine. Plant Biotechnology Journal 1:253–258.
Gunderson, K.L., Steemers, F.J., Lee, G., Mendoza, L.G., Chee, M.S., 2005, A genome-wide scalable SNP
genotyping assay using microarray technology. Nature Genetics 37:549–554.
Haff, L.A., Smirnov, I.P., 1997, Single-nucleotide polymorphism identification assays using a thermostable
DNA polymerase and delayed extraction MALDI-TOF mass spectrometry. Genome Research 7:378–
388.
HapMap, 2003, The International HapMap Project: The International HapMap Consortium. Nature 426:789–
796.
Hardenbol, P., Yu, F., Belmont, J., MacKenzie, J., Bruckner, C., Brundage, T., Boudreau, A., Chow, S., Eberle,
J., Erbilgin, A., Falkowski, M., Fitzgerald, R., Ghose, S., Iartchouk, O., Jain, M., Karlin-Neumann, G.,
Lu, X., Miao, X., Moore, B., Moorhead, M., Namsaraev, E., Pasternak, S., Prakash, E., Tran, K., Wang,
Z., Jones, H.B., Davis, R.W., Willis, T.D., Gibbs, R.A., 2005, Highly multiplexed molecular inversion
92 DAVID CHAGNÉ ET AL.
probe genotyping: over 10,000 targeted SNPs genotyped in a single tube assay. Genome Research
15:269–275.
Heid, C.A., Stevens, J., Livak, K.J., Williams, P.M., 1996, Real time quantitative PCR. Genome Research
6:986–994.
Hsia, A.-P., Wen, T.-J., Chen, H.D., Liu, Z., Yandeau-Nelson, M.D., Wei, Y., Guo, L., Schnable, P.S., 2005,
Temperature gradient capillary electrophoresis (TGCE) – a tool for the high-throughput discovery and
mapping of SNPs and IDPs. Theoretical and Applied Genetics 111:218–225.
Hsu, T.M., Chen, X., Duan, S., Miller, R.D., Kwok, P.-Y., 2001, Universal SNP genotyping assay with
fluorescence polarization detection. BioTechniques 31:560–570.
Iannone, M.A., Taylor, J.D., Chen, J., Li, M.-S., Rivers, P., Slentz-Kesler, K.A., Weiner, M.P., 2000,
Multiplexed single nucleotide polymorphism genotyping by oligonucleotide ligation and flow cytometry.
Cytometry 39:131–140.
Ikeda, K., Watari, A., Ushijima, K., Yamane, H., Hauck, N.R., Iezzoni, A.F., Tao, R., 2004, Molecular markers
for the self-compatible S4? – Haplotype, a pollen-part Mutant in sweet cherry (Prunus avium L.). Journal
of the American Society for Horticultural Science 129:724–728.
Iwaki, K., Nishida, J., Yanagisawa, T., Yoshida, H., Kato, K., 2002, Genetic analysis of Vrn-B1 for
vernalization requirement by using linked dCAPS markers in bread wheat (Triticum aestivum L.).
Theoretical and Applied Genetics 104:571–576.
Jaccoud, D., Peng, K., Feinstein, D., Kilian, A., 2001, Diversity arrays: a solid state technology for sequence
information independent genotyping. Nucleic Acids Research 29(4):e25.
Jander, G., Norris, S.R., Rounsley, S.D., Bush, D.F., Levin, I.M., Last, R.L., 2002, Arabidopsis map-based
cloning in the post-genome era. Plant Physiology 129:440–450.
Jander, G., Norris, S.R., Joshi, V., Fraga, M., Rugg, A., Yu, S., Li, L., Last, R.L., 2004, Application of a high-
throughput HPLC-MS/MS assay to Arabidopsis mutant screening; evidence that threonine aldolase plays
a role in seed nutritional quality. Plant Journal 39:465–475.
Jannoo, N., Grivet, L., Seguin, M., Paulet, F., Domaingue, R., Rao, P.S., Dookun, A., D'Hont, A., Glaszmann,
J.C., 1999, Molecular investigation of the genetic base of sugarcane cultivars. Theoretical and Applied
Genetics 99:171–184.
Jordan, B., Charest, A., Dowd, J.F., Blumenstiel, J.P., Yeh, R.-F., Osman, A., Housman, D.E., Landers, J.E.,
2002, Genome complexity reduction for SNP genotyping analysis. Proceedings of the National Academy
of Sciences of the United States of America 99:2942–2947.
Kahl, G., Mast, A., Tooke, N., Shen, R., van den Boom, D., 2005, Single nucleotide polymorphisms: detection
techniques and their potential for genotyping and genome mapping. In: Meksem, K., Kahl, G. (eds). The
Handbook of Plant Genome Mapping: Genetic and Physical Mapping. Wilecy-VCH Verlag GmbH &
Co., KGaA, Weinheim, pp. 75–104.
Kim, M.Y., Van, K., Lestari, P., Moon, J.-K., Lee, S.-H., 2005, SNP identification and SNAP marker
development for a GmNARK gene controlling supernodulation in soybean. Theoretical and Applied
Genetics 110:1003–1010.
Kota, R., Wolf, M., Michalek, W., Graner, A., 2001, Application of denaturing high-performance liquid
chromatography for mapping of single nucleotide polymorphisms in barley (Hordeum vulgare L.).
Genome 44(4):523–528.
Kourkine, I.V., Hestekin, C.N., Buchholz, B.A., Barron, A.E., 2002, High-throughput, high-sensitivity genetic
mutation detection by tandem single-strand conformation polymorphism/heteroduplex analysis capillary
array electrophoresis. Analytical Chemistry 74:2565–2572.
Kuhn, D.N., Schnell, R.J., 2005, Use of capillary array electrophoresis single-strand conformational
polymorphism analysis to estimate genetic diversity of candidate genes in germplasm collections.
Methods in Enzymology 395:238–258.
Kwok, P.-Y., 2001, Methods for genotyping single nucleotide polymorphisms. Annual Review of Genomics
and Human Genetics 2:235–258.
Landegren, U., Kaiser, R., Sanders, J., Hood, L., 1988, A ligase-mediated gene detection technique. Science
241:1077–1080.
Lee, S.-H., Walker, D.R., Cregan, P.B., Boerma, H.R., 2004, Comparison of four flow cytometric SNP
detection assays and their use in plant improvement. Theoretical and Applied Genetics 110:167–174.
Li, J., Butler, J.M., Tan, Y., Lin, H., Royer, S., Ohler, L., Shaler, T.A., Hunter, J.M., Pollart, D.J., Monforte,
J.A., Becker, C.H., 1999, Single nucleotide polymorphism determination using primer extension and
time-of-flight mass spectrometry. Electrophoresis 20:1258–1265.
Livak, K.J., 1999, Allelic discrimination using fluorogenic probes and the 5′ nuclease assay. Genetic Analysis –
Biomolecular Engineering 14:143–149.
SINGLE NUCLEOTIDE POLYMORPHISMS 93
Livak, K.J., Flood, S.J., Marmaro, J., Giusti, W., Deetz, K., 1995, Oligonucleotides with fluorescent dyes at
opposite ends provide a quenched probe system useful for detecting PCR product and nucleic acid
hybridization. Genome Research 4:357–362.
Lizardi, P.M., Huang, X., Zhu, Z., Bray-Ward, P., Thomas, D.C., Ward, D.C., 1998, Mutation detection and
single-molecule counting using isothermal rolling-circle amplification. Nature Genetics 19:225–232.
Lopez-Crapez, E., Bazin, H., Chevalier, J., Trinquet, E., Grenier, J., Mathis, G., 2005, A separation-free assay
for the detection of mutations: combination of homogeneous time-resolved fluorescence and
minisequencing. Human Mutation 25:468–475.
Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., Berka, J., Braverman, M.S.,
Chen, Y.-J., Chen, Z., Dewell, S.B., Du, L., Fierro, J.M., Gomes, X.V., Godwin, B.C., He, W., Helgesen,
S., Ho, C.H., Irzyk, G.P., Jando, S.C., Alenquer, M.L.I., Jarvie, T.P., Jirage, K.B., Kim, J.-B., Knight,
J.R., Lanza, J.R., Leamon, J.H., Lefkowitz, S.M., Lei, M., Li, J., Lohman, K.L., Lu, H., Makhijani, V.B.,
McDade, K.E., McKenna, M.P., Myers, E.W., Nickerson, E., Nobile, J.R., Plant, R., Puc, B.P., Ronan,
M.T., Roth, G.T., Sarkis, G.J., Simons, J.F., Simpson, J.W., Srinivasan, M., Tartaro, K.R., Tomasz, A.,
Vogt, K.A., Volkmer, G.A., Wang, S.H., Wang, Y., Weiner, M.P., Yu, P., Begley, R.F., Rothberg, J.M.,
2005, Genome sequencing in microfabricated high-density picolitre reactors. Nature 437:376–380.
Matsuzaki, H., Loi, H., Dong, S., Tsai, Y.-Y., Fang, J., Law, J., Di, X., Liu, W.-M., Yang, G., Liu, G., Huang,
J., Kennedy, G.C., Ryder, T.B., Marcus, G.A., Walsh, P.S., Shriver, M.D., Puck, J.M., Jones, K.W., Mei,
R., 2004, Parallel genotyping of over 10,000 SNPs using a one-primer assay on a high-density
oligonucleotide array. Genome Research 14:414–425.
McCallum, C.M., Comai, L., Greene, E.A., Henikoff, S., 2000, Targeting induced local lesions in genomes
(TILLING) for plant functional genomics. Plant Physiology 123:439–442.
Mein, C.A., Barratt, B.J., Dunn, M.G., Siegmund, T., Smith, A.N., Esposito, L., Nutland, S., Stevens, H.E.,
Wilson, A.J., Phillips, M.S., Jarvis, N., Law, S., De Arruda, M., Todd, J.A., 2000, Evaluation of single
nucleotide polymorphism typing with invader on PCR amplicons and its automation. Genome Research
10:330–343.
Mogg, R., Batley, J., Hanley, S., Edwards, D., O'Sullivan, H., Edwards, K.J., 2002, Characterization of the
flanking regions of Zea mays microsatellites reveals a large number of useful sequence polymorphisms.
Theoretical and Applied Genetics 105:532–543.
Myers, R.M., Maniatis, T., Lerman, L.S., 1987, Detection and localization of single base changes by denaturing
gradient gel electrophoresis. Methods in Enzymology 155:501–527.
Neff, M.M., Neff, J.D., Chory, J., Pepper, A.E., 1998, dCAPS, a simple technique for the genetic analysis of
single nucleotide polymorphisms: experimental applications in Arabidopsis thaliana genetics. Plant
Journal 14:387–392.
Newton, C.R., Graham, A., Heptinstall, L.E., Powell, S.J., Summers, C., Kalsheker, N., Smith, J.C., Markham,
A.F., 1989, Analysis of any point mutation in DNA. The Amplification refractory mutation system
(ARMS). Nucleic Acids Research 17:2503–2516.
®
Olivier, M., 2005, The Invader assay for SNP genotyping. Mutation Research – Fundamental and Molecular
Mechanisms of Mutagenesis 573:103–110.
Olivier, M., Chuang, L.M., Chang, M.S., Chen, Y.T., Pei, D., Ranade, K., de Witte, A., Allen, J., Tran, N.,
Curb, D., Pratt, R., Neefs, H., de Arruda Indig, M., Law, S., Neri, B., Wang, L., Cox, D.R., 2002, High-
throughput genotyping of single nucleotide polymorphisms using new biplex invader technology.
Nucleic Acids Research 30:e53.
Orita, M., Suzuki, Y., Sekiya, T., Hayashi, K., 1989, Rapid and sensitive detection of point mutations and SNA
polymorphisms using the polymerase chain reaction. Genomics 5:874–879.
Paran, I., Zamir, D., 2003, Quantitative traits in plants: beyond the QTL. Trends in Genetics 19:303–306.
Paris, M., Jones, M.G.K., Eglinton, J.K., 2002, Genotyping single nucleotide polymorphisms for selection of
barley beta-amylase alleles. Plant Molecular Biology Reporter 20:149–159.
Pastinen, T., Syvänen, A.-C., Partanen, J., 1996, Multiplex, fluorescent, solid-phase minisequencing for
efficient screening of DNA sequence variation. Clinical Chemistry 42:1391–1397.
Pastinen, T., Kurg, A., Metspalu, A., Peltonen, L., Syvänen, A.-C., 1997, Minisequencing: a specific tool for
DNA analysis and diagnostics on oligonucleotide arrays. Genome Research 7:606–614
Polakova, K., Laurie, D., Vaculova, K., Ovesna, J., 2003, Characterization of beta-amylase alleles in 79 barley
varieties with pyrosequencing. Plant Molecular Biology Reporter 21:439–447.
Pot, D., McMillan, L., Echt, C., Le Provost, G., Garnier-Géré, P., Cato, S., Plomion, C., 2005, Nucleotide
variation in genes involved in wood formation in two pine species. New Phytologist 167:101–112.
Rafalski, A., Morgante, M., 2004, Corn and humans: recombination and linkage disequilibrium in two genomes
of similar size. Trends in Genetics 20:103–111.
Robin, C., Lyman, R.F., Long, A.D., Langley, C.H., Mackay, T.F.C., 2002, hairy: a quantitative trait locus for
drosophila sensory bristle number. Genetics 162:155–164.
94 DAVID CHAGNÉ ET AL.
Rust, S., Funke, H., Assmann, G., 1993, Mutagenically separated PCR (MS-PCR): a highly specific one step
procedure for easy mutation detection. Nucleic Acids Research 21:3623–3629.
Schwarz, G., Sift, A., Wenzel, G., Mohler, V., 2003, DHPLC scoring of a SNP between promoter sequences of
HMW glutenin x-type alleles at the Glu-D1 locus in wheat. Journal of Agricultural and Food Chemistry
51:4263–4267.
Semon, M., Nielsen, R., Jones, M.P., McCouch, S.R., 2005, The population structure of African cultivated rice
Oryza glaberrima (Steud.): evidence for elevated levels of linkage disequilibrium caused by admixture
with O-sativa and ecological adaptation. Genetics 169:1639–1647.
Shen, R., Fan, J.-B., Campbell, D., Chang, W., Chen, J., Doucet, D., Yeakley, J., Bibikova, M., Garcia, E.W.,
McBride, C., Steemers, F., Garcia, F., Kermani, B.G., Gunderson, K., Oliphant, A., 2005, High-
throughput SNP genotyping on universal bead arrays. Mutation Research – Fundamental and Molecular
Mechanisms of Mutagenesis 573:70–82.
Syvänen, A.-C., 1999, From gels to chips: ‘Minisequencing’ primer extension for analysis of point mutations
and single nucleotide polymorphisms. Human Mutation 13:1–10.
Syvänen, A.-C., 2001, Accessing genetic variation: genotyping single nucleotide polymorphisms. Nature
Reviews Genetics 2:930–942.
Syvänen, A.-C., 2005, Toward genome-wide SNP genotyping. Nature Genetics 37:S5–S10.
Syvänen, A.-C., Aalto-Setala, K., Harju, L., Kontula, K., Soderlund, H., 1990, A primer-guided nucleotide
incorporation assay in the genotyping of apolipoprotein E. Genomics 8:684–692.
Tang, K., Fu, D.-J., Julien, D., Braun, A., Cantor, C.R., Köstek, H., 1999, Chip-based genotyping by mass
spectrometry. Proceedings of the National Academy of Sciences of the United States of America
96:10016–10020.
Taylor, J.D., Briley, D., Nguyen, Q., Long, K., Iannone, M.A., Li, M.-S., Ye, F., Afshari, A., Lai, E., Wagner,
M., Chen, J., Weiner, M.P., 2001, Flow cytometric platform for high-throughput single nucleotide
polymorphism analysis. BioTechniques 30:661–669.
Telenius, H., Carter, N.P., Bebb, C.E., Nordenskjold, M., Ponder, B.A.J., Tunnacliffe, A., 1992, Degenerate
oligonucleotide-primed PCR: general amplification of target DNA by a single degenerate primer.
Genomics 13:718–725.
Tobe, V.O., Taylor, S.L., Nickerson, D.A., 1996, Single-well genotyping of diallelic sequence variations by a
two-color ELISA-based oligonucleotide ligation assay. Nucleic Acids Research 24:3728–3732.
Törjek, O., Berger, D., Meyer, R.C., Müssig, C., Schmid, K.J., Sörensen, T.R., Weisshaar, B., Mitchell-Olds,
T., Altmann, T., 2003, Establishment of a high-efficiency SNP-based framework marker set for
Arabidopsis. Plant Journal 36:122–140.
Tyrka, M., Blaszczyk, L., Chelkowski, J., Wisniewska, H., Lind, V., Kramer, I., Weilepp, M., Ordon, F., 2004,
Development of the single nucleotide polymorphism marker of the wheat Lr1 leaf rust resistance gene.
Cellular and Molecular Biology Letters 9:879–889.
Wenzl, P., Carling, J., Kudrna, D., Jaccoud, D., Huttner, E., Kleinhofs, A., Kilian, A., 2004, Diversity Arrays
Technology (DArT) for whole-genome profiling of barley. Proceedings of the National Academy of
Sciences of the United States of America 101:9915–9920.
Chapter 6
SNP APPLICATIONS IN PLANTS
Jacqueline Batley1 and David Edwards1
6.1 INTRODUCTION
1 Primary Industries Research Victoria, Victorian AgriBiosciences Centre, La Trobe R&D Park, Bundoora,
Victoria 3083, Australia
95
96 JACQUELINE BATLEY ET AL.
These techniques may be powerful for amplifying single loci within a single
reaction and for assessing genetic diversity or genetic mapping in a species where limited
sequence information is available. However the markers are anonymous. Furthermore,
unlike RAPDs and RFLPs, direct SNP assays provide the exact nature of the allelic
variants. The genetic analysis of SNPs is gaining interest, due to the ever-increasing
availability of sequence data, revealing their abundance. This abundance allows the
construction of very high-density genetic maps, offering the potential to detect
associations between allelic forms of a gene and observed phenotypes. SNPs are far more
prevalent than microsatellites, and therefore, may provide a high-density of markers at a
locus of interest. The abundance of SNPs offsets the disadvantage of bi-allelism,
compared to the multi-allelic nature of microsatellites. The low mutation rate of SNPs
also makes them excellent markers for studying complex genetic traits and as a tool for
the understanding of genome evolution (Syvanen 2001). SNPs may be used to interrogate
haplotype structure and can be applied for linkage disequilibrium (LD) studies. This
application is described in detail in Chapter 9.
Table 6.1. A comparison of features and applications for AFLP, RFLP, SSR, and SNP
molecular genetic markers. Y = Yes, N = No. Scores are based on the authors’ experience
with these markers
In the following paragraphs, we will outline the various applications of the different
marker systems in genetic studies while a comparison of their features and utility is
presented in Table 6.1.
Genetic studies involving linkage mapping, map-based positional cloning, and QTL
mapping require data from large sets of genetic markers. The abundance of SNPs, com-
bined with methods for their high-throughput discovery and detection, makes them
suitable markers for these applications. SNPs identified within ESTs or large genomic
fragments maintained within bacterial artificial chromosomes (BACs) can be applied for
genetic mapping of complex traits. This enables the genetic mapping of specific genes of
interest and assists in the identification of linked or perfect markers for traits, as well as
increasing density of markers on genetic maps (Rafalski 2002b). BAC SNP markers also
allow the integration of genetic and physical maps.
SNPs can be used to develop haplotyping systems for genes or regions of interest
(Rafalski 2002a). The information provided by SNPs is useful when several SNPs define
haplotypes in the region of interest. Only a small subset is then required to define the
haplotype, and therefore need to be assayed. The use of SNPs for identifying haplotype
structure, and subsequent uses for LD studies, will be covered in Chapter 9.
SNPs can be applied for genetic mapping, positional cloning, QTL mapping, and as-
sociation mapping. When SNPs are applied for high-resolution genetic mapping, they can
enable the development of saturated genetic maps. This has been demonstrated both on
the large- and small-scale, in both model and less widely grown crop species. If a whole
genome scan is to be undertaken, trait mapping by allele association requires high marker
density which can readily be provided by SNPs. A genome wide set of SNP markers in
Arabidopsis thaliana has been identified for these purposes (Schmid et al. 2003). Alter-
natively, a targeted approach may be undertaken for the mapping of candidate genes or
the fine mapping of specific genomic regions which may have previously been identified
through QTL mapping.
The use of SNPs to genetically map genes has also been demonstrated by Ching and
Rafalski (2002). This research showed that abundance of SNPs makes them useful for
placing ESTs or candidate genes onto a genetic map, which has been previously
constructed with other markers. Previously, mapping ESTs predominantly involved using
RFLPs or by CAPS, both of which require the presence of restriction enzyme
polymorphisms. The use of SNPs for gene mapping has a further advantage in that this
approach can be gene-specific, whereas RFLPs frequently assay multiple loci. Zhu et al.
(2003) characterized SNPs and studied LD in soybean. A further objective of the research
was to develop a strategy for SNP discovery, for the development of a SNP-based
soybean linkage map. This would create a transcript map for soybean with candidate
genes to associate with quantitative trait loci. A high-density transcript map of barley is
being produced (Kota et al. 2001), and this will facilitate alignment of existing linkage
maps in barley and permit identification of ESTs associated with traits of interest.
Moreover, the SNPs can be used for syntenic studies with other related species. As an
example of this approach in a minor crop, five SNPs were genetically mapped in melon,
using three different genotyping assays (Morales et al. 2004). Genetic mapping of genes
SNP APPLICATIONS IN PLANTS 99
and BAC end sequences using SNPs has also been performed in cassava (Lopez et al.
2005). SNPs are being applied in maize for the generation of a high-resolution genetic
map, which will act as a framework to anchor BAC contigs. This data is being managed
in a database of the maize community (Sanchez-Villeda et al. 2003).
One of the most often cited benefits of genetic markers for plant breeding has been
their use in marker-assisted selection (MAS), exploiting the markers as selection tools in
crop breeding programs (Koebner and Summers 2002). This allows the breeder to
achieve early selection of a trait, or a combination of traits. This is particularly useful
when the trait concerned is under complex genetic control, or when field trials are
unreliable or expensive. By increasing favorable allele frequency early in the breeding
process, a larger number of small populations can be carried forward in the breeding
process, each of which has been prescreened to remove or reduce the frequency of
unfavorable alleles.
Molecular markers are 100% heritable, therefore using these markers to select for a
low heritable trait is more effective and less expensive than phenotypic selection for that
trait. Molecular markers are essential for the mapping of candidate genes, marker-assisted
breeding, and the map-based cloning of genes underlying traits. Marker-assisted breeding
has previously utilized molecular markers such as RAPDs, RFLPs, AFLPs, CAPS, and
microsatellites (SSRs). However, these marker systems are frequently labor intensive and
time-consuming and the associated costs constrain the ability to perform high-throughput
genotyping on breeding populations or germplasm. RFLPs are particularly unsuitable for
large-scale MAS due to the high cost implications of their implementation to screen large
numbers of individual plants. PCR-based markers are preferable to RFLPs due to their
potential for high-throughput and reduced costs. PCR-based methods only require small
quantities of DNA and are therefore suitable for the screening and selection of plants at
early seedling stages. However, the application of each PCR-based marker technology
may have limitations. Many PCR-based technologies are impractical for use as MAS
tools, as they are either too complex for automation (AFLPs), or demonstrate poor
reproducibility (RAPDs). Genotyping with CAPS requires the use of a restriction
endonuclease and is therefore dependent on a polymorphism in the restriction site.
AFLPs are anonymous markers. SSRs are a useful tool for MAS, however the markers
are often only loosely linked to the polymorphism responsible for the trait, rather than
being 100% diagnostic. Markers loosely linked to a trait may suffer from recombination
between the marker and the gene. Linked markers are also not usually transferable
between populations originating from different parents, due to lack of polymorphism.
Markers within the gene responsible for the trait are considered perfect markers. These
are highly valuable for breeding as the possibility of recombination between the marker
and gene is essentially eliminated and they are frequently transferable between
populations. SSRs suffer from homoplasy (alleles which are identical by size, but not by
descent) making them less suitable than SNPs for MAS studies. The abundance of SNPs
in plant genomes makes them attractive tools for MAS and map-based cloning and SNPs
and indel molecular markers can be applied for MAS.
SNPs are highly stable markers which may contribute directly to phenotype and
they can serve as a powerful tool for MAS. Once SNP markers are found to be associated
100 JACQUELINE BATLEY ET AL.
with a target trait, they can be applied by plant breeders for MAS to identify individual
plants containing a combination of alleles of interest from large segregating populations.
SNPs can be identified within or in close proximity to genes underlying agronomic traits.
Although the SNP may not be responsible for the mutant phenotype, they may be applied
for MAS and for the positional cloning of the gene in question (Gupta et al. 2001).
Association of SNPs with genes of economic value has already been demonstrated. SNP
markers for supernodulation in soybean have been identified (Kim et al. 2005). The
identified SNP in the GmNARK gene indicates the presence of the hypernodulating
mutation. The SNP was converted to a single nucleotide amplified polymorphism
(SNAP) marker to allow direct MAS for supernodulation at an early growth stage without
the need to inoculate and phenotype roots.
ESTs have been utilized in sugarcane for the identification of SNP markers
associated with the Adh genes (Grivet et al. 2003). The Adh gene family encodes a key
enzyme, alcohol dehydrogenase, in the glycolytic pathway and is well characterized in a
number of plant species, providing an ideal model for SNP discovery and analysis. These
demonstrate the principles of the application in sugarcane and can be used for genetic
mapping and QTL analysis as well as for MAS.
A high-throughput SNP genotyping system has been developed and used to select
barley alleles carrying superior alleles of β-amylase, a key enzyme involved in the
degradation of starch during the malting process (Paris et al. 2002). The four allelic forms
of the enzyme were unambiguously identified by genotyping two SNPs using the SnuPE
system. A CAPS marker has also been developed enabling the transfer of the marker to
other laboratories which do not have SnuPE assay capabilities. These assays provide a
rapid and inexpensive method for screening large numbers of individual plants, allowing
the introgression of the desirable allele into breeding programs. Further work on MAS
using SNPs in barley include identification of SNPs in the Isa gene, which has a likely
role in defense against pathogens. This gene was sequenced and screened for SNPs across
16 genotypes (Bundock and Henry 2004). This study showed there is little diversity in
cultivated barley and that SNPs could be a useful tool for the introduction of novel alleles
from wild barley. Furthermore, SNPs associated with grain germination have been
characterized across 23 varieties (Russell et al. 2004) for their suitability for
implementation in MAS.
A SNP marker has been developed for the waxy gene controlling amylose content in
rice. Amylose is the main component controlling the cooking and nutritional properties of
cereals. Low amylose varieties are considered desirable, and in rice, it has been shown
that the high and low amylose types can be differentiated based on a SNP near the waxy
gene. This marker will be applied for MAS for the low amylose trait in seedlings (Gupta
et al. 2001). Further SNPs associated with important genes in rice include a SNP marker
for the dwarfing gene. The SNP was identified within an SSR flanking sequence and is
used for selection in a wide range of crosses. SNP-based markers for rice-blast resistance
genes have also been developed (Hayashi et al. 2004). These markers enabled the
mapping of the Piz and Piz-t genes, demonstrating that the SNPs are a valuable tool for
gene mapping, map-based cloning and MAS in rice.
In wheat, the SNP found to alter the protein structure of adenine phosphoribosyl
transferase has been identified (Xing et al. 2005). This gene encodes the key enzyme
which converts adenine to adenosine monophosphate in the purine salvage pathway. In
wheat, further SNPs in genes of interest have been identified, including the Lr1 leaf rust
resistance gene (Tyrka et al. 2004). Infections can lead to severe yield losses and
SNP APPLICATIONS IN PLANTS 101
therefore the desire is to grow resistant cultivars. The development of the SNP marker in
the Lr1 gene has been a dramatic improvement on the STS marker previously used,
which was not specific in 50% of cultivars tested. The growing number of wheat SNP
markers available will open the possibility of introducing multiplexed assays, targeting
loci to pyramid trait selection during wheat breeding.
Work has also been performed on MAS in less developed crop species. One
hundred and thirty-two SNPs in quinoa have been identified from ESTs (Coles et al.
2005). It was found that the SNP development from ESTs was a practical method for
developing species-specific markers and may provide the molecular differentiation
required to monitor gene flow between cultivated quinoa and weedy species.
Furthermore, these will prove valuable in MAS projects aimed at improving quinoa via
exotic gene introgression. Further potential applications in plants include the results of a
study of nucleotide diversity in the pal1 locus of Scots pine (Dvornyk et al. 2002). This
gene is predicted to be associated with ozone tolerance, pathogen defense, and
metabolism of exogenous compounds, and SNPs within it could prove valuable for MAS
in this species.
SNPs are increasingly becoming the marker of choice for a wide range of
applications including genetic mapping, MAS, and diversity analysis. As the availability
of SNPs increases, they are displacing other forms of molecular markers for these
applications. As costs associated with SNP discovery and detection continue to fall, SNPs
will increasingly be associated with agronomic traits and will be applied for crop
improvement through parental selection and MAS. Of the marker systems available, each
has their own benefits and limitations. AFLPs are anonymous markers and do not provide
sequence information, RFLPs are time-consuming and laborious, SSRs have the benefit
that they are transferable between related organisms. However, for LD studies SNPs have
significant advantages over SSRs, due to their greater frequency and specificity in the
genome.
6.7 REFERENCES
Bundock, P.C., Henry, R.J., 2004, Single nucleotide polymorphism, haplotype diversity and recombination in
the Isa gene of barley. Theor. Appl. Genet. 109:543–551.
Chiapparino, E., Lee, D., Donini, P., 2004, Genotyping single nucleotide polymorphisms in barley by tetra-
primer ARMS-PCR. Genome 47:414–420.
Ching, A., Rafalski, A., 2002, Rapid genetic mapping of ESTs using SNP pyrosequencing and indel analysis.
Cell. Mol. Biol. Lett. 7:803–810.
Coles, N.D., Coleman, C.E., Christensen, S.A., Jellen, E.N., Stevens, M.R., Bonifacio, A., Rojas-Beltran, J.A.,
Fairbanks, D.J., Maughan, P.J., 2005, Development and use of an expressed sequenced tag library in
quinoa (Chenopodium quinoa Willd.) for the discovery of single nucleotide polymorphisms. Plant Sci.
168:439–447.
Dusabenyagasani, M., Perry, D., Lee S.-J., Demeke, T., 2003, Genotyping malting barley varieties registered in
Canada with SNP markers. In: XI Plant and Animal Genome Meeting, San Diego, CA.
Dvornyk, V., Sirviö, A., Mikkonen, M., Savolainen, O., 2002, Low nucleotide diversity at the pal1 locus in the
widely distributed Pinus sylvestris. Mol. Biol. Evol. 19:179–188.
Grivet, L., Glaszmann, J.-C., Vincentz, M., da Silva, F., Arruda, P., 2003, ESTs as a source for sequence
polymorphism discovery in sugarcane: example of Adh genes. Theor. Appl. Genet. 106:190–197.
102 JACQUELINE BATLEY ET AL.
Gupta, P.K., Roy, J.K., Prasad, M., 2001, Single nucleotide polymorphisms: a new paradigm for molecular
marker technology and DNA polymorphism detection with emphasis on their use in plants. Curr. Sci.
80:524–535.
Hayashi, K., Hashimoto, N., Daigen, M., Ashikawa, I., 2004, Development of PCR-based SNP markers for rice
blast resistance genes at the Piz locus. Theor. Appl. Genet. 108:1212–1220.
Kim, M.Y., Van, K., Lestari, P., Moon, J.-K., Lee S.-H., 2005, SNP identification and SNAP marker
development for a GmNARK gene controlling supernodulation in soybean. Theor. Appl. Genet.
110:1003–1010.
Kirkpatrick, R., Somers, D.J., Moniwa, M., Walsh, A., Riemer, E., 2002, Variety identification using single
nucleotide polymorphisms in hexaploid wheat. In: X Plant and Animal Genome Meeting, San Diego, CA.
Koebner, R., Summers, R., 2002, The impact of molecular markers on the wheat breeding paradigm. Cell. Mol.
Biol. Lett. 7:695–702.
Kota, R., Varshney, R.K., Thiel, T., Dehmer, K.J., Graner, A., 2001, Generation and comparison of EST
derived SSRs and SNPs in barley (Hordeum vulgare L.). Hereditas 135:145–151.
Lopez, C., Piegu, B., Cooke, R., Delseny, M., Tohme, J., Verdier, V., 2005, Using cDNA and genomic
sequences as tools to develop SNP strategies in cassava (Manihot esculenta Crantz). Theor. Appl. Genet.
110:425–431.
Morales, M., Roig, E., Monforte, A.J., Arús, P., Garcia-Mas, J., 2004, Single-nucleotide polymorphisms
detected in expressed sequence tags of melon (Cucumis melo L.). Genome 47:352–360.
Osman, A., Jordan, B., Lessard, P.A., Muhammad, N., Haron, M.R., Riffin, N.M., Sinskey, A.J., Rha, C.,
Housman, D.E., 2003, Genetic diversity of Eurycoma longifolia inferred from single nucleotide
polymorphisms. Plant Physiol. 131:1294–1301.
Paris, M., Jones, M.G.K., Eglinton, J.K., 2002, Genotyping single nucleotide polymorphisms for selection of
barley β-amylase alleles. Plant Mol. Biol. Rep. 20:149–159.
Rafalski, J.A., 2002a, Novel genetic mapping tools in plants: SNPs and LD-based approaches. Plant Sci.
162:329–333.
Rafalski, J.A., 2002b, Applications of single nucleotide polymorphisms in crop genetics. Curr. Opin. Plant Biol.
5:94–100.
Russell, J., Booth, A., Fuller, J., Harrower, B., Hedley, P., Machray, G., Powell, W., 2004, A comparison of
sequence-based polymorphism and haplotype content in transcribed and anonymous regions of the barley
genome. Genome 47:389–398.
Sanchez-Villeda, H., Schroeder, S., Polacco, M., McMullen, M., Havermann, S., Davis, G., Vroh-bi, I., Cone,
K., Shrapova, N., Yim, Y., Scultz, L., Duru, N., Musket, T., Houchins, K., Fang, Z., Gardiner, J., Coe, E.,
2003, Development of an integrated laboratory information management system for the maize mapping
project. Bioinformatics 19:2022–2030.
SanMiguel, P., Gaut, B.S., Tikhonov, A., Nakajima, Y., Bennetzen, J.L., 1998, The paleontology of intergene
retrotransposons of maize. Nat. Genet. 20:43–45.
Schmid, K.J., Rosleff Sörensen, T., Stracke, R., Törjék, O., Altmann, T., Mitchell-Olds, T., Weisshaar, B.,
2003, Large-scale identification and analysis of genome wide single nucleotide polymorphisms for
mapping in Arabidopsis thaliana. Genome Res. 13:1250–1257.
Syvanen, A.C., 2001, Accessing genetic variation: genotyping single nucleotide polymorphisms. Nat. Rev.
Genet. 2:930–942.
Tenaillon, M.I., Sawkins, M.C., Anderson, L.K., Stack, S.M., Doebley, J., Gaut, B.S., 2002, Patterns of
diversity and recombination along Chromosome 1 of maize (Zea mays ssp. mays L.). Genetics 162:1401–
1413.
Tyrka, M., Blaszczyk, L., Chelkowski, J., Lind, V., Kramer, I., Weilepp, M., Wisniewska, H., Ordon, F., 2004,
Development of the single nucleotide polymorphism of the wheat Lr1 leaf rust resistance gene. Cell. Mol.
Biol. Lett. 9:879–889.
Vigouroux, Y., Mitchell, S., Matsuoka, Y., Hamblin, M., Kresovich, S., Smith, S.C., Jaqueth, J., Smith, O.S.,
Doebley, J., 2005, An analysis of genetic diversity across the maize genome using microsatellites.
Genetics 169:1617–1630.
Wenzl, P., Carling, J., Kudrna, D., Jaccoud, J., Huttner, E., Kleinhofs, A., Killian, A., 2004, Diversity arrays
technology (DArT) for whole genome profiling of barley. Proc. Natl Acad. Sci. USA 101:9915–9920.
Xing, Q., Ru, Z., Li, J., Zhou, C., Jin, D., Sun, Y., Wang, B. 2005, Cloning a second form of adenine
phosphoribosyl transferase gene (TaAPT2) from wheat and analysis of its association with thermosensitive
genic male sterility (TGMS). Plant Science, 169: 37 – 45.
Zhu, Y.L., Song, Q.J., Hyten, D.L., van Tassell, C.P., Matukumalli, L.K., Grimm, D.R., Hyatt, S.M., Fickus,
E.W., Young, N.D., Cregan, P.B., 2003, Single nucleotide polymorphisms in soybean. Genetics 163:1123–
1134.
Chapter 7
LINKAGE DISEQUILIBRIUM MAPPING CONCEPTS
H. Nihal De Silva1 and Roderick D. Ball2
7.1 INTRODUCTION
In this section we introduce the basic statistical concepts needed for association
mapping.
If the data is good enough statistical analysis is hardly needed. Hence the quote:
“If your experiment needs statistics you ought to have done a better experi-
ment.” (Rutherford)
The physicist Rutherford did not have to contend with biological variation. If your exper-
iment is not quite good enough for Rutherford, any simple summary statistics would still
suffice. Any effects would be large compared to their standard errors. This is not the case
for most biological experiments.
Mathematically we can view the genome as a disjoint set of lines, one for each chro-
mosome, with a distance measured along each line, such that the distance between any two
loci is related to the per meiosis probability of recombination in the interval between the
loci. The implications of this structure for gene mapping are:
• That linkage maps can be constructed with a set of markers and distances between
markers estimated; location in the genome can be specified in terms of position in an
interval between markers on the linkage map.
• Within a family, if there is a causal locus affecting the trait, then linked markers will
also be associated with the trait, with effect size reduced by (1 − 2r), where r is the
recombination rate between the two loci.
• Within a population, historical linkage disequilibrium between loci will be reduced
by (1 − 2r) per generation of random mating.
Using the genome map and samples from populations or families we can obtain
information on marker locations and on locations of QTL, i.e. locations on the genome
contributing to variation in a trait, where the genotypes are not directly observed.
1 The Horticulture & Food Research Institute Limited (HortResearch), Mt Albert Research Research Centre, 120
Mt Albert Road, P.B. 92169, Auckland, New Zealand.
2 Ensis (New Zealand Forest Research Institute Limited), 49 Sala Street, P. B. 3020, Rotorua, New Zealand.
103
104 H. N. DE SILVA ET AL.
Statistical methods need to take into account this genome structure. For example,
marker–trait associations give information on trait loci, but statistical estimates and tests
for marker–trait associations at nearby markers will be correlated, and hence there is no
independent evidence for an association.
In association mapping, the effect being tested is only a small proportion of the total
variation in a trait, and to make matters more difficult the loci affecting a given trait are
only several among thousands or even hundreds of thousands of possible candidates found
from the set of all possible loci, markers or genes in the genome. Use of common statistical
methods has lead to many published spurious associations (Chapter 8). Large costly exper-
iments are needed. Therefore we have to be careful with the use of statistics, and make the
best possible use of available data.
There are two main schools of statistics, frequentist and Bayesian. The frequentist
considers the sampling properties of real-valued random variables or “statistics”, while the
Bayesian approach uses probability theory to obtain probability distributions for unknown
parameters. A critical comparison of Bayesian and frequentist approaches to statistical
estimation and inference for association mapping, will be given throughout this chapter
and Chapter 8, and will show that there are substantial differences between methods that
can lead to spurious associations. In the rest of this section we introduce the concepts
of statistical estimation and inference illustrated by results and interpretation in simple
examples, for linkage disequilibrium estimation and association mapping. A comprehen-
sive range of methods is given in more detail in Chapter 8.
Similarities and differences between QTL and LD mapping are summarised in Table
7.1. The main differences are that LD mapping detects historical linkage disequilibrium
generated in a population while QTL mapping detects linkage disequilibrium generated
within a family or pedigree.
In a population, recombinations affecting the association between a gene and a marker
may occur over many generations. This potentially gives a much finer resolution for map-
ping QTL than pedigrees used for traditional QTL mapping (linkage analysis), where
recombinations occur over at most several generations. The extent of LD in the popula-
tion varies between species and populations (cf. Chapter 8, Section 8.3.1), but may be
as small as 4 kb. However, achieving the potential resolution may require many markers
to cover the genome and large sample sizes especially where the extent of LD is small.
(cf. the power calculations in Chapter 8, Section 8.3.2.)
LD mapping is based on a random population sample, which is observational data. In
any observational study associations found may not be causal – associations may be due to
correlation with unmeasured causal factors or population structure.
A QTL mapping family is equivalent to a designed experiment. In a classical designed
experiment, treatments are randomly assigned to experimental units, randomising the eff-
ects of extraneous factors, giving unbiased, and with sufficient sample sizes, accurate, esti-
mates of treatment effects. A QTL mapping family can be thought of as a random sample
from the set of possible progeny. In each meiosis, recombinations occur randomly, ran-
domising the effects of unlinked markers, and generating associations between pairs of loci
LINKAGE DISEQUILIBRIUM MAPPING CONCEPTS 105
QTL LD
Equivalent to a designed experiment.1 Observational study.
• Marker and trait loci probably in link- • Marker and trait loci in linkage dis-
age equilibrium in the population. equilibrium. Marker–trait association
should persist across families.
• Linkage phase of marker–QTL associ-
ations may vary between families, and • Reduced necessity to verify associa-
needs to be verified for each family if tions – verification for MAS could be
used for marker-aided selection (MAS). limited to several families.
Otherwise selecting on marker geno-
types may choose the unfavourable al- • With sufficient sample size, depending
lele. Necessity to verify reduces the po- on marker spacings, allele frequencies
tential benefit of early selection for long and effect sizes, any QTL allele can be
lived species (e.g. forest trees). detected.
Lower resolution (typically 1–50 cM, depend- Potentially finer resolution (down to 4 kb
ing on marker density, sample size and QTL depending on extent of LD in the popula-
heritability). tion, marker density, sample size and QTL
heritability).
• Too many base pairs in QTL interval for
brute force sequencing. • If the extent of LD is not too high, the
LD region can be sequenced to locate
• Typically of the order of 100 markers to and clone genes.
cover the genome.
• Potentially hundreds of thousands of
• Prior odds of the order of 1/10 per SNP markers to cover the genome.
marker.
• Prior odds of the order of 1/50,000 per
marker or candidate gene.
106 H. N. DE SILVA ET AL.
depending only on the recombination distance between the loci. Effects of unlinked mark-
ers are randomised. The set of progeny is also in random order according to our model
for meiosis. Hence, other environmental factors including maternal environment are also
randomised with respect to progeny genotypes. Hence marker genotypes (for any marker
or set of markers) are equivalent to treatments in a designed experiment. Hence estimates
of marker–trait associations are unbiased, and with sufficient sample size, accurate.
If existing QTL mapping populations are available, information from QTL and LD
mapping studies can be combined. Use of combined QTL and LD mapping populations is
discussed in Chapter 8, where it is shown that a combined approach can be more efficient
than LD mapping alone.
We can estimate Pr(A), Pr(T ) and Pr(A, T ) from the sample proportions nA /100,
nT /100, nAT /100, respectively, and plug these values into Equation (7.1) to solve for D:
Table 7.2. Observed counts for two loci from a sample of 100
A G Total
T nAT = 10 nGT = 20 nT = 30
C nAC = 0 nGC = 70 nC = 70
Total nA = 10 nG = 90 n = 100
LINKAGE DISEQUILIBRIUM MAPPING CONCEPTS 107
We refer to this as the “plug-in” estimate. From Equation (7.1), using the delta-method (cf
Weir 1996, Chapter 2), an approximate variance and standard error for D̂ are calculated
as:
where we have used: var(p̂) = n1 p(1 − p), for p = pA , pT , pAT , cov(p̂A , p̂T ) = n1 (pAT −
pA pT ), cov(p̂AT , p̂A ) = n1 pAT (1−pA ), and cov(p̂AT , p̂T ) = n1 pAT (1−pT ), and replaced
pA , pT , pAT by their estimates.
The covariances in (7.3) can be derived using indicator variables (cf Weir 1996,
Chapter 2). Let xi be indicator variables for the allele A at the first locus, in the ith sam-
pled individual: xi = 1 if the allele is A, and 0 otherwise. Similarly let yi be the indicator
variables for the allele T at the second locus. Then
1 1 1
p̂A = xi , p̂T = yi , and p̂AT = xi yi (7.4)
n n n
The covariance between p̂A and p̂T is calculated as:
1
cov(p̂A , p̂T ) = 2 cov xi , yj
n
1
= 2 cov(xi , yi )
n
1
= (pAT − pA pT ) (7.5)
n
since cov(xi , yi ) = E(xi yi ) − E(xi )E(yi ) = pAT − pA pT , and cov(xi , yj ) = 0 for i = j,
since alleles for different individuals are independent.
The covariance between p̂AT and p̂A is calculated as:
1
cov(p̂AT , p̂A ) = 2 cov xi yi , xj
n
1
= pAT (1 − pA ) (7.6)
n
since cov(xi yi , xi ) = E(xi yi xi ) − E(xi yi )E(xi ) = pAT − pAT pA , where we have used
x2i = xi , and since cov(xi yi , xj ) = 0 for i = j. The covariance between p̂AT and p̂T is
calculated similarly.
The mean and standard error are a good estimate summary of a distribution, provided
the distribution is symmetric and approximately normal. However, for pA = 0.1, pT = 0.3
the disequilibrium coefficient D must lie between minimum and maximum limits of −0.03
108 H. N. DE SILVA ET AL.
and 0.07. In this example, the standard error, 0.032, is a substantial fraction of the length
of the parameter space, and D̂ is on the boundary of the parameter space, so the sampling
distribution is likely to be skewed. A mean and standard error may not be a good summary.
Therefore, to calculate a confidence interval we use the method of maximum likelihood.
Maximum likelihood estimation of D. The likelihood function is
100!
f (x | pA , pT , D) = pnAT pnGT pnAC pnGC , (7.7)
10!20!0!70! AT GT AC GC
where x = (nAT , nGT , nAC , nGC ) = (10, 20, 0, 70) denotes the observed counts. Taking
logs and dropping the initial constant term in Equation (7.7) gives the log-likelihood:
L = nAT log pAT + nGT log pGT + nAC log pAC . (7.8)
Solving for the cell probabilities pAT , pGT , pAC , pGC in terms of pA , pT , D we obtain
80
90
Log-likelihood
100
110
120
130
Statistical inference refers to quantifying the evidence for an effect. Statistical infer-
ence is important to association mapping, because we need to quantify evidence for the
existence of a genetic effect at a locus in a genomic region.
The ultimate goal is to determine and locate causal factors, i.e. genes affecting variabil-
ity in traits of interest. Gene mapping, either association mapping or QTL mapping, looks
for statistical associations between genetic markers and traits. However, association or cor-
relation does not imply causality. Gene mapping exploits and also must contend with the
fact that physical proximity between markers and/or genes generates a correlation between
LINKAGE DISEQUILIBRIUM MAPPING CONCEPTS 111
genotypes at different loci. To succeed we must first find good evidence for an associa-
tion, and then use that information in further experiments or applications, e.g. sequencing
a region of the genome to find a gene, or using a marker in MAS to select for superior
genotypes. Since there are many possible loci, e.g. hundreds of thousands of SNP markers,
and only a moderate number of genes expected to substantially influence a given trait, any
given marker has an a priori low probability of being closest to the true gene.
The statistical problem is to identify which subsets of markers are likely to be close to
the genes and/or have good predictive value. This evidence should enable us to make
effective or optimal decisions. Hence, statistical methods for quantifying evidence for
associations and comparing different possible subsets (or the corresponding models) are
critical. The large number of published spurious associations, discussed in Chapter 8,
illustrates the need for more rigorous statistical evidence. In this subsection we review
methods of statistical inference, which are used to quantify evidence for associations.
This confusion results from the close association and similar sounding wording used around
p and α. The common mistake and temptation is to assume α is the observed p-value. The
observed p-value is the lowest, i.e. most optimistic, value to which we could set α and still
have rejected H0 . But it is an error to set α = p.
If the probability of being wrong when H0 is true is not p when a p-value of p is
observed, then what is it? In other words, given p, what is α? Sellke et al. (2001) give an
approximate answer, which they call the conditional α, which we denote by αc . A lower
bound for the conditional α is given by
For example if p = 0.05, αc ≥ 0.289. Suddenly, p = 0.05 does not look like very good
evidence.
The main problem with p-values is how to use and interpret the p-value, when is it
good evidence and how should we make a decision? Problems with the interpretation of
p-values have been pointed out by, e.g. Edwards et al. (1963), Berger and Sellke (1987),
Berger and Berry (1988), and demonstrated in a genetics context by Ball (2001, 2005).
In Chapter 8 we show that the strength of evidence implied by a given p-value depends on
sample size (Chapter 8, Table 8.1). Interpretation of statistical evidence, including p-values
and other measures, and how to make descisions is considered below.
Multiple comparisons. Many comparisons are made, e.g. for each marker in a genome
scan, in a genomics experiment. Multiple comparisons procedures control the type I error
rate for a set of tests. If n independent tests are made under the null hypothesis the proba-
bility that one or more type I errors are made is given by the Bonferroni correction:
In association mapping, the Bonferroni correction is overly conservative for two reasons
(1) because tests are highly correlated between adjacent markers when markers are closely
spaced and (2) because there is no reason that we need to make the overall probability of
even a single type I error low when selecting putative loci from a whole genome scan. We
can afford several errors provided most of the “detected” loci are real, i.e. the proportion
of false discoveries is not too high.
on the false positives, but it does so at the expense of also eliminating most of the true pos-
itives. Perhaps there is an optimum somewhere in between.
Classical frequentist inference controls “error rates” because of an inability to calculate
a probability that H1 is true. However, the FDR is a more useful quantity than the p-value or
type I error rate. Given a set of putative gene effects, end users are interested only in the
proportion of the given gene effects which are real, estimated by 1 − FDR. End users are
not interested in the number of other markers or genes that were or might have been tested
and rejected (as given by the α threshold) to obtain the significant effects.
The false discovery rate is defined as the expected proportion of false discoveries
among the rejected null hypotheses. We give a variant here, the positive false discovery
rate (pFDR, Storey 2003), given by:
where V is the number of false positives when H0 is rejected and R is the number of
rejected null hypotheses. The positive false discovery rate avoids a technical problem with
the denominator in Equation (7.13) if the number of rejected null hypotheses is zero. Storey
(2003) shows that the positive false discovery rate has a Bayesian interpretation:
The FDR is the average probability that H0 is true for the set of “detected” or “sig-
nificant” effects from a testing procedure. The false discovery rate can be calculated in
the frequentist paradigm when there are many exchangeable tests, meaning that the effects
being tested are a priori indistinguishable. In effect, this is a Bayesian approach with prior
probabilities per gene estimated from the data.
FDR computations. If a large number, m, of multiple exchangeable effects are being
tested, the FDR is controlled at level α by the “step-up procedure” of Benjamini and
Hochberg as follows:
• Sort the p values in increasing order and let p(i) denote the ith ordered p-value.
• Optionally, plot the ordered p-values versus i (Figure 7.3).
• Find the last p-value p(k) which lies below the line p(i) = α
π0 m × i, i.e.
iα
k = max i : p(i) ≤ (7.15)
mπ0
1.0
o
o
o
o
oo
0.8
oo
ooo
oooo
oo
oo
0.6
p −value (p)
o
ooo
oo
0.4
o
o
oo
o a
0.2
o p= 3i
+o π0m
p=a
oooo
+++oo p=a/ m
++++++++++++oo++++
0.0
0 10 20 30 40 50 60
Rank (i)
Figure 7.3. FDR computation. Data are ranked p-values from 60 simulations of which 40 were under the null
hypothesis (plotted as “o”) and 20 under the alternative (plotted as “+”). Parameters α = 0.1 and π0 = 2/3 are
being used. Lines corresponding to comparison-wise (p = α), FWER (p = α/m) and FDR (p = α/(π0 m) × i)
thresholds are shown. The largest p-value below the line p(i) = α/(π0 m) × i is at rank 22. This and all
smaller p-values are selected by the Benjamini and Hochberg (1995) “step-up procedure,” controlling the FDR at
level α.
association mapping, where there are many closely spaced markers along the genome it
seems likely that the FDR, estimated this way, will be overly conservative. Hence the FDR
is probably more suited to microarray data than multi-locus association studies.
For the data shown in Figure 7.3, k = 22 effects are selected. The false discovery rate
was controlled at α = 0.1, so the expected number of false discoveries was 2.2. The actual
number of false discoveries is binomial with n = 22 and p = 0.1. The actual number of
false discoveries was 3, which is slightly higher but equivalent to within sampling error.
There was one false negative at rank 28.
Benjamini and Hochberg’s estimation of the FDR above requires a large number of
multiple exchangeable effects. The large number of effects makes it possible to calculate
the FDR without explicitly using a prior. In the Bayesian paradigm described next, the FDR
is calculated as the average posterior probability of H0 for the “detected” effects. Bayesian
calculation of posterior probabilities, and hence the FDR, do not require a large number of
effects. Probabilities can be calculated even for a single effect.
LINKAGE DISEQUILIBRIUM MAPPING CONCEPTS 115
information. Simply assuming equal probabilities of 0.5 for heterozygous and homozy-
gous parents is not a good strategy, as Fisher rightly pointed out. This was perhaps not so
rightly attributed as following from Laplace’s suggestion of assigning equal probabilities
to the various alternatives in the absence of prior information; the mistake being to effec-
tively specify an exact value for the allele frequency p below, giving equal probabilities
of heterozygous and homozygous parents, which he knew would probably be wrong, and
which is wrong in principle, because it ignores the knowledge that the parental mice come
from a population. Instead, Laplace’s principle should be applied to values of p. The mod-
ern Bayesian approach is to say the population allele frequency for A is p, and let p have a
Beta(0.5, 0.5) prior distribution. This gives a hierarchical model
where gp , gm denote the paternal and maternal genotypes and gx,i denotes the ith offspring
genotype. In the hierarchical model, probabilities for each parameter depend on the values
of its ancestors. For example gp depends on p. If Hardy–Weinberg equilibrium applies,
Pr(gp = AA) = p2 , Pr(gp = Aa) = 2p(1 − p), Pr(gp = aa) = (1 − p)2 . The progeny
genotype gx,i depends on gp and on gm . We have Pr(gx,i = Aa | gp = Aa and gm =
aa) = 0.5.
Note: A hierarchical model is similar to a family tree, where the probabilities for the
genotypes of an individual depend on the genotypes of its ancestors. The Beta family of
distributions Beta(a, b) gives a range of shapes, useful as prior distributions for proportions,
with any given mean value a/(a + b) and variance ab/[(a + b)2 (a + b + 1)], ranging from
uninformative Beta(1/2,1/2) or Beta(1,1) to highly informative distributions when a + b is
large. Chapter 8, Example 8.3 gives a Bayesian analysis for a case–control test using Beta
distributions.
Use and interpretation of statistical evidence. Using p-values, one strategy would be
to select loci with p ≤ α, for some α. The “detected” loci would then be further tested, or
regions around these loci genotyped, etc. However, the problem is: what is the best value
of α to use? The strength of evidence implied by a given p-value increases with decreasing
p-value, for a given experimental design and test setup. However, there is no interpretation
of the p-value as evidence independent of sample size. This can be seen from Chapter
8, Table 8.1, where correspondences between p-values and Bayes factors for association
tests are given. Fisher himself said that scientists should not make decisions (Fisher 1959,
p. 101) based on p-values or error rates.
Use and interpretation of p-values is problematic, particularly in gene mapping, where
effects may be small, sample sizes large and each effect has an a priori low chance of being
real. Equation (7.11) gives an indication of a valid error rate and hence a better indication
of the strength of evidence when a given p-value is obtained.
Interpretation of Bayesian posterior probabilities and/or Bayes factors is, in principle,
straightforward. The reader can make the decision which maximises their expected utility.
The expected utility is obtained by summing or integrating over the set of possible values
for unknown parameters, and averaging over possible models if a unique true model is not
known or unequivocally determined from the data.
The Bayes factor gives a direct measure of the strength of evidence favouring one
hypothesis or model over another. Posterior probabilities for each hypothesis, assuming
LINKAGE DISEQUILIBRIUM MAPPING CONCEPTS 117
one is true, can be obtained from the Bayes factor and the reader’s own prior probabilities
for each hypothesis to be true. This is useful in gene mapping where prior probabilities
are low.
To find the optimal set of effects to choose for further investigation or use in applica-
tions, we simply choose all effects where the expected benefit outweighs the increased cost,
i.e. the expected utility (or marginal profit) is positive. Evaluating the expected utility for
the ith effect requires the posterior probability that the effect is real and the posterior distri-
bution for the estimated effect. If the effect is βi and the utility (benefit−cost) is U (βi ), the
posterior probability that the ith gene effect is real or not real is Pr(H1,i | y), Pr(H0,i | y),
respectively, where H0,i , H1,i are, respectively, the null and alternative hypotheses for test-
ing the ith effect, and the posterior distribution for βi assuming H1,i is true is g(βi | H1,i , y)
then the expected utility from using gene i is
U (βi )g(βi | H1,i , y)dβi × Pr(H1,i | y) + C × Pr(H0,i ) , (7.18)
where C is the utility if H0,i is true. If sample sizes are sufficiently large, the posterior
distribution g(βi | H1,i , y) in Equation (7.18) can be approximated by a normal distribution
with mean β̂i and standard deviation se(β̂i ) obtained from maximum likelihood.
FDR and posterior probabilities. We have noted that the FDR is an average of posterior
probabilities for a set of selected gene effects. Posterior probabilities can be recovered as
successive differences from a sequence of FDRs as follows. First, sort the effects in order
of increasing p-values. Then estimate the false discovery rate FDRi , for effects 1, . . . , i,
and similarly estimate FDRi+1 , when the effect i + 1 is added. Since FDRi is the average
posterior probability of H0 for the first i effects and FDRi+1 is the average for the first i+1
effects, these two rates are related by:
iFDRi + Pr(H0,i+1 | y)
FDRi+1 = . (7.19)
i+1
Then, solve for the (approximate) posterior probabilities Pr(H0,i+1 | y) in Equation (7.19)
by equating the false discovery rates to their estimates.
number and location of QTL. Additionally, the most “significant” markers tend to be those
whose effects have been overestimated, a phenomenon known as selection bias (Miller
1990). To avoid selection bias, effects should be re-estimated in an independent population
(Miller 1990) or a Bayesian model selection method can be used (e.g. Ball 2001).
Multi-locus methods give a more direct link between statistical inference and the ge-
netic architecture. Stepwise regression simply chooses the model which best fits the data,
possibly with some adjustment for the number of parameters. A “model” consists of a sub-
set of selected markers, on which the trait is regressed. Stepwise regression is naı̈ve in this
context because the best model is generally not unequivocally identified; on the contrary
there may be many models consistent with the data. Inferences or optimal decisions can-
not be made simply by assuming the “best” model is the true model, particularly when the
quantities of interest, e.g., the genetic architecture are strongly related to the model. The
best approach is to use Bayesian model selection introduced by George and McCulloch
(1993). The Bayesian model selection approach considers all possible models according to
their probabilities. Estimates are averaged over models and inference is based on the total
probability of models where a proposition is satisfied.
Since no model is selected, there is no selection bias. This approach is applied to
QTL mapping in Ball (2001), with approximate posterior probabilities for models estimated
using the BIC criterion.
By considering all models according to their posterior probabilities (cf. Raftery et al.
1997), it is possible to obtain unbiased estimates of effects, and to make inferences about
the genetic architecture (Ball 2001, discussed in Sillanpää and Corander 2002; see also
Yandell et al. 2002; Bogdan et al. 2004).
The Bayesian model selection analysis from Ball (2001) was applied to a linkage
group with five markers in a QTL mapping family. Markers were in pseudo-backcross
configuration, so only a single additive effect per marker was fitted. The prior probability
per marker was 0.1, approximately equivalent to Poisson distribution with an average rate
of 10 QTL over the whole genome. Statistics for the ten most probable models are shown
in Table 7.4. Each row of Table 7.4 corresponds to a model, except for the final row which
shows the total probabilities for markers. A “T” in the column for a marker indicates that
marker is selected in the model, e.g. model 1 has marker M2 only selected. This model
had an R2 = 18.6 and a posterior probability of 50.5%. The marginal probability for M2
is the total probability for models with marker M2 selected, which is 68.5%, and for M3
the marginal probability is 39.5%. Other markers have probability less than 6%.
The null model, model 6, had posterior probability 1.1%. Thus, the posterior prob-
ability for one or more QTLs to be present is 100 − 1.1 = 98.9%. The probability
for model size 1 is obtained by summing the probabilities for models with k = 1, i.e.
50.5 + 28.0 = 78.5%. The probability for model size 2 is obtained similarly as 9.0 + 4.9 +
2.7 + 1.0 + 0.7 + 0.5 + 0.4 = 19.3%. Thus, there is a 1.1%, 78.5%, 19.3% posterior
probability that there is 0, 1 or 2 QTLs, respectively, present in the linkage group.
For case–control studies we have previously used an indirect method, where the num-
ber of cases and controls is fixed, and the marker allele frequencies become random.
We call this an indirect method because the putative explanatory variables (here allele
frequencies) appear in the model as responses, while the response (here disease status,
case or control) appears as an explanatory variable (or treatment factor). This approach
LINKAGE DISEQUILIBRIUM MAPPING CONCEPTS 119
Table 7.4. Top ten models for a linkage group with five markers
Markers
Model M1 M2 M3 M4 M5 k R2 Prob Cum.p
1 F T F F F 1 18.6 50.5 50.5
2 F F T F F 1 17.4 28.0 78.4
3 F T T F F 2 23.8 9.0 87.4
4 F T F T F 2 22.7 4.9 92.3
5 F T F F T 2 21.5 2.7 95.0
6 F F F F F 0 0.0 1.1 96.1
7 T F T F F 2 19.6 1.0 97.1
8 T T F F F 2 18.9 0.7 97.8
9 F F T F T 2 18.3 0.5 98.4
10 F F T T F 2 17.6 0.4 98.8
Total 2.1 68.5 39.5 5.9 3.7 100.0
is convenient when a single marker is being studied. For multi-locus methods, we need
to revert to the direct method of analysis where the putative explanatory variables such as
marker allele frequencies or genotypes appear as explanatory variables in the model and
the response, disease status (case or control), appears as the response in our model. This
is a binary response and is modelled as a generalised linear model (McCullagh and Nelder
1989). A generalised linear model has two parts. The first part is a linear model related to
expectations of the observed data by a non-linear link function g(·):
where
log(p/(1 − p)) = Xβ or p = exp(Xβ)/(1 + exp(Xβ)) . (7.22)
In principle, the Bayesian model selection approach described above can be applied
where X is taken to be the model matrix based on a set of marker loci. However, caution
is advised with binary generalised linear models because the usual model comparison sta-
tistic, the deviance, is approximately distributed as a χ2 for binomial data, only when np
and n(1 − p) are greater than 5. This is not the case for binary data, which is binomial
with n = 1. For the same reason, the BIC criterion may not provide a good approximation
to posterior probabilities for models, and Markov chain Monte Carlo (MCMC) sampling
methods may be required. See Chapter 8, Example 8.7 for an application of MCMC.
120 H. N. DE SILVA ET AL.
7.4.6 Summary
• Single marker hypothesis tests have been used in genome scans because they were a
known and easily computed method.
– A p-value is not a valid error rate. The observed p-value is not the probability
of being wrong if H0 is true.
– Multiple positive tests are likely for linked markers in the neighbourhood, of a
causal locus, complicating the interpretation of error rates.
– The FDR may be overly conservative when applied to many closely linked
markers in association studies.
• Using the Bayesian approach avoids these problems, because we can calculate
Bayesian posterior probabilities for propositions of interest, e.g. the probability that
an effect is real, or that a QTL is in a region.
• The posterior probability for H1 gives us a probability that the effect being tested
is real.
• The statistical problem in association mapping is to select loci associated with vari-
ation in a trait. This is a model selection problem, not a hypothesis testing problem.
• In general, experiments should be designed with good power to detect effects with a
reasonable Bayes factor, e.g. at least B = 20.
• In genomics, we are trying to select a small number of effects from the whole
genome. The prior probabilities will be low. Bayes factors as high as 1, 000, 000
may be required.
• Required sample sizes will be substantially larger, than for experiments designed to
detect effects with α = 0.05.
• A case–control test.
122 H. N. DE SILVA ET AL.
For many diseases, the prevalence of cases in the population is much lower than the
prevalence of controls. A random sample from the population would typically have few
cases, and therefore low power to detect effects on disease incidence. The case–control
study design remedies this by selecting separate samples of affected individuals (cases)
and unaffected individuals (controls). The sample sizes for cases and controls are generally
approximately equal.
LINKAGE DISEQUILIBRIUM MAPPING CONCEPTS 123
Consider a case–control study with nD cases and nH healthy controls. The data for
genotypic and allelic case–control tests is summarised in contingency tables (Tables 7.5
and 7.6).
The Pearson χ2 -test or Fisher’s exact test can be used to test for an association be-
tween disease status and marker genotype or allele frequencies. The χ2 approximation to
the test statistic is approximately valid only if the expected cell counts are five or more or
at least 80% of cells in the table. Also, the Yates continuity correction is recommended
when sample sizes are small (Sokal and Rohlf 1969). Fisher’s exact test does not depend
on any asymptotic approximation, so should be used if cell counts are low. Fisher’s exact
test conditions on the marginal totals, i.e. considers the distribution of the test statistics
only for tables with the given marginal totals, while the χ2 -test considers all possible ta-
bles with the given total under the null hypothesis, H0 , of no association. According to
Fisher (1959), the Pearson χ2 is not appropriate because the marginal totals are known,
and therefore the appropriate reference set is only the set of tables with the given marginal
totals. Other tables are irrelevant. Both approaches have the drawback that they produce a
p-value whose distribution is uniform under H0 but the distribution under H1 is not con-
sidered. Low p-values may be unlikely under H0 but may be equally unlikely under H1 ,
and hence are not necessarily good evidence for H1 . This problem is addressed by using
Bayes factors in Chapter 8.
Recall that the χ2 test statistic for a contingency table is
(Oi − Ei )2
χ2 = , (7.23)
i
Ei
where Oi , Ei are the observed and expected cell counts for the ith cell, respectively.
For 2 × 2 tables such as Table 7.6, the χ2 test statistic is conveniently calculated as:
(a + b + c + d)(ad − bc)2
χ2 = ∼ χ21 under H0 . (7.24)
(a + b)(a + c)(b + d)(c + d)
124 H. N. DE SILVA ET AL.
Table 7.7. Contingency table for genotypic case–control test for association between the sickle cell locus and
malaria (Example 7.2)
Table 7.8. Contingency table for allelic case–control test for association between the sickle cell locus and malaria
(Example 7.2)
Marker allele
Status A S
Case 623 7
Control 1065 101
The p-value is Pr(χ21 ≥ 41.3) = 1.3 × 10−10 , which is very low, suggesting strong
evidence for an effect. This is confirmed by the Bayes factor of 1.0 × 1010 calculated in
Chapter 8.
This and other examples discussed (e.g. the Alzheimer’s – APOE association in Chap-
ter 8) are examples where the association was already known. We should bear in mind that
it is more difficult to find associations from a whole genome scan, than testing a single
locus where the marker locus associated with the trait is already known. The statistical
evidence has to overcome the low prior probability for any given marker locus to be within
a small genomic interval of the given trait locus. Additionally, the statistical evidence must
discriminate between the causal locus and nearby loci which may also be in linkage dise-
quilibrium with the trait.
Table 7.9. Contingency table for transmission of alleles in a TDT test based on parent–offspring trios
requires both parental and offspring genotypes. The S-TDT (Spielman and Ewens 1998)
requires only genotypes from each discordant sib pair, so can be used where parental DNA
is not available. The SDT test (Horvath and Laird 1998) can be used when data from more
than one affected and unaffected sib are available.
The TDT test is based on the fact that each parental allele is transmitted randomly, with
50% probability, to each offspring. If an allele is associated with a disease phenotype, then
there will be a higher or lower proportion of the marker amongst the cases.
The TDT test statistic for a bi-allelic marker is given by
2
(n12 − n21 )2
T = ∼ χ21 under H0 , (7.26)
(n12 + n21 )
where nij are as in Table 7.9. For a multi-allelic marker with m alleles, the TDT test
statistic is given by
m−1
m
(ni· − n·i )2
T = ∼ χ2(m−1) , (7.27)
m i=1 (ni· + n·i − 2nii )
where nij is the numberof times allele i is transmitted and allele j is not transmitted,
ni· = j nij and n·i = j nji (Weir 2001).
Note: The Spielman et al. TDT test was based on an earlier idea (Terwilliger and
Ott 1992) of presenting data related to transmission and non-transmission of alleles, as in
Table 7.9.
Allele transmission status is known if one parent is heterozygous and the other ho-
mozygous (e.g. Aa × aa) or both parents are heterozygous with different alleles.
The TDT test simultaneously tests for linkage and linkage disequilibrium:
• If there is no linkage disequilibrium, that means that the marker and trait loci are
independent in the population. Hence any parental trait QTL allele will be indepen-
dent of the parental marker alleles. Hence there will be no association between QTL
allele and the transmitted marker allele.
• If there is linkage disequilibrium between marker and QTL alleles, but no linked QTL
this means that there is an association between parental marker and QTL alleles, but
because there is no linkage (r = 0.5), the QTL allele transmitted will be independent
of the marker allele transmitted. Again, there will be no association between the
transmitted QTL allele and the transmitted marker allele.
• If there is linkage disequilibrium and linkage, linked QTL effects with r < 0.5 will
be reduced by a factor (1 − 2r).
126 H. N. DE SILVA ET AL.
Table 7.10. Parental and offspring genotypes for the TDT test (Example 7.3)
Table 7.11. Transmitted and non-transmitted alleles for TDT test (Example 7.3)
We consider two forms of sib-based TDT tests, the S-TDT test (Spielman and Ewens
1998) and the SDT test (Horvath and Laird 1998). The S-TDT test compares marker allele
frequencies between affected and unaffected sibs from a large number of nuclear families.
The S-TDT gives a test for both linkage and linkage disequilibrium if only one affected
and one unaffected sib per family are used. If data from more than one affected and/or
unaffected sib is used the S-TDT gives a test for linkage (Monks et al. 1998; Weir 2001).
The SDT test bases inference on a summary statistic calculated for each family, based on
whether or not the average allele frequency for affected sibs differs from the family mean.
The SDT tests for both linkage and linkage disequilibrium provided there is at least one
affected and unaffected sib per family.
The S-TDT test. Let yi denote the number of M1 alleles in affected sibs, ai , ui the num-
ber of affected and unaffected sibs, and ri , si the number of M1 M1 and M1 M2 genotypes
in the ith nuclear family, respectively. The test statistic is
Y −A
T = √ ∼ N (0, 1) asymptotically, under H0 , (7.29)
V
where
n
Y = yi , (7.30)
i=1
n
A = (2ri + si )ai /ti , (7.31)
i=1
n
ai ui [4ri (ti − ri − si ) + si (ti − si )]
V = . (7.32)
i=1
t2i (ti − 1)
Y is the total number of M1 alleles in affected sibs, A is an estimate of the total expected
value of Y under H0 (assuming the allele frequency for affected sibs is the same as for all
sibs) and V is an estimate of the variance of Y − A.
The TDT and S-TDT tests can be combined, with the test statistic, Z, given by:
where n12 , n21 are as in Table 7.9, for the parent–offspring trios, and Y, A, V are calculated
as above for the discordant sib-pair data.
128 H. N. DE SILVA ET AL.
(2, 2) (1, 2) (1, 1) (1, 1) (2, 2) (1, 1) (1, 2) (2, 2) (1, 1) (1, 2) (1, 1) (1, 2) (1, 2)
Figure 7.4. Families for the S-TDT test (Example 7.4). Affected offspring are shown in black.
Family t a u r s y
1 6 2 4 2 1 4
2 3 2 1 1 1 3
3 4 2 2 0 3 1
7.5.4 Summary
• A range of experimental design types is available, including a random population
sample, a case–control study, a TDT test and a pedigree with mixed model analysis.
• Any of these design types may be combined with a QTL mapping study, giving
reduced genotyping.
• There is scope to develop new experimental designs for association mapping in
plants, using clonal replication of genotypes in field trials, with potential to simulta-
neously increase the effective heritability, control environmental variation and reduce
residual errors.
7.6 SUMMARY
• Single marker hypothesis tests have been used in genome scans because they were
a known and easily computed method. But results of single marker tests are not
directly related to the genetic architecture.
130 H. N. DE SILVA ET AL.
• FDR is better, in principle, than p-values but requires data from many exchangeable
tests, and may be overly conservative when applied to many closely linked markers
in association studies.
• Bayesian measures of evidence are readily interpretable, independent of the data or
experimental designs used.
• Using Bayesian methods it is possible to calculate posterior probabilities for a marker
trait association to be real, or for a causal effect to lie within a given region.
• The statistical problem in association mapping is to select loci associated with vari-
ation in a trait. This is a model selection problem, not a hypothesis testing problem.
• Multi-locus methods can be used to infer the genetic architecture, by giving proba-
bility distributions for number and locations of QTL.
• The Bayes factor gives the strength of evidence in the data.
• Experiments should be designed with good power to detect effects with a reasonable
Bayes factor, e.g. at least B = 20. In genomics we are trying to select a small
number of effects from the whole genome. Prior probabilities will be low. Higher
Bayes factors will be required.
• A range of experimental design types is available, including a random population
sample, a case–control study, a TDT test and a pedigree with mixed model analysis.
• Any of these design types may be combined with a QTL mapping study, giving
reduced genotyping.
• There is scope to develop new experimental designs for association mapping in
plants, using clonal replication of genotypes in field trials, with potential to simulta-
neously increase the effective heritability, control environmental variation and reduce
residual errors.
LINKAGE DISEQUILIBRIUM MAPPING CONCEPTS 131
7.7 REFERENCES
Ackerman, H., Usen, S., Jallow, M., Sisay-Joof, F., Pinder, M., Kwiatkowski, D.P. 2005, A
comparison of case–control and family-based association methods: the example of sickle
cell and malaria. Ann. Human Genet. 69:559–565.
Ball, R.D. 2001, Bayesian methods for quantitative trait loci mapping based on model selection:
approximate analysis using the Bayesian Information Criterion. Genetics 159:1351–1364.
Ball, R.D. 2005, Experimental designs for reliable detection of linkage disequilibrium in unstructured
random population association studies. Genetics 170:859–873.
http://www.genetics.org/cgi/content/abstract/170/2/859
Benjamini, Y., Hochberg, Y. 1995, Controlling the false discovery rate a practical and powerful
approach to multiple testing. J. R. Stat. Soc. B 57:289–300.
Benjamini, Y., Hochberg, Y. 2000, On the adaptive control of the false discovery rate in multiple
testing with independent statistics. J. Educ. Behav. Stat. 25:60–83.
Benjamini, Y., Yekutieli, D. 2001, The control of the false discovery rate in multiple testing under
dependency. Ann. Stat. 29:1165–1187.
Berger, J., Berry, D. 1988, Statistical analysis and the illusion of objectivity. Am. Scientist
76:159–165.
Berger, J.O., Sellke, T. 1987, Testing a point null hypothesis: The irreconcilability of P values and
evidence (with discussion). J. Am. Stat. Assoc. 82:112–139.
Boehnke, N., Langefeld, C.D. 1998, Genetic association mapping based on discordant sib pairs: the
discordant-alleles test. Am. J. Hum. Genet. 62:950–961.
Bogdan, M., Ghosh, J.K., Doerge, R.W. 2004, Modifying the Schwarz Bayesian information criterion
to locate multiple interacting quantitative trait loci. Genetics 167:989–999.
Curtis, D. 1997, Use of siblings as controls in case–control association studies. Ann. Hum. Genet.
61:319–333.
Edwards, W., Lindman, H., Savage, L.J. 1963, Bayesian statistical inference for psychological
research. Psychol. Rev. 70:193–242.
Fisher, R.A. 1959, Statistical methods and scientific inference, 2nd ed., T. and A. Constable,
Edinburgh.
George, E.I., McCulloch, R.E. 1993, Variable selection via Gibbs sampling. J. Am. Stat. Assoc.
88:881–889.
Horvath, S., Laird N.M. 1998, A discordant-sibship test for disequilibrium and linkage: no need for
parental data. Am. J. Hum. Genet. 63:1886–1897.
Lander, E.S., Botstein, D. 1989, Mapping Mendelian factors underlying quantitative traits using
RFLP linkage maps. Genetics 121:185–199.
McCullagh, P., Nelder, J.A. 1989. Generalized linear models, 2nd ed., Chapman & Hall/CRC,
London.
Miller, A.J. 1990: Subset selection in regression, Monographs on Statistics and Applied Probability
40, Chapman & Hall, London.
Monks, A.A., Kaplan, N.L., Weir, B.S. 1998, A comparative study of sibship tests of linkage and/or
association. Am. J. Hum. Genet. 63:1507–1516.
Pritchard, J.K., Stephens, M., Donnelly, P. 2000a, Inference of population structure using multilocus
genotype data. Genetics 155:945–959.
Pritchard, J.K., Stephens, M., Rosenberg, N.A., Donnelly, P. 2000b, Association mapping in struc-
tured populations. Am. J. Hum. Genet. 67:170–181.
Raftery A.E., Madigan D., Hoeting J.A. 1997, Bayesian model averaging for linear regression mod-
els. J. Am. Stat. Assoc. 92:179–191.
Sellke, T., Bayarri, M.J., Berger, J.O. 2001, Calibration of p values for testing precise null hypotheses.
Am. Statistician 55:62–71.
132 H. N. DE SILVA ET AL.
Sillanpää, M.J., Corander, J. 2002, Model choice in gene mapping: what and why. Trends Genet.
18:301–307.
Sokal, R.R., Rohlf, F.J. 1969, Biometry: The principles and practice of statistics in biological
research. W. H. Freeman, San Francisco, p. 776
Spielman, R.S. Ewens, W.J. 1998, A sibship test for linkage in the presence of association: the sib
transmission/disequilibrium test. Am. J. Hum. Genet. 62:450–458.
Spielman, R.S., McGinnis, R.E., Ewens, W.J. 1993, Transmission test for linkage disequilibrium:
the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. J. Hum. Genet.
52:506–516.
Storey, J.D. 2003, The positive false discovery rate: A Bayesian interpretation and the q-value. Ann.
Statist. 31:2013–2035.
Stram, D.O., Lee, J.W. 1994, Variance components testing in the longitudinal mixed effects model.
Biometrics 50:1171–1177.
Terwilliger, J.D., Ott, J. 1992, A haplotype based haplotype relative risk approach to detecting allelic
associations. Hum. Hered. 42:337–346.
Yandell B.S., Jin C., Satagopan, J.M., Gaffney, P.J. 2002, In: Discussion of: Model selection
approach for the identification of quantitative trait loci in experimental crosses, by Broman and
Speed. J. Roy. Stat. Soc. B. 64:731–775.
Weir, B.S. 2001, Population Genetic Data Analysis 2001, Southern Summer Institute of Statistical
Genetics, North Carolina State University.
Chapter 8
STATISTICAL ANALYSIS AND
EXPERIMENTAL DESIGN
Roderick D. Ball1
8.1 INTRODUCTION
The goal of association mapping is to locate genes and/or predict genetic effects, to
allow selection of favourable genotypes.
Association mapping, also known as ‘linkage disequilibrium mapping’ or ‘LD map-
ping’ aims to detect and locate genes relative to a map of existing genetic markers. Location
information is obtained because the distance between the gene and a marker on a chromo-
some is one factor influencing the strength of association between the gene and marker.
In a population, recombinations affecting the association between a gene and marker may
occur over many generations. This potentially gives a much finer resolution for mapping
QTL than pedigrees used for linkage analysis.
Early attempts to find associations for complex diseases or quantitative traits have led
to many published associations which are likely to be spurious (Terwilliger and Weiss 1998;
Altshuler et al. 2000; Emahazion et al. 2001; Neale and Savolainen 2004).
Altshuler et al. (2000) (discussed in Gura 2000) retested 13 published associations of
SNPs with type II diabetes in an independent population. Only one was significant. These
results are summed up by Altshuler (Hampton 2000):
The lack of replication of the others points to the need for larger samples,
controls for population differences, and stronger statistical evidence prior to
claiming an association. (emphasis added)
Terwilliger and Weiss (1998, Figure 4) show the distribution of around 260 reported
p-values from association studies in two journals, and note that there is no evidence of
departure from the uniform distribution (i.e. no evidence of any real effect):
Neale and Savolainen (2004) note that candidate gene associations have been criticised
as being unreliable with insufficient sample size cited as a contributing factor.
1Ensis (New Zealand Forest Research Institute Limited), 49 Sala Street, P. B. 3020, Rotorua, New Zealand
133
134 R. D. BALL
Emahazion et al. (2001) retested published associations for a number of markers with
13 genes putatively associated with Alzheimer’s disease, found from case–control studies,
noting
. . . limited ability of typical association studies based on candidate genes to
discern the true medium sized signals from false positives. . .
Except for the APOE 4 allele (with p ≈ 0.003%), which was used as a ‘positive control’
in their study, only 2.8% were ‘verified’ with p < 0.05 and only one had p < 0.01, which
rose to p = 0.33 after allowing for 60 comparisons.
False positives, publication bias, population structure, heterogeneity (i.e. variability
resulting from other genetic and environmental risk factors) and conservative multiple cor-
rections procedures were cited as causes of problems, with the first three factors contribut-
ing to spurious associations and the latter two factors contributing to failure to detect true
effects. These comments point to problems with the use of statistics, particularly the inter-
pretation of p-values.
The rest of this chapter consists of two main sections: the statistical analysis section
(Section 8.2) and the experimental design section (Section 8.3).
The statistical analysis section (Section 8.2) covers general approaches for testing sci-
entific hypotheses, including comparison of frequentist and Bayesian approaches, and com-
parison of model-based and empirical approaches for single marker or multiple marker
(haplotype) analyses. To understand and rectify the problems with spurious associations,
we revisit the fundamentals of statistical inference with respect to the problem of test-
ing scientific hypotheses, comparing frequentist and Bayesian methods in Section 8.2.1.
Problems are noted with the use of frequentist p-values as commonly used, and Bayesian
alternatives given.
The ability to detect LD is determined by factors including the extent of LD, size of
QTL effects, i.e. trait genetic architecture. These factors are important considerations for
experimental design. The experimental design section (Section 8.3) covers how experi-
ments can be designed with power to detect effects with a given strength of evidence, and
considers the main types of experimental design in separate subsections with examples and
statistical analyses appropriate to each.
Testing for an association between a marker and a trait is an example of testing a sci-
entific hypothesis. We first revisit the fundamentals of statistical inference with respect to
testing scientific hypotheses, including the commonly used frequentist hypothesis testing,
with p-values as a measure of evidence, and the Bayesian approach with Bayes factors and
posterior probabilities as evidence. It is shown that the two approaches are substantially
different, for testing scientific hypotheses, and very small p-values are needed to obtain
even modest evidence according to the Bayesian framework, when sample sizes are suf-
ficiently large to detect the many small effects underlying quantitative traits and complex
diseases.
We then consider various statistical approaches ranging from simple single marker
analysis of variance (ANOVA) or t-tests to more complex statistical models such as
coalescent-based approaches to analysis of haplotype data in a gene region.
STATISTICAL ANALYSIS AND EXPERIMENTAL DESIGN 135
Statistical methods specific to each experimental design are discussed in the experi-
mental design section (Section 8.3).
hypothesis (H0 ) to the alternative hypothesis (H1 ). For testing scientific hypotheses, such
as a non-zero marker–trait association, H1 typically has one or more unknown parameters
than H0 , e.g.
H0 : θ = 0 versus H1 : θ = 0 . (8.1)
The observed value of a test statistic, T , chosen to measure departures from H0 is compared
to its sampling distribution under H0 . The observed value, Tobs , of T is compared to its
value under repeated sampling.
As noted in Chapter 7, making a decision based on p-values is problematic. A low
p-value means there is evidence that H0 may not be the perfect model for the data. A low
p-value means the probability Pr(T ≥ Tobs | H0 ) is small (cf. Equation (7.7)). However,
the corresponding probability under H1 may be equally small. Some threshold has to be
chosen, but there is no method for choosing the optimal threshold. Choosing p = 0.05 as a
threshold, i.e. ‘rejecting’ H0 and choosing H1 , when p ≤ 0.05 may or may not be a good
strategy, whether p is the comparison-wise, genome-wise or experiment-wise p-value.
Bayesian statistics
Table 8.1. p-Values corresponding to various Bayes factors, for testing for linkage disequilibrium between a
bi-allelic marker and QTL.
f (x | θ)π(θ)
g(θ | x) = , (8.2)
f (x | θ)π(θ)dθ
where g(θ | x) is the posterior distribution of the unknown parameters θ given the data x,
π(θ) is the prior distribution of the parameters and f (x | θ) is the likelihood, i.e. probabil-
ity of observing the data for a given value of the parameters.
Note how information about x given θ in f (x | θ) has been turned into information
about θ given x in g(θ | x).
Note: the technical difficulty in implementing Bayesian computations lies in evalu-
ating the integral in (8.2), which is often analytically intractable. Nowadays, calculating
the integral is generally avoided by using computationally intensive Markov chain Monte
Carlo (MCMC) sampling methods. Gibbs sampling, Metropolis sampling and variants can
be used to obtain a sample from g(·), and quantities of interest easily calculated from this
sample (see, e.g. Gelfand et al. 1990; Gelman et al. 1995; Gilks et al. 1996). This method-
ology gives great modelling flexibility, and avoids the need for asymptotic (requiring large
samples) and distributional (e.g. requiring normal, independent identically distributed
data) assumptions.
The Bayes factor is defined as the ratio of the probability of observing the observed
data under H1 to that under H0 :
Pr(data | H1 )
B= . (8.3)
Pr(data | H0 )
The Bayes factor measures how much more likely the data are under H1 than under H0 . If
B = 1 the data are equally likely under H0 as under H1 , i.e. there is no evidence either
way. Values close to 1 are weak evidence. High values (greater than 1) are evidence for
H1 , low values (less than 1) are evidence against H1 , or for H0 .
Given prior probabilities π(H0 ), and π(H1 ), for each hypothesis the corresponding
posterior probabilities Pr(H0 | data) and Pr(H1 | data) are determined from the Bayes
factor by
Pr(H1 | data) Pr(data | H1 ) π(H1 )
= × . (8.4)
Pr(H0 | data) Pr(data | H0 ) π(H0 )
In other words,
posterior odds = Bayes factor × prior odds . (8.5)
Equation (8.5) is a consequence of Bayes’ theorem, and states that the Bayes factor is
the factor by which prior odds have increased to give posterior odds as a result of observing
the data. The posterior odds are how much more likely we believe H1 to be true than
H0 after observing the data. If H1 and H0 are the only two possibilities, then Pr(H1 |
data) + Pr(H0 | data) = 1, i.e. the evidence can be equivalently specified by giving any
one of the three quantities Pr(H1 | data), Pr(H0 | data) or the posterior odds, whichever is
convenient.
Clearly, both the Bayes factor and prior odds are important factors contributing to the
posterior odds. If the prior odds are low, a higher Bayes factor, i.e. stronger evidence from
the data are required, to convince us of the likelihood of an effect.
138 R. D. BALL
Note:
1. The Bayes factor does not depend on prior odds. It does, however, depend on the
prior distribution for parameters under H1 , especially the parameter(s) being tested,
e.g. θ in (8.1).
2. The Bayes factor compares the probability of the data under both hypotheses, whereas
the p-value considers only the probability of an event under H0 .
3. The Bayes factor or posterior probability considers only the observed data, unlike
the p-value which considers the probability of unobserved values of the test statistic,
under unobserved repeated sampling.
4. For a given experimental design and test setup, the smaller the p-value, the larger the
Bayes factor will be. However the p-value needed to obtain a given Bayes factor gets
smaller with increasing sample size (Table 8.1; Ball 2005). A p-value of 0.05 can
correspond to evidence against H1 , e.g. with n = 1, 728, p = 0.047 corresponds to
a Bayes factor B = 1/20.
5. The Bayes factor has a natural interpretation as the strength of evidence from (8.5).
The p-value is the probability of an unobserved event, and has no such interpretation
independent of experimental design and test setup.
6. The p-value tends to exaggerate the evidence for H1 . A p-value of much less than
0.05 is needed to correspond to B = 20, i.e. 20-fold increase from prior to posterior
odds in association studies. For example if n = 300 we need p = 7.18 × 10−4 to
obtain a Bayes factor B = 20 in Table 8.1.
Various techniques are available for calculating Bayes factors, or the marginal proba-
bilities Pr(Hi | data), i = 0, 1, forming the numerator and denominator in the equation
for the Bayes factor. The method of Spiegelhalter and Smith (1982) gives Bayes factors
based on non-informative priors for one-way ANOVA models (Equation (8.8)). This form
of the Bayes factor is used to test for differences between marker classes in independent
population sample sizes in Section 8.3.2, and for designing experiments with power to de-
tect effects with a given Bayes factor. Direct integration is used for the case–control studies
in Section 8.3.3. The Savage–Dickey density ratio (Equation (8.32); Dickey 1971) gives
the Bayes factors for nested hypotheses if the marginal posterior for the variable being
tested can be evaluated. The Savage–Dickey density ratio is applied to calculate equivalent
Bayes factors for the TDT, S-TDT transmission disequilibrium tests from Chapter 7, and
for the TDT-Q1 test for continuous traits, in Section 8.3.4. General methods of estimating
Pr(Hi | data) from MCMC samplers are given by Raftery (1996).
Summary
Scientific hypotheses, such as whether a new drug is effective or whether a genomic
region is associated with a trait, are tested statistically by Bayesian or frequentist methods.
Scientific hypotheses often correspond to a ‘precise hypothesis’ of the form (8.1), where
H0 is a subset of H1 obtained by setting one variable to zero. Frequentist inference uses
STATISTICAL ANALYSIS AND EXPERIMENTAL DESIGN 139
the p-value which measures probabilities of more extreme values of a test statistic than
that observed if H0 is the true model. In Bayesian inference the Bayes factor measures
the strength of evidence, while posterior probabilities combine the evidence with prior
probabilities for effects.
Bayesian and frequentist inference give similar results for testing one-sided hypothe-
ses, or where there is negligible cost to making the wrong decision. For testing scientific
hypotheses this is not the case: there is generally a substantial cost to making the wrong
decision, and we are testing precise hypotheses. Bayesian and frequentist inference are
not similar. For a given experiment, smaller p-values correspond to stronger evidence, but
there is no general interpretation of the p-value as strength of evidence for H1 . Very small
p-values are needed to correspond to a respectable Bayes factor with the kind of sample
sizes needed for association studies. Therefore, we do not recommend p-values for testing
scientific hypotheses. Bayes factors and/or posterior probabilities should be used instead.
Statistical approaches vary in model complexity and assumptions from simple ANOVA
or t-tests for single markers, to more complex multi-locus models involving multiple mark-
ers or haplotypes. Multi-locus approaches can be model based using the coalescent, or
mixed models based on IBD probabilities, to take account of correlation between similar
haplotypes or more empirical approaches (Table 8.2).
The general approach is to compare statistical models with and without the association
being tested, allowing for other relevant information, e.g. pedigree or marker locations, etc.
. . . simply looking at the marginal dependency between each marker and dis-
ease status in a case/control sample of chromosomes is clearly inefficient. For
an LD mapping strategy to be optimal in fine mapping, it is essential to con-
sider the information observed in a set of contiguous markers (i.e., haplotypes).
In a review paper, Nielsen and Zaykin (2001) noted the literature was divided. Akey
et al. (2000) suggest ‘significant improvement in power and robustness of association
140 R. D. BALL
tests’ while Long and Langley (1999) and Kaplan and Morris (2001) conclude ‘single
marker tests are at least as powerful as haplotype-based tests.’ This was without con-
sidering the loss of information, when estimating haplotypes. Haplotype data are available
where chromosome segments have been sequenced, or can be estimated where individuals
have sufficiently many progeny.
In practice, haplotypes may need to be estimated from genotypic data, further reducing
the power of haplotype-based methods. For haplotype estimation see, e.g. Stephens et al.
(2001). Although higher LD may be found with the ‘right’ haplotypes, or group of haplo-
types, there are many such possibilities, each with lower prior probability, hence requiring
stronger evidence to reliably detect.
A pragmatic recommendation is to consider the haplotype-based approach where hap-
lotype data from closely spaced loci is available for one or a small number of gene regions.
In essence, the coalescent simulates the evolutionary process backwards in time, con-
sidering recombinations and mutations. A coalescence occurs when a single segment is
the common ancestor of two later segments. Figure 8.1 (Figure 1 from Nordborg and
Tavaré 2002) shows a possible genealogy of a short chromosomal segment. The blue,
green and red chromosome segments at the bottom of the figure are traced backwards in
time, i.e. upwards in the figure, to their most recent common ancestor. Four events are
labelled: Event 1 is a recombination and Events 2–4 are coalescences. The colour coding
shows which parts of a chromosome are ancestral to which parts of chromosomes lower
in the tree. Above a coalescence, multiple colours indicate that a segment is ancestral
to multiple segments, e.g. at Event 4 the left-hand chromosome is ancestral to the red,
green and blue chromosomes, while at Event 2 the left-hand chromosome is partly ances-
tral to both red and blue, and partly ancestral to red alone.
There is no uniquely determined ancestral genealogy, rather inference needs to con-
sider possible genealogies according to their probabilities. Liu et al. (2001) give a fully
Bayesian approach using MCMC to simulate from genealogies according to their proba-
bilities in a coalescent model, illustrated with applications to cystic fibrosis and Friedrich’s
ataxia disease haplotypes.
The coalescent approach requires knowledge of several parameters, e.g. recombination
and mutation rates, and embodies assumptions about the evolutionary process which may
or may not accurately reflect the population history for the species or gene being consid-
ered. Estimates of the evolutionary parameters can be obtained from temporal sequence
data for some species (Drummond et al. 2002). Fearnhead and Donnelly (2001) estimate
recombination rates from population genetic data.
For further information on the coalescent see Kingman (1982), Hudson (1983, 1990),
Nordborg (2001), Griffiths and Marjoram (1997), Stephens (2001), Nordborg and Tavaré
(2002) and Stephens and Donnelly (2003) (with discussion by Bahlo et al. 2003; Wilson
2003).
An example of a genealogy for three copies of a short chromosomal segment. Tracing the
segmental lineages back in time, the following events occur: 1, the “green” lineage undergoes
recombination and splits into two lineages, which are then traced separately; 2, one of the re-
sulting green lineages coalesces with the “magenta” lineage, creating a segment, part of which
is ancestral to both green and magenta, part of which is ancestral to magenta only; 3, the “blue”
lineage coalesces with the lineage created by event 2, creating a segment that is partially ances-
tral to blue and magenta, partially ancestral to all three colours; 4, the “other” part of the green
lineage coalesces with the lineage created by event 3, creating a segment that is ancestral to all
three colours in its entirety. The recombination event induces different genealogical trees on
either side of the break: these are shown in the inserted figure.
Reprinted from Trends in Genetics 18, Nordborg, M. and Tavaré, S., Linkage disequilibrium: what history has to tell us,
Pages No.83–90, Copyright (2002), with permission from Elsevier.
Figure 8.1. Example genealogy illustrating the coalescent (Nordborg and Tavare´ 2002). (see color plate)
STATISTICAL ANALYSIS AND EXPERIMENTAL DESIGN 143
A more sophisticated approach, allowing for groupings of haplotypes with similar ef-
fects could be based on the ‘product partition model’ (Hartigan 1990; Barry and Hartigan
1992) estimated using computationally intensive Bayesian methods. With this approach,
results from each possible grouping of haplotypes are combined according to the posterior
probability for the grouping.
A third empirical approach is to use Bayesian model selection, where a ‘model’ con-
sists of a selected set of markers regressed on the trait (Chapter 7, Section 7. 4. 4). This is
less sophisticated, but by reducing the set of possible models, more computationally effi-
cient than the product partition model. All possible models within the class of linear models
with subsets of markers as explanatory variables are considered according to their proba-
bilities. Ball (2001), reviewed in Sillanpää and Corander (2002), illustrate Bayesian model
selection for QTL mapping with approximate posterior probabilities for models obtained
using the Bayesian Information Criterion (BIC; Schwarz 1978). Unconditional estimates of
effects, not subject to selection bias, are obtained by Bayesian model averaging (cf. Raftery
et al. 1997). The same approach can be applied to association mapping.
In the first instance, additive terms for each marker would be included as
possible variables, but epistasis can also be included, essentially by adding epistatic terms
with the appropriate prior probability. Bogdan et al. (2004) adapt the BIC criterion to
achieve the same effect. This approach is limited by the number of variables which can
simultaneously be considered (about 30). This is not a problem for additive models, if a
single linkage group in the QTL mapping context, or haplotypes in a single small chromo-
some region in the association mapping context, are being studied. It may not be possible to
simultaneously consider all possible epistatic interactions between loci, because the num-
ber of possible models may be too large. One approach is to limit the interactions to loci
already detected in additive models.
When the space of all models is too large, an alternative to considering all possible
models is to search through the space of all possible models with an MCMC sampler.
Since the MCMC sampler samples from models with probability proportional to their pos-
terior probability in the long run, mainly models with reasonably high probability would be
sampled. Sillanpää and Bhattacharjee (2005) is a recent MCMC approach using indicator
variables to give a similar modelling framework, although they do not specifically consider
interactions. This has the advantage that it is implemented in BUGS (Spiegelhalter et al.
1995). BUGS is a programming language and system for specifying Bayesian hierarchical
models, and generating a Gibbs sampler. Implementing a model in BUGS is much quicker
than developing an MCMC sampler from scratch in a conventional programming language,
e.g. C, and additionally it is much easier to check BUGS code, and have confidence in the
sampler. An analysis using BUGS is given in Section 8.3.4.
Note: an MCMC approach to Bayesian model selection was first given in George
and McCulloch (1993). Variables not selected were given a prior concentrated around
0. This sampler was best for uncorrelated predictors, and could have poor convergence
otherwise. Other Bayesian multi-locus methods for LD mapping include Kilpikari and
Sillanpää (2003) and Meuwissen et al. (2001).
The empirical- and model-based coalescent approaches could be combined using a
‘uniform shrinkage prior’. Two models represented by f1 (x | θ1 ) (e.g. representing
the model-based coalescent approach), and f2 (x | θ2 ) (e.g. representing the empirical
144 R. D. BALL
approach) can be combined with a uniform shrinkage prior (Natarajan and Kass 2000)
given by:
f (x | θ1 , θ2 , λ) = λf1 (x | θ1 ) + (1 − λ)f2 (x | θ2 ) . (8.6)
The uniform shrinkage prior is so named because the shrinkage parameter λ varying from
0 to 1 controls the relative influence of each model, and has a uniform prior distribution.
Allowing for λ < 1 relaxes the strong model assumptions, allowing the data to say how
much of the stronger model assumptions apply.
Summary
There are a range of approaches to the analysis of association study data. For inference,
the general approach is to compare models with and without the effect being tested. Single
marker analyses comparing one marker allele or haplotype versus the rest can easily be
carried out using standard methods.
If haplotype data are available in a small genomic region, such as the vicinity of a func-
tional locus, it may be more efficient to use haplotype-based methods: the fully Bayesian
BLADE method based on sampling from the set of possible ancestral genealogies accord-
ing to their posterior probabilities; or a mixed model, with random effects for haplotypes,
and a covariance structure estimated from the coalescent or a deterministic formula. The
assumptions inherent in the coalescent-based models can be avoided by using empirical
models. Reduced dependence on assumptions comes at a possible cost of reduced accu-
racy or power. Further experience is needed to tell which of these approaches is most
effective, and when.
Problem Solutions
p-values Use Bayesian methods, Bayes factors and posterior proba-
bilities.
Population substructure Test for substructure. If present use STRAT type method
(8.3.5) or TDT design (8.3.4), or allow for relatedness in a
pedigree design (8.3.6).
Epistasis When major additive genes or markers have been found
allow for possible epistasis using a Bayesian model selection
approach.
Non-genetic factors Allow for factors as fixed or random effects in a mixed
model.
(G×E) interactions Allow for interactions as fixed or random effects in a mixed
model.
STATISTICAL ANALYSIS AND EXPERIMENTAL DESIGN 145
The main choices available to the experimenter are the number of individuals to sam-
ple, the number of markers to genotype per individual, and which traits to study. Power
calculations allow choice of sample size so that the experiment has power to detect a QTL
with given effect size and LD. Factors affecting the sample size required are summarised
in Figure 8.2.
In this section we consider the various possible experimental designs – independent
samples from a population without substructure (Section 8.3.2), case–control tests (Section
8.3.3), many small families for TDT type tests (Section 8.3.4), populations with substruc-
ture (Section 8.3.5) and samples from related individuals (i.e. pedigrees, Section 8.3.6).
A strategy combining QTL and LD mapping is considered in subsection 8.3.7. Statistical
methods specific to each type of design are discussed in the relevant subsections.
Design of experiments with power to detect effects with given Bayes factor for inde-
pendent population samples is discussed in Section 8.3.2, and the independent population
sample methods applied to results from candidate gene studies in Eucalyptus and maize
in Examples 8.1 and 8.2. Frequentist and Bayesian case–control analyses are compared in
Example 8.3 (the malaria data from Chapter 7, with variants). The power of case–control
studies to detect genomic associations is assessed in Example 8.4 (APOE linkage disequi-
librium data). A full Bayesian analysis using BUGS to implement a Gibbs sampler for
simulated TDT-Q1 data are given in Example 8.7. Example 8.8 shows how LD are gener-
ated following admixture between sub-populations with differing allele frequencies.
desired or
achievable
resolution
power calculation
with ldDesign
sample size
This shows that low-frequency alleles have greater relative variation in frequency from
generation to generation. For a given population size the frequency of low-frequency alleles
is more susceptible to random drift.
These factors contribute to the considerable variation in observed LD between loci in a
population. Coalescent simulations have been used to study the resulting effects (Nordborg
and Tavaré 2002).
LD patterns reflect the population history, including recent man-made influences, with
short range LD patterns reflecting more ancient history, and long range patterns reflecting
recent history, admixtures and inbreeding. So, paradoxically, there may be simultaneously
short range of LD observed when single gene regions are examined, as well as long range
recently introduced LD between more distant markers from a sample. How best to untangle
this information is a challenging statistical problem.
Summary
The extent of LD is an important consideration in the design of association studies.
There is currently limited information on extent of LD, and LD patterns in many species,
however it is clear that the extent of LD varies widely between species, populations and
genomic loci.
Bayesian analysis
A Bayes factor for ANOVA models corresponding to a nearly non-informative prior is
given by Spiegelhalter and Smith (1982) who obtain, for a one-way ANOVA model
−1/2 n/2
1 (m + 1)
m
(m − 1)
B= ni 1+ F , (8.8)
2 n i=1
(n − m)
where m is the number of groups, ni the number in each group, n the total sample size and
F the classical F -value as in Table 8.4.
148 R. D. BALL
df SS MS F
Between marker classes ν1 = 2 SSb MSb = SSb /ν1 F = MSb /MSw
Within marker classes ν2 = n − 3 SSw MSw = SSw /ν2
Reprinted from Ball (2005). Genetics 170:859–873.
and F is the value of the classical F -statistic (Ball 2005), which corresponds to the p-value
via the F -distribution.
This is implemented in the ldDesign R package (Ball 2004). Results for the exam-
ples from Luo (1998) are shown in Table 8.5. Additional columns are the Bayes factor, B,
for the design and the sample size nB20 needed for a Bayes factor of 20 with power 0.9.
Note that none of the original designs has B > 1 when p = 0.05.
Table 8.5. Comparison with results from Luo (1998). Results are shown for the 12 example populations (cf.
Luo Tables 2, 3) with sample size n, marker and QTL allele frequencies p, and q, linkage disequilibrium D,
QTL heritability h2Q and dominance ratio φ. P0.05 is the power to detect an effect with α = 0.05, B is the
corresponding Bayes factor and nB20 is the sample size required to achieve a Bayes factor of 20 with power 0.9.
Table 8.6. Sample sizes required for power of 0.9 of detection of linkage disequilibrium between a bi-allelic QTL
and a bi-allelic marker with given posterior odds for linkage disequilibrium with D = 0.1, p = 0.5 and q = 0.5
in a genome scan with 500,000 SNP markers. Prior probability per marker is assumed to be 1/50,000.
Sample size
Posterior odds Bayes factor h2Q = 0.05 h2Q = 0.01
1/20 2,500 5,572 30,640
1/5 10,000 6,008 32,792
1 50,000 6,524 35,397
5 250,000 7,031 37,949
20 1,000,000 7,465 40,089
Reprinted from Ball (2005).
associations with small effect QTL in a genome scan. These values were calculated using
the ldDesign R package (Ball 2004).
Prior probabilities per marker in Table 8.6 are based on the expected number of QTL
affecting the trait and number of markers. This was based on the assumption that QTL are
equally likely to occur anywhere on the genome and assuming an expected number of 10
QTL. This corresponds to a Poisson prior probability distribution for the number of QTL
in the genome with rate λG = 10. The prior distribution for locations of QTL, assuming
they exist is generally assumed to be uniform: the probability that a QTL is within a small
genomic interval of width Δx is
Δx
Pr[QTL in (x, x + Δx)] = λG , (8.11)
LG
where LG is the genome length in base pairs.
150 R. D. BALL
With 500,000 SNP markers this equates to an average probability of 1/50,000 per
marker. For unequally spaced markers, the prior probability for a marker at position xi
with flanking markers at xi−1 , xi+1 we take ∆x = 1/2(xi+1 − xi−1 ), which is the width
of the sub-interval of points closer to xi than to the flanking markers, in (8.11). The prior
probability πi for the ith marker is then given by:
xi+1 − xi−1
πi = Pr[QTL in (1/2(xi − xi−1 ), 1/2(xi+1 − xi ))] = λG . (8.12)
2LG
The Poisson distribution with rate 10 QTL per genome has mean 10, and standard de-
viation about 3, and has 95% of its probability in the range from 4 to 17. Hence there is
some flexibility in the prior – we are not assuming the number of QTL is exactly 10. If
this prior is too precise, we can allow for more uncertainty in the number of QTL by us-
ing a mixture of Poisson distributions, e.g. a mixture of Poisson distributions with means
3,5,10,20, with mixing probability 0.25 for each rate, which has a mean and standard devi-
ation approximately 9.5, and 7.2, respectively, and 95% of its probability in the range 1–26.
Power calculations for this composite prior can be obtained by noting that for a given n the
power to obtain a given Bayes factor B is the same, regardless of the prior, but the poste-
rior probabilities are different. In general, for a mixture prior π, with prior probabilities per
marker πi , and mixing proportions ci for 1 ≤ i ≤ k
π = c1 π1 + · · · + ck πk , (8.13)
Where there is no prior information for a given locus, we may be guided by the number
of QTL found at other loci, or by information on similar traits in other species. The trait
genetic variance gives an upper bound for variance explained by each individual QTL.
Results from QTL mapping studies also contain useful prior information, e.g. undetected
QTL are likely to be small enough to a reasonable chance of escaping detection. An upper
bound on QTL magnitude for undetected QTL together with the amount of unexplained
genetic variance gives a lower bound for the number of undetected QTL. For example, if the
QTL detection experiment was sufficiently powerful so that each undetected QTL explains
5% or less of the total variance, and there are two detected QTL explaining, in total 20%
of the variance of a trait with heritability 50%, that leaves 30% of the variance, which is
genetic, unexplained. Therefore, there should be at least six loci explaining the remaining
30%. A prior rate of λG = 8 loci per genome would be reasonable. QTL mapping studies
also contain prior information on the locations of detected QTL (cf. Section 8.3.7).
Table 8.7. Sample sizes required for power of 0.9 of detection of linkage disequilibrium between a bi-allelic QTL
and a bi-allelic marker with given posterior odds for linkage disequilibrium with D = 0.1, p = 0.5 and q = 0.5
in a set of 50,000 markers representing candidate genes. Prior probability per marker is assumed to be 1/5,000.
Sample size
Posterior odds Bayes factor h2Q = 0.05 h2Q = 0.01
1/20 250 4,826 26,808
1/5 1,000 5,288 29,093
1 5,000 5,808 31,658
5 25,000 6,322 34,223
20 1,00,000 6,762 36,406
Reprinted from Ball (2005).
– Associated with processes likely to affect the trait in a model species, or,
Table 8.7 shows the sample sizes, which are needed to obtain various posterior odds
for associations with small effect QTL in a genome scan. These values were calculated
using the ldDesign R package (Ball 2004).
Prior probabilities per marker in Table 8.7 again assume a Poisson distribution with
rate 10 QTL for the number of QTL. If there are 50,000 candidate genes, this equates to a
prior probability of 1/5,000 per candidate.
The prior probability is clearly an important factor in our ability to find genes. Can-
didate genes with a lower prior probability need a higher Bayes factor to obtain the same
posterior probability.
Within the candidate gene, 25 SNP markers were tested for associations with MFA in
an independent sample of n = 290 E. nitens trees. A putative association with SNP21
explained an estimated 4.6% of the variation, and had a reported experiment-wise p-value
of 0.0002 (Thumma et al., Table 2). The comparison-wise p-value corresponding to an
effect explaining 4.6% of the variation, with the given allele frequencies was calculated by
us as 0.00023 .
The apparently strongest associations (SNP20, SNP21) were not segregating in the
validation families. However, associations were ‘validated’ (p < 0.05), for nearby markers,
in two full-sib families of E. nitens (n = 287, p = 0.02) and E. globulus (n = 148, p =
0.04). In the validation samples the effect sizes were smaller, and with less significant
p-values, than in the association population. At this point readers should ask themselves:
how good is the evidence? Should we consider the associations validated?
The results for the most ‘significant’ associations from Thumma et al., Tables 3, 5
are shown in Table 8.8. To better assess the evidence from the population and validation
samples separately and combined, we converted all p-values to individual comparison-wise
p-values, calculated the corresponding F -values and then calculated the Spiegelhalter and
Smith Bayes factors (Equation (8.8); R function SS.oneway.bf() from ldDesign).
For the association population, we calculated the comparison-wise p-value based on the
reported percent variation explained (4.6%), and the allele frequency for the SNP. The p-
values for the validation populations were already comparison-wise.
Frequentist interpretation. The p-values show a ‘highly significant’ association in the
population sample, supported by significant associations in the two QTL mapping families.
Bayesian interpretation. The Bayes factors show strong evidence in the data (B = 98)
for an effect in the association population, but very weak evidence in the validation families
(B = 1.5, 1.1). A Bayes factor of 98 normally represents strong evidence, however if the
prior odds are low as in Tables 8.6 and 8.7, the posterior probabilities for an association
will be low.
Note that the ‘validation’ of this association in the QTL mapping families, even if the
evidence was good, would be supporting evidence for, but would not validate an association
with SNP21. An association in the QTL mapping families could result from QTL at some
distance (e.g. 20 cM) from the SNP locus, in either the Bayesian or frequentist paradigms.
A better approach to combining QTL and association mapping inference is to use the QTL
posterior probability distribution to improve the prior odds for the association mapping
analysis. This approach is studied in Section 8.3.7.
Prior and posterior probabilities for various priors are shown in Table 8.9. Priors 1,
2 and 3 are given for a random SNP, a random candidate gene and a candidate gene with
some fairly strong prior information, respectively.
Table 8.8. Statistics for markers with ‘significant’ associations with MFA.
Table 8.9. Prior and posterior probabilities for an association with SNP21, for various priors.
For example, for a random SNP selected from the genome the prior probability per
SNP might be 1/50,000, posterior odds are 1/508 and the posterior probability is 0.002.
Clearly posterior probabilities for a real effect are low except in case 3, where the
candidate gene is a priori, not unlikely. The authors did not give prior probabilities for
an association. Their candidate gene was selected from a set of differentially expressed
genes, and was also associated with stiffness in Arabidopsis. However, the associations
in Arabidopsis are not with the trait in consideration, i.e. MFA. In the absence of other
evidence, since we have no reason to expect a lignin gene to causally affect MFA, we
would, use prior 1 or 2. For respectable posterior odds of 20:1 or more, with the Bayes
factor obtained, the prior odds should be at least 1:5. If prior odds of 1:5 (around 800 times
better than for a random candidate gene, representing stronger evidence than the data) are
used, these need careful justification.
Selection bias The reduction in estimated magnitude of the effects, in the validation pop-
ulation compared to the association population, could be due to validation with different
markers. This phenomenon is also typical of selection bias. Significant effects, originally
estimated from the same population used to test for significance tend to be biased upwardly,
a phenomenon known as selection bias. Estimates free of selection bias should be given.
These can be obtained, either by using an independent population, or, in a Bayesian context
by considering multiple models, not just the models where the marker is selected (cf. Ball
2001, for application in a multi-marker-QTL mapping context), and averaging over models
according to their posterior probabilities.
As a special case, this applies when a single marker or haplotype is being tested. In
this case there are two possible models. These correspond to H1 , the alternative hypothe-
sis, where the marker is selected and H0 , the null hypothesis where there is no effect, i.e.
the effect is zero, respectively. Allowing for selection bias means allowing (with non-zero
probability) for the possibility that H0 is true, in which case the effect is zero. Otherwise
selection bias occurs if Pr(H0 | data) is not small. The unconditional estimate of marker
effects is obtained by averaging effects in each model according to the posterior probabil-
ities. With priors 1 and 2 in Table 8.9, the resulting estimates would be very small since
154 R. D. BALL
the posterior probabilities for H1 are small. With prior 3, the posterior probability of 0.72
would mean the estimates reduce by a factor of 0.72 and the percentage variation explained
reduces by the square of this factor, or 0.52.
Power Frequentist methods give approximately valid results, approximately free of se-
lection bias and without the need to use an independent sample, if the power to detect the
true effect is good, e.g. 0.9 . The difficulty is that we do not know the true effect, we
only have the estimated effect. Often, even if the power to detect the estimated effect is
reasonable, the true effect may be smaller, hence suffer from selection bias. We could be
reasonably sure of good power if the lower limit of a 95% confidence interval for the esti-
mated effect was larger than the value for which the power is 0.9. Often, experiments are
designed with power of 0.9 to detect an effect with p = 0.05, i.e. two standard deviations
greater than zero. The 95% c.i. for the estimated effect would then be greater than this
value if the effect was at least four standard deviations greater than zero, or a p-value of
around 0.0001.
Finally, we examine the power of the experiment using ldDesign (Ball 2004, 2005).
The power, calculated using ldDesign, of the experiment to detect LD with D = 0.1, 0.2,
with the given allele frequencies is shown in Figure 8.3. Power to detect LD with D = 0.1,
with a Bayes factor of 20 is very low (0.04), but nearly 0.5 to detect LD with D = 0.2
(nearly its maximum for the given allele frequencies). The indicated sample size for a
power of 0.9 is 575, or nearly twice the size of the experiment. To detect a QTL explaining
5% of the variation with D = 0.1 and a Bayes factor of 20 or more requires a population of
around 2,730, or almost 10 times the size.
Figure 8.4 shows power to detect LD between a bi-allelic marker and QTL with a given
Bayes factor, as a function of sample size. Allele frequencies are assumed to be 0.31 (the
same as for SNP21) for both marker and QTL. Each panel corresponds to a combination of
D = 0.1 or 0.2, and QTL heritability h2Q = 0.01 or 0.05, i.e. explaining 1% or 5% of trait
variation. Within each panel power curves are given for power to detect associations with
Bayes factors of 20, 1,000 or 1,000,000.
> ld.power(n=290,p=0.31,q=0.31,h2=0.05,phi=0,Bf=20,D=0.1)
n power
[1,] 290 0.038
> ld.power(n=290,p=0.31,q=0.31,h2=0.05,phi=0,Bf=20,D=0.2)
n power
[1,] 290 0.495
> ld.design(p=0.31,q=0.31,D=0.1,h2=0.05,Bf=20,phi=0,power=0.9)
[1] 2727.228
> ld.design(p=0.31,q=0.31,D=0.2,h2=0.05,Bf=20,phi=0,power=0.9)
[1] 575.1845
B = 20 B = 1000 B = 1000000
2.0 2.5 3.0 3.5 4.0 4.5
0.8
0.6
0.4
0.2
0.0
power
0.8
0.6
0.4
0.2
0.0
Figure 8.4. Power versus sample size for various levels of disequilibrium.
and comparison-wise p-values were calculated for the deletion flanking sh2 by reference to
the t-distribution. Bayes factors were based on assuming allele frequencies of 0.5. Some-
what higher values were obtained for Bayes factors based on allele frequencies of 0.1 and
0.9 (not shown).
As with the previous example fairly high Bayes factors were obtained, but strong prior
information (prior odds of no more than 20:1 against an effect, compared to 1/4,000 for
156 R. D. BALL
Table 8.10. Likelihood ratios, comparison-wise p-values and Bayes factors for time to silking in five fields from
Thornsberry et al. (2001). Bayes factors are based on allele frequencies of 0.5.
a random candidate gene assuming 10 genes affecting the trait), or lower for a random
haplotype within the gene is needed to obtain respectable posterior odds.
Note: readers may notice the variability in the Bayes factor between fields. The log
Bayes factors from the final column in Table 8.10 had a sample standard deviation of 0.95.
This level of variability (if applying independently across haplotypes) would contribute a
3.6–20.3-fold increase (95% c.i. for the maxima of the log Bayes factor) in the Bayes
factor by chance as a result of maximising over 41 haplotypes. With a 20-fold reduction,
the whole-gene Bayes factors become comparable to the sh2 values.
Summary
Standard statistical methods, such as the frequentist ANOVA method, can be used to
analyse associations in independent samples. However, due to problems with the interpre-
tation of p-values, Bayes factors and posterior probabilities for H1 are the recommended
measures of evidence.
Using a correspondence between p-values for one-way ANOVA models and the Spiegel-
halter and Smith Bayes factors enables us to use existing power calculations to find the sam-
ple sizes required to detect effects with a given Bayes factor. The same technique is useful
for estimating a Bayes factor based on results where only p-values are published. Results
consistently showed that the p-value is not a reliable measure of evidence. The p-values
corresponding to a respectable Bayes factor were very low, and varied considerably.
The power of an experiment to detect an association between a bi-allelic marker and
QTL with given sample size, allele frequencies and LD coefficient can be calculated using
the ldDesign R library. Or, the sample size required for a given power can be calculated.
Prior probabilities and the Bayes factor combine to give posterior probabilities. Prior
distributions are a mathematical representation of prior knowledge. Priors are subjective,
there is no ‘right’ or ‘wrong’ prior; different observers will have different priors. With
experience, priors can be chosen which are a reasonable representation of the available
prior knowledge. Priors based on the Poisson distribution can be used for the number of
QTL present in the genome, and this information used to obtain probabilities that a QTL
is present in a given genomic region, e.g. the vicinity of a marker. Good prior information
may substantially increase prior odds, hence reducing the sample size needed. But it is
important not to overstate prior information. If necessary, mixtures of Poisson distributions
can be used to obtain less informative priors.
STATISTICAL ANALYSIS AND EXPERIMENTAL DESIGN 157
The key to obtaining a high posterior probability for detected QTL is to design the
experiment with good power to detect QTL with a given Bayes factor, where the Bayes
factor is chosen sufficiently large to overcome the low prior odds.
The methods were applied to candidate gene studies in Eucalyptus and maize (Exam-
ples 8.1 and 8.2). Respectable Bayes factors of around 100 were obtained in both exam-
ples, but these were not high enough to overcome the low prior odds for candidate genes.
Lessons learnt from these examples include:
– Approximate Bayes factors can be found from experiments where comparison-wise
p-values are reported if the sample sizes in each marker class are given.
– Large Bayes factors are needed to overcome the low prior odds in genome scans or
candidate gene studies.
– Low prior probabilities for a genome scan or candidate gene region apply even if
testing a single gene region, unless there is independent evidence for the region to
contain loci affecting the trait in consideration.
– When power is not good estimates of effects for the ‘detected’ markers will be in-
flated by selection bias.
– QTL mapping results can support but do not validate LD mapping associations.
Table 8.11. Observed counts and expected proportions in a case–control test with two marker classes.
Table 8.12. Frequencies of alleles in the case–control test for the malaria data.
A S
Case 623 7
Control 1,065 101
Bayesian analysis. Under H1 , let p12 , p22 be the expected proportions of allele S for
cases and controls, respectively, with Beta(1/2, 1/2) prior distributions. Under H0 we
assume p12 = p22 and let p12 have a Beta(1/2, 1/2) prior.
Note: the Beta prior is a conjugate prior for binomial sampling, meaning that if the
prior is a Beta distribution, and a binomial sample is observed, the posterior distribution is
also a Beta distribution. If the prior for p is Beta(a, b), and k successes are observed in n
Bernoulli trials, then the posterior is Beta(a + k, b + n − k). A Beta(1/2, 1/2) distribution
is the standard Jeffreys prior (Jeffreys 1961) with mean 0.5, and information equivalent to
one Bernoulli trial. The density for a Beta(a, b) distribution is
1
f (p | a, b) = pa (1 − p)b , (8.17)
B(a, b)
where B(a, b) is the value of the Beta function given by
1
B(a, b) = pa (1 − p)b dp , (8.18)
0
i.e. the factor needed to make f (p | a, b) in Equation (8.17) a probability density. Values
of B(a, b) can be calculated with the standard R function beta(). When the values of
B(a, b) are very small it is best to work the logarithm of the values calculated directly with
the R function lbeta().
We now calculate the Bayes factor, by explicitly integrating out p12 , p22 for H1 and
p12 for H0 .
630 7
Pr(data | H1 ) = p12 (1 − p12 )623 × p0.5
12 (1 − p12 )
0.5
/B(0.5, 0.5) ×
7
1, 166 101
p22 (1 − p1,065
22 ) × p0.5
22 (1 − p22 )
0.5
/B(0.5, 0.5)dp12 dp22
101
630 1, 166 B(623.5, 7.5)B(1, 065.5, 101.5)
= . (8.19)
7 101 B(1/2, 1/2)B(1/2, 1/2)
Similarly
630 1, 166 B(1, 688.5, 108.5)
Pr(data | H0 ) = (8.20)
7 101 B(1/2, 1/2)
so the Bayes factor is
Pr(data | H1 ) B(623.5, 7.5)B(1, 065.5, 101.5)
B = =
Pr(data | H0 ) B(1/2, 1/2)B(1, 688.5, 108.5)
= 1.0 × 1010 . (8.21)
STATISTICAL ANALYSIS AND EXPERIMENTAL DESIGN 159
Table 8.13. χ2 and Fisher’s exact test statistics and Bayes factors for three case–control datasets.
Frequentist and Bayesian analyses are compared for 3 possible case – control datasets
in Table 8.13. Fisher’s exact test was computed using the R function fisher.test(),
with a two-sided alternative. In Dataset I (the malaria data) the chi-squared test p-value
Pχ2 is of the order of 10−10 , the Fisher exact test even smaller at PFisher = 9.9 × 10−13 ,
and the corresponding Bayes factor of the order of 1010 , representing very strong evidence.
Even after allowing for prior odds of 1/500,000 the posterior probability for an associa-
tion will be high. Datasets II and III had higher values of the S allele for the cases and
lower values of the A allele than Dataset I but had the same row totals. Dataset II had
Pχ2 ≤ 0.001, and PFisher slightly smaller. Both of these values are commonly considered
‘highly significant’ in frequentist analyses. The corresponding Bayes factor was 13.9, rep-
resenting only moderate evidence and not enough to overcome the low prior odds for most
associations. Dataset III had Pχ2 = 0.02, and PFisher similar. These values are normally
considered ‘significant’ in frequentist analysis. The corresponding Bayes factor was only
0.5, representing weak evidence against H1 .
Odds ratios and relative risks. The odds ratio is given by
The model for H1 could have been parameterised in terms of odds ratios, rather than
p12 , p22 . Nevertheless we can compute the posterior distribution for odds ratios from the
160 R. D. BALL
Figure 8.5. R calculations for simulation from the posterior distribution of the odds ratio for Dataset I.
Table 8.14. Posterior statistics for log-odds ratios. Conditional estimates are made assuming H1 is true. Uncon-
ditional estimates average estimates under H0 and H1 , according to their posterior probabilities. The posterior
probabilities pH1 = Pr(H1 | data), were estimated assuming prior odds of 1/4,000 appropriate for a candidate
gene if 10 genes out of 40,000 are expected to contribute to the disease.
Conditional on H1 Unconditional
Dataset Mean Standard 95% c.i. Mean Standard pH1 Selection
deviation deviation bias (%)1
I −2.13 0.39 (−2.94,−1.41) −2.13 0.39 1.0000 0
II −0.71 0.22 (−1.15,−0.29) −0.0074 0.0443 0.0030 9,495
III −0.44 0.20 (−0.84,−0.05) −0.00027 0.00555 0.0001 162,900
1
Selection bias is estimated as the bias from assuming H1 is true as a percentage of the
unconditional estimate.
The APOE gene has three alleles 2 , 3 and 4 affecting susceptibility to Alzheimer’s
disease. Nielsen and Weir (N&W 2001) simulate power for allele-based case–control tests
and the TDT to detect associations between two SNP markers (SNP1 and SNP2) located
near the APOE gene locus and the disease. Of interest is whether associations between
two SNP markers and the disease could be detected by association mapping. Power was
reported as around 57% for the allele-based case–control test and 50% for the TDT test
with 50 cases and 50 controls, to detect an association with SNP2 (marker ‘N’ in N&W,
Table II), at significance level α = 0.05 (N&W, Fig. 1, p. 259).
To make the probability calculations required for simulations, we make the follow-
ing statistical assumption: that conditional on the APOE genotypes the disease status and
marker genotypes are independent. This conditional independence assumption is equiva-
lent to the biological assumption that the APOE locus is the only locus affecting the disease
that is in linkage disequilibrium with the marker. The conditional independence model is
represented as a graphical model in Figure 8.6. This is the same type of model used to
represent probabilistic models for Bayesian analysis using BUGS in Section 8.3.4 (cf.
Figure 8.13).
Allele frequencies, LD values and disease penetrances from N&W are shown in Tables
8.15, 8.16 and 8.17. Using these values and Bayes’ theorem we calculate probabilities for
APOE genotypes conditional on case or control status (Equation (8.26) and (8.27)).
Figure 8.6. Graphical representation of probabilistic model relating APOE genotypes, SNP marker genotypes
and Alzheimers’ disease status.
162 R. D. BALL
SNP2 APOE
N1 N2 2 3 4
0.15 0.85 0.085 0.779 0.137
Table 8.16. Disequilibrium and recombination rates between SNP markers and APOE.
Disequilibria Recombination
Marker D2,1 D3,1 D4,1 c (%)
M (SNP1) 0.07149 −0.1169 0.04545 0.05
N (SNP2) 0.04545 −0.1169 0.07140 0.5
Genotype (g) 2 2 2 3 2 4 3 3 3 4 4 4
Pr(case | g) 0.0432 0.0288 0.0576 0.0480 0.130 0.600
and
Pr(N2 | k ) = 1 − Pr(N1 | k )
= 1 − (Pr(N1 ) + Dk ,1 /Pr(k ))
= Pr(N2 ) − Dk ,1 /Pr(k ) (8.31)
1. Simulate APOE genotypes for cases and controls using the probabilities Pr(k l |
case), Pr(k l | control).
2. Simulate marker genotypes using the probabilities Pr(Ni Nj | k , l , case),
Pr(Ni Nj | k , l , control).
3. Form the 2 × 2 table of disease status and marker values.
Figure 8.8. Simulations for power of case–control studies to detect associations with Alzheimer’s disease.
Power to detect the association between marker SNP2 and the disease with various
Bayes factors, estimated from 3,000 simulated populations for each sample size, is shown
in Table 8.18. We see that a sample of size n = 50 cases and n = 50 controls with
50% power to obtain a p-value less than 0.05 has 53% power to obtain a Bayes factor of
1 (similar to the power to obtain a p-value of 0.05 from N&W, hence the p = 0.05 is
approximately equivalent to a Bayes factor B = 1 here), but low power to obtain a Bayes
factor of 20. A sample size of n = 200 is sufficient to obtain a Bayes factor of 20 with
80% power, useful if we already have strong prior information on the location of the gene,
while a sample size of n = 600, sufficient to obtain a Bayes factor of 1,000,000 with 95%
power, would suffice for a genome scan, with prior odds of 1/50,000 per marker.
Summary
Data for single marker tests in case–control studies can be summarised as a contin-
gency table, and associations tested using the χ2 or Fisher exact tests, or Bayesian methods.
In Example 8.3, the frequentist χ2 and Fisher exact tests were compared with Bayesian
inference for several 2 × 2 contingency tables. Bayesian inference for Example 8.3 illus-
trates calculating the Bayes factor by explicit integration, made possible because of the
164 R. D. BALL
Table 8.18. Power of case–control test with n cases and n controls to detect the association between marker
SNP2 and Alzheimer’s disease with given Bayes factors. Power was estimated from 3,000 simulated populations
for each sample size. Bayes factors were calculated using Equation (8.22).
Bayes n
factor 50 200 400 600 800
1 0.532 0.966 1.000 1.000 1.000
20 0.153 0.809 0.994 1.000 1.000
100 0.063 0.666 0.981 1.000 1.000
1,000 0.016 0.449 0.940 0.998 1.000
1,000,000 0.000 0.062 0.593 0.952 0.997
use of a conjugate Beta prior. As with other examples, there were ‘significant’ p-values,
corresponding to only weak evidence according to the Bayes factor. Again, selection bias
in estimated effects occurred in the datasets where posterior probabilities for H1 were not
high.
Example 8.4 illustrates probability calculations for the multiple LD coefficients which
occur when there are more than two alleles, and simulations to obtain the power to detect
LD between a marker and trait locus with a given Bayes factor. The sample sizes considered
by Nielsen and Weir, of 50 cases and 50 controls had about 50% power to detect the marker
with p = 0.05. A sample size of 200 cases and 200 controls is required for power 80% to
detect the association with Bayes factor 20. To detect the associations in a genome scan re-
quires a Bayes factor of around 1,000,000, and a sample size of 600 cases and 600 controls.
As with other examples, to reliably detect the associations with the Alzheimer’s locus (as-
suming its position was not already known), in a genome scan would require substantially
larger sample sizes than those indicated by traditional frequentist power calculations.
π(θ = 0)
B= , (8.32)
g(θ = 0 | y)
where θ denotes the parameter being tested (here a), and g(θ = 0 | y) is the marginal
posterior density for θ. If there are additional parameters ψ, common to H0 , H1 , these are
integrated over to obtain the marginal posterior g(θ = 0 | y).
Recall that the Beta distribution is the conjugate prior for binomial sampling – if the
prior is Beta(a, b), and k successes are observed in n Bernoulli trials the posterior is
Beta(a + k, b + n − k). Hence, under H1 the posterior for p1 under H1 is Beta(n12 +
0.5, n21 + 0.5). In Example 7.3, we have n12 = 39 and n21 = 86, so the posterior is
Beta(39.5, 86.5). The prior and posterior densities for p1 under H1 are
1
π(p) = p0.5 (1 − p)0.5 , (8.35)
B(0.5, 0.5)
1
g(p | n12 , n21 ) = p39.5 (1 − p)86.5 , (8.36)
B(0.5, 0.5)
Recall the p-value from Chapter 7, Example 7.3 was 2.6 × 10−5 .
The density-ratio calculation is conveniently done using the R function dbeta()
(Figure 8.9).
166 R. D. BALL
Figure 8.9. R calculation of the Savage–Dickey density ratio for the Bayes factor.
where
ai ui [4ri (ti − ri − si ) + si (ti − si )]
Vi = . (8.40)
t2i (ti − 1)
Now let Zn be given by
√
Zn = Tn / n (8.41)
(Yn − An )/n
= ∼ N (0, 1/n) under H0 , (8.42)
Vn
where V n = Vn /n. Notice that the quantities in the numerator and denominator for Zn are
stable, i.e. estimating the same quantity, independent of n.
Under H1 the sampling variance for Zn is 1/n, and its estimate is the value of the
statistic. We take a prior for Z, the quantity that Zn is estimating under H1 to be the same
as the sampling distribution for Z1 , i.e. N (0, 1). By construction, this is a nearly non-
informative prior with the same information as a single experimental unit. The posterior
distribution for Z under H1 is then given by
n 1
z | Zobs ∼ N Zobs , . (8.43)
n+1 n+1
STATISTICAL ANALYSIS AND EXPERIMENTAL DESIGN 167
If Q and A are not linked but are ‘spuriously’ associated due to allele frequency differences
between sub-populations, the recombination process will ensure the transmission of Q and
Aa aa Aa aa Aa AA Aa AA
Aa aa Aa AA ...
T =1 T =0 T =0 T =1 ...
A from the heterozygous parent are independent, hence there will be no association: (r =
0.5 in Equations (8.48), (8.49) and hence Pr(Q | T = 1) = Pr(Q | T = 0) = Pr(Q),
i.e. T and Q are independent). This eliminates completely spurious associations due to
population structure; however some partially spurious associations between linked loci may
remain. These are associations where the recombination distance between marker and
QTL is less than 0.5 but still large compared to the resolution of the association mapping
experiment. Partially spurious associations, where r is less than 0.5 in Equation (8.48),
will reduce by a factor (1 − 2r) in magnitude, so could still be substantial for small to
moderate values of r, e.g. with r = 0.1, 0.2, the association is reduced by only 20 or
40%, respectively. These ‘small’ values of r correspond to genomic intervals which are
nevertheless large compared to the resolution of the association mapping experiment.
The QTL allele transmitted from the homozygous parent will be random, reflecting
population allele frequencies, whether or not A is transmitted from the heterozygous parent,
hence will not contribute to the expected trait value conditional on transmission of A, but
will contribute to variability.
Note: In practice a number of markers will be tested. The heterozygosity condition can
be obtained for each marker by selecting a subset of families where the condition applies.
For families with more than one offspring, only one progeny can be selected at random.
Frequentist analysis
The TDT-Q1 is analysed with a standard t-test, the TDT-Q2 with a χ2 -test (Allison
1997, p. 678), and the TDT-Q3, with a modified t-test (Allison 1997, p. 679). The χ2 -test
tests for independence between transmission status and phenotype class (L for y < ZL and
U for y > ZU where ZL and ZU are the thresholds used for selective genotyping).
Sample power calculations from Allison (1997) are shown in (Table 8.19). These are
based on a comparison-wise α value of 0.0001, noting the need
Note also that these values are based on the assumption that the marker locus is the trait
locus, so these values are upper bounds to the power.
The sample sizes shown are the number of families genotyped. Not surprisingly, the
TDT-Q2–Q5 designs using selective genotyping have more power per family genotyped.
Which design is more efficient for the end user depends on the relative costs of obtaining
and phenotyping the families. Additionally if multiple traits are being considered, the
advantage of selective genotyping reduces, as different subsets are needed for each trait.
Not surprisingly, the t-test is more powerful than the χ2 -test (TDT-Q2 versus TDT-Q3),
since the t-test requires more assumptions, while the phenotypic data are grouped into
categories for the χ2 -test.
Bayesian analysis
A full Bayesian model can be fitted to the data. Alternatively, where the t-test analysis
has been done, equivalent Bayes factors can be calculated using Equation (8.8), noting that
F = t2 , has an F distribution on 1, n − 1 d.f.
STATISTICAL ANALYSIS AND EXPERIMENTAL DESIGN 169
Table 8.19. Sample sizes required for 80% power of the TDT–Q1-Q5 tests from Allison (1997), for an additive
QTL, assuming a type I error rate of α = 0.0001, with QTL explaining 5, 10% of the phenotypic variance, and
allele frequencies p = 0.1, 0.3, 0.5.
Table 8.20. Equivalent Bayes factors for the TDT-Q1 tests from Table 8.19.
Equivalent Bayes factors for the TDT-Q1 with α as in Table 8.19 are shown in
Table 8.20.
The choice of α = 0.0001 has given some more respectable Bayes factors, but these
vary from 109 to 296, nearly a three-fold range depending on QTL heritability and
allele frequencies. These Bayes factors are still too low to use for genome scans or candi-
date genes without strong prior information (cf. Tables 8.6, 8.7, Datasets 1, 2).
We next illustrate the implementation and analysis of the full Bayesian model including
an MCMC sampler using the BUGS language (Spiegelhalter et al. 1995), with assumptions
as in TDT-Q1 for simulated data.
library(ldDesign)
sim.TDT <- function(h2q,N,p=0.5,mu=0,phi=0,Vp=1,Vq=h2q*Vp,
Ve=Vp-Vq){
# simulate data from a TDT for biallelic marker at trait locus
# Cf Allison 1997
# h2q: QTL variance as a proportion of total
# N: number of family trios
# p: allele frequency for the allele ‘A’, (1-p) for allele ‘a’
# mu: population mean
# phi: dominance proportion d=phi*a
# phi=0 for additive, phi=1 for complete dominance
# Vp: total phenotypic variation
# Vq: QTL variance
# Ve: residual variance
# set initial values for a and d
# calculate QTL variance and scale to give required variance
a0 <- sqrt(2*Vq)
d0 <- phi*a0
muq0 <- pˆ2*2*a0 + 2*p*(1-p)*(a0+d0)
Vq0 <- pˆ2*(2*a0 - muq0)ˆ2 + 2*p*(1-p)*(a0+d0 - muq0)ˆ2 +
(1-p)ˆ2*(0 - muq0)ˆ2
sqrt.ratio <- sqrt(Vq/Vq0)
a <- a0*sqrt.ratio
d <- d0*sqrt.ratio
family.type.levels <- c("Aa x aa", "Aa x AA")
genotype.levels <- c("aa","Aa","AA")
genotype.means <- c(mu,mu+a+d,mu+2*a)
family.type <- sample(size=N, c(1,2),prob=c((1-p)ˆ2,pˆ2),
replace=TRUE)
transmissions <- rbinom(n=N,size=1,prob=0.5)
progeny.genotypes <- ifelse(family.type=="Aa x aa", ifelse(
transmissions==1,"Aa","aa"), ifelse(transmissions==1,"AA","Aa"))
progeny.phenotypes <- genotype.means[match(progeny.genotypes,
genotype.levels)] + sqrt(Ve)*rnorm(N)
list(progeny.phenotypes=progeny.phenotypes,
transmissions=transmissions,
progeny.genotypes=progeny.genotypes)
}
calc.bf.TDT <- function(data){
# data: dataframe as generated by sim.TDT()
summ1 <- summary(aov(progeny.phenotypes ˜ transmissions,
data=data))
ns <- table(data$transmissions)
N <- sum(ns)
F.value <- summ1[[1]]$"F value"[1]
list(N=N,ns=ns,F.value=F.value,B=SS.oneway.bf(group.sizes=ns,
Fstat=F.value))
}
Figure 8.11. R functions for simulating TDT data, and calculating the Spiegelhalter and Smith Bayes factors
using the R function SS.oneway.bf() from ldDesign.
STATISTICAL ANALYSIS AND EXPERIMENTAL DESIGN 171
µAa = µ + a + d, µaa = µ. Let Ci denote the event that a family meets the selection crite-
ria, Ti = 1, Ti = 0 denote the event that the allele A is transmitted (resp., not transmitted),
from the heterozygous parent, and yi denote phenotype of the ith offspring.
Step 1. Write down the model. The key is to note that the TDT ignores the family
genotypes, and looks at transmission only, and that transmission occurs with probability
0.5 and is independent of family type. Conditional on the heterozygous parent being Aa,
the legal family types Aa×aa and Aa×AA occur with probability (1−p)2 , p2 , respectively.
The likelihood is
n
f (y | µ, a, d, σe ) =
2
f (yi | Ti ), (8.50)
i=1
where
where
Note: Equations for µ0 , µ1 , σ02 , σ12 are given in Allison (1997, p. 677). Our equations
may differ due to differing conventions and a possible error in the equation for µY1 (here
denoted by µ1 ), there.
Step 2. Represent the hierarchical model as a graphical model. The graphical model
is shown in Figure 8.13. Points to note are:
1. The graph is a directed graph, with the convention that the arcs are directed down-
wards, i.e. parent nodes are located above their descendants. The probability distrib-
ution of the variable(s) at a node needs to be given as a function of its parent nodes.
Parameter values, for two nodes which are not direct descendants of each other, are
conditionally independent, given the values of their common ‘ancestors’ in the graph.
2. Parameters specific to individual observations (here the transmission status and phe-
notypic value for the offspring of a family trio) are located within the lower box. The
nested boxes signify multiple pages, with one page for each datum.
3. To simplify the diagram, parameters µAA , µAa , µaa have been grouped together in a
single node, as have parameters µ0 , µ1 , σ02 , σ12 .
4. BUGS uses precisions (reciprocal of variances) as parameters instead of variances.
5. Parameters ηi , τi are the means and precisions for the ith observation, given by
µ0 if Ti = 0 1/σ02 if Ti = 0
ηi = and τi = . (8.57)
µ1 if Ti = 1 1/σ12 if Ti = 1
Step 3. Write down the distributions of nodes in the graph in terms of their ‘parents’.
For example the distribution of yi conditional on its parents is normal with mean ηi and
precision τi .
Top level nodes, i.e. those with no parents are given prior distributions, obviously not
involving any other variables.
Step 4. Implement the Gibbs sampler in BUGS code. BUGS code is shown in Figure
8.14. We do not describe the BUGS language in detail, only essential aspects of our code,
referring the reader to the BUGS manual for further information.
The BUGS code consists of initial declarations of variables and constants, specifying
the data file, and optional file of initial values, followed by the main body of the program,
where the distribution of each variable in the graphical model is specified in terms of the
values of its parents. Distributions are specified viz
tau.e ˜ dgamma(1.0,1.0)I(0.7,1.3);
meaning that tau.e has a gamma distribution with shape and rate parameters 1, 1, re-
spectively. The optional I(0.7,1.3) notation restricts the distribution to the interval
(0.7,1.3), allowing BUGS to use Metropolis sampling. This was required for tau.e
because BUGS could not otherwise choose an update method.
Our default priors for µ and a were normal with mean 0 and precision 1, i.e. similar
to the precision of the phenotypic distribution.3 However, even with Metropolis sampling,
3With actual data we may have more informative priors.
STATISTICAL ANALYSIS AND EXPERIMENTAL DESIGN 173
'$
'$'$
'$
µ τe = a d p 1/σe2
&%
&%&%
&%
µAA µAa µaa
µ0 µ1 σ02 σ12
..
'$
Ti
&%
'$ '$
ηi τi
&% &%
'$
yi
&%ith obs.
.. ..
Figure 8.13. Graphical representation of a Bayesian hierarchical model for a TDT model (TDT-Q1) for quantita-
tive traits.
174 R. D. BALL
BUGS could not choose an updating method for µ and a with these priors so these variables
were discretised into 21 steps and, for convenience, given a uniform prior on the discrete
values. In order to make best use of all 21 steps in the discretisations, the discretisations
were adapted so as to extend slightly beyond the ranges of the posterior distributions from
an initial run. The discretisation for µ is implemented with the BUGS parameters imu
(categorical index), and pmu (probabilities for each index value) and mu.values (cor-
responding values for µ), and similarly for a, with parameters ia, pa and a.values,
with values defined in the data file.
The intermediate parameters µAA , µAa , µaa , µ0 , µ1 , σ02 , σ12 , ηi , τi (present mainly for
convenience and readability), assigned with ‘<-’, are deterministic nodes, since they are
completely determined by their parents.
Step 5. Run the sampler, and examine the output.
The Gibbs sampler was run for 60,000 iterations. Each iteration took approximately
1 s on a Linux machine (Kernel 2.4, with a 2.4 GHz Pentium processor). The output was
examined using coda (Spiegelhalter et al. 1995; Plummer et al. 2005), an R (or Splus)
package for BUGS output diagnosis and analysis.
Graphs of BUGS output, for parameters µ, a, τe , and derived parameters h2Q are shown
in Figure 8.16.
The left-hand column of figures shows the trace of sampler estimates for iterations
1,001–60,000. The solid traces indicate frequent visits of the sampler to high and low val-
ues of the variables indicating good mixing. The right-hand column of figures shows den-
sity estimates for the marginal posterior distribution of each parameter. The small bumps
in the density estimate for µ and a are an artefact of the discretisations of these variables.
Graphs were produced using coda. The density estimates were obtained by coda using
kernel smoothing. A number of diagnostics are provided in coda. The Raftery and Lewis
diagnostics (Raftery and Lewis 1992, 1995) are shown in Figure 8.15, calculated using the
R function raftery.diag() in coda.
A run of at least 3,746 is indicated if it is desired to estimate the q = 2.5% quantiles of
the distributions with an accuracy of r = ± 0.5%, with probability 0.95. A run of at least
1,377 is needed to estimate the 50% quantiles with an accuracy of ± 5%, with probability
0.95.
Note: In Figures 8.15 and 8.16 only the variables µ, a, τe , h2Q are shown. In general,
it is important to examine all variables for convergence. The mean parameter µ is not of
particular interest here, however in our experience problems with convergence are often
apparent from values of µ, since µ enters the likelihood for every observation, particularly
if prior distributions or initial values are poorly specified.
Note: No diagnostic can guarantee convergence of a MCMC sampler. Apparent con-
vergence can persist for a large number of iterations in pathological cases or complex prob-
lems. MCMC samplers are best used by statisticians with a good intuitive grasp of the
models and parameterisations being used. In most cases convergence problems can be
overcome by various techniques, e.g. re-parameterising or block updates, which are be-
yond our scope. Under general conditions the Gibbs sampler can be shown to converge
geometrically (see, e.g. Tierney 1994), and some authors recommend formally proving
convergence for each sampler. However, the geometric convergence can still be extremely
slow, and most well-constructed MCMC samplers converge orders of magnitude faster than
176 R. D. BALL
> raftery.diag(bugs1[,c("mu","a","tau.e","h2.q")])
Figure 8.15. Raftery and Lewis diagnostics for the TDT-Q1 BUGS output from and initial run of 1,000 iterations.
the theoretical bounds. Where these techniques fail is well beyond the realm of traditional
asymptotic statistical methods.
Summary statistics for the marginal posterior distributions of parameters are shown in
Figure 8.17. The sem column shows standard errors of the estimated posterior means, cal-
culated naively based on variance of the sampler output. These may overstate the precision
because successive samples are auto-correlated. The effects of auto-correlation can be re-
duced by calculating standard errors based on batch means, where a batch is a group of suc-
cessive iterations (Roberts 1996). The standard errors, re-calculated based on batch means
for batches of size 100, are calculated using the batchSE() function in Figure 8.17.
Finally, from the MCMC output we estimate the Bayes factor for comparing the models
H1 (with a) and H0 (with a = 0), as a sub-model. Recall that the Savage–Dickey Bayes
factor estimate (Dickey 1971) is given by the ratio of prior to posterior densities at 0
π(θ = 0)
B= , (8.58)
f (θ = 0 | y)
where θ denotes the parameter being tested (here a), and f (θ = 0 | y) is the marginal
posterior density for θ.
The density f (θ = 0 | y) can be estimated from BUGS output since
by definition of f (·) as a probability density. The choice of should be small enough to give
STATISTICAL ANALYSIS AND EXPERIMENTAL DESIGN 177
Trace of µ Density of µ
4
0.1
2
0.2
0
0 10000 30000 50000 0.3 0.2 0.1 0.0 0.1 0.2 0.3
Trace of a Density of a
0.1 0.3 0.5
6
4
2
0
Trace of τe Density of τe
0 2 4 6
1.1
0.9
10
0
0 10000 30000 50000 0.00 0.02 0.04 0.06 0.08 0.10 0.12
Figure 8.16. BUGS output for simulated TDT-Q1 data. Data were simulated with N = 873 families with one
parent heterozygous for a bi-allelic marker locus coincident with an additive QTL locus with effect a = 0.31
corresponding to 5% of the variation, and allele frequency 0.5. This sample size had power 0.8 to detect an effect
with α = 0.0001 in Table 8.19, and B = 109 in Table 8.20. The sampler was run for 60,000 iterations. The first
1,000 iterations were removed as ‘burn in’ and, to reduce the amount of data for plotting, only every 20th iteration
is plotted.
178 R. D. BALL
> stats <- function (x, na.rm = T, quants = c(0.025, 0.25, 0.5,
+ 0.75, 0.975)){
+ if (na.rm) x <- x[!is.na(x)]
+ if (length(x > 0)) {
+ c(mean = mean(x), stdev = stdev(x), sem = sem(x),
+ quantile(x, quants))
+ }else NA
+ }
# 1,000 iterations
> t(apply(as.matrix(run1.1k[,c("mu","a","tau.e","h2.q")]),2,
+ stats))
mean stdev sem 2.5% 25% 50% 75% 97.5%
mu 0.0516 0.0760 0.002404 -0.09000 0.0000 0.0600 0.1200 0.1807
a 0.2571 0.0703 0.002222 0.12500 0.2000 0.2500 0.3000 0.4000
tau.e 1.0356 0.0525 0.001661 0.93952 0.9993 1.0347 1.0730 1.1377
h2.q 0.0353 0.0179 0.000567 0.00756 0.0207 0.0325 0.0461 0.0755
> batchSE(run1.1k[,c("mu","a","tau.e","h2.q")])
mu a tau.e h2.q
0.00579 0.00552 0.00140 0.00135
# 10,000 iterations
> t(apply(as.matrix(run1.10k[,c("mu","a","tau.e","h2.q")]),2,
+ stats))
mean stdev sem 2.5% 25% 50% 75% 97.5%
mu 0.0465 0.0756 0.000756 -0.09000 0.0000 0.0600 0.0900 0.1800
a 0.2611 0.0686 0.000686 0.12500 0.2250 0.2500 0.3000 0.4000
tau.e 1.0357 0.0518 0.000518 0.93659 0.9996 1.0341 1.0705 1.1383
h2.q 0.0362 0.0177 0.000177 0.00784 0.0239 0.0338 0.0469 0.0767
> batchSE(run1.10k[,c("mu","a","tau.e","h2.q")])
mu a tau.e h2.q
0.002048 0.001883 0.000583 0.000480
# 60,000 iterations
> t(apply(as.matrix(run1.60k[,c("mu","a","tau.e","h2.q")]),2,
+ stats))
mean stdev sem 2.5% 25% 50% 75% 97.5%
mu 0.0473 0.0743 3.06e-04 -0.09000 0.0000 0.0600 0.0900 0.1800
a 0.2604 0.0676 2.78e-04 0.12500 0.2250 0.2500 0.3000 0.4000
tau.e 1.0360 0.0513 2.11e-04 0.93832 1.0007 1.0347 1.0699 1.1397
h2.q 0.0360 0.0175 7.19e-05 0.00792 0.0239 0.0333 0.0462 0.0758
> batchSE(run1.60k[,c("mu","a","tau.e","h2.q")])
mu a tau.e h2.q
0.000860 0.000778 0.000247 0.000199
Figure 8.17. Calculation of summary statistics for parameters from TDT-Q1 BUGS output.
STATISTICAL ANALYSIS AND EXPERIMENTAL DESIGN 179
a good approximation to f (θ = 0 | y) in Equation (8.59) but large enough so that there are
a reasonable number of posterior samples less than with which to estimate the probability
Pr(0 ≤ θ ≤ ) on the right-hand side of the equation. If necessary more iterations of the
sampler can be run to enable this. The calculation in R, giving a Bayes factor of B = 89.7,
is shown in Figure 8.18.
Summary
This section considers the TDT test for continuous data.
Bayes factors were estimated from the F -values (F = t2 ) from the t-test for an effect
of transmission, using the Spiegelhalter and Smith method.
Bayes factors corresponding to α = 0.0001, for the designs considered by Allison
(1997) varied from 109 to 296. These were much more respectable Bayes factors than
those corresponding to α = 0.05, but not large enough to give high posterior probabilities
for markers from a genome scan.
Example 8.7 illustrates a full Bayesian analysis for the TDT test for simulated data
with n = 873 trios.
A Bayesian hierarchical model was fitted using Gibbs sampling – a MCMC method,
which generates a sample approximately from the posterior distribution. A Bayesian graph-
ical model was constructed (Figure 8.13). Distributions for parameters in terms of their
parents were coded in BUGS language (Figure 8.14).
Posterior estimates, of interest, e.g. posterior means and standard deviations, are easily
obtained from the Gibbs sampler output. Marginal distributions for a set of one or more
parameters are obtained by simply ignoring the other parameters.
Diagnostics and posterior summary statistics were obtained from the Gibbs sampler
output using the R CODA package. The Bayes factor was estimated from the Gibbs sampler
output using the Savage–Dickey density ratio.
The Bayes factor calculated using the Savage–Dickey density ratio (89.7) from the
computationally intensive MCMC sampler output was similar to that obtained by the easy
to compute Spiegelhalter and Smith method (93.1). This is consistent with our experience
in other problems where the amount of information in the prior for the parameter being
tested is comparable to the information in one data point. Of course, the MCMC output
can be used to compute other useful information such as distributions of parameters, and
predictions of genetic gain.
The Bayes factor of 89.7 represents strong evidence for an effect, but not strong enough
to overcome low prior odds in a genome scan. Readers interested in fitting Bayesian models
using MCMC are advised to study this example, and the examples provided with BUGS,
in detail; most of the methods also apply to other designs and models considered in this
chapter.
Similarly
Pr(pop2 | A) = 0.9 × 0.5/0.5 = 0.9, (8.64)
where the second equality uses the assumed within population independence of Q and A.
which is a substantial level of LD. However minor allele frequency differences lead to only
small amounts of LD.
Pritchard et al. (2000a, b), give Bayesian methods for testing and allowing for popu-
lation structure, where the population may be stratified into several sub-populations. The
number of sub-populations and the assignment of individuals to sub-populations are un-
known. Information on population structure is obtained from a set of unlinked auxilliary
markers.
The Bayesian approach is to simulate from the probability distribution of possible sub-
populations. Each individual in the sample is assigned a set of unknown parameters rep-
resenting the proportions of the individual’s alleles coming from each population. The
MCMC sampler is generated by sampling from the conditional distributions of each of
these parameters in turn. These conditional distributions are related to the probability of
belonging to a sub-population given the values of the auxilliary markers. The number
of sub-populations is also allowed to vary using a ‘reversible jump’ MCMC technique
(Green 1995).
182 R. D. BALL
This is the approach taken in the Structure method (Pritchard et al. 2000b):
1. If the population structure were known, the population can be divided into k sets Si
each without structure:
k
S= Si . (8.68)
i=1
In this case the analysis can take the population substructure into account, e.g. by
allowing for different allele frequencies among populations.
2. In the case where the population structure is unknown, i.e. the Si above are unknown,
but k is known in Equation (8.68), a Bayesian approach is used where additional
indicator parameters indicate which of the subsets Si each individual belongs. The
distribution of parameters is obtained using MCMC.
3. The general case, where k is also unknown, is modelled using reversible jump Markov
Chain Monte Carlo (RJMCMC; Green 1995). For each value of k there is a different
model, as per case 2, and the model dimension varies with k. RJMCMC constructs
‘jumps’ between models, and assuming the sampler converges, gives a sample from
the joint distribution of all models (sampled according to their posterior probabili-
ties), and of parameters within models.
In the ‘STRAT’ test (Pritchard et al. 2000b, for case–control data, generalised by
Thornsberry et al. 2001 for quantitative traits) a likelihood ratio statistic is constructed
from the MCMC output. This stops short of a fully Bayesian approach.
More generally, in a fully Bayesian approach, for each possible population structure
from the MCMC output, the within sub-population disequilibrium estimates can be ob-
tained and the results averaged over possible population substructures according to their
posterior probabilities. An important point to note is that the population structure and
membership may not be determined uniquely. A fully Bayesian approach would take this
uncertainty into account by giving probabilities for membership in each sub-population,
e.g. an individual may be in sub-populations S1 , S2 , S3 with probabilities 0.3, 0.2, 0.5,
respectively.
Marker
M1 M2 M3 M4 M5
Allele a1 a2 a1 a2 a1 a2 a1 a2 a1 a2 a3
pop1 0.9 0.1 0.8 0.2 0.7 0.3 0.6 0.4 0.16 0.20 0.64
pop2 0.1 0.9 0.2 0.8 0.3 0.7 0.4 0.6 0.20 0.00 0.80
Table 8.22. Posterior statistics for linkage disequilibrium coefficients. A population of size 200 was simulated
with independent values for markers and QTLs within each sub-population. Allele frequencies within each sub-
population were simulated as in Table 8.21. Linkage disequilibrium between the markers and corresponding QTL
is shown for the combined population (pop1 ∪ pop2), for each sub-population separately (pop1, pop2), and for
the sub-populations estimated by structure (pop1 (est.), and pop2 (est.)).
D̂ (95% c.i.)
M1–Q1 M2–Q2 M3–Q3
pop1 ∪ pop2 0.160 (00.140,0.18) -0.0900 (0.070,0.11) -0.040 (0.02,0.07)
pop1 −0.001 (−0.010,0.01) −0.0004 (−0.020,0.02) −0.001 (−0.03,0.03)
pop2 −0.007 (−0.014,0.002) -0.0300 (0.005,0.06) −0.004 (−0.03,0.02)
pop1 (est.) 0.004 (−0.007,0.02) -0.0300 (0.006,0.06) −0.005 (−0.03,0.02)
pop2 (est.) 0.006 (−0.005,0.02) -0.0030 (−0.020,0.03) -0.002 (−0.03,0.03)
Summary
We have seen that LD can be generated by population structure, and that the population
structure analysis methods (Pritchard et al. 2000b; Thornsberry et al. 2001) can be effec-
tive at removing effects of population structure by estimating sub-population membership
probabilities using a set of preferably unlinked control markers.
There are some caveats to the population structure analysis. A full Bayesian analysis
is not yet available – the current methods give a likelihood ratio test, which gives a p-value,
which as we have seen is not a reliable measure of evidence for an association.
184 R. D. BALL
Many plant breeders will have access to populations where pedigree information is
available. This material cannot be regarded as an independent population sample because
individuals are related. However, these populations may still contain LD useful for associ-
ation studies.
The methods in this subsection take into account relatedness between individuals in
the analysis of marker–trait associations from a known pedigree. The methodology uses
mixed models to allow for correlation between haplotype effects, with correlation struc-
ture based on IBD probabilities, and also to allow for polygenic effects, with covariance
structure given by the additive relationship matrix from quantitative genetics. In so do-
ing, the methods combine linkage and linkage disequilibrium information. The linkage
or ‘QTL mapping’ information is generated by recombinations within the pedigree, de-
tected by marker genotypes of parents and their offspring. The LD or association mapping,
information, is generated by ancestral recombinations, and detected by population level
associations between individuals.
The effectiveness of a pedigree population for LD mapping depends on the effective
population size of the pedigree. The pedigree will probably be recorded for relatively few
generations, and if it has formed from only a few founders the effective sample size for
detecting population level LD is no larger than the number of founders. For example, a
large single family provides no significant population level LD information, since there are
effectively only two parents sampled from the population, and offspring will replicate most
of the parental chromosomes in large blocks.
Incorporating polygenic random effects in the model via the additive relationship
matrix effectively controls for population structure within the pedigree (Sillanpää and
Bhattacharjee 2005). The pedigree analysis may, however, still be affected by spurious
associations from population structure present when the founders were obtained. Relat-
edness between the founders would probably be unknown, and still needs to be checked
and/or controlled by methods of Section 8.3.5. This might happen if a breeding population
was obtained from material taken from several native provenances, as is the case for P.
radiata. If individuals’ ancestry cannot be traced back to the provenances, the genomes of
currently growing trees may be a mixture of provenances, with unknown mixing probabil-
ities, which can be estimated by the program structure.
STATISTICAL ANALYSIS AND EXPERIMENTAL DESIGN 185
Frequentist analysis
Meuwissen et al. (2002) use combined linkage disequilibrium and linkage information
to fine map a QTL in cattle in a known pedigree. They fit a mixed model
y = μ + Zh + u + e, (8.69)
h ∼ N (0, Gσh2 ), (8.70)
u ∼ N (0, Aσu2 ), (8.71)
e ∼ N (0, σe2 ), (8.72)
where μ is the overall mean, h are random haplotype effects, u are random polygenic effects
and e are residual errors. The haplotypes are based on markers close to the QTL locus. An
‘infinite alleles model’ is assumed so that each haplotype potentially has a different effect.
Note that each individual has two haplotypes. The number of haplotypes is greater than
the number of individuals so haplotype effects cannot be estimated individually, however
haplotype effects are correlated. Haplotype effects are identical if the corresponding QTL
alleles are IBD. It follows that haplotype effects are correlated, with correlation matrix, G,
given by the IBD probabilities for the QTL. The correlation between polygenic effects is
given by the ‘additive relationship matrix’ A (Falconer and Mackay 1996).
The IBD probability calculation is based on Meuwissen and Goddard (2001):
– The calculation is different for each putative QTL locus. Haplotypes can be based
on up to around 15 closely spaced markers around the putative QTL locus.
– For base haplotypes (first generation genotyped) Meuwissen and Goddard (M&G)
use a modified coalescent to estimate IBD probabilities. Briefly, assume an effective
population size of Ne , and T generations of random mating. Either simulate the
coalescent (M&G 2000), or use the analytical formulae from M&G (2001). In a
given simulated coalescent, haplotypes are considered IBD if they have coalesced
within the T generations. IBD probabilities for a pair of haplotypes are estimated as
the proportion of simulated coalescents where the haplotypes coalesced.
– For subsequent generations estimate IBD, using parental and marker information.
Similar to interval mapping QTL approaches (Lander and Botstein 1989), the analysis
is repeated for each putative QTL position and likelihood ratios calculated at each position.
A p-value is obtained by referring the likelihood ratio statistic to its sampling distribution
under the null hypothesis of no effect. As demonstrated in previous sections, there are
problems with the interpretation of p-values.
Note:
1. Meuwissen et al. fitted their model using ASREML (Gilmour et al. 2000). Analy-
sis using the publicly available nlme R package is also possible, by forming the
Choleski decomposition of the matrices G and A, and incorporating the Choleski
factor into the Z-matrices, effectively transforming the sets of random effects to inde-
pendent random effects, enabling the model to be fitted using the standard nlme co-
variance matrix classes as in Figure 8.19. This technique is used in the lmeSplines
R package (Ball 2003).
186 R. D. BALL
library(nlme)
# QTL analysis
# given: trait y, G, A matrices
# calculate Z matrices for paternal and maternal haplotypes
# individual: a factor coding individual animals or plants
Zhp <- model.matrix(˜ individual -1)
Zhm <- model.matrix(˜ individual -1)
# Choleski matrix for G
Rg <- chol(G,pivot=FALSE)
Zh <- cbind(Zhp, Zhm) %*% t(Rg)
# Choleski matrix for G
Ra <- chol(A, pivot=FALSE)
Za <- Ra
# model with polygenic effects only
fit0 <- lme(y ˜ 1, random=list(all=pdIdent(˜Za -1)))
# model with QTL plus polygenic effects
fit1 <- lme(y ˜ 1, random=list(all=pdBlocked(list(
pdIdent(˜Zh -1), pdIdent(˜Za -1)))))
# compare models, LR test etc.
anova(fit0,fit1)
Figure 8.19. R code for mixed model QTL analysis (8.69)–(8.72) combining linkage and linkage disequilibrium.
2. The major computational difficulty in fitting the mixed models is evaluating the in-
verse of the matrix A, for large pedigrees.
Bayesian analysis
The mixed model of Equations (8.69)–(8.72) is almost Bayesian in that random effects
have probability distributions. To make a full Bayesian model requires only specifying
priors on the variance components σh2 , σa2 , σe2 . As in previous sections Bayes factors and
posterior probabilities are used for inference. An MCMC sampler can be generated and
Bayes factors for comparing models and posterior probabilities can be calculated (cf. Sec-
tion 8.3.4).
Note: There are some similarities between this approach and the ‘BLADE’ method
(Liu et al. 2001 discussed in Section 8.3.2 above). Meuwissen and Goddard (2000, 2001)
simulate or calculate IBD probabilities based on possible ancestral genealogies and use
the IBD probabilities in a mixed model analysis, while Liu et al. simulate possible an-
cestral genealogies from a coalescent process with inference based on analysis of each of
the simulated genealogies. The mixed model approach has the advantage of being able to
incorporate pedigree information, and control for population structure, but the disadvan-
tage of using fixed estimates of IBD probabilities, in the mixed model. This means one is
effectively conditioning on the IBD probabilities being the true values in the mixed model
analysis. This is the price paid for the convenience of using a more standard mixed model,
with easier implementation in R or BUGS. The full Bayesian coalescent-based model con-
STATISTICAL ANALYSIS AND EXPERIMENTAL DESIGN 187
ditions on the population assumptions inherent in the coalescent, as does the Meuwissen
and Goddard IBD estimation, but not on possible values of IBD probabilities that might be
consistent with these assumptions.
Summary
A sample from a known pedigree combines QTL and LD mapping information in a sin-
gle dataset. Mixed model analysis for a pedigree combines haplotype effects (at or around
a single locus), with a correlation structure based on IBD probabilities, and polygenic ef-
fects. The effectiveness of the pedigree sample for LD mapping depends on the breadth
and sample size of individuals from which the pedigree was founded.
Incorporating polygenic effects via the additive relationship matrix controls for pop-
ulation structure generated within the pedigree, but not for population structure when the
founders were chosen, since relatedness between the founders is probably unknown. Pop-
ulation structure analysis on the founders is recommended.
For further information on models combining pedigree and LD information, see Wu
and Zeng (2001), Wu et al. (2002), Farnir et al. (2002), Perez-Enciso (2003), Fan and Jun
(2003), Lund et al. (2003), Meuwissen and Goddard (2004) and Lee and Van der Werf
(2005). Other approaches to calculation of IBD probabilities include Heath (1997, 2002),
(a stochastic MCMC method for use in large pedigrees, available as a software package
Loki), Pong-Wong et al. (2001) and Gao and Hoeschele (2005) (deterministic methods).
The deterministic methods are faster but are approximate, and/or ignore uncertainty in
haplotypes.
Table 8.23. Sample sizes and amount of genotyping required to locate a QTL when searching the genome using
QTL and LD mapping combined. Assume there are 10 QTL explaining 5% of the variation, D = 0.1 or D = 0.2,
allele frequencies 0.5 for QTL and marker for closest marker to the trait locus, a genome of 3 × 109 bases, extent
of LD 6 kb, 500,000 SNP markers available at a spacing of 6 kb, giving prior probabilities per marker of 1/50,000.
The QTL mapping results assume there are 12 chromosomes and 20 markers per chromosome at a spacing of
10 cM. Results are given for an overall posterior probability of 0.9 for an association.
has not yet been done we can design the LD experiment with given power to obtain a suf-
ficiently high Bayes factor to obtain a reasonably high posterior probability after the LD
analysis. Next, we apply this approach to locating small effect QTL.
Table 8.23 shows results for sample sizes and amount of genotyping required to detect a
QTL explaining 5% of the variation of a trait. Results are given for various sizes (nQTL =
100, 400, 1,000, 3,000) of the QTL mapping family. For each family size the average
standard error (se(x̂)) of the estimate of QTL location was calculated by simulation of an
additive QTL. The QTL interval was assumed to be two standard deviations either side of
the estimate, although smaller values could be considered and may be more cost-effective,
at the risk of loosing some QTL. The number of SNPs within the QTL region was calculated
and average prior odds per SNP were determined from this. Then, the Bayes factor required
to obtain the required posterior probability of 0.9 calculated and the sample sizes (nLD )
for this calculated using the R function ld.design() from the ldDesign package
(Ball 2004, 2005).
There are a number of factors which could be varied in searching for an optimal design
– we have considered only two special cases here. Nevertheless, the results suggest, with
STATISTICAL ANALYSIS AND EXPERIMENTAL DESIGN 189
the extent of LD considered, that a significant efficiency gain can be achieved by combin-
ing QTL and LD mapping, and that the optimal QTL mapping population size will often
be quite large. There are still quite a large number of SNPs to genotype per individual
within the QTL region. Hence, the LD genotyping dominated the QTL genotyping except
for the largest QTL sample size, and the maximum disequilibrium. Except for QTL pop-
ulation size 100, which had posterior probability of only 0.5, the width of the QTL region
decreased gradually in inverse proportion to the square root of the QTL mapping popu-
lation size. The least amount of total genotyping was for the largest QTL population of
size nQTL = 3, 000, with a fivefold reduction in genotyping compared to nQTL = 100.
In this case, depending on phenotyping costs, larger QTL mapping populations should be
considered before embarking on LD mapping. Values are given for both D = 0.2 and
D = 0.1, with the latter being the minimum disequilibrium expected within the marker
interval, by assumption. The total genotyping was still decreasing between nQTL = 1, 000
and nQTL = 3, 000, for both D = 0.2 and D = 0.1, so the optimum may be even higher.
Similar results are shown in Table 8.24 where the extent of LD is assumed to be 60 kb.
In this case the prior odds per SNP have increased tenfold compared to the previous case.
Table 8.24. Sample sizes and amount of genotyping required to locate a QTL when searching the genome using
QTL and LD mapping combined. Assume there are 10 QTL explaining 5% of the variation, D = 0.1 or D = 0.2,
allele frequencies 0.5 for QTL and marker for closest marker to the trait locus, a genome of 3×109 bases, extent of
LD 60 kb, 50,000 SNP markers available at a spacing of 60 kb, giving prior probabilities per marker of 1/5,000.
The QTL mapping results assume there are 12 chromosomes and 20 markers per chromosome at a spacing of
10 cM. Results are given for a posterior probability of 0.9 for an association.
The optimal design appears to be approximately when nQTL ≈ 1, 000 when D = 0.2, and
nQTL ≈ 3, 000, with up to approximately a threefold reduction in genotyping compared to
nQTL = 100 when D = 0.1.
Summary
This subsection shows that QTL mapping and LD mapping analysis and experimental
design can be profitably combined.
The posterior distributions from QTL analysis can be used as prior distributions for
the LD analysis. The Bayes factor required from the LD mapping population for a given
posterior probability is reduced for loci within a QTL mapping region.
Brute force genotyping of all markers in a genome scan for a sufficiently large popula-
tion is very costly due to the very large amount of total genotyping. One possible strategy
is to restrict genotyping of the LD mapping population to QTL regions. We have seen that
this can result in reduced overall genotyping compared to a stand-alone LD mapping ap-
proach. Considering a single trait, the examples suggest that the optimal strategy is to use
even larger QTL mapping populations than those currently used, prior to LD mapping, in
order to find small effect genes.
A by-product of this approach is that, most spurious associations due to population
structure will be eliminated by the QTL mapping study. If the QTL mapping intervals are
small, e.g. with a sufficiently large QTL mapping family size this can be more effective
than the TDT.
8.4 SUMMARY
From the point of view of statistical testing, association mapping for quantitative traits
or complex diseases is characterised by :
1. Small effects requiring large sample sizes to detect, and,
2. Low prior odds for the effects, requiring additional evidence and/or stronger evidence
from the data.
The frequentist hypothesis testing framework is not suited to testing scientific hypotheses:
problems with p-values are accentuated with the large sample sizes required, and the fre-
quentist approach does not consider prior odds. Bayesian statistics is ideally suited to the
problem, since the Bayes factors or posterior probabilities do not depend on sample size
for their interpretation. Bayesian prior probabilities represent the low prior odds, or prior
information from alternative sources such as QTL mapping studies, differential expression
microarray experiments or results from other species. Frequentists see priors as introducing
an undesirable element of subjectivity, but here they are an essential part of the problem.
Thumma et al. (2005) conclude that:
Careful selection of candidate genes through different approaches such as mi-
cro array analysis, EST database searches and QTL mapping is very important
as a large amount of effort is needed for LD mapping. Success of LD mapping
in out-crossing plants therefore depends upon careful selection of candidate
genes. . .
STATISTICAL ANALYSIS AND EXPERIMENTAL DESIGN 191
Section 8.3.7 shows that combining QTL and LD mapping is an effective strategy for
reducing the total amount of genotyping. The number of markers to genotype per individ-
ual is reduced by restricting to QTL regions, and the prior odds per marker is increased,
resulting in fewer individuals needing to be genotyped in the association population for a
given Bayes factor. Where the extent of LD is low, quite large QTL mapping populations
can be profitably used. Of course, the gain from this strategy reduces as more traits are
studied. In addition, other prior information can further improve prior odds.
In addition, it is important for the experiment to be designed with good power, and to
use a reliable measure of evidence, such as the Bayes factor, and to give estimates free of
selection bias. It is essential to consider and justify prior probabilities for markers. Oth-
erwise many misleading and spurious associations will be generated. Large sample sizes
are needed, but these are not out of reach of major companies or international cooperative
efforts for economically important species.
We have given Bayesian power calculations (Ball 2005) for the case of independent
samples only, as these are not yet available for other designs. The existing power calcula-
tions can, however, give a good indication of the sample size required in other cases. In
principle, extending the methods to designs for TDT tests should not be difficult. Where
an existing non-Bayesian power calculation exists this can be used in conjunction with the
R function SS.oneway.bf() from ldDesign, or Equation (8.8). Using this function,
calculate the Bayes factor equivalent to the p-value used for the design. This may be re-
peated, decreasing the p-value used in the power calculation until a sufficiently large Bayes
factor is obtained.
For Bayesian haplotype analysis or analysis allowing for population substructure, sim-
ulations can be carried out to assess the additional sample size required for the more com-
plex models. We conjecture that the analysis allowing for population substructure is equiv-
alent to reducing degrees of freedom by an amount comparable to the number of unlinked
markers used for the population structure analysis, which makes only a minor difference if
the total sample size is much larger.
There are as yet relatively few published association studies in plants. One reason
may be concerned about spurious associations resulting from population structure. We
have considered four approaches which can be used to control and/or test for effects of
population structure:
1. Where population structure is unknown, the population structure analysis methods of
Section 8.3.5 can be used to reduce spurious effects in random population samples.
3. Utilising many small families, the TDT design eliminates spurious associations be-
tween unlinked loci, but some ‘partly spurious’ associations between linked loci may
remain at recombination distances less than 0.5.
A combination of two or more of these approaches can be applied for greater effectiveness.
192 R. D. BALL
Problems with epistasis are best addressed after most of the additive effects (genes or
loci) contributing to a trait are identified. Then possible interactions between detected loci,
and possibly between detected major loci and other loci can be examined.
Statistical methods cannot definitively establish causality. The best we can do is rule
out likely causes of non-causal or spurious associations, and give putative associations with
reasonably high posterior probability. Then, putative effects can be verified by functional
testing.
8.5 REFERENCES
Akey, J., Jin, L., Xiong, M. 2000, Haplotypes vs single marker linkage disequilibrium tests: what do
we gain? Eur. J. Hum. Genet. 9:291–300.
Allison, D.B. 1997, Transmission disequilibrium tests for quantitative traits. Am. J. Hum. Genet.
60:676–690.
Altshuler, D., Hirschhorn, J.N., Klannemark, M., Lindgren, C.M., Vohl, M.-C., Nemesh, J., Lane,
C.R., Schaffner, S.F., Bolk, S., Brewer, C., Tuomis, T., Gaudet, D., Hudson, T.J., Daly, M.,
Groop, L., Lander, E.S. 2000, The common PPARγ Pro12Ala polymorphism is associated
with decreased risk of type 2 diabetes. Nat. Genet. 26:76–80.
Bahlo, M., Thomson, R., Speed, T. 2003, Discussion of: “Ancestral inference in population genetics
models with selection” by M. Stephens and P. Donnelly. Aust. NZ J. Stat. 45:427–428.
Ball, R.D. 2001, Bayesian methods for quantitative trait loci mapping based on model selection:
approximate analysis using the Bayesian Information Criterion. Genetics 159:1351–1364.
Ball, R.D. 2003, lmeSplines – an R package for fitting smoothing spline terms in LME models.
R News 3/3 p24–28.
http://cran.r-project.org/src/contrib/Descriptions/lmeSplines.html
Ball, R.D. 2004, ldDesign – an R package for design of experiments for detection of linkage disequi-
librium.
http://cran.r-project.org/src/contrib/Descriptions/ldDesign.html
Ball, R.D. 2005: Experimental designs for reliable detection of linkage disequilibrium in unstruc-
tured random population association studies. Genetics 170:859–873.
http://www.genetics.org/cgi/content/abstract/170/2/859
Barry, D., Hartigan, J.A. 1992, Product partition models for change point problems. Ann. Stat.
20:260–279.
Bayes, T. 1763, An essay towards solving a problem in the doctrine of chances. Philos. Trans. R.
Soc. 53:370–418.
Berger, J., Berry, D. 1988, Statistical analysis and the illusion of objectivity. Am. Sci. 76:159–165.
Berger, J.O., Sellke, T. 1987, Testing a point null hypothesis: the irreconcilability of P values and
evidence (with discussion). J. Am. Stat. Assoc. 82:112–139.
Bernardo, J.M. 1999, Nested hypothesis testing: the Bayesian reference criterion. In: Bayesian Sta-
tistics 6. J.M. Bernardo, J.O. Berger, A.P. Dawid, A.F.M. Smith (Eds.) Oxford University
Press, Oxford, pp. 101–130 (with discussion).
Bogdan, M., Ghosh, J.K., Doerge, R.W. 2004, Modifying the Schwarz Bayesian information criterion
to locate multiple interacting quantitative trait loci. Genetics 167:989–999.
Brown, G.R., Gill, G.P., Kuntz, R.K., Langley, C.H., Neale, D.B. 2004, Nucleotide diversity and
linkage disequilibrium in loblolly pine. Proc. Natl Acad. Sci. USA 42:15255–15260.
Casella, G., Berger, R.L. 1987, Reconciling Bayesian and frequentist evidence in the one-sided test-
ing problem. J. Am. Stat. Assoc. 82:106–111.
STATISTICAL ANALYSIS AND EXPERIMENTAL DESIGN 193
Cavalli-Sforza, L.L., Cavalli-Sforza, F. 1993, The great human diasporas – the history of diversity and
evolution (Italian original Chi Siamo: La Storia della Diversit‘a Umana). ISBN 0-201-44231-0
(paperback), 1993.
Crow, T.J. (Ed.) 2002, The speciation of modern Homo Sapiens. ISBN 0-19-726311-9 (paperback)
2002.
Dickey, J.M. 1971, The weighted likelihood ratio, linear hypothesis on normal location parameters.
Ann. Math. Stat. 42:204–223.
Drummond, A.J., Nicholls, G.K., Rodrigo, A.G., Solomon, W. 2002, Estimating mutation parame-
ters, population history and genealogy simultaneously from temporally spaced sequence data.
Genetics 161:1307–1320.
Dunner, S., Charlier, C., Farnir, F., Brouwers, B., Canon, J., et al. 1997, Towards interbreed IBD fine
mapping of the mh locus: double-muscling in the Asturiana de los Valles breed involves the
same locus as in the Belgian Blue cattle breed. Mamm. Genome 8:430–435.
Emahazion, T., Feuk, L., Jobs, M., Sawyer, S.L., Fredman, D., et al. 2001, SNP association stud-
ies in Alzheimer’s disease highlight problems for complex disease analysis. Trends Genet.
17:407–413.
Falconer, D.S., Mackay, T.F.C. 1996, Introduction to Quantitative Genetics. Addison-Wesley Long-
man, Harlow, England.
Fan, R., Jung, J. 2003, High-resolution joint linkage disequilibrium and linkage mapping of quanti-
tative trait loci based on Sibship data. Human Heredity 56:166–187.
Farnir, F., Grisart, B., Coppieters, W., Riquet, J., Berzi, P., et al. 2002, Simultaneous mining of
linkage and linkage disequilibrium to fine map quantitative trait loci in outbred half-sib pedi-
grees: revisiting the location of a quantitative trait locus with major effect on milk production
on bovine chromosome 14. Genetics 161:275–287.
Fearnhead, P., Donnelly 2001, Estimating recombination rates from population genetic data. Genetics
159:1299–1318.
Fisher, R.A. 1930, The Genetical Theory of Natural Selection. Clarendon Press, Oxford.
Foley, R. 1995, Humans before humanity. ISBN 0-631-20528-4 (paperback).
Gao, G., Hoeschele, I. 2005, Approximating identity-by-descent matrices using multiple haplotype
configurations on pedigrees. Genetics 171:365–376.
Gelfand, A.E., Hills, S.E., Racine-Poon A., Smith, A.F.M. 1990, Illustration of Bayesian inference
in normal data models using Gibbs sampling. J. Am. M. Stat. Assoc. 85:972–985.
Gelman, A., Carlin, B., Stern H.S., Rubin D.B. 1995, Bayesian Data Analysis. Chapman and Hall,
London.
George, E.I., McCulloch, R.E. 1993, Variable selection via Gibbs sampling. J. Am. Stat. Assoc.
88(423):881–889.
Gilks, W.R., Spiegelhalter, D.J., Richardson, S. (Eds.) 1996, Markov Chain Monte Carlo in Practice.
Chapman and Hall, London.
Gilmour, A.R., Cullis, B.R., Welham, S.J. 2000, ASREML Reference Manual. NSW Agriculture,
Orange, Australia.
Green, P.J. 1995, Reversible jump Markov Chain Monte Carlo computation and Bayesian model
determination. Biometrika 82:711–732.
Griffiths, R.C., Marjoram, P. 1997, An ancestral recombination graph. pp. 257–270. In: Progress in
Population Genetics and Human Evolution. P. Donnelly and S. Tavaré (Eds.), Springer, Berlin
Heidelberg New York.
Gura, T. 2000: Can SNPs deliver on susceptibility genes? Science 293:593–595.
Hampton, T. 2000, Research Brief, Focus, 29 September 2000. Harvard University.
Hartigan, J.A. 1990, Partition models. Commun. Stat. Theory Meth. 19:2745–2756.
Heath, S.C. 1997, Markov chain Monte Carlo segregation and linkage analysis for oligogenic models.
Am. J. Hum. Genet. 61:748–760.
194 R. D. BALL
Heath, S. 2002, Loki 2.4.5 – A package for multipoint linkage analysis on large pedigrees using
reversible jump Markov chain Monte Carlo. Centre National de Genotypage, Evry Cedex,
France.
http://bioweb.pasteur.fr/docs/doc-gensoft/loki/loki doc.ps
Hudson, R.R. 1983, Properties of a neutral allele model with intragenic recombination. Theor. Popul.
Biol. 23:183–201.
Hudson, R.R. 1990, Gene genealogies and the coalescent process. Oxf. Surv. Evol. Biol. 7:
1–43.
Jeffreys, H. 1961, Theory of probability, 3rd ed. Oxford University Press, London.
Kingman, J.F.C. 1982, On the genealogy of large populations. J. Appl. Prob. 19A:27–43.
Kaplan, N., Morris, R. 2001, Issues concerning association studies for fine mapping a susceptibility
gene for a complex disease. Genet. Epidemiol. 20:432–457.
Kilpikari, R., Sillanpää, M.J. 2003, Bayesian analysis of multilocus association in quantitative and
qualitative traits. Genet. Epidemiol. 25:122–135.
Lander, E.S., Botstein, D. 1989, Mapping Mendelian factors underlying quantitative traits using
RFLP linkage maps. Genetics 121:185–199.
Lee, S.H., Van der Werf, J.H.J. 2005, The role of pedigree information in combined linkage disequi-
librium and linkage mapping of quantitative trait loci in a general complex pedigree. Genetics
169:455–466.
Liu, J.S., Sabatti, C., Teng, J., Keats, B.J.B., Risch, N. 2001, Bayesian analysis of haplotypes for
linkage disequilibrium mapping. Genome Res. 11:1716–1724.
Long, A.D., Langley, C.H. 1999: The power of association studies to detect the contribution of can-
didate genetic loci to variation in complex traits. Genome Res. 9:720–731.
Lund, M.S., Sorensen, P., Guldbrandtsen, P., Sorensen, D.A. 2003, Multitrait fine mapping of quan-
titative trait loci using combined linkage disequilibria and linkage analysis. Genetics 163:405–
410.
Luo, Z.W. 1998, Detecting linkage disequilibrium between a polymorphic marker locus and a trait
locus in natural populations. Heredity 80:198–208.
Meuwissen, T.H.E., Goddard, M.E. 2000, Fine mapping of quantitative trait loci using linkage dise-
quilibria with closely linked marker loci. Genetics 155:421–430.
Meuwissen, T.H.E., Goddard, M.E. 2001, Prediction of identity-by-descent probabilities from marker
haplotypes. Genet. Sel. Evol. 33:605–634.
Meuwissen, T.H.E., Hayes, B.J., Goddard, M.E. 2001, Prediction of total genetic value using genome-
wide dense marker maps. Genetics 157:1819–1829.
Meuwissen, T.H.E., Karlsen, A., Lien, S., Oldsaker, I., Goddard, M. 2002, Fine mapping of a quanti-
tative trait locus for twinning rate using combined linkage and linkage disequilibrium mapping.
Genetics 161:373–379.
Meuwissen, T.H.E., Goddard, M.E. 2004, Mapping multiple QTL using linkage disequilibrium and
linkage analysis information and multitrait data. Genet. Sel. Evol. 36:261–279.
Molitor, J., Majoram, P., Thomas, D. 2003, Fine scale mapping of disease genes with multiple muta-
tions via spatial clustering techniques. Am. J. Hum. Genet. 73:1368–1384.
Natarajan, R., Kass, R.E. 2000, Reference Bayesian methods for generalized linear mixed models.
J. Am. Stat. Assoc. 95:227–237.
Neale, D.B., Savolainen, O. 2004, Association genetics of complex traits in conifers. Trends Plant
Sci. 9:325–330.
Nielsen, D.M., Weir, B.S. 2001, Association studies under general disease models. Theor. Popul.
Biol. 60:253–263.
Nielsen, D.M., Zaykin, D. 2001, Association mapping: where we’ve been, where we’re going. Expert
Rev. Mol. Diagn. 1(3):89–97.
STATISTICAL ANALYSIS AND EXPERIMENTAL DESIGN 195
Stephens, M., Donnelly, P. 2003, Ancestral inference in population genetics models with selection
(with discussion). Aust. NZ J. Stat. 45:395–430.
Stringer, C., McKie, R. 1996, African Exodus. Owl Books, London.
Sykes, B. 2001, The Seven Daughters of Eve: The Science That Reveals Our Genetic Ancestry. W.W.
Norton, New York.
Terwilliger, J.D., Weiss, K.M. 1998, Linkage disequilibrium mapping of complex disease: fantasy or
reality? Curr. Opin. Biotechnol. 9:578–594.
Thornsberry, J.M., Goodman, M.M., Doebley, J., Kresovich, S., Nielsen, D., et al. 2001, Dwarf8
polymorphisms associate with variation in flowering time. Nat. Genet. 28:286–289.
Thumma, B.R., Nolan, M.F., Evans, R., Moran, G.F. 2005, Polymorphisms in Cinnamoyl CoA
Reductase (CCR) are associated with variation in microfibril angle in Eucalyptus spp. Genetics
10.1534/genetics.105.042028.
Tierney, L. 1994, Markov chains for exploring posterior distributions. In: Proceedings to the 1991
Interface Symposium, also available as Technical report #560 (revised), School of Statistics,
University of Minnesota.
Wells, S. 2003, The Journey of Man: A Genetic Odyssey. Princeton University Press, Princeton.
Wilson, S. 2003, Discussion of: “Ancestral inference in population genetics models with selection”
by M. Stephens and P. Donnelly. Aust. NZ J. Stat. 45:423–426.
Wright, S. 1931, Evolution in Mendelian populations. Genetics 16:97–159.
Wu, R., Zeng, Z.-B. 2001, Joint linkage and linkage disequilibrium mapping in natural populations.
Genetics 157:899–909.
Wu, R., Ma, C.X., Casella, G. 2002, Joint linkage and linkage disequilibrium mapping in natural
populations. Genetics 160:779–792.
Zöllner, S., Pritchard, J.K. 2005, Coalescent-based association mapping and fine mapping of complex
trait loci. Genetics 169:1071–1092.
Chapter 9
LINKAGE DISEQUILIBRIUM-BASED ASSOCIATION
MAPPING IN FORAGE SPECIES
Mark P. Dobrowolski1 and John W. Forster2
9.1 INTRODUCTION
Forage species provide herbage for grazing, hay and silage production servicing
livestock production industries in both tropical and temperate regions of the world.
The grazing industries are responsible for dairy, meat and fibre production. Tempe-
rate forage grasses include perennial ryegrass (Lolium perenne L.), Italian ryegrass
(Lolium multiflorum Lam.), meadow fescue (Festuca pratensis Huds.), tall fescue
(Festuca arundinacea Schreb.), cocksfoot (Dactylis glomerata L.), Kentucky blue-
grass (Poa pratensis L.), smooth bromegrass (Bromus inermis L.) and harding grass
(Phalaris aquatica L.). Temperate forage legumes include white clover (Trifolium repens
L.), red clover (Trifolium pratense L.), subterranean clover (T. subterraneum L.), bird’s
foot trefoil (Lotus corniculatus L.) and lucerne/alfalfa (Medicago sativa L.). Tropical
forage species include grasses such as buffelgrass (Pennisetum ciliare L.), and members
of the genera Brachiaria and Paspalum, as well as legumes such as round-leafed cassia
(Chamaecrista rotundifolia), siratro (Macroptilium atropurpuerum) and members of the
genera Stylosanthes, Centrosema and Desmodium. A range of temperate and warm-
season grasses are also important for non-forage applications such as turf and amenity
cultivation. The temperate turf grasses include Lolium, Festuca, Poa and Agrostis
(bentgrass) species, while warm-season and tropical turf grasses include switchgrass
(Panicum virgatum L.), seashore paspalum and bahiagrass (Paspalum vaginatum Swartz
and Paspalum notatum Flugge) and members of the Zoysia (zoysiagrass) and Cynodon
(bermuda grass) genera (Forster et al. 2001a).
This chapter will focus mainly on the potential application of linkage disequilibrium
(LD)-based association mapping to perennial ryegrass and white clover. These two
1
Primary Industries Research Victoria, Plant Genetics and Genomics Research Platform, Hamilton Centre, Mt.
Napier Road, Hamilton, Victoria 3300, Australia
Molecular Plant Breeding Cooperative Research Centre, Australia
2
Primary Industries Research Victoria, Plant Genetics and Genomics Research Platform, Victorian
AgriBiosciences Centre, La Trobe R&D Park, Bundoora, Victoria 3083, Australia
197
198 M.P. DOBROWOLSKI AND J.W. FORSTER
9.2.1 Taxonomy
Perennial ryegrass is a member of the Poaeae tribe of the Poodae super-tribe in the
Pooideae sub-family of the grass and cereal family Poaceae (Soreng and Davis 1998).
The Lolium and Festuca genera are closely allied, and the most nearly related major
cereal species is cultivated oats (Avena sativa L.) within the Aveneae tribe of the Poodae.
The Triticeae cereal tribe (wheat, barley and rye) is located within the Triticodae super-
tribe of the Pooidae. Rice (Oryza sativa L.), by contrast, is located in the Poaceae sub-
family Bambusoideae. Translation genomics from rice to perennial ryegrass based on
whole genome DNA sequence data consequently traverses a significant phylogenetic
distance, but the use of partial genomic sequence and expressed sequence tag (EST) data
from wheat (Triticum aestivum L.: Powell and Langridge 2004) and barley (Hordeum
vulgare L.) exploits closer taxonomic affinities.
White clover is a member of the Trifolieae tribe of the cool-season Galegoid clade
in the Papilinoideae sub-family of the legume family Fabaceae (Doyle and Luckow
2003). The most closely related genus is Melilotus (sweet clovers) and the genus
Medicago, including alfalfa, is also part of the Trifolieae. As a consequence, the model
legume species barrel medic (Medicago truncatula Gaertn.) shares a common ancestor
relatively recently in evolutionary time with white clover. Translational genomics based
on whole genome sequencing of M. truncatula (Young et al. 2005; Zhu et al. 2005) is
consequently anticipated to be highly efficient for members of the Trifolium genus. The
other model legume species, Lotus japonicus Gifu, is also a Galegoid legume located in a
separate tribe, Loteae.
Members of the Lolium genus are diploids with a fundamental chromosome number
of 7 (2n = 2x = 14). The genome size of perennial ryegrass has been estimated through
measurements of nuclear DNA content by microdensitometry (Hutchinson et al. 1979;
Seal and Rees 1982). A 2C value of 4.16 pg corresponds to a haploid genome size of c.
1.6 × 109 bp. The individual genome sizes of other Lolium species vary, with the
inbreeding taxa such as Lolium temulentum exhibiting nuclear DNA contents c. 50%
larger than those of the outbreeding species. In common with other Poaceae family
members, the genomes of Lolium species contain large numbers of dispersed repetitive
sequences, frequently belonging to major retroelement families (Jenkins et al. 2000).
LINKAGE DISEQUILIBRIUM-BASED ASSOCIATION 199
Due to the obligate outbreeding natures of both perennial ryegrass and white clover,
both natural and synthetic populations are highly genetically heterogeneous. Varietal
development is typically based on the following process: evaluation of base populations
containing 2,000–5,000 individuals; selection of c. 200 potential parental clones (Vogel
and Pedersen 1993); and polycrossing to generate a synthetic 1 (Syn1) population. The
number of foundation individuals may vary from as low as four for perennial ryegrass to
50–100 for polyploid species such as tall wheat grass and alfalfa (Bray and Irwin 1999).
The development of molecular genetic markers and associated genetic maps for
perennial ryegrass has been comprehensively reviewed by Forster et al. (2001a, 2004),
while the application of genetic marker analysis to trait dissection has been reviewed by
Yamada and Forster (2005).
A comprehensive set (c. 400) of unique perennial ryegrass genomic DNA-derived
SSR (LPSSR) markers has been developed using enrichment library technology (Jones
et al. 2001). This resource has been augmented by the results of similar studies on a
smaller scale by Kubik et al. (2001) and Lauvergeat et al. (2005). The perennial ryegrass
EST collection has also been used for the development of a set of 310 EST-SSR primer
pairs (Faville et al. 2004). More recently, gene-associated single nucleotide
polymorphism (SNP) markers have been developed through both in vitro and in silico
LINKAGE DISEQUILIBRIUM-BASED ASSOCIATION 201
discovery in perennial ryegrass and white clover (Spangenberg et al. 2005; Shinozuka
et al. 2005; Cogan et al. 2006b; Chapter 4 in this volume).
Development of the first generation reference genetic map of perennial ryegrass was
performed through the use of public domain genetic markers, including restriction
fragment length polymorphisms (RFLP) and amplified fragment length polymorphisms
(AFLPs). This was achieved through coordination of the International Lolium Genome
Initiative (ILGI), using the p150/112 one-way pseudo-testcross population of 183 F1
genotypes. The ILGI reference map, constructed through collaboration between Victorian
DPI, Australia; Institute of Grassland and Environmental Research (IGER), UK;
Yamanashi Prefectural Dairy Experiment Station (YPDES) and the National Agricultural
Research Centre for Hokkaido Region (NARCH), Japan and the Institut National de la
Recherche Agronomique (INRA), France contained c. 200 AFLP loci and 109
heterologous RFLP loci (detected by wheat, barley, oat and rice cDNA probes), allowing
the inference of comparative relationships between perennial ryegrass and other Poaceae
species (Jones et al. 2002a). The ILGI map was enhanced through the addition of 93
LPSSR loci, providing the basis of framework genetic mapping in other populations
(Jones et al. 2002b).
A second generation reference genetic mapping family was developed based on the
F1(NA6 × AU6) two-way pseudo-testcross family of 157 F1 genotypes, generating two
parental genetic maps. The consolidated genetic maps included 43 LPSSR loci, 88 EST-
RFLP loci and 71 EST-SSR loci on the NA6 parental map, with a total length of 963 cM;
and 49 LPSSR loci, 67 EST-RFLP loci and 58 EST-SSR loci on the AU6 parental map,
with a total length of 779 cM (Faville et al. 2004).
Trait dissection for perennial ryegrass has been performed in multiple populations to
allow quantitative trait locus (QTL) analysis. The p150/112 population has been analysed
for traits such as vegetative and reproductive morphogenesis, reproductive development
and winter-hardiness, and herbage quality (Yamada et al. 2004; Cogan et al. 2005), while
the F1(NA6 × AU6) population has been studied for a range of root and shoot morpho-
genesis, photosynthetic efficiency, pseudostem water soluble carbohydrate (WSC)
content and crown rust resistance characters (Forster et al. 2004). Other perennial
ryegrass populations have been analysed to detect genetic control of crown rust resistance
(Dumsday et al. 2003; Muylle et al. 2005a,b), vernalisation response (Jensen et al., 2005)
and flowering time variation (Armstead et al. 2004; Warnke et al. 2004).
The development of molecular genetic markers for white clover has been reviewed
by Forster et al. (2001a). A comprehensive set (c. 400) of unique white clover genomic
DNA-derived SSR (TRSSR) markers was developed using enrichment library con-
struction technology (Kölliker et al. 2001a). The white clover EST library was also been
used to develop 792 EST-SSR primer pairs (Barrett et al. 2004).
Genetic map development in white clover was performed using a combination of
TRSSR and AFLP markers. The reference mapping population was the F2(I.4R × I.5J)
family that was developed at IGER, Aberystwyth, UK, with parental genotypes from
fourth and fifth generation inbred lines descended from plants containing the rare self-
fertile (Sf) allele. A single F1 plant was self-pollinated to generate an F2 population of 150
individuals (Michaelson-Yeates et al. 1997). The level of genetic polymorphism between
the inbred parents, as assessed with TRSSR markers, was 48% of those markers showing
efficient amplification. The F2(I.4R × I.5J) map contained 135 loci (78 TRSSR and 57
AFLP) on 18 linkage groups (two more than the karyotypic number), with a total map
202 M.P. DOBROWOLSKI AND J.W. FORSTER
length of 825 cM. The extent of map construction was limited by high levels of
segregation distortion, affecting 39% of the TRSSR loci, with the majority distorted
towards the heterozygous genotypic class (Jones et al. 2003). A higher-resolution genetic
map largely based on EST-SSR markers was constructed using the F1(Sustain 6525-
2 × NRS 364-7) mapping family (Barrett et al. 2004). The EST-SSR markers detected
homoeologous locations between the ancestral genomes at high frequency, and provided
the basis for standard chromosome nomenclature development.
The F2(I.4R × I.5J) genetic map has been exploited for QTL analysis of a number of
vegetative morphogenesis, reproductive morphogenesis and reproductive development
traits (Cogan et al. 2006a). Target traits were measured across a number of years of clonal
replication, and geographical sites in Wales and Scotland, United Kingdom. Individual
environment analyses detected a large number of QTLs for each trait, with QTL
clustering for correlated traits. Multi-environment combined analysis revealed genomic
locations that are relatively insensitive to genotype × environment effects. The F1(Sustain
6525-2 × NRS 364-7) population has also been used for QTL analysis, specifically
targeting seed production traits such as inflorescence density, yield per inflorescence and
thousand-seed weight (Barrett et al. 2005). Stability of QTL effects was observed across
temporal replication, along with co-location of QTLs for correlated traits.
Genetic diversity analysis has been performed for perennial ryegrass using AFLP
and SSR-based marker systems (Roldán-Ruiz et al. 2000; Guthridge et al. 2001; Forster
et al. 2001b, 2005a) and has revealed larger levels of genetic variation within than
between populations. Varieties based on small numbers of parental genotypes (restricted-
base varieties) were found to show lower levels of intrapopulation diversity and to be
more readily discriminated than those based on larger numbers of parental individuals
(non-restricted base varieties). AFLP profiling has also been used to determine levels of
genetic variability within and between white clover populations (Kölliker et al. 2001b).
As for perennial ryegrass, the majority of genetic variation was detected within rather
than between populations, and divergent varieties were largely discriminated on the basis
of AFLP profile. Bulking at the genotypic level followed by AFLP analysis was used to
determine the level of congruence between morphophysiological and genotypic variation
in white clover.
AFLP marker frequencies within populations were tested for association with the
phenotype of populations as a whole, rather than with the individual plant phenotype. For
the cold tolerance trait, marker frequency-trait associations were regarded as spurious
if the marker was not also correlated with average winter temperature and altitude clines
of the population origins, independent of average summer temperature clines, in addition
to these correlations being consistent across geographical origin (Skøt et al. 2002). One
AFLP marker locus was identified as associated with cold tolerance using this method,
and was prioritised for further investigation to assess its predictive value for this trait.
A similar approach using previously analysed phenotypes of natural populations
was employed to analyse AFLP marker association with flowering time variation in
perennial ryegrass (Skøt et al. 2005b). Five marker loci showed association with
flowering time variation, based on linear regression analysis, following exclusion of
distinct populations from the analysis due to concerns regarding population structure.
Twenty nine of the 590 AFLP bands tested could be mapped in a full-sib genetic
mapping family (F2[Aurora × Perma]: Armstead et al. 2002), based on polymorphism in
the mapping population and the assumption that co-migrating bands represent
homologous loci. Three of the five markers that were associated with flowering time
were closely linked in a region of linkage group (LG) 7, which contains a large
quantitative trait locus (QTL) for heading date variation accounting for 70% of the
phenotypic variance (Armstead et al. 2004). These three markers also revealed significant
LD in pair-wise comparisons. However, the effects of residual population structure were
still evident in the data, as many unlinked marker pairs also showed significant LD.
Studies using AFLPs for whole genome scans are vulnerable to various problems,
including: the potential confounding effects of AFLP band size homoplasy; population
structure, which was clearly evident in the latter study and was not measured in the
former, leading to spurious associations; and unanticipated ecological parameters that
define other differentiated traits within and between comparator groups. Another
potential problem was the use of population-level phenotypic data rather than individual
plant performance, and comparison of this data to population-based allele frequencies.
Use of population-based data is especially problematic considering the high levels of
intrapopulation genetic diversity, often greater than 80%, that is frequently observed in
perennial ryegrass populations (Guthridge et al. 2001; Kubik et al. 2001; Dobrowolski
et al. 2005). As previously stated, LD is also predicted to decay rapidly in outbreeding
species such as perennial ryegrass (Flint-Garcia et al. 2003) and for this reason, and the
various problems with use of whole genome scans, LD-based association mapping in
forages has now shifted to the use of genotype-specific phenotypic data and a candidate
gene-based approach.
In silage maize, a candidate gene-based approach was used to test the association
between digestibility and sequence haplotypes of the maize peroxidase gene ZmPox3
(Guillet-Claude et al. 2004b) and the three O-methyltransferase genes, CCoAOMT2,
CCoAOMT1, and AldOMT (Guillet-Claude et al. 2004a). These genes were chosen as
candidates for association studies on the basis of presumptive role in lignin biosynthesis
and co-location with QTLs for lignin content and cell wall digestibility based on analysis
of genetic mapping populations. Genotypic data was obtained by direct sequencing of the
haplotypes of these genes from various silage maize lines. LD decayed rapidly, reaching
values of r2 = 0.2 within 200–1,200 bp, as seen in other studies of diverse maize
populations (Remington et al. 2001). Associations were found between the digestibility
phenotype and distinct haplotypes containing insertions in both ZmPox3 and CCoAOMT2
LINKAGE DISEQUILIBRIUM-BASED ASSOCIATION 205
(Guillet-Claude et al. 2004a, b). However, the investigators recognised that population
structure was not accounted for in testing for the associations, and these effects may have
given rise to spurious haplotype–phenotype correlations.
Candidate gene-based association mapping studies in perennial ryegrass have so far
targeted forage quality and flowering traits. Skøt et al. (2005a) selected 100 genotypes
from each of nine European populations showing wide variation in heading date, and
measured this trait from replicated plants grown in pots following vernalisation. In
parallel, herbage quality traits were measured in replicate, including WSC content,
nitrogen content and dry matter digestibility, using near infrared reflectance spectroscopy
(NIRS).
Two candidate genes were targeted for analysis by Skøt et al. (2005a): the LpHd1
gene, which is the putative perennial ryegrass ortholocus of the rice Hd1 photoperiodic
control gene; and LpAlkInv, an alkaline invertase gene which has been mapped to LG 6 of
the perennial ryegrass genetic map, coincident with a WSC QTL. SNP discovery in
LpAlkInv was based on analysis of 24 genotypes that span the phenotypic range variation
in the larger sample set. The 6,328 bp LpAlkInv genomic clone, composed of six exons
and five introns, was tiled with overlapping PCR primers and the resulting amplicons
from each genotype were directly sequenced. Minimal problems due to paralogous
sequence amplification and sequence frameshifts due to heterozygous indels were
reported, and heterozygous SNPs were identified based on overlapping peaks in sequence
traces. Across the 48 possible haplotypes, an average of one SNP was identified per
28 bp. LD between SNP loci decayed to an r2 value of 0.1 over 2–3 kb distances.
Association analysis based on individual SNP genotype rather than SNP haplotypes
revealed no significant correlations with WSC content variation, but possible correlations
with heading date variation. Equivalent analysis is being performed for the 7.3 kb LpHd1
genomic sequence, including an assessment of the impact of population structure on the
association analysis. Other candidate gene-based studies in perennial ryegrass (Ponting
et al. 2005) have observed similar values for decay of LD, with r2 values dropping below
0.1 over 2 kb distances between SNP loci covering 6,137 bp of the forage quality
candidate gene LpFT1 (putative sucrose:fructose 6-fructosyltransferase) (Lidgett et al.
2002). However in the Lp1-SST (1-sucrose:sucrose fructosyltransferase) gene (Chalmers
et al. 2003), little LD was evident between SNP loci covering 4,269 bp and the equivalent
study to detect LD decay rate in the LpASRa2 (abscisic acid, stress, ripening) gene
(Forster et al. 2005b; Cogan et al. 2006b) was limited by the short distance (447 bp)
between the most distal SNPs. These analyses were based on a set of 81 diverse perennial
ryegrass genotypes. With the addition of genotypes from the closely related ryegrass taxa
L. hybridum, L. multiflorum, L. rigidum, L. temulentum higher levels of LD were
observed, presumably due to the confounding effects of population structure, and specific
LpASRa2 haplotypes were evidently conserved across species.
9.5 CONCLUSIONS
The demonstrated rapid decay of LD over short physical distances in the genomes
of outbreeding forage species provides support for the view that whole genome scans are
unlikely to identify regions of the genome that are causally responsible for agrono-
mically-important phenotypic variation, given the constraints of current technology. By
contrast, the candidate gene-based approach should be highly suitable, allowing the
206 M.P. DOBROWOLSKI AND J.W. FORSTER
9.6 REFERENCES
Andersen, J.R., Lübberstedt, T., 2003, Functional markers in plants. Trends in Plant Science 8: 554–560.
Armstead, I.P., Turner, L.B., King, I.P., Cairns, A.J., Humphreys, M.O., 2002, Comparison and integration of
genetic maps generated from F2 and BC1-type mapping populations in perennial ryegrass. Plant Breeding
121: 501–507.
Armstead, I.P., Turner, L.B., Farrell, M., Skøt, L., Gomez, P., Montoya, T., Donnison, I.S., King, I.P.,
Humphreys, M.O., 2004, Synteny between a major heading-date QTL in perennial ryegrass (Lolium
perenne L.) and the Hd3 heading-date locus in rice. Theoretical and Applied Genetics 108: 822–828.
Arumuganathan, K., Earle, E.D., 1991, Nuclear DNA content of some important plant species. Plant Molecular
Biology Reporter 9: 208–218.
Attwood, S.S., 1940, Genetics of cross-incompatibility among self-incompatible plants of Trifolium repens.
Journal of the American Society of Agronomy 32: 955–968.
Attwood, S.S., 1941, Controlled self- and cross-pollination of Trifolium repens. Journal of the American
Society of Agronomy 33: 538–545.
Attwood, S.S., 1942a, Oppositional alleles causing self-incompatibility in Trifolium repens. Genetics 27, 333–
338.
Attwood, S.S., 1942b, Genetics of pseudo-self-incompatibility and its relation to cross-incompatibility in
Trifolium repens L. Journal of Agricultural Research 64: 699–709.
Badr, A., Sayed-Ahmed, H., El-Shanshouri, A., Watson, L.E., 2002, Ancestors of white clover (Trifolium
repens L.), as revealed by isozyme polymorphisms. Theoretical and Applied Genetics 106: 143–148.
Barrett, B., Griffiths, A., Schreiber, M., Ellison, N., Mercer, C., Bouton, J., Ong, B., Forster, J., Sawbridge, T.,
Spangenberg, G., Bryan, G., Woodfield, D., 2004, A microsatellite map of white clover (Trifolium repens
L.). Theoretical and Applied Genetics 109: 596–608.
Barrett, B.A., Baird, I.J., Woodfield, D.R., 2005, A QTL analysis of white clover seed production. Crop Science
45: 1844–1850.
Baumann, U., Juttner, J., Bian, X.-Y., Langridge, P., 2000, Self-incompatibility in the grasses. Annals of Botany
85 (Supplement A): 203–209.
Bray, R.A., Irwin, J.A.G., 1999, Medicago sativa L. (lucerne) cv. Hallmark. Australian Journal of Experimental
Agriculture 39: 643–644.
Campbell, B.D., Caradus, J.R., Hunt, C.L., 1999, Temperature responses and nuclear DNA amounts of seven
white clover populations which differ in early spring growth rates. New Zealand Journal of Agricultural
Research 42: 9–17.
Chalmers, J., Johnson, X., Lidgett, A., Spangenberg, G., 2003, Isolation and characterisation of a
sucrose:sucrose 1-fructosyltransferase gene from perennial ryegrass (Lolium perenne L.). Journal of Plant
Physiology 160: 1385–1391.
Chen, C.C., Gibson, P.B., 1970, Chromosome pairing in two interspecific hybrids of Trifolium. Canadian
Journal of Genetics and Cytology 12: 790–794.
Chen, C.C., Gibson, P.B., 1971, Karyotypes of fifteen Trifolium species in section Amoria. Crop Science 11:
441–445.
Cogan, N.O.I., Smith, K.F., Yamada, T., Francki, M.G., Vecchies, A.C., Jones, E.S., Spangenberg, G.C.,
Forster, J.W., 2005, QTL analysis and comparative genomics of herbage quality traits in perennial
ryegrass (Lolium perenne L.). Theoretical and Applied Genetics 110: 364–380.
LINKAGE DISEQUILIBRIUM-BASED ASSOCIATION 207
Cogan, N.O.I., Abberton, M.T., Smith, K.F., Kearney, G., Marshall, A.H., Williams, A., Michael-Yeates,
T.P.T., Bowen, C., Jones, E.S., Vecchies, A.C., Forster, J.W., 2006a, Individual and multi-environment
combined analyses identify QTLs for morphogenetic and reproductive development traits in white clover
(Trifolium repens L.). Theoretical and applied Genetics 112: 1401–1415.
Cogan, N.O.I., Ponting, R.C., Vecchies, A.C., Drayton, M.C., George, J., Dobrowolski, M.P., Sawbridge, T.I.,
Spangenberg, G.C., Smith, K.F., Forster, J.W., 2006b, Gene-associated single nucleotide polymorphism
(SNP) discovery in perennial ryegrass (Lolium perenne L.)Mol Genet Genomics 276: 101–12.
Connolly, V., 1990, Seed yield and yield components in ten white clover cultivars. Irish Journal of Agricultural
Research 29: 41–48.
Cornish, M.A., Hayward, M.D., Lawrence, M.J., 1979, Self-incompatibility in ryegrass. I. Genetic control in
diploid Lolium perenne L. Heredity 43: 95–106.
Devey, F., Fearon, C.H., Hayward, M.D., Lawrence, M.J., 1994, Self-incompatibility in ryegrass. 11. Number
and frequency of alleles in a cultivar of Lolium perenne L. Heredity 73: 262–264.
Dobrowolski, M.P., Bannan, N.R., Ponting, R.C., Forster, J.W., Smith, K.F., 2005, Population genetics of
perennial ryegrass (Lolium perenne L.): differentiation of pasture and turf cultivars. In: Molecular
breeding for the genetic improvement of forage crops and turf. Humphreys M.O. (ed.). Wageningen
Academic Publishers: The Netherlands. p. 273.
Doyle, J.J., Luckow, M.A., 2003, The rest of the iceberg. Legume diversity and evolution in a phylogenetic
context. Plant Physiology 131: 900–910.
Dumsday, J.L., Smith, K.F., Forster, J.W., Jones, E.S., 2003, SSR-based genetic linkage analysis of resistance
to crown rust (Puccinia coronata Corda f. sp. lolii) in perennial ryegrass (Lolium perenne L.). Plant
Pathology 52: 628–637.
Ellison, N.W., Liston, A., Szeimer, J.J., Williams, W.M., Taylor, W.L., 2006, Molecular phylogenetics of the
clover genus (Trifolium-Leguminosae). Molecular Phylogenetics and Evolution 39: 688 – 705.
Faville, M., Vecchies, A.C., Schreiber, M., Drayton, M.C., Hughes, L.J., Jones, E.S., Guthridge, K.M., Smith,
K.F., Sawbridge, T., Spangenberg, G.C., Bryan, G.T., Forster, J.W., 2004, Functionally-associated
molecular genetic marker map construction in perennial ryegrass (Lolium perenne L.). Theoretical and
Applied Genetics 110: 12–32.
Fearon, C.H., Cornish, M.A., Hayward, M.D., Lawrence, M.J., 1994, Self-incompatibility in ryegrass. 10.
Number and frequency of alleles in a natural-population of Lolium perenne L. Heredity 73: 254–261.
Flint-Garcia, S.A., Thornsberry, J.M., Buckler, E.S.I., 2003, Structure of linkage disequilibrium in plants.
Annual Review of Plant Biology 54: 357–374.
Forster, J.W., Jones, E.S., Kölliker, R., Drayton, M.C., Dumsday, J., Dupal, M.P., Guthridge, K.M., Mahoney,
N.L., van Zijll de Jong, E., Smith, K.F., 2001a, Development and Implementation of Molecular Markers
for Forage Crop Improvement. In: Molecular breeding of forage crops. Spangenberg G. (ed.). Kluwer
Academic Press: Dordecht. pp. 101–133.
Forster, J.W., Jones, E.S., Kölliker, R., Drayton, M.C., Dupal, M.P., Guthridge, K.M., Smith, K.F., 2001b, DNA
profiling in outbreeding forage species. In: Plant genotyping – the DNA fingerprinting of plants. Henry R
(ed.). CABI Press: New York. pp. 299–320.
Forster, J.W., Jones, E.S., Batley, J., Smith, K.F., 2004, Molecular marker-based genetic analysis of pasture and
turf grasses. In: Molecular breeding of forage and turf. Hopkins A., Wang Z.-Y., Sledge M., Barker R.E.
(eds.). Kluwer Academic Press: Dordecht. pp. 197–239.
Forster, J.W., Jones, E.S., Smith, K.F., Guthridge, K.M., Dupal, M.P., Howlett, S., Hughes, L.J., Garvie, S.,
Preston, C., 2005a, Molecular Marker Technology for the Study of Molecular Variation and Comparative
Genetics in Pasture Grasses. In: Plant Genome: Biodiversity and Evolution, Volume 1 Pt. B:
Phanerogams. Sharma A.K., Sharma A. (eds.). Science Publishers: Enfield, NH. pp. 119–155.
Forster, J.W., Cogan, N.O.I., Vecchies, A.C., Ponting, R.C., Drayton, M.D., George, J., Dumsday, J.L.,
Sawbridge, T.I., Spangenberg, G.C., 2005b, Gene-associated single nucleotide polymorphism (SNP)
discovery in perennial ryegrass (Lolium perenne L.). In: Molecular breeding for the genetic improvement
of forage crops and turf. Humphreys M.O. (ed.). Wageningen Academic Publishers: The Netherlands.
p. 199.
Grime, J.P., Mowforth, M.A., 1982, Variation in genome size – an ecological interpretation. Nature 299: 151–
153.
Guillet-Claude, C., Birolleau-Touchard, C., Manicacci, D., Fourmann, M., Barraud, S., Carret, V., Martinant,
J.P., Barriere, Y., 2004a, Genetic diversity associated with variation in silage corn digestibility for
three O-methyltransferase genes involved in lignin biosynthesis. Theoretical and Applied Genetics 110:
126–135.
Guillet-Claude, C., Birolleau-Touchard, C., Manicacci, D., Rogowsky, P.M., Rigau, J., Murigneux, A.,
Martinant, J.P., Barriere, Y., 2004b, Nucleotide diversity of the ZmPox3 maize peroxidase gene:
208 M.P. DOBROWOLSKI AND J.W. FORSTER
relationships between a MITE insertion in exon 2 and variation in forage maize digestibility. BMC
Genetics 5: 16.
Guthridge, K.M., Dupal, M.P., Kolliker, R., Jones, E.S., Smith, K.F., Forster, J.W., 2001, AFLP analysis of
genetic diversity within and between populations of perennial ryegrass (Lolium perenne L.). Euphytica
122: 191–201.
Hutchinson, J., Rees, H., Seal, A.G., 1979, An assay of the activity of supplementary DNA in Lolium. Heredity
43: 411–421.
Jenkins, G., Head, J., Forster, J.W., 2000, Probing meiosis in hybrids of Lolium (Poaceae) with a discriminatory
repetitive genomic sequence. Chromosoma 109: 280–286.
Jensen, L.B., Andersen, J.R., Frei, U., Xing, Y., Taylor, C., Holm, P.B., Lübberstedt, T., 2005, QTL mapping of
vernalisation response in perennial ryegrass (Lolium perenne L.) reveals co-location with an orthologue of
wheat VRN1. Theoretical and Applied Genetics 110: 527–536.
Jones, E.S., Dupal, M.P., Kölliker, R., Drayton, M.C., Forster, J.W., 2001, Development and characterisation of
simple sequence repeat (SSR) markers for perennial ryegrass (Lolium perenne L.). Theoretical and
Applied Genetics 102: 405–415.
Jones, E.S., Mahoney, N.L., Hayward, M.D., Armstead, I.P., Jones, J.G., Humphreys, M.O., King, I.P., Kishida,
T., Yamada, T., Balfourier, F., Charmet, C., Forster, J.W., 2002a, An enhanced molecular marker-based
map of perennial ryegrass (Lolium perenne L.) reveals comparative relationships with other Poaceae
species. Genome 45: 282–295.
Jones, E.S., Dupal, M.D., Dumsday, J.L., Hughes, L.J., Forster, J.W., 2002b, An SSR-based genetic linkage
map for perennial ryegrass (Lolium perenne L.). Theoretical and Applied Genetics 105: 577–584.
Jones, E.S., Hughes, L.J., Drayton, M.C., Abberton, M.T., Michaelson-Yeates, T.P.T., Forster, J.W., 2003, An
SSR and AFLP molecular marker-based genetic map of white clover (Trifolium repens L.). Plant Science
165: 531–539.
Kawanabe, S., Yoshihara, K., Okada, T., Ueno, M., Hidaka, M., 1963, Studies on summer depression of pasture
crops. 3. Influence of flower bud removal upon vegetative growth of Ladino clover. Journal of the
Japanese Society for Grassland Science 9: 31–41.
Kölliker, R., Jones, E.S., Drayton, M.C., Dupal, M.P., Forster, J.W., 2001a, Development and characterisation
of simple sequence repeat (SSR) markers for white clover (Trifolium repens L.). Theoretical and Applied
Genetics 102: 416–424.
Kölliker, R., Jones, E.S., Jahufer, M.Z.Z., Forster, J.W., 2001b, Bulked AFLP analysis for the assessment of
genetic diversity in white clover (Trifolium repens L.). Euphytica 121: 305–315.
Kubik, C., Sawkins, M., Meyer, W.A., Gaut, B.S., 2001, Genetic diversity in seven perennial ryegrass (Lolium
perenne L.) cultivars based on SSR markers. Crop Science 41: 1565–1572.
Lauvergeat, V., Barre, P., Bonnet, M., Ghesquiére, M., 2005, Sixty simple sequence repeat markers for use in
the Festuca-Lolium complex of grasses. Molecular Ecology Notes 5: 401–405.
Lidgett, A., Jennings, K., Johnson, X., Guthridge, K., Jones, E., Spangenberg, G., 2002, Isolation and
characterisation of a fructosyltransferase gene from perennial ryegrass (Lolium perenne). Journal of Plant
Physiology 159: 1037–1043.
Mackay, T.F.C., 2001, The genetic architecture of quantitative traits. Annual Review of Genetics 35: 303–309.
Michaelson-Yeates, T.P.T., Marshall, A., Abberton, M.T., Rhodes, I., 1997, Self-incompatibility and heterosis in
white clover (Trifolium repens L.). Euphytica 94: 341–348.
Muylle, H., Baert, J., Van Bockstaele, E., Moerkerke, B., Goetghebeur, E., Roldán-Ruiz, I., 2005a,
Identification of molecular markers linked with crown rust (Puccinia coronata f.sp. lolii) resistance in
perennial ryegrass (Lolium perenne) using AFLP markers and a bulked segregant approach. Euphytica
143: 135–144.
Muylle, H., Baert, J., Van Bockstaele, E., Petijs, J., Roldán-Ruiz, I., 2005b, Four QTLs determine crown rust
(Puccinia coronata f.sp. lolii) resistance in a perennial ryegrass (Lolium perenne) population. Heredity
95: 348–357.
Ponting, R.C., Drayton, M.D., Cogan, N.O.I., Dobrowolski, M.D., Spangenberg, G.C., Smith, K.F., Forster,
J.W., 2005, SNP discovery and haplotypic variation in full-length herbage quality genes of perennial
ryegrass (Lolium perenne L. In: Molecular breeding for the genetic improvement of forage crops and
turf. Humphreys MO (ed.). Wageningen Academic Publishers: The Netherlands. p. 196.
Powell, W., Langridge, P., 2004, Unfashionable crop species flourish in the 21st century. Genome Biology 5:
233.
Rafalski, A., Morgante, M., 2004, Corn and humans: recombination and linkage disequilibrium in two genomes
of similar size. Trends in Genetics 20: 103–111.
Remington, D.L., Thornsberry, J.M., Matsuoka, Y., Wilson, L.M., Whitt, S.R., Doebley, J., Kresovich, S.,
Goodman, M.M., Buckler, E.S.I., 2001, Structure of linkage disequilibrium and phenotypic associations
LINKAGE DISEQUILIBRIUM-BASED ASSOCIATION 209
in the maize genome. Proceedings of the National Academy of Sciences of the United States of America
98: 11479–11484.
Roldán-Ruiz I., Dendauw J., Van Bockstaele J., Depicker, E., De Loose, M., 2000, AFLP markers reveal high
polymorphic rates in ryegrasses (Lolium spp.). Molecular Breeding 6: 125–134.
Sackville Hamilton, N.R., Skøt, L., Chorlton, K.H., Thomas, I.D., Mizen, S., 2002, Molecular genecology of
temperature response in Lolium perenne: 1. Preliminary analysis to reduce false positives. Molecular
Ecology 11: 1855–1863.
Sawbridge, T., Ong, E.-K., Binnion, C., Emmerling, M., McInnes, R., Meath, K., Nguyen, N., Nunan, K.,
O'Neill, M., O'Toole, F., Rhodes, C., Simmonds, J., Tian, P., Wearne, K., Webster, T., Winkworth, A.,
Spangenberg, G., 2003a, Generation and analysis of expressed sequence tags in perennial ryegrass
(Lolium perenne L.). Plant Science 165: 1089-1100.
Sawbridge, T., Ong, E.-K., Binnion, C., Emmerling, M., Meath, K., Nunan, K., O'Neill, O., O'Toole, F.,
Simmonds, J., Wearne, K., Winkworth, A., Spangenberg, G., 2003b, Generation and analysis of
expressed sequence tags in white clover (Trifolium repens L.). Plant Science 165: 1077–1089.
Seal, A.G., Rees, H., 1982, The distribution of quantitative DNA changes associated with the evolution of the
diploid Festuceae. Heredity 49: 179–190.
Shinozuka, H., Hisano, H., Ponting, R.C., Jones, E.S., Cogan, N.O.I., Forster, J.W., Yamada, T., 2005,
Molecular cloning and genetic mapping of perennial ryegrass protein kinase CK2α-subunit genes.
Theoretical and Applied Genetics 112: 167–177.
Skøt, L., Sackville Hamilton, N.R., Mizen, S., Chorlton, K.H., Thomas, I.D., 2002, Molecular genecology of
temperature response in Lolium perenne: 2. Association of AFLP markers with ecogeography. Molecular
Ecology 11: 1865–1876.
Skøt, L., Humphreys, J., Armstead, I.P., Humphreys, M.O., Gallagher, J.A., Thomas, I.D., 2005a, Approaches
for associating molecular polymorphisms with phenotypic traits based on linkage disequilibrium in
natural populations of Lolium perenne. In: Molecular breeding for the genetic improvement of forage
crops and turf. Humphreys M.O. (ed.). Wageningen Academic Publishers: The Netherlands. p. 157.
Skøt, L., Humphreys, M.O., Armstead, I., Heywood, S., Skot, K.P., Sanderson, R., Thomas, I.D., Chorlton,
K.H., Hamilton, N.R.S., 2005b, An association mapping approach to identify flowering time genes in
natural populations of Lolium perenne (L.). Molecular Breeding 15: 233–245.
Soreng, R.J., Davis, J.I., 1998, Phylogenetics and character evolution in the grass family (Poaceae):
simultaneous analysis of morphological and chloroplast DNA restriction site character sets. Botanical
Reviews 64: 1–85.
Spangenberg, G., Forster, J.W., Edwards, D., John, U., Mouradov, A., Emmerling, M., Batley, J., Felitti, S.,
Cogan, N.O.I., Smith, K.F., Dobrowolski, M.P., 2005, Future directions in the molecular breeding of
forage and turf. In: Molecular breeding for the genetic improvement of forage crops and turf. Humphreys
M.O. (ed.). Wageningen Academic Publishers: The Netherlands. pp. 83–97.
Thornsberry, J.M., Goodman, M.M., Doebley, J., Kresovich, S., Nielsen, D., Buckler IV, E.S., 2001, Dwarf8
polymorphisms associate with variation in flowering time. Nature Genetics 28: 286–289.
Vogel, K.P., Pedersen, J.F., 1993, Breeding systems for cross-pollinated forage grasses. Plant Breeding
Reviews 11: 251–274.
Warnke, S.E., Barker, R.E., Jung, G., Rouf Mian, M.A., Saha, M.C., Brilman, L.A., Dupal, M.D., Forster, J.W.,
2004, Genetic linkage mapping of an annual × perennial ryegrass population. Theoretical and Applied
Genetics 109: 294–304.
Williams, T.A., Abberton, M.T., Thornley, W.J., Evans, D.R., Rhodes, I., 1998, Evaluation of seed production
potential in white clover (Trifolium repens L.) varietal improvement programs. Grass Forage Science 53:
197–207.
Xing, Q., Ru, Z., Li, J., Zhou, C., Jin, D., Sun Y., Wang, B. Cloning a second form of adenine phosphoribosyl
transferase gene (TaAPT2) from wheat and analysis of its association with thermo-sensitive genic male
sterility (TGMS) (2005) Plant Science, 169 (1), pp. 37–45.
Yamada, T., Forster, J.W., 2005, QTL analysis and trait dissection in ryegrasses (Lolium spp.). In: Molecular
breeding for the genetic improvement of forage crops and turf. Humphreys M.O. (ed.). Wageningen
Academic Publishers: The Netherlands. pp. 43–53.
Yamada, T., Higuchi, A., Fukuoka, A., 1989, Recurrent selection of white clover (Trifolium repens L.) using
self-compatible plants. I. Selection of self-compatible plants and inheritance of a self-compatibility
factor. Euphytica 44: 167–172.
Yamada, T., Jones, E.S., Cogan, N.O.I., Vecchies, A.C., Nomura, T., Hisano, H., Shimamoto, Y., Smith, K.F.,
Forster, J.W., 2004, QTL analysis of morphological, developmental and winter hardiness-associated
traits in perennial ryegrass (Lolium perenne L.). Crop Science 44: 925–935.
Young, N.D., Cannon, S.B., Sato, S., Kim, D., Cook, D.R., Town, C.D., Roe, B.A., Tabata, S., 2005, Sequenc-
ing the genespaces of Medicago truncatula and Lotus japonicus. Plant Physiology 137: 1174–1181.
Zhu, H., Choi, H.-K., Cook, D.R., Shoemaker, R.C., 2005, Bridging model and crop legumes through
comparative genomics. Plant Physiology 137: 1189–1196.
Chapter 10
GENE-ASSISTED SELECTION: APPLICATIONS
OF ASSOCIATION GENETICS FOR FOREST TREE
BREEDING
SUMMARY
This chapter describes application of association genetics in forest tree species for
the purposes of selection. We use the term gene-assisted selection (GAS) to denote
application of marker–trait associations determined via association genetics, which we
anticipate will be based on polymorphisms associated with expressed genes. The salient
features of forest trees are reviewed, including existing and somewhat limited knowledge
of linkage disequilibrium (LD), as well as genomic information for both conifers and
hardwoods. The relatively short span of LD in largely undomesticated and outbred forest
tree species offer good prospects for precisely locating quantitative trait nucleotide
(QTN), but necessitates wise candidate gene selection and generation of nongenic
sequences, which could be limiting, particularly for conifers. Prerequisites for successful
application are discussed, and include suitable populations for detecting LD; powerful
quantitative genetic and bioinformatic capabilities; large EST libraries, if not whole
genomic sequences, to identify candidate genes; and other capabilities for studying
functional genomics; as well as a mix of quantitative genetics, tree breeding, and
molecular biology skills. Experimental designs for tree improvement applications are also
described, as well as analytical methods. For existing tree improvement practice, GAS
should be applicable in virtually all population strata, although careful evaluation on a
case-by-case basis will be needed to determine the appropriate implementation
pathway(s). Such evaluation will likely include numerical simulation. GAS also fits well
with other biotechnologies used for tree improvement. A number of impediments to
1
Cellwall Biotechnology Centre, Scion (New Zealand Forest Research Institute), Private Bag 3020, Rotorua,
New Zealand
2
USDA Forest Service, Southern Institute of Forest Genetics 23332 MS Highway 67, Saucier, MS 39574, USA
3
Ensis Genetics, Scion (New Zealand Forest Research Institute), Private Bag 3020, Rotorua, New Zealand
211
212 P.L. WILCOX ET AL.
10.1 INTRODUCTION
Key features of most forest tree species include their large size and long lifespan;
predominantly outbreeding behavior; slowness to express their phenotype as well as to
reach reproductive maturity; and high levels of synteny within genera, and among
conifers, within orders. The size and longevity of trees has both benefits and drawbacks.
In terms of the latter, size can create major complications for both conventional breeding
and the application of DNA polymorphism for selection. The complications involve both
delayed expression of traits, and high costs of producing and managing the genetic
material. For phenotypic selection, the delayed expression of traits may preclude
effective selection for a number of years. It similarly affects any cross-referencing of
phenotype with either genomic markers or QTN. The size of trees, along with the
lifespan, means that field-testing trees is very expensive, either for a selection population
in itself or for establishing relationships between phenotypic values and DNA
polymorphisms. Unless the cost is accepted, which is a problem in itself, this in turn will
tend to restrict both the potential selection intensity and the quality of information
available on the relationships in question. In contrast, however, a key benefit of the long
lifespan is the lasting presence of genotypes across years, even decades or centuries,
almost “immortalizing” populations. Such a benefit can allow for repeated measurements
over time on the same populations, further leveraging genotypic data, and/or allow for
repeated DNA collections and therefore continued generation of genotypic data.
GENE-ASSISTED SELECTION 213
Successful application of association genetics in forest trees, like all other species,
requires considerable genomic information, either in the species of interest or in some
highly syntenic species. Currently, forest tree species straddle the pre- and postgenome
divide, with the majority (especially conifers) in the former. Recently, the full genome
sequence of a poplar (Populus) has been determined, a first for a forest tree species
(http://www.jgi.doe.gov/poplar). A further effort is currently underway in Eucalyptus
214 P.L. WILCOX ET AL.
As with many other plant and animal species, however, the roles of genes in trait
variation are largely unknown. To date, there are no reports of QTL having been cloned
from forest tree species, partly due to the large number of candidates within QTL
confidence intervals, but also because of the length of time required for trait expression
of transformants arising from complementation studies, as well as the largely subtle
effects expected for most QTL, together requiring considerable experimental resources to
confirm complementation.
LD and nucleotide diversity, insofar as the latter governs functional variation, are
the two key parameters for evaluating the efficacy of association genetics. To date, there
have been relatively few extensive studies of LD in forest trees (see Gupta et al. 2005 for
a review of LD in higher plants). Studies conducted in the 1980s with relatively limited
numbers of polymorphic isozyme loci indicated limited or no LD, as would be expected
in outbred species with relatively large effective population sizes. Mitton et al. (1980)
found higher-than-expected digenic LD (6 out of 30 locus pairs) in Pinus ponderosa.
Similarly, Roberds and Brotschol (1985) found evidence for age-related differences in the
incidence of LD in Liriodendron tulipifera. Muona and Szmidt (1985) reported no
evidence of LD in either pollen or megagametophytes in Pinus sylvestris. A study in
Pinus contorta by Epperson and Allard (1987) showed higher-than-expected LD, but
was limited to certain locus combinations, with some closely linked loci not in LD.
Geburek (1998) also reported higher-than-expected digenic LD in Picea abies, although
most were restricted to two or less subpopulations. In most of the aforementioned
isozyme-based studies, nonrandom mating and/or selection on a limited number of loci
were the most frequent explanations offered for higher-than-expected observed LD.
Studies with DNA-based markers have tended to reveal similar results. Bucci and
Menozzi (1995) reported no LD in a small sample of P. abies using RAPD markers. A
later study in P. radiata, involving microsatellite marker loci from a range of linkage
groups, also indicated very little genome-wide LD (Kumar et al. 2004). More recently, a
number of results from DNA sequence have been reported for conifers as well as
Eucalyptus (Thumma et al. 2005) and Populus (Yin et al. 2004), surveying LD patterns
in relatively small regions in and around expressed genes. Results to date generally
indicate very short regions of LD, particularly in conifers where r2 values tend to
decrease to zero within a few hundreds to low thousands of base pairs (Table 10.1, and
associated references), although there is considerable variability even within genes. Some
exceptions have been noted in the average length of LD within genera; Yin et al. (2004)
reported significant LD in regions around the MXC3 resistance gene in Populus
trichocarpa in the order of 16–34 kb. These results indicate that while on average the
amount of LD is confined to relatively short spans in forest tree species, variations need
to be taken into account, which can only be characterized via empirical data on genes of
interest.
216 P.L. WILCOX ET AL.
Nucleotide diversity
Genus and No. of
Extent of LD Metric(s) (Synonymous or not) References
species genes
yes no
Pinus No evidence r2 N/A N/A N/A Kumar et al. (2004)
radiata between
unlinked SSR
markers
P. radiata Not estimated N/A 1 0.0300 0.0043 Cato et al. (2006)
P. radiata Not estimated N/A 8 0.0008 0.00005 Pot et al. (2005)
P. pinaster Not estimated N/A 8 0.0003 0.00015 Pot et al. (2005)
P. sylvestris None observed r2 11 0.0056 0.0022 Dvornyk et al.
within approx. (2002)
2 kba
P. taeda 2,000 bp r 2 ∼ 0.2 19 0.0064 0.0011 Brown et al. (2004b)
Pseudostuga 1,000 bp r 2 ∼ 0.1 18 0.0105 0.0021 Krutovsky and Neale
menziesii (2005)
Picea abies 100 bp r 2 ∼ 0.2 ? Not Not Unpublished results
200 bp provided provided cited in Rafalski and
Morgante (2004)
Eucalyptus “Similar results r2 1 Not estimated Not estimated Thumma et al.
nitens to maize and (2005)
Pinus”
Populus Up to 34 kb Not provided 1 Not estimated Not estimated Yin et al. (2004)
trichocarpa
2
Populus <500 bp r < 0.05 5 0.0220 0.0059 Ingvarsson (2005)
tremula
a
Analyses based on one gene only.
Nucleotide diversity in forest tree species appears to be variable both among and
within species. In most conifers, typical reported values range between ca. 10−2 and 10−4,
with some variation within species (Krutovsky and Neale 2005). Overall, forest trees
appear to show more such diversity than humans, but slightly less than that observed in
species such as maize (Brown et al. 2004b). Diversity appears to be lower in coding
sequences, with nonsynonymous substitutions being less frequent than synonymous
substitutions, although rarely are such differences reported as being statistically
significant – for example, Brown et al. (2004b) found no evidence for selection in 19
genes in P. taeda, while Krutovsky and Neale (2005) reported evidence for selection in
P. menziesii in three of 18 expressed genes. Cato et al. (2006) reported evidence for
selection in a putative dehydrin gene in P. radiata, and found weak associations with the
same gene and wood density and growth rate.
The moderate nucleotide diversity, coupled with the typically low LD per base pair,
indicates a relatively high number of haplotypes per genic region. For example,
Krutovsky and Neale (2005) found that there were approximately 2–3 haploblocks per
gene, thus on average, 4–5 single nucleotide polymorphisms (SNPs) would be needed to
adequately cover most single genes for association genetics applications.
What is the significance of these results for association genetics in conifers? Firstly,
the observed levels of nucleotide diversity indicate there is sufficient polymorphism for
association genetics studies. Secondly, the relatively small regions of LD give some
cause for optimism regarding functional assignment, as the small regions of LD observed
within most genes indicate the possibility of implicating genes (or even small regions
within, or associated with, genes) in trait variation. The disadvantage is that relatively
GENE-ASSISTED SELECTION 217
One of the key features of outcrossing species such as forest trees is the expectation
of widespread linkage equilibrium within unstructured populations, and conversely, the
expectation of strong LD within specific pedigrees. The latter has been extensively
utilized to date in the field of QTL mapping based on pedigreed populations (usually full-
sib families), leading to the development of linkage maps for a wide range of species and
demonstration of the potential for within-family MAS. This approach, however, has
various limitations, including the restriction of selection to within specific families for
which the marker allele–trait associations have been previously established (Strauss et al.
1992; Johnson et al. 2000; Wilcox et al. 2001).
From a tree breeding perspective, the key feature of association genetics is the
opportunity to select both among and within families, by establishing relationships
between polymorphisms and heritable trait variation outside of any family structure.
However, because LD is restricted to relatively small chromosomal regions in forest tree
species, we consider that the most likely polymorphisms to be associated with trait
variation are those within, or associated with, expressed genes. For this reason we use the
term “gene-assisted selection” to denote the application of within- and/or among-family
selection based on polymorphisms shown to be associated with trait variation in
unstructured populations, i.e., association genetics.
The idea of selecting genotypes based on DNA sequence variation is not new – the
concept of MAS is indeed based on the same principles, i.e., selecting on the phenotypic-,
and/or discrete isozyme-, and/or DNA-sequence variants that are correlated, through
linkage, with phenotypic variation in commercially relevant traits. There are key
differences between MAS and GAS, however (Table 10.2), from perspectives of both
research and operational implementation. Here, the terms GAS and MAS are used
primarily to define differences relevant to typical forest tree breeding; we refer to MAS
218 P.L. WILCOX ET AL.
as a technology for within-family selection only, in contrast to GAS, where selection can
in theory be applied at the family level, in addition to individual genotypes within
families, without prior pedigree information. These differences are not trivial with respect
to the objectives and design of the underlying experiments needed to detect and quantify
marker–trait associations. For example, for MAS, marker–trait associations are generally
detected using pedigreed mapping populations, thereby maximizing linkage disequilibria
between neutral markers and QTL that control detectable proportions of the phenotypic
variation. For GAS, researchers basically accept and work with the existing levels of
(dis)equilibria, however incomplete, that prevail in populations within which there are no
recognized patterns of interrelatedness. Marker systems are likely to differ also, although
in limited cases there may be some overlap. For MAS, selectively neutral marker systems
adequate for development of moderate-density linkage maps and high-throughput (HTP)
genotyping are considered satisfactory (e.g., RAPDs, AFLPs, microsatellites). For GAS,
however, we consider it is more likely that polymorphisms associated with candidate
gene sequences, i.e., SNPs, and insertions/deletions (indels), would be the marker
systems of choice.
Table 10.2. Comparisons of requirements for MAS based on QTL detection and GAS
based on association genetics in a tree breeding context
Stromberg et al. (1994) classified generic benefits relating to the use of DNA markers
for selection into three areas: earlier selection; cheaper, more cost-effective selection; and
increased selection intensity. In the context of GAS in a tree improvement program these
also apply, but for the sake of completeness, can be expanded. The following, partly
overlapping areas are where we consider most of the potential benefits will be:
(1) Earlier selection. Perhaps the single most important limiting factor in plantation
forest tree improvement has been selection age. The vast majority of characteristics
do not adequately express their genotypic value until one-quarter to one-half of
rotation age, which is a key factor influencing the long generation intervals typical
of most tree breeding programs. GAS, like MAS, offers the tantalizing prospect of
selecting at an emergent seedling stage, rather than waiting for up to many years for
adequate trait expression. Such early selection can be used as a substitute for direct
phenotypic selection, or as a complement in a multistage selection procedure, or
simultaneously with information on phenotype. The net effect will be to increase
selection intensity (see (3)). A further benefit, particularly in the cases of plus-tree
and among-family selection, is the prospect of screening individuals without need to
generate and evaluate offspring, which will further reduce generation interval by
directly evaluating genotype.
(2) Cheaper, more cost-effective selection. Knowledge of the sequence variants and their
effects on phenotype offers opportunity to select based on sequence only which
could reduce or perhaps ultimately eliminate need for field screening. Field testing is
one of the most expensive components of tree breeding programs, and sequence-
based selection is likely to be cheaper, particularly for multitrait breeding objectives
where expensive-to-measure traits such as wood properties are involved.
Furthermore, advances in DNA technologies offer further reductions in costs in the
medium term, whereas phenotypic measurements are likely to remain relatively
expensive. One factor to consider, however, is the reasonably high cost of
establishing marker–trait associations, which means that a large-scale breeding
operation may be needed to justify use of GAS. Nonetheless, these costs can be
reduced through various means such as pooling DNA samples (e.g., Germer et al.
2000). Moreover, the associations are expected to hold across a number of
generations, so costs can be spread accordingly provided generation intervals are
short. However, sample sizes necessary for detection of marker–trait association in
LD populations may require at least several thousand genotypes for small-effect
QTL for even modest levels of power and ability to infer association (Ball 2005;
Chapter 8).
(3) Increased selection intensity. This can result partly from the low cost of producing
young propagules that can be screened by GAS and partly from the higher-
throughput evaluation capacity. Even with current moderate- to high-throughput
genotyping technologies, there is capacity to screen far more genotypes than can be
field-tested, at potentially much lower cost. Thus, genetic gains are likely to increase,
particularly with multitrait breeding objectives that will tend to require larger
numbers of selection candidates. In fixed-resource phenotypic-screening programs
the addition of another trait into a breeding objective will typically incur costs in
gain for any single specific trait unless the “new” and “existing” traits are strongly
220 P.L. WILCOX ET AL.
and favorably correlated. Such a cost can be reduced with increased selection
intensities, but in contemporary breeding programs this usually means more
phenotypic evaluations (often on progenies) and possibly introduction of new
genotypes into breeding populations. GAS could be used as a surrogate selection tool
in these situations, although there may be the challenge of establishing the requisite
associations simultaneously in several traits.
(4) Reduced need for phenotypic selection. The combined result of selection that is
cheaper and/or earlier and/or more intensive may mean, in theory, at least, that GAS
could ultimately replace phenotypic selection. This is based on the intriguing
possibility that concomitant advances in genomics, proteomics, and metabolomics
could eventually lead to development of predictive models that integrate information
on gene sequences with information on environmental influences to predict
phenotype, thus reducing reliance on phenotypic selection, and basing genetic
selection entirely upon DNA sequence. The reduced reliance on field testing has
several distinct advantages, including a reduction in costs and/or a concomitant
increase in effectiveness of a tree improvement program via reallocating financial
resources to other components of the operational program. Field testing is one of the
most costly items in a tree improvement program, not just in terms of data collection,
but also trial establishment and maintenance, and to a lesser extent, analyzing data
and maintaining records. While the need for various forms of field experiments will
likely persist even once all genes are sufficiently well characterized with respect to
effects on trait variation (e.g., genetic gain trials), significant cost reductions should
become possible.
(5) Increased flexibility for operational evaluation and selection of genotypes.
Knowledge of phenotypic value associated with specific DNA sequences that can be
applied across unrelated genotypes expands the scope of potential application. GAS
can be applied to plus-tree selection, as well as among- and within-family selection,
in contrast to MAS, where associations between marker alleles and trait variability
are family-specific, and are thus applicable to within-family selection only.
Therefore, in theory a genetic value can be placed on any specific individual based
on DNA sequence information, where sequence has some nonzero association with
trait value. While implementation of GAS would lead to more field trialling initially
because of the need to find sufficient associations between markers and traits,
ultimately, GAS could reduce need for “common-garden” testing, and allows new
introductions to be evaluated without progeny evaluations (as is typically practiced).
(6) Complementary/synergistic fit with both existing and new genetics technologies to
enhance genetic gains. Because various genetic technologies are available for forest
tree improvement, in addition to an array of new technologies currently being
developed, there are typically alternative routes to delivery of genetic gain. GAS
potentially offers an additional technological route, in that it can either complement –
or possibly supplant – phenotypic selection, but in addition, fits well with newer
technologies. We describe in more detail in Section 10.9 the fit with new
biotechnologies.
(7) Prediction of genotypic value and enhanced opportunities for optimizing
combinations of genotypic, site, and silvicultural characteristics. Eventually, the
knowledge of DNA sequences underpinning heritable variation could be combined
with knowledge of key environmental and silvicultural influences to predict
phenotypic characteristics. While this is a far-reaching goal, it is a tantalizing
GENE-ASSISTED SELECTION 221
relatively small subset of genotypes – could be an effective prescreen for genes more
likely to be associated with trait variations, although some caveats apply regarding
power to detect effects of selection (Wright and Gaut 2005).
Successful application of association genetics for forest tree breeding must depend
on the context of a well-structured breeding program. Genetic variation for economic
traits is essential, and must be proven, while important genetic correlations between
different economic traits need to be at least reasonably understood. Achieving this will
entail major progress towards obtaining the populations needed for detecting associations
between DNA polymorphisms and phenotypes. Efficient assays, which can be used on
young trees, are important for this purpose, just as they are for conventional breeding.
This will generally require new measurement technology, and/or easily measured
juvenile traits that are good proxies for harvest-age economic traits. For wood quality,
the SilviScan instrument (e.g., Evans 1994; Evans et al. 1999) has been developed to
measure several detailed anatomical properties, and this has been complemented by an
improved understanding of how such properties affect processing- and product-
performance characteristics. Resistance to certain diseases can be assayed by inoculation
trials of young seedlings (e.g., Powers et al. 1982). Very early evaluation for growth
rate, however, can be very problematic: juvenile–mature correlations can be low,
physiological variables can show highly nonlinear relationships with performance, and
metabolite fluxes can be far more important than metabolite concentrations.
More specific requirements for applying GAS include quantitative capabilities, both
in providing appropriate material to furnish phenotypic data and in managing, analyzing,
and interpreting phenotypic and genomic data; access to HTP genotyping technologies;
and good marker selection. This involves selection of candidate genes that could be
associated with quantitative variation, and discovery and evaluation of important
polymorphisms. We discuss each of these requirements.
Operational implementation will depend not only on meeting the various technical
conditions listed above, but also on meeting organizational and even institutional
requirements. Between the tree breeders and the genomic scientists there need to be
close communication and considerable mutual education. Allocation of resources to the
various parties will be a continuing challenge. A further challenge will lie in maintaining
a strategic focus, whereby GAS and other new technologies can be used to best long-term
advantage.
The total scale of undertakings for successful development and application of GAS
will typically require collaboration between institutions, including industry, specialist
research organizations, and universities. This will need to be achieved in the face of a
climate of competitive bidding for research funding and the various pressures to
appropriate Intellectual Property for individual organizations’ own gain.
can be detected and utilized. These issues are discussed in more detail elsewhere in this
book (Chapters 7 and 8). We cover components relevant to application of association
genetics for tree breeding.
commercially important traits, there has been some debate regarding the true nature of the
underlying variation. Early studies involving relatively small populations indicated genetic
variation was dominated by a few genes of moderate effect, however, these results were
difficult to repeat, even in the same families (Wilcox et al. 1997; Sewell and Neale 2000).
Interpretations of those early studies may therefore have been erroneous in that results are
also consistent with genetic architecture involving genes of small effect only, similar to
that described in corn (Beavis 1994), and subsequent verification, when done, have
indicated this to be the case (Wilcox et al. 1997; Sewell and Neale 2000; Brown et al.
2003). Therefore for most traits, we contend that the underlying genetic architecture is
most likely to be dominated by genes of relatively small effect contributing a few percent
of the variation at most (e.g., Devey et al. 2004). An exception may be that interspecific
hybrids could involve genes of moderate–large effect (e.g., Bradshaw and Stettler 1995),
although small-effect genes may also have a role. Experimental designs for association
genetics will therefore need to be cognizant of these architectures, particularly genes of
small effect, if selection is going to be effective.
A number of different experimental designs could be used to detect associations
between QTL/N and polymorphisms, such as an unstructured population consisting of
putatively unrelated (or distantly related) genotypes; or combined with information on
progeny (analogous to a TDT design, except using quantitative traits); or alternatively a
hybrid QTL–LD population (see Chapter 8 and references therein). Some of these
approaches have been evaluated in a manner more relevant to forest trees (e.g., Wu et al.
2002; Ball 2005). Furthermore, some of the genetic characteristics of forest trees parallel
humans (e.g., high levels of heterozygosity, adverse effects of inbreeding, longevity), for
which much has been written in regard to the theory and efficacies of specific
experimental designs and analytical procedures, and are therefore relevant to tree species.
We review some of this literature here, and refer the reader to Chapter 8 for a more
extensive review.
A number of theoretical studies have been conducted, particularly in comparing
designs with and without use of information from sibs. A somewhat unclear picture has
emerged to date, however, partly because of differing assumptions and input values used
for simulations. Long and Langley (1999) showed that for smaller-effect QTL (∼5% of
phenotypic variance), unstructured or random populations were more powerful than
TDT-based designs, and that power increased more when greater numbers of individuals
rather than markers were used. Moreover, they concluded that unstructured
populations sample sizes ≥500 individuals would suffice to detect small-effect QTL
assuming a Type-1 error rate of 0.05. A further and nontrivial finding was that equally
large populations would be needed to verify any detected associations.
Wu et al. (2002) developed theory for combined linkage- and linkage-
disequilibrium mapping, based on use of genotypic information from a single parent
combined with genotypic and phenotypic information from offspring, analogous to
multiple half-sib families, as in often used in breeding population testing. They compared
different combinations of family numbers and sizes, and compared the power to detect a
segregating QTL of large effect with an unstructured population without information
from progenies. In contrast to Long and Langley (1999), they found that simulation results
indicated that use of information from progenies was more powerful than unstructured
populations only, particularly with low disequilibrium, assuming the same number of
individuals genotyped. Results also indicated that few families with many offspring per
family were more powerful than many families with few offspring. A key benefit of this
GENE-ASSISTED SELECTION 225
approach is that the use of progenies obviates the need to independently evaluate
population structure. However, because these results were based upon a single QTL with
a large effect (both additive and dominance terms equal to residual error), relevance of
these results may well be limited, as individual QTL effects are typically much less than
residual variance. Therefore these results would need more careful evaluation using a
range of QTL effects more relevant to known genetic architectures.
Most of the above studies have involved estimating power with comparison-wise
Type-1 error rates in the region of 0.01–0.05. However, such values may be problematic
in reality because actual results in that range of P-value may not be equate to strong evidence
for an association. Using a Bayesian approach based on theory originally developed by
Luo (1998), Ball (2005) calculated that P-values in the range of 0.01–0.05 actually
represented weak evidence against an association for sample sizes in the 432–1,200
individuals in an unstructured population. P-values in the range of 10−4 would be more
indicative of evidence for an association, assuming high prior expectation for an
association (see Chapter 8). This also implies that larger sample sizes than those
generally reported above would be needed.
Ball (2005) also showed that very large sample sizes are necessary for high power
(0.9) of detection of QTL with small effects (explaining 1–5% of total variance) when
using either candidate genes or a genome scan in an unstructured population. To obtain
high power with strong posterior odds (Bayes Factor >20) with moderate disequilibium
(D ′ = 0.1), sample sizes ranging from 6,800 to 40,100 would be necessary to detect QTL
of 5 and 1% effect, respectively. Such sample sizes are based in part on relatively low
prior odds, which may be increased through generation of additional experimental and
biological information on specific genes (e.g., expression profiles, evidence of selection),
therefore sample sizes could be reduced. However, even with relatively high prior
odds, sample size requirements will still be relatively high. Furthermore, Ball (2005)
quantified the power to detect QTL when marker and QTN frequency differed. Even
with very large sample sizes (19,200 and 38,400 genotypes), there is relatively low
power to detect rare QTN with intermediate marker allele frequencies, even when in
almost complete disequilibria. This is an important consideration, given that long-
term genetic gains are driven by low-frequency QTN, along with mutations that arise
during the selection period.
What can be concluded regarding optimal experimental designs based upon the
work described above, and what are the implications for tree breeding programs? Firstly,
moderate- to large-effect genes are likely to be easily detected using material from
existing breeding populations, as long as there are sufficient numbers (200–1,000
putatively unrelated genotypes with phenotypic records available). For smaller-effect
genes, which are likely to dominate the genetic architecture of quantitative traits in
particular, much larger sample sizes are likely to be needed; therefore augmentation of
existing breeding populations with genotypes from natural populations may be necessary.
The implication here is that such augmentation will require common-garden
experimentation, which is time-consuming, and could delay or militate against use of
association genetics. Furthermore, maintenance of genetic diversity of nonbreeding
population genotypes is also a necessity. Optimal designs with sufficient power for
detection of small-effect QTL will therefore need to be ascertained in the context of tree
improvement programs, most likely necessitating numerical simulation on a case-by-case
basis.
226 P.L. WILCOX ET AL.
haplotype-based methods, and simulations suggested lower Type-1 error rates. Genotypic
data are sometimes cheaper to obtain, as direct sequencing is not necessary.
HTP facilities are necessary for sequencing and genotyping, for both detecting
associations and operational selection. Extensive sequencing and resequencing are
required, even if only a small subset of genotypes are used for initial scans of candidate
gene regions. HTP genotyping is an obvious prerequisite, given the large amount of data
generation necessary for adequately conducting powerful association tests. Whether or
not specific breeding programs choose to develop “in-house” capacity or choose to
outsource this component will be a choice made on a case-by-case basis.
Figure 10.1. Generic process for selecting candidate genes and generating polymorphism information on
association tests.
GENE-ASSISTED SELECTION 229
Generic methods for candidate gene selection are described in more detail elsewhere
in this book. Here, we outline more specific approaches that could be considered, noting
that except for Populus and Eucalyptus, there will be very little genome-wide data
available for subject, although for most commercially important genera extensive EST
sequence information is available, if not in the species of interest, then in a closely related
species. Note, too, that selection of candidate genes can be based on more than a single
criterion, although the relative efficacies of the various criteria are not yet known. Such
criteria include:
– Choosing orthologous genes to those in model plant species that have been
shown to have a role in traits of interest (Figure 10.1, Box A). For example,
Thumma et al. (2005) found that polymorphism in an intronic region of a CCR
gene was statistically associated with microfibril angle in E. nitens in a small
association population. This gene was chosen because it is homologous to the
IRX4-causing CCR in A. thaliana. However, it is not yet known to what extent
and which plant model systems can predict roles of the homologous genes
governing endogenous variation in forest tree species. If, in the more complex
conifer genomes, there is a greater tendency for large gene families affording
some degree of functional redundancy, information from short-lived
angiosperms could be of limited value.
– Similar to the above, but using information on mutations and knowledge of gene
sequences (and expression patterns of the sequences) from other forest tree
species. For example, while an annual-plant model system could have limited
applicability, a model system based on a woody perennial (e.g., Populus) could
be more useful. In either case, the role of comparative genomics is crucial.
– Endogenous genes based on known or suspected role(s) in relevant biochemical
pathways (Figure 10.1, Box B), e.g., genes involved in lignin biosynthesis as a
preliminary choice to investigate natural variation in lignin chemistry. Much
molecular information has been generated on this topic, and the key regulatory
genes have been identified (e.g., Huntley et al. 2003). Such an approach has
been used in mammalian systems, although with mixed success. For example,
the Booroola gene in sheep (FecB), which causes elevated fecundity, was
initially thought to be due to natural variation in FSH, a gene encoding a
follicle-stimulating hormone. However, subsequent linkage analysis showed
otherwise (Dodds et al. 1993), which was later verified by identifying the
causative gene.
– Information from transcript profiling (Figure 10.1, Box C), identifying genes
whose expression patterns are correlated with specific traits. A number of
differential-expression technologies have been developed, including micro-
arrays, cDNA–AFLP and similar approaches, and are now extensively used,
although not as tools in breeding programs. Such technologies do reveal many
candidates – possibly too many to be used as a screening tool alone. Moreover,
230 P.L. WILCOX ET AL.
heritable variation may arise for reasons other than differential expression of
allelic variants. In reviews of cloned plant QTL, only three of ten QTL whose
mechanisms were determined were shown to be due to differential expression
(Salvi and Tuberosa 2005). Nonetheless, combining expression-profiling tech-
nologies with QTL mapping shows considerable promise. A number of studies
have shown this hybrid approach to be useful in identifying the genes potentially
causing trait variation (Wayne and McIntyre 2002). For example, Kirst et al.
(2003) reported a candidate gene underpinning a major-effect QTL in an
interspecific Eucalyptus hybrid. Furthermore, Cato et al. (2006) reported a
dehydrin gene associated with both wood density and growth rate in P. radiata
that showed allelic differences in transcript abundance in different wood-
forming tissues within the same genotype.
– A variant of the above, using proteomics rather than mRNA populations. The
lack of complete correspondence between translation and transcription may be a
useful means to eliminate those genes that are less likely to contribute to trait
variation. Moreover, this approach has promise in that it may also identify gene
products whose contribution to trait variation may be due to reasons other than
differential expression (e.g., protein folding, etc.). Such an approach has not
been extensively tried yet, at least not in forest trees.
– Expressed genes that consistently colocalize with QTL regions in multiple
pedigreed QTL mapping populations, either within or across species
(Figure 10.1, Box D). In practice, this could be of limited value, as confidence
intervals around QTL are likely to cover much of a chromosome, particularly
where sample size is limited (Dupuis and Siegmund 1999). Nonetheless,
pedigreed mapping populations could be used as an additional screening step.
However, caution is recommended: small–moderate size QTL mapping
populations could be of limited value as they may not be sufficiently powerful to
detect QTL, therefore the lack of association is not conclusive; or else the QTN
may not be segregating in the particular pedigree(s) being used. If using
information from another species to infer trait association in the subject species,
then evidence for nonrandom colocation of QTL for traits of interest should be
determined a priori, otherwise use of information from other species will be of
little value.
– Genes that have been shown to be associated with variation in traits of interest
via association genetics in other species (Figure 10.1, Box E). Caveats regarding
utility of transferability of QTL across species mentioned above also apply.
Nonetheless, marker–trait associations that occur in homologous sequences
across species may also serve as independent validation of associations.
– Use of genetic transformation to determine potential role(s) of candidate genes
(Figure 10.1, Box F). This approach involves modification of endogenous gene
function in some manner, e.g., enhancer trapping, RNAi, over-expression, etc.
However, for forest trees, such approaches have limited promise, particularly in
species where trait expression takes years, and/or have low transformation
efficiencies. Other technical problems could also be limiting, e.g., sense
suppression in the case of over-expression. Regulatory issues could also impact,
particularly where field trials are necessary. However, this approach may be
useful in cases where in vitro or early-assay systems have been developed,
particularly where transient expression can result in a discernable phenotype.
GENE-ASSISTED SELECTION 231
As we learn more about the function of specific genes alone, and in concert with
other genes, other criteria are likely to be added to the above list. Moreover, as more
information from each of these sources becomes available, it will be possible to evaluate
the relative efficacy of each of these criteria. Suffice to say, the roles of structural and
comparative genomics, proteomics, molecular biology, as well as knowledge of
physiological roles of specific genes, are crucial. Very few of these skills are currently
utilized by, or available within, current tree breeding programs.
Of interest too, are the identity and nature of regulatory regions associated with
candidate genes (Morgante and Salamini 2003; Paran and Zamir 2003). Because trait
variation could be a result of gene regulation, there is a need to ascertain – via de novo
sequencing if necessary – regulatory sequences. This should be easily achievable for
promoter sequences in close proximity to open reading frames, but may be more difficult
for transacting enhancer elements, particularly if such sequences are not known a priori.
The generic advantages of using association genetics in tree breeding have already
been stated (cf. Stromberg et al. 1994). For effective use there are many possibilities.
Some of the issues will be common to both MAS (including marker-based and marker-
assisted selection) and true GAS based on QTN, and some will be specific to one or the
other. To be effective, use in tree breeding of nucleotide–trait associations derived from
association genetics must be integrated with essentially the existing tree improvement
practice. Such practice includes the arrangement and structuring of breeding populations,
and the manner in which genetic gain is delivered into plantation forests. For the future,
the practices can be modified as true GAS becomes possible.
Tree breeding differs from much traditional crop plant breeding because of various
factors, including relatively little history of domestication, moderate–high levels of
genetic load, and long generation intervals imposed by slowness to reach reproductive
competence and/or late expression of trait values. Forest tree breeding tends therefore to
take a population-based approach involving many genotypes, where populations are
usually structured into a hierarchy (Burdon 1988):
breeding population becomes the “engine room” for cumulative genetic advance,
building up frequencies of favorable alleles through successive cycles of mating, genetic
recombination, and selection. For clonal forestry, clonal selection will typically be done
within crosses between top-ranked parents which may be common to both the breeding
population and existing seed orchards.
To complicate matters, tree breeding typically involves multitrait breeding
objectives, and some programs also develop specific breeds that focus on improving
differing sets of traits (Jayawickrama and Carson 2000). Application of GAS in tree
improvement programs needs to fit into this general framework in a cost-effective
manner. We will now consider potential applications of GAS in the context of such
population hierarchies.
In programs where new plus-tree selections are required, GAS may be useful as a
prescreening tool either to increase selection intensity, or to cull candidates down to those
of sufficient promise to warrant costs of testing, and of forwards selection among
offspring. Here, GAS has, in theory, the advantage of favoring selection well before full
phenotypic expression, therefore increasing the available number of selection candidates.
However, this may be constrained by the cost of phenotyping relative to genotyping, plus
the desideratum of ascertaining marker–trait associations for the multiple traits that comprise
a breeding goal. Nonetheless, marker–trait associations could be accumulated over time
from association tests, and utilized as they become available, thereby increasing scope for
adding new material into breeding populations. Similarly, genotypes could be identified
for immediate deployment, in addition to incorporating them into breeding populations –
assuming propagation systems exist to cost-effectively multiply selected genotypes
without detrimental effects of maturation. For instance, in response to a biotic crisis (e.g.,
outbreak of a new disease or pest) GAS could be directly applied to identify genotypes
more likely to be resistant to the pathogen or pest, rather than undertake laborious
phenotypic screening. Specific genes could then be integrated more quickly into the
relevant populations. Prospects for widespread application of GAS for plus-tree selection
may be limited in practice; however, as population sizes for detecting associations would
most likely exceed those required for breeding population advancement. Moreover,
knowledge of nucleotide–trait associations may come to hand too late for fresh plus-tree
selection, especially with traits of late expression.
by forwards selection for the multitrait criteria. Backwards selection, from progeny-test
results, is also used to rank parents, particularly for production populations.
For breeding population advancement, the same marker–trait associations as might
be used for plus-tree selection described above could be used for selecting among and
within families, to increase selection intensity, as an early selection tool, and/or to reduce
costs. However, even within breeding populations, specific applications will be context-
dependent. For example, in main populations, which are generally less intensively
managed than elite populations, GAS could be used as a surrogate for more expensive-to-
measure traits. Here, phenotypic data could be generated on cheap-to-assay traits (e.g.,
growth rate) and GAS used for more expensive or later-expressing traits (e.g., certain
wood properties). However, for the time being, DNA polymorphisms are likely to
characterize less additive genetic variation than phenotypic records, resulting in
potentially less gain for traits selected just on marker information. Such a reduction could
be offset by increasing selection intensity among, and particularly, within families.
Trade-offs will need to be carefully evaluated, initially at least via simulation.
For any breeding, an ideal is saving rare or low-frequency QTN that have current or
contingently favorable additive effects. Such alleles can be the key to longer-term
genetic gain and/or coping with a biotic crisis. For detecting, preserving and increasing
the frequencies of these QTN, instead of losing them to genetic drift, GAS may be
crucial. However, such a pursuit may well be deemed too expensive for breeding
programs that are dominated by shorter-term financial imperatives.
In elite populations, with the fewer families for intensive measurement and
selection, opportunities may exist for more intensive selection and faster turnover of
generations. For combined among- and within-family selection, there is more scope to
increase selection intensity within families. Because association tests identify markers in
strong disequilibria with QTN, it may be relatively easy to detect pedigrees within which
the predominant linkage phase is reversed. Undetected reverse-phase linkages are likely
to be serious within small elite populations, or any other small breeding groups within the
breeding population; simulation would again be helpful in quantifying potential
reductions in gain.
Reducing generation intervals through use of GAS would depend on the trees
becoming reproductively competent before trait expression. However, if markers or
actual QTN were used as a surrogate for trait expression, genotypes could be screened as
soon as sufficient tissue can be spared for DNA assays, even in germinating seedlings.
Some conifers, in particular, are typically reproductively competent before selection age
for at least some commercially important traits, creating a real potential for use of GAS to
shorten generation interval. However, this would require marker–trait associations that
explain substantial additive genetic variance for at least some important breeding goal
traits. While this could one day be achieved, it is currently more likely to have
associations that explain only a proportion of additive variance for just subset of traits.
Thus, trade-offs between expected gain per generation and rate of generation turnover will
need to be carefully evaluated.
It is more likely that, in the shorter term at least, selection in elite breeding
populations would be implemented in a multistage approach, using marker information as
an early screening tool, followed by phenotypic records. Such an approach could either
increase selection intensity (by screening more genotypes), or reduce costs of phenotypic
evaluation by short-listing genotypes for field testing, to achieve the same gain.
Alternatively, using GAS to select for later-onset traits – if the nucleotide–trait
GENE-ASSISTED SELECTION 235
4
The type of correlations that can in principle be attacked effectively in this way would be correlations
resulting from important chromosomal linkages that are persisting following fusion of differentiated ancestral
populations, rather than correlations stemming from pleiotropic effects
236 P.L. WILCOX ET AL.
disequilibrium with these QTN, has the benefit of obviating the need for screening
families with specific pathotypes, to determine which families carry which resistance
genes. Combining or “pyramiding” different resistance genes, preferably within the same
individuals, can promise resistance that is durable against mutations and genetic shifts in
the pathogen (Burdon 2001). Thus, phenotyping costs can be much reduced, as well as
time required for manipulation of frequencies. This may be a great advantage in the event
of a biotic crisis where low-frequency resistance is required to quickly combat a new
disease or pest. The advantage would be increased by the desirability of pyramiding
different resistance factors. Genotypes carrying such QTN can be identified in the
breeding population (including directly estimating QTN frequencies), enabling among-
and within-family selection to be carried out over a large proportion of the breeding
population. In such circumstances, it is likely that at least some of the resistant genotypes
will be suboptimal for other traits, so GAS might be used to select for other properties to
reduce the loss in genetic gain.
Despite the prevalence of inbreeding depression in forest trees, use of inbreeding as
a breeding tool has attractions because it can theoretically amplify the expression of
additive gene effects (e.g., Burdon and Russell 1999; Russell et al. 2003). In most
species, however, the challenge will be to “purge” highly deleterious recessive alleles
(“hard” genetic load) that threaten viability and/or often mask the expression of favorable
additive gene effects in inbred lines (e.g., Williams and Savolainen 1996). MAS has
promise for such purging, because QTL effects of hard load should be relatively easy to
detect in individual pedigrees in order to purge such alleles even in the heterozygous state
(cf. Kuang et al. 1999). Use of GAS in this way, however, may not really work, because
such genetic load almost certainly represents alleles that are individually rare but occur at
very many loci and are therefore very unlikely to be involved in any general LD.
Production populations comprise the genotypes that either provide seed for
deployment into plantation forests, or are used for large-scale vegetative propagation for
clonal forestry. These populations usually have a few tens of genotypes at any one time,
and actually represent subsets of the breeding populations and are subject to most of the
same considerations as the breeding populations for the applications of GAS. As subsets,
they represent a relatively narrow genetic base compared to the breeding- and gene-
resource populations. Related matings are avoided as far as possible, to avert inbreeding
depression. Various systems are used to deliver commercial planting stock. Some
programs use open-pollinated seed orchards, to produce seedlings. Other programs use
control-pollination technologies, where top genotypes are pollinated with pollens from
either single or multiple parents. Seed from these either provides seedlings for planting
stock, or is vegetatively multiplied as nursery cuttings or as plantlets raised from in vitro
culture, but, despite the average level of genetic improvement, this still produces
uncharacterized segregating offspring genotypes. For clonal forestry, genotypes produced
by intercrossing top parents are subject to a further round of testing and selection, before
identifying and mass-propagating top clones for deployment.
Production populations are of key importance, as it is these populations from which
seed and plant producers obtain most of their revenues, thus additional costs associated
with this form of selection can be offset in a shorter time period than the breeding
GENE-ASSISTED SELECTION 237
population applications, as few if any products are delivered to forest growers directly
from breeding populations. Furthermore, there is continual pressure on breeding
programs to deliver gains to commercial plantations faster and/or at greater rates.
Production populations are therefore more likely to be target populations for applying
GAS, at least in the shorter term.
GAS, along with its variants, has obvious possibilities for selecting individual
offspring for clonal forestry and/or subsequent vegetative amplification of a narrow range
of genotypes – such as in situations where “family forestry” is combined with vegetative
amplification. The parents – while they may already have been selected with the aid of
GAS – will almost certainly still be highly heterozygous, so the expected genetic
variation within any sort of family will be considerable for most quantitatively inherited
traits. Where GAS is based on markers in LD with the QTN rather than on the QTN itself,
response to selection of a limited number of clones in a limited number of families could
be very vulnerable to reversals of the prevailing linkage phase, especially as this material
will represent one more generation for decay of LD to occur in. On the other hand, the
small number of families should make it relatively easy to verify linkage phases in
individual pedigrees. The results of Wilcox et al. (2001) indicate that this scenario could
be cost-effective in the context of within-family selection (MAS) based on neutral DNA
markers.
In selecting clones for clonal forestry the potential of GAS for selecting rare
recombinants, especially involving QTN, looks particularly attractive, because such
recombinants could not be produced reliably through sexual reproduction within any
reasonable timeframe.
Where new traits must be addressed in the breeding goal, the emphasis in selection
for production populations is likely to shift in favor of forwards selection over backwards
selection, which is likely to favor use of GAS if the appropriate associations can be
established.
For disease resistance (and possibly some cases of insect-pest resistance), the
potential of GAS for advantageous pyramiding of resistance factors looks especially
valuable. This could be all the more important where durability of resistance may depend
on certain individual resistance alleles remaining at minority frequencies, in pyramiding
at the level of the population rather than the individual genotype.
As already stated, a key feature of GAS is the complementary fit with other genetic
technologies, including those currently under development. For these new technologies to
be applicable, they need to be more cost-effective at delivering genetic gains than
conventional technologies. A number of new technologies are under development, and
are at various stages of readiness for implementation in tree breeding programs. Here, we
consider examples of new technologies that can be used to complement GAS and greatly
enhance its effectiveness.
Scope exists for integrating GAS strategy with that of MAS. Because most
commercially important tree breeding programs are now well into advanced-generation
selection, there is significant emphasis on within-family selection in order to maintain the
breadth of genetic base and avoid undue build-up of co-ancestry. MAS could be used for
within-family selection although some limitations have been noted (Strauss et al. 1992;
Kerr and Goddard 1997; Johnson et al. 2000), including the need for large individual
family sizes necessary for achieving genetic gain for most quantitatively inherited
characteristics (Wilcox et al. 2001). Given the high cost of detection of marker–trait
associations for MAS on a family-by-family basis, it is likely that in breeding programs
using MAS, detection of marker–trait associations will have been undertaken in only a
subset of families in their respective breeding programs. Here, GAS could be used both
as an aid to among-family selection and to augment MAS for within-family selection
where family-specific marker–trait associations for MAS are not available. There are two
potential benefits in doing this: firstly, increased genetic gains for reasons outlined above,
and secondly, alleviation of the accelerated build-up of co-ancestry that could occur
with the operational dependence on MAS. With MAS, accelerated co-ancestry could
arise through MAS being available only for a small proportion of pedigrees which could
therefore contribute disproportionate numbers of selections. More broadly applicable
marker–trait associations (i.e., GAS), by facilitating selection from all pedigrees, would
not be conducive to the same build-up of co-ancestry. Given the large sample sizes per
family that are needed to detect QTL so as to achieve moderate genetic gains from MAS
(Wilcox et al. 2001), practicing MAS across large numbers of essentially unrelated
families becomes prohibitive. In comparison, GAS requires much lower sample sizes
when averaged across the number of parents in breeding populations (discussed below).
However, this advantage could be offset to some extent by the need to identify and assay
many more polymorphisms per candidate gene, although there is potential to reduce
sampling costs due to techniques such as pooling DNA samples from phenotypic
extremes (Michelmore et al. 1991). Moreover, in specific cases such as dominant
major genes for disease and insect resistance (cf. Bus et al. 2000), which do not require
large sample sizes for detection, MAS is likely to be an effective means of obtaining
gain; when thus detected, such genes may then be amenable to use of GAS, with the help
GENE-ASSISTED SELECTION 239
With the various technologies for in vitro propagation (e.g., organogenesis and
somatic embryogenesis), the opportunity for early identification of top genotypes has
benefits when both amplifying limited quantities of top genetic material, as well as for
development of material for clonal testing and deployment. This form of early selection
not only increases selection intensity, but also could be used to increase the efficiency of
tissue culture by identifying genotypes more likely to propagate well – although having
to select for propagation behavior is liable to be at the expense of potential genetic gain in
other directions. This also applies to in vivo vegetative propagation. However, with a
number of propagation technologies in various commercially important forest tree
species, further development of propagation technologies may be required to fully utilize
the potential from GAS.
240 P.L. WILCOX ET AL.
GAS experiments (LD populations) are also useful as screening populations for
identifying potential causative QTN, allowing integration of molecular and selection
technologies by sharing common experimental platforms. The potential offered by
association genetics experiments to identify candidates offers molecular biologists the
opportunity to use genetics to inform roles and functions, thereby elucidating the
particular roles of specific genes and the manners in which they might interact at a
whole-organism level, either informing or complementing in vitro or model plant studies.
Benefits arising from identification of causal mechanisms and pathways, apart from
improved understanding of the molecular basis for heritable variation, include identifying
genes (and methods) to create and exploit variation based on understanding the causal
mechanisms (including potential pleiotropic effects). In the shorter term, a further benefit
includes the identification of which and what type of genes could be targeted to create
new “mutations” (via transformation) of potentially larger effect (Section 10.9.2 and above).
While the potential for GAS in tree breeding looks positive, implementation in
commercial breeding programs faces a number of key obstacles. These include the high
cost of implementation, institutional barriers, and technical impediments due to certain
molecular mechanisms underpinning trait variation. We briefly discuss each of these
below.
A key impediment to uptake is the high up-front cost of implementation, which is
particularly important given that most commercial breeding programs need to bear most
or all of the entire costs, whereas the benefits of genetic gain tend to accrue further down
GENE-ASSISTED SELECTION 241
the forestry value chain, which can take decades to materialize. Reasons for high
implementation costs include:
High costs mean GAS is unlikely to be an attractive option for species and/or
breeding objectives with low commercial value. Even for species with greater
commercial value, the additional investment may not be considered affordable,
particularly for existing operational programs that lack additional financial resources with
242 P.L. WILCOX ET AL.
which to develop and implement the operational infrastructure necessary for GAS.
Therefore careful evaluation of specific implementation strategies and including costs
and benefits are most likely to be necessary.
Certain mechanisms underpinning trait variation could also prevent effective
development of GAS. An example particularly relevant to species with limited
commercial value and/or relatively limited availability of nongenic DNA sequence
(particularly those with large genomes) is where causative QTN occur many kilobases
distal to expressed genes. Such is the case for the Vgt1 locus in corn, which has been
shown via association genetics to map to a 2 kb region that is 70 kb away from the
nearest open reading frame (Salvi et al. 2006). If such distal transacting regulatory factors
dominate trait variability, then extensive amounts of gDNA resequencing will be
required. This would significantly add to costs, as well as reduce efficacy, particularly for
large-genome species, effectively precluding application in gymnosperms, as well
as a number of hardwood species. Another example is where trait architecture is
predominantly composed of clusters of small-effect QTN per QTL. Such architecture is
theoretically possible, and further experimentation will reveal whether or not this is the
case. Experiments of sufficient power will be necessary, increasing cost and time
required to detect QTN. Furthermore, genotyping costs per unit of gain will be greater,
potentially offsetting expected benefits.
Another technical limitation is the predictive value of associations in the light of
potential modes of gene action, particularly epistasis. Nucleotide substitution effects
would usually be estimated by averaging over allelic combinations sampled in
association tests. However, the selected variants may not be well represented in
association tests, so the predictive value of multilocus QTN could be limited in the
presence of epistasis. Evidence from genetic tests in conifers indicates that large-effect
epistasis is unlikely to be prevalent, but does not rule out smaller epistatic effects. Such
interactions are plausible, given the nature of interdependent biosynthetic pathways that
give rise to phenotype, but may not be observed (or even important) in large outbred
deployment populations that are typically derived from open- and control-pollinated seed
orchards. Conversely, for clonal forestry, where GAS could potentially be used to
identify candidates for further testing, such interactions could be important, particularly if
candidates available to be screened are unlikely to include optimal multitrait genotypes
because of biological limitations on the numbers of seed that could be produced for
screening.
A specific, potentially important class of epistasis, is co-adapted gene complexes.
This phenomenon is possible in forest trees, although some surprising cases have been
observed of essentially independent inheritance of traits that would seem to have
common adaptive significance (Howe et al. 2003). If, however, such complexes do exist,
they must be considered when generating and selecting new variants, necessitating the
detection and if necessary, management of, haplotypic complexes. Fortunately, further
experimentation to detect such complexes may be unnecessary, as existing technologies
combined with association test populations may well be adequate. We envision that such
research will be undertaken over the next few years. If present, means of managing co-
adapted complexes in tree breeding programs will need to be implemented; although this
may not be difficult in theory, it may present major logistical challenges.
GAS may have little or no utility for backwards selection and reselection within
existing breeding and production populations, particularly where progeny tests are
already established and measured for other traits. Such instances may not be rare, as
GENE-ASSISTED SELECTION 243
breeding objectives and strategies are frequently being revised, and new traits are often
introduced into breeding programs in response to factors such as new biological pressures
and/or market signals. In these cases, it may be more cost-effective to screen extant
families for new properties. In breeding programs with limited resources, the short-term
cost-effectiveness of such approaches may restrict or prevent investment in technologies
such as GAS which are longer-term in delivery of improved germplasm, unless marker–
trait relationships can be easily undertaken in association tests that result in a significant
proportion of trait variation being explained by markers.
Institutional barriers to implementation also exist. In the case of breeding
cooperatives and companies whose programs are based on phenotypic selection, barriers
can exist to understanding the nature and complexities of molecular genetics applications
as most programs have tended not to use such tools routinely, and when done, usually in
some conceptually easy application such as verification of parentage or clonal identity.
Convincing such organizations, which tend to be conservative, to implement this
technology, may be difficult particularly in light of the few results to date that clearly
demonstrate ease of detecting associations let alone actual genetic gains. Furthermore,
fluctuations in the relative economics of plantation forestry and frequent ownership
changes can prevent adequate investment from nongovernment sources to appropriately
develop and implement the technology. This may be particularly important where
plantation ownership is dominated by investors with short-term financial goals, therefore
unwilling to participate in more longer-term activities such as association genetics.
For reasons described above, we foresee that GAS is most likely to be implemented
in breeding programs where there are good operational links between molecular
geneticists and tree breeders (as well as others), either moderate to high product values or
sufficient scale to allow costs to be widely spread, and sufficient investment over the
requisite period of time to enable discovery of suitable numbers of marker–QTN
relationships.
10.11 CONCLUSIONS
Application of association genetics in plantation forest tree species has the potential
to increase genetic gains from among- and/or within-family selection via a number of
routes such as increased selection intensities and/or earlier selection. Such selection can
be applied to virtually all strata of hierarchically structured populations used in tree
improvement, although it is likely that the most immediate applications will be in
populations used to provide seed for commercial plantations, owing to the relatively
shorter timeframe to recover additional costs associated with detecting marker–trait
associations. Other potential benefits include cheaper selection, reduced need for
phenotypic selection, and complementary fit with other biotechnologies used either
commercially or in research, as well as use of the same experimental infrastructure for
purposes other than selection.
The few studies to date of LD in forest trees indicate relatively short spans of LD,
implying that finding disequilibria between causative QTN will need to be undertaken
via judiciously chosen candidate genes (hence use of the term “gene-assisted selection”),
particularly in conifers where large genomes effectively preclude cost-effective whole
genome resequencing.
There are a number of important prerequisites for GAS to be successful. These
include effective integration of existing tree breeding skills with molecular genetics,
244 P.L. WILCOX ET AL.
10.11 REFERENCES
Allison, D.B., 1997, Transmission-disequilibrium tests for quantitative traits. Genetics 60:676–690.
Ball, R.D., 2001, Bayesian methods for quantitative trait loci mapping based on model selection: approximate
analysis using the Bayesian Information Criterion. Genetics 159:1351–1364.
Ball, R.D., 2005, Experimental designs for reliable detection of linkage disequilibrium in unstructured random
population association studies. Genetics 170:859–873.
Beavis, W.D., 1994, The power and deceit of QTL experiments: lessons from comparative QTL studies. pp.
250–266. In: Proceedings of the 49th Annual Corn and Sorghum Industry Research Conference.
American Seed Trade Association, Washington, DC.
Bonga, J.M., von Aderkas, P., 1993, Rejuvenation of tissues from mature conifersand its implications for
propagation in vitro. In: Clonal Forestry (Eds. M.R. Ahuja, W.J. Libby) pp. 182–199. Springer-Verlag,
Berlin Heidelberg.
Bradshaw, H.D., Stettler, R.F., 1995, Molecular genetics of growth and development in Populus. IV. Mapping
QTLs with large effects on growth, form, and phenology traits in a forest tree. Genetics 139:963–973.
Brown, G.R., Bassoni, D.L., Gill, G.P., Fontana, J.R., Wheeler, N.C., Megraw, R.A., Davis, M.F., Sewell,
M.M., Tuskan, G.A., Neale, D.B., 2003, Identification of quantitative trait loci influencing wood
property traits in loblolly pine (Pinus taeda L.) III. QTL Verification and candidate gene mapping.
Genetics 164:1537–1546.
Brown, G.R., Gill, G.P., Kuntz, R.J., Beal, J.A., Nelson, C.D., Wheeler, N.C., Penttila, B., Roers, J., Neale,
D.B., 2004a, Associations of candidate gene single nucleotide polymorphism with wood property
phenotypes in loblolly pine (Abstr.). Plant and Animal Genome XII, 10–14 January 2006, San Diego,
CA.
Brown, G.R., Gill, G.P., Kuntz, R.J., Langley, C.H., Neale, D.B., 2004b, Nucleotide diversity and linkage
disequilibrium in loblolly pine. Proceedings of the National Academy of Sciences of the United States of
America 101:15255–15260.
Bucci, G., Menozzi, P., 1995, Genetic variation of RAPD markers in a Picea abies Karst. population. Heredity
75:188–197.
Burdon, R.D., 1982, The Roles and Optimal Place of Vegetative Propagation in Tree Breeding Strategies.
In: Proceedings of IUFRO Meeting on Genetics and Breeding Strategies pp. 66–83. Sensenstein, Germany.
Burdon, R.D., 1988, Recruitment for breeding populations: objectives, genetics, and implementation. In:
Proceedings of Second International Conference on Quantitative Genetics (Eds. B.S. Weir, E.J. Eisen,
M.M. Goodman, G. Namkoong) pp. 555 –572. Sinauer, Sunderland, MA.
Burdon, R.D., 1992, Genetic survey of Pinus radiata. 9: general discussion and implications for genetic
management. New Zealand Journal of Forest Science 22:174–198.
Burdon, R.D., 2001, Genetic diversity and disease resistance: some considerations for research, breeding and
deployment. Canadian Journal of Forest Research 32:596–606.
Burdon, R.D., Namkoong, G., 1983, Multiple populations and sublines. Silvae Genetica 32:221–222.
Burdon, R.D., Russell, J.H., 1999, Inbreeding depression in selfing experiments: statistical issues. Forest
Genetics 5:179–189.
Burdon, R.D., Firth, A., Low, C.B., Miller, M.A., 1998, Multi-site provenance trials of Pinus radiata in New
Zealand. Forest Genetic Resources No 26. pp. 3–8. FAO, Rome.
Bus, V.G., Gardiner, S.E., Bassett, H.C.M., Ranarunga, C., Rikkerink, E.H.A., 2000, Marker assisted selection
for pest and disease resistance in the New Zealand apple breeding programme. Acta Horticulture
538:541–547.
GENE-ASSISTED SELECTION 245
Casasoli, M., Derory, J., Morera-Dutrey, C., Brendel, O., Porth, I., Guehl, J.M., Villani, F., Kremer, A., 2006,
Comparison of quantitative trait loci for adaptive traits between oak and chestnut based on an expressed
sequence tag consensus map. Genetics 172:533–546.
Cato, S.A., Pot, D., Kumar, S., Douglas, J., Gardner, R.C., Wilcox, P.L., 2006, Balancing selection in a
dehydrin gene associated with increased wood density and decreased radial growth in Pinus radiata
(Abstr.). Plant and Animal Genome XIV, 14–18 January 2006, San Diego, CA.
Chagné, D., Brown, G., Lalanne, C., Madur, D., Pot, D., Neale, D., Plomion, C., 2003, Comparative genome
and QTL mapping between maritime and loblolly pines. Molecular Breeding 12:185–195.
Deng, H.-W., 2001, Population admixture may appear to mask, change or reverse genetic effects of genes
underlying complex traits. Genetics 159:1319–1323.
Devey, M.E., Delfino-Mix, A., Kinloch, B.B., Neale, D.B., 1995, Random amplified polymorphic DNA
markers tightly linked to a gene for resistance to white pine blister rust in sugar pine. Proceedings of the
National Academy of Sciences of the United States of America 92:2066–2070.
Devey, M.E., Sewell, M.M., Uren, T.L., Neale, D.B., 1999, Comparative mapping in loblolly and radiata pine
using RFLP and microsatellite markers. Theoretical and Applied Genetics 99:656–662.
Devey, M.E., Groom, K.A., Nolan, M.F., Bell, J.C., Dudzinski, M.J., Old, K.M., Matheson, A.C., Moran, G.F.,
2004, Detection and verification of quantitative trait loci for resistance to Dothistroma needle blight in
Pinus radiata. Theoretical and Applied Genetics 108:1056–1063.
Dodds, K.G., Montgomery, G.W., Tate, M.L., 1993, Testing for linkage between a marker locus and a major
gene locus in half-sib families. Journal of Heredity 84:43–48.
Dupuis, J., Siegmund, D., 1999, Statistical methods for mapping quantitative trait loci from a dense set of
markers. Genetics 151:373–386.
Dvornyk, V., Sirviö, A., Mikkonen, M., Savolainen, O., 2002, Low nucleotide diversity at the pal1 locus in the
widely distributed Pinus sylvestris. Molecular Biology and Evolution 19:179–188.
Echt, C.S., Vendramin, C.D., Nelson, C.D., Marquardt, P., 1999, Microsatellite DNA as shared genetic markers
among conifer species. Canadian Journal of Forest Research 29:365–371.
Epperson, B.K., Allard, R.W., 1987, Linkage disequilibrium between allozymes in natural populations of
lodgepole pine. Genetics 115:341–352.
Evans, R., 1994, Rapid measurement of transverse measurements of tracheids in radial wood specimens of
Pinus radiata. Holzforschung 48:168–172.
Evans, R., Kibblewhite, R.P., Stringer, S., 1999, Variation of microfibril angle, density and fibre orientation in
twenty-nine Eucalyptus nitens trees. Appita Journal 50:487–494.
Geburek, T., 1998, Genetic variation of Norway spruce (Picea abies [L.] Karst.) populations in Austria 1.
Digenic disequilibrium and microspatial patterns derived from allozymes. Forest Genetics 5:221–230.
Germer, S., Holland, M.J., Higuchi, R., 2000, High-throughput SNP allele-frequency determination in pooled
DNA samples by kinetic PCR. Genome Research 10:258–266.
Gupta, P.K., Rustgi, S., Kulwal, P.L., 2005, Linkage disequilibrium and association studies in higher plants:
present status and future prospects. Plant Molecular Biology 57:461–485.
Howe, G.T., Aitken, S.N., Neale, D.B., Jermstad, K.D., Wheeler, N.C., Chen, T.H.H., 2003, From genotype to
phenotype: unravelling the complexities of cold adaptation in forest trees. Canadian Journal of Forest
Research 33:1247–1266.
Huntley, S.K., Ellis, D., Gilbert, M., Chapple, C., Mansfield, S.D., 2003, Significant increases in pulping
efficiency in C4H–F5H transformed poplars: improved chemical savings and reduced environmental
toxins. Journal of Agricultural Food Chemicals 51:6178–6183.
Ingvarsson, P.K., 2005, Nucleotide polymorphism and linkage disequilibrium within and among natural
populations of European aspen (Populus tremula L., Salicaceae). Genetics 169:945–953.
Jayawickrama, K.J.S., Carson, M.J., 2000, A breeding strategy for the New Zealand Radiata Pine Breeding Co-
operative. Silvae Genetica 49:82–90.
Johnson, G.R., Burdon, R.D., 1990, Family-site interaction in Pinus radiata: implications for progeny testing
strategy and regionalised breeding in New Zealand. Silvae Genetica 39:55–62.
Johnson, G.R., Wheeler, N.C., Strauss, S.H., 2000, Financial feasibility of marker-aided selection in Douglas-
fir. Canadian Journal of Forest Research 30:1942–1952.
Jones, L., Ennos, A.R., Turner, S.R., 2001, Cloning and characterization of irregular xylem4 (irx4): a severely
lignin-deficient mutant of Arabidopsis. The Plant Journal 26:205–216.
Kerr, R.J., Goddard, M.E., 1997, Comparison between the use of MAS and clonal tests in tree breeding
programmes. In: IUFRO ’97 Genetics of Radiata Pine (Eds. R.D. Burdon, J.M. Moore) pp. 297–303.
Proceedings of NZFRI/IUFRO Conference 1–4 December and Workshop 5 December, Rotorua,
New Zealand FRI Bulletin No. 203.
Kinloch, B.B., Parks, G.K., Flower, C.W., 1970, White pine blister rust: simply inherited resistance in sugar
pine. Science 167:193–195.
246 P.L. WILCOX ET AL.
Kirst, M.E., Myburg, A.A., Sederoff, R.R., 2003, Genetical genomics of Eucalytptus: combining expression
profiling and genetic segregation analysis (Abstr.). Plant and Animal Genome XI, 11–15 January 2003,
San Diego, CA.
Kirst, M., Myers, R.M., De León, J.P.G., Kirst, M.E., Scott, J., Sederoff, R., 2004, Coordinated genetic
regulation of growth and lignin revealed by quantitative trait locus analysis of cDNA microarray data in
an interspecific backcross of eucalyptus. Plant Physiology 135:2368–2378.
Krutovsky, K.V., Neale, D.B., 2005, Nucleotide diversity and linkage disequilibrium in cold hardiness and
wood quality related candidate genes in Douglas-fir. Genetics 171:2029–2041.
Kuang, H., Richardson, T.E., Carson, S.D., Bongarten, B., 1999, Genetic analysis of inbreeding depression in
plus tree 850.55 of Pinus radiata D. Don. II. Genetics of viability genes. Theoretical and Applied
Genetics 99:140–146.
Kumar, S., Echt, C.S., Wilcox, P.L., Richardson, T.E., 2004, Testing for linkage disequilibrium in the New
Zealand radiata pine breeding population. Theoretical and Applied Genetics 108:292–298.
Lagercrantz, U., Ryman, N., 1990, Genetic structure of Norway spruce (Picea abies): concordance of
morphological and allozymic variation. Evolution 44:38–53.
Long, A.D., Langley, C.H., 1999, The power of association studies to detect the contribution of candidate
genetic loci to variation in complex traits. Genome Research 9:720–731.
Luo, Z.W., 1998, Detecting linkage disequilibrium between a polymorphic marker locus and a trait locus in
natural populations. Heredity 80:198–208.
Lynch, M., Walsh, B., 1997, Genetics and analysis of quantitative traits. Sinauer Associates, Sunderland, MA.
Michelmore, R.W., Paran, I., Kesseli, R.V., 1991, Identification of markers linked to disease-resistance genes
by bulked segregant analysis: a rapid method to detect markers in specific genomic regions by using
segregating populations. Proceedings of the National Academy of Sciences of the United States of
America 88:9828–9832.
Mitton, J.B., 1992, The dynamic mating system of conifers. New Forests 6:187–216.
Mitton, J.B., Sturgeon, K.B., Davis, M.L., 1980, Genetic differentiation in ponderosa pine along a steep
elevational transect. Silvae Genetica 29:100–103.
Morgante, M., Salamini, F., 2003, From plant genomics to breeding practice. Current Opinion in Biotechnology
14:214–219.
Muona, O., Szmidt, A.E., 1985, A multilocus study of natural populations of Pinus sylvestris. In: Lecture notes
in Bioinformatics. (Ed H.-R. Gregorious) pp. 226–240. Springer Verlag, Berlin.
Murray, B.G., 1998, Nuclear DNA amounts in gymnosperms. Annals of Botany 82(Supplement A):3–15.
Paran, I., Zamir, D., 2003, Quantitative traits in plants: beyond the QTL. Trends in Genetics 19:303–306.
Paux, E., Tamasloukht, M.B., Ladouce, N., Sivadon, P., Grima-Pettenati, J., 2004, Identification of genes
preferentially expressed during wood formation in Eucalyptus. Plant Molecular Biology 55:263–280.
Plomion, C., Richardson, T.E., MacKay, J., 2005, Advances in forest tree genomics: forest trees workshop,
plant and animal genome XIII conference, San Diego, CA, January 2005. New Phytologist 166:713–717.
Pot, D., McMillan, L.K., Echt, C.S., Le Provost, G., Garnier-Gere, P., Cato, S.A., Plomion, C., 2005,
Nucleotide variation in genes involved in wood formation in two pine species. New Phytologist 167:101–
112.
Powers, H.R., Hubbard, S.D., Anderson, R.L., 1982, Resistance to diseases and pests in forest trees. In: Proceedings
of Third International Workshop on Genetics of Host-Parasite Interactions in Forestry (Eds. H.M. Heybroek,
B.R. Stephan, K. von Weissenberg). pp. 427– 434. Pudoc, Wageningen, The Netherlands.
Pritchard, J.K., Rosenberg, N.A., 1999, Use of unlinked genetic markers to detect population stratification in
association studies. American Journal of Human Genetics 65:220–228.
Pritchard, J.K., Stephens, M., Rosenberg, N.A., Donnelly, P., 2000, Association mapping in structured
populations. Genetics 67:170–181.
Rafalski, J.A., Morgante, M., 2004, Corn and humans: recombination and linkage disequilibrium in two
genomes of similar size. Trends in Genetics 20:103–111.
Roberds, J.H., Brotschol, J.V., 1985, Linkage disequilibrium among allozyme loci in natural populations of
Liriodendron tulipifera L. Silvae Genetica 34:137–141.
Russell, J.H., Burdon, R.D., Yanchuk, A.D., 2003, Inbreeding depression and variance structures for height and
adaptation in self- and outcross Thuja plicata families in varying environments. Forest Genetics 10:171–
184.
Salvi, S., Tuberosa, R., 2005, To clone or not to clone plant QTLs: present and future challenges. Trends in
Plant Sciences 10:1360–1385.
Salvi, S., Sponza, G., Morgante, M., Tomes, D., Tuberosa, R., 2006, Confirmation of the maize flowering time
QTL Vgt1 by association mapping (Abstr.). Plant and Animal Genome XIV, 14–18 January 2006, San
Diego, CA.
GENE-ASSISTED SELECTION 247
Sewell, M.M., Neale, D.B., 2000, Mapping quantitative traits in forest trees. In: Molecular biology of woody
plants, forestry Sciences (Eds. S.M. Jain, S.C. Minocha) pp. 407–433. Kluwer Academic Publishers, The
Netherlands.
Strauss, S.H., Lande, R., Namkoong, G., 1992, Limitations of molecular marker-aided selection in forest tree
breeding. Canadian Journal of Forest Research 22:1050–1061.
Stromberg, L.D., Dudley, J.D., Rufener, G.K., 1994, Comparing conventional early generation selection with
molecular marker assisted selection in maize. Crop Science 34:1221–1225.
Telfer, E.J., Echt, C.S., Nelson, C.D., Wilcox, P.L., 2006, Comparative mapping in Pinus radiata and P. taeda
reveals co-location of wood density-related QTL (Abstr.). Plant and Animal Genome XIV, 14–18
January 2006, San Diego, CA.
Thornsberry, J.M., Goodman, M.M., Doebley, J., Kresovich, S., Nielsen, D., Buckler, E.S., 2001, Dwarf8
polymorphisms associate with variation in flowering time. Nature Genetics 28:286–289.
Thumma, B.R., Nolan, M.F., Evans, R., Moran, G.F., 2005, Polymorphisms in Cinnamoyl CoA Reductase
(CCR) are associated with variation in microfibril angle in Eucalyptus spp. Genetics 171:1257–1265.
Wayne, M.L., McIntyre, L.M., 2002, Combining mapping and arraying: an approach to candidate gene
identification. Proceedings of the National Academy of Sciences of the United States of America
99:14903–14906.
Wilcox, P.L., Amerson, H.V., Kuhlman, E.G., Liu, B.-H., O'Malley, D.M., Sederoff, R.R., 1996, Detection of a
major gene for resistance to fusiform rust disease in loblolly pine by genomic mapping. Proceedings of
the National Academy of Sciences of the United States of America 93:3859–3864.
Wilcox, P.L., Richardson, T.E., Carson, S.D., 1997, Nature of quantitative trait variation in Pinus radiata:
insights from QTL detection experiments. In: IUFRO ’97 Genetics of Radiata Pine (Eds. R.D. Burdon,
J.M. Moore) pp. 304–312. Proceedings of NZFRI/IUFRO Conference 1–4 December and Workshop 5
December, Rotorua, New Zealand FRI Bulletin No. 203.
Wilcox, P.L., Carson, S.D., Richardson, T.E., Ball, R.D., Horgan, G.P., Carter, P., 2001, Benefit-cost analysis
of DNA marker-based selection in progenies of Pinus radiata seed orchard parents. Canadian Journal of
Forest Research 31:2213–2224.
Williams, C.G., Savolainen, O., 1996, Inbreeding depression in conifers: implications for breeding strategy.
Forest Science 42:102–117.
Wright, S.I., Gaut, B.S., 2005, Molecular population genetics and the search for adaptive evolution in plants.
Molecular Biology and Evolution 22:506–519.
Wu, R., Zeng, Z.-B., 2001, Joint linkage and linkage disequilibrium mapping in natural populations. Genetics
157:899–909.
Wu, R., Ma, C.-X., Casella, G., 2002, Joint linkage and linkage disequilibrium mapping of quantitative trait loci
in natural populations. Genetics 160:779–792.
Yin, T.M., DiFrazio, S.P., Gunter, L.E., Jawdy, S.S., Boerjan, W., Tuskan, G.A., 2004, Genetic and physical
mapping of Melampsora rust resistance genes in Populus and characterization of linkage disequilibrium
and flanking genomic sequence. New Phytologist 164:95–105.
Yu, J., Pressoir, G., Briggs, W.H., Bi, I.V., Yamasaki, M., Doebley, J.F., McMullen, M.D., Gaut, B.S., Nielsen,
D.M., Kresovich, S., Buckler, E.S., 2006, A unified mixed-model method for association mapping that
accounts for multiple levels of relatedness. Nature Genetics 38:203–208.
Chapter 11
PROSPECTS OF ASSOCIATION MAPPING IN
PERENNIAL HORTICULTURAL CROPS
11.1 INTRODUCTION
1 The Horticulture and Food Research Institute of New Zealand Limited (HortResearch), Mount Albert
Research Centre, Private Bag 92169, Auckland, New Zealand
2 HortResearch, Hawke’s Bay Research Centre, Private Bag 1401, Havelock North, New Zealand
3 HortResearch, Palmerston North Research Centre, Private Bag, Palmerston North, New Zealand
249
250 ERIK H. A. RIKKERINK ET AL.
Horticultural crops share some characteristics with each of the crop groups covered
in Chapters 9 and 10. Since some of these properties have already been discussed we
simply outline the major areas of commonality and difference between the crop groups in
Table 11.1. We present an overview of the combination of characteristics which are more
or less peculiar to horticultural crops and then go on to outline in more detail where
horticultural crops differ in their status and/or biological nature as it impinges on the
application of association mapping.
Crop Group
Characteristic Forage/ Agronomy Forestry Perennial
Horticulture
Economic Several major impact Several major impact Large number of
impact species species moderate impact
species
Breeding Both in-breeding and Largely Largely
systems out-breeding out-breeding out-breeding
Generation Annual and perennial, Perennial, years Perennial, years
intervals months
Maintenance and Varied, commonly Large trees, expensive Shrubs to large
testing costs smaller plants and trees, moderate to
lower costs expensive
Ploidy Diploid and Mostly Polyploidy common
polyploidy diploid
Genome size Small to large Moderate to large Small to moderate
The most valuable group of horticultural crops, on a gross margin per hectare basis,
is the fruit crops. From a breeder’s point of view, fruit crops differ from most agronomic
or forest crops because of a peculiar combination of features including high
heterozygosity, asexual propagation, their perennial nature, and the perishability of their
products. At the same time these attributes make them attractive candidates for marker-
assisted and/or gene-assisted selection. Most fruit crops maintain high levels of
heterozygosity in individuals and an allelic richness in their primary germplasm pools.
There are, however, some important exceptions to this generalization (such as peach
within the Rosaceae). In nature diversity is maintained by various mechanisms that
actively promote out-crossing. Such a high degree of diversity might be disadvantageous
if horticulture relied on the sexual cycle to generate the individual plants that yield
product. However fruit crops were amongst the first plants where techniques of asexual
propagation were discovered and utilized. In most cases, production can rely on asexual
propagation of individuals that enables the fruit breeder to exploit all the genetic effects,
additive and non-additive as they are expressed in the phenotypes of superior individuals.
These crops are mostly perennial with many featuring large plant size, long productive
period, an extended juvenile phase for seedlings, and a marketable product that cannot be
assessed until a seedling is physiologically mature. Added complications derive from
PROSPECTS OF ASSOCIATION MAPPING 251
multiple biotic and abiotic factors that can affect both quality and quantity during both
preharvest and postharvest periods.
although there is some evidence to suggest that recombination may be unusually low in
this region in some species at least (Wang et al. 2003). Since good candidates for the
pollen determinant of self-incompatibility have recently been identified near the S-RNase
(pistil determinant) self-incompatibility locus in a number of different crops (Lai et al.
2002; Ushijima et al. 2004), it should now be possible to look for repression of
recombination in the region between or around these genes. Given that the degree of LD
in low recombination regions might be expected to be unusually high compared with the
rest of the genome, the overriding effect of the sex locus will need to be taken into
account when it comes to analysing LD.
predominantly polyploid (e.g. many of the Prunus, Fragaria, and Actinidia species) or
can be polyploid (e.g. Malus species). Fortunately genome sizes in the horticultural crops
are often of moderate size and can sometimes be very small. For example the small
haploid genome size of diploid Fragaria species is comparable to model species such as
Arabidopsis (Antonius and Ahokas 1996, Sargent et al. 2004). There is evidence that
several of the species with larger genome sizes (such as the Maloideae) are probably
cryptic polyploids – making their effective haploid genome size in terms of gene content
smaller than the nuclear DNA content data indicate. However, it is likely that neither
polyploidy or cryptic polyploidy will simplify any of the association mapping strategies,
as both types of polyploids may well have taken advantage of genome duplication events
to generate extra levels of specialization/adaptation of genes at homologous loci. For
any genome scanning based approaches (see below) the absolute genome size will need
to be taken into account (together with the structure of LD in that genome) when deciding
on the number of markers that will be needed to give adequate coverage of the genome.
In the true polyploid, in particular, the inability to readily identify linkage phase and thus
chromosome haplotypes may well hamper subsequent analysis. It may also result in
much more complex inheritance behaviour. The impact of phenomena such as polysomic
inheritance and double reduction on association mapping is largely uncharted territory.
Indeed the novel effects of polyploidy are only now beginning to be addressed in
pedigree based linkage analysis (Luo et al. 2004) and fingerprinting studies (De Silva
et al. 2005).
The feasibility of developing a sufficient number of markers to match the genome
coverage required will depend on the resources that are already available in the species of
interest (see Table 11.2). The existence of a large number of EST sequences, e.g. in crops
such as grape and apple, offer a route by which large numbers of single nucleotide
polymorphisms (SNPs) can be identified – particularly if the data are derived from a
number of genotypes and/or from out-breeding species (as are most of these crops).
Existing genetic maps with a number of genome anchoring markers such as
microsatellites and/or RFLPs will enable researchers to integrate the genome position of
these SNPs efficiently with any association data generated – thus enabling a rapid
integration of LD mapping and more traditional genetic mapping approaches. The
markers and map positions can also be used to assess the level of disequilibrium amongst
alleles of unlinked markers and linked markers. In this way, a picture of the background
level of disequilibrium across the genome that is not associated with genetic linkage can
be developed. These markers can also be used to assess population structure, which can
have a major influence on LD. The most significant horticultural perennials namely
banana, apple, citrus, grape, and peach have already developed many of these genetic and
genomics resources, but there are many other less valuable crops where most of these
basic resources are still missing.
256
Table 11.2. Genomics status of resources for a selection of valuable perennial horticultural species
*1
Genome data are mostly derived from the Kew database of genome sizes (http://www.rbgkew.org.uk/cval/homepage.html) Release 3.0, December 2004. Usually data from the most
valuable member of the genus is given, except in the case of strawberry where the value is from a diploid species and a few cases where the closest relative that has data from the same
genus has been used to provide an approximate value
*2
Nucleotide sequences from the genus lodged in Genbank as at September 2005
*3
Development of a physical map of the genome is in progress; *4 Significant numbers of ESTs known to exist outside the public domain; *5 Since limited information on sequence status or
genetic mapping technology is available for the genera Cydonia (quince), Eriobotrya (loquat), Asimina (pawpaw), Annona (custard apple/cherimoya), Mangifera (mango), Olea (olive),
Feijoa (feijoas), or Ficus (figs), their totals have been combined under “Others”.
Rosaceae/ Peach, Apricot, 262 43,975 BAC libraries, partially completed Multiple genetic maps
Prunus Cherry, Plum, genome*3
Nectarine
Rosaceae/ Strawberry 98-164 7,385*4 Limited BACs, micro-arrays, several
Fragaria genetic maps
Rosaceae/ Pear, Nashi pear 496 664*4 Several genetic maps and some markers
Pyrus
Rosaceae/ Raspberry, 294 237 BAC library Some molecular markers and maps
Rubus Blackberry,
Boysenberry,
Loganberry
Vitaceae/ Grape (table and wine) 417 207,428 Extensive EST libraries, BAC Some genetic maps
Vitis libraries
ERIK H. A. RIKKERINK ET AL.
Rutaceae/ Orange, Lemon, 368 130,991 BAC libraries Some genetic maps
Citrus Lime, Tangerine,
Grapefruit
Lauraceae/ Avocado 907 8,859 Some molecular markers
Persea
Ericaceae/ Blueberry, Cranberry, 588–3,528 4,832*4 Some molecular markers
Vaccinium Huckleberry
Bromeliaceae/ Pineapple 539 5,703 Some molecular markers and maps
Ananas
Musaceae/ Banana, Plantain 613 3,199*4 BAC libraries BAC-end sequence Limited genetic map information
Musa
PROSPECTS OF ASSOCIATION MAPPING
Grossulariaceae/ Black & Red currants, 534 2,540 Limited genetic map information
Ribes Gooseberry
Caricaceae/ Papaya 368 1,657 BAC library Limited genetic map information
Carica
Actinidiaceae/ Kiwifruit 760 444*4 Partial OVERGO based mapping, Microarray, some genetic maps &
Actinidia BACs markers
Others 441–1,911 1,313 *5 - -
257
258 ERIK H. A. RIKKERINK ET AL.
It is likely that association mapping will have the most immediate and largest
impact on the tier of crops with the greatest economic value. There probably are ways
that this impact can be spread across lower value crops. We would therefore expect
association mapping strategies to be applied first in banana, grape, citrus fruit, apple,
pineapple, and stonefruit. There are some technology transfer strategies that could benefit
the lower value crops during the interim period when full scale association mapping
technology remains beyond the reach of such crops (see later). There are likely to be
three main separate (but partially linked) approaches involving association mapping that
can be used to benefit these crops. The approach at one end of the spectrum concentrates
on the improved delivery of markers that can be used for marker aided selection (MAS).
At the other end of the spectrum is the use of whole genome scans in order to identify the
allele(s) of the gene(s) responsible for a particular phenotype of interest. In between these
approaches lies a candidate gene-based approach. We discuss these three approaches in
detail below.
the breeding cycle by two or more generations. However, most studies on marker-assisted
introgression in fruit crops are still in their early stages and it is difficult to judge the
efficacy of this approach.
How MAS for complex traits based on QTL mapping will benefit fruit crops will
depend on a number of factors including the genetic architecture of traits in question, the
accuracy of the phenotyping technique, the type and size of populations used, the density
and functionality of the genetic markers and the repeatability of results. The prospects are
not promising following initial prognosis by Luby and Shaw (2001) summarized below.
The mostly out-breeding nature, selection schemes employed by breeders and other life
history characteristics of these crops would seem to negate or complicate maintenance of
any marker–gene associations detected in their diverse gene pools. Hence, there will
always be the need to establish and evaluate marker associations for each cross or each
recombination cycle. The reason is simple-in the diverse gene pool of an out-crossing
fruit crops there is less chance that a particular allele will be linked to a particular
phenotype of interest in the germplasm at large. In contrast, marker–gene associations
established in self-pollinated crops where most initial MAS studies were carried out
would be expected to break down very slowly because of their mostly in-breeding
behaviour and intense selection. In fruit crops particularly, in the case of simultaneous
selection for multiple polygenic traits, marker-QTL linkages will be more uncertain
because of varying degrees of repulsion/coupling linkage phases (particularly where
genetic correlations between traits are low and probably negative) making LD in specific
crosses more difficult to evaluate. Under these circumstances, it is to be expected that the
same loci may not have the main influence on inheritance of the same traits in different
parents. Also, the same marker alleles may not be segregating in different crosses as
progenies can only inherit alleles from their parents. Therefore, it will be necessary to
conduct separate marker-QTL linkage analysis for each cross or population for which
MAS is used. In the case of markers that are developed in the light of sequence
information (e.g. SNPs, SSRs) this may not always be so serious since the marker may be
adaptable to detect other alleles. However, dominant markers with little sequence
information such as AFLPs and RAPDs would be severely affected by these limitations.
There may also be a large number of loci controlling a quantitative trait and these may be
in different repulsion phase linkage arrangement in the two parents. In such cases it is
likely that a high number of recombinants will be recovered in the F1 populations and this
would lead to poor resolution in QTL mapping. This poor resolution means that the
methodology identifies large DNA segments with potentially hundreds of candidate
genes. It can then take several more years to produce the populations for fine scale
mapping to narrow the distance between marker and gene. This increases the cost of field
testing and phenotyping further, particularly if the trees need to be bearing fruit to
measure the critical phenotype(s). Identification of markers by association mapping
strategies may well be less prone to some of the above problems. Theoretically the power
of detection of quantitative traits may be improved by employing association mapping.
On the other hand other problems such as population structure may mitigate some of the
gains in power of detection. These gains may come at the expense of having to generate a
much greater number of genotypes – but then the high throughput capabilities required
for this are rapidly becoming a reality (see Chapters 3–5). It is also important to be
cautious about equating all reported associations with success, as the real strength of
published evidence may be highly variable (Ball 2005; and see Chapter 8). The resolution
of some maps can definitely be improved by association mapping strategies – as these
260 ERIK H. A. RIKKERINK ET AL.
strategies naturally encompass a far greater number of meioses than the segregating
family approach. This is a particular advantage for large trees as discussed above. An
additional advantage is that the populations utilized can be designed to target different
levels of detailed mapping – thus permitting an incremental approach to improving the
closeness of the association between marker and trait.
investment in generating the sequence information upon which the subsequent large scale
marker analysis would have to be based, it is likely that the genome scanning strategy
will initially be limited to the more valuable tier of crops namely banana, grape, apple,
citrus, and stone-fruits.
perhaps “peculiar” to a horticultural crop, will require the development of model systems
that are more closely related to the crop and share that particular important characteristic.
Sometimes these traits are shared across several unrelated species – such as the growing
importance of dwarfing rootstocks for many horticultural crops. At other times even
closely related crops may have important differences that may limit the application of
discoveries. For example, within the Rosaceae several different fruit types (achenes,
drupes, pomes) can be recognized, and their evolutionary relationship is unclear (Morgan
et al. 1994) suggesting that there could be limited applicability of at least some important
fruit characteristics across these species. In fact there is now also evidence supported by
DNA based phylogeny that suggests different fruit types have repeatedly evolved in
distinct lineages (Knapp 2002). Another major difference between members of the
Rosaceae for example is that it includes both climacteric fruits (such as apple) and non-
climacteric fruits (such as strawberry). Even within the genus Pyrus, there are both
climacteric (European pears – P. communis) and non-climacteric (Asian pear – P.
pyrifolia) species. The influence of such important differences on the success of direct
routes of technology transfer is as yet largely unknown. Another method of technology
transfer (already covered above) is the application of candidate gene approaches that
make use of data from other species.
In this section we will consider if there are ways that the adoption of association
mapping based technologies can be accelerated. These strategies will need to include
both scientific and collaborative strategies. Perhaps the best way to ensure adoption is to
develop strategies for integrating research across the often fragmented research sectors in
these crops at the levels of both research discipline and nation states. Integration of a
number of disciplines is required from breeding through to molecular biology, physiol-
ogy, pathology, chemistry, bioinformatics, and biostatistics to name just a few. The
breadth of disciplines that potentially could play a role highlights the difficulty of achiev-
ing such integration. International collaboration is made more difficult by the fact that the
scientists concerned often focus on research for competing industries, and have different
research priorities.
The potential benefit of better integration/collaboration far outweighs the difficulties
and this is being recognized in some sectors already. Within the Rosaceae, for example, a
group of international researchers has met several times to attempt to integrate the vari-
ous species into a multi-species “model system”, where it is planned that the strength of
the different species, in terms of their rate of progress in different areas, will be used to
maximize the benefit for all species in the family. This type of structure may be particu-
larly appropriate for crop groupings which, in their own right, do not cross the value
threshold required to attract the level of investment that would enable the full power of
association mapping technologies (such as whole genome scanning) to be applied. How-
ever, as a group these species may well make a much more attractive prospect for in-
vestment. Smaller scale integration may be possible when it comes to solving particular
problems that affect several different crops and that may be able to rely on common un-
derlying mechanisms in these crops. In these cases a model system for the problem could
PROSPECTS OF ASSOCIATION MAPPING 263
be a focus of collaboration building on the particular strengths of that crop that are pecu-
liar to the problem at hand.
A major effort is also needed to make the different scientific disciplines both
accessible and understandable to all researchers. This can help to counteract the growing
degree of specialisation of researchers as they are required to focus on narrower fields of
interest in order to cope with the information explosion. A major role in this respect is
likely to be carried out by the development of database systems, and their bioinformatic
interfaces, to display and summarize information. These systems will also allow the data
to be shared almost instantaneously between groups that may well be on opposite sides of
the globe and thus can result in reducing the amount of unnecessary duplication in labour
intensive steps such as genome annotation. The ultimate measure of success of
association mapping strategies, however, will be their integration with breeding practices.
A major strategy should be to develop ways of incorporating association mapping-based
design principles, particularly into the more traditionally based breeding programs.
Fruit breeders are interested in new techniques that can improve their genetic
efficiency in selection and reduce the risk of failing to identify superior individuals. Such
new techniques must, however, have a certain level of cost–benefit advantage over the
older techniques, if they are to be widely adopted. Given the highly variable nature of
horticultural crop species it is likely that multiple guidelines will need to be developed for
these crops. It is not practical to attempt to do this here as this will require a high degree
of specialized expertise for each crop concerned. There will probably be some common
themes on which we will elaborate below. There are several different strategies for
applying association mapping that requires different levels of resource commitment and
we will present these in order from lowest to highest below. Different levels will be
appropriate for different crops at different times and we suggest that each crop builds
gradually to a situation where the full power of association mapping can be applied.
In most horticultural perennials we know nothing about the level of LD in the crops
concerned and relatively little about population structure. In this situation we can at best
develop hypotheses by extending what we know about the biology of these crops. An
obvious prerequisite for applying LD based analysis in a crop that requires a modest
investment of resources is to generate base-line data on population structure and the
extent of LD across the genome. LD should also be analysed in populations consisting of
various levels of known and/or deduced inter-relatedness of plants. If it is anticipated that
a modified form of MAS using association mapping is likely to be the only affordable
route of incorporation into breeding strategies, then a certain approach and experimental
design is required. In this instance the focus could be on utilizing the LD that exists in
breeding populations and among commercial varieties to accelerate selection approaches.
In that case moderate density marker scans (1,000–5,000) may well give sufficient power
to firstly detect, and then follow many of the useful associations in subsequent deliberate
crosses. These would only require moderate throughput capabilities and could
conceivably even utilize a mixture of different marker types including existing
informative markers such as microsatellites and perhaps even RFLPs in the case of
264 ERIK H. A. RIKKERINK ET AL.
candidate genes. The interpretation of these results will need some caution as different
marker systems can deliver different levels of LD (see Section 2.5.3.1). These systems
could be used to initiate low density genome scans for LD amongst markers in relatively
narrow populations, perhaps even with existing breeding populations. We would expect
the maximum extent of LD to be present in such populations given that they have had
little time to approach equilibrium. Since these are the populations where we might first
want to utilize the technology to aid breeding processes, this would also seem a
reasonable place to start. If we can learn something about LD from pilot studies with
existing markers and populations, this can then inform our decision making when it
comes to developing the marker technology required for high density genome scans and
the platforms to score these markers (see Chapter 5).
The next level of commitment of resources involves using the candidate gene
approach. This can be used to “enrich” for markers that are likely to be linked to a
particular phenotype and therefore requires the researcher and breeder to prioritize which
phenotypes to screen for in the first instance. This approach is partly limited by the set of
phenotypes for which candidate gene approaches are feasible (i.e. based on information
correlating particular gene sequences or gene families with particular traits in other plant
systems). Good candidates for this type of approach are genes such as resistance gene
candidates – which have already been successfully used to simplify the identification of
resistance genes by mapping in segregating families (Paal et al. 2004). There are also
large gene families likely to be involved in controlling a large number of different traits
(e.g. transcription factors and protein kinases) which might be able to be utilized in a
“blind approach” to attempt to correlate alleles of particular genes with phenotype. This
type of “blind” association method could then be the forerunner of a comprehensive
application of whole genome scans – involving the most significant level of resource
commitment.
As mentioned above there are some common themes that could be integrated into a
guideline for incorporating association mapping strategies into more traditionally focused
breeding programs. One place to start is with the germplasm that is maintained and
exploited for breeding gains. The strategy adopted for incorporating gains into new
varieties will have an important influence on the nature of the germplasm that can be
exploited by association mapping. One important consideration is whether introducing
genes by artificial gene transfer techniques is likely to be a viable addition to more
traditional selection based strategies for cultivar improvement. Given that gene transfer
methods are much less susceptible to linkage drag (in theory there would be no linkage
drag at all if single genes are transferred) – the introgression of useful characters from a
much wider germplasm base could be anticipated with these methods. This is particularly
the case for the long-lived out-breeders which are common amongst this group of plants.
In outcrossing plants classical introgression cannot be performed and pseudo-backcrosses
have to substitute. If an artificial gene transfer strategy is adopted then a much wider
germplasm base might be utilized for association mapping, than if it is deemed such a
strategy is not yet viable. The wider germplasm base would also lend itself better to
identifying genes by “landing” on the gene since much smaller regions would be
expected to be in LD with phenotype across a wider germplasm base. While the
investment required for such a strategy would be considerable for each crop, major
components such as DNA sequencing are rapidly decreasing in cost and the potential for
incorporating highly novel characteristics that could provide a very valuable point of
PROSPECTS OF ASSOCIATION MAPPING 265
cultivar differentiation in the market is likely to offset these costs in the medium to long-
term.
The above strategies are not mutually exclusive. In fact the MAS strategy will
probably be the first adopted in most crops and it can then be gradually extended to
encompass the candidate gene and eventually the full genome scanning strategy. In some
cases a crop may be able to go directly to a candidate gene strategy, depending on the
nature of the biological question being tackled.
How can LD mapping address some of the major issues in fruit breeding? The
likelihood of success from a cost–benefit point of view will depend on the traits targeted,
their mechanism of inheritance (simple, oligogenic, or polygenic), gene action/effect, and
mating system (self-pollinated, out-crossing, asexual). These factors will then need to
balance any potential cost reductions and/or price advantage that might result from the
technology. In the context of cost reduction, the LD mapping approach may not need
large numbers of segregating populations initially for mapping quantitative traits, thus
costs associated with making crosses and maintaining large tree populations in the
orchard may reduce. However, these advantages could initially be offset by the
substantial upfront cost associated with establishing large databases of gene/marker
sequence information and developing high throughput genotyping assays to facilitate
scans for marker trait associations. In the case of crops like apple, citrus and peach where
some of the sequence resources required to initiate the development of such a system are
already available, these costs will be less (but still substantial).
Some of the cost of running assays may be mitigated through DNA pooling by
combining phenotype extremes. This approach has recently been used successfully in
humans (Butcher et al. 2005; Sham et al. 2002; Zeng et al. 2005) but would only be
feasible for finding associations for a particular predetermined phenotype used to devise
the pools. Depending on the breeding approach adopted and if it includes following traits
for several successive generations – it is likely that many marker–trait associations
established could hold across a number of generations, so costs could be spread
accordingly. Another cost driver is technological advances. Recent advances in DNA
technologies have made large scale EST sequencing efforts viable, and even a significant
number of whole genome sequencing projects viable, when these were considered
virtually unaffordable just ten years ago. These advances are likely to continue to reduce
the cost of developing the required sequence databases and individual genotype assays
further.
Marker techniques could also be targeted at traits which are more expensive to
measure by other approaches (labour intensive field assessment, physiological,
biochemical, physical, chemical, or consumer preference measurements). In the
traditional breeding process many of these assessments would tend to be carried out at the
second stage to reduce their costs. Association methods might allow for their
incorporation into the first stage culling process, thus making the selection process much
more efficient and allowing breeders to increase the population size from which the
superior individuals are selected. As for other molecular marker technologies it may be
possible to screen with markers at an earlier stage of plant growth (prefruiting seedlings)
as well and this could save considerable amounts of orchard space and costs.
The initial financial hurdles for developing genome scanning capability for any
particular crop are considerable. However, once developed and over the initial cost
hurdles, markers are likely to be cheaper particularly when combining the multiple traits
required to meet breeding objectives. Since it is a relatively new technology, an unknown
266 ERIK H. A. RIKKERINK ET AL.
factor is to what extent this technology will be able to reduce the cost of other
measurements and assessments without compromising the outcome. The strategy of
application will need to be thought out carefully for each crop. Given that the biology of
some horticultural crops may be quite unique compared with other organisms where
association mapping has been applied already, it would seem prudent to test the success
of some of these strategies first, before applying them on a large scale.
Accurate estimations of costs between association mapping and “competing”
technologies are difficult as they need to take into account a complex series of
interdependent costs including maintenance of plant material, phenotyping, data
collection, population sizes required, and genotyping of individuals. Amos and Page
(2001) developed methods for comparing the costs of detecting genetic factors by linkage
and association mapping. Because they were developed for comparing mapping
approaches in the human system they did not take into account factors peculiar to plant
breeding such as the extra costs of germplasm maintenance. They determined that the
cost effectiveness of LD methods was greater for traits with lower single-locus
heritability, whereas family based linkage analysis appeared to be more cost effective for
traits with high single-locus heritability.
In terms of generating a price advantage it is likely that the application of these
refined techniques would significantly improve the chances of delivering superior
genotypes ahead of other breeding programs, assuming they are not using similar
techniques. Conversely any programs not using these more efficient techniques would
run the risk of falling behind their competitors.
included utilizing distinct (but closely related) species (e.g. crossing European and Asian
Pears) which is not an uncommon strategy in plants? Wild sister species are often used as
a strategy for introgressing new pathogen and insect resistance factors into their
domesticated relatives. These types of questions will continue to offer fertile ground for
research with both a practical and intellectual interest.
11.7 ACKNOWLEDGMENTS
E.H.A.R., N.C.O., and S.E.G. were funded by the Horticulture and Food Research Institute
of New Zealand Ltd (HortResearch). We thank reviewers for their comments and the
staff of the HortResearch Publication Unit for their editorial assistance. Genome size data
reproduced with the permission of the Trustees of the Royal Botanic Gardens, Kew
http://www.rbgkew.org.uk/cval/homepage.html, Bennett MD, Leitch IJ. 2005. Plant
DNA C-values database (release 4.0, Oct. 2005).
11.8 REFERENCES
Amos, C.I., Page, G., 2001, Cost of linkage versus association methods. In: Rao D.C., Province M.A. (eds)
Genetic Dissection of Complex Traits, pp. 213–221. Academic Press, San Diego, CA, USA.
Antonius, K., Ahokas, H., 1996, Flow cytometric determination of polyploidy level in spontaneous clones of
strawberries. Hereditas 124: 285.
Ashman, T.L., 1999, Quantitative genetics of floral traits in a gynodioecious wild strawberry Fragaria
virginiana: implications for the independent evolution of female and hermaphrodite floral phenotypes.
Heredity 83: 733–741.
Ball, R.D., 2005, Experimental designs for reliable detection of linkage disequilibrium in unstructured random
population association studies. Genetics 170: 859–875.
Barot, S., Gignoux, J., 2004, How do sessile dioecious species cope with their males? Theoretical Population
Biology 66: 163–173.
Bennetzen, J.L., Freeling M., 1993, Grasses as a single genetic system: genome composition, collinearity and
compatibility. Trends in Genetics 9: 259–261.
Bus, V., Brooking, L., Davis, L., Norling, C., Ranatunga, C., Gardiner, S., 2001. Accelerated breeding for
apple. In: Halligan, L. (ed) The New Zealand Controlled Environment Laboratory (NZCEL) Workshop
Proceedings. Use of Controlled Environments in Containment Research, June 2001, pp. 25–27.
Butcher, L.M., Meaburn, E., Knight, J., Sham, P.C., Schalkwyk, L.C., Craig, I.W., Plomin, R., 2005, SNPs,
microarrays and pooled DNA: identification of four loci associated with mild mental impairment in a
sample of 6000 children. Human Molecular Genetics 14: 1315–1325.
Charlesworth, D., Guttman, D.S., 1999, The evolution of dioecy and plant sex chromosome systems. In:
Ainsworth C.C. (ed) Sex Determination in Plants, pp. 25–49. BIOS Scientific Publishers Ltd, Oxford,
UK.
Charlesworth, D., Vekemans, X., Castric, V., Glemin, S., 2005, Plant self-incompatibility systems: a molecular
evolutionary perspective. New Phytologist 168: 61–69.
Dalbo, M.A., Ye, G.N., Weeden, N.F., Steinkellner, H., Sefc, K.M., Reisch, B.I., 2000, A gene controlling sex
in grapevines placed on a molecular marker-based genetic map. Genome 43: 333–340.
De Silva, H.N., Hall, A.J., Rikkerink, E., McNeilage, M.A., Fraser, L., 2005, Estimation of allele frequencies in
polyploids under certain patterns of inheritance. Heredity [Epub ahead of print].
Dellaporta, S.L., Calderon-Urrea, A., 1993, Sex determination in flowering plants. Plant Cell 5: 1241–1251.
Dirlewanger, E., Graziano, E., Joobeur, T., Garriga-Caldere, F., Cosson, P., Howad, W., Arus, P., 2004,
Comparative mapping and marker-assisted selection in Rosaceae fruit crops. Proceedings of the National
Academy of Sciences of the United States of America 101: 9891–9896.
Dobzhanksy, T., 1972, The genetics of the evolutionary process. Columbia University Press, New York, NY,
USA.
268 ERIK H. A. RIKKERINK ET AL.
Gardiner, S.E., Bus, V.G.M., Rusholme, R.L., Chagné, D., Rikkerink, E.H.A., 2005, Apple. In: Kole C (ed) The
Genomes: A Series on Genome Mapping, Molecular Breeding & Genomics. Science Publishers, Inc.,
Enfield, NH, USA, Plymouth, UK.
Guttman, D.S., Charlesworth, D., 1998, An X-linked gene has a degenerate Y-linked homologue in the
dioecious plant Silene latifolia. Nature 393: 263 – 266.
Harvey, C.F., Gill, G.P., Fraser, L.G., McNeilage, M.A. 1997, Sex determination in Actinidia. 1. Sex-linked
markers and progeny sex ratio in diploid A. chinensis. Sexual Plant Reproduction 10: 149–154.
Hospital, F., Chevalet, C., Mulsant, P., 1992, Using markers in gene introgression breeding programs. Genetics
132: 1199–1210.
Knapp, S., 2002, Tobacco to tomatoes: a phylogenetic perspective on fruit diversity in the Solanaceae. Journal
of Experimental Botany 53: 2001–2022.
Lai, Z., Ma, W., Han, B., Liang, L., Zhang, Y., Hong, G., Xue, Y., 2002, An F-box gene linked to the self-
incompatibility (S) locus of Antirrhinum is expressed specifically in pollen and tapetum. Plant Molecular
Biology 50: 29–42.
Lee, J.M., Sonnhammer, E.L.L., 2003, Genomic gene clustering analysis of pathways in eukaryotes. Genome
Research 13: 875–882.
Liu, Z., Moore, P.H., Ma, H., Ackerman, C.M., Ragiba, M., Yu, Q., Pearl, H.M., Kim, M.S., Charlton, J.W.,
Stiles, J.I., Zee, F.T., Paterson, A.H., Ming, R., 2004, A primitive Y chromosome in papaya marks
incipient sex chromosome evolution. Nature 427: 348–352.
Luby, J.J., Shaw, D.V., 2001, Does marker-assisted selection make dollars and sense in a fruit breeding
programme? HortScience 36: 872–879.
Luo, Z.W., Zhang, R.M., Kearsey, M.J., 2004, Theoretical basis for genetic linkage analysis in autotetraploid
species. Proceedings of the National Academy of Sciences of the United States of America 101: 7040–
7045.
Mnejja, M., Arus, P., 2006, Microsatellite transportability across Rosaceae crops. In: 3rd International Rosaceae
Genomics conference, p. 57, War Memorial Conference Center, Napier New Zealand, 19–22 March
2006.
Morgan, D.R., Soltis, D.E., Robertson, K.R., 1994, Systematic and Evolutionary implications of rbcL sequence
variation in Rosaceae. American Journal of Botany 81: 890–903.
Paal, J., Henselewski, H., Muth, J., Meksem, K., Menendez, C.M., Salamini, F., Ballvora, A., Gebhardt, C.,
2004, Molecular cloning of the potato Gro1-4 gene conferring resistance to pathotype Ro1 of the root
cyst nematode Globodera rostochiensis, based on a candidate gene approach. Plant Journal 38: 285–297.
Perovic, D., Stein, N., Zhang, H., Drescher, A., Prasad, M., Kota, R., Kopahnke, D., Graner, A., 2004, An
integrated approach for comparative mapping in rice and barley with special reference to the Rph16
resistance locus. Functional & Integrative Genomics 4: 74–83.
Sargent, D.J., Davis, T.M., Tobutt, K.R., Wilkinson, M.J., Battey, N.H., Simpson, D.W., 2004, A genetic
linkage map of microsatellite, gene-specific and morphological markers in diploid Fragaria. Theoretical
and Applied Genetics 109: 1385–1391.
Sham, P., Bader, J.S., Craig, I., O’Donovan, M., Owen, M., 2002, DNA Pooling: a tool for large-scale
association studies. Nature Reviews Genetics 3: 862–871.
Sykes, B., 2003, Adam’s Curse – a future without men. Bantam, London.
Tanksley, S.D., Nelson, J.C., 1996, Advanced backcross QTL analysis: a method for the simultaneous
discovery and transfer of valuable QTLs from updated germplasm into elite breeding lines. Theoretical
and Applied Genetics 92: 191–203.
Thumma, B.R., Nolan, M.F., Evans, R., Moran, G.F., 2005, Polymorphisms in Cinnamoyl CoA Reductase
(CCR) are associated with Variation in Microfibril Angle in Eucalyptus spp. Genetics:
genetics.105.042028.
Ushijima, K., Yamane, H., Watari, A., Kakehi, E., Ikeda, K., Hauck, N.R., Iezzoni, A.F., Tao, R., 2004, The S
haplotype-specific F-box protein gene, SFB, is defective in self-compatible haplotypes of Prunus avium
and P. mume. Plant Journal 39: 573–586.
Wang, Y., Wang, X., McCubbin, A.G., Kao, T.H., 2003, Genetic mapping and molecular characterization of the
self-incompatibility (S) locus in Petunia inflata. Plant Molecular Biology 53: 565–580.
Weiblen, G.D., Yu, D.W., Wes, S.A., 2001, Pollination and parasitism in functionally dioecious figs.
Proceedings of Biological Science 22: 651–659.
Wright, S.I., Lauga, B., Charlesworth, D., 2003, Subdivision and haplotype structure in natural populations of
Arabidopsis lyrata. Molecular Ecology 12: 1247–1263.
PROSPECTS OF ASSOCIATION MAPPING 269
Yamamoto, T., Terakami, S., Nishitani, C., Kimura, T., Sawamura, Y., Hirabashi, T., Hayashi T., 2006,
Genome mapping in pear. In: 3rd International Rosaceae Genomics Conference, p. 55, War Memorial
Conference Center, Napier New Zealand, 19–22 March 2006.
Zeng, D., Lin, D.Y., 2005, Estimating haplotype-disease associations with pooled genotype data. Genetic
Epidemiology 28: 70–82.
Color Plates
Figure 2.5. Pairwise | D ′ | for 45 SNPs within a linked region (figure from GENESTAT, http://www.
meb.ki.se/genestat/, courtesy of the Swedish National Biobanking program, Wallenberg consortium north).
Figure 2.6. A simplistic diagram showing the major difference between gene conversion and crossover.
(A) Two DNA molecules. (B) Gene conversion after mismatch correction – the red DNA donates part of its genetic
information (e–e' region) to the blue DNA. (C) DNA crossover – the two DNAs exchange part of their genetic
information (f–f ' and F–F').
Figure 4.1. Nonsequencing SNP discovery methods: heteroduplex analysis, TILLING, DGGE, and SSCP.
Target DNA (allele T)
...NNNNNACGTACGTACGTACGTT CGTACGTACGTNNNNN...
NGCATGCATGCA
Invader probe
TGCATGCATGCATGCAA
Allele-specific
AC
CG
probe
AT
AO
CLEAVAGE
F Fluorescence
emission
TGCATGCATGCATGCAA
AC
Flap
CG
Quencher
GT
AT
NO CLEAVAGE
AC
TM
The Invader assay
Allele-specific
CG
probe
AT
CLEAVAGE
AC
Quencher
FRET probe
CLEAVAGE
Target DNA (allele G)
...NNNNNACGTACGTACGTACGTG CGTACGTACGTNNNNN...
NGCATGCATGCA
Fluorescence
TGCATGCATGCATGCAC emission
Allele-specific
GT
CT
(G) secondary
Flap 2
TA
probe
GC
T
CLEAVAGE Quencher
FRET probe
CLEAVAGE
NNNNNTGCATGCATGCATGCAAGCATGCATGCANNNNN
No hybridization
Microarray plate
-NNNNNACGTACGTACGTACGTGCGTACGTACGTNNNNN
NNNNNTGCATGCATGCATGCACGCATGCATGCANNNNN-
Hybridization
Figure 5.3. Allele-specific oligonucleotide hybridization. A oligonucleotides feature with the SNP site in its
central position is bound to a microarray glass plate. Under stringent hybridization conditions, the
complementary allele will anneal to the fixed oligonucleotide and a fluorescent signal attached to the probe will
be detected.
Figure 5.5. Minisequencing or primer extension. An oligonucleotide primer immediately flanking the SNP is
extended using a DNA polymerase. Fluorescently labeled terminating nucleotides are incorporated, with a
different dye color for every nucleotide. The oligonucleotide can be attached to a solid-phase array, separated in
a capillary electrophoresis system, by a flow cytometry instrument, by mass spectrometry, or revealed by a
fluorescent plate reader.
An example of a genealogy for three copies of a short chromosomal segment. Tracing the
segmental lineages back in time, the following events occur: 1, the “green” lineage under-
goes recombination and splits into two lineages, which are then traced separately; 2, one of
the resulting green lineages coalesces with the “magenta” lineage, creating a segment, part of
which is ancestral to both green and magenta, part of which is ancestral to magenta only; 3, the
“blue” lineage coalesces with the lineage created by event 2, creating a segment that is partially
ancestral to blue and magenta, partially ancestral to all three colours; 4, the “other” part of the
green lineage coalesces with the lineage created by event 3, creating a segment that is ancestral
to all three colours in its entirety. The recombination event induces different genealogical trees
on either side of the break: these are shown in the inserted figure.
Reprinted from Trends in Genetics 18, Nordborg, M. and Tavaré, S., Linkage disequilibrium: what history has to tell us,
Pages No.83–90, Copyright (2002), with permission from Elsevier.
Figure 8.1. Example genealogy illustrating the coalescent (Nordborg and Tavaré 2002).
INDEX
271
272 INDEX
C DNA
chip technology, 57
Candidate gene pooling, 56
approach, 5, 22, 62, 78, 104, 150, sequence polymorphism - see
217, 224, 227 polymorphisms
-based markers, 64 Domestication - see crop domestication
mapping, 98, 205
selection, 229 E
Case-control - see Experimental design
case control EcoTILLING, 56
Chi-square, 109 Elite population, 97, 235
Cinnamoyl CoA Reductase (CCR), 151, EM algorithm, 22
217, 261 Epistatic interactions, 7, 25, 143, 242
Cleaved Amplified Polymorphic Experimental design 7, 119, 145, 224
Sequence (CAPS), 54, 99 case-controls, 5, 121, 157, 182
Clonal forestry, 233, 236, 237, choice, 129
242 power, 120, 148, 222, 225
Coadapative gene complexes, 26, 242 sample size, 6, 163, 190
Coalescent, 134, 140, 143 TDT, 5, 124, 161, 164
Coding regions, 44, 46, 80, 231 unstructured populations vs TDT,
Colon cancer, 3 227
Complex diseases and traits, 4, 7, 12, Expressed Sequence Tag database - see
30, 41 Genomic resources ESTs
Conformational Polymorphisms, 54,
Crop domestication, 33, 42, 90, 234 F
Cultivar identification, 97
False discovery rate (FDR), 112
D False positives, 112
False negatives, 114
D and D′ , 13, 17, 65 Family based design, 5
Degenerate oligonucleotide primed- Family-wise error rate (FWER), 112
PCR (DOP-PCR), 79 Fine mapping - see mapping resolution
Deleterious mutations, 32 Fisher’s exact test, 123
Deletions - see indels, Forage species
Denaturing Gradient Gel Electro- breeding characteristics, 251, 252
phoresis (DGGE), 54, 82 genome structure, 198
Denaturing high-performance liquid taxonomy, 198
chromatography (dHPLC), 56, 82 Forest tree species
Derived CAPS markers, 54, 82 characteristics, 212
Diabetes, 3, 28 generation time, 254
Dimorphism, 253 status of crop, 213
Direct sequencing complications, 60 synteny, 213
Disease resistance, 3, 6, 45, 63, 101, Founder effect, 25, 34, 42
214, 237, 258 Functional Polymorphic Nucleotide -
Diversity Array Technology (DArT), see Quantitative Trait Nucleotide
79, 95 Frequentist,
Diversity causes of - see Genetic Hypotheses testing, 135
diversity causes of vs. Bayesian, 135
INDEX 273
bottlenecks, 6, 25, 79, 146 Linkage phase, 22, 62, 237, 241, 255,
combined with linkage mapping, 259
224 Low heritability traits, 99, 128
decay, 6, 14, 33, 34, 205
definition, 12 M
disease resistance, 34, 214
drift, 6, 25, 145 MALDI-TOF MS, 58, 77
Drosophila melanogaster, 31 Map-based cloning - see positional
estimates, 106 cloning
examples, 21, 27, 126, 127, 156, Mapping resolution, 6, 98, 104, 259
158, 189 Marker
forest trees, 217 trait associations, 5, 64, 89, 205,
genome size and, 256 213, 228, 240, 267
genome wide patterns, 6, 22, 79, -QTL associations, 4, 64, 238
253 Marker Assisted Selection (MAS),
genomic status, 254 99
high LD populations, 6, 30, 48 approach with LD, 260
hot spots, 22, 31 autogamous species, 63
human, 29 outbreeding species, 64, 204
human selection, 6 versus GAS, 217
inbreeding species, 28, 30, 60, within-family selection only, 238
198 with SNPs, 99
low LD species, 36, 216 Markov Chain Monte Carlo (MCMC),
maize, 33 119, 137, 182
mapping, 1 MassArray, 58
measures, 16 Mating systems
methylation and, 23 selfing species - see Selfing species
multi-gene complexes and, 253 outcrossing species - see
Norway spruce, 48 Outcrossing species
outbreeding species, 42, 227, Megagametophytes, 47, 60, 215
253 Metropolis sampling, 137, 172
Perennial ryegrass, 206 Microarray, 57, 214, 229
physical linkage, 1, 6 complexity-reduction genotyping,
pine, 47 80
comparison of plant species, 36 Microfibril angle, 151, 217, 229
ploidy and, 255 Migration, 25
population size and, 6 Minisequencing, 87
population structure, 6, 26, 36 Molecular marker comparisons, 96
recombination rate, 6, 23, 29 Multi-locus models, 114, 139, 143
selection and, 6, 26 Multiple Displacement Amplification
sex determining chromosomes and, (MDA), 79
27, 253 Mutation, 23
soybean, 28, 98
SSRs and, 33 N
vs QTL mapping, 5
Linkage Equilibrium, 12 Natural selection
departures from Hardy-Weinberg, signatures of, 31
221 Non-coding regions, 44
INDEX 275