J. Anim. Breed. Genet.
ISSN 0931-2668
ORIGINAL ARTICLE
A note on ENDOG: a computer program for analysing pedigree information
J.P. Gutierrez1 & F. Goyache2
1 Departamento de Produccion Animal, Facultad de Veterinaria, Avda. Puerta de Hierro s/n, Madrid, Spain 2 SERIDA-Somio, C/Camino de los Claveles, Gijon (Asturias), Spain
Correspondence Juan Pablo Gutierrrez, Departamento de Produccion Animal, Facultad de Veterinaria, Avda. Puerta de Hierro s/n, E-28040-Madrid, Spain. Tel/Fax: +34 913943767; E-mail: [email protected] Received: 11 August 2004; accepted: 22 November 2004
Summary The aim of this note is to describe the program endog (v.3.0). The program handles pedigree information to conduct several demographic and genetic analyses including: (a) the individual inbreeding and average relatedness coefcients; (b) effective population size; (c) parameters characterizing the concentration of both gene and individuals origin such as the effective number of founders and ancestors, the effective number of founder herds; (d) F statistics and paired genetic distances for each subpopulation under study; (e) descriptors of the genetic importance of the herds in a population and (f) generation intervals. The program will help breeders and researchers to monitor the changes in genetic variability and population structure with limited costs of preparing datasets. The program, users guide and example le can be downloaded free of charge from the World Wide Web at http://www.ucm.es/ info/prodanim/Endog30.zip.
Introduction The assessment of the within populations genetic variability has received increasing attention over recent years (Woolliams et al. 2002). Considering both selection and conservation, some simple demographic parameters have a large impact on the evolution of the genetic variability and largely depend on the management of the population (Goyache et al. 2003; Gutierrez et al. 2003; Honda et al. 2004). Moreover, breeders and researchers can be interested in the ascertainment of the extent in which an inappropriate mating policy leads to structuring the populations under study (Caballero & Toro 2002). Some computer routines are available to test the evolution of the genetic variability of populations using pedigree information (Boichard 2002). However little efforts have been devoted to pedigree analysis software. endog (current version 3.0) is a population genetics computer program that conducts several demographic and genetic analyses on pedigree infor172
mation in a friendly users environment. endog is tributary of a suite of fortran 77 routines which were widely distributed and used among Spanish groups (Gutierrez et al. 2003). endog has been written in VisualBasicTM language and runs under Windows 95/98/2000/NT/XP versions. A setup menu will guide users when installing the program. The program, users guide and example le can be downloaded free of charge from the World Wide Web at http://www.ucm.es/info/prodanim/Endog30.zip. Methods Primary functions carried out by endog are the computation of the individual inbreeding (F) (Wright 1931) and the average relatedness (AR) (Goyache et al. 2003; Gutierrez et al. 2003) coefcients. F is dened as the probability that an individual has two identical alleles by descent, and is computed following Meuwissen & Luo (1992). The increase in inbreeding (DF) is calculated for each generation by
J. Anim. Breed. Genet. 122 (2005) 172176 2005 Blackwell Verlag, Berlin
J. P. Gutierrez & F. Goyache
ENDOG: a program for monitoring genetic variability
means of the classical formula DF (Ft)Ft)1)/ (1)Ft)1), where Fi is the average inbreeding at the ith generation. Using DF, endog computes the effective population size (Ne) as Ne 1/(2DF) for each generation having Ft > Ft)1 to roughly characterize the effect of the remote and close inbreeding. Ne is dened as the number of breeding animals that would lead to the actual increase in inbreeding if they contributed equally to the next generation. Whatever the way to compute Ne, this parameter ts poorly to real populations in small populations with shallow pedigrees, giving an overestimate of the actual effective population size (Goyache et al. 2003). To better characterize this, endog gives three additional values of Ne by computing the regression coefcient (b) of the individual inbreeding coefcient over: (i) the number of full traced generations; (ii) the maximum number of generations traced and (iii) the equivalent complete generations (Maignel et al. 1996), and considering the corresponding regression coefcient as the increase in inbreeding between two generations (Ft)Ft)1 b), and consequently (assuming 1 ) Ft)1 % 1) Ne 1/2b. When the available information is scarce, these estimations can be useful to approximate the upper (using i), lower (ii) and real (using iii) limits of Ne in the analysed population. The average relatedness coefcient (AR) of each individual is dened as the probability that an allele randomly chosen from the whole population in the pedigree belongs to a given animal. AR can then be interpreted as the representation of the animal in the whole pedigree. The description of the algorithm used to compute AR is given in Table 1. As shown in this Table it is possible to obtain the AR coefcients at the same time as the F coefcients by only writing an additional code line without increasing substantially the computational costs. Colleau (2002) recently presented an algorithm useful, among other things, to obtain the average relationship coefcients between each member of a group and the whole group (including self-relationships) and the average pairwise relationship coefcients. The algorithm implemented in endog is equivalent to that of Colleau (2002) when the whole population is considered as a single group. The advantages of using AR are: (a) the computational cost to calculate AR coefcients is similar to that for the computation of the numerator relationship matrix, because both procedures use common algorithms; (b) the AR of a founder indicates the percentage in which this founder can be consider the origin of the population; (c) AR coefcients can also
J. Anim. Breed. Genet. 122 (2005) 172176 2005 Blackwell Verlag, Berlin
Table 1 Description of the algorithm used in individual average relatedness (AR) coefcients Let a vector c dened as: c0 1=n10 A
ENDOG
to compute the
A being the numerator relationship matrix of size n n. On the other hand, the numerator relationship matrix can be obtained from the P matrix, where pij 1 if j is parent of i, and 0 otherwise, which sets the parents of the animals (Quaas 1976), by: 1 1 A I P1 DI P0 1 2 2 where D is a diagonal matrix with non-zero elements obtained by: 1 1 dii 1 ajj akk 4 4 j and k being the parents of the individual i. From 2, 1 1 AI P0 I P1 D 2 2 Premultiplying by (1/n) 1: 1 1 1=n10 AI P0 1=n10 I P1 D 2 2 and using 1: 1 1 c0 I P0 1=n10 I P1 D 2 2 Multiplying c into the parenthesis and isolating c: 1 1 c0 1=n10 I P1 D c0 P0 2 2 4 3 2
As the computation of both A and the AR coefcients involves the term (I 1 P1 D, it is possible to obtain the AR coefcients at the 2 same time as the F coefcients by only writing an additional code line without increasing substantially the computational costs.
be used as a measure of inbreeding of the whole population, as it takes into account both the inbreeding and the coancestry coefcients; (d) AR can be used as an index to maintain the initial genetic stock selecting as breeding animals those with the lowest AR value and (e) AR, as an alternative or complement to F, can be used to predict the long-term inbreeding of a population because it takes into account the percentage of the complete pedigree originated from a given founder at population level. In addition, AR can be used to compute the effective size of the founder population as the inverse of the sum of the square AR coefcients across founder animals. At the moment of the computation of F and AR coefcients, endog computes for each individual the number of full generations traced, the maximum number of generations traced and the equivalent complete generations for each animal in the pedigree data. The rst is dened as the furthest generation in which all the ancestors are known. Ancestors with no known parent were considered as founders (generation 0). The second is the number of
173
ENDOG: a program for monitoring genetic variability
J. P. Gutierrez & F. Goyache
generations separating the individual from its furthest ancestor. The equivalent complete generations is computed as the sum over all known ancestors of the terms computed as the sum of (1/2)n where n is the number of generations separating the individual to each known ancestor (Maignel et al. 1996). Using endog it is possible to assess the concentration of the origin of both animals and genes computing the following parameters: (a) effective number of founders (fe); (b) effective number of ancestors (fa) (Boichard et al. 1997) and (c) effective number of founder herds (fh). The rst is dened as the number of equally contributing founders that would be expected to produce the same genetic diversity as in the population under study. This is computed as: fe 1
f .X k1
q2 k
statistics are computed following Caballero & Toro (2000, 2002. These authors have formalized the pedigree tools necessary for the analysis of genetic differentiation in subdivided populations starting from the average pairwise coancestry coefcient (fij) between individuals of two subpopulations, i and j, of a given metapopulation including all Ni Nj pairs. For a given subpopulation i, the average coancestry, the average self-coancestry of the Ni individuals and the average coefcient of inbreeding are, respectively, fii,, si and Fi (so that Fi 2si)1). The average distance between individuals of subpopulations i and j would be Dij (si + sj)/2)fij. From these parameters and the corresponding means for the entire metapopulation Caballero & Toro (2000, 2002 obtained the genetic distance between subpopulations i and j (Neis minimum distance; Nei 1987) as Dij (fii + fjj)/2)fij, and its average over the entire metapopulation as
n n P P
where qk is the AR coefcient of the founder k. Parameter fe, as computed by endog, would be equivalent to that computed following Lacy (1989) if the reference population used is the whole pedigree. Parameter fa is the minimum number of ancestors, not necessarily founders, explaining the complete genetic diversity of a population. The parameter fa complements the information offered by the effective number of founders accounting for the losses of genetic variability produced by the unbalanced use of reproductive individuals producing bottlenecks. It is computed in a similar way to the effective number of founders: a .X q2 fa 1 j
j1
Dij Ni Nj
D
i1 j1 2 NT
(where Ni, Nj and NT are, respectively, the size of the corresponding populations i and j and the total population size), that are the equations (3) and (4) of Caballero & Toro (2002). Finally, the Wrights (1978) F-statistics are obtained as FIS ~ f F ~ ; 1~ f FST ~ D f f 1 ; f 1f
and FIT
~ f F ; 1 f
where qj is the marginal contribution of an ancestor j; in other words, the genetic contribution made by an ancestor that is not explained by other ancestors chosen before. The last two parameters are initially computed by endog using as reference population all the individuals in the pedigree with both parents known. However they can be recomputed after choosing a particular reference population. The effective number of herds is simply computed as the inverse of the summed squared of the sum of the contributions of the Boichard et al.s (1997) ancestors into each herd. endog can be used to infer population structure from pedigree information. endog can compute Neis minimum distance (Nei 1987) and F statistics (Wright 1978) for each predened subpopulation (i.e. according to sex, areas, herds, etc.). Wrights F
174
f ~ so that (1)FIT) (1)FIS) (1)FST), where ~,F are respectively the mean coancestry and the inbreeding coefcient for the entire metapopulation, and the f average coancestry for the subpopulation [see equations (3) and (6) in Caballero & Toro 2002]. At herd level, besides the effective number of herds, endog computes the genetic importance of the herds in a population as the contribution of the herds with reproductive males to the population (Vassallo et al. 1986). Using this methodology the herds are classied as: (i) nucleus herds, if breeders use only their own males, never purchase males but sell them; (ii) multiplier herds, when breeders use purchased males and also sell males and (iii) commercial herds if they use purchased males and never sell males. Additionally, endog computes the inverse of the probability that two animals taken at random in the population have their parent in the
J. Anim. Breed. Genet. 122 (2005) 172176 2005 Blackwell Verlag, Berlin
J. P. Gutierrez & F. Goyache
ENDOG: a program for monitoring genetic variability
same herd for each path to know the effective number of herds supplying fathers (HS), grandfathers (HSS) and great-grandfathers (HSSS) (Robertson 1953). Finally, at population level, endog computes both the generation intervals, dened as the average age of parents at the birth of their progeny kept for reproduction, and the average age of parents at the birth of their offspring (used for reproduction or not). Both parameters are computed for the four pathways (fatherson, fatherdaughter, motherson and motherdaughter). Input and output les endog has been designed to avoid much need on preparation of input les. endog accepts xls les (from Microsoft ExcelTM worksheets) or dbf les. Files with dbf format can be used in datasets larger than the limit of rows of Excel (65,536). Columns (or elds) are not supposed to be in a given order and no strict identication of the columns is needed. At the beginning of a session endog will ask for a le containing the input data and, if .xls, for the particular worksheet in which the pedigree is. After that, the program will ask if records are renumbered and ordered sequentially (from 1 to n, the older the lower number) and, later, for the selection of the column (or eld) providing the identication of the individuals, the identication of the fathers, the identication of the mothers, and the sex of the individuals. Numbering and ordering individuals is recommendable, especially if birthdates are not completely known, but, in fact, individuals can be identied in any way (using numbers, characters or both).
Table 2 Description of the result les obtained using Procedure Initial check Default computations Generations Submenu Founders Submenu Intervals submenu Fstats submenu ACCESS table
ENDOG
In any case, the identications used for individuals must be consistent with those used for parents. If records are not renumbered and sequentially ordered, endog will ask for the column (or eld) in which the individuals birth date is to proceed to order data. Dates must be in dd/mm/yyyy format. Sex must be coded as 1 for males and 2 for females. Despite these shortcomings, the input le can have any other columns (or elds) in any format (character, date, numerical or other). These columns may provide any other information: different ways to identify the individuals, the identication of the herd or population corresponding to the individuals, etc. The inclusion of a column with the birth date of the animals in the input le is highly recommended because this information will be needed for some procedures. Users interested in computing parameter fa using a particular reference population must include in the input le a column (or eld) in which the animals forming the reference population are identied using a 1. Most results of endog are written in a Microsoft access le named Gener.mdb to facilitate further use. Results of each analysis are written to the corresponding Table within Gener.mdb le. However, users may be interested in obtaining the summary results that endog shows in the screen after performing some analysis. These summary results are written in their corresponding txt les with delimited pieces of information to allow their edition using any worksheet software. The names of the access tables and txt les containing the results of the computations are usually self informative on the content and are described in Table 2.
txt results le Error.txt
Description List of errors found in the pedigree Computes F, AR, and number of generations for each individual in the dataset Mean values of F, AR and Ne for each generation traced Individual and average information on ancestors explaining genetic variability and effective number of founder herds Average generation intervals and reproductive ages for each path parentson Paired Average, Nei and Fst distance values for each dened subpopulation, and coancestry matrix Information on the genetic importance of each herd in the population, summary of these information and Robertson (1953) statistics Coancestry values of a key individual with all the individuals of the other sex in the dataset
Herds structure submenu
Coancestry submenu
Midef PorG PorC Ancestro RebaFund GenInterv AverDist DistNei Fis_Fsts HerStr StrHerd Roberts Parent
Populat.txt Founders.txt Ancestor.txt Coancest.txt MatFst.txt
J. Anim. Breed. Genet. 122 (2005) 172176 2005 Blackwell Verlag, Berlin
175
ENDOG: a program for monitoring genetic variability
J. P. Gutierrez & F. Goyache
Conclusions The program endog will help breeders and researchers to monitor changes in the genetic variability and structure of the populations with limited cost of preparing datasets. Although written primarily as a populations monitoring package, endog does offer a number of features that may be of interest to teachers and students to develop an in-depth understanding of important statistical concepts and procedures for population genetic analysis. Despite the example le provided with the program includes a very small population, endog can handle very large data les and successful computation of the parameters will be limited basically by the computer characteristics. endog has been recently used to analyse 75 389 records included in the studbook of the Andalusian horse (Valera et al. 2005). The CPU time to obtain the complete set of computations on a PC (processor 1.8 GHz, 512 Mb RAM) was <5 min. Acknowledgements This paper was partially funded by two grants from INIA, no. RZ02-020 and no. RZ03-011. The authors would like to thank Luis Jose Royo, Isabel Alvarez and especially Ivan Fernandez, for their kind support and help. endog has been tested on several data sets and results were checked for consistency with alternative software when possible. The authors would appreciate to be informed of any detected bug. References
Boichard D. (2002) Pedig: a Fortran package for pedigree analysis suited for large populations. In: Y. van der Honing (ed), Proceedings of the 7th World Cong. Genet. Appl. to Livest. Prod., Montpellier, 1923 August 2002. INRA, Castanet-Tolosan, France, CDRom, comm. No. 28-13. Boichard D., Maignel L., Verrier E. (1997) The value of using probabilities of gene origin to measure genetic variability in a population. Genet. Sel. Evol., 29, 523. Caballero A., Toro M.A. (2000) Interrelations between effective population size and other pedigree tools for the management of conserved populations. Genet. Res. Camb., 75, 331343. Caballero A., Toro M.A. (2002) Analysis of genetic diversity for the management of conserved subdivided populations. Conserv. Gen., 3, 289299.
Colleau J.J. (2002) An indirect approach to the extensive calculation of relationship coefcients. Genet. Sel. Evol., 34, 409421. Goyache F., Gutierrez J.P., Fernandez I., Gomez E., Alvarez I., Dez J., Royo L.J. (2003) Using pedigree information to monitor genetic variability of endangered populations: the Xalda sheep breed of Asturias as an example. J. Anim. Breed. Genet., 120, 95103. Gutierrez J.P., Altarriba J., Daz C., Quintanilla A.R., Canon J., Piedrata J. (2003) Genetic analysis of eight Spanish beef cattle breeds. Genet. Sel. Evol., 35, 4364. Honda T., Nomura T., Yamaguchi Y., Mukai F. (2004) Monitoring of genetic diversity in the Japanese Black cattle population by the use of pedigree information. J. Anim. Breed. Genet., 121, 242252. Lacy R.C. (1989) Analysis of founder representation in pedigrees: founder equivalents and founder genome equivalents. Zoo Biol., 8, 111123. Maignel L., Boichard D., Verrier E. (1996) Genetic variability of French dairy breeds estimated from pedigree information. Interbull Bull, 14, 4954. Meuwissen T.I., Luo Z. (1992) Computing inbreeding coefcients in large populations. Genet. Sel. Evol., 24, 305313. Nei M. (1987) Molecular Evolutionary Genetics. Columbia University Press, New York, pp. 512. Quaas R.L. (1976) Computing the diagonal elements of a large numerator relationship matrix. Biometrics, 32, 949953. Robertson A. (1953) A numerical description of breed structure. J. Agric. Sci., 43, 334336. Valera M., Molina A., Gutierrez J.P., Gomez J., Goyache F. (2005) Pedigree analysis in Andalusian horse: population structure, genetic variability and inuence of the Carthusian strain. Livest. Prod. Sci. in press. Vassallo J.M., Daz C., Garca-Medina J.R. (1986) A note on the population structure of the Avilena breed of cattle in Spain, Livest. Prod. Sci., 15, 285288. Woolliams J.A., Pong-Wong R., Villanueva B. (2002) Strategic optimisation of short and long term gain and inbreeding in MAS and non-MAS schemes. In: Proceedings of the 7th World Cong. Genet. Appl. to Livest. Prod., Montpellier, 1923 August 2002. INRA, Castanet-Tolosan, France, CD-Rom, comm. No. 23_02. Wright S. (1931) Evolution in mendelian populations. Genetics, 16, 97159. Wright S. (1978) Evolution and the Genetics of Populations: Vol. 4. Variability within and among Natural Populations. University of Chicago Press, Chicago, IL, USA.
176
J. Anim. Breed. Genet. 122 (2005) 172176 2005 Blackwell Verlag, Berlin