Slides
Slides
2010
A sketch
We provide an overview of genetic analysis of complex traits in humans in the context of large volume of genetic data. While there are many analytical issues, our focus is more on the practical side. We provide specific examples of genetic association study. We are not limited to R, and would provide examples using systems other than R whenever appropriate. Our hope remains to be that this will serve as a forum for a range of issues and a contact point for future researches. Questions are welcome during the sessions.
Contents
The presentation consists of four parts: I. Overview
II. Analytic tools and association testing III. Miscellaneous topics IV. OpenMx and NCBI2R V. Conclusion You may find materials from useR!2008 and useR!2009 tutorials relevant. They are both available from my personal home page.
Monographs
Morton NE. Rao DC, Lalouel JM. Methods in Genetic Epidemiology. Karger, 1983. Khoury MJ, Beaty TH, Cohen BH. Fundamentals of Genetic Epidemiology. Oxford University Press, 1993. Falconer DS, Mackay TFC. Introduction to Quantitative Genetics, 4e. Longman, 1996 Hartl D, Clark AG. Principles of Population Genetics, 3e. Sinauer Associates, Inc. 1997 Lange K. Mathematical and Statistical Methods for Genetics Analysis. 2e, Springer 2002 Sorensen D, Gianola D. Likelihood, Bayesian, and MCMC Methods in Quantitative Genetics. Springer 2002 Thomas DC. Statistical Methods in Genetic Epidemiology, Oxford University Press, 2004 Armitage P, Colton T (Eds). Encyclopedia of Biostistics, 2e, Wiley, 2005
Monographs
Elston RC, Johnson WD. Basic Biostatistics for Geneticists and Epidemiologists, A Practical Approach. Wiley, 2005 Ahrens W, Pigeot I (Eds). Handbook of Epidemiology. Springer, 2005 Balding DJ, Bishop M, Cannings C (Eds). Handbook of Statistical Genetics, 3e, Wiley, 2007 Siegmund D, Yakir B. The Statistics of Gene Mapping. Springer 2007 Wu R, Ma C-X, Casella G. Statistical Genetics of Quantitative Traits-Linkage, Maps and QTL. Springer, 2007 Speicher MR, Antonarakis SE, Motulsky AG. Vogel and Motulskys Human Genetics: Problems and Approaches, 4e, Springer, 2010 Lin S, H Zhao (Eds). Handbook on Analyzing Human Genetic Data: Computational Approaches and Software. Springer, 2010
I Overview
Terminology
Genes, Chromosome, markers Alleles, genotypes, haplotypes Phenotypes, mode of inheritance, penetrance Mendelian laws of inheritance, Hardy-Weinberg equilibrium, linkage disequilibrium Association tests for single or multiple SNPs Population stratification Multiple testing Gene-environment interaction (GEI)
Topics
Organization
Genetic epidemiology
Genetic epidemiology
It is the study of the role of genetic factors in determining health and disease in families and in populations, and the interplay of such genetic factors with environmental factors, or a science which deals with the aetiology, distribution, and control of diseases in groups of relatives and with inherited causes of disease in populations (http://en.wikipedia.org). It customarily includes study of familial aggregation, segregation, linkage and association. It is closely associated with the development of statistical methods for human genetics which deals with these four questions. The last two questions can only be answered if appropriate genetic markers available (Elston & Ann Spence. Stat Med 2006;25:3049-80).
Linkage studies
It is the study of cosegregation between genetic markers and putative disease loci, and has been very successful in localizing rare, Mendelian disorders but since has difficulty for traits which do not strictly follow Mendelian mode of inheritance, considerable linkage heterogeneity and it has limited resolution. It typically involves parametric (model-based) and nonparametric (model-free) methods, the latter most commonly refers to allele-sharing methods. The underlying concepts are nevertheless very important. It can still be useful in providing candidates for fine-mapping and association studies. With availability of whole genome data, it is possible to infer relationship or correlation between any individuals in a population.
Association studies
They focus on association between particular allele and trait; it is only feasible with availability of dense markers. It has traditionally applied to both relatives in families and population sample. For the latter there has been serious concern over spurious association due to difference in allele frequencies between hidden subpopulations in a sample. A range of considerations has been made (Balding. Nat Rev Genet 2006;7:781-91) but the availability of whole genome data refreshes our understanding and perspectives.
GWAS
Any study of genetic variation across the entire human genome designed to identify genetic association with observable traits or the presence or absence of a disease, usually referring to studies with genetic marker density of 100,000 or more to represent a large proportion of variation in the human genome (Pearson & Manolio. JAMA 2008;299:1335-44), or simply look for associations between DNA sequence variants and phenotypes of interest (Donnelly. Nature 2008; 456:728-31). It is associated with the common disease common variant hypothesis (CD-CV). Common polymorphisms (MAF>1%) might contribute to susceptibility to common diseases, so that GWAS of common variants might be used to map loci contributing to common diseases. It therefore helps to catalog millions of common variants in the human population, massive genotypes to large number of individuals, and appropriate analytical framework (Altshuler et al. Science 2008; 322:881-888).
The landscape
A catalog of published GWASs is maintained by Office of Population Genomics at the National Human Genome Research Institute (NHGRI) and available from http://www.genome.gov/GWAStudies As of 3/2010, there were 779 published genome-wide associations at p<5x10-8 for 148 traits. As of 06/2010, the table included 587 publications. For instance, for body mass index, it includes the major publications from GWASs with 100000 SNPs, namely, Thorleifsson et al. Nat Genet 2009;41:18-24, Willer et al. Nat Genet 2009; 41:25-34; Loos et al. Nat Genet 2008;40:76875, Fox et al. BMC Med Genet 2007;8:S18, Frayling et al. Science 2007;316:889-94. Furthermore, there were Benzinou et al. Nat Genet 2008;40:943-5, Meyre et al. Nat Genet 2009;41:157-9.
Published Genome-Wide Associations through 3/2010, 779 published GWA at p<5x10-8 for 148 traits
NHGRI GWA Catalog
Context
The population under study must be characterized to allow the selection of patients likely to share a genetic cause of disease Thousands of cases and controls may be needed if a study is to have sufficient statistical power to identify the alleles of interest it creates bioinformatics challenges and raises questions about how to identify true positive signals
A conceptual picture based on a test of H0: =0 vs H1: = 1 >0 from a normal distribution
Study designs
Three common genetic association designs involving unrelated individuals (left), nuclear families with affected singletons (middle) and affected sib-pairs (right). Males and females are denoted by squares and circles with affected individuals filled with black colors and unaffected individuals being empty Risch & Merikangas. Science 1996;273:1516-7, Zhao. J Stat Soft 2007;23(8):1-18
Power by=1e-5,1e-6,5e-7
10000 20000 10000 20000
1 e -0 5 0 .0 0 1
1.0 0.8 0.6 0.4 0.2 0.0
1 e -0 5 0 .0 0 2
1 e -0 5 0 .0 0 3
1 e -0 5 0 .0 0 4
1 e -0 5 0 .0 0 5
1 e -0 6 0 .0 0 1
1 e -0 6 0 .0 0 2
1 e -0 6 0 .0 0 3
1 e -0 6 0 .0 0 4
1 e -0 6 0 .0 0 5
1.0 0.8
power
5 e -0 7 0 .0 0 1
1.0 0.8 0.6 0.4 0.2 0.0 10000 20000
5 e -0 7 0 .0 0 2
5 e -0 7 0 .0 0 3
5 e -0 7 0 .0 0 4
5 e -0 7 0 .0 0 5
10000
20000
10000
20000
S a m p le s ize
Power calculation
We can of course perform simulations to obtain power estimate but it would be somewhat involved. Instead, we calculate standard error of FTO-BMI-T2D can be calculated which can form the basis of power calculation (Kline RB. Principles and Practice of Structural Equation Modeling, 2nd Edition, The Guilford Press 2005). We implement this in ab function in R/gap. We have for EPIC-Norfolk 25,000, SNP-BMI regression coefficient (SE) of 0.15 (0.01), and BMI-T2D log(1.19) (0.01). We consider =0.05. Criticism arised from this posthoc power calculation could be alleviated when we allow for a range of sample sizes to be considered in the next slide.
Two-stage GEI
A case-only design is used as the first stage. This is to be followed by a second stage involving both cases and controls. Kass & Gold. Handbook of Epidemiology 2004; I.7 Murcray et al. Am J Epidemiol 2008; 169:219-26 Li & Conti. Am J Epidemiol 2008; 169:497-504 Thomas D. Nat Rev Genet 2010 (Epub)
References
Armitage P, Colton T. Encyclopedia of Biostatistics, Second Edition, Wiley 2005 Balding DJ, Bishop M, Cannings C. Handbook of Statistical Genetics, Third Edition, Wiley 2007 Elston RC, Johnson W. Basic Biostatistics for Geneticists and Epidemiologists: A Practical Approach. Wiley 2008 Haines JL, Pericak-Vance M. Genetic Analysis of Complex Diseases, Second Edition. Wiley 2006 Rao DC, Gu CC (Eds). Genetic Dissection of Complex Traits, Volume 60, Second Edition (Advances in Genetics). Academic Press 2008 Thomas DC. Statistical Methods for Genetic Epidemiology. Oxford University Press 2004
A summary
We have restricted our focus and leave out a lot of details to cover a rapid moving field with limited time. It seems that the practice of study designs and data analysis cannot be changed in a short run, but we have already seen steady increase in use of R.
EPIC study
The European Prospective Investigation into Cancer and Nutrition (EPIC) is coordinated by Dr Elio Riboli, Head of the Division of Epidemiology, Public Health and Primary Care at the Imperial College London. EPIC was designed to investigate the relationships between diet, nutritional status, lifestyle and environmental factors and the incidence of cancer and other chronic diseases. EPIC is the largest study of diet and health ever undertaken, having recruited over half a million (520,000) people in ten European countries: Denmark, France, Germany, Greece, Italy, The Netherlands, Norway, Spain, Sweden and the United Kingdom.
EPIC-Norfolk study
EPIC-Norfolk participants are men and women (based on over 30,000 people) who were aged between 45 and 74 when they joined the study, who lived in Norwich and the surrounding towns and rural areas. They have been contributing information about their diet, lifestyle and health through questionnaires, and through health checks carried out by EPIC nurses.
Case-cohort design
The distribution of body mass index (BMI) is the casecohort design of the EPIC-Norfolk study of obesity is a combination of the sub-cohort sample and case sample which is truncated from the whole cohort at BMI=30 Zhao. J Stat Soft 2007;23(8):1-18
Histogram of bmi
Frequency
0 15
200
400
600
800
20
25
30 bmi
35
40
45
Power/sample size
It started with assessment of how the power is compromised relative to the original case-control design. This was followed by power/sample size calculation using methods established by Cai and Zeng (2004) as implemented in an R function, noting a number of assumptions. More practically, it was also envisaged that a proper representative sample of a total of 25,000 individuals would be 10%; the subcohort is then approximately 2,500. The total sample was split between two stages.
GeneChips
Affymetrix 500K Data were available for 3850 individuals Illumina 317K It came at a later time Data quality appears to be poor? The focus has therefore been Affy500K, but with a possible comeback.
Analysis
An incremental approach was adopted since the storage and computing power were somewhat uncertain. This was predated with controls from the breast cancer study, involving about 400 individuals with Perlegen 250K GeneChips. QC including call rates and HWE was feasible with SAS/Genetics (~30GB) which provides a good estimate of the storage for all individuals (~380GB). The Linux platform seemed favourable.
EPIC400 analysis
Additional analysis
Population stratification via EIGENSTRAT SAS is very handy since a single put statement is sufficient to generate the output. Collaborative (e.g. height) and consortium work (GIANT) On the UK side, this is mainly involved with IMPUTE/SNPTEST, with inputs on strand, standard error, quantitative traits, outputs. This facilitates meta-analysis considerably.
LDL
Height
BMI/obesity
Further on BMI
Current practice
Linux clusters are now ready for comprehensive analyses and greatly facilitated by Linux/awk script which is light. awk proves very useful and can be transformed to Perl. In fact, any statistical package which processes data elements would be less efficient. An example is the transformation of long, wide, transposed format noted earlier. They call C/C++ programs such as IMPUTE/SNPTEST. We use Stata package to automate SNPTEST, and in some instances involved C/C++ code. SAS is still useful for data preparation, and in a sense less professional than DBMS such as Oracle but enjoys a large user community and has facility for data analysis. SAS 9.2 PROTO procedure allows for C/C++ to be called.
References
Bodmer W, Bonilla C. Nat Genet 2008;40:695-701 EPIC: http://epic.iarc.fr/ EPIC-Norfolk: http://www.srl.cam.ac.uk/epic Long AD et al.. Science 1997; 275:1328 Loos R et al. Nat Genet 2008; 40:468-75 Prentice RL. Biometrika 1986;73:1-11 Risch N, Merkangas K (1996) Science 1997;273:1516-7 Sandhu MS et al. Lancet 2008; 371:483-91 Thomas DC. Net Rev Genet 2010; 11:259-72 Vimaleswaran KS. Am J Clin Nutr 2009; 90:425-428 Weedon MN et al. Nat Genet 2008;40:575-83 Willer et al. Nat Genet 2009; 41:25-34 Zhao JH. J Stat Soft 2007;23(8):1-18 Zhao JH et al. CCIS 2007;2:781-90
II Association analysis
Topics
Elements of association analysis Analytic tools R packages Examples Appendix
GRAMMAR
It refers to genome-wide rapid association using mixed model and regression, and implemented in R/GenABEL. The method first obtains residuals adjusted for family effects and subsequently analyzes the association between these residuals and genetic polymorphisms using least-squares methods. It can also involves selected polymorphism to be followed up with the full measured genotype analysis (Aulchenko et al. Genetics 2007; 177:577-85). yi = + j j X ij + Gi + ei Initial model: We have the residuals ei = yi ( + j j X ij + Gi ) yi * ei = + i g i + i Linear regression: Measured genotype model: yi = + i g i + j j X ij + Gi + ei The method adjusts for familial relationship, computationally fast, and ready to incorporate methods developed for unrelated individuals in the second stage.
Analytical tools
There are several reviews on Human Genomics, and an active list is maintained at http://linkage.rockefeller.edu LINKAGE, GENEHUNTER, Merlin, PAP, SAGE, SOLAR ETDT, EHPLUS, FBAT, QTDT, UNPHASED, SAS/Genetics R (genetics, gap, haplo.stats, haplin, kinship) For GWAS HaploView PLINK, SNPGWA IMPUTE, MACH, BinBam EIGENSTRAT METAL SAS, Stata, R (snpMatrix, GenABEL, SNPassoc)
Connections with R
Occasionally, these software will be cross-referenced. Analyses with specialized programs such as IMPUTE/SNPTEST and PLINK are illustrated in the useR!2008 tutorial. snpMatrix provide connect with PLINK file, e.g., narac <- read.plink("narac.bed","narac.bim","narac.fam")
Basic R packages
genetics haplo.stats gap Rassoc, HardyWeinberg, kinship, multic, pedigree, identity
See CRAN task view for Genetics (http://cran.rproject.org/web/views/Genetics.html), as with an earlier review on the motivation for analysis with R statistical and computational environment (Zhao & Tan. Hum Genomics 2006;2:258-65) It also refers to Rgenetics projects whose packages are now available from http://www.bioconductor.org.
We obtain results similar to ETDT (Sham PC, Curtis D (1995) An extended transmission/disequilibrium test (TDT) for multi-allelic marker loci. Ann. Hum. Genet. 59:323-336).
Haplotype analysis
library(haplo.stats) mc4r.map <- read.table("mc4r.map",as.is=TRUE) snps <- mc4r.map[,2] M <- length(snps) a1 <- sprintf("%s%s",snps,rep(".a1",M)) a2 <- sprintf("%s%s",snps,rep(".a2",M)) a1a2 <- c(a1,a2) for(i in 1:M) {a1a2[2*i-1] <- a1[i];a1a2[2*i] <- a2[i]} mc4r <read.table("mc4r.ped",col.names=c(paste("v",1:6,sep=""),a1a2)) pheno <- read.csv("mc4r.csv",sep="\t",skip=11) cohort <- subset(pheno,cohort==1) attach(cohort) mc4r.12 <haplo.score(bmi,mc4r[id,7:30],x.adj=sex+age,locus.label=snps[1:12 ])
C D K N2 A/C D K N2 B r e g io n
L D (r^2 ) 8 0 .8 0 .5 0 .2 0 .0 im p . 60 P =5 .4 e -0 8
-log10(Observed p)
20
0 0
CDKN2B CDKN2A
21900
22300
Association plot
While the first three are available from CRAN, snpMatrix is available from BioConductor. Other packages include, multtest, meta, rmeta, CAMAN, qvalue, ROCR.
Notes on S4 class
We illustrate with two classes > library(snpMatrix) > showClass(snpMatrix) > library(GenABEL) > showClass(scan.gwaa) It is more informative with the following commands > class?snpMatrix > class?scan.gwaa Later we will omit the command prompt (>). We will also give examples of creating object with new() function.
SNPassoc
library(SNPassoc) map <- read.table("mc4r.map",sep="\t",as.is=TRUE) info <- data.frame(snp=map[2],chr=map[2],pos=map[4]) ped <- read.table("mc4r.ped",sep="\t",as.is=TRUE) names(ped) <- c(paste("v",1:6,sep=""),map[,2]) pheno <read.csv("mc4r.csv",sep="\t",skip=11,header=TRUE,as.is=TR UE) is.cohort <- pheno$cohort==1 cohort <- subset(pheno,is.cohort) snp <- ped[,-c(1:6)][is.cohort,] snps <- dim(snp)[2] for(i in 1:snps) { substr(snp[,i],2,2) <- "/"; empty <- (snp[,i]=="0/0"); snp[empty,i] <- NA }
Analysis
mc4r <- setupSNP(snp,1:snps,sep="/",sort=TRUE,info=info) summary(mc4r) plot(mc4r$rs17782313) plot(mc4r$rs17700633,type=pie) hwe <- tableHWE(mc4r) mc4r.ld <- LD(mc4r) summary(mc4r.ld) mc4r.ld$"R^2" attach(cohort) association(bmi ~ sex+age+rs17782313,data=mc4r) wga <- WGassociation(bmi ~ sex+age+1,model="logadd",data=mc4r) png("mc4r.png") qqpval(wga$"log-additive") dev.off()
Q-Q plot
Comments
SNPassoc is essentially designed for dealing with unrelated individuals but with considerable enhancements from genetics and haplo.stats. It implements permutation tests for binary traits through scanWGassociation(,nperm=) and permTest() It is possible to conduct gene-gene interaction: mc4r.ip <interactionPval(bmi~sex+age,data=mc4r,model="lo g-add") plot(mc4r.ip) We got a very good feel of the kind of analysis it may involve and this is a very simple example.
snpMatrix
library(snpMatrix) mc4r <- read.snps.pedfile("mc4r.ped") summary(mc4r) mc4rsnps <- row.names(mc4r$snp.support) head(mc4r$snp.support) head(mc4r$subject.support) # quality controls mc4r.qc <- summary(mc4r$snp.data) head(mc4r.qc) mc4r.ld <- ld.snp(mc4r$snp.data) plot.snp.dprime(mc4r.ld,"mc4r.eps",scheme="rsq") # ps2pdf mc4r.eps # xpdf mc4r.pdf # LD(rs17782313, rs17700633) mc4r$snp.support[c(1,12),] pair.result.ld.snp(mc4r$snp.data,1,12)
Meta-analysis
# a meta-analysis cc.test <single.snp.tests(cc,snp.data=mc4r$snp.data,score=TRU E) cc.test2 <- pool(cc.test,cc.test) summary(cc.test2) cc.test.sign <- effect.sign(cc.test) table(cc.test.sign) cc.test.sign[1:12] cc.test.switch <- switch.alleles(cc.test,c(1,12)) effect.sign(cc.test.switch)[1:12]
Genotype imputation
It is customarily to impute genotypes in a large study based on a small sample of fully-genotyped individuals, e.g., hapmap, so as to conduct association tests for large number of SNPs. It is also useful for meta-analysis of SNPs from different platforms such as Affymetrix 500K and Illumina 550K. As it is snpMatrix implements genotype imputation between sets of markers based on same individuals; more generally this involves genotypes from HapMap.
QC for chromosome X
# we omit the X.ped data/map here owing to their size X <- read.snps.pedfile("X.ped",X=TRUE) X.qc <- summary(X$snp.data) X.col <- col.summary(X$snp.data) SNPs <- subset(X.col, Call.rate>=0.90&MAF>=0.01&z.HWE>=1e-6) write.csv(row.names(SNPs), "X.snps", quote=FALSE, row.names=FALSE) library(foreign) write.dta(X.col,"Xqc.dta")
snp.imputation
It is notable with the definition of snp.imputation that given two set of SNPs typed in the same subjects, this function calculates regression equations which can be used to impute one set from the other in a subsequent sample. We customarily use external data (e.g., available from HapMap, 1000 genomes or elsewhere) and our sample jointly, treating non-typed SNPs as missing. CRAN packages such as mice should facilitate this on the phenotype side.
Comments
snpMatrix has explicit treatment of chromosome X. It also provides some facilities for dealing with family data. The retrospective method would be more appropriate with data involving the kind of sample selection here. Please check for snpMatrix vignette for use of hexbin package. It is possible to take advantage of the S4 class facility as implemented in the package when coded genotypes are available from or to other sources, e.g., m1 <- new(snp.matrix,dm1) m2 <- new(snp.matrix,dm2) m <- snp.rbind(m1,m2) write.snp.matrix(m,m.dat)
gwaa.data-class
nbytes: number of bytes used to store data on a SNP nids: number of people male: male code idnames: ID names nsnps: number of SNPs nsnpnames: list of SNP names chromosome: list chromosomes corresponding to SNPs coding: list of nucleotide coding for SNP names strand: strands of the SNPs map: list SNPs positions 2-bit storage gtps: genotypes (snp.mx-class) 0 00
1 2 3 Save 01 10 11 75%
convert.snp.text() from text file (GenABEL default format) convert.snp.ped() from Linkage, Merlin, Mach, and similar files convert.snp.mach() from Mach format convert.snp.tped() from PLINK TPED format convert.snp.illumina() from Illumina/Affymetrix-like format
Data manipulation
snp.subset: subset data by snp names or by QC criteria add.phdata: merge extra phenotypic data to the gwaa.data-class. ztransform: standard normalization of phenotypes rntransform: rank-normalization of phenotypes npsubtreated: non-parametric adjustment of phenotypes for medicated subjects
from scan.glm, scan.haplo, ccfast, qtscore, emp.ccfast,emp.qtscore Names: snpnames list of names of SNPs tested P1df: p-values of 1-d.f. (additive or allelic) test for association P2df: p-values of 2-d.f. (genotypic) test for association Pc1df: p-values from the 1-d.f. test for association between SNP and trait; the statistics is corrected for possible inflation effB: effect of the B allele in allelic test effAB: effect of the AB genotype in genotypic test effBB: effect of the BB genotype in genotypic test Map: list of map positions of the SNPs Chromosome: list of chromosomes the SNPs belong to Idnames: list of subjects used in analysis Lambda: inflation factor estimate, as computed using lower portion (say, 90%) of the distribution, and standard error of the estimate Formula: formula/function used to compute p-values Family: family of the link function / nature of the test
ParallABEL
An R Library for Generalized Parallelization of GenomeWide Association Studies
Analysis
HWE.show(mc4r) r2 <- r2fast(mc4r) dp <- dprfast(mc4r) rho <- rhofast(mc4r) descriptives.trait(mc4r) descriptives.marker(mc4r) use <- csv$cohort==1 qt.bmi <- qtscore(bmi~sex+age,data=mc4r,idsubset=use) plot(qt.bmi) However, as is shown here once the gwaa.data is defined a range of analyses can be rather straightforward. Again we only focus on the cohort sample (cohort==1).
Now the object ped has all the necessary information. We omit details of association testing but pedigree diagrams.
Pedigree diagrams
library(kinship) pdf("pedfile.pdf"); attach(ped) uid <- unique(ped$FAMID) for (j in 1:length(uid)) { selected <- FAMID==uid[j] id <- ID[selected] dadid <- FA[selected] momid <- MO[selected] sex <- SEX[selected] par(xpd=TRUE) ped <- pedigree(id, dadid, momid, sex) plot(ped, id=paste(\n,id,sep=)) title(uid[j]) k <- kinship(id,dadid,momid) print(k) } detach(ped); dev.off()
References
Elston RC. Introduction and overview. Stat Meth Med Res 9(6, special issue), 2000 Balding DJ. Nat Rev Genet 7:781-791, 2006 Elston RC, Anne Spence M. Stat Med 25:3049-3080, 2006 McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA, Hirschhorn JN. Genome-wide association studies for complex traits: Consensus, uncertainty and challenges. Nat Rev Genet 9:356369, 2008 Zheng G, Marchini J, Geller NL. Introduction to the special issue: Genome-wide association studies. Stat Sci 24(4, special issue), 2009
Data preparation
data long (keep=&snpid id &vlist a1a2 add n); set data; fid=open("data"); length id $11. add 3. a1a2 $3.; format add 1.; set map point=_n_; n=0; do col=2 to attrn(fid,"nvars"); iid=col-1; set &trait (keep=&vlist) point=iid; if &inc=1 then do; id=varname(fid,col); a1a2=vvaluex(id); add=.; if a1a2 ne " " then do; a1=substr(a1a2,1,1); a2=substr(a1a2,3,1); add=(a1=b)+(a2=b); n+1; end; output; end; end; rc=close(fid); run;
Analysis
ods select none; proc allele data=long genocol; by rsn notsorted; var a1a2; ods output markersumm=ms allelefreq=out.af; run; proc reg data=long; by rsn notsorted; ods output parameterestimates=bmipm; model bmi = age add / b stb; quit; proc logistic data=long descending; by rsn notsorted; ods output parameterestimates=obpm CLOddsPL=obclpm; model obesity = age add / expb clodds=pl; run;
Stata
It is a general-purpose, modern and easy to use statistical analysis system (e.g., http://en.wikipedia.org/wiki/Stata). Functions for genetic data includes summary statistics, test of Hardy-Weinberg equilibrium, haplotype estimation, tagging and association analysis. It allows for C/C++ routines to be used for computer intensive tasks. My colleague has implemented SNPTEST-based GWA analysis to automate a variety of sample and analyses for imputed genotypes. There is also a good implementation for meta-analysis (metan, etc), as with a set of functions for instrumental variable regressions in our context.
Other programs
By Mario Cleves gencc - Genetic case-control tests genhw - Hardy-Weinberg Equilibrium tests qtlsnp - A program for testng associations between SNPs an a quantitative trait. By Catherine Saunders co_power - Power calculations for Case-only study designs. gei_matching geipower - Power calculations for Gene-Environment interactions. ggipower - Power calculations for Gene-Gene interactions. tdt_geipower - Power calculations for Gene-Environment interactions via TDT analysis. tdt_ggipower - Power calculations for Gene-Gene interactions via TDT analysis. By Neil Shephard genass - Performs a number of statistical tests on your genotypic data and collates the results into a Stata formatted data set for browsing.
Topics
Meta-analysis Risk prediction Instrumental variable method and structural equation modeling Gaussian graphical models and networks Extreme value modeling
Meta-analysis
Some circulations within the GIANT consortium considered two studies with sample sizes 32000 and 8000 both with p values 1e-8, we have a combined twosided p value of 1.49e-14 but also yields p=4.89e-8 with p1=1e-4 and p2=1e-5 (weighted z-score method from metap in gap). In general, it statistically combines data from multiple studies in the consortium to learn about association (level of significance) and factors related to variations in its magnitude (effect size). We have test of significance = size of effect x size of study, e.g., 12=r2N (Kramer & Rosenthal. Comprehensive Clinical Psychology 3-15, Elsevier 1998)
2 2k
= 2 ln P i = 1,..., k ,
i
z =1
K (1 P )
k 1 i =1 i
Fishers method has limitations in Giving equal weight to studies with different sizes No test of heterogeneity No point estimate to become more precise as K increases However, there is suggestion about bias regarding msSNP.
= + , i = 1,..., k ,
i i
~ N (0, )
2 i i
= + b + , i = 1,..., k ,
i i i
b ~ N (0, ), ~ N (0, )
2 2 i i i
( ) ( k 1) =
k 2 2 i =1 k i i 2 k 4 k 2 i =1 i i =1 i i =1 i
= + z + b + , i = 1,..., k ,
i 1 i i i
b ~ N (0, ), ~ N (0, )
2 2 i i i
Var ( ) = + = (r + 1) E ( ) = X , Var ( ) = V
2 2 2 i i 2
Measure of Heterogeneity
Cochrans Q, Q = ik=1 ( i ) 2 / i , can be referred to a chisquared distribution with k-1 degrees of freedom.
2
I2, defined as 100%(Q-df)/Q, which expresses the percentage of between-study variability that is attributable to heterogeneity rather than chance. Thresholds of 20%, 50%, and 75% are suggested to have low, moderate and high heterogeneity (Higgins et al. BMJ 2003; 327:57-60). It has been suggested that cQ~x2(v) with Q being heterogeneity chi-square, has excellent property (Bohning et al. 2008).
Implementations
SAS has no built-in procedure for meta-analysis but can customarily done via PROCs GLM (fixed effects/inverse variance) and more often MIXED as well as macros. Stata has a comprehensive collection of meta-analysis, notably metan. R hosts several package at CRAN (e.g., meta, rmeta) . S-PLUS has user-written packages, e.g., hblm. Others such as HLM, MLwiN, WinBUGS. Customized programs
Useful URLs
CAMAN (Computer Assisted Analysis of Mixtures) http://www.charite.de/biometrie/schlattmann/book/ improved.ci (function for the improved confidence interval using DL method) http://www.statistik.tu-dortmund.de/ma_book.html hblm (Hierarchical Bayes Linear Model Programs) ftp://ftp.research.att.com/dist/bayes-meta/ CAMAP (Computer-Assisted Meta-Analysis with the Profile Likelihood) http://www.personal.reading.ac.uk/~sns05dab/Softwar e.html
Fixed-effects meta-analysis
data test; input studyid lor est; col=_n_; row=_n_; value=est; cards; data for 15 studies run; proc mixed method = ml data=test; class studyid; model lor = / s cl; repeated / group = studyid; parms / parmsdata=test eqcons=1 to 15; run;
Random-effects meta-analysis
proc mixed data=test covtest; class studyid; model lor = / s cl outp=predp outpm=predm; repeated diag / r; random studyid / g gdata = test s v; ods output CovParms=cp G=G R=R V=V SolutionF=SF SolutionR=SR; run; data predp; set predp; pvalue=probnorm(resid/stderrpred); run; data predm; set predm;pvalue=probnorm(resid/stderrpred); run;
Stata
use meta5 list in 1/5 metan b se, by(snp) fixedi nograph
WinBUGS
model { for (i in 1:r) { y[i] ~ dnorm(psi[i],w[i]) psi[i] ~ dnorm(theta,t) } theta ~ dnorm(0,1.0E-4) t ~ dgamma(0.001,0.001) tausq <- 1/t } list(y = c(0.864, 0.646, 0.272, 0.916, 0.867, 0.819, 0.809, 1.212, -0.273), w = c(4.40, 9.89, 16.81, 8.38, 8.15, 10.36, 10.79, 4.40, 15.95), r = 9) list(theta = 0, t = 1, psi = c(0,0,0,0,0,0,0,0,0))
Customized programs
META METAL MetABEL R/snpMatrix
A cautionary note
In a meta-analysis, we compute effect size for each study and combine them but not combine summary data and compute an effects size for the combined data. This allows for a check of consistence regarding effect sizes across studies and minimizes the potential confounders. If we were to pool data across studies and then compute the effect size from the pooled data, we may get the wrong answer, due to Simpons paradox. See Chapter 13 of Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. Introduction to Meta-Analysis. Wiley 2009
Statistical models
The data typically involve b, SE from linear regression of nearby SNPs to allow for fixed- and random effects modeling and assessment of statistical significance. It is not obvious how to infer covariance matrix involving these bs. However, we can work around with respect to pair-wise correlations (r). For linear regression, it is known that r and t (=b/SE) is related via a simple expression r2=t2/(n-2+t2). The covariance between pair-wise correlation has the following form.
Summary
It is far from a comprehensive overview but offers some flavour of the kind of thinking and practice. Evidence synthesis with conscious recognition of heterogeneity is in the heart of meta-analysis. Fixed effects analysis is restricted to data of the type found in the studies included, but random effects model generalizes to all studies of the type from which our studies were drawn. Results from both models together with SH model are highly recommended. We have omitted the graphical aspects, e.g., Bax et al. AJE 2009; 169:249-55. An Excel macro is available from http://www.mix-for-meta-analysis.info/index.html
Risk prediction
A set of SNPs can be used in a logistic regression model to predict if an individual is a case or control based on a cut-off probability. An optimal cut-off can be facilitated through receiver operating characteristics (ROC) curve. The ability to classify individuals correctly is measured by area under the ROC curve (AUC, e.g. ~0.5, 0.7-0.8, 0.8-1 for no, acceptable, excellent discrimination). Examples: prostate cancer, obesity, HDL/TG/LDL. A testing example library(verification) obs<- round(runif(100)) pred<- runif(100) A<- verify(obs, pred, frcst.type = "prob", obs.type = "binary") roc.plot(A, main = "Test", binormal = TRUE, plot = "both") roc.plot(A, threshold=seq(0.1,0.9, 0.1), CI=TRUE, alpha=0.1) roc.plot(obs,pred,xlab=1-specificity',ylab='sensitivity',cex=2) AUC <- roc.area(obs,pred)$A
2000 27
2000
Waist (cm)
BMI (kg/m2)
Frequency
1500 26 1000
Frequency
90 1500
1000 85
25 500
500
0 6 7 8 9 10 11 12 13 14 15 16 17
24
0 6 7 8 9 10 11 12 13 14 15 16 17
80
10
Odds ratio
0.1 6
7 2.6
8 6.6
9 11.1
10 15.6
11
12
13 12.8
14 8.2
15 4.4
16 1.9
17 1.0
%: 1.4
17.9 16.6
Sensitivity
AUC:
0.1 0 0 0.1
Model 1 (Age + Age2 + Sex): 0.572 (95% CI: 0.560-0.584) Model 2 (SNPs): 0.574 (95% CI: 0.559-0.590) Model 3 (Age + Age2 + Sex + SNPs): 0.597 (95% CI: 0.582-0.612)
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1-specificity
IV in simple terms
In an observational study, U represents unmeasured confounders of the XY association. In a randomized trial, U represents variables that affect adherence to treatment assignment and thus influence received treatment X. Z is called an instrumental variable (or instrument) for estimating the effect of X on Y. Rothman KJ, Greenland S, Lash TL. Modern Epidemiology, 3e, Lippincott Williams & Wilkins 2008 a. Z affects X (i.e., Z is an ancestor of X). b. Z affects the outcome Y only through X (i.e., all directed paths from Z to Y pass through X). c. Z and Y share no common causes.
This is the so-called triangulation approach (Freathy et al. Diabetes 2008; 57:1419-26).
= ( + ) + ( + ) SNP + error
0 1 0 1 1 2
TG = + SNP + error
0 2
= + = ( ) /
2 1 1 2 1 1 2
Issues with IV
No suitable genetic variant Unreliable gene association Population stratification Linkage disequilibrium Pleiotropy Nonlinear association Weak instrument
See Lawlor et al. (2007). Stat Med 27:1133-67; Didelez & Sheehan (2007). Stat Meth Med Res 16:309-30; Didelez et al. Stat Sci 2010. Pare & Anand (2010) Lancet 375:1584-5
Mediation analysis
X Z X Y Z X Z X Y Z X
Scenarios of mediation: complete (upper left), partial (lower left), complete (upper right) and partial (lower right) with two mediators
X Y
Results
Quasi-Bayesian Confidence Intervals Mediation Effect: -0.006834 95% CI -0.022355 0.002811 Direct Effect: -0.1205 95% CI -0.19597 -0.02652 Total Effect: -0.1273 95% CI -0.20195 -0.03293 Proportion of Total Effect via Mediation: 0.04556 95% CI 0.02904 0.15595
Mplus code
Model: Title: zltg on zlbmi; snp1: rs1121980 from FTO zlbmi on snp1; snp2: rs17782313 from MC4R zltg on snp1; zlbmi : BMI Model indirect: zlwst : waist zltg ind snp1; zltg : Triglycerides Output: zsys : SBP Standardized; zdia : DBP Data: File is effectsize.dat ; Variable: Names are snp1 snp2 zlbmi zlwst zltg zsys zdia; Missing are all (-9999) ; Usevariables are snp1 zlbmi zltg;
X1 X2
Y1 Y2
When Y1 becomes X2 and X2 becomes Y2, the cross-lagged model can be used to study reverse causation, especially with longitudinal data. It becomes clear that we will be most comfortable with the SEM framework, as is also illustrated with the following slide.
Bayesian networks
Rule-based systems with certainty factors have serious limitations as a method for knowledge representation and reasoning under uncertainty, and attention towards a probabilistic interpretation of certainty factors leads to Bayesian networks. It can be described briefly as an acyclic directed graph (DAG) which defines a factorization of a joint probability distribution over the variables represented by the nodes of the DAG. The process of construction involves identification of the relevant variables and their causal relations, which leads to DAG specified in terms of a set of conditional probabilities.
Methods
Gene expression levels as continuous variables were assumed to follow a multivariate normal distribution, and consistent with a Bayesian network with linear Gaussian conditional densities. The prior of this network is characterised by a prior network reflecting our belief in the joint distribution of the variables in question, and equivalent sample size (ESS) effectively behaving as if it was calculated from a prior data set of that size. For instance, without a priori knowledge of the regulatory network, the prior network could be one where all expression levels are independent in order to avoid explicitly biasing the learning procedure to a particular edge. The learning procedure starts with a training set and evaluates networks according to an asymptotically consistent scoring function that is obtained through the Bayesian framework. The so-called causal structure assumes that dependencies between variables are due to causal relationships between variables in the model.
Left. Importance of the dependencies. Right. Solid arc has direct causal influence (direct meaning that causal influence is not mediated by any other variable that is included in the study). Dashed arc indicates there are two possibilities, but we do not know which holds. Dashed line without any arrow heads indicates there is a dependency but we do not know the reciprocal dependence. From Zhao et al. BMC
Summary
We have covered a variety of topics ranging from metaanalysis to causal modelling, which is expected to be more familiar with more genetic variants being established. They are general since some topics are also quite familiar to researchers at other fields (e.g., psychology, social science, econometrics) where for instance structural equation modelling are routinely used.
References
Bohning D, Kuhnert R, Rattanasiri S. Meta-Analysis of Binary Data Using Profile Likelihood. CRC Press, 2008 Conneely KN, Boehnke M. AJHG 2007;81:1158-68 Demidenko E. Mixed Models. Wiley, 2004 Harris et al. Stata J 2008; 8:3-28 Hartung J, Guido K, Sinha BK. Statistical Meta-Analysis with Applications. Wiley, 2008 Normand S-L. T. Stat Med 1999;18:321-59 Rao DC, Gu CC. Genetic Dissection of Complex Traits, 2e. Academic Press, 2008 Schlattmann P. Medical Applications for Finite Mixture Models. Wiley, 2009 Sidik & Jonkman. Appl Stat 2005; 54:367-84 Sterne J. Meta-Analysis in Stata. Stata Press, 2009. Sutton AJ, Abrams KR, Jones DR, Sheldon TA, Song F. Methods for Meta-Analysis in Medical Research. Wiley, 2000 Verzilli et al. AJHG 2008; 82:859-72 Whitehead A. Meta-Analysis of Controlled Clinical Trials. Wiley, 2002
References
Krzanowski WJ, Hand DJ. ROC Curves for Continuous Data. CRC 2009 Pepe, M.S. The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press, 2003 Gonen M. Analyzing Receiver Operating Characteristic Curves with SAS, SAS Institute Inc., 2007 Loehlin JC. Latent Variable Models-An Introduction to Factor, Path, and Structural Equation Analysis. 4e, Lawrence Erlbaum Associates, 2004 Kline RB. Principles and Practice of Structural Equation Modeling. 2e, The Guiford Press, 2005 Bollen KA, PJ Curran. Latent Curve Models-A Structural Equation Perspective. Wiley, 2006 Kjaerulff UB, AL Madsen. Bayesian Networks and Influence DiagramsA Guide to Construction and Analysis. Springer, 2008 Emmert-Streib F, Matthias D. Analysis of Microarray Data-A NetworkBased Approach. Wiley-VCH, 2008 Junker BH, F Schreiber (Ed). Analysis of Biological Networks. Wiley, 2008
Topics
Heritability estimation Background Family data Twin data OpenMx Summary Information retrieval with NCBI2R Further information
Some clarifications
For a binary trait, such as whether or not an individual has a disease, heritability is not the proportion of disease in the population attributable to or caused by, genetic factors. For a continuous trait, genetic heritability is not a measure of the proportion of an individuals score attributable to genetic factors. Heritability is not about cause per se, but about the causes of variation in a trait across a particular population. As heritability varies according to which factors are considered, there is no unique value of genetic heritability of a characteristic. It also varies from population to population. A poorly measured trait will apportion to measurement error leading to lower estimate of genetic heritability.
Heritability studies
Family studies Adoption (rearing of a nonbiological child) studies Migrant studies migrants carry a risk reflecting country of origin Twin study differences between monozygotic and dizygotic twins can be attributed to genetic influence
h hz c cy p m u fF fM b i iF iM
Effect of child's genotype on child's phenotype Effect of parental genotype on parental phenotype Effect of child's environment on child's phenotype Effect of parental environment on parental phenotype "Primary" correlation between parental phenotypes due to phenotypic homogamy Correlation between parental genotypes due to social homogamy Correlation between parental environments due to social homogamy Effect of father's environment on child's environment Effect of mother's environment on child's environment Effect of common sibship environment on child's environment Effect of child's environment on child's index Effect of father's environment on father's index Effect of mother's environment on mother's index Correlation between parental genotype and parental phenotype Correlation between parental environment and parental phenotype Correlation between adult's environment and spouse's genotype due to social homogamy Total correlation between parental genotype and parental environment
s a
SOLAR
SOLAR (Sequential Oligogenci Linkage Analysis Routine, Almasy L, Blangero J. Am J Hum Genet 1998; 62:1198-211) uses likelihood ratio tests to evaluate heritability by comparing a purely polygenic model with a sporadic model in the case of testing heritability. In a polygenic model, h2r is the total additive genetic heritability. In a linkage model (with one or more locus specific elements) h2q1 represents the heritability associated with the first locus, and h2r represents the residual genetic variance. In a oligogenic model, there may also be h2q2, h2q3, etc. Sung J, et al. JCEM 2009; 94:4946-52. reported a recent study of twins with adjusted (age, sex, age2, age2 x sex, total calorie intake, smoking and alcohol use) heritabilities for waist circumference (59%), glucose (59%), HDL (77%), TG (46%).
rMZ = a 2 + c 2
DZ
rDZ = 0.5 a 2 + c 2
Recall that
Expectations of sample correlation E(r)=-(1-2)/(2n)[1-(1-92)/(4n)+] V(r)=(1-2)2/n[1+112/(2n)+] Keeping ES. Introduction to Statistical Inference. Van Nostrand 1962; Dover 1995.
Simple estimation
There we have mean/variance a 2 = 2( r r ) MZ DZ
2 4 1 rMZ
[(
2 nMZ + 1 rDZ
nDZ
Similarly,
c 2 = 2rDZ rMZ
2 4 1 rMZ
2 nMZ + 1 rDZ
nDZ
e 2 = 1 rMZ
(1 r )
2 2 MZ
nMZ
These can be simpler when there are equal numbers of DZ and MZ twins.
Summary statistics
apply(mzm,2,mean) bmi1 bmi2 22.68876 22.86700 cov(mzm) bmi1 bmi2 bmi1 7.240612 4.698384 bmi2 4.698384 6.921260 cor(mzm) bmi1 bmi2 bmi1 1.0000000 0.6636946 bmi2 0.6636946 1.0000000 apply(dzm,2,mean) bmi1 bmi2 23.32167 23.16760 cov(dzm) bmi1 bmi2 bmi1 9.208189 1.574196 bmi2 1.574196 5.799376 cor(dzm) bmi1 bmi2 bmi1 1.0000000 0.2154175 bmi2 0.2154175 1.0000000
Parameter estimation
name matrix 1 <NA> 2 <NA> 3 <NA> 4 <NA> 5 <NA> 6 <NA> 7 <NA> 8 <NA> 9 <NA> 10 <NA> row col A x1 G A x2 G A x3 G A x4 G A x5 G S x1 x1 S x2 x2 S x3 x3 S x4 x4 S x5 x5 Estimate 0.39715212 0.50366111 0.57724141 0.70277369 0.79624998 0.04081419 0.03801999 0.04082718 0.03938706 0.03628712 Std.Error 0.015549769 0.018232514 0.020448402 0.024011418 0.026669452 0.002812717 0.002805794 0.003152308 0.003408875 0.003678561
Model-fitting statistics
observed statistics: 15 estimated parameters: 10 degrees of freedom: 5 -2 log likelihood: -3648.281 saturated -2 log likelihood: -3655.665 number of observations: 500 chi-square: 7.384002 p: 0.1936117 AIC (Mx): -2.615998 BIC (Mx): -11.84452 adjusted BIC: RMSEA: 0.03088043
Elementary statements
We can obtain a list of commands as usual, i.e., library(help=OpenMx) e.g., mxAlgebra mxMatrix mxData mxEval mxAlgebraObjective
mxFIMLObjective mxRun
Examples
A <- mxMatrix("Full", nrow = 3, ncol = 3, values=2, name = "A") A FullMatrix 'A' @labels: No labels assigned. @values [,1] [,2] [,3] [1,] 2 2 2 [2,] 2 2 2 [3,] 2 2 2 @free: No free parameters. @lbound: No lower bounds assigned. @ubound: No upper bounds assigned.
ACE model
ACE<-function(mzDat=mzData,dzDat=dzData,type="raw",selV=selVars){ twinACEModel <- mxModel("ACE", mxMatrix("Full", 1, 1, TRUE, .6, "a", name="X"), mxMatrix("Full", 1, 1, TRUE, .6, "c", name="Y"), mxMatrix("Full", 1, 1, TRUE, .6, "e", name="Z"), mxAlgebra(X %*% t(X), "A"), mxAlgebra(Y %*% t(Y), "C"), mxAlgebra(Z %*% t(Z), "E"), mxAlgebra(A+C+E, name="V"), mxMatrix("Full", 1, 2, TRUE, 20, "mean", name="expMean"), mxAlgebra(rbind(cbind(A+C+E, A+C), cbind(A+C, A+C+E)), "expCovMZ"), mxAlgebra(rbind(cbind(A+C+E, 0.5%x%A+C), cbind(0.5%x%A+C, A+C+E)), "expCovDZ"), mxModel("MZ", mxData(mzDat, type), mxFIMLObjective("ACE.expCovMZ", "ACE.expMean", selV)), mxModel("DZ", mxData(dzDat, type), mxFIMLObjective("ACE.expCovDZ", "ACE.expMean", selV)), mxAlgebra(MZ.objective + DZ.objective, name="twin"), mxAlgebraObjective("twin"))
Fitting AE model
twinAEModel <- mxModel(twinACEModel, mxMatrix("Full", 1, 1, FALSE, 0, "c", name="Y")) twinAEFit <- mxRun(twinAEModel, silent=TRUE) exp_AE <- mxEval(rbind(expCovMZ,expCovDZ,expMean), twinAEFit) est_AE <- mxEval(cbind(A,C,E,A/V,C/V,E/V), twinAEFit) rownames(est_AE) <- 'AE' LL_AE <- mxEval(objective, twinAEFit) LRT_ACE_AE <- LL_AE - LL_ACE
Fitting CE model
twinCEModel <- mxModel(twinACEModel, mxMatrix("Full", 1, 1, FALSE, 0, "a", name="X")) twinCEFit <- mxRun(twinCEModel, silent=TRUE) exp_CE <- mxEval(rbind(expCovMZ,expCovDZ,expMean), twinCEFit) est_CE <- mxEval(cbind(A,C,E,A/V,C/V,E/V), twinCEFit) rownames(est_CE) <- 'CE' LL_CE <- mxEval(objective, twinCEFit) LRT_ACE_CE <- LL_CE - LL_ACE
Heritability estimates
The heritability and 95% boostrap CI estimates by models are as follows, mean 0.6558460 0.6579248 0.0000000 0.0000000 sd 0.04264874 0.03891981 0.00000000 0.00000000 lcl 0.5722545 0.5816420 0.0000000 0.0000000 ucl 0.7394376 0.7342076 0.0000000 0.0000000
ACE AE CE E
This provides a simple estimation, although we could obtain analytical approximation in a more elaborate way.
Summary
The pursuit of precise estimation of genetic vs environmental contributions to complex traits have a long history and currently an indispensible part of genetic epidemiology or statistical genetics. The literature we focused here is largely from 1970s onwards. However, we have gone quite far with our understanding and implementation of procedures. The former includes simple estimation, maximum likelihood methods, path analysis and structural equation modeling while the latter evolves from Fortran, LISREL/Mx/MxGUI to R. Our focus is on the practical side. The new practice with OpenMx rests on the flexible and powerful R computing environment, which makes collaborative work truly possible.
Setup
As from v1.3, NCBI2R is available from CRAN and the projects homepage has more information: http://drop.io/NCBI2R_package. As usual the package is loaded into R as follows. library(NCBI2R) library(help=NCBI2R) help.start() The package is still under development but most functions should work under the Windows environment.
OpenPmid(refs$PMID[1])
Annotation
This is achieved with functions available, e.g.,
ScanForGenes, ScanforSNPs GetSNPInfo, GetGeneInfo GetGeneInfo(#) AnnotateDataframe(mydata, selections=c(marker,p,beta)) AnnotateSNPList, AnnotateSNPFile GetIDs() GetGeneTable(#) GetGOs(#) GetInteractions(#) GetPathways(#) GetRegion(snp,4,start,end), GetRegion(gene,X,start,end) GetPhenotypes(#) GetSNPsInGene(#)
names(GetGeneInfo(23327)) [1] "locusID" "org_ref_taxname" "org_ref_commonname" [4] "OMIM" "synonyms" "genesummary" [7] "genename" "phenotypes" "pathways" [10] "GeneLowPoint" "GeneHighPoint" "ori" [13] "chr" "genesymbol" "build" [16] "cyto" "approx"
GetGeneTable(4160) # positions of exons, DNA/protein Acc # $ExonInfo Where Start Stop Size Set 1 Exon 1 1438 1438 1 2 CodExon 420 1418 999 1 $ACC.DNA Identifier Length Exons 1 NM_005912.2 1438 1 $ACC.Prot Identifier Length Exons 1 NP_005903.2 332 1
"locusID" "org_ref_taxname" "org_ref_commonname "OMIM" "synonyms" "genesummary "genename" "phenotypes" "pathways "GeneLowPoint "GeneHighPoint" "ori "chr" "genesymbol" "build "cyto" "approx"
GetGOs
ggo
category Function Function Function Function Process Process Process Process Process Process Process Process Component Component Component name evidence pubmed db db_id G-protein coupled receptor activity IEA GO 4930 melanocortin receptor activity TAS 8794897 GO 4977 protein binding IEA GO 5515 receptor activity IEA GO 4872 G-protein coupled receptor IEA GO 7186 protein signaling pathway G-protein signaling, coupled TAS 8794897 GO 7188 to cAMP nucleotide second messenger feeding behavior TAS 9771698 GO 7631 insulin secretion IEA GO 30073 regulation of bone resorption IMP 16614075 GO 45780 regulation of metabolic process IEA GO 19222 response to insulin stimulus IEA GO 32868 signal transduction IEA GO 7165 cytoplasm IDA 18029348 GO 5737 integral to membrane TAS 8392067 GO 16021 plasma membrane TAS 10585465 GO 5886
GetRegion("snp","18",58038564,58040001)
"rs79390404" "rs61741819" "rs13447340" "rs13447335" "rs13447330" "rs13447325" "rs52834737"
"rs78877161" "rs76500026" "rs52820871" "rs52804924" "rs13447339" "rs13447338" "rs13447334" "rs13447333" "rs13447329" "rs13447328" "rs13447324" "rs13447323" "rs2229616" "rs1016862
"4160"
GetRegion("gene","18",58038564,58040001)
GetGeneInfo(4160) GetSNPsInGene(4160)
GetPathways
gpath name 1 KEGG pathway: Neuroactive ligand-receptor interaction 2 Reactome Event:Signaling by GPCR web 1 http://www.genome.jp/dbgetbin/show_pathway?hsa04080+4160 2 http://www.reactome.org/cgibin/eventbrowser_st_id?ST_ID=REACT_14797
dim(oinfo) 302 19
A tutorial example
favouriteSNP <- "rs4294787" favouriteSNPInfo <- GetSNPInfo(favouriteSNP) pathway <- GetPathways(favouriteSNPInfo$locusID) genes_in_pathway <- GetIDs(pathway$name) #a loop to enable GetSNPsInGene work with multiple genes for (i in 1:length(genes_in_pathway)) { if(!(exists("biglist"))) { biglist <- GetSNPsInGene(genes_in_pathway[i]) } else { biglist<-c(biglist,GetSNPsInGene(genes_in_pathway[i])) } } length(biglist)
165212
Side interest
# An example showing the principle as implemented in the package can be useful for obtaining other information. keywords <- c("Professor","England") nj <- NatureJobs(keywords,"nj",days=7) dim(nj) 120 11 names(nj) names(nj) [1] "JobTitle" "Employer" "Location" "Posted" "Desc" "DaysAgo" "IDnumber" "BigDescription" [9] "WebLink" "LocalLink" "ExpDate"
A summary
NCBI, especially PubMed, has been a major source of biomedical information retrieval in daily research. The process can considerably be facilitated by R/NCBI2R package. A minor issue is that it only retrieves the latest information but earlier information is very useful (e.g., build 35). Hence more experiences need to be gathered. NCBI2R annotates lists of SNPs and/or genes, with current information from NCBI designed to allow those performing the genome analysis to produce output that could easily be understood by a person not familiar with R. It is easy to anticipate that more functionality can be added from the same principle. It is helpful to keep an eye on the package development even if implementation may not necessarily be a priority in our research.
Further information
We often use NCSC genome browser (http://genome.ucsc.edu) coupled with the galaxy system (http://g2.bx.psu.edu), but the facility in R will be complementary. Annotation databases (Lesk AM. Database Annotation in Molecular Biology-Principles and Practice. Wiley 2005) Other packages such as gene2pathway from CRAN (Prediction of KEGG pathway membership for individual genes based on InterPro domain signatures), SubpathwayMiner (Annotation and identification of the KEGG pathways) Packages from BioConductor (http://www.bioconductor.org), e.g., KEGGSOAP (interface to the KEGG SOAP server) library(annotate) ab <- pubmed[18454148] buildPubMedAbst(xmlRoot(ab)[[1]]) pubmed(18454148, disp=browser)
RNCBI
library(RNCBI) ncbi <- NCBI() einfo <- EInfo(ncbi) einfo <- setRequestParameter(einfo, "db", "pubmed")
V Conclusion
General comments
As has been driven by technological advances in genotyping and computational technology, the genetic analysis of complex trait is a dynamic topic in a fastmoving field. The R environment is now indispensible with a great deal of recognition and stability. Nevertheless, there are areas which can be further advanced, e.g., graphics. A range of models available from R remains to be explored and have been supplementary to the main analysis.
A great expectation
Ashley et al. Lancet 2010, 375, 1525-35 The authors assessed a patient with a family history of vascular disease and early sudden death. The analysis involved 2.6M SNPs and 752 CNVs showing increased genetic risk for MI, T2D and some cancers.
Summary
The need from various analyses seeds the development in R and shares much in common with many other problems involving large data, such as interactive graphics in combination with publicly available databases, the use of statistical and computational facilities available from the R system. Applications in substantive areas are the constant source of motivation in package development. The implementation is likely to be patchy but with a great prospect, e.g., advanced models and causal pathways. Alternative computing environments are complementary.
References
Murrell P. R Graphics. Chapman & Hall/CRC, 2005
Gentleman R, Carey V, Huber W, Irizarry R, Dudoit S. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer, 2005