The Carolina Environmental
Bioinformatics Research Center
Fred Wright, Ph.D.
University of North Carolina at Chapel Hill
1
Organization of Center
• Three major Research Projects: (1) Biostatistics,
(2) Cheminformatics, and (3) Computational
Infrastructure for Systems Toxicology
• Administrative Unit
• Public Outreach and Training Activity (POTA)
• “Functional areas” of Analysis, Methods
Development and Tools Development overseen
by a panel of experienced investigators
2
CEBRC Center Organization
Director
External Science
Dr. Fred Wright Advisory Committee
Scientific Co-Director Administrative Co-
Dr. Ivan Rusyn Director
Dr. Mary Sym
Analysis - Dr. Lawrence Kupper
Methods - Dr. J. Stephen Marron
Tools - Dr. Jan Prins
Research Oversight Translation Quality
Integration (POTA) Assurance
Dr. Brad Dr. Mary Sym
Project 1 Project 2 Project 3 Hemminger
Biostatistics Chem- Computation
informatics and Systems
Dr. Fred Wright
Toxicology
Dr. Alex
Tropsha Dr. Ivan Rusyn
Dr. David Stotts
3
(1) Biostatistics in
Computational Toxicology
• Emphasis on strengths in
microarray analysis,
elucidation of
networks/pathways,
Bayesian approaches
• Stresses existing
capabilities
4
(2) Chem-informatics Input
Structure
File
Descriptor Generation
Convert
Structures
dbtranslate
Babel
Generate
Descriptors
MolconnZ
GenAP
Descriptor formatting
Reformat
Descriptors
MolconnZ-
ToDescr
Normalize
Descriptors
Utility
Input
Descriptor
File
(UNC)
etc. etc. (UNC)
Randomize
Activities Build & Test Models
• seeks to establish a Input
Activity
File
Train & Test
Set Selection
SE6
(UNC)
Randomize
(UNC)
QSAR
Algorithm
RWKNN,
SAPLS (UNC),
Predict
Test Set
KNNPredict,
SAPLSPred (UNC)
etc.
universally applicable and
etc
QSAR
Model(s)
robust predictive toxicology Report & Visualize Results
Compile
Results
Visualize
Results Database
Screen Database
Normalize
Descriptors
Mine
Database
Visualize
Hits
to Screen
modeling framework
Weblab, TSAR,
ModStat DBMine, TSAR, MOE,
MOE, Spotfire,
(UNC) etc.
Utility KNNPredict, Spotfire,
etc. etc.
functions programs User input
• Focuses on Quantitative Figure 4. Predictive QSPR Modeling Workflow
Structure Activity/Property Descriptor
Preparation
Relationships (QSAR) Randomize
Yes Do y-random
Testing?
SE6
No
• Establishes a modeling QSAR
Model
Predict
No
Are there more
training sets?
Yes
workflow, toxicity Y-rand
test
Train/test
Selection?
Yes QSAR
Model
predict
prediction scheme and plan Update Status Db
To Complete
Fail Pass or
Fail?
Pass
QSAR
No
Collect
GRID
Modeling/ Process
for software development Update
Status DB
Predict Results
To Complete
Figure 5. Workflow Logic of Model Generation Process
5
(3) Computational
Framework for
Systems Toxicology
• Uses model for toxicity
profiling in multiple strains
of mice to set up
computational infrastructure
• Some data mining activity
• will develop user-friendly
software tools from methods
Liver Injury
in Projects 1 and 2
Control C57BL/10J BUB/BnJ MSM/Ms
6
Project 1
Biostatistics in Computational Toxicology
• Fred Wright, Ph.D. (P.I.) –statistical genetics, genomic analysis
• Mayetri Gupta, Ph.D. – sequence analysis, motif detection
• Young Troung, Ph.D. – Bayesian network genetic analysis, SVM
methods for metabolomic data
• Joseph Ibrahim, Ph.D. – Bayesian analysis of microarray data
• Danyu Lin, Ph.D. – haplotype-phenotype analysis, microarray
analysis
• Fei Zou, Ph.D. – statistical genetics, genomic analysis
• Andrew Nobel, Ph.D. – clustering, data dimensional reduction,
genetic pathway analysis
• Master’s trained personnel
7
Project 1 objectives
•to provide statistical analysis capability
•to develop appropriate new methods to apply to
computational toxicology problems
•to develop computational tools
•to disseminate research findings, train students, and
coordinate additional statistical research in
computational toxicology.
8
Methods (to name a few)
• Sample size estimation for high-throughput data
• P-value computation, significance testing
• Multiple-testing issues, false discovery rates
• Dose-response modeling
• New measures of differential expression
• Transcriptional regulation and motif discovery
• Network analysis, discrimination methods
• Pathway analysis
9
Tools
• Much of initial code has been implemented in
R/Bioconductor. This is directly useful to other
statistical investigators.
• Working with project 3 investigators and students to
produce user-friendly web-based and/or standalone
applications
• Working to increase utility of methods by integration
with informatics and biological annotation
• We view the SAM software as a model for
independent successful dissemination. Project 3
personnel are working on implementing appropriate
procedures in ArrayTrack
10
Example 1. New ways of detecting
differential expression
Expression measurements
show a mean-variance
relationship…
Grey envelope shows all
Log(sample mean) the SAM procedures
Log(sample mean)
Which we can exploit to
reduce the false discovery
Lowest curve is a new
rate… maximum likelihood
procedure
Hu and Wright, in press 11
Example 2. Significant genes/pathways/categories: the
Significance Analysis of Function and Expression
procedure (honest pathway significance testing)
Genes in
category
Shading
indicates
individually
significant
genes
Category
p-value
Barry et al. Bioinformatics 21:1943-1949 , 2005 12
GO Tree with significant nodes
Key: blue (p<0.001) green (0.001<=p<0.01), red (0.01<=p<0.1).
nucleotide metabolism cell organization
and biogenesis
13
Example 3. Isotonic regression: gene expression dose-
response data
Model -
f should be
strictly
increasing or
decreasing
Hu et al., 2005, Bioinformatics
21: 3524-3529).
14
Pyrethroid Biomarker Project (J. Harrill, K.
Crofton and colleagues, U.S. E.P.A)
• Problem: Lack of a cost efficient biomarker of effect
hampers assessments of the cumulative risk of
pyrethroid insecticides.
•Aim: Develop a biochemical biomarker of effect for
pyrethroids that reflects changes in neuronal firing
rates.
•Methods: Use gene arrays and RT-PCR to identify
dose-responsive transcripts in rat CNS. Permethrin
and deltamethrin each examined at four doses,
Affymetrix arrays.
15
Dose-response, cont.: a statistic to rank genes…
fˆ (dosehighest ) − fˆ (doselowest ) Standard error estimate.
M=
v Could be improved.
Dose-response data
on pyrethroid in rat
brains, courtesy of
J. Harrill and K.
Crofton, U.S. E.P.A.
16
Project 2
Chem-informatics
• Alex Tropsha, Ph.D. (P.I.) –computational chemistry, QSAR
• Weifan Zheng, Ph.D. – computational methods in drug discovery,
QSAR
• Alexander Golbraikh, Ph.D. – mathematical approaches in QSAR
development
• Yufeng Liu, Ph.D. – Support vector machines, semi-supervised
machine learning
• additional personnel
17
Project 2 objectives
•to develop an innovative QSPR modeling workflow
based on the principles of combinatorial QSPR
modeling, model validation and consensus prediction
•to develop toxicity predictors using the workflow
•to integrate modeling tools and endpoint predictors
using workflow design middleware and workflow
deployment in a predictive toxicology web portal
•Applied to toxicology datasets
18
• Project 2 builds on years of research in the Tropsha lab
on QSAR/QSPR modeling and developing robust
predictors
• Many of the machine learning and cross-validation ideas
are used in statistical genomics
• Descriptors – topological molecular indices, size and
shape, hydrophilic/phobic indices, physical properties,
etc.
• Try to predict biological activity
• Analysis of the Carcinogenic Potency Database
(collaboration with Dr. A. Richard, EPA) has been
performed, applied to 693 compounds, with
classification kNN QSAR prediction accuracies
19
estimated at 85%-90%.
Y-Randomization
Multiple
Training Sets
Combi-QSAR
Original Split into Modeling
Dataset Training and
Test Sets
Only accept models
Multiple Activity that have a
Test Sets Prediction q2 > 0.6
R2 > 0.6, etc.
Validated Predictive
Database Models with High Internal
Screening & External Accuracy
Flowchart of predictive toxicology framework based on
validated combi-QSAR models. Numerous public datasets
proposed. 20
Project 3
Computational Infrastructure for Systems Toxicology
• David Stotts, Ph.D. (co-P.I.) – computer science, software
engineering
• Ivan Rusyn, Ph.D. (co-P.I.) – toxicology, genomics
• Wei Wang, Ph.D. – computer science, data mining
• David Threadgill, Ph.D. – mammalian genetics, genomics
• Additional programmers and students
21
Project 3 objectives
•Develop and implement algorithms that streamline
analysis of multi-dimensional data streams.
•Facilitate the development of a standard workflow for
(i) analysis of -omics data, (ii) linkages to classical
indicators of adverse health effects, and (iii) integration
with other types of biological information such as
sequence and cross-species comparisons.
•Implement user-friendly software for approaches from
Projects 1-3
22
A driving biological problem:
• Toxicogenetic analysis of susceptibility to
toxicant-induced organ injury
• The model being used by Drs. Threadgill and
Rusyn involves extensive profiling of numerous
mouse strains (over 40) for relevant organs
• Early data on acetominophen and alcohol on liver
• Proposals for trichloroethylene and other toxicants
on liver, kidney, and other organs
23
The Mouse as a Model for Studying
Genotype-Phenotype Interactions
24
Image courtesy of D.W. Threadgill
A large variation in response by
genetic background…
Strain-specific susceptibility to
acetaminophen (APAP)-induced
liver injury. Serum ALT levels
(top panel) and tissue
histopathological changes
(bottom panel) were assessed
24 hrs after a single dose
exposure to APAP (300
mg/kg, i.g., 24 hrs).
25
Toxicological and expression analysis of genotype-specific responses to
ethanol in liver. Serum and liver tissues were collected from mice of 6
different strains after acute (5 g/kg, 6 hrs; A) or subchronic (4 weeks, B)
treatment with ethanol.
26
Systems Biology Approach
27
Examination of genetic networks that regulate
gene expression in liver (webQTL and beyond)
Transcriptome map for the murine brain and liver.
Source: Ivan Rusyn and colleagues
28
Source: Ivan Rusyn and colleagues
Correlation between gene expression of CYP2C29 and several liver-
specific phenotypes recorded for BXD strains.
29
Development of new methods for -omics data analysis:
Finding associations between gene expression profiles and
strain-specific genotyping data
SiZer
Smoothing
approach
30
• Data analysis procedures in concert with project 1,
including principal component analyses, distance-
weighted discrimination, SAFE, etc.
• Specific data mining approaches also proposed, such
as subspace clustering (SNPs vs. phenotypes, gene
expression), that fall outside of typical statistical
framework
• The computational challenges are immense when we
compare different –omics platforms (e.g., 100,000
SNPs X 30,000 transcripts)
• This requires serious computer science (activities of
UNC SNP group).
31
Collaborations with the NCCT and NJ ebCTC
• Several collaborative projects identified and ongoing
• Several graduate students working on projects at
NCCT or shared advising with CEBRC members
• Implementation of ArrayTrack extensions underway
in Project 3
• Plan for deliberate expansion of collaborations with
NCCT
• The ebCTC is highly complementary to our Center.
Coordination with ebTrack/ArrayTrack development
is one key area of interchange.
32