Docker in Open Science Data Analysis Challenges by Bruce Hoff

Docker in Open Science
Data Analysis Challenges
Bruce Hoff
Principal Software Engineer,
Sage Bionetworks

Open Science in Disease Research
Containerization as a tool for scientific reproducibility
Case Study: Docker in the 2015 ALS Stratification Challenge
Case Study: Docker in the 2016 Digital Mammography Challenge
Open Issues and Lessons Learned
Agenda

This talk is about saving lives.
Disease research is data intensive…
… but published analyses often aren’t
reproducible.
… and valuable data sets aren’t shared freely.
… which reduces the rate of progress.

Difficulties in science validation
 Amgen scientists tried to confirm 53 landmark papers in pre-clinical
oncology research: Only 6 (11%) were confirmed.[1]
 Bayer HealthCare reported that only about 25% of published
preclinical studies could be validated.[2]
 Poti Gate: Genomics Research at Duke during 2006-2010, led to the
identification of Diagnostic Signatures that spurred clinical trials. The
research was later deemed statistically flawed and the clinical trials
stopped
[1] C. Glenn Begley and Lee M. Ellis, Nature 483, 531 (2012)
[2] Prinz,F.,Schlange,T.&Asadullah,K., NatureRev. Drug Discov. 10, 712 (2011)

Our Solution: Open Data
Analysis Challenges
 Engage the community, rather than a select company or
lab, to solve a problem in biological/medicinal research.
 Obtain and expose a high value data set that would
otherwise be accessible by a few.
 Require that participants share their code and document
their algorithms; test for reproducibility.

0
50
100
150
200
250
300
350
400
450
NumberofSubmittingTeams
Unique Final submissions

Measures of Impact
• 32 scientific challenges
• 50 partner institutions (since 2006)
• >5000 registered users
• 10 international conferences
• 2500 conference attendees
• >100 publications using DREAM data
• 25 journal articles
• 3 journal special issues
• 2 edited books
• 1,300 Citations
• 20 PhD theses
• Use of Challenges in Classroom as problem sets

Dialogue for Reverse Engineering Assessment and Methods
(DREAM) is a crowdsourcing effort that poses quantitative
challenges about systems biology modeling.
Sage Bionetworks (2009-) is a nonprofit biomedical research
organization seeking to accelerate biomedical research through
open systems, incentives, and standards.
The two organization merged in 2013 to drive a continuing
series of open science challenges.
The Organization

• Web services that facilitate collaborative web science
– Projects Sharing Resources (code, files, ideas)
– wiki narratives
• Analysis provenance - linking data, code, and results; data
versioning
• Web services that facilitate Challenge logistics
– Registration, acceptance of data usage, acceptance of Challenge Terms and Conditions
– Real-time challenge leaderboards
– Discussion Forum
– Formation of Teams
– Online Supplement for Challenge Papers: e.g.:
https://www.synapse.org/#!Synapse:syn2528824/wiki/
Synapse: enabling collaborative research

2015 ALS Challenge
a case study in using Docker
in a DREAM challenge

ALS is a rapidly progressing neurodegenerative disease that typically leads
to death within 3-5 years but for which disease progression is heterogeneous
across the patient population.
Data for 9000 ALS patients provided by the Pooled Resources Open-Access
Clinical Trial (PRO-ACT) database.
The challenge was to predict disease progression from clinical data.
$28,000 in prize money raised through a grass-roots fund drive
https://www.indiegogo.com/projects/fund-the-prize-solve-als-together
Nature Biotechnology agreed to publish the results.

In a typical challenge…
• Data is partitioned into
– training
– leaderboard
– validation
• Participants
– download training data
– apply statistical learning methods
– submit predictions

Organizers want to constrain submitted models to work in a certain
way:
• Model has a ‘selector’ component to select predictive clinical features
• Model has a ‘predictor’ component to predict ALS outcome based on
selected features.
Organizers want to run each model themselves to:
- Ensure models are structured as prescribed
- Ensure reproducibility of output
Docker to the rescue!
Clinical
Data
Model
Output
Selector
Selected
Features
Predictor

Scientific
Leadership
High value
data set
IT Resources
Prize Money
High visibility
Publication
Community
participation
The ‘Stone Soup’ of Open Challenges

IBM Cloud with a ZEC12 system virtual
machine running a Linux server with 32
processors, 240 GB memory and 9 TB
storage space.
IBM Donates a Mainframe for ALS Challenge

Provision a container on a unique port for each participant. They log in as:
> ssh user_name@129.34.20.96 -p port_number
Provide a script that sends a “signal” to a process running Docker
> create_model_snapshot
Back-end process runs “docker commit” to create a copy of the model for
scoring.
Back-end reruns captured image as a new container, after mounting
leaderboard (or later, validation) data volume.
Using Docker with a Mainframe

2016 Digital Mammography
Challenge
a case study in using Docker
in a DREAM challenge

• The Scientific Question: How can we reduce erroneous recall
rate (false positives)?
• Image analysis machine learning problem
• “Deep learning” algorithms expected
• $1.2M in prize money expected to attract 100s of serious
participants
• 600,000 mammography images donated (~20TB)
• Budget for 100s of GPU servers from two Cloud providers
(AWS, IBM)

Why use Docker?
1) Large data size
2) Sensitive data
3) Provisioned compute

(1) Allocate
machine (e.g.
own laptop)
(2) Retrieve
base image
(3) Retrieve
small, pilot
dataset.
(4) Create model
(5) Verify model using pilot dataset

(6) Push
Dockerized
model to
registry
(8) Receive
trained model
and score.
…
(7) Submit
model to
Challenge.

Submission queue built into Synapse

(1) Retrieve
new
submissions.
…
(2) Retrieve
Docker image.
(3) Train / score model.
(4) Save
trained model
and score.

• We’ve implemented the data donor’s wish to maintain control of
the data.
• We have obviated the need to download the large data set.
• We have democratized participation, making compute available
to those who might not otherwise have it.
• After the challenge we have a library of rerunnable models
ensuring reproducibility.
Outcome

• How best to monitor a fleet of Docker hosts (incl. GPU usage)?
• How reproducible are models run on different GPU machines?
How much of the software stack should be in the container?
• How shall we limit submitted jobs?
• Are there networking issues as models access data?
• What are the security issues when running submitted
containers?
Open questions

• Images aren't always portable. System Z images can't be used
on Intel-based hardware.
• Reproducibility doesn't mean comprehensibility
• Find out about all our challenges at www.synapse.org
• For those of you down in the trenches, see brucehoff/dockerauth
for an example of how to do registry delegated authorization in
Java.
/etc

Acknowledgements
 Sage Bionetworks
 Stephen Friend
 Thea Norman
 Lara Mangravite
 Mike Kellen
 Mette Peters
 Arno Klein
 Solly Sieberts
 Abhi Pratap
 Chris Bare
 Bruce Hoff
 IBM
 Erhan Bilal
 Kely Norel
 Elise Blaese
 Pablo Meyer Rojas
 Kahn Rrhissorrakrai
 EBI
 Julio Saez Rodriguez
 Thomas Cokelaer
 Federica Eduati
 Michael Menden
 L. Maximilians University
 Robert Kueffner,
 Univ Colorado, Denver
 Jim Costello
 OHSU
 Joe Gray
 Adam Margolin
 Mehmet Gonen
 Laura Heiser
 Prize4Life
 Melanie Leitnerr
 Neta Zach
 NCI
 Dinah Singer
 Dan Gallahan
 ISMMS
 Eli Stahl
 Gaurav Pandey
 Columbia University
 Andrea Califano
 Mukesh Bansal
 Chuck Karan
 Rice University
 Amina Qutub
 David Noren
 Byron Long
 MD Anderson
 Steven Kornblau
 Univ of Lausanne
 Daniel Marbach
 Broad Institute
 Bill Hahn
 Barbara Weir
 Aviad Tsherniak
 Merck
 Robert Plenge
 BYU
 Keoni Kauwe
 OICR
 Paul Boutros
 UCSC
 Josh Stuart

• Science Translational Medicine (1 paper)
• Nature Biotechnology (4 papers)
• Nature Genetics (papers in preparation)
• Nature Methods (papers in preparation)
• Nature Neuroscience (papers in preparation)
• PLoS Computational Biology (papers in review and preparation)
• National Cancer Institute (contracts for Best Performers)
Challenge Assisted Peer Review Partners

 A crowdsourcing effort that poses quantitative challenges about systems
biology modeling and data analysis on:
 Transcriptional and signaling networks,
 Predictions of response to perturbations,
 Translational research (tox, RA, AD, ALS, AML, …)
 Our mission is
 to contribute to the solution of important biomedical problems
 to foster collaboration between research groups
 to democratize data
 to accelerate research
 to objectively assess algorithm performance
What are the DREAM Challenges

Peer review is subjective. But even if it were not, what comes to the
reviewers may be biased:
 Bias against publication of negative results or results contrary to
published results
 Incentive structure put researchers under considerable pressure to try
until they find a positive result (multiple testing, over-fitting, etc.)
Dani Brunner et al., Behavioral
processes 89, 187-195 (2012)
Inflated Statistical Significance
Multiple Testing
Selective Reporting
Overfitting

Benefits of crowd-sourcing
• Performance Evaluation
– Unbiased, consistent, and rigorous method assessment
– Unbiased comparison and discovery of best methods
– Determine the solvability of a scientific question
• Sampling of the space of methods
– Understand the diversity of methodologies presently being
used to solve a problem

Benefits of crowd-sourcing
• Acceleration of Research
– The community of participants can do in 4 months what would take 10
years to any group
• Community Building
– Make high quality, well-annotated data accessible
– Foster community collaborations on fundamental research questions
– Determine robust solutions through community consensus: “The Wisdom
of Crowds”

• Disease research is data intensive. A typical researcher has a PhD in
multivariate statistics and does a lot of programming in languages like R,
Python, and Matlab, using libraries of established tools.
• So these analyses are software stacks of a sort, each piece having the
typical series of revisions.
• This makes reproducibility really challenging: To reproduce an analysis
you need not only the original data and the statistical processing script
written by the author, but the correct versions of all the dependencies.
• Obviously containerization offers a powerful tool for reproducibility: the
entire software stack used in an analysis can be tracked.
The challenge of reproducibility

Docker in Open Science Data Analysis Challenges by Bruce Hoff

More Related Content

What's hot(20)

Viewers also liked(20)

Similar to Docker in Open Science Data Analysis Challenges by Bruce Hoff(20)

More from Docker, Inc.(20)

Recently uploaded(20)

Docker in Open Science Data Analysis Challenges by Bruce Hoff

Editor's Notes