Docker in Open Science
Data Analysis Challenges
Bruce Hoff
Principal Software Engineer,
Sage Bionetworks
Open Science in Disease Research
Containerization as a tool for scientific reproducibility
Case Study: Docker in the 2015 ALS Stratification Challenge
Case Study: Docker in the 2016 Digital Mammography Challenge
Open Issues and Lessons Learned
Agenda
This talk is about saving lives.
Disease research is data intensive…
… but published analyses often aren’t
reproducible.
… and valuable data sets aren’t shared freely.
… which reduces the rate of progress.
Difficulties in science validation
 Amgen scientists tried to confirm 53 landmark papers in pre-clinical
oncology research: Only 6 (11%) were confirmed.[1]
 Bayer HealthCare reported that only about 25% of published
preclinical studies could be validated.[2]
 Poti Gate: Genomics Research at Duke during 2006-2010, led to the
identification of Diagnostic Signatures that spurred clinical trials. The
research was later deemed statistically flawed and the clinical trials
stopped
[1] C. Glenn Begley and Lee M. Ellis, Nature 483, 531 (2012)
[2] Prinz,F.,Schlange,T.&Asadullah,K., NatureRev. Drug Discov. 10, 712 (2011)
Our Solution: Open Data
Analysis Challenges
 Engage the community, rather than a select company or
lab, to solve a problem in biological/medicinal research.
 Obtain and expose a high value data set that would
otherwise be accessible by a few.
 Require that participants share their code and document
their algorithms; test for reproducibility.
0
50
100
150
200
250
300
350
400
450
NumberofSubmittingTeams
Unique Final submissions
Measures of Impact
• 32 scientific challenges
• 50 partner institutions (since 2006)
• >5000 registered users
• 10 international conferences
• 2500 conference attendees
• >100 publications using DREAM data
• 25 journal articles
• 3 journal special issues
• 2 edited books
• 1,300 Citations
• 20 PhD theses
• Use of Challenges in Classroom as problem sets
Dialogue for Reverse Engineering Assessment and Methods
(DREAM) is a crowdsourcing effort that poses quantitative
challenges about systems biology modeling.
Sage Bionetworks (2009-) is a nonprofit biomedical research
organization seeking to accelerate biomedical research through
open systems, incentives, and standards.
The two organization merged in 2013 to drive a continuing
series of open science challenges.
The Organization
• Web services that facilitate collaborative web science
– Projects Sharing Resources (code, files, ideas)
– wiki narratives
• Analysis provenance - linking data, code, and results; data
versioning
• Web services that facilitate Challenge logistics
– Registration, acceptance of data usage, acceptance of Challenge Terms and Conditions
– Real-time challenge leaderboards
– Discussion Forum
– Formation of Teams
– Online Supplement for Challenge Papers: e.g.:
https://www.synapse.org/#!Synapse:syn2528824/wiki/
Synapse: enabling collaborative research
2015 ALS Challenge
a case study in using Docker
in a DREAM challenge
ALS is a rapidly progressing neurodegenerative disease that typically leads
to death within 3-5 years but for which disease progression is heterogeneous
across the patient population.
Data for 9000 ALS patients provided by the Pooled Resources Open-Access
Clinical Trial (PRO-ACT) database.
The challenge was to predict disease progression from clinical data.
$28,000 in prize money raised through a grass-roots fund drive
https://www.indiegogo.com/projects/fund-the-prize-solve-als-together
Nature Biotechnology agreed to publish the results.
In a typical challenge…
• Data is partitioned into
– training
– leaderboard
– validation
• Participants
– download training data
– apply statistical learning methods
– submit predictions
Organizers want to constrain submitted models to work in a certain
way:
• Model has a ‘selector’ component to select predictive clinical features
• Model has a ‘predictor’ component to predict ALS outcome based on
selected features.
Organizers want to run each model themselves to:
- Ensure models are structured as prescribed
- Ensure reproducibility of output
Docker to the rescue!
Clinical
Data
Model
Output
Selector
Selected
Features
Predictor
Scientific
Leadership
High value
data set
IT Resources
Prize Money
High visibility
Publication
Community
participation
The ‘Stone Soup’ of Open Challenges
IBM Cloud with a ZEC12 system virtual
machine running a Linux server with 32
processors, 240 GB memory and 9 TB
storage space.
IBM Donates a Mainframe for ALS Challenge
Provision a container on a unique port for each participant. They log in as:
> ssh user_name@129.34.20.96 -p port_number
Provide a script that sends a “signal” to a process running Docker
> create_model_snapshot
Back-end process runs “docker commit” to create a copy of the model for
scoring.
Back-end reruns captured image as a new container, after mounting
leaderboard (or later, validation) data volume.
Using Docker with a Mainframe
2016 Digital Mammography
Challenge
a case study in using Docker
in a DREAM challenge
• The Scientific Question: How can we reduce erroneous recall
rate (false positives)?
• Image analysis machine learning problem
• “Deep learning” algorithms expected
• $1.2M in prize money expected to attract 100s of serious
participants
• 600,000 mammography images donated (~20TB)
• Budget for 100s of GPU servers from two Cloud providers
(AWS, IBM)
Why use Docker?
1) Large data size
2) Sensitive data
3) Provisioned compute
(1) Allocate
machine (e.g.
own laptop)
(2) Retrieve
base image
(3) Retrieve
small, pilot
dataset.
(4) Create model
(5) Verify model using pilot dataset
(6) Push
Dockerized
model to
registry
(8) Receive
trained model
and score.
…
(7) Submit
model to
Challenge.
Submission queue built into Synapse
(1) Retrieve
new
submissions.
…
(2) Retrieve
Docker image.
(3) Train / score model.
(4) Save
trained model
and score.
• We’ve implemented the data donor’s wish to maintain control of
the data.
• We have obviated the need to download the large data set.
• We have democratized participation, making compute available
to those who might not otherwise have it.
• After the challenge we have a library of rerunnable models
ensuring reproducibility.
Outcome
• How best to monitor a fleet of Docker hosts (incl. GPU usage)?
• How reproducible are models run on different GPU machines?
How much of the software stack should be in the container?
• How shall we limit submitted jobs?
• Are there networking issues as models access data?
• What are the security issues when running submitted
containers?
Open questions
• Images aren't always portable. System Z images can't be used
on Intel-based hardware.
• Reproducibility doesn't mean comprehensibility
• Find out about all our challenges at www.synapse.org
• For those of you down in the trenches, see brucehoff/dockerauth
for an example of how to do registry delegated authorization in
Java.
/etc
Acknowledgements
 Sage Bionetworks
 Stephen Friend
 Thea Norman
 Lara Mangravite
 Mike Kellen
 Mette Peters
 Arno Klein
 Solly Sieberts
 Abhi Pratap
 Chris Bare
 Bruce Hoff
 IBM
 Erhan Bilal
 Kely Norel
 Elise Blaese
 Pablo Meyer Rojas
 Kahn Rrhissorrakrai
 EBI
 Julio Saez Rodriguez
 Thomas Cokelaer
 Federica Eduati
 Michael Menden
 L. Maximilians University
 Robert Kueffner,
 Univ Colorado, Denver
 Jim Costello
 OHSU
 Joe Gray
 Adam Margolin
 Mehmet Gonen
 Laura Heiser
 Prize4Life
 Melanie Leitnerr
 Neta Zach
 NCI
 Dinah Singer
 Dan Gallahan
 ISMMS
 Eli Stahl
 Gaurav Pandey
 Columbia University
 Andrea Califano
 Mukesh Bansal
 Chuck Karan
 Rice University
 Amina Qutub
 David Noren
 Byron Long
 MD Anderson
 Steven Kornblau
 Univ of Lausanne
 Daniel Marbach
 Broad Institute
 Bill Hahn
 Barbara Weir
 Aviad Tsherniak
 Merck
 Robert Plenge
 BYU
 Keoni Kauwe
 OICR
 Paul Boutros
 UCSC
 Josh Stuart
Thank you!
• Science Translational Medicine (1 paper)
• Nature Biotechnology (4 papers)
• Nature Genetics (papers in preparation)
• Nature Methods (papers in preparation)
• Nature Neuroscience (papers in preparation)
• PLoS Computational Biology (papers in review and preparation)
• National Cancer Institute (contracts for Best Performers)
Challenge Assisted Peer Review Partners
 A crowdsourcing effort that poses quantitative challenges about systems
biology modeling and data analysis on:
 Transcriptional and signaling networks,
 Predictions of response to perturbations,
 Translational research (tox, RA, AD, ALS, AML, …)
 Our mission is
 to contribute to the solution of important biomedical problems
 to foster collaboration between research groups
 to democratize data
 to accelerate research
 to objectively assess algorithm performance
What are the DREAM Challenges
Peer review is subjective. But even if it were not, what comes to the
reviewers may be biased:
 Bias against publication of negative results or results contrary to
published results
 Incentive structure put researchers under considerable pressure to try
until they find a positive result (multiple testing, over-fitting, etc.)
Dani Brunner et al., Behavioral
processes 89, 187-195 (2012)
Inflated Statistical Significance
Multiple Testing
Selective Reporting
Overfitting
Benefits of crowd-sourcing
• Performance Evaluation
– Unbiased, consistent, and rigorous method assessment
– Unbiased comparison and discovery of best methods
– Determine the solvability of a scientific question
• Sampling of the space of methods
– Understand the diversity of methodologies presently being
used to solve a problem
Benefits of crowd-sourcing
• Acceleration of Research
– The community of participants can do in 4 months what would take 10
years to any group
• Community Building
– Make high quality, well-annotated data accessible
– Foster community collaborations on fundamental research questions
– Determine robust solutions through community consensus: “The Wisdom
of Crowds”
• Disease research is data intensive. A typical researcher has a PhD in
multivariate statistics and does a lot of programming in languages like R,
Python, and Matlab, using libraries of established tools.
• So these analyses are software stacks of a sort, each piece having the
typical series of revisions.
• This makes reproducibility really challenging: To reproduce an analysis
you need not only the original data and the statistical processing script
written by the author, but the correct versions of all the dependencies.
• Obviously containerization offers a powerful tool for reproducibility: the
entire software stack used in an analysis can be tracked.
The challenge of reproducibility

Docker in Open Science Data Analysis Challenges by Bruce Hoff

  • 1.
    Docker in OpenScience Data Analysis Challenges Bruce Hoff Principal Software Engineer, Sage Bionetworks
  • 2.
    Open Science inDisease Research Containerization as a tool for scientific reproducibility Case Study: Docker in the 2015 ALS Stratification Challenge Case Study: Docker in the 2016 Digital Mammography Challenge Open Issues and Lessons Learned Agenda
  • 3.
    This talk isabout saving lives. Disease research is data intensive… … but published analyses often aren’t reproducible. … and valuable data sets aren’t shared freely. … which reduces the rate of progress.
  • 4.
    Difficulties in sciencevalidation  Amgen scientists tried to confirm 53 landmark papers in pre-clinical oncology research: Only 6 (11%) were confirmed.[1]  Bayer HealthCare reported that only about 25% of published preclinical studies could be validated.[2]  Poti Gate: Genomics Research at Duke during 2006-2010, led to the identification of Diagnostic Signatures that spurred clinical trials. The research was later deemed statistically flawed and the clinical trials stopped [1] C. Glenn Begley and Lee M. Ellis, Nature 483, 531 (2012) [2] Prinz,F.,Schlange,T.&Asadullah,K., NatureRev. Drug Discov. 10, 712 (2011)
  • 5.
    Our Solution: OpenData Analysis Challenges  Engage the community, rather than a select company or lab, to solve a problem in biological/medicinal research.  Obtain and expose a high value data set that would otherwise be accessible by a few.  Require that participants share their code and document their algorithms; test for reproducibility.
  • 6.
  • 7.
    Measures of Impact •32 scientific challenges • 50 partner institutions (since 2006) • >5000 registered users • 10 international conferences • 2500 conference attendees • >100 publications using DREAM data • 25 journal articles • 3 journal special issues • 2 edited books • 1,300 Citations • 20 PhD theses • Use of Challenges in Classroom as problem sets
  • 8.
    Dialogue for ReverseEngineering Assessment and Methods (DREAM) is a crowdsourcing effort that poses quantitative challenges about systems biology modeling. Sage Bionetworks (2009-) is a nonprofit biomedical research organization seeking to accelerate biomedical research through open systems, incentives, and standards. The two organization merged in 2013 to drive a continuing series of open science challenges. The Organization
  • 9.
    • Web servicesthat facilitate collaborative web science – Projects Sharing Resources (code, files, ideas) – wiki narratives • Analysis provenance - linking data, code, and results; data versioning • Web services that facilitate Challenge logistics – Registration, acceptance of data usage, acceptance of Challenge Terms and Conditions – Real-time challenge leaderboards – Discussion Forum – Formation of Teams – Online Supplement for Challenge Papers: e.g.: https://www.synapse.org/#!Synapse:syn2528824/wiki/ Synapse: enabling collaborative research
  • 10.
    2015 ALS Challenge acase study in using Docker in a DREAM challenge
  • 11.
    ALS is arapidly progressing neurodegenerative disease that typically leads to death within 3-5 years but for which disease progression is heterogeneous across the patient population. Data for 9000 ALS patients provided by the Pooled Resources Open-Access Clinical Trial (PRO-ACT) database. The challenge was to predict disease progression from clinical data. $28,000 in prize money raised through a grass-roots fund drive https://www.indiegogo.com/projects/fund-the-prize-solve-als-together Nature Biotechnology agreed to publish the results.
  • 12.
    In a typicalchallenge… • Data is partitioned into – training – leaderboard – validation • Participants – download training data – apply statistical learning methods – submit predictions
  • 13.
    Organizers want toconstrain submitted models to work in a certain way: • Model has a ‘selector’ component to select predictive clinical features • Model has a ‘predictor’ component to predict ALS outcome based on selected features. Organizers want to run each model themselves to: - Ensure models are structured as prescribed - Ensure reproducibility of output Docker to the rescue! Clinical Data Model Output Selector Selected Features Predictor
  • 14.
    Scientific Leadership High value data set ITResources Prize Money High visibility Publication Community participation The ‘Stone Soup’ of Open Challenges
  • 15.
    IBM Cloud witha ZEC12 system virtual machine running a Linux server with 32 processors, 240 GB memory and 9 TB storage space. IBM Donates a Mainframe for ALS Challenge
  • 16.
    Provision a containeron a unique port for each participant. They log in as: > ssh [email protected] -p port_number Provide a script that sends a “signal” to a process running Docker > create_model_snapshot Back-end process runs “docker commit” to create a copy of the model for scoring. Back-end reruns captured image as a new container, after mounting leaderboard (or later, validation) data volume. Using Docker with a Mainframe
  • 17.
    2016 Digital Mammography Challenge acase study in using Docker in a DREAM challenge
  • 18.
    • The ScientificQuestion: How can we reduce erroneous recall rate (false positives)? • Image analysis machine learning problem • “Deep learning” algorithms expected • $1.2M in prize money expected to attract 100s of serious participants • 600,000 mammography images donated (~20TB) • Budget for 100s of GPU servers from two Cloud providers (AWS, IBM)
  • 19.
    Why use Docker? 1)Large data size 2) Sensitive data 3) Provisioned compute
  • 20.
    (1) Allocate machine (e.g. ownlaptop) (2) Retrieve base image (3) Retrieve small, pilot dataset. (4) Create model (5) Verify model using pilot dataset
  • 21.
    (6) Push Dockerized model to registry (8)Receive trained model and score. … (7) Submit model to Challenge.
  • 22.
  • 23.
    (1) Retrieve new submissions. … (2) Retrieve Dockerimage. (3) Train / score model. (4) Save trained model and score.
  • 24.
    • We’ve implementedthe data donor’s wish to maintain control of the data. • We have obviated the need to download the large data set. • We have democratized participation, making compute available to those who might not otherwise have it. • After the challenge we have a library of rerunnable models ensuring reproducibility. Outcome
  • 25.
    • How bestto monitor a fleet of Docker hosts (incl. GPU usage)? • How reproducible are models run on different GPU machines? How much of the software stack should be in the container? • How shall we limit submitted jobs? • Are there networking issues as models access data? • What are the security issues when running submitted containers? Open questions
  • 26.
    • Images aren'talways portable. System Z images can't be used on Intel-based hardware. • Reproducibility doesn't mean comprehensibility • Find out about all our challenges at www.synapse.org • For those of you down in the trenches, see brucehoff/dockerauth for an example of how to do registry delegated authorization in Java. /etc
  • 27.
    Acknowledgements  Sage Bionetworks Stephen Friend  Thea Norman  Lara Mangravite  Mike Kellen  Mette Peters  Arno Klein  Solly Sieberts  Abhi Pratap  Chris Bare  Bruce Hoff  IBM  Erhan Bilal  Kely Norel  Elise Blaese  Pablo Meyer Rojas  Kahn Rrhissorrakrai  EBI  Julio Saez Rodriguez  Thomas Cokelaer  Federica Eduati  Michael Menden  L. Maximilians University  Robert Kueffner,  Univ Colorado, Denver  Jim Costello  OHSU  Joe Gray  Adam Margolin  Mehmet Gonen  Laura Heiser  Prize4Life  Melanie Leitnerr  Neta Zach  NCI  Dinah Singer  Dan Gallahan  ISMMS  Eli Stahl  Gaurav Pandey  Columbia University  Andrea Califano  Mukesh Bansal  Chuck Karan  Rice University  Amina Qutub  David Noren  Byron Long  MD Anderson  Steven Kornblau  Univ of Lausanne  Daniel Marbach  Broad Institute  Bill Hahn  Barbara Weir  Aviad Tsherniak  Merck  Robert Plenge  BYU  Keoni Kauwe  OICR  Paul Boutros  UCSC  Josh Stuart
  • 28.
  • 29.
    • Science TranslationalMedicine (1 paper) • Nature Biotechnology (4 papers) • Nature Genetics (papers in preparation) • Nature Methods (papers in preparation) • Nature Neuroscience (papers in preparation) • PLoS Computational Biology (papers in review and preparation) • National Cancer Institute (contracts for Best Performers) Challenge Assisted Peer Review Partners
  • 30.
     A crowdsourcingeffort that poses quantitative challenges about systems biology modeling and data analysis on:  Transcriptional and signaling networks,  Predictions of response to perturbations,  Translational research (tox, RA, AD, ALS, AML, …)  Our mission is  to contribute to the solution of important biomedical problems  to foster collaboration between research groups  to democratize data  to accelerate research  to objectively assess algorithm performance What are the DREAM Challenges
  • 31.
    Peer review issubjective. But even if it were not, what comes to the reviewers may be biased:  Bias against publication of negative results or results contrary to published results  Incentive structure put researchers under considerable pressure to try until they find a positive result (multiple testing, over-fitting, etc.) Dani Brunner et al., Behavioral processes 89, 187-195 (2012) Inflated Statistical Significance Multiple Testing Selective Reporting Overfitting
  • 32.
    Benefits of crowd-sourcing •Performance Evaluation – Unbiased, consistent, and rigorous method assessment – Unbiased comparison and discovery of best methods – Determine the solvability of a scientific question • Sampling of the space of methods – Understand the diversity of methodologies presently being used to solve a problem
  • 33.
    Benefits of crowd-sourcing •Acceleration of Research – The community of participants can do in 4 months what would take 10 years to any group • Community Building – Make high quality, well-annotated data accessible – Foster community collaborations on fundamental research questions – Determine robust solutions through community consensus: “The Wisdom of Crowds”
  • 34.
    • Disease researchis data intensive. A typical researcher has a PhD in multivariate statistics and does a lot of programming in languages like R, Python, and Matlab, using libraries of established tools. • So these analyses are software stacks of a sort, each piece having the typical series of revisions. • This makes reproducibility really challenging: To reproduce an analysis you need not only the original data and the statistical processing script written by the author, but the correct versions of all the dependencies. • Obviously containerization offers a powerful tool for reproducibility: the entire software stack used in an analysis can be tracked. The challenge of reproducibility

Editor's Notes

  • #4 Think in terms of mining data sets incorporating complete genomic profiles from thousands of subjects. Today someone working in disease research may have a PhD in statistics, never see a wet lab.
  • #10 Synapse provides a layer of web services that allow researchers to easily record and collaborate on their research (as widely or narrowly as they chose) in real-time and across institutional boundaries. These services include not only the Synapse web portal, but also programmatic clients which talk to the same web services. By leveraging Synapse provenance services, analysts are able to provide an analysis trail of the data, code, and results associated with a research project. This helps all involved in the project to clearly see what has been done, and by whom. By operating the Synapse platform and its services free of charge as a service to the scientific community, Sage Bionetworks hopes to catalyze new collaborations as well as exciting and reproducible scientific discoveries. Then maybe just mention that Brian Bot, Chris Bare, and Thea Norman are all at the meeting and would be happy to talk to anyone interested — and that they can stop by our poster.
  • #30 Synapse provides a layer of web services that allow researchers to easily record and collaborate on their research (as widely or narrowly as they chose) in real-time and across institutional boundaries. These services include not only the Synapse web portal, but also programmatic clients which talk to the same web services. By leveraging Synapse provenance services, analysts are able to provide an analysis trail of the data, code, and results associated with a research project. This helps all involved in the project to clearly see what has been done, and by whom. By operating the Synapse platform and its services free of charge as a service to the scientific community, Sage Bionetworks hopes to catalyze new collaborations as well as exciting and reproducible scientific discoveries. Then maybe just mention that Brian Bot, Chris Bare, and Thea Norman are all at the meeting and would be happy to talk to anyone interested — and that they can stop by our poster.
  • #32 For reproduced findings, authors had paid close attention to controls, reagents, investigator bias and describing the complete data set. For non-reproduced findings, data were not routinely analyzed by investigators blinded to the experimental versus control groups, there are no guidelines to report all data, etc. In the Bayer study 70% of the studies analyzed were on cancer research.