How	data	commons	are	changing	the	way	that	large	
genomic	and	clinical	datasets	are	analyzed	and	shared	
Robert	Grossman
Center	for	Data	Intensive	Science
University	of	Chicago	&	
Open	Commons	Consortium
Welcome	Genome	Campus	Workshop	on	Big	Data	in	Health	and	Biology	
September	26,	2017
Data	Commons
2014	- 2024
Data	Clouds	
2010	- 2020
Databases	&	
Repositories
1982	- present
1.	Big	Data	in	Biology,	Medicine	&	Health	Care:	
A	Thirty	Year	Perspective
1993 2004
Data	Mining
&	KDD
1984
Computationally	
Intensive	Statistics
Predictive	
Analytics
2011
PageRank
MobileInternetPOSDirect	marketing
Big	Data
&	Data	Science
AI	(redux)	/	
Deep	Learn.
2016
CNN,	ANN
Labeled	
images	etc.
Hamiltonian	
Monte	Carlo
Deep	Learning	(DL)	Dogma
• Use	Deep	Neural	Networks	(DNN)	with	lots	of	layers	(10’s	to	
100’s	of	layers).
• Try	using	Convolutional	Neural	Networks	(CNN),	even	if	the	
problem	is	not	translation	invariant.
• Represent	the	inputs	and	internal	states	as	long	vectors	of	
numbers	(even	if	the	input	is	an	image,	text,	spoken	voice,	
etc.	that	has	structure)
• Train	with	very	large	amounts	of	labeled	data.
• Don’t	worry	about	the	internal	structure	of	the	model,	just	its	
accuracy	and	coverage.
DNN	(2016) Data	Mining	(1996)
Use	lots	of	parameters 10’s	to	100’s	of	hidden	layers Large	trees,	large	ensembles
Long	vectors for	inputs Part of	the	dogma Generally the	case
Use	as	much	data	as	
possible
Yes Yes
Do	we	care	about	the	
internal	structure	of	
the	model?
No No
Hardware GPU &	custom	chips Clusters	of	workstations
Labeled	data Often the	limiting	factor Often	the	limiting	factor
Features Not	needed Generally	the	hard	part
• In	some	sense,	Deep	Learning	is	
eating	the	world.
• Compare:	Marc	Andressen,	Why	
Software	Is	Eating	The	World,	Wall	
Street	Journal,	August	20,	2011.
• From	a	broader	perspective,	
Machine	Learning	(ML)	continues	
to	eat	the	world,	as	it	has	been	
doing	for	the	last	20	years	drive	by	
the	exponentially	growth	in	the	
amount	of	data	and	the	
computational	power	available	to	
estimate	parameters.
2.	Data	Commons,	An	Emerging	Platform	for	
Data	Science
Data	Commons
Data	commons	co-locate	data,	storage	and	computing	infrastructure	
with	commonly	used	services,	tools	&	apps	for	analyzing	and	sharing	
data	to	create	an	interoperable resource	for	the	research	community.*
*Robert	L.	Grossman,	Allison	Heath,	Mark	Murphy,	Maria	Patterson	and	Walt	Wells,	A	Case	for	Data	Commons	Towards	Data	Science	as	a	Service,	IEEE	Computing	in	
Science	and	Engineer,	2016.			Source	of	image:	The	CDIS,	GDC,	&	OCC data	commons	infrastructure	at	the	University	of	Chicago	Kenwood	Data	Center.
NCI	Genomic	Data	Commons*
• Launched	in	2016	
with	over	4	PB	of	
data.
• Joint	project	with	
OICR.
• Used	by	1500	-
2000+	users	per	
day.
• Based	upon	an	
open	source	
software	stack	that	
can	be	used	to	
build	other	data	
commons.*See:	NCI	Genomic	Data	Commons:	Grossman,	Robert	L.,	et	al.	"Toward	a	shared	vision	for	cancer	
genomic	data."	New	England	Journal	of	Medicine	375.12	(2016):	1109-1112.
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Shared
System	1:	Data	Portals	to	Explore	and	Submit	Data
• MuSE
(MD	Anderson)
• VarScan2	(Washington	
Univ.)
• SomaticSniper
(Washington	Univ.)
• MuTect2	
(Broad	Institute)
Source:	Zhenyu Zhang,	et.	al.	and	the	GDC	Project	Team,	Uniform	Genomic	Data	Analysis	in	
the	NCI	Genomic	Data	Commons,	to	appear.
System	2:	Data	Harmonization	System	To	Analyze	all	
of	the	Submitted	Data	with	a	Common	Pipelines
System	3:	User	Defined	Applications	and	Notebooks	
to	Create	a	Data	Ecosystem
https://gdc-api.nci.nih.gov/files/5003adf1-1cfd-467d-8234-0d396422a4ee?fields=state	
• The	GDC	has	a	REST	API	so	that	researchers	can	develop	their	own	
applications.
• There	are	third	party	applications	that	use	the	REST	API	for	Python,	R,	
Jupyter notebooks	and	Shiny.
• The	REST	API	drives	the	GDC	data	portal,	data	submission	system,	etc.
GDC	Application	Programming	Interface	(API)	–
To	Build	Applications
https://gdc-api.nci.nih.gov/files/5003adf1-1cfd-467d-8234-0d396422a4ee?fields=state	
API	URL Endpoint Optional	Entity	ID Query	parameters
• Based	upon	data	model
• Drives	internally	developed	applications,	e.g.	data	portal
• Allows	third	parties	to	develop	their	own	applications
• Can	be	used	by	other	commons,	by	workspaces,	by	other	
systems,	by	user-developed	applications	and	notebooks
Purple	balls	are	PCA-based	analysis	of	RNA-seq data	for	lung	adenocarcinoma.		
Grey	are	associated	with	lung	squamous	cell	carcinoma.		Green	appear	to	be	
misdiagnosed.	
The	GDC	enables	
bioinformaticians to	
build	their	own	
applications	using	
the	GDC	API.
Source:	Center	for	Data	Intensive	Science,	University	of	Chicago.
Shiny	R	app	
built	using	
the	GDC	API
• Supports	big	data	with	cloud	
computing
• Researchers	can	analyze	data	
with	collaborative	tools	
(workspaces)	– i.	e.	data	does	not
have	to	be	downloaded)
• Data	repository
• Researchers	
download	data.
Databases
Data	Clouds
Data	Commons
• Supports	big	data
• Workspaces
• Common	data	models
• Core	data	services	
• Harmonized	data
• Governance
1982	- present
2010	- 2020
2014	- 2024
3.	GDC	Gen3	Open	Source	Software	Stack
OCC Open Science Data Cloud (2010)
OCC – NASA Project Matsu (2009)
NCI Genomic Data Commons* (2016)
OCC-NOAA Environmental Data
Commons (2016)
OCC Blood Profiling
Atlas in Cancer (2017)
Bionimbus Protected Data Cloud* (2013)
*Operated	under	a	subcontract	from	NCI	/	Leidos Biomedical	
to	the	University	of	Chicago	with	support	from	the	OCC.
Brain Commons
(2017)
Kids First Data
Resource (2017)
Gen3
Gen2
Gen1
The	Gen3	Data	Model
Is	Customizable	&	
Extensible
• BloodPAC
• Brain	Commons
• Wellness	Commons
• Kids	First	Data	Resource
Object-based	
storage	with	access	
control	lists
Scalable	light	
weight	workflow
Community	
data	
products
Data	Commons	Framework	Services	(Digital	ID,	Metadata,	Authentication,	Auth.,	
etc.)			that	support	multiple	data	commons.
Apps
Database	
services
Architecture	used	by	
Gen3	Data	Commons
Data	Commons	1
Data	Commons	2
Portals	for	
accessing	&	
submitting	
data
Workspaces
APIs
Data	Commons	Framework	Services
Workspaces
Workspaces
Notebooks
Apps
Apps	&	Notebooks
Core	Data	Commons	Framework	Services
• Digital	ID	services
• Metadata	services
• Authentication	services
• Authorization	services
• Designed	to	span	multiple	data	commons	
• Designed	to	support	multiple	private	and	commercial	
clouds	
• In	the	future,	we	will	support	portable	workspaces
Open	Source	Software	for	Data	
Commons
Existing	open	
source	apps
Commercial	
apps
New	FOSS	sponsor	
funded	apps
Public	Clouds
Data	managed	by	the	commons
Private	Clouds
CSOC
(Common	
Services	Ops	
Center)
Data	Commons	Management	&	Governance
Sponsor	(e.g.	funder	or	consortium	of	funders)	
OCC	Data	Commons	
Framework
1
2
3
0
Data	Commons	
Framework	Services
Private	Academic	Cloud
Univ. of Chicago
CSOC (ops center)
Cross	Cloud	Services
NCI	Clouds	
Pilots
Compliant	
apps
Bionimbus
PDC	&	other	
clouds	
FAIR	Principles
NCI	GDC
Other	data	commonsData	Peering	
Principles
Commons	
Services	
Operations	
Center
Commons	
services
Commons	Services	
Framework
appapp
app
Summary
1. Designed	to	support	disease	specific,	project	specific	or	
consortium	specific	data	commons,	including	governance	
model.
2. Designed	to	support	multiple	data	commons	that	 peer	and	
interoperate.
3. Designed	to	support	an	ecosystem	of	FAIR-based	
applications.
4. The	core	underlying	software	stack	is	open	source.
5. Data	commons	governance	model	in	which	data	is	public	and	
you	“pay	for	compute”.
6. Supported	by	the	independent	not-for-profit	Open	Commons	
Consortium.
4.	Towards	Data	Ecosystems
Data	Commons
2014	- 2024
Data	Clouds	
2010	- 2020
Data	Ecosystems
2018	- 2028
Databases	
1982	- present
Three	Large	Scale	Data	Commons	That	are	
Working	Towards	Common	APIs
1. NCI	GDC	/	Cloud	
Resources	(UChicago
/	Broad)
2. NIH	All	of	Us	(Broad	/	
Verily)
3. CZI	HCA	Data	
Platform	
(UCSC/Broad)
For	more	information,	see:	Josh	Denny,	David	Glazer,	Robert	L.	Grossman,		Benedict	Paten	&	Anthony	Philippakis,	A	Data	
Biosphere	for	Biomedical	Research,	https://medium.com/@benedictpaten/a-data-biosphere-for-biomedical-research-
d212bbfae95d.			Also	available	at:	https://goo.gl/9CySeo
Accumulating	Knowledge	About	Small	Effects
Data	Commons
Data	harmonization
Data	Ecosystems
Analysis	reuse
Databases	
Curated	data
100%
75%
70%
50%
0%
25%
25%
25%
0% 0%
5%
25%
0%
20%
40%
60%
80%
100%
120%
1995 2005 2015 2025
Hardware	/	data	production Software	/	data	analysis
Data	ecosystem	/	data	reuse
Big	(deep)	knowledge
Big	Information	(informatics)
- Shared	analysis
Big	Data
Data
Infrastructure	
Apps
Academic	Data	Centers
Apps	for	clinical	researchers
Apps	for	bioinformaticians
Apps	for	system	builders
Data
Commons
Data	Ecosystems
Cisplatin?
Idarubicin?
Floxuridine?
Questions?
33
rgrossman.com
@bobgrossman
For	more	information:
• To	learn	more	about	data	commons:		Robert	L.	Grossman,	et.	al.	A	Case	for	Data	Commons:	
Toward	Data	Science	as	a	Service,	Computing	in	Science	&	Engineering	18.5	(2016):	10-20.			
Also	https://arxiv.org/abs/1604.02608
• To	large	more	about	large	scale,	secure	compliant	cloud	based	computing	environments	for	
biomedical	data,	see:	Heath,	Allison	P.,	et	al.	"Bionimbus:	a	cloud	for	managing,	analyzing	and	
sharing	large	genomics	datasets."	Journal	of	the	American	Medical	Informatics	Association	
21.6	(2014):	969-975.	This	article	describes	Bionimbus Gen1.
• To	learn	more	about	the	NCI	Genomic	Data	Commons:	Grossman,	Robert	L.,	et	al.	"Toward	a	
shared	vision	for	cancer	genomic	data."	New	England	Journal	of	Medicine	375.12	(2016):	
1109-1112.		The	GDC	was	developed	using	Bionimbus Gen2.
• To	learn	more	about	BloodPAC,	Grossman,	R.	L.,	et	al.	"Collaborating	to	compete:	Blood	
Profiling	Atlas	in	Cancer	(BloodPAC)	Consortium."	Clinical	Pharmacology	&	Therapeutics	
(2017).		BloodPAC was	developed	using	the	GDC	Community	Edition	(CE)	aka	Bionimbus Gen3
cdis.uchicago.edu
Robert	L.	Grossman
rgrossman.com
@BobGrossman
robert dot	grossman at	uchicago.edu
Contract	Information

More Related Content

PDF
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
PDF
What is Data Commons and How Can Your Organization Build One?
PDF
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
PDF
A Data Biosphere for Biomedical Research
PDF
A Gen3 Perspective of Disparate Data
PDF
Some Frameworks for Improving Analytic Operations at Your Company
PDF
What is a Data Commons and Why Should You Care?
PDF
Some Proposed Principles for Interoperating Cloud Based Data Platforms
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
What is Data Commons and How Can Your Organization Build One?
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
A Data Biosphere for Biomedical Research
A Gen3 Perspective of Disparate Data
Some Frameworks for Improving Analytic Operations at Your Company
What is a Data Commons and Why Should You Care?
Some Proposed Principles for Interoperating Cloud Based Data Platforms

What's hot (20)

PDF
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
PDF
Big Data, The Community and The Commons (May 12, 2014)
PDF
Keynote on 2015 Yale Day of Data
PDF
Adversarial Analytics - 2013 Strata & Hadoop World Talk
PPT
Seminar presentation
PDF
Cri big data
PDF
PDF
Mining Big Data using Genetic Algorithm
PDF
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
PPTX
DataONE Education Module 01: Why Data Management?
PDF
V3 i35
PPT
Elementary Concepts of data minig
PDF
Big Data Mining - Classification, Techniques and Issues
PDF
Using the Open Science Data Cloud for Data Science Research
PDF
Massive Data Analysis- Challenges and Applications
PDF
Ijariie1184
PPTX
DataONE Education Module 03: Data Management Planning
PDF
A Model Design of Big Data Processing using HACE Theorem
PDF
HathiTrust Research Center Secure Commons
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Big Data, The Community and The Commons (May 12, 2014)
Keynote on 2015 Yale Day of Data
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Seminar presentation
Cri big data
Mining Big Data using Genetic Algorithm
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
DataONE Education Module 01: Why Data Management?
V3 i35
Elementary Concepts of data minig
Big Data Mining - Classification, Techniques and Issues
Using the Open Science Data Cloud for Data Science Research
Massive Data Analysis- Challenges and Applications
Ijariie1184
DataONE Education Module 03: Data Management Planning
A Model Design of Big Data Processing using HACE Theorem
HathiTrust Research Center Secure Commons

Similar to How Data Commons are Changing the Way that Large Datasets Are Analyzed and Shared (20)

PPTX
Biomedical Data Sciences - New Name and New Opportunities for Change?
PPTX
Data Science and AI in Biomedicine: The World has Changed
PPTX
Data Science and AI in Biomedicine: The World has Changed
PPT
Aaas Data Intensive Science And Grid
PPTX
Big Data and its Role in Biomedical Research
PDF
BigData in Life Sciences, Genomics and Systems Biology
PDF
Deep learning for biomedical discovery and data mining I
PDF
Intro big data.pdf
PPTX
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
PPTX
Will Biomedical Research Fundamentally Change in the Era of Big Data?
PPTX
What Can Happen when Genome Sciences Meets Data Sciences?
PPTX
Data Science Meets Biomedicine, Does Anything Change
PDF
Big Data in Healthcare and Medical Devices
PPTX
BIMCV: The Perfect "Big Data" Storm.
PPSX
Big&open data challenges for smartcity-PIC2014 Shanghai
ODP
Life sciences big data use cases
PDF
Deep learning for biomedicine
PPTX
2016 09 cxo forum
PPTX
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
PPTX
One View of Data Science
Biomedical Data Sciences - New Name and New Opportunities for Change?
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has Changed
Aaas Data Intensive Science And Grid
Big Data and its Role in Biomedical Research
BigData in Life Sciences, Genomics and Systems Biology
Deep learning for biomedical discovery and data mining I
Intro big data.pdf
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
Will Biomedical Research Fundamentally Change in the Era of Big Data?
What Can Happen when Genome Sciences Meets Data Sciences?
Data Science Meets Biomedicine, Does Anything Change
Big Data in Healthcare and Medical Devices
BIMCV: The Perfect "Big Data" Storm.
Big&open data challenges for smartcity-PIC2014 Shanghai
Life sciences big data use cases
Deep learning for biomedicine
2016 09 cxo forum
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
One View of Data Science

More from Robert Grossman (16)

PDF
AnalyticOps - Chicago PAW 2016
PDF
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
PDF
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
PDF
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
PDF
Architectures for Data Commons (XLDB 15 Lightning Talk)
PDF
Practical Methods for Identifying Anomalies That Matter in Large Datasets
PDF
What Are Science Clouds?
PDF
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
PDF
The Open Science Data Cloud: Empowering the Long Tail of Science
PDF
Big Data - Lab A1 (SC 11 Tutorial)
PDF
Managing Big Data (Chapter 2, SC 11 Tutorial)
PDF
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
PDF
Processing Big Data (Chapter 3, SC 11 Tutorial)
PPTX
Open Science Data Cloud (IEEE Cloud 2011)
PPTX
Open Science Data Cloud - CCA 11
PPTX
Bionimbus - Northwestern CGI Workshop 4-21-2011
AnalyticOps - Chicago PAW 2016
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Architectures for Data Commons (XLDB 15 Lightning Talk)
Practical Methods for Identifying Anomalies That Matter in Large Datasets
What Are Science Clouds?
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Open Science Data Cloud: Empowering the Long Tail of Science
Big Data - Lab A1 (SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud - CCA 11
Bionimbus - Northwestern CGI Workshop 4-21-2011

Recently uploaded (20)

PPTX
ISO 9001-2015 quality management system presentation
PPTX
Capstone Presentation a.pptx on data sci
PDF
NU-MEP-Standards معايير تصميم جامعية .pdf
PPT
Drug treatment of Malbbbbbhhbbbbhharia.ppt
PDF
n8n Masterclass.pdfn8n Mastercn8n Masterclass.pdflass.pdf
PPTX
research framework and review of related literature chapter 2
PDF
The high price of a dog bite in California
PPTX
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
PPTX
Bussiness Plan S Group of college 2020-23 Final
PDF
Introduction to Database Systems Lec # 1
PDF
Teal Blue Futuristic Metaverse Presentation.pdf
PPTX
Dkdkskakkakakakskskdjddidiiffiiddakaka.pptx
PPT
Handout for Lean and Six Sigma application
PPT
2011 HCRP presentation-final.pptjrirrififfi
PPTX
cardiac failure and associated notes.pptx
PPTX
Fkrjrkrkekekekeekkekswkjdjdjddwkejje.pptx
PDF
Book Trusted Companions in Delhi – 24/7 Available Delhi Personal Meeting Ser...
PPTX
1.Introduction to orthodonti hhhgghhcs.pptx
PPTX
Transport System for Biology students in the 11th grade
PDF
MULTI-ACCESS EDGE COMPUTING ARCHITECTURE AND SMART AGRICULTURE APPLICATION IN...
ISO 9001-2015 quality management system presentation
Capstone Presentation a.pptx on data sci
NU-MEP-Standards معايير تصميم جامعية .pdf
Drug treatment of Malbbbbbhhbbbbhharia.ppt
n8n Masterclass.pdfn8n Mastercn8n Masterclass.pdflass.pdf
research framework and review of related literature chapter 2
The high price of a dog bite in California
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
Bussiness Plan S Group of college 2020-23 Final
Introduction to Database Systems Lec # 1
Teal Blue Futuristic Metaverse Presentation.pdf
Dkdkskakkakakakskskdjddidiiffiiddakaka.pptx
Handout for Lean and Six Sigma application
2011 HCRP presentation-final.pptjrirrififfi
cardiac failure and associated notes.pptx
Fkrjrkrkekekekeekkekswkjdjdjddwkejje.pptx
Book Trusted Companions in Delhi – 24/7 Available Delhi Personal Meeting Ser...
1.Introduction to orthodonti hhhgghhcs.pptx
Transport System for Biology students in the 11th grade
MULTI-ACCESS EDGE COMPUTING ARCHITECTURE AND SMART AGRICULTURE APPLICATION IN...

How Data Commons are Changing the Way that Large Datasets Are Analyzed and Shared