Tragedy of the (Data) Commons

Tetherless World Constellation, RPI
Jim Hendler
Tetherless World Professor of Computer,
Web and Cognitive Sciences
Director, Institute for Data Exploration and
Applications
Rensselaer Polytechnic Institute
http://www.cs.rpi.edu/~hendler
@jahendler
Major talks at: http://www.slideshare.net/jahendler

INVESTOPEDIA – Tragedy of the commons (summary)
• The tragedy of the commons is an economic problem that
results in overconsumption, under investment, and
ultimately depletion of a common-pool resource.
• For a tragedy of the commons to occur a resource must be
scarce, rivalrous in consumption, and non-excludable.
• Solutions to the tragedy of the commons include the
imposition of private property rights, government
regulation, or the development of a collective action
arrangement.

Tragedy of the data commons
• The tragedy of the DATA commons is an ongoing scientific
problem that results in under-utilization, over investment,
and ultimately disuse of a common data resources.
• For a tragedy of the commons to occur a resource must be
scarce, rivalrous in consumption, and non-excludable.
• Solutions to the tragedy of the commons include the
imposition of private property rights, government
regulation, or the development of a collective action
arrangement.
– Can we move to the third option?

We want data to be FAIR
• Easy to say, connotes a lot
• Harder to operationalize
• For machines
• Formats
• Standards
• …
• For humans
• Incentives
• Trust
• Training
• …
• Need models, best practices, lessons learned, etc.

The big challenge is we require sharing across large
projects
• Example biomedical research
• Best models span disciplines
• People live in different departments at different universities
• But compelling scientific challenge forcing function for people to work together
• Created incentives
• Funding is still largely by project
• Infrastructure for project data: expensive
• Infrastructure for cross-project data sharing: priceless
• Short- to mid- term solutions likely to require interoperability between
separately funded efforts
• WHICH LEADS TO THE TRAGEDY OF THE DATA COMMONS

Organs Histopathology
Organism Phenotype
Circuits Electrophysiology
Cells In vitro phenotype
Pathways Signal Cascades
Biomodules Protein: Protein Interactions
Protein Proteome
RNA Transcriptome
DNA Genome
Population Epidemiology
Along term endeavor
Understanding a single domain
© G. Bhanavar, IBM, IJCAI ‘16

Solution requires Interoperability
• One reason the Web beat its competitors…
• Gopher
• Archie
• FTP
• …
• Provided
a lightweight
standard that
allowed interoperability between these and more
• Web was built on “coop-etition”
• How do we learn this lesson for data sharing?

FAIR requires sharable metadata

But that ontology stuff never works….
• Ontology
– Hard to build
– Expensive to
maintain
– Don’t map to
people’s data
– Rarely reused
• Aren’t ontologies
why the sharing
parts of FAIR are
so hard?

That ontology stuff never works…
• Ontology
– Hard to build
– Expensive to
maintain
– Don’t map to
people’s data
– Rarely reused
• Aren’t they why
the sharing part of
FAIR is so hard?

CHEAR Ontology Effort
12
The Children’s Health Exposure Analysis Resource, or CHEAR, is a program funded
by the National Institute of Environmental Health Sciences to advance understanding
about how the environment impacts children’s health and development over the
course of a lifetime.
https://chearprogram.org/
Children’s Health Exposure
Analysis Resource (CHEAR)
McGuinness 9/9/19 Partially supported by: NIH/NIEHS 0255-0236-4609 / 1U2CES026555-01
CHEAR is composed of three components:
A National Exposure Assessment Laboratory Network, providing both targeted
and untargeted environmental exposure and biological response analyses in
human samples
A Data Repository, Analysis, and Science Center, providing statistical services,
a data repository, and data standards for integration and sharing
A Coordinating Center, connecting the research community to CHEAR
resources

CHEAR Ontology Effort
1
3
Goal: Encode terminology currently needed by the CHEAR Data Center
Portal, publish an open source extensible ontology integrating general
exposure science and health leveraging best in class terminologies.
Enabling Findable, Accessible, Interoperable, Reusable Data and
Services to support data analysis and interdisciplinary research
Ontologies encode terms and their interrelationships, providing a foundation
for understanding interoperability and reusability (I and R in terms of FAIR)
Ontology-enabled infrastructures - Knowledge Graphs and Ontology-
enabled search services also provide support for finding and accessing
relevant content (the F and A in FAIR)
Child Health Exposure
Analysis Resource Ontology
McGuinness 9/9/19 Partially supported by: NIH/NIEHS 0255-0236-4609 / 1U2CES026555-01
Stingone, Mervish, Kovatch, McGuinness, Gennings, Teitelbaum. Big and Disparate Data: Considerations for
Pediatric Consortia. Current Opinions in Pediatrics Journal. 29(2):231-239, April 2017. doi:
10.1097/MOP.0000000000000467. PMID: 28134706

Ontology Foundations
14
Imported Ontologies:
●Semantic Science Integrated
Ontology (SIO)
●PROV-O
●Units Ontology
●Human-Aware Science Ontology
(HAScO)
●Virtual Solar Terrestrial
Observatory (Instruments)
(VSTO-I)
●Environment Ontology (ENVO)
●…
Minimum Information to Reference an
External Ontology Term (MIREOT)-ed
Ontologies:
●Chemicals of Biological Interest (CheBI)
●Statistics Ontology (STAT-O)
●PubChem
●UBERON (Anatomy)
●Disease Ontology (DO)
●UniProt (Proteins)
●Cogat (Cognitive Measures)
●ExO
●RefMet, …
Annotations:
●Simple Knowledge Organization System
(SKOS)
●Dublin Core (DC) Terms
14
CHEAR Ontology
Foundations and Reuse
McGuinness 9/19 Partially supported by: NIH/NIEHS 0255-0236-4609 / 1U2CES026555-01
McCusker, Rashid, Liang, Liu, Chastain,
Pinheiro, Stingone, McGuinness. Broad,
Interdisciplinary Science In Tela: An
Exposure and Child Health Ontology. In
Proceedings of Web Science, 2017. Troy,
NY. 349-357.

Meta-data as evolving resource, not predefined standard
• Move from hard meta-data standards to sharable
resources
– Referenceable by links
• Linked data principles apply to linking of metadata
– This is a key part of the original semantic web vision
• To date still working best “within cultures” …
– Which solves the “grounding” problems
• But growing interdisciplinary links
– As more sharing occurs

Lessons learned
• This was a crucial component in creating
– Usable (lightweight) semantics for Web Apps (schema.org)
– Usable (lightweight) semantics for govt data sharing (DCAT)
– Successful scientific efforts
• Virtual Observatory
• Deep carbon observatory
• Health Data Research UK (parts thereof to date)

Details omitted for time
• Multiple large projects have been working with
– Provenance
– Curation and Versioning
• Archiving
– Consistency
• Only partial overlap
– Credit and citation
– Interdisciplinary term mappings
– Term reconciliation
– Computational infrastructure (esp cloud)
– Third party (bottom) data curation via learning *
– …

How does this beat the data commons problem?
Total
Project
~$60M
Data Mgt
~$10M
CHEAR
ontology
~$1M
Reuse and linking
of the data via
ontological
(metadata)
development is a
fraction of the total
project cost
- but key to project
success

Can this beat the data commons problem?
• Projects can share data at a
fraction of the cost if they
– Start from overlapping common
metadata terms
• Not a single standard
– Each put aside a relatively small
cost for the metadata team
• Embedded in data team
– Their metadata data teams work
together to the extent possible
• Reusing the metadata leads to
reusability of the data
Project
Data
Meta
data

Questions?
https://idea.rpi.edu

Manufacturing Data Problem
– DARPA Open Manufacturing Performers (Honeywell, Lockheed Martin,
Boeing etc.) generated TBs of metal AM process, testing and
characterization data.
– Data management requirements (Materials Genome Initiative)
– Over a period of time…..DARPA’s data server looks like this
www.existentialennui.com
“Good data”
but
Little use in its current form !

Our Approach
Drill into the data filesStep 1: “Pick up the books”
Step 2: “Develop basic Dewey
decimal system”
Use domain expertise to realize
“functional ontologies” to
anchor the data sets.

Slide No.23
Our Approach
• Faceted search-based
visualization of data
• Meaningful interaction with data
Step 3: “What Type of Display Case ? ”

Our Approach
• Apply machine learning on
the data sets.
• Train & then Predict for
untested conditions.
Step 4: “Read & Discover New Knowledge”
Grand Vision: Data-driven Inverse Design for
AM Part Qualification Paradigm

Typical validation output (confusion matrix)
from a single trial. Green cells are correct
predictions. Gray cells are incorrect
predictions
Machine Learning Example
(Composites Testing Data)
• Data set (n=562) randomly partitioned into
training set (n=395) and test set (n=167). Each
trial partitions the data differently.
Objective: Classify majority failure modes (interfacial/cohesive) based on
input parameters (Surface Preparation, Contaminate Type, Contaminate
Amount)

Tragedy of the (Data) Commons

More Related Content

What's hot(20)

Similar to Tragedy of the (Data) Commons(20)

More from James Hendler(20)

Recently uploaded(20)

Tragedy of the (Data) Commons