Tetherless World Constellation, RPI
Jim Hendler
Tetherless World Professor of Computer,
Web and Cognitive Sciences
Director, Institute for Data Exploration and
Applications
Rensselaer Polytechnic Institute
http://www.cs.rpi.edu/~hendler
@jahendler
Major talks at: http://www.slideshare.net/jahendler
Tetherless World Constellation, RPI
INVESTOPEDIA – Tragedy of the commons (summary)
• The tragedy of the commons is an economic problem that
results in overconsumption, under investment, and
ultimately depletion of a common-pool resource.
• For a tragedy of the commons to occur a resource must be
scarce, rivalrous in consumption, and non-excludable.
• Solutions to the tragedy of the commons include the
imposition of private property rights, government
regulation, or the development of a collective action
arrangement.
Tetherless World Constellation, RPI
Tragedy of the data commons
• The tragedy of the DATA commons is an ongoing scientific
problem that results in under-utilization, over investment,
and ultimately disuse of a common data resources.
• For a tragedy of the commons to occur a resource must be
scarce, rivalrous in consumption, and non-excludable.
• Solutions to the tragedy of the commons include the
imposition of private property rights, government
regulation, or the development of a collective action
arrangement.
– Can we move to the third option?
We want data to be FAIR
• Easy to say, connotes a lot
• Harder to operationalize
• For machines
• Formats
• Standards
• …
• For humans
• Incentives
• Trust
• Training
• …
• Need models, best practices, lessons learned, etc.
The big challenge is we require sharing across large
projects
• Example biomedical research
• Best models span disciplines
• People live in different departments at different universities
• But compelling scientific challenge forcing function for people to work together
• Created incentives
• Funding is still largely by project
• Infrastructure for project data: expensive
• Infrastructure for cross-project data sharing: priceless
• Short- to mid- term solutions likely to require interoperability between
separately funded efforts
• WHICH LEADS TO THE TRAGEDY OF THE DATA COMMONS
Organs Histopathology
Organism Phenotype
Circuits Electrophysiology
Cells In vitro phenotype
Pathways Signal Cascades
Biomodules Protein: Protein Interactions
Protein Proteome
RNA Transcriptome
DNA Genome
Population Epidemiology
Along term endeavor
Understanding a single domain
© G. Bhanavar, IBM, IJCAI ‘16
Solution requires Interoperability
• One reason the Web beat its competitors…
• Gopher
• Archie
• FTP
• …
• Provided
a lightweight
standard that
allowed interoperability between these and more
• Web was built on “coop-etition”
• How do we learn this lesson for data sharing?
Tetherless World Constellation, RPI
FAIR requires sharable metadata
But that ontology stuff never works….
• Ontology
– Hard to build
– Expensive to
maintain
– Don’t map to
people’s data
– Rarely reused
• Aren’t ontologies
why the sharing
parts of FAIR are
so hard?
That ontology stuff never works…
• Ontology
– Hard to build
– Expensive to
maintain
– Don’t map to
people’s data
– Rarely reused
• Aren’t they why
the sharing part of
FAIR is so hard?
CHEAR Ontology Effort
12
The Children’s Health Exposure Analysis Resource, or CHEAR, is a program funded
by the National Institute of Environmental Health Sciences to advance understanding
about how the environment impacts children’s health and development over the
course of a lifetime.
https://chearprogram.org/
Children’s Health Exposure
Analysis Resource (CHEAR)
McGuinness 9/9/19 Partially supported by: NIH/NIEHS 0255-0236-4609 / 1U2CES026555-01
CHEAR is composed of three components:
A National Exposure Assessment Laboratory Network, providing both targeted
and untargeted environmental exposure and biological response analyses in
human samples
A Data Repository, Analysis, and Science Center, providing statistical services,
a data repository, and data standards for integration and sharing
A Coordinating Center, connecting the research community to CHEAR
resources
CHEAR Ontology Effort
1
3
Goal: Encode terminology currently needed by the CHEAR Data Center
Portal, publish an open source extensible ontology integrating general
exposure science and health leveraging best in class terminologies.
Enabling Findable, Accessible, Interoperable, Reusable Data and
Services to support data analysis and interdisciplinary research
Ontologies encode terms and their interrelationships, providing a foundation
for understanding interoperability and reusability (I and R in terms of FAIR)
Ontology-enabled infrastructures - Knowledge Graphs and Ontology-
enabled search services also provide support for finding and accessing
relevant content (the F and A in FAIR)
Child Health Exposure
Analysis Resource Ontology
McGuinness 9/9/19 Partially supported by: NIH/NIEHS 0255-0236-4609 / 1U2CES026555-01
Stingone, Mervish, Kovatch, McGuinness, Gennings, Teitelbaum. Big and Disparate Data: Considerations for
Pediatric Consortia. Current Opinions in Pediatrics Journal. 29(2):231-239, April 2017. doi:
10.1097/MOP.0000000000000467. PMID: 28134706
Ontology Foundations
14
Imported Ontologies:
●Semantic Science Integrated
Ontology (SIO)
●PROV-O
●Units Ontology
●Human-Aware Science Ontology
(HAScO)
●Virtual Solar Terrestrial
Observatory (Instruments)
(VSTO-I)
●Environment Ontology (ENVO)
●…
Minimum Information to Reference an
External Ontology Term (MIREOT)-ed
Ontologies:
●Chemicals of Biological Interest (CheBI)
●Statistics Ontology (STAT-O)
●PubChem
●UBERON (Anatomy)
●Disease Ontology (DO)
●UniProt (Proteins)
●Cogat (Cognitive Measures)
●ExO
●RefMet, …
Annotations:
●Simple Knowledge Organization System
(SKOS)
●Dublin Core (DC) Terms
14
CHEAR Ontology
Foundations and Reuse
McGuinness 9/19 Partially supported by: NIH/NIEHS 0255-0236-4609 / 1U2CES026555-01
McCusker, Rashid, Liang, Liu, Chastain,
Pinheiro, Stingone, McGuinness. Broad,
Interdisciplinary Science In Tela: An
Exposure and Child Health Ontology. In
Proceedings of Web Science, 2017. Troy,
NY. 349-357.
Tetherless World Constellation, RPI
Meta-data as evolving resource, not predefined standard
• Move from hard meta-data standards to sharable
resources
– Referenceable by links
• Linked data principles apply to linking of metadata
– This is a key part of the original semantic web vision
• To date still working best “within cultures” …
– Which solves the “grounding” problems
• But growing interdisciplinary links
– As more sharing occurs
Tetherless World Constellation, RPI
Lessons learned
• This was a crucial component in creating
– Usable (lightweight) semantics for Web Apps (schema.org)
– Usable (lightweight) semantics for govt data sharing (DCAT)
– Successful scientific efforts
• Virtual Observatory
• Deep carbon observatory
• Health Data Research UK (parts thereof to date)
Tetherless World Constellation, RPI
Details omitted for time
• Multiple large projects have been working with
– Provenance
– Curation and Versioning
• Archiving
– Consistency
• Only partial overlap
– Credit and citation
– Interdisciplinary term mappings
– Term reconciliation
– Computational infrastructure (esp cloud)
– Third party (bottom) data curation via learning *
– …
Tetherless World Constellation, RPI
How does this beat the data commons problem?
Total
Project
~$60M
Data Mgt
~$10M
CHEAR
ontology
~$1M
Reuse and linking
of the data via
ontological
(metadata)
development is a
fraction of the total
project cost
- but key to project
success
Tetherless World Constellation, RPI
Can this beat the data commons problem?
• Projects can share data at a
fraction of the cost if they
– Start from overlapping common
metadata terms
• Not a single standard
– Each put aside a relatively small
cost for the metadata team
• Embedded in data team
– Their metadata data teams work
together to the extent possible
• Reusing the metadata leads to
reusability of the data
Project
Data
Meta
data
Tetherless World Constellation, RPI
Questions?
https://idea.rpi.edu
Manufacturing Data Problem
– DARPA Open Manufacturing Performers (Honeywell, Lockheed Martin,
Boeing etc.) generated TBs of metal AM process, testing and
characterization data.
– Data management requirements (Materials Genome Initiative)
– Over a period of time…..DARPA’s data server looks like this
www.existentialennui.com
“Good data”
but
Little use in its current form !
Our Approach
Drill into the data filesStep 1: “Pick up the books”
Step 2: “Develop basic Dewey
decimal system”
Use domain expertise to realize
“functional ontologies” to
anchor the data sets.
Slide No.23
Our Approach
• Faceted search-based
visualization of data
• Meaningful interaction with data
Step 3: “What Type of Display Case ? ”
Our Approach
• Apply machine learning on
the data sets.
• Train & then Predict for
untested conditions.
Step 4: “Read & Discover New Knowledge”
Grand Vision: Data-driven Inverse Design for
AM Part Qualification Paradigm
Typical validation output (confusion matrix)
from a single trial. Green cells are correct
predictions. Gray cells are incorrect
predictions
Machine Learning Example
(Composites Testing Data)
• Data set (n=562) randomly partitioned into
training set (n=395) and test set (n=167). Each
trial partitions the data differently.
Objective: Classify majority failure modes (interfacial/cohesive) based on
input parameters (Surface Preparation, Contaminate Type, Contaminate
Amount)

Tragedy of the (Data) Commons

  • 1.
    Tetherless World Constellation,RPI Jim Hendler Tetherless World Professor of Computer, Web and Cognitive Sciences Director, Institute for Data Exploration and Applications Rensselaer Polytechnic Institute http://www.cs.rpi.edu/~hendler @jahendler Major talks at: http://www.slideshare.net/jahendler
  • 2.
    Tetherless World Constellation,RPI INVESTOPEDIA – Tragedy of the commons (summary) • The tragedy of the commons is an economic problem that results in overconsumption, under investment, and ultimately depletion of a common-pool resource. • For a tragedy of the commons to occur a resource must be scarce, rivalrous in consumption, and non-excludable. • Solutions to the tragedy of the commons include the imposition of private property rights, government regulation, or the development of a collective action arrangement.
  • 3.
    Tetherless World Constellation,RPI Tragedy of the data commons • The tragedy of the DATA commons is an ongoing scientific problem that results in under-utilization, over investment, and ultimately disuse of a common data resources. • For a tragedy of the commons to occur a resource must be scarce, rivalrous in consumption, and non-excludable. • Solutions to the tragedy of the commons include the imposition of private property rights, government regulation, or the development of a collective action arrangement. – Can we move to the third option?
  • 4.
    We want datato be FAIR • Easy to say, connotes a lot • Harder to operationalize • For machines • Formats • Standards • … • For humans • Incentives • Trust • Training • … • Need models, best practices, lessons learned, etc.
  • 5.
    The big challengeis we require sharing across large projects • Example biomedical research • Best models span disciplines • People live in different departments at different universities • But compelling scientific challenge forcing function for people to work together • Created incentives • Funding is still largely by project • Infrastructure for project data: expensive • Infrastructure for cross-project data sharing: priceless • Short- to mid- term solutions likely to require interoperability between separately funded efforts • WHICH LEADS TO THE TRAGEDY OF THE DATA COMMONS
  • 6.
    Organs Histopathology Organism Phenotype CircuitsElectrophysiology Cells In vitro phenotype Pathways Signal Cascades Biomodules Protein: Protein Interactions Protein Proteome RNA Transcriptome DNA Genome Population Epidemiology Along term endeavor Understanding a single domain © G. Bhanavar, IBM, IJCAI ‘16
  • 7.
    Solution requires Interoperability •One reason the Web beat its competitors… • Gopher • Archie • FTP • … • Provided a lightweight standard that allowed interoperability between these and more • Web was built on “coop-etition” • How do we learn this lesson for data sharing?
  • 8.
    Tetherless World Constellation,RPI FAIR requires sharable metadata
  • 9.
    But that ontologystuff never works…. • Ontology – Hard to build – Expensive to maintain – Don’t map to people’s data – Rarely reused • Aren’t ontologies why the sharing parts of FAIR are so hard?
  • 10.
    That ontology stuffnever works… • Ontology – Hard to build – Expensive to maintain – Don’t map to people’s data – Rarely reused • Aren’t they why the sharing part of FAIR is so hard?
  • 11.
    CHEAR Ontology Effort 12 TheChildren’s Health Exposure Analysis Resource, or CHEAR, is a program funded by the National Institute of Environmental Health Sciences to advance understanding about how the environment impacts children’s health and development over the course of a lifetime. https://chearprogram.org/ Children’s Health Exposure Analysis Resource (CHEAR) McGuinness 9/9/19 Partially supported by: NIH/NIEHS 0255-0236-4609 / 1U2CES026555-01 CHEAR is composed of three components: A National Exposure Assessment Laboratory Network, providing both targeted and untargeted environmental exposure and biological response analyses in human samples A Data Repository, Analysis, and Science Center, providing statistical services, a data repository, and data standards for integration and sharing A Coordinating Center, connecting the research community to CHEAR resources
  • 12.
    CHEAR Ontology Effort 1 3 Goal:Encode terminology currently needed by the CHEAR Data Center Portal, publish an open source extensible ontology integrating general exposure science and health leveraging best in class terminologies. Enabling Findable, Accessible, Interoperable, Reusable Data and Services to support data analysis and interdisciplinary research Ontologies encode terms and their interrelationships, providing a foundation for understanding interoperability and reusability (I and R in terms of FAIR) Ontology-enabled infrastructures - Knowledge Graphs and Ontology- enabled search services also provide support for finding and accessing relevant content (the F and A in FAIR) Child Health Exposure Analysis Resource Ontology McGuinness 9/9/19 Partially supported by: NIH/NIEHS 0255-0236-4609 / 1U2CES026555-01 Stingone, Mervish, Kovatch, McGuinness, Gennings, Teitelbaum. Big and Disparate Data: Considerations for Pediatric Consortia. Current Opinions in Pediatrics Journal. 29(2):231-239, April 2017. doi: 10.1097/MOP.0000000000000467. PMID: 28134706
  • 13.
    Ontology Foundations 14 Imported Ontologies: ●SemanticScience Integrated Ontology (SIO) ●PROV-O ●Units Ontology ●Human-Aware Science Ontology (HAScO) ●Virtual Solar Terrestrial Observatory (Instruments) (VSTO-I) ●Environment Ontology (ENVO) ●… Minimum Information to Reference an External Ontology Term (MIREOT)-ed Ontologies: ●Chemicals of Biological Interest (CheBI) ●Statistics Ontology (STAT-O) ●PubChem ●UBERON (Anatomy) ●Disease Ontology (DO) ●UniProt (Proteins) ●Cogat (Cognitive Measures) ●ExO ●RefMet, … Annotations: ●Simple Knowledge Organization System (SKOS) ●Dublin Core (DC) Terms 14 CHEAR Ontology Foundations and Reuse McGuinness 9/19 Partially supported by: NIH/NIEHS 0255-0236-4609 / 1U2CES026555-01 McCusker, Rashid, Liang, Liu, Chastain, Pinheiro, Stingone, McGuinness. Broad, Interdisciplinary Science In Tela: An Exposure and Child Health Ontology. In Proceedings of Web Science, 2017. Troy, NY. 349-357.
  • 14.
    Tetherless World Constellation,RPI Meta-data as evolving resource, not predefined standard • Move from hard meta-data standards to sharable resources – Referenceable by links • Linked data principles apply to linking of metadata – This is a key part of the original semantic web vision • To date still working best “within cultures” … – Which solves the “grounding” problems • But growing interdisciplinary links – As more sharing occurs
  • 15.
    Tetherless World Constellation,RPI Lessons learned • This was a crucial component in creating – Usable (lightweight) semantics for Web Apps (schema.org) – Usable (lightweight) semantics for govt data sharing (DCAT) – Successful scientific efforts • Virtual Observatory • Deep carbon observatory • Health Data Research UK (parts thereof to date)
  • 16.
    Tetherless World Constellation,RPI Details omitted for time • Multiple large projects have been working with – Provenance – Curation and Versioning • Archiving – Consistency • Only partial overlap – Credit and citation – Interdisciplinary term mappings – Term reconciliation – Computational infrastructure (esp cloud) – Third party (bottom) data curation via learning * – …
  • 17.
    Tetherless World Constellation,RPI How does this beat the data commons problem? Total Project ~$60M Data Mgt ~$10M CHEAR ontology ~$1M Reuse and linking of the data via ontological (metadata) development is a fraction of the total project cost - but key to project success
  • 18.
    Tetherless World Constellation,RPI Can this beat the data commons problem? • Projects can share data at a fraction of the cost if they – Start from overlapping common metadata terms • Not a single standard – Each put aside a relatively small cost for the metadata team • Embedded in data team – Their metadata data teams work together to the extent possible • Reusing the metadata leads to reusability of the data Project Data Meta data
  • 19.
    Tetherless World Constellation,RPI Questions? https://idea.rpi.edu
  • 20.
    Manufacturing Data Problem –DARPA Open Manufacturing Performers (Honeywell, Lockheed Martin, Boeing etc.) generated TBs of metal AM process, testing and characterization data. – Data management requirements (Materials Genome Initiative) – Over a period of time…..DARPA’s data server looks like this www.existentialennui.com “Good data” but Little use in its current form !
  • 21.
    Our Approach Drill intothe data filesStep 1: “Pick up the books” Step 2: “Develop basic Dewey decimal system” Use domain expertise to realize “functional ontologies” to anchor the data sets.
  • 22.
    Slide No.23 Our Approach •Faceted search-based visualization of data • Meaningful interaction with data Step 3: “What Type of Display Case ? ”
  • 23.
    Our Approach • Applymachine learning on the data sets. • Train & then Predict for untested conditions. Step 4: “Read & Discover New Knowledge” Grand Vision: Data-driven Inverse Design for AM Part Qualification Paradigm
  • 24.
    Typical validation output(confusion matrix) from a single trial. Green cells are correct predictions. Gray cells are incorrect predictions Machine Learning Example (Composites Testing Data) • Data set (n=562) randomly partitioned into training set (n=395) and test set (n=167). Each trial partitions the data differently. Objective: Classify majority failure modes (interfacial/cohesive) based on input parameters (Surface Preparation, Contaminate Type, Contaminate Amount)