Resource Description Framework Approach to Data Publication and Federation

Agenda Introduction: Requirements, Background, Use Cases Technical Example / Use Case: Requirements for creating a data service and invoking a query Q & A

Recap - ELN Query Functional Requirements The ELN Query Services team have produced a rich set of functional requirements/user stories representing common questions scientists have that can be satisfied with an ELN query. Most, if not all, of those requirements have the conceptual form; SELECT <selection> FROM Experiment WHERE <constraints> … A common approach to the problem would be preferable , ideally leveraging existing standards, rather than building a solution that works just for Experiment. … the Pistoia Technical Committee may wish to constrain such query services, by aligning to existing standards, to ensure consistency in approach.

Example ELN Query Workflow Scientist researching a class of Agents small molecules (or biologics) intended to hit a target or targets links to… Assays test to determine activity, affinity, binding, promiscuity determine potential toxicity, adverse events, etc. links to… Targets sites where compounds bind -- can be locations on a protein, locations on a gene, active centers on an enzyme, etc. links to… Disease/ Gene relationships e.g. biology, can be from TMO / LODD resources pathways, proteins, catalysts, immunology defense mechanisms, potential for adverse events, etc. can be included

Recap - Query Service API As part of the phase two deliverable, the ELN Query Services team produced a prototype SOAP-based service outlining the methods that would be required to support the query service. There are existing protocols and standards for querying structured data over the web. Aligning to an existing approach will prevent re-inventing a query language and will provide confidence in the stability of the query interface. Examples; OData (http://www.odata.org/) is a RESTful query protocol … GData (http://code.google.com/apis/gdata/). Very similar to OData, but published by Google… SPARQL (http://www.w3.org/TR/rdf-sparql-query/) is a RDF query protocol published by W3C for querying linked data. A triple store is [not ] required and there may be synergies with existing Pistoia projects (e.g. VSI, SESL) by adopting SPARQL.

Questions for Tech Committee Ontology content and format Regarding reference content How comprehensive and refined does it need to be? How can it relate to records and different detail? Who’s going to maintain the ontology? Is there a practical, agile approach to creating, managing, extending? Standards - driven? (UML, OWL, or ERD, etc.) Investigate service versioning and how backward compatibility might be implemented API Mechanism (OData, Gdata, SPARQL?) Is there a standards-based approach?

Moving Forward? Resource Description Framework Semantic Technologies provide a unified, standard framework for: Ontology / API representation, modification, merging Query framework / SOA SPARQL API Rich, extensible Query Federation (Does not require ETL) Agile ontology development & maintenance… Customers can extend the ontologies themselves Broadly – these are standards-driven methods that will make the ELN Query Federation lifecycle much easier

What is RDF? Resource Description Framework (“RDF”) defines and links data for more effective discovery, federation, integration and re-use across applications RDF is a fundamentally simple, standard and extensible way of specifying data and data relationships Ontologies describe resources and relationships according to their explicit meaning SPARQL Protocol and RDF Query Language (“SPARQL”) supports federated queries w’ SPARQL API for publication

RDF graphs are collections of triples Triples are made up of a subject , a predicate , and an object Resources and relationships (metadata)are stored Resource Description Framework (RDF) is… A labeled, directed graph of relations between resources and literal values. Confidential IO Informatics © 2011 subject object predicate

“ TP53 encodes Human p53” “ p53 is a tumor suppressor protein” “ TP53 gene is located on the short arm of chromosome 17” Example RDF Triples Confidential TP53 p53 encodes p53 tumor suppressor protein is a TP53 located chromosome 17

Triples Connect to Form Graphs TP53 Confidential p53 encodes tumor suppressor protein is a located chromosome 17 part of Human part of

Why RDF? What’s Different Here? Triples act as a human and machine readable least common denominator for expressing data and relationships Ontologies organize data according to their human readable meaning, to make defining and merging data intuitive… RDF supports inference and disambiguation , so merging data and adding new data and relationships without shared identifiers becomes possible… Confidential IO Informatics © 2011

Maturation of a Standard Framework for Resource Description This standard method and framework for extensible open data publication is maturing: Standards and practices, ontologies, query and data model specification: W3C, ISCB, IEEE, NCBO, OBO SKOS, LOD / LODD and related resources Federation methods: Ontology Resources, URIs, SPARQL endpoints, ontology alignment, inference, D2R, SWObjects, … Scalability: Security, support for transactional processing Practice: Expertise, training, larger projects (FDA, DOD, NASA, World Cup, Data.gov, Chevron, etc.) IO Informatics © 2011

Healthcare / Life Sciences remains the largest data sector for SPARQL APIs

Integration / Federation Options Data Sources RDBMS / REST / Web Services Basic Transformation Query-Based Transformation SPARQL 2 SQL (D2R, SWObjects) Federated Semantic Access / Datastore(s) SPARQL APIs / SOAP / REST Provenance Versioning Governance Meaning

RDF / SPARQL Solution Stack DBs, Services SPARQL API, SPARQL conversion, ETL=>RDF Extensible Standards-based Framework Federated Access / Integrated Semantic Datastore(s) Prediction and Simulation Apps Query by Meaning Rich Browsing Other Data Sources Public Data, Services Other Apps

Example Use Cases Manufacturing: Link data sources across imprecise connections to verify reports Animal Safety: Knowledge Network for discovery, qualification and validation of cross-species biomarkers Personalized Medicine: Knowledge Network and screening application for personalized medicine IO Informatics © 2011

Example Questions What data sources support this specific manufacturing report about product purity and shelf-life? What toxicity biomarkers are common to most animals? What patients are showing combinations of indicators that predict risk? IO Informatics © 2011

Qualitative Benefits Link internal data with public resources E.G. – “out of box” linking of ELN data with LODD sources Growing public resources provide cost-effective enrichment and hypothesis generation Reduce dependencies on expensive commercial databases Emergent Properties ELN data enrichment and knowledge building made easy! Supports inference, rich interrogation, serendipitous discovery R&D concepts can now be translated to immediate consumer benefits Previous “out of reach” integration and applications become practical IO Informatics © 2011

Quantitative Benefits Reduced effort, time and cost to federate, update, extend to new datasets Start sooner - reduce initial design and deployment burden Ontologies are explicit, can be engineered from or decoupled from resources, can be altered without refactoring Finish sooner - reduced time to create and test integrations Agile modeling, integration and testing Extend more easily - add new data sources and applications Building blocks to add new data, create new applications End Users can adapt, modify, extend ontologies IO Informatics © 2011

UBC: Knowledge Network for Organ Failure; “ASK” for Personalized Medicine Outcome Knowledge network for enrichment, visualization and qualification of patterns indicating risk of organ failure Web-based deployment of SPARQL-based screening patterns across multiple data sources, indicating patients-at-risk IO Informatics © 2011

Screening of transplant patients for likelihood of transplant failure, based on combined biomarker patterns Personalized Medicine Knowledge Network IO Informatics © 2011 Web-based Knowledge Application Applies patterns for predictive screening Weighing, scoring of results Bring “hits” back into Knowledge Network for validation of hypotheses and algorithms

UBC / PROOF: Quantitative… Integration and analysis time reduced from estimated 2 years to about 8 months FTE equivalent Time to capture and apply patterns reduced from days to hours Knowledge base can be / has been extended to include new public sources in hours (days / weeks with curation) Expensive commercial database no longer needed due to ease of integrating public resources IO Informatics © 2011

UBC / PROOF: Qualitative… Visual SPARQL presents queries as hypotheses Make it possible for researchers to iteratively create, test and refine hypotheses Research queries were published to web service for easy scale-up Extended SPARQL delivers practically useful classifiers with scoring Use of RDF makes enrichment with public sources practical Provenance, reference annotation, original data accessible for review “ [The] ability to consume and intuitively represent a wide variety of data-types - from images to quantitative data - and more importantly, display that data in ways that make the significant features immediately obvious to our biologist end-users, has allowed us to move to a completely new level of data analysis….” IO Informatics © 2011

Recap – Core Benefit Drivers Semantic technologies provide a standard method and framework for data publication and interoperability… not a new “data standard”! Low barrier to entry for data publishing – extensible building blocks for agile, growing integrations Reduced effort, time and cost to deliver and maintain data definitions and applications, particularly those that depend on federation Growing public resources are a catalyst and value-add Projects that were impractical become practical to achieve and maintain IO Informatics © 2011

Resource Description Framework Approach to Data Publication and Federation

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Resource Description Framework Approach to Data Publication and Federation (20)

More from Pistoia Alliance (20)

Recently uploaded (20)

Resource Description Framework Approach to Data Publication and Federation

Editor's Notes