1
Daan Broeder
Menzo Windhouwer
Meertens Instituut
2
• Create an interoperable domain of Language
Resources (LR)
– Interoperable formats for LR content
– Persistent identification (and citation) of LRs
– Use of SAML based AAI for access to LRs
– Use of the Component Metadata Infrastructure (CMDI) for describing
LRs
3
• Created as a response to a fragmented situation of LR metadata
• Flexible
– Not a single schema, but supports different metadata schema
– Different schema for different situations
– Semantic Interoperability via linking to semantic registries
• Community driven
– communities can model their own metadata schema
– know their data and can create the right schema
– know the right terminology
• Sharing
– Concepts, Terminology, Vocabularies
• CLARIN Concept Registry for linguistic concepts,
• ISO 368 and other relevant vocabularies
• CLAVAS for organisation names
– Components & profiles via the CLARIN metadata component registry
4
• A Component groups together metadata
Elements, which naturally belong together
to describe a property of the resources
– The Location where a SpeechRecording took place
– The Location of an Actor
– A Location is described by an address a/o region a/o
country a/o continent
• Components can be nested
– The Language a specific Actor speaks
– An Actor who takes part in a SpeechRecording for a
specific Project
• A Profile is a specific collection of
Components for a specific type of
resources, e.g., speech recordings
SpeechRecordingP
ActorC
LocationC
- addressE
- regionE
- countryE
- continentE
LocationC
ProjectC
LanguageC
LanguageC
Technical
MetadataC
5
OAI-PMH
Provider
OAI-PMH
Harvester
Local
metadata
repository
Joint
metadata
repository
metadata
modeler
metadata
user
metadata
creator
component
registry &
editor
metadata
editor
metadata
curator
metadata
curator
metadata
catalogue
Relation
Registry
search &
semantic
mapping
Resources
Concept
Registry
6
• Started in 2010, version 1.2 released in 2016 supporting
remote vocabularies
• Actively supported by CLARIN ERIC and several national CLARIN
consortia
• Many supporting tools:
– VLO, COMEDI, ARBIL, CMDI maker, Virtual Collection Registry …
• Link to the Linked (open) Data world: CMDI2RDF
CMDI LODCMDI2RDF
7
• Started as a 2014 CLARIN NL project by TLA/MPI and DANS
• Now a service supported by CLARIAH WP2 (X11.400)
• Linking also to other ‘linguistic’ LoD information sources:
– WALS for linguistic typology information
– CLAVAS organization names
– DBpedia (currently only used as glue)
• Automatic synchronization CMDI metadata
• Simplification of the RDFs CMDI model
8
• CMD is classic W3C schema constrained XML
• To map a CMD record to RDF we need
– A mapping for the basic component model to RDFS
• Basic classes and properties to represent profiles, components,
elements, attributes and their relationships and values
– A mapping for a specific profile or component to RDFS
• A specific subclass or subproperty of the basic component model
– A mapping for specific metadata records to RDF instances of RDFS
• Instances of profile or component
– Additionaly there is a generic CMD envelop that is mapped using
common LOD vocabularies
9
 Basic CMD model is described by ISO/DIS 24622-1
 1st part of ISO TC 37 SC 4 3 CMD standards family
 Natural mapping to RDF would be:
 Profiles/components to RDF Classes
 Elements to RDF Properties
 Complication
 CLARIN’s CMDI allows attributes on both Components and Elements
 So elements have to be RDF Classes as well
10
• Nevertheless introduces extra hierarchy
• CMDI is already a hierarchical metadata schema
• Human readability decreases
• Other solutions welcome!
R 14
Age
<Description URI= …. >
<Age>14</Age>
…
</Person
<Description…. >
<Age status=‘U’>14</Age>
…
</Description> R
Age
14
U
Simplified example
status
11
OAI
harvester
CLARIN
joint
metadata
domain
CMD2RDF
• conversion
• enrichment
Virtuoso
caching
CMD-RDF
• SPARQL
• REST
• browse
(L)L(O)D cloud
Component
Registry
CLAVAS
WALS
Technology:
• Virtuoso RDF store
• Elda as browser
• Tomcat as application server
• Conversion pipeline in Java
• Core transforms in XSLT
• All source code on GitHub,
• Docker build file & images available
12
13
• Offers LoD for different LR
metadata infrastructures
– LRE Map (LREC)
– META-SHARE
– CLARIN
– DataHub (linguistic part)
• However
– Wrt. CLARIN only data with DC
profiles
• Just a small part of CLARIN
– Seems partly based on static old
data dumps
14
• Goals:
– Find metadata type of information about LRs in LD format
– Translate that into a ‘suitable’ CMDI profile based metadata record
• Is there such LD that is not already available direct in another
format: OLAC, CLARIN, DC, META-SHARE
– If so, useful to have this metadata in the CLARIN VLO metadata catalogue
– Humanities data archives will have mostly DC, (inventory available from
different projects: e.g. DASISH) and frequently offer LD
– Easier ways exist to translate DC into CMDI (e.g. the CMDI DC profile)
– But LD can be a pivot set for many such translations
• Still in exploratory phase
– Would like to use a general strategy,
– Its very labor intensive to craft specific transformations for every LD set.
15
• Useful for CLARIN?
– Enriching existing CMDI metadata and
recycling them
– Relations to sources already known as:
• WALS, DBpedia, CLAVAS, GlotoLog, …
• Relations to CLARIAH LD sources ?
– Enable the VLO (or an alternative browser)
for visualizing this information
– Increasing metadata quality:
• Use CLAVAS to repair errors
• Include preferred labels
– Some CMDI adaptations required
• Foreign namespace support in CMDI
payload
A
VLO
B
C
RDF2CMD
CLARIN CENTRES
CLARIAH?
Enriched
CMDI
CMDI
DPpedia Glotolog
RDFstore
16
http://cmdi2rdf.meertens.knaw.nl/cmd2rdf/

CMDI2RDF

  • 1.
  • 2.
    2 • Create aninteroperable domain of Language Resources (LR) – Interoperable formats for LR content – Persistent identification (and citation) of LRs – Use of SAML based AAI for access to LRs – Use of the Component Metadata Infrastructure (CMDI) for describing LRs
  • 3.
    3 • Created asa response to a fragmented situation of LR metadata • Flexible – Not a single schema, but supports different metadata schema – Different schema for different situations – Semantic Interoperability via linking to semantic registries • Community driven – communities can model their own metadata schema – know their data and can create the right schema – know the right terminology • Sharing – Concepts, Terminology, Vocabularies • CLARIN Concept Registry for linguistic concepts, • ISO 368 and other relevant vocabularies • CLAVAS for organisation names – Components & profiles via the CLARIN metadata component registry
  • 4.
    4 • A Componentgroups together metadata Elements, which naturally belong together to describe a property of the resources – The Location where a SpeechRecording took place – The Location of an Actor – A Location is described by an address a/o region a/o country a/o continent • Components can be nested – The Language a specific Actor speaks – An Actor who takes part in a SpeechRecording for a specific Project • A Profile is a specific collection of Components for a specific type of resources, e.g., speech recordings SpeechRecordingP ActorC LocationC - addressE - regionE - countryE - continentE LocationC ProjectC LanguageC LanguageC Technical MetadataC
  • 5.
  • 6.
    6 • Started in2010, version 1.2 released in 2016 supporting remote vocabularies • Actively supported by CLARIN ERIC and several national CLARIN consortia • Many supporting tools: – VLO, COMEDI, ARBIL, CMDI maker, Virtual Collection Registry … • Link to the Linked (open) Data world: CMDI2RDF CMDI LODCMDI2RDF
  • 7.
    7 • Started asa 2014 CLARIN NL project by TLA/MPI and DANS • Now a service supported by CLARIAH WP2 (X11.400) • Linking also to other ‘linguistic’ LoD information sources: – WALS for linguistic typology information – CLAVAS organization names – DBpedia (currently only used as glue) • Automatic synchronization CMDI metadata • Simplification of the RDFs CMDI model
  • 8.
    8 • CMD isclassic W3C schema constrained XML • To map a CMD record to RDF we need – A mapping for the basic component model to RDFS • Basic classes and properties to represent profiles, components, elements, attributes and their relationships and values – A mapping for a specific profile or component to RDFS • A specific subclass or subproperty of the basic component model – A mapping for specific metadata records to RDF instances of RDFS • Instances of profile or component – Additionaly there is a generic CMD envelop that is mapped using common LOD vocabularies
  • 9.
    9  Basic CMDmodel is described by ISO/DIS 24622-1  1st part of ISO TC 37 SC 4 3 CMD standards family  Natural mapping to RDF would be:  Profiles/components to RDF Classes  Elements to RDF Properties  Complication  CLARIN’s CMDI allows attributes on both Components and Elements  So elements have to be RDF Classes as well
  • 10.
    10 • Nevertheless introducesextra hierarchy • CMDI is already a hierarchical metadata schema • Human readability decreases • Other solutions welcome! R 14 Age <Description URI= …. > <Age>14</Age> … </Person <Description…. > <Age status=‘U’>14</Age> … </Description> R Age 14 U Simplified example status
  • 11.
    11 OAI harvester CLARIN joint metadata domain CMD2RDF • conversion • enrichment Virtuoso caching CMD-RDF •SPARQL • REST • browse (L)L(O)D cloud Component Registry CLAVAS WALS Technology: • Virtuoso RDF store • Elda as browser • Tomcat as application server • Conversion pipeline in Java • Core transforms in XSLT • All source code on GitHub, • Docker build file & images available
  • 12.
  • 13.
    13 • Offers LoDfor different LR metadata infrastructures – LRE Map (LREC) – META-SHARE – CLARIN – DataHub (linguistic part) • However – Wrt. CLARIN only data with DC profiles • Just a small part of CLARIN – Seems partly based on static old data dumps
  • 14.
    14 • Goals: – Findmetadata type of information about LRs in LD format – Translate that into a ‘suitable’ CMDI profile based metadata record • Is there such LD that is not already available direct in another format: OLAC, CLARIN, DC, META-SHARE – If so, useful to have this metadata in the CLARIN VLO metadata catalogue – Humanities data archives will have mostly DC, (inventory available from different projects: e.g. DASISH) and frequently offer LD – Easier ways exist to translate DC into CMDI (e.g. the CMDI DC profile) – But LD can be a pivot set for many such translations • Still in exploratory phase – Would like to use a general strategy, – Its very labor intensive to craft specific transformations for every LD set.
  • 15.
    15 • Useful forCLARIN? – Enriching existing CMDI metadata and recycling them – Relations to sources already known as: • WALS, DBpedia, CLAVAS, GlotoLog, … • Relations to CLARIAH LD sources ? – Enable the VLO (or an alternative browser) for visualizing this information – Increasing metadata quality: • Use CLAVAS to repair errors • Include preferred labels – Some CMDI adaptations required • Foreign namespace support in CMDI payload A VLO B C RDF2CMD CLARIN CENTRES CLARIAH? Enriched CMDI CMDI DPpedia Glotolog RDFstore
  • 16.

Editor's Notes

  • #12 Virtuoso as a tripelstore Tomcat as application server Elda as browser Conversion pipeline in Java core transforms in XSLT all in a Docker package Code all on GitHub: