What‘s all the data about –
profiling and interlinking Web datasets
Stefan Dietze
L3S Research Center
27/03/14 1Stefan Dietze
Recent work on Linked Data exploration/discovery/search
 Entity interlinking & dataset interlinking recommendation
 Dataset profiling
 Data consistency & conflicts
Research areas
 Web science, Information Retrieval, Semantic Web & Linked
Data, data & knowledge integration (mapping, classification,
interlinking)
 Application domains: education/TEL, Web archiving, …
Some projects
Introduction
http://www.l3s.de/
Stefan Dietze 27/03/14 2
 See also: http://purl.org/dietze
…why are there so few datasets actually used?
 Date reuse and in-links focused on trusted „reference
graphs“ such as DBpedia, Freebase etc
 Long tail of LD datasets which are neither reused nor linked
to (LOD Cloud alone 300+ datasets, 50 bn triples)
 Explanations?
Linked Data is awesome, but...
27/03/14
 „HTTP-accessibility“
(SPARQL, URI-dereferencing)
 „Structure“ & „Semantics“
(=> shared/linked vocabularies)
 „Interlinked“
 „Persistent“
Hm,
really?
Stefan Dietze
Linked data is more diverse than we think
SPARQL Web-Querying Infrastructure: Ready for Action?,
Carlos Buil-Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves
Vandenbussch, International Semantic Web Conference 2013,
(ISWC2013).
SPARQL endpoint availability over time [Buil-Aranda et al 2013]
Accessibility of datasets?
 Less than 50% of all SPARQL endpoints actually responsive
at given point of time
 “THE” SPARQL protocol? No, but many variants & subsets
 …
Shared vocabularies & schemas, but:
 …still very heterogeneous [d’Aquin, WebSci13]
 …data partially messy and not conformant
(RDFS, schemas) [HoganJWS2012]
 …even widely used reference datasets such as
DBpedia noisy [Paulheim2013]
Co-occurence graph of data
types in 146 datasets: 144
Vocabularies, 588 highly
overlapping types, 719
Properties
Assessing the Educational Linked Data Landscape, D’Aquin, M.,
Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris,
France, May 2013.
Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic
Web – ISWC 2013, Lecture Notes in Computer Science Volume 8218,
2013, pp 510-525
An empirical survey of Linked Data conformance. Hogan, A., Umbrich,
J., Harth, A., Cyganiak, R., Polleres, A., Decker., S., In the Journal of Web
Semantics 14: pp. 14–44, 2012Stefan Dietze
What about data consistency?
Inconsistency and Incompleteness of Linked Datasets – a
Case Study, Yuan, W., Demidova, E., Dietze, S., Zhu, X., Web
Science 2014, WebSci14, under review.
27/03/14
Too many/diverse datasets, too little information
Stefan Dietze 27/03/14
?
?
? ?? ?
 Which datasets are useful & trustworthy for case
XY (eg „learning about the solar system“) ? Which
topics are covered?
 Types: which datasets describe statistics, videos,
slides, publications etc?
 Currentness, dynamics, accessability/reliability,
data quantity & quality?
Data curation and dataset profiling
Dataset
Catalog/Registry
Stefan Dietze 27/03/14
 Catalog of data: classification of
datasets according to resource
types, disciplines/topics, data
quality, accessability, etc
 Infrastructure for
distributed/federated querying
describes
 Which datasets are useful & trustworthy for case
XY (eg „learning about the solar system“) ? Which
topics are covered?
 Types: which datasets describe statistics, videos,
slides, publications etc?
 Currentness, dynamics, accessability/reliability,
data quantity & quality?
db:Astro. Objects
Dataset profiling: what’s all the data about
Dataset
Metadata
Stefan Dietze 27/03/14
BIBO
AAISO
FOAF
contains
Entity disambiguation &
linking [ESWC13]
Topic profile extraction
[WWW13, ESCW14]
db:Astronomy
db:Astro. Objects
Dataset
Catalog/Registry
yov:Video
po:Programme
BBC Programme
<po:Programme …>
<po:Series>Wonders of the Solar System</.>
<po:Actor>Brian Cox</…>
</po:Programme…>
<yo:Video …>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video…>
Yovisto Video
bibo:Fil
bibo:Fi
bibo:Film
Schema mappings
[WebSci13]
Schemas/vocabularies on the Web: XKCD 927
Stefan Dietze 27/03/14
https://xkcd.com/927/
Schema assessment and mapping
Co-occurence of
data types
(in 146 datasets:
144 Vocabularies,
588 highly
overlapping types,
719 Properties)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
<po:Programme …>
<po:title>Secret Universe –
The Life of the Cell</po:title>
…
</po:Programme…>
BBC Programme
<sioc:Item …>
<label>Viral diseases &
bacteria</title>
…
</sioc:Item ….>
SlideShare Set
po:Programme
sioc:Item
?
http://datahub.io/group/linked-education
Stefan Dietze 27/03/14
Schema assessment and mapping
Co-occurence of
data types
(in 146 datasets:
144 Vocabularies,
588 highly
overlapping types,
719 Properties)
Co-occurence after
mapping into most
frequent schemas
(201 frequent types
mapped into 79
classes)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
bibo:Slideshow
bibo:Film
bibo:Document
<po:Programme …>
<po:title>Secret Universe –
The Life of the Cell</po:title>
…
</po:Programme…>
BBC Programme
<sioc:Item …>
<label>Viral diseases &
bacteria</title>
…
</sioc:Item ….>
SlideShare Set
po:Programme
sioc:Item
Stefan Dietze 27/03/14
LinkedUp Data Catalog
in a nutshell http://datahub.io/group/linked-education
http://data.linkededucation.org/linkedup/catalog/
 RDF (VoID) dataset catalog: browse &
query distributed datasets
 Live information about endpoint
accessibility
 Federated queries using type mappings
Stefan Dietze 27/03/14
http://datahub.io/group/linked-education
<yo:Video 8748720>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video 8748720>
Video
<sioc:Item 2139393292>
<title>Planetary motion
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
Topics/categories addressed?
Relatedness of resources/entities?
(types, semantics)
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
Combining a co-occurrence-based and a semantic measure
for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R.
Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended
Semantic Web Conference, (May 2013).
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B., Dietze, S.,
Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended
Semantic Web Conference (ESWC2014), Crete, Greece, (2014).
Challenge: semantics of resources/datasets?
15Stefan Dietze 27/03/14
<yo:Video 8748720>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video 8748720>
Video
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
Data disambiguation (for linking & profiling)
Brian Cox?
Sun?
Pluto?
16Stefan Dietze 27/03/14
db:Pluto
(Dwarf
Planet)
db:Astrono-
mical Objects
db:Sun
Data disambiguation using background knowledge
„Semantic relatetedness“ of resources?
db:Astronomy
17
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
<sioc:Item 2139393292>
<title>Planetary motion
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
<yo:Video 8748720>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video 8748720>
Video
Stefan Dietze 27/03/14
db:Pluto
(Dwarf
Planet)
db:Astrono-
mical Objects
<yov:Lecture8748720>
<title>Pluto & the Dwarf
Planets</title>
…
< yov:Lecture8748720>
Online Lecture
db:Astronomy
 Computation of connectivity scores
between resources/entities
 Method: combination of a
 (i) semantic (graph-based) connectivity
score (SCS) with
 (ii) a Web co-occurence-based measure
(CBM) (similar to NGD)
 For (i): adaptation of Katz-Index from SNA
for (linked) data graphs (considering path
number and path lengths of transversal
properties)
db:Sun
SCS = 0.32
CBM = 0.24
http://purl.org/vol/doc/
http://purl.org/vol/ns/
19/09/2013 19Stefan Dietze
Combining a co-occurrence-based and a semantic
measure for entity linking, B. P. Nunes, S. Dietze, M.A.
Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013
- 10th Extended Semantic Web Conference, (May 2013).
Entity linking: semantic relatedness
<sioc:Item 2139393292>
<title>Planetary motion
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
Entity linking: evaluation
27/03/14 20Stefan Dietze
 Evaluation based on USA Today News items (80.000 entity pairs)
 Manually created gold standard
(1000 entity pairs)
 Baseline: Explicit Semantic Analysis (ESA)
=> CBM/SCS: „relatedness“; ESA: „similarity“
Precision/Recall/F1 for SCS, CBM, ESA.
Combining a co-occurrence-based and a semantic
measure for entity linking, B. P. Nunes, S. Dietze, M.A.
Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013
- 10th Extended Semantic Web Conference, (May 2013).
db:Astrono-
mical Objects
db:Astronomy
db:Sun
 Extracting representative metadata („topic profile“) for each dataset
 Ranking of most representative (DBpedia) categories (= topics); applied to all responsive LOD datasets
 Scalability vs representativeness: sampling & ranking for good scalability/accuracy balance
DBpedia category graph
Stefan Dietze 27/03/14
Dataset profiling: what‘s the data about?
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,
11th Extended Semantic Web Conference
,(ESWC2014), Crete, Greece, (2014).
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
Dataset profiling: approach
Stefan Dietze 27/03/14
1. Sampling of resource instances
(random sampling, weighted sampling, resource
centrality sampling)
2. Entity and topic extraction (NER via DBpedia
Spotlight, category mapping and expansion)
3. Normalisation and ranking (using graphical-
models such as PageRank with Priors, HITS with
Priors and K-Step Markov)
=> Result: weighted dataset-topic profile graph
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles, Fetahu, B.,
Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W.,
11th Extended Semantic Web Conference
(ESWC2014), Crete, Greece, (2014).
Dataset profiling: exploring LOD datasets/topics
in a nutshell http://data-observatory.org/lod-profiles/
Stefan Dietze 27/03/14
 Automatic extraction of dataset “topics” [ESWC2014]
 Visualisation & exploration of dataset-topic graph
(datasets, topics, relationships)
 Includes all (responsive) datasets of LOD Cloud
Dataset profiling: results evaluation
Stefan Dietze 27/03/14
NDCG (averaged over all datasets) .
Datasets & Ground Truth
 Yovisto, Oxpoints, LAK Dataset, Semantic Web
Dogfood
 Crowd-sourced topic indicators from datasets
(keywords, tags)
 Manual mapping to entities & category extraction
(ranking according to frequency)
Baselines
 1) LDA, 2) tf/idf (applied to entire datasets)
 Topic extraction according to our approach,
weighting/ranking based on term weight
Measure
 NDCG @ rank l
 Performance (time/NDCG) for different sampling
strategies/sizes etc
Stefan Dietze 27/03/14
dbp:Category:Royal_Medal_winners
dbp:Category:1955_births
dbp:Category:People_from_London
dbp:Category:Buzzwords
dbp:Category:Web_Services
dbp:Category:HTTP
dbp:Category:Unitarian_Universalists
dbp:Category:World_Wide_Web
What have these categories in common?
Stefan Dietze 27/03/14
Diversity of category profile for a single paper
Berners-Lee, Tim; Hendler, James, Ora Lassila (2001). "The Semantic Web".
Scientific American Magazine.
person
document
dbp:Tim_Berners-Lee
dbp:Category:1955_births
dbp:Category:People_from_London
dbp:Category:Buzzwords
dbp:Semantic_Web
dbp:Category:Semantic_Web
dbp:Category:Web_Services
dbp:Category:HTTP
dbp:Category:Unitarian_Universalists
first-level categories (dcterms:subject)
dbp:Category:World_Wide_Web
dbp:Category:Royal_Medal_winners
 DBpedia category graph not an ideal “topic” vocabulary:
 Broad and noisy
 “Categories” vs “topics” (for capturing disciplines, thesauri
like UMBEL or UNESCO Thesaurus seem better suited)
 Hierarchy ?
 Filtering of certain partitions of category graph (too generic
categories etc)
 Mixing categories across resource types (document, person)
creates “perceived noise”
 But: broadness is useful as general vocabulary for
categorisation of all sorts of resource types
Stefan Dietze 27/03/14
Dataset profiling: some lessons learned
Stefan Dietze 27/03/14
http://data-observatory.org/led-explorer/
 Type specific views on datasets/
categories
 “Document” (foaf:document)
 “Person “ (foaf:person)
 “Course” (aaiso:course)
 Currently applied to datasets in
LinkedUp Catalog only (as
schema mappings already
available here)
Type-specific exploration of dataset categories
Stefan Dietze 27/03/14
Dataset interlinking recommendation
Candidate datasets for interlinking?
34
t
Linkset1
Linkset2
Problem
 Given dataset t, ranking datasets from D
according to probability score (di, t) to
contain linking candidates (entities)
 Features:
 Vocabulary overlap
 Existing links (SNA)
 Datasets more likely to contain linking
candidates if they (a) share common
schema elements, or (b) already link to t
or datasets t links to (friend of a friend)
Conclusions
 Roughly 60% MAP for both approaches
 Future work: quantity of links, more
remote links, extraction of dataset links
rather than data from DataHub
Lopes, G.R., Paes Leme, L.A.P., Nunes, B.P., Casanova, M.A.,
Dietze, S., Recommending Tripleset Interlinking through a
Social Network Approach, The 14th International Conference
on Web Information System Engineering (WISE 2013),
Nanjing, China, 2013.
Paes Leme, L. A. P., Lopes, G. R., Nunes, B. P., Casanova,
M.A., Dietze, S., Identifying candidate datasets for data
interlinking, in Proceedings of the 13th International
Conference on Web Engineering, (2013).
Rank
1 DBLP
2 ACM
3 OAI
4 CiteSeer
5 IBM
6 Roma
7 IEEE
8 Ulm
9 Pisa
?
?
Stefan Dietze 27/03/14 37
Success models:
data & applications
 LinkedUp Challenge
to identify innovative
tools & applications
 Evaluation methods
and approaches
“LinkedUp” – Linking Web Data (for Education)
L
Data linking & curation
Technology transfer
& community-building
 Collecting & exposing open
data
=> LinkedUp Data Catalog
 Profiling and linking of Web
Data for education
=> educational data graph
[ESWC2013], [ISWC2013],
 Disseminating knowledge &
building communities
(educators, computer
scientists, data engineers)
 Gathering stakeholder
feedback: use cases, and
requirements
http://linkedup-challenge.org/#usecases
http://linkedup-project.eu/events
http://www.linkedup-challenge.org/
http://data.linkededucation.org
European suport action to
advance take-up of open
data & related technologies
http://www.linkedup-project.eu
Stefan Dietze 27/03/14
17/09/2013 38
Who we areL
LinkedUp Network
LinkedUp Consortium
LinkedUp Advisory Board
LinkedUp Challenge: using open data (for learning)
 Open Data Competition to promote tools and applications that analyse / integrate (Linked)
Web data
 Organised by LinkedUp project over 2 years (“Veni”, “Vidi”, “Vici”) with 40.000 EUR awards
 Veni Competition - 22 submissions, 8 shortlisted for presentation at Open Knowledge
Conference (17 September, Geneva Switzerland)
http://linkedup-challenge.org
Stefan Dietze 27/03/14
 Open & focused track(s)
 Final events at ESWC2014
(May, Crete)
 Open Track only
 Final events at OKCon 2013
(September 2013, Geneva)
 Open track & focused tracks
 Submission details and calls to be
released soon
 Final events at ISWC2014
(October, Riva del Garda, Italy)
May –September 2013 October 2013 – May 2014 May 2014 – October 2014
?
The Veni shortlist & winners
DataConf.
KnowNodes
Mismuseos
ReCredible
YourHistory
27/03/14
http://www.globe-town.org/
WeShare - 3rd price / people‘s choice
GlobeTown - 2nd price
http://seek.cloud.gsic.tel.uva.es/weshare/
http://www.polimedia.nl/
PoliMedia – 1st price
data.l3s.de – a DataHub for the L3S
Learning Analytics & Knowledge Dataset & Challenge
Facilitating Research on Learning Analytics and EDM
a nutshell
Stefan Dietze 27/03/14
http://lak.linkededucation.org/
http://lak.linkededucation.org/
LAK Dataset (450 publications in RDF/R)
 ACM International Conference on Learning Analytics and
Knowledge (LAK) (2011-13)
 International Conference on Educational Data Mining (2008-13)
 Journal of Educational Data Mining (2008-12)
LAK Data Challenge
 Analyse, explore correlate the LAK Dataset
 At ACM LAK 2014 (April 2014, Indianapolis)
KEYSTONE COST ACTION
27/03/14 51Stefan Dietze
http://www.keystone-cost.eu/
 Research network focused on distributed search,
dataset profiling, to Semantic Web, Databases, etc.
 Running 2013-2017
 WG1: Representation of structured data sources
 WG2: Keyword search
 WG3: User interaction and query interpretation
 WG4: Research integration, showcases,
benchmarks, and evaluations
 Open to new members (even beyond Europe)
 Joint workshops (eg PROFILES2014 @ ESWC2014)
Ongoing/future work … and some upcoming events
Linked Data evolution, preservation, consistency
 In RDF graphs (eg LOD Cloud), „all“ nodes are connected
 LD preservation: which datasets to preserve (direct links
or even more distant neighbours)?
=> semantic relatedness as guidance for scalable
preservation strategies /data enrichment
 Link correctness in evolving LD
 Investigating impact of changes on link correctness
(weekly LOD crawls over 1 year time span)
 Application: informed preservation strategies
 Conflict detection and LD quality (link quality, impact of
conflicts in distant nodes)
 PROFILES workshop @ ESWC2014
(http://keystone-cost.eu/profiles2014)
 26 May 2014, Crete, Greece
 Linking User Data 2014 at UMAP2014
(http://liud.linkededucation.org)
 Deadline: 1 April
 Online Learning & LD Tutorial at WWW2014
(http://www2014.kr/)
 07 April, Seoul
Thank you!
WWW
See also (general)
 http://linkedup-project.eu
 http://linkededucation.org
 http://data.l3s.de
http://purl.org/dietze
See also (data)
 http://data.linkededucation.org
 http://data.linkededucation.org/linkedup/catalog/
 http://lak.linkededucation.org
27/03/14 54Stefan Dietze
 Besnik Fetahu (L3S)
 Bernardo Pereira Nunes (PUC Rio)
 Marco Casanova (PUC Rio)
 Luiz Andre Paes Leme (PUC Rio)
 Giseli Lopes (PUC Rio)
 Davide Taibi (CNR, IT)
 Mathieu d’Aquin (Open University, UK)
 and many more…
Acknowledgements

More Related Content

PDF
KnowEscape workshop, OKCon 2013
PDF
Turning Data into Knowledge (KESW2014 Keynote)
PDF
A structured catalog of open educational datasets
PPT
Combining a co-occurrence-based and a semantic measure for entity linking
PDF
Demo: Profiling & Exploration of Linked Open Data
PDF
Linked Data for Federation of OER Data &amp; Repositories
PPT
LinkedUp - Linked Data & Education
PDF
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
KnowEscape workshop, OKCon 2013
Turning Data into Knowledge (KESW2014 Keynote)
A structured catalog of open educational datasets
Combining a co-occurrence-based and a semantic measure for entity linking
Demo: Profiling & Exploration of Linked Open Data
Linked Data for Federation of OER Data &amp; Repositories
LinkedUp - Linked Data & Education
Retrieval, Crawling and Fusion of Entity-centric Data on the Web

What's hot (20)

PDF
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
PPT
Participatory Web
PPT
User Engagement in Research Data Curation
PPT
Collaboration to Curation: The High Rise Project meets Edinburgh DataShare
PDF
Semantic Web / Linked Data Technologies
PDF
Geospatial Metadata and Spatial Data: It's all Greek to me!
PPT
Scott Edmunds at OASP Asia: Open (and Big) Data – the next challenge
PPTX
Describing Scholarly Contributions semantically with the Open Research Knowle...
PDF
WDAqua ITN – Answering Questions using Web Data
PDF
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
PPTX
Cognitive data
PPTX
Data Management Planning at the DCC: a human factor
PPTX
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
PDF
Data management plans – EUDAT Best practices and case study | www.eudat.eu
PPTX
Interpreting Data Mining Results with Linked Data for Learning Analytics
PPT
Geospatial Metadata Workshop
PPT
Glasgow University Geo Metadata Workshop
PPTX
Towards an Open Research Knowledge Graph
PDF
Long-term data curation, aka data preservation - EUDAT Summer School (Marjan ...
PPTX
Frankfurt Big Data Lab & Refugee Projeect
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
Participatory Web
User Engagement in Research Data Curation
Collaboration to Curation: The High Rise Project meets Edinburgh DataShare
Semantic Web / Linked Data Technologies
Geospatial Metadata and Spatial Data: It's all Greek to me!
Scott Edmunds at OASP Asia: Open (and Big) Data – the next challenge
Describing Scholarly Contributions semantically with the Open Research Knowle...
WDAqua ITN – Answering Questions using Web Data
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
Cognitive data
Data Management Planning at the DCC: a human factor
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
Data management plans – EUDAT Best practices and case study | www.eudat.eu
Interpreting Data Mining Results with Linked Data for Learning Analytics
Geospatial Metadata Workshop
Glasgow University Geo Metadata Workshop
Towards an Open Research Knowledge Graph
Long-term data curation, aka data preservation - EUDAT Summer School (Marjan ...
Frankfurt Big Data Lab & Refugee Projeect
Ad

Viewers also liked (13)

PPTX
Presentation nokobit
PDF
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
PPT
DURAARK at Bibliotheksymposium Wildau
PDF
DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...
PDF
Quality criteria for architectural 3D data in usage and preservation processes
PDF
A Domain-driven Approach to Digital Curation and Preservation of 3D Architect...
PDF
Presentation of the DURAARK project at Ex Libris conference, Berlin, Germany.
PPTX
DURAARK presentation CIB W78 "Applications of IT in AEC" conference Beijing 2...
PDF
Towards preservation of semantically enriched architectural knowledge
PPT
DURAARK at IGeLU 2014
PDF
Grapp2014 presentation
PPT
DURAARK at AUdS 2015
PPT
Preservation of 3 d objects of buildings
Presentation nokobit
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
DURAARK at Bibliotheksymposium Wildau
DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...
Quality criteria for architectural 3D data in usage and preservation processes
A Domain-driven Approach to Digital Curation and Preservation of 3D Architect...
Presentation of the DURAARK project at Ex Libris conference, Berlin, Germany.
DURAARK presentation CIB W78 "Applications of IT in AEC" conference Beijing 2...
Towards preservation of semantically enriched architectural knowledge
DURAARK at IGeLU 2014
Grapp2014 presentation
DURAARK at AUdS 2015
Preservation of 3 d objects of buildings
Ad

Similar to What's all the data about? - Linking and Profiling of Linked Datasets (20)

PDF
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
PDF
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
PDF
From Data to Knowledge - Profiling & Interlinking Web Datasets
PDF
Semantic Linking & Retrieval for Digital Libraries
PDF
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
PDF
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
PDF
LinkedUp - Linked Data Europe Workshop 2014
PDF
Open Data Dialog 2013 - Linked Data in Education
PDF
WWW2013 Tutorial: Linked Data & Education
PDF
Linked Data for Architecture, Engineering and Construction (AEC)
PDF
Mining and Understanding Activities and Resources on the Web
PDF
Open Data & Education Seminar, ITMO, St Petersburg, March 2014
PDF
Towards research data knowledge graphs
PDF
Dats nih-dccpc-kc7-april2018-prs-uoxf
PDF
Big Data in Learning Analytics - Analytics for Everyday Learning
PDF
Data integration in a Hadoop-based data lake: A bioinformatics case
PDF
Data integration in a Hadoop-based data lake: A bioinformatics case
PDF
Data integration in a Hadoop-based data lake: A bioinformatics case
PDF
Data integration in a Hadoop-based data lake: A bioinformatics case
PPTX
BrightTALK - Semantic AI
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
From Data to Knowledge - Profiling & Interlinking Web Datasets
Semantic Linking & Retrieval for Digital Libraries
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
LinkedUp - Linked Data Europe Workshop 2014
Open Data Dialog 2013 - Linked Data in Education
WWW2013 Tutorial: Linked Data & Education
Linked Data for Architecture, Engineering and Construction (AEC)
Mining and Understanding Activities and Resources on the Web
Open Data & Education Seminar, ITMO, St Petersburg, March 2014
Towards research data knowledge graphs
Dats nih-dccpc-kc7-april2018-prs-uoxf
Big Data in Learning Analytics - Analytics for Everyday Learning
Data integration in a Hadoop-based data lake: A bioinformatics case
Data integration in a Hadoop-based data lake: A bioinformatics case
Data integration in a Hadoop-based data lake: A bioinformatics case
Data integration in a Hadoop-based data lake: A bioinformatics case
BrightTALK - Semantic AI

More from Stefan Dietze (16)

PDF
Understanding Scientific and Societal Adoption and Impact of Science Through ...
PDF
NEWORDER Project - Science in the online knowledge order
PDF
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
PDF
AI in between online and offline discourse - and what has ChatGPT to do with ...
PDF
An interdisciplinary journey with the SAL spaceship – results and challenges ...
PDF
Research Knowledge Graphs at NFDI4DS & GESIS
PDF
Research Knowledge Graphs at GESIS & NFDI4DataScience
PDF
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
PDF
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
PDF
Beyond research data infrastructures: exploiting artificial & crowd intellige...
PDF
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
PDF
Using AI to understand everyday learning on the Web
PDF
Analysing User Knowledge, Competence and Learning during Online Activities
PDF
Analysing & Improving Learning Resources Markup on the Web
PDF
Towards embedded Markup of Learning Resources on the Web
PDF
Dietze linked data-vr-es
Understanding Scientific and Societal Adoption and Impact of Science Through ...
NEWORDER Project - Science in the online knowledge order
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
AI in between online and offline discourse - and what has ChatGPT to do with ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...
Research Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at GESIS & NFDI4DataScience
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Beyond research data infrastructures: exploiting artificial & crowd intellige...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
Using AI to understand everyday learning on the Web
Analysing User Knowledge, Competence and Learning during Online Activities
Analysing & Improving Learning Resources Markup on the Web
Towards embedded Markup of Learning Resources on the Web
Dietze linked data-vr-es

Recently uploaded (20)

PDF
Lung cancer patients survival prediction using outlier detection and optimize...
PDF
4 layer Arch & Reference Arch of IoT.pdf
PPTX
MuleSoft-Compete-Deck for midddleware integrations
PDF
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
PDF
Co-training pseudo-labeling for text classification with support vector machi...
PDF
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
PPTX
Training Program for knowledge in solar cell and solar industry
PDF
SaaS reusability assessment using machine learning techniques
PDF
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
DOCX
Basics of Cloud Computing - Cloud Ecosystem
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PPTX
future_of_ai_comprehensive_20250822032121.pptx
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PPTX
Internet of Everything -Basic concepts details
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
Lung cancer patients survival prediction using outlier detection and optimize...
4 layer Arch & Reference Arch of IoT.pdf
MuleSoft-Compete-Deck for midddleware integrations
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
Co-training pseudo-labeling for text classification with support vector machi...
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
Training Program for knowledge in solar cell and solar industry
SaaS reusability assessment using machine learning techniques
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
Basics of Cloud Computing - Cloud Ecosystem
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
Enhancing plagiarism detection using data pre-processing and machine learning...
future_of_ai_comprehensive_20250822032121.pptx
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
Internet of Everything -Basic concepts details
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
Improvisation in detection of pomegranate leaf disease using transfer learni...

What's all the data about? - Linking and Profiling of Linked Datasets

  • 1. What‘s all the data about – profiling and interlinking Web datasets Stefan Dietze L3S Research Center 27/03/14 1Stefan Dietze
  • 2. Recent work on Linked Data exploration/discovery/search  Entity interlinking & dataset interlinking recommendation  Dataset profiling  Data consistency & conflicts Research areas  Web science, Information Retrieval, Semantic Web & Linked Data, data & knowledge integration (mapping, classification, interlinking)  Application domains: education/TEL, Web archiving, … Some projects Introduction http://www.l3s.de/ Stefan Dietze 27/03/14 2  See also: http://purl.org/dietze
  • 3. …why are there so few datasets actually used?  Date reuse and in-links focused on trusted „reference graphs“ such as DBpedia, Freebase etc  Long tail of LD datasets which are neither reused nor linked to (LOD Cloud alone 300+ datasets, 50 bn triples)  Explanations? Linked Data is awesome, but... 27/03/14  „HTTP-accessibility“ (SPARQL, URI-dereferencing)  „Structure“ & „Semantics“ (=> shared/linked vocabularies)  „Interlinked“  „Persistent“ Hm, really? Stefan Dietze
  • 4. Linked data is more diverse than we think SPARQL Web-Querying Infrastructure: Ready for Action?, Carlos Buil-Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, International Semantic Web Conference 2013, (ISWC2013). SPARQL endpoint availability over time [Buil-Aranda et al 2013] Accessibility of datasets?  Less than 50% of all SPARQL endpoints actually responsive at given point of time  “THE” SPARQL protocol? No, but many variants & subsets  … Shared vocabularies & schemas, but:  …still very heterogeneous [d’Aquin, WebSci13]  …data partially messy and not conformant (RDFS, schemas) [HoganJWS2012]  …even widely used reference datasets such as DBpedia noisy [Paulheim2013] Co-occurence graph of data types in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC 2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525 An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker., S., In the Journal of Web Semantics 14: pp. 14–44, 2012Stefan Dietze
  • 5. What about data consistency? Inconsistency and Incompleteness of Linked Datasets – a Case Study, Yuan, W., Demidova, E., Dietze, S., Zhu, X., Web Science 2014, WebSci14, under review. 27/03/14
  • 6. Too many/diverse datasets, too little information Stefan Dietze 27/03/14 ? ? ? ?? ?  Which datasets are useful & trustworthy for case XY (eg „learning about the solar system“) ? Which topics are covered?  Types: which datasets describe statistics, videos, slides, publications etc?  Currentness, dynamics, accessability/reliability, data quantity & quality?
  • 7. Data curation and dataset profiling Dataset Catalog/Registry Stefan Dietze 27/03/14  Catalog of data: classification of datasets according to resource types, disciplines/topics, data quality, accessability, etc  Infrastructure for distributed/federated querying describes  Which datasets are useful & trustworthy for case XY (eg „learning about the solar system“) ? Which topics are covered?  Types: which datasets describe statistics, videos, slides, publications etc?  Currentness, dynamics, accessability/reliability, data quantity & quality?
  • 8. db:Astro. Objects Dataset profiling: what’s all the data about Dataset Metadata Stefan Dietze 27/03/14 BIBO AAISO FOAF contains Entity disambiguation & linking [ESWC13] Topic profile extraction [WWW13, ESCW14] db:Astronomy db:Astro. Objects Dataset Catalog/Registry yov:Video po:Programme BBC Programme <po:Programme …> <po:Series>Wonders of the Solar System</.> <po:Actor>Brian Cox</…> </po:Programme…> <yo:Video …> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video…> Yovisto Video bibo:Fil bibo:Fi bibo:Film Schema mappings [WebSci13]
  • 9. Schemas/vocabularies on the Web: XKCD 927 Stefan Dietze 27/03/14 https://xkcd.com/927/
  • 10. Schema assessment and mapping Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties) Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. <po:Programme …> <po:title>Secret Universe – The Life of the Cell</po:title> … </po:Programme…> BBC Programme <sioc:Item …> <label>Viral diseases & bacteria</title> … </sioc:Item ….> SlideShare Set po:Programme sioc:Item ? http://datahub.io/group/linked-education Stefan Dietze 27/03/14
  • 11. Schema assessment and mapping Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties) Co-occurence after mapping into most frequent schemas (201 frequent types mapped into 79 classes) Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. bibo:Slideshow bibo:Film bibo:Document <po:Programme …> <po:title>Secret Universe – The Life of the Cell</po:title> … </po:Programme…> BBC Programme <sioc:Item …> <label>Viral diseases & bacteria</title> … </sioc:Item ….> SlideShare Set po:Programme sioc:Item Stefan Dietze 27/03/14
  • 12. LinkedUp Data Catalog in a nutshell http://datahub.io/group/linked-education http://data.linkededucation.org/linkedup/catalog/  RDF (VoID) dataset catalog: browse & query distributed datasets  Live information about endpoint accessibility  Federated queries using type mappings Stefan Dietze 27/03/14 http://datahub.io/group/linked-education
  • 13. <yo:Video 8748720> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video 8748720> Video <sioc:Item 2139393292> <title>Planetary motion & gravity</title> … </sioc:Item 2139393292> Slideset Topics/categories addressed? Relatedness of resources/entities? (types, semantics) <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Programme Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013). A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014). Challenge: semantics of resources/datasets? 15Stefan Dietze 27/03/14
  • 14. <yo:Video 8748720> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video 8748720> Video <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Programme Data disambiguation (for linking & profiling) Brian Cox? Sun? Pluto? 16Stefan Dietze 27/03/14
  • 15. db:Pluto (Dwarf Planet) db:Astrono- mical Objects db:Sun Data disambiguation using background knowledge „Semantic relatetedness“ of resources? db:Astronomy 17 <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Programme <sioc:Item 2139393292> <title>Planetary motion & gravity</title> … </sioc:Item 2139393292> Slideset <yo:Video 8748720> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video 8748720> Video Stefan Dietze 27/03/14
  • 16. db:Pluto (Dwarf Planet) db:Astrono- mical Objects <yov:Lecture8748720> <title>Pluto & the Dwarf Planets</title> … < yov:Lecture8748720> Online Lecture db:Astronomy  Computation of connectivity scores between resources/entities  Method: combination of a  (i) semantic (graph-based) connectivity score (SCS) with  (ii) a Web co-occurence-based measure (CBM) (similar to NGD)  For (i): adaptation of Katz-Index from SNA for (linked) data graphs (considering path number and path lengths of transversal properties) db:Sun SCS = 0.32 CBM = 0.24 http://purl.org/vol/doc/ http://purl.org/vol/ns/ 19/09/2013 19Stefan Dietze Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013). Entity linking: semantic relatedness <sioc:Item 2139393292> <title>Planetary motion & gravity</title> … </sioc:Item 2139393292> Slideset <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Programme
  • 17. Entity linking: evaluation 27/03/14 20Stefan Dietze  Evaluation based on USA Today News items (80.000 entity pairs)  Manually created gold standard (1000 entity pairs)  Baseline: Explicit Semantic Analysis (ESA) => CBM/SCS: „relatedness“; ESA: „similarity“ Precision/Recall/F1 for SCS, CBM, ESA. Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013).
  • 18. db:Astrono- mical Objects db:Astronomy db:Sun  Extracting representative metadata („topic profile“) for each dataset  Ranking of most representative (DBpedia) categories (= topics); applied to all responsive LOD datasets  Scalability vs representativeness: sampling & ranking for good scalability/accuracy balance DBpedia category graph Stefan Dietze 27/03/14 Dataset profiling: what‘s the data about? A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference ,(ESWC2014), Crete, Greece, (2014). <po:Programme519215> <po:Series>Wonders of the Solar System</po:Series> <po:Episode>Emp. of the Sun</po:Episode> <po:Actor>Brian Cox</po:Actor> </po:Programme519215 > Programme
  • 19. Dataset profiling: approach Stefan Dietze 27/03/14 1. Sampling of resource instances (random sampling, weighted sampling, resource centrality sampling) 2. Entity and topic extraction (NER via DBpedia Spotlight, category mapping and expansion) 3. Normalisation and ranking (using graphical- models such as PageRank with Priors, HITS with Priors and K-Step Markov) => Result: weighted dataset-topic profile graph A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014).
  • 20. Dataset profiling: exploring LOD datasets/topics in a nutshell http://data-observatory.org/lod-profiles/ Stefan Dietze 27/03/14  Automatic extraction of dataset “topics” [ESWC2014]  Visualisation & exploration of dataset-topic graph (datasets, topics, relationships)  Includes all (responsive) datasets of LOD Cloud
  • 21. Dataset profiling: results evaluation Stefan Dietze 27/03/14 NDCG (averaged over all datasets) . Datasets & Ground Truth  Yovisto, Oxpoints, LAK Dataset, Semantic Web Dogfood  Crowd-sourced topic indicators from datasets (keywords, tags)  Manual mapping to entities & category extraction (ranking according to frequency) Baselines  1) LDA, 2) tf/idf (applied to entire datasets)  Topic extraction according to our approach, weighting/ranking based on term weight Measure  NDCG @ rank l  Performance (time/NDCG) for different sampling strategies/sizes etc
  • 23. Stefan Dietze 27/03/14 Diversity of category profile for a single paper Berners-Lee, Tim; Hendler, James, Ora Lassila (2001). "The Semantic Web". Scientific American Magazine. person document dbp:Tim_Berners-Lee dbp:Category:1955_births dbp:Category:People_from_London dbp:Category:Buzzwords dbp:Semantic_Web dbp:Category:Semantic_Web dbp:Category:Web_Services dbp:Category:HTTP dbp:Category:Unitarian_Universalists first-level categories (dcterms:subject) dbp:Category:World_Wide_Web dbp:Category:Royal_Medal_winners
  • 24.  DBpedia category graph not an ideal “topic” vocabulary:  Broad and noisy  “Categories” vs “topics” (for capturing disciplines, thesauri like UMBEL or UNESCO Thesaurus seem better suited)  Hierarchy ?  Filtering of certain partitions of category graph (too generic categories etc)  Mixing categories across resource types (document, person) creates “perceived noise”  But: broadness is useful as general vocabulary for categorisation of all sorts of resource types Stefan Dietze 27/03/14 Dataset profiling: some lessons learned
  • 25. Stefan Dietze 27/03/14 http://data-observatory.org/led-explorer/  Type specific views on datasets/ categories  “Document” (foaf:document)  “Person “ (foaf:person)  “Course” (aaiso:course)  Currently applied to datasets in LinkedUp Catalog only (as schema mappings already available here) Type-specific exploration of dataset categories
  • 26. Stefan Dietze 27/03/14 Dataset interlinking recommendation Candidate datasets for interlinking? 34 t Linkset1 Linkset2 Problem  Given dataset t, ranking datasets from D according to probability score (di, t) to contain linking candidates (entities)  Features:  Vocabulary overlap  Existing links (SNA)  Datasets more likely to contain linking candidates if they (a) share common schema elements, or (b) already link to t or datasets t links to (friend of a friend) Conclusions  Roughly 60% MAP for both approaches  Future work: quantity of links, more remote links, extraction of dataset links rather than data from DataHub Lopes, G.R., Paes Leme, L.A.P., Nunes, B.P., Casanova, M.A., Dietze, S., Recommending Tripleset Interlinking through a Social Network Approach, The 14th International Conference on Web Information System Engineering (WISE 2013), Nanjing, China, 2013. Paes Leme, L. A. P., Lopes, G. R., Nunes, B. P., Casanova, M.A., Dietze, S., Identifying candidate datasets for data interlinking, in Proceedings of the 13th International Conference on Web Engineering, (2013). Rank 1 DBLP 2 ACM 3 OAI 4 CiteSeer 5 IBM 6 Roma 7 IEEE 8 Ulm 9 Pisa ? ?
  • 27. Stefan Dietze 27/03/14 37 Success models: data & applications  LinkedUp Challenge to identify innovative tools & applications  Evaluation methods and approaches “LinkedUp” – Linking Web Data (for Education) L Data linking & curation Technology transfer & community-building  Collecting & exposing open data => LinkedUp Data Catalog  Profiling and linking of Web Data for education => educational data graph [ESWC2013], [ISWC2013],  Disseminating knowledge & building communities (educators, computer scientists, data engineers)  Gathering stakeholder feedback: use cases, and requirements http://linkedup-challenge.org/#usecases http://linkedup-project.eu/events http://www.linkedup-challenge.org/ http://data.linkededucation.org European suport action to advance take-up of open data & related technologies http://www.linkedup-project.eu
  • 28. Stefan Dietze 27/03/14 17/09/2013 38 Who we areL LinkedUp Network LinkedUp Consortium LinkedUp Advisory Board
  • 29. LinkedUp Challenge: using open data (for learning)  Open Data Competition to promote tools and applications that analyse / integrate (Linked) Web data  Organised by LinkedUp project over 2 years (“Veni”, “Vidi”, “Vici”) with 40.000 EUR awards  Veni Competition - 22 submissions, 8 shortlisted for presentation at Open Knowledge Conference (17 September, Geneva Switzerland) http://linkedup-challenge.org Stefan Dietze 27/03/14
  • 30.  Open & focused track(s)  Final events at ESWC2014 (May, Crete)  Open Track only  Final events at OKCon 2013 (September 2013, Geneva)  Open track & focused tracks  Submission details and calls to be released soon  Final events at ISWC2014 (October, Riva del Garda, Italy) May –September 2013 October 2013 – May 2014 May 2014 – October 2014 ?
  • 31. The Veni shortlist & winners DataConf. KnowNodes Mismuseos ReCredible YourHistory 27/03/14 http://www.globe-town.org/ WeShare - 3rd price / people‘s choice GlobeTown - 2nd price http://seek.cloud.gsic.tel.uva.es/weshare/ http://www.polimedia.nl/ PoliMedia – 1st price
  • 32. data.l3s.de – a DataHub for the L3S
  • 33. Learning Analytics & Knowledge Dataset & Challenge Facilitating Research on Learning Analytics and EDM a nutshell Stefan Dietze 27/03/14 http://lak.linkededucation.org/ http://lak.linkededucation.org/ LAK Dataset (450 publications in RDF/R)  ACM International Conference on Learning Analytics and Knowledge (LAK) (2011-13)  International Conference on Educational Data Mining (2008-13)  Journal of Educational Data Mining (2008-12) LAK Data Challenge  Analyse, explore correlate the LAK Dataset  At ACM LAK 2014 (April 2014, Indianapolis)
  • 34. KEYSTONE COST ACTION 27/03/14 51Stefan Dietze http://www.keystone-cost.eu/  Research network focused on distributed search, dataset profiling, to Semantic Web, Databases, etc.  Running 2013-2017  WG1: Representation of structured data sources  WG2: Keyword search  WG3: User interaction and query interpretation  WG4: Research integration, showcases, benchmarks, and evaluations  Open to new members (even beyond Europe)  Joint workshops (eg PROFILES2014 @ ESWC2014)
  • 35. Ongoing/future work … and some upcoming events Linked Data evolution, preservation, consistency  In RDF graphs (eg LOD Cloud), „all“ nodes are connected  LD preservation: which datasets to preserve (direct links or even more distant neighbours)? => semantic relatedness as guidance for scalable preservation strategies /data enrichment  Link correctness in evolving LD  Investigating impact of changes on link correctness (weekly LOD crawls over 1 year time span)  Application: informed preservation strategies  Conflict detection and LD quality (link quality, impact of conflicts in distant nodes)  PROFILES workshop @ ESWC2014 (http://keystone-cost.eu/profiles2014)  26 May 2014, Crete, Greece  Linking User Data 2014 at UMAP2014 (http://liud.linkededucation.org)  Deadline: 1 April  Online Learning & LD Tutorial at WWW2014 (http://www2014.kr/)  07 April, Seoul
  • 36. Thank you! WWW See also (general)  http://linkedup-project.eu  http://linkededucation.org  http://data.l3s.de http://purl.org/dietze See also (data)  http://data.linkededucation.org  http://data.linkededucation.org/linkedup/catalog/  http://lak.linkededucation.org 27/03/14 54Stefan Dietze  Besnik Fetahu (L3S)  Bernardo Pereira Nunes (PUC Rio)  Marco Casanova (PUC Rio)  Luiz Andre Paes Leme (PUC Rio)  Giseli Lopes (PUC Rio)  Davide Taibi (CNR, IT)  Mathieu d’Aquin (Open University, UK)  and many more… Acknowledgements