Masterarbeit Tabea
Masterarbeit Tabea
C O N T E N T A N A LY S I S I N S O C I O L O G Y
MASTER THESIS
tabea tietz
Matrikelnummer: 749153
In sociology, texts are understood as social phenomena and provide means to an-
alyze social reality. Throughout the years, a broad range of techniques evolved
to perform such analysis, qualitative and quantitative approaches as well as com-
pletely manual analyses and computer-assisted methods. The development of the
World Wide Web and social media as well as technical developments like optical
character recognition and automated speech recognition contributed to the enor-
mous increase of text available for analysis. This also led sociologists to rely more
on computer-assisted approaches for their text analysis and included statistical
Natural Language Processing (NLP) techniques. A variety of techniques, tools and
use cases developed, which lack an overall uniform way of standardizing these
approaches. Furthermore, this problem is coupled with a lack of standards for
reporting studies with regards to text analysis in sociology. Semantic Web and
Linked Data provide a variety of standards to represent information and knowl-
edge. Numerous applications make use of these standards, including possibilities
to publish data and to perform Named Entity Linking, a specific branch of NLP.
This thesis attempts to discuss the question to which extend the standards and
tools provided by the Semantic Web and Linked Data community may support
computer-assisted text analysis in sociology. First, these said tools and standards
will be briefly introduced and then applied to the use case of constitutional texts of
the Netherlands from 1884 to 2016. It will be demonstrated how to generate RDF
data from text and how to publish and access these data. Furthermore, it will be
shown how to query the local data on its own as well as though the enrichment
of existing data with external knowledge from DBpedia. A thorough discussion of
the presented approaches will performed and intersections for a possible future
engagement of sociologist in the Semantic Web community will be elaborated.
iii
Z U S A M M E N FA S S U N G
In der Soziologie werden Texte als soziale Phänomene verstanden, die als Mit-
tel zur Analyse von sozialer Wirklichkeit dienen können. Im Laufe der Jahre hat
sich eine breite Palette von Techniken in der soziologischen Textanalyse entwickelt,
du denen quantitative und qualitative Methoden, sowie vollständig manuelle und
computergestützte Ansätze gehören. Die Entwicklung des World Wide Web und
sozialer Medien, aber auch technische Entwicklungen wie maschinelle Schrift- und
Spracherkennung tragen dazu bei, dass die Menge an verfügbaren und analysier-
baren Texten enorm angestiegen ist. Dies führte in den letzten Jahren dazu, dass
auch Soziologen auf mehr computergestützte Ansätze zur Textanalyse setzten, wie
zum Beispiel statistische ’Natural Language Processing’ (NLP) Techniken. Doch
obwohl vielseitige Methoden und Technologien für die soziologische Textanalyse
entwickelt wurden, fehlt es an einheitlichen Standards zur Analyse und Veröf-
fentlichung textueller Daten. Dieses Problem führt auch dazu, dass die Trans-
parenz von Analyseprozessen und Wiederverwendbarkeit von Forschungsdaten
leidet. Das ’Semantic Web’ und damit einhergehend ’Linked Data’ bieten eine
Reihe von Standards zur Darstellung und Organisation von Informationen und
Wissen. Diese Standards werden von zahlreichen Anwendungen genutzt, darunter
befinden sich auch Methoden zur Veröffentlichung von Daten und ’Named Entity
Linking’, eine spezielle Form von NLP.
Diese Arbeit versucht die Frage zu diskutieren, in welchem Umfang diese Stan-
dards und Tools aus der Semantic Web- und Linked Data- Community die comput-
ergestützte Textanalyse in der Soziologie unterstützen können. Die dafür notwendi-
gen Technologien werden kurz vorgsetellt und danach auf einen Beispieldatensatz
der aus Verfassungstexten der Niederlande von 1883 bis 2016 bestand angewendet.
Dabei wird demonstriert wie aus den Dokumenten RDF Daten generiert und veröf-
fentlicht werden können, und wie darauf zugegriffen werden kann. Es werden
Abfragen erstellt die sich zunächst ausschließlich auf die lokalen Daten beziehen
und daraufhin wird demonstriert wie dieses lokale Wissen durch Informationen
aus externen Wissensbases angereichert werden kann. Die vorgestellten Ansätze
werden im Detail diskutiert und es werden Schnittpunkte für ein mögliches En-
gagement der Soziologen im Semantic Web Bereich herausgearbeitet, die die vo-
gestellten Analysen und Abfragemöglichkeiten in Zukunft erweitern können.
v
CONTENTS
1 introduction 1
2 theoretical and methodological implications 5
2.1 Social Reality in the Context of Text Analysis . . . . . . . . . . . . . . 5
2.2 Text Analysis in Sociology . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Qualitative vs. Quantitative Analysis . . . . . . . . . . . . . . 6
2.2.2 Transparency and Re-usability in Text Analysis . . . . . . . . 7
3 a brief introduction to semantic web technologies and
linked data 9
3.1 From the Internet to the Semantic Web - A Quick Overview . . . . . 9
3.1.1 The Internet - Computer Centered Processing . . . . . . . . . 9
3.1.2 The World Wide Web - Document Centered Processing . . . . 9
3.1.3 The Semantic Web - Data Centered Processing . . . . . . . . . 11
3.2 Basic Principles of the Semantic Web and Linked Data . . . . . . . . 12
3.2.1 The Semantic Web Technology Stack . . . . . . . . . . . . . . . 13
3.2.2 Uniform Resource Identifiers (URIs) . . . . . . . . . . . . . . . 14
3.2.3 The Resource Description Framework . . . . . . . . . . . . . . 14
3.2.4 Linked (Open) Data . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.5 RDF Schema and OWL . . . . . . . . . . . . . . . . . . . . . . 19
3.2.6 Ontologies in Computer Science . . . . . . . . . . . . . . . . . 22
3.2.7 Data Querying with SPARQL . . . . . . . . . . . . . . . . . . . 23
3.2.8 Metadata and Semantic Annotation . . . . . . . . . . . . . . . 24
3.2.9 Named Entity Linking . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Brief Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 semantic web and linked data in sociological text analysis 31
4.1 Publishing Sociological Research Data as Linked Data . . . . . . . . 31
4.1.1 Exemplary Use Case . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.2 Application of Best Practices . . . . . . . . . . . . . . . . . . . 33
4.1.3 Result and Brief Summary . . . . . . . . . . . . . . . . . . . . 38
4.2 Exemplary Structure Analysis of Constitution Texts . . . . . . . . . . 40
4.2.1 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.2 Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.3 Brief Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Exemplary Content Analysis of Constitution Texts . . . . . . . . . . . 52
4.3.1 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.2 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.3 Annotation Statistics . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.4 Content Exploration . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.5 Brief Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5 discussion 69
5.1 Transparency and Re-usability . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.1 Feasibility and Process Automation . . . . . . . . . . . . . . . 70
vii
viii contents
Appendix
a appendix 79
bibliography 81
LIST OF FIGURES
ix
L I S T O F TA B L E S
xi
LISTINGS
xiii
INTRODUCTION
1
Writing was developed independently in Mesopotamia around 3100 B.C., in China
around 1500 B.C. and in Mesoamerica around 300 B.C. (Silberman, 2012). Since
then, writing served as a means of human to human communication and is firmly
established in human cultures. Text as “written or printed words, typically forming
a connected piece of work”1 has been crucial in the human development. Texts
have proven essential to human societies when forming a government or laying
down fundamental laws. But also basic interactions which affect the day-to-day
life of humans is highly influenced by the written text they produce, share and
consume. The development of the Internet and later the Web increased the amount
and diversity of text available to humans. The rise of technologies (e.g. optical
character recognition (OCR) enable to digitize vast amounts of text as currently
attempted by Google Books2 and automated speech recognition (ASR) enables to
extract spoken words in audiovisual material into text. An unthinkable expanse of
information about anything humans do, think and know is captured in form of
text through Emails, Blogs, social media platforms, chatforums, and so on. Alone
on Twitter, one of the leading social networks worldwide with more than 300
million monthly active users, more than 58 million “tweets” are posted every day3 .
Ever since humans began to write and later to print textual documents, scientists
of numerous research fields and professions started to analyze these writings to
research the works’ authors themselves, the context a work was written in (e.g. the
time period) or the impact a written work had on a society.
In sociology, text grants a researcher access to social reality, it provides means
to the realization of society or certain aspects of society (Lemke and Wiedemann,
2015). Already during the 18th century an analysis of religious symbols in songs
was performed. This depicts the first well-documented account of the analysis of
a quantitative analysis of printed material. Until the late 1950s, text analysis was
mainly used to describe text, e.g. through a word frequency analysis. Also sim-
ple valence analyses were developed in which researchers attempted to determine
whether specific words were more positively or negatively valued. Intensity anal-
yses helped to grant certain words or phrases more weight than others to enable
more precise analysis results (Popping, 2000) . During the 1960s researchers began
using the computer for text analysis and especially the development of The Gen-
eral Inquirer, a mainframe program to classify and count words or phrases within
certain categories created a milestone in social scientific research (Stone, Dunphy,
and Smith, 1966). Everything that had previously been accomplished with pen and
pencil e.g. data coding, could now be achieved with the computer. From that mo-
ment text files which have been available as data files on computer systems could
1
2 introduction
be entered into programs and automatically analyzed (Popping, 2000). During the
last decade, statistical Natural Language Processing (NLP) approaches have been
more and more used in social scientific text analysis and the algorithms have si-
multaneously become more accurate and efficient in a way that they supported to
uncover linguistic structures as well as semantic associations (Evans and Aceves,
2016).
While the analysis of natural language text has become an important component
of research in sociology, numerous interesting methods of computer assisted data
acquisition and analysis have established. However, Mayring, (2015) criticizes that
especially in social scientific research, no standardized and systematic means of
the analysis of complex text material has emerged. Also according to Lemke and
Wiedemann, (2015) it is still absolutely necessary to establish universal standards
for a sustainable computer assisted text mining in sociology which will enable
researchers to focus more on the actual research work and less on the development
of new methods. Furthermore, in sociology, data sharing and publishing is to this
day widely un-standardized and often not practiced at all (Herndon and O’Reilly,
2016; Zenk-Möltgen and Lepthien, 2014). According to Büthe et al., (2014), this
lack of transparency lowers the integrity and interpretability of the performed
research. Another widely discussed issue in sociology is the re-use of research
data, especially qualitative data (Moore, 2007). A study by Curty, 2016, suggests
that sociologists generally welcome re-using research data in sociology, but certain
aspects which includes the difficulty of finding and accessing these data often
prevents them to do so.
This thesis builds on the idea to develop standards for sociological text analysis
based on Semantic Web and Linked Data technologies. While working at CERN,
Tim Berners-Lee established the general idea of the World Wide Web (WWW) in
1989. While previously and without the Web, the Internet was mostly accessible to
experts only, the Web was developed in a way which enabled everyone to create
and consume content. Today, social media allows anyone to post text, audiovisual
content and share locations with user friendly interfaces. However, the develop-
ment of the Web also called for numerous standards and formats to be able to
assure that new applications can be created by anyone in the existing framework
of the traditional Web. During the early 2000s, the Semantic Web started to evolve
which brought even more sophisticated standards, recommended by the World
Wide Web Consortium (W3C)4 . “The Semantic Web is an extension of the tradi-
tional Web in which information is given well-defined meaning, better enabling
computers and people to work in cooperation” (Berners-Lee, Hendler, and Las-
sila, 2001). The Semantic Web provides “a common framework for the liberation of
data” (Berners-Lee et al., 2006) by giving data an independent existence which is
free from the constraints of the document in which they appear (Halford, Pope, and
Weal, 2013). Several domains have already not only firmly established methods to
utilize the possibilities provided by Semantic Web and Linked Data technologies
and standards, they have also found ways to take part in the development, provid-
ing new applications based on the general idea. To these domains belong initiatives
in the Life Sciences (e.g. cancer research by McCusker et al., (2017)), the media and
4 https://www.w3.org/, last visited: June 5, 2018
introduction 3
film industry (e.g. by the BBC (Kobilarov et al., 2009) or film production in general
(Agt-Rickauer et al., 2016)) and many more (Schmachtenberg, Bizer, and Paulheim,
2014). However the field of sociology has so far not contributed immensely to the
Semantic Web even though there are many points of intersection, especially in the
field of text analysis. On this foundation, the following research question has been
developed:
Research Question:
To which extend can state-of-the-art Semantic Web and Linked Data tech-
nologies, standards and principles support computer-assisted text analysis
in sociology to improve research transparency and data re-usability.
This thesis attempts to show and discuss these intersection points. It will be
elaborated how different technologies and standards part of the Semantic Web
help to make the research process more transparent and reproducible, make data
re-usable and to support text analysis with these technologies. Thereby it should
be made clear, that the goal of this thesis is not at all to show how to replace but
rather how to support the traditional NLP techniques as introduced by Evans and
Aceves, (2016), and utilized by e.g. Knoth, Stede, and Hägert, (2018).
Halford, Pope, and Weal, (2013) furthermore discuss that sociologists intending
to work with computer-assisted methods should not simply wait until the perfect
method or perfect data suddenly appears. Using their domain knowledge, sociolo-
gists should engage in the way data can be represented on the Web and analyzed
according to their needs. This thesis furthermore attempts to discuss intersection
points for sociologists to engage in future work by emphasizing the imperfections
of current tools and standards provided by the Semantic Web community.
semantic analysis of natural language text. The following chapter 4 will demon-
strate the utilization of these technologies and principles on real world research
examples. In chapter 5, an in-depth discussion of the results and presented ap-
proaches will be provided. The final chapter summarizes and concludes this thesis.
T H E O R E T I C A L A N D M E T H O D O L O G I C A L I M P L I C AT I O N S
2
In this chapter, theoretical and methodological concepts important for computer-
assisted text analysis in social science are briefly introduced.
5
6 theoretical and methodological implications
the matter. These perspectives establish a proximity to the analysis methods of dis-
tant reading and close reading (ibid. pp 30). While both perspectives provide certain
assets and drawbacks, Luhmann, (1984) warns that distance can only be kept, if
researchers can rely on their utilized instruments. Lemke and Wiedemann, (2015)
suggest that bringing together both perspectives in the modular analysis process
entitled as blended reading would enable to deliver the best results for text analy-
sis in sociology. The authors do not understand the hermeneutical approach and
the approach of the sociology of knowledge as oppositional but assume that in
blended reading both methods enable to generate synergetic effects if algorithms
and humans work together in semi-automatic methods and optimally combine
their respective competencies (ibid. pp 43 -54).
Text analysis can furthermore be categorized into qualitative and quantitative ap-
proaches. Mayring, (2015) discussed a variety of categories to differentiate between
qualitative and quantitative research. The first category is merely a terminological
distinction. According to the author, qualitative terms are used to divide objects
into classes, e.g. house, car, street) while quantitative terms introduce numerical
functions into language. In social scientific research, methods are often categorized
according to their scale level. Mayring defines that any analysis that is based on a
nominal scale is most likely a qualitative approach and analyses based on ordinal,
interval or ratio scales belong to quantitative research methods. Of course, they
may also overlap which makes a clear distinction more difficult. Another method
of distinguishing between both methods in sociology is based on the implicit un-
derstanding of research. That means, qualitative research analyzes the complexity
of a matter and intends to understand it, therefore it is rather inductive. Quanti-
tative research on the other hand isolates an object of analysis into variables and
defines the impact of interfering effects. It intends to explain things rather than
understanding them and thus tends to be more deductive (ibid. p 17-21).
2.2 text analysis in sociology 7
Mayring clarifies that the qualitative vs. quantitative battle often created by so-
cial scientists is unnecessary, because both methods can be used in the process of
text analysis in synergy, a view which is shared and acknowledged in this thesis.
The author explains that the first step in text analysis is always the definition of
the research topic and the clarification what is analyzed. Then, either qualitative or
quantitative or both methods may be used for the analysis process, depending on
the use case and research question. In the last step, qualitative methods are used
to interpret the observations made in the analysis (ibid. 20-22).
The development of the Internet already began during the 1960s. During a meet-
ing of the Advanced Research Projects Agency (ARPA) research directors in 1967
the heads of the Information Processing Techniques Office Joseph Licklider and
Lawrence Roberts first raised a discussion about connecting heterogeneous com-
puter networks. As a result, so-called Interface Message Processors (IMP) were
developed to connect proprietary computer systems to telephone networks. Octo-
ber 29, 1969 marked the birth of the so-called ARPANET. The first four connected
nodes of this newly created network belonged to research departments of the Uni-
versities of Santa Barbara, Utah, Los Angeles, and Stanford. The extension of this
network soon reached the west coast of the United States and by the early 1970s, 23
hosts were connected via 15 nodes. The first international nodes were connected in
1973 and starting from 1975, the network was not only connected via telephone ca-
bles but also via satellite. To the first famous applications of this network belonged
an Email program developed by Ray Tomlinson in 1971. Another milestone was
reached in 1983 when the communication software of all connected computer sys-
tems was adapted to the TCP/IP protocol under the leadership of Vinton Cerf and
Robert Kahn. This marks the birth of the Internet (Meinel and Sack, 2011).
Imagine the Internet without the comfortable Web applications we know today
which enable us to access our social media feeds in our browser or on smartphone
applications or send large files around the world using user friendly applications.
In order to access and use information on the Internet, users were required to
connect to a remote system (e.g. using a terminal), retrieve the file system data
on said remote system, download the file and read it on a local system. All of
9
10 a brief introduction to semantic web technologies and linked data
these steps are accomplished with several command lines. As revolutionary as this
method was during the early age of the Internet, simply accessing information via
the Internet required high expert knowledge.
While working at CERN, Tim-Berners Lee published “Information Management: A
Proposal” (1989). In it, the author proposed a decentralized hypertext based docu-
ment management system with the purpose to administrate the enormous amount
of research data and documentations of CERN. Together with Robert Cailliau,
Berners-Lee began working on his initial idea using the NeXT computer system. In
November 1990, the term WorldWideWeb was coined by Tim Berners-Lee and in
1991 the first Web browser was released. The foundation of the World Wide Web
(WWW) is the interlinking of documents via hyperlinks. A hyperlink is defined as
an explicit reference of one document to another document or within the same doc-
ument. Text based documents on the Web are referred to as hypertext documents
(Meinel and Sack, 2011).
3.1 from the internet to the semantic web - a quick overview 11
The traditional Web is a document based decentralized network and its numerous
applications make it possible for everyone to access information and to publish
information on the Web and hence to participate in it’s content and development
without expert knowledge.
However, accessing data on the Web (especially in the context of scientific re-
search) is often difficult due to the variety of standards and formats used. Doc-
uments on the Web can be encoded as HTML (Hypertext Markup Language),
PDF (Portable Document Format) or proprietary document formats, e.g. Microsoft
Word or Excel. Data in these documents are often unstructured and embedded in
text or semi-structured in tables. To make use of these data, they have to be semi-
automatically extracted which is not only labor and time intense but also error
prone (Pellegrini, Sack, and Auer, 2014). Next to these formats, XML evolved as a
prominent standard on the Web to encode data syntactically by creating a tree of
nested sets of tags. However, proprietary style sheets and parsers are needed to
make use of these information efficiently (this will be discussed in more detail in
section 4.1.1.1)
In the traditional Web, HTML is the standard markup language to create Web
pages and applications. It describes how information is presented and how infor-
mation is linked (Faulkner et al., 2017). However, HTML cannot describe what the
information actually means. This becomes clear when initiating a Google image
search using the keyword Jaguar. As Figure 1 shows, the search engine returns
images of the animal as well of the car Jaguar. The reason is that natural language
is often ambiguous and contains words or phrases with the same spelling but dif-
ferent meanings (e.g. Jaguar) as well as words or phrases with a different spelling
and the same meaning (e.g. important, substantial, essential). The former is defined
as a homonym and the latter is defined as a synonym.
But why does it seem to be important to also include the meaning of words and
phrases into Web applications and to disambiguate natural language? In commu-
nication, the meaning is necessary to understand information which is conveyed in
a message using a specific language. Information is understood by the receiver of
a message if the receiver interprets the information correctly. Hence, if a machine
is programmed to not only read data but to also understand it, human-computer
as well as computer-computer communication can be significantly improved. The
example above shows that without any further given context neither humans nor a
computer program can correctly and unambiguously interpret the meaning of the
keyword Jaguar. This mis-interpretation causes communication problems as the
results the computer returns here may differ to the results the user had expected.
The Semantic Web with its underlying technologies enables the development
of machine understandable data. In it, the meaning (semantics) is made explicit
by formal (structured) and standardized knowledge representations (ontologies).
The Semantic Web makes it possible to automatically process the meaning of infor-
mation, relate and integrate heterogeneous data and deduce implicit information
from existing information. The example above helps to understand what the effect
of programming a computer to interpret the meaning of information may look
12 a brief introduction to semantic web technologies and linked data
Figure 2: Google image search using the query “Jaguar with two doors”.
Retrieved on June 6, 2018
like. If the search query is changed from simply “Jaguar” to “Jaguar with two doors”
Semantic Web technologies enable to automatically take into account the context
of the given query. Context denotes the surrounding of a symbol (concept) in an
expression with respect to its relationship with surrounding expressions (concepts)
and further related elements. If a human reads a query like “Jaguar with two doors”
it is immediately clear that the term Jaguar it is not about the large cat since an
animal does obviously not have two doors. Taking into account the given context
(two doors) it becomes immediately clear to humans that Jaguar can only be a type
of car. Semantic Web technologies enable to make these differentiations as well, as
shown in Figure 2. They enable to structure information in a way to make clear
that a Jaguar can be a type of car with a specific amount of doors, an engine and
four wheels or a type of wild cat species.
In contrast to the traditional Web where documents are interconnected to orga-
nize semi-structured information, the Semantic Web enables to structure data, give
data a well defined meaning and derive completely new implicit knowledge from
explicit knowledge via logical reasoning. With the development of the Semantic
Web, a variety of standards, methods and practices have been created to structure
information (data) to be machine interpretable.
The following section will give a brief overview of the underlying technologi-
cal foundations of the Semantic Web, before its potential for social scientific text
analysis can be discussed thoroughly.
The Semantic Web is closely related to the traditional Web of documents. Tim
Berners-Lee, who is credited with the invention of the WWW, also coined the term
Semantic Web. According to Berners-Lee, the “Semantic Web is an extension of the
current Web in which information is given well-defined meaning, better enabling
computers and people to work in cooperation” (Berners-Lee, Hendler, and Lassila,
2001).
3.2 basic principles of the semantic web and linked data 13
According to Hitzler, Kroetzsch, and Rudolph, (2009) the Semantic Web shares
a number of goals with the traditional Web:
• Make knowledge widely accessible
The basic technologies used in Semantic Web applications are visualized in the so-
called Semantic Web technology stack as shown in Figure 3. It gives an overview
of the used standardized concepts and abstracts (left side) as well as specifications
14 a brief introduction to semantic web technologies and linked data
and solutions (right side). While standards have always played an important role
in the Web of Documents, the development of the Semantic Web increased the im-
portance of standardizations even more. Most standardizations in this area have
been conducted under the lead of the World Wide Web Consortium (W3C)1 (Hit-
zler, Kroetzsch, and Rudolph, 2009). In the course of this section, the stack will be
used to visualize the position of the explained technologies and concepts to give a
better general overview.
The Uniform Resource Identifier (URI) is part of the Web Platform layer in the
Semantic Web Technology stack (cf. Figure 3). A URI “defines a simple and exten-
sible schema for worldwide unique identification of abstract or physical resources”
(Berners-Lee, Fielding, and Masinter, 2005). A resource in that sense can be any ob-
ject with a clear identity, e.g. a Web page (URL) or a Book (ISBN). In the Semantic
Web, URIs are used to uniquely distinguish resources from each other.
1 <http://dbpedia.org/resource/University_of_Potsdam>
2 <http://dbpedia.org/ontology/president>
3 <http://dbpedia.org/resource/Oliver_Günther> .
4 <http://dbpedia.org/resource/University_of_Potsdam>
5 <http://dbpedia.org/property/established>
6 "1991" .
7 <http://dbpedia.org/resource/Oliver_Günther>
8 <http://dbpedia.org/ontology/birthPlace>
9 <http://dbpedia.org/resource/Stuttgart> .
2014). Listing 1 depicts the N-Triples serialization of the graph shown in Figure 5.
The syntax directly translates from the graph visualization into triples. URIs are
written in angular brackets while literals are written in quotation marks. Each
triple is terminated by a full stop. Due to the lengthy names, triples in Listing 1
are spread over several lines. Turtle offers a mechanism to abbreviate URIs using
so-called namespaces by means of defining prefixes as depicted in lines 1 - 3 in List-
ing 2. The prefix text can be chosen freely by the user, but it is recommended that
abbreviations are selected which are easy to read and refer to what they abbrevi-
ate. Turtle furthermore provides the possibility to shortcut triples with the same
subject or with the same subject and property (Beckett et al., 2014). The lines 1 and
4 in Listing 1 show that dbr:University_of_Potsdam is the subject in both triples.
In Listing 2 a semicolon was used to indicate that line 6 uses the same subject as
line 5.
There are several further ways to serialize RDF triples, including JSON-LD3 and
RDF/XML4 but for sake of simplicity, they will not be discussed in this chapter.
The term Linked Data refers to best practices for publishing data on the Web. These
practices or principles have been coined by Tim Berners-Lee in (2006):
1. Use URIs as names for things.
3. When someone looks up a URI, provide useful information, using the stan-
dards (e. g. RDF).
4. Include links to other URIs, so that they can discover more things.
These principles emphasize that when representing and accessing data on the
Web, standards should be used to enable extensibility, reuse and sharing of data as
well as interoperability between data sources. The term Linked Open Data refers to
public Linked Data resources on the Web which are licensed as Creative Commons
CC-BY5 . Tim Berners-Lee created a five star criteria system for Linked Open Data6 :
3 https://www.w3.org/TR/json-ld/, visited: June 11, 2018
4 https://www.w3.org/TR/rdf-syntax-grammar/, visited: June 11, 2018
5 https://creativecommons.org/licenses/by/2.0/, visited: June 12, 2018
6 http://5stardata.info/en/, visited: June 12, 2018
3.2 basic principles of the semantic web and linked data 17
? ? ?? Use URIs to denote things, so that people can point at your stuff
Numerous datasets are available on the Web which fulfill (at least some) of
these criteria. The Linked Open Data (LOD) cloud diagram gives an overview of
the Linked Datasets that are available on the Web. Not every dataset qualifies for
the diagram7 . For instance a dataset must contain at least 1000 triples to be rep-
resented in the LOD cloud. Figure 6 shows the Linked Open Data cloud from
May 2007 which contained only 12 datasets which are interlinked with each other.
Since then, the amount of Linked Open Data on the Web has increased tremen-
dously and the current diagram shown in Figure 7 contains 1186 datasets. The
different colors in the diagram represent specific domains, including life sciences,
geography, government or media. To name a few examples, part of the media do-
main in the LOD cloud is the BBC Music8 dataset with around 20.000 triples. Part
of the geography domain is the LinkedGeoData9 knowledge base which contains
information collected by OpenStreetMap10 . The LinkedGeoData dataset contains
around 3 billion triples and is categorized as five-star Linked Data according to
the criteria listed above. As shown in the Figures 6 and 7 many datasets or knowl-
edge bases are interlinked with each other. The knowledge base with the most
diverse connections is DBpedia11 . DBpedia is often referred to as the semantic ver-
sion of the Wikipedia12 . DBpedia is similarly to Wikipedia a community effort.
Semi-structured information from Wikipedia (e.g. from Wikipedia infoboxes) are
extracted and made available on the Web in form of RDF triples (Lehmann et al.,
2015). DBpedia currently contains around 9.5 billion triples13 .
7 http://lod-cloud.net/, last visited: June 11, 2018
8 https://lod-cloud.net/dataset/bbc-music, last visited: June 11, 2018
9 https://lod-cloud.net/dataset/linkedgeodata, visited: June 11, 2018
10 https://www.openstreetmap.org, last visited: June 11, 2018
11 https://lod-cloud.net/dataset/dbpedia, last visited: June 11, 2018
12 https://www.wikipedia.org/, last visited: June 11, 2018
13 https://lod-cloud.net/dataset/dbpedia, last visited: June 11, 2018
18 a brief introduction to semantic web technologies and linked data
In Section 3.1.3 it was stated that Semantic Web technologies enable to embed
meaning in data. However, in this section so far, instances (e.g. dbr:University_
of_Potsdam or dbr:Oliver_Günther) and properties which are used to relate in-
stances to each other were merely specified. But where does the meaning intro-
duced in Section 3.1.3 come from? One way to introduce semantics into the pro-
vided RDF data is by means of RDF Schema (or RDFS) which is part of the Models
layer in the Semantic Web Technology Stack (cf. Figure 3). RDF Schema allows to
specify terminological knowledge, i.e. to express information about the data struc-
ture. In order to explain the possibilities of RDF Schema, the concept of classes in
RDF has to be defined: Resources may be divided into groups called classes. The
members of a class are known as instances of the class. Classes are themselves
resources. They are often identified by IRIs and may be described using RDF prop-
20 a brief introduction to semantic web technologies and linked data
erties. The rdf:type property may be used to state that a resource is an instance of
a class. The group of resources that are RDF Schema classes is itself a class called
rdfs:Class (Brickley and Guha, 2014).
RDFS allows to mark a resource as instances of a class, it allows to define hi-
erarchical relationships between classes and between properties, and it allows
simple logical inferences. Figure 8 visualizes a possible model based on the ex-
amples above. The green nodes in the A-Box depict assertional knowledge with
instances, as well as their directed relation to other instances and classes. The
blue nodes in the T-Box depict the terminological knowledge with classes, prop-
erties and their (hierarchical) relationships. Here, the schema or model in the T-
Box defines the structure of the instances in the A-Box. In the visualized example,
dbr:University_of_Potsdam is of type (rdf:type) dbo:University, which itself is
a rdfs:subClassOf dbo:EducationalInstitution and so on. The orange nodes in
3.2 basic principles of the semantic web and linked data 21
dent and not two or three or none (Hitzler, Kroetzsch, and Rudolph, 2009). These
examples briefly demonstrate that RDFS lacks semantic expressivity.
The Web Ontology Language (OWL) enables to make these differentiations. OWL
is an ontology language for the Semantic Web with formally defined meaning
and it is based on Description Logic (Welty and McGuinness, 2004). Several dif-
ferent OWL flavors exists. For sake of simplicity, in the following the concept of
OWL 2 will be considered. Similarly to RDFS there is a Turtle syntax for OWL.
OWL classes are comparable to RDFS classes, individuals can be compared to
class instances in RDFS and OWL properties are also comparable to RDFS prop-
erties. OWL classes, properties and individuals can be defined in the following
way presented in Listing 4. In it, a number of classes are defined in lines 9 -
19. The classes :Fulltime and :Parttime are both subclasses of :Student. How-
ever, the owl:disjointWith statement in line 15 expresses that any individual (e.g.
:Adrian_Schmidt) can only be in the class :Fulltime or :Parttime but never both.
Properties are defined in lines 22 - 29. The property ex:president is defined as
a owl:FunctionalProperty. This means that a university can have only one presi-
dent. If more than one president was added this would cause a logical contradic-
tion. The possibilities of modeling ontologies with OWL and to infer new knowl-
edge are enormous, but for sake of simplicity only a few examples were explained
in this section which briefly explore the functionalities of OWL in contrast to RDFS.
In the previous sections, RDF was introduced which enables to store data in the
form of triples in knowledge bases. In order to utilize the information stored in
these knowledge bases, the data has to be queried. Since RDF is stored in a triple
format, a query language has to be used that is able to process these patterns.
SPARQL, short for SPARQL Protocol and RDF Query Language allows to query RDF
data and thus is part of the Query layer of the Semantic Web technology stack, cf.
Figure 3. SPARQL is based on the RDF Turtle serialization as well as basic graph
pattern matching, i.e. it contains variables at any arbitrary place.
The basic functionalities of SPARQL will be explained on the foundation of the
RDF graph displayed in Listing 5, which contains nine triples. In order to make
use of the information stored in the graph, a SPARQL query has to be formulated.
To find all entities which are connected to another entity with the property dbo:
almaMater, the query shown in Listing 6 is issued. In line 1, the name space used
university alumni
<http://dbpedia.org/resource/Humboldt_ <http://dbpedia.org/resource/Karl_
University_of_Berlin> Liebknecht>
<http://dbpedia.org/resource/Humboldt_ <http://dbpedia.org/resource/Rudolph_
University_of_Berlin> Virchow>
<http://dbpedia.org/resource/University_ <http://dbpedia.org/resource/Jens_
of_Potsdam> Eisert>
<http://dbpedia.org/resource/University_ <http://dbpedia.org/resource/Katharina_
of_Potsdam> Reiche>
in the query is defined which is similar to the prefix definition in Listing 5. The
SELECT clause in line 2 specifies the output variables ?university and ?alumni.
In the WHERE clause, a graph pattern is listed. In the example above, the query
asks for any ?university and ?alumni in the graph which are connected by the
dbo:almaMater property. The ORDER BY statement in line 5 orders the results by
the variable ?university. The results of the query above are listed in Table 1.
On the Web, resources are described by metadata, i.e. information about data.
Metadata are defined as “structured, encoded data that describe characteristics
of information-bearing entities to aid in the identification, discovery, assessment,
and management of the described entities” (Description and Access, Task Force on
Metadata 2000) . Semantic metadata are part of the foundations of Semantic Web
technologies. The semantics in these metadata are explicitly and formally defined
via ontologies and therefore they are machine understandable. Semantic metadata
form the basis of semantic annotation which describes the process of attaching data
to another piece of data. Thereby, a typed relation between the annotated data and
the annotating data is established (Handschuh, 2005).
Various ontologies have been designed to enable the semantic annotation of tex-
tual documents and audiovisual content. The Web Annotation Ontology15 is one
of the most prominent examples for multi-purpose annotations. It “provides an
extensible, interoperable framework for expressing annotations such that they can
easily be shared between platforms, with sufficient richness of expression to sat-
isfy complex requirements while remaining simple enough to also allow for the
<http://example.org/doc?char=0,50>
rdf:type nif:String , nif:Context , nif:RFC5147String ;
nif:isString "On October 22, 1961, Günther was born in
Stuttgart" .
<http://example.org/doc?char=22,35>
rdf:type nif:String , nif:Phrase ;
nif:anchorOf "Oliver Günther"^^xsd:string ;
nif:beginIndex "22"^^xsd:nonNegativeInteger ;
nif:endIndex "35"^^xsd:nonNegativeInteger ;
nif:referenceContext <http://example.org/doc?char=0,50> ;
itsrdf:taIdentRef dbr:Oliver_Günther .
<http://example.org/doc?char=42,50>
rdf:type nif:String , nif:Phrase ;
nif:anchorOf "Stuttgart"^^xsd:string ;
nif:beginIndex "42"^^xsd:nonNegativeInteger ;
nif:endIndex "50"^^xsd:nonNegativeInteger ;
nif:referenceContext <http://example.org/doc?char=0,50> ;
itsrdf:taIdentRef dbr:Stuttgart .
most common use cases, such as attaching a piece of text to a single web resource”
(Sanderson, Ciccarese, and Young, 2016).
In the recent years, Natural Language Processing has become an important means
of text analysis in Sociology (Evans and Aceves, 2016; Lemke and Wiedemann,
2015). Named Entity Linking is part of the field of NLP and refers to the task of
identifying mentions in a text and linking them to the entity they name in a knowl-
edge base. In that sense, a named entity is a real-world object, for instance a person,
a location, an organization or a product that is denoted with a proper name. It
can be abstract or have a physical existence. In this definition, named refers to enti-
ties for which rigid designators exist, defined by Kripke, (1972). Contrary to rigid
designators are non-rigid designators which may refer to many different objects in
many worlds, e.g. time periods. Rigid designators include mostly proper names
or specific terms like biological species, non-rigid designators do not extensionally
designate the same object in all possible worlds (Nadeau and Sekine, 2007). In the
sentence ’Oliver Günther is president of the University of Potsdam ’, ’Oliver Günther’
and ’University of Potsdam’ are considered named entities. Both refer to specific
objects, while ’president’ can refer to many different objects in many worlds. The
distinction of rigid and non-rigid designators is not always defined universally in
ongoing research and also non-rigid designators (e.g. ’president’) may be included
in NEL approaches. Assuming that in sociology, rigid as well as non-rigid desig-
nators play an important role in the analysis of text, the term named entity will
refer to both categories in this thesis. In the process of NEL, mentions in a text are
28 a brief introduction to semantic web technologies and linked data
linked to the entity they name in a knowledge base like DBpedia or Wikidata16 .
An example is given in the following text17 :
In this chapter, an introduction to a few basic Semantic Web and Linked Data
technologies has been given which will be applied in the field of text analysis in
sociology in the following chapter.
It has been discussed that the traditional Web offers many possibilities of par-
ticipation for any user, but an issue of the Web is the availability of information
in formats which are often unstructured and do not encode, what the information
actually means. It has been shown that the meaning of information on the Web
is important for human-computer and computer-computer communication, e.g. in
the area of Web search. In the Semantic Web, an extension of the traditional Web,
meaning is made explicit by formal and standardized knowledge representations –
ontologies. These ontologies are the foundations of Semantic Web and Linked Data
applications, which include Named Entity Linking. The goal of this special task of
Natural Language Processing is to identify mentions in a text and link them to the
entity they name in a knowledge base. This also means to correctly disambiguate
the named entities, e.g. to specify whether an entity mention Günther refers to the
female soccer player Sarah Günther or the university president Oliver Günther.
Embedding these information in text using semantic metadata enables users (hu-
mans and machines) to understand the text and its meaning. Linking entities to
their representations in a knowledge base also allows to utilize the underlying or-
ganization of such knowledge. Through the explicit definition that Oliver Günther
is the president of Potsdam University it becomes possible to enrich the original
text with additional information about these entities. Thereby the text is given
more context about other persons, locations or events.
In the following chapter 4 it will be demonstrated how these technologies and
standards can be used in the field of sociology and especially text analysis. First, it
will be shown how textual data can be structured using RDF. Then, these data are
annotated with semantic entities and queried using SPARQL to exploit not only
the corpus on its own but to also use the underlying graph structure to enrich the
text with additional context information and aggregate the content in a meaningful
way.
A P P LY I N G S E M A N T I C W E B T E C H N O L O G I E S A N D L I N K E D
D ATA I N S O C I O L O G I C A L T E X T A N A LY S I S
4
On the bases of the foundations and principles presented in chapter 3 this chapter
focuses on the direct applications in the field of sociological text analysis. Sec-
tion 4.1 first motivates why textual research data in sociology should be published
as Linked Data. The section also includes a step by step analysis of the process
of publishing data via the use case described in section 4.1.1. Section 4.2 and 4.3
demonstrate how the generated RDF data can be utilized in sociological text anal-
ysis. Finally, in section 5 the contributions of this chapter are summarized and all
benefits, system limitations and future work will be discussed.
The Web has “radically altered the way we share knowledge by lowering the bar-
rier to publishing and accessing documents as part of a global information space”
(Bizer, Heath, and Berners-Lee, 2011). Even though the Web provides numerous
benefits in the way documents can be shared and accessed, “the same principles
that enabled the Web of documents to flourish have not been applied to data” until
the rise of the Web of data. On the traditional Web, data has been made available as
raw dumps like CSV or XML, or as HTML tables which sacrifices its structure and
semantics. The Web evolved from an information space of linked documents to a
space where documents and data are interlinked during the last two decades, un-
derpinned by best practices for publishing and connecting structured data, which
became known as Linked Data (ibid.). Also in research, initiatives have been pro-
moting to publish data resources, articles, and reviews according to the Linked
Data principles (as shortly discussed in 3.2.4) to make research more accessible,
transparent, reusable, and therefore more credible. For instance, the Linked Re-
search project1 promotes to make all articles and resources available in a format
that is human and machine-readable and interlinked with other information on
the Web.
In sociological research, data sharing and publishing is neither standardized
nor is it widely practiced. Studies by Zenk-Möltgen and Lepthien, 2014 and Hern-
don and O’Reilly, 2016 show that social science journals have just been starting
to slowly adapt data sharing policies and most journals which enforce data pub-
lishing policies do so mostly in an incomplete and varied way. Bosch and Zapilko,
2015 found that also for social sciences, there are promising applications for Se-
mantic Web technologies, especially concerning publishing and exploring survey
and statistical data. The intention of this section is to demonstrate, how textual
documents used for the analysis in sociology can be modeled and published by
31
32 semantic web and linked data in sociological text analysis
The corpus has been kindly made available as XML. While XML provides a num-
ber of benefits regarding the way data can be encoded syntactically, XML in gen-
eral also has a number of disadvantages in contrast to RDF. XML was designed
for markup in documents of any structure and creates a tree of nested sets of
tags. With XML, it is possible to read data and get a representation which can be
further exploited utilizing an XML parser. However, its major disadvantage over
RDF is that XML does not enable to recognize semantic units. It aims at document
structure and imposes no common interpretation of data contained in a document
(Decker et al., 2000). An RDF document describes a directed graph. RDF enables
to easily describe the relationships between resources and allows to combine data
from multiple sources (Hitzler, Kroetzsch, and Rudolph, 2009). Furthermore, the
potential reuse of RDF data is enormous and goes way beyond the parser reuse
offered via XML (Decker et al., 2000). In the specific context of constitution data,
Elkins et al., (2014), have pointed out three major reasons on why to use RDF in-
stead of XML. The first regards syntax consistency. When using RDF, it does not
matter which syntax is actually chosen as long as the data are modeled as a graph
while XML requires to decide upon a schema beforehand to be used to define re-
lationships. The authors have pointed out that constitutions across countries may
4.1 publishing sociological research data as linked data 33
Figure 10: Life Cycle of generating, publishing and maintaining Linked Data by Villazón-
Terrazas et al., (2011)
vary in their structure which makes it difficult to provide a schema suitable for all
constitutions. Using RDF and modeling underlying ontology eliminates the need
of such rigid schema. The second reason regards flexibility, because ontologies al-
low for changes in the underlying data as well as their architecture, while XML
requires to change schema which also includes the re-encoding of each constitu-
tional text. The third reason mentioned by the authors regards the ability to link
to other data in the LOD cloud, e.g. DBpedia (Elkins et al., 2014, pp 17 - 18).
In the following section 4.1.2 it will be discussed how the exemplary use case
data can be converted to RDF based on best practices developed by the W3C.
The W3C has published ten best practices to publishing Linked Data (2014) start-
ing out with creating a common mindset of potential collaborators to selecting
and modeling data to converting data and making it accessible to humans and ma-
chines for reuse. On the bases of these principles and with respect to the presented
use case, it will be demonstrated how textual documents used for the analysis in
sociology can be published on the Web as Linked Data.
This step includes the preparation of stakeholders to the process of creating, pub-
lishing and also maintaining Linked Data. In principle, this entire chapter can be
understood as a preparation of sociologists seeking to publish their research data
in the context of text analysis as Linked Data. In order to prepare stakeholders,
the overall workflow is usually demonstrated. A popular workflow was published
by Villazón-Terrazas et al., (2011). In it, the authors identify a life cycle consisting
of five steps to successfully specify, model, generate, publish, and exploit Linked
Data in the government domain, as shown in Figure 10. The life cycle also shows
that once the data is published, the work is never fully completed. Once new data
is modeled, it has to be generated, published and so on. Even though this mostly
34 semantic web and linked data in sociological text analysis
Figure 11: Original constitution data collected by Knoth, Stede, and Hägert, (2018) from
http://www.verfassungen.eu/, last visited: July 27, 2018
According to the W3C, a dataset should be selected that contains uniquely col-
lected or created data that provides benefits for others to reuse and adapt (Hyland,
Atemezing, and Villazón-Terrazas, 2014). The dataset selected to publish in RDF
in this use case consists of 20 constitution documents of the Netherlands from
1884 to 2016 in the German language. All documents have previously been made
available by Knoth, Stede, and Hägert, (2018) in XML. A clear limitation is that
the documents are available in German language only. Nevertheless, it will be as-
sumed that the generated data will be valuable to any German speaking researcher
intending to study European constitutions based on its structure or content.
Modeling Linked Data often requires to go from one model to another, e.g. from a
relational database to a graph-based representation or from pre-defined XML doc-
uments to the intended graph model as intended in this use case. Modeling the
data also requires to understand its basic structure. The exemplary dataset used
consists of constitution documents which employ a very specific and relatively
consistent document structure. The structure of the utilized XML documents has
previously been modeled by Knoth, Stede, and Hägert, (2018). The process of gen-
erating the original XML dataset was not trivial and is a highly cumbersome task
since no machine-readable and chronological dataset of European constitutions is
available on the Web. Even though an HTML representation of these constitutions
exists on the Web2 , it only consists of the latest version in the constitution with
colored text paragraphs on occurring changes of the respective constitution. An
example of the original data is given in Figure 11. Constitutions are usually di-
vided into several main chapters which are furthermore divided into paragraphs,
2 http://www.verfassungen.eu/, last visited: July 27, 2018
4.1 publishing sociological research data as linked data 35
articles and sections. In some cases, articles and sections are directly connected
to main chapters, the XML representation of this structure is further discussed by
Knoth, Stede, and Hägert, (2018, p 199).
According to the W3C, data reuse is more likely to occur when “there is a clear
statement about the origin, ownership and terms related to the use of the published
data” (Hyland, Atemezing, and Villazón-Terrazas, 2014). The Creative Commons
project provides a sophisticated framework for “free, international, easy-to-use
copyright licenses that are the standard for enabling sharing and remix”3 . All data
provided in this use case are published under the Creative Commons ’Attribution-
ShareAlike 4.0 International’ license4 , which means the data can be copied and
distributed in any medium or format, it can be remixed, transformed and build
upon the material for any purpose, even commercially. The license appears under
the terms that (1) the person or organization to reuse the data must give appro-
priate credit to the author and (2) can only distribute the contributions under the
same license as the original.
In order to benefit from the value of Linked Data, resources should be identified
using HTTP URIs. Furthermore, the URI structure should never contain anything
that will change, i.e. sessions or tokens, to give others the possibility to reuse the
data.
36 semantic web and linked data in sociological text analysis
The W3C highly promotes the reuse of standardized vocabularies (Hyland, Ate-
mezing, and Villazón-Terrazas, 2014). Especially reusing other peoples’ vocabular-
ies has become an important factor in the development of the Semantic Web and
Linked Data and it is one of the major aspects that have made the Semantic Web
as successful as it is across domains. Building on other people’s work, significantly
increases its own value, decreases the work load of the person or organization
reusing it and enlarges the network of Linked Data on the Web (Noy, McGuinness,
et al., 2001). Several services (recommended by the W3C) exist to find existing
vocabularies including Linked Open Vocabularies5 or Prefix.cc6 .
In the context of this use case, a corpus of constitutional documents is used
and therefore a vocabulary for this domain was utilized. The Constitute Project7
already developed an ontology for the domain, which could partially be reused in
the context of this corpus. The project aims at creating a platform for profession-
als drafting constitutions, and thus requiring to read and compare constitutions
of various countries with each other (Elkins et al., 2014). The used ontology is
freely available on the Web8 . Figure 12 visualizes parts of the ontology in the pro-
tégé editor9 . The ontology contains a class co:Constitution10 and the subclass
co:Section. Each section has a co:rowType which can be a title, ulist, olist,
or body11 . Furthermore, each constitution co:isConstitutionOf co:Country. The
modeling of all countries in the ontology has been reused by the authors from the
FAO Geopolitical Ontology12 . The way the ontology was created treats all parts of
a constitution in the same way, regardless if it is actually an article, section or para-
graph. However, in order to query the constitutions for a further analysis in sociol-
ogy, this information is critical, therefore the ontology has been further extended
by these information. The namespace “https://github.com/tabeatietz/semsoc/”
has been created and used with the prefix s:. The classes s:Part, s:Paragraph,
s:Section and s:Article have been added to the ontology which enables to query
for each specific unit separately. The given ontology furthermore models the year
the respective constitution was created in. However, the corpus in the use case of-
ten contains two constitution versions for a specific year. Therefore the s:edition
property has been added which accepts values of the type xsd:date13 .
logic, and not every sociologists is familiar with this complex domain, modeling
ontologies can also be achieved by non technicians. Halford, Pope, and Weal (2013),
have especially emphasized the possibilities and benefits for sociologists and dis-
cussed that creating ontologies is not solely a technical problem, because high
domain knowledge is required to create sophisticated models. For instance, a re-
cent project in the domain of film- and TV-production, non-technicians have been
creating a filmontology14 to model the entire production process from the initial
idea to costume design to the final editing (Agt-Rickauer, Waitelonis, Tietz, and
Sack, 2016). There are numerous tools and guides to support non-technicians in
the development and reuse of ontologies. One of the most popular free and open-
source editors is called protégé (Musen, 2015). The community has build numerous
plugins and developed numerous guides to support non-technical users. One of
the most widely used guides has been created by Horridge, Knublauch, Rector,
Stevens, and Wroe, (2004) and has since been renewed in several editions. If an
ontology has to be created from scratch, the PoolParty Thesaurus Management
System15 can help sociologists to collect and describe all necessary concepts, and
define relationships to other concepts (Schandl and Blumauer, 2010). The tool was
created specifically for domain experts unfamiliar with Semantic Web technologies
and without programming skills and also Bosch and Zapilko (2015) have specifi-
cally pointed out the usefulness of the Poolparty tool for social scientists.
Converting the provided XML data to RDF involves mapping source data to RDF
statements (Hyland, Atemezing, and Villazón-Terrazas, 2014). The files made avail-
able by Knoth, Stede, and Hägert, (2018), have been converted to RDF using a
Python script. The script is available on the Web via Google Colaboratory16 . For
sociologists, this step in the process of publishing Linked Data is one of the most
crucial and is most likely achieved through an interdisciplinary collaboration, since
programming skills are required. Even though several tools exist which provide
aids in converting data (e.g. by Lange, (2009), and Heyvaert et al., (2016)) no com-
plete out of the box software exists which runs the process completely automati-
cally and error free. Ideally, this process will become obsolete in the future when
adding further constitution data, since the data can be modeled directly in RDF.
With this best practice, it is made sure that not only humans are able to access
and exploit the provided data, but also machines have access to the data. This
can be mainly accomplished by providing a RESTful application programming
interface (API), by providing a SPARQL endpoint, or by providing a file download
in RDF. Which method the sociologist should use highly depends on their skill
set, the use case, and the ability to maintain the access point (cf. section 4.1.2.9).
The easiest method is to simply provide RDF data dumps. This approach does not
require extensive maintenance (as e.g. a SPARQL endpoint) and does not require
enormous technical skills. All data generated in this use case are available as a
turtle dump file on GitHub17 . After downloading the dataset, the researcher is able
to load it to a triple store of choice and then proceed to formulate queries using
SPARQL. Popular triple stores include but are not limited to Apache Fuseki18 ,
Blazegraph19 , Virtuoso20 .
The W3C has furthermore recommended best practices to announce the published
data and to recognize the social contract that comes with the published material.
The former includes associating an appropriate data license, ensuring data accu-
racy, planning a persistence strategy as well as a method for people to provide
feedback on the data. Recognizing the social contract refers to the responsibilities
that comes with maintaining the data as well as its access points. If a SPARQL
endpoint is provided to query the data directly, the social contract involves keep-
ing the endpoint available and stable (Hyland, Atemezing, and Villazón-Terrazas,
2014). The way in which data and access points are maintained substantially in-
fluences not only the way a specific dataset can be reused and valued by other
people, it also contributes significantly to the success or fail of Linked Open Data
in the future of the Web (priv. comm.). However, due to the exemplary nature of
this provided use case, an extensive announcement of the provided dataset as well
as a long term commitment to a social contract has not been planned at this point.
This will be further discussed in the future work section in chapter 6.
This section gave an overview of the motivation and process of publishing socio-
logical text documents as Linked Data according to the best practices published by
the W3C. Thereby, the exemplary use case of constitutional texts provides a real
world scenario. As a contribution, the corpus consisting of 20 constitutional docu-
ments has been converted to RDF and made available on the Web, along with the
(in section 4.1.2.7 described) Python script converting the files to RDF. A snippet
of the generated RDF data is depicted in Figure 13 and Listing 19 (in Appendix A)
. In the example, the following information text is depicted:
17 https://github.com/tabeatietz/semsoc
18 https://jena.apache.org/documentation/fuseki2/, last visited: July 5, 2018
19 https://www.blazegraph.com/, last visited: July 5, 2018
20 https://virtuoso.openlinksw.com/, last visited: July 5, 2018
4.1 publishing sociological research data as linked data 39
Figure 13 shows how these information have been structured in the RDF graph.
The entire document (the constitution from 2016) is defined as co:Constitution
and the rest of the different levels of the document are defined as co:Section. To
provide a more fine grained analysis, the single section parts have been further
divided into s:Chapter, s:Paragraph, s:Article and so on. Thereby, the chapter
is co:parent of a paragraph and a paragraph is co:parent of an article. For each
element, it is modeled, whether it is of type co:title or co:body. Each element
further has a co:sectionID, which is a sequential number.
The following sections 4.2 and 4.3 will demonstrate how the generated data can
be queried and exploited in the context of sociological research.
Figure 14: Workflow of converting the provided XML data to RDF and querying the RDF
via Blazegraph
In this section, the process of text analysis aided by Semantic Web and Linked
Data technologies will be discussed on the foundation of the use case described
in section 4.1.1. As described above, the analysis of constitutions in the context of
sociological research enables to assess how states model societies. Constitutions
are self-descriptions of these states and mirror the different roles in it as well as
their relationships with each other. Numerous aspects may be analyzed including
affiliations (who belongs to the state or who is considered a foreigner), leadership
(who is the head of the state and which and how are the responsibilities allocated),
or a more fine grained analysis, e.g. the role of women in the state. Two possible
perspectives to analyze these aspects in sociology are the analysis of the document
structure as well as their actual content. This section focuses on the structure level
before an analysis on content level will be performed in section 4.3.
Constitutional documents follow a strict formal hierarchy. Each document is
organized into several units, being the chapters, paragraphs, articles, and sections.
Typically, each section belongs to a specific article and each article either belongs
to a paragraph or directly to a chapter. When analyzing constitutional documents
for specific countries and their changes over time, the document structure may
give insights on which chapters, paragraphs or articles have been added, deleted
or changed over time. This analysis enables an initial exploration of the corpus
before the in-depth content analysis takes place.
4.2.1 Workflow
Analyzing the document structure in this use case means to query it using the
SPARQL query language as described in section 3.2.7. Figure 14 shows the work-
flow of this section. The RDF graph generated with the Python script introduced
in section 4.1 is taken as the input file here and uploaded to the Blazegraph frame-
work for querying. In order to enrich the documents in the use case with external
context information, the DBpedia knowledge base is used via federated querying
(cf. section 4.2.2.1).
4.2 exemplary structure analysis of constitution texts 41
Listing 8: SPARQL query for the list of constitution documents and their editions
4.2.2 Querying
In order to get a first and broad overview of all documents in the corpus, a first
query as shown in Listing 8 asks:
1. Which documents are in the corpus that contain the constitution of the
Netherlands?
The query in Listing 8 selects anything in the knowledge base that is of rdf:type
co:Constitution and belongs to the Netherlands. Furthermore, the edition of each
constitution is selected via the s:edition property. Figure 16 visualizes the query
result in a timeline overview, generated via TimeGraphics21 . It is shown that there
Chapter Level
The constitutions’ most top level elements are the chapters. The chapters set the
entire framework of the constitutions, they dictate the main topics as well as their
order. Therefore, it is assumed that structural changes on chapter level tend to
have a higher impact to the entire constitution structure and content than changes
on a lower article or section level. How can changes in a document structure be
assessed? Even though simply counting the chapters in each edition is a simple
and low effort approach, it is assumed that it delivers a promising entry point into
top level document changes. The query in Listing 9 returns results to the question:
The results to the query above are also visualized in Figure 16. The red line
in the timeline view shows how many chapters are included in each constitution
edition. While the amount of chapters stayed constant until the 1922 edition, there
have been a number of changes between 1938 and 1972. Due to the fact that the
chapter number dropped from 14 to 9 from 1972 to 1983, it is assumed that sig-
nificant changes in the constitution occurred. Between 1948 and 1953, one more
22 While currently there are only documents of the Netherlands in the corpus, this was included in the
query to increase generalizability
4.2 exemplary structure analysis of constitution texts 43
PREFIX s: <https://github.com/tabeatietz/semsoc/>
PREFIX co: <http://www.constituteproject.org/ontology/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX s: <https://github.com/tabeatietz/semsoc/>
PREFIX co: <http://www.constituteproject.org/ontology/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
chapter was first added and later removed. As simple as these numbers seem, they
give initial insights on the extent of the changes from one edition to another. The
observations made here provide evidence of significant changes structural (and
most likely content) wise and give insights on where to begin a further in-depth
investigation. Due to the given observations, further content analysis may proba-
bly focus on the editions with more severe changes, (e.g. 1938 and 1983) than on
editions with obviously less severe changes.
To get more information of these changes, the next level to investigate are the
chapter’s titles and it can be asked:
• For each edition of the Dutch constitution, what are the names of the single
chapters?
Listing 10 queries all chapter 1 headers per constitution edition. The results show
that the title of the constitution of the Netherlands from 1884 to 1972 was ’Erstes
Hauptstück - Vom Reich und seinen Einwohnern’ (Of the empire and its inhabitants)
and starting from 1983, the first chapter was entitled ’Grundrecht’ (Fundamental
Rights). That means the Netherlands first included the Fundamental Rights in
44 semantic web and linked data in sociological text analysis
Listing 11: Query to count all sections for the first chapter of the 2016 constitution edition
PREFIX s: <https://github.com/tabeatietz/semsoc/>
PREFIX co: <http://www.constituteproject.org/ontology/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
Listing 12: Query to list all section texts for article 12 in the constitution editions of 1983
and 2016
PREFIX s: <https://github.com/tabeatietz/semsoc/>
PREFIX co: <http://www.constituteproject.org/ontology/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
both editions it becomes clear that the Fundamental Rights did not significantly
change on structure level with the exception of article 12 where one section was
added to the constitution. To have a closer look at article 12, the query in List-
ing 10 has been created which returns the text of all article 12 sections. The output
as shown in Table 2 allows to compare the single sections of both editions with
each other. Article 12 defines rules for the right if entering a person’s home. While
section 1 of article 12 was only slightly changed, the changes in section 2 were
much bigger. It can be seen that the last sentence in the 1983 edition which reads
“The resident receives a written report on entering the apartment” was removed in
the 2016 edition. Even though it initially seems that this sentence was removed en-
tirely, the last row in the table reveals that this topic was simply defined in greater
detail and it now also defines this topic in relation to the national security. Further
investigation shows that article 12 in chapter 1 was not edited at all from 1983
to 2000. The changes which now include a paragraph about the national security
were only recently added in the 2016 edition.
From this exploration of the document structure, it can be learned that from a
top level perspective, after the Fundamental Rights have been first included into
the constitution of the Netherlands, there were significant changes in article 12,
which deals with the entering of a person’s home. In 2016, a paragraph was added
to the article that regulates the notification procedure in case the interests of na-
46 semantic web and linked data in sociological text analysis
tional security are affected. It is assumed that adding a section that regards the
national security is worthy of a further analysis on content level. If there exists fur-
ther evidence that more paragraphs related to the national security of the Nether-
lands were added to the constitution, future work may research when this factor
of national security became important in the constitution and to what extend. Fur-
thermore it may be studied which entities of the state (citizens, foreigners, civil
servants, ministers) are affected by regulations on national security in the corpus.
When analyzing a document corpus through the exploration of its structure and
changes over time, findings may be better understood when placed into their his-
torical and societal context. The exemplary use case utilized in this thesis provides
historical textual data about the constitutions of the Netherlands. But what does
it actually mean, that a constitution was created in a specific year and was valid
for a certain time period? Without any background information, e.g. about histor-
ical events, the respective leading party, or ruling monarch, these constitutional
versions are only data without any meaning and can hardly be understood thor-
oughly.
1. Federated Queries: SPARQL not only enables to query RDF in local graph
databases, but also allows to “express queries across diverse data sources”
(Prud’hommeaux and Buil-Aranda, 2013). This feature is executed via the
SERVICE keyword and allows to merge data distributed across the Web via
an endpoint. A SPARQL endpoint enables humans and machines to query
a specific knowledge base via SPARQL23 . In the case of DBpedia, the exist-
ing SPARQL endpoint24 can be used to query very specific information out
of the large knowledge base easily without having to download an entire
dataset. One of the advantages is that the data does not need to be updated
by the user, since its the data provider who has this responsibility. While this
method is easy to use, the major disadvantage is the dependence on third
party servers. SPARQL endpoints are often not available due to maintenance
or other reasons which is a crucial factor for Web applications (Verborgh et
al., 2014).
2. Data dump: Instead of a specific query of the needed triples, data dumps
contain an entire dataset which is entirely downloaded and integrated into
the existing data. For instance, DBpedia provides a large number of datasets
for eight different language versions in turtle and quad-turtle format25 . This
method is often used, because this way applications are completely indepen-
dent of the availability of third party severs and solely rely on their own
systems. However, querying data in this way is not considered querying on
the Web, which is anticipated by the Semantic Web community. Using data
dumps for querying means to include a large amount of triples into a graph
database, even if only a small part of these data are actually used in the end.
Furthermore, the data have to be updated manually by the user, contrary to
the federated querying method.
The exemplary use case in this thesis will focus on federated querying. The ad-
vantages have been discussed above. The main disadvantage of federated queries
is the low availability of endpoints on the Web. This is a crucial aspect, especially
commercial applications. In this use case, it will be assumed that a high availability
at all times is not the most crucial factor. Therefore, a federated query seems to be
the best option to integrate few and very specific external data in the corpus.
Query Planning
The Netherlands is a constitutional monarchy. The monarch is the head of state
and the constitution defines the monarch’s position, power and responsibility in
the state as well as his or her relationship with the rest of the government. The
provided document corpus contains the Dutch constitution from 1884 to 2016. In
these 132 years, several monarchs ruled in the Netherlands. When analyzing consti-
tutional editions and their changes over time, the background information which
monarch ruled the country in which specific constitution edition in the corpus
gives the data meaning and context to understand the dataset. The goal of the
query is:
1. List all editions of the constitution of the Netherlands contained in the docu-
ment corpus and
2. for each edition, present the respective ruling monarch along with the start-
ing and ending year on the throne.
26 http://bit.ly/TPF_Greek_myth, last visited: August 2, 2018
4.2 exemplary structure analysis of constitution texts 49
The query is divided in two parts. The first part is a simple query of content
already included in the corpus. The second part is not part of the corpus and has
to be queried from an external Linked Data knowledge base.
Query Building
To enrich the existing data with external knowledge, a dataset has to be selected.
In this case, the DBpedia will be used via its SPARQL endpoint, because it is as-
sumed that (since the data is generated from Wikipedia content) that it covers
information about countries and their governments well. An example represen-
tation of a Dutch monarch in the DBpedia is the HTML page for the resource
dbr:Beatrix_of_the_Netherlands27 . All triples to the subject that are contained in
DBpedia are visualized in this HTML page.
Listing 13 shows the entire query necessary to present the constitution editions
and their respective monarch in one table. It consists of an inner and an outer query.
The outer query makes use of the triples in the local knowledge base and collects
all constitutions and their editions (lines 10 - 15). The inner query begins at line 17.
The SERVICE keyword followed by <http://dbpedia.org/sparql> indicates that
the DBpedia SPARQL endpoint will be queried for the triples requested in the
following curly braces. The lines 18 to 28 ask for something (?monarch) that belongs
to the class dbc:Dutch_monarchs and has an rdfs:label. As can be seen in the
HTML representation for the entity Queen Beatrix, there are multiple values for
rdfs:label in several languages. The FILTER constraint in line 20 specifies that
it should only query for labels in the English language. Furthermore, the query
specifies in line 21 that all monarchs have to have a starting year of reign to be
included in the results. This is followed by two OPTIONAL statements in line 22
and 26. Optional means that the query collects the triples matching the patterns
in the curly brackets, but only if they exist in the knowledge base. They are no
mandatory specification. The first statement queries for all monarchs who have
a successor (?suc) and the beginning of the successor’s reign is defined as the
previous monarch’s end of reign. The second optional statement simply queries the
end year of the monarch’s reign. The reason to include these statements as optional
is that it is unknown when the reign of the current King of the Netherlands will
end. The filter constraints in line 31 and 32 make sure that only the monarchs are
selected that are related to the edition years of the documents. The BIND statements
create a dummy variable present in both, the inner and outer query to join the
results of both queries in the resulting table.
Results
Table 3 shows the result to the query in Listing 13. Regarding the data itself, there
are two aspects to observe: (1) the rows 7 and 8 refer to the same edition of the
constitution and list two monarchs. The reason is that in 1948 Wilhelmina of the
Netherlands died and the throne was inherited by Juliana of the Netherlands. (2)
the current King, Willem-Alexander of the Netherlands is not listed, even though
he succeeded Beatrix in 2013 and should theoretically be in the result table. The rea-
son is simply an error in DBpedia. The correct resource URL for the ruling King
is dbr:Willem-Alexander_of_the_Netherlands. Unfortunately, the URL listed as
27 http://dbpedia.org/page/Beatrix_of_the_Netherlands, last visited: August 2, 2018
50 semantic web and linked data in sociological text analysis
Listing 13: Federated Query for all Dutch monarchs from 1884 to 2016 and the respective
constitution editions
1 PREFIX s: <https://github.com/tabeatietz/semsoc/>
2 PREFIX dct: <http://purl.org/dc/terms/>
3 PREFIX co: <http://www.constituteproject.org/ontology/>
4 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
5 PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
6 PREFIX dbc: <http://dbpedia.org/resource/Category:>
7 PREFIX dbo: <http://dbpedia.org/ontology/>
8
9 SELECT DISTINCT ?edition ?year ?name ?start ?end
10 WHERE {
11 ?constitution a co:Constitution ;
12 s:edition ?edition ;
13 co:isCreatedIn ?year .
14 BIND("1" as ?dummy )
15
16 SERVICE <http://dbpedia.org/sparql> {
17 ?monarch dct:subject dbc:Dutch_monarchs ;
18 rdfs:label ?name .
19 FILTER (LANG(?name)="en")
20 ?monarch dbo:activeYearsStartYear ?start.
21 OPTIONAL {
22 ?monarch dbo:activeYearsEndYear ?end .
23 }
24 OPTIONAL{
25 ?monarch dbo:successor ?suc .
26 ?suc dbo:activeYearsStartYear ?end .
27 }
28
29 BIND("1" as ?dummy )
30 }
31 FILTER(str(?year)<= str(?end))
32 FILTER(str(?year)>= str(?start))
33 }
34 ORDER BY ?edition
28 When visiting the HTML representation for the resource dbr:Willem-Alexander, the user is immedi-
ately redirected to the correct URL. However, this redirect is not easily accomplished with SPARQL.
29 Even though King Willem Alexander was not found, he was included in gray color in the timeline.
4.2 exemplary structure analysis of constitution texts 51
The content enrichment with the reigns of Dutch monarchs is one example of
many. In the exact same manner, further context could be provided with the gov-
erning Prime Minister in each edition or with periods of war which had significant
influences on the country. In this example, it was important to show that, once
data is converted to RDF and queried with SPARQL in a local triple store the ex-
ploration of the data does not have to end. The possibility to enrich the content
automatically with external data and thus, create new context and new knowledge
is one of the main achievements of the Semantic Web. I strongly believe that the
possibilities that come with the enrichment as demonstrated can be highly benefi-
cial for sociologists when analyzing textual data.
Table 3: Result of the federated SPARQL query in Listing 13 with the columns 1-2 queried
in the existing dataset and the columns 3-5 queried in DBpedia
The analysis of the document structure enables the sociologist to initially explore
the corpus. In this section, it has been demonstrated that analyzing documents in
RDF can be accomplished with the help of SPARQL queries using the exemplary
use case of constitutional documents. Furthermore, it has been shown how external
Linked Data can be integrated into the existing data to create a historical context
and to give the data more meaning.
The 20 documents analyzed in this section can be considered a rather small
document corpus in which some elements of the structural changes, especially on
52 semantic web and linked data in sociological text analysis
Figure 18: Workflow of semantically annotating the data with DBpedia entities and con-
verting the output document containing RDFa into NIF2
the top chapter level may be surveyed by one researcher quickly without using
SPARQL. However, it is easy to imagine that this will not be possible anymore
when there are 200 or 2000 constitution documents from numerous different coun-
tries to be analyzed. The technologies and standards presented in this section pro-
vide means to study these data in the same way this rather small corpus has been
analyzed.
The provided queries are merely examples of what is all possible and realistic
for sociological research using SPARQL to explore the document structure from
the very top level of merely chapter counting to deep into the document structure
on section level. This is possible, because the data were modeled in a way that ref-
erences each section separately with a unique URI. The entire process as discussed
is not limited to constitutional documents and therefore generalizable to numerous
other document corpora and research topics.
A system limitation is the dependency on available and reliable knowledge bases.
Also, it is assumed that another obstacle for users with limited technical skills is the
ability to create SPARQL queries. One solution to overcome this problem is to cre-
ate interactive user interfaces (based on queries as presented) to help researchers
explore the content in an easy and fluent manner without having to concentrate
on designing the correct queries. Even though many solutions already exist to cre-
ate generic user interfaces for a broad range of text, this topic is part of ongoing
research and will be further discussed in section 5.2.4.
In the previous sections, the document corpus has been converted to RDF to cre-
ate unique identifiers for each single unit on section level (cf. 4.1) which enables
to query the generated data using SPARQL. Furthermore, RDF enables to enrich
the provided content with knowledge from external knowledge bases, which was
4.3 exemplary content analysis of constitution texts 53
demonstrated using DBpedia (cf. 4.2). In this section, it will be shown how Seman-
tic Web technologies and Linked Data enable to extend text analysis in sociology
by means of semantic annotation.
4.3.1 Workflow
The workflow in this section builds upon the work in the sections 4.1 and 4.2 and is
visualized in Figure 18. The Python script converting the XML data into RDF pro-
duces two outputs. The .ttl file is, as discussed in the previous sections, directly
integrated into Blazegraph. The other is a .txt file which is used for the seman-
tic annotations. It contains all constitution texts sorted by edition and is imported
into a Wordpress editing interface where it is annotated with DBpedia entities,
using the refer tool. These annotations are stored as a text file containing RDFa
and converted into NIF2 via a simple Python script. The resulting .ttl file is also
integrated into Blazegraph, where both graphs are merged and queried together
with SPARQL. As already elaborated in section 4.2.2.1, the existing content can be
further enriched via federated queries.
4.3.2 Annotation
The rationale and overall methodology of semantically annotating text has been
briefly described in section 4.3.2 and will be implemented in the exemplary use
case in this section. First, the challenges and functionalities of semi-automated an-
notation interfaces in general will be introduced followed by a brief description
of the refer system. Then, the annotation method and criteria are discussed fol-
lowed by a number of use cases and queries to support the social scientist in the
exploration of large textual documents.
Semantically annotating text with entities from a large knowledge base like DBpe-
dia requires a well functioning user interface if the annotations are created man-
ually or semi-automatically. The task of the user interface is to suggest possible
entity candidates to the annotating user based on an input text. One of the major
challenges is to present the entities in a way that users unfamiliar with Linked
Data (so called lay-users) are able to make use of the interfaces. Lay-users typi-
cally have no further insight about what the content of a knowledge base is or
how it is structured, which has to be considered when suggesting the entities
the user should choose from (Shneiderman et al., 2016). An example to demon-
strate the difficulty of this task is the annotation of the term ’Berlin’. The entity
dbr:Berlin as the capital of Germany could be considered as well as the histor-
ical reference to the city ’Berlin’ being dbr:West_Berlin and dbr:East_Berlin or
the Person dbr:Nils_Johan_Berlin, and many more. Some entity mentions yield
to lists of thousands of candidates which a human cannot survey quickly to find
the correct one. Therefore, autosuggestion utilities are applied to rank and organize
the candidate lists according to e. g. string similarity with the entity mention, or
54 semantic web and linked data in sociological text analysis
general popularity of the entity (Osterhoff, Waitelonis, and Sack, 2012). Khalili and
Auer, (2013), furthermore defined requirements for semantic content authoring
tools. While authoring (and annotating) content, the user should face a minimum
level of interruption and entity recommendations should be displayed without (or
a minimum) level of distraction. The tool should assist (semi-)automated annota-
tion in a useful manner and provide an easy correction of previous annotations
(by another user or an algorithm). The user should be able to distinguish between
manually created annotations and automated annotations. Furthermore the user
interface should be customizable, depending on the visual layout of the respec-
tive publishing environment. There are numerous tools available to semantically
annotate text.
The semantic editor and text composition tool Seed by Eldesouky et al., (2016),
enables automated as well as semi-automated semantic text annotation in real-
time. That means, while the author writes a piece of text, the system automatically
performs NEL to reduce the work effort for the author. Despite the fact that this
feature seems rather useful for blog authors, it is not applicable for this use case,
because the text in the documents is already completed. The Pundit Annotator Pro
by Morbidoni and Piccioli, (2015) also offers to create semantic annotations in text.
The tool allows users to define their own properties and knowledge bases. How-
ever, it is assumed that in order to define own resources, the sociologist already
has to have a profound knowledge about these knowledge bases beforehand. Fur-
thermore, the annotator is not available for free. dokieli by Capadisli et al., (2017),
is an annotation tool with an integrated support for social interactions. The goal
is to employ a tool-agnostic generic format for semantic annotations, which are
saved in an HTML+RDFa format. Furthermore, dokieli allows to save documents
4.3 exemplary content analysis of constitution texts 55
refer
In order to annotate the constitution documents, the refer annotation system is
used (Tietz et al., 2016). refer consists of a set of powerful tools focusing on NEL.
It aims at helping text authors and curators to semi-automatically analyze textual
content and semantically annotate it with entities contained in DBpedia. In refer,
automated NEL is complemented by manual semantic annotation supported by so-
phisticated autosuggestion of candidate entities, implemented as publicly available
Wordpress plugin30 . Next to content annotation, refer also enables to visualize the
semantically enriched documents in a navigation interface for content exploration.
refer is chosen for this task, because it fulfills all of the criteria mentioned by Khalili
and Auer, (2013). Furthermore, the entire system is available for download and is
easily installed in the wordpress content management system, which increases the
reproducibility of this work. A user study focusing on lay-users has shown that
the refer annotation interface is easy to use and enables a sophisticated annotation
process (Tietz et al., 2016).
For automated annotation, refer deploys the KEA-NEL (Waitelonis and Sack,
2016), which implements entity linking with DBpedia entities (Usbeck et al., 2015).
The user can choose between a manual and automated annotation process. The
refer annotator includes two configurable annotation interfaces for creating or cor-
recting annotations manually: (1) the Modal annotator, shown in Figure 20 and
(2) the Inline annotator, shown in Figure 19. The former builds upon the native
TinyMCE editor31 controls provided by Wordpress to trigger the display of sug-
gested entities in a modal dialog window. The suggestion dialog starts with a text
input field, which initially contains a selected text fragment and can be used to
refine the search term. Suggested entities are shown below in a table-based layout,
divided into four categories Person (green), Place (blue), Event (yellow) and Thing
(purple). The window further includes a list of recently selected entities for faster
selection of already annotated entities in the same text. The entity’s DBpedia ab-
stract and URI are displayed on mouseover. A click selects the entity and encodes
the annotation RDFa markup, which is added to the according text fragment. The
Inline annotator enables to choose entities directly in the context of a selected text
30 https://www.refer.cx/, last visited: July 28, 2018
31 https://www.tiny.cloud/, last visited: August 2, 2018
56 semantic web and linked data in sociological text analysis
As elaborated in the previous section, the refer annotation tool has been utilized for
this use case. The original corpus provided by Knoth, Stede, and Hägert, (2018),
was generated in the German language for a number of reasons. This provides
some challenges regarding the annotation process. The automated analysis used
with refer deploys the KEA-NEL which was created for the analysis of English
text. Furthermore, KEA uses the DBpedia to create annotations, which is currently
one of the largest Linked Data knowledge bases available. It is mainly generated
from Wikipedia infoboxes and therefore provides information about an enormous
variety of topics. This leads to the assumption that the automated analysis of this
very specific domain of constitutional documents (even if provided in the English
language) would be rather error prone, because not only entities related to con-
stitutions are present in the candidate lists but also entities related to any topic
part of DBpedia. Therefore the decision was made to manually annotate certain
parts of the corpus with entities from the English DBpedia using refer. While this
method seems rather cumbersome, it was chosen to be the best alternative for this
use case. The way the Modal annotation interface was implemented enables an
easy adaption. After an initial survey of the corpus, a candidate list of entities has
been created and integrated into the interface to improve the annotation process.
Before the actual annotation task can start, annotation criteria have to be defined
to ensure a consistent results, especially if the annotations are created collabora-
tively. First, it has to be defined what a named entity actually is that is worth to be
annotated. In the use case of this thesis, it is assumed that rigid as well as non-rigid
designators are important for the analysis, as discussed in section 3.2.9. The ratio-
nale here is to generate as much knowledge as possible from the text to be able
to analyze the data from multiple perspectives. Further entity annotation criteria
regard entity specificity and completeness.
Entity Specificity
Another annotation criterion noteworthy in this use case refers to the specificity
of entities. While the level of entity specificity may differ for various annotation
use cases, the annotations in this thesis are performed with the most specific en-
tity in the knowledge base. If a sentence reads ’In 2018, the Winter Olympics took
place in South Korea’ the DBpedia entity to annotate the phrase ’Winter Olympics’
with is not dbr:Winter_Olympic_Games since it is not the most specific entity in
this context in the knowledge base. The context reveals that the sentence refers
to the 2018 Winter Olympics, therefore the phrase is annotated with the resource
4.3 exemplary content analysis of constitution texts 57
Entity Completeness
The aspect of entity completeness means that anything that is named entity (ac-
cording to the given definition) should be annotated. Even though the 20 given
documents are too much content to annotate entirely in the course of this thesis,
the articles and sections chosen to annotate have been annotated according to this
completeness criterion. In the use case of this thesis, all DBpedia entities have been
used for all annotations. Even though the DBpedia is currently one of the largest
cross-domain knowledge bases and provides a sophisticated source for semantic
annotations, it is clear that it cannot represent all real-world objects or abstract con-
cepts known to humans. DBpedia is generated mainly from Wikipedia infoboxes
and therefore depends on the coverage of content in Wikipedia. Unfortunately,
Wikipedia suffers a so-called systemic bias which causes an unequally distributed
interlinking of entities within the knowledge base. For example, in Wikipedia and
thus also in DBpedia, entities about film and music are overrepresented and very
well interconnected compared to other domains (Oeberst et al., 2016). The con-
sequence for this use case is to expect that not all named entities related to this
specific domain of constitutions are represented in the knowledge base, especially
considering that some of the texts have been created in and before the early 20th
century. Still, the annotations should be as complete as possible. Furthermore, it is
important to measure which of the entities found in the text could not be linked
to a DBpedia entity to ensure the validity and informative value of this approach.
To overcome these shortcomings, a ’Not In List’ (NIL) entity has been created and
included in the Modal annotation interface of refer. Whenever the annotating user
encounters an entity not available in the knowledge base, the NIL entity is used to
assess the level of completeness of the annotations.
Temporal Roles
Another factor which requires some discussion, especially in the annotation of per-
sons, is the acknowledgement of the entities’ temporal role. That means, if a text
in a Dutch constitution document edition from the year 2016 mentions a term like
’der König’ (the King), the term has been annotated with dbr:Willem-Alexander_
of_the_Netherlands who was (and currently is) the king of the Netherlands. This
task is known as temporal role detection and is part of current research in NLP.
Significant advances in this rather young field of research have been accomplished
by (Koutraki, Bakhshandegan-Moghaddam, and Sack, 2018), the topic is also tack-
led in a current research project led by the University of Zurich32 . Even though the
NLP and NEL technologies are constantly improving, this rather difficult task of
disambiguation has not yet been solved in a way that it can be easily implemented
in any domain. This aspect also affirmed the decision to proceed with a manual
annotation process in this use case.
32 http://www.cl.uzh.ch/en/research/completed-research/hist-temporal-entities.html, last vis-
ited: July 29, 2018
58 semantic web and linked data in sociological text analysis
Description Count
Triples overall 155.804
Annotations overall 1.175
Distinct Entities overall 218
NIL Annotations overall 242
Annotations - 2016 edition 455
Annotations - 1983 edition 443
Annotations - 1884 edition 277
Parts of three constitutional documents have been semantically annotated with DB-
pedia entities according to the criteria and method discussed above. Table 4 shows
the statistics of generated annotations in the dataset. Overall, 1.175 annotations
have been created in three constitution documents using 218 distinct DBpedia en-
tities. This means that on average, each DBpedia entity has been used around five
times. Over all documents, 242 NIL annotations have been used, which means that
around 20% of all named entities in the documents were not in the knowledge
base (or could not be found). It can be concluded that solely using the DBpedia
knowledge base is not enough for a profound annotation. The complete list of NIL
annotation surface forms is presented in Table 7 in Appendix A. In order to ad-
vance in this matter, a next step may involve the analysis of all surface forms that
no annotations have been created for. Domain experts may then (1) find another
already existing knowledge base or (2) create their own knowledge base to enable
a more complete annotation process.
Using the refer tool, the texts have been enriched with RDFa annotations. The doc-
uments have further been converted to NIF2 and imported into Blazegraph. This
section discusses how these annotations enable to explore the generated data. All
prefixes used in the SPARQL queries of this section are shown in Listing 14
Two datasets have now been imported into Blazegraph, the RDF data representing
the entire structure of all documents and the data containing all NIF2 annotations
(cf. Figure 18). SPARQL now allows to query both graphs in one query to exploit all
data generated so far in the process. Listing 15 shows how to locate a specific entity
annotated in the corpus on section level. This is possible because each section,
article, paragraph, chapter and constitution have been assigned a unique URI to
be references easily.
4.3 exemplary content analysis of constitution texts 59
Listing 14: All prefixes used for the SPARQL queries in this section
PREFIX s: <https://github.com/tabeatietz/semsoc/>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX co: <http://www.constituteproject.org/ontology/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dbc: <http://dbpedia.org/resource/Category:>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX itsrdf: <http://www.w3.org/2005/11/its/rdf#>
Listing 15: Query for the location of the entity dbr:Netherlands in the corpus
A popular query in text analysis to start out with is a frequency analysis (Mayring,
2015). It is assumed that a frequency analysis becomes especially important when
comparing document changes over time, as it is the case in this exemplary use case.
The analysis can function as a metric to find out when a certain term as first been in-
troduced into a constitution and accordingly if the usage of this term has increased
or decreased over a certain time period. The analysis of the representation of mar-
riage and its meaning for the state can be initiated through a frequency analysis.
The SPARQL query in Listing 16 counts all annotation of the entity dbr:Marriage
for all annotated constitutions separately. The results show that the annotation has
been used five times in the edition of 1884 and three times each in the editions
of 1983 and 2016. Of course, these results are by no means complete, since only
small portions of the documents have been annotated. The usage of semantic an-
notations for a frequency analysis enables to query for specific entities (and also
entity categories) regardless of the surface form, which eliminates the problem of
covering all possible synonyms in the query.
60 semantic web and linked data in sociological text analysis
Listing 16: Query counting all annotations of the entity dbr:Marriage in the annotated
data
4.3.4.3 Contingency
The goal of a contingency analysis is to find out in which relation a specific terms
are used in a text document. Osgood, (1959), has been among the first to use this
method in content analysis. In the context of the research of constitution docu-
ments, a contingency analysis is especially interesting, because it gives insights
about how certain topics are modeled. The role of religion in a state has been
widely researched (e.g. (Lagler, 2000)). As previously discussed, constitutions mir-
ror how the society of a state is modeled. Therefore, analyzing the development of
terms related to religion in constitution documents is plausible. Listing 17 shows
what a contingency analysis using SPARQL may look like. The query selects all en-
tities, which have been annotated in the same section as the entity dbr:Religion.
Thereby, the focus entity itself should not be part of the result list (line 8-10). Table 5
shows the results. According to the annotations made, the entity dbr:Religion has
been most used in the context of educational topics. Of course, also for this analy-
sis the results are by no means representative since only a small portion of the text
has been annotated. Uncommenting the patterns in line 17 and 18 allows to sort
the contingency of the entities by constitution edition to explore the changes over
time.
In section 4.2.2.1, DBpedia has been used to enrich the constitution documents
with information about the reigning monarch in each document edition. In this
section, the document text has been directly annotated with content from DBpedia.
One significant asset of annotating research data in sociological text analysis with
semantic entities from an external knowledge base is the possibility to enrich the
existing content with external knowledge as will be discussed in this section.
Especially the annotation of text with temporal roles as briefly elaborated in sec-
tion 4.3.2.2 provides means to use the knowledge in DBpedia for further analysis.
One example is the annotation of monarchs in the constitutions with respect to
their temporal roles:
Listing 17: Query for all DBpedia entities annotated within the same section as the entity
dbr:Religion
The text above was taken from the 1983 constitution edition. In that year, Queen
Beatrix was the current monarch of the Netherlands, as revealed by the query
in Listing 13 in section 4.2.2.1. Therefore, the term König (King) has been anno-
tated with the respective DBpedia entity. To use DBpedia for knowledge enrich-
ment first requires to know which information the knowledge base holds about
the entity in focus and how it is organized. The information is stored in DBpe-
dia in form of triples, a visual representation of these information is provided
via a HTML webpage33 . On the top of the page there is a short abstract about
Beatrix, which was automatically retrieved from the Wikipedia page of the former
queen34 . A table is located directly beneath the abstract that lists all of the triples
in DBpedia in which Queen Beatrix is the subject. The column on the left repre-
sent the property connected to the subject. The right column lists the respective
objects or values, which are connected to Beatrix. The DBpedia page lists infor-
mation like the birth and death dates, family members, religion and succession.
The information most vital for the analysis in this chapter is shown in Figure 21.
The properties dct:subject and rdf:type connect the subject to certain classes
and categories, which were retrieved from Wikipedia categories or from external
knowledge bases, like Wikidata. These classes and categories help to organize each
entity in formal structures (ontologies). For instance, the categories show that Beat-
rix is a Dutch monarch and belongs to the House of Orange-Nassau. Furthermore,
Beatrix is of type Person and belongs to a royal family. Clicking on one of the
categories, e.g. dbc:Dutch_monarchs, reveals a list of all entities, which are also
connected to this category, e.g. dbr:Willem-Alexander_of_the_Netherlands. This
organization of knowledge creates an enormous network of information, which
can be exploited via SPARQL queries. In the area of Information Retrieval (IR),
these semantic structures are utilized in numerous applications, including seman-
tic search, recommender systems and topic detection (Waitelonis, 2018). Also for
entity num
http://dbpedia.org/resource/World_view 6
http://dbpedia.org/resource/Law 4
http://dbpedia.org/resource/Liberty 4
http://dbpedia.org/resource/Education 4
http://dbpedia.org/resource/Freedom_of_thought 2
http://dbpedia.org/resource/Race_(biology) 2
http://dbpedia.org/resource/Sex 2
http://dbpedia.org/resource/Netherlands 2
http://dbpedia.org/resource/Discrimination 2
http://dbpedia.org/resource/State_school 2
http://dbpedia.org/resource/Private_school 2
http://dbpedia.org/resource/Government_spending 2
http://dbpedia.org/resource/Requirement 2
text analysis in sociology, this knowledge can be used to give existing data more
meaning.
In constitution texts, the roles of persons (e.g. Prime Ministers or monarchs)
are not defined according to their sex, as the example above shows. That means,
even if Queen Beatrix was the Queen of the Netherlands in 1983, the respective
constitution text does not refer to her as the queen (“die Königin”) but in the
male form (“der König”). However, when analyzing constitutions especially in
the context of sociological gender studies (e.g. (Crawford, 2009)), the information
whether the king was actually a king or at the time a queen may be vital. For this
purpose, DBpedia’s graph structure helps to aggregate the content accordingly to
answer the question:
1. Which constitution editions, articles and sections are valid under a reigning
female Dutch monarch?
a) Which edition, article and section has been annotated with an entity . . .
b) under the constraint that this entity belongs to the category of Dutch
monarchs in DBpedia . . .
c) and under the constraint that for the entity a gender was specified in
the knowledge base which can only be female.
Accordingly, the query in Listing 18 selects all constitution editions, articles and
sections that contain a semantic annotation with a female Dutch monarch (line
16). The results of the query are shown in Table 6. For simplicity reasons, only
a portion of the results is shown here. This means that the annotations allow to
aggregate the previously existing content in a way that exploits the organization
of knowledge in an external database.
Of course, in this exemplary use case, only the constitution of the Netherlands
has been annotated and analyzed and it may be simply surveyed whether the
4.3 exemplary content analysis of constitution texts 63
Figure 21: Part of the HTML page about Queen Beatrix in DBpedia
64 semantic web and linked data in sociological text analysis
Listing 18: Federated Query for all constitution editions, articles, and sections that contain
a semantic annotation with a female Dutch monarch
monarchy was reigned by a king or queen in a specific time period. One of the
benefits of annotating the content in the described way is the easy adaptability in
case more content is added to the corpus. It is easy to imagine, that the analysis
of constitution text will in the future not only affect one country at a time but
especially the comparison with other countries will be of significant value. Using
the category dbc:Dutch_monarchs helps to do that. The category does not only
connect monarchs of the Netherlands with each other but it is also connected
to more general categories, e.g. dbc:European_monarchs via the property skos:
broader. This connection enables to bring Queen Beatrix in the context of other
monarchs throughout Europe. The example in Figure 22 visualizes how it works.
In the figure, the elements existing in the corpus are marked by a blue frame.
The elements from DBpedia which are used to extend the existing knowledge are
red and the entities connecting both are marked by green frames. It shows that
in case the corpus of constitution texts was extended by the Swedish constitution,
the category hierarchy allows to easily adapt the query to match not only Dutch
monarchs but also Swedish monarchs at the same time.
The RDFa enrichment created with refer in the corpus enables to visualize addi-
tional information about annotated entities directly within the context of the docu-
ment. When the annotated text is published within Wordpress, the annotations are
immediately presented in the document’s HTML code. Each annotated entity is
indicated by thin, semi-transparent, colored lines. The colors indicate whether the
entity is of type Person (green), Location (blue), Event (yellow) or Thing (purple).
On mouseover, a so-called infobox as shown in Figure 23 is displayed below the an-
66 semantic web and linked data in sociological text analysis
notated text fragment. It contains basic information about the entity derived from
DBpedia, e.g. a thumbnail and additional data from the entity RDF graph put in
a table layout. The visual design and content of infoboxes varies per category and
allows the user to gather basic facts about an entity as well as relations to other
entities (Tietz et al., 2016).
Sociologists, when exploring a document corpus of interest (which has previ-
ously been annotated) can make use of these infobox visualizations to learn more
about the data in front of them without having to leave the original context of the
text. This can support a better understanding of the text, for instance if a certain
term is unknown to them or, as shown in Figure 23, they want to learn about the
temporal roles of entities. In the example, the user can learn that in the constitution
edition of 1983, Ruud Lubbers was the Prime Minister of the Netherlands.
Of course, this visualization is only a preliminary version of what sociologists
could benefit from in the future, because visualizing the document corpus for
sociological research in Wordpress as it is done with refer may not be sufficient
in some cases. Nevertheless, RDFa is versatile and could be embedded in any
HTML document. Furthermore, the current system integrates solely content from
DBpedia into the infoboxes. It always depends on the use case and document
corpus whether this or another knowledge base should be used for this purpose
of content enrichment.
In this section, the meaning and potential of semantic text annotations has been
evaluated for text analysis in sociology on the foundation of the use case intro-
duced in section 4.1.1. The contributions of this section include 1.175 semantic
annotations of three constitution documents, a Python script to convert a text doc-
ument containing RDFa into NIF2, SPARQL queries and scenarios to exploit the
generated data as well as an in-depth discussion of results.
4.3 exemplary content analysis of constitution texts 67
69
70 discussion
may be re-used in form of RDFa, useful for HTML pages, or NIF2 useful for query-
ing and further adaptation. If the annotations have been created thoroughly, they
can furthermore function as a gold standard for computer scientists to improve
and test NEL systems.
In this section, the limitations of the presented approach and future work will
be discussed, with an emphasis on automating the annotation process, challenges
of especially temporal role annotations, and insufficiencies of existing knowledge
bases, and interactive user interfaces.
improve NEL systems in the future. The difficulties in annotating the documents
within a reasonable amount of time resulted in relatively few annotations. Even
though the provided 1.175 annotations have proven enough to perform queries,
receive exemplary results and demonstrate the possibilities of these technologies
for sociology, it was not possible to perform a representative study on their basis,
which is considered a shortcoming of this thesis.
During the annotation process of the constitution corpus, a few challenges oc-
curred, worthy of a brief notion. For instance, one annotation criterion has been
the acknowledgement of the entities’ temporal roles. In the 2016 constitution edi-
tion of the Netherlands, the second chapter reads, that the oldest child is next in
the line of succession of the king. According to the temporal role criterion, the
oldest child should be annotated with the actual person entity to be precise (in this
case the Princess of Orange Catharina-Amalia). However, in case of a tragic sud-
den death of the princess the oldest child would be her younger sibling and thus,
the annotation would become incorrect. This challenges the validity of temporal
role disambiguation for future events.
Another challenge in the annotation process was the imprecise definition of
some named entities. For example the term Königswürde (translated literally: royal
dignity). In German, this term is quite ambiguous, it has no real definition nor is
it anything tangible. In English texts of the Dutch constitutions, the term has been
entitled simply as throne or title of the throne and so on. But no consistent term was
used. Therefore annotating it with a specific entity from DBpedia was not possible.
In other cases, it is assumed that knowledge in the domain of politics and law
was needed to provide the correct annotations. The terms in question include (in
German) Gesetzesvorlage, eine Vorlage, etwas vorlegen. All of the three terms appeared
in the text and could refer to a draft bill, but it was not always clear taking into
account the given context.
Another challenge during the annotation process was to overcome the difficulty
that not all concepts could be covered by the knowledge base. These insufficiencies
will be discussed in this paragraph.
For this case, the NIL entity has been created and added to the refer annotator
which has proven to be quite efficient. Table 7 in Appendix A provides a list of
all 242 occurrences. The NIL entity enabled to measure the feasibility of DBpedia
as a knowledge base for the given use case. As already described in section 4.3.3,
about 20% of all annotations used NIL entities. The NIL annotations furthermore
allowed to define a clear limit for all automated NEL regarding this given use
case. If an entity is not present in the knowledge base, the system will never de-
72 discussion
tect it, therefore the recall can never be at 100 percent1 . In this use case, the text
corpus dealt with constitutions, country specific information and facts about state
leaders. These topics are generally well represented in DBpedia and the way these
information are structured does not leave much room for discussion. However, the
texts analyzed by many sociologists are not always like this. Often, text corpora
are analyzed that deal with human to human communication (e.g. conversations)
and the topics to investigate are divers. One example is the investigation of family
structures and the role of the woman in the family. For cases like this it has to be
assumed that no knowledge base available will represent these information suffi-
ciently at the moment. Next to the problem of availability, another issue is how the
knowledge is formally defined. That means, a sociologists may have a completely
different understanding of the formal definition of the concept family than a psy-
chologist or a political scientist. The facts that represent the concept family and
how they are related may be prioritized in a completely different way in various
domains. That means, solely relying on third party knowledge bases from other
domains may bring the issue of varying concept definitions into the text analysis
than anticipated by the sociologist.
One way to overcome this challenge is to start creating own knowledge bases or
even single concepts from the sociological perspective. The methods to achieve this
have been described in this thesis. Creating a knowledge base, similarly to creating
concepts for coding text in sociology could in fact become part of the research
process itself. That is because even though a general sociological perspective on
a concept may exist, each researcher may add own ideas. Concepts could also be
modeled according to different schools of thought, an interesting possibility for
future work.
Modeling own concepts in form of formal and structured ontologies have been
widely discussed in this thesis and include first and foremost research transparency
and re-usability of the individual research process. Furthermore, this gives sociol-
ogists the chance to become a part of the Semantic Web and its future and share
their domain knowledge with the community.
The sections 4.2 and 4.3 have introduced means for sociologists to query the gen-
erated RDF data and to explore its content. However, issuing a SPARQL query
for each of these exploration tasks is not really sufficient in the long term. One
reason is that a requirement is to learn SPARQL. Even though it can be assumed
that when the general idea of RDF has been understood, the query language will
not be a major issue, it prevents the researcher so merely focus on the document
and its exploration. A solution to enable an easier content exploration is develop-
ing interactive user interfaces in which the discussed SPARQL queries merely run
in the background. Another problem of using SPARQL (or any other query lan-
guage) for content exploration that the researcher will only find exactly what they
are looking for. However, often the most interesting relationships and contexts are
explored by means of serendipity, which can be enabled by intelligent interactive
user interfaces. Serendipity refers to the process of finding valuable information
or facts which have not been sought for. This phenomenon is known and widely
utilized in information retrieval (Foster and Ford, 2003; Waitelonis and Sack, 2012).
Visualizations and user interfaces enabling the exploration of textual documents
is part of current research. Especially using Semantic Web and Linked Data tech-
nologies have delivered promising results. Rahman and Finin, (2018), developed
an unsupervised method based on deep learning to explore large structured doc-
uments like business reports or proposal requests. They created an ontology to
capture not only the general purpose semantic structure, but also domain specific
semantic concepts. While the method seems promising, it is (so far) missing an
exploration feature for non-technical users. Also it is unknown if the method is
generalizable to other domains, e.g. constitution documents. Latif, Liu, and Beck,
(2018), follow a completely different approach. The authors developed a frame-
work that takes text containing markup, a related dataset, and a configuration file
as inputs and produces an interactive document. The result enables to receive fur-
ther details, visual highlighting, and text comparison. However, the framework
does not enable to create aggregations over a large set of documents to receive
information on their overall structure.
However, as many solutions exist for this purpose, there is no out-of-the-box one
for all solution available at the moment. Another method to make use of but useful
interactive interfaces to explore the discussed documents is using simple libraries
to create new interfaces, almost from scratch. Data Driven Documents (D3)2 is a
JavaScript library for manipulating documents based on data. The library is well
documented and there are numerous examples on the Web which can be reused
and further modified. Creating an application to interactively explore promising
interdisciplinary research project for future work.
75
76 summary and conclusion
This thesis closes with a call to action for sociologists, which emphasizes how
sociologists can contribute to the effort of the Semantic Web and Linked data com-
munity to benefit from its current achievements and future possibilities.
A take away message of this thesis for sociologists is that the Semantic Web is a
community effort. Researchers (e.g. sociologists in the field of computer-assisted
text analysis) who want to benefit from the possibilities, principles and standards
and technologies this community offers, have to engage in this effort, which has
also been emphasized by Halford, Pope, and Weal, (2013). This thesis has shown
that many of the current knowledge bases, interfaces, and analysis tools are not
yet mature enough for a sophisticated textual analysis in sociology from start to
finish on any text in any language. However, the domain knowledge sociologists
can bring into the Semantic Web is immense. It can be assumed that interdisci-
plinary efforts which also include sociologists more can results in a significant im-
provement of these insufficiencies. That is because only sociologists know which
concepts in knowledge bases are exactly needed to cover important aspects of text
analysis and only they know what the requirements for an interactive user inter-
face made for sociological research are, to explore textual content and find even
things they have not yet been looking for. One result of this work is the high-
lighting of various topics, tools and principles that offer sociologists appropriate
opportunities for their own sociological contributions to the Semantics Web. The
Semantic Web has proven to be highly valuable for life sciences, medicine, and is
beginning to be more and more incorporated in digital humanities. Hopefully, in
the future sociologists will engage in this ever growing community as well.
APPENDIX
77
APPENDIX
A
Listing 19: RDF Turtle depiction of the RDF graph snippet visualized in Figure 13
@prefix s: <https://github.com/tabeatietz/semsoc/> .
@prefix co: <http://www.constituteproject.org/ontology/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
s:constitution_24_n_2016-constitution a co:Constitution ;
co:hasConstName "Verfassung des Königreiches der Niederlande" ;
co:isConstitutionOf co:Netherlands_the ;
s:edition "2016-11-04"^^xsd:date ;
co:isCreatedIn "2016" .
s:constitution_24_n_2016-constitution_t2 a co:Section, s:Chapter ;
co:isSectionOf s:constitution_24_n_2016-constitution ;
co:rowType co:title ;
co:text "Hauptstück 2 - Regierung" ;
co:header "2" ;
co:sectionID "131’" .
s:constitution_24_n_2016-constitution_t2_s2 a co:Section, s:Paragraph ;
co:isSectionOf s:constitution_24_n_2016-constitution ;
co:rowType co:title ;
co:parent s:constitution_24_n_2016-constitution_t2 ;
co:text "2. König und Minister" ;
co:header "2" ;
co:sectionID "235" .
s:constitution_24_n_2016-constitution_t2_s2_a1 a co:Section, s:Article ;
co:parent s:constitution_24_n_2016-constitution_t2_s2 ;
co:rowType co:title ;
co:isSectionOf s:constitution_24_n_2016-constitution ;
co:text "Art. 42." ;
co:header "42" ;
co:sectionID "236" .
s:constitution_24_n_2016-constitution_t2_s2_a1_s1_title a co:Section, s:Section ;
co:isSectionOf s:constitution_24_n_2016-constitution ;
co:parent s:constitution_24_n_2016-constitution_t2_s2_a1 ;
co:rowType co:title ;
co:header "1" ;
co:sectionID "237" .
s:constitution_24_n_2016-constitution_t2_s2_a1_s1 a co:Section, s:Section ;
co:isSectionOf s:constitution_24_n_2016-constitution ;
co:parent s:constitution_24_n_2016-constitution_t2_s2_a1_s1_title ;
co:rowType co:body ;
co:text "(1) Die Regierung besteht aus dem König und den Ministern." ;
co:sectionID "238" .
79
80 appendix
Agt-Rickauer, Henning, Jörg Waitelonis, Tabea Tietz, and Harald Sack (2016). “Data
Integration for the Media Value Chain.” In: 15th International Semantic Web
Conference (Posters and Demos). CEUR-WS.
Alexy, Robert (1999). “Grundrechte.” In: Enzyklopädie Philosophie 1, pp. 525–529.
Beckett, David (2014). RDF 1.1 N-Triples: A line-based syntax for an RDF graph. W3C
Recommendation. W3C, https://www.w3.org/TR/n-triples/.
Beckett, David, Tim Berners-Lee, Eric Prud’hommeaux, and Gavin Carothers (2014).
RDF 1.1 Turtle: Terse RDF Triple Language. W3C Recommendation. W3C, https:
//www.w3.org/TR/turtle/.
Berger, Peter and Thomas Luckmann (1966). The Social Construction of Reality: A
Treatise in the Sociology of Knowledge.
Berners-Lee, T., R. Fielding, and L. Masinter (2005). RFC3986: Uniform Resource
Identifier (URI): Generic Syntax. https://www.ietf.org/rfc/rfc3986.txt.
Berners-Lee, Tim, James Hendler, and Ora Lassila (2001). “The Semantic Web.” In:
Scientific American 284.5, pp. 34–43.
Berners-Lee, Tim, Wendy Hall, James A Hendler, Kieron O’Hara, Nigel Shadbolt,
Daniel J Weitzner, et al. (2006). “A Framework for Web Science.” In: Founda-
tions and Trends in Web Science 1.1, pp. 1–130.
Berners-Lee, Timothy J (1989). Information management: A Proposal. Tech. rep. CERN.
Bizer, Christian, Tom Heath, and Tim Berners-Lee (2011). “Linked Data: The Story
So Far.” In: Semantic Services, Interoperability and Web Applications: Emerging
Concepts. IGI Global, pp. 205–227.
Boli-Bennett, John (1979). “The Ideology of Expanding State Authority in National
Constitutions, 1870-1970.” In: National development and the world system, pp. 212–
237.
Bosch, Thomas and Benjamin Zapilko (2015). “Semantic Web Applications for the
Social Sciences.” In: IASSIST Quarterly 38.4, pp. 7–7.
Brickley, Dan and R. V. Guha (2014). RDF Schema 1.1. W3C Recommendation. W3C,
https://www.w3.org/TR/rdf-schema/.
Büthe, Tim, Alan M Jacobs, Erik Bleich, Robert J Pekkanen, and Marc Trachtenberg
(2014). “Qualitative & Multi-Method Research.” In: Journal Scan: January 2015,
p. 63.
Capadisli, Sarven, Amy Guy, Ruben Verborgh, Christoph Lange, Sören Auer, and
Tim Berners-Lee (2017). “Decentralised Authoring, Annotations and Notifica-
tions for a Read-Write Web with dokieli.” In: International Conference on Web
Engineering. Springer, pp. 469–481.
Crawford, Katherine (2009). Perilous Performances: Gender and Regency in Early Mod-
ern France. Vol. 145. Harvard University Press.
Curty, Renata Gonçalves (2016). “Factors Influencing Research Data Reuse in the
Social Sciences: An Exploratory Study.” In: IJDC 11.1, pp. 96–117.
81
82 Bibliography
Daiber, Joachim, Max Jakob, Chris Hokamp, and Pablo N. Mendes (2013). “Improv-
ing Efficiency and Accuracy in Multilingual Entity Extraction.” In: Proceedings
of the 9th International Conference on Semantic Systems (I-Semantics), pp. 121–124.
Decker, Stefan, Sergey Melnik, Frank Van Harmelen, Dieter Fensel, Michel Klein,
Jeen Broekstra, Michael Erdmann, and Ian Horrocks (2000). “The Semantic
Web: The Roles of XML and RDF.” In: IEEE Internet computing 4.5, pp. 63–73.
Description and Access, Task Force on Metadata (2000). https://www.libraries.psu.
edu/tas/jca/ccda/tf-meta6.html, last visited: July 19, 2018. Final Report.
Eldesouky, Bahaa, Menna Bakry, Heiko Maus, and Andreas Dengel (2016). “Seed,
an End-User Text Composition Tool for the Semantic Web.” In: International
Semantic Web Conference. Springer, pp. 218–233.
Elkins, Zachary, Tom Ginsburg, James Melton, Robert Shaffer, Juan F Sequeda,
and Daniel P Miranker (2014). “Constitute: The World’s Constitutions to Read,
Search, and Compare.” In: Web Semantics: Science, Services and Agents on the
World Wide Web 27, pp. 10–18.
Evans, James A and Pedro Aceves (2016). “Machine Translation: Mining Text for
Social Theory.” In: Annual Review of Sociology 42, pp. 21–50.
Faulkner, Steve, Arron Eicholz, Travis Leithead, Alex Danilo, and Sangwhan Moon
(2017). HTML 5.2. W3C Recommendation. W3C, https : / / www . w3 . org / TR /
html/.
Foster, Allen and Nigel Ford (2003). “Serendipity and Information Seeking: an
Empirical Study.” In: Journal of documentation 59.3, pp. 321–340.
Froschauer, Ulrike and Manfred Lueger (2003). Das qualitative Interview: Zur Praxis
interpretativer Analyse sozialer Systeme. Vol. 2418. UTB.
Go, Julian (2003). “A Globalizing Constitutionalism?: Views from the Postcolony,
1945-2000.” In: International Sociology 18.1, pp. 71–95.
Gruber, Thomas R. (1993). “A translation approach to portable ontology specifica-
tions.” In: Knowledge Acquisition 5, pp. 199–220.
Halford, Susan, Catherine Pope, and Mark Weal (2013). “Digital Futures? Sociologi-
cal Challenges and Opportunities in the Emergent Semantic Web.” In: Sociology
47.1, pp. 173–189.
Handschuh, Siegfried (2005). “Creating Ontology-based Metadata by Annotation
for the Semantic Web.” PhD thesis. Karlsruher Institut für Technologie.
Heaton, Janet (2004). Reworking Qualitative Data. Sage.
Heintz, Bettina and Annette Schnabel (2006). “Verfassungen als Spiegel globaler
Normen?” In: 58, pp. 685–716.
Hellmann, Sebastian (2013). NIF 2.0 Core Ontology. Ontology Description. AKSW,
University Leipzig, http://persistence.uni-leipzig.org/nlp2rdf/ontologies/
nif-core/nif-core.html.
Hellmann, Sebastian, Jens Lehmann, Sören Auer, and Martin Brümmer (2013). “In-
tegrating NLP using Linked Data.” In: International Semantic Web Conference.
Springer, pp. 98–113.
Hepp, Martin (2008). “Goodrelations: An ontology for describing products and
services offers on the web.” In: International Conference on Knowledge Engineering
and Knowledge Management. Springer, pp. 329–346.
Bibliography 83
Herndon, Joel and Robert O’Reilly (2016). “Data Sharing Policies in Social Sciences
Academic Journals: Evolving Expectations of Data Sharing as a Form of Schol-
arly Communication.” In: Databrarianship: The Academic Data Librarian in Theory
and Practice.
Heyvaert, Pieter, Anastasia Dimou, Aron-Levi Herregodts, Ruben Verborgh, Dim-
itri Schuurman, Erik Mannens, and Rik Van de Walle (2016). “RMLEditor: A
gGaph-based Mmapping Editor for Linked Data Mappings.” In: International
Semantic Web Conference. Springer, pp. 709–723.
Hitzler, Pascal, Markus Kroetzsch, and Sebastian Rudolph (2009). Foundations of
Semantic Web Technologies. CRC press.
Horridge, Matthew, Holger Knublauch, Alan Rector, Robert Stevens, and Chris
Wroe (2004). “A Practical Guide To Building OWL Ontologies Using The Protégé-
OWL Plugin and CO-ODE Tools Edition 1.0.” In: University of Manchester.
Hyland, Bernadette, Ghislain Atemezing, and Boris Villazón-Terrazas (2014). Best
Practices for Publishing Linked Data. W3C Recommendation. W3C, https://www.
w3.org/TR/ld-bp/.
Khalili, Ali and Sören Auer (2013). “User interfaces for Semantic Authoring of
Textual Content: A Systematic Literature Review.” In: Web Semantics: Science,
Services and Agents on the World Wide Web 22, pp. 1–18.
Knoth, Alexander Henning (2016). “Staatliche Selbstbeschreibungen Analysieren.
Soziologische und computerlinguistische Ansätze der Dokumentenarbeit.” In:
Trajectoires. Travaux des jeunes chercheurs du CIERA Hors série n° 1.
Knoth, Alexander, Manfred Stede, and Erik Hägert (2018). “Dokumentenarbeit mit
hierarchisch strukturierten Texten: Eine historisch vergleichende Analyse von
Verfassungen.” In: Kritik der digitalen Vernunft. Abstracts zur Jahrestagung des
Verbandes Digital Humanities im deutschsprachigen Raum, 26.02.-02.03. 2018 an
der Universität zu Köln, veranstaltet vom Cologne Center for eHumanities (CCeH).
Ed. by Georg Vogeler. Universität zu Köln, pp. 196 –203.
Kobilarov, Georgi, Tom Scott, Yves Raimond, Silver Oliver, Chris Sizemore, Michael
Smethurst, Christian Bizer, and Robert Lee (2009). “Media Meets Semantic
Web–How the BBC uses DBpedia and Linked Data to Make Connections.” In:
European Semantic Web Conference. Springer, pp. 723–737.
Koutraki, Maria, Farshad Bakhshandegan-Moghaddam, and Harald Sack (2018).
“Temporal Role Annotation for Named Entities.” In: Proceedings of the 14th Int.
Conference on Semantic Systems. (to be published).
Kripke, Saul A (1972). “Naming and necessity.” In: Semantics of natural language.
Springer, pp. 253–355.
Lagler, Wilfried (2000). Gott im Grundgesetz? Zur Bedeutung des Gottesbezugs in un-
serer Verfassung und zum christlichen Hintergrund der Grund-und Menschenrechte.
Lange, Christoph (2009). “Krextor–An Extensible XML→ RDF Extraction Frame-
work.” In: Workshop on Scripting and Development for the Semantic Web, co-located
with 6th European Semantic Web Conference 449, p. 38.
Latif, Shahid, Diao Liu, and Fabian Beck (2018). “Exploring Interactive Linking
Between Text and Visualization.” In: EUROVIS.
Lehmann, Jens, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo
N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören
84 Bibliography
Prud’hommeaux, Eric and Carlos Buil-Aranda (2013). SPARQL 1.1 Federated Query.
W3C Recommendation. W3C, https://www.w3.org/TR/sparql11-federated-
query/.
Rahman, Muhammad Mahbubur and Tim Finin (2018). “Understanding and Rep-
resenting the Semantics of Large Structured Documents.” In: Workshop on Se-
mantic Deep Learning, co-located with the 17th International Semantic Web Confer-
ence.
Sanderson, Robert, Paolo Ciccarese, and Benjamin Young (2016). Web Annotation
Ontology. https://www.w3.org/ns/oa#, last visited: July 19, 2018.
Schandl, Thomas and Andreas Blumauer (2010). “PoolParty: SKOS Thesaurus Man-
agement Utilizing Linked Data.” In: Extended Semantic Web Conference. Springer,
pp. 421–425.
Schmachtenberg, Max, Christian Bizer, and Heiko Paulheim (2014). “Adoption of
the Linked Data Best Practices in Different Topical Domains.” In: International
Semantic Web Conference. Springer, pp. 245–260.
Schreiber, Guus and Yves Raimond (2014). RDF 1.1 Primer. W3C Recommendation.
W3C, https://www.w3.org/TR/rdf11-primer/.
Shneiderman, Ben, Catherine Plaisant, Maxine S Cohen, Steven Jacobs, Niklas
Elmqvist, and Nicholas Diakopoulos (2016). Designing the User Interface: Strate-
gies for Effective Human-Computer Interaction. Pearson.
Silberman, Neil Asher (2012). The Oxford Companion to Archaeology. 1. Ache-Hoho.
Vol. 1. Oxford University Press.
Stone, Philip J, Dexter C Dunphy, and Marshall S Smith (1966). “The General In-
quirer: A Computer Approach to Content Analysis.” In: MIT press.
Tietz, Tabea, Joscha Jäger, Jörg Waitelonis, and Harald Sack (2016). “Semantic An-
notation and Information Visualization for Blogposts with refer.” In: Workshop
on Visualization and Interaction for Ontologies and Linked Data, co-located with the
15th International Semantic Web Concernce, pp. 28–40.
Usbeck, Ricardo et al. (2015). “GERBIL: General Entity Annotator Benchmark-
ing Framework.” In: Proceedings of the 24th International Conference on World
Wide Web. International World Wide Web Conferences Steering Committee,
pp. 1133–1143.
Van Rijsbergen, Cornelis Joost (1979). “Information Retrieval.” In: Dept. of Computer
Science, University of Glasgow.
Verborgh, Ruben, Olaf Hartig, Ben De Meester, Gerald Haesendonck, Laurens De
Vocht, Miel Vander Sande, Richard Cyganiak, Pieter Colpaert, Erik Mannens,
and Rik Van de Walle (2014). “Querying Datasets on the Web with High Avail-
ability.” In: International Semantic Web Conference. Springer, pp. 180–196.
Verborgh, Ruben, Miel Vander Sande, Olaf Hartig, Joachim Van Herwegen, Lau-
rens De Vocht, Ben De Meester, Gerald Haesendonck, and Pieter Colpaert
(2016). “Triple Pattern Fragments: a Low-cost Knowledge Graph Interface for
the Web.” In: Journal of Web Semantics 37–38, pp. 184–206.
Villazón-Terrazas, Boris, Luis M Vilches-Blázquez, Oscar Corcho, and Asunción
Gómez-Pérez (2011). “Methodological Guidelines for Publishing Government
Linked Data.” In: Linking Government Data. Springer, pp. 27–49.
86 Bibliography
Waitelonis, Jörg, Henrik Jürges, and Harald Sack (2016). “Don’t compare Apples to
Oranges: Extending GERBIL for a fine grained NEL evaluation.” In: Proceedings
of the 12th International Conference on Semantic Systems. ACM, pp. 65–72.
Waitelonis, Jörg and Harald Sack (2012). “Towards Exploratory Video Search using
Linked Data.” In: Multimedia Tools and Applications 59.2, pp. 645–672.
— (2016). “Named Entity Linking in #Tweets with KEA.” In: Proceedings of 6th
workshop on ’Making Sense of Microposts’, Named Entity Recognition and Linking
(NEEL) Challenge in conjunction with 25th International World Wide Web Confer-
ence. CEUR-WS.
Waitelonis, Jörg (2018). “Linked Data Supported Information Retrieval.” PhD the-
sis. Karlsruher Institut für Technologie (KIT). 256 pp. doi: 10.5445/IR/1000084458.
Welty, Christopher and Deborah McGuinness (2004). OWL Web Ontology Language
Guide. W3C Recommendation. W3C, http://www.w3.org/TR/2004/REC- owl-
guide-20040210/.
Wilde, Erik and Martin Dürst (2008). RFC5147: URI Fragment Identifiers for the text/-
plain Media Type. https://www.ietf.org/rfc/rfc5147.txt.
Zenk-Möltgen, Wolfgang and Greta Lepthien (2014). “Data Sharing in Sociology
Journals.” In: Online Information Review 38.6, pp. 709–722.
Bibliography 87
S E L B S T S TÄ N D I G K E I T S E R K L Ä R U N G
Ich versichere, dass ich die von mir vorgelegte schriftliche Arbeit einschließlich
evtl. beigefügter Zeichnungen, Kartenskizzen, Darstellungen u.a.m. selbständig
angefertigt und keine anderen als die angegebenen Hilfsmittel benutzt habe. Alle
Ausführungen, die dem Wortlaut oder dem Sinn nach anderen Texten entnom-
men sind, habe ich in jedem Fall unter genauer Angabe der Quelle deutlich als
Entlehnung kenntlich gemacht. Das gilt auch für Daten oder Textteile aus dem In-
ternet. Die Richtlinie zur Sicherung guter wissenschaftlicher Praxis für Studierende
an der Universität Potsdam (Plagiatsrichtlinie) - Vom 20.Oktober 2010", im In-
ternet unter https://www.uni-potsdam.de/am-up/2011/ambek-2011-01-037-039.
pdf, ist mir bekannt.
Tabea Tietz