Semi-Automatic Annotation System For OWL-based Semantic Search
Semi-Automatic Annotation System For OWL-based Semantic Search
Abstract—Current keyword search by Google, Yahoo, and so on assists user to annotate textual web data and manages the terms
gives enormous unsuitable results. A solution to this perhaps is to defined by users. There are three issues below to be solved:
annotate semantics to textual web data to enable semantic search,
rather than keyword search. However, pure manual annotation is 1. The current information retrieval (IR) is a keyword
very time-consuming. Further, searching high level concept such search, not a semantic search, which gives inaccurate
as metaphor cannot be done if the annotation is done at a low results. We figure that it would be necessary to improve
abstraction level. We, thus, present a semi-automatic annotation search accuracy by annotating semantics to textual web
system, i.e. an automatic annotator and a manual annotator. data.
Against the web ontology language (OWL) terms defined by
Protégé, the former annotates the textual web data using the 2. Because many people do not understand high abstraction
Knuth-Morris-Pratt (KMP) algorithm, while the latter allows a concepts of a domain, they simply annotate some low-
user to use the terms to annotate metaphors with high abstraction keyword (string in the textual web data) of
abstraction. The resulting semantically-enhanced textual web the domain ontology. For example, the title of news is “
document can be semantically processed by other web services 祝融” (God of fire). For educated people, they know
such as the information retrieval system and the recommendation
system shown in our example. that this is a metaphor (high abstraction concept) for “火
災” (blaze), so they annotate “火災” (blaze) to the news.
Keywords- semi-automatic annotation system, semantic search,
web ontology language (OWL) On the other hand, ordinary people may just annotate the
string “祝融” (God of fire) to the news, which makes
I. INTRODUCTION semantic search essentially equivalent to the old
keyword (string) search.
The current keyword search by Google, Yahoo, and so on
gives inaccurate results because two keywords may have the 3. Most textual web data are updated very frequently. Pure
same string, but with different semantics. For instance, a user manual annotation of them is extremely time-
wants to find the blaze news. He/she searches for “祝融” (God consuming.
of fire, which stands for a blaze in Chinese) by Google. The We address the three issues above:
user would find much information about characters of the God
of fire, but not about blaze. 1. We propose the semantically-enhanced textual web
document that is annotated with semantic terms for IR
Ontology is a knowledge description technology, which service to give more accurate results than otherwise.
could description semantics of textual web data such as string
and article. The web ontology language [2] (OWL) is a popular 2. We allow various users to share terms in annotation,
language used to describe ontology. An OWL term annotated which enhances the abstraction level. The user could be
by user is a concept of the world, which is used to improve expert or general user, and they share the terms among
search accuracy. However, for most users the annotation seems all users. The terms include low-level and high-level
difficult. Further, as different users may annotate different abstraction information, which is helpful for semantic
terms to the same data, some management scheme is needed. search. For example, an expert can annotate “祝融”
We thus propose a semi-automatic annotation system, which (God of fire), a high abstraction concept of blaze, to a
poem containing blaze. While a general user can III. A SEMI-AUTOMATE ANNOTATION SYSTEM
annotate “火” (fire), a low abstraction concept to it. This section presents the Semi-automatic Annotation
System architecture and how to implement it.
3. The linear time complexity of the Knuth-Morris-Pratt
(KMP) [4] algorithm used in the automatic annotation
A. Architecture of a Semi-Automate Annotation System
saves a lot of time. This reduces the load of manual
annotation. A lot of semantic annotation systems has been developed,
which is divided to two kinds: 1) Pattern-based and 2) Machine
This paper is organized as follows. Firstly, we compare our learning-based according to the taxonomy described by
approach with other researches in section 2. Secondly, we Lawrence [10]. In this paper, we utilize Pattern-based to build a
introduce the architecture of this system in section 3. Next, we Semi-automatic Annotation System. Its architecture is shown
describe an example in section 4. Finally, we draw conclusions in figure 1.
in section 5.
recommend the relative information according to the by using the ClassifyFactory API provided by Protégé.
semantic information. Furthermore, the user can save terms as an OWL file.
7. Textual Web Data: This is textual type information on Figure 3 shows an OWL file, in which the poem is “出塞
the Web such as string, news, message, and article.
(out of border) 王昌齡 (Wang T.-L.)” and the OWL terms are “
B. Implementation of a Semi-Automate Annotation System 述懷” (memory), “人事” (people), “長城” (great wall) and “戰
In our architecture, the core of ontology toolbox is the 場” (battlefield).
Protégé graphical tree structure tool, which is responsible for
editing domain ontology. In ontology repository, we use
SESAME to store domain ontologies and annotation terms in IV. AN EXAMPLE
OWL files. This section illustrates the example of OWL-based poem
There are two annotators: 1) automatic annotator, and 2) semantic search system, which we develop based on our
manual annotator. In the former, the KMP compares the strings architecture as shown in figure 4:
in textual web date against the terms in domain ontology. If
matched, they will be automatically annotated to form an OWL
file into SESAME. In the latter, a user can explore the domain
ontology by hierarchical structure. First, he/she will see the
top-level terms of the domain ontology, and then he/she could
select a term such as 人事 (people) to view its subclasses such
as 述懷 (memory), 思考 (think), etc. Then, he/she could select
appropriate terms as annotation terms.
In semantic search, our system will compare the string user
inputted with annotated terms. If matched, the system will
return textual web data according to the terms. And, the terms
will be shown in different font size as tag cloud through
calculating the number of annotations.
In the domain ontology, we use Protégé graphical tree
structure tool to build the poem ontology we defined, and then
we can quickly revise tree structure and properties straightly
(Fig 2.). After that, we transform the poem ontology into Java
code, and user can get the terms by declaring object to access
them. The poem ontology include 1116 classes, the first level
Figure 3 OWL file
includes 40 classes such as “京都” (capital), “人事” (people)
and “儒家” (Confucian), the second level includes 1076 classes
such as “留別” (stay and leave), “嘲戲” (ridicule) and “尋訪”
(visiting). Each of the 1116 classes stands for a concept of real
world.
Firstly, the system will automatically annotate the poems by In figure 6, the system finds out the “長城” (great wall), “
using KMP algorithm. Next, by using Teacher User Interface, a
人事” (people) and “戰場” (battlefield) are the same term
teacher selects a poem 出 塞_王昌齡 (Wang T.-L, Out of
between keyword and ontology, and then the RS will
Border) in two words from the poem classification on the top
recommend the five poems, such as 王昌齡’s 出塞 (Wang T.-
of figure 5, and then he/she selects 述 懷 (memory), 人事
L, Out of Border).
(people), 長城 (great wall) and 戰場 (battlefield) as high-level
After students get the five poems, they can also select the
abstraction information to 出塞_王昌齡 (Wang T.-L, Out of terms which is annotated by system in the poems to search
Border) from the poem keywords list at left side of figure 5. related poem, for instance 月 (moon) in 出塞_王昌齡 (Wang
After that, he/she enters the “send” bottom to store the poem
and the high-level abstraction information (terms). And then, T.-L, Out of Border) as shown in figure 7. After that, the
the web services can use these terms in the semantically- system will return back another five poems such as 宿建德江
enhanced poem documents through SESAME RDF (live in the J. D. River) according to 月 (moon) in 出塞_王昌
repository’s API.
齡 (Wang T.-L, Out of Border).
ACKNOWLEDGMENT
The authors would like to thank the Industrial Technology
Research Institute (ITRI) in Taiwan for their supports under the
project "Ontology-based database management technology for
surveillance data" in 2008.
REFERENCES
AUTHORS PROFILE
Chih-Hao Liu received his Master degree of Information Engineering from
Chaoyang University of Technology. He is currently a PhD candidate in the
National Central University in Taiwan. He joined the software engineering
Figure 7(b) Related Term Search in English laboratory in 2005. He also participated the SIM (Service-oriented
Information Marketplace) project from 2005 to 2007. And, his current
V. CONCLUSIONS research interests focus on Semantic Web and Agent.
We propose a semi-automatic annotation system, which Shang-Chih Hung received his Master degree of Control Engineering from
assists user to annotate textual web data and manages the terms the National Chiao-Tung University. He is currently with ISTC
defined by user. Its advantages are: (Identification and Security Technology Center) of ITRI (Industrial
Technology Research Institute) in Taiwan. ISTC focuses on developing
1. The traditional information trivial (IR) search is next generation video surveillance technologies. And, his current
combined with semantic information through the research interests include data fusion and situation awareness.
semantically-enhanced textual web document. This
gives more accurate results than the old keyword search Jhih-Liang. Jain received his Master degree of Information Engineering from
does. the National Central University. He joined the software engineering laboratory
in 2006. He also participated the ITRI project “Ontology-based database
2. The annotation terms are saved in the Annotated management technology for surveillance data” in 2008.
Repository, which is shared among all the users. This
allows a user to manually annotate terms with abstract Jason Jen Yen Chen is with the Department of Computer Science and
concepts. Thus, the search is improved. Information Engineering in the National Central University in Taiwan. He
earned international recognition by winning Top, Third, and Fifth Scholar in
3. In automatic annotation, using Knuth-Morris-Pratt the world in the field of System and Software Engineering in 1995, 1996, and
(KMP) algorithm with linear time complexity saves a lot 1997, respectively. The ranking is based on cumulative publication of six
of time. Thus, the load of manual annotation is reduced. leading journals in that field. His current research interests include agile
method and agent technology.