(Lecture Notes in Computer Science 6703 Lecture Notes in Artificial Intelli PDF
(Lecture Notes in Computer Science 6703 Lecture Notes in Artificial Intelli PDF
Modern Approaches
inApplied Intelligence
24th International Conference
on Industrial Engineering and Other Applications
of Applied Intelligent Systems, IEA/AIE 2011
Syracuse, NY, USA, June 28 – July 1, 2011
Proceedings, Part I
13
Series Editors
Volume Editors
Kishan G. Mehrotra
Chilukuri K. Mohan
Jae C. Oh
Pramod K. Varshney
Syracuse University, Department of Electrical Engineering and Computer Science
Syracuse, NY 13244-4100, USA
E-mail: {mehrotra, mohan, jcoh, varshney}@syr.edu
Moonis Ali
Texas State University San Marcos, Department of Computer Science
601 University Drive, San Marcos, TX 78666-4616, USA
E-mail: [email protected]
There has been a steady increase in demand for efficient and intelligent tech-
niques for solving complex real-world problems. The fields of artificial intelligence
and applied intelligence cover computational approaches and their applications
that are often inspired by biological systems. Applied intelligence technologies are
used to build machines that can solve real-world problems of significant complex-
ity. Technologies used in applied intelligence are thus applicable to many areas
including data mining, adaptive control, intelligent manufacturing, autonomous
agents, bio-informatics, reasoning, computer vision, decision support systems,
fuzzy logic, robotics, intelligent interfaces, Internet technology, machine learn-
ing, neural networks, evolutionary algorithms, heuristic search, intelligent design,
planning, and scheduling.
The International Society of Applied Intelligence (ISAI), through its annual
IEA/AIE conferences, provides a forum for international scientific and indus-
trial communities to interact with each other to develop and advance intelligent
systems that address such concerns.
The 24th International Conference on Industrial, Engineering and Other Ap-
plications of Applied Intelligence Systems (IEA/AIE-2011), held in Syracuse,
NY (USA), followed the IEA/AIE tradition of providing an international sci-
entific forum for researchers in the diverse field of applied intelligence. Invited
speakers and authors addressed problems we face and presented their solutions
by applying a broad spectrum of intelligent methodologies. Papers presented at
IEA/AIE-2011 covered theoretical approaches as well as applications of intelli-
gent systems in solving complex real-life problems.
We received 206 papers and selected the 92 best papers for inclusion in these
proceedings. Each paper was reviewed by at least three members of the Pro-
gram Committee. The papers in the proceedings cover a wide number of topics
including feature extraction, discretization, clustering, classification, diagnosis,
data refinement, neural networks, genetic algorithms, learning classifier systems,
Bayesian and probabilistic methods, image processing, robotics, navigation, op-
timization, scheduling, routing, game theory and agents, cognition, emotion, and
beliefs.
Special sessions included topics in the areas of incremental clustering and nov-
elty detection techniques and their applications to intelligent analysis of time
varying information, intelligent techniques for document processing, modeling
and support of cognitive and affective human processes, cognitive computing
facets in intelligent interaction, applications of intelligent systems,
nature-inspired optimization – foundations and application, chemoinformatic
and bioinformatic methods, algorithms and applications.
These proceedings, consisting of 92 chapters authored by participants of
IEA/AIE-2011, cover both the theory and applications of applied intelligent
VI Preface
systems. Together, these papers highlight new trends and frontiers of applied
intelligence and show how new research could lead to innovative applications of
considerable practical significance. We expect that these proceedings will provide
useful reference for future research.
The conference also invited three outstanding scholars to give plenary keynote
speeches. They were Ajay K. Royyuru from IBM Thomas J. Watson Research
Center, Henry Kauts from the University of Rochester, and Linderman from Air
Force Research Laboratory.
We would like to thank Springer for their help in publishing the proceedings.
We would also like to thank the Program Committee and other reviewers for
their hard work in assuring the high quality of the proceedings. We would like
to thank organizers of special sessions for their efforts to make this conference
successful. We especially thank Syracuse University for their generous support
of the conference.
We thank our main sponsor, ISAI, as well as our cooperating organizations:
Association for the Advancement of Artificial Intelligence (AAAI), Association
for Computing Machinery (ACM/SIGART, SIGKDD), Austrian Association for
Artificial Intelligence (OeGAI), British Computer Society Specialist Group on
Artificial Intelligence (BCS SGAI), European Neural Network Society (ENNS),
International Neural Network Society (INNS), Japanese Society for Artificial In-
telligence (JSAI), Slovenian Artificial Intelligence Society (SLAIS), Spanish So-
ciety for Artificial Intelligence (AEPIA), Swiss Group for Artificial Intelligence
and Cognitive Science (SGAICO), Taiwanese Association for Artificial Intelli-
gence (TAAI), Taiwanese Association for Consumer Electronics (TACE), Texas
State University-San Marcos.
Finally, we cordially thank the organizers, invited speakers, and authors,
whose efforts were critical for the success of the conference and the publication
of these proceedings. Thanks are also due to many professionals who contributed
to making the conference successful.
Program Committee
Program Committee
Adam Jatowt, Japan Dan Halperin, Israel
Ah-Hwee Tan, Singapore Dan Tamir, USA
Amruth Kumar, USA Daniela Godoy, Argentina
Andres Bustillo, Spain Dariusz Krol, Poland
Anna Fensel, Austria David Aha, USA
Antonio Bahamonde, Spain Djamel Sadok, Brazil
Azizi Ab Aziz, The Netherlands Domingo Ortiz-Boyer, Spain
Bärbel Mertsching, Germany Don Potter, USA
Bin-Yih Liao, Taiwan Don-Lin Yang, Taiwan
Bipin Indurkhya, India Duco Ferro, The Netherlands
Bohdan Macukow, Poland Emilia Barakova, The Netherlands
Bora Kumova, Turkey Enrique Frias-Martinez, Spain
C.W. Chan, Hong Kong Enrique Herrera-Viedma, Spain
Catholijn Jonker, The Netherlands Erik Blasch, USA
Cecile Bothorel, France Fevzi Belli, Germany
César Garcı́a-Osorio, Spain Floriana Esposito, Italy
Changshui Zhang, Canada Fran Campa Gómez, Spain
Chien-Chung Chan, USA Francois Jacquenet, France
Chih-Cheng Hung, USA Fred Freitas, Brazil
Chilukuri K. Mohan, USA Gerard Dreyfus, France
Chiung-Yao Fang, Taiwan Geun-Sik Jo, South Korea
Chunsheng Yang, Canada Gonzalo Aranda-Corral, Spain
Chun-Yen Chang, Taiwan Gonzalo Cerruela-Garcı́a, Spain
Colin Fyfe, UK Greg Lee, Taiwan
Coral Del Val-Muñoz, Spain Gregorio Sainz-Palmero, Spain
VIII Organization
Additional Reviewers
Chein-I Chang, USA
Chun-Nan Hsu, Taiwan
John Henry, USA
Jozsef Vancza, Hungary
Michelle Hienkelwinder, USA
Table of Contents – Part I
Section 3: Methodologies
Basic Object Oriented Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . 59
Tony White, Jinfei Fan, and Franz Oppacher
Dayrelis Mena Torres1 , Jesús Aguilar Ruiz2 , and Yanet Rodrı́guez Sarabia3
1
University of Pinar del Rı́o “Hermanos Saı́z Montes de Oca”, Cuba
[email protected]
2
University “Pablo de Olavide”, Spain
[email protected]
3
Central University of Las Villas “Marta Abreu”, Cuba
[email protected]
Abstract. Mining data streams is a field of study that poses new chal-
lenges. This research delves into the study of applying different tech-
niques of classification of data streams, and carries out a comparative
analysis with a proposal based on similarity; introducing a new form
of management of representative data models and policies of insertion
and removal, advancing also in the design of appropriate estimators to
improve classification performance and updating of the model.
Keywords: classification, data streams, similarity.
1 Introduction
Mining data streams has attracted the attention of the scientific community
in recent years with the development of new algorithms for data processing
and classification. In this area it is necessary to extract the knowledge and the
structure gathered in data, without stopping the flow of information. Examples
include sensor networks, the measure of electricity consumption, the call log
in mobile networks, the organization and rankings of e-mails, and records in
telecommunications and financial or commercial transactions.
Incremental learning techniques have been used extensively in these issues.
A learning task is incremental if the training set is not available completely
at the beginning, but generated over time, the examples are being processed
sequentially forcing the system to learn in successive episodes.
Some of the main techniques used in incremental learning algorithms for
solving problems of classification, are listed below:
- Artificial Neural Networks. (Fuzzy-UCF [1] LEARN [2])
- Symbolic Learning (Rules). (Facil [3], FLORA [4], STAGGER [5], LAIR [6])
- Decision trees. (ITI [7], VFDT [8], NIP [9], IADEM [10])
- Instance-based learning. (IBL-DS [11], TWF [12], LWF [12], SlidingWindows
[13], DROP3 [12], ICF [14])
Instance-based learning (IBL) works primarily through the maintenance of the
representative examples of each class and has three main features [15]: the
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 1–9, 2011.
c Springer-Verlag Berlin Heidelberg 2011
2 D. Mena Torres, J. Aguilar Ruiz, and Y. Rodrı́guez Sarabia
similarity function, the selection of the prototype instances and the classifica-
tion function. The main difficulty of this technique is to combine optimally these
three functions.
This paper introduces a new algorithm for classification in data streams based
on similarity, called Dynamic Model Partial Memory-based (DMPM), where a
representative set of current information are maintained.
The paper is structured as follows: section 2 presents a description of the
proposal and its formalization; Section 4 presents a comparative analysis, intro-
ducing other relevant techniques, then the parameters defined for experiments
and finally the results; Section 5 presents conclusions and some future work.
Algorithm 1. UpdateClassifier
input : ei = (xi , c): instance; d: distanceFunction; n: max number of
instance in g
output: B: base of instance
Update d(ei )
if c ∈
/ B then
CreateNewClassGroup(c)
Update B
UpdateEntropy
else
g(m, ci ) ←− min d(ei , ci )
if ci
= c then
CreateNewInstanceGroup(ei )
Update B
UpdateEntropy
EditGroup
else
if instanceCount(g) ≤ n then
Add(ei , g)
else
EditGroup
When an instance (ei ) is coming, whose class (c) is not in the set, a new group
for this class is created, and it is inserted in B in an organized manner.
In case it already exists, the proximity of the new instance to groups of other
class is checked. If it is closer to another class, another group is created, con-
taining the instance, the mean instance (firstly, the only instance), and the age
of the group (initially 1). Otherwise, if the new instance is closer to groups of
its class, it is added to the group in question.
It may happen that the class in question has a maximum permitted number
of instances (n), because the whole set has a limited number of examples, in
this case the procedure EditGroup is called, which is responsible for adding the
new instance and defining which of the remaining will be removed, following a
policy of inclusion - deletion that is aided of different estimators. The process of
editing the groups works as illustrate the algorithm 2.
The strategy of insertion and deletion of instances in a group (EditGroup )
is carried out mainly using two estimators: the mean instance or center of the
group (m) and a parameter defined as a threshold of gradual change (b), which
is updated with each insertion and removal of the buffer (instance base). This
threshold is defined for each group (g) stored and obtained by calculating the
distance from the center of the group to its farthest neighbor.
The procedure for UpdateEntropy is called every time there is a substantial
change in the instance base. The actions defined as substantial changes, are the
creating of new groups, the removing groups and the elimination of instances in
4 D. Mena Torres, J. Aguilar Ruiz, and Y. Rodrı́guez Sarabia
Algorithm 2. EditGroup
input : ei = (xi , c): instance; d: distanceFunction; g(m, ci ): instance
group
output: B: base of instance
nearestN eighbor ←− N earestN eighbor(g, m)
f arthestN eighbor ←− F arthestN eighbor(g, m)
b ←− d(m, f arthestN eighbor)
if d(m, xi ) < b then
Delete nearestN eighbor
Add ei Update B
else
Delete f arthestN eighbor
Add ei Update B
UpdateEntropy
a group when the instance removed is the farthest neighbor of the center of the
group. In these cases, entropy is calculated for all classes, taking into account the
number of instances of each class, and adjusts the number of examples for each
class, ensuring that there is a proper balance with respect to stored examples of
each class.
This adjustment might lead to increasing the number of instances to be stored,
or to decreasing it by eliminating groups whose contribution to classification of
new instances has been poor.
4 Experimentation
In this section we analyze the algorithms that have been selected for the
experimentation.
IBL-DS [11]: This algorithm is able to adapt to concept changes, and also
shows high accuracy for data streams that do not have this feature. However,
these two situations depend on the size of the case base; if the change of concept
is stable, the classification accuracy increases with the size of the case base. On
the other hand, a large database of cases turns out to be unfavorable in cases
where occur concept changes. It establishes a replacement policy based on:
Temporal Relevance: Recent observations are considered more useful than other,
less recent ones.
Space Relevance: A new example in a region of space occupied by other examples
of instances is less relevant than an example of a sparsely occupied region.
Consistency: An example should be removed if it is inconsistent with the current
concept.
As a function of distance SVDM is used which is a simplified version of the
VDM distance measure, which implements two strategies that are used in com-
bination to successfully face the gradual or abrupt concept changes. IBL-DS is
relatively robust and produces good results when uses the default parameters.
Classification Model for Data Streams Based on Similarity 5
LWF (Locally Weighted Forgetting) [12]: One of the best adaptive learning
algorithms. It is a technique that reduces the weight of the k nearest neighbors
(in increasing order of distance) of a new instance. An example is completely
eliminated if its weight is below a threshold. To keep the size of the case base,
the parameter k is adapted, taking into account the size of the case base. As
an obvious alternative to LWF, it was considered the TWF (Time Weighted
Forgetting) algorithm that determines the weight of the cases according to their
age.
In previous approaches, strategies for adapting the size window, if any, are
mostly heuristic in nature. In the algorithm Sliding Windows [13] the authors
propose to adapt the window size in such a way as to minimize the estimated
generalization error of the learner trained on that window. To this end, they
divide the data into batches and successively (that is, for k = 1; 2 ...) test win-
dows consisting of batches t..k; t..k+1 ... t. On each of these windows, they train
a classifier (in this case a support vector machine) and estimate its predictive
accuracy (by approximating the leave-one-out error). The window/model combi-
nation that yields the highest accuracy is finally selected. In [17], this approach
is further generalized by allowing the selection of arbitrary subsets of batches
instead of only uninterrupted sequences. Despite the appealing idea of this ap-
proach to window (training set) adjustment, the successive testing of different
window lengths is computationally expensive and therefore not immediately ap-
plicable in a data stream scenario with tight time constraints.
4.1 Parameters
Experiments were performed with the following setting parameter:
IBL-DS, k = 1.
Single sliding windows of size 400.
LWF, β = 0.8.
TWF, w = 0.996
DMPM, k = 1, change in entropy = 0.98.
To simulate data streams generators, were used Agrawal, Led24, RDG1 and
RandomRBF included in Weka and also were incorporated a total of 8 different
data streams found in the literature that have been used in multiple experiments
[18] [19] [11]. They are: Gauss, Sine, Distrib, Random, Stagger, Mixed, Hyper-
plane and Means. Each set with a maximum of 20000 instances, adding 5% of
noise, generating an initial training set of 100 examples and each algorithm with
a window of 400 instances. Table 1 shows the attribute information, classes and
types of concept (presence of drift) for each of the test data.
4.2 Results
For the evaluation of the proposed model, 4 algorithms were used: IBL-DS [11],
TWF [12], LWF [12], and Win400 (variant of Sliding Windows with size 400)
[13].
6 D. Mena Torres, J. Aguilar Ruiz, and Y. Rodrı́guez Sarabia
The results show that the algorithm DMPM shows better results with the
data stream Led24, of nominal attributes and higher dimensionality.
Table 3 shows the values of standard deviation (σ) and mean (x̄) values ob-
tained with the measure of absolute accuracy.
Classification Model for Data Streams Based on Similarity 7
The experiment shows that the average performance of the proposed model is
good. Showing an average of 65% accuracy and a low standard deviation, giving
the model a desired behavior.
The results obtained with the precision of the streaming, are shown in Table 4.
In this performance measure DMPM algorithm shows the best result for the
data stream Led24 and Gauss.
Table 5 shows the values of standard deviation (σ) and mean (x̄) values ob-
tained with the measurement accuracy of the streaming assessment.
With this measure, the proposed model achieved superior results, showing
a mean accuracy of 71%, and a very low standard deviation, which shows the
accuracy of the algorithm in the classification of the 100 instances last seen at
any time of learning.
Fig 1. contains the results of all performance measures and all the algorithms
for data stream LED24, the highest dimensionality of nominal attributes, with-
out concept change. The algorithm proposed in this paper shows the best results
with this data set, because the model presented does not have a policy to handle
concept drift.
8 D. Mena Torres, J. Aguilar Ruiz, and Y. Rodrı́guez Sarabia
It should be noted that the proposed algorithm does not treat abrupt concept
change, and it is being compared to others that are prepared for this situa-
tion. Taking into account these criteria, the results obtained for the measures of
efficiency are comparable to those obtained with other algorithms.
5 Conclusions
This paper presents a Dynamic Model Partial Memory based (DMPM) for clas-
sification data streams, based on similarity, introducing new proposals for man-
aging data models and policies of insertion and removal.
The proposed model shows better results in data streams with high dimen-
sionality and no concept changes.
Experiments were conducted considering two performance measures, noting
that the model proposed gives results comparable to other classifiers from liter-
ature, taking opportunities to improve both response time and precision.
References
1. Orriols-puig, A., Casillas, J., Bernado, E.: Fuzzy-UCS: A Michigan-style Learning
Fuzzy-Classifier System for Supervised Learning. Transactions on Evolutionary
Computation, 1–23 (2008)
Classification Model for Data Streams Based on Similarity 9
2. Polikar, R., Udpa, L., Udpa, S.S., Honavar, V.: LEARN ++: an Incremental Learn-
ing Algorithm For Multilayer Perceptron Networks. IEEE Transactions on System,
Man and Cybernetics (C), Special Issue on Knowledge Management, 3414–3417
(2000)
3. Ferrer, F.J., Aguilar, J.S., Riquelme, J.C.: Incremental Rule Learning and Border
Examples Selection from Numerical Data Streams. Journal of Universal Computer
Science, 1426–1439 (2005)
4. Widmer, G.: Combining Robustness and Flexibility in Learning Drifting Concepts.
Machine Learning, 1–11 (1994)
5. Schlimmer, J.C., Granger, R.H.: Incremental learning from noisy data. Machine
Learning 1(3), 317–354 (1986)
6. Watanabe, L., Elio, R.: Guiding Constructive Induction for Incremental Learning
from Examples. Knowledge Acquisition, 293–296 (1987)
7. Kolter, J.Z., Maloof, M.A.: Dynamic Weighted Majority: A New Ensemble Method
for Tracking Concept Drift. In: Proceedings of the Third International IEEE Con-
ference on Data Mining, pp. 123–130 (2003)
8. Domingos, P., Hulten, G.: Mining High-Speed Data Streams. In: Proceedings of
the Sixth ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, pp. 71–80 (2000)
9. Jin, R., Agrawal, G.: Efficient Decision Tree Construction on Streaming Data. In:
Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 571 – 576(2003)
10. Ramos-jim, G., Jos, R.M.-b., Avila, C.: IADEM-0: Un Nuevo Algoritmo Incremen-
tal, pp. 91–98 (2004)
11. Beringer, J., Hullermeier, E.: Efficient Instance-Based Learning on Data Streams.
Intelligent Data Analysis, 1–43 (2007)
12. Salganicoff, M.: Tolerating Concept and Sampling Shift in Lazy Learning Using
Prediction Error Context Switching. Artificial Intelligence Review, 133–155 (1997)
13. Klinkenberg, R., Joachims, T.: Detecting Concept Drift with Support Vector Ma-
chines. In: Proceedings of the Seventeenth International Conference on Machine
Learning (ICML), pp. 487–494 (2000)
14. Mukherjee, K.: Application of the Gabriel Graph to Instance Based Learning Al-
gorithms. PhD thesis, Simon Fraser University (2004)
15. Aha, D.W., Kibler, D., Albert, M.K.: Instance-Based Learning Algorithms. Ma-
chine Learning 66, 37–66 (1991)
16. Randall Wilson, D., Martinez, T.R.: Improved Heterogeneous Distance Functions.
Artificial Intelligence 6, 1–34 (1997)
17. Stanfill, C., Waltz, D.: Toward memory-based reasoning. Communications of the
ACM 29(12), 1213–1228 (1986)
18. Gama, J., Medas, P., Rocha, R.: Forest Trees for On-line Data. In: Proceedings of
the 2004 ACM Symposium on Applied Computing, pp. 632–636 (2004)
19. Gama, J., Rocha, R., Medas, P.: Accurate Decision Trees for Mining High-speed
Data Streams. In: Proc. SIGKDD, pp. 523–528 (2003)
Comparison of Artificial Neural Networks and
Dynamic Principal Component Analysis for
Fault Diagnosis
1 Introduction
Any abnormal event in an industrial process can represent financial deficit for the
industry in terms of loss of productivity, environmental or health damage, etc.
Fault Detection and Diagnosis (FDD) task is more complicated when there are
many sensors or actuators in the process such as chemical processes. Advanced
methods of FDD can be classified into two major groups [1]: process history-
based methods and model-based methods.
Most of the existing Fault Detection and Isolation (FDI ) approaches tested
on Heat Exchangers (HE ) are based on model-based methods. For instance,
fuzzy models based on clustering techniques are used to detect leaks in a HE
[2]. In [3], an adaptive observer is used to estimate model parameters, which are
used to detect faults in a HE. In [4], particle filters are proposed to predict the
fault probability distribution in a HE. These approaches require a process model
with high accuracy; however, due to permanent increase in size and complex-
ity of chemical processes, the modeling and analysis task for their monitoring
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 10–18, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Comparison of ANN and DPCA for Fault Diagnosis 11
and control have become exceptionally difficult. For this reason, process history
based-methods are gaining on importance.
In [5] an Artificial Neural Network (ANN ) is proposed to model the perfor-
mance of a HE including fault detection and classification. In [6] a fast training
ANN based on a Bayes classifier is proposed, the ANN can be online retrained
to detect and isolate incipient faults. On the other hand, in [7] and [8], different
adaptive approaches based on Dynamic Principal Component Analysis (DPCA)
have been proposed to detect faults and avoid normal variations. A comparative
analysis between DPCA and Correspondence Analysis (CA) is presented in [9],
CA showed best performance but needs greater computational effort. Another
comparison between two FDI systems is shown in [10], DPCA could not isolate
sequential failures using the multivariate T 2 statistic while the ANN did it.
This paper presents an extended and improved version of [10] in terms of
the quantitative comparison between DPCA and ANN. Moreover, this work
proposes the use of an individual control chart with higher sensitivity to multiple
abnormal deviations, instead of using the multivariate T 2 statistic used in [10].
The comparative analysis, based on the same experimental data provided from
an industrial HE, is divided into two major groups: (1) detection stage in terms
of detection time and probabilities of detection and false alarms and (2) diagnosis
stage based on the percentage of classification error.
The outline of this paper is as follows: in the next section, DPCA formu-
lation is presented. Section 3 describes the ANN design. Section 4 shows the
experimentation. Section 5 and 6 present the results and discussion respectively.
Conclusions are presented in section 7.
2 DPCA Formulation
Process data in the normal operating point must be acquired and standardized.
The variability of the process variables and their measurement scales are factors
that influence in the PCA performance [11]. In chemical processes, serial and
cross-correlations among the variables are very common. To increase the decom-
position in the correlated data, the column space of the data matrix X must
be augmented with past observations for generating a static context of dynamic
relationships:
XD (t) = [X1 (t), X1 (t − 1), . . . , X1 (t − w), . . . , Xn (t), Xn (t − 1), . . . , Xn (t − w)] . (1)
see variable definition in Table 1. The main objective of DPCA is to get a set of
a smaller number (r < n) of variables; r must preserve most of the information
given by the correlated data. DPCA formulation can be reviewed in detail in [12].
Once the scaled data matrix X̄ is projected by a set of orthogonal vectors, called
loading vectors (P ), a new and smaller data matrix T is obtained. Matrix T can
be back-transformed into the original data coordination system as, X ∗ = T P .
Normal operating conditions can be characterized by T 2 statistic [13] based
on the first r Principal Components (PC ). Using the multivariate Hotelling
statistic, it is possible to detect multiple faults; however, the fault isolation is
not achieved [10]. Therefore, the individual control chart of the T 2 statistic is
12 J.C. Tudón-Martı́nez et al.
proposed in this work for diagnosing multiple faults. In this case, the variability
of all measurements is amplified by the PC, i.e. the statistic is more sensitive to
any abnormal deviation once the measurement vector is filtered by P . Using the
X ∗ data matrix, the Mahalanobis distance between a variable and its mean is:
Ti2 (t) = (x∗i (t) − μi ) ∗ s−1 ∗
i ∗ (xi (t) − μi ) . (2)
Since each T statistic follows a t-distribution, the threshold is defined by:
2
α
T 2 (α) = tm−1 . (3)
2n
A sensor or actuator fault is correctly detected and isolated when its T 2 statistic
overshoots the control limit. All variables are defined in Table 1.
Variable Description
X Data matrix
XD Augmented data matrix
X̄ Scaled data matrix
X∗ Back-transformed data matrix
w Number of time delays to include in n process variables.
n Number of process variables
m Number of observations
r Number of variables associated to the PC
T2 Statistic of Hotelling (multivariate or univariate)
T Scores data matrix
P Loading vectors (matrix of eigenvectors)
μi Mean of the variable i used in the DPCA training.
si Standard deviation of the variable i used in the in the DPCA training.
tm−1 t-distribution with m − 1 degrees of freedom.
α Significance value.
3 ANN Formulation
An ANN is a computational model capable to learn behavior patterns of a
process, it can be used to model nonlinear, complex and unknown dynamic
systems, [15]. A Multilayer Perceptron (MLP ) network, which corresponds to a
feedforward system, has been proposed for sensor and actuator fault diagnosis.
For this research, ANN inputs correspond directly to the process measurements,
and ANN outputs generate a fault signature which must be codified into pre-
defined operating states. The ANN training algorithm was backpropagation; [14].
The codifier of the output indicates a fault occurrence; when the ANN output
is zero, the sensor or actuator is free of faults; otherwise, it is faulty. A fault
signature is used for identification of individual or multiple faults. The trained
network can be subsequently validated by using unseen process data (fresh data).
Crossed validation was used to validate the results.
Comparison of ANN and DPCA for Fault Diagnosis 13
4 Experimental System
Heat Exchangers (HE ) are widely used in industry for both cooling and heating
large scale industrial processes. An industrial shell-tube HE was the test bed, for
this research. The experimental set up has all desirable characteristics for testing
a fault detection scheme due to its industrial components. Fig. 1 shows a photo
of the system; while right picture displays a conceptual diagram of the main
instrumentation: 2 temperature sensors (T T1 , T T2 ), 2 flow sensors (F T1 , F T2 )
and their control valves (F V1 , F V2 ). A data acquisition system (NI USB-6215)
communicates the process with a computer.
steam inlet
FV1 FT1
water inlet TT1
FSV1
TT2 Condensed
water outlet
Fig. 1. Experimental System. An industrial HE is used as test bed, water runs through
the tubes whereas low pressure steam flows inside the shell.
5 Results
40
T 2 statistic
30
200 20
Fault 1
10 Threshold
Fault 3: 2400s 0
(TT1 : 150s) 80
150
60 TT1 signal
Fault 2 40
(TT2 : 610s)
100 Fault 2: 2000s 20 Threshold
Fault 3 0
(FT1 :1050s) 50
50 40 TT2 signal
Threshold 30
20
0 10 Threshold
0 500 1000 1500 2000 2500 3000 0
0 500 1000 1500 2000 2500 3000
Time (seconds) MULTIPLE FAULTS Time (seconds)
Fig. 2. FDD analysis for sensor faults. Multiple faults can not be isolated using the
multivariate T 2 statistic (left); while, the univariate statistic can do it (right). The uni-
variate statistics are obtained from decomposing the multivariate statistic, this allows
to detect and isolate the fault sequence simultaneously.
ANN approach. Fig. 3 shows the real faulty condition and ANN estimation for
faults in actuators. Since 5 different operating points were designed for actuator
faults, a codifier is used to translate a set of binary fault signature to specific
operating case. The FDD analysis for sensor faults using ANN can be reviewed
in detail in [10]; the ANN estimation is binary (0 is considered free of faults and
1 as faulty condition), the inlet temperature sensor (T T1 ) showed the greatest
false alarms rate.
6 Discussion
The comparative analysis is quantified in terms of detection time, probability of
false alarms and percentage of classification error.
Comparison of ANN and DPCA for Fault Diagnosis 15
Fault case
4
3
2
1
0
0 100 200 300 400 500 600 700 800 900 1000 1100
ANN approach estimation
Fault case
4
3
2
1
0
0 100 200 300 400 500 600 700 800 900 1000 1100
Time (seconds)
Fig. 3. FDD analysis for actuator faults using ANN. Several times the fault case 1 and
3 are erroneously detected instead the real faulty case.
1 1
Probability of detection (pd )
Sensor n. Sensor .
0.8 0.8 ion
faults actuator tio faults actuator lat
isola lt iso
faults
fa ult faults fau
0.6 0.6 the
he in
int se
a se t ca
0.4 r s tc 0.4 ors
w
e wo the
th of
0.2 of 0.2 nce
ce ere
fe ren Re
f
Re
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability of false alarms (pfa ) Probability of false alarms (pfa )
Fig. 4. ROC curve. The relation between the right fault detection and false alarms is
presented by the ROC curve. Both applications show similar probabilities for detecting
sensor and actuator faults; DPCA (right plot) and ANN (left plot). However, the
detection of faults in actuators is worse than the fault detection in sensors.
have been classified correctly, while the elements out of the diagonal are misclas-
sified events. The last column contains 2 numbers: the right classification and
relative false alarm based on the estimated number of faults in each faulty con-
dition; whereas, the last row represents the percentage of right classification in
terms of the real number of faults in each faulty condition and relative error re-
spectively. Finally, the last cell indicates the global performance of classification:
total right classification and total error in the isolation step.
Fig. 5 shows the isolation results for faults in actuators using DPCA (left)
and ANN (right), both confusion matrices have the same number of samples. A
fault in the valve F V1 has the greatest error (40.8%) of isolation using DPCA,
while the case free of faults is erroneously classified several times (78+135) in-
stead a real faulty condition (false alarm rate of 37.7%). In the ANN approach,
the normal operating condition has the lowest percentage of right classification
(68.6%). In general, ANN shows a lower total error of classification (23.32%)
than DPCA (27.17%). Similarly, for sensor faults, the ANN has the lowest total
error of fault classification than DPCA: 4.34% versus 15.28%.
Fig. 5. Classification results. DPCA approach shows a greater value in the total error
of classification than ANN approach for faults in actuators.
Considering the real time necessary for fault detection, the detection time
is greater in DPCA approach even when these faults were abrupt; whereas,
ANN shows an instantaneous detection. Table 2 shows a summary of the FDD
performance in both approaches.
Table 2. Comparison of DPCA and ANN approaches. ANN shows a better perfor-
mance in detection and isolation of faults.
7 Conclusions
A comparison between Dynamic Principal Component Analysis (DPCA) and
Artificial Neural Networks (ANN ) under the same experimental data generated
by an industrial heat exchanger is presented. Both approaches can be used to de-
tect and isolate abnormal conditions in nonlinear processes without considering
a reliable model structure of the process, moreover the FDD task can be achieved
when the industrial process has many sensors/actuators. The ANN showed the
best performance in the FDD task with greater detection probability (1%), lower
error of fault isolation (14.7% for actuator faults and 71.6% for sensor faults)
and instantaneous detection.
The individual control chart generated by the principal components is pro-
posed for fault diagnosis in DPCA, the method can isolate multiple faults due to
its greater sensitivity to the data variability instead of using the simultaneous T 2
statistic. For ANN, there is not problem in the fault isolation task if all faulty
conditions are known. New and unknown faults can be diagnosed by DPCA;
while, ANN must be trained again including all faulty behaviors.
18 J.C. Tudón-Martı́nez et al.
References
1. Venkatasubramanian, V., Rengaswamy, R., Kavuri, S., Yin, K.: A Review of Pro-
cess Fault Detection and Diagnosis Part I Quantitative Model-Based Methods.
Computers and Chemical Eng. 27, 293–311 (2003)
2. Habbi, H., Kinnaert, M., Zelmat, M.: A Complete Procedure for Leak Detection
and Diagnosis in a Complex Heat Exchanger using Data-Driven Fuzzy Models. ISA
Trans. 48, 354–361 (2008)
3. Astorga-Zaragoza, C.M., Alvarado-Martı́nez, V.M., Zavala-Rı́o, A., Méndez-
Ocaña, R., Guerrero-Ramı́rez, G.V.: Observer-based Monitoring of Heat Exchang-
ers. ISA Trans. 47, 15–24 (2008)
4. Morales-Menendez, R., Freitas, N.D., Poole, D.: State Estimation and Control of
Industrial Processes using Particles Filters. In: IFAC-ACC 2003, Denver Colorado
U.S.A, pp. 579–584 (2003)
5. Tan, C.K., Ward, J., Wilcox, S.J., Payne, R.: Artificial Neural Network Modelling of
the Thermal Performance of a Compact Heat Exchanger. Applied Thermal Eng. 29,
3609–3617 (2009)
6. Rangaswamy, R., Venkatasubramanian, V.: A Fast Training Neural Network and
its Updation for Incipient Fault Detection and Diagnosis. Computers and Chemical
Eng. 24, 431–437 (2000)
7. Perera, A., Papamichail, N., Bârsan, N., Weimar, U., Marco, S.: On-line Novelty
Detection by Recursive Dynamic Principal Component Analysis and Gas Sensor
Arrays under Drift Conditions. IEEE Sensors J. 6(3), 770–783 (2006)
8. Mina, J., Verde, C.: Fault Detection for MIMO Systems Integrating Multivariate
Statistical Analysis and Identification Methods. In: IFAC-ACC 2007, New York
U.S.A, pp. 3234–3239 (2007)
9. Detroja, K., Gudi, R., Patwardhan, S.: Plant Wide Detection and Diagnosis using
Correspondance Analysis. Control Eng. Practice 15(12), 1468–1483 (2007)
10. Tudón-Martı́nez, J.C., Morales-Menendez, R., Garza-Castañón, L.: Fault Diagnosis
in a Heat Exchanger using Process History based-Methods. In: ESCAPE 2010,
Italy, pp. 169–174 (2010)
11. Peña, D.: Análisis de Datos Multivariantes. McGrawHill, España (2002)
12. Tudón-Martı́nez, J.C., Morales-Menendez, R., Garza-Castañón, L.: Fault Detection
and Diagnosis in a Heat Exchanger. In: 6th ICINCO 2009, Milan Italy, pp. 265–270
(2009)
13. Hotelling, H.: Analysis of a Complex of Statistical Variables into Principal Com-
ponents. J. Educ. Psychol. 24 (1993)
14. Freeman, J.A., Skapura, D.M.: Neural Networks: Algorithms, Applications and
Programming Techniques. Adisson-Wesley, Reading (1991)
15. Korbicz, J., Koscielny, J.M., Kowalczuk, Z., Cholewa, W.: Fault Diagnosis Models,
Artificial Intelligence, Applications. Springer, Heidelberg (2004)
16. Woods, K., Bowyer, K.W.: Generating ROC Curves for Artificial Neural Networks.
IEEE Trans. on Medical Imaging 16(3), 329–337 (1997)
Comparative Behaviour of Recent Incremental
and Non-incremental Clustering Methods on
Text: An Extended Study
1 Introduction
Most of the clustering methods show reasonable performance on homogeneous
textual dataset. However, the highest performance on such datasets are gener-
ally obtained by neural clustering methods [6]. The neural clustering methods
are based on the principles of neighbourhood relation between clusters, either
they are preset (fixed topology), like the “Self-Organizing Maps” also named
SOM [4], or dynamic (free topology), like static “Neural Gas” (NG) [10] or
“Growing Neural Gas” (GNG) [3]. As compared to usual clustering method, like
K-means [9], this strategy makes them less sensitive to the initial conditions,
which represents an important asset within the framework of data analysis of
highly multidimensional and sparse data, like textual data.
The most known neural clustering method is the SOM method which is
based on a mapping of the original data space onto a two dimensional grid of
neurons. The SOM algorithm is presented in details in [4]. It consists of two
basic procedures: (1) selecting a winning neuron on the grid and (2) updating
weights of the winning neuron and of its neighbouring neurons.
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 19–28, 2011.
c Springer-Verlag Berlin Heidelberg 2011
20 J.-C. Lamirel, R. Mall, and M. Ahmad
In the NG algorithm [10], the weights of the neurons are adapted without any
fixed topological arrangement within the neural network. Instead, this algorithm
utilizes a neighbourhood ranking process of the neuron weights for a given input
data. The weight changes are not determined by the relative distances between
the neuron within a topologically pre-structured grid, but by the relative dis-
tance between the neurons within the input space, hence the name “neural gas”
network.
The GNG algorithm [3] solves the static character of the NG algorithm bring-
ing out the concept of evolving network. Hence, in this approach the number of
neuron is adapted during the learning phase according to the characteristics of
the data distribution. The creation of the neurons is made only periodically (each
T iteration or time period) between the two neighbour neurons that accumulated
the most important error for representing the data.
Recently, an incremental growing neural gas algorithm or (IGNG) [12] has
been proposed by Prudent and Ennaji for general clustering tasks. It represents
an adaptation of the GNG algorithm that relaxes the constraint of periodical
evolution of the network. Hence, in this algorithm a new neuron is created each
time the distance of the current input data to the existing neuron is greater than
a prefixed threshold σ. The σ value is a global parameter that corresponds to
the average distance of the data to the center of the dataset. Prudent and Ennaji
have proved that their method produces better results than the existing neural
and the non-neural methods on standard test distributions.
More recently, the results of the IGNG algorithm have been evaluated on
heterogeneous datasets by Lamirel & al. [6] using generic quality measures and
cluster labeling techniques. As the results have been proved to be insufficient for
such data, a new incremental growing neural gas algorithm using the cluster label
maximization (IGNG-F) has been proposed by the said authors. In this strategy
the use of a standard distance measure for determining a winner is completely
suppressed by considering the label maximization approach as the main winner
selection process. Label maximization approach is sketched at section 2 and
it is more precisely detailed in [6]. One if its important advantage is that it
provides the IGNG method with an efficient incremental character as it becomes
independent of parameters.
In order to throw some light on the quality and on the defects of above men-
tioned clustering methods on “real world data”, we propose hereafter a extended
evaluation of those methods through the use of 3 textual datasets of increasing
complexity. The third dataset which we exploit is a highly polythematic dataset
that figures out a static simulation of evolving data. It thus represents an in-
teresting benchmark for comparing the behaviour of incremental and non incre-
mental methods. Generic quality measures like M icro-P recision, M icro-Recall,
Cumulative M icro-P recision and cluster labeling techniques are exploited for
comparison. In the next section, we give some details on these clustering quality
measures. The third section provides a detailed analysis of the datasets and on
their complexity. Following which we present the results of our experiments on
our 3 different test datasets.
Incremental and Non-incremental Clustering Methods on Text 21
where Ci+ represents the subset of clusters of C for which the number of
associated data is greater than i, and:
1
The IGNG-F algorithm uses this strategy as a substitute for the classical distance
based measure which provides best results for homogeneous datasets [6].
22 J.-C. Lamirel, R. Mall, and M. Ahmad
3 Dataset Analysis
For the experimental purpose we use 3 different datasets namely the Total-Use,
PROMTECH and Lorraine dataset. The clustering complexity of each dataset
is more than the previous one. We provide detailed analysis of the datasets and
try to estimate their complexity and heterogeneity.
The Total-Use dataset consisted of 220 patent records related to oil en-
gineering technology recorded during the year 1999. This dataset contains in-
formation such as the relationship between the patentees, the advantages of
different type of oils, what are the technologies used by patentees and the con-
text of exploitation of their final products. Based on the information present
in the dataset, the labels belonging to the datasets have been categorized by
the domain experts into four viewpoints or four categories namely Patentees,
Title, Use and Advantages. The task of extracting the labels from the dataset
is divided into two elementary steps. At the step 1, the rough index set of each
specific text sub field is constructed by the use of a basic computer-based index-
ing tool. At the step 2, the normalization of the rough index set associated to
each viewpoint is performed by the domain expert in order to obtain the final
index sets. In our experiment, we solely focus on the labels belonging to the Use
viewpoint. Thus, the resulting corpus can be regarded as a homogeneous dataset
as soon as it covers an elementary description field of the patents with a limited
and contextual standardized vocabulary of 234 keywords or labels spanning over
220 documents.
The PROMTECH dataset is extracted from the original PROMTECH
dataset of the PROMTECH project which has been initially constituted by the
use of INIST PASCAL database and relying on its classification plan with the
overall goal of analysing the dynamics of the various identified topics. For build-
ing up this dataset, a simple search strategy, consisting in the selection of the
bibliographic records having at the same time a code in Physics and a code cor-
responding to a technological scope of application has been firstly employed. The
selected applicative field was the Engineering. By successive selections, combin-
ing statistical techniques and expert approaches, 5 promising sets of themes have
been released [11]. The final choice was to use the set of themes of the optoelec-
tronic devices here because this field is one of the most promising of last decade.
3890 records related to these topics were thus selected in the PASCAL database.
The corpus has then been cut off in two periods that are (1996-1999: period 1)
and (2000-2003: period 2), to carry out for each one an automatic classification
by using the description of the content the bibliographic records provided by the
indexing keywords. Only data associated to first period has been used in our own
experiment. In such a way, our second experimental dataset finally consisted of
1794 records indexed by 903 labels.
The Lorraine dataset is also build up from a set of bibliographic records
resulting from the INIST PASCAL database and covering all the research ac-
tivity performed in the Lorraine region during the year 2005. The structure of
the records makes it possible to distinguish the titles, the summaries, the index-
ing labels and the authors as representatives of the contents of the information
Incremental and Non-incremental Clustering Methods on Text 23
published in the corresponding article. In our experiment, the research topics as-
sociated with the labels field are solely considered. As soon as these labels cover
themselves a large set of different topics (as far one to another as medicine from
structural physics or forest cultivation, etc.), the resulting experimental dataset
can be considered as a highly heterogeneous dataset. The number of records is
1920. A frequency threshold of 2 is applied on the initial label set resulting in
a description space constituted of 3556 labels. Although the keyword data do
not come directly from full text, we noted that the distribution of the terms
took the form of a Zipf law (linear relation between the log of the frequencies of
terms and the log of their row) characteristic of the full text data. The final key-
word or label set also contains a high ratio of polysemic terms, like age, system,
structure, etc.
4 Results
For each method, we do many different experiments letting varying the number
of clusters in the case of the static methods and the neighbourhood parameters
in the case the incremental ones (see below). We have finally kept the best
clustering results for each method regarding to the value of Recall-P recision
F -measure and the Cumulative M icro-P recision.
We first conducted our experiments on the Total-Use dataset which is homo-
geneous by nature. Figure 2 shows the M acro-F -M easure, M icro F -M easure
and the Cumulative M icro-P recision(CM P ) for the dataset. We see that the
M acro F -M easure and M icro F -M easure values are nearly similar for the dif-
ferent clustering approaches, although in the case of the SOM approach more
than half of the clusters are empty. However, the CM P value shows some differ-
ence 2 . The NG and GNG algorithms have good M acro and M icro F -M easure
but lower CM P than the IGNG approaches. We can thus conclude that the best
results are obtained for IGNG and IGNG-F methods. The dataset is homoge-
neous by nature, so it is possible to reach such high precision values. Thus small,
distinct and non-overlapping clusters are formed with the best methods. Its low-
est CM P values also highlights that standard K-Means approach produces the
worst results on the first dataset.
it traverses the entire dataset. The K-Means method obtained the second best
CM P value (0.35). However, its M icro-P recision value is less illustrating lower
global average performance than all the other methods. IGNG-F has moderate
CM P value (0.25) but has high M icro-P recision, so the average clustering
quality is quite good even though there might be some cluster for this method
which is still agglomerating the labels. The neural NG method has been left out
from this experiment and from the next one because of its too high computation
time.
Even if it embeds stable topics, the Lorraine dataset is a very complex het-
erogeneous dataset as we have illustrated earlier. In a first step we restricted our
experiment to 198 clusters as beyond this number, the GNG approach went to
an infinite loop (see below). A first analysis of the results on this dataset within
this limit shows that most of the clustering methods have huge difficulties to
deal with it producing consequently very bad quality results, even with such
high expected number of clusters, as it is illustrated at Figure 4 by the very
low CMP values. It indicates the presence of degenerated results including few
garbage clusters attracting most of the data in parallel with many chunks clus-
ters representing either marginal groups or unformed clusters. This is the case
for K-Means, IGNG, IGNG-F methods and at a lesser extent for GNG method.
This experiment also highlights the irrelevance of Mean Square Error (MSE)
(or distance-based) quality indexes for estimating the clustering quality in com-
plex cases. Hence, the K-Means methods that got the lowest MSE practically
produces the worth results. This behaviour can be confirmed when one looks
more precisely to the cluster content for the said method, using the method-
ology that is described in section 2. It can be thus highlighted that K-means
method mainly produced a garbage cluster with very big size (1722 data or
more than 80% of the dataset) attracting (i.e. maximising) many kinds of dif-
ferent labels (3234 labels among 3556), figuring out a degenerated clustering
result. Conversely, despite of its highest MSE, the correct results of the SOM
method can also be confirmed in the same way. Hence, cluster labels extraction
clearly highlights that this latter method produces different clusters of similar
size attracting semantically homogeneous labels groups which figure out the main
research topics covered by the analysed dataset. The grid constrained learning
of the SOM method seems to be a good strategy for preventing to produce too
bad results in such a critical context. Hence, it enforces the homogeneity of the
results by splitting both data and noise on the grid.
Incremental and Non-incremental Clustering Methods on Text 27
As mentioned earlier, for higher number of clusters than 198 the GNG method
does not provide any results on this dataset because of its incapacity to escape
from an infinite cycle of creation-destruction of neurons (i.e. clusters). Moreover,
the CMP value for GNG approach was surprisingly greater for 136 clusters than
for 198 clusters (see figure 4). Thus, increasing the expected number of clusters
is not helpful to the method to discriminate between potential data groups in the
Lorraine dataset context. At the contrary, it even leads the method to increase its
garbage agglomeration effect. However, we found that as we increase the number
of clusters beyond 198 for the other methods the actual peaks value for the
SOM, IGNG and IGNG-F methods are reached, and the static SOM method and
the incremental IGNG-F method reach equivalently the best clustering quality.
Thus, only these two methods are really able to appropriately cluster this highly
complex dataset. Optimal quality results are reported at figure 5.
5 Conclusion
Clustering algorithms show reasonable performance in the usual context of the
analysis of homogeneous textual dataset. This is especially true for the recent
adaptive versions of neural clustering algorithms, like the incremental neural gas
algorithm (IGNG) or the incremental neural gas algorithm based on label max-
imization (IGNG-F). Nevertheless, using a stable evaluation methodology and
dataset of increasing complexity, this paper highlights clearly the drastic decrease
of performance of most of these algorithms, as well as the one of more classi-
cal non neural algorithms, when a very complex heterogeneous textual dataset,
figuring out a static simulation of evolving data, is considered as an input. If
incrementality is considered as a main constraint, our experiment also showed
28 J.-C. Lamirel, R. Mall, and M. Ahmad
that only non standard distance based methods, like the IGNG-F method, can
produce reliable results in such case. Such a method also presents the advan-
tage to produce the most stable and reliable results in the different experimental
contexts, unlike the other neural and non-neural approaches which have highly
varying on the datasets. However, our experiments also highlighted that one of
the problems IGNG-F faces in some case is that it can associate a data with la-
bels completely different from the ones existing in the prototypes. This leads to
significant decrease in performance in such case as labels belonging to different
clusters are clubbed together. We are thus investigating to use a distance based
criteria to limit the number of prototypes which are considered for a new up-
coming data point. This will allows to set a neighbourhood threshold and focus
for each new data point, which is lacking in the IGNG-F approach.
References
1. Davies, D., Bouldin, W.: A cluster separation measure. IEEE Transaction on Pat-
tern Analysis and Machine Intelligence 1, 224–227 (1979)
2. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood for incomplete data via
the em algorithm. ournal of the Royal Statistical Society, B 39, 1–38 (1977)
3. Frizke, B.: A growing neural gas network learns topologies. Advances in neural
Information processing Systems 7, 625–632 (1995)
4. Kohonen, T.: Self-organized formation of topologically correct feature maps. Bio-
logical Cybernetics 43, 56–59 (1982)
5. Lamirel, J.-C., Al-Shehabi, S., Francois, C., Hofmann, M.: New classification qual-
ity estimators for analysis of documentary information: application to patent anal-
ysis and web mapping. Scientometrics 60 (2004)
6. Lamirel, J.-C., Boulila, Z., Ghribi, M., Cuxac, P.: A new incremental growing neu-
ral gas algorithm based on clusters labeling maximization: application to cluster-
ing of heterogeneous textual data. In: The 22th Int. Conference on Industrial,
Engi- neering and Other Applications of Applied Intelligent Systems (IEA-AIE),
Cordoba, Spain (2010)
7. Lamirel, J.-C., Phuong, T.A., Attik, M.: Novel labeling strategies for hierarchical
representation of multidimensional data analysis results. In: IASTED International
Conference on Artificial Intelligence and Applications (AIA), Innsbruck, Austria
(February 2008)
8. Lamirel, J.-C., Ghribi, M., Cuxac, P.: Unsupervised recall and precision measures:
a step towards new efficient clustering quality indexes. In: Proceedings of the 19th
Int. Conference on Computational Statistics (COMPSTAT 2010), Paris, France
(August 2010)
9. MacQueen, J.: Some methods of classifcation and analysis of multivariate observa-
tions. In: Proc. 5th Berkeley Symposium in Mathematics, Statistics and Probabil-
ity, vol. 1, pp. 281–297. Univ. of California, Berkeley (1967)
10. Martinetz, T., Schulten, K.: A neural gas network learns topologies. Articial Neural
Networks, 397–402 (1991)
11. Oertzen, J.V.: Results of evaluation and screening of 40 technologies. Deliverable
04 for Project PROMTECH, 32 pages + appendix (2007)
12. Prudent, Y., Ennaji, A.: An incremental growing neural gas learns topology.
In: 13th European Symposium on Artificial Neural Networks, Bruges, Belgium
(April 2005)
Fault Diagnosis in Power Networks
with Hybrid Bayesian Networks and Wavelets
1 Introduction
The monitoring and supervision of power networks plays a very important role
in modern societies, because of the close dependency of almost every system to
electricity supply. In this domain, failures in one area perturb other neighbor
areas, and troubleshooting is very difficult due to excess of information, cascade
effects, and noisy data. Early detection of network misbehavior can help to avoid
major breakdowns and incidents, such as those reported in the northeast of USA
and south of Canada in August 2003, and in 18 states in Brasil and Paraguay
in November 2009. These events caused heavy economic loses and millions of
affected users. Therefore, an alternative to increase the efficiency of electrical
distribution systems, is the use of automated tools, which could help the operator
to speed up the process of system diagnosis and recovery.
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 29–38, 2011.
c Springer-Verlag Berlin Heidelberg 2011
30 L.E. Garza Castañón, D.R. Guillén, and R. Morales-Menendez
In order to tackle those problems, fault detection and system diagnosis has
been a very active research domain since a few decades ago. Recently, the need
to develop more powerful methods has been recognized, and approaches based
on Bayesian Networks (BN), able to deal very efficiently with noise and the
modeling of uncertainties have been developed. For instance, in [1], a discrete
BN with noisy-OR and noisy-AND nodes, and a parameter-learning algorithm
similar to the used by artificial neural networks (ANN), is used to estimate
the faulty section in a power system. In [2] and [3], discrete BN are used to
handle discrete information coming from the protection breakers. Continuous
data voltages coming from nodes are handled in [2], with dynamic probabilistic
maximum entropy models, whereas in [3] ANN are used. A Bayesian selectivity
technique and Discrete Wavelet Transform are used in [4], to identify the faulty
feeder in a compensated medium voltage network. Although BN are used in
several processes for fault diagnosis and support decision making, none of the
surveyed papers deal with multiple simultaneous faults, and none reports the
use of HBN to diagnose faults in power systems.
In this paper is proposed a hybrid diagnostic framework, which combines the
extraction of relevant features from continuous data with wavelet theory and the
modeling of a complete 24-nodes power network with HBN. Wavelets are used
to analyze and extract the characteristic features of the voltage data obtained
from electrical network nodes measurements, creating specific coefficients fault
patterns for each node. These coefficients are encoded as probability distribu-
tions in the continuous nodes of HBN model, which in turn help to identify the
fault type of the component. On the other hand, the operation of protection
breakers is modeled with discrete nodes encoding the probability of the status
of breakers. The HBN model of the system states probabilistic relationships be-
tween continuous and discretes system’s components. Promising results of three
different evaluations made from simulations performed with a power network
composed by 24 nodes and 67 breakers and a comparison to other approach.
The organization of the paper is as follows: Section 2 reviews the fundamentals
of the wavelets theory. Section 3 presents the BN framework. Section 4 gives
the general framework description. Section 5 shows the case study, and finally,
Section 6 concludes the paper.
2 Wavelets
Wavelet Transform (WT) methods have been effectively used for multi-scale rep-
resentation and analysis of signals. The WT decomposes signal transients into a
series of wavelet components, each of which corresponds to a time-domain sig-
nal that covers a specific frequency band containing more detailed information.
Wavelets localize the information in the time-frequency plane, and they are ca-
pable of trading one type of resolution for another, which makes them especially
suitable for the analysis of non-stationary signals. Also, wavelet analysis can
analyze appropriately rapid changes in transients of signals. The main strength
of wavelet analysis is its ability to demonstrate the local feature of a particular
area of a large signal.
Fault Diagnosis in Power Networks 31
3 Bayesian Networks
Bayesian Networks (BN) are a representation for probability distributions over
complex domains. Traditional BN can handle only discrete information, but
Hybrid Bayesian Networks (HBN) contain both, discrete data and continuous
conditional probability distributions (CPDs) as numerical inputs. The main ad-
vantages of these models are that they can infer probabilities based on continuous
and discrete information, and also they can deal with a great amount of data,
uncertainty and noise.
The Bayesian inference is a kind of statistical inference in which the evidence
or observations are employed to update or infer the probability of a hypothesis
to be true [6]. The name Bayesian comes from the frequent use of the Bayes’
Theorem (Eqn. 1) during the inference process. The basic task of a probabilistic
inference system is calculating the subsequent probability distributions for a
group of variables, given an observed event.
P (a | b)P (b)
P (b | a) = (1)
P (a)
In an HBN the conditional distribution of continuous variables is given by a
linear Gaussian model:
where Z and I are the set of continuous and discrete parents of xi respectively
and N (μ, σ) is a multivariate normal distribution. The network represents a joint
distribution over all its variables given by a product of all its CPDs.
32 L.E. Garza Castañón, D.R. Guillén, and R. Morales-Menendez
4 Framework
A portion of the HBN model developed for the power system is shown in Figure
1. Each node in the power network is represented with a model of this type;
it is composed of discrete nodes, representing the status of protection break-
ers, and continuous nodes representing wavelets features extracted from voltage
measurements at power network nodes . The model has three parts:
1. Upper part: To manage the discrete data, the breaker status. The main
function of this part is to establish the status of the power network node.
The status node (faulty or not) is known, but not the type.
2. Bottom part: The main function is the analysis of continuous data. In
this way, this information can be communicated with the other parts of
the network. The inserted data in this section are the wavelets coefficients
computed from measurements of node voltages. The main contribution is the
confirmation and determination of which possible fault may have the device.
3. Middle part: It is probably the most important of the whole structure,
because here the weighting probabilities of continuous and discrete data are
communicated. Faults can be isolated and known.
In order to perform the system diagnosis, the first step is to obtain discrete and
continuous observations. The discrete observations will reflect in some way the
status of system’s components (node status). It is assumed that every breaker has
a finite set of states indicating normal or faulty operation. Protection breakers
help to isolate faulted nodes by opening the circuit, and are considered to be
working in one of three states: OK, OPEN, and FAIL. The OK status means
the breaker remains closed and has not detected any problem. The OPEN status
is related to the opening of breaker when a fault is detected. The FAIL status
means the breaker malfunctioned and did not open when a fault event was
detected. The inference task over discrete information will be accomplished by
applying the HBN upper-part of the model for each node, Figure 1. The output
is a possible detection of a faulty or non-faulty node consistent with discrete
observations.
In the second step, continuous data coming from all the system components
is analyzed. First, features are extracted through a bank of iterated filters ac-
cording to the wavelet theory, in which data is decomposed in approximated and
detailed information, generating specific patterns for each component. A signa-
ture of the different possible states or fault types of the system’s components is
considered. The first decomposition is chosen to be the representative pattern
of each signal, because it gives a general view of the analyzed wave and unlike
the detailed information it does not present too much sensibility in presence of
simultaneous multiple faults. In order to make an adequate diagnosis, the pat-
terns here obtained, are inserted in the HBN bottom-part of the model of every
node (continuous part), which with the CPD will elaborate a weighted magni-
tude. Finally, in the HBN middle-part of the model, with both of the weighted
magnitude parts (discrete and continuous), the real power network node status
will be calculated.
5 Case Study
The system under testing is a network that represents an electrical power system
with dynamic load changes and it was taken from [7], Figure 2. The system has
24 nodes, and every node is protected by several breakers devices, which will be
active when a malfunction is detected. For every node on the system, there is a
three-phases continuous voltage measurement, and for every breaker there is a
discrete signal of the state of the device.
The complete HBN model is shown in Figure 3. Each node of the power
network (Figure 2) is modeled individually with the proposed HBN in Figure
1. The 24 nodes of the power network in the complete HBN model are only
communicated through the discrete nodes, which represents the breakers between
two electrical nodes.
The methodology proposed is applied as follows:
1. First evaluation was made only to test and validate the HBN model in 25
scenarios with faults at randomly selected nodes (five simulations with one
fault, five with two simultaneous faults, and so on, till five simultaneous
faults).
2. Second and third evaluation were made for comparison purposes. 24 fault
simulations for each evaluation were included, with faults at 4 test nodes (3,
9, 10 and 13). The results of the second evalutaion were obtained through
the correct diagnose of one test node at a time, while in the third evaluation
the results were obtained through the correct diagnose of the 4 test nodes in
each scenario. The results were compared with the investigation presented in
[3]. In this article, a 2-phase model for fault detection is elaborated; in the
first phase, they used a discrete BN for generating a group of suspicious fault
nodes, and in the second phase, the eigenvalues of the correlation matrix of
Fault Diagnosis in Power Networks 35
the continuous data from the suspicious nodes are computed, and inserted
in a neural network, to confirm which of this suspicious nodes are faulty.
are: one line to ground (A GND), two lines to ground (A-B GND), three lines
to ground (A-B-C GND), or faults between two lines (A-B or B-C). Data where
none of the nodes of the system were under a faulty situation (NO FAULT) are
included. For every fault scenario, the discrete response of breakers associated
to faulty nodes were simulated. The response of a breaker is given by one of
three states: OK, which means not affected by a fault, OPEN, which means
opening the circuit to isolate a fault, and FAIL, which means that the breaker
malfunctioned and the breaker did not open when the event was detected.
The three evaluations were tested in our network under ideal conditions (no
uncertainty), and the first evaluation also under non-ideal conditions (with un-
certainty). In the first case there was no missing information or uncertainty in
discrete data, and in the second case we simulated wrong reading on the status
of protection breakers that can mislead the diagnosis.
The efficiency of the first evaluation can be seen in Table 1. The term effi-
ciency refers to the percentage of the correct identification of nodes’ status in
the different scenarios. The results presented by the model with ideal conditions
have high values of efficiency, even in scenarios with 5 simultaneous faults. The
lowest efficiency is presented in scenarios with 3 and 4 faults. A possible expla-
nation for this behavior is that the randomly faults were presented in nodes that
are not directly connected to a generator, such as node 8, 9, 10 (see Figure 2),
which may not recover as quickly from a failure. The results obtained with the
model in non-ideal conditions have also high values of efficiency, which means
that the scenarios were correctly diagnosed, except for the scenarios with 5 si-
multaneous faults, that presents a degradation level due to the interconnection
effects of the faulted nodes in the electrical network.
In Table 2, the results of the second evaluation are presented. In this evalu-
ation, 1 to 4 faults in random nodes were simulated in every scenario, but the
results were only based on the correct determination of the status of 4 test nodes
(3,9,10 and 13). The efficiency is almost perfect with ideal conditions, while with
non-ideal conditions there is a decrease in the nodes with line-to-line faults and
faults in neighbor nodes.
In Tables 3 and 4, a comparison of the results of our work with the research
in [3] are shown. In this research, three evaluations with different number of
fault samples in the same scenarios were made. In case 1 they use 75 % of faulty
Fault Diagnosis in Power Networks 37
Tables 3 and 4 give a summary of the obtained percentages for each of the
three cases considered in [3] and the work here presented, according to the type
of fault in table 3, and according to the node in table 4. There exists degradation
in the investigation by [3], mainly due to the similarity of some of their processed
data, when there are more normal operation data than faulted (Case 2 and 3).
According these 2 tables the proposal is competitive with the work done in [3], it
even presents a better performance obtaining a higher efficiency of up to 15.65%,
but with the advantage that there are no restrictions regarding the number of
fault samples analyzed.
38 L.E. Garza Castañón, D.R. Guillén, and R. Morales-Menendez
6 Conclusions
A fault detection framework for power systems, based on Hybrid Bayesian Net-
works and wavelets was presented. The wavelet processing analyses the continu-
ous data of the power network, to extract the relevant fault features in pattern
coefficients. These patterns, will then be used in the HBN model. This model
combines the discrete observations of the system’s components and the results
of the application of wavelets in continuous data, and determines the status of
the node (location), and the type of fault. The experiments have shown that the
type of fault in most cases can be identified. The proposal was compared with a
previous approach in the same domain [3], and have shown a better performance
obtaining a higher efficiency of up to 15.65%.
Future work may include the extended version of the model here presented, to
include the degradation of the power network (devices) with the time, to develop
a more realistic project through the use of Dynamic HBN models.
References
1. Yongli, Z., Limin, H., Jinling, L.: Bayesian Networks-based Approach for Power
Systems Fault Diagnosis. IEEE Trans. on Power Delivery 2005 21(2), 634–639 (2006)
2. Garza Castañón, L., Acevedo, P.S., Cantú, O.F.: Integration of Fault Detection and
Diagnosis in a Probabilistic Logic Framework. In: Garijo, F.J., Riquelme, J.-C.,
Toro, M. (eds.) IBERAMIA 2002. LNCS (LNAI), vol. 2527, pp. 265–274. Springer,
Heidelberg (2002)
3. Garza Castañón, L., Nieto, G.J., Garza, C.M., Morales, M.R.: Fault Diagnosis of
Industrial Systems with Bayesian Networks and Neural Networks. In: Gelbukh, A.,
Morales, E.F. (eds.) MICAI 2008. LNCS (LNAI), vol. 5317, pp. 998–1008. Springer,
Heidelberg (2008)
4. Elkalashy, N.I., Lehtonen, M., Tarhuni, N.G.: DWT and Bayesian technique for
enhancing earth fault protection in MV networks. In: Power Systems Conference
and Exposition, PSCE 2009, IEEE/PES, Power Syst.& High Voltage Eng., Helsinki
Univ. of Technol., Helsinki, April 23, pp. 89–93 (2009)
5. Valens, C., (Copyright Valens, C., 1999-2004).: A Really Friendly Guide to Wavelets.
PolyValens, http://www.polyvalens.com/ (recovered February 1, 2010)
6. Jensen, F.V.: Bayesian Networks and Influence Diagrams. Aalborg University. De-
partment of Mathematics and Computer Science, Denmark
7. Reliability Test System Task Force, Application of Probability Methods Subcomi-
tee. IEEE Reliability Test System. IEEE Transactions on Power Apparatus and
Systems 98(6), 2047–2054 (1979)
8. MicroTran official webpage, http://www.microtran.com/
Learning Temporal Bayesian Networks for
Power Plant Diagnosis
1 Introduction
Power plants and their effective operation are vital to the development of in-
dustries, schools, and even for our houses, for this reason they maintain strict
regulations and quality standards. However, problems may appear and when
these happen, human operators have to take decisions relying mostly on their
experience to determine the best recovery action with very limited help from the
system. In order to provide useful information to the operator, different mod-
els have been developed that can deal with industrial diagnosis. These models
must manage uncertainty because real world information is usually imprecise,
incomplete, and with errors (noisy). Furthermore, they must manage temporal
reasoning, since the timing of occurrence of the events is an important piece of
information.
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 39–48, 2011.
c Springer-Verlag Berlin Heidelberg 2011
40 P. Hernandez-Leal et al.
Bayesian Networks [9] are an alternative to deal with uncertainty that has
proven to be successful in various domains. Nevertheless, these models cannot
deal with temporal information. An extension of BNs, called Dynamic Bayesian
Networks (DBNs), can deal with temporal information. DBNs can be seen as
multiple slices of a static BN over time, with links between adjacent slices. How-
ever, these models can become quite complex, in particular, when only a few
important events occur over time.
Temporal Nodes Bayesian Networks (TNBNs) [1] are another extension of
Bayesian Networks. They belong to a class of temporal models known as Event
Bayesian Networks [5]. TNBNs were proposed to manage uncertainty and tem-
poral reasoning. In a TNBN, each Temporal Node has intervals associated to
it. Each node represents an event or state change of a variable. An arc between
two Temporal Nodes corresponds to a causal–temporal relation. One interesting
property of this class of models, in contrast to Dynamic Bayesian Networks, is
that the temporal intervals can differ in number and size.
TNBNs have been used in diagnosis and prediction of temporal faults in a
steam generator of a fossil power plant [1]. However, one problem that appears
when using TNBNs is that no learning algorithm exists, so the model has to be
obtained from external sources (i.e., a domain expert). This can be a hard and
prone to error task. In this paper, we propose a learning algorithm to obtain the
structure and the temporal intervals for TNBNs from data, and apply it to the
diagnosis of a combined cycle power plant.
The learning algorithm consists of three phases. In the first phase, we obtain
an approximation of the intervals. For this, we apply a clustering algorithm.
Then we convert these clusters into initial intervals. In the second phase, the
BN structure is obtained with a structure learning algorithm [2]. The last step
is performed to refine the intervals for each Temporal Node. Our algorithm
obtains the number of possible sets of intervals for each configuration of the
parents by clustering the data based on a Gaussian mixture model. It then
selects the set of intervals that maximizes the prediction accuracy. We applied
the proposed method to fault diagnosis in a subsystem of a power plant. The
data was obtained from a power plant simulator. The structure and intervals
obtained by the proposed algorithm are compared to a uniform discretization
and a k-means clustering algorithm; the results show that our approach creates
a simpler TNBN with high predictive accuracy.
2 Related Work
Bayesian Networks (BN) have been applied to industrial diagnosis [6]. However,
static BNs are not suited to deal with temporal information. For this reason
Dynamic Bayesian Networks [3] were created. In a DBN, a copy of a base model
is created for each time stage. These copies are linked via a transition network
which is usually connected through links only allowing connections between con-
secutive stages (Markov property). The problem is that DBNs can become very
complex; and this is unnecessary when dealing with problems for which there
Learning Temporal Bayesian Networks for Power Plant Diagnosis 41
are only few changes for each variable in the model. Moreover, DBNs are not
capable of managing different levels of time granularity. They have a fixed time
interval between stages.
In TNBNs, each variable represents an event or state change. So, only one
(or a few) instance(s) of each variable is required, assuming there is one (or a
few) change(s) of a variable state in the temporal range of interest. No copies
of the model are needed, and no assumption about the Markovian nature of the
process is made. TNBNs can deal with multiple granularity because the number
and the size of the intervals for each node can be different.
There are several methods to learn BNs from data [8]. Unfortunately, the al-
gorithms used to learn BNs cannot deal with the problem of learning temporal
intervals, so these cannot be applied directly to learn TNBNs. To the best of
our knowledge, there is only one previous work that attempts to learn a TNBN.
Liu et al. [7] proposed a method to build a TNBN from a temporal probabilistic
database. The method obtains the structure from a set of temporal dependen-
cies in a probabilistic temporal relational model (PTRM). In order to build the
TNBN, they obtain a variable ordering that maximizes the set of conditional in-
dependence relations implied by a dependency graph obtained from the PTRM.
Based on this order, a directed acyclic graph corresponding to the implied in-
dependence relations is obtained, which represents the structure of the TNBN.
The previous work assumes a known probabilistic temporal–relational model
from the domain of interest, which is not always available. Building this PTRM
can be as difficult as building a TNBN. In contrast, our approach constructs the
TNBN directly from data, which in many applications is readily available or can
be generated, for instance, using a simulator.
Fig. 1. The TNBN for Example 1. Each oval represents a node. The Failure Steam
Valve is an Instantaneous Node, so it does not have temporal intervals. The Elec-
trical Generation Disturbance and Drum Pressure Disturbance are Temporal Nodes.
Therefore, they have temporal intervals associated with their values.
4 Learning Algorithm
First, we present the interval learning algorithm for a TN, assuming that we
have a defined structure, and later we present the whole learning algorithm.
Algorithm 1 performs two nested loops, for each set of intervals, we sort the
intervals by their starting point. Then checks if there is an interval contained in
another interval. While this is true, the algorithm obtains an average interval,
taking the average of the start and end points of the intervals and replacing these
two intervals with the new one. Next, it refines the intervals to be continuous by
taking the mean of two adjacent values.
As in the first approximation, the best set of intervals for each TN is selected
based on the predictive accuracy in terms of their RBS. However, when a TN
has as parents other Temporal Nodes (an example of this situation is illustrated
in Figure 3), the state of the parent nodes is not initially known. So, we cannot
directly apply the second approximation. In order to solve this problem, the
intervals are selected sequentially in a top–down fashion according to the TNBN
structure. That is, we first select the intervals for the nodes in the second level
of the network (the root nodes are instantaneous by definition in a TNBN [1]).
Once these are defined, we know the values of the parents of the nodes in the 3rd
level, so we can find their intervals; and so on, until the leaf nodes are reached.
4.4 Pruning
Taking the combinations and joining the intervals can become computationally
too expensive, the number of sets of intervals for node is in O(q 2 2 ) where q is
the number of configurations and is the maximum number of clusters for the
GMM. For this reason we used two pruning techniques for each TN to reduce
the computation time.
The first pruning technique discriminates the partitions that contain few in-
stances. For this, we count the number of instances in each partition, and if it
is greater than a value β = Number of instances the configuration is used,
Number of partitions×2
otherwise it is discarded. A second technique is applied when the intervals for
each combination are being obtained. If the final set of intervals contains only
one interval (no temporal information) or more than α (producing a complex
network), the set of intervals is discarded. For our experiments we used α = 4.
Learning Temporal Bayesian Networks for Power Plant Diagnosis 45
Now we present the complete algorithm that learns the structure and the in-
tervals of the TNBN. First we perform an initial discretization of the temporal
variables based on a clustering algorithm (k-means), the obtained clusters are
converted into intervals according to the process shown in Algorithm 2. With
this process, we obtain an initial approximation of the intervals for all the Tem-
poral Nodes and we can perform a standard BN structural learning. We used the
K2 algorithm [2]. This algorithm has as a parameter an ordering of the nodes.
For learning TNBN, we can exploit this parameter and define an order based on
the temporal domain information.
When a structure is obtained, we can apply the interval learning algorithm
described in Section 4.1. Moreover, this process of alternating interval learning
and structure learning may be iterated until convergence.
Fig. 2. Schematic description of a power plant showing the feedwater and main steam
subsystems. F f w refers to feedwater flow, F ms refers to main stream flow, dp refers
to drum pressure, dl refers to drum level.
mass becomes steam and is separated from the water in a special tank called
the drum. Here, water is stored to provide the boiler tubes with the appropriate
volume of liquid that will be evaporated and steam is stored to be sent to the
steam turbine.
From the drum, water is supplied to the rising water tubes called water walls
by means of the water recirculation pump, where it will be evaporated, and
water-steam mixture reaches the drum. From here, steam is supplied to the
steam turbine. The conversion of liquid water to steam is carried out at a spe-
cific saturation condition of pressure and temperature. In this condition, water
and saturated steam are at the same temperature. This must be the stable con-
dition where the volume of water supply is commanded by the feed-water control
system. Furthermore, the valves that allow the steam supply to the turbine are
controlled in order to manipulate the values of pressure in the drum. The level
of the drum is one of the most important variables in the generation process. A
decrease of the level may cause that not enough water is supplied to the rising
tubes and the excess of heat and lack of cooling water may destroy the tubes.
On the contrary, an excess of level in the drum may drag water as humidity in
the steam provided to the turbine and cause a severe damage in the blades. In
both cases, a degradation of the performance of the generation cycle is observed.
Even with a very well calibrated instrument, controlling the level of the drum
is one of the most complicated and uncertain processes of the whole generation
system. This is because the mixture of steam and water makes very difficult the
reading of the exact level of mass.
and failure in the Steam Valve. These types of failures are important because,
they may cause disturbances in the generation capacity and the drum.
Fig. 3. The learned TNBN for a subsystem of a combined cycle power plant. For each
node the obtained temporal intervals are shown. The TNBN presents the possible
effects of the failure of two valves over different important components.
In order to evaluate our algorithm, we obtained the structure and the in-
tervals for each Temporal Node with the proposed algorithm. In this case, we
do not have a reference network, so to compare our method, we used as base-
lines an equal-width discretization (EWD) and a K-means clustering algorithm
to obtain the intervals for each TN. We evaluated the model using three mea-
sures: (i) the predictive accuracy using RBS, (ii) the error in time defined as
the difference between the real event and the expected mean of the interval,
and (iii) the number of intervals in the network. The best network should have
high predictive RBS, low error in time and low complexity (reduced number of
intervals).
We performed three experiments varying the number of cases. First, we gener-
ate the data with the simulator, then we learned the structure and the intervals.
Finally, we used the learned network to compare the results with the original
data. The results are presented in Table 1. The network obtained with the pro-
posed algorithm with higher accuracy is presented in Figure 3.
The following observations can be obtained from these results. In all the ex-
periments, our algorithm obtained the best RBS score and the lowest number of
intervals. The K-means and EW discretization obtained the best score in time
error. However, this happens because they produced a high number of intervals
of smaller size, which decreases the difference between the mean of an interval
and the real event. Even though our algorithm does not obtain the best time
error, it is not far from the other algorithms. It is important to note that our
algorithm obtains the highest accuracy with a simpler model.
48 P. Hernandez-Leal et al.
Table 1. Evaluation on the power plant domain. We compare the proposed algorithm
(Prop), K-means clustering and equal-width discretization (EWD) in terms of predic-
tive accuracy (RBS), time error and number of intervals generated.
Num. of Cases Algorithm RBS (Max 100) Time Error Average num. intervals
50 Prop. 93.26 18.02 16.25
50 K-means 83.57 15.6 24.5
50 EWD 85.3 16.5 24.5
75 Prop. 93.7 17.8 16
75 K-means 85.7 16.3 24.5
75 EWD 86.9 17.2 24.5
100 Prop. 93.37 17.7 17
100 K-Means 90.4 17.1 24.5
100 EW D 91.9 15.29 24.5
References
1. Arroyo-Figueroa, G., Sucar, L.E.: A temporal Bayesian network for diagnosis and
prediction. In: Proceedings of the 15th UAI Conference, pp. 13–22 (1999)
2. Cooper, G.F., Herskovits, E.: A bayesian method for the induction of probabilistic
networks from data. Machine learning 9(4), 309–347 (1992)
3. Dagum, P., Galper, A., Horvitz, E.: Dynamic network models for forecasting. In:
Proc. of the 8th Workshop UAI, pp. 41–48 (1992)
4. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society 39(1), 1–38
(1977)
5. Galán, S.F., Arroyo-Figueroa, G., Dı́ez, F.J., Sucar, L.E.: Comparison of two types
of Event Bayesian Networks: A case study. Applied Artif. Intel. 21(3), 185 (2007)
6. Knox, W.B., Mengshoel, O.: Diagnosis and Reconfiguration using Bayesian Net-
works: An Electrical Power System Case Study. In: SAS 2009, p. 67 (2009)
7. Liu, W., Song, N., Yao, H.: Temporal Functional Dependencies and Temporal Nodes
Bayesian Networks. The Computer Journal 48(1), 30–41 (2005)
8. Neapolitan, R.E.: Learning Bayesian Networks. Pearson Prentice Hall, London
(2004)
9. Pearl, J.: Probabilistic reasoning in intelligent systems: networks of plausible infer-
ence. Morgan Kaufmann, San Francisco (1988)
On the Fusion of Probabilistic Networks
Abstract. This paper deals with the problem of merging multiple-source uncer-
tain information in the framework of probability theory. Pieces of information
are represented by probabilistic (or bayesian) networks, which are efficient tools
for reasoning under uncertainty. We first show that the merging of probabilis-
tic networks having the same graphical (DAG) structure can be easily achieved in
polynomial time. We then propose solutions to merge probabilistic networks hav-
ing different structures. Lastly, we show how to deal with the sub-normalization
problem which reflects the presence of conflicts between different sources.
1 Introduction
This paper addresses the problem of fusion of information multi-source represented in
the framework of probability theory. Uncertain pieces of information are assumed to
be represented by probabilistic networks. Probabilistic networks [3] are important tools
proposed for an efficient representation and analysis of uncertain information. They are
directed acyclic graphs (DAG), where each node encodes a variable and every edge
represents a causal or influence relationship between two variables. Uncertainties are
expressed by means of conditional probability distributions for each node in the context
of its parents.
Several works have been proposed to fuse propositional or weighted logical knowl-
edge bases issued from different sources (e.g., [1], [6]). However there are few works
on the fusion of belief networks in agreement with the combination laws of probability
distributions, see [7], [8], [4] and [5]. In these existing works fusion is based on exploit-
ing intersections or unions of independence relations induced by individual graphs. In
this paper, we are more interested in computing the counterparts of fusing probability
distributions, associated with bayesian networks. Results of this paper can be viewed as
a natural counterpart of the ones recently developed in a possibility theory for merging
possibilistic networks [2]. In fact, all main steps proposed in [2] have natural counter-
parts. More precisely, we propose an approach to merge n probabilistic networks. The
obtained bayesian network is such that its associated probability distribution, is a func-
tion of probability distributions associated with initial bayesian networks to merge. The
merging operator considered in this paper is the product operator. We study, in partic-
ular, the problem of sub-normalization that concerns the conflit between information
sources.
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 49–58, 2011.
c Springer-Verlag Berlin Heidelberg 2011
50 S. Benferhat and F. Titouna
2 Probabilistic Networks
Let V = {A1 , A2 , ..., AN } be a set of variables. We denote by DA = {a1 , .., an }
the domain associated with the variable A. By a we denote any instance of A. Ω =
×Ai ∈V DAi denotes the universe of discourse, which is the cartesian product of all vari-
able domains in V . Each element ω ∈ Ω is called a state or an interpretation or a solu-
tion. Subsets of Ω are simply called events. This section defines quantitative probabilis-
tic graphs. A probabilistic network on a set of variables V , denoted by B = (p B , GB ),
consists of:
– a graphical component, denoted by GB , which is a DAG (Directed Acyclic Graph).
Nodes represent variables and edges encode the influence between variables. The
set of parent of a node A is denoted by UA and μA denotes an instance of parents
of A. We say that C is a child of A if there exists an edge from A to C. We define
descendants of a variable A as the set of nodes obtained by applying the transitivity
closure of relation child.
– a numerical component, denoted by pB , which quantifies different links of the net-
work. For every root node A uncertainty is represented by a conditional probability
degree pB (a | uA ) of each instances a ∈ DA and uA ∈ DUA .
In the following, probability distributions pB , defined on nodes level, are called local
probability distributions. From the set of local conditional probability distributions, one
can define a unique global joint probability distribution.
Definition 1. Let B = (pB , GB ) be a quantitative probabilistic network. The joint or
global probability distribution associated with B and denoted by pBJ , is expressed by
the following quantitative chain rule:
pBJ (A1 , .., AN ) = p(Ai | UAi ). (1)
i=1..N
Note that p⊕(ω) is generally subnormalized. Therefore, the fused probability distribu-
tion should be normalized.
On the Fusion of Probabilistic Networks 51
– The set of its variables is the union of the sets of variables in G1 and in G2 .
– For each variable A, its parents are those in G1 and G2 .
If the union of G1 and G2 does not contain cycles, we say that G1 and G2 are U-acyclic
networks. In this case the fusion can be easily obtained. The fusion of networks which
is not U-acyclic is left for a future work. But we first need to introduce some inter-
mediaries results concerning the addition of variables and arcs to a given probabilistic
network.
The following proposition shows how to add new variables to a probabilistic network
without changing its joint probability distribution:
Then, we have:
∀ω ∈ ×Ai ∈V DAi , pBJ (ω) = J
pB1 (aω). (3)
a∈DA
The above proposition is essential for a fusion process since it allows, if necessary, to
equivalently increase the size of all the probabilistic networks to merge, such that they
will use the same set of variables.
The following proposition shows how to add links to a probabilistic network without
modifying its probability distributions.
Using results of the previous section and Proposition1, the fusion of two U-acyclic
networks, is immediate. Let GB⊕ be the union of GB1 and GB2 . Then the fusion of B1
and B2 can be obtained using the following two steps:
Step 1. Using Proposition 2 and 3, expand B1 and B2 such that GB1 = GB2 = GB⊕ .
Extension consists of adding variables and arcs.
Step 2. Apply Proposition 1 on the probabilistic networks obtained from Step 1 (since
the two networks have now the same structure).
Example 1. Let us consider two probabilistic networks B1 and B2, where their DAG are given
by Figure1. These two DAG’s have a different structure. The conditional probability distributions
associated with B1 and B2 are given below. The DAG GB⊕ is given by Figure 2 which is simply
the union of the two graphs of Figure 1.
Bn n
D
? A
A
n A
UA
? An Bn
Cn
Now we transform both of GB1 and GB2 to the common graph GB⊕ by adding the required
variables and links for each graph. In our example we apply the following steps:
B D
J
JJ
^
A
?
C
• a new variable C, and a link from the variable A to the variable C, with a uniform
conditional probability distribution, namely:
∀c ∈ DC , , ∀a ∈ DA , pB2 (c | a) = .5.
• a link from the variable B to the variable A, and the new conditional probability distri-
bution on the node A becomes:
∀d ∈ DD , ∀b ∈ DB , ∀a ∈ DA , pB2 (a|b, d) = pB2 (a|d).
The conditional probability distributions associated with the DAG representing the result of fu-
sion are represented by the following Table.
A B D pB⊕ (A | B ∧ D)
a1 b1 d1 .18
D pB⊕ (D) C A pB⊕ (C | A) B D pB⊕ (B | D) a1 b1 d2 .3
d1 .9 c1 a1 .25 b1 d1 .24 a1 b2 d1 .6
d2 .1 c1 a2 0 b1 d2 .64 a1 b2 d2 1
c2 a1 .25 b2 d1 .14 a2 b1 d1 .42
c2 a2 .5 b2 d2 .4 a2 b1 d2 .28
a2 b2 d1 0
a2 b2 d2 0
From these different tables of conditional probability distributions, we can easily show that
J
the joint probability distribution pB⊕ computed by using chain rule (Definition 1), is equal to
J J
the product of pB1 and pB2 . For instance, let ω=a1 b1 c1 d1 be a possible situation. Using the
probabilistic chain rule we have:
J
pB1 (a1 b1 c1 d1 ) = .3 ∗ .5 ∗ .8 = .12.
J
pB2 (a1 b1 c1 d1 ) = .9 ∗ .6 ∗ .3 = .162.
J
pB⊕ (a1 b1 c1 d1 ) = pB⊕ (c1 | a1 )∗pB⊕ (a1 | b1 d1 )∗pB⊕ (b1 | d1 )∗pB⊕ (d1 ) = 0.5∗.18∗.24∗.9 =
.01944
It is very important to note that local probability distributions are subnormalized. Hence
the joint probability distribution is also subnormalized.
On the Fusion of Probabilistic Networks 55
Proposition
4. Let B be a probabilistic network. Let A be a root variable where:
a∈DA (p(a)) = α < 1. Let us define B1 such that:
– GB1 = GB ,
= A, pB1 (X | UX ) = pB (X | UX ),
– ∀X, X
– ∀a ∈ DA , pB1 (a) = pB (a)/α.
J
Therefore: ∀ ω, pB1 (ω) = pBJ (ω)/α.
This proposition allows to normalize root variables.
A corollary of the above proposition is the situation where only the probability distri-
butions of a root variable is subnormalized (namely all other variables are normalized),
then the probabilistic network B1 is in fact the result of normalizing B. Namely, pB1 =
pB
h(pB ) .
The next important issue is how to deal with a variable A which is not root. In this
situation, we solve the problem by modifying local probability distributions associated
with the parents of the variable A. This modification does not change joint global prob-
ability distributions. However, the result may produces, as a bord effect, subnormalized
probability distribution on parents of A. Hence, the normalization process should be
repeated from leaves to the root variables. When we reach roots, it is enough to apply
Proposition 6 to get a probabilistic network with normalized local probability distribu-
tions. Let us first consider, the case of a variable A, where its local probability distribu-
tion is subnormalized and it only admits one parent. The normalization of such variable
is given by next proposition.
56 S. Benferhat and F. Titouna
Let us explain the different conditions enumerated in the above proposition. These con-
ditions allow to normalize local probability distributions at node A level which has one
parent B. The first condition says that probability distributions associated with variables,
different from A and B, remain unchanged. The second condition specifies that normal-
ization only affects the variable A (variable concerned by the normalization problem).
Lastly, the third condition applies on the variable B (parent of A) the reverse operation
of normalization, which ensures the equivalence between joint distributions.
Example 2. Let us consider again the previous example. The conditional probability distribution
on the node C are sub-normalized. C has only one parent. Hence, we can apply the above propo-
sition. The result is given in Table 4, where the distribution on the node C is now normalized. We
also give the distribution on the node A which has been changed.
The above proposition allows to normalize local distributions of variables having one
parent, without modifying the joint global distribution. Let us now generalize the above
proposition.
1. GB1 = GB ∪ {C → B : C ∈ UA , C = B}.
2. Local probability distributions are defined by:
(a) ∀X = A, ∀X = B, pB1 (X | UX ) = pB (X | UX ).
(b) ∀a ∈ DA , ∀μA ∈ DUA ,
p
B (a|μA )
if ( a (pB (a | μA )) = α)
pB1 (a | μa ) = α
pB (a | μA ) otherwise
(c) ∀b ∈ DB , ∀μB ,,
⎧
⎨ pB (b | μB ) ∗ α, if μB
= μB × (μa − {b})
pB1 (b | μB ) = and ( a (pB (a | μA )) = α)
⎩
pB (b | μB ) otherwise.
We first recall that this definition concerns the normalization of local probability distri-
bution associated with a variable which has at least two parents. Note that in Definition
3 the variable B always exists (otherwise the graph contains a cycle). The first condition
says that GB is a subgraph of GB1 . Namely, GB1 is obtained from GB by adding new
links from each variable C (which is different form B and parent of A) to the variable B.
We now explain local probability distributions associated with the probabilistic network
B1.
Condition (a) says probability distributions, associated with variables which are dif-
ferent from A and B, are exactly the same as ones of the initial network. Condition (b)
normalizes the probability distribution associated with the variable A.
The condition (c) applies a normalization inverse operation on the variable B in order
to preserve the joint distribution. μB denotes an instance of parents of B in the new net-
work B1, and where μB (resp. μA ) denotes an instance of parents of B(resp. of parents
of A) in the initial network B.
6 Conclusions
This paper has proposed the fusion of probabilistic networks. We first showed that when
the probabilistic networks have the same structure then the fusion can be realized ef-
ficiently. We then studied the fusion of networks having different structure. Lastly, we
proposed an approach to handle the sub-normalization problem induced by the fusion
of probabilistic networks.
References
1. Baral, C., Kraus, S., Minker, J., Subrahmanian, V.S.: Combining knowledge bases consisting
in first order theories. Computational Intelligence 8(1), 45–71 (1992)
2. Benferhat, S., Titouna, F.: Fusion and normalization of quantitative possibilistic networks.
Appl. Intell. 31(2), 135–160 (2009)
3. Darwiche, A.: Modeling and Reasoning with Bayesian Networks. Cambridge University
Press, Cambridge (2009)
4. de Oude, P., Ottens, B., Pavlin, G.: Information fusion with distributed probabilistic networks.
Artificial Intelligence and Applications, 195–201 (2005)
5. JoseDel, S., Serafin, M.: Qualitative combination of bayesian networks. International Journal
of Intelligent Systems 18(2), 237–249 (2003)
6. Konieczny, S., Perez, R.: On the logic of merging. In: Proceedings of the Sixth International
Conference on Principles of Knowledge Representation and Reasoning (KR 1998), pp. 488–
498 (1998)
7. Matzkevich, I., Abramson, B.: The topological fusion of bayes nets. In: Dubois, D., Wellman,
M.P., D’Ambrosio, B., Smets, P. (eds.) 8th Conf. on Uncertainty in Artificial Intelligence
(1992)
8. Matzkevich, I., Abramson, B.: Some complexity considerations in the combination of belief
networks. In: Heckerman, D., Mamdani, A. (eds.) Proc. of the 9th Conf. on Uncertainty in
Artificial Intelligence (1993)
Basic Object Oriented Genetic Programming
1 Introduction
Genetic Programming (GP) is an approach to automated computer program
generation using genetic algorithms [1]. The original genetic programs used lisp-
like expressions [2] for their representation. Since the original introduction of GP
there have been several mechanisms proposed for program generation, such as
strong-typed GP [3], grammar-based GP [3] and linear GP [3], all of which
have attempted to improve the performance and broaden the scope of GPs
applicability.
Genetic programming has been successfully applied in several engineering do-
mains; however, there are limitations. The current GP algorithms cannot handle
large-scale problems. For example, the generation of a modern computer oper-
ating system is far beyond the capability of any GP algorithm.
If we could find a way to drastically improve the performance of current
GP algorithms/methods, there might be more GP applications. This paper is
motivated to make a contribution towards this goal.
It is worth noting that there are implementations of traditional GP algorithms
in object-oriented languages; this is not what is discussed in this paper.
Computer programming has developed from using machine code, to assembly
language, to structured languages, to object-oriented languages, and now to
model driven development. People are now studying a higher level of abstraction
for computer software than simply using the notions of object and class; however,
this too is of limited interest in this paper.
As the level of abstraction in software programming increases, as does the size
of systems able to be developed and the ease with which they are maintained
and upgraded.
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 59–68, 2011.
c Springer-Verlag Berlin Heidelberg 2011
60 T. White, J. Fan, F. Oppacher
Just as the GP paradigm already includes the idea of constraining the func-
tions and types, it is still at the stage of using data and functions to solve
the problem, which is analogous to the era of modular programming. Apply-
ing object-oriented programming should be a better idea according to human
software engineering practice. There has been prior research in the area of object-
oriented genetic programming (OOGP) [4], [5], but they are early stage research
and this paper proposes a representation and genetic operators for OOGP.
This paper proposes extensions to the application of object-oriented concepts
in genetic programming. The main contribution proposed is the method of op-
erating on objects within the same executable as OOGP (hereafter called Basic
OOGP), which is explained in detail in section 3. By experimental comparison
of the performance of the Basic OOGP algorithm with 3 modern GP algorithms
we demonstrate that this method has practical advantages. Experimental setup
and results are contained in section 4. Conclusions and future work are described
in section 5.
2 Related Work
Research in this area is limited. Many of the papers found containing the terms
object oriented genetic programming are either not relevant to this area or ref-
erence the two papers described in the following two subsections. One of the
earliest works in this area in due to Bruce [6] and built upon by the 2 papers
described.
Although these two papers do include OOGP in their titles and they attempt
to use the advantages of object oriented programming, they are different from
what is proposed in this paper, in terms of applying OO concepts to software.
The fundamental difference is in the nature of their representation; these two
papers still represent the software program as a tree-based data structure in
contrast to the linear representation proposed here. We should point out that
linear representations are not new, linear GP having a well- developed, if recent,
history [7] within mainstream GP.
3 Basic OOGP
simple problem may not be solved in a large search space. With the help of
no-op in Basic OOGP, the algorithm has the choice of filling no-ops into the
chromosomes if the solution does not require all pieces of genes to be fully used.
In Basic OOGP, the length of the chromosome is fixed and is determined for a
particular problem domain. In tree-based GP, to increase the tree depth from N
to N+1 for a binary tree, the total number of nodes has to be increased by 2N+1.
Some sections of the chromosome might not be effective in the parent generation,
but definitely could expose its functionality in the offspring; thus it successfully
simulates a pseudo gene feature. This kind of section is saved and passed to the
next generation. Because Basic OOGP is linear, there is no need to truncate the
offspring tree. The offspring will remain the same chromosome length as that
of its parents. In contrast, in normal tree-based GP, since the selected portion
from two parents could have many nodes, the generated offspring might have
more nodes than the pre-defined maximum number of nodes (there has to be
a maximum value, otherwise the tree might grow to a size that is beyond the
processing ability of the host computer). The truncation operation is always
necessary in normal GP.
From the above discussion, we can see that Basic OOGP is somewhat differ-
ent from traditional GP. It applies the concept of OO at the very core of the
genetic algorithm. It is certainly far more than a simple OO implementation of
conventional GP.
Because of the linear representation used by Basic OOGP, the search space
is bounded by the length of the chromosome which will only increase when the
number of nodes increases. For example, for a binary tree, if we set the maximum
depth of the tree to (11+N), for each step K (12 to N), the number of tree nodes
in the search space will increase by 2 to the power of K. Using Basic OOGP, if
the original chromosome is (11+N), we only increase the number of searching
nodes by 3*N. Hence, the rate of increase of our search space is substantially
lower than for a tree-based GP. This will substantially reduce the computational
search effort compared with normal GP.
64 T. White, J. Fan, F. Oppacher
The mutation operator implemented was: randomly pick two gene positions
and generate new randomly generated genes. The Basic OOGP evaluation engine
can then be specified as:
Algorithm 1.
Evaluate(Gene gene)
N = chromosome length
for i = 0 to N do
j = gene[i].host
k = gene[i].action
l = gene[i].passive
host = resourceP ool.hosts[j]
action = resourceP ool.actions[k]
passive = resourceP ool.hosts[l]
Invoke host→actions(passive)
end for
return resourceP ool.hosts[0]
4 Experiments
The performance of Basic OOGP for an even-parity problem [2] was investigated
and compared with 3 other techniques: conventional GP, Liquid State GP and
Traceless GP.
Parameter Value
P op. Size 800
F unction Set AND, OR, NOT, XOR, RESET (reset to initial state),
NO OP
Crossover Exchange 50% of genes based upon uniform distribution
M utation Mutation rate 0.5% of population size. Randomly gener-
ate the selected genes in the individual.
Selection Keep the best 10% of individuals. Select one of the re-
maining individuals as parent A , randomly select an-
other top 33.3% as parent B. Randomly select 50% of
genes from parent B, and replace those in the same po-
sition in parent A., generate one offspring. Use this child
in the next generation, replacing parent A.
Success Predict all truth values in the truth table.
66 T. White, J. Fan, F. Oppacher
5 Results
The following tables provide results for Basic OOGP, which empirically demon-
strate that Basic OOGP can indeed work. Results from two other types of GP
are also provided for comparison.
Liquid State Genetic Programming (LSGP) [9] is a hybrid method combining
both Liquid State Machine (LSM) and Genetic Programming techniques. It uses
a dynamic memory for storing the inputs (the liquid). Table 2 shows the results
from LSGP [9].
Further experiments for the even parity problem of order 9 to 14 showed that
Basic OOGP could solve the problem consistently; LSGP and GP could not.
LSGP could solve the even parity problem for size 9 and above in less than 10%
of all trials.
Traceless Genetic Programming (TGP) [10] is a hybrid method combining
a technique for building the individuals and a technique for representing the
individuals. TGP does not explicitly store the evolved computer programs. Table
3 shows the results for Traceless GP (TGP) [10].
The results of normal GP are not given because according to [10], [9] , LSGP
and TGP all out performed normal GP. Table 4 shows the average number of gen-
erations and standard deviation for Basic OOGP solving the even parity problem
for orders 3 to 9. As can be observed, Basic OOGP solves the even parity prob-
lem in all runs (i.e., is 100% successful) and solves the problem in a fraction of the
number of evaluations used in successful runs of the LSGP or TGP. It should be
noted that for the 10 and 11 problem sizes a larger representation was used.
Basic Object Oriented Genetic Programming 67
6 Conclusions
This paper proposes the application of OO concepts to GP in order to apply
GP to large-scale problems. Through the introduction of OO programming, the
computer program being evolved is represented as a linear array, which helps
to reduce the search space. In the experiments reported in this paper, Basic
OOGP can solve parity problems of up to order 121 , while LSGP can only solve
the even-parity problem up to order 8 with conventional GP failing to solve
even smaller problems. This demonstration is promising and indicates that Basic
OOGP is a feasible approach to program production. Furthermore, Basic OOGP
has the potential to converge to the result faster than normal GP, even with a
smaller population size and fewer generations, though the results may vary due
to differing implementations of selection, mutation and crossover strategies. A
more detailed examination of the roles of various operators in the two techniques
is left as future research. In conclusion, the experiments show that incorporating
OO concepts into GP is a promising direction to improve GP performance,
though it is not clear at this point what is the ultimate complexity of problem
that could be solved by GP. This remains an open question.
By applying Basic OOGP to problems in other domains we expect to be able
to demonstrate that such an approach can improve the performance of GP in
more complex software engineering domains.
References
1. Koza, J.R.: Genetic programming II: automatic discovery of reusable programs.
MIT Press, Cambridge (1994)
2. Koza, J.R.: Genetic programming: on the programming of computers by means of
natural selection. MIT Press, Cambridge (1992)
1
Success with problems of up to size 14 has been achieved.
68 T. White, J. Fan, F. Oppacher
3. Poli, R., Langdon, W.B., McPhee, N.F.: A field guide to genetic programming
(With contributions by J. R. Koza) (2008), Published via http://lulu.com and
freely available at http://www.gp-field-guide.org.uk
http://www.gp-field-guide.org.uk
4. Abbott, R.: Object-oriented genetic programming an initial implementation. Fu-
ture Gener. Comput. Syst. 16(9), 851–871 (2000)
5. Keijzer, M., O’Reilly, U.M., Lucas, S.M., Costa, E., Soule, T., Lucas, S.M.: Exploit-
ing Reflection in Object Oriented Genetic Programming. In: Keijzer, M., O’Reilly,
U.-M., Lucas, S., Costa, E., Soule, T. (eds.) EuroGP 2004. LNCS, vol. 3003, pp.
369–378. Springer, Heidelberg (2004)
6. Bruce, W.S.: Automatic generation of object-oriented programs using genetic pro-
gramming. In: Proceedings of the First Annual Conference on Genetic Program-
ming, GECCO 1996, pp. 267–272. MIT Press, Cambridge (1996)
7. Brameier, M.F., Banzhaf, W.: Linear Genetic Programming (Genetic and Evolu-
tionary Computation). Springer, New York (2006)
8. Oppacher, Y., Oppacher, F., Deugo, D.: Evolving java objects using a grammar-
based approach. In: GECCO, pp. 1891–1892 (2009)
9. Oltean, M.: Liquid State Genetic Programming. In: Beliczynski, B., Dzielinski, A.,
Iwanowski, M., Ribeiro, B. (eds.) ICANNGA 2007. LNCS, vol. 4431, pp. 220–229.
Springer, Heidelberg (2007)
10. Oltean, M.: Solving even-parity problems using traceless genetic programming. In:
Proceedings of the 2004 IEEE Congress on Evolutionary Computation, June 20-23,
vol. 2, pp. 1813–1819. IEEE Press, Portland (2004)
Inferring Border Crossing Intentions with
Hidden Markov Models
1 Introduction
Illegal border crossings constitute a problem addressed by governments at many
international boundaries, and considerable efforts are devoted to surveillance,
monitoring, and interdiction activities. This paper proposes to combine two kinds
of data for this task: –Soft Data, viz., observations made by humans; and Hard
Data, collected from sensors. In this paper, we use the word subject to indicate
a group of people who are under observation, who may be intending to cross a
border.
Decisions based on hard data alone, along with the fusion of data from multiple
(hard) sensors, have been extensively studied and are fairly well understood [4].
In recent years, researchers have recognized the importance of soft data and
the need to combine it effectively with hard data. For example, Laudy and
Benedicte [7] has proposed various techniques to extract valuable information
from soft data, and Hall et al. [3] have proposed a framework for dynamic hard
and soft data fusion. But concrete methods for fusing the information collected
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 69–78, 2011.
c Springer-Verlag Berlin Heidelberg 2011
70 G. Singh et al.
from these two types of sources remain elusive. In this paper, we attempt to use
multiple human observations to estimate the likelihood that a subject intends
to cross the border, and to infer the location and time of such crossing.
Appropriate decisions can be taken to prevent the border crossing or catch the
subjects upon such crossing. The final “belief” in the border crossing hypothesis
is obtained by probabilistically combining three factors:
1. Observations from (hard) sensors;
2. Inferences of subjects’ intentions from their partially observed paths (via
human intelligence reports); and
3. Soft data from other human observations (e.g,, increased activity of unknown
individuals).
Artificial Intelligence has been used to determining intentions from actions, in
several scenarios. For example, the use of intention graphs is proposed in Youn
and Oh [11], Hidden Markov Models (HMMs) [8] have been used to model se-
quenced behavior by Feldman and Balch [2] and sequenced intentions in Kelley
et al. [6]. We explore the following questions: (1) If a partial path travelled by
a subject is accessible to us, can we then predict whether the subject will cross
the border or not?, (2) Can we assign levels of confidence with these decisions?
(3) If it is determined that the subject will cross the border with high level of
confidence then where and when will the crossing occur?
Section 2 presents the Hidden Markov Model approach as applied to repre-
senting and estimating intentions of subjects (to cross or not to cross). Section
3 discusses our probabilistic fusion approach. Simulation results are given in
Section 4. Section 5 contains a description and example of the application of
the HMM approach to predict subject paths. Concluding remarks are given in
Section 6.
(1) approaching the border, (2) returning from the border, (3) Looking around
in the region for a suitable location to cross, (4) waiting at a location for an
appropriate time to proceed, and (5) crossing the border. The subject with no
intention of crossing the border can also approach the border, try to identify a
location by moving around, wait at a location in need for directions, return from
the border and then again go back to one of the previous states, with transitions
as shown in Figure 1(b).
Wait
Locate
Approach Locate
Approach Wait
Wait
Fig. 1. States of Hidden Markov Models for Cross and Not-cross intentions
We now discusses how the HMM parameters are obtained. This is followed by
a description of how the HMMs are used to infer intentions. We emphasize that
the two steps are completely separate.
Fig. 2. Typical paths generated by Markov Models for cross and not-cross intentions.
(a) Typical paths taken with intention to cross, (b) Typical paths taken with intention
to not-cross.
The performance (of the system after training phase) is then tested on a
different set (test set) of paths for intention to cross, not-cross, as well as for a
third sets of paths where the subject changes its intention from cross to not-cross
in the middle possibly due to (i) information received from its informants, (ii)
other information from its surroundings such as being aware that someone is
closely observing him/her, or (iii) spotting the patrol police van. That is, in the
later scenario, the subject starts with the intention to cross the border but at
some point in time changes intention (perhaps temporarily) to not-cross on the
way to the border.
Our goal is to infer the subject’s intention as the subject is in the middle of the
path towards the border and not necessarily when he is very near the border.
In addition, we allow the possibility that the subject may change its intention
with time. For this reason, the intention recognition use of the HMM offers a
good way to estimate intention of the subject using the information available till
that time. Given a set of observations, by using the Forward Algorithm of the
Baum-Welch algorithm [see [8]] we can estimate the probability that the subject
is performing an activity corresponding to HMMC or HMMN C , and infer that
the intention of the subject corresponds to the HMM with higher probability.
the vectors of parameters of the HMM’s for cross and not-cross intentions. Then
the Forward component of the Baum-Welch algorithm provides the probabilities
P (O|λc ) and P (O|λnc ). If the prior probabilities of the models P (λc ) and P (λnc )
are known, then the posterior probabilities P (λc |O) and P (λnc |O) can be easily
calculated. In the absence of any information, the priors are presumed equal
for both models, but if inference from the soft data indicates that the subject
is likely to cross the border, then by Bayesian rule these priors are adjusted
accordingly:
P (O|λc ) × P (λc ) P (O|λnc ) × P (λnc )
P (λc |O) = , and P (λnc |O) = .
P (O) P (O)
P (λc |O)
P (C) = (1)
P (λc |O) + P (λnc |O)
and the probability of the intention of not-cross is assigned a value P (N ) =
1 − P (C) where λc and λnc are the estimated parameter vectors of HMMC and
HMMN C respectively.
where Psof t (C) is the probability of crossing the border based on evidence
provided by the soft sensor only and is calculated as discussed above.
2. Alternatively, Psof t (C) can be used to adjust the prior P (λc ) and then cal-
culate the posterior probability of the model. That is:
Pf used (λc |O) = P (O|λc ) × Pf used (λc ); Pf used (λnc |O) = P (O|λnc ) × Pf used (λnc )
74 G. Singh et al.
These two fusion methods have both been evaluated and compared, as dis-
cussed in the next section.
4 Simulation Results
In simulation experiments we use three levels of surveillance, viz. Low Surveil-
lance during which around 20% of the path is expected to be seen, Medium
Surveillance of around 40% and Alert Surveillance of around 60%. These results
are compared with the ideal situation when there is 100% surveillance, i.e., the
entire path followed by the subject is observed.
The HMMs for intentions to Cross and Not-Cross are learnt from the data set
of 100 paths for each intention. The learnt HMM’s are then tested on another
set of 100 paths of each intention, and 100 paths of changing intentions.
In Figures 3 the red and blue curves denote the probability of intention to
cross without fusion and with fusion respectively whereas the green curve is for
not-cross.
The probability obtained from an analysis of soft sensor information clearly
influences these values. For the purpose of simulation, we assumed that the
probability of ‘false alarm’ associated with soft sensors information is 0.18; that
is, the probability that soft sensors report an innocent person as suspicious is
Inferring Border Crossing Intentions with Hidden Markov Models 75
Fig. 4. Probabilities of crossing (dashed), not crossing (dotted) and fused probability
(solid) of crossing (when the intention of subject is changing between cross or not to
cross) vs. time of observation
0.18. Also, the probability that soft sensors wrongly conclude a subject to be
innocent, is also assumed to be 0.18 (this probability is called the probability of
Miss). Note that when the subject does not have intention to cross, the fusion
comes into play when the human intelligence (soft data) results in a false alarm
and informs that the subject will cross.
As mentioned earlier, in case of border surveillance scenarios it is infeasi-
ble to observe the entire path. Simulation was done on 20%, 40% and 60% of
the complete path. We sampled the entire path for path observation followed
by random period of non-observing part, depending on percentage of path to
be seen. In each of these cases, 100 incomplete paths for each intention were
generated 10 times. The performance of the learning algorithms is measured
in terms of – how often the correct intention is predicted. In Tables 1, 2 and
3 results are reported for three case – when fusion is not implemented, when
fusion with soft evidence was done, and when probability of false alarm and
miss was considered for the intentions Cross, Not Cross and Changing intention
respectively.
The above results are for fusion method given by Equation 2 out of the two
methods discussed in Section on Fusion Aspect. Recall that in Equation 2 deci-
sions of the HMMs and soft sensors are fused, whereas in fusion method given
by Equation 3 soft sensor information is fused with the prior probabilities of
the HMM models. In general, we have observed that the effect of soft sensor
evidence is more prominent when Equation 2 is used instead of Equation 3 and
hence it is a preferred approach. For example, if soft sensor report is contrary to
HMMs results, then the overall probability changes by a more noticeable amount
in Equation 2, often bring the probability below 0.5 from a value more than it,
which is generally not the case with the use of Equation 2.
76 G. Singh et al.
Table 1. Simulation results for the Intention to cross the border.† Accounting for soft
sensor’s probability for a ‘Miss’ in detection.
Table 2. Simulation results for the Intention to not-to-cross the border. ‡ Accounting
for soft sensor’s probability for ‘false alarm’ In detection. This table has only three
columns because there is no information from the soft sensor for the probability that
a subject will change his/her intention.
Table 3. Simulation for Changing Cross Intention. † Accounting for soft sensor’s prob-
ability for a ‘Miss’ In Detection.
5 Path Prediction
Estimation of when and where a subject will cross the border is a significant
component of border crossing problem. To accomplish this task the path ob-
served so far is given as input to the Viterbi algorithm [8], which estimates the
most probable state sequence which could produce the given sequence of obser-
vations. The last state in this state sequence, which corresponds to present state
of the subject, becomes the state using which the future path prediction starts.
We construct the remaining path, up to the border, using parameters of that
HMM which has higher probability of generating the path observed so far.
However, the path so generated is a random path and represents one of the
several possible paths that can be obtained by repeated use of HMM. By gen-
erating an ensemble of paths we can obtain a region where the subject is likely
to cross with high level of confidence. Using the set of predicted data we can
also find the average and minimum time required to reach each zone. Figure 5
Inferring Border Crossing Intentions with Hidden Markov Models 77
shows an example of Path Predictor after observing the subject for 60 minutes.
It displays the probability of crossing border at each zone and the corresponding
time of crossing.
Table 4 shows the predicted place and time of crossing for a subject having
an intention to cross using Viterbi’s algorithm for 1000 iterations. The place of
crossing can be one of the several equal zones along the border. The table shows
the predicted time of crossing at the different zones before the crossing takes
place.
In realistic examples a subject is likely to follow a trail as opposed to walking
through all possible areas within a region. In that case, the above approach will
provide a short list of trails where the subject is likely to cross the border.
Fig. 5. Prediction of zone and time of border crossing given partially observed path
6 Conclusion
The main contribution of this paper is to demonstrate that given a partially
observed path followed by a subject, it is possible to assess whether a subject
intends to cross the border. Using the simulation it is also possible to estimate
when and where the subject will cross the border. Soft sensor data can be very
valuable in increasing or decreasing the amount of surveillance near the border
78 G. Singh et al.
and can also be used to better assessing the probability of intention to cross
the border. In this work we do not consider geographical constraints nor do we
consider the possibility that subjects path can be modeled using other method
such as via a state equation. in our future work we plan to extend these results
for more realistic scenarios and will consider sequential estimation approach for
the case when soft and hard sensors data arrives at the fusion center at multiple,
intermingled times.
Possible errors, imprecision, and uncertainty in the observations will also be
considered in future work.
Acknowledgements
This research was partially supported by U.S. Army Research Contract W911NF-
10-2-00505. The authors also thank Karthik Kuber for his assistance and helpful
comments.
References
1. Brdiczka, O., Langet, M., Maisonnasse, J., Crowley, J.L.: Detecting Human Behav-
ior Models From Multimodal Observation in a Smart Home. IEEE Transactions
on Automation Science and Engineering
2. Feldman, A., Balch, T.: Representing Honey Bee Behavior for Recognition Using
Human Trainable Models. Adapt. Behav. 12, 241–250 (2004)
3. Hall, D.L., Llinas, J., Neese, M.M., Mullen, T.: A Framework for Dynamic
Hard/Soft Fusion. In: Proceedings of the 11th International Conferenec on In-
formation Fusion (2008)
4. Brooks, R., Iyengar, S.: Multi-Sensor Fusion: Fundamentals and Applications with
Software. Prentice Hall, Englewood Cliffs (1997)
5. Jones, R.E.T., Connors, E.S., Endsley, M.R.: Incorporating the Human Analyst
into the Data Fusion Process by Modeling Situation Awareness Using Fuzzy Cog-
nitive Maps. In: 12th International Conference on Information Fusion, Seattle, WA,
USA, July 6-9 (2009)
6. Kelley, R., Tavakkoli, A., King, C., Nicolescu, M., Nicolescu, M., Bebis, G.: Un-
derstanding Human Intentions via Hidden Markov Models in Autonomous Mobile
Robots. In: HRI 2008, Amsterdam, Netherlands (2008)
7. Laudy, C., Goujon, B.: Soft Data Analysis within a Decision Support System. In:
12th International Conference on Information Fusion, Seattle, WA, USA, July 6-9
(2009)
8. Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in
Speech Recognition. Proceedings of the IEEE 77(2) (February 1989)
9. Rickard, J.T.: Level 2/3 Fusion in Conceptual Spaces. In: 10th International Con-
ference on Information Fusion, Quebeck, Canada, July 9-12 (2007)
10. Sambhoos, K., Llinas, J., Little, E.: Graphical Methods for Real-Time Fusion and
Estimation with Soft Message Data. In: 11th International Conference on Infor-
mation Fusion, Cologne, Germany, June 30-July 3 (2008)
11. Youn, S.-J., Oh, K.-W.: Intention Recognition using a Graph Representation.
World Academy of Science, Engineering and Technology 25 (2007)
A Framework for Autonomous Search in the
Ecli pse Solver
1 Introduction
Constraint Programming (CP) is known to be an efficient software technology for
modeling and solving constraint-based problems. Under this framework, prob-
lems are formulated as Constraint Satisfaction Problems (CSP). This formal
problem representation mainly consists in a sequence of variables lying in a
domain and a set of constraints. Solving the CSP implies to find a complete
variable-instantiation that satisfies the whole set of constraints. The common
approach to solve CSPs relies in creating a tree data structure holding the po-
tential solutions by using a backtracking-based procedure. In general, two main
phases are involved: an enumeration and a propagation phase. In the enumer-
ation phase, a variable is instantiated to create a branch of the tree, while the
propagation phase attempts to prune the tree by filtering from domains the
values that do not lead to any solution. This is possible by using the so-called
consistency properties [6].
In the enumeration phase, there is two important decisions to be made: the
order in which the variables and values are selected. This selection refers to the
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 79–84, 2011.
c Springer-Verlag Berlin Heidelberg 2011
80 B. Crawford et al.
variable and value ordering heuristics, and jointly constitutes the enumeration
strategy. It turns out that this pair of decisions is crucial in the performance
of the resolution process. For instance, if the right value is chosen on the first
try for each variable, a solution can be found without performing backtracks.
However, to decide a priori the correct heuristics is quite hard, as the effects of
the strategy can be unpredictable.
A modern approach to handle this concern is called Autonomous Search
(AS) [2]. AS is a special feature allowing systems to improve their performance
by self-adaptation or supervised adaptation. Such an approach has been success-
fully applied in different solving and optimization techniques [3]. For instance, in
evolutionary computing, excellent results have been obtained for parameter set-
tings [5]. In this context, there exists a theoretical framework as well as different
successful implementations. On the contrary, AS for CP is a more recent trend.
Some few works have reported promising results based on a similar theoretical
framework [1], but little work has been done in developing extensible frameworks
and architectures for implementing AS in CP. This hinders the advances in this
area, which in particular, requires a strong work in terms of experimentation.
The central idea of AS in CP is to perform a self-adaptation of the search pro-
cess when it exhibits a poor performance. This roughly consists in performing an
“on the fly” replacement of bad-performing strategies by another ones looking
more promising. In this paper, we propose a new framework for implementing
AS in CP. This new framework enables the required “on the fly” replacement by
measuring the quality of strategies through a choice function. The choice func-
tion determines the performance of a given strategy in a given amount of time,
and it is computed based upon a set of indicators and control parameters. Ad-
ditionally, to guarantee the precision of the choice function, a genetic algorithm
optimizes the control parameters. This framework has been implemented in the
Ecli pse Solver and it is supported by a 4-component architecture (described in
Section 3). An important capability of this new framework is the possibility of
easily update its components. This is useful for experimentation tasks. Devel-
opers are able to add new choice functions, new control parameter optimizers,
and/or new ordering heuristics in order to test new AS approaches. We believe
this framework is a useful support for experimentation tasks and can even be
the basis for the definition of a general framework for AS in CP.
This paper is organized as follows. Section 2 presents the basic notions of
CP and CSP solving. The architecture of the new framework is described in
Section 3. The implementation details are presented in Section 4, followed by
the conclusion and future work.
2 Constraint Programming
3 Architecture
Our framework is supported by the architecture proposed in [4]. This architecture
consists in 4 components: SOLVE, OBSERVATION, ANALYSIS and UPDATE.
4 Framework Implementation
The framework has been designed to allow and easy modification of the UPDATE
component. In fact, UPDATE is the most susceptible component to suffer mod-
ifications, since the most obvious experiment –in the context of AS – is to tune
or replace the choice function or the optimizer.
Database
of snapshots
OBSERVATION
SOLVE ANALYSIS
UPDATE
CHOICE FUNCTION
OPTIMIZER
Database
Java Plug-in of Indicators
In the current UPDATE component, we use a choice function [7] that ranks and
chooses between different enumeration strategies at each step (a step is every
time the solver is invoked to fix a variable by enumeration). For any enumeration
strategy Sj , the choice function f in step n for Sj is defined by equation 1,
where l is the number of indicators considered and α is the control parameter (it
manages the relevance of the indicator within the choice function). Such control
parameters are computed by the genetic algorithm, which attempt to find the
values for which the backtracks are minimized.
l
fn (Sj ) = αi fi n (Sj ) (1)
i=1
fi 1 (Sj ) = v0 (2)
fi n (Sj ) = vn−1 + βi fi n−1 (Sj ) (3)
Let us note that the speed at which the older observations are smoothed (damp-
ened) depends on β. When β is close to 0, dampening is quick and when it is
close to 1, dampening is slow.
The general solving procedure including AS can be seen in Algorithm 2. Three
new function calls have been included: for calculating the indicators (line 11),
the choice function (line 12), and for choosing promising strategies (line 13),
that is, the ones with highest choice function1 . They are called after constraint
propagation to compute the real effects of the strategy (some indicators may be
impacted by the propagation).
In this work, we have presented a new framework for performing AS in CP. Based
on a set of indicators, the framework measures the resolution process state to
allow the replacement of strategies exhibiting poor performances. A main el-
ement of the framework is the UPDATE component, which can be seen as a
plug-in of the architecture. This allows users to modify or replace the choice
1
When strategies have the same score, one is selected randomically.
84 B. Crawford et al.
function and/or the optimizer in order to perform new AS-CP experiments. The
framework has been tested with different instances of several CP-benchmarks
(send+more=money, N-queens, N-linear equations, self referential quiz, magic
squares, sudoku, knight tour problem, etc) by using the already presented UP-
DATE component, which has been plugged to the framework in a few hours by
a master student with a limited Ecli pse knowledge.
The framework introduced here is ongoing work, and it can certainly be ex-
tended by adding new UPDATE components. This may involve to implement
new optimizers as well as the study of new statistical methods for improving the
choice function.
References
1. Crawford, B., Montecinos, M., Castro, C., Monfroy, E.: A hyperheuristic approach
to select enumeration strategies in constraint programming. In: Proceedings of ACT,
pp. 265–267. IEEE Computer Society, Los Alamitos (2009)
2. Hamadi, Y., Monfroy, E., Saubion, F.: Special issue on autonomous search. Contraint
Programming Letters 4 (2008)
3. Hamadi, Y., Monfroy, E., Saubion, F.: What is autonomous search? Technical Re-
port MSR-TR-2008-80, Microsoft Research (2008)
4. Monfroy, E., Castro, C., Crawford, B.: Adaptive enumeration strategies and
metabacktracks for constraint solving. In: Yakhno, T., Neuhold, E.J. (eds.) AD-
VIS 2006. LNCS, vol. 4243, pp. 354–363. Springer, Heidelberg (2006)
5. Robet, J., Lardeux, F., Saubion, F.: Autonomous control approach for local search.
In: Stützle, T., Birattari, M., Hoos, H.H. (eds.) SLS 2009. LNCS, vol. 5752, pp.
130–134. Springer, Heidelberg (2009)
6. Rossi, F., van Beek, P., Walsh, T.: Handbook of Constraint Programming. Elsevier,
Amsterdam (2006)
7. Soubeiga, E.: Development and Application of Hyperheuristics to Personnel Schedul-
ing. PhD thesis, University of Nottingham School of Computer Science (2009)
Multimodal Representations, Indexing,
Unexpectedness and Proteins
1 Introduction
Multimodal systems have been successfully deployed across multiple domains,
ranging from intelligent media browsers for teleconferencing to systems for analysing
expressive gesture in music and dance performances [1, 2]. The success of
multimodal approaches lies in the inherent assumption that each modality brings
information that is absent from the others. That is, complementary perspectives on an
otherwise partially known piece of information are provided. Usually, multimodal
systems aim to combine the inputs from diverse media. For example, the combination
of various images (visible, infrared and ultraviolet) in order to improve the contrast in
air surveillance comes to mind [3]. However, a strength of multimodal analysis is that
it may provide unexpected and complimentary, or even apparently conflicting,
perspectives on a problem domain. This aspect is often overlooked, which may lead to
unexplored knowledge being left undiscovered. This paper shows that such
unexpected and complimentary modes exist when analyzing large protein
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 85–94, 2011.
© Springer-Verlag Berlin Heidelberg 2011
86 E. Paquet and H.L. Viktor
>1HDA:D|PDBID|CHAIN|SEQUENCE
MLTAEEKAAVTAFWGKVKVDEVGGEALGRLLVVYPWTQ
RFFESFGDLSTADAVMNNPKVKAHGKKVLDSFSNGMKHLDDL
…
SELSDLHAHKLRVDPVNFKLLSHSLLVTLASHLPSD
FTPAVHASLDKFLANVSTVLTSKYR
name/family
Initial inputs
Multimodal Protein
Indexing and Analysis
System
>1BNF:C|PDBID|CHAIN|SEQUENCE
AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVAS
KGNLADVAPGKSIGGDIFSNREGKLPGKSGRCWREADINYTS
… GFRNSDRILYSCDWLIYKTTDHYQTFTKIR
>1HDA:D|PDBID|CHAIN|SEQUENCE
MLTAEEKAAVTAFWGKVKVDEVGGEALGRLLVVYPWTQRFFE
SFGDLSTADAVMNNPKVKAHGKKVLDSFSNGMKHLDDL
… DFTPAVHASLDKFLANVSTVLTSKYR
A protein, however, does not have a shape “per se”. It is ultimately constituted of
atoms, which interact with one another according to quantum mechanics. However, it
is possible to infer, from the position of the various atoms, various representations
Multimodal Representations, Indexing, Unexpectedness and Proteins 87
perspectives. That is, their composition, their internal structure and the way they
present themselves to the outside world, which shows the geometry involved in
interaction. We further aim to determine the synergy and the unexpectedness of such
a multimodal analysis. Indeed, as will be shown in Section 4, it is difficult to obtain a
good understanding by simply considering one modality. Furthermore, the “a priori”
knowledge associated with such a single modality may lead either to false, or over-
generalized, conclusions. We will show that such erroneous conclusions may be
avoided with a multimodal analysis.
From the position of the atoms, the topologies and the envelopes were calculated
with the University of Illinois molecular solver [6]. The 3D shape can be obtained
from in one of two ways. Firstly, from the isosurface associated with the electronic
density around the atoms forming the protein or, secondly, from the isosurface
corresponding to the potential energy associated with their relative spatial
configuration.
Our algorithm for 3D shape description then proceeds as follows. The 3D shape is
represented by a triangular mesh. The tensor of inertia associated with the shape is
calculated, from which a rotation invariant reference frame is defined on the Eigen
vectors of the later. Then, the probability that a given triangle has a certain
orientation and a certain distance with respect to the reference frame is calculated.
The distributions associated with a given representation of two proteins may be
compared either in terms of the Euclidian distance, which is a metric measure of
similarity, or in terms of the Jensen-Shannon divergence. The Jensen-Shannon
divergence is a probabilistic dissimilarity measure in between the two distributions.
Given a 3D shape, our algorithm provides a ranking of the most similar shapes, which
in our case are either associated with the topologies or the envelopes of the proteins.
In the next section we illustrate our results. We use, as a reference protein, the
member 1r1y of the human haemoglobin; where 1r1y is the unique identifier of the
protein in the PDB. However, our procedure may be followed for the multimodal
analysis of any protein in the Protein Data Bank.
90 E. Paquet and H.L. Viktor
The first eleven entries obtained when comparing the 1r1y amino acid sequence with
the other proteins as contained in the Protein Data Bank are shown in Table 1.
Table 1. Eleven highest Blast scores [6] for the comparison of the amino acid sequence of
1r1y with the sequences of the other proteins in the Protein Data Bank
Our results show that the members of the human haemoglobin have a very similar
amino acid sequence. This is to be expected. This sequence is the fruit of evolution
and as such, it is very unlikely that for a given species, two distinct sequences would
have produced the same functionality. Here, the aim is the transport of oxygen to the
tissues. For many years, researchers believed this was “the end of the story”. Even
nowadays, it is still sometimes “a priori” assumed that most, if not all, useful
information may be extracted from the amino acid sequence. As our multimodal
analysis will show, this is clearly not the case.
Multimodal Representations, Indexing, Unexpectedness and Proteins 91
The first twelve results for the topology similarity search are presented in Figure 2. It
follows that 1lfq appears in the first position. For each protein, because of its
intricate structure, four (4) views of the topology are shown. Recall that what is
indexed is the actual 3D shape associated with the topology; the views are only
utilised for visualization. The reference protein is shown to the left and the outcome
of the search is shown to the right. As expected, the most similar structures all belong
to the human haemoglobin.
One more unexpected result may be inferred by considering the first hundred (100)
topology-based search results. The result at the 93rd position is not part of the human
haemoglobin, but a member of the cow haemoglobin, entry 1hda in the Protein Data
Bank. It shares a very close topology with its human counterpart. These two proteins
have a similar topology since they both share the same function of carrying oxygen
and it follows that such a function strongly constraints the possible shape of the
protein.
A second unexpected behaviour is that the ranking based on the amino acid
sequence is quite distinct from the one obtained with the topology. For example,
1vwt is the most similar protein to 1r1y from the topological point of view with a
distance of 59.3. It has a Blast maximum score of 278 when considering the amino
acid sequence. This means that there is a substantial level of dissimilarity, from the
sequence point of view. On the other hand, 1r1x which is the most similar protein
from the amino acid sequence point of view (see Table 1), has a distance of 510.1
from a topological perspective as opposed to the most similar one which has a
distance of 59.3. Thus, highly similarity amino acid sequences do not necessarily
imply a high similarity from the topological point of view, and vice versa.
The results for the envelope may be divided in two (2) groups. The members of
the first group have a 3D shape similar to the one of 1y1y as illustrated in Figure 4,
while the members of the second group have a shape similar to the one of 1lfq as
illustrated in Figure 5. Again, the similarity rankings differ from that of the amino
acid sequences as presented in Section 4.1.
A few unexpected results appear. The first is that the ranking of the proteins in
terms of their topologies and envelopes are not the same. In other words, some
members of the human haemoglobin are more similar in terms of topology and others
in terms of their envelopes. This implies, in general, that the topology is not a
sufficient criterion to constraint the envelope. Importantly, the topology is thus not
suitable for finding the modus operandi of the associated protein; recall that the
envelope is related to interaction and of high interest for drug designers.
Perhaps even more surprising is the fact that, for a given rank, the envelope are
much more dissimilar that their topological counterpart, i.e. that there is more
variability in the envelope than in the topology.
Even more unexpected, non-members of the human haemoglobin appear much
earlier in the envelope results, than with their topological counterpart. This is
especially evident for the 1lfq subgroup. For instance, the result at the 7th position,
1uir, is not even a member of the globins family, but of the spermidine synthase
family. This implies that low similarity amino acid sequences may have a very
similar envelope. Although counterintuitive, this result may be explained by the fact
that proteins are highly non-linear, complex molecular systems which are difficult to
understand only in terms of amino acid sequences or topologies.
5 Conclusions
Complex systems, such as proteins, may be modelled using numerous complimentary
representations. It is important to recognize that a true understanding of these systems
may only be obtained through a multimodal analysis of such diverse representations.
Due to the underlying complexity, it might be very difficult to capture all the relevant
knowledge with a sole representation. We have shown that, in the case of proteins, a
deeper understanding may be obtained by analyzing their similarity through three
diverse modalities, i.e. in terms of amino acid sequences, topology and envelopes. It
is also important to realize that it is the synergy, and not the combination, of the
various representations that is important. As shown in Section 4, complimentary
representations might lead to distinct results, which nevertheless hold in their
respective domain of validity.
We are interested in further extending our initial system to include the ability for
additional search modalities. To this end we are currently adding a text mining
component, to enable users to enter free text, such as a disease description, as a search
criterion. We also aim to add a component to provide users with additional links to
relevant research articles from e.g. Medline.
Proteins are flexible structures, in the sense that the relative position of their
constitutive atoms may vary over time. Our shape descriptor is mainly oriented
toward rigid shapes, such as those that are routinely obtained from X-ray
crystallography. However, shapes obtained by, for example, magnetic resonance
94 E. Paquet and H.L. Viktor
References
1. Tucker, S., Whittaker, S.: Accessing Multimodal Meeting Data: Systems, Problems and
Possibilities. In: Bengio, S., Bourlard, H. (eds.) MLMI 2004. LNCS, vol. 3361, pp. 1–11.
Springer, Heidelberg (2005)
2. Camurri, A., et al.: Communicating Expressiveness and Affect in Multimodal Interactive
Systems. IEEE Multi Media 12(1), 43–53 (2005)
3. Jha, M.N., Levy, J., Gao, Y.: Advances in Remote Sensing for Oil Spill Disaster
Management: State-of-the-Art Sensors Technology for Oil Spill Surveillance. Sensors 8,
236–255 (2008)
4. Zaki, M.J., Bystroff, C.: Protein Structure Prediction. Humana Press, Totowa (2007)
5. Berman, H.M., et al.: The Protein Data Bank. Nucleic Acids Research, 235–242 (2000)
6. Humphrey, W., et al.: VMD - Visual Molecular Dynamics. J. Molec. Graphics 14, 33–38
(1996)
7. Karlin, S., Altschul, S.F.: Applications and statistics for multiple high-scoring segments in
molecular sequences. Proc. Natl. Acad. Sci. USA 90, 5873–5877 (1993)
8. Paquet, E., Viktor, H.L.: Addressing the Docking Problem: Finding Similar 3-D Protein
Envelopes for Computer-aided Drug Design, Advances in Computational Biology. In:
Advances in Experimental Medicine and Biology. Springer, Heidelberg (2010)
9. Paquet, E., Viktor, H.L.: CAPRI/MR: Exploring Protein Databases from a Structural and
Physicochemical Point of View. In: 34th International Conference on Very Large Data
Bases – VLDB, Auckland, New Zealand, August 24-30, pp. 1504–1507 (2008)
A Generic Approach for Mining Indirect Association
Rules in Data Streams
1 Introduction
Recently, the problem of mining interesting patterns or knowledge from large volumes
of continuous, fast growing datasets over time, so-called data streams, has emerged as
one of the most challenging issues to the data mining research community [1, 3]. Al-
though over the past few years there is a large volume of literature on mining frequent
patterns, such as itemsets, maximal itemsets, closed itemsets, etc., no work, to our
knowledge, has endeavored to discover indirect associations, a recently coined new
type of infrequent patterns. The term indirection association, first proposed by Tan et
al. in 2000 [21], refers to an infrequent itempair, each item of which is highly co-
occurring with a frequent itemset called “mediator”. Indirect associations have been
recognized as powerful patterns in revealing interesting information hidden in many
applications, such as recommendation ranking [14], common web navigation path [20],
and substitute items (or competitive items) [23], etc. For example, Coca-cola and Pepsi
are competitive products and could be replaced by each other. So it is very likely that
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 95–104, 2011.
© Springer-Verlag Berlin Heidelberg 2011
96 W.-Y. Lin, Y.-E. Wei, and C.-H. Chen
there is an indirect association rule revealing that consumers buy a kind of cookie tend
to buy together with either Coca-cola or Pepsi but not both (Coca-cola, Pepsi | cookie).
In this paper, the problem of mining indirect associations from data streams is con-
sidered. Unlike contemporary research work on stream data mining that investigates
the problem individually from different types of streaming models, we treat the prob-
lem in a unified way. A generic streaming window model that can encompass contem-
porary streaming window models and is endowed with user flexibility for defining
specific models is proposed. In accordance with this model, we develop a generic algo-
rithm for mining indirect associations over the generic streaming window model,
which guarantees no false positive patterns and a bounded error on the quality of the
discovered associations. We further demonstrate an efficient implementation of the
generic algorithm. Comprehensive experiments on both synthetic and real datasets
showed that the proposed algorithm is efficient and effectiveness in finding indirect
association rules.
The remainder of this paper is organized as follows. Section 2 introduces contempo-
rary stream window models and related work conducted based on these models. Our
proposed generic window model, system framework and algorithm GIAMS for mining
indirect association rules over streaming data are presented in Section 3. Some proper-
ties of the proposed algorithm are also discussed. The experimental results are pre-
sented in Section 4. Finally, in Section 5, conclusions and future work are described.
2 Related Work
Suppose that we have a data stream S = (t0, t1, … ti, …), where ti denotes the transaction
arriving at time i. Since data stream is a continuous and unlimited incoming data along
with time, a window W is specified, representing the sequence of data arrived from ti to
tj, denoted as W[i, j] = (ti, ti+1, …, tj). In the literature [1], there are three main different
types of window models for data stream mining, i.e., landmark window, time-fading
window, and sliding window models.
• Landmark model: The landmark model monitors the entire history of stream data
from a specific time point called landmark to the present time. For example, if
window W1 denotes the stream data from time ti to tj, then windows W2 and W3 will
span stream data from ti to tj+1 and ti to tj+2, respectively.
• Time-fading model: The time-fading model (also called damped model) assigns
more weights to recently arrived transactions so that new transactions have higher
weights than old ones. At every moment, based on a fixed decay rate d, a transac-
tion processed n time steps ago is assigned a weight dn, where 0 < d < 1, and the
occurrence of a pattern within that transaction is decreased accordingly.
• Sliding window model: A sliding window model keeps a window of size ω, moni-
toring the data within a fixed time [18] or a fixed number of transactions [8]. Only
the data kept in the window is used for analysis; when a new transaction arrives,
the oldest resident in the window is considered obsolete and deleted to make room
for the new one.
The first work on mining frequent itemsets over data stream with landmark window
model was proposed by Manku et al. [19]. They presented an algorithm, namely
A Generic Approach for Mining Indirect Association Rules in Data Streams 97
Definition 1. Given a data stream S = (t0, t1, … ti, …) as defined before, a generic
window model Ψ is represented as a four-tuple specification, Ψ(l, ω, s, d), where l
denotes the timestamp at which the window starts, ω the window size, s the stride the
window moves forward, and d is the decay rate.
The stride notation s is introduced to allow the window moving forward in a batch
of transactions, i.e., a block of size s. That is, if the current window under concern is
(tj−ω+1, tj−ω+2, …, tj), then the next window will be (tj−ω+s+1, tj−ω+s+2, …, tj+s), and the
weight of a transaction within (tj−s+1, tj−s+2, …, tj), say α, is decayed to αd, and the
weight of a transaction within (tj+1, …, tj+s) is 1.
Example 1. Let ω = 4, s = 2, l = t1, and d = 0.9. An illustration of the generic streaming
window model is depicted in Figure 1. The first window W1 = W[1, 4] = (t1, t2, t3, t4)
consists of two blocks, B1 = {AH, AI} and B2 = {AH, AH}, for B1 receiving weight 0.9
while B2 receiving 1. Next, the window moves forward with stride 2. That means B1 is
outdated and a new block B3 is added, resulting in a new window W2 = W[3, 6].
Below we show that this generic window model can be specified into any one of
the contemporary models described in Section 2.
• Landmark model: Ψ(l, ∞, 1, 1). Since ω = ∞, there is no limitation on the window
size and so the corresponding window at timestamp j is (tl, tl+1, …, tj) and is (tl, tl+1,
…, tj, tj+1) at timestamp j+1.
• Time-fading model: Ψ(l, ∞, 1, d). The parameter setting for this model is similar to
landmark except that a decay rate less than 1 is specified.
• Sliding window model: Ψ(l, ω, 1, 1). Since the window size is limited to ω, the
corresponding window at timestamp j is (tj−ω+1, tj−ω+2, …, tj) and is (tj−ω+2, …, tj, tj+1)
at timestamp j+1.
98 W.-Y. Lin, Y.-E. Wei, and C.-H. Chen
t1 t2 t3 t4 t5 t6 t7 t8 } ti
A A A A A A B C
H I H H I B C D
D D
W1 model setting &
Z=4 adjusting query result
A A A A
s=2
H I H H
l = t1
0.9 0.9 1 1
d = 0.9 data stream Process 1: Process 2:
W2 Discover & Generate IA
A A A A maintain PF
H H I B
0.9 0.9 1 D Access &
1 update PF Access
W3
A A B C
I B C D
0.9 D D 1
0.9 1
Fig. 1. An illustration of the generic window Fig. 2. A generic framework for indirect
model association mining
itemsets into PF. The second process is activated when the user issues a query about
the current indirect associations, responsible for generating the qualified patterns from
the frequent itemsets maintained by process PF-monitoring. Algorithm 1 presents a
sketch of GIAMS.
Algorithm 1. GIAMS
Input: Itempair support threshold σs, association support threshold σf, dependence threshold σd, stride s,
decay rate d, window size ω, and support error ε.
Output: Indirect associations IA.
Initialization:
1: Let N be the accumulated number of transactions, N = 0;
2: Let η be the decayed accumulated number of transaction, η = 0;
3 Let cbid be the current block id, cbid = 0, sbid the starting block id of window, sbid = 1;
3: repeat
4: Process 1;
5: Process 2;
6: until terminate;
Process 1: PF-monitoring
1: Reading the new coming block;
2: N = N + s; η = η × d + s;
3: cbid = cbid + 1;
4: if ( N > ω) then
5: Block_delete(sbid, PF); // Delete outdated block Bsbid
6: sbid = sbid + 1;
7: N = N – s; // Decrease the transaction size in current window
8: η = η – s×dcbid–sbid+1; // Decrease the decayed transaction size in current window
9: endif
10: Insert(PF, σf, cbid, η); // Insert potential frequent itemsets in block cbid into PF
11: Decay&Pruning(d, s, ε, cbid, PF); //Remove infrequent itemsets from PF
Process 2: IA-generation
1: if user query request = true then
2: IAgeneration(PF, σf, σd, σs, N); //Generate all indirect associations from PF
The most challenging and critical issue to the effectiveness of GIAMS is bounding
the error. That is, under the constraints of only one pass of data scan and limited mem-
ory usage, how can we assure the error of the generated patterns is always bounded by
a user specified range? Our approach to this end is, during the process of PF-
monitoring, eliminating those itemsets with the least possibility to become frequent
afterwards. More precisely, after processing the insertion of the new arriving block, we
decay the accumulated count of each maintained itemset, and then prune any itemset X
whose count is below a threshold indicated as follows:
1 − d cbid − sbid +1
X.count < ε × s × (d + d 2 + … + d cbid − sbid +1 ) = ε × s × (1)
1− d
where cbid and sbid denote the identifiers of current block and the first block that X
appears in PF, respectively, s is the stride, and ε is a user-specified tolerant error. Note
that the term s × (d + d2 + … + dcbid − sbid + 1) equals to the decayed amount of transac-
tions between the first block that X appears and the current block within the current
100 W.-Y. Lin, Y.-E. Wei, and C.-H. Chen
4 Experimental Results
A series of experiments were conducted to evaluate the efficiency and effectiveness of
the GIAMS algorithm. Our purpose is to inspect: (1) As a generic algorithm, how
GIAMS performs in various streaming models, especially the three classical models;
(2) How is the influence to GIAMS of each parameter engaged in specifying the win-
dow model? Each evaluation was inspected from three aspects, including execution
time, memory usage, and pattern accuracy.
All experiments were done on an AMD X3-425(2.7 GHz) PC with 3GB of main
memory, running the Windows XP operating system. All programs were coded in Vis-
ual C++ 2008. A synthetic dataset, T5.I5.N0.1K.D1000K, generated by the program in
[2] as well as a real web-news-click stream [4] extracted from msn.com for the entire
day of September 28, 1999 were tested. Since similar phenomena on the synthetic data
were observed, we only show the results on the real dataset.
Effect of Mediator Support Threshold: We first examine the effect of varying
mediator support thresholds, which is ranging from 0.01 to 0.018. It can be seen from
Figure 3(a) that in the case of landmark model, the smaller σf is, less execution time is
spent due to smaller σf resulting in less number of itemsets. And, the overall trend is
linearly proportional to the transaction size. The memory usage exhibits a similar phe-
nomenon. The situation is different in the case of time-fading model. From Figure 3(d)
we can observe both the execution time and memory usage are not affected by the me-
diator support threshold. This is because most of the time for time-fading model, com-
pared with the other two models, is spent on the insertion of itemsets and the benefit of
pruning fades away. The result for sliding model is omitted because it resembles that
for landmark model.
Effect of Window Stride: The stride value is ranging from 10000 to 80000. Two
noticeable phenomena are observed in the case of landmark model. First, the execu-
tion time decreases as the stride increases because larger strides encourage analogical
transactions; more transactions can be merged together. Second, a larger stride also is
helpful in saving the memory usage; as indicated in (1) the pruning threshold becomes
stricter, so more itemsets will be pruned. Effect of larger strides, however, is on the
contrary in the case of sliding model. The execution time increases in proportional to
the stride because larger strides imply larger transaction block to be inserted and de-
leted in maintaining the PF. The memory usage also increases, but is not significant.
Effect of Decay Rate: Note that only the time-fading model depends on this fac-
tor. As exhibited in Figure 3(c), a smaller decay rate contributes to a longer execution
time. This is because a smaller decay rate makes the support count decay more
quickly, so the itemset lifetime becomes shorter, leading to more itemset insertion and
deletion operations. This is also why the memory usage, though not significant, is
smaller than that for larger decay rates.
Effect of Window Size: Not surprisingly the memory increases as the window size
increases, as shown in Figure 3(f), while the performance gap is not significant. This
is because the stride (block size) is the same, and most of the time for process FP-
monitoring is spent on the block insertion and deletion.
Performance of Process IA-generation: We compare the performance of the pro-
posed two methods for implementing process IA-generation. We only show the
results for landmark model because the other models exhibit similar behavior.
GIAMS-IND denotes the approach modified from algorithm INDIRECT while
GIAMS-MED represents the more efficient method utilizing qualified mediator
102 W.-Y. Lin, Y.-E. Wei, and C.-H. Chen
0.5 7 0.5 7
6 6.5
Memory(MB)
0.4 0.4
Memory(MB)
5
Time(sec)
Time(sec)
0.3 4 6
0.3
0.2 3 5.5
2 0.2
0.1 5
1
0.1 4.5
0 0
0 4
0K
0K
0K
0K
0K
0K
0K
0K
0K
0K
0K
K
80
16
24
32
40
48
56
64
72
80
88
96
0K
0K
0K
0K
0K
0K
0K
0K
0K
0K
0K
K
80
Transaction size
16
24
32
40
48
56
64
72
80
88
96
Mem 0.01 Mem 0.012 Mem 0.014 Transaction size
Mem 0.016 Mem 0.018 Time 0.01 Mem stride=10000 Mem stride=20000
Mem stride=40000 Mem stride=80000
Time 0.012 Time 0.014 Time 0.016 Time stride=10000 Time stride=20000
Time 0.018 Time stride=40000 Time stride=80000
250 150
Memory(MB)
29 26
Time(sec)
200
Memory(MB)
27 100
Time(sec)
24
150 25
100 50 22
23
50 21 0 20
0 19
0K
0K
0K
0K
0K
0K
0K
0K
0K
0K
0K
K
80
16
24
32
40
48
56
64
72
80
88
96
0K
0K
0K
0K
0K
0K
0K
0K
0K
0K
0K
K
Transaction size
80
16
24
32
40
48
56
64
72
80
88
96
(c) Time-fading: Varying decaying rates (d) Time-fading: Varying mediator supports
0.3 5 1.4 20
0.25 4 1.2 Memory(MB)
Memory(MB)
15
Time(sec)
1
Time(sec)
0.2
3 0.8
0.15 10
2 0.6
0.1 0.4 5
0.05 1 0.2
0 0 0 0
0K
0K
0K
0K
0K
0K
0K
0K
0K
0K
0K
K
0K
0K
0K
0K
0K
0K
0K
0K
0K
0K
0K
K
80
80
16
24
32
40
48
56
64
72
80
96
88
16
24
32
40
56
64
72
80
88
96
48
0.03 6 1200
Memory(MB)
0.025 5 1000
Time(sec)
0.02 4 800
0.015 3 600
0.01 2 400
0.005 1 200
0 0 0
0.01 0.012 0.014 0.016 0.018 0.01 0.012 0.014 0.016 0.018
Mediator support threshold
mediator support threshold
Mem. GIAMS-IND Mem. GIAMS-MED
Time GIAMS-IND Time GIAMS-MED GIAMS-IND GIAMS-MED
(g) IA-generation: Execution time and memory (h) IA-generation: # of candidate rules
Fig. 3. Experimental results on evaluating GIAMS over a real web-news-click stream
A Generic Approach for Mining Indirect Association Rules in Data Streams 103
Accuracy: First, we check the difference between the true support and estimated
support, measured by ASE (Average Support Error) = Σx∈F(Tsup(x) – Esup(x))/|F|,
where F denotes the set of all frequent itemsets w.r.t. σf. The ASEs for all test cases
with varying strides between 10000 and 80000 and σfs ranging from 0.01 to 0.018
were recorded. All of them are zero except the case for time-fading model with d =
0.9; all values are less than 3×10-7. We also measured the accuracy of discovered indi-
rect association rules by inspecting how many rules generated are correct patterns,
i.e., recall. All the test cases exhibit 100% recalls.
5 Conclusions
In this paper, we have investigated the problem of indirect association mining from a
generic viewpoint. We have proposed a generic stream window model that can encom-
pass all classical streaming models and a generic mining algorithm that guarantees no
false positive rules and bounded support error. An efficient implementation of GIAMS
also was presented. Comprehensive experiments on both synthetic and real datasets
have showed that the proposed generic algorithm is efficient and effectiveness in find-
ing indirect association rules.
Recently, the design of adaptive data stream mining methods that can perform
adaptively under constrained resources has emerged into an important and challenging
research issue to the stream mining community [9, 22]. In the future, we will study
how to apply or incorporate some adaptive technique such as load shedding [1] into
our approach.
Acknowledgements
This work is partially supported by National Science Council of Taiwan under grant
No. NSC97-2221-E-390-016-MY2.
References
1. Aggarwal, C.: Data Streams: Models and Algorithms. Springer, Heidelberg (2007)
2. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: 20th Int. Conf.
Very Large Data Bases, pp. 487–499 (1994)
3. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and Issues in Data
Stream Systems. In: 21st ACM Symp. Principles of Database Systems, pp. 1–16 (2002)
4. Cadez, I., Heckerman, D., Meek, C., Smyth, P., White, S.: Visualization of Navigation Pat-
terns on a Web Site Using Model-based Clustering. In: 6th ACM Int. Conf. Knowledge
Discovery and Data Mining, pp. 280–284 (2000)
5. Chang, J.H., Lee, W.S.: Finding Recent Frequent Itemsets Adaptively over Online Data
Streams. In: 9th ACM Int. Conf. Knowledge Discovery and Data Mining, pp. 487–492
(2003)
6. Chang, J.H., Lee, W.S.: estWin: Adaptively Monitoring the Recent Change of Frequent
Itemsets over Online Data Streams. In: 12th ACM Int. Conf. Information and Knowledge
Management, pp. 536–539 (2003)
104 W.-Y. Lin, Y.-E. Wei, and C.-H. Chen
7. Chen, L., Bhowmick, S.S., Li, J.: Mining Temporal Indirect Associations. In: 10th Pacific-
Asia Conf. Knowledge Discovery and Data Mining, pp. 425–434 (2006)
8. Chi, Y., Wung, H., Yu, P.S., Muntz, R.R.: Moment: Maintaining Closed Frequent Itemsets
over a Stream Sliding Window. In: 4th IEEE Int. Conf. Data Mining, pp. 59–66 (2004)
9. Gaber, M.M., Zaslavsky, A., Krishnaswamy, S.: Towards an Adaptive Approach for Min-
ing Data Streams in Resource Constrained Environments. In: 6th Int. Conf. Data Ware-
housing and Knowledge Discovery, pp. 189–198 (2004)
10. Hidber, C.: Online Association Rule Mining. ACM SIGMOD Record 28(2), 145–156
(1999)
11. Jiang, N., Gruenwald, L.: CFI-stream: Mining Closed Frequent Itemsets in Data Streams.
In: Proc. 12th ACM Int. Conf. Knowledge Discovery and Data Mining, pp. 592–597
(2006)
12. Jin, R., Agrawal, G.: An Algorithm for In-core Frequent Itemset Mining on Streaming
Data. In: 5th IEEE Int. Conf. Data Mining, pp. 210–217 (2005)
13. Kazienko, P.: IDRAM—Mining of Indirect Association Rules. In: Int. Conf. Intelligent In-
formation Processing and Web Mining, pp. 77–86 (2005)
14. Kazienko, P., Kuzminska, K.: The Influence of Indirect Association Rules on Recommen-
dation Ranking Lists. In: 5th Int. Conf. Intelligent Systems Design and Applications, pp.
482–487 (2005)
15. Koh, J.L., Shin, S.N.: An Approximate Approach for Mining Recently Frequent Itemsets
from Data Streams. In: 8th Int. Conf. Data Warehousing and Knowledge Discovery, pp.
352–362 (2006)
16. Lee, D., Lee, W.: Finding Maximal Frequent Itemsets over Online Data Streams Adap-
tively. In: 5th IEEE Int. Conf. Data Mining, pp. 266–273 (2005)
17. Li, H.F., Lee, S.Y., Shan, M.K.: An Efficient Algorithm for Mining Frequent Itemsets over
the Entire History of Data Streams. In: 1st Int. Workshop Knowledge Discovery in Data
Streams, pp. 20–24 (2004)
18. Lin, C.H., Chiu, D.Y., Wu, Y.H., Chen, A.L.P.: Mining Frequent Itemsets from Data
Streams with a Time-Sensitive Sliding Window. In: 5th SIAM Data Mining Conf., pp. 68–
79 (2005)
19. Manku, G.S., Motwani, R.: Approximate Frequency Counts over Data Streams. In: 28th
Int. Conf. Very Large Data Bases, pp. 346–357 (2002)
20. Tan, P.N., Kumar, V.: Mining Indirect Associations in Web Data. In: 3rd Int. Workshop
Mining Web Log Data Across All Customers Touch Points, pp. 145–166 (2001)
21. Tan, P.N., Kumar, V., Srivastava, J.: Indirect Association: Mining Higher Order Depend-
encies in Data. In: 4th European Conf. Principles of Data Mining and Knowledge Discov-
ery, pp. 632–637 (2000)
22. Teng, W.G., Chen, M.S., Yu, P.S.: Resource-Aware Mining with Variable Granularities in
Data Streams. In: 4th SIAM Conf. Data Mining, pp. 527–531 (2004)
23. Teng, W.G., Hsieh, M.J., Chen, M.S.: On the Mining of Substitution Rules for Statistically
Dependent Items. In: 2nd IEEE Int. Conf. Data Mining, pp. 442–449 (2002)
24. Wan, Q., An, A.: An Efficient Approach to Mining Indirect Associations. Journal of Intel-
ligent Information System 27(2), 135–158 (2006)
25. Yu, J.X., Chong, Z., Lu, H., Zhou, A.: False Positive or False Negative: Mining Frequent
Itemsets from High Speed Transactional Data Streams. In: 30th Int. Conf. Very Large Data
Bases, pp. 204–215 (2004)
Status Quo Bias in Configuration Systems
1 Introduction
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 105–114, 2011.
c Springer-Verlag Berlin Heidelberg 2011
106 M. Mandl et al.
phenomenon well known as mass confusion [9]. A possibility to help the user
identifying meaningful alternatives that are compatible with their current pref-
erences is to provide defaults. Defaults in the context of interactive configuration
dialogs are preselected options used to express personalized feature recommen-
dations. Felfernig et al. [7] conducted a study to investigate the impact of person-
alized feature recommendations in a knowledge-based recommendation process.
Nearest neighbors and Naive Bayes voter algorithms have been applied for the
calculation of default values. The results of this research indicate that support-
ing users with personalized defaults can lead to a higher satisfaction with the
configuration process. In this paper we want to discuss further impacts of pre-
senting such default values to users of configurator applications. We present the
results of a case study conducted to figure out whether default values can have
an impact on a user’s selection behavior in product configuration sessions. The
motivation for this empirical analysis is the existence of so-called status quo
biases in human decision making [14].
The remainder of the paper is organized as follows. In the next section we
discuss the concept of the status quo bias in human decision making. In Section 3
we introduce major functionalities of RecoMobile, an environment that supports
the configuration of mobile phones and corresponding subscription features. In
Section 4 we present the test design of our user study. In the following we discuss
the results of our user study with the goal to point out to which extent a status
quo effect exists in product configuration systems (Section 5). Finally, we discuss
related work and conclude the paper.
People have a strong tendency to accept preset values (representing the status
quo) compared to other alternatives [11,13,14]. Samuelson and Zeckhauser [14]
explored this effect, known as status quo bias, in a series of labor experiments.
Their results implied that an alternative was significantly more often chosen
when it was designated as the status quo. They also showed that the status
quo effect increases with the number of alternatives. Kahnemann, Knetsch and
Thaler [11] argue that the status quo bias can be explained by a notion of loss
aversion. They explain that the status quo serves as a neutral reference point
and users evaluate alternative options in terms of gains and losses relative to the
reference point. Since individuals tend to regard losses as more important than
gains in decision making under risk (i.e., alternatives with uncertain outcomes)
[12] the possible disadvantages when changing the status quo appear larger than
advantages.
A major risk of preset values is that they could be exploited for misleading
users and making them choose options that are not really needed to fulfill their
requirements. Bostrom and Ord defined the status quo bias as “a cognitive error,
where one option is incorrectly judged to be better than another because it
represents the status quo” [10]. Ritov and Barron [13] suggest counteracting the
status quo bias by presenting the options in such a way that keeping as well as
Status Quo Bias in Configuration Systems 107
changing the status quo needs user input. They argue that “when both keeping
and changing the status quo require action, people will be less inclined to err by
favoring the status quo when it is worse” [13].
In this paper we want to focus on answering the question whether a status
quo bias exists in the context of product configuration systems and whether it
is possible to reduce this biasing effect by providing an interface supporting the
interaction mechanisms introduced by Ritov and Barron [13].
4 Study Design
Our experiment addressed two relevant questions. (1) Are users of product con-
figuration systems influenced by default settings even if these settings are un-
common? (2) Is it possible to counteract the status quo bias by providing a
configuration interface where both keeping and changing the presented default
settings needs user interaction? To test the influence of uncommon defaults on
the selection behavior of users we differentiate between three basic versions of
RecoMobile (see Table 1). Users of RecoMobile Version A were not confronted
with defaults, i.e., they had to specify each feature preference independent of
any default proposals. Out of the resulting interaction log we selected for each
feature (presented as questions within a configuration session) the alternative
which was chosen least often and used it as default for Versions B and C. These
two versions differ in the extent to which user interaction is required. In Ver-
sion B user interaction is only required when the customer wants to change the
recommended default setting (low user involvement ). In Version C the accep-
tance as well as the changing of the default settings requires user interaction
(high user involvement - see Figure 2). We conducted an online survey at the
108 M. Mandl et al.
Fig. 1. RecoMobile user interface – in the case of default proposal acceptance no further
user interfaction is needed (low involvement)
Fig. 2. Alternative version of the RecoMobile user interface – in the case of default
proposal acceptance users have to explicitly confirm their selection (high involvement)
5 Results
In our evaluation we compared the data of the configurator version without de-
fault settings (Version A - see Table 1) with the data collected in Versions B and
C. For each feature we conducted a chi-square test (the standard test procedure
when dealing with data sets that express frequencies) to compare the selection
behavior of the users. For many of the features we could observe significant dif-
ferences in the selection distribution. A comparison of the selection behavior in
the different configurator versions is given in Table 2.
For example, the evaluation results regarding the feature Which charged ser-
vices should be prohibited for SMS? are depicted in Figure 3. For this feature the
default in Versions B and C was set to alternative 3 - Utility and Entertainment
- (that option which was chosen least often in Version A). In both versions the
default setting obviously had a strong impact on the selection behavior of the
users. Only 2 % of the users of Version A selected option 3 whereas in Version
B 24 % chose this default option. The interesting result is that the version with
high user involvement (Version C) did not counteract the status quo bias. 25.6
% of the users of Version C selected the default alternative. Contrary to the
assumption of Ritov and Baron [13] people tend to stick to the status quo (the
default option) even when user interaction is required to accept it.
110 M. Mandl et al.
Fig. 3. Selections for prohibit charged services for SMS - the results of the conducted
chi-square test show that the underlying distributions differ significantly (p=0.009 for
Version A compared with Version B, p=0.020 for Version A compared with C)
In Figure 4 another example is shown for the feature Which data package do
you want?. The default in Version B and C was set to option 5 (2048 kbit/s
(+ 29.90 euro)), which was the most expensive alternative of this feature. In
Version A 4 % of the users decided to choose this option - the mean value for
the expenses for this attribute is 5.5 Euro (see Table 3). In Version B 16 % and
in Version C 18.6 % of the users retained the status quo alternative. The mean
value for the data package expenses in Version B is 12.8 Euro and in Version
C 13.2 Euro. This example shows that exploiting the status quo effect can lead
to selection of more expensive alternatives. Here again the status quo effect was
not suppressed in Version C, where people had to confirm the default setting.
6 Related Work
Research in the field of human decision making has revealed that people have
a strong tendency to keep the status quo when choosing among alternatives
(see e.g. [10,11,13,14]). This decision bias has firstly been reported by Samuel-
son and Zeckhauser [14]. To our knowledge, such decision biases have not been
analyzed in detail in the context of interactive configuration scenarios. The goal
Status Quo Bias in Configuration Systems 111
Fig. 4. Selections for Monthly data package - the results of the conducted chi-square
test show that the underlying distributions differ significantly (p=0.001 for Version A
compared with Version B, p=0.004 for Version A compared with Version C)
112 M. Mandl et al.
A 5.528
B 12.844
C 13.281
of our work was to investigate whether the status quo effect also exists in product
configuration systems. Felfernig et al. [7] introduced an approach to integrate
recommendation technologies with knowledge-based configuration. The results
of this research indicate that supporting users with personalized feature recom-
mendations (defaults) can lead to a higher satisfaction with the configuration
process. The work presented in this paper is a logical contination of the work of
[7] which extends the impact analysis of personalization concepts to the psycho-
logical phenomenon of decision biases.
Although product configuration systems support interactive decision processes
with the goal to determine configurations that are useful for the customer, the
integration of human decision psychology aspects has been ignored with only
a few exceptions. Human choice processes within a product configuration task
have been investigated by e.g. Kurniawan, So, and Tseng [16]. They conducted a
study to compare product configuration tasks (choice of product attributes) with
product selection tasks (choice of product alternatives). Their results suggest
that configuring products instead of selecting products can increase customer
satisfaction with the shopping process. The research of [17] and [9] was aimed
at investigating the influences on consumer satisfaction in a configuration envi-
ronment. The results of the research of Kamali and Loker [17] showed a higher
consumer satisfaction with the website’s navigation as involvement increased.
Huffman and Kahn [9] explored the relationship between the number of choices
during product configuration and user satisfaction with the configuration pro-
cess. From their results they concluded that customers might be overwhelmed
when being confronted with too many choices.
In the psychological literature there exist a couple of theories that explain
the existence of different types of decision biases. In the context of our empirical
study we could observe a status quo bias triggered by feature value recommen-
dations, even if uncommon values are used as defaults. Another phenomenon
that influences the selection behavior of consumers is known as Decoy effect. Ac-
cording to this theory consumers show a preference change between two options
when a third asymmetrically dominating option is added to the consideration
set. Decoy effects have been intensively investigated in different application con-
texts, see, for example [18,19,20,21,22,23]. The Framing effect describes the fact
that presenting one and the same decision alternative in different variants can
lead to choice reversals. Tversky and Kahnemann have shown that effect in a
series of studies where they confronted participants with choice problems using
variations in the framing of decision outcomes. They reported that “seemingly
Status Quo Bias in Configuration Systems 113
7 Conclusions
In this paper we have presented the results of an empirical study that had the
goal to analyze the impact of the status quo bias in product configuration sce-
narios where defaults are presented as recommendations to users. The results of
our study show that there exists a strong biasing effect even if uncommon values
are presented as default values. Our findings show that, for example, status quo
effects make users of a configuration system selecting more expensive solution
alternatives. As a consequence of these results we have to increasingly turn our
attention to ethical aspects when implementing product configurators since it is
possible that users are mislead simply by the fact that some defaults are rep-
resenting expensive solution alternatives (but are maybe not needed to fulfill
the given requirements). Finally, we detected that providing the possibility of
both keeping and changing the provided defaults (we called this the high involve-
ment user interface) does not counteract the status quo bias. Our future work
will include the investigation of additional decision phenomena in the context of
knowledge-based configuration scenarios (e.g., framing or decoy effects).
References
1. Barker, V., O’Connor, D., Bachant, J., Soloway, E.: Expert systems for configura-
tion at Digital: XCON and beyond. Communications of the ACM 32(3), 298–318
(1989)
2. Fleischanderl, G., Friedrich, G., Haselboeck, A., Schreiner, H., Stumptner, M.: Con-
figuring Large Systems Using Generative Constraint Satisfaction. IEEE Intelligent
Systems 13(4), 59–68 (1998)
3. Mittal, S., Frayman, F.: Towards a Generic Model of Configuration Tasks. In: 11th
International Joint Conference on Artificial Intelligence, Detroit, MI, pp. 1395–1401
(1990)
4. Sabin, D., Weigel, R.: Product Configuration Frameworks - A Survey. IEEE Intel-
ligent Systems 13(4), 42–49 (1998)
5. Stumptner, M.: An overview of knowledge-based configuration. AI Communica-
tions (AICOM) 10(2), 111–126 (1997)
6. Cöster, R., Gustavsson, A., Olsson, T., Rudström, A.: Enhancing web-based config-
uration with recommendations and cluster-based help. In: De Bra, P., Brusilovsky,
P., Conejo, R. (eds.) AH 2002. LNCS, vol. 2347, Springer, Heidelberg (2002)
7. Felfernig, A., Mandl, M., Tiihonen, J., Schubert, M., Leitner, G.: Personalized User
Interfaces for Product Configuration. In: International Conference on Intelligent
User Interfaces (IUI 2010), pp. 317–320 (2010)
114 M. Mandl et al.
8. Mandl, M., Felfernig, A., Teppan, E., Schubert, M.: Consumer Decision Making
in Knowledge-based Recommendation. Journal of Intelligent Information Systems
(2010) (to appear)
9. Huffman, C., Kahn, B.: Variety for Sale: Mass Customization or Mass Confusion.
Journal of Retailing 74, 491–513 (1998)
10. Bostrom, N., Ord, T.: The Reversal Test: Eliminating Status Quo Bias in Applied
Ethics. Ethics (University of Chicago Press) 116(4), 656–679 (2006)
11. Kahneman, D., Knetsch, J., Thaler, R.: Anomalies: The Endowment Effect, Loss
Aversion, and Status Quo Bias. The Journal of Economic Perspectives 5(1), 193–
206 (1991)
12. Kahneman, D., Tversky, A.: Prospect theory: An analysis of decision under risk.
Econometrica 47(2), 263–291 (1979)
13. Ritov, I., Baron, J.: Status-quo and omission biases. Journal of Risk and Uncer-
tainty 5, 49–61 (1992)
14. Samuelson, W., Zeckhauser, R.: Status quo bias in decision making. Journal of
Risk and Uncertainty 1(1), 7–59 (1988)
15. Mandl, M., Felfernig, A., Teppan, E., Schubert, M.: Consumer Decision Making
in Knowledge-based Recommendation. Journal of Intelligent Information Systems
(to appear)
16. Kurniawan, S.H., So, R., Tseng, M.: Consumer Decision Quality in Mass Cus-
tomization. International Journal of Mass Customisation 1(2-3), 176–194 (2006)
17. Kamali, N., Loker, S.: Mass customization: On-line consumer involvement in prod-
uct design. Journal of Computer-Mediated Communication 7(4) (2002)
18. Huber, J., Payne, W., Puto, C.: Adding Asymmetrically Dominated Alternatives:
Violations of Regularity and the Similarity Hypothesis. Journal of Consumer Re-
search 9(1), 90–98 (1982)
19. Simonson, I., Tversky, A.: Choice in context: Tradeoff contrast and extremeness
aversion. Journal of Marketing Research 29(3), 281–295 (1992)
20. Yoon, S., Simonson, I.: Choice set configuration as a determinant of preference
attribution and strength. Journal of Consumer Research 35(2), 324–336 (2008)
21. Teppan, E.C., Felfernig, A.: Calculating Decoy Items in Utility-Based Recommen-
dation. In: Chien, B.-C., Hong, T.-P., Chen, S.-M., Ali, M. (eds.) IEA/AIE 2009.
LNCS, vol. 5579, pp. 183–192. Springer, Heidelberg (2009)
22. Teppan, E., Felfernig, A.: Impacts of Decoy Elements on Result Set Evaluation
in Knowledge-Based Recommendation. International Journal of Advanced Intelli-
gence Paradigms 1(3), 358–373 (2009)
23. Felfernig, A., Gula, B., Leitner, G., Maier, M., Melcher, R., Schippel, S., Teppan,
E.: A Dominance Model for the Calculation of Decoy Products in Recommendation
Environments. In: AISB 2008 Symposium on Persuasive Technology, pp. 43–50
(2008)
24. Tversky, A., Kahneman, D.: The Framing of Decisions and the Psychology of
Choice. Science, New Series 211, 453–458 (1981)
Improvement and Estimation of Prediction Accuracy
of Soft Sensor Models Based on Time Difference
Abstract. Soft sensors are widely used to estimate process variables that are
difficult to measure online. However, their predictive accuracy gradually
decreases with changes in the state of the plants. We have been constructing
soft sensor models based on the time difference of an objective variable y and
that of explanatory variables (time difference models) for reducing the effects
of deterioration with age such as the drift and gradual changes in the state of
plants without reconstruction of the models. In this paper, we have attempted to
improve and estimate the prediction accuracy of time difference models, and
proposed to handle multiple y values predicted from multiple intervals of time
difference. An exponentially-weighted average is the final predicted value and
the standard deviation is the index of its prediction accuracy. This method was
applied to real industrial data and its usefulness was confirmed.
1 Introduction
Soft sensors are inferential models to estimate process variables that are difficult to
measure online and have been widely used in industrial plant [1,2]. These models are
constructed between those variables that are easy to measure online and those that are
not, and an objective variable is then estimated using those model. Through the use of
soft sensors, the values of objective variables can be estimated with a high degree of
accuracy. Their use, however, involves some practical difficulties. One crucial
difficulty is that their predictive accuracy gradually decreases due to changes in the
state of chemical plants, catalyzing performance loss, sensor and process drift, and the
like. In order to reduce the degradation of a soft sensor model, the updating of
regression models [3] and Just-In-Time (JIT) modeling [4] have been proposed.
Regression models are reconstructed with newest database in which new observed
data is stored oline. While many excellent results have been reported based on the use
of these methods, there remain some problems for the introduction of soft sensors into
practice.
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 115–124, 2011.
© Springer-Verlag Berlin Heidelberg 2011
116 H. Kaneko and K. Funatsu
First of all, if soft sensor models are reconstructed with the inclusion of any
abnormal data, their predictive ability can deteriorate [5]. Though such abnormal data
must be detected with high accuracy in real time, under present circumstances it is
difficult to accurately detect all of them. Second, reconstructed models have a high
tendency to specialize in predictions over a narrow data range [6]. Subsequently,
when rapid variations in the process variables occur, these models cannot predict the
resulting variations in data with a high degree of accuracy. Third, if a soft sensor
model is reconstructed, the parameters of the model, for example, the regression
coefficients in linear regression modeling, are dramatically changed in some cases.
Without the operators’ understanding of a soft sensor model, the model cannot be
practically applied. Whenever soft sensor models are reconstructed, operators check
the parameters of the models so they will be safe for operation. This takes a lot of
time and effort because it is not rare that tens of soft sensors are used in a plant [7].
Fourth, the data used to reconstruct soft sensor models are also affected by the drift.
In the construction of the model, data must be selected from a database which
includes both data affected by the drift and data after correction of the drift.
In order to solve these problems, it was proposed to construct soft sensor models
based on the time difference of explanatory variables, X, and that of an objective
variable, y, for reducing the effects of deterioration with age such as the drift and
gradual changes in the state of plants without reconstruction of the models [8,9]. In
other words, models which are not affected by these changes must be constructed
using not the values of process variables, but the time difference in soft sensor
modeling. A model whose construction is based on the time difference of X and that
of y is referred to as a ‘time difference model’. Time difference models can also have
high predictive accuracy even after drift correction because the data is represented as
the time difference that cannot be affected by the drift. We confirmed through the
analysis of actual industrial data that the time difference model displayed high
predictive accuracy for a period of three years, even when the model was never
reconstructed [8,9]. However, its predictive accuracy was lower than that of the
updating model.
On the one hand, we proposed to estimate the relationships between applicability
domains (ADs) and the accuracy of prediction of soft sensor models quantitatively
[6]. The larger the distances to models (DMs) are, the lower the estimated accuracy of
prediction would be. We used the distances to the average of training data and to the
nearest neighbor of training data as DMs, obtained the relationships between the DMs
and prediction accuracy quantitatively, and then, false alarms could be prevented by
estimating large prediction errors when the state was different from that of training
data; further, actual y-analyzer faults could be detected with high accuracy [6].
Therefore in this paper, we have attempted to improve and estimate the prediction
accuracy of time difference models and focused attention on an interval of time
difference. In terms of prediction by using a time difference model, when the interval
is small, a rapid variation in process variables could be accounted for, but after the
variation, predictive accuracy could be low because a state of a plant before the
variation and a state after the variation are different and the time difference model
could not accounted for the difference if an interval of time difference is small. On the
other hand, when the interval is large, the difference between before and after a
variation in process variables could be accounted for, but a rapid variation could not.
Improvement and Estimation of Prediction Accuracy of Soft Sensor Models 117
2 Method
We explain the proposed ensemble prediction method handling multiple y values
predicted by using multiple intervals of time difference of X. Before that, we briefly
introduce the time difference modeling method and DMs.
because y(t’-i) is given previously. This method can be easily expanded to a case that
a interval i is not constant.
By constructing time difference models, the effects of deterioration with age such
as the drift and gradual changes in the state of plants can be accounted for, because
data is represented as time difference that cannot be affected by these factors.
Previously, we proposed a method to estimate the relationships between DMs and the
accuracy of prediction of soft sensor models quantitatively [6]. For example, the
Euclidean distance to the average of training data (ED) is used as a DM. The ED of
explanatory variables of data, x, is defined as follows:
ED = (x − μ )T (x − μ ) (5)
where μ is a vector of the average of training data. When there is correlation among
the variables, the Mahalanobis distance is often used as the distance. The MD of x is
defined as follows:
MD = (x − μ )T ∑ −1 (x − μ ) (6)
Fig. 1 shows the basic concept of the proposed method. In prediction of a y-value of
time k, the time difference of y from time k-j, k-2j, ..., k-nj is predicted by using the
time difference model, f, obtained in Eq. (2) as follows:
SD =
1 n
∑ (ypred (k, k − rj) − μ')2
n − 1 r =1 (9)
each explanatory variable that was delayed for durations ranging from 0 minute to 60
minutes in steps of 10 minutes. The three methods listed below were applied.
A: Update a model constructed with the values of X and those of y.
B: Do not update a model constructed with the time difference of X and that of y.
C: Do not update a model constructed with the time difference of X and that of y.
A final predicted value is an exponentially-weighted average of multiple
predicted values (Proposed method)
For method B, The time difference was calculated between the present values and
those that were 30 minutes before the present time. For method C, the intervals of the
time difference were from 30 minutes to 1440 minutes in steps of 30 minutes. The α
value optimized by using the training data was 0.56. The partial least squares (PLS)
method was used to construct each regression model because the support vector
regression (SVR) model had almost the same predictive accuracy as that of PLS for
this distillation column [6]. The results from distance-based JIT models are not
presented here, but they were almost identical to those of the updating models.
Fig. 2 shows the RMSE values per month for each method from April 2003 to
December 2006. RMSE (root mean square error) is defined as follows:
∑ (y − y pred )
2
RMSE =
obs
N (10)
where yobs is the measured y value, ypred is the predicted y value, and N is the number
of data.
As shown in Fig. 2(a), the predictive accuracy of A was higher than that of B
totally. On the other hand, the RMSE values for C were almost the same as those for
A from Fig. 1(b). The predictive accuracy improved by using the exponentially-
weighted average of multiple predicted values. It is important that the time difference
model in B and C was constructed using only data from January to March 2003 and
never reconstructed. It is possible for a predictive model to be constructed without
Improvement and Estimation of Prediction Accuracy of Soft Sensor Models 121
updating by using the time difference and multiple predicted values. Additionally,
when the RMSE values were small, those of C tended to be smaller than those of A. It
is very practical because high prediction accuracy in the state of the plant is desired.
Fig. 3. The relationships between DMs and the absolute prediction errors in 2002
122 H. Kaneko and K. Funatsu
Fig. 4. The relationships between the coverage and the absolute prediction errors
Next, we estimated the prediction accuracy. The five DMs listed below were applied.
A1: The ED of the values of X
A2: The MD of the values of X
Improvement and Estimation of Prediction Accuracy of Soft Sensor Models 123
where Nin, m is the number of the data whose DMs are smaller than those of the mth
data. The relationships between the coverage and the absolute prediction errors are
shown in Fig. 4. The mth RMSE value is calculated with the Nin, m-data. It is desired
for ADs that the smaller the values of the coverage are, the smaller the RMSE values
are and vice versa. This tendency was shown in the figures except (c) and the line of
the proposed method was smoother than the others. In addition, by comparing the
values of the coverage where the RMSE values were 0.15 and 0.2, for example, those
of the proposed method were larger than those of the other methods, respectively.
This means that the proposed model could predict more number of data with higher
predictive accuracy.
Fig. 5 shows the relationships between the coverage and the absolute prediction
errors in each year using the proposed method. The tendency was identical in all
years. Therefore, we confirmed the high performance of the proposed index.
4 Conclusion
In this study, for improving and estimating the accuracy of the prediction of time
difference models, we proposed the ensemble prediction method with time difference.
A final predicted value is an exponentially-weighted average of the multiple predicted
values and we use the standard deviation of them as a index of prediction accuracy of
the final predicted value. The proposed method was applied to the actual industrial
data obtained from the operation of a distillation column. The proposed model
achieved high predictive accuracy and could predict more number of data with higher
predictive accuracy than traditional models. We therefore confirmed the usefulness of
the proposed method without reconstruction of the soft sensor model. If the
relationship between the SD and the standard deviation of the prediction errors is
124 H. Kaneko and K. Funatsu
modeled as is done in [6], we can estimate not only the value of y but also the
prediction accuracy of new data. The proposed methods can be combined with other
kinds of regression methods, and thus can be used in various fields of the soft sensor.
It is expected that the problems of maintenance of soft sensor models are reduced by
using our methods.
Fig. 5. The relationships between the coverage and the absolute prediction errors in each year
using the proposed method
References
1. Kano, M., Nakagawa, Y.: Data-Based Process Monitoring, Process Control, and Quality
Improvement: Recent Developments and Applications in Steel Industry. Comput. Chem.
Eng. 32, 12–24 (2008)
2. Kadlec, P., Gabrys, B., Strandt, S.: Data-Driven Soft Sensors in the Process Industry.
Comput. Chem. Eng. 33, 795–814 (2009)
3. Qin, S.J.: Recursive PLS Algorithms for Adaptive Data Modelling. Comput. Chem.
Eng. 22, 503–514 (1998)
4. Cheng, C., Chiu, M.S.: A New Data-based Methodology for Nonlinear Process Modeling.
Chem. Eng. Sci. 59, 2801–2810 (2004)
5. Kaneko, H., Arakawa, M., Funatsu, K.: Development of a New Soft Sensor Method Using
Independent Component Analysis and Partial Least Squares. AIChE J. 55, 87–98 (2009)
6. Kaneko, H., Arakawa, M., Funatsu K.: Applicability Domains and Accuracy of Prediction
of Soft Sensor Models. AIChE J. (2010) (in press)
7. Ookita, K.: Operation and quality control for chemical plants by soft sensors. CICSJ
Bull. 24, 31–33 (2006) (in Japanese)
8. Kaneko, H., Arakawa, M., Funatsu, K.: Approaches to Deterioration of Predictive Accuracy
for Practical Soft Sensors. In: Proceedings of PSE ASIA 2010, P054(USB) (2010)
9. Kaneko, H., Funatsu K.: Maintenance-Free Soft Sensor Models with Time Difference of
Process Variables. Chemom. Intell. Lab. Syst. (accepted)
Network Defense Strategies for Maximization of Network
Survivability
Frank Yeong-Sung Lin1, Hong-Hsu Yen2, Pei-Yu Chen1,3,4,*, and Ya-Fang Wen1
1
Department of Information Management, National Taiwan University
2
Department of Information Management, Shih Hsin University
3
Information and Communication Security Technology Center
4
Institute Information Industry
Taipei, Taiwan, R.O.C.
yslin@im,ntu.edu.tw,[email protected],
[email protected],[email protected]
Abstract. The Internet has brought about several threats of information security
to individuals and cooperates. It is difficult to keep a network completely safe
because cyber attackers can launch attacks through networks without limitations
of time and space. As a result, it is an important and critical issue be able to
efficiently evaluate network survivability. In this paper, an innovative metric
called the Degree of Disconnectivity (DOD) is proposed, which is used to
evaluate the damage level of the network. A network attack-defense scenario is
also considered in this problem, in which the attack and defense actions are
composed by many rounds with each round containing two stages. In the first
stage, defenders deploy limited resources on the nodes resulting in attackers
needing to increase attack costs to compromise the nodes. In the second stage,
the attacker uses his limited budget to launch attacks, trying to maximize the
damage of the network. The Lagrangean Relaxation Method is applied to obtain
optimal solutions for the problem.
1 Introduction
With growth of internet use, the number of experienced computer security breaches
has increased exponentially in recent years, especially impacting on businesses that
are increasingly dependent on being connected to the Internet. The computer
networks of these businesses are more vulnerable to access from outside hackers or
cyber attackers, who could launch attacks without the constraints of time and space.
In addition to the rapid growth in rate of cyber attack threats, another factor that
influences overall network security is the network protection of the network defender
[1]. Although it is impossible to keep a network completely safe, the problem of the
network security thus gradually shifts to the issue of survivability [2]. As knowing
*
Corresponding author.
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 125–134, 2011.
© Springer-Verlag Berlin Heidelberg 2011
126 F.Y.-S. Lin et al.
how to evaluate the survivability of the Internet is a critical issue, more and more
researchers are focusing on the definitions of the survivability and evaluation of
network survivability.
However, to enhance or reduce the network survivability, both network defender
and cyber attacker usually need to invest a fixed number of resources in the network.
The interaction between cyber attackers and network defenders is like information
warfare, and how to efficiently allocate scarce resources to the network for both cyber
attacker and network defender is a significant issue. Hence, the attack-defense
situation can be formulated as a min-max or max-min problem. As a result,
researchers can solve this kind of attack-defense problem of network security by
mathematical programming approaches, such as game theory [3], Simulated
Annealing [4], Lagrangean Relaxation Method [5]. In [5], the authors propose a novel
metric called Degree of Disconnectivity (DOD) that is used to measure the damage
level, or survivability, of a network. The DOD value is calculated by (1), in which a
larger DOD value represents a greater damage level, which also implies lower
network survivability. An attacker’s objective is to minimize the total attack cost,
whereas a defender’s objective is to maximize the minimized total attack cost. The
DOD is used as a threshold to determine whether a network has been compromised.
Considering the defense and attacker scenarios in the real world, it is more
reasonable that the attacker fully utilizes his budget to cause maximal impact to the
network rather than simply minimize his total attack cost. In this paper, the attack and
defense actions are composed by rounds, where each round contains two stages. In the
first stage, a defender deploys defense resources to the nodes in the network, whereas
in the second stage, an attacker launches attacks to compromise nodes in the network
in order to cause maximal impact to the network. Survivability and damage of the
network is measured in terms of the DOD value.
In the real world, network attack and defense are continuous processes. For
modeling purposes, the problem is defined as an attack-defense problem, with each
round contains the two stages of defense and attack. First, the defender tries to
minimize network damage, i.e., DOD value, by deploying defense resources to the
nodes in the network. The allocation is restricted by the defender’s defense budget.
Next, attackers start to launch attacks that compromise nodes, trying to maximize the
damage, i.e. DOD value, of the network. Since attackers only have a limited budget
and thus have to make good use of their budget, they need to decide which nodes to
attack in order to cause the greatest impact to network operation. The given
parameters and decision variables of the problem are shown in Table 1.
Given parameter
Notation Description
V Index set of nodes
W Index set of OD pair
Pw Set of all candidate paths of an OD pair w, where w∈W
Large enough number of processing cost that indicates a node has been
M
compromised
Small enough number of processing cost that indicates a node is
ε functional
Indicator function, 1 if node i is on path p, 0 otherwise, where i∈V and
δpi
p∈Pw
Existing defense resources on node i, used for condition which has more
di
than 1 round, where i∈V
ai(bi+di) Attack cost of node i, which is a function of bi+di, where i∈V
State of node i before this round. 1 if node i is inoperable, 0 otherwise,
qi
used for a condition which has more than 1 round, where i∈V
A Attacker’s total budget in this round
B Defender’s defense budget in this round
Decision variable
Notation Description
xp 1 if path p is chosen, 0 otherwise, where p∈Pw
yi 1 if node i is compromised by attacker, 0 otherwise (where i∈V)
twi 1 if node i is used by OD pair w, 0 otherwise, where i∈V and w∈W
Processing cost of node i, which is ε if i is functional, M if i is
ci
compromised by attacker, where i∈V
bi Defense budget allocated to node i, where i∈V
The problem is then formulated as the following min-max problem:
Objective function:
∑ ∑t wi ci
min max w∈W i∈V (IP 1)
bi yi W ×M
,
Subject to:
( ⎣ )
ci = yi + qi M + ⎡1 − ( yi + qi ) ⎤ ε
⎦ ∀i∈V (IP 1.1)
∑t
i∈V
wi ci ≤ ∑δ
i∈V
pi ci ∀ p∈Pw, w∈W (IP 1.2)
128 F.Y.-S. Lin et al.
∑xδ
p∈Pw
p pi = t wi
∀i∈V, w∈W (IP 1.3)
∑y a (b + d ) ≤ A
i∈V
i i i i (IP 1.4)
∑b ≤ B
i∈V
i (IP 1.5)
∑x
p∈Pw
p =1
∀w∈W (IP 1.8)
3 Solution Approach
∑x δ
p∈Pw
p pi ≤ twi
∀i∈V, w∈W (IP 1.3’)
Subject to:
twi = 0 or 1 ∀i∈V, w∈W (Sub 1.3.1)
ci = ε or M ∀i∈V. (Sub 1.3.2)
In (Sub 1.3), both decision variable twi and ci have two options. As a result, the value
of twi and ci can be determined by applying an exhaustive search to obtain the minimal
value. The time complexity here is O(|W|×|V|).
The Lagrangean Relaxation problem (LR 1) can be solved optimally if all the
above subproblems are solved optimally. By the weak duality theorem [7], for any set
of multipliers, the solution to the dual problem is a lower bound on the primal
problem (IP 1). To acquire the tightest lower bound, the value of Lagrangean
multipliers needs to be adjusted to maximize the optimal value of the dual problem.
130 F.Y.-S. Lin et al.
The dual problem can be solved in many ways; here the subgradient method is
adopted to solve the dual problem.
The result of the inner problem represents the attack strategy under a certain initial
defense budget allocation policy. The objective (IP 1) is to minimize the maximized
damage of the network under intentional attacks. Therefore, the outcome of the inner
problem can be used as the input of the outer problem for developing a better budget
allocation policy. From the current attack strategy, the defender can adjust the budget
allocated to nodes in the network according to certain reallocation policies. After
budget adjustment, the inner problem is solved again to derive an attack strategy
under the new defense budget allocation policy. This procedure is repeated a number
of times until an equilibrium is achieved.
The concept used to adjust the defense budget allocation policy is similar to the
subgradient method, in which the budget allocated to each node is redistributed
according to current step size. This subgradient-like method is described as follows.
Initially, the status of each node after attack is checked. If the node is
uncompromised, this suggests that the defense resources (budget) allotted to this node
is inadequate (more than needed) or it is unworthy for an attacker to attack this node
as the node has too great a defense budget. Therefore, we can extract a fraction of the
defense resources from the nodes unaffected by attacks, and allocate it to
Network Defense Strategies for Maximization of Network Survivability 131
compromised nodes. The amount extracted is related to the step size coefficient, and it
is halved if the optimal solution of (IP 1) does not improve in a given iteration limit.
Another factor that is related to the deducted defense resources is the importance
of a node. In general, the greater the number of times a node is used by all OD paths
implies higher importance. When a node with a larger number of times used by all
OD paths is compromised, it provides a higher contribution to the growth of the
objective function value (i.e., DOD value), compared with a node with a smaller
number of times. As a result, only a small amount of defense resources should be
extracted from nodes with higher usage. In the proposed subgradient-like method, the
importance factor to measure the importance of a node is used, which is calculated by
ti / ttotal , where ti is the average number of times node i used by all OD paths, and ttotal
is the summation of ti (∀i∈V). An uncompromised node with greater importance
factor will have a lower amount of defense resources extracted.
4 Computational Experiments
4.1 Experiment Environment
For the proposed Lagrangean Relaxation algorithm, two simple algorithms were
implemented using C++ language, and the program was executed on a PC with AMD
3.6 GHz quad-core CPU. Here three types of network topology acted as attack targets:
the grid network, random network and scale-free network. To determine which budget
allocation policy is more effective under different cases, two initial budget allocation
policies were designed uniform and degree based. The former distributed the defense
budget evenly to all nodes in the network, while the latter allocated budget to each
node according to the percentage of a node’s degree.
To prove the effectiveness of the Lagrangean Relaxation algorithm and the proposed
heuristic, two simple algorithms were developed, introduced in Table 2 and Table 3
respectively for comparison purposes.
//initialization
total_attack_cost = 0;
sort all nodes by their attack cost in ascending order;
for each node i { //already sorted
if ( total_attack_cost + attack_cost _i <= TOTAL_ATTACK_BUDGET
AND (node i is not compromised OR compromised but repaired)){
compromise node i;
total_attack_cost += attack_cost_i;
}
}
calculate DOD;
return DOD;
132 F.Y.-S. Lin et al.
//initialization
total_attack_cost = 0;
sort all nodes by their node degree in descending order;
for each node i { //already sorted
if ( total_attack_cost + attack_cost _i <= TOTAL_ATTACK_BUDGET
AND (node i is not compromised OR compromised but repaired)){
compromise node i;
total_attack_cost += attack_cost_i;
}
}
calculate DOD;
return DOD;
Fig. 2 illustrates the survivability (i.e., DOD value) of the network under different
topologies, node numbers and initial budget allocation policies. Networks with a
degree-based initial budget allocation strategy are more robust compared to those with
a uniform strategy, and the damage of the network was less under the same attack and
defense budgets. This finding suggests that the defense resources should be allotted
according to the importance of each node in the network. The survivability of grid
networks is lower than others using the DOD metric, since grid networks are more
regular and more connected. As a result, some nodes are used more often by OD
pairs. If these nodes are compromised by the attacker, the DOD value increases
considerably. Random networks have higher survivability compared with scale-free
networks because nodes in random networks are arbitrarily connected to each other
rather than attached to nodes with a higher degree and therefore have lower damage
when encounter intentional attacks.
The survivability of different network topologies, node numbers, and budget
reallocation strategies is demonstrated in Fig. 2. From the above bar chart, the
subgradient-like heuristic and both budget reallocation strategies perform well in all
conditions (approximately 20% improvement on average), and the degree-based
budget reallocation strategy is better than the uniform reallocation. However, the
difference between two budget allocation strategies is not significant. One possible
reason is that in the proposed heuristic, the extracted defense resources are allotted to
compromised nodes according to certain strategies. These nodes selected by the
attacker already indicate which nodes are more important, although node degree is
related to the importance of nodes. As a result, the gap between the two strategies is
small.
5 Conclusion
In this paper, a generic mathematical programming model that can be used for solving
network attack-defense problem is proposed by simulating the role of the defender
and the attacker. The attacker tries to maximize network damage by compromising
nodes in the network, whereas the defender’s goal is to minimize the impact by
deploying defense resources to nodes, and thus enhancing their defense capability. In
this context, the Degree of Disconnectivity (DOD) is used to measure the damage of
the network. The inner problem is solved by a Lagrangean Relaxation-based
algorithm, and the solution to min-max problem is obtained from the subgradient-like
heuristic and budget adjustment algorithm.
The main contribution of this research is the generic mathematical model for
solving the network attack-defense problem. The proposed Lagrangean Relaxation-
based algorithm and subgradient-like heuristic have been proved to be effective and
can be applied to real-world networks, such as grid, random and scale-free networks.
Also, the survivability of networks with different topologies, sizes, and budget
allocation policies has been examined. Their survivability can be significantly
improved by adjusting the defense budget allocation. From the outcomes of the
experiments, the defense resources should be allocated according to the importance of
nodes. Moreover, the attack and defense scenarios take rounds to the end. As a result,
the number of rounds, e.g., from 1 to N, could be further expanded.
References
1. Symantec.: Symantec Global Internet Security Threat Report Trends for 2009, Symantec
Corporation, vol. XV (April 2010)
2. Ellison, R.J., Fisher, D.A., Linger, R.C., Lipson, H.F., Longstaff, T., Mead, N.R.:
Survivable Network Systems: An Emerging Discipline, Technical Report CMU/SEI-97-
TR-013 (November 1997)
3. Jiang, W., Fang, B.X., Zhang, H.L., Tian, Z.H.: A Game Theoretic Method for Decision
and Analysis of the Optimal Active Defense Strategy. In: The International Conference on
Computational Intelligence and Security, pp. 819–823 (2007)
4. Lin, F.Y.S., Tsang, P.H., Chen, P.Y., Chen, H.T.: Maximization of Network Robustness
Considering the Effect of Escalation and Accumulated Experience of Intelligent Attackers.
In: The 15th World Multi-Conference on Systemics, Cybernetics and Informatics
(July 2009)
5. Lin, F.Y.S., Yen, H.H., Chen, P.Y., Wen, Y.F.: An Evaluation of Network Survivability
Considering Degree of Disconnectivity. In: The 6th International Conference on Hybrid
Artificial Intelligence Systems (May 2011)
6. Fisher, M.L.: The Lagrangian Relaxation Method for Solving Integer Programming
Problems. Management Science 27(1), 1–18 (1981)
7. Geoffrion, M.: Lagrangean Relaxation and its Use in Integer Programming. Mathematical
Programming Study 2, 82–114 (1974)
PryGuard: A Secure Distributed Authentication Protocol
for Pervasive Computing Environment
1 Introduction
Pervasive computing is becoming an inseparable part of our daily life due to the
development of technology for low-cost pervasive devices and inexpensive but
powerful wireless communication. Pervasive computing which was the brain child of
Weiser [3] is now proving its viability in industry, education, hospitals, healthcare,
battlefields etc. The computational power and memory capacity of pervasive
computing devices have significantly grown in the last few years. However, it is still
not comparable to the storage capacity of other memory devices employed in
distributed computing environment having a fixed infrastructure scenario. Low
battery power places another restriction on pervasive computing devices, as
mentioned in [23]. Due to such constraints, the dependency of one device on another
is an important aspect of the pervasive computing environment.
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 135–145, 2011.
© Springer-Verlag Berlin Heidelberg 2011
136 C. Hasan et al.
2 Related Works
Implementations of several device discovery protocols for ad hoc network
environment are presented in [13, 14, 15, 17]. These protocols are compatible with
different wireless communication protocols like Bluetooth, 802.11, etc. but none of
them have the capability to serve as middleware service. These protocols do not
provide any application interface required for application developers to use a
middleware service rather they work in the MAC and network layers. In [32] the
security of Web service has been talked about. However, the focus of this software
system is to enable the interoperability of applications over the Internet. Martin and
Hung presented a security policy [33] to manage different security-related concerns
including confidentiality, integrity and availability for VoIP. However, our main
focus is on secure device discovery in pervasive ad hoc networks. Aleksy et al.
proposed a three-tier approach to successfully handle the heterogeneity and
interoperability issues [34].
In PryGuard, we have focused on making a validity decision based on behavior.
Popovski et al. [13] have proposed a randomized distributed algorithm for device
discovery with collision avoidance using a stack algorithm. Unfortunately, it fails to
adapt itself with the dynamically changing environment of an ad hoc network.
In UPnP protocol [18, 19], a controlled device broadcasts its presence and services
it is willing to provide. Jini [20, 21] constructs a community where each device can
use the services provided by other devices belonging to the same community. A
framework is provided in OSGI [22] for a home network that allows the user to
communicate through devices with heterogeneous communication protocols.
However, none of them considered the issue of maintaining a valid device list and
corresponding security threats. Research studies in [14, 15, 16] discussed Bluetooth
wireless communication. However, this protocol does not address power optimization
138 C. Hasan et al.
and fault tolerance. In addition, it does not include the impact of malicious attacks
which we have discussed in our findings.
In [24, 25] the authors have shown that the Hopper–Blum protocol can be applied
to strengthen the security features of RFID. In our approach, the task of sending
challenges is distributed to all valid nodes. Thus, the ambassador only combines the
replies of other devices. Due to the limitation of battery power, it is not feasible for a
specific node to manage all the challenges and calculations as/when the ad hoc
network grows in size. In [27] a similar distributed protocol called ILDD is proposed
where recommendations from the valid devices are returned to the ambassador in
plain format, like ‘true’ or ‘false’. Unfortunately, this spoils the purpose of noise
inclusion since by observing the recommendations a passive intruder could determine
whether a response was correct or noisy. We have explained this loophole in ILDD in
the following section where we demonstrate that the authentication protocol presented
in ILDD is not fully secure. In fact, any distributed authentication protocol based on
the LPN technique is vulnerable to a special type of security flaw which we have
termed as the “Noise Recognition Attack” as it enables the attacker to distinguish
noisy responses from valid ones. Our proposed distributed authentication protocol,
PryGuard resolves this security attack.
Fig. 1. Overview of existing distributed device discovery protocol and security loophole
4 Characteristics
PryGuard is intended to offer a feasible device authentication protocol for ad hoc
network environment in a distributed fashion. It is featured with several novel and
exclusive characteristics in order to accomplish that purpose. Our model is different
from others due to its capability of being used as a middleware service. Some other
exclusive features of PryGuard along with comparison with competitive approaches
are mentioned below.
1) The HB protocol is for human-computer authentication in which a user
communicates with the computer through an insecure channel. This protocol was
not designed for an ad hoc environment. In an ad hoc network any member can
join and leave at any time. Our protocol is adaptive to deal with the constraints of
such dynamic environment.
2) HB protocol is based on a single server–client scenario. However, an ad hoc
network is formed by a numerous devices joining and leaving arbitrarily. Instead
of providing a challenge from a single device/server, we modified the method so
that each valid device sends only one challenge to every other node present in the
network. Such distributed approach obviously makes it more scalable.
3) We have used the concept of an ambassador node which is chosen based on trust
level. For the trust portion we used the trust model proposed in [26]. This model
updates the trust level of each device dynamically, depending on its interaction
and behavior with other devices. Remaining battery power of a device can be
another significant criterion for selecting the ambassador node.
4) In our model, we consider the impact of noise inclusion. Due to the fact that a
device is only required to make a certain number of correct responses, there is
possibility of a user being successfully authenticated without having the secret.
140 C. Hasan et al.
These users will impact the authentication processes later as they will also
actively involve. A new term β has been incorporated; β denotes the expected
number of malicious devices present in the valid neighbor list. The number of
valid responses required to get access to the network is adjusted based on this
parameter.
5) We have introduced the use of a specific function to enable the process of bit
position generation and distribution. This function is a bijection and shared
among the valid devices only. The ambassador selects particular bit position for
each member and uses the bijective function to encrypt it.
5 Overview of PryGuard
Major steps of PryGuard are described here. After the execution of PryGuard each
device in the network is updated with a list of valid neighbors. This is suitable for
large corporate building, college campus, etc. where numerous devices communicate
with each other using wireless connection. In our model, each device maintains a list
known as a valid neighbor list that contains the nodes that have been declared valid by
the ambassador node. The terms, leader node and ambassador node, are used
interchangeably in this paper. Ambassador node initiates the process and finally
computes validity of each device upon getting replies from existing valid devices.
Later this updated list is broadcast within the network. We are using the term valid
device to denote those devices that are already in the valid neighbor list of others.
This also indicates that these devices have passed the challenge–response based
device update mechanism in the previous phase. If a malicious device can generate
the required number of correct responses to pass the challenge–response phase, it will
also be updated as a valid device in the valid neighbor list of other devices.
First of all, one of the nodes is selected as the ambassador node. We assume that
selection of ambassador node is based on mutual trust ratings among the network
members. The ambassador node initiates the device authentication process. Whenever
a new member intends to join the network this process is triggered. Again, the
ambassador triggers the process periodically to update its valid neighbor list. It is
assumed that a common secret is possessed by a valid member. The authentication
process runs on this assumption. Main task is to verify whether a new member or an
existing member possesses the correct version of current secret as the secret is also
changed periodically. The verification process is distributed over all the network
members and it is administered by the ambassador. The ambassador uses a sequence
generator function to assign a bit position of the current secret to individual members
which are already in the valid neighbor list. These devices then send challenges either
to the new member or to one another. Afterwards each of them replies with the
corresponding response. Based on the correctness of the response the challengers
forward their replies to the ambassador. Finally the ambassador decides which are the
valid devices based on the number of correct responses made by each device. At the
end, this updated list is broadcast within the network. An overview of the architecture
of PryGuard is depicted in the figure below. Elaborate description on the details of the
protocol follows next.
PryGuard: A Secure Distributed Authentication Protocol 141
Step 5) Now suppose, sender S was actually the kth network member u and was
assigned the bit position b by the ambassador in step 2. Then, for all the matched
results, S will send the value of b th bit of x i.e. x b to the ambassador. This will be
counted as a true recommendation to the ambassador for the corresponding member
to which the response belongs. Conversely, for the devices which fail to correctly
compute the binary inner product, S will send the negated value of b th bit of x i.e.
x b to the ambassador. This will be counted as a false recommendation to the
ambassador for the corresponding member to which the response belongs. So, from
the point of view of the ambassador node, a true recommendation (exact value of the
corresponding bit of x) indicates that the receiver node calculated the correct response
while a false recommendation (negated value of the corresponding bit of x) denotes an
incorrect response.
Step 6) steps 2–5 are continued for each of the valid devices u , u , u , … , u∆
present in the network where ∆ denotes the total number of valid devices.
Step 7) After receiving all of the recommendations, the ambassador calculates the
validity of each device including itself then broadcasts the valid result. Finally, every
other device updates its valid neighbor list according to the ambassador’s broadcasted
results.
All these tasks are executed sequentially among the different participating entities.
The ambassador/leader initiates the authentication process. It generates random
challenges a , a , … , a and mutually exclusive sequence numbers b , b , … , b for all
the current valid members and sends those to them individually. Upon receiving the
challenge and the sequence number each valid member updates the old secret and
forwards the challenge to the new member. The new member also updates its version
of the current secret and uses it to compute the binary inner product. Then, it sends the
resultant parity bits p to corresponding network members. The valid member itself
computes the parity bit p using its version of the secret key. Then based on the
comparison between p and p , it returns its decision, in terms of the value of x b to
the ambassador. After getting decisions from all the existing members, the ambassador
picks the predefined n recommendations which corresponds to bits of and forms
the value of the secret possessed by the new member. If this secret matches exactly
with the original x, the authentication process succeeds.
The calculation of determining the validity of a node based on recommendations
from its peers needs to be explained since we intend to consider the impact of noise
inclusion on such decision. A device that is already in the valid neighbor list will be
updated as a valid device if K ceil 1 η l – β … 1 , where K is the number
of true recommendations from other devices to the ambassador about the device being
validated, η is the maximum allowable percentage of noise (i.e. intentional incorrect
answer), l is the total number of challenges received by one valid device. As each valid
device receives challenges from every other valid device except itself, l Δ – 1, β is
the expected number of malicious devices present in the network that have updated
themselves in the valid neighbor list, and Δ is the total number of valid devices in the
network. So, 1 can be rewritten as K ceil 1 η Δ 1 – β… 2 . A
new device that has just joined the network but is not in the valid neighbor list of other
devices will receive a challenge from every valid device in the network. Thus, it will
actually receive Δ challenges. As a result, l in (1) has been replaced by Δ. So, a new
PryGuard: A Secure Distributed Authentication Protocol 143
References
1. Hopper, N., Blum, M.: A secure human computer authentication scheme., Carnegie Mellon
Univ., Pittsburgh, PA, Tech. Rep. CMU-CS-00-139 (2000)
2. Hopper, N.J., Blum, M.: Secure human identification protocols. In: Boyd, C. (ed.)
ASIACRYPT 2001. LNCS, vol. 2248, pp. 52–66. Springer, Heidelberg (2001)
3. Weiser, M.: Some computer problems in ubiquitous computing. Communications of the
ACM 36(7), 75–84 (1993)
4. Eronen, P., Nikander, P.: Decentralized Jini security. In: Proceedings of the Network.
Distributed. System Security Symposium, San Diego, CA (February 2001)
5. Hewlett Packard CoolTown (2008), http://cooltown.hp.com
6. UC Berkeley. The Ninja Project: Enabling internet scale services from arbitrarily small
devices (2008), http://ninja.cs.berkeley.edu
7. Balazinska, M., Balakrishnan, H., Karger, D.: INS/Twine: A scalable peer-to-peer
architecture for intentional resource discovery. In: Proceedings of the International
Conference on Pervasive Computing, Zurich, Switzerland (2002)
8. Adjie-Winoto, W., Schwartz, E., Balakrishnan, H., Lilley, J.: The design and
implementation of an intentional naming system. In: Proceedings of the 17th ACM
Symposium on Operating Systems Principles (SOSP 1999), Kiawah Island, SC (1999)
144 C. Hasan et al.
9. Nidd, M.: Service discovery in DEAPspace. IEEE Pers. Communications 8(4), 39–45
(2001)
10. Guttman, E., Perkins, C., Veizades, J.: Service location protocol. Version 2,
http://www.ietf.org/rfc/rfc2608.txt
11. The Salutation Consortium, Inc. Salutation architecture specification (1999),
http://ftp.salutation.org/salute/sa20e1a21.ps
12. Czerwinski, S., Zhao, B.Y., Hodes, T., Joseph, A., Katz, R.: An architecture for a secure
service discovery service. In: Procedings of the 5th Annual International Conference on
Mobile Computing Networks (MobiCom 1999), Seattle, WA (1999)
13. Popovski, P., Kozlova, T., Gavrilovska, L., Prasad, R.: Device discovery in short-range
wireless ad hoc networks. IEEE Networks 3, 1361–1365 (2002)
14. Zaruba, G.V., Gupta, V.: Simplified Bluetooth device discovery— Analysis and
simulation. In: Proceedings of the 37th Hawaii International Conference on Systems
Sciences, pp. 307–315 (January 2004)
15. Ferraguto, F., Mambrini, G., Panconesi, A., Petrioli, C.: A newapproach to device
discovery and scatternet formation in Bluetooth networks. In: Proceedings of the 18th
International Parallel Distributed Process. Symposium, pp. 221–228 (April 2004)
16. Zaruba, G.V., Chlamtac, I.: Accelerating Bluetooth inquiry for personal area networks. In:
Proceedings of IEEE Global Telecommunication Conference, vol. 2, pp. 702–706
(December 2003)
17. Sohrabi, K., Gao, J., Ailawadhi, V., Pottie, G.J.: Protocols for selforganization of a
wireless sensor network. Proceedings of IEEE Pers. Communication 7(5), 16–27 (2000)
18. Universal Plug and Play Forum. About universal plug and playtechnology (2008),
http://www.upnp.org/about/default.asp#technology
19. Universal Plug and Play. Understanding universal plug and play: A white paper (June
2000), http://upnp.org/resources/whitepapers.asp
20. Sun Microsystems. Jini network technology (2008), http://www.sun.com/jini
21. Sun Microsystems. The community resource for Jini technology (2008),
http://www.jini.org
22. Dobrev, P., Famolari, D., Kurzke, C., Miller, B.: Device and service discovery in home
networks with OSGI. Proceedings of IEEE Communications Magazine 40(8), 86–92
(2002)
23. Satyanarayanan, M.: Fundamental challenges in mobile computing. In: Proceedings of
15th ACM Symposium on Principles of Distributed Computing, pp. 1–7 (May 1996)
24. Weis, S.A.: Security parallels between people and pervasive devices. In: Proceedings of
3rd IEEE International Conference on Pervasive Computing Communications Workshops,
pp. 105–109 (2005)
25. Juels, A., Weis, S.A.: Authenticating pervasive devices with human protocols. In: Shoup,
V. (ed.) CRYPTO 2005. LNCS, vol. 3621, pp. 293–308. Springer, Heidelberg (2005)
26. Sharmin, M., Ahmed, S., Ahamed, S.I.: An adaptive lightweight trust reliant secure
resource discovery for pervasive computing environments. In: Proc. of PerCom 2006, Pisa,
Italy, pp. 258–263 (2006)
27. Haque, M., Ahamed, S.I.: An Impregnable Lightweight Device Discovery (ILDD) Model
for the Pervasive Computing Environment of Enterprise Applications. IEEE Transactions
on Systems, Man, and Cybernetics 38(3), 334–346 (2008)
28. Sharmin, M., Ahmed, S., Ahamed, S.I.: MARKS (middleware adaptability for resource
discovery, knowledge usability and self-healing) in pervasive computing environments. In:
Proc. of 3rd Int. Conf. Inf. Technol.: New Gen, pp. 306–313 (April 2006)
PryGuard: A Secure Distributed Authentication Protocol 145
29. Ahmed, S., Sharmin, M., Ahamed, S.I.: Knowledge usability and its characteristics for
pervasive computing. In: Proc. 2005 Int. Conf. pervasive Syst. Computing (PSC 2005),
Las Vegas, NV, pp. 206–209 (2005)
30. Sharmin, M., Ahmed, S., Ahamed, S.I.: SAFE-RD (Secure, adaptive, fault tolerant, and
efficient resource discovery). in pervasive computing environments. In: Proc. IEEE Int.
Conf. Inf. Technol (ITCC 2005), Las Vegas, NV, pp. 271–276 (2005)
31. Ahmed, S., Sharmin, M., Ahamed, S.I.: GETS (Generic, efficient, transparent and secured)
self-healing service for pervasive computing application. Proceedings of International
Journal of Network Security 4(3), 271–281 (2007)
32. Carminati, B., Ferrari, E., Hung, P.C.K.: Web services composition: A security
perspective. In: Proceedings of the 21st Int. Conference on Data Engineering (ICDE 2005),
Japan, April 8–9 (2005)
33. Martin, M.V., Hung, P.C.K.: Toward a security policy for VoIP applications. In:
Proceedings of the 18th Annual Can. Conf. Electr. Comput. Eng (CCECE 2005),
Saskatoon, SK, Canada (May 2005)
34. Aleksy, M., Schader, M., Tapper, C.: Interoperability and interchangeability of middleware
components in a three-tier CORBA-environmentstate of the art. In: Proc. 3rd Int. Conf.
Enterprise Distrib. Object Comput (EDOC 1999), pp. 204–213 (1999)
A Global Unsupervised Data Discretization Algorithm
Based on Collective Correlation Coefficient
1 Introduction
Since the Internet applications and other information systems emerged rapidly, the
amount of data becoming available has grown exponentially. Meanwhile, handling
large amount and different types of data can be challenging because the current data
analysis algorithms (e.g. machine learning and statistics learning) are not effective
enough to discover knowledge from a wide spectrum of data analysis tasks. In fact,
some algorithms, such as Associate rule mining, Bayesian networks and Rough Set
theory, can only directly deal with discrete data rather than continuous numerical
data.
As a data-preprocessing step, discretization is to partition continuous numerical
values of an attribute in a data set into a finite number of intervals as discrete values.
A great number of research findings have indicated that discretization results exert a
great impact on the performance of data mining algorithms. A good discretization
algorithm not only can simplify continuous attributes to aid people in more easily
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 146–155, 2011.
© Springer-Verlag Berlin Heidelberg 2011
A Global Unsupervised Data Discretization Algorithm 147
comprehending data and results, but also can make subsequent procedures such as
classification more effective and efficient.
In this paper, we propose a Global Unsupervised Discretization Algorithm based
on Collective Correlation Coefficient algorithm (GUDA-CCC). Without requiring the
class labels in a data set, the algorithm is able to simultaneously optimize the numbers
of data intervals corresponding to all continuous attributes in a data set.
In the GUDA-CCC algorithm, we attempt to stick closely to an original data set as
far as possible by two ways when determining the number of discrete intervals and
positioning these intervals (or cut-points). One way is to preserve the rank of attribute
importance in an attribute set by applying an optimization approach when determining
the number of data intervals. In the method, the attribute importance is measured as
the Collective Correlation Coefficient (CCC), which is in essence an index
quantifying each continuous attribute’s contribution to the state spaces comprised of
continuous attributes in a data set as a whole [1]. In addition, we can consider
discretization as a quantization procedure since quantization is the process of
subdividing the range of a signal into non-overlapping regions (i.e. data intervals) and
assigning a numerical value to represent each region. Hence, the other way is to
minimize the discretization errors using the Lloyd-Max Scalar Quantization [2], when
positioning the data intervals according to the given numbers of data intervals.
The rest of the paper is organized as follows. In section 2, we introduce related
work. Section 3 articulates the algorithm and relevant thoughts. Key components used
in the GUDA-CCC algorithm are detailed in section 4. A set of empirical results and
corresponding comparisons with other discretization algorithms are shown in section
5. Section 6 concludes the paper.
2 Related Works
So far, there have been a wide range of discretization methods. In paper [3], six
separated dimensions were utilized to classify these discretization methods:
unsupervised versus supervised, static versus dynamic, local versus global, splitting
(top-down) versus merging (bottom-up), direct versus incremental, and univariate
versus multivariate.
Due to failing to take into consideration the interdependency among the attributes
to be discretized, univariate discretization algorithms seem difficult to obtain much
better discretization scheme for a multi-attribute data set. The characteristics of
dynamic methods constrain the generality of these discretization algorithms. Local
discretization approaches don’t fully explore the whole instance space so that there is
still some room for improvement. Such an algorithm as merging methods is
computation-consuming and probably unfeasible for high-dimensional and large-scale
data sets. Supervised approaches require much more information, i.e. class labels, in
comparison with unsupervised ones. Whereas for many applications such as
information retrieval unlabeled instances are readily available while labeled ones are
too expensive to achieve since labeling the instances requires substantial human
involvements. In addition, in supervised approaches, leveraging the class labels to
obtain the discretization scheme is likely to result in the overfitting or oversimplifying
of the acquired knowledge.
148 A. Zeng, Q.-G. Gao, and D. Pan
derived from PCA to define the attribute importance and keeps the rank of the
attribute importance unchanged when directly discretizing all the continuous
numerical attributes simultaneously.
3 Methodology
Before introducing the GUDA-CCC algorithm, we first define the formal terminology
on discretization.
In Figure 1, there are n data intervals (i.e. quantization levels), and c1, c2, …, cn+1
are the cut-points to position the corresponding intervals. Moreover, there is a
numerical value (i.e. quantum) representing every interval. For example, q2 and qn
represent (c2 , c3 ] and ( cn , cn +1 ] , respectively. So, the discretized Am (i.e. AmD )
D
consists of akm ∈{q1, q2 ,Lqn } (k=1, 2, …, T), i.e.,
D
akm = qi , if akm ∈ (ci , ci +1 ] ; k =1, 2, …, T; i =1, 2, …, n (2)
Thus, after the discretization, a6m, a4m, a2m, a5m and a3m are corresponding to
q1 , q1 , q2 , q2 and qn , respectively.
Note that, some discretization algorithms don’t calculate the quanta: q1 , q2 ,L qn ,
but merely utilize the nominal values to label and distinguish different data intervals.
However, in our algorithm, the numerical values must be calculated to facilitate the
subsequent processes.
In fact, the discretization process above can be considered as the quantization
procedure. Quantization is the process of subdividing the range of a signal into non-
overlapping regions (i.e. data intervals). An output level (i.e. quantum) is then
150 A. Zeng, Q.-G. Gao, and D. Pan
3) Catch a fish, check whether or not the stop condition of the algorithm is
satisfied. If so, stop; or else, do the following steps;
a) Check whether or not the ‘Follow’ action can be executed. If yes,
execute it; or else, ‘Prey’. And then, compare the current results with
the Bulletin Board to see which one is better. If the current results are
better, update the Bulletin Board with them;
b) Based on the current state, judge whether or not the ‘Swarm’ action
can be executed. If so, execute it; or else, ‘Prey’. And then, compare
the current results with the Bulletin Board to see which one is better. If
the current results are better, update the Bulletin Board with them;
4) Do step 3) until all n artificial fish have been visited;
The detailed steps of prey, follow and swarm action were shown in paper [11].
e = x − qi = x − AD ( x ) (5)
Assume the probability density function of an attribute be p(x), the mean square error
σ e2 can be denoted as:
n
σ e2 = ∑ ∫ ( x − qi )2 p( x )dx
ci +1
(6)
ci
i =1
1
⇒ c i ,opt = ( q k ,opt + q k −1,opt ) i = 2, 3, L , n (7)
2
A Global Unsupervised Data Discretization Algorithm 153
∂ ck + 1
[ ∫ ( x − q i ) 2 p ( x ) dx ] = 0
∂ qi k c
c k + 1, opt
∫ xp ( x ) dx
⇒ q k ,opt = i = 1, 2, L , n
c k , opt
c k + 1, opt
, (8)
∫ c k , opt
p ( x ) dx
Formula (7) and (8) show that the best boundary (i.e., cut-points) is at the midpoint of
two adjacent output levels (quanta) and the best quantum is the centroid of zone lying
between two adjacent boundaries, respectively.
5 Experimental Results
All experimental data sets are from the UCI repository (http://www.ics.uci.edu/
~mlearn/MLRepository.html). Detailed descriptions of the data set are shown in Table
1. It is assumed that each continuous attribute of the data sets in Table 1 has normal
distribution when applying GUDA-CCC algorithm. In real life, before the GUDA-
CCC algorithm is applied, the probability density functions of continuous attributes
can be estimated. Based on the given probability density functions, the quanta qk ,opt
and cut-points ck ,opt can be accordingly obtained with the formula (7) and (8).
Table 2. Experimental results in error rates based on C4.5 classifier (the best results in bold)
All data sets are divided into the training and test data by a 5-trial, 4-fold stratified
sampling cross-validation test method. In each fold, the discretization is learned on
the training data and the resulting bins are applied to the test data. All these
experiments are implemented with Matlab and Weka [14].
From Table 2, we can see that C4.5 combined with the GUDA-CCC has the lowest
error rates in Bupa, Glass, Ionosphere, Musk, Musk2, Pima Indians Diabetes,
Waveform and Wine.
A one-sided t-statistic hypothesis test indicates that GUDA-CCC is significantly
better than Equal width, Equal Frequency, Fayyand & Irani’s MDL criterion,
Kononenko’s MDL criterion and Continuous (non-discretization) in most data sets:
i.e. Bupa, Glass, Ionosphere, Musk2, Pima Indians Diabetes and Waveform. As for
the data set Iris, the difference between GUDA-CCC and Equal Frequency is not
significant though the error rate of Equal Frequency is lower than that of GUDA-
CCC. When the data set Vehicle is processed, although the result of Continuous (non-
discretization) is better than that of GUDA-CCC, the difference between them is not
significant. Likewise, for the data sets: Musk and Wine, even though the error rates
from GUDA-CCC are lower than those of Kononenko's MDL criterion and
Continuous (non-discretization) respectively, the differences between them are not
significant either.
The experimental results imply our algorithm could work much better on the data
set with high dimensions, such as Musk2 with 166 continuous attributes and
Ionosphere with 33 continuous attributes. The reason behind it could be that both the
rank of attribute importance calibrated by CCC values and the discretization errors
measured by the MSE are more suitable for data sets with more continuous attributes.
To sum up, Table 2 suggests that GUDA-CCC is superior to any other in these six
algorithms in classification tasks. This strongly supports our hypothesis that when
other available information is not reliable enough to be leveraged in the discretization,
sticking closely to an original data set, just like honesty, might be the best policy.
Moreover, preserving the rank of attribute importance together with minimizing
discretization errors measured by the MSE seems to be an effective method to the
end.
A Global Unsupervised Data Discretization Algorithm 155
6 Conclusion
In this paper, we present a new data discretization algorithm which requires no class
labels or few predefined parameters. The key idea behind the method is that to stick
closely to an original data set by preserving the rank of attribute importance together
with minimizing discretization errors measured by the MSE might be the best policy.
As an important practical advantage, the algorithm does not require class labels, but
can outperform some supervised methods. The algorithm can discretize all attributes
simultaneously, rather than one attribute at a time, which improves the efficiency of
unsupervised discretization. Experiments on benchmark data sets demonstrate that the
GUDA-CCC could be a rewarding discretization approach to improve the quality of
data mining results, especially when dealing with high dimension data sets.
References
1. Zeng, A., Pan, D., Zheng, Q.L., Peng, H.: Knowledge Acquisition based on Rough Set
Theory and Principal Component Analysis. IEEE Intelligent Systems 21, 78–85 (2006)
2. Lloyd, S.P.: Least Squares Quantization in PCM. IEEE Transactions on Information
Theory 28(2), 129–137 (1982)
3. Liu, H., Hussain, F., Tan, C., Dash, M.: Discretization: An Enabling Technique. Data
Mining and Knowledge Discovery 6(4), 393–423 (2002)
4. Kurgan, L.A., Cios, K.J.: CAIM Discretization Algorithm. IEEE Transactions on
Knowledge and Data Engineering 16, 145–153 (2004)
5. Tsai, C.J., Lee, C.I., Yang, W.P.: A Discretization Algorithm based on Class-attribute
Contingency Coefficient. Information Sciences 178, 714–731 (2008)
6. Yang, Y., Webb, G.I.: Discretization for Naïve-Bayes Learning: Managing Discretization
Bias and Variance. Machine Learning 74, 39–74 (2009)
7. Au, W.H., Chan, K.C.C., Wong, A.K.C.: A Fuzzy Approach to Partitioning Continuous
Attributes for Classification. IEEE Transactions on Knowledge and Data Engineering 18,
715–719 (2006)
8. Bondu, A., Boulle, M., Lemaire, V., Loiseru, S., Duval, B.: A Non-parametric Semi-
supervised Discretization Method. In: Proceedings of 2008 Eighth International
Conference on Data Mining, pp. 53–62 (2008)
9. Mehta, S., Parthasarathy, S., Yang, H.: Toward Unsupervised Correlation Preserving
Discretization. IEEE Transactions on Knowledge and Data Engineering 17, 1174–1185 (2005)
10. Li, X.L., Shao, Z.J.: An Optimizing Method base on Autonomous Animals: Fish-Swarm
Algorithm. Systems Engineering-Theory & Practice 11, 32–38 (2002) (in Chinese)
11. Reynolds, C.W.: Flocks, Herds, and Schools: a Distributed Behavioral Model. Computer
Graphics 21, 25–34 (1987)
12. Fayyad, U.M., Irani, K.B.: On the Handling of Continuous-Valued Attributes in Decision
Tree Generation. Machine Learning 8, 87–102 (1992)
13. Kononenko, I.: On Biases in Estimating Multi-Valued Attributes. In: Proceedings of 14th
International Joint Conference on Artificial Intelligence, pp. 1034–1040 (1995)
14. Weka, http://www.cs.waikato.ac.nz/ml/weka/
A Heuristic Data-Sanitization Approach
Based on TF-IDF
Abstract. Data mining technology can help extract useful knowledge from
large data sets. The process of data collection and data dissemination may,
however, result in an inherent risk of privacy threats. In this paper, the SIF-IDF
algorithm is proposed to modify original databases in order to hide sensitive
itemsets. It is a greedy approach based on the concept of the Term Frequency
and Inverse Document Frequency (TF-IDF) borrowed from text mining.
Experimental results also show the performance of the proposed approach.
1 Introduction
In recent years, the privacy-preserving data mining (PPDM) has become an important
issue due to the quick proliferation of electronic data in governments, corporations
and non-profit organizations. Verykios et al. [10] thus proposed a data sanitization
process to hide sensitive knowledge by item addition or deletion. Zhu et al. [11] then
discussed what kind of public information type was suitable for not revealing
sensitive data and insinuated that the k-anonymity technique might still have security
problems.
In text mining, the technique of term frequency–inverse document frequency (TF-
IDF) [9] is usually used to evaluate how relevant a word in a corpus is to a document.
It may be thought of as a statistical measure. In this paper, a novel greedy-based
approach called sensitive items frequency - inverse database frequency (SIF-IDF)
algorithm is thus designed to modify the TF-IDF [9] approach. It evaluates the
degrees of transactions associated with given sensitive itemsets, reducing the
frequencies of sensitive itemsets for data sanitization. Based on the SIF-IDF
algorithm, the user-specific sensitive itmesets can be completely hidden with reduced
side effects. Experimental results are also used to evaluate the performance of the
proposed approach.
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 156–164, 2011.
© Springer-Verlag Berlin Heidelberg 2011
A Heuristic Data-Sanitization Approach Based on TF-IDF 157
Data mining is most commonly used in attempts to induce association rules from
transaction data, such that the presence of certain items in a transaction will imply the
presence of some other items. To achieve this purpose, Agrawal et al. proposed
several mining algorithms based on the concept of large itemsets to find association
rules in transaction data [3-4]. In the first phase, candidate itemsets were generated
and counted by scanning the transaction data. If the count of an itemset appearing in
the transactions was larger than a pre-defined threshold value (called the minimum
support), the itemset was considered a large itemset. Large itemsets containing only
single items were then combined to form candidate itemsets containing two items.
This process was repeated until all large itemset had been found. In the second phase,
association rules were induced from the large itemsets found in the first phase. All
possible association combinations for each large itemset were formed, and those with
calculated confidence values larger than a predefined threshold (called the minimum
confidence) were output as association rules.
Years of effort in data mining have produced a variety of efficient techniques, which
have also caused the problems of security and privacy threats [7]. In the past, Atallah
et al. first proposed the protection algorithm for data sanitization to avoid the
inference of association rules [2]. Dasseni et al. then proposed a hiding algorithm
based on the hamming-distance approach to reduce the confidence or support values
of association rules [5]. Amiri then proposed three heuristic approaches to hide
multiple sensitive rules [1]. Pontikakis et al. [8] then proposed two heuristics
approaches based on data distortion.
The optimal sanitization of databases is, in general, regard as an NP-hard problem.
Atallah et al. [2] proved that selecting which data to modify or sanitize was also NP-
hard. Their proof was based on the reduction from the NP-hard problem of hitting-
sets [6].
n ⎛ si p ⎞
n
SIF − IDF (Ti ) = ∑ ⎜ × ∑ log ⎟,
ij
⎜ T f − MRC ⎟
j =1
⎝ i k =1 k k ⎠
where |siij| is the number of sensitive items contained in the j-th sensitive itemset in Ti,
and |Ti| is the number of items in transaction Ti, |n| is the number of records in a
database, |fk| is the frequency count of each item, and |MRCk| is the maximum reduced
count of each item.
The SIF-IDF value of each transaction is calculated and is used to measure whether
a transaction has a large number of sensitive items but with less influence to other
transactions. The transactions with high SIF-IDF values are considered to be
processed with high probabilities for sanitization. The transactions are sorted in a
descending order of their SIF-IDF values. The order is used as the processing order of
the transactions for the proposed algorithm. In data sanitization, an item with a higher
occurrence frequency in the sensitive itemsets may be considered to have a larger
influence than the ones with a lower occurrence frequency. The sensitive items in the
processed transactions are then deleted according to the ordering of their occurrence
frequencies. This procedure is repeated until the set of sensitive itemsets becomes
null, which indicates all the supports of the sensitive itemsets are under the user-
specific minimum support threshold. The proposed algorithm is the described as
follows.
Substep 3-3: Calculate the inverse database frequency (IDFk) value of each
items ik as follows:
|n|
IDFk = log ,
| f k − MRCk |
where fk is the occurrence count of item ik in the database.
Substep 3-4: Sum the IDF values of all sensitive items within sensitive
itemsets and calculate the SIF-IDF value for each transaction as follows:
n ⎛ si p ⎞
n
SIF − IDF (Ti ) = ∑ ⎜ × ∑ log ⎟.
ij
⎜ T f k − MRCk ⎟
i =1
⎝ i k =1 ⎠
STEP 4: Find the transaction (Tb) which has the best SIF-IDF value.
STEP 5: Process the transaction Tb to prune appropriate items by the following
substeps.
Substep 5-1: Sort the items in a descending order of their occurrence
frequencies within the sensitive itemsets.
Substrp 5-2: Find the first sensitive item (itemo) in T according to the
sorted order obtained in Substep 5-1.
Substep 5-3: Delete the item (itemo) from the transaction.
STEP 6: Update the occurrence frequencies of the sensitive itemsets.
STEP 7: Repeat STEPS 2 to 6 until the set of sensitive itemsets is null, which
indicates that the supports of all the sensitive itemsets are below the user-
specific minimum support threshold s.
4 An Example
In this section, an example is given to demonstrate the proposed sensitive items
frequency - inverse database frequency (SIF-IDF) algorithm for privacy preserving
data mining (PPDM). Assume a database shown in Table 1 is used as the example. It
consists of 10 transactions and 9 items, denoted a to i.
TID Item
1 a, b, c, d, f, g, h
2 a, b, d, e
3 b, c, d, f, g, h
4 a, b, c, f, h
5 c, d, e, g, i
6 a, c, f, i
7 b, c, d, e, f, g
8 c, d, f, h, i
9 a, d, e, f, i
10 a, c, e, f, h
160 T.-P. Hong et al.
Assume the set of user-specific sensitive itemsets S is {cfh, af, c}. Also assume the
user-specific minimum support threshold is set at 40%, which indicates that the
minimum count is 0.4*10, which is 4. The proposed approach proceeds as follows to
hide the sensitive Itemsets for avoiding being mined from the database.
STEPs 1 & 2: The transactions with sensitive itemsets in the database are found and
kept. The sensitive items frequency (SIF) value of each sensitive itemset in each
transaction is calculated. The results are shown in Table 2.
STEP 3: The inverse database frequency (IDF) value of each sensitive itemset in
each transaction is calculated. In this step, the reduced count (RC) of each item for
each sensitive itemset is first calculated and the maximum of the RC values of each
item is found as the MRC value. The IDF value of each item is then calculated. The
MRC and IDF values of all items are shown in Table 3.
After that, the SIF-IDF value of a sensitive itemset in each transaction is then
calculated as the SIF value of the sensitive itemset multiplied by its IDF value in the
transaction. After that, the results are shown in Table 4.
A Heuristic Data-Sanitization Approach Based on TF-IDF 161
STEPs 4 & 5: The transactions in Table 4 are sorted in the descending order of their
SIF-IDF values. The transactions are processed in the above descending order to
prune appropriate items. In this example, the item c in transaction 4 is then first
selected to be deleted.
STEP 6: After item c is deleted from the fourth transaction, the new occurrence
frequencies of the sensitive itemsets in the transactions are updated. The sensitive
itemsets with their occurrence frequencies are then updated from {cfh:5, af:5, c:8} to
{cfh:4, af:5, c:7}.
STEP 7: STEPs 2 to 6 are then repeated until the supports of all the sensitive itemsets
are below the minimum count. The results of the final sanitized database in the
example are shown in Table 5.
TID Item
1 a, b, d, f, g, h
2 a, b, d, e
3 b, c, d, g, h
4 a, b, h
5 c, d, e, g, i
6 a, f, i
7 b, c, d, e, f, g
8 d, f, h, i
9 a, d, e, f, i
10 a, e, h
5 Experimental Results
Two datasets called BMSPOS and Webview-1 are respectively used to evaluate the
performance of the proposed algorithm. The numbers of sensitive itemsets for
162 T.-P. Hong et al.
BMSPOS and WebView_1 were set at 4, 4, and 8, respectively. For the proposed SIF-
IDF algorithm, the relationships between the numbers of iterations and the EC values
are compared to indicate the proposed algorithm could completely hide all user-
specific sensitive itemsets. The results for BMSPOS and WebView_1 are shown in
Figures 1 and 2, respectively.
Fig. 1. The relationships between the EC value and the number of iterations in the BMSPOS
database
Fig. 2. The relationships between the EC values and he number of iterations in the Webview_1
database
From Figure 1, it can be seen that the sensitive itmesets S3 and S4 were
alternatively processed to be hidden. From Figure 2, the sensitive itemsets S4, S3, S2
and S1 are sequentially processed to be hidden. The execution time to hide the four
sensitive itemsets for two databases was also compared and shown in Figure 3. Note
that the WebView_1 database required more execution time to hide the sensitive
itemsets than the BMSPOS dataset.
A Heuristic Data-Sanitization Approach Based on TF-IDF 163
Fig. 3. The execution time to hide the four sensitive itemsets in the two databases
6 Conclusion
In this paper, the SIF-IDF algorithm is proposed to evaluate the similarity between
sensitive itemsets and transactions for minimizing the side effects, which inherits the
properties from TF-IDF algorithm in information retrieval. Based on the user-specific
sensitive itemsets in the experiments, the proposed SIF-IDF algorithm can process all
defined sensitive itemsets without any side effects in three databases. In the
experimental results, the proposed algorithm has a good performance without any side
effect, which is efficient and effective to hide the sensitive itemsets.
References
1. Amiri A.: Dare to share: Protecting sensitive knowledge with data sanitization. Decision
Support Systems, 181–191 (2007)
2. Atallah, M., Bertino, E., Elmagarmid, A., Ibrahim, M., Verykios, V.S.: Disclosure
limitation of sensitive rules. In: Knowledge and Data Engineering Exchange Workshop,
pp. 45–52 (1999)
3. Agrawal, R., Imielinski, T., Sawmi, A.: Mining association rules between sets of items in
large databases. In: The ACM SIGMOD Conference on Management of Data, pp. 207–216
(1993)
4. Agrawal, R., Srikant, R.: Fast algorithm for mining association rules. In: The International
Conference on Very Large Data Bases, pp. 487–499 (1994)
5. Dasseni, E., Verykios, V.S., Elmagarmid, A.K., Bertino, E.: Hiding Association Rules by
Using Confidence and Support. In: The 4th International Workshop on Information
Hiding, pp. 369–383 (2001)
6. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-
Conpleteness. W. H. Freeman, New York (1979)
7. Leary D. E. O.: Knowledge Discovery as a Threat to Database Security. Knowledge
Discovery in Databases, 507–516 (1991)
164 T.-P. Hong et al.
1 Introduction
Prognostic(s) has recently attracted much attention from researchers in different areas
such as sensor, reliability engineering, machine learning/data mining, and machinery
maintenance. It is defined as the detection of precursors of a failure and accurate
prediction of remaining useful life (RUL) or time-to-failure (TTF) before a failure [1],
and it is expected to provide a paradigm shift from traditional reactive-based
maintenance to proactive-based maintenance. The final aim is to perform the necessary
maintenance actions “just-in-time” in order to minimize costs and increase availability.
Many researchers have focused on developing knowledge-based prognostic [2-6].
With this approach, the prognostic models are built using physics and materials
knowledge [7]. Data-driven prognostic is an alternative approach in which the models
are developed using huge amounts of historical data and techniques from data mining
[8-9]. The data-driven models can predict the likelihood of a component failure and are
applicable to various applications such as aircraft maintenance [8] and train wheel
failure predictions [9]. However, they may sometimes fail to precisely predict the
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 165–175, 2011.
© Springer-Verlag Berlin Heidelberg 2011
166 C. Yang and S. Létourneau
remaining useful life or time-to-failure due to limitations of the techniques used with
respect to important variations in the data. To address this issue, we propose to focus on
the identification of reliable patterns for prognostics. A reliable pattern identifies the
evolution of states such as degradation of the physical component and provides an onset
for TTF estimation. In this paper, we propose a novel KDD methodology that uses
historic operation and sensor data to automatically identify such patterns. After
describing the methodology, we present results from the application on a real-world
problem: train wheel failure prognostics.
The rest of the paper is organized as follows: Section 2 introduces the KDD
methodology; Section 3 presents the application domain: train wheel prognostics;
Section 4 presents the experiments and some preliminary results; Section 5 discusses the
results and limitations; and the final section concludes the paper.
2 Methodology
In modem operation of complex electro-mechanical systems, a given component or
subsystem generates operational or sensor data periodically. When we combine all
sensor data for a given component, we obtain multivariate time-series that characterizes
the evolution of that component from its installation to its removal. Generalizing to a
fleet of vehicles, we obtain a series of multivariate time-series, each characterizing an
individual component. For example, the monitoring of train wheels can generate
thousands of multivariate time-series (Section 3). Each of these time-series, noted as si ,
is associated with a unique component ID, noted SequenceID . We denote an instance
within a time-series as xij , referring the jth instance of the ith time-series in the
multivariate time-series. An instance (xij) is represented by a set of attributes
G
( a ∈ {a1 , a2 ,...an } ) and is associated with a timestamp (tj). Therefore, an instance can be
noted as x ij = {a1 , a 2 ,...a m ,..t j } and a time-series as si = {xi1 , xi 2 ,....xij "} .
The operational database for a complex system, noted as DS, consists of a set of
multivariate time-series, i.e., DS ∈ {s1 , s2 , s3 ...si ...s n } . A pattern, pk , is defined as a
combination of multiple observations. Its length, m, is the number of observations in a
given time period starting from the kth observation. As a result, pk is denoted
as pk = {xik , xik +1 ,...xik + m } , i.e., p k = {( aGik , t k ), ( aGik +1 , t k +1 )...( aGik + m , t k + m )} . Here, t k is the
TTF estimate associated with pattern pk .
The challenge is to find the reliable patterns ( p k ) from a given dataset
( DS ∈ {s1 , s 2 , s 3 ...s i ...s n } ). A reliable pattern can be expressed using numerical values
or symbols representing various states. In this paper, we focus on symbolic patterns as
they allow us to represent the evolution of states in an eay to understand manner. For
example, a symbolic pattern “DEF” contains three statuses along with information on
transitions that happened. It shows that a monitored component changes from status
“D” to “E” to “F”. To find such symbolic patterns, we developed the
KDD methodology shown in Table 1. The methodology consists of three main
processes: (1) Symbolizing time-series data to generate a symbolic dataset
Discovering Patterns for Prognostics: A Case Study in Prognostics of Train Wheels 167
A. Dimensionality reduction
The dimensionality reduction process compresses all instances within a given window
size into a new observation. Several techniques such as Fourier transformations,
wavelets, and Piecewise Aggregate Approximation (PAA) [10-11] could be used for
this task. In this work, we opted to use PAA as it could be directly applied without
further processing of the real world equipment data available for this project. Most of
the other techniques would have required additional pre-processing to account for
data issues such as missing values and irregular data sampling. Following PAA, we
convert each time-series of length (mi) into a new time-series of length ni (ni<mi).
Precisely, the initial sequence si = {xi1 , xi 2 ,....xij ,...( j = 1,2,..., mi )} is converted into a
new dimensionality-reduced sequence si = {xi1 , xi 2 ,....xij ,...( j = 1,2,..., ni )} by using the
following equation:
kx j
ni
xij
mi
¦x
l k x j k
il (1)
where,
the jth observation in si ;
xij :
xil : the lth original observation in si ;
k : the window size for dimensionality reduction;
168 C. Yang and S. Létourneau
Failure event
Pattern
Target
window
time
ts tp te
Fig. 1. The breakpoints for symbolization Fig. 2. The target window for patterns
Discovering Patterns for Prognostics: A Case Study in Prognostics of Train Wheels 169
Having obtained a symbolic sequence dataset (CS), we can initiate the search for
patterns. For this purpose, we divide CS into a training dataset (S) and a testing
dataset (T). To search patterns from S for a given pattern size, a Full Space Search
(FSS) algorithm is developed. FSS consists of the following 5 steps:
time ( t e ). Using the window size (w), the utility (u) for a pattern is computed by
Equation 2.
w tp if t p [t s , te ]
u ® (2)
¯ tp if t p [t s , te ]
where,
w: window size ( w = t e − t s );
tp: TTF estimation of a pattern.
and the total utility for a set of patterns is computed as follows.
k Ni
U ¦¦ (D
i 1 j 1
i x uij ) (3)
where,
k: the number of reliable patterns;
Ni : the number of failure events where pattern i appears;
uij : the utility of pattern i in the jth time-series ;
αi : the confidence of pattern i based-on pattern frequency.
Using Equations 2 and 3, we evaluate the reliability of patterns by computing a total
utility on the testing dataset (T). The larger the total utility is, the higher the patterns
reliability is and the better the pattern performance is.
above to WILD data in order to see if they could be used for prognostics of train
wheel failures.
For this study, we used data collected over a period of 17 months from a fleet of
804 large cars with 12 axles each. After data pre-processing, we ended up with a
dataset containing 2,409,696 instances grouped in 9906 time-series (one time-series
for each axle in operation during the study). Out of the 9906 time-series, only 218 are
associated with an occurrence of a wheel failure. Therefore, we selected these 218
time-series for pattern discovery and pattern evaluation. We divided these218 time
series into a training dataset (106 time-series) and a testing dataset (112 time-series).
used 8, 10, 12 and 14 when converting the numeric value to symbols (nominal value),
i.e., S# = 8, 10, 12, or 14.
• The breakpoints (BKs)
The breakpoints determine how we map the original numerical values to the selected
symbols. This directly affects the quality of the resulting patterns. We need to take
into account the distribution and characteristics of original time-series as well as the
particularities of the application. As discussed above, we adopted an unequal area for
the symbol definitions. For each selected number of symbols, several breakpoints
were tried to improve the quality of patterns. We use BK-x to note the setting for BK.
Here K is a value of symbol number (S#) and x is values for K. Corresponding to the
symbol number, 8, 10, 12, and 14, the breakpoints are chosen as follows (the unit is
percentage):
For 8 symbols: B8-1 ={ 2.5, 5, 7.5, 44.38, 81.26, 91.26, 97.51}
B8-2 ={ 6.25, 12.5, 25, 43.75, 62.5, 81.25, 93.75}
B8-3 ={ 2.5, 6.25, 11.25, 17.5, 51.25, 85, 97.5}
For 10 symbols: B10-1 ={ 2, 4, 6, 25.75, 45.5, 65.25, 85, 93, 98}
B10-2 ={ 5, 10, 20, 30, 40, 55, 70, 85, 95}
B10-3 ={ 2, 5, 9, 14, 32.5, 51, 69.5, 88, 98}
For 12 symbols: B12-1 ={ 1.67, 3.34, 5.01, 18.76, 32.51, 46.01, 59.76, 73.51, 87.26, 93.93, 98.1}
B12-2 ={ 4.17, 8.34, 16.67, 25, 33.33, 41.66, 49.99, 62.49, 74.99, 87.49, 95.82}
B12-3 ={ 1.67, 4.17, 7.5, 11.67, 24.68, 37.69, 50.7, 63.71, 76.72, 89.73, 98.06}
For 14 symbols:
B14-1 ={ 1.43, 3.57, 6.43, 10, 20.18, 30.36, 40.54, 50.72, 60.9, 71.08, 81.26, 91.44, 98.58}
B14-2 ={ 3.57, 7.14, 14.28, 21.42, 28.56, 35.7, 42.84, 49.98, 57.12,67.83, 78.54, 89.25, 96.39}
B14-3 ={1.43, 2.86, 4.29, 14.92, 25.55, 36.18, 46.81, 57.44, 68.07,78.7, 89.33, 95.04, 98.61}
• The number of reliable patterns
In practice, there might be several patterns representing relevant state transition
trends. In order for the discovered patterns to cover more failure events, a reasonable
number is needed for selecting reliable patterns. In our experiment, we choose the top
3 patterns as the reliable patterns from each experiment for evaluation.
• The target window size of time to failure
The target window size impacts directly the pattern evaluation criteria. For this
application, we used 20 (days) as the target window size of time to failure. Therefore,
t s equals -22 and t e equals -2 (the unit of time is day in this application).
Table 2 presents results for a number of runs. The results are obtained by applying
the discovered patterns on the test dataset which contains 112 wheel failures (time-
series). The total utility for each experiment is computed using the key parameters
and Equations 2 and 3. Table 2 also shows the mean and deviation of the TTF
estimates as well as the problem coverage information. The problem coverage is
computed as a problem detection rate (PDR) which is the number of detected
problems over the total number of problems in the test dataset. Whenever one of the
reliable patterns appears in a time-series, the failure is considered as detected. The
mean and standard deviation information for the TTF estimates have been obtained
using the differences between TTF estimates of a pattern and the actual failure time.
Table 2. Some results selected from the experiments (# of problems (time-series) = 112)
These results strongly support the applicability and feasibility of the proposed
approach. However, the proposed method is empirical. Many factors such as the
parameter setting, data quality, and requirements of applications, affect directly the
reliability of patterns. To find reliable patterns, we needed to conduct a large number
of experiments using a trial an error process. It is also worth noting that the patterns
found using the FSS algorithm from the symbolic sequence data do not have
transition probabilities from one state to another. It would be desirable that a pattern
provides practitioners with additional information on state transitions. For example,
pattern “BDF” should provide the transition probability from “B” to “D” and from “D
to “F”, such that the prognostic decision could be effectively made for applications.
174 C. Yang and S. Létourneau
To this end, we could use the Gibbs algorithm [14] during the search for patterns and
directly obtain state transition probabilities.
It is worth noting that the pattern length has a significant impact on the prognostic
performance. Generally speaking, the longer the pattern is, the higher the problem
detection rate is. However, the precision of TTF estimation will deteriorate with
increasing of the pattern length, which was demonstrated in the results from Exp 1.
With a relatively long pattern, the total utility becomes lower.
We also noted that the propose pattern evaluation method appear adequate for
selecting reliable patterns. The higher the total utility is, the more precise the TTF
estimation is. For example, Exp 4 and Exp 5 with higher utility have more accurate
TTF estimates compared to the other experiments.
6 Conclusions
In this paper, we presented a KDD methodology to discover reliable patterns from
multivariate time-series data. The methodology consists of three main processes:
creating symbolic sequences from time-series, searching for patterns from the
symbolic sequences, and evaluating the patterns with a novel utility-based method.
The developed techniques have been applied to a real-world application: prognostics
of train wheel failures. The preliminary results demonstrated the applicability of the
developed techniques. The paper also discussed the results and the limitations of the
methodology.
Acknowledgment
Special thanks go to Kendra Seu, and Olena Frolova for their hard work when they
were at NRC. We also extend our thanks to Chris Drummond and Marvin Zaluski for
their valuable discussions.
Reference
[1] Schwabacher, W., Goebel, K.: A survey of artificial intelligence for prognostics. In: The
AAAI fall Symposium on Artificial Intelligence for Prognostics, pp. 107–114. AAAI
Press, Arlington (2007)
[2] Brown, E.R., McCollom, N.N., Moore, E.E., Hess, A.: Prognostics and health
management: a data-driven approach to supporting the F-35 lightning II. In: IEEE
Aerospace Conference (2007)
[3] Camci, F., Valentine, G.S., Navarra, K.: Methodologies for integration of PHM systems
with maintenance data. In: IEEE Aerospace Conference (2007)
[4] Daw, C.S., Finney, E.A., Tracy, E.R.: A review of symbolic analysis of experimental
data. Review of Scientific Instruments 74(2), 915–930 (2003)
[5] Przytula, E.W., Chol, A.: Reasoning framework for diagnosis and prognosis. In: IEEE
Aerospace Conference (2007)
[6] Usynin, A., Hines, J.W., Urmanov, A.: Formulation of prognostics requirements. In: IEEE
Aerospace Conference (2007)
Discovering Patterns for Prognostics: A Case Study in Prognostics of Train Wheels 175
[7] Luo, M., Wang, D., Pham, M., Low, C.B., Zhang, J.B., Zhang, D.H., Zhao, Y.Z.: Model-
based fault diagnosis/prognosis for wheeled mobile robots: a review. In: The 31st Annual
Conference of IEEE Industry Electronics Society 2005, New York (2005)
[8] Létourneau, S., Yang, C., Drummond, C., Scarlett, E., Valdés, J., Zaluski, M.: A domain
independent data mining methodology for prognostics. In: The 59th Meeting of the
Machinery Failure Prevention Technology Society (2005)
[9] Yang, C., Létourneau, S.: Learning to predict train wheel failures. In: The 11th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD
2005), pp. 516–525 (2005)
[10] Keogh, E., Lin, J., Fu, A., Herle, H.V.: Finding the unusual medical time series:
algorithms and applications. IEEE Transactions on Information Technology in
Biomedicine (2005)
[11] Keogh, E., Chu, S., Hart, D., Pazzani, M.: Segmenting time series: A survey and novel
approach. In: Kandel, Bunke (eds.) Data Mining in Time Series Databases, pp. 1–44.
World Scientific Publishing, Singapore (2004)
[12] Mielikäinen, T., Mannila, H.: The pattern ordering problem. In: Lavrač, N., Gamberger,
D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838,
pp. 327–338. Springer, Heidelberg (2003)
[13] Xin, D., Cheng, H., Yan, X., Han, J.: Extracting redundancy-aware top-k patterns. In:
ACM KDD 2006, Philadelphia, Pennsylvania, USA, pp. 444–453 (2006)
[14] Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., Wootton, J.C.:
Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignments.
Science 262(8), 208–214 (1993)
Automating the Selection of Stories for AI in the News
1 Introduction
Selecting interesting news stories about AI, or any other topic, requires more than
searching for individual terms. The AAAI started collecting current news stories
about AI and making them available to interested readers several years ago, with
manual selection and publishing by an intelligent webmaster.
Current news stories from credible sources that are considered relevant to AI and
interesting to readers are presented every week in five different formats: (i) posting
summarized news stories on the AI in the News page of the AITopics web site [2],(ii)
sending periodic email messages to subscribers through the “AI Alerts” service, (iii)
posting RSS feeds for stories associated with major AITopics,(iv) archiving each
month’s collection of stories for later reference, and(v) posting each news story into a
separate page on the AITopics web site.1
Manually finding and posting stories that are likely to be interesting is time-
consuming.Therefore, we have developed an AI program, NewsFinder, that collects
news stories from selected sources, rates them with respect to a learned measure of
goodness, and publishes them in the five formats mentioned. Off-the-shelf
implementations of several existing techniques were integrated into a working system for
the AAAI.
1
Anyone may view current and archived stories and subscribe to any of the RSS feeds; alerts
are available only to AAAI members.
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 176–185, 2011.
© Springer-Verlag Berlin Heidelberg 2011
Automating the Selection of Stories for AI in the News 177
1.1 Crawling
In the crawling phrase, thee program collects a large number of recent news stories
about AI.Since crawling th he entire web for stories mentioning a specific term llike
‘artificial intelligence’ bring
gs in far too many stories, we restrict the crawling to abbout
two dozen major news pub blications. Thismakes a story more credibleand more likkely
to interest an internationall audience.The system administrators (AI subject maatter
experts) maintain a list of o news sources, chosen for their international scoope,
credibility, and stability. Thhese include The BBC, The New York Times, Forbes, T The
Wall Street Journal, MIT Technology
T Review, CNET, Discovery, Popular Sciennce,
Wired, The Washington Post, and The Guardian. Others can be added to the list.
NewsFinder collects thee latest news stories via either in-site search or RSS feeeds
from those sources and filteers out blogs, press releases, and advertisements.If a souurce
178 L. Dong, R.G. Smith, and B.G. Buchanan
has a search function, then the program uses it to find stories that contain‘artificial
intelligence’ or ‘robots’ or ‘machine learning’.If a source has RSS feeds, then
NewsFinder selects those feed labeled as ‘technology’ or ‘science’.
In order to parse the text to retrieve the content of candidate pages, we write a
specific HTML parser for each news source to identify and extract the news content
from its news web pages.The advantage of this method is precision in that it can
accurately extract news text stories and eliminate advertisements, user comments,
navigation bars, menus and irrelevant in-site hyperlinks.The disadvantage of writing
separate parsers for each news source is somewhat offset by starting with a generic
template. We have written a dozen specific source parsersas modifications of code
inherited from a base parser.Each parser is specifically designed for one news source
web site since different sites use different HTML/CSS tags.We are also investigating
an alternative method [5, 10] a classification method is used to train parsers to
recognize news content either by counting hyperlinked words or by visual layout.
For a typical news source the parser will extract three items from the metadata
associated with each news item: URL, title, and publication date.If the publication
date is outside the crawling period (currently seven days), the news story is
skipped.2For the remaining stories, the parser extracts the text from of each story from
its URL.
NewsFinder then processes the natural language text, using the Natural Language
Toolkit (NLTK)[7] to perform word counting, morphing, stemming, and removal of
the most common words from a stoplist.3
A text summarization algorithm extracts 4-5 sentences from the story to build a
short description―the highlights that make the story interesting―since an arbitrary
paragraph, like the first or last, is often not informative.The main idea of the
algorithm is to first measure the term frequency over the entire story, and then select
the 4-5 sentences containing the most frequent terms.In the end, it re-assembles the
selected top 4-5 sentences in their original order for readability.
NewsFinder references a Whitelist of terms whose inclusion in a story is required
for further consideration.If the extracted text contains no Whitelist term, the story is
skipped.In addition to the term ‘artificial intelligence’, Whitelist includes several
dozen other words, bigrams and trigrams that indicate a story has additional relevance
and interest beyond the search term used to find it in the first place.For example,
stories are retrieved from RSS feeds for the topic ‘robots’ but an additional mention
of ‘autonomous robots,' or ‘unmanned vehicles’ suggests that AI is discussed in
sufficient detail to interest AAAI readers.
The program then determines the main topic of each story.It uses the traditional
Salton tf-idf cosine vector algorithm [8, 11] to measure the similarity of a story to the
2
When using Google News, we also skip stories originating from a URL on our list of
inappropriate domains. We set up the list initially to block formerly legitimate domains that
have been purchased by inappropriate providers, but it can be used to block any that are
known to be unreliable or offensive.
3
The program also includes a Name Entity recognition algorithm, but it is not used routinely
becauseit runs slowly.Instead, we check for names of particular interest, like “Turing”, by
adding them to the Goodlist described in the Ranking section.
Automating the Selection of Stories for AI in the News 179
introductory pages of each of the major topics on the AITopics web site. 4,5 Each
document is treated as a vector with one component corresponding to each term and
its tf-idf weight.Thus, we can measure the similarity of two documents by measuring
the dot product of their normalized vectors, which produces the cosine of the vectors’
angle in a multi-dimensional space[8].
The story is then linked to the AITopics page with the highest similarity so that
readers wanting to follow up on a story with background information on that
topic.The story is also added to a list for the RSS feed for the selected topic.At
publication time the topic is shown with the story and the RSS feed that contains it.
Finally, NewsFinder saves the candidate news stories and their metadata into a
database for subsequent processing.
1.2 Training
4
The current major topics are: AI Overview, Agents, Applications / Expert Systems, Cognitive
Science, Education, Ethical & Social, Games & Puzzles, History, Interfaces, Machine
Learning, Natural Language, Philosophy, Reasoning, Representation, Robots, Science Fiction,
Speech, Systems & Languages, Vision.
5
The topic assignment algorithm was originally written in PHP by Tom Charytoniuk and
rewritten in Python by Liang Dong.
180 L. Dong, R.G. Smith, and B.G. Buchanan
address is a proxy for a user ID and allows us to record just one vote per news item
per IP address.6
During training, all the readers’ ratings are collected and averaged.If a news story
has fewer ratings than a specified number, the average rating is ignored (unless it is
from one of the administrators).If the standard deviation of a news story’s ratings is
greater than a cutoff (default 2.0), the ratings are discarded as well.This way, a news
story is only added to the training set if there is general consensus among several
raters about it (or if one of the administrators ranks it).
The Support Vector Machine (SVM) [3] is a widely used supervised learning
method which can be used for classification, regression or other tasks by constructing
a hyperplane or set of hyperplanes in a high dimensional space.An SVM from a
python library LibSVM[4] has been trained with manually scored stories from the
web to classify the goodness of each story into one of three categories:(a) high –
interesting enough to AI in the News readers to publish, (b) medium – relevant but not
as interesting to readers as the first group, and (c) low – not likely to interest
readers.Currently, we build three 'one against the rest' classifiers to identify these
three sets.
1.3 Ranking
After crawling and training, the next step is ranking the candidate stories during the
current news period by computing and comparing the scores of all news stories
crawled during the period.The score for each news story is computed in two steps: (i)
assign an SVM score and (ii) adjust it using a key term score.
The SVM score is assigned based on which of the three SVM categories has the
highest probability: high interest = 5, medium = 3, low or no interest = 0.If none of
the classifiers assigns a 50% or greater probability of the story being in its category,
the default score for the story is 1.The probability is based on the tf-idf measure of
interestof all non-stop words in the document, typically about 200 words.
NewsFinder performs an adjustment to the SVM scoreby first retrieving every
recent news story containing a term from a list called Goodlist.Terms on Goodlist are
those whose inclusion in a story signals higher interest, as determined by subject-
matter experts.
NewsFinder then measures the tf-idf score for each Goodlist term.All the term
scores are accumulated and normalized across the recent stories.
When a new topic of interest first appears in AI, as “semantic web” did several
years ago, the SVM canautomatically recognize its importance as readers give high
ratings to stories on this topic.Normal practice is for authors of stories on a new topic
to tie the topic to the existing literature.However, an administrator (who is a subject
matter expert) mayalso add new terms to Goodlist to jump-start this practice.Although
one can imagine many dozen key terms on Goodlist, the initial two tests reported
here used only 12 terms.
6
As with Netflix, if there are multiple ratings for the same story from the same reader
(moreprecisely, from the same IP address), only the last vote is used.
Automating the Selection of Stories for AI in the News 181
The same process is executed for terms on a list called Badlist.Terms on Badlist
are those whose inclusion in a story signals lower interest.Initial testing was done
using 12 Badlist terms.Both Goodlist and Badlist are easily edited in the setup file.
The key term score from Goodlist lies in [0, +1], which boosts the final score. The
key term score from Badlist, which reduces the final score, is unbounded.Unlike the
terms on Whitelist, whose omission forces exclusion of a story from further
consideration, the terms on Goodlist and Badlist merely add or subtract from the
initial SVM score based on the number of terms appearing and their frequency of
occurrence.Multi-word terms on Goodlist, such as ‘unmanned vehicle’, have been
manually selected as signals of increased interest.Badlist terms such as ‘ceo’, ‘actor’,
and ‘movie’ can reduce the score for including unrelated news such as gossip about
actors who appeared in Spielberg’s movie “Artificial Intelligence.”The terms ‘tele-
operated’ and ‘manually operated’ similarly reduce the score on many stories about
robots that are less likely to involve AI.
The computation of the key term score is as follows: given a key term such as
‘automated robot’, the program first finds all the recent stories containing both
‘automated’ and ‘robot’.Then it computes thetf-idf score for each term, and adds all
the tf-idf scores for this story.
After NewsFinder obtains the trained SVM score and key term score, each news
story’s final score is a weighted sum of its SVM score and its key term score, where
the weight of the weight term, w, was selected heuristically to be 3.0:
Score = SVMScorew·KeyTermScore
Currently, both Goodlist and Badlist aremanually maintained by the webmaster, in
order to control quality during startup.Whenthe size of the training set reaches about
500 stories, we plan to remove both lists.
It is worth-noting that the length of the story doesn’t affect the SVM score since
each story’s tf-idf is normalized before being classified.But it affects the key term
scores since each term’s tf-idf depends on the number of terms in the document.
However, longer stories are prima facie more likely to include more key terms.In
addition, when selecting from among similar stories, the program prefers longer ones.
After all the potential candidates have been scored, NewsFinder measures the text
similarity to eliminate duplicate stories.The program clusters all the news candidates
to identify news about the same event.These may be exact duplicates (e.g., the same
story from one wire service used in different publications), or they may be two reports
of the same event (e.g., separately written announcements of the winner of a
competition).Again, NewsFinder measures the cosine of the angle between the two
documents' tf-idf to determine their similarity in the vector space.If the computed
similarity value is greater than a cutoff (0.33 by default), these stories are clustered
together.If there is more than one story in a group, the story with the highest final
score is kept for publishing.
The N-highest-scoring stories are selected for publishing each week.At the current
time, these are the N (or fewer) “most interesting” stories with final scores above a
threshold of 3.0; i.e., ranked “medium” to “very high.”For the initial testing, N=20; in
the last test and current version, N=12.
182 L. Dong, R.G. Smith, and B.G. Buchanan
1.4 Publishing
The stories selected for publishing are those with the highest final scores from the
ranking phase, but these still need to be formatted for publishing in different ways: (i)
posting summarized news stories on the Latest AI in theNews page of the AITopics
web site, (ii) sending periodic email messages to subscribers through the “AI Alerts”
service, (iii) posting RSS feeds for stories associated with major AITopics,(iv)
archiving each month’s collection of stories for later reference, and (v) posting each
news story into a separate page on the AITopics web site.
2 Validation
2.1 SVM Alone
After training on the first 100 cases scored manually, we determined the extent to
which the selections of the SVM part of NewsFinder matched our own.For a new set
of 49 stories retrieved from Google News by searching for ‘artificial intelligence’, we
marked each story as “include” or “exclude” from the stories we would want
published, and we matched these against the list of stories NewsFinder would publish,
without use of the additional knowledge of terms on Goodlist and Badlist.On the
unseen new set of 49 recent stories crawled from Google News, the SVM put 46 of 49
stories (94%) into the same two categories – include as “relevant and interesting” or
exclude – as we did.Five stories would have been included for the 10-day period,
which we take to be about right (but on the low side) for weekly email alerts.
This was not a formal study with careful controls since the person rating the stories
could see the program’s ratings, and the SVM was retrained using some of the same
stories it then scored again.Nevertheless, it did suggest that the SVM was worth
keeping.It also suggested that merely using an RSS feed or broad web search with a
term like ‘artificial intelligence’ would return many more irrelevant and low-interest
stories than we wanted.In a one week period Google News returned 400 candidate
stories mentioning the term ‘artificial intelligence’, 88 mentioning ‘machine learning’,
8,195 mentioning ‘robot’, and 2,264 mentioning ‘robotics.’We concluded that not all
would be good to publish in AI in the News, nor would readers want this many.
In a subsequent test, we used a specified set of credible news sources, a training set of
265 stories (including the 149 from the initial test), and a test set of 69 new
stories.The full NewsFinder programwas used, with scores from the SVM adjusted by
additional knowledge of good and bad terms to look for.We compared the program’s
decision to include or exclude from the published set against the judgment of one
administrator (BGB) that was made before looking at the program’s score.
We accumulated scores and ratings by the administrator for 3-4 stories per day that
were not in each previous night’s training set, a total of 69 stories in the first three
weeks of September, 2010.Although the SVM is improving (or at least changing)
each night, these stories are truly “unseen" in the sense that they did not yet appear in
the training set used to train the classifier that scored them. Among 42 stories that the
Automating the Selection of Stories for AI in the News 183
program scored above the publication threshold (≥ 3.0), the administrator rated 33
(78.6%) above threshold.
Out of 27 candidate stories that the program rated below the publication threshold
(< 3.0), the administrator rated 11 (40.1%) below threshold.Thus the program is
publishing mostly stories that the administrator agrees should be published but is
omitting about half the likely candidates that the administrator rates above
threshold.The 27 candidates in this study that were not published were mostly “near
misses.” Many were rated 3 by the administrator, indicating that they were OK, but
not great.Also, a few of the stories the administrator would have published may be
selected on a later day, after retraining or when their normalized scores rise above
threshold because the best story in the new set is not as good as in the previous
set.Given a limit of twelve stories, the tradeoff between false positives and false
negatives weighs in favor of omitting some good stories over including uninteresting
or marginal ones.
We conducted a 5-fold cross validation for 218 stories with administrator ratings to
validate the performance of the SVM classifier (before adjustment).As above, each of
the tests was on “unseen” stories.For these 218 valid ratings, we counted the times
that the administrator and the SVM classified a story in the same way.The accuracy of
the “high” predictions was 66.5%, of the “medium” ratings 72.9%, and of the “low”
ratings 74.3%.
After the completion of these tests, some adjustments were made to correct occasional
problems noted during testing.
─ A story categorized as low or no interest by the SVM (category 0) is not published,
regardless of its adjusted score.
─ The threshold for similarity of two news stories was lowered from 0.4 to 0.33 to
reduce the number of duplicates.
─ Whitelist and Goodlist were made to contain the same terms, though their uses
remain different.Thus a story must contain at least one of several dozen terms to be
considered at all (Whitelist), and the more occurrences of these terms that are
found in a story, the more its score will be boosted (Goodlist).Three new terms
were added to Whitelist and Goodlist.
─ Upward adjustments to the score from the key terms on Goodlist are now
normalized to the highest adjustment in any period because adding a larger number
of Goodlist terms created uncontrollably large adjustments.(Unbounded downward
adjustments do not concern us because stories containing Badlist terms are
unwanted anyway.)
─ Terms having to do with tele-operated robots and Hollywood movies were added
to the Badlist, thus downgrading stories that are about manually controlled robots
or movie personalities.
─ The frequency with which the program searches for stories and publishes a new AI
in the News page was changed from daily to weekly.
184 L. Dong, R.G. Smith, and B.G. Buchanan
─ The number of stories published in any period has been changed from 20to 12,
since that will reduce the false positives and also reduce the size of weekly email
messages to busy people.
─ Stories can be added manually to be included in the current set of stories to be
ranked.Thus when an interesting story is published in a source other than the ones
we crawl automatically, it can be considered for publication.It will also be included
in subsequent training, which may help offset the inertia of training over the
accumulation of all past stories and the lag time in recognizing new topics.
A follow-up test was performed on 118 unseen stories to confirm that the changes we
had made were not detrimental to performance.We also gathered additional statistics
to help us understand the program’s behavior better.Two-thirdsof the stories were at
or above the program’s publication threshold of 3.0 (80/118), based on their initial
SVM and adjustment scores).
Among 118 stories that passed the relevance screening and duplicate elimination,
and thus were scored with respect to interest, the overall rate of agreement between
the program and an administrator is 74.6% on decisions to publish or not (threshold ≥
3.0), with Precision = 0.813, Recall = 0.813, and F1 = 4.92.Both the program and the
administrator recommend publishing about two-thirds of the stories passing the
relevance filters, just not the same two-thirds.
Admin: Admin:
NewsFinder: Publish P (55%)
65 bli h D 15’ (13%)
P bli h
NewsFinder: Don’t Publish 15 (13%) 23 (20%)
3 Conclusions
Replacing a time-consuming manual operation with an AI program is an obvious
thing for the AAAI to do.Although intelligent selection of news stories from the web
is not as simple to implement as it is to imagine, we have shown it is possible to
integrate many existing techniques into one system for this task, at low cost.There are
many different operations, each requiring several parameters to implement the
heuristics of deciding which stories are good enough to present to readers.The two-
step scoring system appears to be a conceptually simple way of combining a trainable
SVM classifier based on term frequencies with prior knowledge of semantically
significant term relationships.
NewsFinder has not been in operation for long, but it appears to be capable of
providing a valuable service.We speculate that it could be generalized to alert
othergroups of people to news stories that are relevant to the focus of the group and
highly interesting to many or most of the group.The program itself is not specific to
AI, but the terms on Goodlist and Badlist, the terms used for searching news sites and
RSS feeds, and to some extent the list of sources to be scanned, are specific to AI.
Automating the Selection of Stories for AI in the News 185
Learning how to select stories that the group rates highly adds generality as well as
flexibility to change its criteria as the interests of the group change over time.
References
1. 5 star rating system for Pmwiki, http://www.pmwiki.org/wiki/Cookbook/StarRater
2. Buchanan, B.G., Glick, J., Smith, R.G.: The AAAI Video Archive. AI Magazine 29(1),
91–94 (2008)
3. Burges, J.C.C.: A tutorial on support vector machines for pattern recognition. Data Mining
and Knowledge Discovery 2(2), 121–167
4. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines, Software (2001),
http://www.csie.ntu.edu.tw/~cjlin/libsvm
5. Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM-based Content Extraction of HTML
Documents. In: Proceedings of the 12th International World Wide Web
Conference(WWW2003), Budapest, Hungary (2003)
6. Herlocker, J., Konstan, J., Terveen, L., Riedl, J.: Evaluating Collaborative Filtering
Recommender Systems. ACM Transactions on Information Systems 22 (2004)
7. Loper, E., Bird, S.: NLTK: The Natural Language Toolkit. In: Proceedings of the ACL
Workshop on Effective Tools and Methodologies for Teaching Natural Language
Processing and Computational Linguistics. Association for Computational Linguistics,
Philadelphia (2002), http://www.nltk.org/
8. Manning, C., et al.: Intro. to Information Retrieval. Cambridge University Press,
Cambridge (2008)
9. Melville, P., Sindhwani, V.: Recommender System. In: Sammut, C., Webb, G. (eds.)
Encyclopedia of Machine Learning, Springer, Heidelberg (2010)
10. Song, R., Liu, H., Wen, J.-R., Ma, W.-Y.: Learning Block Importance Models for Web
Pages. In: Proceedings of the 13th International World Wide Web Conference (WWW
2004), New York (2004)
11. Salton, G., Gerard, Buckley, C.: Term-weighting approaches in automatic text retrieval.
Information Processing & Management 24(5), 513–523 (1988), doi:10.1016/0306-
4573(88)90021-0
12. Zhang, S., Wang, W., Ford, J., Makedon, F.: Learning from incomplete ratings using non-
negative matrix factorization. In: Proc. of the 6th SIAM Conference on Data Mining
(2006)
Diagnosability Study of Technological Systems
1 Introduction
Technological systems are complex systems constituted with many components
interacting with each other and combining multiple physical phenomena:
thermodynamic, hydraulic, electric, etc. Faults, which are un-observable damages
affecting components of a system, can occur due to many causes: wear, dirtying,
breakage, etc. Some are serious and must require to stop the system, or to put it in a
safety mode; while others have minor impact and should only be reported for being
repaired off-board. Thus, it is necessary to achieve on-board the detection of these
faults and to identify them the most precisely; this in order to take the appropriate
decision. An embedded diagnosis system, completing the controller, is a suitable
solution to do this ([1]). However, the problem is then to ensure that this diagnosis
system will always be able not only to detect any fault when it occurs (does the fault
induce an observable behavior distinct from the normality?), but also to assign a
unique listed fault to a divergent observable behavior (do some faults induce the same
observable behavior?). This problem is known as diagnosability ([2]).
A way to handle this diagnosability, with respect to a system, is to augment the
model of this system (the normal model) with faults (producing faulty models); and to
exploit them to characterize observable behaviors of the system, under or not a fault,
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 186–198, 2011.
© Springer-Verlag Berlin Heidelberg 2011
Diagnosability Study of Technological Systems 187
by a specific property which will be verified by the diagnosis system. This approach,
called diagnosability study of faults, inheres in the analysis process of the diagnosis
system. It requires, by definition, all faulty models to produce, for each one, a specific
fault characterization according to its observable behaviors. All these faults
characterizations will then be used by the embedded diagnosis system to detect and
identify faults.
In this paper, we present an approach to study the diagnosability of faults of a
technological system by exploiting its observable behaviors. These observable
behaviors are obtained by using the normal and all faulty models of the system. In the
second part, we present the framework to model a technological system and show
how to integrate faults in it. In the third part, we exploit these produced models in
order to define observable behaviors of the system in the normal and all faulty cases.
In the fourth part, we study diagnosability of faults by producing their
characterizations. In the fifth part, we apply this theoretical framework to a practical
application: a fuel cell system. Finally, the last part concludes by summarizing results
and outlining interesting directions for future works.
Classical works found in literature for diagnosis ([1], [3] and [4]) are based on a
representation of the system in open-loop. But for the majority of industrial
applications, the system is inserted in a closed-loop and its controller computes
system inputs by taking into account its outputs; this to increase system performances
and to maintain them in spite of unknown perturbations affecting it. In this context
fault detection and isolation are more difficult because of the contradiction between
control objectives and diagnosis objectives. In fact, control objectives are to minimize
disturbances or faults effects; whereas diagnosis objectives are precisely to bring out
these faults. The considered solution, to take into account this problem, is to model
the system with its controller in closed-loop. Fig. 1 below represents the complete
structure of the system: the system and its controller in closed-loop.
We model a controlled system with two parts: the controller and the system itself
composed of the physical process, actuators and sensors. We consider a continuous
model in discrete time T with a state space representation, described by the set (1) of
equations:
where the two first equations model the system and the two other ones model the
controller (more precisely its control laws). Variables u, x, θ, d and y are respectively
input, state, parameter, disturbance and output vectors of the system; c and a are
respectively order (from the operator) and state vectors of the controller. We denote
by V = {c;a;u;x;y;d} the set of all variables of the model, with respective domains of
values C, A, U, X, Y and D, and by VObs = {c;u;y} the set of observable variables.
3 Observable Behaviors
By adding faults in the model of the system, we have produced all faulty models
requested for the diagnosability study. We can now exploit them to characterize
observable behaviors in the normal and all faulty cases. A behavior of the system
represents its way of operation, according to an instruction (a sequence of orders)
from the operator. An observable behavior is therefore obtained from the behavior by
restricting it to the only observable variables. Observable means visible from an
external observer, such as a diagnosis system for example.
A behavior of the system is represented by the set of values of variables during its
operation and according to an instruction from the operator. This operation can be
under the presence, or not, of a fault. A behavior is therefore specified according to an
instruction and a fault.
An instruction is the evolution of orders from the operator during the time.
Formally, it is a sequence cs from a temporal window Ics ⊆ T, assumed beginning
from the time instant 0, to the domain C. In the following, we consider a set Cons of
instructions the most representative, i.e.: providing all operation ranges of the system.
Thus, though each instruction cs is defined from its own temporal window Ics, we
assume that all instructions are defined from a same temporal window I = max{Ics},
by extending any instruction cs, where Ics ⊂ I, with its last value cs(max(Ics)).
For a vector v = (v1,…,vn) and for an index i ∈{1,…,n}, we denote by pi(v) = vi the
i-th element of v. For a set E, constituted by a direct product E = E1×…×En, for a sub-
set G ⊆ E and for indexes i1,…,ik ∈{1,…,n} with k ≤ n ; the projection of G onto the
Ei × × Ei is the set PrE × ×E (G ) = {( pi (v ), , pi (v ) )∈ Ei × × Ei \ v∈G}.
1 k i1 ik 1 k 1 k
Normal Behaviors. For an instruction cs∈Cons, the normal behavior of the system,
according to cs, is the set B(cs,F0) of vectors of data (t,v(t))∈I×C×A×U×X×Y×D,
ordered by time t, with v(t) = (c(t),a(t),u(t),x(t),y(t),d(t)). This set of data vectors
satisfies the following:
a. Existence and uniqueness in time: for any time instant t∈I, it exists a unique
vector v(t)∈C×A×U×X×Y×D such as (t,v(t))∈B(cs,F0).
b. Construction according to the instruction cs: for any time instant t∈I,
p1(v(t)) = cs(t).
c. Satisfaction of system equations in normal case: for any time instant t∈I,
1. p4(v(t+1)) = f(p4(v(t)),θ,p3(v(t)),p6(v(t))) and p4(v(0)) = xinit
2. p5(v(t)) = g(p4(v(t)),θ,p3(v(t)),p6(v(t)))
190 M. Batteux et al.
of observable behaviors, according to the set Cons of instructions, is the union of all
sets of observable behaviors for all faults: ObsBehCons = ∪ F∈Γ ObsBehCons (F ) . We also
Diagnosability Study of Technological Systems 191
For a fault F∈Γ, its characterization is a property PF which must be satisfied at each
time instant by at least observable behaviors under this considered fault
(PF : ObsBehCons×J→{true;false}). We propose two kinds of faults characterization.
For a fault F∈Γ, its perfect characterization is the property PF defined as follow: for
an observable behavior ob∈ObsBehCons and a time instant t∈J, PF(ob,t) is true iff Pr[t–
bd
b;t]×C×U×Y(ob)∈ObsBeh Cons(F).
Intuitively, an observable behavior ob∈ObsBehCons, restricted to a temporal
window [ti – b;ti], is an element of the set ObsBehbdCons(F) iff it exists an instruction
cs∈Cons, an occurrence time tn∈Θ(F,cs) of F and a time instant td∈[tn;tn+b], such as
data vectors of ob (which are restricted to the temporal window [ti – b;ti]) are equal to
data vectors of ObsB(cs,F,tn) restricted to the temporal window [td – b;td]. I.e., if for
any time instant k∈[0;b] ⊆ T, the data vector v(ti+k) of ob at the time instant ti+k is
equal to the data vector v’(td+k) of ObsB(cs,F,tn) at the time instant td+k.
It is a perfect characterization because it is not possible to specify, in a best way,
observable behaviors. In fact, the sets ObsBehbdCons(F), constructed by restricting
192 M. Batteux et al.
5 Practical Applications
With the above theoretical framework, we have defined the diagnosability study of
faults by analyzing their observable behavior obtained by models. By using
simulation tools, such as MATLAB/Simulink©, we can apply this theoretical
framework in practical applications.
First of all, we require a simulation model of the system. Simulation models used
in control design could be a first solution; nevertheless, as showed in [10], we have to
keep in mind that models needed for diagnosis are not the same that models needed
for control. A model for control is generally less complex than a model for diagnosis.
We assume to have a simulation model of the system (the process and its
controller) and furthermore we assume the model of the process represents perfectly
the real system.
P is the air pressure in the line. Qin and Qout are air mass flow rates respectively
before and after the stack. W is the compressor speed rotation and X is the valve
opening. uW and uX are respectively compressor and valve orders from the air line
controller; yP and yQ are respectively air pressure and air mass flow rate measures. k,
αW, αX, λP, λQ are constants.
These mass flow rate and pressure are controlled according to orders (cQ for mass
flow rate and cP for pressure) supplied by the global controller of the fuel cell system.
The supplied mass flow rate order cQ is computed, for the most part, according to the
electrical power needed by the vehicle controller (the operator). The pressure order cP
is then deduced from this mass flow rate according to pressure requirements in the
stack and in the line. Thus, we can assume the mass flow rate order cQ is the main
order. Therefore, the set of observable variables of this air line is
VObs = {cQ;cP;uW;uX;yQ;yP}.
196 M. Batteux et al.
All important faults of the air line have been identified and integrated in the
simulation model by using an adapted faults library developed in
MATLAB/Simulink© ([7]). For example, we consider a lock of the compressor Flock
which causes an abrupt decrease of the mass flow rate output of the compressor; and
during the fault presence, this mass flow rate output stays equal to 0 gram per second.
Within the model, variable Qin is therefore disturbed by a multiplicative perturbation
and is described by Qin,Flock(t) = (1 – flt(t,tn)) · Qin(t), where flt is the behavior of Flock,
parameterized to occur abruptly at a time instant tn and stay permanent (so flt(t,tn) = 0
before tn and 1 after).
We have considered a set Cons of instructions representing all operation ranges of the
air line. By simulating the normal and all faulty models, according to the set Cons of
instructions, we have obtained all data sets ObsBehCons(F) of observable behaviors for
all identified faults F∈Γ parameterized according to their behaviors and effects ([7]).
Fig. 2 illustrates two observable behaviors obtained for a random instruction cs
during the temporal window [0;20]. In the two figures, first and second graphs show
the mass flow rate and the pressure in the air line: dotted lines represent orders cQ and
cP whereas plain lines represent measures yQ and yP from sensors. The third graph
shows orders from the air line controller: compressor orders uW in plain line and valve
orders uX in dotted line. Fig. 2 on left shows the normal observable behavior
ObsB(cs,F0,0) whereas Fig. 2 on right shows the faulty observable behavior
ObsB(cs,Flock,13), for the lock of the compressor occurring at the time instant 13.
During the time interval [0;13[, the air line operates correctly, as the normal behavior;
but from the time instant 13, we can see disturbances in all graphs. Mass flow rate
measures (first graph) decrease abruptly to 0 gram per second; compressor orders
(third graph) are therefore maximal (equal to 1) in order to compensate the difference
between orders and measures. Pressure measures (second graph) are thus equal
to 1 bar.
The diagnosability study of faults in the air line model was achieved with use of the
temporal formulas characterization.
For example, the conjunction ϕFo, of the two following formulas, characterizes the
observable behavior of the air line in the normal case:
ϕ1Fo: (cQ∈[0;25[) ⇒
( ( G[-3;0](c – V[-0.01]c = 0) ⇒ ( (|cQ – yQ| ≤ 0.5) ∧ (|cP – yP| ≤ 0.1) ) )
∨ ( E[-3;0](c – V[-0.01]c ≠ 0) ⇒ ( (|cQ – yQ| > 0) ∧ (|cP – yP| > 0) ) ) )
ϕ2Fo: (cQ∈[25;30]) ⇒
( ( G[-3;0](c – V[-0.01]c = 0) ⇒ ( (|cQ – yQ| ≤ 1) ∧ (|cP – yP| ≤ 0.5) ) )
∨ ( E[-3;0](c – V[-0.01]c ≠ 0) ⇒ ( (|cQ – yQ| > 0) ∧ (|cP – yP| > 0) ) ) )
Diagnosability Study of Technological Systems 197
Fig. 2. Observable behaviors obtained by simulations (left: normal; right: lock of compressor)
The following formula ϕlock characterizes the observable behavior of the air line
under the lock of the compressor:
ϕlock: ( G[-1;0]( ( yQ = 0 ) ∧ ( yP = 1 ) ∧ ( uW = 1 ) ∧ ( uX = 1 ) ) )
We consider the set of faults Γ = {F0;Flock}, where F0 is the normal case and Flock is
the lock of the compressor only occurring at the time instant tn = 13. We suppose the
set Cons of instructions is reduced to {cs} and we consider a bound b = 5. Thus,
ObsBeh{cs} = {ObsB(cs,F0,0);ObsB(cs,Flock,13)}, with Θ(Fo,cs) = {0} and
Θ(Flock,cs) = {13}.
Firstly, the normal observable behavior ObsB(cs,F0,0) satisfies the formula ϕFo.
Therefore, the fault F0 is eligible.
Secondly, the faulty observable behavior ObsB(cs,Flock,13) satisfies the faulty
formula ϕlock from the time instant te = 14,3∈[13;18]; therefore, the fault Flock is
eligible. In addition, from the time instant 13, this faulty observable behavior
ObsB(cs,Flock,13) does not satisfy the normal formula ϕ0; the fault Flock is therefore
detectable.
Of course, it is just an example. Firstly, the set of instructions is not reduced to
only one but contains several ones representing all operation ranges of the system.
Moreover, all identified faults have been taken into account with several occurrence
times: before or after a change of orders and according the response time δ of the
system. Furthermore all real temporal formulas obtained are more elaborated.
the study during design and development phases of the system, simply considers all
sets of data vectors, restricted to temporal windows. The other one uses temporal
logic formalism to express the temporal evolution of observable data of the system
and is adapted for an embedded diagnosis system.
Finally, for diagnosable faults, their characterization will then be embedded inside
the diagnosis system in order to detect and identify faults on line. It could be
combined with an embedded model of the system simulated by the controller and
temporal formulas could take into account the temporal evolution of the difference
between real and model outputs data. These future works will be presented in
forthcoming papers.
References
1. Venkatasubramanian, V., Rengaswamy, R., Yin, K., Kavuri, S.N.: A review of process
fault detection and diagnosis. ’part I to III’. Computers and Chemical Engineering 27,
293–346 (2003)
2. Travé-Massuyès, L., Cordier, M.O., Pucel, X.: Comparing diagnosability in CS and DES.
In: International Workshop on Principles of Diagnosis, Aranda de Duero, Spain, June
26-28 (2006)
3. Blanke, M., Kinnaert, M., Lunze, J., Staroswiecki, M.: Diagnosis and Fault-tolerant
Control. Springer, Berlin (2003)
4. Isermann, R.: Fault Diagnosis Systems. Springer, Berlin (2006)
5. Stapelberg, R.F.: Handbook of Reliability, Availability, Maintainability and Safety in
Engineering Design. Springer, London (2009)
6. Basseville, M., Nikiforov, I.V.: Detection of Abrupt Changes: Theory and Application.
Prentice-Hall, Englewood Cliffs (1993)
7. Batteux, M., Dague, P., Rapin, N., Fiani, P.: Fuel cell system improvement for model-
based diagnosis analysis. In: IEEE Vehicle Power and Propulsion Conference, Lille,
France, September 1-3 (2010)
8. Alur, R., Feder, T., Henzinger, T.A.: The Benefits of Relaxing Punctuality. Journal of the
ACM 43, 116–146 (1996)
9. Rapin, N.: Procédé et système permettant de générer un dispositif de contrôle à partir de
comportements redoutés spécifiés, French patent n°0804812 pending, September 2 (2008)
10. Frank, P.M., Alcorta García, E., Köppen-Seliger, B.: Modelling for fault detection and
isolation versus modelling for control. Mathematics and Computers in Simulation 53(4-6),
259–271 (2000)
Using Ensembles of Regression Trees to Monitor
Lubricating Oil Quality
Abstract. This work describes a new on-line sensor that includes a novel cali-
bration process for the real-time condition monitoring of lubricating oil. The pa-
rameter studied with this sensor has been the variation of the Total Acid Number
(TAN) since the beginning of oil’s operation, which is one of the most important
laboratory parameters used to determine the degradation status of lubricating oil.
The calibration of the sensor has been done using machine learning methods with
the aim to obtain a robust predictive model. The methods used are ensembles of
regression trees. Ensembles are combinations of models that often are able to im-
prove the results of individual models. In this work the individual models were
regression trees. Several ensemble methods were studied, the best results were
obtained with Rotation Forests.
1 Introduction
One of the main root causes of failures within industrial lubricated machinery is the
degradation of the lubricating oil. This degradation comes mainly by the oxidation of
the oil caused by overheating, water and air contamination [1]. The machinery stops
and failures coming from oil degradation reduces its life and generates important main-
tenance costs. Traditionally, the lubricating oil quality was monitored by traditional
off-line analysis carried out in the laboratory. This solution is not valid to detect the
early stages of degradation of lubricating oil because it is expensive and complex to
extract oil from the machine and deliver it to a laboratory where this test can be per-
formed. Therefore, the machine owner usually changes the lubricating oil without any
real analysis of its quality, just following the specifications of the supplier which are
always highly conservative.
The advances in micro and nano-technologies in all fields of engineering have
opened new possibilities for the development of new equipment and devices and there-
fore have permitted the miniaturization of such systems with an important reduction of
manufacturing costs.
This work was supported by the vehicle interior manufacturer, Grupo Antolin Ingenieria S.A.,
within the framework of the project MAGNO2008 - 1028.- CENIT Project funded by the
Spanish Ministry of Science and Innovation.
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 199–206, 2011.
c Springer-Verlag Berlin Heidelberg 2011
200 A. Bustillo et al.
Taking all this into account the use of on-line sensor for lubricating oil condition
monitoring will permit in a short period of time the optimization of its life and result in
significant savings in operational costs by solving problems in industrial machinery, in-
creasing reliability and availability [2]. The main advantage of using this type of sensor
is that real time information could be extracted in order to establish a proper predictive
and proactive maintenance. Therefore early stages of lubricating oil degradation could
also be detected [3]. To evaluate the degradation status of a lubricating oil, the most of-
ten used parameter is the Total Acid Number (TAN). This parameter accounts for acid
compounds generated within the oil and, therefore, about oil degradation status.
Although there are some examples of artificial intelligence techniques applied to oil
condition monitoring, specially Artificial Neural Networks [4], no previous application
of ensembles techniques has been reported for this industrial task. There are a few ex-
amples of the use of ensembles to predict critical parameters of different manufacturing
processes. Mainly, ensembles have been used for fault diagnosis in different milling op-
erations [5,6]. This lack of previous work is mainly due to the traditional gap between
data mining researchers and manufacturing engineers and is slowly disappearing thanks
to the development of comprehensible software tools that implement the main ensem-
bles techniques like WEKA [7] that can be used directly by manufacturing engineers
without a deep knowledge of ensembles techniques.
The purpose of this work is to describe a new on-line sensor that includes a novel
calibration process for oil condition monitoring. The new sensor should not require the
same accuracy and precision as laboratory equipment, because its aim is to avoid the
change of lubricating oil simply because it already has lasted the mean life specified
by the oil supplier, although its condition could still be suitable for further operation.
Therefore the new sensor should be able to detect medium degradation of lubricating
oil working under industrial conditions and its cost should be considerable lower than
laboratory equipment.
The paper is structured as follows: Section 2 explains the experimental procedure
followed to obtain the dataset to validate the new calibration process proposed in this
work; Section 3 reviews the ensemble learning techniques tested for the calibration
process; Section 4 shows the results obtained with all the tested ensembles techniques.
Finally, conclusions and future lines of work are summarized in Section 5.
Fig. 1. Fuchs Renolyn Unysin CLP 320 samples with different degradation stages
to 780 nm. The sensor has been developed taking into account end user requirements.
Various knowledge fields, optic, electronic and fluidic, were necessary in order to obtain
a robust equipment, Figure 2. Finally, a software tool manages the calibration and the
data treatment [9].
Then, the samples were analysed with the new on-line sensor and the generated spec-
tra was used as input data for the calibration process. The output variable was the vari-
ation of the TAN value. Three measurements per sample have been carried out in order
to obtain an average measurement in order to minimize the error. The oil samples have
been introduced in the sensor through a 5 ml syringe and between every measurement
the fluidic cell was cleaned twice with petroleum ether. The fluidic cell is the com-
partment where the light source is directed on the oil sample. Some time is necessary in
order to obtain proper stabilization of the oil sample in the fluidic cell. All samples were
measured in a temperature range of 21-22oC, due to the effect of variation of signal with
temperature in the visible spectra.
Regression trees [11] are simple models for predicting. Figure 3 shows a regression tree
obtained for the considered problem. In order to predict with these trees, the process
starts in the root node. There are two types of nodes. In internal nodes a condition is
evaluated: an input variable is compared with a threshold. Depending of the result of
the evaluation, the process selects one of the node children. The leaves have assigned a
numeric value. When a leaf is reached, the prediction assigned is the value of the leaf.
The displayed variables in Figure 3 are the following: intensities of the emission’s
spectrum at a certain wavelength (e.g., 431nm, 571nm. . . ), intensity of the maximum
of the emission’s spectrum (Max), Total Acid Number of the analysed oil before its
use (Tan ref) and the wavelength (expressed in nanometres) of the maximum of the
emission’s spectrum (Pos).
In order to avoid over-fitting, it is usual to prune trees. The pruning method used in
this work is Reduced Error Pruning [12]. The original training data is partitioned in two
subsets, one is used to construct the tree, the other to decide which nodes should be
pruned.
Trees are specially suitable for ensembles. First, they are fast, both in training and
testing time. This is important when using ensembles because several models are con-
structed and used for prediction. Second, they are unstable [13]. This means that small
changes in the data set can cause big differences in the obtained models. This is desir-
able in ensembles because combining very similar models is not useful.
Several methods for constructing ensembles are considered. If the method used to con-
struct the models to be combined is not deterministic, an ensemble can be obtained
based only on Randomization: different models are obtained from the same training
data.
In Bagging [13], each model in the ensemble is constructed using a random sample,
with replacement, from the training data. Usually the size of the sample is the same as
the size of the original training data, but some examples are selected several times in
Using Ensembles of Regression Trees to Monitor Lubricating Oil Quality 203
the sample, so other examples will not be included in this sample. The prediction of the
ensemble is the average of the models predictions.
Random Subspaces [14] is based on training each models using all the examples, but
only a subset of the attributes. The prediction of the ensemble is also the average of the
models predictions.
Iterated Bagging [15] can be considered an ensemble of ensembles. First, a model
based on Bagging is constructed using the original training data. The following Bagging
models are trained for predicting the residuals of the previous iterations. The residuals
are the difference between the predicted values and the actual values. The predicted
value is the sum of the predictions of the Bagging models.
AdaBoost.R2 [16] is a boosting method for regression. Each training example has
assigned a weight. The models have to be constructed taking these weights into account.
The weights of the examples are adjusted depending on the predictions of the previous
model. If the example has a small error, its weight is reduced, otherwise its weight is
augmented. The models also have weights and better models have greater weights. The
prediction is obtained as the weighted median of the models’ predictions.
Rotation Forest [17,18] is an ensemble method. Each model is constructed using a
different rotation of the data set. The attributes are divided into several groups and for
each group Principal Component Analysis is applied to a random sample from the data.
All the obtained components from all the groups are used as new attributes. All the
training examples are used to construct all the models and the random samples are only
used for obtaining different principal components. The ensemble prediction is obtained
as the mean value from the predictions of the models.
204 A. Bustillo et al.
4.2 Results
Table 1 shows the obtained results for the different methods. The results are sorted
according to the average MAE and RMSE. The best results are from Rotation Forests,
combining Regression Trees without pruning. The next methods are AdaBoost.R2 with
unpruned trees and Rotation Forests with pruned trees.
In general, ensembles of unpruned trees have better results than ensembles of pruned
trees. In fact, a single unpruned regression tree has better MAE than all the ensembles
of pruned trees, with the only exception of Rotation Forest.
The best results among the methods that are not ensembles of regression trees are for
the Nearest Neighbour method.
As expected, the worst results are obtained with the simplest model, predicting the
average value.
Using Ensembles of Regression Trees to Monitor Lubricating Oil Quality 205
5 Conclusions
This work describes a novel calibration process for an on-line sensor for oil condition
monitoring. The calibration process is done using ensembles of regression trees. Several
ensemble methods were considered, the best results were obtained with Rotation Forest.
The validation of the new calibration process with a dataset of 179 samples of 14 dif-
ferent commercial oils in different stages of degradation shows that the calibration can
assure an accuracy of ±0.11 in TAN prediction. This accuracy is worse than accuracy
of laboratory equipment, which is usually better than ±0.03, but it is enough for the
new on-line sensor because it achieves an accuracy better than ±0.15 , This fact is due
to the sensor’s purpose: to avoid the change of lubricating oil simply because it already
has lasted the mean life specified by the oil supplier, although its condition could still
be suitable for further operation.
Future work will focus on the study and application of the calibration process to
on-line sensors devoted to a certain application. These sensors will achieve a higher
accuracy but with a smaller variation range in the measured spectra. The ensembles used
in this work combine models obtained with regression trees and ensembles combining
206 A. Bustillo et al.
models obtained with other methods will be studied. Moreover, it is also possible to
combine models obtained from different methods; this heterogeneous ensembles could
improve the results.
References
1. Noria Corporation: What the tests tell us,
http://www.machinerylubrication.com/Read/873/oil-tests
2. Gorritxategi, E., Arnaiz, A., Spiesen, J.: Marine oil monitoring by means of on-line sensors.
In: Proceedings of MARTECH 2007- 2nd International Workshop on Marine Technology,
Barcelona, Spain (2007)
3. Holmberg, H., Adgar, A., Arnaiz, A., Jantunen, E., Mascolo, J., Mekid, J.: E-maintenance.
Springer, London (2010)
4. Yan, X., Zhao, C., Lu, Z.Y., Zhou, X., Xiao, H.: A study of information technology used in
oil monitoring. Tribology International 38(10), 879–886 (2005)
5. Cho, S., Binsaeid, S., Asfour, S.: Design of multisensor fusion-based tool condition monitor-
ing system in end milling. International Journal of Advanced Manufacturing Technology 46,
681–694 (2010)
6. Binsaeid, S., Asfour, S., Cho, S., Onar, A.: Machine ensemble approach for simultaneous
detection of transient and gradual abnormalities in end milling using multisensor fusion.
Journal of Materials Processing Technology 209(10), 4728–4738 (2009)
7. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA
data mining software: An update. SIGKDD Explorations 11(1) (2009)
8. Terradillos, J., Aranzabe, A., Arnaiz, A., Gorritxategi, E., Aranzabe, E.: Novel method for
lube quality status assessment base on-visible spectrometric analysis. In: Proceedings of the
International Congress Lubrication Management and Technology LUBMAT 2008, San Se-
bastian, Spain (2008)
9. Gorritxategi, E., Arnaiz, A., Aranzabe, E., Aranzabe, A., Villar, A.: On line sensors for con-
dition monitoring of lubricating machinery. In: Proceedings of 22nd International Congress
on Condition Monitoring and Diagnostic Engineering Management COMADEM, San Se-
bastian, Spain (2009)
10. Mang, T., Dresel, W.: Lubricants and lubrication. WILEY-VCH Verlag GmbH, Weinheim,
Germany (2007)
11. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd
edn. Morgan Kaufmann, San Francisco (2005)
12. Elomaa, T., Kääriäinen, M.: An analysis of reduced error pruning. Journal of Artificial Intel-
ligence Research 15, 163–187 (2001)
13. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
14. Ho, T.K.: The random subspace method for constructing decision forests. IEEE Transactions
on Pattern Analysis and Machine Intelligence 20(8), 832–844 (1998)
15. Breiman, L.: Using iterated bagging to debias regressions. Machine Learning 45(3), 261–277
(2001)
16. Drucker, H.: Improving regressors using boosting techniques. In: ICML 1997: Proceedings of
the Fourteenth International Conference on Machine Learning, pp. 107–115. Morgan Kauf-
mann Publishers Inc., San Francisco (1997)
17. Rodrı́guez, J.J., Kuncheva, L.I., Alonso, C.J.: Rotation forest: A new classifier ensem-
ble method. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(10),
1619–1630 (2006)
18. Zhang, C., Zhang, J., Wang, G.: An empirical study of using rotation forest to improve re-
gressors. Applied Mathematics and Computation 195(2), 618–629 (2008)
Image Region Segmentation Based on Color Coherence
Quantization
Guang-Nan He1, Yu-Bin Yang1, Yao Zhang2, Yang Gao1, and Lin Shang1
1
State Key Laboratory for Novel Software Technology, Nanjing University,
Nanjing 210093, China
2
Jinlin College, Nanjing University, Nanjing 210093, China
[email protected]
Abstract. This paper presents a novel approach for image region segmentation
based on color coherence quantization. Firstly, we conduct an unequal color
quantization in the HSI color space to generate representative colors, each of
which is used to identify coherent regions in an image. Next, all pixels are labeled
with the values of their representative colors to transform the original image into
a “Color Coherence Quantization” (CCQ) image. Labels with the same color
value are then viewed as coherent regions in the CCQ image. A concept of
“connectivity factor” is thus defined to describe the coherence of those regions.
Afterwards, we propose an iterative image segmentation algorithm by evaluating
the “connectivity factor” distribution in the resulted CCQ image, which results in
a segmented image with only a few important color labels. Image segmentation
experiments of the proposed algorithm are designed and implemented on the
MSRC datasets [1] in order to evaluate its performance. Quantitative results and
qualitative analysis are finally provided to demonstrate the efficiency and
effectiveness of the proposed approach.
1 Introduction
To segment an image into a group of salient, coherent regions is the premise for many
computer vision tasks and image understanding applications, particularly for object
recognition, image retrieval and image annotation. There are already many
segmentation methods currently available. If an image contains only several
homogeneous color regions, clustering methods such as mean-shift [2] is sufficient to
handle the problem. Typical automatic segmentation methods include stochastic model
based approaches [4,14], graph partitioning methods [5,6], etc. However, the problem
of automatic image segmentation is still very difficult in many real applications.
Consequently, incorporation of prior knowledge interactively into automated
segmentation techniques is now a challenging and active research topic. Many
interactive segmentation methods have recently been proposed [7, 8, 9, 10,16].
Unfortunately, those kinds of method are still not suitable for segmenting a huge
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 207–219, 2011.
© Springer-Verlag Berlin Heidelberg 2011
208 G.-N. He et al.
amount of natural images. In real situations, images such as natural scenes are rich in
colors and textures, for which segmentation results achieved by the current available
segmentation algorithms are still far away to match human’s perception.
A typical way to identify salient image regions is to enhance the contrast between
those regions and their surrounding neighbors [3][15]. In this work we aim to segment
an image into a few coherence regions effectively, which is very useful for image
retrieval, object recognition and image understanding purposes. The coherence regions
are all semantically meaningful objects that can be discovered from an image. To
achieve this goal, we propose a novel segmentation approach in which only low-level
color features are needed. An image is firstly quantized into a CCQ image based on
color consistency quantization. After that, an iteration is designed to segment the CCQ
image to achieve the final segmentation result. The main contributions of this paper are
listed as follows.
(1) We introduced a new color coherence quantization method in the HSI color space
to generate the CCQ image, which has been validated by our experiments.
(2) We proposed a practical salient image region segmentation algorithm based on
the concept of “coherence region”, which involves maximizing “connectivity
factor” and suppressing small connected component of each coherent region.
The remainder of the paper is organized as follows. Section 2 describes the process of
color consistency quantization. The segmentation approach is then presented in Section
3. Experimental results are illustrated and analyzed in Section 4. Finally, concluding
remarks with future work directions are provided in Section 5.
(1) if S ≤ 0.1 ,partition the HSI color space into three types of color according to the
intensity value by using Eq. (1).
⎧i = 0, if I ≤ 0.2
⎪
⎨i = 1, if I > 0.2 & I ≤ 0.75 (1)
⎪i = 2, if I > 0.75
⎩
(2) otherwise, the hue, saturation and intensity value of each pixel are quantized by
using Eq. (2), (3), and (4), respectively.
Fig. 1. Color Coherence Quantization Result. (Left: the original image, Right: the quantized CCQ
image.)
210 G.-N. He et al.
In this manner, all remaining colors in the HSI color space are partitioned into the other
63 types of representative colors. By applying the above quantization algorithm, a color
image is firstly transformed from the RGB color space into the HSI color space. Then, all
pixels are labeled with the values of their representative colors to obtain a “Color
Coherence Quantization” (CCQ) image. Each representative color is labeled as i , where i
is in the range [0, 65]. Figure 1 shows an example of our color coherence quantization
result. The image shown in Figure 1 contains three coherence regions: “cow”, “grass” and
“land”. In the CCQ image, each of the three regions has been labeled with several different
numbers. Our aim in the next step is to label the CCQ image by using only three numbers,
in which each number denotes a semantic region.
In this section we describe our segmentation algorithm based on the generated CCQ
images. Take Figure 1 as an example. The CCQ image has 33 types of color labels.
However, the regions with the same color label vary in their sizes. Here the size of each
region is calculated as the pixel number in the region divided by the size of the original
image. To address this problem, we design a segmentation approach based on the
“connectivity” of those regions in a region-growing manner. Even if an object in the
image is described using different representative colors, the proposed algorithm can
still segmented it as a whole.
In order to describe the connectivity of all color labels in a CCQ image, we define eight
types of connectivity for each color label. The connectivity type of a color label i (i ∈
[0,65]) is decided by the number of other label j in its 8-connectivity neighborhood in a
×
3 3 window. Take the color label “0” in Figure 2 as an example. The label “0” is in the
window center, and the eight types of its connectivity are all shown in Figure 2. The
notation ‘*’ denotes color labels different from label “0”. Rotating each type of
connectivity does not change its connectivity factor value.
Let ai denote the size (number of pixels) of label i in a CCQ image; cik denotes
the total number of k − connectivity of label i , where k ∈ [1,8] , i ∈ [0, 65] . Let
σ i denote the connectivity factor of label i , which is calculated as:
8
∑ k×c
,
k
i
σi = k =1 , σ i ∈ [0,1) (6)
8 × ai
According to the above definitions, the connectivity factors of label “0” in Figure 2
×
(image size: 3 3) can be easily calculated, as shown in Table 1.
k 1 2 3 4 5 6 7 8
The connectivity factor of label i becomes larger as its size grows. Table 2 lists this
changing trend of the connectivity factors of label i .
As we can see from Table 2, the connectivity factor value turns close to 1 as the size
of label i grows. But it will rarely equal to 1 except in some limited situations. We find
that this is very helpful for us to perform segmentation task in a region-growing
manner. At the beginning, the generated CCQ image may have a lot of small regions
labeled with different color values. If we try to maximize their connectivity factor by
merging those small regions, they may finally merge into a few large and coherent
regions, which can serve as a satisfied image segmentation result.
order to simplify the calculation, we can assume that each edge-point and its neighbors
are all 4-connected, while the others are 8-connected. Therefore, the connectivity factor
of the circle is σ = π r × 8 − 2π r × 4 × 2 = 1 − 2 . When r = 20 , the circle’s connectivity
2
#
π r2 ×8 r
factor reaches 0.9, such that we can segment it out as an homogenous region.
neighbor
In a CCQ image, each type of color label represents a coherent region. Let
R(n) = [i, σ i , ai ] denote the nth coherent region, where i is the color label, σ i is the
corresponding connectivity factor, ai is the region size, and n = {1, 2,...N } . Here
N denotes the total number of coherent regions in the CCQ image.
Most of the coherent regions will merge when the values of their connectivity factors
are very small. The lower connectivity factors the regions have, the more scattering
distributions they possess. A lower bound for coherent region’s connectivity factor
should be specified by user. For different categories of image, the low bound should be
different. If a coherent region’s connectivity factor is smaller than the lower bound, the
coherent region will be merged. The regions with smaller connectivity factors are often
incompact and usually have more than one connected components. Considering the
spatial distribution of a coherent region, we need to merge the coherent region’s
connected components one by one. In order to find the connected components of a
coherent region, we use the method outlined in Ref. [13] and assume each connected
component in an 8-connectivity manner.
If region R ( n) = [i, σ i , ai ] is merged, all of its connected components will be
identified. First, let p be one of the connected components. Then, we need to decide
which adjacent region of p should be merged with. One intuitive way is to merge
Image Region Segmentation Based on Color Coherence Quantization 213
p into the nearest region. For each edge point of p , we construct a 3 × 3 window with
the edge point located in the center of window. Those generated grid windows contain
labels different from i . We then accumulate all the labels different from i in all those
grid windows and find out the region with the largest number of those labeled points. It
is then identified as the nearest region to p and merged with p .
To simplify the algorithm we only use the grid window centered at label i . When
the total number of i is no less than 5 in the grid window, the point will be assumed as
an edge point of region p , which means that label i at least has a connectivity of 4.
The reason for choosing 5 as the threshold is to avoid some special situations, such as a
region p with many holes (here “holes” indicates labels different from i ). An example
is illustrated in Figure 4. The region shown in it has a hole of label “0”. The hole is
usually detected as an edge of the region by using traditional techniques. However,
when we calculate the edge point of label “0” by using our algorithm, the hole will not
be identified as an edge. This rules out the unnecessary noises and helps to increase the
robustness in the segmentation process by avoiding erroneous merges.
increase, which is also shown in Table 2. The algorithm will stop when no connectivity
factor of coherent regions in the CCQ image less than σ threshold . Figure 5 illustrates the
flow of our algorithm.
By applying the algorithm to a color image, it achieves a segmented result which has
only a few color labels, among which some regions may have more than one connected
components, as shown in the middle image of Figure 6. Hence, we need to merge the
small connected components in those regions to make the segmentation result more
coherent and consistent. We refer to this step as minor-component suppression. The
minor-component suppression is similar to the merging of coherent regions, and the
only difference is that it merges the connect components of coherent regions, rather
than regions themselves. When a connect component of a region is very small, we
should merge it. For example, if the size of a region’s connect component is less than
5% of the whole region’s size, it will merge into the neighborhood region. The right
image in Figure 6 illustrates the results after performing the suppression.
Color image
CCQ image
Segmentation result
It should be noted that the suppression step is optional. Whether use it or not makes
no difference when an appropriate σ threshold is adopted in the algorithm. However, it is
usually difficult to find an optimum σ threshold suitable for all kinds of images. Therefore,
it is necessary to perform the suppression step in some applications.
Fig. 6. Segmentation results. Left: Original image; Middle: Results without performing
minor-component suppression; Right: Results after suppression
4 Experimental Results
We have tested our algorithm on the MSRC challenging datasets, which contains many
categories of high-resolution images with the size of 640 480 [1]. Different×
predetermined parameter σ threshold leads to different segmented results.
Firstly, we applied our algorithm on the “cow” images by using different σ threshold .
Examples of the segmentation results are shown in Figure 7.
Original images
σ threshold = 0.9
σ threshold = 0.93
σ threshold = 0.96
As can be seen from Figure 7, the best result is obtained when σ threshold = 0.96 , and all
the major objects have been segmented from the background. In order to evaluate the
216 G.-N. He et al.
For images shown in Figure 7 and Figure 8, they are high-resolution and their
contents are less structural, therefore we can set the threshold higher to get better
results. On the contrary, for some low-resolution and highly structured images, we need
to set the threshold lower in order to prevent over-segmentation. Figure 9 provides
segmentation examples by applying our algorithm on low-resolution images with the
size of 320 × 213 by setting σ threshold = 0.93 .
Since finding an appropriate threshold is important to achieve the best segmentation
result, our method cannot segment some highly structured images as good as other
categories of images. Take a “bicycle” image as an example. Figure 10 gives the
segmentation results for an bicycle image. Our algorithm cannot segment the whole
structure of the bicycle into a coherent region. However, the main parts of the bicycle,
“tyre” and “seat”, are both segmented. The segmentation result is still very useful for
real applications.
5 Conclusions
In this paper, we present a novel approach for image region segmentation based on
color coherence quantization. A concept of “connectivity factor” is defined, and an
iterative image segmentation algorithm is proposed. Image segmentation experiments
are designed and implemented on the MSRC datasets in order to evaluate the
performance of the proposed algorithm. Experimental results show that the algorithm is
able to segment a variety of images accurately by using a fixed threshold. Usually, the
threshold suitable for high-resolution and unstructured object class images is larger
than the threshold suitable for low-resolution and structured object class images. We
also compare our algorithm with the mean-shift algorithm under the same conditions,
and the results shows that our algorithm is much more preferable. Finally, quantitative
experimental tests on multiple-category images of MSRC datasets have also been done
to validate the algorithm.
Image Region Segmentation Based on Color Coherence Quantization 219
Acknowledgements
(
This work is supported by the“973”Program of China Grant No. 2010CB327903), the
National Natural Science Foundation of China (Grant Nos. 60875011, 60723003,
61021062, 60975043), and the Key Program of Natural Science Foundation of Jiangsu
Province, China (Grant BK2010054).
References
1. Criminisi, A.: Microsoft research cambridge object recognition image database,
http://research.microsoft.com/vision/cambridge/recognition/
2. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space analysis. IEEE
Trans. Pattern Anal. Machine Intell. 24(5), 603–619 (2002)
3. Achanta, R., Estrada, F., Wils, P., Susstrunk, S.: Salient region detection and segmentation.
In: International Conference on Computer Vision Systems (2008)
4. Wang, J.-P.: Stochastic relaxation on partitions with connected components and its
application to image segmentation. PAMI 20(6), 619–636 (1998)
5. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on Pattern
Analysis and Machine Intelligence 22(8), 888–905 (2000)
6. Tolliver, D.A., Miller, G.L.: Graph partitioning by spectral rounding: Applications in image
segmentation and clustering. In: CVPR 2006: Proceedings of the 2006 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, pp. 1053–1060. IEEE
Computer Society Press, Washington, DC, USA (2006)
7. Blake, A., Rother, C., Brown, M., Perez, P., Torr, P.: Interactive image segmentation using
an adaptive GMMRF model. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006.
LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006)
8. Figueiredo, M., Cheng, D.S., Murino, V.: Clustering under prior knowledge with
application to image segmentation. In: Schölkopf, B., Platt, J., Hoffman, T. (eds.) Advances
in Neural Information Processing Systems, vol. 19. MIT Press, Cambridge (2007)
9. Guan, J., Qiu, G.: Interactive image segmentation using optimization with statistical priors.
In: International Workshop on The Representation and Use of Prior Knowledge in Vision, In
conjunction with ECCV 2006, Graz, Austria (2006)
10. Hduchenne, O.H., Haudibert, J.-Y.H., Hkeriven, R.H., Hponce, J.H., Hsegonne, F.H.:
Segmentation by transduction. In: CVPR 2008, Alaska, US, June 24-26 (2008)
11. http://en.wikipedia.org/wiki/HSL_color_space
12. Malisiewicz, T., Efros, A.A.: Improving Spatial Support for Objects via Multiple
Segmentations. In: BMVC (September 2007)
13. Haralick, R.M., Linda, G.: Shapiro, Computer and Robot Vision, vol. I, pp. 28–48.
Addison-Wesley, Reading (1992)
14. Belongie, S., et al.: Color and texture based image segmentation using EM and its
application to content-based image retrieval. In: Proc.of ICCV, pp. 675–682 (1998)
15. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from maximally
stable extremal regions. In: Proc. BMVC, pp. I384–I393 (2002)
16. Boykov, Y., Jolly, M.-P.: Interactive graph cuts for optimal boundary & region
segmentation of objects in N-D images. In: Proc. ICCV, pp. 105–112 (2001)
Image Retrieval Algorithm Based on Enhanced
Relational Graph
Abstract. The “semantic gap” problem is one of the main difficulties in image
retrieval task. Semi-supervised learning is an effective methodology proposed to
narrow down the gap, which is also often integrated with relevance feedback
techniques. However, in semi-supervised learning, the amount of unlabeled data
is usually much greater than that of labeled data. Therefore, the performance of a
semi-supervised learning algorithm relies heavily on how effective it uses the
relationship between the labeled and unlabeled data. A novel algorithm is
proposed in this paper to enhance the relational graph built on the entire data set,
expected to increase the intra-class weights of data while decreasing the
inter-class weights and linking the potential intra-class data. The enhanced
relational matrix can be directly used in any semi-supervised learning algorithm.
The experimental results in feedback-based image retrieval tasks show that the
proposed algorithm performs much better compared with other algorithms in the
same semi-supervised learning framework.
1 Introduction
In image retrieval, the most difficult problem is “semantic gap”, which means that the
features of different images are unable to discriminate from different semantic
concepts. The relevance feedback technique provides a feasible solution to this
problem [1][2]. Early relevance feedback technologies are mainly based on modifying
the feedback information, that is, image features, such as re-weighting the query
vectors [3], adjusting the query vector’s positions, and etc. [4][5] In recent years, a
large amount of image retrieval algorithms was proposed with the development of
semi-supervised learning [6][7][8]. These algorithms generally use the feedback
information to learn potential semantic distributions, so as to improve the retrieval
performance. However, compared with the high dimensionality of image features, the
information available for feedback is usually much less and inadequate. Manifold
learning is a powerful tool for this kind of problem, with the goal of finding the
regularity distribution in high dimensional data. It assumes that high-dimensional data
lie on or distribute in a low dimensional manifold, and uses graph embedding technique
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 220–231, 2011.
© Springer-Verlag Berlin Heidelberg 2011
Image Retrieval Algorithm Based on Enhanced Relational Graph 221
to discover their low dimensions. The representative algorithms include ISOMAP [9],
Local Linear Embedding [10], and Laplacian Eigenmap [11]. S. Yan et al. proposed a
unified framework of graph embedding, under which most of data dimension reduction
algorithms can be well integrated [12].
Manifold learning is proposed based on spectral graph theory. It firstly assumes that
high-dimension data lie on or distribute in a low-dimension manifold, and then uses
linear embedding methods to reduce their dimensions. Meantime, it can increase the
distances of inter-class data points and reduce the distances of intra-class data points. In
the above process, different data relationship matrices will influence the results of the
corresponding algorithms greatly. Moreover, the performance of manifold learning
also mainly depends on the relational graph built on data. In recent years, quite a few
algorithms have been proposed to learn the image semantic manifold by using feedback
information as the labeled data and the unlabeled data. Based on that, the relational
graph can then be constructed simply by the K-nearest neighbor method, in which the
connectivity value of the points belonging to the same K-nearest neighbor equals one.
The representative algorithms are Augmented Relation Embedding (ARE)[6],
Maximize Margin Projection (MMP)[7], and Semi-Supervised Discriminate Analysis
(SDA)[8]. The MMP and ARE have made some improvements on the construction of
the data’s relational graph by modifying the weights in the local scope of the labeled
data. But this improvement is still too limited. Take the MMP for an example, it uses
the feedback information acquired from labeled data to increase the weights of
intra-class data points and reduce the weights of inter-class data points. But that only
changes the weights of the local scoped data points in the labeled data, which makes the
improvement very trivial. To address the above issues, we proposed a novel algorithm
to enhance the relational graph. It is proved to be capable of largely increasing the
weights of intra-class data points while decreasing the weights of inter-class data points
efficiently. Besides, the weights of the potential intra-class data points can be also
increased as well. The algorithm outputs an enhanced relational graph, which possesses
more information and is more instructive for feedback. Furthermore, we applied the
enhanced relational graph in the framework of semi-supervised learning and improved
the algorithm’s performance effectively.
The rest of this paper is organized as follows. Section 2 briefly introduces the typical
semi-supervised algorithms based on the framework of graph embedding. Section 3
presents the construction algorithm for enhanced relational graph. The experiment
results are then illustrated in Section 4. Finally, Section 5 makes concluding remarks.
Then, a low dimension embedding can be achieved through a projection. Let A denote
the projection matrix, it can be calculated by minimizing Eq. (2):
∑(A
ij
T
xi − AT x j )2 Wij (2)
Let yi = aT x i , we have:
∑( y − y ) W = ∑ y W
ij
i j
2
ij
ij
2
i ij − 2∑ yi y jWij + ∑ y 2jWij
ij ij
= 2 yT Ly
Here y denotes the projection of all data on a , that is, y = aT X ; Dii represents the
number of points connected to the ith point, which to some extent reveals the
importance of that point. By adding the constraints y Dy = 1 and making coordinate
T
transformation, the more important a point is, the closer it will be to the origin. Finally,
the objective function is represented as:
From the above derivation, we can see that the similarity matrix W plays an important
role in the procedure. The projective data point y is closely related to W as well. The
larger Wij is, the more similar xi and x j are. After dimension reduction, the distance
between yi and y j should be smaller. Here the similarity relationship can be seen as
the class label, and the intra-class point is more similar to each other. For those
unlabeled data, the similarity can be measured by using the relationship of its
Image Retrieval Algorithm Based on Enhanced Relational Graph 223
neighboring points. Its neighboring points should be more similar than other points. For
those data points neither belonging to the same class nor in the neighborhood, their
similarity are generally set as Wij = 0 .
We will then introduce three representative semi-supervised learning algorithms
proposed in recent years. They generally use the above dimension reduction method in
relevance feedback process to solve image retrieval problem.
⎧⎪1 if xi , x j ∈ Pos
WijP = ⎨ (5)
⎪⎩ 0 otherwise
⎧⎪0 otherwise
WijN = ⎨ (6)
⎪⎩1 if xi ∈ Pos and x j ∈ Neg , x j ∈ Pos and xi ∈ Neg
where Pos and Neg denote the positive set and negative set respectively.
The relational matrix of the whole data set can then be denoted as:
⎧⎪1, if xi ∈ N k ( x j ) or x j ∈ N k ( xi )
Wij = ⎨ (7)
⎪⎩0, otherwise
X [ LN − γ LP ] X T a = λ XLX T a (8)
where γ is the ratio of the number of positive data points to the number of negative data
points.
Eq. (8) finally provides the embedding vector of each image. Theoretically, those
vectors represent the resulted images which match the user’s search query better.
Therefore, the feedback information can be used to modify WijP and WijN repetitively
to finally improve the image retrieval accuracy.
224 G.-N. He et al.
The parameter γ is used to increase the weights between each positive point and each
negative point. Suppose that the vector after a projection is y =( y1 , y2 ,..., ym )T , the two
objection functions are then calculated as follows:
Eq. (11) makes the data points belonging to the same class closer to each other after
projection, while Eq. (12) makes the data points in different classes far away from each
other. Through algebraic operations, the two objective functions can be derived as:
X (α Lb + (1 − α )W w ) X T a = λ XD w X T a (13)
The SDA provides a learning framework for the feedback-based image retrieval.
Generally, there is a great difference between the number of labeled data and unlabeled
data. How to make full use of the unlabeled data is the key point of semi-supervised
learning. The idea of SDA is similar to LDA, except that the LDA is a supervised
learning algorithm and the SDA is a semi-supervised one. It utilizes a small number of
labeled data to improve the discrimination of different class data. The key of this
algorithm is to find the eigenvector by using the following equation:
XWSDA X T a = λ X ( I + α L) X T a (14)
where L is the Laplacian matrix corresponding to the relational matrix defined in Eq.
⎡W l 0⎤
(7), WSDA is the relational matrix for labeled images, and WSDA = ⎢ ⎥ . The
⎢⎣ 0 0 ⎥⎦
feedback information is preserved in W l , where the number of positive images is l ,
⎡ (1)
0 ⎤
and W l =⎢W ⎥. The size of the matrix W(1) is l × l , with all elements equal to
⎢⎣ 0 W (2) ⎥⎦
⎡ I 0⎤
1/ l . Similarly, we can calculate the matrix W(2). I = ⎢ ⎥ , where I is a l × l unit
⎣0 0 ⎦
matrix. Therefore, the rank of Wsda is 2, and the high dimensional image feature data is
projected onto the two-dimensional data space. Repetitively modifying WSDA and
I can improve the image retrieval accuracy.
Some researchers have summarized the above semantic learning methods as
spectrum regression [13]. The objective functions of this kind of algorithms are a
quadratic optimization problem in low dimensional space. Finally the optimization
problem is transformed into the problem of solving eigenvalue. Generally it is similar
to the spectrum clustering algorithm [14].
the points belonging to the same class may not be the neighbors. Therefore, our
algorithm tries to build a more robust relational graph to simulate the semantic
similarity between images more accurately.
In order to better distinguish the labeled data from the unlabeled data, we firstly build a
hierarchical relational matrix given as:
⎧
⎪γ if xi and x j share the same label
⎪
⎪1 if xi or x j is unlabeled , and xi∈N k ( x j ) or x j ∈N k ( xi )
Wij = ⎨ (15)
⎪
⎪0 if xi and x j have different label
⎪
⎩0 otherwise
∑WW z iz jz ← W . *
ij
ENDIF
EndFor
EndFor
3. Output: W*
*
Finally we generate the relational matrix W . From the above procedure, we can
see that if two points have the same neighboring points, the relation between those two
points will then be enhanced. The weight of those two points is calculated as the sum of
the dot product of their common neighbors’ weights.
Image Retrieval Algorithm Based on Enhanced Relational Graph 227
3.2 ESDA
XWSDA X T a = λ X ( I + α L* ) X T a (16)
After dimension reduction, the dimension of the data space of ESDA is 2, similar to
SDA. In feedback-based image retrieval, we use the feedback information to modify
Eq. (15) to improve the retrieval performance iteratively.
Generally speaking, there are two ways to solve the above objective function. One is
singular value decomposition [15] and the other is regularization [16]. In this paper, we
adopt the regularization method to solve it. Regularization is simple and efficient to
change a singular matrix to the non-singular one. Take Eq. (8) as an example. We just
change the right part of the equation from XLX T to XLX T +α I , where I is a unit
matrix and a full rank matrix will be generated after this transformation.
4 Experimental Results
To evaluate the performance of the proposed algorithm, we apply it to the relevance
feedback-based image retrieval. To simplify, we just choose ESDA algorithm as an
example to carry out our experiment and prove the effectiveness of the proposed
enhanced relational graph algorithm.
We used the Corel5K [17] as the data set in our experiments, in which 30 categories are
tested. Each category has different semantic meanings and has one hundred images.
Therefore, the whole data set consists of 3,000 images in total. Examples of the tested
images are shown in Figure 1(b)-(d). The query image was randomly selected from
each category, and the rest of images served as the searched database. To evaluate
algorithm performance, we used Precision-Scope Curve as the benchmark [6][18],
which is more preferable to measure the performance of image retrieval than
Precision-Recall Curve. Because the searched results may contain a large number of
images, and usually a user won’t examine the entire result set to evaluate the retrieval
performance. Therefore, checking the accuracy in a certain scope is more reasonable in
performance evaluation.
For image features, we took color moments, Tamura texture features, Gabor texture
features and color histogram as image features. The image features and their
dimensions are shown in Table 1. Each image was described by a vector with
dimension of 112.
228 G.-N. He et al.
In the experiment, we used the optimal values for all parameters, as indicated in the
original literature. For example, we set γ = 50 in ARE and MMP, and set the number
of reduced dimension as 30 [6][7], the number of neighbors is 5, that is, k = 5 . Both
ESDA and SDA reduce the original data space into 2-dimension data space. According
to the configuration, we firstly chose an image randomly to act as the query image, and
used the top 20 resulted images to evaluate the performance. For each class of images,
we ran an algorithm 50 times and for each algorithm only one feedback was taken.
Finally, we calculated the average retrieval accuracy of all classes for each algorithm.
The experimental results are shown in Figure 1(a). The “Baseline” means the accuracy
without feedback. From the result, we can see that the accuracies of ESDA are greater
than those of ARE and MMP on almost all categories with the exception of Category 9
and Category 19. Particularly, compared to SDA, ESDA outperforms it greatly. It can
also show that relevance feedback can improve the accuracy globally, which proves
that relevance feedback mechanism is an effective way to solve the semantic gap
problem [1][2].
In order to further evaluate the retrieval performance of those algorithms, we
conducted an experiment to test the feedback effectiveness. In this experiment, we ran
an algorithm 50 times for each class of images, and for each algorithm we took
feedback for 4 iterations. We then calculate the average retrieval accuracy of all classes
for each algorithm, and also the Precision-Scope Curve shown in Figure 2(a). The
X-axis shows the scope like top 10, 20 etc, and the Y-axis shows the accuracy under the
scope. The result shows that ESDA achieved the highest accuracy in each feedback,
particularly higher than SDA. It further improves the effectiveness of the enhanced
relational graph algorithm. Figure 2(b)-(e) shows the top 10 retrieval results for the
category of “tiger” with no feedback, 1-iteration feedback, 2-iteration feedback and
3-iteration feedback, respectively.
Image Retrieval Algorithm Based on Enhanced Relational Graph 229
(a)
Fig. 1. The algorithms’ accuracy curves after 1-iteration feedback , (b)-(d) Sample images from
category 1, 2, and 3, respectively
Fig. 2. (a) The algorithms’ Precision-Scope Curves in a 4-iteration feedback; (b)-(e) Retrieval results
with no feedback, 1-iteration feedback, 2-iteration feedback, and 3-itration feedback, respectively
230 G.-N. He et al.
5 Conclusions
A new algorithm for construct enhanced relational graph was proposed in this paper.
Compared with traditional construction methods, the algorithm combines all of the
class information in the data and effectively enhances the intra-class data relationship.
It can be easily extended to the framework of semi-supervised learning. Based on the
enhanced relational graph and SDA method, an algorithm called ESDA was developed.
The experimental results in image retrieval show that ESDA outperforms the existed
algorithms. The enhanced relational graph can also be used to improve the robustness
and performance of MMP and ARE method. Therefore, it can be widely applied to the
methods of data dimension reduction based on graph embedding. It can utilize the
information from both labeled data and unlabeled data in the framework of
semi-supervised learning to effectively improve the robustness of a semi-supervised
learning algorithm.
Acknowledgements
We would like to acknowledge the supports from the National Science Foundation of
,
China (Grant Nos. 60875011, 60975043 60723003, 61021062), the National 973
Program of China (Grant No. 2010CB327903), and the Key Program of National
Science Foundation of Jiangsu, China (Grant No. BK2010054).
References
1. Smeulders, W.M., et al.: Content-based image retrieval at the end of the early years. IEEE
Trans. on Pattern Analysis and Machine Intelligence 22, 1349–1380 (2000)
2. Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New
York (1982)
3. Ishikawa, Y., Subramanya, R., Faloutsos, C.: Mindreader: Querying databases through
multiple examples. In: International Conference on Very Large Data Bases, pp. 218–227
(1998)
4. Porkaew, K., Chakrabarti, K.: Query refinement for multimedia similarity retrieval in
MARS. In: ACM Conference on Multimedia, pp. 235–238 (1999)
5. Rui, Y., Huang, T., Mehrotra, S.: Content-based image retrieval with relevance feedback in
mars. In: Int’l Conference on Image Processing, pp. 815–818 (1997)
6. Lin, Y.-Y., Liu, T.-L., Chen, H.-T.: Semantic manifold learning for image retrieval. In:
Proceedings of the ACM Conference on Multimedia, Singapore (November 2005)
7. He, X., Cai, D., Han, J.: Learning a Maximum Margin Subspace for Image Retrieval. IEEE
Transactions on Knowledge and Data Engineering 20(2), 189–201 (2008)
8. Cai, D., He, X., Han, J.: Semi-Supervised Discriminant Analysis. In: IEEE International
Conference on Computer Vision (ICCV), Rio de Janeiro, Brazil (October 2007)
9. Tenenbaum, J., de Silva, V., Langford, J.: A global geometric framework for nonlinear
dimensionality reduction. Science 290(5500), 2319–2323 (2000)
10. Roweis, S., Saul, L.: Nonlinear dimensionality reduction by locally linear embedding.
Science 290(5500), 2323–2326 (2000)
Image Retrieval Algorithm Based on Enhanced Relational Graph 231
11. Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and
clustering. In: Advances in Neural Information Processing Systems, vol. 14, pp. 585–591.
MIT Press, Cambridge (2001)
12. Yan, S., Xu, D., Zhang, B., Yang, Q., Zhang, H., Lin, S.: Graph embedding and extensions:
A general framework for dimensionality reduction. TPAMI 29(1), 40–51 (2007)
13. Cai, D., He, X., Han, J.: Spectral Regression: A Unified Subspace Learning Framework for
Content-Based Image Retrieval. In: ACM Multimedia, Augsburg, Germany (September
2007)
14. Von Luxburg, U.: A Tutorial on Spectral Clustering. Statistics and Computing 17(4),
395–416 (2007)
15. Stewart, G.W.: Matrix Algorithms: Eigensystems, vol. II. SIAM, Philadelphia (2001)
16. Friedman, J.H.: Regularized discriminant analysis. Journal of the American Statistical
Association 84(405), 165–175 (1989)
17. Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.: Object recognition as machine
translation: Learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G.,
Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 97–112. Springer,
Heidelberg (2002)
18. Huijsmans, D.P., Sebe, N.: How to Complete Performance Graphs in Content-Based Image
Retrieval: Add Generality and Normalize Scope. IEEE Trans. Pattern Analysis and Machine
Intelligence 27(2), 245–251 (2005)
19. Stricker, M., Orengo, M.: Similarity of color images. In: SPIE Storage and Retrieval for
Image and Video Databases III, vol. 2185, pp. 381–392 (February 1995)
20. Tamura, H., Mori, S., Yamawaki, T.: Texture features corresponding to visual perception.
IEEE Trans. On Systems, Man, and Cybernetics Smc-8(6) (June 1978)
21. Bovic, A.C., Clark, M., Geisler, W.S.: Multichannel texture analysis using localized spatial
filters. IEEE Trans. Pattern Analysis and Machine Intelligence 12, 55–73 (1990)
Prediction-Oriented Dimensionality
Reduction of Industrial Data Sets
Maciej Grzenda
1 Introduction
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 232–241, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Prediction-Oriented Dimensionality Reduction of Industrial Data Sets 233
– the problem of dimensionality reduction and the main features of PCA are
summarised in Sect. 2,
– the algorithm controlling the process of dimensionality reduction is proposed
in Sect. 3,
– Next, the results of using an algorithm are discussed. The experiments per-
formed with two different industrial data sets provide basis for this part.
– Finally, conclusions and the main directions of future research are outlined.
234 M. Grzenda
Many data sets used in machine learning problems contain numerous variables
of partly unknown value and mutual impact. Some of these variables may be cor-
related, which may negatively affect model performance [4]. Among other prob-
lems, an empty space phenomenon discussed above is a major issue when mul-
tidimensional data sets are processed. Hence, the analysis of high-dimensional
data aims to identify and eliminate redundancies among the observed variables
[5]. The process of DR is expected to reveal underlying latent variables. More
formally the DR of a data set D ⊂ RS can be defined by two functions used to
code and decode element x ∈ D [5]:
c : RS −→ RR , x −→ x̃ = c(x) (1)
d : RR −→ RS , x̃ −→ x = d(x̃) (2)
The dimensionality reduction may be used for better understanding of the data
including its visualisation and may contribute to model development process.
The work concentrates on the latter problem of dimensionality reduction as a
technique improving the quality of prediction models.
An important aspect of DR is the estimation of intrinsic dimension, which
can be interpreted as the number of latent variables [5]. More precisely, the
intrinsic dimension of a random vector y can be defined as the minimal number
of parameters or latent variables needed to describe y [3,5]. Unfortunately, in
many cases, it remains difficult or even impossible to precisely determine the
intrinsic dimension for a data set of interest. Two main factors contribute to this
problem. The first reason is noisiness of many data sets making the separation
of noise and latent variables problematic. The other reason is the fact that many
techniques such as PCA may overestimate the number of latent variables [5].
It is worth emphasizing that even if intrinsic dimension is correctly assessed,
in the case of some prediction models built using the multidimensional data,
some latent variables may have to be skipped to reduce model complexity and
avoid model overfitting. In the case of MLP-based models, this can be due to the
limited number of records comparing to the number of identified latent variables.
The latter number defines the number of input signals and largely influences
the number of connection weights that have to be set by a training algorithm.
Nevertheless, since R < S, the pair of functions c() and d() usually does not
define a reversible process i.e. for all or many x ∈ D it may be observed that
d(c(x)) = x. Still, the overall benefits of the transformation may justify the
information loss caused by DR.
To sum up, for the purpose of this study an assumption is made that intrin-
sic dimension will not be assessed. Instead of it, the performance of prediction
models built using transformed data sets D̃ ⊂ RR is analyzed.
Prediction-Oriented Dimensionality Reduction of Industrial Data Sets 235
One of the most frequently used DR techniques, being PCA, was used for the
simulations performed in this study. The PCA model assumes that the S ob-
served variables result from linear transformation of R unknown latent variables.
An important requirement for the method is that the input variables are centred
by subtracting the mean value and if necessary standardised. The standardisa-
tion process should take into account the knowledge on the data set and should
not be applied automatically, especially in the presence of noise. Nevertheless,
the basic initial transformation can be defined as Xi −→ Xiσ−μ i
i
, where μi and
σi stand for mean and standard deviation of variable i, i = 1, 2, . . . , S. In the
remainder of the work, for the sake of simplicity, an input data set D denotes
the data centred and standardised.
Let denote a (S × S) covariance matrix, λ1 , . . . , λS denote eigenvalues of
the matrix sorted in descending order and q1 , . . . , qS the associated eigenvectors
i.e. qi = λi qi , i = 1, 2, . . . , S. Finally, let a : aj = qTj x = xT qj , j = 1, . . . , S.
Then, the coding function cR (x) can be defined as follows: cR (x) = [a1 , . . . , aR ]
R
[2]. Hence, cR (x) : RS −→ RR . Similarly, dR (a) = j=1 aj qj [2].
One of the objective techniques of selecting the number of preserved com-
ponents R and hence transforming D −→ D̃ ⊂ RR is based on proportion of
explained variation (PEV) α [4,5]. In technical sciences, α = 0.95 may be ap-
plied [4].
l
λd
R = minl=1,2,...,S : d=1 S
≥α (3)
d=1 λd
According to this criterion, the reduced dimension R is set to the lowest number
of largest eigenvalues satisfying the condition stated in formula 3.
DR, both in its linear and non-linear form is frequently used in engineering
applications, especially in image processing problems. In the latter case, even
mid-size resolution of images results in potentially thousands of dimensions.
Hence, numerous applications of DR to problems such as shape recognition, face
recognition, motion understanding or colour perception were proposed. Among
other studies, [10] provides a survey of works in this field. Even more new ap-
plications of PCA are proposed, such as the detection of moving objects [11].
When modelling production processes, PCA is usually used to run a separate
preprocessing process. In this case, frequently [6,8], a decision is made to leave
the principal components representing α = 0.95 of overall variance in the set,
as stated in formula 3. In this way, a transformed data set D̃ ⊂ RR is obtained
and used for model development. In the next stage, it is typically divided into
the training and testing set [8] or a 10-fold cross-validation technique is applied
to assess the merits of model construction algorithm [6]. In the latter stage,
different prediction and classification models based on neural models, such as
MLPs [8], Radial Basis Function (RBF) [8], or Support Vector Machines (SVM)
[7] are frequently developed. It is important to note that usually a decision is
made to apply an a priori reduction of a data set. Also, some other popular
236 M. Grzenda
methods of selecting the reduced dimension R, such as eigenvalue, scree plot and
minimum communality criterion [4] refrain from evaluating the impact of R on
model creation and its generalisation capabilities.
3 Algorithm
Prediction models based on MLP networks can suffer from poor generalisation.
When the data are scarce, and this is often the case, a model may not respond
correctly on the data not used in the training process. In industrial applications,
the main objective is the credibility of a model i.e. the minimisation of prediction
error on new data. Therefore, the criterion for selecting the reduced dimension
R might be expressed as follows:
R : E(M (D˜T
R )) = min ˜l
l=1,2,...,S E(M (DT )) (4)
where M denotes the prediction model trained on D˜Ll training set of dimension
l, D˜T
l denotes the testing data set, and E() the error observed on the testing
set D˜Tl , when using model M . In reality, the results of applying criterion 4
4 Experimental Results
The first data set comes from a series of deep drilling experiments [1]. The exper-
imental test provided a dataset with 7 input variables: tool type, tool diameter,
hole length, federate per revolution av, cutting speed V c, type of lubrication
system and axial cutting force. A data set of 219 records was used for the sim-
ulations in this study. Some of the records were originally incomplete. Hence,
a data imputation algorithm was applied. The experimental procedure and the
imputation of the data set were discussed in detail in [1]. The main modelling
objective is the prediction of a roughness of a drill hole under different drilling
conditions and tool settings.
R
Fig. 1 shows the results of the simulations. The lowest error rate ET (D, P )
R
and reduction factor β (D, P ) is observed for R = 6. Further dimensionality
reduction results in the raise of the error rates measured on training and test-
ing data sets. In other words, from the prediction perspective, the reduction
of dimensionality with PCA technique made to R = 6 dimensions, yields the
best roughness prediction. When a reduction to D̃ ⊂ R2 is made, the error rate
of MLP networks is virtually identical to naive baseline N . Hence, PCA-based
reduction can not be used to visualise the problem in R2 .
The second data set used in the simulations was obtained from a milling process.
Milling consists of removing the excess material from the workpiece in the form
of small individual chips. A resulting surface consists of a series of elemental
surfaces generated by the individual cutter edges. Surface roughness average
(Ra) is commonly used to evaluate surface quality. Hence, the need for a model
predicting the Ra value depending on the other experimentally measured factors.
Further details on the process can be found in [9].
Prediction-Oriented Dimensionality Reduction of Industrial Data Sets 239
1 1
0.95
0.9
0.9
Error rate/reduction factor
0.8
0.6
0.75
0.5 2
Preserved variance σl 0.7
0.4 0.65
7 6 5 4 3 2
Dimension l
Fig. 1. The impact of DR on prediction error rates for the drilling process
1.1 1
0.95
2
1 Preserved variance σl
0.9
0.9
Error rate/reduction factor
0.6
0.6
0.5
0.5
0.4 0.4
20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
Dimension l
Fig. 2. The impact of DR on prediction error rates for the milling process
The results of the simulations are shown in Fig. 2. When no PCA transfor-
mation is applied to the data set, β R (D, P ) is close to 1 i.e. the MLP-based
prediction is similar to naive guessing of mean value. However, as the dimension
R is reduced, the number of connection weights of MLP network is reduced,
too. Therefore, the β R (D, P ) is gradually reduced and reaches its minimum at
R = 4. In this case, even a reduction to R2 does not fully diminish predictive
capabilities of MLP models.
4.4 Discussion
The error rates calculated on the testing sets and the reduction factors were
summarised in Table 1. Reduced dimension, calculated according to PEV i.e.
R̃ denotes the dimension resulting from formula 6.
formula 3 is denoted by R.
Not only R̃ but also there is no straightforward relation between the two
= R,
values. In the case of a drilling data set, the number of principal components
preserved by PEV rule is lower, while in the case of milling data much higher
than dimension R at which the error reduction β R reaches its minimum. This can
be easily explained, when the size of both data sets is considered. For the drilling
data set, the best prediction models rely on components representing 0.9962 of
the total variance. This is due to larger data set than in the case of a milling
data set. In the latter case, due to just 60 records present in the data set, the
best reduction in terms of predictive capabilities preserves only 0.7317 of total
variance. Nevertheless, when a reduction is made to RR , R ≥ 5, the overfitting
of the MLP models diminishes the benefits arising from lower information loss.
5 Summary
When the prediction of industrial processes is based on experimental data sets,
PCA-based dimensionality reduction is frequently applied. The simulations de-
scribed in this work show that the PEV criterion may not always yield the best
dimensionality reduction from prediction perspective. When the number of avail-
able records is sufficiently large, the number of preserved components may be
set to a higher value. In the other cases, even significant information loss may
contribute to the overall model performance. A formal framework for testing
Prediction-Oriented Dimensionality Reduction of Industrial Data Sets 241
different DR settings was proposed in the work. It may be used with other DR
methods, including non-linear transformations. Further research on the selec-
tion of the optimal combination of DR method, reduced dimension and neural
network architecture using both unsupervised and supervised DR methods is
planned.
Acknowledgements. This work has been made possible thanks to the support
received from dr Andres Bustillo from University of Burgos, who provided both
data sets and industrial problems description. The author would like to especially
thank for his generous help and advice.
References
1. Grzenda, M., Bustillo, A., Zawistowski, P.: A Soft Computing System Using Intel-
ligent Imputation Strategies for Roughness Prediction in Deep Drilling. Journal of
Intelligent Manufacturing, 1–11 (2010),
http://dx.doi.org/10.1007/s10845-010-0478-0
2. Haykin, S.: Neural Networks and Learning Machines. Person Education (2009)
3. Kegl, B.: Intrinsic Dimension Estimation Using Packing Numbers. In: Adv.
In Neural Inform. Proc. Systems, Massachusetts Inst. of Technology, vol. 15,
pp. 697–704 (2003)
4. Larose, D.T.: Data Mining Methods and Models. John Wiley & Sons, Chichester
(2006)
5. Lee, J., Verleysen, M.: Nonlinear Dimensionality Reduction. Springer, Heidelberg
(2010)
6. Li, D.-C., et al.: A Non-linear Quality Improvement Model Using SVR for Manu-
facturing TFT-LCDs. Journal of Intelligent Manufacturing, 1–10 (2010)
7. Maszczyk, T., Duch, W.: Support Vector Machines for Visualization and Dimen-
sionality Reduction. In: Kůrková, V., Neruda, R., Koutnı́k, J. (eds.) ICANN 2008,
Part I. LNCS, vol. 5163, pp. 346–356. Springer, Heidelberg (2008)
8. Pal, S., et al.: Tool Wear Monitoring and Selection of Optimum Cutting Conditions
with Progressive Tool Wear Effect and Input Uncertainties. Journal of Intelligent
Manufacturing, 1–14 (2009)
9. Redondo, R., Santos, P., Bustillo, A., Sedano, J., Villar, J.R., Correa, M., Alique,
J.R., Corchado, E.: A Soft Computing System to Perform Face Milling Operations.
In: Omatu, S., Rocha, M.P., Bravo, J., Fernández, F., Corchado, E., Bustillo, A.,
Corchado, J.M. (eds.) IWANN 2009. LNCS, vol. 5518, pp. 1282–1291. Springer,
Heidelberg (2009)
10. Rosman, G., et al.: Nonlinear Dimensionality Reduction by Topologically Con-
strained Isometric Embedding. Int. J. of Computer Vision 89(1), 56–68 (2010)
11. Verbeke, N., Vincent, N.: A PCA-Based Technique to Detect Moving Objects. In:
Ersbøll, B.K., Pedersen, K.S. (eds.) SCIA 2007. LNCS, vol. 4522, pp. 641–650.
Springer, Heidelberg (2007)
Informative Sentence Retrieval for Domain
Specific Terminologies
1 Introduction
When students study a course in specific domain, the learning materials usually make
mention of many domain specific terminologies. For example, “supervised learning”
or “decision tree” are important domain specific terminologies in the field of data
mining. The domain specific terminologies represent important concepts in the
learning process. If a student didn’t know the implicit meaning of a domain specific
terminology, it is difficult for the student to understand the complete semantics or
concepts represented in the sentences which contain the terminology. For solving this
problem, a student would like to look for some resources to understand the domain
specific terminologies. Accordingly, it is very useful to provide an effective retrieval
system for searching informative sentences of a domain-specific terminology.
Although various kinds of data on the Internet can be accessed easily by search
engines, the quality and correctness of the data are not guaranteed. Books are still the
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 242–252, 2011.
© Springer-Verlag Berlin Heidelberg 2011
Informative Sentence Retrieval for Domain Specific Terminologies 243
2 Related Works
The traditional IR systems provide part of the solution for searching informative data
of a specific term, which can only retrieve relevant documents for a query topic but
not the relevant sentences. To aim at finding exact answers to natural language
questions in a large collection of documents, open domain QA has become one of the
most actively investigated topics over the last decade [13].
Among the research issues of QA, many works focused on constructing short
answers for relatively limited types of questions, such as factoid questions, from a
large document collection [13]. The problem of definitional question answering is a
task of finding out conceptual facts or essential events about the question target [14],
which is similar to the problem studied in this paper. Contract to the facet questions, a
definitional question does not clearly imply an expected answer type but only
specifies its question target. Moreover, the answers of definitional questions may
consist of small segments of data with various conceptual information called
information nuggets. Therefore, the challenge is how to find the information which is
essential for the answers to a definitional question.
Most approaches used pattern matching for definition sentence retrieval. Many of
the previously proposed systems created patterns manually [12]. To prevent the
manually constructed rules from being too rigid, a sequence-mining algorithm was
applied in [6] to discover definition-related lexicographic patterns from the Web.
According to the discovered patterns, a collection of concept-description pairs is
extracted from the document database. The maximal frequent word sequences in the
set of extracted descriptions were selected as candidate answers to the given question.
Finally, the candidate answers were evaluated according to the frequency of
occurrence of their subsequences to determine the most adequate answers. In the joint
predication model proposed in [9], not only the correctness of individual answers, but
also the correlations of the extracted answers were estimated to get a list of accurate
and comprehensive answers. For solving the problem of diversity in patterns, a soft
pattern approach was proposed in [4]. However, the pattern matching approaches are
usually applicable to general topics or to certain types of entities.
The relevance-based approaches explore another direction of solving definitional
question answering. Chen et al. [3] used the bigram and bi-term language model to
capture the term dependence, which was used to rerank candidate answers for
definitional QA. The answer of a QA system is a smaller segment of data than in a
document retrieval task. Therefore, the problems of data sparsity and exact matching
become critical when constructing a language model for extracting relevant answers
to a query. For solving these problems, after performing terms and n-grams clustering,
a class-based language model was constructed in [11] for sentence retrieval. In [7], it
was considered that an answer for the definitional question should not only contain
the content relevant to the topic of the target, but also have a representative form of
the definition style. Therefore, a probabilistic model was proposed to systematically
combine the estimations of a sentence on topic language model, definition language
model, and general language model to find retrieval essential sentences as answers for
the definitional question. Furthermore, external knowledge from web was used in [10]
to construct human interest model for extracting both informative and human-
interested sentences with respect to the query topic.
Informativee Sentence Retrieval for Domain Specific Terminologies 245
From the early 2000s, raather than just made information consumption on the w web,
more and more users particcipated in content creation. Accordingly, the social meedia
sites such as web forum ms, question/answering sites, photo and video sharring
communities etc. are increaasingly popular. For this reason, how to retrieve conteents
of social media to support question
q answering has became an important research toopic
in text mining. The prob blems of identifying question-related threads and thheir
potential answers in forum m were studied in [5] and [8]. A sequential patterns baased
classification method was proposed in [5] to detect questions in a forum threead;
within the same thread, a graph-based propagation method was provided to detect the
corresponding answers. Fu urthermore, it was shown in [8] that, in addition to the
content features, the comb bination of several non-content features can improve the
performance of questions anda answers detection. The extracted question-answer ppairs
in forum can be applied to find potential solutions or suggestions when users ask
similar questions. Consequently, the next problem is how to find good answers foor a
user’s question from a queestion and answer archive. To solve the word mismaatch
problem when looking forr similar questions in a question and answer achieve, the
retrieval model proposed in n [15] adopted a translation-based language model for the
question part. Besides, after combining with the query likelihood language model for
the answer part, it achievedd further improvement on accuracy of the retrieved resuults.
However, the main problem m of the above tasks is that it is difficult to make sure the
quality of content in social media [1].
Fig. 1. The proposed system architecture for retrieving informative sentences of a query terrm
246 J.-L. Koh and C.-W. Cho
3 Proposed Methods
The overall architecture of our proposed system for retrieving informative sentences
of a query term is shown as Fig. 1. The main processing components include Data
Preprocessing, Candidate Answers Selection, Term Weighting, and Answer Ranking.
In the following subsections, we will introduce the strategies used in the processing
components in detail.
<1> Text extraction: In our system, the electrical books in PDF format are selected as
the answer resources. Therefore, the text content in the books is extracted and
maintained in a text file per page by using a pdf-to-text translator. In addition, a
HTML parser is implemented to extract the text content of the documents got from
the web resources.
<2> Stemming and stop word removing: In this step, the English words are all
transformed to its root form. The Poster’s stemming algorithm is applied to do
stemming. After that, we use a stop word list to filter out the common words which do
not contribute significant semantics.
<3> Sentence separation: Because the goal is to retrieve informative sentences of the
query term, the text has to be separated per sentence. We use some heuristics of
sentence patterns to separate sentences, such as capital letter at the beginning of a
sentence and the punctuation marks: ‘?’, ‘!’, and ‘.’ at the end of a sentence.
<4> Index construction: In order to retrieve candidate sentences efficiently from the
textbooks according to the query terminology, we apply the functions supported by
Apache Lucene search engine to construct an index file for the text content of the
books. In a document database which consists of large amount of documents, the
index file not only maintains the ids of documents in which a term appears, but also
the appearing locations and frequencies of the term in the documents. In the scenario
considered in our system, the text content in the electrical books forms a large
document. We apply two different ways to separate the large document into small
documents for indexing: one way is indexing by pages and the other one is indexing
by sentences.
Let B denote the electrical book used as the answer resource. The set of pages in B
is denoted as . , , ,…, , where pi denotes the document content
of page i in B. The set of sentences in B is denoted as .
= , , ,…, , where si denotes the document content of the ith sentence in B.
The method of indexing by pages constructs an index file for . ; the indexing
by sentences constructs an index file for . .
The training corpus consists of the documents got from the web resources: the
Wikipedia and the free online dictionary of computing (FOLDOC). We also construct
an index file for the training corpus in order to calculate the degree of a word relative
to the query terminology efficiently.
Informative Sentence Retrieval for Domain Specific Terminologies 247
In order to perform ranking for the retrieved candidate sentences, the next step is to
estimate the relevance scores of the candidate sentences to the query term. Therefore,
we use the web resources, including the Wekipedia and the FOLDOC, as the training
corpus for mining the relevance degrees of words with respect to the query term.
The term frequency-inverse document frequency (TF-IDF) of a term is a weight
often used in information retrieval to evaluate the importance of a word in a document
within a corpus. The Lucene system also applies a TF-IDF based formula to measure
the relevance score of the indexing objects, the sentences here, to a query term.
Therefore, in our experiments, we use the sort method supported by Lucene as one of
the baseline methods for comparison.
In our approach,we consider the words which appear in a sentence as features of
the sentence. A sentence should have higher score when it has more words with high
ability to distinguish the sentences talking about the query term from the whole
corpus. Therefore, we apply the Jensen-Shannon Divergence (JSD) distance measure
to perform term weighting, which was described in [2] to extract important terms to
represent the documents within the same cluster. The weight of a term (word) w with
respect to the query terminology X is estimated by measuring the contribution of w to
the Jensen-Shannon Divergence (JSD) between the set of documents containing the
query term and the whole training corpus.
Let P denote the set of query related documents returned by Lucene, which are
found out from the training corpus. Let the set of documents in the training corpus be
denoted by Q. The JSD term weighting of a word w, denoted as WJSD(w) is computed
according to the following formula:
248 J.-L. Koh and C.-W. Cho
1 | |
| · log | · log 1
2 | |
The equations for getting the values of | , | and | are defined as
the following:
,
| 2
| . |
,
| 3
| . |
1
| | | 4
2
where , and , denote the frequencies of word w appearing in P and
Q, respectively. Besides, | . | and | . |denote the total word counts in
P and Q, respectively.
According to the JSD term weighting method, the JSD weight of each word in the
candidate sentences is evaluated. The relevance score of a candidate sentence s,
denoted as , is obtained by summarizing the JSD weights of the words in s
as the following formula:
where w denote a word in s. The top scored sentences are then selected as the
informative sentences of the given domain-specific terminology.
We invited 8 testers to participate the experiment for evaluating the quality of the
returned answer sentences. All the testers are graduate students in the university,
whose majors are computer science. Besides, they are familiar with the field of data
mining. For each test term, the sets of the top 25 answer sentences returned by the 5
different methods are grouped together. An interface was developed to collect the
satisfying levels of the testers for each returned answer, where the meanings of the 5
satisfying levels are defined as the following.
Level 5: the answer sentence explains the definition of the test term clearly.
Level 4: the answer sentence introduces the concepts relative to the test term.
Level 3: the answer sentence describes the test term and other related terms, but the
content is not very helpful for understanding the test term.
Level 2: the content in the answer sentence is not relative to the test term.
Level 1: the semantics represented in the answer sentence is not clear.
The DCGn denotes the Discounted Cumulative Gain accumulated at a particular rank
position n, which is defined as the following equation:
The in the equation denotes the actual relevance score of the sentence at rank i.
We estimated the actual relevance score of a sentence by averaging the satisfying
scores given by the 8 testers. In addition, the IDCGn is an ideal DCG at position n,
which occurs when the sentences are sorted according to the descending order of their
actual relevance scores. In the experiment, the top 25 answer sentences are returned
by each method to compute NDCG25 for evaluating the quality of the returned
sentences. The results of experiment are shown in Tab. 1.
As shown in Tab. 1, it is indicated that the answers retrieved by our proposed
method are better than the ones retrieved by either the Lucene system or the pattern-
based approach. The main reason is that a sentence usually consists of dozens of
words at most. In the Lucene system, a TF-IDF based formula is used to measure the
relevance scores of a sentence to the query. Accordingly, only the sentences which
contain the query terms are considered. As shown in Fig. 2, the short sentences which
contain all the query terms have high ranks in the Lucene system but do not have
important information nuggets. On the other hand, most of the sentences retrieved by
the JSD term weighting method proposed by this paper describe important concepts
relative to the query terminology, which are not limited to specific pattern. In
addition, it is not required that the answer sentences must contain the query term. The
250 J.-L. Koh and C.-W. Cho
References
1. Agichtein, E., Castillo, C., Donato, D., Gionis, A., Mishne, G.: Finding High-Quality
Content in Social Media. In: Proc. the International Conference on Web Search and Data
Mining, WSDM (2008)
2. Carmel, D., Roitman, H., Zwerdling, N.: Enhancing Cluster Labeling Using Wikipedia. In:
Proc. the 32nd Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR (2009)
3. Chen, Y., Zhou, M., Wang, S.: Reranking Answers for Definitional QA Using Language
Modeling. In: Proc. the 21st International Conference on Computational Linguistics and
44th Annual Meeting of the ACL (2006)
4. Chi, H., Kan, M.-Y., Chua, T.-S.: Generic Soft Pattern Models for Definitional Question
Answering. In: Proc. the 28th International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR (2005)
5. Cong, G., Wang, L., Lin, C.Y., Song, Y.I., Sun, Y.: Finding Question-Answer Pairs from
Online Forums. In: Proc. the 31st international ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR (2008)
6. Denicia-carral, C., Montes-y-gómez, M., Villaseñor-pineda, L., Hernández, R.G.: A Text
Mining Approach for Definition Question Answering. In: Proc. the 5th International
Conference on Natural Language Processing, FinTal 2006 (2006)
252 J.-L. Koh and C.-W. Cho
7. Han, K.S., Song, Y.I., Rim, H.C.: Probabilistic Model for Definitional Question
Answering. In: Proc. the 29th International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR (2006)
8. Hong, L., Davison, B.D.: A Classification-based Approach to Question Answering in
Discussion Boards. In: Proc. the 32nd International ACM SIGIR Conference on Research
and Development in Information Retrieval, SIGIR (2009)
9. Ko, J., Nyberg, E., Si, L.: A Probabilistic Graphical Model for Joint Answer Ranking in
Question Answering. In: Proc. the 30th International ACM SIGIR Conference on Research
and Development in Information Retrieval, SIGIR (2007)
10. Kor, K.W., Chua, T.S.: Interesting Nuggets and Their Impact on Definitional Question
Answering. In: Proc. the 30th International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR (2007)
11. Momtazi, S., Klakow, D.: A Word Clustering Approach for Language Model-based
Sentence Retrieval in Question Answering Systems. In: Proc. the 18th ACM International
Conference on Information and Knowledge Management, CIKM (2009)
12. Sun, R., Jiang, J., Tan, Y.F., Cui, H., Chua, T.-S., Kan, M.-Y.: Using Syntactic and
Semantic Relation Analysis in Question Answering. In: Proc. the 14th Text REtrieval
Conference, TREC (2005)
13. Voorhees, E.M.: Overview of the TREC 2001 Question Answering Track. In: Proc. the
10th Text REtrieval Conference, TREC (2001)
14. Voorhees, E.M.: Overview of the TREC 2003 Question Answering Track. In: Proc. the
12th Text REtrieval Conference, TREC (2003)
15. Xue, X., Jeon, J., Croft, W.B.: Retrieval Models for Question and Answer Archives. In:
Proc. the 31st International ACM SIGIR Conference on Research and Development in
Information Retrieval, SIGIR (2008)
Factoring Web Tables
Keywords: table analysis, table headers, header paths, table syntax, category
trees, relational tables, algebraic factorization.
1 Introduction
The objective of this work is to make sense of (analyze, interpret, transform) tables
the best we can without resorting to any external semantics: we merely manipulate
symbols. Remaining firmly on the near side of the semantic gap, we propose a
canonical table representation based on Header Paths that relate column/row headers
and data cells. We show that this “canonical” representation is adequate for
subsequent transformations into three other representations that are more suitable for
specific data/information retrieval tasks. The three targets for transformations are
Visual Table (VT). The VT provides the conventional two-dimensional table that
humans are used to. Header and contents rows and columns may, however, be jointly
permuted in the VT generated from Header Paths, and it does not preserve style
attributes like typeface and italics.
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 253–263, 2011.
© Springer-Verlag Berlin Heidelberg 2011
254 D.W. Embley et al.
Relational Table (RT). RT provides the link to standard relational databases and their
extensive apparatus for storing and retrieving data from a collection of tables.
Header Paths are a general approach to tables and capture the essential features of
two-dimensional indexing. A Well Formed Table (WFT) is a table that provides a
unique string of Header Paths to each data cell. If the Header Paths of two distinct
data cells are identical, then the table is ambiguous. Although the concept of Header
Paths seems natural, we have not found any previous work that defines them or uses
them explicitly for table analysis. Our contribution includes extracting Header Paths
from web tables, analysis of Header Paths using open-source mathematical software,
and a new way of transforming them into a standard relational representation.
We have surveyed earlier table processing methods--most directed at scanned
printed tables rather than HTML tables--in [2]. An excellent survey from a different
perspective was published by Zanibbi et al [3]. More recent approaches that share our
objectives have been proposed in [4], [5], and [6]. Advanced approaches to document
image analysis that can be potentially applied to tables are reported in [7] and [8].
After providing some definitions, we describe a simple method of extracting
Header Paths from CSV versions of web tables from large statistical sites. Then we
show how the structure of the table is revealed by a decomposition that incorporates
algebraic factoring. The next section on analysis illustrates the application of Header
Paths to relational tables. Our experiments to date demonstrate the extraction of
header paths, the factorization, and the resulting Wang categories. Examination of the
results suggests modifications of our algorithms that will most increase the fraction of
automatically analyzed tables, and enhancements of the functionality of the
interactive interface necessary to correct residual errors.
2 Header Paths
We extract Header Paths from CSV versions of web tables imported into Excel,
which is considerably simpler than extracting them directly from html [9]. The
transformation into the standard comma-separated-variables format for information
exchange is not, however, entirely lossless. Color, typeface, type size, type style
(bold, italics), and layout (e.g. indentations not specified explicitly in an HTML style
sheet) within the cells are lost. Merged table cells are divided into elementary cells
and the cell content appears only in the first element of the row. Anomalies include
demoted superscripts that alter cell content. Nevertheless enough of the structure and
content of the web tables is usually preserved for potentially complete understanding
of the table. Some sites provide both HTML and CSV versions of the same table.
An example table is shown in Fig. 1a. The (row-) stub contains Year and Term, the
row header is below the stub, the column header is to the right of the stub, and the 36
delta (or data, or content) cells are below the column header. Our Python routines
convert the CSV file (e.g. Figs. 1b and 1c) to Header Paths as follows:
1. Identify the stub header, column header, row header, and content regions.
2. Eliminate blank rows and almost blank rows (that often designate units).
3. Copy into blank cells the contents of the cell above. reverse for rows
4. Copy into blank cells the contents of the cell to the left.
Factoring Web Tables 255
5. Underscore blanks within cells and add quote marks to cell contents.
6. Mark row headers roots in the stub with negative column coordinates.
7. Add cell identifiers (coordinates) to each cell.
8. Trace the column and row Header Paths.
Mark
Year Term Mark
Year Term Assignments Examinations
Grade
Assignm Examina Grade
Ass1 Ass2 Ass3 Midterm Final Ass1 Ass2 Ass3 MidtermFinal
Winter 85 80 75 60 75 75 1991 Winter 85 80 75 60 75 75
1991 Spring 80 65 75 60 70 70 Spring 80 65 75 60 70 70
Fall 80 85 75 55 80 75 Fall 80 85 75 55 80 75
Winter 85 80 70 70 75 75 1992 Winter 85 80 70 70 75 75
1992 Spring 80 80 70 70 75 75 Spring 80 80 70 70 75 75
Fall 75 70 65 60 80 70 Fall 75 70 65 60 80 70
(a) (b)
Year,Term,Mark,,,,,EOL,,Assignments,,,Examinations,,GradeEOL,,Ass1,Ass2,Ass3,Midterm,Final,
EOL1991,Winter,85,80,75,60,75,75EOL,Spring,80,65,75,60,70,70,Fall,80,85,75,55,80,75EOL
1992,Winter,85,80,70,70,75,75EOL,Spring,80,80,70, 70,75,75EOL,Fall,75,70,65,60,80,70EOL
(c)
Fig. 1. (a) A table from [1] used as a running example. (b) Its CSVversion. (c) CSV file string.
Step 1 can partition a table into its four constituent regions only if either the stub
header is empty, or the content cells contain only digits. Otherwise the user must click
on the top-left (critical) delta cell. Fig. 2 shows the paths for the example table.
The paths to the body of the table (delta cells) are traced in the same way. The
combination of Header Paths uniquely identifies each delta cell. The three Wang
Categories for this table are shown in Fig. 3. The algorithm yielded paths for 89 tables
on a random sample of 107 tables from our collection of 1000 web tables. Most of the
rejected tables had non-empty stub headers and blank separators or decimal commas.
colpaths =
(("<0,2>Mark"*"<1,2>Assignments"*"<2,2>Ass1")
+("<0,3>Mark"*"<1,3>Assignments"*"<2,3>Ass2")
+("<0,4>Mark"*"<1,4>Assignments"*"<2,4>Ass3")
+("<0,5>Mark"*"<1,5>Examinations"*"<2,5>Midterm")
+("<0,6>Mark"*"<1,6>Examinations"*"<2,6>Final")
+("<0,7>Mark"*"<1,7>Grade"*"<2,7>Grade"));
rowpaths =
(("<-2,3>Year"*"<-1,3>1991"*"<0,3>Term"*"<1,3>Winter")
+("<-2,4>Year"*"<-1,4>1991"*"<0,4>Term"*"<1,4>Spring")
+("<-2,5>Year"*"<-1,5>1991"*"<0,5>Term"*"<1,5>Fall")
+("<-2,6>Year"*"<-1,6>1992"*"<0,6>Term"*"<1,6>Winter")
+("<-2,7>Year"*"<-1,7>1992"*"<0,7>Term"*"<1,7>Spring")
+("<-2,8>Year"*"<-1,8>1992"*"<0,8>Term"*"<1,8>Fall"));
Fig. 2. Column Header and Row Header Paths for the table of Fig. 1
256 D.W. Embley et al.
Year Mark
1991 Assignments
1992 Ass1
Ass2
Term Ass3
Winter Examinations
Spring Midterm
Fall Final
Grade*Grade
the column header as the decomposition of the column Header Paths expression into
an equivalent expression E satisfying the following constraints:
(a) E is an indexing relation on Γ;
(b) E is irredundant ;
(c) E is minimal in the occurrence of the total number of labels in the
expression.
A benefit of this formulation is that the decomposition problem for algebraic
expressions has attracted attention for a long time from fields ranging from symbolic
mathematics [11,12,13] to logic synthesis [14,15] and programs incorporating the
proposed solutions are widely available [16,17]. In this work, we adopt Sis, a system
for sequential circuit synthesis, for decomposition of header-path expressions [16].
Sis represents logic functions to be optimized in the form of a network with nodes
representing the logic functions and directed edges (a,b) to denote the use of the
function at node a as a sub-function at node b. An important step in network
optimization is extracting, by means of a division operation, new nodes representing
logic functions that are factors of other nodes. Because all good Boolean division
algorithms are computationally expensive, Sis uses the ordinary algebraic division.
The basic idea used is to look for expressions that are observed many times in the
nodes of the network and extract such expressions.
Some caution is in order for this formulation because seemingly identical labels in
two cells may carry different meaning. In our experience with many web tables,
however, identical labels in the same row of a column header (or the same column of
a row header) do not need to be differentiated for our Boolean-algebraic formulation.
This is because either they carry the same meaning (e.g. repeated labels like Winter in
the example table) or the differences in their meaning are preserved by other labels in
their product terms. To illustrate the last point, assume ITEMS*TOTAL and
AMOUNT*TOTAL are two product terms in a Header Paths expression, where the two
occurrences of TOTAL refer to a quantity and a value, respectively. In the factored
form, these two terms might appear as TOTAL*(ITEMS+AMOUNT), where the cell
with label TOTAL now spans cells labeled ITEMS and AMOUNT, thus preserving the
two terms in the Header Paths expression. On the other hand, suppose the column
HEADER PATHS expression included TOTAL*MALES + TOTAL*FEMALES +
TOTAL*TOTAL as a subexpression, where the last term spans the two rows
corresponding to the first two terms. In Boolean algebra, the sub-expression could be
simplified to TOTAL and the resulting expression would no longer cover the two
columns covered by the first two terms. With the row indices attached to labels, the
subexpression becomes: TOTAL1*MALES + TOTAL1*FEMALES + TOTAL1*TOTAL2,
which cannot be simplified so as to eliminate one or more terms.
We illustrate the decomposition obtained by Sis for the column Header Paths of the
example table:
Input Order = Mark Assignments Ass1 Ass2 Ass3 Examinations Midterm
Final Grade
colpaths = Mark*[Examinations*[Final + Midterm]
+ Assignments* [Ass3 + Ass2 + Ass1] + Grade]
258 D.W. Embley et al.
Note that Sis does not preserve the input order in the output equations, but because
the order is listed in the output, we can permute the terms of the output expression
according to the left-to-right, top-to-bottom order of the labels occurring in the table:
colpaths=Mark*[Assignments*[Ass1+Ass2+Ass3] +
[Examinations*[Midterm + Final]] + Grade]
Similarly, the rowpaths expression produces the following output:
rowpaths = Year*[ 1991+1992]*Term*[Winter+Spring+Fall]
We apply a set of Python regular expression functions to split each equation into a
list of product terms and convert each product term into the form of [root, children].
A virtual header is inserted whenever a label for the category root is missing in the
table, e.g. if the right-hand side of colpaths were missing “Mark*”, we would insert a
virtual header for the missing root. Then, the overall equation of the table can be
printed out recursively in the form of Wang categories trees, as illustrated in Fig. 3.
Further, we produce a canonical expression for the three categories in the form in
Fig. 4, which is used in the generation of relational tables:
Mark*(Assignments*(Ass1+Ass2Ass3)+Examinations*(Midterm+Final)+Grade)
+ Year*(1991+1992)
Fig. 4. Canonical expression for the table of Fig. 1.
+ Term*(Winter+Spring+Fall)
Fig. 4. Canonical expression for the table of Fig. 1
Currently, we don’t have an automated way of verifying the visual table (VT) in
CSV form against the category tree (CT) form obtained by factorization. Instead, we
do the verification by comparing the following common characteristic features of the
two forms: the number of row and column categories, and for each category: fanout at
the root level, the total fanout, and whether the root category has a real or virtual
label. The data for CT are unambiguous and readily derived. For VT, however, we
derive it by labor-intensive and error-prone visual inspection, which limits the size
and the objectiveness of the experiment. Still, we believe, the result demonstrates the
current status of the proposed scheme for automated conversion of web tables.
Of 89 tables for which Header Paths were generated, 66 (74%) have correct row
and column categories. Among the remaining 23 tables, 18 have either row or column
category correct, including two where the original table contained mistakes (duplicate
row entries). Both human and computer found it especially difficult to detect the
subtle visual cues (indentation or change in type style) that indicate row categories.
Since in principle the decomposition is error-free given the correct Header Paths, we
are striving to improve the rowpath extraction routines.
relational table for a relational database. We can then query the table with SQL or
any other standard database query language. Given a collection of tables
transformed into relations, we may also be able to join tables in the collection and
otherwise manipulate the tables as we do in a standard relational database.
Our transformation of a factored table assumes that one of the Wang categories
provides the attributes for the relational table while the remaining categories provide
key values for objects represented in the original table. Without input from an
oracle, we do not know which of the Wang categories would serve best for the
attributes. We therefore transform a table with n Wang category trees into n
complementary relational tables—one for each choice of a category to serve as the
attributes.
The transformation from a factored table with its delta cells associated with one of
its categories is straightforward. We illustrate with the example from Wang [1] in
Fig. 1a This table has three categories and thus three associated relational tables
(attributes can be permuted). We obtain the relational table in Fig. 5 using the
factored expression in Fig. 4 and the data values from the original table in Fig. 1.
Given the relational table in Fig. 5, we can now pose queries with SQL:
Query 1: What is the average grade for Fall, 1991?
select K_G
from R
where Y = 91 and T = “F”
With abbreviated names spelled out, the query takes on greater meaning:
select Mark_Grade
from The_average_marks_for_1991_to_1992
where Year = 1991 and Term = “Fall”
The answer for Query 1 is:
Mark_Grade
75
Query 2:
What is the overall average of the final for each year?
select Year, avg(Mark_Examinations_Final)
from The_average_marks_for_1991_to_1992
group by Year
The answer for Query 2 is:
Year Avg(Mark_Examinations_Final)
1991 75.0
1992 76.7
(a) (b)
Fig. 6. (a) A second relational table for the table of Fig. 1. (b) A third relational table
5 Discussion
We propose a natural approach for table analysis based on the indexing of data cells
by the column and row header hierarchies. Each data cell is defined by paths through
every header cell that spans that data cell. Automated extraction of paths from CSV
versions of web tables must first locate the column and row headers via identification
of an empty stub or by finding the boundaries of homogeneous rows and columns of
content cells. It must also compensate for the splitting of spanning cells in both
directions, and for spanning unit cells. We define a relational algebra in which the
collection of row or column Header Paths is represented by a sum-of-products
expression, and we show that the hierarchical structure of the row or column
categories can be recovered by a decomposition process that can be carried out using
widely available symbolic mathematical tools.
We demonstrate initial results on 107 randomly selected web tables from our
collection of 1000 web tables from large sites. The experiments show that in 83% of
our sample the header regions can be found using empty stubs or numerical delta
cells. We estimate that the header regions can be found in at least a further 10%-15%
with only modest improvements in our algorithms. The remainder will still require
clicking on one or two cells on an interactive display of the table.
Most of the extracted column Header Paths are correct, but nearly 25% of the row
Header Paths contain some mistake (not all fatal). The next step is more thorough
analysis of indentations and of header roots in the stub. Interactive path correction is
far more time consuming than interactive table segmentation. An important task is to
develop adequate confidence measures to avoid having to inspect every table.
262 D.W. Embley et al.
References
1. Wang, X.: Tabular Abstraction, Editing, and Formatting, Ph.D Dissertation, University of
Waterloo, Waterloo, ON, Canada (1996)
2. Embley, D.W., Hurst, M., Lopresti, D., Nagy, G.: Table Processing Paradigms: A
Research Survey. Int. J. Doc. Anal. Recognit. 8(2-3), 66–86 (2006)
3. Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition: Models, observations,
transformations, and inferences. International Journal of Document Analysis and
Recognition 7(1), 1–16 (2004)
4. Krüpl, B., Herzog, M., Gatterbauer, W.: Using visual cues for extraction of tabular data
from arbitrary HTML documents. In: Proceedings. of the 14th Int’l Conf. on World Wide
Web, pp. 1000–1001 (2005)
5. Pivk, A., Ciamiano, P., Sure, Y., Gams, M., Rahkovic, V., Studer, R.: Transforming
arbitrary tables into logical form with TARTAR. Data and Knowledge Engineering 60(3),
567–595 (2007)
6. Silva, E.C., Jorge, A.M., Torgo, L.: Design of an end-to-end method to extract information
from tables. Int. J. Doc. Anal. Recognit. 8(2), 144–171 (2006)
7. Esposito, F., Ferilli, S., Di Mauro, N., Basile, T.M.A.: Incremental Learning of First Order
Logic Theories for the Automatic Annotations of Web Documents. In: Proceedings of the
9th International Conference on Document Analysis and Recognition (ICDAR-2007),
Curitiba, Brazil, September 23-26, pp. 1093–1097. IEEE Computer Society, Los Alamitos
(2007); ISBN 0-7695-2822-8, ISSN 1520-5363
8. Esposito, F., Ferilli, S., Basile, T.M.A., Di Mauro, N.: Machine Learning for Digital
Document Processing: From Layout Analysis To Metadata Extraction. In: Marinai, S.,
Fujisawa, H. (eds.) Machine Learning in Document Analysis and Recognition. SCI,
vol. 90, pp. 79–112. Springer, Berlin (2008); ISBN 978-3-540-76279-9
9. Jandhyala, R.C., Krishnamoorthy, M., Nagy, G., Padmanabhan, R., Seth, S., Silversmith,
W.: From Tessellations to Table Interpretation. In: Carette, J., Dixon, L., Coen, C.S., Watt,
S.M. (eds.) MKM 2009, Held as Part of CICM 2009. LNCS, vol. 5625, pp. 422–437.
Springer, Heidelberg (2009)
10. http://www.mathworks.com/help/toolbox/symbolic/horner.html
11. Fateman, R. J.: Essays in Symbolic Simplification. MIT-LCS-TR-095, 4-1-1972,
http://publications.csail.mit.edu/lcs/pubs/pdf/
MIT-LCS-TR-095.pdf (downloaded November 10, 2010)
Factoring Web Tables 263
12. Knuth, D.E.: 4.6.2 Factorization of Polynomials". Seminumerical Algorithms. In: The Art
of Computer Programming, 2nd edn., pp. 439–461, 678–691. Addison-Wesley, Reading
(1997)
13. Kaltofen, E.: Polynomial factorization: a success story. In: ISSAC 2003 Proc. 2003
Internat. Symp. Symbolic Algebraic Comput. [-12], pp. 3–4 (2003)
14. Brayton, R.K., McMullen, C.: The Decomposition and Factorization of Boolean
Expressions. In: Proceedings of the International Symposium on Circuits and Systems, pp.
49–54 (May1982)
15. Vasudevamurthy, J., Rajski, J.: A Method for Concurrent Decomposition and Factorization
of Boolean Expressions. In: Proceedings of the International Conference on Computer-
Aided Design, pp. 510–513 (November 1990)
16. Sentovich, E.M., Singh, K.J., Lavagno, L., Moon, C., Murgai, R., Saldanha, A., Savoj, H.,
Stephan, P.R., Brayton, R.K., Sangiovanni-Vincentelli, A.L.: SIS: A System for Sequential
Circuit Synthesis. In: Memorandum No. UCB/ERL M92/41, Electronics Research
Laboratory, University of California, Berkeley (May 1992),
http://www.eecs.berkeley.edu/Pubs/TechRpts/
1992/ERL-92-41.pdf (downloaded November 4, 2010)
17. (Quickmath-ref),
http://www.quickmath.com/webMathematica3/quickmath/
page.jsp?s1=algebra&s2=factor&s3=advanced (last accessed November 12,
2010)
Document Analysis Research in the Year 2021
The field of document analysis research has had a long, rich history. Still, de-
spite decades of advancement in computer software and hardware, not much has
changed in how we conduct our experimental science, as emphasized in George
Nagy’s superb keynote retrospective at the DAS 2010 workshop [11].
In this paper, we present a vision for the future of experimental document
analysis research. Here the availability of “cloud” resources consisting of data,
algorithms, interpretations and full provenance, provides the foundation for a
research paradigm that builds on collective intelligence (both machine and hu-
man) to instill new practices in a range of research areas. The reader should
be aware that this paradigm is applicable to a much broader scope of machine
perception and pattern recognition – we use document analysis as the topic area
to illustrate the discussion as this is where our main research interests lie, and
where we can legitimately back our claims. Currently under development, the
platform we are building exploits important trends we see arising in a number
of key areas, including the World Wide Web, database systems, and social and
collaborative media.
The first part of this paper presents our view of this future as a fictional,
yet realizable, “story” outlining what we believe to be a compelling view of
community-created and managed resources that will fundamentally change the
This work is a collaborative effort hosted by the Computer Science and Engineer-
ing Department at Lehigh University and funded by a Congressional appropriation
administered through DARPA IPTO via Raytheon BBN Technologies.
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 264–274, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Document Analysis Research in the Year 2021 265
way we do research. In the second part of the paper, we then turn to a more
technical discussion of the status of our current platform and developments in
this direction.
Sometime in the year 2021, Jane, a young researcher just getting started in
the field, decides to turn her attention to a specific task in document analysis:
given a page image, identify regions that contain handwritten notations.1 Her
intention is to develop a fully general method that should be able to take any
page as input, although there is the implicit assumption that the handwriting,
if present, covers only a relatively small portion of the page and the majority of
the content is pre-printed text or graphics.
Note that in this version of the future, there is no such thing as “ground truth”
– that term is no longer used. Rather, we talk about the intent of the author,
the product of the author (e.g., the physical page) [4], and the interpretation
arrived at by a reader of the document (human or algorithm). There are no
“right” or “wrong” answers – interpretations may naturally differ – although for
some applications, we expect that users who are fluent in the language and the
domain of the document will agree nearly all of the time.
The goal of document analysis researchers is to develop new methods that
mimic, as much as possible, what a careful human expert would do when con-
fronted with the same input, or at least to come closer than existing algorithms.
Of course, some people are more “careful” or more “expert” than others when
performing certain tasks. The notion of reputation, originally conceived in the
early days of social networking, figures prominently in determining whose inter-
pretations Jane will choose to use as the target when developing her new method.
Members of the international research community – as well as algorithms – have
always had informal reputations, even in the early days of the field. What is
different today, in 2021, is that reputation has been formalized and is directly
associated with interpretations that we use to judge the effectiveness of our algo-
rithms, so that systems to support experimental research can take advantage of
this valuable information. Users, algorithms, and even individual data items all
have reputations that are automatically managed and updated by the system.
After Jane has determined the nature of the task, she turns to a well-known
resource – a web server we shall call DARE (for “Document Analysis Research
Engine”) – to request a set of sample documents which she will use in developing
and refining her algorithm. This server, which lives in the “cloud” and is not a
single machine, has become the de facto standard in the field, just as certain
1
Experience has taught us that we tend to be overly optimistic when we assume
problems like this will be completely solved in the near future and we will have
moved on to harder questions. Since we need a starting point for our story, we ask
the reader to suspend skepticism on what is likely a minor quibble. Jane’s problem
can be replaced with any one that serves the purpose.
266 D. Lopresti and B. Lamiroy
datasets were once considered a standard in the past. There are significant dif-
ferences, however. In the early days, datasets were simply collections of page
images along with a single interpretation for each item (which was called the
“ground-truth” back then). In 2021, the DARE server supports a fundamentally
different paradigm for doing experimental science in document image analysis.
Jane queries the server to give her 1,000 random documents from the various
collections it knows about. Through the query interface, she specifies that:
– Pages should be presented as a 300 dpi bitonal TIF image.
– Pages should be predominately printed material: text, line art, photographs,
etc. This implies that the page regions have been classified somehow: perhaps
by a human interpreter, or by another algorithm, or some combination. Jane
indicates that she wants the classifications to have come from only the most
“trustworthy” sources, as determined through publication record, citations
to past work, contributions to the DARE server, etc.
– A reasonable number of pages in the set should contain at least one hand-
written annotation. Jane requests that the server provide a set of documents
falling within a given range.
– The handwritten annotations on the pages should be delimited in a way that
is consistent with the intended output from Jane’s algorithm. Jane is allowed
to specify the requirements she would like the interpretations to satisfy, as
well as the level of trustworthiness required of their source.
By now, the status of the DARE web server as the de facto standard for the
community has caused most researchers to use compatible file formats for record-
ing interpretations. Although there is no requirement to do so, it is just easier
this way since so much data is now delivered to users from the server and no
longer lives locally on their own machines. Rather than fight the system, people
cooperate without having to be coerced.
The DARE server not only returns the set of 1,000 random pages along with
their associated interpretations, it also makes a permanent record of her query
and provides a URL that will return exactly same set of documents each time it
is run. Any user who has possession of the URL can see the parameter settings
Jane used. The server logs all accesses to its collections so that members of the
research community can see the history for every page delivered by the server.
In the early days of document image analysis research, one of the major hur-
dles in creating and distributing datasets were the copyright concerns. In 2021,
however, the quantity of image-based data available on the web is astounding.
Digital libraries, both commercially motivated and non-profit, present billions
of pages that have already been scanned and placed online. While simple OCR
results and manual transcriptions allow for efficient keyword-based searching,
opportunities remain for a vast range of more sophisticated analysis and re-
trieval techniques. Hence, contributing a new dataset to the DARE server is not
a matter of scanning the pages and confronting the copyright issues one’s self
but, rather, the vast majority of new datasets are references (links) to collections
of page images that already exist somewhere online. Access – whether free or
through subscription services – is handled as though well-developed mechanisms
Document Analysis Research in the Year 2021 267
(including user authentication, if it is needed) that are part of the much bigger
web environment.
With dataset in hand, Jane proceeds to work on her new algorithm for detect-
ing handwritten annotations. This part of the process is no different from the
way researchers worked in the past. Jane may examine the pages in the dataset
she was given by the DARE server. She uses some pages as a training set and
others as her own “test” set, although this is just for development purposes and
never for publication (since, of course, she cannot prove that the design of her
algorithm was not biased by knowing what was contained in this set).
While working with the data, Jane notices a few problems. One of the page
images was delivered to her upside down (rotated by 180 degrees). These sorts
of errors, while rare, arise from time to time given the enormous size of the
collections on the DARE server. In another case, the TIF file for a page was un-
readable, at least by the version of the library Jane is using. Being a responsible
member of the research community (and wanting her online reputation to reflect
this), Jane logs onto the DARE server and, with a few mouse clicks, reports both
problems – it just takes a minute. Everyone in the community works together
to build and maintain the collections delivered via the web server. Jane’s bug
reports will be checked by other members of the community (whose reputations
will likewise rise) and the problem images will be fixed in time.
In a few other cases, Jane disagrees with the interpretation that is provided
for the page in question. In her opinion, the bounding polygons are drawn im-
properly and, on one page, there is an annotation that has been missed. Rather
than just make changes locally to her own private copies of the annotation files
(as would have happened in the past), Jane records her own interpretations on
the DARE server and then refreshes her copies. No one has to agree with her,
of course – the previous versions are still present on the server. But by adding
her own interpretations, the entire collection is enriched. (At the same time
Jane is doing her own work, dozens of other researchers are using the system.)
The DARE server provides a wiki-like interface with text editing and graphical
markup tools that run in any web browser. Unlike a traditional wiki, however,
the different interpretations are maintained in parallel. The whole process is
quite easy and natural. Once again, Jane’s online reputation benefits when she
contributes annotations that other users agree with and find helpful.
After Jane is done fine-tuning her algorithm, she prepares to write a paper
for submission to a major conference. This will involve testing her claim that
her technique will work for arbitrary real-world pages, not just for the data she
has been using (and becoming very familiar with) for the past six months. She
has two options for performing random, unbiased testing of her method, both of
which turn back to the DARE server.2 These are:
Option 1: Jane can “wrap” her code in a web-service framework provided by
the DARE server. The code continues to run on Jane’s machine, with the
2
All top conferences and journals now require the sort of testing we describe here.
This is a decision reached through the consensus of the research community, not
dictated by some authority.
268 D. Lopresti and B. Lamiroy
DARE server delivering a random page image that her algorithm has not
seen before, but that satisfies certain properties she has specified in advance.
Jane’s algorithm performs its computations and returns its results to the
DARE server within a few seconds. As the results are returned to the DARE
server, they are compared to existing interpretations for the page in question.
These could be human interpretations or the outputs from other algorithms
that have been run previously on the same page.
Option 2: If she wishes, Jane can choose to upload her code to the DARE
server, thereby contributing it to the community and raising her reputation.
In this case, the server will run her algorithm locally on a variety of previously
unseen test pages according to her specifications. It will also maintain her
code in the system and use it in future comparisons when other researchers
test their own new algorithms on the same task.
At the end of the evaluation, Jane is provided with:
– A set of summary results showing how well her algorithm matched human
performance on the task.
– Another set of summary results showing how well her algorithm fared in
comparison to other methods tested on the same pages.
– A certificate (i.e., a unique URL) that guarantees the integrity of the results
and which can be cited in the paper she is writing. Anyone who enters
the certificate into a web browser can see the summary results of Jane’s
experiment delivered directly from the (trusted) DARE web server, so there
can be no doubt what she reported in her paper is true and reproducible.
When Jane writes her paper, the automated analysis performed by the DARE
server allows her to quantify her algorithms performance relative to that of a
human, as well as to techniques that were previously registered on the system. Of
course, given the specifics of our paradigm, performances can only be expressed in
terms of statistical agreement and, perhaps, reputation, but perhaps not in terms
of an absolute ranking of one algorithm with respect to another. Ranking and
classification of algorithms and the data they were evaluated on will necessarily
take more subtle and multi-valued forms. One may argue that having randomly
selected evaluation documents for certification can be considered as marginally
fair, since there is a factor of chance involved. While this is, in essence, true,
the fact that the randomly generated dataset is available for reproduction (i.e.
once generated, the certificate provides a link to the exact dataset used to certify
the results), anyone arguing that the result was obtained on an unusually biased
selection can access the very same data and use it in evaluating other algorithms.
It was perhaps a bit ambitious of Jane to believe that her method would
handle all possible inputs and, in fact, she learns that her code crashes on two
of the test pages. The DARE server allows Jane to download these pages to see
what is wrong (it turns out that she failed to dimension a certain array to be big
enough). If Jane is requesting a certificate, the DARE server will guarantee that
her code never sees the same page twice. If she is not requesting a certificate,
then this restriction does not apply and the server will be happy to deliver the
same page as often as she wishes.
Document Analysis Research in the Year 2021 269
Unlike past researchers who had the ability to remove troublesome inputs from
their test sets in advance, the DARE server prohibits such behavior. As a result,
it is not uncommon for a paper’s authors to report, with refreshing honesty, that
their implementation of an algorithm matched the human interpretation 93% of
the time, failed to match the human 5% of the time, and did not complete (i.e.,
crashed) 2% of the time.
When other researchers read Jane’s paper, they can use the URL she has
published to retrieve exactly the same set of pages from the DARE server.3 If
they wish to perform an unbiased test of their own competing method, comparing
it directly to Jane’s – and receive a DARE certificate guaranteeing the integrity
of their results – they must abide by the same rules she did.
In this future world, there is broad agreement that the new paradigm intro-
duced (and enforced) by the DARE server has improved the quality of research.
Results are now verifiable and reproducible. Beginning researchers no longer
waste their time developing methods that are inferior to already-known tech-
niques (since the DARE server will immediately tell you if another algorithm
did a better job on the test set you were given). The natural (often innocent)
tendency to bias an algorithm based on knowing the details of the test set have
been eliminated. The overuse of relatively small “standard” collections that was
so prevalent in the early days of the field is now no longer a problem.
The DARE server is not foolproof, of course – it provides many features to
encourage and support good science, but it cannot completely eliminate the
possibility of a malicious individual abusing the system. However, due to its
community nature, all records are open and visible to every user of the sys-
tem, which increases the risk of being discovered to the degree that legitimate
researchers would never be willing to take that chance.
Looking back with appreciation at how this leap forward was accomplished,
Jane realizes that it was not the result of a particular research project or any
single individual. Rather, it was the collective effort and dedication of the entire
document analysis research community.
The scenario just presented raises a number of fundamental questions that must
be addressed before document analysis research can realize its benefits. In this
section we develop these questions and analyze the extent to which they already
have (partial or complete) answers in the current state-of-the-art, those which
are open but that can be answered with a reasonable amount of effort, and those
which will require significant attention by the community before they are solved.
We also refer to a proof-of-concept prototype platform for Document Anal-
ysis and Exploitation (DAE – not to be confused with DARE), accessible at
http://dae.cse.lehigh.edu, which is capable of storing data, meta-data and
3
This form of access-via-URL is not limited to randomly generated datasets. Legacy
datasets from the past are also available this way.
270 D. Lopresti and B. Lamiroy
While the DAE platform is a promising first step toward to the DARE paradigm,
it still falls short in addressing some of the key concepts of the scenario we
depicted in Section 2.
earlier, or at least provide tools for users to discover interpretations that are
likely to be compatible with their own context.
– Query Interfaces, and especially those capable of handling the complex ex-
pressions used in this paper, are still open research domains [1]. Their appli-
cation is also directly related to the previously mentioned semantic issues.
Semantic Web [5] and ontology-folksonomy combinations [7,3] are therefore
also also probable actors in our scenario. To make the querying really se-
mantic in an as automated a way as possible, and by correctly capturing
the power of expressiveness as people contribute to the resource pool, the
DAE platform will need to integrate adequate knowledge representations.
This goes beyond the current storage of attributes and links between data.
Since individual research contexts and problems usually require specific rep-
resentations and concepts, contributions to the system will initially focus on
their own formats. However, as the need for new interpretations arises, users
will want to combine different representations of similar concepts to expand
their experiment base. To allow them to do that, formal representations and
semantic web tools will need to be developed. Although their seems to be
intuitively obvious inter-domain synergies between all cited domains (e.g.
data-mining query languages need application contexts and data to validate
their models, while our document analysis targeted platform needs query lan-
guages to validate the scalability and generality of its underlying research)
only widespread adoption of the paradigm described in this paper will reveal
potentially intricate research interactions.
– Archiving and Temporal Consistency, concern a fundamentally crucial part
of this research. Numerous problems arise relating to the comprehensive us-
age and storage of all the information mentioned in this paper. In short, and
given that our scenario is built on a distributed architecture, how shall avail-
ability of data be handled? Since our concept relies on complete traceability
and inter-dependence of data, annotations and algorithms, how can we guar-
antee long term conservation of all these resources when parts of them are
third-party provided? Simple replication and redundancy may rapidly run
into copyright, intellectual property, or trade secret issues. Even data that
was initially considered public domain may suddenly turn out to be owned
by someone and need to be removed. What about all derived annotations,
interpretations, and results? What about reliability and availability from a
purely operational point of view, if the global resource becomes so widely
used that it becomes of vital importance?
4 Conclusion
In this paper, we have presented a vision for the future of experimental research
in document analysis and described how our current DAE platform [8,9] can
exploit collective intelligence to instill new practices in the field. This forms a
significant first step toward a crowd-sourced document resource platform that
can contribute in many ways to more reproducible and sustainable machine
Document Analysis Research in the Year 2021 273
perception research. Some of its features, such as its ability to host complex
workflows, are currently being developed to support benchmarking contests.
We have no doubt that the paradigm we are proposing is largely feasible and
we strongly believe that the future of experimental document analysis research
will head in a new direction much like the one we are suggesting. When this
compelling vision comes to pass, it will be through the combined efforts and
ingenuity of the entire research community.
The DAE server prototype is open to community contributions. It is inher-
ently cloud-ready and has the potential to evolve to support the grand vision of
DARE. Because of this new paradigm’s significance to the international research
community, we encourage discussion, extensions and amendments through a con-
stantly evolving Wiki: http://dae.cse.lehigh.edu/WIKI. This Wiki also hosts
a constantly updated chart of DAE platform features realizing the broader goals
discussed in this paper.
References
1. Boulicaut, J.F., Masson, C.: Data mining query languages. In: Maimon, O., Rokach,
L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 655–664. Springer,
US (2010), doi:10.1007/978-0-387-09823-4 33
2. Clavelli, A., Karatzas, D., Lladós, J.: A framework for the assessment of text ex-
traction algorithms on complex colour images. In: DAS 2010: Proceedings of the 8th
IAPR International Workshop on Document Analysis Systems, pp. 19–26. ACM,
New York (2010)
3. Dotsika, F.: Uniting formal and informal descriptive power: Reconciling ontolo-
gies with folksonomies. International Journal of Information Management 29(5),
407–415 (2009)
4. Eco, U.: The limits of interpretation. Indiana University Press (1990)
5. Feigenbaum, L., Herman, I., Hongsermeier, T., Neumann, E., Stephens, S.: The
semantic web in action. Scientific American (December 2007)
6. Hu, J., Kashi, R., Lopresti, D., Nagy, G., Wilfong, G.: Why table ground-truthing
is hard. In: 6th International Conference on Document Analysis and Recognition,
pp. 129–133. IEEE Computer Society, Los Alamitos (2001)
7. Kim, H.L., Decker, S., Breslin, J.G.: Representing and sharing folksonomies with
semantics. Journal of Information Science 36(1), 57–72 (2010)
8. Lamiroy, B., Lopresti, D.: A platform for storing, visualizing, and interpreting col-
lections of noisy documents. In: Fourth Workshop on Analytics for Noisy Unstruc-
tured Text Data - AND 2010. ACM International Conference Proceeding Series,
IAPR. ACM, Toronto (2010)
9. Lamiroy, B., Lopresti, D., Korth, H., Jeff, H.: How carefully designed open resource
sharing can help and expand document analysis research. In: Agam, G., Viard-
Gaudin, C. (eds.) Document Recognition and Retrieval XVIII. SPIE Proceedings,
vol. 7874. SPIE, San Francisco (2011)
10. Lopresti, D., Nagy, G., Smith, E.B.: Document analysis issues in reading optical
scan ballots. In: DAS 2010: Proceedings of the 8th IAPR International Workshop
on Document Analysis Systems, pp. 105–112. ACM, New York (2010)
274 D. Lopresti and B. Lamiroy
11. Nagy, G.: Document systems analysis: Testing, testing, testing. In: Doerman, D.,
Govindaraju, V., Lopresti, D., Natarajan, P. (eds.) DAS 2010, Proceedings of the
Ninth IAPR International Workshop on Document Analysis Systems, p. 1 (2010),
http://cubs.buffalo.edu/DAS2010/GN_testing_DAS_10.pdf
12. Raub, W., Weesie, J.: Reputation and efficiency in social interactions: An example
of network effects. American Journal of Sociology 96(3), 626–654 (1990)
13. Sabater, J., Sierra, C.: Review on computational trust and reputation models.
Artificial Intelligence Review 24(1), 33–60 (2005)
14. Smith, E.H.B.: An analysis of binarization ground truthing. In: DAS 2010: Pro-
ceedings of the 8th IAPR International Workshop on Document Analysis Systems,
pp. 27–34. ACM, New York (2010)
Markov Logic Networks
for Document Layout Correction
1 Introduction
The task aimed at identifying the geometrical structure of a document is known
as Layout Analysis, and represents a wide area of research in document process-
ing, for which several solutions have been proposed in literature. The quality
of the layout analysis outcome is crucial, because it determines and biases the
quality of the next understanding steps. Unfortunately, the variety of document
styles and formats to be processed makes the layout analysis task a non-trivial
one, so that the automatically found structure often needs to be manually fixed
by domain experts.
The geometric layout analysis phase involves several processes, among which
page decomposition. Several works concerning the page decomposition step are
present in the literature, exploiting different approaches and having different ob-
jectives. Basic operators of all these approaches are split and merge: they exploit
the features extracted from an elementary block to decide whether splitting or
merging two or more of the identified basic blocks in a top-down [8], bottom-up
[15] or hybrid approach [12] to the page decomposition step. Since all methods
split or merge blocks/components based on certain parameters, parameter esti-
mation is crucial in layout analysis. All these methods exploit parameters that
are able to model the split or merge operations in specific classes of the document
domain. Few adaptive methods, in the sense that split or merge operations are
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 275–284, 2011.
c Springer-Verlag Berlin Heidelberg 2011
276 S. Ferilli, T.M.A. Basile, and N. Di Mauro
performed using estimated parameter values, are present in the literature [2,10].
A step forward is represented by the exploitation of Machine Learning techniques
in order to automatically assess the parameters/rules able to perform the doc-
ument page decomposition, and hence the eventual correction of the performed
split/merge operations, without requiring an empirical evaluation on the specific
document domain at hand. To this regard, learning methods have been used to
separate textual areas from graphical areas [5] and to classify text regions as
headline, main text, etc. [3,9] or even to learn split/merge rules in order to carry
out the corresponding operations and/or correction [11,16].
However, a common limit of the above reported methods regards the consider-
ation that they are all designed with the aim of working on scanned documents,
and in some cases on documents of a specified typology, thus lacking any gener-
ality of the proposal with respect to the online available documents that can be
of different digital formats. On the other hand, methods that work on natively
digital documents assume that the segmentation phase can be carried out by
simply performing a matching of the document itself with a standard template,
even in this case, of a specified format. In this work we propose the application
of a Statistical Relational Learning [7] (SRL) technique to infer a probabilistic
logical model recognising wrong document layouts from sample corrections per-
formed by expert users in order to automatically apply them to future incoming
documents. Corrections are codified in a first-order language and the learned
correction model is represented as a Markov Logic Network [14] (MLN). Exper-
iments in a real-world task confirmed the good performance of the solution.
2 Preliminaries
In this section we briefly describe DOC (Document Organization Composer) [4],
a tool for discovering a full layout hierarchy in digital documents based primarily
on layout information. The layout analysis process starts with a preprocessing
step performed by a module that takes as input a generic digital document and
extracts the set of its elementary layout components (basic-blocks), that will be
exploited to identify increasingly complex aggregations of basic components.
The first step in the document layout analysis concerns the identification of
rules to automatically shift from the basic digital document description to a
higher level one. Indeed, the basic-blocks often correspond just to fragments of
words (e.g., in PS/PDF documents), thus a preliminary aggregation based on
their overlapping or adjacency is needed in order to obtain blocks surrounding
whole words (word-blocks). Successively, a further aggregation of word-blocks
could be performed to identify text lines (line-blocks). As to the grouping of
blocks into lines, since techniques based on the mean distance between blocks
proved unable to correctly handle cases of multi-column documents, Machine
Learning approaches were applied in order to automatically infer rewriting rules
that could suggest how to set some parameters in order to group together rect-
angles (words) to obtain lines. To do this, a kernel-based method was exploited
to learn rewriting rules able to perform the bottom-up construction of the whole
document starting from the basic/word blocks up to the lines.
Markov Logic Networks for Document Layout Correction 277
The next step towards the discovery of the high-level layout structure of a
document page consists in applying an improvement of the algorithm reported
in [1]. To this aim, DOC analyzes the whitespace and background structure of
each page in the document in terms of rectangular covers and identifies the white
rectangles that are present in the page by decreasing area, thus reducing to the
Maximal White Rectangle problem as follows: given a set of rectangular con-
tent blocks (obstacles) C = {r0 , . . . , rn }, all placed inside the page rectangular
contour rb , find a rectangle r contained in rb whose area is maximal and that
does not overlap any ri ∈ C. The algorithm exploits a priority queue of pairs
(r, O), where r is a rectangle and O is a set of obstacles overlapping r. The
priority of the pair in the queue corresponds to the area of its rectangle. Pairs
are iteratively extracted from the queue and if the set of obstacles corresponding
to its rectangle is empty, then it represents the maximum white rectangle still
to be discovered. Otherwise, one of its obstacles is chosen as a pivot and the
rectangle is consequently split into four regions (above, below, to the right and
to the left of the pivot). Each such region, along with the obstacles that fall in
it, represents a new pair to be inserted in the queue. Complement of the found
maximal white rectangles yield the document content blocks.
However, taking the algorithm to its natural end and then computing the
complement would result again in the original basic blocks, while the layout
analysis process aims at returning higher-level layout aggregates. This raised the
problem of identifying a stop criterion to end this process. An empirical study
carried out on a set of 100 documents of three different categories revealed that
the best moment to stop the algorithm is when the ratio of the last white area
retrieved with respect to the total white area in the current page of the document
decreases up to 0, since before it the layout is not sufficiently detailed, while after
it useless white spaces are found.
fragments are enqueued; if the contour is empty and fulfils the constraints, it is
added to the list of white areas; if the contour is empty but does not fulfil the
constraints, it is discarded.
Allowing the user to interact with the algorithm means modifying the algo-
rithm behaviour as a consequence of his choices. It turns out that the relevance
of a (white or black) block to the overall layout can be assessed based on its
position inside the document page and its relationships with the other layout
components. According to this assumption, each time the user applies a manual
correction, the information on his actions and on their effect can be stored in a
trace file for subsequent analysis. In particular, each manual correction (user in-
tervention) can be exploited as an example from which learning a model on how
to classify blocks as meaningful or meaningless for the overall layout. Applying
the learned models in subsequent incoming documents, it would be possible to
automatically decide whether or not any white (resp., black) block is to be in-
cluded as background (resp., content) in the final layout, this way reducing the
need for user intervention.
is based on literals expressing the page layout and describing the blocks and
frames surrounding the forced block, and, among them, only those touching or
overlapping the forced block. Each involved frame frame(r) or block block(r)
is considered as a rectangular area of the page, and described according to the
following parameters: horizontal and vertical position of its centroid with respect
to the top-left corner of the page (posX(r,x) and posY(r,y)), height and width
(width(r) and height(r)), and its content type (type(r,t)).
The relationships between blocks/frames are described by means of a set of
predicates representing the spatial relationships existing among all considered
frames and among all blocks belonging to the same frame (belongs(b,f)), that
touch or overlap the forced block; furthermore for each frame or block that
touches the forced block a literal specifying that they touch (touches(b1,b2));
finally, for each block of a frame that overlaps the forced block the percent-
age of overlapping (overlaps(b1,b2,perc). It is fundamental to completely
describe the mutual spatial relationships among all involved elements. All, and
only, the relationships between each block/frame and the forced blocks are ex-
pressed, but not their inverses (i.e., the relationships between the forced block
and the block/frame in question). To this aim, the model proposed in [13] for
representing the spatial relationships among the blocks/frames was considered.
Specifically, according to such a model, fixed a rectangle, its plane is partitioned
in 25 parts and its spatial relationships to any other rectangle in the plane can
be specified by simply listing the parts to which the other rectangle overlaps
(overlapPart(r1,r2,part)).
where x state of the k-th clique, and the partition function Z is given
{k} is the
by Z = x∈X k φk (x{k} ). MNs may be represented as log-linear models, where
280 S. Ferilli, T.M.A. Basile, and N. Di Mauro
where ni (x) is the number of true groundings of Fi in x, x{i} is the state (truth
values) of the atoms appearing in Fi , and φi (x{i} ) = ewi .
Reasoning with MLNs can be classified as either learning or inference. Infer-
ence in SRL is the problem of computing probabilities to answer specific queries
after having defined a probability distribution. Learning corresponds to infer
both the structure and the parameters of the true unknown model. An inference
task is computing the probability that a formula holds, given an MLN and set
of constants, that, by definition, is the sum of the probabilities of the worlds
where it holds. MLN weights can be learned generatively by maximizing the
likelihood of a relational database consisting of one or more possible worlds that
form our training examples. The inference and learning algorithms for MLNs
are publicly available in the open-source Alchemy system3 . Given a relational
database and a set of clauses in the KB, many weights learning and inference pro-
cedures are implemented in the Alchemy system. For weight learning, we used
1
The value of the node is 1 if the ground atom is true, and 0 otherwise.
2
The value of the feature is 1 if the ground formula is true, and 0 otherwise.
3
http://alchemy.cs.washington.edu/
Markov Logic Networks for Document Layout Correction 281
the generative approach tha maximise the pseudo-likelihood of the data with
standard Alchemy parameters (./learnwts -g), while for inference, we used
the MaxWalkSAT procedure (./infer) with standard Alchemy parameters.
Predicates split(b) and merge(b) represent the query in our MLN, where b
is the forced block. The goal is to assign a black (merge) or white (split) forcing
to unlabelled blocks. The MLN clauses used in our system are reported in the
following. There is one of the following rule for each of the 25 planes capturing
the spatial relationships among the blocks:
overlapPart(b1,b,part) => split(b), merge(b)
Other relations are represented by the MLN rules:
belongs(b1,b) => split(b), merge(b)
touches(b1,b) => split(b), merge(b)
overlaps(b1,b,perc) => split(b), merge(b)
belongs(b1,b) => split(b1), merge(b1)
touches(b1,b) => split(b1), merge(b1)
overlaps(b1,b,perc) => split(b1), merge(b1)
Running weight learning with Alchemy, we will learn a weight for every clause
representing how good a relation is for predicting the label. Then, whit this
classifier, each test instance can be classified using inference.
4 Experimental Evaluation
Table 1. Area under the ROC and PR curves for split (S) and merge (M)
number of correctly classified positive examples varies with the number of incor-
rectly classified negative examples, and the Precision-Recall (PR) curve.
Table 1 reports the results for the queries split and merge in this first exper-
iment. The outcomes of the experiment suggest that the description language
proposed and the way in which the forcings are described are effective to let
the system learn clause weights that can be successfully used for automatic lay-
out correction. This suggested to try another experiment to simulate the actual
behavior of such an automatic system, working on the basic layout analysis al-
gorithm. After finishing the execution of the layout analysis algorithm according
to the required stop threshold, three queues are produced (the queued areas
still to be processed, the white areas discarded because not satisfying the con-
straints and the white blocks selected as useful background). Among these, the
last one contains whites that can be forced to black, while the other two contain
rectangles that might be forced white.
Since the rules needed by DOC to automatize the layout correction process
must be able to evaluate each block in order to decide whether forcing it or
not, it is not sufficient any more to consider each white block forcing as a coun-
terexample for black forcing and vice versa, but to ensure that the learned MLN
is correct, also all blocks in the document that have not been forced must be
exploited as negative examples for the corresponding concepts. The adopted
solution was to still express forcings as discussed above, including additional
negative examples obtained from the layout configuration finally accepted by
the user. Indeed, when the layout is considered correct, all actual white blocks
that were not forced become negative examples for concept merge (because they
could be forced as black, but weren’t), while all white blocks, discarded or still
to be processed become negative examples for the concept split (because they
weren’t forced). The dataset for this experiment was obtained by running the
layout analysis algorithm until the predefined threshold was reached, and then
applying the necessary corrections to fix the final layout. The 36 documents con-
sidered were a subset of the former dataset, evenly distributed among the four
categories. Specifically, the new dataset included 113 positive and 840 negative
examples for merge, and resulted in the performance reported in Table 2.
As to the concept split, the dataset was made up of 101 positive and 10046
negative examples. The large number of negative examples is due to the number
of white blocks discarded or still to be processed being typically much greater
than that of white blocks found. Since exploiting such a large number of neg-
ative examples might have significantly unbalanced the learning process, only
Markov Logic Networks for Document Layout Correction 283
Table 2. Area under the ROC and PR curves for split (S) and merge (M)
a random subset of 843 such examples was selected, in order to keep the same
ratio between positive and negative examples as for the merge concept. The
experiment run on such a subset provided the results shown in Table 2.
Figure 1 reports the plot obtained averaging ROC and PR curves for the
ten folds. As reported in [6], a technique to evaluate a classifier over the results
obtained with a cross validation method is to merge together the test instances
belonging to each fold with their assigned scores into one large test set.
1 1
0.8 0.8
True Positive Rate
0.6 0.6
Precision
0.4 0.4
Fig. 1. ROC (left) and PR (right) curve by merging the 10-fold curves
5 Conclusions
The variety of document styles and formats to be processed makes the layout
analysis task a non-trivial one and often the automatically found structure often
needs to be manually fixed by domain experts. In this work we proposed a tool
able to use the steps carried out by the domain expert, with the aim of correcting
the outcome of the layout analysis phase, in order to infer models to be applied to
future incoming documents. Specifically, the tool makes use of a first-order logic
representation of the document structure as it is not fixed and a correction often
depends on the relationships of the wrong components with the surrounding ones.
Moreover, the tool exploits the statistical relational learning system Alchemy.
Experiments in a real-world domain made up of scientific documents have been
presented and discussed, showing the validity of the proposed approach.
284 S. Ferilli, T.M.A. Basile, and N. Di Mauro
References
1. Breuel, T.M.: Two geometric algorithms for layout analysis. In: Lopresti, D.P.,
Hu, J., Kashi, R.S. (eds.) DAS 2002. LNCS, vol. 2423, pp. 188–199. Springer,
Heidelberg (2002)
2. Chang, F., Chu, S.Y., Chen, C.Y.: Chinese document layout analysis using adaptive
regrouping strategy. Pattern Recognition 38(2), 261–271 (2005)
3. Dengel, A., Dubiel, F.: Computer understanding of document structure. Interna-
tional Journal of Imaging Systems and Technology 7, 271–278 (1996)
4. Esposito, F., Ferilli, S., Basile, T.M.A., Di Mauro, N.: Machine Learning for digital
document processing: from layout analysis to metadata extraction. In: Marinai, S.,
Fujisawa, H. (eds.) Machine Learning in Document Analysis and Recognition. SCI,
vol. 90, pp. 105–138. Springer, Heidelberg (2008)
5. Etemad, K., Doermann, D., Chellappa, R.: Multiscale segmentation of unstruc-
tured document pages using soft decision integration. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence 19(1), 92–96 (1997)
6. Fawcett, T.: Roc graphs: Notes and practical considerations for researchers. Tech.
rep., HP Laboratories (2004),
http://www.hpl.hp.com/techreports/2003/HPL-2003-4.pdf
7. Getoor, L., Taskar, B.: Introduction to Statistical Relational Learning (Adaptive
Computation and Machine Learning). MIT Press, Cambridge (2007)
8. Krishnamoorthy, M., Nagy, G., Seth, S., Viswanathan, M.: Syntactic segmenta-
tion and labeling of digitized pages from technical journals. IEEE Transactions on
Pattern Analysis and Machine Intelligence 15(7), 737–747 (1993)
9. Laven, K., Leishman, S., Roweis, S.: A statistical learning approach to document
image analysis. In: Proceedings of the Eighth International Conference on Docu-
ment Analysis and Recognition, pp. 357–361. IEEE Computer Society, Los Alami-
tos (2005)
10. Liu, J., Tang, Y.Y., Suen, C.Y.: Chinese document layout analysis based on adap-
tive split-and-merge and qualitative spatial reasoning. Pattern Recognition 30(8),
1265–1278 (1997)
11. Malerba, D., Esposito, F., Altamura, O., Ceci, M., Berardi, M.: Correcting the
document layout: A machine learning approach. In: Proceedings of the 7th Intern.
Conf. on Document Analysis and Recognition, pp. 97–103. IEEE Comp. Soc., Los
Alamitos (2003)
12. Okamoto, M., Takahashi, M.: A hybrid page segmentation method. In: Proceedings
of the Second International Conference on Document Analysis and Recognition,
pp. 743–748. IEEE Computer Society, Los Alamitos (1993)
13. Papadias, D., Theodoridis, Y.: Spatial relations, minimum bounding rectangles,
and spatial data structures. International Journal of Geographical Information
Science 11(2), 111–138 (1997)
14. Richardson, M., Domingos, P.: Markov logic networks. Machine Learning 62,
107–136 (2006)
15. Simon, A., Pret, J.-C., Johnson, A.P.: A fast algorithm for bottom-up document
layout analysis. IEEE Transactions on PAMI 19(3), 273–277 (1997)
16. Wu, C.C., Chou, C.H., Chang, F.: A machine-learning approach for analyzing
document layout structures with two reading orders. Pattern Recogn. 41(10),
3200–3213 (2008)
Extracting General Lists from Web Documents:
A Hybrid Approach
1 Introduction
The extraction of lists from the Web is useful in a variety of Web mining tasks,
such as annotating relationships on the Web, discovering parallel hyperlinks,
enhancing named entity recognition, disambiguation, and reconciliation. The
many potential applications have also attracted large companies, such as Google,
which has made publicly available the service Google Sets to generate lists from
a small number of examples by using the Web as a big pool of data [13].
Several methods have been proposed for the task of extracting information em-
bedded in lists on the Web. Most of them rely on the underlying HTML markup
and corresponding DOM structure of a Web page [13,1,9,12,5,3,14,7,16,17]. Un-
fortunately, HTML was initially designed for rendering purposes and not for
information structuring (like XML). As a result, a list can be rendered in several
ways in HTML, and it is difficult to find an HTML-only tool that is sufficiently
robust to extract general lists from the Web.
Another class of methods is based on the rendering of an HTML
page [2,4,6,10,11]. These methods are likewise inadequate for general list ex-
traction, since they tend to focus on specific aspects, such as extracting tables
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 285–294, 2011.
c Springer-Verlag Berlin Heidelberg 2011
286 F. Fumarola et al.
where each data record contains a link to a detail page [6], or discovering tables
rendered from Web databases [10] (deep web pages) like Amazon.com. Due to
the restricted notion of what constitutes a table on the web, these visual-based
methods are not likely to effectively extract lists from the Web in the general
case.
This work aims to overcome the limitations of previous works for what con-
cerns the generality of extracted lists. This is obtained by combining several
visual and structural features of Web lists. We start from the observation that
lists usually contain items which are similar in type or in content. For example,
the Web page shown in Figure 1a) shows eight separate lists. Looking closely
at it, we can infer that the individual items in each list: 1) are visually aligned
(horizontally or vertically), and 2) share a similar structure.
The proposed method, called HyLiEn (Hybrid approach for automatic List
discovery and Extraction on the Web), automatically discovers and extracts
general lists on the Web, by using both information on the visual alignment of
list items, and non-visual information such as the DOM structure of visually
aligned items. HyLiEn uses the CSS2 visual box model to segment a Web page
into a number of boxes, each of which has a position and size, and can either
contain content (i.e., text or images) or more boxes. Starting from the box rep-
resenting the entire Web page, HyLiEn recursively considers inner boxes, and
then extracts list boxes which are visually aligned and structurally similar to
other boxes. A few intuitive, descriptive, visual cues in the Web page are used
to generate candidate lists, which are subsequently pruned with a test for struc-
tural similarity in the DOM tree. As shown in this paper, HyLiEn significantly
outperforms existing extraction approaches in specific and general cases.
The paper is organized as follows. Section 2 presents the Web page layout
model. Sections 3 and 4 explain the methodology used by our hybrid approach
for visual candidate generation and Dom-Tree pruning, respectively. Section 5
provides the methodology of our hybrid approach. The experimental results are
Algorithm 1: HybridListExtractor
input : Web site S, level of similarity α, max DOM-nodes β
output: set of lists L
RenderedBoxTree T (S);
Queue Q;
Q.add(T .getRootBox());
while !Q.isEmpty() do
Box b = Q.top();
list candidates = b.getChildren();
list aligned = getVisAligned(candidates);
Q.addAll(getNotAligned (candidates));
Q.addAll(getStructNotAligned (candidates, α, β));
aligned = getStructAligned (aligned, α, β);
L.add(aligned);
return L;
notice that, even if purely DOM-centric approaches fail in the general list finding
problem, the DOM-tree could still be a valuable resource for the comparison
of visually aligned boxes. In a list on a Web page, we hypothesize that the
DOM-subtrees corresponding to the elements of the list must satisfy a structural
similarity measure (structSim) to within a certain threshold α and that the
subtrees not have a number of DOM-nodes (numN odes) greater than β.
This DOM-structural assumption serves to prune false positives from the can-
didate list set. The assumption we make here is shared with most other DOM-
centric structure mining algorithms, and we use it to determine whether the
visual alignment of a certain boxes can be regarded as a real list or whether the
candidate list should be discarded. Specifically, the α and β parameters are es-
sentially the same as the K and T thresholds from MDR [9] and DEPTA [16,17],
and the α and C thresholds from Tag Path Clustering [12].
5 Visual-Structural Method
The input to our algorithm is a set of unlabeled Web pages containing lists. For
each Web page, Algorithm 1 is called to extract Web lists. We use the open
source library CSSBox 1 to render the pages.
The actual rendered box tree of a Web page could contain hundreds or thou-
sands of boxes. Enumerating and matching all of the boxes in search of lists of
arbitrary size would take time exponential in the number of boxes if we used a
1
http://cssbox.sourceforge.net
290 F. Fumarola et al.
Algorithm 2: getVisAligned
input : list candidates
output: list visAligned
Box head = candidates[0];
list visAligned;
for (i = 1; i < candidates.length; i + +) do
Box tail = candidates[i];
if (head.x == tail.x || head.y == tail.y) then
visAligned.add(tail);
if visAligned.length > 1 then
visAligned.add(head);
return visAligned;
brute force approach. To avoid this, our proposed method explores the space of
the boxes in a top-down manner (i.e., from the root to boxes that represent the
elements of a list) using the edges of the rendered box tree (V). This makes our
method more efficient with respect to those reported in the literature.
Starting from the root box r, the rendered box tree is explored. Using a Queue,
a breadth first search over children boxes is implemented. Each time a box b is
retrieved from the Queue, all the children boxes of b are tested for visual and
structural alignment using Algorithms 2 and 3, thereby generating candidate
and genuine lists among the children of b. All the boxes which are not found to
be visually or structurally aligned are enqueued in the Queue, while the tested
boxes are added the resultset of the Web page. This process ensures that the
search does not need to explore each atomic element of a Web page, and thus
makes the search bounded on the complexity of the actual lists in the Web page.
Algorithm 2 uses the visual information of boxes to generate candidate lists
which are horizontally or vertically aligned. To facilitate comprehension of the
approach, we present a generalized version of the method where the vertical and
horizontal alignments are evaluated together. However, in the actual implemen-
tation of the method these features are considered separately; this enables the
method to discover both the horizontal, vertical and tiled lists on the Web page.
Algorithm 3 prunes false positive candidate lists. The first element of the
visually aligned candidate list is used as an element for the structural test. Each
time the tail candidate is found to be structurally similar to the head, the tail
is added to the result list. At the end, if the length of the result list is greater
than one, the head is added. If none of the boxes are found to be structurally
similar, an empty list is returned.
To check if two boxes are structurally similar, Algorithm 4 exploits the
DOM-tree assumption described in Def. 3. It works with any sensible tree sim-
ilarity measure. In our experiments we use a simple string representation of
the corresponding tag subtree of the boxes being compared for our similarity
measurement.
Extracting General Lists from Web Documents: A Hybrid Approach 291
Algorithm 3: getStructAligned
input : list candidates, min. similarity α, max. tag size β
output: list structAligned
Box head = candidates[0];
list structAligned;
for (i = 1; i < candidates.length; i + +) do
Box tail = candidates[i];
if getStructSim (head, tail, β) ≤ α then
structAligned.add(tail);
if structAligned.length > 1 then
structAligned.add(head);
return structAligned;
Algorithm 4: getStructSim
input : Box a, b, max. tag size β
output: double simV alue
TagTree tA = a.getTagTree();
TagTree tB = b.getTagTree();
double simV alue;
if (tA.length ≥ β || tB.length ≥ β) then
return MAXDouble;
simV alue = Distance(tA, tB);
return simV alue;
At the end of the computation, Algorithm 1 returns the collection of all the
lists extracted from the Web page. A post-processing step is finally applied to
deal with tiled structures. Tiled lists are not directly extracted by this algorithm.
Based on Section 2, each list discovered is contained in a box b. Considering the
position P of these boxes, we can recognize tiled lists. We do this as a post-
process by: 1) identifying the boxes which are vertically aligned, and 2) checking
if, the element lists contained in that boxes are visually and structurally aligned.
Using this simple heuristic we are able to identify tiled lists and update the result
set accordingly.
6 Experiments
We showed in [15] that implementing a method that uses assumption in Def. 1
is sufficient to outperform all existing list extraction methods. Thus we tested
HiLiEn on a dataset used to validate the Web Tables discovery in VENTex [4].
This dataset contains Web pages saved by WebPageDump2 including the Web
page after the “x-tagging”, the coordinates of all relevant Visualized Words
(VENs), and the manually determined ground truth. From the first 100 pages of
2
http://www.dbai.tuwien.ac.at/user/pollak/webpagedump/
292 F. Fumarola et al.
the original dataset we manually extracted and verified 224 tables, with a total
number of 6146 data records. We can use this dataset as test set for our method
because we regard tables on the Web to be in the set of lists on the Web, that
is, a table is a special type of list. This dataset was created by asking students
taking a class in Web information extraction at Vienna University of Technology
to provide a random selection of Web tables. This, according to Gatterbauer et
al.[4], was done to eliminate the possible influence of the Web page selection on
the results. We use the generality advantage of this dataset to show that also
our method is robust and the results are not biased from the selected test set.
HyLiEn returns a text and visual representation of the results. The former
consists of a collection of all the discovered lists, where each element is repre-
sented by its HTML tag structure and its inner text. The latter is a png image,
where all the discovered lists are highlighted with random colors.
We compared HyLiEn to VENTex, which returns an XML representation of
the frames discovered. Because of the differences in output of the two methods,
we erred on the side of leniency in most questionable cases. In the experiment,
the two parameters α and β required by HeLiEn are empirically set to 0.6 and
50, respectively.
Table 1 shows that VENTex extracted 82.6% of the tables and 85.7% of the
data records, and HyLiEn extracted 79.5% of the tables and 99.7% of the data
records. We remind readers that HyLiEn was not initially created to extract
tables, but we find that our method can work because we consider tables to be
a type of list.
We see that VENTex did extract 8 more tables than HyLiEn. We believe
this is because HyLiEn does not have any notion of element distance that could
be used to separate aligned but separated lists. On the contrary, in HyLiEn,
if elements across separate lists are aligned and structurally similar they are
merged into one list. Despite the similar table extraction performance, HyLiEn
extracted many more records (i.e., rows) from these tables than VENTex.
We did judge the precision score here for comparison sake. We find that, from
among the 100 Web pages only 2 results contained false positives (i.e., incorrect
list items) resulting in 99.9% precision. VENTex remained competitive with a
precision of 85.7%. Table 2 shows the full set of results on the VENTex data set.
We see that HyLiEn consistently and convincingly outperforms VENTex.
Interestingly, the recall and precision values that we obtained for VENTex
were actually higher than the results presented in Gatterbauer et al. [4] (they
show precision: 81%, and recall: 68%). We are confident this difference is because
we use only the first 100 Web pages of the original dataset.
Table 1. Recall for table and record extraction on the VENTex data set
Table 2. Precision and Recall for record extraction on the VENTex data set
Acknowledgments
This research is funded by an NDSEG Fellowship to the second author. The first
and fourth authors are supported by the projects “Multi-relational approach to
spatial data mining” funded by the University of Bari “Aldo Moro,” and the
Strategic Project DIPIS (Distributed Production as Innovative Systems) funded
by Apulia Region. The third and fifth authors are supported by NSF IIS-09-
05215, U.S. Air Force Office of Scientific Research MURI award FA9550-08-1-
0265, and by the U.S. Army Research Laboratory under Cooperative Agreement
Number W911NF-09-2-0053 (NS-CTA).
294 F. Fumarola et al.
References
1. Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring
the power of tables on the web. Proc. VLDB Endow. 1(1), 538–549 (2008)
2. Cai, D., Yu, S., Rong Wen, J., Ying Ma, W.: Extracting content structure for web
pages based on visual representation. In: Zhou, X., Zhang, Y., Orlowska, M.E.
(eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)
3. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: automatic data extraction
from data-intensive web sites. SIGMOD, 624–624 (2002)
4. Gatterbauer, W., Bohunsky, P., Herzog, M., Krüpl, B., Pollak, B.: Towards domain-
independent information extraction from web tables. In: WWW, pp. 71–80. ACM,
New York (2007)
5. Gupta, R., Sarawagi, S.: Answering table augmentation queries from unstructured
lists on the web. Proc. VLDB Endow. 2(1), 289–300 (2009)
6. Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of web sites
for automatic segmentation of tables. SIGMOD, 119–130 (2004)
7. Lerman, K., Knoblock, C., Minton, S.: Automatic data extraction from lists and
tables in web sources. In: IJCAI. AAAI Press, Menlo Park (2001)
8. Lie, H.W., Bos, B.: Cascading Style Sheets:Designing for the Web, 2nd edn.
Addison-Wesley Professional, Reading (1999)
9. Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: KDD, pp.
601–606. ACM Press, New York (2003)
10. Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep web data
extraction. IEEE Trans. on Knowl. and Data Eng. 22(3), 447–460 (2010)
11. Mehta, R.R., Mitra, P., Karnick, H.: Extracting semantic structure of web docu-
ments using content and visual information. In: WWW, pp. 928–929. ACM, New
York (2005)
12. Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., Moser, L.E.: Extracting data
records from the web using tag path clustering. In: WWW, pp. 981–990. ACM,
New York (2009)
13. Tong, S., Dean, J.: System and methods for automatically creating lists. In: US
Patent: 7350187 (March 2008)
14. Wang, R.C., Cohen, W.W.: Language-independent set expansion of named entities
using the web. In: ICDM, pp. 342–350. IEEE, Washington, DC, USA (2007)
15. Weninger, T., Fumarola, F., Barber, R., Han, J., Malerba, D.: Unexpected results
in automatic list extraction on the web. SIGKDD Explorations 12(2), 26–30 (2010)
16. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW,
pp. 76–85. ACM, New York (2005)
17. Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree
alignment. IEEE Trans. on Knowl. and Data Eng. 18(12), 1614–1628 (2006)
Towards a Computational Model of
the Self-attribution of Agency
1 Introduction
The difference between falling and jumping from a cliff is a significant one.
Traditionally, this difference is characterized in terms of the contrast between
something happening to us and doing something. This contrast, in turn, is cashed
out by indicating that the person involved had mental states (desires, motives,
reasons, intentions, etc.) that produced the action of jumping, and that such
factors were absent or ineffective in the case of falling. Within philosophy, major
debates have taken place about a proper identification of the relevant mental
states and an accurate portrayal of the relation between these mental states
and the ensuing behavior (e.g. [1,2]). In this paper, however, we will focus on a
psychological question: how does one decide that oneself is the originator of one’s
behavior? Where does the feeling of agency come from? Regarding this question
we start with the assumption that an agent generates explanatory hypotheses
about events in the environment, a.o. regarding physical events, the behavior
of others and of him/herself. In line with this assumption, in [3] Wegner has
singled out three factors involved in the self-attribution of agency; the principles
of priority, consistency and exclusivity. Although his account is detailed, both
historically and psychologically, Wegner does not provide a formal model of his
theory, nor a computational mechanism. In this paper, we will provide a review of
the basic aspects of Wegner’s theory, and sketch the outlines of a computational
model implementing it, with a focus on the priority principle. Such a model of
self-attribution can be usefully applied in interaction design to establish whether
a human attributes the effects of the interaction to itself or to a machine.
The paper is organized as follows: Section 2 provides an outline of Wegner’s
theory and introduces the main contributing factors in the formation of an expe-
rience of will. In section 3, it is argued that first-order Bayesian network theory
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 295–305, 2011.
c Springer-Verlag Berlin Heidelberg 2011
296 K. Hindriks et al.
is the appropriate modeling tool for modeling the theory of apparent mental
causation and a model of this theory is presented. In section 4, the model is in-
stantiated with the parameters of the I Spy experiment as performed by Wegner
and the results are evaluated. Finally, section 5 concludes the paper.
Part of a theory of mind is the link between an agent’s state and its actions.
That is, agents describe, explain and predict actions in terms of underlying
mental states that cause the behavior. In particular, human agents perceive their
intentions as causes of their behavior. Moreover, intentions to do something that
occur prior to the corresponding act are interpreted as reasons for doing the
action. This understanding is not fully present yet in very young children.
It is not always clear-cut whether or not an action was caused by ones own
prior intentions. For example, when one finds someone else on the line after
making a phone call to a friend using voice dialing, various explanations may
come to mind. The name may have been pronounced incorrectly making it hard
to recognize it for the phone, the phone’s speech recognition unit may have mixed
up the name somehow, or, alternatively, one may have more or less unconsciously
mentioned the name of someone else only recognizing this fact when the person is
on the line. The perception of agency thus may vary depending on the perception
of one’s own mind and the surrounding environment.
In the self-attribution of agency, intentions play a crucial role, but the con-
scious experience of a feeling that an action was performed by the agent itself
still may vary quite extensively. We want to gain a better understanding of the
perception of agency, in particular of the attribution of agency to oneself. We
believe that the attribution of agency plays an important role in the interaction
and the progression of interaction between agents, whether they are human or
computer-based agents. As the example of the previous paragraph illustrates, in
order to understand human interaction with a computer-based agent it is also
important to understand the factors that play a role in human self-attribution
of agency. Such factors will enhance our understanding of the level of control
that people feel when they find themselves in particular environments. One of
our objectives is to build a computational model to address this question which
may also be useful in the assessment by a computer-based agent of the level of
control of one of its human counterparts in an interaction.
As our starting point for building such a model, we use Wegner’s theory of
apparent mental causation [4]. Wegner argues that there is more to intentional
action than forming an intention to act and performing the act itself. A causal
relation between intention and action may not always be present in a specific
case, despite the fact that it is perceived as such. This may result in an illusion
of control. Vice versa, in other cases, humans that perform an act do not per-
ceive themselves as the author of those acts, resulting in more or less automatic
behavior (automatisms). As Wegner shows, the causal link between intention
and action cannot be taken for granted.
Towards a Computational Model of the Self-attribution of Agency 297
more conscious will [3]. People discount the causal influence of one potential
cause if there are others available [6]. Wegner distinguishes between two types
of competing causes: (i) internal ones such as: emotions, habits, reflexes, traits,
and (ii) external ones such as external agents (people, groups), imagined agents
(spirits, etc.), and the agent’s environment. In the cognitive process which eval-
uates self-agency these alternative causes may discount an intention as the cause
of action. Presumably, an agent has background knowledge about possible alter-
native causes that can explain a particular event in order for such discounting to
happen. Wegner illustrates this principle by habitual and compulsive behavior
like eating a large bag of potato chips. In case we know we do this because of
compulsive habits, any intentions to eat the chips are discounted as causes.
3 Computational Model
defining the model. The basic idea of this approach is that causal attribution
involves searching for underlying mechanism information (i.e. the processes un-
derlying the relationship between the cause and the effect), given evidence made
available through perception and introspection. Assuming that each mechanism
defines a particular covariation (or joint probability distribution) of the con-
tributing factors with the resulting outcome, the introduction of separate prob-
ability distributions for each particular event that is to be explained can be
avoided. As a result, the number of priority and causality fragments needed is a
function linear in the number of mechanisms instead of the number of events.
difference decreases to about one second. As one typically needs some time to
perform an action, the probability starts to decrease again for time intervals less
than one second. Each fragment may be instantiated multiple times, illustrated
in Section 4, depending on the number of generated explanatory hypotheses.
variable in the intentional fragment given the values of the other nodes in the
network. By querying other Cause variables we can find by means of comparison
which of the potential causes is the most plausible one. As a result, only when
the node representing the feeling of doing has a high associated probability an
agent would explain the occurrence of an event as caused by itself.
P (Exists(Ip, tp ))
P (P riority) 0.55 0.45 0.65
0.3 0.41 0.36 0.45
0.35 0.44 0.39 0.48
0.5 0.51 0.46 0.56
0.8 0.62 0.56 0.66
0.85 0.63 0.58 0.67
References
1. Anscombe, G.E.M.: Intention. Harvard University Press, Cambridge (1958)
2. Dretske, F.: Explaining behavior. MIT Press, Cambridge (1988)
3. Wegner, D.M.: The Illusion of Conscious Will. MIT Press, Cambridge (2002)
4. Wegner, D.M.: The mind’s best trick: How we experience conscious will. Trends in
Cognitive Science 7, 65–69 (2003)
5. Wegner, D.M., Wheatley, T.: Apparent mental causation: Sources of the experience
of will. American Psychologist 54 (1999)
6. Ahn, W.K., Bailenson, J.: Causal attribution as a search for underlying mecha-
nisms. Cognitive Psychology 31, 82–123 (1996)
7. Pearl, J.: Probabilistic Reasoning in Intelligent Systems - Networks of Plausible
Inference. Morgan Kaufmann Publishers, Inc., San Francisco (1988)
8. Gopnik, A., Schulz, L.: Mechanisms of theory formation in young children. Trends
in Cognitive Science 8, 371–377 (2004)
9. Tenenbaum, J., Griffiths, T., Niyogi, S.: Intuitive Theories as Grammars for Causal
Inference. In: Gopnik, A., Schulz, L. (eds.) Causal Learning: Psychology, Philoso-
phy, and Computation. Oxford University Press, Oxford (in press)
10. Laskey, K.B.: MEBN: A Logic for Open-World Probabilistic Reasoning. Technical
Report C4I-06-01, George Mason University Department of Systems Engineering
and Operations Research (2006)
11. Kim, J.: Supervenience and Mind. Cambridge University Press, Cambridge (1993)
12. Jonker, C., Treur, J., Wijngaards, W.: Temporal modelling of intentional dynamics.
In: ICCS, pp. 344–349 (2001)
13. Marzollo, J., Wick, W.: I Spy. Scholastic, New York (1992)
An Agent Model for Computational Analysis of
Mirroring Dysfunctioning in Autism Spectrum Disorders
Abstract. Persons with an Autism Spectrum Disorder (ASD) may show certain
types of deviations in social functioning. Since the discovery of mirror neuron
systems and their role in social functioning, it has been suggested that ASD-
related behaviours may be caused by certain types of malfunctioning of mirror
neuron systems. This paper presents an approach to explore such possible
relationships more systematically. As a basis it takes an agent model
incorporating a mirror neuron system. This is used for functional computational
analysis of the different types of malfunctioning of a mirror neuron system that
can be distinguished, and to which types of deviations in social functioning
these types of malfunctioning can be related.
1 Introduction
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 306–316, 2011.
© Springer-Verlag Berlin Heidelberg 2011
Computational Analysis of Mirroring Dysfunctioning in Autism Spectrum Disorders 307
This causal chain is extended to a recursive loop by assuming that the preparation for
the response is also affected by the level of feeling the emotion associated to the
expected outcome of the response:
sensory representation of body state → preparation for response
Thus the obtained agent model is based on reciprocal causation relations between
emotions felt and preparations for responses. Within the agent model presented here,
states are assigned a quantitative (activation) level or gradation. The positive feedback
loops between preparation states for responses and their associated body states, and
the sensory representations of expected outcomes are triggered by a sensory
representation of a stimulus and converge to a certain level of feeling and preparation.
Apparently, activation of preparation neurons by itself has no unambiguous
meaning; it is strongly context-dependent. Suitable forms of context can be
represented at the neurological level based on what are called supermirror neurons
[14, pp. 196-203], [18]; see also [5]. These are neurons which were found to have a
function in control (allowing or suppressing) action execution after preparation has
taken place. In single cell recording experiments with epileptic patients [18], cells
were found that are active when the person prepares an own action to be executed, but
shut down when the action is only observed, suggesting that these cells may be
involved in the distinction between a preparation state to be used for execution, and a
preparation state generated to interpret an observed action. In [14, pp. 201-202] it is
also described that as part of this context representation, certain cells are sensitive to a
specific person, so that in the case of an observed action, this action can be attributed
to the person that was observed. Within the agent model presented in this section, the
functions of super mirror neurons have been incorporated as focus states, generated
by processing of available (sensory) context information. For the case modeled, this
focus can refer to the person her or himself, or to an observed person.
To formalise the agent model in an executable manner, the hybrid dynamic
modeling language LEADSTO has been used; cf. [4]. Within LEADSTO the dynamic
property or temporal relation a → →D b denotes that when a state property a occurs, then
after a certain time delay (which for each relation instance can be specified as any
positive real number D), state property b will occur. Below, this D will be taken as the
time step Δt, and usually not be mentioned explicitly. Both logical and quantitative
calculations can be specified, and a software environment is available to support
specification and simulation. The modeled agent receives input from the external
world, for example, another agent is sensed (see also Fig. 1). Not all signals from the
external world come in with the same level, modelled by having a sensor state of
certain strength. The sensor states, in their turn, will lead to sensory representations.
Sensory representations lead to a state called supermirroring state and to a specific
motor preparation state. The supermirroring state provides a focus state for regulation
and control, it also supports self-other distinction. In the scenario used as illustration,
it is decisive in whether a prepared action is actually executed by the observing agent,
or a communication to the observed agent is performed reflecting that it is understood
what the other agent is feeling. Note that the internal process modelled is not a linear
chain of events, but cyclic: the preparation state of the agent is updated constantly in a
cyclic process involving both a body loop and an internal as-if body loop (via the
connections labeled with w6 and w7). All updates of states take place in parallel.
Computational Analysis of Mirroring Dysfunctioning in Autism Spectrum Disorders 309
Capitals in the agent model are variables (universally quantified), lower case letters
specific instantiations. All strengths are represented by values between 0 and 1. A
capital V with or without subscripts indicates a real number between 0 and 1. The
variable S reflects that it is of the sort signal and B of the sort that concerns the
agent’s body state. What is outside the dotted lining is not a part of the internal
process of the agent. The first two sensor states (sensor_state(A,V) and sensor_state(S,V))
are possibly coming from a single source in the external source, but are not further
specified: their specific forms are not relevant for the processes captured in this agent
model. A more detailed description will follow below. For each of the dynamic
properties an informal and formal explanation is given.
sensor_state(A, V) srs(A, V)
w1
supermirroring(A, S, V)
w4
w2
sensor_state(S, V) srs(S, V) effector_state(B, V)
w3 prep_state(B, V)
w5
w6 w8
sensor_state(B, V)
w7 prep_comm(A, B, V)
srs(B, V) comm(I know how you feel,A, B)
w9
body_state(B,V)
w10
which the indicated state properties hold. Extreme cases are when some of the
connection strengths are 0. Three of such types of malfunctioning are discussed.
Impaired basic mirroring. For example, when both w3 and w7 are 0, from the agent
model it can easily be deduced that the preparation state will never have a nonzero
level. This indeed is confirmed in simulations (in this case the second graph in the
lower part of Fig. 2 is just a flat line at level 0). This illustrates a situation of full lack
of basic mirroring of an observed action or body state. Due to this type of
malfunctioning no imitation is possible, nor empathic attribution to another agent.
Further systematic exploration has been performed in the sense that one connection
at a time was changed, from very low (0.01 and 0.001), and low (0.25), to medium
(0.5), and high (0.75) strengths. The connections w4, w5, w7 and w10 showed more
substantial deviations from the normal situation in comparison to the connections w1,
w2, w3, w6, w8 and w9. As an example, a reduced connection w5 (e.g., value 0.001)
from preparation to body state makes that it takes longer to reach an increased value
for the body state. This corresponds to persons with low expressivity of prepared
body states. However, when the other connections are fully functional, still empathy
may occur, and even be expressed verbally.
4 Discussion
understood; for this inspiration was taken from the agent model described in [14]. In
the case of autism spectrum disorder and the dysfunction of mirror neurons, there was
no general description of the process in the sense of formalised causal relations.
However, neurological evidence informally described what brain area would have an
effect on the performance of certain tasks, resulting in (impaired) behavior. Modeling
such causal relations as presented here does not take specific neurons into
consideration but more abstract states, involving, for example, groups of neurons. At
such an abstract level the proposed agent model summarizes the process in
accordance with literature.
The agent model allows to distinguish three major types of malfunctioning,
corresponding to impaired mirroring, impaired self-other distinction and control
(supermirroring), and impaired emotion integration. Neurological evidence for
impaired mirroring in persons with ASD is reported, for example, in [9], [16], [21].
This type of analysis fits well to the first case of malfunctioning discussed in Section
4. In [16] the role of super mirror neurons is also discussed, but not in relation to
persons with ASD. In [5], [13] it is debated whether the social deviations seen in ASD
could be related more to impaired self-other distinction and control (impaired super
mirroring) than to the basic mirror neuron system; for example:
‘Recent research has focused on the integrity of the mirror system in autistic patients
and has related this to poor social abilities and deficits in imitative performance in
ASD [21]. To date this account is still being debated. In contrast to this hypothesis, we
would predict that autistic patients likely to have problems in the control of imitative
behaviour rather then in imitation per se. Recent evidence has revealed no deficit in
goal-directed imitation in autistic children, which speaks against a global failure in the
mirror neuron system in ASD [13]. It is, therefore, possible that the mirror system is
not deficient in ASD but that this system is not influenced by regions that distinguish
between the self and other agents.’ [4, p. 62]
The type of impaired mechanism suggested here fits well to the second case of
malfunctioning discussed in Section 4.
In [11], [12] it is also debated whether the basic mirror neuron system is the source
of the problem. Another explanation of ASD-related phenomena is suggested:
impaired emotion integration:
‘Three recent studies have shown, however, that, in high-functioning individuals with
autism, the system matching observed actions onto representations of one’s own action
is intact in the presence of persistent difficulties in higher-level processing of social
information (…). This raises doubts about the hypothesis that the motor contagion
phenomenon – “mirror” system – plays a crucial role in the development of
sociocognitive abilities. One possibility is that this mirror mechanism, while
functional, may be dissociated from socio-affective capabilities. (…) A dissociation
between these two mechanisms in autistic subjects seems plausible in the light of
studies reporting problems in information processing at the level of the STS and the
AMG (…) and problems in connectivity between these two regions.’ [9, pp. 73-74]
This mechanism may fit to the third case of malfunctioning discussed in Section 4.
The agent-model-based computational analysis approach presented explains how a
number of dysfunctioning connections cause certain impaired behaviors that are
referred to as typical symptoms in the autism spectrum disorder. The agent model
used, despite the fact that it was kept rather simple compared to the real life situation,
Computational Analysis of Mirroring Dysfunctioning in Autism Spectrum Disorders 315
seems to give a formal confirmation that different hypotheses relating to ASD, such
as the ones put forward in [5], [11], [16] can be explained by different types of
malfunctioning of the mirror neuron system in a wider sense (including super
mirroring and emotion integration). An interesting question is whether the three types
of explanation should be considered as in competition or not. Given the broad
spectrum of phenomena brought under the label ASD, it might well be the case that
these hypotheses are not in competion, but describe persons with different variants of
characteristics. The computational analysis approach presented here provides a
framework to both unify and differentiate the different variants and their underlying
mechanisms and to further explore them. Further research will address computational
analysis of different hypotheses about ASD which were left out of consideration in
the current paper, for example, the role of enhanced sensory processing sensitivity in
ASD; e.g., [6]. Moreover, the possibilities to integrate this model in human-robot
interaction may be addressed in further research; see, e.g., [2].
References
1. Asperger, H.: Die ‘autistischen psychopathen’ im kindesalter. Arch. Psychiatr.
Nervenkr 117, 76–136 (1944) ;Repr. as: ‘Autistic psychopathy’ in childhood. Transl. by
Frith,U.: In: Frith U., (ed.) Autism and Asperger Syndrome. Cambridge University Press.
Cambridge (1991)
2. Barakova, E.I., Lourens, T.: Mirror neuron framework yields representations for robot
interaction. Neurocomputing 72, 895–900 (2009)
3. Bleuler, E.: Das autistische Denken. In: Jahrbuch für psychoanalytische und
psychopathologische Forschungen 4 (Leipzig and Vienna: Deuticke), pp. 1–39 (1912)
4. Bosse, T., Jonker, C.M., Meij, L., van der Treur, J.: A Language and Environment for
Analysis of Dynamics by Simulation. Intern.J. of AI Tools 16, 435–464 (2007)
5. Brass, M., Spengler, S.: The Inhibition of Imitative Behaviour and Attribution of Mental
States. In: [20], pp. 52–66 (2009)
6. Corden, B., Chilvers, R., Skuse, D.: Avoidance of emotionally arousing stimuli predicts
social-perceptual impairment in Asperger syndrome. Neuropsych. 46, 137–147 (2008)
7. Damasio, A.: The Feeling of What Happens. Body and Emotion in the Making of
Consciousness. Harcourt Brace, New York (1999)
8. Damasio, A.: Looking for Spinoza: Joy, Sorrow, and the Feeling Brain. Vintage books,
London (2003)
9. Dapretto, M., Davies, M.S., Pfeifer, J.H., Scott, A.A., Sigman, M., Bookheimer, S.Y.,
Iacoboni, M.: Understanding emotions in others: Mirroor neuron dysfunction in children
with autism spectrum disorder. Nature Neuroscience 9, 28–30 (2006)
10. Frith, U.: Autism, Explaining the Enigma. Blackwell Publishing, Malden (2003)
11. Grèzes, J., de Gelder, B.: Social Perception: Understanding Other People’s Intentions and
Emotions through their Actions. In: [20], pp. 67–78 (2009)
12. Grèzes, J., Wicker, B., Berthoz, S., de Gelder, B.: A failure to grasp the affective meaning
of actions in autism spectrum disorder subjects. Neuropsychologia 47, 1816–1825 (2009)
13. Hamilton, A.F.C., Brindley, R.M., Frith, U.: Imitation and action understanding in autistic
spectrum disorders: How valid is the hypothesis of a deficit in the mirror neuron system?
Neuropsychologia 45, 1859–1868 (2007)
316 Y. van der Laan and J. Treur
14. Hendriks, M., Treur, J.: Modeling super mirroring functionality in action execution,
imagination, mirroring, and imitation. In: Pan, J.-S., Chen, S.-M., Nguyen, N.T. (eds.)
ICCCI 2010. LNCS, vol. 6421, pp. 330–342. Springer, Heidelberg (2010)
15. Hesslow, G.: Conscious thought as simulation of behavior and perception. Trends Cogn.
Sci. 6, 242–247 (2002)
16. Iacoboni, M.: Mirroring People: the New Science of How We Connect with Others. Farrar,
Straus & Giroux (2008)
17. Kanner, L.: Autistic disturbances of affective contact. Nervous Child 2, 217–250 (1943)
18. Mukamel, R., Ekstrom, A.D., Kaplan, J., Iacoboni, M., Fried, I.: Mirror properties of
single cells in human medial frontal cortex. Soc. for Neuroscience (2007)
19. Richer, J., Coates, S. (eds.): Autism, The Search for Coherence. Jessica Kingsley
Publishers, London (2001)
20. Striano, T., Reid, V.: Social Cognition: Development, Neuroscience, and Autism. Wiley-
Blackwell (2009)
21. Williams, J.H., Whiten, A., Suddendorf, T., Perrtett, D.I.: Imitation, mirror neurons and
autism. Neuroscience and Biobehavioral Reviews 25, 287–295 (2001)
Multi-modal Biometric Emotion Recognition
Using Classifier Ensembles
1 Introduction
Affective computing covers the area of computing that relates to, arises from, or
influences emotions [14]. Its application scope stretches from human-computer
interaction for the creative industries sector to social networking and ubiquitous
health care [13]. Real-time emotion recognition is expected to greatly advance
and change the landscape of affective computing [15]. Brain-Computer Inter-
face (BCI) is a rapidly expanding area, offering new, inexpensive, portable and
accurate technologies to neuroscience [21]. However, measuring and recognising
emotion as a brain pattern or detecting emotion from changes in physiological
and behavioural parameters is still a major challenge.
Emotion is believed to be initiated within the limbic system, which lies deep
inside the brain. Hardoon et al. [4] found that the brain patterns corresponding
to basic positive and negative emotions are complex and spatially scattered. This
suggests that in order to classify emotions accurately, the whole brain must be
analysed.
Functional Magnetic Resonance Imaging (fMRI) and Electro Encephalogra-
phy (EEG) have been the two most important driving technologies in modern
neuroscience. No individual technique for measuring brain activity is perfect.
fMRI has the spatial resolution needed for emotion recognition while EEG does
not. fMRI, however, offers little scope for a low-cost, real-time, portable emo-
tion classification system. In spite of the reservations, EEG has been applied for
classification of emotions [1, 5, 19, 20]. Bos [1] argues that the projections of pos-
itive and negative emotions in the left and right frontal lobes of the brain make
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 317–326, 2011.
c Springer-Verlag Berlin Heidelberg 2011
318 L.I. Kuncheva et al.
these two emotions distinguishable by EEG. He also warns that the granularity
of the information collected from these regions through EEG may be insufficient
for detecting more complex emotions. Different success rates of emotion recog-
nition through EEG have been reported in the recent literature ranging from
moderate [2] to excellent accuracy [10, 13]. The reason for the inconclusiveness
of the results can be explained with the different experimental set-ups, different
ways of eliciting and measuring emotion response, and the type and number of
distinct emotions being recognised.
Chanel et al. [2] note that, until recently, there has been a lack of studies on
combination of biometric modalities for recognising affective states. Some phys-
iological signals can be used since they come as spontaneous reactions to emo-
tions. Among other affective states, stress detection gives a perfect example of
the importance of additional biometric modalities. It is known that stress induces
physiological responses such as increased heart rate, rapid breathing, increased
sweating, cool skin, tense muscles, etc. This gives stress detection systems good
chances of success [9]. Considerable effort has been invested in designing low-
power and high-performance readout circuits for the acquisition of biopotential
signals such as EEG/EMG electrodes [16, 24, 25], skin conductance sensors [12],
temperature sensors and muscle tightness gauges. Finally, combination of EEG
and other biometric modalities has proved to be a successful route for affective
state recognition [1, 2, 10, 20].
Here we present a system for multi-modal biometric emotion recognition (AM-
BER) consisting of a single-electrode headset, an EDA sensor and a pulse reader.
These modalities were selected due to their low cost, commercial availability and
simple design. We evaluate state-of-the art classifiers, including classifier ensem-
bles, on data collected from the system. The goal is to assess the ability of the
classification methodologies to recognise emotion from signals spanning several
seconds.
most basic form, human skin is used as an electrical resistor whose value changes
when a small quantity of sweat is secreted. Figure 1(b) depicts the circuit, and
1(c) shows the electronic breadboard used in AMBER.
To feed the signal into the system we used a low-cost analogue to digital
converter, FEZ Domino, shown in Figure 1(d). The FEZ Domino enables elec-
trical and digital data to be controlled using the .NET programming language.
The digital output was transmitted to a computer using a TTL Serial to USB
converter cable.
3 Data
3.1 The Data Collection Experiment
The experiment involved presenting auditory stimuli to the subject in twenty
60-second runs. The stimuli were selected so as to provoke states of relaxation
(positive emotion) or irritation (negative emotion). The positive audio stimuli
were taken from an Apple iPhone application called Sleep Machine. The compo-
sition was a combination of wind, sea waves and sounds referred to as Reflection
(a mixture of slow violins tinkling bells and oboes); this combination was con-
sidered by the subject to be the most relaxing. The negative audio stimuli were
320 L.I. Kuncheva et al.
musical tracks taken from pop music, which the subject strongly disliked. The
three biometric signals were recorded for 60 seconds for each of the 20 runs: 10
with the positive stimuli and 10 with the negative stimuli.
Typical examples of one-second run of the three signals is shown in Figure 2.
4
x 10
−2
300
250
200
150
100
100
80
60
40
20
50 100 150 200 250 300
time
(c) Pulse signal
δ θ α α2
1
γ2
β1 β2 γ1
Negative Positive
Fig. 3. Frequency powers for the 8 bands and the two classes
The remaining two features for the sections were the mean EDA signal and
the mean pulse signal.
4 Classification Methods
The most widely used classification method in neuroscience analyses is the Sup-
port Vector Machine classifier (SVM) [3, 11, 18, 22]. Our previous research con-
firmed the usefulness of SVM but also highlighted the advantages of multiple
classifier systems (classifier ensembles) [6, 7, 8].
All experiments were run within Weka [23] with the default parameter set-
tings. The individual classifiers and the classifier ensemble methods chosen for
this study are shown in Table 1.3 Ten-fold cross-validation was used to estimate
2
The curves for the remaining 7 data sets were a close match.
3
We assume that the reader is familiar with the basic classifiers and ensemble meth-
ods. Further details and references can be found within the Weka software environ-
ment at http://www.cs.waikato.ac.nz/ml/weka/
322 L.I. Kuncheva et al.
Table 1. Classifiers and classifier ensembles used with the AMBER data
Single classifiers
1nn Nearest neighbour
DT Decision tree
RT Random tree
NB Naive Bayes
LOG Logistic classifier
MLP Multi-layer perceptron
SVM-L Support vector machines withl inear kernel
SVM-R Support vector machines with Radial basis function (RBF) kernel
Ensembles
BAG Bagging
RAF Random Forest
ADA AdaBoost.M1
LB LogitBoost
RS Random Subspace
ROF Rotation Forest
5 Results
Table 2 shows the correct classification (in %) for all methods and data sets.
The highest accuracy for each data set is highlighted as a frame box, and the
second highest is underlined. All highest accuracies are achieved by the ensemble
methods. The individual classifiers reach only one of the second highest accu-
racies while the ensemble methods hold the remaining 7 second highest scores.
This result confirms the advantage of using the classifier ensembles compared
to using single classifiers, even the current favourite SVM. In fact, SVM-R was
outperformed by all classifiers and ensembles, and SVM-L managed to beat only
the logistic classifier. A series of pilot experiments revealed that none of the
modalities alone were as accurate as the combination.
Multi-modal Biometric Emotion Recognition Using Classifier Ensembles 323
Fig. 4. Colour matrix for the classification methods sorted by their average ranks.
Brown/red correspond to high accuracy; green/blue correspond to low accuracy.
6 Conclusion
This paper presents a case study of affective data classification coming from
three biometric modalities: EEG electrode, electrodermal sensor (EDA) and
pulse reader, embedded in a system called AMBER. The results indicate that
positive and negative emotional states evoked by audio stimuli can be detected
with good accuracy from a time segment spanning a few seconds. This work
serves as a first step in a developing an inexpensive and accurate real-time emo-
tion recognition system. Improvements on the hardware and the preprocessing
of the signals are considered. We are currently working towards preparing an
experimental protocol and the supporting software for gathering data from AM-
BER on a large scale. The new protocol will be based on a combination of visual,
audio and computer-game type of stimuli.
Multi-modal Biometric Emotion Recognition Using Classifier Ensembles 325
References
Abstract. In this paper, we make the first steps towards developing a fully
automatic tool for supporting users for navigation on the web. We developed a
prototype that takes a user-goal and a website URL as input and predicts the
correct hyperlink to click on each web page starting from the home page, and
uses that as support for users. We evaluated our system's usefulness with actual
data from real users. It was found that users took significantly less time and less
clicks; were significantly less disoriented and more accurate with system-
generated support; and perceived the support positively. Projected extensions to
this system are discussed.
1 Introduction
This paper presents an approach towards the support of web navigation by means of a
computational model. Though several studies have established the usefulness of
providing support based on cognitive models to end-users, no such fully automatic
tools have been developed so far for web navigation. Several existing tools are either
used for evaluating hyperlink structure (Auto-CWW based on CoLiDeS) or for
predicting user navigation behavior on the web (Bloodhound based on WUFIS).
For example, Cognitive Walkthrough for the Web (Auto-CWW) [1] is an
analytical method (based on a cognitive model of web-navigation called CoLiDeS
[2]) to inspect usability of websites. It tries to account for the four steps of parsing,
elaborating, focusing and selecting of CoLiDeS. It also provides a publicly available
online interface called AutoCWW (http://autocww.colorado.edu/), which allows you
to run CWW online. One bottleneck is that the steps of identifying the headings and
the hyperlinks under each heading in a page, designating the correct hyperlink
corresponding for each goal and various parameters concerning LSA (a computational
mechanism to compute similarity between two texts, described later in detail) like
selecting a semantic space, word frequencies and minimum cosine value to come up
*
Corresponding author.
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 327–337, 2011.
© Springer-Verlag Berlin Heidelberg 2011
328 S. Karanam, H. van Oostendorp, and B. Indurkhya
with the link and heading elaborations need to be entered manually. Auto-CWW then
generates a report identifying any potential usability problems such as unfamiliar
hyperlink text, competing/confusing hyperlinks. A designer can make use of this
report to make corrections in the website's hyperlinks.
Bloodhound developed by [3] predicts how typical users would navigate through a
website hierarchy given their goals. It combines both information retrieval and
spreading activation techniques to arrive at the probabilities associated with each
hyperlink that specify the proportion of users who would navigate through it.
Bloodhound takes a starting page, few keywords that describe the user-goal, and a
destination page as input. It outputs average task success based on the percentage of
simulated users who reach the destination page for each goal.
ScentTrails [4] brings together the strengths of both browsing and searching
behavior. It operates as a proxy between the user and the web server. A ScentTrails
user can input a list of search terms (keywords) into an input box at any point while
browsing. ScentTrails highlights hyperlinks on the current page that lead the user
towards his goal. It has been found that with ScentTrails running, users could finish
their tasks quicker than in a normal scenario without ScentTrails.
Both Auto-CWW and Bloodhound are tools for web-designers and evaluators and
not for supporting end-users. ScentTrails, though it is designed for supporting end-
users, it makes an underlying assumption that knowledge about the website structure
is assumed to be known beforehand. User can enter queries at any point of time
during a browsing session and the ScentTrails system directs the user along paths that
lead to his or her desired target pages from the current page. Our proposed system
does not make this assumption. It takes as input from the user only the goal and a
website URL and nothing else. A fully automatic model of web-navigation has many
potential benefits for people working under cognitively challenging conditions.
Screen readers for visually impaired persons can be made more efficient with an
automated model that can read out only relevant information. Elderly people have one
or more of the following problems: they can forget their original goal, or forget the
outcome of the previous steps which can have an impact on their next action; they
may have low mental capacity to filter unnecessary information that is not relevant to
their goal; and their planning capabilities may be weak in complex scenarios [5]. For
these situations, an automated model can plan an optimized path for a goal, can
provide relevant information only and keep track of their progress towards the
completion of the task. Naive users who are very new to the internet generally do not
employ efficient strategies to navigate: they follow more of an exploratory navigation
style; they get lost and disoriented quite often; and they are slow and also inefficient
in finding their information. An automated model that can provide visual cues to such
users can help them learn the art of navigation faster. An experienced internet user
generally opens multiple applications on her or his machine and also multiple tabs in
a browser. She or he could be listening to songs, writing a report, replying by email to
a friend, chatting with friends on a chat-application and searching on internet for the
meaning of a complex word she or he is using in a report. Under these scenarios, she
or he would definitely appreciate an automated model that can reduce the time spent
on one of these tasks.
In previous research [1], it was shown that the CoLiDeS model could be used to
predict user navigation behavior and also to come up with the correct navigation path
Towards a Fully Computational Model of Web-Navigation 329
for each goal. CoLiDeS can find its way towards the target page by picking the most
relevant hyperlink (based on semantic similarity between the user-goal and all the
hyperlink texts on a web-page) on each page. The basic idea of the CoLiDeS model
[2] [7] is that a web page is made up of many objects competing for user's attention.
Users are assumed to manage this complexity by an attention cycle and action-
selection cycle. In the attention cycle they first parse the web page into regions and
then focus on a region that is relevant to their goal. In the action-selection cycle each
of the parsed regions is comprehended and elaborated based on user's memory, and
here various links are compared in relevancy to the goal and finally the link that has
the highest information scent – that is, the highest semantic similarity between the
link and the user’s goal – is selected. For this, Latent Semantic Analysis technique is
used [6]. This process is then repeated for every page visited by users until they reach
the target page. This model can be used to come up with a tool in which we give the
successful path back to the user and this could help the user in reducing the efforts
spent in filtering unnecessary information. Thus, the tool we are developing is
designed to help users in cognitively challenging scenarios. In the next section, we
provide the details of our system.
2 System Details
support will be found useful by the participants; and their navigation performance in
terms of time, number of clicks taken to finish the task, accuracy and overall
disorientation will significantly improve.
3.1 Method
Material. A mock-up website on the Human Body with 34 web pages spread across
four levels of depth was used. Eight user-goals (or tasks), two for each level, which
required the users to navigate, search and find the answer, were designed.
Design. We had two conditions: a control condition, where no support was provided;
and a support condition, where support was provided in the form of highlighted links.
These conditions are shown in Figures 2 and 3, respectively.
The automated system based on CoLiDeS described in the previous section was
run on the eight different information retrieval tasks. The results of the simulations
were paths predicted by the system. Based on these results, the model-predicted paths
for each goal were highlighted in green color (See Figure 3). It is important to
emphasize that this support is based on the automated model, which is based on
CoLiDeS: computation of semantic similarity between the goal description and
hyperlink text. We used a between-subjects design: half the participants received the
332 S. Karanam, H. van Oostendorp, and B. Indurkhya
control condition and the other half the support condition. The dependent variables
were mean task-completion time, mean number of clicks, mean disorientation and
mean task accuracy. Disorientation was measured by using Smith’s measure [8]
Where, R = number of nodes required to finish the task successfully (thus, the number
of nodes on the optimal path); S = total number of nodes visited while searching;
N = number of different nodes visited while searching. Task accuracy was measured
by scoring the answers given by the users. The correct answer from the correct page
was scored 1. A wrong answer from the correct page was scored 0.5. Wrong answers
from wrong pages and answers beyond the time limit were scored 0.
Procedure. The participants were given eight information retrieval tasks in random
order. They were first presented with the task description on the screen and then the
website was presented in a browser. The task description was always present in the
top-left corner, in case the participant wished to read it again. In the control condition,
participants had to read the task, browse the website, search for the answer, type the
answer in the space provided and move to the next task. In the support condition, the
task was the same except that they got a message reminding them that they are
moving away from the model-predicted path if they did not choose the model-
predicted hyperlink, but they were free to choose their path.
3.2 Results
Mean number of clicks. An independent samples t-test between control and support
conditions with mean number of clicks as dependent variable was performed. There
was a significant difference in number of clicks between control and support
conditions t(22)=5.47, p<.001. Figure 5 shows the means plot. Participants took
significantly less number of clicks to reach their target page in support condition
when compared to control condition.
Mean task accuracy. Next, independent samples t-test was performed between the
control and support conditions with mean task accuracy as dependent variable. The
difference was found to be highly significant t(22)=-3.112, p<.01. Participants were
more accurate in the support condition compared to the control condition. Support
helped them not only to reach their target pages faster but also to answer the question
more accurately. See Figure 7 for mean task accuracies in both conditions.
(M=3.3, SD=2.3), “Learning to use the site was easy” (M=5.6, SD=1.9), “Support
helped me reach my target page quicker” (M=6.3, SD=1.4) and “Support made it very
easy to navigate” (M=5.6, SD=2.3). These results indicate that participants had in
general a (very) positive opinion about the support mechanism.
We took the first steps towards filling a gap in the domain of cognitive models of
web-navigation: lack of a fully automated support system that aids the user in
navigating on a website. This is important as it illustrates the practical relevance of
developing these cognitive models of web-navigation. We provided the details of our
prototype. We evaluated our system with an experiment and found a very positive
influence of model-generated support on navigation performance of the user. The
participants were significantly faster, took significantly less number of clicks and
were significantly less disoriented and more accurate when support was provided than
in the no-support control condition. All in all, the main conclusion is that the support
seems to have had a significant positive influence on performance.
However, our system is preliminary and we developed this to illustrate the positive
effects of such a system. Several improvements can be made to the existing system,
some of which will be discussed next. The current system uses only semantic
similarity between the user-goal description and the hyperlink text to predict the
correct hyperlink. Several other parameters can be incorporated: like background
knowledge of the user in relation to the hyperlink text, frequency of usage of the
hyperlink text by the user, and whether there is a literal matching between the user
goal and the hyperlink text (such as, if the user is looking for ‘heart diseases’ and the
336 S. Karanam, H. van Oostendorp, and B. Indurkhya
hyperlink also says ‘Heart Diseases’) [9]. The path adequacy concept and
backtracking rules of CoLiDeS+ can be taken into account [7]. Semantic information
from pictures, if included, could make the predictions even more accurate as shown in
[10]. For simplicity, and also to establish a proof of concept, we ran our model only
on our mock-up website. Real websites provide more programming challenges like
handling image links, handling the terms that are not present in the semantic space
etc. Though we introduced the paper with cognitively challenging scenarios, in our
behavioral study, we used a conventional scenario. It would be interesting to replicate
the study under such scenarios of heavy cognitive load and look at the results again.
We expect a much higher significant improvement in performance under such
conditions. Also the perception of usefulness of the support system under such
conditions would be very high. Which form of support is the best? Visually
highlighting correct hyperlinks as we did in this study? Or re-ordering the hyperlinks
in the order of their similarity measure with respect to the user-goal or using a bigger
font? The current version of our system does not give feedback if the user is already
on the target page. The user is left pondering over questions such as: Is this the target
page? Should I navigate further? Other questions are: How intrusive should the
support be? Should it take over the user-control over the navigation completely?
When does it get annoying? These are some research questions open for investigation.
We are currently working already on incorporating additional parameters when
computing information scent and plan to answer these questions in our future studies.
References
1. Blackmon, M.H., Mandalia, D.R., Polson, P.G., Kitajima, M.: Automating Usability
Evaluation: Cognitive Walkthrough for the Web Puts LSA to Work on Real-World HCI
Design Problems. In: Landauer, T.K., McNamara, D.S., Dennis, S., Kintsch, W. (eds.)
Handbook of Latent Semantic Analysis, pp. 345–375. Lawrence Erlbaum Associates,
Mahwah (2007)
2. Kitajima, M., Blackmon, M.H., Polson, P.G.: A comprehension-based model of Web
navigation and its application to Web usability analysis. In: Proceedings of CHI 2000, pp.
357–373 (2000)
3. Chi, E.H., Rosein, A., Supattanasiri, G., Williams, A., Royer, C., Chow, C., Robles, E.,
Dalal, B., Chen, J., Steve, C.: The Bloodhound Project: Automating discovery of web
usability issues using the InfoScent Simulator. In: Proceedings of CHI 2003, pp. 505–512.
ACM Press, New York (2003)
4. Olston, C., Chi, E.H.: ScentTrails: Integrating browsing and searching on the web. ACM
Transactions on Computer Human Interaction 10(3), 1–21 (2003)
5. Kitajima, M., Kumada, T., Akamatsu, M., Ogi, H., Yamazaki, H.: Effect of Cognitive
Ability Deficits on Elderly Passengers’ Mobility at Railway Stations - Focusing on
Attention, Working Memory, and Planning. In: The 5th International Conference of The
International Society for Gerontechnology (2005)
6. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis.
Discourse Processes 25, 259–284 (1998)
7. Juvina, I., van Oostendorp, H.: Modeling semantic and structural knowledge in Web-
navigation. Discourse Processes 45(4-5), 346–364 (2008)
Towards a Fully Computational Model of Web-Navigation 337
Abstract. In this paper we are proposing a method for detecting the localization
of indoor stairways. This is a fundamental step for the implementation of
autonomous stair climbing navigation and passive alarm systems intended for
the blind and visually impaired. Both of these kinds of systems must be able to
recognize parameters that can describe stairways in unknown environments.
This method analyzes the edges of a stairway based on planar motion tracking
and directional filters. We extracted the horizontal edge of the stairs by using
the Gabor Filter. From the specified set of horizontal line segments, we
extracted a hypothetical set of targets by using the correlation method. Finally,
we used the discrimination method to find the ground plane, using the
behavioral distance measurement. Consequently, the remaining information is
considered as an indoor stairway candidate region. As a result, testing was able
to prove its effectiveness.
1 Introduction
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 338–347, 2011.
© Springer-Verlag Berlin Heidelberg 2011
Stairway Detection Based on Single Camera by Motion Stereo 339
2 Algorithm Description
Our proposed algorithm for indoor stairs segmentation has six steps which are: (1)
extracting frames, (2) finding the maximum distance of the ground plane (MDGP),
(3) extracting the line segments, (4) extracting the region of interest (ROI), (5) finding
correspondences, and (6) ground discrimination. From the last step we will define the
stair candidate.
Using our algorithm, we extracted frames in a short time interval that started just
before the earliest detection from the video capture sequence. Due to the fact that
frames are collected in a short time interval, there is a temporally close relation
between frames. Consequently, from the video image sequences, both the spatial and
temporal information is extracted in order to know the dynamics of the scene. Our
method was based on the assumptions that stairways are located below the true
horizon. Consequently, in the coordinate system of the single camera, axis Z is
aligned with the optical axis of the camera. Axes X and Y are aligned with axes x and
y of the image plane. In Fig.1 P is a point in the world at coordinate (X,Y,Z), the
projection of this point into the image plane is denoted R(xc,yc).
The proposed method consists mainly in extracting the information surrounding the
ground plane projected onto the image plane. The MDGP is the maximum distance
that the proposed method is able to compute from the camera position to the horizon
line onto the image plane by using the Pythagorean trigonometric identities. When
given an image, our goal is to compute the MDGP according to the following set of
equations (1, 2, 3):
340 D.C. Hernández, T. Kim, and K.-H. Jo
(1)
(2)
(3)
where d is the distance estimation between the camera and the target object in the
image, h is the height of the camera above the ground, δ is the angle between the
optical axis of the camera and the horizontal projection, α is the angle formed by
scanning the image from top to bottom with respect to the center point of the image
(R(xc,yc) in Fig. 1), y is the pixel position on the y axis, yc is the pixel located in the
center of the image, f is the focal length in pixels and mm, Img is the image width,
and CCD is the sensor width.
After scanning the image from top to bottom, the algorithm was able to find a point of
discontinuity (see Fig. 2). This discontinuity causes an ambiguity in the distance value
when |−α| is more than or equal to δ. Τhe point is highly related with MDGP. By
finding the pixel in which the discontinuity happened the proposed algorithm
distinguished where the ground plane candidate area was.
This step consisted of extracting the line segments of the stairs, by applying the Gabor
Filter [7, 8, 9]. Gabor Filters are the most widely used texture feature extractors. The
Gabor Filter is made of directional wavelet-type filters, or masks, and consists of a
sinusoidal plane wave of particular frequency and orientation, modulated by a
Gaussian Envelope. Using different values for wavelength, variance of the Gaussian
Stairway Detection Based on Single Camera by Motion Stereo 341
Fig. 2. Finding the maximum distances of the ground plane. The graph above shows the pixel
value where the discontinuity occurs. For δ = 9.3, the discontinuity occurs between the pixel
value 72 and 73 on the y-axis. Consequently the image shows the ambiguity in the distance
value from the 72 pixel value.
and orientation,we experimentally verified a set of Gabor Filters in order to get the
best response. Eq. (4) shows the real part of the Gabor Filter:
(4)
a line using the estimated angle of orientation and to generate the data set of pixel
targets position at time t (PTP). As a result, the proposed algorithm was able to extract
the set of line segments from a sequence of images.
In this section we propose an automatic target system detection. This process consists
of two steps. First, using the dataset of point on PTP, from the current image frame
(from now time t) a target area that includes the pixel target position were extracted,
called region of interest (ROI). Due to the fact that frames are collected in a short time
interval, there is a temporally close relation between frames. For the next image frame
(from now time t+1), the possible target area where the pixel target position must
appear were extracted. As mentioned above, we consider the data set of pixel target
position as our target at time t (ROI1), as well as at time t+1 (ROI2). The set of ROI1
is confined in a small block with a 20x10 pixels size, and the set of ROI2 is confined
in a block with a 40x20 pixels size.
Fig. 3. Region of interest at frame t and frame t+1 stages. (a) Grayscale input image. (b) Result
of binarization process from Gabor filter. (c) Extracting the longest line segments. (d),(e)
Extracting the area of interest for ROI1. The white block represents the ROI1, using a 20x10
pixel size for each block extracted at time t; the numbers show the order of extraction. The blue
block represents the ROI2, using a 40x20 pixel size for each block extracted at time t+1; the
numbers show the order of extraction.
(7)
where rs is the cross correlation coefficient, xi is the intensity of the i-th pixel in
image ROI1, yi is the intensity of the i-th pixel in image ROI2, mx and my are the
means of intensity of the corresponding images. For each ROI1 and its respective
ROI2 we generated a matrix, which included all the correlation coefficients computed
for every displacement of the ROI1 into the ROI2 area. In this matrix, the maximum
value of the coefficient correlation represents the position of the best matches
between the ROI1 and the ROI2. By finding the correspondences between them,
extraction of the new pixel target position for the frame at time t+1 was attainable.
From the set of line segments, the algorithm removes those candidate segments that
can represent ground plane information based on the concept that stairways can be
described as objects with a set of lines segments with a certain height level. After
finding the correspondences, the algorithm extracted a set of pixel target points for
two different frames. Using this information, those line segments that showed a
distance discrepancy between the frame at time t+1 and the frame at time t according
to the camera displacement are removed. In other words, the displacement of the
camera is highly related with the displacement of the target in an image sequence. The
difference between the frame at t+1 and the frame at time t has to be almost the same
as the distance of the camera displacement. Then, the distance estimation for each
pixel target point in exposure time is calculated. This step is done according to eq.
(1, 2, 3). Table 1 shows the estimation result.
By computing the mean for the data set of absolute errors, we were able to remove
those sets of lines in the frame at time t+1 if the mean value was more than 10% of
Table 1. Probability first loop results in 4 different data set images with different properties
Note No. Areas is the number of stairway candidates into the image plane, PA is the probability
per each area, PL is the probability value per each area, PP is the probability of the pixel
numbers per each area and PC in bold is the best stairway candidate in the image.
344 D.C. Hernández, T. Kim, and K.-H. Jo
(a)
(b)
(c)
Fig. 4. Horizontal and vertical position histogram. (a) By scanning the image using the
horizontal histogram position we identify the candidate stairways areas. In this image there are
two stairways candidates which were detected by the process. (b) Vertical histogram position
result for area 1. (c) Vertical histogram position result for area 2. The number of lines and
pixels in each candidate area are extracted by applying the vertical histogram position.
the displacement of the camera between frames. This process was done with the
assumption that a value of 10% of the displacement of the camera between frames can
be a target point with a certain height level above the ground plane. After verifying
the value and determining the removal of those line segments in the frame at time t+1,
we scanned the image by using a horizontal and vertical histogram position in order to
determine the distribution of the set of line segments (see Fig 4). This step is done
through the analysis of the data that was extracted specifically for this process, such
as the number of density areas, as well as the number of lines and pixels per each
density area. Fig. 5 shows these steps according to eq. (8)
!
(8)
!
Stairway Detection Based on Single Camera by Motion Stereo 345
(a) (b)
Fig. 5. Stairway detection process result of the first two image sequences used in the first loop.
(a) Stairway candidate area at time t after extracting the ROI1 and ROI2. (b) Stairway
candidate area at time t+1. Segmented result after determining the removal of those line
segments for which the distance difference is less than or equal to 10% of the camera
displacement in the frame at time t+1.
where NA is the number of areas, line segments, and pixels in A, and N is the total
number of areas, line segments, and pixels in the image. The process mentioned so far
was recursively repeated until we extracted, from the set of targets, relevant
information for describing stairways (see table 1). The process stops at the exact
moment when the mean value of the distance discrepancy between targets is not the
same as the camera displacement.
The purpose of this step is to extract from the image the candidate stairway region.
Once we get this point the proposed algorithm is able to describe the area where the
stairway candidate area is confined using the maximum and minimum values of the
last frame which detect a height information. Fig. 5 shows the candidate stairway.
From the detected area, we extract the distance and the angle between the
coordinate system of the camera and the candidate region. Table 2 shows us the
distance and the angle estimation result with respect to the camera coordinate system.
This process of stairway detection is relevant for the implementation of autonomous
stair climbing navigation and passive alarm systems intended for blind and visually
impaired people. In consequence, the proposed algorithm can determine the
localization of stairways through a given sequence of images given by motion stereo.
346 D.C. Hernández, T. Kim, and K.-H. Jo
Table 2. First loop results in 4 different data set images with different properties
Note that rs is the cross correlation coefficient, de is the distance estimation, and dd is the
distance difference.
3 Experiment Result
In this section, we will show the ending results of the experiment. All the experiments
were done on Pentium(R) Dual-Core CPU E5200@ 2.5 GHz with 2 GB RAM under
Emacs environment. We used 320x240 images. Table 1 and Table 2 shows the results
for the first loops, Table 3 shows the computational time required per loop iteration.
Figure 5 shows the result of the different groups of data sets. The depicted results are
due to the implementation of the distance measurement to estimate the ground plane,
in combination with a Gabor Filter in order to find the high density of line segments
in the candidate area. The experiment described a system that can detect the
localization of stairways by the combination of the edges on a planar motion tracking.
Table 3. Stairway candidate localization for the first loop results in 4 different data set images
with different properties.
Note this information is done using as reference the coordinate system of the camera. Real dist.
is the real distance in meter, Est. dist is the distance estimation, Relat. error is the relative error,
Est. Angle is the angle estimation. Time is the computational time.
4 Conclusion
In this paper we presented relevant information for stairs detection and localization
without any a priori information about the position of the stairways in the image. This
approach was done through the extraction of the stairway features by analyzing the
edges of the stairways belonging to the horizon projected in 2D space. Before the
edge analysis was applied, the presented approach had allowed the assessment of the
maximum distance of ground plane. By using this approach the authors presented a
method to identify the projected horizon in 2D space. This process is important at the
Stairway Detection Based on Single Camera by Motion Stereo 347
time of reducing the computational time for developing a real time application. By
using the Gabor filter, the algorithm can extract the line segments of the stairways
showing a robustness to the noise, such as the illumination condition (see Fig. 3(a)
and 3(b)), where most of the line segments are located above the ground plane. The
contribution of this paper is the fact that the proposed algorithm gave us an estimate
of information about the stairways, such as the localization with respect to the
coordinate system of the camera (see table 3). This information is important and
necessary in order to localize stairways in unknown environments. It is also a
fundamental step for the implementation of autonomous stair climbing navigation and
passive alarm systems intended for blind and visually impaired people. As a future
work, we will improve the performance by adding information about the camera
rotation in 3D world (such information is important because nonplanar surfaces affect
the result) and we will also improve the accuracy of the distance estimation result.
References
1. Hernandez, D.C., Jo, K.-H.: Outdoor Stairway Segmentation Using Vertical Vanishing
Point and Directional Filter. In: The 5th International Forum on Strategic Technology
(2010)
2. Aparício Fernandez, J.C., Campos Neves, J.A.B.: Angle Invariace for Distance
Measurements Using a Single Camera. In: 2006 IEEE International Symposium on
Industrial Electroncis (2006)
3. Cong, Y., Li, X., Liu, J., Tang, Y.: A Stairway Detection Algorithm based on Vision for
UGV Stair Climbing. In: 2008 IEEE International Conference on Networking, Sensing and
Control (2008)
4. Lu, X., Manduchi, R.: Detection and Localization of Curbs and Stairways Using Stereo
Vision. In: International Conference on Robots and Automation (2005)
5. Gutmann, J.-S., Fucuchi, M., Fujita, M.: Stair Climbing for Humanoid Robots Using
Stereo Vision. In: International Conference on Intelligent Robots and System (2004)
6. Se, S., Brady, M.: Vision-based detection of stair-cases. In: Fourth Asian Conference on
Computer Vision ACCV 2000, vol. 1, pp. 535–540 (2000)
7. Ferraz, J., Ventura, R.: Robust Autonomous Stair Climbing by a Tracked Robot Using
Accelerometor Sensors. In: Proceedings of the Twelfth International Conference on
Climbing and Walking Robots and the Support Technologies for Mobile Machines (2009)
8. Barnard, S.T.: Interpreting perspective images. Artificial Intelligence 21(4), 435–462
(1983)
9. Weldon, T.P., Higgins, W.E., Dunn, D.F.: Efficient Gabor Filter Design for Texture
Segmentation. Pattern Recognition 29, 2005–2015 (1996)
10. Lee, T.S.: Image Representation Using 2D Gabor Wavelets. IEEE Transactions on Pattern
Analysis and Machine Intelligence 18(10) (1996)
11. Basca, C.A., Brad, R., Blaga, L.: Texture Segmentation.Gabor Filter Bank Optimization
Using Genetic Algoritnms. In: The International Conference on Computer Tool (2007)
12. Deb, K., Vavilin, A., Kim, J.-W., Jo, K.-H.: Vehicle License Plate Tilt Correction Based
on the Straight Line Fitting Method and Minimizing Variance of Coordinates of Projection
Points. International Jounal of Control Automation, and System (2010)
Robot with Two Ears Listens to
More than Two Simultaneous Utterances
by Exploiting Harmonic Structures
1 Introduction
Since people have increasing opportunities to see and interact with humanoid
robots, e.g., the Honda ASIMO [1], Kawada HRP [2], and Kokoro Actroid [3],
enhanced verbal communication is critical in attaining symbiosis between hu-
mans and humanoid robots in daily life. Verbal communication is the easiest
and most effective way of human-humanoid interaction such as when we ask a
robot to do housework or when a robot gives us information.
Robots’ capabilities are quite unbalanced in verbal communication. Robots
can speak very fluently and sometimes in an emotional way, but they cannot
hear well. This is partially due to many interfering sounds, e.g., other people
interrupting or speaking at the same time, air-conditioning, the robot’s own
cooling fans, and the robot’s movements. Even if robots have a higher-level of
intelligence, they cannot respond absolutely because they cannot correctly hear
and recognize what a human is saying.
Robots working with humans often encounter an under-determined situation,
i.e. there are more sound sources than microphones. One good way to deal with
many sounds is to use sound separation methods for preprocessing. However,
K.G. Mehrotra et al. (Eds.): IEA/AIE 2011, Part I, LNAI 6703, pp. 348–358, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Robot with Two Ears Listens to More than Two Simultaneous Utterances 349
The problem settings for speech separation used in this paper are as follows.
Speech
signals s
Observed Separated Recognition
sounds x Speech sounds ŝ Speech results
separation recognition
mic. module module
M mixtures N voices N outputs
N sources
When all elements are real numbers, this is a Linear Programming (LP) problem,
and we can show that sopt , which maximizes Eq. (3), has at most M non-
zero elements. When we define K = {k1 , k2 , ..., kM } as a set of indices of these
non-zero sources, we can also prove that the optimum separation result sopt is
represented as [ŝ1 , ŝ2 , ..., ŝN ]T using the following equations:
ŝK = H −1
K x, (4)
ŝi = 0 ∀i ∈ K, (5)
in which ŝK = [ŝk1 , ŝk2 , ..., ŝkM ]T and H K = [hk1 , hk2 , ..., hkM ].
Using this knowledge, a maximum a posteriori problem can be written as a
combinatorial optimization problem as follows. This means that estimating K,
M
Kopt = argmin |ŝki | (6)
K
i=1
inwhich
ŝK = H −1
K x = [h k1 , hk2 , ..., hkM ]
−1
x (7)
K = {k1 , k2 , ..., kM } (1 ≤ k1 < ... < kM ≤ N ) (8)
than a strict solution and that the solution is similar to a strict one even when
elements are complex numbers. Thus, in this paper, we use the above combina-
torial optimization to obtain the solution.
We will now discuss the problem that occurs when we use this method in a
simultaneous speech recognition system. As we stated in subsection 2.2, we need
a separation method that has the ability to handle a large number of talkers and
reduces distortion in acoustic features. Since this method can handle at most M
dominant sources in each time-frequency region, this method satisfies the first
requirement. However, it does not satisfy the second requirement because the
accuracy of the dominant source estimation is about 50 to 60%, and this poor
estimation accuracy causes a lack of spectra and noise leakage, which greatly
distort the acoustic features. To improve the system’s ASR results, it is necessary
to improve the accuracy of the dominant source estimation.
Observed
sounds x
Separated
Conventional Separation sounds ŝ
L1-norm using harmonic
Harmonic structure structure
separation
extraction
Tentatively Fundamental Harmonic
frequency structure Harmonic
separated structure
sounds estimation estimation
M
Kopt = argmin |ŝki | (9)
K
i=1
inwhich
ŝK = H −1
K x = [h k1 , hk2 , ..., hkM ]
−1
x (10)
K = {k1 , k2 , ..., kM } (1 ≤ k1 < ... < kM ≤ N ) (11)
P ⊆K (|P | ≤ M ) (12)
P ⊃K (|P | > M ) (13)
The difference between this combinatorial optimization problem and the one
discussed in subsection 2.3 is the existence of Eqs. (12) and (13). When we
take into consideration the time-frequency region which does not have harmonic
structure (P = φ), these constraints do not take effect; thus, we can reuse the
separation results from the first separation. In addition, even when we consider
time-frequency that contains harmonic structure, we can reuse the calculation
results of the matrix operations in Eq. (10).
354 Y. Hirasawa et al.
4 Experiments
To determine the improvements obtained with our method, we carried out two
experiments using synthesized sounds. Table 1 lists the experimental conditions,
and Fig. 3 shows the arrangement of microphones and loud speakers. To deter-
mine the proper STFT frame length, we use the results of a previous experiment
by Yılmaz et al. [6].
Our evaluations use two kinds of measurements. The first is a Signal-to-Noise
Ratio (SNR) to check whether our method can accurately separate mixed speech
signals, and the second is ASR correctness to check whether the output sounds
maintain acoustic feature values and are suitable for ASR.
Additionally, we also evaluate calculation time, which is important for real-
time human-robot interaction. The following is the result. Table 2 shows the
calculation time taken to separate 198 mixtures, which equalled 1218 seconds.
To solve the LP problem, we implement a program in Matlab that follows sub-
sections 2.3 and 3.2. To solve the SOCP problem, we use the SeDuMi toolbox
with the CVX with default settings.
Center
talker
Table 2. Separation time and RTF
Right
Left
talker ° 1m
60 talker
Method Time(s) RTF
Conventional LP 266 0.23
Proposed method 376 0.31
mic
Robot SOCP solver 31686 26.0
Fig. 3. Arrangement of mics and talkers
Robot with Two Ears Listens to More than Two Simultaneous Utterances 355
Figure 4 shows the original signal, the separation results with conventional
method, the estimated harmonic structure, and the separation results with the
proposed method. It shows only the low-frequency region because most of the
estimated harmonic structure is in the low-frequency region. The black in the
lower left figure represents the estimated harmonic structure, and the black and
white represents high and low power in the other three figures, respectively.
The results of conventional method 4(b) indicate that there are some spectra
lacking in the black circles and some noise leakage in the black rectangles. How-
ever, in the results obtained with the proposed method seen in 4(d), the spectra
in the black circles are recovered, and noise leakage in the black rectangles is
800 800
Frequency (Hz)
Frequency (Hz)
600 600
400 400
200 200
800 800
Frequency (Hz)
Frequency (Hz)
600 600
400 400
200 200
reduced. This means that our method improves the accuracy of dominant source
estimation and reduces interference.
Table 3 lists the average SNR and average ASR correctness of each talker. In
this table, “(c) Optimum harmonic” represents the condition in which harmonic
structures are given in all time frames, and “(d) Optimum in all TF” represents
the condition in which dominant sources are given in all time-frequency regions;
thus, this is the upper bound under our experimental condition. Note that the
ASR correctness of clean speech is about 93%.
As this table shows, “(b)Proposed method” outperforms “(a) Conventional
method” in both measurements. In addition, the table shows the results of
“(x) SOCP,” a theoretically justified solution, are as good as “(a) Conventional
method.”
4.2 Discussion
This experiment demonstrated that our method improves separation for both
measurements of SNR and ASR correctness. This means that our method, which
uses harmonic structure constraints, can estimate dominant sources correctly and
maintain acoustic feature values.
In Table 3, the difference between “(b) Proposed method” and “(c) Optimum
harmonic” shows that we can improve our method more by at most 8 points if we
estimate harmonic structure more accurately. Also, the difference between “(c)
Optimum harmonic” and “(d) Optimum in all TF” shows that there is room to
improve our method; e.g., add new constraints for high frequency regions.
Finally, we also find that a solution using SOCP solver has almost no ad-
vantage under our experimental condition in the sense that its accuracy and is
almost same to and its computational complexity is more than the conventional
method. Of course, SOCP solver can solve more generalized problem, so we need
to weigh its computational complexity against the benefits.
Robot with Two Ears Listens to More than Two Simultaneous Utterances 357
References
1. Hirose, M., Ogawa, K.: Honda humanoid robots development. Philosophical Trans.
A 365(2007), 11–19 (1850)
2. Akachi, K., Kaneko, K., et al.: Development of humanoid robot HRP-3P. In: Proc.
Humanoids 2005, pp. 50–55 (2005)
3. MacDorman, K.F., Ishiguro, H.: The uncanny advantage of using androids in cog-
nitive and social science research. Interaction Studies 7(3), 297–337 (2006)
4. Hyvärinen, A., Oja, E.: Independent Component Analysis: Algorithms and Appli-
cations. Neural Networks 13(4-5), 411–430 (2000)
5. Griffiths, L., Jim, C.: An alternative approach to linearly constrained adaptive
beamforming. IEEE Trans. on Antennas and Propagation 30(1), 27–34 (1982)
6. Yılmaz, O., Rickard, S.: Blind separation of speech mixtures via time-frequency
masking. IEEE Trans. on Signal Processing 52(7), 1830–1847 (2004)
7. Lee, T.W., Lewicki, M.S., et al.: Blind source separation of more sources than
mixtures using overcomplete representations. IEEE Signal Processing Letters 6(4),
87–90 (1999)
8. Bofill, P., Zibulevsky, M.: Underdetermined blind source separation using sparse
representations. Signal processing 81(11), 2353–2362 (2001)
9. Li, Y., Cichocki, A., Amari, S.: Analysis of sparse representation and blind source
separation. Neural Computation 16(6), 1193–1234 (2004)
358 Y. Hirasawa et al.
10. Li, Y., Amari, S., Cichocki, A., Ho, D.W.C., Xie, S.: Underdetermined blind source
separation based on sparse representation. IEEE Trans. on Signal Processing 54(2),
423–437 (2006)
11. Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyl-
labic word recognition in continuously spoken sentences. IEEE Trans. on Acoustics,
Speech and Signal Processing 28(4), 357–366 (1980)
12. Winter, S., Sawada, H., Makino, S.: On real and complex valued L1-norm mini-
mization for overcomplete blind source separation. In: Proc. WASPAA 2005, pp.
86–89 (2005)
Author Index