Computer Analysis of Images and Patterns: Richard Wilson Edwin Hancock Adrian Bors William Smith
Computer Analysis of Images and Patterns: Richard Wilson Edwin Hancock Adrian Bors William Smith
Edwin Hancock
Adrian Bors
William Smith (Eds.)
LNCS 8047
Computer Analysis
of Images and Patterns
15th International Conference, CAIP 2013
York, UK, August 2013
Proceedings, Part I
123
Lecture Notes in Computer Science 8047
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
Richard Wilson Edwin Hancock
Adrian Bors William Smith (Eds.)
Computer Analysis
of Images and Patterns
15th International Conference, CAIP 2013
York, UK, August 27-29, 2013
Proceedings, Part I
13
Volume Editors
Richard Wilson
Edwin Hancock
Adrian Bors
William Smith
University of York
Department of Computer Science
Deramore Lane
York YO10 5GH, UK
E-mail:{wilson, erh, adrian, wsmith}@cs.york.ac.uk
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and
executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication
or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location,
in its current version, and permission for use must always be obtained from Springer. Permissions for use
may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution
under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of publication,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or
omissions that may be made. The publisher makes no warranty, express or implied, with respect to the
material contained herein.
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Preface
This volume contains the papers presented at the 15th International Conference on
Computer Analysis of Images and Patterns (CAIP 2013) held in York during August
27–29, 2013.
CAIP was first held in 1985 in Berlin, and since then has been organized biennially
in Wismar, Leipzig, Dresden, Budapest, Prague, Kiel, Ljubljana, Warsaw, Groningen,
Versailles, Vienna, Münster, and Seville.
We received 243 full papers, from authors in 48 countries. Of these 142 were ac-
cepted, 39 oral presentation and 103 posters. There were three invited speakers, Rama
Chellappa from the University of Maryland, Xiaoyi Jiang from the University of Münster,
and Tim Weyrich from University College London.
We hope that participants benefitted scientifically from the meeting, but also got
a flavor of York’s rich history and saw something of the region too. To this end, we
organized a reception at the York Castle Museum, and the conference dinner at the
Yorkshire Sculpture Park. The latter gave participants the chance to view large-scale
works by the Yorkshire artists Henry Moore and Barbara Hepworth.
We would like to thank a number of people for their help in organising this event.
Firstly, we would to thank the IAPR for sponsorship. Furqan Aziz managed the produc-
tion of the proceedings, and Bob French co-ordinated local arrangements.
Program Committee
Ceyhun Burak Akgül Vistek ISRA Vision, Turkey
Madjid Allili Bishop’s University, Canada
Nigel Allinson University of Lincoln, UK
Apostolos Antonacopoulos University of Salford, UK
Helder Araujo University of Coimbra, Portugal
Nicole M. Artner PRIP, Vienna University of Technology, Austria
Furqan Aziz University of York, UK
Andrew Bagdanov Media Integration and Communication Center
University of Florence, Italy
Antonio Bandera University of Malaga, Spain
Elisa H. Barney Smith Boise State University, USA
Ardhendu Behera University of Leeds, UK
Abdel Belaid Université de Lorraine - LORIA, France
Gunilla Borgefors Centre for Image Analysis, Swedish University
of Agricultural Sciences, Sweden
Adrian Bors University of York, UK
Luc Brun GREYC, ENS, France
Lorenzo Bruzzone University of Trento, Italy
Horst Bunke University of Bern, Switzerland
Martin Burger WWU Münster, Germany
Gustavo Carneiro University of Adelaide, Australia
Andrea Cerri University of Bologna, Italy
Kwok-Ping Chan The University of Hong Kong, SAR China
Rama Chellappa University of Maryland, USA
Sei-Wang Chen National Taiwan Normal University, Taiwan
Dmitry Chetverikov Hungarian Academy of Sciences, Hungary
John Collomosse University of Surrey, UK
Bertrand Coüasnon Irisa/Insa, France
Marco Cristani University of Verona, Italy
Guillaume Damiand LIRIS/Université de Lyon, France
Justin Dauwels M.I.T., USA
Mohammad Dawood University of Münster, Germany
Joachim Denzler University Jena, Germany
Cecilia Di Ruberto Università di Cagliari, Italy
Junyu Dong Ocean University of China, China
VIII Organization
Sphere Detection in Kinect Point Clouds Via the 3D Hough Transform . . . . . . . 290
Anas Abuzaina, Mark S. Nixon, and John N. Carter
Color Transfer Based on Earth Mover’s Distance and Color Categorization . . . . 394
Wenya Feng, Yilin Guo, Okhee Kim, Yonggan Hou, Long Liu, and
Huiping Sun
Exploring Interest Points and Local Descriptors for Word Spotting Application
on Historical Handwriting Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
Peng Wang, Véronique Eglin, Christine Largeron, Antony McKenna, and
Christophe Garcia
Local and Global Statistics-Based Explicit Active Contour for Weld Defect
Extraction in Radiographic Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
Aicha Baya Goumeidane, Nafaa Nacereddine, and Mohammed Khamadja
1 Introduction
The success story of modern biology and medicine is also one of imaging. It is
the imaging techniques that enable biological experiments (for high-throughput
behavioral screens or conformation analysis) and make the body of humans and
animals anatomically or functionally visible for clinical purposes (medical pro-
cedures seeking to reveal, diagnose, or examine disease). With the widespread
use of imaging modalities in fundamental research and routine clinical practice,
researchers and physicians are faced with ever-increasing amount of image data
to be analyzed and the quantitative outcomes of such analysis are getting in-
creasingly important. Modern computer vision technology is thus indispensable
to acquire and extract information out of the huge amount of data.
Computer vision has a long history and is becoming increasingly mature.
Many computer vision algorithms have been successfully adapted and applied
to biomedical imaging applications. However, biomedical imaging has several
special characteristics which pose particular challenges, e.g.,
– Acquisition and enhancement techniques for challenging imaging situations
are needed.
– The variety of different imaging sensors, each with its own physical principle
and characteristics (e.g., noise modeling), often requires modality-specific
treatment.
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 1–19, 2013.
c Springer-Verlag Berlin Heidelberg 2013
2 X. Jiang et al.
Fig. 1. Illustration of three noise models. (a) Noise-free 1D signal. (b) Signal biased
by additive Gaussian noise with σ = 5. (c) Signal biased by Poisson noise. (d) Signal
biased by speckle noise with σ = 5 and γ = 1. (from [42])
– It is not seldom that different modalities are involved. Thus, algorithms must
be designed to cope with multiple modalities.
– Due to the high complexity of many biomedical image analysis tasks, semi-
automatic processing may be unavoidable in some cases. The design of in-
telligent and user-friendly interactive tools is a challenging task.
– Also the different body organs may require specific treatment.
As an example, the influence of noise modeling is considered. The following noise
models are popular:
– Additive Gaussian noise: f = μ + ν, where μ is the unbiased image intensity
and ν is a Gaussian-distributed random variable with expectation 0 and
variance σ 2 .
– Poisson noise (”photon counting noise”): This type of noise is signal-depen-
dent and appears in a wide class of real-life applications, e.g., in positron
emission tomography and fluorescence microscopy.
– Speckle noise: f = μ + νμγ/2 occurs in ultrasound imaging and is of multi-
plicative nature. Its dependency on the unbiased image intensity μ is con-
trolled by the parameter γ. ν is the same as for additive Gaussian noise.
To illustrate the different characteristics of these noise forms a synthetic 1D
signal and its corrupted versions are shown in Figure 1. We can observe that for
similar parameters, the appearance of signal-dependent Poisson and speckle noise
is in general stronger compared to the additive Gaussian noise. Their processing
is thus definitively challenging and pushes the need for accurate data modeling
in computer vision.
On the other hand, the special characteristics of biomedical imaging also give
extra power to computer vision research. Multimodality can be helpful since they
carry complementary information and their combined use may ease some image
analysis tasks (e.g., segmentation [25]). Generally, a lot of knowledge specific to
a particular application or object type may exist that should be accurately mod-
eled and integrated into algorithms for dedicated processing towards improved
performance.
Given the challenges discussed above, biomedical computer vision is far be-
yond simply adapting and applying advanced computer vision techniques to solve
Biomedical Imaging: A Computer Vision Perspective 3
real problems. It is also a wide field with huge potential of developing novel con-
cepts, techniques, and algorithms. Indeed, biomedical imaging can be seen as a
driving force for computer vision research.
In this paper this view of biomedical computer vision is emphasized by
considering important topics of biomedical imaging: Minimum-cost boundary de-
tection, region-based image segmentation, image registration, optical flow com-
putation, and imaging techniques. Our intention is not to give a complete
coverage of these topics, but rather exemplarily focus on typical challenges and
the related concepts, techniques, and algorithms. The majority of the given ex-
amples is based on our own research and experiences in the respective fields.
Quantification is one of the key words in biomedical imaging and requires robust,
fast, and possibly automatic image segmentation algorithms. It can be either
in the form of boundary detection or alternatively region-based segmentation.
Automatic segmentation enables assessment of meaningful parameters, e.g., for
diagnosis of pathological findings in clinical environments.
Fig. 2. B-mode CCA image (left) and detected intima and adventitia layer of far wall
(right)
3D volume data or time sequences of 2D images. The key idea there is that the
user specifies contours via live-wiring on a few slices that are orthogonal to the
natural slices of the original data. If these slices are selected strategically, then
one obtains a sufficient number of seed points in each natural slice which enable
a subsequent automatic optimal boundary detection therein.
Live-wire techniques are a good example of designing intelligent and user-
friendly interactive segmentation tools. They help to solve complex segmentation
tasks by locally and non-extensively integrating the expertises and wishes of
domain experts, which in turn also increases the user’s faith in the automatic
solution.
Fig. 3. (a) Tumor cell ROI; (b) gradient; (c) gradient-based optimal boundary; (d)
region-based optimal contour (from [29])
gradient is not always a reliable measure to work with. One such example is the
region-of-interest (ROI) of a tumor cell from microscopic imaging shown in Fig-
ure 3. Maximizing the sum of gradient magnitude does not produce satisfactory
result.
There are only very few works on DP-based boundary detection using non-
gradient information [35,53]. A challenge remains to develop boundary detection
methods based on region information. A general framework for this purpose is
proposed in [29]. A star-shaped contour C can be represented in polar form
r(θ), θ ∈ [0, 2π). Given the image boundary B(θ), θ ∈ [0, 2π), the segmentation
task can be generally formulated as one of optimizing the energy function:
2π r(θ) B(θ)
E(C) = Fi (θ, r)dr + Fo (θ, r)dr dθ (1)
0 0 r(θ)
contain non-star-shaped objects. One such attempt from [30] allows the user to
interactively specify and edit the general shape of the desired object by using a
so-called rack, which basically corresponds to the object skeleton. The straight-
forward extension of the boundary class considered here to 3D is the terrain-like
surface z = f (x, y) (height field or discrete Monge surface). Unfortunately, there
is no way of extending the dynamic programming solution to the 3D minimum-
cost surface detection problem in an efficient manner. An optimal 3D graph
search algorithm approach is presented in [32] with low polynomial time for this
purpose. Similar to handling closed boundaries, cylindrical (tube-like) surfaces
can be handled by first unfolding into a terrain-like surface using cylindrical
coordinate transform. In addition to detecting minimum-cost surfaces this algo-
rithm can also be applied to sequences of 2D images for temporally consistent
boundary detection.
In practice, fast and easy-to-use algorithms like DP-based boundary detection
are highly desired. To cite the biologist colleague who provided us the microscopic
images used in [29] (see Figure 3): ”I have literally tens of thousands of images per
experiment” that must be processed within reasonable time. Therefore, further
developments like boundary detection based on region information will have high
practical impact.
The two approaches discussed above are representative for a variety of seg-
mentation algorithms which fully utilize the knowledge about the specific char-
acteristics of the image data at hand. A better modeling is the prerequisite for
improved segmentation accuracy and robustness. This is especially important in
biomedical imaging due to the variety of imaging modalities.
4 Image Registration
Image registration [21,34] aims at geometrically aligning two images of the same
scene, which may be taken at different times, from different viewpoints, and by
different sensors. It is among the most important tasks of biomedical imaging in
practice. Given a template image T : Ω → and a reference image R : Ω →
, where Ω ⊂ d is the image domain and d the dimension, the registration
yields a transformation y : d → d representing point-to-point correspondences
between T and R. To find y, the following functional has to be minimized:
Here, D denotes the distance functional and the M transformation model, and
S is the regularization functional. D measures the dissimilarity between the
transformed template image and the fixed reference image. If both images are
of the same modality, the sum-of-squared differences (SSD) can be used as a
distance functional D. In case of multimodal image registration information-
theoretic measures, in particular, mutual information, are popular [39].
The SSD and related dissimilarity measures implicitly assume the intensity
constancy between the template and reference image. Thus, we solely search
for the optimal geometric transformation. In medical imaging, however, this as-
sumption is not always satisfied. Such a problem instance appears in the context
of motion correction in positron emission tomography (PET) [14].
PET requires relatively long image acquisition times in the range of minutes.
In thoracic PET both respiratory and cardiac motion lead to spatially blurred
images. To reduce motion artifacts in PET, so-called gating based techniques
Biomedical Imaging: A Computer Vision Perspective 9
Fig. 4. Coronal slices of the left ventricle in a human heart during systole (a) and
diastole (b) and corresponding line profiles (c) are shown for one patient. It can be
observed that the maximum peaks in these line profiles vary a lot. (from [24])
were found useful, which decompose the whole dataset into parts that represent
different breathing and/or cardiac phases [16]. After gating, each single gate
shows less motion, however, suffers from a relatively low signal-to-noise ratio
(SNR) as only a small portion of all available events is contained. After gating
the data, each gate is reconstructed individually and registered to one assigned
reference gate. The registered images are averaged afterwards to overcome the
problem of low SNR. Tissue compression and the partial volume effect (PVE)
lead to intensity modulations. Especially for relatively small structures like the
myocardium the true uptake values are affected by the PVE. An example is
given in Figure 4 where a systolic and diastolic slice (same respiratory phase)
of a gated 3D dataset and line profiles are shown. Among others, the maximum
intensity values of the two heart phases indicate that corresponding points can
differ in intensity significantly.
In this situation an image registration mechanism is required which consists
of simultaneous geometric transformation (spatially moving the pixels) and in-
tensity modulation (redistributing the intensity values). In gating, all gates are
formed over the same time interval. Hence, the total amount of radioactivity in
each phase is approximately equal. In other words, in any respiratory and/or car-
diac gate no radioactivity can be lost or added apart from some minor changes
at the edges of the field of view. This property provides the foundation for a
mass-preserving image registration. VAMPIRE (Variational Algorithm for Mass-
Preserving Image REgistration) [24] incorporates a mass-preserving component
by accounting for the volumetric change induced by the transformation y. Based
on the integration by substitution theorem for multiple variables we have:
T (x)dx = T (y(x))|det(∇y(x))|dx (7)
y(Ω) Ω
It guarantees the same total amount of radioactivity before and after applying
the transformation y to T . Therefore, for an image T and a transformation y,
the mass-preserving transformation model is defined as:
which assumes that when a pixel moves from one image to another, its inten-
sity (or color) does not change. In fact, this assumption combines a number of
assumptions about the reflectance properties of the scene, the illumination in
the scene, and the image formation process in the camera [2]. Linearizing this
constancy equation by applying a first-order Taylor expansion to the right-hand
side leads to the fundamental optical flow constraint (OFC):
u · Ix + v · Iy = −It (10)
or more compactly:
f · ∇I = −It (11)
with f = (u, v) and ∇I = (Ix , Iy ), which is used to derive optimization algo-
rithms in a continuous setting.
In practice, however, this popular brightness constancy is not always valid.
Other constancy terms have also been suggested including gradient, gradient
magnitude, higher-order derivatives, e.g., on the (spatial) Hessian photometric
invariants, texture features, and combination of multiple features (see [5] for a
discussion). In the following two subsections we briefly discuss two additional
variants from the medical imaging perspective.
the observation that the OFC is very similar to the continuity equation of fluid
dynamics, Schunck [44] presented the extended optical flow constraint (EOFC):
Multiplicative speckle noise (cf. Figure 1d) is characteristic for diagnostic ultra-
sound imaging. The origin of speckle are tiny inhomogeneities within the tissue,
which reflect ultrasound waves but cannot be resolved by the ultrasound system.
Speckle noise f = μ + νμγ/2 is of multiplicative nature, i.e., the noise variance
directly depends on the underlying signal intensity.
The speckle noise has substantial impact on motion estimation. In fact, it
turns out that the brightness constancy does not hold any more (see [49] for a
mathematical proof). This can be demonstrated by the following simple experi-
ment [49]. Starting from two pixel patches of size 5 × 5 with constant intensity
values μ = 150 and η ∈ [0, 255], a realistic amount of speckle noise was added
according to f = μ + νμγ/2 with γ = 1.5. The resulting pixel patches, denoted
by X 150 and Y η , were compared pixelwise with the squared L2 -distance. Com-
parison of the two pixel patches was performed 10,000 times for every value of
η ∈ [0, 255]. The simulation results (average distance of the two pixel patches
and standard deviation) are plotted in Figure 5 (left). Normally, one would ex-
pect the minimum of the graph to be exactly at the value η = μ = 150, i.e., both
pixel patches have the same constant intensity before adding noise. However, the
minimum of the graph is below the expected value. This discrepancy has been
theoretically analyzed in [49], which predicts the minimum at η ≈ 141 for the
particular example as can be observed in Figure 5 (left).
In [49] it is argued that the overall distribution within a local image region
remains approximately constant since the tissue characteristics remain and thus
12 X. Jiang et al.
Fig. 5. Left: Average distance between two pixel patches biased by speckle noise. The
global minimum is below the correct value of η = 150. Right: Average distance between
the histograms of two pixel patches biased by speckle noise. The global minimum
matches with the correct value of η = 150. In both case the two dashed lines represent
the standard deviation of the 10,000 experiments. (from [49])
where H represents the cumulative histogram of the region surrounding the pixel
(x, y) at time t. The validity of this new constraint has been mathematically
proven in [49] and can also been seen in Figure 5 (right). On ultrasound data
the derived histogram-based optical flow algorithm outperforms state-of-the-art
general-purpose optical flow methods.
In medical imaging some motion is inherently periodic. For example, this occurs
in cardiac gated imaging, where images are obtained at different phases of the
periodic cardiac cycle. Another example is in respiratory gated imaging, where
the respiratory motion of the chest can also be described by a periodic model. Li
and Yang [33] proposed optical flow estimation for a sequence of images wherein
the inherent motion is periodic over time. Although in principle one could adopt
a frame-by-frame approach to determine the motion fields, a joint estimation,
in which all motion fields of a sequence are estimated simultaneously, explicitly
exploits the inherent periodicity in image motion over time and can thus be
advantageous against the framewise approach.
By applying Fourier series expansion, the components (u, v) at location (x, y)
over time are modeled by:
L
2πl 2πl
u(x, y, t) = al (x, y) cos t + bl (x, y) sin t (14)
T T
l=1
L
2πl 2πl
v(x, y, t) = cl (x, y) cos t + dl (x, y) sin t (15)
T T
l=1
Biomedical Imaging: A Computer Vision Perspective 13
where al (x, y), bl (x, y), cl (x, y), dl (x, y) are the coefficients associated with har-
monic component l and L is the order of the harmonic representation. This
motion model is embedded into the motion estimation for each pair of two suc-
cessive images and the overall data term of the energy function to be minimized
is the sum of all pairwise data terms from the brightness constancy.
In an on-going project we aim to track freely moving small animals with high
precision inside a positron emission tomograph. Normally, the animals have to be
anesthetized during 15-60 minutes of data acquisition to avoid motion artifacts.
However, anesthesia influences the metabolism which is measured by PET. To
avoid this, the aim of our project is to track awake and freely moving animals
during the scan and use the information to correct the acquired PET data for
motion. For this task a small animal chamber of 20×10×9 cm was built (Figure 6)
with a pair of stereo cameras positioned on both small sides of the chamber.
Due to the experimental setup highly distorted wide angle lenses have to be
used. To reach the required tracking accuracy a high-precision lens distortion
correction is crucial. First tests using a simple polynomial model for lens dis-
tortion correction lead to deviations from a pinhole camera model of up to 5
pixels. Therefore, more sophisticated methods are needed. Two high-precision
lens distortion correction methods are described in [26,27]. In the latter case
several images of a harp of wire are acquired and a massive amount of edge
points is used to determine the parameters of a 11th grade polynomial distor-
tion function. Both methods require a very accurately manufactured calibration
pattern. In [43] another solution is suggested using a planar checkerboard pat-
tern to provide very accurately detectable feature points even under distortion
as a calibration pattern. Smoothed thin plate splines are applied to model the
mapping between control points, leading to a mean accuracy below 0.084 pixel.
14 X. Jiang et al.
Fig. 6. Camera setup for PET imaging of freely moving mice. Left: Construction model
of the animal chamber. Right: Manufactured chamber halfway inserted into a quad-
HIDAC PET-scanner (16 cm in diameter). (from [43])
Fig. 7. The FIM setup. (A) Image of 10 larvae (arrow) imaged in a conventional
setup. The asterisks denote scratches and reflections in the tracking surface. (B) Image
of 10 larvae (arrow) imaged in the FIM setup with high contrast. (C) The principle
of frustrated total internal reflection. na , n1 , n2 , and n3 indicate different refractory
indices of air, acrylic glass, agar and larvae respectively, an acrylic glass plate is flooded
with infrared light (indicated by red lines). The camera is mounted below the tracking
table. (D) Schematic drawing of the setup. (E) Image of the tracking table. (from [41])
it even allows to image internal organs. FIM is suitable for a wide range of
biological applications and a wide range of organisms. Together with optimized
tracking software it facilitates analysis of larval locomotion and will simplify
genetic screening procedures.
7 Conclusion
Biomedical computer vision is far beyond simply adapting and applying ad-
vanced computer vision techniques to solve real problems. Biomedical imaging
also poses new and challenging computer vision problems in order to cope with
the complex and multifarious reality. In this paper we have exemplarily discussed
a number of challenges and the related concepts and algorithms, mainly in the
fields of our own research. They are well motivated by the practice. Biomedical
imaging is full of such challenges and powerful computer vision solutions will
immediately have benefit for the practice.
We need to understand how the domain experts work best with a technical
system, which helps to design intelligent and user-friendly interactive tools. In
addition, we are forced to have deeper understanding of the sources of signals
and images to be processed, i.e., the objects of interest and biomedical devices.
Only this way essential knowledge can be included for improved modeling and
solution.
Many fundamental assumptions made when developing algorithms for biome-
dical imaging are shared by different - even non-biomedical - imaging modalities.
For instance, the speckle noise model applies to both ultrasound and synthetic
aperture radar imaging. Thus, the developed algorithms are of general interest
and can be used in manifold application contexts.
Modern biology and medicine is a successful story of imaging. In the past
biomedical computer vision has already established a vast body of powerful
methods and tools. Continuous well-founded research will further enlarge the
spectrum of successfully solved practical problems and thus continue to make a
noticable contribution to biology and medicine.
References
1. Ardekani, R., Biyani, A., Dalton, J., Saltz, J., Arbeitman, M., Tower, J., Nuzhdin,
S., Tavare, S.: Three-dimensional tracking and behaviour monitoring of multiple
fruit flies. J. R. Soc. Interface 10(78), 20120547 (2013)
2. Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A database
and evaluation methodology for optical flow. International Journal of Computer
Vision 92(1), 1–31 (2011)
Biomedical Imaging: A Computer Vision Perspective 17
3. Béréziat, D., Herlin, I., Younes, L.: A generalized optical flow constraint and its
physical interpretation. In: Proc. of CVPR, pp. 487–492 (2000)
4. Bimbo, A.D., Nesi, P., Sanz, J.L.C.: Optical flow computation using extended
constraints. IEEE Trans. on Image Processing 5(5), 720–739 (1996)
5. Bruhn, A.: Variational Optic Flow Computation – Accurate Modelling and Efficient
Numerics. Ph.D. thesis, University of Saarland (2006)
6. Burkard, R., Dell’Amico, M., Martello, S.: Assignment Problems. Society for In-
dustrial Mathematics (2009)
7. Chan, T.F., Vese, L.A.: Active contours without edges. IEEE Trans. on Image
Processing 10(2), 266–277 (2001)
8. Cheng, D.C., Jiang, X.: Detections of arterial wall in sonographic artery images
using dual dynamic programming. IEEE Trans. on Information Technology in
Biomedicine 12(6), 792–799 (2008)
9. Chesnaud, C., Réfrégier, P., Boulet, V.: Statistical region snake-based segmentation
adapted to different physical noise models. IEEE Trans. on Pattern Anaysis and
Machine Intelligence 21(11), 1145–1157 (1999)
10. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Trans.
on Pattern Analysis and Machine Intelligence 23(6), 681–685 (2001)
11. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models-their
training and application. Computer Vision and Image Understanding 61(1), 38–59
(1995)
12. Corpetti, T., Heitz, D., Arroyo, G., Memin, E., Santa-Cruz, A.: Fluid experimental
flow estimation based on an optical-flow scheme. Experiments in Fluids 40(1),
80–97 (2006)
13. Dawood, M., Gigengack, F., Jiang, X., Schäfers, K.: A mass conservation-based op-
tical flow method for cardiac motion correction in 3D-PET. Medical Physics 40(1),
012505 (2013)
14. Dawood, M., Jiang, X., Schäfers, K. (eds.): Correction Techniques in Emission
Tomographic Imaging. CRC Press (2012)
15. Dawood, M., Büther, F., Jiang, X., Schäfers, K.P.: Respiratory motion correction
in 3-D PET data with advanced optical flow algorithms. IEEE Trans. on Medical
Imaging 27(8), 1164–1175 (2008)
16. Dawood, M., Büther, F., Stegger, L., Jiang, X., Schober, O., Schäfers, M., Schäfers,
K.P.: Optimal number of respiratory gates in positron emission tomography: A
cardiac patient study. Medical Physics 36(5), 1775–1784 (2009)
17. Dawood, M., Kösters, T., Fieseler, M., Büther, F., Jiang, X., Wübbeling, F.,
Schäfers, K.P.: Motion correction in respiratory gated cardiac PET/CT using
multi-scale optical flow. In: Metaxas, D., Axel, L., Fichtinger, G., Székely, G. (eds.)
MICCAI 2008, Part II. LNCS, vol. 5242, pp. 155–162. Springer, Heidelberg (2008)
18. Falcão, A.X., Udupa, J.K.: A 3D generalization of user-steered live-wire segmen-
tation. Medical Image Analysis 4(4), 389–402 (2000)
19. Falcão, A.X., Udupa, J.K., Miyazawa, F.K.: An ultra-fast user-steered image sege-
mentation paradigm: Live-wire-on-the-fly. IEEE Trans. on Medical Imaging 19(1),
55–62 (2000)
20. Falcão, A.X., Udupa, J.K., Samarasekera, S., Sharma, S., Hirsch, B.E., de Alencar
Lotufo, R.: User-steered image segmentation paradigms: Live wire and live lane.
Graphical Models and Image Processing 60(4), 233–260 (1998)
21. Fischer, B., Modersitzki, J.: Ill-posed medicine - an introduction to image registra-
tion. Inverse Problems 24(3), 034008 (2008)
18 X. Jiang et al.
22. Fleet, D., Weiss, Y.: Optical flow estimation. In: Paragios, N., Chen, Y., Fauregas,
O. (eds.) The Handbook of Mathematical Models in Computer Vision, pp. 241–260.
Springer (2005)
23. Gigengack, F.: Mass-Preserving Motion Correction and Multimodal Image Sege-
mentation in Positron Emission Tomography. Ph.D. thesis, University of Münster
(2012)
24. Gigengack, F., Ruthotto, L., Burger, M., Wolters, C.H., Jiang, X., Schäfers, K.P.:
Motion correction in dual gated cardiac PET using mass-preserving image regis-
tration. IEEE Trans. on Medical Imaging 31(3), 698–712 (2012)
25. Gigengack, F., Ruthotto, L., Jiang, X., Modersitzki, J., Burger, M., Hermann,
S., Schäfers, K.P.: Atlas-based whole-body PET-CT segmentation using a passive
contour distance. In: Menze, B.H., Langs, G., Lu, L., Montillo, A., Tu, Z., Criminisi,
A. (eds.) MCV 2012. LNCS, vol. 7766, pp. 82–92. Springer, Heidelberg (2013)
26. von Gioi, R.G., Monasse, P., Morel, J.M., Tang, Z.: Towards high-precision lens
distortion correction. In: Proc. of ICIP, pp. 4237–4240 (2010)
27. von Gioi, R.G., Monasse, P., Morel, J.M., Tang, Z.: Lens distortion correction with
a calibration harp. In: Proc. of ICIP, pp. 617–620 (2011)
28. Heimann, T., Meinzer, H.P.: Statistical shape models for 3D medical image seg-
mentation: A review. Medical Image Analysis 13(4), 543–563 (2009)
29. Jiang, X., Tenbrinck, D.: Region based contour detection by dynamic programming.
In: Hancock, E., Smith, W., Wilson, R., Bors, A. (eds.) CAIP 2013, Part II. LNCS,
vol. 8048, pp. 152–159. Springer, Heidelberg (2013)
30. Jiang, X., Große, A., Rothaus, K.: Interactive segmentation of non-star-shaped
contours by dynamic programming. Pattern Recognition 44(9), 2008–2016 (2011)
31. Khurana, S., Atkinson, W.L.N.: Image enhancement for tracking the translucent
larvae of drosophila melanogaster. PLoS ONE 5(12), e15259 (2010)
32. Li, K., Wu, X., Chen, D., Sonka, M.: Optimal surface segmentation in volumetric
images - a graph-theoretic approach. IEEE Trans. on Pattern Analysis and Machine
Intelligence 28(1), 119–134 (2006)
33. Li, L., Yang, Y.: Optical flow estimation for a periodic image sequence. IEEE Trans.
on Image Processing 19(1), 1–10 (2010)
34. Maintz, J.B.A., Viergever, M.A.: A survey of medical image registration. Medical
Image Analysis 2(1), 1–36 (1998)
35. Malon, C., Cosatto, E.: Dynamic radial contour extraction by splitting homo-
geneous areas. In: Real, P., Diaz-Pernil, D., Molina-Abril, H., Berciano, A.,
Kropatsch, W. (eds.) CAIP 2011, Part I. LNCS, vol. 6854, pp. 269–277. Springer,
Heidelberg (2011)
36. Martin, P., Réfrégier, P., Goudail, F., Guérault, F.: Influence of the noise model
on level set active contour segmentation. IEEE Trans. on Pattern Analysis and
Machine Intelligence 26(6), 799–803 (2004)
37. Mortensen, E., Morse, B., Barrett, W.: Adaptive boundary detection using ‘live-
wire’ two-dimensional dynamic programming. In: IEEE Proc. Computers in Car-
diology, pp. 635–638 (1992)
38. Mumford, D., Shah, J.: Optimal approximations by piecewise smooth functions and
associated variational problems. Commun. Pure Appl. Math. 42, 577–685 (1989)
39. Pluim, J.P.W., Maintz, J.B.A., Viergever, M.A.: Mutual information based reg-
istration of medical images: A survey. IEEE Trans. on Medical Imaging 22(8),
986–1004 (2003)
40. Qiu, M.: Computing optical flow based on the mass-conserving assumption. In:
Proc. of ICPR, pp. 7041–7044 (2000)
Biomedical Imaging: A Computer Vision Perspective 19
41. Risse, B., Thomas, S., Otto, N., Löpmeier, T., Valkov, D., Jiang, X., Klämbt,
C.: FIM, a novel FTIR-based imaging method for high throughput locomotion
analysis. PLoS ONE 8(1), e53963 (2013)
42. Sawatzky, A., Tenbrinck, D., Jiang, X., Burger, M.: A variational framework for
region-based segmentation incorporating physical noise models. Journal of Math-
ematical Imaging and Vision (2013), doi:10.1007/s10851-013-0419-6
43. Schmid, S., Jiang, X., Schäfers, K.: High-precision lens distortion correction using
smoothed thin plate splines. In: Hancock, E., Smith, W., Wilson, R., Bors, A. (eds.)
CAIP 2013, Part II. LNCS, vol. 8048, pp. 432–439. Springer, Heidelberg (2013)
44. Schunck, B.: The motion constraint equation for optical flow. In: Proc. of ICPR,
pp. 20–22 (1984)
45. Sonka, M., Hlavac, V., Boyle, R.: Image Processing, Analysis, and Machine Vision.
Cengage Learning, 3rd edn. (2007)
46. Sun, C., Appleton, B.: Multiple paths extraction in images using a constrained
expanded trellis. IEEE Trans. on Pattern Analysis and Machine Intelligence 27(12),
1923–1933 (2005)
47. Sun, C., Pallottino, S.: Circular shortest path in images. Pattern Recognition 36(3),
709–719 (2003)
48. Tenbrinck, D., Jiang, X.: Discriminant analysis based level set segmentation for
ultrasound imaging. In: Hancock, E., Smith, W., Wilson, R., Bors, A. (eds.) CAIP
2013, Part II. LNCS, vol. 8048, pp. 144–151. Springer, Heidelberg (2013)
49. Tenbrinck, D., Schmid, S., Jiang, X., Schäfers, K., Stypmann, J.: Histogram-based
optical flow for motion estimation in ultrasound imaging. Journal of Mathematical
Imaging and Vision (2013), doi:10.1007/s10851-012-0398-z
50. Tenbrinck, D., Sawatzky, A., Jiang, X., Burger, M., Haffner, W., Willems, P., Paul,
M., Stypmann, J.: Impact of physical noise modeling on image segmentation in
echocardiography. In: Proc. of Eurographics Workshop on Visual Computing for
Biomedicine, pp. 33–40 (2012)
51. Udupa, J., Samarasekera, S., Barrett, W.: Boundary detection via dynamic pro-
gramming. In: Visualization in Biomedical Computing 1992, pp. 33–39 (1992)
52. Yan, H., Gigengack, F., Jiang, X., Schäfers, K.: Super-resolution in cardiac PET
using mass-preserving image registration. In: Proc. of ICIP (2013)
53. Yu, M., Huang, Q., Jin, R., Song, E., Liu, H., Hung, C.C.: A novel segmentation
method for convex lesions based on dynamic programming with local intra-class
variance. In: Proc. of ACM Symposium on Applied Computing, pp. 39–44 (2012)
54. Zou, D., Zhao, Q., Wu, H.S., Chen, Y.Q.: Reconstructing 3d motion trajecto-
ries of particle swarms by global correspondence selection. In: Proc. of ICCV,
pp. 1578–1585 (2009)
Rapid Localisation and Retrieval
of Human Actions with Relevance Feedback
1 Introduction
In recent years search engines – such as Google – that operate on textual in-
formation have become both mature and commonplace. Efficient and accurate
search of multimedia data, however, is still an open research question, and this
is becoming an increasingly relevant problem with the growth in use of Internet
multimedia data. In order to perform searches on multimedia databases, cur-
rent technology relies on textual metadata associated with each video, such as
keyword tags or the video’s description – unfortunately such metadata are of-
ten incomplete or inaccurate. Furthermore, even if a textual search engine can
locate the correct video, it cannot search within that video to localise specific
sub-sequences that the user is interested in.
Compared to this, content-based retrieval systems present a better alterna-
tive. Such systems directly search through the content of multimedia objects,
avoiding the problems associated with metadata searches. Content-Based Im-
age Retrieval (CBIR) is the primary focus of many researchers. Video retrieval
(CBVR) has also been studied [1], but to a far lesser degree. Retrieval of human
actions in particular has received relatively little attention in comparison to ac-
tion recognition, with some notable exceptions in [2,3]. This is perhaps because
human actions are particularly difficult to retrieve because only a single query
example is provided to search on, but this single query cannot capture the vast
intraclass variability of even the simplest of human actions. Additionally, if the
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 20–27, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Rapid Localisation and Retrieval of Human Actions 21
Query q
Query histogram Hq Temporal Localisation
Time Series Sq,T Temporally Localised Candidates
Time-slice (+Histograms)
Histograms
Fig. 1. An overview of the localisation and ranking aspects of our algorithm. Relevance
feedback has been omitted for clarity.
query itself is noisy it can be difficult to isolate the relevant features of the ac-
tion. One method researchers use to overcome this issue is relevance feedback,
such as presented in [2].
Finding relevant videos alone is not enough for a practical video retrieval sys-
tem. It is also necessary to localise the relevant segments within longer videos,
as in the real world actions of interest are rarely neatly segmented. In the image
domain, Rahmani et al. [4] and Zhang et al. [5] have combined retrieval with
spatial localisation of objects. In videos, most localisation to date has been per-
formed in a recognition context, such as in [6]. However, more recently Yu et al.
[3] have performed human action retrieval combined with localisation.
Our goal is to introduce a time-efficient system for performing human action
retrieval, showing how localisation and retrieval can be integrated while main-
taining accuracy. We argue that, compared to previous works such as Yu et al.[3]
our method is an order of magnitude more efficient in time and space, making
it far more practical for real-world searches, while still maintaining practical
accuracy. Furthermore, we experiment with the addition of relevance feedback
in various forms, demonstrating that even imperfectly localised feedback can be
used to significantly improve results. We believe ours is the first work to consider
the effect of noisy relevance feedback samples in our experimentation, detailed
further in section 3.
2.1 Pre-processing
In the pre-processing stage it is helpful to consider previous work on human
action recognition. Approaches to human action recognition are broken down
into two categories based on the feature extraction method: global feature-based
methods and local feature-based methods [7]. Global feature based methods, such
as [8], consider the whole human shape or scene through time. Local feature-
based methods, such as [9,10], discard more potentially salient information, such
as the structural information between features, so are generally not as accu-
rate on clean datasets. However, they are typically more robust against noise.
Some methods, however, including the spatio-temporal shape context [11] and
spatio-temporal pyramid representations [12], are local feature-based but par-
tially retain structural information between features. The localisation technique
presented in this work is similar to these structure-retaining representations.
The first step in our approach is to reduce the video database to a compact
representation. As we want our algorithm to operate on realistic datasets, we use
local features. Features are detected using Dollar’s method [10] at a loosely con-
stant rate with respect to time, at multiple spatial and temporal scales. At each
detected point, we extract a spatio-temporal cuboid and apply the HOG3D [13]
descriptor. We base our choice of detector on a human action classification eval-
uation study [14], and the descriptor on the experimental results shown in [13].
Next we assign each of the features one of k distinct codewords/clusters, as in the
Bag-of-Words method. To achieve this, we first reduce the feature descriptors’
dimensionality using principal components analysis. We then perform k-means
clustering on the reduced descriptors, and each feature is assigned to one cluster.
Each feature is then represented by a single value – its cluster membership.
We then aggregate these features in a way suitable for rapid localisation. While
Yu et al.’s fast method [3] for action localisation can often localise the optimal
3D sub-volume, generating a score for each STIP using Random Forests is too
expensive for real-world retrieval. Feature voting[6] is another potential scheme,
but we have experimentally determined that such methods are only stable when
applied to clean datasets. We instead propose to use a BoW-derived approach
to video representation, visualised in part of Figure 1. Each database video is
divided into time-slices t ∈ T , of nf frames, and we create a code-word frequency
histogram Ht for all the features within each t. Each histogram is normalised,
and nf is chosen to be approximately half the size of the smallest query that
can be searched on. The time-slices do not overlap, as preliminary experiments
have shown this does not improve accuracy. While this representation is simple,
we show through experiments that it captures sufficient information to localise
a human action. All of the aforementioned steps can be processed once on the
database in batch – this improves the time efficiency of later user searches.
2.2 Search
Previous work [15,16] on human action localisation typically utilise a trained
model – this requires several examples of the target action and the accompa-
nying ground truths. This is not possible in a retrieval context, where only a
Rapid Localisation and Retrieval of Human Actions 23
single query example is provided. Some researchers have made attempts to per-
form image retrieval with spatial localisation [4,5], and one work focuses on
spatio-temporal retrieval and localisation of videos [3]. However, all of the afore-
mentioned techniques are computationally complex, making them unsuitable for
real-world retrieval. We present a more efficient system below.
To search, the user provides a video example of the human action they want
to find. The system performs feature extraction on this query in the manner
described in section 2.1, but a single normalised histogram is generated for the
entire length of the query, rather than for time-slices. To search for an action
within a single video taken from the database, we first use a simple metric to cal-
culate the similarity between each time-slice histogram and the query histogram
Hq . This metric is the histogram intersection:
k
s(Hq , Ht ) = min(Hqi , Hti ) (1)
i=1
3 Relevance Feedback
We can use relevance feedback (RF) to iteratively improve both the ranking and
localisation aspects of our algorithm. After an initial search, RF can improve
results by combining the original query with user feedback about the quality of
the initial results, to generate a more discriminative query. Usually this second,
more discriminative query will return better results than the original query. To
date, RF has been used mostly in the image retrieval domain [17,18], but has
also been applied to human action retrieval in more recent years [2].
In this work, relevance feedback occurs after the localisation and retrieval have
been performed once as described above to give an initial ranking of videos. The
user provides binary feedback on the relevance of several highly-ranked results,
and the histograms associated with these results are used to train new local-
isation and retrieval algorithms. To improve localisation, we use the feedback
histograms and the original query histogram to train an SVM, with the his-
togram intersection shown in equation 1 as the SVM’s kernel. Then, to calculate
the relevance of each time slice t, we measure the distance from the SVM’s hy-
perplane to Ht . The rest of the localisation algorithm proceeds as described in
§2.2. To improve our ranking with relevance feedback, we replace the histogram
intersection shown in equation 1 with a simple query expansion metric that only
utilises positive feedback pos. This query expansion takes the following form:
4 Experiments
4.1 Setup
1 0.8
0th RF Iteration
1st RF Iteration 0.78
0.9
3nd RF Iteration
5th RF Iteration 0.76
0.8
0.74
Top 20 Precision
0.7 0.72
Precision
0.7
0.6
Imperfect+Both
0.68
Imperfect+Localisation
0.5 0.66
Imperfect+Ranking
0.64 Adjusted+Both
0.4
Adjusted+Localisation
0.62
Adjusted+Ranking
0 0.1 0.2 0.3 0.4 0.5
Recall 0 1 2 3 4 5
RF Iterations
(a) (b)
left or the right, all features are also mirrored on the y-axis, giving an average
of 24 features per frame. In the creation of the feature codebook, we use PCA
to retain 95% of the total variance, and for clustering k = 1000.
Leave-one-out cross validation is performed, treating each of the 203 actions
as the query in turn, averaging results over all runs. We use the following method
length(E∩G)
to determine the accuracy of our localisation: let L(E, G) = length(E∪G) where
E is the temporal extent of the estimated action, and G is the temporal extent
of the closest ground truth. An action is considered successfully localised when
L(E, G) ≥ 0.5. To simulate a user’s relevance feedback, we use the ground truth
to determine up to 5 examples of each of positive and negative feedback.
4.2 Results
Figure 2a shows a precision-recall graph using our optimal setup over the whole
MSR II dataset, after various iterations of imperfect relevance feedback. Preci-
sion and recall are usually used in the context of binary relevance. To use these
metrics with localised results, however, we need a way of determining whether
an imperfectly localised result is still relevant. In [3], the authors determined
relevance of a result differently for precision and recall. However, we contend
that this method creates an unintuitive statistic, which cannot be interpreted in
the same way as traditional precision-recall. We use the single, stricter criterion
L, defined above for both precision and recall.
26 S. Jones and L. Shao
5 Discussion
We have created and demonstrated the use of an efficiency-focused video retrieval
system with localisation. Our relatively simple localisation search can still give
practical results, but completes in a fraction of the time of any previously re-
ported algorithm. We have additionally looked at the application of relevance
feedback in a retrieval context, and have shown that both user-adjusted and
imperfect feedback can be used to improve results significantly.
Our proposed method’s primary weakness, compared to existing algorithms,
lies in its inability to separate spatially-distinct background noise from the results,
which may cause incorrect ranking of the candidates. This has not significantly
affected our results on the MSR II, but on more complex datasets, such as the
HMDB[20] it may become a problem, particularly as the number of actions may
decrease accuracy[21]. In future work, we will investigate ways to spatially isolate
actions without the performance costs associated with branch-and-bound derived
methods. Additionally, further experimentation needs to be done on more complex
datasets, such as HMDB [20], to prove the algorithm’s general applicability.
References
1. Zhang, H.J., Wu, J., Zhong, D., Smoliar, S.: An Integrated System for Content-
based Video Retrieval and Browsing. Pattern Recognition 30(4), 643–658 (1997)
2. Jones, S., Shao, L., Zhang, J., Liu, Y.: Relevance Feedback for Real-World Human
Action Retrieval. Pattern Recognition Lett. 33(4), 446–452 (2012)
Rapid Localisation and Retrieval of Human Actions 27
3. Yu, G., Yuan, J., Liu, Z.: Unsupervised Random Forest Indexing for Fast Ac-
tion Search. In: Proc. IEEE Conf. Comput. Vision and Pattern Recognition,
pp. 865–872 (2011)
4. Rahmani, R., Goldman, S.A., Zhang, H., Krettek, J., Fritts, J.E.: Localized Content
Based Image Retrieval. In: ACM SIGMM Int. Conf. Multimedia Inform. Retrieval,
pp. 227–236 (2005)
5. Zhang, D., Wang, F., Shi, Z., Zhang, C.: Interactive Localized Content Based Im-
age Retrieval With Multiple-Instance Active Learning. Pattern Recognition 43(2),
478–484 (2010)
6. Ryoo, M., Aggarwal, J.: Spatio-temporal Relationship Match: Video Structure
Comparison for Recognition of Complex Human Activities. In: IEEE Int. Conf.
Comput. Vision, pp. 1593–1600 (2009)
7. Poppe, R.: A survey on vision-based human action recognition. Image and Vision
Computing 28(6), 976–990 (2010)
8. Davis, J.W., Bobick, A.F.: The Representation and Recognition of Human Move-
ment Using Temporal Templates. In: Proc. IEEE Conf. Comput. Vision and Pat-
tern Recognition, p. 928 (1997)
9. Laptev, I.: On Space-Time Interest Points. Int. J. Comput. Vision 64(2-3), 107–123
(2005)
10. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior Recognition via Sparse
Spatio-Temporal Features. In: Proc. IEEE Workshop Visual Surveillance and Per-
formance Evaluation Tracking and Surveillance, pp. 65–72 (2005)
11. Shao, L., Du, Y.: Spatio-temporal Shape Contexts for Human Action Retrieval.
In: Proc. Int. Workshop Interactive Multimedia Consumer Electronics, pp. 43–50
(2009)
12. Choi, J., Jeon, W.J., Lee, S.-C.: Spatio-temporal pyramid matching for sports
videos. In: ACM SIGMM Int. Conf. Multimedia Inform. Retrieval, pp. 291–297
(2008)
13. Kläser, A., Marszalek, M., Schmid, C.: A Spatio-Temporal Descriptor Based on
3D-Gradients. In: Proc. British Mach. Vision Conf., pp. 995–1004 (2008)
14. Shao, L., Mattivi, R.: Feature Detector and Descriptor Evaluation in Human Action
Recognition. In: Proc. ACM Int. Conf. Image and Video Retrieval, pp. 477–484
(2010)
15. Kläser, A., Marszalek, M., Schmid, C., Zisserman, A.: Human Focused Action
Localization in Video. In: International Workshop on Sign, Gesture, Activity (2010)
16. Sullivan, J., Carlsson, S.: Recognizing and Tracking Human Action. In: Heyden, A.,
Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part I. LNCS, vol. 2350,
pp. 629–644. Springer, Heidelberg (2002)
17. Tong, S., Chang, E.: Support Vector Machine Active Learning for Image Retrieval.
In: Proc. ACM Multimedia, pp. 107–118 (2001)
18. Tao, D., Tang, X., Li, X., Wu, X.: Asymmetric Bagging and Random Subspace
for Support Vector Machines-Based Relevance Feedback in Image Retrieval. IEEE
Trans. Pattern Anal. Mach. Intell. 28, 1088–1099 (2006)
19. Cao, L., Liu, Z., Huang, T.: Cross-dataset Action Detection. In: Proc. IEEE Conf.
Comput. Vision and Pattern Recognition, pp. 1998–2005 (2010)
20. Kuehne, H., Poggio, H.: HMDB: A Large Video Database for Human Motion
Recognition. In: IEEE Int. Conf. Comput. Vision (2011)
21. Reddy, K., Shah, M.: Recognizing 50 human action categories of web videos. Mach.
Vision and Applicat., 1–11 (2012)
Deformable Shape Reconstruction
from Monocular Video with Manifold Forests
1 Introduction
Deformable shape recovery from a single uncalibrated camera is a challenging, under-
constrained problem. The methods proposed to deal with this problem can be divided
into three main categories: Low-rank shape models [9], shape trajectory approaches
[1,5,6] and template based methods [13]. Most of the existing methods are restricted by
the fact that they try to explain the complex deformations using a linear model.
Recent methods have integrated the manifold learning algorithm to regularise the
shape reconstruction problem by constraining the shapes as to be well represented by
the learned manifold. Using shape embedding as initialisation was introduced in [10].
Hamsici et.al [7] modelled the shape coefficients in a manifold feature space. The map-
ping was learned from the corresponding 2D measurement data of upcoming recon-
structed shapes, rather than a fixed set of trajectory bases.
Contrary to other techniques using manifold in the shape reconstruction, our mani-
fold is learned based on the 3D shapes rather than on 2D observations. The proposed
implementation is based on the manifold forest method described in [4]. The main ad-
vantage of using manifold forest as compared for example to standard diffusion maps
[3] is the fact that in the manifold forest the neighbourhood topology is learned from the
data itself rather than being defined by the Euclidean distance. To the best of authors’
knowledge, random forests technique has never been applied in the context of non-rigid
shape reconstruction. This work is the first to integrate the ideas of manifold forests and
deformable shape reconstruction.
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 28–36, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Deformable Shape Reconstruction from Monocular Video with Manifold Forests 29
2 Manifold Forest
Random forests have become a popular method, given their capability to handle high
dimensional data, efficiently avoid over-fitting without pruning, and possibility of par-
allel operation. This section gives a brief review of the randomized decision forests and
their use in learning diffusion map manifolds. Although other choices are possible, this
paper is focused only on the binary decision forest.
where || indicate a cardinality for the dataset. Xm denotes the training data X reaching
node m. Xm L
, Xm
R
are the subsets assigned to the left and right child nodes of node m.
30 L. Tao and B.J. Matuszewski
In the paper it is assumed that data is adequately represented by the Gaussian distribu-
tion [4]. In that case the differential entropy H (Xm ) can be calculated analytically as:
1
H (Xm ) = log (2πe |Λ|) (4)
2
where Λ is the covariance matrix of Xm . The trees are trained until the number of
samples in a leaf is less than the pre-specified limit or the depth of the tree has exceeded
the pre-defined depth.
Once the random forest has been trained, the new sample can be simply put through
each tree. Depending on the result of the decision function at each internal node, the new
data is sent to the left or right child node until it arrives at a leaf. The samples ending
up in the same leaf are likely to be statistically similar and are expected to represent the
same neighbourhood of the manifold. As such similarity measure is statistical in nature,
thus the results is averaged over many decision trees. If the samples end up in the same
leaf for the majority of the trees they are consider to be drawn from the similar location
on the manifold.
affinity model:
0 l(Xi ) = l(Xj )
Lt (Xi , Xj ) = (5)
∞ otherwise
The binary model is simple and efficient and can be considered to be a parameter-free.
However, as affinity matrix calculated based on a single tree is not representative, the
ensemble of T trees is used to calculate the more accurate affinity
matrix W by averag-
ing over all affinity matrices from each single tree: W = T1 Tt=1 Wt .
Coifman et al. presented a justification behind using normalised graph Laplacian
[3] by connecting them to diffusion distance. Each entry ofthe diffusion operator G
is constructed as G(Xi , Xj ) = W ij /Υii with Υii =
j W ij . W is a renor-
malised affinity matrix of W using an anisotropic normalised
graph Laplacian, such
that W ij = Wij /qi qj with qi = j W ij , qj = i W ji . The convergence of
Deformable Shape Reconstruction from Monocular Video with Manifold Forests 31
optimal embedding Ψ for diffusion maps is proven in [3] and is found via eigen-
vectors ϕ and its corresponding n biggest eigenvalues λ of the operator G, such that
1 = λ0 > λ1 ≥ . . . ≥ λn ,
T
Ψ : Xi → [λ1 ϕ1 (Xi ), · · · , λn ϕn (Xi )] (6)
Inverse Mapping: Given a point b ∈ Rn in the reduced space, finding its inverse map-
ping St = Ψ−1 (b) from the feature space back to the input space is a typical pre-image
problem. As claimed in [2], the exact pre-image might not exist if the shape St has not
been seen in the training set. However, according to the properties of isometric map-
ping, if the points in the reduced space are relatively close, the corresponding shapes in
high dimensional space should represent similar shapes since they have small diffusion
distances. Based on this, the point bt can be approximated as a linear n+1combination of
its weighted neighbouring points in feature space, such that bt = l=1 θtl xtl , where
xtl is the lth nearest point of bt and the weights θtl are computed as the barycentric
coordinates of bt . Once the weights are estimated, the shape St can be calculated as
n+1
well based on a set of weighted training samples St = l=1 θtl Xtl , where the train-
ing samples Xtl are the pre-images of xtl , and are equivalent to the basis shapes in Eq.1.
A number of experiments were carried out to evaluate the proposed method. Several
state-of-the-art algorithms were evaluated and compared in these experiments:
RF: The proposed random forest method; DM: The diffusion maps based method. The
DM method is similar to the RF except the manifold learning was implemented without
random forest. [11]; MP: The metric projection method [9]; PTA: The discrete cosine
transform (DCT) based point trajectory approach [1]; CSF: The column space fitting
method [5]; KSFM: The kernel non-rigid structure from motion approach [6]; IPCA:
The incremental principal components analysis based method [12].
The data which were used for evaluation include: two articulated face sequences,
surprise and talking, both captured using passive 3-D scanner with 3D tracking of 83
facial landmarks [8]; two surface models, cardboard and cloth [13]; two human actions,
walking and stretch, and three dance sequences: dance, Indian dance and Capoeira
from CMU motion capture database1 . This paper is not focusing on feature detection
and tracking. In the experiments described here the 3D points are known and these were
projected onto the image sequences under the orthographic camera model and subse-
quently used as features. Diffusion maps require training process, so training datasets
for two face sequences were taken from the BU-3DFE [14] and for two surface se-
quences the data were obtained from [13]. Since no separate training data are provided
for CMU database, half of each sequence was used for manifold learning and the other
half for testing. All the training data has been rigidly co-registered, the same testing
data has been used with the methods which do not require training.
Measurement Data with Noise: In order to assess the performance of the recon-
struction methods when the observed data is corrupted by noise, the next experiment
1
The data was obtained from http://mocap.cs.cmu.edu
Deformable Shape Reconstruction from Monocular Video with Manifold Forests 33
30 60
Small
Large PTA
CSF
3D error %
Dance
3D error %
20 All KSFM
40
DM
RF
10 20
0 0
3 5 7 10 15 3 5 7 10 12 15
n n
(a) (b)
Fig. 1. Reconstruction 3D error as a function of the number of bases n. (left) Errors produced
by RF. Bars left to right: Group of small deformation sequences, large deformation sequences,all
dance sequences, all the sequences; (right) Comparison results on stretch sequence.
MP
PTA
CSF
KSFM
60 DM
RF 60
3D error %
3D error %
40 40
MP
PTA
20 20 CSF
KSFM
DM
RF
0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Level of Noise % Level of Noise %
(a) (b)
Fig. 2. Reconstruction results on walking (left) and capoeira (right) sequences with Gaussian
noise
compared the RF method against previously proposed methods in terms of shape recon-
struction error expressed as a function of level of noise in the observed data. We ran 10
trials for each experiment for each level of noise using walking and capoeira sequences
respectively. It can be noticed that although the performance of all six algorithms de-
creases with the level of noise, two non-linear methods DM and RF are clearly superior
and achieve smaller standard deviations, whereas others are quite sensitive with large
mean error and error dispersion. Even though RF and DM provide comparable perfor-
mance in walking, as expected RF outperforms DM in the cases of recovery of more
complex deformations, e.g. capoeira sequence.
Table 1. Relative normalised mean reconstruction 3D error in percentages for KSFM, IPCA, DM
and RF methods. The optimal number of bases n, for which the 3D errors are shown in the table,
is given in brackets for each tested method.
KSFM IPCA DM RF
Initial No Opt. Opt.
Surprise 3.81(4) 12.89 3.52(10) 31.54 29.29 2.41(15)
Talking 4.98(4) 9.86 3.50(10) 96.57 8.37 3.43(10)
Cardboard 27.53(2) 24.45 10.64(10) 26.74 16.06 9.40(10)
Cloth 18.06(2) 19.09 2.87(7) 29.67 17.29 2.54(7)
Walking 10.29(5) 32.64 2.65(9) 35.02 16.31 3.69(15)
IndianDance 23.43(7) 34.40 9.81(10) 29.69 12.82 5.55(15)
Capoeira 23.76(7) 40.59 2.58(9) 40.59 29.2 0.54(10)
Stretch 7.36(12) 19.18 6.87(6) 26.23 17.08 5.88(10)
Dance 23.69(4) 30.58 16.76(7) 26.08 15.30 11.69(15)
KSFM
RF
Fig. 3. Reconstruction results on the Indian Dance sequence. Reconstructed 3D shapes (circles),
with ground truth (dots) are shown.
points. For RF method the initialisation error and the error produced by the proposed
algorithm with and without non-linear refinement are presented. The errors shown in
the table correspond to the optimal n value selection. This is achieved by running the
trials with n varying from 2 to 15. The best selected n value for each tested method is
shown in brackets. The reconstructed shapes are aligned using a single global rotation
based on Procrustes alignment [1]. As shown in the table, RF has better performance
than other methods, especially for the large deformations. Even though the initial error
is big, the RF method is still able to provide accurate reconstruction results.
Fig.3 shows three randomly selected reconstructed shapes from the Indian Dance
sequence using KSFM and RF methods. More comparison results for DM against other
methods can be found in [11].
Deformable Shape Reconstruction from Monocular Video with Manifold Forests 35
Fig. 4. Selected 2D frames from the video sequence of a paper bending. Front and top views of
the corresponding 3D reconstructed results using the proposed method (RF), PTA and KSFM.
Real Data: The algorithms used in the motion capture experiments above are applied
to real data in Fig.4. In the video, 81 point features were tracked along 61 frames show-
ing approximately two periods of paper bending movement.
5 Conclusions
In this paper a new approach for monocular reconstruction of non-rigid object is de-
scribed. The method performs particularly well, when compared to other methods,
especially for large and complex deformations. The method combines the ideas of non-
linear manifold learning and deformable shape reconstruction. The non-linear manifold
has been build upon diffusion maps with random forests used to estimate local manifold
neighbourhood topology. The method has the potential to be extended to handle cases
with missing data and to be implemented for real time reconstructions. The proposed
method shows a significant improvement for the reconstruction of large deformable
objects, even though, due to the lack of training data, the manifold is built using only
limited number of shapes. Further possible improvements include building a sufficiently
dense representation of the manifold by collecting and generating more training data.
References
1. Akhter, I., Sheikh, Y., Khan, S., Kanade, T.: Trajectory space: A dual representation for
nonrigid structure from motion. IEEE PAMI 33, 1442–1456 (2011)
2. Arias, P., Randall, G., Sapiro, G.: Connecting the out-of sample and pre-image problems in
kernel methods. In: ICPR, pp. 1–8 (2007)
3. Coifman, R., Lafon, S.: Diffusion maps. Appl. Comp. Harm. Anal. 21, 5–30 (2006)
4. Criminisi, A., Shotton, J., Konukoglu, E.: Decision forests: A unified framework for clas-
sification, regression, density estimation, manifold learning and semi-supervised learning.
Foundations and Trends in Computer Graphics and Computer Vision 7, 81–227 (2012)
5. Gotardo, P., Martinez, A.M.: Computing smooth time-trajectories for camera and deformable
shape in structure from motion with occlusion. IEEE PAMI 33, 2051–2065 (2011)
6. Gotardo, P., Martinez, A.M.: Kernel non-rigid structure from motion. In: ICCV, pp. 802–809
(2011)
7. Hamsici, O.C., Gotardo, P.F.U., Martinez, A.M.: Learning spatially-smooth mappings in non-
rigid structure from motion. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid,
C. (eds.) ECCV 2012, Part IV. LNCS, vol. 7575, pp. 260–273. Springer, Heidelberg (2012)
36 L. Tao and B.J. Matuszewski
8. Matuszewski, B., Quan, W., Shark, L.-K., McLoughlin, A., Lightbody, C., Emsley, H.,
Watkins, C.: Hi4d–adsip 3d dynamic facial articulation database. Image and Vision Com-
puting 10, 713–727 (2012)
9. Paladini, M., Bue, A., Xavier, J., Stosic, M., Dodig, M., Agapito, L.: Factorization for non-
rigid and articulated structure using metric projections. In: CVPR, pp. 2898–2905 (2009)
10. Rabaud, V., Belongie, S.: Linear embeddings in non-rigid structure from motion. In: CVPR,
pp. 2427–2434 (2009)
11. Tao, L., Matuszewski, B.J.: Non-rigid strucutre from motion with diffusion maps prior. In:
CVPR (2013)
12. Tao, L., Matuszewski, B.J., Mein, S.J.: Non-rigid structure from motion with incremental
shape prior. In: ICIP, pp. 1753–1756 (2012)
13. Varol, A., Salzmann, M., Fua, P., Urtasun, R.: A constrained latent variable model. In: CVPR,
pp. 2248–2255 (2012)
14. Yin, L., Wei, X., Sun, Y., Wang, J., Rosato, M.: A 3d face expression database for facial
behavior research. In: AFGR, pp. 211–216 (2006)
Multi-SVM Multi-instance Learning
for Object-Based Image Retrieval
1 Introduction
With the explosive growth of the number of digital images, effective and efficient
retrieval technique is in urgent need. Since a user usually pays attention to
some object instead of the whole image, if only overall characteristics are used
for image description, the retrieval performance is often unsatisfactory. To deal
with the problem, object-based (or localized content-based) image retrieval is
proposed, and much related work has been developed [1], [2].
As an effective approach to describe the relationship between whole and part,
multi-instance learning has been widely used in image analysis [3], [4]. In this
learning framework, each sample is called a bag, and contains several instances.
The available labels are only assigned for bags, and the relationship between bag
and instance is that a bag is positive if at least one instance in it is positive,
otherwise it is negative. In order to involve object-based image retrieval into the
framework of multi-instance learning, images are first segmented into regions,
and then images and regions are treated as bags and instances, respectively.
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 37–44, 2013.
c Springer-Verlag Berlin Heidelberg 2013
38 F. Li, R. Liu, and T. Baba
this way, image Im can be described by either a vector xIm or a set of vectors
{xRn |Rn ∈ Im }.
s.t. ym f (xIm ) ≥ 1 − ξm
I
, I
ξm ≥ 0, (m = 1, 2, · · · , M ).
where wI and bI are classifier parameters and the classification function is de-
fined as f (xIm ) = wI · xIm + bI , “·” denotes the inner product of two vectors,
ξmI
(m = 1, 2, · · · , M ) are slack variables, and C I > 0 is a penalty parameter.
For region-level representation, as each image is described by a set of feature
vectors, standard SVM cannot be directly adopted. According to the basic idea
of multi-instance learning, the classification results of a bag can be determined
by the instance with the maximum classification function value. In this way,
multi-instance learning is introduced into the framework of SVM [12], and a
linear multi-instance SVM with region-level features can be constructed by
1 2 M
min Q2 = min wR + C R R
ξm (2)
wR, bR, ξR wR, bR, ξR 2
m=1
n ) ≥ 1 − ξm ,
ym max g(xR ≥ 0, (m = 1, 2, · · · , M ).
R R
s.t. ξm
Rn ∈Im
s.t. ym f (xIm ) ≥ 1 − ξm
I
, I
ξm ≥ 0, (m = 1, 2, · · · , M ).
where ΔR m = max ym max g(xR n ), 1 . By introducing new variables
Rn ∈Im
λIm (m = 1, 2, · · · , M ), we can further rewrite (6) as
1 2
M
M
2
min wI + CI I
ξm +β 1 + λIm − ΔR
m (7)
wI, bI, ξ I, λI 2 m=1 m=1
ym f (xIm ) ≥ 1 − ξm
I
, I
ξm ≥ 0, (m = 1, 2, · · · , M );
s.t.
ym f (xIm ) ≤1+ λIm , λIm ≥ 0, (m = 1, 2, · · · , M ).
By changing it to its dual problem, we can solve the problem of quadratic pro-
gramming efficiently.
While with fixed {wI , bI , ξ I }, the optimization problem (5) becomes
M 2
1 2 M
min α w R
+C R R
ξm +β max ym max g(xn ), 1 −Δm (8)
R I
wR, bR, ξR 2 Rn ∈Im
m=1 m=1
n ) ≥ 1 − ξm ,
ym max g(xR ≥ 0, (m = 1, 2, · · · , M ).
R R
s.t. ξm
Rn ∈Im
where ΔIm = max ym f (xIm ), 1 . To deal with the problem, similarly as [12], for
image Im , a selector variables is defined as
Sm ) ≥ 1 − ξm ,
ym g(xR ≥ 0, (m = 1, 2, · · · , M ).
R R
s.t. ξm
It can be seen that (10) is with the same form as (6). If the values of Sm
(m = 1, 2, · · · , M ) are determined, (10) can also be solved by the aforementioned
method. Therefore, the final solution of the original problem (8) can be obtained
by iteratively calculating (9) and (10).
So far, only linear SVM is adopted in our proposal. As only the inner product
of feature vectors is involved in the dual problem for solving (6) and (8), kernel
trick can be easily introduced in our proposed optimization framework, and
nonlinear SVM can also be utilized conveniently.
After the iterative process has converged, all the parameters for the two clas-
sifiers can be calculated. For a database image It with image-level feature xIt
42 F. Li, R. Liu, and T. Baba
3 Experimental Results
The proposed method is evaluated on the SIVAL (Spatially Independent, Vari-
able Area, and Lighting) image benchmark, which is widely used for multi-
instance learning. It consists of 25 different categories, each includes 60 images.
The images in one category contain the same object photographed against highly
diverse backgrounds. The object may occur anywhere in the images and may be
photographed at a wide-angle or close up. Some example images are shown in
Fig. 1. All the images in the data set have been segmented, and each region is
represented by a 30-dimensional feature vector.
We conduct 30 independent runs for all the categories in the database. In
each category, 8 positive and 8 negative images are randomly selected as training
samples. To compare with other methods, we also use the area under the receiver
operating characteristic curve (AUC) as the performance measure.
For image representation, the method of locality-constrained linear coding [13]
is adopted for constructing image-level features, and the region-level features
provided by the data set are directly used after normalization. The parameters
in the optimization framework are set as follows. The penalty parameters C I
and C R are set to 1000. The combination coefficients α and β are set to 1 and
100, respectively. Nonlinear SVMs are constructed in the experiments, in which
Gaussian kernel K(u, v) = exp −γ u − v 2 is adopted, and γ is set to 0.001.
The methods used for comparison include multi-graph multi-instance learning
(MGMIL) [10], multi-instance learning based on region-level graph (GMIL) [8],
support vector machine with evidence region identification (EC-SVM) [6], as well
as image-level semi-supervised multi-instance learning (MISSL) [5]. The average
Multi-SVM Multi-instance Learning for Object-Based Image Retrieval 43
Table 1. Average AUC values and 95%-confidence intervals over 30 independent runs
on the SIVAL data set
AUC values and the 95%-confidence intervals for our proposal and the other
four methods are listed in Table 1. We can see that the overall performance of
our proposal is the best. In all the 25 categories, our proposal achieves highest
AUC values on 14 categories. Especially for “RapBook”, “TranslucentBowl”, and
“BlueScrunge”, the performance can be improved by more than 8%. MGMIL also
adopts both image- and region-level representations. As image-level features can
provide additional information, it outperforms GMIL, EC-SVM and MISSL, in
which only region features are considered. Comparing MGMIL with our proposal,
the main difference is that graph-based learning is involved in MGMIL, while
SVM is introduced as the basic classifier in our method. The superior of our
proposal demonstrates the advantage of exploring information in the feature
space over analyzing relationship between graph nodes.
As far as computational load is concerned, we also compare our proposal with
MGMIL. Both the methods develop optimization frameworks based on two kinds
of features, but the numbers of involved variables are different. MGMIL wants
to calculate the soft labels for images and regions, while our proposal aims to
construct effective SVMs. In general, the number of all the images and regions is
larger than the number of classifier parameters, hence our proposal often costs
less time than MGMIL.
44 F. Li, R. Liu, and T. Baba
4 Conclusions
In this paper, a novel multi-SVM multi-instance learning method is proposed
for object-based image retrieval. According to the two kinds of representations,
image-level SVM and region-level multi-instance SVM are adopted respectively.
For comprehensively utilizing the available information, a unified optimization
framework is developed, and the relationship between images and their seg-
mented regions is also taken into consideration to avoid constructing the two
classifiers separately. An iterative approach is introduced to solve the optimiza-
tion problem. It is demonstrated that our proposal is both effective and efficient
for image retrieval.
References
1. Rahmani, R., Goldman, S.A., Zhang, H., Krettek, J., Fritts, J.E.: Localized con-
tent based image retrieval. In: Proc. ACM SIGMM Int. Workshop Multimedia
Information Retrieval, pp. 227–236 (2005)
2. Zheng, Q.-F., Wang, W.-Q., Gao, W.: Effective and efficient object-based image
retrieval using visual phrases. In: Proc. ACM Int. Conf. Multimedia, pp. 77–80
(2006)
3. Chen, Y., Bi, J., Wang, J.Z.: MILES: Multiple-instance learning via embedded
instance selection. IEEE Trans. Pattern Analysis and Machine Intelligence 28(12),
1931–1947 (2006)
4. Feng, S., Xu, D.: Transductive multi-instance multi-label learning algorithm with
application to automatic image annotation. Expert Systems with Applications 37,
661–670 (2010)
5. Rahmani, R., Goldman, S.A.: MISSL: Multiple-instance semi-supervised learning.
In: Proc. Int. Conf. Machine Learning, pp. 705–712 (2006)
6. Li, W.-J., Yeung, D.-Y.: Localized content-based image retrieval through evidence
region identification. In: Proc. IEEE Int. Conf. Computer Vision and Pattern
Recognition, pp. 1666–1673 (2009)
7. Tang, J., Hua, X.-S., Qi, G.-J., Wu, X.: Typicality ranking via semi-supervised
multiple-instance learning. In: Proc. ACM Int. Conf. Multimedia, pp. 297–300
(2007)
8. Wang, C., Zhang, L., Zhang, H.-J.: Graph-based multiple-instance learning for
object-based image retrieval. In: Proc. ACM Int. Conf. Multimedia Information
Retrieval, pp. 156–163 (2008)
9. Tang, J., Li, H., Qi, G.-J., Chua, T.-S.: Image annotation by graph-based infer-
ence with integrated multiple/single instance representations. IEEE Trans. Multi-
media 12(2), 131–141 (2010)
10. Li, F., Liu, R.: Multi-graph multi-instance learning for object-based image and
video retrieval. In: Proc. ACM Int. Conf. Multimedia Retrieval (2012)
11. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York
(1995)
12. Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for
multiple-instance learning. In: Advances in Neural Information Processing Systems
(2002)
13. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained
linear coding for image classification. In: Proc. IEEE Int. Conf. Computer Vision
and Pattern Recognition, pp. 3360–3367 (2010)
Maximizing Edit Distance Accuracy
with Hidden Conditional Random Fields
1 Introduction
Handwriting recognition (HWR) aims at transforming a raw image of a hand-
written document into a sequence of characters and words. HWR systems consist
first in performing some preprocessing steps on a sliding window over each text
lines, yielding to a sequence of real-valued feature vectors, and second in apply-
ing statistical models such as Hidden Markov Models (HMMs) or Conditional
Random Fields (CRFs). These systems take as input a T-length sequence of
observations x and output a L-length sequence of characters y with temporal
boundaries. Their accuracy is systematically evaluated from the edit distance
which counts the number of character errors (insertions, deletions, and replace-
ment) required to align a predicted string y and the true string y (e.g. the word
to recognize), thus ignoring irrelevant eventual temporal boundaries shifts. One
can affect various weights to these error types (we used an uniform weighting
here). The accuracy of a recognition engine, which will be further denoted by
EDA (Edit Distance Accuracy), is defined from the edit distance as follows:
Hits − Insertions
accuracy(y, y ) = EDA(y, y ) = (1)
|y |
where Hits denote the number of character that have been well predicted and |y |
denotes the length (the number of character) of the true string. Most popular
approaches rely on HMMs trained with either generative or discriminative crite-
rion such as Maximum Mutual Information (MMI) [1], Minimum Classification
Error and variants (MCE, P-MCE) [2–4] and Minimum Phone Error (MPE) [5].
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 45–53, 2013.
c Springer-Verlag Berlin Heidelberg 2013
46 A. Vinel and T. Artières
Recently, building on the success of large margin learning ideas (popularized with
Support Vector Machines), some works have demonstrated the strong potential
of this approach for learning HMMs [7–11]. Most of them focus on optimizing
a Hamming distance loss criterion (a frame based error measure wich takes into
account the temporal boundaries mismatch) which is simpler to optimize, while
their performances are measured with EDA.
Recently, pure discriminative models have been also proposed to deal with
sequence labelling tasks. Hidden Conditional Random Fields (HCRFs), which are
CRFs [12] powered by hidden states (alike HMMs) [13–15] have been successfully
applied to various signal labelling tasks [16, 14, 13, 17–19]. HCRFs are trained
either to maximize the conditional likelihood of label sequences given observation
sequences or a margin criterion based on a naive zero-one loss (the sentence is
completely recognized or not) or at best a Hamming distance loss [20, 16, 21].
This work describes an approach for learning HCRFs with a large margin
criterion to maximize the edit distance accuracy. Although this has been often
mentioned as a perspective of previous works [22] this extension is not straight-
forward. We first recall some background on HCRFs, then we detail our algo-
rithm and provide experimental results on two HWR datasets.
2 Background
2.1 HCRF
HCRFs are discriminative graphical models that have been proven useful for
handling complex signals like handwriting and speech [13–15]. A common ap-
proach consists in using a similar architecture as for HMM systems: a left-right
sub-model for every character to be recognized, where all ending states of all
sub-models are connected to all starting states. Learning and decoding is per-
formed within this global model. A HCRF defines a conditional probability of a
label sequence y given an observation sequence x as follows:
p(y|x, w) = p(h|x, w) (2)
h∈S(y)
where Φ(., .) is a real valued vector, called the joint feature map,
which is com-
monly defined as a sum of local feature map, i.e. Φ(x, h) = t ϕ(xt , ht , ht−1 ),
so that all necessary quantities for training and inference may be computed
Maximizing Edit Distance Accuracy with Hidden CRFs 47
This approach has been used in the past both for generative models (HMMs
[21, 20]) and discriminant ones (CRF and HCRFs [23, 16, 21]).
Although F (x, y, w) = log p(y|x, w) would be a natural choice for learn-
ing HCRFs, a common and simpler choice is an approximation F (x, y, w) =
maxh∈S(y) Φ(x, h), w which resumes to F (x, y, w) = maxh∈S(y) log p(y, h|x, w)
≈ log p(y|x, w).
The above optimization problems are using as objective an upper bound of the
Δ-loss. They may be solved using quadratic programming algorithms. However,
48 A. Vinel and T. Artières
since the number of the constraints is exponential in the length of the input, the
standard solvers cannot be used easily. Efficient algorithms that overcome this
difficulty either rely on an online learning scheme (see section 3.1) or exploit a
limited memory algorithm to keep the size of the quadratic program limited [21].
3 Optimization
Our motivation to maximize the edit distance based accuracy (EDA) rather
than the Hamming distance accuracy (HDA) comes from the weak correlation
between those measures. Indeed, the scatter plot (on figure 1) shows, for each test
sequence, a point whose coordinates are the HDA and the EDA. An example of
an extreme case is shown on the right part of figure 1 with a high HDA (i.e. 76%
of the image’s columns are correctly labelled) and a low EDA (25%) suffering
from an important number of insertions.
Including the edit distance loss as penalty term Δ(y, yi ) = dedit (y, yi ) is far
from being straightforward since the objective function is piecewise constant.
All gradient-based algorithms are excluded in favour of contrastive-based [22] or
margin-based ones, we investigate these latter methods here.
Raw image
0.5
Inferred string
0
0 0.25 0.5 0.75 1 Ground truth
HDA (76%)
7 Matching label ( )
-0.5
Hamming Distance Accuracy EDA (25%) 5 Insertions ( )
1 Replacement ( )
Fig. 1. Relation between Edit Distance Accuracy and Hamming Distance Accuracy
EDA vs HDA scatter plot (left) Example of an extreme case (right)
Maximizing Edit Distance Accuracy with Hidden CRFs 49
is that in the margin violation case, the optimal solution w∗ can be computed
analytically.
Δ(ŷin , yin ) − δΦ(ĥ), w
w∗ = wn−1 + min(C, )Φ(x, ĥ) (MR case) (8)
δΦ(ĥ) 2
where δΦ(ĥ) is the difference, in the joint feature space, between the most vi-
olating hidden state sequence ĥin and the hidden state sequence hin matching
the ground truth yin .
These quantities are defined according to:
ŷin = argmaxy Δ(y, yin ) − δF (yin , y) (9)
in in
ĥ = argmaxh∈S(ŷin ) Φ(x , h), w (10)
in in
h = argmaxh∈S(yin ) Φ(x , h), w (11)
δΦ(ĥ) = Φ(xin , hin ) − Φ(xin , ĥin ) (12)
Initial path
r 2nd expansion
u 3rd to 7th expansions
m
g
w
i
n
e
j
The E expansions may represent (in the theoretical worst case) a huge number
of explored hypothesis (i.e. alternative strings) : O(exp(E)). The empirical com-
plexity was bounded by (E/2)3 . To deal with it, we used an efficient dynamic
programming routine to compute ĥin on the graph by factorizing as much as
possible the edit distance computation.
This part of our approach differs from MPE [5, 6] by the fact that we use the
true EDA (instead of a frame-based ”local accuracy” estimation computed from
the character overlapping in an hypothesis lattice).
4 Experiments
60 Accuracy 87 Accuracy
55 86
85 EDM MR
50
[29]-close
84 EDM SR
45 HDM
83 CML
40
82
35 81
E E
30 80
1 2 4 6 8 10 12 15 0 10 20 30 40 50
used iterate averaging. This technique use (in the inference process on validation
and test set) an average of w over some of the last iterations. In our case, we
performed averaging over the last iteration over the whole training set.
We first compare on R16 and R32 (fig. 3) the accuracies of few HCRFs
trained my maximizing the following learning criterion : the conditional like-
lihood (CML), a margin based on Hamming Distance (denoted HDM), a margin
based on the Edit Distance (denoted EDM) with the Slack or Margin rescaling
variants. We also reported the accuracy obtained with a HCRF trained with a
method which is pretty similar to [29] by using the ”Margin rescaling” approach
and using a 1-expansion graph. It differs from the original method in that we do
not model stay duration in the states.
EDM approach significantly outperforms the other methods on both datasets.
Both strategies of EDM (Margin and Slack Scaling) work well, with a slight ad-
vantage for Margin Scaling. The HDM approach performs well too, while slightly
less than the EDM with the Margin Scaling strategy. Interestingly, EDM already
outperforms the other methods when exploiting a small number of expansions,
which means a limited complexity overhead. And the method steadily improves
with the number of expansions.
This table compares a number of methods on the more complex and big-
ger IAM dataset. All methods have been implemented and tuned by us. These
results show that the margin based methods (HDM and EDM) clearly outper-
form non discriminative and discriminative training of HMMs (HMM and HDM
HMMs), standard discriminative training of HCRFs (CML). Moreover margin
EDM methods again achieve the best results on this dataset. Although improve-
ment over HDM-HCRF is modest, it must be noticed that all these results are
already high on this dataset and that any improvement is very hard to ob-
tain. Finally it must be noticed that we deliberately limited training complexity
on this dataset by exploiting a rather small number of expansions to fit our
52 A. Vinel and T. Artières
computational power, but one can expect to get even better results by augment-
ing the number of expansions and training time.
5 Conclusion
We proposed a new algorithm for learning HCRF relying on a max margin cri-
terion to optimize directly the edit distance accuracy in the passive-aggressive
framework. We detailed a lattice-based approach allowing a factored computa-
tion of the Levenshtein distance for the negative example selection. We finally
showed the benefits of this approach on few handwriting labelling tasks with
respect to a number of alternative discriminatively learning schemes.
References
1. Woodland, P.C., Povey, D.: Large scale discriminative training of hidden markov
models for speech recognition. Computer Speech & Language (1) (2002)
2. Juang, B.H., Katagiri, S.: Discriminative learning for minimum error classification.
IEEE Transactions on Signal Processing (12) (1992)
3. Fu, Q., He, X., Deng, L.: Phone-discriminating minimum classification error
(p-mce) training for phonetic recognition. In: Interspeech (2007)
4. He, X., Deng, L., Chou, W.: A novel learning method for hidden markov models
in speech and audio processing. In: Multimedia Signal Processing. IEEE (2006)
5. Povey, D., Woodland, P.C.: Minimum phone error and i-smoothing for improved
discriminative training. In: ICASSP, vol. 1, p. I–105. IEEE (2002)
6. Deng, L., Wu, J., Droppo, J., Acero, A.: Analysis and comparison of two speech
feature extraction/compensation algorithms. In: SPL (2005)
7. Cheng, C.-C., Sha, F., Saul, L.K.: Online learning and acoustic feature adaptation
in large-margin hidden markov models. JSP (6) (December 2010)
8. Sha, F., Saul, L.K.: Large margin hidden markov models for automatic speech
recognition. In: NIPS (2007)
9. Cheng, C.C., Sha, F., Saul, L.K.: A fast online algorithm for large margin training
of continuous density hidden markov models. In: Interspeech (2009)
10. Do, T.M.T., Artieres, T.: Maximum margin training of gaussian hmms for hand-
writing recognition. In: ICDAR, pp. 976–980. IEEE Computer Society (2009)
11. Yu, D., Deng, L., He, X., Acero, A.: Large-margin minimum classification error
training for large-scale speech recognition tasks. In: ICASSP (2007)
12. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic
models for segmenting and labeling sequence data. In: ICML Workshop (2001)
13. Gunawardana, A., Mahajan, M., Acero, A., Platt, J.C.: Hidden conditional random
fields for phone classification. In: Interspeech (2005)
14. Do, T.-M.-T., Artieres, T.: Conditional random fields for online handwriting recog-
nition. In: ICFHR (2006)
15. Morency, L.P., Quattoni, A., Darrell, T.: Latent-dynamic discriminative models
for continuous gesture recognition. In: CPVR, pp. 1–8. IEEE (2007)
16. Wang, Y., Mori, G.: Max-margin hidden conditional random fields for human ac-
tion recognition. In: CVPR, pp. 872–879. IEEE (2009)
17. Vinel, A., Do, T.M.T., Artières, T.: Joint optimization of hidden conditional ran-
dom fields and non linear feature extraction. In: ICDAR (2011)
Maximizing Edit Distance Accuracy with Hidden CRFs 53
18. Soullard, Y., Artieres, T.: Hybrid hmm and hcrf model for sequence classification.
In: ESANN (2011)
19. Reiter, S., Schuller, B., Rigoll, G.: Hidden conditional random fields for meeting
segmentation. In: Multimedia and Expo. IEEE (2007)
20. Taskar, B., Guestrin, C., Koller, D.: Max-margin markov networks. In: NIPS (2003)
21. Do, T.M.T., Artières, T.: Large margin training for hidden markov models with
partially observed states. In: ICML (2009)
22. Keshet, J., Cheng, C.-C., Stoehr, M., McAllester, D.A.: Direct error rate mini-
mization of hidden markov models. In: Interspeech (2011)
23. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods
for structured and interdependent output variables. JMLR (2) (2006)
24. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passive-
aggressive algorithms. Journal of Machine Learning Research (2006)
25. Tran, B.H., Seide, F., Steinbiss, T.: A word graph based n-best search in continuous
speech recognition. In: ICSLP (1996)
26. http://YAWDa.lip6.fr/
27. Marti, U.V., Bunke, H.: A full english sentence database for off-line handwriting
recognition. In: ICDAR (2002)
28. Marti, U.V., Bunke, H.: Handwritten sentence recognition. In: ICPR (2000)
29. Keshet, J., Shalev-Shwartz, S., Bengio, S., Singer, Y., Chazan, D.: Discriminative
kernel-based phoneme sequence recognition. In: Interspeech (2006)
Background Recovery by Fixed-Rank Robust
Principal Component Analysis
Wee Kheng Leow, Yuan Cheng, Li Zhang, Terence Sim, and Lewis Foo
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 54–61, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Background Recovery by Fixed-Rank Robust Principal Component Analysis 55
3 Fixed-Rank RPCA
Given an m×n data matrix D, PCA seeks to recover a low-rank matrix A from
data matrix D such that the discrepancy or error E = D − A is minimized:
min E F, subject to rank(A) ≤ r, D = A + E (1)
A,E
56 W.K. Leow et al.
FrALM
Input: D, r, λ
1. A = 0, E = 0.
2. Y = sgn(D)/J(sgn(D)), μ > 0, ρ > 1.
3. Repeat until convergence:
4. Repeat until convergence:
5. U, S, V = svd(D − E + Y/μ).
6. If rank(T1/μ (S)) < r, A = U T1/μ (S)V ; otherwise, A = U Sr V .
7. E = Tλ/μ (D − A + Y/μ).
8. Y = Y + μ(D − A − E), μ = ρμ.
Output: A, E.
In line 2, sgn(·) computes the sign of each matrix element, and J(·) computes a
scaling factor
J(X) = max X 2 , λ−1 X ∞ (5)
as recommended in [9]. The function T in line 7 is a soft thresholding function:
⎧
⎨ x − , if x > ,
T (x) = x + , if x < −, (6)
⎩
0, otherwise.
The main difference between FrALM and LrALM lies in Step 6. Sr is the diagonal
matrix of singular values whose diagonal elements above r are set to 0. FrALM
fixes A’s rank to the desired rank r if a rank-r matrix is recovered. Otherwise,
it behaves in the same manner as LrALM. On the other hand, LrALM allows
A’s rank to increase beyond r if μ is too large.
FrALM is algorithmically equivalent to LrALM with a sufficiently small μ
that restricts the rank of A to r. Therefore, the convergence proof of LrALM
given in [9] applies to FrALM. Consequently, FrALM can converge as efficiently
as LrALM does (Fig. 2(a)). The advantage of FrALM over LrALM is that the
user does not have to specify the exact μ that fixes the rank of A to r.
Eg E1 Rank
1000 50
E2
E3
100 E4 40
E5
F1 30
10
F2
F3 20
1 F4
F5 10
0.1
λ 0 λ
0.001 0.01 0.1 1 0.001 0.01
(a) (b)
Fig. 1. Performance comparison for reflection removal. (a) Error Eg vs. λ. Error curves
above Eg = 1000 for very small λ are cropped to reduce clutter. (b) Rank of low-rank
matrix recovered by LrALM. (Red dashed lines) LrALM results. Vertical dash lines
denote theoretical optimal λ∗ = 0.003̇. (Blue solid lines) FrALM results.
Eg E1 Eg
E2 1000
100 E3
E4 100
E5
10 F1 10
F2
1 F3
F4 1
F5
0.1 0.1
iteration rank
1 2 3 4 5 6 7 8 9 10 1 10
(a) (b)
Fig. 2. Performance comparison for reflection removal. (a) Convergence curves with
λ = 0.003̇. (b) Error vs. rank. Error curves above Eg = 1000 for very small λ are
cropped to reduce clutter.
LrALM and FrALM were tested on the test sets √ over a range of λ from 0.0001
to 0.5, including the theoretical optimal λ of m = 0.003̇, denoted as λ∗ , as
proved in [13]. The parameters ρ and initial μ were set to the default values of 6
and 0.5/σ1 , where σ1 is the largest singular value of the initial Y, as for LrALM.
The algorithms’ accuracy was measured in terms of the mean squared error Eg
between the ground truth G and the recovered A:
1
Eg = G−A 2
F. (7)
mn
Test results show that FrALM converges as efficiently as LrALM (Fig. 2(a)).
Since the desired rank of A is 1, λ has to be sufficiently small for LrALM
to produce accurate results (Fig. 1(a)). For Sets 2, 3, and 5 with non-sparse
E, the empirical optimal λ (0.002) is smaller than the theoretical λ∗ (0.003̇),
contrary to the theory of [13]. At this lower λ, the ranks of the optimal A
Background Recovery by Fixed-Rank Robust Principal Component Analysis 59
(1)
(2)
(3)
(4)
(5)
Fig. 3. Sample test results for reflection removal. (a) Ground truth background. (b)
Sample input images. (c) LrALM’s results. (d) FrALM’s results. (1) Set 1: synthetic
local reflection. (2) Set 2: synthetic global reflection (light background). (3) Set 3:
synthetic global reflection (dark background). (4) Set 4: real local reflection. (5) Set 5:
real global reflection. λ = 0.003̇ for these test results.
recovered by LrALM are still larger than the known value of 1 (Fig. 1(b)). This
shows that LrALM has accumulated higher-rank components in A and thus
over-estimated A. In contrast, FrALM constrains the rank of A to 1, removing
the over-estimation. Consequently, FrALM yields more accurate results than
does LrALM, and it returns optimal or near optimal results over a wide range
of λ (Fig. 1(a)). We have also verified empirically that reducing the rank of
A to 1 after it is returned by LrALM can reduce over-estimation and improve
LrALM’s accuracy. However, this post-processing is insufficient for removing the
over-estimation entirely and LrALM’s error is still larger than that of FrALM.
To investigate the stability of FrALM, we ran it on the test cases at a range of
fixed ranks r, with λ set to the empirical optimal of 0.002. FrALM’s results were
plotted together with LrALM’s results obtained in previous tests (Fig. 2(b)).
60 W.K. Leow et al.
(1)
(2)
(3)
(4)
Fig. 4. Sample test results for human and traffic video. (a) Ground truth background.
(b) Sample video frames. (c) LrALM’s results. (d) FrALM’s results. (1) Human motion
video. (2–4) Traffic video; ground truth is not available. λ = 0.003̇ for these test results.
When r is slightly larger than 1, FrALM’s error increases only slightly. when r
is larger than the rank of A recovered (line 6 of algorithm), FrALM reduces to
LrALM, and its error simply approaches that of LrALM.
Figure 3 displays sample results for reflection removal obtained at the the-
oretical λ∗ . LrALM’s results are good for Sets 1 and 4 whose E is sparse. For
Sets 2, 3, and 5, E is not sparse and LrALM’s results have visually noticeable
errors (when the images are viewed at higher zoom factors). In contrast, FrALM
obtains good results for all test sets.
Figure 4 shows sample results for video background modeling obtained at the
theoretical λ∗ . In the video frames where the human and vehicles are moving
continuously, LrALM can recover the stationary background well (Fig. 4(1c, 2c)).
When the vehicles are moving slowly, E is not sparse, and LrALM shows signs
of inaccuracy (Fig. 4(3c)). When the vehicles stop at the traffic junction for an
extended period of time, LrALM regards them as part of the low-rank matrix
A and fails to remove them from A (Fig. 4(4c)). In contrast, FrALM produces
much better overall results than does LrALM (Fig. 4(d)).
Background Recovery by Fixed-Rank Robust Principal Component Analysis 61
5 Conclusions
A fixed-rank RPCA algorithm, FrALM, based on exact augmented Lagrange
multiplier method is proposed in this paper. By fixing the rank of the low-rank
matrix to be recovered, FrALM removes over-estimation of the low-rank matrix
and produces more accurate results than does low-rank ALM method (LrALM).
Moreover, FrALM returns optimal or near optimal results over a wide range of
λ values, whereas LrALM’s accuracy is sensitive to λ. If FrALM is fixed to a
desired rank that is larger than the actual rank, then FrALM just reduces to
LrALM. These properties make FrALM more reliable and accurate than LrALM
for solving computer vision problems whose low-rank matrices have known ranks.
References
1. Babacan, S.D., Luessi, M., Molina, R., Katsaggelos, A.K.: Sparse bayesian methods
for low-rank matrix estimation. IEEE Trans. Signal Processing 60(8), 3964–3977
(2012)
2. Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis?
Journal of ACM 58(3), 11 (2011)
3. Candès, E.J., Plan, Y.: Matrix completion with noise. In: Proc. IEEE, pp. 925–936
(2010)
4. De la Torre, F., Black, M.: A framework for robust subspace learning. Int. Journal
of Computer Vision 54(1-3), 117–142 (2003)
5. Fischler, M., Bolles, R.: Random sample consensus: A paradigm for model fitting
with applications to image analysis and automated cartography. Communications
of ACM 24(6), 381–385 (1981)
6. Ganesh, A., Lin, Z., Wright, J., Wu, L., Chen, M., Ma, Y.: Fast convex optimization
algorithms for exact recovery of a corrupted low-rank matrix. In: CAMSAP (2009)
7. Gnanadesikan, R., Kettenring, J.: A framework for robust subspace learning. Ro-
bust Estimates, Residuals, and Outlier Detection with Multiresponse Data (check
journal title) 28(1), 81–124 (1972)
8. Ke, Q., Kanade, T.: Robust L1 norm factorization in the presence of outliers and
missing data by alternative convex programming. In: Proc. CVPR, pp. 739–746
(2005)
9. Lin, Z., Chen, M., Wu, L., Ma, Y.: The augmented Lagrange multiplier method
for exact recovery of corrupted low-rank matrices. Technical Report UILU-ENG-
09-2215, UIUC (2009), arXiv preprint arXiv:1009.5055
10. Liu, R., Lin, Z., De la Torre, F., Su, Z.: Fixed-rank representation for unsupervised
visual learning. In: Proc. CVPR, pp. 598–605 (2012)
11. Wang, N., Yao, T., Wang, J., Yeung, D.-Y.: A probabilistic approach to robust
matrix factorization. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid,
C. (eds.) ECCV 2012, Part VII. LNCS, vol. 7578, pp. 126–139. Springer, Heidelberg
(2012)
12. Wright, J., Peng, Y., Ma, Y., Ganesh, A., Rao, S.: Robust principal component
analysis: Exact recovery of corrupted low-rank matrices by convex optimization.
In: Proc. NIPS, pp. 2080–2088 (2009)
13. Zhou, Z., Li, X., Wright, J., Candès, E.J., Ma, Y.: Stable principal component
pursuit. In: Proc. Int. Symp. Information Theory, pp. 1518–1522 (2010)
Manifold Learning and the Quantum
Jensen-Shannon Divergence Kernel
1 Introduction
Graph-based representations have become increasingly popular due to their
ability to characterize in a natural way a large number of systems [2, 3]. Un-
fortunately, our ability to analyse this wealth of data is severely limited by the
restrictions posed by standard pattern recognition techniques, which usually re-
quire the graphs to be first embedded into a vectorial space, a procedure which
is far from being trivial. Kernel methods [4] provide a neat way to shift the prob-
lem from that of finding an embedding to that of defining a positive semidefinite
kernel. In fact, once we define a positive semidefinite kernel k : X × X → R
on a set X, there exists a map φ : X → H into a Hilbert space H, such that
k(x, y) = φ(x) φ(y) for all x, y ∈ X. Thus, any algorithm can be formulated in
terms of the data by implicitily mapping them to H via the well-known kernel
trick. As a consequence, we are now faced with the problem of defining a positive
semidefinite kernel on graphs rather than computing an embedding. However,
due to the rich expressiveness of graphs, this task has also proven to be difficult.
Many different graph kernels have been proposed in the literature [5–7], which
are generally instances of the family of R-convolution kernels introduced by
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 62–69, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Manifold Learning and the Quantum Jensen-Shannon Divergence Kernel 63
Haussler [8]. The fundamental idea is that of decomposing two discrete objects
them and comparing some simpler substructures. For example, Gärtner et al. [5]
propose to count the number of common random walks between two graphs,
while Borgwardt and Kriegel [6] measure the similarity based on the shortest
paths in the graphs. Shervashidze et al. [7], on the other hand, count the number
of graphlets, i.e. subgraphs with k nodes. Recently, Rossi et. al [1] introduced
a novel kernel where the graph structure is probed through the evolution of a
continuous-time quantum walk [9]. The idea underpinning their method is that
the interference effects introduced by the quantum walk seem to be enhanced by
the presence of symmetrical motifs in the graph [10, 11]. To this end, they define
a walk onto a new structure that is maximally symmetric when the original
graphs are isomorphic. Finally, the kernel is defined as the quantum Jensen-
Shannon divergence [12] between the density operators [13] associated with the
walks.
In this paper, we study the separability properties of the QJSD kernel and we
apply standard manifold learning techniques [14, 15] on the kernel embedding
to map the data onto a low-dimensional space where the different classes can
exhibit a better linear separation. The idea stems from the observation that
the multidimensional scaling embeddings of the QJSD kernel show the so-called
horseshoe effect [16]. This particular behaviour is known to arise when long range
distances are not estimated accurately, and it implies that the data lie on a non-
linear manifold. This is no surprise, since Emms et. al [10] have shown that the
continuous-time quantum walk underestimates the commute time related to the
classical random walk. For this reason, it is natural to investigate the impact
of the locality of distance information on the performance of the QJSD kernel.
Given a set of graphs, we propose to use Isomap [14] to embed the graphs onto
a low-dimensional vectorial space, and we compute the separability of the graph
classes as the distance information varies from local to global. Moreover, we
perform the same analysis on a set of alternative graph kernels commonly found
in the literature [5–7]. Experiments on several standard datasets demonstrate
that the Isomap embedding shows a higher separability of the classes.
The remainder of this paper is organized as follows: Section 2 introduces some
basic quantum mechanical terminology, while Section 3 reviews the QJSD kernel.
Section 4 illustrates the experimental results and the conclusions are presented
in Section 5.
Quantum walks are the quantum analogue of classical random walks. In this
paper we consider only continuous-time quantum walks, as first introduced by
Farhi and Gutmann in [9]. Given a graph G = (V, E), the state space of the
continuous-time quantum walk defined on G is the set of the vertices V of the
graph. Unlike the classical case, where the evolution of the walk is governed by
a stochastic matrix (i.e. a matrix whose columns sum to unity), in the quantum
case the dynamics of the walker is governed by a complex unitary matrix i.e.,
64 L. Rossi, A. Torsello, and E.R. Hancock
a matrix that multiplied by its conjugate transpose yields the identity matrix.
Hence, the evolution of the quantum walk is reversible, which implies that quan-
tum walks are non-ergodic and do not possess a limiting distribution. Using
Dirac notation, we denote the basis state corresponding to the walk being at
vertex u ∈ V as |u. A general state of the walk is a complex linear combination
of the basis states, such that the state of the walk at time t is defined as
|ψt = αu (t) |u (1)
u∈V
where the amplitude αu (t) ∈ C and |ψt ∈ C|V | are both complex.
At each instant in time the probability of the walker being at a particular
vertex of the graph is given by the square of the norm of the amplitude of the
relative state. More formally, let X t be a random variable giving the location of
the walker at time t. Then the probability of the walker being at the vertex u
at time t is given by
Fig. 1. The MDS embeddings from the QJSD kernel consistently show an horseshoe
shape distribution of the points
Fig. 2. Sample images of the four selected object from the COIL-100 [18] dataset
where α is the 10-fold cross validation accuracy of the C-SVM, C is the regu-
larizer constant, d is the embedding dimension and k is the number of nearest
neighbors. Note that the multi-classification task is solved using majority voting
on a set of one-vs-one C-SVM classifiers.
4 Experimental Results
The experiments are performed on four different dataset, namely MUTAG, PPI,
COIL [18] and a set of shock graphs. MUTAG is a dataset of 188 mutagenic
Manifold Learning and the Quantum Jensen-Shannon Divergence Kernel 67
80 70
Accuracy
Accuracy
Accuracy
75
70 70 60
65 50
60
25 25 25
70 19 70 19 70 19
50 13 50 13 50 13
30 7 30 7 30 7
kNN 10 1 Dim kNN 10 1 Dim kNN 10 1 Dim
Fig. 3. 3D plot of the 10-fold cross validation accuracy on the PPI dataset as the
number of the nearest neighbors k and the embedding dimension d vary
hand, for the graphlet kernel the maximum accuracy is achieved for a smaller
neighborhood, which means that in this case the long range distance information
is less accurate.
Figure 4 shows the two-dimensional Isomap embeddings with the highest lin-
ear separability for the QJSD kernels on the synthetic dataset, MUTAG and
COIL. The result clearly shows the lack of the horseshoe shape distribution of
Figure 1. Note, however, that the best embedding is usually found at a dimen-
sion higher than two and, as shown in Figure 3, the separability can change
significantly as the dimension varies. Figure 4 also shows a clearer separation
among the different classes, as highlighted in Table 1, which shows the separa-
bility of the data for each kernel and dataset. It is interesting to observe that,
with the exception of a few cases, the Isomap embedding always yields an in-
creased separability of the data, independently of the original kernel. It should
also be underlined that the QJSD kernel always yields the highest separation,
with a maximum classification accuracy above 90% in 4 out of 5 datasets.
5 Conclusions
In this paper, we studied the separability properties of the QJSD kernel and
we have proposed a way to compute a low-dimensional embedding where the
separation of the different classes is enhanced. The idea stems from the observa-
tion that the multidimensional scaling embeddings on this kernel show a strong
horseshoe shape distribution, a pattern which is known to arise when long range
distances are not estimated accurately. Here we proposed to use Isomap to em-
bed the graphs using only local distance information onto a new vectorial space
with a higher class separability. An extensive experimental evaluation has shown
the effectiveness of the proposed approach.
References
1. Rossi, L., Torsello, A., Hancock, E.R.: A continuous-time quantum walk kernel for
unattributed graphs. In: Kropatsch, W.G., Artner, N.M., Haxhimusa, Y., Jiang,
X. (eds.) GbRPR 2013. LNCS, vol. 7877, pp. 101–110. Springer, Heidelberg (2013)
2. Siddiqi, K., Shokoufandeh, A., Dickinson, S., Zucker, S.: Shock graphs and shape
matching. International Journal of Computer Vision 35, 13–32 (1999)
3. Jeong, H., Tombor, B., Albert, R., Oltvai, Z., Barabási, A.: The large-scale orga-
nization of metabolic networks. Nature 407, 651–654 (2000)
4. Schölkopf, B., Smola, A.J.: Learning with kernels: Support vector machines, regu-
larization, optimization, and beyond. MIT press (2001)
5. Gaertner, T., Flach, P., Wrobel, S.: On graph kernels: Hardness results and efficient
alternatives. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS
(LNAI), vol. 2777, pp. 129–143. Springer, Heidelberg (2003)
6. Borgwardt, K., Kriegel, H.: Shortest-path kernels on graphs. In: Fifth IEEE Inter-
national Conference on Data Mining, p. 8. IEEE (2005)
7. Shervashidze, N., Vishwanathan, S., Petri, T., Mehlhorn, K., Borgwardt, K.: Effi-
cient graphlet kernels for large graph comparison. In: Proceedings of the Interna-
tional Workshop on Artificial Intelligence and Statistics (2009)
8. Haussler, D.: Convolution kernels on discrete structures. Technical report, UC
Santa Cruz (1999)
9. Farhi, E., Gutmann, S.: Quantum computation and decision trees. Physical Review
A 58, 915 (1998)
10. Emms, D., Wilson, R., Hancock, E.: Graph embedding using a quasi-quantum ana-
logue of the hitting times of continuous time quantum walks. Quantum Information
& Computation 9, 231–254 (2009)
11. Rossi, L., Torsello, A., Hancock, E.R.: Approximate axial symmetries from contin-
uous time quantum walks. In: Gimel’farb, G., Hancock, E., Imiya, A., Kuijper, A.,
Kudo, M., Omachi, S., Windeatt, T., Yamada, K. (eds.) SSPR&SPR 2012. LNCS,
vol. 7626, pp. 144–152. Springer, Heidelberg (2012)
12. Lamberti, P., Majtey, A., Borras, A., Casas, M., Plastino, A.: Metric character of
the quantum Jensen-Shannon divergence. Physical Review A 77, 052311 (2008)
13. Nielsen, M., Chuang, I.: Quantum computation and quantum information. Cam-
bridge university press (2010)
14. Tenenbaum, J.B., De Silva, V., Langford, J.C.: A global geometric framework for
nonlinear dimensionality reduction. Science 290, 2319–2323 (2000)
15. Czaja, W., Ehler, M.: Schroedinger eigenmaps for the analysis of biomedical data.
IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1274–1280
(2013)
16. Kendall, D.G.: Abundance matrices and seriation in archaeology. Probability The-
ory and Related Fields 17, 104–112 (1971)
17. Briët, J., Harremoës, P.: Properties of classical and quantum jensen-shannon di-
vergence. Physical review A 79, 052311 (2009)
18. Nayar, S., Nene, S., Murase, H.: Columbia object image library (coil 100). Technical
report, Tech. Report No. CUCS-006-96. Department of Comp. Science, Columbia
University (1996)
19. Torsello, A., Rossi, L.: Supervised learning of graph structure. In: Pelillo, M., Han-
cock, E.R. (eds.) SIMBAD 2011. LNCS, vol. 7005, pp. 117–132. Springer, Heidel-
berg (2011)
Spatio-temporal Manifold Embedding
for Nearly-Repetitive Contents
in a Video Stream
1 Introduction
In recent years there have been a wide range of audio visual data publicly
available, including news, movies, television programmes and meeting records,
resulting in various content-management problems. Among these there exist
nearly-repetitive video sequences whereby the original material is transformed
to nearly, but not exactly, identical contents. Rushes videos, also referred to
as pre-production videos, belong to one category of such examples [1]. It is a
collection consisting of raw footage, used to produce, e.g., TV programmes [2].
Unlike many other video datasets rushes are unconventional, containing ad-
ditional contents such as clapper boards, colour bars and empty white shots.
They also contain repetitive contents from multiple retakes of the same scene,
caused by, e.g., actors’ mistakes or technical failures during the production.
Although contents are nearly repetitive they may not be totally identical dupli-
cates, sometimes causing inconsistency between retakes. Occasionally some parts
of the original sequence may be dropped or extra information may be added at
various places, resulting in retakes of the same scene with unequal lengths.
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 70–77, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Spatio-temporal Manifold Embedding 71
The task of aligning multiple audio visual sequences, potentially from differ-
ent angles, needs precise synchronisation in both spatial and temporal domains.
The majority of previous works employed techniques such as template match-
ing, camera calibration analysis and object tracking. Whitehead et al. [3], for
example, tracked multiple objects throughout each sequence using a 2D shape
heuristic. Temporal correspondence was then computed between frames by iden-
tifying the object’s location in all views satisfying the epipolar geometry. In [4],
the authors required the events to be captured by still cameras with flashes; the
binary flash patterns were analysed and matched throughout the video sequence.
Tresadern and Reid [5] used a rank constraint on corresponding frame features
instead of the epipolar geometry. The synchronisations were defined by searching
frame pairs that minimise the rank constraint. However their approach requires
prior knowledge on the number of correspondences in the frame sequences.
In this paper we present a spatio-temporal framework to aligning nearly-
repetitive contents. Embedded repetitions in the three dimensional (3D) signal,
consisting of two spatial and one temporal dimension, are discovered by defining
the coherent structure. We depart from the previous extension made on Isomap
[1] to spatio-temporal graph-based manifold embedding that captures correla-
tions between repetitive scenes. The intra- and inter-correlations within and
between repeated video contents are defined by applying the spatio-temporal
extension of the scale-invariant feature transform (SIFT) [6]. It is followed by
the modified version of the locality constrained linear coding (LLC) [7], where
each spatio-temporal descriptor is encoded by k-nearest neighbours (kNN) based
on the geodesic distances, instead of the Euclidean distance. The latter measures
the distance between two points as the length of a straight line from one point
to the other, whereas on the non-linear manifold, their Euclidean distance may
not accurately reflect their intrinsic similarity, which is measured by the geodesic
distance. A cluster of intrinsic coordinates are then generated on the embedded
space to define the spatial and temporal similarity between repetitions.
The contributions of this study are as follows: Firstly a spatial intra-correlation
representation is created for repetitive contents in a video stream. Interest points
that have significant local variations in both space and time are extracted and
encoded using fewer codebook basis in the high-dimensional feature space. Intra-
correlation is derived by constructing a shortest path graph using the kNN with
the geodesic distances. Secondly Isomap is extended to estimate the underlying
structure of repetitive contents and to define a spatio-temporal inter-correlation
in a video stream. Thirdly an unsupervised framework, which does not require
prior information or pre-processing steps, for aligning similar contents is pre-
sented for multimedia data with repetitions.
!
!
coordinate systems. It translates image descriptors into local sparse codes based
on the Euclidean distances and the kNN search. We extended this algorithm
to project spatio-temporal descriptors extracted from a video stream into their
local linear codes using the geodesic distance and the shortest path graph.
Technically, given the ST-SIFT feature matrix extracted from a video stream
with N entries and D dimensions, i.e., X = {x1 , . . . , xN } ∈ RD×N , LLC solves
the following problem:
N
min xi − Bsi 2
+ λ di si 2
st. 1 si = 1, ∀i
S
i=1
L
where argminj indicates node indexes for j that give L minimum values of
δij . Other L chronologically ordered neighbours around each frame xi are then
defined as temporal neighbours (tn):
! "
tnxi = xi− L , . . . , xi−1 , xi+1 , . . . , xi+ L , i = 1, . . . , N
2 2
The temporal neighbours of the spatial neighbours tnsnxi are defined for more
coverage:
tnsnxi = {tnxi1 , . . . , tnxiL } , i = 1, . . . , N
74 M. Al Ghamdi and Y. Gotoh
Finally, union between spatial and temporal sets represents the spatio-temporal
neighbours stn:
T : δγ → D
where Q and ∧ are the eigenvectors and the eigenvalues of δγ . To optimise the
embedded representation, the m largest eigenvalues in ∧ along the diagonal is
defined in ∧+ , and the square root of m columns of Q is defined in Q+ .
3 Experiments
The approach was evaluated using MPEG-1 videos from the NIST TRECVID
2008 BBC rushes video collection [2]. Five video sequences were selected con-
taining drama productions in the following genres: detective, emergency, police,
ancient Greece and historical London. In total we had an approximate duration
of 82 minutes, sampled at the frame rate of 25 fps (frames per second) and a
frame size of 288 × 352 pixels. Table 1 provides further details of the dataset.
The video representation was created as follows. Firstly, spatio-temporal re-
gions were detected and described from the video cube using the ST-SIFT [6].
For each interest point the descriptor length was 640-dimensional, determined
by the number of bins to represent the orientation angles, θ and φ, in the sub-
histograms. In the spatial pyramid matching step, the LLC codes were computed
for each sub-region and pooled together using the multi-scale max pooling to cre-
ate the pooled representation. We used 4 × 4, 2 × 2 and 1 × 1 sub-regions. The
pooled features were then concatenated and normalised using the 2 -norm.
Table 1. The duration, the number of scenes and the number of retakes for each scene
summarisation task in 2008 [2]. The ground truth was constructed for each video
using three human judges at a frame rate of 0.5 fps (one frame per two seconds).
The judges were asked to study the video summary and use it to identify the
start and the end for each retake. The defined positions for five videos, totalling
39 scenes and 94 retakes, were used as the ground truth.
In the experiments, the approach was compared with the three other simplified
alternatives. The first one evaluates the performance of the entire framework.
It was a combination of the original 2D SIFT by Lowe [9], LLC coding with
the Euclidean distance graph by [7] and spatial Isomap by [10]. The second
one evaluates the performance of the intra-correlation step covered by ST-SIFT
and LLC with the shortest path graph. It consisted of the 2D SIFT, LLC coding
with the Euclidean distance graph and the Isomap-ST, an adapted version of [1].
The third one evaluates the performance of the inter-correlation step covered by
the Isomap-ST. For that we combined the ST-SIFT [6] with LLC coding
with the shortest path graph and the spatial Isomap.
3.2 Results
Figure 2 presents the average precision and recall for each video using the ap-
proach and three alternatives. Graphs were created using the neighbourhood size
k as the operating parameter. They indicate that the approach outperformed the
conventional techniques with a fair margin. The approach was able to capture
the spatio-temporal correlations between retakes in each video sequence. The
best result was obtained with video MRS150072 in Figure 2(e). This video con-
tained outdoor scenes with large variations, characterised by busy backgrounds
and lots of movements by actors and objects. On the other hand, Figure 2(a)
for video MS206290 resulted in the lowest performance. It consisted of indoor
scenes with crowded people and little moves. Therefore there were few significant
changes between the frames to be captured by the ST-SIFT.
Figure 3 illustrates the reconstruction of video sequences, aiming to uncover
their nearly-repetitive contents. Retakes from the same scene were mapped close
to each other in the manifold resulting in clusters of repetitive contents. The
video sequence MRS044499 presented in the figure contained six scenes with
ten retakes (described earlier in Table 1). The left panel of the figure shows
the aligned sequences in the 2D space with multiple clusters of frames. Most
frames from the same scene were re-positioned and placed close together in the
76 M. Al Ghamdi and Y. Gotoh
Fig. 2. Average precision and recall for five rushes videos, identified as MS206290,
MS206370, MS215830, MRS044499, and MRS150072. For each video stream, the
spatio-temporal alignment method (blue) is compared with three other alternatives.
lower dimensional space. There were many causes, such as camera moves, that
could result in discontinuity because such frames did not share sufficient spatial
features with others. Consideration of temporal relation in the intra-correlation
step alleviated this problem, thus successfully producing a clear video trajectory
in the manifold. The contents of one cluster, two retakes of the same scene, are
presented in the right panel of the figure.
Fig. 3. Video sequence MRS044499 was aligned in the two-dimensional space using
the neighbourhood size of k = 15
Acknowledgements. The first author would like to thank Umm Al-Qura Uni-
versity, Makkah, Saudi Arabia for funding this work as part of her PhD schol-
arship program.
References
1. Chantamunee, S., Gotoh, Y.: Nearly-repetitive video synchronisation using nonlin-
ear manifold embedding. In: Proceedings of ICASSP (2010)
2. Over, P., Smeaton, A.F., Awad, G.: The TRECVID 2008 BBC rushes summariza-
tion evaluation. In: ACM TRECVID Video Summarization Workshop (2008)
3. Whitehead, A., Laganiere, R., Bose, P.: Temporal synchronization of video se-
quences in theory and in practice. In: IEEE Workshop on Motion and Video Com-
puting (2005)
4. Shrestha, P., Weda, H., Barbieri, M., Sekulovski, D.: Synchronization of multiple
video recordings based on still camera flashes. In: Proceedings of ACM Multimedia
(2006)
5. Tresadern, P.A., Reid, I.D.: Synchronizing image sequences of non-rigid objects.
In: Proceedings of BMVC (2003)
6. Al Ghamdi, M., Zhang, L., Gotoh, Y.: Spatio-temporal SIFT and its application
to human action classification. In: Fusiello, A., Murino, V., Cucchiara, R. (eds.)
ECCV 2012 Ws/Demos, Part I. LNCS, vol. 7583, pp. 301–310. Springer, Heidelberg
(2012)
7. Wang, H., Ullah, M.M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local
spatio-temporal features for action recognition. In: Proceedings of BMVC (2009)
8. Serre, T., Wolf, L., Poggio, T.: Object recognition with features inspired by visual
cortex. In: Proceedings of CVPR (2005)
9. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International
Journal of Computer Vision (2004)
10. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for
nonlinear dimensionality reduction. Science (2000)
Spatio-temporal Human Body Segmentation
from Video Stream
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 78–85, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Spatio-temporal Human Body Segmentation from Video Stream 79
the very nature of the process of segmentation developed for still images that it
is unable to realise continuity through time.
In this work we explore an approach to extracting three dimensional (3D)
human volume, consisting of two spatial and one temporal dimension. Our im-
plementation of video segmentation follows the line of tracking-based methods.
It detects and segments human body regions from a video stream by jointly
embedding parts and pixels [1]. For all extracted segments the appearance and
shape models will be learned in order to automatically identify foreground ob-
jects across video frames. It focuses on human contours, in particular, modified
from the category independent segmentation work by Lee et al. [2]. The approach
is evaluated using office scenes selected from the Hollywood2 dataset [3]. The
experimental results indicate that the approach was able to create consistently
better segmentation than recently implemented work [2].
2 Approach
Our goal is to segment human body volume in an unlabelled video. The ap-
proach consists of two main stages (Figure 1). Firstly, human body objects are
segmented at a frame level by combining low-level cues with a top-down part-
based person detector developed by Maire et al. [1], formulating grouped patches.
Secondly, detected segments are propagated along the video frames, exploiting
the temporal consistency of detected foreground objects using colour models and
local shape matching [2]. The final output is a spatio-temporal segmentation of
the human body in a video stream. We now describe each stage in turn.
where Mj is the region of the image overlapped by a part qj . Each part is then
assigned to a Qi , which represents the number of confirmed objects detected.
Human body segments are then scored for each frame. This step is repeated
with a set of N × F , where each N is the number of human body objects per
frame and F is the number of frames. These steps result in the set of hypotheses,
h, which are then used to identify the spatio-temporal segmentation of human
body parts in the entire video stream.
where Uic (·) is the colour-induced cost, and Uil (·) is the local shape match-induced
cost. The segments detected in each frame on the basis of their parts and pixels
are projected onto other frames by local shape matching, with a spatial extent
which defines the location and scale prior to the segment, whose pixels can sub-
sequently be labelled as foreground or background. Optical flow connections are
used to maintain frame-to-frame consistency of the background and foreground
labelling of propagated segments. For each hypothesis h, the foreground object
segmentation of the video can be labelled by using binary graph cuts to minimise
the function E(f, h). Each frame is labelled in this way, using a space-time graph
of three frames to connect each frame to its preceding and subsequent frames.
This is more efficient than segmenting the video as the whole.
Spatio-temporal Human Body Segmentation from Video Stream 83
Fig. 2. Sample segmentations. The first row shows key frames from two video clips.
The second and the third rows respectively present the results of key segments and the
corresponding segmentation using the approach in this paper. The last two rows show
the same attempts using the implementation by Lee et al. [2]. Best viewed on pdf.
3 Experiments
Dataset. The Hollywood2 dataset [3] holds a total of 69 Hollywood movie
scenes, from which ten short video clips were selected for testing the approach.
All the selected scenes are set in an office environment, and feature a broad range
of motions as well as a variety of temporal changes, thus creating a challenging
video segmentation task. The selected clips vary from 30 seconds to 2 minutes
in duration, with at least one human present in each shot; there are many shots
showing multiple human figures. For each clip, video frames are extracted using
a ffmpeg1 decoder, with a sample rate of one frame per second.
Evaluation Scheme. Accuracy is the commonly used measure for evaluating
video segmentation tasks. In this work we adopt the average per-frame pixel error
rate [19] for evaluation of the approach. Let F denote the number of frames in
the video, and S and GT represent pixels in the segmented region and in the
‘groundtruth’ across the frame sequence respectively. The error rate is calculated
using the exclusive OR operation:
| XOR(S, GT )|
E(S) =
F
The equation is used under the general hypothesis that object and groundtruth
annotation should match.
1
www.ffmpeg.org/
84 N. Al Harbi and Y. Gotoh
Table 1. The average number of incorrectly segmented pixels per frame. The video
clip name is in the format of ‘sceneclipautoautotrain· · · · ·’ where ‘· · · · ·’ part is shown
in the table.
4 Conclusion
In this paper we presented the two-stage approach to spatio-temporal human
body segmentation by extracting a human body at a frame level, followed by
tracking the segmented regions using colour appearance and local shape match-
ing across the frames. By detecting and segmenting human body parts, we over-
came the limitations of the bottom-up unsupervised methods that often overseg-
mented an object. Using ten challenging video clips derived from the Hollywood2
dataset, we were able to obtain consistently better segmentation results than re-
cent implementations in the field.
2
Program code available from www.cs.utexas.edu/˜grauman/research/software.html.
We tested their implementation with our office scene dataset. This was perhaps not
totally a fair comparison because the purpose of their work was an unsupervised
approach to key object segmentation from unlabelled video, where the number of
object was restricted to one, while we focused on extraction of human volume.
Spatio-temporal Human Body Segmentation from Video Stream 85
References
1. Maire, M., Yu, S.X., Perona, P.: Object detection and segmentation from joint
embedding of parts and pixels. In: Proceedings of ICCV (2011)
2. Lee, Y.J., Kim, J., Grauman, K.: Key-segments for video object segmentation. In:
Proceedings of ICCV (2011)
3. Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: Proceedings of CVPR
(2009)
4. Freedman, D., Kisilev, P.: Fast mean shift by compact density representation. In:
Proceedings of CVPR (2009)
5. Patti, A.J., Tekalp, A.M., Sezan, M.I.: A new motion-compensated reduced-order
model Kalman filter for space-varying restoration of progressive and interlaced
video. IEEE Transactions on Image Processing 7 (1998)
6. Paris, S.: Edge-preserving smoothing and mean-shift segmentation of video
streams. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS,
vol. 5303, pp. 460–473. Springer, Heidelberg (2008)
7. Klein, A.W., Sloan, P.P.J., Finkelstein, A., Cohen, M.F.: Stylized video cubes. In:
ACM SIGGRAPH/Eurographics Symposium on Computer Animation (2002)
8. Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space anal-
ysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002)
9. DeMenthon, D., Megret, R.: Spatio-temporal segmentation of video by hierarchical
mean shift analysis. Technical report, Language and Media Processing Laboratory,
University of Maryland (2002)
10. Wang, J., Xu, Y., Shum, H.Y., Cohen, M.F.: Video tooning. ACM Transaction on
Graphics 23 (2004)
11. Wang, J.Y.A., Adelson, E.H.: Representing moving images with layers. IEEE
Transactions on Image Processing 3 (1994)
12. Khan, S., Shah, M.: Object based segmentation of video using color, motion and
spatial information. In: Proceedings of CVPR (2001)
13. Zitnick, C.L., Jojic, N., Kang, S.B.: Consistent segmentation for optical flow esti-
mation. In: Proceedings of ICCV (2005)
14. Brendel, W., Todorovic, S.: Video object segmentation by tracking regions. In:
Proceedings of ICCV (2009)
15. Bai, X., Wang, J., Simons, D., Sapiro, G.: Video SnapCut: robust video object
cutout using localized classifiers. ACM Transaction on Graphics 28 (2009)
16. Huang, Y., Liu, Q., Metaxas, D.: Video object segmentation by hypergraph cut.
In: Proceedings of CVPR (2009)
17. Li, Y., Sun, J., Shum, H.Y.: Video object cut and paste. ACM Transaction on
Graphics 24 (2005)
18. Yu, S.X., Shi, J.: Segmentation given partial grouping constraints. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence 26 (2004)
19. Tsai, D., Flagg, M., Rehg, J.M.: Motion coherent tracking with multi-label MRF
optimization. In: Proceedings of BMVC (2010)
Sparse Depth Sampling for Interventional
2-D/3-D Overlay: Theoretical Error Analysis
and Enhanced Motion Estimation
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 86–93, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Sparse Depth Sampling for Interventional 2-D/3-D Overlay 87
Since the 3-D point x with depth d and the point xE with sparsely estimated
depth dE are both on r(p), it yields λ(x) = 1/z = 1/d and λ(xE ) = 1/dE . The
points can be then reformulated as
. T . T
x = (vrxyz (p))T , 1/d and xE = (vrxyz (p))T , 1/dE . (2)
Since vrxyz is determined by the 2D projection, the representations in Eq. 2 show
the geometric relationship between x and xE (as in Fig. 1(b)): they share the
same projection but with a shift of Δd = dE − d in depth.
After a rigid motion (rotation R0 and translation t0 ), the new projections of
x and xE are p and pE , respectively, as shown in Fig. 1(b). In this scenario,
X-ray
camera coord. xcam c ycam
source c zcam
principal ray
direction L0 d
world
coord.
dE
d1
d2
x
volume V d3 xc
…
dn xE
xcE
image
pc pcE p
coord.
detector D D
(a) Depth sampling along viewing direction (b) The systematic error
the points p and p are observations of x on the 2-D image before and after
the motion. Since the estimated 3-D point xE and p (instead of pE ) are used
in motion estimation [6], the systematic error of one point is introduced by the
difference vector between p and pE , as follows:
,z -
d − dE t0 −tx0
p xy
− pxy =a· R0 vrxyz , (3)
E
(d · r3 vrxyz + tz0 )(dE · r3 vrxyz + tz0 ) tz0 −ty0
where r3 ∈ R1×3 is the third row of R0 and t0 = (tx0 , ty0 , tz0 )T . The above 2-D
vector corresponds to a line segment lε connecting p and pE , which is exactly
a segment of the epipolar line of p under the motion of [R0 |t0 ] [7]. As Eq. 3
shows, the direction of the vector is only determined by the motion [R0 |t0 ] and
the 2-D projection p. The depth error Δd together with the off-plane motion (r3
and tz0 ) affects the length of lε . Therefore, the systematic error by sparse depth
sampling is not only influenced by the estimation error Δd in depth.
~
pc ~i
pc
G max l Hi l Hi ~i
pc
pcE pcEi pc
pc i
lH E
pc
Fig. 2. (a) Illustration of projection and the errors; (b) the metrics for the “influences”
the errors; (c) a case with more significant random noise
90 J. Wang et al.
(Fig. 2(a)), which reflects the accuracy of the tracking method applied in our
procedure. However, p and δmax are unknowns in practice, there is no explicit
measurement for the random noise. Therefore, we again make use of Eq. 3 for
the metric of the random noise. Since the in-plane motion can be well estimated
initially [6] and the depth error as well as the off-plane motion affects more on
the length of the systematic error vector in Eq. 3, the true projection p appears
near to or on the line lε . Thus, we introduce here the point-to-line distance N i
(the distance between p . i and liE in Fig. 2(b)) as the metric for the “influence”
of the random noise.
Fig. 2(b) and 2(c) show two examples of error conditions. In Fig.2(b), N i is
obviously smaller then S i . If it is mostly the case for other points, we can draw
the conclusion that the systematic error is more dominant than the random
noise. Contrarily, if N i is bigger then S i (Fig. 2(c)) for most of the points, the
random noise appears more dominant.
where n is the number of points, dist(·, ·) is the Euclidean distance of the two
points and pE = K[R̂0 |t̂0 ]xE . This least-squares optimization helps to find the
best fitting corrected depth values based on the estimated motion. We then refine
the motion by a follow-up motion estimation using the corrected depths.
However, if the random error is about the same level of or more dominant than
the systematic error (e.g. under a small or specific motions causing small system-
atic error), depth correction can even introduce more error in the motion estima-
n
tion results (see section 4). The reason is that minimizing i dist p . , pE (di
E)
in Eq. 4 does not lead to proper fitting depth values 2(c). In contrast, if the
random error has an acceptable range (i.e. with reasonable δmax ), it’s better to
include more points in the motion estimation procedure, so that a globally con-
sistent solution of the motion can be estimated while the effect of the random
error is averaged to a minimum.
An error-based motion estimation strategy is therefore proposed according to
the influence metrics proposed in section 2.2 and a dominance factor f (Tab. 1).
We consider as a strong depth correction criterion if S̄ > f · N̄ , we perform depth
correction and motion estimation on all points. For the cases not satisfying the
weak
strong criterion, we consider as a weak criterion if still some points xiE
contain dominant systematic error (S i > f · N̄ ) and perform depth correction on
i weak
xE , but still refine the motion using all points. If neither of the criteria are
satisfied, we consider all points as random noise dominant (no further correction).
Sparse Depth Sampling for Interventional 2-D/3-D Overlay 91
0.3 3
0.2 2
0.1 1
0 0
0 2 4 6 8 10 12 14 0 10 20 30 40
before motion correction (mm) before motion correction (mm)
(a) Results of using 5-interval depth sampling;
References
1. Rossitti, S., Pfister, M.: 3D road-mapping in the endovascular treatment of cerebral
aneurysms and arteriovenous malformations. Interventional Neuroradiology 15(3),
283 (2009)
2. Ruijters, D.: Multi-modal image fusion during minimally invasive treatment. PhD
thesis, Katholieke Universiteit Leuven and the University of Technology Eindhoven,
TU/e (2010)
3. Brost, A., Liao, R., Strobel, N., Hornegger, J.: Respiratory motion compensation by
model-based catheter tracking during ep procedures. Medical Image Analysis 14(5),
695–706 (2010)
4. Ma, Y., King, A.P., Gogin, N., Rinaldi, C.A., Gill, J., Razavi, R., Rhode, K.S.:
Real-time respiratory motion correction for cardiac electrophysiology procedures
using image-based coronary sinus catheter tracking. In: Jiang, T., Navab, N., Pluim,
J.P.W., Viergever, M.A. (eds.) MICCAI 2010, Part I. LNCS, vol. 6361, pp. 391–399.
Springer, Heidelberg (2010)
5. Wang, P., Marcus, P., Chen, T., Comaniciu, D.: Using needle detection and tracking
for motion compensation in abdominal interventions. In: 2010 IEEE International
Symposium on Biomedical Imaging: From Nano to Macro, pp. 612–615. IEEE (2010)
6. Wang, J., Borsdorf, A., Hornegger, J.: Depth-layer based patient motion compen-
sation for the overlay of 3D volumes onto x-ray sequences. In: Proceedings Bildver-
arbeitung für die Medizin 2013, pp. 128–133 (2013)
7. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn.
Cambridge Univsersity Press (2003)
8. Taylor, J.R.: An Introduction Error Analysis: The Study of Uncertainties in Physical
Measurements. University Science Books (1997)
Video Synopsis Based on a Sequential Distortion
Minimization Method
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 94–101, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Video Synopsis Based on a Sequential Distortion Minimization Method 95
are extracted from video footage. They have been used to distinguish videos,
summarize them and provide access points [3].
Key frames selection approaches can be classified into basically three cate-
gories, namely cluster-based methods, energy minimization-based methods and
sequential methods [1,4]. Cluster-based methods take all frames from every shot
and classify by content similarity to take key-frame. The disadvantage of these
methods is that the temporal information of a video sequence is omitted. The
energy minimization based methods extract the key frames by solving a rate-
constrained problem. These methods are generally computational expensive by
iterative techniques. Sequential methods consider a new key frame when the
content difference from the previous key frame exceeds the predefined threshold.
In [5], key frames are computed based on unsupervised learning for video
retrieval and video summarization by combination of shot boundary detection,
intra-shot-clustering and keyframe “meta-clustering”. It exploits the Color Lay-
out Descriptor (CLD) [6], on consecutive frames and compute differences be-
tween them define the bounds of each shot. Recently, dynamic programming
techniques have been proposed in the literature, such as the MINMAX approach
of [7] to extract the key frames of a video sequence. In this work, the problem is
solved optimally in O(N 2 · Kmax ), where Kmax is related to the rate-distortion
optimization. In [8], a video is represented as a complete undirected graph and
the normalized cut algorithm is carried out to globally and optimally partition
the graph into video clusters. The resulting clusters form a directed temporal
graph and a shortest path algorithm is proposed for video summarization.
Video summarization has been applied by many researchers with multiple ap-
proaches. Most of them are dealing with minimizing content features, defining
restrictions on distortion, applying simple clustering-based techniques and ignor-
ing temporal variation. In addition, due to its high computational cost (O(N 3 )
when the number of key frames is proportional to the number of video frames
N ), most of the prementioned methods have been used to extract a small per-
centages of initial frames that represent well the visual content but they have
not been used to reproduce a video synopsis. Video synopsis is quite important
task for video summarization, since it is another short video representation of
visual content and video variation. This paper refers to video summarization by
the meaning of video synopsis creation. The resulting video synopsis takes into
account temporal content variation, shot detection, and minimizes the content
distortion between the initial video and the synoptic video. At the same time
the proposed method has low computational cost O(N 2 ). Another advantage of
this work is that it can be used under any visual content description.
The rest of the paper is organized as follows: Section 2 gives the problem
formulation. Section 3 presents the proposed methodology of the video synopsis
creation. segmentation of periodic human motion. The experimental results are
given in Section 4. Finally, conclusions and discussion are provided in Section 5.
96 C. Panagiotakis, N. Ovsepian, and E. Michael
2 Problem Formulation
The problem of video synopsis belong to video summarization problems. Its goal
is to create a new video, shorter than the initial video according to a given
parameter α, without significant loss of content between the two videos (the
distortion between the original video and the video synopsis is minimized). The
ratio between the temporal duration of the video synopsis and the initial video
is equal to α ∈ [0, 1]. Let N denote the number frames of original video. Then,
the video synopsis consists of α · N frames. Therefore, we have to select the α · N
representative key frames. The broadcasting of the video synopsis is done with
the original frame ratio meaning that the real speed of the new video has been
increased by the factor of α1 on average. For example, we have a video with 5
sec duration with 25 frames/sec, so the whole video is consisted of 5 × 25 = 125
frames and the given parameter α = 0.2 the final video will have 125 × 0.2 = 25
frames. In other words, the final duration will be one sec which is 20% of initial
video.
Let Ci , i ∈ {1, ..., N } denote the visual descriptor of i-frame of original video.
Let S ⊂ {1, ..., N } denote the frames of video synopsis. According to the problem
definition, it holds that the number of frames of video synopsis (|S|) is equal to
α · N . Then, the distortion D({1, ..., N }, S) between the original video and video
synopsis is given by the following equation:
S(1)
N
D({1, ..., N }, S) = d(i, S(1)) + d(i, S(|S|)) (1)
i=1 i=S(|S|)+1
S(|S|)
+ minS(j)≤i≤S(j+1) (d(i, S(j)), d(i, S(j + 1)))
i=S(1)+1
where d(i, S(j)) denotes the distance between the visual descriptor of i-frame
and S(j)-frame. S(j) and S(j + 1) are two successive frames of video synopsis
so that S(j) ≤ i ≤ S(j + 1), this means that S(j) is determined by the index
i. The first and the second parts of this sum concern the cases that the frame
i is located before the first key frame S(1) or after the last key frame S(|S|),
respectively. Therefore, the used distortion that is defined by the sum of visual
distances between the frame of original video and the “closest” corresponding
frame of video synopsis, can be considered as an extension of the definition of
Iso-Content Distortion Principle [1] in the domain of shots.
3 Methodology
Fig. 1 illustrates a scheme of the proposed system architecture. The proposed
method can be divided into several steps. Initially, we estimate the CLD for each
frame of the original video. Next, we performed shot detection (see Section 3.1).
Based on the shot detection results and to the given parameter α we estimate the
number of frames per shot that the video synopsis (see Section 3.1). Finally, the
Video Synopsis Based on a Sequential Distortion Minimization Method 97
Shot parameter
Detection a
Number of Sequential
Video
Video CLD Frames per Distortion
Synopsis
Shot Minimization
α·N ·L
Lk , is defined by the following equation: bk = |SH| k , where |SH| denotes the
k=1 Lk
number of shots. This definitionof bk also satisfies the constraint that the video
|SH|
synopsis should contain α · N : k=1 bk = α · N . In the special case of bk ≤ 1
which means that all frames of the shot have the same content, we set bk = 2 so
that the video synopsis summarizes all of the shots of the video.
When the number of key frames of shot k become bk , CANk is being the empty
set (CANk = ∅), since we can not select more frames from this shot. The process
continues until the number of key frames of video synopsis become α · N .
Concerning the computational cost, this procedure can be implemented in
O(N 2 ). The worst case is appeared when the video consists of one shot. In this
case it holds that (N = |SH1 |). In the start (fist step), the finding of global
minima of D({1, ..., N }, ∅) needs O(N 2 ) (see Equation 1). In the n-step of the
method, we have to compute D({1, ..., N }, S ∪ u) only when the previous or the
next key frame of u is the last key frame that have been added in S in previous
step (n−1). Otherwise, it holds that D({1, ..., N }, S ∪u) = D({1, ..., N }, S). This
2
needs O( N n2 ), since the video content changes “smoothly” in the sense that the
selected frames are about equally distributed during the time. Let T (.) denote
the computation cost of the algorithm. It holds that T (1) = O(N 2 ). In the n-
step, we have to find the minimum of D(., .) that can be given in O(N ) and to
2
update the specific values of D(., .) in O( Nn2 ). So, the total computational cost
is O(N 2 ).
In addition, we have proposed a simple variation of SeDiM that is presented
hereafter. In this variation, we just assume the first and last frame of each shot
as two starting key frames for video synopsis. So, in the case of one-shot, we
initialize Sk = {SHk (1), SHk (|SHk |)}. This algorithm is called SeDiM-IN. The
rest of the process is exactly the same with SeDiM. The proposed methods do
Video Synopsis Based on a Sequential Distortion Minimization Method 99
not guarantee global minima of distortion, since they sequentially minimize the
distortion function. SeDiM guarantees global minima of distortion only in the
case of bk = 1.
4 Experimental Results
In this section, the experimental results and comparisons with other algo-
rithms are presented. We have tested the proposed algorithm on a data set
containing more than 100 video sequences. We selected 10 videos (eight real-life
and two synthetic (animation)) videos of different content in order to evaluate
the distortion of each algorithm and take results from videos which have differ-
ent content. The real-life videos have been recording either in indoor or outdoor
environments. The ten used videos consist of 69 shots. The number of shots
per video varies from one to 22. In addition, the duration of the videos varies
from 300 frames to 1925 frames. Fig. 2 depicts shapshots from these videos. The
names of the videos are given in the first column of Table 1.
Table 1. The distortion D({1, ..., N }, S) between the original video and video synopsis
α = 0.1 α = 0.1 α = 0.1 α = 0.1 α = 0.3 α = 0.3 α = 0.3 α = 0.3
Dataset SeDiM SeDiM-IN CEA TEA SeDiM SeDiM-IN CEA TEA
foreman.avi 19209 18973 21814 22069 6755.1 6774.1 7738.2 8992.2
coast guard.avi 6962.7 7054.8 7486.6 7079.9 2521.4 2562.4 2669.5 4146.4
hall.avi 3913.8 3938.8 4309.1 4444.4 2137 2141 2228.1 3863
table.avi 10207 11578 11529 10928 4097 4113.2 4542.4 6046.4
blue.avi 13826 14487 14690 16171 5419 5494 5736 10631
doconCut.avi 116420 122550 142460 148230 40503 43602 45412 70521
data.avi 14635 15800 17147 15294 4292 4303 5058 17260
Wildlife.avi 27187 29841 31763 33752 9052 9210 10826 12493
MessiVsRonaldo.avi 74630 85270 85310 111070 20971 22051 23001 40209
FootballHistory.avi 68434 79676 80236 95497 16402 16842 17323 58503
5 Conclusion
In this paper, we have proposed a video synopsis creation scheme that can be
used in video summarization applications. According to the proposed frame-
work, the problem of video synopsis creation is reduced to the minimization of
the distortion between the initial video and the video synopsis. The proposed
method sequentially minimizes this distortion, resulting in high performance re-
sults under any value of the parameter α that controls the number of frames
of the video synopsis. In addition, the proposed scheme can be used under any
type of video content description.
1
https://www.dropbox.com/sh/rpysux4oa746jty/B265lHwpAB
Video Synopsis Based on a Sequential Distortion Minimization Method 101
References
1. Panagiotakis, C., Doulamis, A., Tziritas, G.: Equivalent key frames selection based
on iso-content principles. IEEE Transactions on Circuits and Systems for Video
Technology 19, 447–451 (2009)
2. Hanjalic, A., Zhang, H.: An integrated scheme for automated video abstraction
based onunsupervised cluster-validity analysis. IEEE Trans. on Circuits and Sys-
tems for Video Tech. 9, 1280–1289 (1999)
3. Girgensohn, A., Boreczky, J.S.: Time-constrained keyframe selection technique.
Multimedia Tools and Applications 11, 347–358 (2000)
4. Panagiotakis, C., Doulamis, A., Tziritas, G.: Equivalent key frames selection based
on iso-content distance and iso-distortion principles. In: IEEE International Work-
shop on Image Analysis for Multimedia Interactive Services (2007)
5. Hammoud, R., Mohr, R.: A probabilistic framework of selecting effective key frames
for video browsing and indexing. In: International Workshop on Real-Time Image
Sequence Analysis (RISA 2000), pp. 79–88 (2000)
6. Manjunath, B., Ohm, J., Vasudevan, V., Yamada, A.: Color and texture descrip-
tors. IEEE Trans. on Circuits and Systems for Video Tech. 11, 703–715 (2001)
7. Li, Z., Schuster, G., Katsaggelos, A.: Minmax optimal video summarization. IEEE
Trans. Circuits Syst. Video Techn. 15, 1245–1256 (2005)
8. Ngo, C.W., Ma, Y.F., Zhang, H.J.: Video summarization and scene detection by
graph modeling. IEEE Trans. Circuits Syst. Video Techn. 15, 296–305 (2005)
9. Kasutani, E., Yamada, A.: The mpeg-7 color layout descriptor: a compact im-
age feature description for high-speed image/video segment retrieval, pp. 674–677
(2001)
10. Pele, O., Werman, M.: The quadratic-chi histogram distance family. In: Dani-
ilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312,
pp. 749–762. Springer, Heidelberg (2010)
A Graph Embedding Method Using the Jensen-Shannon
Divergence
Abstract. Riesen and Bunke recently proposed a novel dissimilarity based ap-
proach for embedding graphs into a vector space. One drawback of their approach
is the computational cost graph edit operations required to compute the dissimi-
larity for graphs. In this paper we explore whether the Jensen-Shannon divergence
can be used as a means of computing a fast similarity measure between a pair of
graphs. We commence by computing the Shannon entropy of a graph associated
with a steady state random walk. We establish a family of prototype graphs by
using an information theoretic approach to construct generative graph prototypes.
With the required graph entropies and a family of prototype graphs to hand, the
Jensen-Shannon divergence between a sample graph and a prototype graph can be
computed. It is defined as the Jensen-Shannon between the pair of separate graphs
and a composite structure formed by the pair of graphs. The required entropies
of the graphs can be efficiently computed, the proposed graph embedding using
the Jensen-Shannon divergence avoids the burdensome graph edit operation. We
explore our approach on several graph datasets abstracted from computer vision
and bioinformatics databases.
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 102–109, 2013.
c Springer-Verlag Berlin Heidelberg 2013
A Graph Embedding Method Using the Jensen-Shannon Divergence 103
methods to the graph mining domain. The key idea is to use the edit distance from a
sample graph to a number of class prototype graphs to give a vectorial description of
the sample graph in the embedding space. Furthermore, this approach potentially allows
any (dis)similarity measure of graphs to be used for graph (dis)similarity embedding as
well. Unfortunately, the edit distance between a sample graph and a prototype graph
requires burdensome computations, and as a result the graph dissimilarity embedding
using the edit distance can not be efficiently computed for graphs.
To address this inefficiency, in this paper we investigate whether the Jensen-Shannon
divergence can be used as a means of establishing a computationally efficient similarity
measure between a pair of graphs, and then use such a measure to propose a novel fast
graph embedding approach. In information theory the Jensen-Shannon divergence is a
nonextensive mutual information theoretic measure based on nonextensive entropies.
An extensive entropy is defined as the sum of the individual entropies of two probabil-
ity distributions. The definition of nonextensive entropy generalizes the sum operation
into composite actions. The Jensen-Shannon divergence is defined as a similarity mea-
sure between probability distributions, and is related to the Shannon entropy [3]. The
problem of establishing Jensen-Shannon divergence measures for graphs is that of com-
puting the required entropies for individual and composite graphs. In [4], we have used
the steady state random walk of a graph to establish a probability distribution for this
purpose. The Jensen-Shannon divergence between a pair of graphs is defined as the dif-
ference between the entropy of a composite structure and their individual entropies. To
determine a set of prototype graphs for vector space embedding. We use an information
theoretic approach to construct the required graph prototypes [5]. Once the vectorial
descriptions of a set of graphs are established, we perform graph classification in the
principle component space. Experiments on graph datasets abstracted from bioinfor-
matics and computer vision databases demonstrate the effectiveness and the efficiency
of the proposed graph embedding method.
This paper is organized as follows. Section 2 develops a Jensen-Shannon divergence
measure between graphs. Section 3 reviews the concept of graph dissimilarity embed-
ding, and shows how to compute the similarity vectorial descriptions for a set of graphs
using the Jensen-Shannon divergence. Section 4 provides the experimental evaluations.
Finally, Section 5 provides the conclusion and future work.
1 if(i, j) ∈ E;
A(i, j) = (1)
0 otherwise.
The vertex degree matrix of G(V, E) is a diagonal matrix D with diagonal elements
given by D(vi , vi ) = d(i) = j∈V A(i, j).
Shannon Entropy. For the graph G(V, E), the probability
of a steady state random
walk on G(V, E) visiting vertex i is PG (i) = d(i)/ j∈V d(j). the Shannon entropy
associated with the steady state random walk on G(V, E) is
|V |
HS (G) = − PG (i) log PG (i). (2)
i=1
Time Complexity. For the graph G(V, E) having n = |V | vertices, the Shannon en-
tropy HS (G) requires time complexity O(n2 ).
Let graphs Gp and Gq be the connected components of the disjoint union graph GDU ,
and ρp = |V (Gp )|/|V (GDU )| and ρq = |V (Gq )|/|V (GDU )|. The entropy (i.e. the
composite entropy) [7] of GDU is
Here the entropy function H(·) is the Shannon entropy HS (·) defined in Eq.(2).
P +Q HS (P ) + HS (Q)
DJS (P, Q) = HS ( )− . (5)
2 2
K
where HS (P ) = k=1 pk log pk is the Shannon entropy of the probability distribution
P . Given a pair of graphs Gp (Vp , Eq ) and Gq (Vq , Eq ), the Jensen-Shannon divergence
for them is
H(Gp ) + H(Gq )
DJS (Gp , Gq ) = H(Gp ⊕ Gq ) − . (6)
2
A Graph Embedding Method Using the Jensen-Shannon Divergence 105
where H(Gp ⊕ Gq ) is the entropy of the composite structure. Here we use the disjoint
union defined in Sec.2.2 as the composite structure, and the entropy function H(·) is
the Shannon entropy HS (·) defined in Eq.(2).
Time Complexity. For a pair of graphs Gp (Vp , Ep ) and Gq (Vq , Eq ) both having n
vertices, computing the Jensen-Shannon divergence DJS (Gp , Gq ) defined in Eq. (6)
requires time complexity O(n2 ).
where DJS (Gi , Tm ) is the Jensen-Shannon divergence between the sample graph
Gi (Vi , Ei ) and the m-th prototype graph Tm . Since the Jesnen-Shannon divergence
between graphs can be efficiently computed, the proposed embedding method are
more efficient than the dissimilarity embedding using the costly computed graph edit
distance.
106 L. Bai, E.R. Hancock, and L. Han
4 Experimental Evaluation
Data ALOI CMU NCI109 MUTAG Data ALOI CMU NCI109 MUTAG
DEJS 91.35 100 65.49 80.75 DEJS 2” 1” 2” 1”
CIZF − 100 67.19 80.85 CIZF − 2 33” 14” 1”
PVAG − 62.59 64.59 82.44 PVAG − 5” 19” 1”
DEED − 100 63.34 83.55 DEED − 3h55 17h49 49 23”
Experimental Results: On the ALOI dataset which possesses graphs of more than one
thousand vertices, our method takes 2 seconds, while DEED takes over one day and
even CIZF and PVAG generate overflows on the computation. The runtime of CIZF
and PVAG methods are only competitive to our method DEJS on the MUTAG, NCI109
and CMU datasets which possess graphs of smaller sizes. This reveals that our DEJS
can easily scale up to graphs with thousands of vertices. DEED can achieve competitive
classification accuracies to our DEJS, but requires more computation time. The graph
similarity embedding using the Jensen-Shannon divergence measure is more efficient
than that using the edit distance dissimilarity measure proposed by Riesen and Bunke.
The reason for this is that the Jensen-Shannon divergence between graphs only requires
quadratic numbers of vertices.
Furthermore, both our embedding method and DEED also require extra runtime for
learning the required prototype graphs. For the ALOI, CMU, NCI109 and MUTAG
datasets, the average times for learning a prototype graph are 5 hours, 30 minutes, 15
minutes and 5 minutes respectively. This reveals that for graphs of large sizes, our em-
bedding method may require additionally and potentially expansive computations for
learning the prototype graphs. However for graphs of less than 300 vertices, the learn-
ing of prototype graphs can still be completed in polynomial time.
108 L. Bai, E.R. Hancock, and L. Han
We show the experimental results in Fig.1 and Fig. 2. Fig.1 and Fig. 2 show the effects
of vertex and edge deletion respectively. The x-axis represents 1% to 35% of vertices
or edges are deleted, and the y-axis shows the Euclidean distance dG0 ,Gn between the
original seed graph Go and its noise corrupted counterpart Gn . From Fig.1 and Fig. 2,
there is an approximate linear relationship in each case. This implies that the proposed
method possesses ability to distinguish graphs under controlled structural-error.
12 14 12
12
10 10
10
8 8
Euclidean distance
Euclidean distance
Euclidean distance
8
6 6
6
4 4
4
2 2
2
0 0 0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Node edit operation Node edit operation Node edit operation
12 10 12
9
10 10
8
7
8 8
Euclidean distance
Euclidean distance
Euclidean distance
6 5 6
4
4 4
3
2
2 2
1
0 0 0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Edge edit operation Edge edit operation Edge edit operation
feature space by computing the Jensen-Shannon divergence measure between the sam-
ple graph and each of the prototype graphs. We perform 10-folds cross validation as-
sociated with KNN classifier to assign the graphs into classes. Experimental results
demonstrate the effectiveness and efficiency of the proposed method. Since learning
prototype graphs usually requires expensive computation, our further work is to define
a fast approach to learn the prototype graphs. This will be useful to define a faster graph
embedding method.
Acknowledgments. We thank Dr. Peng Ren for providing the Matlab implementation
for the graph Ihara zeta function method.
References
1. Riesen, K., Bunke, H.: Graph Classification and Clustering Based on Vector Space Embed-
ding. World Scientific Press (2010)
2. Pekalska, E., Duin, R.P.W., Paclı́k, P.: Prototype Selection for Dissimilarity-based Classifiers.
Pattern Recognition 39, 189–208 (2006)
3. Martins, A.F., Smith, N.A., Xing, E.P., Aguiar, P.M., Figueiredo, M.A.: Nonextensive Infor-
mation Theoretic Kernels on Measures. Journal of Machine Learning Research 10, 935–975
(2009)
4. Bai, L., Hancock, E.R.: Graph Kernels from The Jensen-Shannon Divergence. Journal of
Mathematical Imaging and Vision (to appear)
5. Han, L., Hancock, E.R., Wilson, R.C.: An Information Theoretic Approach to Learning
Generative Graph Prototypes. In: Pelillo, M., Hancock, E.R. (eds.) SIMBAD 2011. LNCS,
vol. 7005, pp. 133–148. Springer, Heidelberg (2011)
6. Gadouleau, M., Riis, S.: Graph-theoretical Constructions for Graph Entropy and Network
Coding Based Communications. IEEE Transactions on Information Theory 57, 6703–6717
(2011)
7. Köner, J.: Coding of An Information Source Having Ambiguous Alphabet and The Entropy
of Graphs. In: Proceedings of the 6th Prague Conference on Information Theory, Statistical
Decision Function, Random Processes, pp. 411–425 (1971)
8. Luo, B., Hancock, E.R.: Structural Graph Matching Using the EM Alogrithm and Singular
Value Decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence 23,
1120–1136 (2001)
9. Rissanen, J.: Stochastic Complexity in Statistical Inquiry. World Scientific, Singapore (1989)
10. Rissanen, J.: Modelling by Shortest Data Description. Automatica 14, 465–471 (1978)
11. Rissanen, J.: An Universal Prior for Integers and Estimation by Minimum Description
Length. Annals of Statistics 11, 417–431 (1983)
12. Han, L., Hancock, E.R., Wilson, R.C.: Characterizing Graphs Using Approximate von Neu-
mann Entropy. In: Vitrià, J., Sanches, J.M., Hernández, M. (eds.) IbPRIA 2011. LNCS,
vol. 6669, pp. 484–491. Springer, Heidelberg (2011)
13. Shervashidze, N., Borgwardt, K.M.: Fast Subtree Kernels on Graphs. In: NIPS,
pp. 1660–1668 (2009)
14. Ren, P., Wilson, R.C., Hancock, E.R.: Graph Characterization via Ihara Coefficients. IEEE
Transactions on Neural Networks 22, 233–245 (2011)
15. Wilson, R.C., Hancock, E.R., Luo, B.: Pattern Vectors from Algebraic Graph Theory. IEEE
Transactions on Pattern Analysis and Machine Intelligence 27, 1112–1124 (2005)
Mixtures of Radial Densities for Clustering
Graphs
Brijnesh J. Jain
1 Introduction
Attributed graphs are a versatile and expressive data structure for representing
complex patterns consisting of objects and relationships between objects. Ex-
amples include molecules, mid- and high-level description of images, instances
of relational schemes, web graphs, and social networks.
Despite the many advantages of graph-based representations, statistical learn-
ing on attributed graphs is underdeveloped compared to learning on feature vec-
tors. For example, generic state-of-the-art methods for clustering of non-vectorial
data are mainly based on pairwise dissimilarity methods such as hierarchical or
spectral clustering. One research direction to complement the manageable range
of graph clustering methods aims at extending centroid-based clustering methods
to attributed graphs [2,3,9,11], which more or less amount in different variants
of graph quantization methods. A theoretical justification of these approaches is
provided in [8] by means of establishing conditions for optimality and statistical
consistency.
Vector quantization, k-means and their variants are not only well-known for
their simplicity but also for their deficiencies. Such a statement in the graph
domain is difficult to derive, because more advanced generalizations of standard
clustering algorithms to graphs are rare and an empirical comparison to state-
of-the-art graph clustering methods is missing.
Following this line of research, the contribution of this paper are twofold: (1)
we extend mixtures of Gaussians to mixtures of radial densities on graphs and
adopt the EM algorithm for parameter estimation on the basis of the orbifold
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 110–119, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Mixtures of Radial Densities for Clustering Graphs 111
2 Graph Orbifolds
The section introduces a suitable representation of attributed graphs by means
of the orbifold framework as proposed in [5,7].
Let E be a p-dimensional Euclidean space. An attributed graph X = (V, E, α)
consists of a set V of vertices, a set E ⊆ V ×V of edges, and an attribute function
α : V × V → E, such that α(i, j) = 0 for each edge and α(i, j) = 0 for each
non-edge. Attributes α(i, i) of vertices i may take any value from E.
For simplifying the mathematical treatment, we assume that all graphs are
of order n, where n is chosen to be sufficiently large. Graphs of order less than
n can be extended to order n by including isolated vertices with attribute zero.
This is a merely technical assumption to simplify mathematics. A graph X is
completely specified by its matrix representation x = (α(i, j)). Let X = En×n
be the Euclidean space of all (n × n)-matrices with elements from E and let Π n
be the set of all (n × n)-permutation matrices. For each p ∈ Π n we define a
mapping
γp : X → X , x → pT xp.
Then G = {γp : p ∈ Π n } is a finite group acting on X . For x ∈ X , the orbit
of x is the set defined by [x] = {γ(x) : γ ∈ G}. Thus, the orbit [x] consists of
all possible matrix representations of X obtained by reordering its vertices. We
define a graph orbifold by the quotient set
XG = XG = {[x] : x ∈ X }
of all orbits. Its natural projection is given by π : X → XG , x → [x]. In the
following, we identify [x] with X and occasionally write x ∈ X if π projects
to X.
In order to mimic Gaussian distributions, we extend the Euclidean norm ·
to a metric on XG defined by
d(X, X ) = min { x − x : x ∈ X, x ∈ X }, (1)
where · is the Euclidean distance on X . We call a pair (x, x ) ∈ X × X with
x − x = d(X, X ) an optimal alignment of X and X .
An orbifold function is a mapping of the form f : XG → R. Instead of studying
f , it is more convenient to study the lift f : X → R of f satisfying f (x) =
f (π(x)) = f (X) for all x ∈ X .
Dc = {x ∈ X : x − c ≤ x − c , c ∈ C} .
where the parameters πj are the mixing coefficients of components j with centers
Cj and widths σj . Lifting the mixture model p(X) to the Euclidean space X
yields
K
p (x) = πj h (x|Cj , σj ), (5)
j=1
where xj ∈ Dj . Note that the argument x of the mixture pt and the data points
xj represent the same graph. To emphasize that the mixture pt (x) is indeed
114 B.J. Jain
3.3 EM Algorithm
N
Suppose that S = (X1 )i=1 are N example graphs generated by a mixture model
p(X) of radial densities as defined in equation (4). Our goal is to estimate the
parameters πj , Cj , and σj2 on the basis of S by adopting the EM algorithm for
maximizing the log-likelihood
N
K
(Θ|S) = ln πj h(Xi |Cj , σj ),
i=1 j=1
4 Experiments
4.1 Data
We selected subsets of the following training data sets from the IAM graph
database repository [14]: Letter (low, medium, high), fingerprint, grec, and coil.
In addition, we considered the Monoamine Oxydase (MAO) data set. The let-
ter data sets compile distorted letter drawings from the Roman alphabet that
consist of straight lines. Lines of a letter are represented by edges and endpoints
of lines by vertices. Fingerprint images of the fingerprints data set are con-
verted into graphs, where vertices represent endpoints and bifurcation points of
skeletonized versions of relevant regions. Edge represent ridges in the skeleton.
The distortion levels are low, medium, and high. The grec data set consists of
graphs representing symbols from noisy versions of architectural and electronic
drawings. Vertices represent endpoints, corners, intersections, or circles. Edges
represent lines or arcs. The coil-4 data set is a subset of the coil-100 data set
consisting of 4 out of 32 objects that are expected to be recognized easily. The
arbitrarily chosen objects correspond to indices 7 (car), 17 (cup), 52 (duck), and
75 (house) of the coil-100 data set (starting at index 1). After preprocessing, the
images are represented by graphs, where vertices represent endpoints of lines and
edges represent lines. The mao data set is composed of 68 molecules, divided
into two classes, where 38 molecules inhibit the monoamine oxidase (antidepres-
sant drugs) and 30 do not.These molecules are composed of different chemical
elements and are thus encoded as labeled graphs, where nodes represent atoms
and edges represent bonds between atoms.1
see that acc is often close to max. 3. Inspecting the letter data sets, where low,
medium and high refer to the noise level with which the letters were distorted,
we observe that the EM algorithm and k-means are most robust against noise.
It is notable that the performance of k-medoids strongly declines with increas-
ing noise level. In contrast to findings in vector spaces (see e.g. [13]), results
on graphs do not confirm a common view that k-mediods is more robust than
k-means. 4. As shown in Table 2, other hierarchical clustering methods using
complete, average, and single linkage were not able to recover the class structure
given the fixed number K of clusters.
The good results of Ward’s clustering and the EM algorithm suggest that
clustering methods relying on the graph orbifold framework complement the
collection of existing clustering approaches. The beneficial feature of orbifold-
based clustering approaches is the notion of centroid of graphs, which may im-
prove clustering results for some problems and can be used for nearest neighbor
classification in a straightforward manner.
5 Conclusion
Graph orbifolds together with lifting and truncating constitute a suitable toolkit
to generalize standard clustering methods from vectors to the graphs and to
118 B.J. Jain
provide geometrical insight into the graph domain. Using this toolkit, we showed
that the study of mixtures of radial densities on graphs can be reduced to the
study of mixtures of truncated Gaussians, where each truncated component lives
in a different region of the Euclidean space. From these findings, we adapted the
EM algorithm for parameter estimation. In experiments, we compared clustering
methods operating in a graph orbifold against state-of-the-art clustering methods
based on pairwise dissimilarities. Results show that clustering in a graph orbifold
is a competitive alternative and therefore complement the collection of existing
clustering algorithms on graphs.
Open issues with respect to estimating the parameters of a mixture model in-
clude a principled approximation of the adjustment terms for truncation, exten-
sion to covariances, and a statement to which extent mixtures of radial densities
can approximate arbitrary distributions on graphs.
References
1. Gold, S., Rangarajan, A.: A Graduated Assignment Algorithm for Graph Match-
ing. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(4),
377–388 (1996)
2. Gold, S., Rangarajan, E., Mjolsness, A.: Learning with preknowledge: cluster-
ing with point and graph matching distance measures. Neural Computation 8(4),
787–804 (1996)
3. Günter, S., Bunke, H.: Self-organizing map for clustering in the graph domain.
Pattern Recognition Letters 23(4), 405–417 (2002)
4. Jain, B.J., Obermayer, K.: Algorithms for the Sample Mean of Graphs. In: Jiang,
X., Petkov, N. (eds.) CAIP 2009. LNCS, vol. 5702, pp. 351–359. Springer, Heidel-
berg (2009)
5. Jain, B., Obermayer, K.: Structure spaces. The Journal of Machine Learning Re-
search 10 (2009)
6. Jain, B.J., Obermayer, K.: Large sample statistics in the domain of graphs. In: Han-
cock, E.R., Wilson, R.C., Windeatt, T., Ulusoy, I., Escolano, F. (eds.) SSPR&SPR
2010. LNCS, vol. 6218, pp. 690–697. Springer, Heidelberg (2010)
7. Jain, B., Obermayer, K.: Maximum Likelihood Method for Parameter Estimation
of Bell-Shaped Functions on Graphs. Pat. Rec. Letters 33(15), 2000–2010 (2012)
8. Jain, B.J., Obermayer, K.: Graph quantization. Computer Vision and Image Un-
derstanding 115(7), 946–961 (2011)
9. Jain, B.J., Wysotzki, F.: Central clustering of attributed graphs. Machine Learn-
ing 56(1), 169–207 (2004)
Mixtures of Radial Densities for Clustering Graphs 119
10. Kaufman, L., Rousseeuw, P.: Clustering by means of medoids. Statistical Data
Analysis Based on the L1 -Norm and Related Methods, 405–416 (1987)
11. Lozano, M.A., Escolano, F.: Protein classification by matching and clustering sur-
face graphs. Pattern Recognition 39(4), 539–551 (2006)
12. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: Analysis and an algorithm.
Advances in Neural Information Processing Systems 2, 849–856 (2002)
13. Ng, R.T., Han, J.: Efficient and effective clustering methods for spatial data mining.
In: Proceedings of the 20th International Conference on Very Large Data Bases,
VLDB 1994, pp. 144–155 (1994)
14. Riesen, K., Bunke, H.: IAM graph database repository for graph based pattern
recognition and machine learning. In: da Vitoria Lobo, N., Kasparis, T., Roli, F.,
Kwok, J.T., Georgiopoulos, M., Anagnostopoulos, G.C., Loog, M. (eds.) S+SSPR
2008. LNCS, vol. 5342, pp. 287–297. Springer, Heidelberg (2008)
15. Ward, J.H.: Hierarchical grouping to optimize an objective function. Journal of the
American Statistical Association 58(301) (1963)
Complexity Fusion for Indexing Reeb Digraphs
1 Introduction
This paper is motivated by the hypothesis that mixing the same graph complex-
ity measure over the same shape, represented with different graphs boosts the
discrimination power of isolated complexity measures. To commence, there has
been a recent effort in quantifying the intrinsic complexity of graphs in their orig-
inal discrete space. Early attempts have incorporated principles related to MDL
(Minimum Description Length) to trees and graphs (see [1] for trees and [2]
for edge-weighted undirected graphs). More recently, the intersection between
structural pattern recognition and complex networks has proved to be fruitful
and has inspired several interesting measures of graph complexity. Many of them
rely on elements of spectral graph theory. For instance, Passerini and Severini
have applied the quantum (von Neumann) entropy to graphs [3]. We have re-
cently applied thermodynamic depth [5] to the domain of graphs [6] and we
have extended the approach to digraphs [7]. However, this latter approach has
not been applied to graph discrimination as the one based on approximated
von Neumann entropy [4]. Simultaneously, we have recently developed a method
for selecting the best set of spectral features in order to classify Reeb graphs
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 120–127, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Complexity Fusion for Indexing Reeb Digraphs 121
(which summarize 3D shapes) [8]. Besides the spectral features we have evalu-
ated in the latter work the discriminability of three different real functions for
building Reeb graphs: geodesic, distance from barycenter and distance from the
circumscribing sphere. Feature selection results in two intriguing conclusions:
a) heat flow complexity is not one of the most interesting features, and b) the
three latter real functions for building Reeb graphs have a similar relevance. The
first conclusion seems to discard the use of heat flow based complexity measures
for discrimination, at least in undirected graphs, whereas the second conclusion
points towards discarding also the analysis of the impact of the representations.
In this paper we show that these conclusions are misleading. To commence, when
directed graphs are considered, heat-flow complexity information is richer. Sec-
ondly, the three functions explored in [8] are far from being orthogonal (they
produce very similar graphs). Consequently herein we fuse both lines of research
in order to find the best performance achievable only with topological informa-
tion (without attributes). In Section 2 we present the catalog of real functions
we are going to explore. In Section 3 we highlight the main ingredients of heat
flow and thermodynamic depth in digraphs. Section 4 is devoted to analyze the
result of fusing the directed complexities of several Reeb graphs from the same
3D shape. Finally, in Section 5 we will present our conclusions and future works.
The Reeb graph [9] is a well-known topological description that codes in a graph
the evolution of the isocontours of a real-valued, continuous function f : M → R
over a manifold M . In other words, it tracks the origin, the disappearance,
the union or the split of the isocontours as the co-domain of the function f is
spanned. The nodes of the Reeb graph correspond to the critical points of f
while the arcs are associated to the surface portions crossed when going from a
critical points to another.
Several algorithms exist for the Reeb graph extraction from triangle meshes
[10]; in this paper we adopt a directed version of the Extended Reeb graph (ERG)
[11], we name it, the diERG. The diERG differs from the ERG in terms of arc
orientation. Similarly to the ERG, to build the diERG we sample the co-domain
of f with a finite number of intervals, then we characterize the surface in term
of critical or regular areas and, finally, we track the evolution of the regions in
the graph. Arcs are oriented according to the increasing value of the function.
The diERG is then an acyclic, directed graph, a formal proof of this fact can be
found in [12]. Figure 1 shows the pipeline of the graph extraction.
Each function can be seen as a geometric property and a tool for coding
invariance in the description [13]. When dealing with shape retrieval, the function
f has to be invariant from object rotation, translation and scaling. In the large
number of functions available in the literature, we are considering:
– the distance from the barycentre B of the object, Bar(p) = dE (p, B), p ∈ M
and dE represents the usual Euclidean distance (Figure 2-b);
122 F. Escolano, E.R. Hancock, and S. Biasotti
Fig. 1. Pipeline of the diERG extraction. (a) Surface partition and recognition of
critical areas; blue areas correspond to minima, red areas correspond to maxima, green
areas to saddle areas. (b) Expansion of critical areas to their nearest one. (c) The
oriented diERG.
– the distance from the main shape axis v, M SA(p) = dE (p, v), (Figure 2-c);
– the function M SAN orm(p) = v × (p − B) , p ∈ S, v is the same as above
and B is the barycentre (Figure 2-d);
– the average of the geodesic distances defined in [14] (Figure 2-e);
– the first six (ranked with respect to the decreasing eigenvalues), non-constant
eigenfunctions of the Laplace-Beltrami operator of the mesh computed ac-
cording to [15], LAP Li , i = {1, . . . , 6}, (Figure 2(f-i));
– a mix of the first three eigenfunctions of the Laplace-Beltrami operator ob-
tained according to the rule: M IXi+j−2 = (LAP Li )2 −(LAP Lj )2 , i = {1, 2},
j = {2, 3}, i = j (Figure 2(j-l)).
Fig. 2. The set of real functions in our framework. Colors represent the function, from
low (blue) to high (red) values.
Complexity Fusion for Indexing Reeb Digraphs 123
A
more compact definition of the flow is Fβ (G) = A : Kβ , where X : Z =
T
ij X ij Z ij = trace(XZ ) is the Frobenius inner product. While instantaneous
flow for the heat flowing through the edges of the graph, it accounts neither
for the heat remaining in the nodes nor for that in the transitivity links. The
limiting cases are F0 = 0 and Fβmax = n1 i→j Aij which is reduced to |E| n if
G is unattributed (Aij ∈ {0, 1} ∀ij). Defining Fβ in terms of A instead of W,
we retain the directed nature of the original graph G. The function derived from
computing Fβ (G) from β = 0 to βmax is the so called directed heat flow trace.
These traces satisfy the phase transition principle [6] (although the formal proof
is out of the scope of this paper). In general heat flow diffuses more slowly than
in the undirected case and phase transition points (PTPs) appear later. This is
due to the constraints imposed by A.
Complexity Fusion for Indexing Reeb Digraphs 125
4 Experiments
In order to compare our method with the technique proposed in [8] we use the
same database, SHRECH version used in the Shape Retrieval Contest in [23].
The database has 400 exemplars and 20 classes (20 exemplars per class). For each
exemplar we apply the 13 real functions presented in Section 2 and then extract
the corresponding Reeb digraphs. Then, each exemplar is characterized by 13
the heat flow complexities (one per digraph). If we map these vectors (bags of
complexities) via MDS we found that it is quite easy to discriminate glasses from
pliers and fishes. However, it is very difficult to discriminate humans from chairs
126 F. Escolano, E.R. Hancock, and S. Biasotti
Fig. 3. Experiments. Left: MDS for humans, chairs and armadillos. Right: PR curves.
(see Figure 3-left). The average behavior of these bags of complexities is given
by the precision recall (PR) curves. In Figure 3-right we show the PR curves
for Feature Selection [8] (CVIU’13), Thermodynamic Depth (TD) with the von
Neumann Entropy (here we use the W attributed graph induced by the Directed
Laplacian) and TD with directed heat flow. Our PR (heat flow) as well as the
one of Feature Selection reaches the average performance of attributed methods.
The 10-fold CV error for 15 classes reported by Feature Selection is 23, 3%.
However, here we obtain a similar PR curve for the 20 class problem. Given that
Feature Selection relies on a complex offline process, the less computationally
demanding heat flow TD complexity for digraphs produces comparable results
(or better ones, if we consider that we are addressing the 20 class problem). In
addition, heat flow outperforms von Neumann entropy when embedded in TD.
The main contribution of this paper is the proposal of a method (fusion of di-
graphs heat flow complexity) which has a similar discrimination power (or even
better if we consider the whole 20 classes problem) than Feature Selection and
outperforms von Neumann entropy. Future works include the exploration of more
sophisticated methods for fusing complexities and more real functions.
References
[1] Torsello, A., Hancock, E.R.: Learning Shape-Classes Using a Mixture of Tree-
Unions. IEEE Tran. on Pattern Analysis and Mach. Intelligence 28(6), 954–967
(2006)
[2] Torsello, A., Lowe, D.L.: Supervised Learning of a Generative Model for Edge-
Weighted Graphs. In: Proc. of ICPR (2008)
Complexity Fusion for Indexing Reeb Digraphs 127
[3] Passerini, F., Severini, S.: The von Neumann Entropy of Networks.
arXiv:0812.2597v1 (December 2008)
[4] Han, L., Escolano, F., Hancock, E.R., Wilson, R.: Graph Characterizations From
Von Neumann Entropy. Pattern Recognition Letters (2012) (in press)
[5] Lloyd, S., Pagels, H.: Complexity as Thermodynamic Depth Ann. Phys. 188, 186
(1988)
[6] Escolano, F., Hancock, E.R., Lozano, M.A.: Heat Diffusion: Thermodynamic
Depth Complexity of Networks. Phys. Rev. E 85, 036206 (2012)
[7] Escolano, F., Bonev, B., Hancock, E.R.: Heat Flow-Thermodynamic Depth Com-
plexity in Directed Networks. In: Gimel’farb, G., Hancock, E., Imiya, A., Kuijper,
A., Kudo, M., Omachi, S., Windeatt, T., Yamada, K. (eds.) SSPR & SPR 2012.
LNCS, vol. 7626, pp. 190–198. Springer, Heidelberg (2012)
[8] Bonev, B., Escolano, F., Giorgi, D., Biasotti, S.: Information-theoretic Selection of
High-dimensional Spectral Features for Structural Recognition. Computer Vision
and Image Understanding 117(3), 214–228 (2013)
[9] Reeb, G.: Sur les points singuliers d’une forme de Pfaff complètement intégrable
ou d’une fonction numérique. Comptes Rendus Hebdomadaires des Séances de
l’Académie des Sciences 222, 847–849 (1946)
[10] Biasotti, S., Giorgi, D., Spagnuolo, M., Falcidieno, B.: Reeb graphs for shape
analysis and applications. Theoretical Computer Science 392(1-3), 5–22 (2008)
[11] Biasotti, S.: Topological coding of surfaces with boundary using Reeb graphs.
Computer Graphics and Geometry 7(3), 31–45 (2005)
[12] Biasotti, S.: Computational Topology Methods for Shape Modelling Applications.
PhD Thesis, Universitá degli Studi di Genova (May 2004)
[13] Biasotti, S., De Floriani, L., Falcidieno, B., Frosini, P., Giorgi, D., Landi, C., Pa-
paleo, L., Spagnuolo, M.: Describing shapes by geometrical-topological properties
of real functions. ACM Comput. Surv. 40(4), 1–87 (2008)
[14] Hilaga, M., Shinagawa, Y., Kohmura, T., Kunii, T.L.: Topology Matching for Fully
Automatic Similarity Estimation of 3D Shapes. In: Proc. of SIGGRAPH 2001,
pp. 203–212 (2001)
[15] Belkin, M., Sun, J., Wang, Y.: Discrete Laplace Operator for Meshed Surfaces.
In: Proc. Symposium on Computational Geometry, pp. 278–287 (2008)
[16] Bronstein, A.M., Bronstein, M.M., Kimmel, R.: Efficient Computation of
Isometry-Invariant Distances Between Surfaces. SIAM J. Sci. Comput. 28(5),
1812–1836 (2006)
[17] Chung, F.: Laplacians and the Cheeger Inequailty for Directed Graphs. Annals of
Combinatorics 9, 1–19 (2005)
[18] Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking:
Bring Order to the Web (Technical Report). Stanford University (1998)
[19] Johns, J., Mahadevan, S.: Constructing Basic Functions from Directed Graphs for
Value Functions Approximation. In: Proc. of ICML (2007)
[20] Zhou, D., Huang, J., Schölkopf, B.: Learning from Labeled and Unlabeled Data
on a Directed Graph. In: Proc. of ICML (2005)
[21] Nock, R., Nielsen, F.: Fitting the Smallest Enclosing Bregman Ball. In: Gama,
J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS
(LNAI), vol. 3720, pp. 649–656. Springer, Heidelberg (2005)
[22] Tsang, I.W., Kocsor, A., Kwok, J.T.: Simple Core Vector Machines with Enclosing
Balls. In: Proc. of ICLM (2007)
[23] Giorgi, D., Biasotti, S., Paraboschi, L.: SHape Retrieval Contest: Watertight Mod-
els Track, http://watertight.ge.imati.cnr.it
Analysis of Wave Packet Signature of a Graph
1 Introduction
Graph clustering is one of the most commonly used problems in areas where
data are represented using graphs. Since graphs are non-vectorial, we require a
method for characterizing graph that can be used to embed the graph in a high-
dimensional feature space for the purpose of clustering. Most of the commonly
used method for graph clustering are spectral methods which are based on the
eigensystem of the Laplacian matrix associated with the graph. For example
Xiao et al [1] have used heat kernel for graph characterization. Wilson et al.
[2] have made use of graph spectra to construct a set of permutation-invariant
features for the purpose of clustering graphs.
The discrete Laplacian defined over the vertices of a graph, however, cannot
link most results in analysis to a graph theoretic analogue. For example the wave
equation utt = Δu, defined with discrete Laplacian, does not have finite speed
of propagation. In [3,4], Friedman and Tillich develop a calculus on graph which
provides strong connection between graph theory and analysis. Their work is
based on the fact that graph theory involves two different volume measures.
i.e., a “vertex-based” measure and an “edge-based” measure. This approach has
many advantages. It allows the application of many results from analysis directly
to the graph domain.
While the method of Friedman and Tillich leads to the definition of both a
divergence operator and a Laplacian (through the definition of both vertex and
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 128–136, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Analysis of Wave Packet Signature of a Graph 129
edge Laplacian), it is not exhaustive in the sense that the edge-based eigen-
functions are not fully specified. In a recent study we have fully explored the
eigenfunctions of the edge-based Laplacian and developed a method for explicitly
calculating the edge-interior eigenfunctions of the edge-based Laplacian [5]. This
reveals a connection between the eigenfunctions of the edge-based Laplacian and
both the classical random walk and the backtrackless random walk on a graph.
As an application of the edge-based Laplacian, we have recently presented a new
approach to characterizing points on a non-rigid three-dimensional shape[6].
Wave equation provides potentially richer characterisation of graphs than heat
equation. Initial work by Howaida and Hancock [7] has revealed some of its
potential uses. They have proposed a new approach for embedding graphs on
pseudo-Riemannian manifolds based on the wave kernel. However, there are
two problems with the rigourous solution of the wave equation; a) we need to
compute the edge-based Laplacian, and b) the solution is more complex than
the heat equation. Recently we [8] have presented a solution of the edge-based
wave equation on a graph. In [9] we have used this solution to define a signature,
called the wave packet signature (WPS) of a graph. In this paper we extend the
idea of WPS to weighted graphs and experimentally demonstrate the properties
of WPS. We perform numerous experiments and demonstrate the performance
of the proposed methods on both weighted and un-weighted graphs.
2 Edge-Based Eigensystem
In this section we review the eigenvalues and eigenfunction of the edge-based
Laplacian[3][5]. Let G = (V, E) be a graph with a boundary ∂G. Let G be
the geometric realization of G. The geometric realization is the metric space
consisting of vertices V with a closed interval of length le associated with each
edge e ∈ E. We associate an edge variable xe with each edge that represents the
standard coordinate on the edge with xe (u) = 0 and xe (v) = 1. For our work, it
will suffice to assume that the graph is finite with empty boundary (i.e., ∂G = 0)
and le = 1.
Let a graph coordinate X defines an edge e and a value of the standard coordinate
on that edge x. The eigenfunctions of the edge-based Laplacian are
∂ 2u
(X , t) = ΔE u(X , t)
∂t2
5 6
Let W(z) be z wrapped to the range [− 12 , 12 ), i.e., W(z) = z − z + 12 . For
the un-weighted graph, we solve the wave equation assuming that the initial
condition is a Gaussian wave packet on a single edge of a graph [9]. The solution
for this case becomes
C(ω, e)C(ω, f )
u(X , t) =
2
ω∈Ωa
, 7 8-
−aW(x+t+μ)2 1
e cos B(e, ω) + B(f, ω) + ω x + t + μ +
2
, 7 8-
−aW(x−t−μ) 2 1
+e cos B(e, ω) − B(f, ω) + ω x − t − μ +
2
1 1 −aW(x+t+μ)2 1 −aW(x−t−μ)2
+ e + e
2|E| 4 4
Analysis of Wave Packet Signature of a Graph 131
C(ω, e)C(ω, f )
e−aW(x−t−μ) − e−aW(x+t+μ)
2 2
+
4
ω∈Ωc
C(ω, e)C(ω, f )
(−1)x−t−μ+ 2 ! e−aW(x−t−μ)
1 2
+
4
ω∈Ωc
−(−1)x+t+μ+ 2 ! e−aW(x+t+μ)
1 2
2
, 7 8-
1
+ e−aW(x−t−μ) cos B(e, ω) − B(f, ω) + ω x − t − μ +
2
2
1 1 −aW(x+t+μ)2 1 −aW(x−t−μ)2
+ e + e
2|E| 4 4
C(ω, e)C(ω, f )
e−aW(x−t−μ) − e−aW(x+t+μ)
2 2
+
4
ω∈Ωc
C(ω, e)C(ω, f )
(−1)x−t−μ+ 2 ! e−aW(x−t−μ)
1 2
+
4
ω∈Ωc
−(−1)x+t+μ+ 2 ! e−aW(x+t+μ)
1 2
To define signature for both weighted and un-weighted graphs, we use the ampli-
tudes of the waves on the edges of the graph over time. For un-weighted graphs,
we assume that the initial condition is a Gaussian wave packet on a single edge
of the graph. For this purpose we select the edge (u, v) ∈ E, such that u is
the highest degree vertex in the graph and v is the highest degree vertex in the
neighbours of u. For weighted graph, we assume a wave packet on every edge
whose amplitude is multiplied by the weight of the edge. We define the local
signature of an edge as
W P S(X ) = [u(X , t0 ), u(X , t1 ), u(X , t2 ), ...u(X , tn )]
Given a graph G, we define its global wave packet signature as
GW P S(G) = hist W P S(X1 ), W P S(X2 ), , ..., W P S(X|E| ) (1)
where hist(.) is the histogram operator which bins the list of arguments
W P S(X1 ), W P S(X2 ), , ..., W P S(X|E| ).
132 F. Aziz, R.C. Wilson, and E.R. Hancock
4 Experiments
We compute the wave signature for an edge by taking tmax = 100 and
xe = 0.5. We take t = 20 to allow the wave packet to be distributed over
the whole graph. We then compute the GWPS for the graph by fixing 100 bins
for histogram. To visualize the results, we have performed principal component
analysis (PCA) on GWPS. PCA is mathematically defined [12] as an orthogo-
nal linear transformation that transforms the data to a new coordinate system
such that the greatest variance by any projection of the data comes to lie on
the first coordinate (called the first principal component), the second greatest
variance on the second coordinate, and so on. Figure 2(a) shows the results of
the embedding of the feature vectors on the first three principal components.
To measure the performance of the proposed method we compare it with
truncated Laplacian, random walk [13] and Ihara coefficients [14]. Figure 2 shows
Method DT GG RNG
Wave Kernel Signature 0.9965 0.9511 0.8235
Random Walk Kernel 0.9526 0.9115 0.8197
Ihara Coefficients 0.9864 0.8574 0.7541
compare the performance we have computed the rand indices for both methods.
The rand index for WPS is 0.9931, while for truncated Laplacian is 0.8855.
Finally we look at the characteristic of the proposed WPS. The histogram
distribution of the WPS closely follows Gaussian distribution. Figure 5 shows
distribution of WPS of a single view of 3 different objects in COIL dataset and a
Gaussian fit for each signature. Figure 6(a) and 6(b) show the values of standard
deviation of all the DT and GG respectively of all 72 views of 4 different objects
of COIL dataset. Table 2 shows the mean value of the standard deviation and a
standard error for each of the 4 objects.
(a) DT (b) GG
Fig. 6. Standard Deviation
Analysis of Wave Packet Signature of a Graph 135
References
1. Xiao, B., Yu, H., Hancock, E.R.: Graph matching using manifold embedding. In:
Campilho, A.C., Kamel, M.S. (eds.) ICIAR 2004. LNCS, vol. 3211, pp. 352–359.
Springer, Heidelberg (2004)
2. Wilson, R.C., Hancock, E.R., Luo, B.: Pattern vectors from algebraic graph theory.
IEEE Trans. Pattern Anal. Mach. Intell. 27, 1112–1124 (2005)
3. Friedman, J., Tillich, J.P.: Wave equations for graphs and the edge based laplacian.
Pacific Journal of Mathematics, 229–266 (2004)
4. Friedman, J., Tillich, J.P.: Calculus on graphs. CoRR (2004)
5. Wilson, R.C., Aziz, F., Hancock, E.R.: Eigenfunctions of the edge-based laplacian
on a graph. Linear Algebra and its Applications 438, 4183–4189 (2013)
6. Aziz, F., Wilson, R.C., Hancock, E.R.: Shape signature using the edge-based lapla-
cian. In: International Conference on Pattern Recognition (2012)
7. ElGhawalby, H., Hancock, E.R.: Graph embedding using an edge-based wave ker-
nel. In: Hancock, E.R., Wilson, R.C., Windeatt, T., Ulusoy, I., Escolano, F. (eds.)
SSPR & SPR 2010. LNCS, vol. 6218, pp. 60–69. Springer, Heidelberg (2010)
8. Aziz, F., Wilson, R.C., Hancock, E.R.: Gaussian wave packet on a graph. In:
Kropatsch, W.G., Artner, N.M., Haxhimusa, Y., Jiang, X. (eds.) GbRPR 2013.
LNCS, vol. 7877, pp. 224–233. Springer, Heidelberg (2013)
9. Aziz, F., Wilson, R.C., Hancock, E.R.: Graph characterization using gaussian
wave packet signature. In: Hancock, E., Pelillo, M. (eds.) SIMBAD 2013. LNCS,
vol. 7953, pp. 176–189. Springer, Heidelberg (2013)
10. Murase, H., Nayar, S.K.: Visual learning and recognition of 3-D objects from ap-
pearance. International Journal of Computer Vision 14, 5–24 (1995)
11. Harris, C., Stephens, M.: A combined corner and edge detector. In: Fourth Alvey
Vision Conference, Manchester, UK, pp. 147–151 (1988)
136 F. Aziz, R.C. Wilson, and E.R. Hancock
12. Jolliffe, I.T.: Principal component analysis. Springer, New York (1986)
13. Gärtner, T., Flach, P.A., Wrobel, S.: On graph kernels: Hardness results and ef-
ficient alternatives. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003.
LNCS (LNAI), vol. 2777, pp. 129–143. Springer, Heidelberg (2003)
14. Ren, P., Wilson, R.C., Hancock, E.R.: Graph characterization via Ihara coefficients.
IEEE Tran. on Neural Networks 22, 233–245 (2011)
15. MacQueen, J.B.: Some methods for classification and analysis of multivariate ob-
servations, vol. 1, pp. 281–297. University of California Press (1967)
16. Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal
of the American Statistical Association 66, 846–850 (1971)
Hearing versus Seeing Identical Twins
School of Computing
National University of Singapore
Singapore, 117417
{lizhang,shenggao,tsim,leowwk}@comp.nus.edu.sg,[email protected]
1 Introduction
According to the statistics in [1], twins birth rate has risen from 17.8 to 32.2
per 1000 birth with an average 3% growth per year since 1990. This increase
is associated with the increasing usage of fertility therapies and the change of
birth concept. Nowadays women tend to bear children at older age and are more
likely than younger women to conceive multiples spontaneously especially in de-
veloped countries [2]. Although currently identical twins still only represent a
minority (0.2% of the world’s population), it is worth noting that the total num-
ber of identical twins is equal to the whole population of countries like Portugal
or Greece. This, in turn, has created an urgent demand for biometric systems
that can accurately distinguish between identical twins. Identical twins share the
same genetic code, therefore they look very alike. This poses a great challenge to
current biometric systems, especially face recognition system. The challenge us-
ing facial appearance to distinguish between identical twins has been verified by
Sun et al. [2] on 93 pairs of twins using a commercial face matcher. Nevertheless,
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 137–144, 2013.
c Springer-Verlag Berlin Heidelberg 2013
138 L. Zhang et al.
some biometrics depend not only on the genetic signature but also on the individ-
ual development in the womb. Some researchers explored the possibility of using
behavior difference, such as expressions and head motion [3] to distinguish be-
tween identical twins. Zhang et al. [3] proposed to use exception reporting model
to model the head motion abnormality to differentiate twins. They reported the
verification accuracy was over 90%, but their algorithm was very sensitive to
subject behavior consistence and strongly relied on accurate tracking algorithm.
Several researchers showed encouraging results by using fingerprint [4,2], palm-
print [5], ear [6] and iris [7,2] to distinguish between identical twins. For example,
equal error rate for 4-finger fusion reported by Sun et al. [2] was 0.49, and equal
error rate for 2-iris fusion was also 0.49. Despite of the discriminating ability of
those biometrics, those biometrics require the cooperation of the subject. There-
fore, it is desirable to identify twins in a natural way. In this paper, we propose to
utilize voice biometric to distinguish between identical twins and compare voice
biometric with facial appearance. Voice is non-intrusive and natural, it does not
require explicit cooperation of the subject and is widely available from videos
captured by ordinary cam-corders. To the best of our knowledge, we are the first
to investigate voice and appearance biometrics at the meantime.
Voice signal usually conveys several levels of information. Primarily, voice sig-
nal conveys the words or message being spoken, but on a secondary level, it also
conveys information about the identity of the speaker [8]. Voice biometric tries
to extract the identity information from the voice and uses it for speaker recogni-
tion. Generally speaking, the speaker recognition can be divided into two specific
tasks: speaker verification and speaker identification. In speaker verification, the
goal is to establish whether a person is who he/she claims to be; whereas in
speaker identification, the goal is to determine the identity (name or employee
number) of the unknown speaker. In either task the speech can be further di-
vided into text dependent (i.e.the speaker is required to talk same phrase) and
text independent (i.e. the speaker can talk different phrase). Douglas et al. [8]
and Sinith et al. [9] proposed to use Mel Frequency Cepstral Coefficients and
Gaussian Mixture Model to solve text independent identification problem for
general population, i.e. non twins. Dupont et al. [10] and Dean et al. [11] tried
to use hidden Markov model to model the distribution of the speaker spectral
shape from voice sample and claimed the identity using maximum likelihood
of the posterior probabilities belonging to different classes. Both these works
demonstrated that the identity of speaker can be well recognized via their voices
under the condition that voice samples were in good quality and the gallery
size was small, i.e. the number of subjects is small. This conclusion, in turn,
brings new hope to use voice biometric to differentiate identical twins, because
to distinguish between identical twins, the number of involved subjects was very
small i.e. the number of twins siblings. In this paper, we are trying to answer
those questions as follows:
Speech Signal
Frames
Preprocessing Feature Extraction Feature Vectors Speaker Verification
coefficient number is set to 13 and the predictor order (i.e., the number of LPC
coefficients) is set to 8.
M
p(x) = wi bi (x) (1)
i=1
where x is the D-dimensional feature vector (In our case, it is Pitch, LCP and
MFCC), bi (x) is the component density and wi is the mixture weight. Each
component density is represented as a Gaussian distribution of the form
1 1
bi (x) = exp{− (x − μi ) Δ−1
i (x − μi )} (2)
(2π)D/2 |Δi |1/2 2
with mean vector μi and covariance matrix Δi . The sum of mixture weights wi
equals to 1. For convenience, we denote mean vectors, covariance matrices and
mixture weights as Γ , where Γ = {wi , μi , Δi }, i = 1, ..., M . Therefore, each
speaker is represented by his/her model Γ .
Given the training data in the gallery, we use Expectation Maximization al-
gorithm [18] to estimate the Γ for each subject. In the verification phase, given a
test feature vector, ψ, and the hypothesized speaker S, we aim to check whether
the hypothesized identity is same to classified identity. We state this task as a
basic hypothesis test between two hypotheses:
H0: ψ is from the hypothesized twin speaker S.
H1: ψ is not from the hypothesized speaker S (i.e. ψ is from the twin sibling
of hypothesized speaker S).
The optimum classification to decide between these two hypotheses is through
the likelihood ratio (LR) given by
p(ψ|H0)
LR = (3)
p(ψ|H1)
If LR > , we accept H0; otherwise, we reject H0. Here, is the threshold,
p(ψ|H0) is the probability density function for the hypothesis subject S for the
observed feature vector ψ, and p(ψ|H1) is the probability density function for
not being the hypothesis subject S for the observed feature vector ψ.
3 Experiments
3.1 Data and Performance Evaluation
We collected a twins audio-visual database at the Sixth Mojiang International
Twins Festival held on 1 May 2010 in China. It includes Chinese, Canadian and
Hearing versus Seeing Identical Twins 141
Russian subjects for a total of 39 pairs of twins. Several examples can be seen
in Figure 2. For each subject, there are at least three audio recordings, each
around 30 seconds. The talking content of those recordings are different. For the
first recording, the subjects are required to count the number from one to ten;
For the second recording, the subjects are reading a paragraph; For the third
recording, the subjects are reciting a poem.
The twin verification performance is evaluated in terms of Twin Equal Error
Rate(Twin-EER) which Twin False Accept Rate(Twin-FAR) meets the False
Reject Rate (FRR). The Twin-FAR is the ratio between the times that twin
imposter is recognized as genuine with the total number of imposter. FRR is the
ratio between the times that genuine is recognized as imposter with the total
number of the genuine. We also introduce General Equal Error Rate(General-
EER) where General False Accept Rate(General-FAR) meets the FRR. The
General-FAR is the ratio between the times that general imposter is recognized
as genuine with the total number of the non-twin imposter. The purpose of
introducing General-FAR is to compare the verification accuracy between twins
with non-twins to see the challenge brought by twins.
1 1
0.6 0.6
FAR
FAR
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FRR FRR
used as probe. For each recording, we divide it into three parts, and each part is
acted as single probe. During GMM training, the covariance matrix is assumed
to be diagonal and the number of Gaussians is set to 4 for Pitch, 4 for LPC
and 5 for MFCC. The number of gaussian is optimized on the test set for better
performance. The experimental result is showed in Figure 3(b). Compared with
Figure 3(a), it can be clearly seen that twins can be better distinguished via
voice than appearance. The Twin-EER for MFCC is 0.171, which is significantly
better than appearance (the best for appearance is 0.338). However, not all voice
features are better than appearance. The Twin-EERs of pitch and LPC (0.394
for Pitch and 0.366 for LPC) are even larger than appearance based approach.
This shows that Pitch and LPC is not discriminating enough for twins.
Moreover, based on the experimental results in [10], the General-EER for
speaker verification on general population is around 0.05, which is much smaller
than the best (0.171) in twins database. The difference may come from three
aspects: 1)insufficient training data in our experiments. In our case, we only use
one audio recording around 30 seconds as training, and the talking content is
very simple and sometime duplicated. Therefore, it may cannot cover the entire
voice spectral pattern. 2) The voice spectral pattern for identical twins may have
some overlap. Identical twins share the same genetic code, therefore their voice
may share some similarity. 3) Our audio recording is not collected in very clean
environment, the environment sound may also degrade our performance. The
General-EER reported by [10] was obtained at clean recording room.
1
0.25
0.7
0.6 0.15
FAR
FAR
0.5
0.4 0.1
0.3
0.2 0.05
0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.3
FRR FRR
(a) Fusion of Gabor and MFCC (b) Zoom-in Result for Figure 4(a)
in our previous experiment. In multimodal systems, there are three levels of fu-
sion when combining two biometrics. The first is fusion at the feature extraction
level. The features for each biometric modality are formed into a new feature.
The second is fusion at the confidence level. Each biometric provides a similarity
score, and these scores will be combined together to assert the veracity of the
claimed identity. The third fusion is at decision level. Each biometric will make
one decision and final decision is made based on those decisions.
In our proposal, we use the second fusion strategy. Given a probe and a claim
identity, we compute the Euclidean distance of Gabor, denoted as GD, and the
likelihood ratio against the claimed identity, denoted LR in Equ 3, separately.
The final similarity, F S, is computed as the weighted sum of GD and LR,
denoted as F S = αGD + (1 − α)LR. Then, we compare the F S against the
pre-set threshold . If F S > , we accept; otherwise we reject. We conducted the
experiments on the whole database, and the performance is showed in Figure 4.
From this figure, we can see that when α is set to 0.415, by fusion of Gabor and
MFCC the Twin-EER decreases from 0.171(MFCC) to 0.160. We set the α for
the best of test performance in our dataset.
In this work, we collect a moderate size of identical twins database including ap-
pearance and voice. We propose to use Gaussian Mixture Model to model the voice
spectral pattern for verification. The results verify that voice biometric can be used
to distinguish between identical twins and it is significantly better than traditional
facial appearance features, including EigenFace, LBP and Gabor; Among various
voice features, MFCC has the most discriminating ability. We further prove that
the accuracy can be improved via fusion of voice biometric and facial appearance.
In future, we would like to test the robustness of our voice proposal, including
the length of training data and environment noise. Even though our current
144 L. Zhang et al.
result is very promising, we still hope to collect a larger twin database for our
research. We also intend to test the scalability of our voice proposal. Finally, we
look forward building a multimodal biometric system to which can work well for
general population but also can prevent the evil twin attack.
References
1. Martin, J., Kung, H., Mathews, T., Hoyert, D., Strobino, D., Guyer, B., Sutton,
S.: Annual summary of vital statistics: 2006. Pediatrics (2008)
2. Sun, Z., Paulino, A., Feng, J., Chai, Z., Tan, T., Jain, A.: A study of multibiometric
traits of identical twins. SPIE (2010)
3. Zhang, L., Ye, N., Marroquin, E.M., Guo, D., Sim, T.: New hope for recognizing
twins by using facial motion. In: WACV, pp. 209–214. IEEE (2012)
4. Jain, A., Prabhakar, S., Pankanti, S.: On the similarity of identical twin finger-
prints. Pattern Recognition, 2653–2663 (2002)
5. Kong, A., Zhang, D., Lu, G.: A study of identical twins’ palmprints for personal
verification. Pattern Recognition, 2149–2156 (2006)
6. Nejati, H., Zhang, L., Sim, T., Martinez-Marroquin, E., Dong, G.: Wonder ears:
Identification of identical twins from ear images. In: ICPR, pp. 1201–1204 (2012)
7. Daugman, J., Downing, C.: Epigenetic randomness, complexity and singularity of
human iris patterns. Proceedings of the Royal Society of London, 1737 (2001)
8. Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification us-
ing gaussian mixture speaker models. IEEE Transactions on Speech and Audio
Processing 3(1), 72–83 (1995)
9. Sinith, M., Salim, A., Gowri Sankar, K., Sandeep Narayanan, K., Soman, V.: A
novel method for text-independent speaker identification using mfcc and gmm. In:
ICALIP, pp. 292–296. IEEE (2010)
10. Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recog-
nition. IEEE Transactions on Multimedia 2(3), 141–151 (2000)
11. Dean, D., Sridharan, S., Wark, T.: Audio-visual speaker verification using contin-
uous fused hmms. In: Proceedings of the HCSNet workshop, pp. 87–92 (2006)
12. Turk, M.A., Pentland, A.P.: Face recognition using eigenfaces. In: CVPR,
pp. 586–591. IEEE (1991)
13. Ahonen, T., Hadid, A., Pietikäinen, M.: Face recognition with local binary patterns.
In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 469–481.
Springer, Heidelberg (2004)
14. Liu, C., Wechsler, H.: Gabor feature based classification using the enhanced fisher
linear discriminant model for face recognition. IEEE Transactions on Image pro-
cessing 11(4), 467–476 (2002)
15. Zatorre, R.J., Evans, A.C., Meyer, E., Gjedde, A.: Lateralization of phonetic and
pitch discrimination in speech processing. Science 256(5058), 846–849 (1992)
16. Atal, B.S., Hanauer, S.L.: Speech analysis and synthesis by linear prediction of the
speech wave. The Journal of the Acoustical Society of America 50, 637 (1971)
17. Logan, B., et al.: Mel frequency cepstral coefficients for music modeling. In: Inter-
national Symposium on Music Information Retrieval, vol. 28, p. 5 (2000)
18. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete
data via the em algorithm. Journal of the Royal Statistical Society, 1–38 (1977)
19. Milborrow, S., Nicolls, F.: Locating facial features with an extended active shape
model. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS,
vol. 5305, pp. 504–513. Springer, Heidelberg (2008)
Voting Strategies
for Anatomical Landmark Localization
Using the Implicit Shape Model
1 Introduction
The localization of anatomical landmarks (e.g. hands or hip center) is an essential
preprocessing step in many action recognition approaches. Sliding window-based
person detectors (Dollar et al. [3] provides a survey) are often used for the
initial person detection step. Local feature based methods (Leibe et al. [6]) have
a clear advantage over sliding window based methods in cases where persons
are occluded. For both person detection approaches there are corresponding
anatomical landmark detection approaches. Bourdev and Malik [2] trained SVM
classifiers for body part classification using sets of example image patches that
correspond to a similar 2D (or 3D) pose (called ’poselets’). A multiscale sliding-
window is run over the image and each window is classified by the SVMs, which
is computationally demanding. Müller and Arens [7] reuse local features from an
ISM person detection step to vote for landmark locations which is more efficient
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 145–153, 2013.
c Springer-Verlag Berlin Heidelberg 2013
146 J. Brauer, W. Hübner, and M. Arens
2 Voting Strategies
Original Implicit Shape Model Voting (ORIG-VOT). The basic idea be-
hind the Implicit Shape Model (ISM) (Leibe et al. [6]) is to learn the spatial
relationship between local features and an object using training data. For a new
image, local features are used to vote for possible object locations according
to the learned spatial relationship. More formally, an ISM I = (C, P) consists
of a set C (codebook) of prototypical image structures wi (visual words, code-
words) together with a set P of 3D probability distributions P = {P1 , ..., P|C| }.
Pi are 3D distributions, which specify where a visual word wi is typically lo-
cated on the object and at which feature scale. Leibe et al. [6] represent these
probability distributions in a non-parametric manner by collecting a set Oi =
{oj = (Δxj , Δyj , sj ) : j = 1, ..., Qi } of sample observation vectors oj that en-
code where (Δxj , Δyj ) the object center was observed relative to a local feature
(matching to word wi ) and at which scale sj the feature appeared. In Lehmann
et al. [5] this non-parametric representation of observation vectors is replaced
by Gaussian Mixture Models. In the object detection phase, a set of local fea-
tures F = {fk = (fx , fy , fs , d, wi ) : 1 ≤ k ≤ K} is computed, where fk
is a local feature detected at keypoint location (fx , fy ) at scale fs with corre-
sponding descriptor vector d which matches best to word wi . According to the
list of previously learned observation vectors Oi this feature now casts a vote
v = (vr , vx , vy , vs ) according to each previously stored observation vector oj ,
where the vote location, scale, and weight vr is computed by:
1 fs fs fs
vr = P (wi |d) vx = fx + Δx vy = fy + Δy vs = (1)
|Oi | sj sj sj
where P (wi |d) is the probability that descriptor vector d matches to word wi .
With v3 = (vx , vy , vs ) we denote the 3D vote space location, v2 = (vx , vy ) the
corresponding 2D image vote location, and V denotes the set of all votes casted
by all features. Object instances are then detected by identifying clusters of high
vote density in the 3D vote space using a Mean-Shift search.
Reference-Point Voting (RP-VOT). For human pose estimation, we first
have to detect persons in the image. The idea of RP-VOT is to exploit knowledge
Voting Strategies for Anatomical Landmark Localization 147
about the person detection location in the landmark localization process. The
motivation goes back to the observation that many visual words can appear at
very different locations on the human body. For this, it is helpful to include the
knowledge about the location where this image structure is observed relative to a
reference point (here: person center). For RP-VOT we modify the original voting
procedure such that votes are only casted, if the word appears in the detection
phase at a similar location relative to the reference point, as in training. We need
a description of the location of the word relative to the reference point which is
independent from the person’s appearance size in the image. More formally, we
augment the observation vectors oj = (Δxj , Δyj , sj , h1 ) such that we record
also at which size h1 (in pixels) we observed the person during training. The word
location relative to the object center can then be represented in person height
units a = (−Δxj /h1 , −Δyj /h1 ) and can be compared with the word location
b = ((fx −Rx )/h2 , (fy −Ry )/h2 ) relative to the reference point (Rx , Ry ) during
testing where we estimate the person’s height to be h2 . For each feature f and
observation vector oj we then cast a vote only, if their location difference is
below some threshold, i.e. a − b 2 < θ. Here we use θ = 0.05 which means
that we use the observation vector only if the word’s location distance between
training and testing is below 5% of the person’s height. For the person size
estimate we experimented with two approaches: (i) estimating the size from the
person’s bounding box height, which is a plausible estimate, if the person is
upright standing and (ii) estimating the size from the set of local features within
the person’s bounding box, which is a better choice, if we expect poses, where
the person also will show non upright standing poses, as e.g. bending down, or
if the person is partially occluded. For estimating the person height from local
features we used an ISM as well, where each local feature casts votes for the
person height (1D vote space) and the final height estimate is found by applying
a 1D Mean Shift with a 1D Gaussian kernel.
Heuristic Voting (H-VOT). In the training phase of ORIG-VOT for each
word wi – which has its keypoint location on the person’s segmentation mask
– we store an observation vector with its location relative to the person center
in Oi . In [6] the person’s segmentation mask is automatically retrieved using
a motion segmentation by the Grimson-Stauffer background model. For land-
mark localization we could also allow each word that appears on the person’s
segmentation mask to store an observation vector for each landmark. Though
this means that e.g. a word that appears during training on the feet would store
an observation vector for all other landmarks (including the left/right hand, the
head, etc.), not only the left/right foot. Therefore, we propose to use a simple
but effective heuristic which is to exploit the information, whether the landmark
location which we are interested in was within the descriptor region of the fea-
ture during training. A vote is generated only if this is the case. This filters for
image structures that most probably contain information about the location of
the landmark. It is not necessary to augment the observation vectors by this
information, since they already contain this information: the landmark
location
is within the descriptor region of the word during training, if Δxj + Δyj2 < sj .
2
148 J. Brauer, W. Hübner, and M. Arens
Dataset. Aa et al. [1] recently published the new UMPM benchmark1 which
allows for quantitative evaluations of 2D and 3D human pose estimation algo-
rithms. It consists of 30 subjects and a total of approx. 400.000 frames. The
dataset is provided with extrinsic and intrinsic camera parameters for all of the
four cameras. This allows to project the motion capture data into the image to
yield ground truth landmark locations which can be used to train the landmark
1
http://www.projects.science.uu.nl/umpm/
Voting Strategies for Anatomical Landmark Localization 149
Table 1. Results for evaluation measures α and β for all landmark localization exper-
iments. [x] specifies the number of persons used for training, or testing respectively.
Ø specifies the average evaluation measure value for each voting strategy, where we
averaged over all experiments 1a-8b.
α β
Exp # Train videos Test video
ORIG RP H OW COMBI ORIG RP H OW COMBI
1a [2] p2-chair-2 [1] p1 chair 2 9.8 25.5 18.2 12.9 44.0 44.7 26.3 31.9 35.2 16.4
1b [2] p2-chair-2 [1] p1 chair 2 8.4 24.4 16.7 11.7 43.5 47.1 26.9 33.5 37.0 16.0
2a [2] p2-chair-1 [1] p1 chair 2 9.3 21.8 17.8 11.5 37.9 44.1 27.4 30.2 37.3 18.5
2b [2] p2-chair-1 [1] p1 chair 2 8.6 22.3 17.8 10.9 38.8 46.3 27.1 31.4 39.0 18.0
3a [4] p2/p3-chair-1 [1] p1 chair 2 8.7 20.9 17.0 11.0 36.7 45.5 28.5 31.1 37.0 17.9
3b [4] p2/p3-chair-1 [1] p1 chair 2 7.7 21.4 15.9 10.2 37.5 48.3 28.2 33.0 38.9 17.5
4a [2] p2-grab-1 [1] p1 grab 3 12.0 29.7 21.0 16.4 49.4 38.5 20.1 26.8 30.5 12.1
4b [2] p2-grab-1 [1] p1 grab 3 10.9 30.5 20.2 14.3 50.3 41.5 19.7 28.9 34.1 12.0
5a [4] p2/p3-grab-1 [1] p1 grab 3 11.1 29.1 19.8 14.9 48.7 39.9 20.4 27.9 31.5 12.2
5b [4] p2/p3-grab-1 [1] p1 grab 3 10.0 30.0 18.2 13.3 49.3 42.7 19.8 30.2 34.5 12.1
6a [2] p3-ball-2 [2] p2 ball 1 9.9 22.4 17.4 13.7 39.3 44.3 26.4 30.5 34.2 15.6
6b [2] p3-ball-2 [2] p2 ball 1 8.4 22.5 15.0 11.5 39.0 47.5 26.1 33.5 37.0 15.5
7a [2] p3-free-1 [2] p2 free 1 8.8 22.1 17.5 11.6 38.2 43.7 25.1 30.6 33.6 15.5
7b [2] p3-free-1 [2] p2 free 1 7.7 22.1 14.9 10.2 38.5 45.6 25.0 32.6 35.8 15.5
8a [4] p3-free-1/11 [2] p2 free 1 9.5 23.5 18.8 12.4 41.6 42.3 24.5 29.4 32.9 14.8
8b [4] p3-free-1/11 [2] p2 free 1 8.2 23.1 16.1 10.8 41.4 44.8 24.7 31.4 35.3 14.9
Ø 9.3 24.5 17.6 12.3 42.1 44.2 24.8 30.8 35.2 15.3
Evaluation Measures. For a good part localization we want the vote dis-
tribution to be compact, uni-modal, centered on the ground truth landmark,
and monotonically decreasing to the periphery. We use three different evalua-
tion measures (α, β, γ) to assess to which degree this is fulfilled by the different
voting strategies:
1 1 ,x ≤ r
α= vr δr ( v2 − t 2 ) with δr (x) = (2)
W 0 ,x > r
v∈V
1
β= vr v2 − t 2 /h (3)
W
1 v∈V
1
γ(d) = ρ(l̃) with ρ(l̃) = vr K( v3 − l̃ 2 /λ(s)) (4)
|Xd | W λ(s)
(d,ρ(l̃))∈Xd v∈V
α measures the ratio of correct vs. total votes casted, weighted by the corre-
sponding vote weights. All votes within a circle of radius r around the ground
truth location are considered as correct. Here we use r = 0.1h, where h is the
person height measured in pixels. h can be estimated from the stick figure ground
150 J. Brauer, W. Hübner, and M. Arens
truth 2D pose by h = (Ll +Lr )/2+S +N , where Ll and Lr are the lengths of the
legs, S is the length of the spine and N is the length from
the neck to the head.
t is the ground truth 2D landmark location, and W = v∈V vr is the sum of all
vote weights. β measures the mean distance of the votes to the true landmark
location, again weighted by the vote weights, such that the distance (to the true
location) of a vote with a large weight has higher impact than a distance of a
vote with small weight. The distance is computed in relative person height units
by dividing through h. γ measures the average vote density in dependence of the
distance to the true marker location. For this we sample 3D vote space locations
l̃ = (x̃, ỹ, s̃) on a regular 3D grid and compute the density ρ of the votes at
these locations using a (weighted) kernel density estimator with scale-dependent
bandwidth λ(s). For each vote density sample location l̃, we then compute the
distance d of the corresponding 2D vote location v2 to the true marker location
t in person height units, i.e. d = v2 − t /h and add a new sample x = (d, ρ(l̃))
of this distance and vote density to a histogram of 100 discretized distance bins
Xd (d = 0.01n, 0 ≤ n ≤ 100). Fig. 1 shows the average vote density γ(d) of all
votes within such a bin Xd as a function of the distance d.
Fig. 1. Vote density as a function of the distance to ground truth landmark location.
γ(0.1) e.g. specifies the average vote density we find at locations which have a distance
of 10% of the person’s height to the true marker location. Left: averaged over all
experiments and landmarks. Right: averaged over all experiments for landmark ’Upper
Spine’ (top) and landmark ’Right Hand’ (bottom).
Codebooks. Two different codebooks were used for the following experi-
ments. First, a codebook was generated by using 178 of the 272 UMPM video
sequences. All video sequences where skipped where persons occurred on which
we later tested in the experiments. From 1747 persons images we collected 109458
SURF descriptor vectors of keypoints within a person bounding box. The 128
dimensional descriptor vectors were clustered using RNN clustering [6] result-
ing in 1315 visual words. The generic codebook was generated using the ETHZ
Voting Strategies for Anatomical Landmark Localization 151
right elbow
left knee
left shoulder
left foot
right shoulder
right hand Input ORIG-VOT RP-VOT H-VOT OW-VOT COMBI-VOT
Fig. 2. Vote densities as generated by the different voting strategies. For different
landmarks we show the resulting vote density generated by each of the strategy on 6
example person images from the experiments. The original ISM voting strategy yields
non focused vote distributions, while especially the combination of the new voting
strategies allows a much more focused localization of the landmarks and often shows
vote density peaks at the true landmark locations (heat map color encoding: warm
colors mean high density).
152 J. Brauer, W. Hübner, and M. Arens
4 Conclusions
In this paper we showed that the original ISM voting strategy produces vote
distributions which are clearly limited for usage in the context of landmark
localization and as a basis for human pose estimation. We introduced three new
alternative vote generation mechanisms which produce much more focused vote
distributions and yield higher vote densities near to the true landmark locations.
When combining all three voting strategies to a new fourth one, we can see clear
vote peaks near to the true marker locations. While our work is in the context of
human pose estimation and action recognition, it is highly interesting for future
work to repeat the experimental comparison of the strategies presented here on
other object categories than humans.
References
1. van der Aa, N., Luo, X., Giezeman, G., Tan, R., Veltkamp, R.: Utrecht multi-person
motion (umpm) benchmark: a multi-person dataset with synchronized video and
motion capture data for evaluation of articulated human motion and interaction.
In: HICV Workshop, in Conj. with ICCV (2011)
2. Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3D human pose
annotations. In: Proc. of ICCV (2009)
3. Dollár, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: An evaluation
of the state of the art. PAMI 99(PrePrints) (2011)
2
http://www.vision.ee.ethz.ch/~aess/dataset/
Voting Strategies for Anatomical Landmark Localization 153
4. Girshick, R., Shotton, J., Kohli, P., Criminisi, A., Fitzgibbon, A.: Efficient regression
of general-activity human poses from depth images. In: Proc. of ICCV (2011)
5. Lehmann, A., Leibe, B., van Gool, L.: Prism: Principled implicit shape model. In:
Proc. of BMVC, pp. 64.1–64.11 (2009)
6. Leibe, B., Leonardis, A., Schiele, B.: Robust object detection with interleaved cat-
egorization and segmentation. IJCV 77, 259–289 (2008)
7. Müller, J., Arens, M.: Human pose estimation with implicit shape models. In: ACM
Artemis, ARTEMIS 2010, pp. 9–14. ACM, New York (2010)
8. Shotton, J., Fitzgibbon, A.W., Cook, M., Sharp, T., Finocchio, M., Moore, R.,
Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth
images. In: CVPR, pp. 1297–1304 (2011)
Evaluating the Impact of Color
on Texture Recognition
Fahad Shahbaz Khan1 , Joost van de Weijer2 , Sadiq Ali3 , and Michael Felsberg1
1
Computer Vision Laboratory, Linköping University, Sweden
[email protected]
2
Computer Vision Center, CS Dept. Universitat Autonoma de Barcelona, Spain
3
SPCOMNAV, Universitat Autonoma de Barcelona, Spain
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 154–162, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Evaluating the Impact of Color on Texture Recognition 155
There exist two main approaches to combine color and texture cues for texture
categorisation.
Early Fusion: Early fusion fuses the two cues at the pixel level to obtain a joint
color-texture representation. The fusion is obtained by computing the texture
descriptor on the color channels. Early fusion performs best for categories which
exhibit constancy in both color and shape [1].
Late Fusion: Late fusion process the two visual cues separately. The two his-
tograms are concatenated into a single representation which is then the input to
a classifier. Late fusion combines the visual cues at the image level. Late fusion
works better for categories where one cue remains constant and the other changes
significantly [1]. In this work we analyze both early and late fusion approaches
for the task of texture categorisation.
As mentioned above, state-of-the-art early fusion approaches [3] combine the
features at the pixel level. Contrary to computer vision, it is well known that
visual features are processed separately before combining at a later stage for
visual recognition in human brain [4,5]. Recently, Khan et al. [6] propose an
alternative approach to perform early fusion for object recognition. The visual
cues are combined in a single product vocabulary. A clustering algorithm based
on information theory is then applied to obtain a discriminative compact repre-
sentation. Here we apply this approach to obtain a compact early fusion based
color-texture feature representation.
In conclusion, we make the following novel contributions:
– We investigate state-of-the-art color features used for image classification for
the task of texture categorisation. We show that the color names descriptor
with its only 11 dimensional feature vector provides the best results for
texture categorisation.
– We analyze fusion approaches to combine color and texture. Both early and
late feature fusion is investigated in our work.
– We also introduce a new dataset of 10 different and challenging texture cat-
egories as shown in Figure 1 for the problem of color-texture categorisation.
The images are collected from the internet and Corel collections.
Color has shown to provide excellent results for bag-of-words based object
recognition [3,1]. Recently, Khan et al. [1,2] have shown that an explicit repre-
sentation based on color names outperforms other color descriptors for object
recognition and detection. However, the performance of color descriptors, popu-
lar in image classification, has yet to be investigated for texture categorization
task. Therefore, in this paper we investigate the contribution of color for texture
categorization. Different from the previous methods [11,12], we propose to use
color names as a compact explicit color representation. We investigate both late
and early fusion based global color-texture description approaches. Contrary to
conventional pixel based early fusion methods, we use an alternative approach
to construct a compact color-texture image representation.
Color Names [14]: Most of the aforementioned color descriptors are designed
to achieve photometric invariance. Instead, color names descriptor balances a
certain degree of photometric invariance with discriminative power. Humans use
color names to communicate color, such as “black”, “blue” and “orange”. In this
work we use the color names mapping learned from the Google images [14].
Here we discuss different fusion approaches to combine color and texture features.
Early Fusion: Early fusion involves binding the visual cues at the pixel level.
A common way to construct an early fusion representation is to compute the
texture descriptor on the color channels. Early fusion results in a more discrim-
inative representation since both color and shape are combined together at the
pixel level. However, the final representation is high dimensional. Constructing
an early fusion representation using color channels with a texture descriptor for
an image I is obtained as:
TE = [TR , TG , TB ], (1)
TL = [HT , HC ] , (2)
Where HT and HC are explicit texture and color histograms. Late fusion pro-
vides superior performance for categories where one of the visual cues changes
significantly. For example, most of the man made categories such as car, motor-
bike etc. changes significantly in color. Since an explicit color representation is
used for late fusion, it is shown to provide superior results for such classes [1].
Portmanteau Fusion: Most theories from the human vision literature suggest
that the visual cues are processed separately [4,5] and combined at a later stage
for visual recognition. Recently, Khan et al. [6] propose an alternative solution
for constructing compact early fusion within the bag-of-words framework. Color
and shape are processed separately and a product vocabulary is constructed. A
Divisive information theoretic clustering algorithm (DITC) [15] is then applied
to obtain a compact discriminative color-shape vocabulary. Similarly, in this
158 F.S. Khan et al.
I (R; T C) − I R; T C R
J
= p (tcs ) KL(p(R|tcs ), p(R|T Cj )), (3)
j=1 tcs ∈T Cj
The algorithm finds a desired number of histogram bins based on minimizing the
loss in mutual information between the bins of product histogram and the class
labels of training instances. Histogram bins with similar discriminative power
are merged together over the classes. We refer to Dhillon et al. [15] for a detail
introduction on the DITC algorithm.
5 Experimental Results
To evaluate the performance of our approach we have collected a new dataset
of 400 images for color-texture recognition. The dataset consists of 10 different
2
In our experiments we also evaluated PCA and PLS but inferior results were ob-
tained. A comparison of other compression techniques with DITC is also performed
by [16].
Evaluating the Impact of Color on Texture Recognition 159
Fig. 1. Example images from the two datasets used in our experiments. First row:
images from the OT scenes dataset. Bottom row: images from our texture dataset.
categories namely: marble, beads, foliage, wood, lace, fruit, cloud, graffiti, brick
and water. We use 25 images per class for training and 15 instances for testing.
Existing datasets are either grey scale, such as the Brodatz set, or too simple,
such as the Outex dataset, for color-texture recognition. Texture cues are also
used frequently within the context of object and scene categorisation. Therefore,
we also perform experiments on the challenging OT scenes dataset [17]. The OT
dataset [17] consists of 2688 images classified as 8 categories. Figure 1 shows
example images from the two datasets.
In all experiments a global histogram is constructed for the whole image. We
use LBP with uniform patterns having final dimensionality of 383. Early fu-
sion is performed by computing the texture descriptor on the color channels.
For late fusion, histograms of pure color descriptor is concatenated with a tex-
ture histogram. A non-linear SVM is used for classification. The performance is
evaluated as a classification accuracy which is the number of correctly classified
instances of each category. The final performance is the mean accuracy obtained
from all the categories. We also compare our approach with color-texture de-
scriptors proposed in literature [12,10].
Table 1. Classification accuracy on the two datasets. (a) Results using different pure
color descriptors. Note that on both datasets color names being additionally compact
provides the best results. (b) Scores using late fusion approaches. On both datasets
late fusion using color names provides the best results while being low dimensional.
Here, we first show results obtain by late fusion approaches in Table 1. The
texture descriptor with 383 dimensions provides a classification score of 77%
and 69% respectively. The late fusion of RGB and LBP provides a classifica-
tion score of 79% and 74%. The STD [12] descriptor provides inferior results
of 58% and 67% respectively. The best results are obtained on both datasets
using the combination of color names with LBP. Table 2 shows results obtained
using early fusion approaches on the two datasets. The conventional pixel based
descriptors provide inferior results on both datasets. The LCVBP descriptor [10]
provides classification scores of 76% and 53% on the two datasets. By taking the
product histogram directly without compression provides an accuracy of 81%
and 72% while being significantly high dimensional. It is worthy to mention that
both JTD and LCVBP descriptors are also significantly high dimensional. The
portmanteau fusion provides the best results among early fusion based methods
while additionally being compact in size.
In summary late fusion provides superior performance while being compact on
both datasets. Among early fusion based methods portmanteau fusion provides
improved performance on both datasets. The best results are achieved using the
color names descriptor. Color names having only an 11 dimensional histogram is
Table 2. Classification accuracy using early fusion approaches. Among early fusion
approaches, portmanteau fusion provides the best results on both datasets while addi-
tionally being compact.
6 Conclusions
References
1. Khan, F.S., van de Weijer, J., Vanrell, M.: Modulating shape features by color
attention for object recognition. IJCV 98(1), 49–64 (2012)
2. Khan, F.S., Anwer, R.M., van de Weijer, J., Bagdanov, A.D., Vanrell, M., Lopez,
A.M.: Color attributes for object detection. In: CVPR (2012)
3. van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluating color descriptors for
object and scene recognition. PAMI 32(9), 1582–1596 (2010)
4. Treisman, A., Gelade, G.: A feature integration theory of attention. Cogn.
Psych. 12, 97–136 (1980)
5. Wolfe, J.M.: Watching single cells pay attention. Science 308, 503–504 (2005)
6. Khan, F.S., van de Weijer, J., Bagdanov, A.D., Vanrell, M.: Portmanteau vocabu-
laries for multi-cue image representations. In: NIPS (2011)
7. Lazebnik, S., Schmid, C., Ponce, J.: A sparse texture representation using local
affine regions. PAMI 27(8), 1265–1278 (2005)
8. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns. PAMI 24(7), 971–987
(2002)
9. Varma, M., Zisserman, A.: A statistical approach to texture classification from
single images. IJCV 62(2), 61–81 (2005)
10. Lee, S.H., Choi, J.Y., Ro, Y.M., Plataniotis, K.: Local color vector binary patterns
from multichannel face images for face recognition. TIP 21(4), 2347–2353 (2012)
11. Topi Maenpaa, M.P.: Classification with color and texture: jointly or separately?
PR 37(8), 1629–1640 (2004)
12. Susana Alvarez, M.V.: Texton theory revisited: A bag-of-words approach to com-
bine textons. PR 45(12), 4312–4325 (2012)
162 F.S. Khan et al.
13. van de Weijer, J., Schmid, C.: Coloring local feature extraction. In: Leonardis, A.,
Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 334–348. Springer,
Heidelberg (2006)
14. van de Weijer, J., Schmid, C., Verbeek, J.J., Larlus, D.: Learning color names for
real-world applications. TIP 18(7), 1512–1524 (2009)
15. Dhillon, I., Mallela, S., Kumar, R.: A divisive information-theoretic feature clus-
tering algorithm for text classification. JMLR 3, 1265–1287 (2003)
16. Elfiky, N., Khan, F.S., van de Weijer, J., Gonzalez, J.: Discriminative compact
pyramids for object and scene recognition. PR 45(4), 1627–1636 (2012)
17. Oliva, A., Torralba, A.B.: Modeling the shape of the scene: A holistic representation
of the spatial envelope. IJCV 42(3), 145–175 (2001)
Temporal Self-Similarity for Appearance-Based
Action Recognition in Multi-View Setups
1 Introduction
The automatic recognition of actions from video streams states a very important
problem in current computer vision research, as reflected by recent surveys[1]. A
variety of possible applications—e.g. Human-Machine Interaction, surveillance,
Smart Environments, entertainment, etc.—argues for the emerging relevance of
this topic.
As monocular approaches rely on single-view images, they solely perceive 2d
projections of the real world and discard important information. Hence, they are
likely to suffer from occlusions and ambiguities. As a consequence, the majority
of these methods use data-driven methods like Space-Time Interest Points[8]
instead of model-based representations of the image content. In contrast, existing
multi-view action recognition systems try to directly exploit 3d information,
e.g. by reconstructing the scene or fitting anatomical models, resulting in a far
higher complexity.
Having these observations in mind, we propose a method to recognize articu-
lated actions, which meets the following demands: (i) it is designed to be general
and not restricted to human action recognition, (ii) it avoids expensive dense 3d
reconstruction, (iii) it is independent from the camera setup it was learned in,
and (iv) it does not rely on foreground segmentation and exact localization.
The rest of this paper is structured as follows: in Sect. 2 we give a short intro-
duction in theory of Recurrence Plots and Temporal Self-Similarity Maps and
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 163–171, 2013.
c Springer-Verlag Berlin Heidelberg 2013
164 M. Körner and J. Denzler
Fig. 1. Two SSMs obtained for a robot dog performing an stand kickright action cap-
tured from different viewpoints. Action primitives induce similar local structures in the
corresponding SSM even under changes of viewpoint, illumination, or image quality.
motivate their usage. We also suggest to extend the related approach of Junejo
et al.[7] by new low-level features and distance metrics. Subsequently, Sect. 3 will
present our approach to utilize SSMs for multi-view action recognition. There-
fore we use a Gaussian Process classifier together with the Histogram Intersection
Kernel, which has been shown to be more suitable for comparison of histograms.
In Sect. 4, we show results of our approach on our own new multi-view action
recognition dataset as well as on the widely used IXMAS dataset.
Going through the related literature, methods for action recognition can be cat-
egorized into three groups: the first kind of approaches tries to reconstruct 3d
information or trajectories from the scene[15] or augment these representations
by a fourth time dimension[19,5]. Alternatively, relationships between action
features obtained from different views are learned by applying transfer learning
or knowledge transfer techniques[3,9]. The methods most related to our pro-
posal try to directly model the dynamics of actions within a view-independent
framework[7,2]. For a more extensive review about recent work on action recog-
nition we refer to recent reviews[1,14].
To understand human actions and activities, observers take benefit of their prior
knowledge of typical temporal and spatial recurrences in execution of actions.
Besides all differences in execution, two actions can be perceived as being se-
mantically identical if they share atomic action primitives in a similar frequency.
Assuming those actions to be instances of deterministic dynamical systems—
which can be modeled by differential equations—, Marwan et al.[11] presented an
intensive discussion about their interpretation utilizing Recurrence Plots (RP).
This work was further referenced for human gait analysis[2] and—due to their
stability in the case of viewpoint changes—for cross-view action recognition[7].
Temporal Self-Similarity for Multi-View Action Recognition 165
(Anti-) Diagonal The recorded action contains different atomic actions with
straight lines similar evolutionary characteristics in (reversed) time
Horiz. & vert. lines No or slow change of states for a given period of time
Bow structures The recorded action contains different atomic actions with
similar evolutionary characteristics in reversed time with
different velocities
The choice for low-level image features f (·) is of inherent importance and has to
suit the given scenario. In the following, we will discuss some possible alternatives.
Intensity Values. The simplest way to convert an image into a descriptive fea-
ture vector fint (I) ∈ RM·N is to append its intensities, as proposed for human
gait analysis[2]. While this is suitable for sequences with a single stationary ac-
tor, it yields large feature vectors and is very sensitive to noise and illumination
changes.
Landmark Positions. Assuming to be able to track anatomical or artificial
landmarks of the actor over time, their positions fpos (I) = (x0 , x1 , . . .) , xi =
(xi , yi , zi ) , can be used to represent the current system configuration[7]. This is
sufficient, as long as the tracked points are distributed over moving body parts,
but it demands points to be able to be tracked continuously.
166 M. Körner and J. Denzler
Euclidean
Distance
Intensity
Normalized
HoG Cross-
Correlation
HoF
Histogram
Intersection
Fourier
Action 1,
Sequence 1 SSM Bag of SSM Words
SSM Features
PCA
GMM
...
...
training
...
...
Action N, GP
Sequence M Classier
PCA
...
...
GMM
Bag of SSM Words
testing
PCA GP
...
...
GMM Classier
and might harm the further processing. Hence, we further concentrate on using
the proposed Fourier coefficients, since they are easily and fast to compute and
provide some handy invariants by design.
Beside the choice for a suitable image representation f (·), the distance measure
d(·, ·) plays an important role when computing self-similarities, as qualitatively
compared in Tab. 3.
Euclidean Distances. The euclidean distance deucl (f1 , f2 ) = f1 − f2 2 serves
as a straightforward way to quantify the similarity between two image feature
descriptors f1 = f (I1 ) and f2 = f (I2 ) of equal length, as proposed by [7]. While
this is easy to compute, it might be unsuited for histogram data[10], since false
bin assignments would cause large errors in the euclidean distance.
Normalized Cross-Correlation. From a signal-theoretical point of view, the
image feature descriptors f1 , f2 can be regarded as D-dimensional discrete signals
of : Then, the normalized cross-correlation coefficient dNCC (f1 , f2 ) =
9 equal size.
f1 f2
f1 , f2 ∈ [−1, 1] measures the cosine of the angle between the signal vectors
f1 and f2 . Hence, this distance measure is independent from their lengths.
D−1
Histogram Intersection. The intersection dHI (h1 , h2 ) = i=0 min (h1,i , h2,i )
of two histograms h1 , h2 ∈ RD was shown to perform better for codebook gen-
eration and image classification tasks[13]. In case of comparing normalized his-
tograms, the histogram intersection distance is bounded by [0, +1].
As mentioned before, SSMs obtained from videos capturing the identical action
from different viewpoints share common patterns. Hence, local feature descrip-
tors suitable for monitoring the structure of those patterns have to be developed
168 M. Körner and J. Denzler
4 Experimental Evaluation
In order to evaluate our multi-view action recognition system, we firstly per-
formed experiments on our own dataset. This dataset contains 10 sequences of
each 56 predefined actions performed by Sony AIBO robot dogs simultaneously
captured by six cameras.1
In our general setup, the dimension of SIFT descriptors extracted along the
SSM diagonal was reduced from 128 to 32 by applying PCA. Subsequently, all
descriptors from all train sequences were clustered into a mixture of 512 Gaus-
sians to create a Bag of Self-Similarity Words (BOSS Words). This is further
used to represent each training sequence by a histogram of relative frequencies.
These parameters heuristically show best results. While Junejo et al.[7] propose
to employ a multiclass SVM, this yield a very high complexity in our case, as
the AIBO dataset covers a relatively large number of classes to be distinguished.
Hence, we use a Gaussian Process (GP) classifier combined with a Histogram
D
Intersection Kernel κHIK (h, h ) = i=0 min (hi , hi ) , h, h ∈ RD , which can be
evaluated efficiently, as recently shown by Rodner et al.[16] and Freytag et al.[4].
Recognition rates were obtained after 10-fold cross validation.
One of the most important questions concerning multi-view action recogni-
tion is the influence of the training and testing camera setup on the overall
1
The complete dataset including labels, calibration data and background images is
available at http://www.inf-cv.uni-jena.de/JAR-Aibo.
Temporal Self-Similarity for Multi-View Action Recognition 169
the invariance and stability properties of SSMs support our demands on a action
recognition system.
We made three contributions: (i) we further extended the method originally
presented in [7] by new low-level features and distance metrics, (ii) we applied
a Gaussian Process (GP) classifier combined with histogram intersection ker-
nel, which have been shown to be more suitable and efficient for comparing
histograms[16,4], and (iii) we used a new extensive dataset for evaluating multi-
view action recognition systems, which will be made publicly available.
It is straightforward to augment the Bag of Self-Similarity Words modeling
scheme by histograms of co-occurrences of vocabulary words in order to improve
the descriptive power of this representation. Another important aspect is the
direct integration of calibration knowledge into our framework.
References
1. Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: A review. ACM Comput.
Surv. 43(3), 16:1–16:43 (2011)
2. Cutler, R., Davis, L.S.: Robust real-time periodic motion detection, analysis, and
applications. TPAMI 22(8), 781–796 (2000)
3. Farhadi, A., Tabrizi, M.K.: Learning to recognize activities from the wrong view
point. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS,
vol. 5302, pp. 154–166. Springer, Heidelberg (2008)
4. Freytag, A., Rodner, E., Bodesheim, P., Denzler, J.: Rapid uncertainty compu-
tation with gaussian processes and histogram intersection kernels. In: Lee, K.M.,
Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part II. LNCS, vol. 7725,
pp. 511–524. Springer, Heidelberg (2013)
5. Holte, M.B., Chakraborty, B., Gonzalez, J., Moeslund, T.B.: A local 3-D motion de-
scriptor for multi-view human action recognition from 4-D spatio-temporal interest
points. Selected Topics in Signal Processing 6(5), 553–565 (2012)
6. Iwanski, J.S., Bradley, E.: Recurrence plots of experimental data: To embed or not
to embed? Chaos 8(4), 861–871 (1998)
7. Junejo, I.N., Dexter, E., Laptev, I., Pérez, P.: View-independent action recognition
from temporal self-similarities. TPAMI 33(1), 172–185 (2011)
8. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human
actions from movies. In: CVPR, pp. 1–8 (2008)
9. Liu, J., Shah, M., Kuipers, B., Savarese, S.: Cross-view action recognition via view
knowledge transfer. In: CVPR, pp. 3209–3216 (2011)
10. Maji, S., Berg, A.C., Malik, J.: Classification using intersection kernel support
vector machines is efficient. In: CVPR, pp. 1–8 (2008)
11. Marwan, N., Romano, M.C., Thiel, M., Kurths, J.: Recurrence plots for the analysis
of complex systems. Physics Reports 438(5-6), 237–329 (2007)
12. McGuire, G., Azar, N.B., Shelhamer, M.: Recurrence matrices and the preservation
of dynamical properties. Physics Letters A 237(1-2), 43–47 (1997)
13. Odone, F., Barla, A., Verri, A.: Building kernels from binary strings for image
matching. IP 14(2), 169–180 (2005)
14. Poppe, R.: A survey on vision-based human action recognition. IVC 28(6), 976–990
(2010)
15. Rao, C., Yilmaz, A., Shah, M.: View-invariant representation and recognition of
actions. IJCV 50(2), 203–226 (2002)
Temporal Self-Similarity for Multi-View Action Recognition 171
16. Rodner, E., Freytag, A., Bodesheim, P., Denzler, J.: Large-scale gaussian process
classification with flexible adaptive histogram kernels. In: Fitzgibbon, A., Lazebnik,
S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part IV. LNCS, vol. 7575,
pp. 85–98. Springer, Heidelberg (2012)
17. Weinland, D., Boyer, E., Ronfard, R.: Action recognition from arbitrary views
using 3D exemplars. In: ICCV, pp. 1–7 (2007)
18. Weinland, D., Özuysal, M., Fua, P.: Making action recognition robust to occlusions
and viewpoint changes. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV
2010, Part III. LNCS, vol. 6313, pp. 635–648. Springer, Heidelberg (2010)
19. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using mo-
tion history volumes. CVIU 104(2), 249–257 (2006)
Adaptive Pixel/Patch-Based Stereo Matching
for 2D Face Recognition
1 Introduction
Although face recognition in controlled environment has been well solved, its per-
formance in real application is still far from satisfactory. The variations of pose,
illumination, occlusion and expression are still critical issues that affect the face rec-
ognition performance. Existing techniques such as Eigenfaces [1] or Fisherfaces [2]
are not robust to these variations. Local features such as local binary patterns (LBP)
[3] are then proposed for recognition. Recently, sparse representation-based
classification (SRC) [4] has also been proposed and showed very promising results.
But these methods degrade gracefully with changes of pose. Previous methods for
improving face recognition accuracy under pose variation include [5-10]. In [5], a
pose-specific locally linear mapping is learned between a set of non-frontal faces and
the corresponding frontal faces. [6] shows that dynamic programming-based stereo
matching algorithm (DP-SM) can gain significant performance for 2D face recogni-
tion across pose. A learning method is present in [10] to perform patch-based
rectification based on locally linear regression.
In our work, we perform recognition by using Adaptive Pixel/Patch-based Stereo
Matching (APP-SM) to judge the similarity of two 2D images of faces. Fig. 1 con-
tains a representational overview of our method which consists of the following steps:
First, we build a gallery of 2D face images. Second, we align each probe-gallery
image pair using four feature points by calculating the epipolar geometry. Then we
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 172–179, 2013.
© Springer-Verlag Berlin Heidelberg 2013
Adaptive Pixel/Patch-Based Stereo Matching for 2D Face Recognition 173
run an adaptive pixel/patch-based stereo algorithm on the image pair. Note that we
don’t perform 3D reconstruction. Also we discard all the correspondences and the
disparities. We only use the matching cost to compute the similarity of two face im-
ages. Finally, we identify the probe with the gallery image that produces the max
similarity.
The paper is organized as follows. Section 2 presents the details of our face recog-
nition method. Section 3 presents and analyses all experiments. Finally, in Section 4,
conclusions will be given.
2.1 Alignment
Before running the stereo algorithm on the image pair, we first need to rectify them to
maintain the epipolar constraint. Generally, eight corresponding points are required to
obtain the epipolar geometry. Nevertheless, since the average variation of the depth of
the head is small compared to the distance of the camera to the head, and the field
of view for the facial image is small, we can simply the model to scaled orthographic
projection [11]. Then we only need four feature points to calculate the epipolar
geometry.
174 R. Liu, W. Feng, and M. Zhu
However, it has been showed in [11] that there can be still considerable variations
in disparity between two images under scaled orthographic projection. Traditional
linear transformations can only create linear disparity maps, which cannot be used
here to align the images accounting for the disparity variations. So in this step, we
follow [6] to achieve alignment by solving a non-linear system. For completeness, we
briefly review the mainly ideas of solving the epipolar geometry under scaled ortho-
graphic projection. Details can be refer to [6].
The epipolar geometry in this scenario is modeled as a tuple: (θ , γ , s, t ) . Here, θ
and γ are the angle of the epipolar lines in the left and right image, respectively.
Scaling the right image by s will cause the distance between two epipolar lines in the
right image to match the distance in the left image. Translating the image perpendicu-
lar to the epipolar lines by t will cause the alignment between the corresponding lines.
In our experiments, we specify these feature points by hand, as in [6]. With four
corresponding points, we get a nonlinear system of equations, which we solve in a
straightforward way to complete the step of alignment.
left scanline (row) and a pixel xR at position in the right scanline (row). Birchfield
and Tomasi [12] defined a pixel dissimilarity measure that is insensitive to image
sampling:
d ( xL , xR ) = min min I L ( x) − I R ( xR ) , min I L ( xL ) − I R ( x) (1)
1 1 1 1
xL − 2 ≤ x ≤ xL + 2 xR − ≤ x ≤ xR +
2 2
Here, xL = xR + d , d ∈ [ Δ1 , Δ 2 ] . I L and I R are two discrete one-dimensional arrays of
intensity values. I R is the linearly interpolated function between the sample points of
the right scanline , and I L is defined similarly.
Based on the Birchfield-Tomasi method, we propose a simple adaptive pixel-based
stereo matching algorithm to deal with horizontal slant. It processes a pair of scanli-
nes at a time. Horizontal disparities Δ are assigned to the scanline within a given range
[ Δ , Δ ] . The disparities are not assigned to pixels, but continuously over the whole
1 2
scanline. Given a point xL in the left scanline and its corresponding point xR in the right
scanline, we have
xL = m ⋅ xR + d (2)
Adaptive Pixel/Patch-Based Stereo Matching for 2D Face Recognition 175
m is the horizontal slant, which allows line segments of different length in the two
scanlines to correspond adaptively. The values of the horizontal slant which are to be
examined are provided as inputs, i.e., m ∈ M , where M = {m1 , m2 ,..., mk } , such that
m1 , m2 ,..., mk ≥ 1 .
Δ = (m − 1) ⋅ xR + d (3)
Δ is the horizontal disparity. The disparity search range [ Δ1 , Δ 2 ] is also provided as an
input. Then we can find range for d using given range of Δ and Equation (3). In our
implementation, we choose Δ and m empirically. With the constraint of runtime and
memory consumption, we find Δ = [0,8] and M = {1,1.2,...,3} performs best.
For ith pixel in the left sth scanline, we simultaneously searched the space of possi-
ble disparities and horizontal slants. Then we choose the minimum dissimilarity as the
(i, s) . After we obtain the dissimilarity value for each pixel in the left im-
value of PI
age, we normalise them using Min-Max Normalization method:
(i, s ) = 1 − PI (i, s ) − min( PI )
PI (4)
) − min( PI
max ( PI )
We compute a similarity value sv( I1 , I 2 ) by making aggregation over all the set of
scanlines in the left image. This value tells us how well image I1 and image I 2 match.
Since each similarity is going to be compared to other costs matched over scanlines of
potentially different lengths, we use some normalization strategy again:
MATCH (i, s )
sv( I1 , I 2 )= s i (7)
s I + I
1, s 2, s
sv ( I1 , I 2 )
sv ( I 2 , I1 )
similarity( I1 , I 2 ) = max (8)
sv (flip( I1 ), I 2 )
sv ( I 2 , flip( I1 ))
Finally, we perform recognition simply by matching a probe image to the most simi-
lar image in the gallery.
3 Experiments
The CMU PIE [13] database consists of 13 poses of which 9 have approximately the
same camera altitude (poses: c34, c14, c11, c29, c27, c05, c37, c25 and c22). For each
pose of the same person, there are also 21 images, each with different illumination.
We conducted experiments to compare our method APP-SM with the others. The
thumbnails used were generated as described in Section 2.1. A number of prior ex-
periments have been done using the CMU PIE database, but somewhat different ex-
perimental conditions. We have run our own algorithm under a variety of conditions
so that we may compare to these.
First, we only tested on individuals 35-68 from the PIE database to compare our
method with six others. Specifically, we selected each gallery pose as one of the 13
PIE poses and the probe pose as one of the remaining 12 poses, for a total of 156 gal-
lery-probe pairs. We evaluated the accuracy of our method in this setting and com-
pared to the results in [6, 7, 9]. Table 1 summarizes the average recognition rates.
Adaptive Pixel/Patch-Based Stereo Matching for 2D Face Recognition 177
100
90
80
70
Accuracy (%)
60
50
Proposed
40
DP−SM [6]
BFS [7]
30
20
10
0
c34 c31 c14 c11 c29 c09 c27 c07 c05 c37 c02 c25 c22
Probe Poses
Fig. 2. A comparison with the Method of Castillo et al. [6] and Gross et al. [7]. Gallery pose is
frontal (c27) probe. We report the average over the 21 illuminations.
178 R. Liu, W. Feng, and M. Zhu
We also evaluate our method on FERET face image database [14]. This database is
one of the largest publicly available databases. It has been used for evaluating face
recognition algorithms displays diversity across gender, ethnicity, and age.
Method bh bg bf be bd bc Avg(%)
Zhang et al. [15] 62.0 91.0 98.0 96.0 84.0 51.0 80.5
Gao et al. [16] 78.5 91.5 98.0 97.0 93.0 81.5 90.0
Sarfraz et al. [8] 92.4 89.7 100 98.6 97.0 89.0 94.5
Mostafa et al. [17] 87.5 98.0 100 99.0 98.5 82.4 94.2
Proposed APP-SM 92.0 94.5 100 98.8 96.0 89.5 95.1
In our experiments, we used all 200 subjects at 7 different poses (bh, bg, bf, be,
bd, bc). The pose angles range from +600 to -600. The frontal image ba for each sub-
ject is used as gallery and the remaining 6 images per subject were used as probes
(1,200 totally). Tables 2 shows that our APP-SM performs as well as any prior me-
thod based on image comparison. However, it should be noticed that APP-SM needs
no training. It is much simpler and more straightforward, which is very important for
applications.
4 Conclusion
References
1. Turk, M.A., Pentland, A.P.: Face recognition using eigenfaces. In: Proceedings 1991 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (91CH2983-5),
pp. 586–591 (1991)
2. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: Recognition
using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine
Intelligence 19, 711–720 (1997)
Adaptive Pixel/Patch-Based Stereo Matching for 2D Face Recognition 179
3. Ahonen, T., Hadid, A., Pietikäinen, M.: Face recognition with local binary patterns. In:
Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 469–481. Springer,
Heidelberg (2004)
4. Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust Face Recognition via
Sparse Representation. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 31, 210–227 (2009)
5. Chai, X., Shan, S., Chen, X., Gao, W.: Locally linear regression for pose-invariant face
recognition. IEEE Transactions on Image Processing 16, 1716–1725 (2007)
6. Castillo, C.D., Jacobs, D.W.: Using Stereo Matching with General Epipolar Geometry for
2D Face Recognition across Pose. IEEE Transactions on Pattern Analysis and Machine In-
telligence 31, 2298–2304 (2009)
7. Gross, R., Matthews, S.B.I., Kanade, T.: Face recognition across pose and illumination. In:
Jain, A.K., Li, S.Z. (eds.) Handbook of Face Recognition. Springer-Verlag New York, Inc.
(2005)
8. Sarfraz, M.S., Hellwich, O.: Probabilistic learning for fully automatic face recognition
across pose. Image and Vision Computing 28, 744–753 (2010)
9. Sharma, A., Jacobs, D.W.: Ieee: Bypassing Synthesis: PLS for Face Recognition with
Pose, Low-Resolution and Sketch. In: 2011 IEEE Conference on Computer Vision and
Pattern Recognition, pp. 593–600 (2011)
10. Ashraf, A.B., Lucey, S., Tsuhan, C.: Learning patch correspondences for improved view-
point invariant face recognition. In: 2008 IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), p. 8 (2008)
11. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge
University Press (2003)
12. Birchfield, S., Tomasi, C.: A pixel dissimilarity measure that is insensitive to image sam-
pling. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 401–406
(1998)
13. Sim, T., Baker, S., Bsat, M.: The CMU pose, illumination, and expression database. IEEE
Transactions on Pattern Analysis and Machine Intelligence 25, 1615–1618 (2003)
14. Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.J.: The FERET evaluation methodology for
face-recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 22, 1090–1104 (2000)
15. Wenchao, Z., Shiguang, S., Wen, G., Xilin, C., Hongming, Z.: Local Gabor binary pattern
histogram sequence (LGBPHS): a novel non-statistical model for face representation and
recognition. In: Proceedings of the Tenth IEEE International Conference on Computer Vi-
sion, vol. 781, pp. 786–791 (2005)
16. Gao, H., Ekenel, H.K., Stiefelhagen, R.: Pose Normalization for Local Appearance-Based
Face Recognition. In: Tistarelli, M., Nixon, M.S. (eds.) ICB 2009. LNCS, vol. 5558,
pp. 32–41. Springer, Heidelberg (2009)
17. Mostafa, E.A., Farag, A.A.: Dynamic weighting of facial features for automatic pose-
invariant face recognition. In: 2012 Canadian Conference on Computer and Robot Vision,
pp. 411–416 (2012)
A Machine Learning Approach
for Displaying Query Results in Search Engines
Tunga Güngör1,2
1
Boğaziçi University, Computer Engineering Department, Bebek,
34342 İstanbul, Turkey
2
Visiting Professor at Universitat Politècnica de Catalunya, TALP Research Center,
Barcelona, Spain
[email protected]
1 Introduction
A search engine is a web information retrieval system that, given a user query,
outputs brief information about a number of documents that it thinks relevant to the
query. By looking at the results displayed, the user tries to locate the relevant pages.
The main drawback is the difficulty of determining the relevancy of a result from the
short extracts. The work in [14] aims at increasing the relevancy by accompanying the
text extracts by images. In addition to important text portions in a document, some
images determined by segmenting the web page are also retrieved. Roussinov and
Chen propose an approach that returns clusters of terms as query results [12]. A
framework is proposed and its usefulness is tested with comprehensive experiments.
Related to summarization of web documents, White et al. describe a system that
forms hierarchical summaries. The documents are analyzed using DOM (document
object model) tree and their summaries are formed. A similar approach is used in [10]
where a rule-based method is employed to obtain the document structures. Sentence
weighting schemes were used for identification of important sentences [6,9,16]. In a
study, the “table of content”-like structure of HTML documents was incorporated into
summaries [1]. Yang and Wang developed fractal summarization method where
generic summaries are created based on structures of documents [15]. These studies
focus on general-purpose summaries, not tailored to particular user queries. There
exist some studies on summarization of XML documents. In [13], query-based
summarization is used for searching XML documents. In another study, a machine
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 180–187, 2013.
© Springer-Verlag Berlin Heidelberg 2013
A Machine Learning Approach for Displaying Query Results in Search Engines 181
learning approach was proposed based on document structure and content [2]. The
concept of lexical chains was also used for document summarization. Berker and
Güngör used lexical chaining as a feature in summarization [3]. In another work, a
lexical chain formation algorithm based on relaxation labeling was proposed [5]. The
sentences were selected according to some heuristics.
In this paper, we propose an approach that displays the search results as long
extracts. We build a system that creates a hierarchical summary for each document
retrieved. The cohesion in the summaries is maintained by using lexical chains. The
experiments on standard query sets showed that the methods significantly outperform
the traditional search engines and the lexical chain component is an important factor.
Training Test
Structure
Training extractor Test
Data Data
Lexical
chain
Model
Summarizer
builder
that includes different typees of parts. The first process is extracting the underlyying
content of the document. We W parse a given web document and build its DOM ttree
using the Cobra open sourcce toolkit [4]. We then remove the nodes that contain nnon-
textual content by traversin
ng the tree. After this process, we obtain a tree that incluudes
only textual elements in th he document and the hierarchical relations between them.
The result of the simplificattion process for the document in Fig. 2 is shown in Fig. 3.
B
Biological Fathers' Rights in Adoption
Adopted Adopting
A Lack of While the ... In 1996 …
P
Persons a Child attention baby …
Fig. 3. Simplified DOM tree
A Machine Learning Approach for Displaying Query Results in Search Engines 183
The tree structure obtained does not correspond to the actual hierarchical structure.
We identify the structure in three steps by using a learning approach. We consider the
document structure as a hierarchy formed of sections where each section has a title
(heading) and includes subsections. The first step is identification of headings which
is a binary classification problem (heading or non-heading). For each textual unit in a
document, we use the features shown in Table 1. The second step is determining the
hierarchical relationships between the units. We first extract the parent-child relations
between the heading units. This is a learning problem where the patterns of heading–
subheading connections are learned. During training, we use the actual parent-child
connections between headings as positive examples and other possible parent-child
connections as negative examples. We use the same set of features shown in Table 1.
In the last step, the non-heading units are attached to the heading units. For this
purpose, we employ the same approach used for heading hierarchy.
,
, /
,
, /
0 ,
184 T. Güngör
The chain score is calculated by summing up the scores of all pairs of terms in the
chain. The score of terms ti and tj depends on the relation between them. We use a
fixed value for each relation type. As the chains are scored, we select only the
strongest chains for summarization. A lexical chain is accepted as a strong chain if its
score is more than two standard deviations from the average of lexical chain scores.
In this work, we aim at producing summaries that take into account the structure of
web pages and that will be shown to the user as a result of a search query. We use the
criteria shown in Table 2 for determining the salience of sentences in a document. For
each feature, the table gives the feature name, the formula used to calculate the feature
value for a sentence S, and the explanation of the parameters in the formula. The score
values of the features are normalized to the range [0,1]. We learn the weight of each
feature using a genetic algorithm. As the feature weights are learned, the score of a
sentence can be calculated by the equation
(2)
where wi denotes the weight of the corresponding feature. Given a document as a
result of a query, the sentences in the document are weighted according to the learned
feature weights. The summary of the document is formed using a fixed summary size.
While forming the summary, the hierarchical structure of the document is preserved
and each section is represented by that number of sentences that is proportional to the
importance of the section (total score of the sentences in the section).
To evaluate the proposed approach, we identified 30 queries from the TREC (Text
Retrieval Conference) Robust Retrieval Tracks for the years 2004 and 2005. Each
query was given to the Google search engine and the top 20 documents retrieved for
each query were collected. Thus, we compiled a corpus formed of 600 documents.
The corpus was divided into training and test sets with a 80%-20% ratio.
For structure extraction, we used SVM-Light which is an efficient algorithm for
binary classification [8]. The results are given in Table 3. Accuracy is measured by
dividing the number of correctly identified parent-child relationships to the total
number of parent-child relationships. The first row of the table shows the performance
of the proposed method. This figure is computed by considering each pair of nodes
independent of the others. A stronger success criterion is counting a connection to a
node as a success only if the node is in the correct position in the tree. The accuracy
under this criterion is shown in the second row, which indicates that the method
identified the correct path in most of the cases. The third result gives the accuracy
when only the heading units are considered. That is, the last step of the method
explained in Section 2 was not performed.
A Machine Learning Approach for Displaying Query Results in Search Engines 185
Accuracy
Document structure 76.47
Document structure (full path) 68.41
Sectional structure 78.11
186 T. Güngör
As the feature weights were determined, we formed the summaries of all the
documents in the corpus. For evaluation, instead of using a manually prepared
summarization data set, we used the relevance prediction method [9] adapted to a
search engine setting. In this method, a summary is compared with the original
document. If the user evaluates both of them as relevant or irrelevant to the search
query, then we consider the summary as a successful summary.
The evaluation was performed by two evaluators. For a query, the evaluator was
given the query terms, a short description of the query, and a guide that shows which
documents are relevant results. The evaluator is shown first the summaries of the 20
documents retrieved by the search engine for the query in random order and then the
original documents in random order. The user is asked to mark each document or
summary displayed as relevant or irrelevant for the query. The results are given in
Table 4. We use precision, recall and f-measure for the evaluation as shown below:
| |
(3)
| |
| |
(4)
| |
(5)
where Drel and Srel denote, respectively, the set of documents and the set of summaries
relevant for the query.
The first row in the table shows the performance of the method, where we obtain
about 80% success rate. The second row is the performance of the Google search
engine. We see that the outputs produced by the proposed system are significantly
better than the outputs produced by a traditional search engine. This is due to the fact
that when the user is given a long summary that shows the document structure and the
important contents of the document, it becomes easier to determine the relevancy of
the corresponding page. Thus we can conclude that the proposed approach yields an
effective way in displaying the query results for the users.
6 Conclusions
In this paper, we built a framework for displaying web pages retrieved as a result of a
search query. The system makes use of the document structures and the lexical chains
extracted from the documents. The contents of web pages are summarized according
to the learned model by preserving the sectional layouts of the pages. The
experiments on two query datasets and a corpus of documents compiled from the
results of the queries showed that document structures can be extracted with 76%
A Machine Learning Approach for Displaying Query Results in Search Engines 187
References
1. Alam, H., Kumar, A., Nakamura, M., Rahman, A.F.R., Tarnikova, Y., Wilcox, C.:
Structured and Unstructured Document Summarization: Design of a Commercial
Summarizer Using Lexical Chains. In: Proc. of the 7th International Conference on
Document Analysis and Recognition, pp. 1147–1150 (2003)
2. Amini, M.R., Tombros, A., Usunier, N., Lalmas, M.: Learning Based Summarisation of
XML Documents. Journal of Information Retrieval 10(3), 233–255 (2007)
3. Berker, M., Güngör, T.: Using Genetic Algorithms with Lexical Chains for Automatic
Text Summarization. In: Proc. of the 4th International Conference on Agents and Artificial
Intelligence (ICAART), Vilamoura, Portugal, pp. 595–600 (2012)
4. Cobra: Java HTML Renderer & Parser (2010),
http://lobobrowser.org/cobra.jsp
5. Gonzàlez, E., Fuentes, M.: A New Lexical Chain Algorithm Used for Automatic
Summarization. In: Proc. of the 12th International Congress of the Catalan Association of
Artificial Intelligence (CCIA) (2009)
6. Guo, Y., Stylios, G.: An Intelligent Summarisation System Based on Cognitive
Psychology. Information Sciences 174(1-2), 1–36 (2005)
7. Hobson, S.P., Dorr, B.J., Monz, C., Schwartz, R.: Task-based Evaluation of Text
Summarisation Using Relevance Prediction. Information Processing and
Management 43(6), 1482–1499 (2007)
8. Joachims, T.: Advances in Kernel Methods: Support Vector Learning. MIT (1999)
9. Otterbacher, J., Radev, D., Kareem, O.: News to Go: Hierarchical Text Summarisation for
Mobile Devices. In: Proc. of 29th Annual ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 589–596 (2006)
10. Pembe, F.C., Güngör, T.: Structure-Preserving and Query-Biased Document
Summarisation for Web Searching. Online Information Review 33(4) (2009)
11. Princeton University, About WordNet (2010), http://wordnet.princeton.edu
12. Roussinov, D.G., Chen, H.: Information Navigation on the Web by Clustering and
Summarizing Query Results. Information Processing and Management 37 (2001)
13. Szlavik, Z., Tombros, A., Lalmas, M.: Investigating the Use of Summarisation for
Interactive XML Retrieval. In: Proc. of ACM Symposium on Applied Computing (2006)
14. Xue, X.-B., Zhou, Z.-H.: Improving Web Search Using Image Snippets. ACM
Transactions on Internet Technology 8(4) (2008)
15. Yang, C.C., Wang, F.L.: Hierarchical Summarization of Large Documents. Journal of
American Society for Information Science and Technology 59(6), 887–902 (2008)
16. Yeh, J.Y., Ke, H.R., Yang, W.P., Meng, I.H.: Text Summarisation Using a Trainable
Summariser and Latent Semantic Analysis. Information Processing and
Management 41(1), 75–95 (2005)
A New Pixel-Based Quality Measure
for Segmentation Algorithms Integrating
Precision, Recall and Specificity
1 Introduction
There is an increasing need for automated processing of massive amounts of
visual information generated by video surveillance systems. The goal here is to
verify that the automation of a number of tasks for image and video analysis
is reliable, in order to reduce human supervision and to assist decision-making.
The quantitative performance evaluation methods should make it possible to
compare the results provided by different segmentation algorithms. The most
commonly used methods in the literature attempt to find a compromise between
Precision, Recall (also called Sensitivity) and Specificity. But most widely used
representations only consider two of the three indicators: Precision/Recall space
and ROC curves (which represent only the Sensitivity=Recall and Specificity).
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 188–195, 2013.
c Springer-Verlag Berlin Heidelberg 2013
A New Pixel-Based Quality Measure for Segmentation Algorithms 189
In the same way, the F-measure is a single value quality measure based on
Recall and Precision only, that completely ignores the Specificity of segmentation
algorithms.
In the context of segmentation quality evaluation in video surveillance, we
observed there are situations where all these measures disagree in the sense that
they can give very different values of parameters. In this case, the question we
asked ourselves and we tried to answer is “which measure behaves better than
others and why”? We compared the optimal parameters given by several of these
measures. We propose a segmentation quality measure seeking a compromise
between the three indicators above. We show, in a case study, our measure gives
more satisfactory results than the most commonly used measures (F-measure,
Jaccard coefficient (JC), percentage of correct classification (PCC)).
The outline of the paper is as follows: section 2 presents related work in the
fields of performance evaluation of pixel-based segmantation algorithms. Sec-
tion 3 explains the methodology of evaluation and introduces a new measure
which is shown to give more often than the others, better parameters for seg-
mentation algorithms. Experimental results and conclusions are presented in
section 4 and 5 respectively.
2 Related Work
the foreground, but one of them can be less performant than the other if the
background is not well detected. In this case, the two algorithms will differ in
terms of Specificity. After noting that the TPR dimension used in ROC curves
is equal to the Recall dimension of Precision/Recall graphs, we propose to build
a quality measure that takes into account the three dimensions involved in the
previous two approaches. The 3D generalizarion of ROC curves had already
been proposed by [7]. They use a soft decision threshold parameter as a third
dimension which helps to take the final binary decision. This is different of our
approach but it does not take into account the true negative rate, therefore,
in our context of videosurveillance applications, it has the same drawbacks as
standard ROC analysis.
3 Evaluation Measure
Our goal is to find the best way to measure the quality of a segmentation algo-
rithm having one or more parameters and to find the best values for these pa-
rameters. Most of the authors compare the results of their algorithms to a ground
truth which is considered as the ideal segmentation. This ground truth can be
obtained from manual segmentations done by human users, but this can be very
time-consuming, especially for large video sequences. Alternatively, we can use
synthetic sequences where the ground truth is known, since the moving objects
are generated by a computer algorithm. This is the method we have used in this
paper. The results of any segmentation algorithms vary as a function of the val-
ues of different parameters. The best parameters values minimize a distance or
maximize the similarity between the segmentation and the ground truth. Several
criteria for evaluating this proximity (F-measure, JC, PCC) behave differently,
leading to different choices of the best value of the algorithm’s parameters.
We consider the segmentation of images divided into two classes: foreground
and background. For a given image in a video sequence, we compare the results
of a binary segmentation S with the binary image of the ground truth T . A pixel
is represented in white if it is part of a moving object (foreground), and black
when it belongs to the background. A white pixel in S is called a positive. If it
is also white in T , then it is a true positive (TP), whereas if it is black in T ,
it is a false positive (FP). Symmetrically, a black pixel in S is a negative. If it
is also black in T , it is a true negative (TN), while if it is white in T , it is a
false negative (FN). We can then define the Precision (PR), Recall (RE) and
Specificity (SP) for each image:
P R = T P/(T P + F P ) (1)
RE = T P/(T P + F N ) (2)
SP = T N/(T N + F P ) (3)
A perfect segmentation algorithm calculates an image S identical to the ground
truth T . Such an algorithm will give values of Precision, Recall and Specificity
equal to 1. The F-measure is the harmonic mean of Precision and Recall.
A New Pixel-Based Quality Measure for Segmentation Algorithms 191
P CC = (T P + T N )/(T P + F N + F P + T N ) (5)
JC = T P/(T P + F P + F N ) (6)
We propose to measure the quality of a segmentation as an Euclidean distance
called Dprs in the space of the indicators, between the point (P R, RE, SP ) and
the ideal point (1, 1, 1).
Dprs = (1 − P R)2 + (1 − RE)2 + (1 − SP )2 (7)
4 Experimental Results
In order to evaluate our segmentation quality measure, we segment several videos
of the synthetic Visage database [12]. We apply several morphological operations
in order to improve the segmentation results and compare the behavior of several
quality measures.
In order to segment moving objets we use the hierarchical background mod-
eling technique introduced by [8]. It is considered as one of the best background
modeling algorithms in the comparative study proposed by [9]. This method uses
coarse level contrast descriptors introduced by [11] whose evolution is represented
by mixtures of gaussians. This coarse-level representation is then combined with
the classical pixel-based mixture of gaussians [10]. We can thus identify the fore-
ground objects at coarse level, then detail shapes of foreground objects at pixel
level. Figure 1 presents some results of background subtraction algorithms.
Object detection by background modeling algorithms, used without post-
processing, very often let appear isolated pixels in the background. They are
considered as foreground objects. On the contrary, holes in objects are classified
as background. The method most commonly used to overcome these drawbacks
is to apply mathematical morphology (erosion, dilation, opening, closing, . . . ).
In our experiments we vary the number of erosions (p) and dilations (q) applied
as post-processing of the segmentation. We search the parameter sets giving
the best possible values for each of the segmentation quality measures. In many
cases, the optimal parameters are significantly different from one measurement
to another, and do not coincide with the optimal (subjective) settings provided
by a human user.
We calculate the following measures : F-measure, JC, PCC and Dprs for all
images in a sequence, and for each set of parameters. We evaluate the mathe-
matical morphology (dilations followed by erosions) between (0, 0) and (10, 10)
192 K. Intawong, M. Scuturici, and S. Miguet
(a) Original image (b) Ground truth image (c) Block-based [11]
In order to validate this evaluation, the solution was to refer to what a human
user considers as the best solution by a voting system. A website has been devel-
oped for presenting the four segmented images using the best parameters chosen
by the four compared measures (PCC, F, JC and Dprs ) (the frames for which the
four measures agree are discarded). In these first series of experiments, the new
A New Pixel-Based Quality Measure for Segmentation Algorithms 193
Table 1. Comparison of different measures (sunny weather). PCC and Dprs give the
optimal values corresponding to the (4, 4) pair. Simultaneously, F-measure and JC give
the optimal values corresponding to the (10, 10) pair.
Standard Measures New Measure
Dilation Erosion TPR FPR Recall Precision F measure PCC JC Dprs
0 0 0.3614 0.00042 0.3614 0.6836 0.3003 0.9967 0.2946 0.9554
1 1 0.3818 0.00073 0.3818 0.6738 0.4023 0.9970 0.3881 0.9451
2 2 0.4057 0.00121 0.4057 0.66635 0.4765 0.9971 0.4007 0.9320
3 3 0.4806 0.00124 0.4806 0.6522 0.5064 0.9972 0.4233 0.8684
4 4 0.5176 0.00145 0.5176 0.6497 0.5267 0.9972 0.4684 0.8341
5 5 0.5360 0.00206 0.5360 0.6278 0.5561 0.9971 0.4899 0.8382
6 6 0.5587 0.00242 0.5587 0.5979 0.5668 0.9969 0.5028 0.8458
7 7 0.5703 0.00261 0.5703 0.5858 0.5722 0.9965 0.5089 0.8465
8 8 0.5891 0.00313 0.5891 0.5723 0.5754 0.9961 0.5132 0.8417
9 9 0.5911 0.00362 0.5911 0.5709 0.5879 0.9957 0.5161 0.8416
10 10 0.6018 0.00413 0.3614 0.5612 0.5980 0.9952 0.5165 0.8411
measure appears to give results that are the closest to the human evaluation:
Dprs collects 37% of votes, PCC receives 29%, and F-measure and JC have a
comparable number of votes with only 17%.
(a) Original (b) Truth (c) Dprs (d) PCC (e) F (f) JC
5 Conclusions
In this paper we have presented a new measure (Dprs ) for qualitative evaluation
of the segmentation algorithms in complex video surveillance environments. It
is an extension of the traditional Precision-Recall methodology and represents
a compromise between three indicators: Recall, Precision and Specificity. The
third dimension - Specificity - takes into account the correct detection of the
background. This measure was compared with other performance measures (F-
measure, PCC and JC). It is important to note that each criterion may conclude
at different optimal sets of parameters. Experiments show that our measure
A New Pixel-Based Quality Measure for Segmentation Algorithms 195
References
1. Nascimento, J., Marques, J.: Performance evaluation of object detection algorithms
for video surveillance. IEEE Transactions on Multimedia 8, 761–777 (2006)
2. Rosin, P.L., Ioannidis, E.: Evaluattion of Global Image Thresholding for Change
Detection. Pattern Recognition Letter 24, 345–2356 (2003)
3. Elhabian, S., El-Sayed, K.: Moving Object Detection in Spatial Domain using Back-
ground Removal Techniques - State-of-Art. Recent Patents on Computer Science 1,
32–54 (2008)
4. Gao, X., Boult, T.E., Coetzee, F., Ramesh, V.: Error analysis of background adap-
tation. In: Computer Vision and Pattern Recognition, vol. 1, pp. 503–510 (2000)
5. Davis, J., Burnside, E., Dutra, I., Page, D., Ramakrishnan, R., Santos Costa, V.,
Shavlik, J.: View learning for statistical relational learning: With an application
to mammography. In: Proceeding of the 19th International Joint Conference on
Artificial Intelligence, pp. 677–683 (2005)
6. Davis, J., Goadrich, M.: The Relationship between Precision-Recall and ROC
Curves. In: International Conference on Machine Learning, pp. 233–240 (2006)
7. Wang, S., Chang, C.I., Yang, S.C., Hsu, G.C., Hsu, H.H., Chung, P.C., Guo,
S.M., Lee, S.K.: 3D ROC Analysis for Medical Imaging Diagnosis. Engineering
in Medicine and Biology Society, 7545–7548 (2005)
8. Chen, Y.T., Chen, C.S., Huang, C.R., Hung, Y.P.: Efficient hierarchical method
for background subtraction. Pattern Recognition, 2706–2715 (2007)
9. Dhome, Y., Tronson, N., Vacavant, A.: A Benchmark for Background Subtraction
Algorithms in Monocular Vision: A Comparative Study. Image Processing Theory
Tools and Applications (IPTA), 7–10 (2010)
10. Stauffer, C., Grimson, W.E.L.: Adaptive background mixture models for real-time
tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, pp.
246–252 (1999)
11. Huang, C.R., Chen, C.S., Chung, P.C.: Contrast context histogram A discrimi-
nating local descriptor for image matching. In: Proceedings of IEEE International
Conference on Pattern Recognition, vol. 4, pp. 53–56 (2006)
12. Chateau, T.: Visage Challenge: Video-surveillance Intelligente: Systemes et AlGo-
rithmES (2012), http://visage.univ-bpclermont.fr/?q=node/2
A Novel Border Identification Algorithm
Based on an “Anti-Bayesian” Paradigm
1 Introduction
The objective of a PRS is to reduce the cardinality of the training set to be as
small as possible by selecting some training patterns based on various criteria,
as long as the reduction does not significantly affect the performance. Thus,
instead of considering all the training patterns for the classification, a subset of
the whole set is selected based on certain criteria. The learning (or training) is
then performed on this reduced training set, which is also called the “Reference”
set. This Reference set not only contains the patterns which are closer to the
true discriminant’s boundary, but also the patterns from the other regions of the
space that can adequately represent the entire training set.
The authors are grateful for the partial support provided by NSERC, the Natural
Sciences and Engineering Research Council of Canada.
Chancellor’s Professor ; Fellow: IEEE and Fellow: IAPR. This author is also an
Adjunct Professor with the University of Agder in Grimstad, Norway.
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 196–203, 2013.
c Springer-Verlag Berlin Heidelberg 2013
A Novel BI Algorithm Based on an “Anti-Bayesian” Paradigm 197
Border Identification (BI) algorithms, which are a subset of PRSs, work with
a Reference set which only contains “border” points. Specializing this criterion,
the current-day BI algorithms, designed by Duch [1], Foody [2,3], and Li et al.
[4], attempt to select a Reference set which contains border patterns derived, in
turn, from the set of training patterns. Observe that, in effect, these algorithms
also yield reduced training sets. Once the Reference set is obtained, all of these
traditional methods perform the classification by invoking some sort of classifier
like the SVM, neural networks etc. As opposed to the latter, we are interested in
determining border patterns which, in some sense, are neither closer to the true
optimal classifier nor to the means, and which can thus better classify the entire
training set. Contrary to a Bayesian intuition, these border patterns have the
ability to accurately classify the testing patterns, as we shall presently demon-
strate. Our method is a combination of NN computations and (Mahalanobis)
multi-dimensional1 distance computations which yield the border points that
are subsequently used for the purpose of classification. The characterizing com-
ponent of our algorithm, referred to as ABBI, is that classification can be done
by processing the obtained border points by themselves without invoking, for
example, a subsequent SVM phase.
How then can one determine the border points themselves? This, indeed,
depends on the model of computation - for example, whether we are working
within the parametric or non-parametric model. The current paper deals with
the former model, where the information about the classes is crystallized in
the class-conditional distributions and their respective parameters, where the
training samples are used to estimate the parameters of these models. In this
paper, we have shown how the border points can be obtained by utilizing the
information gleaned from the estimated distributions. Observe that with regard
to classification and testing, all of these computations can be considered to be of
a “pre-processing” nature, and so the final scheme would merely be of a Nearest
Neighbor(NN)-sort. The details of how this is achieved is described in the paper.
The Formal Algorithm. The problem of determining the border points for
the parametric model of computation can be solved for fairly complex scenarios.
When one examines the existing BI schemes, he observes that the information
that has been utilized to procure the border patterns is primarily (and indeed,
essentially) distance-based. In other words, the distances between the patterns
are evaluated independently, and the border patterns are obtained based on
such distances. The patterns obtained in this manner are considered as the new
training set, which reduces these BI schemes to be special types of PRSs, but with
the border patterns being the Reference set. However, as these border patterns
1
We also have some initial results in which the distance and optimizations are done
using lower-dimensional projections, the results of which are subsequently fused using
an appropriate fusion technique.
198 A. Thomas and B. John Oommen
are only the “Near” ones, they do not possess sufficient information to train an
efficient classifier. We shall now rectify this.
We now mention a second major handicap associated with the traditional BI
schemes. Once they have computed the border points associated with the specific
classes, the traditional schemes operate by determining a “classifier” based on
the new set. In other words, they have to determine a classifying boundary (linear,
quadratic or SVM-based) to achieve this classification. As the reader will observe,
in our work, we attempt to circumvent this entire phase. Indeed, in our proposed
strategy, we merely achieve the classification using the final small subset of
border points – which entails a significant reduction in computation.
The reader should also observe that this final decision would involve NN-like
computations with a few points. The intriguing feature of these few points is that
they lie close to the boundary and not to the mean, implying an “anti-Bayesian”
philosophy [5,6,7].
In order to obtain the border patterns of the distributions ω1 and ω2 in an
“anti-Bayesian” approach, we make use of the axiom that the patterns that have
nearest neighbors from other classes along with the patterns of the same class
fall into a common region - which is, by definition, the overlapped region.
The proposed algorithm has 4 parameters, namely, J, J1 , J2 and K. First
of all, J denotes the number of border points that have to be selected from
each class. We understand that in the process of selecting the border points,
the training set must be “examined” so as to ignore the patterns which are not
relevant for the classification. As this decision is taken based on the border points
in and of themselves, we conclude that the patterns which are in the overlapping
region are not able to provide an accurate decision, and so these points have to
be ignored. Thus, for any X, those patterns with J2 or more NNs out of the J1
NNs, which are not from the same class as X, are ignored.
To be more specific, in order to eliminate the overlapping points, we first
determine J1 -NNs of every pattern X. If J2 or more of these NN patterns are
from the same class, this pattern X is added to the new training set. Once this
step is achieved, we are left with the training points which are not overlapping
with any other classes. Thereafter, we evaluate the (Mahalanobis) distance2 (MD)
of every pattern of the new training set with respect to the mean of both the
classes. Both of these phases distinguish our particular strategy. The patterns
which are almost equidistant from both the classes, and which are not determined
to be overlapping with respect to the other classes, are added to the Border set.
The process of determining the (Mahalanobis) distances with respect to both
the classes, is repeated for all the patterns of the new training set, and a decision
is made for each pattern based on the difference between these distances.
The two-dimensional view of this philosophy is depicted in Figures 1a - 1c. The
border patterns obtained by applying this method are also given in the figure,
where the border patterns of class ω1 are specified by rectangles, and those of
class ω2 are specified by circles. We now make the following observations:
2
Any well-defined norm, appropriate for the data distribution, can be used to quantify
this distance.
A Novel BI Algorithm Based on an “Anti-Bayesian” Paradigm 199
15 15
10 10
5 5
X1 X1
X2 X2
0 0
BI1 BI1
Ͳ15 Ͳ10 Ͳ5 0 5 10 15 Ͳ15 Ͳ10 Ͳ5 0 5 10 15
BI2 BI2
Ͳ5 Ͳ5
Ͳ10 Ͳ10
Ͳ15 Ͳ15
(a) Almost separable classes (b) Semi-overlapped classes (c) Overlapped classes
1. If we examine Figure 1a, we can see that the border patterns that are speci-
fied by rectangles and circles are precisely those that lie at the true borders
of the classes.
2. However, if the classes are semi-overlapped,
then the “more interior” sym-
metric percentiles, such as the 23 , 13 can perform a near-optimal classifica-
tion. This can be seen in Figure 1b. The patterns in this figure have more
overlap (the BD = 1.69), and the border points chosen are the ones which
lie just outside the overlapping region.
3. The same argument is valid for Figure 1c. In the OS-based classification, we
have seen that if the classes have a large overlap as in Figure 1c (in this case,
M D = 0.78), the border patterns again lie outside the overlapped region.
The algorithm for obtaining the border patterns, ABBI, is formally given in
Algorithm 1.
Contrary to the traditional BI algorithms, ABBI requires only a small number
of border patterns for the classification. For example, consider the Breast Cancer
data set which contains 699 patterns. A traditional BI algorithms will obtain a
border set of around 150 patterns for this data set. Furthermore, once these
methods have obtained the border points, they will have to generate a classifier
for the new reduced set to achieve the classification. As opposed to this, our
method requires only 20 border patterns, and the classification is based on the
five NN border patterns of the testing pattern.
3 Experimental Results
The proposed method ABBI has been tested on various data sets that include
artificial and real-life data sets obtained from the UCI repository [8]. ABBI has
also been compared with other well-known methods which include the NB, SVM,
and the kNN. In order to obtain the results, ABBI algorithm was executed 50
times with the 10-fold cross validation scheme.
200 A. Thomas and B. John Oommen
Algorithm 1. ABBI(ω1, ω2 )
Input:
Assumption:
Notation:
N T R1 and N T R2 are the new training sets which do not contain points in the overlapped region.
Output:
Method:
1: N T R1 ← ∅
2: N T R2 ← ∅
3: Divide points of ω1 into training and testing sets, T RP1 and T1 respectively
4: Divide points of ω2 into training and testing sets, T RP2 and T2 respectively
5: for all X ∈ T RP1 do
6: Compute J1 NNs of X
7: If J2 or more NNs are from class ω1 , N T R1 ← N T R1 ∪ X
8: end for
9: for all X ∈ T RP2 do
10: Compute J1 NNs of X
11: If J2 or more NNs are from class ω2 , N T R2 ← N T R2 ∪ X
12: end for
13: for all X ∈ N T R1 do
14: Dist(X, M1 )
15: Dist(X, M2 )
16: end for
17: for all X ∈ N T R2 do
18: Dist(X, M1 )
19: Dist(X, M2 )
20: end for
21: for all X ∈ N T R1 do
22: DistDiff(X)
23: end for
24: for all X ∈ N T R2 do
25: DistDiff(X)
26: end for
27: Add J points with minimum DistDiff from N T R1 and N T R2 to BI
28: Classify testing points using a K-NN based on the points in BI.
End Algorithm
Artificial Data Sets: For a prima facie testing of artificial data, we generated
two classes that obeyed Gaussian distributions. To do this, we made use of a Uni-
form [0, 1] random variable generator to generate data values that follow a Gaus-
sian distribution. The expression z = −2ln(u1) cos(2πu2 ) is known to yield
data values that follow N (0, 1) [9]. Thereafter, by using the technique described
in [10], one can generate Gaussian random vectors which possess any arbitrary
mean and covariance matrix. In our experiments, since this is just for a prima
facie case, we opted to perform experiments for two-dimensional and three-
dimensional data sets. The respective means of the classes were [μ11 , μ12 ]T and
[μ21 , μ22 ]T for the two-dimensional data, and [μ11 , μ12 , μ13 ]T and [μ21 , μ22 , μ23 ]T
A Novel BI Algorithm Based on an “Anti-Bayesian” Paradigm 201
By examining Tables 1 and 2, one can see that ABBI can achieve remarkable
classification when compared to that attained by the benchmark classifiers. For
3
These values can be included if requested by the Referees.
202 A. Thomas and B. John Oommen
example, if we consider the case where the classes are separated by a BD of 1.66
in Table 1, ABBI can achieve a classification accuracy of 95.38%, while the 3NN
achieves 97.25%. This is quite fascinating when we consider the fact the ABBI
performs the classification based only 5 samples from the selected 10 samples
from each class, whereas the classification of NN involves the entire training set.
Real-life Data Sets: The data sets [8] used in this study have two classes, and
the number of attributes varies from four up to thirty two. The data sets are
described in Table 3.
Experimental Results: Real-life Data Sets. The results obtained for the
ABBI algorithm are tabulated in Table 4.
From the table of results, one can see that the proposed algorithm achieves
a comparable classification when compared to the other traditional classifiers,
which is particularly impressive because only a very few samples are involved in
the process. For example, for the WOBC data set, we can see that the new ap-
proach yielded a accuracy of 95.80% which should be compared to the accuracies
of the SVM (95.99%), NB (96.40%) and the kNN (96.60%). Similarly, for the Iris
data set, ABBI can achieve an accuracy of 94.53%, which is again comparable
to the performance of SVM (96.67%), NB (96.00%), and NN (95.13%).
4 Conclusions
The objective of BI algorithms is to reduce the number of training vectors by se-
lecting the patterns that are close to the class boundaries. However, the patterns
A Novel BI Algorithm Based on an “Anti-Bayesian” Paradigm 203
that are on the exact border of the classes (“near” borders) are not sufficient
to perform a classification which is comparable to that obtained based on the
centrally located patterns. In order to resolve this issue, researchers have tried to
add more patterns (“far” borders) to the “border” set so as to boost the quality
of the resultant border set. Thus, the cardinality of the resultant border set can
be relatively high. After obtaining such a large border set, a classifier has to be
generated for this set, to perform a classification.
In this paper, we have proposed a novel BI algorithm which involves the border
patterns selected with respect to a new definition of the term “border”. In line
with the newly proposed OS-based anti-Bayesian classifiers [5,6,7], we created
the “border” set by selecting those patterns which are close to the true border
of the alternate class. The classification is achieved with regard to these border
patterns alone, and the size of this set is very small, in some cases, as small as
five from each class. The resultant accuracy is comparable to that attained by
other well-established classifiers. The superiority of this method over other BI
schemes is that it yields a relatively small border set, and as the classification is
based on the border patterns themselves , it is computationally inexpensive.
References
1. Duch, W.: Similarity Based Methods: A General Framework for Classification,
Approximation and Association. Control and Cybernetics 29(4), 937–968 (2000)
2. Foody, G.M.: Issues in Training Set Selection and Refinement for Classification by
a Feedforward Neural Network. In: Proceedings of IEEE International Geoscience
and Remote Sensing Symposium, pp. 409–411 (1998)
3. Foody, G.M.: The Significance of Border Training Patterns in Classification by
a Feedforward Neural Network using Back Propogation Learning. International
Journal of Remote Sensing 20(18), 3549–3562 (1999)
4. Li, G., Japkowicz, N., Stocki, T.J., Ungar, R.K.: Full Border Identification for
Reduction of Training Sets. In: Bergler, S. (ed.) Canadian AI 2008. LNCS (LNAI),
vol. 5032, pp. 203–215. Springer, Heidelberg (2008)
5. Oommen, B.J., Thomas, A.: Optimal Order Statistics-based “Anti-Bayesian” Para-
metric Pattern Classification for the Exponential Family. Pattern Recognition (ac-
cepted for publication, 2013)
6. Thomas, A., Oommen, B.J.: Order Statistics-based Parametric Classification for
Multi-dimensional Distributions (submitted for publication, 2013)
7. Thomas, A., Oommen, B.J.: The Fundamental Theory of Optimal “Anti-Bayesian”
Parametric Pattern Classification Using Order Statistics Criteria. Pattern Recog-
nition 46, 376–388 (2013)
8. Frank, A., Asuncion, A.: UCI Machine Learning Repository (2010),
http://archive.ics.uci.edu/ml (April 18, 2013)
9. Devroye, L.: Non-Uniform Random Variate Generation. Springer, New York (1986)
10. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic
Press, San Diego (1990)
Assessing the Effect of Crossing Databases
on Global and Local Approaches
for Face Gender Classification
1 Introduction
Automated face analysis has been extensively studied over the past decades.
Specifically, gender classification has attracted the interest of researchers for its
useful applications in many areas, such as, commercial profiling, surveillance and
human-computer interaction.
Contrary to what could be thought, gender classification should not be simply
considered as a 2-class version of a face recognition problem. While face recog-
nition search for characteristics that make a face unique, gender classification
techniques look for common features shared among a group of faces (female or
male) [1]. Hence, face recognition solutions are not always suitable for solving
gender classification problems.
Although, some researchers employ local descriptions for classifying gender
[2,3], most of the published works on face gender classification use global infor-
mation provided by the whole face [4]. Intuitively, holistic solutions seem to be
more likely to achieve higher classification rates, since global characterisations
provide configural information (i.e. relations among face parts) as well as featu-
ral (i.e. characteristics of the face parts), whereas local descriptors only provide
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 204–211, 2013.
© Springer-Verlag Berlin Heidelberg 2013
Assessing the Effect of Crossing Databases on Global and Local Approaches 205
2 Methodology
This study compares the performance in solving gender classification problems
of two different approaches (global and local), two types of features (grey levels
and PCA) and three different classifiers (1-NN, PCA+LDA and SVM). The
methodology followed for performing the experiments has three steps:
Image preprocessing. First of all, the face in the image is detected by the
Viola and Jones algorithm [5] implemented in the OpenCV library [6]. Next,
the area containing the face is equalized and resized. The interpolation pro-
cess required for resizing the image uses a three-lobed Lanczos windowed
sinc function [7] which keeps the original aspect ratio of the image. It should
be noted that no techniques for aligning faces are applied, so in the end
unaligned face images are classified.
Feature extraction. Given a preprocessed face image, a global or local ap-
proach is followed, as described in Section 2.1, for characterising the face
with grey levels or PCA feature vectors, as explained in Section 2.2.
Classification. A trained classifier predicts the gender of a test face using pre-
viously extracted features. The classifiers used are detailed in Section 2.3.
2.2 Features
Two types of features are used in the experiments: grey levels and Principal
Component Analysis.
– Grey Levels
In the global case, the feature vector simply consists of the grey level values
of the pixels within the area of the image containing the face. In the local
case, one feature vector is formed by the grey level values of the pixels within
each patch.
– Principal Component Analysis (PCA)
In the global case, PCA basis are calculated from the grey level value vectors
of all the training images. Then, this transformation is applied to all the
vectors extracted from the face images of both sets, training and test. In
the local case, PCA basis are calculated over the features extracted from
each one of the neighbourhoods in the training images. Afterwards, the grey
level value vector of each patch is transformed using the PCA transformation
associated to their corresponding neighbourhoods.
In our experiments, the PCA transformation applied retains those eigen-
vectors accounting for 95% of the variance of the data.
2.3 Classifiers
Three classifiers are tested in the experiments: 1-NN, PCA+LDA and SVM. All
of them are well-known classification methods which have been extensively used
in automated facial analysis.
– 1-NN
In the global case, the classic 1-NN is used. In the local case, a 1-NN clas-
sifier per patch’s neighbourhood is defined and each of these local classifiers
provides a gender estimation for the corresponding patch of a given test face
image. Finally, the predicted gender is obtained by majority voting of the lo-
cal predictions. In both approaches, the metric used is the Square Euclidean
distance.
Assessing the Effect of Crossing Databases on Global and Local Approaches 207
– PCA+LDA
Linear Discriminant Analysis (LDA) searches for a linear combination of the
features that best discriminates between classes. In the face analysis field,
this classifier is most commonly applied over a transformed space, usually
Principal Components Analysis (PCA) [3].
In the global case, the standard PCA+LDA is used. In the local case,
a local PCA+LDA classifier per patch’s neighbourhood is defined which is
trained using only the patches that belong to the corresponding neighbour-
hood. The final predicted class label is obtained by majority voting of the
local decisions.
– SVM
Support Vector Machine (SVM) is a recognised classifier for its good results
in automated face analysis tasks. It is also known that it requires a large
amount of time for training purposes. We conducted an experimental study
which concluded that the use of local SVMs was not computationally af-
fordable. Therefore, SVM only follows a global approach in the experiments.
The SVM implementation with a third degree polynomial kernel provided
with LIBSVM 3.0 is employed.
3 Experimental Setup
A number of experiments have been designed to assess how robust are global
and local classification models when training and test faces are acquired under
different conditions. In order to compare the performance of both approaches,
both types of features and the three classifiers previously detailed, an experiment
was performed involving each of the possible combinations of those three factors
(from now on each combination is referred as classification model ). These classi-
fication models are tested with non-occluded frontal faces from three databases:
sampling the database it can be said that all the individuals are young Cau-
casian adults. Our experiments use 130 occlusion-free frontal face images
with neutral expressions (74 males and 56 females).
Global Local
NN A SVM NN A
LD LD
Training Test Grey A+ Grey Grey A+
Data Set Data Set Levels PCA PC Levels PCA Levels PCA PC
FERET 85.31 85.57 91.86 93.66 92.83 92.35 91.29 85.07
FERET PAL 66.03 64.98 71.25 66.72 62.55 66.03 62.19 60.80
AR Neutral 79.17 82.31 77.69 81.54 84.62 86.15 86.92 83.08
FERET 66.53 65.56 75.22 72.99 70.66 63.16 62.07 77.11
PAL PAL 77.42 77.35 82.72 85.23 85.61 83.73 83.52 73.69
AR Neutral 81.25 82.31 89.23 92.31 91.54 90.00 90.00 87.69
FERET 76.02 76.86 80.09 80.83 77.21 78.90 78.90 78.20
AR Neutral PAL 73.35 72.30 71.43 75.09 70.38 74.39 73.17 65.51
AR Neutral 83.99 82.46 87.54 90.42 98.15 88.92 89.08 86.31
values. Let R+ be the sum of the ranks where the 1st model outperforms the
2nd and R− be the sum of the ranks not included in R+ . In cases where di = 0,
its rank is split evenly between R+ and R− . If di = 0 occurs an odd number
of times, one of those ranks is ignored. Being Z = min(R+ , R− ), if Z is less
or equal than the Wilcoxon distribution for n degrees of freedom, then the null
hypothesis stating that both classification models perform equally is rejected.
These statistical tests were conducted using KEEL data mining software [14].
Two different analyses of the results are presented in this section: the first one
includes all the experiments, whereas the second just cross-database experiments.
Looking at the numerical results of all conducted experiments (shown in Ta-
ble 1), the first impression is that the classification models using a global SVM
or a local classifier obtain higher accuracies than the rest. In order to check
whether these performance differences are statistically relevant or not, we ap-
plied the three statistical tests previously described which results are shown
in Figure 1(a). In Figure 1(a), the table on the left-hand side shows the value
of the Iman-Davenport’s statistic (FF ) and the corresponding value of the F-
distribution; the table in the centre shows the results of the Holm’s method with
a 95% confidence level where all models above the double line performed signif-
icantly worse than the most significant model (marked in bold at the bottom of
the table); and the table on the right-hand side shows a summary of Wilcoson’s
test where the symbol “•” indicates that the classification model in the row sig-
nificantly outperforms the model in the column, and the symbol “◦” indicates
that the model in the column significantly surpasses the model in the row (above
the main diagonal with a 90% confidence level, and below it with a 95%).
Iman-Davenport’s statistic finds significant differences among the perform-
ances of all classification models, which is corroborated by the results of the
other two tests. Specifically, Holm’s method results indicate that the models
210 Y.A. Cabedo, R.A. Mollineda Cárdenas, and P. Garcı́a-Sevilla
Wilcoxon’s Test
Holm’s Method
1 23 4 5 67 8
1NN-pca-G 0.007143 1NN-grey-G (1) - ◦ ◦ ◦ ◦◦
1NN-grey-G 0.008333 1NN-pca-G (2) - ◦ ◦ ◦ ◦◦
Iman-Davenport’s
PCALDA-L 0.01 PCALDA-G (3) • • -
Statistic
1NN-pca-L 0.0125 SVM-grey-G (4) • • - •• •
FF = 12.18
F (7, 35)0.95 = 2.29 PCALDA-G 0.016667 SVM-pca-G (5) • - •
SVM-pca-G ]0.025 1NN-grey-L (6) • • -
1NN-grey-L 0.05 1NN-pca-L (7) -
SVM-grey-G PCALDA-L (8) ◦ -
Wilcoxon’s Test
Holm’s Method
1 23 4 5 67 8
1NN-pca-G 0.007143
1NN-grey-G (1) - ◦
Iman-Davenport’s 1NN-grey-G 0.008333 1NN-pca-G (2) - ◦
Statistic PCALDA-L 0.01 PCALDA-G (3) -
FF = 1.53 SVM-pca-G 0.0125 SVM-grey-G (4) • -
F (7, 35)0.95 = 2.29 1NN-pca-L 0.016667 SVM-pca-G (5) -
PCALDA-G 0.025 1NN-grey-L (6) -
1NN-grey-L 0.05 1NN-pca-L (7) -
1NN-grey-L PCALDA-L (8) -
Fig. 1. Statistical analyses performed. Holm’s results with a 95% significance level
(models above the double line performed significantly worse than the most significant
model, marked in bold at the bottom). Wilcoxon’s summary above the main diagonal
with a 90% significance level, and below it with a 95% (“•”: model in row outperforms
model in column, “◦”: model in column outperforms model in row).
using global SVMs (both, with grey levels and PCA features), global PCA+LDA
and local 1-NN with grey levels are statistically superior than the rest. In a
pairwise comparison, Wilcoxon’s test reveals that a global SVM model using
grey levels outperforms all classification models except for global PCA+LDA
and global SVM with PCA features. In view of the results of this first analysis,
a straightforward conclusion would be that global methods are more suitable for
dealing with a gender classification problem than local models.
The results of a second analysis of only the cross-database experiments, that
is, omitting three experiments that were carried out using the same database for
training and testing, are shown in Figure 1(b). In this case, Iman-Davenport’s
statistic does not find significant differences among classification models. Holm’s
method only rejects global 1-NN with PCA, indicating that the rest of the models
perform statistically equal. The pairwise comparison provided by Wilcoxon’s test
supports these results, since only a couple of statistical differences are found
where global SVM with grey levels outperforms both global 1-NN models.
After these two statistical studies on the performances of all experiments and
the cross-database experiments, an interesting fact was discovered: differences
among the classification accuracies of the implemented models only exist when
single-database experiments are taken into account. In more realistic scenarios,
where training and testing images do not share the same acquisition conditions
nor the demography of subjects (i.e, simulated with cross-database experiments),
no significant differences are found in the performances of the models.
Assessing the Effect of Crossing Databases on Global and Local Approaches 211
5 Conclusion
This paper has provided a comprehensive statistical study of how suitable global
and local approaches are for gender classification under realistic conditions.
These circumstances have been simulated by cross-database experiments involv-
ing three face image collections with a wide range of ages and races and different
acquisition conditions. The comparison has included three classifiers using two
different types of features.
The main conclusion drawn from the results is that when addressing gen-
der classification problems from neutral non-occluded faces, global an local ap-
proaches achieve statistically equal accuracies. However, if we can ensure similar
acquisition condition (i.e., similar to the experiments using the same database
for training and testing), global features should be used. As regards the classifiers
and features, when the training and test images share the same characteristics,
a global SVM using grey levels is more likely to obtain the highest classification
accuracies. In other cases, no significant differences were found among the three
classifiers studied nor the two types of features considered.
Acknowledgements. This work has been partially funded by Universitat
Jaume I through grant FPI PREDOC/2009/20 and projects P1-1B2012-22,
and TIN2009-14205-C04-04 from the Spanish Ministerio de Economı́a y
Competitividad.
References
1. Zhao, W., Chellappa, R.: Face Processing: Advanced Modeling and Methods. Aca-
demic Press (2006)
2. Shan, C.: Learning local binary patterns for gender classification on real-world face
images. Pattern Recognition Letters 33(4), 431–437 (2012)
3. Bekios-Calfa, J., Buenaposada, J.M., Baumela, L.: Revisiting linear discriminant
techniques in gender recognition. IEEE PAMI 33(4), 858–864 (2011)
4. Makinen, E., Raisamo, R.: Evaluation of gender classification methods with auto-
matically detected and aligned faces. IEEE PAMI 30(3), 541–547 (2008)
5. Viola, P., Jones, M.: Robust real-time face detection. Int. J. of Computer Vision 57,
137–154 (2004)
6. Bradski, G.R., Kaehler, A.: Learning OpenCV. O’Reilly (2008)
7. Turkowski, K.: Filters for common resampling tasks. In: Graphics Gems I,
pp. 147–165. Academic Press (1990)
8. Phillips, P.J., Rizvi, S.A., Rauss, P.J.: The FERET evaluation methodology for
face-recognition algorithms. IEEE PAMI 22(10), 1090–1104 (2000)
9. Minear, M., Park, D.: A lifespan database of adult facial stimuli. Behavior Research
Methods, Instruments, and Computers 36, 630–633 (2004)
10. Martinez, A., Benavente, R.: The AR face database. Technical report, CVC (1998)
11. Iman, R., Davenport, J.: Approximations of the critical region of the friedman
statistic. Communications in Statistics 9, 571–595 (1980)
12. Holm, S.: A simple sequentially rejective multiple test procedure. Scandinavian
Journal of Statistics 6, 65–70 (1979)
13. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bul-
letin 1(6), 80–83 (1945)
14. Alcalá-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., Garcı́a, S., Sánchez, L.,
Herrera, F.: KEEL. J. Multiple-Valued Logic and Computing 17(2-3), 255–287 (2011)
BRDF Estimation for Faces from a Sparse
Dataset Using a Neural Network
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 212–220, 2013.
c Springer-Verlag Berlin Heidelberg 2013
BRDF Estimation for Faces from a Sparse Dataset 213
This paper verifies the capture accuracy of our system and presents progress
in obtaining BRDF data from a sparse dataset and using this model to simulate
unseen lighting angles photo-realistically via an ANN. The motivation for this
work is to show that an accurate BRDF can be modelled from the sparse dataset,
and that high speed Near-InfraRed (NIR) capture is suitable for the photometric
reconstruction of faces. We make no claim that the modelled BRDF is state-of-
the-art for skin reflectance modelling (for examples of such work please refer to
[7], [8]), but we are offering a particularly rapid means to acquire sufficient data
for photo-realistic rendering.
V
θi θr
αi
Δα
αr
Fig. 1. The four dimensions (zenith incident and reflected angles: θi , θr and azimuth
incident and reflected angles: αi , αr ) upon which the observed intensity at V (viewer)
depends. L is the light source vector, and N is normal to the plane of the reflecting
surface. In order to reduce the dimensionality, Δα, the difference between incident and
reflective azimuths is used.
The contributions of this paper are i) to prove the practicality of using NIR
for high speed 2.5D data captured in terms of speed and accuracy, compared
with a commercial projected pattern scanner and ii) to demonstrate accurate
modelling of the BRDF from only five lighting directions via an ANN, which is
used to generate photo-realistic images from novel lighting angles.
2 Capture Device
2.1 Hardware
This section details the acquisition device hardware which is based upon the
Photoface device presented in [9]. The device, shown in Fig. 2, is designed for
practical 3D face geometry capture and recognition. The presence of an individ-
ual approaching the device is detected by an ultrasound proximity sensor placed
before the archway. This can be seen in Fig. 2(6) towards the left-hand side of
the photograph. The sensor triggers a sequence of high speed synchronised frame
grabbing and light source switching operations.
214 M.F. Hansen, G.A. Atkinson, and M.L. Smith
Fig. 2. The NIR geometry capture device (left) and an enlarged image of one of the
LED clusters (right). A camera can be seen on the rear panel, above which is located a
NIR light source for retro-reflective capture (5). Four other light sources are arranged
at evenly spaced angles around the camera (1-4). An ultrasound trigger is located on
the left vertical beam of the archway (6).
The aim is to capture six images at a high frame rate: one control image with
only ambient illumination and five images each illuminated by one of the NIR
light sources in sequence. A captured face is typically 700 × 850 pixels. Note
that the ambient lighting is uncontrolled (for the experiments presented in this
paper, overhead fluorescent lights are present). The five NIR lamps are made
from a cluster of seven high power NIR LEDs arranged in an ‘H’-formation
to minimize the emitting area (as can be seen in the right hand side image
of Fig 2). The LEDs emit light at ≈850nm. The light sources and camera are
located approximately 1.2m from the head of the subject with four of the light
sources arranged at evenly spaced angles, and one placed as close as possible to
the camera to capture retro-reflection.
It was found experimentally that for people walking through the device, a
minimum frame rate of approximately 150fps was necessary to avoid significant
movement between frames. The device currently operates at 210fps, and it should
be noted that it is only operating for the period required to capture the six
images. That is, the device is left idle until it is triggered. A monitor is included
on the back panel to show the reconstructed face or to display other information.
where ρi is the reflectance albedo. The intensity values (I) and light source (L)
positions are known, and from these the albedo and surface normal components
(n) can be calculated by solving (1) using a linear least-squares method.
3 BRDF Modelling
Gargan & Neelamkavil [12] showed that using an ANN provides excellent approx-
imation performance for a dense BRDF generated using a gonio-reflectometer.
Experimenting with different numbers of layers (which affects the ability of the
216 M.F. Hansen, G.A. Atkinson, and M.L. Smith
θr {0,90}
Δα {0,180} {0,255}
... ... ...
x {1,W}
y {I,H}
5 nodes 10 nodes 10 nodes 10 nodes 1 node
Fig. 3. The architecture of the ANN used to model the BRDF. The inputs θi and θr
are in the range of 0-90 degrees, and Δα is in the range 0-180 degrees. x and y give
the pixel coordinate and so are in the range of 1 to either the width (W) or height (H).
The output is in the range 0-255 for each pixel, so that when all pixels are estimated,
a full reflectance image will have been rendered.
The network architecture can be seen in Fig. 3 and was trained using the
Levenberg-Marquardt optimisation backpropagation algorithm, taking in the re-
gion of 200 epochs to obtain a Mean Square Error (MSE) of 11.58 gray levels
and an R value of 0.9598. Using fewer hidden layers generated higher MSEs and
lower R-values although training times were faster, while using four layers led to
very slightly improved results but at a much higher computational cost. 100,000
image locations (approximately 20% of all data) were chosen at random to pro-
vide a representative sample of the whole face surface, and for each location,
θi , θr and Δα as well as the x, y co-ordinates were used as inputs.
The reason for including the pixel coordinates is an attempt to allow for the
different types of reflectance around the face to be captured (i.e. the reflectance
of the skin at the nose tip is different to that of the cheeks). In doing so it is
possible to correctly model the behaviour of different skin types when the same
θi , θr and Δα are provided without having to label different regions of the face
as having different skin types. This provides a means of unsupervised learning
that will assist in improving the realism of rendered images.
BRDF Estimation for Faces from a Sparse Dataset 217
4 Results
This section first presents results showing the reconstruction accuracy under
NIR using a commercially available system as ground truth. Then, to assess the
potential of the interpolation and ANN BRDF models, we use re-rendered images
from the estimated surface normals obtained by PS. We use the BRDF models
to generate images from novel lighting angles to see how well the models can
generalise. We compare these images with those generated using the Lambertian
reflectance model and show that the ANN produces the most photo-realistic
images for unseen lighting angles.
5 Discussion
The results demonstrated the practicality of using the custom built NIR lamps
for PS acquisition. The capture process itself is unobtrusive (most other PS
1
Written by Ajmal Saeed Mian, Computer Science, The University of Western
Australia
218 M.F. Hansen, G.A. Atkinson, and M.L. Smith
1.8
80
1.6
70
1.4
1.2 60
1 50
0.8 40
0.6 30
0.4 20
0.2 10
(e) (f)
Fig. 4. Reconstructions from the Photoface device (a and c) and 3dMD (b and d) and
a map of 2 -distance (e) and angular error (f)
techniques require a sequence of pulsed visible lights), takes only 30ms, and the
results generated are accurate and of high resolution.
In terms of BRDF modelling, the results show that photo-realistic images
can be synthesised by using an ANN to model the BRDF from a sparse dataset
resulting from practical PS acquisition. It offers more realistic results for novel
lighting angles than either a linear interpolation based method or Lambertian
model. The ANN offers a compact representation of the BRDF and a fast method
of synthesising observed intensities from novel lighting directions.
There are some limitations of using a BRDF for modelling skin reflectance,
especially under NIR. The BRDF describes the relationship between incident,
reflected angles and observed intensity. However, there will be a certain amount
of sub-surface scattering (and this will be increased under NIR which penetrates
deeper into the skin) which the BRDF is not designed to capture. Also, the
BRDF may deviate from actual values as we have used surface normals estimated
by PS, but for purposes such as CGI this is not as important as the perceived
realism (e.g. avatar generation). We have shown that photo-realistic results are
achievable and future work will aim to overcome the Lambertian assumption by
incorporating the BRDF model into normal estimates by iteratively enhancing
the accuracy of the surface normal representations, which in turn can then be
used to generate a more accurate BRDF until convergence is reached. This in
turn will reduce distortion in the 3D reconstruction of the surface relief.
6 Conclusion
We have presented a five source NIR, high speed and high resolution 2.5D PS
face capture device, which can be used to generate accurate 3D models of human
faces. In addition, the five light sources are used to train an ANN to model
BRDF Estimation for Faces from a Sparse Dataset 219
Fig. 5. Synthesised intensity images using estimated surface normals from PS and
synthesised light angles (azimuth angles are indicated by arrows. The zenith an-
gle is 15 degrees which is representative of Photoface light sources). Top row:
ANN using Photoface surface normals, second row: interpolated Photoface surface
normals, bottom row: images generated using the Lambertian reflectance model.
A video of the rendering created by the ANN BRDF can be downloaded from
www.cems.uwe.ac.uk/~ mf-hansen/CAIP13/rerender75.avi
the individual’s BRDF. Using this modelled BRDF, photo-realistic results are
attained from novel light source directions. Future work will look at the use of the
BRDF to improve the 2.5D estimates by replacing the Lambertian assumption
in PS, as well as using it as an additional biometric.
References
1. Marschner, S.R., Westin, S.H., Lafortune, E.P.F., Torrance, K.E., Greenberg, D.P.:
Image-based BRDF measurement including human skin. In: Proceedings of the
10th Eurographics Workshop on Rendering, pp. 139–152 (1999)
2. Lambert, J.-H.: Photometria, sive de Mensura et gradibus luminis, colorum et
umbrae. sumptibus viduae E. Klett (1760)
3. Phong, B.T.: Illumination for computer generated pictures. Communications of the
ACM 18(6), 311–317 (1975)
4. Torrance, K.E., Sparrow, E.M.: Theory for off-specular reflection from roughened
surfaces. Journal of the Optical Society of America A 57(9), 1105–1112 (1967)
220 M.F. Hansen, G.A. Atkinson, and M.L. Smith
5. Oren, M., Nayar, S.K.: Generalization of the lambertian model and implications for
machine vision. International Journal of Computer Vision 14(3), 227–251 (1995)
6. Kumar, R., Barmpoutis, A., Banerjee, A., Vemuri, B.C.: Non-lambertian re-
flectance modeling and shape recovery of faces using tensor splines. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 33(3), 533–567 (2011)
7. Debevec, P., Hawkins, T., Tchou, C., Duiker, H.P., Sarokin, W., Sagar, M.: Ac-
quiring the reflectance field of a human face. In: Proceedings of the 27th Annual
Conference on Computer Graphics and Interactive Techniques, pp. 145–156 (2000)
8. Ghosh, A., Hawkins, T., Peers, P., Frederiksen, S., Debevec, P.: Practical modeling
and acquisition of layered facial reflectance. ACM Transactions on Graphics 27(5),
1–10 (2008)
9. Hansen, M.F., Atkinson, G.A., Smith, L.N., Smith, M.L.: 3D face reconstructions
from photometric stereo using near infrared and visible light. Computer Vision and
Image Understanding 114(8), 942–951 (2010)
10. Lienhart, R., Maydt, J.: An extended set of haar-like features for rapid ob-
ject detection. In: IEEE International Conference on Image Processing, vol. 1,
pp. 900–903 (2002)
11. Forsyth, D.A., Ponce, J.: Computer Vision: A modern approach. Prentice Hall
Professional Technical Reference (2002)
12. Gargan, D., Neelamkavil, F.: Approximating reflectance functions using neural net-
works. In: Rendering Techniques 1998: Proceedings of the Eurographics Workshop
in Vienna, Austria, June 29-July 1, p. 23 (1998)
13. 3dMDface system, http://www.3dmd.com/3dmdface.html (accessed: December
2011)
Comparison of Leaf Recognition by Moments
and Fourier Descriptors
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 221–228, 2013.
c Springer-Verlag Berlin Heidelberg 2013
222 T. Suk, J. Flusser, and P. Novotný
Since the most discriminative information is carried by the leaf boundary (see
Fig. 2c), all above-cited papers employ boundary-based features. We decided to
objectively compare the most popular ones – Fourier descriptors, Zernike mo-
ments, Legendre moments, Chebyshev moments, and a direct use of the bound-
ary coordinates – on a large database of tree leaves.
2 Data Set
In the experiments, we used our own data set named Middle European Woody
Plants (MEW 2012 – Fig. 1, [8]). It contains all native and frequently cultivated
trees and shrubs of the Central Europe Region. It has 151 botanical species (153
recognizable classes), at least 50 samples per species and a total of 9745 samples
(leaves). In the case of compound leaves (Fig. 2b), we considered the individual
leaflets separately.
Fig. 1. Samples of our data set (different scale – MEW 2012 scans cleaned for this
printed presentation): 1st row – Acer pseudoplatanus, Ailanthus altissima (leaflet of
pinnately compound leaf), Berberis vulgaris, Catalpa bignonioides, Cornus alba, 2nd
row – Deutzia scabra, Fraxinus excelsior (leaflet of pinnately compound leaf), Juglans
regia, Maclura pomifera (male), Morus alba, 3rd row – Populus tremula, Quercus pe-
traea, Salix caprea, Tilia cordata and Vaccinium vitis-idaea.
Comparison of Leaf Recognition by Moments and Fourier Descriptors 223
600
400
200
y coordinate
0
−200
−400
−600
Fig. 2. (a) A simple leaf (Rhamnus cathartica). (b) pinnately compound leaf (Clematis
vitalba). (c) the boundary of the leaf (Fagus sylvatica).
3 Method
3.1 Preprocessing
The preprocessing consists namely of the leaf segmentation and boundary de-
tection. The scanned green leaves on white background are first segmented by
simple thresholding. The leaves are converted from color to graylevel scale and
then we compute the Otsu’s threshold [9]. The contours in the binary image are
then traced. Only the longest outer boundary of the image is used, the other
boundaries (if any) and holes are ignored.
Then we compute the following features: Cartesian coordinates of the bound-
ary points (CB), polar coordinates of the boundary (PB), Fourier descriptors
(FD), Zernike moments computed of the boundary image (ZMB), Legendre mo-
ments (LM), Chebyshev moments of the first kind (CM1) and that of the second
kind (CM2).
All the features need to be normalized to translation and rotation. The nor-
malization to the translation is provided by a subtraction of the centroid coordi-
nates m10 /m00 and m01 /m00 , where mpq is a geometric moment. The rotation
normalization in the case of the direct coordinates, Legendre and Chebyshev mo-
ments is provided so the principal axis coincides with the x-axis and the complex
moment c21 would have non-negative real part
1 2μ11
θ = arctan
2 μ20 − μ02 (1)
if (μ30 + μ12 ) cos(θ) − (μ21 + μ03 ) sin(θ) < 0 then θ := θ + π,
where μpq is a central geometric moment. All the boundary coordinates are
multiplied by a rotation matrix corresponding to the angle −θ. The starting
224 T. Suk, J. Flusser, and P. Novotný
point of the coordinate sequence is that one with the minimum x-coordinate. If
there is more such points, that one which minimizes y-coordinate is chosen.
The parameter n is called order and is called repetition. Since ZM’s were
designed for 2D images, we treat the leaf boundary (which is actually 1D infor-
mation) as a 2D binary image.
This explicit formula becomes numerically unstable for high orders, therefore
three recurrence formulas were developed. They are known as Prata method,
Kintner method and Chong method, we used the Kintner method [12]. The
scaling invariance is provided by a suitable mapping of the image onto the unit
disk. The points in the distance κno from the centroid are mapped onto the
boundary of the unit disk, where κ is a constant found by optimization of the
discriminability on the given dataset. The value κ = 0.3 was determined for
MEW2012. The parts of the leaf mapped outside the unit disk are not included
into the computation. The moment amplitudes are also normalized both to a
sampling density and to a contrast: an = |An |/A00 , the phases are normalized
to the rotation as ϕn = angle(An ) − · angle(A31 ).
no
k−1
Pn = (xk + iyk )Kn 2 −1 , (5)
no − 1
k=1
−
(d(a) (b)
m −dm )
δs (a, b) = 1 − e
(a) (b)
2dm dm
. (7)
226 T. Suk, J. Flusser, and P. Novotný
4 Classifier
We use a simple nearest neighbor classifier with optimized weights of individual
features. While we can use just L2 norm for comparison of the amplitude features,
the phase features are angles in principle and we have to use special distance
where SA is the set of all indices, for which ak is an amplitude feature. Similarly,
SP is the set of all indices, for that ϕk is a phase feature. The weight wf is
constant for a given type of features, while wc (k) depends on the order of the
feature. We use wc (k) = 1/|uk | for FD and wc (k) = 1/nk for all the moments,
where uk is the current harmonic and nk is the current moment order. In the
case of CB and PB, wc (k) has no meaning. The parameters and weights of all
features were optimized for MEW2012.
In the training phase, the features of all leaves in the data set are computed.
In the classification phase, the features of the query leaf are computed, they are
labeled by index (q) in Eq. (9), while the features labeled () are successively
whole data set features. We only consider one nearest neighbor from each species.
Where the information whether the leaf is simple or compound is available, only
the corresponding species are considered.
5 Results
In the experiments, we divided randomly the leaves of each species in the data
set into two halves. One of them was used as a training set and the other half was
tested against it. The results are in Tab. 1. The Fourier descriptors slightly out-
perform the other tested features. The reason of their superiority to moments in
this task lies in numerical properties of the features. Since the leaves are similar
Table 1. The success rates (f – boundary features only, s – the leaf size, c – information
whether the leaf is simple or compound)
6 Conclusion
We have tested several types of features in a specific task - recognition of wooden
species based on their leaves. We concluded that Fourier descriptors are the most
appropriate features which can, when combined with the leaf size, achieve the
recognition rate above 85%. A crucial factor influencing the success rate is of
course the quality of the input image.
In this study, the leaves were scanned in the laboratory. The system is not
primarily designed to work with photographs of the leaves taken directly on
the tree. In such a case, the background segmentation and elimination of the
perspective would have to be incorporated. We encourage the readers to take
their own pictures and to try our public web-based application [13].
References
1. Kumar, N., Belhumeur, P.N., Biswas, A., Jacobs, D.W., Kress, W.J., Lopez, I.C.,
Soares, J.V.B.: Leafsnap: A computer vision system for automatic plant species
identification. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C.
(eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 502–516. Springer, Heidelberg
(2012)
228 T. Suk, J. Flusser, and P. Novotný
2. Chen, Y., Lin, P., He, Y.: Velocity representation method for description of contour
shape and the classification of weed leaf images. Biosystems Engineering 109(3),
186–195 (2011)
3. Kadir, A., Nugroho, L.E., Susanto, A., Santosa, P.I.: Foliage plant retrieval using
polar fourier transform, color moments and vein features. Signal & Image Process-
ing: An International Journal 2(3), 1–13 (2011)
4. Nanni, L., Brahnam, S., Lumini, A.: Local phase quantization descriptor for
improving shape retrieval/classification. Pattern Recognition Letters 33(16),
2254–2260 (2012)
5. Wu, S.G., Bao, F.S., Xu, E.Y., Wang, Y.X., Chang, Y.F., Xiang, Q.L.: A leaf
recognition algorithm for plant classification using probabilistic neural network.
In: 7th International Symposium on Signal Processing and Information Technology
ISSPIT 2007, p. 6. IEEE (2007)
6. Söderkvist, O.J.O.: Computer vision classiffcation of leaves from swedish trees.
Master’s thesis, Linköping University (September 2001)
7. Kadir, A., Nugroho, L.E., Susanto, A., Santosa, P.I.: Experiments of Zernike mo-
ments for leaf identification. Journal of Theoretical and Applied Information Tech-
nology 41(1), 113–124 (2012)
8. MEW2012: Download middle european woods (2012),
http://zoi.utia.cas.cz/node/662
9. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans-
actions on Systems, Man, and Cybernetics 9(1), 62–66 (1979)
10. Lin, C.C., Chellapa, R.: Classification of partial 2-D shapes using Fourier de-
scriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence 9(5),
686–690 (1987)
11. Flusser, J., Suk, T., Zitová, B.: Moments and Moment Invariants in Pattern Recog-
nition. Wiley, Chichester (2009)
12. Kintner, E.C.: On the mathematical properties of Zernike polynomials. Journal of
Modern Optics 23(8), 679–680 (1976)
13. MEWProjectSite: Recognition of woods by shape of the leaf (2012),
http://leaves.utia.cas.cz/index?lang=en
Dense Correspondence of Skull Models by
Automatic Detection of Anatomical Landmarks
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 229–236, 2013.
c Springer-Verlag Berlin Heidelberg 2013
230 K. Zhang, Y. Cheng, and W.K. Leow
Most of these methods are demonstrated on models with simple surfaces such
as faces [8, 12, 20], human bodies [1, 15], knee ligaments [6], and lower jaws [2].
TPS is particularly effective for mesh models with highly complex surfaces such
as brain sulci [4], lumbar vertebrae [10], and skulls [5, 7, 9, 14, 18]. Skull models
are particularly complex because they have holes, missing teeth and bones, and
interior as well as exterior surfaces.
Like all non-rigid registration methods, TPS registration of skull models re-
quires known correspondence on the reference and the target, which can be
manually marked or automatically detected. The first approach manually marks
anatomical landmarks on the reference and the target [5, 9, 14], and uses the
landmarks as hard constraints in TPS registration. This approach is accurate,
but manual marking is tedious and potentially error prone. The second approach
automatically detects surface points on the reference mesh, which are mapped
to the target surface. These surface points can be randomly sampled points [7]
or distinctive feature points such as local curvature maximals [18], and they
serve as soft constraints in TPS registration. This approach is sensitive to noise,
outliers, and false correspondences. Turner et al. [18] apply multi-stage coarse-
to-fine method to reduce outliers, and forward (reference-to-target) and back-
ward (target-to-reference) registrations to reduce false correspondences. How-
ever, there is no guarantee that the correspondences detected anatomically are
consistent and accurate, despite the complexity of the method.
This paper presents an automatic dense correspondence algorithm for skull
models that combines the strengths of both approaches. First, anatomical land-
marks are automatically and accurately detected to serve as hard constraints
in TPS registration. They ensure anatomically consistent correspondence.
The number of such landmarks is expected to be small because automatic detec-
tion of anatomical landmarks is a very difficult task (Section 2). Second, control
points are sampled on skull surfaces to serve as soft constraints in TPS regis-
tration. They provide additional local shape constraints for a close matching
of reference and target surfaces. Compared to [18], our method also uses multi-
stage coarse-to-fine approach, except that our landmark detection algorithm is
based on anatomical definitions of landmarks, which ensures the correctness and
accuracy of the detected landmarks.
Quantitative evaluation of point correspondence is a challenging task. Most
works reported only qualitative results. The quantitative errors measured in [2,
8, 19] are non-rigid registration error instead of point correspondence error. This
paper proposes a method for measuring point correspondence error, and shows
that registration error is not necessarily correlated to correspondence error.
In anatomy [16] and forensics [17], craniometric landmarks are feature points on
a skull that are used to define and measure skull shapes. Automatic detection
of craniometric landmarks is very difficult and challenging due to a form of
cyclic definition. Many craniometric landmarks are defined according to the three
Dense Correspondence of Skull Models by Automatic Detection 231
uy
(1)
uz ux
(2)
Fig. 1. Skull models and craniometric landmarks. (1) Reference model. (1a) Frankfurt
plane (FP) is the horizontal (red) plane and mid-sagittal plane (MSP) is the vertical
(green) plane. (1b–1d) Blue dots denote landmarks used for registration and yellow dots
denote Landmarks used for evaluation. (2) Detected registration landmarks (blue) and
50 control points (red) on two sample test targets.
Like ICP, FICP iteratively computes the best similarity transformation (scaling,
rotation, and translation) that registers the reference to the target. The difference
is that in each iteration, FICP computes the transformation using only a subset
of reference points whose distances to the target model are the smallest.
After registration, Step 2 maps the landmarks on the reference to the target.
First, closest points on the target surface to the reference landmarks are identi-
fied. These closest points are the initial estimates of the landmarks on the target,
which may not be accurate due to shape variations among the skulls. Next, FP
and MSP are fitted to the initial estimates using PCA.
In Step 3, an elliptical landmark region R is identified around each initial
estimate. The orientation and size of R are empirically predefined. R varies for
different landmark according to the shape of the skull around the landmark.
These regions should be large enough to include the landmarks on the target
model. Accurate landmark locations are searched within the regions according
to their anatomical definitions. For example, the left and right porions (Pl, Pr
in Fig. 1) are the most lateral points of the roofs of the ear canals [16, 17]. After
refining FP landmarks in Step 3(a), FP is fitted to the refined FP landmarks.
Next, MSP landmarks are refined in Step 3(b) in a similar manner, and MSP is
fitted to the refined MSP landmarks, keeping it orthogonal to FP.
As Step 3 is iterated, the locations and orientations of FP and MSP are refined
by fitting to the landmarks, and the landmarks’ locations are refined according
to the refined FP and MSP. After the algorithm converges, accurate craniometric
landmarks are detected on the target model.
In addition to the landmarks on FP and MSP, other landmarks are also de-
tected (Fig. 1). These include points of extremum along the anatomical orienta-
tions defined by FP and MSP. These landmarks are detected in a similar manner
as the FP and MSP landmarks, first by mapping known landmark regions on the
reference to the target, and then searching within the regions for the landmarks
according to their anatomical definitions. Test results show that the average
landmark detection error is 3.54 mm, which is very small compared to the size
of human skulls.
ECE
Algorithm ER ECR ECE 6
6 Conclusions
This paper presents a multi-stage, coarse-to-fine automatic dense correspon-
dence algorithm for mesh models of skulls that combines two key features. First,
anatomical landmarks are automatically and accurately detected to serve as
hard constraints for non-rigid registration. They ensure anatomically consis-
tent correspondence. Second, control points are sampled on the skull surfaces
to serve as soft constraints for non-rigid registration. They provide additional
local shape constraints to ensure close matching of reference and target sur-
faces. Test results show that, by combining both approaches, our algorithm can
236 K. Zhang, Y. Cheng, and W.K. Leow
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 237–244, 2013.
c Springer-Verlag Berlin Heidelberg 2013
238 F. López-Garcı́a et al.
Fig. 1. Some blemishes in citrus. From left to right; scale, thrips scarring, sooty mould
and green mould.
scarring, chilling injury, stem injury, sooty mould, anthracnose and phytotoxicity.
Figure 1 shows four different types of defects (blemishes) in citrus.
The automatic detection of visual defects in orange post-harvest, performed to
classify the fruit depending on their appearance, is a major problem. Species and
cultivars of citrus present great unpredictability in colors and textures in both,
sound and defective areas. Thus, the inspection system will need frequent train-
ing to adapt to the visual features of new cultivars and even different batches
within the same cultivar [1]. In addition, as the training process will be performed
by non-specialized operators at the inspection lines, we need to select an unsu-
pervised methodology (no labeling process required) that leads to an easy-to-
train inspection system. Real-time compliance is also an important issue so that
the overall production can be inspected at on-line rates. Thus, approaches with
low computational costs are valuable. In the present paper, we study and com-
pare two methods that offer these features, they are unsupervised , easy-to-train
and also provide low computational costs in comparison with similar-in-purpose
methods in literature.
The first method [2] is based on a Multivariate Image Analysis (MIA) strategy
developed in the area of applied statistics [3,4,5]. This strategy differs from tradi-
tional image analysis, where the image is considered a single sample from which
a vector of features is extracted and then used for classification or comparison
purposes. In MIA, the image is considered a sample of size equal to the number
of pixels that compose the image. Principal Component Analysis (PCA) is ap-
plied to the raw data of pixels and then statistic measures are used to perform
the image analysis. The method was originally developed as a general approach
for the detection of defects in random color textures, which is a Computer Vision
issue where several works have been released recently in literature. We chose this
kind of method because it fits the needs for the detection of blemishes (visual
defects) in citrus, where sound peel areas and damaged areas are in fact random
color textures. With regard to the other literature methods for the detection of
defects in random color textures, this method presents the following advantages;
it uses one of the simplest approaches providing low computational costs, and
also, it is unsupervised and only needs few samples to train the system [2]. In
order to better compile defects and parts of defects of different sizes we introduce
a multiresolution scheme which minimizes the computational effort. The method
is applied at different scales gathering the results in one map of defects. In the
Detection of Visual Defects in Citrus Fruits: MIA-DDRCT vs EGIS 239
2 Experimental Work
The set of fruit used to carry out the experiments consisted of a total of 120 or-
anges and mandarins coming from four different cultivars: Clemenules, Marisol,
Fortune, and Valencia (30 samples per cultivar). The fruit was randomly col-
lected from a citrus packing house. Five fruits of each cultivar belonged to the
extra category, thus, they were fully free of defects. The other 25 fruits of each
cultivar fitted secondary commercial categories and had several skin defects, try-
ing to represent the cause of most important losses during post-harvesting (see
Section 1).
The first step in the experimental work for this approach was to select a set of
defect-free samples for each cultivar, in order to build the corresponding model
of sound color textures. A total of 64 different sound patches were collected for
each cultivar (see Figure 2). We used patches instead of complete samples in
order to introduce in the model more different types of sound peels and collect
as much as possible the variability of colors and textures.
240 F. López-Garcı́a et al.
In this approach there is no training stage and also no model of sound color
textures is built. Instead, the method tries to segment the sample (the image)
into regions in such a way that adjacent regions have a different visual appearance
but it remains similar within them. Thus, in order to extract the defects it is
necessary to set the hypothesis that bigger regions in samples always correspond
to the sound area (the background is not considered).
Since no training is performed, we went directly to tune the parameters of
the method for each cultivar. Parameters are sigma, which is used to smooth
the image before being segmented, and the k value of the threshold function.
In [6] the recommended values for sigma and k are respectively 0.5 and 500,
then, we varied the parameters around these central values. For each cultivar a
set of experiments was performed varying sigma in [0.25, 0.30, 0.35, 0.40, 0.45,
0.50, 0.55, 0.60, 0.65, 0.70, 0.75], and k in [200, 250, 300, 350, 400, 450, 500, 550,
600, 650, 700, 750], which led to 132 different experiments. As the in previous
approach, parameters were tuned by comparing the manually marked defects
with regard to those achieved by the method. This comparison was performed
again through the measures of Precision, Recall and F-Score. Tables 3 and 4
correspond to Tables 1 and 2 of previous approach. These tables show that the
EGIS approach is better in fitting the marked defects and also in defect detection,
although it produces more false detections.
A major difference between both approaches arises when we study their timing
costs. Using an standard PC, we measured for both methods the mean timing
cost of 20 executions performed on the same sample of clemenules cultivar. While
the MIA-DDRCT method achieved a mean timing of 588,5 ms, the EGIS method
achieved 162.5 ms. Nevertheless and despite the difference, both methods can
meet the real-time requirements at production lines (5 pieces per second) since
242 F. López-Garcı́a et al.
their timing costs can be drastically reduced by using simple and cheap paral-
lelization techniques based on computer clustering. Figure 3 shows the results
achieved by both approaches on two different samples.
3 Conclusions
Fig. 3. MIA-DDRCT versus EGIS. From top to bottom; original, manually marked
defects, MIA-DDRCT and EGIS results.
244 F. López-Garcı́a et al.
faster than MIA-DDRCT, although both methods can easily achieve real-time
compliance by introducing simple parallelization techniques. Finally, the MIA-
DDRCT approach has the advantage that does not need to use the hypothesis
that the bigger area in samples correspond to the sound area, unlike it occurs in
EGIS method.
References
1. Blasco, J., Aleixos, N., Moltó, E.: Computer vision detection of peel defects in citrus
by means of a region oriented segmentation algorithm. Journal of Food Engineer-
ing 81(3), 535–543 (2007)
2. López, F., Prats, J.M., Ferrer, A., Valiente, J.M.: Defect Detection in Random
Colour Textures Using the MIA T2 Defect Maps. In: Campilho, A., Kamel, M.S.
(eds.) ICIAR 2006. LNCS, vol. 4142, pp. 752–763. Springer, Heidelberg (2006)
3. Bharati, M.H., MacGregor, J.F.: Texture analysis of images using Principal Com-
ponent Analysis. In: SPIE/Photonics Conference on Process Imaging for Automatic
Control, pp. 27–37 (2000)
4. Geladi, P., Granh, H.: Multivariate Image Analysis. Wiley, Chichester (1996)
5. Prats-Montalbán, J.M., Ferrer, A.: Integration of colour and textural information
in multivariate image analysis: defect detection and classification issues. Journal of
Chemometrics 21(1-2), 10–23 (2007)
6. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient Graph-Based Image Segmentation.
International Journal of Computer Vision 59(2), 167–181 (2004)
7. Urquhart, R.: Graph theoretical clustering based on limited neighborhood sets. Pat-
tern Recognition 15(3), 173–187 (1982)
8. Zahn, C.T.: Graph-theoretic methods for detecting and describing gestalt clusters.
IEEE Transactions on Computing 20(1), 68–86 (1971)
Domain Adaptation Based on Eigen-Analysis
and Clustering, for Object Categorization
1 Introduction
Domain adaptation [1], [2] is a well-known problem in the field of machine learn-
ing, with recent applications in many Computer Vision tasks. The basic as-
sumption for most classification and regression techniques is that the training
and the testing samples are drawn from the same distribution. For many real-
world datasets, the distributions between the training and the testing data are
dissimilar, which leads to a poor classification performance. This happens in
situations where, the test samples are drawn only from the target domain and
typically a large number of training samples are available in the source domain.
In many situations, only a few labeled samples (images) are available for a
classification task in the target domain, though plenty of samples are available
from the source (or auxiliary) domain. When a small number of labeled training
samples are used for learning, then it generally creates an ill-fitting of a model.
This is known as small sample size (SSS) problem, where the parameters ob-
tained during the training phase are not generalized for the testing data, leading
to high erroneous results during the testing phase.
Domain adaptation (DA) is the process where one can use the training sam-
ples available from source domain to aid a classification task. Typically, a large
number of samples (instances) from source domain and a few from target domain
are available for training in supervised DA. The job of classification is done using
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 245–253, 2013.
c Springer-Verlag Berlin Heidelberg 2013
246 S. Samanta and S. Das
a separate set of test samples obtained only from the target domain. Broadly
speaking, there are two types of DA techniques available in the literature: (a) su-
pervised - where we have a very small number of labeled training samples from
the target domain and (b) unsupervised - where we have plenty of unlabeled
training samples from target domain. Using training samples from both the do-
mains, we expect to built a statistical model, which gives better performance on
the testing data available from the target domain.
In the recent past, the problem of transfer learning or DA, has been attempted
on applications of various computer vision tasks [1], [2], [3], [4]. Jiang et. al.
[5] and Yang et. al. [6] have proposed methods of modifying the existing SVM
trained on video samples available from source domain by introducing a bias term
between source and target domains during optimization in training the SVM. A
transformation matrix has been proposed in [1], which transforms instances from
one domain to other. In [7], a domain dependent regularizer has been proposed,
which enforces the target classifier to give results similar to the relevant base
classifiers on unlabeled instances from target domain. There have been works
on unsupervised DA which measures the geodesic distance between source and
target domains in Grassmannian manifold [2], [3]. Application of DA on face
detection recognition [8] has also been exploited.
In this paper, we use the structural information of clusters present in the
dataset for DA. Transformation of data to a different domain becomes simplified
if the distribution of the two domains are known or estimated properly. Hence, we
group the dataset into Gaussian clusters in both the domains, and propose trans-
formation of clusters from source to target domain using an inter-domain cluster
mapping function. Results shown on real-world object categorization datasets
reveal the efficiency of the system.
The rest of the paper is organized as follows. Section 2 gives a concise descrip-
tion of the proposed method of clustering and domain transformation. Section
3 presents and discusses the performance of the proposed methodology on real-
world datasets. Section 4 concludes the paper.
Let X ∈ ns ×D and Y ∈ nt ×D denote the source and the target data having
ns and nt number of samples respectively. Let, Ksd and Ktd be the number of
clusters in Xd and Yd respectively. Let δsd : {1, . . . , Ksd } and δtd : {1, . . . , Ktd } be
the sets of cluster-labels of the clusters formed in Xd and Yd , where Xd and Yd
represents the dth feature of X and Y respectively (d = 1, 2, . . . , D). Let Xdi and
Ydj denote the ith and j th clusters of Xd and Yd respectively (i ∈ δsd , j ∈ δtd ).
The entire process has been explained in the following sub-sections.
This step computes the density distribution of data and obtains clusters in both
the domains along each dimension separately. To cluster a data along each di-
mension, we estimate the density of the data using Parzen window estimator.
The size of the window is set to n−1/2 , where n is the number of instances.
Next, we detect the peaks and the valleys present in the probability density dis-
tribution. All the instances whose probability density falls between two adjacent
valley points are clustered together. Since we are clustering the data along each
dimension separately, a dataset may have different number of clusters along each
dimension. This process is repeated for both source and target domains. A pro-
cess of smoothing the density distribution may be necessary prior to detection
of peaks and valleys.
Initially, we normalize the range of the data in both domains. The problem
of small sample size will not affect the method of non-parametric clustering, as
the process is performed repeatedly (done for all dimensions separately) using
only one dimension at a time. Distribution of dataset can be parameterized by
fitting a Gaussian mixture model (GMM) in general. However, due to presence
of very few number of samples in target domain, often fitting a GMM produces
inaccurate result and is thus avoided.
between all pairs of clusters, such that ΛdS→T (i, j) = kldiv(Xdi , Ydj ).
b. Calculate the average similarity of each of the clusters in Yd , with all the
source domain clusters as: η d (j) = mean ΛdS→T (i, j) , ∀j ∈ δtd .
i
c. Using the criterion: if the Ydj is most similar to Xdi , then FdS→T (i) = j;
d
calculate FdS→T as: FdS→T (i) = arg minj ηS→T
d
(i, j), ∀i ∈ δsd . Here, F−1
S→T (i)
denotes the inverse mapping, which gives the set of ’pre-image clusters’ of
Ydj in the source domain.
d. Now, if for any Ydj the ’pre-image cluster’ set is NULL, then identify a Xdi
satisfying the following conditions and re-assign the mapping as:
d
∀j, such that F−1S→T (j) is NULL
d d
FS→T (i) = arg min ΛS→T (i, j) ∀i, such that |F−1 d (Fd
j S→T S→T (i))| > 1
and ΛdS→T (i, j) ≤ η d (j)
If there remains some cluster in target domain which is outside the range
of the function FdS→T but have similar distribution to a cluster in source
domain, we re-assign the mapping function based on the above equation. Let
Xdi be a cluster in Xd and its corresponding ’image cluster’ in Yd be Ydj .
Let, there exist another cluster Ydk in Yd for which the ’pre-image cluster’
set is empty (thus, kldiv(Xdi , Ydj ) ≤ kldiv(Xi , Yk )). If Xdi is quite similar,
based on the condition: kldiv(Xdi , Ydk ) ≤ η d (j) and the current number of
elements in the ’pre-image’ set of Ydj is greater than one (there exists at least
one more cluster Xdk such that k = i), then according to this step, update
the ’image cluster’ for Xdi to Ydk .
This stage describes the process of transferring instances from source to target
domain using well-formed clusters formed at the earlier stage. While transforma-
tion, we preserve the relative distances between instances of source domain along
each of the directions of principal component directions. The relation between
the eigen-analysis of a data following Gaussian distribution with the distribu-
tion parameters are given in [10] (pp-29). The proposed method of EDT exploits
this relation after clustering groups the data following Gaussian distribution. We
match the distribution of the source and target clusters in the eigen-space and
project it back to get the transformed source cluster.
Let, (λi = [λ1i , . . . λD 1 D 1 D
i ], Φi = [Φi . . . Φi ]) and (γi = [γi . . . γi ], Ψi =
1 D
[Ψi . . . Ψi ]), be the corresponding pair of eigen values/vectors obtained by
cluster-wise eigen-analysis of Xi and Yj respectively. Each cluster Xdi , i ∈ δsd ,
of source domain, is transformed to match the distribution of its corresponding
τ
’image cluster’ FdS→T (i), as determined in the previous stage. If X denotes the
τ
transformed source domain data, then the eigen analysis of X should be identical
Domain Adaptation Based on Eigen-Analysis and Clustering 249
to that of Y. Let us consider one cluster, Xdi , whose ’image cluster’ is Ydj in the
τ
target domain (FdS→T (i) = j). The steps to obtain X are as follows:
1 0.5 0.5 1
0 0
0 1 0 1
0.5 0.5
0 0 0 0
0 1 0 1 0 1 0 1
Fig. 1. Data from source and target domains (in 2 ) are marked in green and blue
points in (a) and (d); while transformed source domain data is marked in red in (d).
Intermediate results of clustering are shown in (b) and (c) for first and second dimen-
sions respectively. Green and blue (dash) curves denote the density of data in source
and target domains. Brown and magenta (dash) curves indicate the Gaussian functions
modeling the clusters distributions formed in source and target domains respectively.
250 S. Samanta and S. Das
3 Experimental Results
We evaluate the performance of the proposed method on real world datasets
for object categorization. The original Office dataset [1] contains 3 domains:
Amazon (A), Dslr (D) and Webcam (W), each having 31 classes of objects.
In [2], Office dataset has been merged with Caltech-256 dataset to create four
domains, Amazon (A), Caltech (C), Dslr (D) and Webcam (W), with ten classes
of objects. A few sample images from the 4 domains are shown in Fig. 2. The
size of the image samples in Amazon, DSLR, Webcam and Caltech datasets, are:
300 × 300, 1000 × 1000, 500 × 500 (average size) and 170 × 104 − 2304 × 1728.
Fig. 2. Sample images of two classes of objects taken from four domains
SURF features [11] are extracted and a bag of words (BOW) feature set is
calculated with a codebook of size 800, as done in [1], [2]. Two methods of EDT
are used for experimentation: (i) class-wise EDT: done for every class separately,
and (ii) Unsupervised EDT: done on the entire dataset by considering data from
all classes together. In the following, we describe the two sets of experimentation
done to exhibit the efficiency of the proposed method of DA.
In the first set of experimentation, we consider the Office dataset [1] with
31 classes. Number of training samples taken are: for target domain 3 samples
per class for Amazon/Dslr/Webcam, for source domain 8 samples per class for
Webcam/Dslr and 20 samples per class for Amazon. Results obtained using a
K-nearest neighbor (k=1) classifier, are averaged over a 10-fold cross validation,
which are compared with that reported in [1], [2]. Table 1 shows the classification
accuracy (in %-age) of object categorization using different techniques of DA.
The 2nd and 3rd columns show the results of metric learning [1] while the 4th
and 5th columns show the results of sampling geodesic flow (SGF) [2]. The 6th
and 7th columns show the result of our proposed class-wise and unsupervised
EDT methods respectively. The proposed EDT gives better performance than
the metric learning method given in [1], while the results of SDF [2] outperform
the proposed method only in one case.
In second set of experimentations, we consider the object dataset considered
in [7], which is a mixture of the office dataset [1] and Caltech - 256 [12]. The
Domain Adaptation Based on Eigen-Analysis and Clustering 251
Table 1. Classification accuracy (in %-age) of Office dataset [1] using different tech-
niques of domain adaptation (DA). Best classification accuracy is highlighted in bold.
90 90
80 80
70 70
Classification Accuracy
60 60
50 50
C−>A D−>A W−>A A−>C D−>C W−>C
90 90
80 80
70 70
60 60
50 50
A−>D C−>D W−>D A−>W C−>W D−>W
Class−wise EDT Unsup EDT DAM NC
Fig. 3. Classification accuracy done using DA for 12 different cases of object catego-
rization, using the Office+Caltech dataset [2]. Results are grouped into four categories,
with an identical target domain and three source domains considered separately.
dataset has four different domains from which we get 12 different pairs of source
and target domains. We create different sets of training samples by considering
different fractions (0.2, 0.3, ... , 0.7) of the training samples from the target
domain. The average classification accuracy is reported with a 10-fold cross vali-
dation. In this case, we use SVM with histogram intersection kernel to obtain the
classification accuracy. We compare our method with Domain Adaptive Machine
(DAM) [7], a supervised technique of DA. Results are shown in Fig. 3.
The mean accuracy over different sets of training samples is reported for all
the 12 scenarios of classification using DA. The red and the green bar in Fig. 3
shows the performance of proposed class-wise and unsupervised EDT methods
respectively. The blue bar shows the mean accuracy when DAM [7] is used. We
also observe the performance of the classifier when samples from both source
and target domains are combined together for training. This method is termed
as Naive Combination (NC), for which the performance is given by the yellow
bar. Class-wise EDT technique gives the best result for 10 DA classification
tasks. For two cases, D→W and W→D, NC gives better results. This is due to
252 S. Samanta and S. Das
the fact that the two domains - Dslr and Webcam have similar distribution and
application of DA in this case leads to negative transfer.
Another interesting fact for these two tasks is that the unsupervised EDT
performs marginally better than the class-wise-EDT, as the available number of
training samples are the least among the 12 different classification tasks. Hence,
the unsupervised EDT is expected to give better performance as the covariance
matrix is estimated more accurately with a larger number of training samples
(than in class-wise EDT), leading to less error during eigen-analysis. Hence,
the choice between two techniques of EDT (class-wise and unsupervised) for a
classification task, should be based on the number of available training samples.
4 Conclusion
We have proposed a new method of domain adaptation and applied it successfully
for the task of object categorization. Difference in distributions among the data
in source and target domains, is overcome by clustering and then modeling with
Gaussian functions. A cross-domain mapping function helps to transform data
from source to target domain, using a forward followed by an inverse eigen-
transformations. Results show that the proposed method of DA is better than a
few state of the art published in the recent past. The work can be extended for
handling multiple source domains.
References
1. Saenko, K., Kulis, B., Fritz, M., Darrell, T.: Adapting visual category models to
new domains. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part
IV. LNCS, vol. 6314, pp. 213–226. Springer, Heidelberg (2010)
2. Gopalan, R., Li, R., Chellappa, R.: Domain adaptation for object recognition:
An unsupervised approach. In: International Conference in Computer Vision,
pp. 999–1006 (2011)
3. Grauman, K.: Geodesic flow kernel for unsupervised domain adaptation. In: IEEE
Conference on Computer Vision and Pattern Recognition, pp. 2066–2073 (2012)
4. Marton, Z.-C., Balint-Benczedi, F., Seidel, F., Goron, L.C., Beetz, M.: Object cate-
gorization in clutter using additive features and hashing of part-graph descriptors.
In: Stachniss, C., Schill, K., Uttal, D. (eds.) Spatial Cognition 2012. LNCS (LNAI),
vol. 7463, pp. 17–33. Springer, Heidelberg (2012)
5. Jiang, W., Zavesky, E., Fu Chang, S., Loui, A.: Cross-domain learning methods
for high-level visual concept classification. In: International Conference on Image
Processing, pp. 161–164 (2008)
6. Yang, J., Yan, R., Hauptmann, A.G.: Cross-domain video concept detection using
adaptive svms. In: International Conference on Multimedia, pp. 188–197 (2007)
7. Duan, L., Xu, D., Tsang, I.W.H.: Domain adaptation from multiple sources: A
domain-dependent regularization approach. IEEE Transaction in Neural Netwet-
work Learning System 23(3), 504–518 (2012)
Domain Adaptation Based on Eigen-Analysis and Clustering 253
8. Qiu, Q., Patel, V.M., Turaga, P., Chellappa, R.: Domain adaptive dictionary learn-
ing. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV
2012, Part IV. LNCS, vol. 7575, pp. 631–645. Springer, Heidelberg (2012)
9. Penny, W.: Kl-divergences of normal, gamma, dirichlet and wishart densities. Tech-
nical report, Wellcome Department of Cognitive Neurology, University College Lon-
don (2001)
10. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic
Press (1990)
11. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (surf).
Computer Vision Image Understanding 110(3), 346–359 (2008)
12. Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical
Report 7694, California Institute of Technology (2007)
Estimating Clusters Centres Using Support
Vector Machine: An Improved Soft Subspace
Clustering Algorithm
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 254–261, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Estimating Clusters Centres Using Support Vector Machine 255
c
N
D
JESSC (v, u, w) = um
ij wik (xik − vik )2
i=1 j=1 k=1
c
D
c N
D
+γ wik lnwik − η ( um
ij ) wik (vik − v0k )2
i=1 k=1 i=1 j=1 k=1
(5)
Where
c: clusters number, N : data size. D: features number. v: cluster center , w: weight
matrix, u: fuzzy partition matrix.
Using the Lagrange multiplier u, v and w are deduced to minimize the objective
function of Eq.5: Using the Lagrange multiplier u, v and w are deduced to
minimize the objective function of Eq.5:
Estimating Clusters Centres Using Support Vector Machine 257
D D
[ k=1 wik (xjk − vik) − η k=1 wik (vik − v0k )2 ]− m−1
1
uij = c D D (6)
2 − m−1
1
i =1 [ k=1 wik (xjk − vik) − η k=1 wik (vik − v0k ) ]
N
j=1 uij (xjk − ηv0K )
m
vij = N (7)
j=1 uij (1 − η)
m
N N
ij (xjk −vik)−η
um j=1 uij (vik −v0k )
m 2
k=1
exp( γ )
wik = D N N (8)
k=1 uij (xjk −vik)−η j=1 uij (vik −v0k )
m m 2
k =1 exp( γ )
In each iteration , the obtained clusters centers vik and their near neighborhoods
detected using an Euclidean distance constitute the new SVM vector training.
Txy = {(v1 , 1), ..., (vc , c), (xi , yi ), ..., (xk , yk )} (9)
Where
yi = maxi=1:c (uij ) c: number of clusters
k: vector size
The size of the learntng vecior depends on the data size.
The final equations ow u ano f are as follows:
N N 1
− m−1
D u ij xj k D j =1
u xj k
j =1
[ k=1 wik (xjk − Ni
)2 − η k=1 wik ( Ni
ij
− v0k )2 ]
uij = N N 1
− m−1
c D u ij xj k D u xj k
j =1 j =1
i =1
[ k=1 wik (xjk − Ni
)2 −η k=1 wik Ni
ij
− v0k )2 ]
(10)
Where
N
c j =1
u ij xj k
i=1 Ni
v0k = (11)
c
N N
N u x N u x
j =1 ij j k j =1 ij j k
k=1 ij (xjk −
um Ni )−η j=1 um
ij ( Ni −v0k )2
exp( γ )
wik = N
u x
N
u x
N N
D k=1 uij (xjk −
m j =1
Ni
ij j k
)−η m
j=1 uij (
j =1
Ni
ij j k
−v0k )2
k =1 exp( γ )
(12)
258 A. Boulemnadjel and F. Hachouf
Algorithm.
Initialization step
– Input: Number of clusters C, parameters m,μ and ε.
– Train SVM by input data
– Compute w by using (8)
Processing step
W hile |v(t + 1) − v(t) ≥ ε| do
– Compute the partition matrix u using (10)
– Compute the cluster centre matrix v using(7)
– Compute w using (11)
– Extract the vector learning using centres previously computed
– Apply the SVM algorithm.
– Compute the new centres matrix using equations 3 and 4.
endwhile
Clustering step
– Assign each pixel i its potential cluster using the max of memberships degrees.
The performance of the proposed algorithm has been studied on the six UCI
databases [15] given with the specific number of clusters, number of instances
and the number of attributes. The normalized mutual information (NMI) metric
given by Eq.12 is used to evaluate and compare the performance of the proposed
algorithm and ESSC [6]. The higher is the value of NMI, the better is the result
of the clustering.
c c N .N j
i=1 i=1logN. ijNi
N M I = / (13)
c Ni logNi c
i=1 N . j=1 Nj log Nj N
Where
Nij is the number of agreements between clusters i and true clusterj.
Ni is the number of data points in clusters i.
Nj is the number of data points in true cluster j.
All the experiments were implemented on a 2.22 GHz CPU and 2GB RAM. The
used database descriptions are given in Table 1. The NMI values are tabulated
in Table 2.
It is clear from Table 2 that the NMI value is greater for the proposed algo-
rithm than the ESSC algorithm. Also the iteration number decreases.
For both methods, the evolution norm of clusters centers is plotted against
the iterations number. The results are compared to the real norm of clusters
centers given by the dataset (Figure.2 green). The clusters centers obtained by
Estimating Clusters Centres Using Support Vector Machine 259
Table 2. Clustering results Obtained for nine UCI datasets with NMI as metricc
the proposed method are closer to the real centers with a minimum number
of iterations (Figure.2). The proposed approach has been tested on different
types of image to evaluate its performance in image segmentation. The first six
parameters of co-occurrence features [15] and the edge detection have been used,
namely: contrast, homogeneity, correlation, energy, angular second moment and
the entropy. The number of clusters C for Fig 3.A is 5 for the two methods,
for Fig 3.B C =6. The obtained images are given in Figure 3. It is noticed that
on Figures.3B1 and 3B2, some clusters are merged. On Figure.3B1 some edges
of the triangle are missing and some parts of the flower do not appear. The
background of the image and the triangle constitute the same cluster. Unlike to
Figures.3B1 and 3B2, Figures.3C1 and 3C2 show better shapes. The edge and
the leaves of the flower are well defined..
260 A. Boulemnadjel and F. Hachouf
4 Conclusion
In this paper a soft subspace clustering is enhanced. New formulations of the
membership degrees and centers clusters have been developed. Obtained results
have shown the significant improvement of the data clustering. Estimating the
center and the membership degree in the initialization step has reduced the num-
ber of iterations that made very fast the algorithm convergence. SVM algorithm
used in the initialization step has improved the centers location. In a future work,
we suggest to use active learning for a better selection of the learning vector,
and thus improving the clustering and minimizing the running time.
References
1. Agrawal, R., et al.: Automatic subspace clustering of high dimensional data for
data mining applications. In: SIGMOD Record ACM Special Interest Group on
Management of Data, pp. 94–105 (1998)
Estimating Clusters Centres Using Support Vector Machine 261
2. Parsons, L., Haque, E., Liu, H.: Evaluating subspace clustering algorithms. In
Workshop on Clustering High Dimensional Data and its Applications. In: SIAM
Int. Conf. on Data Mining, pp. 48–56 (2004)
3. Yip, K.Y., Cheung, D.W., Ng, M.K.: A practical projected clustering algorithm.
IEEE Trans. Knowl. Data Eng. 16(11), 1387–1397 (2004)
4. Chang, H., Yeung, D.Y.: Locally linear metric adaptation with application to
semi-supervised clustering and image retrieval. Pattern Recognition 39, 1253–1264
(2006)
5. Liang, B., et al.: A novel attribute weighting algorithm for clustering high-
dimensional categorical data. Pattern Recognition 44, 2843–2861 (2011)
6. Deng, Z., et al.: Enhanced soft subspace clustering integrating within cluster and
between-cluster information. Pattern Recognition 43, 767–781 (2010)
7. Damodar, R., Janaa, P.K.: A prototype-based modified DBSCAN for gene cluster-
ing. Procedia Technology 6, 485–492 (2012)
8. Yang, A., et al.: Unsupervised segmentation of natural images via lossy data com-
pression. Comput. Vis. Image Understand 110, 212–225 (2008)
9. Vidal, R., Tron, R., Hartley, R.: Multiframe motion segmentation with missing
data using power factorization and GPCA. Int. J. Comput. Vis. 79, 85–105 (2008)
10. Domeniconi, C., et al.: Locally adaptive metrics for clustering high dimensional
data. Data Min. Knowl. Disc. 14, 63–67 (2007)
11. Xiaojun, C., et al.: A feature group weighting method for subspace clustering of
high dimensional data. Pattern Recognition 45, 434–446 (2012)
12. Sangeetha, R., et al.: Identifying Efficient Kernel Function in Multiclass Support
Vector Machines. International Journal of Computer Applications 28 (2011)
13. Vapnik, V.: An overview of statistical learning theory. IEEE Trans. on Neural
Networks (1999)
14. http://archive.ics.uci.edu/ml/
15. Haralick, R., et al.: Textural features for image classification. IEEE Transactions
on Systems, Man and Cybernetics 3(6), 610–621 (1973)
Fast Approximate Minimum Spanning Tree
Algorithm Based on K-Means
1 Introduction
Given an undirected and weighted graph, the problem of MST is to find a span-
ning tree such that the sum of weights is minimized. Since MST can roughly
estimate the intrinsic structure of a dataset, it has been broadly applied in im-
age segmentation [1], cluster analysis [9], classification [4], manifold learning [8].
However, traditional MST algorithms such as Prim’s and Kruskal’s algorithm
have running time of O(N 2 ) [3], and for a large dataset a fast MST algorithm
is needed.
Recent work to find an approximate MST can be found in [6][7], and the
both work apply MSTs to clustering. Wang et al. [7] employ divide-and-conquer
scheme to detect the long edges of the MST at an early stage for clustering. Ini-
tially, data points are randomly stored in a list, and each data point is connected
to its predecessor (or successor), and a spanning tree is achieved. To optimize the
spanning tree, the dataset is divided into a collection of subsets with a divisive
hierarchical clustering algorithm. The distance between any pair of data points
within a subset can be computed by a brute force nearest neighbor search, and
with the distances, the spanning tree is updated.
Lai et al. [6] proposed an approximate MST algorithm based on Hilbert curve
for clustering. It is a two-phase algorithm: the first phase is to construct an
approximate MST of a given dataset with Hilbert curve, and the second phase is
to partition the dataset into subsets by measuring the densities of points along
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 262–269, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Fast Approximate Minimum Spanning Tree Algorithm Based on K-Means 263
Divide-and-conquer stage:
(a) Data set (b) Partitions by K-means (c) MSTs of the subsets (d) Connected MSTs
Refinement stage:
(e) Partitions on borders (f) MSTs of the subsets (g) Connected MSTs (h) Approximate MST
the approximate MST with a specified density threshold. However, the accuracy
of MST depends on the order of Hilbert Curve and the number of neighbors of
a visited point in the linear list.
2 Proposed Method
2.1 Overview of the Proposed Framework
To improve the efficiency of constructing an MST is to reduce the unnecessary
comparisons. For example, in Kruskal’s algorithm, it is not necessary to sort
all N (N − 1)/2 edges of a complete graph but to find (1 + α)N edges with
least weights, where (N − 3)/2 # α ≥ −1/N . We employ a divide-and-conquer
technique to achieve the improvement. The overview of the proposed method is
illustrated in Fig. 1.
therefore expected that the subsets preserve this locality. Since K-means can
partition local neighboring data points into the same group, we employ K-means
to partition the dataset.
√
Divide and Conquer Algorithm. After the dataset is divided into N sub-
sets by K-means, the MSTs of the subsets are constructed with an exact MST
algorithm, such as Prim’s or Kruskal’s algorithm. The algorithm of K-means
based divide and conquer is described as follows:
c8 c8 c8
c7 c7 c7
c5 c6 c5 c6 c5 c6
c3 c3 c3
c4 c4 c4
a
c1 c2 c1 c2 c1 c2
b
Fig. 2. The combine step of MSTs of the proposed algorithm. In (a), centers of the
partitions (c1, ..., c8) are calculated. In (b), a MST of the centers, M STcen , is con-
structed with an exact MST algorithm. In (c), each pair of subsets whose centers are
neighbors with respect to M STcen in (b) is connected.
However, the accuracy of the approximate MST achieved so far is not enough.
The reason is that, when the MST of a subset is built, the data points that lie in
the boundary of the subset are considered only within the subset but not across
the boundaries. Based on this observation, the refinement stage is designed.
c8
m7
c7
m6
c5 c6
m4
m5
c3 m3
c4
m1
m2
c1
c2
Fig. 3. Boundary-based partition. In (a), the black solid points, m1 , · · · , m7 , are the
midpoints of the edges of M STcen . In (b), each data point is assigned to its nearest
midpoint, and the dataset is partitioned by the midpoints. The corresponding Voronoi
graph is with respect to the midpoints.
Build Secondary Approximate MST. After the dataset has been re-
partitioned, the conquer and combine steps similar to those in first stage are
used to produce the secondary approximate MST. The algorithm is summarized
as follows:
Secondary Approximate MST (SAM)
Input: MST of the subset centers M STcen , dataset X;
Output: Approximate MST of X, M ST2 ;
1. Compute the midpoint mi of an edge ei ∈ M STcen , where 1 ≤ i ≤ K − 1.
2. Partition dataset X into K − 1 subsets, S1 , · · · , SK−1
, by assigning each
point to its nearest point from m1 , · · · , mK−1 .
3. Build MSTs, M ST (S1 ), · · · , M ST (SK−1
), with an exact MST algorithm.
4. Combine the K − 1 MSTs with CA to produce an approximate MST M ST2 .
where TDAC , TCA and TSAM are the time complexities of the algorithms DAC,
CA and SAM respectively, TCOM is the running time of an exact MST algo-
rithm on the combination of M ST1 and M ST2 .
DAC consists of two operations: partitioning the dataset X into K subsets
and constructing
√ the MSTs of the subsets with an exact MST algorithm. Since
K = N , we have TDAC = O(N 1.5 ). In CA, computing the mean points of the
subsets and constructing MST of the K mean points take only O(N ) time. For
each connected subset pair, determining the connecting edge requires O(2N ×
(K − 1)/K). The total computational cost of CA is therefore O(N ).
In SAM, Computing K−1 midpoints and partitioning the dataset take O(N ×
(K − 1)) time. The running time of Step 3 and 4 is O((K − 1) × N 2/(K − 1)2 ) =
O(N 2 /(K −1)) and O(N ), respectively. Therefore, the time complexity of SAM
is O(N 1.5 ). The number of edges in the graph that is formed by combining M ST1
and M ST2 is at most 2(N − 1). The time complexity of applying an exact MST
algorithm to this graph is only O(2(N − 1) log N ). Thus, TCOM = O(N log N ).
To sum up, the time complexity of the proposed fast algorithm is O(N 1.5 ).
4 Experiments
In this section, experimental results are presented to illustrate the efficiency and
the accuracy of the proposed fast approximate MST algorithm. The accuracy of
FMST is tested with four datasets: t4.8k [5], MNIST [10], ConfLongDemo [11]
and MiniBooNE [11]. Experiments were conducted on a PC with an Intel Core2
2.4GHz CPU and 4GB memory running Windows 7.
4.2 Accuracy
Suppose Eappr is the set of the correct edges in an approximate MST, the edge
N −|Eappr |−1
error rate ERedge is defined as: ERedge = N −1 . The second measure is
268 C. Zhong et al.
7LPH 6HFRQGV
7LPH 6HFRQGV
7LPH 6HFRQGV
7LPH 6HFRQGV
3ULPÿVDOJRULWKP 3ULPÿVDOJRULWKP 3ULPÿVDOJRULWKP 3ULPÿVDOJRULWKP
)067$3ULP )0673ULP )0673ULP
)0673ULP
N N [ N N [
Edge error rate (%)
Edge error rate (%)
[
N N N N
Weight error rate (%)
Weight error rate (%)
[ [
N N N N
b
t4.8k MNIST ConfLongDemo MiniBooNE
FMST 1.57 1.62 1.54 1.44
Prim’s Alg. 1.88 2.01 1.99 2.00
defined as the differ of the sum of the weights in FMST and the exact MST,
W −Wexact
which is called weight error rate: ERweight = appr Wexact , where Wexact and
Wappr are the sum of weights of the exact MST and FMST, respectively.
The edge error rates and weight error rates of the four datasets are shown
in the third row of Fig. 4. We can see that both the edge error rate and the
weight error rate decrease with the increase of the data size. For datasets with
high dimension, the edge error rates are bigger, for example, the maximum edge
error rates of MNIST are approximate to 18.5%, while those of t4.8k and Con-
fLongDemo less than 3.2%. In contrast, the weight error rates decrease when
the dimensionality increases. This is one aspect of the curse of dimensionality,
distance concentration, which means that Euclidean distances between all pairs
of points in high dimensional data are tend to be similar.
Fast Approximate Minimum Spanning Tree Algorithm Based on K-Means 269
5 Conclusion
References
1. An, L., Xiang, Q.S., Chavez, S.: A fast implementation of the minimum spanning
tree method for phase unwrapping. IEEE Trans. Medical Imaging 19, 805–808
(2000)
2. Bezdek, J.C., Pal, N.R.: Some new indexes of cluster validity. IEEE Trans. Systems,
Man and Cybernetics, Part B 28, 301–315 (1998)
3. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms,
2nd edn. The MIT Press (2001)
4. Juszczak, P., Tax, D.M.J., Pȩkalska, E., Duin, R.P.W.: Minimum spanning tree
based one-class classifier. Neurocomputing 72, 1859–1869 (2009)
5. Karypis, G., Han, E.H., Kumar, V.: CHAMELEON: A hierarchical clustering al-
gorithm using dynamic modeling. IEEE Trans. Comput. 32, 68–75 (1999)
6. Lai, C., Rafa, T., Nelson, D.E.: Approximate minimum spanning tree clustering in
high-dimensional space. Intelligent Data Analysis 13, 575–597 (2009)
7. Wang, X., Wang, X., Wilkes, D.M.: A divide-and-conquer approach for minimum
spanning tree-based clustering. IEEE Trans., Knowledge and Data Engineering 21,
945–958 (2009)
8. Yang, L.: Building k edge-disjoint spanning trees of minimum total length for iso-
metric data embedding. IEEE Trans. Pattern Analysis and Machine Intelligence 27,
1680–1683 (2005)
9. Zhong, C., Miao, D., Wang, R.: A graph-theoretical clustering method based on
two rounds of minimum spanning trees. Pattern Recognition 43, 752–766 (2010)
10. http://yann.lecun.com/exdb/mnist
11. http://archive.ics.uci.edu/ml/
Fast EM Principal Component Analysis Image
Registration Using Neighbourhood Pixel
Connectivity
Parminder Singh Reel1 , Laurence S. Dooley1 , K.C.P. Wong1 , and Anko Börner2
1
Department of Communication and Systems, The Open University, Milton Keynes,
United Kingdom
{p.s.reel,laurence.dooley,k.c.p.wong}@open.ac.uk
2
Optical Sensor Systems, German Aerospace Center (DLR), Berlin, Germany
[email protected]
1 Introduction
Image Registration (IR) is a vital processing task in numerous applications where
the final information is obtained by combining different data sources, as for ex-
ample in computer vision, remote sensing and medical imaging [1]. The process
of IR involves the geometric transformation of a source image in order to attain
the best physical alignment with a reference target image. It applies an opti-
mization method to maximize some predefined similarity measure with known
transformations between the source and reference image set.
Similarity measures which have been proposed [1] for both mono and multi-
modal IR can be broadly categorized according to whether they are based on
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 270–277, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Fast EM PCA Image Registration Using Neighbourhood Pixel Connectivity 271
EMPCA-MI [9] is a recent similarity measure for IR, which efficiently incor-
porates spatial information together with MI without incurring high computa-
tional overheads. Fig. 1 illustrates the three core processing steps involved in the
EMPCA-MI algorithm, namely: input image data rearrangement (highlighted in
yellow ) followed by EMPCA and MI calculation. Both the reference (IR ) and
source images (IS ) are pre-processed (Step I ) into vector form for a given neigh-
bourhood radius r, so the spatial and intensity information is preserved (see Fig.
1(a) and 1(b)). The first P principal components XR and XS of the respective
reference and source images are then iteratively computed using EMPCA [12] in
Step II. Subsequently, the MI [3] is calculated between XR and XS in Step III,
with a higher MI value signifying the images are better aligned. In [9], only the
first principal component is considered, i.e., P=1 since this is the direction of
highest variance and represents the most dominant feature in any region.
115 134 126 109 153 115 134 126 109 153 115 134 126 109 153
n=10
n=10
n=10
(a)
125 95 180 105 120 125 95 180 105 120 125 95 180 105 120
Preprocessing
210 190 182 175 130 210 190 182 175 130 210 190 182 175 130
Step I:
d=5
122 106 140 122 106 125 122 106 125
-16 24 12
EMPCA EMPCA
Step II:
n=64 n=64
P=1 P=1
XR XS
Mutual
MI Computation
Information
Step III:
EMPCA-MI
Output
Fig. 1. Illustration of the EMPCA-MI algorithm [9], together with the proposed
mEMPCA-MI pre-processing step using neighbourhood 8-pixel and 4-pixel region con-
nectivity for an image pair size of 10 x 10 pixels
Once Step I has been completed, the remaining two processing steps of
mEMPCA-MI are as in [9]. Fig 2. displays two mEMPCA-MI traces for both
4-pixel and 8-pixel connectivity together with EMPCA-MI with respect to the
θ angular rotational transformation parameter for the IR of the multimodal
MRI pair T1 and T2. Fig 2(a) shows IR case when there is neither INU nor
noise present, while Fig 2(b) reflects the challenging registration of 40% INU
and Gaussian noise. It is palpable the mEMPCA-MI traces for both 4-pixel
and 8-pixel neighbourhood connectivity provide smoother and higher similarity
measure values for the best alignment compared with EMPCA-MI.
274 P.S. Reel et al.
Fig. 2. Similarity measure value traces for EMPCA-MI and mEMPCA-MI (8-pixel and
4-pixel connectivity). (a) shows the angular rotation transformation for MRI T1 and
T2 multimodal registration (without INU and noise) and (b) with 40% INU and noise.
Interestingly 4-pixel neighbourhoods are better in both cases, since they ex-
ploit the neighbourhood relations with the strongest links leading to a corre-
spondingly higher overall MI value between IR and IS . Fig 2 also highlights the
smooth convergence of mEMPCA-MI compared with EMPCA-MI with less os-
cillatory behaviour particularly where 40% INU and noise is present. This is a
very useful feature for effective convergence of the ensuing optimization process
[1] and ultimately leads to lower IR errors.
3 Experiment Setup
To evaluate the performance of the mEMPCA-MI similarity measure, a series
of multimodal IR experiments were undertaken. Multimodal MRI T1 and T2
datasets from BrainWeb Database [13] were chosen due to their challenging char-
acteristics of varying INU and noise artefacts with the corresponding parameter
details being defined in Table 1. To simulate a range of applications and analyse
the robustness of mEMPCA-MI, both Lena and Baboon images have also been
used with a simulated INU function Z [14]. Finally, Gaussian noise has been
added to all the datasets. The IR experiments were classified into four separate
scenarios representing monomodal, multimodal and two generic registrations.
4 Results Discussion
Table 2 shows the IR error results for all four Scenarios in terms of the percent-
age translation (ΔX, ΔY ) and angular rotational (Δθ) errors. To clarify the
nomenclature adopted in Table 2; T1+α20 for example, represents an MRI T1
image slice with 20% INU, while Bb+Z +β refers to the Baboon image with INU
and Gaussian noise artefacts. The results confirm the mEMPCA-MI algorithm
using both 8-pixel and 4-pixel neighbourhood connectivity consistently provides
better registration than the EMPCA-MI model for both mono and multimodal
MRI T1 and T2 images, both when there is and is not INU and noise present.
For example, in monomodal IR Scenario 1 with 40% INU and noise present
(T1+α40 +β), 8-pixel and 4-pixel connectivity mEMPCA-MI provide percentage
errors for the (ΔX, ΔY, Δθ) parameters of (7.84, 9.59, 0.48 ) and (7.45, 9.28,
0.46 ) respectively which are both lower than the corresponding EMPCA-MI
error (8.0, 10.0, 0.58 ). Similar performance improvements are also evident for
Lena and Baboon images.
276 P.S. Reel et al.
Table 3. Average Runtimes (ART ) Results (in ms) for Different Scenarios
This corroborates the fact that the mEMPCA-MI algorithm using both 8-
pixel and 4-pixel connectivity in the pre-processing step more accurately reflects
neighbourhood spatial information by considering a second-order representation
of region pixels values with respect to centre pixel of the sliding window. The
results also reveal that the IR error performance of mEMPCA-MI with 4-pixel
neighbourhood connectivity is consistently lower than 8-pixel connectivity across
all four Scenarios. Particularly striking, is the performance achieved for the chal-
lenging MRI T1 and T2 multimodal registration in Scenario 2, in the presence
of both INU and noise. This reflects that 4-pixel neighbourhood connectivity
exploits the direct pixel relations providing more relevant spatial information
about local neighbourhood for subsequent EMPCA and MI computation. In
contrast, 8-pixel connectivity also considers weaker indirect neighbours, which
marginally reduces the corresponding principal component values leading to a
lower MI between the image pair.
Table 3 displays the average runtimes (ART ) for both EMPCA-MI and
mEMPCA-MI. While ART is a resource dependent metric, it concomitantly pro-
vides an insightful time complexity comparator between similarity measures. As
illustrated in Fig. 1, since the data dimensionality of mEMPCA-MI with 4-pixel
connectivity is reduced to 5 from 9 for both 8-pixel connectivity and EMPCA-MI
[12], the corresponding ART values are considerably lower, i.e., 95ms compared
to 144ms for 8-pixel connectivity and 152ms for EMPCA-MI to determine only
the first principal component for Scenarios 1 and 2. A similar trend in the ART
values is observed in Scenarios 3 and 4, though these datasets have a different
spatial resolution compared to Scenarios 1 and 2. Overall, the ART results re-
veal a notable improvement in computational efficiency for mEMPCA-MI using
4-pixel neighbourhood connectivity, allied with superior IR robustness to both
INU and noise for both mono and multimodal image datasets.
5 Conclusion
This paper has presented a neighbourhood connectivity based modification to
the existing Expectation Maximisation for Principal Component Analysis with
MI (EMPCA-MI) similarity measure. Superior and enhanced robust image reg-
istration performance in the presence of both INU and Gaussian noise has
been achieved by incorporating second-order neighbourhood region information
compared with the grayscale value rearrangement in the original EMPCA-MI
Fast EM PCA Image Registration Using Neighbourhood Pixel Connectivity 277
References
1. Zitová, B., Flusser, J.: Image registration methods: a survey. Image and Vision
Computing 21(11), 977–1000 (2003)
2. Pluim, J., Maintz, J., Viergever, M.: Mutual-information-based registration of med-
ical images: a survey. IEEE Transactions on Medical Imaging 22(8), 986–1004
(2003)
3. Collignon, Maes, F., Delaere, D., Vandermeulen, D., Suetens, P., Marchal, G.:
Automated multi-modality image registration based on information theory. Imag-
ing 3(1), 263–274 (1995)
4. Viola, P., Wells, W.M.: Alignment by maximization of mutual information. In:
Proceedings of the Fifth International Conference on Computer Vision, pp. 16–23.
IEEE (June 1995)
5. Studholme, Hill, D., Hawkes, D.J.: An overlap invariant entropy measure of 3D
medical image alignment. Pattern Recognition 32(1), 71–86 (1999)
6. Simmons, A., Tofts, P.S., Barker, G.J., Arridge, S.R.: Sources of intensity nonuni-
formity in spin echo images at 1.5 t. Magnetic Resonance in Medicine 32(1),
121–128 (1994)
7. Russakoff, D.B., Tomasi, C., Rohlfing, T., Maurer Jr., C.R.: Image similarity using
mutual information of regions. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004.
LNCS, vol. 3023, pp. 596–607. Springer, Heidelberg (2004)
8. Yang, C., Jiang, T., Wang, J., Zheng, L.: A neighborhood incorporated method in
image registration. In: Yang, G.Z., Jiang, T.-Z., Shen, D., Gu, L., Yang, J. (eds.)
MIAR 2006. LNCS, vol. 4091, pp. 244–251. Springer, Heidelberg (2006)
9. Reel, P.S., Dooley, L.S., Wong, K.C.P.: A new mutual information based similarity
measure for medical image registration. In: IET Conference on Image Processing
(IPR 2012), pp. 1–6 (July 2012)
10. Reel, P.S., Dooley, L.S., Wong, K.C.P.: Efficient image registration using fast prin-
cipal component analysis. In: 19th IEEE International Conference on Image Pro-
cessing (ICIP 2012), Lake Buena Vista, Orlando, Florida, USA, pp. 1661–1664.
IEEE (September 2012)
11. Reel, P., Dooley, L., Wong, P., Börner, A.: Robust retinal image registration us-
ing expectation maximisation with mutual information. In: 38th IEEE Interna-
tional Conference on Acoustics, Speech, and Signal Processing (ICASSP 2013),
Vancouver, Canada, pp. 1118–1122. IEEE (May 2013)
12. Roweis, S.: EM algorithms for PCA and SPCA. In: Proceedings of the 1997 Con-
ference on Advances in Neural Information Processing Systems 10, NIPS 1997, pp.
626–632. MIT Press, Cambridge (1998)
13. Collins, D.L., Zijdenbos, A.P., Kollokian, V., Sled, J.G., Kabani, N.J., Holmes,
C.J., Evans, A.C.: Design and construction of a realistic digital brain phantom.
IEEE Transactions on Medical Imaging 17(3), 463–468 (1998)
14. Garcia-Arteaga, J.D., Kybic, J.: Regional image similarity criteria based on the
kozachenko-leonenko entropy estimator. In: IEEE Computer Society Conference
on Computer Vision and Pattern Recognition Workshops, CVPRW 2008, pp. 1–8.
IEEE (June 2008)
Fast Unsupervised Segmentation Using Active
Contours and Belief Functions
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 278–285, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Fast Unsupervised Segmentation Using Active Contours and Belief Functions 279
can improve the segmentation based AC models for vector-valued images [2]. An-
other reason for failed segmentations is due to the local or global minimizer for
AC models [3]. To overcome these difficulties, the evidential framework appears
to be a new way to improve segmentation based AC models for vector valued im-
ages [4,5,6]. The Dempster-Shafer (DS) framework [7] has been combined with
either a simple thresholding [4], a clustering algorithm [8], a region merging al-
gorithm [5] or with an AC algorithms [6]. In this paper we propose to use the
evidential framework [7] to fuse several statistical knowledge as a new descriptor
and incorporates this new descriptor in the formulation of the AC models. The
fusion of information issued from different feature channels, e.g., color chan-
nels and texture offers an alternative to the Bayesian framework [9]. Instead
of fusing separated probability densities, the evidential framework allows both
inaccuracy and uncertainty. This concept is represented using Belief Functions
(BF s) [7,10,11,12] which are particularly well suited to represent information
from partial and unreliable knowledge. The use of BF s as an alternative to the
probability in segmentation process can be very helpful in reducing uncertainties
and imprecisions using conjunctive combination of neighboring pixels. First, it
allows us to reduce the noise and secondly, to highlight conflicting areas mainly
present at the transition between regions where the contours occur. In addition,
BF s have the advantage to manipulate not only singletons but also disjunctions.
This gives the ability to represent both uncertainties and imprecisions explicitly
. The disjunctive combination allows the transferring of both uncertain and im-
precise information on disjunctions [7,12]. Finally, the conjunctive combination
is applied to reduce uncertainties due to noise while maintaining representation
of imprecise information at the boundaries between areas on disjunctions. In
this paper, we highlight the advantage of evidential framework, to define a new
descriptor based on the BF s to incorporate it in the formulation of the AC
models. In Section 2, we review the Dempster-Shafer concept in order to define
our evidential descriptor. In section3 we proposed a fast algorithm based split
Bregman of our segmentation algorithm. In Section 4, we demonstrate the ad-
vantages of the proposed method by applying it to some challenging to some
challenging images.
m (∅) = 0
m (Ωi ) = 1
Ωi ⊆Ω
BF s (Ω) = m (Ωi ) (1)
Ωi ⊆ΩII
P l (Ω) = m (Ωi )
Ωi ΩII =∅
When m (Ωi ) > 0, Ωi is called a focal element [5,7]. The relation between mass
function, BF s and P el can be described as:
The independent masses mj are defined within the same frame of discernment
as:
m Ωi={1,...,n} = m1 Ωi={1,...,n} ⊗ m2 Ωi={1,...,n} ...mj Ωi={1,...,n} ⊗
(3)
⊗mm Ωi={1,...,n}
In equation (6), the first energy term corresponds to the geometric properties
of P (Ω) = {Ωin , Ωout }. Ωin and Ωout correspond respectively to the fore-
ground and background region to be extracted. The data-fidelity energy term
Fast Unsupervised Segmentation Using Active Contours and Belief Functions 281
Equation (8) is handled because P (Ω) = {Ωin , Ωout };is a focal element. When
the foreground/background regions are disjoint ( Ωin Ωout = ∅), then:
The pdfs pjin and pjout are estimated for all channels using the Parzen kernel
[13]. Our proposed method uses the total belief committed to foreground or
background region. In the next section we propose a fast version of our segmen-
tation algorithm.
m in (E (χ, d)) = |d (x)| dx + λin in
VBF s χ (x) dx − λout out
VBF s χ (x) dx
χ,d Ω Ω Ω
(12)
in/out
where the velocity VBF s is calculated using the Eulerian derivative of Edata in
the direction ξ as follows:
< =
∂Edata Ωin/out (t) , I in/out
,ξ = VBF s ξ (s) , N (s) ds (13)
∂t
∂Ω
4 Results
We introduced an AC model that incorporates BF s as statistical region knowl-
edge. To illustrate and demonstrate the accuracy of our segmentation method,
we present some results of our method and compare them to segmentation of
vector-valued images done by the traditional AC model and the model pro-
posed in [6]. The three methods are evaluated on 20 color images taken from
the Berkeley segmentation datasets [15] using F-measure criterion. Traditional
segmentation and the method in [6] are initialized by contour curve around the
object to be segmented, our method is free initialization. The segmentation done
by the three methods are presented for three challenging images (see Figure.1).
The accuracy of the segmentation is represented in term of Precision/Recall
The proposed method give the best segmentation and the F-measure is better
then the other methods (see Table.1).
Fast Unsupervised Segmentation Using Active Contours and Belief Functions 283
Fig. 1. Images taken from the Berkeley Segmentation benchmark dataset [15]. The
from the left to right, en red color segmentation done by our segmentation model,
in blue color segmentation done by the model proposed in [6]. In yellow color, the
segmentation done by the traditional model based vector valued image and KL distance.
284 F. Derraz et al.
5 Conclusion
We have investigated the use of the evidential framework for AC model using
Dempster-Shafer (DS) theory. In particular, we have investigated how to calcu-
late the mass function using Parzen kernel, which represents a difficult task. The
results have shown that our proposed approach give the best segmentation for
color and textured images. The experimental results show that the segmentation
performance is improved by using the three information sources to represent the
same image with respect to the use of on information. However, there are some
drawbacks of our proposed method. Our method of calculating mass functions
is high time consuming when the number of channels increase. Furthermore, the
research of other optimal models to estimate mass functions in the DS theory
and the imprecision coming from different images channels are an important
areas for future research.
References
1. Cremers, D., Rousson, M., Deriche, R.: A review of statistical approaches to level
set segmentation: Integrating color, texture, motion and shape. Int. J. Comput.
Vision 72(2), 195–215 (2007)
2. Chan, T.F., Sandberg, B.Y., Vese, L.A.: Active contours without edges for vector-
valued images. J. of Vis. Communi. and Image Repres. 11, 130–141 (2000)
3. Bresson, X., Esedoglu, S., Vandergheynst, P., Thiran, J.P., Osher, S.: Fast global
minimization of the active contour/snake model. J. Math. Imaging Vis. 28(2),
151–167 (2007)
4. Rombaut, M., Zhu, Y.M.: Study of dempster–shafer theory for image segmentation
applications. Image and Vision Computing 20(1), 15–23 (2002)
5. Lelandais, B., Gardin, I., Mouchard, L., Vera, P., Ruan, S.: Using belief function
theory to deal with uncertainties and imprecisions in image processing. In: Denœux,
T., Masson, M.-H. (eds.) Belief Functions: Theory & Appl. AISC, vol. 164, pp.
197–204. Springer, Heidelberg (2012)
6. Scheuermann, B., Rosenhahn, B.: Feature quarrels: The dempster-shafer evi-
dence theory for image segmentation using a variational framework. In: Kimmel,
R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part II. LNCS, vol. 6493, pp.
426–439. Springer, Heidelberg (2011)
7. Dempster, A.P., Chiu, W.F.: Dempster-shafer models for object recognition and
classification. Int. J. Intell. Syst. 21(3), 283–297 (2006)
8. Masson, M.H., Denoeux, T.: Ecm: An evidential version of the fuzzy c. Pattern
Recognition 41(4), 1384–1397 (2008)
Fast Unsupervised Segmentation Using Active Contours and Belief Functions 285
9. Vannoorenberghe, P., Colot, O., de Brucq, D.: Color image segmentation using
dempster-shafer’s theory. In: ICIP (4), pp. 300–303 (1999)
10. Cuzzolin, F.: A geometric approach to the theory of evidence. IEEE Trans. on
Syst., Man, and Cyber., Part C 38(4), 522–534 (2008)
11. Denoeux, T.: Maximum likelihood estimation from uncertain data in the belief
function framework. IEEE Trans. Knowl. Data Eng. 25(1), 119–130 (2013)
12. Appriou, A.: Generic approach of the uncertainty management in multisensor fu-
sion processes. Revue Traitement du Signal 22(2), 307–319 (2005)
13. Parzen, E.: On estimation of a probability density function and mode. The Annals
of Mathematical Statistics 33(3), 1065–1076 (1962)
14. Goldstein, T., Bresson, X., Osher, S.: Geometric applications of the split breg-
man method: Segmentation and surface reconstruction. J. Sci. Comput. 45(1-3),
272–293 (2010)
15. Martin, D.R., Fowlkes, C.C., Malik, J.: Learning to detect natural image bound-
aries using local brightness, color, and texture cues. IEEE Trans. Pattern Anal.
Mach. Intell. 26(5), 530–549 (2004)
Flexible Hypersurface Fitting with RBF Kernels
Keywords: feature space, fitting, kernel PCA, RBF kernel, Hilbert space.
1 Introduction
In the fields of computer vision and machine learning, various nonlinear prob-
lems are reduced to linear problems in feature space with feature mappings. For
getting such feature mappings, RBF kernel functions are widely used because
RBF kernel functions have a great advantage called a kernel trick [4]. Since the
kernel trick changes a searching problem in feature space into the problem in the
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 286–293, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Flexible Hypersurface Fitting with RBF Kernels 287
space spanned by sample data, the dimension of searching space also changes
from that of feature space to the number of sample data, and then the trick
is effective when the dimension of feature space is very large comparing to the
number of sample data. In other words, the kernel trick makes high-dimensional
problems free from the curse of dimensionality. The justification of kernel trick
is guaranteed by the representer theorem [7]. The representer theorem ensures
that the optimal estimator can be represented by a linear combination of kernel
functions evaluated at sample points (functions). When the theorem holds, it
changes an estimation in an infinite dimensional feature space corresponding to
RBF kernels is change to that in a finite dimensional sample space. This is why
RBF kernels are widely used to treat infinite dimension.
One of the purposes of this paper is applying the kernel trick to hypersurface
fitting. Wahba [7] gives the representer theorem in the case of regression, which
can be regarded as a kind of hypersurface fitting. However, hypersurface fitting
based on regression does not have geometrical property. For example, the fitting
result is not invariant under any rotation of coordinates. The reason why the
fitting is not geometrical is that the fitting has a special variable called a target
variable, and then all variables (coordinates) are not equivalent. This makes the
fitting result not be geometrical.
On the other hand, line fitting based on a principal component analysis (PCA)
is geometrical, that is, the fitting result is invariant under rotation of coordinates.
PCA is kernelized as a kernel PCA [3], which is widely utilized in pattern recogni-
tion, and a kernel PCA satisfies the representer theorem. The line fitting method
based on PCA is extended to nonlinear hypersurface fitting methods [1]. In the
extended methods, the fitting hypersurface is represented by an inner product
form like a F (x) = 0, where F (x) is a function of coordinates to represent a
set of hypersurfaces. The parameter vector a is estimated so as to minimize the
(weighted) sum of (a F (x))2 for observed data, in general.
The extended method means to subtract one-dimensional space of the small-
est principal component from feature space. Such a subtraction of the smallest
principal component is called a minor component analysis (MCA) [6]. From
this point of view, to establish a geometrical hypersurface fitting method, the
MCA should be given in a kernel formulation such as a kernel MCA (KMCA).
However the KMCA does not satisfy the representer theorem. This fact can be
explained as following. If the KMCA satisfies the representer theorem, it can
be represented as a kernel formulation, and the KMCA corresponding to an
infinite dimensional feature space should exist. But in an infinite dimensional
feature space, the dimension of eigenspace corresponding to zero-eigenvalue is
also infinite. This means that the parameter vector which describes the fitting
hypersurface cannot be determined uniquely, and there are infinite possibilities
for parameter vectors. This paper gives flexible hypersurface fitting method by
utilizing this indeterminacy.
288 J. Fujiki and S. Akaho
D
a = α[d] F [d] = F α .
(n×1) (n×D) (D×1)
d=1
if the representer theorem holds, it is guaranteed that the global minimum exists
in the linear combination, but if not, it is not guaranteed. As explained later, the
representer theorem does not hold for hypersurface fitting.
Let K be a D × D matrix as (K)ij = k(x[i] , x[j] ), and let the each column of K
be defined as K = K[1] · · · K[D] . Since K = F F , there holds K[d] = F F [d] ,
2
and a F [d] = α K[d] K[d]
α.
Then the coefficient vector α minimizes
D
α K[d] K[d]
α = α Kα . (1)
d=1
And then, the coefficient vector α is the eigenvector corresponding to the mini-
mum eigenvalue of the matrix K.
8
6
6
4
4
2
2
0
0
−2
−2
−2
−2
−2
−2
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
8
8
6
6
4
4
2
2
0
0
−2
−2
−2
−2
−2
−2
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
Fig. 1. Quadratic curve fitting for parabola: with (top) and without (bottom) noises
1
91
0
−1
91
6e
0
140
0.4
6
−11
6
0.7
1.1e
7
1.4 0
0.7
1.4 7
6e−11
41
31 17
4
4
14
10
420
420
yt
yt
yt
yt
180
180
yt
yt
2 2
2.2
e−
200
200
12
2
2
2.2 1
1.1e−1
2
6e−11
0.19
0.1 0.54
1.4 9
0.13
0.77
0.46
3.2e
0
0
0.93 0.033 3 −13
1.1 0.1 2.2e
−12
1.7 0.54
1.1e
2.1
140
1.1e−09
3.2
−09
3.8 3.3
140
4.5
14
14
10
10
2.2
17
17
−2
−2
−2
−2
−2
−2
8.8 8.6
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
8
8
0
90
90
140
0.4
6
6
6
0.7
7
1.4
0.7
1.4 7
41
31 17
4
4
14
10 10
420
420
yt
yt
yt
yt
180
180
yt
yt
2 2
200
200
2
2
2.2 2.2
0.19
0.1 0.54 0.54
1.4 9
0.13 0.13
0.77 0.46
0
0
0.93 0.033 3 0.033 3
1.1 0.1 0.1
1.7 0.54 0.54
2.1
3.8 3.2
3.3
4.5
130
130
130
130
14
14
10
10
10
10
2.2 2.2
17
17
−2
−2
−2
−2
−2
−2
8.8 8.6
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
V1 V2 V3 V4 V5 V6
Fig. 2. Equidistant lines from points to Vk : with (top) and without (bottom) noises
D
E(a(x)) = a v[d] v[d] a . (2)
d=1
As already discussed, the representer theorem does not hold for the hypersurface
fitting, but we find appropriate |a from the linear space spanned by square-
integrable functions in the Hilbert space as |a = Dd=1 αd v[d] = |V α where
|V = v[1] · · · v[D] .
Let k(v, w) be a RBF kernel function as k(v, w) = v | w. Andlet K be a
D × D-matrix of its ij-components are defined as (K)ij = v[i] v[j] .
Flexible Hypersurface Fitting with RBF Kernels 291
There holds K = V | V for V| = v[1] · · · v[D] , and the d-th column of
K is represented as K[d] = V v[d] . Since there holds
a v[d] v[d] a = α V v[d] v[d] V α = α K[d] K[d]
α,
Eq.(2) is rewritten as
D
α K[d] K[d]
α = α Kα
d=1
without utilizing the mapping to Hilbert space. That is the representation of the
energy function that has the same representation as Riemannian space.
5 Discussion
This paper gives a flexible hypersurface fitting method for a set of samples from
some hypersurface. It is shown by some simulations that our method works well,
but it is no easy to choose both the best eigenvector and the width parameter
σ systematically. In order to choose both of them, some new criterions must be
needed. Nevertheless, our method has worth to be used at first, as we shown in
this paper.
8
6
6
4
4
yt
yt
yt
yt
yt
yt
2
2
0
0
−2
−2
−2
−2
−2
−2
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
8
8
6
6
4
4
yt
yt
yt
yt
yt
yt
2
2
0
0
−2
−2
−2
−2
−2
−2
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
8
8
6
6
4
4
yt
yt
yt
yt
yt
yt
2
2
0
0
−2
−2
−2
−2
−2
−2
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
8
8
6
6
4
4
yt
yt
yt
yt
yt
yt
2
2
0
0
−2
−2
−2
−2
−2
−2
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
8
8
6
6
4
4
yt
yt
yt
yt
yt
yt
2
2
0
0
−2
−2
−2
−2
−2
−2
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
8
8
6
6
4
4
yt
yt
yt
yt
yt
yt
2
2
0
0
−2
−2
−2
−2
−2
−2
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
8
8
6
6
4
4
yt
yt
yt
yt
yt
yt
2
2
0
0
−2
−2
−2
−2
−2
−2
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
8
8
6
8
4
4
yt
yt
yt
yt
2
6
0
0
−2
−2
−2
−2
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
2
8
8
6
0
4
4
yt
yt
yt
yt
2
−2
0
−4 −2 0 2 4
−2
−2
−2
−2
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
1.5
1.5
1.5
1.5
1.5
1.0
1.0
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.5
0.5
0.0
0.0
0.0
0.0
0.0
0.0
yt
yt
yt
yt
yt
yt
−0.5
−0.5
−0.5
−0.5
−0.5
−0.5
−1.0
−1.0
−1.0
−1.0
−1.0
−1.0
−1.5
−1.5
−1.5
−1.5
−1.5
−1.5
−3 −2 −1 0 1 2 3 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
1.5
1.5
1.5
1.0
1.0
1.0
0.5
0.5
0.5
0.0
0.0
0.0
yt
yt
yt
−0.5
−0.5
−0.5
−1.0
−1.0
−1.0
−1.5
−1.5
−1.5
−3 −2 −1 0 1 2 3 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
1.5
1.5
1.5
1.0
1.0
1.0
0.5
0.5
0.5
0.0
0.0
0.0
yt
yt
yt
−0.5
−0.5
−0.5
−1.0
−1.0
−1.0
−1.5
−1.5
−1.5
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
Fig. 4. Examples of flexible fitting: points (top) and results (middle and bottom)
References
1. Fujiki, J., Akaho, S.: Hypersurface fitting via Jacobian nonlinear PCA on Rieman-
nian space. In: Real, P., Diaz-Pernil, D., Molina-Abril, H., Berciano, A., Kropatsch,
W. (eds.) CAIP 2011, Part I. LNCS, vol. 6854, pp. 236–243. Springer, Heidelberg
(2011)
2. R Development Core Team, R: A language and environment for statistical comput-
ing, R Foundation for Statistical Computing, Vienna, Austria (2008) ISBN3-900051-
07-0, http://www.R-project.org
3. Schölkopf, B., Smola, A., Müller, K.-R.: Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computation 10, 1299–1319 (1998)
4. Schölkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines, Regu-
larization, Optimization, and Beyond. MIT Press (2001)
5. Tsuda, K.: Subspace Classifier in the Hilbert Space. Pattern Recognition Letters 20,
513–519 (1999)
6. Xu, L., Oja, E., Suen, C.: Modified Hebbian learning for curve and surface fitting.
Neural Networks 5(3), 441–457 (1992)
7. Wahba, G.: Spline Models for Observational Data. SIAM (1990)
Gender Classification Using Facial Images
and Basis Pursuit
1 Introduction
Gender classification is an important task in social activities and communications. In
fact, automatically identifying gender is useful for many applications, e.g. security
surveillance [4] and statistics about customers in places such as movie theaters, build-
ing entrances and restaurants [3]. Automatic gender classification is performed based
on facial features [8], voice [10], body movement or gait [23].
Most of the published work in gender classification is based on facial images.
Moghaddam et al. [16] used Support Vector Machines (SVMs) for gender classification
from facial images. They used low resolution thumbnail face images (21 × 12 pixels).
Wu et al. [21] presented a real time gender classification system using a Look-Up-Table
Adaboost algorithm. They extracted demographic information from human faces. Gol-
lomb et al. [8] developed a neural network based gender identification system. They
used face images with resolution of 30x30 pixels from 45 males and 45 females to
train a fully connected two-layer neural network, SEXNET. Cottrell and Metcalfe [6]
used neural networks for face emotion and gender classification from facial images.
Gutta and Wechsler [9] used hybrid classifiers for gender identification from facial
images. The authors proposed a hybrid approach that consists of an ensemble of RBF
neural networks and inductive decision trees. Yu et al. [23] presented a study of gen-
der classification based on human gait. They used model-based gait features such as
height, frequency and angle between the thighs. Face-based gender classification is still
an atractive research area and there is room for developing novel algorithms that are
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 294–301, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Gender Classification Using Facial Images and Basis Pursuit 295
more roubst, more accurate and fast. In this paper, we will to present a novel method
for gender classification based on sparse representation and basis pursuit.
Over the past few years, the theory of sparse representation has been used in various
practical applications in signal processing and pattern recognition [7]. A sparse repre-
sentation of a signal can be achieved by representing the signal as a linear combination
of a relatively few base elements in a basis or an overcomplete dictionary [2]. Sparse
representation has been used for compression [1], denoising [19], and audio and im-
age analysis [14]. However, its use in recognition and classification is relatively new.
Wright et al. [20] proposed a classification algorithm for face recognition based on a
sparse representation. The reported results for face recognition are encouraging enough
to extend this concept to other areas such as gender classification. In addition, Patal et
al. [17] proposed a face recognition algorithm based on dictionary learning and sparse
representation. A dictionary is learned for each class based on given training samples.
The test sample is projected onto the span of the training data in each learned dictionary.
The main idea of using sparse representation for recognition and classification is to
represent the test data as a linear combination of training data. The set of coefficients
of the linear representation is called weight vector. If we assume that there are many
subjects in database, the test data will only be related to one of the subjects. Therefore,
in sparse representation, the weight vector should be sparse and it is important to find
the sparsest solution. To find the weight vector, we use basis pursuit as described in the
next section.
In this paper, we present a gender classification system based on 2-D facial images
and sparse representation. This paper is organized as follows: In Section 2, we present
a brief mathematical explanation of the sparse representation concept and the proposed
method based on basis pursuit to obtain the sparsest solution. Section 3 presents exper-
imental results that demonstrate the performance of the proposed method in terms of
recognition. Conclusions and future research directions are discussed in Section 4.
vi,j , where j = 1, ..., ni , is a column vector that represents the features extracted from
the training data sample j of subject i.
It is assumed that a test data from class i can be represented as a linear combination
of the training data from that class [20]:
where y ∈ Rm is the feature vector of the test data and the α values are the coefficients
corresponding to the training data samples of subject i. Concatenating the matrices
Ai , i = 1, 2, ..., k yields:
y = Ax0 ∈ Rm (4)
where x0 is the coefficient vector. By solving this equation for x0 , the class of the test
data y can be identified (as described in the next section).
(l2 ) : 42 = argmin x
x 2 Subject to y = Ax (5)
where x 42 is the solution, which can be obtained simply by computing the pseudo-
inverse of A. However, this solution does not contain useful information for recog-
nition. In recognition, the test data belongs to one of the classes represented in the
dictionary. Therefore, the obtained answer should be sparse (i.e., only a few elements
that correspond to the training samples of the correct class are not zero). The sparsest
solution of y = Ax can be obtained by minimizing l0 norm as follows [20]:
(l0 ) : 40 = argmin x
x 0 Subject to y = Ax (6)
Gender Classification Using Facial Images and Basis Pursuit 297
where . 0 is the zero norm, which counts the number of non zero elements of x.
However, finding the minimum l0 norm is not an easy task and it becomes harder as
the dimensionality increases since we need to use combinatorial search. Furthermore,
noise affects the solution because noise magnitude can significantly change the l0 norm
of a vector. In this paper, Basis Pursuit (BP) is used to find the sparsest solution,i.e., the
solution vector x that has the smallest number of non zero elements.
41 = argmin
x x 1 subject to y = Ax (7)
Because |x1 | + ... + |xn | is a nonlinear function, the optimization problem can not be
solved using linear programming methods. To make this function linear, nonlinearities
should change to constraints by adding new variables as follow:
where ri (y) is the residual distance for class i. This signifies that the classification is
performed based on the best approximation and least error [20].
Here, we propose a new approach to perform classification using x 41 . In gender clas-
sification, there are only two classes and the dictionary contains training face images
for males and females as representatives of these two classes. The obtained elements of
41 are the coefficients associated with each training face image and we can divide x
x 41
into two vectors, x1 and x2 , where x1 contains the coefficients associated with males
and x2 contains the coefficients associated with females. x 41 can be written as follows:
, -
x1
x41 =
x2
41 is m and the number of training samples for males and females are
The length of the x
equal. Hence, the length of x1 and the length of x2 is m/2.
Let xmax be the maximum value of the x 41 elements (xmax = max(4 x1 )). Then, a
threshold xmax /τ , where (τ ≥ 1) is defined. The elements in x1 and x2 whose values
are more than the threshold are counted. The classification is performed based on the
majority vote of the coefficients.
3 Gabor Wavelets
The Gabor filters (kernels) with orientation μ and scale ν are defined as [22]
kμ,ν 2 −kμ,ν 2 z2
eikμ,ν z − e−σ /2
2
ψμ,ν = e( 2σ2
)
(12)
σ2
where z = (x, y) is the pixel position, and the wave vector kμ,ν is defined as kμ,ν =
kν eiφμ with kν = kmax /f ν and Φμ = πμ/8. kmax is the maximum frequency, and f is
the spacing factor between kernels in the frequency domain. The ratio of the Gaussian
window width to wavelength is determined by σ. Considering Eq. 12, the Gabor kernels
can be generated from one wavelet, i.e., the mother wavelet, by scaling and rotation via
the wave vector kμ,ν [13]. In this work, we used five scales, ν ∈ {0, ...,√4} and eight
orientations μ ∈ {0, ..., 7}. We also used σ = 2π, kmax = π/2 and f = 2.
Gender Classification Using Facial Images and Basis Pursuit 299
Fig. 1. Sample images for both males and females in FERET database
The FERET database [18] is used to validate the proposed method. Images are frontal
faces at a resolution of 256x384 with 256 gray levels. All the images are preprocessed
before applying the algorithm. First, the automatic eye-detection method is applied
based on the [11] and the distance, d, between the 2 eye corners is measured. Then,
the middle point between the 2 eye corners is found and the image is cropped by the
size of 2d × 2d. Then all images are resized to 128x128. A few sample face images
for both male and female subjects are shown in Fig. 1. In this database, there are 250
male subjects and 250 female subjects. As previously stated, in sparse classification,
the training samples are used to build a dictionary, which is used during the classifica-
tion to represent a test sample as a linear combination of the training samples. Since
we are using majority voting for making a decision between the two categories, the
number of training samples for males and females should be equal. In addition, to com-
pare our results with other methods, especially [12], four experiments are conducted
with different number of subjects used for training in each experiment, sizes of: 50,
100, 150 and 200 subjects were used for training. In each experiment, the remaining
subjects are used for testing. For instance, when using 200 subjects for training (100
male subjects and 100 female subjects), the other 300 subjects are used for testing. In
the feature extraction step, Gabor wavelets are extracted for 8 orientations and 5 spatial
frequencies. Finally, using PCA, the number of features used to represent each image
is reduced to 128. In the proposed approach, for each test data, the sparsest coefficient
vector x>1 is obtained based on the basis pursuit. Majority voting is then used to recog-
nize the gender of the test subject. We provide a comparison of the experimental results
with other gender classification systems applied to the same dataset. Table. 1 shows the
classification rates for 4 different training set sizes. The results of our proposed method
(PCA + BP) are compared with the results of the methods proposed by Jain et al. [12],
in which the authors evaluated their method on the FERET database. They used Inde-
pendent Component Analysis (ICA) to represent each image as feature vector in a low
dimensional subspace. In addition, they used different classifiers such as cosine classi-
fier (COS), linear discriminant classifier (LDA) and the support vector machine (SVM).
The best result reported in [12] is 95.67% accuracy using SVM with ICA. Furthermore,
the results for conventional sparse representation based classification (SRC)[20] are re-
ported in Table. 1 which show our modification was helpful in gender recognition. The
300 R. Khorsandi and M. Abdel-Mottaleb
Table 1. Performance comparison to other gender classification systems based on facial images
Training Set Size ICA + COS ICA + LDA ICA+ SVM SRC PCA+BP (Proposed Method)
50 60.67% 64.67% 68.30% 68.88% 68.88%
100 71.67% 73.67% 76.00% 76.00% 76.25%
150 80.33% 83.00% 86.67% 86.85% 88.57%
200 85.33% 93.33% 95.67% 96.33% 97.66%
experimental results in this paper indicate that our proposed method using sparse rep-
resentation and PCA obtained higher performance of correct classification rate on the
same data set. To our best knowledge, better results for gender classification on FERET
database is not reported since 2005. Moreover, in [15], authors used 661 images from
FERET database, for 248 subjects. The best obtained result for gender classification in
that paper is 90% for feature dimension 11,520. However, we obtained a classification
rate of 97% for 512 feature dimension and for 500 subjects.
5 Conclusion
In this paper, we presented a method for gender classification, from facial images, using
sparse representation. Basis pursuit method was used to formulate the problem in order
to find the sparsest solution. The experiments were conducted on the FERET data set
containing 500 subjects (250 male and 250 female subjects). Features were extracted
using Gabor wavelets, and a dictionary was constructed based on the extracted fea-
tures from a training set. The rest of the data set was used for testing. We compared
the proposed method in this paper with previous methods that used the same data set,
performance of our the presented method is better than the previous reported methods.
Experiments are encouraging enough for future research on the sparse representation
for gender classification. In the future, we plan to apply our method for the fusion of
facial and ear features.
References
1. Aharon, M., Elad, M., Bruckstein, A.: K-svd: An algorithm for designing overcomplete
dictionaries for sparse representation. IEEE Transactions on Signal Processing 54(11),
4311–4322 (2006)
2. Baraniuk, R., Candes, E., Elad, M., Ma, Y.: Applications of sparse representation and com-
pressive sensing. Proceedings of the IEEE 98(6), 906–909 (2010)
3. Cao, L., Dikmen, M., Fu, Y., Huang, T.S.: Gender recognition from body. In: Proceedings
of the 16th ACM International Conference on Multimedia, MM 2008, New York, NY, USA,
pp. 725–728 (2008)
4. Chen, D.-Y., Lin, K.-Y.: Robust gender recognition for real-time surveillance system. In:
IEEE International Conference on Multimedia and Expo (ICME), pp. 191–196 (July 2010)
5. Chen, S.S., Donoho, D.L., Michael, Saunders, A.: Atomic decomposition by basis pursuit.
SIAM Journal on Scientific Computing 20, 33–61 (1998)
6. Cottrell, G.W., Metcalfe, J.: Empath: face, emotion, and gender recognition using holons. In:
Proceedings of the 1990 Conference on Advances in Neural Information Processing Systems
3, NIPS-3, San Francisco, CA, USA, pp. 564–571 (1990)
Gender Classification Using Facial Images and Basis Pursuit 301
7. Donoho, D.L.: Compressed sensing. IEEE Transaction on Information Theory 52(4) (2006)
8. Golomb, B.A., Lawrence, D.T., Sejnowski, T.J.: Sexnet: A neural network identifies sex from
human faces. In: Proceedings Conf. Advances in Neural Information Processing Systems 3,
pp. 572–577 (1990)
9. Gutta, S., Wechsler, H.: Gender and ethnic classification of human faces using hybrid clas-
sifiers. In: International Joint Conference on Neural Networks, IJCNN 1999, vol. 6, pp.
4084–4089 (1999)
10. Harb, H., Chen, L.: Gender identification using a general audio classifier. In: Proceedings of
the International Conference on Multimedia and Expo, ICME 2003, Washington, DC, USA,
pp. 733–736 (2003)
11. Hsu, R.-L., Abdel-Mottaleb, M., Jain, A.: Face detection in color images. IEEE Transactions
on Pattern Analysis and Machine Intelligence 24(5), 696–706 (2002)
12. Jain, A., Huang, J., Fang, S.: Gender identification using frontal facial images. In: IEEE
International Conference on Multimedia and Expo, ICME 2005, p. 4 (July 2005)
13. Liu, C., Wechsler, H.: Gabor feature based classification using the enhanced fisher linear
discriminant model for face recognition. IEEE Transactions on Image Processing 11(4),
467–476 (2002)
14. Llagostera Casanovas, A., Monaci, G., Vandergheynst, P., Gribonval, R.: Blind audiovisual
source separation based on sparse redundant representations. IEEE Transactions on Multi-
media 12(5), 358–371 (2010)
15. Lu, H., Huang, Y., Chen, Y., Yang, D.: Automatic gender recognition based on pixel-pattern-
based texture feature. Journal of Real-Time Image Processing 3, 109–116 (2008)
16. Moghaddam, B., Yang, M.-H.: Gender classification with support vector machines. In: Fourth
IEEE International Conference on Automatic Face and Gesture Recognition, pp. 306–311
(2000)
17. Patel, V.M., Wu, T., Biswas, S., Phillips, P.J., Chellappa, R.: Dictionary-based face recog-
nition under variable lighting and pose. IEEE Transactions on Information Forensics and
Security 7(3), 954–965 (2012)
18. Phillips, P., Moon, H., Rizvi, S., Rauss, P.: The feret evaluation methodology for face-
recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 22(10), 1090–1104 (2000)
19. Protter, M., Elad, M.: Image sequence denoising via sparse and redundant representations.
IEEE Transactions on Image Processing 18(1), 27–35 (2009)
20. Wright, J., Yang, A., Ganesh, A., Sastry, S., Ma, Y.: Robust face recognition via sparse repre-
sentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(2), 210–227
(2009)
21. Wu, B., Ai, H., Huang, C.: Facial image retrieval based on demographic classification. In:
Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 3,
pp. 914–917 (2004)
22. Yang, M., Zhang, L.: Gabor feature based sparse representation for face recognition with
gabor occlusion dictionary. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
Part VI. LNCS, vol. 6316, pp. 448–461. Springer, Heidelberg (2010)
23. Yu, S., Tan, T., Huang, K., Jia, K., Wu, X.: A study on gait-based gender classification. IEEE
Transactions on Image Processing 18(8), 1905–1910 (2009)
Graph Clustering through Attribute Statistics
Based Embedding
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 302–309, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Graph Clustering through Attribute Statistics Based Embedding 303
is available. Graph embeddings and graph kernels are the main paradigms. The
former explicitly assign a feature vector to each graph while the latter implicitly
map each graph in a feature space and compute the corresponding scalar product.
The relation between graph embeddings and graph kernels is clear since, given
an explicit embedding, any kernel function on vectors also defines a graph kernel.
We have previously proposed an explicit embedding approach which is based
on extracting features describing the occurrence and co-occurrence of node label
representatives in a given graph [4]. Its efficiency and good performance, when
compared to state of the art methodologies for graph classification, have been
empirically demonstrated. In the current paper, we aim at an evaluation of this
embedding methodology for graph clustering. To that end, we make use of the
ICPR 2010 Graph Embedding Contest [3]. This contest was organized in order
to provide a framework for direct comparison between embedding methodologies
for the purpose of graph clustering. Three object image datasets were chosen and
converted into graphs, divided into a training and a test set. The participants
also received a code with which they could assess their own methodologies in
terms of a clustering measure. Object images were first segmented into different
regions and a region adjacency graph was constructed. Each node representing a
region was attributed with the corresponding relative size and the average RGB
color components, while edges remained unattributed.
For the contest, four algorithms were submitted, three explicit embedding
methods and an implicit one. Jouili and Tabbone build feature vectors whose
distances distribution respect as much as possible that of the corresponding
graphs. In particular, they assign a feature vector to every graph by considering
the eigenvectors of a positive semidefinite matrix regarding the dissimilarity of
graphs [6]. Riesen and Bunke map every graph to a feature vector whose com-
ponents are the edit distances to a predefined set of prototypes [8]. Their goal
is thus to characterize graphs as how they are located with respect to some
key graphs in the graph space. Luqman et at. search for particular subgraph
structures present in the original graphs. They encode relevant information by
quantizing node and edge attributes via the use of fuzzy intervals [7]. Finally, the
implicit methodology proposed by Osmanlıoğlu et al. maps each node of each
graph to a vector space by means of the caterpillar decomposition, and com-
putes a kernel value between two given graphs in terms of a point set matching
algorithm based on the Earth Mover’s distance [2].
The contribution of the work described in this paper is to evaluate the novel
embedding methodology of [4], which was not yet available at the time of the
ICPR contest, for the task of clustering and compare it to existing approaches.
Besides, the mentioned embedding methodology has been re-formulated in such
a way that it can handle color-based attributed graphs. We will show that, in
such a way, it constitutes an attractive addition to the set of graph cluster-
ing tools currently available. For the purpose of self-completeness, Section 2 of
the paper provides a brief introduction to graph embedding using node label
occurrence and co-occurrence statistics. Next, Section 3 describes in detail the
304 J. Gibert et al.
experimental evaluation and shows how to gain further improvements along this
line of research. Finally, Section 4 draws conclusions from this work.
The main idea of the embedding methodology used in this work is based on
counting the frequency of appearance of the node labels in a given graph and also
on the co-occurrence of pairs of node labels in conjunction with edge linkings.
The fact that node labels might not be discrete (as it is in the present case)
demands for a discretization of the node labelling space and, thus, the selection of
a set of representatives. Under the proposed approach, the features are obtained
by computing statistics on these representatives in terms of those nodes that
have been assigned to each of them. Based on how this assignment from nodes
to representatives is made we have two formulations of the embedding approach.
Bij = #(wi ↔ wj , g)
= | {(u, v) ∈ E | wi = λh (u) ∧ wj = λh (v)} |. (3)
Both the unary features Ui and the binary ones Bi are eventually arranged in a
feature vector.
In particular, note that what this formulation is proposing is to build a his-
togram of the presence of specific features in the graphs. In the present case,
we aim at evaluating the presence of each color in every graph, and also the
presence of the neighbouring relations of all colors in the graphs. In Section 2.4
we discuss the connections of this approach to other existing graph embedding
methodologies.
Graph Clustering through Attribute Statistics Based Embedding 305
The fuzzy version of the binary features needs to regard the transition probabil-
ities from one node to the other and thus is defined as
Bij = #(wi ↔ wj , g)
= pi (u)pj (v) + pj (u)pi (v). (6)
(u,v)∈E
One of the key issues of the embedding methodology is the selection of the set
of representatives for the node labels. We can make use of generic clustering
approaches independent to the domain such as kMeans, or we can use domain-
specific approaches. In addition to using kMeans, we propose in this work a
color-based approach that tries to adapt the set of representatives to the inherent
RGB structure of the node labelling space.
250 250
250
50 50 50
0 300 0 300
0 250 250 0 300
50 200 0 0 250
50 200 50 200
100 150 100 150 100 150
150 100 150 100 150
200 200 200 100
250 50 250 50 250 50
300 0 300 0 300 0
Fig. 1. Distributions of the graphs’ nodes in the RGB space (best seen in color). (a)
Original color of each node. (b) kMeans clusters for k = 10. (c) Color naming distri-
bution.
Table 1. C -index on the test sets under the Euclidean distance: lower index values
indicate better clustering results. Comparison with the contest participants. The best
results are shown bold face.
3 Experimental Evaluation
The three object image datasets that were used in the contest are the ALOI,
COIL and OBDK collections. Each of them is representing object images under
different angles of rotation and illumination changes. For more details on the
datasets, we refer to the contest report [3]. We recall here that a training and
a test set are available for each dataset. We use the training set to validate
the parameters (number of representative elements) that are eventually used for
processing the test set.
Every approach is assessed by computing the C-index clustering measure, and
approaches are ranked in terms of the geometric mean of the results on the three
datasets. When the embedding is explicit, the clustering index is computed based
on the Euclidean distances of the vectorial representations of graphs. When an
implicit formulation is given, distances are computed according to the following
formula
dij = kii + kjj − 2kij (7)
where kij is the kernel value between graphs gi and gj . Under a kernel func-
tion, graphs are implicitly mapped to a hidden feature space where the scalar
product is calculated. Formula (7) is the Euclidean distance between the corre-
sponding vectors in such a feature space. Results of the proposed methodologies
in comparison with the contest participants are shown in Table 1.
As expected, the Soft approaches obtain better results than the hard ones, and
the color-based versions improve the generic ones. Compared to the participants
methods, the proposed embedding approach ranks second on two databases and
first on the third one. This leads to the best geometric mean among all tested
methods. Moreover, let us mention the high efficiency of our approach which
arises from the fact that we base our embedding method on very simple features
with a fast computation.
As already said before, the contest clustering measures are computed based
on the Euclidean distances of the embedding representations. In other works,
the proposed embedding methodology has been shown to perform better under
different vectorial metrics [4]. We thus refine our results by computing the C-
index for clustering validation under the L1 and χ2 distances. Results of these
308 J. Gibert et al.
Table 2. C -index under different distances and under the kχ2 kernel on the test sets.
The best results are shown bold face.
ALOI COIL
Distance / Kernel
Soft kM Soft Color Soft kM Soft Color
L2 0.073 0.056 0.136 0.121
L1 0.064 0.060 0.130 0.110
χ2 0.031 0.032 0.066 0.064
kχ2 0.088 0.083 3.10e-08 9.04e-07
experiments for the Soft versions are shown on the first three rows of Table 2
(Hard versions are discarded since they do not show as good a performance as
the Soft ones). The χ2 distance is providing the best results, ranking best on all
datasets, even when compared to the contest participants (we, however, want
to point out that a direct comparison to the results obtained by the contest
participants would not be fair since we do not know how their algorithms would
perform under other metrics). Interestingly, the χ2 distance extracts the best
out of the Soft kM versions since it outperforms the Soft Color one in two of
the three datasets, which does not happen when using the two other metrics.
Finally, in order to relate our methodology to those that provide an implicit
embedding of graphs we compute kernel values between embedded graphs as
1
kd (g1 , g2 ) = exp − d(φ(g1 ), φ(g2 )) , γ > 0 (8)
γ
where φ(gi ) is the vectorial representation of the graph gi under the described
embedding methodology, and d is the χ2 metric (L2 or L1 could also be used
but χ2 is the one providing the best results when clustering under metrics as
discussed above). Distance values for the C-index computation are calculated
using Eq. (7). The γ parameter is also validated using the training set. Results
for the Soft versions are shown on the last row of Table 2.
Although the results for the ALOI database worsen when using the kernel
values for both versions of the embedding, the most significant point to highlight
from this table is that we obtain almost perfect separation indexes for the COIL
dataset under the two Soft versions and also for the ODBK under the Soft kM
one. This makes the geometric means to drastically decrease and demonstrates
the embedding methodology we propose in this work being a strong approach
for graph clustering.
Graph Clustering through Attribute Statistics Based Embedding 309
4 Conclusions
References
1. Benavente, R., Vanrell, M., Baldrich, R.: Parametric fuzzy sets for automatic color
naming. J. Optical Society of America A 25(10), 2582–2593 (2008)
2. Demirci, M.F., Osmanlıoğlu, Y., Shokoufandeh, A., Dickinson, S.: Efficient many-
to-many feature matching under the l1 norm. Computer Vision and Image Under-
standing 115(7), 976–983 (2011)
3. Foggia, P., Vento, M.: Graph Embedding for Pattern Recognition. In: Ünay, D.,
Çataltepe, Z., Aksoy, S. (eds.) ICPR 2010. LNCS, vol. 6388, pp. 75–82. Springer,
Heidelberg (2010)
4. Gibert, J., Valveny, E., Bunke, H.: Graph embedding in vector spaces by node
attribute statistics. Pattern Recognition 45(9), 3072–3083 (2012)
5. Jain, A.K.: Data Clustering: 50 years beyond K-means. Pattern Recognition Let-
ters 31(8), 651–666 (2010)
6. Jouili, S., Tabbone, S.: Graph Embedding Using Constant Shift Embedding. In:
Ünay, D., Çataltepe, Z., Aksoy, S. (eds.) ICPR 2010. LNCS, vol. 6388, pp. 83–92.
Springer, Heidelberg (2010)
7. Luqman, M.M., Lladós, J., Ramel, J.-Y., Brouard, T.: A Fuzzy-Interval Based
Approach for Explicit Graph Embedding. In: Ünay, D., Çataltepe, Z., Aksoy, S.
(eds.) ICPR 2010. LNCS, vol. 6388, pp. 93–98. Springer, Heidelberg (2010)
8. Riesen, K., Bunke, H.: Graph Classification and Clustering Based on Vector Space
Embedding. World Scientific (2010)
9. Mahé, P., Ueda, N., Akutsu, T., Perret, J.-L., Vert, J.-P.: Graph Kernels for Molec-
ular Structure-Activity Relationship Analysis with Support Vector Machines. Jour-
nal of Chemical Information and Modelling, 939–951 (2005)
10. Gärtner, T., Flach, P.A., Wrobel, S.: On graph kernels: Hardness results and
efficient alternatives. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003.
LNCS (LNAI), vol. 2777, pp. 129–143. Springer, Heidelberg (2003)
Graph-Based Regularization of Binary
Classifiers for Texture Segmentation
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 310–318, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Graph-Based Regularization of Binary Classifiers for Texture Segmentation 311
One drawback of this segmentation technique comes from applying the Chan
and Vese criteria on texture features. Depending on the type of images (acquisi-
tion method, content. . . ), some features may be irrelevant, or more relevant than
others. This fact cannot be automatically handled by the Chan and Vese crite-
ria, and features must be manually selected in order to obtain the best results.
Moreover, unsupervised weighing of the features is impossible.
The aim of this work is to overcome this limitation while keeping the two
non-local approaches. To do so, we propose to combine a supervised binary
classifier with the graph regularization process used previously. Implemented as
a neural network, the classifier takes care of the texture feature relevancy issue by
providing an initial classification of the pixels. The graph regularization process
from [3], which is designed to handle any multivariate feature, is this time applied
on an univariate one, the output of the classifier, in order to correct classification
errors and produce a smoothed segmentation.
2 Methodology
2.1 Haralick Texture Features
In order to deal with complex vision problems, the use of pixel features with
a higher level of abstraction than raw gray level intensities has now become
almost mandatory. Among all texture characterization techniques available in
the literature, we have decided to use Haralick features [4]. Several works have
shown their efficiency, especially when applied to medical images [6], and can
be easily extended to 3D images [7], two of our final goals. Only 10 of the 14
features proposed by Haralick have been used, correlation based ones having
been left out due to numerical instability.
measure of each pairs of nodes. Multiple ways to build such graph exist, and
correspond to three successive steps: choosing the node set, the edge set, and
the similarity measure.
For the choice of the node set, the most obvious approach is to build a pixel
adjacency graph (each pixel of the image is represented by a node), but more
advanced methods use clustering or segmentation algorithms to already group
similar pixels together, building a region adjacency graph [2,11]. While the lat-
ter approach can greatly reduce the size of the node set, obtaining such pre-
segmentation when working with texture-features is a vast subject, and will
therefore not be explored in this paper.
From the point of view of the heuristic used to build the edge set, the best
known types of graphs are: fully connected graph, -neighborhood graph and
k-nearest neighbors graph (see [12] for further details). In this paper, it has been
decided to work with the -neighborhood graph, where two nodes are connected
together if the distance between them is below a defined threshold. The distance
involved in this process is application dependent, and any combination of charac-
teristics (gray level intensity, coordinates. . . ) can be used. The one used here is
the Manhattan distance applied on the pixel coordinates with = 1, which
allows to build 4-neighbors graphs.
The similarity measure used by the weighing function is not necessarily linked
to the distance used in the previous step, but is also application dependent. We
chose to use a constant value of wu,v = 1 associated to each edge (u, v) ∈ E,
which is enough to allow the regularization process to take place.
The regularization process is actually carried out by solving an optimization
problem, which consists in finding a function f : V → [0; 1] that minimizes the
following energy:
E(f, f0 , λ) = Rw (f ) + λ g(f0 , u)f (u) , (1)
u∈V
3 Experimental Results
The results presented through this section correspond to a measure of the par-
tition distance [13], which is equivalent to the percentage of misclassified pixels
according to a ground truth. The ground truth we are referring to are either
absolute ones (for artificial images) or obtained by experts (for medical images).
For each image, a classifier is trained using the ground truth (the binary mask
used to compose the textures, see figure 1(a)). Then, each image is corrupted by
adding white Gaussian noise of varying standard deviation, and processed by the
method for different combinations of parameters (λ and number of iterations).
In order to illustrate the benefit of applying a graph regularization algorithm
to the output of a binary classifier, the results of our method are compared
with the ones obtained by thresholding the raw output of the same classifier.
Our method is also compared to the texture-based graph regularization process
proposed in [3]. Table 1 presents the results obtained for the three methods for
noise level (standard deviation) 8.
Because of the use of a classifier that relies on eager learning, handling high
noise levels is impossible since such level of corruption has a high influence on
the texture features. Results from this test are therefore exposed considering
noise levels 2, 4, 6, 8, 10 and 12.
By comparing the values from the second and fourth column of table 1, we
can see that the application of the regularization process on the output of the
classifier highly improves the segmentation quality. On the whole test set, it
allows to classify correctly up to 13.8% more pixels, with an average improvement
of 4.3%. Compared to the texture-based graph regularization method from [3]
(third column), our new approach shows an improvement of the segmentation
Graph-Based Regularization of Binary Classifiers for Texture Segmentation 315
Table 1. Results (partition distance) for the artificial test set for noise level 8
quality of 1.9%, and do not require neither a manual selection of relevant texture
features nor the definition of an initialization. Moreover, such results are obtained
with significantly less regularization iterations: less than 500 for our new method
against more than 1000 for the one from [3].
to different depth. The one used here is a view of the olfactory system of a bee,
the bright circular elements being the glomeruli (see figure 2(d)).
The classifier is trained using the ground truth of the first image of the stack,
then our method is applied on the following slices. Table 3 presents the partition
distance obtained on the test image for some of the slices.
Results for both modalities are illustrated in figure 2.
Table 3. Results (partition distance) for the confocal microscopy test set
4 Conclusion
In this paper, a recent texture-based graph regularization process has been im-
proved. A supervised binary classifier is included in the segmentation process in
order to take care of the selection of features. A learning set is first provided by
Graph-Based Regularization of Binary Classifiers for Texture Segmentation 317
an expert in order to train the classifier, which is then used to provide a raw
segmentation of the image. A graph regularization process is finally applied on
this initial segmentation to produce the final one.
By including a supervised binary classifier in the segmentation process, we
enable it to automatically ponderate relevant texture features. Compared to the
initial algorithm, benefits are multiple. First, irrelevant features do not need to
be manually deleted, which allows to virtually use any texture characterization
technique without the need to worry about its usefulness or uselessness regard-
ing the type of image to be processed. The generic nature of the system has
also been improved: a lot more features can be added without risking to mini-
mize their contribution, since the training algorithm will sort them out. Finally,
no initialization has to be provided for each image. This fact might be argued
since a training set has to be provided, but by turning this algorithm into a
system configured for one or several pre-defined tasks, it can easily be rendered
parameter-less.
In order to increase the capacity of the process to perform any segmenta-
tion task, we intend to incorporate more texture descriptors (Haralick features
computed on different co-occurrence matrices, Gabor filters. . . ) in it.
During the design of this algorithm, we chose to implements the classifier as
a MLP because of its ability to handle multiclass problems. Research into trans-
forming this binary segmentation algorithm into a multiclass one are already in
progress.
References
1. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)
2. Ta, V.-T., Lezoray, O., Elmoataz, A.: Graph Based Semi and Unsupervised Classi-
fication and Segmentation of Microscopic Images. In: IEEE International Sympo-
sium on Signal Processing and Information Technology, pp. 1160–1165 (December
2007)
3. Faucheux, C., Olivier, J., Bone, R., Makris, P.: Texture-based graph regularization
process for 2D and 3D ultrasound image segmentation. In: IEEE International
Conference on Image Processing (ICIP), pp. 2333–2336 (September 2012)
4. Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural Features for Image Clas-
sification. IEEE Transactions on Systems, Man, and Cybernetics 3(6), 610–621
(1973)
5. Chan, T.F., Sandberg, B.Y., Vese, L.A.: Active Contours without Edges for Vector-
Valued Images. Journal of Visual Communication and Image Representation 11(2),
130–141 (2000)
6. Olivier, J., Boné, R., Rousselle, J.-J., Cardot, H.: Active Contours Driven by Su-
pervised Binary Classifiers for Texture Segmentation. In: Bebis, G., et al. (eds.)
ISVC 2008, Part I. LNCS, vol. 5358, pp. 288–297. Springer, Heidelberg (2008)
7. Tesar, L., Shimizu, A., Smutek, D., Kobatake, H., Nawano, S.: Medical image analy-
sis of 3D CT images based on extension of Haralick texture features. Computerized
Medical Imaging and Graphics: The Official Journal of the Computerized Medical
Imaging Society 32(6), 513–520 (2008)
318 C. Faucheux, J. Olivier, and R. Boné
1 Introduction
Tracking articulated structures with accuracy and within a reasonable time is
challenging due to the high complexity of the problem to solve. For this purpose,
various approaches based on particle filtering have been proposed. Among them,
one class addresses the complexity issue by reducing the dimensionality of the
state space. For instance, some methods add constraints (e.g., physical) to the
mathematical models [4, 13], to the object priors [7] or to their interactions with
the environment [11]. Relying on the basic assumption that some body part
movements are mutually dependent, some learning-based approaches [16, 19]
reduce the number of degrees of freedom of these movements.
Alternatively, a second class of methods has been proposed in the literature [5,
9, 12, 14, 17, 18] whose key idea is to decompose the state space into a set of small
subspaces where particle filtering can be applied: by working on small subspaces,
sampling is more efficient and, therefore, fewer particles are needed to achieve a
good performance. Finally, in the class of the optimization-based methods, the
approach is to optimize an objective function corresponding to the matching
between the model and the observed image features [3, 6, 8]. Recently, Particle
Swarm Optimization (PSO) has been reported to perform well on articulated
human tracking [10, 20]. Its key idea is to apply evolutionary algorithms inspired
from social behaviors observed in wildlife to make the particles evolve following
their own experience and the experience of the global population.
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 319–326, 2013.
c Springer-Verlag Berlin Heidelberg 2013
320 X.S. Nguyen, S. Dubuisson, and C. Gonzales
In this paper, our approach consists in decomposing the search space into
subspaces of smaller dimensions and, then, in exploiting the approach proposed
in [20] to search within these subspaces in a hierarchical order. A hierarchical
particle swarm optimization has also been introduced in [10]. The main difference
between this approach and ours is that we incorporate the sampling covariance
and the annealing factor into the update equation of PSO at each optimization
step to tackle the problem of noisy observations and cluttered background.
The paper is organized as follows. In Section 2, we briefly recall PSO. Section 3
presents the proposed algorithm. Section 4 reports the results of our experimental
evaluation. Finally, Section 5 gives some conclusions and perspectives.
Let X denote the state space: our goal is to search for state x ∈ X that maximizes
a cost function f : X → R, with a ≤ x ≤ b. A swarm consists of N particles, each
(i)
one representing a candidate state of the articulated object. Denote x(m) the ith
(i)
particle at the mth iteration. x(m) is decomposed into K (object) parts, i.e.,
(i) (i),1 (i),K
x(m) = {x(m) , ..., x(m) } ∈ X . Unlike evolutionary algorithms, to each particle
(i) (i),1 (i),K
in PSO is assigned a velocity v(m) = {v(m) , ..., v(m) } ∈ X and each particle has
the ability to memorize its best state computed so far s(i) = {s(i),1 , ..., s(i),K } ∈
X . Let sg be the current global best state, i.e., sg = Argmax{f (s(i) )}N i=1 . The
evolution of the particles in PSO is described by the following equations:
(i) (i) (i) (i)
v(m) = wv(m−1) + β1 r1 (s(i) − x(m−1) ) + β2 r2 (sg − x(m−1) ) (1)
(i) (i) (i)
x(m) = x(m−1) + v(m) (2)
3 Proposed Approach
searched for in these subspaces in the hierarchical order of the kinematic struc-
ture using Partitioned Sampling (PS) [12]. These optimal states are then used
to constrain the search in the next subspaces in the hierarchical order.
(i),k (i),k
At time t, let xt (resp. st ) denote the kth substate of the ith particle
(i) (i)
xt (resp. the ith particle’s best state st ) and let sg,k
t be the kth substate of
(i) (i),1 (i),K
the global best state. Then, at the mth iteration, xt,(m) = {xt,(m) , ..., xt,(m) },
(i) (i),1 (i),K (i) (i),1 (i),K
vt,(m) = {vt,(m) , ..., vt,(m) } and st,(m) = {st,(m) , ..., st,(m) }. We follow the ap-
proach proposed in [20], except that the state and velocity update equations for
each subpart k are written as follows:
(i),k (i),k (i),k (i),k
vt,(m) = r0 P(m−1) + β1 r1 (st − xt,(m−1) ) + β2 r2 (sg,k
t − xt,(m−1) ) (3)
(i),k (i),k (i),k
xt,(m) = xt,(m−1) + vt,(m) (4)
4 Experimental Results
We compare our approach with APF [6], PSAPF [2], APSOPF [20] and HPSO
(i),k
[10]. The cost function w(xt,(m) , y) to be optimized by PSO measures how well
(i),k
a state hypothesis xt,(m) matches the true state w.r.t. the observed image y,
and is constructed using histogram and foreground silhouette [6]. An articulated
object is described by a hierarchy (a tree) of parts, each part being linked to its
parent in the tree by an articulation point. For instance, in the top row of Fig. 1,
the blue polygonal parts are the root of the tree and the colored rectangles are
the other nodes of the tree. The root is described by its center (x, y) and its
322 X.S. Nguyen, S. Dubuisson, and C. Gonzales
(i)
Input: {st−1 }N i=1 , α0 , β0 , βmax , βmin , P(0) , M (number of iterations)
(i) N
Output: {st }i=1
(i)
1 Set πt = 1, i = 1, . . . , N
2 for k = 1 to K do
(i),k (i),k
3 Sample: xt,(0) ∼ N (st−1 , P(0) ), i = 1, . . . , N
4 for m = 0 to M do
5 if m ≥ 1 then
6 Compute P(m) and update β1 , β2
7 Carry out the PSO iteration based on Eq. (3) and (4)
(i),k (i),k
8 Evaluate: f (xt,(m) ) = w(xt,(m) , y), i = 1, . . . , N
(i),k
9 Update {st }N i=1 and the k-th part of the global best state sg,k
t
(i) (i) (i),k
10 Evaluate particle weights: πt = πt × w(st , y), i = 1, . . . , N
(i)
(i) πt
11 Normalize particle weights: π̄t
= N (j) , i = 1, . . . , N
j=1 πt
(i) (i) (i)
12 return {st }Ni=1 , x̄ = N
π̄
i=1 t s t
orientation θ whereas the other parts are only characterized by their angle θ.
For all algorithms, particles are propagated using a random walk with standard
deviations fixed to σx = 2, σy = 2 and σθ = 0.05. For APSOPF and HAPSOPF,
P(0) is a diagonal matrix with the values of σx , σy and σθ . Our comparisons are
based on two criteria: estimation errors and computation times.
(a)
(b)
La = 3, Na = 4 La = 4, Na = 5 La = 3 , Na = 6 La = 4 , Na = 7
Fig. 1. Synthetic video sequences used for quantitative evaluation (number of arms
Na , length of arms La ): (a) without clutter and (b) with clutter
Hierarchical Annealed PSO for Articulated Object Tracking 323
Quantitative Tracking Results. The tracking errors are given by the sum
of the Euclidean distances between each corner of the estimated parts and their
corresponding corner in the ground truth. We used M = 3 layers for PSAPF and
APF since it produces stable results for both algorithms, and M = 3 maximal
iterations for HAPSOPF, HPSO and APSOPF. Table 1 gives the performances
of the tested algorithms for sequences without or with noise (cluttered back-
ground). In our experiments, tracking in noisy sequences is challenging due to
the background. In such cases, the annealing factor helps the particle swarm to
follow its own searching strategy without being affected by any wrong guide of
the local or global best states. On the contrary, the annealing process of PSAPF
forces the particle set to represent one of the modes of the cost function, which
causes some parts of the object to get stuck in wrong locations. This problem
of annealing approaches was reported in [1]. Moreover, the use of the sampling
covariance instead of the inertial velocity of Eq. (1) leads to an efficient explo-
ration of the search space without losing the searching power of PSO. This is
validated by our experiments on sequences without cluttered background, where
our approach outperforms all the other ones. Fig. 2 gives comparative conver-
gence results (error depending on the number N of particles) and computation
times for a synthetic sequence (behaviors are similar for other sequences). Note
that our approach converges better and faster than the other methods.
Table 1. Tracking errors in pixels (average over 30 runs) and standard deviations for
synthetic video sequences, N is the number of particles used per filter
Na = 4, La = 3 Na = 5, La = 4 Na = 6, La = 3 Na = 7, La = 4
N 50 200 50 200 50 200 50 200
without noise 110(2) 106(1) 214(5) 195(2) 243(11) 211(9) 312(7) 271(4)
HAPSOPF
noise 204(39) 143(10) 227(56) 175(30) 322(67) 295(60) 553(194) 516(180)
without noise 120(2) 114(1) 238(6) 208(4) 251(7) 218(3) 319(8) 278(4)
PSAPF
noise 309(109) 221(94) 281(78) 219(48) 432(86) 388(75) 1008(232) 914(213)
without noise 125(5) 119(2) 252(9) 227(5) 254(11) 213(6) 382(5) 315(3)
HPSO
noise 277(78) 194(65) 245(42) 201(26) 345(27) 295(10) 922(334) 731(259)
without noise 184(3) 169(2) 260(12) 241(10) 265(15) 257(12) 471(30) 439(21)
APSOPF
noise 254(16) 227(8) 308(33) 291(25) 490(68) 474(47) 817(223) 785(169)
without noise 128(3) 109(2) 246(11) 221(9) 270(13) 236(11) 487(35) 412(24)
APF
noise 272(9) 258(5) 322(29) 309(18) 440(51) 429(40) 613(174) 592(156)
324 X.S. Nguyen, S. Dubuisson, and C. Gonzales
160 4000
APF APF
HPSO HPSO
PFAPF PFAPF
3500
APSOPF APSOPF
150 HAPSOPF HAPSOPF
3000
140
2500
Computation time
Error in pixels
130 2000
1500
120
1000
110
500
100 0
50 100 150 200 250 300 50 100 150 200 250 300
Number of particles Number of particles
(a) (b)
Fig. 2. Comparison tests for convergence and computation time when track-
ing the object Na = 4, La = 3: (a) convergence and (b) computation times (HPSO
and our approach give same curves) in seconds
right forearm. For a fair comparison, we fixed the number of evaluations of the
weighting function at each frame for all the algorithms to 2000, and tuned pa-
rameters {N,M} for each method so that they achieve the best performance while
satisfying the above constraint: {400, 5} for APF, {40, 5} for PSAPF, {200, 10}
for APSOPF and {20, 10} for HPSO and HAPSOPF.
Table 2. Tracking errors for full body in pixels (average over 30 runs)
Fig. 3. Tracking results for frames 123,160,275,387,488,523: HPSO (first row), PSAPF
(second row), HAPSOPF (third row). The tracking results for the other approaches as
well as those for the sequence S1 Gesture are not presented due to space constraint.
In this paper, we have introduced a new algorithm for articulated object tracking
based on particle swarm optimization and hierarchical search. We addressed the
problem of articulated object tracking in high dimensional spaces by employ-
ing a hierarchical search to improve search efficiency. Furthermore, the problem
of noisy observation has been alleviated by incorporating the annealing factor
terms into the velocity updating equation of PSO. Our experiments on syn-
thetic and real video sequences demonstrate the efficiency and effectiveness of
our approach compared to other common approaches, both in terms of tracking
accuracy and computation time. Our future work will focus on evaluating the
proposed approach in multi-view environments.
References
[1] Balan, A.O., Sigal, L., Black, M.J.: A quantitative evaluation of video-based 3d
person tracking. In: PETS, pp. 349–356 (2005)
326 X.S. Nguyen, S. Dubuisson, and C. Gonzales
[2] Bandouch, J., Engstler, F., Beetz, M.: Evaluation of Hierarchical Sampling Strate-
gies in 3D Human Pose Estimation. In: BMVC, pp. 925–934 (2008)
[3] Bray, M., Kollermeier, E., Vangool, L.: Smart particle filtering for high-
dimensional tracking. Computer Vision and Image Understanding 106(1), 116–129
(2007)
[4] Brubaker, M., Fleet, D., Hertzmann, A.: Physics-based person tracking using
the anthropomorphic walker. International Journal of Computer Vision 87(1-2),
140–155 (2009)
[5] Chang, I.C., Lin, S.Y.: 3D human motion tracking based on a progressive particle
filter. Pattern Recognition 43(10), 3621–3635 (2010)
[6] Deutscher, J., Reid, I.: Articulated body motion capture by stochastic search.
International Journal of Computer Vision 61(2), 185–205 (2005)
[7] Hauberg, S., Pedersen, K.S.: Stick it! articulated tracking using spatial rigid object
priors. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part III.
LNCS, vol. 6494, pp. 758–769. Springer, Heidelberg (2011)
[8] Hofmann, M., Gavrila, D.: 3D human model adaptation by frame selection and
shape-texture optimization. Computer Vision and Image Understanding 115(11),
1559–1570 (2011)
[9] Isard, M.: PAMPAS: real-valued graphical models for computer vision. In: CVPR,
pp. 613–620 (2003)
[10] John, V., Trucco, E., Ivekovic, S.: Markerless human articulated tracking using
hierarchical particle swarm optimization. Image and Vision Computing 28(11),
1530–1547 (2010)
[11] Kjellstrom, H., Kragic, D., Black, M.: Tracking people interacting with objects.
In: CVPR, pp. 747–754 (2010)
[12] MacCormick, J., Isard, M.: Partitioned sampling, articulated objects, and
interface-quality hand tracking. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843,
pp. 3–19. Springer, Heidelberg (2000)
[13] Oikonomidis, I., Kyriazis, N.: Full DOF tracking of a hand interacting with an
object by modeling occlusions and physical constraints. In: ICCV, pp. 2088–2095
(2011)
[14] Rose, C., Saboune, J., Charpillet, F.: Reducing particle filtering complexity for
3D motion capture using dynamic Bayesian networks, 1396–1401. AAAI (2008)
[15] Sigal, L., Balan, R.: Humaneva: Synchronized video and motion capture dataset
and baseline algorithm for evaluation of articulated human motion. Technical re-
port (2009)
[16] Urtasun, R., Fleet, D., Hertzmann, A., Fua, P.: Priors for people tracking from
small training sets. In: ICCV, pp. 403–410 (2005)
[17] Wu, Y., Hua, G., Yu, T.: Tracking articulated body by dynamic Markov network.
In: ICCV, pp. 1094–1101 (2003)
[18] Xinyu, X., Baoxin, L.: Learning Motion Correlation for Tracking Articulated Hu-
man Body with a Rao-Blackwellised Particle Filter. In: ICCV, pp. 1–8 (2007)
[19] Yao, A., Gall, J., Gool, L., Urtasun, R.: Learning probabilistic non-linear latent
variable models for tracking complex activities. In: Shawe-Taylor, J., Zemel, R.,
Bartlett, P., Pereira, F., Weinberger, K. (eds.) Advances in Neural Information
Processing Systems 24, pp. 1359–1367 (2011)
[20] Zhang, X., Hu, W., Wang, X., Kong, Y., Xie, N., Wang, H., Ling, H., Maybank,
S.: A swarm intelligence based searching strategy for articulated 3D human body
tracking. In: CVPRW, pp. 45–50 (2010)
High-Resolution Feature Evaluation Benchmark
Abstract. Benchmark data sets consisting of image pairs and ground truth ho-
mographies are used for evaluating fundamental computer vision challenges, such
as the detection of image features. The mostly used benchmark provides data
with only low resolution images. This paper presents an evaluation benchmark
consisting of high resolution images of up to 8 megapixels and highly accurate
homographies. State of the art feature detection approaches are evaluated using
the new benchmark data. It is shown that existing approaches perform differently
on the high resolution data compared to the same images with lower resolution.
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 327–334, 2013.
© Springer-Verlag Berlin Heidelberg 2013
328 K. Cordes, B. Rosenhahn, and J. Ostermann
Fig. 1. Part of the mapped images 1 and image 3 of the Graffiti sequence (top row) and the Trees
sequence (bottom row). For the mapping of image 1, the ground truth homographies are used.
Large errors occur due to the car in the foreground (Graffiti) and the moving leaves because
of wind (Trees). The bottom part of the Graffiti wall indicates a violation of the homography
assumption. The error is shown in the images 1(c) and 1(f) (cf. equation (6)).
as ground truth for high-accuracy evaluations, sometimes using very small overlap error
thresholds [3,8,9]. Apart from feature evaluation there are applications [10] which use
a dense representation of the images. In this case, the mapping errors would spoil the
evaluation significantly. Hence, the data set is useless for applications with dense image
representations.
Nowadays, consumer cameras provide image resolutions of 8 megapixels or more.
The question arises, if feature detector evaluations based on data with 0.5 megapixels
are valid for high resolution images. In [3], the evaluated detectors provide scale invari-
ant properties. On the other hand, the localization accuracy of a scale invariant feature
may be dependent on the detected scale [11], because its position error in a certain pyra-
mid layer is mapped to the ground plane of the scale space pyramid. In high resolution
data, more features are expected to be detected in higher scales of the image pyramid.
Thus, a small localization error of a detector may become significant in high resolution
image data.
An improved homography benchmark is provided in [12] with image resolutions of
1.5 megapixels per image. In addition, the accuracy of the Mikolajczyk benchmark is
slightly increased using a dense image representation instead of image features.
We use the RAW camera data from the images of the data set [12]. The proposed
technique exploits the ground truth data from [12] for initializing an evolutionary opti-
mization for the computation of ground truth homographies between image pairs with
High-Resolution Feature Evaluation Benchmark 329
2 Homography Upscaling
We make use of the RAW image data from [12]. In [12], the benchmark is created
using subsampled images of size 1536 × 1024 (1.5 megapixels). We use the images
with the same scene content at higher resolution. The radial distortion is removed in a
preprocessing step. Our objective is to create ground truth homographies with image
resolutions of up to 3456 × 2304 (8 megapixels), which is the maximum resolution of
the utilized Canon EOS 350D camera.
Since the homography for the image pair at resolution R1 is approximately known,
it can be used for a reasonable initialization for the optimization at resolution R2 as
shown in Section 2.1. The optimization is based on a cost function which computes the
mapping error of the homography HR2 at resolution R2 . The minimization of the cost
function is explained in Section 2.2.
1
J
E(H) = dRGB (H · pj , pj ) , (6)
J j=1
using the RGB values of the left and the right image I1 , I2 . The homography H maps a
pixel pj , j ∈ [1; J] from the left image I1 to the corresponding pixel pj in right image
I2 . If the homography is accurate, the color distance dRGB (·) is small. The color distance
dRGB (·) is determined as:
1
dRGB (pi , pj ) = · (|r(pi ) − r(pj )| + |g(pi ) − g(pj )| + |b(pi ) − b(pj )|) (7)
3
using the RGB values (r(pi ), g(pi ), b(pi )) of an image point pi . For the extraction of
the color values, a bilinear interpolation is used. If a mapped point pj falls outside the
image boundaries, it is neglected.
Due to lighting and perspective changes between the images, the cost function is
likely to have several local minima. Hence, a Differential Evolution (DE) optimizer is
used for the minimization of E(H) with respect to H in the cost function (6). Evolu-
tionary optimization methods have proved impressive performance for parameter esti-
mation challenges finding the global optimum in a parameter space with many local
optima. Nevertheless, limiting the parameter space with upper and lower boundaries,
increases the performance of these optimization algorithms significantly. For setting
the search space boundaries, the approximately known solutions for the homographies
at lower resolution are used. With equation (5), the search space centers are computed.
Then, a Differential Evolution (DE) optimizer is performed using common parameter
settings [17].
High-Resolution Feature Evaluation Benchmark 331
3 Experimental Results
For the benchmark generation, 5 sequences are used. Each of the sequences contains 6
images like in the reference benchmark [3]. In Section 3.1, the resulting cost function
values of different resolutions are compared. In Section 3.2 the evaluation protocol [3]
is performed on the new data.
(a) Colors (b) Grace (c) Posters (d) There (e) Underground
Fig. 2. First images of the input image sequences. The resolution is up to 3456 × 2304.
Table 1. Comparison of cost function values E(H) for the homographies for image resolutions
1536×1024 (cf. [12] for Grace) and the new data set with resolution 3456×2304. The resulting
cost function values for each image pair are approximately the same.
90 90
80 80
70 70
60 60
repeatability %
repeatability %
50 50
40 40
30 Harris−Affine 30 Harris−Affine
Hessian−Affine Hessian−Affine
20 MSER 20 MSER
IBR IBR
EBR EBR
10 10
0 0
15 20 25 30 35 40 45 50 55 60 65 15 20 25 30 35 40 45 50 55 60 65
viewpoint angle viewpoint angle
5500 7000
Harris−Affine
5000 Harris−Affine Hessian−Affine
Hessian−Affine 6000 MSER
4500 MSER IBR
EBR
IBR
4000 5000
EBR
nb of correspondences
nb of correspondences
3500
4000
3000
2500
3000
2000
1500 2000
1000
1000
500
0 0
15 20 25 30 35 40 45 50 55 60 65 15 20 25 30 35 40 45 50 55 60 65
viewpoint angle viewpoint angle
Fig. 3. Repeatability results (top row) and the number of correctly detected points (bottom row)
for the Underground sequence with different resolutions
significantly loose repeatability score. The repeatability rate of IBR decreases between
12 % and 15 % and MSER looses up to 25 % for large viewpoint changes. Interest-
ingly, the EBR gains about 4 % for small viewpoint changes, but looses about 5 %
for large viewpoint changes. Generally, none of the detectors can really improve their
performance using high resolution images.
4 Conclusions
In this paper, high-resolution image data of up to 8 megapixels is presented together
with highly accurate homographies. This data can be used as a benchmark for computer
vision tasks, such as feature detection. In contrast to the mainly used benchmark, our
data provides high-resolution, fully planar scenes with removed radial distortion and
a feature independent computation of the homographies. They are determined by the
global optimization of a cost function using a dense representation of the images. The
optimization is initialized with values inferred from the solution of lower resolution
images.
The evaluation shows that none of the standard feature detection approaches can
improve in repeatability on higher resolution images. On the contrary, their performance
High-Resolution Feature Evaluation Benchmark 333
90 90
80 80 Harris−Affine
Hessian−Affine
70 70 MSER
IBR
EBR
60 60
repeatability %
repeatability %
50 50
40 40
Harris−Affine
30 30
Hessian−Affine
MSER
20 20
IBR
EBR
10 10
0 0
15 20 25 30 35 40 45 50 55 60 65 15 20 25 30 35 40 45 50 55 60 65
viewpoint angle viewpoint angle
2000 3500
1800 Harris−Affine
Hessian−Affine Harris−Affine
3000
MSER Hessian−Affine
1600
IBR MSER
EBR 2500 IBR
1400
EBR
nb of correspondences
nb of correspondences
1200
2000
1000
1500
800
600 1000
400
500
200
0 0
15 20 25 30 35 40 45 50 55 60 65 15 20 25 30 35 40 45 50 55 60 65
viewpoint angle viewpoint angle
Fig. 4. Repeatability results (top row) and the number of correctly detected points (bottom row)
for the Grace sequence with different resolutions
References
1. Mikolajczyk, K., Schmid, C.: Scale & affine invariant interest point detectors. International
Journal of Computer Vision (IJCV) 60, 63–86 (2004)
2. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence (PAMI) 27, 1615–1630 (2005)
3. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F.,
Kadir, T., Gool, L.V.: A comparison of affine region detectors. International Journal of Com-
puter Vision (IJCV) 65, 43–72 (2005)
4. Schmid, C., Mohr, R., Bauckhage, C.: Comparing and evaluating interest points. In: IEEE
International Conference on Computer Vision (ICCV), pp. 230–235 (1998)
334 K. Cordes, B. Rosenhahn, and J. Ostermann
5. Schmid, C., Mohr, R., Bauckhage, C.: Evaluation of interest point detectors. International
Journal of Computer Vision (IJCV) 37, 151–172 (2000)
6. Haja, A., Jähne, B., Abraham, S.: Localization accuracy of region detectors. In: IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8 (2008)
7. Tuytelaars, T., Mikolajczyk, K.: Local invariant feature detectors: a survey. Foundations and
Trends in Computer Graphics and Vision, vol. 3 (2008)
8. Mikolajczyk, K., Schmid, C.: An affine invariant interest point detector. In: Heyden, A.,
Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part I. LNCS, vol. 2350, pp.
128–142. Springer, Heidelberg (2002)
9. Förstner, W., Dickscheid, T., Schindler, F.: Detecting interpretable and accurate scale-
invariant keypoints. In: IEEE International Conference on Computer Vision (ICCV), Kyoto,
Japan, pp. 2256–2263 (2009)
10. Mobahi, H., Zitnick, C., Ma, Y.: Seeing through the blur. In: IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 1736–1743 (2012)
11. Brown, M., Lowe, D.G.: Invariant features from interest point groups. In: British Machine
Vision Conference (BMVC), pp. 656–665 (2002)
12. Cordes, K., Rosenhahn, B., Ostermann, J.: Increasing the accuracy of feature evaluation
benchmarks using differential evolution. In: IEEE Symposium Series on Computational In-
telligence (SSCI) - IEEE Symposium on Differential Evolution (SDE). IEEE Computer So-
ciety (2011)
13. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from maximally
stable extremal regions. British Machine Vision Conference (BMVC) 1, 384–393 (2002)
14. Tuytelaars, T., Gool, L.V.: Wide baseline stereo matching based on local, affinely invariant
regions. In: British Machine Vision Conference (BMVC), pp. 412–425 (2000)
15. Tuytelaars, T., Van Gool, L.: Content-based image retrieval based on local affinely invariant
regions. In: Huijsmans, D.P., Smeulders, A.W.M. (eds.) VISUAL 1999. LNCS, vol. 1614,
pp. 493–500. Springer, Heidelberg (1999)
16. Hartley, R.I., Zisserman, A.: Multiple View Geometry, 2nd edn. Cambridge University Press
(2003)
17. Price, K.V., Storn, R., Lampinen, J.A.: Differential Evolution - A Practical Approach to
Global Optimization. Natural Computing Series. Springer, Berlin (2005)
Fully Automatic Segmentation of AP Pelvis
X-rays via Random Forest Regression
and Hierarchical Sparse Shape Composition
1 Introduction
Segmenting anatomical regions such as the femur and the pelvis is an impor-
tant task in the analysis of conventional 2D X-ray images, which benefits many
applications such as disease diagnosis [1,2], operation planning/intervention [3],
3D reconstruction [4,5], and so on. Traditionally, manual segmentation of X-ray
images is both time-consuming and error-prone. Therefore, automatic methods
are beneficial both in efficiency and accuracy. However, automatic segmentation
of X-ray images face many challanges. The poor and non-uniform image con-
strast, along with the noise, makes the segmentation very difficult. Occlusions
such as the overlap between bones make it difficult to identify local features of
bone contours. Furthermore, the existence of implants drastically interferes with
the appearance. Therefore, conventional segmentation techniques [1,3], which
mainly depend on local image features such as the edge information, cannot
provide satisfactory results, and model based segmentation techniques are often
adopted [5,6]. However, model based methods suffer from the requirement of
proper initialization, which is typically done manually, and the limited converg-
ing region, leading to unsatisfactory results.
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 335–343, 2013.
c Springer-Verlag Berlin Heidelberg 2013
336 C. Chen and G. Zheng
Fig. 1. (a): Global landmarks. (b): Local landmarks defining the left femur shape. (c):
Local landmarks defining the left pelvis shape.
Fully Automatic Segmentation of AP Pelvis X-rays 337
Fig. 2. Workflow on a test image. Rectangles represent different steps. Clouds represent
different pre-trained models.
2.2 Workflow
Fig. 3. The landmark detection algorithm. (a): A patch sampled around the ground-
truth landmark. (b): Patches sampled for training. (c): For a new image, patches sam-
pled over the image. (d): Each patch produces a prediction of landmark position. (e):
The response image is calculated by combining individual predictions.
In practice, the output of random forest regressor d(Pj ) is not a single value,
but a distribution2 . Similarly to Eq. (1), we add up the predicted distributions,
getting a single distribution (as in Fig. 3(e)), which is called the response image.
In our method, we use the multi-level HoG (Histogram of Oriented Gradient)
[13] as our feature for image patches, and we use a feature selection algorithm
propose in [14] to efficiently select only the most relevant feature components.
1
We use 1000 patches per image for both training and testing in our implementation.
2
The raw output of the random forest regressor is the displacement vectors on the
leave where the test feature vector falls, from which we fit a gaussian distribution.
Fully Automatic Segmentation of AP Pelvis X-rays 339
Input: Initial shape y0 , landmark response images {R(I)i }i=1,...,K , shape model M
Output: Optimal shape y , pose transform matrix T
Procedure:
1. Initialize y = y0 , T = the optimal similarity transform from y to shape model M
2. Update shape y by locally moving each landmark in the ascend direction
in the corresponding response image
3. Regularize shape y by the shape model M.
4. Update pose T by the optimal similarity transform from y to shape model M
5. Repeat steps 2 to 4 until convergence. Then y = y
3 Experiments
3.1 Data
We conduct experiments using a collection of 436 AP radiographs from our
clinical partner. A considerable part of these images are post-operative x-ray
radiographs after trauma or joint replacement surgery, which significantly in-
creases the challenge due to large variation of femur/pelvis appearance and the
presence of implants. From these 436 images, we randomly select 100 images for
training, and the other 336 images are used for testing purpose. For the training
images, we manually annotate all the (global and local) landmarks.
Fig. 4. Segmentation examples done with our method. Yellow: prlvis; green: femur.
the 173 valid images for pelvis, our method succeeded in 169 with a success rate
of 97.7%.
Our method takes around 5 minutes to process one image with an unoptimized
Matlab implementation.
To evaluate the effectiveness of our shape model, we compare with the PCA based
shape model (as in [2]), for which the result is shown in Table 3. Comparing Table
3 with Table 2, we see that our shape model outperforms the PCA based one.
Note that the result reported here is not directly comparable with that of [2]
due to several reasons. First, we perform both femur and pelvis segmentation,
while [2] only segments femur. Second, we model the femur contour details such
as the lesser trochanter, and these details are missing in [2] which uses a simplfied
model. Third and most importantly, in part of our test images the regions to be
segmented are occluded by implants (see Fig. 4).
Anatomy Success rate Median Min. Max. Mean Std. 97.5th percentile
Femur 98.4% 1.2 0.6 3.4 1.3 0.6 2.7
Pelvis 97.7% 2.1 1.0 3.7 2.2 0.5 3.4
Anatomy Success rate Median Min. Max. Mean Std. 97.5th percentile
Femur 97.1% 1.3 0.6 3.6 1.4 0.6 3.0
Pelvis 95.4% 2.4 1.2 3.8 2.5 0.6 3.5
342 C. Chen and G. Zheng
4 Conclusions
We have proposed a new fully-automatic method for left femur and pelvis seg-
mentation in conventional X-ray images. Our method features a hierarchical
segmentation framework and a shape model based on sparse representation. Ex-
periments show that our method achieves good results, and that the different
contributions (feature selection, shape model) indeed improve the performance.
Although we demonstrate our method using the left femur and left pelvis, our
method can be readily extended to the right side. In the future, we are also
interested in extending our method into 3D data.
References
1. Chen, Y., Ee, X., Leow, W.-K., Howe, T.S.: Automatic extraction of femur contours
from hip X-ray images. In: Liu, Y., Jiang, T.-Z., Zhang, C. (eds.) CVBIA 2005.
LNCS, vol. 3765, pp. 200–209. Springer, Heidelberg (2005)
2. Lindner, C., Thiagarajah, S., Wilkinson, J.M., Wallis, G.A., Cootes, T.F.: Accurate
fully automatic femur segmentation in pelvic radiographs using regression voting.
In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) MICCAI 2012, Part
III. LNCS, vol. 7512, pp. 353–360. Springer, Heidelberg (2012)
3. Gottschling, H., Roth, M., Schweikard, A., Burgkart, R.: Intraoperative,
fluoroscopy-based planning for complex osteotomies of the proximal femur. IJMR-
CAS 1(3), 33–38 (2005)
4. Baka, N., Kaptein, B.L., Bruijne, M., van Walsum, T., Giphart, J.E., Niessen, W.J.,
Lelieveldt, B.P.: 2D-3D shape reconstruction of the distal femur from stereo x-ray
imaging using statistical shape model. Med. Image Anal. 15(6), 840–850 (2001)
5. Dong, X., Zheng, G.: Automatic extraction of proximal femur contours from cali-
brated x-ray images using 3D statistical models: an in vitro study. IJMRCAS 5(2),
213–222 (2009)
6. Cristinacce, D., Cootes, T.: Automatic feature localization with constrained local
models. Pattern Recognition 41(19), 3054–3067 (2008)
7. Zhou, S.K., Comaniciu, D.: Shape regression machine. In: Karssemeijer, N.,
Lelieveldt, B. (eds.) IPMI 2007. LNCS, vol. 4584, pp. 13–25. Springer, Heidelberg
(2007)
8. Zheng, Y., Barbu, A., Georgescu, B., Scheuering, M., Comaniciu, D.: Four-chamber
heart modeling and automatic segmentation of 3-D cardiac CT volumes using
marginal space learning and steerable features. IEEE T. Med. Imaging 27(11),
1668–1681 (2008)
9. Pauly, O., Glocker, B., Criminisi, A., Mateus, D., Möller, A.M., Nekolla, S., Navab,
N.: Fast multiple organ detection and localization in whole-body MR Dixon se-
quences. In: Fichtinger, G., Martel, A., Peters, T. (eds.) MICCAI 2011, Part III.
LNCS, vol. 6893, pp. 239–247. Springer, Heidelberg (2011)
10. Criminisi, A., Shotton, J., Robertson, D., Konukoglu, E.: Regression forests for
efficient anatomy detection and localization in CT studies. In: MCV 2010, pp.
106–117 (1201)
Fully Automatic Segmentation of AP Pelvis X-rays 343
11. Gall, J., Lempitsky, V.: Class-specific Hough forests for object detection. In: CVPR,
pp. 1022–1029 (2009)
12. Zhang, S., Zhan, Y., Dewan, M., Huang, J., Metaxas, D.N., Zhou, X.S.: Sparse
shape composition: a new framework for shape prior modeling. In: CVPR (2011)
13. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
CVPR, vol. I, pp. 886–893 (2005)
14. Chen, C., Yang, Y., Nie, F., Odobez, J.M.: 3D human pose recovery from image
by efficient visual feature selection. CVIU 115(3), 290–299 (2011)
15. Cootes, T.F., Taylor, C.J.: Active shape models-‘smart snakes’. In: BMVC (1992)
Language Adaptive Methodology
for Handwritten Text Line Segmentation
1 Introduction
Text line segmentation of handwritten documents is much more difficult than
that of printed documents. Unlike the printed documents which have approxi-
mately straight and parallel text lines, the lines in handwritten documents are
often un-uniformly skewed and curved. Moreover, the spaces between handwrit-
ten text lines are often not obvious compared to the spaces between within-line
characters, and some text lines may interfere with each other. Therefore many
text line detection techniques, such as projection analysis [7] [5], Hough trans-
form [6] and K-nearest neighbour connected components (CCs) grouping [9],
are not able to segment handwritten text lines successfully and also still a uni-
form approach to handle all kind of challenges is not available. Figure 1 shows
an example of unconstrained handwritten document. Text document image seg-
mentation can be roughly categorized into three classes: top-down, bottom-up,
and hybrid. Top-down methods partition the document image recursively into
text regions, text lines, and words/characters with the assumption of straight
lines. Bottom-up methods group small units of image (pixels, CCs, characters,
words, etc.) into text lines and then text regions. Bottom-up grouping can be
viewed as a clustering process, which aggregates image components according to
proximity and does not rely on the assumption of straight lines. Hybrid methods
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 344–351, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Language Adaptive Methodology for Handwritten Text Line Segmentation 345
Fig. 1. Example image of a general handwritten text paragraph from IAM dataset [4]
Fig. 2. M ST generated for the text paragraph shown in Figure 1, the green pixels
mark the centroids of every connected component, and the red lines depict the edges
of the M ST of the graph of the connected components of the same figure
Fig. 4. Forest remains after removing the week edges from Figure 1 using CSF
where, CSF = 0, only when d = yd , means the two components have the
minimum connectivity strength as they are orthogonal and hence belongs to
different lines, which are almost parallel. And the CSF = ∞, only when yd = 0.
This means the connectivity between the two components is the strongest as
they both belong to the same line. The angle between the two components is
zero aligning them on the same line as shown in Figure 3.
Thus, after applying the CSF rules on Figure 1 we remove the mis-aligned
components from the text lines and generate the forest of given document image
as shown in Figure 4. Where our forest is defined as a group of trees. Where
every tree is a text line. Text lines are over segmentation in a single iteration
of CSF . To overcome such situation we apply same process on forest treating
each tree of forest as a single node and find the connected graph and apply CSF
approach. Finally we get the forest having a single tree for every single text
line as shown in Figure 4 . Figure 5(a) shows an example of hindi handwritten
348 S. Panwar et al.
Fig. 5. (a) Example image of a hindi handwritten document. (b) Forest remains after
applying CSF .
document. Figure 5(b) shows the result using proposed approach. The complete
process can be enumerated as shown in Algorithm 1.
3 Experimental Results
Fig. 6. (A) Example image of curved lines in handwritten text document. (B) Forest
remains after applying CSF .
350 S. Panwar et al.
Fig. 7. (a) Example image of a urdu handwritten document. (b) Forest remains after
applying CSF .
Fig. 8. (A) Example image of a skewed lines in handwritten text document. (B) Forest
remains after applying CSF .
The experimental results of the proposed line segmentation approach shows that
proposed CSF improves line segmentation accuracy significantly in all the cases.
The proposed method was also compared with other state-of-the-art methods in
experiments on a large database of IAM [4], handwritten documents data set
and its superiority was demonstrated. The accuracy rate of the proposed text
line segmentation method is summarized in Table 1.
Language Adaptive Methodology for Handwritten Text Line Segmentation 351
Line types Total no. of lines Accurate detected lines Accuracy rate
Printed lines 320 320 100
Skewed lines 2600 2520 96.92
Curved lines 1750 1670 95.42
4 Conclusions
In this paper, a language adaptive approach for handwritten text line segmen-
tation with CSF has been presented and applied to the IAM dataset and doc-
uments collected from different writers with different languages as Hindi english
and urdu. The proposed text line segmentation approach with the novel use of
CSF has the advantage of language adaptivity with highly curve and skewed
text lines. From the experiments, 97.30% average accuracy were observed in the
system. The results obtained by this segmentation is a forest of lines. It shows
that the proposed system is capable of locating accurately the text lines in im-
ages and documents. Future work mainly concerns the sequential arrangement
of all lines of forest according to appearing in paragraph so that the sequential
stroke is sent to next step of recognition system.
References
1. Ouwayed, N., Belaid, A.: A general approach for multi-oriented text line extraction
of handwritten documents. IJDAR 15(4), 297–314 (2012)
2. Kumar, J., et al.: Handwritten Arabic text line segmentation using affinity propa-
gation. In: DAS 2010, pp. 135–142 (2010)
3. Papavassiliou, V., et al.: Handwritten document image segmentation into text lines
and words. Pattern Recognition 43, 369–377 (2010)
4. Marti, U., Bunke, H.: The IAM-database: An English Sentence Database for Off-
line Handwriting Recognition. Int. Journal on Document Analysis and Recognition,
IJDAR 5, 39–46 (2002)
5. Likforman Sulem, L., Zahour, A., Taconet, B.: Text line segmentation of historical
documents: A survey. IJDAR 9, 123–138 (2007)
6. Likforman-Sulem, L., Hanimyan, A., Faure, C.: A Hough based algorithm for ex-
tracting text lines in handwritten documents. In: Proc. 3rd Int. Conf. on Document
Analysis and Recognition, pp. 774–777 (1995)
7. Zamora-Martinez, F., Castro-Bleda, M.J., Espaa-Boquera, S., Gorbe-Moya, J.: Un-
constrained offline handwriting recognition using connectionist character N-grams.
In: The 2010 International Joint Conference on Neural Networks (IJCNN), July
18-23, pp. 1–7 (2010)
8. Yin, F., Liu, C.-L.: Handwritten Chinese text line segmentation by clustering with
distance metric learning. Pattern Recognition (Elsevier) 42(12), 3146–3157 (2009)
9. Kumar, M., Jindal, M.K., Sharma, R.K.: K-nearest neighbour Based offline Hand-
written Gurumukhi Character Recognition. In: International IEEE Conference on
Image Information Processing (ICHP 2011), vol. 1, pp. 7–11 (2011)
Learning Geometry-Aware Kernels
in a Regularization Framework
1 Introduction
There has recently been a surge of interest in learning algorithms that are aware
of the geometric structure of the data. These algorithms have been successfully
applied to pattern recognition, image analysis, data mining etc.. Kernel function,
defining the similarity between the data in Reproducing Kernel Hilbert Space
(RKHS), can capture the structure of the data. Thus, the use of kernels for
learning the geometric structure of the data has received a significant amount
of attention. Such kernels are called as geometry-aware kernels. Algorithms for
learning geometry-aware kernels can be roughly classified into two categories.
Algorithms in the first category only explore the geometric structure of the
data, ignoring other source of information. Kondor and Lafferty propose Diffu-
sion Kernels which are originated from the heat equation on geometric manifold
and are aware of the data geometry [1]. Smola and Kondor show that the spec-
trum of graph Laplacian can be passed through various filter functions leading
to a family of geometry-aware kernels [2]. Some examples of kernels are given, in-
cluding Regularized Laplacian Kernel, Diffussion Kernel, Random Walk Kernel
and Inverse Cosine Kernel. Some well-known algotihms for dimensionality reduc-
tion of manifold can be unified in a kernel perspective [3]. These algothrithms can
be interpreted as kernel PCA with specifically constructed Gram matrices. Also,
researchers focus on learning geometry-aware kernels for nonlinear dimensional-
ity reduction [4,5]. These methods are unsupervised and well-suited to task of
dimensionality reduction. However, they may not give satisfactory performance
on supervised tasks.
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 352–359, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Learning Geometry-Aware Kernels in a Regularization Framework 353
The second category learns kernels from multiple sources of information which
include data geometry, side information and so on. Sindhwani et. al. show how
the standard kernels can be adapted to incorporate data geometry while re-
taining out-of-sample extension [6]. Song et. al. show a variant of Maximum
Variance Unfolding that is aware of the data geometry and side information [7].
Learning geometry-aware kernels from the nonparametric transforms of graph
Laplacian are discussed in semi-supervised learning scenario [8]. Some studies
focus on learning nonparametric geometry-aware kernels with the help of mani-
fold regularization [9]. Comparing with the algorithms in the first category, these
methods are more suitable for supervised tasks since the task related information
is incorporated into them.
In this paper, we present a general framework for learning geometry-aware
kernels. Our framework involves an optimization problem which minimizes a
divergence between the learnt kernel matrix and a given prior matrix, along with
some regularization term. Some existing geometry-aware kernels can be unified in
our framework. Furthermore, new geomerty-aware kernels can be developed. We
will show how to integrate multiple sources of information within the framework,
leading to a family of algorithms by choosing various divergence, prior matrix and
regularization term. Empirical results indicate that our algorithms significantly
improve the performance.
where K is the kernel matrix upon X associated with φ, i.e., Kij = φ(xi )T φ(xj ),
L = I − D− 2 W D−
1 1
2 is the normalized Laplacian matrix, D is diagonal matrix
with entries Dii = j Wji . Thus, the regularization term Ω(K) = tr(KL).
2.4 Algorithm
The kernel learning problem with manifold regularization is
min F (K) − F (K0 ) − tr((K − K0 )∇F (K0 )T ) + γtr(KL)
K (6)
s.t. K & 0.
Learning Geometry-Aware Kernels in a Regularization Framework 355
We will show how the positive semi-definiteness constraint can be satisfied au-
tomatically by choosing various kinds of Bregman divergence. Specifically, we
present the following kernels:
γ
K = K0 − L (squared Frobenius norm) (9)
2
K = exp(log K0 − γL) (von Neumann divergence) (10)
K= (K0−1 + γL)−1
(LogDet Divergence) (11)
3 Discussions
K = exp(−γL). (12)
Sindhwani’s Work. Note that Equation (11) is the Gram matrix of the kernel
proposed in [6]. Given new test data, we can compute the out-of-sample ex-
tension in Equation (11) by enlarging K0 with additional test data and setting
L̃ = diag(L, O). The resulting formulation is the same as the kernel function pre-
sented in [6]. Therefore, we reinterpret Sindhwani’s work in our regularization
framework.
4 Experiments
We perform experiments on clustering and classification tasks. We use the
von Neumann divergence (Equation (10)) and choose various prior kernel ma-
trix, leading to three algorithms: Manifold Regularized Gaussian Kernel (MR-
Gaussian), Manifold Regularized Linear Kernel (MR-Linear) and Manifold Reg-
ularized Ideal Kernel (MR-Ideal). Comparisons are made with Diffusion Kernel
(Equation (12)), Gaussian kernel κG and linear kernel κL :
4.1 Classification
We consider the transductive learning where the test data are available in ad-
vance before the classifier is learned. The classification experiments are designed
using USPS dataset which contains 16 × 16 grayscale images of handwritten dig-
its. We select challenging tasks where the digits are similar: 2 vs 3, and 5 vs 6.
The first 400 images for each digit were taken to form the dataset. Training data
are normalized to zero mean and unit variance. The same processing settings are
applied to the test data. The parameter of Gaussian kernel is tuned via 2-fold
cross validation. The trade-off γ in (10) is fixed to 10. We compute accuracy using
1-norm soft margin SVM classifier where the regularization parameter C = 1.
The results are averaged over 20 runs.
The number of training data per digit is fixed to be 10 and the total number of
each digit is changed within the set of {25, 50, 100, 200, 400}. We wish to see how
the accuracies vary with the accurate and inaccurate manifolds. The averaged
performance are tabulated in Table 1 and 2. The accuracies of Gaussian and
linear kernels change tinily with different total number of data. Since these two
kernels do not consider the geometric structure of the data, the incremental
data have little influence to the performance. The Diffusion Kernel performs
unstably. The performance of MR-Gaussian and MR-Linear are unsatisfactory
with 25 data, but improved as more and more data are available. So do the
MR-Ideal. This is in accord with our conjecture. When the total number of data
is very limited, the recovered manifold is inaccurate. Thus, the kernel with this
inaccurate information yields worse performance. But once the data are enough
to reconstruct the manifold, our algorithms give higher accuracies.
Table 1. Accuracy (%) of 2 vs 3 task (#training=10). The best results are highlighted.
Table 2. Accuracy (%) of 5 vs 6 task (#training=10). The best results are highlighted.
4.2 Clustering
The MNIST dataset is used for clustering. This dataset contains 28×28 grayscale
images of handwritten digits. We also take the first 400 images for each digit to
form the dataset. The data are normalized to zero mean and unit variance. The
parameter of Gaussian kernel is fixed to 104 . The trade-off γ in (10) is also fixed
to 10. We use the kernel k-means algorithm and evaluate the performance by
computing Normalized Mutual Information (NMI) and clustering accuracy. For
two random variable A and B, the NMI is defined as:
I(A, B)
NMI = , (18)
H(A)H(B)
where I(A, B) is the mutual information between the random variables A and
B, H(A) is the Shannon entropy of A. High NMI value indicates that the cluster
and true labels match well. The clustering accuracy is defined as:
n
δ(yi , map(ci ))
Accuracy = i=1 , (19)
n
where n is the number of data, yi denotes the true label and ci denotes the
corresponding cluster label, δ(y, c) is a function that equals 1 if y = c and equals
0 otherwise. map(·) is a permutation function that maps each cluster label to a
true label. This optimal matching can be found with the Hungarian algorithm.
We run kernel k-means on the dataset 10 times with random initialization, then
average the NMI and Accuracy values.
The experiment is designed to investigate the impact of the number of mani-
fold on the performance. Since each digit can be viewed as a submanifold in the
input space, we vary the number of digits within the set of {2, 3, · · · , 10}. We
firstly adopt digits 1 and 2 for evaluation, then add another digit one by one.
The results are shown in Figure 1. As the number of digits increases, the per-
formance of all algorithms tend to degenerate, because the mixture of manifolds
complicates the problem. Our MR-Gaussian outperforms other two algorithms
in almost all cases, except for one situation. This demonstrates that the incor-
poration of manifold structure provides advantages in clustering.
100 0.8
MR−Gaussian MR−Gaussian
90 Gaussian Gaussian
Diffusion 0.6 Diffusion
Accuracy (%)
80
NMI
70 0.4
60
0.2
50
40 0
2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10
Number of Digits Number of Digits
(a) (b)
Fig. 1. MNIST results. (a) Accuracy values (b) NMI values.
Learning Geometry-Aware Kernels in a Regularization Framework 359
5 Conclusions
References
1. Kondor, R.I., Lafferty, J.: Diffusion kernels on graphs and other discrete input
spaces. In: Proceedings of the 19th Annual International Conference on Machine
Learning, pp. 315–322 (2002)
2. Smola, A.J., Kondor, R.: Kernels and regularization on graphs. In: Schölkopf, B.,
Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 144–158.
Springer, Heidelberg (2003)
3. Ham, J., Lee, D., Mika, S., Schölkopf, B.: A kernel view of the dimensionality
reduction of manifolds. In: Proceedings of the 21st Annual International Conference
on Machine Learning, pp. 47–54 (2004)
4. Weinberger, K., Sha, F., Saul, L.: Learning a kernel matrix for nonlinear dimen-
sionality reduction. In: Proceedings of the 21st Annual International Conference
on Machine Learning, pp. 839–846 (2004)
5. Lawrence, N.D.: A unifying probabilistic perspective for spectral dimensionality
reduction: Insights and new models. The Journal of Machine Learning Research 12,
1609–1638 (2012)
6. Sindhwani, V., Niyogi, P., Belkin, M.: Beyond the point cloud: from transductive
to semi-supervised learning. In: Proceedings of the 22nd International Conference
on Machine Learning, pp. 824–831 (2005)
7. Song, L., Smola, A., Borgwardt, K., Gretton, A.: Colored maximum variance un-
folding. Advances in Neural Information Processing Systems 20, 1385–1392 (2008)
8. Zhu, X., Kandola, J., Ghahramani, Z., Lafferty, J.: Nonparametric transforms of
graph kernels for semi-supervised learning. Advances in Neural Information Pro-
cessing Systems 17, 1641–1648 (2005)
9. Zhuang, J., Tsang, I., Hoi, S.: A family of simple non-parametric kernel learning
algorithms. The Journal of Machine Learning Research 12, 1313–1347 (2011)
10. Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric frame-
work for learning from labeled and unlabeled examples. The Journal of Machine
Learning Research 7, 2399–2434 (2006)
Motion Trend Patterns for Action Modelling
and Recognition
1 Introduction
Action recognition has become a very important topic in computer vision in
recent years, due to its applicative potential in many domains, like video surveil-
lance, human computer interaction, or video indexing. In spite of many proposed
methods exhibiting good results on academic databases, action recognition in
real-time and real conditions is still a big challenge. Previous works cannot be
evoked extensively, we refer to [1] for a comprehensive survey. In the following,
we will concentrate on the two classes of method most related to our work:
trajectory based modelling, and dynamic texture methods.
An important approach for action representation is to extract features from
point trajectories of moving objects. It has been considered for a long time
as an efficient feature to represent action. Johansson [2] showed that human
subjects can perceive a structured action such as walking from points of light
attached to the walker’s body. Messing et al. [3], inspired by human psychovisual
performance, extracted features from the velocity histories of keypoints using
KLT tracker. Sun et al. [4] used trajectories of SIFT points and encoded motion
in three levels of context information: point level, intra-trajectory context and
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 360–367, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Motion Trend Patterns for Action Modelling and Recognition 361
to compute statistics on the set of trajectories. In this work we use the semi dense
point tracking method [13] which is a trade-off between long term tracking and
dense optical flow, and allows the tracking of a high number of weak keypoints
in a video in real-time, thanks to its high level of parallelism. Using GPU im-
plementation, this method can handle 10 000 points per frame at 55 frames/s
on 640 × 480 videos. In addition, it is robust to sudden camera motion changes
thanks to a dominant acceleration estimation. Figure 1 shows several actions
represented by their corresponding beams of trajectories.
Fig. 1. Actions from the KTH dataset represented as beams of trajectories. For actions
(d-f), only the most recent part of the trajectory is displayed.
−
p−−→ −
i+1 p−−→
i+1
−
→ −
→
p
p i i
−−−→ −
p−−→
p i−1 i−1
−
→
p i
00 01
−
p−−→ −
i+1 p−−→
i+1
−
→
p i
→
−
pi
−−−→ −
p−−→
p i−1 i−1
10 11
Classification
To perform action classification, we choose the SVM classifier of Vedaldi et al.
[15] which approximates a large scale support vector machines using an explicit
feature map for the additive class of kernels. Generally, it is much faster than
non linear SVMs and it can be used in large scale problems.
5 Experiments
We evaluate our descriptor on two well-known datasets. The first one (KTH) [16]
is a classic dataset, used to evaluate many action recognition methods. The
second one (UCF Youtube) [17] is a more realistic and challenging dataset.
Extraction of Semi-Dense Trajectories. We have studied the influence of
the extraction of semi-dense trajectories on the performance of our model. We
changed the parameters of the semi dense point tracker [13] to modify the number
of trajectories obtained on the video. What we observe is that as long as the
average matching error does not increase significantly, more we have trajectories,
the better is the recognition rate. The improvement can be raised up to 5-6 % for
KTH dataset. Table 1 shows the recognition rate obtained on this dataset, for
different average number of tracked trajectories. In our experiments, the average
number is set to 5 000 for a good result.
Motion Trend Patterns for Action Modelling and Recognition 365
Table 1. Recognition rate on the KTH dataset in function of the number of trajectories
Mean number of trajectories per video 1 240 2 700 3 200 5 271 7 763
Recognition rate 87.5 88.33 90 92.5 90.83
1
It should contain 600 videos but one is missing.
366 T.P. Nguyen, A. Manzanera, and M. Garrigues
Table 4. Confusion matrix on UCF. Ground truth (by row) and predicted (by column)
labels are: basketball, biking, diving, golf swing, horse riding, soccer juggling, swing,
tennis swing, trampoline jumping, volleyball spiking, walking with dog.
6 Conclusions
We have presented a new action model based on semi dense trajectories and LBP-
like encoding of motion trend. It allows to perform on line action recognition on
unsegmented videos at low computational complexity.
For the future, we are interested in using other variants of LBP operator.
A temporal multi-scale approach for MTP encoding will also be considered.
Furthermore, we will address the effects of moving camera in the performance
of our model, in order to deal with uncontrolled realistic videos.
References
1. Aggarwal, J., Ryoo, M.: Human activity analysis: A review. ACM Comput. Surv.
43, 16:1–16:43 (2011)
2. Johansson, G.: Visual perception of biological motion and a model for its analysis.
Perception and Psychophysics 14, 201–211 (1973)
3. Messing, R., Pal, C., Kautz, H.A.: Activity recognition using the velocity histories
of tracked keypoints. In: ICCV 2009, pp. 104–111 (2009)
4. Sun, J., Wu, X., Yan, S., Cheong, L.F., Chua, T.S., Li, J.: Hierarchical spatio-
temporal context modeling for action recognition. In: CVPR, pp. 2004–2011 (2009)
Motion Trend Patterns for Action Modelling and Recognition 367
5. Wu, S., Oreifej, O., Shah, M.: Action recognition in videos acquired by a moving
camera using motion decomposition of lagrangian particle trajectories. In: ICCV,
pp. 1419–1426 (2011)
6. Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajecto-
ries. In: CVPR, pp. 3169–3176 (2011)
7. Ojala, T., Pietikäinen, M., Mäenpää, T.: Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns. PAMI 24, 971–987 (2002)
8. Zhao, G., Pietikainen, M.: Dynamic texture recognition using local binary patterns
with an application to facial expressions. PAMI 29, 915–928 (2007)
9. Kellokumpu, V., Zhao, G., Pietikäinen, M.: Human activity recognition using a
dynamic texture based method. In: BMVC (2008)
10. Kellokumpu, V., Zhao, G., Pietikäinen, M.: Texture based description of move-
ments for activity analysis. In: VISAPP (2), 206–213 (2008)
11. Yeffet, L., Wolf, L.: Local trinary patterns for human action recognition. In: ICCV,
pp. 492–497 (2009)
12. Kliper-Gross, O., Gurovich, Y., Hassner, T., Wolf, L.: Motion interchange patterns
for action recognition in unconstrained videos. In: Fitzgibbon, A., Lazebnik, S.,
Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part VI. LNCS, vol. 7577, pp.
256–269. Springer, Heidelberg (2012)
13. Garrigues, M., Manzanera, A.: Real time semi-dense point tracking. In: Campilho,
A., Kamel, M. (eds.) ICIAR 2012, Part I. LNCS, vol. 7324, pp. 245–252. Springer,
Heidelberg (2012)
14. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories. In: CVPR, pp. 2169–2178 (2006)
15. Vedaldi, A., Zisserman, A.: Efficient additive kernels via explicit feature maps.
PAMI 34, 480–492 (2012)
16. Schuldt, C., Laptev, I., Caputo, B.: Recognizing Human Actions: A Local SVM
Approach. In: ICPR, pp. 32–36 (2004)
17. Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from video “in the wild”.
In: CVPR, pp. 1996–2003 (2009)
18. Tabia, H., Gouiffès, M., Lacassagne, L.: Motion histogram quantification for human
action recognition. In: ICPR, pp. 2404–2407 (2012)
19. Mattivi, R., Shao, L.: Human action recognition using lbp-top as sparse spatio-
temporal feature descriptor. In: Jiang, X., Petkov, N. (eds.) CAIP 2009. LNCS,
vol. 5702, pp. 740–747. Springer, Heidelberg (2009)
20. Lu, Z., Peng, Y., Ip, H.H.S.: Spectral learning of latent semantics for action recog-
nition. In: ICCV, pp. 1503–1510 (2011)
21. Bregonzio, M., Li, J., Gong, S., Xiang, T.: Discriminative topics modelling for
action feature selection and recognition. In: BMVC, pp. 1–11 (2010)
22. Wang, S., Yang, Y., Ma, Z., Li, X., Pang, C., Hauptmann, A.G.: Action recognition
by exploring data distribution and feature correlation. In: CVPR, pp. 1370–1377
(2012)
On Achieving Near-Optimal “Anti-Bayesian”
Order Statistics-Based Classification
for Asymmetric Exponential Distributions
Abstract. This paper considers the use of Order Statistics (OS) in the
theory of Pattern Recognition (PR). The pioneering work on using OS
for classification was presented in [1] for the Uniform distribution, where
it was shown that optimal PR can be achieved in a counter-intuitive
manner, diametrically opposed to the Bayesian paradigm, i.e., by com-
paring the testing sample to a few samples distant from the mean - which
is distinct from the optimal Bayesian paradigm. In [2], we showed that
the results could be extended for a few symmetric distributions within
the exponential family. In this paper, we attempt to extend these results
significantly by considering asymmetric distributions within the expo-
nential family, for some of which even the closed form expressions of
the cumulative distribution functions are not available. These distribu-
tions include the Rayleigh, Gamma and certain Beta distributions. As
in [1] and [2], the new scheme, referred to as Classification by Moments
of Order Statistics (CMOS), attains an accuracy very close to the opti-
mal Bayes’ bound, as has been shown both theoretically and by rigorous
experimental testing.
1 Introduction
Class conditional distributions have numerous indicators such as their means,
variances etc., and these indices have, traditionally, played a prominent role in
achieving pattern classification, and in designing the corresponding training and
testing algorithms. It is also well known that a distribution has many other
characterizing indicators, for example, those related to its Order Statistics (OS).
The interesting point about these indicators is that some of them are quite
unrelated to the traditional moments themselves, and in spite of this, have not
been used in achieving PR. The amazing fact, demonstrated in [3] is that OS can
be used in PR, and that such classifiers operate in a completely “anti-Bayesian”
manner, i.e., by only considering certain outliers of the distribution.
Chancellor’s Professor ; Fellow: IEEE and Fellow: IAPR. This author is also an Ad-
junct Professor with the University of Agder in Grimstad, Norway. The work of this
author was partially supported by NSERC, the Natural Sciences and Engineering
Research Council of Canada.
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 368–376, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Order Statistics-Based Optimal “Anti-Bayesian” PR 369
Earlier, in [1] and [2], we showed that we could obtain optimal results by
an “anti-Bayesian” paradigm by using the OS. Interestingly enough, the novel
methodology that we propose, referred to as Classification by Moments of Order
Statistics (CMOS), is computationally not any more complex than working with
the Bayesian paradigm itself. This was done in [1] for the Uniform distribution
and in [2] for certain distributions within the exponential family. In this paper,
we attempt to extend these results significantly by considering asymmetric dis-
tributions within the exponential family, for some of which even the closed form
expressions of the cumulative distribution functions are not available. Examples
of these distributions are the Rayleigh, Gamma and certain Beta distributions.
Again, as in [1] and [2], we show the completely counter-intuitive result that
by working with a very few (sometimes as small as two) points distant from
the mean, one can obtain remarkable classification accuracies, and this has been
demonstrated both theoretically and by experimental verification.
Let us assume that we are dealing with the 2-class problem with classes ω1 and
ω2 , where their class-conditional densities are f1 (x) and f2 (x) respectively (i.e,
their corresponding distributions are F1 (x) and F2 (x) respectively)1 . Let ν1 and
ν2 be the corresponding medians of the distributions. Then, classification based
on ν1 and ν2 would be the strategy that classifies samples based on a single
OS. We can see that for all symmetric distributions, this classification accuracy
attains the Bayes’ accuracy.
This result is not too astonishing because the median is centrally located close
to (if not exactly) on the mean. The result for higher order OS is actually far
more intriguing because the higher order OS are not located centrally (close
to the means), but rather distant from the means. In [2], we have shown that
for a large number of distributions, mostly from the exponential family, the
classification based on these OS again attains the Bayes’ bound. These results
are now extended for asymmetric exponential distributions.
The pdf of the Rayleigh distribution, whose applications are found in [4], with
parameter σ > 0 is ϕ(x, σ) = σx2 e−x /2σ , x ≥ 0 and the cumulative distribution
2 2
/
and u2 = θ + σ 2ln 32 .
Theorem 1. For the 2-class problem in which the two class conditional dis-
tributions are Rayleigh and identical, the accuracy obtained by CMOS deviates
from
the optimal Bayes’ bound as the solution
of the/transcendental
equality
x −θ 2 +2θx θ σ 3
ln x−θ = 2σ2 deviates from 2 + √2 ln(3) + ln 2 .
Table 1. A comparison of the accuracy of the Bayesian and the 2-OS CMOS classifier
for the Rayleigh Distribution
θ 3 2.5 2 1.5 1
Bayesian 99.1 97.35 94.45 87.75 78.80
CMOS 99.1 97.35 94.40 87.70 78.65
we shall now consider the scenario when we utilize other k-OS. Let u1 be the
point for the percentile n+1−k of the first distribution, and u2 be the point
n+1
Eu
of the second distribution. Then, 0 1 σx2 e−x /2σ dx =
2 2
k
for the percentile n+1
/ F
n+1−k
n+1 =⇒ u1 = σ 2 ln n+1 k
n+1
and u2 = θ + σ 2 ln n+1−k .
Theorem 3. For the 2-class problem in which the two class conditional distri-
butions are Rayleigh and identical, a near-optimal Bayesian classification can
be achieved by using symmetric pairs of the n-OS,F i.e., the n − k OS for ω1
/
and the k OS for ω2 if and only if ln n+1 k − ln n+1−kn+1
< σ√ θ
2
. The
classification obtained by CMOS deviates from the optimal Bayes’ bound as
= −θ2σ+2θx
2
x
the solution of the transcendental equality ln x−θ 2 deviates from
,/ F -
θ σ
2 + 2
√ ln n+1
k + ln n+1−kn+1
.
Proof. The proof of this theorem is omitted here, but is included in [4]. )
(
Table 2. A comparison of the accuracy of the Bayesian(i.e., 82.15%) and the k-OS
CMOS classifier for the Rayleigh Distribution by using the symmetric pairs of the OS
for different values of n (where σ = 2 and θ = 1.5)
CMOS/
No. Order(n) Moments OS1 OS2 CMOS
Dual CMOS
2 1
1 Two , σ 2 ln 31 θ+σ 2 ln 32 82.05 CMOS
3 3 5
5 , 5 ,1 ≤ i ≤ 2
5−i i n
2 Four σ 2 ln 52 θ+σ 2 ln 3 82.0 CMOS
7 , 7 1 ≤ i ≤ 2
7−i i n
3 Six σ 2 ln 71 θ+σ 2 ln 76 81.6 Dual CMOS
9 , 9 ,1 ≤ i ≤ 2
9−i i n
4 Eight σ 2 ln 94 θ+σ 2 ln 95 82.15 CMOS
Details of when the original OS-based criteria and when the Dual criteria are
used, are found in [4]. These are omitted here in the interest of space.
−x
distribution is Γ (a)1 ba xa−1 e b ; a > 0, b > 0, with mean ab and variance ab2
where a and b are the parameters. Unfortunately, the cumulative distribution
function does not have a closed form expression [5,6,7].
Theoretical Analysis: Gamma Distribution. The typical PR problem in-
voking the Gamma distribution would consider two classes ω1 and ω2 where the
class ω2 is displaced by a quantity θ, and in the case analogous to the ones we
have analyzed, the values of the scale and shape parameters are identical. We
consider the scenario when a1 = a2 = a and b1 = b2 = b. Thus, we consider the
distributions: f (x, 2, 1) = xe−x and f (x − θ, 2, 1) = (x − θ)e−(x−θ) .
We first derive the moments of the 2-OS, which are the points of interest for
CMOS, for the Gamma distribution. Let u1 be the point for the percentile 23
of the first distribution, and u2 be the point for the percentile 13 of the second
Eu
distribution. Then, 0 1 x e−x dx = 23 =⇒ ln(u1 )−2u1 = ln 13 and ln(u2 −θ)−
2(u2 −θ) = ln 13 −ln(θ). The following results hold for the Gamma distribution.
Theorem 4. For the 2-class problem in which the two class conditional dis-
tributions are Gamma and identical, the accuracy obtained by CMOS deviates
from the accuracy attained by the classifier with regard to the distance from the
corresponding medians as 1.7391 + θ2 deviates from 1.6783 + θ2 .
Table 3. A comparison of the accuracy with respect to the median and the 2-OS
CMOS classifier for the Gamma Distribution
Theorem 5. For the 2-class problem in which the two class conditional distri-
butions are Gamma and identical, a near-optimal Bayesian classification can be
achieved by using certain symmetric pairs of the n-OS, i.e., the (n − k)th OS for
ω1 (represented as u1 ) and the k th OS for ω2 (represented as u2 ) if and only if
u1 < u2 .
Table 4. A comparison of the k-OS CMOS classifier when compared to the Bayes’
classifier and the classifier with respect to median and mean for the Gamma Distri-
bution for different values of n. In each column, the value which is near-optimal is
rendered bold.
Theorem 6. For the 2-class problem in which the two class conditional distri-
butions are Beta(α, β) (α > 1, β > 1) and identical with α = 2 and β = 5, the
accuracy obtained by CMOS deviates from the accuracy attained by the classifier
with regard to the distance from the corresponding medians as the areas under
the error curves deviate from the positions 0.26445 + θ2 and 0.2689 + θ2 .
Proof. The proof of this theorem is omitted here, but can be found in [4]. )
(
Table 5. A comparison of the accuracy of the 2-OS CMOS classifier with the clas-
sification with respect to the medians for the Beta Distribution for different values
of θ
2
Any analysis will clearly have to involve specific values for α and β. The analyses for
other values of α and β will follow the same arguments and are not included here.
Order Statistics-Based Optimal “Anti-Bayesian” PR 375
Theorem 7. For the 2-class problem in which the two class conditional dis-
tributions are Beta(α, β) (α > 1, β > 1) and identical with α = 2, β = 5, a
near-optimal classification can be achieved by using certain symmetric pairs of
the n-OS, i.e., the (n − k)th OS for ω1 (represented as o1 ) and the k th OS for
ω2 (represented as o2 ) if and only if o1 < o2 . If o1 > o2 , the CMOS classifier
uses the Dual condition, i.e., the k OS for ω1 and the n − k OS for ω2 .
)
(
Experimental Results: Beta Distribution (α > 1, β > 1) - k-OS. The
CMOS method has been rigorously tested for certain symmetric pairs of the
k-OS and for various values of n, and the test results are given in Table 6.From
the table, we can see that CMOS attained a near-optimal Bayes’ accuracy when
o1 < o2 . Also, we can see that the Dual CMOS has to be invoked if o1 > o2 .
Table 6. A comparison of the k-OS CMOS classifier when compared to the classifier
with respect to means and medians for the Beta Distribution for different values of n.
The scenarios for the Dual condition are specified by “(D)”.
6 Conclusions
In this paper, we have shown that optimal classification for symmetric distri-
butions and near-optimal bound for asymmetric distributions can be attained
by an “anti-Bayesian” approach, i.e., by working with a very few (sometimes as
small as two) points distant from the mean. This scheme, referred to as CMOS,
Classification by Moments of Order Statistics, operates by using these points
determined by the Order Statistics of the distributions. In this paper, we have
proven the claim for some distributions within the exponential family, and the
theoretical results have been verified by rigorous experimental testing. Our re-
sults for classification using the OS are both pioneering and novel.
376 A. Thomas and B. John Oommen
References
1. Thomas, A., Oommen, B.J.: Optimal “Anti-Bayesian” Parametric Pattern Clas-
sification Using Order Statistics Criteria. In: Alvarez, L., Mejail, M., Gomez, L.,
Jacobo, J. (eds.) CIARP 2012. LNCS, vol. 7441, pp. 1–13. Springer, Heidelberg
(2012)
2. Thomas, A., Oommen, B.J.: Optimal “Anti-Bayesian” Parametric Pattern Classi-
fication for the Exponential Family Using Order Statistics Criteria. In: Campilho,
A., Kamel, M. (eds.) ICIAR 2012, Part I. LNCS, vol. 7324, pp. 11–18. Springer,
Heidelberg (2012)
3. Thomas, A., Oommen, B.J.: The Fundamental Theory of Optimal “Anti-Bayesian”
Parametric Pattern Classification Using Order Statistics Criteria. Pattern Recogni-
tion 46, 376–388 (2013)
4. Oommen, B.J., Thomas, A.: Optimal Order Statistics-based “Anti-Bayesian”
Parametric Pattern Classification for the Exponential Family. Pattern Recognition
(accepted for publication, 2013)
5. Krishnaih, P.R., Rizvi, M.H.: A Note on Moments of Gamma Order Statistics.
Technometrics 9, 315–318 (1967)
6. Tadikamalla, P.R.: An Approximation to the Moments and the Percentiles of
Gamma Order Statistics. Sankhya: The Indian Journal of Statistics 39, 372–381
(1977)
7. Young, D.H.: Moment Relations for Order Statistics of the Standardized Gamma
Distribution and the Inverse Multinomial Distribution. Biometrika 58, 637–640
(1971)
Optimizing Feature Selection through Binary
Charged System Search
Douglas Rodrigues1 , Luis A.M. Pereira1, Joao P. Papa1, Caio C.O. Ramos2 ,
Andre N. Souza3 , and Luciene P. Papa4
1
UNESP - Univ Estadual Paulista, Department of Computing, Bauru, Brazil
{markitovtr1,caioramos,lucienepapa}@gmail.com,
[email protected]
2
UNESP - Univ Estadual Paulista, Depart. of Electrical Engineering, Bauru, Brazil
[email protected]
3
University of São Paulo, Polytechnic School, São Paulo, Brazil
4
Faculdade Sudoeste Paulista, Department of Health, Avaré, Brazil
1 Introduction
Feature Selection is a challenging task which aims selecting a subset of features
in a given dataset. Working only with relevant features can reduce the training
time and improve the prediction performance of classifiers. A simple way to
handle feature selection is performing an exhaustive search, if the dimensions
(features) is not too large. However, this problem is known to be NP-hard and
the computational load may become intractable [1].
Recently, several works have employed meta-heuristic algorithms based on
biological behavior and physical systems to deal with feature selection as an
optimization problem. In such context, Kennedy and Eberhart [2] proposed a
binary version of the traditional Particle Swarm Optimization (PSO) algorithm
in order to handle binary optimization problems. Further, Firpi and Goodman [3]
The authors would like to thank FAPESP grants #2009/16206-1, #2011/14094-1
and #2012/14158-2, CAPES and also CNPq grant #303182/2011-3.
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 377–384, 2013.
c Springer-Verlag Berlin Heidelberg 2013
378 D. Rodrigues et al.
extended BPSO to the context of feature selection. Some years later, Rashedi et
al. [4] proposed a binary version of the Gravitational Search Algorithm (GSA)
called BGSA for feature selection, and Ramos et al. [5] presented their version of
the Harmony Search (HS) for the same purpose in the context of theft detection
in power distribution systems. Nakamura et al. [6] introduced their version of
Bat Algorithm (BA) for binary optimization problems, called BBA.
Kaveh and Talatahari [7] proposed an optimization algorithm called Charged
System Search (CSS), which is based on the interactions between electrically
charged particles. The idea is that an electrical field of one particle generates an
attracting or repelling force over other particles. This interaction is defined by
physical principles such as Coulomb, Gauss and Newtonian laws. The authors
have shown interesting results of CSS when compared with some well-known
approaches, such as PSO and Genetic Algorithms.
In this paper, we propose a binary version of the Charged System Search for
feature selection purposes called BCSS, in which the search space is modeled
as a m-cube, where m stands for the number of features. The main idea is
to associate for each charged particle a set of binary coordinates that denote
whether a feature will belong to the final set of features or not, and the function
to be maximized is the one given by a supervised classifier’s accuracy. As the
quality of the solution is related with the number of charged particles, we need
to evaluate each one of them by training a classifier with the selected features
encoded by the particles’ quality and also to classify an evaluating set. Thus,
we need a fast and robust classifier, since we have one instance of it for each
charged particle. As such, we opted to use the Optimum-Path Forest (OPF)
classifier [8,9], which has been demonstrated to be so effective as Support Vector
Machines, but faster for training.
The proposed algorithm has been compared with Binary Bat Algorithm, Bi-
nary Gravitational Search Algorithm, Binary Harmony Search and the Binary
Particle Swarm Optimization using several public datasets, being one of them
related with non-technical losses detection in power distribution systems. The re-
mainder of the paper is organized as follows. In Section 2 we revisit the Charged
System Search approach and we present the proposed methodology for binary
optimization using CSS. The methodology and the experimental results are dis-
cussed in Sections 3 and 4, respectively. Finally, conclusions are stated in Sec-
tion 5.
The governing Coulomb’s law is a physics law used to describe the interactions
between electrically charged particles. Let a charge be a solid sphere with radius
r and uniform density volume. The attraction force Fij between two spheres i
and j with total charges qi and qj is defined by:
Optimizing Feature Selection through Binary Charged System Search 379
ke qi qj
Fij = , (1)
d2ij
where ke is a constant called the Coulomb constant and dij is the distance
between the charges.
Based on aforementioned definition, Kaveh and Talatahari [7] have proposed
a new metaheuristic algorithm called Charged System Search (CSS). In this al-
gorithm, each Charged Particle (CP) on system is affected by the electrical fields
of the others, generating a resultant force over each CP, which is determined by
using the electrostatics laws. The CP interaction movement is determined using
Newtonian mechanics laws. Therefore, Kaveh and Talatahari [7] have sumarized
CSS over the following definitions:
f it(i) − f itworst
qi = , (2)
f itbest − f itworst
where f itbest and f itworst denote, respectively, the so far best and the
worst fitness of all particles. The distance dij between two CPs is given by
the following equation:
xi − xj
dij = xi −xj , (3)
2 − xbest +
in which xi , xj and xbest denote the positions of the ith, jth and the best
current CP respectively, and is a small positive number to avoid singular-
ities.
– Definition 2 : The initial position xij (0) and velocity vij (0), for each jth
variable of the ith CP, with j = 1, 2, . . . , m, is given by:
and
vij (0) = 0, (5)
where xi,max and xi,min represents the upper and low bounds respectively,
and θ ∼ U (0, 1).
– Definition 3 : For maximization problem, the probability of each CP moves
toward others CPs is given as follow:
1 if f it(j)−f itworst
f it(i)−f it(j) > θ ∨ f it(i) > f it(j),
pij = (6)
0 otherwise
qi qi
Fj = qj · dij · c1 + 2 · c2 pij (xi − xj ), (8)
r3 dij
j,i=j
3 Methodology
Suppose we have a fully labeled dataset Z = Z1 ∪ Z2 ∪ Z3 ∪ Z4 , in which Z1 , Z2 ,
Z3 and Z4 stand for training, learning, validating, and test sets, respectively.
The idea is to employ Z1 and Z2 to find the subset of features that maximize
the accuracy over Z2 , being such accuracy the fitness function. Therefore, each
agent (bat, CP, particle, etc.) is then initialized with random binary positions
and the original dataset is mapped to a new one which contains the features that
were selected in this first sampling. Further, the fitness function of each agent
is set as the recognition rate of a classifier over Z2 after training in Z1 . As soon
as the agent changes its position, a new training in Z1 followed by classification
in Z2 needs to be performed. As the reader can see, such formulation requires a
fast training and classification steps. This is the reason why we have employed
the Optimum-Path Forest (OPF) classifier [8,9], since it is a non-parametric and
very robust classifier.
However, in order to allow a fair comparison, we have added a third set in the
experiments called validating set (Z3 ): the idea is to establish a threshold that
ranges from 10% to 90%, and for each value of this threshold we marked the
features that were selected at least a minimum percentage of the runnings (10
times) over a learning process in Z1 and Z2 , as aforementioned. For instance, a
threshold of 40% means we choose the features that were selected at least 40%
of the runnings. For each threshold, we computed the fitness function over the
validation set Z3 to evaluate the generalization capability of the selected solution.
Thus, the final subset will be the one that maximizes the curve over the range
of values, i.e., the features that maximize the accuracy over Z3 . Further, these
selected features are then applied to assess the accuracy over Z4 . Notice the
fitness function employed in this paper is the accuracy measure proposed by
Papa et al. [8], which is capable to handle unbalanced classes. Notice we used
30% of the original dataset for Z1 , 20% for Z2 , 20% for Z3 , and 40% for Z4 .
These percentages have been empirically chosen.
4 Experimental Results
4.1 Dataset
In this work we have employed four datasets, in which three of them have
been obtained from LibSVM repository1 , and NTL refers to a dataset obtained
from a Brazilian electrical power company frequently used to detect thefts in
power distributions systems. Table 1 presents all datasets and their main char-
acteristics.
4.2 Experiments
In this section we discuss the experiments conducted in order to assess the ro-
bustness of BCSS against with BBA, BGSA, BHS (Binary Harmony Search) and
1
http://www.csie.ntu.edu.tw/~ cjlin/libsvmtools/datasets/
382 D. Rodrigues et al.
BPSO for feature selection. Table 2 presents the parameters employed for each
evolutionary-based technique. Notice for all techniques we employed 30 agents
with 100 iterations. These parameters have been empirically set.
Technique Parameters
BBA α = 0.9, γ = 0.9
BGSA G0 = 100
BHS HMCR= 0.9
BPSO c1 = 2.0, c2 = 2.0, w= 0.7
Figure 1a shows the OPF accuracy curve for Australian dataset, in which
BBA, BCSS and BGSA achieve the maximum value of the fitness function equals
to 87.20%, with a threshold equal, 40%, 80% and 70% respectively. Figure 1b
displays the results for Diabetes dataset. We can see that BBA, BCSS, BGSA
and BPSO have achieved the same effectiveness (around 61.5%), while BHS did
not perform very well. Actually, although BHS has selected the same number
of features, its accuracy over Z4 was about 2.32% less accurate than the others
approaches, as we can see in Table 3, which displays the accuracy over Z4 and
also the threshold over Z3 for all datasets.
Figures 2a and 2b display the curves over Z3 for Vehicle and NTL datasets,
respectively. From Figure 2a we can see that the maximum accuracy over Z3
has been obtained by BHS with a threshold of 40%, and with respect to NTL
dataset BBA and the proposed BCSS have achieved 95% with a threshold of
30%, while the remaining techniques needed a threshold of 50% to reach such
accuracy.
Table 4 displays the mean execution times for all techniques. The fastest
approach has been BHS, followed by BPSO and BGSA. Although the proposed
technique has required a considerable computational effort, it is not so different
than BBA and BGSA. From Table 3, we can see that BCSS has been the most
accurate approach together with other techniques for Australian, Diabetes and
NTL datasets, and has been the sole more effective approach for Vehicle dataset.
Optimizing Feature Selection through Binary Charged System Search 383
BBA BCSS BGSA BHS BPSO BBA BCSS BGSA BHS BPSO
100 62
90 61.5
80 61
70
Accuracy [%]
Accuracy [%]
60.5
60
60
50
59.5
40
30 59
20 58.5
10 58
10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
Threshold [%] Threshold [%]
(a) (b)
Fig. 1. OPF accuracy curve over Z3 for (a) Australian and (b) Diabetes datasets
BBA BCSS BGSA BHS BPSO BBA BCSS BGSA BHS BPSO
85 100
95
80
90
75 85
Accuracy [%]
Accuracy [%]
80
70
75
65 70
65
60
60
55 55
10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
Threshold [%] Threshold [%]
(a) (b)
Fig. 2. OPF accuracy curve over Z3 for (a) Vehicle and (b) NTL datasets
Table 3. Classification accuracy over Z4 with the best subset of features selected over
Z3
5 Conclusions
References
1. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach.
Learn. Res. 3, 1157–1182 (2003)
2. Kennedy, J., Eberhart, R.C.: A discrete binary version of the particle swarm algo-
rithm. In: IEEE International Conference on Systems, Man, and Cybernetics, vol. 5,
pp. 4104–4108 (1997)
3. Firpi, H.A., Goodman, E.: Swarmed feature selection. In: Proceedings of the 33rd
Applied Imagery Pattern Recognition Workshop, pp. 112–118. IEEE Computer
Society, Washington, DC (2004)
4. Rashedi, E., Nezamabadi-pour, H., Saryazdi, S.: BGSA: binary gravitational search
algorithm. Natural Computing 9, 727–745 (2010)
5. Ramos, C., Souza, A., Chiachia, G., Falcão, A., Papa, J.: A novel algorithm for
feature selection using harmony search and its application for non-technical losses
detection. Computers & Electrical Engineering 37(6), 886–894 (2011)
6. Nakamura, R.Y.M., Pereira, L.A.M., Costa, K.A., Rodrigues, D., Papa, J.P., Yang,
X.-S.: BBA: A binary bat algorithm for feature selection. In: Proceedings of the
XXV SIBGRAPI - Conference on Graphics, Patterns and Images (2012) (accepted
for publication)
7. Kaveh, A., Talatahari, S.: A novel heuristic optimization method: charged system
search. Acta Mechanica 213(3), 267–289 (2010)
8. Papa, J.P., Falcão, A.X., Suzuki, C.T.N.: Supervised pattern classification based
on optimum-path forest. International Journal of Imaging Systems and Technol-
ogy 19(2), 120–131 (2009)
9. Papa, J.P., Falcão, A.X., Albuquerque, V.H.C., Tavares, J.M.R.S.: Efficient
supervised optimum-path forest classification for large datasets. Pattern Recogni-
tion 45(1), 512–520 (2012)
Outlines of Objects Detection by Analogy
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 385–392, 2013.
c Springer-Verlag Berlin Heidelberg 2013
386 A. Bellili, S. Larabi, and N.M. Robertson
Table 1. Illustrative contours located using a selection (four) of the 14 artificial pat-
terns. Note the increase of intensity around the located contours from left to right.
done with success. However when IB is from a different scene, the location of
contour pixels cannot be done without the loss of many candidates. To locate
all contour pixels, a set of constraints must be verified in the neighbours N (p),
N (q). To avoid this, a set of pairs of artificial patterns (IA , SA ) are proposed
instead of hand drawn contours. The pattern IA is composed by a shape with
intensity FA (Foreground) and a background with intensity GA . The pattern SA
is the same as IA , in addition, the contour is highlighted. The values of (GA , FA )
are chosen so as for any query pixel q, the values of (GB , FB ) representing the
average of intensities of N (q) regions verify the required constraints. The set of
patterns P1 , P2 , ..., P14 (see figure 2) are characterised by the values of GA , FA
(background and foreground intensities):
(0, 32), (0, 64), (0, 96), (0, 128), (0, 160), (0, 192), (0, 224), (64, 192), (64, 224),
(96, 224), (128, 224), (160, 224), (192, 224), (208, 240).
For each pattern (IA , SA ) and for a query image IB , only a set of contour
pixels q will be localized such as the intensities of the neighbouring pixels in
N (q) verify a defined constraint related to (IA , SA ) . We obtain then 14 images
of contours corresponding to the 14 patterns (see figures in table 1).
Fig. 3. Possible values of FB in case where q is detected using only by one pattern
4 Results
We present in this section results obtained by applying our method to real images
of BSD [4]. Firstly, we illustrate in table 3 the evolution of the contour located
using artificial patterns. We can see that the located contour using P7 , P8 , P9 is
steady around object boundary except the central left part where the contour is
moving fast (3 pixels from one pattern to another). For the patterns P10 , P11 , P12 ,
contours are moving fast from one pattern to other due to the absence of object
boundary.
We applied our method using different values of energy defined as the number
of times where the contours is steady or with slow motion. The increasing of
energy value allows producing most significant contours corresponding to high
390 A. Bellili, S. Larabi, and N.M. Robertson
Table 4. (Left to right): original image, located outlines with energy equal to 3 and 4
are similar to those of Arbelaez et al [4]. For high Recall values, our Precision is
better and the difference reaches 20%. However for low Recall values, our Preci-
sion values are near from the values of Arbelaez et al [4], the difference is around
3%.
References
1. Alpert, S., Galun, M., Basri, R., Brandt, A.: Image Segmentation by Probabilistic
Bottom-Up Aggregation and Cue Integration. In: Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition (June 2007)
392 A. Bellili, S. Larabi, and N.M. Robertson
2. Ashikhmin, M.: Fast texture transfer. IEEE Computer Graphics and Applica-
tions 23(4), 38–43 (2003)
3. Alpert, S., Galun, M., Basri, R., Brandt, A.: Image Segmentation by Probabilistic
Bottom-Up Aggregation and Cue Integration, In. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (2007)
4. Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour Detection and Hierar-
chical Image Segmentation. IEEE Transactions on Pattern Analysis and Machine
Intelligence 33(5), 898–916 (2011)
5. Cheng, H.D., Jiang, X.H., Sun, Y., Wang, J.L.: Color image segmentation: advances
and prospects. Pattern Recognition 34, 2259–2281 (2001)
6. Cheng, L., Vishwanathan, S.V.N., Zhang, X.: Consistent image analogies using
semi-supervised learning. In: IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2008 (2008)
7. De Winter, J., Wagemans, J.: Segmentation of object outlines into parts: A large-
scale integrative study. Cognition 99, 275–325 (2006)
8. Freeman, W.T., Pasztor, E.C., Carmichael, O.T.: Learning Low-Level Vision. In-
ternational Journal of Computer Vision 40(1) (2000)
9. Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Seitz, S.M.: Image analogies.
In: SIGGRAPH Conference Proceedings, pp. 327–340 (2001)
10. Hertzmann, A., Oliver, N., Curless, B., Seitz, S.M.: Curve analogies. In: Proc. 13th
Eurographics Workshop on Rendering, Pisa, Italy, pp. 233–245 (2002)
11. Lackey, J.B., Colagrosso, M.D.: Supervised segmentation of visible human data
with image analogies. In: Proceedings of the International Conference on Machine
Learning; Models, Technologies and Applications (2004)
12. Larabi, S., Robertson, N.M.: Contour detection by image analogies. In: Bebis, G.,
Boyle, R., Parvin, B., Koracin, D., Fowlkes, C., Wang, S., Choi, M.-H., Mantler, S.,
Schulze, J., Acevedo, D., Mueller, K., Papka, M. (eds.) ISVC 2012, Part II. LNCS,
vol. 7432, pp. 430–439. Springer, Heidelberg (2012)
13. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A Database of Human Segmented
Natural Images and its Application to Evaluating Segmentation Algorithms and
Measuring Ecological Statistics. In: Proc. 8th Int’l Conf. Computer Vision (2001)
14. Sykora, D., Burianek, J., Zara, J.: Unsupervised colorization of black-and-white
cartoons. In: Proceedings of the 3rd Int. Symp. Non-photorealistic Animation and
Rendering, pp. 121–127 (2004)
15. Wang, G., Wong, T., Heng, P.: Deringing cartoons by image analogies. ACM Trans-
actions on Graphics 25(4), 1360–1379 (2006)
16. Zhanga, H., Frittsb, J.E., Goldmana, S.A.: Image segmentation evaluation: A sur-
vey of unsupervised methods. Computer Vision and Image Understanding 110(2),
260–280 (2008)
PaTHOS: Part-Based Tree Hierarchy for Object
Segmentation
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 393–400, 2013.
c Springer-Verlag Berlin Heidelberg 2013
394 L. Suta et al.
more challenging. We are also aware that a single photo may be insufficient to
identify the specie and even less the variety of a flower. From a botanical point
of view, plant recognition should furthermore take into account plant morphol-
ogy such as features based on the appearance or the external form of a plant.
The study of vegetative parts (roots, stems and leaves) as well as the reproduc-
tive parts (inflorescences, flowers, fruits and seeds) are crucial for plant variety
identification.
The purpose of this paper is object segmentation on multiple level of detail.
A hierarchical aggregation describes an image in its constituent objects. Thus,
the analysis of plant morphology becomes accessible in contrast with classical
approaches. In particular, we are interested in plant recognition tasks based on
images of flowers and/or inflorescences, [14]; therefore an accurate segmentation
is an essential step.
The remainder of the paper is organized as follows: section 2 presents related
work. Section 3 describes our hierarchical approach for natural image segmenta-
tion and relevant object selection is presented in section 4. Experimental results
are shown in section 5 followed by conclusions and future work in section 6.
2 Related Work
A wide variety of image segmentation methods have been recently proposed fo-
cusing on particular object types (cars, birds, horses, plants, etc.). In the field
of flower segmentation, authors explore background/foreground separation tech-
niques, [1] and [12], or combine them with geometrical models, [2] and [10], and
superpixel segmentation, [11], [3] and [13].
Bottom-up methods use uniformity conditions in order to form image seg-
ments which are merged together respecting homogeneity criteria such as simi-
lar color properties, spatial structure, texture, etc. The object results from the
aggregation of several components. For example, the components of a plant can
be the flower, the leaves and the stem. According to the level of detail we would
like to study, this approach offers multiple detail levels. In [5] the authors pro-
pose an image segmentation technique based on a high-performance contour
detector. Oriented watershed procedure creates regions from the oriented con-
tour signal. Image segmentation is achieved by agglomerative clustering with a
method that transforms contours into a hierarchy of regions. [9] uses hierarchical
grouping for object localization. Segments are generated using [4] while a greedy
algorithm merges similar regions based on size and appearance features. In [7]
and [6] a novel multiscale image segmentation method is introduced. The ramp
transform detect ramp discontinuities and seeds for all regions while a region
growing technique creates the desired segmented parts which can be organized
as a tree representation. Segmentation is independent of object properties, pa-
rameters and initialization; contrariwise region growing technique may produce
false associations.
We focused on hierarchical approaches since our purpose is plant morphology
analysis (sepals, petals, stamens, etc.). It will be an intermediate step towards
PaTHOS: Part-Based Tree Hierarchy for Object Segmentation 395
object semantics: a complex object having several regions of different colors will
be represented as a more complex hierarchical structure than a uniform color
region which will be represented as a single node in the hierarchy.
Fig. 1. The diagram of the proposed method describing the main steps
Algorithm 1. ”PaTHOS”
Require: Image I; number of bins for each color channel BinCount.
Ensure: Object hierarchy H.
1: O ← ∅
2: for color = 0 → I.channels do
3: for i = 0 → BinCount − 1 do
4: binaryImg ← new Image
5: for all pixel ∈ I do
6: if (0 ≤ pixel[color] < (i + 1) ∗ 255/BinCount) then
7: binaryImg[pixel] = 1
8: else
9: binaryImg[pixel] = 0
10: end if
11: end for
12: O ← O ∪ segmentation(binaryImg)
13: end for
14: end for
15: H ← build tree(O)
16: return H
5 Experimental Results
5.1 Segmentation
Fig. 2. Flower segmentation results: top row - original images; bottom row - our seg-
mentations. Note that for images containing multiple objects, we obtain each object
separately unless a spatial correlation exists.
398 L. Suta et al.
Tests were performed on Oxford Flowers 17, [8], with 848 images which dispose
of ground truth segmentations. Applying the proposed segmentation method,
we obtain 5958 segments (one or several segments organized in a hierarchy per
image). Subsequently, the decision tree indicates 2089 correct segments (Good-
Segmentation = ”Yes”), corresponding to the relevant objects (hereby flowers).
In order to estimate the quality of our segmentation, we compare our results to
the contour-based tree segmentation approach presented in [6]. As their method
produces similar results without specifying correct segmentations, the user has to
indicate the one corresponding to the object of interest. We chose their best result
for each image according to the maximum value of the accuracy - compared to
the ground truth. Table 1 presents the average Hausdorff distance between the
segmented objects and the groundtruth. Best segmentation is achieved in case
of minimum distance. Table 1 show a small advantage of our method compared
to [6].
Fig. 4. Results relevant object choice (first row - original and groundtruth image;
second row - correct and incorrect segmentation after classification)
Future work includes the development of new features based on the hierarchy
that may complement classical ones in the process of object selection. Our target
is to employ the presented segmentation approach in plant recognition tasks
extending our research towards flower variety identification.
References
1. Rother, C., Kolmogorov, V., Blake, A.: GrabCut: Interactive Foreground Extrac-
tion using Iterated Graph Cuts. ACM Transactions on Graphics 23, 309–314 (2004)
2. Nilsback, M.-E., Zisserman, A.: Delving Deeper into the Whorl of Flower Segmen-
tation. Image and Vision Computing 28(6), 1049–1062 (2010)
3. Chai, Y., Lempitsky, V., Zisserman, A.: BiCoS: A Bi-level Co-Segmentation
Method for Image Classification. In: IEEE International Conference on Computer
Vision, pp. 2579–2586 (2011)
4. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient Graph-Based Image Segmenta-
tion. International Journal on Computer Vision 59(2), 167–181 (2004)
5. Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour Detection and Hierar-
chical Image Segmentation. IEEE Transactions on Pattern Analysis and Machine
Intelligence 33(5), 898–916 (2011)
6. Akbas, E., Ahuja, N.: From Ramp Discontinuities to Segmentation Tree. In: Zha,
H., Taniguchi, R.-i., Maybank, S. (eds.) ACCV 2009, Part I. LNCS, vol. 5994, pp.
123–134. Springer, Heidelberg (2010)
7. Todorovic, S., Ahuja, N.: Unsupervised Category Modeling, Recognition, and Seg-
mentation in Images. IEEE Transactions on Pattern Analysis and Machine Intel-
ligence 30(12), 2158–2174 (2008)
8. Oxford Flowers 17 (2011), http://www.robots.ox.ac.uk/~ vgg/data/bicos/
9. van de Sande, K., Uijlings, J., Gevers, T., Smeulders, A.: Segmentation as Selective
Search for Object Recognition. In: IEEE International Conference on Computer
Vision, pp. 1879–1886 (2011)
10. Cerutti, G., Tougne, L., Vacavant, A., Coquin, D.: A Parametric Active Polygon for
Leaf Segmentation and Shape Estimation. In: International Symposium on Visual
Computing, pp. 202–213 (2011)
11. Angelova, A., Zhu, S., Lin, Y.: Image segmentation for large-scale subcategory
flower recognition. In: IEEE Workshop on the Applications of Computer Vision,
pp. 39–45 (2013)
12. Kumar, N., Belhumeur, P.N., Biswas, A., Jacobs, D.W., John Kress, W., Lopez,
I.C., Soares, J.V.B.: Leafsnap: A computer vision system for automatic plant
species identification. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid,
C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 502–516. Springer, Heidelberg
(2012)
13. Chai, Y., Rahtu, E., Lempitsky, V., Van Gool, L., Zisserman, A.: TriCoS: A
tri-level class-discriminative co-segmentation method for image classification. In:
Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012,
Part I. LNCS, vol. 7572, pp. 794–807. Springer, Heidelberg (2012)
14. Singh, G.: Plants Systematics: An Integrated Approach. Science Publishers (2004)
15. Folia (2011), http://liris.cnrs.fr/reves/index.php
16. Leafsnap (2011), http://leafsnap.com/
Tracking System with Re-identification
Using a Graph Kernels Approach
1 Introduction
Re-identification is a recent field of study in pattern recognition. The purpose of
re-identification is to identify object/person coming back onto the field view of
a camera. Such a framework may be extended to the tracking of object/persons
on a network of cameras.
Methods dealing with the re-identification problem can be divided into two
categories. A first group is based on building a unique signature for object.
Features used to describe signatures are different: regions, Haar-like features,
interest points [1], [2]. The second group of methods [3], [4] does not use a single
signature for the object, but the latter is represented by a set of signatures. Thus,
the comparison between objects takes place between two sets of signatures rather
than between two individual signatures.
The basic idea of our work starts from the consideration that there are few
works that exploit relationships between the visual features of an object. Fur-
thermore, our work combines both approaches by describing a person both with
a global descriptor over several frames and a set of representative frames. More
precisely, the principle of our approach is to represent each occurrence of a per-
son at time t by a graph representation called a t-prototype (Section 2). A kernel
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 401–408, 2013.
c Springer-Verlag Berlin Heidelberg 2013
402 A. Mahboubi et al.
2 T-Prototype Construction
The first step of our method consists to separate subjects from the background.
To that end, we use binary object masks [5] defined by a foreground detection
with shadow removals. Each moving person within a frame is thus associated to
a mask that we characterize using SIFT key point detectors. Such key points
provide a fine local characterization of the image inside the mask which is robust
against usual image transformations such as scales and rotations. Each key-point
is represented by its x and y coordinates, scale, orientation and 128 numbers
(the descriptors) per color channel. In order to contextualize the information
encoded by SIFT points we encode them by a mutual k nearest neighbor graph
G = (V, E, w) where V corresponds to the set of SIFT points, E to the set of
edges and w is a weight function defined over V and defined as the scale of
appearance of the corresponding vertex. The set of edges E is defined from the
key point coordinates x and y: one edge (v, v ) belongs to E if v belongs to
the k nearest neighbors of v while v belongs to the k nearest neighbors of v.
The degree of each vertex is thus bounded by k. For a given vertex u, we take
into account the local arrangement of its incident vertices by explicitly encoding
the sequence of its neighbors encountered when turning counterclockwise around
it. This neighborhood N (u) = (u1 , ..., un ) is thus defined as an ordered set of
vertices. The first vertex of this sequence u1 is arbitrary chosen as the upper
right vertex. The set {N (u)}u∈V is called the bag of oriented neighborhoods
(BON). The node u is called the central node.
of each graph by a finite bag of patterns. Such an approach consists to: i) define
the bag of patterns from each graph, ii) define a minor kernel between patterns,
iii) convolve minor kernels into a major one in order to encode the similarity
between bags. SIFT points being local detectors, we consider that the more
relevant information of a t-prototype corresponds to the local oriented neighbor-
hood of its vertices. We thus define the bag of patterns of a t-prototype as its
BON (section 2). The minor kernel between oriented neighborhoods is defined
as follows:
0 if |N (u)| = |N (v)|
Kseq (u, v) = G|N (u)| (1)
i=1 Kg (ui , vi ) otherwise
where Kg (u, v) is a RBF kernel between features of input vertices defined by a
tuning parameter σ and the Euclidean distance d(., .) between feature values:
d(μ(x),μ(y))
Kg (x, y) = e− σ .
Eq. 1 corresponds to the same basic idea that the heuristic used to compute
the graph edit distance between two nodes [7] where the similarity between two
nodes is enforced by a comparison of their neighborhoods.
Note that Kseq (., .) corresponds to a tensor product kernel and is hence def-
inite positive. However, due to acquisition noise or small changes between two
images, some SIFT points may be added or removed within the neighborhood of
some vertices. Such an alteration of the neighborhood’s cardinal may drastically
change the similarity between key points. Indeed, according to equation (1),
two points with a different neighborhood’s cardinal have a similarity equal to 0.
Equation (1) induces thus an important sensibility to noise. In order to overcome
this drawback, we introduce a rewriting rule on oriented neighborhoods. Given
a vertex v, the rewriting of its oriented neighborhood denoted κ(v) is defined as:
κ(v) = (v1 , ..., v4i , ..., vlv ) where v4i = argminj∈{1,...,lv } w(vj ) is the neighbor of v
with lowest weight.
This rewriting is iterated leading to a sequence of oriented neighborhoods
(κi (v))i∈{0,...,Dv } , where Dv denotes the maximal number of rewritings. The cost
of each rewriting is measured by the cumulative weight function CW defined by:
CW (v) =0
(2)
CW (κi (v)) = w(vi ) + CW (κi−1 (v))
where vi is the vertex removed between κi−1 (v) and κi (v).
Kernel between Oriented Neighborhoods: Our kernel between two ori-
ented neighborhoods is defined as a convolution kernel between the sequence of
rewritings of each neighborhood, each rewriting being weighted by its cumulative
cost:
Dv
Du
Krewriting (u, v) = KW (κi (u), κj (v)) ∗ Kseq (κi (u), κj (v)) (3)
i=1 j=1
The weighting function ϕ encodes the relevance of each vertex and is defined as
− 1
an increasing function of the weight: ϕ(u) = e σ (1+w(u))
4 People Description
The identification of a person by a single t-prototype is subject to errors due to
slight changes of the pose or some errors on the location of SIFT points. Assum-
ing that the appearance of a person remains stable on a set of successive frames,
we describe a person at instant t by the set of its t-prototypes computed on its
HTW window. The description of a person, by a set of t-prototypes provides
an implicit definition of the mean appearance of this person over HTW. Let H
denotes the Hilbert space defined by Kgraph (equation 5). In order to get an
explicit representation of this mean appearance, we first use Kgraph to project
the mapping of all t-prototypes onto the unit-sphere of H. This operation is
performed by normalizing our kernel [8]. Following [8], we then apply a one class
ν-SVM on each set of t-prototypes describing a person. From a geometrical point
of view, this operation is equivalent to model the set of projected t-prototypes
by a spherical cap defined by a weight vector w and an offset ρ both provided
by the ν-SVM algorithm. These two parameters define the hyper plane whose
intersection with the unit sphere defines the spherical cap. T-prototypes whose
projection on the unit sphere lies outside the spherical cap are considered as
outliers. Each person is thus encoded by a triplet (w, ρ, S) where S corresponds
to the set of t-prototypes and (w, ρ) are defined from a one class ν-SVM. The
parameter w indicates the center of the spherical cap and may be intuitively
understood as the vector encoding the mean appearance of a person over its
HTW window. The parameter ρ influence the radius of the spherical cap and
may be understand as the extend of the set of representatives t-prototypes in S.
T
w KA,B wB
dsphere (wA , wB ) = arccos A wA wB where wA and wB denote the norms
of wA and wB in H and KA,B is a |SA | × |SB | matrix defined by KA,B =
(Knorm (t, t ))(t,t )∈SA ×SB , where Knorm denotes our normalized kernel. Based
on dsphere , the kernel between A and B is defined as the following product of
RBF kernels:
−d2
sphere (wA ,wB ) −(ρA −ρB )2
2
2σmoy 2σ2
Kchange (PA , PB ) = e e origin (6)
5 Tracking System
Our tracking algorithm uses four labels ‘new’, ‘get out’, ‘unknown’ and ‘get back’
with the following meaning: new refers to an object classified as new, get-out
represents an object leaving the scene, unknown describes a query object (an
object recently appeared, not yet classified) and get-back refers to an object
classified as an old one.
Unlike our previous work [5], where we used a training data set to model
each object and the re-identification was triggered by an edit graph distance. in
this paper, we are using online learning and the re-identification is performed
using the similarity (eq.6) between each unknown person and all the get out
persons. The general architecture of our system is shown in Figure 1. All masks
detected in the first frame of a video are considered as new persons. Then a
mask detected in frame t + 1 is considered as matched if there is a sufficient
overlap between its bounding box and a single mask’s bounding box defined in
frame t. In this case, the mask is affected to the same person than in frame t
and its graph of SIFT points is added to the sliding HTW window containing
the last graphs of this person. If one mask defined at frame t does not have
any successor in frame t + 1, the associated person is marked as get out and
its triplet P = (w, ρ, S) (Section 4) computed over the last |HT W | frames is
stored in an output object data base model noted DBS . In the case of a person
corresponding to an unmatched mask in frame t + 1, the unmatched person
406 A. Mahboubi et al.
is initially labeled as ‘get in’. When a ‘get in’ person is detected, if there is
no ‘get out’ persons we classify this ‘get in’ person immediately as new. This
‘get in’ person is then tracked along the video using the previously described
protocol. On the other hand, if there is at least one ‘get out’ person we should
delay the identification of this ‘get in’ person which is thus labeled as ‘unknown’.
This ‘unknown’ person is then tracked on |HT W | frames in order to obtain its
description by a triplet (w, ρ, S). Using this description we compute the value of
kernel Kchange (equation 6) between this unknown person and all get out persons
contained in our database. Similarities between the unknown person and get out
ones are sorted in decreasing order so that the first get out person of this list
corresponds to the best candidate for a re identification. Our criterion to map an
unknown person to a get out one, and thus to classify it as get back is based both
on a threshold on the maximum similarity values maxker and a threshold on the
standard deviations σker of the list of similarities. This criterion called, SC is
defined as maxker > th1 and σKer > th2 , where th1 and th2 are experimentally
fixed thresholds. Note that, SC is reduced to a fixed threshold on maxker when
the set of get out persons is reduced to two elements. An unknown person whose
SC criterion is false is labeled as a new person. Both new and get back persons
are tracked between frames until they get out from the video and reach the
get out state.
Classically, any tracking algorithm has to deal with many difficulties such as
occlusions. The type of occlusions examined in this paper is limited to the case
where bounding boxes overlap. An occlusion is detected when the spatial overlap
between two bounding boxes is greater than an experimentally fixed threshold
while each individual box remains detected. If for a given object an occlusion
is detected, the description of this object is compromised. Thus a compromised
object is only tracked and its triplet (w, ρ, S) is neither updated nor stored
in DBS . At identification time, the model of the unknown person is matched
against each get-out person from DBS .
6 Experiments
The proposed algorithm has been tested on v01, v05, v04 and v06 video sequences
of the PETS’09 S2L1 [9] dataset. Each sequence contains multiple persons. To
compare our framework with previous work, we use the well-known metrics Se-
quence Frame Detection Accuracy (SF DA), Multiple Object Detection Accu-
racy (M ODA) and Multiple Object Tracking Accuracy (M OT A) described in
[11]. Note that such a measure does not allow to take into account the fact that
the identification of a person may be delayed. Since our method identifies a per-
son only after HT W frames, we decided not to take into account persons with
an unknown status in the M ODA and M OT A measures until these persons are
identified as get back or new (Section 5).
In our first experiment we have evaluated how different values of the length of
HTW may affect the re-identification accuracy. The obtained results show that,
v01 and v05 perform at peak efficiency for HTW=35. V04 and v06 attain their
optimum at HTW=20.
Tracking System with Re-identification Using a Graph Kernels Approach 407
80
70 View MODA of[10] MODA MOTA SFDA
60 v01 0.67 0.91 0.91 0.90
50
40 v05 0.72 0.75 0.75 0.80
30 v01
20 v05 v04 0.61 0.2799 0.2790 0.47
v04
10 v06 v06 0.75 0.506 0.505 0.64
0
2 4 6 8 10 12 14 16 18 20
Rank
7 Conclusion
In this paper, we presented a new people re-identification approach based on
graph kernels. Our graph kernel between SIFT points includes rewriting rules
on oriented neighborhood in order to reduce the lack of stability of the key point
detection methods. Furthermore, each person in the video is defined by a set
408 A. Mahboubi et al.
of graphs with a similarity measure between sets which removes outliers. Our
tracking system is based on a simple matching criterion to follow one person
along a video. Person’s description and kernel between these descriptions is used
to remove ambiguities when one person reappears in the video. Such a system
may be easily extended to follow one person over a network of camera. People
are prone to occlusions by others nearby. However, a re-identification algorithm
for an individual person is not suitable for solving the groups cases. A further
study with more focus on groups is therefore suggested.
References
1. Hamdoun, O., Moutarde, F., Stanciulescu, B., Steux, B.: Person re-identification
in multi-camera system by signature based on interest point descriptors collected
on short video sequences. In: ICDSC 2008, pp. 1–6 (2008)
2. Ijiri, Y., Lao, S., Han, T.X., Murase, H.: Human Re-identification through Distance
Metric Learning based on Jensen-Shannon Kernel. In: VISAPP 2012, pp. 603–612
(2012)
3. Truong Cong, D.-N., Khoudour, L., Achard, C., Meurie, C., Lezoray, O.: People
re-identification by spectral classification of silhouettes. International Journal of
Signal Processing 90, 2362–2374 (2010)
4. Zhao, S., Precioso, F., Cord, M.: Spatio-Temporal Tube data representation and
Kernel design for SVM-based video object retrieval system. Multimedia Tools
Appl. (55), 105–125 (2011)
5. Brun, L., Conte, D., Foggia, P., Vento, M.: People Re-identification by Graph
Kernels Methods. In: Jiang, X., Ferrer, M., Torsello, A. (eds.) GbRPR 2011. LNCS,
vol. 6658, pp. 285–294. Springer, Heidelberg (2011)
6. Mahboubi, A., Brun, L., Dupé, F.-X.: Object Classification Based on Graph Ker-
nels. In: HPCS-PAR, pp. 385–389 (2010)
7. Fankhauser, S., Riesen, K., Bunke, H.: Speeding up Graph Edit Distance Com-
putation through Fast Bipartite Matching. In: Jiang, X., Ferrer, M., Torsello, A.
(eds.) GbRPR 2011. LNCS, vol. 6658, pp. 102–111. Springer, Heidelberg (2011)
8. Desobry, F., Davy, M., Doncarli, C.: An Online Kernel Change Detection Algo-
rithm. IEEE Transaction on Signal Processing 53, 2961–2974 (2005)
9. Ellis, A., Shahrokni, A., Ferryman, J.: PETS 2009 and Winter PETS 2009 Results,
a Combined Evaluation. In: 12th IEEE Int. Work. on Performance Evaluation of
Tracking and Surveillance, pp. 1–8 (2009)
10. Berclaz, J., Shahrokni, A., Fleuret, F., Freyman, J.M., Fua, P.: Evaluation of prob-
abilistic occupancy map people detection for surveillance systems. In: 11th IEEE
Int. Work. on Performance Evaluation of Tracking and Surveillance, pp. 55–62
(2009)
11. Kasturi, R., Goldgof, D., Soundararajan, P., Manohar, V., Garofolo, J., Bowers,
R., Boonstra, M., Korzhova, V., Zhang, J.: Framework for performance evaluation
of face, text, and vehicule detection and tracking in video: Data, metrics, and
protocol. IEEE Transaction on Pattern Analysis and Machine Intelligence 31(2),
319–336 (2009)
Recognizing Human-Object Interactions
Using Sparse Subspace Clustering
1 Introduction
Recognizing human-object interactions from videos is a hard problem that has
been receiving renewed attention by the computer-vision community. The prob-
lem’s complexity comes from large degree of variations present in both the ap-
pearance of objects and the many ways people interact with them. Current solu-
tions differ mostly in terms of the type of input data used by algorithms, which
ranges from low-level features such as optical flow to human-centered features
such as spatio-temporal volumes.
The state-of-the-art is represented by the weakly supervised method described
by Prest et al. [1], which combines part-based human detector, tracking by de-
tection, and classification into a single framework. This method reports the best
results on most datasets. Other representative solutions include the work of
Gupta et al. [2], which uses a histogram-based model to account for appearance
information, and trajectories for representing motion. Their classification ap-
proach is based on a Bayesian network. They introduce interaction features such
as time of the object grasp, interaction start, and interaction stop, which are
learned from velocity profiles. Motion trajectories were also used by Filipovych
and Ribeiro [3] for recognizing interactions by matching trajectories of hand
motion using a robust sequence-alignment method.
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 409–416, 2013.
c Springer-Verlag Berlin Heidelberg 2013
410 I. Bogun and E. Ribeiro
2 Our Method
We commenced by annotating the videos from Gupta et al. [2]. Currently, video
datasets of human-object interactions are either not publicly available, such as
the Coffee and Cigarretes used by Laptev and Pérez [5] and by Prest et al.
[1], or are unannotated, as in the case of dataset provided by Gupta et al. [2].
Videos in Gupta et al. are short sequences (i.e., 3–10 seconds long) of a single
person performing interactions such as drinking from a cup, answering the phone,
making a phone call, spraying, pouring from a cup, and lighting a torch. Our
annotation is as follows: on every 3rd frame of the videos, we extract the position
(i.e., the centroid of the bounding box) of the left hand, torso, head, right hand,
and the object associated with the interaction. Figure 1 shows samples of these
trajectories superimposed on frames from the input videos.
In our interaction-classification method, each video is represented by five tra-
jectories that we extracted manually by linearly interpolating between the pre-
viously annotated keyframes. These trajectories are termed T h , T r , T l , T t ,
and T o , for head, left hand, right hand, torso, and object, respectively. While
automatic trajectory extraction is indeed desirable, it is not the focus of this
paper, and we decided that assuming trajectory availability would suffice. The
Recognizing Human-Object InteractionsUsing Sparse Subspace Clustering 411
1 h
N
x̄h = x . (1)
N i=1 i
Here, x̄h = ( x̄h , ȳ h )T is a centroid location that we can use to flip trajectories
horizontally and thus account for left-right hand symmetry. For each trajectory
j in the video, the registered trajectory points are given by:
T
x̂ji = |xji − x̄h |, yij − ȳ h , ∀i. (2)
where the p(.) returns the p-value. Now, that we have determined the trajectory
corresponding to the interacting-hand, we represent a video by its feature vector:
, -
T
f= . (4)
To
Y = Yperfect + Z. (11)
Here, Z is the noise and Yperfect is the clean data, which lies in the union of the
subspaces. Thus, the problem becomes:
λz
min . ||C||1 + ||Z||2F (12)
2
s.t. Y = Y C + Z, (13)
diag(C) = 0, (14)
Recognizing Human-Object InteractionsUsing Sparse Subspace Clustering 413
/
m n
where ||Z||F = i=1 j=1 |zij |2 and λz is a regularization parameter. For
parameter settings, Elhamifar and Vidal [4] suggest to set λz = αz /μz , where
αz > 1 and μz is defined by μz = mini maxi=j |yiT yj |.
After coefficient matrix C is found, SSC clusters trajectories using spectral
clustering. SSC gave state-of-the art results on the Hopkins-155 dataset [10].
Combined with nice theoretical properties [11] and the momentum gained by
the success of sparse optimization problems [12], we believe that SSC can be
useful in classifying human actions and interactions.
Let Y ∈ Rm×n be the data set. The Sparse Subspace Clustering algorithm is
seeking for the sparsest representation for every yi ∈ Y, i = 1, . . . , n
s.t. cii = 0.
Let c−i be ci without cii . Similarly, define Y−i as Y without its i-th row. Then,
the problem can be cast as:
The latter is problem where weare trying to find the sparsest approximation of
m
yi using Y−i in the form y−i = j=1 Y−i,j c−ij . It can be shown that SVM can be
reformulated to solve it [13]. On the other hand, Girosi [14] has shown that, for
noiseless data, the solution given by sparse approximation corresponds exactly
to the solution given by SVM. Moreover, non-zero coefficients of c−i correspond
to support vectors.
Limited amount of support vectors correspond to the bounded VC dimension
of the dataset, which is shown to define generalization ability; the result, due to
Vapnik [13], shows the connection between generalized error and the number of
support vectors, i.e.:
This leads to the connection between sparsity and the ability to extract the
most important samples from the dataset, which in turn leads to the proper
partitioning of the samples via SSC.
414 I. Bogun and E. Ribeiro
3 Experimental Evaluation
In this section, we experiment with the SSC algorithm in an unsupervised ap-
proach for actor-object interaction recognition based on the trajectory data. Our
experiments were designed with an emphasis on trying to answer the following
two questions: (i) Can human-object interactions be seen as body and object tra-
jectories that lie in a low-dimensional space? (ii) What is the role of interaction
localization (i.e., segmentation) in recognition?
Here, we run the SSC algorithm1 in two settings: (a) With complete trajec-
tories (i.e., from the first to the last frame of the input videos), and (b) With
trajectories corresponding only to interaction frames (i.e., frames where the in-
teraction starts and ends)2 . Confusion matrices resulting from these experiments
are given in Figures 2(a) and 2(b), and present classification rates of 74.1% and
81.48%, respectively. As a point of comparison, we note that Gupta et al. [2]
report accuracy of 93.34% while Prest et al. [1] report an average classification
of 93%. However, in addition to being completely supervised, these approaches
use additional data to train a HOG-based object detector, while our method
is unsupervised given trajectories and their parameters. As reported in Gupta
et al. [2], interaction such as lighting and pouring, dialing and lighting have sim-
ilar trajectories, thus can be hardly distinguished by motion cues alone. This
behavior is also observed in our results. The test videos and demo code for our
method are available online.3
Our results suggest that segmentation of interaction trajectories can be seen
as a special case of motion segmentation, and consequently, the space of such
trajectories consists of a union of low-dimensional subspaces. Our results imply
that, if two interactions lie on different subspaces, motion information alone is
able to distinguish between them. However, if the interactions lie on the same
subspace then appearance information should be used for classification.
In our second experiment, the higher classification rates suggest that interac-
tion localization (i.e., segmentation of start and end of actions) is a goal worth
pursuing. The results agree with the intuition that removing unnecessary parts
of trajectories, we can improve the value of information necessary for recognition.
Lighting 7 1
Pouring 4 4
Spraying 1 1 8
Talking 3 4 2
Dialing 10
Drinking Lighting Pouring Spraying Talking Dialing
(a)
Lighting 6 2
Pouring 8
Spraying 1 9
Talking 8 1
Dialing 3 1 6
Drinking Lighting Pouring Spraying Talking Dialing
(b)
A main drawback of our work is the use of manual trajectory annotation. This
issue can be addressed by implementing a method for tracking and detection [1].
Another direction is to investigate how to use the classification labels provided
by our method in a fully supervised setting.
References
[1] Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions be-
tween humans and objects. TPAMI 34(3), 601–614 (2012)
[2] Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: Using
spatial and functional compatibility for recognition. TPAMI 31(10), 1775–1789
(2009)
416 I. Bogun and E. Ribeiro
[3] Filipovych, R., Ribeiro, E.: Robust sequence alignment for actor-object interaction
recognition: Discovering actor-object states. CVIU 115(2), 177–193 (2011)
[4] Elhamifar, E., Vidal, R.: Sparse subspace clustering: Algorithm, theory, and ap-
plications. arXiv preprint arXiv:1203.1005 (2012)
[5] Laptev, I., Pérez, P.: Retrieving actions in movies. In: ICCV, pp. 1–8 (2007)
[6] Rao, S., Tron, R., Vidal, R., Ma, Y.: Motion segmentation in the presence of
outlying, incomplete, or corrupted trajectories. TPAMI 32(10), 1832–1845 (2010)
[7] Meka, R., Jain, P., Caramanis, C., Dhillon, I.S.: Rank minimization via online
learning. In: ICML, pp. 656–663 (2008)
[8] Ma, S., Goldfarb, D., Chen, L.: Fixed point and bregman iterative methods for
matrix rank minimization. Mathematical Programming 128(1-2), 321–353 (2011)
[9] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society Series B, 267–288 (1996)
[10] Tron, R., Vidal, R.: A benchmark for the comparison of 3-d motion segmentation
algorithms. In: CVPR, pp. 1–8. IEEE (2007)
[11] Soltanolkotabi, M., Candes, E.J.: A geometric analysis of subspace clustering with
outliers. The Annals of Statistics 40(4), 2195–2238 (2012)
[12] Chandrasekaran, V., Sanghavi, S., Parrilo, P.A., Willsky, A.S.: Sparse and low-
rank matrix decompositions. In: IEEE CCC, pp. 962–967 (2009)
[13] Vapnik, V.: The nature of statistical learning theory. Springer (1999)
[14] Girosi, F.: An equivalence between sparse approximation and support vector ma-
chines. Neural computation 10(6), 1455–1480 (1998)
[15] Elhamifar, E., Vidal, R.: Sparse subspace clustering. In: CVPR, pp. 2790–2797.
IEEE (2009)
[16] Inc. CVX Research. CVX: Matlab software for disciplined convex programming,
version 2.0 beta (September 2012), http://cvxr.com/cvx
Scale-Space Clustering on the Sphere
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 417–424, 2013.
c Springer-Verlag Berlin Heidelberg 2013
418 Y. Mochizuki et al.
2 Mathematical Preliminaries
A vector x ∈ S3 is expressed as x = x(φ, θ) = (cos φ sin θ, sin φ sin θ, cos θ)
using spherical coordinates (φ, θ), where φ ∈ [0, 2π), θ ∈ [0, π]. The scale image
of an image f (x, τ ) on S2 is the solution of the linear spherical heat equation
, -
∂ 1 ∂ ∂ 1 ∂2
f (x, τ ) = ΔS2 f (x, τ ) = sin θ + f (x, τ ), (1)
∂τ sin θ ∂θ ∂θ sin2 θ ∂φ2
for f (x, 0) = f (x). The scale space image f (x, τ ) of scale τ is expressed as
∞
l
f (x, τ ) = e−l(l+1)τ cm
l Yl
m
(x), c m
l = f (x)Ylm (x)dσ, (2)
l=0 m=−l S2
for dσ = sin θdθdφ, where Ylm is the spherical harmonic function of the degree l
and the order m. Equation (2) is re-expressed as
f (x, τ ) = f (x)K(x, y, τ )dσ = Kτ ∗S2 f (x), x, y ∈ S2 (3)
S2
For a function f (x) and the matrix R ∈ SO(3), the function g(x) = f (R x)
represents a rotated function of f . The north pole n = (0, 0, 1) is moved to
n = Rn and the relationship f (n) = g(n ) is satisfied. When f (x(φ, θ)) is
constant for any φ, the rotated function of f is identical to f for any R ∈ SO(3)
which satisfies n = Rn. Using this property, we define the rotation of functions
on the sphere.
Definition 2. For the function f (x(φ, θ)) which is constant for any φ and the
rotation matrix R ∈ SO(3) which moves the north pole (0, 0, 1) to p, we write
the function rotated by R as f (x ∼ p) = f (R x).
For a point set on the sphere, by substituting each point for an impulse function,
we can have the function associated with a point set.
Scale-Space Clustering on the Sphere 419
where ∇S2 = ∂φ , ∂θ , for ∂φ = sin1 θ ∂φ ∂
and ∂θ = ∂θ ∂
. We call f (x, τ ) =
[P](x, τ ) the generalised PDF (GPDF) after the scale-space theory [8].
For PDF f (x) on the sphere, the derivative of f (x) describes the differential
geometric features. A primitive geometric feature of the GPDF is the extension
of stationary point {x | ∇S2 f (x) = 0}, where the spatial gradient vanishes. The
stationary point can be classified into three types based on the combination of
the signs of the eigenvalues of the Hessian matrix Hf = ∇S2 ∇ S2 f , where
∂2 ∂2
sin θ ∂θ∂φ − sin θ ∂φ
2 1 cot θ ∂
∂ ∂ ∂
∇S2 ∇S2 = φ φ θ
= ∂2
∂θ 2
∂2
. (8)
sin θ ∂θ∂φ − sin θ ∂φ sin2 θ ∂φ2 + cot θ ∂θ
1 cot θ ∂ 1 ∂
∂θ ∂φ ∂θ2
We denote the signs of the eigenvalues as (±, ±). Sign (−, −) means that the
point on f is a local maximum. A local maximum of a PDF is called the mode
in probability theory and statistics.
Let P be a set of points, f = [P], and N = |P|. Let τi be selected scale values,
where τi < τi+1 for i = 0, 1, . . . , M with τ0 = 0. The mode tree M corresponding
to f is defined as follows.
420 Y. Mochizuki et al.
– Each node in M has three values: the node ID i, a scale value τ and a location
vector x, is denoted by (τ, i, x).
– M has N leaf nodes. Each node has a unique ID in { 0, . . . , N − 1 }. The scale
values of the all leaf nodes are 0 and each location is defined by P.
– Parent of a node whose scale is τi is a node whose scale is τi+1 for i < M − 1.
– A node whose scale and location are τ and p, respectively, is one of the local
maxima of the function f (x, τ ) at x = p.
We denote a set of nodes whose scale is τ by Mτ . The algorithm 1 constructs
the mode tree. In this, the leaf nodes (at the scale 0) correspond to the input
points. All nodes at a scale are moved according to the scale image of the next
scale. When some node points are concentrated to a point, they are merged into
a new node whose ID is inherited from the point which remains as isolated point
in the scale space. Figure 1 shows an example of construction of a tree.
3
3
τ 2 3
𝕊2
𝜙
2
θ 1 2 3
1 3
Fig. 1. Mode tree and node merging. (a) Trajectory of three modes in scale space.
Each plane represents the spherical space in spherical coordinates on the different scale.
(b) The mode tree. Each circle and the number in it represent a node in the tree and
its ID, respectively. (c) The linear separability for point set on the spherical space.
5 Experimental Results
We show an example shown in Fig. 2. The set consists of three clusters with 3000
points, which can be separated by a curve with the appearance of a baseball
stitching. Figure 2(c) shows the graph of the number of maxima. The set can be
successfully separated into two clusters in the mode tree.
Fig. 3 (a) shows the spherical image generated by the dioptric image shown
in Fig. 3 (b). Our aim is to detect spatial lines captured in the spherical image.
Since spatial lines are projected to the great circles in the spherical image [7],
We use the spherical Hough transform for spherical images [13,14] which detects
the great circles from sample points on a spherical image. Since the voting space
of the spherical Hough transform is the unit sphere, the votes yield a point cloud
on the sphere [13]. Therefore, line detection is achieved by detecting mean points
in clusters of a point cloud on the spherical voting space.
To apply the spherical clustering, we extend the voting space to the sphere du-
plicating all points x as the antipodal points −x. As a result of constructing the
mode tree of the point cloud, Fig. 4 (d) shows the graph of the number of modes
at each scale. From symmetry of the extended points, the modes are also sym-
metry at any scales and we can divide the mode tree into two subtrees. From this
geometric property of the mode tree, we use the modes in the north hemisphere
422 Y. Mochizuki et al.
o
0
1000
0.1
100 90o
0.01
# of points
0
0.5
1 10
0.001 0
1.5
1
2
180o o
2
3
4 2.5
5 o o o o
6 3 0 90 180 270 360
1
0.0001 0.001 0.01 0.1 1
scale
(a) (b)
Fig. 3. Spherical image captured by Fish-eye-lens camera system. (a) Image on unit
sphere transformed from (b). (b) Image acquired by fish-eye-lens camera system.
to estimate clusters of the point cloud generated voting. In the graph, there are
some stable intervals in which the number of points does not change. The first
three these stable intervals of scales are [0.00337, 0.00526], [0.00925, 0.0134] and
[0.01587, 0.02907]. These intervals are shaded regions on the graph. The number
of modes in these intervals are 36, 20 and 10 respectively. The clustering results
in these intervals are shown in Fig. 4 (d).
Figures 5 (a), (b) and (c) are means of the clustr detected from point cloud on
spherical voting space using spherical scale-space clustering. Furthermore, Figs
5 (d), (e) and (f) illustrate detected lines. The results show that the method can
detect parameters of lines from dioptric images.
6 Conclusions
We introduced an algorithm for scale-space-based clustering for point clouds on
the sphere regarding the union of given point sets as an image of a finite sum of
the delta functions located at the positions of the points on the sphere.
The principal advantage of the scale-space-based analysis for the point-set
analysis is that deterministic features of the point set can be observed at higher
scales even if the positions of the points are stochastic. This property can be
Scale-Space Clustering on the Sphere 423
4
10
6530
0
3
10
0.5
# of points
1
1
2
10
0.1
1.5
2
0.01
10
2.5 0
0.5
1
0.001 0
3 1.5
1
2 2
0 1 2 3 4 5 6 3
4 2.5 1
5 -5 -4 -3 -2 -1
6 3
10 10 10 10 10
scale
Fig. 4. Voting points obtained for the N -point randomized Hough Transform
(NPRHT)[13,14]. (a) and (b) Point set on the spherical accumulato space. (c) Mode
tree. (d) Numbers of mode.
0 0 0
1 1 1
2 2 2
3 3 3
0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6
0 0 0
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Fig. 5. Detected means nd lines. (a) Means detected from 1587 clusters. (b) Means
detected from 925 clusters. (c) Means detected from 337 clusters. (d) Lines detected
from 1587 clusters. (e) Lines detected from 925 clusters. (f) Lines detected from 337
clusters.
424 Y. Mochizuki et al.
References
1. Witkin, A.P.: Scale space filtering. In: Proc. 8th IJCAI, pp. 1019–1022 (1983)
2. Griffin, L.D., Colchester, A.: Superficial and deep structure in linear diffusion scale
space: Isophotes, critical points and separatrices. Image and Vision Computing 13,
543–557 (1995)
3. Nakamura, E., Kehtarnavaz, N.: Determining number of clusters and prototype lo-
cations via multi-scale clustering. Pattern Recognition Letters 19, 1265–1283 (1998)
4. Loog, M., Duistermaat, J.J., Florack, L.M.J.: On the Behavior of Spatial Criti-
cal Points under Gaussian Blurring (A Folklore Theorem and Scale-Space Con-
straints). In: Kerckhove, M. (ed.) Scale-Space 2001. LNCS, vol. 2106, pp. 183–192.
Springer, Heidelberg (2001)
5. Sakai, T., Imiya, A., Komazaki, T., Hama, S.: Critical scale for unsupervised cluster
discovery. In: Perner, P. (ed.) MLDM 2007. LNCS (LNAI), vol. 4571, pp. 218–232.
Springer, Heidelberg (2007)
6. Sakai, T., Imiya, A.: Unsupervised cluster discovery using statistics in scale space.
Engineering Applications of Artificial Intelligence 22, 92–100 (2009)
7. Franz, M.O., Chahl, J.S., Krapp, H.G.: Insect-inspired estimation of egomotion.
Neural Computation 16, 2245–2260 (2004)
8. Zhao, N.-Y., Iijima, T.: A theory of feature extraction by the tree of stable view-
points. IEICE Japan, Trans. D J68-D, 1125–1135 (1985) (in Japanese)
9. Kim, G., Sato, M.: Scale space filtering on spherical pattern. In: Proc. ICPR 1992,
pp. 638–641 (1992)
10. Chung, M.K.: Heat kernel smoothing on unit sphere. In: Proc. 3rd IEEE ISBI:
Nano to Macro, pp. 992–995 (2006)
11. Kuijper, A., Florack, L.M.J., Viergever, M.A.: Scale space hierarchy. Journal of
Mathematical Imaging and Vision 18, 169–189 (2003)
12. Minnotte, M.C., Scott, D.W.: The mode tree: A tool for visualization of nonpara-
metric density features. Journal of Computational and Graphical Statistics 2, 51–68
(1993)
13. Torii, A., Imiya, A.: The randomized-Hough-transform method for great-circle de-
tection on sphere. Pattern Recognition Letters 28, 1186–1192 (2007)
14. Mochizuki, Y., Torii, A., Imiya, A.: N -Point Hough transform for line detection.
Journal of Visual Communication and Image Representation 20, 242–253 (2009)
The Importance of Long-Range Interactions to Texture
Similarity
Abstract. We have tested 51 sets of texture features for estimating the percep-
tual similarity between textures. Our results show that these computational fea-
tures only agree with human judgments at an average rate of 57.76%. In a
second experiment we show that the agreement rates, between humans and
computational features, increase when humans are not allowed to use long-
range interactions beyond 19×19 pixels. We believe that this experiment
provides evidence that humans exploit long-range interactions which are not
normally available to computational features.
1 Introduction
Although computed texture similarity is widely used for texture classification and
retrieval, human “perceptual similarity” is difficult to acquire and estimate. Halley [1]
derived a perceptual similarity matrix for a large texture database of 334 textures, and
Clarke et al. [2] compared these data with similarity matrices obtained by 4 computa-
tional feature sets and found that these did not correlate well.
Traditionally, computational features are divided into: filtering-based [3], structural
[4], statistical [5], and model-based [6] features. According to Parseval’s theorem [7],
filtering operations in the spatial domain are equivalent to those in the frequency do-
main when the variances of filtered images are used. In this case, linear filtering based
features, with the exception of quadrature filters which are designed to capture local
phase, only use power spectrum information. However, phase is believed to encode
the “structure” information in images [8]. As a result, these approaches are unlikely to
be able to capture texture structure. Texton-based features are a form of vector quanti-
zation and normally work by clustering in pixel neighborhood space. Computational
cost and feature space sparsity both severely limit the size of the neighborhood. Gen-
erally, statistical features also extract only local statistics again largely for reasons of
computational cost. Similarly, model-based features utilize only a small neighborhood
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 425–432, 2013.
© Springer-Verlag Berlin Heidelberg 2013
426 X. Dong and M.J. Chantler
although the recursive structures have the potential to encode long-range interactions.
However, the majority of published features either work in the power spectrum, or
only exploit higher-order information from relatively small (i.e. 19×19 or less) local
neighborhoods. However, the more interesting aperiodic structures in textures are
represented by phase spectra data and features with small spatial extent cannot encode
these long-range interactions.
In order to examine the abilities of computational features to estimate perceptual
similarity, we have benchmarked 51 sets of features but we have found that these do
not agree well with humans’ perceptual judgments, even if a multi-pyramid scheme is
employed. We believe this may be because the majority of texture features do not
encode long-range higher-order information, such as continuity as expressed by the
Gestalt law of “good continuation” [9]. Furthermore, Polat et al. [10] found that the
interactions might be attributed to grouping collinear line segments into smooth
curves after they studied the lateral interactions between spatial filters. As Spillmann
et al. stated, classical receptive-field models can only explain local perceptual effects
but are unable to explain some global perceptual phenomenon, such as the perception
of illusory contours [11], which is believed to result from long-range interactions.
The most direct hypothesis is that the features that we have investigated cannot ex-
ploit these long-range interactions and hence these produce estimates of similarity that
are not consistent with human judgments. Unfortunately, it is difficult to test this hy-
pothesis directly by “adding” long-range interactions to textures as such actions inva-
riably change local features or 1st- or 2nd-order statistics. However, it is relatively
simple to prevent humans from using long-range interactions. In this situation, hu-
mans are likely to make judgments that are more similar to the computational results
if our hypothesis holds true. Hence, we performed two additional pair-of-pairs expe-
riments and their results show that humans are more inclined to make judgments that
coincide with the feature data when long-range interactions are not available to the
observers. These results indicate that the features that we have examined do not ex-
ploit long-range interactions.
In the next section, a series of evaluation experiments is conducted that compare
human and computational similarity. The effect of removing long-range interactions
on perceptual similarity is investigated in section 3. Finally, in section 4, conclusions
are drawn.
In this section, we carry out a series of evaluation experiments for examining the abil-
ities of 51 sets of different computational features to estimate perceptual similarity as
obtained from free-grouping [1, 12] and pair-of-pairs [12] experiments respectively.
Multi-pyramid decomposition is used to increase the spatial extent of the computa-
tional features and 6 pyramid levels are used. The computational similarity is com-
pared with perceptual similarity and the agreement rate is used to measure the estima-
tion ability of each set of features.
The Importance of Long-Range Interactions to Texture Similarity 427
, ∑ (1)
, ∑ (2)
In our research, 51 sets of features were used to estimate perceptual similarity. Due to
space limitations, we list and reference these in the paper’s supplementary material.
(a) Accumulate the choice (left or right) decisions made by all 20 participants for
1000 pair-of-pairs trials, and label these as and respectively. The differ-
ence between these two figures is computed and normalized:
, 1, 2, … , 1000; (3)
(b) For each trial, label the computational similarities of the left and right pairs as
and respectively, and compute their difference as
, 1, 2, … , 1000; (4)
(c) Compute the criterion to decide whether the computational features and human
decisions are consistent for each trial:
0 || 0 && 0 , 1, 2, … , 1000; (5)
% ∑ 100⁄1000. (6)
, 1, 2, … , 1000. (7)
2.4 Results
The average agreement rates (%) of the humans’ perceptual pair-of-pairs judgments
against 51 sets of computational features computed at 6 resolutions are displayed in
Figure 1(a). In addition a second set of human data obtained by free grouping and
Isomap analysis (8D-ISO) is also shown for comparison purposes. The 8D-ISO data
provides the highest agreement rate at 73.9% providing validation of the pair-of-pairs
data and an indication of the variability of human performance. However, the perfor-
mance of the computational features is much lower (average agreement rates lie in the
range 48.58% to 63.38%). Figure 1(b) provides a similar plot in which the same com-
putational features are compared against the 8D-ISO human data. The highest and
lowest performances (58.55% and 48.85%) were provided by MRSAR and SVR,
respectively. It can be seen that two curves in Figure 1(a) and 1(b) are similar. In both
cases the performance of the 51 computational features is poor when compared
against the two sources of human data.
The Importance of Long-Range Interactions to Texture Similarity 429
(a)
(b)
Fig. 1. The average agreement rates (%) of computational features with perceptual pair-of-pairs
judgments (a), 8D-ISO data (b), and corresponding standard deviations (error bars) over 6
resolutions. In (a) and (b), the black solid lines illustrate the overall average agreement rates
(57.76% and 53.59%), over all 51 methods and 6 resolutions.
Table 1. Pairwise t-test (α = 0.05) results, where r ≥ 0.5 means that the strong effect is
obtained. POP , POP and POP : probabilities that the participants chose “Left” throughout
80 trials in original, non-randomized and randomized pair-of-pairs experiments respectively.
t-test t p r df Significant
POP vs. POP -1.12 0.27 0.21 28 No
POP vs. POP -4.73 0.00 0.67 28 Yes
POP vs. POP 2.74 0.02 0.67 9 Yes
430 X. Dong and M.J. Ch
hantler
(a)
(b)
Fig. 2. The 4 (out of 80) pairrs of pairs (rows) in which computational features have the mmost
difficulty agreeing with humaan judgments: (a) most participants think that the right pairss are
more similar in the original (POP ) and the non-randomized (POP ) pair-of-pairs but tthey
change their minds in the ran ndomized pair-of-pairs (POP ), and (b) most participants alw
ways
think the left pairs are more sim
milar throughout all three versions of pair-of-pairs
(a) (b)
Fig. 3. Average agreement raates between each method and perceptual judgments in POP and
POP , along with standard dev viations (error bars) over 6 resolutions. The black solid lines illu-
strate the overall average rates: (a) 31.28% and (b) 54.67%, on 51 methods and 6 resolutionss.
The Importance of Long-Range Interactions to Texture Similarity 431
We hypothesize that one possible reason for the computational features producing
results that do not agree with human perceptual judgments is that the features com-
pute higher order spatial statistics only on small neighborhoods. Thus, these cannot
exploit the long-range interactions that humans have been shown to be capable of
using for other tasks. Unfortunately, it is difficult to introduce long-range interactions
to real texture images or generate synthetic textures that with controlled long-range
interactions that do not affect local and/or 1st-/2nd-order statistics. As an alternative,
we designed an experiment in which the human observers were prevented from using
long-range interactions. We were inspired by Field et al. [9] who showed that humans
can recognize an object in one image by using long-range interactions even although a
grid has been imposed on top of it. Hence given a “blocked” texture image we can
remove the original long-range interactions by randomizing the position of the blocks.
Experiments were performed using “non-randomized blocked” and “randomized
blocked” textures respectively. The former was designed to provide a control to un-
derstand the effect of superimposing the grid onto images. Note that the grid was
provided in order to lessen the effect of local discontinuities at randomized blocked
edges. The size of each block was set at 19×19 which is the largest neighborhood
used by computational features (excluding filtering-based features). Two modified
pair-of-pairs (POP and POP ) experiments were designed using non-randomized
and randomized blocked images respectively. All other conditions were kept the same
as for the original pair-of-pairs (POP ) experiment.
Three pairwise t-tests were conducted on the 3 sets of results (see Table 1). There
is no significant difference between the choices participants have made in the original
(POP ) and the blocked but non-randomized (POP ) experiments. However, the re-
sults of the randomized experiment (POP ) against both the original (POP ) and the
blocked but non-randomized (POP ) show significant changes. In both cases the ran-
domized blocked experiment provides increased agreements with the computational
features. That is, when humans are provided with images containing the original long-
range interactions (i.e. either the original, or the blocked but non-randomized images),
they disagree significantly more with the computational results compared with when
these long-range interactions have been removed. For example, in Figure 2(a) most
participants judge that the right pairs are more similar in POP and POP but
change their minds in POP . The average agreement rates between each computa-
tional approach and perceptual judgments in POP and POP over 6 resolutions, are
plotted in Figure 3(a) and 3(b). It can be seen that the participants have agreed more
with the features when they have not been able to use long-range interactions.
4 Conclusions
In this paper, we examined the abilities of 51 feature sets to estimate perceptual simi-
larity as estimated using free-grouping and pair-of-pairs experiments. Even though
five different pyramid resolutions were used in order to enhance the feature sets’
432 X. Dong and M.J. Chantler
spatial extents, the results did not agree well with humans’ perceptual judgments. The
average agreement rates of 57.76% and 53.59% were obtained over all methods and
resolutions. Obviously, enhancing spatial extent alone is not enough for capturing the
complexity of human perception.
In a second experiment, 80 of the most difficult pairs of pairs of images from expe-
riment one, were selected for further investigation. These were “blocked” and then the
position of the blocks within an image was randomized in order to remove, or at least
reduce, the ability of observers to exploit long-range interactions in the textures. The
results of the second experiment showed that the block-randomized images produced
significantly different results from the original experiment, while the blocked but non-
randomized images did not. When human observers are allowed to use long-range
interactions in textures, they agree significantly less with the computational feature-
based results. Thus we hypothesize that long-range interactions are important when
humans judge the similarity of textures and that the 51 feature sets that we tested do
not use this important information.
References
1. Halley, F.: Perceptually Relevant Browsing Environments for Large Texture Databases.
PhD Thesis, Heriot Watt University (2011)
2. Clarke, A.D.F., Halley, F., Newell, A.J., Griffin, L.D., Chantler, M.J.: Perceptual Similari-
ty: A Texture Challenge. In: BMVC 2011, pp. 120.1–120.10. BMVA Press (2011)
3. Randen, T., Husøy, J.H.: Filtering for Texture Classification: A Comparative Study. IEEE
Transactions on PAMI 21, 291–310 (1999)
4. Varma, M., Zisserman, A.: A Statistical Approach to Material Classification Using Image
Patch Exemplars. IEEE Transactions on PAMI 31, 2032–2047 (2009)
5. Unser, M.: Sum and Difference Histograms for Texture Classification. IEEE Transactions
on PAMI 8(1), 118–125 (1986)
6. Mao, J., Jain, A.K.: Texture classification and segmentation using multiresolution simulta-
neous autoregressive models. Pattern Recognition 25(2), 173–188 (1992)
7. Parseval’s Theorem, http://mathworld.wolfram.com/ParsevalsTheorem.
html
8. Oppenheim, A.V., Lim, J.S.: The Importance of Phase in Signals. Proceedings of the
IEEE 69(5), 529–541 (1991)
9. Field, D.J., Hayes, A., Hess, R.F.: Contour integration by the human visual system:
evidence for a local “association field”. Vision Research 33, 173–193 (1993)
10. Polat, U., Sagi, D.: The Architecture of Perceptual Spatial Interactions. Vision Re-
search 34, 73–78 (1994)
11. Spillmann, L., Werner, J.S.: Long-range interactions in visual perception. Trends in Neu-
rosciences 19, 428–434 (1996)
12. Clarke, A.D.F., Dong, X., Chantler, M.J.: Does Free-sorting Provide a Good Estimate of
Visual Similarity. In: Predicting Perceptions, pp. 17–20 (2012)
13. MatlabPyrTools-v1.4, http://www.cns.nyu.edu/~lcv/software.php
Unsupervised Dynamic Textures Segmentation
1 Introduction
Many automated static or dynamic visual data analysis systems build on the
segmentation as the fundamental process which affects the overall performance
of any analysis. Visual scene regions, homogeneous with respect to some usu-
ally textural or colour measure, which result from a segmentation algorithm are
analysed in subsequent interpretation steps. Dynamic texture-based (DT) im-
age segmentation is an area of novel research activity in recent years and several
algorithms were published in consequence of all this effort. Different published
methods are difficult to compare because of incompatible assumptions (gray-
scale, fixed or known number of regions, segmentation or retrieval, constant
shape and/or location of texture regions, etc.), lack of a comprehensive analysis
together with accessible experimental data. Gray scale dynamic texture seg-
mentation or retrieval was addressed in few papers [1–5], while colour texture
retrieval based on VLBP [6] or DT segmentation [7], based on the geodesic ac-
tive contour algorithm and partial shape matching to obtain partial match costs
between regions of subsequent frames, were addressed to even lesser extent. How-
ever all available published results indicate that the ill-defined dynamic texture
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 433–440, 2013.
c Springer-Verlag Berlin Heidelberg 2013
434 M. Haindl and S. Mikeš
Yr = γXr + er , (1)
where Xr = [Yr−s T
: ∀s ∈ Irc ]T is a vector of the contextual neighbours Yr−s ,
c
Ir is a causal neighbourhood index set of the model with the cardinality η =
card(Irc ), γ = [A1 , . . . , Aη ] is the d×dη parameter matrix containing parametric
sub-matrices As for each contextual neighbour Yr−s , d is the number of
spectral bands, er is a white Gaussian noise vector with zero mean and a
constant but unknown covariance, and r, r − 1, . . . is a chosen direction of
movement on the image index lattice I. The selection of an appropriate model
support (Irc ) is important to obtain good texture representation for realistic
texture synthesis but less important for adequate texture segmentation which
works only with site specific parameters. Both, the optimal neighbourhood as
well as the Bayesian parameters estimation of the AR3D model can be found
analytically under few additional and acceptable assumptions using the Bayesian
approach (see details in [15]). The local model parameters can be advantageously
evaluated using the recursive Bayesian parameter estimator for every DT frame
as follows:
Unsupervised Dynamic Textures Segmentation 435
−1
T T
Vx(r−2) Xr−1 (Yr−1 − γ̂r−2 Xr−1 )T
γ̂r−1 = γ̂r−2 + T V −1
, (2)
(1 + Xr−1 x(r−2) Xr−1 )
where the data accumulation matrix is
r−1
Vx(r−1) = Xk XkT + Vx(0) . (3)
k=1
Thus the parameter matrix estimate can be easily upgraded after moving to
a new lattice location (r − 1 −→ r). The model is very fast, hence the local
texture for each pixel can be represented by four directional parametric vectors
corresponding to four distinct models. Each vector contains local estimations of
the AR3D model parameters. These models have identical contextual neighbour-
hood Irc but they differ in their major movement direction (top-down, bottom-up,
rightward, leftward), i.e.,
T
γ̃r,o = {γ̂r,o
t b
, γ̂r,o r
, γ̂r,o l
, γ̂r,o }T , (4)
where o = 1, . . . , n is the DT frame number.
|Σi |− 2
1 −1
(Θr −νi )T Σ (Θr −νi )
e−
i
p(Θr | νi , Σi ) = d
2 . (7)
(2π) 2
The mixture model equations (6),(7) are solved using a modified EM algorithm.
The algorithm is initialised, for the first DT frame, using νi , Σi statistics es-
timated from the corresponding rectangular subimages obtained by regular di-
vision of the input texture mosaic. An alternative initialisation can be random
choice of these statistics. For each possible couple of rectangles the Kullback
Leibler divergence
D (p(Θr | νi , Σi ) || p(Θr | νj , Σj )) =
p(Θr | νi , Σi )
p(Θr | νi , Σi ) log dΘr (8)
Ω p(Θr | νj , Σj )
436 M. Haindl and S. Mikeš
are merged together in each step. This initialization results in Kini subimages
and recomputed statistics νi , Σi . Kini > K where K is the optimal number of
textured segments to be found by the algorithm. All the subsequent DT frames
are initialized either from the corrected statistics ν̂i,o−1 , Σ̂i,o−1 for i = 1, . . . , K
computed from the trimmed segmented regions in the previous frame o − 1 or
with random parameter values ν̂i,o−1 , Σ̂i,o−1 i = K + 1, . . . , Kini for possi-
bly newly (re)appearing regions. Two steps of the EM algorithm are repeating
after initialisation. The components with smaller weights than a fixed thresh-
old (pj < K0.1ini
) are eliminated. For every pair of components we estimate their
Kullback Leibler divergence (8). From the most similar couple, the component
with the weight smaller than the threshold is merged to its stronger partner and
all statistics are actualised using the EM algorithm. The algorithm stops when
either the likelihood function has negligible increase (Lt − Lt−1 < 0.05) or the
maximum iteration number threshold is reached.
The parametric vectors representing texture mosaic pixels are assigned to the
clusters according to the highest component probabilities, i.e., Yr is assigned to
the cluster ωj if
πr,j = maxj ws p(Θr−s | νj , Σj ) , (10)
s∈Ir
4 Experimental Results
The algorithm was tested on the natural colour dynamic textural mosaics from
the Prague Texture Segmentation Data-Generator and Benchmark [16]. The
benchmark (http://mosaic.utia.cas.cz) test mosaics with varying layouts and
each cell texture membership are randomly generated and filled with dynamic
colour textures from the Dyntex database [17]. The benchmark ranks segmenta-
tion algorithms according to a chosen criterion. The benchmark has implemented
the majority of segmentation criteria used for both supervised or unsupervised
algorithms evaluation. Twenty seven evaluation criteria (see their definition in
[16]) are categorized into four groups: region-based (5+5), pixel-wise (12), consis-
tency measures (2), and clustering comparison criteria (3) and permit detailed
and objective study of any segmentation method properties. Tab.1 compares
Unsupervised Dynamic Textures Segmentation 437
Benchmark – Dynamic A
DTAR3D+EM DTAR3D+EM DTAR3D+EM
e+pp (1.33) pp (1.86) (2.62)
↑ CS 92.68 1 60.75 2 60 .12 3
↓ OS 39 .47 3 20.37 2 14.78 1
↓ US 0.00 1
0.00 1
0.00 1
↓ ME 0.00 1
35 .77 2
35 .77 2
↓ NE 0.00 1
36.52 2
36 .76 3
↓O 3.23 1 10.07 2 11 .08 3
↓C 13.25 2
12.45 1
14 .56 3
↑ CA 87.03 1
81.26 2
80 .10 3
↑ CO 92.68 1
84.42 2
83 .69 3
↑ CC 94 .01 3 95.85 1 95.18 2
↓ I. 7.32 1
15.58 2
16 .31 3
↓ II. 1 .32 3
0.76 1
0.89 2
↑ EA 92.80 1 89.01 2 88 .30 3
↑ MS 89.07 1 82.41 2 81 .35 3
↓ RM 2.56 1
5 .54 3
5.21 2
↑ CI 93.07 1
89.56 2
88 .86 3
↓ GCE 11.13 1 12.39 2 13 .51 3
↓ LCE 7.02 1
11.03 2
12 .21 3
↓ dD 7.27 1
11.68 2
12 .37 3
↓ dM 4.95 1
6.36 2
6 .80 3
↓ dVI 13.18 1 13.93 2 13 .99 3
the overall (average over all DT frames) benchmark performance of the pro-
posed algorithm (DT AR3D + EM (e + pp)) with postprocessing (pp) and robust
trimmed initialization (e) with its alternative versions. The results demonstrate
very good performance on all criteria with the exception of over-segmentation
tendency and slightly worse variation of information criterion. We could not
compare our results with few published alternative DT segmenters [1, 2, 4] be-
cause neither their code, nor their experimental segmentation data are publicly
available, however the static single-frame (AR3D+EM) version of the method
was extensively evaluated and compared with several alternative methods (22
438 M. Haindl and S. Mikeš
0 1 2 126 249
Fig. 1. Selected experimental dynamic texture mosaic frames (0, 1, 2, 126, 249), ground
truth from the benchmark (middle row), and the corresponding segmentation results
(DTAR3D+EM e+pp - bottom))
5 Conclusions
We proposed novel method for fast unsupervised dynamic texture or video seg-
mentation with unknown variable number of classes based on the underlying
Unsupervised Dynamic Textures Segmentation 439
three dimensional Markovian local image representation and the Gaussian mix-
ture parametric space models. Single homogeneous texture regions can not only
dynamically change their location but simultaneously also their shape. Textu-
ral regions can also disappear temporarily or permanently and new regions can
appear at any time. Although the algorithm uses the random field type data
model it is very fast because it uses efficient recursive parameter estimation
of the model and therefore is much faster than the usual Markov chain Monte
Carlo estimation approach needed for Markovian models. Segmentation methods
typically suffer with lot of application dependent parameters to be experimen-
tally estimated. Our method requires only a contextual neighbourhood selection
and two additional thresholds all of them having an intuitive meaning. The al-
gorithm’s performance is demonstrated on the extensive benchmark objective
tests on natural dynamic texture mosaics. The static version of our method out-
performs several alternative unsupervised segmentation algorithms and it is also
faster than most of them. These dynamic texture unsupervised segmentation test
results are encouraging and we proceed with more elaborate post-processing and
some modification of the texture representation model.
References
1. Doretto, G., Cremers, D., Favaro, P., Soatto, S.: Dynamic texture segmentation.
In: Proceedings of the 9th IEEE International Conference on Computer Vision,
vol. 2, pp. 1236–1242 (2003)
2. Péteri, R., Chetverikov, D.: Dynamic texture recognition using normal flow and
texture regularity. In: Marques, J.S., Pérez de la Blanca, N., Pina, P. (eds.) IbPRIA
2005, Part II. LNCS, vol. 3523, pp. 223–230. Springer, Heidelberg (2005)
3. Chan, A.B., Vasconcelos, N.: Classifying video with kernel dynamic textures. In:
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR 2007), pp. 1–6. IEEE Computer Society (2007)
4. Chan, A.B., Vasconcelos, N.: Layered dynamic textures. IEEE Transactions on
Pattern Analalysis and Machine Intelligence 31(10), 1862–1879 (2009)
5. Chen, J., Zhao, G., Salo, M., Rahtu, E., Pietikinen, M.: Automatic dynamic texture
segmentation using local descriptors and optical flow. IEEE Transactions on Image
Processing (2012)
6. Zhao, G., Pietikäinen, M.: Dynamic texture recognition using local binary patterns
with an application to facial expressions. IEEE Transactions on Pattern Analysis
and Machine Intelligence 29(6), 915–928 (2007)
7. Donoser, M., Urschler, M., Riemenschneider, H., Bischof, H.: Highly consistent se-
quential segmentation. In: Heyden, A., Kahl, F. (eds.) SCIA 2011. LNCS, vol. 6688,
pp. 48–58. Springer, Heidelberg (2011)
8. Kashyap, R.L.: Image models. In: Young, T.Y., Fu, K.S. (eds.) Handbook of Pat-
tern Recognition and Image Processing. Academic Press, New York (1986)
9. Haindl, M.: Texture synthesis. CWI Quarterly 4(4), 305–331 (1991)
10. Panjwani, D.K., Healey, G.: Markov random field models for unsupervised seg-
mentation of textured color images. IEEE Transactions on Pattern Analysis and
Machine Intelligence 17(10), 939–954 (1995)
11. Manjunath, B.S., Chellapa, R.: Unsupervised texture segmentation using Markov
random field models. IEEE Transactions on Pattern Analysis and Machine Intel-
ligence 13, 478–482 (1991)
440 M. Haindl and S. Mikeš
12. Haindl, M.: Texture segmentation using recursive markov random field parameter
estimation. In: Bjarne, K.E., Peter, J. (eds.) Proceedings of the 11th Scandinavian
Conference on Image Analysis, Lyngby, Denmark, pp. 771–776. Pattern Recogni-
tion Society of Denmark (June 1999)
13. Haindl, M., Mikeš, S., Pudil, P.: Unsupervised hierarchical weighted multi-
segmenter. In: Benediktsson, J.A., Kittler, J., Roli, F. (eds.) MCS 2009. LNCS,
vol. 5519, pp. 272–282. Springer, Heidelberg (2009)
14. Haindl, M., Mikeš, S.: Unsupervised texture segmentation using multispectral mod-
elling approach. In: Tang, Y.Y., Wang, S.P., Yeung, D.S., Yan, H., Lorette, G. (eds.)
Proceedings of the 18th International Conference on Pattern Recognition, ICPR
2006, vol. II, pp. 203–206. IEEE Computer Society, Los Alamitos (2006)
15. Haindl, M., Šimberová, S.: A multispectral image line reconstruction method. In:
Theory & Applications of Image Analysis, pp. 306–315. World Scientific Publishing
Co., Singapore (1992)
16. Haindl, M., Mikeš, S.: Texture segmentation benchmark. In: Lovell, B., Lauren-
deau, D., Duin, R. (eds.) Proceedings of the 19th International Conference on
Pattern Recognition, ICPR 2008, pp. 1–4. IEEE Computer Society (2008)
17. Péteri, R., Fazekas, S., Huiskes, M.J.: DynTex: A Comprehensive Database
of Dynamic Textures. Pattern Recognition Letters 31(12), 1627–1632 (2010),
http://projects.cwi.nl/dyntex/
18. Hoang, M.A., Geusebroek, J.M., Smeulders, A.W.M.: Color texture measurement
and segmentation. Signal Processing 85(2), 265–275 (2005)
19. Hoover, A., Jean-Baptiste, G., Jiang, X., Flynn, P.J., Bunke, H., Goldgof, D.B.,
Bowyer, K., Eggert, D.W., Fitzgibbon, A., Fisher, R.B.: An experimental com-
parison of range image segmentation algorithms. IEEE Transaction on Pattern
Analysis and Machine Intelligence 18(7), 673–689 (1996)
20. Kittler, J.V., Marik, R., Mirmehdi, M., Petrou, M., Song, J.: Detection of defects
in colour texture surfaces. In: IAPR Workshop on Machine Vision Application,
Tokyo, Japan, pp. 558–567 (1994)
21. Mitra, P., Murthy, C.A., Pal, S.K.: Unsupervised feature selection using feature
similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence 24,
301–312 (2002)
22. Mirmehdi, M., Marik, R., Petrou, M., Kittler, J.: Iterative morphology for fault
detection in stochastic textures. Electronic Letters 32(5), 443–444 (1996)
Voting Clustering and Key Points Selection
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 441–448, 2013.
c Springer-Verlag Berlin Heidelberg 2013
442 C. Panagiotakis and P. Fragopoulou
over all clusters. In [3], a variant method (K-means++ algorithm) for centroid
initialization has been proposed that chooses centers at random from the data
points, but weights the data points according to their squared distance from
the closest center already chosen. K-means++ usually outperforms K-means in
terms of both accuracy and speed. A deterministic initialization scheme for K-
means is given by the KKZ algorithm [4]. According to KKZ method, the first
centroid is given as the data point with maximum norm, and the second cen-
troid is the point farthest from the first centroid, the third centroid is the point
farthest from its closest existing centroid and so on. An extension/variation of
K-means is the K-medoid or Partitioning Around Medoids (PAM) [5], where
the clusters are represented using the medoid of the data instead of the mean.
Medoid is the object of the cluster with minimum distance to all others objects
in the cluster. Most of the approaches from literature are heuristic or they try to
optimize a criterion that may not be appropriate for clustering or they require a
training set. On the contrary, in this paper, we have solved the crisp clustering
problem via a voting maximization scheme that ensures high similarity between
the points of the same cluster without any user defined parameter. In addition,
the proposed method has been applied to video summarization problem [6].
(a) N j=1 V (i, j) = 1, (b) V (i, i) = 0, (c) V (i, j) ∼ d(xi ,xj ) , (d) V (i, j) ≤ 2
1 1
where d(xi , xj ) denotes the Euclidean distance between the points xi , xj . The
first two conditions ensure the point “equality” (each point have the same vot-
ing “power”). The third condition ensures the scale/density invariant property.
According to the first three conditions it holds that
1
d(xi ,xj )
V3 (i, j) = 1 , where V3 (i, j) denotes the voting matrix that
k∈{1,...,N }−{i} d(xi ,xk )
satisfy the first three conditions (the sub-index show the number of satisfied
conditions). The last condition is added in order to ensure that each point is will
vote the rest points, avoiding the special case of pairs of identical points that
only vote each other resulting wrong voting descriptors (see at the end of the
Section). When all the conditions are satisfied then V4 (i, j) is given by:
V3 (i, j) , δ(i) ≤ 0
V4 (i, j) = V3 (i,j) 1 (1)
min( 1−δ(i) , 2 ) , δ(i) > 0
1 1 1
1.6
0.5 0.5 0.5
1.4
0 1.2 0 0
1
−0.5 −0.5 −0.5
0.8
−1 0.6 −1 −1
0.4
−1.5
−1 −0.5 0 0.5 1 −1.5
−1 −0.5 0 0.5 1 −1.5
−1 −0.5 0 0.5 1
Fig. 1. (a) The dataset using a colormap according to voting descriptor. Results of
clustering (b) K = 4, SSE = 27.68 and (c) K = 4, SSE = 23.89.
resulting algorithm is called CVR-LMV. Let’s assume that two nearby located
points xi , xj , i, j ∈ {1, ..., N } that are misclassified by CVR in the same cluster,
so that V (i, j) ≥ V (i, k), ∀k ∈ {1, ..., N } and V (j, i) ≥ V (j, k), ∀k ∈ {1, ..., N }.
Under this assumption, it is possible that if we separately check to reassign the
point i (or the point j) to the true cluster, VM will be reduced, since the point
xj (or the point xi ) belongs to a different cluster.
In order to solve this problem without increasing the computation cost of the
algorithm, we have introduced the median based VM VH M , that estimates VM
based on the median value of votes of points without affected by nearby points.
1
K
VH
M= · medianj∈pk (V (j, i)) (3)
K i∈p k=1 k
Let V M and V M denote the validity measure before and after the possible
reassignment. Let VH M and VH M denote the median based VM before and after
the possible reassignment. According to the proposed algorithm, we reassign the
point with index i, if V M > V M or VH M > VH M ∧ VHM − VH M +V M −V M > 0
is satisfied. The first condition ensures that V M increases. If only the second
condition is true, this will cause an impermanent decrease of VM. Since the
increase of VHM is higher than the decrease of V M means that the point with
index i is closer to the examined cluster and we have to perform reassignment.
In the next steps, we will also reassign the neighbors of point with index i and
V M will increase.
Figs. 1(b) and 1(c) illustrate two different clustering results of the same
dataset using the CVR-LMV and K-means clustering, respectively. The SSE
of clustering depicted in Fig. 1(c) is 13.69% lower than the SSE of clustering de-
picted in Fig. 1(b). However the optimal solution of clustering is clearly depicted
in Fig. 1(b).
4 Experimental Results
In this section, the experimental results of our performance study are presented.
We have tested our methods (CVR and CVR-LMV) using the following six
real datasets [8], where the number of records, the number of clusters, the data
dimension, the cluster sizes and cluster densities are varied:
We have tested the proposed methods with 144 synthetic datasets generated by
c random cluster centroids that are uniformly distributed over the d-dimensional
446 C. Panagiotakis and P. Fragopoulou
hypercube (c ∈ {4, 8, 16}, d ∈ {4, 8}). The number of points ni in cluster i is ran-
domly selected from a uniform distribution between min n and max n (min n ∈
{16, 128}, max n − min n ∈ {0, 128}). The ni points in cluster i are randomly
selected around the cluster centroid from a d-dimensional multivariate Gaussian
distribution with covariance matrix Σi = σi2 Id and mean value equal to the
cluster centroid, where σi is randomly selected from a uniform distribution be-
tween min σ and max σ, (min σ ∈ {0.04, 0.08, 0.16}), (max σ −min σ ∈ {0, 0.08}).
The parameters c and min σ receive three different values and the rest of the
parameters receive two different values yielding 32 · 24 = 144 datasets.
In order to evaluate the accuracy of the proposed scheme, we have compared
the proposed methods with seven other clustering methods: the K-means, the
K-means KKZ algorithms [4], the hierarchical agglomerative algorithm based
on the linkage metric of average link (HAC-AV) [9], spectral clustering using
Nystrom method without orthogonalization (SCN) and with orthogonalization
(SCN-O) [10], the K-means++ method [3] and the PAM algotithm [5]. For the
non deterministic algorithms, 20 trials have been performed under any given
dataset, getting the average value of the used performance metrics. We evaluate
the performance using the clustering accuracy (Acc) [10]. Acc ∈ [0, 1] is defined
as the percentage of the correctly classified points.
Table 1. The accuracy (first 6 lines) and the average Acc (last line) of several clustering
algorithms in 6 real and 144 synthetic datasets (144 S.D.), respectively
Dataset CVR-LMV CVR K-means K-means KKZ HAC-AV SCN SCN-O PAM K-means++
Iris 93.33% 81.33% 84.20% 89.33% 90.67% 89.10% 88.87% 77.43% 85.77%
Yeast 39.22% 42.39% 36.04% 37.80% 32.35% 37.54% 37.10% 32.37% 35.15%
Segmentation 52.14% 37.43% 51.87% 35.62% 14.62% 47.35% 46.55% 52.45% 50.86%
Wisconsin 91.04% 90.51% 85.41% 85.41% 66.26% 73.15% 85.14% 84.97% 85.41%
Wine 71.35% 71.35% 68.20% 56.74% 61.24% 66.04% 60.17% 67.44% 65.65%
covtype10k 37.10% 38.18% 36.41% 35.95% 35.63% 36.20% 36.49% 35.90% 36.97%
144 S.D. 98.71% 97.85% 79.51% 97.51% 97.01% 94.04% 97.08% 78.61% 86.21%
Fig. 2. Selected key frames of tennis ((a),(b),(c)) and foreman ((e),(f )) videos
5 Conclusions
In this paper, we propose a deterministic point clustering method that
can be also used in video summarization problem. According to the proposed
1
http://media.xiph.org/video/derf/
448 C. Panagiotakis and P. Fragopoulou
References
1. Gupta, U., Ranganathan, N.: A game theoretic approach for simultaneous com-
paction and equipartitioning of spatial data sets. IEEE Transactions on Knowledge
and Data Engineering 22, 465–478 (2010)
2. Jain, A.: Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31,
651–666 (2010)
3. Arthur, D., Vassilvitskii, S.: k-means++: The advantages of careful seeding. In:
Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algo-
rithms, pp. 1027–1035 (2007)
4. Katsavounidis, I., Kuo, C.C.J., Zhang, Z.: A new initialization technique for gen-
eralized lloyd iteration. IEEE Signal Processing Letters 1, 144–146 (1994)
5. Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 3rd edn. Elsevier (2006)
6. Panagiotakis, C., Doulamis, A., Tziritas, G.: Equivalent key frames selection based
on iso-content principles. IEEE Transactions on Circuits and Systems for Video
Technology 19, 447–451 (2009)
7. Panagiotakis, C., Tziritas, G.: Successive group selection for microaggregation.
IEEE Trans. on Knowledge and Data Engineering 99 (accepted, 2011)
8. Blake, C., Keough, E., Merz, C.J.: UCI Repository of Machine Learning Database
(1998), http://www.ics.uci.edu/~ mlearn/MLrepository.html
9. Day, W., Edelsbrunner, H.: Efficient algorithms for agglomerative hierarchical clus-
tering methods. Journal of Classification 1, 7–24 (1984)
10. Chen, W.Y., Song, Y., Bai, H., Lin, C.J., Chang, E.Y.: Parallel spectral cluster-
ing in distributed systems. IEEE Transactions on Pattern Analysis and Machine
Intelligence 33, 568–586 (2011)
Motor Pump Fault Diagnosis with Feature
Selection and Levenberg-Marquardt Trained
Feedforward Neural Network
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 449–456, 2013.
c Springer-Verlag Berlin Heidelberg 2013
450 T.W. Rauber and F.M. Varejão
The signals were obtained during a period of five years. To generate the classified
training data, experts in maintenance engineering provided a label for every fault
present in each acquired example. Since several faults can simultaneously occur
in a machine, we construct an independent classification task for each type of
fault (one against all). Naturally the labelling process has been done by different
persons and therefore the ground truth of the class membership is subject to
model errors introduced a priori, i.e. it cannot be excluded that the provided
label is erroneous in some cases.
2.5
0.5
0
0 200 400 600 800 1000
Frequency (Hz)
Fig. 1. Misalignment fault and its manifestation in the frequency spectrum at the first
three harmonics of the shaft rotation frequency. The high energy in the fifth harmonic,
as well as the noise in low frequencies indicate that additionally a hydrodynamic fault
is emerging.
2 Feature Selection
The main idea of feature selection is to obtain data that has a reduced di-
mensionality and more relevant information that can increase the classification
performance. Feature selection is basically composed of a search algorithm and
a selection criterion [11,12,7,13], Another important aspect of feature selection
is the explication of the importance of each feature for the classification process.
Previous work has investigated the problem of feature selection in the context
of fault detection of rotating machines. In [14], features were ranked by their
appraisal, using the sensitivity [3] during the training of a feedforward net with
one hidden layer. Feature selection based on a individual threshold ranking in
the context of tool condition monitoring can be found in [15]. A heuristic based
on binary ant colony is used for feature selection in the context of a rotary kiln
in [16]. Mutual information is the selection criterion proposed in [17] which also
give a considerable overview of feature extraction and selection methods, also cf.
452 T.W. Rauber and F.M. Varejão
We follow the terminology of [8,9], to define the network calculus and the weight
optimization. We expect the reader to be familiar with the basic concepts of a
feedforward net and the principle of gradient descent.
3.1 Architecture
Consider as input to the net a R-dimensional pattern p from the Euclidean vector
space RR . The net input with weights wi,j
m+1
from the jth unit in layer m to the
m+1
ith unit in layer (m + 1) and the biases bi is
m
S
nm+1
i = m+1 m
wi,j aj + bm+1
i . (1)
j=1
am+1
i = f m+1 (nm+1
i ). (2)
The network has M layers and the output of layer m in matrix form can be
written as
where y is the final output of the net. The S m+1 × S m matrix W m+1 contains
the weights of layer (m + 1). The activation function f : R → R is usually the
logistic sigmoid function f (n) = 1/(1 + exp(−n)), hyperbolic tangent sigmoid
function f (n) = tanh n or the identity function f (n) = n. We consider a network
with only one hidden layer, i.e. M = 2 and m = 0, 1, since empirically additional
layers do not increase discriminative power.
Motor Pump Fault Diagnosis with Feature Selection 453
ek,q , k = 1, . . . , S M , q = 1, . . . , Q. (4)
There are N = Q · S M such errors. Their gradient with respect to each of the
weights and biases
∂ek,q ∂ek,q
m , , k = 1, . . . , S M , q = 1, . . . , Q,
∂wi,j ∂bm
i
i = 0, . . . , S m , j = 1, . . . , S m+1 , m = 0, . . . , M − 1 (5)
is calculated and introduced into the N × n Jacobian matrix J(x).
The sensitivity [9] of the i-th unit of the m-th layer sm
i of the conventional
backpropagation is replaced by the definition of the Marquardt sensitivity of unit
i in layer m
∂ek,q
i,h ≡
s̃m , (6)
∂nm
i,q
where the index h over all N pattern-output unit pairs is calculated as h =
(q − 1)S M + k, q = 1, . . . , Q, k = 1, . . . , S M . The elements of the Jacobian can
now be obtained as
∂ek,q ∂ek,q ∂nm i,q ∂ek,q
[J]h, = m = · m m−1
m = s̃i,h aj,q , [J]h, = = s̃m
i,h . (7)
∂wi,j ∂nm
i,q ∂wi,j ∂bm
i
The index over all network connections from layer m to layer (m + 1) is calcu-
lated as = j · S m + i, i = 1, . . . , S m+1 , j = 0, . . . , S m , m = 0, . . . , M − 1.
4 Experimental Results
The main objective is to illustrate the advantages of feature selection. With only
a fraction of the original feature set it should be possible to obtain equivalent or
better performance, compared to the complete feature set. As mentioned before,
we use a combination of performance estimation and selection by taking the
performance score as the proper criterion (wrapper).
454 T.W. Rauber and F.M. Varejão
Estimated accuracy
Estimated accuracy
0.8 0.9
0.7
0.6 0.8
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
# of SFS accumulated features # of SFS accumulated features
Estimated accuracy
Estimated accuracy
0.9 0.88
0.86
0.84
0.82
0.80
0.78
0.8 0.76
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
# of SFS accumulated features # of SFS accumulated features
0.96 0.93
0.95
0.94 0.92
0.93
0.92 0.91
0.91
0.90 0.90
0.89
0.88 0.89
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
# of SFS accumulated features # of SFS accumulated features
0.98
0.86
0.84 0.97
0.82 0.96
0.80
0.78 0.95
0.76
0.74 0.94
0.72 0.93
0.70
0.68 0.92
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
# of SFS accumulated features # of SFS accumulated features
4.2 Discussion
It can be clearly observed from the evolution of the estimated error in fig. 2
that the feature selection is able to reduce the complexity of the subsequent
classification stage considerably. At the same time the performance is improved
since noise is filtered out. Except for the ’structural pump looseness’ fault, the
estimated performance with only a fraction of the features is higher compared
to the case when taking all available features for the training of the classifier.
This clearly justifies the use of this important step in pattern recognition, also
for this field of application.
5 Conclusion
We have presented a complete system for the diagnosis of faults in a real world
scenario of rotating machinery installed on offshore oil rigs. A feature pool of
frequency measurements is provided. From this pool, a subset is selected that
achieves better performance with less complexity. Future work will concentrate
on other sensors, feature models and performance estimation techniques.
References
1. Tavner, P.J., Ran, L., Penman, J., Sedding, H.: Conditiong Monitoring of Electrical
Machines. The Institution of Engineering and Technology, London (2008)
456 T.W. Rauber and F.M. Varejão
1 Introduction
In almost all countries of the world the elderly population is continuously in-
creasing. Improving the quality of life of increasingly elderly population is one
of the most central challenges facing our society today. As humans become old,
their bodies weaken and the risk of accidental falls raises noticeably [12]. A fall
can lead to severe injuries such as broken bones, and a fallen person might need
assistance at getting up again. Falls lead to losing self-confidence, a loss of in-
dependence and a higher risk of morbidity and mortality. Thus, in recent years
a lot of research has been devoted to development of unobtrusive fall detection
methods [15]. However, despite many efforts undertaken to achieve reliable and
unobtrusive fall detection [16], the existing technology does not meet the seniors’
needs [18]. The main reason is that it does not preserve the privacy and unob-
trusiveness adequately. In particular, the current solutions generate too much
false alarms, which in turn lead to considerable frustration of the seniors.
Most of the currently available techniques for fall detection are based on
body-worn or built-in devices. They typically employ accelerometers or both ac-
celerometers and gyroscopes [16]. However, on the basis of such sensors it is not
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 457–464, 2013.
c Springer-Verlag Berlin Heidelberg 2013
458 M. Kepski and B. Kwolek
easy to separate real falls from fall-like activities [2]. They typically trigger sig-
nificant number of false alarms. Moreover, the detectors that are typically worn
on a belt around the hip, are obstructive and uncomfortable during the sleep
[7]. What’s more, their monitoring performance in critical phases like getting up
from the bed or the chair is relatively poor.
In recent years, a lot of research has been done on detecting falls using a wide
range of sensor types [16][18], including pressure pads [17], single CCD camera
[1], multiple cameras [6], specialized omni-directional ones [14] and stereo-pair
cameras [8]. Video cameras have several advantages over other sensors includ-
ing the capability of recognition a variety of activities. Additional benefit is
low intrusiveness and possibility of a remote verification of fall events. However,
the solutions that are available at present require time for installation, camera
calibration and in general they are not cheap. Additionally, the lack of 3D in-
formation can lead to lots of false alarms. Moreover, in vast majority of such
systems the privacy is not preserved adequately.
Recently, the Kinect sensor was employed in fall detection systems [9][10][13].
It is the world’s first low-cost device that combines an RGB camera and a depth
sensor. Unlike 2D cameras, it allows tracking the body movements in 3D. Thus,
if only depth images are used it preserves the privacy. Since it is equipped with
an active light source it is independent of external light conditions. Owing to
using the infrared light it is capable of extracting depth images in dark rooms.
In this work we demonstrate an approach to fall detection using only depth
images. The person is detected on the basis of the depth reference image. We
demonstrate a method for updating the depth reference image with a low compu-
tational cost. The ground plane is extracted automatically using the v-disparity
images, Hough transform and the RANSAC algorithm. Fall detection is achieved
using a classifier trained on features representing the extracted person both in
depth images and in point clouds.
Depth is very useful cue to achieve reliable person detection because humans
may not have consistent color and texture but have to occupy an integrated
region in space. The depth images were acquired by the Kinect sensor using
OpenNI (Open Natural Interaction) library. The sensor has an infrared laser-
based IR emitter, an infrared camera and a RGB camera. The IR camera and
the IR projector form a stereo pair with a baseline of approximately 75 mm.
Kinect depth measurement is based on structured light, making a triangulation
between the dot pattern emitted and the pattern captured by the IR CMOS
sensor. The pixels in the depth images indicate calibrated depth in the scene.
Kinect’s angular field of view is 57◦ horizontally and 43◦ vertically. The sensor
has a practical ranging limit of about 0.6-5 m. It captures depth and color images
simultaneously at a frame rate of about 30 fps. The default RGB video stream
has size 640 × 480 and 8-bit for each channel. The depth stream is 640 × 480
resolution and with 11-bit depth, which provides 2048 levels of sensitivity.
Unobtrusive Fall Detection at Home Using Kinect Sensor 459
Due to occlusions it is not easy to detect a person using only single camera
and depth images. The software called NITE from PrimeSense offers skeleton
tracking on the basis of images acquired by the Kinect sensor. However, this
software is targeted for supporting the human-computer interaction, and not for
detecting the person fall. Thus, in many circumstances it can have difficulties in
extracting and tracking the person’s skeleton [10].
The person was detected on the basis of a scene reference image, which was
extracted in advance and then updated on-line. In the depth reference image each
pixel assumes the median value of several pixels values from the past images. In
the set-up stage we collect a number of the depth images, and for each pixel we
assemble a list of the pixel values from the former images, which is then sorted in
order to extract the median. Given the sorted lists of pixels the depth reference
image can be updated quickly by removing the oldest pixels and updating the
sorted lists with the pixels from the current depth image and then extracting
the median value. We found that for typical human motions, good results can be
obtained using 13 depth images. For the Kinect acquiring the images at 25 Hz
we take every fifteenth image.
Figure 1 illustrates some example depth reference images, which were obtained
using the discussed technique. In the image #500 we can see an office with the
closed door, which was then opened to demonstrate how the algorithm updates
the reference image. In frames #650 and #800 we can see that the opened door
appears temporally in the binary image, and then it disappears in the frame
#1000. As we can observe, the updated reference image is clutter free and allows
us to extract the person’s silhouette in the depth images. In order to eliminate
small objects the depth connected components were extracted. Afterwards, small
artifacts were eliminated. Otherwise, the depth images can be cleaned using
morphological erosion. When the person does not move the reference image is
not updated.
Fig. 1. Person segmentation using depth reference image. RGB images (upper row),
depth (middle row) and binary images depicting the delineated person (bottom row).
460 M. Kepski and B. Kwolek
In the detection mode the foreground objects are extracted through differenc-
ing the current image from such a reference depth map. Afterwards, the fore-
ground object is determined through extracting the largest connected component
in the thresholded difference map. Alternatively, the subject can be delineated
using a pre-trained person detector. However, having in mind the privacy, the
use of a person detector operating on depth images or point clouds leads to lower
detection ratio and a higher computational cost.
a) b) c)
Fig. 2. V-disparity map calculated on depth images from Kinect: RGB image a), cor-
responding depth image b), v-disparity map c)
The line corresponding to the floor pixels in the v-disparity map was extracted
using the Hough transform. Assuming that the Kinect is placed at height about
1 m from the floor, the line representing the floor should begin in the disparities
ranging from 15 to 25 depending on the tilt angle of the sensor. On Fig. 3 we
can see some example lines extracted on the v-disparity images, which were
obtained on the basis of images acquired in typical rooms, like office, see Fig. 2c,
classroom, etc.
Unobtrusive Fall Detection at Home Using Kinect Sensor 461
The line corresponding to the floor was extracted using Hough transform
(HT) operating o v-disparity values and a predefined range of parameters. The
accumulator was incremented by v-disparity values, see Fig. 4a. It is worth not-
ing that ordinary HT operating on thresholded v-disparity images often gives
incorrect results, see Fig. 4b where the extremum is quite close to 0 deg.
a) b)
Fig. 4. Accumulator of the Hough transform: operating on v-disparity values a), thresh-
olded v-disparity images b). The accumulator depicted on figure a) is divided by 100.
Given the extracted line in such a way, the pixels belonging to the floor ar-
eas were determined. Due to the measurement inaccuracies, pixels falling into
some disparity extent dt were also considered as belonging to the ground. As-
suming that dy is a disparity in the line y, which represents the pixels belong-
ing to the ground plane, we take into account the disparities from the range
d ∈ (dy − dt , dy + dt ) as a representation of the ground plane. Given the line
extracted by the Hough transform, the points on the v-disparity image with
the corresponding depth pixels were selected, and then transformed to the point
cloud [10]. After the transformation of the pixels representing the floor to the 3D
points cloud, the plane described by the equation ax + by + cx + d was recovered.
The parameters a, b, c and d were estimated using the RANSAC algorithm. The
distance to the ground plane from the 3D centroid of points cloud corresponding
to the segmented person was determined on the basis of the following equation:
|aXc + bYc + cZc + d|
D= √ (2)
a2 + b 2 + c2
where Xc , Yc , Zc stand for the coordinates of the centroid.
462 M. Kepski and B. Kwolek
4 Experimental Results
In total 312 images representing typical human actions were selected and then
utilized to extract the following features:
Figure 6 depicts a scatterplot matrix for the employed attributes, in which a col-
lection of scatterplots is organized in a two-dimensional matrix simultaneously to
provide correlation information among the attributes. In a single scatterplot two
attributes are projected along the x-y axes of the Cartesian coordinates. As we
can observe, the overlaps in the attribute space are not too significant. We consid-
ered also another attributes, for instance, a filling ratio of the rectangles making
up the person bounding box. The worth of the features was evaluated on the basis
of the information gain [4], which measures the dependence between the feature
and the class label. In the evaluation we utilized the InfoGainAttributeEval
procedure from the Weka [5], which is a collection of machine learning algo-
rithms.
The classification accuracy was evaluated in 10-fold cross-validation using
Weka software. The falls were classified using KStar [3], AdaBoost, SVM, multi-
layer perceptron (MLP), Naı̈ve Bayes (NB) and k-NN classifiers. The KStar and
Unobtrusive Fall Detection at Home Using Kinect Sensor 463
MLP classified all falls correctly, whereas the remaining algorithms incorrectly
classified 2 instances. The number of images with person fall was equal to 110.
The system was implemented in C/C++ and runs at 25 fps on 2.4 GHz I7
(4 cores, Hyper-Threading) notebook powered by Linux. The most computa-
tionally demanding operation is extraction of the depth reference image. For
images of size 640 × 480 the computation time needed for extraction of the
depth reference image is about 9 milliseconds. At the PandaBoard, which is a
low-power, low-cost single-board computer development platform, this operation
can be completed in 0.15 sec. We are planning to implement the whole system
on the PandaBoard.
5 Conclusions
In this work we demonstrated our approach to fall detection using Kinect. The
fall detection is done on the basis of the segmented person in the depth images.
The segmentation of the person takes place using updated depth reference im-
age of the scene. For person extracted in such a way the corresponding points
cloud is then extracted. The ground plane is determined automatically using the
v-disparity images, Hough transform and the RANSAC algorithm. The fall is
detected using a classifier built on features extracted both from the depth images
as well as the points cloud corresponding to the extracted person. The system
achieves high detection rate. On image set consisting of 312 images of which 110
contained human falls all fall events were recognized correctly.
464 M. Kepski and B. Kwolek
References
1. Anderson, D., Keller, J., Skubic, M., Chen, X., He, Z.: Recognizing falls from sil-
houettes. In: Annual Int. Conf. of the Engineering in Medicine and Biology Society,
pp. 6388–6391 (2006)
2. Bourke, A., O’Brien, J., Lyons, G.: Evaluation of a threshold-based tri-axial ac-
celerometer fall detection algorithm. Gait & Posture 26(2), 194–199 (2007)
3. Cleary, J., Trigg, L.: An instance-based learner using an entropic distance measure.
In: Int. Conf. on Machine Learning, pp. 108–114 (1995)
4. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley (1992)
5. Cover, T.M., Thomas, J.A.: Data Mining: Practical machine learning tools and
techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
6. Cucchiara, R., Prati, A., Vezzani, R.: A multi-camera vision system for fall detec-
tion and alarm generation. Expert Systems 24(5), 334–345 (2007)
7. Degen, T., Jaeckel, H., Rufer, M., Wyss, S.: Speedy: A fall detector in a wrist
watch. In: Proc. of IEEE Int. Symp. on Wearable Computers, pp. 184–187 (2003)
8. Jansen, B., Deklerck, R.: Context aware inactivity recognition for visual fall detec-
tion. In: Proc. IEEE Pervasive Health Conference and Workshops, pp. 1–4 (2006)
9. Kepski, M., Kwolek, B., Austvoll, I.: Fuzzy inference-based reliable fall detection
using kinect and accelerometer. In: Rutkowski, L., Korytkowski, M., Scherer, R.,
Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2012, Part I. LNCS,
vol. 7267, pp. 266–273. Springer, Heidelberg (2012)
10. Kepski, M., Kwolek, B.: Human fall detection using kinect sensor. In: Burduk, R.,
Jackowski, K., Kurzynski, M., Wozniak, M., Zolnierek, A. (eds.) CORES 2013.
AISC, vol. 226, pp. 743–752. Springer, Heidelberg (2013)
11. Labayrade, R., Aubert, D., Tarel, J.P.: Real time obstacle detection in stereovi-
sion on non flat road geometry through “v-disparity” representation. In: IEEE
Intelligent Vehicle Symposium, vol. 2, pp. 646–651 (2002)
12. Marshall, S.W., Runyan, C.W., Yang, J., Coyne-Beasley, T., Waller, A.E., Johnson,
R.M., Perkis, D.: Prevalence of selected risk and protective factors for falls in the
home. American J. of Preventive Medicine 8(1), 95–101 (2005)
13. Mastorakis, G., Makris, D.: Fall detection system using Kinect’s infrared sensor.
J. of Real-Time Image Processing, 1–12 (2012)
14. Miaou, S.G., Sung, P.H., Huang, C.Y.: A customized human fall detection system
using omni-camera images and personal information. Distributed Diagnosis and
Home Healthcare, 39–42 (2006)
15. Mubashir, M., Shao, L., Seed, L.: A survey on fall detection: Principles and ap-
proaches. Neurocomputing 100, 144–152 (2013), special issue: Behaviours in video
16. Noury, N., Fleury, A., Rumeau, P., Bourke, A., ÓLaighin, G., Rialle, V., Lundy,
J.: Fall detection - principles and methods. In: Annual Int. Conf. of the IEEE
Engineering in Medicine and Biology Society, pp. 1663–1666 (2007)
17. Tzeng, H.W., Chen, M.Y., Chen, J.Y.: Design of fall detection system with floor
pressure and infrared image. In: Int. Conf. on System Science and Engineering, pp.
131–135 (2010)
18. Yu, X.: Approaches and principles of fall detection for elderly and patient. In: 10th
Int. Conf. on e-Health Networking, Applications and Services, pp. 42–47 (2008)
“BAM!” Depth-Based Body Analysis in Critical Care
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 465–472, 2013.
c Springer-Verlag Berlin Heidelberg 2013
466 M. Martinez, B. Schauerte, and R. Stiefelhagen
Fig. 1. Left: the Medical Recording Device developed within the VIPSAFE project monitors
the patient and the ICU environment. Right: The Bed Aligned Map (BAM) is a height based
representation aligned to the surface of the bed (best viewed in color).
We identified the three main problems for computer vision ICU monitoring:
Occlusion: As most of the body is occluded by a blanket, high-level approaches that
rely on the shape of the body, such as poselets [3] and bodypart detectors [12] are not
effective.
Lack of Datasets: Due to privacy concerns, there is no public dataset to train data
intensive models.
Night Monitoring: Night monitoring can be done under infrared illumination [10, 11],
but color information is lost.
Depth cameras have been used successfully to automatically estimate breathing rate
in clothing-occluded ICU patients [1,10]. Depth cameras allow us to overcome the night
monitoring problems as they are independent of the light conditions, and volumetric
information can be extracted even when the patient is covered by the bed clothing. Cap-
turing a meaningful depth field requires an active depth camera like Kinect, as stereo
cameras are unable to capture an accurate depth field due to the lack of texture in most
medical clothing.
Fig. 2. Bed localiation even when a patient is sleeping on it (best viewed in color). From left to
right: Post-filtered tiles from a patient lying in fetal position with the outline of the estimated bed
position with his correspoding BAM. Same from a patient lying in supine position.
“BAM!” Depth-Based Body Analysis in Critical Care 467
In this paper we go one step further and propose the Bed Aligned Map (BAM), a
robust representation model aligned to the bed surface. To this end we develop a novel
algorithm able to localize the bed even when somebody is sleeping on it.
Although some indicators (e.g., bed occupancy, body location with respect to the
bed) can be obtained directly from BAM, its main advantage is its capability to easily
combine multiple observations of several patients, simplifying the development of ma-
chine learning based classifiers. We prove this capability by training a sleeping position
classifier using data from only 23 subjects and achieving a 100% accuracy in a 4-class
test.
2 Experimental Setup
Within the framework of the VIPSAFE1 [11] project, we have developed a Medical
Recording Device (Fig. 1) with a large variety of sensors and cameras. This project
uses the depth camera (derived from Kinect) which provides a 640x480@30fps depth
map. We recorded 23 male and female subjects from different ethnicities and ages be-
tween 14 and 50; and they were asked to perform a sequence of 45 actions divided in
5 scenarios. To capture a wider range of behaviors, subjects were given only minimal
guidance, relying on their own interpretation. To evaluate sleeping positions they were
asked to lie on their back, and then move to a lateral right position followed by lateral
left position. To evaluate agitation they were asked to be relaxed, then to show small dis-
tress, followed by increased distress and strong distress. This minimal guidance resulted
in strongly different interpretations of distress, which was our goal.
Beds in critical care are wheeled, articulated and can be installed in a variety of configu-
rations. Commonly a wall with medical equipment is behind the head of the patient, but
having the wall along the side of the patient is not unheard of. Finally, in the most ver-
satile medical wards, most equipment is also attached to mobile stands around the bed
in order to accommodate the different requirements a patient may have. Therefore the
location of the bed must be determined to accurately select the Region of Interest (ROI).
In most studies the ROI is fixed or manually defined [1, 5, 16]. Kittipanya-Ngam [8]
suggest an automated algorithm which models the bed as rigid rectangular surface us-
ing edges and Hough transform for localization. However the articulated beds used in
critical care are divided in several segments which can be adjusted at different incli-
nations to better suit the needs of the patient ( Fig. 1). The baseline approach used in
VIPSAFE [11] used region growing in the depth field to find a low curvature area, this
approach was successful on articulated beds, but required the bed to be empty.
We present here an approach that improves our previous work by enabling the detec-
tion of non-empty articulated beds. The algorithm performs the following steps:
1
VIPSAFE: Visual Monitoring for Improving Patient Safety
https://cvhci.anthropomatik.kit.edu/project/vipsafe
468 M. Martinez, B. Schauerte, and R. Stiefelhagen
Prefiltering: Non-smooth pixels in the disparity image are discarded: pixels are consid-
ered smooth if the difference in disparity between itself and its neighbors is at most 1.
This removes noisy pixels and pixels adjacent to edges.
Tile Splitting: Pixels are grouped in tiles of 16×16; the center and normal vector of
each tile is estimated. The size is chosen to be small enough to offer good spatial reso-
lution, but large enough to determine the direction of the normal vector with precision.
Tile Filtering: Tiles below the minimum height of the bed are discarded. Blocks tilted
more than 45 degrees respect to the ground are discarded (usually walls and medical
equipment). At this point most remaining blocks belong to the bed.
2D Estimation: Remaining tiles are projected to the ground plane. In the ground plane
we fit the smallest 2D bounding box containing 95 % of the remaining tiles. The long
side of the bounding box can have a varying size due to the bed articulation, but the
shortest side is assumed to be fixed. If it is not close to the measured width of the bed,
the estimation is discarded (Fig. 2).
3D Estimation: To compensate for the articulation of the bed, its height profile is es-
timated along the long side of the bounding box using the convex hull of the detected
points. We assume that the bed is wide enough to not be covered entirely by the pa-
tient. Thus each horizontal cut of the bed will provide at least one measure showing its
mattress height.
Normalization: The estimated 2D surface of the bed is divided in sections of 10x10cm
and the average height above the mattress is calculated for each section (Fig. 2). Sec-
tions without height estimate (rare) are interpolated from the neighbors.
The resulting representation estimates the average height of the patient with respect
to a planar bed mattress; we call it Bed Aligned Map (BAM). It is independent of
bed localization and lighting conditions and allows us to compare the behavior of the
patients in different medical institutions which was not possible until now.
We estimated the bed localization once every 10 seconds (a total of 2505 times).
We accepted the bed estimations if the detected bed width lies within 5cm of the ac-
tual value. In total 91.8 % of the times the bed estimation was accepted, and the mean
standard deviation measured was of 13.6 mm for width and 31.4 mm for length.
4 Body Analysis
Fig. 3. BAM representations of subjects lying in supine position (top), on the left side (middle),
and on the right side (bottom). The estimated center of gravity of the body is displayed as a circle.
Best viewed in color.
150
Volume(l)
100
50
0
0 A 40 60 B 100 120 C 160 180 D 220
Time(s)
Fig. 4. Bed Occupancy: subject enters the bed (A), changes two times of sleeping position (B,
C), and leaves the bed (D). Note how the volume never reaches zero as the pillow and the bed
clothing occupy a significant amount of space.
4.2 Agitation
Agitation is the main indicator recorded in several computer vision ICU monitoring
systems. Due to the difficulty of localizing precisely the body, it is usually quantifyed
by analyzing the changes between consecutive images. This approach is not robust to
changes in illumination [13], although light invariant feature descriptors have been used
to compensate global illumination changes [16].
We propose one agitation measure defined as the mean difference between maxi-
mum and minimum cell height of all BAMs captured within one second. The resulting
measure has volumetric units and is independent of the viewpoint used to capture it.
Lacking a standard procedure to measure agitation, we asked the subjects to show
three different levels of distress in supine and fetal positions. To compare between
470 M. Martinez, B. Schauerte, and R. Stiefelhagen
40
30
30
Agitation
Agitation
20
20
10 10
0 0
rest low mild strong rest low mild strong
(a) Supine (b) Fetal
Fig. 5. Mean and standard deviation of the agitation values of subjects when instructed to rest or
show a low, mild and strong distress respectively
Table 2. Confusion matrix of sleep position classification using BAM. The high accuracy ob-
tained from the simple 1NN approach endorses the quality of the BAM as a robust representation.
While using LMNN and PCA achieves 100 % accuracy.
(a) 1NN (b) PCA-LMNN
empty supine left right empty supine left right
empty 21 2 0 0 empty 23 0 0 0
supine 0 20 3 0 supine 0 23 0 0
left 1 3 18 1 left 0 0 23 0
right 0 0 3 20 right 0 0 0 23
5 Conclusions
We address two principal challenges that computer vision approaches face in critical
care monitoring: First, bedridden patients in hospitals are often covered by a textureless
blanket. This makes is hard for computer vision algorithms to estimate body parameters
and articulation. But, it is possible to detect movements and rough shapes beneath the
blanket, especially when depth information is available. Second, intensive care units are
dynamic environments in which the location of the bed or sensor can be changed by the
hospital personnel at any time.
We address these two challenges and introduce the Bed Aligned Map (BAM), which
extracts and aligns the image patch that contains the bed. BAM is calculated from depth
information, is view and light independent and does not require markers. We show some
indicators that can be obtained directly from BAM (bed occupancy, body location with
respect to the bed) and present a robust metric to quantify body agitation. Furthermore,
the BAM facilitates the development of machine learning based classifiers, because the
alignment allows us to combine observations of several patients. We use this property
to develop a sleeping position classifier where we discern between an empty bed, a
patient lying on his back, and a patient lying on his left and right sides. On this 4-class
problem a naive nearest neighbor approach using BAM achieves a 85.9 % accuracy
while a combined LMNN and PCA approach achieves 100 % accuracy on a 23 subject
experiment.
References
1. Aoki, H., Takemura, Y., Mimura, K., Nakajima, M.: Development of non-restrictive sens-
ing system for sleeping person using fiber grating vision sensor. In: Micromechatronics and
Human Science (2001)
472 M. Martinez, B. Schauerte, and R. Stiefelhagen
2. Becouze, P., Hann, C., Chase, J., Shaw, G.: Measuring facial grimacing for quantifying pa-
tient agitation in critical care. In: Computer Methods and Programs in Biomedicine (2007)
3. Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3D human pose annota-
tions. In: ICCV (2009)
4. Chanques, G., Jaber, S., Barbotte, E., Violet, S., Sebbane, M., Perrigault, P.F., Mann, C.,
Lefrant, J.Y., Eledjam, J.J.: Impact of systematic evaluation of pain and agitation in an inten-
sive care unit* (2006)
5. Geoffrey Chase, J., Agogue, F., Starfinger, C., Lam, Z., Shaw, G.M., Rudge, A.D., Sirisena,
H.: Quantifying agitation in sedated icu patients using digital imaging. In: Computer Meth-
ods and Programs in Biomedicine (2004)
6. Grap, M.J., Hamilton, V.A., McNallen, A., Ketchum, J.M., Best, A.M., Isti Arief, N.Y.,
Wetzel, P.A.: Actigraphy: Analyzing patient movement. Heart & Lung: The Journal of Acute
and Critical Care (2011)
7. Weinberger, K., John Blitzer, L.K.S.: Distance metric learning for large margin nearest neigh-
bor classification. In: NIPS (2006)
8. Kittipanya-Ngam, P., Guat, O., Lung, E.: Computer vision applications for patients monitor-
ing system. In: FUSION (2012)
9. Mansor, M., Yaacob, S., Nagarajan, R., Che, L., Hariharan, M., Ezanuddin, M.: Detection of
facial changes for ICU patients using knn classifier. In: ICIAS (2010)
10. Martinez, M., Stiefelhagen, R.: Breath rate monitoring during sleep using near-ir imagery
and pca. In: ICPR (2012)
11. Martinez, M., Stiefelhagen, R.: Automated multi-camera system for long term behavioral
monitoring in intensive care units. In: MVA (2013)
12. Mikolajczyk, K., Schmid, C., Zisserman, A.: Human detection based on a probabilistic
assembly of robust part detectors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS,
vol. 3021, pp. 69–82. Springer, Heidelberg (2004)
13. Naufal Bin Mansor, M., Yaacob, S., Nagarajan, R., Hariharan, M.: Patient monitoring in ICU
under unstructured lighting condition. In: ISIEA (2010)
14. Ouimet, S., Kavanagh, B.P., Gottfried, S.B., Skrobik, Y.: Incidence, risk factors and conse-
quences of ICU delirium. Intensive Care Medicine (2007)
15. Paquay, L., Wouters, R., Defloor, T., Buntinx, F., Debaillie, R., Geys, L.: Adherence to pres-
sure ulcer prevention guidelines in home care: a survey of current practice. Journal of Clinical
Nursing (2008)
16. Reyes, M., Vitria, J., Radeva, P., Escalera, S.: Real-time activity monitoring of inpatients. In:
MICCAT (2010)
3-D Feature Point Matching for Object
Recognition Based on Estimation
of Local Shape Distinctiveness
1 Introduction
Bin-picking systems are an important means for developing automated cell man-
ufacturing systems. An important requirement for such systems is a reliable and
high-speed means for recognizing the pose of an object in scenes that consist of
a lot of randomly stacked same objects.
In the field of 3-D object recognition, a lot of model-based object recogni-
tion methods have been proposed. These methods estimate the pose parameters
of objects by matching an object model to an input range image. The Spin
Image method[1] is a typical model-based method. It uses pose-invariant fea-
tures created by calculating the direction of normal vectors in each point of
an object model. However, its computational cost is expensive because it is
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 473–481, 2013.
c Springer-Verlag Berlin Heidelberg 2013
474 M. Nagase, S. Akizuki, and M. Hashimoto
necessary to calculate the feature values from all points of the object model.
Other methods[2][3] using edge information with depth value have been pro-
posed. These methods can achieve high-speed recognition because they use only
local information of the object model. For randomly stacked objects, however,
mismatchings may frequently occur due to pseudo-edges caused by objects that
overlap other objects.
For other model-based approaches, several high-speed recognition methods
have been proposed. These methods use only feature points for the matching
process. For example, the DAI (Depth Aspect Image) matching method[4] and
the Local Surface Patch method[5] are typical methods that use this approach.
These methods use distinctive local shapes that have large curvature, so they are
effective in some cases. However, for cases where there are a lot of local shapes
that have large curvature, mismatching will be increased.
A recent study proposed a 3-D local descriptor called SHOT (Signature of
Histograms of OrienTations)[6][7]. This method uses only one corresponding
point with SHOT descriptors, so high-speed recognition is achieved. A problem
with it, however, is that pose parameter calculation becomes difficult when the
SHOT descriptor is disturbed by outliers due to multiple objects.
A more substantial problem is that no practical 3-D object recognition meth-
ods have yet been developed that achieve both high speed and reliability.
The purpose of our research, therefore, is to develop a new method that can
achieve both reliability and high speed. From the viewpoint of efficient process-
ing, our method can be categorized as a feature-point-based matching method
using “point cloud data”.
We assume that the object model in this study consists of point cloud data
with 3-D coordinates. Each point of the object model has an attribute value
that represents the local shape around the interest point. The attribute value is
represented as an orientation histogram of a normal vector, which is calculated
by using several neighboring feature points around the interest point. As men-
tioned above, the attribute value of a model point means its local shape. Before
matching the model points to acquired data, we determine the distinctiveness
of all points by calculating the relative similarity of two points of all possible
combinations in the object model. Rather than all of the feature points, a small
number of them with high distinctiveness are used in the matching process. Us-
ing this effective feature-point selection based on estimating distinctiveness, we
achieve both reliable and high-speed recognition.
In Section 2 we explain the key idea of and concrete algorithm for the proposed
method. In Section 3 we demonstrate, through experimental results acquired in
testing a lot of real range images, that our method has better performance than
conventional methods such as the Spin Image method.
2 Proposed Method
2.1 Basic Idea
In this study, we introduce two basic ideas.
3-D Feature Point Matching for Object Recognition Based on Estimation 475
Figure 2 shows the method for creating a normal distribution histogram, which
is the local shape descriptor we propose in this study.
476 M. Nagase, S. Akizuki, and M. Hashimoto
First, the sphere region of radius r is set to interest point n. Next, angle θ is
calculated between the normal vector Nn and another normal vector Nmt that
contains the sphere region. This creates the normal distribution histogram of θ.
This histogram represents the local shapes of an interest point using neighboring
points. Even if the input data contains outliers, stable feature description is
possible because the proposed descriptor uses many neighboring points of the
interest point for feature description. This process is applied for all points of the
object model.
Next, we explain the method for calculating the distinctiveness value at each
point of the object model.
The dissimilarity value B is calculated by the Bhattacharyya coefficient be-
tween the normal distribution histograms of the interest point and one of another
point created as described in Subsection 2.3:
U
B(P, Q) = 1 − Pu Qu (1)
u=1
1
T
Sn = B(pn , qt ) (2)
T t=1
Fig. 3. Overview of pose recognition scheme using distinctive feature points extracted
from object model
1
T
R= |mt − f (i, j)| (4)
T t=1
478 M. Nagase, S. Akizuki, and M. Hashimoto
where mt represents the t-th point in a transformed model object. The value
R represents the difference between a transformed object model and an input
range image. A low R value means high consistency.
The random method extracted feature points from smooth shaped parts,
which accounted for a large part of the object model. The curvature method
extracted feature points from the recessed part of a large curvature shape; how-
ever, we thought that in this case the feature points might be easily hidden
and thus correct matching could not be obtained for them if the model under-
went a pose change. The proposed method, however, selects distinctive feature
points that is selected for large curvature of characterizing the model continues
in a straight line and planar shape of the object model. Therefore, the proposed
method enables correct matching even if the object model undergoes a pose
change.
4 Conclusion
We proposed an object recognition system that achieves both reliable and high-
speed recognition by using a small number of distinctive feature points.
Experimental results using actual scenes demonstrated that our method
achieves 93.8% recognition rate, which is 42.2% higher than that of the con-
ventional Spin Image method, and that its computing time is also about nine
times faster. These results confirmed that the process our method uses to select
distinctive feature points is an effective approach to object recognition.
In future work, we intend to further improve the method’s processing time,
optimize various parameters, and build a bin-picking system that implements
the method.
References
1. Johnson, A.E., Hebert, M.: Using Spin Images for Efficient Object Recognition in
Cluttered 3D Scenes. Trans. IEEE Pattern Analysis and Machine Intelligence 21,
433–499 (1999)
2. Sumi, Y., Tomita, F.: 3D Object Recognition Using Segment-Based Stereo Vision.
In: Chin, R., Pong, T.-C. (eds.) ACCV 1998. LNCS, vol. 1352, pp. 249–256. Springer,
Heidelberg (1997)
3. Steder, B., Rusu, R.B., Konolige, K., Burgard, W.: Point Feature Extraction on
3D Range Scans Taking into Account Object Boundaries. In: IEEE International
Conference on Robotics and Automation, pp. 2601–2608 (2011)
3-D Feature Point Matching for Object Recognition Based on Estimation 481
4. Takeguchi, T., Kaneko, S.: Depth Aspect Images for Efficient Object Recognition.
In: Proc. SPIE Conference on Optomechatronic Systems IV, vol. 5264, pp. 54–65
(2003)
5. Chen, H., Bhanu, B.: 3D Free-form Object Recognition in Range Images Using Local
Surface Patches. Pattern Recognition Letters 28, 1252–1262 (2007)
6. Tombari, F., Salti, S., Di Stefano, L.: Unique Signatures of Histograms for Local
Surface Description. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
Part III. LNCS, vol. 6313, pp. 356–369. Springer, Heidelberg (2010)
7. Tombari, F., Salti, S., Stefano, L.D.: A Combined Texture-Shape Descriptor for
Enhanced 3D feature Matching. In: IEEE International Conference on Image Pro-
cessing, pp. 809–812 (2011)
8. Tombari, F., Stefano, L.D.: Object Recognition in 3D Scene with Occlusions and
Clutter by Hough Voting. In: IEEE Proc. on 4th Pacific-Rim Symposium on Image
and Video Technology, pp. 349–355 (2010)
9. Rusu, R.B., Cousins, S.: 3D is here: Point Cloud Library (PCL). In: IEEE Interna-
tional Conference on Robotics and Automation, pp. 1–4 (2011)
3D Human Tracking from Depth Cue
in a Buying Behavior Analysis Context
Abstract. This paper presents a real time approach to track the human
body pose in the 3D space. For the buying behavior analysis, the camera
is placed on the top of the shelves, above the customers. In this top
view, the markerless tracking is harder. Hence, we use the depth cue
provided by the kinect that gives discriminative features of the pose. We
introduce a new 3D model that are fitted to these data in a particle
filter framework. First the head and shoulders position is tracked in the
2D space of the acquisition images. Then the arms poses are tracked in
the 3D space. Finally, we demonstrate that an efficient implementation
provides a real-time system.
1 Introduction
Behavior analysis based on artificial vision method offers a wide range of appli-
cations that are currently little developed in the marketing area. In the customer
behavior analysis, the camera is often placed on the ceiling of the market. Only
the top view of the person is also available. However, the great majority of the
methods in the literature use a model adapted to a front view of the person
because the shape of a person is much more discriminative on this orientation.
The aim of the project ANR-10-CORD0016 ORIGAMI2 that supports this work
is to develop real-time and non intrusive tools designed to analyze the shoppers
buying act decisions. The approach is, in the first time, based on extracting
and following the shoppers’ gaze and gesture positions with computer vision
algorithmic. It is then based on statistically analyzing the extracted data: the
goal of this cognitive analysis is to measure the interaction between the shopper
and their environment. This technology will provide consumer goods producers
with non biased and exhaustive information on shoppers’ behaviors during their
buying acts.
To make the tracking possible, the depth cue is required. One of the more
popular devices used to provide it is kinect, which has sensors that capture both
rgb and depth data. In this paper we integrate the depth cue in a particle filter to
track the body parts. The gesture recognition and the behavior of the customer
could, in a second time, be analyzed using the Moeslund’s taxinomy.
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 482–489, 2013.
c Springer-Verlag Berlin Heidelberg 2013
3D Human Tracking from Depth Cue 483
We use the Xtion Pro-live camera produced by Asus for the acquisition. All the
points that the sensor is not able to measure depth are offset to 0 in the output
array. We regard it as a kind of noise. Moreover we only model the upper part
of the body. Thus, we threshold the image to only take into consideration the
pixels recognized as an element of the torso, the arms or the head. It gives a first
segmentation of region of interest (ROI).
The Asus Xtion Pro-live provides simultaneously the color and the depth cues.
Nevertheless the color cue is often degraded in practice. Indeed, persons on the
484 C. Migniot and F. Ababsa
supermarket shelves are over-lit. The tracking must be robust and the depth cue
is not disturbed by lighting. Thus we only take into consideration the depth cue.
sampled. Particles are selected by their weight: large weight particles are
duplicated while low weight particles are deleted.
– propagation: particles are propagated using the dynamic model of the sys-
tem p(xk+1 |xk ) to obtain
{xik+1 , N1 } ∼ p(xk+1 |yk ).
– weighting: particles are weighted by a likelihood function related to the
correspondence from the model to the new observation. The new weights
N
i i
ωk+1 are normalized so that : ωk+1 = 1. It provides the new sample
i=1
{xik+1 , ωk+1
i
} ∼ p(xk+1 |yk+1 ).
– estimation: the new pose is approximated by:
N
i
xk+1 = ωk+1 xik+1 (1)
i=1
Fig. 1. The 3D models: (a) the 3D model is made of a skeleton with geometrical
primitives, (b) the angles of the articulation defines the pose of the person, (c) in
the 3D-2D processing the head and the shoulders are tracked in the 2D space of the
recorded images whereas the arms are tracked in the 3D space.
3 Performances Analysis
We now present some experimental results. So as to control the movement of
the person and to maximize the number of tested poses, we have simulated the
behavior of customers in experimental conditions. In fact, the most important
variation is the presence of shelves and goods. But, as the camera do not move,
an estimation of the background is computed and can be removed to the frames
486 C. Migniot and F. Ababsa
of the sequence. Using experimental conditions is justified because the ROI also
obtained are similar to the experimental ones.
The Xtion Pro live camera produced by Asus is installed at 2,9 m of the ground.
It provides 7 frames per second. The dimension of a frame is 320 × 240 pixels. In
the first experiment, we recorded two sequences S1 et S2 that are made of 450
frames (>1min) and 300 frames (≈43s). The movement of the arms are various
and representative of the buying behaviors. The depth cue is extracted with the
OpenNI library. The distance of use of the Xtion Pro live camera is between 0,8m
and 3,5m. Consequently it can not be used in at-a-distance video surveillance
but is relevant for the buying behavior analysis context.
Fig. 2. The tracking provides the pose of the person: visualization of the estimated
model state on the recorded frames (in left) and in the 3D space (in right) with the
projection of the pixels of the depth image in white
We now estimate the quality of the arms tracking. We have manually an-
notated the pixels of the arms on the frames of the 2 sequences to create a
groundtruth. Then we compute the average distance ε from the projection on
the 3D space of each of these pixels to the model state estimated by our method.
It has to be minimized to optimize the tracking. The mean variable of the par-
ticle filter is the number of particles. If it is increased, the tracking is improved
3D Human Tracking from Depth Cue 487
but the computing time is increased too. The processing times are here obtained
with a non-optimized C++ implementation running on a 3,1GHz processor. We
give in the following the average processing times per frame. We can seen in
figure 3 that there are no meaningful improvement over 50 particles (computed
in 25 ms). With this configuration there are an average distance of less than
2,5cm between each pixel of the observation and the estimated model state.
This processing is real time.
We compare our algorithm with the case where the 3D fitting presented in
section 2.3 is applied on a complete 3D model (figure 1(a)) with 17 degrees of
freedom. The figure 3 shows that the tracking is less efficient with this configu-
ration. Indeed, the required number of particles is higher because the number of
degrees of freedom is higher. Consequently the processing time increases. More-
over, as we realized a part-based treatment, each body part is tracked more
efficiently.
Fig. 3. Performances of the tracking with the various models on the 2 sequences: the
tracking is the best with our 3D-2D method
Fig. 4. Trajectories of the 3D coordinates (x,y and z) of the shoulder, the elbow and the
wrist of the left arm in the sequence S3 : our 3D-2D method well follows the articulation
movements
4 Conclusion
In this paper we have presented a 3D gesture tracking method that uses the well
known particle filter method. To be efficient in the buying behavior analysis con-
text where the camera is placed above the customers, our treatment is adapted
to the top view of the person and used the depth cue provided by the new Asus
camera. To do this, we have introduced a top view model that simultaneously
uses 2D and 3D fitting. The process is accurate and real-time.
In the future, our method could be inserted in an action recognition processing
to analyse the customer behavior. Moreover, a camera pose estimation [5,2,1]
could insert our work in a Augmented Reality context with a moving camera.
Finally an additional camera placed at the head level could refine the behavior
analysis by a gaze estimation [13].
References
1. Ababsa, F.: Robust Extended Kalman Filtering For Camera Pose Tracking Us-
ing 2D to 3D Lines Correspondences. In: IEEE/ASME Conference on Advanced
Intelligent Mechatronics, pp. 1834–1838 (2009)
3D Human Tracking from Depth Cue 489
2. Ababsa, F., Mallem, M.: A Robust Circular Fiducial Detection Technique and
Real-Time 3D Camera Tracking. International Journal of Multimedia 3, 34–41
(2008)
3. Canton-Ferrer, C., Salvador, J., Casas, J.R., Pardàs, M.: Multi-person Track-
ing Strategies Based on Voxel Analysis. In: Stiefelhagen, R., Bowers, R., Fiscus,
J.G. (eds.) CLEAR 2007 and RT 2007. LNCS, vol. 4625, pp. 91–103. Springer,
Heidelberg (2008)
4. Deutscher, J., Reid, I.: Articulated Body Motion Capture by Stochastic Search.
International Journal of Computer Vision 2, 185–205 (2005)
5. Didier, J.Y., Ababsa, F., Mallem, M.: Hybrid Camera Pose Estimation Combin-
ing Square Fiducials Localisation Technique and Orthogonal Iteration Algorithm.
International Journal of Image and Graphics 8, 169–188 (2008)
6. Gonzalez, M., Collet, C.: Robust Body Parts Tracking using Particle Filter and
Dynamic Template. In: IEEE International Conference on Image Processing, pp.
529–532 (2011)
7. Hauberg, S., Sommer, S., Pedersen, K.S.: Gaussian-like Spatial Priors for Artic-
ulated Tracking. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
Part I. LNCS, vol. 6311, pp. 425–437. Springer, Heidelberg (2010)
8. Horaud, R., Niskanen, M., Dewaele, G., Boyer, E.: Human Motion Tracking by
Registering an Articulated Surface to 3D Points and Normals. IEEE Transaction
on Pattern Analysis and Machine Intelligence 31, 158–163 (2009)
9. Isard, M., Blake, A.: CONDENSATION - Conditional Density Propagation for
Visual Tracking. International Journal of Computer Vision 29, 5–28 (1998)
10. Kjellström, H., Kragic, D., Black, M.J.: Tracking People Interacting with Objects.
In: IEEE Conference on Computer Vision and Pattern Recognition (2010)
11. Kobayashi, Y., Sugimura, D., Sato, Y., Hirasawa, K., Suzuki, N., Kage, H., Sug-
imoto, A.: 3D Head Tracking using the Particle Filter with Cascaded Classifiers.
In: British Machine Vision Conference, pp. 37–46 (2006)
12. Lin, J.Y., Wu, Y., Huang, T.S.: 3D Model-based Hand Tracking using Stochastic
Direct Search Method. In: IEEE International Conference on Automatic Face and
Gesture Recognition, pp. 693–698 (2004)
13. Funes-Mora, K.A., Odobez, J.: Gaze Estimation from Multimodal Kinect Data. In:
IEEE Conference on Computer Vision and Pattern Recognition, pp. 25–30 (2012)
14. Micilotta, A., Bowden, R.: View-Based Location and Tracking of Body Parts for
Visual Interaction. In: British Machine Vision Conference, pp. 849–858 (2004)
15. Stoll, C., Hasler, N., Gall, J., Seidel, H.P., Theobalt, C.: Fast Articulated Motion
Tracking using a Sums of Gaussians Body Model. In: International Conference on
Computer Vision, pp. 951–958 (2011)
16. Xia, L., Chen, C.C., Aggarwal, J.K.: Human Detection Using Depth Information
by Kinect. In: International Workshop on Human Activity Understanding from 3D
Data (2011)
17. Yang, C., Duraiswami, R., Davis, L.: Fast Multiple Object Tracking via a Hierarchi-
cal Particle Filter. In: International Conference on Computer Vision, pp. 212–219
(2005)
A New Bag of Words LBP (BoWL) Descriptor
for Scene Image Classification
Abstract. This paper explores a new Local Binary Patterns (LBP) based im-
age descriptor that makes use of the bag-of-words model to significantly im-
prove classification performance for scene images. Specifically, first, a novel
multi-neighborhood LBP is introduced for small image patches. Second, this
multi-neighborhood LBP is combined with frequency domain smoothing to ex-
tract features from an image. Third, the features extracted are used with spatial
pyramid matching (SPM) and bag-of-words representation to propose an innova-
tive Bag of Words LBP (BoWL) descriptor. Next, a comparative assessment is
done of the proposed BoWL descriptor and the conventional LBP descriptor for
scene image classification using a Support Vector Machine (SVM) classifier. Fur-
ther, the classification performance of the new BoWL descriptor is compared with
the performance achieved by other researchers in recent years using some popu-
lar methods. Experiments with three fairly challenging publicly available image
datasets show that the proposed BoWL descriptor not only yields significantly
higher classification performance than LBP, but also generates results better than
or at par with some other popular image descriptors.
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 490–497, 2013.
c Springer-Verlag Berlin Heidelberg 2013
A New Bag of Words LBP (BoWL) Descriptor for Scene Image Classification 491
Fig. 1. (a) shows a grayscale image, its LBP image, and the illustration of the computation of the
LBP code for a center pixel with gray level 90. (b) shows the eight 4-neighborhood masks used
for computing the proposed BoWL descriptor.
Lately, part-based methods have been very popular among researchers due to their
accuracy in image classification tasks [5]. Here the image is considered as a collection
of sub-images or parts. After features are extracted from all the parts, similar parts are
clustered together to form a visual vocabulary and a histogram of the parts is used to
represent the image. This approach is known as a ”bag-of-words model”, with features
from each part representing a ”visual word” that describes one characteristic of the
complete image [6].
This paper explores a new bag-of-words based image descriptor that makes use of the
multi-neighborhood LBP concept from [7], but significantly improves the classification
accuracy.
Fig. 2. (a) A grayscale image is broken down into small image patches which are then quantized
into a number of visual words and the image is represented as a histogram of words. (b) The
spatial pyramid model for image representation. The image is successively tiled into different
regions and features are extracted from each region and concatenated.
shows a grayscale image on the top left and its LBP image on the bottom left. The two
3 × 3 matrices on the right illustrate how the LBP code is computed for the center pixel
whose gray level is 90.
in the spatial domain. In the proposed method, the original image is transformed to
the frequency domain and the highest 25%, 50% and 75% frequencies are eliminated,
respectively. The original image and the three images thus formed undergo the same
process of dense sampling and eight-mask LBP feature extraction.
3 Experiments
This section first introduces the three scene image datasets used for testing the new
BoWL image descriptor and then does a comparative assessment of the classification
performances of the LBP, the BoWL and some other popular descriptors.
The UIUC Sports Event Dataset. The UIUC Sports Event dataset [15] contains 1,574
images from eight sports event categories. These images contain both indoor and out-
door scenes where the foreground contains elements that define the category. The back-
ground is often cluttered and is similar across different categories. Some sample images
are displayed in Figure 3(a).
494 S. Banerji, A. Sinha, and C. Liu
(a)
(b)
(c)
Fig. 3. Some sample images from (a) the UIUC Sports Event dataset, (b) the MIT Scene dataset,
and (c) the Fifteen Scene Categories dataset
The MIT Scene Dataset. The MIT Scene dataset (also known as OT Scenes) [16] has
2,688 images classified as eight categories. There is a large variation in light, content
and angles, along with a high intra-class variation [16]. Figure 3(b) shows a few sample
images from this dataset.
The Fifteen Scene Categories Dataset. The Fifteen Scene Categories dataset [12] is
composed of 15 scene categories with 200 to 400 images: thirteen were provided by
[5], eight of which were originally collected by [16] as the MIT Scene dataset, and two
were collected by [12]. Figure 3(c) shows one image each from the newer seven classes
of this dataset.
3.2 Comparative Assessment of the LBP, the BoWL and other Popular
Descriptors on Scene Image Datasets
In this section, a comparative assessment of the LBP and the proposed BoWL descriptor
is made using the three datasets described earlier to evaluate classification performance.
To compute the BoWL and the LBP, first each training image, if color, is converted to
grayscale. For evaluating the relative classification performances of the LBP and the
BoWL descriptors, a Support Vector Machine (SVM) classifier with a Hellinger kernel
[17], [14] is used.
For the UIUC Sports Event dataset, 70 images are used from each class for training
and 60 from each class for testing of the two descriptors. The results are obtained over
five random splits of the data. As shown in Figure 4, the BoWL outperforms the LBP
A New Bag of Words LBP (BoWL) Descriptor for Scene Image Classification 495
Fig. 4. The mean average classification performance of the LBP and the proposed BoWL descrip-
tors using an SVM classifier with a Hellinger kernel on the three datasets
Fig. 5. The comparative mean average classification performance of the LBP and the BoWL
descriptors on the 15 categories of the Fifteen Scene Categories dataset
by a big margin of over 15%. In fact, on this dataset the BoWL not only outperforms
the LBP, but also provides a decent classification performance on its own.
From both the MIT Scene dataset and the Fifteen Scene Categories dataset five ran-
dom splits of 100 images per class are used for training, and the rest of the images
are used for testing. Again, the BoWL produces decent classification performance on
its own apart from beating the LBP by a fair margin. Figure 4 displays these results
on the MIT Scene dataset and Fifteen Scene Categories dataset. The highest classifica-
tion rate for the MIT Scene dataset is as high as 91.6% for the BoWL descriptor. The
classification performance of BoWL beats that of LBP by a margin of over 17%.
On the Fifteen Scene Categories dataset, the overall success rate for BoWL is 80.7%
which is again over 14% higher than LBP. This is also shown in Figure 4. In Figure 5,
the category wise classification rates of the grayscale LBP and the grayscale BoWL
descriptors for all 15 categories of this dataset are shown. The BoWL here is shown to
better the LBP classification performance in 12 of the 15 scene categories.
496 S. Banerji, A. Sinha, and C. Liu
Table 1. Comparison of the Classification Performance (%) of the Proposed Grayscale BoWL
Descriptor with Other Popular Methods on the Three Image Datasets
4 Conclusion
In this paper, a variation of the LBP descriptor is used with a DCT and bag-of-words
based representation to form the novel Bag of Words-LBP (BoWL) image descriptor.
The contributions of this paper are manifold. First, a new multi-neighborhood LBP is
proposed for small image patches. Second, this multi-neighborhood LBP is coupled
with a DCT-based smoothing to extract features at different scales. Third, these fea-
tures are used with a spatial pyramid image representation and SVM classifier to prove
that the BoWL descriptor significantly improves image classification performance over
LBP. Finally, experimental results on three popular scene image datasets show that the
BoWL descriptor also yields classification performance better than or comparable to
several recent methods used by other researchers.
References
1. Ojala, T., Pietikainen, M., Harwood, D.: A comparative study of texture measures with clas-
sification based on feature distributions. Pattern Recognition 29(1), 51–59 (1996)
2. Banerji, S., Sinha, A., Liu, C.: New image descriptors based on color, texture, shape, and
wavelets for object and scene image classification. Neurocomputing (2013)
3. Banerji, S., Sinha, A., Liu, C.: Scene image classification: Some novel descriptors. In: IEEE
International Conference on Systems, Man, and Cybernetics, Seoul, Korea, October 14-17,
pp. 2294–2299 (2012)
4. Sinha, A., Banerji, S., Liu, C.: Novel color gabor-lbp-phog (glp) descriptors for object and
scene image classification. In: The Eighth Indian Conference on Vision, Graphics and Image
Processing, Mumbai, India, December 16-19, p. 58 (2012)
A New Bag of Words LBP (BoWL) Descriptor for Scene Image Classification 497
5. Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories.
In: Conference on Computer Vision and Pattern Recognition, pp. 524–531 (2005)
6. Yang, J., Jiang, Y., Hauptmann, A., Ngo, C.: Evaluating bag-of-visual-words representations
in scene classification. In: Multimedia Information Retrieval, pp. 197–206 (2007)
7. Banerji, S., Verma, A., Liu, C.: Novel color LBP descriptors for scene and image texture
classification. In: 15th International Conference on Image Processing, Computer Vision, and
Pattern Recognition, Las Vegas, Nevada, July 18-21, pp. 537–543 (2011)
8. Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of
Computer Vision 60(2), 91–110 (2004)
9. Nowak, E., Jurie, F., Triggs, B.: Sampling strategies for bag-of-features image classification.
In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 490–503.
Springer, Heidelberg (2006)
10. Zhu, C., Bichot, C., Chen, L.: Multi-scale color local binary patterns for visual object classes
recognition. In: International Conference on Pattern Recognition, Istanbul, Turkey, August
23-26, pp. 3065–3068 (2010)
11. Gu, J., Liu, C.: Feature local binary patterns with application to eye detection. Neurocom-
puting (2013)
12. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for
recognizing natural scene categories. In: IEEE Conference on Computer Vision and Pattern
Recognition, New York, NY, USA (2006)
13. Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos.
In: Ninth IEEE International Conference on Computer Vision, pp. 1470–1477 (2003)
14. Vedaldi, A., Fulkerson, B.: Vlfeat – an open and portable library of computer vision algo-
rithms. In: The 18th Annual ACM International Conference on Multimedia (2010)
15. Li, L.J., Fei-Fei, L.: What, where and who? classifying event by scene and object recognition.
In: IEEE International Conference in Computer Vision (2007)
16. Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation of the
spatial envelope. International Journal of Computer Vision 42(3), 145–175 (2001)
17. Vapnik, Y.: The Nature of Statistical Learning Theory. Springer (1995)
18. Li, L.J., Su, H., Xing, E.P., Fei-Fei, L.: Object bank: A high-level image representation for
scene classification & semantic feature sparsification. In: Neural Information Processing Sys-
tems, Vancouver, Canada (December 2010)
19. Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding
for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition,
Singapore, December 4-6, pp. 1794–(1801)
20. Van Gemert, J., Veenman, C., Smeulders, A., Geusebroek, J.M.: Visual word ambiguity.
IEEE Transactions on Pattern Analysis and Machine Intelligence 32(7), 1271–1283 (2010)
21. Niu, Z., Hua, G., Gao, X., Tian, Q.: Context aware topic model for scene recognition. In:
IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June
16-21, pp. 2743–2750 (2012)
22. Bo, L., Ren, X., Fox, D.: Hierarchical matching pursuit for image classification: Architecture
and fast algorithms. In: Advances in Neural Information Processing Systems (December
2011)
23. Bosch, A., Zisserman, A., Muñoz, X.: Scene classification via pLSA. In: Leonardis,
A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 517–530. Springer,
Heidelberg (2006)
Accurate Scale Factor Estimation
in 3D Reconstruction
1 Introduction
Structure from motion with a single camera aims at recovering both the 3D
structure of the world and the motion of the camera used to photograph it.
Without any external knowledge, this process is subject to the inherent scale
ambiguity [9,17,5], which consists in the fact that the recovered 3D structure
and the translational component of camera motion are defined up to an unknown
scale factor which cannot be determined from images alone. This is because if a
scene and a camera are scaled together, this change would not be discernible in
the captured images. However, in applications such as robotic manipulation or
augmented reality which need to interact with the environment using Euclidean
measurements, the scale of a reconstruction has to be known quite accurately.
Albeit important, scale estimation is an often overlooked step by structure
from motion algorithms. It is commonly suggested that scale should be estimated
by manually measuring a single absolute distance in the scene and then using
it to scale a reconstruction to its physical dimensions [5,12]. In practice, there
are two problems associated with such an approach. The first is that it favors
certain elements of the reconstruction, possibly biasing the estimated scale. The
second, and more important, is that the distance in question has to be measured
Work funded by the EC FP7 programme under grant no. 270138 DARWIN.
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 498–506, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Accurate Scale Factor Estimation in 3D Reconstruction 499
accurately in the world and then correctly associated with the corresponding dis-
tance in the 3D reconstruction. Such a task can be quite difficult to perform and
is better suited to large-scale reconstructions for which the measurement error
can be negligible compared to the distance being measured. However, measuring
distances for objects at the centimeter scale has to be performed with extreme
care and is therefore remarkably challenging. For example, [1] observes that a
modeling error of 1mm in the scale of a coke can, gives rise to a depth estimation
error of up to 3cm at a distance of 1m from the camera, which is large enough
to cause problems to a robotic manipulator attempting to grasp the object.
This work investigates three techniques for obtaining reliable scale estimates
pertaining to a monocular 3D reconstruction and evaluates them experimen-
tally. These techniques differ in their required level of manual intervention, their
flexibility and accuracy. Section 2 briefly presents our approach for obtaining a
reconstruction whose scale is to be estimated. Scale estimation techniques are
detailed in Sections 3-5 and experimental results from their application to real
and synthetic datasets are reported in Sect. 6. The paper concludes in Sect. 7.
2 Obtaining a 3D Reconstruction
In this work, 3D reconstruction refers to the recovery of sparse sets of points
from an object’s surface. To obtain a complete and view independent represen-
tation, several images depicting an object from multiple unknown viewpoints are
acquired with a single camera. These images are used in a feature-based struc-
ture from motion pipeline to estimate the interimage camera motion and recover
a corresponding 3D point cloud [16]. This pipeline relies on the detection and
matching of SIFT keypoints across images which are then reconstructed in 3D.
The 3D coordinates are complemented by associating with each reconstructed
point a SIFT feature descriptor [11], which captures the local surface appearance
in the point’s vicinity. A SIFT descriptor is available from each image where a
particular 3D point is seen. Thus, we select as its most representative descriptor
the one originating from the image in which the imaged surface is most frontal
and close enough to the camera. This requires knowledge of the surface normal,
which is obtained by gathering the point’s 3D neighbours and robustly fitting to
them a plane. As will become clear in the following, SIFT descriptors permit the
establishment of putative correspondences between an image and an object’s 3D
geometry. Combined together, 3D points and SIFT descriptors of their image
projections constitute an object’s representation.
image points by substituting the squared reprojection error with a robust cost
function (i.e., M-estimator). Our pose estimation approach is detailed in [10].
1 1
n n
M= Mi , N = Ni , Mi = Mi − M , Ni = Ni − N.
n i=1 n i=1
n
Forming the cross-covariance matrix C as i=1 Ni Mit , the rotational compo-
nent of the similarity is directly computed from C’s decomposition C = U Σ Vt
502 M. Lourakis and X. Zabulis
n
2
m
2
d(KL · [λ R(r) | t] · Mi − mL
i ) + d(KR · [λ Rs R(r) | Rs t + ts ] · Mj − mR
j ) ,
i=1 j=1
(2)
where λ, t and R(r) are respectively the sought scale factor, translation vector
and rotation matrix parameterized using the Rodrigues rotation vector r, KL ·
Accurate Scale Factor Estimation in 3D Reconstruction 503
6 Experiments
Each of the three methods previously described provides a means for computing
a single estimate of the pursued scale factor through monocular or binocular
measurements. It is reasonable to expect that such estimates will be affected by
various errors, therefore basing scale estimation on a single pair of images should
be avoided. Instead, more accurate estimates can be obtained by employing
multiple images in which the object has been moved to different positions and
collecting the corresponding estimates. Then, the final scale estimate is obtained
by applying a robust location estimator such as their sample median [14]. In the
following, the methods of Sect. 3, 4 and 5 will be denoted as mono, absor and
reproj, respectively. Due to limited space, two sets of experiments are reported.
An experiment with synthetic images was conducted first, in which the base-
line of the stereo pair imaging the target object was varied. A set of images
was generated, utilizing a custom OpenGL renderer. A 1:1 model of a textured
rectangular cuboid (sized 45 × 45 × 90 mm3), represented by a 3D triangle mesh
504 M. Lourakis and X. Zabulis
(with 14433 vertices & 28687 faces), was rendered in 59 images. These images
correspond to a virtual camera (1280 × 960 pixels, 22.2◦ × 16.7◦ FOV) circum-
venting the object in a full circle of radius 500 mm perpendicular to its major
symmetry axis. At all simulated camera locations, the optical axis was oriented
so that it pointed towards the object’s centroid. The experiment was conducted
in 30 conditions, each employing an increasingly larger baseline. In condition n,
the ith stereo pair comprised of images i and i + n. Hence, the baseline increment
in each successive condition was ≈ 52mm. In Fig. 1(a) and (b), an image from
the experiments and the absolute error in the estimated scale factor are shown.
Notice that the plot for absor terminates early at a baseline of ≈ 209mm. This
is because as the baseline length increases, the reduction in overlap between the
two images of the stereo pair results in fewer correspondences. In conditions
of the experiment corresponding to larger baselines, some stereo pairs did not
provide enough correspondences to support a reliable estimate by absor. As a
result, the estimation error for these pairs was overly large.
−3
x 10
4.5 2
REPROJ 1.8
ABSOR
4
ABSOR REPROJ
MONO
1.6
3.5
1.4
Error (mm)
3
1.2
2.5
error
2 0.8
1.5 0.6
0.4
1
0.2
0.5
0
0
0 200 400 600 800 1000 1200 1400 1600
baseline (mm)
Fig. 1. Experiments. Left to right: (a) sample image from the experiment with synthetic
stereo images and (b) scale factor estimation error (in milli scale), (c) sample image
from the experiment with real images and (d) translational pose estimation error.
The three methods are compared next with the aid of real images. Consider-
ing that the task of directly using the estimated scales to assess their accuracy is
cumbersome, it was chosen to compare scales indirectly through pose estimation.
More specifically, an arbitrarily scaled model of an object was re-scaled with the
estimates provided by mono, absor and reproj. Following this, these re-scaled
models were used for estimating poses of the object as explained in Sect. 3, which
were then compared with the true poses. In this manner, the accuracy of a scale
estimate is reflected on the accuracy of the translational components of the esti-
mated poses. To obtain ground truth for object poses, a checkerboard was used
to guide the placement of the object that was systematically moved at locations
aligned with the checkers. The camera pose with respect to the checkerboard was
estimated through conventional extrinsic calibration, from which the locations of
the object on the checkerboard were transformed to the camera reference frame.
The object and the experimental setup are shown in Fig. 1(c). Note that these
presumed locations include minute calibration inaccuracies as well as human er-
rors in object placement. The object was placed and aligned upon every checker
of the 8 × 12 checkerboard in the image. The checkerboard was at a distance
of approximately 1.5 m from the camera, with each checker being 32 × 32 mm2 .
Accurate Scale Factor Estimation in 3D Reconstruction 505
Camera resolution was 1280 × 960 pixels, and its FOV was 16◦ × 21◦ . The mean
translational error in these 96 trials was 1.411 mm with a deviation of 0.522 mm
for mono, 1.342 mm with a deviation of 0.643 mm for absor and 0.863 mm with
a deviation of 0.344 mm for reproj. The mean translational errors of the pose
estimates are shown graphically in Fig. 1(d).
7 Conclusion
The paper has presented one monocular and two binocular methods for scale
factor estimation. Binocular methods are preferable due to their flexibility with
respect to object placement. Furthermore, the binocular method of Sect. 5 is
applicable regardless of the size of the baseline and was shown to be the most
accurate, hence it constitutes our recommended means for scale estimation.
References
1. Collet Romea, A., Srinivasa, S.: Efficient Multi-View Object Recognition and Full
Pose Estimation. In: Proc. of ICRA 2010 (May 2010)
2. Fischler, M., Bolles, R.: RanSaC: A Paradigm for Model Fitting with Applications
to Image Analysis and Automated Cartography. In: CACM, vol. 24, pp. 381–395
(1981)
3. Grunert, J.: Das pothenotische Problem in erweiterter Gestalt nebst über seine
Anwendungen in Geodäsie. Grunerts Archiv für Mathematik und Physik (1841)
4. Hartley, R., Sturm, P.: Triangulation. CVIU 68(2), 146–157 (1997)
5. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn.
Cambridge University Press (2004) ISBN: 0521540518
6. Horn, B.: Closed-form Solution of Absolute Orientation Using Unit Quaternions.
J. Optical Soc. Am. A 4(4), 629–642 (1987)
7. Kneip, L., Scaramuzza, D., Siegwart, R.: A Novel Parametrization of the
Perspective-three-Point Problem for a Direct Computation of Absolute Camera
Position and Orientation. In: Proc. of CVPR 2011, pp. 2969–2976 (2011)
8. Li, Y., Snavely, N., Huttenlocher, D.P.: Location Recognition Using Prioritized
Feature Matching. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
Part II. LNCS, vol. 6312, pp. 791–804. Springer, Heidelberg (2010)
9. Longuet-Higgins, H.: A Computer Algorithm for Reconstructing a Scene From Two
Projections. Nature 293(5828), 133–135 (1981)
10. Lourakis, M., Zabulis, X.: Model-Based Pose Estimation for Rigid Objects. In:
Chen, M., Leibe, B., Neumann, B. (eds.) ICVS 2013. LNCS, vol. 7963, pp. 83–92.
Springer, Heidelberg (2013)
11. Lowe, D.: Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Com-
put. Vis. 60(2), 91–110 (2004)
12. Moons, T., Gool, L.V., Vergauwen, M.: 3D Reconstruction from Multiple Images
Part 1: Principles. Found. Trends. Comput. Graph. Vis. 4(4), 287–404 (2009)
13. Nistér, D., Naroditsky, O., Bergen, J.: Visual Odometry for Ground Vehicle Ap-
plications. J. Field Robot. 23, 3–20 (2006)
14. Rousseeuw, P.: Least Median of Squares Regression. J. Am. Stat. Assoc. 79,
871–880 (1984)
506 M. Lourakis and X. Zabulis
15. Rubner, Y., Puzicha, J., Tomasi, C., Buhmann, J.: Empirical Evaluation of Dis-
similarity Measures for Color and Texture. Comput. Vis. Image Und. 84(1), 25–43
(2001)
16. Snavely, N., Seitz, S., Szeliski, R.: Photo Tourism: Exploring Photo Collections in
3D. ACM Trans. Graph. 25(3), 835–846 (2006)
17. Szeliski, R., Kang, S.: Shape Ambiguities in Structure from Motion. In: Buxton,
B.F., Cipolla, R. (eds.) ECCV 1996. LNCS, vol. 1064, pp. 709–721. Springer,
Heidelberg (1996)
Affine Colour Optical Flow Computation
1 Introduction
The theoretical aim of this paper is to introduce the affine tracker [10] for multi-
channel temporal image sequences. Furthermore, we also introduce an evaluation
criterion for the computed optical flow field without ground truth.
Optical flow provides fundamental features for motion analysis and motion
understanding. In ref. [10], using local stationariness of visual motion, a linear
method for motion tracking was introduced. The colour optical flow method
computes optical flow from a multichannel image sequence, assuming the multi-
channel optical flow constraint that, in a short duration, the illumination of an
image in each channel is locally constant [4]. This assumption is an extension of
the classical optical flow constraint to the multichannel case. This colour optical
flow constraint derives a multichannel version of KLT tracker [8].
The colour optical flow constraint yields an overdetermined or redundant sys-
tem of linear equations [4], although the usual optical flow constraint for a single
channel image yields a singular linear equation. Therefore, the colour optical
flow constraint provides a simple method to compute optical flow without ei-
ther regularisation [1] or multiresolution analysis [2]. The other method to use
multichannel image is to unify features on each channel.
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 507–514, 2013.
c Springer-Verlag Berlin Heidelberg 2013
508 M.-Y. Fan et al.
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
Fig. 1. Order of planar vector fields. (a) A constant vector field u(x, y) = (1, 1) . (b)
A linear vector field u(x, y) = (x, x) .
df αi ∂φαβ
i df
β1
∂φαβ
i df
β2
∂φαβ
i df
β3
= β1
+ β2
+ β3
. (2)
dt ∂f dt ∂f dt ∂f dt
We call
d α d α1 d α2 d α3
f = f , f , f =0 (3)
dt dt dt dt
the colour brightness consistency. Lemma 1 implies that colour brightness con-
sistency is satisfied for all colour spaces. Therefore, hereafter, we set the spatio-
temporal multichannel colour image as f (x) = (f 1 (x), f 2 (x), f 3 (x)) . Then,
equation (3) becomes
Ju + ft = 0 (4)
for ft = (ft1 , ft2 , ft3 ) and the optical flow u = ( dx dy
dt , dt ) = (ẋ, ẏ) = ẋ of the
point x, where J = ∇f 1 , ∇f 2 , ∇f 3 .
Assuming that the optical flow vector u is constant in the neighbourhood Ω(x)
of point x, the optical flow vector is the minimiser of
1 1 1 1
E0 = · |Ju + ft |2 dx = u Ḡu + ēu + c̄ (5)
2 |Ω(x)| Ω(x) 2 2
where
3
1 1
Ḡ = J Jdx = Gi , Gi = ∇f i ∇f i dx, (6)
|Ω(x)|Ω(x) i=1
|Ω(x)| Ω(x)
3
1 1
ē = J ft dx = ei , ei = fti ∇f i dx, (7)
|Ω(x)| Ω(x) i=1
|Ω(x)| Ω(x)
3
1 1
c̄ = |f | dx =
i 2
ci , ci = (fti )2 dx. (8)
|Ω(x)| Ω(x) t i=1
|Ω(x)| Ω(x)
Equation (5) implies that the solution of the system of the linear equations
∂E0
= Ḡu + ē = 0 (9)
∂u
is the optical flow field vector u of the point x.
510 M.-Y. Fan et al.
for the point x, we have an affine optical flow field vector u at the point x.
Since rank Ḡ ≤ 2 and rank(xx) = 1,
†
d Ḡ, x ⊗ Ḡ ē
=− . (12)
vecD x ⊗ Ḡ, (xx ) ⊗ Ḡ x ⊗ ē
In Algoritm 1, for the sampled vector function fijk = f (i, j, k), the pyramid
transform R and its dual transform E are expressed as
1
2
k k k
Rfmn = wi wj f2m−i, 2n−j , Efmn =4 wi wj f km−i , n−j , (13)
2 2
i,j=−1 i,j=−2
where w±1 = 14 and w0 = 12 , and the summation is achieved for (m − i)/2 and
(n − j)/2 are integers.
1
The matrix equation AXB = C is replaced to the linear system of equations (B ⊗
A)vecX = vecC.
Affine Colour Optical Flow Computation 511
4 Numerical Experiments
Colour Space. There are several standard and nonstandard colour spaces for
the representation of colour images. Selection of the most relevant colour space
is a fundamental issue for colour optical flow computation. We use the following
spaces.
Primary colour systems RGB, CM Y , XY Z.
Luminance-chrominance colour systems Y U V , Y IQ, Y CbCr, HSV , HSL.
Perceptual colour systems L ∗ a ∗ b∗.
Usual noncorrelated colour system I1I2I3.
For the selection of the window size in the affine colour optical flow computation,
we evaluate the spatial angle error ψE = arccos(uc , ue ) between the ground
truth uc and the estimation ue for Middlebury sequences.
Figure 2 shows errors of computed colour optical flow vectors for various
window sizes. Left and right columns shows average angle errors and least mean
square errors, respectively, for hydra, grove3, dimetrodon, and urban3 sequences.
Results in Fig. 2 suggest that for accurate colour optical flow computation,
we are required to use a window larger than 5 × 5. In Fig. 3, we show the results
of colour optical flow computation using the 7 × 7 window for Hydra and Grove3
sequences, respectively. In Fig. 3, (a) and (e) are original image. (b) and (f) are
the ground truths of optical flow fields. Furthermore, (c) and (g) are the optical
flow fields computed with 7 × 7 window.
For the flow vector v(x, y, t) = (u, v) , setting f (x, y, t) = f (x−u, y−v, t+1),
we define
M
1
RM S error = (f (x, y, t) − f (x, y, t))2 dxdy (14)
Ω Ω
for images in the region of interest Ω at time t. We use the sequential error
M
1
(t) = |u(x, t) − u(x , t + 1)|2 dx (15)
Ω Ω
AAEcolorhydrangea rmscolorhydrangea
70 70
CMY CMY
HSL HSL
HSV HSV
60 I1I2I3 60 I1I2I3
I1rg I1rg
RGB RGB
SLab SLab
SLuv SLuv
50 XYZ 50 XYZ
YCC YCC
YIQ YIQ
YUV YUV
40 Yxy 40 Yxy
AAE rgb rgb
rms
Y Y
30 30
20 20
10 10
0 0
3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17 3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17
Window size Window size
AAEcolorGrove3 rmscolorGrove3
70 70
CMY CMY
HSL HSL
HSV HSV
60 I1I2I3 60 I1I2I3
I1rg I1rg
RGB RGB
SLab SLab
SLuv SLuv
50 XYZ 50 XYZ
YCC YCC
YIQ YIQ
YUV YUV
40 Yxy 40 Yxy
rgb rgb
AAE
rms
Y Y
30 30
20 20
10 10
0 0
3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17 3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17
Window size Window size
AAEcolordimetrodon rmscolordimetrodon
70 70
CMY CMY
HSL HSL
HSV HSV
60 I1I2I3 60 I1I2I3
I1rg I1rg
RGB RGB
SLab SLab
SLuv SLuv
50 XYZ 50 XYZ
YCC YCC
YIQ YIQ
YUV YUV
40 Yxy 40 Yxy
rgb rgb
AAE
rms
Y Y
30 30
20 20
10 10
0 0
3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17 3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17
Window size Window size
AAEcolorurban2 rmscolorurban3
70 70
CMY CMY
HSL HSL
HSV HSV
60 I1I2I3 60 I1I2I3
I1rg I1rg
RGB RGB
SLab SLab
SLuv SLuv
50 XYZ 50 XYZ
YCC YCC
YIQ YIQ
YUV YUV
40 Yxy 40 Yxy
rgb rgb
AAE
rms
Y Y
30 30
20 20
10 10
0 0
3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17 3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17
Window size Window size
Fig. 2. Computed optical flow. Errors for various window sizes. Left: Average angle
error. Right: Reast mean square error. From top to bottom: results for hydra, grove3,
dimetrodon and urban3 sequences.
Affine Colour Optical Flow Computation 513
Fig. 3. Computed optical flow: (a) (e) Image. (b) (f) Ground truth of optical flow. (c)
Optical flow field computed with 7 window. (g) Optical flow field computed with 7
window.
2800 3500
RGB 3x3
XYZ 5x5
2600 HSL 7x7
HSV 9x9
CMY 3000 11x11
2400 YCC 13x13
YUV 15x15
III
2200 LAB 2500
2000
2000
1800
FD
FD
1600
1500
1400
1200 1000
1000
500
800
600 0
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Frame Frame
Fig. 4. Result of DIPLODOC Sequence. (a) Image from the sequence. (b) Sequential
error eq. (15) of various colour channels for the window 7 × 7. (c) Sequential error eq.
(15) for various windows for the RGB channel. We used the pyramid of two layers.
5 Conclusions
References
1. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artificial Intelligence 17,
185–204 (1981)
2. Bouguet, J.-Y.: Pyramidal implementation of the Lucas Kanade feature tracker
description of the algorithm, In: Intel Corporation. Microprocessor Research Labs,
OpenCV Documents (1999)
3. Lucas, B.D., Kanade, T.: An iterative image registration technique with an appli-
cation to stereo vision. In: International Joint Conference on Artificial Intelligence,
pp. 674–679 (1981)
4. Golland, P., Bruckstein, A.M.: Motion from color CVIU, vol. 68, pp. 346–362 (1997)
5. Andrews, R.J., Lovell, B.C.: Color optical flow. In: Proc. Workshop on Digital
Image Computing, pp. 135–139 (2003)
6. van de Weijer, J., Gevers, T.: Robust optical flow from photometric invariants. In:
Proc. ICIP, pp. 1835–1838 (2004)
7. Barron, J.L., Klette, R.: Quantitative color optical flow. In: Proceedings of 16th
ICPR, vol. 4, pp. 251–255 (2002)
8. Heigl, B., Paulus, D., Niemann, H.: Tracking points in sequences of color images.
In: Proceedings 5th German-Russian Workshop on Pattern Analysis, pp. 70–77
(1998)
9. Mileva, Y., Bruhn, A., Weickert, J.: Illumination-robust variational optical flow
with photometric invariants. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.)
DAGM 2007. LNCS, vol. 4713, pp. 152–162. Springer, Heidelberg (2007)
10. Shi, J., Tomasi, C.: Good features to track. In: CVPR 1994, pp. 593–600 (1994)
Can Salient Interest Regions Resume Emotional
Impact of an Image?
1 Introduction
Many achievements have been made in computer vision in order to replicate the
most amazing capabilities of the human brain. However, some aspects of our be-
haviour remain unsolved, for example emotions prediction for images and videos.
Emotions extraction has several applications, for example for films classification
or road safety education by choosing the adequate images to the situation.
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 515–522, 2013.
c Springer-Verlag Berlin Heidelberg 2013
516 S. Gbèhounou et al.
Perreira Da Silva et al. [10] shows that despite the non deterministic behaviour of
preys/predators equations, the system exhibits interesting properties of stability,
reproducibility and reactiveness while allowing a fast and efficient exploration of
the scene. We applied the same optimal parameters used by Perreira Da Silva
to evaluate our approach.
The attention model presented in this section is computationally efficient and
plausible [9]. It provides many tuning possibilities (adjustment of curiosity, cen-
tral preferences, etc.) that can be exploited in order to adapt the behavior of the
system to a particular context.
518 S. Gbèhounou et al.
3 Experimentations
There are many image databases for emotions studies; the most known is Inter-
national Affective Picture System (IAPS) [3] from the Center for Emotion and
Attention (CSEA) at the University of Florida. In general they are highly se-
mantic and this aspect justifies our choice to create a new low-semantic database
for emotions study and research purpose in general way. In this paper, ”low-
semantic” means, that the images do not shock and do not force a strong emo-
tional response. We also choose low semantic images to minimize the potential
interactions between emotions on following images during subjective evaluations.
This aspect is important to ensure that the emotions indicated for an image is
really related to its content and not to the emotional impact of the previous one.
For these experimentations the data set used in [1] has been expanded to
350 images and is now called SENSE (Studies of Emotion on Natural image
databaSE) and is free to use. It is a diversified set of images which contains
landscapes, animals, food and drink, historic and touristic monuments.
This data set has also the advantage to be composed of natural images except
some non-natural transformations (rotations and colour balance modification)
on a few images. These transformations are performed to measure their impact
on emotions recognition system based on low-level images features [1].
3.2 Experimentations
Our goal during psycho-visual evaluations is to assess the different images ac-
cording to the nature of the emotional impact during a short viewing duration.
For evaluations of emotional impact of an image, viewing duration is really im-
portant. In fact, if the observation time extends observers access more to the
semantic and their ratings are semantic interpretations and not really ”primary
emotions”.
Usually two methodologies of emotion classification are found [4]:
– Discrete approach;
– Dimensional approach.
The images of IAPS are scored according to the affective ratings: pleasure,
arousal and dominance.
During our experimentations we asked participants to indicate the nature of
the emotions; ”Positive”, ”Negative” or ”Neutral” and the power varies from
”Weak” to ”Strong”. According to us, it seems easier to rate our images by this
way specially in a short observation duration.
Fig. 2. (a)-(c)Some images assessed during SENSE2 with the percentage of the original
image conserved and (d)-(f) their corresponding in SENSE1
Because our tests were applied on a low semantic database we do not need to
worry about the potential interactions between emotions on following images.
These interactions are really minimized.
We conducted two different tests:
– First evaluations on the full images, an example is shows on Fig. 2(d)-2(f).
1741 participants, including 848 men (48.71%) and 893 women (51.29%),
around the world, scored the database. These evaluations are named SENSE1.
– Second tests on the regions of interest obtained with the visual attention
model described in the previous section. 1166 participants including 624
women (53.49%) and 542 men (46.51%) scored the 350 images. Their size
varies from 3% to 100% of the size of the original ones. Fig.2(a)-2(c) are
some examples. These experimentations are named SENSE2.
The two experimentations were successively accessible via the Internet. In fact,
SENSE1 were made several months before SENSE2. The participants take volun-
tarily the one or two evaluations and can also stop it when they want. Even if we
cannot control the observation conditions (concentration, displaying, humour),
it is not a problem for emotional impact evaluations as these are the daily view-
ing conditions. Participants were asked not to take a long time to score images.
The average time of observation is 6.6 seconds so we considered the responses
as ”primary” emotions . Each observer evaluated at most 24 randomly selected
images if he makes the full test.
520 S. Gbèhounou et al.
Each image was assessed by an average of 104.81 observers during SENSE1 and
65.40 during SENSE2.
– During SENSE1, only 21 images (6% of the all database) were scored by
less than 100 persons. The less assessed image was evaluated by 86 different
participants, both genders combined.
– During SENSE2, the less rated image was seen by 47 participants. Only 2
images were rated by less than 50 persons.
Despite these diversified evaluations, some images are not really categorized. We
considered that an image is categorized (in some emotion nature class Negative,
Neutral or Positive) if the difference of the percentages (of observers) between
the two most important emotions is greater than or equal to 10%.
The results from SENSE1 are considered like the reference for this study and
the analysis of the impact of the reduction of the size will be interpreted with
the rate of good categorization.
100
Good categorization rate
80
P1: ]7%, 50%[
60 P2: [50%, 70%[
40 P3: [70%, 100%]
20
0
P1 P2 P3
Fig. 3. Good categorization rates during SENSE2 for categorized images during
SENSE1
100
Fig. 4. Rate of uncategorised images during SENSE1 now categorized during SENSE2
Sometimes some images can be rated neutral because there are found positive
or negative but not enough.
Fig. 4 shows images uncategorised1 during SENSE1 and definitively classi-
fied during SENSE2 classify in one of the two major classes of emotion found
SENSE1. On SENSE1, 61 images are ”Uncategorised”, the main contribution of
this paper concerns these kind of images. In fact, Figure 4 shows that a large
part of them (79%) is now categorized; often in one of the two major classes of
SENSE1. Reduce the viewing region has probably reduced the semantic and the
analysis time.
Fig. 5 represents the rate of images with good categorization during SENSE2
acording to the percentage of observed thumbnails. It shows that for the three
classes of emotions; from 50%, 77% of the images are correctly categorized. This
notice answers to our hypothesis that the idea of reduction of the image with a
bottom-up visual attention model can offer similar results compare to the full
images.
1
To be considered categorized the major class of an image must have a percentage of
at least 10% higher than the other.
522 S. Gbèhounou et al.
References
1. Gbèhounou, S., Lecellier, F., Fernandez-Maloigne, C.: Extraction of emotional im-
pact in colour images. In: Proc. CGIV, vol. 6, pp. 314–319 (2012)
2. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for
rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 11(20), 1254–1259 (1998)
3. Lang, P.J., Bradley, M.M., Cuthbert, B.N.: International affective picture system
(IAPS): Affective ratings of pictures and instruction manual. Technical report A-8,
University of Florida (2008)
4. Liu, N., Dellandréa, E., Chen, L.: Evaluation of features and combination ap-
proaches for the classification of emotional semantics in images. In: International
Conference on Computer Vision Theory and Applications (2011)
5. Lucassen, M., Gevers, T., Gijsenij, A.: Adding texture to color: quantitative anal-
ysis of color emotions. In: Proc. CGIV (2010)
6. Machajdik, J., Hanbury, A.: Affective image classification using features inspired
by psychology and art theory. In: Proc. International Conference on Multimedia,
pp. 83–92 (2010)
7. Ou, L., Luo, M.R., Woodcock, A., Wright, A.: A study of colour emotion and
colour preference. part i: Colour emotions for single colours. Color Research &
Application 29(3), 232–240 (2004)
8. Paleari, M., Huet, B.: Toward emotion indexing of multimedia excerpts. In: Proc.
Content-Based Multimedia Indexing, International Workshop, pp. 425–432 (2008)
9. Perreira Da Silva, M., Courboulay, V., Prigent, A., Estraillier, P.: Evaluation of
preys/predators systems for visual attention simulation. In: International Conference
on Computer Vision Theory and Applications, VISAPP 2010, pp. 275–282 (2010)
10. Perreira Da Silva, M., Courboulay, V.: Implementation and evaluation of a com-
putational model of attention for computer vision. In: Developing and Applying
Biologically-Inspired Vision Systems: Interdisciplinary Concepts, pp. 273–306 (2012)
11. Wang, W., Yu, Y.: Image emotional semantic query based on color semantic de-
scription. In: Proc. The Fourth International Conference on Machine Leraning and
Cybernectics, vol. 7, pp. 4571–4576 (2005)
12. Wei, K., He, B., Zhang, T., He, W.: Image Emotional Classification Based on Color
Semantic Description. In: Tang, C., Ling, C.X., Zhou, X., Cercone, N.J., Li, X. (eds.)
ADMA 2008. LNCS (LNAI), vol. 5139, pp. 485–491. Springer, Heidelberg (2008)
Contraharmonic Mean Based Bias Field
Correction in MR Images
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 523–530, 2013.
c Springer-Verlag Berlin Heidelberg 2013
524 A. Banerjee and P. Maji
2 Basics of HUM
The HUM assumes that intensity inhomogeneity is a low-frequency component
in the high-frequency structure of the image. It is usually implemented with
a noise threshold to prevent background pixels from distorting the bias field
estimation.
The model of the HUM assumes that intensity inhomogeneity is multiplicative.
If the ith pixel of the inhomogeneity-free image is ui , and corresponding intensity
inhomogeneity field and noise are bi and ni , respectively, then the ith pixel vi of
the acquired image is obtained as follows:
vi = ui bi + ni . (1)
In general, the bias field can be estimated either from the noise-free image
or from the noisy image. However, Guillemaud and Brady [4] showed that
Contraharmonic Mean Based Bias Field Correction in MR Images 525
3 Proposed Method
This section presents a new approach, using the merits of contraharmonic mean,
for estimating bias field present in the MR images.
where I denotes the set of all pixels in the image and the normalizing constant
CN is estimated by the global CHM of the intensity values of all the pixels in
the image.
vi vi
Now, ui < ui ⇔ < ⇔ bi < bi
bi bi
⎧ p+1 ⎫ ⎧ p ⎫ ⎧ ⎫ ⎧ ⎫
⎪
⎪ vj ⎪⎪ ⎪
⎪ vj ⎪
⎪ ⎪
⎪ vj ⎪
⎪ ⎪ ⎪
⎨ ⎬⎨ ⎬ ⎨ ⎪
⎬ ⎨ |I| ⎪⎬
j∈Ni j∈I j∈Ni
⇔ p p+1 <
⎪
⎪ vj ⎪
⎪ ⎪ vj ⎪ ⎪ |Ni | ⎪ ⎪ vj ⎪
⎩ ⎭⎪⎩ ⎪
⎭ ⎪
⎩ ⎪
⎭⎪⎩ ⎪
⎭
j∈Ni j∈I j∈I
|Ni | vjp+1 |I| vjp+1
j∈Ni j∈I
⇔ < (8)
vj vjp vj vjp
j∈Ni j∈Ni j∈I j∈I
Subtracting 1 from both sides of (8) and multiplying the numerator by 2, we get
(vj − vk )(vjp − vkp ) (vj − vk )(vjp − vkp )
j∈Ni k∈Ni j∈I k∈I
⇔ <
vj vjp vj vjp
j∈Ni j∈Ni j∈I j∈I
(vj − vk )(vjp − vkp )
j∈R k∈R
⇔ ηNi < ηI , where ηR =
vj vjp
j∈R j∈R
Contraharmonic Mean Based Bias Field Correction in MR Images 527
Fig. 1 presents the performance of the proposed, HUM and N3 bias field correc-
tion methods, in terms of RMSE value. From the results reported in Fig. 1, it
is observed that the proposed algorithm provides optimum restoration in the 7
cases out of total 12 cases, in terms of RMSE value, while optimum restoration
is achieved in the remaining 5 cases using the N3 algorithm. The second, third,
and fourth columns of Fig. 2 and 3 compare the reconstructed images produced
by the proposed, HUM, and N3 algorithms for different bias fields and noise lev-
els. All the results reported in Fig. 2 and 3 establish the fact that the proposed
method estimates the bias field more accurately than the existing HUM and N3
algorithms irrespective of the bias fields and noise levels.
528 A. Banerjee and P. Maji
20
16
Proposed Proposed
HUM HUM
N3 N3
18
14
16
12
14
RMSE
RMSE
10
12
8
10
6
8
4 6
0-20 1-20 3-20 5-20 7-20 9-20 0-40 1-40 3-40 5-40 7-40 9-40
Noise-Bias Noise-Bias
Fig. 1. Performance of proposed, HUM, and N3 algorithms for bias affected images
Fig. 2. Input image with 20% bias field and images restored by the proposed algorithm
(using CHM filter), the HUM algorithm of Brinkmann et al., and the N3 algorithm
One of the caveats about the HUM algorithm is that it can alter an image even
when no inhomogeneity is present, while a perfect correction algorithm should
be expected to leave the image unchanged.
From the results reported in Fig. 4, it is observed that the proposed algorithm
provides better restoration in all of the 6 cases, in terms of lowest RMSE value.
The HUM algorithm of Brinkmann et al. and the N3 algorithm severely change
the input image in spite of absence of intensity inhomogeneity artifacts, whereas
the proposed algorithm leaves the input image more or less unchanged.
Contraharmonic Mean Based Bias Field Correction in MR Images 529
Fig. 3. Input image with 40% bias field and images restored by the proposed algorithm
(using CHM filter), the HUM algorithm of Brinkmann et al., and the N3 algorithm
18
Proposed
HUM
16 N3
14
12
RMSE
10
0
0-0 1-0 3-0 5-0 7-0 9-0
Noise-Bias
5 Conclusion
The contribution of the paper is two fold, namely, the development of a bias
field correction algorithm, using the merits of contraharmonic mean filter; and
demonstrating the effectiveness of the proposed algorithm, along with a compar-
ison with other algorithms, on a set of MR images obtained from “BrainWeb:
Simulated Brain Database” for different bias fields and noise levels. A theoret-
ical analysis is presented to justify the use of contraharmonic mean for bias
field estimation. The algorithm using the contraharmonic mean filter instead of
530 A. Banerjee and P. Maji
arithmetic mean filter provides better restoration of the MR images than the
convensional AM based HUM.
References
1. Suetens, P.: Fundamentals of Medical Imaging. Cambridge University Press (2002)
2. Sled, J.G., Zijdenbos, A.P., Evans, A.C.: A Nonparametric Method for Automatic
Correction of Intensity Nonuniformity in MRI Data. IEEE Transactions on Medical
Imaging 17(1), 87–97 (1998)
3. Wells III, W.M., Grimson, W.E.L., Kikins, R., Jolezs, F.A.: Adaptive Segmentation
of MRI Data. IEEE Transactions on Medical Imaging 15(8), 429–442 (1996)
4. Guillemaud, R., Brady, M.: Estimating the Bias Field of MR Images. IEEE Trans-
actions on Medical Imaging 16(3), 238–251 (1997)
5. Pham, D.L., Prince, J.L.: Adaptive Fuzzy Segmentation of Magnetic Resonance
Images. IEEE Transactions on Medical Imaging 18(9), 737–752 (1999)
6. Ahmed, M.N., Yamany, S.M., Mohamed, N., Farag, A.A., Moriarty, T.: A Modified
Fuzzy C-Means Algorithm for Bias Field Estimation and Segmentation of MRI
Data. IEEE Transactions on Medical Imaging 21(3), 193–199 (2002)
7. Ashburner, J., Friston, K.J.: Unified Segmentation. NeuroImage 26(3), 839–851
(2005)
8. Axel, L., Costantini, J., Listerud, J.: Intensity Correction in Surface-Coil MR Imag-
ing. American Journal of Roentgenology 148, 418–420 (1987)
9. Zhou, L.Q., Zhu, Y.M., Bergot, C., Laval-Jeantet, A.M., Bousson, V., Laredo,
J.D., Laval-Jeantet, M.: A Method of Radio-Frequency Inhomogeneity Correction
for Brain Tissue Segmentation in MRI. Computerized Medical Imaging and Graph-
ics 25(5), 379–389 (2001)
10. Narayana, P.A., Borthakur, A.: Effect of Radio Frequency Inhomogeneity Cor-
rection on the Reproducibility of Intra-Cranial Volumes Using MR Image Data.
Magnetic Resonance in Medicine 33, 396–400 (1994)
11. Bedell, B.J., Narayana, P.A., Wolinsky, J.S.: A Dual Approach for Minimizing
False Lesion Classifications on Magnetic-Resonance Images. Magnetic Resonance
in Medicine 37(1), 94–102 (1997)
12. Brinkmann, B.H., Manduca, A., Robb, R.A.: Optimized Homomorphic Unsharp
Masking for MR Grayscale Inhomogeneity Correction. IEEE Transactions on Med-
ical Imaging 17(2), 161–171 (1998)
Correlation between Biopsy Confirmed Cases
and Radiologist’s Annotations in the Detection of Lung
Nodules by Expanding the Diagnostic Database
Using Content Based Image Retrieval
1 Introduction
Lung cancer is the leading cause of cancer death in the United States. Early detection
and treatment of lung cancer is important in order to improve the five year survival
rate of cancer patients. Medical imaging plays an important role in the early detection
and treatment of cancer. It provides physicians with information essential for efficient
and effective diagnosis of various diseases. In order to improve lung nodule detection,
CAD is effective as a second opinion for radiologists in clinical settings [1]. To assess
the high-quality of the data, several researchers and physicians have to be involved in
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 531–538, 2013.
© Springer-Verlag Berlin Heidelberg 2013
532 P. Aggarwal, H.K. Sardana, and R. Vig
the case selection process and the delineation of regions of interest (ROIs) to cope
with the inter- and intra-observer variability, the latter being particularly important in
radiology [2]. Efforts for building a resource for the lung imaging research communi-
ty are detailed in [3]. In almost all the CAD studies, most authors created their own
datasets with their own ground truth for evaluation. The use of different datasets
makes the comparison of these CAD systems not feasible and therefore, there is an
immediate need for reference datasets that can provide a common ground truth for the
evaluation and validation of these systems.
The pulmonary CT scans used in this study were obtained from the LIDC [3], and
we refer to the nodules in this dataset as the LIDC Nodule Dataset. Recently, diagno-
sis data for some of the nodules were released by the LIDC; however, because the
diagnosis is available patient-wise not nodule-wise, only the diagnoses belonging to
patients with a single nodule could be reliably matched with the nodules in the LIDC
Nodule Dataset, resulting in 18 diagnosed nodules (eight benign, six malignant, three
metastases and one unknown). The 17 nodules with known diagnoses comprise the
initial Diagnosed Subset as one case with unknown diagnose cannot be considered as
ground truth. Since the diagnoses in the LIDC Diagnosis Dataset are the closest thing
to a ground truth available for the malignancy of the LIDC nodules, our goal is to
expand the Diagnosed Subset by adding nodules similar to those already in the subset.
To identify these similar nodules and to predict their diagnoses, CBIR with classi-
fication is employed. The radiologist’s annotation along with LIDC data is also con-
sidered as semantic rating to prepare the ground truth from LIDC data. Increasing the
number of nodules for which a diagnostic ground truth is available is important for
future CAD applications of the LIDC database. With the aid of similar images, radi-
ologists’ diagnoses of lung nodules in CT scans can be significantly improved [4].
Having diagnostic information for medical images is an important tool for datasets
used in clinical CBIR; however, any CAD system would benefit from a larger Diag-
nosed Subset as well as the semantic rating, since the increased variability in this set
would result in more accurately predicted diagnoses for new patients.
Only a limited number of CAD studies have used a pathologically confirmed diagnos-
tic ground truth, since there are few publically available databases with pathological
annotations [5]. Even with LIDC data where biopsy confirmed cases are available still
due to the variability in the opinion of four different radiologists made the LIDC data
more complex and redundant. In exploring the relationship between content-based
similarity and semantic-based similarity for LIDC images, Jabon et al. found that
there is a high correlation between image features and radiologists’ semantic ratings
[6]. Though in this study, the malignancy rating is also considered for patients having
multiple nodules by taking the mean of all the four radiologists rating. McNitt-Gray et
al. [7] used nodule size, shape and co-occurrence texture features as nodule characte-
ristics to design a linear discriminant analysis (LDA) classification system for malig-
nant versus benign nodules. Armato et al. [8] used nodule appearance and shape to
build an LDA classification system to classify pulmonary nodules into malignant
Correlation between Biopsy Confirmed Cases and Radiologist’s Annotations 533
versus benign classes. Takashima et al. [9] used shape information to characterize
malignant versus benign lesions in the lung. Samuel et al. [10] developed a system for
lung nodule diagnosis using Fuzzy Logic. Although the work cited here provides
convincing evidence that a combination of image features can indirectly encode radi-
ologists’ knowledge about indicators of malignancy the precise mechanism by which
this correspondence happens is unknown. To understand this mechanism, there is a
need to explore and find the correlation between all these is required to prepare the
ground truth of LIDC data. Also, in all these systems the major concern was to distin-
guish benign nodules from malignant one where as in the current study we have as-
signed a new class to the nodules metastasis, which indicates that the nodule is malig-
nant however the primary cancer is not lung cancer and adding this new class will
definitely help the physicians in better understanding of cause and diagnosis for those
patients. The third class metastasis has not been introduced in the history of CBIR and
medical imaging. In the current study, we adopted a semi-supervised approach for
labeling undiagnosed nodules in the LIDC. CBIR is used to label nodules most simi-
lar to the query with respect to Euclidean distance of image features.
2 Materials
The NIH LIDC has created a dataset to serve as an international research resource for
development, training, and evaluation of CAD algorithms for detecting lung nodules
on CT scans. The LIDC database, released in 2009, contains 399 pulmonary CT
scans. Up to four radiologists analyzed each scan by identifying nodules and rating
the malignancy of each nodule on a scale of 1-5. The boundaries provided in the
XML files are already marked using manual as well as semi-automated methods [1]
[4]. Both cancerous and non-cancerous regions appear with little distinction on CT
scan image. The nine characteristics are presented in [11] are the common terms phy-
sicians consider for a nodule to be benign or malignant. To our best knowledge, this is
the first use of the LIDC dataset for the purpose of validating and classifying lung
nodule using biopsy report as well as the semantics attached.
truth the values of annotations are averaged for all the four radiologists. No automatic
segmentation is considered in this study as manual segmentation in medical imaging
provided better results; see Fig. 1 [12].
Fig. 1 presents the radiologist’s segmentation of nodule and hence manual segmen-
tation is considered as “gold standard”. Each slice is read independently to identify
its area marked by all the four radiologists and only those slices per nodule is consi-
dered to be in the database whose area is maximum [13].
truth. The first query set (Rad210) used the radiologist-predicted malignancy, the
second set (Comp210) used the computer-predicted malignancy, the third set
(Comp_Rad_biopsy57) used only those nodules for which the radiologist, computer-
predicted as well as biopsy confirmed malignancies agreed and the fourth set used
only those nodules for which the radiologist- and computer-predicted malignancies
agreed. The radiologist-predicted and computer-predicted contained equal number of
nodules i.e. 210 and radiologist-computer-biopsy-agreement query set contained 57,
and Rad_Comp92 contained 92 nodules after all modifications.
3 Methods
In this way, 1677 nodules from 62 patients were assigned the malignancy class as
above. These 1677 nodules contain multiple slices per nodule and assigned to Radi-
oMarked62 set, which further have been reduced to 210 and assigned to QueryNodu-
leSet210. QueryNoduleSet210 further assigned to various categories like Rad210,
Comp210 and Comp_Rad_biopsy210 as explained earlier.
536 P. Aggarwal, H.K. Sardana, and R. Vig
In the absence of diagnostic information, labels can be applied to unlabeled data using
semi-supervised learning (SSL) approaches. In SSL, unlabeled data is exploited to
improve learning when the dataset contains an insufficient amount of labeled data
[14]. Using available datasets and by evaluating the method with a CAD application,
we determined how to effectively expand the Diagnosed17 with CBIR and assist the
physicians in the final diagnosis. Each nodule in the QueryNoduleSet210 was then
used as a query to retrieve the ten most similar images from the remaining nodules in
the Diagnosed17 using CBIR with Euclidean distance. The query nodule was as-
signed predicted malignancy ratings based on the retrieved nodules (e.g., if the maxi-
mum retrieved nodules belong to class malignant then the query nodule was assigned
the class M), Fig. 2. The newly identified nodule was considered candidates for
addition to the Diagnosed17.
CBIR
Query: RadioMarked62/QueryNoduleSet210
Retrieval Set: Diagnosed17
10 nodules retrieved per query
Nodules to be added to the Diagnosed17 were selected from the candidates described
above. For verifying the addition of a candidate nodule in the Diagnosed17, a reverse
mechanism is adopted. Diagnosed17 nodules acted as query and nodules to be re-
trieved are from QueryNoduleSet210, see Fig. 2. The first three similar nodules are
assigned the same malignancy as the query nodule if they were previously assigned as
candidate nodules (i.e. if the query nodule is benign then the top three retrieved no-
dules are also assigned the class benign if previously are assigned as candidate no-
dule). With this mechanism Diagnosed17 in expanded to Diagnosed74 and then to
Correlation between Biopsy Confirmed Cases and Radiologist’s Annotations 537
Diagnosed121. This process repeated until no candidate nodules were added to the
Diagnosed17 following an iteration. Since neither computer-predicted nor radiologist-
predicted malignancy ratings can be considered ground truth due to high variability
between radiologists’ ratings [5]. This mechanism guarantees the preparation of LIDC
ground truth and accuracy of CBIR based diagnostic labeling. All the nodules can be
classified in three class benign, malignant and metastasis.
Using the query and retrieval sets as described above, average precision after 3, 5, 10,
and 15 images retrieved was calculated. A retrieved nodule was considered relevant if
its diagnosis matched the malignancy rating (either radiologist-predicted, computer-
predicted, or both) of the query nodule. Initial precision values were obtained by us-
ing the 17 nodules in the initial Diagnosed17 as the retrieval set. Then, nodules were
added to this set as described in sections 2.2 and 2.3. Precision was recalculated, and
the nodule addition process was repeated iteratively using the new Diagnosed17.
80 Size of
60 Diagnosed Set
40 Diagnosed17
20
0 Diagnosed74
Diagnosed121
Query Sets
Fig. 3. Comparison of precision for different query sets at x-axis and different retrieval sets at
y-axis
Various experiments were setup for the validation of nodules examined. Fig. 3
shows that with five query sets and three retrieval sets Diagnosed17, Diagnosed74
and Diagnosed121, the precision increases respectively. Nodules in
Comp_Rad_biopsy57 have provided the best precision i.e. 98% which is the best
precision achieved in the history of medical CBIR with best of our knowledge.
CBIR is an effective method for expanding the Diagnosed Subset by labeling no-
dules which do not have associated diagnoses. As LIDC is having lack of ground
truth, CBIR techniques works tremendously better to prepare the ground truth. This
method outperforms control expansion, yielding higher precision values when tested
with a potential CAD application [12] that requires a diagnostically accurate ground
truth. By increasing the size of the Diagnosed Subset from 17 to 74 and finally to 121
nodules, CBIR expansion provides greater variability in the retrieval set, resulting in
retrieved nodules that are more similar to undiagnosed queries.
538 P. Aggarwal, H.K. Sardana, and R. Vig
References
1. Wormanns, D., Fiebich, M., Saidi, M., Diederich, S., Heindel, W.: Automatic detection of
pulmonary nodules at spiral CT: clinical application of a computer-aided diagnosis system.
European Radiology 12, 1052–1057 (2002)
2. Blum, A., Mitchell, T.: Combining Labelled and Unlabelled Data with Co-Training. In:
Proceedings of the 11th Annual Conference on Computational Learning Theory, COLT
1998, pp. 92–100 (1998)
3. McNitt-Gray, M.F., Armato, S.G., Meyer, C.R., Reeves, A.P., McLennan, G., Pais, R.C.,
et al.: The lung image database consortium (LIDC) data collection process for nodule de-
tection and annotation. Academic Radiology 14(12), 1464–1474 (2007)
4. Armato III, S.G., McLennan, G., Bidaut, L., McNitt-Gray, M.F., Meyer, C.R., Reeves,
A.P., Zhao, B., Aberle, D.R., Henschke, C.I., Hoffman, E.A., Kazerooni, E.A., MacMa-
hon, H., van Beek, E.J.R., Yankelevitz, D., et al.: The Lung Image Database Consortium
(LIDC) and Image Database Resources Initiative (IDRI): A completed reference database
of lung nodules on CT scans. Medical Physics 38, 915–931 (2011)
5. Horsthemke, W.H., Raicu, D.S., Furst, J.D., Armato III, S.G.: Evaluation Challenges for
Computer-Aided Diagnostic Characterization: Shape Disagreements in the Lung Image
Database Consortium Pulmonary Nodule Dataset. In: Tan, J. (ed.) New Technologies for
Advancing Healthcare and Clinical Practices, pp. 18–43. IGI Global, Hershey PA (2011)
6. Jabon, S.A., Raicu, D.S., Furst, J.D.: Content-based versus semantic-based similarity re-
trieval: a LIDC case study. In: SPIE Medical Imaging Conference, Orlando (February 2009)
7. McNitt-Gray, M.F., Hart, E.M., Wyckoff, N., Sayre, J.W., Goldin, J.G., Aberle, D.R.: A
pattern classification approach to characterizing solitary pulmonary nodules imaged on
high resolution CT: Preliminary results. Med. Phys. 26, 880–888 (1999)
8. Armato III, S.G., Altman, M.B., Wilkie, J., Sone, S., Li, F., Doi, K., Roy, A.S.: Automated
lung nodule classification following automated nodule detection on CT: A serial approach.
Med. Phys. 30, 1188–1197 (2003)
9. Takashima, S., Sone, S., Li, F., Maruyama, Y., Hasegawa, M., Kadoya, M.: Indeterminate
solitary pulmonary nodules revealed at population-based CT screening of the lung: using
first follow-up diagnostic CT to differentiate benign and malignant lesions. Am. J. Roent-
genol. 180, 1255–1263 (2003)
10. Samuel, C.C., Saravanan, V., Vimala, D.M.R.: Lung nodule diagnosis from CT images us-
ing fuzzy logic. In: Proceedings of International Conference on Computational Intelligence
and Multimedia Applications, Sivakasi, Tamilnadu, India, December 13-15, pp. 159–163
(2007)
11. Raicu, D.S., Varutbangkul, E., Furst, J.D.: Modelling semantics from image data: oppor-
tunities from LIDC. International Journal of Biomedical Engineering and Technology
3(1-2), 83–113 (2009)
12. Giuca, A.-M., Seitz Jr., K.A., Furst, J., Raicu, D.: Expanding diagnostically labeled data-
sets using content-based image retrieval. In: IEEE International Conference on Image
Processing 2012, Lake Buena Vista, Florida, September 30-October 3 (2012)
13. Aggarwal, P., Vig, R., Sardana, H.K.: Largest Versus Smallest Nodules Marked by Differ-
ent Radiologists in Chest CT Scans for Lung Cancer Detection. In: International Confe-
rence on Image Engineering, ICIE 2013 Organized by IAENG at Hong Kong (in press,
2013)
14. Zhou, Z.-H.: Learning with Unlabeled Data and Its Application to Image Retrieval. In:
Yang, Q., Webb, G. (eds.) PRICAI 2006. LNCS (LNAI), vol. 4099, pp. 5–10. Springer,
Heidelberg (2006)
Enforcing Consistency of 3D Scenes
with Multiple Objects
Using Shape-from-Contours
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 539–547, 2013.
c Springer-Verlag Berlin Heidelberg 2013
540 M. Grum and A.G. Bors
and the contours Ci segmented from the image of the corresponding view are
detected and then used to correct the 3D scene. The visual hull, denoted by H, is
the outer bound of the scene based on its appearance in several images and was
used for modelling 3D scenes from images in shape-from-silhouettes [4,6,5,11].
If a point lies within the visual hull then its projection falls inside the scene
silhouette in each image. However, the visual hull will not be able to represent
certain regions in multi-object scenes, such as for example the region from the
middle of the scene, due to the fact that objects will invariably occlude each
other in several images in such scenes.
In shape-from-contours, the concept of the visual hull is applied to individual
objects from the scene. In this case, the visual hull of objects, denoted as H(ai )
and H(bi ) is provided by the object contours such as ai or bi from each image
i = 1, . . . , n, where these objects are visible. After comparing the sets of pixels
corresponding to 2D contours and those from the projected 3D contours we
identify the regions corresponding to the undesired difference sets as :
{c|Pi c = (S(Pi CS )\S(Ci )) ∪ (S(Ci )\S(Pi CS ))} (2)
where S(Ci ) represents the set of pixels located in the interior of contour Ci ,
i = 1, . . . , n and c ∈ S is a point from the 3D scene, whose projection Pi c lies
among the pixels from the area between the sets Pi CS and Ci from each image.
Such points are displaced to their nearest surface in 3D along its surface normal:
1 −1 −−−−→
m
ĉ = c − γ P ∇zj Ci , (3)
m i=1 i
where m ≤ n represents all the images in which the inconsistency between the
3D scene projections and the actual object contours is identified, γ is a correction
−−−−→
factor, and ∇zj Ci represents the correction vector which is perpendicular on the
object contour calculated in the location zj . Eventually, such points would be
located in S(Pi CS )∩S(Ci ) ensuring the consistency of H(ai ) and H(bi ) with the
3D scene S. This methodology can be applied to various surface representations
such as voxels, parametric (including RBFs) and meshes, where surface self-
intersections would have to be avoided [3].
In the following we consider the case when two objects from the scene are wrongly
merged together in the initial stages of the 3D modelling of the scene. Such
situations may arise due to object occlusion, uncertainty in camera parameters,
illumination conditions, image noise, etc, [3]. We consider a simple artificial
scene consisting of two identical objects, considered as either a pair of cylinders
or a pair of cuboids with square base. A circular configuration of n cameras is
considered located evenly spaced on a circle located at a height corresponding
to θ = 0.85 radians.
542 M. Grum and A.G. Bors
0.03 0.04
0.02
0.02
A
A
e
e
0.01 π/2 π/2
0 π/4 0 π/4
0 0
π/4 π/4
π/2 π/2
3π/4 0 θ 3π/4 0 θ
φ π φ π
150 0.06
100 0.04
H
A
e
50 2ι 0.02 2ι
0 ι 0 ι
0 0
π/4 π/4
π/2 π/2
3π/4 0 d 3π/4 0 d
φ π φ π
0.03
0.025
worst case
0.02 best case
eA
0.015
0.01
0.005
0
0 5 10 15 20 25
n
Fig. 2. Surface error eA (F, G) plotted against the number of cameras n
Enforcing Consistency of 3D Scenes with Multiple Objects 543
the objects is kept constant. We also consider varying the distance between the
objects d by moving them away from each other. Plots of both Hausdorff distance
[10] and the area error eA (F, G) are shown when varying φ and d in Figs. 1(c)
and 1(d) for the scene showing two cuboids. These plots clearly show the width
of the peak getting smaller as the intra-object gap is reduced.
Large numbers of input images do not necessarily improve 3D scene recon-
struction proportionally [11]. We estimate the necessary number of cameras for
detecting when two cuboids are merged. For each number of given cameras n,
we record the minimum and maximum area errors eA (F, G), measured between
the fused and separate case hypotheses when considering all possible offsets. The
results for the best and worse cases when detecting the separation for each cam-
eras configuration are shown in Figure 2. It can be observed that when using
more than 10 cameras, the error eA (F, G) from (4) provides a good assessment
of the inconsistencies between the 3D scene and the objects shown in images.
4 Experimental Results
The proposed methodology of correcting 3D scenes of multiple objects using
the consistency with object contours was applied on various image sets. Four
images, from a set of n = 12 images of a scene with 5 main objects captured
from various viewpoints, are shown in Figs. 3a-e. We initialize the 3D scene using
space carving [1], represent its surface with implicit RBFs [2], and correct the
image disparities from projections of 3D patches as in [7]. The resulting 3D scene
is shown from two different angles in Figs. 4a and 7a. It can be observed that
two of the objects representing a knife-block and a kettle are merged together
as shown in the closer view from Fig. 4b. The results provided by shape-from-
silhouettes (SFS) [5,6] when applied on the original set of 12 images is shown in
Fig. 5a. In Fig. 5b we apply SFS onto the result of the 3D scene from Fig. 4a.
(a) Applied on the initial image set (b) Applied on the 3D initial estimate
Fig. 5. Shape-from-silhouettes results
between the kettle and knife-block. These results are definitely better than those
provided by SFS from Figs. 5a and 5b.
Numerical errors are evaluated for assessing the improvement in the 3D scene
when using SFC from either unsupervized or supervized image segmentations.
We consider two error measures as in the study from Section 3: the Hausdorff
distance [10] and area error eA (F, G) from (4). These measures assess the differ-
ences between the projected contours of objects and their corresponding image
segmented contours. Numerical results are shown in Figs. 8a and 8b when vary-
ing the azimuth angle φ. An error peak, which corresponds to the region located
between the kettle and the knife-block, can be seen in all four curves from both
plots from Fig. 8.
546 M. Grum and A.G. Bors
90
unsupervised unsupervised
80 0.25
supervised supervised
70
0.2
60
H
0.15
A
e
e
50
0.1
40
30 0.05
20 0
0 π/4 π/2 3π/4 π 5π/4 3π/2 7π/4 0 π/4 π/2 3π/4 π 5π/4 3π/2 7π/4
φ φ
Fig. 8. Numerical accuracy of extracted contours for the entire image set
In the following, we evaluate the PSNR, for a region as shown inside the
black rectangle from Fig. 3e, between the original image and the corresponding
projections from the corrected 3D scene. For this region, the initial PSNR of
11.17 dB is improved to 13.91 dB and to 13.75 dB, when using unsupervized
and supervised segmentations for SFC. Three different views of the entire 3D
scene reconstruction are shown in Fig. 9 after enforcing unsupervised shape-
from-contours consistency, considering the same projection parameters P as in
the original image set.
5 Conclusions
is provided together with the analysis of the number of cameras required for
detecting such errors. The proposed methodology can be applied for correcting
various 3D scene representations including those using voxels, meshes or para-
metric models.
References
1. Broadhurst, A., Drummond, T.W., Cipolla, R.: A probabilistic framework for space
carving. In: Proc. Int. Conf. on Comp. Vision, vol. 1, pp. 388–393 (2001)
2. Dinh, H.Q., Turk, G., Slabaugh, G.: Reconstructing surfaces by volumetric regular-
ization using radial basis functions. IEEE Trans. on Pattern Analysis and Machine
Intelligence 24(10), 1358–1371 (2002)
3. Zaharescu, A., Boyer, E., Horaud, R.: Topology-Adaptive Mesh Deformation for
Surface Evolution, Morphing, and Multiview Reconstruction. IEEE Trans. on Pat-
tern Analysis and Machine Intelligence 33(4), 823–837 (2011)
4. Liang, C., Wong, K.-Y.: Robust recovery of shapes with unknown topology from
the dual space. IEEE Trans. on Pat. Anal. and Machine Intell. 29(12), 2205–2216
(2007)
5. Lazebnik, Z., Furukawa, Y., Ponce, J.: Projective visual hulls. Int. Jour. of Comp.
Vision 74(2), 137–165 (2007)
6. Koenderink, J.: What does the occluding contour tell us about solid shape. Per-
ception 13(3), 321–330 (1984)
7. Grum, M., Bors, A.G.: Enforcing image consistency in multiple 3-D object mod-
elling. In: Proc. Int. Conf. on Pattern Recog., Tampa, FL, USA, pp. 3354–3357
(2008)
8. Comaniciu, D., Meer, D.: Mean shift: a robust approach toward feature space
analysis. IEEE Trans. on Pattern Analysis and Machine Intell. 24(5), 603–619
(2002)
9. Scholkopf, B., Sung, K.-K., Burges, C., Girosi, F., Niyogi, P., Poggio, T., Vapnik,
V.: Comparing support vector machines with Gaussian kernels to Radial Basis
Function Classifiers. IEEE Trans. on Signal Processing 45(11), 2758–2765 (1997)
10. Huttenlocher, D., Klanderman, G., Rucklidge, W.: Comparing images using
the Hausdorff distance. IEEE Trans. on Pattern Analysis and Machine Intelli-
gence 15(9), 850–863 (1993)
11. Chaurasia, G., Sorkine, O., Drettakis, G.: Silhouette-Aware Warping for Image-
Based Rendering. Computer Graphics Forum 30(4), 1223–1232 (2011)
Expectation Conditional Maximization-Based
Deformable Shape Registration
Guoyan Zheng
Abstract. This paper addresses the issue of matching statistical and non-rigid
shapes, and introduces an Expectation Conditional Maximization-based de-
formable shape registration (ECM-DSR) algorithm. Similar to previous works,
we cast the statistical and non-rigid shape registration problem into a missing
data framework and handle the unknown correspondences with Gaussian Mix-
ture Models (GMM). The registration problem is then solved by fitting the
GMM centroids to the data. But unlike previous works where equal isotropic
covariances are used, our new algorithm uses heteroscedastic covariances
whose values are iteratively estimated from the data. A previously introduced
virtual observation concept is adopted here to simplify the estimation of the reg-
istration parameters. Based on this concept, we derive closed-form solutions to
estimate parameters for statistical or non-rigid shape registrations in each itera-
tion. Our experiments conducted on synthesized and real data demonstrate that
the ECM-DSR algorithm has various advantages over existing algorithms.
1 Introduction
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 548–555, 2013.
© Springer-Verlag Berlin Heidelberg 2013
Expectation Conditional Maximization-Based Deformable Shape Registration 549
their algorithm allows the use of general covariance matrices for the mixture model
components and improves over the equal isotropic covariance case, but they only
applied their algorithm to solve rigid and articulated point registration problems. Re-
cently Xie et al. [14] used the ECM algorithm to solve the statistical shape registra-
tion problem but the shape coefficients were estimated asynchronously.
In this paper, we extend the ECM algorithm [9, 19] to solve the statistical and non-
rigid shape registration problems and introduce the ECM-based deformable shape
registration (ECM-DSR) algorithm. Unlike previous works where equal isotropic
covariances are used, our new algorithm allows uses heteroscedastic covariances
whose values are iteratively estimated. Furthermore, a previously introduced virtual
observation concept is adopted here to simplify the estimation of the registration
parameters. Based on this concept, we derive closed-form solutions to estimate para-
meters for statistical or non-rigid shape registration in iteration. We conducted com-
prehensive experiments on synthesized and real data to demonstrate the advantages of
the ECM-DSR algorithm over existing algorithms.
Details about the ECM-DSR algorithm will be described in Section 2, followed by
experimental results in Section 3. We conclude the paper in Section 4.
Throughout the paper, we use following notations. The superscript “T” means “trans-
pose”. is the dimension of the point sets; , are the number of points in the
two point sets: , ,…, is the data matrix for the data points and
, , …, is the data matrix for the GMM centroids;
, is the transformation applied to to get the new positions of the
GMM centroids (more specifically, , ) , where is the set of the trans-
formation parameters. || || is a regularization over the transformation. For a
statistical shape model, we also use to indicate the mean model. Giving a cu-
toff number , , 1,2, … , are the set of eigenvalues that are sorted in des-
cending order and , 1,2, … , are their corresponding normalized eigenvec-
tors of the statistical shape model, where each eigenvector is
1 ,…, . Furthermore, dot product between two vectors
and is written as either · or , depending on the context. is an identity
matrix and means a diagonal matrix constructed from the vector ;
means to compute the trace of a matrix .
| || , ||
(2)
∑
∑ ∑ | , | log || || (3)
where is the parameters controlling the contribution of the regularization and the
problem is solved by two conditional minimization steps, using given by (2):
A. Estimating the registration parameters by minimizing
argmin ∑ ∑ | , | || || (4)
Details about how to estimate the registration parameters will be given in section 2.3.
B. For all 1,2, … , , estimate covariances using following closed-form solution
∑ || , ||
∑
(5)
argmin ∑ || , || || || (6)
Eq. (6) and Eq. (4) has exactly the same solution. This can be proved by expanding
the first term of the right side of the Eq. (4) and neglect the constant coefficient as:
∑ ∑ 2∑ , ∑ , , (7)
Expectation Conditional Maximization-Based Deformable Shape Registration 551
Since the first term of Eq. (7) does not depend on the registration parameters ,
replacing it with will not change the original optimization problem of
Eq. (4). The second term of Eq. (7) is 2 , , and the third term is
, , . Combining all three terms we have it as ∑ ||
, || . This proves that Eq. (6) and Eq. (4) has the same solution.
With the virtual observation concept, we can now discuss how to solve Eq. (6).
The solution depends on how we parameterize the transformation .
∑ · ∑ · ∑ ·
∑ · ∑ · ∑ ·
and is computed as
∑ ·
∑ ·
∑ ·
, (9)
(10)
3 Experimental Results
Qualitative and quantitative experiments are conducted to evaluate the performance of
the present approach.
Fig. 1. Performance of CPD (top) and ECM-DSR (bottom) in handling outliers. From left to
right: inputs, results, overlap the results with ground truth, and estimated covariances depicted
as black circles around GMM centroids whose radii equal to the corresponding covariances.
1
From http://www.umiacs.umd.edu/~zhengyf/PointMatchDemo/DataChui.
zip, we got the synthesized data, and from https://sites.google.com/site/
myronenko/research/cpd, we got the reference implementation of the CPD algorithm
and the bunny model data shown in Fig. 3.
Expectation Conditional Maximization-Based Deformable Shape Registration 553
Fig. 2. Equal covariances estimated by the CPD algorithm (top) and the heteroscedastic cova-
riances estimated by the ECM-DSR algorithm (bottom) during iterations. The rightmost col-
umn shows the estimated covariances after convergence. Three regions of high uncertainty
(depicted with red ellipses) can be identified for the ECM-DSR but not for the CPD.
Fig. 4. Boxplots of the comparison of the ECM-DSR with those of the CPD on the Chui-
Rangarajan synthesized data sets. Top: the Chinese Character shape, bottom: the fish shape.
three sets of data designed on two shape templates (Chinese Character shape and fish
shape) to measure the robustness of an algorithm under different degrees of deforma-
tion, noise levels and outliers. Fig. 4 demonstrates the quantitative comparison results
between our non-rigid registration algorithm and the CPD algorithm.
4 Conclusions
In this paper we presented a robust point matching algorithm for statistical and non-
rigid shape registration based on the Expectation Conditional Maximization algo-
rithm. Our experiments conducted on synthesized and real data demonstrate that
ECM-DSR has various advantages over existing algorithms, including less iteration
steps required for convergence, higher accuracy, more robust to outliers and providing
more information about the uncertainty of the registration results.
References
1. Besl, P., McKay, N.: A method for registration of 3-d shapes. IEEE Trans. Pat. Anal. and
Machine Intel. 14, 239–256 (1992)
2. Granger, S., Pennec, X.: Multi-scale EM-ICP: A fast and robust approach for surface reg-
istration. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV.
LNCS, vol. 2353, pp. 418–432. Springer, Heidelberg (2002)
3. Chui, H., Rangarajan, A.: A new point matching algorithm for nonrigid registration. Com-
put. Vis. Image Understand. 89(2-3), 114–141 (2003)
4. Luo, B., Hancock, E.: A unified framework for alignment and correspondence. Comp. Vis.
and Image Underst. 92(1), 26–55 (2003)
5. Tsin, Y., Kanade, T.: A correlation-based approach to robust point set registration. In: Paj-
dla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3023, pp. 558–569. Springer, Heidel-
berg (2004)
6. Zheng, Y., Doermann, D.S.: Robust Point Matching for Nonrigid Shapes by Preserving Local
Neighborhood Structures. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 643–649 (2006)
7. Myronenko, A., Song, X.: Point set registration: coherent point drift. IEEE Trans. Pattern
Anal. Mach. Intell. 32(12), 2262–2275 (2010)
8. Jian, B., Vemuri, B.C.: Robust Point Set Registration Using Gaussian Mixture Models.
IEEE Trans. Pattern Anal. Mach. Intell. 33(8), 1633–1645 (2011)
9. Horaud, R., Forbes, F., Yguel, M., Dewaele, G., Zhang, J.: Rigid and Articulated Point
Registration with Expectation Conditional Maximization. IEEE Trans. Pattern Anal. Mach.
Intell. 33(3), 587–602 (2011)
10. Chui, H., Rangarajan, A., Zhang, J., Leonard, C.M.: Unsupervised learning of an atlas
from unlabeled point-sets. IEEE Trans. Pattern Anal. Mach. Intell. 26(2), 160–172 (2004)
11. Hufnagel, H., Pennec, X., Ehrhardt, J., Ayache, N., Handels, H.: Generation of a statistical
shape model with probabilistic point correspondences and the expectation maximization-
iterative closest point algorithm. Int. J. CARS 2(5), 265–273 (2008)
12. Abi-Nahed, J., Jolly, M., Yang, G.Z.: Robust active shape models: A robust, generic and
simple automatic segmentation tool. In: Larsen, R., Nielsen, M., Sporring, J. (eds.)
MICCAI 2006. LNCS, vol. 4191, pp. 1–8. Springer, Heidelberg (2006)
13. Shen, K., Bourgeat, P., Fripp, J., Meriaudeau, F., Salvado, O.: Consistent estimation of
shape parameters in statistical shape model by symmetric EM algorithm. In: SPIE Medical
Imaging 2012: Image Processing, vol. 8134, p. 83140R (2012)
14. Xie, W., Schumann, S., Franke, J., Grützner, P.A., Nolte, L.P., Zheng, G.: Finding De-
formable Shapes by Correspondence-Free Instantiation and Registration of Statistical
Shape Models. In: Wang, F., Shen, D., Yan, P., Suzuki, K. (eds.) MLMI 2012. LNCS,
vol. 7588, pp. 258–265. Springer, Heidelberg (2012)
15. Kang, X., Taylor, R.H., Armand, M., Otake, Y., Yau, W.P., Cheung, P.Y.S., Hu, Y.: Cor-
respondenceless 3D-2D registration based on expectation conditional maximization. In:
Proc. SPIE, vol. 7964, p. 79642Z (2011)
16. Chen, T., Vemuri, B.C., Rangarajan, A., Eisenschenk, S.J.: Groupwise point-set registra-
tion using a novel CDF-based Havrda-Charvtat divergence. IJCV 86(1), 111–124 (2010)
17. Rasoulian, A., Rohling, R., Abolmaesumi, P.: Group-wise registration of point sets for sta-
tistical shape models. IEEE Trans. Med. Imaging 31(11), 2025–2034 (2012)
18. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood estimation from in complete
data via the EM algorithm (with discussion). J. Royal Statistical Soc.(B) 39, 1–38 (1877)
19. Meng, X.-L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: A
general framework. Biometrika 80(2), 267–278 (1993)
Facial Expression Recognition with Regional
Features Using Local Binary Patterns
Abstract. This paper presents a simple yet efficient and completely au-
tomatic approach to recognize six fundamental facial expressions using
Local Binary Patterns (LBPs) texture features. A system is proposed
that can automatically locate four important facial regions from which
the uniform LBPs features are extracted and concatenated to form a
236 dimensional enhanced feature vector to be used for six fundamen-
tal expressions recognition. The features are trained using three widely
used classifiers: Naive bayes, Radial Basis Function Network (RBFN)
and three layered Multi-layer Perceptron (MLP3). The notable feature
of the proposed method is the use of few preferred regions of the face
to extract the LBPs features as opposed to the use of entire face. The
experimental results obtained from MMI database show proficiency of
the proposed features extraction method.
1 Introduction
Communication plays a very important role in our day to day life. Facial expres-
sions those come under the category of nonverbal communication are considered
to be one of the most powerful and immediate means of recognizing one’s emo-
tion, intentions and opinion about each other. A study of Mehrabian [1] has
found that while communicating feelings and attitudes, a person convey 55% of
the message through facial expression alone, 38% via vocal cues and the remain-
ing 7% is through verbal cues. The goal of facial expression recognition system
is to have an automatic system that can recognition expressions like Happiness,
Sadness, Disgust, Anger, Surprise and Fear regardless of the person’s identity.
Since last few decades [2,3,4,5], researchers have been working on facial expres-
sion recognition and a lot of advancements have been made in recent years.
But recognizing facial expressions with high accuracy is still a challenging area
because of it’s subtlety, complexity and variability of expressions.
Generally, techniques to represent facial features needed for expressions recog-
nition are broadly categorized into two types: Geometric based method and
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 556–563, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Facial Expression Recognition with Regional Features Using LBPs 557
Appearance based method. Geometric features such as eyes, mouth, nose etc.,
contain location and shape information of those features, whereas, appearance
features examines the appearance change of face like wrinkle, bulges and fur-
rows [6]. In this presented work, we use both information of facial geometry
and appearance based method to represent facial features for six fundamental
expressions (Happiness, Sadness, Disgust, Anger, Surprise, Fear). Facial geom-
etry is used for automatic localization of 4 important facial regions like eyes
regions, Nose region and Lips region. Those regions are extracted in a way that
the regions cover nearby regions containing important information needed for
expressions recognition. LBPs have already been presented to be a successful
texture descriptor in many computer vision applications [7,8,3]. LBPs are used
over Gabor filters because of its simplicity and much lower dimensionality. Many
researchers applied LBPs for Facial expressions recognition but, mostly they ei-
ther applied over whole face image or after dividing the face region into M × N
sub-blocks [6,3]. Holistic representation of features loses information related to
the location of the features. Moreover, dividing the whole facial region into dif-
ferent sub-blocks and then taking LBPs from each sub-block needs more com-
putation time, which is unnecessary. Not all the blocks within the face region
contains useful informations needed for expression recognition. We introduce a
new method that reduces such unnecessary computation cost by localizing re-
quired facial zones needed for expression recognition. Uniform LBPs obtained
from each of the 4 facial regions are concatenated to form a feature vector of
dimension 236. We apply three widely used classifiers: Naive Bayes, Radial Basis
Function Network, and three layered MLP [9,10].
Rest of the paper is organized as follows. Section 2 demonstrates the different
facial regions extraction techniques. Section 3 gives a brief overview about the
local binary pattern and proposed method. Section 4 shows the experimental
results obtained after applying 3 different classifiers: Naive bayes, RBFN and
MLP3. Finally, in section 5 conclusions are drawn.
– Features representation
– Modeling of appropriate classifier [3].
Extraction of features those can represent the facial expressions effectively, are
the key to have an accurate facial expression recognition system. In this paper, we
introduce a new mechanism for automatic extraction of appearance features from
different facial regions, using local binary pattern. Fig. 1 shows the 4 important
localized regions from where LBP features are extracted.
The flow diagram shown in Fig. 2 demonstrates the steps involved in auto-
matic facial regions extraction techniques.
558 A. Majumder, L. Behera, and V.K. Subramanian
f aceheight
eye1 height
(x1, y1) (x2, y2)
hnose =
1/3 × f aceheight
Expected nose
region
Nose width
(x2 − x1)
hl =
1.5 × hnose
The foremost important part to have better facial expression recognition results
is to have automatic and accurate face and facial regions detection methods.
Accurate extraction of facial features is the key to have successful classification
results. Most of the facial expression recognition techniques using Local binary
patterns applies the LBP over each sub-block covering the whole face image
[3,6,11]. Caifeng Shah et.al [3] divided the whole face image into small sub-block
regions, extracted the LBPs from each region and then assigned weights to each
sub-region based on importance of that region. But, not all the facial regions con-
tain useful informations for expression recognition. Some regions contain almost
no information for any expression. Thus, calculation of LBPs over whole facial
region leads to unnecessary computational cost. In this work we demonstrate
a completely automatic facial regions extraction method. The four important
regions where most of the informations available are: Two eyes region enclos-
ing eyebrow regions, nose region, lips region enclosing chin region. Fig. 1 shows
a pictorial explanation for the calculation of estimated facial regions based on
actual face height and width. The steps involved in this process is given in the
block diagram shown in Fig. 2. Face detection is followed by eyes detection. We
apply Paul Viola and Michael Jones’ face detection algorithm [12] to detect the
face region from the image. They use simple rectangular (Haar-like) features
which are equivalent to intensity difference readings and are quite easy to com-
pute. The face detection using Viola Jones’ method is 15 times quicker than any
technique. It gives 95% detection accuracy at around 17 fps.
The next important step after face detection is detection of two eyes. The eyes’
centers play a vital role in face alignment, scaling and location estimation of other
Facial Expression Recognition with Regional Features Using LBPs 559
Image
Detect Face
Normalize face
Detect eyes
facial features, like lips, eyebrows, nose, etc. Thus, accurate detection of eyes are
very much desirable. We estimate the expected regions of eyes using basic facial
geometry [13]: for frontal face images, the eyes are located in the upper facial
region. We remove (1/5)th upper facial region and extract (1/3)rd vertical part
as the expected eyes regions. We apply Haar-like cascaded features and Viola-
Jones’ object detection algorithm to detect two eyes within the expected eyes
region. Nose lies below the two eyes’ region and above lips region in frontal face
images. Fig. 1 shows a pictorial description about the nose region. We calculate
the two eyes’ centers as (x1 , y1 ) and (x2 , y2 )(centroid of the eye rectangular
region) respectively. The expected nose region starts from the left eye’s center
1
with width (x2 − x1 ) and height ( )rd of the face height. Similarly, we calculate
3
the expected lip region that also covers the chin region. The lip region starts after
the nose region. We take lip regions with x coordinate same as eyes’ center’s x
coordinate and width as distance between two eye’s centers. To cover the extra
region near the lips that usually contain some wrinkles during certain expression,
we add (1/4)th of lips width in both left and right side. Lips region height is
taken as last (1/3)rd region that also includes chin region. Eyes’ regions are also
extended using facial geometry that cover the regions near eyebrows and eyes’
crow’s feet.
within the neighborhood are labeled by thresholding each pixel with the value of
the center pixel gc . For a gray value gp in an evenly spaced circular neighborhood
with maximum of P pixels and radius R around point (x, y), coordinates of the
point can be found as
1 if gp ≥ gc
S(gp − gc ) =
0 otherwise.
The LBPP R operator is computed by applying a binomial factor to each of the
S(gp − gc ). The method can be stated as
P −1
LBPP, R (xc , yc ) = S(gp − gc )2p . (2)
p=0
5 12 13 0 1 1
Threshold
11 10 3 1 0
15 9 7 1 0 0
Binary = 0 1 1 0 0 0 1 1
Decimal = 99
Fig. 3. An example of basic LBP operation. Neighborhood pixels are thresholded with
center pixel and followed by generation of decimal value for the binary coded data.
Fig. 3 shows an example of a basic LBPs operator. Each pixel around the
center pixel is thresholded. A binary pattern is extracted and the corresponding
decimal equivalent is calculated. The prominent properties of LBPs features are:
robustness for change in illumination and computational simplicity. The texture
primitives like spot, line end, edge and corner are detected by operators. Fig. 4
shows examples of texture primitives can be detected on applying LBPs. The
notation LBPP R indicates P sampling points each at equal distance R from the
center pixel.
Ojala et.al [14] verified that only a subset of the 2P patterns is sufficient
enough to describe most of the texture information with in the image. The pat-
terns forLBP (8, 1) with bit-wise transitions not more than 2, called uniform
patterns contains more than 90% of the texture informations. The uniform pat-
terns contains in total of 58 different patterns for LBP (8, 1) with 58 binary
Facial Expression Recognition with Regional Features Using LBPs 561
Fig. 4. Examples of texture primitives those can be detected by LBP. White circles
shows ones and the black circles shows zeros.
0.09
"lip_hist.txt"
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0 10 20 30 40 50 60
Fig. 5. Images showing lips region, corresponding uniform local binary code and the
feature histogram for the lips region
4 Experimental Results
Experiments are conducted on publicly available MMI facial expression database
[15]. The results present classification accuracy of six basic facial expressions
(happiness (H), sadness (Sa), disgust (D), anger (A), surprise (Sur), fear (F)).
In our experiment, we use 81 different video clips taken from MMI facial ex-
pression database. The video clips comprises of 12 different characters and each
character shows all the 6 basic expressions separately. The uniform LBP fea-
tures of dimension 59 obtained from each of the 4 facial zones are concatenated
Table 1. Confusion matrix of emotions detection for the 236 dimensional LBP features
data using Naive Bayes. The emotion classified with maximum percentage is shown to
be the detected emotion.
H Sa D A Sur F
H 91.9 5.12 0 2.56 0 1.28
Sa 0.9 45.9 4.5 1.8 29.7 17.1
D 3.84 3.84 75.0 11.53 3.84 1.92
A 1.61 0 11.29 61.3 20.96 4.83
Sur 0 1.36 0 0 90.4 8.21
F 1.47 5.88 1.47 2.94 35.29 52.9
562 A. Majumder, L. Behera, and V.K. Subramanian
Table 2. Confusion matrix of emotions detection for the 236 dimensional LBP features
data using Radial Basis Function. The emotion classified with maximum percentage is
shown to be the detected emotion.
H Sa D A Sur F
H 87.2 3.84 5.12 3.84 0 0
Sa 0 90.1 0 1.8 5.4 2.7
D 5.7 5.7 80.8 5.7 1.9 0
A 0 0 3.2 96.8 0 0
Sur 0 5.48 0 2.74 91.8 0
F 1.47 5.88 0 5.88 0 86.8
Table 3. Confusion matrix of emotions detection for the 236 dimensional LBP features
data using MLP3. The emotion classified with maximum percentage is shown to be
the detected emotion.
H Sa D A Sur F
H 89.7 3.8 5.13 0 0 1.28
Sa 2.7 88.3 0 0 0 9.0
D 1.92 5.77 84.6 0 0 7.7
A 0 4.84 11.29 74.2 0 9.7
Sur 0 2.73 2.73 0 94.52 0
F 1.47 8.82 1.47 0 0 88.2
5 Conclusions
The recognition results show the accuracy of our proposed facial expression
recognition system. It is observed that RBFN outperforms the other two classi-
fiers. As future works, we can conduct experiments using different neighborhood
sizes. Also, a comparative analysis can be done by applying LBPs over whole
face image.
References
1. Mehrabian, A.: Nonverbal communication. Aldine (2007)
2. Sun, Y., Yin, L.: Facial expression recognition based on 3D dynamic range model
sequences. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II.
LNCS, vol. 5303, pp. 58–71. Springer, Heidelberg (2008)
3. Shan, C., Gong, S., McOwan, P.: Facial expression recognition based on local binary
patterns: A comprehensive study. Image and Vision Computing 27, 803–816 (2009)
4. Tsalakanidou, F., Malassiotis, S.: Real-time 2d+ 3d facial action and expression
recognition. Pattern Recognition 43, 1763–1775 (2010)
5. Moridis, C., Economides, A.: Affective learning: Empathetic agents with emotional
facial and tone of voice expressions. IEEE Transactions on Affective Computing 3,
260–272 (2012)
6. Moore, S., Bowden, R.: Local binary patterns for multi-view facial expression recog-
nition. Computer Vision and Image Understanding 115, 541–558 (2011)
7. Chan, C.-H., Kittler, J., Messer, K.: Multi-scale local binary pattern histograms
for face recognition. In: Lee, S.-W., Li, S.Z. (eds.) ICB 2007. LNCS, vol. 4642, pp.
809–818. Springer, Heidelberg (2007)
8. Ahonen, T., Hadid, A., Pietikäinen, M.: Face recognition with local binary patterns.
In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 469–481.
Springer, Heidelberg (2004)
9. Zhang, Z., Zhang, Z.: Feature-based facial expression recognition: Sensitivity analy-
sis and experiments with a multilayer perceptron. International Journal of Pattern
Recognition and Artificial Intelligence 13, 893–911 (1999)
10. Rosenblum, M., Yacoob, Y., Davis, L.: Human expression recognition from motion
using a radial basis function network architecture. IEEE Transactions on Neural
Networks 7, 1121–1138 (1996)
11. Zhao, G., Pietikainen, M.: Dynamic texture recognition using local binary patterns
with an application to facial expressions. IEEE Transactions on Pattern Analysis
and Machine Intelligence 29, 915–928 (2007)
12. Viola, P., Jones, M.: Robust real-time object detection. International Journal of
Computer Vision 57, 137–154 (2002)
13. Majumder, A., Behera, L., Venkatesh, K.S.: Automatic and Robust Detection of
Facial Features in Frontal Face Images. In: Proceedings of the 13th International
Conference on Modelling and Simulation. IEEE (2011)
14. Ojala, T., Pietikäinen, M., Harwood, D.: A comparative study of texture measures
with classification based on featured distributions. Pattern Recognition 29, 51–59
(1996)
15. Pantic, M., Valstar, M.F., Rademaker, R., Maat, L.: Web-based database for facial
expression analysis. In: Proceedings of IEEE Int’l Conf. Multimedia and Expo,
ICME 2005, Amsterdam, The Netherlands, pp. 317–321 (2005)
Global Image Registration Using Random
Projection and Local Linear Method
1 Introduction
Image registration overlays two or more template images with the same observed
at different times, from different viewpoints, or by different sensors, on a reference
image. Image registration is a process of estimating geometric transformations
that transform all or most points on template images to corresponding points
on a reference image.
Setting Π to be an appropriate parameter space for image generation, we
assume that images are expressed as f (x, θ) for ∃θ ∈ Π, x ∈ Rl . We call
the set of generated images f (x, θi ) and the parameter θi a dictionary. Here,
i = 1, . . . , N . Image registration methods are generally classified into local image
registration and global image registration. For the global alignment of images,
the linear transformation x = Ax + t that minimises the criterion
M
R(f, g) = |f (x ) − g(x)|2 dx (1)
Ω
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 564–571, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Global Image Registration Using Random Projection 565
for the reference image f (x) and the template image g(x) is used to relate two
images. In image registration, we assume that the parameter θ in Π generates the
affine coefficients A and t. Solving the nearest-neighbour search (NNS) problem
using the dictionary, we can estimate the transformation A, t as θi .
The simplest solution to the NNS problem is to compute the distance from
the query point to every other point in the database, preserving a track of the
“best so far”. This algorithm, sometimes referred to as the naive approach, has
a computational cost of O(N d). Here, N and d are the cardinality of a set of
points in a metric space and the dimensionality of the metric space, respectively.
The NNS-based image registration requires the storage of reference images
in the dictionary, since the method finds the best matched image from the dic-
tionary. Since this mathematical property leads to the conclusion that we are
required to store a large number of images in the dictionary, for the robust image
registration, the space complexity of the dictionary becomes large. Fast global
image registration algorithms using random projection have been developed
[1–3]. In these methods, using random projection, we can reduce d in NNS[4].
In addition to random projection, using a local linear property of the pattern
space, we interpolate entries in a sparse dictionary by generating an image and
estimating the parameter. Using such an interpolation, we can reduce N in NNS.
Generally, the pattern space generated by the affine motion of the image data
is a curved manifold in the higher-dimensional Euclidean space. However, if the
motion is relatively small and can be approximated by linear perturbation to
a sampled image, it is possible to approximate transformed images as a linear
combination of sampled images in the neighbourhood of the reference image.
Using this approximation property, we generate new reference images that are
similar to the target image using images in the sparse dictionary. Furthermore,
using the local linear property, we can estimate the parameter of the generated
image. This strategy reduces the space complexity of the dictionary.
In this paper, using an efficient random projection[3] and the local linear
property of the images, we introduce a method of reducing the time and space
computational costs of this naive NNS-based image registration.
images g(x) = f (Rx + t) for a small angle rotation R and a small translation
vector t, we can assume the relation
Equation (4) implies that the number of independent images in the collected
images,
L(f ) = {fij |fij (x) = λf (Ri x + tj )}p,q
i,j=1 , (5)
is 3. Setting f ⊗ g to be a linear operation such that (f ⊗ g)h = (h, g), where
(·, ·) is the inner product of the image space, the covariance of L(f ) is defined
as Lf = Epqpqijuv [fij ⊗ fuv ], where Ei [fi ] is the expectation of {fi }i=1 . We can use
n n
the first 3 principal vectors of Lf as the local bases for image expression.
Fig. 1. (a) Nearest neighbours of g searched for by k-NNS on manifold. (b) Generation
of image in dictionary. The input image g is projected onto the subspace spanned by
three-nearest neighbours. (c) Interpolation of dictionary. For the template g, we firstly
generate the image g ∗ . Next, we estimate the parameter θ∗ of g∗. Here, dim θ = 1.
is searched for in a random projected space. Using the local linear property, we
can approximate the space spanned by {ui }3i=1 using one by {g} ∪ {f r }3r=1 if the
data space L is not extremely sparse. Using the Gram-Schmidt orthonormalisa-
tion for f 1 , f 2 and f 3 , we obtain the bases {ui }3i=1 . Projecting the template to
the space spanned by {ui }3i=1 , we obtain a new image,
3
g∗ = ai u i , (6)
i=1
1 1 1 1 1
0 0 0 0 0
Fig. 2. Parameter estimation for phantom image. From top to bottom, original im-
ages, rotated images, estimated parameters and the differential curves of the estimated
parameters are shown for σ = 1, 2, 4 and 8. In Figs. 2(p)-2(t), the solid and dashed
lines represent the first- and second-order differentials, respectively.
as the difference among images. Here, f (x, θα1 ), f (x, θa1 ) and f (x, θb1 ) are the
2nd, 3rd and 4th nearest neighbours in the dictionary. From Eqs. (8) and (9),
we obtain the relation
f (x, θ 1) − f (x, θ)
ψ=− ∇Π f (x, θ) (10)
|∇Π f (x, θ)|2
for the image. Therefore, we obtain the equation θ ∗ = θ 1 + E [ψ] for parameter
estimation, where E(ψ) = (E[ψ1 ], E[ψ2 ], E[ψ3 ]) .
4 Numerical Examples
For rotation transform, we show results of registration using the LLM. We
evaluated our method both for phantom and real images. Phantom images are
1
The least-mean-squares solution of x a + c = 0 is x = −ca/|a|2 .
Global Image Registration Using Random Projection 569
1 1 1 1 1
0 0 0 0 0
Fig. 3. Parameter estimation for MRI slice image of human brain. From top to bottom,
original images, rotated images, estimated parameters and the differential curves of the
estimated parameters are shown for σ = 1, 2, 4 and 8. In Figs. 3(p)-3(t), the solid and
dashed lines represent the first- and second-order differentials, respectively.
generated from a two-dimensional Gaussian function. Real images are MRI slice
images of a simulated human brain. Figures 2(a) and 3(a) show the phantom
image P 0 and the slice image A0, respectively. For the generation of entries in
the dictionary, we rotated the original image with angles 12 π
i, i = 0, 1, 2, · · · , 23.
We use the original image as reference and the rotated images as targets. Figures
2(f) and 3(f) show the sample images of the dictionary, respectively. For phan-
tom images, we generate rotated phantom images with angles of −6, −5, . . . , 6
degrees as template images. For real images, our template images are selected
from several slice images different from the original slice image. Figures 4(a),
(b), (c) and (d) show the MRI volume data and the images of slice A0, slice B0
and slice C0, respectively. As template images, we generate the rotated images
A0, B0 and C0 with angles of −6, −5, . . . , 6 degrees. For the stable computation,
we first select images using the NNS, then we apply the Gaussian smoothing
for selected images. We set the standard deviation of the Gaussian function to
be σ = 1, 2, 4 and 8. Figures 2(b)-(e), 2(g)-(j), 3(b)-(e) and 3(g)-(j) show the
smoothed images. In all registrations, the dimensions of vectorised images are
reduced to 1024 dimensions by the efficient random projection.
Selecting a template image from the generated template set, we apply our
registration algorithm to reference images in the dictionary. For phantom im-
ages, Figs. 2(k)-(o) show the estimated parameters for each σ. Furthermore,
570 H. Itoh et al.
z C0 ; z=52
A0 ; z=50
B0 ; z=48
y
x
Fig. 4. Extracted slice images from three-dimensional volume data. The volume data
is the MRI simulation data of a human brain[10]. The size of the volume data is
181×217×181 pixels. The slice images A0, B0 and C0 are extracted from the z = 50,
z = 48 and z = 52 planes, repectively. The size of A0, B0, and C0 is 543×543 pixels.
Figs. 3(k)-(o) show the estimated parameters for real images. For phantom im-
ages, in Figs. 2(p)-(t), the solid and dashed lines show the first- and second-order
differentials for the curves of the estimated parameters, respectively. For real im-
ages, in Figs. 3(p)-(t), the solid and dashed lines show the first- and second-order
differentials for the curves of the estimated parameters, respectively.
Next, selecting a template image from the generated template set using several
slice images, we apply our registration algorithm to reference images in the
dictionary smoothing images with σ = 0, 1 and 2. The results of image generation
and parameter estimation are shown in Tabs. 1 and 2, respectively. In Tab. 1,
we evaluate the accuracy of the generation errors using the distance between the
template and the generated images.
In Figs. 2(q)-2(t), the second-order differentials of the estimation curves are
almost flat, that is, the rotation angle is linearly estimated for a potentially
smooth image. In Figs. 3(q)-3(t), for σ ≥ 4, the second-order differentials of the
estimation curves are almost flat. These results indicate that, for |θ − θi | < 6,
our method accurately estimates the parameters. From Tab. 1, the registration
errors of the LLM are smaller than those of the NNS method. Table 2 shows
that, for σ = 1, the estimated rotation angle satisfies G.T −0.3 ≤ θ∗ ≤ G.T +0.5.
Tables 1 and 2 show that the LLM generates image in the dictionary using the
local linear property of images in the pattern space.
These experiments show that our algorithm achieves the medical image reg-
istration with a sparse dictionary. The errors of our algorithm are less than 1
degree, while the maximum error of NNS is 6 degree. Using the random pro-
jection, we can compute the global image registration using the LLM with only
8.3% of the memory storage size of the NNS method.
5 Conclusions
We introduced the interpolation of entries in a dictionary to reduce the compu-
tational cost of preprocessing and the size of the dictionary used in the nearest-
neighbour search. Using the random projection and interpolation techniques for
the dictionary, we developed an algorithm that efficiently establishes a global
image registration. From numerical examples, we show that our method can
perform the registration with high accuracy using a small-memory-storage-size
dictionary compared with the naive nearest-neighbour-search-based method.
Global Image Registration Using Random Projection 571
References
1. Healy, D.M., Rohge, G.K.: Fast global image registration using random projections.
In: Proc. Biomedical Imaging: From Nano to Macro, pp. 476–479 (2007)
2. Itoh, H., Lu, S., Sakai, T., Imiya, A.: Global image registration by fast random
projection. In: Bebis, G. (ed.) ISVC 2011, Part I. LNCS, vol. 6938, pp. 23–32.
Springer, Heidelberg (2011)
3. Sakai, T., Imiya, A.: Practical algorithms of spectral clustering: Toward large-
scale vision-based motion analysis. In: Machine Learning for Vision-Based Motion
Analysis, pp. 3–26. Springer (2011)
4. Vempala, S.S.: The Random Projection Method, vol. 65. American Mathematical
Society (2004)
5. Iijima, T.: Theory of pattern recognition. Electronics and Communications in
Japan, 123–134 (1963)
6. Watanabe, S., Labert, P.F., Kulikowski, C.A., Buxton, J.L., Walker, R.: Evaluation
and selection of variables in pattern recognition. In: Computer and Information
Science II, pp. 91–122 (1967)
7. Oja, E.: Subspace methods of pattern recognition. Research Studies Press (1983)
8. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision.
Cambridge University Press (2004)
9. Mahajan, D., Huang, F.C., Matusik, W., Ramamoorthi, R., Belhumeur, P.: Moving
gradients: A path-based method for plausible image interpolation. ACM Transac-
tions on Graphics 28, 42:1–42:11 (2009)
10. Cocosco, C., Kollokian, V., Kwan, R.S., Evans, A.: Brainweb. online interface to
a 3D MRI simulated brain database. NeuroImage 5, 425 (1997)
Image Segmentation
by Oriented Image Foresting Transform
with Geodesic Star Convexity
1 Introduction
Image segmentation, such as to extract an object from a background, is very
useful for medical and biological image analysis. However, in order to guarantee
reliable and accurate results, user supervision is still required in several seg-
mentation tasks, such as the extraction of poorly defined structures in medical
imaging, due to their intensity non-standardness among images, field inhomo-
geneity, noise, partial volume effects, and their interplay [1]. The high-level,
application-domain-specific knowledge of the user is also often required in the
digital matting of natural scenes, because of their heterogeneous nature [2]. These
problems motivated the development of several methods for semi-automatic seg-
mentation [3,4,5,6], aiming to minimize the user involvement and time required
without compromising accuracy and precision.
One important class of interactive image segmentation comprises seed-based
methods, which have been developed based on different theories, supposedly
not related, leading to different frameworks, such as watershed [6], random
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 572–579, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Image Segmentation by Oriented Image Foresting Transform 573
walks [7], fuzzy connectedness [8], graph cuts [4], distance cut [2], image forest-
ing transform [9], and grow cut [10]. The study of the relations among different
frameworks, including theoretical and empirical comparisons, has a vast litera-
ture [11,12,13,14]. These methods can also be adapted to automatic segmentation
whenever the seeds can be automatically found [15].
In this paper, we pursue our previous work on Oriented Image Foresting
Transform (OIFT) [16], which extends popular methods [9,8], by incorporat-
ing the boundary orientation (boundary polarity) to resolve between very sim-
ilar nearby boundary segments by exploring directed weighted graphs. OIFT
presents an excellent trade-off between time efficiency and accuracy, and is ex-
tensible to multidimensional images. In this work, we discuss how to incorpo-
rate Gulshan’s geodesic star convexity (GSC) prior in the OIFT approach. This
convexity constraint eliminates undesirable intricate shapes, improving the seg-
mentation of objects with more regular shape. We include a theoretical proof of
the optimality of the new algorithm in terms of a global maximum of an energy
function subject to the shape constraints. The proposed method GSC-OIFT can
simultaneously handle boundary polarity and shape constraints with improved
accuracy for targeted image segmentation [17].
The next sections give a summary of the relevant previous work of the Image
Foresting Transform [9] and OIFT [16]. The proposed extensions are presented
in Section 5. In Section 6, we evaluate the methods, and state our conclusions.
bkg
As demonstrated in [16], the following non-smooth connectivity functions fi,max
bkg
and fo,max in the IFT algorithm (which we denote as OIFT) lead to optimum
bkg
cuts that maximize Eq. 6 and Eq. 7, respectively. The handicap values of fi,max
and fo,max for trivial paths are defined as before (i.e., H(t) = −1 for all t ∈ S,
bkg
bkg
bkg max{fi,max (πs ), 2 × w(t, s) + 1} if R(πs ) ∈ So
fi,max (πs · s, t) = bkg (8)
max{fi,max (πs ), 2 × w(s, t)} if R(πs ) ∈ Sb
bkg
max{fo,max (πs ), 2 × w(s, t) + 1} if R(πs ) ∈ So
bkg
fo,max (πs · s, t) = (9)
bkg
max{fo,max (πs ), 2 × w(t, s)} if R(πs ) ∈ Sb
C in Psum lies in the set O). In this work, we want to constrain the search for
optimum results, that maximize the graph-cut measures Ei (L, G) (Eq. 6) and
Eo (L, G) (Eq. 7), only to segmentations that satisfy the geodesic star convexity
constraint.
First, we compute the optimum forest Psum for fsum by the regular IFT
algorithm, using only So as seeds, for the given directed graph G = (I, A, w).
Let’s consider the following two sets of arcs ξPi sum = {∀(s, t) ∈ A| s = Psum (t)}
and ξPo sum = {∀(s, t) ∈ A| t = Psum (s)}. We have the following Lemma 1:
Lemma 1. For a given segmentation L, we have Co (L) ∩ ξPo sum = ∅, if and
only if there is a violation of the geodesic star convexity constraint. We have
Ci (L) ∩ ξPi sum = ∅, if and only if there is a violation of the geodesic star convexity
constraint.
Proof. We will demonstrate it for Co (L) ∩ ξPo sum = ∅, but the demonstration for
Ci (L) ∩ ξPi sum = ∅ is essentially identical. By definition, a violation of geodesic
star convexity constraint with respect to a set of centers C = So , will be given
if there exists a point p ∈ O = {∀t|L(t) = 1} that is not visible to C via O (i.e.,
there is a pixel r in the shortest path joining p to C in Psum , and r ∈ / O).
By the definitions of ξPo sum and Co (L), we have Co (L) ∩ ξPo sum = {∀(s, t) ∈
A|L(s) = 1, L(t) = 0 and t = Psum (s)}. For any edge (s, t) ∈ Co ∩ ξPo sum we have
t = Psum (s), which means that there exists a shortest path πs = πt · t, s in
Psum rooted at the internal seeds So (i.e., line segment between s and So ). But
(s, t) ∈ Co (L) implies that L(t) = 0 (i.e., t ∈ / O), and hence s is not visible to So
through πs = πt · t, s in Psum . Thus, Co ∩ ξPo sum = ∅ implies in a violation of
the geodesic star convexity constraint.
On the other hand, if we have a violation of the geodesic star convexity
constraint, it means that ∃s ∈ O (i.e., L(s) = 1), which is not visible to So
via the shortest path πs in Psum , so that there is a pixel pi ∈ / O in πs =
p1 , . . . , pi , . . . , pn = s, with Psum (pi+1 ) = pi and pi+1 ∈ O. Hence, (pi+1 , pi ) ∈
Co ∩ ξPo sum , which implies that Co ∩ ξPo sum = ∅.
Therefore, we have Co ∩ ξPo sum = ∅, if and only if there is a violation of the
geodesic star convexity constraint.
Theorem 1 (Inner/outer-cut boundary optimality). For a given image
graph G = (I, A, w), consider a modified weighted graph G = (I, A, w ), with
weights w (s, t) = −∞ for all (s, t) ∈ ξPo sum , and w (s, t) = w(s, t) otherwise.
For two given sets of seeds So and Sb , the segmentation computed over G by the
bkg
IFT algorithm for function fo,max defines an optimum cut in the original graph
G, that maximizes Eo (L, G) among all possible segmentation results satisfying
the shape constraints by the geodesic star convexity, and the seed constraints.
bkg
Similarly, the segmentation computed by the IFT algorithm for function fi,max ,
over a modified graph G = (I, A, w ); with weights w (s, t) = −∞ for all
(s, t) ∈ ξPi sum , and w (s, t) = w(s, t) otherwise; defines an optimum cut in the
original graph G, that maximizes Ei (L, G) among all possible segmentation re-
sults satisfying the shape constraints by the geodesic star convexity, and the seed
constraints.
Image Segmentation by Oriented Image Foresting Transform 577
bkg
Proof. We will prove the theorem in the case of function fo,max the other case
having essentially identical proof. Since we assign the worst weight to all arcs
(s, t) ∈ ξPo sum in G (i.e., w (s, t) = −∞), any segmentation L̃ with Co (L̃) ∩
ξPo sum = ∅ will receive the worst energy value (Eo (L̃, G ) = −∞) 1 . From the
Theorem in [16], we know that the IFT with fo,max bkg
over G maximizes the energy
Eo (L, G ) in the graph G , consequently, it will naturally avoid in its outer-cut
boundary any edge from ξPo sum . Since, there is always a solution that does not
violate the GSC constraint (e.g., we could take O = So ), and from Lemma 1, we
have that the computed solution cannot violate the GSC constraint.
Since w(s, t) ≥ 0, ∀(s, t) ∈ A, and from Lemma 1, we have that any candidate
segmentation L̈ satisfying the GSC constraint must have Eo (L̈, G ) ≥ 0. More-
over, since its weights for the arcs in Co (L̈) were not changed in G , we also have
that Eo (L̈, G ) = Eo (L̈, G). Hence, all results satisfying the GSC constraint were
considered in the optimization, and therefore Theorem 1 holds, as we wanted to
prove.
1 1 1 1
Dice coefficient
Dice coefficient
Dice coefficient
Dice coefficient
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Erosion radius (pixels) Erosion radius (pixels) Erosion radius (pixels) Erosion radius (pixels)
Fig. 1. (a) The mean accuracy curves of all methods for the liver segmentation for
various values of β: (a) β = 0.0, (b) β = 0.2, (c) β = 0.5, and (d) β = 0.7
1
The GSC restrictions are embedded directly into the graph G .
578 L.A.C. Mansilla and P.A.V. Miranda
bkg
Fig. 2. Results for user-selected markers: (a) IRFC (IFT with fmax ), (b) OIFT (fo,max
with α = 0.5), (c) GSC-IFT (β = 0.7, α = 0.0), and (d) GSC-OIFT (β = 0.7, α = 0.5)
Fig. 3. Example of 3D skull stripping in MRI: (a) IRFC (IFT with fmax ), (b) GSC-IFT
(β = 0.3, α = 0.0), and (c) GSC-OIFT (β = 0.3, α = 0.5), for the same user-selected
markers
References
1. Madabhushi, A., Udupa, J.: Interplay between intensity standardization and in-
homogeneity correction in MR image processing. IEEE Transactions on Medical
Imaging 24(5), 561–576 (2005)
2. Bai, X., Sapiro, G.: Distance cut: Interactive segmentation and matting of images
and videos. In: Proc. of the IEEE Intl. Conf. on Image Processing, vol. 2, pp.
II-249–II-252 (2007)
3. Falcão, A., Udupa, J., Samarasekera, S., Sharma, S., Hirsch, B., Lotufo, R.: User-
steered image segmentation paradigms: Live-wire and live-lane. Graphical Models
and Image Processing 60(4), 233–260 (1998)
Image Segmentation by Oriented Image Foresting Transform 579
4. Boykov, Y., Funka-Lea, G.: Graph cuts and efficient N-D image segmentation. Intl.
Journal of Computer Vision 70(2), 109–131 (2006)
5. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. Intl. Journal
of Computer Vision 1, 321–331 (1987)
6. Cousty, J., Bertrand, G., Najman, L., Couprie, M.: Watershed cuts: Thinnings,
shortest path forests, and topological watersheds. Trans. on Pattern Analysis and
Machine Intelligence 32, 925–939 (2010)
7. Grady, L.: Random walks for image segmentation. IEEE Trans. Pattern Anaysis
and Machine Intelligence 28(11), 1768–1783 (2006)
8. Ciesielski, K., Udupa, J., Saha, P., Zhuge, Y.: Iterative relative fuzzy connectedness
for multiple objects with multiple seeds. Computer Vision and Image Understand-
ing 107(3), 160–182 (2007)
9. Falcão, A., Stolfi, J., Lotufo, R.: The image foresting transform: Theory, algo-
rithms, and applications. IEEE Transactions on Pattern Analysis and Machine
Intelligence 26(1), 19–29 (2004)
10. Vezhnevets, V., Konouchine, V.: “growcut” - interactive multi-label N-D image
segmentation by cellular automata. In: Proc. Graphicon., pp. 150–156 (2005)
11. Sinop, A., Grady, L.: A seeded image segmentation framework unifying graph cuts
and random walker which yields a new algorithm. In: Proc. of the 11th International
Conference on Computer Vision, ICCV, pp. 1–8. IEEE (2007)
12. Miranda, P., Falcão, A.: Elucidating the relations among seeded image segmen-
tation methods and their possible extensions. In: XXIV Conference on Graphics,
Patterns and Images, Maceió, AL (August 2011)
13. Ciesielski, K., Udupa, J., Falcão, A., Miranda, P.: Fuzzy connectedness image seg-
mentation in graph cut formulation: A linear-time algorithm and a comparative
analysis. Journal of Mathematical Imaging and Vision (2012)
14. Couprie, C., Grady, L., Najman, L., Talbot, H.: Power watersheds: A unifying
graph-based optimization framework. Trans. on Pattern Anal. and Machine Intel-
ligence 99 (2010)
15. Miranda, P., Falcão, A., Udupa, J.: Cloud bank: A multiple clouds model and
its use in MR brain image segmentation. In: Proc. of the IEEE Intl. Symp. on
Biomedical Imaging, Boston, MA, pp. 506–509 (2009)
16. Miranda, P., Mansilla, L.: Oriented image foresting transform segmentation by seed
competition. IEEE Transactions on Image Processing (accepted, to appear, 2013)
17. Lézoray, O., Grady, L.: Image Processing and Analysis with Graphs: Theory and
Practice. CRC Press, California (2012)
18. Miranda, P., Falcão, A.: Links between image segmentation based on optimum-
path forest and minimum cut in graph. Journal of Mathematical Imaging and
Vision 35(2), 128–142 (2009)
19. Veksler, O.: Star shape prior for graph-cut image segmentation. In: Forsyth, D.,
Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 454–467.
Springer, Heidelberg (2008)
20. Gulshan, V., Rother, C., Criminisi, A., Blake, A., Zisserman, A.: Geodesic star
convexity for interactive image segmentation. In: Proc. of Computer Vision and
Pattern Recognition, pp. 3129–3136 (2010)
Multi-run 3D Streetside Reconstruction
from a Vehicle
1 Introduction
Current large-scale camera-based reconstruction techniques can be subdivided
into aerial reconstruction or ground-level reconstruction techniques. Although
a large amount of user interaction is needed, the resulting model is often of
high quality and visually compelling. There are various commercial products
available in the market, demonstrating high quality, such as 3D RealityMaps [1],
for example. However, reconstruction methods using aerial images only cannot
produce models with photo-realistic details at ground level. There is an extensive
literature on ground-level reconstruction; see, for example, [5,9,17]. In both aerial
and ground-level reconstructions, cameras capture input images as they travel
through the scene. Standard cameras only have limited viewing angles. Thus, a
large number of blind spots of the scene exist, resulting in incomplete 3D models,
and this is inevitable for a single run reconstruction (i.e. when moving cameras
on a “nearly straight” path, without any significant variations in the path). A
single run has a defined direction, being the vector from start and end point of
the run.
In this paper, we propose a stereo-based reconstruction framework for auto-
matically merging reconstruction results from multiple single runs in different
directions. For each single run, we perform binocular stereo analysis on pairs of
left and right images. We use the left image and the generated disparity map for
a bundle-adjustment-based visual odometry algorithm. Then, applying the esti-
mated changes in camera poses, a 3D point cloud of the scene is accumulated
frame by frame. Finally, we triangulate the 3D point cloud using an α-shape
algorithm to generate a surface model. Up to this stage we apply basically exist-
ing techniques. The novelty of this paper is mainly in the merging step, and we
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 580–588, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Multi-run 3D Streetside Reconstruction from a Vehicle 581
Fig. 1. From top to bottom: original image of a used stereo frame sequence, and colour-
coded disparity maps using OpenCV (May 2013) block matching or iSGM
detail the case where two surface models are merged generated from single runs
in opposite directions. Input data are recorded stereo sequences from a mobile
platform. In this paper we discuss greyscale sequences recorded at Tamaki cam-
pus, The University of Auckland, at a resolution of 960 × 320 at 25 Hz, with 10
bit per pixel. Each recorded sequence consists of about 1,800 stereo frames. For
an example of an input image, see the top of Fig. 1.
The quality of the used stereo matcher has crucial impact on the accuracy of
our 3D reconstruction. We decided for iterative semi-global matching (iSGM),
see [8], mainly due to its performance at ECCV 2012 [7]. A comparison with the
block-matching stereo procedure in OpenCV (see Fig. 1, middle) illustrates the
achieved improvement by using iSGM.
The rest of the paper is structured as follows. In Section 2, we estimate the
ego-motion of the vehicle using some kind of bundle adjustment. Section 3 dis-
cusses alpha-shape, as used for the surface reconstruction algorithm applied in
the system. Finally, the merging step is discussed in Section 4, also showing
experimental results. Section 5 concludes the paper.
2 Visual Odometry
Visual Odometry [13], the estimation of position and direction of the camera, is
achieved by analysing consecutive images in the recorded sequence. The quality
of our reconstructed 3D scene is directly related to the result of visual odome-
try. Drift in visual odometry [10] often leads to a twist in the 3D model. The
basic algorithm is usually: (1) Detect feature points in the image. (2) Track the
features across consecutive frames. (3) Calculate the camera’s motion based on
the tracked features. In this paper, since we focus on quality, an algorithm [15]
based on Bundle Adjustment (BA) is used for visual odometry.
We tested a basic algorithm for comparison. 2-dimensional (2D) feature points
are detected and tracked across the left sequence only. The speeded-up robust
feature detector (SURF), see [2], is used to extract feature points in the first
frame. We chose SURF over the Harris corner detector [6] (which is a common
582 Y. Zeng and R. Klette
choice in visual odometry) because corner points may not be evenly distributed
depending on the geometry of the scene. The Lucas-Kanade [12] algorithm is
used to track these detected features in the subsequent frame. Tracked feature
points serve then as input, and are again tracked in the following frame, and so
on. Since the same set of feature points is tracked, the total number of features
decays over frames. When the total number of features drops below a threshold
τ then a new set of features is detected using again the SURF detector. After
calculating a relative transformation between Frames t − 1 and t, the global pose
of the cameras at time t is obtained by keeping a global accumulator, assuming
that the pose of the camera at time 1 is a 4 × 4 identity matrix for initialization.
However, in our experiments, when applying this basic algorithm, the estimation
of camera pose transformations was inaccurate, and became less stable as errors
accumulate along the sequence. In order to improve the accuracy, we apply a
sliding-window bundle adjustment.
Bundle adjustment [16] is the problem of refining the 3D structure as well
as the camera parameters. Mathematically, assume that n 3D points bi are seen
from m cameras with parameters aj , and Xij is the projection of the ith point on
camera j. Bundle adjustment is the task to minimize the reprojection error with
respect to 3D points bi and cameras’ parameters aj . In formal representation,
determine the minimum
n
m
min d(Q(aj , bi ), Xij )2
aj ,bi
i=1 j=1
3 Surface Reconstruction
In this section, we build a 3D model of the scene using results of visual odometry.
The final surface representation is polygonal, but in order to build it we construct
Multi-run 3D Streetside Reconstruction from a Vehicle 583
a point cloud model first. Once we calculate the pose for cameras for all frames,
building a 3D point cloud model can be as easy as projecting all 3D points derived
from pixels with valid disparities into a global coordinate system. However, we
did not accumulate pixels for all the frames, because the number of points grows
exponentially, and a large percentage of points is actually redundant information.
(The vehicle was driving at 10 km/h only, and recall that images were captured
at 25 Hz.) For each frame, only pixels within a specified disparity range are used,
due to the non-linear property of the Z-function. See Fig. 2 for an example.
Point-cloud data usually contain large portion of noise and outliers, and the
density of points varies across the 3D space. Two additional steps are to be
carried out to refine the quality of the point cloud.
Down-Sampling. A voxel grid filter is applied to simplify cloud data, thus im-
proving the efficiency of subsequent processing. The filter creates a 3D voxel
grid spanning over the cloud data. Then, for each voxel, all the points within
are replaced by their centroid.
Outlier Removal. Errors in stereo matching and visual odometry lead to sparse
outliers which corrupt the cloud data. Some of these errors can be eliminated by
Fig. 2. A generated point cloud model. Yellow cubes indicate detected camera poses.
applying a statistical filter on the point set, i.e. for each point, we compute the
mean distance from it to all of its neighbours. If this mean distance of the point
is outside a predefined interval, then the point can be treated as an outlier and
is removed from the set. The order of these steps affects the overall performance
of the process. The down-sampling process is significantly faster than outlier
removal. Thus we decided to perform these two processes in the listed order.
Given a set S of points in 3D, the α-shape [4] was designed for answering
questions such as “What is the shape formed by these points?” Edelsbrunner
and Mücke mention in [4] an intuitive description of 3D α-shape: Imagine that
a huge ice-cream fills space R3 and contains all points of S as “hard” chocolate
pieces. Using a sphere-formed spoon, we carve out all possible parts of the ice-
cream block without touching any of the chocolate pieces, even carving out holes
inside the block. The object we end up with is the α-shape of S, and the value
α is the squared radius of the carving spoon.
To formally define the α-shape, we first define an α-complex. An α-complex
of a set S of points is a subcomplex of the 3D Delaunay triangulation of S, which
is a tetrahedrization such that no point in S is inside the circumsphere of any
of the created tetrahedra. Given a value of α, the α-complex contains all the
simplexes in the Delaunay triangulation which have an empty circumscribing
sphere with squared radius equal to, or smaller than α. The α-shape is the
topological frontier of the α-complex.
In our reconstruction pipeline, after obtaining and refining a point-cloud
model, the α-shape is calculated and defines a 3D surface model of the scene.
See Fig. 3 for an example. Compared to Fig. 2, the reader might agree with our
general observation that the surface model looks in general “better” than the
point-cloud visualization.
Now we are ready to discuss our proposed merger of point-cloud or surface data
obtained from multiple runs through a 3D scene.
The 3D model reconstructed from a single run (i.e. driving through the scene
in one direction) contains a large number of “blind spots” (e.g. due to occlusions,
e.g. the “other side of the wall”, or the limited viewing angle of the cameras, but
also due to missing depth data, if disparities were rated “invalid”). By combining
the results from opposite runs, we aim at producing a more accurate and more
complete model of the scene.
The task of aligning consistently models from different views is know as reg-
istration. Fully automatic pairwise registration methods exist for laser-scanner
data, and the main steps are listed below
1. Identify a set of interest points (e.g. SIFT [11]) that best represent both 3D
point sets.
2. Compute a feature descriptor at each interest point, using methods such as
fast point feature histograms (FPFH); see [14].
Multi-run 3D Streetside Reconstruction from a Vehicle 585
Fig. 4. Bird’s-eye view of an initial alignment of two opposite runs. Results of each
run are shown in different colours.
edge. When merging models from opposite runs, the occlusion walls from the
two models intersect each other.
References
13. Nister, D., Naroditsky, O., Bergen, J.: Visual odometry. In: Proc. CVPR, vol. 1,
pp. 652–659 (2004)
14. Rusu, R.B., Blodow, N., Beetz, M.: Fast point feature histograms (FPFH) for 3D
registration. In: Proc. IEEE ICRA, pp. 3212–3217 (2009)
15. Sünderhauf, N., Konolige, K., Lacroix, S., Protzel, P.: Visual odometry using sparse
bundle adjustment on an autonomous outdoor vehicle. In: Proc. Autonome Mobile
Systems, pp. 157–163 (2005)
16. Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W.: Bundle adjustment
– A modern synthesis. In: Triggs, B., Zisserman, A., Szeliski, R. (eds.) ICCV-WS
1999. LNCS, vol. 1883, pp. 298–375. Springer, Heidelberg (2000)
17. Xiao, J., Fang, T., Zhao, P., Lhuilier, M., Quan, L.: Image-based street-side city
modeling. In: Proc. SIGGRAPH, pp. 114:1–114:12 (2009)
Interactive Image Segmentation via Graph
Clustering and Synthetic Coordinates Modeling
1 Introduction
Image segmentation is a key step in many image-video analysis and multimedia
applications. According to interactive image segmentation, which is a special
case of image segmentation, unambiguous solutions, or segmentations satisfying
subjective criteria, could be obtained, since the user gives some markers on the
regions of interest and on the background. Fig. 1 illustrates an example of an
original image, two types of markers and the segmentation ground truth.
During the last decade, a large number of interactive image segmentation algo-
rithms have been proposed in the literature. In [1], a new shape constraint based
method for interactive image segmentation has been proposed using Geodesic
paths. The authors introduce Geodesic Forests, which exploit the structure of
Paraskevi Fragopoulou is also with the Foundation for Research and Technology-
Hellas, Institute of Computer Science, 70013 Heraklion, Crete, Greece.
R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 589–596, 2013.
c Springer-Verlag Berlin Heidelberg 2013
590 C. Panagiotakis et al.
Fig. 1. (a) Original image, (b), (c) given markers and (d) the ground truth image
2 Graph Generation
Initially, we partition the image into superpixels using the oversegmentation al-
gorithm proposed in [5]. In this work, the description of visual content consists of
Interactive Image Segmentation via Graph Clustering 591
Lab color components for color distribution and textureness for texture content.
This approach has been also used in [8]. The visual distance dv (si , sj ) between
two superpixels si and sj is given by the Mallows distance [9] of the three color
components in Lab color space and for the textureness measure of the corre-
sponding superpixels. Let G be the weighted graph of superpixels, so that two
superpixels si and sj are connected with an edge of weight dv (si , sj ) if and only
if they are neighbors, meaning that they share a common boundary. Then, the
proximity distance dp (si , sj ) between superpixels si and sj is given by the length
of the shortest path from si to sj in graph G . The proposed distance between
superpixels si and sj that efficiently combines the visual and proximity distances
is given by Equation 1:
/
d(si , sj ) = dp (si , sj ) · dv (si , sj ) (1)
The use of the square root on the proximity distance is explained by the fact that
the visual distance is more important than the proximity distance. The graph
G is used in order to compute the graph G that is defined hereafter.
In the next step, we construct a graph G that represents the superpixels
and the connections between them, taking into account the given markers and
visual information. According to the given markers, two superpixels can either
be connected, meaning that they belong to the same class or be disconnected,
meaning that they belong to different classes. Thus, the nodes (superpixels) in
this graph are connected with edges of two types:
– the EC edges that connect two superpixels belonging to the same class,
– the ED edges that connect two superpixels belonging to different classes,
taking into account the two types of relations between superpixels. In the second
step of the algorithm, the visual distance and the superpixels’ proximity are used
to create the set of edges EC until G becomes a connected graph.
Hereafter, we present the procedure that computes the two sets of edges, EC
and ED for graph G. EC and ED are initialized to the corresponding edges ac-
cording to the given markers. Then, the N ·(N2 −1) pairs of distances d(., .) are
sorted and stored in vector v, where N denotes the number of superpixels. We
add the sorted edges of v on EC set until G becomes a connected graph in
order to be able to execute the Vivaldi algorithm [6] that generates the super-
pixels’ synthetic coordinates (see Section 3). In addition, we keep the graph
balanced (almost equal degree per node) using an upper limit on node degree
(M axConn = 10).
3 Synthetic Coordinates
In this work, we have used Vivaldi [6] to position the superpixels in a virtual
space (the n−dimensional Euclidean space n , e.g. n = 20). Vivaldi [6] is a
fully decentralized, light-weight, adaptive network coordinate algorithm that
predicts Internet latencies with low error. Recently, we have successfully applied
592 C. Panagiotakis et al.
4 Graph Clustering
– from the closest image border is less than 1% of the image diagonal
– from the closest image corner is less than 7% of the image diagonal.
5 Image Segmentation
classified pixels estimated by the graph clustering algorithm. For any unclassified
pixel s we can consider all the paths linking it to a classified set or region. A
path Cl (s) is a sequence of adjacent pixels {s0 , ..., sn−1 , sn = s}. It holds that all
pixels of the sequence are unlabeled, except s0 which has label l. The cost of a
particular path is defined as the maximum cost of a pixel classification according
to the Bayesian rule and along the path
Finally, the classification problem becomes equivalent to a search for the shortest
path given the above cost. Two algorithms based on the principle of the min −
max Bayesian criterion for classification, have been used. These algorithms have
been proposed in [4, 8].
– According to the Independent Label Flooding MRF-based minimization Al-
gorithm (ILFMA), we use the primal-dual method proposed in [12], which
casts the M RF optimization problem as an integer program and then makes
use of the duality theory of linear programming in order to derive solutions
that have been proved to be almost optimal.
– The Priority Multi-Class Flooding Algorithm (PMCFA), that is analytically
described in [8], imposes strong topology constraints. All the contours of
initially classified regions are propagated towards the space of unclassified
image pixels, according to similarity criteria, which are based on the class
label and the segmentation features.
In what follows, the proposed methods using the MRF model and the flooding
algorithm are denoted as SGC-ILFMA and SGC-PMCFA, respectively.
6 Experimental Results
SGC-ILFMA and SGC-PMCFA have been compared with algorithms from the
literature according to the reported results of [2] and [13], using the following
two datasets:
– The LHI interactive segmentation benchmark [14]. This benchmark consists
of 21 natural images with ground-truths and three types of users’ scribbles
for each image.
– The Zhao interactive segmentation benchmark [13]. This benchmark consists
of 50 natural images with ground-truths and four types of users’ scribbles
(levels) for each image. The higher the level, the more markers are added.
In order to measure the algorithms’ performance, we use the Region precision
criterion (RP ) [2]. RP measures an overlap rate between a result foreground
and the corresponding ground truth foreground. A higher RP indicates a better
segmentation result.
Fig. 2 depicts intermediate (initial map) and final segmentation results of the
proposed methods (SGC-ILFMA and SGC-PMCFA) on an image of LHI dataset
594 C. Panagiotakis et al.
(see Fig. 2(a)) and three images of Zhao dataset. We graphically depict the given
markers on the original images using red color for foreground and green color for
background, respectively. The red, blue and white coloring of intermediate re-
sults correspond to foreground, background and unclassified pixels, respectively.
Under any case, the results of the proposed algorithms are almost the same,
yielding high performance results. In Figs. 2(f), 2(f) and 2(j) the initial marker
information suffices for the segmentation. Although in Fig. 2(n) the low number
of given markers does not suffice to discriminate the foreground and background
classes, the proposed methods give good performance results. A demonstration
of the proposed method with experimental results is given in 1 .
Using the LHI dataset, we have compared the proposed methods with three
other algorithms from the literature: CO3 [2], Unger et al. [3] based on the
reported results of [2]. The proposed methods SGC-ILFMA and SGC-PMCFA
yield RP 85.4% and 85.2%, respectively, outperforming the other methods. The
third and the fourth highest performance results are given by the CO3 [2] method
with RP = 79% and Unger et al. with RP = 73%.
In addition, we have compared the proposed methods with three other algo-
rithms from the literature Couprie et al. [15], Grady [16] and Noma et al. [17]
using the reported results of [13] on Zhao dataset. Table 1 depicts the mean
region precision (RP ) on four different simulation levels of SGC-ILFMA, SGC-
PMCFA, Bai et al., Couprie et al., Grady and Noma et al. algorithms. The
highest performance results are clearly obtained by the SGC-ILFMA and SGC-
PMCFA algorithms, while the third highest performance results are given by the
Couprie et al. [15] method that gives similar results with the Grady and Noma
et al. method.
7 Conclusion
In this paper, a two-step algorithm is proposed for interactive image segmenta-
tion taking into account image visual information, proximity distances as well as
the given markers. In the first step, we constructed a weighted graph of super-
pixels and we clustered this graph based on a synthetic coordinates algorithm.
In the second step, we have used a MRF or a flooding algorithm for getting the
final image segmentation. The proposed method yields high performance results
under different types of images and shapes of the initial markers.
1
http://www.csd.uoc.gr/~ cpanag/DEMOS/intImageSegmentation.htm
Interactive Image Segmentation via Graph Clustering 595
100
200
300
400
500
600
100 200 300 400 500 600 700 800
50
100
150
200
250
300
50
100
150
200
250
300
350
400
450
50
100
150
200
250
300
Fig. 2. (a), (e), (i), (m) Original images with markers from the LHI and Zhao
datasets. (b), (f ), (j), (n) Initial map of classified pixels. (c), (g), (k), (o) Final
segmentation results of the SGC-ILFMA method. (d), (h), (l), (p) Final segmenta-
tion results of the SGC-PMCFA method.
References
1. Gulshan, V., Rother, C., Criminisi, A., Blake, A., Zisserman, A.: Geodesic star
convexity for interactive image segmentation. In: IEEE Conference on Computer
Vision and Pattern Recognition, CVPR, pp. 3129–3136 (2010)
596 C. Panagiotakis et al.
2. Zhao, Y., Zhu, S., Luo, S.: Co3 for ultra-fast and accurate interactive segmentation.
In: Proceedings of the International Conference on Multimedia, pp. 93–102. ACM
(2010)
3. Unger, M., Pock, T., Trobin, W., Cremers, D., Bischof, H.: Tvseg-interactive total
variation based image segmentation. In: British Machine Vision Conference, BMVC
(2008)
4. Grinias, I., Komodakis, N., Tziritas, G.: Flooding and MRF-based algorithms for
interactive segmentation. In: International Conference on Pattern Recognition,
ICPR, pp. 3943–3946 (2010)
5. Felzenszwalb, P., Huttenlocher, D.: Efficient graph-based image segmentation. In-
ternational Journal of Computer Vision 59, 167–181 (2004)
6. Dabek, F., Cox, R., Kaashoek, F., Morris, R.: Vivaldi: A decentralized network co-
ordinate system. In: Proceedings of the ACM SIGCOMM 2004 Conference, vol. 34,
pp. 15–26 (2004)
7. Panagiotakis, C., Papadakis, H., Grinias, I., Komodakis, N., Fragopoulou, P.,
Tziritas, G.: Interactive image segmentation based on synthetic graph coordinates.
In: Pattern Recognition (accepted, 2013)
8. Panagiotakis, C., Grinias, I., Tziritas, G.: Natural image segmentation based on
tree equipartition, bayesian flooding and region merging. IEEE Transactions on
Image Processing 20, 2276–2287 (2011)
9. Mallows, C.: A note on asymptotic joint normality. The Annals of Mathematical
Statistics 43, 508–515 (1972)
10. Papadakis, H., Panagiotakis, C., Fragopoulou, P.: Local community finding using
synthetic coordinates. In: Park, J.J., Yang, L.T., Lee, C. (eds.) FutureTech 2011,
Part II. CCIS, vol. 185, pp. 9–15. Springer, Heidelberg (2011)
11. Papadakis, H., Panagiotakis, C., Fragopoulou, P.: Locating communities on real
dataset graphs using synthetic coordinates. Parallel Processing Letters, PPL 20
(2012)
12. Komodakis, N., Tziritas, G.: Approximate labeling via graph cuts based on linear
programming. IEEE Transactions on Pattern Analysis and Machine Intelligence 29,
1436–1453 (2007)
13. Zhao, Y., Nie, X., Duan, Y., Huang, Y., Luo, S.: A benchmark for interactive
image segmentation algorithms. In: IEEE Workshop on Person-Oriented Vision,
POV, pp. 33–38 (2011)
14. Yao, B., Yang, X., Zhu, S.-C.: Introduction to a large-scale general purpose ground
truth database: Methodology, annotation tool and benchmarks. In: Yuille, A.L.,
Zhu, S.-C., Cremers, D., Wang, Y. (eds.) EMMCVPR 2007. LNCS, vol. 4679, pp.
169–183. Springer, Heidelberg (2007)
15. Couprie, C., Grady, L., Najman, L., Talbot, H.: Power watersheds: A new image
segmentation framework extending graph cuts, random walker and optimal span-
ning forest. In: International Conference on Computer Vision, pp. 731–738 (2009)
16. Grady, L.: Random walks for image segmentation. IEEE Transactions on Pattern
Analysis and Machine Intelligence 28, 1768–1783 (2006)
17. Noma, A., Graciano, A., Consularo, L., Cesar Jr., R., Bloch, I.: A new algorithm for
interactive structural image segmentation. arXiv preprint arXiv:0805.1854 (2008)
Author Index