0% found this document useful (0 votes)

32 views622 pages

Computer Analysis of Images and Patterns: Richard Wilson Edwin Hancock Adrian Bors William Smith

This document is the proceedings of the 15th International Conference on Computer Analysis of Images and Patterns (CAIP 2013) held in York, UK, from August 27-29, 2013. It includes contributions from various authors across 48 countries, with 142 papers accepted for presentation, including 39 oral and 103 poster presentations. The volume also features invited speakers and highlights the conference's focus on advancements in image processing, computer vision, and pattern recognition.

Uploaded by

ruizmelendezrafa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views622 pages

Computer Analysis of Images and Patterns: Richard Wilson Edwin Hancock Adrian Bors William Smith

Uploaded by

ruizmelendezrafa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 622

Richard Wilson

Edwin Hancock
Adrian Bors
William Smith (Eds.)
LNCS 8047

Computer Analysis
of Images and Patterns
15th International Conference, CAIP 2013
York, UK, August 2013
Proceedings, Part I

123
Lecture Notes in Computer Science 8047
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
Richard Wilson Edwin Hancock
Adrian Bors William Smith (Eds.)

Computer Analysis
of Images and Patterns
15th International Conference, CAIP 2013
York, UK, August 27-29, 2013
Proceedings, Part I

13
Volume Editors
Richard Wilson
Edwin Hancock
Adrian Bors
William Smith
University of York
Department of Computer Science
Deramore Lane
York YO10 5GH, UK
E-mail:{wilson, erh, adrian, wsmith}@cs.york.ac.uk

ISSN 0302-9743 e-ISSN 1611-3349

ISBN 978-3-642-40260-9 e-ISBN 978-3-642-40261-6
DOI 10.1007/978-3-642-40261-6
Springer Heidelberg Dordrecht London New York

Library of Congress Control Number: 2013944666

CR Subject Classification (1998): I.5, I.4, I.2, H.2.8, I.3, H.3

LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition,

and Graphics
© Springer-Verlag Berlin Heidelberg 2013

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and
executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication
or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location,
in its current version, and permission for use must always be obtained from Springer. Permissions for use
may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution
under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of publication,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or
omissions that may be made. The publisher makes no warranty, express or implied, with respect to the
material contained herein.
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Preface

This volume contains the papers presented at the 15th International Conference on
Computer Analysis of Images and Patterns (CAIP 2013) held in York during August
27–29, 2013.
CAIP was first held in 1985 in Berlin, and since then has been organized biennially
in Wismar, Leipzig, Dresden, Budapest, Prague, Kiel, Ljubljana, Warsaw, Groningen,
Versailles, Vienna, Münster, and Seville.
We received 243 full papers, from authors in 48 countries. Of these 142 were ac-
cepted, 39 oral presentation and 103 posters. There were three invited speakers, Rama
Chellappa from the University of Maryland, Xiaoyi Jiang from the University of Münster,
and Tim Weyrich from University College London.
We hope that participants benefitted scientifically from the meeting, but also got
a flavor of York’s rich history and saw something of the region too. To this end, we
organized a reception at the York Castle Museum, and the conference dinner at the
Yorkshire Sculpture Park. The latter gave participants the chance to view large-scale
works by the Yorkshire artists Henry Moore and Barbara Hepworth.
We would like to thank a number of people for their help in organising this event.
Firstly, we would to thank the IAPR for sponsorship. Furqan Aziz managed the produc-
tion of the proceedings, and Bob French co-ordinated local arrangements.

June 2013 Edwin Hancock

William Smith
Richard Wilson
Adrian Bors
Organization

Program Committee
Ceyhun Burak Akgül Vistek ISRA Vision, Turkey
Madjid Allili Bishop’s University, Canada
Nigel Allinson University of Lincoln, UK
Apostolos Antonacopoulos University of Salford, UK
Helder Araujo University of Coimbra, Portugal
Nicole M. Artner PRIP, Vienna University of Technology, Austria
Furqan Aziz University of York, UK
Andrew Bagdanov Media Integration and Communication Center
University of Florence, Italy
Antonio Bandera University of Malaga, Spain
Elisa H. Barney Smith Boise State University, USA
Ardhendu Behera University of Leeds, UK
Abdel Belaid Université de Lorraine - LORIA, France
Gunilla Borgefors Centre for Image Analysis, Swedish University
of Agricultural Sciences, Sweden
Adrian Bors University of York, UK
Luc Brun GREYC, ENS, France
Lorenzo Bruzzone University of Trento, Italy
Horst Bunke University of Bern, Switzerland
Martin Burger WWU Münster, Germany
Gustavo Carneiro University of Adelaide, Australia
Andrea Cerri University of Bologna, Italy
Kwok-Ping Chan The University of Hong Kong, SAR China
Rama Chellappa University of Maryland, USA
Sei-Wang Chen National Taiwan Normal University, Taiwan
Dmitry Chetverikov Hungarian Academy of Sciences, Hungary
John Collomosse University of Surrey, UK
Bertrand Coüasnon Irisa/Insa, France
Marco Cristani University of Verona, Italy
Guillaume Damiand LIRIS/Université de Lyon, France
Justin Dauwels M.I.T., USA
Mohammad Dawood University of Münster, Germany
Joachim Denzler University Jena, Germany
Cecilia Di Ruberto Università di Cagliari, Italy
Junyu Dong Ocean University of China, China
VIII Organization

Hazim Kemal Ekenel InterACT Research, Universität Karlsruhe,

Germany
Hakan Erdogan Sabanci University, Turkey
Francisco Escolano University of Alicante, Spain
M. Taner Eskil ISIK University, Turkey
Alexandre Falcão Institute of Computing - University of
Campinas (Unicamp), Brazil
Chiung-Yao Fang National Taiwan Normal University, Taiwan
Massimo Ferri University of Bologna, Italy
Gernot Fink TU Dortmund University, Germany
Ana Fred Instituto Superior Tecnico, Portugal
Patrizio Frosini University of Bologna, Italy
Laurent Fuchs XLIM-SIC, UMR CNRS 7252, Université de
Poitiers, France
Xinbo Gao Xidian University, China
Dr. Anarta Ghosh Research Fellow, Ireland
Georgy Gimelfarb The University of Auckland, New Zealand
Daniela Giorgi IMATI Genova, Italy
Dmitry Goldgof University of South Florida, USA
Rocio Gonzalez-Diaz University of Seville, Spain
Cosmin Grigorescu European Patent Office, Brussels
Miguel A Gutiérrez-Naranjo University of Seville, Italy
Michal Haindl Institute of Information Theory and
Automation, Czech Republic
Edwin Hancock University of York, UK
Yll Haxhimusa Vienna University of Technology, Austria
Vaclav Hlavac Czech Technical University in Prague,
Czech Republic
Zha Hongbin Peking University, China
Yo-Ping Huang National Taipei University of Technology,
Taiwan
Yung-Fa Huang Chaoyang University of Technology, Taiwan
Atsushi Imiya IMIT Chiba University, Japan
Xiaoyi Jiang Universität Münster, Germany
Maria Jose Jimenez University of Seville, Spain
Martin Kampel Vienna University of Technology, Computer
Vision Lab, Austria
Nahum Kiryati Tel Aviv University, Israel
Reinhard Klette University of Auckland, New Zealand
Andreas Koschan University of Tennessee, USA
Walter Kropatsch Vienna University of Technology, Austria
Xuelong Li University of London, UK
Pascal Lienhardt SIC Laboratory, France
Guo-Shiang Lin Da-Yeh University, Taiwan
Agnieszka Lisowska University of Silesia, Poland
Organization IX

Josep Llados Computer Vision Center, Universitat

Autonoma de Barcelona, Spain
Jean-Luc Mari Faculté des Sciences de Luminy, Université
Aix-Marseille 2, LSIS Laboratory, UMR
CNRS 6168, France
Eckart Michaelsen FGAN-FOM, Germany
Majid Mirmehdi University of Bristol, UK
Radu Nicolescu The University of Auckland, New Zealand
Mark Nixon University of Southampton, UK
Darian Onchis University of Vienna, Austria
Ioannis Patras Queen Mary College London, UK
Petra Perner Institute of Computer Vision and Applied
Computer Sciences, Germany
Nicolai Petkov University of Groningen, The Netherlands
Ioannis Pitas Aristotle University of Thessaloniki, Greece
Eugene Popov Nizhegorodsky Architectural and Civil
Engineering State University (NNACESU),
Russia
Mario J. Pérez Jiménez University of Seville, Spain
Petia Radeva Computer Vision Center, Universitat
Autònoma de Barcelona, Spain
Pedro Real University of Seville, Spain
Bodo Rosenhahn University of Hannover, Germany
Paul Rosin Cardiff University, UK
Samuel Rota Bulo Università Ca’ Foscari, Italy
Jose Ruiz-Shulcloper Advanced Technologies Applications Center
(CENATAV) MINBAS, Cuba
Robert Sablatnig Vienna University of Technology, Austria
Hideo Saito Keio University, Japan
Albert Salah Bogazici University, Turkey
Gabriella Sanniti Di Baja Institute of Cybernetics “E. Caianiello”, CNR,
Italy
Sudeep Sarkar University of South Florida, USA
Oliver Schreer Fraunhofer Heinrich Hertz Institute, Germany
Francesc Serratosa Universitat Rovira i Virgili, Spain
Luciano Silva Universidade Federal do Parana, Brazil
William Smith University of York, UK
Mingli Song Zhejiang University, China
K.G. Subramanian Universiti Sains Malaysia, Malaysia
Akihiro Sugimoto National Institute of Informatics, Japan
Dacheng Tao The Hong Kong Polytechnic University, SAR
China
Bernie Tiddeman University of Wales, Wales
Klaus Toennies Otto-von-Guericke-Universität, Germany
Javier Toro Desarrollo para la Ciencia y la Tecnologia, C.A.,
Venezuela
X Organization

Andrea Torsello Università Ca’ Foscari, Italy

Chwei-Shyong Tsai National Chung Hsing University, Taiwan
Georgios Tzimiropoulos University of Lincoln, UK
Ernest Valveny Computer Vision Center - Universitat
Autònoma de Barcelona, Spain
Mario Vento Università degli Studi di Salerno, Italy
José Antonio Vilches Universidad de Sevilla, Spain
Sophie Viseur University of Provence, France
Shengrui Wang University of Sherbrooke, Canada
Michel Westenberg Eindhoven University of Technology,
The Netherlands
Paul Whelan DCU, Ireland
Richard Wilson University of York, UK
David Windridge University of Surrey, UK
Xianghua Xie Swansea University, UK
Jovisa Zunic University of Exeter, UK
Table of Contents – Part I

Biomedical Imaging: A Computer Vision Perspective . . . . . . . . . . . . . . . . . . . . . 1

Xiaoyi Jiang, Mohammad Dawood, Fabian Gigengack,
Benjamin Risse, Sönke Schmid, Daniel Tenbrinck, and
Klaus Schäfers

Rapid Localisation and Retrieval of Human Actions with Relevance

Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Simon Jones and Ling Shao

Deformable Shape Reconstruction from Monocular Video with Manifold

Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Lili Tao and Bogdan J. Matuszewski

Multi-SVM Multi-instance Learning for Object-Based Image

Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Fei Li, Rujie Liu, and Takayuki Baba

Maximizing Edit Distance Accuracy with Hidden Conditional Random

Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Antoine Vinel and Thierry Artières

Background Recovery by Fixed-Rank Robust Principal Component

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Wee Kheng Leow, Yuan Cheng, Li Zhang, Terence Sim, and
Lewis Foo

Manifold Learning and the Quantum Jensen-Shannon Divergence

Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Luca Rossi, Andrea Torsello, and Edwin R. Hancock

Spatio-temporal Manifold Embedding for Nearly-Repetitive Contents in a

Video Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Manal Al Ghamdi and Yoshihiko Gotoh

Spatio-temporal Human Body Segmentation from Video Stream . . . . . . . . . . . . 78

Nouf Al Harbi and Yoshihiko Gotoh

Sparse Depth Sampling for Interventional 2-D/3-D Overlay: Theoretical Error

Analysis and Enhanced Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Jian Wang, Christian Riess, Anja Borsdorf, Benno Heigl, and
Joachim Hornegger
XII Table of Contents – Part I

Video Synopsis Based on a Sequential Distortion Minimization

Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Costas Panagiotakis, Nelly Ovsepian, and Elena Michael

A Graph Embedding Method Using the Jensen-Shannon Divergence . . . . . . . . . 102

Lu Bai, Edwin R. Hancock, and Lin Han

Mixtures of Radial Densities for Clustering Graphs . . . . . . . . . . . . . . . . . . . . . . 110

Brijnesh J. Jain

Complexity Fusion for Indexing Reeb Digraphs . . . . . . . . . . . . . . . . . . . . . . . . . 120

Francisco Escolano, Edwin R. Hancock, and Silvia Biasotti

Analysis of Wave Packet Signature of a Graph . . . . . . . . . . . . . . . . . . . . . . . . . . 128

Furqan Aziz, Richard C. Wilson, and Edwin R. Hancock

Hearing versus Seeing Identical Twins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

Li Zhang, Shenggao Zhu, Terence Sim, Wee Kheng Leow,
Hossein Najati, and Dong Guo

Voting Strategies for Anatomical Landmark Localization Using

the Implicit Shape Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Jürgen Brauer, Wolfgang Hübner, and Michael Arens

Evaluating the Impact of Color on Texture Recognition . . . . . . . . . . . . . . . . . . . 154

Fahad Shahbaz Khan, Joost van de Weijer, Sadiq Ali, and
Michael Felsberg

Temporal Self-Similarity for Appearance-Based Action Recognition

in Multi-View Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Marco Körner and Joachim Denzler

Adaptive Pixel/Patch-Based Stereo Matching for 2D Face

Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Rui Liu, Weiguo Feng, and Ming Zhu

A Machine Learning Approach for Displaying Query Results in Search

Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Tunga Güngör

A New Pixel-Based Quality Measure for Segmentation Algorithms Integrating

Precision, Recall and Specificity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Kannikar Intawong, Mihaela Scuturici, and Serge Miguet

A Novel Border Identification Algorithm Based on an “Anti-Bayesian”

Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Anu Thomas and B. John Oommen
Table of Contents – Part I XIII

Assessing the Effect of Crossing Databases on Global and Local Approaches

for Face Gender Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Yasmina Andreu Cabedo, Ramón A. Mollineda Cárdenas, and
Pedro Garcı́a-Sevilla
BRDF Estimation for Faces from a Sparse Dataset Using a Neural Network . . . 212
Mark F. Hansen, Gary A. Atkinson, and Melvyn L. Smith
Comparison of Leaf Recognition by Moments and Fourier
Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Tomáš Suk, Jan Flusser, and Petr Novotný
Dense Correspondence of Skull Models by Automatic Detection
of Anatomical Landmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Kun Zhang, Yuan Cheng, and Wee Kheng Leow
Detection of Visual Defects in Citrus Fruits: Multivariate Image Analysis vs
Graph Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Fernando López-Garcı́a, Gabriela Andreu-Garcı́a,
José-Miguel Valiente-Gonzalez, and Vicente Atienza-Vanacloig
Domain Adaptation Based on Eigen-Analysis and Clustering,
for Object Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Suranjana Samanta and Sukhendu Das
Estimating Clusters Centres Using Support Vector Machine:
An Improved Soft Subspace Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . 254
Amel Boulemnadjel and Fella Hachouf
Fast Approximate Minimum Spanning Tree Algorithm
Based on K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
Caiming Zhong, Mikko Malinen, Duoqian Miao, and Pasi Fränti
Fast EM Principal Component Analysis Image Registration Using
Neighbourhood Pixel Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Parminder Singh Reel, Laurence S. Dooley, K.C.P. Wong, and Anko Börner
Fast Unsupervised Segmentation Using Active Contours and Belief
Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
Foued Derraz, Laurent Peyrodie, Abdelmalik Taleb-Ahmed,
Miloud Boussahla, and Gerard Forzy
Flexible Hypersurface Fitting with RBF Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 286
Jun Fujiki and Shotaro Akaho
Gender Classification Using Facial Images and Basis Pursuit . . . . . . . . . . . . . . . 294
Rahman Khorsandi and Mohamed Abdel-Mottaleb
Graph Clustering through Attribute Statistics Based Embedding . . . . . . . . . . . . 302
Jaume Gibert, Ernest Valveny, Horst Bunke, and Luc Brun
XIV Table of Contents – Part I

Graph-Based Regularization of Binary Classifiers for Texture Segmentation . . . 310

Cyrille Faucheux, Julien Olivier, and Romuald Boné
Hierarchical Annealed Particle Swarm Optimization for Articulated Object
Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Xuan Son Nguyen, Séverine Dubuisson, and Christophe Gonzales
High-Resolution Feature Evaluation Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . 327
Kai Cordes, Bodo Rosenhahn, and Jörn Ostermann
Fully Automatic Segmentation of AP Pelvis X-rays via Random Forest
Regression and Hierarchical Sparse Shape Composition . . . . . . . . . . . . . . . . . . . 335
Cheng Chen and Guoyan Zheng
Language Adaptive Methodology for Handwritten Text Line
Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
Subhash Panwar, Neeta Nain, Subhra Saxena, and P.C. Gupta
Learning Geometry-Aware Kernels in a Regularization Framework . . . . . . . . . . 352
Binbin Pan and Wen-Sheng Chen
Motion Trend Patterns for Action Modelling and Recognition . . . . . . . . . . . . . . 360
Thanh Phuong Nguyen, Antoine Manzanera, and Matthieu Garrigues
On Achieving Near-Optimal “Anti-Bayesian” Order Statistics-Based
Classification for Asymmetric Exponential Distributions . . . . . . . . . . . . . . . . . . 368
Anu Thomas and B. John Oommen
Optimizing Feature Selection through Binary Charged System Search . . . . . . . . 377
Douglas Rodrigues, Luis A.M. Pereira, Joao P. Papa,
Caio C.O. Ramos, Andre N. Souza, and Luciene P. Papa
Outlines of Objects Detection by Analogy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
Asma Bellili, Slimane Larabi, and Neil M. Robertson
PaTHOS: Part-Based Tree Hierarchy for Object Segmentation . . . . . . . . . . . . . . 393
Loreta Suta, Mihaela Scuturici, Vasile-Marian Scuturici, and
Serge Miguet
Tracking System with Re-identification Using a Graph Kernels
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
Amal Mahboubi, Luc Brun, Donatello Conte, Pasquale Foggia, and
Mario Vento
Recognizing Human-Object Interactions Using Sparse Subspace
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
Ivan Bogun and Eraldo Ribeiro
Scale-Space Clustering on the Sphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Yoshihiko Mochizuki, Atsushi Imiya, Kazuhiko Kawamoto,
Tomoya Sakai, and Akihiko Torii
Table of Contents – Part I XV

The Importance of Long-Range Interactions to Texture Similarity . . . . . . . . . . . 425

Xinghui Dong and Mike J. Chantler

Unsupervised Dynamic Textures Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 433

Michal Haindl and Stanislav Mikeš

Voting Clustering and Key Points Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441

Costas Panagiotakis and Paraskevi Fragopoulou

Motor Pump Fault Diagnosis with Feature Selection and

Levenberg-Marquardt Trained Feedforward Neural Network . . . . . . . . . . . . . . . 449
Thomas W. Rauber and Flávio M. Varejão

Unobtrusive Fall Detection at Home Using Kinect Sensor . . . . . . . . . . . . . . . . . 457

Michal Kepski and Bogdan Kwolek

“BAM!” Depth-Based Body Analysis in Critical Care . . . . . . . . . . . . . . . . . . . . 465

Manuel Martinez, Boris Schauerte, and Rainer Stiefelhagen

3-D Feature Point Matching for Object Recognition

Based on Estimation of Local Shape Distinctiveness . . . . . . . . . . . . . . . . . . . . . . 473
Masanobu Nagase, Shuichi Akizuki, and Manabu Hashimoto

3D Human Tracking from Depth Cue in a Buying Behavior Analysis

Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
Cyrille Migniot and Fakhreddine Ababsa

A New Bag of Words LBP (BoWL) Descriptor for Scene Image

Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
Sugata Banerji, Atreyee Sinha, and Chengjun Liu

Accurate Scale Factor Estimation in 3D Reconstruction . . . . . . . . . . . . . . . . . . . 498

Manolis Lourakis and Xenophon Zabulis

Affine Colour Optical Flow Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507

Ming-Ying Fan, Atsushi Imiya, Kazuhiko Kawamoto, and
Tomoya Sakai

Can Salient Interest Regions Resume Emotional Impact of an Image? . . . . . . . . 515

Syntyche Gbèhounou, François Lecellier,
Christine Fernandez-Maloigne, and Vincent Courboulay

Contraharmonic Mean Based Bias Field Correction in MR Images . . . . . . . . . . 523

Abhirup Banerjee and Pradipta Maji

Correlation between Biopsy Confirmed Cases and Radiologist’s Annotations

in the Detection of Lung Nodules by Expanding the Diagnostic Database
Using Content Based Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
Preeti Aggarwal, H.K. Sardana, and Renu Vig
XVI Table of Contents – Part I

Enforcing Consistency of 3D Scenes with Multiple Objects Using

Shape-from-Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
Matthew Grum and Adrian G. Bors

Expectation Conditional Maximization-Based Deformable Shape

Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
Guoyan Zheng

Facial Expression Recognition with Regional Features Using Local Binary

Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
Anima Majumder, Laxmidhar Behera, and
Venkatesh K. Subramanian

Global Image Registration Using Random Projection and Local Linear

Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
Hayato Itoh, Tomoya Sakai, Kazuhiko Kawamoto, and Atsushi Imiya

Image Segmentation by Oriented Image Foresting Transform

with Geodesic Star Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
Lucy A.C. Mansilla and Paulo A.V. Miranda

Multi-run 3D Streetside Reconstruction from a Vehicle . . . . . . . . . . . . . . . . . . . 580

Yi Zeng and Reinhard Klette

Interactive Image Segmentation via Graph Clustering and Synthetic

Coordinates Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
Costas Panagiotakis, Harris Papadakis, Elias Grinias,
Nikos Komodakis, Paraskevi Fragopoulou, and Georgios Tziritas

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597

Table of Contents – Part II

Classified-Distance Based Shape Descriptor for Application to Image

Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Jinhee Chun, Natsuda Kaothanthong, and Takeshi Tokuyama

A Shape Descriptor Based on Trainable COSFIRE Filters for the Recognition

of Handwritten Digits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
George Azzopardi and Nicolai Petkov

Supporting Ancient Coin Classification by Image-Based Reverse Side Symbol

Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Hafeez Anwar, Sebastian Zambanini, and Martin Kampel

Eyewitness Face Sketch Recognition Based on Two-Step Bias Modeling . . . . . 26

Hossein Nejati, Li Zhang, and Terence Sim

Weighted Semi-Global Matching and Center-Symmetric Census Transform

for Robust Driver Assistance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Robert Spangenberg, Tobias Langner, and Raúl Rojas

Handwritten Word Image Matching Based on Heat Kernel Signature . . . . . . . . . 42

Xi Zhang and Chew Lim Tan

Wrong Roadway Detection for Multi-lane Roads . . . . . . . . . . . . . . . . . . . . . . . . 50

Junli Tao, Bok-Suk Shin, and Reinhard Klette

Blind Deconvolution Using Alternating Maximum a Posteriori Estimation

with Heavy-Tailed Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Jan Kotera, Filip Šroubek, and Peyman Milanfar

Focus Fusion with Anisotropic Depth Map Smoothing . . . . . . . . . . . . . . . . . . . . 67

Madina Boshtayeva, David Hafner, and Joachim Weickert

Accurate Fibre Orientation Measurement for Carbon Fibre Surfaces . . . . . . . . . 75

Stefan Thumfart, Werner Palfinger, Matthias Stöger, and
Christian Eitzinger

Benchmarking GPU-Based Phase Correlation for Homography-Based

Registration of Aerial Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Falk Schubert and Krystian Mikolajczyk

Robustness of Point Feature Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Zijiang Song and Reinhard Klette
XVIII Table of Contents – Part II

Depth Super-Resolution by Enhanced Shift and Add . . . . . . . . . . . . . . . . . . . . . 100

Kassem Al Ismaeil, Djamila Aouada, Bruno Mirbach, and Björn Ottersten

Using Region-Based Saliency for 3D Interest Points Detection . . . . . . . . . . . . . 108

Yitian Zhao, Yonghuai Liu, and Ziming Zeng

Accurate 3D Multi-marker Tracking in X-ray Cardiac Sequences Using

a Two-Stage Graph Modeling Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Xiaoyan Jiang, Daniel Haase, Marco Körner, Wolfgang Bothe, and
Joachim Denzler

3D Mesh Decomposition Using Protrusion and Boundary Part Detection . . . . . 126

Fattah Alizadeh and Alistair Sutherland

Isometrically Invariant Description of Deformable Objects Based on the

Fractional Heat Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Eric Paquet and Herna Lydia Viktor

Discriminant Analysis Based Level Set Segmentation for Ultrasound

Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Daniel Tenbrinck and Xiaoyi Jiang

Region Based Contour Detection by Dynamic Programming . . . . . . . . . . . . . . . 152

Xiaoyi Jiang and Daniel Tenbrinck

Sparse Coding and Mid-Level Superpixel-Feature for 0 -Graph Based

Unsupervised Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Xiaofang Wang, Huibin Li, Simon Masnou, and Liming Chen

Intuitive Large Image Database Browsing Using Perceptual Similarity

Enriched by Crowds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Stefano Padilla, Fraser Halley, David A. Robb, and Mike J. Chantler

Irreversibility Analysis of Feature Transform-Based Cancelable Biometrics . . . 177

Christian Rathgeb and Christoph Busch

L∞ Norm Based Solution for Visual Odometry . . . . . . . . . . . . . . . . . . . . . . . . . . 185

Mohammed Boulekchour and Nabil Aouf

Matching Folded Garments to Unfolded Templates Using Robust Shape

Analysis Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Ioannis Mariolis and Sotiris Malassiotis

Multi-scale Image Segmentation Using MSER . . . . . . . . . . . . . . . . . . . . . . . . . . 201

Il-Seok Oh, Jinseon Lee, and Aditi Majumder

Multi-spectral Material Classification in Landscape Scenes Using Commodity

Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Gwyneth Bradbury, Kenny Mitchell, and Tim Weyrich
Table of Contents – Part II XIX

Multispectral Stereo Image Correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

Marcelo D. Pistarelli, Angel D. Sappa, and Ricardo Toledo

NLP EAC Recognition by Component Separation in the Eye Region . . . . . . . . . 225

Ruxandra Vrânceanu, Corneliu Florea, Laura Florea, and
Constantin Vertan

OPF-MRF: Optimum-Path Forest and Markov Random Fields

for Contextual-Based Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Rodrigo Nakamura, Daniel Osaku, Alexandre Levada,
Fabio Cappabianco, Alexandre Falcão, and Joao Papa

Orthonormal Diffusion Decompositions of Images for Optical Flow

Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Sravan Gudivada and Adrian G. Bors

Pairwise Similarity for Line Extraction from Distorted Images . . . . . . . . . . . . . . 250

Hideitsu Hino, Jun Fujiki, Shotaro Akaho, Yoshihiko Mochizuki, and
Noboru Murata

Plant Leaf Classification Using Color on a Gravitational Approach . . . . . . . . . . 258

Jarbas J. de M. Sá Junior, André R. Backes, and Paulo César Cortez

Semi-automatic Image Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

Julia Moehrmann and Gunther Heidemann

Segmentation of Skin Spectral Images Using Simulated Illuminations . . . . . . . . 274

Zhengzhe Wu, Ville Heikkinen, Markku Hauta-Kasari, and Jussi Parkkinen

Robust Visual Object Tracking via Sparse Representation and

Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
Zhenjun Han, Qixiang Ye, and Jianbin Jiao

Sphere Detection in Kinect Point Clouds Via the 3D Hough Transform . . . . . . . 290
Anas Abuzaina, Mark S. Nixon, and John N. Carter

Watermark Optimization of 3D Shapes for Minimal Distortion and High

Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
Adrian G. Bors and Ming Luo

Wavelet Network and Geometric Features Fusion Using Belief Functions

for 3D Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
Mohamed Anouar Borgi, Maher El’Arbi, and Chokri Ben Amar

A Color-Based Selective and Interactive Filter Using Weighted TV . . . . . . . . . . 315

Cédric Loosli, François Lecellier, Stéphanie Jehan-Besson, and
Jonas Koko

A Composable Strategy for Shredded Document Reconstruction . . . . . . . . . . . . 324

Razvan Ranca and Iain Murray
XX Table of Contents – Part II

A Global-Local Approach to Saliency Detection . . . . . . . . . . . . . . . . . . . . . . . . . 332

Ahmed Boudissa, JooKooi Tan, Hyoungseop Kim, Seiji Ishikawa,
Takashi Shinomiya, and Krystian Mikolajczyk

A Moving Average Bidirectional Texture Function Model . . . . . . . . . . . . . . . . . 338

Michal Havlı́ček and Michal Haindl

A Multiscale Blob Representation of Mammographic Parenchymal

Patterns and Mammographic Risk Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . 346
Zhili Chen, Liping Wang, Erika Denton, and Reyer Zwiggelaar

Alternating Optimization for Lambertian Photometric Stereo Model

with Unknown Lighting Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
Khrystyna Kyrgyzova, Lorène Allano, and Michaël Aupetit

An Automated Visual Inspection System for the Classification of the Phases

of Ti-6Al-4V Titanium Alloy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
Antonino Ducato, Livan Fratini, Marco La Cascia, and Giuseppe Mazzola

Analysis of Bat Wing Beat Frequency Using Fourier Transform . . . . . . . . . . . . . 370

John Atanbori, Peter Cowling, John Murray, Belinda Colston, Paul Eady,
Dave Hughes, Ian Nixon, and Patrick Dickinson

Automated Ground-Plane Estimation for Trajectory Rectification . . . . . . . . . . . 378

Ian Hales, David Hogg, Kia Ng, and Roger Boyle

Breast Parenchymal Pattern Analysis in Digital Mammography: Associations

between Tabàr and Birads Tissue Compositions . . . . . . . . . . . . . . . . . . . . . . . . . 386
Wenda He and Reyer Zwiggelaar

Color Transfer Based on Earth Mover’s Distance and Color Categorization . . . . 394
Wenya Feng, Yilin Guo, Okhee Kim, Yonggan Hou, Long Liu, and
Huiping Sun

Empirical Comparison of Visual Descriptors for Multiple Bleeding Spots

Recognition in Wireless Capsule Endoscopy Video . . . . . . . . . . . . . . . . . . . . . . . 402
Sarah Alotaibi, Sahar Qasim, Ouiem Bchir, and
Mohamed Maher Ben Ismail

Exploring Interest Points and Local Descriptors for Word Spotting Application
on Historical Handwriting Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
Peng Wang, Véronique Eglin, Christine Largeron, Antony McKenna, and
Christophe Garcia

Gravitational Based Texture Roughness for Plant Leaf Identification . . . . . . . . . 416

Jarbas J. de M. Sá Junior, André R. Backes, and Paulo César Cortez

Heterogeneity Index for Directed Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424

Cheng Ye, Richard C. Wilson, and Edwin R. Hancock
Table of Contents – Part II XXI

High-Precision Lens Distortion Correction Using Smoothed Thin Plate

Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
Sönke Schmid, Xiaoyi Jiang, and Klaus Schäfers

Identification Using Encrypted Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440

Mohammad Haghighat, Saman Zonouz, and Mohamed Abdel-Mottaleb

Illumination Effects in Quantitative Virtual Microscopy . . . . . . . . . . . . . . . . . . . 449

Doreen Altinay and Andrew P. Bradley

Improving the Correspondence Establishment Based on Interactive

Homography Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
Xavier Cortés, Carlos Moreno, and Francesc Serratosa

Interactive Segmentation of Media-Adventitia Border in IVUS . . . . . . . . . . . . . 466

Jonathan-Lee Jones, Ehab Essa, Xianghua Xie, and Dave Smith

Kernel Maximum Mean Discrepancy for Region Merging Approach . . . . . . . . . 475

Alya Slimene and Ezzeddine Zagrouba

Laplacian Derivative Based Regularization for Optical Flow Estimation

in Driving Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
Naveen Onkarappa and Angel D. Sappa

Local and Global Statistics-Based Explicit Active Contour for Weld Defect
Extraction in Radiographic Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
Aicha Baya Goumeidane, Nafaa Nacereddine, and Mohammed Khamadja

Minimum Entropy Models for Laser Line Extraction . . . . . . . . . . . . . . . . . . . . . 499

Wei Yang, Liguo Zhang, Wei Ke, Ce Li, and Jianbin Jiao

A Convenient and Fast Method of Endoscope Calibration under Surgical

Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
Meiqing Liu, Dayuan Yan, Xiaoming Hu, Ya Zhou, and Zhaoguo Wu

SAMSLAM: Simulated Annealing Monocular SLAM . . . . . . . . . . . . . . . . . . . . 515

Marco Fanfani, Fabio Bellavia, Fabio Pazzaglia, and Carlo Colombo

Spatial Patch Blending for Artefact Reduction in Pattern-Based Inpainting

Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
Maxime Daisy, David Tschumperlé, and Olivier Lézoray

Spatio-temporal Support for Range Flow Based Ego-Motion Estimators . . . . . . 531

Graeme A. Jones and Gordon Hunter

Tracking for Quantifying Social Network of Drosophila Melanogaster . . . . . . . 539

Tanmay Nath, Guangda Liu, Barbara Weyn, Bassem Hassan,
Ariane Ramaekers, Steve De Backer, and Paul Scheunders
XXII Table of Contents – Part II

Virtual Top View: Towards Real-Time Aggregation of Videos to Monitor

Large Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
Hagen Borstell, Saira Saleem Pathan, Michael Soffner, and Klaus Richter

Writer Identification in Old Music Manuscripts Using Contour-Hinge

Feature and Dimensionality Reduction with an Autoencoder . . . . . . . . . . . . . . . 555
Masahiro Niitsuma, Lambert Schomaker, Jean-Paul van Oosten, and
Yo Tomita

Human Action Recognition Using Temporal Segmentation and Accordion

Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
Manel Sekma, Mahmoud Mejdoub, and Chokri Ben Amar

Effective Diversification for Ambiguous Queries in Social Image Retrieval . . . . 571

Amel Ksibi, Ghada Feki, Anis Ben Ammar, and Chokri Ben Amar

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579

Biomedical Imaging:
A Computer Vision Perspective

Xiaoyi Jiang1,2,3, Mohammad Dawood1,2 , Fabian Gigengack1,2,

Benjamin Risse1,4 , Sönke Schmid1,2,3 , Daniel Tenbrinck1,2 ,
and Klaus Schäfers2,3
1
Department of Mathematics and Computer Science, University of Münster,
Germany
2
European Institute for Molecular Imaging (EIMI), University of Münster, Germany
3
Cluster of Excellence EXC 1003, Cells in Motion, CiM, Münster, Germany
4
Department of Neuro and Behavioral Biology, University of Münster, Germany

Abstract. Many computer vision algorithms have been successfully

adapted and applied to biomedical imaging applications. However,
biomedical computer vision is far beyond being only an application ﬁeld.
Indeed, it is a wide ﬁeld with huge potential for developing novel con-
cepts and algorithms and can be seen as a driving force for computer
vision research. To emphasize this view of biomedical computer vision
we consider a variety of important topics of biomedical imaging in this
paper and exemplarily discuss some challenges, the related concepts,
techniques, and algorithms.

1 Introduction
The success story of modern biology and medicine is also one of imaging. It is
the imaging techniques that enable biological experiments (for high-throughput
behavioral screens or conformation analysis) and make the body of humans and
animals anatomically or functionally visible for clinical purposes (medical pro-
cedures seeking to reveal, diagnose, or examine disease). With the widespread
use of imaging modalities in fundamental research and routine clinical practice,
researchers and physicians are faced with ever-increasing amount of image data
to be analyzed and the quantitative outcomes of such analysis are getting in-
creasingly important. Modern computer vision technology is thus indispensable
to acquire and extract information out of the huge amount of data.
Computer vision has a long history and is becoming increasingly mature.
Many computer vision algorithms have been successfully adapted and applied
to biomedical imaging applications. However, biomedical imaging has several
special characteristics which pose particular challenges, e.g.,
– Acquisition and enhancement techniques for challenging imaging situations
are needed.
– The variety of diﬀerent imaging sensors, each with its own physical principle
and characteristics (e.g., noise modeling), often requires modality-speciﬁc
treatment.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 1–19, 2013.

c Springer-Verlag Berlin Heidelberg 2013
2 X. Jiang et al.

(a) (b) (c) (d)

Fig. 1. Illustration of three noise models. (a) Noise-free 1D signal. (b) Signal biased
by additive Gaussian noise with σ = 5. (c) Signal biased by Poisson noise. (d) Signal
biased by speckle noise with σ = 5 and γ = 1. (from [42])

– It is not seldom that different modalities are involved. Thus, algorithms must
be designed to cope with multiple modalities.
– Due to the high complexity of many biomedical image analysis tasks, semi-
automatic processing may be unavoidable in some cases. The design of in-
telligent and user-friendly interactive tools is a challenging task.
– Also the different body organs may require specific treatment.
As an example, the influence of noise modeling is considered. The following noise
models are popular:
– Additive Gaussian noise: f = μ + ν, where μ is the unbiased image intensity
and ν is a Gaussian-distributed random variable with expectation 0 and
variance σ 2 .
– Poisson noise (”photon counting noise”): This type of noise is signal-depen-
dent and appears in a wide class of real-life applications, e.g., in positron
emission tomography and fluorescence microscopy.
– Speckle noise: f = μ + νμγ/2 occurs in ultrasound imaging and is of multi-
plicative nature. Its dependency on the unbiased image intensity μ is con-
trolled by the parameter γ. ν is the same as for additive Gaussian noise.
To illustrate the different characteristics of these noise forms a synthetic 1D
signal and its corrupted versions are shown in Figure 1. We can observe that for
similar parameters, the appearance of signal-dependent Poisson and speckle noise
is in general stronger compared to the additive Gaussian noise. Their processing
is thus definitively challenging and pushes the need for accurate data modeling
in computer vision.
On the other hand, the special characteristics of biomedical imaging also give
extra power to computer vision research. Multimodality can be helpful since they
carry complementary information and their combined use may ease some image
analysis tasks (e.g., segmentation [25]). Generally, a lot of knowledge specific to
a particular application or object type may exist that should be accurately mod-
eled and integrated into algorithms for dedicated processing towards improved
performance.
Given the challenges discussed above, biomedical computer vision is far be-
yond simply adapting and applying advanced computer vision techniques to solve
Biomedical Imaging: A Computer Vision Perspective 3

real problems. It is also a wide field with huge potential of developing novel con-
cepts, techniques, and algorithms. Indeed, biomedical imaging can be seen as a
driving force for computer vision research.
In this paper this view of biomedical computer vision is emphasized by
considering important topics of biomedical imaging: Minimum-cost boundary de-
tection, region-based image segmentation, image registration, optical flow com-
putation, and imaging techniques. Our intention is not to give a complete
coverage of these topics, but rather exemplarily focus on typical challenges and
the related concepts, techniques, and algorithms. The majority of the given ex-
amples is based on our own research and experiences in the respective fields.

2 Minimum-cost Boundary Detection

Quantiﬁcation is one of the key words in biomedical imaging and requires robust,
fast, and possibly automatic image segmentation algorithms. It can be either
in the form of boundary detection or alternatively region-based segmentation.
Automatic segmentation enables assessment of meaningful parameters, e.g., for
diagnosis of pathological ﬁndings in clinical environments.

2.1 Live-Wire Techniques

Several paradigms of minimum-cost boundary detection exist in the literature.

Among them the live-wire approach was initially introduced by Mortensen et
al. [37] and Udupa et al. [51]. The user interactively picks a seed point on the
boundary. Then, a live-wire is displayed in real time from the initial point to any
subsequent position taken by the cursor. The entire 2D boundary is specified by
means of a set of live-wire segments in this manner. The detection of segments
is formulated as a graph searching problem, which finds the globally optimal
(minimum-cost) path between an initial start pixel and an end pixel.
Placing seed points precisely on an object boundary may be difficult and
tedious. To facilitate seed point placement, a cursor snap mechanism forces the
mouse point to the pixel of maximum gradient magnitude within a user-specified
neighborhood. The user-friendliness can be further increased by the live lane
approach [20]. The user selects only the initial point. Subsequent points are
selected automatically as the cursor is moved within a lane surrounding the
boundary whose width changes as a function of the speed and acceleration of
cursor motion.
Live-wire boundaries are piecewise optimal (between two seed points) and
thus provide a balance between global optimality and local control. In contrast
to statistical deformable approaches (e.g., [10,11,28]) no training is required.
This semi-automatic technique has established itself as a robust and user-friendly
method for the extraction of structure outlines for many biomedical applications.
A very fast implementation called live-wire on the fly is described in [19]
which avoids unnecessary minimum-cost path computation during segmentation.
Another important extension is the 3D generalization proposed in [18] to segment
4 X. Jiang et al.

Fig. 2. B-mode CCA image (left) and detected intima and adventitia layer of far wall
(right)

3D volume data or time sequences of 2D images. The key idea there is that the
user speciﬁes contours via live-wiring on a few slices that are orthogonal to the
natural slices of the original data. If these slices are selected strategically, then
one obtains a suﬃcient number of seed points in each natural slice which enable
a subsequent automatic optimal boundary detection therein.
Live-wire techniques are a good example of designing intelligent and user-
friendly interactive segmentation tools. They help to solve complex segmentation
tasks by locally and non-extensively integrating the expertises and wishes of
domain experts, which in turn also increases the user’s faith in the automatic
solution.

2.2 Dynamic Programming Based Boundary Detection

Dynamic programming (DP) is a popular technique for boundary detection due
to its elegance, efficiency, and guarantee of optimality. One class of detectable
boundaries starts from the left, passes each image column exactly once, and ends
in the last column. An example is shown in Figure 2 for detecting the intimal
and adventitial layers of the common carotid artery (CCA) in B-mode sono-
graphic images [8]. Given an image of n rows and m columns, a total number of
O(n·3m−1 ) potential paths exist. However, the dynamic programming technique
gives us an efficient algorithm for exactly finding the minimum-cost path with
O(mn) time and space [45].
Another, perhaps even more important, application class deals with closed
boundaries. Based on a point p in the interior of the boundary, a polar transfor-
mation with p being the central point brings the original image into a matrix, in
which a closed boundary becomes one from left to right afterwards. Finally, the
detected boundary has to be transformed back to the original image space. This
technique works well for star-shaped boundaries1 , particularly including (nearly)
convex boundaries. Note that special care must be taken in order to guarantee
the closedness of the detected boundary [47].
Typically, DP-based boundary detection assumes strong edges along the
boundary and is thus based on gradient computation. In the simplest case the
cost function is defined by the sum of gradient magnitudes. In practice, however,
1
A star-shaped boundary is characterized by the existence of a point p such that for
each interior point q the segment pq lies entirely inside.
Biomedical Imaging: A Computer Vision Perspective 5

(a) (b) (c) (d)

Fig. 3. (a) Tumor cell ROI; (b) gradient; (c) gradient-based optimal boundary; (d)
region-based optimal contour (from [29])

gradient is not always a reliable measure to work with. One such example is the
region-of-interest (ROI) of a tumor cell from microscopic imaging shown in Fig-
ure 3. Maximizing the sum of gradient magnitude does not produce satisfactory
result.
There are only very few works on DP-based boundary detection using non-
gradient information [35,53]. A challenge remains to develop boundary detection
methods based on region information. A general framework for this purpose is
proposed in [29]. A star-shaped contour C can be represented in polar form
r(θ), θ ∈ [0, 2π). Given the image boundary B(θ), θ ∈ [0, 2π), the segmentation
task can be generally formulated as one of optimizing the energy function:
2π r(θ) B(θ)
E(C) = Fi (θ, r)dr + Fo (θ, r)dr dθ (1)
0 0 r(θ)

Each region is assumed to be well represented by some model, which can be

validated by a model testing function Fi (inside) and Fo (outside), respectively.
This problem, however, cannot be solved by dynamic programming since the
model parameters have to be estimated by the entirety of inside and outside of
C. In [29] an approximation is thus made by modeling each radial ray separately,
enabling to restrict the model testing functions Fi (θ, r) and Fo (θ, r) to a par-
ticular radial ray θ instead of the whole image. Then, a dynamic programming
solution becomes possible for any representation model and model testing func-
tion independent of their form, complexity, and mathematical properties, e.g.,
diﬀerentiability. This universality gives the rather simple scheme of dynamic
programming considerable power for real-world applications. In particular, ro-
bust estimation methods such as median-based approaches and L1 norm (see
Figure 3d for a related result) are highly desired for improved robustness. Also,
sophisticated testing criteria like Fisher linear discriminant and others from ma-
chine learning theory provide extra useful options for measuring the separability
of two distributions.
The principle of DP-based boundary detection can be extended in various
ways. One extension is to simultaneously extract multiple boundaries [8,46],
e.g., for detecting a pair of intimal and adventitial boundary in sonographic im-
ages (Figure 2). The domain of detectable boundaries can be further enlarged to
6 X. Jiang et al.

contain non-star-shaped objects. One such attempt from [30] allows the user to
interactively specify and edit the general shape of the desired object by using a
so-called rack, which basically corresponds to the object skeleton. The straight-
forward extension of the boundary class considered here to 3D is the terrain-like
surface z = f (x, y) (height field or discrete Monge surface). Unfortunately, there
is no way of extending the dynamic programming solution to the 3D minimum-
cost surface detection problem in an efficient manner. An optimal 3D graph
search algorithm approach is presented in [32] with low polynomial time for this
purpose. Similar to handling closed boundaries, cylindrical (tube-like) surfaces
can be handled by first unfolding into a terrain-like surface using cylindrical
coordinate transform. In addition to detecting minimum-cost surfaces this algo-
rithm can also be applied to sequences of 2D images for temporally consistent
boundary detection.
In practice, fast and easy-to-use algorithms like DP-based boundary detection
are highly desired. To cite the biologist colleague who provided us the microscopic
images used in [29] (see Figure 3): ”I have literally tens of thousands of images per
experiment” that must be processed within reasonable time. Therefore, further
developments like boundary detection based on region information will have high
practical impact.

3 Region-Based Image Segmentation

Region-based image segmentation is one of the fundamental problems in biomed-
ical imaging for quantitative reasoning and diagnostic. Recently, mathematical
tools such as level sets and variational methods led to signiﬁcant improvements
in image segmentation. However, a majority of works on image segmentation
implicitly assume the given image to be biased by additive Gaussian noise, for
instance the popular Mumford-Shah model [38]. Generally, it still lacks mature
treatment of segmenting images with non-Gaussian noise models.

3.1 Discriminant Analysis Based Level Set Segmentation

The popular Chan-Vese (CV) approach [7], which is a special case of the
Mumford-Shah formulation, uses a closed contour Γ ⊂ Ω to separate a given im-
age domain Ω into two regions Ω1 , Ω2 . In particular, Γ is implicitly represented
by the level sets of a Lipschitz function Φ : Ω → , i.e., Φ(x) < 0 for x ∈ Ω1 ,
Φ(x) = 0 for x ∈ Γ , and Φ(x) > 0 for x ∈ Ω2 . Disregarding regularization of the
segmentation area, the CV energy functional is given as:

E CV (c1 , c2 , Φ) = β δ0 (Φ(x)) |∇Φ(x)| dx
Ω
(2)
+ λ1 (c1 − f (x)) H(Φ) dx + λ2 (c2 − f (x)) (1 − H(Φ(x))) dx
2 2
Ω Ω

Here f is the perturbed image to be segmented and c1 and c2 are constant

approximations of f in Ω1 and Ω2 , respectively. The Heavyside function H is
Biomedical Imaging: A Computer Vision Perspective 7

used as indicator function for Ω1 , while δ0 denotes the one-dimensional δ-Dirac

measure.
In case of additive Gaussian noise (cf. Figure 1b) it is shown in [48] that
for ﬁxed c1 , c2 , the energy in Eq. (2) gets minimal if Φ partitions the data
according to a natural threshold tCV = (c1 + c2 )/2 as in clustering where c1
and c2 are cluster centers. However, the situation in presence of multiplicative
noise is diﬀerent and the optimal threshold cannot be tCV in this case (see
[42] for details). In [48] a discriminant analysis (corresponding to the popular
Otsu thresholding method) is thus applied to determine an optimal threshold
tO . Then, a new variational segmentation model is formulated as:

1
E(Φ) = sgn(Φ(x)) (f (x) − tO ) dx + β δ0 (Φ(x)) |∇Φ(x)| dx (3)
2 Ω Ω

This approach has been demonstrated to be superior to the Chan-Vese for-

mulation on real patient data from echocardiography, which are known to be
perturbed by multiplicative speckle noise.

3.2 Variational Segmentation Framework Incorporating Physical

Noise Models
Despite its high popularity the Mumford-Shah formulation has not yet been
investigated in a more general context of explicit physical noise modeling. Indeed,
only few publications considered the effect of a specific noise model on the results
of image segmentation [9,36]. A lot of segmentation problems need a suitable
noise model, e.g., positron emission tomography or medical ultrasound imaging.
Especially for data with poor statistics, i.e., with a low signal-to-noise ratio, it is
important to consider the impact of the present noise model on the segmentation
process.
In [42] a general segmentation framework for different physical noise models
is presented, which also allows the incorporation of a-priori knowledge by using
different regularization terms. For the special case of two-phase segmentation
problems, the image domain Ω is partitioned into a background and a target
subregion Ω1 and Ω2 , respectively. An indicator function χ is introduced such
that χ(x) = 1 if x ∈ Ω1 and 0 otherwise (comparable to the Heavyside function
H(Φ) in Eq. (2)). The data fidelity functions are defined by the negative log-
likelihood functions derived from Bayesian modeling:

Di (f, ui ) = − log pi (f | ui ) for i ∈ {1, 2} (4)

where ui is a smooth function for each subregion, which is chosen according to

the assumed noise model for the given data f . Then, the energy functional for
the two-phase segmentation problem is formulated as:

E(u1 , u2 , χ) = χ(x) D1 (f, u1 ) + (1 − χ(x)) D2 (f, u2 ) dx
Ω (5)
+ α1 R1 (u1 ) + α2 R2 (u2 ) + βHn−1 (Γ )
8 X. Jiang et al.

Here Hn−1 (Γ ) is the (n − 1)-dimensional Hausdorﬀ measure. The regularization

terms R1 and R2 are used to incorporate a-priori knowledge about the expected
unbiased signals, e.g., H 1 seminorm, Fisher information, or TV regularization.
The choice of the probability densities pi (f | ui ) for i = 1, 2 crucially depends
on the image formation process and hence on the noise model assumed for the
data f and the subregion Ωi . This is the place where physical noise modeling
comes into play. In [42] the cases of Poisson and multiplicative speckle noise (cf.
Figure 1c and 1d, respectively) have been intensively discussed.
In [50] the inﬂuence of three diﬀerent noise models is investigated using this
variational segmentation framework. In particular, shape priors are integrated
as regularization term to the framework. It is demonstrated that correct physical
noise modeling is of high importance for the computation of accurate segmenta-
tion results both in low-level as well as high-level segmentation.

The two approaches discussed above are representative for a variety of seg-
mentation algorithms which fully utilize the knowledge about the speciﬁc char-
acteristics of the image data at hand. A better modeling is the prerequisite for
improved segmentation accuracy and robustness. This is especially important in
biomedical imaging due to the variety of imaging modalities.

4 Image Registration
Image registration [21,34] aims at geometrically aligning two images of the same
scene, which may be taken at different times, from different viewpoints, and by
different sensors. It is among the most important tasks of biomedical imaging in
practice. Given a template image T : Ω → and a reference image R : Ω →
, where Ω ⊂ d is the image domain and d the dimension, the registration
yields a transformation y : d → d representing point-to-point correspondences
between T and R. To find y, the following functional has to be minimized:

min D(M(T , y), R) + αS(y) (6)

Here, D denotes the distance functional and the M transformation model, and
S is the regularization functional. D measures the dissimilarity between the
transformed template image and the fixed reference image. If both images are
of the same modality, the sum-of-squared differences (SSD) can be used as a
distance functional D. In case of multimodal image registration information-
theoretic measures, in particular, mutual information, are popular [39].
The SSD and related dissimilarity measures implicitly assume the intensity
constancy between the template and reference image. Thus, we solely search
for the optimal geometric transformation. In medical imaging, however, this as-
sumption is not always satisfied. Such a problem instance appears in the context
of motion correction in positron emission tomography (PET) [14].
PET requires relatively long image acquisition times in the range of minutes.
In thoracic PET both respiratory and cardiac motion lead to spatially blurred
images. To reduce motion artifacts in PET, so-called gating based techniques
Biomedical Imaging: A Computer Vision Perspective 9

Fig. 4. Coronal slices of the left ventricle in a human heart during systole (a) and
diastole (b) and corresponding line proﬁles (c) are shown for one patient. It can be
observed that the maximum peaks in these line proﬁles vary a lot. (from [24])

were found useful, which decompose the whole dataset into parts that represent
different breathing and/or cardiac phases [16]. After gating, each single gate
shows less motion, however, suffers from a relatively low signal-to-noise ratio
(SNR) as only a small portion of all available events is contained. After gating
the data, each gate is reconstructed individually and registered to one assigned
reference gate. The registered images are averaged afterwards to overcome the
problem of low SNR. Tissue compression and the partial volume effect (PVE)
lead to intensity modulations. Especially for relatively small structures like the
myocardium the true uptake values are affected by the PVE. An example is
given in Figure 4 where a systolic and diastolic slice (same respiratory phase)
of a gated 3D dataset and line profiles are shown. Among others, the maximum
intensity values of the two heart phases indicate that corresponding points can
differ in intensity significantly.
In this situation an image registration mechanism is required which consists
of simultaneous geometric transformation (spatially moving the pixels) and in-
tensity modulation (redistributing the intensity values). In gating, all gates are
formed over the same time interval. Hence, the total amount of radioactivity in
each phase is approximately equal. In other words, in any respiratory and/or car-
diac gate no radioactivity can be lost or added apart from some minor changes
at the edges of the field of view. This property provides the foundation for a
mass-preserving image registration. VAMPIRE (Variational Algorithm for Mass-
Preserving Image REgistration) [24] incorporates a mass-preserving component
by accounting for the volumetric change induced by the transformation y. Based
on the integration by substitution theorem for multiple variables we have:

T (x)dx = T (y(x))|det(∇y(x))|dx (7)
y(Ω) Ω

It guarantees the same total amount of radioactivity before and after applying
the transformation y to T . Therefore, for an image T and a transformation y,
the mass-preserving transformation model is deﬁned as:

MMP (T , y) := T (y) · det(∇y) (8)

10 X. Jiang et al.

which is used in the registration functional (6) to enable simultaneous geometric

transformation and intensity modulation.
In [24] this mass-preserving registration algorithm has been successfully ap-
plied to correct motion for dual – cardiac as well as respiratory – gated PET
imaging. Motion estimation is also a fundamental requirement for super-resolu-
tion computation. More robust motion estimation based on mass-preserving reg-
istration thus facilitates improved super-resolution quality [52].
Similar to noise modeling discussed for region-based image segmentation, it
is the explicit consideration of the mass-preserving property which enables im-
proved image registration. This is another example of the beneﬁt of accurate
modeling in biomedical imaging.

5 Optical Flow Computation

Motion analysis is an important tool in biomedical imaging and optical ﬂow
estimation plays a central role in this context [2,22]. The basis of most optical
ﬂow algorithms is the brightness constancy2 :

I(x, y, t) = I(x + u, y + v, t + 1) (9)

which assumes that when a pixel moves from one image to another, its inten-
sity (or color) does not change. In fact, this assumption combines a number of
assumptions about the reflectance properties of the scene, the illumination in
the scene, and the image formation process in the camera [2]. Linearizing this
constancy equation by applying a first-order Taylor expansion to the right-hand
side leads to the fundamental optical flow constraint (OFC):

u · Ix + v · Iy = −It (10)

or more compactly:
f · ∇I = −It (11)
with f = (u, v) and ∇I = (Ix , Iy ), which is used to derive optimization algo-
rithms in a continuous setting.
In practice, however, this popular brightness constancy is not always valid.
Other constancy terms have also been suggested including gradient, gradient
magnitude, higher-order derivatives, e.g., on the (spatial) Hessian photometric
invariants, texture features, and combination of multiple features (see [5] for a
discussion). In the following two subsections we brieﬂy discuss two additional
variants from the medical imaging perspective.

5.1 Mass-Preserving Optical Flow

The problem of intensity modulations discussed for image registration in the
previous section can also be tackled with optical ﬂow techniques. According to
2
For notation simplicity we consistently give the 2D version only. Its extension to n-D
cases is straightforward.
Biomedical Imaging: A Computer Vision Perspective 11

the observation that the OFC is very similar to the continuity equation of ﬂuid
dynamics, Schunck [44] presented the extended optical ﬂow constraint (EOFC):

f · ∇I + I · div(f ) = −It (12)

where div(f ) = ux + vy is the divergence of f . Optical ﬂow computaton

based on EOFC has been studied and compared with ordinary OFC-based
methods [4].
Interestingly, the EOFC has a physical interpretation of mass preservation.
As shown by several researchers [3,12,40], this constraint is equivalent to a to-
tal brightness invariance hypothesis. The total brightness is defined as the sum
of intensity values of a moving object. Instead of assuming that a point has a
constant brightness over time, it is assumed that a moving object has a total
brightness constant over time. Combined with a non-quadratic penalization a
mass-preserving optical flow method has been applied for cardiac motion cor-
rection in 3D PET imaging [13]. In contrast to OFC-based optical flow [15,17],
mass-preserving methods reflect better the physical reality of PET imaging.
Note that the idea behind the mass-preserving optical flow and the mass-
preserving registration discussed in Section 4 is the same. Indeed, Eq. (12) can
also be derived from Eq. (6), see [23]. Both registration and optical flow methods
give us a powerful tool for solving mass-preserving motion estimation problems.

5.2 Histogram-Based Optical Flow

Multiplicative speckle noise (cf. Figure 1d) is characteristic for diagnostic ultra-
sound imaging. The origin of speckle are tiny inhomogeneities within the tissue,
which reﬂect ultrasound waves but cannot be resolved by the ultrasound system.
Speckle noise f = μ + νμγ/2 is of multiplicative nature, i.e., the noise variance
directly depends on the underlying signal intensity.
The speckle noise has substantial impact on motion estimation. In fact, it
turns out that the brightness constancy does not hold any more (see [49] for a
mathematical proof). This can be demonstrated by the following simple experi-
ment [49]. Starting from two pixel patches of size 5 × 5 with constant intensity
values μ = 150 and η ∈ [0, 255], a realistic amount of speckle noise was added
according to f = μ + νμγ/2 with γ = 1.5. The resulting pixel patches, denoted
by X 150 and Y η , were compared pixelwise with the squared L2 -distance. Com-
parison of the two pixel patches was performed 10,000 times for every value of
η ∈ [0, 255]. The simulation results (average distance of the two pixel patches
and standard deviation) are plotted in Figure 5 (left). Normally, one would ex-
pect the minimum of the graph to be exactly at the value η = μ = 150, i.e., both
pixel patches have the same constant intensity before adding noise. However, the
minimum of the graph is below the expected value. This discrepancy has been
theoretically analyzed in [49], which predicts the minimum at η ≈ 141 for the
particular example as can be observed in Figure 5 (left).
In [49] it is argued that the overall distribution within a local image region
remains approximately constant since the tissue characteristics remain and thus
12 X. Jiang et al.

Fig. 5. Left: Average distance between two pixel patches biased by speckle noise. The
global minimum is below the correct value of η = 150. Right: Average distance between
the histograms of two pixel patches biased by speckle noise. The global minimum
matches with the correct value of η = 150. In both case the two dashed lines represent
the standard deviation of the 10,000 experiments. (from [49])

suggested to consider a small neighborhood around a pixel and compare the

local statistics, i.e, local histograms as a discrete representation of the intensity
distribution, of the images. This leads to the histogram constancy constraint:

H(x, y, t) = H(x + u, y + v, t + 1) (13)

where H represents the cumulative histogram of the region surrounding the pixel
(x, y) at time t. The validity of this new constraint has been mathematically
proven in [49] and can also been seen in Figure 5 (right). On ultrasound data
the derived histogram-based optical ﬂow algorithm outperforms state-of-the-art
general-purpose optical ﬂow methods.

5.3 Periodic Optical Flow

In medical imaging some motion is inherently periodic. For example, this occurs
in cardiac gated imaging, where images are obtained at different phases of the
periodic cardiac cycle. Another example is in respiratory gated imaging, where
the respiratory motion of the chest can also be described by a periodic model. Li
and Yang [33] proposed optical flow estimation for a sequence of images wherein
the inherent motion is periodic over time. Although in principle one could adopt
a frame-by-frame approach to determine the motion fields, a joint estimation,
in which all motion fields of a sequence are estimated simultaneously, explicitly
exploits the inherent periodicity in image motion over time and can thus be
advantageous against the framewise approach.
By applying Fourier series expansion, the components (u, v) at location (x, y)
over time are modeled by:

L
2πl 2πl
u(x, y, t) = al (x, y) cos t + bl (x, y) sin t (14)
T T
l=1

L
2πl 2πl
v(x, y, t) = cl (x, y) cos t + dl (x, y) sin t (15)
T T
l=1
Biomedical Imaging: A Computer Vision Perspective 13

where al (x, y), bl (x, y), cl (x, y), dl (x, y) are the coeﬃcients associated with har-
monic component l and L is the order of the harmonic representation. This
motion model is embedded into the motion estimation for each pair of two suc-
cessive images and the overall data term of the energy function to be minimized
is the sum of all pairwise data terms from the brightness constancy.

While a number of constancy terms have been suggested in computer vi-

sion, the popular brightness constancy is dominating. The mass-preserving and
histogram-based optical flow computation discussed above exemplarily demon-
strate the need of finding suitable constancy terms in particular biomedical imag-
ing scenarios. Periodic optical flow is a new concept and not fully explored yet.
In both cases biomedical imaging provides large room for methodological devel-
opment from a computer vision perspective.

6 Novel Imaging Techniques

Biomedical imaging has a broad range of subjects to be imaged and various

imaging modalities. Despite of the enormous progresses there is still substan-
tial room for further development of imaging technology. In the following we
exemplarily describe two scenarios.

6.1 PET Imaging of Freely Moving Mice

In an on-going project we aim to track freely moving small animals with high
precision inside a positron emission tomograph. Normally, the animals have to be
anesthetized during 15-60 minutes of data acquisition to avoid motion artifacts.
However, anesthesia inﬂuences the metabolism which is measured by PET. To
avoid this, the aim of our project is to track awake and freely moving animals
during the scan and use the information to correct the acquired PET data for
motion. For this task a small animal chamber of 20×10×9 cm was built (Figure 6)
with a pair of stereo cameras positioned on both small sides of the chamber.
Due to the experimental setup highly distorted wide angle lenses have to be
used. To reach the required tracking accuracy a high-precision lens distortion
correction is crucial. First tests using a simple polynomial model for lens dis-
tortion correction lead to deviations from a pinhole camera model of up to 5
pixels. Therefore, more sophisticated methods are needed. Two high-precision
lens distortion correction methods are described in [26,27]. In the latter case
several images of a harp of wire are acquired and a massive amount of edge
points is used to determine the parameters of a 11th grade polynomial distor-
tion function. Both methods require a very accurately manufactured calibration
pattern. In [43] another solution is suggested using a planar checkerboard pat-
tern to provide very accurately detectable feature points even under distortion
as a calibration pattern. Smoothed thin plate splines are applied to model the
mapping between control points, leading to a mean accuracy below 0.084 pixel.
14 X. Jiang et al.

Fig. 6. Camera setup for PET imaging of freely moving mice. Left: Construction model
of the animal chamber. Right: Manufactured chamber halfway inserted into a quad-
HIDAC PET-scanner (16 cm in diameter). (from [43])

In addition to lens distortion correction we also need to solve other problems

like feature detection and stereo vision in order to provide the technical funda-
ment for a motion-corrected reconstruction. It is this bundle of computer vision
solutions that will help to further improve the functionality of PET imaging
towards imaging of freely moving mice.

6.2 FTIR-Based Imaging Method

The investigation of complex movement patterns of various organisms has be-
come an integral subject of biological research. One of the most popular model
organisms to study how the nervous system controls locomotion is Drosophila
melanogaster (i.e., fruit fly). Drosophila is a holomethabolous insect. In the lar-
val stage locomotion is confined to 2D, whereas the adult fly moves in two and
three dimensions.
Work on freely flying fruit flies is still in its infancy because they form the
so-called general multi-index assignment problem, which is nondeterministically
polynomial-time hard (NP-hard) [6]. The current solutions are only able to track
a small number of subjects for a short period [1,54]. From the computer vision
perspective 3D tracking of flying flies is an unsolved challenge. Much efforts are
still required to realize highly accurate tracking systems in order to fully meet
the need of biological behavior studies.
In contrast, larval crawling occurs in two dimensions at relatively low speed.
In principle, larval movement can be documented by a simple camera setup.
However, recording of crawling larvae requires high contrast images, which are
typically obtained by following sophisticated illumination protocols or dye ap-
plications [31]. For conventional, relatively low resolution tracking of larval lo-
comotion, larvae are illuminated by incident or transmitted light and monitored
by cameras with appropriate filters. This is technically challenging due to the
semi-translucent body of these small animals. In addition, the observation of
larvae is complicated by light reflections caused by the tracking surface. Thus,
illumination problems aggravate faithful recordings of larval crawling paths and
the poor signal to noise ratio complicates subsequent computer-based analysis.
Biomedical Imaging: A Computer Vision Perspective 15

Fig. 7. The FIM setup. (A) Image of 10 larvae (arrow) imaged in a conventional
setup. The asterisks denote scratches and reflections in the tracking surface. (B) Image
of 10 larvae (arrow) imaged in the FIM setup with high contrast. (C) The principle
of frustrated total internal reflection. na , n1 , n2 , and n3 indicate different refractory
indices of air, acrylic glass, agar and larvae respectively, an acrylic glass plate is flooded
with infrared light (indicated by red lines). The camera is mounted below the tracking
table. (D) Schematic drawing of the setup. (E) Image of the tracking table. (from [41])

A novel imaging technique based on frustrated total internal reﬂection (FTIR)

is reported in [41], see Figure 7. Instead of directly illuminating crawling larvae,
the frustrated total internal reflection is used to determine the contact surface
between the animal and the substrate. In this FIM setup, an acrylic glass plate
is flooded with infrared light. Due to the differences in the refractive indices of
acrylic glass and air, it is completely reflected at the glass/air boundary (Fig-
ure 7C). To provide a moist crawling environment a thin agar layer is added.
According to Snell’s law, the light enters the agar layer since its refractive index
(n2 ) is higher than the refractive index of the acrylic glass (n1 ). The larvae have
an even higher optical density resulting in a higher refractive index (n3 ), and
thus, reflection is frustrated at the agar/larva interface and light enters the larval
body. Here, light is reflected and since the reflection angle is smaller than the
critical angle, the light passes through the different layers and can be detected
by a camera equipped with an infrared filter (Figure 7D,E). This setup is easy to
assemble and does not require cost-intensive equipment. This new imaging ap-
proach, named FIM (FTIR-based Imaging Method), provides an unprecedented
high contrast view on crawling animals. Even without any background subtrac-
tion it generates constant image quality superior to previous setups. In addition,
16 X. Jiang et al.

it even allows to image internal organs. FIM is suitable for a wide range of
biological applications and a wide range of organisms. Together with optimized
tracking software it facilitates analysis of larval locomotion and will simplify
genetic screening procedures.

7 Conclusion
Biomedical computer vision is far beyond simply adapting and applying ad-
vanced computer vision techniques to solve real problems. Biomedical imaging
also poses new and challenging computer vision problems in order to cope with
the complex and multifarious reality. In this paper we have exemplarily discussed
a number of challenges and the related concepts and algorithms, mainly in the
fields of our own research. They are well motivated by the practice. Biomedical
imaging is full of such challenges and powerful computer vision solutions will
immediately have benefit for the practice.
We need to understand how the domain experts work best with a technical
system, which helps to design intelligent and user-friendly interactive tools. In
addition, we are forced to have deeper understanding of the sources of signals
and images to be processed, i.e., the objects of interest and biomedical devices.
Only this way essential knowledge can be included for improved modeling and
solution.
Many fundamental assumptions made when developing algorithms for biome-
dical imaging are shared by different - even non-biomedical - imaging modalities.
For instance, the speckle noise model applies to both ultrasound and synthetic
aperture radar imaging. Thus, the developed algorithms are of general interest
and can be used in manifold application contexts.
Modern biology and medicine is a successful story of imaging. In the past
biomedical computer vision has already established a vast body of powerful
methods and tools. Continuous well-founded research will further enlarge the
spectrum of successfully solved practical problems and thus continue to make a
noticable contribution to biology and medicine.

Acknowledgments. The authors were supported by the Deutsche Forschungs-

gemeinschaft (DFG): SFB 656 MoBil (project B3, C3), EXC 1003 Cells in Mo-
tion – Cluster of Excellence, and DA 1064/3. Thanks go to Kristen Mills at Max
Planck Institute for Intelligent Systems, Stuttgart, for providing the microscopic
images.

References
1. Ardekani, R., Biyani, A., Dalton, J., Saltz, J., Arbeitman, M., Tower, J., Nuzhdin,
S., Tavare, S.: Three-dimensional tracking and behaviour monitoring of multiple
fruit ﬂies. J. R. Soc. Interface 10(78), 20120547 (2013)
2. Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A database
and evaluation methodology for optical ﬂow. International Journal of Computer
Vision 92(1), 1–31 (2011)
Biomedical Imaging: A Computer Vision Perspective 17

3. Béréziat, D., Herlin, I., Younes, L.: A generalized optical flow constraint and its
physical interpretation. In: Proc. of CVPR, pp. 487–492 (2000)
4. Bimbo, A.D., Nesi, P., Sanz, J.L.C.: Optical flow computation using extended
constraints. IEEE Trans. on Image Processing 5(5), 720–739 (1996)
5. Bruhn, A.: Variational Optic Flow Computation – Accurate Modelling and Efficient
Numerics. Ph.D. thesis, University of Saarland (2006)
6. Burkard, R., Dell’Amico, M., Martello, S.: Assignment Problems. Society for In-
dustrial Mathematics (2009)
7. Chan, T.F., Vese, L.A.: Active contours without edges. IEEE Trans. on Image
Processing 10(2), 266–277 (2001)
8. Cheng, D.C., Jiang, X.: Detections of arterial wall in sonographic artery images
using dual dynamic programming. IEEE Trans. on Information Technology in
Biomedicine 12(6), 792–799 (2008)
9. Chesnaud, C., Réfrégier, P., Boulet, V.: Statistical region snake-based segmentation
adapted to different physical noise models. IEEE Trans. on Pattern Anaysis and
Machine Intelligence 21(11), 1145–1157 (1999)
10. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Trans.
on Pattern Analysis and Machine Intelligence 23(6), 681–685 (2001)
11. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models-their
training and application. Computer Vision and Image Understanding 61(1), 38–59
(1995)
12. Corpetti, T., Heitz, D., Arroyo, G., Memin, E., Santa-Cruz, A.: Fluid experimental
flow estimation based on an optical-flow scheme. Experiments in Fluids 40(1),
80–97 (2006)
13. Dawood, M., Gigengack, F., Jiang, X., Schäfers, K.: A mass conservation-based op-
tical flow method for cardiac motion correction in 3D-PET. Medical Physics 40(1),
012505 (2013)
14. Dawood, M., Jiang, X., Schäfers, K. (eds.): Correction Techniques in Emission
Tomographic Imaging. CRC Press (2012)
15. Dawood, M., Büther, F., Jiang, X., Schäfers, K.P.: Respiratory motion correction
in 3-D PET data with advanced optical flow algorithms. IEEE Trans. on Medical
Imaging 27(8), 1164–1175 (2008)
16. Dawood, M., Büther, F., Stegger, L., Jiang, X., Schober, O., Schäfers, M., Schäfers,
K.P.: Optimal number of respiratory gates in positron emission tomography: A
cardiac patient study. Medical Physics 36(5), 1775–1784 (2009)
17. Dawood, M., Kösters, T., Fieseler, M., Büther, F., Jiang, X., Wübbeling, F.,
Schäfers, K.P.: Motion correction in respiratory gated cardiac PET/CT using
multi-scale optical flow. In: Metaxas, D., Axel, L., Fichtinger, G., Székely, G. (eds.)
MICCAI 2008, Part II. LNCS, vol. 5242, pp. 155–162. Springer, Heidelberg (2008)
18. Falcão, A.X., Udupa, J.K.: A 3D generalization of user-steered live-wire segmen-
tation. Medical Image Analysis 4(4), 389–402 (2000)
19. Falcão, A.X., Udupa, J.K., Miyazawa, F.K.: An ultra-fast user-steered image sege-
mentation paradigm: Live-wire-on-the-fly. IEEE Trans. on Medical Imaging 19(1),
55–62 (2000)
20. Falcão, A.X., Udupa, J.K., Samarasekera, S., Sharma, S., Hirsch, B.E., de Alencar
Lotufo, R.: User-steered image segmentation paradigms: Live wire and live lane.
Graphical Models and Image Processing 60(4), 233–260 (1998)
21. Fischer, B., Modersitzki, J.: Ill-posed medicine - an introduction to image registra-
tion. Inverse Problems 24(3), 034008 (2008)
18 X. Jiang et al.

22. Fleet, D., Weiss, Y.: Optical flow estimation. In: Paragios, N., Chen, Y., Fauregas,
O. (eds.) The Handbook of Mathematical Models in Computer Vision, pp. 241–260.
Springer (2005)
23. Gigengack, F.: Mass-Preserving Motion Correction and Multimodal Image Sege-
mentation in Positron Emission Tomography. Ph.D. thesis, University of Münster
(2012)
24. Gigengack, F., Ruthotto, L., Burger, M., Wolters, C.H., Jiang, X., Schäfers, K.P.:
Motion correction in dual gated cardiac PET using mass-preserving image regis-
tration. IEEE Trans. on Medical Imaging 31(3), 698–712 (2012)
25. Gigengack, F., Ruthotto, L., Jiang, X., Modersitzki, J., Burger, M., Hermann,
S., Schäfers, K.P.: Atlas-based whole-body PET-CT segmentation using a passive
contour distance. In: Menze, B.H., Langs, G., Lu, L., Montillo, A., Tu, Z., Criminisi,
A. (eds.) MCV 2012. LNCS, vol. 7766, pp. 82–92. Springer, Heidelberg (2013)
26. von Gioi, R.G., Monasse, P., Morel, J.M., Tang, Z.: Towards high-precision lens
distortion correction. In: Proc. of ICIP, pp. 4237–4240 (2010)
27. von Gioi, R.G., Monasse, P., Morel, J.M., Tang, Z.: Lens distortion correction with
a calibration harp. In: Proc. of ICIP, pp. 617–620 (2011)
28. Heimann, T., Meinzer, H.P.: Statistical shape models for 3D medical image seg-
mentation: A review. Medical Image Analysis 13(4), 543–563 (2009)
29. Jiang, X., Tenbrinck, D.: Region based contour detection by dynamic programming.
In: Hancock, E., Smith, W., Wilson, R., Bors, A. (eds.) CAIP 2013, Part II. LNCS,
vol. 8048, pp. 152–159. Springer, Heidelberg (2013)
30. Jiang, X., Große, A., Rothaus, K.: Interactive segmentation of non-star-shaped
contours by dynamic programming. Pattern Recognition 44(9), 2008–2016 (2011)
31. Khurana, S., Atkinson, W.L.N.: Image enhancement for tracking the translucent
larvae of drosophila melanogaster. PLoS ONE 5(12), e15259 (2010)
32. Li, K., Wu, X., Chen, D., Sonka, M.: Optimal surface segmentation in volumetric
images - a graph-theoretic approach. IEEE Trans. on Pattern Analysis and Machine
Intelligence 28(1), 119–134 (2006)
33. Li, L., Yang, Y.: Optical flow estimation for a periodic image sequence. IEEE Trans.
on Image Processing 19(1), 1–10 (2010)
34. Maintz, J.B.A., Viergever, M.A.: A survey of medical image registration. Medical
Image Analysis 2(1), 1–36 (1998)
35. Malon, C., Cosatto, E.: Dynamic radial contour extraction by splitting homo-
geneous areas. In: Real, P., Diaz-Pernil, D., Molina-Abril, H., Berciano, A.,
Kropatsch, W. (eds.) CAIP 2011, Part I. LNCS, vol. 6854, pp. 269–277. Springer,
Heidelberg (2011)
36. Martin, P., Réfrégier, P., Goudail, F., Guérault, F.: Influence of the noise model
on level set active contour segmentation. IEEE Trans. on Pattern Analysis and
Machine Intelligence 26(6), 799–803 (2004)
37. Mortensen, E., Morse, B., Barrett, W.: Adaptive boundary detection using ‘live-
wire’ two-dimensional dynamic programming. In: IEEE Proc. Computers in Car-
diology, pp. 635–638 (1992)
38. Mumford, D., Shah, J.: Optimal approximations by piecewise smooth functions and
associated variational problems. Commun. Pure Appl. Math. 42, 577–685 (1989)
39. Pluim, J.P.W., Maintz, J.B.A., Viergever, M.A.: Mutual information based reg-
istration of medical images: A survey. IEEE Trans. on Medical Imaging 22(8),
986–1004 (2003)
40. Qiu, M.: Computing optical flow based on the mass-conserving assumption. In:
Proc. of ICPR, pp. 7041–7044 (2000)
Biomedical Imaging: A Computer Vision Perspective 19

41. Risse, B., Thomas, S., Otto, N., Löpmeier, T., Valkov, D., Jiang, X., Klämbt,
C.: FIM, a novel FTIR-based imaging method for high throughput locomotion
analysis. PLoS ONE 8(1), e53963 (2013)
42. Sawatzky, A., Tenbrinck, D., Jiang, X., Burger, M.: A variational framework for
region-based segmentation incorporating physical noise models. Journal of Math-
ematical Imaging and Vision (2013), doi:10.1007/s10851-013-0419-6
43. Schmid, S., Jiang, X., Schäfers, K.: High-precision lens distortion correction using
smoothed thin plate splines. In: Hancock, E., Smith, W., Wilson, R., Bors, A. (eds.)
CAIP 2013, Part II. LNCS, vol. 8048, pp. 432–439. Springer, Heidelberg (2013)
44. Schunck, B.: The motion constraint equation for optical flow. In: Proc. of ICPR,
pp. 20–22 (1984)
45. Sonka, M., Hlavac, V., Boyle, R.: Image Processing, Analysis, and Machine Vision.
Cengage Learning, 3rd edn. (2007)
46. Sun, C., Appleton, B.: Multiple paths extraction in images using a constrained
expanded trellis. IEEE Trans. on Pattern Analysis and Machine Intelligence 27(12),
1923–1933 (2005)
47. Sun, C., Pallottino, S.: Circular shortest path in images. Pattern Recognition 36(3),
709–719 (2003)
48. Tenbrinck, D., Jiang, X.: Discriminant analysis based level set segmentation for
ultrasound imaging. In: Hancock, E., Smith, W., Wilson, R., Bors, A. (eds.) CAIP
2013, Part II. LNCS, vol. 8048, pp. 144–151. Springer, Heidelberg (2013)
49. Tenbrinck, D., Schmid, S., Jiang, X., Schäfers, K., Stypmann, J.: Histogram-based
optical flow for motion estimation in ultrasound imaging. Journal of Mathematical
Imaging and Vision (2013), doi:10.1007/s10851-012-0398-z
50. Tenbrinck, D., Sawatzky, A., Jiang, X., Burger, M., Haffner, W., Willems, P., Paul,
M., Stypmann, J.: Impact of physical noise modeling on image segmentation in
echocardiography. In: Proc. of Eurographics Workshop on Visual Computing for
Biomedicine, pp. 33–40 (2012)
51. Udupa, J., Samarasekera, S., Barrett, W.: Boundary detection via dynamic pro-
gramming. In: Visualization in Biomedical Computing 1992, pp. 33–39 (1992)
52. Yan, H., Gigengack, F., Jiang, X., Schäfers, K.: Super-resolution in cardiac PET
using mass-preserving image registration. In: Proc. of ICIP (2013)
53. Yu, M., Huang, Q., Jin, R., Song, E., Liu, H., Hung, C.C.: A novel segmentation
method for convex lesions based on dynamic programming with local intra-class
variance. In: Proc. of ACM Symposium on Applied Computing, pp. 39–44 (2012)
54. Zou, D., Zhao, Q., Wu, H.S., Chen, Y.Q.: Reconstructing 3d motion trajecto-
ries of particle swarms by global correspondence selection. In: Proc. of ICCV,
pp. 1578–1585 (2009)
Rapid Localisation and Retrieval
of Human Actions with Relevance Feedback

Simon Jones and Ling Shao

The University of Sheﬃeld

{simon.m.jones,ling.shao}@shef.ac.uk

Abstract. As increasing levels of multimedia data online require more

sophisticated methods to organise this data, we present a practical sys-
tem for performing rapid localisation and retrieval of human actions from
large video databases. We first temporally segment the database and
calculate a histogram-match score for each segment against the query.
High-scoring, adjacent segments are joined into candidate localised re-
gions using a noise-robust localisation algorithm, and each candidate
region is then ranked against the query. Experiments show that this
method surpasses the efficiency of previous attempts to perform similar
action searches with localisation. We demonstrate how results can be
enhanced using relevance feedback, considering how relevance feedback
can be effectively applied in the context of localisation.

1 Introduction

In recent years search engines – such as Google – that operate on textual in-
formation have become both mature and commonplace. Efficient and accurate
search of multimedia data, however, is still an open research question, and this
is becoming an increasingly relevant problem with the growth in use of Internet
multimedia data. In order to perform searches on multimedia databases, cur-
rent technology relies on textual metadata associated with each video, such as
keyword tags or the video’s description – unfortunately such metadata are of-
ten incomplete or inaccurate. Furthermore, even if a textual search engine can
locate the correct video, it cannot search within that video to localise specific
sub-sequences that the user is interested in.
Compared to this, content-based retrieval systems present a better alterna-
tive. Such systems directly search through the content of multimedia objects,
avoiding the problems associated with metadata searches. Content-Based Im-
age Retrieval (CBIR) is the primary focus of many researchers. Video retrieval
(CBVR) has also been studied [1], but to a far lesser degree. Retrieval of human
actions in particular has received relatively little attention in comparison to ac-
tion recognition, with some notable exceptions in [2,3]. This is perhaps because
human actions are particularly difficult to retrieve because only a single query
example is provided to search on, but this single query cannot capture the vast
intraclass variability of even the simplest of human actions. Additionally, if the

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 20–27, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Rapid Localisation and Retrieval of Human Actions 21

Query q
Query histogram Hq Temporal Localisation
Time Series Sq,T Temporally Localised Candidates
Time-slice (+Histograms)
Histograms

Database Video t Ht Ranking by

s(Hq,Hc)
t+1 s(Hq,Ht)
Ht+1
Top X Results
nf
t+2
nf
nf Ht+2

Fig. 1. An overview of the localisation and ranking aspects of our algorithm. Relevance
feedback has been omitted for clarity.

query itself is noisy it can be difficult to isolate the relevant features of the ac-
tion. One method researchers use to overcome this issue is relevance feedback,
such as presented in [2].
Finding relevant videos alone is not enough for a practical video retrieval sys-
tem. It is also necessary to localise the relevant segments within longer videos,
as in the real world actions of interest are rarely neatly segmented. In the image
domain, Rahmani et al. [4] and Zhang et al. [5] have combined retrieval with
spatial localisation of objects. In videos, most localisation to date has been per-
formed in a recognition context, such as in [6]. However, more recently Yu et al.
[3] have performed human action retrieval combined with localisation.
Our goal is to introduce a time-efficient system for performing human action
retrieval, showing how localisation and retrieval can be integrated while main-
taining accuracy. We argue that, compared to previous works such as Yu et al.[3]
our method is an order of magnitude more efficient in time and space, making
it far more practical for real-world searches, while still maintaining practical
accuracy. Furthermore, we experiment with the addition of relevance feedback
in various forms, demonstrating that even imperfectly localised feedback can be
used to significantly improve results. We believe ours is the first work to consider
the effect of noisy relevance feedback samples in our experimentation, detailed
further in section 3.

2 Localisation and Retrieval

Our foremost consideration in performing video localisation and retrieval is ef-

ﬁciency, as videos are data-intensive and yet searches need to be fast to be
practical. In this section, we detail a localisation algorithm. This algorithm has
linear complexity with respect to the size of the database, the potential to be
further optimised, yet makes little sacriﬁce in accuracy. We additionally reduce
the query time through batch pre-processing of the database to a compact rep-
resentation. As it is based on local features, our algorithm is scale-invariant,
robust against noise and partially viewpoint invariant.
22 S. Jones and L. Shao

2.1 Pre-processing
In the pre-processing stage it is helpful to consider previous work on human
action recognition. Approaches to human action recognition are broken down
into two categories based on the feature extraction method: global feature-based
methods and local feature-based methods [7]. Global feature based methods, such
as [8], consider the whole human shape or scene through time. Local feature-
based methods, such as [9,10], discard more potentially salient information, such
as the structural information between features, so are generally not as accu-
rate on clean datasets. However, they are typically more robust against noise.
Some methods, however, including the spatio-temporal shape context [11] and
spatio-temporal pyramid representations [12], are local feature-based but par-
tially retain structural information between features. The localisation technique
presented in this work is similar to these structure-retaining representations.
The first step in our approach is to reduce the video database to a compact
representation. As we want our algorithm to operate on realistic datasets, we use
local features. Features are detected using Dollar’s method [10] at a loosely con-
stant rate with respect to time, at multiple spatial and temporal scales. At each
detected point, we extract a spatio-temporal cuboid and apply the HOG3D [13]
descriptor. We base our choice of detector on a human action classification eval-
uation study [14], and the descriptor on the experimental results shown in [13].
Next we assign each of the features one of k distinct codewords/clusters, as in the
Bag-of-Words method. To achieve this, we first reduce the feature descriptors’
dimensionality using principal components analysis. We then perform k-means
clustering on the reduced descriptors, and each feature is assigned to one cluster.
Each feature is then represented by a single value – its cluster membership.
We then aggregate these features in a way suitable for rapid localisation. While
Yu et al.’s fast method [3] for action localisation can often localise the optimal
3D sub-volume, generating a score for each STIP using Random Forests is too
expensive for real-world retrieval. Feature voting[6] is another potential scheme,
but we have experimentally determined that such methods are only stable when
applied to clean datasets. We instead propose to use a BoW-derived approach
to video representation, visualised in part of Figure 1. Each database video is
divided into time-slices t ∈ T , of nf frames, and we create a code-word frequency
histogram Ht for all the features within each t. Each histogram is normalised,
and nf is chosen to be approximately half the size of the smallest query that
can be searched on. The time-slices do not overlap, as preliminary experiments
have shown this does not improve accuracy. While this representation is simple,
we show through experiments that it captures sufficient information to localise
a human action. All of the aforementioned steps can be processed once on the
database in batch – this improves the time efficiency of later user searches.

2.2 Search
Previous work [15,16] on human action localisation typically utilise a trained
model – this requires several examples of the target action and the accompa-
nying ground truths. This is not possible in a retrieval context, where only a
Rapid Localisation and Retrieval of Human Actions 23

single query example is provided. Some researchers have made attempts to per-
form image retrieval with spatial localisation [4,5], and one work focuses on
spatio-temporal retrieval and localisation of videos [3]. However, all of the afore-
mentioned techniques are computationally complex, making them unsuitable for
real-world retrieval. We present a more efficient system below.
To search, the user provides a video example of the human action they want
to find. The system performs feature extraction on this query in the manner
described in section 2.1, but a single normalised histogram is generated for the
entire length of the query, rather than for time-slices. To search for an action
within a single video taken from the database, we first use a simple metric to cal-
culate the similarity between each time-slice histogram and the query histogram
Hq . This metric is the histogram intersection:

k
s(Hq , Ht ) = min(Hqi , Hti ) (1)
i=1

If nf is chosen appropriately, each time-slice t can only, at best, represent a

small fraction of the action being searched for, thus Hq and Ht will not be fully
correlated. However, we show in our experiments below that the histogram inter-
section still generates a stronger response generally for relevant time-slices than
irrelevant ones. Aggregating s over all t ∈ T gives a time-series Sq,T representing
the similarity s of each t ∈ T to q.
Analysing Sq,T , it is possible to find candidate regions for the localised action.
One possible approach involves finding local peaks in this series. However, such
a method proves too sensitive to noise. Our best method applies thresholding
and then candidate segmentation. First, any t where Sq,t is below a threshold
is discarded. This threshold is one standard deviation above the mean over all
Sq,T . Next, we identify false negative time-slices that occur during an action: if
time slice ti and ti+2 are candidates, then ti+1 is also considered a candidate.
False-negative time-slices are often caused by brief interference with the action,
such as a person walking in front of the actor as the action is performed. (The
assumption is made that even the shortest action will span several time-slices,
making the choice of nf important.) Finally, remaining candidate time slices
without neighbours are also discarded, as candidate regions are unlikely to be
only nf frames in length. (N.B. these last two steps are somewhat analogous to
the region growing and shrinking methods found in image segmentation.) After
this, any temporally contiguous set of time slices remaining are considered to
be a single candidate for the action. The computational complexity of the entire
localisation process is O(|T |).
Performing these steps on all videos in the database, the system identifies a
large set of candidate regions. A single feature frequency histogram Hc is gen-
erated over each candidate region, and s(Hq , Hc ) is used to rank the candidates
by their relevance to the query. The top X of these are returned to the user. The
entire process is shown in Figure 1.
24 S. Jones and L. Shao

3 Relevance Feedback

We can use relevance feedback (RF) to iteratively improve both the ranking and
localisation aspects of our algorithm. After an initial search, RF can improve
results by combining the original query with user feedback about the quality of
the initial results, to generate a more discriminative query. Usually this second,
more discriminative query will return better results than the original query. To
date, RF has been used mostly in the image retrieval domain [17,18], but has
also been applied to human action retrieval in more recent years [2].
In this work, relevance feedback occurs after the localisation and retrieval have
been performed once as described above to give an initial ranking of videos. The
user provides binary feedback on the relevance of several highly-ranked results,
and the histograms associated with these results are used to train new local-
isation and retrieval algorithms. To improve localisation, we use the feedback
histograms and the original query histogram to train an SVM, with the his-
togram intersection shown in equation 1 as the SVM’s kernel. Then, to calculate
the relevance of each time slice t, we measure the distance from the SVM’s hy-
perplane to Ht . The rest of the localisation algorithm proceeds as described in
§2.2. To improve our ranking with relevance feedback, we replace the histogram
intersection shown in equation 1 with a simple query expansion metric that only
utilises positive feedback pos. This query expansion takes the following form:

Dt,pos = min(s(Hp , Ht )|p ∈ pos}) (2)

Applying relevance feedback to a system with localisation results in an unusual
issue. Results returned to the user are often neither completely irrelevant nor
completely relevant – a result may be mostly relevant, but imperfectly localised.
In light of this problem, does the user have to manually re-localise the feed-
back both spatially and temporally before rerunning the query? Two methods
of providing feedback are considered in our experiments.

4 Experiments

4.1 Setup

In this section, we describe experiments to demonstrate our algorithm. We use

the MSR II human action dataset [19] based on its popular use in other human
action localisation works. This dataset consists of 54 videos, totalling approxi-
mately 46 minutes of footage, containing 203 total examples of actions. The three
classes of action are: handwaving, handclapping and boxing. These actions are
performed orthogonally to the camera in a very similar fashion to one other, but
the localisation is made more diﬃcult due to various issues such movement of
action-unrelated actors in the background and spatially/temporally overlapping
actions.
During feature extraction, we extract, on average, 3 features per frame, at 4
diﬀerent spatio-temporal scales. Because boxing can be performed to either the
Rapid Localisation and Retrieval of Human Actions 25

1 0.8
0th RF Iteration
1st RF Iteration 0.78
0.9
3nd RF Iteration
5th RF Iteration 0.76
0.8
0.74

Top 20 Precision
0.7 0.72
Precision

0.7
0.6
Imperfect+Both
0.68
Imperfect+Localisation
0.5 0.66
Imperfect+Ranking

0.64 Adjusted+Both
0.4
Adjusted+Localisation
0.62
Adjusted+Ranking
0 0.1 0.2 0.3 0.4 0.5
Recall 0 1 2 3 4 5
RF Iterations

(a) (b)

Fig. 2. (a) Precision-recall of localisation+retrieval after having performed relevance

feedback iteratively. The improvements of successive RF iterations can be seen clearly
here. (b) The precision (% of true positives) of the top 20 results after relevance feed-
back in diﬀerent scenarios. We show both imperfect and user-adjusted relevance feed-
back. We also show the eﬀects of applying relevance feedback to the localisation and
ranking algorithms in isolation, to see their contributions to the overall improvement
in precision.

left or the right, all features are also mirrored on the y-axis, giving an average
of 24 features per frame. In the creation of the feature codebook, we use PCA
to retain 95% of the total variance, and for clustering k = 1000.
Leave-one-out cross validation is performed, treating each of the 203 actions
as the query in turn, averaging results over all runs. We use the following method
length(E∩G)
to determine the accuracy of our localisation: let L(E, G) = length(E∪G) where
E is the temporal extent of the estimated action, and G is the temporal extent
of the closest ground truth. An action is considered successfully localised when
L(E, G) ≥ 0.5. To simulate a user’s relevance feedback, we use the ground truth
to determine up to 5 examples of each of positive and negative feedback.

4.2 Results
Figure 2a shows a precision-recall graph using our optimal setup over the whole
MSR II dataset, after various iterations of imperfect relevance feedback. Preci-
sion and recall are usually used in the context of binary relevance. To use these
metrics with localised results, however, we need a way of determining whether
an imperfectly localised result is still relevant. In [3], the authors determined
relevance of a result diﬀerently for precision and recall. However, we contend
that this method creates an unintuitive statistic, which cannot be interpreted in
the same way as traditional precision-recall. We use the single, stricter criterion
L, deﬁned above for both precision and recall.
26 S. Jones and L. Shao

The eﬀects of relevance feedback on the precision of the top 20 results is

shown in Figure 2b. Only results that satisfy L(E, G) ≥ 0.5 are considered for
positive relevance feedback. Negative relevance feedback is taken from results
where L(E, G) = 0. We have considered both “imperfect” feedback, unmodified
from the results, and user-adjusted feedback, where the spatial and temporal
extents of positive feedback are modified to exactly match the ground truth.
While user-adjusted feedback performs better, imperfect feedback still shows a
significant improvement after one and subsequent rounds of relevance feedback.
This could have practical implications for the usability of a retrieval system
with localisation. We also consider the effects of applying relevance feedback to
only localisation, and only retrieval, to show their individual contribution to the
overall improvement in precision.
We ran our experiments using MATLAB R2009a, on a 2.9GHz Core 2 Duo
PC, with 4GB RAM, running 32-bit Windows 7. The database is 46 minutes in
length, and the mean length of a query video is 5.7 seconds. The average time for
a query with and without relevance feedback are 0.298 without relevance feed-
back, and 0.847 with relevance feedback, excluding offline computational costs.
These times are at least an order of magnitude better than previous results. Our
algorithm could also potentially be accelerated through programmatic optimi-
sation, or its computational complexity reduced through a search of hierarchical
time-slices according to size.

5 Discussion
We have created and demonstrated the use of an efficiency-focused video retrieval
system with localisation. Our relatively simple localisation search can still give
practical results, but completes in a fraction of the time of any previously re-
ported algorithm. We have additionally looked at the application of relevance
feedback in a retrieval context, and have shown that both user-adjusted and
imperfect feedback can be used to improve results significantly.
Our proposed method’s primary weakness, compared to existing algorithms,
lies in its inability to separate spatially-distinct background noise from the results,
which may cause incorrect ranking of the candidates. This has not significantly
affected our results on the MSR II, but on more complex datasets, such as the
HMDB[20] it may become a problem, particularly as the number of actions may
decrease accuracy[21]. In future work, we will investigate ways to spatially isolate
actions without the performance costs associated with branch-and-bound derived
methods. Additionally, further experimentation needs to be done on more complex
datasets, such as HMDB [20], to prove the algorithm’s general applicability.

References
1. Zhang, H.J., Wu, J., Zhong, D., Smoliar, S.: An Integrated System for Content-
based Video Retrieval and Browsing. Pattern Recognition 30(4), 643–658 (1997)
2. Jones, S., Shao, L., Zhang, J., Liu, Y.: Relevance Feedback for Real-World Human
Action Retrieval. Pattern Recognition Lett. 33(4), 446–452 (2012)
Rapid Localisation and Retrieval of Human Actions 27

3. Yu, G., Yuan, J., Liu, Z.: Unsupervised Random Forest Indexing for Fast Ac-
tion Search. In: Proc. IEEE Conf. Comput. Vision and Pattern Recognition,
pp. 865–872 (2011)
4. Rahmani, R., Goldman, S.A., Zhang, H., Krettek, J., Fritts, J.E.: Localized Content
Based Image Retrieval. In: ACM SIGMM Int. Conf. Multimedia Inform. Retrieval,
pp. 227–236 (2005)
5. Zhang, D., Wang, F., Shi, Z., Zhang, C.: Interactive Localized Content Based Im-
age Retrieval With Multiple-Instance Active Learning. Pattern Recognition 43(2),
478–484 (2010)
6. Ryoo, M., Aggarwal, J.: Spatio-temporal Relationship Match: Video Structure
Comparison for Recognition of Complex Human Activities. In: IEEE Int. Conf.
Comput. Vision, pp. 1593–1600 (2009)
7. Poppe, R.: A survey on vision-based human action recognition. Image and Vision
Computing 28(6), 976–990 (2010)
8. Davis, J.W., Bobick, A.F.: The Representation and Recognition of Human Move-
ment Using Temporal Templates. In: Proc. IEEE Conf. Comput. Vision and Pat-
tern Recognition, p. 928 (1997)
9. Laptev, I.: On Space-Time Interest Points. Int. J. Comput. Vision 64(2-3), 107–123
(2005)
10. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior Recognition via Sparse
Spatio-Temporal Features. In: Proc. IEEE Workshop Visual Surveillance and Per-
formance Evaluation Tracking and Surveillance, pp. 65–72 (2005)
11. Shao, L., Du, Y.: Spatio-temporal Shape Contexts for Human Action Retrieval.
In: Proc. Int. Workshop Interactive Multimedia Consumer Electronics, pp. 43–50
(2009)
12. Choi, J., Jeon, W.J., Lee, S.-C.: Spatio-temporal pyramid matching for sports
videos. In: ACM SIGMM Int. Conf. Multimedia Inform. Retrieval, pp. 291–297
(2008)
13. Kläser, A., Marszalek, M., Schmid, C.: A Spatio-Temporal Descriptor Based on
3D-Gradients. In: Proc. British Mach. Vision Conf., pp. 995–1004 (2008)
14. Shao, L., Mattivi, R.: Feature Detector and Descriptor Evaluation in Human Action
Recognition. In: Proc. ACM Int. Conf. Image and Video Retrieval, pp. 477–484
(2010)
15. Kläser, A., Marszalek, M., Schmid, C., Zisserman, A.: Human Focused Action
Localization in Video. In: International Workshop on Sign, Gesture, Activity (2010)
16. Sullivan, J., Carlsson, S.: Recognizing and Tracking Human Action. In: Heyden, A.,
Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part I. LNCS, vol. 2350,
pp. 629–644. Springer, Heidelberg (2002)
17. Tong, S., Chang, E.: Support Vector Machine Active Learning for Image Retrieval.
In: Proc. ACM Multimedia, pp. 107–118 (2001)
18. Tao, D., Tang, X., Li, X., Wu, X.: Asymmetric Bagging and Random Subspace
for Support Vector Machines-Based Relevance Feedback in Image Retrieval. IEEE
Trans. Pattern Anal. Mach. Intell. 28, 1088–1099 (2006)
19. Cao, L., Liu, Z., Huang, T.: Cross-dataset Action Detection. In: Proc. IEEE Conf.
Comput. Vision and Pattern Recognition, pp. 1998–2005 (2010)
20. Kuehne, H., Poggio, H.: HMDB: A Large Video Database for Human Motion
Recognition. In: IEEE Int. Conf. Comput. Vision (2011)
21. Reddy, K., Shah, M.: Recognizing 50 human action categories of web videos. Mach.
Vision and Applicat., 1–11 (2012)
Deformable Shape Reconstruction
from Monocular Video with Manifold Forests

Lili Tao and Bogdan J. Matuszewski

Applied Digital Signal and Image Processing Research Centre

University of Central Lancashire, UK
{lltao,bmatuszewski1}@uclan.ac.uk

Abstract. A common approach to recover structure of 3D deformable scene and

camera motion from uncalibrated 2D video sequences is to assume that shapes
can be accurately represented in linear subspaces. These methods are simple and
have been proven effective for reconstructions of objects with relatively small de-
formations, but have considerable limitations when the deformations are large or
complex. This paper describes a novel approach to reconstruction of deformable
objects utilising a manifold decision forest technique. The key contribution of
this work is the use of random decision forests for the shape manifold learning.
The learned manifold defines constraints imposed on the reconstructed shapes.
Due to nonlinear structure of the learned manifold, this approach is more suitable
to deal with large and complex object deformations when compared to the linear
constraints.

1 Introduction
Deformable shape recovery from a single uncalibrated camera is a challenging, under-
constrained problem. The methods proposed to deal with this problem can be divided
into three main categories: Low-rank shape models [9], shape trajectory approaches
[1,5,6] and template based methods [13]. Most of the existing methods are restricted by
the fact that they try to explain the complex deformations using a linear model.
Recent methods have integrated the manifold learning algorithm to regularise the
shape reconstruction problem by constraining the shapes as to be well represented by
the learned manifold. Using shape embedding as initialisation was introduced in [10].
Hamsici et.al [7] modelled the shape coefficients in a manifold feature space. The map-
ping was learned from the corresponding 2D measurement data of upcoming recon-
structed shapes, rather than a fixed set of trajectory bases.
Contrary to other techniques using manifold in the shape reconstruction, our mani-
fold is learned based on the 3D shapes rather than on 2D observations. The proposed
implementation is based on the manifold forest method described in [4]. The main ad-
vantage of using manifold forest as compared for example to standard diffusion maps
[3] is the fact that in the manifold forest the neighbourhood topology is learned from the
data itself rather than being defined by the Euclidean distance. To the best of authors’
knowledge, random forests technique has never been applied in the context of non-rigid
shape reconstruction. This work is the first to integrate the ideas of manifold forests and
deformable shape reconstruction.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 28–36, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Deformable Shape Reconstruction from Monocular Video with Manifold Forests 29

1.1 Basic Formulation

Throughout this paper, vectors and matrices are denoted as lower- and upper-case bold
letters, whereas sets are represented by calligraphic letters.
We assume that 2D points (features) are obtained from F frames under an ortho-
graphic camera projection model. The problem consists of shapes S = {S1 , S2 , . . . SF }
and camera rotations R = {R1 , R2 , . . . , RF } recovery from 2D observations
Y = {Y1 , Y2 , . . . , YF }, thus can be formulated as the following optimisation
problem,
F
2
arg min Yt − P · Rt · St (1)
R,S t=1
where P represents orthographic camera projection matrix, Yt is a 2 × P matrix, and
St ∈ R3×P contains coordinates of P 3D points describing shape in the tth frame.
The shape St is represented as a linear combination of n + 1 (where n is the dimen-
sion ofthe manifold introduced in Section 2.2) unknown but fixed basis shapes Xtl :
n
St = l=1 θtl Xtl . The camera translation has been eliminated by expressing 2D ob-
servations Y with respect to the data points centroid calculated in each observed image.

2 Manifold Forest
Random forests have become a popular method, given their capability to handle high
dimensional data, efficiently avoid over-fitting without pruning, and possibility of par-
allel operation. This section gives a brief review of the randomized decision forests and
their use in learning diffusion map manifolds. Although other choices are possible, this
paper is focused only on the binary decision forest.

2.1 Randomized Decision Forest

The decision trees in our method are built by making decision in each node of the tree
based on randomly selected features. A random decision forest is an ensemble of such
decision trees. The trees are different and independent from each other.
Given a set of training data X with M samples: Xi ∈ X , i = 1 . . . M , where
each sample contains 3P features. The trees are randomised, by randomly selecting
single feature at each internal node. The decision function at the internal node is used
to decide whether the data Xi reaching that node should be assigned to its left or right
child node. The threshold αm of the decision function at node m is selected as result of
the maximisation of the information gain:
α∗m = arg max Im (2)
αm
with the generic information gain Im defined as:
Xm
i
Im = H (Xm ) − H Xm
i
(3)
|Xm |
i∈{L,R}

where || indicate a cardinality for the dataset. Xm denotes the training data X reaching
node m. Xm L
, Xm
R
are the subsets assigned to the left and right child nodes of node m.
30 L. Tao and B.J. Matuszewski

In the paper it is assumed that data is adequately represented by the Gaussian distribu-
tion [4]. In that case the differential entropy H (Xm ) can be calculated analytically as:
1
H (Xm ) = log (2πe |Λ|) (4)
2
where Λ is the covariance matrix of Xm . The trees are trained until the number of
samples in a leaf is less than the pre-specified limit or the depth of the tree has exceeded
the pre-defined depth.
Once the random forest has been trained, the new sample can be simply put through
each tree. Depending on the result of the decision function at each internal node, the new
data is sent to the left or right child node until it arrives at a leaf. The samples ending
up in the same leaf are likely to be statistically similar and are expected to represent the
same neighbourhood of the manifold. As such similarity measure is statistical in nature,
thus the results is averaged over many decision trees. If the samples end up in the same
leaf for the majority of the trees they are consider to be drawn from the similar location
on the manifold.

2.2 Forest Model for Manifold Learning

In many problems, data is hard to be represented or analysed due to its high dimensional
structure. However, such complex data might by governed by a small number of param-
eters. The goal of the manifold learning is to find the embedding function, mapping the
data set X form a high, N = 3P dimensional space to a reduced, n dimensional space.
The manifold forests are constructed upon diffusion maps [3] with the neighbour-
hood topology learned through random forest data clustering. It generates efficient
representations of complex geometric structures even when the observed samples are
non-uniformly distributed. The diffusion map is a graph based non-linear technique
with isometric mapping from original shape space onto a lower dimensional diffusion
space.
In the proposed method, the affinity model in manifold learning is built by applying
random forest clustering. The data partition is defined based on the leaf node l() the
input data Xi would reach. The entries of the affinity matrix Wt for tree t are calculated
as, Wijt = e−L (Xi ,Xj ) , i, j ∈ 1 . . . M , where the distance L is obtained using binary
t

affinity model:
0 l(Xi ) = l(Xj )
Lt (Xi , Xj ) = (5)
∞ otherwise
The binary model is simple and efficient and can be considered to be a parameter-free.
However, as affinity matrix calculated based on a single tree is not representative, the
ensemble of T trees is used to calculate the more accurate affinity
matrix W by averag-
ing over all affinity matrices from each single tree: W = T1 Tt=1 Wt .
Coifman et al. presented a justification behind using normalised graph Laplacian
[3] by connecting them to diffusion distance. Each entry ofthe diffusion operator G
is constructed as G(Xi , Xj ) = W ij /Υii with Υii =
j W ij . W is a renor-
malised affinity matrix of W using an anisotropic normalised
graph Laplacian, such
that W ij = Wij /qi qj with qi = j W ij , qj = i W ji . The convergence of
Deformable Shape Reconstruction from Monocular Video with Manifold Forests 31

optimal embedding Ψ for diffusion maps is proven in [3] and is found via eigen-
vectors ϕ and its corresponding n biggest eigenvalues λ of the operator G, such that
1 = λ0 > λ1 ≥ . . . ≥ λn ,
T
Ψ : Xi → [λ1 ϕ1 (Xi ), · · · , λn ϕn (Xi )] (6)

3 Random Forests in 3D Reconstruction

Initialisation: Initial shapes and camera motion are estimated by running a few itera-
tion of the optimisation process using the linear method described in [12]. Our method
is not significantly sensitive to the initial solution as the method can iteratively update
the shapes by projecting them on the learned manifold until convergence.

Mapping Out-of-Sample Points: The manifold forests method briefly described in

Section 2 is used to find a meaningful representation of the data, but the mapping Ψ is
only able to provide an embedding for the data present in the given training set. Suppose
a new shape St ∈ RN becomes available after the manifold had been learned, instead
of re-learning the manifold which is computationally expensive, an efficient way is to
interpolate the shape onto the lower dimensional feature space. For each new shape,
such embedding is calculated based on the Nyström extension [2],

Inverse Mapping: Given a point b ∈ Rn in the reduced space, finding its inverse map-
ping St = Ψ−1 (b) from the feature space back to the input space is a typical pre-image
problem. As claimed in [2], the exact pre-image might not exist if the shape St has not
been seen in the training set. However, according to the properties of isometric map-
ping, if the points in the reduced space are relatively close, the corresponding shapes in
high dimensional space should represent similar shapes since they have small diffusion
distances. Based on this, the point bt can be approximated as a linear n+1combination of
its weighted neighbouring points in feature space, such that bt = l=1 θtl xtl , where
xtl is the lth nearest point of bt and the weights θtl are computed as the barycentric
coordinates of bt . Once the weights are estimated, the shape St can be calculated as
n+1
well based on a set of weighted training samples St = l=1 θtl Xtl , where the train-
ing samples Xtl are the pre-images of xtl , and are equivalent to the basis shapes in Eq.1.

Non-linear Refinement: The cost function is given as,

F
2
F
2

F
arg min Yt −P · Rt · St + ϕS St −St−1 + ϕR εrot
{Rt },{θtl } t=1 t=2 t=1

n+1
with θtl = 1, 0 ≤ θt ≤ 1 (7)
l=1
2
where εrot = Rt · Rt T − I enforces orthonomality of all Rt . ϕS and ϕR are regu-
larisation constants.
However, the underlying problem is that the quality of the optimisation result is
strongly depending on the accuracy of initial shapes. To avoid this, we update the basis
shapes in each iteration until 2D measurement error is less than the defined threshold
(10−3 in our case) and the error between two adjacent frames is relatively small.
32 L. Tao and B.J. Matuszewski

4 Results and Discussion

A number of experiments were carried out to evaluate the proposed method. Several
state-of-the-art algorithms were evaluated and compared in these experiments:
RF: The proposed random forest method; DM: The diffusion maps based method. The
DM method is similar to the RF except the manifold learning was implemented without
random forest. [11]; MP: The metric projection method [9]; PTA: The discrete cosine
transform (DCT) based point trajectory approach [1]; CSF: The column space fitting
method [5]; KSFM: The kernel non-rigid structure from motion approach [6]; IPCA:
The incremental principal components analysis based method [12].
The data which were used for evaluation include: two articulated face sequences,
surprise and talking, both captured using passive 3-D scanner with 3D tracking of 83
facial landmarks [8]; two surface models, cardboard and cloth [13]; two human actions,
walking and stretch, and three dance sequences: dance, Indian dance and Capoeira
from CMU motion capture database1 . This paper is not focusing on feature detection
and tracking. In the experiments described here the 3D points are known and these were
projected onto the image sequences under the orthographic camera model and subse-
quently used as features. Diffusion maps require training process, so training datasets
for two face sequences were taken from the BU-3DFE [14] and for two surface se-
quences the data were obtained from [13]. Since no separate training data are provided
for CMU database, half of each sequence was used for manifold learning and the other
half for testing. All the training data has been rigidly co-registered, the same testing
data has been used with the methods which do not require training.

4.1 Quantitative Evaluation

Different Number of Bases n: The accuracy of reconstruction is affected by the di-

mensionality of the reduced space n, corresponding to number of shape basis. The first
test looked at the relation between manifold dimensionality and the shape reconstruc-
tion error. The 9 sequences were separated into 3 groups: small deformation (Surprise,
Talking, Cardboard), large deformation (Cloth, Walking, Stretch) and all the dance
sequences representing very large and complex deformations. The forests have been
trained with the average 600 number of trees. The results in Fig.1(left) show that with
increasing dimension of the reduced space n the shape reconstruction error is reduced.
As expected, a higher number of bases is required to describe a complex shape defor-
mation, e.g. dance sequences.
Fig.1(right) shows the comparison results on stretch sequence which were produced
by the proposed method and the previous methods. The error calculated for PTA, CSF
and KSFM varies with the number of bases and indeed increases for n>12 demonstrat-
ing that the problem becomes ill-conditions. DM and RF methods are “more stable” as
the solution is strongly constrained by the requirement that it belongs to the manifold.

Measurement Data with Noise: In order to assess the performance of the recon-
struction methods when the observed data is corrupted by noise, the next experiment
1
The data was obtained from http://mocap.cs.cmu.edu
Deformable Shape Reconstruction from Monocular Video with Manifold Forests 33

30 60
Small
Large PTA
CSF
3D error %
Dance

3D error %
20 All KSFM
40
DM
RF
10 20

0 0
3 5 7 10 15 3 5 7 10 12 15
n n
(a) (b)

Fig. 1. Reconstruction 3D error as a function of the number of bases n. (left) Errors produced
by RF. Bars left to right: Group of small deformation sequences, large deformation sequences,all
dance sequences, all the sequences; (right) Comparison results on stretch sequence.

MP
PTA
CSF
KSFM
60 DM
RF 60
3D error %

3D error %

40 40

MP
PTA
20 20 CSF
KSFM
DM
RF

0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Level of Noise % Level of Noise %
(a) (b)

Fig. 2. Reconstruction results on walking (left) and capoeira (right) sequences with Gaussian
noise

compared the RF method against previously proposed methods in terms of shape recon-
struction error expressed as a function of level of noise in the observed data. We ran 10
trials for each experiment for each level of noise using walking and capoeira sequences
respectively. It can be noticed that although the performance of all six algorithms de-
creases with the level of noise, two non-linear methods DM and RF are clearly superior
and achieve smaller standard deviations, whereas others are quite sensitive with large
mean error and error dispersion. Even though RF and DM provide comparable perfor-
mance in walking, as expected RF outperforms DM in the cases of recovery of more
complex deformations, e.g. capoeira sequence.

4.2 Qualitative Evaluation

Motion Capture Data: Table 1 shows the 3D reconstruction error for RF, DM, IPCA
and KSFM which on average provide better results than other trajectory based methods.
The relative normalised means of the 3D error [6] are compared over all frames and all
34 L. Tao and B.J. Matuszewski

Table 1. Relative normalised mean reconstruction 3D error in percentages for KSFM, IPCA, DM
and RF methods. The optimal number of bases n, for which the 3D errors are shown in the table,
is given in brackets for each tested method.

KSFM IPCA DM RF
Initial No Opt. Opt.
Surprise 3.81(4) 12.89 3.52(10) 31.54 29.29 2.41(15)
Talking 4.98(4) 9.86 3.50(10) 96.57 8.37 3.43(10)
Cardboard 27.53(2) 24.45 10.64(10) 26.74 16.06 9.40(10)
Cloth 18.06(2) 19.09 2.87(7) 29.67 17.29 2.54(7)
Walking 10.29(5) 32.64 2.65(9) 35.02 16.31 3.69(15)
IndianDance 23.43(7) 34.40 9.81(10) 29.69 12.82 5.55(15)
Capoeira 23.76(7) 40.59 2.58(9) 40.59 29.2 0.54(10)
Stretch 7.36(12) 19.18 6.87(6) 26.23 17.08 5.88(10)
Dance 23.69(4) 30.58 16.76(7) 26.08 15.30 11.69(15)

Frame 1 Frame 50 Frame 60 Frame 90 Frame 120

KSFM

Fig. 3. Reconstruction results on the Indian Dance sequence. Reconstructed 3D shapes (circles),
with ground truth (dots) are shown.

points. For RF method the initialisation error and the error produced by the proposed
algorithm with and without non-linear refinement are presented. The errors shown in
the table correspond to the optimal n value selection. This is achieved by running the
trials with n varying from 2 to 15. The best selected n value for each tested method is
shown in brackets. The reconstructed shapes are aligned using a single global rotation
based on Procrustes alignment [1]. As shown in the table, RF has better performance
than other methods, especially for the large deformations. Even though the initial error
is big, the RF method is still able to provide accurate reconstruction results.
Fig.3 shows three randomly selected reconstructed shapes from the Indian Dance
sequence using KSFM and RF methods. More comparison results for DM against other
methods can be found in [11].
Deformable Shape Reconstruction from Monocular Video with Manifold Forests 35

Frame 1 PTA KSFM RF Frame 41 PTA KSFM RF

Fig. 4. Selected 2D frames from the video sequence of a paper bending. Front and top views of
the corresponding 3D reconstructed results using the proposed method (RF), PTA and KSFM.

Real Data: The algorithms used in the motion capture experiments above are applied
to real data in Fig.4. In the video, 81 point features were tracked along 61 frames show-
ing approximately two periods of paper bending movement.

5 Conclusions

In this paper a new approach for monocular reconstruction of non-rigid object is de-
scribed. The method performs particularly well, when compared to other methods,
especially for large and complex deformations. The method combines the ideas of non-
linear manifold learning and deformable shape reconstruction. The non-linear manifold
has been build upon diffusion maps with random forests used to estimate local manifold
neighbourhood topology. The method has the potential to be extended to handle cases
with missing data and to be implemented for real time reconstructions. The proposed
method shows a significant improvement for the reconstruction of large deformable
objects, even though, due to the lack of training data, the manifold is built using only
limited number of shapes. Further possible improvements include building a sufficiently
dense representation of the manifold by collecting and generating more training data.

References
1. Akhter, I., Sheikh, Y., Khan, S., Kanade, T.: Trajectory space: A dual representation for
nonrigid structure from motion. IEEE PAMI 33, 1442–1456 (2011)
2. Arias, P., Randall, G., Sapiro, G.: Connecting the out-of sample and pre-image problems in
kernel methods. In: ICPR, pp. 1–8 (2007)
3. Coifman, R., Lafon, S.: Diffusion maps. Appl. Comp. Harm. Anal. 21, 5–30 (2006)
4. Criminisi, A., Shotton, J., Konukoglu, E.: Decision forests: A unified framework for clas-
sification, regression, density estimation, manifold learning and semi-supervised learning.
Foundations and Trends in Computer Graphics and Computer Vision 7, 81–227 (2012)
5. Gotardo, P., Martinez, A.M.: Computing smooth time-trajectories for camera and deformable
shape in structure from motion with occlusion. IEEE PAMI 33, 2051–2065 (2011)
6. Gotardo, P., Martinez, A.M.: Kernel non-rigid structure from motion. In: ICCV, pp. 802–809
(2011)
7. Hamsici, O.C., Gotardo, P.F.U., Martinez, A.M.: Learning spatially-smooth mappings in non-
rigid structure from motion. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid,
C. (eds.) ECCV 2012, Part IV. LNCS, vol. 7575, pp. 260–273. Springer, Heidelberg (2012)
36 L. Tao and B.J. Matuszewski

8. Matuszewski, B., Quan, W., Shark, L.-K., McLoughlin, A., Lightbody, C., Emsley, H.,
Watkins, C.: Hi4d–adsip 3d dynamic facial articulation database. Image and Vision Com-
puting 10, 713–727 (2012)
9. Paladini, M., Bue, A., Xavier, J., Stosic, M., Dodig, M., Agapito, L.: Factorization for non-
rigid and articulated structure using metric projections. In: CVPR, pp. 2898–2905 (2009)
10. Rabaud, V., Belongie, S.: Linear embeddings in non-rigid structure from motion. In: CVPR,
pp. 2427–2434 (2009)
11. Tao, L., Matuszewski, B.J.: Non-rigid strucutre from motion with diffusion maps prior. In:
CVPR (2013)
12. Tao, L., Matuszewski, B.J., Mein, S.J.: Non-rigid structure from motion with incremental
shape prior. In: ICIP, pp. 1753–1756 (2012)
13. Varol, A., Salzmann, M., Fua, P., Urtasun, R.: A constrained latent variable model. In: CVPR,
pp. 2248–2255 (2012)
14. Yin, L., Wei, X., Sun, Y., Wang, J., Rosato, M.: A 3d face expression database for facial
behavior research. In: AFGR, pp. 211–216 (2006)
Multi-SVM Multi-instance Learning
for Object-Based Image Retrieval

Fei Li1 , Rujie Liu1 , and Takayuki Baba2

1
Fujitsu Research & Development Center Co., Ltd., Beijing, China
{lifei,rjliu}@cn.fujitsu.com
2
Fujitsu Laboratories Ltd., Kawasaki, Japan
[email protected]

Abstract. Object-based image retrieval has been an active research

topic in recent years, in which a user is only interested in some object
in the images. The recently proposed methods try to comprehensively
use both image- and region-level features for more satisfactory perfor-
mance, but they either cannot well explore the relationship between the
two kinds of features or lead to heavy computational load. In this pa-
per, by adopting support vector machine (SVM) as the basic classifier, a
novel multi-instance learning method is proposed. To deal with the differ-
ent forms of image- and region-level representations, standard SVM and
multi-instance SVM are utilized respectively. Moreover, the relationship
between images and their segmented regions is also taken into account. A
unified optimization framework is developed to involve all the available
information, and an efficient iterative solution is introduced. Experimen-
tal results on the benchmark data set demonstrate the effectiveness of
our proposal.

Keywords: Object-based image retrieval, support vector machine,

multi-instance learning.

1 Introduction
With the explosive growth of the number of digital images, effective and efficient
retrieval technique is in urgent need. Since a user usually pays attention to
some object instead of the whole image, if only overall characteristics are used
for image description, the retrieval performance is often unsatisfactory. To deal
with the problem, object-based (or localized content-based) image retrieval is
proposed, and much related work has been developed [1], [2].
As an effective approach to describe the relationship between whole and part,
multi-instance learning has been widely used in image analysis [3], [4]. In this
learning framework, each sample is called a bag, and contains several instances.
The available labels are only assigned for bags, and the relationship between bag
and instance is that a bag is positive if at least one instance in it is positive,
otherwise it is negative. In order to involve object-based image retrieval into the
framework of multi-instance learning, images are first segmented into regions,
and then images and regions are treated as bags and instances, respectively.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 37–44, 2013.

c Springer-Verlag Berlin Heidelberg 2013
38 F. Li, R. Liu, and T. Baba

According to the adopted image representation, the existing multi-instance

retrieval methods can be divided into two categories. In the first category, only
region features are used. To transform region-level multi-instance learning into
image-level single-instance learning, MISSL [5] calculates the weighted edges of
image-level graph based on region similarities, while EC-SVM [6] maps all the
images into a new space spanned by some selected regions. No matter which
approach is utilized, some useful information is inevitably lost during the trans-
formation process, and this may influence the final performance. Another usu-
ally adopted idea is to directly conduct region-level multi-instance learning, and
graph-based methods are often used. In [7], real-valued labels are first assigned
to the selected underlying positive regions, and then propagated to the regions
of all the database images. GMIL [8] considers the relationship between images
and their segmented regions as the fitting constraint of an optimization problem.
Since both graph-based learning and multi-instance learning are well involved in
a unified framework, it achieves the state-of-the-art performance.
In the other category of methods, both image- and region-level representations
are adopted. Although global characteristics cannot effectively describe the user-
interested object, some features, especially image-level statistical description of
local descriptors, are also useful for image retrieval. Therefore, it is hoped that
more satisfactory performance can be obtained by comprehensively utilizing the
two kinds of information. In [9], two image-level graphs are constructed, one is
directly from image-level features, and the other is from region-level features
by a suitable conversion strategy. Then the corresponding propagation matrices
are linearly combined and graph-based learning is conducted. Since the two
kinds of representations are dealt with separately and only combined via graphs,
the available information is not well explored. In the method of multi-graph
multi-instance learning [10], both image- and region-level graphs are constructed.
To address the problem of their different sizes, the whole learning process is
conducted in an optimization framework. And the relationship between images,
the relationship between regions, as well as the relationship between images
and their segmented regions, are all well involved. Although effective, there are
many variables in the optimization problem, and the final solution is obtained
by iterative calculation, so its computational load is quite heavy.
In this paper, for exploring both image- and region-level information, we
present a novel multi-instance image retrieval method based on support vec-
tor machine (SVM). Considering the different forms of the two kinds of features,
image-level SVM and region-level multi-instance SVM are adopted respectively.
In order to construct two classifiers at the same time, similarly as [10], a unified
optimization framework is developed, and the relationship between images and
their segmented regions is also involved. Although iterative calculation is still
needed to get the final results, since less variables are involved, our proposal is
efficient enough for practical applications.
The rest of the paper is organized as follows. Section 2 describes our pro-
posed multi-SVM multi-instance learning method. Our experimental results are
presented in Section 3, and it is followed by some conclusions in Section 4.
Multi-SVM Multi-instance Learning for Object-Based Image Retrieval 39

2 Multi-SVM Multi-instance Learning

In this section, ﬁrst we explain the ways to represent images by two kinds of fea-
tures. Then we present the uniﬁed optimization framework to construct image-
level SVM and region-level multi-instance SVM, and talk about its solution.

2.1 Image Representation

Suppose there are altogether M training images denoted as {I1 , I2 , · · · , IM }, and
each image Im (m = 1, 2, · · · , M ) corresponds to a category label ym ∈ {−1, 1}.
After segmentation, image Im is represented by a set of regions. Let the total
number of regions from all the training images be N . When it is unnecessary to
point out the corresponding images, the regions are denoted as {R1 , R2 , · · · , RN }.
To describe the relationship between image and region, we use Rn ∈ Im to
indicate that Rn is a region in image Im . The extracted image- and region-level
features are denoted as {xI1 , xI2 , · · · , xIM } and {xR
1 , x2 , · · · , xN }, respectively. In
R R

this way, image Im can be described by either a vector xIm or a set of vectors
{xRn |Rn ∈ Im }.

2.2 Optimization Framework

Based on the idea of maximizing the geometric margin, SVM [11] has shown its
eﬀectiveness in many ﬁelds. With vector-formed image-level features, a linear
SVM can be constructed by solving the optimization problem
1 2 M
min Q1 = min wI + C I I
ξm (1)
wI, bI, ξ I wI, bI, ξ I 2
m=1

s.t. ym f (xIm ) ≥ 1 − ξm
I
, I
ξm ≥ 0, (m = 1, 2, · · · , M ).
where wI and bI are classifier parameters and the classification function is de-
fined as f (xIm ) = wI · xIm + bI , “·” denotes the inner product of two vectors,
ξmI
(m = 1, 2, · · · , M ) are slack variables, and C I > 0 is a penalty parameter.
For region-level representation, as each image is described by a set of feature
vectors, standard SVM cannot be directly adopted. According to the basic idea
of multi-instance learning, the classification results of a bag can be determined
by the instance with the maximum classification function value. In this way,
multi-instance learning is introduced into the framework of SVM [12], and a
linear multi-instance SVM with region-level features can be constructed by
1 2 M
min Q2 = min wR + C R R
ξm (2)
wR, bR, ξR wR, bR, ξR 2
m=1

n ) ≥ 1 − ξm ,
ym max g(xR ≥ 0, (m = 1, 2, · · · , M ).
R R
s.t. ξm
Rn ∈Im

where wR and bR are classiﬁer parameters and the classiﬁcation function is

n ) = w · xn + b , ξm (m = 1, 2, · · · , M ) are also slack variables,
deﬁned as g(xR R R R R
R
and C > 0 is also a penalty parameter.
40 F. Li, R. Liu, and T. Baba

To avoid separately constructing image-level SVM and region-level multi-

instance SVM, a cost item corresponding to the relationship between images
and their segmented regions is introduced. Since the classification results can be
determined by either image- or region-level features, their corresponding classi-
fication function values should be consistent with each other, thus the cost item
is defined as

M
I R
Q3 = L ym f (xm ), ym max g(xn ) (3)
Rn ∈Im
m=1

where L(x, y) is a suitable distance measure. Squared Euclidean distance is usu-

ally adopted, but here we only want to involve the distance when the classifica-
tion results are believable. Considering the characteristic of SVM, larger value
of ym f (xIm ) or ym max g(xR
n ) indicates more confidence on the result, and only
Rn ∈Im
when the value is larger than 1, the corresponding slack variable is 0. Therefore,
the measure is defined as
2
L(x, y) = (max{x, 1} − max{y, 1}) (4)
It should be noted that if ym f (xIm ) > 1 and ym max g(xR
n ) < 1, only the difference
Rn ∈Im
between ym f (xIm ) and 1 is considered. This is because the difference between
1 and ym max g(xR n ) has already been embodied by the corresponding slack
Rn ∈Im
variable. The case is similar when ym f (xIm ) < 1 and ym max g(xR
n ) > 1.
Rn ∈Im
By taking all the aforementioned issues in a unified framework, the final op-
timization problem is formulated as
min Q1 + αQ2 + βQ3
wI, bI, ξ I, wR, bR, ξR

1 M
1
M
I 2 I I 2
= min w +C ξm + α wR +C R R
ξm
I I I R R
w ,b ,ξ ,w ,b ,ξR 2 m=1
2 m=1

M
+β L ym f (xIm ), ym max g(xR n) (5)
Rn ∈Im
m=1
⎧
⎨ ym f (xIm ) ≥ 1 − ξm
I
, I
ξm ≥ 0, (m = 1, 2, · · · , M );
s.t.
⎩ ym max g(xR
n) ≥1− R
ξm , R
ξm ≥ 0, (m = 1, 2, · · · , M ).
Rn ∈Im

where α and β are combination coeﬃcients.

2.3 Solution to Optimization Problem

In this paper, the above problem is treated as joint optimization for {wI , bI , ξ I }
and {wR , bR , ξ R }. In order to solve it, an iterative approach is proposed, in
which image-level SVM and region-level multi-instance SVM are constructed
respectively, and the details are explained as follows.
Either {wI , bI , ξ I } or {wR , bR , ξ R } can be adopted in the ﬁrst iteration, and
their original values are calculated by (1) or (2).
Multi-SVM Multi-instance Learning for Object-Based Image Retrieval 41

With ﬁxed {wR , bR , ξ R }, the optimization problem (5) is reduced to

1 2 M M
2
min wI + C I I
ξm +β max ym f (xIm ), 1 − ΔR m (6)
I I
w ,b ,ξ I 2 m=1 m=1

s.t. ym f (xIm ) ≥ 1 − ξm
I
, I
ξm ≥ 0, (m = 1, 2, · · · , M ).

where ΔR m = max ym max g(xR n ), 1 . By introducing new variables
Rn ∈Im
λIm (m = 1, 2, · · · , M ), we can further rewrite (6) as
1 2
M
M
2
min wI + CI I
ξm +β 1 + λIm − ΔR
m (7)
wI, bI, ξ I, λI 2 m=1 m=1

ym f (xIm ) ≥ 1 − ξm
I
, I
ξm ≥ 0, (m = 1, 2, · · · , M );
s.t.
ym f (xIm ) ≤1+ λIm , λIm ≥ 0, (m = 1, 2, · · · , M ).
By changing it to its dual problem, we can solve the problem of quadratic pro-
gramming eﬃciently.
While with ﬁxed {wI , bI , ξ I }, the optimization problem (5) becomes
M 2
1 2 M
min α w R
+C R R
ξm +β max ym max g(xn ), 1 −Δm (8)
R I
wR, bR, ξR 2 Rn ∈Im
m=1 m=1

n ) ≥ 1 − ξm ,
ym max g(xR ≥ 0, (m = 1, 2, · · · , M ).
R R
s.t. ξm
Rn ∈Im

where ΔIm = max ym f (xIm ), 1 . To deal with the problem, similarly as [12], for
image Im , a selector variables is deﬁned as

Sm = arg max g(xR

n) (9)
n:Rn ∈Im
Then (8) can be written as

1 M M

R 2 2
Sm ), 1 −Δm
R R
min α w +C ξm +β max ym g(xR I
(10)
R R
w ,b ,ξR 2 m=1 m=1

Sm ) ≥ 1 − ξm ,
ym g(xR ≥ 0, (m = 1, 2, · · · , M ).
R R
s.t. ξm

It can be seen that (10) is with the same form as (6). If the values of Sm
(m = 1, 2, · · · , M ) are determined, (10) can also be solved by the aforementioned
method. Therefore, the ﬁnal solution of the original problem (8) can be obtained
by iteratively calculating (9) and (10).
So far, only linear SVM is adopted in our proposal. As only the inner product
of feature vectors is involved in the dual problem for solving (6) and (8), kernel
trick can be easily introduced in our proposed optimization framework, and
nonlinear SVM can also be utilized conveniently.
After the iterative process has converged, all the parameters for the two clas-
siﬁers can be calculated. For a database image It with image-level feature xIt
42 F. Li, R. Liu, and T. Baba

Fig. 1. Example images in the SIVAL data set

and region-level features {xR

s |Rs ∈ It }, the ﬁnal classiﬁcation function h(It ) is
determined by combining the two kinds of information together

h(It ) = ωf (xIt ) + (1 − ω) max g(xR

s ) (11)
Rs ∈It

where ω (0 ≤ ω ≤ 1) is a tunable combination parameter and can simply set to

0.5 for convenience. Then the images with the largest values of h(It ) are returned
as the retrieval results.

3 Experimental Results
The proposed method is evaluated on the SIVAL (Spatially Independent, Vari-
able Area, and Lighting) image benchmark, which is widely used for multi-
instance learning. It consists of 25 different categories, each includes 60 images.
The images in one category contain the same object photographed against highly
diverse backgrounds. The object may occur anywhere in the images and may be
photographed at a wide-angle or close up. Some example images are shown in
Fig. 1. All the images in the data set have been segmented, and each region is
represented by a 30-dimensional feature vector.
We conduct 30 independent runs for all the categories in the database. In
each category, 8 positive and 8 negative images are randomly selected as training
samples. To compare with other methods, we also use the area under the receiver
operating characteristic curve (AUC) as the performance measure.
For image representation, the method of locality-constrained linear coding [13]
is adopted for constructing image-level features, and the region-level features
provided by the data set are directly used after normalization. The parameters
in the optimization framework are set as follows. The penalty parameters C I
and C R are set to 1000. The combination coefficients α and β are set to 1 and
100, respectively. Nonlinear SVMs are constructed in the experiments, in which
Gaussian kernel K(u, v) = exp −γ u − v 2 is adopted, and γ is set to 0.001.
The methods used for comparison include multi-graph multi-instance learning
(MGMIL) [10], multi-instance learning based on region-level graph (GMIL) [8],
support vector machine with evidence region identification (EC-SVM) [6], as well
as image-level semi-supervised multi-instance learning (MISSL) [5]. The average
Multi-SVM Multi-instance Learning for Object-Based Image Retrieval 43

Table 1. Average AUC values and 95%-conﬁdence intervals over 30 independent runs
on the SIVAL data set

Our proposal MGMIL GMIL EC-SVM MISSL

FabricSoftenerBox 99.1±0.5 95.8±0.7 94.6±0.6 97.9±0.5 97.7±0.3
WD40Can 95.7±1.0 92.3±1.0 84.9±1.1 94.3±0.6 93.9±0.9
DataMiningBook 95.3±0.9 93.0±0.9 84.8±1.6 75.0±2.4 77.3±4.3
RapBook 95.1±0.9 87.0±1.0 77.0±1.7 68.6±2.3 61.3±2.8
GreenTeaBox 95.1±1.8 90.4±0.9 93.1±0.8 86.9±2.2 80.4±3.5
CheckeredScarf 94.0±1.0 89.3±0.9 94.0±0.6 96.9±0.5 88.9±0.7
AjaxOrange 93.3±1.3 93.9±0.7 88.2±1.2 93.8±2.1 90.0±2.1
GoldMedal 91.2±1.1 85.3±1.6 80.4±1.7 87.5±1.4 83.4±2.7
FeltFlowerRug 90.6±1.4 89.2±1.2 94.1±0.6 94.2±0.8 90.5±1.1
SpriteCan 90.3±1.0 79.7±1.4 79.4±1.3 85.4±1.2 81.2±1.5
SmileyFaceDoll 89.7±1.3 83.8±1.2 81.1±1.4 84.6±1.9 80.7±2.0
CokeCan 88.8±1.7 89.4±1.4 85.3±0.8 94.6±0.8 93.3±0.9
TranslucentBowl 88.7±1.9 79.3±1.7 79.6±1.3 74.2±3.2 63.2±5.2
BlueScrunge 88.7±2.1 80.4±1.4 73.4±1.8 74.1±2.4 76.8±5.2
JuliesPot 88.5±2.0 87.3±1.5 87.1±1.6 67.3±3.3 68.0±5.2
DirtyRunningShoe 86.7±1.6 82.9±1.1 89.5±0.8 90.3±1.3 78.2±1.6
CardboardBox 86.6±1.5 81.2±1.2 85.0±1.3 85.6±1.6 69.6±2.5
DirtyWorkGloves 81.0±1.9 80.1±1.2 78.1±1.7 83.0±1.3 73.8±3.4
Banana 78.2±2.2 77.1±0.9 69.5±1.6 69.1±2.9 62.4±4.3
StripedNotebook 77.4±1.7 73.8±1.3 83.7±1.7 75.6±2.3 70.2±2.9
CandleWithHolder 76.0±1.9 76.0±1.3 81.0±1.5 88.1±1.1 84.5±0.8
Apple 75.0±2.1 73.5±1.3 72.7±1.5 68.0±2.6 51.1±4.4
GlazedWoodPot 73.5±1.3 74.9±1.1 76.4±1.2 68.0±2.8 51.5±3.3
LargeSpoon 71.8±1.4 73.7±1.3 64.3±1.4 61.3±1.8 50.2±2.1
WoodRollingPin 69.7±1.2 70.3±1.2 72.4±1.9 66.9±1.7 51.6±2.6
Average 86.4 83.2 82.0 81.3 74.8

AUC values and the 95%-confidence intervals for our proposal and the other
four methods are listed in Table 1. We can see that the overall performance of
our proposal is the best. In all the 25 categories, our proposal achieves highest
AUC values on 14 categories. Especially for “RapBook”, “TranslucentBowl”, and
“BlueScrunge”, the performance can be improved by more than 8%. MGMIL also
adopts both image- and region-level representations. As image-level features can
provide additional information, it outperforms GMIL, EC-SVM and MISSL, in
which only region features are considered. Comparing MGMIL with our proposal,
the main difference is that graph-based learning is involved in MGMIL, while
SVM is introduced as the basic classifier in our method. The superior of our
proposal demonstrates the advantage of exploring information in the feature
space over analyzing relationship between graph nodes.
As far as computational load is concerned, we also compare our proposal with
MGMIL. Both the methods develop optimization frameworks based on two kinds
of features, but the numbers of involved variables are different. MGMIL wants
to calculate the soft labels for images and regions, while our proposal aims to
construct effective SVMs. In general, the number of all the images and regions is
larger than the number of classifier parameters, hence our proposal often costs
less time than MGMIL.
44 F. Li, R. Liu, and T. Baba

4 Conclusions
In this paper, a novel multi-SVM multi-instance learning method is proposed
for object-based image retrieval. According to the two kinds of representations,
image-level SVM and region-level multi-instance SVM are adopted respectively.
For comprehensively utilizing the available information, a unified optimization
framework is developed, and the relationship between images and their seg-
mented regions is also taken into consideration to avoid constructing the two
classifiers separately. An iterative approach is introduced to solve the optimiza-
tion problem. It is demonstrated that our proposal is both effective and efficient
for image retrieval.

References
1. Rahmani, R., Goldman, S.A., Zhang, H., Krettek, J., Fritts, J.E.: Localized con-
tent based image retrieval. In: Proc. ACM SIGMM Int. Workshop Multimedia
Information Retrieval, pp. 227–236 (2005)
2. Zheng, Q.-F., Wang, W.-Q., Gao, W.: Effective and efficient object-based image
retrieval using visual phrases. In: Proc. ACM Int. Conf. Multimedia, pp. 77–80
(2006)
3. Chen, Y., Bi, J., Wang, J.Z.: MILES: Multiple-instance learning via embedded
instance selection. IEEE Trans. Pattern Analysis and Machine Intelligence 28(12),
1931–1947 (2006)
4. Feng, S., Xu, D.: Transductive multi-instance multi-label learning algorithm with
application to automatic image annotation. Expert Systems with Applications 37,
661–670 (2010)
5. Rahmani, R., Goldman, S.A.: MISSL: Multiple-instance semi-supervised learning.
In: Proc. Int. Conf. Machine Learning, pp. 705–712 (2006)
6. Li, W.-J., Yeung, D.-Y.: Localized content-based image retrieval through evidence
region identification. In: Proc. IEEE Int. Conf. Computer Vision and Pattern
Recognition, pp. 1666–1673 (2009)
7. Tang, J., Hua, X.-S., Qi, G.-J., Wu, X.: Typicality ranking via semi-supervised
multiple-instance learning. In: Proc. ACM Int. Conf. Multimedia, pp. 297–300
(2007)
8. Wang, C., Zhang, L., Zhang, H.-J.: Graph-based multiple-instance learning for
object-based image retrieval. In: Proc. ACM Int. Conf. Multimedia Information
Retrieval, pp. 156–163 (2008)
9. Tang, J., Li, H., Qi, G.-J., Chua, T.-S.: Image annotation by graph-based infer-
ence with integrated multiple/single instance representations. IEEE Trans. Multi-
media 12(2), 131–141 (2010)
10. Li, F., Liu, R.: Multi-graph multi-instance learning for object-based image and
video retrieval. In: Proc. ACM Int. Conf. Multimedia Retrieval (2012)
11. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York
(1995)
12. Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for
multiple-instance learning. In: Advances in Neural Information Processing Systems
(2002)
13. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained
linear coding for image classification. In: Proc. IEEE Int. Conf. Computer Vision
and Pattern Recognition, pp. 3360–3367 (2010)
Maximizing Edit Distance Accuracy
with Hidden Conditional Random Fields

Antoine Vinel and Thierry Artières

Université Pierre et Marie Curie (LIP6), Paris, France

{antoine.vinel,thierry.artieres}@lip6.fr

Abstract. Handwriting recognition aims at predicting a sequence of

characters from an image of a handwritten text. Main approaches rely on
learning statistical models such as Hidden Markov Models or Conditional
Random Fields, whose quality is measured through character and word
error rates while they are usually not trained to optimize such criterion.
We propose an eﬃcient method for learning Hidden Conditional Random
Fields to optimize the error rate within the large margin framework.

Keywords: Document Analysis, Conditional Random Fields, Maximum

Margin Learning, Handwriting recognition.

1 Introduction
Handwriting recognition (HWR) aims at transforming a raw image of a hand-
written document into a sequence of characters and words. HWR systems consist
first in performing some preprocessing steps on a sliding window over each text
lines, yielding to a sequence of real-valued feature vectors, and second in apply-
ing statistical models such as Hidden Markov Models (HMMs) or Conditional
Random Fields (CRFs). These systems take as input a T-length sequence of
observations x and output a L-length sequence of characters y with temporal
boundaries. Their accuracy is systematically evaluated from the edit distance
which counts the number of character errors (insertions, deletions, and replace-
ment) required to align a predicted string y and the true string y (e.g. the word
to recognize), thus ignoring irrelevant eventual temporal boundaries shifts. One
can affect various weights to these error types (we used an uniform weighting
here). The accuracy of a recognition engine, which will be further denoted by
EDA (Edit Distance Accuracy), is defined from the edit distance as follows:
Hits − Insertions
accuracy(y, y ) = EDA(y, y ) = (1)
|y |

where Hits denote the number of character that have been well predicted and |y |
denotes the length (the number of character) of the true string. Most popular
approaches rely on HMMs trained with either generative or discriminative crite-
rion such as Maximum Mutual Information (MMI) [1], Minimum Classiﬁcation
Error and variants (MCE, P-MCE) [2–4] and Minimum Phone Error (MPE) [5].

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 45–53, 2013.

c Springer-Verlag Berlin Heidelberg 2013
46 A. Vinel and T. Artières

Recently, building on the success of large margin learning ideas (popularized with
Support Vector Machines), some works have demonstrated the strong potential
of this approach for learning HMMs [7–11]. Most of them focus on optimizing
a Hamming distance loss criterion (a frame based error measure wich takes into
account the temporal boundaries mismatch) which is simpler to optimize, while
their performances are measured with EDA.
Recently, pure discriminative models have been also proposed to deal with
sequence labelling tasks. Hidden Conditional Random Fields (HCRFs), which are
CRFs [12] powered by hidden states (alike HMMs) [13–15] have been successfully
applied to various signal labelling tasks [16, 14, 13, 17–19]. HCRFs are trained
either to maximize the conditional likelihood of label sequences given observation
sequences or a margin criterion based on a naive zero-one loss (the sentence is
completely recognized or not) or at best a Hamming distance loss [20, 16, 21].
This work describes an approach for learning HCRFs with a large margin
criterion to maximize the edit distance accuracy. Although this has been often
mentioned as a perspective of previous works [22] this extension is not straight-
forward. We ﬁrst recall some background on HCRFs, then we detail our algo-
rithm and provide experimental results on two HWR datasets.

2 Background
2.1 HCRF
HCRFs are discriminative graphical models that have been proven useful for
handling complex signals like handwriting and speech [13–15]. A common ap-
proach consists in using a similar architecture as for HMM systems: a left-right
sub-model for every character to be recognized, where all ending states of all
sub-models are connected to all starting states. Learning and decoding is per-
formed within this global model. A HCRF deﬁnes a conditional probability of a
label sequence y given an observation sequence x as follows:

p(y|x, w) = p(h|x, w) (2)
h∈S(y)

where x is a T -length observation sequence x = (x1 , ..., xT ), y = (y1 , ..., yL ) is

a sequence of L labels (characters), and h = (h1 , ..., hT ) is a state sequence.
Note that L is usually much less than T . The set of all possible labels (resp.
hidden states) is denoted by Y (resp. H). Finally S(y) denotes the set of all
states sequences that match the label sequence y and w stands for the HCRF
parameter set. The posterior probability of a sequence of states is given by:
exp Φ(x, h), w
p(h|x, w) = (3)
Z(x, w)

where Φ(., .) is a real valued vector, called the joint feature map,
which is com-
monly deﬁned as a sum of local feature map, i.e. Φ(x, h) = t ϕ(xt , ht , ht−1 ),
so that all necessary quantities for training and inference may be computed
Maximizing Edit Distance Accuracy with Hidden CRFs 47

using dynamic programming procedures similar to algorithms used for HMMs

(e.g. forward and Viterbi algorithms). The denominator Z(.) is a normal-
ization
factor known as the ”partition function”, and deﬁned by Z(x, w) =
h∈H T exp Φ(x, h), w. This term is computationally expensive, and has to be
computed at every iteration for each sequence during training.

2.2 Maximum Margin Learning for HCRFs

HCRFs are usually trained to maximize the conditional likelihood. Using a train-
ing dataset B = (xi , yi )|i = 1, .., N with N training sequences, the optimal
parameter set for the HCRF is defined by :

w∗ = argminw − log p(yi |xi , w) (4)
i=1..N
Alternatively, the large margin framework was successfully applied to many
structured problems [23]. Optimizing a margin criterion has a priori two ad-
vantages over maximizing conditional likelihood as in Eq. (4): first, there is no
need to compute the costly partition function, second, margin based criterion
are known to achieve good generalization. Training resumes to:
⎧ ∗
⎪
⎪ w = argminw,ξi 12 w 2 + C ξi
⎨
such that
(5)
⎪
⎪ ∀(i, y = yi ), δF (yi , y) ≥ 1 − ξi
⎩
∀i ξi ≥ 0
where δF (yi , y) = F (xi , yi , w) − F (xi , y, w) and F (x, y, w) is a discriminant
function. Moreover, the standard zero-one loss can be replaced by a penalty
term Δ(y, yi ) that fits better the structured output prediction framework (e.g.
Hamming distance loss).
The Margin Rescaling (MR) and the Slack Rescaling (SR) frameworks consist
in solving the following problem :
⎧ ∗ 1 2

⎪
⎪ w = argminw,ξi 2 w + C ξi
⎪
⎪
⎪
⎪ such that
⎪
⎪
⎨ ∀i ξi ≥ 0
and (6)
⎪
⎪ i
≥ i
−
⎪
⎪ δF (y , y) Δ(y, y ) ξi (MR case)
⎪
⎪
⎪ ∀(i, y = y ),
⎪ i or
⎩
δF (yi , y) ≥ 1 − Δ(y,y
ξi
i) (SR case)

This approach has been used in the past both for generative models (HMMs
[21, 20]) and discriminant ones (CRF and HCRFs [23, 16, 21]).
Although F (x, y, w) = log p(y|x, w) would be a natural choice for learn-
ing HCRFs, a common and simpler choice is an approximation F (x, y, w) =
maxh∈S(y) Φ(x, h), w which resumes to F (x, y, w) = maxh∈S(y) log p(y, h|x, w)
≈ log p(y|x, w).
The above optimization problems are using as objective an upper bound of the
Δ-loss. They may be solved using quadratic programming algorithms. However,
48 A. Vinel and T. Artières

since the number of the constraints is exponential in the length of the input, the
standard solvers cannot be used easily. Eﬃcient algorithms that overcome this
diﬃculty either rely on an online learning scheme (see section 3.1) or exploit a
limited memory algorithm to keep the size of the quadratic program limited [21].

3 Optimization
Our motivation to maximize the edit distance based accuracy (EDA) rather
than the Hamming distance accuracy (HDA) comes from the weak correlation
between those measures. Indeed, the scatter plot (on figure 1) shows, for each test
sequence, a point whose coordinates are the HDA and the EDA. An example of
an extreme case is shown on the right part of figure 1 with a high HDA (i.e. 76%
of the image’s columns are correctly labelled) and a low EDA (25%) suffering
from an important number of insertions.
Including the edit distance loss as penalty term Δ(y, yi ) = dedit (y, yi ) is far
from being straightforward since the objective function is piecewise constant.
All gradient-based algorithms are excluded in favour of contrastive-based [22] or
margin-based ones, we investigate these latter methods here.

3.1 Online Learning

The passive-aggressive algorithm [24] solves iteratively a succession of simple
quadratic problems. For each iteration n, we pick a new training sequence
(xin , yin ) at random, and update the parameter set wn as the solution of the
following problem (SR case is similar) :
⎧ ∗
⎨ w = argminw,ξ 21 w − wn−1 2 + Cξ
s.t. δF (yin , ŷin ) ≥ Δ(ŷin , yin ) − ξ (MR case) (7)
⎩
ξ≥0
Note that when there is not any margin violation (i.e. ξ = 0 ) the optimal
solution is trivially wn ← w∗ = wn−1 . The main interest of this formulation

This point is corresponding to the word "buybacks"

1
that have been misrecognized as "brunybanclrus"
Edit Distance Accuracy

Raw image
0.5

Inferred string
0
0 0.25 0.5 0.75 1 Ground truth
HDA (76%)
7 Matching label ( )
-0.5
Hamming Distance Accuracy EDA (25%) 5 Insertions ( )
1 Replacement ( )

Fig. 1. Relation between Edit Distance Accuracy and Hamming Distance Accuracy
EDA vs HDA scatter plot (left) Example of an extreme case (right)
Maximizing Edit Distance Accuracy with Hidden CRFs 49

is that in the margin violation case, the optimal solution w∗ can be computed
analytically.
Δ(ŷin , yin ) − δΦ(ĥ), w
w∗ = wn−1 + min(C, )Φ(x, ĥ) (MR case) (8)
δΦ(ĥ) 2

where δΦ(ĥ) is the diﬀerence, in the joint feature space, between the most vi-
olating hidden state sequence ĥin and the hidden state sequence hin matching
the ground truth yin .
These quantities are deﬁned according to:
ŷin = argmaxy Δ(y, yin ) − δF (yin , y) (9)
in in
ĥ = argmaxh∈S(ŷin ) Φ(x , h), w (10)
in in
h = argmaxh∈S(yin ) Φ(x , h), w (11)
δΦ(ĥ) = Φ(xin , hin ) − Φ(xin , ĥin ) (12)

3.2 Negative Example Selection

Computing the best negative example for ith training sample, ĥin , requires ŷin
(Eq. (9)). These quantities cannot be computed by dynamic programming rou-
tines since the edit distance is not decomposable over the frames. We propose to
build incrementally a lattice enclosing a limited number of best hypothesis for
which we will compute edit distance.

Initial path
r 2nd expansion
u 3rd to 7th expansions
m
g
w
i
n
e
j

Fig. 2. Hypothesis lattice built for selecting negative example

Inspired by the word-graph algorithm proposed e.g. in [25], we initialize the

graph using a standard Viterbi-like algorithm to generate the prediction of the
current model. In ﬁgure 2 this step corresponds to the boldest black path : the
word rwnning (starting by the character r from frames 1 to 30, w from frames 31
to 90, ...), we add a node in the graph for each character (of the decoded string)
at its starting time. Then, we iterate what we call expansions of the graph up
to a maximum number of expansions E.
Every iteration e < E selects the most likely alternative best path from the
beginning of the sequence to one node of the graph. For instance in ﬁgure 2, the
node i was selected for the second expansion yielding a new hypothesis ruming.
50 A. Vinel and T. Artières

The E expansions may represent (in the theoretical worst case) a huge number
of explored hypothesis (i.e. alternative strings) : O(exp(E)). The empirical com-
plexity was bounded by (E/2)3 . To deal with it, we used an eﬃcient dynamic
programming routine to compute ĥin on the graph by factorizing as much as
possible the edit distance computation.
This part of our approach diﬀers from MPE [5, 6] by the fact that we use the
true EDA (instead of a frame-based ”local accuracy” estimation computed from
the character overlapping in an hypothesis lattice).

4 Experiments

We performed experiments on two handwritten words datasets. The YAWDa

dataset is a home made dataset (freely available at [26]) of 2 400 handwritten
words whose character distribution is roughly uniform. We used two versions of
this set. In R16 the raw images are rescaled to a 16-pixel height and a frame
consists in one column of pixels (resulting in a 16-dimensional feature vector).
In R32 the raw images are rescaled to 32-pixels height and every frame is the
concatenation of ﬁve columns (two ”context-columns” on both side of the central
frame), leading to 160-dimensional frames. R16 is obviously harder than R32, it
allows investigating the behaviour of the methods with ”low-informative” data.
R32 helps exploring models’ behaviour in a ”richer-context”. We also performed
experiments on a 10k words benchmark dataset (∼46k characters) that is ex-
tracted from the IAM corpus [27], pre-processing yields nine geometrical frames
of computed on a sliding window [28]. As in the YAWDa R32 set, we use aug-
mented frames by including two context frames. As done in [13], we also add the
cross-product of all features which guarantees that the model have a represen-
tational power at least equal to that of HMMs with one gaussian per state. The
ﬁnal dimension of each frame is 1080. In any case the system’s performances
are measured as with EDA accuracy. As often done with online algorithms, we

60 Accuracy 87 Accuracy

55 86

85 EDM MR
50
[29]-close
84 EDM SR
45 HDM
83 CML
40
82

35 81
E E
30 80
1 2 4 6 8 10 12 15 0 10 20 30 40 50

(a) on YAWDa R16. (b) on YAWDa R32.

Fig. 3. Comparison of HCRF learning criterion w.r.t. the number of expansions

Maximizing Edit Distance Accuracy with Hidden CRFs 51

used iterate averaging. This technique use (in the inference process on validation
and test set) an average of w over some of the last iterations. In our case, we
performed averaging over the last iteration over the whole training set.
We first compare on R16 and R32 (fig. 3) the accuracies of few HCRFs
trained my maximizing the following learning criterion : the conditional like-
lihood (CML), a margin based on Hamming Distance (denoted HDM), a margin
based on the Edit Distance (denoted EDM) with the Slack or Margin rescaling
variants. We also reported the accuracy obtained with a HCRF trained with a
method which is pretty similar to [29] by using the ”Margin rescaling” approach
and using a 1-expansion graph. It differs from the original method in that we do
not model stay duration in the states.
EDM approach significantly outperforms the other methods on both datasets.
Both strategies of EDM (Margin and Slack Scaling) work well, with a slight ad-
vantage for Margin Scaling. The HDM approach performs well too, while slightly
less than the EDM with the Margin Scaling strategy. Interestingly, EDM already
outperforms the other methods when exploiting a small number of expansions,
which means a limited complexity overhead. And the method steadily improves
with the number of expansions.

Table 1. Performance comparison on IAM

Models Speciﬁcations Accuracy

8 states 51.2
Hidden Markov Model (HMM)
14 states 59.6
HMM trained with 8 states 70.7
Hamming Distance Margin (HDM) 14 states 70.3
Hidden Conditional Random Field (HCRF) 5 states 70.6
HCRF trained with HDM 5 states 72.0
HCRF trained with EDM ([29]-close approach) 5 states 71.1
HCRF trained with EDM 5 states - 5 expansions 72.1
Slack Rescaling case 5 states - 10 expansions 72.4
HCRF trained with EDM 5 states - 5 expansions 72.2
Margin Rescaling case 5 states - 10 expansions 72.9

This table compares a number of methods on the more complex and big-
ger IAM dataset. All methods have been implemented and tuned by us. These
results show that the margin based methods (HDM and EDM) clearly outper-
form non discriminative and discriminative training of HMMs (HMM and HDM
HMMs), standard discriminative training of HCRFs (CML). Moreover margin
EDM methods again achieve the best results on this dataset. Although improve-
ment over HDM-HCRF is modest, it must be noticed that all these results are
already high on this dataset and that any improvement is very hard to ob-
tain. Finally it must be noticed that we deliberately limited training complexity
on this dataset by exploiting a rather small number of expansions to ﬁt our
52 A. Vinel and T. Artières

computational power, but one can expect to get even better results by augment-
ing the number of expansions and training time.

5 Conclusion
We proposed a new algorithm for learning HCRF relying on a max margin cri-
terion to optimize directly the edit distance accuracy in the passive-aggressive
framework. We detailed a lattice-based approach allowing a factored computa-
tion of the Levenshtein distance for the negative example selection. We ﬁnally
showed the beneﬁts of this approach on few handwriting labelling tasks with
respect to a number of alternative discriminatively learning schemes.

References
1. Woodland, P.C., Povey, D.: Large scale discriminative training of hidden markov
models for speech recognition. Computer Speech & Language (1) (2002)
2. Juang, B.H., Katagiri, S.: Discriminative learning for minimum error classification.
IEEE Transactions on Signal Processing (12) (1992)
3. Fu, Q., He, X., Deng, L.: Phone-discriminating minimum classification error
(p-mce) training for phonetic recognition. In: Interspeech (2007)
4. He, X., Deng, L., Chou, W.: A novel learning method for hidden markov models
in speech and audio processing. In: Multimedia Signal Processing. IEEE (2006)
5. Povey, D., Woodland, P.C.: Minimum phone error and i-smoothing for improved
discriminative training. In: ICASSP, vol. 1, p. I–105. IEEE (2002)
6. Deng, L., Wu, J., Droppo, J., Acero, A.: Analysis and comparison of two speech
feature extraction/compensation algorithms. In: SPL (2005)
7. Cheng, C.-C., Sha, F., Saul, L.K.: Online learning and acoustic feature adaptation
in large-margin hidden markov models. JSP (6) (December 2010)
8. Sha, F., Saul, L.K.: Large margin hidden markov models for automatic speech
recognition. In: NIPS (2007)
9. Cheng, C.C., Sha, F., Saul, L.K.: A fast online algorithm for large margin training
of continuous density hidden markov models. In: Interspeech (2009)
10. Do, T.M.T., Artieres, T.: Maximum margin training of gaussian hmms for hand-
writing recognition. In: ICDAR, pp. 976–980. IEEE Computer Society (2009)
11. Yu, D., Deng, L., He, X., Acero, A.: Large-margin minimum classification error
training for large-scale speech recognition tasks. In: ICASSP (2007)
12. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic
models for segmenting and labeling sequence data. In: ICML Workshop (2001)
13. Gunawardana, A., Mahajan, M., Acero, A., Platt, J.C.: Hidden conditional random
fields for phone classification. In: Interspeech (2005)
14. Do, T.-M.-T., Artieres, T.: Conditional random fields for online handwriting recog-
nition. In: ICFHR (2006)
15. Morency, L.P., Quattoni, A., Darrell, T.: Latent-dynamic discriminative models
for continuous gesture recognition. In: CPVR, pp. 1–8. IEEE (2007)
16. Wang, Y., Mori, G.: Max-margin hidden conditional random fields for human ac-
tion recognition. In: CVPR, pp. 872–879. IEEE (2009)
17. Vinel, A., Do, T.M.T., Artières, T.: Joint optimization of hidden conditional ran-
dom fields and non linear feature extraction. In: ICDAR (2011)
Maximizing Edit Distance Accuracy with Hidden CRFs 53

18. Soullard, Y., Artieres, T.: Hybrid hmm and hcrf model for sequence classification.
In: ESANN (2011)
19. Reiter, S., Schuller, B., Rigoll, G.: Hidden conditional random fields for meeting
segmentation. In: Multimedia and Expo. IEEE (2007)
20. Taskar, B., Guestrin, C., Koller, D.: Max-margin markov networks. In: NIPS (2003)
21. Do, T.M.T., Artières, T.: Large margin training for hidden markov models with
partially observed states. In: ICML (2009)
22. Keshet, J., Cheng, C.-C., Stoehr, M., McAllester, D.A.: Direct error rate mini-
mization of hidden markov models. In: Interspeech (2011)
23. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods
for structured and interdependent output variables. JMLR (2) (2006)
24. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passive-
aggressive algorithms. Journal of Machine Learning Research (2006)
25. Tran, B.H., Seide, F., Steinbiss, T.: A word graph based n-best search in continuous
speech recognition. In: ICSLP (1996)
26. http://YAWDa.lip6.fr/
27. Marti, U.V., Bunke, H.: A full english sentence database for off-line handwriting
recognition. In: ICDAR (2002)
28. Marti, U.V., Bunke, H.: Handwritten sentence recognition. In: ICPR (2000)
29. Keshet, J., Shalev-Shwartz, S., Bengio, S., Singer, Y., Chazan, D.: Discriminative
kernel-based phoneme sequence recognition. In: Interspeech (2006)
Background Recovery by Fixed-Rank Robust
Principal Component Analysis

Wee Kheng Leow, Yuan Cheng, Li Zhang, Terence Sim, and Lewis Foo

Department of Computer Science, National University of Singapore

Computing 1, 13 Computing Drive, Singapore 117417
{leowwk,cyuan,zhangli,tsim,lewis}@comp.nus.edu.sg

Abstract. Background recovery is a very important theme in computer

vision applications. Recent research shows that robust principal compo-
nent analysis (RPCA) is a promising approach for solving problems such
as noise removal, video background modeling, and removal of shadows
and specularity. RPCA utilizes the fact that the background is common
in multiple views of a scene, and attempts to decompose the data ma-
trix constructed from input images into a low-rank matrix and a sparse
matrix. This is possible if the sparse matrix is sufficiently sparse, which
may not be true in computer vision applications. Moreover, algorithmic
parameters need to be fine tuned to yield accurate results. This paper
proposes a fixed-rank RPCA algorithm for solving background recover-
ing problems whose low-rank matrices have known ranks. Comprehensive
tests show that, by fixing the rank of the low-rank matrix to a known
value, the fixed-rank algorithm produces more reliable and accurate re-
sults than existing low-rank RPCA algorithm.

Keywords: Background recovery, reﬂection removal, robust PCA.

1 Introduction

Background recovery is a very important recurring theme in computer vision

applications. Traditionally, different approaches have been developed to solve
different varieties of the problem. Recent research in robust principal com-
ponent analysis (RPCA) offers a promising alternative approach for solving
problems such as noise removal, video background modeling, and removal of
shadows and specularity [2,12]. RPCA utilizes the fact that multiple views of
a scene contain consistent information about the common background. It con-
structs a data matrix from multiple views and decomposes it into a low-rank
matrix that contains the background and a sparse matrix that captures non-
background components. It has been proved that exact solution of RPCA prob-
lem is available if the data matrix is composed of a sufficiently low-rank matrix
and a sufficiently sparse matrix [2,3,9,13]. Various algorithms have been pro-
posed for solving RPCA problem [6,9,12]. In particular, the methods based on
augmented Lagrange multiplier (ALM) have been shown to be among the
most efficient and accurate methods [9].

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 54–61, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Background Recovery by Fixed-Rank Robust Principal Component Analysis 55

In computer vision applications, the non-background components may not be

sparse. Moreover, algorithmic parameters need to be fine tuned to yield accurate
results [1]. These difficulties are especially pronounced for reflection removal
problem, and no work on applying RPCA to reflection removal has been reported
so far. Fortunately, these application problems can be framed as one of recovering
a fixed-rank matrix from the data matrix because the rank of the low-rank
matrix is known. This paper proposes a fixed-rank RPCA algorithm based
on ALM (FrALM) for solving background recovering problems. Comprehensive
tests on reflection removal and video background modeling show that FrALM
produces more accurate results than does low-rank ALM method (LrALM).
Moreover, FrALM can produce optimal or near optimal results over a much
wider range of parameter values than does LrALM, making it more reliable for
solving computer vision problems whose low-rank matrices have known ranks.

2 Existing RPCA Methods

Robust PCA is a term given to a long line of work that aims to render PCA
robust to gross corruption and outliers. Various methods have been proposed
including influence function [4], multivariate trimming [7], alternating minimiza-
tion [8], and random sampling [5]. These methods are either inefficient, having
non-polynomial time complexity, or do not guarantee optimal solutions [12].
A recent approach directly decomposes a corrupted data matrix into a low-
rank matrix and a sparse matrix. The corruption is assumed to be sparse, but
the noise amplitude can be large. Various methods have been proposed such as
iterative thresholding [9], proximal gradient [12], accelerated proximal gradient
[6], and augmented Lagrange multiplier method (ALM) [9]. In particular, ALM
has been shown to be among the most efficient and accurate methods [9]. These
methods require tuning of algorithmic parameters [1]. On the other hand, [1]
applies Bayesian approach to estimate the algorithmic parameters along with
the matrices based on prior distributions of inverse variances.
In our applications, the rank of the low-rank matrix is known. So, we adopt
the ALM approach but fix the rank of the low-rank matrix, which provides
more specific constraint than do prior distributions. This approach allows our
algorithm to converge efficiently and accurately, as for the low-rank ALM method
of [9], and is simpler and more efficient than the Bayesian method of [1].
Other methods have been proposed to solve related but different problems. For
example, [11] solves low-rank matrix factorization and [10] computes a fixed-rank
representation for sparse subspace clustering. They are not directly applicable
to our application problem, which is a matrix decomposition problem.

3 Fixed-Rank RPCA
Given an m×n data matrix D, PCA seeks to recover a low-rank matrix A from
data matrix D such that the discrepancy or error E = D − A is minimized:
min E F, subject to rank(A) ≤ r, D = A + E (1)
A,E
56 W.K. Leow et al.

where r min(m, n) is the target rank of A and · F is the Frobenius norm.

Eq. 1 can be solved by SVD but the solution will be vastly inaccurate if the error
entries in E are arbitrarily large. Under the conditions that A is low-rank and
E is suﬃciently sparse, Wright et al. [12] show that A can be exactly recovered
by solving the following convex optimization problem:

min A ∗ + λ E 1 , subject to D = A + E (2)

A,E

where · ∗ denotes the nuclear norm and · 1 denotes the 1-norm,

Lin et al. [9] reformulate Eq. 2 using augmented Lagrange multiplier method.
Their method (LrALM) uses a matrix Y and parameter μ to merge the constraint
into the objective function, leading to the following revised problem:
μ
min A ∗ +λ E 1 + Y, D − A − E + D−A−E 2
F (3)
A,E 2
where U, V is the sum of the product of corresponding elements in U and V,
and λ and μ are parameters that need to be specified. An iterative algorithm is
applied to determine the A and E that minimize Eq. 3.
For reflection removal, a set of reflection images are arranged as column ma-
trices in D. If the images are well aligned such that the transmitted parts are
identical, then A captures the transmitted parts and has a rank of 1. If the
reflection is localized, E is sparse; otherwise, E is not sparse. Similar charac-
teristics are observed in background modeling of video taken with a stationary
camera.
When E is not sparse, LrALM may not recover accurate results unless the
parameter λ is carefully chosen (Section 4). If λ is too large, the trivial solution
of E = 0 is obtained, and A = D, which has a rank larger than the desired low
rank. On the other hand, if λ is too small, E = D and A = 0, which has a rank
of 0. So, the value of λ directly influences the rank of A recovered by LrALM.
Although
Zhou et al. [13] prove theoretically that the optimal λ can be set to
1/ max(m, n), this is true only if A is low-rank and E is sufficiently sparse.
The parameter μ can also affect the accuracy of the recovered A by influencing
the rank of A (see discussion below).
To overcome the above difficulties, we frame the background recovery problem
as one of recovering a low-rank matrix A with a known rank r:

min E F, subject to rank(A) = known r, D = A + E. (4)

A,E

To solve Eq. 4 robustly, we reformulate it in the same manner as the ALM

approach (Eq. 3), with the additional constraint of rank(A) = r. With A’s rank
fixed, it may seem that the term A ∗ in Eq. 3 is redundant. Nevertheless, we
choose to keep A ∗ in Eq. 3 and solve for A using ALM approach so that the
convergence and optimality properties proved by Lin et al. [9] are preserved.
Our algorithm (FrALM) adopts the exact ALM approach to solve fixed-rank
RPCA problem. It is similar to the low-rank ALM algorithm (LrALM) proposed
by Lin et al. [9], except that FrALM fixes the rank of A.
Background Recovery by Fixed-Rank Robust Principal Component Analysis 57

FrALM
Input: D, r, λ
1. A = 0, E = 0.
2. Y = sgn(D)/J(sgn(D)), μ > 0, ρ > 1.
3. Repeat until convergence:
4. Repeat until convergence:
5. U, S, V = svd(D − E + Y/μ).
6. If rank(T1/μ (S)) < r, A = U T1/μ (S)V ; otherwise, A = U Sr V .
7. E = Tλ/μ (D − A + Y/μ).
8. Y = Y + μ(D − A − E), μ = ρμ.
Output: A, E.
In line 2, sgn(·) computes the sign of each matrix element, and J(·) computes a
scaling factor
J(X) = max X 2 , λ−1 X ∞ (5)
as recommended in [9]. The function T in line 7 is a soft thresholding function:
⎧
⎨ x − , if x > ,
T (x) = x + , if x < −, (6)
⎩
0, otherwise.

The main difference between FrALM and LrALM lies in Step 6. Sr is the diagonal
matrix of singular values whose diagonal elements above r are set to 0. FrALM
fixes A’s rank to the desired rank r if a rank-r matrix is recovered. Otherwise,
it behaves in the same manner as LrALM. On the other hand, LrALM allows
A’s rank to increase beyond r if μ is too large.
FrALM is algorithmically equivalent to LrALM with a sufficiently small μ
that restricts the rank of A to r. Therefore, the convergence proof of LrALM
given in [9] applies to FrALM. Consequently, FrALM can converge as efficiently
as LrALM does (Fig. 2(a)). The advantage of FrALM over LrALM is that the
user does not have to specify the exact μ that fixes the rank of A to r.

4 Experiments and Discussions

7 test sets were used to evaluate the performance of FrALM and LrALM on the
tasks of reflection removal and video background modeling (Fig. 3, 4). Sets 1 to 3
contained synthetically generated reflection images, and Sets 4 and 5 contained
real reflection images. Sets 1 and 4 were corrupted by local reflections whereas
Sets 2, 3, and 5 were corrupted by global reflections. Sets 6 and 7 contained
video frames of a single moving human and busy traffic junction, respectively.
All the test images were color images of size 200×150 pixels; so m = 200×
150×3 = 90000. The number of images n in Sets 1 to 7 were respectively, 38,
38, 38, 31, 46, 210, and 250 respectively. Ground truth background images were
available for Sets 1 to 6 but not for Set 7. The images were captured with a
stationary camera. So the desired rank of the low-rank matrix is 1.
58 W.K. Leow et al.

Eg E1 Rank
1000 50
E2
E3
100 E4 40
E5
F1 30
10
F2
F3 20
1 F4
F5 10
0.1
λ 0 λ
0.001 0.01 0.1 1 0.001 0.01

(a) (b)
Fig. 1. Performance comparison for reﬂection removal. (a) Error Eg vs. λ. Error curves
above Eg = 1000 for very small λ are cropped to reduce clutter. (b) Rank of low-rank
matrix recovered by LrALM. (Red dashed lines) LrALM results. Vertical dash lines
denote theoretical optimal λ∗ = 0.003̇. (Blue solid lines) FrALM results.

Eg E1 Eg
E2 1000
100 E3
E4 100
E5
10 F1 10
F2
1 F3
F4 1
F5
0.1 0.1
iteration rank
1 2 3 4 5 6 7 8 9 10 1 10

(a) (b)
Fig. 2. Performance comparison for reﬂection removal. (a) Convergence curves with
λ = 0.003̇. (b) Error vs. rank. Error curves above Eg = 1000 for very small λ are
cropped to reduce clutter.

LrALM and FrALM were tested on the test sets √ over a range of λ from 0.0001
to 0.5, including the theoretical optimal λ of m = 0.003̇, denoted as λ∗ , as
proved in [13]. The parameters ρ and initial μ were set to the default values of 6
and 0.5/σ1 , where σ1 is the largest singular value of the initial Y, as for LrALM.
The algorithms’ accuracy was measured in terms of the mean squared error Eg
between the ground truth G and the recovered A:
1
Eg = G−A 2
F. (7)
mn
Test results show that FrALM converges as eﬃciently as LrALM (Fig. 2(a)).
Since the desired rank of A is 1, λ has to be suﬃciently small for LrALM
to produce accurate results (Fig. 1(a)). For Sets 2, 3, and 5 with non-sparse
E, the empirical optimal λ (0.002) is smaller than the theoretical λ∗ (0.003̇),
contrary to the theory of [13]. At this lower λ, the ranks of the optimal A
Background Recovery by Fixed-Rank Robust Principal Component Analysis 59

(1)

(2)

(3)

(4)

(5)

(a) (b) (c) (d)

Fig. 3. Sample test results for reflection removal. (a) Ground truth background. (b)
Sample input images. (c) LrALM’s results. (d) FrALM’s results. (1) Set 1: synthetic
local reflection. (2) Set 2: synthetic global reflection (light background). (3) Set 3:
synthetic global reflection (dark background). (4) Set 4: real local reflection. (5) Set 5:
real global reflection. λ = 0.003̇ for these test results.

recovered by LrALM are still larger than the known value of 1 (Fig. 1(b)). This
shows that LrALM has accumulated higher-rank components in A and thus
over-estimated A. In contrast, FrALM constrains the rank of A to 1, removing
the over-estimation. Consequently, FrALM yields more accurate results than
does LrALM, and it returns optimal or near optimal results over a wide range
of λ (Fig. 1(a)). We have also verified empirically that reducing the rank of
A to 1 after it is returned by LrALM can reduce over-estimation and improve
LrALM’s accuracy. However, this post-processing is insufficient for removing the
over-estimation entirely and LrALM’s error is still larger than that of FrALM.
To investigate the stability of FrALM, we ran it on the test cases at a range of
fixed ranks r, with λ set to the empirical optimal of 0.002. FrALM’s results were
plotted together with LrALM’s results obtained in previous tests (Fig. 2(b)).
60 W.K. Leow et al.

(1)

(2)

(3)

(4)

(a) (b) (c) (d)

Fig. 4. Sample test results for human and traﬃc video. (a) Ground truth background.
(b) Sample video frames. (c) LrALM’s results. (d) FrALM’s results. (1) Human motion
video. (2–4) Traﬃc video; ground truth is not available. λ = 0.003̇ for these test results.

When r is slightly larger than 1, FrALM’s error increases only slightly. when r
is larger than the rank of A recovered (line 6 of algorithm), FrALM reduces to
LrALM, and its error simply approaches that of LrALM.
Figure 3 displays sample results for reﬂection removal obtained at the the-
oretical λ∗ . LrALM’s results are good for Sets 1 and 4 whose E is sparse. For
Sets 2, 3, and 5, E is not sparse and LrALM’s results have visually noticeable
errors (when the images are viewed at higher zoom factors). In contrast, FrALM
obtains good results for all test sets.
Figure 4 shows sample results for video background modeling obtained at the
theoretical λ∗ . In the video frames where the human and vehicles are moving
continuously, LrALM can recover the stationary background well (Fig. 4(1c, 2c)).
When the vehicles are moving slowly, E is not sparse, and LrALM shows signs
of inaccuracy (Fig. 4(3c)). When the vehicles stop at the traﬃc junction for an
extended period of time, LrALM regards them as part of the low-rank matrix
A and fails to remove them from A (Fig. 4(4c)). In contrast, FrALM produces
much better overall results than does LrALM (Fig. 4(d)).
Background Recovery by Fixed-Rank Robust Principal Component Analysis 61

5 Conclusions
A fixed-rank RPCA algorithm, FrALM, based on exact augmented Lagrange
multiplier method is proposed in this paper. By fixing the rank of the low-rank
matrix to be recovered, FrALM removes over-estimation of the low-rank matrix
and produces more accurate results than does low-rank ALM method (LrALM).
Moreover, FrALM returns optimal or near optimal results over a wide range of
λ values, whereas LrALM’s accuracy is sensitive to λ. If FrALM is fixed to a
desired rank that is larger than the actual rank, then FrALM just reduces to
LrALM. These properties make FrALM more reliable and accurate than LrALM
for solving computer vision problems whose low-rank matrices have known ranks.

References
1. Babacan, S.D., Luessi, M., Molina, R., Katsaggelos, A.K.: Sparse bayesian methods
for low-rank matrix estimation. IEEE Trans. Signal Processing 60(8), 3964–3977
(2012)
2. Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis?
Journal of ACM 58(3), 11 (2011)
3. Candès, E.J., Plan, Y.: Matrix completion with noise. In: Proc. IEEE, pp. 925–936
(2010)
4. De la Torre, F., Black, M.: A framework for robust subspace learning. Int. Journal
of Computer Vision 54(1-3), 117–142 (2003)
5. Fischler, M., Bolles, R.: Random sample consensus: A paradigm for model ﬁtting
with applications to image analysis and automated cartography. Communications
of ACM 24(6), 381–385 (1981)
6. Ganesh, A., Lin, Z., Wright, J., Wu, L., Chen, M., Ma, Y.: Fast convex optimization
algorithms for exact recovery of a corrupted low-rank matrix. In: CAMSAP (2009)
7. Gnanadesikan, R., Kettenring, J.: A framework for robust subspace learning. Ro-
bust Estimates, Residuals, and Outlier Detection with Multiresponse Data (check
journal title) 28(1), 81–124 (1972)
8. Ke, Q., Kanade, T.: Robust L1 norm factorization in the presence of outliers and
missing data by alternative convex programming. In: Proc. CVPR, pp. 739–746
(2005)
9. Lin, Z., Chen, M., Wu, L., Ma, Y.: The augmented Lagrange multiplier method
for exact recovery of corrupted low-rank matrices. Technical Report UILU-ENG-
09-2215, UIUC (2009), arXiv preprint arXiv:1009.5055
10. Liu, R., Lin, Z., De la Torre, F., Su, Z.: Fixed-rank representation for unsupervised
visual learning. In: Proc. CVPR, pp. 598–605 (2012)
11. Wang, N., Yao, T., Wang, J., Yeung, D.-Y.: A probabilistic approach to robust
matrix factorization. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid,
C. (eds.) ECCV 2012, Part VII. LNCS, vol. 7578, pp. 126–139. Springer, Heidelberg
(2012)
12. Wright, J., Peng, Y., Ma, Y., Ganesh, A., Rao, S.: Robust principal component
analysis: Exact recovery of corrupted low-rank matrices by convex optimization.
In: Proc. NIPS, pp. 2080–2088 (2009)
13. Zhou, Z., Li, X., Wright, J., Candès, E.J., Ma, Y.: Stable principal component
pursuit. In: Proc. Int. Symp. Information Theory, pp. 1518–1522 (2010)
Manifold Learning and the Quantum
Jensen-Shannon Divergence Kernel

Luca Rossi1 , Andrea Torsello1 , and Edwin R. Hancock2

1
Department of Environmental Science, Informatics and Statistics,
Ca’ Foscari University of Venice, Italy
{lurossi,torsello}@dsi.unive.it
2
Department of Computer Science, University of York, YO10 5GH, UK
[email protected]

Abstract. The quantum Jensen-Shannon divergence kernel [1] was re-

cently introduced in the context of unattributed graphs where it was
shown to outperform several commonly used alternatives. In this paper,
we study the separability properties of this kernel and we propose a way
to compute a low-dimensional kernel embedding where the separation
of the diﬀerent classes is enhanced. The idea stems from the observa-
tion that the multidimensional scaling embeddings on this kernel show
a strong horseshoe shape distribution, a pattern which is known to arise
when long range distances are not estimated accurately. Here we propose
to use Isomap to embed the graphs using only local distance information
onto a new vectorial space with a higher class separability. The experi-
mental evaluation shows the eﬀectiveness of the proposed approach.

Keywords: Graph Kernels, Manifold Learning, Continuous-Time

Quantum Walk, Quantum Jensen-Shannon Divergence.

1 Introduction
Graph-based representations have become increasingly popular due to their
ability to characterize in a natural way a large number of systems [2, 3]. Un-
fortunately, our ability to analyse this wealth of data is severely limited by the
restrictions posed by standard pattern recognition techniques, which usually re-
quire the graphs to be first embedded into a vectorial space, a procedure which
is far from being trivial. Kernel methods [4] provide a neat way to shift the prob-
lem from that of finding an embedding to that of defining a positive semidefinite
kernel. In fact, once we define a positive semidefinite kernel k : X × X → R
on a set X, there exists a map φ : X → H into a Hilbert space H, such that
k(x, y) = φ(x) φ(y) for all x, y ∈ X. Thus, any algorithm can be formulated in
terms of the data by implicitily mapping them to H via the well-known kernel
trick. As a consequence, we are now faced with the problem of defining a positive
semidefinite kernel on graphs rather than computing an embedding. However,
due to the rich expressiveness of graphs, this task has also proven to be difficult.
Many different graph kernels have been proposed in the literature [5–7], which
are generally instances of the family of R-convolution kernels introduced by

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 62–69, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Manifold Learning and the Quantum Jensen-Shannon Divergence Kernel 63

Haussler [8]. The fundamental idea is that of decomposing two discrete objects
them and comparing some simpler substructures. For example, Gärtner et al. [5]
propose to count the number of common random walks between two graphs,
while Borgwardt and Kriegel [6] measure the similarity based on the shortest
paths in the graphs. Shervashidze et al. [7], on the other hand, count the number
of graphlets, i.e. subgraphs with k nodes. Recently, Rossi et. al [1] introduced
a novel kernel where the graph structure is probed through the evolution of a
continuous-time quantum walk [9]. The idea underpinning their method is that
the interference effects introduced by the quantum walk seem to be enhanced by
the presence of symmetrical motifs in the graph [10, 11]. To this end, they define
a walk onto a new structure that is maximally symmetric when the original
graphs are isomorphic. Finally, the kernel is defined as the quantum Jensen-
Shannon divergence [12] between the density operators [13] associated with the
walks.
In this paper, we study the separability properties of the QJSD kernel and we
apply standard manifold learning techniques [14, 15] on the kernel embedding
to map the data onto a low-dimensional space where the different classes can
exhibit a better linear separation. The idea stems from the observation that
the multidimensional scaling embeddings of the QJSD kernel show the so-called
horseshoe effect [16]. This particular behaviour is known to arise when long range
distances are not estimated accurately, and it implies that the data lie on a non-
linear manifold. This is no surprise, since Emms et. al [10] have shown that the
continuous-time quantum walk underestimates the commute time related to the
classical random walk. For this reason, it is natural to investigate the impact
of the locality of distance information on the performance of the QJSD kernel.
Given a set of graphs, we propose to use Isomap [14] to embed the graphs onto
a low-dimensional vectorial space, and we compute the separability of the graph
classes as the distance information varies from local to global. Moreover, we
perform the same analysis on a set of alternative graph kernels commonly found
in the literature [5–7]. Experiments on several standard datasets demonstrate
that the Isomap embedding shows a higher separability of the classes.
The remainder of this paper is organized as follows: Section 2 introduces some
basic quantum mechanical terminology, while Section 3 reviews the QJSD kernel.
Section 4 illustrates the experimental results and the conclusions are presented
in Section 5.

2 Quantum Mechanical Background

Quantum walks are the quantum analogue of classical random walks. In this
paper we consider only continuous-time quantum walks, as ﬁrst introduced by
Farhi and Gutmann in [9]. Given a graph G = (V, E), the state space of the
continuous-time quantum walk deﬁned on G is the set of the vertices V of the
graph. Unlike the classical case, where the evolution of the walk is governed by
a stochastic matrix (i.e. a matrix whose columns sum to unity), in the quantum
case the dynamics of the walker is governed by a complex unitary matrix i.e.,
64 L. Rossi, A. Torsello, and E.R. Hancock

a matrix that multiplied by its conjugate transpose yields the identity matrix.
Hence, the evolution of the quantum walk is reversible, which implies that quan-
tum walks are non-ergodic and do not possess a limiting distribution. Using
Dirac notation, we denote the basis state corresponding to the walk being at
vertex u ∈ V as |u. A general state of the walk is a complex linear combination
of the basis states, such that the state of the walk at time t is deﬁned as

|ψt = αu (t) |u (1)
u∈V

where the amplitude αu (t) ∈ C and |ψt ∈ C|V | are both complex.
At each instant in time the probability of the walker being at a particular
vertex of the graph is given by the square of the norm of the amplitude of the
relative state. More formally, let X t be a random variable giving the location of
the walker at time t. Then the probability of the walker being at the vertex u
at time t is given by

Pr(X t = u) = αu (t)α∗u (t) (2)

where α∗u (t) is the complex conjugate of α ∗
u (t). Moreover αu (t)αu (t) ∈ [0, 1], for
∗
all u ∈ V , t ∈ R , and in a closed system u∈V αu (t)αu (t) = 1.
+

The evolution of the walk is governed by Schrödinger equation, where we take

the Hamiltonian of the system to be the graph adjacency matrix A, which yields
d
|ψt = −iA |ψt (3)
dt
Given an initial state |ψ0 , we can solve Equation 3 to determine the state vector
at time t

|ψt = e−iAt |ψ0 = Φe−iΛt Φ |ψ0 , (4)

where A = ΦΛΦ is the spectral decomposition of the adjacency matrix.
Consider a quantum system that can be in a number of states |ψi each with
probability pi . The system is said to be in the ensemble of (pure) states {|ψi , pi }.
The density operator (or density matrix) of such a system is deﬁned as

ρ= pi |ψi ψi | (5)
i

The Von Neumann entropy [13] of a density operator ρ is HN (ρ) =

−T r(ρ log ρ) = − j λj log λj , where the λj s are the eigenvalues of ρ.
With the Von Neumann entropy to hand, we can define the quantum Jensen-
Shannon divergence between two density operators ρ and σ as
ρ + σ 1 1
DJS (ρ, σ) = HN − HN (ρ) − HN (σ) (6)
2 2 2
This quantity is always well defined, symmetric and negative definite [17]. It can
also be shown that DJS (ρ, σ) is bounded, i.e., 0 ≤ DJS (ρ, σ) ≤ 1, with equality
to 1 if and only if the states ρ and σ have support on orthogonal subspaces.
Manifold Learning and the Quantum Jensen-Shannon Divergence Kernel 65

Fig. 1. The MDS embeddings from the QJSD kernel consistently show an horseshoe
shape distribution of the points

3 The QJSD Kernel

Given two graphs G1 (V1 , E1 ) and G2 (V2 , E2 ) we construct a new graph G =
(V, E) where V = V1 ∪ V2 , E = E1 ∪ E2 ∪ E12 , and (u, v) ∈ E12 only if u ∈ V1 and
v ∈ V2 .With
this
new structure to hand,
we
define+two continuous-time quantum
walks ψt− = u∈V ψ0u −
|u and ψt+ = u∈V ψ0u |u on G with starting states
du du
− + C if u ∈ G1 + + C if u ∈ G1
ψ0u = ψ0u = (7)
− dCu if u ∈ G2 + dCu if u ∈ G2
where du is the degree of the node u and C is the normalisation constant such
that the probabilities sum to one.
We allow the two quantum walks evolve until a time T and we define the
average density operators ρT and σT over this time as

1 T − − 1 T + +
ρT = ψt ψt dt σT = ψt ψt dt (8)
T 0 T 0
In other words, we have defined two mixed systems with equal probability of
being in any of the pure states defined by the quantum walks evolutions.
The quantum Jensen-Shannon kernel kT (G1 , G2 ) between the unattributed
graphs G1 and G2 is defined as
kT (G1 , G2 ) = DJS (ρT , σT ) (9)
where ρT and σT are the density operators defined as in Eq. 8. Note that this
kernel is parametrised by the time T . In [1] the authors we propose to let T → ∞,
however, they show that a proper choice of T can yield an increased average
accuracy in an SVM classification task.
It can be proved [1] that 0 ≤ kT (G1 , G2 ) ≤ 1 and that if G1 and G2 are two
isomorphic graphs, then ρT and σT have support on orthogonal subspaces, and
as a consequence kT (G1 , G2 ) = 1. Note that although the authors are unable to
provide a proof that the QJSD kernel is positive semidefinite, both empirical ev-
idence and the fact that the Jensen-Shannon Divergence is negative semidefinite
on pure quantum states [17] while the QJSD is maximal on orthogonal states
suggest that it might be.
66 L. Rossi, A. Torsello, and E.R. Hancock

Fig. 2. Sample images of the four selected object from the COIL-100 [18] dataset

3.1 Enhancing the QJSD through Manifold Learning

Figure 1 shows the MDS embedding of the distance matrices associated with
the QJSD kernel for the synthetic, MUTAG and COIL datasets. Details on the
datasets used in this paper can be found in Section 4. These embeddings clearly
suffer from a horseshoe shape effect, which is usually the result of an accurate
estimate of the distance between objects only when they are close together, but
not when they are far apart [16]. As a consequence, it should be possible to
increase the kernel performance by filtering out in some way this long range
distance information.
In this paper we propose a simple yet effective way to achieve this goal. Given
a set of graphs, we compute the Isomap [14] embedding of the graphs and we
evaluate the separability of the graph classes as the distance information varies
from local to global. Isomap is a well-known manifold learning technique, which
extends classical MDS by incorporating the pairwise geodesic distances between
points. To this end, a neighborhood graph is constructed from the original set
of points, where each node is connected to its k nearest neighbors in the high-
dimensional space. The geodesic distance between two nodes is then defined as
the sum of the edge weights along the shortest-path between them. It is known
that Isomap suffers from several shortcomings, so further work should focus on
experimenting with more robust manifold learning techniques.
The class separability is evaluated in the following way. For each embedding,
we perform a 10-fold cross validation using a binary C-SVM with a linear kernel,
where we let the value of the SVM regularizer constant C vary over the interval
10−3 and 103 . Then, we take the maximum value of the average classification
accuracy as an indicator of the separability. More formally, we look for the
Isomap embedding which maximises
arg max max α (10)
d,k C

where α is the 10-fold cross validation accuracy of the C-SVM, C is the regu-
larizer constant, d is the embedding dimension and k is the number of nearest
neighbors. Note that the multi-classiﬁcation task is solved using majority voting
on a set of one-vs-one C-SVM classiﬁers.

4 Experimental Results
The experiments are performed on four diﬀerent dataset, namely MUTAG, PPI,
COIL [18] and a set of shock graphs. MUTAG is a dataset of 188 mutagenic
Manifold Learning and the Quantum Jensen-Shannon Divergence Kernel 67

80 70
Accuracy

Accuracy

Accuracy
75
70 70 60
65 50
60
25 25 25
70 19 70 19 70 19
50 13 50 13 50 13
30 7 30 7 30 7
kNN 10 1 Dim kNN 10 1 Dim kNN 10 1 Dim

(a) QJSD (b) Random Walk (c) Graphlet

Fig. 3. 3D plot of the 10-fold cross validation accuracy on the PPI dataset as the
number of the nearest neighbors k and the embedding dimension d vary

aromatic and heteroaromatic compounds labeled according to whether or not

they have a mutagenic effect on the Gram-negative bacterium Salmonella ty-
phimurium. The PPI dataset consists of protein-protein interaction (PPIs) net-
works related to histidine kinase from two different groups: 40 PPIs from Aci-
dovorax avenae and 46 PPIs from Acidobacteria. The COIL dataset consists
of the 4 objects shown in Figure 2, each with 72 views obtained from equally
spaced viewing directions over 360◦. For each image, a graph is obtained as the
Delaunay triangulation of the Harris corner points. Finally, we select a set of
shock graphs, a skeletal-based representation of the differential structure of the
boundary of a 2D shape. The 120 graphs are divided into 8 classes of 15 shapes
each. Each graph has a node attribute that reflects the size of the boundary
feature generating the corresponding skeletal segment. To reflect the presence
of attributes, the QJSD kernel is modified by labeling the new connections of
the merged graph with the similarity between its two endpoints. To these four
datasets, we add a fifth set of 30 synthetically generated graphs, 10 for each
class. The graphs belonging to each class were sampled from a generative model
with size 12,14 and 16 respectively [19].
Figure 3 shows the 3D plots of the 10-fold cross validation accuracy on the
Isomap embeddings of the QJSD, the random walk and the graphlet kernels
for the PPI dataset, as the size of the initial neighborhood and the embedding
dimension vary. The plots show that for this dataset the QJSD kernel seems
to be less sensitive to the locality of the distance information. On the other

Fig. 4. The optimal two-dimensional Isomap embeddings in terms of separability be-

tween the graph classes
68 L. Rossi, A. Torsello, and E.R. Hancock

Table 1. Maximum classiﬁcation accuracy on the unattributed graph datasets. Here

SP is the shortest-path kernel of Borgwardt and Kriegel [6], RW is the random walk
kernel of Gartner et al. [5], while GR denotes the graphlet kernel computed using all
graphlets of size 3 described in Shervashidze et al. [7], while the subscript ISO indicates
the result after the Isomap embedding. For each dataset, the best performing kernel
before and after the embedding is shown in bold and italic, respectively.

Kernel Synthetic MUTAG PPI COIL Shock

QJSD 90 .00 88 .27 78 .75 84.44 67 .50
QJSDISO 96.67 91.96 90.69 91.53 77.50
SP 80.00 86.08 71.25 85.56 61.67
SPISO 86.67 89.33 87.08 89.17 60.05
RW 86.67 77.02 70.97 79.72 49.17
RWISO 86.67 81.35 82.50 80.97 50.12
GR 86.67 82.92 49.56 86 .67 39.17
GRISO 90.00 84.53 77.08 87.78 54.17

hand, for the graphlet kernel the maximum accuracy is achieved for a smaller
neighborhood, which means that in this case the long range distance information
is less accurate.
Figure 4 shows the two-dimensional Isomap embeddings with the highest lin-
ear separability for the QJSD kernels on the synthetic dataset, MUTAG and
COIL. The result clearly shows the lack of the horseshoe shape distribution of
Figure 1. Note, however, that the best embedding is usually found at a dimen-
sion higher than two and, as shown in Figure 3, the separability can change
significantly as the dimension varies. Figure 4 also shows a clearer separation
among the different classes, as highlighted in Table 1, which shows the separa-
bility of the data for each kernel and dataset. It is interesting to observe that,
with the exception of a few cases, the Isomap embedding always yields an in-
creased separability of the data, independently of the original kernel. It should
also be underlined that the QJSD kernel always yields the highest separation,
with a maximum classification accuracy above 90% in 4 out of 5 datasets.

5 Conclusions
In this paper, we studied the separability properties of the QJSD kernel and
we have proposed a way to compute a low-dimensional embedding where the
separation of the diﬀerent classes is enhanced. The idea stems from the observa-
tion that the multidimensional scaling embeddings on this kernel show a strong
horseshoe shape distribution, a pattern which is known to arise when long range
distances are not estimated accurately. Here we proposed to use Isomap to em-
bed the graphs using only local distance information onto a new vectorial space
with a higher class separability. An extensive experimental evaluation has shown
the eﬀectiveness of the proposed approach.

Acknowledgments. Edwin Hancock was supported by a Royal Society Wolfson

Research Merit Award.
Manifold Learning and the Quantum Jensen-Shannon Divergence Kernel 69

References
1. Rossi, L., Torsello, A., Hancock, E.R.: A continuous-time quantum walk kernel for
unattributed graphs. In: Kropatsch, W.G., Artner, N.M., Haxhimusa, Y., Jiang,
X. (eds.) GbRPR 2013. LNCS, vol. 7877, pp. 101–110. Springer, Heidelberg (2013)
2. Siddiqi, K., Shokoufandeh, A., Dickinson, S., Zucker, S.: Shock graphs and shape
matching. International Journal of Computer Vision 35, 13–32 (1999)
3. Jeong, H., Tombor, B., Albert, R., Oltvai, Z., Barabási, A.: The large-scale orga-
nization of metabolic networks. Nature 407, 651–654 (2000)
4. Schölkopf, B., Smola, A.J.: Learning with kernels: Support vector machines, regu-
larization, optimization, and beyond. MIT press (2001)
5. Gaertner, T., Flach, P., Wrobel, S.: On graph kernels: Hardness results and efficient
alternatives. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS
(LNAI), vol. 2777, pp. 129–143. Springer, Heidelberg (2003)
6. Borgwardt, K., Kriegel, H.: Shortest-path kernels on graphs. In: Fifth IEEE Inter-
national Conference on Data Mining, p. 8. IEEE (2005)
7. Shervashidze, N., Vishwanathan, S., Petri, T., Mehlhorn, K., Borgwardt, K.: Effi-
cient graphlet kernels for large graph comparison. In: Proceedings of the Interna-
tional Workshop on Artificial Intelligence and Statistics (2009)
8. Haussler, D.: Convolution kernels on discrete structures. Technical report, UC
Santa Cruz (1999)
9. Farhi, E., Gutmann, S.: Quantum computation and decision trees. Physical Review
A 58, 915 (1998)
10. Emms, D., Wilson, R., Hancock, E.: Graph embedding using a quasi-quantum ana-
logue of the hitting times of continuous time quantum walks. Quantum Information
& Computation 9, 231–254 (2009)
11. Rossi, L., Torsello, A., Hancock, E.R.: Approximate axial symmetries from contin-
uous time quantum walks. In: Gimel’farb, G., Hancock, E., Imiya, A., Kuijper, A.,
Kudo, M., Omachi, S., Windeatt, T., Yamada, K. (eds.) SSPR&SPR 2012. LNCS,
vol. 7626, pp. 144–152. Springer, Heidelberg (2012)
12. Lamberti, P., Majtey, A., Borras, A., Casas, M., Plastino, A.: Metric character of
the quantum Jensen-Shannon divergence. Physical Review A 77, 052311 (2008)
13. Nielsen, M., Chuang, I.: Quantum computation and quantum information. Cam-
bridge university press (2010)
14. Tenenbaum, J.B., De Silva, V., Langford, J.C.: A global geometric framework for
nonlinear dimensionality reduction. Science 290, 2319–2323 (2000)
15. Czaja, W., Ehler, M.: Schroedinger eigenmaps for the analysis of biomedical data.
IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1274–1280
(2013)
16. Kendall, D.G.: Abundance matrices and seriation in archaeology. Probability The-
ory and Related Fields 17, 104–112 (1971)
17. Briët, J., Harremoës, P.: Properties of classical and quantum jensen-shannon di-
vergence. Physical review A 79, 052311 (2009)
18. Nayar, S., Nene, S., Murase, H.: Columbia object image library (coil 100). Technical
report, Tech. Report No. CUCS-006-96. Department of Comp. Science, Columbia
University (1996)
19. Torsello, A., Rossi, L.: Supervised learning of graph structure. In: Pelillo, M., Han-
cock, E.R. (eds.) SIMBAD 2011. LNCS, vol. 7005, pp. 117–132. Springer, Heidel-
berg (2011)
Spatio-temporal Manifold Embedding
for Nearly-Repetitive Contents
in a Video Stream

Manal Al Ghamdi and Yoshihiko Gotoh

Department of Computer Science, University of Sheﬃeld, United Kingdom

{m.alghamdi,y.gotoh}@dcs.shef.ac.uk

Abstract. This paper presents a framework to identify and align nearly-

repetitive contents in a video stream using spatio-temporal manifold em-
bedding. The similarities observed in frame sequences are captured by
deﬁning two types of correlation graphs: an intra-correlation graph in the
spatial domain and an inter-correlation graph in the temporal domain.
The presented work is novel in that it does not utilise any prior informa-
tion such as the length and contents of the repetitive scenes. No template
is required, and no learning process is involved in the approach. Instead it
analyses the video contents using the spatio-temporal extension of SIFT
combined with a coding technique. The underlying structure is then re-
constructed using manifold embedding. Experiments using a TRECVID
rushes video proved that the framework was able to improve embedding
of repetitive sequences over the conventional methods, thus was able to
identify the repetitive contents from complex scenes.

Keywords: manifold embedding, synchronisation, inter- and intra-

correlations, rushes video.

1 Introduction
In recent years there have been a wide range of audio visual data publicly
available, including news, movies, television programmes and meeting records,
resulting in various content-management problems. Among these there exist
nearly-repetitive video sequences whereby the original material is transformed
to nearly, but not exactly, identical contents. Rushes videos, also referred to
as pre-production videos, belong to one category of such examples [1]. It is a
collection consisting of raw footage, used to produce, e.g., TV programmes [2].
Unlike many other video datasets rushes are unconventional, containing ad-
ditional contents such as clapper boards, colour bars and empty white shots.
They also contain repetitive contents from multiple retakes of the same scene,
caused by, e.g., actors’ mistakes or technical failures during the production.
Although contents are nearly repetitive they may not be totally identical dupli-
cates, sometimes causing inconsistency between retakes. Occasionally some parts
of the original sequence may be dropped or extra information may be added at
various places, resulting in retakes of the same scene with unequal lengths.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 70–77, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Spatio-temporal Manifold Embedding 71

The task of aligning multiple audio visual sequences, potentially from differ-
ent angles, needs precise synchronisation in both spatial and temporal domains.
The majority of previous works employed techniques such as template match-
ing, camera calibration analysis and object tracking. Whitehead et al. [3], for
example, tracked multiple objects throughout each sequence using a 2D shape
heuristic. Temporal correspondence was then computed between frames by iden-
tifying the object’s location in all views satisfying the epipolar geometry. In [4],
the authors required the events to be captured by still cameras with flashes; the
binary flash patterns were analysed and matched throughout the video sequence.
Tresadern and Reid [5] used a rank constraint on corresponding frame features
instead of the epipolar geometry. The synchronisations were defined by searching
frame pairs that minimise the rank constraint. However their approach requires
prior knowledge on the number of correspondences in the frame sequences.
In this paper we present a spatio-temporal framework to aligning nearly-
repetitive contents. Embedded repetitions in the three dimensional (3D) signal,
consisting of two spatial and one temporal dimension, are discovered by defining
the coherent structure. We depart from the previous extension made on Isomap
[1] to spatio-temporal graph-based manifold embedding that captures correla-
tions between repetitive scenes. The intra- and inter-correlations within and
between repeated video contents are defined by applying the spatio-temporal
extension of the scale-invariant feature transform (SIFT) [6]. It is followed by
the modified version of the locality constrained linear coding (LLC) [7], where
each spatio-temporal descriptor is encoded by k-nearest neighbours (kNN) based
on the geodesic distances, instead of the Euclidean distance. The latter measures
the distance between two points as the length of a straight line from one point
to the other, whereas on the non-linear manifold, their Euclidean distance may
not accurately reflect their intrinsic similarity, which is measured by the geodesic
distance. A cluster of intrinsic coordinates are then generated on the embedded
space to define the spatial and temporal similarity between repetitions.
The contributions of this study are as follows: Firstly a spatial intra-correlation
representation is created for repetitive contents in a video stream. Interest points
that have significant local variations in both space and time are extracted and
encoded using fewer codebook basis in the high-dimensional feature space. Intra-
correlation is derived by constructing a shortest path graph using the kNN with
the geodesic distances. Secondly Isomap is extended to estimate the underlying
structure of repetitive contents and to define a spatio-temporal inter-correlation
in a video stream. Thirdly an unsupervised framework, which does not require
prior information or pre-processing steps, for aligning similar contents is pre-
sented for multimedia data with repetitions.

2 Spatio-temporal Alignment of Nearly-Repetitive

Scenes
In this work we explore a low-dimensional representation of nearly-repetitive
contents observed in a video stream. To this end video’s semantic structure is de-
ﬁned in the high-dimensional space. The approach consists of two stages. Firstly
72 M. Al Ghamdi and Y. Gotoh

!

Fig. 1. Processing steps for spatio-temporal alignment of nearly-repetitive contents in

a video stream

intra-correlation is captured in the high-dimensional space using the space-time

invariant interest points detection and coding scheme (Section 2.1). Application
of LCC technique at this stage allows consideration of the locality of the man-
ifold structure. Secondly a manifold representation maps the video sequence to
the embedded space (Section 2.2). At this stage the inter-correlation is com-
puted between multiple video scenes using the spatio-temporal kNN graph. We
adapted the spatio-temporal Isomap implemented previously by [1] to generate
the intrinsic coordinates for each manifold. Generated coordinates are chrono-
logically ordered based on the spatio-temporal similarity and clustered to groups
of similar repetitive contents. The entire process of the approach is illustrated
in Figure 1.

2.1 Spatio-temporal Video Representation

The framework consists of a space-time extension of SIFT, or ST-SIFT [6], com-

bined with a modified version of the LLC [7]. The coding scheme projects each
descriptor into a local coordinate representation produced by max pooling [8].
Spatio-Temporal SIFT. The ST-SIFT algorithm identifies spatially and tem-
porally invariant interest points given a video stream [6]. These points contain
the amount of information sufficient to represent the video contents. Unlike other
interest points detection schemes, ST-SIFT is able to detect spatially distinctive
points with sufficient motion information at multiple scales. To achieve the in-
variance in both space and time, a spatio-temporal Gaussian and Difference of
Gaussian (DoG) pyramids are calculated first. Then the points shared between
three spatial and temporal planes (xy, xt and yt) at each scale in the DoG are
chosen as interest points.
Coding with Shortest Path Graph. LLC is a coding scheme proposed
by Wang et al. [7] to project individual descriptors onto their respective local
Spatio-temporal Manifold Embedding 73

coordinate systems. It translates image descriptors into local sparse codes based
on the Euclidean distances and the kNN search. We extended this algorithm
to project spatio-temporal descriptors extracted from a video stream into their
local linear codes using the geodesic distance and the shortest path graph.
Technically, given the ST-SIFT feature matrix extracted from a video stream
with N entries and D dimensions, i.e., X = {x1 , . . . , xN } ∈ RD×N , LLC solves
the following problem:

N
min xi − Bsi 2
+ λ di si 2
st. 1 si = 1, ∀i
S
i=1

where is the element-wise multiplication, B is a codebook, λ is a sparsity

regularisation term and S = {s1 , . . . , sN } ∈ RD×N is a set of codes for X.
Furthermore ‘1 si = 1, ∀i’ means the shift-invariant requirements for the LLC
code. The locality-constrained parameter di represents each basis vector with
different freedom based on its shortest path to the spatio-temporal descriptor
xi . Intra-correlation in the spatial domain S is derived by firstly constructing a
neighbourhood graph based on the geodesic distances between the descriptors
and the codebook, then computing the shortest path, performing a kNN search,
and finally solving a constrained least square fitting problem.

2.2 Manifold Embedding

High-dimensional representation can be mapped to a spatio-temporal graph
where nodes represent frames and edges represent the temporal order (event
sequence). We adapted the Isomap extension of [1] to reconstruct the spatio-
temporal inter-correlation δ from the intra-correlation S. The algorithm calcu-
lates the geodesic distance within the video frames to ensure the shortest path.
The algorithm can be summarised in the following three steps:
Step 1. We construct a spatio-temporal neighbourhood graph δ from the intra-
correlation matrix S. N nodes represent frames and N edges represent the con-
nection between the frames if they are related. Initially, the geodesic distances
between the nodes in δ graph are computed. Then the L spatial neighbours (sn)
are deﬁned for each frame xi using the shortest path:

snxi = xi1 , . . . , xiL | argmin (δij ) , i = 1, . . . , N
L
j

L
where argminj indicates node indexes for j that give L minimum values of
δij . Other L chronologically ordered neighbours around each frame xi are then
deﬁned as temporal neighbours (tn):
! "
tnxi = xi− L , . . . , xi−1 , xi+1 , . . . , xi+ L , i = 1, . . . , N
2 2

The temporal neighbours of the spatial neighbours tnsnxi are deﬁned for more
coverage:
tnsnxi = {tnxi1 , . . . , tnxiL } , i = 1, . . . , N
74 M. Al Ghamdi and Y. Gotoh

Finally, union between spatial and temporal sets represents the spatio-temporal
neighbours stn:

stnxi = snxi ∪ tnsnxi , i = 1, . . . , N

Step 2. Given the spatio-temporal neighbourhood graph δ, correlation based on

the geodesic distances δγ is deﬁned by recalculating the shortest path between
the neighbouring nodes.
Step 3. The manifold embedding is modelled as a transformation T of the
high-dimensional data in terms of correlation δγ into a new embedded space D:

T : δγ → D

The function T is the eigen decomposition of the inter-correlation matrix that

minimises the following loss function:
1
Lprojection = δ − T (δ) = δ − T (δγ ) = δ − (Q ∧ QT ) = δ − (Q+ ∧+2 )

where Q and ∧ are the eigenvectors and the eigenvalues of δγ . To optimise the
embedded representation, the m largest eigenvalues in ∧ along the diagonal is
deﬁned in ∧+ , and the square root of m columns of Q is deﬁned in Q+ .

3 Experiments
The approach was evaluated using MPEG-1 videos from the NIST TRECVID
2008 BBC rushes video collection [2]. Five video sequences were selected con-
taining drama productions in the following genres: detective, emergency, police,
ancient Greece and historical London. In total we had an approximate duration
of 82 minutes, sampled at the frame rate of 25 fps (frames per second) and a
frame size of 288 × 352 pixels. Table 1 provides further details of the dataset.
The video representation was created as follows. Firstly, spatio-temporal re-
gions were detected and described from the video cube using the ST-SIFT [6].
For each interest point the descriptor length was 640-dimensional, determined
by the number of bins to represent the orientation angles, θ and φ, in the sub-
histograms. In the spatial pyramid matching step, the LLC codes were computed
for each sub-region and pooled together using the multi-scale max pooling to cre-
ate the pooled representation. We used 4 × 4, 2 × 2 and 1 × 1 sub-regions. The
pooled features were then concatenated and normalised using the 2 -norm.

3.1 Evaluation Schema

Each scene from the rushes videos is a line of actions deﬁned by actors’ dialogue.
They were used as units of evaluation and the purpose of the experiment was to
group and align the multiple similar retakes of the same scene. The description
of actions for each scene was provided by the NIST for the BBC rushes video
Spatio-temporal Manifold Embedding 75

Table 1. The duration, the number of scenes and the number of retakes for each scene

video id duration #scenes #retakes

(min:sec) (#retakes/scene)
MS206290 21:03 11 27 (2,1,1,3,5,5,2,2,1,4,1)
MS206370 12:30 7 17 (2,2,2,2,3,4,2)
MS215830 14:55 5 14 (3,3,3,2,3)
MRS044499 12:42 6 10 (2,2,2,1,1,2)
MRS1500072 21:40 10 26 (3,5,1,2,2,3,2,3,3,2)

summarisation task in 2008 [2]. The ground truth was constructed for each video
using three human judges at a frame rate of 0.5 fps (one frame per two seconds).
The judges were asked to study the video summary and use it to identify the
start and the end for each retake. The defined positions for five videos, totalling
39 scenes and 94 retakes, were used as the ground truth.
In the experiments, the approach was compared with the three other simplified
alternatives. The first one evaluates the performance of the entire framework.
It was a combination of the original 2D SIFT by Lowe [9], LLC coding with
the Euclidean distance graph by [7] and spatial Isomap by [10]. The second
one evaluates the performance of the intra-correlation step covered by ST-SIFT
and LLC with the shortest path graph. It consisted of the 2D SIFT, LLC coding
with the Euclidean distance graph and the Isomap-ST, an adapted version of [1].
The third one evaluates the performance of the inter-correlation step covered by
the Isomap-ST. For that we combined the ST-SIFT [6] with LLC coding
with the shortest path graph and the spatial Isomap.

3.2 Results
Figure 2 presents the average precision and recall for each video using the ap-
proach and three alternatives. Graphs were created using the neighbourhood size
k as the operating parameter. They indicate that the approach outperformed the
conventional techniques with a fair margin. The approach was able to capture
the spatio-temporal correlations between retakes in each video sequence. The
best result was obtained with video MRS150072 in Figure 2(e). This video con-
tained outdoor scenes with large variations, characterised by busy backgrounds
and lots of movements by actors and objects. On the other hand, Figure 2(a)
for video MS206290 resulted in the lowest performance. It consisted of indoor
scenes with crowded people and little moves. Therefore there were few significant
changes between the frames to be captured by the ST-SIFT.
Figure 3 illustrates the reconstruction of video sequences, aiming to uncover
their nearly-repetitive contents. Retakes from the same scene were mapped close
to each other in the manifold resulting in clusters of repetitive contents. The
video sequence MRS044499 presented in the figure contained six scenes with
ten retakes (described earlier in Table 1). The left panel of the figure shows
the aligned sequences in the 2D space with multiple clusters of frames. Most
frames from the same scene were re-positioned and placed close together in the
76 M. Al Ghamdi and Y. Gotoh

Fig. 2. Average precision and recall for ﬁve rushes videos, identiﬁed as MS206290,
MS206370, MS215830, MRS044499, and MRS150072. For each video stream, the
spatio-temporal alignment method (blue) is compared with three other alternatives.

lower dimensional space. There were many causes, such as camera moves, that
could result in discontinuity because such frames did not share suﬃcient spatial
features with others. Consideration of temporal relation in the intra-correlation
step alleviated this problem, thus successfully producing a clear video trajectory
in the manifold. The contents of one cluster, two retakes of the same scene, are
presented in the right panel of the ﬁgure.

4 Conclusions and Future Work

This paper presented a framework to aligning nearly-repetitive contents in a
video stream using manifold embedding. It utilised LLC with the shortest path
graph to densely extract and encode salient feature points from a 3D signal, gen-
erating an intra-correlation in the spatial domain. A spatio-temporal graph was
derived as a step for manifold embedding that deﬁned the inter-correlation across
the video sequence. Experimental results using rushes videos showed that the
approach with spatio-temporal representation performed better than the con-
ventional techniques. The contribution of this study may be extended to other
applications involving temporal information processing, such as video summari-
sation and video information retrieval.
Spatio-temporal Manifold Embedding 77

Fig. 3. Video sequence MRS044499 was aligned in the two-dimensional space using
the neighbourhood size of k = 15

Acknowledgements. The ﬁrst author would like to thank Umm Al-Qura Uni-
versity, Makkah, Saudi Arabia for funding this work as part of her PhD schol-
arship program.

References
1. Chantamunee, S., Gotoh, Y.: Nearly-repetitive video synchronisation using nonlin-
ear manifold embedding. In: Proceedings of ICASSP (2010)
2. Over, P., Smeaton, A.F., Awad, G.: The TRECVID 2008 BBC rushes summariza-
tion evaluation. In: ACM TRECVID Video Summarization Workshop (2008)
3. Whitehead, A., Laganiere, R., Bose, P.: Temporal synchronization of video se-
quences in theory and in practice. In: IEEE Workshop on Motion and Video Com-
puting (2005)
4. Shrestha, P., Weda, H., Barbieri, M., Sekulovski, D.: Synchronization of multiple
video recordings based on still camera ﬂashes. In: Proceedings of ACM Multimedia
(2006)
5. Tresadern, P.A., Reid, I.D.: Synchronizing image sequences of non-rigid objects.
In: Proceedings of BMVC (2003)
6. Al Ghamdi, M., Zhang, L., Gotoh, Y.: Spatio-temporal SIFT and its application
to human action classiﬁcation. In: Fusiello, A., Murino, V., Cucchiara, R. (eds.)
ECCV 2012 Ws/Demos, Part I. LNCS, vol. 7583, pp. 301–310. Springer, Heidelberg
(2012)
7. Wang, H., Ullah, M.M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local
spatio-temporal features for action recognition. In: Proceedings of BMVC (2009)
8. Serre, T., Wolf, L., Poggio, T.: Object recognition with features inspired by visual
cortex. In: Proceedings of CVPR (2005)
9. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International
Journal of Computer Vision (2004)
10. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for
nonlinear dimensionality reduction. Science (2000)
Spatio-temporal Human Body Segmentation
from Video Stream

Nouf Al Harbi and Yoshihiko Gotoh

Department of Computer Science, University of Sheﬃeld, United Kingdom

[email protected], [email protected]

Abstract. We present a framework in which human body volume is

extracted from a video stream. Following the line of object tracking-
based methods, our approach detect and segment human body regions
by jointly embedding parts and pixels. For all extracted segments the ap-
pearance and shape models are learned in order to automatically extract
the foreground objects across a sequence of video frames. We evaluated
the framework using a challenging set of video clips, consisting of oﬃce
scenes, selected from Hollywood2 dataset. The outcome from the experi-
ments indicates that the approach was able to create better segmentation
than recently implemented work.

Keywords: spatio-temporal segmentation, human volume, object

tracking.

1 Introduction

Computer vision presents several challenges, the foremost of which is that of

automatic interpretation of video streams. The intricate task involves detecting
and interpreting video contents and identifying people and other objects, in
addition to recognising their movement. Traditionally these tasks are carried
out by working with individual video frames. However, the reality of video and
its moving language means that there are clues to interpretation beyond any
given frame. Specifically these cues can be defined as the motion of objects, the
way that individuals and objects interact over time, how time moves within the
video, and the event relationships between objects and characters.
To date, various attempts have been made to sort video pixels into groups
of similarity, but this has not proven a simple or error-free task. Video seg-
mentation aims to sort pixels into regions of spatio-temporal unity in terms of
individual contents and their movements. It is useful for higher level vision tasks
such as activity recognition, object tracking across video frames, content-based
retrieval and more general image enhancement. However the intricate nature of
the segmentation process is related to the temporal coherence of a video clip.
The frame-by-frame approach to segmentation generally results in a choppy and
unusable end product, since individually segmented frames are difficult to patch
back into a video stream due to a lack of coherence in movement. Indeed it is in

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 78–85, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Spatio-temporal Human Body Segmentation from Video Stream 79

the very nature of the process of segmentation developed for still images that it
is unable to realise continuity through time.
In this work we explore an approach to extracting three dimensional (3D)
human volume, consisting of two spatial and one temporal dimension. Our im-
plementation of video segmentation follows the line of tracking-based methods.
It detects and segments human body regions from a video stream by jointly
embedding parts and pixels [1]. For all extracted segments the appearance and
shape models will be learned in order to automatically identify foreground ob-
jects across video frames. It focuses on human contours, in particular, modiﬁed
from the category independent segmentation work by Lee et al. [2]. The approach
is evaluated using oﬃce scenes selected from the Hollywood2 dataset [3]. The
experimental results indicate that the approach was able to create consistently
better segmentation than recently implemented work [2].

1.1 Related Work

Recent image segmentation techniques have the possibility of rendering segmen-
tation in real time, though continuity across frames may still present an issue.
The consideration of temporal coherence is obviously what needs to be added
to the conventional schemes in order to cleanly segment video streams. A mean
shift approach was developed by Freedman and Kisilev [4], that used a sample-
based method to group frames, ten in their case, into image clusters. They were
able to smooth out the segmentation, resulting in elements larger than frames
of a moving image, without taking temporal information into account.
The current range of spatio-temporal video segmentation techniques can gen-
erally be divided into those that use information from subsequent frames, and
those that make use of information from previous frames. Patti et al. [5] inves-
tigated a Kalman filtering based mechanism, generating more fluid and coher-
ent segmentation. Kalman filtering was able to sort visual information through
time, although its causal process could only take past data into account. Paris
[6] managed to realise real-time performance using a method based on the Gaus-
sian kernel with mean shift segmentation. The mechanism did not, however, take
information from future video frames into the process of analysis.
Techniques that look at both past and future frames are the third category
of segmentation [7]. They worked with a video as if it were a 3D space-time
volume, making use of a varying mean shift algorithm in order to carry out a
segmentation process [8]. One such application was created by Dementhon and
Megret [9], who created a lattice with hierarchies, in order to rank and evaluate
clusters of space-time in an efficient manner.
Wang et al. [10] developed a mechanism for ‘tooning’ videos based on an
anisotropic kernel mean shift. Motion heuristics was yet another scheme for
creating smooth layers within a video, as Wang and Adelson [11] segmented a
video iteratively using this method. Tracking-based video segmentation methods
generally define segments at a frame level; they use motion, colour and spatial re-
lations to determine segmentation in a relatively unified fashion [12,13]. Brendel
and Todorovic [14] used contour cues to allow splitting and merging of segments
80 N. Al Harbi and Y. Gotoh

Fig. 1. Two-stage approach to human volume segmentation. A human body detected

in the ﬁrst stage is propagated along video frames in the second stage.

to boost the tracking performance. Finally, interactive object segmentation has

recently shown signiﬁcant progress [15,16,17], producing high quality segmenta-
tions driven by user input. We exhibit a similar interactive framework driven by
our segmentation.

2 Approach
Our goal is to segment human body volume in an unlabelled video. The ap-
proach consists of two main stages (Figure 1). Firstly, human body objects are
segmented at a frame level by combining low-level cues with a top-down part-
based person detector developed by Maire et al. [1], formulating grouped patches.
Secondly, detected segments are propagated along the video frames, exploiting
the temporal consistency of detected foreground objects using colour models and
local shape matching [2]. The ﬁnal output is a spatio-temporal segmentation of
the human body in a video stream. We now describe each stage in turn.

2.1 Estimation of Human Body Region at Frame Level

This stage builds on the graph-based image segmentation technique of Maire et
al. [1]. It produces a grouping of parts and pixels along the following idea:
– pixels are connected based on low-level cues in order to accomplish region
consistency;
– detected parts are bound together when they belong to the same object;
– the regions belonging to a part are included in the foreground, whereas the
remaining regions are pushed to the background.
A brief description of the ﬁrst stage is given below. Further detail should be
referred to [1].
Spatio-temporal Human Body Segmentation from Video Stream 81

Globalisation. The angular embedding (AE) algorithm is used as a globlisa-

tion framework [18], which is constrained using a pairwise ordering relationship
matrix Θ. Each relationship is assigned a confidence matrix C, which is com-
bined with linear constraints on a solution space of embedding U and complex
eigenvectors, z0 , . . . , zm−1 , to form the generalised eigenproblem:
QP Qz = λz
where P is a normalised weight matrix and Q is a projector onto the feasible
solution space defined by:
P = D−1 W, Q = I − D−1 U (U T D−1 U )−1 U T
D and W are defined based on C and Θ:
D = Diag(C1n ), W = C • eiΘ
where n represent the number of nodes, 1n is a column vector of ones, I is the
identity matrix, Diag(.) is a matrix with an√argument on the main diagonal, •
stand for the matrix Hadamard product, i = −1 and exponentiation performed
element-wise. Eigenvectors z0 , . . . , zm−1 correspond to the largest eigenvalues
and transfer the output of pixels and parts into C.
Graph Setup: Pixel and Part Relations. The graph to be used for the
image segmentation is constructed using four node types [pixels (p), parts (q),
surround (s) and figure/ground prior (f )] within a block structure defined by n×
n matrices, C and Θ. Colour and texture pixel-pixel affinity Cp is determined by
examining the contour between the pixels, whereas the geometric compatibility
Cq (the part-part affinity) is identified using pairwise part-pose compatibility and
poselet detection scores. These part-part detection scores are used to determine
increases in repulsion between the part and the surround [(Cs , Θs )]; the latter is
based on the global surround node (Cf , Θf ). U is constrained by part embedding
equalling the mean embedding of the pixels comprising the part, and requires
the part/surround nodes to concur with the pixels assigned to each node.
⎡ ⎤ ⎡ ⎤
Cp 0 0 0 0 0 0 0
⎢ 0 α · Cq β · Cs γ · Cf ⎥ ⎢0 0 −Θs −Θf ⎥
C=⎢ ⎣ 0 β · CsT 0
⎥,
⎦ Θ = Σ −1 ⎢⎣0 −ΘsT 0
⎥
0 0 ⎦
0 γ · CfT 0 0 0 −ΘfT 0 0

Output: Decoding Eigenvectors. The nodes in Cm , which are based on the

pixels and parts plugged into the graph according to the eigenvectors, are inher-
ently meaningful; the eigenvectors can be used to identify the region occupied
by each human body object in the frame. Each pixel is assigned to a part by
solving the equation:
⎧ ⎫
⎨ ⎬
pk −→ argmin min {D(pk , qj )}
Qi ⎩ qj ∈Qi ⎭
pk ∈Mj
82 N. Al Harbi and Y. Gotoh

where Mj is the region of the image overlapped by a part qj . Each part is then
assigned to a Qi , which represents the number of conﬁrmed objects detected.
Human body segments are then scored for each frame. This step is repeated
with a set of N × F , where each N is the number of human body objects per
frame and F is the number of frames. These steps result in the set of hypotheses,
h, which are then used to identify the spatio-temporal segmentation of human
body parts in the entire video stream.

2.2 Spatio-temporal Segmentation of Human from a Video Stream

Each of the hypotheses (h) identified in the previous stage defines a foreground
(human body) and a background (surround) model. Each object-like region in
each frame is replaced by a human body, following the method of Lee et al. [2].
Pixel-wise segmentation is used to extract the human body segments from the
surround in the video stream on a frame-by-frame basis, using the space-time
Markov random field (MRF) described below. For each frame, the space-time
graph of a pixel is defined, where the pixel is represented by a node and the edge
between two nodes equates to the cut between two pixels. Each hypothesis h has
an energy function, which can be determined by:

E(f, h) = Dih (fi ) + γ Vi,j (fi , fj )
i∈S i,j∈N

where f represents the pixel nodes, S = {p1 , . . . , pn } is a set of n pixels in the

video, and i and j index the pixels in space and time. Each pixel is then assigned
to the foreground or background by setting pi of each pixel to fi ∈ {0, 1}, where
0 = background and 1 = foreground. The neighbourhood term Vi,j is used to
enhance smoothness in space and time between the pixels in adjacent frames.
Four spatial neighbours are assigned to each pixel per frame. Two temporal
neighbours are assigned in the preceding and subsequent frames; each of these is
then given an optical ﬂow vector displacement. Neighbouring pixels of the same
colour are labelled using standard contrast dependent functions, with the cost
of labelling deﬁned by:

Dih (fi ) = − log(α · Uic (fi , h) + (1 − α) · Uil (fi , h))

where Uic (·) is the colour-induced cost, and Uil (·) is the local shape match-induced
cost. The segments detected in each frame on the basis of their parts and pixels
are projected onto other frames by local shape matching, with a spatial extent
which defines the location and scale prior to the segment, whose pixels can sub-
sequently be labelled as foreground or background. Optical flow connections are
used to maintain frame-to-frame consistency of the background and foreground
labelling of propagated segments. For each hypothesis h, the foreground object
segmentation of the video can be labelled by using binary graph cuts to minimise
the function E(f, h). Each frame is labelled in this way, using a space-time graph
of three frames to connect each frame to its preceding and subsequent frames.
This is more efficient than segmenting the video as the whole.
Spatio-temporal Human Body Segmentation from Video Stream 83

Fig. 2. Sample segmentations. The ﬁrst row shows key frames from two video clips.
The second and the third rows respectively present the results of key segments and the
corresponding segmentation using the approach in this paper. The last two rows show
the same attempts using the implementation by Lee et al. [2]. Best viewed on pdf.

3 Experiments
Dataset. The Hollywood2 dataset [3] holds a total of 69 Hollywood movie
scenes, from which ten short video clips were selected for testing the approach.
All the selected scenes are set in an office environment, and feature a broad range
of motions as well as a variety of temporal changes, thus creating a challenging
video segmentation task. The selected clips vary from 30 seconds to 2 minutes
in duration, with at least one human present in each shot; there are many shots
showing multiple human figures. For each clip, video frames are extracted using
a ffmpeg1 decoder, with a sample rate of one frame per second.
Evaluation Scheme. Accuracy is the commonly used measure for evaluating
video segmentation tasks. In this work we adopt the average per-frame pixel error
rate [19] for evaluation of the approach. Let F denote the number of frames in
the video, and S and GT represent pixels in the segmented region and in the
‘groundtruth’ across the frame sequence respectively. The error rate is calculated
using the exclusive OR operation:
| XOR(S, GT )|
E(S) =
F
The equation is used under the general hypothesis that object and groundtruth
annotation should match.
1
www.ffmpeg.org/
84 N. Al Harbi and Y. Gotoh

Table 1. The average number of incorrectly segmented pixels per frame. The video
clip name is in the format of ‘sceneclipautoautotrain· · · · ·’ where ‘· · · · ·’ part is shown
in the table.

clip name this paper Lee et al.

00007 1172 1875
00062 9829 42532
00099 6996 20858
00105 10870 13949
00107 2096 6919
00181 1265 9624
00187 8900 19112
00319 520 4659
00405 3400 30513
00431 11585 45361

Results and Discussion. Figure 2 presents sample outcomes of segmentation

using (a) the approach in this paper and (b) the recent implementation by Lee
et al.2 [2]. It shows that accurate segmentation of humans was made by our
approach. Implementation by Lee et al. could not extract a complete human
body although it discovered some parts.
Quantitative evaluation was conducted using ten video clips from the Hol-
lywood2 dataset. The groundtruth was obtained by manually segmenting each
frame into the foreground (humans) and the background (anything else present
in the frame). Table 1 shows that the approach produced consistently better seg-
mentation than the one implemented by Lee et al. We observed that the typical
cause of failed segmentation by our approach was the absence of a human face.
On the other hand, segmentation by the latter was unsuccessful especially when
there was more than one person present in the scene.

4 Conclusion
In this paper we presented the two-stage approach to spatio-temporal human
body segmentation by extracting a human body at a frame level, followed by
tracking the segmented regions using colour appearance and local shape match-
ing across the frames. By detecting and segmenting human body parts, we over-
came the limitations of the bottom-up unsupervised methods that often overseg-
mented an object. Using ten challenging video clips derived from the Hollywood2
dataset, we were able to obtain consistently better segmentation results than re-
cent implementations in the ﬁeld.

2
Program code available from www.cs.utexas.edu/˜grauman/research/software.html.
We tested their implementation with our oﬃce scene dataset. This was perhaps not
totally a fair comparison because the purpose of their work was an unsupervised
approach to key object segmentation from unlabelled video, where the number of
object was restricted to one, while we focused on extraction of human volume.
Spatio-temporal Human Body Segmentation from Video Stream 85

Acknowledgements. The ﬁrst author would like to thank Taibah University,

Madinah, Saudi Arabia for funding this work as part of her PhD scholarship
program.

References
1. Maire, M., Yu, S.X., Perona, P.: Object detection and segmentation from joint
embedding of parts and pixels. In: Proceedings of ICCV (2011)
2. Lee, Y.J., Kim, J., Grauman, K.: Key-segments for video object segmentation. In:
Proceedings of ICCV (2011)
3. Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: Proceedings of CVPR
(2009)
4. Freedman, D., Kisilev, P.: Fast mean shift by compact density representation. In:
Proceedings of CVPR (2009)
5. Patti, A.J., Tekalp, A.M., Sezan, M.I.: A new motion-compensated reduced-order
model Kalman filter for space-varying restoration of progressive and interlaced
video. IEEE Transactions on Image Processing 7 (1998)
6. Paris, S.: Edge-preserving smoothing and mean-shift segmentation of video
streams. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS,
vol. 5303, pp. 460–473. Springer, Heidelberg (2008)
7. Klein, A.W., Sloan, P.P.J., Finkelstein, A., Cohen, M.F.: Stylized video cubes. In:
ACM SIGGRAPH/Eurographics Symposium on Computer Animation (2002)
8. Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space anal-
ysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002)
9. DeMenthon, D., Megret, R.: Spatio-temporal segmentation of video by hierarchical
mean shift analysis. Technical report, Language and Media Processing Laboratory,
University of Maryland (2002)
10. Wang, J., Xu, Y., Shum, H.Y., Cohen, M.F.: Video tooning. ACM Transaction on
Graphics 23 (2004)
11. Wang, J.Y.A., Adelson, E.H.: Representing moving images with layers. IEEE
Transactions on Image Processing 3 (1994)
12. Khan, S., Shah, M.: Object based segmentation of video using color, motion and
spatial information. In: Proceedings of CVPR (2001)
13. Zitnick, C.L., Jojic, N., Kang, S.B.: Consistent segmentation for optical flow esti-
mation. In: Proceedings of ICCV (2005)
14. Brendel, W., Todorovic, S.: Video object segmentation by tracking regions. In:
Proceedings of ICCV (2009)
15. Bai, X., Wang, J., Simons, D., Sapiro, G.: Video SnapCut: robust video object
cutout using localized classifiers. ACM Transaction on Graphics 28 (2009)
16. Huang, Y., Liu, Q., Metaxas, D.: Video object segmentation by hypergraph cut.
In: Proceedings of CVPR (2009)
17. Li, Y., Sun, J., Shum, H.Y.: Video object cut and paste. ACM Transaction on
Graphics 24 (2005)
18. Yu, S.X., Shi, J.: Segmentation given partial grouping constraints. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence 26 (2004)
19. Tsai, D., Flagg, M., Rehg, J.M.: Motion coherent tracking with multi-label MRF
optimization. In: Proceedings of BMVC (2010)
Sparse Depth Sampling for Interventional
2-D/3-D Overlay: Theoretical Error Analysis
and Enhanced Motion Estimation

Jian Wang1,2 , Christian Riess1 , Anja Borsdorf2, Benno Heigl2 ,

and Joachim Hornegger1,3
1
Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg
2
Healthcare Sector, Siemens AG, Forchheim
3
Erlangen Graduate School in Advanced Optical Technologies (SAOT)
[email protected]

Abstract. Patient motion compensation is challenging for dynamic 2-

D/3-D overlay in interventional procedures. A ﬁrst motion compensation
approach based on depth-layers has been recently proposed, where 3-D
motion can be estimated by tracking feature points on 2-D X-ray images.
However, the sparse depth estimation introduces a systematic error. In
this paper, we present a theoretical analysis on the systematic error and
propose an enhanced motion estimation strategy accordingly. The sim-
ulation experiments show that the proposed approach yields a reduced
3-D correction error that is consistently below 2 mm, in comparison to a
mean of 6 mm with high variance using the previous approach.

Keywords: interventional 2-D/3-D overlay, error analysis, sparse depth

sampling, 3-D motion estimation.

1 Introduction

In interventional radiology, pre-operative three-dimensional (3-D) images (e.g.

computed tomography (CT) or magnetic resonance angiography (MRA)) can
be fused with interventional two-dimensional (2-D) X-ray images (fluoroscopy),
which is known as 2-D/3-D overlay. This yields several advantages: 1) the pre-
operative planning information in the 3-D images can be displayed on the fluo-
roscopic images; 2) additional information that is not visible in the fluoroscopic
images (e.g. vascular structure and spatial information) can be seen in the over-
laid 3-D images. A good 2-D/3-D overlay can shorten the time of the procedure
and reduce the radiation dose [1]. Accuracy is the most critical factor for the
quality of 2-D/3-D overlay. The proper spatial alignment of a 2-D projection to
a 3-D image (e.g. volume) is typically referred to as 2-D/3-D image registration.
2-D/3-D registration is usually performed before the intervention to ensure
an accurate overlay at the starting point. However, patient motion during the
intervention makes it necessary to correct the registration on the fly. In state-
of-the-art applications, the patient motion is usually detected by clinicians and

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 86–93, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Sparse Depth Sampling for Interventional 2-D/3-D Overlay 87

the correction is triggered by user interaction. However, clinicians have limited

time and attention for computer interaction during the treatment [2].
Recently, research work for real-time motion compensation can be found in
literature [3–5]. All the approaches either are application specific or rely on
specific devices or particular motion models.
Recently, we proposed in [6] a depth-layer-based tracking approach for patient
motion compensation. The key innovative contribution of this approach is that
the depth information is transferred from 3-D image to 2-D feature points using
depth layers, which are the images rendered separately from sub-volumes of
different depth intervals. Fig. 1(a) shows how the sub-volumes are generated.
Based on the initial registration, the 2-D feature points can be mapped to certain
depth intervals by matching them to the depth layers. To the knowledge of the
authors, this is the first approach that is capable of estimating real 3-D motion
by only tracking 2-D feature points from single-view X-ray images. Since this
approach does not rely on a particular device or motion model, and no iterative
computation of digitally reconstructed radiographs (DRRs) is involved, it yields
a high potential for real-time motion compensation in dynamic 2-D/3-D overlay.
However, depth sampling (quantization) introduces a systematic error in mo-
tion estimation. Using fine depth sampling can of course reduce the error, but
3-D structures are rather truncated into several small sub-volumes, and this leads
to bad 2-D/3-D matching results; In contrast, the 3-D structures are more likely
to be preserved in sub-volumes using coarse depth intervals, i.e. using sparse
depth sampling. Therefore, we see a requirement to extend the method to be
able to handle the depth error caused by sparse depth sampling.
In this paper, we present a mathematical model of the systematic error in-
troduced by sparse depth sampling. Based on this analysis, we propose a depth
correction strategy for motion estimation, which handles the systematic error
together with random noise. Quantitative simulation experiments are performed
to evaluate the new approach. Qualitative results are shown by an example of
motion compensated 2-D/3-D overlay using our approach.

2 Theoretical Error Analysis of Sparse Depth Sampling

In this part, we analyze the systematic error of sparse depth sampling, and set
it into relation with the random noise coming from other noise sources.

2.1 The Systematic Error Introduced by Depth Sampling

The principle of 2-D/3-D overlay is to virtually place the 3-D volume at the cor-
responding position of the patient, so that the volume is rendered as imaged from
the X-ray source and fused with the live ﬂuoroscopic image [1]. The projection
geometry of a C-arm system is described by a pinhole camera model, as shown
in Fig. 1(a). The projection procedure is described by the projection
⎡ ⎤ matrix
a u
P ∈ R3×4 , which can be represented as P = K[R|t], where K = ⎣ a v ⎦ ∈ R3×3
1
88 J. Wang et al.

contains the intrinsic parameters, rotation R ∈ R3×3 and translation t ∈ R3 are

known as extrinsic parameters [7]. All the parameters are known during the 2-D
acquisition from a calibrated C-arm system.
To simplify the problem, the motion estimation is done in the camera coordi-
nate system, where the origin is located at the camera center c and the z-axis is
aligned with the principal ray direction (L0 in Fig.1(a)). So that the z compo-
nent of a 3-D point represents its depth, and the projection matrix in the camera
coordinate system is simpliﬁed as Pc = K [I|0].
To analyze the systematic error by depth sampling, we start with a 3-D point
x = (x, y, z, 1)T (homogeneous coordinates) and its projection p = (up , vp , 1)T
on the detector plane D (Fig. 1(b)). Given the projection matrix Pc , the 2-
D projection point p can be back-projected to a ray in 3-D [7], denoted as
T
up −u vp −v
r(p) = vr (p) + λc, where vr (p) = P+ c p = a , a , 1, 0 and λ is a scalar
related to the depth of the 3-D point on the ray [7] [6]. In the camera coordinate
system, where c = (0, 0, 0, 1), r(p) can be further simpliﬁed as
T
T up − u vp − v
r(p) = (vrxyz (p))T , λ , with vrxyz = , ,1 (1)
a a

Since the 3-D point x with depth d and the point xE with sparsely estimated
depth dE are both on r(p), it yields λ(x) = 1/z = 1/d and λ(xE ) = 1/dE . The
points can be then reformulated as
. T . T
x = (vrxyz (p))T , 1/d and xE = (vrxyz (p))T , 1/dE . (2)
Since vrxyz is determined by the 2D projection, the representations in Eq. 2 show
the geometric relationship between x and xE (as in Fig. 1(b)): they share the
same projection but with a shift of Δd = dE − d in depth.
After a rigid motion (rotation R0 and translation t0 ), the new projections of
x and xE are p and pE , respectively, as shown in Fig. 1(b). In this scenario,

X-ray
camera coord. xcam c ycam
source c zcam
principal ray
direction L0 d
world
coord.
dE
d1
d2
x
volume V d3 xc
…

dn xE
xcE
image

pc pcE p
coord.
detector D D

(a) Depth sampling along viewing direction (b) The systematic error

Fig. 1. Illustration of depth sampling and the systematic error

Sparse Depth Sampling for Interventional 2-D/3-D Overlay 89

the points p and p are observations of x on the 2-D image before and after
the motion. Since the estimated 3-D point xE and p (instead of pE ) are used
in motion estimation [6], the systematic error of one point is introduced by the
diﬀerence vector between p and pE , as follows:

,z -
d − dE t0 −tx0
p xy
− pxy =a· R0 vrxyz , (3)
E
(d · r3 vrxyz + tz0 )(dE · r3 vrxyz + tz0 ) tz0 −ty0

where r3 ∈ R1×3 is the third row of R0 and t0 = (tx0 , ty0 , tz0 )T . The above 2-D
vector corresponds to a line segment lε connecting p and pE , which is exactly
a segment of the epipolar line of p under the motion of [R0 |t0 ] [7]. As Eq. 3
shows, the direction of the vector is only determined by the motion [R0 |t0 ] and
the 2-D projection p. The depth error Δd together with the off-plane motion (r3
and tz0 ) affects the length of lε . Therefore, the systematic error by sparse depth
sampling is not only influenced by the estimation error Δd in depth.

2.2 The Systematic Error in Relation to the Random Noise

In the last section, the mathematical representation of the systematic error is
derived. However, all the measurements in the real world are subject not only
to systematic error but also to random noise [8]. Therefore, we model the noise
from other steps in the whole procedure in [6] (e.g. tracking error) as random
noise, which is deﬁned by a uniform distribution. In this section, we analyze the
systematic error in Eq. 3 together with random noise, in order to treat them
diﬀerently to achieve a better motion estimation.
Fig. 2(a) illustrates the systematic error together with random noise in our
scenario. After an initial motion estimation using sparse depth sampling as in [6],
we can compute the systematic error vector using Eq. 3, which corresponds to
the line segment lε through pE (Fig. 2(a)). The possible maximum length of lε
determined by the depth bounds can be used as an explicit measurement of the
the systematic error, denoted as S i for pi
E (Fig. 2(b) and 2(c)).
Furthermore, we assume that the random noise can shift a 2-D point towards
an arbitrary direction with maximum distance δmax , where direction and dis-
tance are uniformly distributed. Therefore, the observation of the projection p
(denoted as p . ) is positioned within a disk-like region with a radius of δmax

~
pc ~i
pc
G max l Hi l Hi ~i
pc
pcE pcEi pc
pc i
lH E

(a) (b) (c)

Fig. 2. (a) Illustration of projection and the errors; (b) the metrics for the “inﬂuences”
the errors; (c) a case with more signiﬁcant random noise
90 J. Wang et al.

(Fig. 2(a)), which reflects the accuracy of the tracking method applied in our
procedure. However, p and δmax are unknowns in practice, there is no explicit
measurement for the random noise. Therefore, we again make use of Eq. 3 for
the metric of the random noise. Since the in-plane motion can be well estimated
initially [6] and the depth error as well as the off-plane motion affects more on
the length of the systematic error vector in Eq. 3, the true projection p appears
near to or on the line lε . Thus, we introduce here the point-to-line distance N i
(the distance between p . i and liE in Fig. 2(b)) as the metric for the “influence”
of the random noise.
Fig. 2(b) and 2(c) show two examples of error conditions. In Fig.2(b), N i is
obviously smaller then S i . If it is mostly the case for other points, we can draw
the conclusion that the systematic error is more dominant than the random
noise. Contrarily, if N i is bigger then S i (Fig. 2(c)) for most of the points, the
random noise appears more dominant.

3 The Error-Dependent Motion Estimation Strategy

In order to reduce the depth error, we now consider a depth correction step
after the initial motion estimation (denoted as [R̂0 |t̂0 ]). It can be performed by
solving a least-squares optimization problem as
n
! "
i i
dˆ = argmin
i
E . , p (d )
dist p E E , (4)
{dE } i

where n is the number of points, dist(·, ·) is the Euclidean distance of the two
points and pE = K[R̂0 |t̂0 ]xE . This least-squares optimization helps to find the
best fitting corrected depth values based on the estimated motion. We then refine
the motion by a follow-up motion estimation using the corrected depths.
However, if the random error is about the same level of or more dominant than
the systematic error (e.g. under a small or specific motions causing small system-
atic error), depth correction can even introduce more error in the motion estima-
n
tion results (see section 4). The reason is that minimizing i dist p . , pE (di
E)
in Eq. 4 does not lead to proper fitting depth values 2(c). In contrast, if the
random error has an acceptable range (i.e. with reasonable δmax ), it’s better to
include more points in the motion estimation procedure, so that a globally con-
sistent solution of the motion can be estimated while the effect of the random
error is averaged to a minimum.
An error-based motion estimation strategy is therefore proposed according to
the influence metrics proposed in section 2.2 and a dominance factor f (Tab. 1).
We consider as a strong depth correction criterion if S̄ > f · N̄ , we perform depth
correction and motion estimation on all points. For the cases not satisfying the
weak
strong criterion, we consider as a weak criterion if still some points xiE
contain dominant systematic error (S i > f · N̄ ) and perform depth correction on
i weak
xE , but still refine the motion using all points. If neither of the criteria are
satisfied, we consider all points as random noise dominant (no further correction).
Sparse Depth Sampling for Interventional 2-D/3-D Overlay 91

Table 1. The error-dependent motion estimation via dynamic depth correction

i
Inputs 3-D
i point set with sparse depth estimation xE , 2-D projections before mo-
tion p and the observed projections after motion p i .
i
Initialization Initial estimation of the rigid motion [R̂0 |t̂0 ] using xiE and p [6].
i
Error analysis Based on [R̂0 |t̂0 ], compute the systematic error vectors for all xE

using Eq. 3, the systematic error influences S i and the mean influence S. Com-
i
pute the random noise influences N and the mean influence N .
Optimization determination criteria
1. Strong criterion:
if S > f · N , perform depth correction and motion refinement
on all points xi ;
weak
2. Weak criterion: else if max(S i ) > f · N , iperform
depth correction on xi
and motion refinement on all points x ;
3. Otherwise: no further correction.

4 Experiments and Discussion

In this section, we quantitatively evaluate our new approach using point-based
simulation experiments. Furthermore, we show the qualitative motion estimation
results using a real clinical CT volume with simulated X-ray images.

4.1 Point-Based Simulation Experiments

Point-based simulation is a convenient and established way of evaluating the
theoretical-analysis-based algorithms (as in [6]). It allows to neglect the external
influences and gives better insights how things behave in the scope of interest.
Our point-based simulation set-up is similar as in [6]. The projection parameters
of a real C-arm system are applied. 3-D point sets are randomly generated within
a bounding box (20 cm×20 cm×30 cm). Random 3-D motions of different scales
are generated, which cause 2-D projection errors from about 1 mm to 13 mm on
the detector plane. Random, uniformly distributed noise with δmax = 4 pixels
(see Sec. 2.2) is added to the 2-D correspondences.
In Fig. 3, results of the test cases using 5- and 10-interval depth sampling
are shown. The plots show how the errors caused by a motion, which represent
the scale of the 3-D motion, are corrected in 2-D and 3-D. In each plot, the
horizontal and vertical axes show respectively the error before and after motion
correction. We discuss three important properties of the proposed algorithm.
1. Error reduction in 2-D and 3-D – In clinical practice, an error of 2 mm can
be considered acceptable [3]. As shown in Fig. 3, the baseline algorithm [6] often
fails to achieve this requirement in 3-D error correction. Conversely, with our
proposed depth correction scheme, we consistently yield a 3-D correction error
below 2 mm. This shows the fact that the 3D motion (and even the off-plane
motion part) can be well estimated using the our proposed approach.
2. Effects of depth quantization – Since both examples show sparse depth
sampling, where the systematic error causes a significant quantization effect.
92 J. Wang et al.

after motion correction (mm)

2D error (depth sampling 5) 3D error (depth sampling 5)
0.5 5
no depth correction no depth correction
0.4 our depth correction 4 our depth correction

0.3 3

0.2 2

0.1 1

0 0
0 2 4 6 8 10 12 14 0 10 20 30 40
before motion correction (mm) before motion correction (mm)
(a) Results of using 5-interval depth sampling;

after motion correction (mm)

2D error (depth sampling 10) 3D error (depth sampling 10)

0.4 8
no depth correction 7 no depth correction
all depth correction 6 all depth correction
0.3 our depth correction our depth correction
5
4
0.2 3
2
0.1 1
0
0 2 4 6 8 10 12 14 0 10 20 30 40
before motion correction (mm) before motion correction (mm)

(b) Results of using 10-interval depth sampling

Fig. 3. Plots of the results of point-based experiments

The quantization eﬀect leads to an uncertainty of the motion estimation. As we

can see by comparing the results in Fig. 3(a) and 3(b), 10-interval depth sampling
even yields worse results then 5-interval sampling. The results are much more
stable (consistently under 2/ts mm) by using our proposed depth correction.
3. Performance gained by the point selection criteria – In Fig. 3(b), the esti-
mation results using all points for depth correction, where none of the criteria
in Tab. 1 are considered, are shown together with the results using our motion
estimation strategy (dominance factor f = 3). Obviously, more computation is
involved if all points are considered for depth correction. Nevertheless, we can
observe better results by using points selection criteria for depth correction.

4.2 Image-Based Experiments

Similar as in [6], a sequence of DRRs is generated from a clinical CT volume
under a sequence of rigid motion, where 2-D/3-D overlay is initially registered.
10-layer depth sampling is used for motion compensation. Normalized cross cor-
relation (NCC) based similarity map between the 2-D projection (gradient mag-
nitude) and the 3-D volume (gradient-based rendering [6]), as shown respectively
in green, blue and red in Fig. 4 together with the overlays. Fig. 4(a) and 4(b)
show the 2-D/3-D overlay without and with our motion estimation approach
at frame 19. Due to the patient motion in 2-D projection, the overlay loses 2-
D/3-D similarity (green) along the frames. However, our motion compensation
approach maintains the high 2-D/3-D similarity. The results clearly show that
the 2-D/3-D overlay accuracy is strongly improved using our proposed approach.
Sparse Depth Sampling for Interventional 2-D/3-D Overlay 93

(a) Without motion compensation (b) With motion compensation

Fig. 4. Motion compensation using sparse depth estimation

5 Conclusion and Future Work

In this paper, a theoretical analysis of the systematic error introduced by sparse
depth sampling for motion estimation is presented. An improved motion esti-
mation strategy which handles the depth error is proposed. The experimental
results show the improved estimation of 3-D motion in cases of very sparse depth
sampling, the 3-D errors are below 2 mm after motion correction.
As an outlook, we will evaluate diﬀerent tracking methods using the presented
error analysis. The theoretical analysis can further help to adapt the tracking
methods to X-ray images for our motion compensation framework.

References
1. Rossitti, S., Pﬁster, M.: 3D road-mapping in the endovascular treatment of cerebral
aneurysms and arteriovenous malformations. Interventional Neuroradiology 15(3),
283 (2009)
2. Ruijters, D.: Multi-modal image fusion during minimally invasive treatment. PhD
thesis, Katholieke Universiteit Leuven and the University of Technology Eindhoven,
TU/e (2010)
3. Brost, A., Liao, R., Strobel, N., Hornegger, J.: Respiratory motion compensation by
model-based catheter tracking during ep procedures. Medical Image Analysis 14(5),
695–706 (2010)
4. Ma, Y., King, A.P., Gogin, N., Rinaldi, C.A., Gill, J., Razavi, R., Rhode, K.S.:
Real-time respiratory motion correction for cardiac electrophysiology procedures
using image-based coronary sinus catheter tracking. In: Jiang, T., Navab, N., Pluim,
J.P.W., Viergever, M.A. (eds.) MICCAI 2010, Part I. LNCS, vol. 6361, pp. 391–399.
Springer, Heidelberg (2010)
5. Wang, P., Marcus, P., Chen, T., Comaniciu, D.: Using needle detection and tracking
for motion compensation in abdominal interventions. In: 2010 IEEE International
Symposium on Biomedical Imaging: From Nano to Macro, pp. 612–615. IEEE (2010)
6. Wang, J., Borsdorf, A., Hornegger, J.: Depth-layer based patient motion compen-
sation for the overlay of 3D volumes onto x-ray sequences. In: Proceedings Bildver-
arbeitung für die Medizin 2013, pp. 128–133 (2013)
7. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn.
Cambridge Univsersity Press (2003)
8. Taylor, J.R.: An Introduction Error Analysis: The Study of Uncertainties in Physical
Measurements. University Science Books (1997)
Video Synopsis Based on a Sequential Distortion
Minimization Method

Costas Panagiotakis1, Nelly Ovsepian2 , and Elena Michael2

1
Dept. of Commerce and Marketing,
Technological Educational Institute (TEI) of Crete, 72200 Ierapetra, Greece
[email protected]
2
Dept. of Computer Science, University of Crete, P.O. Box 2208, Greece
[email protected], [email protected]

Abstract. The main goal of the proposed method is to select from a

video the most “signiﬁcant” frames in order to broadcast, without appar-
ent loss of content by decreasing the potential distortion criterion. Ini-
tially, the video is divided into shots and the number of synopsis frames
per shot is computed based on a criterion that takes into account the
visual content variation. Next, the most “signiﬁcant” frames are sequen-
tially selected, so that the visual content distortion between the initial
video and the synoptic video is minimized. Experimental results and
comparisons with other methods on several real-life and animation video
sequences illustrate the high performance of the proposed scheme.

Keywords: Video summarization, Key frames, Video synopsis.

1 Introduction

The traditional representation of video ﬁles as a sequence of numerous consecu-

tive frames, each of which corresponds to a constant time interval, while being
adequate for viewing a file in a movie mode, presents a number of limitations for
the new emerging multimedia services such as content-based search, retrieval,
navigation and video browsing [1]. Therefore, it is important to segment the
video into homogenous segments in content domain and then to describe each
segment by a small and sufficient number of frames [2] in order to get a video
summarization.
Video summarization algorithms attempt to abstract the main occurrences,
scenes, or objects in a clip in order to provide an easily interpreted multimedia
synopsis. The videos consist of a sequence of successive images which are called
frames and represent scenes in movie-motion. The video summarization algo-
rithms are based on detection of representative frames inside on basic temporal
units which are called shots. They are designed to detect the most suitable frames
from each shot in order to shorten the video without high distortion. A shot can
be defined as a sequence of frames that are or appear to be continuously cap-
tured from the same camera. Key frames are the most significant images which

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 94–101, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Video Synopsis Based on a Sequential Distortion Minimization Method 95

are extracted from video footage. They have been used to distinguish videos,
summarize them and provide access points [3].
Key frames selection approaches can be classified into basically three cate-
gories, namely cluster-based methods, energy minimization-based methods and
sequential methods [1,4]. Cluster-based methods take all frames from every shot
and classify by content similarity to take key-frame. The disadvantage of these
methods is that the temporal information of a video sequence is omitted. The
energy minimization based methods extract the key frames by solving a rate-
constrained problem. These methods are generally computational expensive by
iterative techniques. Sequential methods consider a new key frame when the
content difference from the previous key frame exceeds the predefined threshold.
In [5], key frames are computed based on unsupervised learning for video
retrieval and video summarization by combination of shot boundary detection,
intra-shot-clustering and keyframe “meta-clustering”. It exploits the Color Lay-
out Descriptor (CLD) [6], on consecutive frames and compute differences be-
tween them define the bounds of each shot. Recently, dynamic programming
techniques have been proposed in the literature, such as the MINMAX approach
of [7] to extract the key frames of a video sequence. In this work, the problem is
solved optimally in O(N 2 · Kmax ), where Kmax is related to the rate-distortion
optimization. In [8], a video is represented as a complete undirected graph and
the normalized cut algorithm is carried out to globally and optimally partition
the graph into video clusters. The resulting clusters form a directed temporal
graph and a shortest path algorithm is proposed for video summarization.
Video summarization has been applied by many researchers with multiple ap-
proaches. Most of them are dealing with minimizing content features, defining
restrictions on distortion, applying simple clustering-based techniques and ignor-
ing temporal variation. In addition, due to its high computational cost (O(N 3 )
when the number of key frames is proportional to the number of video frames
N ), most of the prementioned methods have been used to extract a small per-
centages of initial frames that represent well the visual content but they have
not been used to reproduce a video synopsis. Video synopsis is quite important
task for video summarization, since it is another short video representation of
visual content and video variation. This paper refers to video summarization by
the meaning of video synopsis creation. The resulting video synopsis takes into
account temporal content variation, shot detection, and minimizes the content
distortion between the initial video and the synoptic video. At the same time
the proposed method has low computational cost O(N 2 ). Another advantage of
this work is that it can be used under any visual content description.
The rest of the paper is organized as follows: Section 2 gives the problem
formulation. Section 3 presents the proposed methodology of the video synopsis
creation. segmentation of periodic human motion. The experimental results are
given in Section 4. Finally, conclusions and discussion are provided in Section 5.
96 C. Panagiotakis, N. Ovsepian, and E. Michael

2 Problem Formulation
The problem of video synopsis belong to video summarization problems. Its goal
is to create a new video, shorter than the initial video according to a given
parameter α, without significant loss of content between the two videos (the
distortion between the original video and the video synopsis is minimized). The
ratio between the temporal duration of the video synopsis and the initial video
is equal to α ∈ [0, 1]. Let N denote the number frames of original video. Then,
the video synopsis consists of α · N frames. Therefore, we have to select the α · N
representative key frames. The broadcasting of the video synopsis is done with
the original frame ratio meaning that the real speed of the new video has been
increased by the factor of α1 on average. For example, we have a video with 5
sec duration with 25 frames/sec, so the whole video is consisted of 5 × 25 = 125
frames and the given parameter α = 0.2 the final video will have 125 × 0.2 = 25
frames. In other words, the final duration will be one sec which is 20% of initial
video.
Let Ci , i ∈ {1, ..., N } denote the visual descriptor of i-frame of original video.
Let S ⊂ {1, ..., N } denote the frames of video synopsis. According to the problem
definition, it holds that the number of frames of video synopsis (|S|) is equal to
α · N . Then, the distortion D({1, ..., N }, S) between the original video and video
synopsis is given by the following equation:

S(1)

N
D({1, ..., N }, S) = d(i, S(1)) + d(i, S(|S|)) (1)
i=1 i=S(|S|)+1

S(|S|)
+ minS(j)≤i≤S(j+1) (d(i, S(j)), d(i, S(j + 1)))
i=S(1)+1

where d(i, S(j)) denotes the distance between the visual descriptor of i-frame
and S(j)-frame. S(j) and S(j + 1) are two successive frames of video synopsis
so that S(j) ≤ i ≤ S(j + 1), this means that S(j) is determined by the index
i. The first and the second parts of this sum concern the cases that the frame
i is located before the first key frame S(1) or after the last key frame S(|S|),
respectively. Therefore, the used distortion that is defined by the sum of visual
distances between the frame of original video and the “closest” corresponding
frame of video synopsis, can be considered as an extension of the definition of
Iso-Content Distortion Principle [1] in the domain of shots.

3 Methodology
Fig. 1 illustrates a scheme of the proposed system architecture. The proposed
method can be divided into several steps. Initially, we estimate the CLD for each
frame of the original video. Next, we performed shot detection (see Section 3.1).
Based on the shot detection results and to the given parameter α we estimate the
number of frames per shot that the video synopsis (see Section 3.1). Finally, the
Video Synopsis Based on a Sequential Distortion Minimization Method 97

Shot parameter
Detection a

Number of Sequential
Video
Video CLD Frames per Distortion
Synopsis
Shot Minimization

Fig. 1. Scheme of the proposed system architecture

video distortion is sequentially minimized according to the proposed methods

resulting to the video synopsis (see Section 3.2).
The proposed method can be executed under any choice or combination of au-
dio/visual content descriptors. Descriptors based on image segmentation results
or on camera motion estimation techniques are computational expensive. More-
over, there is not any guarantee that their results will be accurate for any video
content variation [1]. To overcome these problems, we adopt the MPEG-7 visual
descriptors [6, 9] as appropriate features, such as the Color Layout Descriptor
(CLD), a low cost and compact descriptor, which suffices to describe smoothly
the changes in visual content of a shot. These descriptors have been successfully
used on our prementioned work on key frames extraction problems [1, 4]. The
CLD is a compact descriptor that uses representative colors on a grid followed by
a DCT and encoding of the resulting coefficients. We used the following semimet-
ric function D to measure the content distance of two CLDs, {DY, DCb, DCr}

and {DY , DCb , DCr },
2 2 2
D = i (DYi − DYi ) + i (DCbi − DCbi ) + i (DCri − DCri ) ,
where, (DY, DCb, DCr) represent the ith DCT coefficients of the respective
color components [1, 4].

3.1 Shot Detection

This section presents the shot detection method. Shot detection is optional and
it is used in order, to ensure that the video synopsis contains frames for each shot
and to decrease the computational cost of the proposed sequential algorithms.
We perform detection for sharp shot changes only. This is done by using the
chi-squared distance between of the lightness histogram of each frame with the
next one. This histogram distance has been also successfully used for texture
and object categories classification, near duplicate image identification, local
descriptors matching, shape classification and boundary detection [10].
Hereafter, we present the method that we have used to compute the number
of frames of each detected shot for video synopsis. So, the goal of this section is
to find out the percentage of the frames of each shot that are capable enough to
represent the whole shot. We have used the metric Lk that is defined
with the sum
CLD distance between all successive frames in shot k: Lk = i∈SHk d(i, i + 1),
where SHk denotes the set of frames of shot k. Lk shows the sum of sequential
visual changes of shot k. The higher Lk , the higher number of frames have to
be selected from shot k. The selected frames (frames of video synopsis) are also
called key frames. The number of frames (bk ) in shot k, that is proportional to
98 C. Panagiotakis, N. Ovsepian, and E. Michael

α·N ·L
Lk , is defined by the following equation: bk = |SH| k , where |SH| denotes the
k=1 Lk
number of shots. This definitionof bk also satisfies the constraint that the video
|SH|
synopsis should contain α · N : k=1 bk = α · N . In the special case of bk ≤ 1
which means that all frames of the shot have the same content, we set bk = 2 so
that the video synopsis summarizes all of the shots of the video.

3.2 Sequential Distortion Minimization

This section presents the proposed Sequential Distortion Minimization algorithm

(SeDiM) for video synopsis creation. This method selects bk frames for the k
shot, so that the distortion between the original video and the video synopsis is
sequentially minimized. The ordering of key frames selection corresponds to their
signiﬁcance on content description. The SeDiM method is described hereafter:
Let CANk denote the set of candidate frames of shot k for video synopsis.
Initially, we set CANk = SHk . Let Sk be the frames of video synopsis of shot k.
Initially, we set Sk = ∅. For each shot k, we iteratively select the frame f from
CANk so that if we include it in set Sk the current video distortion of shot k is
minimized (see Equation 2). Next, we remove it from set CANk and we add it
on set Sk :

f = argminu∈CANk D(SHk , Sk ∪ u) (2)
i∈SHk
CANk = CANk − {f }, Sk = Sk ∪ f

When the number of key frames of shot k become bk , CANk is being the empty
set (CANk = ∅), since we can not select more frames from this shot. The process
continues until the number of key frames of video synopsis become α · N .
Concerning the computational cost, this procedure can be implemented in
O(N 2 ). The worst case is appeared when the video consists of one shot. In this
case it holds that (N = |SH1 |). In the start (fist step), the finding of global
minima of D({1, ..., N }, ∅) needs O(N 2 ) (see Equation 1). In the n-step of the
method, we have to compute D({1, ..., N }, S ∪ u) only when the previous or the
next key frame of u is the last key frame that have been added in S in previous
step (n−1). Otherwise, it holds that D({1, ..., N }, S ∪u) = D({1, ..., N }, S). This
2
needs O( N n2 ), since the video content changes “smoothly” in the sense that the
selected frames are about equally distributed during the time. Let T (.) denote
the computation cost of the algorithm. It holds that T (1) = O(N 2 ). In the n-
step, we have to find the minimum of D(., .) that can be given in O(N ) and to
2
update the specific values of D(., .) in O( Nn2 ). So, the total computational cost
is O(N 2 ).
In addition, we have proposed a simple variation of SeDiM that is presented
hereafter. In this variation, we just assume the first and last frame of each shot
as two starting key frames for video synopsis. So, in the case of one-shot, we
initialize Sk = {SHk (1), SHk (|SHk |)}. This algorithm is called SeDiM-IN. The
rest of the process is exactly the same with SeDiM. The proposed methods do
Video Synopsis Based on a Sequential Distortion Minimization Method 99

not guarantee global minima of distortion, since they sequentially minimize the
distortion function. SeDiM guarantees global minima of distortion only in the
case of bk = 1.

4 Experimental Results

Fig. 2. Snapshots of videos that we have used in the paper

In this section, the experimental results and comparisons with other algo-
rithms are presented. We have tested the proposed algorithm on a data set
containing more than 100 video sequences. We selected 10 videos (eight real-life
and two synthetic (animation)) videos of different content in order to evaluate
the distortion of each algorithm and take results from videos which have differ-
ent content. The real-life videos have been recording either in indoor or outdoor
environments. The ten used videos consist of 69 shots. The number of shots
per video varies from one to 22. In addition, the duration of the videos varies
from 300 frames to 1925 frames. Fig. 2 depicts shapshots from these videos. The
names of the videos are given in the first column of Table 1.

4.1 Comparison to other Algorithms

The proposed methods SeDiM and SeDiM-IN have been compared with the
content equidistant and time equidistant algorithms in the same data sets and
same set of parameters α = 0.1 and α = 0.3. Hereafter, we present these
two algorithms. The content equidistant algorithm (CEA) is inspired by the
work [1], where the iso-content principle has been proposed to estimate the
key frames that are equidistant in video content. According to this method,
the key frames {t1 , t2 , ..., tbk } in shot k are deﬁned by the following equation:
1 −1 2 −1 bk −1
m tu=1 d(u, u + 1) tu=t d(u, u + 1) ... u=t d(u, u + 1) where
1 bk −1
1 bk

m = bk −1 u=1 d(u, u + 1). So, based on the measurement m, ﬁrst we compute

the key frame t1 , next we compute t2 , and so on. Finally we compute tbk .
The time equidistant algorithm (TEA) is based on equivalent frames in each
shot of video by ﬁnding key frames as equal intervals in duration of shot. Ac-
cording to this method, the key frame ti , i ∈ {1, 2, ..., bk } in shot k is directly
k|
deﬁned by the following equation: ti = i·|SH bk , where |SHk | denotes the num-
ber of frames of shot k and . denotes the nearest integer function. This is the
simplest method for video synopsis creation, since it does not take into account
visual changes.
Table 1 depicts the distortion D({1, ..., N }, S) between the original video and
video synopsis of SeDiM, SeDiM-IN, CEA, TEA methods under the ten used
video sequences with α = 0.1 and α = 0.3.
100 C. Panagiotakis, N. Ovsepian, and E. Michael

Table 1. The distortion D({1, ..., N }, S) between the original video and video synopsis
α = 0.1 α = 0.1 α = 0.1 α = 0.1 α = 0.3 α = 0.3 α = 0.3 α = 0.3
Dataset SeDiM SeDiM-IN CEA TEA SeDiM SeDiM-IN CEA TEA
foreman.avi 19209 18973 21814 22069 6755.1 6774.1 7738.2 8992.2
coast guard.avi 6962.7 7054.8 7486.6 7079.9 2521.4 2562.4 2669.5 4146.4
hall.avi 3913.8 3938.8 4309.1 4444.4 2137 2141 2228.1 3863
table.avi 10207 11578 11529 10928 4097 4113.2 4542.4 6046.4
blue.avi 13826 14487 14690 16171 5419 5494 5736 10631
doconCut.avi 116420 122550 142460 148230 40503 43602 45412 70521
data.avi 14635 15800 17147 15294 4292 4303 5058 17260
Wildlife.avi 27187 29841 31763 33752 9052 9210 10826 12493
MessiVsRonaldo.avi 74630 85270 85310 111070 20971 22051 23001 40209
FootballHistory.avi 68434 79676 80236 95497 16402 16842 17323 58503

According to these experiments, SeDiM yields the highest performance re-

sults, outperforming the other algorithms, since in 95% of cases (19 out of 20)
gives the lowest distortion. SeDiM-IN is the second highest performance method.
When α = 0.3 is always the second highest performance method. When α = 0.1,
in 70% of cases is the second highest performance method. In addition, in fore-
man.avi, SeDiM-IN gives the lowest distortion when α = 0.1. SeDiM usually
gives less distortion than SeDiM-IN, because the video synopsis of SeDiM-IN
contain the first and last frame each shot, without examine if they are appropri-
ate to optimize the summarization of video.
High performance results are also obtained by CEA that is the third highest
performance method, especially when α = 0.3. We observed that CEA is better
method to get video synopsis than TEA because the equal time intervals in the
shot don’t guarantee that the selected frames from this method have different
or same visual content. CEA ensure that the key frames are selected by equal
content differences and with this way maintain the distortion of video in low
levels. The initial videos and video synopsis results (with α = 0.1 and α = 0.3)
of SeDiM, SeDiM-IN, CEA, TEA methods are given in 1 . It holds that the video
synopsis of the proposed schemata describe well the visual content under any
type of videos.

5 Conclusion

In this paper, we have proposed a video synopsis creation scheme that can be
used in video summarization applications. According to the proposed frame-
work, the problem of video synopsis creation is reduced to the minimization of
the distortion between the initial video and the video synopsis. The proposed
method sequentially minimizes this distortion, resulting in high performance re-
sults under any value of the parameter α that controls the number of frames
of the video synopsis. In addition, the proposed scheme can be used under any
type of video content description.
1
https://www.dropbox.com/sh/rpysux4oa746jty/B265lHwpAB
Video Synopsis Based on a Sequential Distortion Minimization Method 101

Acknowledgments. This research has been partially co-ﬁnanced by the Euro-

pean Union (European Social Fund - ESF) and Greek national funds through the
Operational Program “Education and Lifelong Learning” of the National Strate-
gic Reference Framework (NSRF) - Research Funding Programs: ARCHIMEDE
III-TEI-Crete-P2PCOORD and THALIS-UOA- ERASITECHNIS.

References
1. Panagiotakis, C., Doulamis, A., Tziritas, G.: Equivalent key frames selection based
on iso-content principles. IEEE Transactions on Circuits and Systems for Video
Technology 19, 447–451 (2009)
2. Hanjalic, A., Zhang, H.: An integrated scheme for automated video abstraction
based onunsupervised cluster-validity analysis. IEEE Trans. on Circuits and Sys-
tems for Video Tech. 9, 1280–1289 (1999)
3. Girgensohn, A., Boreczky, J.S.: Time-constrained keyframe selection technique.
Multimedia Tools and Applications 11, 347–358 (2000)
4. Panagiotakis, C., Doulamis, A., Tziritas, G.: Equivalent key frames selection based
on iso-content distance and iso-distortion principles. In: IEEE International Work-
shop on Image Analysis for Multimedia Interactive Services (2007)
5. Hammoud, R., Mohr, R.: A probabilistic framework of selecting eﬀective key frames
for video browsing and indexing. In: International Workshop on Real-Time Image
Sequence Analysis (RISA 2000), pp. 79–88 (2000)
6. Manjunath, B., Ohm, J., Vasudevan, V., Yamada, A.: Color and texture descrip-
tors. IEEE Trans. on Circuits and Systems for Video Tech. 11, 703–715 (2001)
7. Li, Z., Schuster, G., Katsaggelos, A.: Minmax optimal video summarization. IEEE
Trans. Circuits Syst. Video Techn. 15, 1245–1256 (2005)
8. Ngo, C.W., Ma, Y.F., Zhang, H.J.: Video summarization and scene detection by
graph modeling. IEEE Trans. Circuits Syst. Video Techn. 15, 296–305 (2005)
9. Kasutani, E., Yamada, A.: The mpeg-7 color layout descriptor: a compact im-
age feature description for high-speed image/video segment retrieval, pp. 674–677
(2001)
10. Pele, O., Werman, M.: The quadratic-chi histogram distance family. In: Dani-
ilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312,
pp. 749–762. Springer, Heidelberg (2010)
A Graph Embedding Method Using the Jensen-Shannon
Divergence

Lu Bai, Edwin R. Hancock, and Lin Han

Department of Computer Science, University of York, UK

Deramore Lane, Heslington, York, YO10 5GH, UK
{lu,erh,lin}@cs.york.ac.uk

Abstract. Riesen and Bunke recently proposed a novel dissimilarity based ap-
proach for embedding graphs into a vector space. One drawback of their approach
is the computational cost graph edit operations required to compute the dissimi-
larity for graphs. In this paper we explore whether the Jensen-Shannon divergence
can be used as a means of computing a fast similarity measure between a pair of
graphs. We commence by computing the Shannon entropy of a graph associated
with a steady state random walk. We establish a family of prototype graphs by
using an information theoretic approach to construct generative graph prototypes.
With the required graph entropies and a family of prototype graphs to hand, the
Jensen-Shannon divergence between a sample graph and a prototype graph can be
computed. It is defined as the Jensen-Shannon between the pair of separate graphs
and a composite structure formed by the pair of graphs. The required entropies
of the graphs can be efficiently computed, the proposed graph embedding using
the Jensen-Shannon divergence avoids the burdensome graph edit operation. We
explore our approach on several graph datasets abstracted from computer vision
and bioinformatics databases.

1 Introduction

In pattern recognition, graph based object representations offer a versatile alternative

way to the vector based representation. The main advantage of graph representations is
their rich mathematical structure. Unfortunately, most of the standard pattern recogni-
tion and machine learning algorithms are formulated for vectors, and are not available
for graphs. One way to overcome this problem is to embed the graph data into a vector
space, and then deploy vectorial methods.
However, the vector space embedding presents two obstacles. First, since graphs can
be of different sizes, the vectors may be of different lengths. The second problem is
that the information residing on the edges of a graph is discarded. In order to over-
come these problems, Riesen and Bunke recently proposed a method for embedding
graphs into a vector space [1], that bridges the gap between the powerful graph based
representation and the algorithms available for the vector based representation. The
ideas underpin graph dissimilarity embedding framework were first described in Duin
and Pekalska’s work [2]. Riesen and Bunke generalized and substantially extended the

Edwin R. Hancock is supported by a Royal Society Wolfson Research Merit Award.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 102–109, 2013.

c Springer-Verlag Berlin Heidelberg 2013
A Graph Embedding Method Using the Jensen-Shannon Divergence 103

methods to the graph mining domain. The key idea is to use the edit distance from a
sample graph to a number of class prototype graphs to give a vectorial description of
the sample graph in the embedding space. Furthermore, this approach potentially allows
any (dis)similarity measure of graphs to be used for graph (dis)similarity embedding as
well. Unfortunately, the edit distance between a sample graph and a prototype graph
requires burdensome computations, and as a result the graph dissimilarity embedding
using the edit distance can not be efficiently computed for graphs.
To address this inefficiency, in this paper we investigate whether the Jensen-Shannon
divergence can be used as a means of establishing a computationally efficient similarity
measure between a pair of graphs, and then use such a measure to propose a novel fast
graph embedding approach. In information theory the Jensen-Shannon divergence is a
nonextensive mutual information theoretic measure based on nonextensive entropies.
An extensive entropy is defined as the sum of the individual entropies of two probabil-
ity distributions. The definition of nonextensive entropy generalizes the sum operation
into composite actions. The Jensen-Shannon divergence is defined as a similarity mea-
sure between probability distributions, and is related to the Shannon entropy [3]. The
problem of establishing Jensen-Shannon divergence measures for graphs is that of com-
puting the required entropies for individual and composite graphs. In [4], we have used
the steady state random walk of a graph to establish a probability distribution for this
purpose. The Jensen-Shannon divergence between a pair of graphs is defined as the dif-
ference between the entropy of a composite structure and their individual entropies. To
determine a set of prototype graphs for vector space embedding. We use an information
theoretic approach to construct the required graph prototypes [5]. Once the vectorial
descriptions of a set of graphs are established, we perform graph classification in the
principle component space. Experiments on graph datasets abstracted from bioinfor-
matics and computer vision databases demonstrate the effectiveness and the efficiency
of the proposed graph embedding method.
This paper is organized as follows. Section 2 develops a Jensen-Shannon divergence
measure between graphs. Section 3 reviews the concept of graph dissimilarity embed-
ding, and shows how to compute the similarity vectorial descriptions for a set of graphs
using the Jensen-Shannon divergence. Section 4 provides the experimental evaluations.
Finally, Section 5 provides the conclusion and future work.

2 The Jensen-Shannon Divergence on Graphs

In this section, we exploit the Jensen-Shannon divergence for developing a computa-
tionally efficient similarity measure for graphs. We commence by defining a Shannon
entropy of a graph associated with its steady state random walk. Then we develop the
similarity measure for graphs by using the Jensen-Shannon divergence between the
graph entropies.

2.1 Graph Entropies

Consider a graph G(V, E) with vertex set V and edge set E ⊆ V × V . The adjacency
matrix A for G(V, E) has elements
104 L. Bai, E.R. Hancock, and L. Han

1 if(i, j) ∈ E;
A(i, j) = (1)
0 otherwise.
The vertex degree matrix of G(V, E) is a diagonal matrix D with diagonal elements
given by D(vi , vi ) = d(i) = j∈V A(i, j).
Shannon Entropy. For the graph G(V, E), the probability
of a steady state random
walk on G(V, E) visiting vertex i is PG (i) = d(i)/ j∈V d(j). the Shannon entropy
associated with the steady state random walk on G(V, E) is
|V |

HS (G) = − PG (i) log PG (i). (2)
i=1

Time Complexity. For the graph G(V, E) having n = |V | vertices, the Shannon en-
tropy HS (G) requires time complexity O(n2 ).

2.2 A Composite Entropy of a Pair of Graphs

To compute the Jensen-Shannon divergence of a pair of random walks on a pair of
graphs Gp (Vp , Ep ) and Gq (Vq , Eq ), we require a method for constructing a composite
structure Gp ⊕ Gq for the pair of graphs Gp (Vp , Ep ) and Gq (Vq , Eq ). For reasons of
efficiency, we use the disjoint union as the composite structure. According to [6], the
disjoint union graph of Gp (Vp , Ep ) and Gq (Vq , Eq ) is

GDU = Gp ∪ Gq = {Vp ∪ Vq , Ep ∪ Eq }. (3)

Let graphs Gp and Gq be the connected components of the disjoint union graph GDU ,
and ρp = |V (Gp )|/|V (GDU )| and ρq = |V (Gq )|/|V (GDU )|. The entropy (i.e. the
composite entropy) [7] of GDU is

H(GDU ) = ρp H(Gp ) + ρq H(Gq ). (4)

Here the entropy function H(·) is the Shannon entropy HS (·) defined in Eq.(2).

2.3 The Jensen-Shannon Divergence on Graphs

The Jensen-Shannon divergence between the (discrete) probability distributions P =
(p1 , p2 , . . . , pK ) and Q = (q1 , q2 , . . . , qK ), associated with the random walks on graphs
Gp (Vp , Ep ) and Gq (Vq , Eq ), is negative definite (nd) with the following function:

P +Q HS (P ) + HS (Q)
DJS (P, Q) = HS ( )− . (5)
2 2
K
where HS (P ) = k=1 pk log pk is the Shannon entropy of the probability distribution
P . Given a pair of graphs Gp (Vp , Eq ) and Gq (Vq , Eq ), the Jensen-Shannon divergence
for them is
H(Gp ) + H(Gq )
DJS (Gp , Gq ) = H(Gp ⊕ Gq ) − . (6)
2
A Graph Embedding Method Using the Jensen-Shannon Divergence 105

where H(Gp ⊕ Gq ) is the entropy of the composite structure. Here we use the disjoint
union defined in Sec.2.2 as the composite structure, and the entropy function H(·) is
the Shannon entropy HS (·) defined in Eq.(2).
Time Complexity. For a pair of graphs Gp (Vp , Ep ) and Gq (Vq , Eq ) both having n
vertices, computing the Jensen-Shannon divergence DJS (Gp , Gq ) defined in Eq. (6)
requires time complexity O(n2 ).

3 Graph Embedding Using The Jensen-Shannon Divergence

In this section, we explore how to use the Jensen-Shannon divergence as a means of em-
bedding graph structures into a vector space. We commence by reviewing the definition
of the graph dissimilarity embedding.

3.1 Graph Dissimilarity Embedding

In [1], Riesen and Bunke have proposed a graph dissimilarity embedding to embed a
sample graph into a vectorial description, they computed the edit distances between
the sample graph and a number of prototype graphs. For a sample graph Gi (Vi , Ei )
(i = 1, . . . , N ) and a set of prototype graphs T = {T1 , . . . , Tm , . . . , Tn }, we measure
the dissimilarities between Gi (Vi , Ei ) and each prototype graph Tm ∈ T as the m-th
element of the n-dimensional vectorial description V (Gi ) of Gi . The mapping ϕTn :
(Gi ) → Rn is defined as the function

V (Gi ) = (d(Gi , T1 ), . . . , d(Gi , Tm ), . . . , d(Gi , Tn )) (7)

where d(Gi , Tm ) is the graph dissimilarity measure between Gi (Vi , Ei ) and the m-th
prototype graph Tm . Riesen and Bunke proposed to use the graph edit distance as the
dissimilarity measure. Although, their approach allows any (dis)similarity measure of
graphs to be used.

3.2 A Graph Embedding Method Using the Jensen-Shannon Divergence

The novel graph embedding procedure described in Section 3.1 offers us a principled
way to develop a new graph embedding approach using the Jensen-Shannon diver-
gence. Consider the sample graph Gi (Vi , Ei ) and the set of prototype graphs T =
{T1 , . . . , Tm , . . . , Tn }, we compute the similarity measure between Gi (Vi , Ei ) and each
prototype graph using the Jensen-Shannon divergence, as a result the mapping ϕTn :
(Gi ) → Rn defined in Eq.(7) can be re-written as

VDJS (Gi ) = (DJS (Gi , T1 ), . . . , DJS (Gi , Tm ), . . . , DJS (Gi , Tn )) (8)

where DJS (Gi , Tm ) is the Jensen-Shannon divergence between the sample graph
Gi (Vi , Ei ) and the m-th prototype graph Tm . Since the Jesnen-Shannon divergence
between graphs can be efficiently computed, the proposed embedding method are
more efficient than the dissimilarity embedding using the costly computed graph edit
distance.
106 L. Bai, E.R. Hancock, and L. Han

3.3 The Prototype Graph Selection

For our approach, the prototype graphs T = {T1 , . . . , Tm , . . . , Tn } serve as reference

points to transform graphs into real vectors. Hence, the aim of prototype graph selec-
tion is to find reference points which result in a meaningful vector in the embedding
space. Intuitively, the prototype graphs T = {T1 , . . . , Tm , . . . , Tn } should be able to
characterize the structural variations present in a set of sample embedded graphs. Fur-
thermore, these prototype graphs should be neither too redundant nor too simple. To
locate the prototype graphs we make use of Luo and Hancock’s probabilistic model
of graph structures described in [8], and develop an information theoretic approach to
selecting prototype graphs. By using a two-part minimum description length criterion
[9,10,11], these selected prototype graphs trade off the goodness-of-fit to the sample
data against their intrinsic complexities. To formalize this idea, we locate the proto-
type graphs that minimize the overall code-length. The code-length of a set of sample
graphs is the average of their Shannon-Fano code which is equivalent to the negative
logarithm of their likelihood function. The code-length for describing the complexity
of the prototype graphs is measured using the approximate von-Neumann entropy [12].
To minimize the overall code length, we develop a variant of EM algorithm where we
view both the structure of the prototype graphs and the vertex correspondence informa-
tion between the sample and prototype graphs as missing data. In the two interleaved
steps of the EM algorithm, the expectation step involves recomputing the a posteriori
probability of vertex correspondence while the maximization step involves updating
both the structure of the prototype graphs and the vertex correspondence information.
More details of how we apply the minimum description length criterion and how the
EM algorithm work can be found in [5].

4 Experimental Evaluation

In this section, we demonstrate the performance of our proposed method on several

graph datasets abstracted from real-world image and bioinformatics databases. These
datasets are: ALOI, CMU, MUTAG and NCI109 [13]. The ALOI dataset consists of 54
graphs extracted from selected images of three similar boxes. The CMU dataset consists
of 54 graphs extracted from selected images of three similar toy houses. For each object
in the ALOI and CMU datasets, there are 18 images captured from different viewpoints.
The graphs are the Delaunay triangulations of feature points extracted from the different
images using the SIFT detection. The maximum and minimum vertices of the ALOI
and CMU datasets are 1288 (max) and 295 (min) for ALOI, and 495 (max) and 27
(min) for CMU. The MUTAG dataset is based on graphs representing 188 chemical
compounds, and aims to predict whether each compound possesses mutagenicity. The
maximum and minimum number of vertices are 28 and 10 respectively. As the vertices
and edges of each compound are labeled with real number, we transform these graphs
into unweighted graphs. The NCI109 is based on un weighted graph representing 4127
chemical compounds, and aims to predict whether each sub-set of compound is active
in an anti-cancer screen. The maximum and minimum number of vertices are 111 and
4 respectively.
A Graph Embedding Method Using the Jensen-Shannon Divergence 107

4.1 Experiments on Graph Datasets

Experimental Setup: We evaluate the performance of our proposed graph embedding

method using the Jensen-Shannon divergence (DEJS) on the four graph datasets. We
compare our method against several alternative graph based learning methods. The
comparative methods include a) pattern vectors from coefficients of the Ihara zeta func-
tion (CIZF) [14], b) pattern vectors from algebraic graph theory (PVAG) [15], and c)
the graph dissimilarity embedding using the edit distance (DEED) [1]. For our method
and DEED on each dataset, we randomly divide the graphs into 20 folds and use any
6 folds to learn 6 prototype graphs. We construct the 6 dimensional vector description
of each testing graph. For the alternative methods CIZF and PVAG on each dataset,
we construct the vectorial description of each testing graph. We then perform 10-fold
cross-validation of KNN classifier to evaluate the performance of our method and the
alternative methods, using nine folds for training and one fold for testing. All the KNN
classifiers and the paraments were performed and optimised on a Weka workbench. We
repeat the whole experiment 10 times and report the average classification accuracies
in Table 1. We also report the runtime to establish graph feature vectors of each method
in Table 1 under Matlab R2011a with an Intel(i5) 3.2GHz 4-core processor.

Table 1. Experimental Comparisons on Graph Datasets

Data ALOI CMU NCI109 MUTAG Data ALOI CMU NCI109 MUTAG
DEJS 91.35 100 65.49 80.75 DEJS 2” 1” 2” 1”
CIZF − 100 67.19 80.85 CIZF − 2 33” 14” 1”
PVAG − 62.59 64.59 82.44 PVAG − 5” 19” 1”
DEED − 100 63.34 83.55 DEED − 3h55 17h49 49 23”

Experimental Results: On the ALOI dataset which possesses graphs of more than one
thousand vertices, our method takes 2 seconds, while DEED takes over one day and
even CIZF and PVAG generate overflows on the computation. The runtime of CIZF
and PVAG methods are only competitive to our method DEJS on the MUTAG, NCI109
and CMU datasets which possess graphs of smaller sizes. This reveals that our DEJS
can easily scale up to graphs with thousands of vertices. DEED can achieve competitive
classification accuracies to our DEJS, but requires more computation time. The graph
similarity embedding using the Jensen-Shannon divergence measure is more efficient
than that using the edit distance dissimilarity measure proposed by Riesen and Bunke.
The reason for this is that the Jensen-Shannon divergence between graphs only requires
quadratic numbers of vertices.
Furthermore, both our embedding method and DEED also require extra runtime for
learning the required prototype graphs. For the ALOI, CMU, NCI109 and MUTAG
datasets, the average times for learning a prototype graph are 5 hours, 30 minutes, 15
minutes and 5 minutes respectively. This reveals that for graphs of large sizes, our em-
bedding method may require additionally and potentially expansive computations for
learning the prototype graphs. However for graphs of less than 300 vertices, the learn-
ing of prototype graphs can still be completed in polynomial time.
108 L. Bai, E.R. Hancock, and L. Han

4.2 Stability Evaluation

In this subsection, we investigate the stability of our proposed method DEJS. We ran-
domly select three seed graphs from the ALOI dataset. We then apply random edit
operations on the three seed graphs to simulate the effects of noise. The edit operations
are vertex deletion and edge deletion. For each seed graph, we randomly delete a pre-
determined fraction of vertices or edges to obtain noise corrupted variants. The feature
distance between an original seed graph G0 and its noise corrupted counterpart Gn is
defined as their Euclidean distance, defined as
/
T
dG0 ,Gn = (VJS (G0 ) − VJS (Gn )) (VJS (G0 ) − VJS (Gn )) (9)

We show the experimental results in Fig.1 and Fig. 2. Fig.1 and Fig. 2 show the effects
of vertex and edge deletion respectively. The x-axis represents 1% to 35% of vertices
or edges are deleted, and the y-axis shows the Euclidean distance dG0 ,Gn between the
original seed graph Go and its noise corrupted counterpart Gn . From Fig.1 and Fig. 2,
there is an approximate linear relationship in each case. This implies that the proposed
method possesses ability to distinguish graphs under controlled structural-error.

12 14 12

12
10 10

10
8 8
Euclidean distance

Euclidean distance

8
6 6
6

4 4
4

2 2
2

0 0 0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Node edit operation Node edit operation Node edit operation

(a) (b) (c)

Fig. 1. Stability evaluation vertex edit operation

12 10 12

9
10 10
8

7
8 8
Euclidean distance

Euclidean distance

6 5 6

4
4 4
3

2
2 2
1

0 0 0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Edge edit operation Edge edit operation Edge edit operation

(a) (b) (c)

Fig. 2. Stability evaluation edge edit operation

5 Conclusion and Future Work

In this paper, we have shown how to use the Jensen-Shannon divergence as a means
of embedding a sample graph into a vector space. We use an information theoretic
approach to construct the required prototype graphs. We embed a sample graph into
A Graph Embedding Method Using the Jensen-Shannon Divergence 109

feature space by computing the Jensen-Shannon divergence measure between the sam-
ple graph and each of the prototype graphs. We perform 10-folds cross validation as-
sociated with KNN classifier to assign the graphs into classes. Experimental results
demonstrate the effectiveness and efficiency of the proposed method. Since learning
prototype graphs usually requires expensive computation, our further work is to define
a fast approach to learn the prototype graphs. This will be useful to define a faster graph
embedding method.

Acknowledgments. We thank Dr. Peng Ren for providing the Matlab implementation
for the graph Ihara zeta function method.

References
1. Riesen, K., Bunke, H.: Graph Classification and Clustering Based on Vector Space Embed-
ding. World Scientific Press (2010)
2. Pekalska, E., Duin, R.P.W., Paclı́k, P.: Prototype Selection for Dissimilarity-based Classifiers.
Pattern Recognition 39, 189–208 (2006)
3. Martins, A.F., Smith, N.A., Xing, E.P., Aguiar, P.M., Figueiredo, M.A.: Nonextensive Infor-
mation Theoretic Kernels on Measures. Journal of Machine Learning Research 10, 935–975
(2009)
4. Bai, L., Hancock, E.R.: Graph Kernels from The Jensen-Shannon Divergence. Journal of
Mathematical Imaging and Vision (to appear)
5. Han, L., Hancock, E.R., Wilson, R.C.: An Information Theoretic Approach to Learning
Generative Graph Prototypes. In: Pelillo, M., Hancock, E.R. (eds.) SIMBAD 2011. LNCS,
vol. 7005, pp. 133–148. Springer, Heidelberg (2011)
6. Gadouleau, M., Riis, S.: Graph-theoretical Constructions for Graph Entropy and Network
Coding Based Communications. IEEE Transactions on Information Theory 57, 6703–6717
(2011)
7. Köner, J.: Coding of An Information Source Having Ambiguous Alphabet and The Entropy
of Graphs. In: Proceedings of the 6th Prague Conference on Information Theory, Statistical
Decision Function, Random Processes, pp. 411–425 (1971)
8. Luo, B., Hancock, E.R.: Structural Graph Matching Using the EM Alogrithm and Singular
Value Decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence 23,
1120–1136 (2001)
9. Rissanen, J.: Stochastic Complexity in Statistical Inquiry. World Scientific, Singapore (1989)
10. Rissanen, J.: Modelling by Shortest Data Description. Automatica 14, 465–471 (1978)
11. Rissanen, J.: An Universal Prior for Integers and Estimation by Minimum Description
Length. Annals of Statistics 11, 417–431 (1983)
12. Han, L., Hancock, E.R., Wilson, R.C.: Characterizing Graphs Using Approximate von Neu-
mann Entropy. In: Vitrià, J., Sanches, J.M., Hernández, M. (eds.) IbPRIA 2011. LNCS,
vol. 6669, pp. 484–491. Springer, Heidelberg (2011)
13. Shervashidze, N., Borgwardt, K.M.: Fast Subtree Kernels on Graphs. In: NIPS,
pp. 1660–1668 (2009)
14. Ren, P., Wilson, R.C., Hancock, E.R.: Graph Characterization via Ihara Coefficients. IEEE
Transactions on Neural Networks 22, 233–245 (2011)
15. Wilson, R.C., Hancock, E.R., Luo, B.: Pattern Vectors from Algebraic Graph Theory. IEEE
Transactions on Pattern Analysis and Machine Intelligence 27, 1112–1124 (2005)
Mixtures of Radial Densities for Clustering
Graphs

Brijnesh J. Jain

Technische Universität Berlin, Germany

[email protected]

Abstract. We address the problem of unsupervised learning on graphs.

The contribution is twofold: (1) we propose an EM algorithm for esti-
mating the parameters of a mixture of radial densities on graphs on the
basis of the graph orbifold framework; and (2) we compare orbifold-based
clustering algorithms including the proposed EM algorithm against state-
of-the-art methods based on pairwise dissimilarities. The results show
that orbifold-based clustering methods complement the existing arsenal
of clustering methods on graphs.

Keywords: graphs, clustering, EM algorithm, graph matching.

1 Introduction

Attributed graphs are a versatile and expressive data structure for representing
complex patterns consisting of objects and relationships between objects. Ex-
amples include molecules, mid- and high-level description of images, instances
of relational schemes, web graphs, and social networks.
Despite the many advantages of graph-based representations, statistical learn-
ing on attributed graphs is underdeveloped compared to learning on feature vec-
tors. For example, generic state-of-the-art methods for clustering of non-vectorial
data are mainly based on pairwise dissimilarity methods such as hierarchical or
spectral clustering. One research direction to complement the manageable range
of graph clustering methods aims at extending centroid-based clustering methods
to attributed graphs [2,3,9,11], which more or less amount in different variants
of graph quantization methods. A theoretical justification of these approaches is
provided in [8] by means of establishing conditions for optimality and statistical
consistency.
Vector quantization, k-means and their variants are not only well-known for
their simplicity but also for their deficiencies. Such a statement in the graph
domain is difficult to derive, because more advanced generalizations of standard
clustering algorithms to graphs are rare and an empirical comparison to state-
of-the-art graph clustering methods is missing.
Following this line of research, the contribution of this paper are twofold: (1)
we extend mixtures of Gaussians to mixtures of radial densities on graphs and
adopt the EM algorithm for parameter estimation on the basis of the orbifold

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 110–119, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Mixtures of Radial Densities for Clustering Graphs 111

framework [7]; (2) we compare the performance of clustering methods in orb-

ifolds with state-of-the art clustering based on pairwise dissimilarity data. The
results show that clustering in orbifolds constitute a promising alternative and
complement existing state-of-the-art graph clustering methods.

2 Graph Orbifolds
The section introduces a suitable representation of attributed graphs by means
of the orbifold framework as proposed in [5,7].
Let E be a p-dimensional Euclidean space. An attributed graph X = (V, E, α)
consists of a set V of vertices, a set E ⊆ V ×V of edges, and an attribute function
α : V × V → E, such that α(i, j) = 0 for each edge and α(i, j) = 0 for each
non-edge. Attributes α(i, i) of vertices i may take any value from E.
For simplifying the mathematical treatment, we assume that all graphs are
of order n, where n is chosen to be sufficiently large. Graphs of order less than
n can be extended to order n by including isolated vertices with attribute zero.
This is a merely technical assumption to simplify mathematics. A graph X is
completely specified by its matrix representation x = (α(i, j)). Let X = En×n
be the Euclidean space of all (n × n)-matrices with elements from E and let Π n
be the set of all (n × n)-permutation matrices. For each p ∈ Π n we define a
mapping
γp : X → X , x → pT xp.
Then G = {γp : p ∈ Π n } is a finite group acting on X . For x ∈ X , the orbit
of x is the set defined by [x] = {γ(x) : γ ∈ G}. Thus, the orbit [x] consists of
all possible matrix representations of X obtained by reordering its vertices. We
define a graph orbifold by the quotient set
XG = XG = {[x] : x ∈ X }
of all orbits. Its natural projection is given by π : X → XG , x → [x]. In the
following, we identify [x] with X and occasionally write x ∈ X if π projects
to X.
In order to mimic Gaussian distributions, we extend the Euclidean norm ·
to a metric on XG defined by
d(X, X ) = min { x − x : x ∈ X, x ∈ X }, (1)
where · is the Euclidean distance on X . We call a pair (x, x ) ∈ X × X with
x − x = d(X, X ) an optimal alignment of X and X .
An orbifold function is a mapping of the form f : XG → R. Instead of studying
f , it is more convenient to study the lift f : X → R of f satisfying f (x) =
f (π(x)) = f (X) for all x ∈ X .

3 Mixtures of Radial Densities

This section studies mixtures of radial densities on graphs. Throughout this
section, we assume that (X , B, λ) is the Lebesgue-Borel measure space. The
112 B.J. Jain

action of the ﬁnite group G on X induces the measure space (XG , BG , λG ),

where BG = B ⊂ XG : π −1 (B) ∈ B and λG = λ ◦ π −1 is the induced quotient
measure.

3.1 Radial Densities

Radial densities on graphs are bell-shaped functions of the form

d(X, C)2
h(X|C, σ) = a · exp − ,
2σ 2
where a > 0 is the height of the bell, C ∈ XG is the graph at the centre of the
bell, and σ > 0 is the width of the bell. The height a scales h to a density and
thus is of the form
−1
a= h1 (X|C, σ)λG (dX) .
XG

Lifting a radial density h(X|C, σ) to the Euclidean space X yields

2 2
min c∈C x − c x−c
h (x|C, σ) = a · exp −

2
= max a · exp − , (2)
2σ c∈C 2σ 2
0 12 3
=: φ(x|c,σ)

where x is an arbitrary matrix representation of X and φ(x|c, σ) denotes the

radial function on vectors with center c and width σ.
Next, we are interested in those regions of X at which a radial density φ(x|c, σ)
on vectors is maximum. By deﬁnition of the lift h , this region is the Dirichlet
(fundamental) domain of c

Dc = {x ∈ X : x − c ≤ x − c , c ∈ C} .

A Dirichlet domain is a convex polyhedral cone [6] with a number of useful

properties: (1) members x ∈ Dc are optimally aligned with c, that is d(X, C) =
x − c ; (2) each graph X has a representation matrix x in Dc ; (3) if x, x ∈
Dc are two distinct matrix representations of the same graph X, then both
representations lie on the boundary of Dc ; and (4) there is a cross section s :
XG → X of π such that π(s(X)) = X.
A cross section s establishes a one-to-one correspondence between the graphs
from XG and the elements of s(XG ) ⊆ Dc by selecting a unique representation
for each graph from the Dirichlet domain Dc . The subset s(XG ) ⊂ Dc looks like
the Dirichlet domain Dc itself, except of some holes in the boundary.
Suppose that h(X|C, σ) is a radial density on graphs with lift h (x|C, σ). Let
c be an arbitrary representation of the center C. Truncating the lift h to the
Dirichlet domain Dc yields the density

t a · φ(x|c, σ) : x ∈ Dc
h (x|c, σ) = ,
0 : otherwise
Mixtures of Radial Densities for Clustering Graphs 113

where the Dirichlet domain Dc of center c is the support of ht . This shows

that the study of radial densities h(X|C, σ) on graphs reduces to the study
of truncated radial densities ht (x|c, σ) on a representable region in Dc . The
truncated radial density ht is related to the Gaussian distribution N (x|c, Σ)
with mean c and covariance matrix Σ = σ 2 I as follows:
1
ht (x|c, σ) = · N (x|c, Σ), (3)
Pc
where d is the dimension of X and x ∈ Dc . The term Pc is the probability
of being in the Dirichlet domain Dc with respect to the measure λ deﬁned by
some cross section into Dc . Thus, equation (3) shows that studying radial densi-
ties on graphs can be reduced to studying truncated Gaussians on vectors with
support Dc .
Note that neither the center c nor the squared width σ 2 of a truncated Gaus-
sian coincides with its expectation E[x] or variance V[x]. The expectation and
variance can be obtained from the center and squared width plus an adjustment
for the truncation on the distribution.

3.2 Mixture Models of Radial Densities

A mixture model of radial densities on graphs is a probability distribution of the
form

K
p(X) = πj h(X|Cj , σj ), (4)
j=1

where the parameters πj are the mixing coeﬃcients of components j with centers
Cj and widths σj . Lifting the mixture model p(X) to the Euclidean space X
yields

K
p (x) = πj h (x|Cj , σj ), (5)
j=1

where x is an arbitrarily chosen matrix representation of graph X. Truncation of

the lift p to a single Dirichlet domain as for radial functions is no longer useful,
because diﬀerent centers Cj may result in diﬀerent Dirichlet domains regardless
of how we choose the particular representations cj of the centers Cj . Instead,
we choose a matrix representation cj for each center Cj . Next, we truncate the
lift h of each component j to the Dirichlet domain Dj = Dcj . In doing so, we
obtain a truncated mixture model

K
pt (x) = πj ht xj |cj , σj , (6)
j=1

where xj ∈ Dj . Note that the argument x of the mixture pt and the data points
xj represent the same graph. To emphasize that the mixture pt (x) is indeed
114 B.J. Jain

a function of x, we may think of xj = sj (π(x)), where sj is a cross section

into Dj .
Equation (6) together with equation (3) show that the study of mixtures of
radial densities on graphs can be reduced to the study of mixtures of truncated
Gaussians on vectors. In contrast to standard mixtures, each component j of a
truncated mixture lives in a diﬀerent region Dj , which requires a transformation
of the data points.

3.3 EM Algorithm
N
Suppose that S = (X1 )i=1 are N example graphs generated by a mixture model
p(X) of radial densities as deﬁned in equation (4). Our goal is to estimate the
parameters πj , Cj , and σj2 on the basis of S by adopting the EM algorithm for
maximizing the log-likelihood

N
K
(Θ|S) = ln πj h(Xi |Cj , σj ),
i=1 j=1

where Θ = (πj , Cj , σj2 )K

j=1 . Let cj be arbitrarily chosen matrix representations
of the centers Cj . Lifting and truncating of the log-likelihood (Θ|S) yields

N
K
t (θ|S) = ln πj ht xji |cj , σj ,
i=1 j=1

where θ = (πj , cj , σj2 )K

and
j=1 xji
∈ Dj is a representation of the ith -example Xi
in Dj .
Since the log-likelihood t (θ|S) is diﬀerentiable with respect to its parameters
in Dj , we can adopt the EM algorithm for mixtures of truncated Gaussians.
The E-Step determines the responsibilities γij that component j has generated
example Xi according to

πj ht xji | cj , σj
γij = . (7)
πk ht xki | ck , σk
k

For the M-Step, we arrive at

1
N
4
cj = γij xji − δE (j) (8)
Nj i=1

1
N 2
j
4j =
σ 2
γij xi − 4
cj − δV (j) (9)
Nj i=1
Nj
4j =
π , (10)
N

where Nj = N i=1 γij is the eﬀective number of training graphs assigned to com-
ponent j, and δE (j) = δE (cj , σj ) as well as δV (j) = δV (cj , σj ) are the adjustments
for the truncations.
Mixtures of Radial Densities for Clustering Graphs 115

4 Experiments

The goal of our experimental study is to assess the performance of clustering

methods based on the graph orbifold framework and compare it against state-
of-the-art clustering method based on pairwise dissimilarity data. We consid-
ered the following algorithms: (1) EM algorithm for mixtures of radial densities
on graphs, (2) k-means for graphs [8], (3) hierarchical clustering using Ward’s
linkage [15], (4) spectral clustering [12], (5) k-medoids [10], and (6) hierarchi-
cal clustering using complete, average, and single linkage. Methods (1)–(3) are
orbifold-based and methods (4)–(6) are based on pairwise dissimilarity data.
Note that Ward’s linkage is considered as an orbifold method though it is based
on pairwise dissimilarities, because it computes the cluster mean of graphs for
each newly merged cluster.

4.1 Data
We selected subsets of the following training data sets from the IAM graph
database repository [14]: Letter (low, medium, high), fingerprint, grec, and coil.
In addition, we considered the Monoamine Oxydase (MAO) data set. The let-
ter data sets compile distorted letter drawings from the Roman alphabet that
consist of straight lines. Lines of a letter are represented by edges and endpoints
of lines by vertices. Fingerprint images of the fingerprints data set are con-
verted into graphs, where vertices represent endpoints and bifurcation points of
skeletonized versions of relevant regions. Edge represent ridges in the skeleton.
The distortion levels are low, medium, and high. The grec data set consists of
graphs representing symbols from noisy versions of architectural and electronic
drawings. Vertices represent endpoints, corners, intersections, or circles. Edges
represent lines or arcs. The coil-4 data set is a subset of the coil-100 data set
consisting of 4 out of 32 objects that are expected to be recognized easily. The
arbitrarily chosen objects correspond to indices 7 (car), 17 (cup), 52 (duck), and
75 (house) of the coil-100 data set (starting at index 1). After preprocessing, the
images are represented by graphs, where vertices represent endpoints of lines and
edges represent lines. The mao data set is composed of 68 molecules, divided
into two classes, where 38 molecules inhibit the monoamine oxidase (antidepres-
sant drugs) and 30 do not.These molecules are composed of different chemical
elements and are thus encoded as labeled graphs, where nodes represent atoms
and edges represent bonds between atoms.1

4.2 Experimental Setup

We set the number K of clusters to the number of classes of the respective data
sets. All clustering algorithms apply the squared graph metric deﬁned in eq. (1)
as underlying dissimilarity function. All graph metrics were approximated using
the graduated assignment algorithm [1]. Spectral clustering and k-medoids can
1
https://brunl01.users.greyc.fr/CHEMISTRY/
116 B.J. Jain

be applied in any distance space, because both methods operate on pairwise

dissimilarity data. Ward’s clustering, k-means for graphs, and the proposed EM
algorithm require the concept of a sample mean, which is generally not available
in an arbitrary distance space. By means of the orbifold framework, we esti-
mated sample mean graphs using the incremental arithmetic mean algorithm
[4]. The non-hierarchical methods iterate several times through the training set.
We terminated these iterative methods after 20 epochs without improvement of
the respective error function and recorded the solution where the value of the
error function was lowest. To initialize the centroids of the iterative methods, we
applied the furthest first method on 3K log(K) randomly chosen elements. To
overcome computational issues of the EM algorithm, we substituted the calcula-
tion of the adjustment terms by controling the estimated width σ̂j2 according to
σ̃j2 = 1/(1 + exp −β · σ̂j2 ), where the slope β is a problem dependent parameter
of ψ. The task of the sigmoid is to replace absolute adjustments in a relative
manner and to ensure numerical stable solutions which would be sufficient for
learning tasks such as clustering.
To assess the quality of the clustering algorithms, we determined its classi-
fication accuracy on the respective data sets. Clusters are assigned to classes
by solving the corresponding maximum weighted bipartite matching problem.
Nodes of the complete bipartite graph represent clusters and classes. Edges con-
nect each cluster C with each class c weighted by the normalized number of
members in C of class c. We performed scaling parameter selections for EM
(β) and spectral clustering (σ 2 ) as follows: For each data set, we performed a
one-dimensional grid-search, where each parameter configuration was tested 10
times. We selected the parameter with best average cost over all 10 runs. We
performed 10 trials for each data set. Since each clustering algorithm optimizes
an error function, we reported the classification accuracy of the particular trial
with lowest error over all 10 trials.

4.3 Results and Discussion

Table 1 and 2 summarize the results. Shown are the classification accuracy acc
obtained at a trial with lowest error, the average acc, maximum max and min-
imum min accuracy over all ten trials for each data set. Note that in an un-
supervised setting, we are unable to identify the trial with maximum accuracy,
when using the error values as criterion for selecting the ”best” outcome of a
clustering method out of ten trials. We make the following observations taking
into account the pre-specified number K of clusters and a limited significance
due to the low number trials:
1. Ward’s clustering performs best overall. Spectral clustering and the EM
algorithm are comparable to Ward’s clustering, whereas k-means for graphs and
in particular k-medoids are not competitive. Compared to the best results, each
cluster method has a performance drop on at least one data set showing that
there is no best clustering method. 2. A low error is a good indicator for recov-
ering the class structure for all cluster algorithms. Inspecting the accuracy acc
at the lowest error and the maximum accuracy max attained over all runs, we
Mixtures of Radial Densities for Clustering Graphs 117

Table 1. Classiﬁcation accuracy of graph clustering algorithms over 10 trials. K de-

notes the number of clusters. acc: accuracy at lowest error. avg: average accuracy with
standard deviation. max : maximum accuracy. min minimum accuracy.

pairwise approaches orbifold-based approaches

data accuracy Spectral k-Medoids Ward’s k-Means EM
Clustering Clustering for graphs for graphs
letter (low) acc 93.7 82.4 94.5 88.8 93.3
K = 15 avg 93.9±0.1 73.4±6.5 94.1±0.4 80.9±4.1 91.4±3.0
max 94.0 82.4 94.5 88.8 93.5
min 93.7 59.3 93.2 75.5 86.8
letter (medium) acc 91.1 77.9 92.4 84.9 90.9
K = 15 avg 90.6±1.8 71.0±5.4 91.3±0.7 78.7±4.9 90.6±0.3
max 91.6 77.9 92.5 85.1 90.9
min 85.6 60.4 90.1 68.3 90.3
letter (high) acc 86.5 55.7 84.9 85.3 90.3
K = 15 avg 84.1±3.9 47.7±4.9 82.5±1.4 75.1±5.5 85.2±1.9
max 88.4 55.7 85.5 85.3 90.3
min 78.9 40.9 78.9 66.0 83.3
ﬁngerprint acc 61.4 60.8 69.7 63.0 63.2
±0.0
K =4 avg 61.4 60.0±5.1 66.6 ±2.3
64.6±1.6
63.0±0.3
max 61.4 70.4 69.7 66.6 63.2
min 61.4 54.6 58.7 62.4 62.2
grec acc 64.7 50.3 69.6 54.5 54.9
K = 22 avg 65.7±1.2 50.5±3.4 68.1±1.3 51.7±3.1 50.1±2.3
max 68.5 55.2 70.6 57.3 54.9
min 64.7 46.2 67.1 45.8 46.5
coil acc 58.3 44.8 54.2 50.0 64.6
±2.0
K =4 avg 59.0 42.6±2.7 51.8±4.6
46.0±2.0
51.1±7.1
max 64.6 46.9 60.4 50.0 64.6
min 58.3 38.5 44.8 42.7 44.8
mao acc 73.5 70.6 73.5 72.1 73.5
±7.9
K =2 avg 69.7 60.0±6.4 69.6±8.0
72.8±1.0
74.1±5.5
max 80.9 70.6 89.7 75.0 88.2
min 63.2 55.9 60.3 72.1 66.2

see that acc is often close to max. 3. Inspecting the letter data sets, where low,
medium and high refer to the noise level with which the letters were distorted,
we observe that the EM algorithm and k-means are most robust against noise.
It is notable that the performance of k-medoids strongly declines with increas-
ing noise level. In contrast to findings in vector spaces (see e.g. [13]), results
on graphs do not confirm a common view that k-mediods is more robust than
k-means. 4. As shown in Table 2, other hierarchical clustering methods using
complete, average, and single linkage were not able to recover the class structure
given the fixed number K of clusters.
The good results of Ward’s clustering and the EM algorithm suggest that
clustering methods relying on the graph orbifold framework complement the
collection of existing clustering approaches. The beneficial feature of orbifold-
based clustering approaches is the notion of centroid of graphs, which may im-
prove clustering results for some problems and can be used for nearest neighbor
classification in a straightforward manner.

5 Conclusion

Graph orbifolds together with lifting and truncating constitute a suitable toolkit
to generalize standard clustering methods from vectors to the graphs and to
118 B.J. Jain

Table 2. Classiﬁcation accuracy of hierarchical clustering methods. Results are

deterministic.

data K Complete Average Single

letter (low) 15 69.3 35.7 8.5
letter (medium) 15 63.1 33.7 8.5
letter (high) 15 54.9 22.8 8.5
ﬁngerprint 4 55.6 40.4 40.2
grec 22 47.6 43.0 38.1
coil 4 32.3 28.1 28.1
mao 2 55.9 66.2 55.9

provide geometrical insight into the graph domain. Using this toolkit, we showed
that the study of mixtures of radial densities on graphs can be reduced to the
study of mixtures of truncated Gaussians, where each truncated component lives
in a diﬀerent region of the Euclidean space. From these ﬁndings, we adapted the
EM algorithm for parameter estimation. In experiments, we compared clustering
methods operating in a graph orbifold against state-of-the-art clustering methods
based on pairwise dissimilarities. Results show that clustering in a graph orbifold
is a competitive alternative and therefore complement the collection of existing
clustering algorithms on graphs.
Open issues with respect to estimating the parameters of a mixture model in-
clude a principled approximation of the adjustment terms for truncation, exten-
sion to covariances, and a statement to which extent mixtures of radial densities
can approximate arbitrary distributions on graphs.

References
1. Gold, S., Rangarajan, A.: A Graduated Assignment Algorithm for Graph Match-
ing. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(4),
377–388 (1996)
2. Gold, S., Rangarajan, E., Mjolsness, A.: Learning with preknowledge: cluster-
ing with point and graph matching distance measures. Neural Computation 8(4),
787–804 (1996)
3. Günter, S., Bunke, H.: Self-organizing map for clustering in the graph domain.
Pattern Recognition Letters 23(4), 405–417 (2002)
4. Jain, B.J., Obermayer, K.: Algorithms for the Sample Mean of Graphs. In: Jiang,
X., Petkov, N. (eds.) CAIP 2009. LNCS, vol. 5702, pp. 351–359. Springer, Heidel-
berg (2009)
5. Jain, B., Obermayer, K.: Structure spaces. The Journal of Machine Learning Re-
search 10 (2009)
6. Jain, B.J., Obermayer, K.: Large sample statistics in the domain of graphs. In: Han-
cock, E.R., Wilson, R.C., Windeatt, T., Ulusoy, I., Escolano, F. (eds.) SSPR&SPR
2010. LNCS, vol. 6218, pp. 690–697. Springer, Heidelberg (2010)
7. Jain, B., Obermayer, K.: Maximum Likelihood Method for Parameter Estimation
of Bell-Shaped Functions on Graphs. Pat. Rec. Letters 33(15), 2000–2010 (2012)
8. Jain, B.J., Obermayer, K.: Graph quantization. Computer Vision and Image Un-
derstanding 115(7), 946–961 (2011)
9. Jain, B.J., Wysotzki, F.: Central clustering of attributed graphs. Machine Learn-
ing 56(1), 169–207 (2004)
Mixtures of Radial Densities for Clustering Graphs 119

10. Kaufman, L., Rousseeuw, P.: Clustering by means of medoids. Statistical Data
Analysis Based on the L1 -Norm and Related Methods, 405–416 (1987)
11. Lozano, M.A., Escolano, F.: Protein classification by matching and clustering sur-
face graphs. Pattern Recognition 39(4), 539–551 (2006)
12. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: Analysis and an algorithm.
Advances in Neural Information Processing Systems 2, 849–856 (2002)
13. Ng, R.T., Han, J.: Efficient and effective clustering methods for spatial data mining.
In: Proceedings of the 20th International Conference on Very Large Data Bases,
VLDB 1994, pp. 144–155 (1994)
14. Riesen, K., Bunke, H.: IAM graph database repository for graph based pattern
recognition and machine learning. In: da Vitoria Lobo, N., Kasparis, T., Roli, F.,
Kwok, J.T., Georgiopoulos, M., Anagnostopoulos, G.C., Loog, M. (eds.) S+SSPR
2008. LNCS, vol. 5342, pp. 287–297. Springer, Heidelberg (2008)
15. Ward, J.H.: Hierarchical grouping to optimize an objective function. Journal of the
American Statistical Association 58(301) (1963)
Complexity Fusion for Indexing Reeb Digraphs

Francisco Escolano1, Edwin R. Hancock2 , and Silvia Biasotti3

1
University of Alicante
[email protected]
2
University of York
[email protected]
3
CNR-IMATI Genova
[email protected]

Abstract. In this paper we combine diﬀerent quantiﬁcations of heat

diffusion-thermodynamic depth on digraphs in order to match directed
Reeb graphs for 3D shape recognition. Since different real valued func-
tions can infer also different Reeb graphs for the same shape, we exploit
a set of quasi-orthogonal representations for comparing sets of digraphs
which encode the 3D shapes. In order to do so, we fuse complexities.
Fused complexities come from computing the heat-flow thermodynamic
depth approach for directed graphs, which has been recently proposed
but not yet used for discrimination. In this regard, we do not rely on
attributed graphs as usual for we want to explore the limits of pure
topological information for structural pattern discrimination. Our
experimental results show that: a) our approach is competitive with
information-theoretic selection of spectral features and, b) it outperforms
the discriminability of the von Neumann entropy embedded in a ther-
modynamic depth, and thus spectrally robust, approach.

1 Introduction
This paper is motivated by the hypothesis that mixing the same graph complex-
ity measure over the same shape, represented with diﬀerent graphs boosts the
discrimination power of isolated complexity measures. To commence, there has
been a recent eﬀort in quantifying the intrinsic complexity of graphs in their orig-
inal discrete space. Early attempts have incorporated principles related to MDL
(Minimum Description Length) to trees and graphs (see [1] for trees and [2]
for edge-weighted undirected graphs). More recently, the intersection between
structural pattern recognition and complex networks has proved to be fruitful
and has inspired several interesting measures of graph complexity. Many of them
rely on elements of spectral graph theory. For instance, Passerini and Severini
have applied the quantum (von Neumann) entropy to graphs [3]. We have re-
cently applied thermodynamic depth [5] to the domain of graphs [6] and we
have extended the approach to digraphs [7]. However, this latter approach has
not been applied to graph discrimination as the one based on approximated
von Neumann entropy [4]. Simultaneously, we have recently developed a method
for selecting the best set of spectral features in order to classify Reeb graphs

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 120–127, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Complexity Fusion for Indexing Reeb Digraphs 121

(which summarize 3D shapes) [8]. Besides the spectral features we have evalu-
ated in the latter work the discriminability of three different real functions for
building Reeb graphs: geodesic, distance from barycenter and distance from the
circumscribing sphere. Feature selection results in two intriguing conclusions:
a) heat flow complexity is not one of the most interesting features, and b) the
three latter real functions for building Reeb graphs have a similar relevance. The
first conclusion seems to discard the use of heat flow based complexity measures
for discrimination, at least in undirected graphs, whereas the second conclusion
points towards discarding also the analysis of the impact of the representations.
In this paper we show that these conclusions are misleading. To commence, when
directed graphs are considered, heat-flow complexity information is richer. Sec-
ondly, the three functions explored in [8] are far from being orthogonal (they
produce very similar graphs). Consequently herein we fuse both lines of research
in order to find the best performance achievable only with topological informa-
tion (without attributes). In Section 2 we present the catalog of real functions
we are going to explore. In Section 3 we highlight the main ingredients of heat
flow and thermodynamic depth in digraphs. Section 4 is devoted to analyze the
result of fusing the directed complexities of several Reeb graphs from the same
3D shape. Finally, in Section 5 we will present our conclusions and future works.

2 Directed Reeb Graphs and Real Functions

The Reeb graph [9] is a well-known topological description that codes in a graph
the evolution of the isocontours of a real-valued, continuous function f : M → R
over a manifold M . In other words, it tracks the origin, the disappearance,
the union or the split of the isocontours as the co-domain of the function f is
spanned. The nodes of the Reeb graph correspond to the critical points of f
while the arcs are associated to the surface portions crossed when going from a
critical points to another.
Several algorithms exist for the Reeb graph extraction from triangle meshes
[10]; in this paper we adopt a directed version of the Extended Reeb graph (ERG)
[11], we name it, the diERG. The diERG differs from the ERG in terms of arc
orientation. Similarly to the ERG, to build the diERG we sample the co-domain
of f with a finite number of intervals, then we characterize the surface in term
of critical or regular areas and, finally, we track the evolution of the regions in
the graph. Arcs are oriented according to the increasing value of the function.
The diERG is then an acyclic, directed graph, a formal proof of this fact can be
found in [12]. Figure 1 shows the pipeline of the graph extraction.
Each function can be seen as a geometric property and a tool for coding
invariance in the description [13]. When dealing with shape retrieval, the function
f has to be invariant from object rotation, translation and scaling. In the large
number of functions available in the literature, we are considering:

– the distance from the barycentre B of the object, Bar(p) = dE (p, B), p ∈ M
and dE represents the usual Euclidean distance (Figure 2-b);
122 F. Escolano, E.R. Hancock, and S. Biasotti

(a) (b) (c)

Fig. 1. Pipeline of the diERG extraction. (a) Surface partition and recognition of
critical areas; blue areas correspond to minima, red areas correspond to maxima, green
areas to saddle areas. (b) Expansion of critical areas to their nearest one. (c) The
oriented diERG.

– the distance from the main shape axis v, M SA(p) = dE (p, v), (Figure 2-c);
– the function M SAN orm(p) = v × (p − B) , p ∈ S, v is the same as above
and B is the barycentre (Figure 2-d);
– the average of the geodesic distances defined in [14] (Figure 2-e);
– the first six (ranked with respect to the decreasing eigenvalues), non-constant
eigenfunctions of the Laplace-Beltrami operator of the mesh computed ac-
cording to [15], LAP Li , i = {1, . . . , 6}, (Figure 2(f-i));
– a mix of the first three eigenfunctions of the Laplace-Beltrami operator ob-
tained according to the rule: M IXi+j−2 = (LAP Li )2 −(LAP Lj )2 , i = {1, 2},
j = {2, 3}, i = j (Figure 2(j-l)).

(a) (b) (c) (d) (e) (f)

(f) (g) (h) (i) (j) (k) (l)

Fig. 2. The set of real functions in our framework. Colors represent the function, from
low (blue) to high (red) values.
Complexity Fusion for Indexing Reeb Digraphs 123

Each function reﬂects either intrinsic or extrinsic shape features. Geodesic-

based and Laplacian-based functions are isometry-invariant and therefore pose
invariant because they approximate the intrinsic Riemannian metric of the sur-
face [16]. In this way, the graph representation is independent of the different
articulations of the objects. On the other hand, the distance from the barycentre
highlights the distribution of the object with respect to its barycentre. There-
fore such a function is rotation invariant with respect to rotations around the
barycentre but sensitive to pose variations. Similarly the distances from the
principal shape axis and its orthogonal are independent of axis rotations and
independent of axis symmetries.
Mixing the different properties (rigid or isometry invariant) different shape
features are kept.

3 Heat Flow Complexity in Digraphs

3.1 The Laplacian of a Directed Graph

A directed graph (digraph) G = (V, E) with n = |V | vertices and edges E ⊆

V × V is encoded by and adjacency matrix A where Aij > 0 if i → j ∈ E and
Aij = 0 otherwise (this definition includes weigthed adjacency
matrices). The
outdegree matrix D is a diagonal matrix where Dii = j∈V Aij . The transition
A
matrix P is defined by Pij = Dij ii
if (i, j) ∈ E and Pij = 0 otherwise. The
transition matrix is key to defining random walks on the digraph and Pij is the
probability
of reaching node j from node i. Given these definitions we have that
j∈V Pij = 1 in general. In addition, P is irreducible iff G is strongly connected
(there is path from each vertex to every other vertex). If P is irreducible, the
Perron-Frobenius theorem ensures that there exists a left eigenvector φ satisfying
φT P = λφT and φ(i) > 0 ∀i. If P is aperiodic (spectral radius ρ = 1) we have
φT P = ρφT and all the other eigenvalues have an absolute value smaller that
ρ = 1. By ensuring strong connection and aperiodicity we also ensure that any
random walk in a directed graph satisfying these two properties converges to a
unique stationary distribution.

Normalizing φ so that i∈V φ(i) = 1, we encode the eigenvector elements
as a probability distribution. This normalized row vector φ corresponds to the
stationary distribution
of the random walks defined by P since φP = φ. There-
fore, φ(i) = j,j→i φ(j)Pji , that is, the probability of that the random walk
is at node i is the sum of all incoming probabilities from all nodes j satisfying
j → j. If we define Φ = diag(φ(1) . . . φ(n)) we have the definition of the following
matrices:

ΦP + PT Φ Φ1/2 PΦ−1/2 + Φ−1/2 PT Φ1/2

L=Φ− and L = I − , (1)
2 2
where Φ = diag(φ(1) . . . φ(n)), L is the combinatorial directed Laplacian and L
is the normalized directed Laplacian [17].
124 F. Escolano, E.R. Hancock, and S. Biasotti

Symmetrizing P leads to real valued eigenvalues and eigenvectors. In any case,

satisfying irreducibility is difficult in practice since sink vertices may arise fre-
quently. A formal trick for solving this problem consists of replacing P by P
so that Pij = n1 if Aij = 0 and Dii = 0. This strategy is adopted in Pager-
ank [18] and allows for teleporting acting on the random walk to any other
node in the graph. Teleporting is modeled by redefining P in the following way:
T
P = ηP + (1 − η) 11n with 0 < η < 1. The new P ensures both irreducibility
and aperiodicity and this allows us to both apply P with probability η and to
teleport from any node with Aij = 0 with probability 1 − η. In [19] a trade-off
between large values η (preserving more the structure of P ) and small ones
(potentially increasing the spectral gap) is recommended. For instance, in [20],
where the task is to learn classifiers on directed graphs, the setting is η = 0.99.
When using the new P we always have that Pii = 0 due to the Pagerank mask-
ing. Such masking may introduce significant interferences in heat diffusion when
the Laplacian is used to derive the heat kernel.

3.2 Directed Heat Kernels and Heat Flow

We commence by reviewing the concept of heat flow [6]. Firstly, the spectral
decomposition of the diffusion kernel is Kβ (G) = exp(−βL) ≡ Ψ ΛΨ T , where
Λ = diag(e−βλ1 , e−βλ2 , . . . , e−βλn ), Ψ = [ψ1 , ψ2 , . . . , ψn ], and {(λi , ψi )}ni=1 are
T
the eigenvalue-eigenvector pairs of Φ − W where W = ΦP+P 2
Φ
can be seen
as the weight
n matrix of the undirected graph G u associated with G. Anyway,
Kβij = k=1 ψk (i)ψk (j)e−λk β , and Kβij ∈ [0, 1] is the (i, j) entry of a doubly
stochastic matrix. Doubly stochasticity for all β implies heat conservation in the
system as a whole. That is, not only in the nodes and edges of the graph but also
in the transitivity links eventually established between non-adjacent nodes (if i is
not adjacent to j, eventually will appear an entry Kβij > 0 for β large enough).
The total directed heat flowing through the graph at a given β (instantaneous
directed flow) is given by
n

−λk β
Fβ (G) = Aij ψk (i)ψk (j)e , (2)
i→j k=1

A
more compact definition of the flow is Fβ (G) = A : Kβ , where X : Z =
T
ij X ij Z ij = trace(XZ ) is the Frobenius inner product. While instantaneous
flow for the heat flowing through the edges of the graph, it accounts neither
for the heat remaining in the nodes nor for that in the transitivity links. The

limiting cases are F0 = 0 and Fβmax = n1 i→j Aij which is reduced to |E| n if
G is unattributed (Aij ∈ {0, 1} ∀ij). Defining Fβ in terms of A instead of W,
we retain the directed nature of the original graph G. The function derived from
computing Fβ (G) from β = 0 to βmax is the so called directed heat flow trace.
These traces satisfy the phase transition principle [6] (although the formal proof
is out of the scope of this paper). In general heat flow diffuses more slowly than
in the undirected case and phase transition points (PTPs) appear later. This is
due to the constraints imposed by A.
Complexity Fusion for Indexing Reeb Digraphs 125

3.3 Thermodynamic Depth Complexity

Let G = (V, E) with |V | = n. Then the directed history of a node i ∈ V is hi (G) =

{e(i), e2 (i)), . . . , ep (i)} where: e(i) ⊆ G is the first-order expansion subgraph
given by i and all j : i → j. If there are nodes j also satisfying j → i then these
edges are included. If node i is a sink then e(i) = i. Similarly e2 (i) = e(e(i)) ⊆ G
is the second-order expansion consisting on j → z : j ∈ Ve(i) , z ∈ Ve(i) , including
also z → j if these edges exists and j → z. This process continues until p
cannot be increased. If G is strongly connected ep (i) = G, otherwise ep (i) is the
strongly connected component to which i belongs. Thus, every hi (G) defines a
different causal trajectory which may lead to G itself if it is strongly connected.
The depth of such macro-states relies on the variability of the causal trajectories
leading to them. In order to characterize each trajectory we combine the heat flow
complexities of its expansion subgraphs by means of defining minimal enclosing
Bregman balls (MEBB) [21]. Here we use the I-Kullback-Leibler (I-KL) Bregman
d d d
divergence between traces f and g: DF (f ||g) = i=1 fi log fgii − i=1 fi + i=1 gi
d
with convex generator F (f ) = i=1 (fi log fi − fi ).
Given hi (G), the heat flow complexity trace f t = F (et (i)) for the t − th
expansion of i, a generator F and a Bregman divergence DF , the causal tra-
jectory leading to G (or one of its strongly connected components) from i is
characterized by the center ci ∈ Rd and radius ri ∈ R of the MEBB B ci ,ri =
{f t ∈ Xi : DF (ci ||f t ) ≤ ri } where X is the set of all causal trajectories for
the i−th node.Solving for the center and radius implies finding c∗ and r∗ min-
imizing r subject to DF (ci ||f t ) ≤ r ∀t ∈ Xi with |X : i| = T . Considering
T
the Lagrange multipliers αt we have that c∗ = ∇−1 F ( t=1 αt f t ∇F (f t )). The
efficient algorithm in [21] estimates both the center and multipliers. This idea is
closely related to Core Vector Machines [22], and it is interesting to focus on the
non-zero multipliers (and their support vectors) used to compute the optimal
radius. More precisely, the multipliers define a convex combination and we have
αt ∝ DF (c∗ ||f t ), and the radius is simply chosen as: r∗ = maxαt >0 DF (c∗ ||f t ).
Given the directed graph G = (V, E), with |V | = n and all the n pairs (ci , ri ), the
heat flow-thermodynamic depth complexity of G is characterized by the MEBB
B c,r = {ci ∈ Xi : DF (c||ci ) ≤ r}. As a result, the TD depth of the directed graph
is given by D(G) = r.

4 Experiments

In order to compare our method with the technique proposed in [8] we use the
same database, SHRECH version used in the Shape Retrieval Contest in [23].
The database has 400 exemplars and 20 classes (20 exemplars per class). For each
exemplar we apply the 13 real functions presented in Section 2 and then extract
the corresponding Reeb digraphs. Then, each exemplar is characterized by 13
the heat flow complexities (one per digraph). If we map these vectors (bags of
complexities) via MDS we found that it is quite easy to discriminate glasses from
pliers and fishes. However, it is very difficult to discriminate humans from chairs
126 F. Escolano, E.R. Hancock, and S. Biasotti

Fig. 3. Experiments. Left: MDS for humans, chairs and armadillos. Right: PR curves.

(see Figure 3-left). The average behavior of these bags of complexities is given
by the precision recall (PR) curves. In Figure 3-right we show the PR curves
for Feature Selection [8] (CVIU’13), Thermodynamic Depth (TD) with the von
Neumann Entropy (here we use the W attributed graph induced by the Directed
Laplacian) and TD with directed heat flow. Our PR (heat flow) as well as the
one of Feature Selection reaches the average performance of attributed methods.
The 10-fold CV error for 15 classes reported by Feature Selection is 23, 3%.
However, here we obtain a similar PR curve for the 20 class problem. Given that
Feature Selection relies on a complex offline process, the less computationally
demanding heat flow TD complexity for digraphs produces comparable results
(or better ones, if we consider that we are addressing the 20 class problem). In
addition, heat flow outperforms von Neumann entropy when embedded in TD.

5 Conclusions and Future Work

The main contribution of this paper is the proposal of a method (fusion of di-
graphs heat ﬂow complexity) which has a similar discrimination power (or even
better if we consider the whole 20 classes problem) than Feature Selection and
outperforms von Neumann entropy. Future works include the exploration of more
sophisticated methods for fusing complexities and more real functions.

Acknowledgements. Francisco Escolano was funded by project TIN2012-

32839 of the Spanish Government. Edwin Hancock was supported by a Royal
Society Wolfson Research Merit Award.

References

[1] Torsello, A., Hancock, E.R.: Learning Shape-Classes Using a Mixture of Tree-
Unions. IEEE Tran. on Pattern Analysis and Mach. Intelligence 28(6), 954–967
(2006)
[2] Torsello, A., Lowe, D.L.: Supervised Learning of a Generative Model for Edge-
Weighted Graphs. In: Proc. of ICPR (2008)
Complexity Fusion for Indexing Reeb Digraphs 127

[3] Passerini, F., Severini, S.: The von Neumann Entropy of Networks.
arXiv:0812.2597v1 (December 2008)
[4] Han, L., Escolano, F., Hancock, E.R., Wilson, R.: Graph Characterizations From
Von Neumann Entropy. Pattern Recognition Letters (2012) (in press)
[5] Lloyd, S., Pagels, H.: Complexity as Thermodynamic Depth Ann. Phys. 188, 186
(1988)
[6] Escolano, F., Hancock, E.R., Lozano, M.A.: Heat Diffusion: Thermodynamic
Depth Complexity of Networks. Phys. Rev. E 85, 036206 (2012)
[7] Escolano, F., Bonev, B., Hancock, E.R.: Heat Flow-Thermodynamic Depth Com-
plexity in Directed Networks. In: Gimel’farb, G., Hancock, E., Imiya, A., Kuijper,
A., Kudo, M., Omachi, S., Windeatt, T., Yamada, K. (eds.) SSPR & SPR 2012.
LNCS, vol. 7626, pp. 190–198. Springer, Heidelberg (2012)
[8] Bonev, B., Escolano, F., Giorgi, D., Biasotti, S.: Information-theoretic Selection of
High-dimensional Spectral Features for Structural Recognition. Computer Vision
and Image Understanding 117(3), 214–228 (2013)
[9] Reeb, G.: Sur les points singuliers d’une forme de Pfaff complètement intégrable
ou d’une fonction numérique. Comptes Rendus Hebdomadaires des Séances de
l’Académie des Sciences 222, 847–849 (1946)
[10] Biasotti, S., Giorgi, D., Spagnuolo, M., Falcidieno, B.: Reeb graphs for shape
analysis and applications. Theoretical Computer Science 392(1-3), 5–22 (2008)
[11] Biasotti, S.: Topological coding of surfaces with boundary using Reeb graphs.
Computer Graphics and Geometry 7(3), 31–45 (2005)
[12] Biasotti, S.: Computational Topology Methods for Shape Modelling Applications.
PhD Thesis, Universitá degli Studi di Genova (May 2004)
[13] Biasotti, S., De Floriani, L., Falcidieno, B., Frosini, P., Giorgi, D., Landi, C., Pa-
paleo, L., Spagnuolo, M.: Describing shapes by geometrical-topological properties
of real functions. ACM Comput. Surv. 40(4), 1–87 (2008)
[14] Hilaga, M., Shinagawa, Y., Kohmura, T., Kunii, T.L.: Topology Matching for Fully
Automatic Similarity Estimation of 3D Shapes. In: Proc. of SIGGRAPH 2001,
pp. 203–212 (2001)
[15] Belkin, M., Sun, J., Wang, Y.: Discrete Laplace Operator for Meshed Surfaces.
In: Proc. Symposium on Computational Geometry, pp. 278–287 (2008)
[16] Bronstein, A.M., Bronstein, M.M., Kimmel, R.: Efficient Computation of
Isometry-Invariant Distances Between Surfaces. SIAM J. Sci. Comput. 28(5),
1812–1836 (2006)
[17] Chung, F.: Laplacians and the Cheeger Inequailty for Directed Graphs. Annals of
Combinatorics 9, 1–19 (2005)
[18] Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking:
Bring Order to the Web (Technical Report). Stanford University (1998)
[19] Johns, J., Mahadevan, S.: Constructing Basic Functions from Directed Graphs for
Value Functions Approximation. In: Proc. of ICML (2007)
[20] Zhou, D., Huang, J., Schölkopf, B.: Learning from Labeled and Unlabeled Data
on a Directed Graph. In: Proc. of ICML (2005)
[21] Nock, R., Nielsen, F.: Fitting the Smallest Enclosing Bregman Ball. In: Gama,
J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS
(LNAI), vol. 3720, pp. 649–656. Springer, Heidelberg (2005)
[22] Tsang, I.W., Kocsor, A., Kwok, J.T.: Simple Core Vector Machines with Enclosing
Balls. In: Proc. of ICLM (2007)
[23] Giorgi, D., Biasotti, S., Paraboschi, L.: SHape Retrieval Contest: Watertight Mod-
els Track, http://watertight.ge.imati.cnr.it
Analysis of Wave Packet Signature of a Graph

Furqan Aziz, Richard C. Wilson, and Edwin R. Hancock

Department of Computer Science, University of York, YO10 5GH, UK

{furqan,wilson,erh}@cs.york.ac.uk

Abstract. In this paper we investigate a new approach for characteriz-

ing both the weighted and un-weighted graphs using the solution of the
edge-based wave equation. The reason for using wave equation is that
it provides a richer and potentially more expressive means of charac-
terizing graphs than the more widely studied heat equation. The wave
equation on a graph is defined using the Edge-based Laplacian. We com-
mence by defining the eigensystem of the edge-based Laplacian. We give
a solution of the wave equation and define signature for both weighted
graphs and un-weighted graphs. In the experiment section we perform
the proposed method on real world data and compare its performance
with other state-of-the-art methods.

Keywords: Edge-based Laplacian, Wave Equation, Gaussian wave

packet, Graph Characterization, Weighted graphs.

1 Introduction

Graph clustering is one of the most commonly used problems in areas where
data are represented using graphs. Since graphs are non-vectorial, we require a
method for characterizing graph that can be used to embed the graph in a high-
dimensional feature space for the purpose of clustering. Most of the commonly
used method for graph clustering are spectral methods which are based on the
eigensystem of the Laplacian matrix associated with the graph. For example
Xiao et al [1] have used heat kernel for graph characterization. Wilson et al.
[2] have made use of graph spectra to construct a set of permutation-invariant
features for the purpose of clustering graphs.
The discrete Laplacian defined over the vertices of a graph, however, cannot
link most results in analysis to a graph theoretic analogue. For example the wave
equation utt = Δu, defined with discrete Laplacian, does not have finite speed
of propagation. In [3,4], Friedman and Tillich develop a calculus on graph which
provides strong connection between graph theory and analysis. Their work is
based on the fact that graph theory involves two different volume measures.
i.e., a “vertex-based” measure and an “edge-based” measure. This approach has
many advantages. It allows the application of many results from analysis directly
to the graph domain.
While the method of Friedman and Tillich leads to the definition of both a
divergence operator and a Laplacian (through the definition of both vertex and

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 128–136, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Analysis of Wave Packet Signature of a Graph 129

edge Laplacian), it is not exhaustive in the sense that the edge-based eigen-
functions are not fully speciﬁed. In a recent study we have fully explored the
eigenfunctions of the edge-based Laplacian and developed a method for explicitly
calculating the edge-interior eigenfunctions of the edge-based Laplacian [5]. This
reveals a connection between the eigenfunctions of the edge-based Laplacian and
both the classical random walk and the backtrackless random walk on a graph.
As an application of the edge-based Laplacian, we have recently presented a new
approach to characterizing points on a non-rigid three-dimensional shape[6].
Wave equation provides potentially richer characterisation of graphs than heat
equation. Initial work by Howaida and Hancock [7] has revealed some of its
potential uses. They have proposed a new approach for embedding graphs on
pseudo-Riemannian manifolds based on the wave kernel. However, there are
two problems with the rigourous solution of the wave equation; a) we need to
compute the edge-based Laplacian, and b) the solution is more complex than
the heat equation. Recently we [8] have presented a solution of the edge-based
wave equation on a graph. In [9] we have used this solution to deﬁne a signature,
called the wave packet signature (WPS) of a graph. In this paper we extend the
idea of WPS to weighted graphs and experimentally demonstrate the properties
of WPS. We perform numerous experiments and demonstrate the performance
of the proposed methods on both weighted and un-weighted graphs.

2 Edge-Based Eigensystem
In this section we review the eigenvalues and eigenfunction of the edge-based
Laplacian[3][5]. Let G = (V, E) be a graph with a boundary ∂G. Let G be
the geometric realization of G. The geometric realization is the metric space
consisting of vertices V with a closed interval of length le associated with each
edge e ∈ E. We associate an edge variable xe with each edge that represents the
standard coordinate on the edge with xe (u) = 0 and xe (v) = 1. For our work, it
will suﬃce to assume that the graph is ﬁnite with empty boundary (i.e., ∂G = 0)
and le = 1.

2.1 Vertex Supported Edge-Based Eigenfunctions

The vertex-supported eigenpairs of the edge-based Laplacian can be expressed in
terms of the eigenpairs of the normalized adjacency matrix of the graph. Let A
be the adjacency matrix of the graph G, and Ã be the row normalized adjacency
matrix. i.e., the (i, j)th entry of Ã is given as Ã(i, j) = A(i, j)/ (k,j)∈E A(k, j).
Let (φ(v), λ) be an eigenvector-eigenvalue pair for this matrix. Note that φ(.)
is deﬁned on vertices and may be extended along each edge to an edge-based
eigenfunction. Let ω 2 and φ(e, xe ) denote the edge-based eigenvalue and eigen-
function. Then the vertex-supported eigenpairs of the edge-based Laplacian are
given as follows:
1. For each (φ(v), λ) with λ = ±1, we have a pair of eigenvalues ω 2 with
ω = cos−1 λ and ω = 2π − cos−1 λ. Since there are multiple solutions to
130 F. Aziz, R.C. Wilson, and E.R. Hancock

ω = cos−1 λ, we obtain an inﬁnite sequence of eigenfunctions; if ω0 ∈ [0, π]

is the principal solution, the eigenvalues are ω = ω0 + 2πn and ω = 2π −
ω0 + 2πn, n ≥ 0. The eigenfunctions are φ(e, xe ) = C(e) cos(B(e) + ωxe ).
2. λ = 1 is always an eigenvalue of Ã. We obtain a principle frequency ω = 0,
and therefore since φ(e, xe ) = C cos(B) and so φ(v) = φ(u) = C cos(B),
which is constant on the vertices.

2.2 Edge-Interior Eigenfunctions

The edge-interior eigenfunctions are those eigenfunctions which are zero on ver-
tices and therefore must have a principle frequency of ω ∈ {π, 2π}. Recently we
have shown that these eigenfunctions can be determined from the eigenvectors
of the adjacency matrix of the oriented line graph[5]. We have shown that the
eigenvector corresponding to eigenvalue λ = 1 of the oriented line graph provides
a solution in the case ω = 2π. In this case we obtain |E| − |V | + 1 linearly inde-
pendent solutions. Similary the eigenvector corresponding to eigenvalue λ = −1
of the oriented line graph provides a solution in the case ω = π. In this case we
obtain |E| − |V | linearly independent solutions. This comprises all the principal
eigenpairs which are only supported on the edges.

3 Wave Packet Signatures

Let a graph coordinate X deﬁnes an edge e and a value of the standard coordinate
on that edge x. The eigenfunctions of the edge-based Laplacian are

φω,n (X ) = C(e, ω) cos (B(e, ω) + ωx + 2πnx)

The edge-based wave equation is

∂ 2u
(X , t) = ΔE u(X , t)
∂t2
5 6
Let W(z) be z wrapped to the range [− 12 , 12 ), i.e., W(z) = z − z + 12 . For
the un-weighted graph, we solve the wave equation assuming that the initial
condition is a Gaussian wave packet on a single edge of a graph [9]. The solution
for this case becomes
C(ω, e)C(ω, f )
u(X , t) =
2
ω∈Ωa
, 7 8-
−aW(x+t+μ)2 1
e cos B(e, ω) + B(f, ω) + ω x + t + μ +
2
, 7 8-
−aW(x−t−μ) 2 1
+e cos B(e, ω) − B(f, ω) + ω x − t − μ +
2

1 1 −aW(x+t+μ)2 1 −aW(x−t−μ)2
+ e + e
2|E| 4 4
Analysis of Wave Packet Signature of a Graph 131

C(ω, e)C(ω, f )
e−aW(x−t−μ) − e−aW(x+t+μ)
2 2
+
4
ω∈Ωc
C(ω, e)C(ω, f )
(−1)x−t−μ+ 2 ! e−aW(x−t−μ)
1 2
+
4
ω∈Ωc

−(−1)x+t+μ+ 2 ! e−aW(x+t+μ)
1 2

where Ωa represents the set of vertex-supported eigenvalues and Ωb and Ωc

represent the set of edge-interior eigenvalues respectively. i.e., π and 2π.
For a weighted graph, we assume a Gaussian wave packet on every edge of the
graph, whose amplitude is multiplied by the weight of that particular edge, and
solve the wave equation for this case. Let wij be the weight of the edge (i, j).
The solution in this case becomes
C(ω, e)C(ω, f )
u(X , t) = wi,j
2
(u,v)∈E ω∈Ωa
, 7 8-
1
e−aW(x+t+μ) cos B(e, ω) + B(f, ω) + ω x + t + μ +
2

2
, 7 8-
1
+ e−aW(x−t−μ) cos B(e, ω) − B(f, ω) + ω x − t − μ +
2

2

1 1 −aW(x+t+μ)2 1 −aW(x−t−μ)2
+ e + e
2|E| 4 4
C(ω, e)C(ω, f )
e−aW(x−t−μ) − e−aW(x+t+μ)
2 2
+
4
ω∈Ωc
C(ω, e)C(ω, f )
(−1)x−t−μ+ 2 ! e−aW(x−t−μ)
1 2
+
4
ω∈Ωc

−(−1)x+t+μ+ 2 ! e−aW(x+t+μ)
1 2

To define signature for both weighted and un-weighted graphs, we use the ampli-
tudes of the waves on the edges of the graph over time. For un-weighted graphs,
we assume that the initial condition is a Gaussian wave packet on a single edge
of the graph. For this purpose we select the edge (u, v) ∈ E, such that u is
the highest degree vertex in the graph and v is the highest degree vertex in the
neighbours of u. For weighted graph, we assume a wave packet on every edge
whose amplitude is multiplied by the weight of the edge. We define the local
signature of an edge as
W P S(X ) = [u(X , t0 ), u(X , t1 ), u(X , t2 ), ...u(X , tn )]
Given a graph G, we define its global wave packet signature as
GW P S(G) = hist W P S(X1 ), W P S(X2 ), , ..., W P S(X|E| ) (1)
where hist(.) is the histogram operator which bins the list of arguments
W P S(X1 ), W P S(X2 ), , ..., W P S(X|E| ).
132 F. Aziz, R.C. Wilson, and E.R. Hancock

4 Experiments

In this section we perform an experimental evaluation of the proposed methods

on different graphs. These graphs are extracted from the images in the Columbia
object image library (COIL) dataset [10]. This dataset contains views of 3D
objects under controlled viewer and lighting condition. For each object in the
database there are 72 equally spaced views. The objective here is to cluster
different views of the same object onto the same class. To establish a graph on
the images of objects, we first extract feature points from the image. For this
purpose, we use the Harris corner detector [11]. We then construct a Delaunay
triangulation (DT) using the selected feature points as vertices of the graph.
Figure 1(a) shows some of the object views (images) used for our experiments
and Figure 1(b) shows the corresponding Delaunay triangulations.

(a) COIL (b) DT

(c) GG (d) RNG

Fig. 1. COIL objects and their extracted graphs

We compute the wave signature for an edge by taking tmax = 100 and
xe = 0.5. We take t = 20 to allow the wave packet to be distributed over
the whole graph. We then compute the GWPS for the graph by fixing 100 bins
for histogram. To visualize the results, we have performed principal component
analysis (PCA) on GWPS. PCA is mathematically defined [12] as an orthogo-
nal linear transformation that transforms the data to a new coordinate system
such that the greatest variance by any projection of the data comes to lie on
the first coordinate (called the first principal component), the second greatest
variance on the second coordinate, and so on. Figure 2(a) shows the results of
the embedding of the feature vectors on the first three principal components.
To measure the performance of the proposed method we compare it with
truncated Laplacian, random walk [13] and Ihara coefficients [14]. Figure 2 shows

(a) WPS (b) Random Walk (c) Ihara Coeﬃcients

Fig. 2. Graph, its digraph, and its oriented line graph

Analysis of Wave Packet Signature of a Graph 133

the embedding results for diﬀerent methods. To compare the performance, we

cluster the feature vectors using k-means clustering [15]. k-means clustering is
a method which aims to partition n observations into k clusters in which each
observation belongs to the cluster with the nearest mean. We compute Rand
index [16] of these clusters which is a measure of the similarity between two
data clusters. The rand indices for these methods are shown in Table 1. It is
clear from the table that the proposed method can classify the graphs with
higher accuracy.
Table 1. Experimental results on Mutag dataset

Method DT GG RNG
Wave Kernel Signature 0.9965 0.9511 0.8235
Random Walk Kernel 0.9526 0.9115 0.8197
Ihara Coeﬃcients 0.9864 0.8574 0.7541

We now compare the performance of the proposed method on Gabriel graphs

(GG) and relative neighbourhood graphs (RNG) extracted from the same COIL
dataset. The Gabriel graph for a set of n points is a subset of Delaunay triangu-
lation, which connects two data points vi and vj for which there is no other point
vk inside the open ball whose diameter is the edge(vi , vj ). The relative neigh-
bourhood graph is also a subset of Delaunay Triangulation. In this case a lune
is constructed on each Delaunay edge. The circles enclosing the lune have their
centres at the end-points of the Delaunay edge; each circle has a radius equal to
the length of the edge. If the lune contains another node then its defining edge
is pruned from the relative neighbourhood graph. Figure 1(c) and 1(d) show the
GG and RNG of the corresponding COIL object of Figure 1(a) respectively.
The purpose of comparing the performance on GG and RNG is twofold. First,
since both the GG and RNG are subset of DT, it allows us to analyze the
performance of the proposed method under controlled structure modification.
Second, since both GG and RNG reduce the frequency of cycles of smaller length
and introduce branches in the graph, it allows us to analyze the performance of
the proposed method on non-cyclic graphs. We compute the performance on GG
and RNG in the same way as we did for DT. The visual results of the proposed
method on GG and RNG are shown in Figure 3(a) and Figure 3(b) respectively.
Table 1 compares the performance of the three methods, which shows that the
proposed method performs well under controlled structural modification. Note
that a drop in the performance of Ihara coefficient is due to the fact that the
Ihara coefficients cannot provide a good measure of similarity for the graphs
when branches are present.
We now compare the performance of the proposed WPS on weighted graphs.
For this purpose, we have selected the same objects from the COIL dataset. We
have extracted the Gabriel graphs for each of these view.The edges are weighted
with the exponential of the negative distance between two connected vertices, i.e.
wij = exp[k||xi xj ||] where xi and xj are coordinates of corner points i and j in
an image and k is a scalar scaling factor. Figure 4(a) shows the clustering result
of WPS, while Figure 4(b) shows the clustering result of truncated Laplacian. To
134 F. Aziz, R.C. Wilson, and E.R. Hancock

(a) clustering GG (b) clustering RNG

Fig. 3. Clustering results

(a) WPS (b) Truncated Laplacian

Fig. 4. Clustering results on Weighted graphs

compare the performance we have computed the rand indices for both methods.
The rand index for WPS is 0.9931, while for truncated Laplacian is 0.8855.
Finally we look at the characteristic of the proposed WPS. The histogram
distribution of the WPS closely follows Gaussian distribution. Figure 5 shows
distribution of WPS of a single view of 3 different objects in COIL dataset and a
Gaussian fit for each signature. Figure 6(a) and 6(b) show the values of standard
deviation of all the DT and GG respectively of all 72 views of 4 different objects
of COIL dataset. Table 2 shows the mean value of the standard deviation and a
standard error for each of the 4 objects.

Fig. 5. Gaussian Fit

(a) DT (b) GG
Fig. 6. Standard Deviation
Analysis of Wave Packet Signature of a Graph 135

Table 2. Average value of standard deviation

Standard Deviation Standard Error

Object 1 0.1400 1.54 × 10−3
Object 2 0.0989 6.57 × 10−4
Object 3 0.0793 5.64 × 10−4
Object 4 0.0685 4.07 × 10−4

5 Conclusion and Future Work

In this paper we have used the solution of the wave equation on a graph to
characterize both weighted and un-weighted graphs. The wave equation is solved
using the edge-based Laplacian of a graph. The advantage of using the edge-based
Laplacian over vertex-based Laplacian is that it allows the direct application of
many results from analysis to graph theoretic domain. In future our goal is to
use the solution of other equations deﬁned using the edge-based Laplacian for
deﬁning local and global signatures for graphs.

References
1. Xiao, B., Yu, H., Hancock, E.R.: Graph matching using manifold embedding. In:
Campilho, A.C., Kamel, M.S. (eds.) ICIAR 2004. LNCS, vol. 3211, pp. 352–359.
Springer, Heidelberg (2004)
2. Wilson, R.C., Hancock, E.R., Luo, B.: Pattern vectors from algebraic graph theory.
IEEE Trans. Pattern Anal. Mach. Intell. 27, 1112–1124 (2005)
3. Friedman, J., Tillich, J.P.: Wave equations for graphs and the edge based laplacian.
Paciﬁc Journal of Mathematics, 229–266 (2004)
4. Friedman, J., Tillich, J.P.: Calculus on graphs. CoRR (2004)
5. Wilson, R.C., Aziz, F., Hancock, E.R.: Eigenfunctions of the edge-based laplacian
on a graph. Linear Algebra and its Applications 438, 4183–4189 (2013)
6. Aziz, F., Wilson, R.C., Hancock, E.R.: Shape signature using the edge-based lapla-
cian. In: International Conference on Pattern Recognition (2012)
7. ElGhawalby, H., Hancock, E.R.: Graph embedding using an edge-based wave ker-
nel. In: Hancock, E.R., Wilson, R.C., Windeatt, T., Ulusoy, I., Escolano, F. (eds.)
SSPR & SPR 2010. LNCS, vol. 6218, pp. 60–69. Springer, Heidelberg (2010)
8. Aziz, F., Wilson, R.C., Hancock, E.R.: Gaussian wave packet on a graph. In:
Kropatsch, W.G., Artner, N.M., Haxhimusa, Y., Jiang, X. (eds.) GbRPR 2013.
LNCS, vol. 7877, pp. 224–233. Springer, Heidelberg (2013)
9. Aziz, F., Wilson, R.C., Hancock, E.R.: Graph characterization using gaussian
wave packet signature. In: Hancock, E., Pelillo, M. (eds.) SIMBAD 2013. LNCS,
vol. 7953, pp. 176–189. Springer, Heidelberg (2013)
10. Murase, H., Nayar, S.K.: Visual learning and recognition of 3-D objects from ap-
pearance. International Journal of Computer Vision 14, 5–24 (1995)
11. Harris, C., Stephens, M.: A combined corner and edge detector. In: Fourth Alvey
Vision Conference, Manchester, UK, pp. 147–151 (1988)
136 F. Aziz, R.C. Wilson, and E.R. Hancock

12. Jolliffe, I.T.: Principal component analysis. Springer, New York (1986)
13. Gärtner, T., Flach, P.A., Wrobel, S.: On graph kernels: Hardness results and ef-
ficient alternatives. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003.
LNCS (LNAI), vol. 2777, pp. 129–143. Springer, Heidelberg (2003)
14. Ren, P., Wilson, R.C., Hancock, E.R.: Graph characterization via Ihara coefficients.
IEEE Tran. on Neural Networks 22, 233–245 (2011)
15. MacQueen, J.B.: Some methods for classification and analysis of multivariate ob-
servations, vol. 1, pp. 281–297. University of California Press (1967)
16. Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal
of the American Statistical Association 66, 846–850 (1971)
Hearing versus Seeing Identical Twins

Li Zhang, Shenggao Zhu, Terence Sim, Wee Kheng Leow,

Hossein Najati, and Dong Guo

School of Computing
National University of Singapore
Singapore, 117417
{lizhang,shenggao,tsim,leowwk}@comp.nus.edu.sg,[email protected]

Abstract. Identical twins pose a great challenge to face recognition sys-

tems due to their similar appearance. Nevertheless, even though twins
may look alike, we believe they speak differently. Hence we propose to
use their voice patterns to distinguish between twins. Voice is a natural
signal to produce, and it is a combination of physiological and behavioral
biometrics, therefore it is suitable for twin verification. In this paper, we
collect an audio-visual database from 39 pairs of identical twins. Three
types of typical voice features are investigated, including Pitch, Linear
Prediction Coefficients (LPC) and Mel Frequency Cepstral Coefficients
(MFCC). For each type of voice feature, we use Gaussian Mixture Model
to model the voice spectral distribution of each subject, and then em-
ploy the likelihood ratio of the probe belonging to different classes for
verification. The experimental results on this database demonstrate a
significant improvement by using voice over facial appearance to distin-
guish between identical twins. Furthermore, we show that by fusion both
types of biometrics, recognition accuracy can be improved.

Keywords: identical twins, veriﬁcation, fusion, Gaussian Mixture Model.

1 Introduction
According to the statistics in [1], twins birth rate has risen from 17.8 to 32.2
per 1000 birth with an average 3% growth per year since 1990. This increase
is associated with the increasing usage of fertility therapies and the change of
birth concept. Nowadays women tend to bear children at older age and are more
likely than younger women to conceive multiples spontaneously especially in de-
veloped countries [2]. Although currently identical twins still only represent a
minority (0.2% of the world’s population), it is worth noting that the total num-
ber of identical twins is equal to the whole population of countries like Portugal
or Greece. This, in turn, has created an urgent demand for biometric systems
that can accurately distinguish between identical twins. Identical twins share the
same genetic code, therefore they look very alike. This poses a great challenge to
current biometric systems, especially face recognition system. The challenge us-
ing facial appearance to distinguish between identical twins has been veriﬁed by
Sun et al. [2] on 93 pairs of twins using a commercial face matcher. Nevertheless,

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 137–144, 2013.

c Springer-Verlag Berlin Heidelberg 2013
138 L. Zhang et al.

some biometrics depend not only on the genetic signature but also on the individ-
ual development in the womb. Some researchers explored the possibility of using
behavior difference, such as expressions and head motion [3] to distinguish be-
tween identical twins. Zhang et al. [3] proposed to use exception reporting model
to model the head motion abnormality to differentiate twins. They reported the
verification accuracy was over 90%, but their algorithm was very sensitive to
subject behavior consistence and strongly relied on accurate tracking algorithm.
Several researchers showed encouraging results by using fingerprint [4,2], palm-
print [5], ear [6] and iris [7,2] to distinguish between identical twins. For example,
equal error rate for 4-finger fusion reported by Sun et al. [2] was 0.49, and equal
error rate for 2-iris fusion was also 0.49. Despite of the discriminating ability of
those biometrics, those biometrics require the cooperation of the subject. There-
fore, it is desirable to identify twins in a natural way. In this paper, we propose to
utilize voice biometric to distinguish between identical twins and compare voice
biometric with facial appearance. Voice is non-intrusive and natural, it does not
require explicit cooperation of the subject and is widely available from videos
captured by ordinary cam-corders. To the best of our knowledge, we are the first
to investigate voice and appearance biometrics at the meantime.
Voice signal usually conveys several levels of information. Primarily, voice sig-
nal conveys the words or message being spoken, but on a secondary level, it also
conveys information about the identity of the speaker [8]. Voice biometric tries
to extract the identity information from the voice and uses it for speaker recogni-
tion. Generally speaking, the speaker recognition can be divided into two specific
tasks: speaker verification and speaker identification. In speaker verification, the
goal is to establish whether a person is who he/she claims to be; whereas in
speaker identification, the goal is to determine the identity (name or employee
number) of the unknown speaker. In either task the speech can be further di-
vided into text dependent (i.e.the speaker is required to talk same phrase) and
text independent (i.e. the speaker can talk different phrase). Douglas et al. [8]
and Sinith et al. [9] proposed to use Mel Frequency Cepstral Coefficients and
Gaussian Mixture Model to solve text independent identification problem for
general population, i.e. non twins. Dupont et al. [10] and Dean et al. [11] tried
to use hidden Markov model to model the distribution of the speaker spectral
shape from voice sample and claimed the identity using maximum likelihood
of the posterior probabilities belonging to different classes. Both these works
demonstrated that the identity of speaker can be well recognized via their voices
under the condition that voice samples were in good quality and the gallery
size was small, i.e. the number of subjects is small. This conclusion, in turn,
brings new hope to use voice biometric to differentiate identical twins, because
to distinguish between identical twins, the number of involved subjects was very
small i.e. the number of twins siblings. In this paper, we are trying to answer
those questions as follows:

1. Can voice be used to distinguish between identical twins? Is it bet-

ter than appearance based approach? If it is, which voice feature
is the best for identical twins?
Hearing versus Seeing Identical Twins 139

Speech Signal

Frames
Preprocessing Feature Extraction Feature Vectors Speaker Veriﬁcation

Fig. 1. Flowchart of twin veriﬁcation using voice

2. Can we combine facial appearance with speech to improve

accuracy?
Our work can be divided into three parts: 1) we firstly collected a twin audio-
visual database with 39 pairs of identical twins and test the discriminating abil-
ity of facial appearance to distinguish between identical twins by using Eigen-
face [12], Local Binary Pattern [13] and Linear Discriminating Analysis on Gabor
wavelet features (Gabor) [14]. 2) We propose to use Gaussian Mixture Model to
estimate the spectral shape of each twin subject, an then use the ratio of the
probabilities belonging to different twin subjects for verification. Three types
of voice features are used: Pitch, LPC and MFCC. 3) We use confidence level
fusion to combine the Gabor and MFCC to improve accuracy.

2 Twin Veriﬁcation Using GMM

2.1 Preprocessing and Feature Extraction
The proposal of our twin verification can be seen in Figure 1. The first step
of preprocessing is framing which is to divide audio into successive overlapping
frames. The frame size is set to 23 milliseconds in our work, with 50% overlap.
The energy in the high frequencies is boosted in each frame to compensates the
nonlinear nature of human voice that more energy is located at lower frequencies.
A Hamming window is utilized to smooth out the discontinuities at the beginning
and the end of the frame. Since silent frames may exist in the speech signal, we
filter out these frames using a simple thresholding method. The threshold θ
indicates the probability of containing human voice in this frame. If θ is larger
than the threshold, we keep this frame; otherwise we throw it away. In our
experiments, we set the threshold t0 0.4.
After preprocessing, various acoustic features can be extracted from the
frames. We select three kinds of features for testing and comparison purpose,
which are Pitch [15], Linear Prediction Coefficients (LPC) [16], and Mel Fre-
quency Cepstral Coefficients (MFCC) [17]. Pitch is a perceptual property of the
voice that allows the ordering on a frequency-related scale. MFCC is to map the
powers of the frame spectrum onto the mel scale and then uses amplitudes of
discrete cosine transform of the list of mel scale as feature. LPC is the coeffi-
cients of the linear predictive coding from the frames. In our work, the MFCC
140 L. Zhang et al.

coeﬃcient number is set to 13 and the predictor order (i.e., the number of LPC
coeﬃcients) is set to 8.

2.2 Modeling Using GMM

For each subject, his/her identity-dependent acoustic spectral distribution is
modeled as a weighted sum of M component densities given by the equation

M
p(x) = wi bi (x) (1)
i=1

where x is the D-dimensional feature vector (In our case, it is Pitch, LCP and
MFCC), bi (x) is the component density and wi is the mixture weight. Each
component density is represented as a Gaussian distribution of the form
1 1
bi (x) = exp{− (x − μi ) Δ−1
i (x − μi )} (2)
(2π)D/2 |Δi |1/2 2
with mean vector μi and covariance matrix Δi . The sum of mixture weights wi
equals to 1. For convenience, we denote mean vectors, covariance matrices and
mixture weights as Γ , where Γ = {wi , μi , Δi }, i = 1, ..., M . Therefore, each
speaker is represented by his/her model Γ .
Given the training data in the gallery, we use Expectation Maximization al-
gorithm [18] to estimate the Γ for each subject. In the verification phase, given a
test feature vector, ψ, and the hypothesized speaker S, we aim to check whether
the hypothesized identity is same to classified identity. We state this task as a
basic hypothesis test between two hypotheses:
H0: ψ is from the hypothesized twin speaker S.
H1: ψ is not from the hypothesized speaker S (i.e. ψ is from the twin sibling
of hypothesized speaker S).
The optimum classification to decide between these two hypotheses is through
the likelihood ratio (LR) given by

p(ψ|H0)
LR = (3)
p(ψ|H1)
If LR > , we accept H0; otherwise, we reject H0. Here, is the threshold,
p(ψ|H0) is the probability density function for the hypothesis subject S for the
observed feature vector ψ, and p(ψ|H1) is the probability density function for
not being the hypothesis subject S for the observed feature vector ψ.

3 Experiments
3.1 Data and Performance Evaluation
We collected a twins audio-visual database at the Sixth Mojiang International
Twins Festival held on 1 May 2010 in China. It includes Chinese, Canadian and
Hearing versus Seeing Identical Twins 141

Fig. 2. Some image examples of identical twins

Russian subjects for a total of 39 pairs of twins. Several examples can be seen
in Figure 2. For each subject, there are at least three audio recordings, each
around 30 seconds. The talking content of those recordings are different. For the
first recording, the subjects are required to count the number from one to ten;
For the second recording, the subjects are reading a paragraph; For the third
recording, the subjects are reciting a poem.
The twin verification performance is evaluated in terms of Twin Equal Error
Rate(Twin-EER) which Twin False Accept Rate(Twin-FAR) meets the False
Reject Rate (FRR). The Twin-FAR is the ratio between the times that twin
imposter is recognized as genuine with the total number of imposter. FRR is the
ratio between the times that genuine is recognized as imposter with the total
number of the genuine. We also introduce General Equal Error Rate(General-
EER) where General False Accept Rate(General-FAR) meets the FRR. The
General-FAR is the ratio between the times that general imposter is recognized
as genuine with the total number of the non-twin imposter. The purpose of
introducing General-FAR is to compare the verification accuracy between twins
with non-twins to see the challenge brought by twins.

3.2 Performance of Appearance and Audio Based Approach

We chose three traditional facial appearance approaches, Eigenface, Local Binary
Pattern and Gabor, to test the performance of using appearance to distinguish
between identical twins. For each twin subject, we randomly select 8 images.
The images are then registered by eye positions detected by STASM [19] and
resized to to 160 by 128. For Eigenface, we vectorized gray intensity in each pixel
as feature and performed PCA to reduce the dimension. For LBP, we divided
the image into 80 blocks. For each block, we extract the 59-bins histogram. For
Gabor, we used 40 Gabor (5 scales, 8 orientation) filters and set the kernel size
for each Gabor filter to 17 by 17. A PCA is performed to reduce the feature
dimension for LBP and Gabor. The experimental result is shown in Figure 3(a).
From this figure, we can see that identical twins indeed pose a great challenge to
appearance based approach. The General-EER of Gabor for general population
is around 0.122, while Twin-EER is significantly larger than 0.33. We can also
see that there is no huge difference between Intensity, LBP and Gabor for twin
verification. The Twin-EERs for them are 0.352 (Intensity), 0.340 (LBP) and
0.338 (Gabor), separately.
For voice based twin verification, we use one of the audio recordings as gallery
to train the GMM for each subject. Then, the remaining audio recordings are
142 L. Zhang et al.

1 1

LBP TwinFAR−FRR 0.9

Pitch TwinFAR−FRR
0.9
Gabor TwinFAR−FRR LPC TwinFAR−FRR
0.8 Intensity TwinFAR−FRR 0.8 MFCC TwinFAR−FRR
Gabor GeneralFAR−FRR
0.7 0.7

0.6 0.6
FAR

FAR
0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FRR FRR

(a) Appearance Accuracy (b) Voice Accuracy

Fig. 3. Performance comparison between facial appearance and voice biometric

used as probe. For each recording, we divide it into three parts, and each part is
acted as single probe. During GMM training, the covariance matrix is assumed
to be diagonal and the number of Gaussians is set to 4 for Pitch, 4 for LPC
and 5 for MFCC. The number of gaussian is optimized on the test set for better
performance. The experimental result is showed in Figure 3(b). Compared with
Figure 3(a), it can be clearly seen that twins can be better distinguished via
voice than appearance. The Twin-EER for MFCC is 0.171, which is significantly
better than appearance (the best for appearance is 0.338). However, not all voice
features are better than appearance. The Twin-EERs of pitch and LPC (0.394
for Pitch and 0.366 for LPC) are even larger than appearance based approach.
This shows that Pitch and LPC is not discriminating enough for twins.
Moreover, based on the experimental results in [10], the General-EER for
speaker verification on general population is around 0.05, which is much smaller
than the best (0.171) in twins database. The difference may come from three
aspects: 1)insufficient training data in our experiments. In our case, we only use
one audio recording around 30 seconds as training, and the talking content is
very simple and sometime duplicated. Therefore, it may cannot cover the entire
voice spectral pattern. 2) The voice spectral pattern for identical twins may have
some overlap. Identical twins share the same genetic code, therefore their voice
may share some similarity. 3) Our audio recording is not collected in very clean
environment, the environment sound may also degrade our performance. The
General-EER reported by [10] was obtained at clean recording room.

4 Fusion of Gabor and MFCC

In this section, we combine the appearance and speech to improve the twin
recognition accuracy. We choose Gabor as feature to represent appearance fea-
ture; we choose MFCC as feature to represent voice feature. The reason for our
choice is trivial, because these two features perform the best in each category
Hearing versus Seeing Identical Twins 143

1
0.25

0.9 MFCC TwinFAR−FRR MFCC TwinFAR−FRR

Fusion TwinFAR−FRR Fusion TwinFAR−FRR
0.8
0.2

0.7

0.6 0.15
FAR

FAR
0.5

0.4 0.1

0.3

0.2 0.05

0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.3
FRR FRR

(a) Fusion of Gabor and MFCC (b) Zoom-in Result for Figure 4(a)

Fig. 4. Performance of Fusion of Gabor and MFCC

in our previous experiment. In multimodal systems, there are three levels of fu-
sion when combining two biometrics. The first is fusion at the feature extraction
level. The features for each biometric modality are formed into a new feature.
The second is fusion at the confidence level. Each biometric provides a similarity
score, and these scores will be combined together to assert the veracity of the
claimed identity. The third fusion is at decision level. Each biometric will make
one decision and final decision is made based on those decisions.
In our proposal, we use the second fusion strategy. Given a probe and a claim
identity, we compute the Euclidean distance of Gabor, denoted as GD, and the
likelihood ratio against the claimed identity, denoted LR in Equ 3, separately.
The final similarity, F S, is computed as the weighted sum of GD and LR,
denoted as F S = αGD + (1 − α)LR. Then, we compare the F S against the
pre-set threshold . If F S > , we accept; otherwise we reject. We conducted the
experiments on the whole database, and the performance is showed in Figure 4.
From this figure, we can see that when α is set to 0.415, by fusion of Gabor and
MFCC the Twin-EER decreases from 0.171(MFCC) to 0.160. We set the α for
the best of test performance in our dataset.

5 Conclusion and Future Work

In this work, we collect a moderate size of identical twins database including ap-
pearance and voice. We propose to use Gaussian Mixture Model to model the voice
spectral pattern for veriﬁcation. The results verify that voice biometric can be used
to distinguish between identical twins and it is signiﬁcantly better than traditional
facial appearance features, including EigenFace, LBP and Gabor; Among various
voice features, MFCC has the most discriminating ability. We further prove that
the accuracy can be improved via fusion of voice biometric and facial appearance.
In future, we would like to test the robustness of our voice proposal, including
the length of training data and environment noise. Even though our current
144 L. Zhang et al.

result is very promising, we still hope to collect a larger twin database for our
research. We also intend to test the scalability of our voice proposal. Finally, we
look forward building a multimodal biometric system to which can work well for
general population but also can prevent the evil twin attack.

References
1. Martin, J., Kung, H., Mathews, T., Hoyert, D., Strobino, D., Guyer, B., Sutton,
S.: Annual summary of vital statistics: 2006. Pediatrics (2008)
2. Sun, Z., Paulino, A., Feng, J., Chai, Z., Tan, T., Jain, A.: A study of multibiometric
traits of identical twins. SPIE (2010)
3. Zhang, L., Ye, N., Marroquin, E.M., Guo, D., Sim, T.: New hope for recognizing
twins by using facial motion. In: WACV, pp. 209–214. IEEE (2012)
4. Jain, A., Prabhakar, S., Pankanti, S.: On the similarity of identical twin finger-
prints. Pattern Recognition, 2653–2663 (2002)
5. Kong, A., Zhang, D., Lu, G.: A study of identical twins’ palmprints for personal
verification. Pattern Recognition, 2149–2156 (2006)
6. Nejati, H., Zhang, L., Sim, T., Martinez-Marroquin, E., Dong, G.: Wonder ears:
Identification of identical twins from ear images. In: ICPR, pp. 1201–1204 (2012)
7. Daugman, J., Downing, C.: Epigenetic randomness, complexity and singularity of
human iris patterns. Proceedings of the Royal Society of London, 1737 (2001)
8. Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification us-
ing gaussian mixture speaker models. IEEE Transactions on Speech and Audio
Processing 3(1), 72–83 (1995)
9. Sinith, M., Salim, A., Gowri Sankar, K., Sandeep Narayanan, K., Soman, V.: A
novel method for text-independent speaker identification using mfcc and gmm. In:
ICALIP, pp. 292–296. IEEE (2010)
10. Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recog-
nition. IEEE Transactions on Multimedia 2(3), 141–151 (2000)
11. Dean, D., Sridharan, S., Wark, T.: Audio-visual speaker verification using contin-
uous fused hmms. In: Proceedings of the HCSNet workshop, pp. 87–92 (2006)
12. Turk, M.A., Pentland, A.P.: Face recognition using eigenfaces. In: CVPR,
pp. 586–591. IEEE (1991)
13. Ahonen, T., Hadid, A., Pietikäinen, M.: Face recognition with local binary patterns.
In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 469–481.
Springer, Heidelberg (2004)
14. Liu, C., Wechsler, H.: Gabor feature based classification using the enhanced fisher
linear discriminant model for face recognition. IEEE Transactions on Image pro-
cessing 11(4), 467–476 (2002)
15. Zatorre, R.J., Evans, A.C., Meyer, E., Gjedde, A.: Lateralization of phonetic and
pitch discrimination in speech processing. Science 256(5058), 846–849 (1992)
16. Atal, B.S., Hanauer, S.L.: Speech analysis and synthesis by linear prediction of the
speech wave. The Journal of the Acoustical Society of America 50, 637 (1971)
17. Logan, B., et al.: Mel frequency cepstral coefficients for music modeling. In: Inter-
national Symposium on Music Information Retrieval, vol. 28, p. 5 (2000)
18. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete
data via the em algorithm. Journal of the Royal Statistical Society, 1–38 (1977)
19. Milborrow, S., Nicolls, F.: Locating facial features with an extended active shape
model. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS,
vol. 5305, pp. 504–513. Springer, Heidelberg (2008)
Voting Strategies
for Anatomical Landmark Localization
Using the Implicit Shape Model

Jürgen Brauer, Wolfgang Hübner, and Michael Arens

Fraunhofer IOSB, Ettlingen / Germany

{juergen.brauer,wolfgang.huebner,michael.arens}@iosb.fraunhofer.de

Abstract. We address the problem of anatomical landmark localiza-

tion using monocular camera information only. For person detection the
Implicit Shape Model (ISM) is a well known method. Recently it was
shown that the same local features that are used to detect persons, can
be used to give rough estimates for anatomical landmark locations as
well. Though the landmark localization accuracy of the original ISM is
far away from being optimal. We show that a direct application of the
ISM to the problem of landmark localization leads to poorly localized
vote distributions. In this context, we propose three alternative voting
strategies which include the use of a reference point, a simple observa-
tion vector ﬁltering heuristic, and an observation vector weight learning
algorithm. These strategies can be combined in order to further increase
localization accuracy. An evaluation on the UMPM benchmark shows
that these new voting strategies are able to generate compact and mono-
tonically decreasing vote distributions, which are centered around the
ground truth location of the landmarks. As a result, the ratio of correct
votes can be increased from only 9.3% for the original ISM up to 42.1%
if we combine all voting strategies.

Keywords: human pose estimation, anatomical landmark localization,

Implicit Shape Model.

1 Introduction
The localization of anatomical landmarks (e.g. hands or hip center) is an essential
preprocessing step in many action recognition approaches. Sliding window-based
person detectors (Dollar et al. [3] provides a survey) are often used for the
initial person detection step. Local feature based methods (Leibe et al. [6]) have
a clear advantage over sliding window based methods in cases where persons
are occluded. For both person detection approaches there are corresponding
anatomical landmark detection approaches. Bourdev and Malik [2] trained SVM
classifiers for body part classification using sets of example image patches that
correspond to a similar 2D (or 3D) pose (called ’poselets’). A multiscale sliding-
window is run over the image and each window is classified by the SVMs, which
is computationally demanding. Müller and Arens [7] reuse local features from an
ISM person detection step to vote for landmark locations which is more efficient

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 145–153, 2013.

c Springer-Verlag Berlin Heidelberg 2013
146 J. Brauer, W. Hübner, and M. Arens

in terms of computation time. This approach was recently adopted by Girshick

et al. [4] for estimating 3D human poses using depth information provided by
the Kinect: instead of assigning each pixel an unique landmark classiﬁcation as
in the original approach (Shotton et al. [8]), a Hough forest is used, where a set
of votes is stored at each leaf to directly vote for the location of the landmark.
Here we adopt the approach from [7] for landmark localization, i.e. for each
landmark we are interested to localize, a single ISM is learned. The contribution
of this paper is to provide a set of new voting strategy alternatives in section 2
which allow a much more better localization of landmarks. Further, we provide
an evaluation on the UMPM benchmark of each of the new voting strategies and
their combination in section 3.

2 Voting Strategies
Original Implicit Shape Model Voting (ORIG-VOT). The basic idea be-
hind the Implicit Shape Model (ISM) (Leibe et al. [6]) is to learn the spatial
relationship between local features and an object using training data. For a new
image, local features are used to vote for possible object locations according
to the learned spatial relationship. More formally, an ISM I = (C, P) consists
of a set C (codebook) of prototypical image structures wi (visual words, code-
words) together with a set P of 3D probability distributions P = {P1 , ..., P|C| }.
Pi are 3D distributions, which specify where a visual word wi is typically lo-
cated on the object and at which feature scale. Leibe et al. [6] represent these
probability distributions in a non-parametric manner by collecting a set Oi =
{oj = (Δxj , Δyj , sj ) : j = 1, ..., Qi } of sample observation vectors oj that en-
code where (Δxj , Δyj ) the object center was observed relative to a local feature
(matching to word wi ) and at which scale sj the feature appeared. In Lehmann
et al. [5] this non-parametric representation of observation vectors is replaced
by Gaussian Mixture Models. In the object detection phase, a set of local fea-
tures F = {fk = (fx , fy , fs , d, wi ) : 1 ≤ k ≤ K} is computed, where fk
is a local feature detected at keypoint location (fx , fy ) at scale fs with corre-
sponding descriptor vector d which matches best to word wi . According to the
list of previously learned observation vectors Oi this feature now casts a vote
v = (vr , vx , vy , vs ) according to each previously stored observation vector oj ,
where the vote location, scale, and weight vr is computed by:

1 fs fs fs
vr = P (wi |d) vx = fx + Δx vy = fy + Δy vs = (1)
|Oi | sj sj sj
where P (wi |d) is the probability that descriptor vector d matches to word wi .
With v3 = (vx , vy , vs ) we denote the 3D vote space location, v2 = (vx , vy ) the
corresponding 2D image vote location, and V denotes the set of all votes casted
by all features. Object instances are then detected by identifying clusters of high
vote density in the 3D vote space using a Mean-Shift search.
Reference-Point Voting (RP-VOT). For human pose estimation, we ﬁrst
have to detect persons in the image. The idea of RP-VOT is to exploit knowledge
Voting Strategies for Anatomical Landmark Localization 147

about the person detection location in the landmark localization process. The
motivation goes back to the observation that many visual words can appear at
very different locations on the human body. For this, it is helpful to include the
knowledge about the location where this image structure is observed relative to a
reference point (here: person center). For RP-VOT we modify the original voting
procedure such that votes are only casted, if the word appears in the detection
phase at a similar location relative to the reference point, as in training. We need
a description of the location of the word relative to the reference point which is
independent from the person’s appearance size in the image. More formally, we
augment the observation vectors oj = (Δxj , Δyj , sj , h1 ) such that we record
also at which size h1 (in pixels) we observed the person during training. The word
location relative to the object center can then be represented in person height
units a = (−Δxj /h1 , −Δyj /h1 ) and can be compared with the word location
b = ((fx −Rx )/h2 , (fy −Ry )/h2 ) relative to the reference point (Rx , Ry ) during
testing where we estimate the person’s height to be h2 . For each feature f and
observation vector oj we then cast a vote only, if their location difference is
below some threshold, i.e. a − b 2 < θ. Here we use θ = 0.05 which means
that we use the observation vector only if the word’s location distance between
training and testing is below 5% of the person’s height. For the person size
estimate we experimented with two approaches: (i) estimating the size from the
person’s bounding box height, which is a plausible estimate, if the person is
upright standing and (ii) estimating the size from the set of local features within
the person’s bounding box, which is a better choice, if we expect poses, where
the person also will show non upright standing poses, as e.g. bending down, or
if the person is partially occluded. For estimating the person height from local
features we used an ISM as well, where each local feature casts votes for the
person height (1D vote space) and the final height estimate is found by applying
a 1D Mean Shift with a 1D Gaussian kernel.
Heuristic Voting (H-VOT). In the training phase of ORIG-VOT for each
word wi – which has its keypoint location on the person’s segmentation mask
– we store an observation vector with its location relative to the person center
in Oi . In [6] the person’s segmentation mask is automatically retrieved using
a motion segmentation by the Grimson-Stauffer background model. For land-
mark localization we could also allow each word that appears on the person’s
segmentation mask to store an observation vector for each landmark. Though
this means that e.g. a word that appears during training on the feet would store
an observation vector for all other landmarks (including the left/right hand, the
head, etc.), not only the left/right foot. Therefore, we propose to use a simple
but effective heuristic which is to exploit the information, whether the landmark
location which we are interested in was within the descriptor region of the fea-
ture during training. A vote is generated only if this is the case. This filters for
image structures that most probably contain information about the location of
the landmark. It is not necessary to augment the observation vectors by this
information, since they already contain this information: the landmark
location
is within the descriptor region of the word during training, if Δxj + Δyj2 < sj .
2
148 J. Brauer, W. Hübner, and M. Arens

Observation Vector Weighting Voting (OW-VOT). A more generic

approach – compared to H-VOT – is to treat the problem of choosing observation
vectors as a learning problem. For this, each observation vector is provided with
an individual weight. The idea is to give an observation vector a large weight if
it successfully allows to detect landmarks and to give it a smaller weight if this
is not the case. The training of an OW-VOT needs three steps. The training
data is split into two equally sized subsets, comparable to the principle of cross-
validation. The first set is used to collect the observation vectors. The second
set is used to estimate the weights. In the first step, we collect observation
vectors as in ORIG-VOT on the first set. In the second step, we iterate on
the second set of the training data and augment each observation vector oj =
(Δxj , Δyj , sj , hj , ηj ) by a weight ηj which is initially set to 0. For each sample
image from the second set we compute local features f , match them to words, and
for each word iterate over all the associated observation vectors. We compute the
corresponding vote location v2 and compare it with the ground truth landmark
location t. The weight is increased by K( v2 − t 2 , σ 2 ) where K is a Gaussian
kernel with a standard deviation of σ = 0.1h, and h is the current person height.
This ensures that observation weights are increased stronger if the corresponding
vote location is near to the ground truth marker location. In the third step,
η
each weight ηj is normalized by W , i.e. ηj ← Wj , where W is the sum of all
Qi
observation weights of the word wi , i.e. W = k=1 ηk . During voting the vote
weight formula in eqn. (1) is then modified by the observation vectors weights,
i.e. vr = ηj P (wi |d). Note that the original vote weight normalization by |O1i | –
which was introduced in ORIG-VOT to give each feature the same weight when
voting – can be skipped, since the new weights ηj are already normalized to 1.
Thus OW-VOT replaces the uniform weighting of all observation vectors by a
relative weighting.
Combined Voting Strategy (COMBI-VOT). The three voting strategies
(RP,H,OW - VOT) can be combined in order to generate an improved voting
strategy COMBI-VOT. RP-VOT and H-VOT mainly act as a filter for observa-
tion vectors: each observation vector associated with a detected word is checked
for being used during the voting procedure. OW-VOT then further modifies the
final vote weights by giving observation vectors larger weights, if they have shown
to be appropriate for predicting landmarks.

3 Experiments and Results

Dataset. Aa et al. [1] recently published the new UMPM benchmark1 which
allows for quantitative evaluations of 2D and 3D human pose estimation algo-
rithms. It consists of 30 subjects and a total of approx. 400.000 frames. The
dataset is provided with extrinsic and intrinsic camera parameters for all of the
four cameras. This allows to project the motion capture data into the image to
yield ground truth landmark locations which can be used to train the landmark
1
http://www.projects.science.uu.nl/umpm/
Voting Strategies for Anatomical Landmark Localization 149

ISMs. Table 1 shows the list of 16 experiments we conducted. In the Xa ex-

periments we used the UMPM codebook, while Xb experiments were conducted
using the generic codebook (see Codebook section below). In exp. 5b e.g. we used
the generic codebook, trained 15 landmark ISMs using 2 video sequences show-
ing 4 persons performing mainly grabbing objects poses and tested on one video
sequence showing one person. In all experiments except 1a+1b test person(s)
were diﬀerent from ISM train data persons.

Table 1. Results for evaluation measures α and β for all landmark localization exper-
iments. [x] speciﬁes the number of persons used for training, or testing respectively.
Ø speciﬁes the average evaluation measure value for each voting strategy, where we
averaged over all experiments 1a-8b.

α β
Exp # Train videos Test video
ORIG RP H OW COMBI ORIG RP H OW COMBI
1a [2] p2-chair-2 [1] p1 chair 2 9.8 25.5 18.2 12.9 44.0 44.7 26.3 31.9 35.2 16.4
1b [2] p2-chair-2 [1] p1 chair 2 8.4 24.4 16.7 11.7 43.5 47.1 26.9 33.5 37.0 16.0
2a [2] p2-chair-1 [1] p1 chair 2 9.3 21.8 17.8 11.5 37.9 44.1 27.4 30.2 37.3 18.5
2b [2] p2-chair-1 [1] p1 chair 2 8.6 22.3 17.8 10.9 38.8 46.3 27.1 31.4 39.0 18.0
3a [4] p2/p3-chair-1 [1] p1 chair 2 8.7 20.9 17.0 11.0 36.7 45.5 28.5 31.1 37.0 17.9
3b [4] p2/p3-chair-1 [1] p1 chair 2 7.7 21.4 15.9 10.2 37.5 48.3 28.2 33.0 38.9 17.5
4a [2] p2-grab-1 [1] p1 grab 3 12.0 29.7 21.0 16.4 49.4 38.5 20.1 26.8 30.5 12.1
4b [2] p2-grab-1 [1] p1 grab 3 10.9 30.5 20.2 14.3 50.3 41.5 19.7 28.9 34.1 12.0
5a [4] p2/p3-grab-1 [1] p1 grab 3 11.1 29.1 19.8 14.9 48.7 39.9 20.4 27.9 31.5 12.2
5b [4] p2/p3-grab-1 [1] p1 grab 3 10.0 30.0 18.2 13.3 49.3 42.7 19.8 30.2 34.5 12.1
6a [2] p3-ball-2 [2] p2 ball 1 9.9 22.4 17.4 13.7 39.3 44.3 26.4 30.5 34.2 15.6
6b [2] p3-ball-2 [2] p2 ball 1 8.4 22.5 15.0 11.5 39.0 47.5 26.1 33.5 37.0 15.5
7a [2] p3-free-1 [2] p2 free 1 8.8 22.1 17.5 11.6 38.2 43.7 25.1 30.6 33.6 15.5
7b [2] p3-free-1 [2] p2 free 1 7.7 22.1 14.9 10.2 38.5 45.6 25.0 32.6 35.8 15.5
8a [4] p3-free-1/11 [2] p2 free 1 9.5 23.5 18.8 12.4 41.6 42.3 24.5 29.4 32.9 14.8
8b [4] p3-free-1/11 [2] p2 free 1 8.2 23.1 16.1 10.8 41.4 44.8 24.7 31.4 35.3 14.9
Ø 9.3 24.5 17.6 12.3 42.1 44.2 24.8 30.8 35.2 15.3

Evaluation Measures. For a good part localization we want the vote dis-
tribution to be compact, uni-modal, centered on the ground truth landmark,
and monotonically decreasing to the periphery. We use three different evalua-
tion measures (α, β, γ) to assess to which degree this is fulfilled by the different
voting strategies:

1 1 ,x ≤ r
α= vr δr ( v2 − t 2 ) with δr (x) = (2)
W 0 ,x > r
v∈V
1
β= vr v2 − t 2 /h (3)
W
1 v∈V
1
γ(d) = ρ(l̃) with ρ(l̃) = vr K( v3 − l̃ 2 /λ(s)) (4)
|Xd | W λ(s)
(d,ρ(l̃))∈Xd v∈V

α measures the ratio of correct vs. total votes casted, weighted by the corre-
sponding vote weights. All votes within a circle of radius r around the ground
truth location are considered as correct. Here we use r = 0.1h, where h is the
person height measured in pixels. h can be estimated from the stick ﬁgure ground
150 J. Brauer, W. Hübner, and M. Arens

truth 2D pose by h = (Ll +Lr )/2+S +N , where Ll and Lr are the lengths of the
legs, S is the length of the spine and N is the length from
the neck to the head.
t is the ground truth 2D landmark location, and W = v∈V vr is the sum of all
vote weights. β measures the mean distance of the votes to the true landmark
location, again weighted by the vote weights, such that the distance (to the true
location) of a vote with a large weight has higher impact than a distance of a
vote with small weight. The distance is computed in relative person height units
by dividing through h. γ measures the average vote density in dependence of the
distance to the true marker location. For this we sample 3D vote space locations
l̃ = (x̃, ỹ, s̃) on a regular 3D grid and compute the density ρ of the votes at
these locations using a (weighted) kernel density estimator with scale-dependent
bandwidth λ(s). For each vote density sample location l̃, we then compute the
distance d of the corresponding 2D vote location v2 to the true marker location
t in person height units, i.e. d = v2 − t /h and add a new sample x = (d, ρ(l̃))
of this distance and vote density to a histogram of 100 discretized distance bins
Xd (d = 0.01n, 0 ≤ n ≤ 100). Fig. 1 shows the average vote density γ(d) of all
votes within such a bin Xd as a function of the distance d.

Fig. 1. Vote density as a function of the distance to ground truth landmark location.
γ(0.1) e.g. speciﬁes the average vote density we ﬁnd at locations which have a distance
of 10% of the person’s height to the true marker location. Left: averaged over all
experiments and landmarks. Right: averaged over all experiments for landmark ’Upper
Spine’ (top) and landmark ’Right Hand’ (bottom).

Codebooks. Two diﬀerent codebooks were used for the following experi-
ments. First, a codebook was generated by using 178 of the 272 UMPM video
sequences. All video sequences where skipped where persons occurred on which
we later tested in the experiments. From 1747 persons images we collected 109458
SURF descriptor vectors of keypoints within a person bounding box. The 128
dimensional descriptor vectors were clustered using RNN clustering [6] result-
ing in 1315 visual words. The generic codebook was generated using the ETHZ
Voting Strategies for Anatomical Landmark Localization 151

right elbow
left knee
left shoulder
left foot
right shoulder
right hand Input ORIG-VOT RP-VOT H-VOT OW-VOT COMBI-VOT

Fig. 2. Vote densities as generated by the diﬀerent voting strategies. For diﬀerent
landmarks we show the resulting vote density generated by each of the strategy on 6
example person images from the experiments. The original ISM voting strategy yields
non focused vote distributions, while especially the combination of the new voting
strategies allows a much more focused localization of the landmarks and often shows
vote density peaks at the true landmark locations (heat map color encoding: warm
colors mean high density).
152 J. Brauer, W. Hübner, and M. Arens

Pedestrian dataset2 showing hundreds of diﬀerent persons in street scenes. From

17721 person images 56234 descriptor vectors were clustered into 843 clusters.
Results. Table 1 shows the results for the evaluation measures α and β
for the individual experiments (α,β are specified in %). While only 9.3% of
the votes are correct for the ORIG-VOT, RP-VOT can increase the number of
correct votes already up to 24.5%. COMBI-VOT can even increase it up to 42.1%
correct votes. The (average distance of votes) measure β paints a picture which
is consistent with measure α. While the average distance of votes is 44.2% of
the person height for ORIG-VOT, COMBI-VOT can reduce this error measure
down to 15.3%. The influence of the codebook is only marginal (compare a with
b results) which indicates that a generic codebook trained on different person
images is appropriate. We expected a larger number of training persons (2 vs. 4)
to yield better results, which was slightly the case in exp. 7a vs. 8a and 7b vs.
8b, but not in exp. 4a vs. 5a and 4b vs. 5b, where the performance even slightly
dropped. This indicates that a small number of training persons are enough at
least when the poses shown in the training phase are similar to the ones in the test
phase. Fig. 1 shows the evaluation measure γ(d): the new voting strategies locate
more of the vote mass near to the ground truth landmark location. Especially,
COMBI-VOT and RP-VOT show a steep increase in the vote density near to
the ground truth location (compare Fig. 2 as well).

4 Conclusions
In this paper we showed that the original ISM voting strategy produces vote
distributions which are clearly limited for usage in the context of landmark
localization and as a basis for human pose estimation. We introduced three new
alternative vote generation mechanisms which produce much more focused vote
distributions and yield higher vote densities near to the true landmark locations.
When combining all three voting strategies to a new fourth one, we can see clear
vote peaks near to the true marker locations. While our work is in the context of
human pose estimation and action recognition, it is highly interesting for future
work to repeat the experimental comparison of the strategies presented here on
other object categories than humans.

References
1. van der Aa, N., Luo, X., Giezeman, G., Tan, R., Veltkamp, R.: Utrecht multi-person
motion (umpm) benchmark: a multi-person dataset with synchronized video and
motion capture data for evaluation of articulated human motion and interaction.
In: HICV Workshop, in Conj. with ICCV (2011)
2. Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3D human pose
annotations. In: Proc. of ICCV (2009)
3. Dollár, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: An evaluation
of the state of the art. PAMI 99(PrePrints) (2011)
2
http://www.vision.ee.ethz.ch/~aess/dataset/
Voting Strategies for Anatomical Landmark Localization 153

4. Girshick, R., Shotton, J., Kohli, P., Criminisi, A., Fitzgibbon, A.: Eﬃcient regression
of general-activity human poses from depth images. In: Proc. of ICCV (2011)
5. Lehmann, A., Leibe, B., van Gool, L.: Prism: Principled implicit shape model. In:
Proc. of BMVC, pp. 64.1–64.11 (2009)
6. Leibe, B., Leonardis, A., Schiele, B.: Robust object detection with interleaved cat-
egorization and segmentation. IJCV 77, 259–289 (2008)
7. Müller, J., Arens, M.: Human pose estimation with implicit shape models. In: ACM
Artemis, ARTEMIS 2010, pp. 9–14. ACM, New York (2010)
8. Shotton, J., Fitzgibbon, A.W., Cook, M., Sharp, T., Finocchio, M., Moore, R.,
Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth
images. In: CVPR, pp. 1297–1304 (2011)
Evaluating the Impact of Color
on Texture Recognition

Fahad Shahbaz Khan1 , Joost van de Weijer2 , Sadiq Ali3 , and Michael Felsberg1
1
Computer Vision Laboratory, Linköping University, Sweden
[email protected]
2
Computer Vision Center, CS Dept. Universitat Autonoma de Barcelona, Spain
3
SPCOMNAV, Universitat Autonoma de Barcelona, Spain

Abstract. State-of-the-art texture descriptors typically operate on grey

scale images while ignoring color information. A common way to obtain
a joint color-texture representation is to combine the two visual cues at
the pixel level. However, such an approach provides sub-optimal results
for texture categorisation task.
In this paper we investigate how to optimally exploit color informa-
tion for texture recognition. We evaluate a variety of color descriptors,
popular in image classiﬁcation, for texture categorisation. In addition we
analyze diﬀerent fusion approaches to combine color and texture cues.
Experiments are conducted on the challenging scenes and 10 class tex-
ture datasets. Our experiments clearly suggest that in all cases color
names provide the best performance. Late fusion is the best strategy to
combine color and texture. By selecting the best color descriptor with
optimal fusion strategy provides a gain of 5% to 8% compared to texture
alone on scenes and texture datasets.

Keywords: Color, texture, image representation.

1 Introduction

Texture categorisation is a diﬃcult task. The problem involves assigning a class

label to the texture category it belongs to. Significant amount of variations in
images of the same class, illumination changes, scale and viewpoint variations
are some of the key factors that make the problem challenging. The task consists
of two parts, namely, efficient feature extraction and classification. In this work
we focus on obtaining compact color-texture features to represent an image.
State-of-the-art texture descriptors operate on grey level images. Color and
texture are two of the most important low level visual cues for visual recognition.
A straight forward way to extend these descriptors with color is to operate on
separately on the color channels and then concatenate the descriptors. However
such representations are high dimensional. Recently, it has been shown that an
explicit color representation improves performance on object recognition and
detection tasks [1,2]. Therefore, this work explores several pure color descriptors
popular in image classification for texture categorisation task.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 154–162, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Evaluating the Impact of Color on Texture Recognition 155

There exist two main approaches to combine color and texture cues for texture
categorisation.
Early Fusion: Early fusion fuses the two cues at the pixel level to obtain a joint
color-texture representation. The fusion is obtained by computing the texture
descriptor on the color channels. Early fusion performs best for categories which
exhibit constancy in both color and shape [1].
Late Fusion: Late fusion process the two visual cues separately. The two his-
tograms are concatenated into a single representation which is then the input to
a classifier. Late fusion combines the visual cues at the image level. Late fusion
works better for categories where one cue remains constant and the other changes
significantly [1]. In this work we analyze both early and late fusion approaches
for the task of texture categorisation.
As mentioned above, state-of-the-art early fusion approaches [3] combine the
features at the pixel level. Contrary to computer vision, it is well known that
visual features are processed separately before combining at a later stage for
visual recognition in human brain [4,5]. Recently, Khan et al. [6] propose an
alternative approach to perform early fusion for object recognition. The visual
cues are combined in a single product vocabulary. A clustering algorithm based
on information theory is then applied to obtain a discriminative compact repre-
sentation. Here we apply this approach to obtain a compact early fusion based
color-texture feature representation.
In conclusion, we make the following novel contributions:
– We investigate state-of-the-art color features used for image classification for
the task of texture categorisation. We show that the color names descriptor
with its only 11 dimensional feature vector provides the best results for
texture categorisation.
– We analyze fusion approaches to combine color and texture. Both early and
late feature fusion is investigated in our work.
– We also introduce a new dataset of 10 different and challenging texture cat-
egories as shown in Figure 1 for the problem of color-texture categorisation.
The images are collected from the internet and Corel collections.

2 Relation to Prior Work

Image representations based on color and texture description are an interesting
research problem. Significant amount of research has been done in recent years
to the solve the problem of texture description [7,8,9,10]. Texture description
based on local binary patterns [8] is one of the most commonly used approach
for texture classification. Other than texture classification, local binary patterns
have been employed for many other vision tasks such as face recognition, object
and pedestrian detection. Due to its success and wide applicability, we also use
local binary patterns for texture categorisation in this paper1 .
1
We also investigated other texture descriptors such as MR8 and Gabor filters but
inferior results were obtained compared to LBP. However, the approach presented
in this paper can be applied with any texture descriptor.
156 F.S. Khan et al.

Color has shown to provide excellent results for bag-of-words based object
recognition [3,1]. Recently, Khan et al. [1,2] have shown that an explicit repre-
sentation based on color names outperforms other color descriptors for object
recognition and detection. However, the performance of color descriptors, popu-
lar in image classiﬁcation, has yet to be investigated for texture categorization
task. Therefore, in this paper we investigate the contribution of color for texture
categorization. Diﬀerent from the previous methods [11,12], we propose to use
color names as a compact explicit color representation. We investigate both late
and early fusion based global color-texture description approaches. Contrary to
conventional pixel based early fusion methods, we use an alternative approach
to construct a compact color-texture image representation.

3 Pure Color Descriptors

Here we show a comparison of pure color descriptors popular in image classiﬁ-

cation for texture description.
RGB Histogram [3]: As a baseline, we use the standard RGB descriptor. The
RGB histogram combines the three histograms from the R, G and B channels.
The descriptor has 45 dimensions.
rg Histogram [3]: The histogram is based on the normalized RGB color model.
The descriptor is 45 dimensional and invariant to light intensity changes and
shadows.
C Histogram: This descriptor has shown to provide excellent results on the
object recognition task [3]. The descriptor is derived from the opponent color
space as O1 O2
O3 and O3 . The channels O1 and O2 describe the color information.
Whereas O3 channel contains the intensity information in an image. We quantize
the descriptor into 36 bins using K-means to construct a histogram.
Opponent-angle Histogram [13]: The opponent-angle histogram proposed
by van de Weijer and Schmid is based on image derivatives. The histogram has
36 dimensions.
HUE Histogram [13]: The descriptor was proposed by [13] where hue is
weighted by the saturation of a pixel in order to counter the instabilities in
hue. This descriptor also has 36 dimensions.
Transformed Color Distribution [3]: The descriptor is derived by normal-
izing each channel of RGB histogram. The descriptor has 45 dimensions and is
invariant to scale with respect to light intensity.
Color Moments and Invariants [3]: In the work of [3] the color moment
descriptor is obtained by using all generalized color moments up to the second
degree and the ﬁrst order. Whereas color moment invariants are constructed
using generalized color moments.
Hue-saturation Descriptor: The hue-saturation histogram is invariant to
luminance variations. It has 36 dimensions (nine bins for hue times four for
saturation).
Evaluating the Impact of Color on Texture Recognition 157

Color Names [14]: Most of the aforementioned color descriptors are designed
to achieve photometric invariance. Instead, color names descriptor balances a
certain degree of photometric invariance with discriminative power. Humans use
color names to communicate color, such as “black”, “blue” and “orange”. In this
work we use the color names mapping learned from the Google images [14].

4 Combining Color and Texture

Here we discuss diﬀerent fusion approaches to combine color and texture features.
Early Fusion: Early fusion involves binding the visual cues at the pixel level.
A common way to construct an early fusion representation is to compute the
texture descriptor on the color channels. Early fusion results in a more discrim-
inative representation since both color and shape are combined together at the
pixel level. However, the ﬁnal representation is high dimensional. Constructing
an early fusion representation using color channels with a texture descriptor for
an image I is obtained as:

TE = [TR , TG , TB ], (1)

Where T can be any texture descriptor. Most color-texture approaches in liter-

ature are based on early fusion approach [12,3]. Recently, Khan et al. [1] have
shown that early fusion performs better for categories that exhibit constancy of
both color and shape. For example, the foliage category has a constant shape
and color.
Late Fusion: Late fusion involves combining visual cues at the image level. The
visual cues are processed independently. The two histograms are then concate-
nated into a single representation before the classiﬁcation stage. Since the visual
cues are combined at the histogram level, the binding between the visual cues is
lost. A late fusion histogram for an image is obtained as,

TL = [HT , HC ] , (2)

Where HT and HC are explicit texture and color histograms. Late fusion pro-
vides superior performance for categories where one of the visual cues changes
signiﬁcantly. For example, most of the man made categories such as car, motor-
bike etc. changes signiﬁcantly in color. Since an explicit color representation is
used for late fusion, it is shown to provide superior results for such classes [1].
Portmanteau Fusion: Most theories from the human vision literature suggest
that the visual cues are processed separately [4,5] and combined at a later stage
for visual recognition. Recently, Khan et al. [6] propose an alternative solution
for constructing compact early fusion within the bag-of-words framework. Color
and shape are processed separately and a product vocabulary is constructed. A
Divisive information theoretic clustering algorithm (DITC) [15] is then applied
to obtain a compact discriminative color-shape vocabulary. Similarly, in this
158 F.S. Khan et al.

work we also aim at constructing a compact early fusion based color-texture

representation2.
Here we construct separate histograms for both color and texture and product
histogram is constructed. Suppose that T = {t1 , t2 , ..., tL } and C = {c1 , c2 , ...,
cM } represent the visual texture and color histograms, respectively. Then the
product histogram is given by

T C = {tc1 , tc2 , ..., tcS } = {{ti , cj } | 1 ≤ i ≤ L, 1 ≤ j ≤ M },

where S = L × M . The product histogram is equal to number of texture bins

times number of color histogram bins. This leads to high dimensional feature
representation. This product histogram is then input to the DITC algorithm to
obtain a low dimensional compact color-texture representation. The DITC algo-
rithm works on the class-conditional distributions over product histograms. The
class-conditional estimation is measured by the probability distribution p (R|tcs ),
where R = {r1 , r2 , ..rO } is the set of O classes. The DITC algorithm works by
estimating the drop in mutual information I between the histogram T C and
the class labels R. The transformation from the original histogram T C to the
new representation T C R = {T C1 , T C2 , ..., T CJ } (where every T Cj represents a
group of clusters from T C) is equal to

I (R; T C) − I R; T C R

J
= p (tcs ) KL(p(R|tcs ), p(R|T Cj )), (3)
j=1 tcs ∈T Cj

where KL is the Kullback-Leibler divergence between the two distributions de-

ﬁned by
p1 (y)
KL(p1 , p2 ) = p1 (y)log . (4)
p2 (y)
y∈Y

The algorithm ﬁnds a desired number of histogram bins based on minimizing the
loss in mutual information between the bins of product histogram and the class
labels of training instances. Histogram bins with similar discriminative power
are merged together over the classes. We refer to Dhillon et al. [15] for a detail
introduction on the DITC algorithm.

5 Experimental Results
To evaluate the performance of our approach we have collected a new dataset
of 400 images for color-texture recognition. The dataset consists of 10 diﬀerent
2
In our experiments we also evaluated PCA and PLS but inferior results were ob-
tained. A comparison of other compression techniques with DITC is also performed
by [16].
Evaluating the Impact of Color on Texture Recognition 159

Fig. 1. Example images from the two datasets used in our experiments. First row:
images from the OT scenes dataset. Bottom row: images from our texture dataset.

categories namely: marble, beads, foliage, wood, lace, fruit, cloud, graffiti, brick
and water. We use 25 images per class for training and 15 instances for testing.
Existing datasets are either grey scale, such as the Brodatz set, or too simple,
such as the Outex dataset, for color-texture recognition. Texture cues are also
used frequently within the context of object and scene categorisation. Therefore,
we also perform experiments on the challenging OT scenes dataset [17]. The OT
dataset [17] consists of 2688 images classified as 8 categories. Figure 1 shows
example images from the two datasets.
In all experiments a global histogram is constructed for the whole image. We
use LBP with uniform patterns having final dimensionality of 383. Early fu-
sion is performed by computing the texture descriptor on the color channels.
For late fusion, histograms of pure color descriptor is concatenated with a tex-
ture histogram. A non-linear SVM is used for classification. The performance is
evaluated as a classification accuracy which is the number of correctly classified
instances of each category. The final performance is the mean accuracy obtained
from all the categories. We also compare our approach with color-texture de-
scriptors proposed in literature [12,10].

Table 1. Classiﬁcation accuracy on the two datasets. (a) Results using diﬀerent pure
color descriptors. Note that on both datasets color names being additionally compact
provides the best results. (b) Scores using late fusion approaches. On both datasets
late fusion using color names provides the best results while being low dimensional.

Method Size OT [17] Texture Method Size OT [17] Texture

RGB 45 43 51 RGB LBP 383 + 45 79 74
rg 30 39 50 rg LBP 383 + 30 80 69
HUE 36 38 43 HUE LBP 383 + 36 80 74
C 36 39 41 C LBP 383 + 36 79 73
Opp-angle 36 33 27 Opp-angle LBP 383 + 36 79 74
Transformed color 45 40 41 Transformed color LBP 383 + 45 79 72
Color moments 30 42 50 Color moments LBP 383 + 30 80 74
Color moments inv 24 23 34 Color moments inv LBP 383 + 24 23 71
HS 36 37 42 HS LBP 383 + 36 79 72
Color names 11 46 56 Color names LBP 383 + 11 82 77
(a) (b)
160 F.S. Khan et al.

5.1 Experiment 1: Pure Color Descriptors

We start by providing results on the pure color descriptors discussed in Section 3.

The results are presented in Table 1. On both datasets, the baseline RGB pro-
vides improved results compared to several other sophisticated color desccriptors.
Among all the descriptors, the color names descriptor provides best results on
both datasets. Note that color names being additionally compact, possesses a
certain degree of photometric invariance together with discriminative power. It
has the ability to encode achromatic colors such as grey, white etc. Based on
these results, we propose to use color names as an explicit color representation
to combine with texture cue.

5.2 Experiment 2: Fusing Color and Texture

Here, we first show results obtain by late fusion approaches in Table 1. The
texture descriptor with 383 dimensions provides a classification score of 77%
and 69% respectively. The late fusion of RGB and LBP provides a classifica-
tion score of 79% and 74%. The STD [12] descriptor provides inferior results
of 58% and 67% respectively. The best results are obtained on both datasets
using the combination of color names with LBP. Table 2 shows results obtained
using early fusion approaches on the two datasets. The conventional pixel based
descriptors provide inferior results on both datasets. The LCVBP descriptor [10]
provides classification scores of 76% and 53% on the two datasets. By taking the
product histogram directly without compression provides an accuracy of 81%
and 72% while being significantly high dimensional. It is worthy to mention that
both JTD and LCVBP descriptors are also significantly high dimensional. The
portmanteau fusion provides the best results among early fusion based methods
while additionally being compact in size.
In summary late fusion provides superior performance while being compact on
both datasets. Among early fusion based methods portmanteau fusion provides
improved performance on both datasets. The best results are achieved using the
color names descriptor. Color names having only an 11 dimensional histogram is

Table 2. Classiﬁcation accuracy using early fusion approaches. Among early fusion
approaches, portmanteau fusion provides the best results on both datasets while addi-
tionally being compact.

Method Dimension OT [17] Texture

RGBLBP 1149 79 70
CLBP 1149 78 69
OPPLBP 1149 80 70
HSVLBP 1149 78 71
JTD [12] 15625 57 61
LCVBP [10] 15104 76 53
Product 4213 81 72
Portmanteau fusion 500 82 73
Evaluating the Impact of Color on Texture Recognition 161

compact, possesses a certain degree of photometric invariance while maintaining

discriminative power. Note that in this paper we investigate global color-texture
representation. Such a representation can further be combined with local bag-
of-words based descriptors for further improvement in performance.

6 Conclusions

We evaluate a variety of color descriptors and fusion approaches popular in

image classiﬁcation for texture recognition. Our results suggest that color names
provides the best performance for texture recognition. Late fusion is an optimal
approach to combine the two cues. Portmanteau fusion provides superior results
compared to conventional pixel level early fusion. On scenes and texture datasets,
color names in a late fusion settings signiﬁcantly improve the performance by
5% to 8% compared to texture alone.

Acknowledgments. We acknowledge the support of Collaborative Unmanned

Aerial Systems (within the Linnaeus environment CADICS), ELLIIT, the Strate-
gic Area for ICT research, funded by the Swedish Government, and Spanish
project TIN2009-14173.

References

1. Khan, F.S., van de Weijer, J., Vanrell, M.: Modulating shape features by color
attention for object recognition. IJCV 98(1), 49–64 (2012)
2. Khan, F.S., Anwer, R.M., van de Weijer, J., Bagdanov, A.D., Vanrell, M., Lopez,
A.M.: Color attributes for object detection. In: CVPR (2012)
3. van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluating color descriptors for
object and scene recognition. PAMI 32(9), 1582–1596 (2010)
4. Treisman, A., Gelade, G.: A feature integration theory of attention. Cogn.
Psych. 12, 97–136 (1980)
5. Wolfe, J.M.: Watching single cells pay attention. Science 308, 503–504 (2005)
6. Khan, F.S., van de Weijer, J., Bagdanov, A.D., Vanrell, M.: Portmanteau vocabu-
laries for multi-cue image representations. In: NIPS (2011)
7. Lazebnik, S., Schmid, C., Ponce, J.: A sparse texture representation using local
affine regions. PAMI 27(8), 1265–1278 (2005)
8. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns. PAMI 24(7), 971–987
(2002)
9. Varma, M., Zisserman, A.: A statistical approach to texture classification from
single images. IJCV 62(2), 61–81 (2005)
10. Lee, S.H., Choi, J.Y., Ro, Y.M., Plataniotis, K.: Local color vector binary patterns
from multichannel face images for face recognition. TIP 21(4), 2347–2353 (2012)
11. Topi Maenpaa, M.P.: Classification with color and texture: jointly or separately?
PR 37(8), 1629–1640 (2004)
12. Susana Alvarez, M.V.: Texton theory revisited: A bag-of-words approach to com-
bine textons. PR 45(12), 4312–4325 (2012)
162 F.S. Khan et al.

13. van de Weijer, J., Schmid, C.: Coloring local feature extraction. In: Leonardis, A.,
Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 334–348. Springer,
Heidelberg (2006)
14. van de Weijer, J., Schmid, C., Verbeek, J.J., Larlus, D.: Learning color names for
real-world applications. TIP 18(7), 1512–1524 (2009)
15. Dhillon, I., Mallela, S., Kumar, R.: A divisive information-theoretic feature clus-
tering algorithm for text classiﬁcation. JMLR 3, 1265–1287 (2003)
16. Elﬁky, N., Khan, F.S., van de Weijer, J., Gonzalez, J.: Discriminative compact
pyramids for object and scene recognition. PR 45(4), 1627–1636 (2012)
17. Oliva, A., Torralba, A.B.: Modeling the shape of the scene: A holistic representation
of the spatial envelope. IJCV 42(3), 145–175 (2001)
Temporal Self-Similarity for Appearance-Based
Action Recognition in Multi-View Setups

Marco Körner and Joachim Denzler

Friedrich Schiller University of Jena, Computer Vision Group, Jena, Germany

{marco.koerner,joachim.denzler}@uni-jena.de, www.inf-cv.uni-jena.de

Abstract. We present a general data-driven method for multi-view ac-

tion recognition relying on the appearance of dynamic systems captured
from diﬀerent viewpoints. Thus, we do not depend on 3d reconstruction,
foreground segmentation, or accurate detections. We extend further ear-
lier approaches based on Temporal Self-Similarity Maps by new low-level
image features and similarity measures. Gaussian Process classiﬁcation
in combination with Histogram Intersection Kernels serve as powerful
tools in our approach. Experiments performed on our new combined
multi-view dataset as well as on the widely used IXMAS dataset show
promising and competing results.

Keywords: Action Recognition, Multi-View, Temporal Self-Similarity,

Gaussian Processes, Histogram-Intersection Kernel.

1 Introduction

The automatic recognition of actions from video streams states a very important
problem in current computer vision research, as reflected by recent surveys[1]. A
variety of possible applications—e.g. Human-Machine Interaction, surveillance,
Smart Environments, entertainment, etc.—argues for the emerging relevance of
this topic.
As monocular approaches rely on single-view images, they solely perceive 2d
projections of the real world and discard important information. Hence, they are
likely to suffer from occlusions and ambiguities. As a consequence, the majority
of these methods use data-driven methods like Space-Time Interest Points[8]
instead of model-based representations of the image content. In contrast, existing
multi-view action recognition systems try to directly exploit 3d information,
e.g. by reconstructing the scene or fitting anatomical models, resulting in a far
higher complexity.
Having these observations in mind, we propose a method to recognize articu-
lated actions, which meets the following demands: (i) it is designed to be general
and not restricted to human action recognition, (ii) it avoids expensive dense 3d
reconstruction, (iii) it is independent from the camera setup it was learned in,
and (iv) it does not rely on foreground segmentation and exact localization.
The rest of this paper is structured as follows: in Sect. 2 we give a short intro-
duction in theory of Recurrence Plots and Temporal Self-Similarity Maps and

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 163–171, 2013.

c Springer-Verlag Berlin Heidelberg 2013
164 M. Körner and J. Denzler

(a) Camera 0 (b) Camera 3

Fig. 1. Two SSMs obtained for a robot dog performing an stand kickright action cap-
tured from diﬀerent viewpoints. Action primitives induce similar local structures in the
corresponding SSM even under changes of viewpoint, illumination, or image quality.

motivate their usage. We also suggest to extend the related approach of Junejo
et al.[7] by new low-level features and distance metrics. Subsequently, Sect. 3 will
present our approach to utilize SSMs for multi-view action recognition. There-
fore we use a Gaussian Process classiﬁer together with the Histogram Intersection
Kernel, which has been shown to be more suitable for comparison of histograms.
In Sect. 4, we show results of our approach on our own new multi-view action
recognition dataset as well as on the widely used IXMAS dataset.

1.1 Related Work

Going through the related literature, methods for action recognition can be cat-
egorized into three groups: the ﬁrst kind of approaches tries to reconstruct 3d
information or trajectories from the scene[15] or augment these representations
by a fourth time dimension[19,5]. Alternatively, relationships between action
features obtained from diﬀerent views are learned by applying transfer learning
or knowledge transfer techniques[3,9]. The methods most related to our pro-
posal try to directly model the dynamics of actions within a view-independent
framework[7,2]. For a more extensive review about recent work on action recog-
nition we refer to recent reviews[1,14].

2 Temporal Self-similarity Maps

To understand human actions and activities, observers take benefit of their prior
knowledge of typical temporal and spatial recurrences in execution of actions.
Besides all differences in execution, two actions can be perceived as being se-
mantically identical if they share atomic action primitives in a similar frequency.
Assuming those actions to be instances of deterministic dynamical systems—
which can be modeled by differential equations—, Marwan et al.[11] presented an
intensive discussion about their interpretation utilizing Recurrence Plots (RP).
This work was further referenced for human gait analysis[2] and—due to their
stability in the case of viewpoint changes—for cross-view action recognition[7].
Temporal Self-Similarity for Multi-View Action Recognition 165

Table 1. Semantic interpretations of patterns shown in SSMs introduced by recorded

actions (interpretation of [11]))
Pattern Interpretation

Homogeneous areas The corresponding atomic action represents a stationary

process
Fading in corners The recorded action represents a Non-stationary process
Periodic structures The recorded action contains a cyclic/periodic motion
Isolated points The recorded action contains an abrupt ﬂuctuation

(Anti-) Diagonal The recorded action contains different atomic actions with
straight lines similar evolutionary characteristics in (reversed) time
Horiz. & vert. lines No or slow change of states for a given period of time
Bow structures The recorded action contains different atomic actions with
similar evolutionary characteristics in reversed time with
different velocities

Given a sequence I1:N = {I1 , . . . , IN } of images Ii , 1 ≤ i ≤ N , a temporal Self-

Similarity Map (SSM) is generically defined as a square and symmetric matrix
I1:N I1:N
Sf,d = [d(f (Ii ), f (Ij ))]i,j , Sf,d ∈ RN ×N of pairwise similarities d(·, ·) be-
tween low-level image features f (·) computed independently for every sequence
frame. In the literature, it has already been shown that SSMs preserve invari-
ants of the dynamic systems they capture[12], they are stable wrt. different
embedding dimensions[12,6], invariant under isometric transformations[12] and
though not being invariant under projective or affine transformations, SSMs are
heuristically shown to be stable under 3d view changes[7]. In Fig. 1, a robot dog
performing a stand kickright action was captured from two viewpoints with
different illumination conditions. Apparently, atomic action primitives induce
similar structures within the corresponding SSM. It can further be observed,
that the local structure of these SSMs reflects the temporal relations between
different system configurations over time, as summarized in Tab. 1.

2.1 Image Features

The choice for low-level image features f (·) is of inherent importance and has to
suit the given scenario. In the following, we will discuss some possible alternatives.
Intensity Values. The simplest way to convert an image into a descriptive fea-
ture vector fint (I) ∈ RM·N is to append its intensities, as proposed for human
gait analysis[2]. While this is suitable for sequences with a single stationary ac-
tor, it yields large feature vectors and is very sensitive to noise and illumination
changes.
Landmark Positions. Assuming to be able to track anatomical or artificial
landmarks of the actor over time, their positions fpos (I) = (x0 , x1 , . . .) , xi =
(xi , yi , zi ) , can be used to represent the current system configuration[7]. This is
sufficient, as long as the tracked points are distributed over moving body parts,
but it demands points to be able to be tracked continuously.
166 M. Körner and J. Denzler

Table 2. Exemplary SSMs extracted Table 3. Exemplary SSMs extracted

from recordings of actions from the Aibo from the same stand dance1 action from
dataset using diﬀerent low-level image the Aibo dataset using diﬀerent similarity
features measures
Similarity
Feature Action (perfomed in greeting pose) Feature
Measure
greeting scoot stretch dance1 HoG HoF Fourier
right

Euclidean
Distance
Intensity

Normalized
HoG Cross-
Correlation

HoF
Histogram
Intersection
Fourier

Histograms of Oriented Gradients have been shown to give good repre-

sentations of shape for object detection. For this purpose, the image is sub-
divided into overlapping cells, where the distribution of gradient directions is
approximated by a fixed-bin discretization. These certain local orientation his-
tograms are normalized to the direction of the strongest gradient in order to ob-
tain local rotation invariance. Appending those local gradient histograms gives
the final descriptor fHoG (I) = (h0 , h1 , . . .) , hi = (n0i , n1i , . . .)[7].
Histograms of Optical Flows. When analyzing the displacements of each
pixel between two succeeding frames, this optical flow field represents an early
fusion of temporal dynamics. Building a global histogram over discretized flow
orientations or appending histograms obtained from smaller subimages yield the
HoF descriptor fHOF (I).
Fourier Coefficients. When computing the 2-dimensional discrete Fourier
M−1 N −1 −2πi( mk
M + N ) , 0 ≤ k ≤ M − 1, 0 ≤ l ≤
nl
transform âk,l = m=0 n=0 Im,n · e
N −1, âk,l ∈ C of an image patch I, the series of Fourier coefficients [âk,l ] contains
spectral information up to a given cutoff frequency 0 ≤ k ≤ Mc −1, 0 ≤ l ≤ Nc −1
and inherently provides invariance against translation. Since the first Fourier co-
efficient â0,0 represents the mean intensity of the transformed image patch I, the
Fourier coefficient descriptor fFourier = (â0,1 , â0,2 , . . . , â1,Nc −1 , . . . , âMc −1,NC −1 )
is further invariant wrt. global illumination changes. By tuning the cutoff fre-
quencies Mc , Nc , statistical noise can be suppressed as it is represented by higher-
order frequencies. Since DFT can be implemented in parallel on modern GPU
environments, these features can be computed very efficiently.
A qualitative comparison of these features extracted from different action
classes is given in Tab. 2. It can be seen that the HoF feature shows many
abrupt changes, while the other SSMs contain more smooth transitions between
the certain similarity values. The HoG feature seems to be more sensitive so
temporal changes at small time scale, which could be explained by image noise
Temporal Self-Similarity for Multi-View Action Recognition 167

Action 1,
Sequence 1 SSM Bag of SSM Words
SSM Features
PCA
GMM

...

...
training

...

...
Action N, GP
Sequence M Classier

PCA
...

...
GMM
Bag of SSM Words
testing

PCA GP
...

...

GMM Classier

SSM SSM Features Bag of SSM Words Action Class

Fig. 2. Outline of the training and testing phase of our approach

and might harm the further processing. Hence, we further concentrate on using
the proposed Fourier coeﬃcients, since they are easily and fast to compute and
provide some handy invariants by design.

2.2 Similarity Measures

Beside the choice for a suitable image representation f (·), the distance measure
d(·, ·) plays an important role when computing self-similarities, as qualitatively
compared in Tab. 3.
Euclidean Distances. The euclidean distance deucl (f1 , f2 ) = f1 − f2 2 serves
as a straightforward way to quantify the similarity between two image feature
descriptors f1 = f (I1 ) and f2 = f (I2 ) of equal length, as proposed by [7]. While
this is easy to compute, it might be unsuited for histogram data[10], since false
bin assignments would cause large errors in the euclidean distance.
Normalized Cross-Correlation. From a signal-theoretical point of view, the
image feature descriptors f1 , f2 can be regarded as D-dimensional discrete signals
of : Then, the normalized cross-correlation coeﬃcient dNCC (f1 , f2 ) =
9 equal size.
f1 f2
f1 , f2 ∈ [−1, 1] measures the cosine of the angle between the signal vectors
f1 and f2 . Hence, this distance measure is independent from their lengths.
D−1
Histogram Intersection. The intersection dHI (h1 , h2 ) = i=0 min (h1,i , h2,i )
of two histograms h1 , h2 ∈ RD was shown to perform better for codebook gen-
eration and image classiﬁcation tasks[13]. In case of comparing normalized his-
tograms, the histogram intersection distance is bounded by [0, +1].

3 MVSSM Feature Extraction and Action Topic Model

Learning and Classiﬁcation

As mentioned before, SSMs obtained from videos capturing the identical action
from diﬀerent viewpoints share common patterns. Hence, local feature descrip-
tors suitable for monitoring the structure of those patterns have to be developed
168 M. Körner and J. Denzler

in order to use Multi-View SSM (MVSSM) representations for action recogni-

tion purposes in multi-view environments. Hooked on on the choice for features
and the similarity measure used to create the SSM, self-similarity values are ex-
pected to become less reliable when moving away from the diagonal, as measur-
ing the similarity gets more difficult. Junejo et al.[7] proposed to use a log-polar
histogram of intensity gradients extracted on discrete positions at the main diag-
onal of the SSM to be analyzed, which yields a descriptor of dimension 88. The
radius of this histogram, i.e. the temporal extend of interest, controls the amount
of temporal information taken into account. As an extension, they constructed
these histograms at different time scales to catch variations in executions.
Alternatively, we propose to extract 128-dimensional SIFT descriptors at key-
points equally distributed along the diagonal of fused multi-view SSMs.These are
scale-invariant by design, as they examine and aggregate the image information
on different scale spaces. To reduce the number of dimensions, we further apply
PCA to the matrix of descriptor vectors.
Since the number of feature descriptors varies with the size of the SSM, i.e. the
length of the sequence, and the density of keypoints used for extracting these
features, we need to transform this set of features into a fixed-size representation.
We used the widely popular Bag of Visual Words approach to assign the given
action descriptors to representative prototypes identified by a custom cluster
algorithm. Choosing an appropriate value for the number of prototypes, the ob-
tained feature histograms are sparse and thus easy to distinguish. Fig. 2 outlines
the training and testing phase of our system.

4 Experimental Evaluation
In order to evaluate our multi-view action recognition system, we firstly per-
formed experiments on our own dataset. This dataset contains 10 sequences of
each 56 predefined actions performed by Sony AIBO robot dogs simultaneously
captured by six cameras.1
In our general setup, the dimension of SIFT descriptors extracted along the
SSM diagonal was reduced from 128 to 32 by applying PCA. Subsequently, all
descriptors from all train sequences were clustered into a mixture of 512 Gaus-
sians to create a Bag of Self-Similarity Words (BOSS Words). This is further
used to represent each training sequence by a histogram of relative frequencies.
These parameters heuristically show best results. While Junejo et al.[7] propose
to employ a multiclass SVM, this yield a very high complexity in our case, as
the AIBO dataset covers a relatively large number of classes to be distinguished.
Hence, we use a Gaussian Process (GP) classifier combined with a Histogram
D
Intersection Kernel κHIK (h, h ) = i=0 min (hi , hi ) , h, h ∈ RD , which can be
evaluated efficiently, as recently shown by Rodner et al.[16] and Freytag et al.[4].
Recognition rates were obtained after 10-fold cross validation.
One of the most important questions concerning multi-view action recogni-
tion is the influence of the training and testing camera setup on the overall
1
The complete dataset including labels, calibration data and background images is
available at http://www.inf-cv.uni-jena.de/JAR-Aibo.
Temporal Self-Similarity for Multi-View Action Recognition 169

0.87 Table 4. Results obtained on IXMAS

dataset (cross-view evaluation)
0.81
0.8 0.79 Approach Description Rec.
0.76
our approach 79%
0.7 0.69 Junejo et al.[7] HoG 1
63%
Junejo et al.[7] HOF1 67%
0.62 Junejo et al.[7] HoG+HoF1 74%
0.6
Junejo et al.[7] HoG+HoF2 80%
1/5 2/4 3/3 4/2 5/1 6/6 Weinland et al.[17] 2d Silhouettes 58%
Farhadi et al.[3] 2d Silhouettes+OF 69%
Fig. 3. Results obtained on Aibo dataset:
Weinland et al.[18] 3d HoG3 84%
average recognition rates for diﬀerent
1 Multi-Scale SSM, 2 Space-Time Interest Points[8]
ntraining /ntesting view partitions 3 all views used for training and testing

accuracy. In order to preserve generality, we evaluated our method on disjoint

sets for training and testing views. Fig. 3 shows averaged results of experiments,
where all 62 possible partitions of views for training and testing where used. As
expected, the maximum performance was obtained when dividing the available
views into equally-sized subsets. Confusions between semantically related classes
only appeared occasionally. In general, we were even able to distinguish identical
actions performed in different poses, which argues for the discriminativeness of
our modeling scheme. For our experiments we used a standard desktop computer
equipped with a Intel(R) Core(TM)2 Quad CPU 2.50 GHz and 8 GB of RAM.
Some algorithms were parallelized, e.g. the Fourier Transform, SIFT extraction,
or GMM modeling. While learning an action model for the whole dataset took
about 3 hours, the SSM computation, feature extraction, and classification per-
formed in real-time. Most of the approaches presented before concerning the
recognition of actions in multi-view environments focus on cross-view setups,
i.e. the system is trained on one single view and evaluated on another view.
Hence, we adopted the evaluation method of Junejo et al.[7] in order to do a fair
comparison. We did no further adaptions, especially we did not tune the process
parameters to obtain optimal results for this scenario. Tab. 4 shows the resulting
recognition rates compared to other not model-based approaches. While Junejo
et al.[7] used a combination of HoF and HoG features, we can reach similar re-
sults using our proposed Fourier descriptors, which are assumed to be computed
more efficiently. Furthermore, they enabled their approach to show time-scale
invariance by extracting their SSM features on different scales, i.e. with distinct
radii, while the SIFT features we used for representing SSMs are (time-) scale-
invariant by design. By estimating 3d optical flow, Weinland et al.[18] obtained
slightly higher recognition rates.

5 Summary and Outlook

We presented a framework for creating and evaluation temporal self-similarity
maps to employ them for multi-view action recognition. It was pointed out, that
170 M. Körner and J. Denzler

the invariance and stability properties of SSMs support our demands on a action
recognition system.
We made three contributions: (i) we further extended the method originally
presented in [7] by new low-level features and distance metrics, (ii) we applied
a Gaussian Process (GP) classiﬁer combined with histogram intersection ker-
nel, which have been shown to be more suitable and eﬃcient for comparing
histograms[16,4], and (iii) we used a new extensive dataset for evaluating multi-
view action recognition systems, which will be made publicly available.
It is straightforward to augment the Bag of Self-Similarity Words modeling
scheme by histograms of co-occurrences of vocabulary words in order to improve
the descriptive power of this representation. Another important aspect is the
direct integration of calibration knowledge into our framework.

References
1. Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: A review. ACM Comput.
Surv. 43(3), 16:1–16:43 (2011)
2. Cutler, R., Davis, L.S.: Robust real-time periodic motion detection, analysis, and
applications. TPAMI 22(8), 781–796 (2000)
3. Farhadi, A., Tabrizi, M.K.: Learning to recognize activities from the wrong view
point. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS,
vol. 5302, pp. 154–166. Springer, Heidelberg (2008)
4. Freytag, A., Rodner, E., Bodesheim, P., Denzler, J.: Rapid uncertainty compu-
tation with gaussian processes and histogram intersection kernels. In: Lee, K.M.,
Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part II. LNCS, vol. 7725,
pp. 511–524. Springer, Heidelberg (2013)
5. Holte, M.B., Chakraborty, B., Gonzalez, J., Moeslund, T.B.: A local 3-D motion de-
scriptor for multi-view human action recognition from 4-D spatio-temporal interest
points. Selected Topics in Signal Processing 6(5), 553–565 (2012)
6. Iwanski, J.S., Bradley, E.: Recurrence plots of experimental data: To embed or not
to embed? Chaos 8(4), 861–871 (1998)
7. Junejo, I.N., Dexter, E., Laptev, I., Pérez, P.: View-independent action recognition
from temporal self-similarities. TPAMI 33(1), 172–185 (2011)
8. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human
actions from movies. In: CVPR, pp. 1–8 (2008)
9. Liu, J., Shah, M., Kuipers, B., Savarese, S.: Cross-view action recognition via view
knowledge transfer. In: CVPR, pp. 3209–3216 (2011)
10. Maji, S., Berg, A.C., Malik, J.: Classiﬁcation using intersection kernel support
vector machines is eﬃcient. In: CVPR, pp. 1–8 (2008)
11. Marwan, N., Romano, M.C., Thiel, M., Kurths, J.: Recurrence plots for the analysis
of complex systems. Physics Reports 438(5-6), 237–329 (2007)
12. McGuire, G., Azar, N.B., Shelhamer, M.: Recurrence matrices and the preservation
of dynamical properties. Physics Letters A 237(1-2), 43–47 (1997)
13. Odone, F., Barla, A., Verri, A.: Building kernels from binary strings for image
matching. IP 14(2), 169–180 (2005)
14. Poppe, R.: A survey on vision-based human action recognition. IVC 28(6), 976–990
(2010)
15. Rao, C., Yilmaz, A., Shah, M.: View-invariant representation and recognition of
actions. IJCV 50(2), 203–226 (2002)
Temporal Self-Similarity for Multi-View Action Recognition 171

16. Rodner, E., Freytag, A., Bodesheim, P., Denzler, J.: Large-scale gaussian process
classiﬁcation with ﬂexible adaptive histogram kernels. In: Fitzgibbon, A., Lazebnik,
S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part IV. LNCS, vol. 7575,
pp. 85–98. Springer, Heidelberg (2012)
17. Weinland, D., Boyer, E., Ronfard, R.: Action recognition from arbitrary views
using 3D exemplars. In: ICCV, pp. 1–7 (2007)
18. Weinland, D., Özuysal, M., Fua, P.: Making action recognition robust to occlusions
and viewpoint changes. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV
2010, Part III. LNCS, vol. 6313, pp. 635–648. Springer, Heidelberg (2010)
19. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using mo-
tion history volumes. CVIU 104(2), 249–257 (2006)
Adaptive Pixel/Patch-Based Stereo Matching
for 2D Face Recognition

Rui Liu1, Weiguo Feng1, and Ming Zhu2

1
Department of Electronic Engineering and Information Science, University of Science and
Technology of China, Hefei, China
{liuruin,fwg168}@mail.ustc.edu.cn
2
Department of Automation, University of Science and Technology of China, Hefei, China
[email protected]

Abstract. In this paper, we propose using adaptive pixel/patch-based stereo

matching for 2D face recognition. We don't perform 3D reconstruction but de-
fine a measure of the similarity of two 2D face images. After rectifying the two
images by epipolar geometry, we match them using the similarity for face rec-
ognition. The proposed approach has been tested on the CMU PIE and FERET
database and demonstrates superior performance compared to existing methods
in real-world situations including changes in pose and illumination.

Keywords: face recognition, adaptive, pixel, patch, stereo matching.

1 Introduction

Although face recognition in controlled environment has been well solved, its per-
formance in real application is still far from satisfactory. The variations of pose,
illumination, occlusion and expression are still critical issues that affect the face rec-
ognition performance. Existing techniques such as Eigenfaces [1] or Fisherfaces [2]
are not robust to these variations. Local features such as local binary patterns (LBP)
[3] are then proposed for recognition. Recently, sparse representation-based
classification (SRC) [4] has also been proposed and showed very promising results.
But these methods degrade gracefully with changes of pose. Previous methods for
improving face recognition accuracy under pose variation include [5-10]. In [5], a
pose-specific locally linear mapping is learned between a set of non-frontal faces and
the corresponding frontal faces. [6] shows that dynamic programming-based stereo
matching algorithm (DP-SM) can gain significant performance for 2D face recogni-
tion across pose. A learning method is present in [10] to perform patch-based
rectification based on locally linear regression.
In our work, we perform recognition by using Adaptive Pixel/Patch-based Stereo
Matching (APP-SM) to judge the similarity of two 2D images of faces. Fig. 1 con-
tains a representational overview of our method which consists of the following steps:
First, we build a gallery of 2D face images. Second, we align each probe-gallery
image pair using four feature points by calculating the epipolar geometry. Then we

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 172–179, 2013.
© Springer-Verlag Berlin Heidelberg 2013
Adaptive Pixel/Patch-Based Stereo Matching for 2D Face Recognition 173

run an adaptive pixel/patch-based stereo algorithm on the image pair. Note that we
don’t perform 3D reconstruction. Also we discard all the correspondences and the
disparities. We only use the matching cost to compute the similarity of two face im-
ages. Finally, we identify the probe with the gallery image that produces the max
similarity.

Fig. 1. An overview of our APP-SM method

The paper is organized as follows. Section 2 presents the details of our face recog-
nition method. Section 3 presents and analyses all experiments. Finally, in Section 4,
conclusions will be given.

2 Stereo Matching and Face Recognition

2.1 Alignment

Before running the stereo algorithm on the image pair, we first need to rectify them to
maintain the epipolar constraint. Generally, eight corresponding points are required to
obtain the epipolar geometry. Nevertheless, since the average variation of the depth of
the head is small compared to the distance of the camera to the head, and the field
of view for the facial image is small, we can simply the model to scaled orthographic
projection [11]. Then we only need four feature points to calculate the epipolar
geometry.
174 R. Liu, W. Feng, and M. Zhu

However, it has been showed in [11] that there can be still considerable variations
in disparity between two images under scaled orthographic projection. Traditional
linear transformations can only create linear disparity maps, which cannot be used
here to align the images accounting for the disparity variations. So in this step, we
follow [6] to achieve alignment by solving a non-linear system. For completeness, we
briefly review the mainly ideas of solving the epipolar geometry under scaled ortho-
graphic projection. Details can be refer to [6].
The epipolar geometry in this scenario is modeled as a tuple: (θ , γ , s, t ) . Here, θ
and γ are the angle of the epipolar lines in the left and right image, respectively.
Scaling the right image by s will cause the distance between two epipolar lines in the
right image to match the distance in the left image. Translating the image perpendicu-
lar to the epipolar lines by t will cause the alignment between the corresponding lines.
In our experiments, we specify these feature points by hand, as in [6]. With four
corresponding points, we get a nonlinear system of equations, which we solve in a
straightforward way to complete the step of alignment.

2.2 Stereo Matching and Similarity Measure

We propose a stereo algorithm which is appropriate for wide baseline matching of
faces. Notice that our approach has been well designed in both pixel and patch
level, so that it can handle the problems such as large pose changes and variable
illumination.

2.2.1 Pixel Level

 between a pixel at position x in the
Our goal here is to compute the dissimilarity PI L

left scanline (row) and a pixel xR at position in the right scanline (row). Birchfield
and Tomasi [12] defined a pixel dissimilarity measure that is insensitive to image
sampling:
 
d ( xL , xR ) = min  min I L ( x) − I R ( xR ) , min I L ( xL ) − I R ( x)  (1)
1 1 1 1
 xL − 2 ≤ x ≤ xL + 2 xR − ≤ x ≤ xR +
2 2 

Here, xL = xR + d , d ∈ [ Δ1 , Δ 2 ] . I L and I R are two discrete one-dimensional arrays of
intensity values. I R is the linearly interpolated function between the sample points of
the right scanline , and I L is defined similarly.
Based on the Birchfield-Tomasi method, we propose a simple adaptive pixel-based
stereo matching algorithm to deal with horizontal slant. It processes a pair of scanli-
nes at a time. Horizontal disparities Δ are assigned to the scanline within a given range
[ Δ , Δ ] . The disparities are not assigned to pixels, but continuously over the whole
1 2

scanline. Given a point xL in the left scanline and its corresponding point xR in the right
scanline, we have
xL = m ⋅ xR + d (2)
Adaptive Pixel/Patch-Based Stereo Matching for 2D Face Recognition 175

m is the horizontal slant, which allows line segments of different length in the two
scanlines to correspond adaptively. The values of the horizontal slant which are to be
examined are provided as inputs, i.e., m ∈ M , where M = {m1 , m2 ,..., mk } , such that
m1 , m2 ,..., mk ≥ 1 .
Δ = (m − 1) ⋅ xR + d (3)
Δ is the horizontal disparity. The disparity search range [ Δ1 , Δ 2 ] is also provided as an
input. Then we can find range for d using given range of Δ and Equation (3). In our
implementation, we choose Δ and m empirically. With the constraint of runtime and
memory consumption, we find Δ = [0,8] and M = {1,1.2,...,3} performs best.
For ith pixel in the left sth scanline, we simultaneously searched the space of possi-
ble disparities and horizontal slants. Then we choose the minimum dissimilarity as the
 (i, s) . After we obtain the dissimilarity value for each pixel in the left im-
value of PI
age, we normalise them using Min-Max Normalization method:
 
 (i, s ) = 1 − PI (i, s ) − min( PI )
PI (4)
 ) − min( PI
max ( PI )

2.2.2 Patch Level

Since matching individual pixel intensities will be very sensitive to noise such as
lighting variation, we also compute an adaptive patch-based dissimilarity PA  using the
local facial features.
For ith pixel in the left sth scanline and jth pixel in the right sth scanline, we compute
 (i, s) by matching patches around the points (i, s) and ( j , s ) be-
the dissimilarity PA
tween images using LBP feature vector (59 dimensions in our experiments).
However, this requires us to account for the effects of slant on patch size. The
patch in the left image is fixed-size, while the original patch size in the right image
should be determined by the horizontal slant.
We use m to determine the patch size in the right image. For example, if the size of
the patch in the left image is fixed at 9 × 21 , the size of the patch in the right image is
therefore 9 m × 21 . We then use interpolation to create a matching patch in the right
image and resize it to be the same size as the patch in the left image. Similarly, after
we obtain the dissimilarity value for each pixel in the left image, we choose the
minimum dissimilarity as the value of PA  (i, s) and normalise them as following:
 
 (i, s ) = 1 − PA(i, s ) − min( PA)
PA (5)

max ( PA) − min( PA)

2.2.3 Similarity Value Computation

For ith pixel in the left sth scanline, a weighted fusion is made to compute the matching
cost matrix:
 (i, s) + α ⋅ PA
MATCH (i, s) = (1 − α ) ⋅ PI  (i, s) (6)
176 R. Liu, W. Feng, and M. Zhu

We compute a similarity value sv( I1 , I 2 ) by making aggregation over all the set of
scanlines in the left image. This value tells us how well image I1 and image I 2 match.
Since each similarity is going to be compared to other costs matched over scanlines of
potentially different lengths, we use some normalization strategy again:

 MATCH (i, s )
sv( I1 , I 2 )= s i (7)
s I + I
1, s 2, s

2.3 Face Recognition

Since we do not know which image is left and which image is right in stereo vision, it
is better to try both options. We also need to use flip which produces a left-right
reflection of the image. Faces are approximately vertically symmetric, so flip is help-
ful when two views see mainly different sides of the face.
Given two images I1 and I 2 , we define the similarity of the two images as:

 sv ( I1 , I 2 )

 sv ( I 2 , I1 )
similarity( I1 , I 2 ) = max  (8)
 sv (flip( I1 ), I 2 )
 sv ( I 2 , flip( I1 ))

Finally, we perform recognition simply by matching a probe image to the most simi-
lar image in the gallery.

3 Experiments

3.1 CMU PIE Database

The CMU PIE [13] database consists of 13 poses of which 9 have approximately the
same camera altitude (poses: c34, c14, c11, c29, c27, c05, c37, c25 and c22). For each
pose of the same person, there are also 21 images, each with different illumination.
We conducted experiments to compare our method APP-SM with the others. The
thumbnails used were generated as described in Section 2.1. A number of prior ex-
periments have been done using the CMU PIE database, but somewhat different ex-
perimental conditions. We have run our own algorithm under a variety of conditions
so that we may compare to these.
First, we only tested on individuals 35-68 from the PIE database to compare our
method with six others. Specifically, we selected each gallery pose as one of the 13
PIE poses and the probe pose as one of the remaining 12 poses, for a total of 156 gal-
lery-probe pairs. We evaluated the accuracy of our method in this setting and com-
pared to the results in [6, 7, 9]. Table 1 summarizes the average recognition rates.
Adaptive Pixel/Patch-Based Stereo Matching for 2D Face Recognition 177

Table 1. A comparison on 34 subjects of CMU PIE

Method Accuracy (%)

Eigenfaces [7] 16.6
FaceIt [7] 24.3
Eigen light-fields (Multi-point norm.) [7] 66.3
DP-SM (Castillo and Jacobs [6]) 86.8
Partial Least Squares [9] 90.1
Proposed APP-SM 92.3

Then, we evaluated simultaneous variations in pose and illumination, to illustrate

that our method can work in more realistic situations. We compare our method to DP-
SM [6] which also takes advantage of the stereo algorithm, and Bayesian Face Subre-
gions (BFS) [7] which computes the reflectance and illumination fields from real
images. The gallery is frontal pose and illumination. For each probe pose, the accura-
cy is determined by averaging the results for all 21 different illumination conditions.
It can be observed from Fig.2 that our pixel/patch-based synthesizer and normaliza-
tion strategy provide good robustness to local lighting changes.

100

70
Accuracy (%)

50
Proposed
40
DP−SM [6]
BFS [7]
30

0
c34 c31 c14 c11 c29 c09 c27 c07 c05 c37 c02 c25 c22
Probe Poses

Fig. 2. A comparison with the Method of Castillo et al. [6] and Gross et al. [7]. Gallery pose is
frontal (c27) probe. We report the average over the 21 illuminations.
178 R. Liu, W. Feng, and M. Zhu

3.2 FERET Database

We also evaluate our method on FERET face image database [14]. This database is
one of the largest publicly available databases. It has been used for evaluating face
recognition algorithms displays diversity across gender, ethnicity, and age.

Table 2. A comparison on 200 subjects of FERET

Method bh bg bf be bd bc Avg(%)
Zhang et al. [15] 62.0 91.0 98.0 96.0 84.0 51.0 80.5
Gao et al. [16] 78.5 91.5 98.0 97.0 93.0 81.5 90.0
Sarfraz et al. [8] 92.4 89.7 100 98.6 97.0 89.0 94.5
Mostafa et al. [17] 87.5 98.0 100 99.0 98.5 82.4 94.2
Proposed APP-SM 92.0 94.5 100 98.8 96.0 89.5 95.1

In our experiments, we used all 200 subjects at 7 different poses (bh, bg, bf, be,
bd, bc). The pose angles range from +600 to -600. The frontal image ba for each sub-
ject is used as gallery and the remaining 6 images per subject were used as probes
(1,200 totally). Tables 2 shows that our APP-SM performs as well as any prior me-
thod based on image comparison. However, it should be noticed that APP-SM needs
no training. It is much simpler and more straightforward, which is very important for
applications.

4 Conclusion

In this paper, we proposed a method using adaptive pixel/patch-based stereo matching

(APP-SM) for 2D face recognition. Compared to existing methods, our APP-SM is
simple and performs very well. There is still a lot of room for improvement
in our method. For example, some strategies can be pursued to automatically select
the parameters. Also it remains a future direction to determine how best to incorporate
learning into it.

Acknowledgment. This research was supported by the “Strategic Priority Research

Program - Network Video Communication and Control” of the Chinese Academy of
Sciences (Grant No. XDA06030900).

References
1. Turk, M.A., Pentland, A.P.: Face recognition using eigenfaces. In: Proceedings 1991 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (91CH2983-5),
pp. 586–591 (1991)
2. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: Recognition
using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine
Intelligence 19, 711–720 (1997)
Adaptive Pixel/Patch-Based Stereo Matching for 2D Face Recognition 179

3. Ahonen, T., Hadid, A., Pietikäinen, M.: Face recognition with local binary patterns. In:
Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 469–481. Springer,
Heidelberg (2004)
4. Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust Face Recognition via
Sparse Representation. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 31, 210–227 (2009)
5. Chai, X., Shan, S., Chen, X., Gao, W.: Locally linear regression for pose-invariant face
recognition. IEEE Transactions on Image Processing 16, 1716–1725 (2007)
6. Castillo, C.D., Jacobs, D.W.: Using Stereo Matching with General Epipolar Geometry for
2D Face Recognition across Pose. IEEE Transactions on Pattern Analysis and Machine In-
telligence 31, 2298–2304 (2009)
7. Gross, R., Matthews, S.B.I., Kanade, T.: Face recognition across pose and illumination. In:
Jain, A.K., Li, S.Z. (eds.) Handbook of Face Recognition. Springer-Verlag New York, Inc.
(2005)
8. Sarfraz, M.S., Hellwich, O.: Probabilistic learning for fully automatic face recognition
across pose. Image and Vision Computing 28, 744–753 (2010)
9. Sharma, A., Jacobs, D.W.: Ieee: Bypassing Synthesis: PLS for Face Recognition with
Pose, Low-Resolution and Sketch. In: 2011 IEEE Conference on Computer Vision and
Pattern Recognition, pp. 593–600 (2011)
10. Ashraf, A.B., Lucey, S., Tsuhan, C.: Learning patch correspondences for improved view-
point invariant face recognition. In: 2008 IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), p. 8 (2008)
11. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge
University Press (2003)
12. Birchfield, S., Tomasi, C.: A pixel dissimilarity measure that is insensitive to image sam-
pling. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 401–406
(1998)
13. Sim, T., Baker, S., Bsat, M.: The CMU pose, illumination, and expression database. IEEE
Transactions on Pattern Analysis and Machine Intelligence 25, 1615–1618 (2003)
14. Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.J.: The FERET evaluation methodology for
face-recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 22, 1090–1104 (2000)
15. Wenchao, Z., Shiguang, S., Wen, G., Xilin, C., Hongming, Z.: Local Gabor binary pattern
histogram sequence (LGBPHS): a novel non-statistical model for face representation and
recognition. In: Proceedings of the Tenth IEEE International Conference on Computer Vi-
sion, vol. 781, pp. 786–791 (2005)
16. Gao, H., Ekenel, H.K., Stiefelhagen, R.: Pose Normalization for Local Appearance-Based
Face Recognition. In: Tistarelli, M., Nixon, M.S. (eds.) ICB 2009. LNCS, vol. 5558,
pp. 32–41. Springer, Heidelberg (2009)
17. Mostafa, E.A., Farag, A.A.: Dynamic weighting of facial features for automatic pose-
invariant face recognition. In: 2012 Canadian Conference on Computer and Robot Vision,
pp. 411–416 (2012)
A Machine Learning Approach
for Displaying Query Results in Search Engines

Tunga Güngör1,2
1
Boğaziçi University, Computer Engineering Department, Bebek,
34342 İstanbul, Turkey
2
Visiting Professor at Universitat Politècnica de Catalunya, TALP Research Center,
Barcelona, Spain
[email protected]

Abstract. In this paper, we propose an approach that displays the results of a

search engine query in a more effective way. Each web page retrieved by the
search engine is subjected to a summarization process and the important content
is extracted. The system consists of four stages. First, the hierarchical structures
of documents are extracted. Then the lexical chains in documents are identified
to build coherent summaries. The document structures and lexical chains are
used to learn a summarization model by the next component. Finally, the
summaries are formed and displayed to the user. Experiments on two datasets
showed that the method significantly outperforms traditional search engines.

1 Introduction

A search engine is a web information retrieval system that, given a user query,
outputs brief information about a number of documents that it thinks relevant to the
query. By looking at the results displayed, the user tries to locate the relevant pages.
The main drawback is the difficulty of determining the relevancy of a result from the
short extracts. The work in [14] aims at increasing the relevancy by accompanying the
text extracts by images. In addition to important text portions in a document, some
images determined by segmenting the web page are also retrieved. Roussinov and
Chen propose an approach that returns clusters of terms as query results [12]. A
framework is proposed and its usefulness is tested with comprehensive experiments.
Related to summarization of web documents, White et al. describe a system that
forms hierarchical summaries. The documents are analyzed using DOM (document
object model) tree and their summaries are formed. A similar approach is used in [10]
where a rule-based method is employed to obtain the document structures. Sentence
weighting schemes were used for identification of important sentences [6,9,16]. In a
study, the “table of content”-like structure of HTML documents was incorporated into
summaries [1]. Yang and Wang developed fractal summarization method where
generic summaries are created based on structures of documents [15]. These studies
focus on general-purpose summaries, not tailored to particular user queries. There
exist some studies on summarization of XML documents. In [13], query-based
summarization is used for searching XML documents. In another study, a machine

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 180–187, 2013.
© Springer-Verlag Berlin Heidelberg 2013
A Machine Learning Approach for Displaying Query Results in Search Engines 181

learning approach was proposed based on document structure and content [2]. The
concept of lexical chains was also used for document summarization. Berker and
Güngör used lexical chaining as a feature in summarization [3]. In another work, a
lexical chain formation algorithm based on relaxation labeling was proposed [5]. The
sentences were selected according to some heuristics.
In this paper, we propose an approach that displays the search results as long
extracts. We build a system that creates a hierarchical summary for each document
retrieved. The cohesion in the summaries is maintained by using lexical chains. The
experiments on standard query sets showed that the methods significantly outperform
the traditional search engines and the lexical chain component is an important factor.

2 Proposed Summarization Framework

The architecture of the summarization framework is shown in Fig. 1. The system is

formed of four main components. The first component is structure extractor where the
document structure is analyzed and converted into a hierarchical representation. The
lexical chain builder processes the document content using WordNet and forms the
lexical chains. Model builder employs a learning algorithm to learn a summarization
model by using the structures and lexical chains. Finally, the summarizer forms the
summaries of the documents in a hierarchical representation using the learned model.

Training Test

Structure
Training extractor Test
Data Data
Lexical
chain

Model
Summarizer
builder

Fig. 1. System architecture

2.1 Extracting Structures of Documents

We simplify the problem of document structural processing by dividing the whole
process into a number of consecutive steps. Fig. 2 shows an example web document
182 T. Güngör

that includes different typees of parts. The first process is extracting the underlyying
content of the document. We W parse a given web document and build its DOM ttree
using the Cobra open sourcce toolkit [4]. We then remove the nodes that contain nnon-
textual content by traversin
ng the tree. After this process, we obtain a tree that incluudes
only textual elements in th he document and the hierarchical relations between them.
The result of the simplificattion process for the document in Fig. 2 is shown in Fig. 3.

Fig. 2. An example web document

B
Biological Fathers' Rights in Adoption

expectant/ about biollogical Content Biological The

placing biological fath
hers' Categories Fathers' Baby M
parents fathers rightss Rights Case

Adopted Adopting
A Lack of While the ... In 1996 …
P
Persons a Child attention baby …
Fig. 3. Simplified DOM tree
A Machine Learning Approach for Displaying Query Results in Search Engines 183

The tree structure obtained does not correspond to the actual hierarchical structure.
We identify the structure in three steps by using a learning approach. We consider the
document structure as a hierarchy formed of sections where each section has a title
(heading) and includes subsections. The first step is identification of headings which
is a binary classification problem (heading or non-heading). For each textual unit in a
document, we use the features shown in Table 1. The second step is determining the
hierarchical relationships between the units. We first extract the parent-child relations
between the heading units. This is a learning problem where the patterns of heading–
subheading connections are learned. During training, we use the actual parent-child
connections between headings as positive examples and other possible parent-child
connections as negative examples. We use the same set of features shown in Table 1.
In the last step, the non-heading units are attached to the heading units. For this
purpose, we employ the same approach used for heading hierarchy.

Table 1. Features used in the machine learning algorithm

Heading <h1> – <h6>

Emphasis <b>, <i>, <u>, <em>, <strong>, <big>, <small>
Font <font size>, <font color>, <font face>
Indentation <center>
Anchor <a>, <link>
Length length of the unit (in characters)
Sentence count number of sentences in the unit
Punctuation punctuation mark at the end of the unit
Coordinate x and y coordinates of the unit

3 Identification of Lexical Chains

A lexical chain is a set of terms that are related to each other in some context. We
make use of lexical chains in addition to other criteria for summarization. We process
all the documents and construct a set of lexical chains from the documents’ contents.
The idea is forming the longest and strongest chains so that the important relations
between different parts of the documents can be captured. To determine the chains,
we process the terms by using WordNet [11] to identify different term relations. We
consider only the nouns. The documents are parsed using a part-of-speech tagger
(http://alias-i.com/lingpipe). The relations we consider are the synonymy, hypernymy,
and hyponymy relations. The words are processed and a word is placed in the lexical
chain that holds the most number of words with a relation with the candidate word.
After the lexical chains are formed, each chain is given a score as follows:
∑ , , , , (1)

,
, /
,
, /
0 ,
184 T. Güngör

The chain score is calculated by summing up the scores of all pairs of terms in the
chain. The score of terms ti and tj depends on the relation between them. We use a
fixed value for each relation type. As the chains are scored, we select only the
strongest chains for summarization. A lexical chain is accepted as a strong chain if its
score is more than two standard deviations from the average of lexical chain scores.

4 Learning Feature Weights and Summarization of Documents

In this work, we aim at producing summaries that take into account the structure of
web pages and that will be shown to the user as a result of a search query. We use the
criteria shown in Table 2 for determining the salience of sentences in a document. For
each feature, the table gives the feature name, the formula used to calculate the feature
value for a sentence S, and the explanation of the parameters in the formula. The score
values of the features are normalized to the range [0,1]. We learn the weight of each
feature using a genetic algorithm. As the feature weights are learned, the score of a
sentence can be calculated by the equation
(2)
where wi denotes the weight of the corresponding feature. Given a document as a
result of a query, the sentences in the document are weighted according to the learned
feature weights. The summary of the document is formed using a fixed summary size.
While forming the summary, the hierarchical structure of the document is preserved
and each section is represented by that number of sentences that is proportional to the
importance of the section (total score of the sentences in the section).

5 Experiments and Results

To evaluate the proposed approach, we identified 30 queries from the TREC (Text
Retrieval Conference) Robust Retrieval Tracks for the years 2004 and 2005. Each
query was given to the Google search engine and the top 20 documents retrieved for
each query were collected. Thus, we compiled a corpus formed of 600 documents.
The corpus was divided into training and test sets with a 80%-20% ratio.
For structure extraction, we used SVM-Light which is an efficient algorithm for
binary classification [8]. The results are given in Table 3. Accuracy is measured by
dividing the number of correctly identified parent-child relationships to the total
number of parent-child relationships. The first row of the table shows the performance
of the proposed method. This figure is computed by considering each pair of nodes
independent of the others. A stronger success criterion is counting a connection to a
node as a success only if the node is in the correct position in the tree. The accuracy
under this criterion is shown in the second row, which indicates that the method
identified the correct path in most of the cases. The third result gives the accuracy
when only the heading units are considered. That is, the last step of the method
explained in Section 2 was not performed.
A Machine Learning Approach for Displaying Query Results in Search Engines 185

Table 2. Sentence features used in the summarization algorithm

Feature Feature equation and parameter explanations

Sentence location 1

maxs: maximum score that a sentence can get

sdepth: section depth
spos: location of the sentence within the section
Term frequency-
inverse document
frequency (tf-idf)
tf(ti): term frequency of term ti in the document
dnum: number of documents in the corpus
dnumti: number of documents that ti occurs
Heading

hterm(ti): 1 or 0 depending on whether ti, respectively,

occurs or does not occur in the corresponding heading.
Query

qterm(ti): 1 or 0 depending on whether ti, respectively,

occurs or does not occur in the query
Lexical chain
,

tf(ti): term frequency of ti in the document

We use the document structures identified by the structure extractor component in

the summarization process. As lexical chains are formed, we use genetic algorithm to
learn the weights of the features. 50 documents selected randomly from the corpus
were summarized manually using a fixed summary length. The feature weights were
allowed to be in the range of 0-15. After training, we obtained the feature weights as
wl=5, wrf=7, wh=8, wq=12, and wlc=11. This shows that query terms are important in
determining the summary sentences. The lexical chain concept is also an important
tool for summarization. This is probably due to the combining effect of lexical chains
in the sense that they build a connection between related parts of a document and it is
preferable to include such parts in the summary to obtain coherent summaries.

Table 3. Results of structure extractor

Accuracy
Document structure 76.47
Document structure (full path) 68.41
Sectional structure 78.11
186 T. Güngör

As the feature weights were determined, we formed the summaries of all the
documents in the corpus. For evaluation, instead of using a manually prepared
summarization data set, we used the relevance prediction method [9] adapted to a
search engine setting. In this method, a summary is compared with the original
document. If the user evaluates both of them as relevant or irrelevant to the search
query, then we consider the summary as a successful summary.
The evaluation was performed by two evaluators. For a query, the evaluator was
given the query terms, a short description of the query, and a guide that shows which
documents are relevant results. The evaluator is shown first the summaries of the 20
documents retrieved by the search engine for the query in random order and then the
original documents in random order. The user is asked to mark each document or
summary displayed as relevant or irrelevant for the query. The results are given in
Table 4. We use precision, recall and f-measure for the evaluation as shown below:
| |
(3)
| |
| |
(4)
| |

(5)

where Drel and Srel denote, respectively, the set of documents and the set of summaries
relevant for the query.
The first row in the table shows the performance of the method, where we obtain
about 80% success rate. The second row is the performance of the Google search
engine. We see that the outputs produced by the proposed system are significantly
better than the outputs produced by a traditional search engine. This is due to the fact
that when the user is given a long summary that shows the document structure and the
important contents of the document, it becomes easier to determine the relevancy of
the corresponding page. Thus we can conclude that the proposed approach yields an
effective way in displaying the query results for the users.

Table 4. Results of the summarization system

precision recall F-measure

Proposed method 80.76 78.17 79.44
Search engine 63.57 58.24 60.79

6 Conclusions
In this paper, we built a framework for displaying web pages retrieved as a result of a
search query. The system makes use of the document structures and the lexical chains
extracted from the documents. The contents of web pages are summarized according
to the learned model by preserving the sectional layouts of the pages. The
experiments on two query datasets and a corpus of documents compiled from the
results of the queries showed that document structures can be extracted with 76%
A Machine Learning Approach for Displaying Query Results in Search Engines 187

accuracy. In the summarization experiments, we obtained nearly 80% success rates. A

comparison with a state-of-the-art search engine has shown that the method
significantly outperforms the performance of current search engines.
As a future work, we plan to improve the summarizer component by including new
features that can determine the saliency of sentences more effectively. Some semantic
features that take into account dependencies between sentences can be used. The
methods used in the proposed framework such as structural analysis and lexical chain
identification can also be utilized in other related areas. Another future work can be
making use of these methods in multi-document summarization or text categorization.

References
1. Alam, H., Kumar, A., Nakamura, M., Rahman, A.F.R., Tarnikova, Y., Wilcox, C.:
Structured and Unstructured Document Summarization: Design of a Commercial
Summarizer Using Lexical Chains. In: Proc. of the 7th International Conference on
Document Analysis and Recognition, pp. 1147–1150 (2003)
2. Amini, M.R., Tombros, A., Usunier, N., Lalmas, M.: Learning Based Summarisation of
XML Documents. Journal of Information Retrieval 10(3), 233–255 (2007)
3. Berker, M., Güngör, T.: Using Genetic Algorithms with Lexical Chains for Automatic
Text Summarization. In: Proc. of the 4th International Conference on Agents and Artificial
Intelligence (ICAART), Vilamoura, Portugal, pp. 595–600 (2012)
4. Cobra: Java HTML Renderer & Parser (2010),
http://lobobrowser.org/cobra.jsp
5. Gonzàlez, E., Fuentes, M.: A New Lexical Chain Algorithm Used for Automatic
Summarization. In: Proc. of the 12th International Congress of the Catalan Association of
Artificial Intelligence (CCIA) (2009)
6. Guo, Y., Stylios, G.: An Intelligent Summarisation System Based on Cognitive
Psychology. Information Sciences 174(1-2), 1–36 (2005)
7. Hobson, S.P., Dorr, B.J., Monz, C., Schwartz, R.: Task-based Evaluation of Text
Summarisation Using Relevance Prediction. Information Processing and
Management 43(6), 1482–1499 (2007)
8. Joachims, T.: Advances in Kernel Methods: Support Vector Learning. MIT (1999)
9. Otterbacher, J., Radev, D., Kareem, O.: News to Go: Hierarchical Text Summarisation for
Mobile Devices. In: Proc. of 29th Annual ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 589–596 (2006)
10. Pembe, F.C., Güngör, T.: Structure-Preserving and Query-Biased Document
Summarisation for Web Searching. Online Information Review 33(4) (2009)
11. Princeton University, About WordNet (2010), http://wordnet.princeton.edu
12. Roussinov, D.G., Chen, H.: Information Navigation on the Web by Clustering and
Summarizing Query Results. Information Processing and Management 37 (2001)
13. Szlavik, Z., Tombros, A., Lalmas, M.: Investigating the Use of Summarisation for
Interactive XML Retrieval. In: Proc. of ACM Symposium on Applied Computing (2006)
14. Xue, X.-B., Zhou, Z.-H.: Improving Web Search Using Image Snippets. ACM
Transactions on Internet Technology 8(4) (2008)
15. Yang, C.C., Wang, F.L.: Hierarchical Summarization of Large Documents. Journal of
American Society for Information Science and Technology 59(6), 887–902 (2008)
16. Yeh, J.Y., Ke, H.R., Yang, W.P., Meng, I.H.: Text Summarisation Using a Trainable
Summariser and Latent Semantic Analysis. Information Processing and
Management 41(1), 75–95 (2005)
A New Pixel-Based Quality Measure
for Segmentation Algorithms Integrating
Precision, Recall and Specificity

Kannikar Intawong, Mihaela Scuturici, and Serge Miguet

Université de Lyon, CNRS

Université Lumière Lyon 2, LIRIS UMR5205
5, Av. Pierre Mendès-France, 69676, Bron, France
{kannikar.intawong,mihaela.scuturici,serge.miguet}@univ-lyon2.fr

Abstract. There are several approaches for performance evaluation of

image processing algorithms in video-based surveillance systems: Preci-
sion/ Recall, Receiver Operator Characteristics (ROC), F-measure, Jac-
card Coefficient, etc. These measures can be used to find good values for
input parameters of image segmentation algorithms. Different measures
can give different values of these parameters, considered as optimal by
one criterion, but not by another. Most of the times, the measures are
expressed as a compromise between two of the three aspects that are im-
portant for a quality assessment: Precision, Recall and Specificity. In this
paper, we propose a new 3-dimensional measure (Dprs ), which takes into
account all of the three aspects. It can be considered as a 3D generaliza-
tion of 2D ROC analysis and Precision/Recall curves. To estimate the
impact of parameters on the quality of the segmentation, we study the
behavior of this measure and compare it with several classical measures.
Both objective and subjective evaluations confirm that our new measure
allows to determine more stable parameters than classical criteria, and
to obtain better segmentations of images.

Keywords: Segmentation, quality measures, F-measure, Jaccard

Coeﬃcient(JC), Percentage of Correct Classiﬁcation(PCC).

1 Introduction
There is an increasing need for automated processing of massive amounts of
visual information generated by video surveillance systems. The goal here is to
verify that the automation of a number of tasks for image and video analysis
is reliable, in order to reduce human supervision and to assist decision-making.
The quantitative performance evaluation methods should make it possible to
compare the results provided by different segmentation algorithms. The most
commonly used methods in the literature attempt to find a compromise between
Precision, Recall (also called Sensitivity) and Specificity. But most widely used
representations only consider two of the three indicators: Precision/Recall space
and ROC curves (which represent only the Sensitivity=Recall and Specificity).

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 188–195, 2013.

c Springer-Verlag Berlin Heidelberg 2013
A New Pixel-Based Quality Measure for Segmentation Algorithms 189

In the same way, the F-measure is a single value quality measure based on
Recall and Precision only, that completely ignores the Specificity of segmentation
algorithms.
In the context of segmentation quality evaluation in video surveillance, we
observed there are situations where all these measures disagree in the sense that
they can give very different values of parameters. In this case, the question we
asked ourselves and we tried to answer is “which measure behaves better than
others and why”? We compared the optimal parameters given by several of these
measures. We propose a segmentation quality measure seeking a compromise
between the three indicators above. We show, in a case study, our measure gives
more satisfactory results than the most commonly used measures (F-measure,
Jaccard coefficient (JC), percentage of correct classification (PCC)).
The outline of the paper is as follows: section 2 presents related work in the
fields of performance evaluation of pixel-based segmantation algorithms. Sec-
tion 3 explains the methodology of evaluation and introduces a new measure
which is shown to give more often than the others, better parameters for seg-
mentation algorithms. Experimental results and conclusions are presented in
section 4 and 5 respectively.

2 Related Work

Nascimento et al. compare in [1] the performance of diﬀerent segmentation algo-

rithms. They investigate three different approaches: pixel-based, template-based
and object-based performance metrics. Here we focus on pixel-based evaluation
metrics which compare a classic binary detection with a ground truth. Three
measures for quantifying the performances of a classifier are classically used in
the literature [2,3]. The most widely used measure is the Percentage of Correct
Classification (PCC). PCC does not allow a good quality estimation when the
classes are unbalanced (e.g.the scenes where the background represents at least
95% of pixels). This problem is solved by the Jaccard Coefficient (JC) which
only considers the foreground. This measure eliminates the consequence of the
big volume of true negatives but it does not take into account a bad background
detection. Another measure, the Yule Coefficient (YC), tries to avoid the draw-
backs of these measures by giving the same weight to the two classes, but it
cannot be used for images containing only background.
Several authors have shown the benefit of using two dimensions to describe
the behavior of their algorithms: in ROC curves [4], the True Positive Rate
(TPR) is plotted against the false positive rate (FPR). For instance, in the
case of video surveillance applications, the user can choose the best compromise
between probability of miss detections and rate of false alarms. However, in
the case of unbalanced classes, ROC curves present weaknesses. In such cases,
[5,6] show the advantages of Precision/Recall curves to analyze the behavior of
segmentation algorithms. Recall is equal to the TPR used in the ROC curves,
but Precision is different from FPR. Two different algorithms might present the
same Precision/Recall graph, because they have the same capacity to detect
190 K. Intawong, M. Scuturici, and S. Miguet

the foreground, but one of them can be less performant than the other if the
background is not well detected. In this case, the two algorithms will differ in
terms of Specificity. After noting that the TPR dimension used in ROC curves
is equal to the Recall dimension of Precision/Recall graphs, we propose to build
a quality measure that takes into account the three dimensions involved in the
previous two approaches. The 3D generalizarion of ROC curves had already
been proposed by [7]. They use a soft decision threshold parameter as a third
dimension which helps to take the final binary decision. This is different of our
approach but it does not take into account the true negative rate, therefore,
in our context of videosurveillance applications, it has the same drawbacks as
standard ROC analysis.

3 Evaluation Measure
Our goal is to find the best way to measure the quality of a segmentation algo-
rithm having one or more parameters and to find the best values for these pa-
rameters. Most of the authors compare the results of their algorithms to a ground
truth which is considered as the ideal segmentation. This ground truth can be
obtained from manual segmentations done by human users, but this can be very
time-consuming, especially for large video sequences. Alternatively, we can use
synthetic sequences where the ground truth is known, since the moving objects
are generated by a computer algorithm. This is the method we have used in this
paper. The results of any segmentation algorithms vary as a function of the val-
ues of different parameters. The best parameters values minimize a distance or
maximize the similarity between the segmentation and the ground truth. Several
criteria for evaluating this proximity (F-measure, JC, PCC) behave differently,
leading to different choices of the best value of the algorithm’s parameters.
We consider the segmentation of images divided into two classes: foreground
and background. For a given image in a video sequence, we compare the results
of a binary segmentation S with the binary image of the ground truth T . A pixel
is represented in white if it is part of a moving object (foreground), and black
when it belongs to the background. A white pixel in S is called a positive. If it
is also white in T , then it is a true positive (TP), whereas if it is black in T ,
it is a false positive (FP). Symmetrically, a black pixel in S is a negative. If it
is also black in T , it is a true negative (TN), while if it is white in T , it is a
false negative (FN). We can then define the Precision (PR), Recall (RE) and
Specificity (SP) for each image:

P R = T P/(T P + F P ) (1)

RE = T P/(T P + F N ) (2)
SP = T N/(T N + F P ) (3)
A perfect segmentation algorithm calculates an image S identical to the ground
truth T . Such an algorithm will give values of Precision, Recall and Speciﬁcity
equal to 1. The F-measure is the harmonic mean of Precision and Recall.
A New Pixel-Based Quality Measure for Segmentation Algorithms 191

F = 2((P R ∗ RE)/(P R + RE)) (4)

Other scalar values as the Percentage of Correct Classification (P CC) or the
Jaccard coefficient (JC) can also be defined.

P CC = (T P + T N )/(T P + F N + F P + T N ) (5)

JC = T P/(T P + F P + F N ) (6)
We propose to measure the quality of a segmentation as an Euclidean distance
called Dprs in the space of the indicators, between the point (P R, RE, SP ) and
the ideal point (1, 1, 1).

Dprs = (1 − P R)2 + (1 − RE)2 + (1 − SP )2 (7)

This can be seen as a 3D generalization of the use of 2D ROC curves or Preci-

sion/Recall diagrams for optimal parameters determination.

4 Experimental Results
In order to evaluate our segmentation quality measure, we segment several videos
of the synthetic Visage database [12]. We apply several morphological operations
in order to improve the segmentation results and compare the behavior of several
quality measures.
In order to segment moving objets we use the hierarchical background mod-
eling technique introduced by [8]. It is considered as one of the best background
modeling algorithms in the comparative study proposed by [9]. This method uses
coarse level contrast descriptors introduced by [11] whose evolution is represented
by mixtures of gaussians. This coarse-level representation is then combined with
the classical pixel-based mixture of gaussians [10]. We can thus identify the fore-
ground objects at coarse level, then detail shapes of foreground objects at pixel
level. Figure 1 presents some results of background subtraction algorithms.
Object detection by background modeling algorithms, used without post-
processing, very often let appear isolated pixels in the background. They are
considered as foreground objects. On the contrary, holes in objects are classified
as background. The method most commonly used to overcome these drawbacks
is to apply mathematical morphology (erosion, dilation, opening, closing, . . . ).
In our experiments we vary the number of erosions (p) and dilations (q) applied
as post-processing of the segmentation. We search the parameter sets giving
the best possible values for each of the segmentation quality measures. In many
cases, the optimal parameters are significantly different from one measurement
to another, and do not coincide with the optimal (subjective) settings provided
by a human user.
We calculate the following measures : F-measure, JC, PCC and Dprs for all
images in a sequence, and for each set of parameters. We evaluate the mathe-
matical morphology (dilations followed by erosions) between (0, 0) and (10, 10)
192 K. Intawong, M. Scuturici, and S. Miguet

on 10 synthetic videos composed of 1500 frames each (diﬀerent scenes, sunny or

rainy weather, noisy camera). We select the optimal parameters for each frame.
F, PCC and JC are quality measures, whose value should be maximized. Instead,
Dprs is a distance from the ideal situation and its value has to be minimized. It
is interesting to note that the criteria may conclude at different optimal sets of
parameters. We show in Table 1 and Table 2 a set of experiments for which the
different criteria find very different optimal parameters. In Table 1, Dprs and
PCC exhibit an optimal parameter set equal to (4, 4) whereas F and JC suggest
that (10, 10) give better results. In Table 2, F, JC and Dprs coincide for choosing
(4, 4) as optimal parameter values where PCC finds (1, 1) as being better. In
order to illustrate the contribution of Table 2, the graph curves are presented
in Figure 3 and Figure 4 and the segmentation results which use optimal pa-
rameters determined by the F-measure, JC, PCC and Dprs are presented in
Figure 5. This first series of experiments show that the optimal parameter se-
lected by Dprs are closer to those determined by users.

(a) Original image (b) Ground truth image (c) Block-based [11]

(d) Mixture Of Gaussians [10] (e) Hierarchical [8] (f) Hierarchical+Morphology

Fig. 1. Segmentation results of diﬀerent background modeling algorithms. Block-based

method result is not accurate enough (c). Mixture of Gaussians algorithm is not as
reliable as the hierarchical method when illumination/weather conditions change (d),
there are plenty of false positives. The background model we use (e) is the intersection
of (c) and (d). (f) presents the segmentation with morphological post-treatement.

In order to validate this evaluation, the solution was to refer to what a human
user considers as the best solution by a voting system. A website has been devel-
oped for presenting the four segmented images using the best parameters chosen
by the four compared measures (PCC, F, JC and Dprs ) (the frames for which the
four measures agree are discarded). In these ﬁrst series of experiments, the new
A New Pixel-Based Quality Measure for Segmentation Algorithms 193

(a) (b) (c)

Fig. 2. Number of morphological operations (dilations, erosions): (a) (0, 0) (b) (4,4)
(c) (10,10)

Table 1. Comparison of diﬀerent measures (sunny weather). PCC and Dprs give the
optimal values corresponding to the (4, 4) pair. Simultaneously, F-measure and JC give
the optimal values corresponding to the (10, 10) pair.
Standard Measures New Measure
Dilation Erosion TPR FPR Recall Precision F measure PCC JC Dprs
0 0 0.3614 0.00042 0.3614 0.6836 0.3003 0.9967 0.2946 0.9554
1 1 0.3818 0.00073 0.3818 0.6738 0.4023 0.9970 0.3881 0.9451
2 2 0.4057 0.00121 0.4057 0.66635 0.4765 0.9971 0.4007 0.9320
3 3 0.4806 0.00124 0.4806 0.6522 0.5064 0.9972 0.4233 0.8684
4 4 0.5176 0.00145 0.5176 0.6497 0.5267 0.9972 0.4684 0.8341
5 5 0.5360 0.00206 0.5360 0.6278 0.5561 0.9971 0.4899 0.8382
6 6 0.5587 0.00242 0.5587 0.5979 0.5668 0.9969 0.5028 0.8458
7 7 0.5703 0.00261 0.5703 0.5858 0.5722 0.9965 0.5089 0.8465
8 8 0.5891 0.00313 0.5891 0.5723 0.5754 0.9961 0.5132 0.8417
9 9 0.5911 0.00362 0.5911 0.5709 0.5879 0.9957 0.5161 0.8416
10 10 0.6018 0.00413 0.3614 0.5612 0.5980 0.9952 0.5165 0.8411

Table 2. Comparison of diﬀerent measures (noisy camera). F-measure, JC and Dprs

give the optimal values corresponding to the (3, 3) pair, while only PCC gives the
optimal values corresponding to the (1, 1) pair.
Standard Measures New Measure
Dilation Erosion TPR FPR Recall Precision F measure PCC JC Dprs
0 0 0.3964 0.0005 0.3964 0.6396 0.7493 0.9973 0.3477 0.9645
1 1 0.6096 0.0009 0.6096 0.6500 0.6198 0.9989 0.5134 0.7413
2 2 0.7010 0.0013 0.7010 0.6350 0.6584 0.9987 0.5642 0.6654
3 3 0.7345 0.0017 0.7345 0.6155 0.6615 0.9985 0.5684 0.6517
4 4 0.7500 0.0020 0.7500 0.5986 0.6566 0.9982 0.5623 0.6535
5 5 0.7577 0.0024 0.7577 0.5862 0.6516 0.9979 0.5561 0.6586
6 6 0.7632 0.0029 0.7632 0.5724 0.6436 0.9975 0.5466 0.6674
7 7 0.7678 0.0034 0.7678 0.5602 0.6362 0.9970 0.5376 0.6754
8 8 0.7702 0.0040 0.7702 0.5469 0.6269 0.9965 0.5267 0.6869
9 9 0.7731 0.0046 0.7731 0.5369 0.6204 0.9959 0.5185 0.6945
10 10 0.7744 0.0052 0.7744 0.5296 0.6151 0.9954 0.5125 0.7012

(a) Precision-Recall (b) ROC

Fig. 3. Conventional 2D representations and graphical interpresentation of optimal

parameter values for diﬀerent measures
194 K. Intawong, M. Scuturici, and S. Miguet

Fig. 4. Precision, Recall, Speciﬁcity space

measure appears to give results that are the closest to the human evaluation:
Dprs collects 37% of votes, PCC receives 29%, and F-measure and JC have a
comparable number of votes with only 17%.

(a) Original (b) Truth (c) Dprs (d) PCC (e) F (f) JC

Fig. 5. Example of optimal segmentations according to diﬀerent criteria, presented to

user for subjective evaluation

5 Conclusions

In this paper we have presented a new measure (Dprs ) for qualitative evaluation
of the segmentation algorithms in complex video surveillance environments. It
is an extension of the traditional Precision-Recall methodology and represents
a compromise between three indicators: Recall, Precision and Specificity. The
third dimension - Specificity - takes into account the correct detection of the
background. This measure was compared with other performance measures (F-
measure, PCC and JC). It is important to note that each criterion may conclude
at different optimal sets of parameters. Experiments show that our measure
A New Pixel-Based Quality Measure for Segmentation Algorithms 195

Dprs seems to give optimal parameters close to those determined by subjective

evaluations. Additional experiments should be done on larger scale data, as well
as on real data.

References
1. Nascimento, J., Marques, J.: Performance evaluation of object detection algorithms
for video surveillance. IEEE Transactions on Multimedia 8, 761–777 (2006)
2. Rosin, P.L., Ioannidis, E.: Evaluattion of Global Image Thresholding for Change
Detection. Pattern Recognition Letter 24, 345–2356 (2003)
3. Elhabian, S., El-Sayed, K.: Moving Object Detection in Spatial Domain using Back-
ground Removal Techniques - State-of-Art. Recent Patents on Computer Science 1,
32–54 (2008)
4. Gao, X., Boult, T.E., Coetzee, F., Ramesh, V.: Error analysis of background adap-
tation. In: Computer Vision and Pattern Recognition, vol. 1, pp. 503–510 (2000)
5. Davis, J., Burnside, E., Dutra, I., Page, D., Ramakrishnan, R., Santos Costa, V.,
Shavlik, J.: View learning for statistical relational learning: With an application
to mammography. In: Proceeding of the 19th International Joint Conference on
Artificial Intelligence, pp. 677–683 (2005)
6. Davis, J., Goadrich, M.: The Relationship between Precision-Recall and ROC
Curves. In: International Conference on Machine Learning, pp. 233–240 (2006)
7. Wang, S., Chang, C.I., Yang, S.C., Hsu, G.C., Hsu, H.H., Chung, P.C., Guo,
S.M., Lee, S.K.: 3D ROC Analysis for Medical Imaging Diagnosis. Engineering
in Medicine and Biology Society, 7545–7548 (2005)
8. Chen, Y.T., Chen, C.S., Huang, C.R., Hung, Y.P.: Efficient hierarchical method
for background subtraction. Pattern Recognition, 2706–2715 (2007)
9. Dhome, Y., Tronson, N., Vacavant, A.: A Benchmark for Background Subtraction
Algorithms in Monocular Vision: A Comparative Study. Image Processing Theory
Tools and Applications (IPTA), 7–10 (2010)
10. Stauffer, C., Grimson, W.E.L.: Adaptive background mixture models for real-time
tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, pp.
246–252 (1999)
11. Huang, C.R., Chen, C.S., Chung, P.C.: Contrast context histogram A discrimi-
nating local descriptor for image matching. In: Proceedings of IEEE International
Conference on Pattern Recognition, vol. 4, pp. 53–56 (2006)
12. Chateau, T.: Visage Challenge: Video-surveillance Intelligente: Systemes et AlGo-
rithmES (2012), http://visage.univ-bpclermont.fr/?q=node/2
A Novel Border Identification Algorithm
Based on an “Anti-Bayesian” Paradigm

Anu Thomas and B. John Oommen

School of Computer Science, Carleton University, Ottawa, Canada: K1S 5B6

Abstract. Border Identiﬁcation (BI) algorithms, a subset of Prototype

Reduction Schemes (PRS) aim to reduce the number of training vectors
so that the reduced set (the border set) contains only those patterns
which lie near the border of the classes, and have sufficient information
to perform a meaningful classification. However, one can see that the
true border patterns (“near” border) are not able to perform the task
independently as they are not able to always distinguish the testing sam-
ples. Thus, researchers have worked on this issue so as to find a way to
strengthen the “border” set. A recent development in this field tries to
add more border patterns, i.e., the “far” borders, to the border set, and
this process continues until it reaches a stage at which the classification
accuracy no longer increases. In this case, the cardinality of the border
set is relatively high. In this paper, we aim to design a novel BI algorithm
based on a new definition for the term “border”. We opt to select the
patterns which lie at the border of the alternate class as the border pat-
terns. Thus, those patterns which are neither on the true discriminant
nor too close to the central position of the distributions, are added to the
“border” set. The border patterns, which are very small in number (for
example, five from both classes), selected in this manner, have the po-
tential to perform a classification which is comparable to that obtained
by well-known traditional classifiers like the SVM, and very close to the
optimal Bayes’ bound.

1 Introduction
The objective of a PRS is to reduce the cardinality of the training set to be as
small as possible by selecting some training patterns based on various criteria,
as long as the reduction does not significantly affect the performance. Thus,
instead of considering all the training patterns for the classification, a subset of
the whole set is selected based on certain criteria. The learning (or training) is
then performed on this reduced training set, which is also called the “Reference”
set. This Reference set not only contains the patterns which are closer to the
true discriminant’s boundary, but also the patterns from the other regions of the
space that can adequately represent the entire training set.

The authors are grateful for the partial support provided by NSERC, the Natural
Sciences and Engineering Research Council of Canada.

Chancellor’s Professor ; Fellow: IEEE and Fellow: IAPR. This author is also an
Adjunct Professor with the University of Agder in Grimstad, Norway.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 196–203, 2013.

c Springer-Verlag Berlin Heidelberg 2013
A Novel BI Algorithm Based on an “Anti-Bayesian” Paradigm 197

Border Identification (BI) algorithms, which are a subset of PRSs, work with
a Reference set which only contains “border” points. Specializing this criterion,
the current-day BI algorithms, designed by Duch [1], Foody [2,3], and Li et al.
[4], attempt to select a Reference set which contains border patterns derived, in
turn, from the set of training patterns. Observe that, in effect, these algorithms
also yield reduced training sets. Once the Reference set is obtained, all of these
traditional methods perform the classification by invoking some sort of classifier
like the SVM, neural networks etc. As opposed to the latter, we are interested in
determining border patterns which, in some sense, are neither closer to the true
optimal classifier nor to the means, and which can thus better classify the entire
training set. Contrary to a Bayesian intuition, these border patterns have the
ability to accurately classify the testing patterns, as we shall presently demon-
strate. Our method is a combination of NN computations and (Mahalanobis)
multi-dimensional1 distance computations which yield the border points that
are subsequently used for the purpose of classification. The characterizing com-
ponent of our algorithm, referred to as ABBI, is that classification can be done
by processing the obtained border points by themselves without invoking, for
example, a subsequent SVM phase.
How then can one determine the border points themselves? This, indeed,
depends on the model of computation - for example, whether we are working
within the parametric or non-parametric model. The current paper deals with
the former model, where the information about the classes is crystallized in
the class-conditional distributions and their respective parameters, where the
training samples are used to estimate the parameters of these models. In this
paper, we have shown how the border points can be obtained by utilizing the
information gleaned from the estimated distributions. Observe that with regard
to classification and testing, all of these computations can be considered to be of
a “pre-processing” nature, and so the final scheme would merely be of a Nearest
Neighbor(NN)-sort. The details of how this is achieved is described in the paper.

2 A Novel Two-Class “Anti-Bayesian” BI Scheme

The Formal Algorithm. The problem of determining the border points for
the parametric model of computation can be solved for fairly complex scenarios.
When one examines the existing BI schemes, he observes that the information
that has been utilized to procure the border patterns is primarily (and indeed,
essentially) distance-based. In other words, the distances between the patterns
are evaluated independently, and the border patterns are obtained based on
such distances. The patterns obtained in this manner are considered as the new
training set, which reduces these BI schemes to be special types of PRSs, but with
the border patterns being the Reference set. However, as these border patterns
1
We also have some initial results in which the distance and optimizations are done
using lower-dimensional projections, the results of which are subsequently fused using
an appropriate fusion technique.
198 A. Thomas and B. John Oommen

are only the “Near” ones, they do not possess sufficient information to train an
efficient classifier. We shall now rectify this.
We now mention a second major handicap associated with the traditional BI
schemes. Once they have computed the border points associated with the specific
classes, the traditional schemes operate by determining a “classifier” based on
the new set. In other words, they have to determine a classifying boundary (linear,
quadratic or SVM-based) to achieve this classification. As the reader will observe,
in our work, we attempt to circumvent this entire phase. Indeed, in our proposed
strategy, we merely achieve the classification using the final small subset of
border points – which entails a significant reduction in computation.
The reader should also observe that this final decision would involve NN-like
computations with a few points. The intriguing feature of these few points is that
they lie close to the boundary and not to the mean, implying an “anti-Bayesian”
philosophy [5,6,7].
In order to obtain the border patterns of the distributions ω1 and ω2 in an
“anti-Bayesian” approach, we make use of the axiom that the patterns that have
nearest neighbors from other classes along with the patterns of the same class
fall into a common region - which is, by definition, the overlapped region.
The proposed algorithm has 4 parameters, namely, J, J1 , J2 and K. First
of all, J denotes the number of border points that have to be selected from
each class. We understand that in the process of selecting the border points,
the training set must be “examined” so as to ignore the patterns which are not
relevant for the classification. As this decision is taken based on the border points
in and of themselves, we conclude that the patterns which are in the overlapping
region are not able to provide an accurate decision, and so these points have to
be ignored. Thus, for any X, those patterns with J2 or more NNs out of the J1
NNs, which are not from the same class as X, are ignored.
To be more specific, in order to eliminate the overlapping points, we first
determine J1 -NNs of every pattern X. If J2 or more of these NN patterns are
from the same class, this pattern X is added to the new training set. Once this
step is achieved, we are left with the training points which are not overlapping
with any other classes. Thereafter, we evaluate the (Mahalanobis) distance2 (MD)
of every pattern of the new training set with respect to the mean of both the
classes. Both of these phases distinguish our particular strategy. The patterns
which are almost equidistant from both the classes, and which are not determined
to be overlapping with respect to the other classes, are added to the Border set.
The process of determining the (Mahalanobis) distances with respect to both
the classes, is repeated for all the patterns of the new training set, and a decision
is made for each pattern based on the difference between these distances.
The two-dimensional view of this philosophy is depicted in Figures 1a - 1c. The
border patterns obtained by applying this method are also given in the figure,
where the border patterns of class ω1 are specified by rectangles, and those of
class ω2 are specified by circles. We now make the following observations:

2
Any well-deﬁned norm, appropriate for the data distribution, can be used to quantify
this distance.
A Novel BI Algorithm Based on an “Anti-Bayesian” Paradigm 199

15 15

10 10

5 5

X1 X1
X2 X2
0 0
BI1 BI1
Ͳ15 Ͳ10 Ͳ5 0 5 10 15 Ͳ15 Ͳ10 Ͳ5 0 5 10 15
BI2 BI2

Ͳ5 Ͳ5

Ͳ10 Ͳ10

Ͳ15 Ͳ15

(a) Almost separable classes (b) Semi-overlapped classes (c) Overlapped classes

Fig. 1. Border patterns for separable and overlapped classes

1. If we examine Figure 1a, we can see that the border patterns that are speci-
fied by rectangles and circles are precisely those that lie at the true borders
of the classes.
2. However, if the classes are semi-overlapped,
then the “more interior” sym-
metric percentiles, such as the 23 , 13 can perform a near-optimal classifica-
tion. This can be seen in Figure 1b. The patterns in this figure have more
overlap (the BD = 1.69), and the border points chosen are the ones which
lie just outside the overlapping region.
3. The same argument is valid for Figure 1c. In the OS-based classification, we
have seen that if the classes have a large overlap as in Figure 1c (in this case,
M D = 0.78), the border patterns again lie outside the overlapped region.

The algorithm for obtaining the border patterns, ABBI, is formally given in
Algorithm 1.
Contrary to the traditional BI algorithms, ABBI requires only a small number
of border patterns for the classification. For example, consider the Breast Cancer
data set which contains 699 patterns. A traditional BI algorithms will obtain a
border set of around 150 patterns for this data set. Furthermore, once these
methods have obtained the border points, they will have to generate a classifier
for the new reduced set to achieve the classification. As opposed to this, our
method requires only 20 border patterns, and the classification is based on the
five NN border patterns of the testing pattern.

3 Experimental Results

The proposed method ABBI has been tested on various data sets that include
artiﬁcial and real-life data sets obtained from the UCI repository [8]. ABBI has
also been compared with other well-known methods which include the NB, SVM,
and the kNN. In order to obtain the results, ABBI algorithm was executed 50
times with the 10-fold cross validation scheme.
200 A. Thomas and B. John Oommen

Algorithm 1. ABBI(ω1, ω2 )
Input:

Data from two classes; ω1 , ω2 , whose means are M1 and M2 respectively.

Parameters: J1 , J, J2 , K: Small numbers

Assumption:

Dist computes the distance between two vectors.

DistDiﬀ computes the diﬀerence in distances obtained with respect to μ1 and μ2

Notation:

N T R1 and N T R2 are the new training sets which do not contain points in the overlapped region.

Output:

The classiﬁcation based only on the Border points

Method:
1: N T R1 ← ∅
2: N T R2 ← ∅
3: Divide points of ω1 into training and testing sets, T RP1 and T1 respectively
4: Divide points of ω2 into training and testing sets, T RP2 and T2 respectively
5: for all X ∈ T RP1 do
6: Compute J1 NNs of X
7: If J2 or more NNs are from class ω1 , N T R1 ← N T R1 ∪ X
8: end for
9: for all X ∈ T RP2 do
10: Compute J1 NNs of X
11: If J2 or more NNs are from class ω2 , N T R2 ← N T R2 ∪ X
12: end for
13: for all X ∈ N T R1 do
14: Dist(X, M1 )
15: Dist(X, M2 )
16: end for
17: for all X ∈ N T R2 do
18: Dist(X, M1 )
19: Dist(X, M2 )
20: end for
21: for all X ∈ N T R1 do
22: DistDiff(X)
23: end for
24: for all X ∈ N T R2 do
25: DistDiff(X)
26: end for
27: Add J points with minimum DistDiff from N T R1 and N T R2 to BI
28: Classify testing points using a K-NN based on the points in BI.
End Algorithm

Artiﬁcial Data Sets: For a prima facie testing of artiﬁcial data, we generated
two classes that obeyed Gaussian distributions. To do this, we made use of a Uni-
form [0, 1] random variable generator to generate data values that follow a Gaus-
sian distribution. The expression z = −2ln(u1) cos(2πu2 ) is known to yield
data values that follow N (0, 1) [9]. Thereafter, by using the technique described
in [10], one can generate Gaussian random vectors which possess any arbitrary
mean and covariance matrix. In our experiments, since this is just for a prima
facie case, we opted to perform experiments for two-dimensional and three-
dimensional data sets. The respective means of the classes were [μ11 , μ12 ]T and
[μ21 , μ22 ]T for the two-dimensional data, and [μ11 , μ12 , μ13 ]T and [μ21 , μ22 , μ23 ]T
A Novel BI Algorithm Based on an “Anti-Bayesian” Paradigm 201

for the three-dimensional data. Further, the corresponding covariance matrices

of the two-dimensional classes had the forms:
, 2 - , 2 -
a αab b αab
Σ1 = , Σ 2 =
αab b2 αab a2
The covariance matrices for the three-dimensional classes had the forms:
⎡ 2 ⎤ ⎡ 2 ⎤
a 0 αab b 0 αab
Σ1 = ⎣ 0 1 0 ⎦ , Σ1 = ⎣ 0 1 0 ⎦
αab 0 b2 αab 0 a2
With regard to the cardinality of the data set, each of the classes had 200
instances in the corresponding two and three-dimensional space. For the distance
computations, we used the MD, which is based on the means and the covariance
matrices Σ1 and Σ2 . It is based on the correlations between the variables using
which different patterns can be identified and analyzed.
In order to not make the chapter too cumbersome, the specific details of the
values of the μ’s, a, b and α (for the means and covariances), are not included
here3 . However, what is crucial to guarantee “repeatability”, are the respective
values of the BD for each experimental setting, and these are clearly specified in
every single row.
Experimental Results: Artificial Data Sets. The experimental results ob-
tained for two dimensional artificial data sets can be seen in Table 1 and those
for three dimensional artificial data sets can be seen in Table 2.

Table 1. Results of the classiﬁcation of two dimensional artiﬁcial data sets

BD 1NN 3NN SVM ABBI

4.52 100 100 100 100
2.94 99.10 99.20 99.25 99.25
1.69 95.30 96.50 97.00 96.40
0.78 84.15 86.05 88.25 88.0
0.45 73.55 75.45 81.50 80.55

Table 2. Results of the classiﬁcation of three dimensional artiﬁcial data sets

Class Nature Average BD 1NN 3NN SVM ABBI

Separated 6.08 100 100 100 100
Semi-overlapped 2.64 96.92 97.67 97.81 95.67
Overlapped 2.42 94.50 95.50 96.50 94.72
Highly overlapped 1.43 83.50 87.23 88.79 85.20

By examining Tables 1 and 2, one can see that ABBI can achieve remarkable
classiﬁcation when compared to that attained by the benchmark classiﬁers. For
3
These values can be included if requested by the Referees.
202 A. Thomas and B. John Oommen

example, if we consider the case where the classes are separated by a BD of 1.66
in Table 1, ABBI can achieve a classification accuracy of 95.38%, while the 3NN
achieves 97.25%. This is quite fascinating when we consider the fact the ABBI
performs the classification based only 5 samples from the selected 10 samples
from each class, whereas the classification of NN involves the entire training set.
Real-life Data Sets: The data sets [8] used in this study have two classes, and
the number of attributes varies from four up to thirty two. The data sets are
described in Table 3.

Table 3. The Real-life data sets used in our experiments

Data set No. Instances No. Attributes No. Classes Attribute Type
WOBC 699 9 2 Integer
WDBC 569 32 2 Real
Diabetes 768 8 2 Integer, Real
Hepatitis 155 19 2 Categorical, Integer, Real
Iris 150 4 3 Real
Statlog (Heart) 270 13 2 Categorical, Real
Statlog (Australian Credit) 690 14 2 Categorical, Integer, Real
Vote 435 16 2 Categorical, Integer

Experimental Results: Real-life Data Sets. The results obtained for the
ABBI algorithm are tabulated in Table 4.

Table 4. Classiﬁcation of Real Data

Data set kNN NB SVM ABBI

WOBC 96.60 96.40 95.99 95.80
WDBC 96.66 92.97 97.71 92.39
Diabetes 75.26 73.1098 76.70 72.30
Hepatitis 82.58 83.19 82.51 80.27
Iris 95.13 96.00 96.67 94.53
Statlog (Australian Credit) 85.90 87.40 85.51 78.85
Statlog (Heart) 84.40 83.00 85.60 82.50
Vote 94.2857 90.23 94.33 90.76

From the table of results, one can see that the proposed algorithm achieves
a comparable classiﬁcation when compared to the other traditional classiﬁers,
which is particularly impressive because only a very few samples are involved in
the process. For example, for the WOBC data set, we can see that the new ap-
proach yielded a accuracy of 95.80% which should be compared to the accuracies
of the SVM (95.99%), NB (96.40%) and the kNN (96.60%). Similarly, for the Iris
data set, ABBI can achieve an accuracy of 94.53%, which is again comparable
to the performance of SVM (96.67%), NB (96.00%), and NN (95.13%).

4 Conclusions
The objective of BI algorithms is to reduce the number of training vectors by se-
lecting the patterns that are close to the class boundaries. However, the patterns
A Novel BI Algorithm Based on an “Anti-Bayesian” Paradigm 203

that are on the exact border of the classes (“near” borders) are not sufficient
to perform a classification which is comparable to that obtained based on the
centrally located patterns. In order to resolve this issue, researchers have tried to
add more patterns (“far” borders) to the “border” set so as to boost the quality
of the resultant border set. Thus, the cardinality of the resultant border set can
be relatively high. After obtaining such a large border set, a classifier has to be
generated for this set, to perform a classification.
In this paper, we have proposed a novel BI algorithm which involves the border
patterns selected with respect to a new definition of the term “border”. In line
with the newly proposed OS-based anti-Bayesian classifiers [5,6,7], we created
the “border” set by selecting those patterns which are close to the true border
of the alternate class. The classification is achieved with regard to these border
patterns alone, and the size of this set is very small, in some cases, as small as
five from each class. The resultant accuracy is comparable to that attained by
other well-established classifiers. The superiority of this method over other BI
schemes is that it yields a relatively small border set, and as the classification is
based on the border patterns themselves , it is computationally inexpensive.

References
1. Duch, W.: Similarity Based Methods: A General Framework for Classification,
Approximation and Association. Control and Cybernetics 29(4), 937–968 (2000)
2. Foody, G.M.: Issues in Training Set Selection and Refinement for Classification by
a Feedforward Neural Network. In: Proceedings of IEEE International Geoscience
and Remote Sensing Symposium, pp. 409–411 (1998)
3. Foody, G.M.: The Significance of Border Training Patterns in Classification by
a Feedforward Neural Network using Back Propogation Learning. International
Journal of Remote Sensing 20(18), 3549–3562 (1999)
4. Li, G., Japkowicz, N., Stocki, T.J., Ungar, R.K.: Full Border Identification for
Reduction of Training Sets. In: Bergler, S. (ed.) Canadian AI 2008. LNCS (LNAI),
vol. 5032, pp. 203–215. Springer, Heidelberg (2008)
5. Oommen, B.J., Thomas, A.: Optimal Order Statistics-based “Anti-Bayesian” Para-
metric Pattern Classification for the Exponential Family. Pattern Recognition (ac-
cepted for publication, 2013)
6. Thomas, A., Oommen, B.J.: Order Statistics-based Parametric Classification for
Multi-dimensional Distributions (submitted for publication, 2013)
7. Thomas, A., Oommen, B.J.: The Fundamental Theory of Optimal “Anti-Bayesian”
Parametric Pattern Classification Using Order Statistics Criteria. Pattern Recog-
nition 46, 376–388 (2013)
8. Frank, A., Asuncion, A.: UCI Machine Learning Repository (2010),
http://archive.ics.uci.edu/ml (April 18, 2013)
9. Devroye, L.: Non-Uniform Random Variate Generation. Springer, New York (1986)
10. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic
Press, San Diego (1990)
Assessing the Effect of Crossing Databases
on Global and Local Approaches
for Face Gender Classification

Yasmina Andreu Cabedo, Ramón A. Mollineda Cárdenas,

and Pedro Garcı́a-Sevilla

Dep. de Lenguajes y Sistemas Informáticos, Universidad Jaume I

{yandreu, mollined, pgarcia}@uji.es

Abstract. This paper presents a comprehensive statistical study of the

suitability of global and local approaches for face gender classification from
frontal non-occluded faces. A realistic scenario is simulated with cross-
database experiments where acquisition and demographic conditions con-
siderably vary between training and test images. The performances of
three classifiers (1-NN, PCA+LDA and SVM) using two types of features
(grey levels and PCA) are compared for the two approaches. Supported by
three statistical tests, the main conclusion extracted from the experiments
is that if training and test faces are acquired under different conditions
from diverse populations, no significant differences exist between global
and local solutions. However, global methods outperform local models
when training and test sets contain only images of the same database.

Keywords: Face Gender Classiﬁcation, Global/Local Representations,

Cross-database Experiments, Statistical Study.

1 Introduction
Automated face analysis has been extensively studied over the past decades.
Specifically, gender classification has attracted the interest of researchers for its
useful applications in many areas, such as, commercial profiling, surveillance and
human-computer interaction.
Contrary to what could be thought, gender classification should not be simply
considered as a 2-class version of a face recognition problem. While face recog-
nition search for characteristics that make a face unique, gender classification
techniques look for common features shared among a group of faces (female or
male) [1]. Hence, face recognition solutions are not always suitable for solving
gender classification problems.
Although, some researchers employ local descriptions for classifying gender
[2,3], most of the published works on face gender classification use global infor-
mation provided by the whole face [4]. Intuitively, holistic solutions seem to be
more likely to achieve higher classification rates, since global characterisations
provide configural information (i.e. relations among face parts) as well as featu-
ral (i.e. characteristics of the face parts), whereas local descriptors only provide

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 204–211, 2013.
© Springer-Verlag Berlin Heidelberg 2013
Assessing the Eﬀect of Crossing Databases on Global and Local Approaches 205

featural information. However, this has only be tested on single-database exper-

iments. In this work, more realistic scenarios are simulated by using different
databases for both training and testing where the acquisition conditions and
demographic characteristic vary notably. To the best of our knowledge, in the
literature cannot be found a comprehensive study on how realistic conditions
affect global and local approaches when classifying gender.
The present paper studies the suitability of global and local approaches for ad-
dressing automated gender classification problems of neutral non-occluded faces
under realistic conditions. To simulate such conditions, cross-database experi-
ments are performed involving three different databases. A comparison of the
performances of three well-known classifiers (1-NN, PCA+LDA and SVM) us-
ing two different types of features (grey levels and PCA) is provided. In order
to support the discussion of the results, three statistical tests are conducted to
better grasp the performance differences.
The rest of this paper is organised as follows: Section 2 presents the method-
ology adopted for describing the faces and classifying them, Section 3 describes
in detail the experiments and the statistical tests, in Section 4 the results are
presented as well as discussed and, finally, Section 5 presents the conclusions.

2 Methodology
This study compares the performance in solving gender classification problems
of two different approaches (global and local), two types of features (grey levels
and PCA) and three different classifiers (1-NN, PCA+LDA and SVM). The
methodology followed for performing the experiments has three steps:
Image preprocessing. First of all, the face in the image is detected by the
Viola and Jones algorithm [5] implemented in the OpenCV library [6]. Next,
the area containing the face is equalized and resized. The interpolation pro-
cess required for resizing the image uses a three-lobed Lanczos windowed
sinc function [7] which keeps the original aspect ratio of the image. It should
be noted that no techniques for aligning faces are applied, so in the end
unaligned face images are classified.
Feature extraction. Given a preprocessed face image, a global or local ap-
proach is followed, as described in Section 2.1, for characterising the face
with grey levels or PCA feature vectors, as explained in Section 2.2.
Classification. A trained classifier predicts the gender of a test face using pre-
viously extracted features. The classifiers used are detailed in Section 2.3.

2.1 Global and Local Approaches

In this work, faces are described following two approaches: global and local.
The global approach provides conﬁgural and featural information by charac-
terizing the face as a whole. In this case, one feature vector is extracted from
the area of the image where the face is detected.
206 Y.A. Cabedo, R.A. Mollineda Cárdenas, and P. Garcı́a-Sevilla

The local approach provides featural information by describing overlapping

patches of N × N pixels considered over the face image. From one patch to its
neighbour, there is a one pixel shift. A feature vector is extracted from each one
of these patches, consequently several feature vectors describe one face. In this
approach, the classification methods are based on the concept of neighbourhood
which is defined to achieve a higher tolerance towards inaccurate face detections
and unaligned faces. Let Ni,j be the neighbourhood associated to position (i, j)
of the image, then, given a patch pk,l centred at position (k, l), pk,l ∈ Ni,j iff
|i − k| ≤ P and |j − l| ≤ P , where P defines how many pixels the neighbourhood
spans in each direction.

2.2 Features

Two types of features are used in the experiments: grey levels and Principal
Component Analysis.

– Grey Levels
In the global case, the feature vector simply consists of the grey level values
of the pixels within the area of the image containing the face. In the local
case, one feature vector is formed by the grey level values of the pixels within
each patch.
– Principal Component Analysis (PCA)
In the global case, PCA basis are calculated from the grey level value vectors
of all the training images. Then, this transformation is applied to all the
vectors extracted from the face images of both sets, training and test. In
the local case, PCA basis are calculated over the features extracted from
each one of the neighbourhoods in the training images. Afterwards, the grey
level value vector of each patch is transformed using the PCA transformation
associated to their corresponding neighbourhoods.
In our experiments, the PCA transformation applied retains those eigen-
vectors accounting for 95% of the variance of the data.

2.3 Classiﬁers

Three classiﬁers are tested in the experiments: 1-NN, PCA+LDA and SVM. All
of them are well-known classiﬁcation methods which have been extensively used
in automated facial analysis.

– 1-NN
In the global case, the classic 1-NN is used. In the local case, a 1-NN clas-
sifier per patch’s neighbourhood is defined and each of these local classifiers
provides a gender estimation for the corresponding patch of a given test face
image. Finally, the predicted gender is obtained by majority voting of the lo-
cal predictions. In both approaches, the metric used is the Square Euclidean
distance.
Assessing the Effect of Crossing Databases on Global and Local Approaches 207

– PCA+LDA
Linear Discriminant Analysis (LDA) searches for a linear combination of the
features that best discriminates between classes. In the face analysis field,
this classifier is most commonly applied over a transformed space, usually
Principal Components Analysis (PCA) [3].
In the global case, the standard PCA+LDA is used. In the local case,
a local PCA+LDA classifier per patch’s neighbourhood is defined which is
trained using only the patches that belong to the corresponding neighbour-
hood. The final predicted class label is obtained by majority voting of the
local decisions.
– SVM
Support Vector Machine (SVM) is a recognised classifier for its good results
in automated face analysis tasks. It is also known that it requires a large
amount of time for training purposes. We conducted an experimental study
which concluded that the use of local SVMs was not computationally af-
fordable. Therefore, SVM only follows a global approach in the experiments.
The SVM implementation with a third degree polynomial kernel provided
with LIBSVM 3.0 is employed.

3 Experimental Setup

A number of experiments have been designed to assess how robust are global
and local classification models when training and test faces are acquired under
different conditions. In order to compare the performance of both approaches,
both types of features and the three classifiers previously detailed, an experiment
was performed involving each of the possible combinations of those three factors
(from now on each combination is referred as classification model ). These classi-
fication models are tested with non-occluded frontal faces from three databases:

– FERET (Facial Recognition Technology Database) [8] contains 12,922 colour

images of 512 × 768 pixels corresponding to 994 peoples faces ranging from
ages 10 to 70 and from different races. There are faces from Asians, African-
Americans, Hispanics, Caucasians and other races. Our experiments use
2,015 frontal face images. Specifically, 1,173 male and 842 female faces cor-
responding to 787 different subjects (427 males and 360 females).
– PAL (Productive Aging Lab Face) [9] contains 575 colour images of size
640×480 pixels corresponding to 575 individuals (there is only one image per
individual) with ages ranging from 18 to 93. There are 89 African-American
faces, 434 Caucasian faces and 52 from other races. All the faces images from
this database are used in the experiments.
– AR [10] contains around 4,000 colour images of 768 × 576 pixels correspond-
ing to 130 people’s faces. Images feature frontal view faces with different fa-
cial expressions, illumination conditions, and occlusions. Information about
the age and race of the subjects is not provided, although after majority
208 Y.A. Cabedo, R.A. Mollineda Cárdenas, and P. Garcı́a-Sevilla

sampling the database it can be said that all the individuals are young Cau-
casian adults. Our experiments use 130 occlusion-free frontal face images
with neutral expressions (74 males and 56 females).

In order to simulate realistic conditions, each classiﬁcation model is evaluated

using all possible combinations of these databases for training and testing. Con-
sequently, 72 experiments (8 classiﬁcation models × 9 data sets) are performed.
When the same database is used for training and testing, 5 repetitions of a
5-fold cross validation technique are implemented. The partition of the database
is made by subjects, not by images. Therefore, one subject can only be in the
training or test set, but never in both. In cross-database experiments, only one
simulation is executed, training with one data set and testing with the other.
For replicating the experiments, it should be noted that after detecting the
face in the image, the face area is reduced to 45×36 pixels. In the local approach,
the patches covering the image are 7 × 7 pixels, and the value that deﬁnes the
neighbourhood size is P = 2, resulting in 25 patches per neighbourhood.

3.1 Statistical Tests

Due to the large number of experiments, a detailed comparison of the perfor-

mances is difficult to provide. In order to ease the comparison task, several tests
have been applied to show whether statistical differences exist among the per-
formances of the classifiers. All of these statistical tests are based on a null
hypothesis which is assumed certain and they try to prove it to be false.
Firstly, Iman-Davenport test [11] detects differences among the performances
of a set of classification models. This statistic’s null hypothesis is that all classifi-
cation models perform equally, with no significant differences. To reject the null
(n−1)χ2F
hypothesis, the statistic is obtained from the equation: FF = n(k−1)−χ 2 , which
F
follows an F -distribution with k − 1 and (k − 1)(n − 1) degrees of freedom. If the
FF statistic is higher than the corresponding value of the F -distribution, the null
hypothesis is rejected. Therefore, significant differences among the classification
model performances exist.
Secondly, Holm’s method [12] is applied to identify statistical differences be-
tween the most significant classification model and the remaining models. Holm’s
null hypothesis assumes that the most significant classification model is statisti-
cally superior to the other models. Several hypotheses are checked sequentially,
one per each of the models except for the most significant one. For a given sig-
α
nificance level α, Holm’s method checks if P(i) < k−i where P(i) is the P-value
th
of the i hypothesis and k the amount of classification models. If the condition
is met, the corresponding null hypothesis is rejected.
Thirdly, Wilcoxon’s Signed Rank test [13] provides pairwise comparisons, so
statistical differences between each pair of classification models can be found.
This test proceeds by ranking the differences in performance of two models. Let
di be the difference between the performances of two classification models on
the i-th training-test dataset. Then, di ∀i are ranked according to their absolute
Assessing the Effect of Crossing Databases on Global and Local Approaches 209

Table 1. Correct gender classiﬁcation rates (%) obtained in all experiments

Global Local
NN A SVM NN A
LD LD
Training Test Grey A+ Grey Grey A+
Data Set Data Set Levels PCA PC Levels PCA Levels PCA PC
FERET 85.31 85.57 91.86 93.66 92.83 92.35 91.29 85.07
FERET PAL 66.03 64.98 71.25 66.72 62.55 66.03 62.19 60.80
AR Neutral 79.17 82.31 77.69 81.54 84.62 86.15 86.92 83.08
FERET 66.53 65.56 75.22 72.99 70.66 63.16 62.07 77.11
PAL PAL 77.42 77.35 82.72 85.23 85.61 83.73 83.52 73.69
AR Neutral 81.25 82.31 89.23 92.31 91.54 90.00 90.00 87.69
FERET 76.02 76.86 80.09 80.83 77.21 78.90 78.90 78.20
AR Neutral PAL 73.35 72.30 71.43 75.09 70.38 74.39 73.17 65.51
AR Neutral 83.99 82.46 87.54 90.42 98.15 88.92 89.08 86.31

values. Let R+ be the sum of the ranks where the 1st model outperforms the
2nd and R− be the sum of the ranks not included in R+ . In cases where di = 0,
its rank is split evenly between R+ and R− . If di = 0 occurs an odd number
of times, one of those ranks is ignored. Being Z = min(R+ , R− ), if Z is less
or equal than the Wilcoxon distribution for n degrees of freedom, then the null
hypothesis stating that both classiﬁcation models perform equally is rejected.
These statistical tests were conducted using KEEL data mining software [14].

4 Results and Discussion

Two different analyses of the results are presented in this section: the first one
includes all the experiments, whereas the second just cross-database experiments.
Looking at the numerical results of all conducted experiments (shown in Ta-
ble 1), the first impression is that the classification models using a global SVM
or a local classifier obtain higher accuracies than the rest. In order to check
whether these performance differences are statistically relevant or not, we ap-
plied the three statistical tests previously described which results are shown
in Figure 1(a). In Figure 1(a), the table on the left-hand side shows the value
of the Iman-Davenport’s statistic (FF ) and the corresponding value of the F-
distribution; the table in the centre shows the results of the Holm’s method with
a 95% confidence level where all models above the double line performed signif-
icantly worse than the most significant model (marked in bold at the bottom of
the table); and the table on the right-hand side shows a summary of Wilcoson’s
test where the symbol “•” indicates that the classification model in the row sig-
nificantly outperforms the model in the column, and the symbol “◦” indicates
that the model in the column significantly surpasses the model in the row (above
the main diagonal with a 90% confidence level, and below it with a 95%).
Iman-Davenport’s statistic finds significant differences among the perform-
ances of all classification models, which is corroborated by the results of the
other two tests. Specifically, Holm’s method results indicate that the models
210 Y.A. Cabedo, R.A. Mollineda Cárdenas, and P. Garcı́a-Sevilla

Wilcoxon’s Test
Holm’s Method
1 23 4 5 67 8
1NN-pca-G 0.007143 1NN-grey-G (1) - ◦ ◦ ◦ ◦◦
1NN-grey-G 0.008333 1NN-pca-G (2) - ◦ ◦ ◦ ◦◦
Iman-Davenport’s
PCALDA-L 0.01 PCALDA-G (3) • • -
Statistic
1NN-pca-L 0.0125 SVM-grey-G (4) • • - •• •
FF = 12.18
F (7, 35)0.95 = 2.29 PCALDA-G 0.016667 SVM-pca-G (5) • - •
SVM-pca-G ]0.025 1NN-grey-L (6) • • -
1NN-grey-L 0.05 1NN-pca-L (7) -
SVM-grey-G PCALDA-L (8) ◦ -

(a) Statistical analysis of the accuracies of all experiments

Wilcoxon’s Test
Holm’s Method
1 23 4 5 67 8
1NN-pca-G 0.007143
1NN-grey-G (1) - ◦
Iman-Davenport’s 1NN-grey-G 0.008333 1NN-pca-G (2) - ◦
Statistic PCALDA-L 0.01 PCALDA-G (3) -
FF = 1.53 SVM-pca-G 0.0125 SVM-grey-G (4) • -
F (7, 35)0.95 = 2.29 1NN-pca-L 0.016667 SVM-pca-G (5) -
PCALDA-G 0.025 1NN-grey-L (6) -
1NN-grey-L 0.05 1NN-pca-L (7) -
1NN-grey-L PCALDA-L (8) -

(b) Statistical analysis of the accuracies of only cross-database experiments

Fig. 1. Statistical analyses performed. Holm’s results with a 95% significance level
(models above the double line performed significantly worse than the most significant
model, marked in bold at the bottom). Wilcoxon’s summary above the main diagonal
with a 90% significance level, and below it with a 95% (“•”: model in row outperforms
model in column, “◦”: model in column outperforms model in row).

using global SVMs (both, with grey levels and PCA features), global PCA+LDA
and local 1-NN with grey levels are statistically superior than the rest. In a
pairwise comparison, Wilcoxon’s test reveals that a global SVM model using
grey levels outperforms all classification models except for global PCA+LDA
and global SVM with PCA features. In view of the results of this first analysis,
a straightforward conclusion would be that global methods are more suitable for
dealing with a gender classification problem than local models.
The results of a second analysis of only the cross-database experiments, that
is, omitting three experiments that were carried out using the same database for
training and testing, are shown in Figure 1(b). In this case, Iman-Davenport’s
statistic does not find significant differences among classification models. Holm’s
method only rejects global 1-NN with PCA, indicating that the rest of the models
perform statistically equal. The pairwise comparison provided by Wilcoxon’s test
supports these results, since only a couple of statistical differences are found
where global SVM with grey levels outperforms both global 1-NN models.
After these two statistical studies on the performances of all experiments and
the cross-database experiments, an interesting fact was discovered: differences
among the classification accuracies of the implemented models only exist when
single-database experiments are taken into account. In more realistic scenarios,
where training and testing images do not share the same acquisition conditions
nor the demography of subjects (i.e, simulated with cross-database experiments),
no significant differences are found in the performances of the models.
Assessing the Effect of Crossing Databases on Global and Local Approaches 211

5 Conclusion
This paper has provided a comprehensive statistical study of how suitable global
and local approaches are for gender classification under realistic conditions.
These circumstances have been simulated by cross-database experiments involv-
ing three face image collections with a wide range of ages and races and different
acquisition conditions. The comparison has included three classifiers using two
different types of features.
The main conclusion drawn from the results is that when addressing gen-
der classification problems from neutral non-occluded faces, global an local ap-
proaches achieve statistically equal accuracies. However, if we can ensure similar
acquisition condition (i.e., similar to the experiments using the same database
for training and testing), global features should be used. As regards the classifiers
and features, when the training and test images share the same characteristics,
a global SVM using grey levels is more likely to obtain the highest classification
accuracies. In other cases, no significant differences were found among the three
classifiers studied nor the two types of features considered.
Acknowledgements. This work has been partially funded by Universitat
Jaume I through grant FPI PREDOC/2009/20 and projects P1-1B2012-22,
and TIN2009-14205-C04-04 from the Spanish Ministerio de Economı́a y
Competitividad.
References
1. Zhao, W., Chellappa, R.: Face Processing: Advanced Modeling and Methods. Aca-
demic Press (2006)
2. Shan, C.: Learning local binary patterns for gender classification on real-world face
images. Pattern Recognition Letters 33(4), 431–437 (2012)
3. Bekios-Calfa, J., Buenaposada, J.M., Baumela, L.: Revisiting linear discriminant
techniques in gender recognition. IEEE PAMI 33(4), 858–864 (2011)
4. Makinen, E., Raisamo, R.: Evaluation of gender classification methods with auto-
matically detected and aligned faces. IEEE PAMI 30(3), 541–547 (2008)
5. Viola, P., Jones, M.: Robust real-time face detection. Int. J. of Computer Vision 57,
137–154 (2004)
6. Bradski, G.R., Kaehler, A.: Learning OpenCV. O’Reilly (2008)
7. Turkowski, K.: Filters for common resampling tasks. In: Graphics Gems I,
pp. 147–165. Academic Press (1990)
8. Phillips, P.J., Rizvi, S.A., Rauss, P.J.: The FERET evaluation methodology for
face-recognition algorithms. IEEE PAMI 22(10), 1090–1104 (2000)
9. Minear, M., Park, D.: A lifespan database of adult facial stimuli. Behavior Research
Methods, Instruments, and Computers 36, 630–633 (2004)
10. Martinez, A., Benavente, R.: The AR face database. Technical report, CVC (1998)
11. Iman, R., Davenport, J.: Approximations of the critical region of the friedman
statistic. Communications in Statistics 9, 571–595 (1980)
12. Holm, S.: A simple sequentially rejective multiple test procedure. Scandinavian
Journal of Statistics 6, 65–70 (1979)
13. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bul-
letin 1(6), 80–83 (1945)
14. Alcalá-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., Garcı́a, S., Sánchez, L.,
Herrera, F.: KEEL. J. Multiple-Valued Logic and Computing 17(2-3), 255–287 (2011)
BRDF Estimation for Faces from a Sparse
Dataset Using a Neural Network

Mark F. Hansen, Gary A. Atkinson, and Melvyn L. Smith

Centre for Machine Vision, Bristol Robotics Laboratory,

University of the West of England & University of Bristol

Abstract. We present a novel ﬁve source near-infrared photometric

stereo 3D face capture device. The accuracy of the system is demon-
strated by a comparison with ground truth from a commercial 3D scan-
ner. We also use the data from the ﬁve captured images to model the
Bi-directional Reﬂectance Distribution Function (BRDF) in order to
synthesise images from novel lighting directions. A comparison of these
synthetic images created from modelling the BRDF using a three layer
neural network, a linear interpolation method and the Lambertian model
is given, which shows that the neural network proves to be the most
photo-realistic.

1 Introduction

The Bi-directional Reﬂectance Distribution Function (BRDF) describes the re-

lationship between observed intensity at a point on a surface as a function of the
incident and reflected angles between the light source and the observer (Fig. 1).
BRDFs are commonly used in Computer Generated Imagery (CGI) to provide
photo-realistic rendering as well as being used for solving inverse problems as-
sociated with shape recovery. A BRDF completely describes the reflectance be-
haviour of an object under every possible illumination and observation direction
assuming no subsurface light transport exists. A discrete representation of the
BRDF therefore involves sampling the space in four dimensions (the two angles
each to describe incidence and reflectance). As such, this leads to difficulties
both in the practicalities of obtaining and using the complete dataset. A BRDF
is traditionally measured by using gonio-reflectometers, custom built devices
which are expensive [1] and suffer from practical limitations such as angular
precision and measurement noise. Some of these limitations can be overcome
by employing reflectance models such as Lambertian [2], Phong [3], Torrance-
Sparrow [4], Oren-Nayar [5] and more recently a tensor-spline based approach
[6], and indeed, these have been used with great success. However, modelling
is no substitute for the use of an accurate, image-based BRDF to capture the
subtleties of the reflectance properties of an object. In this paper, we present a
device which photometrically captures a set of images to create a sparse BRDF,
which we then show can realistically model unsampled regions through use of
an Artificial Neural Network (ANN).

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 212–220, 2013.

c Springer-Verlag Berlin Heidelberg 2013
BRDF Estimation for Faces from a Sparse Dataset 213

This paper verifies the capture accuracy of our system and presents progress
in obtaining BRDF data from a sparse dataset and using this model to simulate
unseen lighting angles photo-realistically via an ANN. The motivation for this
work is to show that an accurate BRDF can be modelled from the sparse dataset,
and that high speed Near-InfraRed (NIR) capture is suitable for the photometric
reconstruction of faces. We make no claim that the modelled BRDF is state-of-
the-art for skin reflectance modelling (for examples of such work please refer to
[7], [8]), but we are offering a particularly rapid means to acquire sufficient data
for photo-realistic rendering.

V
θi θr
αi

Δα
αr

Fig. 1. The four dimensions (zenith incident and reflected angles: θi , θr and azimuth
incident and reflected angles: αi , αr ) upon which the observed intensity at V (viewer)
depends. L is the light source vector, and N is normal to the plane of the reflecting
surface. In order to reduce the dimensionality, Δα, the difference between incident and
reflective azimuths is used.

The contributions of this paper are i) to prove the practicality of using NIR
for high speed 2.5D data captured in terms of speed and accuracy, compared
with a commercial projected pattern scanner and ii) to demonstrate accurate
modelling of the BRDF from only ﬁve lighting directions via an ANN, which is
used to generate photo-realistic images from novel lighting angles.

2 Capture Device
2.1 Hardware
This section details the acquisition device hardware which is based upon the
Photoface device presented in [9]. The device, shown in Fig. 2, is designed for
practical 3D face geometry capture and recognition. The presence of an individ-
ual approaching the device is detected by an ultrasound proximity sensor placed
before the archway. This can be seen in Fig. 2(6) towards the left-hand side of
the photograph. The sensor triggers a sequence of high speed synchronised frame
grabbing and light source switching operations.
214 M.F. Hansen, G.A. Atkinson, and M.L. Smith

Fig. 2. The NIR geometry capture device (left) and an enlarged image of one of the
LED clusters (right). A camera can be seen on the rear panel, above which is located a
NIR light source for retro-reﬂective capture (5). Four other light sources are arranged
at evenly spaced angles around the camera (1-4). An ultrasound trigger is located on
the left vertical beam of the archway (6).

The aim is to capture six images at a high frame rate: one control image with
only ambient illumination and five images each illuminated by one of the NIR
light sources in sequence. A captured face is typically 700 × 850 pixels. Note
that the ambient lighting is uncontrolled (for the experiments presented in this
paper, overhead fluorescent lights are present). The five NIR lamps are made
from a cluster of seven high power NIR LEDs arranged in an ‘H’-formation
to minimize the emitting area (as can be seen in the right hand side image
of Fig 2). The LEDs emit light at ≈850nm. The light sources and camera are
located approximately 1.2m from the head of the subject with four of the light
sources arranged at evenly spaced angles, and one placed as close as possible to
the camera to capture retro-reflection.
It was found experimentally that for people walking through the device, a
minimum frame rate of approximately 150fps was necessary to avoid significant
movement between frames. The device currently operates at 210fps, and it should
be noted that it is only operating for the period required to capture the six
images. That is, the device is left idle until it is triggered. A monitor is included
on the back panel to show the reconstructed face or to display other information.

2.2 Photometric Stereo

The face detection method of Lienhart and Maydt [10] is used to extract the
face from the background of the ﬁve images. The ﬁve intensity images are pro-
cessed using a MATLAB implementation of a standard Photometric Stereo (PS)
method [11, §5.4].
BRDF Estimation for Faces from a Sparse Dataset 215

The general equation for PS using ﬁve sources for pixel i is

⎡ ⎤ ⎡ T⎤
I1,i L1
⎢ I2,i ⎥ ⎢ LT2 ⎥
⎢ ⎥ ⎢ ⎥
⎢ I3,i ⎥ = ρi ⎢ LT3 ⎥ ni (1)
⎢ ⎥ ⎢ T⎥
⎣ I4,i ⎦ ⎣ L4 ⎦
I5,i LT5

where ρi is the reﬂectance albedo. The intensity values (I) and light source (L)
positions are known, and from these the albedo and surface normal components
(n) can be calculated by solving (1) using a linear least-squares method.

3 BRDF Modelling

Traditionally the generation of a BRDF involves illuminating an object from a

large number of directions. Traditional PS illuminates an object using three light
source directions, and we extend this to five. However this is still a very sparse
amount of information from which to generate accurate reflectance information.
We therefore explore the use of a traditional linear interpolation of the data, an
ANN and the Lambertian reflectance model (which PS assumes) to model the
reflectance information from novel lighting angles in order to see how well they
can approximate the actual BRDF.
In order to minimise the number of dimensions, we assume that the surface is
isotropic. This allows the use of Δα rather than the individual αi , αr values i.e.
it is the difference between the azimuth angles that affects the reflectance, rather
the orientation of that difference. While this assumption may not be perfect for
human skin, the trade-off between accuracy and complexity makes it appealing.

3.1 Linear Interpolation of BRDF Data

A traditional triangle-based linear interpolation method is used to model the

regions between measured points. This method can be expected to work well
when the distances between points are not too large and the surface being mod-
elled is relatively uniform and predictable. As the sampled data does not fit this
description well, we might expect the results to be poor. Delaunay tessellation
is used to fit simplices to the sampled data and these are used to interpolate
intensity values for the novel data points given the zenith angle of incidence and
reflection (θi , θr respectively) and the difference in azimuth angle between the
incidence and reflection angles (Δα).

3.2 Neural Network Architecture for BRDF Generation

Gargan & Neelamkavil [12] showed that using an ANN provides excellent approx-
imation performance for a dense BRDF generated using a gonio-reflectometer.
Experimenting with different numbers of layers (which affects the ability of the
216 M.F. Hansen, G.A. Atkinson, and M.L. Smith

network to either generalise or overﬁt), they concluded that a three-tier feed-

forward backpropagation architecture oﬀers the best performance. The same
architecture is used in these experiments, but the novelty is that it is trained
with a very sparse dataset instead of the dense BRDF used previously to test
whether such good approximation performance is found. Additionally, Gargan
& Neelamkavil use an XY parameter space or projected hemispherical space
for inputs whereas we use a lower dimensional co-ordinate space which assumes
isotropism.

Input Hidden 1 Hidden 2 Hidden 3 Output

θi {0,90}

θr {0,90}

Δα {0,180} {0,255}
... ... ...
x {1,W}

y {I,H}
5 nodes 10 nodes 10 nodes 10 nodes 1 node

Fig. 3. The architecture of the ANN used to model the BRDF. The inputs θi and θr
are in the range of 0-90 degrees, and Δα is in the range 0-180 degrees. x and y give
the pixel coordinate and so are in the range of 1 to either the width (W) or height (H).
The output is in the range 0-255 for each pixel, so that when all pixels are estimated,
a full reﬂectance image will have been rendered.

The network architecture can be seen in Fig. 3 and was trained using the
Levenberg-Marquardt optimisation backpropagation algorithm, taking in the re-
gion of 200 epochs to obtain a Mean Square Error (MSE) of 11.58 gray levels
and an R value of 0.9598. Using fewer hidden layers generated higher MSEs and
lower R-values although training times were faster, while using four layers led to
very slightly improved results but at a much higher computational cost. 100,000
image locations (approximately 20% of all data) were chosen at random to pro-
vide a representative sample of the whole face surface, and for each location,
θi , θr and Δα as well as the x, y co-ordinates were used as inputs.
The reason for including the pixel coordinates is an attempt to allow for the
different types of reflectance around the face to be captured (i.e. the reflectance
of the skin at the nose tip is different to that of the cheeks). In doing so it is
possible to correctly model the behaviour of different skin types when the same
θi , θr and Δα are provided without having to label different regions of the face
as having different skin types. This provides a means of unsupervised learning
that will assist in improving the realism of rendered images.
BRDF Estimation for Faces from a Sparse Dataset 217

4 Results
This section ﬁrst presents results showing the reconstruction accuracy under
NIR using a commercially available system as ground truth. Then, to assess the
potential of the interpolation and ANN BRDF models, we use re-rendered images
from the estimated surface normals obtained by PS. We use the BRDF models
to generate images from novel lighting angles to see how well the models can
generalise. We compare these images with those generated using the Lambertian
reﬂectance model and show that the ANN produces the most photo-realistic
images for unseen lighting angles.

4.1 Surface Reconstruction Using Near-Infrared Photometric

Stereo
Fig. 4 shows a reconstruction using NIR light sources for PS, a reconstruction
using a commercially available 3D scanner (3dMD [13]) and a map of 2 -distances
and angular errors between surface normals at each pixel location. They have
been aligned using an Iterative Closest Point (ICP) algorithm1 . It can be seen
that PS offers a very similar level of reconstruction to the commercial scanner
– the largest differences occur around regions that are hard to integrate e.g. the
lateral edges of the nostril. Median 2 -distance is 0.19 and median angular error
(calculated by taking the dot product between corresponding 3dMD and PS
vectors) is 11 degrees. These errors appear high, but looking at Fig. 4 (e) (2 -
distance) and (f) (angular error) it is possible to see that overall errors are low,
but that discrete areas around difficult to integrate regions where cast shadowing
occurs (around the nose and lips) as well as the specularities caused by the eyes
have extremely high errors.

4.2 Modelling the BRDF Using Linear Interpolation and a Neural

Network, and a Comparison with Lambertian Reflectance
Fig. 5 shows the results of using novel light source directions (i.e. different to
the light source directions used by Photoface that have been used to model the
reflectance). The first thing to note is that the images produced using the ANN
(top row) show a high degree of realism, whereas the interpolated images shown
in the second row are noisy and contain many artefacts, presumably due to
the sparseness of the BRDF data. The images produced by assuming a Lam-
bertian reflectance clearly show the lighting directions but again lack any real
photo-realism.

5 Discussion
The results demonstrated the practicality of using the custom built NIR lamps
for PS acquisition. The capture process itself is unobtrusive (most other PS
1
Written by Ajmal Saeed Mian, Computer Science, The University of Western
Australia
218 M.F. Hansen, G.A. Atkinson, and M.L. Smith

1.8
80
1.6
70
1.4

1.2 60

1 50

0.8 40

0.6 30

0.4 20

0.2 10

(e) (f)

Fig. 4. Reconstructions from the Photoface device (a and c) and 3dMD (b and d) and
a map of 2 -distance (e) and angular error (f)

techniques require a sequence of pulsed visible lights), takes only 30ms, and the
results generated are accurate and of high resolution.
In terms of BRDF modelling, the results show that photo-realistic images
can be synthesised by using an ANN to model the BRDF from a sparse dataset
resulting from practical PS acquisition. It offers more realistic results for novel
lighting angles than either a linear interpolation based method or Lambertian
model. The ANN offers a compact representation of the BRDF and a fast method
of synthesising observed intensities from novel lighting directions.
There are some limitations of using a BRDF for modelling skin reflectance,
especially under NIR. The BRDF describes the relationship between incident,
reflected angles and observed intensity. However, there will be a certain amount
of sub-surface scattering (and this will be increased under NIR which penetrates
deeper into the skin) which the BRDF is not designed to capture. Also, the
BRDF may deviate from actual values as we have used surface normals estimated
by PS, but for purposes such as CGI this is not as important as the perceived
realism (e.g. avatar generation). We have shown that photo-realistic results are
achievable and future work will aim to overcome the Lambertian assumption by
incorporating the BRDF model into normal estimates by iteratively enhancing
the accuracy of the surface normal representations, which in turn can then be
used to generate a more accurate BRDF until convergence is reached. This in
turn will reduce distortion in the 3D reconstruction of the surface relief.

6 Conclusion

We have presented a ﬁve source NIR, high speed and high resolution 2.5D PS
face capture device, which can be used to generate accurate 3D models of human
faces. In addition, the ﬁve light sources are used to train an ANN to model
BRDF Estimation for Faces from a Sparse Dataset 219

Fig. 5. Synthesised intensity images using estimated surface normals from PS and
synthesised light angles (azimuth angles are indicated by arrows. The zenith an-
gle is 15 degrees which is representative of Photoface light sources). Top row:
ANN using Photoface surface normals, second row: interpolated Photoface surface
normals, bottom row: images generated using the Lambertian reﬂectance model.
A video of the rendering created by the ANN BRDF can be downloaded from
www.cems.uwe.ac.uk/~ mf-hansen/CAIP13/rerender75.avi

the individual’s BRDF. Using this modelled BRDF, photo-realistic results are
attained from novel light source directions. Future work will look at the use of the
BRDF to improve the 2.5D estimates by replacing the Lambertian assumption
in PS, as well as using it as an additional biometric.

References

1. Marschner, S.R., Westin, S.H., Lafortune, E.P.F., Torrance, K.E., Greenberg, D.P.:
Image-based BRDF measurement including human skin. In: Proceedings of the
10th Eurographics Workshop on Rendering, pp. 139–152 (1999)
2. Lambert, J.-H.: Photometria, sive de Mensura et gradibus luminis, colorum et
umbrae. sumptibus viduae E. Klett (1760)
3. Phong, B.T.: Illumination for computer generated pictures. Communications of the
ACM 18(6), 311–317 (1975)
4. Torrance, K.E., Sparrow, E.M.: Theory for oﬀ-specular reﬂection from roughened
surfaces. Journal of the Optical Society of America A 57(9), 1105–1112 (1967)
220 M.F. Hansen, G.A. Atkinson, and M.L. Smith

5. Oren, M., Nayar, S.K.: Generalization of the lambertian model and implications for
machine vision. International Journal of Computer Vision 14(3), 227–251 (1995)
6. Kumar, R., Barmpoutis, A., Banerjee, A., Vemuri, B.C.: Non-lambertian re-
flectance modeling and shape recovery of faces using tensor splines. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 33(3), 533–567 (2011)
7. Debevec, P., Hawkins, T., Tchou, C., Duiker, H.P., Sarokin, W., Sagar, M.: Ac-
quiring the reflectance field of a human face. In: Proceedings of the 27th Annual
Conference on Computer Graphics and Interactive Techniques, pp. 145–156 (2000)
8. Ghosh, A., Hawkins, T., Peers, P., Frederiksen, S., Debevec, P.: Practical modeling
and acquisition of layered facial reflectance. ACM Transactions on Graphics 27(5),
1–10 (2008)
9. Hansen, M.F., Atkinson, G.A., Smith, L.N., Smith, M.L.: 3D face reconstructions
from photometric stereo using near infrared and visible light. Computer Vision and
Image Understanding 114(8), 942–951 (2010)
10. Lienhart, R., Maydt, J.: An extended set of haar-like features for rapid ob-
ject detection. In: IEEE International Conference on Image Processing, vol. 1,
pp. 900–903 (2002)
11. Forsyth, D.A., Ponce, J.: Computer Vision: A modern approach. Prentice Hall
Professional Technical Reference (2002)
12. Gargan, D., Neelamkavil, F.: Approximating reflectance functions using neural net-
works. In: Rendering Techniques 1998: Proceedings of the Eurographics Workshop
in Vienna, Austria, June 29-July 1, p. 23 (1998)
13. 3dMDface system, http://www.3dmd.com/3dmdface.html (accessed: December
2011)
Comparison of Leaf Recognition by Moments
and Fourier Descriptors

Tomáš Suk1 , Jan Flusser1 , and Petr Novotný2

1
Institute of Information Theory and Automation of the ASCR,
Department of Image Processing,
Pod Vodárenskou věžı́ 4, 182 08 Praha 8, Czech Republic
{suk,flusser}@utia.cas.cz
2
Charles University in Prague, Faculty of Education,
Biology & Environmental Studies Department
M.D. Rettigové 4, 116 39 Praha 1, Czech Republic
[email protected]

Abstract. We test various features for recognition of leaves of wooden

species. We compare Fourier descriptors, Zernike moments, Legendre mo-
ments and Chebyshev moments. All the features are computed from the
leaf boundary only. Experimental evaluation on real data indicates that
Fourier descriptors slightly outperform the other tested features.

Keywords: leaf recognition, Zernike moments, Fourier descriptors,

Legendre polynomials, Chebyshev polynomials.

1 Introduction

Recognition of plant species by their leaves is an important task in botany. Its

automation is at the same time a challenging problem which can be resolved
by visual pattern recognition methods. The plant leaves have high intraclass
variability and sometimes the leaves of different plants are very similar, which
makes this task difficult even for botanists.
Various approaches can be found in the literature. Kumar et al. [1] use a
histogram of curvatures. The curvature is computed as a part of a disk with
center on a leaf boundary covered by the leaf. The disks of several radii are
used. Chen et al. [2] use another type of curvature. Kadir et al. [3] use polar
Fourier transformation supplemented by a few color and vein features.
Nanni et al. [4] use the combination of inner distance shape context, shape
context and height functions. Wu et al. [5] use simple geometric features as
diameter, length, width, area, perimeter, smooth factor, aspect ratio, form factor,
rectangularity, narrow factor, convex area ratio, ratio of diameter to perimeter,
ratio of perimeter to length plus width and four vein features. The features are
evaluated by principal component analysis and neural network. Söderkvist [6]
uses similar features as a supplement to geometric moments with support vector
machine as a classifier. In [7], Zernike moments are used.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 221–228, 2013.

c Springer-Verlag Berlin Heidelberg 2013
222 T. Suk, J. Flusser, and P. Novotný

Since the most discriminative information is carried by the leaf boundary (see
Fig. 2c), all above-cited papers employ boundary-based features. We decided to
objectively compare the most popular ones – Fourier descriptors, Zernike mo-
ments, Legendre moments, Chebyshev moments, and a direct use of the bound-
ary coordinates – on a large database of tree leaves.

2 Data Set

In the experiments, we used our own data set named Middle European Woody
Plants (MEW 2012 – Fig. 1, [8]). It contains all native and frequently cultivated
trees and shrubs of the Central Europe Region. It has 151 botanical species (153
recognizable classes), at least 50 samples per species and a total of 9745 samples
(leaves). In the case of compound leaves (Fig. 2b), we considered the individual
leaﬂets separately.

Fig. 1. Samples of our data set (different scale – MEW 2012 scans cleaned for this
printed presentation): 1st row – Acer pseudoplatanus, Ailanthus altissima (leaflet of
pinnately compound leaf), Berberis vulgaris, Catalpa bignonioides, Cornus alba, 2nd
row – Deutzia scabra, Fraxinus excelsior (leaflet of pinnately compound leaf), Juglans
regia, Maclura pomifera (male), Morus alba, 3rd row – Populus tremula, Quercus pe-
traea, Salix caprea, Tilia cordata and Vaccinium vitis-idaea.
Comparison of Leaf Recognition by Moments and Fourier Descriptors 223

600

400

200

y coordinate
0

−200

−400

−600

−600 −400 −200 0 200 400 600

x coordinate

(a) (b) (c)

Fig. 2. (a) A simple leaf (Rhamnus cathartica). (b) pinnately compound leaf (Clematis
vitalba). (c) the boundary of the leaf (Fagus sylvatica).

3 Method
3.1 Preprocessing
The preprocessing consists namely of the leaf segmentation and boundary de-
tection. The scanned green leaves on white background are ﬁrst segmented by
simple thresholding. The leaves are converted from color to graylevel scale and
then we compute the Otsu’s threshold [9]. The contours in the binary image are
then traced. Only the longest outer boundary of the image is used, the other
boundaries (if any) and holes are ignored.
Then we compute the following features: Cartesian coordinates of the bound-
ary points (CB), polar coordinates of the boundary (PB), Fourier descriptors
(FD), Zernike moments computed of the boundary image (ZMB), Legendre mo-
ments (LM), Chebyshev moments of the ﬁrst kind (CM1) and that of the second
kind (CM2).
All the features need to be normalized to translation and rotation. The nor-
malization to the translation is provided by a subtraction of the centroid coordi-
nates m10 /m00 and m01 /m00 , where mpq is a geometric moment. The rotation
normalization in the case of the direct coordinates, Legendre and Chebyshev mo-
ments is provided so the principal axis coincides with the x-axis and the complex
moment c21 would have non-negative real part

1 2μ11
θ = arctan
2 μ20 − μ02 (1)
if (μ30 + μ12 ) cos(θ) − (μ21 + μ03 ) sin(θ) < 0 then θ := θ + π,
where μpq is a central geometric moment. All the boundary coordinates are
multiplied by a rotation matrix corresponding to the angle −θ. The starting
224 T. Suk, J. Flusser, and P. Novotný

point of the coordinate sequence is that one with the minimum x-coordinate. If
there is more such points, that one which minimizes y-coordinate is chosen.

3.2 Direct Coordinates

The simplest method is the direct use of the boundary coordinates as the
features. To normalize the features with respect to scaling, we resampled the
boundaries of all leaves to the constant number of samples nd . Nearest neighbor
interpolation was found slightly better than the linear interpolation for this pur-
pose. Then we used the Cartesian coordinates a2j−1 = xj /no and a2j = yj /no
as the features, so we have 2nd features together. no is the original number of
the boundary points, xj and yj are the resampled coordinates, j =/1, 2, . . . nd .
As an alternative, we tried to use the polar coordinates aj = x2j + yj2 /no
and ϕj = arctan(yj /xj ), where the angle is used with a lowered weight.

3.3 Fourier Descriptors

The Fourier Descriptors [10] are deﬁned as Fourier transformation of the
boundary

no
F (u) = (xk + iyk )e−2πiku/no , (2)
k=1
where xk and yk are the original boundary point coordinates, no is their num-
ber, u is the relative frequency (harmonic). We use the descriptors in the range
u = −nh , −nh + 1, . . . nh , where nh is an empiric value common for all leaves,
F (−u) = F (no − u). After the translation normalization F (0) = 0 and it is not
further considered.
The scaling invariance can be easily provided by normalization of the ampli-
tudes by the squared boundary length. Another problem is that the magnitude
of the amplitude falls quickly with the frequency and we need the appropri-
ate weight of the features in the classiﬁer, therefore we use the normalization
au = 10(|u|+1)|F (u)|/n2o. The phase must be normalized to the rotation of coor-
dinates and to the change of the starting point: ϕu = angle(F (u))−ϑ−uρ, where
ϑ = (angle(F (1)) + angle(F (−1)))/2 and ρ = (angle(F (1)) − angle(F (−1)))/2.

3.4 Zernike Moments

The Zernike moments are frequently-used visual features, see e.g. [11], deﬁned
as
n+1
no
An = Rn (rk )e−iϕk , (3)
π
k=1
where rk and ϕk are the polar coordinates of the boundary and the radial func-
tion Rn (x) is a polynomial of the nth degree

(n−||)/2
(−1)s (n − s)!
Rn (x) = xn−2s . (4)
s=0
s!((n + ||)/2 − s)!((n − ||)/2 − s)!
Comparison of Leaf Recognition by Moments and Fourier Descriptors 225

The parameter n is called order and is called repetition. Since ZM’s were
designed for 2D images, we treat the leaf boundary (which is actually 1D infor-
mation) as a 2D binary image.
This explicit formula becomes numerically unstable for high orders, therefore
three recurrence formulas were developed. They are known as Prata method,
Kintner method and Chong method, we used the Kintner method [12]. The
scaling invariance is provided by a suitable mapping of the image onto the unit
disk. The points in the distance κno from the centroid are mapped onto the
boundary of the unit disk, where κ is a constant found by optimization of the
discriminability on the given dataset. The value κ = 0.3 was determined for
MEW2012. The parts of the leaf mapped outside the unit disk are not included
into the computation. The moment amplitudes are also normalized both to a
sampling density and to a contrast: an = |An |/A00 , the phases are normalized
to the rotation as ϕn = angle(An ) − · angle(A31 ).

3.5 Legendre and Chebyshev Moments

The one-dimensional moments can be computed by

no
k−1
Pn = (xk + iyk )Kn 2 −1 , (5)
no − 1
k=1

where xk , yk are the boundary coordinates normalized to rotation and start-

ing point by (1). Kn (x) is a Legendre or Chebyshev polynomial. They can be
computed by the recurrence formula

K0 (x) = 1, K1 (x) = α0 x, Kn (x) = α1 xKn−1 (x) − α2 Kn−2 (x), (6)

where α0 = 2 for the Chebyshev polynomials of the second kind otherwise α0 =

1. α1 = 2 − n1 and α2 = 1 − n1 for the Legendre polynomials, while α1 = 2 and
α2 = 1 for the Chebyshev polynomials.
The amplitude features are used as an = |Pn |/n2o and the phase features as
ϕn = angle(Pn ). There is the coeﬃcient 1/n2o because of the scaling normaliza-
tion.

3.6 Leaf Size

The leaf size has big intraclass variability – the largest leaf is approximately
twice as large as the smallest one. Regardless, the size bears some interesting
information, we must use it only with a suitable weight (see the choice of ws in
the next section). When comparing the sizes of two leaves, we must compensate
(a)
for the resolution of the images if they are different. Then we find diameters dm ,
(b)
dm of both leaves and define the distance between the leaves as
2

−
(d(a) (b)
m −dm )

δs (a, b) = 1 − e
(a) (b)
2dm dm
. (7)
226 T. Suk, J. Flusser, and P. Novotný

4 Classiﬁer
We use a simple nearest neighbor classiﬁer with optimized weights of individual
features. While we can use just L2 norm for comparison of the amplitude features,
the phase features are angles in principle and we have to use special distance

δϕ (α, β) = min(|α − β|, 2π − |α − β|). (8)

The distance of two leaves in the feature space is than evaluated

12
() ()
df (, q) = ws δs (d(q) ()
m , dm ) + (ak − ak )2 +
k∈SA (9)
() (q)
+wf wc (k)δϕ (ϕk , ϕk ),
k∈SP

where SA is the set of all indices, for which ak is an amplitude feature. Similarly,
SP is the set of all indices, for that ϕk is a phase feature. The weight wf is
constant for a given type of features, while wc (k) depends on the order of the
feature. We use wc (k) = 1/|uk | for FD and wc (k) = 1/nk for all the moments,
where uk is the current harmonic and nk is the current moment order. In the
case of CB and PB, wc (k) has no meaning. The parameters and weights of all
features were optimized for MEW2012.
In the training phase, the features of all leaves in the data set are computed.
In the classiﬁcation phase, the features of the query leaf are computed, they are
labeled by index (q) in Eq. (9), while the features labeled () are successively
whole data set features. We only consider one nearest neighbor from each species.
Where the information whether the leaf is simple or compound is available, only
the corresponding species are considered.

5 Results
In the experiments, we divided randomly the leaves of each species in the data
set into two halves. One of them was used as a training set and the other half was
tested against it. The results are in Tab. 1. The Fourier descriptors slightly out-
perform the other tested features. The reason of their superiority to moments in
this task lies in numerical properties of the features. Since the leaves are similar

Table 1. The success rates (f – boundary features only, s – the leaf size, c – information
whether the leaf is simple or compound)

test CB PB FD ZMB LM CM1 CM2

f 64.55% 63.16% 79.88% 69.03% 66.69% 69.13% 74.98%
f&c 67.42% 66.67% 81.84% 72.31% 69.01% 71.92% 77.47%
f&s 74.01% 73.12% 85.43% 78.10% 75.04% 77.38% 77.19%
f&s&c 76.47% 76.16% 86.86% 80.70% 77.31% 80.14% 79.69%
Comparison of Leaf Recognition by Moments and Fourier Descriptors 227

to one another, we need to use high-order features to distinguish them. How-

ever, when calculating the high-order moments, floating-point overflow and/or
underflow may occur for the orders higher than 60 (even for orthogonal moments
calculated by recurrent relations), which leads to a loss of precision. Fourier de-
scriptors are not so prone to overflow/underflow. Although they may also suffer
with numerical errors when calculating high-frequency coefficients, the influence
of these errors appears to be less significant. Another reason could lie in the shape
of the basis functions, which in case of Fourier descriptors can better characterize
the shape of most leaves. The direct use of the boundary coordinates, without
computing any sophisticated features, produces slightly worse results than both
Fourier descriptors and moments. It is also interesting that the leaf size is more
important than the information, whether the leaf is simple or compound.
Finally, we compared the performance of the automatic method with the
performance of humans. We asked 12 students of computer science to classify
the leaves visually. The experiment setup was such that they could see the query
leaf and could simultaneously browse the database and compare the query with
the training leaves. Unlike the algorithm, they worked with full color images, not
with the boundaries only. Each test person classified 30 leaves. The mean success
rate was 63% which is far less than the success rate of the algorithm regardless
of the particular features used. Hence, the public web-version of our method [13]
could be a good leaf recognition tool for non-specialists, which provides them
with better performance and higher speed than their sight.

6 Conclusion
We have tested several types of features in a speciﬁc task - recognition of wooden
species based on their leaves. We concluded that Fourier descriptors are the most
appropriate features which can, when combined with the leaf size, achieve the
recognition rate above 85%. A crucial factor inﬂuencing the success rate is of
course the quality of the input image.
In this study, the leaves were scanned in the laboratory. The system is not
primarily designed to work with photographs of the leaves taken directly on
the tree. In such a case, the background segmentation and elimination of the
perspective would have to be incorporated. We encourage the readers to take
their own pictures and to try our public web-based application [13].

Acknowledgement. This work was supported by the grant No. P103/11/1552

of the Czech Science Foundation.

References
1. Kumar, N., Belhumeur, P.N., Biswas, A., Jacobs, D.W., Kress, W.J., Lopez, I.C.,
Soares, J.V.B.: Leafsnap: A computer vision system for automatic plant species
identiﬁcation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C.
(eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 502–516. Springer, Heidelberg
(2012)
228 T. Suk, J. Flusser, and P. Novotný

2. Chen, Y., Lin, P., He, Y.: Velocity representation method for description of contour
shape and the classification of weed leaf images. Biosystems Engineering 109(3),
186–195 (2011)
3. Kadir, A., Nugroho, L.E., Susanto, A., Santosa, P.I.: Foliage plant retrieval using
polar fourier transform, color moments and vein features. Signal & Image Process-
ing: An International Journal 2(3), 1–13 (2011)
4. Nanni, L., Brahnam, S., Lumini, A.: Local phase quantization descriptor for
improving shape retrieval/classification. Pattern Recognition Letters 33(16),
2254–2260 (2012)
5. Wu, S.G., Bao, F.S., Xu, E.Y., Wang, Y.X., Chang, Y.F., Xiang, Q.L.: A leaf
recognition algorithm for plant classification using probabilistic neural network.
In: 7th International Symposium on Signal Processing and Information Technology
ISSPIT 2007, p. 6. IEEE (2007)
6. Söderkvist, O.J.O.: Computer vision classiffcation of leaves from swedish trees.
Master’s thesis, Linköping University (September 2001)
7. Kadir, A., Nugroho, L.E., Susanto, A., Santosa, P.I.: Experiments of Zernike mo-
ments for leaf identification. Journal of Theoretical and Applied Information Tech-
nology 41(1), 113–124 (2012)
8. MEW2012: Download middle european woods (2012),
http://zoi.utia.cas.cz/node/662
9. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans-
actions on Systems, Man, and Cybernetics 9(1), 62–66 (1979)
10. Lin, C.C., Chellapa, R.: Classification of partial 2-D shapes using Fourier de-
scriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence 9(5),
686–690 (1987)
11. Flusser, J., Suk, T., Zitová, B.: Moments and Moment Invariants in Pattern Recog-
nition. Wiley, Chichester (2009)
12. Kintner, E.C.: On the mathematical properties of Zernike polynomials. Journal of
Modern Optics 23(8), 679–680 (1976)
13. MEWProjectSite: Recognition of woods by shape of the leaf (2012),
http://leaves.utia.cas.cz/index?lang=en
Dense Correspondence of Skull Models by
Automatic Detection of Anatomical Landmarks

Kun Zhang, Yuan Cheng, and Wee Kheng Leow

Department of Computer Science, National University of Singapore

Computing 1, 13 Computing Drive, Singapore 117417
{zhangkun,cyuan,leowwk}@comp.nus.edu.sg

Abstract. Determining dense correspondence between 3D skull mod-

els is a very important but diﬃcult task due to the complexity of the
skulls. Non-rigid registration is at present the predominant approach for
dense correspondence. It registers a reference model to a target model
and then resamples the target according to the reference. Methods that
use manually marked corresponding landmarks are accurate, but manual
marking is tedious and potentially error prone. On the other hand, meth-
ods that automatically detect correspondence based on local geometric
features are sensitive to noise and outliers, which can adversely aﬀect
their accuracy. This paper presents an automatic dense correspondence
method for skull models that combines the strengths of both approaches.
First, anatomical landmarks are automatically and accurately detected
to serve as hard constraints for non-rigid registration. They ensure that
the correspondence is anatomically consistent and accurate. Second, con-
trol points are sampled on the skull surfaces to serve as soft constraints
for non-rigid registration. They provide additional local shape constraints
for a closer match between the reference and the target. Test results show
that, by combining both approaches, our algorithm can achieve more ac-
curate automatic dense correspondence.

Keywords: Dense correspondence, anatomical landmarks, skull models.

1 Introduction

Determining dense correspondence between 3D mesh models is a very important

task in many applications such as remeshing, shape morphing, and construction
of active shape models. Among existing approaches for dense correspondence,
non-rigid registration is at present the predominant approach due to its ﬂexibil-
ity. Non-rigid registration methods deform a reference mesh to match the target
mesh and resample the target by mapping reference mesh vertices to the tar-
get surface. They are typically preceded by rigid registration to globally align
the sizes, positions, and orientations of the meshes. Various deformable methods
have been used including energy minimization [11, 12], mass-spring model [15],
local aﬃne transformations [1], trilinear transformation [2], graph and manifold
matching [20], octree-splines [6], and thin-plate spline (TPS) [4, 5, 7–10, 14, 18].

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 229–236, 2013.

c Springer-Verlag Berlin Heidelberg 2013
230 K. Zhang, Y. Cheng, and W.K. Leow

Most of these methods are demonstrated on models with simple surfaces such
as faces [8, 12, 20], human bodies [1, 15], knee ligaments [6], and lower jaws [2].
TPS is particularly effective for mesh models with highly complex surfaces such
as brain sulci [4], lumbar vertebrae [10], and skulls [5, 7, 9, 14, 18]. Skull models
are particularly complex because they have holes, missing teeth and bones, and
interior as well as exterior surfaces.
Like all non-rigid registration methods, TPS registration of skull models re-
quires known correspondence on the reference and the target, which can be
manually marked or automatically detected. The first approach manually marks
anatomical landmarks on the reference and the target [5, 9, 14], and uses the
landmarks as hard constraints in TPS registration. This approach is accurate,
but manual marking is tedious and potentially error prone. The second approach
automatically detects surface points on the reference mesh, which are mapped
to the target surface. These surface points can be randomly sampled points [7]
or distinctive feature points such as local curvature maximals [18], and they
serve as soft constraints in TPS registration. This approach is sensitive to noise,
outliers, and false correspondences. Turner et al. [18] apply multi-stage coarse-
to-fine method to reduce outliers, and forward (reference-to-target) and back-
ward (target-to-reference) registrations to reduce false correspondences. How-
ever, there is no guarantee that the correspondences detected anatomically are
consistent and accurate, despite the complexity of the method.
This paper presents an automatic dense correspondence algorithm for skull
models that combines the strengths of both approaches. First, anatomical land-
marks are automatically and accurately detected to serve as hard constraints
in TPS registration. They ensure anatomically consistent correspondence.
The number of such landmarks is expected to be small because automatic detec-
tion of anatomical landmarks is a very difficult task (Section 2). Second, control
points are sampled on skull surfaces to serve as soft constraints in TPS regis-
tration. They provide additional local shape constraints for a close matching
of reference and target surfaces. Compared to [18], our method also uses multi-
stage coarse-to-fine approach, except that our landmark detection algorithm is
based on anatomical definitions of landmarks, which ensures the correctness and
accuracy of the detected landmarks.
Quantitative evaluation of point correspondence is a challenging task. Most
works reported only qualitative results. The quantitative errors measured in [2,
8, 19] are non-rigid registration error instead of point correspondence error. This
paper proposes a method for measuring point correspondence error, and shows
that registration error is not necessarily correlated to correspondence error.

2 Automatic Craniometric Landmark Detection

In anatomy [16] and forensics [17], craniometric landmarks are feature points on
a skull that are used to define and measure skull shapes. Automatic detection
of craniometric landmarks is very difficult and challenging due to a form of
cyclic definition. Many craniometric landmarks are defined according to the three
Dense Correspondence of Skull Models by Automatic Detection 231

(1)
uz ux

(a) (b) (c) (d)

(2)

Fig. 1. Skull models and craniometric landmarks. (1) Reference model. (1a) Frankfurt
plane (FP) is the horizontal (red) plane and mid-sagittal plane (MSP) is the vertical
(green) plane. (1b–1d) Blue dots denote landmarks used for registration and yellow dots
denote Landmarks used for evaluation. (2) Detected registration landmarks (blue) and
50 control points (red) on two sample test targets.

anatomical orientations of the skull (Fig. 1(a)): lateral (left-right), anterior-

posterior (front-back), and superior-inferior (up-down). These orientations are
defined by the Frankfurt plane (FP) and the mid-sagittal plane (MSP), which
are in turn defined as the planes that pass through specific landmarks.
Our automatic landmark detection algorithm is an adaptation of our previous
work on automatic identification of FP and MSP [3]. It overcomes the cyclic
definition of craniometric landmarks by first mapping known landmarks on a
reference model to a target model, and then iteratively refining FP, MSP and
their landmarks on the target model. It can be summarized as follows:
Craniometric Landmark Detection Algorithm
1. Register a reference model with known landmarks to the target model.
2. Locate the landmarks on the target based on the registered reference and fit
FP and MSP to their landmarks on the target.
3. Repeat until convergence:
(a) Refine the locations of the FP landmarks on the target, and fit FP to
the refined FP landmarks.
(b) Refine the locations of the MSP landmarks on the target, and fit MSP
to the refined MSP landmarks, keeping it orthogonal to FP.
Step 1 registers the reference to the target using Fractional Iterative Closest
Point (FICP) [13], a variant of ICP robust to noise, outliers, and missing bones.
232 K. Zhang, Y. Cheng, and W.K. Leow

Like ICP, FICP iteratively computes the best similarity transformation (scaling,
rotation, and translation) that registers the reference to the target. The difference
is that in each iteration, FICP computes the transformation using only a subset
of reference points whose distances to the target model are the smallest.
After registration, Step 2 maps the landmarks on the reference to the target.
First, closest points on the target surface to the reference landmarks are identi-
fied. These closest points are the initial estimates of the landmarks on the target,
which may not be accurate due to shape variations among the skulls. Next, FP
and MSP are fitted to the initial estimates using PCA.
In Step 3, an elliptical landmark region R is identified around each initial
estimate. The orientation and size of R are empirically predefined. R varies for
different landmark according to the shape of the skull around the landmark.
These regions should be large enough to include the landmarks on the target
model. Accurate landmark locations are searched within the regions according
to their anatomical definitions. For example, the left and right porions (Pl, Pr
in Fig. 1) are the most lateral points of the roofs of the ear canals [16, 17]. After
refining FP landmarks in Step 3(a), FP is fitted to the refined FP landmarks.
Next, MSP landmarks are refined in Step 3(b) in a similar manner, and MSP is
fitted to the refined MSP landmarks, keeping it orthogonal to FP.
As Step 3 is iterated, the locations and orientations of FP and MSP are refined
by fitting to the landmarks, and the landmarks’ locations are refined according
to the refined FP and MSP. After the algorithm converges, accurate craniometric
landmarks are detected on the target model.
In addition to the landmarks on FP and MSP, other landmarks are also de-
tected (Fig. 1). These include points of extremum along the anatomical orienta-
tions defined by FP and MSP. These landmarks are detected in a similar manner
as the FP and MSP landmarks, first by mapping known landmark regions on the
reference to the target, and then searching within the regions for the landmarks
according to their anatomical definitions. Test results show that the average
landmark detection error is 3.54 mm, which is very small compared to the size
of human skulls.

3 Dense Correspondence Algorithm

Our dense correspondence algorithm consists of the following stages:
1. Apply craniometric landmark detection algorithm on the target model.
2. Apply TPS to register the reference to the target with craniometric land-
marks as hard constraints.
3. Sample control points on reference surface and map them to target surface.
4. Apply TPS with craniometric landmarks as hard constraints and control
points as soft constraints.
5. Resample target surface by mapping reference mesh vertices to the target.
Stage 1 automatically detects craniometric landmarks on the target model. Af-
ter applying the landmark detection algorithm, the reference model is already
rigidly registered to the target model. Stage 2 applies TPS to perform coarse
Dense Correspondence of Skull Models by Automatic Detection 233

registration with the accurately detected landmarks as hard constraints, which

ensure anatomically consistent correspondence. Stage 3 randomly selects
m reference mesh vertices with the large registration errors as the control points.
For each control point, a nearest point on the target surface within a fixed dis-
tance and with a sufficiently similar surface normal is selected as the correspond-
ing point. If a corresponding point that satisfies these criteria cannot be found,
then the control point is discarded. This approach renders the algorithm robust
to missing parts in the target skulls. Stage 4 performs another TPS registra-
tion with craniometric landmarks as hard constraints and control points as soft
constraints. These constraints ensure close matching of reference and target
surfaces while maintaining anatomically consistent correspondence. After TPS
registration, Stage 5 maps the reference mesh vertices to the target surface in
the same manner as mapping of control points in Stage 3.

4 Accuracy of Registration and Correspondence

Registration error ER measures the diﬀerence between the registered reference
surface and the target surface. It can be computed as the mean distance between
the reference mesh vertices vir and the nearest surface points vit on the target:
n 1/2
1 r
ER = v − vit 2 (1)
n i=1 i

where n is the number of vertices. This is essentially the error measured in

[2, 8, 19], although the actual formulations that they used diﬀer slightly.
Correspondence error, on the other hand, should measure the error in com-
puting point correspondence. One possible formulation of correspondence error
is to measure the mean distance between the desired and actual corresponding
target points of reference mesh vertices. The desired corresponding point D(vir )
is the ground-truth marked by a human expert, whereas the actual correspond-
ing point C(vir ) is the one computed by dense correspondence algorithm. With
this formulation, the correspondence error EC can be computed as
n 1/2
1
EC = D(vi ) − C(vi )
r r 2
(2)
n i=1

In practice, it is impossible to manually mark the desired corresponding points

of reference mesh vertices accurately on the target mesh surface. An alternative
formulation is to measure the mean distance between the desired and actual
corresponding target landmarks of reference landmarks Mri :
1/2
1
l
EC = D(Mi ) − C(Mi )
r r 2
(3)
l i=1

where l is the number of evaluation landmarks. The desired target landmarks

are manually marked whereas the actual target landmarks are computed by
234 K. Zhang, Y. Cheng, and W.K. Leow

the dense correspondence algorithm. Given enough landmarks adequately dis-

tributed over the entire reference surface, Eq. 3 is a good approximation of Eq. 2.

5 Experiments and Discussions

11 skull models reconstructed from CT images were used in the tests. One of
them served as the reference model and the others were target models. For perfor-
mance comparison, the following methods were tested for dense correspondence:
1. ICP: ICP rigid registration with mesh vertices as corresponding points.
2. FICP: FICP rigid registration with mesh vertices as corresponding points.
3. CP-S: TPS registration with automatically detected control points as soft
constraints. This approach was adopted by [7].
4. LM-H: TPS registration with automatically detected craniometric landmarks
as hard constraints.
5. LM-S/CP-S: TPS registration with automatically detected craniometric
landmarks and control points as soft constraints. This approach is simi-
lar to the method of [18], except [18] adopted a more elaborate multi-stage,
coarse-to-fine, and forward-backward registration scheme.
6. LM-H/CP-S: TPS registration with automatically detected craniometric
landmarks as hard constraints and control points as soft constraints. This is
our proposed algorithm.
7. MLM-H: TPS registration with manually marked craniometric landmarks as
hard constraints. This approach was adopted by [5, 9, 14].
These test cases were equivalent to our algorithm (Case 6) with different stages
and constraints omitted. All the TPS registrations were preceded by FICP. The
stiffness parameter for TPS soft constraints was set to 0.8 where the algorithms
generally performed well. 15 landmarks and 150 control points were used for
registration for Cases 3–6, and 30 landmarks for Case 7. More landmarks could
be used for Case 7 because they included landmarks that could be accurately
marked manually but not detected automatically. 28 other landmarks were used
for evaluation. Both registration error and correspondence error were measured.
Test results (Figure 2(a)) show that FICP is more robust than ICP in rigid
registration. The registration error of CP-S is smaller than those of LM-S/CP-S
and LM-H/CP-S, but its correspondence error is larger. This shows that low
registration error does not necessarily imply low correspondence error.
CP-S and LM-S/CP-S use only soft constraints, which are inadequate for en-
suring anatomically consistent correspondence. So, their correspondence errors
are larger than those of LM-H/CP-S, which also uses registration landmarks
as hard constraints. On the other hand, LM-H uses only landmarks, which are
insufficient for ensuring close matching of reference and target surfaces, though
consistent correspondence is somewhat achieved. So, its correspondence error
for registration landmarks ECR is very small, but its correspondence error for
evaluation landmarks ECE is large. LM-S/CP-S uses landmarks as soft con-
straints, which weakens the anatomical consistency of correspondence, though
close matching of reference and target surfaces is achieved. Using landmarks as
Dense Correspondence of Skull Models by Automatic Detection 235

ECE
Algorithm ER ECR ECE 6

ICP 2.22 7.09 7.42 5.9

FICP 1.97 5.55 6.35 5.8

CP-S 1.64 4.15 5.81 5.7
LM-H 2.69 3.51 5.94
5.6
LM-S/CP-S 1.76 3.68 5.73
low curvature
LM-H/CP-S 1.76 3.58 5.56 5.5
large ER
MLM-H 2.42 0.00 4.66 5.4 c
0 100 200 300 400 500 600 700 800
(a) (b)
Fig. 2. Quantitative evaluation. (a) ER : registration error. ECR , ECE : correspondence
errors for registration landmarks and evaluation landmarks, respectively. Units are in
mm. (b) Plots of ECE vs. c, number of control points.

hard constraints, our algorithm LM-H/CP-S ensures strong anatomically consis-

tent correspondence. Together with control points as soft constraints, it achieves
very low registration error and the lowest correspondence error for evaluation
landmarks ECE among the automatic methods (Cases 1–6).
MLM-H uses manually marked landmarks as hard constraints. So, it is not
surprising that it has the smallest correspondence errors. Interestingly, its regis-
tration error is quite large compared to the other methods. This is because some
parts of the skulls lack distinctive surface features for locating both registration
and evaluation landmarks (Fig. 1), where most of the registration errors occur.
To investigate the stability of our algorithm LM-H/CP-S, we tested it with
varying numbers of control points and two different sampling schemes that are
used by existing methods: low curvature [18] and large registration error [5].
Figure 2(b) shows that control points with large registration errors are more
effective than those with low curvatures in reducing correspondence error. Com-
pared to the accuracy of LM-H, which uses landmarks only, a small number of
control points can already improve correspondence accuracy significantly. After
sampling enough control points that cover various parts of the skulls, adding
more control points do not reduce correspondence error significantly. This is due
to the diminished quality of the additional control points.

6 Conclusions
This paper presents a multi-stage, coarse-to-ﬁne automatic dense correspon-
dence algorithm for mesh models of skulls that combines two key features. First,
anatomical landmarks are automatically and accurately detected to serve as
hard constraints for non-rigid registration. They ensure anatomically consis-
tent correspondence. Second, control points are sampled on the skull surfaces
to serve as soft constraints for non-rigid registration. They provide additional
local shape constraints to ensure close matching of reference and target sur-
faces. Test results show that, by combining both approaches, our algorithm can
236 K. Zhang, Y. Cheng, and W.K. Leow

achieve more accurate automatic dense correspondence than other automatic

algorithms. Our test results also show that low registration error does not al-
ways imply low correspondence error. So, both error measures should be used in
conjunction to evaluate the accuracy of dense correspondence algorithms.
References
1. Allen, B., Curless, B., Popović, Z.: The space of human body shapes: reconstruction
and parameterization from range scans. In: Proc. SIGGRAPH (2003)
2. Berar, M., Desvignes, M., Bailly, G., Payan, Y.: 3D meshes registration: Appli-
cation to statistical skull model. In: Campilho, A.C., Kamel, M.S. (eds.) ICIAR
2004. LNCS, vol. 3212, pp. 100–107. Springer, Heidelberg (2004)
3. Cheng, Y., Leow, W.K., Lim, T.C.: Automatic identiﬁcation of frankfurt plane and
mid-sagittal plane of skull. In: Proc. WACV (2012)
4. Chui, H., Rangarajan, A.: A new algorithm for non-rigid point matching. In: Proc.
CVPR (2000)
5. Deng, Q., Zhou, M., Shui, W., Wu, Z., Ji, Y., Bai, R.: A novel skull registration
based on global and local deformations for craniofacial reconstruction. Forensic
Science International 208, 95–102 (2011)
6. Fleute, M., Lavallée, S.: Building a complete surface model from sparse data us-
ing statistical shape models: Application to computer assisted knee surgery. In:
Wells, W.M., Colchester, A.C.F., Delp, S.L. (eds.) MICCAI 1998. LNCS, vol. 1496,
pp. 879–887. Springer, Heidelberg (1998)
7. Hu, Y., Duan, F., Zhou, M., Sun, Y., Yin, B.: Craniofacial reconstruction based on
a hierarchical dense deformable model. EURASIP Journal on Advances in Signal
Processing 217, 1–14 (2012)
8. Hutton, T.J., Buxton, B.F., Hammond, P.: Automated registration of 3D faces
using dense surface models. In: Proc. BMVC (2003)
9. Lapeer, R.J.A., Prager, R.W.: 3D shape recovery of a newborn skull using thin-
plate splines. Computerized Medical Imaging & Graphics 24(3), 193–204 (2000)
10. Lorenz, C., Krahnstöver, N.: Generation of point-based 3D shape models for
anatomical objects. Computer Vision and Image Understanding 77, 175–191 (2000)
11. Lüthi, M., Albrecht, T., Vetter, T.: Building shape models from lousy data. In:
Yang, G.-Z., Hawkes, D., Rueckert, D., Noble, A., Taylor, C. (eds.) MICCAI 2009,
Part II. LNCS, vol. 5762, pp. 1–8. Springer, Heidelberg (2009)
12. Pan, G., Han, S., Wu, Z., Zhang, Y.: Removal of 3D facial expressions: A learning-
based approach. In: Proc. CVPR (2010)
13. Phillips, J.M., Liu, R., Tomasi, C.: Outlier robust icp for minimizing fractional
RMSD. In: Proc. 3D Digital Imaging and Modeling (2007)
14. Rosas, A., Bastir, M.: Thin-plate spline analysis of allometry and sexual dimor-
phism in the human craniofacial complex. American J. Physical Anthropology 117,
236–245 (2002)
15. Seo, H., Magnenat-Thalmann, N.: An automatic modeling of human bodies from
sizing parameters. In: Proc. ACM SIGGRAPH (2003)
16. Siwek, D.F., Hoyt, R.J.: Anatomy of the human skull, http://skullanatomy.info
17. Taylor, K.T.: Forensic Art and Illustration. CRC (2001)
18. Turner, W.D., Brown, R.E., Kelliher, T.P., Tu, P.H., Taister, M.A., Miller, K.W.:
A novel method of automated skull registration for forensic facial approximation.
Forensic Science International 154, 149–158 (2005)
19. Wang, Y., Peterson, B.S., Staib, L.H.: Shape-based 3D surface correspondence
using geodesics and local geometry. In: Proc. CVPR (2000)
20. Zeng, Y., Wang, C., Wang, Y.: Automated registration of 3D faces using dense
surface models. In: Proc. CVPR (2010)
Detection of Visual Defects in Citrus Fruits:
Multivariate Image Analysis vs Graph Image
Segmentation

Fernando López-Garcı́a, Gabriela Andreu-Garcı́a,

José-Miguel Valiente-Gonzalez, and Vicente Atienza-Vanacloig

Instituto de Automática e Informática Industrial. Universidad Politécnica de Valencia

Camino de Vera (s/n), 46022 Valencia, Spain
{flopez,gandreu,jvalient,vatienza}@disca.upv.es
http://www.ai2.upv.es/

Abstract. This paper presents an application of visual quality control

in orange post-harvesting comparing two different approaches. These ap-
proaches correspond to two very different methodologies released in the
area of Computer Vision. The first approach is based on Multivariate
Image Analysis (MIA) and was originally developed for the detection of
defects in random color textures. It uses Principal Component Analysis
and the T2 statistic to map the defective areas. The second approach
is based on Graph Image Segmentation (GIS). It is an efficient segmen-
tation algorithm that uses a graph-based representation of the image
and a predicate to measure the evidence of boundaries between adja-
cent regions. While the MIA approach performs novelty detection on
defects using a trained model of sound color textures, the GIS approach
is strictly an unsupervised method with no training required on sound
or defective areas. Both methods are compared through experimental
work performed on a ground truth of 120 samples of citrus coming from
four different cultivars. Although the GIS approach is faster and achieves
better results in defect detection, the MIA method provides less false de-
tections and does not need to use the hypothesis that the bigger area in
samples always correspond to the non-damaged area.

Keywords: Fruit Inspection, Automatic Quality Control, Multivariate

Image Analysis, Principal Component Analysis, Unsupervised Methods.

1 Introduction

Quality control in the agro-industry is becoming of paramount importance in

order to decrease production costs and increase quality standards. In the packing
lines, where external quality attributes are currently inspected visually, machine
vision is providing a way to perform this task automatically. The detection of
blemishes is one of the most important factors in the commercial quality of fruit.
Blemishes in citrus can be due to several causes; medﬂy egg deposition, green
mould by Penicillium digitatum, oleocellosis (rind oil spot), scale, scarring, thrips

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 237–244, 2013.

c Springer-Verlag Berlin Heidelberg 2013
238 F. López-Garcı́a et al.

Fig. 1. Some blemishes in citrus. From left to right; scale, thrips scarring, sooty mould
and green mould.

scarring, chilling injury, stem injury, sooty mould, anthracnose and phytotoxicity.
Figure 1 shows four different types of defects (blemishes) in citrus.
The automatic detection of visual defects in orange post-harvest, performed to
classify the fruit depending on their appearance, is a major problem. Species and
cultivars of citrus present great unpredictability in colors and textures in both,
sound and defective areas. Thus, the inspection system will need frequent train-
ing to adapt to the visual features of new cultivars and even different batches
within the same cultivar [1]. In addition, as the training process will be performed
by non-specialized operators at the inspection lines, we need to select an unsu-
pervised methodology (no labeling process required) that leads to an easy-to-
train inspection system. Real-time compliance is also an important issue so that
the overall production can be inspected at on-line rates. Thus, approaches with
low computational costs are valuable. In the present paper, we study and com-
pare two methods that offer these features, they are unsupervised , easy-to-train
and also provide low computational costs in comparison with similar-in-purpose
methods in literature.
The first method [2] is based on a Multivariate Image Analysis (MIA) strategy
developed in the area of applied statistics [3,4,5]. This strategy differs from tradi-
tional image analysis, where the image is considered a single sample from which
a vector of features is extracted and then used for classification or comparison
purposes. In MIA, the image is considered a sample of size equal to the number
of pixels that compose the image. Principal Component Analysis (PCA) is ap-
plied to the raw data of pixels and then statistic measures are used to perform
the image analysis. The method was originally developed as a general approach
for the detection of defects in random color textures, which is a Computer Vision
issue where several works have been released recently in literature. We chose this
kind of method because it fits the needs for the detection of blemishes (visual
defects) in citrus, where sound peel areas and damaged areas are in fact random
color textures. With regard to the other literature methods for the detection of
defects in random color textures, this method presents the following advantages;
it uses one of the simplest approaches providing low computational costs, and
also, it is unsupervised and only needs few samples to train the system [2]. In
order to better compile defects and parts of defects of different sizes we introduce
a multiresolution scheme which minimizes the computational effort. The method
is applied at different scales gathering the results in one map of defects. In the
Detection of Visual Defects in Citrus Fruits: MIA-DDRCT vs EGIS 239

paper, we call this method MIA-DDRCT (MIA Defect Detection on Random

Color Textures).
The second method we study [6] is a Graph Image Segmentation (GIS) ap-
proach which belongs to the set of methods that use a graph representation of
the image and a given criteria to segment the image into regions (e.g. [7,8]). It
is an efficient segmentation algorithm based on a predicate which is defined to
measure the evidence of a boundary between two adjacent regions. This predi-
cate measures inter-regions differences in the neighborhood of boundaries as well
as intra-region differences. This way, local and non-local criteria are introduced.
We chose this method because it is a recent work on the Computer Vision topic
of image segmentation which improves results of previous methods [6]. The GIS
method is highly efficient and achieves a running time nearly linear with the
number of pixels in the image. Also, it is strictly unsupervised because it does
not need to learn about sound or defective areas. If we set the hypothesis that
the bigger part in samples correspond to the sound non-damaged area, then the
rest of regions will correspond to defects. In this case, we only need to adjust
two parameters in the method: sigma, which is used to smooth the image before
being segmented, and the k value of a threshold function where larger values of
k result in larger regions. The hypothesis of the bigger area in samples being the
sound area is reasonable and has been used before [1]. In the paper, we call this
method EGIS (Efficient Graph Image Segmentation).
Next section shows the experimental work performed to evaluate and compare
the approaches. Conclusions are reported in final section.

2 Experimental Work

2.1 Ground Truth

The set of fruit used to carry out the experiments consisted of a total of 120 or-
anges and mandarins coming from four diﬀerent cultivars: Clemenules, Marisol,
Fortune, and Valencia (30 samples per cultivar). The fruit was randomly col-
lected from a citrus packing house. Five fruits of each cultivar belonged to the
extra category, thus, they were fully free of defects. The other 25 fruits of each
cultivar ﬁtted secondary commercial categories and had several skin defects, try-
ing to represent the cause of most important losses during post-harvesting (see
Section 1).

2.2 MIA-DDRCT Approach

The first step in the experimental work for this approach was to select a set of
defect-free samples for each cultivar, in order to build the corresponding model
of sound color textures. A total of 64 different sound patches were collected for
each cultivar (see Figure 2). We used patches instead of complete samples in
order to introduce in the model more different types of sound peels and collect
as much as possible the variability of colors and textures.
240 F. López-Garcı́a et al.

Fig. 2. Several sound patches of Clemenules cultivar

Then, to tune the parameters, we designed a set of experiments that involved

to apply the method to the ground truth of each cultivar and extract the corre-
sponding defect maps, but varying in each experiment; the number of principal
eigenvectors chosen to build the reference eigenspace, the percentile used to
set the T2 threshold, and the combination of scales used in the multiresolution
scheme. The number of principal eigenvectors were varied in [1, 3, 5, 7, 9, 11,
13, 15, 17, 19, 21, 23, 25, 27], the percentile in [90, 95, 99], and the set of scales
in [(0.25,0.12), (0.50, 0.12), (0.50, 0.25), (1.00, 0.12), (1.00, 0.25), (1.00, 0.50),
(0.50, 0.25, 0.12), (1.00, 0.25, 0.12), (1.00, 0.50, 0.12), (1.00, 0.50, 0.25), (1.0,
0.50, 0.25, 0.12)]. Thus, a total number of 462 experiments were carried out for
each cultivar. To tune the parameters, that is, to select the values that maximize
the quality of defect maps, we marked manually the defective areas in the sam-
ples and then compared with the achieved defect maps using three measures;
Precision, Recall and F-Score.
tp tp
P recision = , Recall = (1)
tp + f p tp + f n
P recision ∗ Recall
F Score = 2 ∗ (2)
P recision + Recall
where tp (true positives) is the number of pixels marked and correctly detected,
f p (false positives) is the number of pixels not marked but detected, and f n
(false negatives) is the number of pixels marked but not detected. Precision is
a measure of exactness (ﬁdelity), Recall is a measure of completeness, and the
F-score combines both through their harmonic mean. Once the set of experi-
ments was carried out for each cultivar, mean values of previous measures were
computed. Then, we selected the most balanced result for each cultivar. Table
1 shows the best combination of factors for each cultivar and the corresponding
mean values of Precision, Recall and F-Score.
Once the parameters were tuned, from the marked defects and the achieved
defect maps we counted the actual defects, the correctly detected defects and the
false detections for each cultivar. These results are shown in Table 2 (percentage
of false detections is provided with regards to the number of detected defects
plus the false detections, that is, the total number of defects extracted by the
method).
Detection of Visual Defects in Citrus Fruits: MIA-DDRCT vs EGIS 241

Table 1. Best combinations of factors (MIA-DDRCT)

Cultivar #EigenVectors Percentile Scales Precision Recall F-Score

Clemenules 11 95 (0.50, 0.25) 0.60 0.61 0.54
Fortune 17 90 (0.50, 0.25) 0.54 0.69 0.56
Marisol 23 90 (0.50, 0.25) 0.62 0.58 0.53
Valencia 27 95 (0.50, 0.12) 0.64 0.67 0.62
T.Mean 0.60 0.64 0.56

Table 2. Detection results on individual defects (MIA-DDRCT)

Cultivar Defects Detected False Detections

Clemenules 238 211 (88.7%) 04 (1.7%)
Fortune 172 159 (92.4%) 10 (5.5%)
Marisol 195 185 (94.9%) 07 (3.5%)
Valencia 138 125 (90.6%) 06 (4.2%)
Total 743 680 (91.5%) 27 (3.8%)

2.3 EGIS Approach

In this approach there is no training stage and also no model of sound color
textures is built. Instead, the method tries to segment the sample (the image)
into regions in such a way that adjacent regions have a different visual appearance
but it remains similar within them. Thus, in order to extract the defects it is
necessary to set the hypothesis that bigger regions in samples always correspond
to the sound area (the background is not considered).
Since no training is performed, we went directly to tune the parameters of
the method for each cultivar. Parameters are sigma, which is used to smooth
the image before being segmented, and the k value of the threshold function.
In [6] the recommended values for sigma and k are respectively 0.5 and 500,
then, we varied the parameters around these central values. For each cultivar a
set of experiments was performed varying sigma in [0.25, 0.30, 0.35, 0.40, 0.45,
0.50, 0.55, 0.60, 0.65, 0.70, 0.75], and k in [200, 250, 300, 350, 400, 450, 500, 550,
600, 650, 700, 750], which led to 132 different experiments. As the in previous
approach, parameters were tuned by comparing the manually marked defects
with regard to those achieved by the method. This comparison was performed
again through the measures of Precision, Recall and F-Score. Tables 3 and 4
correspond to Tables 1 and 2 of previous approach. These tables show that the
EGIS approach is better in fitting the marked defects and also in defect detection,
although it produces more false detections.
A major difference between both approaches arises when we study their timing
costs. Using an standard PC, we measured for both methods the mean timing
cost of 20 executions performed on the same sample of clemenules cultivar. While
the MIA-DDRCT method achieved a mean timing of 588,5 ms, the EGIS method
achieved 162.5 ms. Nevertheless and despite the difference, both methods can
meet the real-time requirements at production lines (5 pieces per second) since
242 F. López-Garcı́a et al.

Table 3. Best combinations of factors (EGIS)

Cultivar sigma k Precision Recall F-Score

Clemenules 0.50 350 0.75 0.75 0.71
Fortune 0.45 350 0.72 0.73 0.66
Marisol 0.60 250 0.63 0.65 0.58
Valencia 0.65 450 0.77 0.74 0.72
T.Mean 0.72 0.72 0.67

Table 4. Detection results on individual defects (EGIS)

Cultivar Defects Detected False Detections

Clemenules 238 220 (92.4%) 09 (3.6%)
Fortune 172 164 (95.4%) 12 (6.5%)
Marisol 195 182 (93.3%) 17 (8.2%)
Valencia 138 129 (93.5%) 08 (5.5%)
Total 743 695 (93.5%) 46 (6.2%)

their timing costs can be drastically reduced by using simple and cheap paral-
lelization techniques based on computer clustering. Figure 3 shows the results
achieved by both approaches on two diﬀerent samples.

3 Conclusions

In this paper, we have presented an application of visual quality control in orange

post-harvesting comparing two different approaches of Computer Vision. A gen-
eral approach based on a Multivariate Image Analysis strategy for the detection
of defects in random color textures (MIA-DDRCT), and a generic, graph-based
and efficient approach to image segmentation (EGIS). Both methods have been
compared through an experimental work performed on a ground truth composed
by 120 samples of citrus coming from four different varieties.
First, a set of experiments were designed to tune-up the parameters in both
methods. For each cultivar, the parameters of the corresponding method were
varied in a wide range. This led to an extend number of experiments; 462 for
the MIA-DDRCT method and 132 for the EGIS method. Then, the parameters
were tuned using Precision, Recall and F-Score, three measures that compare
the difference among the defects manually marked and the defects extracted by
the methods. Since higher values of these measures were achieved by the EGIS
method, we can conclude that this approach fits better the marked defects.
Then, for the best combinations of parameters for each cultivar in both meth-
ods, we collected the defect detection results. We counted the actual defects,
the correctly detected defects and the false detections. In this case, the EGIS
method achieved better performance in the correct detection ratio (93.5% ver-
sus 91.5%), while MIA-DDRCT was better providing less false detections (3.8%
versus 6.2%). With regards to timing costs, the EGIS method performs 3.6 times
Detection of Visual Defects in Citrus Fruits: MIA-DDRCT vs EGIS 243

Fig. 3. MIA-DDRCT versus EGIS. From top to bottom; original, manually marked
defects, MIA-DDRCT and EGIS results.
244 F. López-Garcı́a et al.

faster than MIA-DDRCT, although both methods can easily achieve real-time
compliance by introducing simple parallelization techniques. Finally, the MIA-
DDRCT approach has the advantage that does not need to use the hypothesis
that the bigger area in samples correspond to the sound area, unlike it occurs in
EGIS method.

References
1. Blasco, J., Aleixos, N., Moltó, E.: Computer vision detection of peel defects in citrus
by means of a region oriented segmentation algorithm. Journal of Food Engineer-
ing 81(3), 535–543 (2007)
2. López, F., Prats, J.M., Ferrer, A., Valiente, J.M.: Defect Detection in Random
Colour Textures Using the MIA T2 Defect Maps. In: Campilho, A., Kamel, M.S.
(eds.) ICIAR 2006. LNCS, vol. 4142, pp. 752–763. Springer, Heidelberg (2006)
3. Bharati, M.H., MacGregor, J.F.: Texture analysis of images using Principal Com-
ponent Analysis. In: SPIE/Photonics Conference on Process Imaging for Automatic
Control, pp. 27–37 (2000)
4. Geladi, P., Granh, H.: Multivariate Image Analysis. Wiley, Chichester (1996)
5. Prats-Montalbán, J.M., Ferrer, A.: Integration of colour and textural information
in multivariate image analysis: defect detection and classiﬁcation issues. Journal of
Chemometrics 21(1-2), 10–23 (2007)
6. Felzenszwalb, P.F., Huttenlocher, D.P.: Eﬃcient Graph-Based Image Segmentation.
International Journal of Computer Vision 59(2), 167–181 (2004)
7. Urquhart, R.: Graph theoretical clustering based on limited neighborhood sets. Pat-
tern Recognition 15(3), 173–187 (1982)
8. Zahn, C.T.: Graph-theoretic methods for detecting and describing gestalt clusters.
IEEE Transactions on Computing 20(1), 68–86 (1971)
Domain Adaptation Based on Eigen-Analysis
and Clustering, for Object Categorization

Suranjana Samanta and Sukhendu Das

V.P. Lab., Dept. of CSE, IIT Madras, India

[email protected], [email protected]

Abstract. Domain adaptation (DA) is a method used to obtain better

classiﬁcation accuracy, when the training and testing datasets have diﬀer-
ent distributions. This paper describes an algorithm for DA to transform
data from source domain to match the distribution of the target domain.
We use eigen-analysis of data on both the domains, to estimate the trans-
formation along each dimension separately. In order to parameterize the
distributions in both the domains, we perform clustering separately along
every dimension, prior to the transformation. The proposed algorithm of
DA when applied to the task of object categorization, gives better results
than a few state of the art methods.

Keywords: Domain Adaptation (DA), non-parametric clustering, eigen-

based transformation, object categorization.

1 Introduction
Domain adaptation [1], [2] is a well-known problem in the field of machine learn-
ing, with recent applications in many Computer Vision tasks. The basic as-
sumption for most classification and regression techniques is that the training
and the testing samples are drawn from the same distribution. For many real-
world datasets, the distributions between the training and the testing data are
dissimilar, which leads to a poor classification performance. This happens in
situations where, the test samples are drawn only from the target domain and
typically a large number of training samples are available in the source domain.
In many situations, only a few labeled samples (images) are available for a
classification task in the target domain, though plenty of samples are available
from the source (or auxiliary) domain. When a small number of labeled training
samples are used for learning, then it generally creates an ill-fitting of a model.
This is known as small sample size (SSS) problem, where the parameters ob-
tained during the training phase are not generalized for the testing data, leading
to high erroneous results during the testing phase.
Domain adaptation (DA) is the process where one can use the training sam-
ples available from source domain to aid a classification task. Typically, a large
number of samples (instances) from source domain and a few from target domain
are available for training in supervised DA. The job of classification is done using

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 245–253, 2013.

c Springer-Verlag Berlin Heidelberg 2013
246 S. Samanta and S. Das

a separate set of test samples obtained only from the target domain. Broadly
speaking, there are two types of DA techniques available in the literature: (a) su-
pervised - where we have a very small number of labeled training samples from
the target domain and (b) unsupervised - where we have plenty of unlabeled
training samples from target domain. Using training samples from both the do-
mains, we expect to built a statistical model, which gives better performance on
the testing data available from the target domain.
In the recent past, the problem of transfer learning or DA, has been attempted
on applications of various computer vision tasks [1], [2], [3], [4]. Jiang et. al.
[5] and Yang et. al. [6] have proposed methods of modifying the existing SVM
trained on video samples available from source domain by introducing a bias term
between source and target domains during optimization in training the SVM. A
transformation matrix has been proposed in [1], which transforms instances from
one domain to other. In [7], a domain dependent regularizer has been proposed,
which enforces the target classifier to give results similar to the relevant base
classifiers on unlabeled instances from target domain. There have been works
on unsupervised DA which measures the geodesic distance between source and
target domains in Grassmannian manifold [2], [3]. Application of DA on face
detection recognition [8] has also been exploited.
In this paper, we use the structural information of clusters present in the
dataset for DA. Transformation of data to a different domain becomes simplified
if the distribution of the two domains are known or estimated properly. Hence, we
group the dataset into Gaussian clusters in both the domains, and propose trans-
formation of clusters from source to target domain using an inter-domain cluster
mapping function. Results shown on real-world object categorization datasets
reveal the efficiency of the system.
The rest of the paper is organized as follows. Section 2 gives a concise descrip-
tion of the proposed method of clustering and domain transformation. Section
3 presents and discusses the performance of the proposed methodology on real-
world datasets. Section 4 concludes the paper.

2 Proposed Method of Domain Adaptation

This paper presents a method of DA, where instances from the source domain are
transformed to match the distribution of the target domain. The eigen-analysis
of a data is robust to the presence of less number of samples. This has been
the major motivation of our work. The proposed method of DA, consists of
three main stages - (a) clustering data in both the domains dimension-wise, to
represent the data in both the domains using one or multiple number of Gaus-
sian function(s), (b) cross-domain mapping of clusters which helps to determine
which cluster in target domain has a similar distribution with a cluster in source
domain such that, the particular source domain cluster has distribution similar
to the mapped target domain cluster after transformation and (c) Eigen-Domain
Transformation (EDT) - to transform a source domain cluster such that its dis-
tribution is similar to its mapped target domain cluster. We consider that atleast
one training sample per class is present from both source and target domains.
Domain Adaptation Based on Eigen-Analysis and Clustering 247

Let X ∈ ns ×D and Y ∈ nt ×D denote the source and the target data having
ns and nt number of samples respectively. Let, Ksd and Ktd be the number of
clusters in Xd and Yd respectively. Let δsd : {1, . . . , Ksd } and δtd : {1, . . . , Ktd } be
the sets of cluster-labels of the clusters formed in Xd and Yd , where Xd and Yd
represents the dth feature of X and Y respectively (d = 1, 2, . . . , D). Let Xdi and
Ydj denote the ith and j th clusters of Xd and Yd respectively (i ∈ δsd , j ∈ δtd ).
The entire process has been explained in the following sub-sections.

2.1 Clustering Using Non-parametric Density Estimation

This step computes the density distribution of data and obtains clusters in both
the domains along each dimension separately. To cluster a data along each di-
mension, we estimate the density of the data using Parzen window estimator.
The size of the window is set to n−1/2 , where n is the number of instances.
Next, we detect the peaks and the valleys present in the probability density dis-
tribution. All the instances whose probability density falls between two adjacent
valley points are clustered together. Since we are clustering the data along each
dimension separately, a dataset may have different number of clusters along each
dimension. This process is repeated for both source and target domains. A pro-
cess of smoothing the density distribution may be necessary prior to detection
of peaks and valleys.
Initially, we normalize the range of the data in both domains. The problem
of small sample size will not affect the method of non-parametric clustering, as
the process is performed repeatedly (done for all dimensions separately) using
only one dimension at a time. Distribution of dataset can be parameterized by
fitting a Gaussian mixture model (GMM) in general. However, due to presence
of very few number of samples in target domain, often fitting a GMM produces
inaccurate result and is thus avoided.

2.2 Cross-Domain Mapping of Clusters

We deﬁne a mapping FdS→T : δsd → δtd , which maps each cluster in source domain
to a cluster in target domain, using KL-divergence measure. If FdS→T (i) = j, i.e.
Xdi is mapped to Ydj , where Ydj is termed as the ’image cluster’ of Xdi and Xdi
is the ’pre-image cluster’ of Ydj .
Since ns = nt , we calculate the divergence between two cross-domain clusters
using the KL-Divergence measure (assuming a Gaussian distribution), which re-
quires the mean and covariance of the two Gaussian distributions. Hence, we
assume a univariate Gaussian function over each cluster along a feature dimen-
sion. The position of a peak present between the two valleys is considered as
the mean of the Gaussian function. The average distance between a pair of val-
leys on either side of a peak is considered as an approximation of the standard
deviation. KL-Divergence (kldiv) between two Gaussian distribution is then es-
timated using the formulation as described in [9]. The cross-domain mapping
function, FdS→T , is calculated as follows:
248 S. Samanta and S. Das

a. Calculate a dissimilarity matrix ΛdS→T ∈ Ks ×Kt , using KL divergence [9]

d d

between all pairs of clusters, such that ΛdS→T (i, j) = kldiv(Xdi , Ydj ).
b. Calculate the average similarity of each of the clusters in Yd , with all the
source domain clusters as: η d (j) = mean ΛdS→T (i, j) , ∀j ∈ δtd .
i
c. Using the criterion: if the Ydj is most similar to Xdi , then FdS→T (i) = j;
d
calculate FdS→T as: FdS→T (i) = arg minj ηS→T
d
(i, j), ∀i ∈ δsd . Here, F−1
S→T (i)
denotes the inverse mapping, which gives the set of ’pre-image clusters’ of
Ydj in the source domain.
d. Now, if for any Ydj the ’pre-image cluster’ set is NULL, then identify a Xdi
satisfying the following conditions and re-assign the mapping as:
d
∀j, such that F−1S→T (j) is NULL
d d
FS→T (i) = arg min ΛS→T (i, j) ∀i, such that |F−1 d (Fd
j S→T S→T (i))| > 1
and ΛdS→T (i, j) ≤ η d (j)

If there remains some cluster in target domain which is outside the range
of the function FdS→T but have similar distribution to a cluster in source
domain, we re-assign the mapping function based on the above equation. Let
Xdi be a cluster in Xd and its corresponding ’image cluster’ in Yd be Ydj .
Let, there exist another cluster Ydk in Yd for which the ’pre-image cluster’
set is empty (thus, kldiv(Xdi , Ydj ) ≤ kldiv(Xi , Yk )). If Xdi is quite similar,
based on the condition: kldiv(Xdi , Ydk ) ≤ η d (j) and the current number of
elements in the ’pre-image’ set of Ydj is greater than one (there exists at least
one more cluster Xdk such that k = i), then according to this step, update
the ’image cluster’ for Xdi to Ydk .

2.3 Eigen Domain Transformation (EDT)

This stage describes the process of transferring instances from source to target
domain using well-formed clusters formed at the earlier stage. While transforma-
tion, we preserve the relative distances between instances of source domain along
each of the directions of principal component directions. The relation between
the eigen-analysis of a data following Gaussian distribution with the distribu-
tion parameters are given in [10] (pp-29). The proposed method of EDT exploits
this relation after clustering groups the data following Gaussian distribution. We
match the distribution of the source and target clusters in the eigen-space and
project it back to get the transformed source cluster.
Let, (λi = [λ1i , . . . λD 1 D 1 D
i ], Φi = [Φi . . . Φi ]) and (γi = [γi . . . γi ], Ψi =
1 D
[Ψi . . . Ψi ]), be the corresponding pair of eigen values/vectors obtained by
cluster-wise eigen-analysis of Xi and Yj respectively. Each cluster Xdi , i ∈ δsd ,
of source domain, is transformed to match the distribution of its corresponding
τ
’image cluster’ FdS→T (i), as determined in the previous stage. If X denotes the
τ
transformed source domain data, then the eigen analysis of X should be identical
Domain Adaptation Based on Eigen-Analysis and Clustering 249

to that of Y. Let us consider one cluster, Xdi , whose ’image cluster’ is Ydj in the
τ
target domain (FdS→T (i) = j). The steps to obtain X are as follows:

1. For each dimension d, do:

d
(a) Consider the projection of Xdi onto Φdi , denoted by X̂i , which can be
obtained by considering the dth dimension of Xi Φi . Similarly, calculate
d d
Ŷj using Ψj for target domain. Normalize the range of X̂i .
d d d
(b) Adjust the mean and variance of X̂i to match that of Ŷi , as: X̃i =
d d
γjd (X̂i ) + μdj ; where, μdj is the mean of Ŷj . Since this is a linear opera-
tion, the relative distances between instances remain preserved in eigen
domain.
d d d
(c) Form X̃ = [X̃1 ; ...; X̃Ksd ]T , i.e., the dth dimension of X̃.
τ
2. The transformed source domain data, X is estimated using inverse eigen-
τ 1 2 D
transformation as: X= X̃Ψ −1 where, X̃ = [X̃ X̃ . . . X̃ ].

2.4 Result on a Synthetic (toy) Data

To explain the steps of the proposed algorithm, we consider a simple example of

data distribution in source and target domains in 2 . In Fig. 1 (a) the green and
the blue points denote the instances from source and target domains respectively.
The result of dimension-wise clustering for ﬁrst and second dimensions are shown
in Figs. 1 (b) and (c) respectively. The curves shown on top in green and the
blue, denote the density obtained by the Parzen window estimator for source and
target domains. The brown and the magenta curves in Figs. 1 (b) & (c) denote
the Gaussian distributions modeling the density functions for each cluster in
source and target domains respectively. Fig. 1 (d) shows the transformed source
domain data in red points obtained using the proposed method, as explained in

1 0.5 0.5 1

0 0
0 1 0 1

0.5 0.5

0 0 0 0
0 1 0 1 0 1 0 1

(a) (b) (c) (d)

Fig. 1. Data from source and target domains (in 2 ) are marked in green and blue
points in (a) and (d); while transformed source domain data is marked in red in (d).
Intermediate results of clustering are shown in (b) and (c) for ﬁrst and second dimen-
sions respectively. Green and blue (dash) curves denote the density of data in source
and target domains. Brown and magenta (dash) curves indicate the Gaussian functions
modeling the clusters distributions formed in source and target domains respectively.
250 S. Samanta and S. Das

previous sub-sections, whose distribution is similar to that of the target domain

data. The KL divergence measure with respect to the target domain, for the
source and transformed source domains are 9.0179 and 0.5371 respectively.

3 Experimental Results
We evaluate the performance of the proposed method on real world datasets
for object categorization. The original Oﬃce dataset [1] contains 3 domains:
Amazon (A), Dslr (D) and Webcam (W), each having 31 classes of objects.
In [2], Oﬃce dataset has been merged with Caltech-256 dataset to create four
domains, Amazon (A), Caltech (C), Dslr (D) and Webcam (W), with ten classes
of objects. A few sample images from the 4 domains are shown in Fig. 2. The
size of the image samples in Amazon, DSLR, Webcam and Caltech datasets, are:
300 × 300, 1000 × 1000, 500 × 500 (average size) and 170 × 104 − 2304 × 1728.

AMAZON DSLR WEBCAM CALTECH

Fig. 2. Sample images of two classes of objects taken from four domains

SURF features [11] are extracted and a bag of words (BOW) feature set is
calculated with a codebook of size 800, as done in [1], [2]. Two methods of EDT
are used for experimentation: (i) class-wise EDT: done for every class separately,
and (ii) Unsupervised EDT: done on the entire dataset by considering data from
all classes together. In the following, we describe the two sets of experimentation
done to exhibit the efficiency of the proposed method of DA.
In the first set of experimentation, we consider the Office dataset [1] with
31 classes. Number of training samples taken are: for target domain 3 samples
per class for Amazon/Dslr/Webcam, for source domain 8 samples per class for
Webcam/Dslr and 20 samples per class for Amazon. Results obtained using a
K-nearest neighbor (k=1) classifier, are averaged over a 10-fold cross validation,
which are compared with that reported in [1], [2]. Table 1 shows the classification
accuracy (in %-age) of object categorization using different techniques of DA.
The 2nd and 3rd columns show the results of metric learning [1] while the 4th
and 5th columns show the results of sampling geodesic flow (SGF) [2]. The 6th
and 7th columns show the result of our proposed class-wise and unsupervised
EDT methods respectively. The proposed EDT gives better performance than
the metric learning method given in [1], while the results of SDF [2] outperform
the proposed method only in one case.
In second set of experimentations, we consider the object dataset considered
in [7], which is a mixture of the office dataset [1] and Caltech - 256 [12]. The
Domain Adaptation Based on Eigen-Analysis and Clustering 251

Table 1. Classification accuracy (in %-age) of Office dataset [1] using different tech-
niques of domain adaptation (DA). Best classification accuracy is highlighted in bold.

Domains Metric Learning [1] SGF [2] Proposed EDT

Source Target Asymm Symm Unsup. Semi-sup. Class-wise Unsup.
webcam dslr 25 27 19 37 41.13 32
dslr webcam 30 31 26 36 48.26 52.22
amazon webcam 48 44 39 57 50.01 49.00

90 90

80 80

70 70
Classification Accuracy

60 60

50 50
C−>A D−>A W−>A A−>C D−>C W−>C
90 90

80 80

70 70

60 60

50 50
A−>D C−>D W−>D A−>W C−>W D−>W
Class−wise EDT Unsup EDT DAM NC

Fig. 3. Classification accuracy done using DA for 12 different cases of object catego-
rization, using the Office+Caltech dataset [2]. Results are grouped into four categories,
with an identical target domain and three source domains considered separately.

dataset has four different domains from which we get 12 different pairs of source
and target domains. We create different sets of training samples by considering
different fractions (0.2, 0.3, ... , 0.7) of the training samples from the target
domain. The average classification accuracy is reported with a 10-fold cross vali-
dation. In this case, we use SVM with histogram intersection kernel to obtain the
classification accuracy. We compare our method with Domain Adaptive Machine
(DAM) [7], a supervised technique of DA. Results are shown in Fig. 3.
The mean accuracy over different sets of training samples is reported for all
the 12 scenarios of classification using DA. The red and the green bar in Fig. 3
shows the performance of proposed class-wise and unsupervised EDT methods
respectively. The blue bar shows the mean accuracy when DAM [7] is used. We
also observe the performance of the classifier when samples from both source
and target domains are combined together for training. This method is termed
as Naive Combination (NC), for which the performance is given by the yellow
bar. Class-wise EDT technique gives the best result for 10 DA classification
tasks. For two cases, D→W and W→D, NC gives better results. This is due to
252 S. Samanta and S. Das

the fact that the two domains - Dslr and Webcam have similar distribution and
application of DA in this case leads to negative transfer.
Another interesting fact for these two tasks is that the unsupervised EDT
performs marginally better than the class-wise-EDT, as the available number of
training samples are the least among the 12 different classification tasks. Hence,
the unsupervised EDT is expected to give better performance as the covariance
matrix is estimated more accurately with a larger number of training samples
(than in class-wise EDT), leading to less error during eigen-analysis. Hence,
the choice between two techniques of EDT (class-wise and unsupervised) for a
classification task, should be based on the number of available training samples.

4 Conclusion
We have proposed a new method of domain adaptation and applied it successfully
for the task of object categorization. Diﬀerence in distributions among the data
in source and target domains, is overcome by clustering and then modeling with
Gaussian functions. A cross-domain mapping function helps to transform data
from source to target domain, using a forward followed by an inverse eigen-
transformations. Results show that the proposed method of DA is better than a
few state of the art published in the recent past. The work can be extended for
handling multiple source domains.

Acknowledgment. This work is supported by TCS India.

References
1. Saenko, K., Kulis, B., Fritz, M., Darrell, T.: Adapting visual category models to
new domains. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part
IV. LNCS, vol. 6314, pp. 213–226. Springer, Heidelberg (2010)
2. Gopalan, R., Li, R., Chellappa, R.: Domain adaptation for object recognition:
An unsupervised approach. In: International Conference in Computer Vision,
pp. 999–1006 (2011)
3. Grauman, K.: Geodesic ﬂow kernel for unsupervised domain adaptation. In: IEEE
Conference on Computer Vision and Pattern Recognition, pp. 2066–2073 (2012)
4. Marton, Z.-C., Balint-Benczedi, F., Seidel, F., Goron, L.C., Beetz, M.: Object cate-
gorization in clutter using additive features and hashing of part-graph descriptors.
In: Stachniss, C., Schill, K., Uttal, D. (eds.) Spatial Cognition 2012. LNCS (LNAI),
vol. 7463, pp. 17–33. Springer, Heidelberg (2012)
5. Jiang, W., Zavesky, E., Fu Chang, S., Loui, A.: Cross-domain learning methods
for high-level visual concept classiﬁcation. In: International Conference on Image
Processing, pp. 161–164 (2008)
6. Yang, J., Yan, R., Hauptmann, A.G.: Cross-domain video concept detection using
adaptive svms. In: International Conference on Multimedia, pp. 188–197 (2007)
7. Duan, L., Xu, D., Tsang, I.W.H.: Domain adaptation from multiple sources: A
domain-dependent regularization approach. IEEE Transaction in Neural Netwet-
work Learning System 23(3), 504–518 (2012)
Domain Adaptation Based on Eigen-Analysis and Clustering 253

8. Qiu, Q., Patel, V.M., Turaga, P., Chellappa, R.: Domain adaptive dictionary learn-
ing. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV
2012, Part IV. LNCS, vol. 7575, pp. 631–645. Springer, Heidelberg (2012)
9. Penny, W.: Kl-divergences of normal, gamma, dirichlet and wishart densities. Tech-
nical report, Wellcome Department of Cognitive Neurology, University College Lon-
don (2001)
10. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic
Press (1990)
11. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (surf).
Computer Vision Image Understanding 110(3), 346–359 (2008)
12. Griﬃn, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical
Report 7694, California Institute of Technology (2007)
Estimating Clusters Centres Using Support
Vector Machine: An Improved Soft Subspace
Clustering Algorithm

Amel Boulemnadjel and Fella Hachouf

Laboratoire d’ Automatique et de Robotique, Département d’ Électronique

Faculté des sciences de l’ingérineur, Université Constantine 1
Route d’ Ain el bey, 25000 Constantine, Algérie
[email protected], [email protected]

Abstract. In this paper, a new approach of soft subspace clustering is

proposed. It is based on the estimation of the clusters centres using a
multi-class support vector machine (SVM). This method is an extension
of the ESSC algorithm which is performed by optimizing an objective
function containing three terms: a weighting within cluster compactness,
entropy of weights and a weighting between clusters separations. First,
the SVM is used to compute initial centres and partition matrices. This
new developed formulation of the centres is integrated in each iteration
to yield new centres and membership degrees. A comparative study has
been conducted on UCI datasets and diﬀerent image types. The obtained
results show the eﬀectiveness of the suggested method.

Keywords: soft, subspace, clustering, SVM, centre.

1 Introduction

Clustering problem concerns the discovery of homogeneous groups of data ac-

cording to a certain similarity measure. In high dimensional data sets, it is dif-
ficult to differentiate between similar data points from dissimilar ones. Clusters
are embedded in subspaces of high dimensional data space, and different clus-
ters may exist in different subspaces of different dimensions. The difficulty that
conventional clustering algorithms encounter in dealing with high dimensional
data sets motivates the concept of subspaces clustering [1]. Subspace clustering
is the task of detecting all clusters in all subspaces. This means that a point
might be a member of multiple clusters, each existing in a different subspace.
This concept is better in handling multidimensional data than other methods.
The two main categories of subspace clustering algorithms are hard subspace
clustering and soft subspace clustering. Hard subspace clustering methods [2, 3]
have been extensively studied for clustering high dimensional data to identify
the exact clusters. Soft subspace clustering [4, 5, 6] has become an effective mean
of dealing with high dimensional data. It assigns a weight to each dimension to
measure its contribution to build a particular cluster. Most clustering techniques

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 254–261, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Estimating Clusters Centres Using Support Vector Machine 255

use a distance as a similarity or dissimilarity between objects as a measure to

yield clusters. A major weakness of soft subspaces clustering algorithms is that
almost all of them are developed based on within- cluster information only or by
employing both within-cluster and between- clusters information. Also, most ex-
isting soft subspace clustering algorithms contain parameters which are difficult
to be determined by users in real-world applications. Many subspaces clustering
algorithms have been developed and applied to different areas [7, 8, 9]. Their per-
formance can be further enhanced. In the most recent methods, locally adaptive
metrics are developed to avoid the risk of information loss encountered in global
dimensionality reduction techniques. Different combinations of dimensions are
used via local weighting of features [10]. In [11],the authors propose two types of
weights in the clustering process. It is an extension of the K-means algorithm by
adding two steps to automatically calculate the two types of subspace weights.
Another soft subspace clustering method is proposed in [6] (ESSC) ; employing
both within cluster and between cluster information. The optimized objective
function is developed by integrating the within- clusters compactness and the
between-cluster separation. This objective function contains three terms, the
weighting within-cluster compactness, the entropy of weights and the weighting
between-cluster separation. This algorithm has a problem of locating clusters
centres, in terms of estimating inter and intra clusters’ distances. Indeed, errors
in estimating distances between centres are amplified through the iterations,
which induces a classification low rate and a high processing time. In this paper,
we propose a new method for subspace clustering based on enhanced clusters
centres estimation. For this purpose a classification step using a multi class sup-
port vector machine SVM is performed [12,13]. As an initialization step, the first
six parameters of the co-occurrence matrix and the edges are extracted from the
original image. An SVM is trained using these features to estimate the clusters
centres vi and the membership degree uj of each element. As a processing step
the ESSC algorithm uses these values to compute the new ones and produces a
new SVM training vector. The classification is performed if the test of centers
given by the SVM at two consecutive otherwise the computed v and u by the
SVM are injected into the ESSC and so on. Results in different parameters sub-
space are concatenated (cf. Fig.1). This paper is organized as follows: after an
introduction, Section 2 describes the new soft subspace clustering method. The
results are discussed in Section 3. Finally some concluding remarks are given in
the last section.

2 The Proposed Method

SVM is applied using an arbitrary learning vector T
Txy = {(x1 , y1 ), ..., (xk , yk )} (1)
Where yk : the cluster of the input xk The SVM are trained on the data Txy
whith the kernel function k given by :
||xi − yj ||
k(xi yj ) = exp(− ) (2)
2σ 2
256 A. Boulemnadjel and F. Hachouf

Fig. 1. Flowchart of the proposed method

Where xi , xj : data inputs,

σ is a free parameter
The optimal values of σ are retained. They match to the best classiﬁcation rate.
The clusters centers vik obtained by SVM can be written as:
Ni N
l=1 xli j=1 wij xjk
v ik = = (3)
Ni Ni

Where the membership degrees u are deﬁned as:

u ij = 1 if the cluster of the pixel j is equal to i
(4)
u ij = 0 otherwise i = 1...c; j = 1...N

c
N
D
JESSC (v, u, w) = um
ij wik (xik − vik )2
i=1 j=1 k=1

c
D
c N
D
+γ wik lnwik − η ( um
ij ) wik (vik − v0k )2
i=1 k=1 i=1 j=1 k=1

(5)
Where
c: clusters number, N : data size. D: features number. v: cluster center , w: weight
matrix, u: fuzzy partition matrix.
Using the Lagrange multiplier u, v and w are deduced to minimize the objective
function of Eq.5: Using the Lagrange multiplier u, v and w are deduced to
minimize the objective function of Eq.5:
Estimating Clusters Centres Using Support Vector Machine 257

D D
[ k=1 wik (xjk − vik) − η k=1 wik (vik − v0k )2 ]− m−1
1

uij = c D D (6)
2 − m−1
1
i =1 [ k=1 wik (xjk − vik) − η k=1 wik (vik − v0k ) ]

N
j=1 uij (xjk − ηv0K )
m
vij = N (7)
j=1 uij (1 − η)
m

N N
ij (xjk −vik)−η
um j=1 uij (vik −v0k )
m 2
k=1
exp( γ )
wik = D N N (8)
k=1 uij (xjk −vik)−η j=1 uij (vik −v0k )
m m 2

k =1 exp( γ )

In each iteration , the obtained clusters centers vik and their near neighborhoods
detected using an Euclidean distance constitute the new SVM vector training.

Txy = {(v1 , 1), ..., (vc , c), (xi , yi ), ..., (xk , yk )} (9)

Where
yi = maxi=1:c (uij ) c: number of clusters
k: vector size
The size of the learntng vecior depends on the data size.
The ﬁnal equations ow u ano f are as follows:

N N 1
− m−1
D u ij xj k D j =1
u xj k
j =1
[ k=1 wik (xjk − Ni
)2 − η k=1 wik ( Ni
ij
− v0k )2 ]
uij = N N 1
− m−1
c D u ij xj k D u xj k
j =1 j =1

i =1
[ k=1 wik (xjk − Ni
)2 −η k=1 wik Ni
ij
− v0k )2 ]
(10)
Where

N
c j =1
u ij xj k
i=1 Ni
v0k = (11)
c

N N
N u x N u x
j =1 ij j k j =1 ij j k
k=1 ij (xjk −
um Ni )−η j=1 um
ij ( Ni −v0k )2
exp( γ )
wik = N
u x
N
u x
N N
D k=1 uij (xjk −
m j =1
Ni
ij j k
)−η m
j=1 uij (
j =1
Ni
ij j k
−v0k )2
k =1 exp( γ )
(12)
258 A. Boulemnadjel and F. Hachouf

Algorithm.
Initialization step
– Input: Number of clusters C, parameters m,μ and ε.
– Train SVM by input data
– Compute w by using (8)
Processing step
W hile |v(t + 1) − v(t) ≥ ε| do
– Compute the partition matrix u using (10)
– Compute the cluster centre matrix v using(7)
– Compute w using (11)
– Extract the vector learning using centres previously computed
– Apply the SVM algorithm.
– Compute the new centres matrix using equations 3 and 4.
endwhile
Clustering step
– Assign each pixel i its potential cluster using the max of memberships degrees.

3 Results and Discussions

The performance of the proposed algorithm has been studied on the six UCI
databases [15] given with the speciﬁc number of clusters, number of instances
and the number of attributes. The normalized mutual information (NMI) metric
given by Eq.12 is used to evaluate and compare the performance of the proposed
algorithm and ESSC [6]. The higher is the value of NMI, the better is the result
of the clustering.
c c N .N j
i=1 i=1logN. ijNi
N M I = / (13)
c Ni logNi c
i=1 N . j=1 Nj log Nj N

Where
Nij is the number of agreements between clusters i and true clusterj.
Ni is the number of data points in clusters i.
Nj is the number of data points in true cluster j.
All the experiments were implemented on a 2.22 GHz CPU and 2GB RAM. The
used database descriptions are given in Table 1. The NMI values are tabulated
in Table 2.
It is clear from Table 2 that the NMI value is greater for the proposed algo-
rithm than the ESSC algorithm. Also the iteration number decreases.
For both methods, the evolution norm of clusters centers is plotted against
the iterations number. The results are compared to the real norm of clusters
centers given by the dataset (Figure.2 green). The clusters centers obtained by
Estimating Clusters Centres Using Support Vector Machine 259

Table 1. UCI datasets

Database Number of Number of
Clusters number
name Instances Attributes
Abdolan 3 4177 8
Australian 2 690 14
balance 3 625 4
Car 4 946 18
Glass 6 214 9
Heart 1 5 270 13
Heart 2 270 13
Iris 3 150 4
Wine 3 178 13

Table 2. Clustering results Obtained for nine UCI datasets with NMI as metricc

dataset methods NMI iterations

Abdolan ESSC 0.1639 13
Proposed method 0.1811 7
Australian ESSC 0.3617 12
Proposed method 0.3884 2
Balance ESSC 0.2278 4
Proposed method 0.2878 3
Car ESSC 0.1235 12
Propdseo method 0.2265 3
Glass ESSC 0.3505 35
Proposed method 0.5483 7
Heart ESSC 0.3067 12
Proposed method 0.2533 2
Heart1 ESSC 0.1114 16
Proposed method 0.1933 7
Iris ESSC 0.7419 22
Proposed method 0.8642 4
wine ESSC 0.8629 31
Proposed method 0.7774 4

the proposed method are closer to the real centers with a minimum number
of iterations (Figure.2). The proposed approach has been tested on different
types of image to evaluate its performance in image segmentation. The first six
parameters of co-occurrence features [15] and the edge detection have been used,
namely: contrast, homogeneity, correlation, energy, angular second moment and
the entropy. The number of clusters C for Fig 3.A is 5 for the two methods,
for Fig 3.B C =6. The obtained images are given in Figure 3. It is noticed that
on Figures.3B1 and 3B2, some clusters are merged. On Figure.3B1 some edges
of the triangle are missing and some parts of the flower do not appear. The
background of the image and the triangle constitute the same cluster. Unlike to
Figures.3B1 and 3B2, Figures.3C1 and 3C2 show better shapes. The edge and
the leaves of the flower are well defined..
260 A. Boulemnadjel and F. Hachouf

Fig. 2. A:Centres evolution against iterations number: iris dataset, B: Convergence

(a) (b) (c)

(d) (e) (f)

Fig. 3. A: original images; B: ESSC results, C: Proposed method results

4 Conclusion
In this paper a soft subspace clustering is enhanced. New formulations of the
membership degrees and centers clusters have been developed. Obtained results
have shown the signiﬁcant improvement of the data clustering. Estimating the
center and the membership degree in the initialization step has reduced the num-
ber of iterations that made very fast the algorithm convergence. SVM algorithm
used in the initialization step has improved the centers location. In a future work,
we suggest to use active learning for a better selection of the learning vector,
and thus improving the clustering and minimizing the running time.

References
1. Agrawal, R., et al.: Automatic subspace clustering of high dimensional data for
data mining applications. In: SIGMOD Record ACM Special Interest Group on
Management of Data, pp. 94–105 (1998)
Estimating Clusters Centres Using Support Vector Machine 261

2. Parsons, L., Haque, E., Liu, H.: Evaluating subspace clustering algorithms. In
Workshop on Clustering High Dimensional Data and its Applications. In: SIAM
Int. Conf. on Data Mining, pp. 48–56 (2004)
3. Yip, K.Y., Cheung, D.W., Ng, M.K.: A practical projected clustering algorithm.
IEEE Trans. Knowl. Data Eng. 16(11), 1387–1397 (2004)
4. Chang, H., Yeung, D.Y.: Locally linear metric adaptation with application to
semi-supervised clustering and image retrieval. Pattern Recognition 39, 1253–1264
(2006)
5. Liang, B., et al.: A novel attribute weighting algorithm for clustering high-
dimensional categorical data. Pattern Recognition 44, 2843–2861 (2011)
6. Deng, Z., et al.: Enhanced soft subspace clustering integrating within cluster and
between-cluster information. Pattern Recognition 43, 767–781 (2010)
7. Damodar, R., Janaa, P.K.: A prototype-based modified DBSCAN for gene cluster-
ing. Procedia Technology 6, 485–492 (2012)
8. Yang, A., et al.: Unsupervised segmentation of natural images via lossy data com-
pression. Comput. Vis. Image Understand 110, 212–225 (2008)
9. Vidal, R., Tron, R., Hartley, R.: Multiframe motion segmentation with missing
data using power factorization and GPCA. Int. J. Comput. Vis. 79, 85–105 (2008)
10. Domeniconi, C., et al.: Locally adaptive metrics for clustering high dimensional
data. Data Min. Knowl. Disc. 14, 63–67 (2007)
11. Xiaojun, C., et al.: A feature group weighting method for subspace clustering of
high dimensional data. Pattern Recognition 45, 434–446 (2012)
12. Sangeetha, R., et al.: Identifying Efficient Kernel Function in Multiclass Support
Vector Machines. International Journal of Computer Applications 28 (2011)
13. Vapnik, V.: An overview of statistical learning theory. IEEE Trans. on Neural
Networks (1999)
14. http://archive.ics.uci.edu/ml/
15. Haralick, R., et al.: Textural features for image classification. IEEE Transactions
on Systems, Man and Cybernetics 3(6), 610–621 (1973)
Fast Approximate Minimum Spanning Tree
Algorithm Based on K-Means

Caiming Zhong1,2,3 , Mikko Malinen2 , Duoqian Miao1 , and Pasi Fränti2

1
Department of Computer Science and Technology, Tongji University,
Shanghai 201804, PR China
2
Department of Computer Science, University of Eastern Finland, P.O. Box 111,
FIN-80101 Joensuu, Finland
3
College of Science and Technology, Ningbo University, Ningbo 315211, PR China

Abstract. We present a fast approximate Minimum spanning tree(MST)

framework on the complete graph of a dataset with N points, and any ex-
act MST algorithm can be incorporated into the framework and speeded
up. It employs a divide-and-conquer scheme to produce an approximate
MST with theoretical time complexity of O(N 1.5 ), if the incorporated ex-
act MST algorithm has the running time of O(N 2 ). Experimental results
show that the proposed approximate MST algorithm is computational
eﬃcient, and the accuracy is close to the true MST.

Keywords: Minimum spanning tree, divide-and-conquer, K-means.

1 Introduction
Given an undirected and weighted graph, the problem of MST is to find a span-
ning tree such that the sum of weights is minimized. Since MST can roughly
estimate the intrinsic structure of a dataset, it has been broadly applied in im-
age segmentation [1], cluster analysis [9], classification [4], manifold learning [8].
However, traditional MST algorithms such as Prim’s and Kruskal’s algorithm
have running time of O(N 2 ) [3], and for a large dataset a fast MST algorithm
is needed.
Recent work to find an approximate MST can be found in [6][7], and the
both work apply MSTs to clustering. Wang et al. [7] employ divide-and-conquer
scheme to detect the long edges of the MST at an early stage for clustering. Ini-
tially, data points are randomly stored in a list, and each data point is connected
to its predecessor (or successor), and a spanning tree is achieved. To optimize the
spanning tree, the dataset is divided into a collection of subsets with a divisive
hierarchical clustering algorithm. The distance between any pair of data points
within a subset can be computed by a brute force nearest neighbor search, and
with the distances, the spanning tree is updated.
Lai et al. [6] proposed an approximate MST algorithm based on Hilbert curve
for clustering. It is a two-phase algorithm: the first phase is to construct an
approximate MST of a given dataset with Hilbert curve, and the second phase is
to partition the dataset into subsets by measuring the densities of points along

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 262–269, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Fast Approximate Minimum Spanning Tree Algorithm Based on K-Means 263

Divide-and-conquer stage:
(a) Data set (b) Partitions by K-means (c) MSTs of the subsets (d) Connected MSTs

Refinement stage:

(e) Partitions on borders (f) MSTs of the subsets (g) Connected MSTs (h) Approximate MST

Fig. 1. The scheme of the proposed

√ fast MST algorithm. (a) A given dataset. (b) The
dataset is partitioned into N subsets by K-means. (c) An exact MST algorithm
is applied to each subset. (d) MSTs of the subsets are connected. (e) The dataset is
partitioned again so that the neighboring data points in diﬀerent subsets are partitioned
into identical partitions. (f) Exact MST algorithm is used again on the secondary
partition. (g) MSTs of the subsets are connected. (h) A more accurate approximate
MST is produced by merging the two approximate MSTs in (d) and (g) respectively.

the approximate MST with a speciﬁed density threshold. However, the accuracy
of MST depends on the order of Hilbert Curve and the number of neighbors of
a visited point in the linear list.

2 Proposed Method
2.1 Overview of the Proposed Framework
To improve the eﬃciency of constructing an MST is to reduce the unnecessary
comparisons. For example, in Kruskal’s algorithm, it is not necessary to sort
all N (N − 1)/2 edges of a complete graph but to ﬁnd (1 + α)N edges with
least weights, where (N − 3)/2 # α ≥ −1/N . We employ a divide-and-conquer
technique to achieve the improvement. The overview of the proposed method is
illustrated in Fig. 1.

2.2 Partition Dataset with K-Means

In general, a data point in an MST is connected to its nearest neighbors, which
implies that the connections have a locality property. In the divide step, it is
264 C. Zhong et al.

therefore expected that the subsets preserve this locality. Since K-means can
partition local neighboring data points into the same group, we employ K-means
to partition the dataset.

The Number√ of Clusters K. In our method, the number of clusters K

is set to N . There are two reasons for this determination. One is that √the
maximum number of clusters in some clustering algorithms is often √ set to N
as a rule of thumb [2]. That means if a dataset is partitioned into N subsets,
each subset will consist of data points coming from an identical genuine cluster,
which satisﬁes the requirement of the locality property when constructing an
MST. The other reason is that the overall time complexity√ of the proposed
approximate MST algorithm is minimized if K is set to N , assuming that the
data points are equally divided into the clusters.

√
Divide and Conquer Algorithm. After the dataset is divided into N sub-
sets by K-means, the MSTs of the subsets are constructed with an exact MST
algorithm, such as Prim’s or Kruskal’s algorithm. The algorithm of K-means
based divide and conquer is described as follows:

Divide and Conquer Using K-Means (DAC)

Input: Dataset X;
Output: MSTs of the subsets partitioned from X
√
1. Set the number of subsets K = N .
2. Apply K-means to X to achieve K subsets S = {S1 , . . . , SK }.
3. Apply an exact MST algorithm to Si , and its MST M ST (Si ) is obtained.

2.3 Combine MSTs of the K Subsets

An intuitive solution to combine MSTs is brute force: for the MST of a cluster,
the shortest edge between it and MSTs of other clusters is computed. But this
solution is time consuming, and therefore a fast MST-based eﬀective combination
is presented.

MST-Based Combination. The neighboring subsets are determined ﬁrst be-

cause the MSTSs of those far away from each other will not be connected. This
can be achieved by MST of the centers of the subsets, see Fig. 2. To connect
a pair of neighboring subsets eﬃciently, the nearest point of one subset to the
center of the other is selected. For example, a and b are the nearest points to
opposite centers respectively, and they are connected.
Consequently, the algorithm of combining MSTs of subsets is summarized as
follows:
Combine Algorithm (CA)
Input: MSTs of the subsets partitioned from X: M ST (S1 ), · · · , M ST (SK ).
Output: Approximate MST of X: M ST1 , and MST of the cluster centers: M STcen ;
Fast Approximate Minimum Spanning Tree Algorithm Based on K-Means 265

(a) Centroids of subsets (b) MST of centroids (c) Connected subsets

c8 c8 c8

c7 c7 c7
c5 c6 c5 c6 c5 c6

c3 c3 c3
c4 c4 c4
a
c1 c2 c1 c2 c1 c2
b

Fig. 2. The combine step of MSTs of the proposed algorithm. In (a), centers of the
partitions (c1, ..., c8) are calculated. In (b), a MST of the centers, M STcen , is con-
structed with an exact MST algorithm. In (c), each pair of subsets whose centers are
neighbors with respect to M STcen in (b) is connected.

1. Compute the center ci of subset Si , 1 ≤ i ≤ K.

2. Construct an MST, M STcen , of c1 , · · · , cK by an exact MST algorithm.
3. For each pair of subsets (Si , Sj ) whose centers are connected by an edge of
M STcen , discover the edge by DCE that connects M ST (Si ) and M ST (Sj ).
4. Combine discovered edges with M ST (S1 ), · · · , M ST (SK ) to achieve M ST1 .

Detect the Connecting Edge (DCE)

Input: A pair of subsets to be connected, (Si , Sj );
Output: The edge connecting M ST (Si ) and M ST (Sj );

1. Find data point a ∈ Si so that the distance between a and cj is minimized.

2. Find data point b ∈ Sj so that the distance between b and ci is minimized.
3. Select edge e(a, b) as the connecting edge.

2.4 Reﬁne the MST Focusing on Boundaries

However, the accuracy of the approximate MST achieved so far is not enough.
The reason is that, when the MST of a subset is built, the data points that lie in
the boundary of the subset are considered only within the subset but not across
the boundaries. Based on this observation, the reﬁnement stage is designed.

Partition Dataset Focusing on Boundaries. In this step, another compli-

mentary partition is constructed so that the clusters would locate at the bound-
ary areas of the previous K-means partition. We ﬁrst calculate the midpoints of
each edge of M STcen . In most cases, these midpoints lie near the boundaries,
and are therefore employed as the initial cluster centers. The dataset is then par-
titioned by K-means, in which only one iteration is performed for the purpose
of focusing on the boundaries. The process is illustrated in Fig. 3.
266 C. Zhong et al.

(a) Midpoints between (b) Partitions on borders

centers

m7
c7
m6
c5 c6
m4
m5
c3 m3
c4
m1
m2
c1
c2

Fig. 3. Boundary-based partition. In (a), the black solid points, m1 , · · · , m7 , are the
midpoints of the edges of M STcen . In (b), each data point is assigned to its nearest
midpoint, and the dataset is partitioned by the midpoints. The corresponding Voronoi
graph is with respect to the midpoints.

Build Secondary Approximate MST. After the dataset has been re-
partitioned, the conquer and combine steps similar to those in ﬁrst stage are
used to produce the secondary approximate MST. The algorithm is summarized
as follows:
Secondary Approximate MST (SAM)
Input: MST of the subset centers M STcen , dataset X;
Output: Approximate MST of X, M ST2 ;
1. Compute the midpoint mi of an edge ei ∈ M STcen , where 1 ≤ i ≤ K − 1.
2. Partition dataset X into K − 1 subsets, S1 , · · · , SK−1
, by assigning each
point to its nearest point from m1 , · · · , mK−1 .
3. Build MSTs, M ST (S1 ), · · · , M ST (SK−1

), with an exact MST algorithm.
4. Combine the K − 1 MSTs with CA to produce an approximate MST M ST2 .

2.5 Combine Two Rounds of Approximate MSTs

So far we have two approximate MSTs on dataset X, M ST1 and M ST2 . To
produce the final approximate MST, we first merge the two approximate MSTs
to produce a graph, which has no more than 2(N − 1) edges, and then apply an
exact MST algorithm on this graph to achieve the final approximate MST of X.

3 Complexity and Accuracy Analysis

3.1 Complexity Analysis
The overall time complexity of the proposed algorithm FMST, TF MST , can be
evaluated as:
TF MST = TDAC + TCA + TSAM + TCOM (1)
Fast Approximate Minimum Spanning Tree Algorithm Based on K-Means 267

where TDAC , TCA and TSAM are the time complexities of the algorithms DAC,
CA and SAM respectively, TCOM is the running time of an exact MST algo-
rithm on the combination of M ST1 and M ST2 .
DAC consists of two operations: partitioning the dataset X into K subsets
and constructing
√ the MSTs of the subsets with an exact MST algorithm. Since
K = N , we have TDAC = O(N 1.5 ). In CA, computing the mean points of the
subsets and constructing MST of the K mean points take only O(N ) time. For
each connected subset pair, determining the connecting edge requires O(2N ×
(K − 1)/K). The total computational cost of CA is therefore O(N ).
In SAM, Computing K−1 midpoints and partitioning the dataset take O(N ×
(K − 1)) time. The running time of Step 3 and 4 is O((K − 1) × N 2/(K − 1)2 ) =
O(N 2 /(K −1)) and O(N ), respectively. Therefore, the time complexity of SAM
is O(N 1.5 ). The number of edges in the graph that is formed by combining M ST1
and M ST2 is at most 2(N − 1). The time complexity of applying an exact MST
algorithm to this graph is only O(2(N − 1) log N ). Thus, TCOM = O(N log N ).
To sum up, the time complexity of the proposed fast algorithm is O(N 1.5 ).

4 Experiments
In this section, experimental results are presented to illustrate the eﬃciency and
the accuracy of the proposed fast approximate MST algorithm. The accuracy of
FMST is tested with four datasets: t4.8k [5], MNIST [10], ConfLongDemo [11]
and MiniBooNE [11]. Experiments were conducted on a PC with an Intel Core2
2.4GHz CPU and 4GB memory running Windows 7.

4.1 Running Time

From each dataset, subsets with different size are randomly selected to test the
running time as a function of data size. The subset sizes of the first two datasets
gradually increase with step 20, the third with step 100 and the last with step
1000.
The running time of FMST and Prim’s algorithm on the four datasets is
illustrated in the first row of Fig. 4. From the results, we can see that FMST
is computationally more efficient than Prim’s algorithm, especially for the large
datasets ConfLongDemo and MiniBooNE. The efficiency for MiniBooNE shown
in the rightmost of the second and third row in Fig. 4, however, deteriorates
because of the high dimensionality. Although the complexity analysis indicates
that the time complexity of proposed FMST is O(N 1.5 ), the actual running
time may be different because K-means can not produce clusters being of equal
size. We analyze the actual processing time by fitting an exponential function
T = aN b , where T is the running time and N is the number of data points. The
the results are shown in Table 1.

4.2 Accuracy
Suppose Eappr is the set of the correct edges in an approximate MST, the edge
N −|Eappr |−1
error rate ERedge is deﬁned as: ERedge = N −1 . The second measure is
268 C. Zhong et al.

t4.8k (d=2) ConfLongDemo (d=3) MNIST (d=784) MiniBooNE (d=50)

7LPH 6HFRQGV
7LPH 6HFRQGV

7LPH 6HFRQGV

7LPH 6HFRQGV

3ULPÿVDOJRULWKP 3ULPÿVDOJRULWKP 3ULPÿVDOJRULWKP 3ULPÿVDOJRULWKP

)067$3ULP )0673ULP )0673ULP

)0673ULP

N N [ N N [

Edge error rate (%)

[
N N N N

Weight error rate (%)
Weight error rate (%)

Weight error rate (%)

[ [
N N N N

Fig. 4. The results of the test on the four datasets

Table 1. The exponent bs obtained by ﬁtting T = aN b

b
t4.8k MNIST ConfLongDemo MiniBooNE
FMST 1.57 1.62 1.54 1.44
Prim’s Alg. 1.88 2.01 1.99 2.00

deﬁned as the diﬀer of the sum of the weights in FMST and the exact MST,
W −Wexact
which is called weight error rate: ERweight = appr Wexact , where Wexact and
Wappr are the sum of weights of the exact MST and FMST, respectively.
The edge error rates and weight error rates of the four datasets are shown
in the third row of Fig. 4. We can see that both the edge error rate and the
weight error rate decrease with the increase of the data size. For datasets with
high dimension, the edge error rates are bigger, for example, the maximum edge
error rates of MNIST are approximate to 18.5%, while those of t4.8k and Con-
fLongDemo less than 3.2%. In contrast, the weight error rates decrease when
the dimensionality increases. This is one aspect of the curse of dimensionality,
distance concentration, which means that Euclidean distances between all pairs
of points in high dimensional data are tend to be similar.
Fast Approximate Minimum Spanning Tree Algorithm Based on K-Means 269

5 Conclusion

In this paper, we have proposed a fast approximate MST algorithm with a

divide and conquer scheme. The time complexity of the proposed algorithm is
theoretically O(N 1.5 ). Furthermore, any MST algorithm can be incorporated
into to the proposed framework to make it more eﬃcient.

References
1. An, L., Xiang, Q.S., Chavez, S.: A fast implementation of the minimum spanning
tree method for phase unwrapping. IEEE Trans. Medical Imaging 19, 805–808
(2000)
2. Bezdek, J.C., Pal, N.R.: Some new indexes of cluster validity. IEEE Trans. Systems,
Man and Cybernetics, Part B 28, 301–315 (1998)
3. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms,
2nd edn. The MIT Press (2001)
4. Juszczak, P., Tax, D.M.J., Pȩkalska, E., Duin, R.P.W.: Minimum spanning tree
based one-class classiﬁer. Neurocomputing 72, 1859–1869 (2009)
5. Karypis, G., Han, E.H., Kumar, V.: CHAMELEON: A hierarchical clustering al-
gorithm using dynamic modeling. IEEE Trans. Comput. 32, 68–75 (1999)
6. Lai, C., Rafa, T., Nelson, D.E.: Approximate minimum spanning tree clustering in
high-dimensional space. Intelligent Data Analysis 13, 575–597 (2009)
7. Wang, X., Wang, X., Wilkes, D.M.: A divide-and-conquer approach for minimum
spanning tree-based clustering. IEEE Trans., Knowledge and Data Engineering 21,
945–958 (2009)
8. Yang, L.: Building k edge-disjoint spanning trees of minimum total length for iso-
metric data embedding. IEEE Trans. Pattern Analysis and Machine Intelligence 27,
1680–1683 (2005)
9. Zhong, C., Miao, D., Wang, R.: A graph-theoretical clustering method based on
two rounds of minimum spanning trees. Pattern Recognition 43, 752–766 (2010)
10. http://yann.lecun.com/exdb/mnist
11. http://archive.ics.uci.edu/ml/
Fast EM Principal Component Analysis Image
Registration Using Neighbourhood Pixel
Connectivity

Parminder Singh Reel1 , Laurence S. Dooley1 , K.C.P. Wong1 , and Anko Börner2
1
Department of Communication and Systems, The Open University, Milton Keynes,
United Kingdom
{p.s.reel,laurence.dooley,k.c.p.wong}@open.ac.uk
2
Optical Sensor Systems, German Aerospace Center (DLR), Berlin, Germany
[email protected]

Abstract. Image registration (IR) is the systematic process of aligning

two images of the same or different modalities. The registration of mono
and multimodal images i.e., magnetic resonance images, pose a particu-
lar challenge due to intensity non-uniformities (INU) and noise artefacts.
Recent similarity measures including regional mutual information (RMI)
and expectation maximisation for principal component analysis with MI
(EMPCA-MI) have sought to address this problem. EMPCA-MI incorpo-
rates neighbourhood region information to iteratively compute principal
components giving superior IR performance compared with RMI, though
it is not always effective in the presence of high INU. This paper presents
a modified EMPCA-MI (mEMPCA-MI) similarity measure which intro-
duces a novel pre-processing step to exploit local spatial information
using 4-and 8-pixel neighbourhood connectivity. Experimental results
using diverse image datasets, conclusively demonstrate the improved IR
robustness of mEMPCA-MI when adopting second-order neighbourhood
representations. Furthermore, mEMPCA-MI with 4-pixel connectivity is
notably more computationally efficient than EMPCA-MI.

Keywords: Image registration, mutual information, principal compo-

nent analysis, expectation maximisation algorithms.

1 Introduction
Image Registration (IR) is a vital processing task in numerous applications where
the final information is obtained by combining different data sources, as for ex-
ample in computer vision, remote sensing and medical imaging [1]. The process
of IR involves the geometric transformation of a source image in order to attain
the best physical alignment with a reference target image. It applies an opti-
mization method to maximize some predefined similarity measure with known
transformations between the source and reference image set.
Similarity measures which have been proposed [1] for both mono and multi-
modal IR can be broadly categorized according to whether they are based on

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 270–277, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Fast EM PCA Image Registration Using Neighbourhood Pixel Connectivity 271

cross correlation, phase correlation, Fourier techniques or mutual information

(MI) [2], with MI being well-established in the medical imaging domain [3]. MI
is computationally efficient and seeks to form a statistical relationship between
the source and reference images [4]. It is however, sensitive to interpolation arte-
facts and its performance can be severely compromised when the overlap region
between the images is small.
Normalized MI (NMI) [5] was specifically designed to facilitate the sucessful
IR of partially overlapping images, though it along with MI is unable to con-
sistently and accurately register images containing intensity non-uniformities
(INU) [6] which is an omnipresent feature in magnetic resonance images (MRI)
for instance. In contrast, regional MI (RMI) [7] and its variant [8] incorporate
neighbourhood features within MI by segmenting an image into several regions
for feature extraction to lessen the influence of INU on the resulting IR qual-
ity. In computing the associated entropies, these MI-based approaches employ a
covariance matrix instead of high-dimensional histograms to reduce data com-
plexity, though as the size of a neighbourhood region grows, so the computation
overheads commensurately increase [7].
The expectation maximisation for principal component analysis with MI (EM
PCA-MI) algorithm [9] is a recently proposed IR similarity measure, which sig-
nificantly reduces the computational cost without loss of IR performance for
different mono and multimodalities of the human anatomy [10],[11]. Its per-
formance however, can be compromised in the presence of high INU and noise
levels [9]. EMPCA-MI achieves dimensionality reduction by iteratively determin-
ing the principal component without recourse to solving the complete covariance
matrix as in conventional principal component analysis (PCA) techniques. As a
pre-processing step, EMPCA-MI rearranges the neighbourhood region grayscale
data values into vector form so preserving both the spatial and intensity infor-
mation of the images.
This paper presents a modified EMPCA-MI (mEMPCA-MI) similarity mea-
sure which uses the difference in grayscale values for direct (4-pixel ) and indirect
(8-pixel ) neighbourhood relations, instead of rearranging the pixels in the pre-
processing stage. This provides the dual advantages of more accurate feature
representation for EMPCA and MI computation, and significantly lower compu-
tational cost with, as will be evidenced in Section 4, minimal impact upon the
corresponding IR performance compared with EMPCA-MI. Quantitative results
verify the new pre-processing step adopted in mEMPCA-MI provides superior
IR performance from both a registration error and computational time perspec-
tive for various mono and multimodal test datasets. The remainder of the paper
is organized as follows: Section 2 briefly reviews the original EMPCA-MI simi-
larity measure before the proposed mEMPCA-MI pre-processing step exploiting
localised pixel relations is introduced. Section 3 describes the experimental test
setup used, while Section 4 presents an IR results analysis of the mEMPCA-MI
algorithm. Finally, Section 5 provides some concluding comments.
272 P.S. Reel et al.

2 The mEMPCA-MI Model

2.1 EMPCA-MI Similarity Measure [9]

EMPCA-MI [9] is a recent similarity measure for IR, which efficiently incor-
porates spatial information together with MI without incurring high computa-
tional overheads. Fig. 1 illustrates the three core processing steps involved in the
EMPCA-MI algorithm, namely: input image data rearrangement (highlighted in
yellow ) followed by EMPCA and MI calculation. Both the reference (IR ) and
source images (IS ) are pre-processed (Step I ) into vector form for a given neigh-
bourhood radius r, so the spatial and intensity information is preserved (see Fig.
1(a) and 1(b)). The first P principal components XR and XS of the respective
reference and source images are then iteratively computed using EMPCA [12] in
Step II. Subsequently, the MI [3] is calculated between XR and XS in Step III,
with a higher MI value signifying the images are better aligned. In [9], only the
first principal component is considered, i.e., P=1 since this is the direction of
highest variance and represents the most dominant feature in any region.

2.2 New Pre-processing Step

As evidenced in Fig. 1(a), Step I of the EMPCA-MI algorithm reorganises the

image grayscale values within each neighbourhood region in order to incorporate
spatial information. This provides noteworthy IR results [9] when there is neither
INU nor noise present, however when there are high levels of INU and noise, the
corresponding registration performance can degrade because only a first-order
region representation is used which considers each pixel independently without
cognisance of any neighbourhood relations. This is reflected in the repetitive pat-
terns in QR and QS in the neighbouring position of the sliding window (See Step
I (b) in Fig 1). The rationale behind the proposed mEMPCA-MI pre-processing
step is that spatial information within a neighbourhood region can be more
accurately characterised as second-order representation, where the relationship
between pixels can be exploited instead of just pixel values. To illustrate the
new pre-processing step for mEMPCA-MI, consider the example shown in Fig.
1, for a 3 x 3 pixel sliding window (r =1 ) neighbourhood region B (see Fig. 1(a))
which assumes either 4-pixel (direct neighbours) or 8-pixel (indirect neighbours)
connectivity i.e., c=4 and c=8 respectively. The resulting single column vector
B∗ will thus have length c+1 as shown in Fig. 1(b), and can be represented as:

∗ Bi − B5 i ∈ [1, c + 1]; i ∈
/ [5]
Bi = (1)
B5 i ∈ [5]
Each column vector B∗ now represents the differential value of c connected
pixels with respect to the centre pixel B5 . Here, mEMPCA-MI pre-processing
no longer generates repetitive pattern as in EMPCA-MI, but instead provides
unique relative intensity values (See B∗ in Fig.1) for the next computational
steps in the mEMPCA-MI.
Fast EM PCA Image Registration Using Neighbourhood Pixel Connectivity 273

EMPCA-MI Proposed mEMPCA-MI

8-pixel connectivity 4-pixel connectivity
r=1 B r=1 B r=1

25 155 50 110 135 25 155 50 110 135 25 155 50 110 135

IR IS IR IS IR IS
120 122 106 130 109 120 122 106 130 109 120 122 106 130 109

115 134 126 109 153 115 134 126 109 153 115 134 126 109 153
n=10

n=10
n=10
(a)
125 95 180 105 120 125 95 180 105 120 125 95 180 105 120
Preprocessing

210 190 182 175 130 210 190 182 175 130 210 190 182 175 130
Step I:

m=10 m=10 m=10

B*
QR 25 155 120 QS QR -97 49 -50 QS
155 50 112 33 -56 25 B*
33 -56 25
50 110 135 d=9 -72 4 -54 QR QS
d=9

120 122 128 -2 16 32 -2 16 32

(b)

d=5
122 106 140 122 106 125 122 106 125
-16 24 12

126 109 130 4 3 -10 12 20 -15

n=64 n=64 n=64

EMPCA Computation

EMPCA EMPCA
Step II:

n=64 n=64
P=1 P=1
XR XS
Mutual
MI Computation

Information
Step III:

EMPCA-MI
Output

Fig. 1. Illustration of the EMPCA-MI algorithm [9], together with the proposed
mEMPCA-MI pre-processing step using neighbourhood 8-pixel and 4-pixel region con-
nectivity for an image pair size of 10 x 10 pixels

Once Step I has been completed, the remaining two processing steps of
mEMPCA-MI are as in [9]. Fig 2. displays two mEMPCA-MI traces for both
4-pixel and 8-pixel connectivity together with EMPCA-MI with respect to the
θ angular rotational transformation parameter for the IR of the multimodal
MRI pair T1 and T2. Fig 2(a) shows IR case when there is neither INU nor
noise present, while Fig 2(b) reﬂects the challenging registration of 40% INU
and Gaussian noise. It is palpable the mEMPCA-MI traces for both 4-pixel
and 8-pixel neighbourhood connectivity provide smoother and higher similarity
measure values for the best alignment compared with EMPCA-MI.
274 P.S. Reel et al.

Fig. 2. Similarity measure value traces for EMPCA-MI and mEMPCA-MI (8-pixel and
4-pixel connectivity). (a) shows the angular rotation transformation for MRI T1 and
T2 multimodal registration (without INU and noise) and (b) with 40% INU and noise.

Interestingly 4-pixel neighbourhoods are better in both cases, since they ex-
ploit the neighbourhood relations with the strongest links leading to a corre-
spondingly higher overall MI value between IR and IS . Fig 2 also highlights the
smooth convergence of mEMPCA-MI compared with EMPCA-MI with less os-
cillatory behaviour particularly where 40% INU and noise is present. This is a
very useful feature for eﬀective convergence of the ensuing optimization process
[1] and ultimately leads to lower IR errors.

3 Experiment Setup
To evaluate the performance of the mEMPCA-MI similarity measure, a series
of multimodal IR experiments were undertaken. Multimodal MRI T1 and T2
datasets from BrainWeb Database [13] were chosen due to their challenging char-
acteristics of varying INU and noise artefacts with the corresponding parameter
details being deﬁned in Table 1. To simulate a range of applications and analyse
the robustness of mEMPCA-MI, both Lena and Baboon images have also been
used with a simulated INU function Z [14]. Finally, Gaussian noise has been
added to all the datasets. The IR experiments were classiﬁed into four separate
scenarios representing monomodal, multimodal and two generic registrations.

Table 1. Dataset Parameter Details

Dataset Resolution (pixels) INU Noise(β)

MRI T1 (T1) [181 x 217 x 181] α20 =20% INU Gaussian
MRI T2 (T2) [181 x 217 x 181] α40 =40% INU (μ=0.01,
Lena (L) [256 x 256] 1 σ 2 =0.01)
Z (x,y)= (x + y) [14]
Baboon (Bb) [256 x 256] 3.2
Fast EM PCA Image Registration Using Neighbourhood Pixel Connectivity 275

Table 2. Registration Error Results for Diﬀerent Scenarios

EMPCA-MI [10] mEMPCA-MI (r=1,P=1 )

Scenario IR IS (r=1,P=1)) 8-pixel 4-pixel
No. ΔX, ΔY, Δθ (%) ΔX, ΔY, Δθ (%) ΔX, ΔY, Δθ (%)
T1+α20 2.0, 1.3, 0.36 1.26, 0.98, 0.24 1.12. 0.93, 0.21
T1+α40 4.5, 4.0, 0.42 3.05, 3.21, 0.39 2.96, 3.04, 0.32
1 T1
T1+β 6.0, 7.0, 0.52 5.74, 6.58, 0.45 5.41, 6.25, 0.43
T1+α40 + β 8.0, 10.0, 0.58 7.84, 9.59, 0.48 7.45, 9.28, 0.46
T1+α20 2.6, 2.6, 0.42 2.05, 1.98, 0.36 1.93, 1.71, 0.32
T1+α40 4.8, 4.6, 0.62 4.12, 4.27, 0.48 3.98, 4.10, 0.43
2 T2
T1+β 6.2, 3.0, 0.46 5.82, 2.32, 0.37 5.68, 2.18, 0.33
T1+α40 + β 9.7, 4.3, 0.62 9.12, 3.28, 0.58 8.99, 3.11, 0.56
L+Z 0.2, 0.32, 0.21 0.18, 0.28, 0.19 0.16, 0.24, 0.17
3 L+β L 0.32, 0.50, 0.36 0.29, 0.47, 0.31 0.27, 0.45, 0.29
L+Z +β 2.0, 5.33, 0.21 1.95, 5.28, 0.20 1.93, 5.26, 0.19
Bb+Z 0.45, 0.70, 0.21 0.32, 0.63, 0.18 0.30, 0.60, 0.16
4 Bb+β Bb 0.8, 1.26, 0.21 0.71, 1.14, 0.19 0.69, 1.12, 0.18
Bb+Z +β 1.4, 1.50, 0.23 1.27, 1.24, 0.22 1.20, 1.22, 0.20

Each experiment involved an initial misregistration of predeﬁned x and y axis

translations and rotation θ. The registration process involved partial volume
interpolation along with Powell optimization method [2] to iteratively estimate
the transformation parameters. The parameter values at which the mEMPCA-
MI similarity measure is a maximum then define the final transformation for
which the two images are best aligned. The registration error is defined as the
difference between the initial and final value for each parameter. All experiments
were performed upon an Ubuntu 10.04 (lucid) with 2.93 GHz Intel Core and 3GB
RAM, and the assorted algorithms implemented in MATLAB.

4 Results Discussion
Table 2 shows the IR error results for all four Scenarios in terms of the percent-
age translation (ΔX, ΔY ) and angular rotational (Δθ) errors. To clarify the
nomenclature adopted in Table 2; T1+α20 for example, represents an MRI T1
image slice with 20% INU, while Bb+Z +β refers to the Baboon image with INU
and Gaussian noise artefacts. The results conﬁrm the mEMPCA-MI algorithm
using both 8-pixel and 4-pixel neighbourhood connectivity consistently provides
better registration than the EMPCA-MI model for both mono and multimodal
MRI T1 and T2 images, both when there is and is not INU and noise present.
For example, in monomodal IR Scenario 1 with 40% INU and noise present
(T1+α40 +β), 8-pixel and 4-pixel connectivity mEMPCA-MI provide percentage
errors for the (ΔX, ΔY, Δθ) parameters of (7.84, 9.59, 0.48 ) and (7.45, 9.28,
0.46 ) respectively which are both lower than the corresponding EMPCA-MI
error (8.0, 10.0, 0.58 ). Similar performance improvements are also evident for
Lena and Baboon images.
276 P.S. Reel et al.

Table 3. Average Runtimes (ART ) Results (in ms) for Diﬀerent Scenarios

Scenario EMPCA-MI [10] mEMPCA-MI (r=1,P=1 )

No. (r=1,P=1)) 8-pixel 4-pixel
1
152 144 95
2
3
170 156 109
4

This corroborates the fact that the mEMPCA-MI algorithm using both 8-
pixel and 4-pixel connectivity in the pre-processing step more accurately reflects
neighbourhood spatial information by considering a second-order representation
of region pixels values with respect to centre pixel of the sliding window. The
results also reveal that the IR error performance of mEMPCA-MI with 4-pixel
neighbourhood connectivity is consistently lower than 8-pixel connectivity across
all four Scenarios. Particularly striking, is the performance achieved for the chal-
lenging MRI T1 and T2 multimodal registration in Scenario 2, in the presence
of both INU and noise. This reflects that 4-pixel neighbourhood connectivity
exploits the direct pixel relations providing more relevant spatial information
about local neighbourhood for subsequent EMPCA and MI computation. In
contrast, 8-pixel connectivity also considers weaker indirect neighbours, which
marginally reduces the corresponding principal component values leading to a
lower MI between the image pair.
Table 3 displays the average runtimes (ART ) for both EMPCA-MI and
mEMPCA-MI. While ART is a resource dependent metric, it concomitantly pro-
vides an insightful time complexity comparator between similarity measures. As
illustrated in Fig. 1, since the data dimensionality of mEMPCA-MI with 4-pixel
connectivity is reduced to 5 from 9 for both 8-pixel connectivity and EMPCA-MI
[12], the corresponding ART values are considerably lower, i.e., 95ms compared
to 144ms for 8-pixel connectivity and 152ms for EMPCA-MI to determine only
the first principal component for Scenarios 1 and 2. A similar trend in the ART
values is observed in Scenarios 3 and 4, though these datasets have a different
spatial resolution compared to Scenarios 1 and 2. Overall, the ART results re-
veal a notable improvement in computational efficiency for mEMPCA-MI using
4-pixel neighbourhood connectivity, allied with superior IR robustness to both
INU and noise for both mono and multimodal image datasets.

5 Conclusion
This paper has presented a neighbourhood connectivity based modiﬁcation to
the existing Expectation Maximisation for Principal Component Analysis with
MI (EMPCA-MI) similarity measure. Superior and enhanced robust image reg-
istration performance in the presence of both INU and Gaussian noise has
been achieved by incorporating second-order neighbourhood region information
compared with the grayscale value rearrangement in the original EMPCA-MI
Fast EM PCA Image Registration Using Neighbourhood Pixel Connectivity 277

paradigm. Additionally, the 4-pixel connectivity mEMPCA-MI similarity mea-

sure is computationally more eﬃcient compared to both EMPCA-MI and using
8-pixel neighbourhood connectivity.

References
1. Zitová, B., Flusser, J.: Image registration methods: a survey. Image and Vision
Computing 21(11), 977–1000 (2003)
2. Pluim, J., Maintz, J., Viergever, M.: Mutual-information-based registration of med-
ical images: a survey. IEEE Transactions on Medical Imaging 22(8), 986–1004
(2003)
3. Collignon, Maes, F., Delaere, D., Vandermeulen, D., Suetens, P., Marchal, G.:
Automated multi-modality image registration based on information theory. Imag-
ing 3(1), 263–274 (1995)
4. Viola, P., Wells, W.M.: Alignment by maximization of mutual information. In:
Proceedings of the Fifth International Conference on Computer Vision, pp. 16–23.
IEEE (June 1995)
5. Studholme, Hill, D., Hawkes, D.J.: An overlap invariant entropy measure of 3D
medical image alignment. Pattern Recognition 32(1), 71–86 (1999)
6. Simmons, A., Tofts, P.S., Barker, G.J., Arridge, S.R.: Sources of intensity nonuni-
formity in spin echo images at 1.5 t. Magnetic Resonance in Medicine 32(1),
121–128 (1994)
7. Russakoff, D.B., Tomasi, C., Rohlfing, T., Maurer Jr., C.R.: Image similarity using
mutual information of regions. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004.
LNCS, vol. 3023, pp. 596–607. Springer, Heidelberg (2004)
8. Yang, C., Jiang, T., Wang, J., Zheng, L.: A neighborhood incorporated method in
image registration. In: Yang, G.Z., Jiang, T.-Z., Shen, D., Gu, L., Yang, J. (eds.)
MIAR 2006. LNCS, vol. 4091, pp. 244–251. Springer, Heidelberg (2006)
9. Reel, P.S., Dooley, L.S., Wong, K.C.P.: A new mutual information based similarity
measure for medical image registration. In: IET Conference on Image Processing
(IPR 2012), pp. 1–6 (July 2012)
10. Reel, P.S., Dooley, L.S., Wong, K.C.P.: Efficient image registration using fast prin-
cipal component analysis. In: 19th IEEE International Conference on Image Pro-
cessing (ICIP 2012), Lake Buena Vista, Orlando, Florida, USA, pp. 1661–1664.
IEEE (September 2012)
11. Reel, P., Dooley, L., Wong, P., Börner, A.: Robust retinal image registration us-
ing expectation maximisation with mutual information. In: 38th IEEE Interna-
tional Conference on Acoustics, Speech, and Signal Processing (ICASSP 2013),
Vancouver, Canada, pp. 1118–1122. IEEE (May 2013)
12. Roweis, S.: EM algorithms for PCA and SPCA. In: Proceedings of the 1997 Con-
ference on Advances in Neural Information Processing Systems 10, NIPS 1997, pp.
626–632. MIT Press, Cambridge (1998)
13. Collins, D.L., Zijdenbos, A.P., Kollokian, V., Sled, J.G., Kabani, N.J., Holmes,
C.J., Evans, A.C.: Design and construction of a realistic digital brain phantom.
IEEE Transactions on Medical Imaging 17(3), 463–468 (1998)
14. Garcia-Arteaga, J.D., Kybic, J.: Regional image similarity criteria based on the
kozachenko-leonenko entropy estimator. In: IEEE Computer Society Conference
on Computer Vision and Pattern Recognition Workshops, CVPRW 2008, pp. 1–8.
IEEE (June 2008)
Fast Unsupervised Segmentation Using Active
Contours and Belief Functions

Foued Derraz1 , Laurent Peyrodie2 , Abdelmalik Taleb-Ahmed3 ,

Miloud Boussahla4, and Gerard Forzy1
1
Université Nord de France
Faculté Libre de Médicine, Institut Catholique de Lille
46 rue du Port de Lille, France
[email protected]
2
Université Nord de France
Haute Etude dı́ngénieur, LAGIS UMR CNRS 8219
46 rue du Port de Lille, France
[email protected]
3
Université Nord de France, Lille
Université de Valenciennes, LAMIH UMR CNRS 8201, Valenciennes
[email protected]
4
Université Abou Bekr Belkaid, Tlemcen
Telecommunication laboratory

Abstract. In this paper, we study Active Contours (AC) based globally

segmentation for vector valued images using evidential Kullback-Leibler
(KL) distance. We investigate the evidential framework to fuse multiple
features issued from vector-valued images. This formulation has two main
advantages: 1) by the combination of foreground/background issued from
the multiple channels in the same framework. 2) the incorporation of the
heterogeneous knowledge and the reduction of the imprecision due to the
noise. The statistical relation between the image channels is ensured by
the Dempster-Shafer rule. We illustrate the performance of our segmen-
tation algorithm using some challenging color and textured images.

Keywords: Active Contours, Characteristic function, Evidential

Kullback-Leibler distance, Belief Functions, Dempster-Shafer rule.

1 Introduction

Active Contours (AC) models have proven to be very powerful segmentation

tools in many computer vision and medical imaging applications [1]. Segmenta-
tion based AC models are limited by several challenges mainly related to image
noise, poor contrast, weak or missing boundaries between imaged objects, inho-
mogeneities, etc. One way to overcome these diﬃculties is to exploit the high
level knowledge about usual objects. This will ease the interpretation of low-level
cues extracted from images which may be highly beneﬁcial in the segmentation
based AC. Statistical knowledge [1] and additional information such as texture

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 278–285, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Fast Unsupervised Segmentation Using Active Contours and Belief Functions 279

can improve the segmentation based AC models for vector-valued images [2]. An-
other reason for failed segmentations is due to the local or global minimizer for
AC models [3]. To overcome these difficulties, the evidential framework appears
to be a new way to improve segmentation based AC models for vector valued im-
ages [4,5,6]. The Dempster-Shafer (DS) framework [7] has been combined with
either a simple thresholding [4], a clustering algorithm [8], a region merging al-
gorithm [5] or with an AC algorithms [6]. In this paper we propose to use the
evidential framework [7] to fuse several statistical knowledge as a new descriptor
and incorporates this new descriptor in the formulation of the AC models. The
fusion of information issued from different feature channels, e.g., color chan-
nels and texture offers an alternative to the Bayesian framework [9]. Instead
of fusing separated probability densities, the evidential framework allows both
inaccuracy and uncertainty. This concept is represented using Belief Functions
(BF s) [7,10,11,12] which are particularly well suited to represent information
from partial and unreliable knowledge. The use of BF s as an alternative to the
probability in segmentation process can be very helpful in reducing uncertainties
and imprecisions using conjunctive combination of neighboring pixels. First, it
allows us to reduce the noise and secondly, to highlight conflicting areas mainly
present at the transition between regions where the contours occur. In addition,
BF s have the advantage to manipulate not only singletons but also disjunctions.
This gives the ability to represent both uncertainties and imprecisions explicitly
. The disjunctive combination allows the transferring of both uncertain and im-
precise information on disjunctions [7,12]. Finally, the conjunctive combination
is applied to reduce uncertainties due to noise while maintaining representation
of imprecise information at the boundaries between areas on disjunctions. In
this paper, we highlight the advantage of evidential framework, to define a new
descriptor based on the BF s to incorporate it in the formulation of the AC
models. In Section 2, we review the Dempster-Shafer concept in order to define
our evidential descriptor. In section3 we proposed a fast algorithm based split
Bregman of our segmentation algorithm. In Section 4, we demonstrate the ad-
vantages of the proposed method by applying it to some challenging to some
challenging images.

2 Globally Active Contours in Evidential Framework

2.1 Dempster Shafer Rules
The Plausibility (P L) and Belief Functions (BF s) [7,11], which are both derived
from a Mass function (m) provide the evidential framework. For the frame of
discernment ΩII = {Ω1 , Ω2 , ..., Ωn }, composed of n single mutually exclusive
subsets Ωi , the mass function is deﬁned by m : 2Ω → [0, 1].
280 F. Derraz et al.

m (∅) = 0

m (Ωi ) = 1
Ωi ⊆Ω

BF s (Ω) = m (Ωi ) (1)
Ωi ⊆ΩII

P l (Ω) = m (Ωi )

Ωi ΩII =∅

When m (Ωi ) > 0, Ωi is called a focal element [5,7]. The relation between mass
function, BF s and P el can be described as:

m (Ωi ) ≤ BF s (Ωi ) ≤ p (Ωi ) ≤ P l (Ωi ) (2)

The independent masses mj are deﬁned within the same frame of discernment
as:
m Ωi={1,...,n} = m1 Ωi={1,...,n} ⊗ m2 Ωi={1,...,n} ...mj Ωi={1,...,n} ⊗
(3)
⊗mm Ωi={1,...,n}

where ⊗ is the sum of DS orthogonal rule.

The total belief assigned to focal element Ωi is equal to the belief strictly placed
on the foreground region Ωi . Then Belief Functions (BF s) can expressed as:

BF s (Ωi ) = m (Ωi ) (4)

This relationship can be very helpful in the formulation of our AC model.

2.2 Active Contours and Belief Functions

The segmentation based AC models for vector Valued image I consists of ﬁnding
one or more regions Ω from I. In Bayesian framework, we search for the domain
Ω or the partition of the image P (Ω) that maximizes the a posteriori partition
probability p ( P (Ω)| I). The Maximum of a-Posteriori of P (Ω) can be given by
minimizing the criterion as fellows:

p ( P (Ω)| I) ∝ p ( I| P (Ω)) p (P (Ω)) (5)

Therefore, in Bayesian framework, the partitioning probability can be expressed

as: ⎧ ⎫
⎪
⎪ ⎪
⎪
⎪
⎨ ⎪
⎬
1 1
∂ Ω̂ = arg min log +λ log (6)
⎪
⎪ p (P (Ω)) p ( I| P (Ω)) ⎪
⎪
⎪
⎩ 0 12 3 0 12 3⎪
⎭
Eb (∂Ω) Edata (Ω,I)

In equation (6), the ﬁrst energy term corresponds to the geometric properties
of P (Ω) = {Ωin , Ωout }. Ωin and Ωout correspond respectively to the fore-
ground and background region to be extracted. The data-ﬁdelity energy term
Fast Unsupervised Segmentation Using Active Contours and Belief Functions 281

Edata (Ω, I) allows us to incorporate statistical properties of the vector valued-

image data I = {I1 , ..., IM }.
Our proposed method uses the evidential framework to fuse the knowledge
issued from the multiple channels. The expression in (5) can be revisited in
evidential framework using BF s with respect to (2):

p ( P (Ω)| I) ≥ p (P (Ω)) BF s (P (Ω)) (7)

In the case of two phase segmentation, we can state that:

p ( P (Ω)| I) ∝ p (P (Ω)) BF s ({Ωin , Ωout }) (8)

Equation (8) is handled because P (Ω) = {Ωin , Ωout };is a focal element. When
the foreground/background regions are disjoint ( Ωin Ωout = ∅), then:

p ( P (Ω)| I) ∝ p (P (Ω)) BF s (Ωin ) BF s (Ωout ) (9)

Intuitively, the best ∂Ω can be obtained by maximizing the kullback-Leibler

distance between BF s associated to the foreground/background region or min-
imizing the criterion:
⎧ ⎫
⎪ 1 ⎪
⎪
⎪ log ⎪
⎪
⎪
⎪ p (P (Ω)) ⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎨ ⎪
⎬
Eb (∂Ω)
∂ Ω̂ = arg min
⎪
⎪ BF s (Ωin ) BF s (Ωout ) ⎪
⎪
⎪
⎪ + BF s (Ωin ) log − BF s (Ωout ) log ⎪
⎪
⎪
⎪ BF s (Ω ) BF s (Ωin ) ⎪
⎪
⎪
⎪
out
⎪
⎪
⎩ ⎭
Edata (Ωin ,I) Edata (Ωin ,I)
(10)

Similarly to [6], we used the deﬁnitions proposed by Appriou in [12] to deﬁne

mass function for all image channels Ij as:
j={1,...,M}
mj={1,...,M} Ωin/out = pin/out
mj={1,...,M} (Ω) = 1 − pin
j={1,...,M} j={1,...,M}
+ pout (11)
mj={1,...,M} (∅) = 0

The pdfs pjin and pjout are estimated for all channels using the Parzen kernel
[13]. Our proposed method uses the total belief committed to foreground or
background region. In the next section we propose a fast version of our segmen-
tation algorithm.

3 Fast Algorithm Based on Split Bregman

A fast and accurate minimization algorithm for TV problem is introduced in
[3,14]. We propose to perform our segmentation in this new framework and we
formulate the variational problem using characteristic function χ:
282 F. Derraz et al.

m in (E (χ, d)) = |d (x)| dx + λin in
VBF s χ (x) dx − λout out
VBF s χ (x) dx
χ,d Ω Ω Ω
(12)
in/out
where the velocity VBF s is calculated using the Eulerian derivative of Edata in
the direction ξ as follows:
< =
∂Edata Ωin/out (t) , I in/out
,ξ = VBF s ξ (s) , N (s) ds (13)
∂t
∂Ω

Where N is an exterior unit normal vector to the boundary ∂Ω, ε, N is the

Euclidean scalar product, and s is the arc length parametrization. The vectorial
function d enforces d = ∇χ using the eﬃcient Bregman iteration approach [14]
deﬁned as:
⎧ ⎧ ⎫
⎪
⎪ ⎪
⎪ λin VBF s χ − λout VBF
in out ⎪
χ⎪
⎪
⎪ ⎪
⎪ s ⎪
⎪
⎪
⎪ ⎨ ⎬
⎪
⎨ χk+1 , dk+1 = argmin Ω Ω

⎪
⎪ μ 2 ⎪
⎪ (14)
⎪ ⎪
⎪ + d − ∇χ − bk ⎪
⎪
⎪
⎪ ⎩ 2 ⎭
⎪
⎪
⎪
⎪ Ω
⎩
bk+1 = bk + ∇χk − dk+1
The minimizing solution χk+1 is characterized by the optimality condition:

s − λout VBF s + μdiv b − d , χ ∈ [0, 1]

in out k k
μΔχ = λin VBF (15)
The minimizing solution is given by soft-thresholding:
dk+1 = sign(∇χk+1 + bk )max ∇χk+1 + bk − μ−1 , 0 (16)
! "
f inal
Then, the ﬁnal active contour is given by x ∈ Ω| χ(x) ≥ 12 . The two
in out
iteration schemes are straightforward to implement. Finally, VBF S and VBF S
are updated at each iteration using the belief function given in (11) and (4).

4 Results
We introduced an AC model that incorporates BF s as statistical region knowl-
edge. To illustrate and demonstrate the accuracy of our segmentation method,
we present some results of our method and compare them to segmentation of
vector-valued images done by the traditional AC model and the model pro-
posed in [6]. The three methods are evaluated on 20 color images taken from
the Berkeley segmentation datasets [15] using F-measure criterion. Traditional
segmentation and the method in [6] are initialized by contour curve around the
object to be segmented, our method is free initialization. The segmentation done
by the three methods are presented for three challenging images (see Figure.1).
The accuracy of the segmentation is represented in term of Precision/Recall
The proposed method give the best segmentation and the F-measure is better
then the other methods (see Table.1).
Fast Unsupervised Segmentation Using Active Contours and Belief Functions 283

(a) Our method (b) Method [6] (c) AC based KL

(d) Our method (e) Method [6] (f ) AC based KL

(g) Our method (h) Method [6] (i) AC based KL

(j) Our method (k) Method [6] (l) AC based KL

(m) Our method (n) Method [6] (o) AC based KL

Fig. 1. Images taken from the Berkeley Segmentation benchmark dataset [15]. The
from the left to right, en red color segmentation done by our segmentation model,
in blue color segmentation done by the model proposed in [6]. In yellow color, the
segmentation done by the traditional model based vector valued image and KL distance.
284 F. Derraz et al.

Table 1. Quantitative evaluation of the segmentation using F-measure

Image number Our method Method in [6] Method in [2]
Image:124084 0.93 0.90 0.71
Image:106024 0.94 0.90 0.73
Image:164074 0.88 0.81 0.66
Image:80099 0.92 0.87 0.69
Image:134008 0.88 0.71 0.61

5 Conclusion
We have investigated the use of the evidential framework for AC model using
Dempster-Shafer (DS) theory. In particular, we have investigated how to calcu-
late the mass function using Parzen kernel, which represents a diﬃcult task. The
results have shown that our proposed approach give the best segmentation for
color and textured images. The experimental results show that the segmentation
performance is improved by using the three information sources to represent the
same image with respect to the use of on information. However, there are some
drawbacks of our proposed method. Our method of calculating mass functions
is high time consuming when the number of channels increase. Furthermore, the
research of other optimal models to estimate mass functions in the DS theory
and the imprecision coming from diﬀerent images channels are an important
areas for future research.

References
1. Cremers, D., Rousson, M., Deriche, R.: A review of statistical approaches to level
set segmentation: Integrating color, texture, motion and shape. Int. J. Comput.
Vision 72(2), 195–215 (2007)
2. Chan, T.F., Sandberg, B.Y., Vese, L.A.: Active contours without edges for vector-
valued images. J. of Vis. Communi. and Image Repres. 11, 130–141 (2000)
3. Bresson, X., Esedoglu, S., Vandergheynst, P., Thiran, J.P., Osher, S.: Fast global
minimization of the active contour/snake model. J. Math. Imaging Vis. 28(2),
151–167 (2007)
4. Rombaut, M., Zhu, Y.M.: Study of dempster–shafer theory for image segmentation
applications. Image and Vision Computing 20(1), 15–23 (2002)
5. Lelandais, B., Gardin, I., Mouchard, L., Vera, P., Ruan, S.: Using belief function
theory to deal with uncertainties and imprecisions in image processing. In: Denœux,
T., Masson, M.-H. (eds.) Belief Functions: Theory & Appl. AISC, vol. 164, pp.
197–204. Springer, Heidelberg (2012)
6. Scheuermann, B., Rosenhahn, B.: Feature quarrels: The dempster-shafer evi-
dence theory for image segmentation using a variational framework. In: Kimmel,
R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part II. LNCS, vol. 6493, pp.
426–439. Springer, Heidelberg (2011)
7. Dempster, A.P., Chiu, W.F.: Dempster-shafer models for object recognition and
classiﬁcation. Int. J. Intell. Syst. 21(3), 283–297 (2006)
8. Masson, M.H., Denoeux, T.: Ecm: An evidential version of the fuzzy c. Pattern
Recognition 41(4), 1384–1397 (2008)
Fast Unsupervised Segmentation Using Active Contours and Belief Functions 285

9. Vannoorenberghe, P., Colot, O., de Brucq, D.: Color image segmentation using
dempster-shafer’s theory. In: ICIP (4), pp. 300–303 (1999)
10. Cuzzolin, F.: A geometric approach to the theory of evidence. IEEE Trans. on
Syst., Man, and Cyber., Part C 38(4), 522–534 (2008)
11. Denoeux, T.: Maximum likelihood estimation from uncertain data in the belief
function framework. IEEE Trans. Knowl. Data Eng. 25(1), 119–130 (2013)
12. Appriou, A.: Generic approach of the uncertainty management in multisensor fu-
sion processes. Revue Traitement du Signal 22(2), 307–319 (2005)
13. Parzen, E.: On estimation of a probability density function and mode. The Annals
of Mathematical Statistics 33(3), 1065–1076 (1962)
14. Goldstein, T., Bresson, X., Osher, S.: Geometric applications of the split breg-
man method: Segmentation and surface reconstruction. J. Sci. Comput. 45(1-3),
272–293 (2010)
15. Martin, D.R., Fowlkes, C.C., Malik, J.: Learning to detect natural image bound-
aries using local brightness, color, and texture cues. IEEE Trans. Pattern Anal.
Mach. Intell. 26(5), 530–549 (2004)
Flexible Hypersurface Fitting with RBF Kernels

Jun Fujiki1 and Shotaro Akaho2

1
Fukuoka University
[email protected]
2
National Institute of Advanced Industrial Science and Technology
[email protected]

Abstract. This paper gives a method of ﬂexible hypersurface ﬁtting

with RBF kernel functions. In order to fit a hypersurface to a given set
of points in an Euclidean space, we can apply the hyperplane fitting
method to the points mapped to a high dimensional feature space. This
fitting is equivalent to a one-dimensional reduction of the feature space
by eliminating the linear space spanned by an eigenvector corresponding
to the smallest eigenvalue of a variance covariance matrix of data points
in the feature space. This dimension reduction is called minor compo-
nent analysis (MCA), which solves the same eigenvalue problem as kernel
principal component analysis and extracts the eigenvector corresponding
to the least eigenvalue. In general, feature space is set to an Euclidean
space, which is a finite Hilbert space. To consider an MCA for an infi-
nite Hilbert space, a kernel MCA (KMCA), which leads to an MCA in
reproducing kernel Hilbert space, should be constructed. However, the
representer theorem does not hold for a KMCA since there are infinite
numbers of zero-eigenvalues would appear in an MCA for the infinite
Hilbert space. Then, the fitting solution is not determined uniquely in
the infinite Hilbert space, contrary to there being a unique solution in
a finite Hilbert space. This ambiguity of fitting seems disadvantageous
because it derives instability in fitting, but it can realize flexible fitting.
Based on this flexibility, this paper gives a hypersurface fitting method
in the infinite Hilbert space with RBF kernel functions to realize flexible
hypersurface fitting. Although some eigenvectors of the matrix defined
from kernel function at each sample are considered, we have a candidate
of a reasonable solution among the simulation result under a specific
situation. It is seen that the flexibility of our method is still effective
through simulations.

Keywords: feature space, ﬁtting, kernel PCA, RBF kernel, Hilbert space.

1 Introduction
In the ﬁelds of computer vision and machine learning, various nonlinear prob-
lems are reduced to linear problems in feature space with feature mappings. For
getting such feature mappings, RBF kernel functions are widely used because
RBF kernel functions have a great advantage called a kernel trick [4]. Since the
kernel trick changes a searching problem in feature space into the problem in the

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 286–293, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Flexible Hypersurface Fitting with RBF Kernels 287

space spanned by sample data, the dimension of searching space also changes
from that of feature space to the number of sample data, and then the trick
is effective when the dimension of feature space is very large comparing to the
number of sample data. In other words, the kernel trick makes high-dimensional
problems free from the curse of dimensionality. The justification of kernel trick
is guaranteed by the representer theorem [7]. The representer theorem ensures
that the optimal estimator can be represented by a linear combination of kernel
functions evaluated at sample points (functions). When the theorem holds, it
changes an estimation in an infinite dimensional feature space corresponding to
RBF kernels is change to that in a finite dimensional sample space. This is why
RBF kernels are widely used to treat infinite dimension.
One of the purposes of this paper is applying the kernel trick to hypersurface
fitting. Wahba [7] gives the representer theorem in the case of regression, which
can be regarded as a kind of hypersurface fitting. However, hypersurface fitting
based on regression does not have geometrical property. For example, the fitting
result is not invariant under any rotation of coordinates. The reason why the
fitting is not geometrical is that the fitting has a special variable called a target
variable, and then all variables (coordinates) are not equivalent. This makes the
fitting result not be geometrical.
On the other hand, line fitting based on a principal component analysis (PCA)
is geometrical, that is, the fitting result is invariant under rotation of coordinates.
PCA is kernelized as a kernel PCA [3], which is widely utilized in pattern recogni-
tion, and a kernel PCA satisfies the representer theorem. The line fitting method
based on PCA is extended to nonlinear hypersurface fitting methods [1]. In the
extended methods, the fitting hypersurface is represented by an inner product
form like a F (x) = 0, where F (x) is a function of coordinates to represent a
set of hypersurfaces. The parameter vector a is estimated so as to minimize the
(weighted) sum of (a F (x))2 for observed data, in general.
The extended method means to subtract one-dimensional space of the small-
est principal component from feature space. Such a subtraction of the smallest
principal component is called a minor component analysis (MCA) [6]. From
this point of view, to establish a geometrical hypersurface fitting method, the
MCA should be given in a kernel formulation such as a kernel MCA (KMCA).
However the KMCA does not satisfy the representer theorem. This fact can be
explained as following. If the KMCA satisfies the representer theorem, it can
be represented as a kernel formulation, and the KMCA corresponding to an
infinite dimensional feature space should exist. But in an infinite dimensional
feature space, the dimension of eigenspace corresponding to zero-eigenvalue is
also infinite. This means that the parameter vector which describes the fitting
hypersurface cannot be determined uniquely, and there are infinite possibilities
for parameter vectors. This paper gives flexible hypersurface fitting method by
utilizing this indeterminacy.
288 J. Fujiki and S. Akaho

2 Linear Fitting on Feature Space and Its Kernelization

This section reviews linear fitting on feature space, and its kernelization. The
fitting criterion is to minimize the sum of square of algebraic errors.
We have D observed points and let {x[d] }D d=1 be their Euclidean coordinates.
Let a F (x) = 0 be the representation of the set of fitting curve. The result
of linear fitting on the feature space, that is, the parameter vector a minimizes
D 2
d=1 a F [d] s.t. ||a|| = 1, where F [d] = F (x[d] ). And then, the parameter
vector aD is the eigenvector corresponding to the minimum eigenvalue of the
matrix d=1 F [d] F [d] .
Here, the kernel function can be represented by inner product as
k(x, y) = F (x) F (y) .
We only consider the case that a can be represented by a linear combination of
F [d] , that is,

D
a = α[d] F [d] = F α .
(n×1) (n×D) (D×1)
d=1

if the representer theorem holds, it is guaranteed that the global minimum exists
in the linear combination, but if not, it is not guaranteed. As explained later, the
representer theorem does not hold for hypersurface fitting.
Let K be a D × D matrix as (K)ij = k(x[i] , x[j] ), and let the each column of K
be defined as K = K[1] · · · K[D] . Since K = F F , there holds K[d] = F F [d] ,
2
and a F [d] = α K[d] K[d]

α.
Then the coefficient vector α minimizes

D
α K[d] K[d]

α = α Kα . (1)
d=1

And then, the coeﬃcient vector α is the eigenvector corresponding to the mini-
mum eigenvalue of the matrix K.

2.1 Representer Theorem Does Not Hold

We briefly explain the fact that the representer theorem does not hold in an
MCA. The representer theorem ensure that the appropriate a can be represented
by a linear combination of F [d] , and coefficients of the linear combination is the
eigenvector corresponding to the minimum eigenvalue of the matrix K.
Let a be F α, which is a linear combination of F [d] . the coefficient vector α
satisfies λα = Kα = F F α for the minimum λ. Then if λ = 0, the vector a can
1
be uniquely obtained as a linear combination of F [d] as a = F α = F F F α.
λ
However, if λ = 0, there holds Kα = 0 implying a = F α = 0, and a should be
estimated as ‘trivial’ linear combination of F [d] , then as a result, a is estimated
improperly.
Flexible Hypersurface Fitting with RBF Kernels 289

8
6

6
4

4
2

2
0

0
−2

−2

−2
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
8

8
6

6
4

4
2

2
0

0
−2

−2

−2
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4

Fig. 1. Quadratic curve ﬁtting for parabola: with (top) and without (bottom) noises

2.2 Experiments to Check the Representer Theorem

Here, there is the experiment to investigate the feature space for quadratic curve
fitting. We generate 50-data from y = x2 of its x-coordinate is chosen uniformly
from the interval x ∈ [−3.5, 3.5]. Data are mapped to feature space by feature
mapping F : (x1 , x2 ) →= (x21 , x1 x2 , x22 , x1 , x2 , 1) ∈ R6 .
We fit a quadratic curve to data by two-dimensional polynomial kernel. The
fitting result of the data with noise is shown in Fig. 1. In the experiment, there
holds rank K ≤ 6, and then over sixth eigenvalue of K should be zero. Let the unit
eigenvectors corresponding to the largest six eigenvalues of K be ai (i = 1, . . . , 6).
Fig. 1 shows the quadratic curve a i F (x) = 0. The results with noises are in
the top row and those without noises are in the bottom. The difference between
the top and the bottom is the figure corresponded to the sixth eigenvalue. When
there are noises, the sixth eigenvalue of K is not so small (rank K = 6), we can
compute appropriate a6 . To compare with this, when there are no noises, the
sixth eigenvalue of K is theoretically zero (rank K = 5), but computed as small
value due to round-off error in computation. And then the unreliable eigenvector
a6 is computed. In this case, the region a 6 F (x) = 0 corresponding to sixth
eigenvalue in the bottom is drawn by contour function provided by statistical
software R[2]. It is shown that the curve fitting is failed when there are no
noises. Fig. 2 shows equidistant lines of Euclidean distance to the linear space
Vk = span {a1 , . . . , ak } in feature space. The results with noises are in the top
row and those without noises are in the bottom. The sixth figures of both rows
are different. When there are noises, there holds V6 = R6 and Euclidean distance
is zero for all points. This means the linear space {x | a 6 x = 0}, which is the
orthogonal complement of a6 , is equivalent to V5 .
To compare with this, when there are no noises, there holds V6 = V5 (= R6 )
since a6 = 0 and the fifth and the sixth figures are precisely the same. This
means the linear space {x | a 6 x = 0}, which is the orthogonal complement of
a6 , is equivalent to V6 = R6 .
From the experiment, It can be seen representer theorem does not hold when
there are no noises, and the parameter vector a shrinks to zero vector.
290 J. Fujiki and S. Akaho

1
91
0

−1
91

0
140

0.4
6

−11
6
0.7

1.1e
7
1.4 0

0.7
1.4 7

6e−11
41
31 17
4

4
14
10

420

420
yt

180

yt
2 2

2.2
e−
200

200

12
2

2
2.2 1
1.1e−1
2

6e−11
0.19
0.1 0.54
1.4 9
0.13
0.77
0.46
3.2e
0

0
0.93 0.033 3 −13
1.1 0.1 2.2e
−12
1.7 0.54

1.1e
2.1

140

1.1e−09
3.2

−09
3.8 3.3

140
4.5

10
2.2

17
−2

−2

−2
8.8 8.6

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
8

8
0

90
90

140

0.4
6

6
6
0.7

7
1.4

0.7
1.4 7

41
31 17
4

4
14
10 10
420

420
yt

180

yt
2 2

200

200
2

2
2.2 2.2

0.19
0.1 0.54 0.54
1.4 9
0.13 0.13
0.77 0.46
0

0
0.93 0.033 3 0.033 3
1.1 0.1 0.1
1.7 0.54 0.54
2.1
3.8 3.2
3.3
4.5

130

130
14

10
2.2 2.2

17
−2

−2

−2
8.8 8.6

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4

V1 V2 V3 V4 V5 V6
Fig. 2. Equidistant lines from points to Vk : with (top) and without (bottom) noises

3 Fitting via Hilbert Space

A method of construction of a Hilbert space with RBF kernel functions is pro-
posed by [3], which is known as a kernel PCA (KPCA). The resulting Hilbert
space is used for classification, which is called subspace method[5] In this paper,
the Hilbert space is used for hypersurface fitting.
To construct a Hilbert space, vectors in Rn are mapped to square-integrable
functions. With RBF kernels, a vector v ∈ Rn is mapped to
n/4
||x − v||2 2
v → |v = v(x) = C exp − , C= ,
σ2 πσ 2
and inner product in the Hilbert space is defined as

||v − w||2
v | w = v(x)w(x)dx = exp − .
2σ 2
In the definition, a quantity σ is the width parameter of an RBF kernel function.
In the Hilbert space, the representation of hyperplane is a | v = 0, which
is fit to sample points. And the hypersurface fitting is finding the best square-
integrable function a(x).
We have D observed data {v [d] }D d=1 and let v[d] be the image of v [d] on the
Hilbert space with RBF kernels. For the data, hyperplane a | v = 0 is fit in the
Hilbert space. The function |a is estimated as a minimizer of

D

E(a(x)) = a v[d] v[d] a . (2)
d=1

As already discussed, the representer theorem does not hold for the hypersurface
fitting, but we find appropriate |a from the linear space spanned by square-

integrable functions in the Hilbert space as |a = Dd=1 αd v[d] = |V α where

|V = v[1] · · · v[D] .
Let k(v, w) be a RBF kernel function as k(v, w) = v | w. Andlet K be a
D × D-matrix of its ij-components are defined as (K)ij = v[i] v[j] .
Flexible Hypersurface Fitting with RBF Kernels 291

There holds K = V | V for V| = v[1] · · · v[D] , and the d-th column of

K is represented as K[d] = V v[d] . Since there holds

a v[d] v[d] a = α V v[d] v[d] V α = α K[d] K[d]

α,

Eq.(2) is rewritten as

D
α K[d] K[d]

α = α Kα
d=1

without utilizing the mapping to Hilbert space. That is the representation of the
energy function that has the same representation as Riemannian space.

4 Experiments: Curve Fitting with RBF Kernel

√
The curves derived from RBF kernel with σ = 5 2, is fit to the points on a
parabola as the same as the previous experiment. Since the number of data
is 50, K is a 50 × 50 matrix. Let the unit eigenvectors corresponding to 50
eigenvalues of K be |ai (i = 1, . . . , 50). Here, small suffix is corresponded to large
eigenvalue. Figure 3 shows ai | v = 0 (i=1,...,50) drawn by contour function
provided by R[2]. The curves sometimes disconnectedly due to round-off error
in computation.
From Fig. 3, the eigenvectors corresponding to large eigenvalues (suffix is
small) and small eigenvalues (suffix is large) does not correspond to appropriate
curve. The curves corresponding to large eigenvalues does not pass through the
data, and the curves corresponding to small eigenvalues is complicated though
they pass through most of data, that is, overfitting. In the experiment, 15th
curve is the best. The way to choose the best eigenvector is one of future works,
but the proposed method has a potential to realize good fitting.
Figure 4 shows other fitting results, in which the proposed method works very
well.

5 Discussion
This paper gives a flexible hypersurface fitting method for a set of samples from
some hypersurface. It is shown by some simulations that our method works well,
but it is no easy to choose both the best eigenvector and the width parameter
σ systematically. In order to choose both of them, some new criterions must be
needed. Nevertheless, our method has worth to be used at first, as we shown in
this paper.

Acknowledgment. This work was supported by JSPS KAKENHI Grant Num-

ber 25330276.
292 J. Fujiki and S. Akaho
8

8
6

6
4

4
yt

yt
2

2
0

0
−2

−2

−2
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
8

8
6

6
4

4
yt

yt
2

2
0

0
−2

−2

−2
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
8

8
6

6
4

4
yt

yt
2

2
0

0
−2

−2

−2
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
8

8
6

6
4

4
yt

yt
2

2
0

0
−2

−2

−2
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
8

8
6

6
4

4
yt

yt
2

2
0

0
−2

−2

−2
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
8

8
6

6
4

4
yt

yt
2

2
0

0
−2

−2

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
8

8
6

6
4

4
yt

yt
2

2
0

0
−2

−2

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
8

8
6

8
4

4
yt

yt
2

6
0

0
−2

−2

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
2
8

8
6

0
4

4
yt

yt
2

−2
0

−4 −2 0 2 4
−2

−2

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4

Fig. 3. RBF kernel ﬁtting to points on parabola: Curves correspond to 50-eigenvectors

(15th curve is re-shown in bottom right)
Flexible Hypersurface Fitting with RBF Kernels 293
1.5

1.5

1.5
1.0

1.0

1.0
0.5

0.5

0.5
0.0

0.0

0.0
yt

yt
−0.5

−0.5

−0.5
−1.0

−1.0

−1.0
−1.5

−1.5

−1.5
−3 −2 −1 0 1 2 3 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
1.5

1.5

1.5
1.0

1.0

1.0
0.5

0.5

0.5
0.0

0.0

0.0
yt

yt
−0.5

−0.5

−0.5
−1.0

−1.0

−1.0
−1.5

−1.5

−1.5
−3 −2 −1 0 1 2 3 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
1.5

1.5

1.5
1.0

1.0

1.0
0.5

0.5

0.5
0.0

0.0

0.0
yt

yt
−0.5

−0.5

−0.5
−1.0

−1.0

−1.0
−1.5

−1.5

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Fig. 4. Examples of ﬂexible ﬁtting: points (top) and results (middle and bottom)

References
1. Fujiki, J., Akaho, S.: Hypersurface fitting via Jacobian nonlinear PCA on Rieman-
nian space. In: Real, P., Diaz-Pernil, D., Molina-Abril, H., Berciano, A., Kropatsch,
W. (eds.) CAIP 2011, Part I. LNCS, vol. 6854, pp. 236–243. Springer, Heidelberg
(2011)
2. R Development Core Team, R: A language and environment for statistical comput-
ing, R Foundation for Statistical Computing, Vienna, Austria (2008) ISBN3-900051-
07-0, http://www.R-project.org
3. Schölkopf, B., Smola, A., Müller, K.-R.: Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computation 10, 1299–1319 (1998)
4. Schölkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines, Regu-
larization, Optimization, and Beyond. MIT Press (2001)
5. Tsuda, K.: Subspace Classifier in the Hilbert Space. Pattern Recognition Letters 20,
513–519 (1999)
6. Xu, L., Oja, E., Suen, C.: Modified Hebbian learning for curve and surface fitting.
Neural Networks 5(3), 441–457 (1992)
7. Wahba, G.: Spline Models for Observational Data. SIAM (1990)
Gender Classification Using Facial Images
and Basis Pursuit

Rahman Khorsandi and Mohamed Abdel-Mottaleb

Department of Electrical and Computer Engineering, University of Miami

[email protected], [email protected]

Abstract. In many social interactions, it is important to correctly recognize the

gender. Researches have addressed this issue based on facial images, ear images
and gait. In this paper, we present an approach for gender classification using
facial images based upon sparse representation and Basis Pursuit. In sparse rep-
resentation, the training data is used to develop a dictionary based on extracted
features. Classification is achieved by representing the extracted features of the
test data using the dictionary. For this purpose, basis pursuit is used to find the
best representation by minimizing the l1 norm. In this work, Gabor filters are used
for feature extraction. Experimental results are conducted on the FERET data set
and obtained results are compared with other works in this area. The results show
improvement in gender classification over existing methods.

Keywords: Gender Classification, Basis Pursuit, Sparse Representation, Facial

Images, Gabor Wavelets.

1 Introduction
Gender classification is an important task in social activities and communications. In
fact, automatically identifying gender is useful for many applications, e.g. security
surveillance [4] and statistics about customers in places such as movie theaters, build-
ing entrances and restaurants [3]. Automatic gender classification is performed based
on facial features [8], voice [10], body movement or gait [23].
Most of the published work in gender classification is based on facial images.
Moghaddam et al. [16] used Support Vector Machines (SVMs) for gender classification
from facial images. They used low resolution thumbnail face images (21 × 12 pixels).
Wu et al. [21] presented a real time gender classification system using a Look-Up-Table
Adaboost algorithm. They extracted demographic information from human faces. Gol-
lomb et al. [8] developed a neural network based gender identification system. They
used face images with resolution of 30x30 pixels from 45 males and 45 females to
train a fully connected two-layer neural network, SEXNET. Cottrell and Metcalfe [6]
used neural networks for face emotion and gender classification from facial images.
Gutta and Wechsler [9] used hybrid classifiers for gender identification from facial
images. The authors proposed a hybrid approach that consists of an ensemble of RBF
neural networks and inductive decision trees. Yu et al. [23] presented a study of gen-
der classification based on human gait. They used model-based gait features such as
height, frequency and angle between the thighs. Face-based gender classification is still
an atractive research area and there is room for developing novel algorithms that are

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 294–301, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Gender Classification Using Facial Images and Basis Pursuit 295

more roubst, more accurate and fast. In this paper, we will to present a novel method
for gender classification based on sparse representation and basis pursuit.
Over the past few years, the theory of sparse representation has been used in various
practical applications in signal processing and pattern recognition [7]. A sparse repre-
sentation of a signal can be achieved by representing the signal as a linear combination
of a relatively few base elements in a basis or an overcomplete dictionary [2]. Sparse
representation has been used for compression [1], denoising [19], and audio and im-
age analysis [14]. However, its use in recognition and classification is relatively new.
Wright et al. [20] proposed a classification algorithm for face recognition based on a
sparse representation. The reported results for face recognition are encouraging enough
to extend this concept to other areas such as gender classification. In addition, Patal et
al. [17] proposed a face recognition algorithm based on dictionary learning and sparse
representation. A dictionary is learned for each class based on given training samples.
The test sample is projected onto the span of the training data in each learned dictionary.
The main idea of using sparse representation for recognition and classification is to
represent the test data as a linear combination of training data. The set of coefficients
of the linear representation is called weight vector. If we assume that there are many
subjects in database, the test data will only be related to one of the subjects. Therefore,
in sparse representation, the weight vector should be sparse and it is important to find
the sparsest solution. To find the weight vector, we use basis pursuit as described in the
next section.
In this paper, we present a gender classification system based on 2-D facial images
and sparse representation. This paper is organized as follows: In Section 2, we present
a brief mathematical explanation of the sparse representation concept and the proposed
method based on basis pursuit to obtain the sparsest solution. Section 3 presents exper-
imental results that demonstrate the performance of the proposed method in terms of
recognition. Conclusions and future research directions are discussed in Section 4.

2 Classification Based on Sparse Representation

Underdetermined systems appear in different important areas such as signal processing,
statistics, pattern recognition and image processing. Sparse representation is a relatively
new approach to solve underdetermind systems. In this section, we briefly explain the
concept of sparse representation. First, the approach for building a dictionary is intro-
duced. Then, the proposed approach for finding the sparsest solution based on basis
pursuit is described. Finally, Gabor wavelets, which we used for extracting the feature
vectors, are discussed.

2.1 Building the Dictionary

In the proposed method, a dictionary is built. The dictionary is a matrix where each
column is the feature vector of one of the training samples using training data. Assume
that there are ni training data samples for the ith class, where each data sample is
represented by a vector of m elements. These vectors are then used to construct the
columns of matrix Ai :
Ai = [vi,1 , vi,2 , ..., vi,ni ] ∈ Rm×ni (1)
296 R. Khorsandi and M. Abdel-Mottaleb

vi,j , where j = 1, ..., ni , is a column vector that represents the features extracted from
the training data sample j of subject i.
It is assumed that a test data from class i can be represented as a linear combination
of the training data from that class [20]:

y = αi,1 vi,1 + αi,2 vi,2 + ... + αi,ni vi,ni (2)

where y ∈ Rm is the feature vector of the test data and the α values are the coefficients
corresponding to the training data samples of subject i. Concatenating the matrices
Ai , i = 1, 2, ..., k yields:

A = [A1 , A2 , ..., Ak ] ∈ Rm×n (3)

k
where k is the number of subjects and n = i=1 ni . A linear representation for the
feature vector of the test data, i.e., y, can then be given as :

y = Ax0 ∈ Rm (4)

where x0 is the coefficient vector. By solving this equation for x0 , the class of the test
data y can be identified (as described in the next section).

2.2 Sparse Representation

To solve equation y = Ax, the number of equations, m, and unknown parameters, n,
are important. If m = n the system of equations will be complete and the solution will
be unique. However, in recognition and classification, usually there are many subjects
or classes, where the test image belongs only to one of the classes and does not belong
to the other classes. In addition, the number of extracted features are much less than
the training samples. Therefore, the number of equations is less than the number of
unknown parameters (m < n) and there is no unique solution for the system y = Ax.
In this formulation, the matrix A is the dictionary that contains the representations of
n samples or atoms, where each sample is represented by a feature vector of length m.
Since m < n, it is an overcomplete matrix. Since dictionary A contains redundancies, it
is possible to find x in an infinite number of ways. Therefore, it is important to introduce
a criteria in order to find the best representation (as mentioned later, this criteria is
that the solution is the sparsest solution of the equation y = Ax). When the system
y = Ax is under determined (i.e., m < n), usually the l2 norm is used and the estimate
is expressed as follows:

(l2 ) : 42 = argmin x
x 2 Subject to y = Ax (5)

where x 42 is the solution, which can be obtained simply by computing the pseudo-
inverse of A. However, this solution does not contain useful information for recog-
nition. In recognition, the test data belongs to one of the classes represented in the
dictionary. Therefore, the obtained answer should be sparse (i.e., only a few elements
that correspond to the training samples of the correct class are not zero). The sparsest
solution of y = Ax can be obtained by minimizing l0 norm as follows [20]:

(l0 ) : 40 = argmin x
x 0 Subject to y = Ax (6)
Gender Classification Using Facial Images and Basis Pursuit 297

where . 0 is the zero norm, which counts the number of non zero elements of x.
However, finding the minimum l0 norm is not an easy task and it becomes harder as
the dimensionality increases since we need to use combinatorial search. Furthermore,
noise affects the solution because noise magnitude can significantly change the l0 norm
of a vector. In this paper, Basis Pursuit (BP) is used to find the sparsest solution,i.e., the
solution vector x that has the smallest number of non zero elements.

2.3 Sparse Solution Based on Basis Pursuit

Basis pursuit was introduced in the 1970s, and then studied mathematically in the 1990s
by Chen and Donoho [5]. To solve the underdetermined system of equations y = Ax,
l2 minimization is easy to compute, but not useful in recognition. In fact, for recognition
purposes, minimizing the l0 norm provides the best solution since the test data is related
to only one of the subjects in the training set. In fact, most of the components of x
should be zero or close to zero. However, the l0 norm is not a continous function. Since
l0 norm minimization is not a convex optimization problem, it is not easy to obtaine
the solution. On the other hand, we can use l1 norm minization, which is convex, to
find x. In the l1 norm minimization, a cost is assigned to each atom that we use in our
representation. Actually, there is no charge for the norm when it gives a zero coefficient.
The BP finds the best solution of x by minimizing the l1 norm of the x as follows:

41 = argmin
x x 1 subject to y = Ax (7)

Where x 1 is the l1 norm. To find x 41 , since the nonzero coefficients correspond to

columns of the dictionary, it is possible to use the indices of the nonzero components of
41 to identify the columns of A that are necessary to represent the test image. l1 norm
x
assigns a cost to each atom that is used in representation. For example, the norm will
not be penalized when it gives a zero coefficient, but it should be charged proportionally
for small and large coefficients.
Since x 1 = |x1 | + ... + |xn | and we can rewrite equation 7 as:

minimize x 1 = |x1 | + ... + |xn | subject to y = Ax (8)

Because |x1 | + ... + |xn | is a nonlinear function, the optimization problem can not be
solved using linear programming methods. To make this function linear, nonlinearities
should change to constraints by adding new variables as follow:

minimize t1 + t2 + .... + tn subject to (9)

|x1 | ≤ t1 , .... , |xn | ≤ tn and y = Ax

Where, t1 , ..., tn are non-negative constants. In this formulation, the objective function
is linear and it is possible to solve this problem using linear programming.

2.4 Using Sparse Representation for Classification

For a test data, y, belonging to the ith class, it is assumed that the non-zero elements of
>1 will correspond to the training data samples from the ith class. However, due to noise
x
298 R. Khorsandi and M. Abdel-Mottaleb

and representation errors, there will be extraneous non-zero elements corresponding to

training samples from other classes.
In [20], they presented an approach for the decision making step based upon the
obtained x41 by computing the error between y, the original data, and y4i , the approxi-
mation obtained through the sparse representation. For each class i and x ∈ Rn , vector
δi (x) ∈ Rn represents the coefficients that are associated with class i. Using this defi-
nition, approximated test data y4i is given as:

y4i = Aδi (>

x1 ) (10)
Recognition was performed by assigning the test data to the class that minimizes the
residual between y and y4i as follows:
min
0123 ri (y) = y − Aδi (4
x1 ) 2 (11)
i

where ri (y) is the residual distance for class i. This signifies that the classification is
performed based on the best approximation and least error [20].
Here, we propose a new approach to perform classification using x 41 . In gender clas-
sification, there are only two classes and the dictionary contains training face images
for males and females as representatives of these two classes. The obtained elements of
41 are the coefficients associated with each training face image and we can divide x
x 41
into two vectors, x1 and x2 , where x1 contains the coefficients associated with males
and x2 contains the coefficients associated with females. x 41 can be written as follows:
, -
x1
x41 =
x2

41 is m and the number of training samples for males and females are
The length of the x
equal. Hence, the length of x1 and the length of x2 is m/2.
Let xmax be the maximum value of the x 41 elements (xmax = max(4 x1 )). Then, a
threshold xmax /τ , where (τ ≥ 1) is defined. The elements in x1 and x2 whose values
are more than the threshold are counted. The classification is performed based on the
majority vote of the coefficients.

3 Gabor Wavelets
The Gabor filters (kernels) with orientation μ and scale ν are defined as [22]
kμ,ν 2 −kμ,ν 2 z2
eikμ,ν z − e−σ /2
2
ψμ,ν = e( 2σ2
)
(12)
σ2
where z = (x, y) is the pixel position, and the wave vector kμ,ν is defined as kμ,ν =
kν eiφμ with kν = kmax /f ν and Φμ = πμ/8. kmax is the maximum frequency, and f is
the spacing factor between kernels in the frequency domain. The ratio of the Gaussian
window width to wavelength is determined by σ. Considering Eq. 12, the Gabor kernels
can be generated from one wavelet, i.e., the mother wavelet, by scaling and rotation via
the wave vector kμ,ν [13]. In this work, we used five scales, ν ∈ {0, ...,√4} and eight
orientations μ ∈ {0, ..., 7}. We also used σ = 2π, kmax = π/2 and f = 2.
Gender Classification Using Facial Images and Basis Pursuit 299

Fig. 1. Sample images for both males and females in FERET database

4 Experiments and Results

The FERET database [18] is used to validate the proposed method. Images are frontal
faces at a resolution of 256x384 with 256 gray levels. All the images are preprocessed
before applying the algorithm. First, the automatic eye-detection method is applied
based on the [11] and the distance, d, between the 2 eye corners is measured. Then,
the middle point between the 2 eye corners is found and the image is cropped by the
size of 2d × 2d. Then all images are resized to 128x128. A few sample face images
for both male and female subjects are shown in Fig. 1. In this database, there are 250
male subjects and 250 female subjects. As previously stated, in sparse classification,
the training samples are used to build a dictionary, which is used during the classifica-
tion to represent a test sample as a linear combination of the training samples. Since
we are using majority voting for making a decision between the two categories, the
number of training samples for males and females should be equal. In addition, to com-
pare our results with other methods, especially [12], four experiments are conducted
with different number of subjects used for training in each experiment, sizes of: 50,
100, 150 and 200 subjects were used for training. In each experiment, the remaining
subjects are used for testing. For instance, when using 200 subjects for training (100
male subjects and 100 female subjects), the other 300 subjects are used for testing. In
the feature extraction step, Gabor wavelets are extracted for 8 orientations and 5 spatial
frequencies. Finally, using PCA, the number of features used to represent each image
is reduced to 128. In the proposed approach, for each test data, the sparsest coefficient
vector x>1 is obtained based on the basis pursuit. Majority voting is then used to recog-
nize the gender of the test subject. We provide a comparison of the experimental results
with other gender classification systems applied to the same dataset. Table. 1 shows the
classification rates for 4 different training set sizes. The results of our proposed method
(PCA + BP) are compared with the results of the methods proposed by Jain et al. [12],
in which the authors evaluated their method on the FERET database. They used Inde-
pendent Component Analysis (ICA) to represent each image as feature vector in a low
dimensional subspace. In addition, they used different classifiers such as cosine classi-
fier (COS), linear discriminant classifier (LDA) and the support vector machine (SVM).
The best result reported in [12] is 95.67% accuracy using SVM with ICA. Furthermore,
the results for conventional sparse representation based classification (SRC)[20] are re-
ported in Table. 1 which show our modification was helpful in gender recognition. The
300 R. Khorsandi and M. Abdel-Mottaleb

Table 1. Performance comparison to other gender classification systems based on facial images
Training Set Size ICA + COS ICA + LDA ICA+ SVM SRC PCA+BP (Proposed Method)
50 60.67% 64.67% 68.30% 68.88% 68.88%
100 71.67% 73.67% 76.00% 76.00% 76.25%
150 80.33% 83.00% 86.67% 86.85% 88.57%
200 85.33% 93.33% 95.67% 96.33% 97.66%

experimental results in this paper indicate that our proposed method using sparse rep-
resentation and PCA obtained higher performance of correct classification rate on the
same data set. To our best knowledge, better results for gender classification on FERET
database is not reported since 2005. Moreover, in [15], authors used 661 images from
FERET database, for 248 subjects. The best obtained result for gender classification in
that paper is 90% for feature dimension 11,520. However, we obtained a classification
rate of 97% for 512 feature dimension and for 500 subjects.

5 Conclusion
In this paper, we presented a method for gender classification, from facial images, using
sparse representation. Basis pursuit method was used to formulate the problem in order
to find the sparsest solution. The experiments were conducted on the FERET data set
containing 500 subjects (250 male and 250 female subjects). Features were extracted
using Gabor wavelets, and a dictionary was constructed based on the extracted fea-
tures from a training set. The rest of the data set was used for testing. We compared
the proposed method in this paper with previous methods that used the same data set,
performance of our the presented method is better than the previous reported methods.
Experiments are encouraging enough for future research on the sparse representation
for gender classification. In the future, we plan to apply our method for the fusion of
facial and ear features.

References
1. Aharon, M., Elad, M., Bruckstein, A.: K-svd: An algorithm for designing overcomplete
dictionaries for sparse representation. IEEE Transactions on Signal Processing 54(11),
4311–4322 (2006)
2. Baraniuk, R., Candes, E., Elad, M., Ma, Y.: Applications of sparse representation and com-
pressive sensing. Proceedings of the IEEE 98(6), 906–909 (2010)
3. Cao, L., Dikmen, M., Fu, Y., Huang, T.S.: Gender recognition from body. In: Proceedings
of the 16th ACM International Conference on Multimedia, MM 2008, New York, NY, USA,
pp. 725–728 (2008)
4. Chen, D.-Y., Lin, K.-Y.: Robust gender recognition for real-time surveillance system. In:
IEEE International Conference on Multimedia and Expo (ICME), pp. 191–196 (July 2010)
5. Chen, S.S., Donoho, D.L., Michael, Saunders, A.: Atomic decomposition by basis pursuit.
SIAM Journal on Scientific Computing 20, 33–61 (1998)
6. Cottrell, G.W., Metcalfe, J.: Empath: face, emotion, and gender recognition using holons. In:
Proceedings of the 1990 Conference on Advances in Neural Information Processing Systems
3, NIPS-3, San Francisco, CA, USA, pp. 564–571 (1990)
Gender Classification Using Facial Images and Basis Pursuit 301

7. Donoho, D.L.: Compressed sensing. IEEE Transaction on Information Theory 52(4) (2006)
8. Golomb, B.A., Lawrence, D.T., Sejnowski, T.J.: Sexnet: A neural network identifies sex from
human faces. In: Proceedings Conf. Advances in Neural Information Processing Systems 3,
pp. 572–577 (1990)
9. Gutta, S., Wechsler, H.: Gender and ethnic classification of human faces using hybrid clas-
sifiers. In: International Joint Conference on Neural Networks, IJCNN 1999, vol. 6, pp.
4084–4089 (1999)
10. Harb, H., Chen, L.: Gender identification using a general audio classifier. In: Proceedings of
the International Conference on Multimedia and Expo, ICME 2003, Washington, DC, USA,
pp. 733–736 (2003)
11. Hsu, R.-L., Abdel-Mottaleb, M., Jain, A.: Face detection in color images. IEEE Transactions
on Pattern Analysis and Machine Intelligence 24(5), 696–706 (2002)
12. Jain, A., Huang, J., Fang, S.: Gender identification using frontal facial images. In: IEEE
International Conference on Multimedia and Expo, ICME 2005, p. 4 (July 2005)
13. Liu, C., Wechsler, H.: Gabor feature based classification using the enhanced fisher linear
discriminant model for face recognition. IEEE Transactions on Image Processing 11(4),
467–476 (2002)
14. Llagostera Casanovas, A., Monaci, G., Vandergheynst, P., Gribonval, R.: Blind audiovisual
source separation based on sparse redundant representations. IEEE Transactions on Multi-
media 12(5), 358–371 (2010)
15. Lu, H., Huang, Y., Chen, Y., Yang, D.: Automatic gender recognition based on pixel-pattern-
based texture feature. Journal of Real-Time Image Processing 3, 109–116 (2008)
16. Moghaddam, B., Yang, M.-H.: Gender classification with support vector machines. In: Fourth
IEEE International Conference on Automatic Face and Gesture Recognition, pp. 306–311
(2000)
17. Patel, V.M., Wu, T., Biswas, S., Phillips, P.J., Chellappa, R.: Dictionary-based face recog-
nition under variable lighting and pose. IEEE Transactions on Information Forensics and
Security 7(3), 954–965 (2012)
18. Phillips, P., Moon, H., Rizvi, S., Rauss, P.: The feret evaluation methodology for face-
recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 22(10), 1090–1104 (2000)
19. Protter, M., Elad, M.: Image sequence denoising via sparse and redundant representations.
IEEE Transactions on Image Processing 18(1), 27–35 (2009)
20. Wright, J., Yang, A., Ganesh, A., Sastry, S., Ma, Y.: Robust face recognition via sparse repre-
sentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(2), 210–227
(2009)
21. Wu, B., Ai, H., Huang, C.: Facial image retrieval based on demographic classification. In:
Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 3,
pp. 914–917 (2004)
22. Yang, M., Zhang, L.: Gabor feature based sparse representation for face recognition with
gabor occlusion dictionary. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
Part VI. LNCS, vol. 6316, pp. 448–461. Springer, Heidelberg (2010)
23. Yu, S., Tan, T., Huang, K., Jia, K., Wu, X.: A study on gait-based gender classification. IEEE
Transactions on Image Processing 18(8), 1905–1910 (2009)
Graph Clustering through Attribute Statistics
Based Embedding

Jaume Gibert1 , Ernest Valveny2 , Horst Bunke3 , and Luc Brun1

1
École Nationale Supérieure d’Ingénieurs de Caen, ENSICAEN
Université de Caen Basse-Normandie, 6 Boulevard Maréchal Juin
14050 Caen, France
{jaume.gibert,luc.brun}@ensicaen.fr
2
Computer Vision Center, Universitat Autònoma de Barcelona
Ediﬁci O Campus UAB, 08193 Bellaterra, Spain
[email protected]
3
Institute for Computer Science and Applied Mathematics, University of Bern,
Neubrückstrasse 10, CH-3012 Bern, Switzerland
[email protected]

Abstract. This work tackles the problem of graph clustering by an ex-

plicit embedding of graphs into vector spaces. We use an embedding
methodology based on occurrence and co-occurrence statistics of repre-
sentative elements of the node attributes. This embedding methodology
has already been used for graph classiﬁcation problems. In the current
paper we investigate its applicability to the problem of clustering color-
attributed graphs. The ICPR 2010 Graph Embedding Contest serves us
as an evaluation framework. Explicit and implicit embedding methods
are evaluated in terms of their ability to cluster object images repre-
sented as attributed graphs. We compare the attribute statistics based
embedding methodology to explicit and implicit embedding techniques
proposed by the contest participants and show improvements in some of
the datasets. We then demonstrate further improvements by means of
diﬀerent vectorial metrics and kernel functions on the embedded graphs.

1 Introduction

Clustering, or unsupervised learning, is a key concept in pattern recognition.

While the clustering of vectorial pattern representations has reached some level of
maturity, the clustering of graphs is still in its infancy [5]. This is due to a number
of diﬃculties that arise especially from the fact that many operations needed in a
clustering algorithm, although readily available for vectorial representations, do
not exist for graphs (or are at least extremely diﬃcult to accomplish). Examples
are the computation of the mean of a set of graphs, or the operation of making
two graphs more similar to each other, as needed in kMeans clustering and self-
organizing nets, respectively. In order to overcome these problems, a number of
approaches have been proposed that relate the graph domain to vector spaces
where such operations are easier to perform and plenty of learning machinery

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 302–309, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Graph Clustering through Attribute Statistics Based Embedding 303

is available. Graph embeddings and graph kernels are the main paradigms. The
former explicitly assign a feature vector to each graph while the latter implicitly
map each graph in a feature space and compute the corresponding scalar product.
The relation between graph embeddings and graph kernels is clear since, given
an explicit embedding, any kernel function on vectors also defines a graph kernel.
We have previously proposed an explicit embedding approach which is based
on extracting features describing the occurrence and co-occurrence of node label
representatives in a given graph [4]. Its efficiency and good performance, when
compared to state of the art methodologies for graph classification, have been
empirically demonstrated. In the current paper, we aim at an evaluation of this
embedding methodology for graph clustering. To that end, we make use of the
ICPR 2010 Graph Embedding Contest [3]. This contest was organized in order
to provide a framework for direct comparison between embedding methodologies
for the purpose of graph clustering. Three object image datasets were chosen and
converted into graphs, divided into a training and a test set. The participants
also received a code with which they could assess their own methodologies in
terms of a clustering measure. Object images were first segmented into different
regions and a region adjacency graph was constructed. Each node representing a
region was attributed with the corresponding relative size and the average RGB
color components, while edges remained unattributed.
For the contest, four algorithms were submitted, three explicit embedding
methods and an implicit one. Jouili and Tabbone build feature vectors whose
distances distribution respect as much as possible that of the corresponding
graphs. In particular, they assign a feature vector to every graph by considering
the eigenvectors of a positive semidefinite matrix regarding the dissimilarity of
graphs [6]. Riesen and Bunke map every graph to a feature vector whose com-
ponents are the edit distances to a predefined set of prototypes [8]. Their goal
is thus to characterize graphs as how they are located with respect to some
key graphs in the graph space. Luqman et at. search for particular subgraph
structures present in the original graphs. They encode relevant information by
quantizing node and edge attributes via the use of fuzzy intervals [7]. Finally, the
implicit methodology proposed by Osmanlıoğlu et al. maps each node of each
graph to a vector space by means of the caterpillar decomposition, and com-
putes a kernel value between two given graphs in terms of a point set matching
algorithm based on the Earth Mover’s distance [2].
The contribution of the work described in this paper is to evaluate the novel
embedding methodology of [4], which was not yet available at the time of the
ICPR contest, for the task of clustering and compare it to existing approaches.
Besides, the mentioned embedding methodology has been re-formulated in such
a way that it can handle color-based attributed graphs. We will show that, in
such a way, it constitutes an attractive addition to the set of graph cluster-
ing tools currently available. For the purpose of self-completeness, Section 2 of
the paper provides a brief introduction to graph embedding using node label
occurrence and co-occurrence statistics. Next, Section 3 describes in detail the
304 J. Gibert et al.

experimental evaluation and shows how to gain further improvements along this
line of research. Finally, Section 4 draws conclusions from this work.

2 Attribute Statistics Based Embedding

The main idea of the embedding methodology used in this work is based on
counting the frequency of appearance of the node labels in a given graph and also
on the co-occurrence of pairs of node labels in conjunction with edge linkings.
The fact that node labels might not be discrete (as it is in the present case)
demands for a discretization of the node labelling space and, thus, the selection of
a set of representatives. Under the proposed approach, the features are obtained
by computing statistics on these representatives in terms of those nodes that
have been assigned to each of them. Based on how this assignment from nodes
to representatives is made we have two formulations of the embedding approach.

2.1 Hard Assignment

Assume a set of graphs G = {g1 , . . . , gN } is given, each being a four-tuple gi =

(Vi , Ei , μi , νi ) consisting on a set of nodes Vi , a set of edges Ei ⊆ Vi × Vi , and
the corresponding labelling functions μi and νi . In this work, nodes are labelled
with RGB values, thus the labelling function codomain is always the set [0, 255]3
(the relative size attribute is disregarded). Edges remain unlabelled.
From the set of all node labels of all graphs in G we select some representatives
W = {w1 , . . . , wn } (see Section 2.3). Given a graph g, each node v ∈ V is assigned
to the closest representative by

λh (v) = argmin μ(v) − wi 2 . (1)

wi ∈W

Then we extract unary features as occurrences of representatives in the graph

by
Ui = #(wi , g) = | {v ∈ V | wi = λh (v)} |. (2)
Also, co-occurrence features between two representatives are deﬁned as

Bij = #(wi ↔ wj , g)
= | {(u, v) ∈ E | wi = λh (u) ∧ wj = λh (v)} |. (3)

Both the unary features Ui and the binary ones Bi are eventually arranged in a
feature vector.
In particular, note that what this formulation is proposing is to build a his-
togram of the presence of speciﬁc features in the graphs. In the present case,
we aim at evaluating the presence of each color in every graph, and also the
presence of the neighbouring relations of all colors in the graphs. In Section 2.4
we discuss the connections of this approach to other existing graph embedding
methodologies.
Graph Clustering through Attribute Statistics Based Embedding 305

2.2 Soft Assignment

Assigning nodes to representatives in a hard fashion might lead to weak de-

scriptions because the graph extraction process is usually noisy. However, the
embedding methodology used in this work is adaptable for a fuzzy assignment
from nodes to representatives which might correct such situations. In particular,
each node is represented by a set of probabilities

λs (v) = (p1 (v), . . . , pn (v)), (4)

where pi (v) = P (v ∼ wi ) is the probability of node v being represented by wi .

The unary features are then the addition of all probabilities for all nodes in the
graph g:
Ui = #(wi , g) = pi (v). (5)
v∈V

The fuzzy version of the binary features needs to regard the transition probabil-
ities from one node to the other and thus is deﬁned as

Bij = #(wi ↔ wj , g)

= pi (u)pj (v) + pj (u)pi (v). (6)
(u,v)∈E

2.3 Representative Set Selection

One of the key issues of the embedding methodology is the selection of the set
of representatives for the node labels. We can make use of generic clustering
approaches independent to the domain such as kMeans, or we can use domain-
speciﬁc approaches. In addition to using kMeans, we propose in this work a
color-based approach that tries to adapt the set of representatives to the inherent
RGB structure of the node labelling space.

Generic Approaches. In order to select representatives of the node labels for

the hard assignment version of the proposed embedding, we use the kMeans
clustering algorithm for different values of the parameter k. This representation
will be referred to as Hard kM. In Fig. 1(a) we show a sample of node labels
with their corresponding original color. Next to it, in Fig. 1(b), the distribution
of the k = 10 clusters after applying the kMeans algorithm is shown.
For the soft assignment version we use fuzzy kMeans to select the set of repre-
sentatives, and the probability of each node to belong to a certain representative
is defined to be inversely proportional to the Euclidean distance between the
considered node and the representative. We will refer to this representations as
Soft kM.
These two configurations not only depend on how the representative elements
are selected, but also on the number of them. This parameter needs to be vali-
dated using the training set.
306 J. Gibert et al.

250 250
250

200 200 200

150 150 150

100 100 100

50 50 50

0 300 0 300
0 250 250 0 300
50 200 0 0 250
50 200 50 200
100 150 100 150 100 150
150 100 150 100 150
200 200 200 100
250 50 250 50 250 50
300 0 300 0 300 0

(a) Original Color (b) kMeans Clusters (c) Color Naming

Fig. 1. Distributions of the graphs’ nodes in the RGB space (best seen in color). (a)
Original color of each node. (b) kMeans clusters for k = 10. (c) Color naming distri-
bution.

Color-Based Approaches. Due to the spherical arrangement of its clusters,

kMeans does not really account for the color distribution of the original node
values, grouping for instance node labels of diﬀerent color into the same cluster.
This problem demands for a way of selecting representatives that can adapt in
a more accurate manner to the real RGB distribution of the whole set of node
attributes (Fig. 1(a)). To do so, we have adopted a color naming approach for
which each node label, i.e. each point in the RGB space, is assigned to one of
the eleven basic colors of the color naming theory.
In particular, we adopted the methodology proposed in [1], where each RGB
point is automatically assigned to every color in the color naming scheme with
a certain probability. Each node is thus represented with a set of probabilities
allowing the use of the soft assignment version of the embedding approach de-
scribed here. This version is referred to in the text as Soft Color. For the hard
assignment version, for every node in a given graph, we just pick the color that
produces the highest probability value and refer to it as Hard Color. Fig. 1(c)
shows the resulting 11 clusters after the assignment.

2.4 Relation to other Approaches

An interesting observation to be done regarding the proposed methodology is

its connections to other graph characterization approaches. In particular, an ap-
pealing consideration is that of its similarity with ﬁngerprint characterization of
molecular structures [9]. In this domain, molecules are represented as histograms
of the presence of particular subgraph structures, where such substructures are
selected based on prior chemical knowledge. The methodology used in this work
is related to this one in the sense that, after node attribute discretization, it look
for particular substruces in the graph representations such as node appearences
and node-edge-node walks.
On the other hand, but strictly related to the former, the explicit embedding of
graphs by the presented approach can be reduced to characterize graphs based on
walks of length 0 and walks of length 1 with respect to their labelling information.
In that sense, it might also be connected to the so-called family of random walk
kernels [10].
Graph Clustering through Attribute Statistics Based Embedding 307

Table 1. C -index on the test sets under the Euclidean distance: lower index values
indicate better clustering results. Comparison with the contest participants. The best
results are shown bold face.

Embedding ALOI COIL ODBK Geometric Mean

Osmanlıoğlu et al. 0.088 0.067 0.105 0.085
Jouili and Tabbone 0.136 0.199 0.138 0.155
Riesen and Bunke 0.048 0.128 0.132 0.093
Luqman et al. 0.379 0.377 0.355 0.370
Hard kM 0.080 0.160 0.070 0.096
Soft kM 0.068 0.136 0.058 0.081
Hard Color 0.067 0.143 0.061 0.083
Soft Color 0.056 0.121 0.051 0.070

3 Experimental Evaluation
The three object image datasets that were used in the contest are the ALOI,
COIL and OBDK collections. Each of them is representing object images under
diﬀerent angles of rotation and illumination changes. For more details on the
datasets, we refer to the contest report [3]. We recall here that a training and
a test set are available for each dataset. We use the training set to validate
the parameters (number of representative elements) that are eventually used for
processing the test set.
Every approach is assessed by computing the C-index clustering measure, and
approaches are ranked in terms of the geometric mean of the results on the three
datasets. When the embedding is explicit, the clustering index is computed based
on the Euclidean distances of the vectorial representations of graphs. When an
implicit formulation is given, distances are computed according to the following
formula
dij = kii + kjj − 2kij (7)

where kij is the kernel value between graphs gi and gj . Under a kernel func-
tion, graphs are implicitly mapped to a hidden feature space where the scalar
product is calculated. Formula (7) is the Euclidean distance between the corre-
sponding vectors in such a feature space. Results of the proposed methodologies
in comparison with the contest participants are shown in Table 1.
As expected, the Soft approaches obtain better results than the hard ones, and
the color-based versions improve the generic ones. Compared to the participants
methods, the proposed embedding approach ranks second on two databases and
first on the third one. This leads to the best geometric mean among all tested
methods. Moreover, let us mention the high efficiency of our approach which
arises from the fact that we base our embedding method on very simple features
with a fast computation.
As already said before, the contest clustering measures are computed based
on the Euclidean distances of the embedding representations. In other works,
the proposed embedding methodology has been shown to perform better under
different vectorial metrics [4]. We thus refine our results by computing the C-
index for clustering validation under the L1 and χ2 distances. Results of these
308 J. Gibert et al.

Table 2. C -index under diﬀerent distances and under the kχ2 kernel on the test sets.
The best results are shown bold face.

ALOI COIL
Distance / Kernel
Soft kM Soft Color Soft kM Soft Color
L2 0.073 0.056 0.136 0.121
L1 0.064 0.060 0.130 0.110
χ2 0.031 0.032 0.066 0.064
kχ2 0.088 0.083 3.10e-08 9.04e-07

ODBK Geometric Mean

Soft kM Soft Color Soft kM Soft Color
L2 0.056 0.051 0.083 0.070
L1 0.063 0.061 0.081 0.074
χ2 0.033 0.037 0.041 0.042
kχ2 8.67e-10 0.097 1.33e-06 1.94e-03

experiments for the Soft versions are shown on the ﬁrst three rows of Table 2
(Hard versions are discarded since they do not show as good a performance as
the Soft ones). The χ2 distance is providing the best results, ranking best on all
datasets, even when compared to the contest participants (we, however, want
to point out that a direct comparison to the results obtained by the contest
participants would not be fair since we do not know how their algorithms would
perform under other metrics). Interestingly, the χ2 distance extracts the best
out of the Soft kM versions since it outperforms the Soft Color one in two of
the three datasets, which does not happen when using the two other metrics.
Finally, in order to relate our methodology to those that provide an implicit
embedding of graphs we compute kernel values between embedded graphs as

1
kd (g1 , g2 ) = exp − d(φ(g1 ), φ(g2 )) , γ > 0 (8)
γ

where φ(gi ) is the vectorial representation of the graph gi under the described
embedding methodology, and d is the χ2 metric (L2 or L1 could also be used
but χ2 is the one providing the best results when clustering under metrics as
discussed above). Distance values for the C-index computation are calculated
using Eq. (7). The γ parameter is also validated using the training set. Results
for the Soft versions are shown on the last row of Table 2.
Although the results for the ALOI database worsen when using the kernel
values for both versions of the embedding, the most signiﬁcant point to highlight
from this table is that we obtain almost perfect separation indexes for the COIL
dataset under the two Soft versions and also for the ODBK under the Soft kM
one. This makes the geometric means to drastically decrease and demonstrates
the embedding methodology we propose in this work being a strong approach
for graph clustering.
Graph Clustering through Attribute Statistics Based Embedding 309

4 Conclusions

In this work, we have evaluated the color-based explicit graph embedding

methodology that accounts for statistics on node label representatives in terms
of clustering performance. We have compared it to the approaches that were re-
ported in the ICPR 2010 Graph Embedding Contest and shown that it performs
very favorably. Additional improvements are gained by evaluating the embed-
ding method under different metrics and also by the use of kernel functions on
the resulting vectors, leading to almost perfect separation results in two of the
three contest datasets. Future work on this research line should be directed to
assess whether object-wise color quantization provides a better characterization
of the color space where the set of node labels can be found.
As a final remark, the authors find of paramount importance that different
works appear in the same line as the present, where different pattern recognition
methodologies are brought together and compared ones to the others using a
unified and clear framework. In this sense, we want to acknowledge the ICPR
Graph Embedding Contest organizers for their valuable work.

References
1. Benavente, R., Vanrell, M., Baldrich, R.: Parametric fuzzy sets for automatic color
naming. J. Optical Society of America A 25(10), 2582–2593 (2008)
2. Demirci, M.F., Osmanlıoğlu, Y., Shokoufandeh, A., Dickinson, S.: Efficient many-
to-many feature matching under the l1 norm. Computer Vision and Image Under-
standing 115(7), 976–983 (2011)
3. Foggia, P., Vento, M.: Graph Embedding for Pattern Recognition. In: Ünay, D.,
Çataltepe, Z., Aksoy, S. (eds.) ICPR 2010. LNCS, vol. 6388, pp. 75–82. Springer,
Heidelberg (2010)
4. Gibert, J., Valveny, E., Bunke, H.: Graph embedding in vector spaces by node
attribute statistics. Pattern Recognition 45(9), 3072–3083 (2012)
5. Jain, A.K.: Data Clustering: 50 years beyond K-means. Pattern Recognition Let-
ters 31(8), 651–666 (2010)
6. Jouili, S., Tabbone, S.: Graph Embedding Using Constant Shift Embedding. In:
Ünay, D., Çataltepe, Z., Aksoy, S. (eds.) ICPR 2010. LNCS, vol. 6388, pp. 83–92.
Springer, Heidelberg (2010)
7. Luqman, M.M., Lladós, J., Ramel, J.-Y., Brouard, T.: A Fuzzy-Interval Based
Approach for Explicit Graph Embedding. In: Ünay, D., Çataltepe, Z., Aksoy, S.
(eds.) ICPR 2010. LNCS, vol. 6388, pp. 93–98. Springer, Heidelberg (2010)
8. Riesen, K., Bunke, H.: Graph Classification and Clustering Based on Vector Space
Embedding. World Scientific (2010)
9. Mahé, P., Ueda, N., Akutsu, T., Perret, J.-L., Vert, J.-P.: Graph Kernels for Molec-
ular Structure-Activity Relationship Analysis with Support Vector Machines. Jour-
nal of Chemical Information and Modelling, 939–951 (2005)
10. Gärtner, T., Flach, P.A., Wrobel, S.: On graph kernels: Hardness results and
efficient alternatives. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003.
LNCS (LNAI), vol. 2777, pp. 129–143. Springer, Heidelberg (2003)
Graph-Based Regularization of Binary
Classifiers for Texture Segmentation

Cyrille Faucheux1,3, , Julien Olivier2,1 , and Romuald Boné2,1

1
Université François Rabelais de Tours, Laboratoire d’Informatique
64, Avenue Jean Portalis, 37200 Tours, France
2
École Nationale d’Ingénieurs du Val de Loire
3, Rue de la Chocolaterie, BP 3410, 41034 Blois CEDEX, France
3
Cosm’O Laboratory
100, Rue de Suède, 37100 Tours, France
[email protected],
{julien.olivier,romuald.bone}@univ-tours.fr

Abstract. In this paper, we propose to improve a recent texture-based

graph regularization model used to perform image segmentation by in-
cluding a binary classifier in the process. Built upon two non-local image
processing techniques, the addition of a classifier brings to our model
the ability to weight texture features according to their relevance. The
graph regularization process is then applied on the initial segmentation
provided by the classifier in order to clear it from most imperfections. Re-
sults are presented on artificial and medical images, and compared to an
active contour driven by classifiers segmentation algorithm, highlighting
the increased generality and accuracy of our model.

Keywords: image segmentation, graph regularization, Haralick texture

features, neural networks.

1 Introduction

Developed to overcome some traditional algorithm limitations, non-local image

processing approaches, especially graph-based ones [1,2] have recently gained a
lot of interest due to the increased ﬂexibility of the data structures involved.
In a previous paper [3], an image segmentation technique that combines two
non-local approaches to image processing was proposed. The ﬁrst one takes ad-
vantage of windowed Haralick texture features [4] in order to work with pixel
characteristics with a higher level of abstraction than the pixel itself, and there-
fore more meaningful than raw gray-level intensities. The second approach actu-
ally carries out the segmentation task using a graph regularization process that
relies on a criteria inspired by the work of Chan and Vese [5] to separate the two
textures and partition the image.

This work was partially supported under a research grant of the ANR (1241/2009).

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 310–318, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Graph-Based Regularization of Binary Classiﬁers for Texture Segmentation 311

One drawback of this segmentation technique comes from applying the Chan
and Vese criteria on texture features. Depending on the type of images (acquisi-
tion method, content. . . ), some features may be irrelevant, or more relevant than
others. This fact cannot be automatically handled by the Chan and Vese crite-
ria, and features must be manually selected in order to obtain the best results.
Moreover, unsupervised weighing of the features is impossible.
The aim of this work is to overcome this limitation while keeping the two
non-local approaches. To do so, we propose to combine a supervised binary
classifier with the graph regularization process used previously. Implemented as
a neural network, the classifier takes care of the texture feature relevancy issue by
providing an initial classification of the pixels. The graph regularization process
from [3], which is designed to handle any multivariate feature, is this time applied
on an univariate one, the output of the classifier, in order to correct classification
errors and produce a smoothed segmentation.

2 Methodology
2.1 Haralick Texture Features
In order to deal with complex vision problems, the use of pixel features with
a higher level of abstraction than raw gray level intensities has now become
almost mandatory. Among all texture characterization techniques available in
the literature, we have decided to use Haralick features [4]. Several works have
shown their eﬃciency, especially when applied to medical images [6], and can
be easily extended to 3D images [7], two of our ﬁnal goals. Only 10 of the 14
features proposed by Haralick have been used, correlation based ones having
been left out due to numerical instability.

2.2 Texture Classification Using Neural Network

The two most studied supervised classifiers are the support vector machine
(SVM) [8] and the multi-layer perceptron (MLP) [9]. While the SVM are com-
monly recognized to be more accurate than the MLP [10], their original design
limits them to binary classification problems. Although this is not an issue for the
current work, turning this method into a multiclass segmentation one is planned
by the authors. The neural network based approach was therefore selected.
The use of a supervised classifier requires the definition of a training set
composed of classified texture samples. Depending on the task to be performed,
two training set definition schemes are possible.

Interactive Scheme. It is mainly intended to be used for single image seg-

mentation tasks. The user is asked to provide a training set by manually tagging
several regions of the image with the expected classes. Given those data, the
neural network can be trained and the whole segmentation process conducted.
If the result is not as accurate as expected, the user can consolidate the training
set by tagging new pixels.
312 C. Faucheux, J. Olivier, and R. Boné

Ground Truth Scheme. If the algorithm is used to regularly process images

from the same type, for example produced with the same acquisition technique,
a pre-configured MLP can be built upstream and stored in order to be used re-
peatedly, saving thus some computation time. The training set is then composed
from ground truths for one or several images provided by experts.
Once the training set X is defined, it is divided into two subsets Xt and Xe ,
with |Xt | = 2|Xe |. Xt will be used by the training algorithm to compute the
weights of the neural network, while Xe is used to evaluate its generalization
accuracy. The one with the highest classification rate is selected at the end of
the training process.
The MLP used in this paper are composed of 10 inputs – the 10 selected
features – and a single output. Each neuron uses a sigmoid activation function
and is connected to a bias neuron. The two training subsets can also be used
to select the best MLP among several configurations (number of hidden layers,
number of hidden neurons per layer, learning-rate. . . ). During our tests, we found
that a MLP with a single hidden layer composed of 3 neurons was enough to
perform the binary segmentation based on the 10 texture features.

2.3 Regularization of a Binary Graph Function

The classifier is not used to produce an absolute classification: no decision rule is
applied on its output. Therefore, the information it returns for an input pattern
only represents the likelihood of belonging to one of the two classes. Depending
on the content of an image and on the texture features used, the overall result
on an image can be quite rough.
The purpose of a regularization process is to perform a bi-criteria optimiza-
tion: produce a smooth version of a function while keeping it close-enough to
its original version. By applying such algorithm on the classifier’s output, we
manage to obtain a revised version of it, where imperfections are be corrected.
One key element of this work is to take advantage of non-local image process-
ing techniques. Beyond the use of texture features, we want to be able to process
pixels that might not be neighbors on the image. It is done by using a graph
structure, which allows to efficiently express relations between any pair of pixel
in the image.
The graph regularization method we use has been proposed in [3]. It is de-
signed to perform image segmentation using texture features, but is subject to
the limitations exposed in the introduction. For this work, we will apply it on
an image composed of a single numeric feature: the output of the classifier.
A weighted graph G = (V, E, w) is composed of a set V = {V1 . . . VN } of N
nodes and a set E ∈ V × V of edges. Two nodes u and v that are connected by
an edge (u, v) ∈ E are said to be neighbors. This relation is noted u ∼ v. The
graphs used in our approach are simple (undirected, without loop, with at most
one edge between any two nodes). w : E → R is a function that associates a real
value w(e) (also noted wu,v ) to each edge e = (u, v) ∈ E.
The type of graph used by the regularization algorithm is called a similarity
graph. This is a weighted graph where the weighting function is a similarity
Graph-Based Regularization of Binary Classifiers for Texture Segmentation 313

measure of each pairs of nodes. Multiple ways to build such graph exist, and
correspond to three successive steps: choosing the node set, the edge set, and
the similarity measure.
For the choice of the node set, the most obvious approach is to build a pixel
adjacency graph (each pixel of the image is represented by a node), but more
advanced methods use clustering or segmentation algorithms to already group
similar pixels together, building a region adjacency graph [2,11]. While the lat-
ter approach can greatly reduce the size of the node set, obtaining such pre-
segmentation when working with texture-features is a vast subject, and will
therefore not be explored in this paper.
From the point of view of the heuristic used to build the edge set, the best
known types of graphs are: fully connected graph, -neighborhood graph and
k-nearest neighbors graph (see [12] for further details). In this paper, it has been
decided to work with the -neighborhood graph, where two nodes are connected
together if the distance between them is below a deﬁned threshold. The distance
involved in this process is application dependent, and any combination of charac-
teristics (gray level intensity, coordinates. . . ) can be used. The one used here is
the Manhattan distance applied on the pixel coordinates with = 1, which
allows to build 4-neighbors graphs.
The similarity measure used by the weighing function is not necessarily linked
to the distance used in the previous step, but is also application dependent. We
chose to use a constant value of wu,v = 1 associated to each edge (u, v) ∈ E,
which is enough to allow the regularization process to take place.
The regularization process is actually carried out by solving an optimization
problem, which consists in ﬁnding a function f : V → [0; 1] that minimizes the
following energy:

E(f, f0 , λ) = Rw (f ) + λ g(f0 , u)f (u) , (1)
u∈V

Rw (f ) being the regularization term, whose purpose is to keep the f function

as smooth as possible, while g(f0 , u) is the ﬁtting criteria (inspired by the work
of Chan and Vese [5]), which tries to keep the f function as close as possible to
f0 , the output of the classiﬁer:
√
Rw (f ) = wuv |f (v) − f (u)| , (2)
u∈V v∼u

g(f0 , u) = (c1 − f0 (u))2 − (c2 − f0 (u))2 , (3)

where c1 and c2 are the average value of f0 inside (f ≥ 0.5) and outside (f <
0.5) the segmented region. λ expresses the trade-off between regularity of the
segmentation and fidelity to the original one provided by the classifier. Due to
the simple nature of the "feature vector" actually involved (a single real value),
equation (3) presents a simplified version of the fitting criteria proposed in [3]
which was designed to work on texture-features.
The optimization problem is then approximated by an iterative Gauss-Jacobi
algorithm. The reader can refer to [3] for further details.
314 C. Faucheux, J. Olivier, and R. Boné

3 Experimental Results
The results presented through this section correspond to a measure of the par-
tition distance [13], which is equivalent to the percentage of misclassiﬁed pixels
according to a ground truth. The ground truth we are referring to are either
absolute ones (for artiﬁcial images) or obtained by experts (for medical images).

3.1 Artificial Images

In order to validate our method, is has been first applied on artificial images,
with an absolute ground truth available.
A set of 16 images was generated, each containing a textured object with a
random shape drawn over a textured background. Two examples of such textures
are shown in figure 1(b) and 1(c), and the resulting textured object in figure 1(d).

(a) (b) (c) (d)

Fig. 1. (a): random binary object, (b) & (c): examples of textures, (d): resulting tex-
tured object

For each image, a classifier is trained using the ground truth (the binary mask
used to compose the textures, see figure 1(a)). Then, each image is corrupted by
adding white Gaussian noise of varying standard deviation, and processed by the
method for different combinations of parameters (λ and number of iterations).
In order to illustrate the benefit of applying a graph regularization algorithm
to the output of a binary classifier, the results of our method are compared
with the ones obtained by thresholding the raw output of the same classifier.
Our method is also compared to the texture-based graph regularization process
proposed in [3]. Table 1 presents the results obtained for the three methods for
noise level (standard deviation) 8.
Because of the use of a classifier that relies on eager learning, handling high
noise levels is impossible since such level of corruption has a high influence on
the texture features. Results from this test are therefore exposed considering
noise levels 2, 4, 6, 8, 10 and 12.
By comparing the values from the second and fourth column of table 1, we
can see that the application of the regularization process on the output of the
classifier highly improves the segmentation quality. On the whole test set, it
allows to classify correctly up to 13.8% more pixels, with an average improvement
of 4.3%. Compared to the texture-based graph regularization method from [3]
(third column), our new approach shows an improvement of the segmentation
Graph-Based Regularization of Binary Classifiers for Texture Segmentation 315

Table 1. Results (partition distance) for the artiﬁcial test set for noise level 8

Image Classiﬁer Texture-based graph Our

only regularization method method
1 17.8% 10.2% 6.9%
2 15.1% 7.1% 10.0%
3 15.0% 9.5% 6.9%
4 8.0% 5.4% 4.1%
5 4.1% 5.3% 2.2%
6 3.9% 6.8% 3.0%
7 7.3% 8.2% 4.5%
8 11.2% 5.4% 4.6%
9 6.3% 4.8% 3.7%
10 1.3% 2.6% 0.9%
11 17.3% 14.4% 15.2%
12 10.2% 5.2% 4.0%
13 1.8% 3.7% 1.4%
14 2.7% 4.2% 2.1%
15 11.9% 6.9% 4.3%
16 1.2% 3.5% 1.2%
Average 8.4% 6.5% 4.7%

quality of 1.9%, and do not require neither a manual selection of relevant texture
features nor the deﬁnition of an initialization. Moreover, such results are obtained
with signiﬁcantly less regularization iterations: less than 500 for our new method
against more than 1000 for the one from [3].

3.2 Medical Images

Because medical imaging is an important source of computer vision challenges,
our method is tested against two medical imaging modalities: ultrasonography
and confocal microscopy.
Ultrasonography is a technique known to produce particularly noisy images.
Hopefully, this noise is also a source of textural information that can be used
by our method. Our test set is composed of three ultrasound images of the
skin where a lesion (a nevus) can be seen. The classifier is trained using the
ground truth of the first image, and the method applied on the two remaining
ones. Our method is compared to the texture-based graph regularization method
from [3] and to a binary classifier driven active contours technique [6] which uses
Haralick texture features, a MLP as a classifier, and the same training scheme.
The partition distance is computed for the three methods according to ground
truth. Table 2 presents the results for all three methods.
Confocal microscopy is very different to ultrasonography: it generates very
clear images. By processing such images, our goal is to illustrate the versatility
of our method. A confocal microscope produces a stack of images corresponding
316 C. Faucheux, J. Olivier, and R. Boné

Table 2. Results (partition distance) for the ultrasound test set

Image Our Texture-based graph Active

method regularization method contours
2 1.84% 2.20% 2.52%
3 2.0% 3.19% 4.12%

to different depth. The one used here is a view of the olfactory system of a bee,
the bright circular elements being the glomeruli (see figure 2(d)).
The classifier is trained using the ground truth of the first image of the stack,
then our method is applied on the following slices. Table 3 presents the partition
distance obtained on the test image for some of the slices.
Results for both modalities are illustrated in figure 2.

Table 3. Results (partition distance) for the confocal microscopy test set

Image Partition distance

2 7.07%
3 8.06%
4 9.05%
8 8.93%

(a) (b) (c)

(d) (e) (f)

Fig. 2. Top row: ultrasound image #3. Bottom row: confocal microscopy image #2.
(a) & (d): Original image. (b) & (e): Ground truth. (c) & (f): Segmentation produced
by our method.

4 Conclusion

In this paper, a recent texture-based graph regularization process has been im-
proved. A supervised binary classifier is included in the segmentation process in
order to take care of the selection of features. A learning set is first provided by
Graph-Based Regularization of Binary Classifiers for Texture Segmentation 317

an expert in order to train the classifier, which is then used to provide a raw
segmentation of the image. A graph regularization process is finally applied on
this initial segmentation to produce the final one.
By including a supervised binary classifier in the segmentation process, we
enable it to automatically ponderate relevant texture features. Compared to the
initial algorithm, benefits are multiple. First, irrelevant features do not need to
be manually deleted, which allows to virtually use any texture characterization
technique without the need to worry about its usefulness or uselessness regard-
ing the type of image to be processed. The generic nature of the system has
also been improved: a lot more features can be added without risking to mini-
mize their contribution, since the training algorithm will sort them out. Finally,
no initialization has to be provided for each image. This fact might be argued
since a training set has to be provided, but by turning this algorithm into a
system configured for one or several pre-defined tasks, it can easily be rendered
parameter-less.
In order to increase the capacity of the process to perform any segmenta-
tion task, we intend to incorporate more texture descriptors (Haralick features
computed on different co-occurrence matrices, Gabor filters. . . ) in it.
During the design of this algorithm, we chose to implements the classifier as
a MLP because of its ability to handle multiclass problems. Research into trans-
forming this binary segmentation algorithm into a multiclass one are already in
progress.

References
1. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)
2. Ta, V.-T., Lezoray, O., Elmoataz, A.: Graph Based Semi and Unsupervised Classi-
fication and Segmentation of Microscopic Images. In: IEEE International Sympo-
sium on Signal Processing and Information Technology, pp. 1160–1165 (December
2007)
3. Faucheux, C., Olivier, J., Bone, R., Makris, P.: Texture-based graph regularization
process for 2D and 3D ultrasound image segmentation. In: IEEE International
Conference on Image Processing (ICIP), pp. 2333–2336 (September 2012)
4. Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural Features for Image Clas-
sification. IEEE Transactions on Systems, Man, and Cybernetics 3(6), 610–621
(1973)
5. Chan, T.F., Sandberg, B.Y., Vese, L.A.: Active Contours without Edges for Vector-
Valued Images. Journal of Visual Communication and Image Representation 11(2),
130–141 (2000)
6. Olivier, J., Boné, R., Rousselle, J.-J., Cardot, H.: Active Contours Driven by Su-
pervised Binary Classifiers for Texture Segmentation. In: Bebis, G., et al. (eds.)
ISVC 2008, Part I. LNCS, vol. 5358, pp. 288–297. Springer, Heidelberg (2008)
7. Tesar, L., Shimizu, A., Smutek, D., Kobatake, H., Nawano, S.: Medical image analy-
sis of 3D CT images based on extension of Haralick texture features. Computerized
Medical Imaging and Graphics: The Official Journal of the Computerized Medical
Imaging Society 32(6), 513–520 (2008)
318 C. Faucheux, J. Olivier, and R. Boné

8. Vapnik, V.: The nature of statistical learning theory. Springer (1999)

9. Rumelhart, D., Hinton, G., Williams, R.: Learning internal representations by error
propagation. In: Parallel Distributed Processing: Explorations in the Microstruc-
ture of Cognition. Foundations, vol. 1 (1985)
10. Dal Moro, F., Abate, A., Lanckriet, G.R.G., Arandjelovic, G., Gasparella, P.,
Bassi, P., Mancini, M., Pagano, F.: A novel approach for accurate prediction of
spontaneous passage of ureteral stones: support vector machines. Kidney Interna-
tional 69(1), 157–160 (2006)
11. Trémeau, A., Colantoni, P.: Regions adjacency graph applied to color image seg-
mentation. IEEE Transactions on Image Processing 9(4), 735–744 (2000)
12. von Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17(4),
395–416 (2007)
13. Cardoso, J.S., Corte-Real, L.: Toward a generic evaluation of image segmentation.
IEEE Transactions on Image Processing 14(11), 1773–1782 (2005)
Hierarchical Annealed Particle Swarm
Optimization for Articulated Object Tracking

Xuan Son Nguyen, Séverine Dubuisson, and Christophe Gonzales

Laboratoire d’Informatique de Paris 6 (LIP6/UPMC)

4 place Jussieu, 75005 Paris, France

Abstract. In this paper, we propose a novel algorithm for articulated

object tracking, based on a hierarchical search and particle swarm opti-
mization. Our approach aims to reduce the complexity induced by the
high dimensional state space in articulated object tracking by decom-
posing the search space into subspaces and then using particle swarms
to optimize over these subspaces hierarchically. Moreover, the intelligent
search strategy proposed in [20] is integrated into each optimization step
to provide a robust tracking algorithm under noisy observation condi-
tions. Our quantitative and qualitative analysis both on synthetic and
real video sequences show the eﬃciency of the proposed approach com-
pared to other existing competitive tracking methods.

Keywords: particle ﬁlter, articulated object tracking, PSO.

1 Introduction
Tracking articulated structures with accuracy and within a reasonable time is
challenging due to the high complexity of the problem to solve. For this purpose,
various approaches based on particle filtering have been proposed. Among them,
one class addresses the complexity issue by reducing the dimensionality of the
state space. For instance, some methods add constraints (e.g., physical) to the
mathematical models [4, 13], to the object priors [7] or to their interactions with
the environment [11]. Relying on the basic assumption that some body part
movements are mutually dependent, some learning-based approaches [16, 19]
reduce the number of degrees of freedom of these movements.
Alternatively, a second class of methods has been proposed in the literature [5,
9, 12, 14, 17, 18] whose key idea is to decompose the state space into a set of small
subspaces where particle filtering can be applied: by working on small subspaces,
sampling is more efficient and, therefore, fewer particles are needed to achieve a
good performance. Finally, in the class of the optimization-based methods, the
approach is to optimize an objective function corresponding to the matching
between the model and the observed image features [3, 6, 8]. Recently, Particle
Swarm Optimization (PSO) has been reported to perform well on articulated
human tracking [10, 20]. Its key idea is to apply evolutionary algorithms inspired
from social behaviors observed in wildlife to make the particles evolve following
their own experience and the experience of the global population.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 319–326, 2013.

c Springer-Verlag Berlin Heidelberg 2013
320 X.S. Nguyen, S. Dubuisson, and C. Gonzales

In this paper, our approach consists in decomposing the search space into
subspaces of smaller dimensions and, then, in exploiting the approach proposed
in [20] to search within these subspaces in a hierarchical order. A hierarchical
particle swarm optimization has also been introduced in [10]. The main diﬀerence
between this approach and ours is that we incorporate the sampling covariance
and the annealing factor into the update equation of PSO at each optimization
step to tackle the problem of noisy observations and cluttered background.
The paper is organized as follows. In Section 2, we brieﬂy recall PSO. Section 3
presents the proposed algorithm. Section 4 reports the results of our experimental
evaluation. Finally, Section 5 gives some conclusions and perspectives.

2 Particle Swarm Optimization (PSO)

Let X denote the state space: our goal is to search for state x ∈ X that maximizes
a cost function f : X → R, with a ≤ x ≤ b. A swarm consists of N particles, each
(i)
one representing a candidate state of the articulated object. Denote x(m) the ith
(i)
particle at the mth iteration. x(m) is decomposed into K (object) parts, i.e.,
(i) (i),1 (i),K
x(m) = {x(m) , ..., x(m) } ∈ X . Unlike evolutionary algorithms, to each particle
(i) (i),1 (i),K
in PSO is assigned a velocity v(m) = {v(m) , ..., v(m) } ∈ X and each particle has
the ability to memorize its best state computed so far s(i) = {s(i),1 , ..., s(i),K } ∈
X . Let sg be the current global best state, i.e., sg = Argmax{f (s(i) )}N i=1 . The
evolution of the particles in PSO is described by the following equations:
(i) (i) (i) (i)
v(m) = wv(m−1) + β1 r1 (s(i) − x(m−1) ) + β2 r2 (sg − x(m−1) ) (1)
(i) (i) (i)
x(m) = x(m−1) + v(m) (2)

where β1 , β2 are constants, r1 , r2 ∼ U (0, 1) are random numbers drawn from a

(i)
uniform distribution, w is the inertia weight and wv(m−1) is the inertial velocity.
PSO has the ability to balance between the local and global search strategies
of particles by setting the appropriate values for constants β1 , β2 and inertia
weight w. A large inertia weight results in an exploration of the search space
(global search) while a small inertia weight limits the search around the globally
best particle (local search). The value of the inertia weight can be ﬁxed as a
constant or adaptively changed throughout the search.
In the next section, we introduce our approach, inspired from PSO, and ded-
icated to articulated object tracking in cluttered environments.

3 Proposed Approach

We propose to exploit the hierarchical nature of the kinematic structure of the

articulated object to improve tracking. First, the state space of the target ob-
ject is decomposed into lower dimensional subspaces. Then, optimal states are
Hierarchical Annealed PSO for Articulated Object Tracking 321

searched for in these subspaces in the hierarchical order of the kinematic struc-
ture using Partitioned Sampling (PS) [12]. These optimal states are then used
to constrain the search in the next subspaces in the hierarchical order.
(i),k (i),k
At time t, let xt (resp. st ) denote the kth substate of the ith particle
(i) (i)
xt (resp. the ith particle’s best state st ) and let sg,k
t be the kth substate of
(i) (i),1 (i),K
the global best state. Then, at the mth iteration, xt,(m) = {xt,(m) , ..., xt,(m) },
(i) (i),1 (i),K (i) (i),1 (i),K
vt,(m) = {vt,(m) , ..., vt,(m) } and st,(m) = {st,(m) , ..., st,(m) }. We follow the ap-
proach proposed in [20], except that the state and velocity update equations for
each subpart k are written as follows:
(i),k (i),k (i),k (i),k
vt,(m) = r0 P(m−1) + β1 r1 (st − xt,(m−1) ) + β2 r2 (sg,k
t − xt,(m−1) ) (3)
(i),k (i),k (i),k
xt,(m) = xt,(m−1) + vt,(m) (4)

P(m−1) = α0 ∗ P(m−2) , m ≥ 2, is the sampling covariance, with α0 a constant,

and P(0) is a covariance matrix whose diagonal elements are fixed with respect
to the model configuration parameters. We propose to compute factors β1 and
β2 at each iteration m using the annealing principle so that:
− Mm
βmax
β1 = β2 = β0 βmax (5)
βmin
where β0 , βmax , βmin are constants, 0 < β0 ≤ 1, and M is the maximal number
of iterations.
By combining PSO and hierarchical search, our approach aims to increase
the tracking accuracy and to reduce the computational cost of the tracking
algorithm by integrating the benefits of both methods. First, the search efficiency
is improved by performing PSO within lower dimensional subspaces, thereby
increasing tracking accuracy. Second, since the search is performed in the same
way as PS, the number of particles required and thus the computational cost of
the tracking algorithm is greatly reduced. Our proposed Hierarchical Annealed
based Particle Swarm Optimization Particle Filter (HAPSOPF) is described in
Algorithm 1, where x̄ is the estimated state of the object at time slice t, w(., y)
is the cost function to be optimized by PSO, and y is the current observation.

4 Experimental Results
We compare our approach with APF [6], PSAPF [2], APSOPF [20] and HPSO
(i),k
[10]. The cost function w(xt,(m) , y) to be optimized by PSO measures how well
(i),k
a state hypothesis xt,(m) matches the true state w.r.t. the observed image y,
and is constructed using histogram and foreground silhouette [6]. An articulated
object is described by a hierarchy (a tree) of parts, each part being linked to its
parent in the tree by an articulation point. For instance, in the top row of Fig. 1,
the blue polygonal parts are the root of the tree and the colored rectangles are
the other nodes of the tree. The root is described by its center (x, y) and its
322 X.S. Nguyen, S. Dubuisson, and C. Gonzales

(i)
Input: {st−1 }N i=1 , α0 , β0 , βmax , βmin , P(0) , M (number of iterations)
(i) N
Output: {st }i=1
(i)
1 Set πt = 1, i = 1, . . . , N
2 for k = 1 to K do
(i),k (i),k
3 Sample: xt,(0) ∼ N (st−1 , P(0) ), i = 1, . . . , N
4 for m = 0 to M do
5 if m ≥ 1 then
6 Compute P(m) and update β1 , β2
7 Carry out the PSO iteration based on Eq. (3) and (4)
(i),k (i),k
8 Evaluate: f (xt,(m) ) = w(xt,(m) , y), i = 1, . . . , N
(i),k
9 Update {st }N i=1 and the k-th part of the global best state sg,k
t
(i) (i) (i),k
10 Evaluate particle weights: πt = πt × w(st , y), i = 1, . . . , N
(i)
(i) πt
11 Normalize particle weights: π̄t
= N (j) , i = 1, . . . , N
j=1 πt
(i) (i) (i)
12 return {st }Ni=1 , x̄ = N
π̄
i=1 t s t

Algorithm 1. Our HAPSOPF algorithm

orientation θ whereas the other parts are only characterized by their angle θ.
For all algorithms, particles are propagated using a random walk with standard
deviations ﬁxed to σx = 2, σy = 2 and σθ = 0.05. For APSOPF and HAPSOPF,
P(0) is a diagonal matrix with the values of σx , σy and σθ . Our comparisons are
based on two criteria: estimation errors and computation times.

4.1 Tests on Synthetic Sequences

Video Sequences. We have generated two sets of various synthetic video se-
quences composed of 200 frames of 640 × 480 pixels (with ground truth). The
video sequences in the ﬁrst set contain no noise while, in the second set, cluttered
background was generated to demonstrate the robustness of the proposed ap-
proach. The clutter is made up of polygons and rectangles randomly positioned

(a)

(b)
La = 3, Na = 4 La = 4, Na = 5 La = 3 , Na = 6 La = 4 , Na = 7

Fig. 1. Synthetic video sequences used for quantitative evaluation (number of arms
Na , length of arms La ): (a) without clutter and (b) with clutter
Hierarchical Annealed PSO for Articulated Object Tracking 323

in the image. An articulated object is deﬁned by its number Na of arms, and

their length La : some examples are given in Fig. 1.

Quantitative Tracking Results. The tracking errors are given by the sum
of the Euclidean distances between each corner of the estimated parts and their
corresponding corner in the ground truth. We used M = 3 layers for PSAPF and
APF since it produces stable results for both algorithms, and M = 3 maximal
iterations for HAPSOPF, HPSO and APSOPF. Table 1 gives the performances
of the tested algorithms for sequences without or with noise (cluttered back-
ground). In our experiments, tracking in noisy sequences is challenging due to
the background. In such cases, the annealing factor helps the particle swarm to
follow its own searching strategy without being aﬀected by any wrong guide of
the local or global best states. On the contrary, the annealing process of PSAPF
forces the particle set to represent one of the modes of the cost function, which
causes some parts of the object to get stuck in wrong locations. This problem
of annealing approaches was reported in [1]. Moreover, the use of the sampling
covariance instead of the inertial velocity of Eq. (1) leads to an eﬃcient explo-
ration of the search space without losing the searching power of PSO. This is
validated by our experiments on sequences without cluttered background, where
our approach outperforms all the other ones. Fig. 2 gives comparative conver-
gence results (error depending on the number N of particles) and computation
times for a synthetic sequence (behaviors are similar for other sequences). Note
that our approach converges better and faster than the other methods.

4.2 Tests on Real Sequences

Dataset. We used sequences S1 Gesture and S2 Throwcatch of the HumanEva-I
dataset [15] that include ground truths, thus allowing to evaluate quantitatively
our approach. For both sequences, the lower right hands of the subject move
quickly, which makes them diﬃcult to track. Moreover, S2 Throwcatch contains
self-occlusions (hands and torso, left and right hands, left and right legs).
The searching order for PSAPF, HPSO, and HAPSOPF is: torso, head, left
thigh, right thigh, left upper arm, right upper arm, left leg, right leg, left forearm,

Table 1. Tracking errors in pixels (average over 30 runs) and standard deviations for
synthetic video sequences, N is the number of particles used per ﬁlter
Na = 4, La = 3 Na = 5, La = 4 Na = 6, La = 3 Na = 7, La = 4
N 50 200 50 200 50 200 50 200
without noise 110(2) 106(1) 214(5) 195(2) 243(11) 211(9) 312(7) 271(4)
HAPSOPF
noise 204(39) 143(10) 227(56) 175(30) 322(67) 295(60) 553(194) 516(180)
without noise 120(2) 114(1) 238(6) 208(4) 251(7) 218(3) 319(8) 278(4)
PSAPF
noise 309(109) 221(94) 281(78) 219(48) 432(86) 388(75) 1008(232) 914(213)
without noise 125(5) 119(2) 252(9) 227(5) 254(11) 213(6) 382(5) 315(3)
HPSO
noise 277(78) 194(65) 245(42) 201(26) 345(27) 295(10) 922(334) 731(259)
without noise 184(3) 169(2) 260(12) 241(10) 265(15) 257(12) 471(30) 439(21)
APSOPF
noise 254(16) 227(8) 308(33) 291(25) 490(68) 474(47) 817(223) 785(169)
without noise 128(3) 109(2) 246(11) 221(9) 270(13) 236(11) 487(35) 412(24)
APF
noise 272(9) 258(5) 322(29) 309(18) 440(51) 429(40) 613(174) 592(156)
324 X.S. Nguyen, S. Dubuisson, and C. Gonzales

160 4000
APF APF
HPSO HPSO
PFAPF PFAPF
3500
APSOPF APSOPF
150 HAPSOPF HAPSOPF

3000

140
2500

Computation time
Error in pixels

130 2000

1500
120

1000

110
500

100 0
50 100 150 200 250 300 50 100 150 200 250 300
Number of particles Number of particles

(a) (b)

Fig. 2. Comparison tests for convergence and computation time when track-
ing the object Na = 4, La = 3: (a) convergence and (b) computation times (HPSO
and our approach give same curves) in seconds

right forearm. For a fair comparison, we ﬁxed the number of evaluations of the
weighting function at each frame for all the algorithms to 2000, and tuned pa-
rameters {N,M} for each method so that they achieve the best performance while
satisfying the above constraint: {400, 5} for APF, {40, 5} for PSAPF, {200, 10}
for APSOPF and {20, 10} for HPSO and HAPSOPF.

Quantitative Tracking Results. We used the evaluation measure proposed

in [15], which is based on Euclidean distances between 15 virtual markers on the
human body. Table 2 provides tracking errors and computation times. As can
be observed, our approach has the same computation time as HPSO but reduces
the estimation error and it outperforms the other approaches on both criteria.
Fig. 3 provides qualitative tracking results. Our approach always outperforms
PSAPF and HPSO in cases of self-occlusions (frames 275, 523) or quick move-
ments (frames 160, 387), showing its robustness. Because our approach incor-
porates the annealing into each searching stage of the hierarchical search, the
problem of noisy observations is eﬀectively alleviated. This makes our approach
more robust to self-occlusions. The sampling covariance also helps to improve
the searching eﬀectiveness by shifting the particle swarm toward more promising
regions.

Table 2. Tracking errors for full body in pixels (average over 30 runs)

HAPSOPF PSAPF HPSO APSOPF APF

Error Time Error Time Error Time Error Time Error Time
S1 Gesture 95(6) 287 99(11) 293 101(9) 287 102(4) 1348 105(2) 1412
S2 Throwcatch 212(10) 557 227(19) 579 232(12) 557 235(7) 2070 240(5) 2184
Hierarchical Annealed PSO for Articulated Object Tracking 325

Fig. 3. Tracking results for frames 123,160,275,387,488,523: HPSO (ﬁrst row), PSAPF
(second row), HAPSOPF (third row). The tracking results for the other approaches as
well as those for the sequence S1 Gesture are not presented due to space constraint.

5 Conclusions and Future Work

In this paper, we have introduced a new algorithm for articulated object tracking
based on particle swarm optimization and hierarchical search. We addressed the
problem of articulated object tracking in high dimensional spaces by employ-
ing a hierarchical search to improve search efficiency. Furthermore, the problem
of noisy observation has been alleviated by incorporating the annealing factor
terms into the velocity updating equation of PSO. Our experiments on syn-
thetic and real video sequences demonstrate the efficiency and effectiveness of
our approach compared to other common approaches, both in terms of tracking
accuracy and computation time. Our future work will focus on evaluating the
proposed approach in multi-view environments.

References
[1] Balan, A.O., Sigal, L., Black, M.J.: A quantitative evaluation of video-based 3d
person tracking. In: PETS, pp. 349–356 (2005)
326 X.S. Nguyen, S. Dubuisson, and C. Gonzales

[2] Bandouch, J., Engstler, F., Beetz, M.: Evaluation of Hierarchical Sampling Strate-
gies in 3D Human Pose Estimation. In: BMVC, pp. 925–934 (2008)
[3] Bray, M., Kollermeier, E., Vangool, L.: Smart particle filtering for high-
dimensional tracking. Computer Vision and Image Understanding 106(1), 116–129
(2007)
[4] Brubaker, M., Fleet, D., Hertzmann, A.: Physics-based person tracking using
the anthropomorphic walker. International Journal of Computer Vision 87(1-2),
140–155 (2009)
[5] Chang, I.C., Lin, S.Y.: 3D human motion tracking based on a progressive particle
filter. Pattern Recognition 43(10), 3621–3635 (2010)
[6] Deutscher, J., Reid, I.: Articulated body motion capture by stochastic search.
International Journal of Computer Vision 61(2), 185–205 (2005)
[7] Hauberg, S., Pedersen, K.S.: Stick it! articulated tracking using spatial rigid object
priors. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part III.
LNCS, vol. 6494, pp. 758–769. Springer, Heidelberg (2011)
[8] Hofmann, M., Gavrila, D.: 3D human model adaptation by frame selection and
shape-texture optimization. Computer Vision and Image Understanding 115(11),
1559–1570 (2011)
[9] Isard, M.: PAMPAS: real-valued graphical models for computer vision. In: CVPR,
pp. 613–620 (2003)
[10] John, V., Trucco, E., Ivekovic, S.: Markerless human articulated tracking using
hierarchical particle swarm optimization. Image and Vision Computing 28(11),
1530–1547 (2010)
[11] Kjellstrom, H., Kragic, D., Black, M.: Tracking people interacting with objects.
In: CVPR, pp. 747–754 (2010)
[12] MacCormick, J., Isard, M.: Partitioned sampling, articulated objects, and
interface-quality hand tracking. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843,
pp. 3–19. Springer, Heidelberg (2000)
[13] Oikonomidis, I., Kyriazis, N.: Full DOF tracking of a hand interacting with an
object by modeling occlusions and physical constraints. In: ICCV, pp. 2088–2095
(2011)
[14] Rose, C., Saboune, J., Charpillet, F.: Reducing particle filtering complexity for
3D motion capture using dynamic Bayesian networks, 1396–1401. AAAI (2008)
[15] Sigal, L., Balan, R.: Humaneva: Synchronized video and motion capture dataset
and baseline algorithm for evaluation of articulated human motion. Technical re-
port (2009)
[16] Urtasun, R., Fleet, D., Hertzmann, A., Fua, P.: Priors for people tracking from
small training sets. In: ICCV, pp. 403–410 (2005)
[17] Wu, Y., Hua, G., Yu, T.: Tracking articulated body by dynamic Markov network.
In: ICCV, pp. 1094–1101 (2003)
[18] Xinyu, X., Baoxin, L.: Learning Motion Correlation for Tracking Articulated Hu-
man Body with a Rao-Blackwellised Particle Filter. In: ICCV, pp. 1–8 (2007)
[19] Yao, A., Gall, J., Gool, L., Urtasun, R.: Learning probabilistic non-linear latent
variable models for tracking complex activities. In: Shawe-Taylor, J., Zemel, R.,
Bartlett, P., Pereira, F., Weinberger, K. (eds.) Advances in Neural Information
Processing Systems 24, pp. 1359–1367 (2011)
[20] Zhang, X., Hu, W., Wang, X., Kong, Y., Xie, N., Wang, H., Ling, H., Maybank,
S.: A swarm intelligence based searching strategy for articulated 3D human body
tracking. In: CVPRW, pp. 45–50 (2010)
High-Resolution Feature Evaluation Benchmark

Kai Cordes, Bodo Rosenhahn, and Jörn Ostermann

Institut für Informationsverarbeitung (TNT)

{cordes,rosenhahn,ostermann}@tnt.uni-hannover.de

Abstract. Benchmark data sets consisting of image pairs and ground truth ho-
mographies are used for evaluating fundamental computer vision challenges, such
as the detection of image features. The mostly used benchmark provides data
with only low resolution images. This paper presents an evaluation benchmark
consisting of high resolution images of up to 8 megapixels and highly accurate
homographies. State of the art feature detection approaches are evaluated using
the new benchmark data. It is shown that existing approaches perform differently
on the high resolution data compared to the same images with lower resolution.

1 Introduction

The detection of features is a fundamental step in many computer vision applications.

Standing at the beginning of a processing pipeline, the accuracy of such an application
is often determined by the accuracy of the detected features. Thus, the development and
the evaluation of feature detectors is of high interest in the computer vision community.
The evaluations of feature detectors and descriptors [1,2,3,4,5,6,7] are based on im-
age pairs showing planar scenes and corresponding homographies which determine the
mapping between an image pair. This data serves as ground truth for the accuracy eval-
uation. The mostly used reference data set is proposed by Mikolajczyk et al. [3]. In this
set, a sequence consists of 6 images showing the same scene undergoing different types
of distortion, such as scale or viewpoint change, illumination, or coding artefacts. The
evaluation criterion for feature detectors is the repeatability. The evaluation protocol
counts the number of correctly detected feature pairs. A correctly detected feature pair
is determined by using a threshold for the overlap error [3]. The threshold controls the
demanded accuracy of the evaluation.
The evaluation benchmark [3] has some deficiencies regarding the images as well as
the homographies. The image resolution is only 0.5 megapixels. Many images of the
data set are not restricted to a plane which is a violation of the homography assumption
as shown in Figure 1. For some images, scene content moves between the capturing
process (leaves in the Trees sequence). It appears that radial distortion is not considered
for the benchmark generation which is another violation of the mapping assumption.
For the computation of the ground truth homographies, features are used1 . This is not
desirable because the data is used for the evaluation of feature detectors. Finally, the
authors concede that the homographies are not perfect [8]. However, the data set is used
1
www.robots.ox.ac.uk/˜vgg/research/affine/det eval files/
DataREADME

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 327–334, 2013.
© Springer-Verlag Berlin Heidelberg 2013
328 K. Cordes, B. Rosenhahn, and J. Ostermann

(a) Graffiti image 3 (b) Mapped image 1 (c) Differences between

(a) and (b)

(d) Trees image 3 (e) Mapped image 1 (f) Differences between

(d) and (e)

Fig. 1. Part of the mapped images 1 and image 3 of the Graffiti sequence (top row) and the Trees
sequence (bottom row). For the mapping of image 1, the ground truth homographies are used.
Large errors occur due to the car in the foreground (Graffiti) and the moving leaves because
of wind (Trees). The bottom part of the Graffiti wall indicates a violation of the homography
assumption. The error is shown in the images 1(c) and 1(f) (cf. equation (6)).

as ground truth for high-accuracy evaluations, sometimes using very small overlap error
thresholds [3,8,9]. Apart from feature evaluation there are applications [10] which use
a dense representation of the images. In this case, the mapping errors would spoil the
evaluation significantly. Hence, the data set is useless for applications with dense image
representations.
Nowadays, consumer cameras provide image resolutions of 8 megapixels or more.
The question arises, if feature detector evaluations based on data with 0.5 megapixels
are valid for high resolution images. In [3], the evaluated detectors provide scale invari-
ant properties. On the other hand, the localization accuracy of a scale invariant feature
may be dependent on the detected scale [11], because its position error in a certain pyra-
mid layer is mapped to the ground plane of the scale space pyramid. In high resolution
data, more features are expected to be detected in higher scales of the image pyramid.
Thus, a small localization error of a detector may become significant in high resolution
image data.
An improved homography benchmark is provided in [12] with image resolutions of
1.5 megapixels per image. In addition, the accuracy of the Mikolajczyk benchmark is
slightly increased using a dense image representation instead of image features.
We use the RAW camera data from the images of the data set [12]. The proposed
technique exploits the ground truth data from [12] for initializing an evolutionary opti-
mization for the computation of ground truth homographies between image pairs with
High-Resolution Feature Evaluation Benchmark 329

resolutions of up to 8 megapixels. This technique is called homography upscaling. The

data is validated using the evaluation protocol invented by [3]. For the comparison be-
tween low-resolution and high-resolution benchmark data, the same detectors [3] are
evaluated: MSER [13], Hessian-Affine [1], Harris-Affine [8], intensity extrema-based
regions (IBR) [14], and edge-based regions (EBR) [15].
The main motivation of this paper is the question if the well known results for the
accuracy of feature detectors are still valid for high resolution data. Furthermore, the
newly generated high resolution ground truth data set will be provided to the computer
vision community for feature detector evaluation or for applications using a dense rep-
resentation of the images, such as [10].
In the following Section 2, the computation of the new high resolution benchmark
is explained. Section 3 shows the accuracy results of the benchmark compared to [12]
and the feature evaluation using the repeatability criterion. In Section 4, the paper is
concluded.

2 Homography Upscaling
We make use of the RAW image data from [12]. In [12], the benchmark is created
using subsampled images of size 1536 × 1024 (1.5 megapixels). We use the images
with the same scene content at higher resolution. The radial distortion is removed in a
preprocessing step. Our objective is to create ground truth homographies with image
resolutions of up to 3456 × 2304 (8 megapixels), which is the maximum resolution of
the utilized Canon EOS 350D camera.
Since the homography for the image pair at resolution R1 is approximately known,
it can be used for a reasonable initialization for the optimization at resolution R2 as
shown in Section 2.1. The optimization is based on a cost function which computes the
mapping error of the homography HR2 at resolution R2 . The minimization of the cost
function is explained in Section 2.2.

2.1 Upscaling a Homography Analytically

Let the homography between two images at resolution R1 = MR1 × NR1 be given as
HR1 . Then, a point pR1 of the first image can be identified in the second image with
coordinates pR1 by
pR1 = HR1 · pR1 (1)
The pixel coordinates of a corresponding image point pair pR1 ↔ pR1 in homoge-
neous coordinates [16] are normalized to the resolution R0 = [−1; 1] × [−1; 1]. This
mapping in the left and right image is determined by:

pR1 = AR1 · xR0 and pR1 = AR1 · xR0 (2)

⎛ ⎞
MR1 −1 MR1 −1
0
⎜ 2
NR1 −1 NR1 −1 ⎟
2
with the matrix AR1 =⎝ 0 ⎠.
2 2
0 0 1
330 K. Cordes, B. Rosenhahn, and J. Ostermann

From equations (1) and (2), it follows:

AR1 · xR0 = HR1 · AR1 · xR0 (3)
The desired homography at image resolution R2 = MR2 × NR2 is HR2 . If all image
positions from resolutions R1 and R2 are normalized to R0 , their coordinates xR0 are
identical (cf. equations (2)):
xR0 = A−1 −1
R2 · pR2 and xR0 = AR2 · pR2 (4)
By exchanging xR0 and xR0 in equation (3) with equations (4), it follows:
pR2 = AR2 · A−1 −1
R1 · HR1 · AR1 · AR2 ·pR2 (5)
0 12 3
HR 2

Hence, the homography HR2 can be computed by a matrix multiplication consisting of

the known matrix HR1 and the resolutions MR1 × NR1 and MR2 × NR2 of the left and
right image, which build the matrices AR1 and AR2 .

2.2 Optimization Using Differential Evolution

The approximate homography at resolution R2 is computed from the homography at
resolution R1 as explained in Section 2.1. Due to inaccuracies in HR1 , the matrix HR2
has to be refined by minimizing a cost function. In the following, we denote the homog-
raphy in the desired resolution with H := HR2 . Then, the cost function E(H) is [12]:

1
J
E(H) = dRGB (H · pj , pj ) , (6)
J j=1
using the RGB values of the left and the right image I1 , I2 . The homography H maps a
pixel pj , j ∈ [1; J] from the left image I1 to the corresponding pixel pj in right image
I2 . If the homography is accurate, the color distance dRGB (·) is small. The color distance
dRGB (·) is determined as:
1
dRGB (pi , pj ) = · (|r(pi ) − r(pj )| + |g(pi ) − g(pj )| + |b(pi ) − b(pj )|) (7)
3
using the RGB values (r(pi ), g(pi ), b(pi )) of an image point pi . For the extraction of
the color values, a bilinear interpolation is used. If a mapped point pj falls outside the
image boundaries, it is neglected.
Due to lighting and perspective changes between the images, the cost function is
likely to have several local minima. Hence, a Differential Evolution (DE) optimizer is
used for the minimization of E(H) with respect to H in the cost function (6). Evolu-
tionary optimization methods have proved impressive performance for parameter esti-
mation challenges finding the global optimum in a parameter space with many local
optima. Nevertheless, limiting the parameter space with upper and lower boundaries,
increases the performance of these optimization algorithms significantly. For setting
the search space boundaries, the approximately known solutions for the homographies
at lower resolution are used. With equation (5), the search space centers are computed.
Then, a Differential Evolution (DE) optimizer is performed using common parameter
settings [17].
High-Resolution Feature Evaluation Benchmark 331

3 Experimental Results
For the benchmark generation, 5 sequences are used. Each of the sequences contains 6
images like in the reference benchmark [3]. In Section 3.1, the resulting cost function
values of different resolutions are compared. In Section 3.2 the evaluation protocol [3]
is performed on the new data.

(a) Colors (b) Grace (c) Posters (d) There (e) Underground
Fig. 2. First images of the input image sequences. The resolution is up to 3456 × 2304.

3.1 High-Resolution Benchmark Generation

The resulting cost function values E(H) for the resolutions R1 = 1536 × 1024 and
R2 = 3456 × 2304 are shown in Table 1. Two example sequences are selected, Grace
and Underground. Due to the high accuracy of the computed homographies at resolution
R2 , E(H) increases only slightly compared to resolution R1 . The generally larger error
for the Underground sequence is due to the higher amount of light reflection from the
surface of the wall. Nevertheless, the accuracies of the new homographies are high.

Table 1. Comparison of cost function values E(H) for the homographies for image resolutions
1536×1024 (cf. [12] for Grace) and the new data set with resolution 3456×2304. The resulting
cost function values for each image pair are approximately the same.

E(H) Grace Underground

1-2 1-3 1-4 1-5 1-6 1-2 1-3 1-4 1-5 1-6
1.5 megapixels 3.44 4.62 6.02 8.21 9.99 7.23 8.31 12.52 19.07 28.64
8.0 megapixels 3.93 5.20 6.60 8.73 10.46 7.46 8.63 12.67 19.20 28.73

3.2 Repeatability Comparison

To validate the usability of the new data set, the benchmark protocol provided in [3] is
used. Like in Section 3.1, we compare results for resolution R1 = 1536 × 1024 with
R2 = 3456 × 2304 for the sequences Grace (Figure 4) and Underground (Figure 3).
Like in the majority of evaluation papers, the overlap error threshold is set to 40 %. The
evaluated feature detectors are chosen from the reference paper [3].
Regarding the Underground sequence, the results for R2 are consistent with the re-
sults obtained for the smaller resolution R1 . MSER performs best followed by Hessian-
Affine and IBR, very similar to the evaluation in [3] for the viewpoint change scenario.
But, each of the detectors loose between 1 % and 9 % in repeatability.
For the Grace sequence, the results are different for each detector. While Harris-
Affine and Hessian-Affine perform like in the Underground sequence, MSER and IBR
332 K. Cordes, B. Rosenhahn, and J. Ostermann

90 90

80 80

70 70

60 60
repeatability %

repeatability %
50 50

40 40

30 Harris−Affine 30 Harris−Affine
Hessian−Affine Hessian−Affine
20 MSER 20 MSER
IBR IBR
EBR EBR
10 10

0 0
15 20 25 30 35 40 45 50 55 60 65 15 20 25 30 35 40 45 50 55 60 65
viewpoint angle viewpoint angle

(a) Repeatability (1.5 megapixels) (b) Repeatabality (8.0 megapixels)

5500 7000
Harris−Affine
5000 Harris−Affine Hessian−Affine
Hessian−Affine 6000 MSER
4500 MSER IBR
EBR
IBR
4000 5000
EBR
nb of correspondences

nb of correspondences

3500
4000
3000

2500
3000
2000

1500 2000

1000
1000
500

0 0
15 20 25 30 35 40 45 50 55 60 65 15 20 25 30 35 40 45 50 55 60 65
viewpoint angle viewpoint angle

(c) Correspondences (1.5 megapixels) (d) Correspondences (8.0 megapixels)

Fig. 3. Repeatability results (top row) and the number of correctly detected points (bottom row)
for the Underground sequence with different resolutions

significantly loose repeatability score. The repeatability rate of IBR decreases between
12 % and 15 % and MSER looses up to 25 % for large viewpoint changes. Interest-
ingly, the EBR gains about 4 % for small viewpoint changes, but looses about 5 %
for large viewpoint changes. Generally, none of the detectors can really improve their
performance using high resolution images.

4 Conclusions
In this paper, high-resolution image data of up to 8 megapixels is presented together
with highly accurate homographies. This data can be used as a benchmark for computer
vision tasks, such as feature detection. In contrast to the mainly used benchmark, our
data provides high-resolution, fully planar scenes with removed radial distortion and
a feature independent computation of the homographies. They are determined by the
global optimization of a cost function using a dense representation of the images. The
optimization is initialized with values inferred from the solution of lower resolution
images.
The evaluation shows that none of the standard feature detection approaches can
improve in repeatability on higher resolution images. On the contrary, their performance
High-Resolution Feature Evaluation Benchmark 333

90 90

80 80 Harris−Affine
Hessian−Affine
70 70 MSER
IBR
EBR
60 60
repeatability %

repeatability %
50 50

40 40

Harris−Affine
30 30
Hessian−Affine
MSER
20 20
IBR
EBR
10 10

0 0
15 20 25 30 35 40 45 50 55 60 65 15 20 25 30 35 40 45 50 55 60 65
viewpoint angle viewpoint angle

(a) Repeatability (1.5 megapixels) (b) Repeatability (8.0 megapixels)

2000 3500

1800 Harris−Affine
Hessian−Affine Harris−Affine
3000
MSER Hessian−Affine
1600
IBR MSER
EBR 2500 IBR
1400
EBR
nb of correspondences

nb of correspondences

1200
2000
1000
1500
800

600 1000

400
500
200

0 0
15 20 25 30 35 40 45 50 55 60 65 15 20 25 30 35 40 45 50 55 60 65
viewpoint angle viewpoint angle

(c) Correspondences (1.5 megapixels) (d) Correspondences (8.0 megapixels)

Fig. 4. Repeatability results (top row) and the number of correctly detected points (bottom row)
for the Grace sequence with different resolutions

decreases. Dependent on the approach, the repeatability looses up to 25 %, but gains

only 4 % in maximum. It follows, that feature detectors should be evaluated using high
resolution images. The presented benchmark provides the necessary data to do this.
The data set resulting from this work with all five sequences is available at:
http://www.tnt.uni-hannover.de/project/feature_evaluation/
The provided resolutions include versions with 1.5 megapixels, 3 megapixels,
6 megapixels, and 8 megapixels for each sequence.

References
1. Mikolajczyk, K., Schmid, C.: Scale & affine invariant interest point detectors. International
Journal of Computer Vision (IJCV) 60, 63–86 (2004)
2. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence (PAMI) 27, 1615–1630 (2005)
3. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F.,
Kadir, T., Gool, L.V.: A comparison of affine region detectors. International Journal of Com-
puter Vision (IJCV) 65, 43–72 (2005)
4. Schmid, C., Mohr, R., Bauckhage, C.: Comparing and evaluating interest points. In: IEEE
International Conference on Computer Vision (ICCV), pp. 230–235 (1998)
334 K. Cordes, B. Rosenhahn, and J. Ostermann

5. Schmid, C., Mohr, R., Bauckhage, C.: Evaluation of interest point detectors. International
Journal of Computer Vision (IJCV) 37, 151–172 (2000)
6. Haja, A., Jähne, B., Abraham, S.: Localization accuracy of region detectors. In: IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8 (2008)
7. Tuytelaars, T., Mikolajczyk, K.: Local invariant feature detectors: a survey. Foundations and
Trends in Computer Graphics and Vision, vol. 3 (2008)
8. Mikolajczyk, K., Schmid, C.: An affine invariant interest point detector. In: Heyden, A.,
Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part I. LNCS, vol. 2350, pp.
128–142. Springer, Heidelberg (2002)
9. Förstner, W., Dickscheid, T., Schindler, F.: Detecting interpretable and accurate scale-
invariant keypoints. In: IEEE International Conference on Computer Vision (ICCV), Kyoto,
Japan, pp. 2256–2263 (2009)
10. Mobahi, H., Zitnick, C., Ma, Y.: Seeing through the blur. In: IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 1736–1743 (2012)
11. Brown, M., Lowe, D.G.: Invariant features from interest point groups. In: British Machine
Vision Conference (BMVC), pp. 656–665 (2002)
12. Cordes, K., Rosenhahn, B., Ostermann, J.: Increasing the accuracy of feature evaluation
benchmarks using differential evolution. In: IEEE Symposium Series on Computational In-
telligence (SSCI) - IEEE Symposium on Differential Evolution (SDE). IEEE Computer So-
ciety (2011)
13. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from maximally
stable extremal regions. British Machine Vision Conference (BMVC) 1, 384–393 (2002)
14. Tuytelaars, T., Gool, L.V.: Wide baseline stereo matching based on local, affinely invariant
regions. In: British Machine Vision Conference (BMVC), pp. 412–425 (2000)
15. Tuytelaars, T., Van Gool, L.: Content-based image retrieval based on local affinely invariant
regions. In: Huijsmans, D.P., Smeulders, A.W.M. (eds.) VISUAL 1999. LNCS, vol. 1614,
pp. 493–500. Springer, Heidelberg (1999)
16. Hartley, R.I., Zisserman, A.: Multiple View Geometry, 2nd edn. Cambridge University Press
(2003)
17. Price, K.V., Storn, R., Lampinen, J.A.: Differential Evolution - A Practical Approach to
Global Optimization. Natural Computing Series. Springer, Berlin (2005)
Fully Automatic Segmentation of AP Pelvis
X-rays via Random Forest Regression
and Hierarchical Sparse Shape Composition

Cheng Chen and Guoyan Zheng

Institute for Surgical Technology and Biomechanics, University of Bern,

CH-3014, Bern, Switzerland
[email protected], [email protected]

Abstract. Knowledge of landmarks and contours in anteroposterior

(AP) pelvis X-rays is invaluable for computer aided diagnosis, hip surgery
planning and image-guided interventions. This paper presents a fully au-
tomatic and robust approach for landmarking and segmentation of both
pelvis and femur in a conventional AP X-ray. Our approach is based
on random forest regression and hierarchical sparse shape composition.
Experiments conducted on 436 clinical AP pelvis x-rays show that our
approach achieves an average point-to-curve error around 1.3 mm for
femur and 2.2 mm for pelvis, both with success rates around 98%. Com-
pared to existing methods, our approach exhibits better performance in
both the robustness and the accuracy.

Keywords: image segmentation, visual feature selection, shape model.

1 Introduction

Segmenting anatomical regions such as the femur and the pelvis is an impor-
tant task in the analysis of conventional 2D X-ray images, which benefits many
applications such as disease diagnosis [1,2], operation planning/intervention [3],
3D reconstruction [4,5], and so on. Traditionally, manual segmentation of X-ray
images is both time-consuming and error-prone. Therefore, automatic methods
are beneficial both in efficiency and accuracy. However, automatic segmentation
of X-ray images face many challanges. The poor and non-uniform image con-
strast, along with the noise, makes the segmentation very difficult. Occlusions
such as the overlap between bones make it difficult to identify local features of
bone contours. Furthermore, the existence of implants drastically interferes with
the appearance. Therefore, conventional segmentation techniques [1,3], which
mainly depend on local image features such as the edge information, cannot
provide satisfactory results, and model based segmentation techniques are often
adopted [5,6]. However, model based methods suffer from the requirement of
proper initialization, which is typically done manually, and the limited converg-
ing region, leading to unsatisfactory results.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 335–343, 2013.

c Springer-Verlag Berlin Heidelberg 2013
336 C. Chen and G. Zheng

To overcome these challanges, machine learning-based methods has gained

more and more interests in medical image segmentation. In [7], Zhou and Co-
maniciu introduced the so-called shape regression machine to segment in real
time the left ventricle endocardium from an echocardiogram of an apical four
chamber view. In [8], Zheng et al. proposed marginal space learning to auto-
matically localize the heart chamber from 3D CT. More recently, random forest
regression has been used to automatically localize organs from 3D volumetric
data such as CT or MRI [9,10]. However, in comparison with 3D data, 2D X-
ray images pose more challenges because of the poor image quality caused by
projection overlap of surrounding soft and hard tissues. In [2], Lindner et al.
introduced a regression voting method in combination with a constrained local
model (CLM) framework for automatic segmentation of proximal femur from
conventional x-ray radiographs without occlusion.
In this paper, we propose a new fully automatic method for femur and pelvis
segmentation in anteroposterior (AP) X-ray images. The contributions of this
paper include: (A) A hierarchical landmark detection framework where a set of
globally detected landmarks are used for image normalization and another set
of locally detected landmarks are utilized for shape optimization; and (b) The
exploitation of the recently proposed sparse shape composition model.

2 The Proposed Approach

2.1 Landmarks and Shape Models

We deﬁne X-ray landmarks hierarchically in two diﬀerent levels, as shown in

Fig. 1. The first level, global landmarks, contains one group of 22 landmarks
G = {LG1 , ..., LG22 } defined on anatomically important positions over the whole
image (Fig. 1(a)). The second level, local landmarks consists of two different
groups: left femur LF = {LLF LF
1 , ..., L59 } with 59 local landmarks (Fig. 1(b)),
LP LP
and left pelvis LP = {L1 , ..., L97 } with 97 local landmarks (Fig. 1(c)).
Shapes are defined by the ordered landmarks in thesame group. For example,
an instance of ”global shape” y G is defined by y G = l1G , ...l22
G
∈ R44 , where liG
is the location of landmark LGi in the image. For each landmark group, there

Fig. 1. (a): Global landmarks. (b): Local landmarks deﬁning the left femur shape. (c):
Local landmarks deﬁning the left pelvis shape.
Fully Automatic Segmentation of AP Pelvis X-rays 337

Fig. 2. Workflow on a test image. Rectangles represent different steps. Clouds represent
different pre-trained models.

is an associated statistical shape model. We denote MG , MLF , MLP as the

shape models for G, LF and LP, respectively. Shape models specify the prior
distribution of landmark positions in the correponding landmark group.

2.2 Workﬂow

Fig. 2 shows the workﬂow of our method given a test image I.

(1) Global landmark detection. First, we launch landmark detector for the
global landmarks, which produces response images {R(I)G1 , ..., R(I)G22 }. The de-
tailed definition of response image will be given in Section 2.3.
(2) Global shape optimization and global image alignment. The response
images {R(I)Gi }, together with the global shape model MG , are used for a shape
optimization process which finds the optimal global shape, as well as the similar-
ity transform (rigid+scale) with regard to the shape model. Then, the image and
global landmark positions are transformed by this similarity transform, which
compensates for the global translation, rotation and scaling. Note in Fig. 2 how
the erroneous landmark detections are corrected by the shape optimization.
(3) Local Shape initialization. The left femur shape y LF and the left pelvis
shape y LP (i.e. the corresponding landmark positions) are initialized based on
the positions of the 22 global landmarks derived in the previous step.
(4) Local landmark detection. Local landmarks are detected, which generates
the response images {R(I)LF LF LP LP
1 , ..., R(I)59 } and {R(I)1 , ..., R(I)97 }.
LF
(5) Local shape optimization. With the initialized shape y from step (3),
the landmark response images {R(I)LF i } from step (4), and the shape model
MLF , this final step searches for the optimal femur shape. The same process is
repeated for the pelvis.
In short, we use global shape to compensate for the global image pose as well as
to initialize the local shapes, and this (steps (1)-(2)) needs to be done only once
for an image. After the global alignment is done, the local landmark detection
in step (4) only has to be done in limited image region without considering large
scale/rotation variance, which speeds up the detection.
338 C. Chen and G. Zheng

Fig. 3. The landmark detection algorithm. (a): A patch sampled around the ground-
truth landmark. (b): Patches sampled for training. (c): For a new image, patches sam-
pled over the image. (d): Each patch produces a prediction of landmark position. (e):
The response image is calculated by combining individual predictions.

2.3 Landmark Detection

We have a separate detector for each landmark. During training, in each training
image, we sample a set of rectangular patches1 around the ground-truth land-
mark position which is known. Each patch is represented by its visual feature
f ∈ Rdf and the displacement vector d ∈ R2 from the patch center to the land-
mark (Fig 3(a)). Let us denote all the sampled patches in all training images as
{Pi = (fi , di )}i=1,...,N (Fig. 3(b)). The goal is then to learn a mapping function
φ : Rdf → R2 from the feature space to the displacement vector space. Prin-
cipally, any regression method can be used. In this paper, similar to [9,2], we
utilize the random forest regressor [11].
Once the regressor is trained, given a new image, as in Fig. 3(c), we randomly
sample a another set of patches {Pj = fj , cj }j=1,...,N all over the image (or
an ROI), where fj and cj are the visual feature and center coordinate of the
patch Pj , respectively. Through the trained mapping φ, we can calculate the
predicted displacement d(Pj ) = φ fj , and then d(Pj ) + cj is the prediction of
the landmark location by a single patch Pj (Fig. 3(d)). The individual predictions
are very noisy, but when combined, they approach an accurate prediction l :
1
l = φ fj + cj (1)
N j

In practice, the output of random forest regressor d(Pj ) is not a single value,
but a distribution2 . Similarly to Eq. (1), we add up the predicted distributions,
getting a single distribution (as in Fig. 3(e)), which is called the response image.
In our method, we use the multi-level HoG (Histogram of Oriented Gradient)
[13] as our feature for image patches, and we use a feature selection algorithm
propose in [14] to eﬃciently select only the most relevant feature components.

1
We use 1000 patches per image for both training and testing in our implementation.
2
The raw output of the random forest regressor is the displacement vectors on the
leave where the test feature vector falls, from which we ﬁt a gaussian distribution.
Fully Automatic Segmentation of AP Pelvis X-rays 339

Table 1. The algorithm of shape optimization

Input: Initial shape y0 , landmark response images {R(I)i }i=1,...,K , shape model M
Output: Optimal shape y , pose transform matrix T
Procedure:
1. Initialize y = y0 , T = the optimal similarity transform from y to shape model M
2. Update shape y by locally moving each landmark in the ascend direction
in the corresponding response image
3. Regularize shape y by the shape model M.
4. Update pose T by the optimal similarity transform from y to shape model M
5. Repeat steps 2 to 4 until convergence. Then y = y

2.4 Shape Optimization

In this section we present our method which searches for the optimal shape in
steps (2) and (5) of Section 2.2. Prior to the shape optimization, we have the
response images {R(I)i }i=1,...,K for K landmarks, the initial shape y0 ∈ R2K ,
and a shape model M. The task is then to find the optimal shape y ∗ ∈ R2K ,
starting from the initial shape y0 , constrained both by the image cue encoded
in the response images, and the prior information encoded in the shape model.
The procedure is shown in Table 1. Basically, starting from the initial shape,
we update the shape iteratively. In each iteration, we perform three actions:
update the shape by moving each local landmark to a better position according
to response images, regularize the shape by the shape model, and update the
shape pose. These steps are straightforward except the shape regularization,
which regularizes the locally updated shape by the shape model to remove noise
(step 3 in Table 1). Traditionally, this can be done by the Active Shape Model
[15] based on PCA (Principal Component Analysis). In this paper, we instead
employ the recently proposed shape model based on sparse representation [12].
Here we briefly explains this method.
The shape model consists of a set of pre-aligned training shapes {yi }i=1,...,N .
For each new shape y to be regularized, after a transformation T (which is
evaluated separately in step 4 of Table 1), it should be approximated by a linear
combination involving only a small subset of the training shapes, plus a sparse
error: , -
x
T (y ) ≈ Y x + e = Y I = Y x (2)
e

where Y = [Y, I], and x = x , e . In Eq. (2), both the linear coefficients x
and the error e are sparse. Therefore, the composite coefficient x is also sparse.
Our goal becomes to solve the following L1 -regularized least squares problem:

xopt = arg min
T (y
) − Y 2
x 2 + λ x
1 (3)
x

where λ is a parameter controlling the importance of the sparsity constraint.

There are a number of solvers for Eq. (3), and we employ the method using
truncated Newton interior-point method.
340 C. Chen and G. Zheng

The intepretation of Eq. (2) is clear: the shape y should be approximated

(with a transformation T ) as a linear combination of only a small number of
”basis”, which can either be the training shapes, or standard basis of the R2K
space. The contribution from the training shapes represents the ”true” part of
shape y that is consistent to the shape model, and the contribution from the
standard basis accomodates large but sparse errors (noises). Therefore, after we

get the optimal xopt by Eq. (3), we decompose xopt by xopt = x opt , eopt as
in Eq. (2), discard the eopt which corresponds to the noises, and the regularized
shape is given by back-projecting the ”true” part of the shape:

yregularized = T −1 (Y xopt ) (4)

Thus we complete the shape regularization step in Table 1.

3 Experiments
3.1 Data
We conduct experiments using a collection of 436 AP radiographs from our
clinical partner. A considerable part of these images are post-operative x-ray
radiographs after trauma or joint replacement surgery, which signiﬁcantly in-
creases the challenge due to large variation of femur/pelvis appearance and the
presence of implants. From these 436 images, we randomly select 100 images for
training, and the other 336 images are used for testing purpose. For the training
images, we manually annotate all the (global and local) landmarks.

3.2 Results on Femur/Pelvis Segmentation

We implemented our segmentation algorithm on the 336 test images, and Fig.
4 shows examples done with our method. We can see that our method achieves
excellent performance despite of chanllenges such as significant variation of ap-
pearance, poor image contrast, or implants. Note that 115 of the 336 test images
(34%) have different types of implants, which reflects the challenge of our dataset.
For quantitative evaluation, from the 336 test images, we randomly choose
192 images, on which we manually annotated the left femur and left pelvis con-
tour. The error of segmentation is thus calculated by the average point-to-curve
distance between the points on the segmented shape and the annotated contour.
Since the images are stored in Dicom format, we know the pixel resolution of
each image and therefore the error is expressed in the physical unit of millimeter,
which is shown in Table 2. We can see that our method achieves an average error
of 1.3 mm for femur segmentation, and 2.2 mm for pelvis segmentation.
Evaluated on all the 192 annotated test images, our method succeeded in 189
for femur segmentation, with a success rate of 98.4%. A segmentation is classified
as successful if the average point-to-curve distance is smaller than 4 mm. For
pelvis segmentation, 19 images out of the 192 do not contain complete pelvis
structure (as in Fig. 4(d)) and are naturally excluded from the evaluation. Among
Fully Automatic Segmentation of AP Pelvis X-rays 341

Fig. 4. Segmentation examples done with our method. Yellow: prlvis; green: femur.

the 173 valid images for pelvis, our method succeeded in 169 with a success rate
of 97.7%.
Our method takes around 5 minutes to process one image with an unoptimized
Matlab implementation.

3.3 Evaluation of the Sparse Shape Composition Model

To evaluate the eﬀectiveness of our shape model, we compare with the PCA based
shape model (as in [2]), for which the result is shown in Table 3. Comparing Table
3 with Table 2, we see that our shape model outperforms the PCA based one.
Note that the result reported here is not directly comparable with that of [2]
due to several reasons. First, we perform both femur and pelvis segmentation,
while [2] only segments femur. Second, we model the femur contour details such
as the lesser trochanter, and these details are missing in [2] which uses a simplﬁed
model. Third and most importantly, in part of our test images the regions to be
segmented are occluded by implants (see Fig. 4).

Table 2. Quantitative result of our method

Anatomy Success rate Median Min. Max. Mean Std. 97.5th percentile
Femur 98.4% 1.2 0.6 3.4 1.3 0.6 2.7
Pelvis 97.7% 2.1 1.0 3.7 2.2 0.5 3.4

Table 3. Quantitative evaluation using the PCA based shape model

Anatomy Success rate Median Min. Max. Mean Std. 97.5th percentile
Femur 97.1% 1.3 0.6 3.6 1.4 0.6 3.0
Pelvis 95.4% 2.4 1.2 3.8 2.5 0.6 3.5
342 C. Chen and G. Zheng

4 Conclusions
We have proposed a new fully-automatic method for left femur and pelvis seg-
mentation in conventional X-ray images. Our method features a hierarchical
segmentation framework and a shape model based on sparse representation. Ex-
periments show that our method achieves good results, and that the diﬀerent
contributions (feature selection, shape model) indeed improve the performance.
Although we demonstrate our method using the left femur and left pelvis, our
method can be readily extended to the right side. In the future, we are also
interested in extending our method into 3D data.

Acknowledgements. This work was partially supported by the Swiss National

Science Foundation via Project 51NF40-144610.

References
1. Chen, Y., Ee, X., Leow, W.-K., Howe, T.S.: Automatic extraction of femur contours
from hip X-ray images. In: Liu, Y., Jiang, T.-Z., Zhang, C. (eds.) CVBIA 2005.
LNCS, vol. 3765, pp. 200–209. Springer, Heidelberg (2005)
2. Lindner, C., Thiagarajah, S., Wilkinson, J.M., Wallis, G.A., Cootes, T.F.: Accurate
fully automatic femur segmentation in pelvic radiographs using regression voting.
In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) MICCAI 2012, Part
III. LNCS, vol. 7512, pp. 353–360. Springer, Heidelberg (2012)
3. Gottschling, H., Roth, M., Schweikard, A., Burgkart, R.: Intraoperative,
ﬂuoroscopy-based planning for complex osteotomies of the proximal femur. IJMR-
CAS 1(3), 33–38 (2005)
4. Baka, N., Kaptein, B.L., Bruijne, M., van Walsum, T., Giphart, J.E., Niessen, W.J.,
Lelieveldt, B.P.: 2D-3D shape reconstruction of the distal femur from stereo x-ray
imaging using statistical shape model. Med. Image Anal. 15(6), 840–850 (2001)
5. Dong, X., Zheng, G.: Automatic extraction of proximal femur contours from cali-
brated x-ray images using 3D statistical models: an in vitro study. IJMRCAS 5(2),
213–222 (2009)
6. Cristinacce, D., Cootes, T.: Automatic feature localization with constrained local
models. Pattern Recognition 41(19), 3054–3067 (2008)
7. Zhou, S.K., Comaniciu, D.: Shape regression machine. In: Karssemeijer, N.,
Lelieveldt, B. (eds.) IPMI 2007. LNCS, vol. 4584, pp. 13–25. Springer, Heidelberg
(2007)
8. Zheng, Y., Barbu, A., Georgescu, B., Scheuering, M., Comaniciu, D.: Four-chamber
heart modeling and automatic segmentation of 3-D cardiac CT volumes using
marginal space learning and steerable features. IEEE T. Med. Imaging 27(11),
1668–1681 (2008)
9. Pauly, O., Glocker, B., Criminisi, A., Mateus, D., Möller, A.M., Nekolla, S., Navab,
N.: Fast multiple organ detection and localization in whole-body MR Dixon se-
quences. In: Fichtinger, G., Martel, A., Peters, T. (eds.) MICCAI 2011, Part III.
LNCS, vol. 6893, pp. 239–247. Springer, Heidelberg (2011)
10. Criminisi, A., Shotton, J., Robertson, D., Konukoglu, E.: Regression forests for
eﬃcient anatomy detection and localization in CT studies. In: MCV 2010, pp.
106–117 (1201)
Fully Automatic Segmentation of AP Pelvis X-rays 343

11. Gall, J., Lempitsky, V.: Class-speciﬁc Hough forests for object detection. In: CVPR,
pp. 1022–1029 (2009)
12. Zhang, S., Zhan, Y., Dewan, M., Huang, J., Metaxas, D.N., Zhou, X.S.: Sparse
shape composition: a new framework for shape prior modeling. In: CVPR (2011)
13. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
CVPR, vol. I, pp. 886–893 (2005)
14. Chen, C., Yang, Y., Nie, F., Odobez, J.M.: 3D human pose recovery from image
by eﬃcient visual feature selection. CVIU 115(3), 290–299 (2011)
15. Cootes, T.F., Taylor, C.J.: Active shape models-‘smart snakes’. In: BMVC (1992)
Language Adaptive Methodology
for Handwritten Text Line Segmentation

Subhash Panwar1, Neeta Nain1 , Subhra Saxena2 , and P.C. Gupta3

1
Deptt. of Computer Engineering, Malaviya National Institute of Technology Jaipur
2
School of Engineering and Technology, Jaipur National University, Jaipur
3
Deptt. of Computer Scinece and Informatics, University of Kota, Kota

Abstract. Text line segmentation in handwritten documents is a very

challenging task because in handwritten documents curved text lines ap-
pear frequently. In this paper, we have implemented a general line seg-
mentation approach for handwritten documents with various languages.
A novel connectivity strength parameter is used for deciding the groups
of the components which belongs to the same line. oversegmentation is
also removed with the help of depth ﬁrst search approach and iterative
use of the CSF . We have implemented and tested this approach with En-
glish, Hindi and Urdu text images taken from benchmark database and
ﬁnd that it is a language adaptive approach which provide encouraged
results. The average accuracy of the proposed technique is 97.30%.

Keywords: Handwritten text recognition, Line Segmentation, Con-

nected component, Connectivity strength function.

1 Introduction
Text line segmentation of handwritten documents is much more diﬃcult than
that of printed documents. Unlike the printed documents which have approxi-
mately straight and parallel text lines, the lines in handwritten documents are
often un-uniformly skewed and curved. Moreover, the spaces between handwrit-
ten text lines are often not obvious compared to the spaces between within-line
characters, and some text lines may interfere with each other. Therefore many
text line detection techniques, such as projection analysis [7] [5], Hough trans-
form [6] and K-nearest neighbour connected components (CCs) grouping [9],
are not able to segment handwritten text lines successfully and also still a uni-
form approach to handle all kind of challenges is not available. Figure 1 shows
an example of unconstrained handwritten document. Text document image seg-
mentation can be roughly categorized into three classes: top-down, bottom-up,
and hybrid. Top-down methods partition the document image recursively into
text regions, text lines, and words/characters with the assumption of straight
lines. Bottom-up methods group small units of image (pixels, CCs, characters,
words, etc.) into text lines and then text regions. Bottom-up grouping can be
viewed as a clustering process, which aggregates image components according to
proximity and does not rely on the assumption of straight lines. Hybrid methods

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 344–351, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Language Adaptive Methodology for Handwritten Text Line Segmentation 345

Fig. 1. Example image of a general handwritten text paragraph from IAM dataset [4]

combine bottom-up grouping and top-down partitioning in diﬀerent ways.

All the three approaches have their advantages and disadvantages. Top-down
methods work well for typed text where the text lines are relatively horizon-
tal, but it does not perform well on curved and overlapping text lines. The
performance of bottom-up grouping relies on some heuristic rules or artificial
parameters, such as the between-component distance metric for clustering. On
the other hand, hybrid methods are complicated in computation, and the design
of a robust combination scheme is non-trivial.
In graph representation of image as each component is represented as vertex and
the distance calculated between the CC is represented as an edge with weight.
Then, we may find out the minimum spanning tree of the given image, and thus
the segmentation is made as by comparing with pre determined distance which
may be an inter word distance or intra word distance [8] .
In [1] authors use the image meshing for line detectioin locally in the presence of
multi-orientation of lines. Wigner-Villey distribution and projection histogram
is used to determine the local orientation. This local orientation is then enlarged
to limit the orientation in the neighbourhood.
In [2] the text line is segmented using Affinity Propogation. They first estimate
the local orientation at each primary component to build a sparse similirity
graph then use a shortest path algorithm to compute similirities between non-
neighbouring components. Affinity propagnation and Breadth-first-search are
used to obtain coarse text lines.
In [3], the line segmentation algorithm is based on locating the optimal succes-
sion of text and gap areas within veritcal zones by applying Viterbi algorithm
and a text line seperator drawing technique is applied and finally the connected
components are assigned to text lines.
We are proposing an effective bottom-up grouping method for text line segmen-
tation for unconstrained handwritten text documents. Our approach is based on
minimal spanning tree (M ST ), grouping of CCs, and the connectivity strength
function (CSF ).
346 S. Panwar et al.

Fig. 2. M ST generated for the text paragraph shown in Figure 1, the green pixels
mark the centroids of every connected component, and the red lines depict the edges
of the M ST of the graph of the connected components of the same ﬁgure

2 The Proposed Line Segmentation Method

In this paper, we first extract the CCs from the image (binarized). To construct a
line from these CCs, calculate the centroid of every CC as depicted in Figure 2
with green coloured pixels. Using these centroids of CCs as vertices (graph),
calculate the cost matrix of the given graph, where the cost of the edge is the
distance between two vertices. The M ST is then calculated for the graph as
shown in Figure 2. The M ST as shown in Figure 2, also have some mis-qualified
words linked with the line words which are not part of the same line. For re-
moving such connections (edges) from the M ST , we further use a connectivity
strength function as explained below which is very useful in deciding the groups
of the components which belongs to the same line. Thus the mis-aligned edges
are removed from the M ST and we generate the correct forest of the CCs.
The Connectivity Strength Function CSF is derived as, let there be two
connected components C1 and C2 having centroids as (x1 , y1 ) and (x2 , y2 ), re-
spectively. The minimum distance (d) between the two components is
/
2 2
d = (x2 − x1 ) + (y2 − y1 )
and the vertical distance Yd is
yd = (y2 − y1 )
then the CSF is defined as
| d − yd |
CSF =
yd
For each pair of connected component in the M ST compute the value of CSF .
The decision for grouping the components depends on CSF as

0 belongs to different lines.
CSF =
∞ belongs to same line.
Language Adaptive Methodology for Handwritten Text Line Segmentation 347

Fig. 3. Illustration of boundary values of CSF

Fig. 4. Forest remains after removing the week edges from Figure 1 using CSF

where, CSF = 0, only when d = yd , means the two components have the
minimum connectivity strength as they are orthogonal and hence belongs to
different lines, which are almost parallel. And the CSF = ∞, only when yd = 0.
This means the connectivity between the two components is the strongest as
they both belong to the same line. The angle between the two components is
zero aligning them on the same line as shown in Figure 3.
Thus, after applying the CSF rules on Figure 1 we remove the mis-aligned
components from the text lines and generate the forest of given document image
as shown in Figure 4. Where our forest is defined as a group of trees. Where
every tree is a text line. Text lines are over segmentation in a single iteration
of CSF . To overcome such situation we apply same process on forest treating
each tree of forest as a single node and find the connected graph and apply CSF
approach. Finally we get the forest having a single tree for every single text
line as shown in Figure 4 . Figure 5(a) shows an example of hindi handwritten
348 S. Panwar et al.

Fig. 5. (a) Example image of a hindi handwritten document. (b) Forest remains after
applying CSF .

document. Figure 5(b) shows the result using proposed approach. The complete
process can be enumerated as shown in Algorithm 1.

3 Experimental Results

The experiments are done on various variety of handwritten document images

including different languages as hindi, english and urdu. To cover all the cases
such as skewed lines, curved lines and touching line, some images are randomly
selected from the large IAM [4] database of handwritten document and some are
generated from different writers with different languages. The various cases are
enumerated below.

1. Curved lines: Figure 6(A) shows an example of a curved handwritten lines.

The projection profile techniques [7] for such curved lines fails completely in
such line segmentations. Also Figure 6(B) shows the result that the M ST
generated which when used by clustering techniques [9] using inter word
distance for line segmentation will also give erroneous result. This shortcom-
ing of existing techniques is overcome by the application of CSF on which
gives the correct result. Figure 7(a) shows an example of urdu handwritten
document with curved lines. Figure 7(b) shows the result using proposed
approach.
2. Skewed lines: An example image of handwritten document with skewed lines
is shown in Figure 8(A); and after finding the M ST of given example image,
we apply the CSF and find the exact forest of the text line which is shown
in Figure 8(B). Here also the traditional methods would fail.
Again it is ascertained that the CSF improves the accuracy of line segmen-
tation in presence of skewed line in the documents.
Language Adaptive Methodology for Handwritten Text Line Segmentation 349

Algorithm 1. Text Line Segmentation.

Ensure: F - Forest of text lines.
Require: I- Text document binarized image with background as 0.
Compute connected components (CCi )s using 8-connectivity.
Compute centroids CCi s.
for1 all
1
{(cx , cy ) = ( M xj , M yj ), where xj , yj ∈ CCi ; M is the number of pixels in ith
CC.}
{Compute cost matrix dm,n of CCs}
di,j = (xi − xj )2 + (yi − yj )2 .
Scan the CC using DF S of G (graph) using cost matrixdm,n
{∀ V (vertex) ∈ G with Vdegree ≥ 2 ∈ DFS sequence, apply CSF.}
if (Vdegree ≥ 2) then
yd = (y2 − y1 ).
CSF = |d−yyd
d|
.
{∀CC connected with this V .}
end if
Calculate T hCSF {T hCSF =Minimum vertical distance between two CCi }
Remove the week edges where ever CSF ≤ T hCSF .
Compute centroids and cost metrix for Forest F .
{∀ tree Ti ∈ F are treated as vertex of the graph}
Remove the week edges where ever CSF ≈ 0.
Return(F ) {Finally remains a forest having the trees for every single line.}

Fig. 6. (A) Example image of curved lines in handwritten text document. (B) Forest
remains after applying CSF .
350 S. Panwar et al.

Fig. 7. (a) Example image of a urdu handwritten document. (b) Forest remains after
applying CSF .

Fig. 8. (A) Example image of a skewed lines in handwritten text document. (B) Forest
remains after applying CSF .

The experimental results of the proposed line segmentation approach shows that
proposed CSF improves line segmentation accuracy signiﬁcantly in all the cases.
The proposed method was also compared with other state-of-the-art methods in
experiments on a large database of IAM [4], handwritten documents data set
and its superiority was demonstrated. The accuracy rate of the proposed text
line segmentation method is summarized in Table 1.
Language Adaptive Methodology for Handwritten Text Line Segmentation 351

Table 1. Accuracy rate of proposed text line segmentation using CSF

Line types Total no. of lines Accurate detected lines Accuracy rate
Printed lines 320 320 100
Skewed lines 2600 2520 96.92
Curved lines 1750 1670 95.42

4 Conclusions
In this paper, a language adaptive approach for handwritten text line segmen-
tation with CSF has been presented and applied to the IAM dataset and doc-
uments collected from diﬀerent writers with diﬀerent languages as Hindi english
and urdu. The proposed text line segmentation approach with the novel use of
CSF has the advantage of language adaptivity with highly curve and skewed
text lines. From the experiments, 97.30% average accuracy were observed in the
system. The results obtained by this segmentation is a forest of lines. It shows
that the proposed system is capable of locating accurately the text lines in im-
ages and documents. Future work mainly concerns the sequential arrangement
of all lines of forest according to appearing in paragraph so that the sequential
stroke is sent to next step of recognition system.

References
1. Ouwayed, N., Belaid, A.: A general approach for multi-oriented text line extraction
of handwritten documents. IJDAR 15(4), 297–314 (2012)
2. Kumar, J., et al.: Handwritten Arabic text line segmentation using affinity propa-
gation. In: DAS 2010, pp. 135–142 (2010)
3. Papavassiliou, V., et al.: Handwritten document image segmentation into text lines
and words. Pattern Recognition 43, 369–377 (2010)
4. Marti, U., Bunke, H.: The IAM-database: An English Sentence Database for Off-
line Handwriting Recognition. Int. Journal on Document Analysis and Recognition,
IJDAR 5, 39–46 (2002)
5. Likforman Sulem, L., Zahour, A., Taconet, B.: Text line segmentation of historical
documents: A survey. IJDAR 9, 123–138 (2007)
6. Likforman-Sulem, L., Hanimyan, A., Faure, C.: A Hough based algorithm for ex-
tracting text lines in handwritten documents. In: Proc. 3rd Int. Conf. on Document
Analysis and Recognition, pp. 774–777 (1995)
7. Zamora-Martinez, F., Castro-Bleda, M.J., Espaa-Boquera, S., Gorbe-Moya, J.: Un-
constrained offline handwriting recognition using connectionist character N-grams.
In: The 2010 International Joint Conference on Neural Networks (IJCNN), July
18-23, pp. 1–7 (2010)
8. Yin, F., Liu, C.-L.: Handwritten Chinese text line segmentation by clustering with
distance metric learning. Pattern Recognition (Elsevier) 42(12), 3146–3157 (2009)
9. Kumar, M., Jindal, M.K., Sharma, R.K.: K-nearest neighbour Based offline Hand-
written Gurumukhi Character Recognition. In: International IEEE Conference on
Image Information Processing (ICHP 2011), vol. 1, pp. 7–11 (2011)
Learning Geometry-Aware Kernels
in a Regularization Framework

Binbin Pan1 and Wen-Sheng Chen2

1
Shenzhen University
[email protected]
2
Shenzhen University
[email protected]

Abstract. In this paper, we propose a regularization framework for learn-

ing geometry-aware kernels. Some existing geometry-aware kernels can be
viewed as instances in our framework. Moreover, the proposed framework
can be used as a general platform for developing new geometry-aware ker-
nels. We show how multiple sources of information can be integrated in
our framework, allowing us to develop more flexible kernels. We present
some new kernels based on our framework. The performance of the kernels
is evaluated on classification and clustering tasks. The empirical results
show that our kernels significantly improve the performance.

1 Introduction
There has recently been a surge of interest in learning algorithms that are aware
of the geometric structure of the data. These algorithms have been successfully
applied to pattern recognition, image analysis, data mining etc.. Kernel function,
defining the similarity between the data in Reproducing Kernel Hilbert Space
(RKHS), can capture the structure of the data. Thus, the use of kernels for
learning the geometric structure of the data has received a significant amount
of attention. Such kernels are called as geometry-aware kernels. Algorithms for
learning geometry-aware kernels can be roughly classified into two categories.
Algorithms in the first category only explore the geometric structure of the
data, ignoring other source of information. Kondor and Lafferty propose Diffu-
sion Kernels which are originated from the heat equation on geometric manifold
and are aware of the data geometry [1]. Smola and Kondor show that the spec-
trum of graph Laplacian can be passed through various filter functions leading
to a family of geometry-aware kernels [2]. Some examples of kernels are given, in-
cluding Regularized Laplacian Kernel, Diffussion Kernel, Random Walk Kernel
and Inverse Cosine Kernel. Some well-known algotihms for dimensionality reduc-
tion of manifold can be unified in a kernel perspective [3]. These algothrithms can
be interpreted as kernel PCA with specifically constructed Gram matrices. Also,
researchers focus on learning geometry-aware kernels for nonlinear dimensional-
ity reduction [4,5]. These methods are unsupervised and well-suited to task of
dimensionality reduction. However, they may not give satisfactory performance
on supervised tasks.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 352–359, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Learning Geometry-Aware Kernels in a Regularization Framework 353

The second category learns kernels from multiple sources of information which
include data geometry, side information and so on. Sindhwani et. al. show how
the standard kernels can be adapted to incorporate data geometry while re-
taining out-of-sample extension [6]. Song et. al. show a variant of Maximum
Variance Unfolding that is aware of the data geometry and side information [7].
Learning geometry-aware kernels from the nonparametric transforms of graph
Laplacian are discussed in semi-supervised learning scenario [8]. Some studies
focus on learning nonparametric geometry-aware kernels with the help of mani-
fold regularization [9]. Comparing with the algorithms in the first category, these
methods are more suitable for supervised tasks since the task related information
is incorporated into them.
In this paper, we present a general framework for learning geometry-aware
kernels. Our framework involves an optimization problem which minimizes a
divergence between the learnt kernel matrix and a given prior matrix, along with
some regularization term. Some existing geometry-aware kernels can be unified in
our framework. Furthermore, new geomerty-aware kernels can be developed. We
will show how to integrate multiple sources of information within the framework,
leading to a family of algorithms by choosing various divergence, prior matrix and
regularization term. Empirical results indicate that our algorithms significantly
improve the performance.

2 The Regularization Framework

2.1 Problem Formulation
Given a prior kernel matrix K0 , we investigate how to learn geometry-aware
kernel from the prior kernel matrix and the geometric structure of the data. We
formulate the learning problem as follows:

min D(K, K0 ) + γΩ(K)

K (1)
s.t. K & 0,

where D(·, ·) is the divergence between two matrices, Ω(·) is a regularization

term, γ > 0 is the regularization trade-off, K & 0 means K is a positive semi-
definite matrix. The regularizer Ω(K) should measure the complexity of pre-
serving the geometric structure.
If D(K, K0 ) and Ω(K) are convex functionals with respect to K, then (1)
is a convex optimization problem with global minima. However, the positive
semi-definiteness constraint makes the problem not easy to solve. Some papers
reformulate their problems with such constraint as semi-definite programmings,
leading to algorithms involving expensive computation [4]. In this paper, we
adopt Bregman divergence to measure the discrepancy between the matrices. As
we will see later, this divergence is well-suited to positive semi-definite matrices.
354 B. Pan and W.-S. Chen

2.2 Choosing D(·, ·)

We choose D(·, ·) as the Bregman divergence. Let F : Δ → R be a continuously-
diﬀerentiable real-valued and strictly convex function deﬁned on a closed convex
set Δ. The Bregman divergence associated with F for K, K0 ∈ Δ is:
DF (K, K0 ) = F (K) − F (K0 ) − tr((K − K0 )∇F (K0 )T ). (2)
The Bregman divergence represents a class of distance and divergence. For in-
stance, if F (K) = K 2F , then the resulting Bregman divergence is the squared
Frobenius norm K − K0 2F .When choosing F (K) = tr(K log K − K), then
∇F (K) = log K and the corresponding Bregman divergence becomes the von
Neumann divergence:
DF (K, K0 ) = tr(K log K − K log K0 − K + K0 ). (3)
If we choose F (K) = −LogDetK, then the gradient is ∇F (K) = −K −1 . The
corresponding Bregman divergence is LogDet divergence:
DF (K, K0 ) = tr(KK0−1 ) − log det(KK0−1 ) − n. (4)

2.3 Choosing Ω(·)

The regularization term Ω(·) is chosen as a manifold regularization [10] which
can exploit the geometric structure. Given a data set X = {x1 , · · · , xn }, we can
build a weighted undirected graph with adjacency matrix W to describe the local
neighborhood relations between the data points. The entries Wij > 0 if the ith
and jth data points are neighbors, otherwise Wij = 0. The neighbor relations
can be deﬁned in terms of symmetric nearest neighbors or an -ball distance
criterion. The non-zero weights in W can be chosen as Wij = 1, or according to
a heat kernel Wij = exp(− xi − xj 2 /t) where t > 0.
Suppose that the data were sampled from a smooth manifold which is ap-
proximated by a graph G. We seek a nonlinear mapping φ which embeds the
graph G to a RKHS so that connected points stay as close together as possible.
Mathematically, we need to minimize
φ(xi ) φ(xj )
√ − 2
Wij = tr(KL), (5)
i,j
Dii Djj

where K is the kernel matrix upon X associated with φ, i.e., Kij = φ(xi )T φ(xj ),
L = I − D− 2 W D−
1 1
2 is the normalized Laplacian matrix, D is diagonal matrix

with entries Dii = j Wji . Thus, the regularization term Ω(K) = tr(KL).

2.4 Algorithm
The kernel learning problem with manifold regularization is
min F (K) − F (K0 ) − tr((K − K0 )∇F (K0 )T ) + γtr(KL)
K (6)
s.t. K & 0.
Learning Geometry-Aware Kernels in a Regularization Framework 355

Problem (6) is a convex programming since the Bregman divergence is convex

with respect to the ﬁrst parameter. Ignoring the positive semi-deﬁniteness con-
straint and setting the derivative with respect to K to zero, we have

∇F (K) − ∇F (K0 ) + γL = 0. (7)

The solution is derived in closed-form:

K = (∇F )−1 (∇F (K0 ) − γL). (8)

We will show how the positive semi-definiteness constraint can be satisfied au-
tomatically by choosing various kinds of Bregman divergence. Specifically, we
present the following kernels:
γ
K = K0 − L (squared Frobenius norm) (9)
2
K = exp(log K0 − γL) (von Neumann divergence) (10)
K= (K0−1 + γL)−1
(LogDet Divergence) (11)

A small γ should be set to ensure the positive semi-deﬁniteness of Equation (9).

Note that the eigenvalues of L are no more than 2. Equation (10) is positive
definite since the exponent log K0 − γL is a symmetric matrix and the ma-
trix exponential converts symmetric matrix back into a positive definite matrix.
Equation (11) is positive definite because of the positive definiteness of K0 and
L.

3 Discussions

3.1 Relation to Related Work

We present that some geometry-aware kernels can be uniﬁed in our framework.

Diﬀusion Kernel. We choose that K0 = I in Equation (10), which means

no prior similarity information in K0 . We have log I = O, where O is the zero
matrix, then the solution is

K = exp(−γL). (12)

Equation (12) is the diﬀusion kernel proposed in [1].

Regularized Laplacian Kernel. Setting K0 = I in Equation (11), we arrive

at that
K = (I + γL)−1 . (13)
This is the Regularized Laplacian Kernel presented in [2].
356 B. Pan and W.-S. Chen

1-Step Random Walk Kernel. Given K0 = I in Equation (9), we yield the

following kernel
γ
K = I − L. (14)
2
When choosing γ ≤ 1, K is positive semi-deﬁnite. It becomes the 1-Step Random
Walk Kernel proposed in [2].

Sindhwani’s Work. Note that Equation (11) is the Gram matrix of the kernel
proposed in [6]. Given new test data, we can compute the out-of-sample ex-
tension in Equation (11) by enlarging K0 with additional test data and setting
L̃ = diag(L, O). The resulting formulation is the same as the kernel function pre-
sented in [6]. Therefore, we reinterpret Sindhwani’s work in our regularization
framework.

3.2 Integrating Multiple Sources of Information

The data geometry is incorporated with the manifold regularization. Other
source of information could be integrated by the special choice of K0 . If K0
is chosen as the identity matrix, we only use the geometric information, ignoring
other source of information.
When choosing K0 as the Gram matrix generated by Gaussian or linear kernel,
we integrate the information of ambient space. When the manifold assumption
does not hold or holds to a lesser degree, the incorporation of ambient space
maybe yield a better solution. Being able to trade oﬀ the ambient space and
manifold space may be important in practice.
The supervised information can be integrated by deﬁning K0 as the ideal
kernel:
K0 = yy T , (15)
where y ∈ Rn is a vector of {0, −1, +1} labels, 0 means the data point is un-
labeled. Ideal kernel indicates whether two given training points belong to the
same class or not. Since the manifold regularization brings the geometric struc-
ture of the unlabeled data into the ideal kernel, the learnt geometry-aware kernel
generalizes to unlabeled data.

4 Experiments
We perform experiments on clustering and classiﬁcation tasks. We use the
von Neumann divergence (Equation (10)) and choose various prior kernel ma-
trix, leading to three algorithms: Manifold Regularized Gaussian Kernel (MR-
Gaussian), Manifold Regularized Linear Kernel (MR-Linear) and Manifold Reg-
ularized Ideal Kernel (MR-Ideal). Comparisons are made with Diﬀusion Kernel
(Equation (12)), Gaussian kernel κG and linear kernel κL :

κG (x, y) = exp(− x − y 2 /t), (16)

κL (x, y) = x, y , (17)
where t > 0 is the width of Gaussian kernel.
Learning Geometry-Aware Kernels in a Regularization Framework 357

4.1 Classification
We consider the transductive learning where the test data are available in ad-
vance before the classifier is learned. The classification experiments are designed
using USPS dataset which contains 16 × 16 grayscale images of handwritten dig-
its. We select challenging tasks where the digits are similar: 2 vs 3, and 5 vs 6.
The first 400 images for each digit were taken to form the dataset. Training data
are normalized to zero mean and unit variance. The same processing settings are
applied to the test data. The parameter of Gaussian kernel is tuned via 2-fold
cross validation. The trade-off γ in (10) is fixed to 10. We compute accuracy using
1-norm soft margin SVM classifier where the regularization parameter C = 1.
The results are averaged over 20 runs.
The number of training data per digit is fixed to be 10 and the total number of
each digit is changed within the set of {25, 50, 100, 200, 400}. We wish to see how
the accuracies vary with the accurate and inaccurate manifolds. The averaged
performance are tabulated in Table 1 and 2. The accuracies of Gaussian and
linear kernels change tinily with different total number of data. Since these two
kernels do not consider the geometric structure of the data, the incremental
data have little influence to the performance. The Diffusion Kernel performs
unstably. The performance of MR-Gaussian and MR-Linear are unsatisfactory
with 25 data, but improved as more and more data are available. So do the
MR-Ideal. This is in accord with our conjecture. When the total number of data
is very limited, the recovered manifold is inaccurate. Thus, the kernel with this
inaccurate information yields worse performance. But once the data are enough
to reconstruct the manifold, our algorithms give higher accuracies.

Table 1. Accuracy (%) of 2 vs 3 task (#training=10). The best results are highlighted.

# data Gaussian Linear Diﬀusion MR-Gaussian MR-Linear MR-Ideal

25 91.33±5.23 92.50±4.70 91.17±4.36 79.83±11.67 82.17±15.72 90.17±5.67
50 90.06±3.88 91.06±2.99 89.69±5.97 84.81±9.66 87.50±6.80 88.38±7.49
100 89.31±6.64 91.31±3.07 90.25±7.95 90.08±5.47 93.97±3.29 91.00±5.89
200 88.30±6.63 90.84±2.69 91.37±9.53 94.49±2.19 95.96±1.88 95.04±3.30
400 88.74±5.67 90.90±2.80 82.45±16.01 96.07±0.68 97.04±1.08 96.56±0.85

Table 2. Accuracy (%) of 5 vs 6 task (#training=10). The best results are highlighted.

# data Gaussian Linear Diﬀusion MR-Gaussian MR-Linear MR-Ideal

25 88.00±8.19 92.83±4.22 89.67±9.48 74.50±10.56 78.17±12.26 94.67±6.70
50 87.94±11.65 93.44±4.37 91.88±6.63 88.06±10.61 95.75±3.57 95.13±5.21
100 89.14±9.79 94.08±2.91 90.64±11.64 92.14±7.38 97.39±1.71 96.28±4.80
200 91.01±4.16 93.92±2.98 84.87±13.47 93.59±10.92 98.63±0.69 95.25±9.57
400 90.86±4.36 93.47±2.86 76.14±13.51 97.82±2.18 98.94±0.70 98.09±1.06
358 B. Pan and W.-S. Chen

4.2 Clustering
The MNIST dataset is used for clustering. This dataset contains 28×28 grayscale
images of handwritten digits. We also take the first 400 images for each digit to
form the dataset. The data are normalized to zero mean and unit variance. The
parameter of Gaussian kernel is fixed to 104 . The trade-off γ in (10) is also fixed
to 10. We use the kernel k-means algorithm and evaluate the performance by
computing Normalized Mutual Information (NMI) and clustering accuracy. For
two random variable A and B, the NMI is defined as:
I(A, B)
NMI = , (18)
H(A)H(B)
where I(A, B) is the mutual information between the random variables A and
B, H(A) is the Shannon entropy of A. High NMI value indicates that the cluster
and true labels match well. The clustering accuracy is defined as:
n
δ(yi , map(ci ))
Accuracy = i=1 , (19)
n
where n is the number of data, yi denotes the true label and ci denotes the
corresponding cluster label, δ(y, c) is a function that equals 1 if y = c and equals
0 otherwise. map(·) is a permutation function that maps each cluster label to a
true label. This optimal matching can be found with the Hungarian algorithm.
We run kernel k-means on the dataset 10 times with random initialization, then
average the NMI and Accuracy values.
The experiment is designed to investigate the impact of the number of mani-
fold on the performance. Since each digit can be viewed as a submanifold in the
input space, we vary the number of digits within the set of {2, 3, · · · , 10}. We
firstly adopt digits 1 and 2 for evaluation, then add another digit one by one.
The results are shown in Figure 1. As the number of digits increases, the per-
formance of all algorithms tend to degenerate, because the mixture of manifolds
complicates the problem. Our MR-Gaussian outperforms other two algorithms
in almost all cases, except for one situation. This demonstrates that the incor-
poration of manifold structure provides advantages in clustering.

100 0.8
MR−Gaussian MR−Gaussian
90 Gaussian Gaussian
Diffusion 0.6 Diffusion
Accuracy (%)

80
NMI

70 0.4

60
0.2
50

40 0
2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10
Number of Digits Number of Digits

(a) (b)
Fig. 1. MNIST results. (a) Accuracy values (b) NMI values.
Learning Geometry-Aware Kernels in a Regularization Framework 359

5 Conclusions

We present a regularization framework for learning geometry-aware kernels. Our

framework includes some existing geometry-aware kernels. Furthermore, we de-
velop new geometry-aware kernels by integrating other source of information. In
future research, we will study how to incorporate more sources of information
into our framework by deﬁning more ﬂexible regularization term.

Acknowledgement. This work is partially supported by Natural Science Foun-

dation of SZU (grant no. 00035693), NSF of China (61272252) and Science &
Technology Planning Project of Shenzhen City (JC201105130447A).

References
1. Kondor, R.I., Lafferty, J.: Diffusion kernels on graphs and other discrete input
spaces. In: Proceedings of the 19th Annual International Conference on Machine
Learning, pp. 315–322 (2002)
2. Smola, A.J., Kondor, R.: Kernels and regularization on graphs. In: Schölkopf, B.,
Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 144–158.
Springer, Heidelberg (2003)
3. Ham, J., Lee, D., Mika, S., Schölkopf, B.: A kernel view of the dimensionality
reduction of manifolds. In: Proceedings of the 21st Annual International Conference
on Machine Learning, pp. 47–54 (2004)
4. Weinberger, K., Sha, F., Saul, L.: Learning a kernel matrix for nonlinear dimen-
sionality reduction. In: Proceedings of the 21st Annual International Conference
on Machine Learning, pp. 839–846 (2004)
5. Lawrence, N.D.: A unifying probabilistic perspective for spectral dimensionality
reduction: Insights and new models. The Journal of Machine Learning Research 12,
1609–1638 (2012)
6. Sindhwani, V., Niyogi, P., Belkin, M.: Beyond the point cloud: from transductive
to semi-supervised learning. In: Proceedings of the 22nd International Conference
on Machine Learning, pp. 824–831 (2005)
7. Song, L., Smola, A., Borgwardt, K., Gretton, A.: Colored maximum variance un-
folding. Advances in Neural Information Processing Systems 20, 1385–1392 (2008)
8. Zhu, X., Kandola, J., Ghahramani, Z., Lafferty, J.: Nonparametric transforms of
graph kernels for semi-supervised learning. Advances in Neural Information Pro-
cessing Systems 17, 1641–1648 (2005)
9. Zhuang, J., Tsang, I., Hoi, S.: A family of simple non-parametric kernel learning
algorithms. The Journal of Machine Learning Research 12, 1313–1347 (2011)
10. Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric frame-
work for learning from labeled and unlabeled examples. The Journal of Machine
Learning Research 7, 2399–2434 (2006)
Motion Trend Patterns for Action Modelling
and Recognition

Thanh Phuong Nguyen, Antoine Manzanera, and Matthieu Garrigues

ENSTA-ParisTech, 828 Boulevard des Maréchaux, 91762 Palaiseau CEDEX, France

{thanh-phuong.nguyen,antoine.manzanera,
matthieu.garrigues}@ensta-paristech.fr

Abstract. A new method for action modelling is proposed, which com-

bines the trajectory beam obtained by semi-dense point tracking and a
local binary trend description inspired from the Local Binary Patterns
(LBP). The semi dense trajectory approach represents a good trade-off
between reliability and density of the motion field, whereas the LBP
component allows to capture relevant elementary motion elements along
each trajectory, which are encoded into mixed descriptors called Motion
Trend Patterns (MTP). The combination of those two fast operators al-
lows a real-time, on line computation of the action descriptors, composed
of space-time blockwise histograms of MTP values, which are classified
using a fast SVM classifier. An encoding scheme is proposed and com-
pared with the state-of-the-art through an evaluation performed on two
academic action video datasets.

Keywords: Action Recognition, Semi dense Trajectory ﬁeld, Local Bi-

nary Pattern, Bag of Features.

1 Introduction
Action recognition has become a very important topic in computer vision in
recent years, due to its applicative potential in many domains, like video surveil-
lance, human computer interaction, or video indexing. In spite of many proposed
methods exhibiting good results on academic databases, action recognition in
real-time and real conditions is still a big challenge. Previous works cannot be
evoked extensively, we refer to [1] for a comprehensive survey. In the following,
we will concentrate on the two classes of method most related to our work:
trajectory based modelling, and dynamic texture methods.
An important approach for action representation is to extract features from
point trajectories of moving objects. It has been considered for a long time
as an eﬃcient feature to represent action. Johansson [2] showed that human
subjects can perceive a structured action such as walking from points of light
attached to the walker’s body. Messing et al. [3], inspired by human psychovisual
performance, extracted features from the velocity histories of keypoints using
KLT tracker. Sun et al. [4] used trajectories of SIFT points and encoded motion
in three levels of context information: point level, intra-trajectory context and

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 360–367, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Motion Trend Patterns for Action Modelling and Recognition 361

inter-trajectory context. Wu et al. [5] used a dense trajectory ﬁeld obtained by

tracking densely sampled particles driven by optical flow. They decomposed the
trajectories into camera-induced and object-induced components using low rank
optimisation. Then a set of global measures coming from the theory of chaotic
dynamical systems are used to describe the action. Wang et al. [6] also used a
dense trajectory field. They encoded the action information using histograms
of the differential motion vectors computed along the boundary of the moving
objects. Those works have shown the benefits of using dense motion features with
respect to the sparse approaches, when using histogram based action descriptors.
On the other hand, the LBP representation [7] was introduced for texture clas-
sification. It captures local image structure thanks to a binary sequence obtained
by comparing values between neighbouring pixels. Due to its nice properties in
terms of contrast invariance and computation time, LBP is very attractive for
many applications, including action recognition. Zhao and Pietikäinen proposed
an extension (LBP-TOP) [8] to dynamic texture by computing LBP on Three
Orthogonal Planes (TOP), that was used by Kellokumpu et al. [9] in 3d space-
time to represent human movement. In another approach [10], they used classical
LBP on temporal templates (MEI and MHI, 2d images whose appearance is re-
lated to motion information). In these methods, the action is modelled using
Hidden Markov Model to represent the dynamics of the LBPs. Recently, Yef-
fet and Wolf proposed LTP (Local Trinary Patterns) [11] that combines the
effective description of LBP with the flexibility and appearance invariance of
patch matching methods. They capture the motion effect on the local struc-
ture of self-similarities considering 3 neighbourhood circles at different instants.
Kliper-Gross et al. extended this idea to Motion Interchange Patterns [12], which
encodes local changes in different motion directions.
In this paper, we present a novel representation of human actions based on
elementary motion elements called Motion Trend Patterns (MTP), that capture
local trends along trajectories obtained by semi-dense point tracking. It com-
bines the effective properties of the two previously presented techniques. The
semi dense point tracking allows to obtain a large number of trajectories with
a much smaller computational cost than fully dense tracking. We encode local
direction changes along each trajectory using an LBP based representation. The
combination of these approaches allows a real-time, on-line computation of ac-
tion descriptors. The remaining of the paper is organised as follows: Section 2
summarises the computation of semi-dense trajectories. Section 3 details our el-
ementary motion element, the MTP descriptor. Section 4 introduces the action
modelling and its application to recognition. The last sections are dedicated to
experiments, evaluation and discussion.

2 Semi Dense Beam of Trajectories

Trajectories are compact and rich information source to represent activity in

video. Generally, to obtain reliable trajectories, the spatial information is dra-
matically reduced to a small number of keypoints, and then it may be hazardous
362 T.P. Nguyen, A. Manzanera, and M. Garrigues

to compute statistics on the set of trajectories. In this work we use the semi dense
point tracking method [13] which is a trade-oﬀ between long term tracking and
dense optical ﬂow, and allows the tracking of a high number of weak keypoints
in a video in real-time, thanks to its high level of parallelism. Using GPU im-
plementation, this method can handle 10 000 points per frame at 55 frames/s
on 640 × 480 videos. In addition, it is robust to sudden camera motion changes
thanks to a dominant acceleration estimation. Figure 1 shows several actions
represented by their corresponding beams of trajectories.

(a: Hand waving) (b: Boxing) (c: Hand clapping)

(d: Jogging) (e: Running) (f: Walking)

Fig. 1. Actions from the KTH dataset represented as beams of trajectories. For actions
(d-f), only the most recent part of the trajectory is displayed.

3 Motion Trend Patterns

We describe hereafter our MTP descriptor for action modelling. The input of
the MTP is the previously described beam of trajectories, and no appearance
information is used. An MTP descriptor is produced for every frame and for
every point which belongs to a trajectory. It has two components: the motion
itself, represented by the quantised velocity vector, and the motion local trend,
represented by polarities of local direction changes.

3.1 Encoding of Motion

Let −
→
pi be the 2d displacement of the point between frames i − 1 and i. The ﬁrst
part of the encoding is simply a dartboard quantisation of vector −→
pi (see Fig. 2).
In our implementation, we used intervals of π/6 for the angles and 2 pixels for
the norm (the last interval being [6, +∞[ ), resulting in 12 bins for the angle, 4
bins for the norm.
Motion Trend Patterns for Action Modelling and Recognition 363

3.2 Encoding of Motion Changes

Inspired by LBP, we encode elementary motion changes by comparing the mo-
tion vector − →
pi with its preceding and following velocities along the trajectory:
− −→ −−→
{pi−1 , pi+1 }. Encoding the sign of the difference can be applied to the 2 com-
ponents: (1) the norm, where it relates to tangential acceleration, and (2) the
direction, where it relates to concavity and inflexion. The two can be encoded in
a binary pattern. It turned out from our experiments that the use of the norm
did not improve the results with respect to using the direction only, and then
we only consider direction changes in MTP proposed hereafter.
Motion Trend Patterns (MTP): The local direction trend is encoded by the
signs of the 2 differences between the direction ∠− →
pi and the directions of its 2
preceding and following motion vectors. This encoding corresponds to the local
trend of the motion direction in terms of concavity and inflexion, as illustrated
by Fig. 3, which shows the 4 possible configurations of MTP, for a fixed value of
the quantised motion vector.

−
p−−→ −
i+1 p−−→
i+1
−
→ −
→
p
p i i

−−−→ −
p−−→
p i−1 i−1
−
→
p i
00 01
−
p−−→ −
i+1 p−−→
i+1
−
→
p i
→
−
pi
−−−→ −
p−−→
p i−1 i−1

10 11

Fig. 2. Dartboard quantisa- Fig. 3. Possible conﬁgurations of MTPs

tion of the motion vector

3.3 Properties of Our Descriptor

Several good properties of our action descriptor can be pointed out:
– Low computation cost. It is based on the semi dense beam of trajectories
whose computation is very fast [13]. Thanks to the low complexity of LBP-
based operators, the calculation of MTPs is also very fast and then our
system is suitable for action recognition in real-time.
– Invariance to monotonic changes of direction. It inherits from LBPs their
invariance to monotonic changes, which in our case correspond to changes
in the curvature of concavities and inﬂexions.
– Robustness to appearance variation. By design of the weak keypoints de-
tection, which is normalised by the contrast, the descriptor must be robust
against illumination change and should not depend much on the appearance
of the moving objects.
364 T.P. Nguyen, A. Manzanera, and M. Garrigues

4 Modelling and Recognition of Actions

Motion Words and Action Representation

The MTP descriptors represent elementary motion elements, or motion words.
The number of different motion words is 48 × 22 . Following hierarchical bag-
of-features approach [14], we model an action by the space-time distribution of
motion words: let B be the 3d space-time bounding box of the action. Histograms
of MTPs are calculated in B and in sub-volumes of B, using regular pyramidal
sub-grids, and the different histograms are concatenated as shown on Fig. 4. In
our experiments we used 3 different sub-grids: 1 × 1 × 1, 2 × 2 × 2 and 4 × 4 × 4,
resulting in 73 histograms.

Fig. 4. Action modelling by concatenation of MTP histograms

Classification
To perform action classification, we choose the SVM classifier of Vedaldi et al.
[15] which approximates a large scale support vector machines using an explicit
feature map for the additive class of kernels. Generally, it is much faster than
non linear SVMs and it can be used in large scale problems.

5 Experiments

We evaluate our descriptor on two well-known datasets. The first one (KTH) [16]
is a classic dataset, used to evaluate many action recognition methods. The
second one (UCF Youtube) [17] is a more realistic and challenging dataset.
Extraction of Semi-Dense Trajectories. We have studied the influence of
the extraction of semi-dense trajectories on the performance of our model. We
changed the parameters of the semi dense point tracker [13] to modify the number
of trajectories obtained on the video. What we observe is that as long as the
average matching error does not increase significantly, more we have trajectories,
the better is the recognition rate. The improvement can be raised up to 5-6 % for
KTH dataset. Table 1 shows the recognition rate obtained on this dataset, for
different average number of tracked trajectories. In our experiments, the average
number is set to 5 000 for a good result.
Motion Trend Patterns for Action Modelling and Recognition 365

Table 1. Recognition rate on the KTH dataset in function of the number of trajectories

Mean number of trajectories per video 1 240 2 700 3 200 5 271 7 763
Recognition rate 87.5 88.33 90 92.5 90.83

Table 2. Confusion matrix on KTH dataset Table 3. Comparison with other

methods on KTH dataset
Box. Clap. Wave Jog. Run. Walk.
Boxing 97.5 2.5 0 0 0 0 Method Result Method Result
Clapping 7.5 92.5 0 0 0 0 Our method 92.5 [11] 90.17
Waving 0 2.5 97.5 0 0 0 [18] 82.36 [12] 93.0
Jogging 0 0 0 95.0 2.5 2.5 [19] 88.38 [9] 93.8
Running 0 0 0 12.5 82.5 5.0 [10] 90.8 [17] 90.5
Walking 0 0 0 10.0 0 90.0 [6] 95.0

Experiments on KTH Dataset. The KTH dataset contains 25 people for 6

actions (running, walking, jogging, boxing, hand clapping and hand waving) in 4
different scenarios (indoors, outdoors, outdoors with scale change and outdoors
with different clothes). It contains 599 1 videos, of which 399 are used for training,
and the rest for testing. As designed by [16], the test set contains the actions of
9 people, and the training set corresponds to the 16 remaining persons.
Table 2 shows the confusion matrix obtained by our method on the KTH
dataset. The ground truth is read row by row. The average recognition rate is
92.5 % which is comparable to the state-of-the-art, including LBP-based ap-
proaches (see Table 3). We recall that unlike [9, 10], which works on segmented
boxes, our results can be obtained on line on unsegmented videos, using a pre-
processing step to circumscribe the interest motions in space-time bounding
boxes. The main error factor comes from confusion between jogging and running,
which is a typical problem in reported methods. Table 6 presents the recogni-
tion rates obtained separately by the different components of our method: motion
only, MTP only and both components (motion words). Clearly, quantised mo-
tion provides more information than the MTP component, but combining these
complementary components allows to improve by 2 % the recognition rate.
Experimentation on UCF Youtube Dataset. This dataset [17] contains
1 600 video sequences with 11 action categories. Each category is divided into 25
groups sharing common appearance properties (actors, background, or other).
Following the experimental protocol proposed by the authors [17], we used 9
groups out of the 25 as test and the 16 remaining groups as training data.
This dataset is much more challenging than KTH because of its large variability
in terms of viewpoints, backgrounds and camera motions. Table 4 shows the
confusion matrix obtained by our method; our mean recognition rate (65.63 %)
is comparable to recent methods (see Table 5).

1
It should contain 600 videos but one is missing.
366 T.P. Nguyen, A. Manzanera, and M. Garrigues

Table 4. Confusion matrix on UCF. Ground truth (by row) and predicted (by column)
labels are: basketball, biking, diving, golf swing, horse riding, soccer juggling, swing,
tennis swing, trampoline jumping, volleyball spiking, walking with dog.

48.98 0 2.05 0 0 8.16 14.28 10.20 0 16.33 0

0 70.21 0 0 8.51 0 17.02 0 2.13 0 2.12
4.92 0 90.16 0 0 1.64 0 0 0 0 3.28
0 0 3.70 83.33 0 7.41 3.71 0 0 1.85 0
1.64 4.92 0 0 73.77 4.92 0 1.64 0 4.92 8.20
0 0 5.26 8.77 1.75 64.91 7.02 1.75 0 1.75 8.77
1.96 5.88 0 0 0 7.84 56.86 1.96 11.76 5.88 7.84
1.64 4.92 0 1.64 1.64 1.64 0 78.69 3.28 1.64 4.92
0 0 0 0 0 9.10 13.64 15.91 56.82 0 4.54
11.36 0 6.82 4.54 6.82 2.27 0 4.54 0 59.10 4.54
4.348 15.22 0 4.38 8.69 2.17 17.39 4.35 2.17 2.17 39.13

Table 5. Comparison on UCF Youtube Table 6. Experimentation on KTH using

diﬀerent components
Our method [20] [21] [17] [22]
65.63 64 64 71.2 56.8 Motion Motion changes Motion words
90.42 84.58 92.5

6 Conclusions
We have presented a new action model based on semi dense trajectories and LBP-
like encoding of motion trend. It allows to perform on line action recognition on
unsegmented videos at low computational complexity.
For the future, we are interested in using other variants of LBP operator.
A temporal multi-scale approach for MTP encoding will also be considered.
Furthermore, we will address the eﬀects of moving camera in the performance
of our model, in order to deal with uncontrolled realistic videos.

Acknowledgement. This work is part of an ITEA2 project, and is supported

by french Ministry of Economy (DGCIS).

References
1. Aggarwal, J., Ryoo, M.: Human activity analysis: A review. ACM Comput. Surv.
43, 16:1–16:43 (2011)
2. Johansson, G.: Visual perception of biological motion and a model for its analysis.
Perception and Psychophysics 14, 201–211 (1973)
3. Messing, R., Pal, C., Kautz, H.A.: Activity recognition using the velocity histories
of tracked keypoints. In: ICCV 2009, pp. 104–111 (2009)
4. Sun, J., Wu, X., Yan, S., Cheong, L.F., Chua, T.S., Li, J.: Hierarchical spatio-
temporal context modeling for action recognition. In: CVPR, pp. 2004–2011 (2009)
Motion Trend Patterns for Action Modelling and Recognition 367

5. Wu, S., Oreifej, O., Shah, M.: Action recognition in videos acquired by a moving
camera using motion decomposition of lagrangian particle trajectories. In: ICCV,
pp. 1419–1426 (2011)
6. Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajecto-
ries. In: CVPR, pp. 3169–3176 (2011)
7. Ojala, T., Pietikäinen, M., Mäenpää, T.: Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns. PAMI 24, 971–987 (2002)
8. Zhao, G., Pietikainen, M.: Dynamic texture recognition using local binary patterns
with an application to facial expressions. PAMI 29, 915–928 (2007)
9. Kellokumpu, V., Zhao, G., Pietikäinen, M.: Human activity recognition using a
dynamic texture based method. In: BMVC (2008)
10. Kellokumpu, V., Zhao, G., Pietikäinen, M.: Texture based description of move-
ments for activity analysis. In: VISAPP (2), 206–213 (2008)
11. Yeffet, L., Wolf, L.: Local trinary patterns for human action recognition. In: ICCV,
pp. 492–497 (2009)
12. Kliper-Gross, O., Gurovich, Y., Hassner, T., Wolf, L.: Motion interchange patterns
for action recognition in unconstrained videos. In: Fitzgibbon, A., Lazebnik, S.,
Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part VI. LNCS, vol. 7577, pp.
256–269. Springer, Heidelberg (2012)
13. Garrigues, M., Manzanera, A.: Real time semi-dense point tracking. In: Campilho,
A., Kamel, M. (eds.) ICIAR 2012, Part I. LNCS, vol. 7324, pp. 245–252. Springer,
Heidelberg (2012)
14. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories. In: CVPR, pp. 2169–2178 (2006)
15. Vedaldi, A., Zisserman, A.: Efficient additive kernels via explicit feature maps.
PAMI 34, 480–492 (2012)
16. Schuldt, C., Laptev, I., Caputo, B.: Recognizing Human Actions: A Local SVM
Approach. In: ICPR, pp. 32–36 (2004)
17. Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from video “in the wild”.
In: CVPR, pp. 1996–2003 (2009)
18. Tabia, H., Gouiffès, M., Lacassagne, L.: Motion histogram quantification for human
action recognition. In: ICPR, pp. 2404–2407 (2012)
19. Mattivi, R., Shao, L.: Human action recognition using lbp-top as sparse spatio-
temporal feature descriptor. In: Jiang, X., Petkov, N. (eds.) CAIP 2009. LNCS,
vol. 5702, pp. 740–747. Springer, Heidelberg (2009)
20. Lu, Z., Peng, Y., Ip, H.H.S.: Spectral learning of latent semantics for action recog-
nition. In: ICCV, pp. 1503–1510 (2011)
21. Bregonzio, M., Li, J., Gong, S., Xiang, T.: Discriminative topics modelling for
action feature selection and recognition. In: BMVC, pp. 1–11 (2010)
22. Wang, S., Yang, Y., Ma, Z., Li, X., Pang, C., Hauptmann, A.G.: Action recognition
by exploring data distribution and feature correlation. In: CVPR, pp. 1370–1377
(2012)
On Achieving Near-Optimal “Anti-Bayesian”
Order Statistics-Based Classification
for Asymmetric Exponential Distributions

Anu Thomas and B. John Oommen

School of Computer Science, Carleton University, Ottawa, Canada : K1S 5B6

Abstract. This paper considers the use of Order Statistics (OS) in the
theory of Pattern Recognition (PR). The pioneering work on using OS
for classification was presented in [1] for the Uniform distribution, where
it was shown that optimal PR can be achieved in a counter-intuitive
manner, diametrically opposed to the Bayesian paradigm, i.e., by com-
paring the testing sample to a few samples distant from the mean - which
is distinct from the optimal Bayesian paradigm. In [2], we showed that
the results could be extended for a few symmetric distributions within
the exponential family. In this paper, we attempt to extend these results
significantly by considering asymmetric distributions within the expo-
nential family, for some of which even the closed form expressions of
the cumulative distribution functions are not available. These distribu-
tions include the Rayleigh, Gamma and certain Beta distributions. As
in [1] and [2], the new scheme, referred to as Classification by Moments
of Order Statistics (CMOS), attains an accuracy very close to the opti-
mal Bayes’ bound, as has been shown both theoretically and by rigorous
experimental testing.

Keywords: Classiﬁcation using Order Statistics (OS), Moments of OS.

1 Introduction
Class conditional distributions have numerous indicators such as their means,
variances etc., and these indices have, traditionally, played a prominent role in
achieving pattern classiﬁcation, and in designing the corresponding training and
testing algorithms. It is also well known that a distribution has many other
characterizing indicators, for example, those related to its Order Statistics (OS).
The interesting point about these indicators is that some of them are quite
unrelated to the traditional moments themselves, and in spite of this, have not
been used in achieving PR. The amazing fact, demonstrated in [3] is that OS can
be used in PR, and that such classiﬁers operate in a completely “anti-Bayesian”
manner, i.e., by only considering certain outliers of the distribution.

Chancellor’s Professor ; Fellow: IEEE and Fellow: IAPR. This author is also an Ad-
junct Professor with the University of Agder in Grimstad, Norway. The work of this
author was partially supported by NSERC, the Natural Sciences and Engineering
Research Council of Canada.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 368–376, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Order Statistics-Based Optimal “Anti-Bayesian” PR 369

Earlier, in [1] and [2], we showed that we could obtain optimal results by
an “anti-Bayesian” paradigm by using the OS. Interestingly enough, the novel
methodology that we propose, referred to as Classification by Moments of Order
Statistics (CMOS), is computationally not any more complex than working with
the Bayesian paradigm itself. This was done in [1] for the Uniform distribution
and in [2] for certain distributions within the exponential family. In this paper,
we attempt to extend these results significantly by considering asymmetric dis-
tributions within the exponential family, for some of which even the closed form
expressions of the cumulative distribution functions are not available. Examples
of these distributions are the Rayleigh, Gamma and certain Beta distributions.
Again, as in [1] and [2], we show the completely counter-intuitive result that
by working with a very few (sometimes as small as two) points distant from
the mean, one can obtain remarkable classification accuracies, and this has been
demonstrated both theoretically and by experimental verification.

2 Optimal OS-Based Classiﬁcation: The Generic

Classiﬁer

Let us assume that we are dealing with the 2-class problem with classes ω1 and
ω2 , where their class-conditional densities are f1 (x) and f2 (x) respectively (i.e,
their corresponding distributions are F1 (x) and F2 (x) respectively)1 . Let ν1 and
ν2 be the corresponding medians of the distributions. Then, classification based
on ν1 and ν2 would be the strategy that classifies samples based on a single
OS. We can see that for all symmetric distributions, this classification accuracy
attains the Bayes’ accuracy.
This result is not too astonishing because the median is centrally located close
to (if not exactly) on the mean. The result for higher order OS is actually far
more intriguing because the higher order OS are not located centrally (close
to the means), but rather distant from the means. In [2], we have shown that
for a large number of distributions, mostly from the exponential family, the
classification based on these OS again attains the Bayes’ bound. These results
are now extended for asymmetric exponential distributions.

3 The Rayleigh Distribution

The pdf of the Rayleigh distribution, whose applications are found in [4], with
parameter σ > 0 is ϕ(x, σ) = σx2 e−x /2σ , x ≥ 0 and the cumulative distribution
2 2

function is Φ(x) = 1 − e−x /2σ , x

2 2
≥ 0. The mean, the variance and the median
of the Rayleigh distribution are σ π2 , 4−π
2 σ 2
and σ ln(4), respectively.
Theoretical Analysis: Rayleigh Distribution - 2-OS. The typical PR prob-
lem involving the Rayleigh distribution would consider two classes ω1 and ω2
where the class ω2 is displaced by a quantity θ, and the values of σ are σ1 and
1
Throughout this section, we will assume that the a priori probabilities are equal.
370 A. Thomas and B. John Oommen

σ2 respectively. We consider the scenario when σ1 = σ2 = σ. Consider the dis-

−x2 −(x−θ)2
tributions f (x, σ) = σx2 e 2σ2 and f (x − θ, σ) = x−θ
σ2 e
2σ2 . In order to do the
classification based on CMOS, we shall first derive the moments of the 2-OS for
the Rayleigh distribution. The expected values of the first moments of the two
OS can be obtained by determining the points where the cumulative distribution
function attains the values of 13 and 23 respectively. Let u1 be the point for the
percentile 23 of the first distribution, and u2 be the point for the percentile 13
Eu
of the second distribution. Then, 0 1 σx2 e−x /2σ dx = 23 =⇒ u1 = σ 2 ln(3)
2 2

/
and u2 = θ + σ 2ln 32 .
Theorem 1. For the 2-class problem in which the two class conditional dis-
tributions are Rayleigh and identical, the accuracy obtained by CMOS deviates
from
the optimal Bayes’ bound as the solution
of the/transcendental
equality
x −θ 2 +2θx θ σ 3
ln x−θ = 2σ2 deviates from 2 + √2 ln(3) + ln 2 .

Proof. The proof of the theorem can be found in [4]. )

(
Remark: Another way of comparing the approaches is by obtaining the er-
ror difference created by the CMOS classifier when compared to the Bayesian
classifier. The details of this can be found in [4].
Theorem 2. For the 2-class problem in which the two class conditional distribu-
tions are Rayleigh and identical, the accuracy obtained by using 2-OS CMOS de-
viates from the classifier which discriminates based on the distance from
/ the cor-

θ
θ σ

responding medians as 2 + σ ln(4) deviates from 2 + √2 ln(3) + ln 32 .

Proof. The proof is omitted here but can be seen in [4]. )

(
Experimental Results: Rayleigh Distribution - 2-OS. The CMOS clas-
sifier was rigorously tested for a number of experiments with various Rayleigh
distributions having the identical parameter σ. In every case, the 2-OS CMOS
gave almost the same classification as that of the Bayesian classifier. The method
was executed 50 times with the 10-fold cross validation scheme. The test results
are tabulated in Table 1. The results presented justify the claims of Theorems 1
and 2.

Table 1. A comparison of the accuracy of the Bayesian and the 2-OS CMOS classiﬁer
for the Rayleigh Distribution

θ 3 2.5 2 1.5 1
Bayesian 99.1 97.35 94.45 87.75 78.80
CMOS 99.1 97.35 94.40 87.70 78.65

Theoretical Analysis: Rayleigh Distribution - k-OS. We have seen from

Theorem 1 that for the Rayleigh distribution, the moments of the 2-OS are suf-
ﬁcient for a near-optimal classiﬁcation. As in the case of the other distributions,
Order Statistics-Based Optimal “Anti-Bayesian” PR 371

we shall now consider the scenario when we utilize other k-OS. Let u1 be the
point for the percentile n+1−k of the ﬁrst distribution, and u2 be the point
n+1
Eu
of the second distribution. Then, 0 1 σx2 e−x /2σ dx =
2 2
k
for the percentile n+1
/ F
n+1−k
n+1 =⇒ u1 = σ 2 ln n+1 k
n+1
and u2 = θ + σ 2 ln n+1−k .

Theorem 3. For the 2-class problem in which the two class conditional distri-
butions are Rayleigh and identical, a near-optimal Bayesian classiﬁcation can
be achieved by using symmetric pairs of the n-OS,F i.e., the n − k OS for ω1
/
and the k OS for ω2 if and only if ln n+1 k − ln n+1−kn+1
< σ√ θ
2
. The
classiﬁcation obtained by CMOS deviates from the optimal Bayes’ bound as
= −θ2σ+2θx
2
x
the solution of the transcendental equality ln x−θ 2 deviates from
,/ F -
θ σ
2 + 2
√ ln n+1
k + ln n+1−kn+1
.

Proof. The proof of this theorem is omitted here, but is included in [4]. )
(

Experimental Results: Rayleigh Distribution - k-OS. The CMOS method

has been rigorously tested with diﬀerent possibilities of the k-OS and for various
values of n, and the test results are given in Table 2. The Bayesian approach
provides an accuracy of 82.15%, and from the table, it is obvious that some
of the considered k-OSs attains the optimal accuracy and the rest of the cases
attain near-optimal accuracy. Also, we can see that the Dual CMOS has to be
invoked if the condition stated in Theorem 3 is not satisﬁed.

Table 2. A comparison of the accuracy of the Bayesian(i.e., 82.15%) and the k-OS
CMOS classiﬁer for the Rayleigh Distribution by using the symmetric pairs of the OS
for diﬀerent values of n (where σ = 2 and θ = 1.5)
CMOS/
No. Order(n) Moments OS1 OS2 CMOS
Dual CMOS
2 1
1 Two , σ 2 ln 31 θ+σ 2 ln 32 82.05 CMOS
3 3 5
5 , 5 ,1 ≤ i ≤ 2
5−i i n
2 Four σ 2 ln 52 θ+σ 2 ln 3 82.0 CMOS

7 , 7 1 ≤ i ≤ 2
7−i i n
3 Six σ 2 ln 71 θ+σ 2 ln 76 81.6 Dual CMOS

9 , 9 ,1 ≤ i ≤ 2
9−i i n
4 Eight σ 2 ln 94 θ+σ 2 ln 95 82.15 CMOS

Details of when the original OS-based criteria and when the Dual criteria are
used, are found in [4]. These are omitted here in the interest of space.

4 The Gamma Distribution

The Gamma distribution is a continuous probability distribution with two pa-
rameters - a, a shape parameter and b, a scale parameter. The pdf of the Gamma
372 A. Thomas and B. John Oommen

−x
distribution is Γ (a)1 ba xa−1 e b ; a > 0, b > 0, with mean ab and variance ab2
where a and b are the parameters. Unfortunately, the cumulative distribution
function does not have a closed form expression [5,6,7].
Theoretical Analysis: Gamma Distribution. The typical PR problem in-
voking the Gamma distribution would consider two classes ω1 and ω2 where the
class ω2 is displaced by a quantity θ, and in the case analogous to the ones we
have analyzed, the values of the scale and shape parameters are identical. We
consider the scenario when a1 = a2 = a and b1 = b2 = b. Thus, we consider the
distributions: f (x, 2, 1) = xe−x and f (x − θ, 2, 1) = (x − θ)e−(x−θ) .
We ﬁrst derive the moments of the 2-OS, which are the points of interest for
CMOS, for the Gamma distribution. Let u1 be the point for the percentile 23
of the ﬁrst distribution, and u2 be the point for the percentile 13 of the second
Eu
distribution. Then, 0 1 x e−x dx = 23 =⇒ ln(u1 )−2u1 = ln 13 and ln(u2 −θ)−
2(u2 −θ) = ln 13 −ln(θ). The following results hold for the Gamma distribution.

Theorem 4. For the 2-class problem in which the two class conditional dis-
tributions are Gamma and identical, the accuracy obtained by CMOS deviates
from the accuracy attained by the classiﬁer with regard to the distance from the
corresponding medians as 1.7391 + θ2 deviates from 1.6783 + θ2 .

Proof. The proof of this theorem can be found in [4]. )

(

Experimental Results: Gamma Distribution - 2-OS. The CMOS clas-

sifier was rigorously tested for a number of experiments with various Gamma
distributions having the identical shape and scale parameters a1 = a2 = 2, and
b1 = b2 = 1. In every case, the 2-OS CMOS gave almost the same classification
as that of the classifier based on the central moments, namely, the mean and
the median. The method was executed 50 times with the 10-fold cross validation
scheme. The test results are tabulated in Table 3.

Table 3. A comparison of the accuracy with respect to the median and the 2-OS
CMOS classiﬁer for the Gamma Distribution

n 4.5 4.0 3.5 3.0 2.5 2.0 1.5

Median 94.83 94.25 92.74 90.77 86.51 80.15 72.64
CMOS 95.01 94.49 92.92 90.43 85.99 79.54 72.34

Theorem 5. For the 2-class problem in which the two class conditional distri-
butions are Gamma and identical, a near-optimal Bayesian classiﬁcation can be
achieved by using certain symmetric pairs of the n-OS, i.e., the (n − k)th OS for
ω1 (represented as u1 ) and the k th OS for ω2 (represented as u2 ) if and only if
u1 < u2 .

Proof. The proof of this theorem is included in [4]. )

(
Order Statistics-Based Optimal “Anti-Bayesian” PR 373

Experimental Results: Gamma Distribution - k-OS. The CMOS method

has been rigorously tested for numerous symmetric pairs of the k-OS and for
various values of n, and a subset of the test results are given in Table 4. Ex-
periments have been performed for diﬀerent values of θ, and we can see that
the CMOS attained near-optimal Bayes’ bound. Also, we can see that the Dual
CMOS has to be invoked if the condition stated in Theorem 5 is not satisﬁed.

Table 4. A comparison of the k-OS CMOS classifier when compared to the Bayes’
classifier and the classifier with respect to median and mean for the Gamma Distri-
bution for different values of n. In each column, the value which is near-optimal is
rendered bold.

No. Classifier Moments θ = 4.5 4.0 3.5 3.0 2.5 2.0

1 Bayes - 97.06 95.085 93.145 90.68 86.93 81.53
2 Mean - 96.165 94.875 92.52 88.335 83.105 77.035
3 Median 2 - 1 90.04 93.57 92.735 90.775 86.275 80.115
4 2-OS 34 , 31 95.285 93.865 92.87 90.61 86.085 79.48
5 4-OS 53 , 52 95.905 94.605 93.11 89.57 84.68 22.125
6 4-OS 56 , 51 95.185 93.675 92.82 90.855 86.02 80.32
7 6-OS 75 , 72 96.405 95.01 92.125 88.005 17.29 23.565
8 6-OS 47 , 37 95.47 94.11 93.135 90.16 85.495 79.55
9 6-OS 78 , 71 95.135 93.625 92.78 90.745 86.135 80.165
10 8-OS 97 , 92 96.815 94.895 91.555 13.095 19.41 24.06
11 8-OS 95 , 94 95.8 94.445 93.11 89.885 84.81 78.535
12 8-OS 9, 9 95.135 93.625 92.735 90.7 86.085 80.045

5 The Beta Distribution

The Beta distribution is a family of continuous probability distributions de-

fined in (0, 1) parameterized by two shape parameters α and β. The distribu-
tion can take different shapes based on the specific values of the parameters. If
the parameters are identical, the distribution is symmetric with respect to 12 .
Further, if α = β = 1, B(1, 1) becomes U (0, 1). The pdf of the Beta distribu-
tion is f (x; α, β) = ΓΓ(α)
(α+β)
Γ (β) x
α−1
(1 − x)β−1 . The mean and the variance of
α
the distribution are α+β and (α+β)2αβ
(α+β+1) respectively. We consider the case
when α = β > 1. Earlier, in paper [3], when we first introduced the concept of
CMOS-based PR, we had analyzed the 2-OS and k-OS CMOS for the Uniform
distribution, and had provided the corresponding theoretical analysis and the
experimental results. We had concluded that, for the 2-class problem in which
the two class conditional distributions are Uniform and identical, CMOS can,
indeed, attain the optimal Bayes’ bound. So, in this paper, to avoid repetition,
we skip the analysis for the Beta distribution, B(1,1), as this case reduces to the
analysis for Uniform U(0,1). Thus, we reckon that the first of these cases (i.e.,
when α = 1 and β = 1) as being closed. We also discussed the symmetric Beta
distribution when the values of the shape parameters α and β are identical in [4].
In this paper, we now move on to the unimodal Beta distribution characterized
by the shape parameters α > 1 and β > 1, α = β.
374 A. Thomas and B. John Oommen

Theoretical Analysis: Beta Distribution (α > 1, β > 1) - 2-OS. Consider

the two classes ω1 and ω2 where the class ω2 is displaced by a quantity θ. In
this section, we consider the case when the shape parameters take the values
α > 1 and β > 1, and for the interest of preciseness2, we consider the case
when α = 2 and β = 5. Then, the distributions are f (x, 2, 5) = 30x(1 − x)4 and
f (x − θ, 2, 5) = 30(x − θ)(1 − x + θ)4 .
We first derive the moments of the 2-OS, namely o1 and o2 where o1 represents
the point for the percentile 23 of the first distribution, and
E oo2 represents the point
for the percentile 13 of the second distribution. Then, 0 1 30x(1 − x)4 dx = 23
Eo
and 0 2 30(x − θ)(1 − x + θ)4 dx = 13 .
These positions o1 and o2 can be obtained by making use of the built-in
functions available in standard software packages as o1 = 0.34249 and o2 =
θ + 0.1954. Thus, our aim is to show that the classification based on these points
can attain near optimal accuracies when compared to the accuracy obtained
by the classifier with regard to the medians, the most central points of the
distributions.

Theorem 6. For the 2-class problem in which the two class conditional distri-
butions are Beta(α, β) (α > 1, β > 1) and identical with α = 2 and β = 5, the
accuracy obtained by CMOS deviates from the accuracy attained by the classiﬁer
with regard to the distance from the corresponding medians as the areas under
the error curves deviate from the positions 0.26445 + θ2 and 0.2689 + θ2 .

Proof. The proof of this theorem is omitted here, but can be found in [4]. )
(

Experimental Results: Beta Distribution (α > 1, β > 1) - 2-OS. The

CMOS has been rigorously tested for various Beta distributions with 2-OS. For
each of the experiments, we generated 1,000 points for the classes ω1 and ω2
characterized by B(x, 2, 5) and B(x − θ, 2, 5) respectively. We then performed
the classification based on the CMOS strategy and with regard to the medians of
the distributions. In every case, CMOS was compared with the accuracy obtained
with respect to the medians for different values of θ, as tabulated in Table 5.
The results were obtained by executing each algorithm 50 times using a 10-fold
cross-validation scheme. The quality of the classifier is obvious.

Table 5. A comparison of the accuracy of the 2-OS CMOS classifier with the clas-
sification with respect to the medians for the Beta Distribution for different values
of θ

θ 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8

Median 89.625 92.9 94.3 95.525 97.3 97.975 98.375 99.05 99.15
CMOS 89.475 92.775 94.525 95.75 97.3 98.05 98.375 99.2 99.225

2
Any analysis will clearly have to involve speciﬁc values for α and β. The analyses for
other values of α and β will follow the same arguments and are not included here.
Order Statistics-Based Optimal “Anti-Bayesian” PR 375

Theoretical Analysis: Beta Distribution (α > 1, β > 1) - k-OS. We have

seen in Theorem 6 that the 2-OS CMOS can attain a near-optimal classiﬁcation
when compared to the classiﬁcation obtained with regard to the medians of the
distributions. We shall now prove that the k-OS CMOS can also attain almost
indistinguishable bounds for some symmetric pairs of the n-OS. The formal
theorem, proven in [4], follows.

Theorem 7. For the 2-class problem in which the two class conditional dis-
tributions are Beta(α, β) (α > 1, β > 1) and identical with α = 2, β = 5, a
near-optimal classiﬁcation can be achieved by using certain symmetric pairs of
the n-OS, i.e., the (n − k)th OS for ω1 (represented as o1 ) and the k th OS for
ω2 (represented as o2 ) if and only if o1 < o2 . If o1 > o2 , the CMOS classiﬁer
uses the Dual condition, i.e., the k OS for ω1 and the n − k OS for ω2 .
)
(
Experimental Results: Beta Distribution (α > 1, β > 1) - k-OS. The
CMOS method has been rigorously tested for certain symmetric pairs of the
k-OS and for various values of n, and the test results are given in Table 6.From
the table, we can see that CMOS attained a near-optimal Bayes’ accuracy when
o1 < o2 . Also, we can see that the Dual CMOS has to be invoked if o1 > o2 .

Table 6. A comparison of the k-OS CMOS classifier when compared to the classifier
with respect to means and medians for the Beta Distribution for different values of n.
The scenarios for the Dual condition are specified by “(D)”.

No. Classifier Moments θ = 0.35 0.45 0.55 0.65 0.75 0.85

1 Mean - 85.325 92.575 96.55 98.3 99.4 99.475
2 Median 2 - 1 86.675 92.775 95.525 97.975 99.05 99.275
3 2-OS 43 , 13 86.2 92.575 95.75 98.05 99.2 99.275
4 4-OS 53 , 52 85.375 92.525 96.225 98.225 99.325 99.475
5 4-OS 56 , 51 86.475 92.775 95.6 98.05 99.125 99.275
6 6-OS 75 , 72 85.2 (D) 92.425 96.475 98.35 99.45 99.625
7 6-OS 74 , 73 86.125 92.625 96.0 98.075 99.2 99.275
8 6-OS 7, 7 86.55 92.775 95.525 97.975 99.125 99.75

6 Conclusions
In this paper, we have shown that optimal classification for symmetric distri-
butions and near-optimal bound for asymmetric distributions can be attained
by an “anti-Bayesian” approach, i.e., by working with a very few (sometimes as
small as two) points distant from the mean. This scheme, referred to as CMOS,
Classification by Moments of Order Statistics, operates by using these points
determined by the Order Statistics of the distributions. In this paper, we have
proven the claim for some distributions within the exponential family, and the
theoretical results have been verified by rigorous experimental testing. Our re-
sults for classification using the OS are both pioneering and novel.
376 A. Thomas and B. John Oommen

References
1. Thomas, A., Oommen, B.J.: Optimal “Anti-Bayesian” Parametric Pattern Clas-
sification Using Order Statistics Criteria. In: Alvarez, L., Mejail, M., Gomez, L.,
Jacobo, J. (eds.) CIARP 2012. LNCS, vol. 7441, pp. 1–13. Springer, Heidelberg
(2012)
2. Thomas, A., Oommen, B.J.: Optimal “Anti-Bayesian” Parametric Pattern Classi-
fication for the Exponential Family Using Order Statistics Criteria. In: Campilho,
A., Kamel, M. (eds.) ICIAR 2012, Part I. LNCS, vol. 7324, pp. 11–18. Springer,
Heidelberg (2012)
3. Thomas, A., Oommen, B.J.: The Fundamental Theory of Optimal “Anti-Bayesian”
Parametric Pattern Classification Using Order Statistics Criteria. Pattern Recogni-
tion 46, 376–388 (2013)
4. Oommen, B.J., Thomas, A.: Optimal Order Statistics-based “Anti-Bayesian”
Parametric Pattern Classification for the Exponential Family. Pattern Recognition
(accepted for publication, 2013)
5. Krishnaih, P.R., Rizvi, M.H.: A Note on Moments of Gamma Order Statistics.
Technometrics 9, 315–318 (1967)
6. Tadikamalla, P.R.: An Approximation to the Moments and the Percentiles of
Gamma Order Statistics. Sankhya: The Indian Journal of Statistics 39, 372–381
(1977)
7. Young, D.H.: Moment Relations for Order Statistics of the Standardized Gamma
Distribution and the Inverse Multinomial Distribution. Biometrika 58, 637–640
(1971)
Optimizing Feature Selection through Binary
Charged System Search

Douglas Rodrigues1 , Luis A.M. Pereira1, Joao P. Papa1, Caio C.O. Ramos2 ,
Andre N. Souza3 , and Luciene P. Papa4
1
UNESP - Univ Estadual Paulista, Department of Computing, Bauru, Brazil
{markitovtr1,caioramos,lucienepapa}@gmail.com,
[email protected]
2
UNESP - Univ Estadual Paulista, Depart. of Electrical Engineering, Bauru, Brazil
[email protected]
3
University of São Paulo, Polytechnic School, São Paulo, Brazil
4
Faculdade Sudoeste Paulista, Department of Health, Avaré, Brazil

Abstract. Feature selection aims to ﬁnd the most important informa-

tion from a given set of features. As this task can be seen as an optimiza-
tion problem, the combinatorial growth of the possible solutions may be
inviable for a exhaustive search. In this paper we propose a new nature-
inspired feature selection technique based on the Charged System Search
(CSS), which has never been applied to this context so far. The wrapper
approach combines the power of exploration of CSS together with the
speed of the Optimum-Path Forest classiﬁer to ﬁnd the set of features
that maximizes the accuracy in a validating set. Experiments conducted
in four public datasets have demonstrated the validity of the proposed
approach can outperform some well-known swarm-based techniques.

Keywords: Feature Felection, Charged System Search, Evolutionary

Optimization.

1 Introduction
Feature Selection is a challenging task which aims selecting a subset of features
in a given dataset. Working only with relevant features can reduce the training
time and improve the prediction performance of classiﬁers. A simple way to
handle feature selection is performing an exhaustive search, if the dimensions
(features) is not too large. However, this problem is known to be NP-hard and
the computational load may become intractable [1].
Recently, several works have employed meta-heuristic algorithms based on
biological behavior and physical systems to deal with feature selection as an
optimization problem. In such context, Kennedy and Eberhart [2] proposed a
binary version of the traditional Particle Swarm Optimization (PSO) algorithm
in order to handle binary optimization problems. Further, Firpi and Goodman [3]

The authors would like to thank FAPESP grants #2009/16206-1, #2011/14094-1
and #2012/14158-2, CAPES and also CNPq grant #303182/2011-3.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 377–384, 2013.

c Springer-Verlag Berlin Heidelberg 2013
378 D. Rodrigues et al.

extended BPSO to the context of feature selection. Some years later, Rashedi et
al. [4] proposed a binary version of the Gravitational Search Algorithm (GSA)
called BGSA for feature selection, and Ramos et al. [5] presented their version of
the Harmony Search (HS) for the same purpose in the context of theft detection
in power distribution systems. Nakamura et al. [6] introduced their version of
Bat Algorithm (BA) for binary optimization problems, called BBA.
Kaveh and Talatahari [7] proposed an optimization algorithm called Charged
System Search (CSS), which is based on the interactions between electrically
charged particles. The idea is that an electrical field of one particle generates an
attracting or repelling force over other particles. This interaction is defined by
physical principles such as Coulomb, Gauss and Newtonian laws. The authors
have shown interesting results of CSS when compared with some well-known
approaches, such as PSO and Genetic Algorithms.
In this paper, we propose a binary version of the Charged System Search for
feature selection purposes called BCSS, in which the search space is modeled
as a m-cube, where m stands for the number of features. The main idea is
to associate for each charged particle a set of binary coordinates that denote
whether a feature will belong to the final set of features or not, and the function
to be maximized is the one given by a supervised classifier’s accuracy. As the
quality of the solution is related with the number of charged particles, we need
to evaluate each one of them by training a classifier with the selected features
encoded by the particles’ quality and also to classify an evaluating set. Thus,
we need a fast and robust classifier, since we have one instance of it for each
charged particle. As such, we opted to use the Optimum-Path Forest (OPF)
classifier [8,9], which has been demonstrated to be so effective as Support Vector
Machines, but faster for training.
The proposed algorithm has been compared with Binary Bat Algorithm, Bi-
nary Gravitational Search Algorithm, Binary Harmony Search and the Binary
Particle Swarm Optimization using several public datasets, being one of them
related with non-technical losses detection in power distribution systems. The re-
mainder of the paper is organized as follows. In Section 2 we revisit the Charged
System Search approach and we present the proposed methodology for binary
optimization using CSS. The methodology and the experimental results are dis-
cussed in Sections 3 and 4, respectively. Finally, conclusions are stated in Sec-
tion 5.

2 Charged System Search

The governing Coulomb’s law is a physics law used to describe the interactions
between electrically charged particles. Let a charge be a solid sphere with radius
r and uniform density volume. The attraction force Fij between two spheres i
and j with total charges qi and qj is deﬁned by:
Optimizing Feature Selection through Binary Charged System Search 379

ke qi qj
Fij = , (1)
d2ij
where ke is a constant called the Coulomb constant and dij is the distance
between the charges.
Based on aforementioned definition, Kaveh and Talatahari [7] have proposed
a new metaheuristic algorithm called Charged System Search (CSS). In this al-
gorithm, each Charged Particle (CP) on system is affected by the electrical fields
of the others, generating a resultant force over each CP, which is determined by
using the electrostatics laws. The CP interaction movement is determined using
Newtonian mechanics laws. Therefore, Kaveh and Talatahari [7] have sumarized
CSS over the following definitions:

– Deﬁnition 1 : The magnitude of charge qi , with i = 1, 2, ..., n, is deﬁned

considering the quality of its solution, i.e. objective function value f it(i):

f it(i) − f itworst
qi = , (2)
f itbest − f itworst
where f itbest and f itworst denote, respectively, the so far best and the
worst ﬁtness of all particles. The distance dij between two CPs is given by
the following equation:

xi − xj
dij = xi −xj , (3)
2 − xbest +
in which xi , xj and xbest denote the positions of the ith, jth and the best
current CP respectively, and is a small positive number to avoid singular-
ities.
– Deﬁnition 2 : The initial position xij (0) and velocity vij (0), for each jth
variable of the ith CP, with j = 1, 2, . . . , m, is given by:

xij (0) = xi,min + θ(xi,max − xi,min ) (4)

and
vij (0) = 0, (5)
where xi,max and xi,min represents the upper and low bounds respectively,
and θ ∼ U (0, 1).
– Deﬁnition 3 : For maximization problem, the probability of each CP moves
toward others CPs is given as follow:

1 if f it(j)−f itworst
f it(i)−f it(j) > θ ∨ f it(i) > f it(j),
pij = (6)
0 otherwise

– Deﬁnition 4 : The value of the resultant force acting on a CP j is deﬁned as:

r = 0.1max(xi,max − xi,min ) (7)

380 D. Rodrigues et al.

qi qi
Fj = qj · dij · c1 + 2 · c2 pij (xi − xj ), (8)
r3 dij
j,i=j

where c1 = 1 and c2 = 0 if dij < r, otherwise c1 = 0 and c2 = 1.

– Definition 5 : The new position and velocity of each CP is given by
xj (t) = θj1 · ka · Fj + θj2 · kv · v j (t − 1) + xj (t − 1) (9)
and
v j (t) = xj (t) − xj (t − 1), (10)
where ka = 0.5(1 + ) and kv = 0.5(1 − ) are the acceleration and the
t
T
t
T
velocity coefficients respectively, being t the actual iterations and T the
maximum number of iterations.
– Definition 6 : A number of the best so far solutions is saved using a Charged
Memory (CM). The worst solutions are excluded from CM, and better new
ones are included to the CM.

2.1 BCSS: Binary Charged System Search

In this paper we propose the Binary Charged System Search (BCSS) for fea-
ture selection, in which each CP can change its position only to binary values.
Therefore, we propose some modiﬁcations in the traditional CSS as follows:
– The Equation 4 is replaced for a function which initializes randomly each
CP vector with binary values (0 or 1), where 0 stand for absence and 1 for
presence of feature.
– In order to compute the distance between two binary CP vectors, we employ
a Hamming distance function H(·, ·). Thus the Equation 3 is changed to
H(xi , xj )
dij = , (11)
H((xi &xj ), xbest ) +
where & performs the logical AND between two vectors.
– In the traditional CSS, Kaveh et al. [7] proposed use an HS-based algorithm
to correct the position of a CP which surpass the upper or lower bounds. In
case of the BCSS for feature selection, as the new solution should be always
0 or 1, we employ a sigmoid function in order to restrict the new solutions
to only binary values:
1
S(xij ) = . (12)
1 + e−xij

1 if S(xij ) > θ,
xij = (13)
0 otherwise.
Therefore, Equation 13 can provide only binary values for each CP coordi-
nates. In addition the search space is modelled as a m-dimensional boolean
lattice, where the CPs moves across the corners of a hypercube.
Optimizing Feature Selection through Binary Charged System Search 381

3 Methodology
Suppose we have a fully labeled dataset Z = Z1 ∪ Z2 ∪ Z3 ∪ Z4 , in which Z1 , Z2 ,
Z3 and Z4 stand for training, learning, validating, and test sets, respectively.
The idea is to employ Z1 and Z2 to find the subset of features that maximize
the accuracy over Z2 , being such accuracy the fitness function. Therefore, each
agent (bat, CP, particle, etc.) is then initialized with random binary positions
and the original dataset is mapped to a new one which contains the features that
were selected in this first sampling. Further, the fitness function of each agent
is set as the recognition rate of a classifier over Z2 after training in Z1 . As soon
as the agent changes its position, a new training in Z1 followed by classification
in Z2 needs to be performed. As the reader can see, such formulation requires a
fast training and classification steps. This is the reason why we have employed
the Optimum-Path Forest (OPF) classifier [8,9], since it is a non-parametric and
very robust classifier.
However, in order to allow a fair comparison, we have added a third set in the
experiments called validating set (Z3 ): the idea is to establish a threshold that
ranges from 10% to 90%, and for each value of this threshold we marked the
features that were selected at least a minimum percentage of the runnings (10
times) over a learning process in Z1 and Z2 , as aforementioned. For instance, a
threshold of 40% means we choose the features that were selected at least 40%
of the runnings. For each threshold, we computed the fitness function over the
validation set Z3 to evaluate the generalization capability of the selected solution.
Thus, the final subset will be the one that maximizes the curve over the range
of values, i.e., the features that maximize the accuracy over Z3 . Further, these
selected features are then applied to assess the accuracy over Z4 . Notice the
fitness function employed in this paper is the accuracy measure proposed by
Papa et al. [8], which is capable to handle unbalanced classes. Notice we used
30% of the original dataset for Z1 , 20% for Z2 , 20% for Z3 , and 40% for Z4 .
These percentages have been empirically chosen.

4 Experimental Results
4.1 Dataset
In this work we have employed four datasets, in which three of them have
been obtained from LibSVM repository1 , and NTL refers to a dataset obtained
from a Brazilian electrical power company frequently used to detect thefts in
power distributions systems. Table 1 presents all datasets and their main char-
acteristics.

4.2 Experiments
In this section we discuss the experiments conducted in order to assess the ro-
bustness of BCSS against with BBA, BGSA, BHS (Binary Harmony Search) and
1
http://www.csie.ntu.edu.tw/~ cjlin/libsvmtools/datasets/
382 D. Rodrigues et al.

Table 1. Description of the datasets used in this work

Dataset # samples # features # classes

Australian 690 14 2
Diabetes 768 8 2
NTL 3182 8 2
Vehicle 846 18 4

BPSO for feature selection. Table 2 presents the parameters employed for each
evolutionary-based technique. Notice for all techniques we employed 30 agents
with 100 iterations. These parameters have been empirically set.

Table 2. Parameters setting of metaheuristic algorithms

Technique Parameters
BBA α = 0.9, γ = 0.9
BGSA G0 = 100
BHS HMCR= 0.9
BPSO c1 = 2.0, c2 = 2.0, w= 0.7

Figure 1a shows the OPF accuracy curve for Australian dataset, in which
BBA, BCSS and BGSA achieve the maximum value of the fitness function equals
to 87.20%, with a threshold equal, 40%, 80% and 70% respectively. Figure 1b
displays the results for Diabetes dataset. We can see that BBA, BCSS, BGSA
and BPSO have achieved the same effectiveness (around 61.5%), while BHS did
not perform very well. Actually, although BHS has selected the same number
of features, its accuracy over Z4 was about 2.32% less accurate than the others
approaches, as we can see in Table 3, which displays the accuracy over Z4 and
also the threshold over Z3 for all datasets.
Figures 2a and 2b display the curves over Z3 for Vehicle and NTL datasets,
respectively. From Figure 2a we can see that the maximum accuracy over Z3
has been obtained by BHS with a threshold of 40%, and with respect to NTL
dataset BBA and the proposed BCSS have achieved 95% with a threshold of
30%, while the remaining techniques needed a threshold of 50% to reach such
accuracy.
Table 4 displays the mean execution times for all techniques. The fastest
approach has been BHS, followed by BPSO and BGSA. Although the proposed
technique has required a considerable computational effort, it is not so different
than BBA and BGSA. From Table 3, we can see that BCSS has been the most
accurate approach together with other techniques for Australian, Diabetes and
NTL datasets, and has been the sole more effective approach for Vehicle dataset.
Optimizing Feature Selection through Binary Charged System Search 383

BBA BCSS BGSA BHS BPSO BBA BCSS BGSA BHS BPSO

100 62
90 61.5
80 61
70
Accuracy [%]

Accuracy [%]
60.5
60
60
50
59.5
40
30 59

20 58.5

10 58
10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
Threshold [%] Threshold [%]

(a) (b)
Fig. 1. OPF accuracy curve over Z3 for (a) Australian and (b) Diabetes datasets

BBA BCSS BGSA BHS BPSO BBA BCSS BGSA BHS BPSO

85 100
95
80
90
75 85
Accuracy [%]

Accuracy [%]

80
70
75
65 70
65
60
60
55 55
10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
Threshold [%] Threshold [%]

(a) (b)

Fig. 2. OPF accuracy curve over Z3 for (a) Vehicle and (b) NTL datasets

Table 3. Classiﬁcation accuracy over Z4 with the best subset of features selected over
Z3

BBA BCSS BGSA BHS BPSO

Dataset
Z3 Z4 Z3 Z4 Z3 Z4 Z3 Z4 Z3 Z4
Australian 40.0% 87.20% 80.0% 87.20% 70.0% 87.20% 30.0% 66.85% 70.0% 87.20%
Diabetes 10.0% 67.82% 10.0% 67.82% 10.0% 67.82% 10.0% 65.50% 10.0% 67.82%
NTL 30.0% 95.49% 30.0% 95.49% 50.0% 94.55% 50.0% 89.33% 40.0% 95.49%
Vehicle 90.0% 77.09% 70.0% 78.47% 50.0% 76.44% 40.0% 77.92% 20.0% 76.90%

5 Conclusions

We have proposed a binary version of the well-known continuous-valued Charged

System Search, which was derived in order to position the charged particles to
binary coordinates.
We conducted experiments against several metaheuristic algorithms to
show the robustness of the proposed technique and also its good generalization
384 D. Rodrigues et al.

Table 4. Mean computational load in seconds

Dataset BBA BCSS BGSA BHS BPSO

Australian 115.9 119.3 115.1 5.13 98.40
Diabetes 137.03 139.7 135.0 5.95 136.6
NTL 2305.2 2337.4 2310.4 99.81 2223.7
Vehicle 184.0 186.9 181.0 7.98 182.0

capability. We have employed four datasets to accomplish this task, in which

BCSS has been compared against BBA, BGSA, BPSO and BHS. The proposed
algorithm has obtained the best results together with other techniques for three
datasets, and has been the more eﬀective approach for one dataset.

References
1. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach.
Learn. Res. 3, 1157–1182 (2003)
2. Kennedy, J., Eberhart, R.C.: A discrete binary version of the particle swarm algo-
rithm. In: IEEE International Conference on Systems, Man, and Cybernetics, vol. 5,
pp. 4104–4108 (1997)
3. Firpi, H.A., Goodman, E.: Swarmed feature selection. In: Proceedings of the 33rd
Applied Imagery Pattern Recognition Workshop, pp. 112–118. IEEE Computer
Society, Washington, DC (2004)
4. Rashedi, E., Nezamabadi-pour, H., Saryazdi, S.: BGSA: binary gravitational search
algorithm. Natural Computing 9, 727–745 (2010)
5. Ramos, C., Souza, A., Chiachia, G., Falcão, A., Papa, J.: A novel algorithm for
feature selection using harmony search and its application for non-technical losses
detection. Computers & Electrical Engineering 37(6), 886–894 (2011)
6. Nakamura, R.Y.M., Pereira, L.A.M., Costa, K.A., Rodrigues, D., Papa, J.P., Yang,
X.-S.: BBA: A binary bat algorithm for feature selection. In: Proceedings of the
XXV SIBGRAPI - Conference on Graphics, Patterns and Images (2012) (accepted
for publication)
7. Kaveh, A., Talatahari, S.: A novel heuristic optimization method: charged system
search. Acta Mechanica 213(3), 267–289 (2010)
8. Papa, J.P., Falcão, A.X., Suzuki, C.T.N.: Supervised pattern classification based
on optimum-path forest. International Journal of Imaging Systems and Technol-
ogy 19(2), 120–131 (2009)
9. Papa, J.P., Falcão, A.X., Albuquerque, V.H.C., Tavares, J.M.R.S.: Efficient
supervised optimum-path forest classification for large datasets. Pattern Recogni-
tion 45(1), 512–520 (2012)
Outlines of Objects Detection by Analogy

Asma Bellili1 , Slimane Larabi1 , and Neil M. Robertson2

1
University of Sciences and Technology Houari Boumediene, Computer Science
Department, BP 32 El Alia, Algiers, Algeria
[email protected]
2
Edinburgh Research Partnership in Engineering and Mathematics,
School of Engineering and Physical Sciences, Heriot-Watt University, Edinburgh,
EH14 4AS, UK
[email protected]

Abstract. In this paper we propose a new technique for outlines of

objects detection. We exploit the set of contours computed using the
image analogies principle. A set of artiﬁcial patterns are used to locate
contours of any query image, each one permits the location of contours
corresponding to a speciﬁc intensity variation. We studied these contours
and a theoretical foundation is proposed to explain the slow motion of
these contours around regions boundaries. Experiments are conducted
and the obtained results are presented and discussed.

Keywords: Segmentation, Object outline, Analogy, Contour, Multi-

Scale.

1 Introduction

Image segmentation is considered as an important task in many computer vision

applications. It consist of partitioning an image into meaningful regions including
objects. Despite that there are many image segmentation methods proposed in
the literature [5], [16], this problem remains an active topic for two reasons: first,
results of the proposed techniques are still far from what the human can achieve;
second, segmentation is a critical step for many applications.
Image analogies constitutes a natural means of specifying filters and image
transformations. Assuming that the transformation between two images A and
A is “learned”, image analogies is defined as a method of creating an image
filter which allows to recover by analogy from any given different image B the
image B in the same way as A is related to A [6], [9]. Rather than selecting
from among myriad different filters and their settings, a user can simply supply
an appropriate exemplar (along with a corresponding unfiltered source image)
and say, in effect,“Make it look like this”. Ideally, image analogies should make
it possible to learn very complex and non-linear image filters [9].
Image analogies has been largely used in many applications such as texture
synthesis [2], curves synthesis [10], super resolution [8], image colorization, image
enhancement and artistic filters [14], [15]. This new technique has been also used

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 385–392, 2013.

c Springer-Verlag Berlin Heidelberg 2013
386 A. Bellili, S. Larabi, and N.M. Robertson

in supervised medical image segmentation [11] consisting in ﬁnding by analogies

the same colored regions in medical images as those processed by the expert.
Recent work has been published and concerns contour detection by image
analogies which attempts to locate contours as humans do [12]. A set of training
images (artificial patterns) are proposed, producing several images of contours,
at varying intensity levels. Each one is obtained applying the corresponding
pattern (see figures in Table 1).
We present in this paper what can be achieved with these contours for outlines
of objects detection. We note that the motion of these contours from one pattern
to another is implicitly related to regions boundaries, similar to those required for
segmentation. A fast motion is present when the considered part of image does
not contain objects or regions. However, this motion is slow and the contours are
sometimes static around regions boundaries. We prove theoretically this property
in this paper and it serves as the basis for a new approach to outlines of objects
detection. In section 2 we present a review of the contour detection by image
analogy [12]. We propose in section 3 a theoretical foundation of our method.
Section 4 is devoted to the experiments conducted on different images.

Table 1. Illustrative contours located using a selection (four) of the 14 artiﬁcial pat-
terns. Note the increase of intensity around the located contours from left to right.

2 Contour Detection Using Images Analogies: A Review

[12]

The problem addressed in [12] is how to do for automatic location of contours

on the query image IB giving the result SB in the same way as this is done for
(IA , SA ) where IA is an initial image whose contours are manually located giving
the synthesized image noted SA (see ﬁgure 1).
Using the Image Analogy technique, each pixel q of IB is classiﬁed (contour
pixel or not) using the knowledge inferred from (IA , SA ). The best match p∗ of
q is searched in IA using the neighbourhoods N (p), N (q) of p and q. For this, p∗
must verify the minimal value of the similarity measure Sm (q, p) between N (p),
N (q) pixels. A kernel K(m × m) is used in this measure in order to give high
weight for pixels of four main directions (horizontal, vertical and two diagonals).
The main result is that if the training pair of images (IA , SA ) and the query
one IB are taken from the same scene, the location of contour pixels of IB is
Outlines of Objects Detection by Analogy 387

Fig. 1. Contour detection by analogy: the basic principle

done with success. However when IB is from a different scene, the location of
contour pixels cannot be done without the loss of many candidates. To locate
all contour pixels, a set of constraints must be verified in the neighbours N (p),
N (q). To avoid this, a set of pairs of artificial patterns (IA , SA ) are proposed
instead of hand drawn contours. The pattern IA is composed by a shape with
intensity FA (Foreground) and a background with intensity GA . The pattern SA
is the same as IA , in addition, the contour is highlighted. The values of (GA , FA )
are chosen so as for any query pixel q, the values of (GB , FB ) representing the
average of intensities of N (q) regions verify the required constraints. The set of
patterns P1 , P2 , ..., P14 (see figure 2) are characterised by the values of GA , FA
(background and foreground intensities):
(0, 32), (0, 64), (0, 96), (0, 128), (0, 160), (0, 192), (0, 224), (64, 192), (64, 224),
(96, 224), (128, 224), (160, 224), (192, 224), (208, 240).
For each pattern (IA , SA ) and for a query image IB , only a set of contour
pixels q will be localized such as the intensities of the neighbouring pixels in
N (q) verify a defined constraint related to (IA , SA ) . We obtain then 14 images
of contours corresponding to the 14 patterns (see figures in table 1).

Fig. 2. Artiﬁcial patterns (IA , SA )

3 Outlines of Objects Detection

The use of artificial patterns allows locating contours of any query image IB and
provides the images: SB,1 , ..., SB,n where n is the number of artificial patterns
(n = 14). The computed contours are different from SB,i to SB,i+1 . Figures of
table 2 illustrate the contours computed using the patterns P3 and P4 . We note
that inside of regions, contours are moving quickly from one pattern to another,
and around region boundaries they are moving slowly or are steady.
First, we introduce the property of object boundaries, we prove it theoretically
in the next. Finally, we describe our method for outlines of object detection based
on this property.
388 A. Bellili, S. Larabi, and N.M. Robertson

Table 2. Contours located using the patterns P3 and P4

Property of Region’s Boundary. Contours extracted by image analogy are

more stable at regions boundaries and are unstable for others parts of image.
Proof.
We prove that if the contour is moving slowly, this implies that there is boundary
defined as a gradual changing of intensity between neighbouring pixels. Let q be
a contour pixel detected by the pattern Pi but not detected by the next one Pi+1 ,
q a contour pixel detected by the pattern Pi+1 but not detected by the previous
pattern Pi . Let GAi , FAi be the intensities of the two regions (background and
foreground) of the pattern Pi .
If q is detected by the pattern Pi then the values GB , FB associated to N (q)
verify (see figure 3) [12]: FB ≥ GB i + δl and GAi < GB ≤ GB i , where GB i =
(FAi + GAi )/2 and δl is the minimum intensity between two different regions.
As Pi and Pi+1 are successive patterns, this means that whether ((FAi =
FAi+1 ) and (GAi+1 = GAi + 2δl)) or ((GAi = GAi+1 ) and (FAi+1 = FAi −
(2δl))). We consider in this proof that (GAi = GAi+1 ), the same reasoning is
also valid for other cases. If q is not detected using the pattern Pi+1 , then
FB is necessarily lower than Gi+1 B + δl where GB i+1 = (GAi+1 + FAi+1 )/2),
otherwise it will be detected by the pattern Pi+1 . The belonging interval of FB
is then: [GB i + δl, GB i+1 + δl] (see figure 3). We assume that q is located as a
contour pixel using the pattern Pi , and let q be the neighbouring to p where
GB , FB are the averages of intensities associated to N (q ). Now if we assume
that q is located by the pattern Pi+1 and not detected by the pattern Pi , this
means that the contour is steady (or moving slowly). We get from the previous
result: GB i + δl < FB < GB i + 2δl and GB i+1 + δl < FB . This is possible if
FB ≥ GB i + 2δl and GB i+1 > GB > GB,i .
Let dist = 1 be the distance between the two pixels q and q (see figure 4).
Without loss of generality, we can write: GB = (5 × FB + 5 × GB )/10 and
FB = (10 ∗ FB + 5 ∗ FB )/15 such as FB is the average intensity of neighboring
pixels to N (q) and m = 5 is the size of the neighborhoods N (p), N (q).
As FB ≥ GB i + 2δl and GB < GB i+1 . The analysis of these relations gives the
condition : FB > 3GB i − 2FB i + 6δl. However, as FB < GB i+1 + δl, N (q ) must
then verify: FB > GB i+1 + δl. Also, as GB = (FB + GB )/2, and GB i+1 + δl ≥
FB ≥ GB i+1 , then GB i ≤ GB < GB i+1 . Then, to locate q as a contour pixel, a
minimal difference of luminance intensity between FB and FB in N (q ) equal to δl
Outlines of Objects Detection by Analogy 389

Fig. 3. Possible values of FB in case where q is detected using only by one pattern

Fig. 4. Example of contour motion with dist = 1, 2, 3, N (q), N (q ) are illustrated in

red and green colors

must be veriﬁed. We note also the presence of a graduate changing of luminance

intensities between GB , GB , FB and FB . For the case dist = 2 and applying the
same reasoning (see ﬁgure 4), we obtain: GB = FB and FB = (5FB +10F ”B )/15.
As FB ≥ GB i +2δl and GB i+1 > GB > GiB , we get : (2F ”B +FB )/3 ≥ GB i +2δl.
This implies that: 2F ”B ≥ 3GB i + 6δl − FB , we get : F ”B ≥ GB i+1 + δl thus:
F ”B ≥ GB +δl. When dist = 3 (see ﬁgure 4), we have: GB = FB and FB = F ”B .
As FB ≥ GB i + 2δl, we get the same relation: F ”B ≥ GB + δl. Otherwise, if q
isn’t detected by the pattern Pi+1 , this means that there is no intensity variation
in the neighbourhood of q.

3.1 Outline of Objects Detection: The Algorithm

We deﬁne the energy of contour as the number of times it is located using

successive patterns with slow motion. We proved in previous subsection that
when a contour is moving slowly and then with high energy, this means that it
corresponds to object outline (border).

4 Results

We present in this section results obtained by applying our method to real images
of BSD [4]. Firstly, we illustrate in table 3 the evolution of the contour located
using artificial patterns. We can see that the located contour using P7 , P8 , P9 is
steady around object boundary except the central left part where the contour is
moving fast (3 pixels from one pattern to another). For the patterns P10 , P11 , P12 ,
contours are moving fast from one pattern to other due to the absence of object
boundary.
We applied our method using different values of energy defined as the number
of times where the contours is steady or with slow motion. The increasing of
energy value allows producing most significant contours corresponding to high
390 A. Bellili, S. Larabi, and N.M. Robertson

Algorithm 1. Object Outlines Detection

Extract Contours Cji using all patterns Pi
for Each successive patterns Pi , Pi+1 do
for Each contour Cji do
Find the contour Cki+1 neighbouring to Cji with (dist < 3), energy(Cki+1) + +
end for
end for
Select contours of given energy

Table 3. Contour’s evolution using the patterns P7 , P8 , P9 , P10 , P11

diﬀerence of intensities of related regions. Figures of table 4 illustrate the results

obtained for energy equal to 3 and 4.
To measure the quality of outlines located, we used the ratios - precision,
recall computed using the numbers of pixels found in the automatic contours vs
the correct (hand-drawn) ones. For the data set BSD500 [4], for each image, five
hand drawn contours are available as ground truth.
Depending of the used energy, which is synonymous to resolution level, the
Precision and Recall have different values. More the energy increases, more pre-
cision increases because only the pixels contours corresponding to high difference
of intensity are located and then the number of false candidates decreases. How-
ever, the recall decreases because the number of located outline pixels decreases.
Figure 5 illustrates the values of Precision/Recall for Energy=1. These results

Fig. 5. Precision-Recall values for BSD dataset when Energy=1

Outlines of Objects Detection by Analogy 391

Table 4. (Left to right): original image, located outlines with energy equal to 3 and 4

are similar to those of Arbelaez et al [4]. For high Recall values, our Precision is
better and the diﬀerence reaches 20%. However for low Recall values, our Preci-
sion values are near from the values of Arbelaez et al [4], the diﬀerence is around
3%.

5 Conclusion and Future Work

We proposed in this paper a new technique for Object Outlines Detection based
on image analogy. In the ﬁrst part, we presented a review of contour detection
by image analogy technique and then we gave a theoretical explanation of the
steady contour motion corresponding to boundary object. The proposed algo-
rithm has been applied to Weizmann and BSD datasets and the obtained results
are presented. The obtained results are promising knowing that only intensity
is used for this approach. We plan to add new attributes in the stage of contour
detection e.g. color in order to locate the contours which may be missing using
the current approach.

References
1. Alpert, S., Galun, M., Basri, R., Brandt, A.: Image Segmentation by Probabilistic
Bottom-Up Aggregation and Cue Integration. In: Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition (June 2007)
392 A. Bellili, S. Larabi, and N.M. Robertson

2. Ashikhmin, M.: Fast texture transfer. IEEE Computer Graphics and Applica-
tions 23(4), 38–43 (2003)
3. Alpert, S., Galun, M., Basri, R., Brandt, A.: Image Segmentation by Probabilistic
Bottom-Up Aggregation and Cue Integration, In. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (2007)
4. Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour Detection and Hierar-
chical Image Segmentation. IEEE Transactions on Pattern Analysis and Machine
Intelligence 33(5), 898–916 (2011)
5. Cheng, H.D., Jiang, X.H., Sun, Y., Wang, J.L.: Color image segmentation: advances
and prospects. Pattern Recognition 34, 2259–2281 (2001)
6. Cheng, L., Vishwanathan, S.V.N., Zhang, X.: Consistent image analogies using
semi-supervised learning. In: IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2008 (2008)
7. De Winter, J., Wagemans, J.: Segmentation of object outlines into parts: A large-
scale integrative study. Cognition 99, 275–325 (2006)
8. Freeman, W.T., Pasztor, E.C., Carmichael, O.T.: Learning Low-Level Vision. In-
ternational Journal of Computer Vision 40(1) (2000)
9. Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Seitz, S.M.: Image analogies.
In: SIGGRAPH Conference Proceedings, pp. 327–340 (2001)
10. Hertzmann, A., Oliver, N., Curless, B., Seitz, S.M.: Curve analogies. In: Proc. 13th
Eurographics Workshop on Rendering, Pisa, Italy, pp. 233–245 (2002)
11. Lackey, J.B., Colagrosso, M.D.: Supervised segmentation of visible human data
with image analogies. In: Proceedings of the International Conference on Machine
Learning; Models, Technologies and Applications (2004)
12. Larabi, S., Robertson, N.M.: Contour detection by image analogies. In: Bebis, G.,
Boyle, R., Parvin, B., Koracin, D., Fowlkes, C., Wang, S., Choi, M.-H., Mantler, S.,
Schulze, J., Acevedo, D., Mueller, K., Papka, M. (eds.) ISVC 2012, Part II. LNCS,
vol. 7432, pp. 430–439. Springer, Heidelberg (2012)
13. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A Database of Human Segmented
Natural Images and its Application to Evaluating Segmentation Algorithms and
Measuring Ecological Statistics. In: Proc. 8th Int’l Conf. Computer Vision (2001)
14. Sykora, D., Burianek, J., Zara, J.: Unsupervised colorization of black-and-white
cartoons. In: Proceedings of the 3rd Int. Symp. Non-photorealistic Animation and
Rendering, pp. 121–127 (2004)
15. Wang, G., Wong, T., Heng, P.: Deringing cartoons by image analogies. ACM Trans-
actions on Graphics 25(4), 1360–1379 (2006)
16. Zhanga, H., Frittsb, J.E., Goldmana, S.A.: Image segmentation evaluation: A sur-
vey of unsupervised methods. Computer Vision and Image Understanding 110(2),
260–280 (2008)
PaTHOS: Part-Based Tree Hierarchy for Object
Segmentation

Loreta Suta1 , Mihaela Scuturici1 , Vasile-Marian Scuturici2 , and Serge Miguet1

1
Université de Lyon - LIRIS
Université Lumière Lyon 2
5, Avenue Pierre Mendès-France
69676 Bron Cedex France
{Mihaela.Scuturici,Loreta.Suta,Serge.Miguet}@univ-lyon2.fr
2
Université de Lyon - LIRIS
INSA de Lyon
7 Bd Jean Capelle
69621 Villeurbanne cedex France
[email protected]

Abstract. The problem we address in this paper is the segmentation

and hierarchical grouping in digital images. In terms of image acquisi-
tion protocol, no constraints are posed to the user. At first, a histogram
thresholding provides numerous segments where a homogeneity crite-
rion is respected. Segments are merged together using similarity prop-
erties and aggregated in a hierarchy based on spatial inclusions. Shape
and color features are extracted on the produced segments. Tests per-
formed on Oxford Flower 17 [8] show that our method outperforms a
similar one and allow the relevant object selection from the hierarchy. In
our case, this approach represents the first stage towards flower variety
identification.

Keywords: segmentation, hierarchical grouping, plant recognition.

1 Introduction

The swift growth of innovative technologies in the ﬁeld of smartphone applica-

tions generated new possibilities for exploiting multimedia data. Users can take
photos of objects and use them to search over the internet for complementary in-
formation (e.g. Google Goggles - for landmarks, artwork, wine, logos, Leafsnap,
[16] and Folia, [15] - to identify leaf species, etc.). This implies the recognition of
the photographed object. There are situations whereas recognition to be effective
we need a segmentation step as accurate as possible.
In the case of flower species recognition, [11] investigated a new segmenta-
tion algorithm and compared the results to those obtained by [3] claiming that
recognition was improved by 4% due to a better segmentation. To our knowledge,
there are several applications attempting to recognize flower species neverthe-
less remaining an active research field. Moreover, retrieving flower variety is even

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 393–400, 2013.

c Springer-Verlag Berlin Heidelberg 2013
394 L. Suta et al.

more challenging. We are also aware that a single photo may be insufficient to
identify the specie and even less the variety of a flower. From a botanical point
of view, plant recognition should furthermore take into account plant morphol-
ogy such as features based on the appearance or the external form of a plant.
The study of vegetative parts (roots, stems and leaves) as well as the reproduc-
tive parts (inflorescences, flowers, fruits and seeds) are crucial for plant variety
identification.
The purpose of this paper is object segmentation on multiple level of detail.
A hierarchical aggregation describes an image in its constituent objects. Thus,
the analysis of plant morphology becomes accessible in contrast with classical
approaches. In particular, we are interested in plant recognition tasks based on
images of flowers and/or inflorescences, [14]; therefore an accurate segmentation
is an essential step.
The remainder of the paper is organized as follows: section 2 presents related
work. Section 3 describes our hierarchical approach for natural image segmenta-
tion and relevant object selection is presented in section 4. Experimental results
are shown in section 5 followed by conclusions and future work in section 6.

2 Related Work

A wide variety of image segmentation methods have been recently proposed fo-
cusing on particular object types (cars, birds, horses, plants, etc.). In the field
of flower segmentation, authors explore background/foreground separation tech-
niques, [1] and [12], or combine them with geometrical models, [2] and [10], and
superpixel segmentation, [11], [3] and [13].
Bottom-up methods use uniformity conditions in order to form image seg-
ments which are merged together respecting homogeneity criteria such as simi-
lar color properties, spatial structure, texture, etc. The object results from the
aggregation of several components. For example, the components of a plant can
be the flower, the leaves and the stem. According to the level of detail we would
like to study, this approach offers multiple detail levels. In [5] the authors pro-
pose an image segmentation technique based on a high-performance contour
detector. Oriented watershed procedure creates regions from the oriented con-
tour signal. Image segmentation is achieved by agglomerative clustering with a
method that transforms contours into a hierarchy of regions. [9] uses hierarchical
grouping for object localization. Segments are generated using [4] while a greedy
algorithm merges similar regions based on size and appearance features. In [7]
and [6] a novel multiscale image segmentation method is introduced. The ramp
transform detect ramp discontinuities and seeds for all regions while a region
growing technique creates the desired segmented parts which can be organized
as a tree representation. Segmentation is independent of object properties, pa-
rameters and initialization; contrariwise region growing technique may produce
false associations.
We focused on hierarchical approaches since our purpose is plant morphology
analysis (sepals, petals, stamens, etc.). It will be an intermediate step towards
PaTHOS: Part-Based Tree Hierarchy for Object Segmentation 395

object semantics: a complex object having several regions of diﬀerent colors will
be represented as a more complex hierarchical structure than a uniform color
region which will be represented as a single node in the hierarchy.

3 Segmentation and Hierarchical Grouping

The target of our segmentation method is to group pixels using spatial criteria
to build a tree-like representation of the input image where each node describes
objects. Figure 1 presents the main stages of our approach: input image (1),
segmentation using color uniformity criterion ﬁlter (2), segment merging and ag-
gregation into a hierarchy (3), feature extraction (4) and relevant object selection
(5).

Fig. 1. The diagram of the proposed method describing the main steps

Our segmentation method is presented in Algorithm 1. It takes an image as

input and a number of bins. Each bin corresponds to a color range (or a color
histogram bin) generating a segmentation of the input image. For every colour
channel of the input image and for each possible bin we build a new binary image
binaryImg. This image will set to 1 all the pixels corresponding to the bin and
to 0 the others (lines 5-11).
Pixels are grouped satisfying the color homogeneity criterion (line 12, call
to the function segmentation). Pseudo-objects - hereby denoted segments - are
detected using an edge detection ﬁlter. A large number of segments are created
characterized by similar colors and kept in the list O (line 12). The same segments
can be found several times in the list (under slightly diﬀerent forms) due to color
transitions caused by illumination.
Starting from the list O we organize the segments in a hierarchical structure
using the following steps (build tree from line 15):
396 L. Suta et al.

Algorithm 1. ”PaTHOS”
Require: Image I; number of bins for each color channel BinCount.
Ensure: Object hierarchy H.
1: O ← ∅
2: for color = 0 → I.channels do
3: for i = 0 → BinCount − 1 do
4: binaryImg ← new Image
5: for all pixel ∈ I do
6: if (0 ≤ pixel[color] < (i + 1) ∗ 255/BinCount) then
7: binaryImg[pixel] = 1
8: else
9: binaryImg[pixel] = 0
10: end if
11: end for
12: O ← O ∪ segmentation(binaryImg)
13: end for
14: end for
15: H ← build tree(O)
16: return H

– Inclusion relationship: a segment o1 is included in the segment o2 if all the

pixels from o1 are found in o2 ; we denote this relation as o1 ⊆ o2
– Equality relationship (for identical segments): two segments o1 and o2 are
equal (identical) if o1 ⊆ o2 and o2 ⊆ o1
– Merging: two identical segments o1 and o2 are merged in a single one (o1 );
– Child-parent relationship: the segment o1 ∈ O is the direct child of o2 ∈ O
if ¬∃ o3 ∈ O|o1 ⊂ o3 ⊂ o2
The resulting tree H characterise the content of the image I. Segments from
several partitions of I contribute to the construction of this tree.

4 Relevant Object Selection

After the segmentation process, we obtain a hierarchy of multiple segments.
Among them, there are one or several objects similar to the object of interest
that we would like to find automatically. In our case, the object of interest is the
flower. Other segments in the hierarchy represent either parts of the background
or insignificant sub-parts or super-parts of the object of interest. For example,
a single petal is a part of the flower that we obtain in the hierarchy due to the
homogeneity of its color. However, we consider it less significant in comparison
with the entire flower.
We noticed that many segments in the segmentation hierarchy are irrelevant
objects. Thus, we supposed that they share common features which are different
of the segments of interest (irregular edges, color, position, etc.). We perform a
supervised learning using a C4.5 decision tree in order to choose relevant seg-
ment(s) (Figure 1 - (5)) from the hierarchy. Unlike other supervised learning
PaTHOS: Part-Based Tree Hierarchy for Object Segmentation 397

algorithms such as SVM (Support Vector Machine) or neural networks, a deci-

sion tree provides comprehensible decision rules involving the most discriminant
features which are reusable in subsequent tasks or embedded systems.
In order to perform supervised learning of correct and incorrect parts of the
segmentation hierarchy, the segments are labeled as correct/incorrect segmenta-
tions. We add a new attribute, GoodSegmentation which is set to ”yes” if the
segmentation is correct and to ”no” if the segmentation is incorrect. In order
to automatically label the segments as correct/incorrect we rely on the ground
truth and the accuracy deﬁned in information retrieval. All the segments with
accuracy ≥ t are considered as correct (GoodSegmentation = ”Yes”) while the
segments where accuracy < t are considered incorrect (GoodSegmentation =
”No”). Here, t represents a threshold ﬁxed at 0.85 (see section 5 for details).
For each segment, we also extract the following features: shape (perimeter,
area, Hu moments), color (normalized entropy and standard deviation), round-
ness, eccentricity, minimum bounding box, gravity center and diameter. We ap-
ply a C4.5 decision tree algorithm, in order to predict the qualitative attribute
GoodSegmentation. As we will see in section 5, the decision tree provides several
decision rules allowing to choose the correct segments with a 77% success rate.

5 Experimental Results

5.1 Segmentation

As a hierarchical implementation, our method returns one object from several

found in a digital image, see figure 2. This is an advantage that most of other
approaches do not provide ([1], [3], etc.). It represents a relevant property as,
in recognition tasks, methods take only one object in order to recognize it. The
last column in figure 2 shows three possible flowers in the original image (corre-
sponding to foreground objects), but only one is selected as being relevant. This
disfavor our approach in comparative tests.

Fig. 2. Flower segmentation results: top row - original images; bottom row - our seg-
mentations. Note that for images containing multiple objects, we obtain each object
separately unless a spatial correlation exists.
398 L. Suta et al.

Tests were performed on Oxford Flowers 17, [8], with 848 images which dispose
of ground truth segmentations. Applying the proposed segmentation method,
we obtain 5958 segments (one or several segments organized in a hierarchy per
image). Subsequently, the decision tree indicates 2089 correct segments (Good-
Segmentation = ”Yes”), corresponding to the relevant objects (hereby ﬂowers).
In order to estimate the quality of our segmentation, we compare our results to
the contour-based tree segmentation approach presented in [6]. As their method
produces similar results without specifying correct segmentations, the user has to
indicate the one corresponding to the object of interest. We chose their best result
for each image according to the maximum value of the accuracy - compared to
the ground truth. Table 1 presents the average Hausdorﬀ distance between the
segmented objects and the groundtruth. Best segmentation is achieved in case
of minimum distance. Table 1 show a small advantage of our method compared
to [6].

Table 1. Performance evaluation of the segmentation results on Oxford Flowers 17

Method Hausdorﬀ Distance

Tree segmentation 35.34
Our method 30.71

5.2 Relevant Object

In order to choose the relevant object of each segmented image, we performed
a supervised learning using C4.5 decision tree in order to learn the GoodSeg-
mentation attribute values (”Yes”/”No”). 5958 segments were labeled using a
threshold t = 0.85 on the accuracy value. In order to choose the best value of
parameter t, we varied the threshold value with a 0.05 step. For an accuracy
rate higher than 0.85, the correct segments can be identified with a success rate
of 77%. Best segmentation results are achieved for accuracy rates above 0.94
but with much less available data (not enough correct segments available). The
supervised learning of correct segments was performed in cross validation: 10 val-
idations with stratified sampling on ”Good Segmentation” column, which gives
a more realistic estimate of the error rate.
Decision rules based on the extracted features for correct/incorrect segmen-
tation labeling are presented in figure 3. The confusion matrix shows 4588 accu-
rately classified segments from a total of 5958 segments resulting in an error rate
of 22%. Figure 4 presents two segments with different accuracy values belonging
to the same original image. Both images were initially labeled as correct. Due
to the supervised learning the bottom-right image is labeled as incorrect which
proves the necessity of such validation.
PaTHOS: Part-Based Tree Hierarchy for Object Segmentation 399

Fig. 3. Decision rules

Fig. 4. Results relevant object choice (ﬁrst row - original and groundtruth image;
second row - correct and incorrect segmentation after classiﬁcation)

6 Conclusions and Future Work

In this paper we presented a hierarchical grouping segmentation model. At ﬁrst, a

histogram thresholding provides numerous segments where a homogeneity crite-
rion is respected. Segments are merged and aggregated in a hierarchy via spatial
inclusions. Shape and color features are extracted on the produced segments.
Supervised learning applied on features and accuracy labels segments into
correct/incorrect segmentations and, therefore, allows the choice of relevant ob-
jects. Tests have been conducted on Oxford Flowers 17 in order to compare our
results with a similar state-of-the-art method.
400 L. Suta et al.

Future work includes the development of new features based on the hierarchy
that may complement classical ones in the process of object selection. Our target
is to employ the presented segmentation approach in plant recognition tasks
extending our research towards ﬂower variety identiﬁcation.

Acknowledgements. This work has been supported by the French National

Agency for Research with the reference ANR-10-CORD-005 (REVES project).

References
1. Rother, C., Kolmogorov, V., Blake, A.: GrabCut: Interactive Foreground Extrac-
tion using Iterated Graph Cuts. ACM Transactions on Graphics 23, 309–314 (2004)
2. Nilsback, M.-E., Zisserman, A.: Delving Deeper into the Whorl of Flower Segmen-
tation. Image and Vision Computing 28(6), 1049–1062 (2010)
3. Chai, Y., Lempitsky, V., Zisserman, A.: BiCoS: A Bi-level Co-Segmentation
Method for Image Classification. In: IEEE International Conference on Computer
Vision, pp. 2579–2586 (2011)
4. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient Graph-Based Image Segmenta-
tion. International Journal on Computer Vision 59(2), 167–181 (2004)
5. Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour Detection and Hierar-
chical Image Segmentation. IEEE Transactions on Pattern Analysis and Machine
Intelligence 33(5), 898–916 (2011)
6. Akbas, E., Ahuja, N.: From Ramp Discontinuities to Segmentation Tree. In: Zha,
H., Taniguchi, R.-i., Maybank, S. (eds.) ACCV 2009, Part I. LNCS, vol. 5994, pp.
123–134. Springer, Heidelberg (2010)
7. Todorovic, S., Ahuja, N.: Unsupervised Category Modeling, Recognition, and Seg-
mentation in Images. IEEE Transactions on Pattern Analysis and Machine Intel-
ligence 30(12), 2158–2174 (2008)
8. Oxford Flowers 17 (2011), http://www.robots.ox.ac.uk/~ vgg/data/bicos/
9. van de Sande, K., Uijlings, J., Gevers, T., Smeulders, A.: Segmentation as Selective
Search for Object Recognition. In: IEEE International Conference on Computer
Vision, pp. 1879–1886 (2011)
10. Cerutti, G., Tougne, L., Vacavant, A., Coquin, D.: A Parametric Active Polygon for
Leaf Segmentation and Shape Estimation. In: International Symposium on Visual
Computing, pp. 202–213 (2011)
11. Angelova, A., Zhu, S., Lin, Y.: Image segmentation for large-scale subcategory
flower recognition. In: IEEE Workshop on the Applications of Computer Vision,
pp. 39–45 (2013)
12. Kumar, N., Belhumeur, P.N., Biswas, A., Jacobs, D.W., John Kress, W., Lopez,
I.C., Soares, J.V.B.: Leafsnap: A computer vision system for automatic plant
species identification. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid,
C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 502–516. Springer, Heidelberg
(2012)
13. Chai, Y., Rahtu, E., Lempitsky, V., Van Gool, L., Zisserman, A.: TriCoS: A
tri-level class-discriminative co-segmentation method for image classification. In:
Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012,
Part I. LNCS, vol. 7572, pp. 794–807. Springer, Heidelberg (2012)
14. Singh, G.: Plants Systematics: An Integrated Approach. Science Publishers (2004)
15. Folia (2011), http://liris.cnrs.fr/reves/index.php
16. Leafsnap (2011), http://leafsnap.com/
Tracking System with Re-identification
Using a Graph Kernels Approach

Amal Mahboubi1 , Luc Brun1 ,

Donatello Conte2 , Pasquale Foggia2, and Mario Vento2
1
GREYC UMR CNRS 6072, Equipe Image ENSICAEN
6, boulevard Maréchal Juin F-14050 Caen, France
[email protected] , [email protected]
2
Dipartimento di Ingegneria dell’Informazione, Ingegneria Elettrica e Matematica
Applicata
Università di Salerno, Via Ponte Don Melillo, 1 I-84084 Fisciano (SA), Italy
{dconte,pfoggia,mvento}@unisa.it

Abstract. This paper addresses people re-identiﬁcation problem for vi-

sual surveillance applications. Our approach is based on a rich description
of each occurrence of a person thanks to a graph encoding of its salient
points. People appearance in a video is encoded by bags of graphs whose
similarities are encoded by a graph kernel. Such similarities combined
with a tracking system allow us to distinguish a new person from a re-
entering one into a video. The eﬃciency of our method is demonstrated
through experiments.

Keywords: Visual surveillance, Graph Kernel, Re-identiﬁcation.

1 Introduction
Re-identification is a recent field of study in pattern recognition. The purpose of
re-identification is to identify object/person coming back onto the field view of
a camera. Such a framework may be extended to the tracking of object/persons
on a network of cameras.
Methods dealing with the re-identification problem can be divided into two
categories. A first group is based on building a unique signature for object.
Features used to describe signatures are different: regions, Haar-like features,
interest points [1], [2]. The second group of methods [3], [4] does not use a single
signature for the object, but the latter is represented by a set of signatures. Thus,
the comparison between objects takes place between two sets of signatures rather
than between two individual signatures.
The basic idea of our work starts from the consideration that there are few
works that exploit relationships between the visual features of an object. Fur-
thermore, our work combines both approaches by describing a person both with
a global descriptor over several frames and a set of representative frames. More
precisely, the principle of our approach is to represent each occurrence of a per-
son at time t by a graph representation called a t-prototype (Section 2). A kernel

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 401–408, 2013.

c Springer-Verlag Berlin Heidelberg 2013
402 A. Mahboubi et al.

between t-prototypes (Section 3) is proposed in order to encode the similarity

between two persons based on their appearance on a single frame.
The design of the proposed kernel in Section 3 is based on a previous kernel
[6] devoted to image indexation. However for people re-identification, tracking
problems have to be addressed in order to cover the re-identification investiga-
tions. Within our framework a person is not characterized by a single image but
by a sequence of images encoding its appearance along several frames. This new
proposal (as shown by the dotted box in Figure 1) is described in section 4 and
5. The global appearance of a person over a video, is described by a bag of t-
prototypes (Section 4) and global features of the bag computed on representative
t-prototypes. The temporal window over which is build a bag of t-prototypes is
called the history tracking window (HTW). Kernels between bags of t-prototypes
are proposed in Section 4.1 in order to measure the similarity of two persons on
several frames. Such kernels are used within our tracking system (Section 5) in
order to determine if an entering person is a new person or a re-entering one. The
efficiency of the proposed approach is evaluated through experimental results in
section 6.

2 T-Prototype Construction
The first step of our method consists to separate subjects from the background.
To that end, we use binary object masks [5] defined by a foreground detection
with shadow removals. Each moving person within a frame is thus associated to
a mask that we characterize using SIFT key point detectors. Such key points
provide a fine local characterization of the image inside the mask which is robust
against usual image transformations such as scales and rotations. Each key-point
is represented by its x and y coordinates, scale, orientation and 128 numbers
(the descriptors) per color channel. In order to contextualize the information
encoded by SIFT points we encode them by a mutual k nearest neighbor graph
G = (V, E, w) where V corresponds to the set of SIFT points, E to the set of
edges and w is a weight function defined over V and defined as the scale of
appearance of the corresponding vertex. The set of edges E is defined from the
key point coordinates x and y: one edge (v, v ) belongs to E if v belongs to
the k nearest neighbors of v while v belongs to the k nearest neighbors of v.
The degree of each vertex is thus bounded by k. For a given vertex u, we take
into account the local arrangement of its incident vertices by explicitly encoding
the sequence of its neighbors encountered when turning counterclockwise around
it. This neighborhood N (u) = (u1 , ..., un ) is thus defined as an ordered set of
vertices. The first vertex of this sequence u1 is arbitrary chosen as the upper
right vertex. The set {N (u)}u∈V is called the bag of oriented neighborhoods
(BON). The node u is called the central node.

3 Kernel between T-Prototypes

Our kernel between t-prototypes (eq. 5) is based on a previous contribution [6]
within the image indexation framework. This kernel is based on the description
Tracking System with Re-identiﬁcation Using a Graph Kernels Approach 403

of each graph by a finite bag of patterns. Such an approach consists to: i) define
the bag of patterns from each graph, ii) define a minor kernel between patterns,
iii) convolve minor kernels into a major one in order to encode the similarity
between bags. SIFT points being local detectors, we consider that the more
relevant information of a t-prototype corresponds to the local oriented neighbor-
hood of its vertices. We thus define the bag of patterns of a t-prototype as its
BON (section 2). The minor kernel between oriented neighborhoods is defined
as follows:
0 if |N (u)| = |N (v)|
Kseq (u, v) = G|N (u)| (1)
i=1 Kg (ui , vi ) otherwise
where Kg (u, v) is a RBF kernel between features of input vertices defined by a
tuning parameter σ and the Euclidean distance d(., .) between feature values:
d(μ(x),μ(y))
Kg (x, y) = e− σ .
Eq. 1 corresponds to the same basic idea that the heuristic used to compute
the graph edit distance between two nodes [7] where the similarity between two
nodes is enforced by a comparison of their neighborhoods.
Note that Kseq (., .) corresponds to a tensor product kernel and is hence def-
inite positive. However, due to acquisition noise or small changes between two
images, some SIFT points may be added or removed within the neighborhood of
some vertices. Such an alteration of the neighborhood’s cardinal may drastically
change the similarity between key points. Indeed, according to equation (1),
two points with a different neighborhood’s cardinal have a similarity equal to 0.
Equation (1) induces thus an important sensibility to noise. In order to overcome
this drawback, we introduce a rewriting rule on oriented neighborhoods. Given
a vertex v, the rewriting of its oriented neighborhood denoted κ(v) is defined as:
κ(v) = (v1 , ..., v4i , ..., vlv ) where v4i = argminj∈{1,...,lv } w(vj ) is the neighbor of v
with lowest weight.
This rewriting is iterated leading to a sequence of oriented neighborhoods
(κi (v))i∈{0,...,Dv } , where Dv denotes the maximal number of rewritings. The cost
of each rewriting is measured by the cumulative weight function CW defined by:

CW (v) =0
(2)
CW (κi (v)) = w(vi ) + CW (κi−1 (v))
where vi is the vertex removed between κi−1 (v) and κi (v).
Kernel between Oriented Neighborhoods: Our kernel between two ori-
ented neighborhoods is defined as a convolution kernel between the sequence of
rewritings of each neighborhood, each rewriting being weighted by its cumulative
cost:

Dv
Du
Krewriting (u, v) = KW (κi (u), κj (v)) ∗ Kseq (κi (u), κj (v)) (3)
i=1 j=1

where kernel KW penalizes costly rewritings corresponding to the removal of

important key-points. Such a kernel is deﬁned as follows:
CW (κi (u))+CW (κj (v))
KW (κi (u), κj (v)) = e− σ where σ is a tuning variable. (4)
404 A. Mahboubi et al.

The number of rewritings (Dv ) for each vertex v corresponds to a compromise

between an over simplification of its oriented neighborhood (large Dv ) and the
corruption of equation 3 by non relevant vertices which may appear in only one
of two similar oriented neighborhoods. This number has been empirically set to
half the cardinal of v’s neighborhood [6].
Graph Kernel: Taking into account central nodes, our final kernel between two
vertices u and v is defined as follows: K(u, v) = Kg (u, v)Krewriting (u, v)
Our final kernel between two graphs is defined as a convolution kernel between
both BONs:

Kgraph (G1 , G2 ) = ϕ(u)ϕ(v)K(u, v) (5)
u∈V1 v∈V2

The weighting function ϕ encodes the relevance of each vertex and is deﬁned as
− 1
an increasing function of the weight: ϕ(u) = e σ (1+w(u))

4 People Description
The identification of a person by a single t-prototype is subject to errors due to
slight changes of the pose or some errors on the location of SIFT points. Assum-
ing that the appearance of a person remains stable on a set of successive frames,
we describe a person at instant t by the set of its t-prototypes computed on its
HTW window. The description of a person, by a set of t-prototypes provides
an implicit definition of the mean appearance of this person over HTW. Let H
denotes the Hilbert space defined by Kgraph (equation 5). In order to get an
explicit representation of this mean appearance, we first use Kgraph to project
the mapping of all t-prototypes onto the unit-sphere of H. This operation is
performed by normalizing our kernel [8]. Following [8], we then apply a one class
ν-SVM on each set of t-prototypes describing a person. From a geometrical point
of view, this operation is equivalent to model the set of projected t-prototypes
by a spherical cap defined by a weight vector w and an offset ρ both provided
by the ν-SVM algorithm. These two parameters define the hyper plane whose
intersection with the unit sphere defines the spherical cap. T-prototypes whose
projection on the unit sphere lies outside the spherical cap are considered as
outliers. Each person is thus encoded by a triplet (w, ρ, S) where S corresponds
to the set of t-prototypes and (w, ρ) are defined from a one class ν-SVM. The
parameter w indicates the center of the spherical cap and may be intuitively
understood as the vector encoding the mean appearance of a person over its
HTW window. The parameter ρ influence the radius of the spherical cap and
may be understand as the extend of the set of representatives t-prototypes in S.

4.1 People’s Kernel

Let PA = (wA , ρA , SA ) and PB = (wB , ρB , SB ) denote two triplets encoding
two persons A and B. The distance between A and B is defined from the angle
between vectors wA and wB defined by [8] as follows:
Tracking System with Re-identification Using a Graph Kernels Approach 405

Fig. 1. Algorithm steps

T
w KA,B wB
dsphere (wA , wB ) = arccos A wA wB where wA and wB denote the norms
of wA and wB in H and KA,B is a |SA | × |SB | matrix deﬁned by KA,B =
(Knorm (t, t ))(t,t )∈SA ×SB , where Knorm denotes our normalized kernel. Based
on dsphere , the kernel between A and B is deﬁned as the following product of
RBF kernels:
−d2
sphere (wA ,wB ) −(ρA −ρB )2
2
2σmoy 2σ2
Kchange (PA , PB ) = e e origin (6)

Where σmoy and σorigin are tuning variables.

5 Tracking System

Our tracking algorithm uses four labels ‘new’, ‘get out’, ‘unknown’ and ‘get back’
with the following meaning: new refers to an object classified as new, get-out
represents an object leaving the scene, unknown describes a query object (an
object recently appeared, not yet classified) and get-back refers to an object
classified as an old one.
Unlike our previous work [5], where we used a training data set to model
each object and the re-identification was triggered by an edit graph distance. in
this paper, we are using online learning and the re-identification is performed
using the similarity (eq.6) between each unknown person and all the get out
persons. The general architecture of our system is shown in Figure 1. All masks
detected in the first frame of a video are considered as new persons. Then a
mask detected in frame t + 1 is considered as matched if there is a sufficient
overlap between its bounding box and a single mask’s bounding box defined in
frame t. In this case, the mask is affected to the same person than in frame t
and its graph of SIFT points is added to the sliding HTW window containing
the last graphs of this person. If one mask defined at frame t does not have
any successor in frame t + 1, the associated person is marked as get out and
its triplet P = (w, ρ, S) (Section 4) computed over the last |HT W | frames is
stored in an output object data base model noted DBS . In the case of a person
corresponding to an unmatched mask in frame t + 1, the unmatched person
406 A. Mahboubi et al.

is initially labeled as ‘get in’. When a ‘get in’ person is detected, if there is
no ‘get out’ persons we classify this ‘get in’ person immediately as new. This
‘get in’ person is then tracked along the video using the previously described
protocol. On the other hand, if there is at least one ‘get out’ person we should
delay the identification of this ‘get in’ person which is thus labeled as ‘unknown’.
This ‘unknown’ person is then tracked on |HT W | frames in order to obtain its
description by a triplet (w, ρ, S). Using this description we compute the value of
kernel Kchange (equation 6) between this unknown person and all get out persons
contained in our database. Similarities between the unknown person and get out
ones are sorted in decreasing order so that the first get out person of this list
corresponds to the best candidate for a re identification. Our criterion to map an
unknown person to a get out one, and thus to classify it as get back is based both
on a threshold on the maximum similarity values maxker and a threshold on the
standard deviations σker of the list of similarities. This criterion called, SC is
defined as maxker > th1 and σKer > th2 , where th1 and th2 are experimentally
fixed thresholds. Note that, SC is reduced to a fixed threshold on maxker when
the set of get out persons is reduced to two elements. An unknown person whose
SC criterion is false is labeled as a new person. Both new and get back persons
are tracked between frames until they get out from the video and reach the
get out state.
Classically, any tracking algorithm has to deal with many difficulties such as
occlusions. The type of occlusions examined in this paper is limited to the case
where bounding boxes overlap. An occlusion is detected when the spatial overlap
between two bounding boxes is greater than an experimentally fixed threshold
while each individual box remains detected. If for a given object an occlusion
is detected, the description of this object is compromised. Thus a compromised
object is only tracked and its triplet (w, ρ, S) is neither updated nor stored
in DBS . At identification time, the model of the unknown person is matched
against each get-out person from DBS .

6 Experiments
The proposed algorithm has been tested on v01, v05, v04 and v06 video sequences
of the PETS’09 S2L1 [9] dataset. Each sequence contains multiple persons. To
compare our framework with previous work, we use the well-known metrics Se-
quence Frame Detection Accuracy (SF DA), Multiple Object Detection Accu-
racy (M ODA) and Multiple Object Tracking Accuracy (M OT A) described in
[11]. Note that such a measure does not allow to take into account the fact that
the identification of a person may be delayed. Since our method identifies a per-
son only after HT W frames, we decided not to take into account persons with
an unknown status in the M ODA and M OT A measures until these persons are
identified as get back or new (Section 5).
In our first experiment we have evaluated how different values of the length of
HTW may affect the re-identification accuracy. The obtained results show that,
v01 and v05 perform at peak efficiency for HTW=35. V04 and v06 attain their
optimum at HTW=20.
Tracking System with Re-identification Using a Graph Kernels Approach 407

100 Table 1. Evaluation results

90
Identification Rate

80
70 View MODA of[10] MODA MOTA SFDA
60 v01 0.67 0.91 0.91 0.90
50
40 v05 0.72 0.75 0.75 0.80
30 v01
20 v05 v04 0.61 0.2799 0.2790 0.47
v04
10 v06 v06 0.75 0.506 0.505 0.64
0
2 4 6 8 10 12 14 16 18 20
Rank

Fig. 2. CMC curves

To validate our method of re-identiﬁcation we used the Cumulative Matching

Characteristic (CMC) curves. The CMC curve represents the percentage of times
the correct identity match is found in the first n matches. Figure 2 shows the
CMC curves for the four views. We can see that the performance of v01 is much
better than that of v05, v06 and v04. We attribute this to the high detection
accuracy in v01. Figure 2 shows that if we focus on the first 5 matches, we find
that for v04 and v06 a score of 54% and 65% respectively is obtained, while for
v01 and v05 it attains 100%.
In order to compare our results to the state of the art’s methods we used
the exhaustive comparison of 13 methods defined in [9]; where a quantitative
performance evaluation of the submitted results by contributing authors of the
two PETS workshops in 2009 on PETS’09 S2.L1 dataset was performed. Using
the metrics MODA, MOTA, MODP, MOTP SODA and SFDA described in
[11], the submitted results of [10] outperform all other methods. We hence only
compare our results to this best method. The left column of Table 1 shows the
best results [10] obtained by methods described in [9] on each video. As shown by
the two left-most data columns of Table 1, our method obtains lower result than
that of [10] for v04 and v06. This may be explained by the fact that v04 and v06
have persistent group cases. Indeed the case where two or more existing objects
at time t become too spatially close at time t + 1 and then merge together to
become a one detected object at time t+1 is not considered here as an occlusion,
but rather as a group. Since such case is not addressed by this paper, v04 and
v06 results need to be interpreted with caution. Due to the frequent group cases
in v04 and v06 we missed a lot of persons in the scene. However, our method
obtains better result than that of [10] for v01 and v05. These results set forth the
relevance of the proposed re-identification algorithm since we have only occlusion
cases.

7 Conclusion
In this paper, we presented a new people re-identiﬁcation approach based on
graph kernels. Our graph kernel between SIFT points includes rewriting rules
on oriented neighborhood in order to reduce the lack of stability of the key point
detection methods. Furthermore, each person in the video is deﬁned by a set
408 A. Mahboubi et al.

of graphs with a similarity measure between sets which removes outliers. Our
tracking system is based on a simple matching criterion to follow one person
along a video. Person’s description and kernel between these descriptions is used
to remove ambiguities when one person reappears in the video. Such a system
may be easily extended to follow one person over a network of camera. People
are prone to occlusions by others nearby. However, a re-identiﬁcation algorithm
for an individual person is not suitable for solving the groups cases. A further
study with more focus on groups is therefore suggested.

References
1. Hamdoun, O., Moutarde, F., Stanciulescu, B., Steux, B.: Person re-identification
in multi-camera system by signature based on interest point descriptors collected
on short video sequences. In: ICDSC 2008, pp. 1–6 (2008)
2. Ijiri, Y., Lao, S., Han, T.X., Murase, H.: Human Re-identification through Distance
Metric Learning based on Jensen-Shannon Kernel. In: VISAPP 2012, pp. 603–612
(2012)
3. Truong Cong, D.-N., Khoudour, L., Achard, C., Meurie, C., Lezoray, O.: People
re-identification by spectral classification of silhouettes. International Journal of
Signal Processing 90, 2362–2374 (2010)
4. Zhao, S., Precioso, F., Cord, M.: Spatio-Temporal Tube data representation and
Kernel design for SVM-based video object retrieval system. Multimedia Tools
Appl. (55), 105–125 (2011)
5. Brun, L., Conte, D., Foggia, P., Vento, M.: People Re-identification by Graph
Kernels Methods. In: Jiang, X., Ferrer, M., Torsello, A. (eds.) GbRPR 2011. LNCS,
vol. 6658, pp. 285–294. Springer, Heidelberg (2011)
6. Mahboubi, A., Brun, L., Dupé, F.-X.: Object Classification Based on Graph Ker-
nels. In: HPCS-PAR, pp. 385–389 (2010)
7. Fankhauser, S., Riesen, K., Bunke, H.: Speeding up Graph Edit Distance Com-
putation through Fast Bipartite Matching. In: Jiang, X., Ferrer, M., Torsello, A.
(eds.) GbRPR 2011. LNCS, vol. 6658, pp. 102–111. Springer, Heidelberg (2011)
8. Desobry, F., Davy, M., Doncarli, C.: An Online Kernel Change Detection Algo-
rithm. IEEE Transaction on Signal Processing 53, 2961–2974 (2005)
9. Ellis, A., Shahrokni, A., Ferryman, J.: PETS 2009 and Winter PETS 2009 Results,
a Combined Evaluation. In: 12th IEEE Int. Work. on Performance Evaluation of
Tracking and Surveillance, pp. 1–8 (2009)
10. Berclaz, J., Shahrokni, A., Fleuret, F., Freyman, J.M., Fua, P.: Evaluation of prob-
abilistic occupancy map people detection for surveillance systems. In: 11th IEEE
Int. Work. on Performance Evaluation of Tracking and Surveillance, pp. 55–62
(2009)
11. Kasturi, R., Goldgof, D., Soundararajan, P., Manohar, V., Garofolo, J., Bowers,
R., Boonstra, M., Korzhova, V., Zhang, J.: Framework for performance evaluation
of face, text, and vehicule detection and tracking in video: Data, metrics, and
protocol. IEEE Transaction on Pattern Analysis and Machine Intelligence 31(2),
319–336 (2009)
Recognizing Human-Object Interactions
Using Sparse Subspace Clustering

Ivan Bogun and Eraldo Ribeiro

Computer Vision Laboratory,

Florida Institute of Technology
Melbourne, Florida, U.S.A.
[email protected],
[email protected]

Abstract. In this paper, we approach the problem of recognizing human-

object interactions from video data. Using only motion trajectories as in-
put, we propose an unsupervised framework for clustering and classifying
videos of people interacting with objects. Our method is based on the
concept of sparse subspace clustering, which has been recently applied to
motion segmentation. Here, we show that human-object interactions can
be seen as trajectories lying on a low-dimensional subspace, and which
can in turn be recovered by subspace clustering. Experimental results,
performed on a publicly available dataset, show that our approach is
comparable to the state-of-the-art.

Keywords: human-object interaction, action classiﬁcation, human mo-

tion, sparse subspace clustering, subspace decomposition.

1 Introduction
Recognizing human-object interactions from videos is a hard problem that has
been receiving renewed attention by the computer-vision community. The prob-
lem’s complexity comes from large degree of variations present in both the ap-
pearance of objects and the many ways people interact with them. Current solu-
tions differ mostly in terms of the type of input data used by algorithms, which
ranges from low-level features such as optical flow to human-centered features
such as spatio-temporal volumes.
The state-of-the-art is represented by the weakly supervised method described
by Prest et al. [1], which combines part-based human detector, tracking by de-
tection, and classification into a single framework. This method reports the best
results on most datasets. Other representative solutions include the work of
Gupta et al. [2], which uses a histogram-based model to account for appearance
information, and trajectories for representing motion. Their classification ap-
proach is based on a Bayesian network. They introduce interaction features such
as time of the object grasp, interaction start, and interaction stop, which are
learned from velocity profiles. Motion trajectories were also used by Filipovych
and Ribeiro [3] for recognizing interactions by matching trajectories of hand
motion using a robust sequence-alignment method.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 409–416, 2013.

c Springer-Verlag Berlin Heidelberg 2013
410 I. Bogun and E. Ribeiro

Fig. 1. Examples of interactions and annotated trajectories in dataset from [2]

In this paper, we look at the problem of human-object interaction recognition

in light of recent developments in sparse subpace clustering [4]. Here, we show
that such interactions can be seen as trajectories lying on a low-dimensional
subspace. We propose an unsupervised framework for interaction recognition
that uses sparse subspace clustering (Section 2). We compare our method to the
state-of-the art (Section 3).

2 Our Method

2.1 Trajectory Extraction and Pre-processing

We commenced by annotating the videos from Gupta et al. [2]. Currently, video
datasets of human-object interactions are either not publicly available, such as
the Coffee and Cigarretes used by Laptev and Pérez [5] and by Prest et al.
[1], or are unannotated, as in the case of dataset provided by Gupta et al. [2].
Videos in Gupta et al. are short sequences (i.e., 3–10 seconds long) of a single
person performing interactions such as drinking from a cup, answering the phone,
making a phone call, spraying, pouring from a cup, and lighting a torch. Our
annotation is as follows: on every 3rd frame of the videos, we extract the position
(i.e., the centroid of the bounding box) of the left hand, torso, head, right hand,
and the object associated with the interaction. Figure 1 shows samples of these
trajectories superimposed on frames from the input videos.
In our interaction-classification method, each video is represented by five tra-
jectories that we extracted manually by linearly interpolating between the pre-
viously annotated keyframes. These trajectories are termed T h , T r , T l , T t ,
and T o , for head, left hand, right hand, torso, and object, respectively. While
automatic trajectory extraction is indeed desirable, it is not the focus of this
paper, and we decided that assuming trajectory availability would suffice. The
Recognizing Human-Object InteractionsUsing Sparse Subspace Clustering 411

extracted trajectories were resampled to become both contiguous and of equal

length (i.e., from tmin = 3 to tmax = 100).
We further pre-processed each video to normalize trajectories with respect to
the head location, and to remove potential bias that may exist towards right-
handed subjects. Because the head motion is not relevant for the interactions in
the dataset, the mean location of the head trajectory was subtracted from the
other four trajectories. For example, let T h = {xh1 , . . . , xhN } be the trajectory of
length N frames that describes the head motion. Its mean is given by:

1 h
N
x̄h = x . (1)
N i=1 i

Here, x̄h = ( x̄h , ȳ h )T is a centroid location that we can use to ﬂip trajectories
horizontally and thus account for left-right hand symmetry. For each trajectory
j in the video, the registered trajectory points are given by:
T
x̂ji = |xji − x̄h |, yij − ȳ h , ∀i. (2)

This normalization reduces bias towards right-handed people as well as noise

associated with the object being placed at random locations on the table. Here-
after, we assume that all trajectories have been normalized using this procedure.

2.2 Video Representation

As we clarify in the next section, we want each video to be represented by a

single feature vector. However, simply stacking all trajectories to form a raw
feature vector would lead to incorrect results as trajectories associated with
the non-interacting hand do not provide valuable information, and we do not
want to use these trajectories for classiﬁcation. To detect the hand that is more
likely to be interacting with the object, we calculate the correlation between
the object trajectory and the trajectory of each hand using p-values. We use χ2
(chi-square) statistical test as a correlation measure. The trajectory having the
smallest p-value is selected as the one performing the interaction, i.e.:

T = arg min p χ2 (T o , T ) , (3)

T ∈{T l ,T r }

where the p(.) returns the p-value. Now, that we have determined the trajectory
corresponding to the interacting-hand, we represent a video by its feature vector:
, -
T
f= . (4)
To

Finally, given a set V = {v1 , . . . , vM } of M videos containing human-object inter-

actions, we can stack their corresponding feature vectors to form a large matrix
Y = (f1 , . . . , fM ) ∈ R2N ×M , where N is the size of normalized trajectories.
412 I. Bogun and E. Ribeiro

2.3 Interaction Motion as Sparse Subspace Separation

Our method is based on the Subspace Clustering algorithm (SSC) proposed by

Elhamifar and Vidal [4]. An earlier version of the SSC algorithm was applied
to motion segmentation by casting the segmentation as a problem of subspace
separation [6]. The motion-segmentation approach in Rao et al. [6] is based on
two observations: the motion data lies in the union of low-dimensional spaces
and the fact the dataset satisﬁes the self-expressiveness property. According to
Elhamifar and Vidal [4], a dataset is self-expressive if each data point in a union
of subspaces can be reconstructed by a combination of other points in the dataset.
Let ci be such a representation for video i, then the sparsest representation of
yi via the set {yj |j = i} is given as follows:

min . ||ci ||0 (5)

s.t. yi = Y ci , (6)
cii = 0, (7)

where ||x||0 = #{i|xi = 0}. Equation 5 enforces sparsity in coeﬃcients while

Equation 6 takes care of self-expressiveness. Equation 7 restricts the represen-
tation such that it does not contain any part of the original vector itself. This
problem turns out to be non-convex and NP-hard [7]. Nevertheless, recent de-
velopments in the optimization field have provided heuristics that are able to
find relaxed solutions that are sparse [8]. More specifically, the relaxed version
of Equation 5 replaces the l0 -norm with the l1 -norm, which is known to prefer
sparse solutions [9]. With the norm replacement, the relaxation can be written
in matrix notation as follows:

min . ||C||1 (8)

s.t. Y = Y C, (9)
diag(C) = 0, (10)

where diag(C) denotes diagonal entries of C. Relaxed versions of Equations 8 to

10 are able to ﬁnd solutions only when the data lies perfectly on the union of
subspaces. We follow Rao et al. [6] and assume that the data can be decomposed
into a noise-free component and a noise component, i.e.:

Y = Yperfect + Z. (11)

Here, Z is the noise and Yperfect is the clean data, which lies in the union of the
subspaces. Thus, the problem becomes:

λz
min . ||C||1 + ||Z||2F (12)
2
s.t. Y = Y C + Z, (13)
diag(C) = 0, (14)
Recognizing Human-Object InteractionsUsing Sparse Subspace Clustering 413
/
m n
where ||Z||F = i=1 j=1 |zij |2 and λz is a regularization parameter. For
parameter settings, Elhamifar and Vidal [4] suggest to set λz = αz /μz , where
αz > 1 and μz is deﬁned by μz = mini maxi=j |yiT yj |.
After coeﬃcient matrix C is found, SSC clusters trajectories using spectral
clustering. SSC gave state-of-the art results on the Hopkins-155 dataset [10].
Combined with nice theoretical properties [11] and the momentum gained by
the success of sparse optimization problems [12], we believe that SSC can be
useful in classifying human actions and interactions.

2.4 Why Should We Care about Sparsity?

Let Y ∈ Rm×n be the data set. The Sparse Subspace Clustering algorithm is
seeking for the sparsest representation for every yi ∈ Y, i = 1, . . . , n

min . ||ci ||1 (15)

s.t. yi = Y ci , (16)
cii = 0, (17)

which can be seen as a limit λ → ∞ of the following:

min. ||ci ||1 + λ||yi − Y ci ||22

s.t. cii = 0.
Let c−i be ci without cii . Similarly, deﬁne Y−i as Y without its i-th row. Then,
the problem can be cast as:

min. ||c−i ||1 + λ||yi − Y−i c−i ||22 .

The latter is problem where weare trying to find the sparsest approximation of
m
yi using Y−i in the form y−i = j=1 Y−i,j c−ij . It can be shown that SVM can be
reformulated to solve it [13]. On the other hand, Girosi [14] has shown that, for
noiseless data, the solution given by sparse approximation corresponds exactly
to the solution given by SVM. Moreover, non-zero coefficients of c−i correspond
to support vectors.
Limited amount of support vectors correspond to the bounded VC dimension
of the dataset, which is shown to define generalization ability; the result, due to
Vapnik [13], shows the connection between generalized error and the number of
support vectors, i.e.:

E[#of support vectors]

E[P (error)] ≤ . (18)
# of training examples

This leads to the connection between sparsity and the ability to extract the
most important samples from the dataset, which in turn leads to the proper
partitioning of the samples via SSC.
414 I. Bogun and E. Ribeiro

3 Experimental Evaluation
In this section, we experiment with the SSC algorithm in an unsupervised ap-
proach for actor-object interaction recognition based on the trajectory data. Our
experiments were designed with an emphasis on trying to answer the following
two questions: (i) Can human-object interactions be seen as body and object tra-
jectories that lie in a low-dimensional space? (ii) What is the role of interaction
localization (i.e., segmentation) in recognition?
Here, we run the SSC algorithm1 in two settings: (a) With complete trajec-
tories (i.e., from the first to the last frame of the input videos), and (b) With
trajectories corresponding only to interaction frames (i.e., frames where the in-
teraction starts and ends)2 . Confusion matrices resulting from these experiments
are given in Figures 2(a) and 2(b), and present classification rates of 74.1% and
81.48%, respectively. As a point of comparison, we note that Gupta et al. [2]
report accuracy of 93.34% while Prest et al. [1] report an average classification
of 93%. However, in addition to being completely supervised, these approaches
use additional data to train a HOG-based object detector, while our method
is unsupervised given trajectories and their parameters. As reported in Gupta
et al. [2], interaction such as lighting and pouring, dialing and lighting have sim-
ilar trajectories, thus can be hardly distinguished by motion cues alone. This
behavior is also observed in our results. The test videos and demo code for our
method are available online.3
Our results suggest that segmentation of interaction trajectories can be seen
as a special case of motion segmentation, and consequently, the space of such
trajectories consists of a union of low-dimensional subspaces. Our results imply
that, if two interactions lie on different subspaces, motion information alone is
able to distinguish between them. However, if the interactions lie on the same
subspace then appearance information should be used for classification.
In our second experiment, the higher classification rates suggest that interac-
tion localization (i.e., segmentation of start and end of actions) is a goal worth
pursuing. The results agree with the intuition that removing unnecessary parts
of trajectories, we can improve the value of information necessary for recognition.

4 Conclusion and Future Work

We presented an unsupervised approach for the recognition of actions involving
people interacting with objects. Our method uses motion-trajectory data as in-
put. We demonstrated that the SSC algorithm is applicable to the problem. We
showed that motion data from human-object interactions can indeed be consid-
ered to lie in a low-dimensional space. Whenever this is not the case, additional
information such as appearance is needed to improve classiﬁcation.
1
Code provided by Elhamifar and Vidal [15], which uses the CVX Convex Program-
ming package [16].
2
These parameters were also manually annotated.
3
Test videos and code are available on https://github.com/ibogun/interaction
Recognizing Human-Object InteractionsUsing Sparse Subspace Clustering 415

Target/Predicted Confusion Matrix (Full Trajectories)

Drinking 7 1 1

Lighting 7 1

Pouring 4 4

Spraying 1 1 8

Talking 3 4 2

Dialing 10
Drinking Lighting Pouring Spraying Talking Dialing

(a)

Target/Predicted Confusion Matrix (Localized Trajectories)

Drinking 7 1 1

Lighting 6 2

Pouring 8

Spraying 1 9

Talking 8 1

Dialing 3 1 6
Drinking Lighting Pouring Spraying Talking Dialing

(b)

Fig. 2. Confusion matrices. (a) full trajectories, (b) localized trajectories.

A main drawback of our work is the use of manual trajectory annotation. This
issue can be addressed by implementing a method for tracking and detection [1].
Another direction is to investigate how to use the classiﬁcation labels provided
by our method in a fully supervised setting.

References
[1] Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions be-
tween humans and objects. TPAMI 34(3), 601–614 (2012)
[2] Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: Using
spatial and functional compatibility for recognition. TPAMI 31(10), 1775–1789
(2009)
416 I. Bogun and E. Ribeiro

[3] Filipovych, R., Ribeiro, E.: Robust sequence alignment for actor-object interaction
recognition: Discovering actor-object states. CVIU 115(2), 177–193 (2011)
[4] Elhamifar, E., Vidal, R.: Sparse subspace clustering: Algorithm, theory, and ap-
plications. arXiv preprint arXiv:1203.1005 (2012)
[5] Laptev, I., Pérez, P.: Retrieving actions in movies. In: ICCV, pp. 1–8 (2007)
[6] Rao, S., Tron, R., Vidal, R., Ma, Y.: Motion segmentation in the presence of
outlying, incomplete, or corrupted trajectories. TPAMI 32(10), 1832–1845 (2010)
[7] Meka, R., Jain, P., Caramanis, C., Dhillon, I.S.: Rank minimization via online
learning. In: ICML, pp. 656–663 (2008)
[8] Ma, S., Goldfarb, D., Chen, L.: Fixed point and bregman iterative methods for
matrix rank minimization. Mathematical Programming 128(1-2), 321–353 (2011)
[9] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society Series B, 267–288 (1996)
[10] Tron, R., Vidal, R.: A benchmark for the comparison of 3-d motion segmentation
algorithms. In: CVPR, pp. 1–8. IEEE (2007)
[11] Soltanolkotabi, M., Candes, E.J.: A geometric analysis of subspace clustering with
outliers. The Annals of Statistics 40(4), 2195–2238 (2012)
[12] Chandrasekaran, V., Sanghavi, S., Parrilo, P.A., Willsky, A.S.: Sparse and low-
rank matrix decompositions. In: IEEE CCC, pp. 962–967 (2009)
[13] Vapnik, V.: The nature of statistical learning theory. Springer (1999)
[14] Girosi, F.: An equivalence between sparse approximation and support vector ma-
chines. Neural computation 10(6), 1455–1480 (1998)
[15] Elhamifar, E., Vidal, R.: Sparse subspace clustering. In: CVPR, pp. 2790–2797.
IEEE (2009)
[16] Inc. CVX Research. CVX: Matlab software for disciplined convex programming,
version 2.0 beta (September 2012), http://cvxr.com/cvx
Scale-Space Clustering on the Sphere

Yoshihiko Mochizuki1 , Atsushi Imiya2 , Kazuhiko Kawamoto3, Tomoya Sakai4 ,

and Akihiko Torii5
1
Faculty of Science and Engineering, Waseda University
3-4-1 Ohkubo, Shinjuku-ku, Tokyo, 169-8555, Japan
2
Institute of Management and Information Technologies, Chiba University
3
Academic Link Center, Chiba University
Yayoicho 1-33, Inage-ku, Chiba, 263-8522, Japan
4
Department of Computer and Information Sciences, Nagasaki University
Bunkyo-cho 1-14, Nagasaki 852-8521, Japan
5
Department of Control Engineering, Tokyo Institute of Technology
Ookayama, 1-12-1, Meguro-ku, Tokyo, 153-8550, Japan

Abstract. We present an algorithm for scale-space clustering of point

cloud on the sphere using the methodology for the estimation of the
density distribution of the points in the linear scale space. Our algorithm
regards the union of observed point sets as an image deﬁned by the delta
functions located at the positions of the points on the sphere. A blurred
version of this image has a deterministic structure which qualitatively
represents the density distribution of the points in a point cloud on a
manifold.

1 Introduction

The linear scale-space theory [1] provides a dimension-independent observation

theory of input data [2,3]. As an extension of scale-space-based clustering of point
cloud on a plane and a curved manifold [6], we develop a framework to extract
the clusters in a point cloud on the sphere, and to evaluate their statistical
signiﬁcance or cluster validity. Regarding the density function as a greyscale
image, we can estimate the density function in the scale space and identify the
point correspondences by the scale-space analysis of image structure.
There are two types of clustering methodologies: (i) supervised clustering
and (ii) unsupervised clustering. Furthermore, there are metric based and non-
metric-based clustering methods. In this paper, we focus on unsupervised metric-
based clustering using scale-space analysis of point cloud. Although in typical
metric-based clustering, data are assumed to be points in a ﬂat space, data
sometimes lie on a curved manifold. The graph-Laplacian-based method is a
powerful method to deal with data on a manifold expressing data as an undi-
rected weighted graph using metric in the data space. On the other hand, the
scale-space-based clustering [2,3,4] estimates the number of clusters from hier-
archical expression of data derived by scale-space analysis of a point cloud.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 417–424, 2013.

c Springer-Verlag Berlin Heidelberg 2013
418 Y. Mochizuki et al.

2 Mathematical Preliminaries

A vector x ∈ S3 is expressed as x = x(φ, θ) = (cos φ sin θ, sin φ sin θ, cos θ)
using spherical coordinates (φ, θ), where φ ∈ [0, 2π), θ ∈ [0, π]. The scale image
of an image f (x, τ ) on S2 is the solution of the linear spherical heat equation
, -
∂ 1 ∂ ∂ 1 ∂2
f (x, τ ) = ΔS2 f (x, τ ) = sin θ + f (x, τ ), (1)
∂τ sin θ ∂θ ∂θ sin2 θ ∂φ2
for f (x, 0) = f (x). The scale space image f (x, τ ) of scale τ is expressed as
∞
l
f (x, τ ) = e−l(l+1)τ cm
l Yl
m
(x), c m
l = f (x)Ylm (x)dσ, (2)
l=0 m=−l S2

for dσ = sin θdθdφ, where Ylm is the spherical harmonic function of the degree l
and the order m. Equation (2) is re-expressed as

f (x, τ ) = f (x)K(x, y, τ )dσ = Kτ ∗S2 f (x), x, y ∈ S2 (3)
S2

using the spherical heat kernel

∞
∞
1
l
K(x, y, τ ) = e−l(l+1)τ Ylm (x)Ylm (y) = (2l+1)e−l(l+1)τ Pl0 (cos Θ),
4π
l=0 m=−l l=0
(4)
for cos Θ = x(φ, θ) y(φ , θ ) = cos φ cos φ + sin φ sin φ cos(θ − θ ), where Pl0 (t)
is the associated Legendre function of the degree l and the order 0 [9,10].
Setting n = (0, 0, 1) to be the north pole, the impulse function on S2 can be
defined similarly to one in Euclidean space.
Definition 1. For an arbitrary test function f on S2 , the impulse function
δS2 (x) can be defined as the function which satisfies the relation

f (x)δS2 (x)dσ = f (x) ∗S2 δS2 (x) = δS2 (x) ∗S2 f (x) = f (n). (5)
S2

For a function f (x) and the matrix R ∈ SO(3), the function g(x) = f (R x)
represents a rotated function of f . The north pole n = (0, 0, 1) is moved to
n = Rn and the relationship f (n) = g(n ) is satisfied. When f (x(φ, θ)) is
constant for any φ, the rotated function of f is identical to f for any R ∈ SO(3)
which satisfies n = Rn. Using this property, we define the rotation of functions
on the sphere.
Definition 2. For the function f (x(φ, θ)) which is constant for any φ and the
rotation matrix R ∈ SO(3) which moves the north pole (0, 0, 1) to p, we write
the function rotated by R as f (x ∼ p) = f (R x).
For a point set on the sphere, by substituting each point for an impulse function,
we can have the function associated with a point set.
Scale-Space Clustering on the Sphere 419

Deﬁnition 3. For a set of points P on the sphere, the spherical image of P is

f (x) = [P] (x) = δS2 (x ∼ p). (6)
p∈P

We call f (x) = [P](x) a probability density function (PDF) on the sphere.

Deﬁnitions 1 and 2 imply Kτ ∗S2 δS2 (x ∼ p) = K(x ∼ p, n, τ ). Therefore, setting
G(x, τ ) = K(x, n, τ ), we have the relations

f (x, τ ) = G(x ∼ p, τ ), ∇S2 f (x, τ ) = ∇S2 G(x ∼ p, τ ), (7)
p∈P p∈P

where ∇S2 = ∂φ , ∂θ , for ∂φ = sin1 θ ∂φ ∂
and ∂θ = ∂θ ∂
. We call f (x, τ ) =
[P](x, τ ) the generalised PDF (GPDF) after the scale-space theory [8].
For PDF f (x) on the sphere, the derivative of f (x) describes the diﬀerential
geometric features. A primitive geometric feature of the GPDF is the extension
of stationary point {x | ∇S2 f (x) = 0}, where the spatial gradient vanishes. The
stationary point can be classiﬁed into three types based on the combination of
the signs of the eigenvalues of the Hessian matrix Hf = ∇S2 ∇ S2 f , where

∂2 ∂2
sin θ ∂θ∂φ − sin θ ∂φ
2 1 cot θ ∂
∂ ∂ ∂
∇S2 ∇S2 = φ φ θ
= ∂2
∂θ 2
∂2
. (8)
sin θ ∂θ∂φ − sin θ ∂φ sin2 θ ∂φ2 + cot θ ∂θ
1 cot θ ∂ 1 ∂
∂θ ∂φ ∂θ2
We denote the signs of the eigenvalues as (±, ±). Sign (−, −) means that the
point on f is a local maximum. A local maximum of a PDF is called the mode
in probability theory and statistics.

3 Structural Simpliﬁcation and Mode Tree Construction

The trajectory of the stationary point [8]

S(τ ) = x(τ ) = x(φ(τ ), θ(τ )) ∈ S2 | ∇S2 f (x, τ ) = 0 (9)
in the scale space is called the stationary curve in the scale-space theory. Since
∇S2 f = 0 and fτ = ΔS2 f , we have the relationship
d
Hf f (x(τ ), τ ) = −∇S2 ΔS2 f (x(τ ), τ ). (10)
dτ
If the scale is suﬃciently small, the generalised PDF f consists of |P| small blobs
of spherical heat kernels. As the scale increases, the blobs merge with each other
into large ones, and the modes at their peaks disappear one after another [11,12].
Since n = arg{∇S2 G(x) = 0}, we have therelation p = arg{∇S2 G(x ∼ p) = 0}.
Moreover, since the function f (x, τ ) = p∈P G(x ∼ p, τ ) is asymptotically

p
equivalent to f (x, τ ) = G(x ∼ q, τ ) for q = p∈P
| p∈P p| , we have the relation q =
(cos φ sin θ , sin φ sin θ , cos θ ) for limτ →∞ θ(τ ) = θ and limτ →∞ φ(τ ) = φ∗ .
∗ ∗ ∗ ∗ ∗ ∗

Let P be a set of points, f = [P], and N = |P|. Let τi be selected scale values,
where τi < τi+1 for i = 0, 1, . . . , M with τ0 = 0. The mode tree M corresponding
to f is deﬁned as follows.
420 Y. Mochizuki et al.

– Each node in M has three values: the node ID i, a scale value τ and a location
vector x, is denoted by (τ, i, x).
– M has N leaf nodes. Each node has a unique ID in { 0, . . . , N − 1 }. The scale
values of the all leaf nodes are 0 and each location is deﬁned by P.
– Parent of a node whose scale is τi is a node whose scale is τi+1 for i < M − 1.
– A node whose scale and location are τ and p, respectively, is one of the local
maxima of the function f (x, τ ) at x = p.
We denote a set of nodes whose scale is τ by Mτ . The algorithm 1 constructs
the mode tree. In this, the leaf nodes (at the scale 0) correspond to the input
points. All nodes at a scale are moved according to the scale image of the next
scale. When some node points are concentrated to a point, they are merged into
a new node whose ID is inherited from the point which remains as isolated point
in the scale space. Figure 1 shows an example of construction of a tree.

3
3

τ 2 3
𝕊2
𝜙

2
θ 1 2 3
1 3

(a) (b) (c)

Fig. 1. Mode tree and node merging. (a) Trajectory of three modes in scale space.
Each plane represents the spherical space in spherical coordinates on the diﬀerent scale.
(b) The mode tree. Each circle and the number in it represent a node in the tree and
its ID, respectively. (c) The linear separability for point set on the spherical space.

4 Deterministic Structure and Critical Scale

Since the random features of a point cloud are filtered out as the scale increases,
and deterministic features emerge, the deterministic structure of a point cloud is
established from coarse to fine. Therefore, this scale-space hierarchy of clusters
[1,11] provides us with the multi-resolution approach to scale selection for the
clustering. There presumably exists a critical lower bound of scale, above which
the structure is deterministic and the clusters are valid, and below which the
structure is stochastic and the clusters are invalid. Therefore, detection of the
critical scale is achieved by observing the decay of the number of clusters with
respect to the scale [5]. We define the lifitime as the scale at which a new cluster
appears.
Definition 4. The lifetime of the cluster is defined as the scale at which the
cluster disappears with increasing scale.
Furthermore, we define the linear separability for point set on the spherical space
as shown in Fig. 1(c).
Scale-Space Clustering on the Sphere 421

Algorithm 1. Mode tree construction

Input: A point set P = { pi ∈ S2 }N−1 i=0 , f = [P] and scales
t0 = 0 < t1 < · · · < tM
Output: Mode tree
Let M be a graph with N nodes, { (t0 , i, pi ) }N−1
i=0 .
i←1
while i < M do
// Update local maxima for the next scale
for n = (ti , j, p) ∈ Mti do
Calculate the maximised point p of p on the function f (x, ti+1 ).
Add a node (ti+1 , j, p ) to M adjacent to node n.
// Marge absorbed maxima
for n = (ti+1 , j, p) ∈ Mti+1 do
N ← { (ti+1 , k, q) ∈ Mti+1 | q = p }
Find m ∈ N whose displacement is the smallest.
All nodes N \ m are removed from M and their child nodes are
connected to be adjacent to m.
i←i+1
Output M as a tree whose leaf nodes are Mτ0 .

Deﬁnition 5. Let P and Q be point sets on S2 . P and Q are linearly separable

if there exists a plane through the origin of the sphere, which separates the space
R3 into two subsets R3− and R3+ , such that P ⊂ R3± and Q ⊂ R3∓ .

5 Experimental Results
We show an example shown in Fig. 2. The set consists of three clusters with 3000
points, which can be separated by a curve with the appearance of a baseball
stitching. Figure 2(c) shows the graph of the number of maxima. The set can be
successfully separated into two clusters in the mode tree.
Fig. 3 (a) shows the spherical image generated by the dioptric image shown
in Fig. 3 (b). Our aim is to detect spatial lines captured in the spherical image.
Since spatial lines are projected to the great circles in the spherical image [7],
We use the spherical Hough transform for spherical images [13,14] which detects
the great circles from sample points on a spherical image. Since the voting space
of the spherical Hough transform is the unit sphere, the votes yield a point cloud
on the sphere [13]. Therefore, line detection is achieved by detecting mean points
in clusters of a point cloud on the spherical voting space.
To apply the spherical clustering, we extend the voting space to the sphere du-
plicating all points x as the antipodal points −x. As a result of constructing the
mode tree of the point cloud, Fig. 4 (d) shows the graph of the number of modes
at each scale. From symmetry of the extended points, the modes are also sym-
metry at any scales and we can divide the mode tree into two subtrees. From this
geometric property of the mode tree, we use the modes in the north hemisphere
422 Y. Mochizuki et al.

o
0
1000

0.1

100 90o
0.01

# of points
0
0.5
1 10
0.001 0
1.5
1
2
180o o
2
3
4 2.5
5 o o o o
6 3 0 90 180 270 360
1
0.0001 0.001 0.01 0.1 1
scale

(a) (b) (c) (d)

Fig. 2. Clustering of a linearly non-separable point set on the sphere in three-

dimensional Euclidean space. (a) point cloud with clusters, each of which has randomly
generated 2000 points on the sphere. (b) Number of maxima at each scale. Using the
life time, we can estimate the number of clusters in a point cloud on the sphere. (c)
The mode tree of the point set. (d) The clustered point set at the scale of 0.1.

(a) (b)

Fig. 3. Spherical image captured by Fish-eye-lens camera system. (a) Image on unit
sphere transformed from (b). (b) Image acquired by ﬁsh-eye-lens camera system.

to estimate clusters of the point cloud generated voting. In the graph, there are
some stable intervals in which the number of points does not change. The ﬁrst
three these stable intervals of scales are [0.00337, 0.00526], [0.00925, 0.0134] and
[0.01587, 0.02907]. These intervals are shaded regions on the graph. The number
of modes in these intervals are 36, 20 and 10 respectively. The clustering results
in these intervals are shown in Fig. 4 (d).
Figures 5 (a), (b) and (c) are means of the clustr detected from point cloud on
spherical voting space using spherical scale-space clustering. Furthermore, Figs
5 (d), (e) and (f) illustrate detected lines. The results show that the method can
detect parameters of lines from dioptric images.

6 Conclusions
We introduced an algorithm for scale-space-based clustering for point clouds on
the sphere regarding the union of given point sets as an image of a ﬁnite sum of
the delta functions located at the positions of the points on the sphere.
The principal advantage of the scale-space-based analysis for the point-set
analysis is that deterministic features of the point set can be observed at higher
scales even if the positions of the points are stochastic. This property can be
Scale-Space Clustering on the Sphere 423

4
10
6530

0
3
10
0.5

# of points
1

1
2
10
0.1
1.5

2
0.01
10
2.5 0
0.5
1
0.001 0
3 1.5
1
2 2
0 1 2 3 4 5 6 3
4 2.5 1
5 -5 -4 -3 -2 -1
6 3
10 10 10 10 10
scale

(a) (b) (c) (d)

Fig. 4. Voting points obtained for the N -point randomized Hough Transform
(NPRHT)[13,14]. (a) and (b) Point set on the spherical accumulato space. (c) Mode
tree. (d) Numbers of mode.

0 0 0

0.5 0.5 0.5

1 1 1

1.5 1.5 1.5

2 2 2

2.5 2.5 2.5

3 3 3

0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6

(a) (b) (c)

500 500 500

400 400 400

300 300 300

200 200 200

100 100 100

0 0 0
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000

(d) (e) (f)

Fig. 5. Detected means nd lines. (a) Means detected from 1587 clusters. (b) Means
detected from 925 clusters. (c) Means detected from 337 clusters. (d) Lines detected
from 1587 clusters. (e) Lines detected from 925 clusters. (f) Lines detected from 337
clusters.
424 Y. Mochizuki et al.

qualitatively explained using an image of dots. Our method is capable of ﬁnd-

ing the deterministic correspondences of points and clusters at sutable scales,
analysing the distribution of the points in scale space.

References
1. Witkin, A.P.: Scale space filtering. In: Proc. 8th IJCAI, pp. 1019–1022 (1983)
2. Griffin, L.D., Colchester, A.: Superficial and deep structure in linear diffusion scale
space: Isophotes, critical points and separatrices. Image and Vision Computing 13,
543–557 (1995)
3. Nakamura, E., Kehtarnavaz, N.: Determining number of clusters and prototype lo-
cations via multi-scale clustering. Pattern Recognition Letters 19, 1265–1283 (1998)
4. Loog, M., Duistermaat, J.J., Florack, L.M.J.: On the Behavior of Spatial Criti-
cal Points under Gaussian Blurring (A Folklore Theorem and Scale-Space Con-
straints). In: Kerckhove, M. (ed.) Scale-Space 2001. LNCS, vol. 2106, pp. 183–192.
Springer, Heidelberg (2001)
5. Sakai, T., Imiya, A., Komazaki, T., Hama, S.: Critical scale for unsupervised cluster
discovery. In: Perner, P. (ed.) MLDM 2007. LNCS (LNAI), vol. 4571, pp. 218–232.
Springer, Heidelberg (2007)
6. Sakai, T., Imiya, A.: Unsupervised cluster discovery using statistics in scale space.
Engineering Applications of Artificial Intelligence 22, 92–100 (2009)
7. Franz, M.O., Chahl, J.S., Krapp, H.G.: Insect-inspired estimation of egomotion.
Neural Computation 16, 2245–2260 (2004)
8. Zhao, N.-Y., Iijima, T.: A theory of feature extraction by the tree of stable view-
points. IEICE Japan, Trans. D J68-D, 1125–1135 (1985) (in Japanese)
9. Kim, G., Sato, M.: Scale space filtering on spherical pattern. In: Proc. ICPR 1992,
pp. 638–641 (1992)
10. Chung, M.K.: Heat kernel smoothing on unit sphere. In: Proc. 3rd IEEE ISBI:
Nano to Macro, pp. 992–995 (2006)
11. Kuijper, A., Florack, L.M.J., Viergever, M.A.: Scale space hierarchy. Journal of
Mathematical Imaging and Vision 18, 169–189 (2003)
12. Minnotte, M.C., Scott, D.W.: The mode tree: A tool for visualization of nonpara-
metric density features. Journal of Computational and Graphical Statistics 2, 51–68
(1993)
13. Torii, A., Imiya, A.: The randomized-Hough-transform method for great-circle de-
tection on sphere. Pattern Recognition Letters 28, 1186–1192 (2007)
14. Mochizuki, Y., Torii, A., Imiya, A.: N -Point Hough transform for line detection.
Journal of Visual Communication and Image Representation 20, 242–253 (2009)
The Importance of Long-Range Interactions to Texture
Similarity

Xinghui Dong and Mike J. Chantler

The Texture Lab

School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, UK
{xd25,m.j.chantler}@hw.ac.uk

Abstract. We have tested 51 sets of texture features for estimating the percep-
tual similarity between textures. Our results show that these computational fea-
tures only agree with human judgments at an average rate of 57.76%. In a
second experiment we show that the agreement rates, between humans and
computational features, increase when humans are not allowed to use long-
range interactions beyond 19×19 pixels. We believe that this experiment
provides evidence that humans exploit long-range interactions which are not
normally available to computational features.

Keywords: Texture Features, Texture Similarity, Perceptual Similarity, Long-

Range Interactions, Evaluation.

1 Introduction

Although computed texture similarity is widely used for texture classification and
retrieval, human “perceptual similarity” is difficult to acquire and estimate. Halley [1]
derived a perceptual similarity matrix for a large texture database of 334 textures, and
Clarke et al. [2] compared these data with similarity matrices obtained by 4 computa-
tional feature sets and found that these did not correlate well.
Traditionally, computational features are divided into: filtering-based [3], structural
[4], statistical [5], and model-based [6] features. According to Parseval’s theorem [7],
filtering operations in the spatial domain are equivalent to those in the frequency do-
main when the variances of filtered images are used. In this case, linear filtering based
features, with the exception of quadrature filters which are designed to capture local
phase, only use power spectrum information. However, phase is believed to encode
the “structure” information in images [8]. As a result, these approaches are unlikely to
be able to capture texture structure. Texton-based features are a form of vector quanti-
zation and normally work by clustering in pixel neighborhood space. Computational
cost and feature space sparsity both severely limit the size of the neighborhood. Gen-
erally, statistical features also extract only local statistics again largely for reasons of
computational cost. Similarly, model-based features utilize only a small neighborhood

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 425–432, 2013.
© Springer-Verlag Berlin Heidelberg 2013
426 X. Dong and M.J. Chantler

although the recursive structures have the potential to encode long-range interactions.
However, the majority of published features either work in the power spectrum, or
only exploit higher-order information from relatively small (i.e. 19×19 or less) local
neighborhoods. However, the more interesting aperiodic structures in textures are
represented by phase spectra data and features with small spatial extent cannot encode
these long-range interactions.
In order to examine the abilities of computational features to estimate perceptual
similarity, we have benchmarked 51 sets of features but we have found that these do
not agree well with humans’ perceptual judgments, even if a multi-pyramid scheme is
employed. We believe this may be because the majority of texture features do not
encode long-range higher-order information, such as continuity as expressed by the
Gestalt law of “good continuation” [9]. Furthermore, Polat et al. [10] found that the
interactions might be attributed to grouping collinear line segments into smooth
curves after they studied the lateral interactions between spatial filters. As Spillmann
et al. stated, classical receptive-field models can only explain local perceptual effects
but are unable to explain some global perceptual phenomenon, such as the perception
of illusory contours [11], which is believed to result from long-range interactions.
The most direct hypothesis is that the features that we have investigated cannot ex-
ploit these long-range interactions and hence these produce estimates of similarity that
are not consistent with human judgments. Unfortunately, it is difficult to test this hy-
pothesis directly by “adding” long-range interactions to textures as such actions inva-
riably change local features or 1st- or 2nd-order statistics. However, it is relatively
simple to prevent humans from using long-range interactions. In this situation, hu-
mans are likely to make judgments that are more similar to the computational results
if our hypothesis holds true. Hence, we performed two additional pair-of-pairs expe-
riments and their results show that humans are more inclined to make judgments that
coincide with the feature data when long-range interactions are not available to the
observers. These results indicate that the features that we have examined do not ex-
ploit long-range interactions.
In the next section, a series of evaluation experiments is conducted that compare
human and computational similarity. The effect of removing long-range interactions
on perceptual similarity is investigated in section 3. Finally, in section 4, conclusions
are drawn.

2 Evaluating Texture Features for Estimating Perceptual

Similarity

In this section, we carry out a series of evaluation experiments for examining the abil-
ities of 51 sets of different computational features to estimate perceptual similarity as
obtained from free-grouping [1, 12] and pair-of-pairs [12] experiments respectively.
Multi-pyramid decomposition is used to increase the spatial extent of the computa-
tional features and 6 pyramid levels are used. The computational similarity is com-
pared with perceptual similarity and the agreement rate is used to measure the estima-
tion ability of each set of features.
The Importance of Long-Range Interactions to Texture Similarity 427

2.1 Capturing Perceptual Similarity

Halley [1] obtained a perceptual similarity matrix for the Pertex texture database of
334 textures using free grouping. However, for more than 200 textures this sorting-
based method is very time-consuming and hence only 30 participants were used. In
order to augment these data, a new set of perceptual similarity judgments were ob-
tained using the pair-of-pairs method, in which two pairs of textures were displayed
simultaneously in each trial and the participant was required to decide which pair is
more similar [12].

2.2 Estimating Perceptual Similarity Using Computational Features

Spatial extent is an important impact factor of features that capture visual structure
and is normally limited by computational considerations. Multi-resolution analysis is
often used to enhance the performance of features because such techniques allow
larger spatial extent to be considered. In our study, a simple multi-pyramid perceptual
similarity estimation scheme is proposed. Firstly, each texture image is decomposed
into 5 Gaussian pyramid [13] sub-bands. Next, each sub-band is individually norma-
lized to have an average intensity of 0 and standard deviation of 1 in order to remove
the influence of 1st- and 2nd-order gray level properties. Feature extraction is per-
formed to obtain a feature vector from each sub-band independently. In addition, all 5
feature vectors are combined into an additional feature vector. Thus, in total 6 feature
vectors are generated for each texture. Finally, a distance matrix is computed from all
334 sub-band images on every pyramid level. Although the Euclidean distance (see
Equation (1)) is a simple but popular metric, it is not suitable for measuring the dis-
tance between two histograms. Therefore, the Chi-square statistic (see Equation (2))
and the Euclidean distance are used for histogram-based features and the rest respec-
tively. Each distance matrix is first normalized to the range of [0, 1] and is then con-
verted to a similarity matrix. As a result, six similarity matrices are obtained for each
method and are used as the estimation of perceptual similarity.

, ∑ (1)

, ∑ (2)

In our research, 51 sets of features were used to estimate perceptual similarity. Due to
space limitations, we list and reference these in the paper’s supplementary material.

2.3 Comparing Computational and Perceptual Similarity

We use agreement rate (%) to measure the consistency between the computational
features and human judgments. When perceptual judgments from pair-of-pairs expe-
riments [12] are used as the ground-truth, it is separately computed on each pyramid
level as shown below:
428 X. Dong and M.J. Chantler

(a) Accumulate the choice (left or right) decisions made by all 20 participants for
1000 pair-of-pairs trials, and label these as and respectively. The differ-
ence between these two figures is computed and normalized:

, 1, 2, … , 1000; (3)

(b) For each trial, label the computational similarities of the left and right pairs as
and respectively, and compute their difference as
, 1, 2, … , 1000; (4)

(c) Compute the criterion to decide whether the computational features and human
decisions are consistent for each trial:
0 || 0 && 0 , 1, 2, … , 1000; (5)

(d) Finally compute the percentage agreement rate:

% ∑ 100⁄1000. (6)

In total, six agreement rates are obtained for each approach.

Since only 1000 pairs of pairs were examined in the pair-of-pairs experiment, we
can only consider these when the perceptual similarity matrix [1, 12] is used as the
ground-truth, in order to obtain consistent evaluation results for these two ground-
truth datasets. As a result, step (a) above is replaced by step (a1) below.
(a1) For each pair-of-pairs trial, label the perceptual similarities of the left and
right pairs as and respectively, and compute their difference as

, 1, 2, … , 1000. (7)

However, the other three steps are kept constant.

2.4 Results
The average agreement rates (%) of the humans’ perceptual pair-of-pairs judgments
against 51 sets of computational features computed at 6 resolutions are displayed in
Figure 1(a). In addition a second set of human data obtained by free grouping and
Isomap analysis (8D-ISO) is also shown for comparison purposes. The 8D-ISO data
provides the highest agreement rate at 73.9% providing validation of the pair-of-pairs
data and an indication of the variability of human performance. However, the perfor-
mance of the computational features is much lower (average agreement rates lie in the
range 48.58% to 63.38%). Figure 1(b) provides a similar plot in which the same com-
putational features are compared against the 8D-ISO human data. The highest and
lowest performances (58.55% and 48.85%) were provided by MRSAR and SVR,
respectively. It can be seen that two curves in Figure 1(a) and 1(b) are similar. In both
cases the performance of the 51 computational features is poor when compared
against the two sources of human data.
The Importance of Long-Range Interactions to Texture Similarity 429

In order to investigate the failure of the computational features further, 80 pairs of

pairs of textures were selected in which the disagreement between computer and hu-
man judgments was greatest. These were used in the experiment described in the next
section.

(a)

(b)
Fig. 1. The average agreement rates (%) of computational features with perceptual pair-of-pairs
judgments (a), 8D-ISO data (b), and corresponding standard deviations (error bars) over 6
resolutions. In (a) and (b), the black solid lines illustrate the overall average agreement rates
(57.76% and 53.59%), over all 51 methods and 6 resolutions.

Table 1. Pairwise t-test (α = 0.05) results, where r ≥ 0.5 means that the strong effect is
obtained. POP , POP and POP : probabilities that the participants chose “Left” throughout
80 trials in original, non-randomized and randomized pair-of-pairs experiments respectively.

t-test t p r df Significant
POP vs. POP -1.12 0.27 0.21 28 No
POP vs. POP -4.73 0.00 0.67 28 Yes
POP vs. POP 2.74 0.02 0.67 9 Yes
430 X. Dong and M.J. Ch
hantler

(a)

(b)
Fig. 2. The 4 (out of 80) pairrs of pairs (rows) in which computational features have the mmost
difficulty agreeing with humaan judgments: (a) most participants think that the right pairss are
more similar in the original (POP ) and the non-randomized (POP ) pair-of-pairs but tthey
change their minds in the ran ndomized pair-of-pairs (POP ), and (b) most participants alw
ways
think the left pairs are more sim
milar throughout all three versions of pair-of-pairs

(a) (b)
Fig. 3. Average agreement raates between each method and perceptual judgments in POP and
POP , along with standard dev viations (error bars) over 6 resolutions. The black solid lines illu-
strate the overall average rates: (a) 31.28% and (b) 54.67%, on 51 methods and 6 resolutionss.
The Importance of Long-Range Interactions to Texture Similarity 431

3 The Effect of Removing Long-Range Interactions

We hypothesize that one possible reason for the computational features producing
results that do not agree with human perceptual judgments is that the features com-
pute higher order spatial statistics only on small neighborhoods. Thus, these cannot
exploit the long-range interactions that humans have been shown to be capable of
using for other tasks. Unfortunately, it is difficult to introduce long-range interactions
to real texture images or generate synthetic textures that with controlled long-range
interactions that do not affect local and/or 1st-/2nd-order statistics. As an alternative,
we designed an experiment in which the human observers were prevented from using
long-range interactions. We were inspired by Field et al. [9] who showed that humans
can recognize an object in one image by using long-range interactions even although a
grid has been imposed on top of it. Hence given a “blocked” texture image we can
remove the original long-range interactions by randomizing the position of the blocks.
Experiments were performed using “non-randomized blocked” and “randomized
blocked” textures respectively. The former was designed to provide a control to un-
derstand the effect of superimposing the grid onto images. Note that the grid was
provided in order to lessen the effect of local discontinuities at randomized blocked
edges. The size of each block was set at 19×19 which is the largest neighborhood
used by computational features (excluding filtering-based features). Two modified
pair-of-pairs (POP and POP ) experiments were designed using non-randomized
and randomized blocked images respectively. All other conditions were kept the same
as for the original pair-of-pairs (POP ) experiment.
Three pairwise t-tests were conducted on the 3 sets of results (see Table 1). There
is no significant difference between the choices participants have made in the original
(POP ) and the blocked but non-randomized (POP ) experiments. However, the re-
sults of the randomized experiment (POP ) against both the original (POP ) and the
blocked but non-randomized (POP ) show significant changes. In both cases the ran-
domized blocked experiment provides increased agreements with the computational
features. That is, when humans are provided with images containing the original long-
range interactions (i.e. either the original, or the blocked but non-randomized images),
they disagree significantly more with the computational results compared with when
these long-range interactions have been removed. For example, in Figure 2(a) most
participants judge that the right pairs are more similar in POP and POP but
change their minds in POP . The average agreement rates between each computa-
tional approach and perceptual judgments in POP and POP over 6 resolutions, are
plotted in Figure 3(a) and 3(b). It can be seen that the participants have agreed more
with the features when they have not been able to use long-range interactions.

4 Conclusions

In this paper, we examined the abilities of 51 feature sets to estimate perceptual simi-
larity as estimated using free-grouping and pair-of-pairs experiments. Even though
five different pyramid resolutions were used in order to enhance the feature sets’
432 X. Dong and M.J. Chantler

spatial extents, the results did not agree well with humans’ perceptual judgments. The
average agreement rates of 57.76% and 53.59% were obtained over all methods and
resolutions. Obviously, enhancing spatial extent alone is not enough for capturing the
complexity of human perception.
In a second experiment, 80 of the most difficult pairs of pairs of images from expe-
riment one, were selected for further investigation. These were “blocked” and then the
position of the blocks within an image was randomized in order to remove, or at least
reduce, the ability of observers to exploit long-range interactions in the textures. The
results of the second experiment showed that the block-randomized images produced
significantly different results from the original experiment, while the blocked but non-
randomized images did not. When human observers are allowed to use long-range
interactions in textures, they agree significantly less with the computational feature-
based results. Thus we hypothesize that long-range interactions are important when
humans judge the similarity of textures and that the 51 feature sets that we tested do
not use this important information.

Acknowledgements. We would like to acknowledge the support of the Life Sciences

Interface theme of Heriot-Watt University.

References
1. Halley, F.: Perceptually Relevant Browsing Environments for Large Texture Databases.
PhD Thesis, Heriot Watt University (2011)
2. Clarke, A.D.F., Halley, F., Newell, A.J., Griffin, L.D., Chantler, M.J.: Perceptual Similari-
ty: A Texture Challenge. In: BMVC 2011, pp. 120.1–120.10. BMVA Press (2011)
3. Randen, T., Husøy, J.H.: Filtering for Texture Classification: A Comparative Study. IEEE
Transactions on PAMI 21, 291–310 (1999)
4. Varma, M., Zisserman, A.: A Statistical Approach to Material Classification Using Image
Patch Exemplars. IEEE Transactions on PAMI 31, 2032–2047 (2009)
5. Unser, M.: Sum and Difference Histograms for Texture Classification. IEEE Transactions
on PAMI 8(1), 118–125 (1986)
6. Mao, J., Jain, A.K.: Texture classification and segmentation using multiresolution simulta-
neous autoregressive models. Pattern Recognition 25(2), 173–188 (1992)
7. Parseval’s Theorem, http://mathworld.wolfram.com/ParsevalsTheorem.
html
8. Oppenheim, A.V., Lim, J.S.: The Importance of Phase in Signals. Proceedings of the
IEEE 69(5), 529–541 (1991)
9. Field, D.J., Hayes, A., Hess, R.F.: Contour integration by the human visual system:
evidence for a local “association field”. Vision Research 33, 173–193 (1993)
10. Polat, U., Sagi, D.: The Architecture of Perceptual Spatial Interactions. Vision Re-
search 34, 73–78 (1994)
11. Spillmann, L., Werner, J.S.: Long-range interactions in visual perception. Trends in Neu-
rosciences 19, 428–434 (1996)
12. Clarke, A.D.F., Dong, X., Chantler, M.J.: Does Free-sorting Provide a Good Estimate of
Visual Similarity. In: Predicting Perceptions, pp. 17–20 (2012)
13. MatlabPyrTools-v1.4, http://www.cns.nyu.edu/~lcv/software.php
Unsupervised Dynamic Textures Segmentation

Michal Haindl and Stanislav Mikeš

Institute of Information Theory and Automation

of the ASCR, Prague, Czech Republic
{haindl,xaos}@utia.cz

Abstract. This paper presents an unsupervised dynamic colour tex-

ture segmentation method with unknown and variable number of texture
classes. Single regions with dynamic textures can furthermore change
their location as well as their shape. Individual dynamic multispectral
texture mosaic frames are locally represented by Markovian features de-
rived from four directional multispectral Markovian models recursively
evaluated for each pixel site. Estimated frame-based Markovian paramet-
ric spaces are segmented using an unsupervised segmenter derived from
the Gaussian mixture model data representation which exploits contex-
tual information from previous video frames segmentation history. The
segmentation algorithm for every frame starts with an over segmented
initial estimation which is adaptively modiﬁed until the optimal number
of homogeneous texture segments is reached. The presented method is
objectively numerically evaluated on the dynamic textural test set from
the Prague Segmentation Benchmark.

Keywords: dynamic texture segmentation, unsupervised segmentation.

1 Introduction

Many automated static or dynamic visual data analysis systems build on the
segmentation as the fundamental process which affects the overall performance
of any analysis. Visual scene regions, homogeneous with respect to some usu-
ally textural or colour measure, which result from a segmentation algorithm are
analysed in subsequent interpretation steps. Dynamic texture-based (DT) im-
age segmentation is an area of novel research activity in recent years and several
algorithms were published in consequence of all this effort. Different published
methods are difficult to compare because of incompatible assumptions (gray-
scale, fixed or known number of regions, segmentation or retrieval, constant
shape and/or location of texture regions, etc.), lack of a comprehensive analysis
together with accessible experimental data. Gray scale dynamic texture seg-
mentation or retrieval was addressed in few papers [1–5], while colour texture
retrieval based on VLBP [6] or DT segmentation [7], based on the geodesic ac-
tive contour algorithm and partial shape matching to obtain partial match costs
between regions of subsequent frames, were addressed to even lesser extent. How-
ever all available published results indicate that the ill-defined dynamic texture

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 433–440, 2013.

c Springer-Verlag Berlin Heidelberg 2013
434 M. Haindl and S. Mikeš

segmentation problem is far from being satisfactorily solved. Spatial interac-

tion models and especially Markov random fields-based models are increasingly
popular for texture representation [8, 9], etc. Several researchers dealt with the
difficult problem of unsupervised segmentation using these models see for exam-
ple [10–13] or [14] which is also generalized to dynamic textures and addressed
in this paper.
The contribution of the paper is a novel unsupervised dynamic multispec-
tral texture segmentation method with unknown and variable number of texture
classes, and regions (with dynamic texture) which can in addition change their
location as well as their shape. Thus the method relaxes most of the alterna-
tive approaches [1–5] limitations (gray-scale textures, fixed or known number of
regions, fixed regions shape and locations) which prevent their practical appli-
cations.
The outline of this paper is as follows. Section 2 presents our Markovian mul-
tispectral texture representation. Section 3 outlines the unsupervised segmenter,
followed by the experimental verification in the subsequent Section 4 and con-
cluding Section 5.

2 Dynamic Texture Representation

Dynamic multispectral textures would require a four dimensional (4D) model
or some of its lower dimensional approximation such as a set of spectrally fac-
torized 3D models. However we assume to model each dynamic texture frame
separately and thus a 3D static smooth textural model is suﬃcient for its ad-
equate representation. We assume that single multispectral frame textures can
be locally modeled using a 3D simultaneous causal auto-regressive random ﬁeld
model (AR3D). This model can be expressed as a stationary causal uncorrelated
noise driven 3D auto-regressive process [15]:

Yr = γXr + er , (1)

where Xr = [Yr−s T
: ∀s ∈ Irc ]T is a vector of the contextual neighbours Yr−s ,
c
Ir is a causal neighbourhood index set of the model with the cardinality η =
card(Irc ), γ = [A1 , . . . , Aη ] is the d×dη parameter matrix containing parametric
sub-matrices As for each contextual neighbour Yr−s , d is the number of
spectral bands, er is a white Gaussian noise vector with zero mean and a
constant but unknown covariance, and r, r − 1, . . . is a chosen direction of
movement on the image index lattice I. The selection of an appropriate model
support (Irc ) is important to obtain good texture representation for realistic
texture synthesis but less important for adequate texture segmentation which
works only with site speciﬁc parameters. Both, the optimal neighbourhood as
well as the Bayesian parameters estimation of the AR3D model can be found
analytically under few additional and acceptable assumptions using the Bayesian
approach (see details in [15]). The local model parameters can be advantageously
evaluated using the recursive Bayesian parameter estimator for every DT frame
as follows:
Unsupervised Dynamic Textures Segmentation 435

−1
T T
Vx(r−2) Xr−1 (Yr−1 − γ̂r−2 Xr−1 )T
γ̂r−1 = γ̂r−2 + T V −1
, (2)
(1 + Xr−1 x(r−2) Xr−1 )
where the data accumulation matrix is

r−1
Vx(r−1) = Xk XkT + Vx(0) . (3)
k=1

Thus the parameter matrix estimate can be easily upgraded after moving to
a new lattice location (r − 1 −→ r). The model is very fast, hence the local
texture for each pixel can be represented by four directional parametric vectors
corresponding to four distinct models. Each vector contains local estimations of
the AR3D model parameters. These models have identical contextual neighbour-
hood Irc but they diﬀer in their major movement direction (top-down, bottom-up,
rightward, leftward), i.e.,
T
γ̃r,o = {γ̂r,o
t b
, γ̂r,o r
, γ̂r,o l
, γ̂r,o }T , (4)
where o = 1, . . . , n is the DT frame number.

3 Gaussian Mixture Segmenter

Multispectral texture segmentation is done by clustering in the AR3D parameter
space Θo deﬁned on the lattice I for every frame o where
T
Θr,o = γ̄r,o (5)
is the decorrelated parameter vector (4) computed for the lattice location r
(the frame index is further left out to simplify notation). We assume that this
parametric space can be represented using the Gaussian mixture model with
diagonal covariance matrices due to the previous CAR parametric space decor-
relation. The Gaussian mixture model for AR3D parametric representation is as
follows:

K
p(Θr ) = pi p(Θr | νi , Σi ) , (6)
i=1

|Σi |− 2
1 −1
(Θr −νi )T Σ (Θr −νi )
e−
i
p(Θr | νi , Σi ) = d
2 . (7)
(2π) 2

The mixture model equations (6),(7) are solved using a modiﬁed EM algorithm.
The algorithm is initialised, for the ﬁrst DT frame, using νi , Σi statistics es-
timated from the corresponding rectangular subimages obtained by regular di-
vision of the input texture mosaic. An alternative initialisation can be random
choice of these statistics. For each possible couple of rectangles the Kullback
Leibler divergence
D (p(Θr | νi , Σi ) || p(Θr | νj , Σj )) =

p(Θr | νi , Σi )
p(Θr | νi , Σi ) log dΘr (8)
Ω p(Θr | νj , Σj )
436 M. Haindl and S. Mikeš

is evaluated and the most similar rectangles, i.e.,

{i, j} = arg min D (p(Θr | νl , Σl ) || p(Θr | νk , Σk )) (9)

k,l

are merged together in each step. This initialization results in Kini subimages
and recomputed statistics νi , Σi . Kini > K where K is the optimal number of
textured segments to be found by the algorithm. All the subsequent DT frames
are initialized either from the corrected statistics ν̂i,o−1 , Σ̂i,o−1 for i = 1, . . . , K
computed from the trimmed segmented regions in the previous frame o − 1 or
with random parameter values ν̂i,o−1 , Σ̂i,o−1 i = K + 1, . . . , Kini for possi-
bly newly (re)appearing regions. Two steps of the EM algorithm are repeating
after initialisation. The components with smaller weights than a ﬁxed thresh-
old (pj < K0.1ini
) are eliminated. For every pair of components we estimate their
Kullback Leibler divergence (8). From the most similar couple, the component
with the weight smaller than the threshold is merged to its stronger partner and
all statistics are actualised using the EM algorithm. The algorithm stops when
either the likelihood function has negligible increase (Lt − Lt−1 < 0.05) or the
maximum iteration number threshold is reached.
The parametric vectors representing texture mosaic pixels are assigned to the
clusters according to the highest component probabilities, i.e., Yr is assigned to
the cluster ωj if

πr,j = maxj ws p(Θr−s | νj , Σj ) , (10)
s∈Ir

where ws are ﬁxed distance-based weights, Ir is a rectangular neighbourhood

and πr,j > πthre (otherwise the pixel is unclassiﬁed). The area of single cluster
blobs is evaluated in the post-processing thematic map ﬁltration step. Regions
with similar statistics are merged. Thematic map blobs with area smaller than a
given threshold are attached to its neighbour with the highest similarity value.
Finally, the resulting region classes are remapped to ensure their between frame
consistency.

4 Experimental Results

The algorithm was tested on the natural colour dynamic textural mosaics from
the Prague Texture Segmentation Data-Generator and Benchmark [16]. The
benchmark (http://mosaic.utia.cas.cz) test mosaics with varying layouts and
each cell texture membership are randomly generated and ﬁlled with dynamic
colour textures from the Dyntex database [17]. The benchmark ranks segmenta-
tion algorithms according to a chosen criterion. The benchmark has implemented
the majority of segmentation criteria used for both supervised or unsupervised
algorithms evaluation. Twenty seven evaluation criteria (see their deﬁnition in
[16]) are categorized into four groups: region-based (5+5), pixel-wise (12), consis-
tency measures (2), and clustering comparison criteria (3) and permit detailed
and objective study of any segmentation method properties. Tab.1 compares
Unsupervised Dynamic Textures Segmentation 437

Table 1. Dynamic A benchmark results for DTAR3D+EM (e+pp), DTAR3D+EM

(pp), DTAR3D+EM; (Benchmark criteria [16]: CS = correct segmentation; OS =
over-segmentation; US = under-segmentation; ME = missed error; NE =
noise error; O = omission error; C = commission error; CA = class accu-
racy; CO = recall - correct assignment; CC = precision - object accuracy;
I. = type I error; II. = type II error; EA = mean class accuracy estimate;
MS = mapping score; RM = root mean square proportion estimation error;
CI = comparison index; GCE = Global Consistency Error; LCE = Local
Consistency Error; dD = Van Dongen metric; dM = Mirkin metric; dVI =
variation of information). Arrows directions denote the required criterion motion,
the criteria rank numbers are down-sized on the right with the average rank besides
the method label. The bold numbers are the best criterion values, while italic numbers
are the worst criterion values.

Benchmark – Dynamic A
DTAR3D+EM DTAR3D+EM DTAR3D+EM
e+pp (1.33) pp (1.86) (2.62)
↑ CS 92.68 1 60.75 2 60 .12 3
↓ OS 39 .47 3 20.37 2 14.78 1
↓ US 0.00 1
0.00 1
0.00 1
↓ ME 0.00 1
35 .77 2
35 .77 2
↓ NE 0.00 1
36.52 2
36 .76 3
↓O 3.23 1 10.07 2 11 .08 3
↓C 13.25 2
12.45 1
14 .56 3
↑ CA 87.03 1
81.26 2
80 .10 3
↑ CO 92.68 1
84.42 2
83 .69 3
↑ CC 94 .01 3 95.85 1 95.18 2
↓ I. 7.32 1
15.58 2
16 .31 3
↓ II. 1 .32 3
0.76 1
0.89 2
↑ EA 92.80 1 89.01 2 88 .30 3
↑ MS 89.07 1 82.41 2 81 .35 3
↓ RM 2.56 1
5 .54 3
5.21 2
↑ CI 93.07 1
89.56 2
88 .86 3
↓ GCE 11.13 1 12.39 2 13 .51 3
↓ LCE 7.02 1
11.03 2
12 .21 3
↓ dD 7.27 1
11.68 2
12 .37 3
↓ dM 4.95 1
6.36 2
6 .80 3
↓ dVI 13.18 1 13.93 2 13 .99 3

the overall (average over all DT frames) benchmark performance of the pro-
posed algorithm (DT AR3D + EM (e + pp)) with postprocessing (pp) and robust
trimmed initialization (e) with its alternative versions. The results demonstrate
very good performance on all criteria with the exception of over-segmentation
tendency and slightly worse variation of information criterion. We could not
compare our results with few published alternative DT segmenters [1, 2, 4] be-
cause neither their code, nor their experimental segmentation data are publicly
available, however the static single-frame (AR3D+EM) version of the method
was extensively evaluated and compared with several alternative methods (22
438 M. Haindl and S. Mikeš

0 1 2 126 249

Fig. 1. Selected experimental dynamic texture mosaic frames (0, 1, 2, 126, 249), ground
truth from the benchmark (middle row), and the corresponding segmentation results
(DTAR3D+EM e+pp - bottom))

other leading unsupervised segmenters) in the segmentation benchmark. The

static method proved its very good performance and outperformed most of these
alternatives (see details in http://mosaic.utia.cas.cz). For example, the impor-
tant correct region segmentation criterion (CS) is 25% better than for the HGS
method [18], under-segmentation is low as well as missed and noise errors [19].
Fig.1 shows five selected (three from the beginning, one from the middle and
one from the end of the sequence) 720 × 576 frames from the experimental
benchmark mosaics created from five Dyntex dynamic colour textures (47fa110 -
curtain, 54aa110 - curly hair, 54ac210 - straw, 54pd110 - escalator, and 571b110
- water). While the first frame suffers with over-segmentation the contextual in-
formation propagated from previous frames significantly improves the segmen-
tation consistency. Hard natural Dyntex textures were chosen for comparison
rather than synthesised (for example using the generative AR3D model or some
other Markov random field model) ones because they are expected to be more
difficult for the underlying segmentation model. Resulting segmentation results
are promising even if we could not compare them with alternative DT segmen-
tation methods. The time for an unoptimized parameter estimation is 170 s and
segmentation time is 10 s per frame. Our results can be further improved by an
appropriate more elaborate postprocessing or frame model initialization.

5 Conclusions

We proposed novel method for fast unsupervised dynamic texture or video seg-
mentation with unknown variable number of classes based on the underlying
Unsupervised Dynamic Textures Segmentation 439

three dimensional Markovian local image representation and the Gaussian mix-
ture parametric space models. Single homogeneous texture regions can not only
dynamically change their location but simultaneously also their shape. Textu-
ral regions can also disappear temporarily or permanently and new regions can
appear at any time. Although the algorithm uses the random field type data
model it is very fast because it uses efficient recursive parameter estimation
of the model and therefore is much faster than the usual Markov chain Monte
Carlo estimation approach needed for Markovian models. Segmentation methods
typically suffer with lot of application dependent parameters to be experimen-
tally estimated. Our method requires only a contextual neighbourhood selection
and two additional thresholds all of them having an intuitive meaning. The al-
gorithm’s performance is demonstrated on the extensive benchmark objective
tests on natural dynamic texture mosaics. The static version of our method out-
performs several alternative unsupervised segmentation algorithms and it is also
faster than most of them. These dynamic texture unsupervised segmentation test
results are encouraging and we proceed with more elaborate post-processing and
some modification of the texture representation model.

References
1. Doretto, G., Cremers, D., Favaro, P., Soatto, S.: Dynamic texture segmentation.
In: Proceedings of the 9th IEEE International Conference on Computer Vision,
vol. 2, pp. 1236–1242 (2003)
2. Péteri, R., Chetverikov, D.: Dynamic texture recognition using normal flow and
texture regularity. In: Marques, J.S., Pérez de la Blanca, N., Pina, P. (eds.) IbPRIA
2005, Part II. LNCS, vol. 3523, pp. 223–230. Springer, Heidelberg (2005)
3. Chan, A.B., Vasconcelos, N.: Classifying video with kernel dynamic textures. In:
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR 2007), pp. 1–6. IEEE Computer Society (2007)
4. Chan, A.B., Vasconcelos, N.: Layered dynamic textures. IEEE Transactions on
Pattern Analalysis and Machine Intelligence 31(10), 1862–1879 (2009)
5. Chen, J., Zhao, G., Salo, M., Rahtu, E., Pietikinen, M.: Automatic dynamic texture
segmentation using local descriptors and optical flow. IEEE Transactions on Image
Processing (2012)
6. Zhao, G., Pietikäinen, M.: Dynamic texture recognition using local binary patterns
with an application to facial expressions. IEEE Transactions on Pattern Analysis
and Machine Intelligence 29(6), 915–928 (2007)
7. Donoser, M., Urschler, M., Riemenschneider, H., Bischof, H.: Highly consistent se-
quential segmentation. In: Heyden, A., Kahl, F. (eds.) SCIA 2011. LNCS, vol. 6688,
pp. 48–58. Springer, Heidelberg (2011)
8. Kashyap, R.L.: Image models. In: Young, T.Y., Fu, K.S. (eds.) Handbook of Pat-
tern Recognition and Image Processing. Academic Press, New York (1986)
9. Haindl, M.: Texture synthesis. CWI Quarterly 4(4), 305–331 (1991)
10. Panjwani, D.K., Healey, G.: Markov random field models for unsupervised seg-
mentation of textured color images. IEEE Transactions on Pattern Analysis and
Machine Intelligence 17(10), 939–954 (1995)
11. Manjunath, B.S., Chellapa, R.: Unsupervised texture segmentation using Markov
random field models. IEEE Transactions on Pattern Analysis and Machine Intel-
ligence 13, 478–482 (1991)
440 M. Haindl and S. Mikeš

12. Haindl, M.: Texture segmentation using recursive markov random ﬁeld parameter
estimation. In: Bjarne, K.E., Peter, J. (eds.) Proceedings of the 11th Scandinavian
Conference on Image Analysis, Lyngby, Denmark, pp. 771–776. Pattern Recogni-
tion Society of Denmark (June 1999)
13. Haindl, M., Mikeš, S., Pudil, P.: Unsupervised hierarchical weighted multi-
segmenter. In: Benediktsson, J.A., Kittler, J., Roli, F. (eds.) MCS 2009. LNCS,
vol. 5519, pp. 272–282. Springer, Heidelberg (2009)
14. Haindl, M., Mikeš, S.: Unsupervised texture segmentation using multispectral mod-
elling approach. In: Tang, Y.Y., Wang, S.P., Yeung, D.S., Yan, H., Lorette, G. (eds.)
Proceedings of the 18th International Conference on Pattern Recognition, ICPR
2006, vol. II, pp. 203–206. IEEE Computer Society, Los Alamitos (2006)
15. Haindl, M., Šimberová, S.: A multispectral image line reconstruction method. In:
Theory & Applications of Image Analysis, pp. 306–315. World Scientiﬁc Publishing
Co., Singapore (1992)
16. Haindl, M., Mikeš, S.: Texture segmentation benchmark. In: Lovell, B., Lauren-
deau, D., Duin, R. (eds.) Proceedings of the 19th International Conference on
Pattern Recognition, ICPR 2008, pp. 1–4. IEEE Computer Society (2008)
17. Péteri, R., Fazekas, S., Huiskes, M.J.: DynTex: A Comprehensive Database
of Dynamic Textures. Pattern Recognition Letters 31(12), 1627–1632 (2010),
http://projects.cwi.nl/dyntex/
18. Hoang, M.A., Geusebroek, J.M., Smeulders, A.W.M.: Color texture measurement
and segmentation. Signal Processing 85(2), 265–275 (2005)
19. Hoover, A., Jean-Baptiste, G., Jiang, X., Flynn, P.J., Bunke, H., Goldgof, D.B.,
Bowyer, K., Eggert, D.W., Fitzgibbon, A., Fisher, R.B.: An experimental com-
parison of range image segmentation algorithms. IEEE Transaction on Pattern
Analysis and Machine Intelligence 18(7), 673–689 (1996)
20. Kittler, J.V., Marik, R., Mirmehdi, M., Petrou, M., Song, J.: Detection of defects
in colour texture surfaces. In: IAPR Workshop on Machine Vision Application,
Tokyo, Japan, pp. 558–567 (1994)
21. Mitra, P., Murthy, C.A., Pal, S.K.: Unsupervised feature selection using feature
similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence 24,
301–312 (2002)
22. Mirmehdi, M., Marik, R., Petrou, M., Kittler, J.: Iterative morphology for fault
detection in stochastic textures. Electronic Letters 32(5), 443–444 (1996)
Voting Clustering and Key Points Selection

Costas Panagiotakis1 and Paraskevi Fragopoulou2

1
Dept. of Commerce and Marketing, Technological Educational Institute (TEI)
of Crete, 72200 Ierapetra, Greece
[email protected]
2
Dept. of Applied Informatics and Multimedia, TEI of Crete, PO Box 140, Greece
[email protected]

Abstract. We propose a method for clustering and key points selection.

We have shown that the proposed clustering based on the voting max-
imization scheme has advantages concerning the cluster’s compactness,
working well for clusters of diﬀerent densities and/or sizes. Experimental
results demonstrate the high performance of the proposed scheme and
its application to video summarization problem.

Keywords: Clustering, Grouping, K-means, Video summarization.

1 Introduction

Clustering is one of the most fundamental problems of pattern recognition with

many applications in different fields like computer vision, signal-image-video
analysis, multimedia, networks and biology. The clustering task involves group-
ing N given objects (points of d−dimensional space) into a set of K subgroups
(clusters) in such a manner that the similarity measure between the objects
within a subgroup is higher than the similarity measure between the objects
from other subgroups [1]. Clustering algorithms can be divided into two main
categories: hierarchical and partitional [2]. Hierarchical clustering algorithms re-
cursively find nested clusters either in agglomerative (bottom-up) mode or in
divisive (top-down) mode. According to partitional clustering algorithms, the
clusters are simultaneously computed as a partition of the data. The resulting
clusters can be disjoint and nonoverlapping (crisp clustering), where an object
belongs to one and only one cluster, or overlapping (fuzzy clustering), where an
object may belong to more than one cluster.
During the last decades, thousands of clustering algorithms [2] have been
published, so hereafter we briefly present some popular and widely used clus-
tering algorithms. An extensive survey of various clustering algorithms can be
found in [2]. The K-means clustering algorithm, is one of the simplest partitional
clustering algorithms that solves the clustering problem for a given number of
clusters. The goal of K-means is to minimize the sum of squared error (SSE)

Paraskevi Fragopoulou is also with the Foundation for Research and Technology-
Hellas, Institute of Computer Science, 70013 Heraklion, Crete, Greece.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 441–448, 2013.

c Springer-Verlag Berlin Heidelberg 2013
442 C. Panagiotakis and P. Fragopoulou

over all clusters. In [3], a variant method (K-means++ algorithm) for centroid
initialization has been proposed that chooses centers at random from the data
points, but weights the data points according to their squared distance from
the closest center already chosen. K-means++ usually outperforms K-means in
terms of both accuracy and speed. A deterministic initialization scheme for K-
means is given by the KKZ algorithm [4]. According to KKZ method, the first
centroid is given as the data point with maximum norm, and the second cen-
troid is the point farthest from the first centroid, the third centroid is the point
farthest from its closest existing centroid and so on. An extension/variation of
K-means is the K-medoid or Partitioning Around Medoids (PAM) [5], where
the clusters are represented using the medoid of the data instead of the mean.
Medoid is the object of the cluster with minimum distance to all others objects
in the cluster. Most of the approaches from literature are heuristic or they try to
optimize a criterion that may not be appropriate for clustering or they require a
training set. On the contrary, in this paper, we have solved the crisp clustering
problem via a voting maximization scheme that ensures high similarity between
the points of the same cluster without any user defined parameter. In addition,
the proposed method has been applied to video summarization problem [6].

2 The Clustering Problem

In this section the clustering problem is analyzed. Let us assume a dataset of N

points, xi , i ∈ {1, ..., N }, in the d dimensional space (xi ∈ d ) that are clustered
into K non empty clusters, pk , k ∈ {1, ..., K}, where pk denotes the k-cluster
indexes and |pk | denotes the number of points of cluster pk . According to crisp
clustering it holds that each point belongs to exactly one cluster.
One of the most widely used criteria for clustering and for other similar prob-
lems (e.g. see Microaggregation problem [7]) is the within-group squared error
(SSE) minimization,for cases of almost equal sized clusters and almost the same
variation, the minimization of SSE yields what the humans mean ”optimal clus-
tering”. However, the clustering that corresponds to the minimization of SSE
is not always appropriate even for the simple case of two clusters. According
to the minimization of SSE, it is difficult to keep connected large clusters with
high variation, that means that if there exists a large physical cluster with high
variation it is possible to be divided into two or more clusters.
In this research, we introduce a new validity measure, the Voting Measure
(VM) that can also work well for clusters with different densities and/or sizes.
VM is invariant on scaling and number of data points and is bounded V M ∈
[0, 1]. In order to define VM, first we introduce the voting point problem. Ac-
cording to this problem, we have to define the function V (i, j) ∈ [0, 1] that
corresponds to the votes of point xi , i ∈ {1, ..., N } to point xj , j ∈ {1, ..., N }.
However, if we use a metric for points’ density like the Gaussian similarity func-
tion in spectral clustering, then high density clusters will be favored. In order to
overcome this problem, the voting function is defined so that it should satisfy
the following conditions:
Voting Clustering and Key Points Selection 443

(a) N j=1 V (i, j) = 1, (b) V (i, i) = 0, (c) V (i, j) ∼ d(xi ,xj ) , (d) V (i, j) ≤ 2
1 1

where d(xi , xj ) denotes the Euclidean distance between the points xi , xj . The
ﬁrst two conditions ensure the point “equality” (each point have the same vot-
ing “power”). The third condition ensures the scale/density invariant property.
According to the ﬁrst three conditions it holds that
1
d(xi ,xj )
V3 (i, j) = 1 , where V3 (i, j) denotes the voting matrix that
k∈{1,...,N }−{i} d(xi ,xk )

satisfy the first three conditions (the sub-index show the number of satisfied
conditions). The last condition is added in order to ensure that each point is will
vote the rest points, avoiding the special case of pairs of identical points that
only vote each other resulting wrong voting descriptors (see at the end of the
Section). When all the conditions are satisfied then V4 (i, j) is given by:
V3 (i, j) , δ(i) ≤ 0
V4 (i, j) = V3 (i,j) 1 (1)
min( 1−δ(i) , 2 ) , δ(i) > 0

where δ(i) = maxj∈{1,...,N } V3 (i, j) − 12 . In our experimental results, the voting

matrix is computedbased on the four prementioned conditions. The voting de-
N
scriptor V D(j) = i=1 V (i, j) of point xj , j ∈ {1, ..., N } measures the votes
that point xj receives. Under any dataset, it holds that the mean value of V D is
one (E(V D) = 1). VM is deﬁned by the average value of voting descriptors per
cluster taking into account only the intrinsic voting, dividing by the number of
clusters K:
K
1 i∈pk j∈pk V (j, i)
VM = · (2)
K |pk |
k=1
Fig. 1(a) depicts a dataset using a colormap according to voting descriptor (red
for high values and blue for low values). It holds that the voting descriptor
generally receives higher values on points that are closer to a cluster centroid,
while it receives lower values on boundary points. Lower values (e.g. close to
zero) are observed for outliers, since these points are quite far from clusters,
thus it is diﬃcult to receive votes.

3 The Proposed Algorithm

3.1 Voting-Based Clustering Algorithm
In this section, the proposed Clustering based on Voting Representativeness
algorithm (CVR) is presented. This method requires as input the voting array
V , the voting descriptor V D, and the K. The output of the method is the cluster
indexes. The proposed Clustering based on Voting Representativeness algorithm
(CVR) method consists of two phases:
– In the ﬁrst phase, K iterations are performed selecting the K key points.
In the k th -iteration of the method, we select a key point of the dataset to
be the representative of the k-cluster and we discard it from the dataset.
Therefore, at the end of the ﬁrst phase, the each cluster has been initialized
with one record.
444 C. Panagiotakis and P. Fragopoulou

1 1 1
1.6
0.5 0.5 0.5
1.4
0 1.2 0 0
1
−0.5 −0.5 −0.5
0.8
−1 0.6 −1 −1
0.4
−1.5
−1 −0.5 0 0.5 1 −1.5
−1 −0.5 0 0.5 1 −1.5
−1 −0.5 0 0.5 1

(a) (b) (c)

Fig. 1. (a) The dataset using a colormap according to voting descriptor. Results of
clustering (b) K = 4, SSE = 27.68 and (c) K = 4, SSE = 23.89.

– Finally, in the second trivial phase the N − K remaining unlabeled points

are assigned to the cluster that corresponds to their “closest” representative
point according to the voting formulation.
Hereafter, we analyze the first phase of the proposed method with more details,
that has been used in video summarization problem (see Section 4). In the first
iteration of the first phase, we detect the most representative (key) point of the
dataset (p1 ), where V D is maximized. This key point will be assigned to the
first cluster and it will be discarded from the dataset (S).
For the selection of the next key points pk , k > 1, we have to taken into
account the already selected representative points. Therefore, the second key
point (pk , k = 2) is selected taking into account the first one. This point should
belong to a different cluster meaning that it should have low similarity with
p1 and vice versa. In order to satisfy this condition, we select the point with
V (i,p ) V (pk−1 ,i)
index i that minimizes the formula v = V D(pk−1 k−1 )
+ V D(k) . This is the sum of
percentages of votes that the point with index p1 receives from the point with
index i and vice versa.
We initialize a function F (i) = 0, i ∈ S. The next key points are selected
by repeating the same procedure using the function F . When a point (xpk ) is
selected as a key point, we add it to the appropriate cluster and we discard
it from the set S, where S denotes the domain of F . Finally, F is updated in
order to ensure that the next key points will have low similarity with the already
computed key points F (i) = max(F (i), v) as well as with xpk . The global minima
of F will give the next key points. The total computational cost of the proposed
CVR algorithm can be reduced from O(N 2 ) to O(N · log N + K · N ) when a
sparse matrix and R-tree-like data structure are used.

3.2 Local Maximization of VM

This section presents an optional algorithm, inspired by the GSMS-T2 [7], that
possibly improves a given initial clustering based on the local maximization of
Voting Measure (VM). When we use as input the clustering of CVR method, the
Voting Clustering and Key Points Selection 445

resulting algorithm is called CVR-LMV. Let’s assume that two nearby located
points xi , xj , i, j ∈ {1, ..., N } that are misclassified by CVR in the same cluster,
so that V (i, j) ≥ V (i, k), ∀k ∈ {1, ..., N } and V (j, i) ≥ V (j, k), ∀k ∈ {1, ..., N }.
Under this assumption, it is possible that if we separately check to reassign the
point i (or the point j) to the true cluster, VM will be reduced, since the point
xj (or the point xi ) belongs to a different cluster.
In order to solve this problem without increasing the computation cost of the
algorithm, we have introduced the median based VM VH M , that estimates VM
based on the median value of votes of points without affected by nearby points.

1
K
VH
M= · medianj∈pk (V (j, i)) (3)
K i∈p k=1 k

Let V M and V M denote the validity measure before and after the possible

reassignment. Let VH M and VH M denote the median based VM before and after
the possible reassignment. According to the proposed algorithm, we reassign the

point with index i, if V M > V M or VH M > VH M ∧ VHM − VH M +V M −V M > 0
is satisfied. The first condition ensures that V M increases. If only the second
condition is true, this will cause an impermanent decrease of VM. Since the
increase of VHM is higher than the decrease of V M means that the point with
index i is closer to the examined cluster and we have to perform reassignment.
In the next steps, we will also reassign the neighbors of point with index i and
V M will increase.
Figs. 1(b) and 1(c) illustrate two different clustering results of the same
dataset using the CVR-LMV and K-means clustering, respectively. The SSE
of clustering depicted in Fig. 1(c) is 13.69% lower than the SSE of clustering de-
picted in Fig. 1(b). However the optimal solution of clustering is clearly depicted
in Fig. 1(b).

4 Experimental Results

In this section, the experimental results of our performance study are presented.
We have tested our methods (CVR and CVR-LMV) using the following six
real datasets [8], where the number of records, the number of clusters, the data
dimension, the cluster sizes and cluster densities are varied:

– the Iris (150 records in 4-dimensional space, K = 3).

– the Yeast (1484 records in 8-dimensional space, K = 10).
– the Segmentation (2100 records in 19-dimensional space, K = 7),
– the Wisconsin breast cancer (683 records in 30-dimensional space, K = 2).
– the Wine (178 records in 13-dimensional space, K = 3).
– the ﬁrst 104 records of covtype (covtype10k) in 54-dimensional space, K = 7.

We have tested the proposed methods with 144 synthetic datasets generated by
c random cluster centroids that are uniformly distributed over the d-dimensional
446 C. Panagiotakis and P. Fragopoulou

hypercube (c ∈ {4, 8, 16}, d ∈ {4, 8}). The number of points ni in cluster i is ran-
domly selected from a uniform distribution between min n and max n (min n ∈
{16, 128}, max n − min n ∈ {0, 128}). The ni points in cluster i are randomly
selected around the cluster centroid from a d-dimensional multivariate Gaussian
distribution with covariance matrix Σi = σi2 Id and mean value equal to the
cluster centroid, where σi is randomly selected from a uniform distribution be-
tween min σ and max σ, (min σ ∈ {0.04, 0.08, 0.16}), (max σ −min σ ∈ {0, 0.08}).
The parameters c and min σ receive three different values and the rest of the
parameters receive two different values yielding 32 · 24 = 144 datasets.
In order to evaluate the accuracy of the proposed scheme, we have compared
the proposed methods with seven other clustering methods: the K-means, the
K-means KKZ algorithms [4], the hierarchical agglomerative algorithm based
on the linkage metric of average link (HAC-AV) [9], spectral clustering using
Nystrom method without orthogonalization (SCN) and with orthogonalization
(SCN-O) [10], the K-means++ method [3] and the PAM algotithm [5]. For the
non deterministic algorithms, 20 trials have been performed under any given
dataset, getting the average value of the used performance metrics. We evaluate
the performance using the clustering accuracy (Acc) [10]. Acc ∈ [0, 1] is defined
as the percentage of the correctly classified points.

Table 1. The accuracy (ﬁrst 6 lines) and the average Acc (last line) of several clustering
algorithms in 6 real and 144 synthetic datasets (144 S.D.), respectively
Dataset CVR-LMV CVR K-means K-means KKZ HAC-AV SCN SCN-O PAM K-means++
Iris 93.33% 81.33% 84.20% 89.33% 90.67% 89.10% 88.87% 77.43% 85.77%
Yeast 39.22% 42.39% 36.04% 37.80% 32.35% 37.54% 37.10% 32.37% 35.15%
Segmentation 52.14% 37.43% 51.87% 35.62% 14.62% 47.35% 46.55% 52.45% 50.86%
Wisconsin 91.04% 90.51% 85.41% 85.41% 66.26% 73.15% 85.14% 84.97% 85.41%
Wine 71.35% 71.35% 68.20% 56.74% 61.24% 66.04% 60.17% 67.44% 65.65%
covtype10k 37.10% 38.18% 36.41% 35.95% 35.63% 36.20% 36.49% 35.90% 36.97%
144 S.D. 98.71% 97.85% 79.51% 97.51% 97.01% 94.04% 97.08% 78.61% 86.21%

Table 1 depicts the clustering accuracy measure of CVR-LMV, CVR, K-

means, K-means KKZ, HAC-AV, SCN, SCN-O, PAM and K-means++ algo-
rithms in real datasets (first six lines of the table) and the average clustering
accuracy measure over the 144 synthetic datasets (144 S.D.) (last line of table).
According to these results, the proposed methods CVR-LMV and CVR yield the
highest performance results, outperforming the other methods from literature in
five out of six real datasets. The highest performance results are achieved by
CVR-LMV, since it holds that almost always, it gives the highest or the second
highest performance results. According to the experiments on synthetic datasets,
CVR-LMV yields the highest performance results, outperforming the other al-
gorithms. CVR is the second highest performance method. High performance
results are also obtained by K-means KKZ, HAC-AV and SCN-O methods.
Concerning the probability that CVR-LMV reduces the clustering performance,
this probability increases when CVR fails to find the true classes. In this case,
CVR-LMV is possible to reduce or increase the clustering performance.
Voting Clustering and Key Points Selection 447

Fig. 2. Selected key frames of tennis ((a),(b),(c)) and foreman ((e),(f )) videos

Fig. 3. Selected key frames of hall monitor video

The proposed method can be used on several clustering based applications

like the video summarization using key frames [6], where the goal is to select
a subset of a video sequence (key frames) that can represent the video visual
content. Similarly to [6], we have used the Color Layout Descriptor (CLD) which
suffices to describe smoothly the changes in visual content. Then we apply the
CVR algorithm using as input the CLD vectors and the desired number of key
points K. The key points of the first phase of the CVR algorithm can be con-
sidered as the selected key frames, since they have the property to cover the
video content space belonging to different clusters according to the CVR algo-
rithm. An advantage of the proposed method is that ordering of the resulting
key frames corresponds to their significance. Moreover, the proposed method
does not assume that the video file has been segmented into shots as most of the
key frame extraction algorithms done. The proposed method has been tested in
several indoor and outdoor real life video sequences that have been used in [6]
describing well the video content. Hereafter, we present the results of the pro-
posed method on tennis, foreman and hall monitor videos1 (see Figs. 2, 3) using
three, two and five key frames, respectively. Under any case, it holds that the
selected key frames are close to the humans’ perception: In the tennis video, the
first two selected key frames (#274, #120) belong to the two different shots of
the video and the third one (#20) belongs on the first shot that has substantial
visual content changes. In the foreman video, the selected key frames belong on
the start and end of the sequence, describing well the two characteristics phases
of the sequence (the interview and the buildings). In the hall monitor video the
five selected key frames correspond to the five different “scenes” of the video
(empty hall, a human with a bag in hall and so on).

5 Conclusions
In this paper, we propose a deterministic point clustering method that
can be also used in video summarization problem. According to the proposed
1
http://media.xiph.org/video/derf/
448 C. Panagiotakis and P. Fragopoulou

framework, the problem of clustering is reduced to the maximization of the sum

of votes between the points of the same cluster. In addition, we have proposed
the LMV algorithm that possibly improves a given initial clustering based on
the local maximization of the proposed robust voting measure (VM). The pro-
posed method can yield high performance results on clusters of diﬀerent densities
and/or sizes outperforming other methods from literature. In addition, the se-
lected key frames describes well the visual content of the videos.

Acknowledgments. This research has been partially co-ﬁnanced by the Euro-

References
1. Gupta, U., Ranganathan, N.: A game theoretic approach for simultaneous com-
paction and equipartitioning of spatial data sets. IEEE Transactions on Knowledge
and Data Engineering 22, 465–478 (2010)
2. Jain, A.: Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31,
651–666 (2010)
3. Arthur, D., Vassilvitskii, S.: k-means++: The advantages of careful seeding. In:
Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algo-
rithms, pp. 1027–1035 (2007)
4. Katsavounidis, I., Kuo, C.C.J., Zhang, Z.: A new initialization technique for gen-
eralized lloyd iteration. IEEE Signal Processing Letters 1, 144–146 (1994)
5. Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 3rd edn. Elsevier (2006)
6. Panagiotakis, C., Doulamis, A., Tziritas, G.: Equivalent key frames selection based
on iso-content principles. IEEE Transactions on Circuits and Systems for Video
Technology 19, 447–451 (2009)
7. Panagiotakis, C., Tziritas, G.: Successive group selection for microaggregation.
IEEE Trans. on Knowledge and Data Engineering 99 (accepted, 2011)
8. Blake, C., Keough, E., Merz, C.J.: UCI Repository of Machine Learning Database
(1998), http://www.ics.uci.edu/~ mlearn/MLrepository.html
9. Day, W., Edelsbrunner, H.: Eﬃcient algorithms for agglomerative hierarchical clus-
tering methods. Journal of Classiﬁcation 1, 7–24 (1984)
10. Chen, W.Y., Song, Y., Bai, H., Lin, C.J., Chang, E.Y.: Parallel spectral cluster-
ing in distributed systems. IEEE Transactions on Pattern Analysis and Machine
Intelligence 33, 568–586 (2011)
Motor Pump Fault Diagnosis with Feature
Selection and Levenberg-Marquardt Trained
Feedforward Neural Network

Thomas W. Rauber and Flávio M. Varejão

Departamento de Informática, Centro Tecnológico

Universidade Federal do Espı́rito Santo
29060-970 Vitória, Brazil
{thomas,fvarejao}@inf.ufes.br

Abstract. We present a system for automatic model-free fault detection

based on a feature set from vibrational patterns. The complexity of the
feature model is reduced by feature selection. We use a wrapper approach
for the selection criteria, incorporating the training of an artiﬁcial neural
network into the selection process. For fast convergence we train with
the Levenberg-Marquardt algorithm. Experiments are presented for eight
diﬀerent fault classes.

Keywords: Fault diagnosis, feature selection, feedforward neural net-

work, Levenberg-Marquardt.

The diagnosis of faults in expensive industrial equipment under production con-

ditions is a valuable tool for improving the economic and security quality of its
operation [1]. On the contrary to model-based diagnosis [2] where an analytical
model of the machine has to be provided, we use a model-free approach based
on supervised learning based on labeled data [3]. The sequence of processing
steps to obtain a feature vector is raw data acquisition from several sensors,
pre-processing on the level of the signal (filtering) and feature extraction. An
additional information reduction step which we consider fundamental for an op-
timized performance of the fault diagnosis system is feature selection.
Since we are dealing with rotating machinery, the use of vibrational signals
is indicated, positioning accelerometers at appropriate positions of motor and
pump which deliver time domain signals that can be converted into disloca-
tion, velocity and acceleration signals in the frequency domain [4]. For bearing
faults, the envelope detection (or amplitude demodulation) [5] is an indicated
tool that will be employed here to extract features from the raw signals. In [6],
an overview of fault diagnosis of rotating machinery by empirical mode decom-
position is given. Raw features are extracted from the vibrational signals, which
are subjected to feature selection [7] and subsequently classified by a feedforward
net trained by the Levenberg-Marquardt weight optimization method [8,9]. The
equipment we are considering are motor pumps operating on offshore oil rigs.
We work with 2000 examples of vibration signals obtained from operating par-
tially faulty motor pumps, installed on 25 oil platforms off the Brazilian coast.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 449–456, 2013.

c Springer-Verlag Berlin Heidelberg 2013
450 T.W. Rauber and F.M. Varejão

The signals were obtained during a period of five years. To generate the classified
training data, experts in maintenance engineering provided a label for every fault
present in each acquired example. Since several faults can simultaneously occur
in a machine, we construct an independent classification task for each type of
fault (one against all). Naturally the labelling process has been done by different
persons and therefore the ground truth of the class membership is subject to
model errors introduced a priori, i.e. it cannot be excluded that the provided
label is erroneous in some cases.

1 Condition Monitoring of Oil Rig Motor Pumps

In real-world processes the availability of an analytical model is often unrealis-
tic or inaccurate due to the complexity of the process. In this case model-free
techniques are an alternative approach. This paper is concerned with model-free
diagnosis of multiple faults in motor pumps, relying on supervised learning based
techniques.

1.1 Motor Pump Equipment

Rotating machinery covers a wide range of mechanical equipment and plays an
important role in industrial applications. In this work we focus on a speciﬁc
rotating machine model, namely the horizontal motor pump with extended cou-
pling between the electric motor and the pump. Accelerometers are placed at
strategic positions along the main directions to capture speciﬁc vibrations of the
main shaft which provides a multichannel time domain raw signal.

1.2 Considered Fault Categories

We build a predictor for individually detecting each of the following eight fault
categories in an input pattern: shaft misalignment, pump blade unbalance, me-
chanical looseness of the pump, mechanical looseness of the motor, structural
looseness of the pump, structural looseness of the motor, hydrodynamic fault
(due to blade pass and vane pass, cavitation or ﬂow turbulence) and resonance.
Fig. 1 illustrates how the frequency spectrum is associated with two of the
faults, presenting the Fourier spectrum of the vibration signal (measured at the
pump immediately behind the coupling and in horizontal direction) of a faulty
motor pump with a misalignment fault and also an emerging hydrodynamic
fault.
Each example in the database of 2000 machine signal acquisitions presented
the occurrence of at least one fault. Examples presenting the occurrence of mul-
tiple faults are more common than examples in which just one fault is occurring.
For each of the faults we built a one-against-the-rest classiﬁer, i.e. a specialist
for a certain kind of problem.
Motor Pump Fault Diagnosis with Feature Selection 451

2.5

Velocity (mm/s) 1.5

0.5

0
0 200 400 600 800 1000
Frequency (Hz)

Fig. 1. Misalignment fault and its manifestation in the frequency spectrum at the ﬁrst
three harmonics of the shaft rotation frequency. The high energy in the ﬁfth harmonic,
as well as the noise in low frequencies indicate that additionally a hydrodynamic fault
is emerging.

1.3 Extracted Features

Our general strategy is based on providing as much information as possible
in the initial feature extraction stage. The available preprocessed signals are
the frequency and the envelope spectrum of machine vibration signals. Hence
we work with well-established signal processing techniques, namely the Fourier
transform, envelope analysis based on the Hilbert transform [10] and median
ﬁltering. In this context, the extracted features correspond to the vibrational
energy of predetermined frequency bands of the spectrum. We initially extract
the same feature categories for building the predictor of every considered fault.
Speciﬁcally, the initially extracted feature set is composed of a total of D = 81
features, with 68 of them from the Fourier spectrum and 13 of them from the
Envelope spectrum.

2 Feature Selection
The main idea of feature selection is to obtain data that has a reduced di-
mensionality and more relevant information that can increase the classiﬁcation
performance. Feature selection is basically composed of a search algorithm and
a selection criterion [11,12,7,13], Another important aspect of feature selection
is the explication of the importance of each feature for the classiﬁcation process.
Previous work has investigated the problem of feature selection in the context
of fault detection of rotating machines. In [14], features were ranked by their
appraisal, using the sensitivity [3] during the training of a feedforward net with
one hidden layer. Feature selection based on a individual threshold ranking in
the context of tool condition monitoring can be found in [15]. A heuristic based
on binary ant colony is used for feature selection in the context of a rotary kiln
in [16]. Mutual information is the selection criterion proposed in [17] which also
give a considerable overview of feature extraction and selection methods, also cf.
452 T.W. Rauber and F.M. Varejão

[18] where the minimum redundancy maximum relevancy (mRMR) criterion is

proposed. The C4.5 decision tree machine learning algorithm presented in [19]
implicitly performs feature selection by positioning those features with highest
mutual information at the treetop.
We use a Sequential Forward Selection (SFS) [3] search algorithm which con-
stitutes a good compromise between search space coverage and speed. As the
selection criterion we use a wrapper approach with the estimated accuracy of a
Levenberg-Marquardt trained feedforward net. For an application in human gait
recognition that uses a multidimensional expansion of the mutual information
theory and SFS, c.f. for instance [20]. This work is an example of a ﬁlter approach
which do not use the ﬁnal performance criterion as the selection criterion.

3 Feedforward Net and Weight Optimization

We follow the terminology of [8,9], to deﬁne the network calculus and the weight
optimization. We expect the reader to be familiar with the basic concepts of a
feedforward net and the principle of gradient descent.

3.1 Architecture
Consider as input to the net a R-dimensional pattern p from the Euclidean vector
space RR . The net input with weights wi,j
m+1
from the jth unit in layer m to the
m+1
ith unit in layer (m + 1) and the biases bi is

m
S
nm+1
i = m+1 m
wi,j aj + bm+1
i . (1)
j=1

Passing through an activation function f , the output of unit i becomes

am+1
i = f m+1 (nm+1
i ). (2)

The network has M layers and the output of layer m in matrix form can be
written as

a0 = p, am+1 = f m+1 (W m+1 am + bm+1 ), m = 0, 1, . . . , M − 1, aM = y,

(3)

where y is the ﬁnal output of the net. The S m+1 × S m matrix W m+1 contains
the weights of layer (m + 1). The activation function f : R → R is usually the
logistic sigmoid function f (n) = 1/(1 + exp(−n)), hyperbolic tangent sigmoid
function f (n) = tanh n or the identity function f (n) = n. We consider a network
with only one hidden layer, i.e. M = 2 and m = 0, 1, since empirically additional
layers do not increase discriminative power.
Motor Pump Fault Diagnosis with Feature Selection 453

3.2 Levenberg-Marquardt Training

For data sets with a reasonable number of patterns, this network training method
is currently considered as state of the art due to its fast convergence and con-
stitutes a strong motivation
to use this technique for our purpose. The pa-

rameter vector x = w1,1 w1,2 · · · wS1 1 ,R b11 · · · b1S 1 · · · w1,1
1 1 2
· · · bM
S M of dimension
M
n = m=1 S m (S m−1 + 1) is assembled from all weights and biases and then
subjected to the Levenberg-Marquardt (LM) optimization which is a modified
second order approximator Gauss-Newton method. A control parameter μ deter-
mines how much steepest descent or how much Gauss-Newton the LM algorithm
becomes.
For each of the q = 1, . . . , Q patterns and each of the S M units of the final
layer each component of the individual error eq = tq − aM q with targets tq and
final output vector aqM is

ek,q , k = 1, . . . , S M , q = 1, . . . , Q. (4)
There are N = Q · S M such errors. Their gradient with respect to each of the
weights and biases
∂ek,q ∂ek,q
m , , k = 1, . . . , S M , q = 1, . . . , Q,
∂wi,j ∂bm
i
i = 0, . . . , S m , j = 1, . . . , S m+1 , m = 0, . . . , M − 1 (5)
is calculated and introduced into the N × n Jacobian matrix J(x).
The sensitivity [9] of the i-th unit of the m-th layer sm
i of the conventional
backpropagation is replaced by the deﬁnition of the Marquardt sensitivity of unit
i in layer m
∂ek,q
i,h ≡
s̃m , (6)
∂nm
i,q
where the index h over all N pattern-output unit pairs is calculated as h =
(q − 1)S M + k, q = 1, . . . , Q, k = 1, . . . , S M . The elements of the Jacobian can
now be obtained as
∂ek,q ∂ek,q ∂nm i,q ∂ek,q
[J]h, = m = · m m−1
m = s̃i,h aj,q , [J]h, = = s̃m
i,h . (7)
∂wi,j ∂nm
i,q ∂wi,j ∂bm
i

The index over all network connections from layer m to layer (m + 1) is calcu-
lated as = j · S m + i, i = 1, . . . , S m+1 , j = 0, . . . , S m , m = 0, . . . , M − 1.

4 Experimental Results
The main objective is to illustrate the advantages of feature selection. With only
a fraction of the original feature set it should be possible to obtain equivalent or
better performance, compared to the complete feature set. As mentioned before,
we use a combination of performance estimation and selection by taking the
performance score as the proper criterion (wrapper).
454 T.W. Rauber and F.M. Varejão

Misalignment [57.57%] Unbalance [72.95%]

Estimated accuracy
Estimated accuracy
0.8 0.9

0.7

0.6 0.8
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
# of SFS accumulated features # of SFS accumulated features

Mechanical pump looseness [85.36%] Mechanical motor looseness [67.82%]

Estimated accuracy
Estimated accuracy

0.9 0.88
0.86
0.84
0.82
0.80
0.78
0.8 0.76
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
# of SFS accumulated features # of SFS accumulated features

Structural pump looseness [88.81%] Structural motor looseness [85.06%]

Estimated accuracy
Estimated accuracy

0.96 0.93
0.95
0.94 0.92
0.93
0.92 0.91
0.91
0.90 0.90
0.89
0.88 0.89
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
# of SFS accumulated features # of SFS accumulated features

Hydrodynamic [51.57%] Resonance [91.86%]

Estimated accuracy
Estimated accuracy

0.98
0.86
0.84 0.97
0.82 0.96
0.80
0.78 0.95
0.76
0.74 0.94
0.72 0.93
0.70
0.68 0.92
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
# of SFS accumulated features # of SFS accumulated features

Fig. 2. Estimated classiﬁcation accuracy [%] (y-axis) as a function of the cardinality

of the selected feature set (x-axis) during the Sequential Forward Selection process.
A blue dotted line shows the estimated accuracy when using all available 81 features.
Each data set shows a qualitatively similar performance evolution. After a relatively
small number of selected features the performance curve enters a saturation state with
degrading performance.
Motor Pump Fault Diagnosis with Feature Selection 455

4.1 Experimental Setup

As the selection algorithm the Sequential Forward Selection described in sec-
tion 2 is used as a wrapper with the LM training of a feedforward net, cf.
section 3.2. The number of units in the hidden layer is set as twice the current
number of features plus one, for instance when the actual number of features in
the SFS search is four, the number of neurons in the hidden layer is set to nine.
As the performance criterion we choose the estimated error rate. The data is
split randomly into 70% training data, 15% validation data and 15% test data.
Training is interrupted if six consecutive validations perform worse than the
corresponding training. The ﬁnal score is taken as the mean of ten experiments.
From the total of 81 features we select ten features and plot the estimated
performance for each cardinality of selected features. For instance during the
SFS selection with the ’mechanical motor looseness’ data set the best score for
seven selected features was 87.96%. The results for all fault classes are plotted
in ﬁg. 2. Together with the estimated error for each cardinality, the estimated
error for the total number of features is shown as a horizontal reference line.

4.2 Discussion
It can be clearly observed from the evolution of the estimated error in fig. 2
that the feature selection is able to reduce the complexity of the subsequent
classification stage considerably. At the same time the performance is improved
since noise is filtered out. Except for the ’structural pump looseness’ fault, the
estimated performance with only a fraction of the features is higher compared
to the case when taking all available features for the training of the classifier.
This clearly justifies the use of this important step in pattern recognition, also
for this field of application.

5 Conclusion
We have presented a complete system for the diagnosis of faults in a real world
scenario of rotating machinery installed on oﬀshore oil rigs. A feature pool of
frequency measurements is provided. From this pool, a subset is selected that
achieves better performance with less complexity. Future work will concentrate
on other sensors, feature models and performance estimation techniques.

Acknowledgments. This work was supported by Brazilian National Science

Foundation CNPq (project 552630/2011-0) and the State Science Foundation of
the State of Espı́rito Santo FAPES (project 48511579/2009).

References
1. Tavner, P.J., Ran, L., Penman, J., Sedding, H.: Conditiong Monitoring of Electrical
Machines. The Institution of Engineering and Technology, London (2008)
456 T.W. Rauber and F.M. Varejão

2. Isermann, R.: Fault-Diagnosis Systems: An Introduction from Fault Detection to

Fault Tolerance. Springer, Berlin (2006)
3. Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 3rd edn. Academic Press,
Inc., Orlando (2006)
4. Scheffer, C., Girdhar, P.: Pratical Machinery Vibration Analysis and Predictive
Maintenance, 1st edn. Elsevier (2004)
5. McInerny, S.A., Dai, Y.: Basic vibration signal processing for bearing fault detec-
tion. IEEE Transactions on Education 46, 149–156 (2003)
6. Lei, Y., Lin, J., He, Z., Zuo, M.J.: A review on empirical mode decomposition
in fault diagnosis of rotating machinery. Mechanical Systems and Signal Process-
ing 35, 108–126 (2013)
7. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach.
Learn. Res. 3, 1157–1182 (2003)
8. Hagan, M., Menhaj, M.: Training feedforward networks with the marquardt algo-
rithm. IEEE Transactions on Neural Networks 5, 989–993 (1994)
9. Hagan, M., Demuth, H., Beale, M.: Neural Network Design. Vikas Publishing House
(2003)
10. Mendel, E., Rauber, T.W., Varejao, F.M.: Automatic bearing fault pattern recog-
nition using vibration signal analysis. In: Proc. of the IEEE Int. Symp. on Ind.
Electronics, ISIE 2008 (2008)
11. Devijver, P.A., Kittler, J.: Pattern Recognition: A Statistical Approach. Prentice/
Hall Int., London (1982)
12. Kudo, M., Sklansky, J.: Comparison of algorithms that select features for pattern
classifiers. Pattern Recognition Letters 33, 25–41 (2000)
13. Liu, H., Motoda, H.: Computational Methods of Feature Selection (Chapman &
Hall/CRC Data Mining and Knowledge Discovery Series). Chapman & Hall/CRC
(2007)
14. Matsuura, T.: An application of neural network for selecting feature parameters in
machinery diagnosis. J. of Materials Processing Technol. 157–158, 203–207 (2004)
15. Jemielniak, K., Urbański, T., Kossakowska, J., Bombiński, S.: Tool condition mon-
itoring based on numerous signal features. The International Journal of Advanced
Manufacturing Technology 59, 73–81 (2012)
16. Kadri, O., Mouss, L.H., Mouss, M.D.: Fault diagnosis of rotary kiln using svm and
binary aco. Journal of Mechanical Science and Technology 26, 601–608 (2012)
17. Tang, J., Chai, T., Yu, W., Zhao, L.: Feature extraction and selection based on
vibration spectrum with application to estimating the load parameters of ball mill
in grinding process. Control Engineering Practice 20, 991–1004 (2012); 4th Sym-
posium on Advanced Control of Industrial Processes (ADCONIP)
18. Wang, J., Liu, S., Gao, R.X., Yan, R.: Current envelope analysis for defect identi-
fication and diagnosis in induction motors. Journal of Manufacturing Systems 31,
380–387 (2012); Selected Papers of 40th North American Manufacturing Research
Conference
19. Amarnath, M., Sugumaran, V., Kumar, H.: Exploiting sound signals for fault di-
agnosis of bearings using decision tree. Measurement 46, 1250–1256 (2013)
20. Guo, B., Nixon, M.: Gait feature subset selection by mutual information. IEEE
Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans 39,
36–46 (2009)
Unobtrusive Fall Detection at Home
Using Kinect Sensor

Michal Kepski2 and Bogdan Kwolek1

1
AGH University of Science and Technology, 30 Mickiewicza Av.,
30-059 Krakow, Poland
[email protected]
2
University of Rzeszow, 16c Rejtana Av., 35-959 Rzeszów, Poland
[email protected]

Abstract. The existing CCD-camera based systems for fall detection

require time for installation and camera calibration. They do not pre-
serve the privacy adequately and are unable to operate in low lighting
conditions. In this paper we show how to achieve automatic fall detection
using only depth images. The point cloud corresponding to floor is delin-
eated automatically using v-disparity images and Hough transform. The
ground plane is extracted by the RANSAC algorithm. The detection of
the person takes place on the basis of the updated on-line depth reference
images. Fall detection is achieved using a classifier trained on features
representing the extracted person both in depth images and in point
clouds. All fall events were recognized correctly on an image set consist-
ing of 312 images of which 110 contained the human falls. The images
were acquired by two Kinect sensors placed at two different locations.

Keywords: Depth image and point cloud processing, fall detection.

1 Introduction
In almost all countries of the world the elderly population is continuously in-
creasing. Improving the quality of life of increasingly elderly population is one
of the most central challenges facing our society today. As humans become old,
their bodies weaken and the risk of accidental falls raises noticeably [12]. A fall
can lead to severe injuries such as broken bones, and a fallen person might need
assistance at getting up again. Falls lead to losing self-conﬁdence, a loss of in-
dependence and a higher risk of morbidity and mortality. Thus, in recent years
a lot of research has been devoted to development of unobtrusive fall detection
methods [15]. However, despite many eﬀorts undertaken to achieve reliable and
unobtrusive fall detection [16], the existing technology does not meet the seniors’
needs [18]. The main reason is that it does not preserve the privacy and unob-
trusiveness adequately. In particular, the current solutions generate too much
false alarms, which in turn lead to considerable frustration of the seniors.
Most of the currently available techniques for fall detection are based on
body-worn or built-in devices. They typically employ accelerometers or both ac-
celerometers and gyroscopes [16]. However, on the basis of such sensors it is not

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 457–464, 2013.

c Springer-Verlag Berlin Heidelberg 2013
458 M. Kepski and B. Kwolek

easy to separate real falls from fall-like activities [2]. They typically trigger sig-
nificant number of false alarms. Moreover, the detectors that are typically worn
on a belt around the hip, are obstructive and uncomfortable during the sleep
[7]. What’s more, their monitoring performance in critical phases like getting up
from the bed or the chair is relatively poor.
In recent years, a lot of research has been done on detecting falls using a wide
range of sensor types [16][18], including pressure pads [17], single CCD camera
[1], multiple cameras [6], specialized omni-directional ones [14] and stereo-pair
cameras [8]. Video cameras have several advantages over other sensors includ-
ing the capability of recognition a variety of activities. Additional benefit is
low intrusiveness and possibility of a remote verification of fall events. However,
the solutions that are available at present require time for installation, camera
calibration and in general they are not cheap. Additionally, the lack of 3D in-
formation can lead to lots of false alarms. Moreover, in vast majority of such
systems the privacy is not preserved adequately.
Recently, the Kinect sensor was employed in fall detection systems [9][10][13].
It is the world’s first low-cost device that combines an RGB camera and a depth
sensor. Unlike 2D cameras, it allows tracking the body movements in 3D. Thus,
if only depth images are used it preserves the privacy. Since it is equipped with
an active light source it is independent of external light conditions. Owing to
using the infrared light it is capable of extracting depth images in dark rooms.
In this work we demonstrate an approach to fall detection using only depth
images. The person is detected on the basis of the depth reference image. We
demonstrate a method for updating the depth reference image with a low compu-
tational cost. The ground plane is extracted automatically using the v-disparity
images, Hough transform and the RANSAC algorithm. Fall detection is achieved
using a classifier trained on features representing the extracted person both in
depth images and in point clouds.

2 Person Detection in Depth Images

Depth is very useful cue to achieve reliable person detection because humans
may not have consistent color and texture but have to occupy an integrated
region in space. The depth images were acquired by the Kinect sensor using
OpenNI (Open Natural Interaction) library. The sensor has an infrared laser-
based IR emitter, an infrared camera and a RGB camera. The IR camera and
the IR projector form a stereo pair with a baseline of approximately 75 mm.
Kinect depth measurement is based on structured light, making a triangulation
between the dot pattern emitted and the pattern captured by the IR CMOS
sensor. The pixels in the depth images indicate calibrated depth in the scene.
Kinect’s angular ﬁeld of view is 57◦ horizontally and 43◦ vertically. The sensor
has a practical ranging limit of about 0.6-5 m. It captures depth and color images
simultaneously at a frame rate of about 30 fps. The default RGB video stream
has size 640 × 480 and 8-bit for each channel. The depth stream is 640 × 480
resolution and with 11-bit depth, which provides 2048 levels of sensitivity.
Unobtrusive Fall Detection at Home Using Kinect Sensor 459

Due to occlusions it is not easy to detect a person using only single camera
and depth images. The software called NITE from PrimeSense offers skeleton
tracking on the basis of images acquired by the Kinect sensor. However, this
software is targeted for supporting the human-computer interaction, and not for
detecting the person fall. Thus, in many circumstances it can have difficulties in
extracting and tracking the person’s skeleton [10].
The person was detected on the basis of a scene reference image, which was
extracted in advance and then updated on-line. In the depth reference image each
pixel assumes the median value of several pixels values from the past images. In
the set-up stage we collect a number of the depth images, and for each pixel we
assemble a list of the pixel values from the former images, which is then sorted in
order to extract the median. Given the sorted lists of pixels the depth reference
image can be updated quickly by removing the oldest pixels and updating the
sorted lists with the pixels from the current depth image and then extracting
the median value. We found that for typical human motions, good results can be
obtained using 13 depth images. For the Kinect acquiring the images at 25 Hz
we take every fifteenth image.
Figure 1 illustrates some example depth reference images, which were obtained
using the discussed technique. In the image #500 we can see an office with the
closed door, which was then opened to demonstrate how the algorithm updates
the reference image. In frames #650 and #800 we can see that the opened door
appears temporally in the binary image, and then it disappears in the frame
#1000. As we can observe, the updated reference image is clutter free and allows
us to extract the person’s silhouette in the depth images. In order to eliminate
small objects the depth connected components were extracted. Afterwards, small
artifacts were eliminated. Otherwise, the depth images can be cleaned using
morphological erosion. When the person does not move the reference image is
not updated.

#500 650 800 1000

Fig. 1. Person segmentation using depth reference image. RGB images (upper row),
depth (middle row) and binary images depicting the delineated person (bottom row).
460 M. Kepski and B. Kwolek

In the detection mode the foreground objects are extracted through diﬀerenc-
ing the current image from such a reference depth map. Afterwards, the fore-
ground object is determined through extracting the largest connected component
in the thresholded diﬀerence map. Alternatively, the subject can be delineated
using a pre-trained person detector. However, having in mind the privacy, the
use of a person detector operating on depth images or point clouds leads to lower
detection ratio and a higher computational cost.

3 V-Disparity Based Ground Plane Extraction

In [11] a method based on v-disparity maps between two stereo images has been
proposed to achieve reliable obstacle detection. Given a depth map provided by
the Kinect sensor, the disparity d can be determined in the following manner:
b·f
d= (1)
z
where z is the depth (in meters), b is the horizontal baseline between the cameras
(in meters), f is the (common) focal length of the cameras (in pixels). The IR
camera and the IR projector form a stereo pair with a baseline of approximately
b = 7.5 cm, whereas the focal length f is equal to 580 pixels.
Let H be a function of the disparities d such that H(d) = Id . The Id is the
v-disparity image and H accumulates the pixels with the same disparity from a
given line of the disparity image. Thus, in the v-disparity image each point in
the line i represents the number of points with the same disparity occurring in
the i-th line of the disparity image. Figure 2c illustrates the v-disparity image
that corresponds to the depth image depicted on Fig. 2b.

a) b) c)

Fig. 2. V-disparity map calculated on depth images from Kinect: RGB image a), cor-
responding depth image b), v-disparity map c)

The line corresponding to the floor pixels in the v-disparity map was extracted
using the Hough transform. Assuming that the Kinect is placed at height about
1 m from the floor, the line representing the floor should begin in the disparities
ranging from 15 to 25 depending on the tilt angle of the sensor. On Fig. 3 we
can see some example lines extracted on the v-disparity images, which were
obtained on the basis of images acquired in typical rooms, like office, see Fig. 2c,
classroom, etc.
Unobtrusive Fall Detection at Home Using Kinect Sensor 461

Fig. 3. Lines extracted by Hough transform on various v-disparity maps

The line corresponding to the ﬂoor was extracted using Hough transform
(HT) operating o v-disparity values and a predeﬁned range of parameters. The
accumulator was incremented by v-disparity values, see Fig. 4a. It is worth not-
ing that ordinary HT operating on thresholded v-disparity images often gives
incorrect results, see Fig. 4b where the extremum is quite close to 0 deg.

a) b)

Fig. 4. Accumulator of the Hough transform: operating on v-disparity values a), thresh-
olded v-disparity images b). The accumulator depicted on ﬁgure a) is divided by 100.

Given the extracted line in such a way, the pixels belonging to the ﬂoor ar-
eas were determined. Due to the measurement inaccuracies, pixels falling into
some disparity extent dt were also considered as belonging to the ground. As-
suming that dy is a disparity in the line y, which represents the pixels belong-
ing to the ground plane, we take into account the disparities from the range
d ∈ (dy − dt , dy + dt ) as a representation of the ground plane. Given the line
extracted by the Hough transform, the points on the v-disparity image with
the corresponding depth pixels were selected, and then transformed to the point
cloud [10]. After the transformation of the pixels representing the ﬂoor to the 3D
points cloud, the plane described by the equation ax + by + cx + d was recovered.
The parameters a, b, c and d were estimated using the RANSAC algorithm. The
distance to the ground plane from the 3D centroid of points cloud corresponding
to the segmented person was determined on the basis of the following equation:
|aXc + bYc + cZc + d|
D= √ (2)
a2 + b 2 + c2
where Xc , Yc , Zc stand for the coordinates of the centroid.
462 M. Kepski and B. Kwolek

4 Experimental Results

A data-set consisting of normal activities like walking, sitting down, crouching

down and lying has been composed in order to train classifiers and to evaluate
the performance of the fall detection system. Thirty five volunteers with age
under 28 years attended in preparation of the data-set. The image sequences
were recorded using two Kinect devices. The first Kinect was placed at a height
of about one meter to the floor, whereas the second one was placed at a ceiling
corner of the room. Figure 5 shows example depth images seen from two different
views.

Fig. 5. Person in depth images seen from two diﬀerent views

In total 312 images representing typical human actions were selected and then
utilized to extract the following features:

– h/w - a ratio of width to height of the person bounding box, calculated in

the points cloud
– h/hmax - a ratio expressing the height of the person surrounding box in the
current frame to the height of the person
– dist - the distance of the person centroid to the ﬂoor, expressed in millimeters
– max(σx , σz ) - standard deviation from the centroid for the abscissa and the
depth, respectively.

Figure 6 depicts a scatterplot matrix for the employed attributes, in which a col-
lection of scatterplots is organized in a two-dimensional matrix simultaneously to
provide correlation information among the attributes. In a single scatterplot two
attributes are projected along the x-y axes of the Cartesian coordinates. As we
can observe, the overlaps in the attribute space are not too significant. We consid-
ered also another attributes, for instance, a filling ratio of the rectangles making
up the person bounding box. The worth of the features was evaluated on the basis
of the information gain [4], which measures the dependence between the feature
and the class label. In the evaluation we utilized the InfoGainAttributeEval
procedure from the Weka [5], which is a collection of machine learning algo-
rithms.
The classification accuracy was evaluated in 10-fold cross-validation using
Weka software. The falls were classified using KStar [3], AdaBoost, SVM, multi-
layer perceptron (MLP), Naı̈ve Bayes (NB) and k-NN classifiers. The KStar and
Unobtrusive Fall Detection at Home Using Kinect Sensor 463

Fig. 6. Multivariate classiﬁcation scatter plot

MLP classiﬁed all falls correctly, whereas the remaining algorithms incorrectly
classiﬁed 2 instances. The number of images with person fall was equal to 110.
The system was implemented in C/C++ and runs at 25 fps on 2.4 GHz I7
(4 cores, Hyper-Threading) notebook powered by Linux. The most computa-
tionally demanding operation is extraction of the depth reference image. For
images of size 640 × 480 the computation time needed for extraction of the
depth reference image is about 9 milliseconds. At the PandaBoard, which is a
low-power, low-cost single-board computer development platform, this operation
can be completed in 0.15 sec. We are planning to implement the whole system
on the PandaBoard.

5 Conclusions
In this work we demonstrated our approach to fall detection using Kinect. The
fall detection is done on the basis of the segmented person in the depth images.
The segmentation of the person takes place using updated depth reference im-
age of the scene. For person extracted in such a way the corresponding points
cloud is then extracted. The ground plane is determined automatically using the
v-disparity images, Hough transform and the RANSAC algorithm. The fall is
detected using a classiﬁer built on features extracted both from the depth images
as well as the points cloud corresponding to the extracted person. The system
achieves high detection rate. On image set consisting of 312 images of which 110
contained human falls all fall events were recognized correctly.
464 M. Kepski and B. Kwolek

Acknowledgment. This work has been supported by the National Science

Centre (NCN) within the project N N516 483240.

References
1. Anderson, D., Keller, J., Skubic, M., Chen, X., He, Z.: Recognizing falls from sil-
houettes. In: Annual Int. Conf. of the Engineering in Medicine and Biology Society,
pp. 6388–6391 (2006)
2. Bourke, A., O’Brien, J., Lyons, G.: Evaluation of a threshold-based tri-axial ac-
celerometer fall detection algorithm. Gait & Posture 26(2), 194–199 (2007)
3. Cleary, J., Trigg, L.: An instance-based learner using an entropic distance measure.
In: Int. Conf. on Machine Learning, pp. 108–114 (1995)
4. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley (1992)
5. Cover, T.M., Thomas, J.A.: Data Mining: Practical machine learning tools and
techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
6. Cucchiara, R., Prati, A., Vezzani, R.: A multi-camera vision system for fall detec-
tion and alarm generation. Expert Systems 24(5), 334–345 (2007)
7. Degen, T., Jaeckel, H., Rufer, M., Wyss, S.: Speedy: A fall detector in a wrist
watch. In: Proc. of IEEE Int. Symp. on Wearable Computers, pp. 184–187 (2003)
8. Jansen, B., Deklerck, R.: Context aware inactivity recognition for visual fall detec-
tion. In: Proc. IEEE Pervasive Health Conference and Workshops, pp. 1–4 (2006)
9. Kepski, M., Kwolek, B., Austvoll, I.: Fuzzy inference-based reliable fall detection
using kinect and accelerometer. In: Rutkowski, L., Korytkowski, M., Scherer, R.,
Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2012, Part I. LNCS,
vol. 7267, pp. 266–273. Springer, Heidelberg (2012)
10. Kepski, M., Kwolek, B.: Human fall detection using kinect sensor. In: Burduk, R.,
Jackowski, K., Kurzynski, M., Wozniak, M., Zolnierek, A. (eds.) CORES 2013.
AISC, vol. 226, pp. 743–752. Springer, Heidelberg (2013)
11. Labayrade, R., Aubert, D., Tarel, J.P.: Real time obstacle detection in stereovi-
sion on non ﬂat road geometry through “v-disparity” representation. In: IEEE
Intelligent Vehicle Symposium, vol. 2, pp. 646–651 (2002)
12. Marshall, S.W., Runyan, C.W., Yang, J., Coyne-Beasley, T., Waller, A.E., Johnson,
R.M., Perkis, D.: Prevalence of selected risk and protective factors for falls in the
home. American J. of Preventive Medicine 8(1), 95–101 (2005)
13. Mastorakis, G., Makris, D.: Fall detection system using Kinect’s infrared sensor.
J. of Real-Time Image Processing, 1–12 (2012)
14. Miaou, S.G., Sung, P.H., Huang, C.Y.: A customized human fall detection system
using omni-camera images and personal information. Distributed Diagnosis and
Home Healthcare, 39–42 (2006)
15. Mubashir, M., Shao, L., Seed, L.: A survey on fall detection: Principles and ap-
proaches. Neurocomputing 100, 144–152 (2013), special issue: Behaviours in video
16. Noury, N., Fleury, A., Rumeau, P., Bourke, A., ÓLaighin, G., Rialle, V., Lundy,
J.: Fall detection - principles and methods. In: Annual Int. Conf. of the IEEE
Engineering in Medicine and Biology Society, pp. 1663–1666 (2007)
17. Tzeng, H.W., Chen, M.Y., Chen, J.Y.: Design of fall detection system with ﬂoor
pressure and infrared image. In: Int. Conf. on System Science and Engineering, pp.
131–135 (2010)
18. Yu, X.: Approaches and principles of fall detection for elderly and patient. In: 10th
Int. Conf. on e-Health Networking, Applications and Services, pp. 42–47 (2008)
“BAM!” Depth-Based Body Analysis in Critical Care

Manuel Martinez, Boris Schauerte, and Rainer Stiefelhagen

Institute for Anthropomatics, Karlsruhe Institute of Technology,

Adenauerring 2, 76131 Karlsruhe, Germany
{name.surname}@kit.edu

Abstract. We investigate computer vision methods to monitor Intensive Care

Units (ICU) and assist in sedation delivery and accident prevention. We pro-
pose the use of a Bed Aligned Map (BAM) to analyze the patient’s body. We
use a depth camera to localize the bed, estimate its surface and divide it into
10 cm × 10 cm cells. Here, the BAM represents the average cell height over the
mattress. This depth-based BAM is independent of illumination and bed position-
ing, improving the consistency between patients. This representation allow us to
develop metrics to estimate bed occupancy, body localization, body agitation and
sleeping position. Experiments with 23 subjects show an accuracy in 4-level ag-
itation tests of 88 % and 73 % in supine and fetal positions respectively, while
sleeping position was recognized with a 100 % accuracy in a 4-class test.

Keywords: depth camera, critical care, monitoring, agitation, sleeping position.

1 Introduction

Effective control of sedation in Intensive Care Units (ICUs) is performed in a closed-

loop to keep patients relaxed and mentally conscious [6]. Feedback is obtained from
medical equipment that monitors vital signs and notes from medical staff who regis-
ter the behavior of the patient. In contrast to vital signs, there is no objective method
to measure behavioral cues. Each hospital has a different methodology for behavior
monitoring making it difficult to translate the experiences from one hospital to another.
This results in wide disparities between different medical wards (reports of delirium
incidence range from 11 % to 80 % [14]).
Actigraphy, the measurement of physical activity, has been suggested as an objec-
tive indicator to be included in sedation scales [6]. Although actigraphy is extensively
used in sleep monitoring laboratories, the procedure to attach all the required sensors
to capture a meaningful actigraphic profile is costly. Additionally, some sensors are in-
trusive and therefore not used for critical care. Computer vision monitoring systems are
increasingly used [2, 5, 8, 13, 16] as they are easy to install, non-intrusive, and the versa-
tility of the sensor allows them to handle a wide variety of tasks [8, 11]. Agitation is the
most common actigraphic cue, however there is no “golden standard” to quantify it [4]
and most computer vision algorithms provide view-dependent custom measurements.
To the best of our knowledge, there is no system able to work completely unattended
in all lighting conditions (e.g., during the night). Current approaches require markers [2,
8], active management by medical staff [5, 16] and/or color [2, 8].

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 465–472, 2013.

c Springer-Verlag Berlin Heidelberg 2013
466 M. Martinez, B. Schauerte, and R. Stiefelhagen

Fig. 1. Left: the Medical Recording Device developed within the VIPSAFE project monitors
the patient and the ICU environment. Right: The Bed Aligned Map (BAM) is a height based
representation aligned to the surface of the bed (best viewed in color).

We identified the three main problems for computer vision ICU monitoring:
Occlusion: As most of the body is occluded by a blanket, high-level approaches that
rely on the shape of the body, such as poselets [3] and bodypart detectors [12] are not
effective.
Lack of Datasets: Due to privacy concerns, there is no public dataset to train data
intensive models.
Night Monitoring: Night monitoring can be done under infrared illumination [10, 11],
but color information is lost.
Depth cameras have been used successfully to automatically estimate breathing rate
in clothing-occluded ICU patients [1,10]. Depth cameras allow us to overcome the night
monitoring problems as they are independent of the light conditions, and volumetric
information can be extracted even when the patient is covered by the bed clothing. Cap-
turing a meaningful depth field requires an active depth camera like Kinect, as stereo
cameras are unable to capture an accurate depth field due to the lack of texture in most
medical clothing.

Fig. 2. Bed localiation even when a patient is sleeping on it (best viewed in color). From left to
right: Post-filtered tiles from a patient lying in fetal position with the outline of the estimated bed
position with his correspoding BAM. Same from a patient lying in supine position.
“BAM!” Depth-Based Body Analysis in Critical Care 467

In this paper we go one step further and propose the Bed Aligned Map (BAM), a
robust representation model aligned to the bed surface. To this end we develop a novel
algorithm able to localize the bed even when somebody is sleeping on it.
Although some indicators (e.g., bed occupancy, body location with respect to the
bed) can be obtained directly from BAM, its main advantage is its capability to easily
combine multiple observations of several patients, simplifying the development of ma-
chine learning based classifiers. We prove this capability by training a sleeping position
classifier using data from only 23 subjects and achieving a 100% accuracy in a 4-class
test.

2 Experimental Setup

Within the framework of the VIPSAFE1 [11] project, we have developed a Medical
Recording Device (Fig. 1) with a large variety of sensors and cameras. This project
uses the depth camera (derived from Kinect) which provides a 640x480@30fps depth
map. We recorded 23 male and female subjects from different ethnicities and ages be-
tween 14 and 50; and they were asked to perform a sequence of 45 actions divided in
5 scenarios. To capture a wider range of behaviors, subjects were given only minimal
guidance, relying on their own interpretation. To evaluate sleeping positions they were
asked to lie on their back, and then move to a lateral right position followed by lateral
left position. To evaluate agitation they were asked to be relaxed, then to show small dis-
tress, followed by increased distress and strong distress. This minimal guidance resulted
in strongly different interpretations of distress, which was our goal.

3 The Bed Aligned Map

Beds in critical care are wheeled, articulated and can be installed in a variety of configu-
rations. Commonly a wall with medical equipment is behind the head of the patient, but
having the wall along the side of the patient is not unheard of. Finally, in the most ver-
satile medical wards, most equipment is also attached to mobile stands around the bed
in order to accommodate the different requirements a patient may have. Therefore the
location of the bed must be determined to accurately select the Region of Interest (ROI).
In most studies the ROI is fixed or manually defined [1, 5, 16]. Kittipanya-Ngam [8]
suggest an automated algorithm which models the bed as rigid rectangular surface us-
ing edges and Hough transform for localization. However the articulated beds used in
critical care are divided in several segments which can be adjusted at different incli-
nations to better suit the needs of the patient ( Fig. 1). The baseline approach used in
VIPSAFE [11] used region growing in the depth field to find a low curvature area, this
approach was successful on articulated beds, but required the bed to be empty.
We present here an approach that improves our previous work by enabling the detec-
tion of non-empty articulated beds. The algorithm performs the following steps:
1
VIPSAFE: Visual Monitoring for Improving Patient Safety
https://cvhci.anthropomatik.kit.edu/project/vipsafe
468 M. Martinez, B. Schauerte, and R. Stiefelhagen

Prefiltering: Non-smooth pixels in the disparity image are discarded: pixels are consid-
ered smooth if the difference in disparity between itself and its neighbors is at most 1.
This removes noisy pixels and pixels adjacent to edges.
Tile Splitting: Pixels are grouped in tiles of 16×16; the center and normal vector of
each tile is estimated. The size is chosen to be small enough to offer good spatial reso-
lution, but large enough to determine the direction of the normal vector with precision.
Tile Filtering: Tiles below the minimum height of the bed are discarded. Blocks tilted
more than 45 degrees respect to the ground are discarded (usually walls and medical
equipment). At this point most remaining blocks belong to the bed.
2D Estimation: Remaining tiles are projected to the ground plane. In the ground plane
we fit the smallest 2D bounding box containing 95 % of the remaining tiles. The long
side of the bounding box can have a varying size due to the bed articulation, but the
shortest side is assumed to be fixed. If it is not close to the measured width of the bed,
the estimation is discarded (Fig. 2).
3D Estimation: To compensate for the articulation of the bed, its height profile is es-
timated along the long side of the bounding box using the convex hull of the detected
points. We assume that the bed is wide enough to not be covered entirely by the pa-
tient. Thus each horizontal cut of the bed will provide at least one measure showing its
mattress height.
Normalization: The estimated 2D surface of the bed is divided in sections of 10x10cm
and the average height above the mattress is calculated for each section (Fig. 2). Sec-
tions without height estimate (rare) are interpolated from the neighbors.
The resulting representation estimates the average height of the patient with respect
to a planar bed mattress; we call it Bed Aligned Map (BAM). It is independent of
bed localization and lighting conditions and allows us to compare the behavior of the
patients in different medical institutions which was not possible until now.
We estimated the bed localization once every 10 seconds (a total of 2505 times).
We accepted the bed estimations if the detected bed width lies within 5cm of the ac-
tual value. In total 91.8 % of the times the bed estimation was accepted, and the mean
standard deviation measured was of 13.6 mm for width and 31.4 mm for length.

4 Body Analysis

4.1 Bed Occupancy and Body Localization

Bed occupancy is a common indicator extracted by visual monitoring systems. Since the
body is generally occluded by the blanket, bed occupancy is estimated by detecting the
patient’s head with skin color models [9] or markers [2]. Neither approach is practical
(color is not available by night). In contrast, bed occupancy can be trivially extracted
from the BAM by estimating the volume under the blanket (Fig. 4).
BAM can also be used to estimate the body localization (Fig. 3), which combined
with safety frameworks is used to predict if a patient is in danger of falling out of the
bed.
“BAM!” Depth-Based Body Analysis in Critical Care 469

Fig. 3. BAM representations of subjects lying in supine position (top), on the left side (middle),
and on the right side (bottom). The estimated center of gravity of the body is displayed as a circle.
Best viewed in color.

150
Volume(l)

100

0
0 A 40 60 B 100 120 C 160 180 D 220
Time(s)
Fig. 4. Bed Occupancy: subject enters the bed (A), changes two times of sleeping position (B,
C), and leaves the bed (D). Note how the volume never reaches zero as the pillow and the bed
clothing occupy a significant amount of space.

4.2 Agitation

Agitation is the main indicator recorded in several computer vision ICU monitoring
systems. Due to the difficulty of localizing precisely the body, it is usually quantifyed
by analyzing the changes between consecutive images. This approach is not robust to
changes in illumination [13], although light invariant feature descriptors have been used
to compensate global illumination changes [16].
We propose one agitation measure defined as the mean difference between maxi-
mum and minimum cell height of all BAMs captured within one second. The resulting
measure has volumetric units and is independent of the viewpoint used to capture it.
Lacking a standard procedure to measure agitation, we asked the subjects to show
three different levels of distress in supine and fetal positions. To compare between
470 M. Martinez, B. Schauerte, and R. Stiefelhagen

40
30
30

Agitation
Agitation

20
20

10 10

0 0
rest low mild strong rest low mild strong
(a) Supine (b) Fetal
Fig. 5. Mean and standard deviation of the agitation values of subjects when instructed to rest or
show a low, mild and strong distress respectively

Table 1. Agitation classification within a subject using the suggested metric

(a) Supine (b) Fetal
rest low mild strong rest low mild strong
rest 23 0 0 0 rest 21 1 1 0
low 0 21 2 0 low 1 16 4 2
mild 0 1 18 4 mild 1 3 14 5
strong 0 1 3 19 strong 0 3 4 16

agitation levels we averaged the measurements corresponding to 5 seconds, and the

BAM was extracted at 10 fps.
The only instructions our subjects received were to show low, mild and strong
distress; this results in wide disparities between subjects but the average measured agi-
tation showed a consistent progression between intensity levels ( Fig. 5). Then we eval-
uated the effectiveness of this indicator within a subject, and the resulting confusion
tables can be seen at Table 1.

4.3 Sleep Position

To prevent pressure ulcers, it is recommended to change the sleeping position of ICU
patients every two hours. As most patients are too sedated to relocate on their own, the
ICU crew must check often the sleeping position of the patients and relocate them if
required. This is an unwieldy task and the adherence to the suggested guidelines is, in
general, low [15].
Using classical pose estimation methods is not possible as the patient body is usually
covered by a blanket, however the volumetric nature of BAM simplifies this task. We
tested BAM ability to distinguish between an empty bed, a person lying on his back, a
person lying to the left and a person lying to the right, from a single image. Using a naive
Nearest Neighbor approach with euclidean distance we obtain a accuracy of 85.9 % (Ta-
ble 2) using a leave-one-person-out cross-validation. This results were improved by us-
ing PCA to reduce BAM to 32 dimensions and Large Margin Nearest Neighbor [7]
“BAM!” Depth-Based Body Analysis in Critical Care 471

Table 2. Confusion matrix of sleep position classification using BAM. The high accuracy ob-
tained from the simple 1NN approach endorses the quality of the BAM as a robust representation.
While using LMNN and PCA achieves 100 % accuracy.
(a) 1NN (b) PCA-LMNN
empty supine left right empty supine left right
empty 21 2 0 0 empty 23 0 0 0
supine 0 20 3 0 supine 0 23 0 0
left 1 3 18 1 left 0 0 23 0
right 0 0 3 20 right 0 0 0 23

as a classifier. LMNN uses semidefinite programming to learn a Mahalanobis distance

metric for kNN classification. This combination achieves a 100 % accuracy.

5 Conclusions
We address two principal challenges that computer vision approaches face in critical
care monitoring: First, bedridden patients in hospitals are often covered by a textureless
blanket. This makes is hard for computer vision algorithms to estimate body parameters
and articulation. But, it is possible to detect movements and rough shapes beneath the
blanket, especially when depth information is available. Second, intensive care units are
dynamic environments in which the location of the bed or sensor can be changed by the
hospital personnel at any time.
We address these two challenges and introduce the Bed Aligned Map (BAM), which
extracts and aligns the image patch that contains the bed. BAM is calculated from depth
information, is view and light independent and does not require markers. We show some
indicators that can be obtained directly from BAM (bed occupancy, body location with
respect to the bed) and present a robust metric to quantify body agitation. Furthermore,
the BAM facilitates the development of machine learning based classifiers, because the
alignment allows us to combine observations of several patients. We use this property
to develop a sleeping position classifier where we discern between an empty bed, a
patient lying on his back, and a patient lying on his left and right sides. On this 4-class
problem a naive nearest neighbor approach using BAM achieves a 85.9 % accuracy
while a combined LMNN and PCA approach achieves 100 % accuracy on a 23 subject
experiment.

Acknowledgements. This work is supported by the German Federal Ministry of Edu-

cation and Research (BMBF) within the VIPSAFE project.

References
1. Aoki, H., Takemura, Y., Mimura, K., Nakajima, M.: Development of non-restrictive sens-
ing system for sleeping person using fiber grating vision sensor. In: Micromechatronics and
Human Science (2001)
472 M. Martinez, B. Schauerte, and R. Stiefelhagen

2. Becouze, P., Hann, C., Chase, J., Shaw, G.: Measuring facial grimacing for quantifying pa-
tient agitation in critical care. In: Computer Methods and Programs in Biomedicine (2007)
3. Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3D human pose annota-
tions. In: ICCV (2009)
4. Chanques, G., Jaber, S., Barbotte, E., Violet, S., Sebbane, M., Perrigault, P.F., Mann, C.,
Lefrant, J.Y., Eledjam, J.J.: Impact of systematic evaluation of pain and agitation in an inten-
sive care unit* (2006)
5. Geoffrey Chase, J., Agogue, F., Starfinger, C., Lam, Z., Shaw, G.M., Rudge, A.D., Sirisena,
H.: Quantifying agitation in sedated icu patients using digital imaging. In: Computer Meth-
ods and Programs in Biomedicine (2004)
6. Grap, M.J., Hamilton, V.A., McNallen, A., Ketchum, J.M., Best, A.M., Isti Arief, N.Y.,
Wetzel, P.A.: Actigraphy: Analyzing patient movement. Heart & Lung: The Journal of Acute
and Critical Care (2011)
7. Weinberger, K., John Blitzer, L.K.S.: Distance metric learning for large margin nearest neigh-
bor classification. In: NIPS (2006)
8. Kittipanya-Ngam, P., Guat, O., Lung, E.: Computer vision applications for patients monitor-
ing system. In: FUSION (2012)
9. Mansor, M., Yaacob, S., Nagarajan, R., Che, L., Hariharan, M., Ezanuddin, M.: Detection of
facial changes for ICU patients using knn classifier. In: ICIAS (2010)
10. Martinez, M., Stiefelhagen, R.: Breath rate monitoring during sleep using near-ir imagery
and pca. In: ICPR (2012)
11. Martinez, M., Stiefelhagen, R.: Automated multi-camera system for long term behavioral
monitoring in intensive care units. In: MVA (2013)
12. Mikolajczyk, K., Schmid, C., Zisserman, A.: Human detection based on a probabilistic
assembly of robust part detectors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS,
vol. 3021, pp. 69–82. Springer, Heidelberg (2004)
13. Naufal Bin Mansor, M., Yaacob, S., Nagarajan, R., Hariharan, M.: Patient monitoring in ICU
under unstructured lighting condition. In: ISIEA (2010)
14. Ouimet, S., Kavanagh, B.P., Gottfried, S.B., Skrobik, Y.: Incidence, risk factors and conse-
quences of ICU delirium. Intensive Care Medicine (2007)
15. Paquay, L., Wouters, R., Defloor, T., Buntinx, F., Debaillie, R., Geys, L.: Adherence to pres-
sure ulcer prevention guidelines in home care: a survey of current practice. Journal of Clinical
Nursing (2008)
16. Reyes, M., Vitria, J., Radeva, P., Escalera, S.: Real-time activity monitoring of inpatients. In:
MICCAT (2010)
3-D Feature Point Matching for Object
Recognition Based on Estimation
of Local Shape Distinctiveness

Masanobu Nagase, Shuichi Akizuki, and Manabu Hashimoto

Graduate School of Information Science and Technology Chukyo University

101-2 Yagoto honmachi, Showa-ku, Nagoya, Aichi, 466-8666, Japan
{nagase,mana}@isl.sist.chukyo-u.ac.jp

Abstract. In this paper, we propose a reliable 3-D object recognition

method that can statistically minimize object mismatching. Our method
basically uses a 3-D object model that is represented as a set of feature
points with 3-D coordinates. Each feature point also has an attribute
value for the local shape around the point. The attribute value is rep-
resented as an orientation histogram of a normal vector calculated by
using several neighboring feature points around each point. Here, the
important thing is this attribute value means its local shape. By esti-
mating the relative similarity of two points of all possible combinations
in the model, we deﬁne the distinctiveness of each point. In the proposed
method, only a small number of distinctive feature points are selected
and used for matching with all feature points extracted from an acquired
range image. Finally, the position and pose of the target object can be
estimated from a number of correctly matched points. Experimental re-
sults using actual scenes have demonstrated that the recognition rate of
our method is 93.8%, which is 42.2% higher than that of the conventional
Spin Image method. Furthermore, its computing time is about nine times
faster than that of the Spin Image method.

Keywords: object recognition, 3-D feature point matching, robot

vision, point cloud data, 3-D descriptor, bin-picking.

1 Introduction
Bin-picking systems are an important means for developing automated cell man-
ufacturing systems. An important requirement for such systems is a reliable and
high-speed means for recognizing the pose of an object in scenes that consist of
a lot of randomly stacked same objects.
In the ﬁeld of 3-D object recognition, a lot of model-based object recogni-
tion methods have been proposed. These methods estimate the pose parameters
of objects by matching an object model to an input range image. The Spin
Image method[1] is a typical model-based method. It uses pose-invariant fea-
tures created by calculating the direction of normal vectors in each point of
an object model. However, its computational cost is expensive because it is

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 473–481, 2013.

c Springer-Verlag Berlin Heidelberg 2013
474 M. Nagase, S. Akizuki, and M. Hashimoto

necessary to calculate the feature values from all points of the object model.
Other methods[2][3] using edge information with depth value have been pro-
posed. These methods can achieve high-speed recognition because they use only
local information of the object model. For randomly stacked objects, however,
mismatchings may frequently occur due to pseudo-edges caused by objects that
overlap other objects.
For other model-based approaches, several high-speed recognition methods
have been proposed. These methods use only feature points for the matching
process. For example, the DAI (Depth Aspect Image) matching method[4] and
the Local Surface Patch method[5] are typical methods that use this approach.
These methods use distinctive local shapes that have large curvature, so they are
effective in some cases. However, for cases where there are a lot of local shapes
that have large curvature, mismatching will be increased.
A recent study proposed a 3-D local descriptor called SHOT (Signature of
Histograms of OrienTations)[6][7]. This method uses only one corresponding
point with SHOT descriptors, so high-speed recognition is achieved. A problem
with it, however, is that pose parameter calculation becomes difficult when the
SHOT descriptor is disturbed by outliers due to multiple objects.
A more substantial problem is that no practical 3-D object recognition meth-
ods have yet been developed that achieve both high speed and reliability.
The purpose of our research, therefore, is to develop a new method that can
achieve both reliability and high speed. From the viewpoint of efficient process-
ing, our method can be categorized as a feature-point-based matching method
using “point cloud data”.
We assume that the object model in this study consists of point cloud data
with 3-D coordinates. Each point of the object model has an attribute value
that represents the local shape around the interest point. The attribute value is
represented as an orientation histogram of a normal vector, which is calculated
by using several neighboring feature points around the interest point. As men-
tioned above, the attribute value of a model point means its local shape. Before
matching the model points to acquired data, we determine the distinctiveness
of all points by calculating the relative similarity of two points of all possible
combinations in the object model. Rather than all of the feature points, a small
number of them with high distinctiveness are used in the matching process. Us-
ing this effective feature-point selection based on estimating distinctiveness, we
achieve both reliable and high-speed recognition.
In Section 2 we explain the key idea of and concrete algorithm for the proposed
method. In Section 3 we demonstrate, through experimental results acquired in
testing a lot of real range images, that our method has better performance than
conventional methods such as the Spin Image method.

2 Proposed Method
2.1 Basic Idea
In this study, we introduce two basic ideas.
3-D Feature Point Matching for Object Recognition Based on Estimation 475

The ﬁrst is to calculate the distinctiveness of each feature point of an object

model. The use of feature points that have high distinctiveness for the matching
process reduces the risk of mismatching.
The second is to use only a small number of selected feature points for the
matching process. Reducing the number of feature points used for the matching
process achieves high-speed recognition.

2.2 Outline of Proposed Algorithm

Figure 1 shows a schematic block diagram of the proposed algorithm.

Fig. 1. Schematic block diagram of proposed algorithm

The proposed algorithm consists of two modules: an object model analysis

module and a recognition module. In this study, we assume that the object
model consists of feature points with 3-D coordinates.
In the object model analysis module, the attribute values of each feature point
are described by a histogram created by the normal vector of several neighboring
feature points. Next, the distinctiveness value of each point is calculated by
the similarity between normal distribution histograms. Finally, the distinctive
feature points used for matching are selected on the basis of high distinctiveness
value.
In the recognition module, the position and pose of the object are estimated
from a number of correctly corresponding points.

2.3 Normal Distribution Histogram

Figure 2 shows the method for creating a normal distribution histogram, which
is the local shape descriptor we propose in this study.
476 M. Nagase, S. Akizuki, and M. Hashimoto

First, the sphere region of radius r is set to interest point n. Next, angle θ is
calculated between the normal vector Nn and another normal vector Nmt that
contains the sphere region. This creates the normal distribution histogram of θ.
This histogram represents the local shapes of an interest point using neighboring
points. Even if the input data contains outliers, stable feature description is
possible because the proposed descriptor uses many neighboring points of the
interest point for feature description. This process is applied for all points of the
object model.

Fig. 2. Method for creating normal distribution histogram

2.4 Calculation of Distinctiveness Value of Feature Points

Next, we explain the method for calculating the distinctiveness value at each
point of the object model.
The dissimilarity value B is calculated by the Bhattacharyya coeﬃcient be-
tween the normal distribution histograms of the interest point and one of another
point created as described in Subsection 2.3:

U
B(P, Q) = 1 − Pu Qu (1)
u=1

where P and Q are normal distribution histograms, U represents the number of

bins in the histogram, and u represents the interest bin. The dissimilarity value
will be nearest to 1 when the correlation between histograms is low.
Next, the distinctiveness value Sn is calculated by Equation (2):

1
T
Sn = B(pn , qt ) (2)
T t=1

where p and q are normal distribution histograms, T represents the number of

points of the object model, and n and t are the interest point and another model
point. The distinctiveness value Sn is calculated, with its range lying between 0
and 1. When Sn is nearest to 1, it means the distinctiveness value of point n is
high. In this study, feature points with a high distinctiveness value are extracted
as feature points.
3-D Feature Point Matching for Object Recognition Based on Estimation 477

2.5 Position and Pose Estimation by Matching Feature Points

In this subsection we explain our pose recognition method, which uses distinctive
feature points.
First, the Bhattacharyya coeﬃcient is used to calculate the similarity between
two normal distribution histograms calculated from all of the points in the input
range image and the distinctive feature points extracted from the object model.
Corresponding points are determined from the input range image and the object
model. The pose parameters of the object model are calculated from the corre-
sponding points of the input range image and the object model, which satisﬁes
Equation (3):
(|ds1 ,s2 − dm1 ,m2 | < thd ) ∧ (|θs1 − θm1 | < tht ) ∧ (|θs2 − θm2 | < tht ) (3)
where m1 ,m2 are distinctive feature points of the object model, s1 ,s2 are points
of an input range image, θm1 ,θm2 ,θs1 ,θs2 are the angles between these points and
the normal vector of the interest feature point, dm1 ,m2 , ds1 ,s2 are the Euclidean
distances between the two points of the model and those of the input range
image, and thd and tht are the distance and angle thresholds.
We associate the vector to the centroid of the object model to the points of
interest feature point m1 . The pose parameter is applied to the centroid vector.
Second, the transformation parameters calculated from all matching results
are voted for in a voting space prepared on the input range image. An example
matching that uses distinctive feature points is illustrated in Figure 3.

Fig. 3. Overview of pose recognition scheme using distinctive feature points extracted
from object model

After the voting, the candidate parameters supported by many corresponding

points are determined to be reasonable transformation parameters for the object
model in an input range image. Finally, we carefully conﬁrm the consistency
between object model and input range image for all hypotheses by using many
points other than the distinctive feature points. The parameter with the highest
consistency is determined to be the recognition result.
Here, the consistency between a transformed object model and an input range
image is calculated by Equation (4):

1
T
R= |mt − f (i, j)| (4)
T t=1
478 M. Nagase, S. Akizuki, and M. Hashimoto

where mt represents the t-th point in a transformed model object. The value
R represents the diﬀerence between a transformed object model and an input
range image. A low R value means high consistency.

3 Experiments and Discussion

3.1 Distribution of Extracted Distinctive Feature Points

In this section, we use various methods to explain the distribution of extracted

distinctive feature points. Experiments were conducted to compare the proposed
method with two conventional methods: the random method, in which distinc-
tive feature points are randomly extracted, and the curvature method, in which
distinctive feature points with large curvature are preferentially extracted.
In Figure 4, (a) is an overview of the model, (b) is the result obtained in
calculating the distinctiveness of each point of the object model, and (c), (d),
and (e) show the distinctive feature points extracted by the random, curvature,
and proposed methods. Each method extracted 30 distinctive feature points.

Fig. 4. Overview of object model and distinctiveness calculation result

The random method extracted feature points from smooth shaped parts,
which accounted for a large part of the object model. The curvature method
extracted feature points from the recessed part of a large curvature shape; how-
ever, we thought that in this case the feature points might be easily hidden
and thus correct matching could not be obtained for them if the model under-
went a pose change. The proposed method, however, selects distinctive feature
points that is selected for large curvature of characterizing the model continues
in a straight line and planar shape of the object model. Therefore, the proposed
method enables correct matching even if the object model undergoes a pose
change.

3.2 Performance for Complicated Scenes

To verify the eﬀectiveness of the proposed method, we compared its performance

for randomly stacked objects with that of four conventional methods: (1) the
3-D Feature Point Matching for Object Recognition Based on Estimation 479

Spin Image method[1], (2) the Correspondence Grouping[8] method, which is

contained in the Point Cloud Library[9] as a recognition module, (3) the random
method, and (4) the curvature method. To estimate the versatility of our method,
we evaluated its recognition performance by using four kinds of objects. The
parameters of our approach are r=2.0mm, thd =1.2mm, tht =5◦ . The number of
bins in the normal distribution histogram is 37.
Table 1 shows the recognition success rate Pr , the processing time T , and
the number of distinctive feature points N of each method for four datasets
that collectively consisted of about 130 real range images. These range images
are captured by laser range ﬁnder. In this study, we considered recognition was
successful if the alignment error of the input range image and the object model
was within 1.5 mm. Figure 5 (a), (b), (c) and (d) show overviews of the input
scenes, while (e), (f), (g) and (h) show recognition results that are superimposed
transformed object models on the input data. The experiments were performed
using an Intel ,CORE
R TM
i7 3.40GHz with 8GB memory.

Table 1. Recognition success rate and processing time

Object A Object B Object C Object D Mean

Spin Pr [%] 49.6 50.8 35.2 70.6 51.6
Image[1] T [sec] 24.95 55.34 31.21 20.68 33.05
Correspondence Pr [%] 74.4 96.2 65.6 84.9 80.3
Grouping[8][9] T [sec] 45.41 52.85 29.26 26.10 38.41
N [point] 70 50 30 30 -
Random method Pr [%] 89.1 96.9 85.2 69.8 85.3
T [sec] 5.87 4.77 2.06 1.74 3.61
N [point] 70 50 30 30 -
Curvature method Pr [%] 86.8 76.9 18.0 54.8 59.1
T [sec] 6.01 4.94 2.21 1.97 3.78
N [point] 70 50 30 30 -
Proposed method Pr [%] 89.1 97.7 94.5 93.7 93.8
T [sec] 5.83 4.76 2.05 1.76 3.60

Fig. 5. Example recognition results

480 M. Nagase, S. Akizuki, and M. Hashimoto

Experiments conﬁrmed that the proposed method achieves 93.8% recognition

success rate to 51.6% for the Spin Image method. It is also about nine times
faster since it uses only a small number of distinctive feature points for the
matching process while the Spin Image method uses all points of the model.
Furthermore, since the Spin Image method creates spin images by selecting
points at random from range images, correct matching cannot be achieved if the
randomly selected points correspond to those in the object model. This is one
of the reasons for the method’s low recognition rate. The proposed method’s
recognition results were also superior to those for the Correspondence Grouping
method, because the range of the reference frame can easily allow the inclusion of
data from objects at a nearby object boundary in the input range data. Through
experiments we also conﬁrmed the amount of time it takes for the methods to
create reference frames. The experiment results conﬁrmed that the proposed
method not only achieves higher recognition rates than the conventional meth-
ods, but also that its processing time is equal to or better than that of the other
methods.

4 Conclusion

We proposed an object recognition system that achieves both reliable and high-
speed recognition by using a small number of distinctive feature points.
Experimental results using actual scenes demonstrated that our method
achieves 93.8% recognition rate, which is 42.2% higher than that of the con-
ventional Spin Image method, and that its computing time is also about nine
times faster. These results conﬁrmed that the process our method uses to select
distinctive feature points is an eﬀective approach to object recognition.
In future work, we intend to further improve the method’s processing time,
optimize various parameters, and build a bin-picking system that implements
the method.

Acknowledgment. This work was partially supported by Grant-in-Aid for

Scientiﬁc Research (C) 23560512.

References

1. Johnson, A.E., Hebert, M.: Using Spin Images for Eﬃcient Object Recognition in
Cluttered 3D Scenes. Trans. IEEE Pattern Analysis and Machine Intelligence 21,
433–499 (1999)
2. Sumi, Y., Tomita, F.: 3D Object Recognition Using Segment-Based Stereo Vision.
In: Chin, R., Pong, T.-C. (eds.) ACCV 1998. LNCS, vol. 1352, pp. 249–256. Springer,
Heidelberg (1997)
3. Steder, B., Rusu, R.B., Konolige, K., Burgard, W.: Point Feature Extraction on
3D Range Scans Taking into Account Object Boundaries. In: IEEE International
Conference on Robotics and Automation, pp. 2601–2608 (2011)
3-D Feature Point Matching for Object Recognition Based on Estimation 481

4. Takeguchi, T., Kaneko, S.: Depth Aspect Images for Eﬃcient Object Recognition.
In: Proc. SPIE Conference on Optomechatronic Systems IV, vol. 5264, pp. 54–65
(2003)
5. Chen, H., Bhanu, B.: 3D Free-form Object Recognition in Range Images Using Local
Surface Patches. Pattern Recognition Letters 28, 1252–1262 (2007)
6. Tombari, F., Salti, S., Di Stefano, L.: Unique Signatures of Histograms for Local
Surface Description. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
Part III. LNCS, vol. 6313, pp. 356–369. Springer, Heidelberg (2010)
7. Tombari, F., Salti, S., Stefano, L.D.: A Combined Texture-Shape Descriptor for
Enhanced 3D feature Matching. In: IEEE International Conference on Image Pro-
cessing, pp. 809–812 (2011)
8. Tombari, F., Stefano, L.D.: Object Recognition in 3D Scene with Occlusions and
Clutter by Hough Voting. In: IEEE Proc. on 4th Paciﬁc-Rim Symposium on Image
and Video Technology, pp. 349–355 (2010)
9. Rusu, R.B., Cousins, S.: 3D is here: Point Cloud Library (PCL). In: IEEE Interna-
tional Conference on Robotics and Automation, pp. 1–4 (2011)
3D Human Tracking from Depth Cue
in a Buying Behavior Analysis Context

Cyrille Migniot and Fakhreddine Ababsa

IBISC laboratory - University of Evry val d’Essonne, France

{Cyrille.Migniot,Fakhr-Eddine.Ababsa}@ufrst.univ-evry.fr

Abstract. This paper presents a real time approach to track the human
body pose in the 3D space. For the buying behavior analysis, the camera
is placed on the top of the shelves, above the customers. In this top
view, the markerless tracking is harder. Hence, we use the depth cue
provided by the kinect that gives discriminative features of the pose. We
introduce a new 3D model that are fitted to these data in a particle
filter framework. First the head and shoulders position is tracked in the
2D space of the acquisition images. Then the arms poses are tracked in
the 3D space. Finally, we demonstrate that an efficient implementation
provides a real-time system.

Keywords: Human tracking, kinect, particle ﬁlter, buying behavior.

1 Introduction

Behavior analysis based on artificial vision method offers a wide range of appli-
cations that are currently little developed in the marketing area. In the customer
behavior analysis, the camera is often placed on the ceiling of the market. Only
the top view of the person is also available. However, the great majority of the
methods in the literature use a model adapted to a front view of the person
because the shape of a person is much more discriminative on this orientation.
The aim of the project ANR-10-CORD0016 ORIGAMI2 that supports this work
is to develop real-time and non intrusive tools designed to analyze the shoppers
buying act decisions. The approach is, in the first time, based on extracting
and following the shoppers’ gaze and gesture positions with computer vision
algorithmic. It is then based on statistically analyzing the extracted data: the
goal of this cognitive analysis is to measure the interaction between the shopper
and their environment. This technology will provide consumer goods producers
with non biased and exhaustive information on shoppers’ behaviors during their
buying acts.
To make the tracking possible, the depth cue is required. One of the more
popular devices used to provide it is kinect, which has sensors that capture both
rgb and depth data. In this paper we integrate the depth cue in a particle filter to
track the body parts. The gesture recognition and the behavior of the customer
could, in a second time, be analyzed using the Moeslund’s taxinomy.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 482–489, 2013.

c Springer-Verlag Berlin Heidelberg 2013
3D Human Tracking from Depth Cue 483

Pose estimation and 3D tracking have received a signiﬁcant amount of atten-

tion in the computer vision research community in the past decade. To do this,
the observation is fitted to a model that embodies the possible states. For an
articulated target as a person, the particle filter [9] is mostly used. It estimates
the current pose from a sample of possible states weighted from a likelihood
function that represents the probability that a state of the model corresponds
to the observation. A skeleton defines the states of the model. It comprises of a
set of appropriately assembled geometric primitives [4,7,8] or 3D gaussians [15]
to introduce the volume occupied by the body in the 3D space. The main vari-
ations on the framework come from the choice of the likelihood function. Skin
color [6] and contour (matched to the chamfer distance [16]) are the most useful
features. Kabayashi [11] inserts results of classifiers in the likelihood function.
Some poses of the skeleton can not be executed by a human body. The sampling
can be constrainted by a projection on the feasible configuration space [7] or by
stochastic Nelder-Mead simplex search [12].
In the buying behavior analysis context, post treatment is not assesed and real
time processing is also appreciated. In the particle filtering, the most expensive
operation is the evaluation of the likelihood function because it has to be done
once at every time step for every particle. Some adaptations are needed to obtain
a real time processing. Gonzales [6] realizes a tracking for each sub-part of the
body so as to use only simple models. A hierarchical particle filter [17] simplifies
the likelihood function. The annealed particle filtering [4] reduces the required
number of particles. Finally Kjellström [10] considers interaction with objects in
the environment to constrain the pose of body and remove degrees of freedom.
In this paper, we propose a new human pose tracking by particle filtering in a
top view. To obtain a real time processing, the model is broken it up a 2D model
representing the head and the shoulders and an 3D model representing the arms.
Using the depth cue provided by a kinect drastically reduces the complexity of
the first one. For the second one, the pose is constrainted by the position of the
shoulders.
Our main contributions are fisrt considering buying act conditions to opti-
mize the tracking, then taking advantage of a recent data acquisition equipment
and finally decomposing the model in two parts so as to reduce the filtering
complexity and use simultaneously 2D and 3D models.

2 Particle Filter Implementation

We use the Xtion Pro-live camera produced by Asus for the acquisition. All the
points that the sensor is not able to measure depth are oﬀset to 0 in the output
array. We regard it as a kind of noise. Moreover we only model the upper part
of the body. Thus, we threshold the image to only take into consideration the
pixels recognized as an element of the torso, the arms or the head. It gives a ﬁrst
segmentation of region of interest (ROI).
The Asus Xtion Pro-live provides simultaneously the color and the depth cues.
Nevertheless the color cue is often degraded in practice. Indeed, persons on the
484 C. Migniot and F. Ababsa

supermarket shelves are over-lit. The tracking must be robust and the depth cue
is not disturbed by lighting. Thus we only take into consideration the depth cue.

2.1 The Particle Filter

Particle ﬁltering has been a successful numerical approximation technique for
Bayesian sequential estimation with non-linear, non-Gaussian models. At mo-
ment k, let xk be the state of the model and yk be the observation. Particle
ﬁlter recursively approximates the posterior probability density p(xk |yk ) of the
current state xk evaluating observation likelihood based on a weighted particle
sample set {xik , ωki }. Each of the N particles xik corresponds to a random state
propagated by the dynamic model of the system and weighted by ωki . There are
4 basic steps:
– resampling: N particles {xi k , N } ∼ p(xk |yk ) from sample {xk , ωk } are re-
1 i i

sampled. Particles are selected by their weight: large weight particles are
duplicated while low weight particles are deleted.
– propagation: particles are propagated using the dynamic model of the sys-
tem p(xk+1 |xk ) to obtain
{xik+1 , N1 } ∼ p(xk+1 |yk ).
– weighting: particles are weighted by a likelihood function related to the
correspondence from the model to the new observation. The new weights
N
i i
ωk+1 are normalized so that : ωk+1 = 1. It provides the new sample
i=1
{xik+1 , ωk+1
i
} ∼ p(xk+1 |yk+1 ).
– estimation: the new pose is approximated by:

N
i
xk+1 = ωk+1 xik+1 (1)
i=1

2.2 The 2D Head-Shoulders Model

Top of the head and top of the shoulders make a variation of depth that is well
descriptive of the human class [14]. Moreover, the shapes of this two parts in the
top view is almost constant and make ellipses. Canton-Ferrer [3] defines the vol-
ume of the person by an ellipsoid. While he estimates the position of the person,
we estimate its pose. Our model is also made of 2 ellipses whose the dimension is
relative to the person stoutness and are computed in the initialization step. The
state vector is defined by: V hs = {xh , y h , θh , xs , y s , θs } where (x. , y . ) gives the
position of the center of the ellipse and θ. its orientation for the head (h) and the
shoulders (s). We first threshold the depth image to separate the pixels that are
likely to correspond to the head and the pixels that are likely to correspond to
the shoulders. This map is our observation. To define the likelihood function, the
ellipses given by a state vector (a particle) is matched to the chamfer distance
map of the thresholded depth image. The interaction between the 2 ellipses at
the neck level is introduced by constraints in the propagation step : the position
of one part reduces the possible state space of the second.
3D Human Tracking from Depth Cue 485

2.3 The 3D Arms Model

For the arms tracking, we need to realize the tracking in the 3D space. A 3D
model of the whole body could be used (figure 1(a)) but the shoulders are best
tracked in the 2D space and a complete model is time consuming. A tracking
is done for each arm hardly constrained by the 2D estimation of the shoulders
position of section 2.2 as illustrated in figure 1(c). In our model, the arm has 5
degrees of freedom: 3 for the shoulder and 2 for the elbow. The pose of the skele-
ton is defined by the 5 angles of the state vector: V a = {θxsh , θysh , θzsh , θxel , θzel }.
Geometrical primitives introduce the volume: arms and forearms are modeled
by truncated cylinders, torso by an elliptic cylinder and finally the hands by
rectangular planes.

Fig. 1. The 3D models: (a) the 3D model is made of a skeleton with geometrical
primitives, (b) the angles of the articulation deﬁnes the pose of the person, (c) in
the 3D-2D processing the head and the shoulders are tracked in the 2D space of the
recorded images whereas the arms are tracked in the 3D space.

The depth variation is well-descriptive of arm. The pixels of the foreground in

the depth image are transposed in the 3D space. Then the model state is ﬁtted
to these 3D points.
Let Δ be the pixels of the foreground of the depth image excluding the head
and the shoulders detected previously and M be the 3D model state given by a
particle. The likelihood function related to particle i is deﬁned by:

ωi = average(d3D (p, Mi )) (2)

p∈Δ

where d3D is the euclidean shortest distance from a point to a 3D model.

3 Performances Analysis
We now present some experimental results. So as to control the movement of
the person and to maximize the number of tested poses, we have simulated the
behavior of customers in experimental conditions. In fact, the most important
variation is the presence of shelves and goods. But, as the camera do not move,
an estimation of the background is computed and can be removed to the frames
486 C. Migniot and F. Ababsa

of the sequence. Using experimental conditions is justiﬁed because the ROI also
obtained are similar to the experimental ones.
The Xtion Pro live camera produced by Asus is installed at 2,9 m of the ground.
It provides 7 frames per second. The dimension of a frame is 320 × 240 pixels. In
the ﬁrst experiment, we recorded two sequences S1 et S2 that are made of 450
frames (>1min) and 300 frames (≈43s). The movement of the arms are various
and representative of the buying behaviors. The depth cue is extracted with the
OpenNI library. The distance of use of the Xtion Pro live camera is between 0,8m
and 3,5m. Consequently it can not be used in at-a-distance video surveillance
but is relevant for the buying behavior analysis context.

Fig. 2. The tracking provides the pose of the person: visualization of the estimated
model state on the recorded frames (in left) and in the 3D space (in right) with the
projection of the pixels of the depth image in white

We now estimate the quality of the arms tracking. We have manually an-
notated the pixels of the arms on the frames of the 2 sequences to create a
groundtruth. Then we compute the average distance ε from the projection on
the 3D space of each of these pixels to the model state estimated by our method.
It has to be minimized to optimize the tracking. The mean variable of the par-
ticle ﬁlter is the number of particles. If it is increased, the tracking is improved
3D Human Tracking from Depth Cue 487

but the computing time is increased too. The processing times are here obtained
with a non-optimized C++ implementation running on a 3,1GHz processor. We
give in the following the average processing times per frame. We can seen in
figure 3 that there are no meaningful improvement over 50 particles (computed
in 25 ms). With this configuration there are an average distance of less than
2,5cm between each pixel of the observation and the estimated model state.
This processing is real time.
We compare our algorithm with the case where the 3D fitting presented in
section 2.3 is applied on a complete 3D model (figure 1(a)) with 17 degrees of
freedom. The figure 3 shows that the tracking is less efficient with this configu-
ration. Indeed, the required number of particles is higher because the number of
degrees of freedom is higher. Consequently the processing time increases. More-
over, as we realized a part-based treatment, each body part is tracked more
efficiently.

Fig. 3. Performances of the tracking with the various models on the 2 sequences: the
tracking is the best with our 3D-2D method

In a second experiment, we evaluate the trajectories of the articulations of the

arms. Two cameras ARTTRACK1 provides the 3D positions of reflecting balls
with the software DTRACK. We have placed captors on the shoulder, the elbow
and the wrist of the left arm of a person and we have recorded their positions
simultaneously to the kinect acquisition in a sequence S3 (≈55s). The captors
can not be placed accurately on the center of the articulations. The recorded
positions are so not a groundtruth. However they can be used to evaluate the
trajectories of the articulations that define the arm movement. We show in figure
4 that the our trajectories well fitted the ART ones. Then the difficult case
where the person bends down (the sharp peak on the z coordinate) is much
better estimated by our method. Finally our tracking is more robust when the
movements are sharp (z coordinate of the wrist). This experiment validates the
estimation of the arm movement by our method.
488 C. Migniot and F. Ababsa

Fig. 4. Trajectories of the 3D coordinates (x,y and z) of the shoulder, the elbow and the
wrist of the left arm in the sequence S3 : our 3D-2D method well follows the articulation
movements

4 Conclusion
In this paper we have presented a 3D gesture tracking method that uses the well
known particle filter method. To be efficient in the buying behavior analysis con-
text where the camera is placed above the customers, our treatment is adapted
to the top view of the person and used the depth cue provided by the new Asus
camera. To do this, we have introduced a top view model that simultaneously
uses 2D and 3D fitting. The process is accurate and real-time.
In the future, our method could be inserted in an action recognition processing
to analyse the customer behavior. Moreover, a camera pose estimation [5,2,1]
could insert our work in a Augmented Reality context with a moving camera.
Finally an additional camera placed at the head level could refine the behavior
analysis by a gaze estimation [13].

References
1. Ababsa, F.: Robust Extended Kalman Filtering For Camera Pose Tracking Us-
ing 2D to 3D Lines Correspondences. In: IEEE/ASME Conference on Advanced
Intelligent Mechatronics, pp. 1834–1838 (2009)
3D Human Tracking from Depth Cue 489

2. Ababsa, F., Mallem, M.: A Robust Circular Fiducial Detection Technique and
Real-Time 3D Camera Tracking. International Journal of Multimedia 3, 34–41
(2008)
3. Canton-Ferrer, C., Salvador, J., Casas, J.R., Pardàs, M.: Multi-person Track-
ing Strategies Based on Voxel Analysis. In: Stiefelhagen, R., Bowers, R., Fiscus,
J.G. (eds.) CLEAR 2007 and RT 2007. LNCS, vol. 4625, pp. 91–103. Springer,
Heidelberg (2008)
4. Deutscher, J., Reid, I.: Articulated Body Motion Capture by Stochastic Search.
International Journal of Computer Vision 2, 185–205 (2005)
5. Didier, J.Y., Ababsa, F., Mallem, M.: Hybrid Camera Pose Estimation Combin-
ing Square Fiducials Localisation Technique and Orthogonal Iteration Algorithm.
International Journal of Image and Graphics 8, 169–188 (2008)
6. Gonzalez, M., Collet, C.: Robust Body Parts Tracking using Particle Filter and
Dynamic Template. In: IEEE International Conference on Image Processing, pp.
529–532 (2011)
7. Hauberg, S., Sommer, S., Pedersen, K.S.: Gaussian-like Spatial Priors for Artic-
ulated Tracking. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
Part I. LNCS, vol. 6311, pp. 425–437. Springer, Heidelberg (2010)
8. Horaud, R., Niskanen, M., Dewaele, G., Boyer, E.: Human Motion Tracking by
Registering an Articulated Surface to 3D Points and Normals. IEEE Transaction
on Pattern Analysis and Machine Intelligence 31, 158–163 (2009)
9. Isard, M., Blake, A.: CONDENSATION - Conditional Density Propagation for
Visual Tracking. International Journal of Computer Vision 29, 5–28 (1998)
10. Kjellström, H., Kragic, D., Black, M.J.: Tracking People Interacting with Objects.
In: IEEE Conference on Computer Vision and Pattern Recognition (2010)
11. Kobayashi, Y., Sugimura, D., Sato, Y., Hirasawa, K., Suzuki, N., Kage, H., Sug-
imoto, A.: 3D Head Tracking using the Particle Filter with Cascaded Classiﬁers.
In: British Machine Vision Conference, pp. 37–46 (2006)
12. Lin, J.Y., Wu, Y., Huang, T.S.: 3D Model-based Hand Tracking using Stochastic
Direct Search Method. In: IEEE International Conference on Automatic Face and
Gesture Recognition, pp. 693–698 (2004)
13. Funes-Mora, K.A., Odobez, J.: Gaze Estimation from Multimodal Kinect Data. In:
IEEE Conference on Computer Vision and Pattern Recognition, pp. 25–30 (2012)
14. Micilotta, A., Bowden, R.: View-Based Location and Tracking of Body Parts for
Visual Interaction. In: British Machine Vision Conference, pp. 849–858 (2004)
15. Stoll, C., Hasler, N., Gall, J., Seidel, H.P., Theobalt, C.: Fast Articulated Motion
Tracking using a Sums of Gaussians Body Model. In: International Conference on
Computer Vision, pp. 951–958 (2011)
16. Xia, L., Chen, C.C., Aggarwal, J.K.: Human Detection Using Depth Information
by Kinect. In: International Workshop on Human Activity Understanding from 3D
Data (2011)
17. Yang, C., Duraiswami, R., Davis, L.: Fast Multiple Object Tracking via a Hierarchi-
cal Particle Filter. In: International Conference on Computer Vision, pp. 212–219
(2005)
A New Bag of Words LBP (BoWL) Descriptor
for Scene Image Classification

Sugata Banerji , Atreyee Sinha, and Chengjun Liu

Department of Computer Science,

New Jersey Institute of Technology,
Newark, NJ 07102, USA
{sb256,as739,cliu}@njit.edu

Abstract. This paper explores a new Local Binary Patterns (LBP) based im-
age descriptor that makes use of the bag-of-words model to significantly im-
prove classification performance for scene images. Specifically, first, a novel
multi-neighborhood LBP is introduced for small image patches. Second, this
multi-neighborhood LBP is combined with frequency domain smoothing to ex-
tract features from an image. Third, the features extracted are used with spatial
pyramid matching (SPM) and bag-of-words representation to propose an innova-
tive Bag of Words LBP (BoWL) descriptor. Next, a comparative assessment is
done of the proposed BoWL descriptor and the conventional LBP descriptor for
scene image classification using a Support Vector Machine (SVM) classifier. Fur-
ther, the classification performance of the new BoWL descriptor is compared with
the performance achieved by other researchers in recent years using some popu-
lar methods. Experiments with three fairly challenging publicly available image
datasets show that the proposed BoWL descriptor not only yields significantly
higher classification performance than LBP, but also generates results better than
or at par with some other popular image descriptors.

Keywords: BoWL descriptor, Bag of Words, LBP, Scene Image Classification,

Spatial Pyramid.

1 Introduction

Content-based image classification, search and retrieval is a rapidly-expanding research

area. The large volume of digital images taken worldwide every year necessitates the
development of automated classification systems. Apart from classifying large volume
of uncategorized images, image recognition has a variety of uses such as weather fore-
casting, medical diagnostics and robot vision.
The Local Binary Patterns (LBP) descriptor, which captures the variation in intensity
between neighboring pixels, was originally introduced to encode the texture from im-
ages [1]. Due to its computational efficiency, the LBP feature has been used alone or in
conjunction with other features to develop new image descriptors suitable for content-
based classification tasks [2], [3], [4].

Corresponding author.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 490–497, 2013.

c Springer-Verlag Berlin Heidelberg 2013
A New Bag of Words LBP (BoWL) Descriptor for Scene Image Classification 491

Fig. 1. (a) shows a grayscale image, its LBP image, and the illustration of the computation of the
LBP code for a center pixel with gray level 90. (b) shows the eight 4-neighborhood masks used
for computing the proposed BoWL descriptor.

Lately, part-based methods have been very popular among researchers due to their
accuracy in image classification tasks [5]. Here the image is considered as a collection
of sub-images or parts. After features are extracted from all the parts, similar parts are
clustered together to form a visual vocabulary and a histogram of the parts is used to
represent the image. This approach is known as a ”bag-of-words model”, with features
from each part representing a ”visual word” that describes one characteristic of the
complete image [6].
This paper explores a new bag-of-words based image descriptor that makes use of the
multi-neighborhood LBP concept from [7], but significantly improves the classification
accuracy.

2 An Innovative Bag of Words LBP (BoWL) Descriptor for Scene

Image Classification
In this section, we review the LBP descriptor, and then describe the process of comput-
ing the proposed Bag of Words LBP (BoWL) descriptor from an image.

2.1 Local Binary Patterns (LBP)

The Local Binary Patterns (LBP) method derives the texture description of a grayscale
i.e. intensity image by comparing a center pixel with its neighbors [1]. LBP tends to
achieve grayscale invariance because only the signs of the differences between the cen-
ter pixel and its neighbors are used to define the value of the LBP code. Figure 1(a)
492 S. Banerji, A. Sinha, and C. Liu

Fig. 2. (a) A grayscale image is broken down into small image patches which are then quantized
into a number of visual words and the image is represented as a histogram of words. (b) The
spatial pyramid model for image representation. The image is successively tiled into different
regions and features are extracted from each region and concatenated.

shows a grayscale image on the top left and its LBP image on the bottom left. The two
3 × 3 matrices on the right illustrate how the LBP code is computed for the center pixel
whose gray level is 90.

2.2 Dense Sampling: Image to Bag of Features

The first step while computing the new BoWL descriptor is sampling. Some image de-
scriptors like SIFT [8] use multiscale keypoint detectors to select regions of interest
within the image, but dense or even random sampling often outperforms the keypoint-
based sampling methods [9]. In the method proposed here, the image is divided into a
large number of equal sized blocks using a uniform grid and each block is used as a sep-
arate region for feature extraction. To increase classification performance, overlapping
image blocks are used. This process is explained in Figure 2(a).

2.3 A Modified LBP for Small Image Patches

Different forms of the LBP descriptor have resulted from different styles of selecting
the neighborhood by different researchers [10], [7], [11]. Figure 1(b) shows the eight 4-
pixel neighborhoods used for generating the multi-neighborhood LBP descriptor used
here. The traditional LBP process assigns one out of 28 possible intensity values to
each pixel forming a 256 bin histogram. However, if this technique is applied to a small
image patch with ∼256 pixels the histogram becomes sparse. To solve this problem,
eight smaller neighborhoods of four pixels each are used. These neighborhoods produce
a more dense 16-bin histogram, and eight such histograms from different neighborhoods
are concatenated to generate the 128-dimensional feature vector describing each image
patch.
The Discrete Cosine Transform (DCT) can be used to transform an image from the
spatial domain to the frequency domain. DCT is thus able to extract the features in
the frequency domain to encode different image details that are not directly accessible
A New Bag of Words LBP (BoWL) Descriptor for Scene Image Classification 493

in the spatial domain. In the proposed method, the original image is transformed to
the frequency domain and the highest 25%, 50% and 75% frequencies are eliminated,
respectively. The original image and the three images thus formed undergo the same
process of dense sampling and eight-mask LBP feature extraction.

2.4 Bag of Features to Histogram of Visual Words

As demonstrated in the lower part of Figure 2(a), the bag of features extracted from the
training images are quantized into a visual vocabulary with discrete visual words using
the popular k-means clustering method. The vocabulary size used by other researchers
varies from a few hundreds [12] to several thousands and tens of thousands [13]. For the
BoWL features, experiments were performed with vocabularies of varying sizes and a
1000-word vocabulary was found to be optimum. After the formation of the visual vo-
cabulary, each image patch from each training and test image is mapped to one specific
word in the vocabulary and the image, therefore, can be represented by a histogram of
visual words.
Using the image pyramid representation of [12], a descriptor is able to represent
local image features and their spatial layout. In this method, an image is tiled into
successively smaller blocks at each level and descriptors are computed for each block
and concatenated. This technique is explained in Figure 2(b). For this work, only the
second level of this pyramid has been used to keep the computational complexity low.
This creates a 4000 dimensional BoWL feature vector for each image.
For classification, a Support Vector Machine (SVM) with a Hellinger kernel is trained
independently for each class (one-vs-all). The SVM implementation used here is the one
that is distributed with the VlFeat package [14].

3 Experiments
This section first introduces the three scene image datasets used for testing the new
BoWL image descriptor and then does a comparative assessment of the classification
performances of the LBP, the BoWL and some other popular descriptors.

3.1 Datasets Used

Three publicly available and widely used image datasets are used in this work for as-
sessing the classification performance of the proposed descriptor.

The UIUC Sports Event Dataset. The UIUC Sports Event dataset [15] contains 1,574
images from eight sports event categories. These images contain both indoor and out-
door scenes where the foreground contains elements that define the category. The back-
ground is often cluttered and is similar across different categories. Some sample images
are displayed in Figure 3(a).
494 S. Banerji, A. Sinha, and C. Liu

(a)

(b)

(c)

Fig. 3. Some sample images from (a) the UIUC Sports Event dataset, (b) the MIT Scene dataset,
and (c) the Fifteen Scene Categories dataset

The MIT Scene Dataset. The MIT Scene dataset (also known as OT Scenes) [16] has
2,688 images classified as eight categories. There is a large variation in light, content
and angles, along with a high intra-class variation [16]. Figure 3(b) shows a few sample
images from this dataset.

The Fifteen Scene Categories Dataset. The Fifteen Scene Categories dataset [12] is
composed of 15 scene categories with 200 to 400 images: thirteen were provided by
[5], eight of which were originally collected by [16] as the MIT Scene dataset, and two
were collected by [12]. Figure 3(c) shows one image each from the newer seven classes
of this dataset.

3.2 Comparative Assessment of the LBP, the BoWL and other Popular
Descriptors on Scene Image Datasets
In this section, a comparative assessment of the LBP and the proposed BoWL descriptor
is made using the three datasets described earlier to evaluate classification performance.
To compute the BoWL and the LBP, first each training image, if color, is converted to
grayscale. For evaluating the relative classification performances of the LBP and the
BoWL descriptors, a Support Vector Machine (SVM) classifier with a Hellinger kernel
[17], [14] is used.
For the UIUC Sports Event dataset, 70 images are used from each class for training
and 60 from each class for testing of the two descriptors. The results are obtained over
five random splits of the data. As shown in Figure 4, the BoWL outperforms the LBP
A New Bag of Words LBP (BoWL) Descriptor for Scene Image Classification 495

Fig. 4. The mean average classification performance of the LBP and the proposed BoWL descrip-
tors using an SVM classifier with a Hellinger kernel on the three datasets

Fig. 5. The comparative mean average classification performance of the LBP and the BoWL
descriptors on the 15 categories of the Fifteen Scene Categories dataset

by a big margin of over 15%. In fact, on this dataset the BoWL not only outperforms
the LBP, but also provides a decent classification performance on its own.
From both the MIT Scene dataset and the Fifteen Scene Categories dataset five ran-
dom splits of 100 images per class are used for training, and the rest of the images
are used for testing. Again, the BoWL produces decent classification performance on
its own apart from beating the LBP by a fair margin. Figure 4 displays these results
on the MIT Scene dataset and Fifteen Scene Categories dataset. The highest classifica-
tion rate for the MIT Scene dataset is as high as 91.6% for the BoWL descriptor. The
classification performance of BoWL beats that of LBP by a margin of over 17%.
On the Fifteen Scene Categories dataset, the overall success rate for BoWL is 80.7%
which is again over 14% higher than LBP. This is also shown in Figure 4. In Figure 5,
the category wise classification rates of the grayscale LBP and the grayscale BoWL
descriptors for all 15 categories of this dataset are shown. The BoWL here is shown to
better the LBP classification performance in 12 of the 15 scene categories.
496 S. Banerji, A. Sinha, and C. Liu

Table 1. Comparison of the Classification Performance (%) of the Proposed Grayscale BoWL
Descriptor with Other Popular Methods on the Three Image Datasets

Method UIUC MIT Scene 15 Scenes

SIFT+GGM [15] 73.4 - -
OB [18] 76.3 - -
KSPM [19] - - 76.7
KC [20] - - 76.7
CA-TM [21] 78.0 - -
ScSPM [19] - - 80.3
SIFT+SC [22] 82.7 - -
SE [16] - 83.7 -
HMP [22] 85.7 - -
C4CC [23] - 86.7 -
BoWL+SVM (Proposed) 87.7 91.6 80.7

The classification performance of the proposed BoWL descriptor is also compared

with some popular image descriptors and classification techniques as reported by other
researchers. The detailed comparison is shown in Table 1.

4 Conclusion
In this paper, a variation of the LBP descriptor is used with a DCT and bag-of-words
based representation to form the novel Bag of Words-LBP (BoWL) image descriptor.
The contributions of this paper are manifold. First, a new multi-neighborhood LBP is
proposed for small image patches. Second, this multi-neighborhood LBP is coupled
with a DCT-based smoothing to extract features at different scales. Third, these fea-
tures are used with a spatial pyramid image representation and SVM classifier to prove
that the BoWL descriptor significantly improves image classification performance over
LBP. Finally, experimental results on three popular scene image datasets show that the
BoWL descriptor also yields classification performance better than or comparable to
several recent methods used by other researchers.

References
1. Ojala, T., Pietikainen, M., Harwood, D.: A comparative study of texture measures with clas-
sification based on feature distributions. Pattern Recognition 29(1), 51–59 (1996)
2. Banerji, S., Sinha, A., Liu, C.: New image descriptors based on color, texture, shape, and
wavelets for object and scene image classification. Neurocomputing (2013)
3. Banerji, S., Sinha, A., Liu, C.: Scene image classification: Some novel descriptors. In: IEEE
International Conference on Systems, Man, and Cybernetics, Seoul, Korea, October 14-17,
pp. 2294–2299 (2012)
4. Sinha, A., Banerji, S., Liu, C.: Novel color gabor-lbp-phog (glp) descriptors for object and
scene image classification. In: The Eighth Indian Conference on Vision, Graphics and Image
Processing, Mumbai, India, December 16-19, p. 58 (2012)
A New Bag of Words LBP (BoWL) Descriptor for Scene Image Classification 497

5. Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories.
In: Conference on Computer Vision and Pattern Recognition, pp. 524–531 (2005)
6. Yang, J., Jiang, Y., Hauptmann, A., Ngo, C.: Evaluating bag-of-visual-words representations
in scene classification. In: Multimedia Information Retrieval, pp. 197–206 (2007)
7. Banerji, S., Verma, A., Liu, C.: Novel color LBP descriptors for scene and image texture
classification. In: 15th International Conference on Image Processing, Computer Vision, and
Pattern Recognition, Las Vegas, Nevada, July 18-21, pp. 537–543 (2011)
8. Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of
Computer Vision 60(2), 91–110 (2004)
9. Nowak, E., Jurie, F., Triggs, B.: Sampling strategies for bag-of-features image classification.
In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 490–503.
Springer, Heidelberg (2006)
10. Zhu, C., Bichot, C., Chen, L.: Multi-scale color local binary patterns for visual object classes
recognition. In: International Conference on Pattern Recognition, Istanbul, Turkey, August
23-26, pp. 3065–3068 (2010)
11. Gu, J., Liu, C.: Feature local binary patterns with application to eye detection. Neurocom-
puting (2013)
12. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for
recognizing natural scene categories. In: IEEE Conference on Computer Vision and Pattern
Recognition, New York, NY, USA (2006)
13. Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos.
In: Ninth IEEE International Conference on Computer Vision, pp. 1470–1477 (2003)
14. Vedaldi, A., Fulkerson, B.: Vlfeat – an open and portable library of computer vision algo-
rithms. In: The 18th Annual ACM International Conference on Multimedia (2010)
15. Li, L.J., Fei-Fei, L.: What, where and who? classifying event by scene and object recognition.
In: IEEE International Conference in Computer Vision (2007)
16. Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation of the
spatial envelope. International Journal of Computer Vision 42(3), 145–175 (2001)
17. Vapnik, Y.: The Nature of Statistical Learning Theory. Springer (1995)
18. Li, L.J., Su, H., Xing, E.P., Fei-Fei, L.: Object bank: A high-level image representation for
scene classification & semantic feature sparsification. In: Neural Information Processing Sys-
tems, Vancouver, Canada (December 2010)
19. Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding
for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition,
Singapore, December 4-6, pp. 1794–(1801)
20. Van Gemert, J., Veenman, C., Smeulders, A., Geusebroek, J.M.: Visual word ambiguity.
IEEE Transactions on Pattern Analysis and Machine Intelligence 32(7), 1271–1283 (2010)
21. Niu, Z., Hua, G., Gao, X., Tian, Q.: Context aware topic model for scene recognition. In:
IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June
16-21, pp. 2743–2750 (2012)
22. Bo, L., Ren, X., Fox, D.: Hierarchical matching pursuit for image classification: Architecture
and fast algorithms. In: Advances in Neural Information Processing Systems (December
2011)
23. Bosch, A., Zisserman, A., Muñoz, X.: Scene classification via pLSA. In: Leonardis,
A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 517–530. Springer,
Heidelberg (2006)
Accurate Scale Factor Estimation
in 3D Reconstruction

Manolis Lourakis and Xenophon Zabulis

Institute of Computer Science,

Foundation for Research and Technology - Hellas (FORTH)
Vassilika Vouton, P.O.Box 1385, GR 711 10, Heraklion, Crete, Greece

Abstract. A well-known ambiguity in monocular structure from motion

estimation is that 3D reconstruction is possible up to a similarity trans-
formation, i.e. an isometry composed with isotropic scaling. To raise this
ambiguity, it is commonly suggested to manually measure an absolute
distance in the environment and then use it to scale a reconstruction
accordingly. In practice, however, it is often the case that such a mea-
surement cannot be performed with suﬃcient accuracy, compromising
certain uses of a 3D reconstruction that require the acquisition of true
Euclidean measurements. This paper studies three alternative techniques
for obtaining estimates of the scale pertaining to a reconstruction and
compares them experimentally with the aid of real and synthetic data.

Keywords: structure from motion, scale ambiguity, pose estimation.

1 Introduction

Structure from motion with a single camera aims at recovering both the 3D
structure of the world and the motion of the camera used to photograph it.
Without any external knowledge, this process is subject to the inherent scale
ambiguity [9,17,5], which consists in the fact that the recovered 3D structure
and the translational component of camera motion are deﬁned up to an unknown
scale factor which cannot be determined from images alone. This is because if a
scene and a camera are scaled together, this change would not be discernible in
the captured images. However, in applications such as robotic manipulation or
augmented reality which need to interact with the environment using Euclidean
measurements, the scale of a reconstruction has to be known quite accurately.
Albeit important, scale estimation is an often overlooked step by structure
from motion algorithms. It is commonly suggested that scale should be estimated
by manually measuring a single absolute distance in the scene and then using
it to scale a reconstruction to its physical dimensions [5,12]. In practice, there
are two problems associated with such an approach. The ﬁrst is that it favors
certain elements of the reconstruction, possibly biasing the estimated scale. The
second, and more important, is that the distance in question has to be measured

Work funded by the EC FP7 programme under grant no. 270138 DARWIN.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 498–506, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Accurate Scale Factor Estimation in 3D Reconstruction 499

accurately in the world and then correctly associated with the corresponding dis-
tance in the 3D reconstruction. Such a task can be quite difficult to perform and
is better suited to large-scale reconstructions for which the measurement error
can be negligible compared to the distance being measured. However, measuring
distances for objects at the centimeter scale has to be performed with extreme
care and is therefore remarkably challenging. For example, [1] observes that a
modeling error of 1mm in the scale of a coke can, gives rise to a depth estimation
error of up to 3cm at a distance of 1m from the camera, which is large enough
to cause problems to a robotic manipulator attempting to grasp the object.
This work investigates three techniques for obtaining reliable scale estimates
pertaining to a monocular 3D reconstruction and evaluates them experimen-
tally. These techniques differ in their required level of manual intervention, their
flexibility and accuracy. Section 2 briefly presents our approach for obtaining a
reconstruction whose scale is to be estimated. Scale estimation techniques are
detailed in Sections 3-5 and experimental results from their application to real
and synthetic datasets are reported in Sect. 6. The paper concludes in Sect. 7.

2 Obtaining a 3D Reconstruction
In this work, 3D reconstruction refers to the recovery of sparse sets of points
from an object’s surface. To obtain a complete and view independent represen-
tation, several images depicting an object from multiple unknown viewpoints are
acquired with a single camera. These images are used in a feature-based struc-
ture from motion pipeline to estimate the interimage camera motion and recover
a corresponding 3D point cloud [16]. This pipeline relies on the detection and
matching of SIFT keypoints across images which are then reconstructed in 3D.
The 3D coordinates are complemented by associating with each reconstructed
point a SIFT feature descriptor [11], which captures the local surface appearance
in the point’s vicinity. A SIFT descriptor is available from each image where a
particular 3D point is seen. Thus, we select as its most representative descriptor
the one originating from the image in which the imaged surface is most frontal
and close enough to the camera. This requires knowledge of the surface normal,
which is obtained by gathering the point’s 3D neighbours and robustly ﬁtting to
them a plane. As will become clear in the following, SIFT descriptors permit the
establishment of putative correspondences between an image and an object’s 3D
geometry. Combined together, 3D points and SIFT descriptors of their image
projections constitute an object’s representation.

3 Scale Estimation from Known Object Motion

The simplest approach to estimate an object’s scale employs a single static cam-
era to acquire two views of the object in diﬀerent poses with known relative
displacement. Then, the pose of the object in each view is determined. Since
the camera is static, the two poses estimated can be used to compute the ob-
ject’s displacement up to the unknown scale. The sought scale is simply the
500 M. Lourakis and X. Zabulis

ratio of known over recovered displacement. To ease the task of measuring 3D

displacements, the object is placed so that it is aligned with the checkers of
a checkerboard grid. Such a guided placemement allows the distance between
the object’s locations to be known through the actual size of each checker. An
advantage of the known motion approach is that it does not involve a special
camera setup. On the other hand, it suffers from two disadvantages. First, it
relies on careful object placement on the grid and is, therefore, susceptible to
human error. Second, it treats images separately and thus does not avail any
opportunities for combining them and in so doing increase the overall accuracy.
A key ingredient of the method outlined above is the estimation of the pose of a
known object in a single image, therefore more details regarding this computation
are provided next. Given an image of the object, SIFT keypoints are detected in
it and then matched against those contained in its reconstructed representation
(cf. Sect. 2). The invariance of SIFT permits the reliable identification of features
that have undergone large affine distortions in the image. The established cor-
respondences are used to associate the 2D image locations of detected features
with the 3D coordinates of their corresponding points on the object’s surface. The
procedure adopted for point matching is the F2P strategy from [8]. Compared to
the standard test defined by the ratio of the distances to the closest and second
closest neighbors [11], F2P was found to yield fewer erroneous matches. An impor-
tant detail concerns the quantification of distances among SIFT descriptors, which
are traditionally computed with the Euclidean (L2 ) norm. Considering that the
SIFT descriptor is a weighted histogram of gradient orientations, improvements
in matching are attained by substituting L2 with histogram norms such as the
Chi-squared (χ2 ) distance [15]. This is a histogram distance that takes into ac-
count the fact that in many natural histograms, the difference between large bins
is less important than the difference between small bins and should therefore be
reduced. Keypoint matching provides a set of 3D-2D correspondences from which
pose is estimated as explained below.
Pose estimation concerns determining the position and orientation of an object
with respect to a camera given the camera intrinsics and a set of n correspon-
dences between known 3D object points and their image projections. This prob-
lem, also known as the Perspective-n-Point (PnP) problem, is typically solved
using non-iterative approaches that involve small, fixed-size sets of correspon-
dences. For example, the basic case for triplets (n = 3, known as the P3P
problem), has been studied in [3] whereas other solutions were later proposed
in [2,7]. P3P is known to admit up to four different solutions, whereas in practice
it usually has just two. Our approach for pose estimation in a single image uses a
set of 2D-3D point correspondences to compute a preliminary pose estimate and
then refine it iteratively. This is achieved by embedding the P3P solver [3] into
a RANSAC [2] framework and computing an initial pose estimate along with
a classification of correspondences into inliers and outliers. The pose computed
by RANSAC is next refined to take into account all inlying correspondences by
minimizing a non-linear cost function corresponding to their total reprojection
error. The minimization is made more immune to noise caused by mislocalized
Accurate Scale Factor Estimation in 3D Reconstruction 501

image points by substituting the squared reprojection error with a robust cost
function (i.e., M-estimator). Our pose estimation approach is detailed in [10].

4 Scale Estimation from 3D Reconstruction and Absolute

Orientation
Another way of approaching the scale estimation problem is to resort to stereo.
More specifically, a strongly calibrated stereo pair is assumed and two-view tri-
angulation is employed to estimate the 3D coordinates of points on the surface
of the object. These points are then matched to points from the object’s rep-
resentation. The scale factor is estimated by finding the similarity aligning the
triangulated 3D points with their counterparts from the representation. This is
achieved by solving the absolute orientation problem, which also accounts for
the unknown scale. To safeguard against possible outliers, the calculation is em-
bedded in a RANSAC robust estimation scheme that seeks the transformation
aligning together a fraction of the available 3D matches. More details regarding
the solution of the absolute orientation problem are given next.
Starting with a stereo image pair depicting the object whose scale is to be
estimated, sparse correspondences between the two images are established. This
is achieved by detecting SIFT features in each image and then matching them
through their descriptors. For each pair of corresponding points, stereo trian-
gulation is used to estimate the 3D coordinates of the imaged world point [4].
Knowledge of the extrinsic calibration of the stereo rig permits the triangulated
points to be expressed in their true scale. Further to their matching in the stereo
images, SIFT descriptors are also matched against the descriptors stored in the
representation. In other words, three-way correspondences are established be-
tween object points in the two images and the representation. In this manner,
the triangulated points are associated with 3D points from the object’s represen-
tation. The sought scale factor is then computed by determining the similarity
between the triangulated 3D points and their counterparts, as follows.
Let {Mi } be a set of n ≥ 3 reference points from the representation expressed
in an object-centered reference frame and {Ni } a set of corresponding camera-
space triangulated points. Assume also that the two sets of points are related by
a similarity transformation as Ni = λ R Mi + t, where λ is the sought scale
factor and R, t a rotation matrix and translation vector defining an isometry. As
shown by Horn [6], absolute orientation can be solved using at least three non-
collinear reference points and singular value decomposition (SVD). The solution

proceeds by defining the centroids M and N and the locations {Mi } and {Ni }
of 3D points relative to them:

1 1
n n

M= Mi , N = Ni , Mi = Mi − M , Ni = Ni − N.
n i=1 n i=1
n
Forming the cross-covariance matrix C as i=1 Ni Mit , the rotational compo-
nent of the similarity is directly computed from C’s decomposition C = U Σ Vt
502 M. Lourakis and X. Zabulis

as R = V Ut . The scale factor is given by

I
J n L
J n
λ=K

||Mi ||2 ||Ni ||2 , (1)
i=1 i=1

whereas the translation follows as t = N − λ R M.

The primary advantages of this method over the one of Sect. 3 are that it does
not require a particular object positioning strategy nor the measurement of any
distances. Any object placement, provided that it is well imaged and avails suf-
ficient correspondences, is suitable for applying the method. On the other hand,
3D reconstruction of points based on binocular stereo is often error-prone [13]
and such inaccuracies can significantly affect the final estimation result.

5 Scale Estimation from Binocular Reprojection Error

Similarly to that in Sect. 4, this method also employs an extrinsically calibrated
stereo pair. Given an object’s 3D representation, its scale is determined by con-
sidering the reprojection error pertaining to the object’s projections in the two
images. Using the same coordinate system for both cameras, the reprojection er-
ror is expressed by an objective function which also includes scale in addition to
rotation and translation. Then, the object’s scale and pose are jointly estimated
by minimizing the total reprojection error in both images, as follows.
The method starts by detecting SIFT keypoints in both stereo images. Inde-
pendently for each image, the extracted keypoints are matched against the points
of the representation through their descriptors. For each image, monocular pose
estimation is carried out as described in Sect. 3 to determine the object’s pose in
it. Knowledge of the camera extrinsics allows us to express both of these poses in
the same coordinate system, for example that of the left camera. Indeed, if the
pose of the object in the left camera is defined by R and t, its pose in the right
camera equals Rs R and Rs t + ts , where Rs and ts correspond to the pose of the
right camera with respect to the left. Due to the stereo rig being rigid, Rs and
ts remain constant and can be estimated offline via extrinsic calibration. The
most plausible scale and left camera pose are determined via the minimization
of the cumulative reprojection error in both images. The binocular reprojection
error consists of two additive terms, one for each image. More specifically, de-
noting the intrinsics for the left and right images by KL and KR , the binocular
reprojection error for n points in the left image and m in the right is defined as:

n
2
m
2
d(KL · [λ R(r) | t] · Mi − mL
i ) + d(KR · [λ Rs R(r) | Rs t + ts ] · Mj − mR
j ) ,
i=1 j=1
(2)

where λ, t and R(r) are respectively the sought scale factor, translation vector
and rotation matrix parameterized using the Rodrigues rotation vector r, KL ·
Accurate Scale Factor Estimation in 3D Reconstruction 503

[λ R(r) | t] · Mi is the projection of homogeneous point Mi in the left image,

KR · [λ Rs R(r) | Rs t + ts ] · Mj is the projection of homogeneous point Mj in the
right image, mL R
i and mj are respectively the 2D points corresponding to Mi and
Mj in the left and right images and d(x, y) denotes the reprojection error, i.e.
the Euclidean distance between the image points represented by vectors x and y.
The expression in (2) can be extended to an arbitrary number of cameras and is
minimized with respect to λ, r, t with the Levenberg-Marquardt non-linear least
squares algorithm, employing only the inliers of the two monocular estimations
to ensure resilience to outliers. Similarly to the monocular case, a M-estimate
of the reprojection error is minimized rather than the squared Euclidean norm.
One possible initialization is to start the minimization from the monocular pose
computed for the left camera. Still, this initialization does not treat images
symmetrically as it gives more importance to the left image. Therefore, if the
pose with respect to the left camera has been computed with less precision than
that in the right, there is a risk of the binocular refinement also converging to a
suboptimal solution. To remedy this, the refinement scheme is extended by also
using the right image as reference and refining pose in it using both cameras,
assuming a constant transformation from the left to the right camera. Then, the
pose yielding the smaller overall binocular reprojection error is selected.
This method has several attractive features: It does not require a particular
object placement strategy. There is no need for a short baseline as correspon-
dences are not established across the two views but, rather, between each indi-
vidual view and the reconstruction. Because no attempt is made to reconstruct
in 3D, the experimental setup is relieved from the constraints related to the
binocular matching of points and the inaccuracies associated with their recon-
struction. A direct consequence of this is that the two cameras may have very
different viewpoints. In fact, employing large baselines favours the method as it
better constrains the problem of scale factor estimation.

6 Experiments

Each of the three methods previously described provides a means for computing
a single estimate of the pursued scale factor through monocular or binocular
measurements. It is reasonable to expect that such estimates will be affected by
various errors, therefore basing scale estimation on a single pair of images should
be avoided. Instead, more accurate estimates can be obtained by employing
multiple images in which the object has been moved to different positions and
collecting the corresponding estimates. Then, the final scale estimate is obtained
by applying a robust location estimator such as their sample median [14]. In the
following, the methods of Sect. 3, 4 and 5 will be denoted as mono, absor and
reproj, respectively. Due to limited space, two sets of experiments are reported.
An experiment with synthetic images was conducted first, in which the base-
line of the stereo pair imaging the target object was varied. A set of images
was generated, utilizing a custom OpenGL renderer. A 1:1 model of a textured
rectangular cuboid (sized 45 × 45 × 90 mm3), represented by a 3D triangle mesh
504 M. Lourakis and X. Zabulis

(with 14433 vertices & 28687 faces), was rendered in 59 images. These images
correspond to a virtual camera (1280 × 960 pixels, 22.2◦ × 16.7◦ FOV) circum-
venting the object in a full circle of radius 500 mm perpendicular to its major
symmetry axis. At all simulated camera locations, the optical axis was oriented
so that it pointed towards the object’s centroid. The experiment was conducted
in 30 conditions, each employing an increasingly larger baseline. In condition n,
the ith stereo pair comprised of images i and i + n. Hence, the baseline increment
in each successive condition was ≈ 52mm. In Fig. 1(a) and (b), an image from
the experiments and the absolute error in the estimated scale factor are shown.
Notice that the plot for absor terminates early at a baseline of ≈ 209mm. This
is because as the baseline length increases, the reduction in overlap between the
two images of the stereo pair results in fewer correspondences. In conditions
of the experiment corresponding to larger baselines, some stereo pairs did not
provide enough correspondences to support a reliable estimate by absor. As a
result, the estimation error for these pairs was overly large.

−3
x 10
4.5 2

REPROJ 1.8
ABSOR
4
ABSOR REPROJ
MONO
1.6
3.5

1.4

Error (mm)
3
1.2

2.5
error

2 0.8

1.5 0.6

0.4
1

0.2
0.5
0

0
0 200 400 600 800 1000 1200 1400 1600

baseline (mm)

Fig. 1. Experiments. Left to right: (a) sample image from the experiment with synthetic
stereo images and (b) scale factor estimation error (in milli scale), (c) sample image
from the experiment with real images and (d) translational pose estimation error.

The three methods are compared next with the aid of real images. Consider-
ing that the task of directly using the estimated scales to assess their accuracy is
cumbersome, it was chosen to compare scales indirectly through pose estimation.
More speciﬁcally, an arbitrarily scaled model of an object was re-scaled with the
estimates provided by mono, absor and reproj. Following this, these re-scaled
models were used for estimating poses of the object as explained in Sect. 3, which
were then compared with the true poses. In this manner, the accuracy of a scale
estimate is reﬂected on the accuracy of the translational components of the esti-
mated poses. To obtain ground truth for object poses, a checkerboard was used
to guide the placement of the object that was systematically moved at locations
aligned with the checkers. The camera pose with respect to the checkerboard was
estimated through conventional extrinsic calibration, from which the locations of
the object on the checkerboard were transformed to the camera reference frame.
The object and the experimental setup are shown in Fig. 1(c). Note that these
presumed locations include minute calibration inaccuracies as well as human er-
rors in object placement. The object was placed and aligned upon every checker
of the 8 × 12 checkerboard in the image. The checkerboard was at a distance
of approximately 1.5 m from the camera, with each checker being 32 × 32 mm2 .
Accurate Scale Factor Estimation in 3D Reconstruction 505

Camera resolution was 1280 × 960 pixels, and its FOV was 16◦ × 21◦ . The mean
translational error in these 96 trials was 1.411 mm with a deviation of 0.522 mm
for mono, 1.342 mm with a deviation of 0.643 mm for absor and 0.863 mm with
a deviation of 0.344 mm for reproj. The mean translational errors of the pose
estimates are shown graphically in Fig. 1(d).

7 Conclusion

The paper has presented one monocular and two binocular methods for scale
factor estimation. Binocular methods are preferable due to their ﬂexibility with
respect to object placement. Furthermore, the binocular method of Sect. 5 is
applicable regardless of the size of the baseline and was shown to be the most
accurate, hence it constitutes our recommended means for scale estimation.

References
1. Collet Romea, A., Srinivasa, S.: Eﬃcient Multi-View Object Recognition and Full
Pose Estimation. In: Proc. of ICRA 2010 (May 2010)
2. Fischler, M., Bolles, R.: RanSaC: A Paradigm for Model Fitting with Applications
to Image Analysis and Automated Cartography. In: CACM, vol. 24, pp. 381–395
(1981)
3. Grunert, J.: Das pothenotische Problem in erweiterter Gestalt nebst über seine
Anwendungen in Geodäsie. Grunerts Archiv für Mathematik und Physik (1841)
4. Hartley, R., Sturm, P.: Triangulation. CVIU 68(2), 146–157 (1997)
5. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn.
Cambridge University Press (2004) ISBN: 0521540518
6. Horn, B.: Closed-form Solution of Absolute Orientation Using Unit Quaternions.
J. Optical Soc. Am. A 4(4), 629–642 (1987)
7. Kneip, L., Scaramuzza, D., Siegwart, R.: A Novel Parametrization of the
Perspective-three-Point Problem for a Direct Computation of Absolute Camera
Position and Orientation. In: Proc. of CVPR 2011, pp. 2969–2976 (2011)
8. Li, Y., Snavely, N., Huttenlocher, D.P.: Location Recognition Using Prioritized
Feature Matching. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010,
Part II. LNCS, vol. 6312, pp. 791–804. Springer, Heidelberg (2010)
9. Longuet-Higgins, H.: A Computer Algorithm for Reconstructing a Scene From Two
Projections. Nature 293(5828), 133–135 (1981)
10. Lourakis, M., Zabulis, X.: Model-Based Pose Estimation for Rigid Objects. In:
Chen, M., Leibe, B., Neumann, B. (eds.) ICVS 2013. LNCS, vol. 7963, pp. 83–92.
Springer, Heidelberg (2013)
11. Lowe, D.: Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Com-
put. Vis. 60(2), 91–110 (2004)
12. Moons, T., Gool, L.V., Vergauwen, M.: 3D Reconstruction from Multiple Images
Part 1: Principles. Found. Trends. Comput. Graph. Vis. 4(4), 287–404 (2009)
13. Nistér, D., Naroditsky, O., Bergen, J.: Visual Odometry for Ground Vehicle Ap-
plications. J. Field Robot. 23, 3–20 (2006)
14. Rousseeuw, P.: Least Median of Squares Regression. J. Am. Stat. Assoc. 79,
871–880 (1984)
506 M. Lourakis and X. Zabulis

15. Rubner, Y., Puzicha, J., Tomasi, C., Buhmann, J.: Empirical Evaluation of Dis-
similarity Measures for Color and Texture. Comput. Vis. Image Und. 84(1), 25–43
(2001)
16. Snavely, N., Seitz, S., Szeliski, R.: Photo Tourism: Exploring Photo Collections in
3D. ACM Trans. Graph. 25(3), 835–846 (2006)
17. Szeliski, R., Kang, S.: Shape Ambiguities in Structure from Motion. In: Buxton,
B.F., Cipolla, R. (eds.) ECCV 1996. LNCS, vol. 1064, pp. 709–721. Springer,
Heidelberg (1996)
Aﬃne Colour Optical Flow Computation

Ming-Ying Fan1 , Atsushi Imiya2 , Kazuhiko Kawamoto3, and Tomoya Sakai4

1
School of Advanced Integration Science, Chiba University
2
Institute of Management and Information Technologies, Chiba University
3
Academic Link Center, Chiba University
Yayoicho 1-33, Inage-ku, Chiba 263-8522, Japan
4
Department of Computer and Information Sciences, Nagasaki University
Bunkyo-cho 1-14, Nagasaki 852-8521, Japan

Abstract. The purpose of this paper is three-fold. First, we develop

an algorith for the computation a locally affine optical flow field from
multichannel images as an extension of the Lucus-Kanade (LK) method.
The classical LK method solves a system of linear equations assuming
that the flow field is locally constant. Our method solves a collection
of systems of linear equations assuming the flow field is locally affine.
For autonomous navigation in a real environment, the adaptation of the
motion and image analysis algorithm to illumination changes is a fun-
damental problem, because illumination changes in an image sequence
yield counterfeit obstacles. Second, we evaluate the colour channel se-
lection of colour optical flow computation. By selecting an appropriate
colour channel, it is possible to avoid these counterfeit obstacle regions
in the snapshot image in front of a vehicle. Finally, we introduce an
evaluation criterion for the computed optical flow field without ground
truth.

1 Introduction
The theoretical aim of this paper is to introduce the affine tracker [10] for multi-
channel temporal image sequences. Furthermore, we also introduce an evaluation
criterion for the computed optical flow field without ground truth.
Optical flow provides fundamental features for motion analysis and motion
understanding. In ref. [10], using local stationariness of visual motion, a linear
method for motion tracking was introduced. The colour optical flow method
computes optical flow from a multichannel image sequence, assuming the multi-
channel optical flow constraint that, in a short duration, the illumination of an
image in each channel is locally constant [4]. This assumption is an extension of
the classical optical flow constraint to the multichannel case. This colour optical
flow constraint derives a multichannel version of KLT tracker [8].
The colour optical flow constraint yields an overdetermined or redundant sys-
tem of linear equations [4], although the usual optical flow constraint for a single
channel image yields a singular linear equation. Therefore, the colour optical
flow constraint provides a simple method to compute optical flow without ei-
ther regularisation [1] or multiresolution analysis [2]. The other method to use
multichannel image is to unify features on each channel.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 507–514, 2013.

c Springer-Verlag Berlin Heidelberg 2013
508 M.-Y. Fan et al.

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Fig. 1. Order of planar vector fields. (a) A constant vector field u(x, y) = (1, 1) . (b)
A linear vector field u(x, y) = (x, x) .

Barron and Klette [7] experimentally examined combinations of channels for

the accurate computation of optical flow using Golland and Bruckstein method
[4]. Barron and Klette concluded that the Y-channel image has advantages for
the accurate and robust computation. Mileva et. al. [9] showed the H-channel im-
age has advantages for robust optical flow computation in illumination changing
environment. Andrews and Lovell [5] examined combinations of colour models
and optical-flow computation algorithms. In ref. [6] van de Weijer and Gevers
examined photometric invariant for optical flow computation. These references
were devoted to the accurate and robust optical flow computation from mul-
tichannel images. For the computation of photometric invariant optical flow,
performance of both single channel and multichannel methods were compared
[7,6]. They experimentally showed that the effects of brightness component and
colour components are different for the results of optical flow vector detection.
In ref. [10], using local statinonarity of visual motion, a linear method for
motion tracking was introduced. As a sequel to refs. [10,6,7], we develop a local-
affine method of motion tracking for multichannel image sequence.
Figure 1 show locally constant and linear displacement fields in a region. The
LK method derives an optical flow field assuming that the field is constant in a
windowed region as shown in Fig 1 (a). Our method assumes that the optical
flow filed linearly depends on the location of pixels as shown in Fig 1 (b). This
assumption allows us to compute a globally smooth optical flow filed.

2 Colour Optical Flow

A colour image is expressed as a triplet of gray-valued images such that f α =

(f α1 , f α2 , f α3 ) , where α is the index to identify the colour space. The triplets
f α and f β , which are derived in diﬀerent colour spaces, are combined by a
one-to-one transform as f α = Φαβ (f β ), where f αi = φαβ i (f
β1
, f β2 , f β3 ) for
i = 1, 2, 3. For a temporal sequence of colour images, we have the relation

d α d β
f = ∇Φ αβ
f , (1)
dt dt
Aﬃne Colour Optical Flow Computation 509

where ∇Φαβ is the Jacobian matrix of φαβ , since

df αi ∂φαβ
i df
β1
∂φαβ
i df
β2
∂φαβ
i df
β3
= β1
+ β2
+ β3
. (2)
dt ∂f dt ∂f dt ∂f dt

Since the matrix ∇Φαβ is invertible, we have the following Lemma.

d α d β d α d αi
Lemma 1 Iﬀ dt f = 0, dt f = 0 and dt f = 0, dt f = 0 for all i = 1, 2, 3.

We call
d α d α1 d α2 d α3
f = f , f , f =0 (3)
dt dt dt dt
the colour brightness consistency. Lemma 1 implies that colour brightness con-
sistency is satisﬁed for all colour spaces. Therefore, hereafter, we set the spatio-
temporal multichannel colour image as f (x) = (f 1 (x), f 2 (x), f 3 (x)) . Then,
equation (3) becomes
Ju + ft = 0 (4)
for ft = (ft1 , ft2 , ft3 ) and the optical ﬂow u = ( dx dy
dt , dt ) = (ẋ, ẏ) = ẋ of the

point x, where J = ∇f 1 , ∇f 2 , ∇f 3 .

3 Colour Aﬃne Method

Assuming that the optical ﬂow vector u is constant in the neighbourhood Ω(x)
of point x, the optical ﬂow vector is the minimiser of

1 1 1 1
E0 = · |Ju + ft |2 dx = u Ḡu + ēu + c̄ (5)
2 |Ω(x)| Ω(x) 2 2

where

3
1 1
Ḡ = J Jdx = Gi , Gi = ∇f i ∇f i dx, (6)
|Ω(x)|Ω(x) i=1
|Ω(x)| Ω(x)
3
1 1
ē = J ft dx = ei , ei = fti ∇f i dx, (7)
|Ω(x)| Ω(x) i=1
|Ω(x)| Ω(x)
3
1 1
c̄ = |f | dx =
i 2
ci , ci = (fti )2 dx. (8)
|Ω(x)| Ω(x) t i=1
|Ω(x)| Ω(x)

Equation (5) implies that the solution of the system of the linear equations

∂E0
= Ḡu + ē = 0 (9)
∂u
is the optical ﬂow ﬁeld vector u of the point x.
510 M.-Y. Fan et al.

If the displacement is locally aﬃne such that u = Dx + d, where D and d

are a 2 × 2 matrix and a two-dimensional vector, respectively, we estimate D
and d which minimise the criterion

1 1
E1 = · |J(Dx + d) + ft |2 dy
2 |Ω(x)| Ω(x)
2
1 1 d
= · J, (x ⊗ J) + ft dy. (10)
2 |Ω(x)| Ω(x) vecD

as an extension of eq. (5)1 .

Solving the system of linear equations

∂E1 Ḡ, x ⊗ Ḡ d ē
= + =0 (11)
∂(d , (vecD) ) x ⊗ Ḡ, (xx ) ⊗ Ḡ vecD x ⊗ ē

for the point x, we have an affine optical flow field vector u at the point x.
Since rank Ḡ ≤ 2 and rank(xx) = 1,
†
d Ḡ, x ⊗ Ḡ ē
=− . (12)
vecD x ⊗ Ḡ, (xx ) ⊗ Ḡ x ⊗ ē

Furthermore, for the accurate computation of D and d, we employ the pyramid-

based method shown in Algorithm 1.

Algorithm 1. Colour Aﬃne TLK tracker with Gaussian Pyramid

Data: uL+1 := 0, L ≥ 0, l := L
Data: fkL · · · fk0
L
Data: fk+1 · · · fk+1
0

Result: optical ﬂow u0k

while l ≥ 0 do
l
fk+1 := fk+1l
(· + E(ul+1
k ), k + 1) ;
l l
compute Dk and dk ;
ulk := Dkl xl + dlk ;
l := l − 1
end

In Algoritm 1, for the sampled vector function fijk = f (i, j, k), the pyramid
transform R and its dual transform E are expressed as

1
2
k k k
Rfmn = wi wj f2m−i, 2n−j , Efmn =4 wi wj f km−i , n−j , (13)
2 2
i,j=−1 i,j=−2

where w±1 = 14 and w0 = 12 , and the summation is achieved for (m − i)/2 and
(n − j)/2 are integers.
1
The matrix equation AXB = C is replaced to the linear system of equations (B ⊗
A)vecX = vecC.
Aﬃne Colour Optical Flow Computation 511

4 Numerical Experiments

Colour Space. There are several standard and nonstandard colour spaces for
the representation of colour images. Selection of the most relevant colour space
is a fundamental issue for colour optical flow computation. We use the following
spaces.
Primary colour systems RGB, CM Y , XY Z.
Luminance-chrominance colour systems Y U V , Y IQ, Y CbCr, HSV , HSL.
Perceptual colour systems L ∗ a ∗ b∗.
Usual noncorrelated colour system I1I2I3.
For the selection of the window size in the affine colour optical flow computation,
we evaluate the spatial angle error ψE = arccos(uc , ue ) between the ground
truth uc and the estimation ue for Middlebury sequences.
Figure 2 shows errors of computed colour optical flow vectors for various
window sizes. Left and right columns shows average angle errors and least mean
square errors, respectively, for hydra, grove3, dimetrodon, and urban3 sequences.
Results in Fig. 2 suggest that for accurate colour optical flow computation,
we are required to use a window larger than 5 × 5. In Fig. 3, we show the results
of colour optical flow computation using the 7 × 7 window for Hydra and Grove3
sequences, respectively. In Fig. 3, (a) and (e) are original image. (b) and (f) are
the ground truths of optical flow fields. Furthermore, (c) and (g) are the optical
flow fields computed with 7 × 7 window.
For the flow vector v(x, y, t) = (u, v) , setting f (x, y, t) = f (x−u, y−v, t+1),
we define
M
1
RM S error = (f (x, y, t) − f (x, y, t))2 dxdy (14)
Ω Ω

for images in the region of interest Ω at time t. We use the sequential error
M
1
(t) = |u(x, t) − u(x , t + 1)|2 dx (15)
Ω Ω

for sequential computation without ground truth.

For the evaluation using the DIPLODOC sequence, we compute the optical
flow from the 120th to 142th frames. We work with a pyramid of level 2 or 3
and use 10 iteration at every levels.
Although between frame 130 and frame 132, shading causes counterfeit ob-
stacles in the RGB space, optical flow in the Lab space detected accurately as
shown in Fig. 4(a). Figure 4(b) again suggests that for accurate colour optical
flow computation, we are required to use a window larger than 5 × 5.
The above results imply that for the application of colour optical flow to
vehicle vision systems, adaptive selection of the colour space and unification of
the results computed using several colour spaces are essential.
512 M.-Y. Fan et al.

AAEcolorhydrangea rmscolorhydrangea
70 70
CMY CMY
HSL HSL
HSV HSV
60 I1I2I3 60 I1I2I3
I1rg I1rg
RGB RGB
SLab SLab
SLuv SLuv
50 XYZ 50 XYZ
YCC YCC
YIQ YIQ
YUV YUV
40 Yxy 40 Yxy
AAE rgb rgb

rms
Y Y

30 30

20 20

10 10

0 0
3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17 3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17
Window size Window size

(a) Average Angle Error (b) LMS Error

AAEcolorGrove3 rmscolorGrove3
70 70
CMY CMY
HSL HSL
HSV HSV
60 I1I2I3 60 I1I2I3
I1rg I1rg
RGB RGB
SLab SLab
SLuv SLuv
50 XYZ 50 XYZ
YCC YCC
YIQ YIQ
YUV YUV
40 Yxy 40 Yxy
rgb rgb
AAE

rms
Y Y

30 30

20 20

10 10

0 0
3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17 3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17
Window size Window size

(c) Average Angle Error (d) LMS Error

AAEcolordimetrodon rmscolordimetrodon
70 70
CMY CMY
HSL HSL
HSV HSV
60 I1I2I3 60 I1I2I3
I1rg I1rg
RGB RGB
SLab SLab
SLuv SLuv
50 XYZ 50 XYZ
YCC YCC
YIQ YIQ
YUV YUV
40 Yxy 40 Yxy
rgb rgb
AAE

rms

Y Y

30 30

20 20

10 10

0 0
3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17 3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17
Window size Window size

(e) Average Angle Error (f) LMS Error

AAEcolorurban2 rmscolorurban3
70 70
CMY CMY
HSL HSL
HSV HSV
60 I1I2I3 60 I1I2I3
I1rg I1rg
RGB RGB
SLab SLab
SLuv SLuv
50 XYZ 50 XYZ
YCC YCC
YIQ YIQ
YUV YUV
40 Yxy 40 Yxy
rgb rgb
AAE

rms

Y Y

30 30

20 20

10 10

0 0
3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17 3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17
Window size Window size

(g) Average Angle Error (h) LMS Error

Fig. 2. Computed optical ﬂow. Errors for various window sizes. Left: Average angle
error. Right: Reast mean square error. From top to bottom: results for hydra, grove3,
dimetrodon and urban3 sequences.
Aﬃne Colour Optical Flow Computation 513

(a) Image (b) Ground truth (c) Aﬃne Window: 7x7

(d) Image (e) Ground truth (f) Aﬃne Window: 7x7

Fig. 3. Computed optical flow: (a) (e) Image. (b) (f) Ground truth of optical flow. (c)
Optical flow field computed with 7 window. (g) Optical flow field computed with 7
window.

2800 3500
RGB 3x3
XYZ 5x5
2600 HSL 7x7
HSV 9x9
CMY 3000 11x11
2400 YCC 13x13
YUV 15x15
III
2200 LAB 2500

2000

2000
1800
FD

1600
1500

1400

1200 1000

1000
500
800

600 0
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Frame Frame

(a) Image (b) L=2, size=7x7 (c) RGB L=2

Fig. 4. Result of DIPLODOC Sequence. (a) Image from the sequence. (b) Sequential
error eq. (15) of various colour channels for the window 7 × 7. (c) Sequential error eq.
(15) for various windows for the RGB channel. We used the pyramid of two layers.

5 Conclusions

For autonomous navigation in a real environment, the adaptation of the motion

and image analysis algorithms to illumination changes is a fundamental problem,
because illumination changes in an image sequence yield counterfeit obstacles.
In this paper, extending KLT-tracker, we develop the colour affine tracker for
motion analysis of outdoor long image sequences. The method computes a locally
affine optical flow filed using a shift-variant linear equation.
514 M.-Y. Fan et al.

We evaluated the performance of aﬃne colour optical ﬂow computation using

Middlebury colour sequences. The results show that our method for multichannel
images improves the average angle errors against optical flow computed from
monochrome and single channel images. We evaluated the temporal stability
of the optical flow field sequences, which are common as images captured by
vehicle-mounted imaging systems. The result is a combination of the previous
results in refs. [7,9,6,8].
Numerical results shows that for the computation of affine optical flow, we are
required to use windows larger than 7 × 7. Furthermore, the computation with
the 5 window derives stable results for the Lucus-Kanade method. Therefore, we
have the following conjuncture.
Conjuncture 1. For the computation of the locally n-th order flow field, we are
required to use windows larger than (5 + n) × (5 + n).

References
1. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artificial Intelligence 17,
185–204 (1981)
2. Bouguet, J.-Y.: Pyramidal implementation of the Lucas Kanade feature tracker
description of the algorithm, In: Intel Corporation. Microprocessor Research Labs,
OpenCV Documents (1999)
3. Lucas, B.D., Kanade, T.: An iterative image registration technique with an appli-
cation to stereo vision. In: International Joint Conference on Artificial Intelligence,
pp. 674–679 (1981)
4. Golland, P., Bruckstein, A.M.: Motion from color CVIU, vol. 68, pp. 346–362 (1997)
5. Andrews, R.J., Lovell, B.C.: Color optical flow. In: Proc. Workshop on Digital
Image Computing, pp. 135–139 (2003)
6. van de Weijer, J., Gevers, T.: Robust optical flow from photometric invariants. In:
Proc. ICIP, pp. 1835–1838 (2004)
7. Barron, J.L., Klette, R.: Quantitative color optical flow. In: Proceedings of 16th
ICPR, vol. 4, pp. 251–255 (2002)
8. Heigl, B., Paulus, D., Niemann, H.: Tracking points in sequences of color images.
In: Proceedings 5th German-Russian Workshop on Pattern Analysis, pp. 70–77
(1998)
9. Mileva, Y., Bruhn, A., Weickert, J.: Illumination-robust variational optical flow
with photometric invariants. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.)
DAGM 2007. LNCS, vol. 4713, pp. 152–162. Springer, Heidelberg (2007)
10. Shi, J., Tomasi, C.: Good features to track. In: CVPR 1994, pp. 593–600 (1994)
Can Salient Interest Regions Resume Emotional
Impact of an Image?

Syntyche Gbèhounou1 , François Lecellier1 ,

Christine Fernandez-Maloigne1, and Vincent Courboulay2
1
Department SIC of XLIM Laboratory, UMR CNRS 7252 - University of Poitiers,
Bât. SP2MI, Téléport 2, Boulevard Marie et Pierre Curie, BP 30179
86962 Futuroscope Chasseneuil cedex, France
[email protected]
2
L3i - University of La Rochelle, Avenue M. Crépeau
17042 La Rochelle Cedex 01, France
[email protected]

Abstract. The salient regions of interest are supposed to contain the

interesting keypoints for analysis and understanding. We studied in this
paper the impact of image reduction to the region of interest on the
emotion recognition. We chose a bottom-up visual attention model be-
cause we addressed emotions on a new low-semantic data set SENSE
(Studies of Emotion on Natural image databaSE). We organized two ex-
perimentations. The ﬁrst one has been conducted on the whole images
called SENSE1 and the second on reduced images named SENSE2. These
latter are obtained with a visual attention model and their size varies
from 3% to 100% of the size of the original ones. The information col-
lected during these evaluations are the nature and the power of emotions.
For the nature observers have choice between ”Negative”, ”Neutral” and
”Positive” and the power varies from ”Weak” to ”Strong”. On the both
experimentations some images have ambiguous categorization. In fact,
the participants were not able to decide on their emotional class (Neg-
ative, Neutral ans Positive). The evaluations on reduced images showed
that average 79% of the uncategorised images during SENSE1 are cate-
gorized during SENSE2 in one of the both major classes. Reducing the
size of the area to be observed leads to a better evaluation maybe because
some semantic content are attenuated.

Keywords: bottom-up saliency, regions of interest, emotions,

psycho-visual tests.

1 Introduction
Many achievements have been made in computer vision in order to replicate the
most amazing capabilities of the human brain. However, some aspects of our be-
haviour remain unsolved, for example emotions prediction for images and videos.
Emotions extraction has several applications, for example for ﬁlms classiﬁcation
or road safety education by choosing the adequate images to the situation.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 515–522, 2013.

c Springer-Verlag Berlin Heidelberg 2013
516 S. Gbèhounou et al.

Trying to extract the emotional impact in images is an ambitious task. In

fact, different information in an image (content, textures, colours, semantic, ...)
can be emotional vector and emotions are complex reactions. Many factors, like
cultural aspects, are more complex than content or overall colour and must be
considered in the emotional interpretation of an image.
Several papers explore emotions extraction domain [4,5,7,8,12] and propose
different approaches to extract emotion of the images. The principal strategy is
based on faces detection, an emotion is then associated with facial features (such
as eyebrows, lips). The other major approach is the detection of emotions from
the characteristics of the image [1,4,5,7,12]. Considering these two approaches,
it is interesting to compute the model on a part of images for many reasons.
The first one is reducing of the size of the features computed. The limitation of
the viewed images to the salient regions can reduce the semantic interpretation.
Then can allow easy classification by observers and so a good ratings of our
database. Saliency appears to be a good strategy to reduce the information and
to conserve the main attractive ones. For example, in a faces characteristics
based strategy, visual attention models can detect the different faces if they are
the salient information and in another strategies the analysis can be based only
on the features of the salient regions.
In this paper, we considered bottom-up visual attention model because we
addressed ”primary-emotions” and they are felt in the first seconds. This requires
a short observation period. In fact, the viewing duration is important during an
evaluation of an image regardless the task. This duration involved the process
of bottom-up and top-down saliency. Our goal is to study the impact of the
reduction of the size of the observed image on the emotions evaluation. This
reduction can allow to focus the emotions extraction model on the most relevant
regions.

2 The Bottom-Up Visual Attention Model

Recenlty, Perreira Da Silva et al. [10] proposes a new hybrid model which al-
lows modeling the temporal evolution of the visual focus of attention and its
validation. As shown in ﬁgure 1, it is based on the classical algorithm proposed
by Itti [2], in which the ﬁrst part of its architecture relies on the extraction of
three conspicuity maps based on low level characteristics computation. These
three conspicuity maps are representative of the three main human perceptual
channels: color, intensity and orientation. In[9] Perreira Da Silva et al. propose
to substitute the second part of Itti’s model by an optimal competitive approach:
a preys/predators system. They have demonstrated that it is an optimal way of
extracting information.
Besides, this optimal criteria, preys/predators equations are particularly well
adapted for such a task:
– preys / predators systems are dynamic, they include intrinsically time evolu-
tion of their activities. Thus, the visual focus of attention , seen as a predator,
can evolve dynamically;
Can Salient Interest Regions Resume Emotional Impact of an Image? 517

– without any objective (top-down information or pregnancy), choosing a

method for conspicuity maps fusion is hard. A solution consists in devel-
oping a competition between conspicuity maps and waiting for a natural
balance in the preys/predators system, reﬂecting the competition between
emergence and inhibition of elements that engage or not our attention;
– discrete dynamic systems can have a chaotic behaviour. Despite the fact
that this property is not often interesting, it is an important one in this
case. Actually, it allows the emergence of original paths and exploration of
visual scene, even in non salient areas, reﬂecting something like curiosity or
emotion.

Fig. 1. Architecture of the computational model of attention

Perreira Da Silva et al. [10] shows that despite the non deterministic behaviour of
preys/predators equations, the system exhibits interesting properties of stability,
reproducibility and reactiveness while allowing a fast and eﬃcient exploration of
the scene. We applied the same optimal parameters used by Perreira Da Silva
to evaluate our approach.
The attention model presented in this section is computationally eﬃcient and
plausible [9]. It provides many tuning possibilities (adjustment of curiosity, cen-
tral preferences, etc.) that can be exploited in order to adapt the behavior of the
system to a particular context.
518 S. Gbèhounou et al.

3 Experimentations

3.1 Image Database

There are many image databases for emotions studies; the most known is Inter-
national Affective Picture System (IAPS) [3] from the Center for Emotion and
Attention (CSEA) at the University of Florida. In general they are highly se-
mantic and this aspect justifies our choice to create a new low-semantic database
for emotions study and research purpose in general way. In this paper, ”low-
semantic” means, that the images do not shock and do not force a strong emo-
tional response. We also choose low semantic images to minimize the potential
interactions between emotions on following images during subjective evaluations.
This aspect is important to ensure that the emotions indicated for an image is
really related to its content and not to the emotional impact of the previous one.
For these experimentations the data set used in [1] has been expanded to
350 images and is now called SENSE (Studies of Emotion on Natural image
databaSE) and is free to use. It is a diversified set of images which contains
landscapes, animals, food and drink, historic and touristic monuments.
This data set has also the advantage to be composed of natural images except
some non-natural transformations (rotations and colour balance modification)
on a few images. These transformations are performed to measure their impact
on emotions recognition system based on low-level images features [1].

3.2 Experimentations

Our goal during psycho-visual evaluations is to assess the diﬀerent images ac-
cording to the nature of the emotional impact during a short viewing duration.
For evaluations of emotional impact of an image, viewing duration is really im-
portant. In fact, if the observation time extends observers access more to the
semantic and their ratings are semantic interpretations and not really ”primary
emotions”.
Usually two methodologies of emotion classiﬁcation are found [4]:

– Discrete approach;
– Dimensional approach.

In the discrete modelling, emotional process is explained with a set of basic or

fundamental emotions, innate and common to all human. There is no consensus
about the nature and the number of these fundamental emotions [4]. It can
be diﬃcult to score our images with this approach. For example, scoring an
image like ”Happy” or ”Sad” on a low semantic database needs a real semantic
interpretation and we prefer a ”primary” emotion after a short observation time.
In the dimensional approach, emotions are the result of ﬁxed number of con-
cepts represented in a dimensional space [4]. The dimensions can be pleasure,
arousal and power. They vary depending the needs of the application or the
researches.
Can Salient Interest Regions Resume Emotional Impact of an Image? 519

The images of IAPS are scored according to the aﬀective ratings: pleasure,
arousal and dominance.
During our experimentations we asked participants to indicate the nature of
the emotions; ”Positive”, ”Negative” or ”Neutral” and the power varies from
”Weak” to ”Strong”. According to us, it seems easier to rate our images by this
way specially in a short observation duration.

(a) 61% (b) 27% (c) 6%

(d) (e) (f)

Fig. 2. (a)-(c)Some images assessed during SENSE2 with the percentage of the original
image conserved and (d)-(f) their corresponding in SENSE1

Because our tests were applied on a low semantic database we do not need to
worry about the potential interactions between emotions on following images.
These interactions are really minimized.
We conducted two diﬀerent tests:
– First evaluations on the full images, an example is shows on Fig. 2(d)-2(f).
1741 participants, including 848 men (48.71%) and 893 women (51.29%),
around the world, scored the database. These evaluations are named SENSE1.
– Second tests on the regions of interest obtained with the visual attention
model described in the previous section. 1166 participants including 624
women (53.49%) and 542 men (46.51%) scored the 350 images. Their size
varies from 3% to 100% of the size of the original ones. Fig.2(a)-2(c) are
some examples. These experimentations are named SENSE2.
The two experimentations were successively accessible via the Internet. In fact,
SENSE1 were made several months before SENSE2. The participants take volun-
tarily the one or two evaluations and can also stop it when they want. Even if we
cannot control the observation conditions (concentration, displaying, humour),
it is not a problem for emotional impact evaluations as these are the daily view-
ing conditions. Participants were asked not to take a long time to score images.
The average time of observation is 6.6 seconds so we considered the responses
as ”primary” emotions . Each observer evaluated at most 24 randomly selected
images if he makes the full test.
520 S. Gbèhounou et al.

4 Results and Discussions

Each image was assessed by an average of 104.81 observers during SENSE1 and
65.40 during SENSE2.

– During SENSE1, only 21 images (6% of the all database) were scored by
less than 100 persons. The less assessed image was evaluated by 86 diﬀerent
participants, both genders combined.
– During SENSE2, the less rated image was seen by 47 participants. Only 2
images were rated by less than 50 persons.

Despite these diversiﬁed evaluations, some images are not really categorized. We
considered that an image is categorized (in some emotion nature class Negative,
Neutral or Positive) if the diﬀerence of the percentages (of observers) between
the two most important emotions is greater than or equal to 10%.
The results from SENSE1 are considered like the reference for this study and
the analysis of the impact of the reduction of the size will be interpreted with
the rate of good categorization.

100
Good categorization rate

80
P1: ]7%, 50%[
60 P2: [50%, 70%[
40 P3: [70%, 100%]
20
0
P1 P2 P3

Fig. 3. Good categorization rates during SENSE2 for categorized images during
SENSE1

If we considered the results of SENSE2 according to the percentage of the

original represented by the visual region of the interest, we noticed that when
the percentage of the image is less than or equal to 7%, the images (18 of the
20 concerned) are ”Neutral” or ”Uncategorised” excepted 2 images. These latter
including the image 2(c) have the particularity to have been resumed with the
main colours and their emotional impact can essentially be resume with colours.
Fig. 3 represents the rate of images categorized during SENSE1 and SENSE2
in the same class of emotions. Regarding the diﬀerent results on the SENSE1,
reduce the size of the images according to a bottom-up visual attention is a
good solution to evaluate the primary emotions. In fact, these emotions must
be uncorrelated as possible to the semantic. They appear in the ﬁrst seconds of
viewing before the top-down process. The main errors of categorization during
SENSE2 concern neutral images. In fact, neutral images are ambiguous. We
proposed the nature ”Neutral” to our participants not to force them to score an
image as positive or negative if they are not sure about its emotional impact.
Can Salient Interest Regions Resume Emotional Impact of an Image? 521

100

Good categorization rate

80
P1: ]7%, 50%[
60 P2: [50%, 70%[
40 P3: [70%, 100%]
20
0
P1 P2 P3

Fig. 4. Rate of uncategorised images during SENSE1 now categorized during SENSE2

Sometimes some images can be rated neutral because there are found positive
or negative but not enough.
Fig. 4 shows images uncategorised1 during SENSE1 and deﬁnitively classi-
ﬁed during SENSE2 classify in one of the two major classes of emotion found
SENSE1. On SENSE1, 61 images are ”Uncategorised”, the main contribution of
this paper concerns these kind of images. In fact, Figure 4 shows that a large
part of them (79%) is now categorized; often in one of the two major classes of
SENSE1. Reduce the viewing region has probably reduced the semantic and the
analysis time.

Fig. 5. Rate of good categorization during SENSE2 according to the percentage of

original image viewed

Fig. 5 represents the rate of images with good categorization during SENSE2
acording to the percentage of observed thumbnails. It shows that for the three
classes of emotions; from 50%, 77% of the images are correctly categorized. This
notice answers to our hypothesis that the idea of reduction of the image with a
bottom-up visual attention model can oﬀer similar results compare to the full
images.
1
To be considered categorized the major class of an image must have a percentage of
at least 10% higher than the other.
522 S. Gbèhounou et al.

5 Conclusions and Future Work

In this paper we study the impact of salient regions of an image on its emotion
perception. We have shown that when the interest regions are too small they
cannot allow an adequate evaluation of the emotional impact. But the results
prove that a bottom-up visual attention model could be very helpful to reduce the
size of the evaluated image to a pertinent regions that allow ”primary emotions”
recognition.
We plan to test another visual attention models and compare the diﬀerent
results with those of subjective experimentations with an eye-tracker. Another
future work is description of the salient regions by colours, textures, contours and
positions features in order to determine the characteristics of these thumbnails
which resume the emotions.

References
1. Gbèhounou, S., Lecellier, F., Fernandez-Maloigne, C.: Extraction of emotional im-
pact in colour images. In: Proc. CGIV, vol. 6, pp. 314–319 (2012)
2. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for
rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 11(20), 1254–1259 (1998)
3. Lang, P.J., Bradley, M.M., Cuthbert, B.N.: International affective picture system
(IAPS): Affective ratings of pictures and instruction manual. Technical report A-8,
University of Florida (2008)
4. Liu, N., Dellandréa, E., Chen, L.: Evaluation of features and combination ap-
proaches for the classification of emotional semantics in images. In: International
Conference on Computer Vision Theory and Applications (2011)
5. Lucassen, M., Gevers, T., Gijsenij, A.: Adding texture to color: quantitative anal-
ysis of color emotions. In: Proc. CGIV (2010)
6. Machajdik, J., Hanbury, A.: Affective image classification using features inspired
by psychology and art theory. In: Proc. International Conference on Multimedia,
pp. 83–92 (2010)
7. Ou, L., Luo, M.R., Woodcock, A., Wright, A.: A study of colour emotion and
colour preference. part i: Colour emotions for single colours. Color Research &
Application 29(3), 232–240 (2004)
8. Paleari, M., Huet, B.: Toward emotion indexing of multimedia excerpts. In: Proc.
Content-Based Multimedia Indexing, International Workshop, pp. 425–432 (2008)
9. Perreira Da Silva, M., Courboulay, V., Prigent, A., Estraillier, P.: Evaluation of
preys/predators systems for visual attention simulation. In: International Conference
on Computer Vision Theory and Applications, VISAPP 2010, pp. 275–282 (2010)
10. Perreira Da Silva, M., Courboulay, V.: Implementation and evaluation of a com-
putational model of attention for computer vision. In: Developing and Applying
Biologically-Inspired Vision Systems: Interdisciplinary Concepts, pp. 273–306 (2012)
11. Wang, W., Yu, Y.: Image emotional semantic query based on color semantic de-
scription. In: Proc. The Fourth International Conference on Machine Leraning and
Cybernectics, vol. 7, pp. 4571–4576 (2005)
12. Wei, K., He, B., Zhang, T., He, W.: Image Emotional Classification Based on Color
Semantic Description. In: Tang, C., Ling, C.X., Zhou, X., Cercone, N.J., Li, X. (eds.)
ADMA 2008. LNCS (LNAI), vol. 5139, pp. 485–491. Springer, Heidelberg (2008)
Contraharmonic Mean Based Bias Field
Correction in MR Images

Abhirup Banerjee and Pradipta Maji

Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India

{abhirup r,pmaji}@isical.ac.in

Abstract. One of the key problems in magnetic resonance (MR) image

analysis is to remove the intensity inhomogeneity artifact present in MR
images, which often degrades the performance of an automatic image
analysis technique. In this regard, the paper presents a novel approach
for bias field correction in MR images using the merit of contraharmonic
mean, which is used in low-pass averaging filter to estimate the near opti-
mum bias field in multiplicative model. A theoretical analysis is presented
to justify the use of contraharmonic mean for bias field estimation. The
performance of the proposed approach, along with a comparison with
other bias field correction algorithms, is demonstrated on a set of MR
images for different bias fields and noise levels.

Keywords: Magnetic resonance imaging, intensity inhomogeneity, bias

ﬁeld, contraharmonic mean ﬁlter.

1 Introduction

Magnetic resonance (MR) images are often corrupted by a speciﬁc inhomogene-

ity artifact, called intensity inhomogeneity or bias field, which creates a shading
effect in the images [1]. Because of this slow spatially varying artifact, the in-
tensity values of a specific tissue class vary in different locations, which result in
increase of overall variation of the tissue class. Although this inhomogeneity ar-
tifact is hardly visible in human eyes, it is enough to degrade the performance of
any automatic image analysis tool such as segmentation or registration. Hence,
a preprocessing step is often required to remove such inhomogeneity artifacts
from the MR images before applying these tools.
Several retrospective methods exist in the literature that try to remove this
artifact depending on the information of the acquired image. While some his-
togram based methods such as N3 [2] try to estimate the bias field by maximizing
high frequency information of the tissue intensity distribution, others try to re-
move it by simultaneously estimating the bias field and segmenting the image
into meaningful tissue classes [3,4]. Pham and Prince [5] and Ahmed et al. [6]
proposed fuzzy-c-means based bias field correction approach, while Ashburner
and Friston [7] developed a probabilistic framework for simultaneous image reg-
istration, tissue classification, and bias correction.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 523–530, 2013.

c Springer-Verlag Berlin Heidelberg 2013
524 A. Banerjee and P. Maji

The simplest and computationally inexpensive method to remove intensity

inhomogeneity is filtering method, which depends only on the information of
the acquired image. Assuming that intensity inhomogeneity is a low-frequency
component in the high-frequency structure of the image, these methods try to
remove it by low-pass filtering the image. One of the popular filtering methods is
homogeneous unsharp masking (HUM) [8], which is an improvement of classical
homomorphic filtering. The HUM is generally implemented either after masking
out the background pixels from the image or by replacing the background pixels
with average intensity values. However, Zhou et al. [9] also removed the high-
intensity structures such as grease and cerebro-spinal fluid and replaced them
by the average intensity in their neighborhood. Some other methods also exist in
the literature that use median filter instead of mean filter to estimate the inten-
sity inhomogeneity component [10,11]. However, Brinkmann et al. [12] showed
experimentally that the mean filter outperforms median filter in estimating the
bias field from the MR images. In [12], they also tried to find the optimum
window size or the optimum range of window size for the low-pass filter.
In general, arithmetic mean (AM) filter is used as a low-pass filter in the
HUM. But, it only computes simple arithmetic mean of the intensity values of
the pixels in the neighborhood area of a specific pixel. Hence, all pixels in the
neighborhood contribute equally in calculating the local average. In effect, this
causes a problem in calculating the bias field component of the pixels in object-
background edge area. In [12], Brinkmann et al. used a thresholding technique
to distinguish background pixels from the object pixels.
In this regard, the paper presents a bias field estimation technique, using the
merit of contraharmonic mean (CHM), which is used in low-pass filtering to
estimate the bias field. A theoretical analysis is presented to justify the use of
contraharmonic mean for bias field estimation. The effectiveness of the proposed
algorithm, along with a comparison with the HUM and N3 algorithms, is demon-
strated on a set of benchmark MR images both qualitatively and quantitatively
for different bias fields and noise levels.

2 Basics of HUM
The HUM assumes that intensity inhomogeneity is a low-frequency component
in the high-frequency structure of the image. It is usually implemented with
a noise threshold to prevent background pixels from distorting the bias field
estimation.
The model of the HUM assumes that intensity inhomogeneity is multiplicative.
If the ith pixel of the inhomogeneity-free image is ui , and corresponding intensity
inhomogeneity field and noise are bi and ni , respectively, then the ith pixel vi of
the acquired image is obtained as follows:
vi = ui bi + ni . (1)
In general, the bias field can be estimated either from the noise-free image
or from the noisy image. However, Guillemaud and Brady [4] showed that
Contraharmonic Mean Based Bias Field Correction in MR Images 525

post-ﬁltering is more preferable than pre-ﬁltering. Also, intensity inhomogeneity

is a low-frequency component. Hence, the model of the HUM can be rewritten
as
vi vi CN
ui = = , (2)
bi LP F (vi )
where LP F (.) is the low-pass filter and CN represents the normalizing constant
that depends on the low-pass filter. If the low-pass filter is an averaging filter,
then the constant CN is used to preserve the average intensity of the image.

3 Proposed Method
This section presents a new approach, using the merits of contraharmonic mean,
for estimating bias ﬁeld present in the MR images.

3.1 Contraharmonic Mean for Bias Field Estimation

Generally, it is assumed that the bias field is multiplicative. So, if fixed amount of
bias field is applied to two different pixels, then the pixel with higher intensity
value will suffer the effect of bias field much more than the pixel with lower
intensity value. Hence, the pixel with higher intensity value should be given
higher priority while estimating the bias field. Hence, instead of using simple
AM filter, weighted AM filter should be used as a low-pass filter, where higher
intensity value gets higher weightage. To achieve this goal, contraharmonic mean
(CHM) filter with order p > 0 is a favourable choice as a low-pass filter, because
it gives higher weightage to higher intensity values and lower weightage to lower
intensity values while calculating the mean.
The CHM filter of order p, for any coordinate i of the filtered image fˆ, is
defined as follows: p+1
vj
j∈Ni
fî = (3)
vjp
j∈Ni

where Ni denotes the set of pixels in a square window centered at coordinate i.

The CHM filter reduces to the AM filter for p = 0 and to the harmonic mean
filter in case of p = −1.
Hence, the model of the HUM using the CHM filter can be rewritten as
vi
ui = (4)
bi
where bi is the estimated bias field at coordinate i of the acquired image v and
is given by
⎧ p+1 ⎫ ⎧ p+1 ⎫−1
⎪
⎪ vj ⎪ ⎪ ⎪ vj ⎪
⎨ ⎬⎪ ⎨ ⎪
⎬
j∈Ni j∈I
bi = p p (5)
⎪
⎪ vj ⎪
⎪ ⎪ vj ⎪
⎩ ⎭⎪ ⎩ ⎪
⎭
j∈Ni j∈I
526 A. Banerjee and P. Maji

where I denotes the set of all pixels in the image and the normalizing constant
CN is estimated by the global CHM of the intensity values of all the pixels in
the image.

3.2 Importance of CHM

The following discussion constitutes the fact that the CHM of order p > 0
estimates the intensity inhomogeneity component more efficiently than the AM.
Let the original intensity and reduced intensity of the ith pixel of an image
be denoted by ui and vi , respectively, and the intensity of that pixel restored
by the HUM using the AM filter and CHM filter of order p > 0 be denoted by
ui and ui , respectively. The CHM filter will provide better restoration than the
AM filter if the error in estimating the intensity value of the restored pixel is
minimum, that is, if
(ui − ui )2 < (ui − ui )2 . (6)
So, better restoration can be achieved by the CHM filter of order p > 0 if
ui < ui and ui < 2ui − ui . (7)

vi vi
Now, ui < ui ⇔ < ⇔ bi < bi
bi bi

⎧ p+1 ⎫ ⎧ p ⎫ ⎧ ⎫ ⎧ ⎫
⎪
⎪ vj ⎪⎪ ⎪
⎪ vj ⎪
⎪ ⎪
⎪ vj ⎪
⎪ ⎪ ⎪
⎨ ⎬⎨ ⎬ ⎨ ⎪
⎬ ⎨ |I| ⎪⎬
j∈Ni j∈I j∈Ni
⇔ p p+1 <
⎪
⎪ vj ⎪
⎪ ⎪ vj ⎪ ⎪ |Ni | ⎪ ⎪ vj ⎪
⎩ ⎭⎪⎩ ⎪
⎭ ⎪
⎩ ⎪
⎭⎪⎩ ⎪
⎭
j∈Ni j∈I j∈I

|Ni | vjp+1 |I| vjp+1
j∈Ni j∈I
⇔ < (8)
vj vjp vj vjp
j∈Ni j∈Ni j∈I j∈I

Subtracting 1 from both sides of (8) and multiplying the numerator by 2, we get

(vj − vk )(vjp − vkp ) (vj − vk )(vjp − vkp )
j∈Ni k∈Ni j∈I k∈I
⇔ <
vj vjp vj vjp
j∈Ni j∈Ni j∈I j∈I

(vj − vk )(vjp − vkp )
j∈R k∈R
⇔ ηNi < ηI , where ηR =
vj vjp
j∈R j∈R
Contraharmonic Mean Based Bias Field Correction in MR Images 527

The numerator of ηR denotes a measure of dispersion and the denominator is

used to remove the effect of the sample from the expression and make it unit-
free. Hence, the quantity ηR denotes a measure of relative dispersion. In case of
p = 1, ηR is twice the square of coefficient of variation.
The analysis reported above establishes the fact that the better restoration
can be achieved by the CHM filter of order p > 0 than the AM filter if the
measure of relative dispersion ηNi within the filtered area Ni is less than the
measure of relative dispersion ηI in the whole image I. Now, this is quite trivial
as the filtered area Ni is much smaller than the whole image I. Also, there are
smaller number of tissue classes present within a specified filtered area, which
makes the value of ηNi smaller than that of ηI . This shows that the CHM filter
of order p > 0 provides better restoration of MR images than that of the AM
filter if the condition ui < 2ui − ui is satisfied. Hence, as p increases, the HUM
using the CHM filter provides better restoration than that of the AM filter if the
condition ui < 2ui − ui is satisfied. The CHM filter attains its best performance
at poptimum , and after that the performance decreases with the increase in p as
the above condition will not be satisfied for large value of p.

4 Experimental Results and Discussion

The performance of the proposed bias ﬁeld estimation method is extensively

studied and compared with the AM based HUM [8,12] and N3 [2] algorithms.
In [12], Brinkmann et al. showed that the optimal window size of the low-pass
filter lies in the range 65 to 127. Hence, the optimal window size is fixed at 121
for all the experiments.
To analyze the performance of different algorithms, the experimentation is
done on some benchmark images obtained from “BrainWeb: Simulated Brain
Database” (http://www.bic.mni.mcgill.ca/brainweb/). The results are reported
for different noise levels and intensity inhomogeneity. The noise is calculated
relative to the brightest tissue. The performance of different methods is evaluated
using the RMSE value. A good bias field correction procedure should make the
value of RMSE as low as possible.

4.1 Performance of Diﬀerent Algorithms

Fig. 1 presents the performance of the proposed, HUM and N3 bias field correc-
tion methods, in terms of RMSE value. From the results reported in Fig. 1, it
is observed that the proposed algorithm provides optimum restoration in the 7
cases out of total 12 cases, in terms of RMSE value, while optimum restoration
is achieved in the remaining 5 cases using the N3 algorithm. The second, third,
and fourth columns of Fig. 2 and 3 compare the reconstructed images produced
by the proposed, HUM, and N3 algorithms for different bias fields and noise lev-
els. All the results reported in Fig. 2 and 3 establish the fact that the proposed
method estimates the bias field more accurately than the existing HUM and N3
algorithms irrespective of the bias fields and noise levels.
528 A. Banerjee and P. Maji

20
16
Proposed Proposed
HUM HUM
N3 N3
18
14

16
12

14
RMSE

RMSE
10
12

8
10

6
8

4 6
0-20 1-20 3-20 5-20 7-20 9-20 0-40 1-40 3-40 5-40 7-40 9-40
Noise-Bias Noise-Bias

Fig. 1. Performance of proposed, HUM, and N3 algorithms for bias aﬀected images

Noise 1% RMSE = 5.45 RMSE = 7.15 RMSE = 8.62

Noise 7% RMSE = 11.56 RMSE = 11.88 RMSE = 12.41

Fig. 2. Input image with 20% bias ﬁeld and images restored by the proposed algorithm
(using CHM ﬁlter), the HUM algorithm of Brinkmann et al., and the N3 algorithm

4.2 Unbiased Estimation

One of the caveats about the HUM algorithm is that it can alter an image even
when no inhomogeneity is present, while a perfect correction algorithm should
be expected to leave the image unchanged.
From the results reported in Fig. 4, it is observed that the proposed algorithm
provides better restoration in all of the 6 cases, in terms of lowest RMSE value.
The HUM algorithm of Brinkmann et al. and the N3 algorithm severely change
the input image in spite of absence of intensity inhomogeneity artifacts, whereas
the proposed algorithm leaves the input image more or less unchanged.
Contraharmonic Mean Based Bias Field Correction in MR Images 529

Noise 1% RMSE = 10.40 RMSE = 11.97 RMSE = 7.62

Noise 7% RMSE = 12.78 RMSE = 13.68 RMSE = 18.98

Fig. 3. Input image with 40% bias ﬁeld and images restored by the proposed algorithm
(using CHM ﬁlter), the HUM algorithm of Brinkmann et al., and the N3 algorithm

18
Proposed
HUM
16 N3

12
RMSE

0
0-0 1-0 3-0 5-0 7-0 9-0
Noise-Bias

Fig. 4. Performance of proposed, HUM, and N3 methods for bias-free images

5 Conclusion
The contribution of the paper is two fold, namely, the development of a bias
field correction algorithm, using the merits of contraharmonic mean filter; and
demonstrating the effectiveness of the proposed algorithm, along with a compar-
ison with other algorithms, on a set of MR images obtained from “BrainWeb:
Simulated Brain Database” for different bias fields and noise levels. A theoret-
ical analysis is presented to justify the use of contraharmonic mean for bias
field estimation. The algorithm using the contraharmonic mean filter instead of
530 A. Banerjee and P. Maji

arithmetic mean ﬁlter provides better restoration of the MR images than the
convensional AM based HUM.

Acknowledgment. This work is partially supported by the Indian National

Science Academy, New Delhi (grant no. SP/YSP/68/2012).

References
1. Suetens, P.: Fundamentals of Medical Imaging. Cambridge University Press (2002)
2. Sled, J.G., Zijdenbos, A.P., Evans, A.C.: A Nonparametric Method for Automatic
Correction of Intensity Nonuniformity in MRI Data. IEEE Transactions on Medical
Imaging 17(1), 87–97 (1998)
3. Wells III, W.M., Grimson, W.E.L., Kikins, R., Jolezs, F.A.: Adaptive Segmentation
of MRI Data. IEEE Transactions on Medical Imaging 15(8), 429–442 (1996)
4. Guillemaud, R., Brady, M.: Estimating the Bias Field of MR Images. IEEE Trans-
actions on Medical Imaging 16(3), 238–251 (1997)
5. Pham, D.L., Prince, J.L.: Adaptive Fuzzy Segmentation of Magnetic Resonance
Images. IEEE Transactions on Medical Imaging 18(9), 737–752 (1999)
6. Ahmed, M.N., Yamany, S.M., Mohamed, N., Farag, A.A., Moriarty, T.: A Modified
Fuzzy C-Means Algorithm for Bias Field Estimation and Segmentation of MRI
Data. IEEE Transactions on Medical Imaging 21(3), 193–199 (2002)
7. Ashburner, J., Friston, K.J.: Unified Segmentation. NeuroImage 26(3), 839–851
(2005)
8. Axel, L., Costantini, J., Listerud, J.: Intensity Correction in Surface-Coil MR Imag-
ing. American Journal of Roentgenology 148, 418–420 (1987)
9. Zhou, L.Q., Zhu, Y.M., Bergot, C., Laval-Jeantet, A.M., Bousson, V., Laredo,
J.D., Laval-Jeantet, M.: A Method of Radio-Frequency Inhomogeneity Correction
for Brain Tissue Segmentation in MRI. Computerized Medical Imaging and Graph-
ics 25(5), 379–389 (2001)
10. Narayana, P.A., Borthakur, A.: Effect of Radio Frequency Inhomogeneity Cor-
rection on the Reproducibility of Intra-Cranial Volumes Using MR Image Data.
Magnetic Resonance in Medicine 33, 396–400 (1994)
11. Bedell, B.J., Narayana, P.A., Wolinsky, J.S.: A Dual Approach for Minimizing
False Lesion Classifications on Magnetic-Resonance Images. Magnetic Resonance
in Medicine 37(1), 94–102 (1997)
12. Brinkmann, B.H., Manduca, A., Robb, R.A.: Optimized Homomorphic Unsharp
Masking for MR Grayscale Inhomogeneity Correction. IEEE Transactions on Med-
ical Imaging 17(2), 161–171 (1998)
Correlation between Biopsy Confirmed Cases
and Radiologist’s Annotations in the Detection of Lung
Nodules by Expanding the Diagnostic Database
Using Content Based Image Retrieval

Preeti Aggarwal1, H.K. Sardana2, and Renu Vig1

1
UIET, Panjab University, Chandigarh, India
2
CSIO, Chandigarh, India
[email protected], [email protected],
[email protected]

Abstract. In lung cancer computer-aided diagnosis (CAD) systems, having an

accurate and available ground truth is critical and time consuming. In this study,
we have explored Lung Image Database Consortium (LIDC) database contain-
ing pulmonary computed tomography (CT) scans, and we have implemented
content-based image retrieval (CBIR) approach to exploit the limited amount of
diagnostically labeled data in order to annotate unlabeled images with diagnos-
es. By applying CBIR method iteratively and using pathologically confirmed
cases, we expand the set of diagnosed data available for CAD systems from 17
nodules to 121 nodules. We evaluated the method by implementing a CAD sys-
tem that uses various combinations of lung nodule sets as queries and retrieves
similar nodules from the diagnostically labeled dataset. In calculating the preci-
sion of this system Diagnosed dataset and computer-predicted malignancy data
are used as ground truth for the undiagnosed query nodules. Our results indicate
that CBIR expansion is an effective method for labeling undiagnosed images in
order to improve the performance of CAD systems. It also indicated that little
knowledge of biopsy confirmed cases not only assist the physician’s as second
opinion to mark the undiagnosed cases and avoid unnecessary biopsies too.

Keywords: Chest CT scan, computer-aided diagnosis, LIDC, cancer detection

and diagnosis, biopsy.

1 Introduction

Lung cancer is the leading cause of cancer death in the United States. Early detection
and treatment of lung cancer is important in order to improve the five year survival
rate of cancer patients. Medical imaging plays an important role in the early detection
and treatment of cancer. It provides physicians with information essential for efficient
and effective diagnosis of various diseases. In order to improve lung nodule detection,
CAD is effective as a second opinion for radiologists in clinical settings [1]. To assess
the high-quality of the data, several researchers and physicians have to be involved in

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 531–538, 2013.
© Springer-Verlag Berlin Heidelberg 2013
532 P. Aggarwal, H.K. Sardana, and R. Vig

the case selection process and the delineation of regions of interest (ROIs) to cope
with the inter- and intra-observer variability, the latter being particularly important in
radiology [2]. Efforts for building a resource for the lung imaging research communi-
ty are detailed in [3]. In almost all the CAD studies, most authors created their own
datasets with their own ground truth for evaluation. The use of different datasets
makes the comparison of these CAD systems not feasible and therefore, there is an
immediate need for reference datasets that can provide a common ground truth for the
evaluation and validation of these systems.
The pulmonary CT scans used in this study were obtained from the LIDC [3], and
we refer to the nodules in this dataset as the LIDC Nodule Dataset. Recently, diagno-
sis data for some of the nodules were released by the LIDC; however, because the
diagnosis is available patient-wise not nodule-wise, only the diagnoses belonging to
patients with a single nodule could be reliably matched with the nodules in the LIDC
Nodule Dataset, resulting in 18 diagnosed nodules (eight benign, six malignant, three
metastases and one unknown). The 17 nodules with known diagnoses comprise the
initial Diagnosed Subset as one case with unknown diagnose cannot be considered as
ground truth. Since the diagnoses in the LIDC Diagnosis Dataset are the closest thing
to a ground truth available for the malignancy of the LIDC nodules, our goal is to
expand the Diagnosed Subset by adding nodules similar to those already in the subset.
To identify these similar nodules and to predict their diagnoses, CBIR with classi-
fication is employed. The radiologist’s annotation along with LIDC data is also con-
sidered as semantic rating to prepare the ground truth from LIDC data. Increasing the
number of nodules for which a diagnostic ground truth is available is important for
future CAD applications of the LIDC database. With the aid of similar images, radi-
ologists’ diagnoses of lung nodules in CT scans can be significantly improved [4].
Having diagnostic information for medical images is an important tool for datasets
used in clinical CBIR; however, any CAD system would benefit from a larger Diag-
nosed Subset as well as the semantic rating, since the increased variability in this set
would result in more accurately predicted diagnoses for new patients.

1.1 State of the Art

Only a limited number of CAD studies have used a pathologically confirmed diagnos-
tic ground truth, since there are few publically available databases with pathological
annotations [5]. Even with LIDC data where biopsy confirmed cases are available still
due to the variability in the opinion of four different radiologists made the LIDC data
more complex and redundant. In exploring the relationship between content-based
similarity and semantic-based similarity for LIDC images, Jabon et al. found that
there is a high correlation between image features and radiologists’ semantic ratings
[6]. Though in this study, the malignancy rating is also considered for patients having
multiple nodules by taking the mean of all the four radiologists rating. McNitt-Gray et
al. [7] used nodule size, shape and co-occurrence texture features as nodule characte-
ristics to design a linear discriminant analysis (LDA) classification system for malig-
nant versus benign nodules. Armato et al. [8] used nodule appearance and shape to
build an LDA classification system to classify pulmonary nodules into malignant
Correlation between Biopsy Confirmed Cases and Radiologist’s Annotations 533

versus benign classes. Takashima et al. [9] used shape information to characterize
malignant versus benign lesions in the lung. Samuel et al. [10] developed a system for
lung nodule diagnosis using Fuzzy Logic. Although the work cited here provides
convincing evidence that a combination of image features can indirectly encode radi-
ologists’ knowledge about indicators of malignancy the precise mechanism by which
this correspondence happens is unknown. To understand this mechanism, there is a
need to explore and find the correlation between all these is required to prepare the
ground truth of LIDC data. Also, in all these systems the major concern was to distin-
guish benign nodules from malignant one where as in the current study we have as-
signed a new class to the nodules metastasis, which indicates that the nodule is malig-
nant however the primary cancer is not lung cancer and adding this new class will
definitely help the physicians in better understanding of cause and diagnosis for those
patients. The third class metastasis has not been introduced in the history of CBIR and
medical imaging. In the current study, we adopted a semi-supervised approach for
labeling undiagnosed nodules in the LIDC. CBIR is used to label nodules most simi-
lar to the query with respect to Euclidean distance of image features.

2 Materials

2.1 Lung Image Database Consortium (LIDC) Dataset; A Benchmark

The NIH LIDC has created a dataset to serve as an international research resource for
development, training, and evaluation of CAD algorithms for detecting lung nodules
on CT scans. The LIDC database, released in 2009, contains 399 pulmonary CT
scans. Up to four radiologists analyzed each scan by identifying nodules and rating
the malignancy of each nodule on a scale of 1-5. The boundaries provided in the
XML files are already marked using manual as well as semi-automated methods [1]
[4]. Both cancerous and non-cancerous regions appear with little distinction on CT
scan image. The nine characteristics are presented in [11] are the common terms phy-
sicians consider for a nodule to be benign or malignant. To our best knowledge, this is
the first use of the LIDC dataset for the purpose of validating and classifying lung
nodule using biopsy report as well as the semantics attached.

2.2 Lung Nodule Detection and Selection of Slices

Lung nodules are volumetric and almost available in each slice of patient. CT scan of
chest is the better method to analyze these nodules for detection as well as for diagno-
sis. Due to multiple slices in CT, the physician has to see each and every slice for
better understanding of each nodule, if present. This task is time consuming as well as
not deterministic in any way. We presented a CAD system which considers a nodule
as qualifying nodule if and only if it is visible in three consecutive slices. This method
can further lead to decrease in time needed to examine the patient’s scan by a radiolo-
gist. In this work, radiologist’s markings are considered for the nodule detection and
segmentation from chest CT scan. For better results as well to prepare the ground
534 P. Aggarwal, H.K. Sardana, and R. Vig

truth the values of annotations are averaged for all the four radiologists. No automatic
segmentation is considered in this study as manual segmentation in medical imaging
provided better results; see Fig. 1 [12].

Fig. 1. Radiologist segmentation of nodules

Fig. 1 presents the radiologist’s segmentation of nodule and hence manual segmen-
tation is considered as “gold standard”. Each slice is read independently to identify
its area marked by all the four radiologists and only those slices per nodule is consi-
dered to be in the database whose area is maximum [13].

2.3 Final Extracted Nodule Dataset

CT scan of 80 biopsy confirmed patients with solitary pulmonary nodules mostly less
than 3 cm have been taken from. All the images are of size 512*512 and each having
16 bit resolution. All images are in DICOM (Digital Imaging and Communication in
Medicine) format which is well known standard used in medical field. Total of 1737
nodules are marked in 80 patients considering each slice of a patient having area
greater than all those marked by four different radiologists. Out of 80 biopsy con-
firmed cases only 18 cases were available with single nodule. From these 18, only 17
cases were considered further to prepare the ground truth as diagnosis for one patient
was unknown and this set will be referred to as the Diagnosed17. The classes assigned
to these nodules were malignant, benign and metastasis based on the diagnosis report
available. Rest 62 patients were assigned the class based on the mean of malignancy
rating provided by four different radiologists as no ground truth is available for these
62 patients with multiple nodules and this set will be referred as RadioMarked62, see
section 3. It contains 1677 nodules from 62 patients. 83 well known image features
were extracted for each nodule based on texture, size, shape, and intensity [11]. The
four feature extraction methods used to obtain these 83 features from the LIDC im-
ages were Haralick co-occurrence, Grey level difference method (GLDM), Gabor
filters, and Intensity [11]. The number of nodules was reduced to 210 by removing
nodules smaller than five-by-five pixels and multiple slices per nodules considering
slice with largest area of nodule because features extracted from these smaller nodules
are imprecise. Four different “undiagnosed” query sets containing subsets of the
LIDC Nodule Dataset were used, since neither computer-predicted nor radiologist-
predicted malignancy ratings can be considered ground truth due to high variability
between radiologists’ ratings. Each of these query sets differed in diagnostic ground
Correlation between Biopsy Confirmed Cases and Radiologist’s Annotations 535

truth. The first query set (Rad210) used the radiologist-predicted malignancy, the
second set (Comp210) used the computer-predicted malignancy, the third set
(Comp_Rad_biopsy57) used only those nodules for which the radiologist, computer-
predicted as well as biopsy confirmed malignancies agreed and the fourth set used
only those nodules for which the radiologist- and computer-predicted malignancies
agreed. The radiologist-predicted and computer-predicted contained equal number of
nodules i.e. 210 and radiologist-computer-biopsy-agreement query set contained 57,
and Rad_Comp92 contained 92 nodules after all modifications.

3 Methods

3.1 Labeling of the Nodules

Nodules are labeled according to single nodule per patient and patients with multiple
nodules. Following sections show the details:

Patients with Single Nodule.

Out of 80, only 18 patient cases were having one nodule whereas 62 patients were
having more than one nodule. Biopsy report for those patients has four classes iden-
tified as 0, 1, 2 and 3. 17 out of 18 biopsy confirmed cases were having the diagnosis
as 1, 2 and 3 whereas only one patient was having the diagnosis as 0 which means
unknown or indeterminate. This can decrease the classification results, so was not
considered in this study. Consequently, 17 pathologically confirmed cases were as-
signed three classes malignant (M), benign (B) and metastases (MT). There are eight
benign (B) nodules, six malignant (M) nodules and three metastases (MT) nodules
present in the initial Diagnosed17 set.

Patients with Multiple Nodules.

62 out of 80 biopsy confirmed cases with multiple nodules are assigned classes on the
basis of radiologist’s malignancy characteristics [11]. Out of nine annotations only
malignancy feature is used to assign the class to each nodule marked by radiologists
as this is most promising feature to determine the malignancy of a nodule. The me-
thod used to label each nodule is as follows
Nodules with malignancy rating >=3 assigned class Malig-
nant (M) whereas
Nodules with malignancy rating <3 assigned class Benign
(B)

In this way, 1677 nodules from 62 patients were assigned the malignancy class as
above. These 1677 nodules contain multiple slices per nodule and assigned to Radi-
oMarked62 set, which further have been reduced to 210 and assigned to QueryNodu-
leSet210. QueryNoduleSet210 further assigned to various categories like Rad210,
Comp210 and Comp_Rad_biopsy210 as explained earlier.
536 P. Aggarwal, H.K. Sardana, and R. Vig

3.2 Summary of CBIR Method of Expanding the Diagnosed Subset17; CBIR

Expansion Occurs Iteratively

In the absence of diagnostic information, labels can be applied to unlabeled data using
semi-supervised learning (SSL) approaches. In SSL, unlabeled data is exploited to
improve learning when the dataset contains an insufficient amount of labeled data
[14]. Using available datasets and by evaluating the method with a CAD application,
we determined how to effectively expand the Diagnosed17 with CBIR and assist the
physicians in the final diagnosis. Each nodule in the QueryNoduleSet210 was then
used as a query to retrieve the ten most similar images from the remaining nodules in
the Diagnosed17 using CBIR with Euclidean distance. The query nodule was as-
signed predicted malignancy ratings based on the retrieved nodules (e.g., if the maxi-
mum retrieved nodules belong to class malignant then the query nodule was assigned
the class M), Fig. 2. The newly identified nodule was considered candidates for
addition to the Diagnosed17.

Diagnosed17 with 17 nodules categorized in 3 classes

Evaluative with CBIR

CBIR

Query: RadioMarked62/QueryNoduleSet210
Retrieval Set: Diagnosed17
10 nodules retrieved per query

New nodule selected as candidate for Diagnosed17 set

New Diagnosed17 with candidate nodules

Fig. 2. Selection of candidate nodules using CBIR and Diagnosed17 set

3.3 Diagnosed Subset Evaluation

Nodules to be added to the Diagnosed17 were selected from the candidates described
above. For verifying the addition of a candidate nodule in the Diagnosed17, a reverse
mechanism is adopted. Diagnosed17 nodules acted as query and nodules to be re-
trieved are from QueryNoduleSet210, see Fig. 2. The first three similar nodules are
assigned the same malignancy as the query nodule if they were previously assigned as
candidate nodules (i.e. if the query nodule is benign then the top three retrieved no-
dules are also assigned the class benign if previously are assigned as candidate no-
dule). With this mechanism Diagnosed17 in expanded to Diagnosed74 and then to
Correlation between Biopsy Confirmed Cases and Radiologist’s Annotations 537

Diagnosed121. This process repeated until no candidate nodules were added to the
Diagnosed17 following an iteration. Since neither computer-predicted nor radiologist-
predicted malignancy ratings can be considered ground truth due to high variability
between radiologists’ ratings [5]. This mechanism guarantees the preparation of LIDC
ground truth and accuracy of CBIR based diagnostic labeling. All the nodules can be
classified in three class benign, malignant and metastasis.

4 Results and Discussion

Using the query and retrieval sets as described above, average precision after 3, 5, 10,
and 15 images retrieved was calculated. A retrieved nodule was considered relevant if
its diagnosis matched the malignancy rating (either radiologist-predicted, computer-
predicted, or both) of the query nodule. Initial precision values were obtained by us-
ing the 17 nodules in the initial Diagnosed17 as the retrieval set. Then, nodules were
added to this set as described in sections 2.2 and 2.3. Precision was recalculated, and
the nodule addition process was repeated iteratively using the new Diagnosed17.

Average Precision after 10 images Retrieved

100
Precision in %age

80 Size of
60 Diagnosed Set
40 Diagnosed17
20
0 Diagnosed74
Diagnosed121

Query Sets

Fig. 3. Comparison of precision for different query sets at x-axis and different retrieval sets at
y-axis

Various experiments were setup for the validation of nodules examined. Fig. 3
shows that with five query sets and three retrieval sets Diagnosed17, Diagnosed74
and Diagnosed121, the precision increases respectively. Nodules in
Comp_Rad_biopsy57 have provided the best precision i.e. 98% which is the best
precision achieved in the history of medical CBIR with best of our knowledge.
CBIR is an effective method for expanding the Diagnosed Subset by labeling no-
dules which do not have associated diagnoses. As LIDC is having lack of ground
truth, CBIR techniques works tremendously better to prepare the ground truth. This
method outperforms control expansion, yielding higher precision values when tested
with a potential CAD application [12] that requires a diagnostically accurate ground
truth. By increasing the size of the Diagnosed Subset from 17 to 74 and finally to 121
nodules, CBIR expansion provides greater variability in the retrieval set, resulting in
retrieved nodules that are more similar to undiagnosed queries.
538 P. Aggarwal, H.K. Sardana, and R. Vig

References
1. Wormanns, D., Fiebich, M., Saidi, M., Diederich, S., Heindel, W.: Automatic detection of
pulmonary nodules at spiral CT: clinical application of a computer-aided diagnosis system.
European Radiology 12, 1052–1057 (2002)
2. Blum, A., Mitchell, T.: Combining Labelled and Unlabelled Data with Co-Training. In:
Proceedings of the 11th Annual Conference on Computational Learning Theory, COLT
1998, pp. 92–100 (1998)
3. McNitt-Gray, M.F., Armato, S.G., Meyer, C.R., Reeves, A.P., McLennan, G., Pais, R.C.,
et al.: The lung image database consortium (LIDC) data collection process for nodule de-
tection and annotation. Academic Radiology 14(12), 1464–1474 (2007)
4. Armato III, S.G., McLennan, G., Bidaut, L., McNitt-Gray, M.F., Meyer, C.R., Reeves,
A.P., Zhao, B., Aberle, D.R., Henschke, C.I., Hoffman, E.A., Kazerooni, E.A., MacMa-
hon, H., van Beek, E.J.R., Yankelevitz, D., et al.: The Lung Image Database Consortium
(LIDC) and Image Database Resources Initiative (IDRI): A completed reference database
of lung nodules on CT scans. Medical Physics 38, 915–931 (2011)
5. Horsthemke, W.H., Raicu, D.S., Furst, J.D., Armato III, S.G.: Evaluation Challenges for
Computer-Aided Diagnostic Characterization: Shape Disagreements in the Lung Image
Database Consortium Pulmonary Nodule Dataset. In: Tan, J. (ed.) New Technologies for
Advancing Healthcare and Clinical Practices, pp. 18–43. IGI Global, Hershey PA (2011)
6. Jabon, S.A., Raicu, D.S., Furst, J.D.: Content-based versus semantic-based similarity re-
trieval: a LIDC case study. In: SPIE Medical Imaging Conference, Orlando (February 2009)
7. McNitt-Gray, M.F., Hart, E.M., Wyckoff, N., Sayre, J.W., Goldin, J.G., Aberle, D.R.: A
pattern classification approach to characterizing solitary pulmonary nodules imaged on
high resolution CT: Preliminary results. Med. Phys. 26, 880–888 (1999)
8. Armato III, S.G., Altman, M.B., Wilkie, J., Sone, S., Li, F., Doi, K., Roy, A.S.: Automated
lung nodule classification following automated nodule detection on CT: A serial approach.
Med. Phys. 30, 1188–1197 (2003)
9. Takashima, S., Sone, S., Li, F., Maruyama, Y., Hasegawa, M., Kadoya, M.: Indeterminate
solitary pulmonary nodules revealed at population-based CT screening of the lung: using
first follow-up diagnostic CT to differentiate benign and malignant lesions. Am. J. Roent-
genol. 180, 1255–1263 (2003)
10. Samuel, C.C., Saravanan, V., Vimala, D.M.R.: Lung nodule diagnosis from CT images us-
ing fuzzy logic. In: Proceedings of International Conference on Computational Intelligence
and Multimedia Applications, Sivakasi, Tamilnadu, India, December 13-15, pp. 159–163
(2007)
11. Raicu, D.S., Varutbangkul, E., Furst, J.D.: Modelling semantics from image data: oppor-
tunities from LIDC. International Journal of Biomedical Engineering and Technology
3(1-2), 83–113 (2009)
12. Giuca, A.-M., Seitz Jr., K.A., Furst, J., Raicu, D.: Expanding diagnostically labeled data-
sets using content-based image retrieval. In: IEEE International Conference on Image
Processing 2012, Lake Buena Vista, Florida, September 30-October 3 (2012)
13. Aggarwal, P., Vig, R., Sardana, H.K.: Largest Versus Smallest Nodules Marked by Differ-
ent Radiologists in Chest CT Scans for Lung Cancer Detection. In: International Confe-
rence on Image Engineering, ICIE 2013 Organized by IAENG at Hong Kong (in press,
2013)
14. Zhou, Z.-H.: Learning with Unlabeled Data and Its Application to Image Retrieval. In:
Yang, Q., Webb, G. (eds.) PRICAI 2006. LNCS (LNAI), vol. 4099, pp. 5–10. Springer,
Heidelberg (2006)
Enforcing Consistency of 3D Scenes
with Multiple Objects
Using Shape-from-Contours

Matthew Grum and Adrian G. Bors

Dept. of Computer Science, University of York, York YO10 5GH, UK

Abstract. In this paper we present a new approach for modelling scenes

with multiple 3D objects from images taken from various viewpoints.
Such images are segmented using either supervised or unsupervised al-
gorithms. We consider the mean-shift and support vector machines for
image segmentation using the colour and texture as features. Back-
projections of segmented contours are used to enforce the consistency of
the segmented contours with initial estimates of the 3D scene. A study
for detecting merged objects in 3D scenes is provided as well.

Keywords: 3D scene reconstruction, multi-view images, Shape-from-

contours, image segmentation.

1 Introduction

Single 3D object reconstruction from several images has attracted considerable

research interest using various approaches such as multi-view stereo, shape-from-
silhouettes, shape-from-shading, etc. Various 3D object representations are used
including voxels [1], radial basis functions (RBF) [2] and meshes [3]. Space carv-
ing is a form of multi-view stereo which assigns voxels to a 3D object or carves
them away from its volume according to their photoconsistency [1]. Implicit
RBFs have been shown to represent well surfaces of 3D objects in [2]. Object
reconstruction from its silhouettes was based on the principle of duality between
the tangent planes and their corresponding object space [4,5] or by using visual
hulls of the object [6]. In [7] disparities between projections of 3D patches from
various images, are used to correct the 3D scene modelled using RBFs.
The methodology described in this paper aims to robustly enforce the con-
sistency of scenes of multiple objects with their corresponding contours seg-
mented from images. We consider a certain approximate representation of the
3D scene as the initialization. The surface representation elements, which may
be voxels, RBF centers or vertices, are displaced according to their consistency
with segmented contours. We consider both supervized and unuspervized seg-
mentation for extracting object contours from images from various views. The
shape-from-contours approach for correcting 3D scenes of multiple objects by
using object contour consistency is described in Section 2. A study of identifying

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 539–547, 2013.

c Springer-Verlag Berlin Heidelberg 2013
540 M. Grum and A.G. Bors

wrongly merged objects in 3D scenes is provided in Section 3. Experimental re-

sults and the conclusions of this study are described in Section 4 and Section 5,
respectively.

2 3D Scene Correction Using Shape-from-Contours

Let us assume that we have a set of images I = {Ii |i = 1, . . . , n}, each taken from
a different viewing angle, which are characterized by their projection matrices
P = {Pi |i = 1, . . . , n}. In the following we assume that we have an initial
approximate representation of a scene S with multiple objects. Correcting such
scenes using image disparities in the projections of 3D patches was discussed in
[7]. That approach relies on good textures but does not provide good results in
areas of uniform color. Nevertheless, such areas can be easily segmented.
Let us assume that the scene contains at least two distinct objects {A, B} ∈ S
closely located to each other. We consider that each object outline from the 3D
scene is projected onto segmented contours, of their corresponding projections
from the input images, denoted as {ai , bi } ∈ Ii , i = 1, . . . , n where:
ai = Pi A ; bi = Pi B (1)
where Pi represents the projection matrix from the 3D scene to the ith image. In
this paper we consider image segmentation for defining the contours of objects,
such as A and B in 3D or {ai , bi } in the 2D images from I. By comparing
the projections of the segmented objects in 3D with their corresponding image
segmentations we can detect any inconsistencies and use these to correct S.
In the following we consider both unsupervized and supervized image segmen-
tation for extracting object contours. We consider a feature space for each pixel
represented by a vector z = [r, g, b, ζt]T , containing the three color components
and a texture feature t, weighted by ζ, provided by the output of the Harris
corner detector. For the unsupervized image segmentation we use the mean shift
algorithm which was employed for image segmentation in [8]. The local maxima
in the probability density function of the features are found by the mean-shift
algorithm. Pixels with features closest to each of these local maxima are as-
signed to the corresponding segmented regions. For the supervised segmentation
we use support vector machines (SVM) [9]. SVM requires a training set which
is created by sampling a set of pixels from one or more images from the set I.
Each of these pixels is labelled according to their class which may correspond to
an object or to the background. SVM using quadratic programming techniques
finds the boundary which optimally divides the training data set into classes,
each corresponding to an object.
The goal of the segmentation is to extract a set of object boundaries from im-
ages. The initial estimates of the object boundaries are refined by using active
contours such as snakes, in order to join edge maps, representing boundaries of
segmented objects from images, into continuous contours of objects Ci . The con-
sistency of the object contours, resulting after projecting the 3D object outlines
CS is verified with that corresponding to segmented object contours from im-
ages. Any inconsistencies between the projected contours of the 3D scene Pi CS
Enforcing Consistency of 3D Scenes with Multiple Objects 541

and the contours Ci segmented from the image of the corresponding view are
detected and then used to correct the 3D scene. The visual hull, denoted by H, is
the outer bound of the scene based on its appearance in several images and was
used for modelling 3D scenes from images in shape-from-silhouettes [4,6,5,11].
If a point lies within the visual hull then its projection falls inside the scene
silhouette in each image. However, the visual hull will not be able to represent
certain regions in multi-object scenes, such as for example the region from the
middle of the scene, due to the fact that objects will invariably occlude each
other in several images in such scenes.
In shape-from-contours, the concept of the visual hull is applied to individual
objects from the scene. In this case, the visual hull of objects, denoted as H(ai )
and H(bi ) is provided by the object contours such as ai or bi from each image
i = 1, . . . , n, where these objects are visible. After comparing the sets of pixels
corresponding to 2D contours and those from the projected 3D contours we
identify the regions corresponding to the undesired diﬀerence sets as :
{c|Pi c = (S(Pi CS )\S(Ci )) ∪ (S(Ci )\S(Pi CS ))} (2)
where S(Ci ) represents the set of pixels located in the interior of contour Ci ,
i = 1, . . . , n and c ∈ S is a point from the 3D scene, whose projection Pi c lies
among the pixels from the area between the sets Pi CS and Ci from each image.
Such points are displaced to their nearest surface in 3D along its surface normal:
1 −1 −−−−→
m
ĉ = c − γ P ∇zj Ci , (3)
m i=1 i
where m ≤ n represents all the images in which the inconsistency between the
3D scene projections and the actual object contours is identiﬁed, γ is a correction
−−−−→
factor, and ∇zj Ci represents the correction vector which is perpendicular on the
object contour calculated in the location zj . Eventually, such points would be
located in S(Pi CS )∩S(Ci ) ensuring the consistency of H(ai ) and H(bi ) with the
3D scene S. This methodology can be applied to various surface representations
such as voxels, parametric (including RBFs) and meshes, where surface self-
intersections would have to be avoided [3].

3 Analysis of Object Separability When Reconstructing

Scenes with Multiple Objects

In the following we consider the case when two objects from the scene are wrongly
merged together in the initial stages of the 3D modelling of the scene. Such
situations may arise due to object occlusion, uncertainty in camera parameters,
illumination conditions, image noise, etc, [3]. We consider a simple artiﬁcial
scene consisting of two identical objects, considered as either a pair of cylinders
or a pair of cuboids with square base. A circular conﬁguration of n cameras is
considered located evenly spaced on a circle located at a height corresponding
to θ = 0.85 radians.
542 M. Grum and A.G. Bors

0.03 0.04

0.02
0.02
A

A
e

e
0.01 π/2 π/2

0 π/4 0 π/4
0 0
π/4 π/4
π/2 π/2
3π/4 0 θ 3π/4 0 θ
φ π φ π

(a) Two cuboids (b) Two cylinders

150 0.06

100 0.04
H

A
e

50 2ι 0.02 2ι

0 ι 0 ι
0 0
π/4 π/4
π/2 π/2
3π/4 0 d 3π/4 0 d
φ π φ π

(c) Hausdorﬀ distance (d) Area diﬀerence

Fig. 1. Detecting inconsistencies between the merged and separated objects

We assess the inﬂuence of the shape when attempting to detect wrongly

merged objects in 3D scenes, which are shown as separate in their image pro-
jections. The distance between the objects is set as equal to their width and we
consider the normalized area error, calculated from the image projections as:
|(S(F ) ∪ S(G))\(S(F ) ∩ S(F ))|
eA (F, G) = (4)
|S(G)|
where S(F ) and S(G) are the areas corresponding to the projections of the 3D
objects in the hypotheses of fused and separated objects. Figs. 1(a) and 1(b)
show the plots of the normalized area error eA (F, G) when varying both the
elevation θ and the azimuth φ angles of the camera while the distance between

0.03

0.025
worst case
0.02 best case
eA

0.015

0.01

0.005

0
0 5 10 15 20 25
n
Fig. 2. Surface error eA (F, G) plotted against the number of cameras n
Enforcing Consistency of 3D Scenes with Multiple Objects 543

the objects is kept constant. We also consider varying the distance between the
objects d by moving them away from each other. Plots of both Hausdorff distance
[10] and the area error eA (F, G) are shown when varying φ and d in Figs. 1(c)
and 1(d) for the scene showing two cuboids. These plots clearly show the width
of the peak getting smaller as the intra-object gap is reduced.
Large numbers of input images do not necessarily improve 3D scene recon-
struction proportionally [11]. We estimate the necessary number of cameras for
detecting when two cuboids are merged. For each number of given cameras n,
we record the minimum and maximum area errors eA (F, G), measured between
the fused and separate case hypotheses when considering all possible offsets. The
results for the best and worse cases when detecting the separation for each cam-
eras configuration are shown in Figure 2. It can be observed that when using
more than 10 cameras, the error eA (F, G) from (4) provides a good assessment
of the inconsistencies between the 3D scene and the objects shown in images.

4 Experimental Results
The proposed methodology of correcting 3D scenes of multiple objects using
the consistency with object contours was applied on various image sets. Four
images, from a set of n = 12 images of a scene with 5 main objects captured
from various viewpoints, are shown in Figs. 3a-e. We initialize the 3D scene using
space carving [1], represent its surface with implicit RBFs [2], and correct the
image disparities from projections of 3D patches as in [7]. The resulting 3D scene
is shown from two diﬀerent angles in Figs. 4a and 7a. It can be observed that
two of the objects representing a knife-block and a kettle are merged together
as shown in the closer view from Fig. 4b. The results provided by shape-from-
silhouettes (SFS) [5,6] when applied on the original set of 12 images is shown in
Fig. 5a. In Fig. 5b we apply SFS onto the result of the 3D scene from Fig. 4a.

(a) (b) (c) (d) (e)

Fig. 3. Five images from the image set showing multiple objects

In the following we apply the proposed shape-from-contours (SFC) method-

ology, described in Section 2. The main diﬀerence between SFC and SFS occurs
in the regions where two or more objects are located close or in contact with
each other. Individual 3D objects from the scene are separated by thresholding
the scene with a plane parallel with their horizontal base. Object contours are
extracted from the image set and compared with the contours resulting from
projecting the current 3D scene onto the image planes using identical camera
544 M. Grum and A.G. Bors

(a) Initial 3D scene (b) Segmented fused objects

Fig. 4. 3D scene representations using implicit RBFs

(a) Applied on the initial image set (b) Applied on the 3D initial estimate
Fig. 5. Shape-from-silhouettes results

parameters P with those of the original images. We consider a weight of ζ = 2

for the Harris corner output from the feature vector z. Both unsupervised and
supervised image segmentation are considered for extracting the object contours
from the images. Mean shift clustering is used for unsupervised segmentation
while SVM was employed for supervised segmentation as described in Section 2.
The training for SVM considers only a few pixels from the objects and back-
ground of a single image, shown in Fig. 3b.
The merged object segmentation in four diﬀerent images is shown in blue in
Figs. 6a-d and 6e-h, when using unsupervised and supervised segmentation,
respectively, while the projection of the fused object surface from Fig. 4b, is
shown in red. We can observe a large discrepancy in Figs. 6c and 6g and a smaller
one in Figs. 6a, 6b, 6e and 6f. The regions between the two contours, displayed
using red and blue, are back-projected into the 3D scene and their corresponding
volumes corrected. For the RBF representation, the corresponding centers are
displaced according to (3) or their weights are switched from positive to negative.
The initial 3D scene is shown in Fig. 7a, while the 3D updated scene results
are provided in Figs. 7b and 7c after the correction by using unsupervised and
supervised segmented contours, respectively. There is a signiﬁcant improvement
in the 3D reconstruction of the area between the kettle and knife block, with
both methods achieving a complete separation of the two objects. In the updated
results, regions from the middle of the scene are now visible through the gap
Enforcing Consistency of 3D Scenes with Multiple Objects 545

(a) (b) (c) (d)

(e) (f) (g) (h)

Fig. 6. Projections of the 3D scene onto image planes, shown with red, and object
contours resulting from image segmentation, shown with blue, using mean-shift in (a)-
(d) and SVM in (e)-(h). Some segmentation errors due to shading and specularities
can be observed in the segmented contours.

(a) Initial (b) Unsupervised (c) Supervised

Fig. 7. 3D scene correction results by using the discrepancy in the contours resulting
from the projection of the 3D scene and those segmented from the actual images

between the kettle and knife-block. These results are definitely better than those
provided by SFS from Figs. 5a and 5b.
Numerical errors are evaluated for assessing the improvement in the 3D scene
when using SFC from either unsupervized or supervized image segmentations.
We consider two error measures as in the study from Section 3: the Hausdorff
distance [10] and area error eA (F, G) from (4). These measures assess the differ-
ences between the projected contours of objects and their corresponding image
segmented contours. Numerical results are shown in Figs. 8a and 8b when vary-
ing the azimuth angle φ. An error peak, which corresponds to the region located
between the kettle and the knife-block, can be seen in all four curves from both
plots from Fig. 8.
546 M. Grum and A.G. Bors

unsupervised unsupervised
80 0.25
supervised supervised
70
0.2
60
H

0.15

A
e

e
50
0.1
40

30 0.05

20 0
0 π/4 π/2 3π/4 π 5π/4 3π/2 7π/4 0 π/4 π/2 3π/4 π 5π/4 3π/2 7π/4
φ φ

(a) Hausdorﬀ distance eH (F, G) (b) Area diﬀerence eA (F, G)

Fig. 8. Numerical accuracy of extracted contours for the entire image set

In the following, we evaluate the PSNR, for a region as shown inside the
black rectangle from Fig. 3e, between the original image and the corresponding
projections from the corrected 3D scene. For this region, the initial PSNR of
11.17 dB is improved to 13.91 dB and to 13.75 dB, when using unsupervized
and supervised segmentations for SFC. Three diﬀerent views of the entire 3D
scene reconstruction are shown in Fig. 9 after enforcing unsupervised shape-
from-contours consistency, considering the same projection parameters P as in
the original image set.

Fig. 9. Views of the 3D scene corrected using unsupervised shape-from-contours

5 Conclusions

This paper proposes to use shape-from-contours (SFC) for modelling scenes of

multiple objects. SFC use back-projections of segmented contours of objects
from multi-view images in order to improve the 3D scene with multiple objects.
Both supervised and unsupervised segmentation using the mean-shift and sup-
port vector machines (SVM), respectively, are used for segmenting the image set.
A study analyzing the detection of merged objects in reconstructed 3D scenes
Enforcing Consistency of 3D Scenes with Multiple Objects 547

is provided together with the analysis of the number of cameras required for
detecting such errors. The proposed methodology can be applied for correcting
various 3D scene representations including those using voxels, meshes or para-
metric models.

References
1. Broadhurst, A., Drummond, T.W., Cipolla, R.: A probabilistic framework for space
carving. In: Proc. Int. Conf. on Comp. Vision, vol. 1, pp. 388–393 (2001)
2. Dinh, H.Q., Turk, G., Slabaugh, G.: Reconstructing surfaces by volumetric regular-
ization using radial basis functions. IEEE Trans. on Pattern Analysis and Machine
Intelligence 24(10), 1358–1371 (2002)
3. Zaharescu, A., Boyer, E., Horaud, R.: Topology-Adaptive Mesh Deformation for
Surface Evolution, Morphing, and Multiview Reconstruction. IEEE Trans. on Pat-
tern Analysis and Machine Intelligence 33(4), 823–837 (2011)
4. Liang, C., Wong, K.-Y.: Robust recovery of shapes with unknown topology from
the dual space. IEEE Trans. on Pat. Anal. and Machine Intell. 29(12), 2205–2216
(2007)
5. Lazebnik, Z., Furukawa, Y., Ponce, J.: Projective visual hulls. Int. Jour. of Comp.
Vision 74(2), 137–165 (2007)
6. Koenderink, J.: What does the occluding contour tell us about solid shape. Per-
ception 13(3), 321–330 (1984)
7. Grum, M., Bors, A.G.: Enforcing image consistency in multiple 3-D object mod-
elling. In: Proc. Int. Conf. on Pattern Recog., Tampa, FL, USA, pp. 3354–3357
(2008)
8. Comaniciu, D., Meer, D.: Mean shift: a robust approach toward feature space
analysis. IEEE Trans. on Pattern Analysis and Machine Intell. 24(5), 603–619
(2002)
9. Scholkopf, B., Sung, K.-K., Burges, C., Girosi, F., Niyogi, P., Poggio, T., Vapnik,
V.: Comparing support vector machines with Gaussian kernels to Radial Basis
Function Classiﬁers. IEEE Trans. on Signal Processing 45(11), 2758–2765 (1997)
10. Huttenlocher, D., Klanderman, G., Rucklidge, W.: Comparing images using
the Hausdorﬀ distance. IEEE Trans. on Pattern Analysis and Machine Intelli-
gence 15(9), 850–863 (1993)
11. Chaurasia, G., Sorkine, O., Drettakis, G.: Silhouette-Aware Warping for Image-
Based Rendering. Computer Graphics Forum 30(4), 1223–1232 (2011)
Expectation Conditional Maximization-Based
Deformable Shape Registration

Guoyan Zheng

Institute for Surgical Technology and Biomechanics, University of Bern, Switzerland

[email protected]

Abstract. This paper addresses the issue of matching statistical and non-rigid
shapes, and introduces an Expectation Conditional Maximization-based de-
formable shape registration (ECM-DSR) algorithm. Similar to previous works,
we cast the statistical and non-rigid shape registration problem into a missing
data framework and handle the unknown correspondences with Gaussian Mix-
ture Models (GMM). The registration problem is then solved by fitting the
GMM centroids to the data. But unlike previous works where equal isotropic
covariances are used, our new algorithm uses heteroscedastic covariances
whose values are iteratively estimated from the data. A previously introduced
virtual observation concept is adopted here to simplify the estimation of the reg-
istration parameters. Based on this concept, we derive closed-form solutions to
estimate parameters for statistical or non-rigid shape registrations in each itera-
tion. Our experiments conducted on synthesized and real data demonstrate that
the ECM-DSR algorithm has various advantages over existing algorithms.

Keywords: Expectation conditional maximization, deformable shape registration,

Gaussian mixture models, heteroscedastic covariances.

1 Introduction

Registration of a set of model points to the observation data is a frequently encoun-

tered problem in several fields namely, computer vision and pattern recognition [1-9],
as well as medical imaging [10-17]. The iterative closest point (ICP) algorithm [1] is
one of the most well-known algorithms. It works by alternating between correspon-
dence establishment and parameter estimation until convergence. Disadvantages of
the ICP algorithm include the requirement of a good initial guess, the convergence to
local minima and the sensitivity to outliers. This has motivated the introduction of
various robust methods based on Gaussian Mixture Models (GMM) fitting which is
then solved by the Expectation Maximization (EM) algorithm [18]. Common to all
these previous works, however, is that equal isotropic covariances are used. Horaud et
al. introduce the Expectation Conditional Maximization (ECM) for Point Registration
algorithm [9] which replaces the maximization step in the EM algorithm with Condi-
tional Maximization (CM) steps that consist of first estimation of the registration
parameters by maximizing the expectation and then computing the covariance matric-
es conditioned by the newly estimated registration parameters. They have shown that

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 548–555, 2013.
© Springer-Verlag Berlin Heidelberg 2013
Expectation Conditional Maximization-Based Deformable Shape Registration 549

their algorithm allows the use of general covariance matrices for the mixture model
components and improves over the equal isotropic covariance case, but they only
applied their algorithm to solve rigid and articulated point registration problems. Re-
cently Xie et al. [14] used the ECM algorithm to solve the statistical shape registra-
tion problem but the shape coefficients were estimated asynchronously.
In this paper, we extend the ECM algorithm [9, 19] to solve the statistical and non-
rigid shape registration problems and introduce the ECM-based deformable shape
registration (ECM-DSR) algorithm. Unlike previous works where equal isotropic
covariances are used, our new algorithm allows uses heteroscedastic covariances
whose values are iteratively estimated. Furthermore, a previously introduced virtual
observation concept is adopted here to simplify the estimation of the registration
parameters. Based on this concept, we derive closed-form solutions to estimate para-
meters for statistical or non-rigid shape registration in iteration. We conducted com-
prehensive experiments on synthesized and real data to demonstrate the advantages of
the ECM-DSR algorithm over existing algorithms.
Details about the ECM-DSR algorithm will be described in Section 2, followed by
experimental results in Section 3. We conclude the paper in Section 4.

2 The ECM-DSR Algorithm

2.1 Mathematical Notations

Throughout the paper, we use following notations. The superscript “T” means “trans-
pose”. is the dimension of the point sets; , are the number of points in the
two point sets: , ,…, is the data matrix for the data points and
, , …, is the data matrix for the GMM centroids;
, is the transformation applied to to get the new positions of the
GMM centroids (more specifically, , ) , where is the set of the trans-
formation parameters. || || is a regularization over the transformation. For a
statistical shape model, we also use to indicate the mean model. Giving a cu-
toff number , , 1,2, … , are the set of eigenvalues that are sorted in des-
cending order and , 1,2, … , are their corresponding normalized eigenvec-
tors of the statistical shape model, where each eigenvector is
1 ,…, . Furthermore, dot product between two vectors
and is written as either · or , depending on the context. is an identity
matrix and means a diagonal matrix constructed from the vector ;
means to compute the trace of a matrix .

2.2 ECM for Shape Registration

The shape registration problem is cast into a missing data framework and the un-
known correspondences are handled with GMM. We consider the points in as the
GMM centroids and the points in as the data points generated by the GMM. The
problem is solved by minimizing the negative log-likelihood function
,Σ ,…,Σ ∑ ∑ | , (1)
550 G. Zheng

where Σ , 1,2, … , is the covariance matrices and | ,

| , ,Σ is the likelihood of an observation given its assignment to
GMM centroid , which is drawn from a Gaussian distribution with mean ,
and covariance Σ . Although the original ECM algorithm [9] allows the use of general
covariance matrices, we choose to use heteroscedastic covariance for each GMM cen-
troid. Thus, we have Σ . Our unknowns now are , ,…, that can
be found by the ECM algorithm [19]. The expectation step computes the posterior prob-
ability of the GMM components:
|| , ||

| || , ||
(2)
∑

where as suggested by Horaud et al. [9] corresponds to the outlier component.

The unknowns are estimated by minimizing following objective function:

∑ ∑ | , | log || || (3)

where is the parameters controlling the contribution of the regularization and the
problem is solved by two conditional minimization steps, using given by (2):
A. Estimating the registration parameters by minimizing

argmin ∑ ∑ | , | || || (4)

Details about how to estimate the registration parameters will be given in section 2.3.
B. For all 1,2, … , , estimate covariances using following closed-form solution
∑ || , ||
∑
(5)

2.3 Virtual Observation Concept for Estimating the Registration Parameters

Virtual observation concept has been implicitly or explicitly used by several authors
such as in [3] and [9] to simplify the point matching problem. We also adopted this
concept to simplify the solution to our problem. The so-called virtual observation
and its weight are defined as ∑ , and ∑ , re-
spectively. With these two definitions, now Eq. (4) can be simplified as

argmin ∑ || , || || || (6)

Eq. (6) and Eq. (4) has exactly the same solution. This can be proved by expanding
the first term of the right side of the Eq. (4) and neglect the constant coefficient as:

∑ ∑ 2∑ , ∑ , , (7)
Expectation Conditional Maximization-Based Deformable Shape Registration 551

Since the first term of Eq. (7) does not depend on the registration parameters ,
replacing it with will not change the original optimization problem of
Eq. (4). The second term of Eq. (7) is 2 , , and the third term is
, , . Combining all three terms we have it as ∑ ||
, || . This proves that Eq. (6) and Eq. (4) has the same solution.
With the virtual observation concept, we can now discuss how to solve Eq. (6).
The solution depends on how we parameterize the transformation .

• When is a rigid or a scaled rigid transformation. In this case, || ||

0. Previous works such as [9], [14] and [15] have discussed its solution. Thus, we
will not address it here. We invited the interested reader to refer to these works.
• When is a statistical shape model instantiation. In this case, can be parame-
terized as , ∑ and || || ∑ is the Mahalanobis
distance [20]. All the shape coefficients can be solved with a closed-form solution:
(8)

where ,…, are the shape coefficients vector to be determined; is

∑ · ∑ · ∑ ·

and is computed as
∑ ·

∑ ·

• When is a non-rigid transformation. There are different ways to parameterize a

non-rigid transformation . One example is to use thin-plate splines [3, 6]. Here we
choose to use the coherent point drift (CPD) that was introduced by Myronenko and
Song [7] to parameterize . In CPD, the non-rigid transformation is modeled as a
displacement field defined over each GMM centroid that can be expressed as:

, (9)

where is a matrix of coefficients to be determined, is a symmetric

|| ||
kernel matrix with elements , exp . With this parameterization,
the regularization over the transformation is: || || . If we
represent all the virtual observation weights as a vector ,…, and compute
the partial derivatives of Eq. (6) with respect to , we can get a closed-form solution:
552 G. Zheng

(10)

where , ,…, is the data matrix of all virtual observations.

3 Experimental Results
Qualitative and quantitative experiments are conducted to evaluate the performance of
the present approach.

3.1 Qualitative Experiments

We first conducted a qualitative experiment on 2D synthesized data with outliers by
Chui and Rangarajan [3] to evaluate the performance and the efficacy of our non-rigid
registration algorithm taking the CPD algorithm as the reference method1. Fig. 1
shows qualitatively how these two algorithms perform when tested on a synthesized
data with outliers. In this figure, we also depict the final covariances that are esti-
mated automatically by these two algorithms. From this depiction, one can observe
that our algorithm uses heteroscedastic covariances which are helpful in effectively
handling outliers while the CPD algorithm uses equal covariances, leading to poor
results in handling outliers. Furthermore, it took 42 steps for our algorithm to con-
verge while this number changed to 87 when the CPD algorithm was used.
To better illustrate how the heteroscedastic covariances that are automatically es-
timated by the ECM-DSR algorithm can tell us more information about the uncertain-
ty of the results, we conducted a second study on registering a pair of corpus callosum
shapes (Fig. 2). For this study, it took the ECM-DSR algorithm 21 iterations to con-
verge and the CPD algorithm converged after 31 iterations. Both algorithms achieved
reasonably good results but we can identify three regions where the uncertainty of the
registration is high by looking into the heteroscedastic covariances estimated by the
ECM-DSR algorithm. Such information cannot be obtained from the CPD algorithm.

Fig. 1. Performance of CPD (top) and ECM-DSR (bottom) in handling outliers. From left to
right: inputs, results, overlap the results with ground truth, and estimated covariances depicted
as black circles around GMM centroids whose radii equal to the corresponding covariances.

1
From http://www.umiacs.umd.edu/~zhengyf/PointMatchDemo/DataChui.
zip, we got the synthesized data, and from https://sites.google.com/site/
myronenko/research/cpd, we got the reference implementation of the CPD algorithm
and the bunny model data shown in Fig. 3.
Expectation Conditional Maximization-Based Deformable Shape Registration 553

Fig. 2. Equal covariances estimated by the CPD algorithm (top) and the heteroscedastic cova-
riances estimated by the ECM-DSR algorithm (bottom) during iterations. The rightmost col-
umn shows the estimated covariances after convergence. Three regions of high uncertainty
(depicted with red ellipses) can be identified for the ECM-DSR but not for the CPD.

The third qualitative experiment was conducted on non-rigid registration of two 3D

bunny models. Fig. 3. shows the results achieved by our approach.

Fig. 3. Qualitative experiment conducted on non-rigid registration of two 3D bunny models.

Left: before registration; right: after registration.

Fig. 4. Boxplots of the comparison of the ECM-DSR with those of the CPD on the Chui-
Rangarajan synthesized data sets. Top: the Chinese Character shape, bottom: the fish shape.

3.2 Quantitative Experiments on Chui-Rangarajan Synthesized Data

With synthesized data, we know the ground truth. Thus, it can be used to test an algo-
rithm. More details about the synthesized data are explained in [3], which contains
554 G. Zheng

three sets of data designed on two shape templates (Chinese Character shape and fish
shape) to measure the robustness of an algorithm under different degrees of deforma-
tion, noise levels and outliers. Fig. 4 demonstrates the quantitative comparison results
between our non-rigid registration algorithm and the CPD algorithm.

3.3 Scapula Data

In this experiment, we applied the ECM-DSR algorithm to a challenging task that is
to instantiate a surface model of the scapula from a statistical shape model (SSM) and
sparse point set.
To this end, we construct a statistical shape model of the scapula from 24 seg-
mented CT models. Additionally, we have 9 segmented CT models of complete sca-
pula. Each time about 500 to 1000 points are randomly generated from each seg-
mented CT model which will be used as the input to instantiate a surface model of the
scapula. We conducted quantitative study on the 9 complete scapula data to evaluate
the reconstruction accuracy of the ECM-DSR algorithm. The reconstruction accuracy
was estimated by computing the distances between the reconstructed surface models
and the ground truth surface models segmented from CT data.
Fig. 5 shows the quantitative reconstruction results by our algorithm. A mean re-
construction error of 0.66 mm was found.

Fig. 5. Quantitative reconstruction results by the ECM-DSR algorithm

4 Conclusions

In this paper we presented a robust point matching algorithm for statistical and non-
rigid shape registration based on the Expectation Conditional Maximization algo-
rithm. Our experiments conducted on synthesized and real data demonstrate that
ECM-DSR has various advantages over existing algorithms, including less iteration
steps required for convergence, higher accuracy, more robust to outliers and providing
more information about the uncertainty of the registration results.

Acknowledgements. This work was partially supported by the Swiss National

Science Foundation (SNSF) via Project 205321_138009.
Expectation Conditional Maximization-Based Deformable Shape Registration 555

References
1. Besl, P., McKay, N.: A method for registration of 3-d shapes. IEEE Trans. Pat. Anal. and
Machine Intel. 14, 239–256 (1992)
2. Granger, S., Pennec, X.: Multi-scale EM-ICP: A fast and robust approach for surface reg-
istration. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV.
LNCS, vol. 2353, pp. 418–432. Springer, Heidelberg (2002)
3. Chui, H., Rangarajan, A.: A new point matching algorithm for nonrigid registration. Com-
put. Vis. Image Understand. 89(2-3), 114–141 (2003)
4. Luo, B., Hancock, E.: A unified framework for alignment and correspondence. Comp. Vis.
and Image Underst. 92(1), 26–55 (2003)
5. Tsin, Y., Kanade, T.: A correlation-based approach to robust point set registration. In: Paj-
dla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3023, pp. 558–569. Springer, Heidel-
berg (2004)
6. Zheng, Y., Doermann, D.S.: Robust Point Matching for Nonrigid Shapes by Preserving Local
Neighborhood Structures. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 643–649 (2006)
7. Myronenko, A., Song, X.: Point set registration: coherent point drift. IEEE Trans. Pattern
Anal. Mach. Intell. 32(12), 2262–2275 (2010)
8. Jian, B., Vemuri, B.C.: Robust Point Set Registration Using Gaussian Mixture Models.
IEEE Trans. Pattern Anal. Mach. Intell. 33(8), 1633–1645 (2011)
9. Horaud, R., Forbes, F., Yguel, M., Dewaele, G., Zhang, J.: Rigid and Articulated Point
Registration with Expectation Conditional Maximization. IEEE Trans. Pattern Anal. Mach.
Intell. 33(3), 587–602 (2011)
10. Chui, H., Rangarajan, A., Zhang, J., Leonard, C.M.: Unsupervised learning of an atlas
from unlabeled point-sets. IEEE Trans. Pattern Anal. Mach. Intell. 26(2), 160–172 (2004)
11. Hufnagel, H., Pennec, X., Ehrhardt, J., Ayache, N., Handels, H.: Generation of a statistical
shape model with probabilistic point correspondences and the expectation maximization-
iterative closest point algorithm. Int. J. CARS 2(5), 265–273 (2008)
12. Abi-Nahed, J., Jolly, M., Yang, G.Z.: Robust active shape models: A robust, generic and
simple automatic segmentation tool. In: Larsen, R., Nielsen, M., Sporring, J. (eds.)
MICCAI 2006. LNCS, vol. 4191, pp. 1–8. Springer, Heidelberg (2006)
13. Shen, K., Bourgeat, P., Fripp, J., Meriaudeau, F., Salvado, O.: Consistent estimation of
shape parameters in statistical shape model by symmetric EM algorithm. In: SPIE Medical
Imaging 2012: Image Processing, vol. 8134, p. 83140R (2012)
14. Xie, W., Schumann, S., Franke, J., Grützner, P.A., Nolte, L.P., Zheng, G.: Finding De-
formable Shapes by Correspondence-Free Instantiation and Registration of Statistical
Shape Models. In: Wang, F., Shen, D., Yan, P., Suzuki, K. (eds.) MLMI 2012. LNCS,
vol. 7588, pp. 258–265. Springer, Heidelberg (2012)
15. Kang, X., Taylor, R.H., Armand, M., Otake, Y., Yau, W.P., Cheung, P.Y.S., Hu, Y.: Cor-
respondenceless 3D-2D registration based on expectation conditional maximization. In:
Proc. SPIE, vol. 7964, p. 79642Z (2011)
16. Chen, T., Vemuri, B.C., Rangarajan, A., Eisenschenk, S.J.: Groupwise point-set registra-
tion using a novel CDF-based Havrda-Charvtat divergence. IJCV 86(1), 111–124 (2010)
17. Rasoulian, A., Rohling, R., Abolmaesumi, P.: Group-wise registration of point sets for sta-
tistical shape models. IEEE Trans. Med. Imaging 31(11), 2025–2034 (2012)
18. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood estimation from in complete
data via the EM algorithm (with discussion). J. Royal Statistical Soc.(B) 39, 1–38 (1877)
19. Meng, X.-L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: A
general framework. Biometrika 80(2), 267–278 (1993)
Facial Expression Recognition with Regional
Features Using Local Binary Patterns

Anima Majumder, Laxmidhar Behera, and Venkatesh K. Subramanian

Department of Electrical Engineering,

Indian Institute of Technology Kanpur, India
{animam,lbehera,venkats}@iitk.ac.in

Abstract. This paper presents a simple yet efficient and completely au-
tomatic approach to recognize six fundamental facial expressions using
Local Binary Patterns (LBPs) texture features. A system is proposed
that can automatically locate four important facial regions from which
the uniform LBPs features are extracted and concatenated to form a
236 dimensional enhanced feature vector to be used for six fundamen-
tal expressions recognition. The features are trained using three widely
used classifiers: Naive bayes, Radial Basis Function Network (RBFN)
and three layered Multi-layer Perceptron (MLP3). The notable feature
of the proposed method is the use of few preferred regions of the face
to extract the LBPs features as opposed to the use of entire face. The
experimental results obtained from MMI database show proficiency of
the proposed features extraction method.

Keywords: Facial expression recognition, local binary pattern, facial

features extraction, radial basis function network, multilayer perceptron,
naive bayes.

1 Introduction

Communication plays a very important role in our day to day life. Facial expres-
sions those come under the category of nonverbal communication are considered
to be one of the most powerful and immediate means of recognizing one’s emo-
tion, intentions and opinion about each other. A study of Mehrabian [1] has
found that while communicating feelings and attitudes, a person convey 55% of
the message through facial expression alone, 38% via vocal cues and the remain-
ing 7% is through verbal cues. The goal of facial expression recognition system
is to have an automatic system that can recognition expressions like Happiness,
Sadness, Disgust, Anger, Surprise and Fear regardless of the person’s identity.
Since last few decades [2,3,4,5], researchers have been working on facial expres-
sion recognition and a lot of advancements have been made in recent years.
But recognizing facial expressions with high accuracy is still a challenging area
because of it’s subtlety, complexity and variability of expressions.
Generally, techniques to represent facial features needed for expressions recog-
nition are broadly categorized into two types: Geometric based method and

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 556–563, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Facial Expression Recognition with Regional Features Using LBPs 557

Appearance based method. Geometric features such as eyes, mouth, nose etc.,
contain location and shape information of those features, whereas, appearance
features examines the appearance change of face like wrinkle, bulges and fur-
rows [6]. In this presented work, we use both information of facial geometry
and appearance based method to represent facial features for six fundamental
expressions (Happiness, Sadness, Disgust, Anger, Surprise, Fear). Facial geom-
etry is used for automatic localization of 4 important facial regions like eyes
regions, Nose region and Lips region. Those regions are extracted in a way that
the regions cover nearby regions containing important information needed for
expressions recognition. LBPs have already been presented to be a successful
texture descriptor in many computer vision applications [7,8,3]. LBPs are used
over Gabor filters because of its simplicity and much lower dimensionality. Many
researchers applied LBPs for Facial expressions recognition but, mostly they ei-
ther applied over whole face image or after dividing the face region into M × N
sub-blocks [6,3]. Holistic representation of features loses information related to
the location of the features. Moreover, dividing the whole facial region into dif-
ferent sub-blocks and then taking LBPs from each sub-block needs more com-
putation time, which is unnecessary. Not all the blocks within the face region
contains useful informations needed for expression recognition. We introduce a
new method that reduces such unnecessary computation cost by localizing re-
quired facial zones needed for expression recognition. Uniform LBPs obtained
from each of the 4 facial regions are concatenated to form a feature vector of
dimension 236. We apply three widely used classifiers: Naive Bayes, Radial Basis
Function Network, and three layered MLP [9,10].
Rest of the paper is organized as follows. Section 2 demonstrates the different
facial regions extraction techniques. Section 3 gives a brief overview about the
local binary pattern and proposed method. Section 4 shows the experimental
results obtained after applying 3 different classifiers: Naive bayes, RBFN and
MLP3. Finally, in section 5 conclusions are drawn.

2 Diﬀerent Facial Region Extraction Techniques

Automatic facial expression recognition techniques require two most important

aspects:

– Features representation
– Modeling of appropriate classiﬁer [3].

Extraction of features those can represent the facial expressions effectively, are
the key to have an accurate facial expression recognition system. In this paper, we
introduce a new mechanism for automatic extraction of appearance features from
different facial regions, using local binary pattern. Fig. 1 shows the 4 important
localized regions from where LBP features are extracted.
The flow diagram shown in Fig. 2 demonstrates the steps involved in auto-
matic facial regions extraction techniques.
558 A. Majumder, L. Behera, and V.K. Subramanian

f aceheight
eye1 height
(x1, y1) (x2, y2)

hnose =
1/3 × f aceheight
Expected nose
region

Nose width
(x2 − x1)

hl =
1.5 × hnose

Fig. 1. Pictorial description of the 4 key facial regions extraction technique

2.1 Facial Regions Extraction

The foremost important part to have better facial expression recognition results
is to have automatic and accurate face and facial regions detection methods.
Accurate extraction of facial features is the key to have successful classiﬁcation
results. Most of the facial expression recognition techniques using Local binary
patterns applies the LBP over each sub-block covering the whole face image
[3,6,11]. Caifeng Shah et.al [3] divided the whole face image into small sub-block
regions, extracted the LBPs from each region and then assigned weights to each
sub-region based on importance of that region. But, not all the facial regions con-
tain useful informations for expression recognition. Some regions contain almost
no information for any expression. Thus, calculation of LBPs over whole facial
region leads to unnecessary computational cost. In this work we demonstrate
a completely automatic facial regions extraction method. The four important
regions where most of the informations available are: Two eyes region enclos-
ing eyebrow regions, nose region, lips region enclosing chin region. Fig. 1 shows
a pictorial explanation for the calculation of estimated facial regions based on
actual face height and width. The steps involved in this process is given in the
block diagram shown in Fig. 2. Face detection is followed by eyes detection. We
apply Paul Viola and Michael Jones’ face detection algorithm [12] to detect the
face region from the image. They use simple rectangular (Haar-like) features
which are equivalent to intensity diﬀerence readings and are quite easy to com-
pute. The face detection using Viola Jones’ method is 15 times quicker than any
technique. It gives 95% detection accuracy at around 17 fps.
The next important step after face detection is detection of two eyes. The eyes’
centers play a vital role in face alignment, scaling and location estimation of other
Facial Expression Recognition with Regional Features Using LBPs 559

Image

Detect Face

Normalize face

Detect eyes

Locate Eyes region Locate Lips region Locate Nose region

Fig. 2. Basic block diagram to extract important facial regions

facial features, like lips, eyebrows, nose, etc. Thus, accurate detection of eyes are
very much desirable. We estimate the expected regions of eyes using basic facial
geometry [13]: for frontal face images, the eyes are located in the upper facial
region. We remove (1/5)th upper facial region and extract (1/3)rd vertical part
as the expected eyes regions. We apply Haar-like cascaded features and Viola-
Jones’ object detection algorithm to detect two eyes within the expected eyes
region. Nose lies below the two eyes’ region and above lips region in frontal face
images. Fig. 1 shows a pictorial description about the nose region. We calculate
the two eyes’ centers as (x1 , y1 ) and (x2 , y2 )(centroid of the eye rectangular
region) respectively. The expected nose region starts from the left eye’s center
1
with width (x2 − x1 ) and height ( )rd of the face height. Similarly, we calculate
3
the expected lip region that also covers the chin region. The lip region starts after
the nose region. We take lip regions with x coordinate same as eyes’ center’s x
coordinate and width as distance between two eye’s centers. To cover the extra
region near the lips that usually contain some wrinkles during certain expression,
we add (1/4)th of lips width in both left and right side. Lips region height is
taken as last (1/3)rd region that also includes chin region. Eyes’ regions are also
extended using facial geometry that cover the regions near eyebrows and eyes’
crow’s feet.

3 Local Binary Patterns

We perform person independent facial expression recognition using Local Binary
Patterns (LBPs). Initially, the LBPs was introduced by Ojala et.al [14] and it is
proven to be a very powerful means of texture description. Given a monochrome
image I(x, y) let, gc be the center pixel within a 3 × 3 neighborhood. The pixels
560 A. Majumder, L. Behera, and V.K. Subramanian

within the neighborhood are labeled by thresholding each pixel with the value of
the center pixel gc . For a gray value gp in an evenly spaced circular neighborhood
with maximum of P pixels and radius R around point (x, y), coordinates of the
point can be found as

xp = x + R cos(2πp/P ), yp = y − R sin(2πp/P ). (1)

The binary operator S(gp − gc ) can be deﬁned as

1 if gp ≥ gc
S(gp − gc ) =
0 otherwise.
The LBPP R operator is computed by applying a binomial factor to each of the
S(gp − gc ). The method can be stated as

P −1
LBPP, R (xc , yc ) = S(gp − gc )2p . (2)
p=0

5 12 13 0 1 1
Threshold
11 10 3 1 0

15 9 7 1 0 0

Binary = 0 1 1 0 0 0 1 1
Decimal = 99

Fig. 3. An example of basic LBP operation. Neighborhood pixels are thresholded with
center pixel and followed by generation of decimal value for the binary coded data.

Fig. 3 shows an example of a basic LBPs operator. Each pixel around the
center pixel is thresholded. A binary pattern is extracted and the corresponding
decimal equivalent is calculated. The prominent properties of LBPs features are:
robustness for change in illumination and computational simplicity. The texture
primitives like spot, line end, edge and corner are detected by operators. Fig. 4
shows examples of texture primitives can be detected on applying LBPs. The
notation LBPP R indicates P sampling points each at equal distance R from the
center pixel.
Ojala et.al [14] verified that only a subset of the 2P patterns is sufficient
enough to describe most of the texture information with in the image. The pat-
terns forLBP (8, 1) with bit-wise transitions not more than 2, called uniform
patterns contains more than 90% of the texture informations. The uniform pat-
terns contains in total of 58 different patterns for LBP (8, 1) with 58 binary
Facial Expression Recognition with Regional Features Using LBPs 561

Spot Spot / flat Line end Edge Corner

Fig. 4. Examples of texture primitives those can be detected by LBP. White circles
shows ones and the black circles shows zeros.

patterns ( 56 rotational patterns and 2 non-rotational patterns). The patterns

with transitions U (x) ≥ 2 is called non-uniform patterns and is assigned to a
single label, totally 59 patterns for LBP8,u21 . Fig. 3 shows an example of the LBPs
and corresponding histogram for an extracted lips region.

0.09
"lip_hist.txt"
0.08

0.07

0.06

0.05

0.04

0.03

0.02

0.01

0
0 10 20 30 40 50 60

Fig. 5. Images showing lips region, corresponding uniform local binary code and the
feature histogram for the lips region

4 Experimental Results
Experiments are conducted on publicly available MMI facial expression database
[15]. The results present classification accuracy of six basic facial expressions
(happiness (H), sadness (Sa), disgust (D), anger (A), surprise (Sur), fear (F)).
In our experiment, we use 81 different video clips taken from MMI facial ex-
pression database. The video clips comprises of 12 different characters and each
character shows all the 6 basic expressions separately. The uniform LBP fea-
tures of dimension 59 obtained from each of the 4 facial zones are concatenated

Table 1. Confusion matrix of emotions detection for the 236 dimensional LBP features
data using Naive Bayes. The emotion classiﬁed with maximum percentage is shown to
be the detected emotion.

H Sa D A Sur F
H 91.9 5.12 0 2.56 0 1.28
Sa 0.9 45.9 4.5 1.8 29.7 17.1
D 3.84 3.84 75.0 11.53 3.84 1.92
A 1.61 0 11.29 61.3 20.96 4.83
Sur 0 1.36 0 0 90.4 8.21
F 1.47 5.88 1.47 2.94 35.29 52.9
562 A. Majumder, L. Behera, and V.K. Subramanian

Table 2. Confusion matrix of emotions detection for the 236 dimensional LBP features
data using Radial Basis Function. The emotion classiﬁed with maximum percentage is
shown to be the detected emotion.

H Sa D A Sur F
H 87.2 3.84 5.12 3.84 0 0
Sa 0 90.1 0 1.8 5.4 2.7
D 5.7 5.7 80.8 5.7 1.9 0
A 0 0 3.2 96.8 0 0
Sur 0 5.48 0 2.74 91.8 0
F 1.47 5.88 0 5.88 0 86.8

Table 3. Confusion matrix of emotions detection for the 236 dimensional LBP features
data using MLP3. The emotion classiﬁed with maximum percentage is shown to be
the detected emotion.

H Sa D A Sur F
H 89.7 3.8 5.13 0 0 1.28
Sa 2.7 88.3 0 0 0 9.0
D 1.92 5.77 84.6 0 0 7.7
A 0 4.84 11.29 74.2 0 9.7
Sur 0 2.73 2.73 0 94.52 0
F 1.47 8.82 1.47 0 0 88.2

together to form a feature vector of dimension 236. Neighborhood size of 3 × 3

is considered to extract LBP features for each of the four regions. We apply
3 widely used classifiers: Naive Bayes, RBFN and MLP3 for facial expression
recognition. The experimental results are shown in tables 1, 2 and 3 respec-
tively. The average recognition accuracy of Naive bayes classifier is 69.41% with
training time 0.7 second and for MLP3 it is 86.05% with training time much
higher (3155 seconds). It is observed that, RBFN gives best performance among
the three classifiers with average classification accuracy 88.91% and training time
12.8 seconds.

5 Conclusions

The paper present an empirical study of facial expression recognition system

based on Local binary patterns features. We propose an automatic method of
extracting 4 important facial regions those includes two eyes, nose, mouth region
and near by region of theses features like chin region, crows-feet region, check
region, some portion of eyebrows region. Thus the method avoids unnecessary
computation cost by applying LBPs over whole face image and yet preserves lo-
cation information. Moreover, features dimension is much smaller (only 4 times
LBPs from each block) as opposed to the method in which whole face image is
divided into M × N sub-blocks. The experiments are performed over MMI fa-
cial expression database. Three well-known classiﬁers: Naive Bayes, RBFN and
MLP-3 are used to classify the LBPs features into six basic facial expressions.
Facial Expression Recognition with Regional Features Using LBPs 563

The recognition results show the accuracy of our proposed facial expression
recognition system. It is observed that RBFN outperforms the other two classi-
ﬁers. As future works, we can conduct experiments using diﬀerent neighborhood
sizes. Also, a comparative analysis can be done by applying LBPs over whole
face image.

References
1. Mehrabian, A.: Nonverbal communication. Aldine (2007)
2. Sun, Y., Yin, L.: Facial expression recognition based on 3D dynamic range model
sequences. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II.
LNCS, vol. 5303, pp. 58–71. Springer, Heidelberg (2008)
3. Shan, C., Gong, S., McOwan, P.: Facial expression recognition based on local binary
patterns: A comprehensive study. Image and Vision Computing 27, 803–816 (2009)
4. Tsalakanidou, F., Malassiotis, S.: Real-time 2d+ 3d facial action and expression
recognition. Pattern Recognition 43, 1763–1775 (2010)
5. Moridis, C., Economides, A.: Affective learning: Empathetic agents with emotional
facial and tone of voice expressions. IEEE Transactions on Affective Computing 3,
260–272 (2012)
6. Moore, S., Bowden, R.: Local binary patterns for multi-view facial expression recog-
nition. Computer Vision and Image Understanding 115, 541–558 (2011)
7. Chan, C.-H., Kittler, J., Messer, K.: Multi-scale local binary pattern histograms
for face recognition. In: Lee, S.-W., Li, S.Z. (eds.) ICB 2007. LNCS, vol. 4642, pp.
809–818. Springer, Heidelberg (2007)
8. Ahonen, T., Hadid, A., Pietikäinen, M.: Face recognition with local binary patterns.
In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 469–481.
Springer, Heidelberg (2004)
9. Zhang, Z., Zhang, Z.: Feature-based facial expression recognition: Sensitivity analy-
sis and experiments with a multilayer perceptron. International Journal of Pattern
Recognition and Artificial Intelligence 13, 893–911 (1999)
10. Rosenblum, M., Yacoob, Y., Davis, L.: Human expression recognition from motion
using a radial basis function network architecture. IEEE Transactions on Neural
Networks 7, 1121–1138 (1996)
11. Zhao, G., Pietikainen, M.: Dynamic texture recognition using local binary patterns
with an application to facial expressions. IEEE Transactions on Pattern Analysis
and Machine Intelligence 29, 915–928 (2007)
12. Viola, P., Jones, M.: Robust real-time object detection. International Journal of
Computer Vision 57, 137–154 (2002)
13. Majumder, A., Behera, L., Venkatesh, K.S.: Automatic and Robust Detection of
Facial Features in Frontal Face Images. In: Proceedings of the 13th International
Conference on Modelling and Simulation. IEEE (2011)
14. Ojala, T., Pietikäinen, M., Harwood, D.: A comparative study of texture measures
with classification based on featured distributions. Pattern Recognition 29, 51–59
(1996)
15. Pantic, M., Valstar, M.F., Rademaker, R., Maat, L.: Web-based database for facial
expression analysis. In: Proceedings of IEEE Int’l Conf. Multimedia and Expo,
ICME 2005, Amsterdam, The Netherlands, pp. 317–321 (2005)
Global Image Registration Using Random
Projection and Local Linear Method

Hayato Itoh1 , Tomoya Sakai2 , Kazuhiko Kawamoto3, and Atsushi Imiya4

1
Graduate School of Advanced Integration Science, Chiba University,
Yayoicho 1-33, Inage-ku, Chiba 263-8522, Japan
2
Graduate School of Engineering, Nagasaki University,
Bunkyo-cho 1-14, Nagasaki 852-8521, Japan
3
Academic Link Center, Chiba University,
Yayoicho 1-33, Inage-ku, Chiba 263-8522, Japan
4
Institute of Management and Information Technologies, Chiba University,
Yayoicho 1-33, Inage-ku, Chiba 263-8522, Japan

Abstract. The purpose of this paper is twofold. First, we introduce

fast global image registration using random projection. By generating
many transformed images as entries in a dictionary from a reference im-
age, nearest-neighbour-search (NNS)-based image registration computes
the transformation that establishes the best match among the generated
transformations. For the reduction in the computational cost for NNS
without a signiﬁcant loss of accuracy, we use random projection. Fur-
thermore, for the reduction in the computational complexity of random
projection, we use the spectrum-spreading technique and circular convo-
lution. Second, for the reduction in the space complexity of the dictio-
nary, we introduce an interpolation technique into the dictionary using
the linear subspace method and a local linear property of the pattern
space.

1 Introduction
Image registration overlays two or more template images with the same observed
at different times, from different viewpoints, or by different sensors, on a reference
image. Image registration is a process of estimating geometric transformations
that transform all or most points on template images to corresponding points
on a reference image.
Setting Π to be an appropriate parameter space for image generation, we
assume that images are expressed as f (x, θ) for ∃θ ∈ Π, x ∈ Rl . We call
the set of generated images f (x, θi ) and the parameter θi a dictionary. Here,
i = 1, . . . , N . Image registration methods are generally classified into local image
registration and global image registration. For the global alignment of images,
the linear transformation x = Ax + t that minimises the criterion
M
R(f, g) = |f (x ) − g(x)|2 dx (1)
Ω

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 564–571, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Global Image Registration Using Random Projection 565

for the reference image f (x) and the template image g(x) is used to relate two
images. In image registration, we assume that the parameter θ in Π generates the
affine coefficients A and t. Solving the nearest-neighbour search (NNS) problem
using the dictionary, we can estimate the transformation A, t as θi .
The simplest solution to the NNS problem is to compute the distance from
the query point to every other point in the database, preserving a track of the
“best so far”. This algorithm, sometimes referred to as the naive approach, has
a computational cost of O(N d). Here, N and d are the cardinality of a set of
points in a metric space and the dimensionality of the metric space, respectively.
The NNS-based image registration requires the storage of reference images
in the dictionary, since the method finds the best matched image from the dic-
tionary. Since this mathematical property leads to the conclusion that we are
required to store a large number of images in the dictionary, for the robust image
registration, the space complexity of the dictionary becomes large. Fast global
image registration algorithms using random projection have been developed
[1–3]. In these methods, using random projection, we can reduce d in NNS[4].
In addition to random projection, using a local linear property of the pattern
space, we interpolate entries in a sparse dictionary by generating an image and
estimating the parameter. Using such an interpolation, we can reduce N in NNS.
Generally, the pattern space generated by the affine motion of the image data
is a curved manifold in the higher-dimensional Euclidean space. However, if the
motion is relatively small and can be approximated by linear perturbation to
a sampled image, it is possible to approximate transformed images as a linear
combination of sampled images in the neighbourhood of the reference image.
Using this approximation property, we generate new reference images that are
similar to the target image using images in the sparse dictionary. Furthermore,
using the local linear property, we can estimate the parameter of the generated
image. This strategy reduces the space complexity of the dictionary.
In this paper, using an efficient random projection[3] and the local linear
property of the images, we introduce a method of reducing the time and space
computational costs of this naive NNS-based image registration.

2 Local Linear Property

2.1 Linear Subspace of Pattern
Setting the Hilbert space H to be the space of patterns, we assume that in H,
the inner product (f, g) is defined. Furthermore, we define the Schatten product
f, g, which is the operator from H to H. Let f ∈ H and P be a pattern
and an operator for a class, respectively; thus, we define the class C = {f | P f =
f, P ∗ P = I}. For recognition, we construct P for f ∈ C while minimising E[ f −
P f 2 ] with respect to P ∗ P = I, where f ∈ C is the pattern for a class, I is the
identity operator, and E is the expectation in H. This methodology is well known
as the subspace method [5–7].
For the practical calculation of P , we adopt the Karhunen-Loeve expan-
sion. The Karhunen-Loeve expansion approximates the subspace of data in H.
566 H. Itoh et al.

We set {ϕj }nj=1 to be the eigenfunction of M = E[ f, f ]. We deﬁne the eigen-

function of M as ϕj 2 = 1 for eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λj ≥ · · · ≥ λn .
n
Therefore, P is deﬁned as Pn = j=1 ϕj , ϕj for n ≤ n.

2.2 Geometric Perturbation in Local Linear Space

If the image f (x) deﬁned in the two-dimensional Euclidean plane x = (x, y) ∈
R2 is geometrically perturbed, we can accept the relation

f (x + δ) = f (x) + δ ∇f (x, y), (2)

where δ is a perturbation vector. Since

f (x)∂x f (x)dx = 0, f (x)∂y f (x)dx = 0, ∂x f (x)∂y f (x)dx = 0, (3)
R2 R2 R2

images g(x) = f (Rx + t) for a small angle rotation R and a small translation
vector t, we can assume the relation

g(x) = a0 f (x) + a1 ∂x f (x) + a2 ∂y f (x). (4)

Equation (4) implies that the number of independent images in the collected
images,
L(f ) = {fij |fij (x) = λf (Ri x + tj )}p,q
i,j=1 , (5)
is 3. Setting f ⊗ g to be a linear operation such that (f ⊗ g)h = (h, g), where
(·, ·) is the inner product of the image space, the covariance of L(f ) is deﬁned
as Lf = Epqpqijuv [fij ⊗ fuv ], where Ei [fi ] is the expectation of {fi }i=1 . We can use
n n

the ﬁrst 3 principal vectors of Lf as the local bases for image expression.

3 Local Linear Method

For a two-dimensional image, we introduce a method of reducing the number N
of an image dictionary. Using the local linear property of images in the image
space, we ﬁrst generate an image in a sparse dictionary[8, 9]. For the registration
of a template g, using the generated image, we next estimate the parameters in
the dictionary. From the generated image and estimated parameters, the local
linear method (LLM) can generate new entries in the dictionary as an inter-
polation among entries. Figures 1(a)-(c) show the generation of an image, the
estimation of parameters and the interpolation of elements in a dictionary, re-
spectively.

3.1 Generation of Image in Dictionary

For image generation, we use the k-nearest neighbours (k-NNs) of g in the dic-
tionary. Let f r ∈ L, r = 1, 2, · · · k, be the rth neighbour of g. The random
projection preserves the pairwise distances of vectorised images. Therefore, f r
Global Image Registration Using Random Projection 567

(a) (b) (c)

Fig. 1. (a) Nearest neighbours of g searched for by k-NNS on manifold. (b) Generation
of image in dictionary. The input image g is projected onto the subspace spanned by
three-nearest neighbours. (c) Interpolation of dictionary. For the template g, we ﬁrstly
generate the image g ∗ . Next, we estimate the parameter θ∗ of g∗. Here, dim θ = 1.

is searched for in a random projected space. Using the local linear property, we
can approximate the space spanned by {ui }3i=1 using one by {g} ∪ {f r }3r=1 if the
data space L is not extremely sparse. Using the Gram-Schmidt orthonormalisa-
tion for f 1 , f 2 and f 3 , we obtain the bases {ui }3i=1 . Projecting the template to
the space spanned by {ui }3i=1 , we obtain a new image,

3
g∗ = ai u i , (6)
i=1

from a triplet of pre-prepared entries in the dictionary. Here, αi represents the

coeﬃcient of a linear combination. We assume that a small perturbation of the
parameter causes a small geometrical transformation on the image pattern, that
is, we accept the relation f (x + δ, θ) = f (x, θ + ψ). Therefore, the linear ap-
proximation of the left-hand-side of the equation is
f (x, θ + ψ) = f (x, θ) + ψ ∇Π f (x, θ), (7)
where ∇Π is the gradient operation in the parameter space Π. Equations (2)
and (7) derive the relation
ψ ∇Π f (x, θ) = δ ∇f (x, θ). (8)

3.2 Parameter Estimation

For rotation transform with angle α in a counterclockwise direction and trans-
lation a = (a, b) , setting θi = (αi , ai , bi ) and {θi }N i=1 = ΠN ∈ Π, we express
images fi = f (x, θi ) in a dictionary. Furthermore, let f 1 = f (x, θ 1 ) be the
nearest neighbour of the template f (x, θ) in the dictionary.
Setting θα1 = (α2 , a1 , b1 ) , θa1 = (α1 , a2 , b1 ) and θb1 = (α1 , a1 , b2 ) , we
obtain the partial diﬀerential
⎛ ⎞
f (x,θ 1 )−f (x,θα
1
) ⎛ ⎞
⎜ f (x,θα12)−f h1
1 ⎟
−α1
∇Π f (x, θ) = ⎜ ⎝ a −a
2
(x,θa ) ⎟
1 ⎠ = ⎝ h2 ⎠ (9)
f (x,θ 1 )−f (x,θb1 ) h3
b2 −b1
568 H. Itoh et al.

(a) P 0 (b) σ = 1 (c) σ = 2 (d) σ = 4 (e) σ = 8

(f) P 0(15) (g) σ = 1 (h) σ = 2 (i) σ = 4 (j) σ = 8

10 10 10 10 10
Estimated rotation angle [degree]

Estimated rotation angle [degree]

8 8 8 8 8
6 6 6 6 6
4 4 4 4 4
2 2 2 2 2
0 0 0 0 0
−2 −2 −2 −2 −2
−4 −4 −4 −4 −4
−6 −6 −6 −6 −6
−8 −8 −8 −8 −8
−10 −10 −10 −10 −10
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
Ground truth[degree] Ground truth[degree] Ground truth[degree] Ground truth[degree] Ground truth[degree]

(k) G.T. (l) σ = 1 (m) σ = 2 (n) σ = 4 (o) σ = 8

2.5 2.5 2.5 2.5 2.5
Derivation of estimated parameter

Derivation of estimated parameter

2 2 2 2 2

1.5 1.5 1.5 1.5 1.5

1 1 1 1 1

0.5 0.5 0.5 0.5 0.5

0 0 0 0 0

−0.5 −0.5 −0.5 −0.5 −0.5

−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
Ground truth[degree] Ground truth[degree] Ground truth[degree] Ground truth[degree] Ground truth[degree]

(p) G.T. (q) σ = 1 (r) σ = 2 (s) σ = 4 (t) σ = 8

Fig. 2. Parameter estimation for phantom image. From top to bottom, original im-
ages, rotated images, estimated parameters and the differential curves of the estimated
parameters are shown for σ = 1, 2, 4 and 8. In Figs. 2(p)-2(t), the solid and dashed
lines represent the first- and second-order differentials, respectively.

as the diﬀerence among images. Here, f (x, θα1 ), f (x, θa1 ) and f (x, θb1 ) are the
2nd, 3rd and 4th nearest neighbours in the dictionary. From Eqs. (8) and (9),
we obtain the relation
f (x, θ 1) − f (x, θ)
ψ=− ∇Π f (x, θ) (10)
|∇Π f (x, θ)|2

at each point1 . Here, ψ = (ψ1 , ψ2 , ψ3 ) . Next, by computing the square of the

average of both sides of Eq. (10), we obtain the relation
M
| (f (x, θ) − f (x, θ 1 )) hi |2
E[ψi ] = dx (11)
Ω |∇Π f (x, θ)|4

for the image. Therefore, we obtain the equation θ ∗ = θ 1 + E [ψ] for parameter

estimation, where E(ψ) = (E[ψ1 ], E[ψ2 ], E[ψ3 ]) .

4 Numerical Examples
For rotation transform, we show results of registration using the LLM. We
evaluated our method both for phantom and real images. Phantom images are
1
The least-mean-squares solution of x a + c = 0 is x = −ca/|a|2 .
Global Image Registration Using Random Projection 569

(a) A0 (b) σ = 1 (c) σ = 2 (d) σ = 4 (e) σ = 8

(f) A0(15) (g) σ = 1 (h) σ = 2 (i) σ = 4 (j) σ = 8

10 10 10 10 10
Estimated rotation angle [degree]

Estimated rotation angle [degree]

(k) G.T. (l) σ = 1 (m) σ = 2 (n) σ = 4 (o) σ = 8

2.5 2.5 2.5 2.5 2.5
Derivation of estimated parameter

Derivation of estimated parameter

2 2 2 2 2

1.5 1.5 1.5 1.5 1.5

1 1 1 1 1

0.5 0.5 0.5 0.5 0.5

0 0 0 0 0

−0.5 −0.5 −0.5 −0.5 −0.5

(p) G.T. (q) σ = 1 (r) σ = 2 (s) σ = 4 (t) σ = 8

Fig. 3. Parameter estimation for MRI slice image of human brain. From top to bottom,
original images, rotated images, estimated parameters and the differential curves of the
estimated parameters are shown for σ = 1, 2, 4 and 8. In Figs. 3(p)-3(t), the solid and
dashed lines represent the first- and second-order differentials, respectively.

generated from a two-dimensional Gaussian function. Real images are MRI slice
images of a simulated human brain. Figures 2(a) and 3(a) show the phantom
image P 0 and the slice image A0, respectively. For the generation of entries in
the dictionary, we rotated the original image with angles 12 π
i, i = 0, 1, 2, · · · , 23.
We use the original image as reference and the rotated images as targets. Figures
2(f) and 3(f) show the sample images of the dictionary, respectively. For phan-
tom images, we generate rotated phantom images with angles of −6, −5, . . . , 6
degrees as template images. For real images, our template images are selected
from several slice images different from the original slice image. Figures 4(a),
(b), (c) and (d) show the MRI volume data and the images of slice A0, slice B0
and slice C0, respectively. As template images, we generate the rotated images
A0, B0 and C0 with angles of −6, −5, . . . , 6 degrees. For the stable computation,
we first select images using the NNS, then we apply the Gaussian smoothing
for selected images. We set the standard deviation of the Gaussian function to
be σ = 1, 2, 4 and 8. Figures 2(b)-(e), 2(g)-(j), 3(b)-(e) and 3(g)-(j) show the
smoothed images. In all registrations, the dimensions of vectorised images are
reduced to 1024 dimensions by the efficient random projection.
Selecting a template image from the generated template set, we apply our
registration algorithm to reference images in the dictionary. For phantom im-
ages, Figs. 2(k)-(o) show the estimated parameters for each σ. Furthermore,
570 H. Itoh et al.

z C0 ; z=52
A0 ; z=50
B0 ; z=48
y
x

(a) Voxel image (b) A0 (c) B0 (d) C0

Fig. 4. Extracted slice images from three-dimensional volume data. The volume data
is the MRI simulation data of a human brain[10]. The size of the volume data is
181×217×181 pixels. The slice images A0, B0 and C0 are extracted from the z = 50,
z = 48 and z = 52 planes, repectively. The size of A0, B0, and C0 is 543×543 pixels.

Figs. 3(k)-(o) show the estimated parameters for real images. For phantom im-
ages, in Figs. 2(p)-(t), the solid and dashed lines show the first- and second-order
differentials for the curves of the estimated parameters, respectively. For real im-
ages, in Figs. 3(p)-(t), the solid and dashed lines show the first- and second-order
differentials for the curves of the estimated parameters, respectively.
Next, selecting a template image from the generated template set using several
slice images, we apply our registration algorithm to reference images in the
dictionary smoothing images with σ = 0, 1 and 2. The results of image generation
and parameter estimation are shown in Tabs. 1 and 2, respectively. In Tab. 1,
we evaluate the accuracy of the generation errors using the distance between the
template and the generated images.
In Figs. 2(q)-2(t), the second-order differentials of the estimation curves are
almost flat, that is, the rotation angle is linearly estimated for a potentially
smooth image. In Figs. 3(q)-3(t), for σ ≥ 4, the second-order differentials of the
estimation curves are almost flat. These results indicate that, for |θ − θi | < 6,
our method accurately estimates the parameters. From Tab. 1, the registration
errors of the LLM are smaller than those of the NNS method. Table 2 shows
that, for σ = 1, the estimated rotation angle satisfies G.T −0.3 ≤ θ∗ ≤ G.T +0.5.
Tables 1 and 2 show that the LLM generates image in the dictionary using the
local linear property of images in the pattern space.
These experiments show that our algorithm achieves the medical image reg-
istration with a sparse dictionary. The errors of our algorithm are less than 1
degree, while the maximum error of NNS is 6 degree. Using the random pro-
jection, we can compute the global image registration using the LLM with only
8.3% of the memory storage size of the NNS method.

5 Conclusions
We introduced the interpolation of entries in a dictionary to reduce the compu-
tational cost of preprocessing and the size of the dictionary used in the nearest-
neighbour search. Using the random projection and interpolation techniques for
the dictionary, we developed an algorithm that eﬃciently establishes a global
image registration. From numerical examples, we show that our method can
perform the registration with high accuracy using a small-memory-storage-size
dictionary compared with the naive nearest-neighbour-search-based method.
Global Image Registration Using Random Projection 571

Table 1. Registration errors Table 2. Estimated angles

Registration error [×103 ] G.T Estimated θ∗
θ NNS LLM (3 bases) LLM (4 bases) θ σ=0 σ=1 σ=2
1 1.37 1.30 1.29 1 0.77 1.06 2.06
2 2.41 2.22 2.21 2 1.68 1.76 2.38
3 3.26 2.87 2.86 3 2.72 2.73 3.10
A0 4 3.95 3.30 3.30 A0 4 3.82 3.83 4.07
5 4.54 3.55 3.55 5 4.97 4.98 5.17
6 5.03 3.63 3.63 6 6.07 6.07 5.58
1 3.40 3.27 3.27 1 1.69 1.69 2.22
2 3.78 3.52 3.52 2 2.41 2.39 2.76
3 4.23 3.79 3.80 3 3.28 3.26 3.52
B0 4 4.69 4.03 4.03 B0 4 4.22 4.22 4.43
5 5.13 4.18 4.17 5 5.22 5.24 5.43
6 5.35 4.23 4.23 6 6.23 6.21 6.03
1 3.29 3.12 3.08 1 1.99 1.93 2.30
2 3.63 3.35 3.33 2 2.52 2.46 2.71
3 4.05 3.60 3.59 3 3.27 3.21 3.38
C0 4 4.47 3.81 3.80 C0 4 4.13 4.01 4.22
5 4.87 3.93 3.92 5 5.04 5.02 5.14
6 5.22 3.96 3.95 6 5.96 5.98 6.10

This research was supported by the “Computational anatomy for computer-

aided diagnosis and therapy: Frontiers of medical image sciences” project funded
by a Grant-in-Aid for Scientiﬁc Research on Innovative Areas from MEXT,
Japan, by Grants-in-Aid for Scientiﬁc Research funded by the Japan Society
for the Promotion of Science, and by a Grant-in-Aid for Young Scientists (A)
from MEXT.

References
1. Healy, D.M., Rohge, G.K.: Fast global image registration using random projections.
In: Proc. Biomedical Imaging: From Nano to Macro, pp. 476–479 (2007)
2. Itoh, H., Lu, S., Sakai, T., Imiya, A.: Global image registration by fast random
projection. In: Bebis, G. (ed.) ISVC 2011, Part I. LNCS, vol. 6938, pp. 23–32.
Springer, Heidelberg (2011)
3. Sakai, T., Imiya, A.: Practical algorithms of spectral clustering: Toward large-
scale vision-based motion analysis. In: Machine Learning for Vision-Based Motion
Analysis, pp. 3–26. Springer (2011)
4. Vempala, S.S.: The Random Projection Method, vol. 65. American Mathematical
Society (2004)
5. Iijima, T.: Theory of pattern recognition. Electronics and Communications in
Japan, 123–134 (1963)
6. Watanabe, S., Labert, P.F., Kulikowski, C.A., Buxton, J.L., Walker, R.: Evaluation
and selection of variables in pattern recognition. In: Computer and Information
Science II, pp. 91–122 (1967)
7. Oja, E.: Subspace methods of pattern recognition. Research Studies Press (1983)
8. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision.
Cambridge University Press (2004)
9. Mahajan, D., Huang, F.C., Matusik, W., Ramamoorthi, R., Belhumeur, P.: Moving
gradients: A path-based method for plausible image interpolation. ACM Transac-
tions on Graphics 28, 42:1–42:11 (2009)
10. Cocosco, C., Kollokian, V., Kwan, R.S., Evans, A.: Brainweb. online interface to
a 3D MRI simulated brain database. NeuroImage 5, 425 (1997)
Image Segmentation
by Oriented Image Foresting Transform
with Geodesic Star Convexity

Lucy A.C. Mansilla and Paulo A.V. Miranda

Department of Computer Science, University of São Paulo (USP),

05508-090, São Paulo, SP, Brazil
{lucyacm,pmiranda}@vision.ime.usp.br

Abstract. Anatomical structures and tissues are often hard to be seg-

mented in medical images due to their poorly deﬁned boundaries, i.e., low
contrast in relation to other nearby false boundaries. The speciﬁcation
of the boundary polarity and the usage of shape constraints can help
to alleviate part of this problem. Recently, an Oriented Image Forest-
ing Transform (OIFT) has been proposed. In this work, we discuss how
to incorporate Gulshan’s geodesic star convexity prior in the OIFT ap-
proach for interactive image segmentation, in order to simultaneously
handle boundary polarity and shape constraints. This convexity con-
straint eliminates undesirable intricate shapes, improving the segmenta-
tion of objects with more regular shape. We include a theoretical proof of
the optimality of the new algorithm in terms of a global maximum of an
oriented energy function subject to the shape constraints, and show the
obtained gains in accuracy using medical images of thoracic CT studies.

Keywords: graph search algorithms, image foresting transform, graph-

cut segmentation, geodesic star convexity.

1 Introduction
Image segmentation, such as to extract an object from a background, is very
useful for medical and biological image analysis. However, in order to guarantee
reliable and accurate results, user supervision is still required in several seg-
mentation tasks, such as the extraction of poorly defined structures in medical
imaging, due to their intensity non-standardness among images, field inhomo-
geneity, noise, partial volume effects, and their interplay [1]. The high-level,
application-domain-specific knowledge of the user is also often required in the
digital matting of natural scenes, because of their heterogeneous nature [2]. These
problems motivated the development of several methods for semi-automatic seg-
mentation [3,4,5,6], aiming to minimize the user involvement and time required
without compromising accuracy and precision.
One important class of interactive image segmentation comprises seed-based
methods, which have been developed based on different theories, supposedly
not related, leading to different frameworks, such as watershed [6], random

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 572–579, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Image Segmentation by Oriented Image Foresting Transform 573

walks [7], fuzzy connectedness [8], graph cuts [4], distance cut [2], image forest-
ing transform [9], and grow cut [10]. The study of the relations among different
frameworks, including theoretical and empirical comparisons, has a vast litera-
ture [11,12,13,14]. These methods can also be adapted to automatic segmentation
whenever the seeds can be automatically found [15].
In this paper, we pursue our previous work on Oriented Image Foresting
Transform (OIFT) [16], which extends popular methods [9,8], by incorporat-
ing the boundary orientation (boundary polarity) to resolve between very sim-
ilar nearby boundary segments by exploring directed weighted graphs. OIFT
presents an excellent trade-off between time efficiency and accuracy, and is ex-
tensible to multidimensional images. In this work, we discuss how to incorpo-
rate Gulshan’s geodesic star convexity (GSC) prior in the OIFT approach. This
convexity constraint eliminates undesirable intricate shapes, improving the seg-
mentation of objects with more regular shape. We include a theoretical proof of
the optimality of the new algorithm in terms of a global maximum of an energy
function subject to the shape constraints. The proposed method GSC-OIFT can
simultaneously handle boundary polarity and shape constraints with improved
accuracy for targeted image segmentation [17].
The next sections give a summary of the relevant previous work of the Image
Foresting Transform [9] and OIFT [16]. The proposed extensions are presented
in Section 5. In Section 6, we evaluate the methods, and state our conclusions.

2 Image Foresting Transform

An image can be interpreted as a weighted graph G = (I, A, w) whose nodes

are the image pixels in its image domain I ⊂ Z n , and whose arcs are the pixel
pairs (s, t) in A (e.g., 4-neighborhood, or 8-neighborhood, in case of 2D images).
The adjacency relation A is a binary relation on I. We use t ∈ A(s) and (s, t) ∈
A to indicate that t is adjacent to s. Each arc (s, t) ∈ A has a fixed weight
w(s, t) ≥ 0. In this work, higher arc weights across the object’s boundary should
be considered, such as a dissimilarity measure between pixels s and t (e.g.,
w(s, t) = |I(t) − I(s)| for a single channel image with values given by I(t)).
The graph is undirected weighted if w(s, t) = w(t, s) for all (s, t) ∈ A, otherwise
we have a directed weighted graph.
For a given image graph G = (I, A, w), a path πt = t1 , t2 , . . . , t is a sequence
of adjacent pixels with terminus at a pixel t. A path is trivial when πt = t.
A path πt = πs · s, t indicates the extension of a path πs by an arc (s, t). A
predecessor map is a function P that assigns to each pixel t in I either some
other adjacent pixel in I, or a distinctive marker nil not in I — in which case
t is said to be a root of the map. A spanning forest is a predecessor map which
contains no cycles — i.e., one which takes every pixel to nil in a finite number of
iterations. For any pixel t ∈ I, a spanning forest P defines a path πt recursively
as t if P (t) = nil, and πs · s, t if P (t) = s = nil.
A connectivity function computes a value f (πt ) for any path πt , usually
based on arc weights. A path πt is optimum if f (πt ) ≤ f (τt ) for any other
574 L.A.C. Mansilla and P.A.V. Miranda

path τt in G. By taking to each pixel t ∈ I one optimum path with termi-

nus t, we obtain the optimum-path value V (t), which is uniquely defined by
V (t) = min∀πt in G {f (πt )}. The image foresting transform (IFT) [9] takes an
image graph G = (I, A, w), and a path-value function f ; and assigns one op-
timum path πt to every pixel t ∈ I such that an optimum-path forest P is
obtained — i.e., a spanning forest where all paths are optimum. However, f
must be smooth [9], otherwise, the paths may not be optimum.
The cost of a trivial path πt = t is usually given by a handicap value H(t),
while the connectivity functions for non-trivial paths follow a path-extension
rule. For example:
fmax (πs · s, t) = max{fmax (πs ), w(s, t)} (1)
fsum (πs · s, t) = fsum (πs ) + δ(s, t) (2)
feuc (πs · s, t) = t − R(πs ) 2
(3)
fw (πs · s, t) = w(s, t) (4)
where w(s, t) ≥ 0 is a fixed arc weight, δ(s, t) ≥ 0 is a dissimilarity measure,
R(πt ) is the origin/root of a path πt , and fw is a non-smooth function, which
has important relations with the fmax smooth function [18,12].
We consider image segmentation from two seed sets, So and Sb (So ∩ Sb = ∅),
containing pixels selected inside and outside the object, respectively. The search
for optimum paths (usually considering fmax ) is constrained to start in S =
So ∪ Sb (i.e., H(t) = −1 for all t ∈ S, and H(t) = +∞ otherwise). The image is
partitioned into two optimum-path forests — one rooted at the internal seeds,
defining the object, and the other rooted at the external seeds, representing the
background [18]. A label, L(t) = 1 for all t ∈ So and L(t) = 0 for all t ∈ Sb , is
propagated to all unlabeled pixels during the computation [9].
In the case of undirected weighted graphs, the connectivity functions fmax
(under the conditions stated in [18]) and fw give a global optimum segmentation
according to an energy function of the cut boundary [13,14]. They maximize the
graph-cut measure E defined by Equation 5 among all possible segmentation
results satisfying the hard constraints (seeds).
E(L, G = (I, A, w)) = min w(s, t) (5)
∀(s,t)∈A| L(s)=L(t)

3 Oriented Image Foresting Transform (OIFT)

In the case of directed graphs, an important thing to note is that there are two
diﬀerent types of cut for each object boundary: an inner-cut boundary composed
by edges that point toward object pixels Ci (L) = {∀(s, t) ∈ A| L(s) = 0, L(t) =
1}, and an outer-cut boundary with edges from object to background pixels
Co (L) = {∀(s, t) ∈ A| L(s) = 1, L(t) = 0}. Consequently, we have two diﬀerent
kinds of energy, Ei (L, G) and Eo (L, G):
Ei (L, G = (I, A, w)) = min w(s, t) (6)
∀(s,t)∈Ci (L)

Eo (L, G = (I, A, w)) = min w(s, t) (7)

∀(s,t)∈Co (L)
Image Segmentation by Oriented Image Foresting Transform 575

bkg
As demonstrated in [16], the following non-smooth connectivity functions fi,max
bkg
and fo,max in the IFT algorithm (which we denote as OIFT) lead to optimum
bkg
cuts that maximize Eq. 6 and Eq. 7, respectively. The handicap values of fi,max
and fo,max for trivial paths are deﬁned as before (i.e., H(t) = −1 for all t ∈ S,
bkg

and H(t) = +∞ otherwise). The undirected weights w(s, t) are converted to

directed arcs by multiplying them by an orientation factor (1 + α) if I(s) > I(t),
and by (1 − α) otherwise (e.g., α = 0.5).

bkg
bkg max{fi,max (πs ), 2 × w(t, s) + 1} if R(πs ) ∈ So
fi,max (πs · s, t) = bkg (8)
max{fi,max (πs ), 2 × w(s, t)} if R(πs ) ∈ Sb
bkg
max{fo,max (πs ), 2 × w(s, t) + 1} if R(πs ) ∈ So
bkg
fo,max (πs · s, t) = (9)
bkg
max{fo,max (πs ), 2 × w(t, s)} if R(πs ) ∈ Sb

4 Geodesic Star Convexity (GSC)

A point p is said to be visible to c via a set O if the line segment joining p to c lies
in the set O. An object O is star-convex with respect to center c, if every point
p ∈ O is visible to c via O [19]. It is also possible to define a discrete version of
this constraint directly in the image domain, by considering a shortest path in
the image graph, returned by the IFT (e.g., using feuc , 8-connected adjacency,
H(c) = 0, and H(t) = +∞ for all t = c), as the line segment.
In the case of multiple stars, a computationally tractable definition, was pro-
posed in [20]. The previous notion of the line segment (shortest path) joining
the single star center c to p, is extended to a line segment joining the set of star
centers C = {c1 , c2 , . . . , cn } to p, which is taken as the shortest path between the
point p and set C. In interactive segmentation, the set of star centers is usually
taken to coincide with the internal seeds (i.e., C = So ), and, in the discrete
version, the line segments form a spanning forest rooted at the internal seeds,
where each line segment corresponds to a path in the graph.
In [20], the authors proposed the usage of a different notion of star convexity
with shortest path from Euclidean to geodesic (fsum ). We use H(t) = −1 for all
β
t ∈ So (H(t) = +∞ otherwise), and δ(s, t) = [w(s, t) + 1] − 1 + t − s in the
path-extension rule for fsum , where t − s is the Euclidean distance between
pixels s and t, and β controls the forest topology in the returned predecessor map,
which we will denote by Psum . For lower values of β (β ≈ 0.0), δ(s, t) approaches
t − s , and it imposes more star regularization to the object’s boundary. For
higher values, [w(s, t) + 1]β dominates the expression, allowing a better fit to the
curved protrusions and indentations of the boundary.

5 OIFT with Geodesic Star Convexity (GSC-OIFT)

An object O is geodesic star convex (GSC) with respect to a set of centers C,
if every point p ∈ O is visible to C via O (i.e., the shortest path joining p to
576 L.A.C. Mansilla and P.A.V. Miranda

C in Psum lies in the set O). In this work, we want to constrain the search for
optimum results, that maximize the graph-cut measures Ei (L, G) (Eq. 6) and
Eo (L, G) (Eq. 7), only to segmentations that satisfy the geodesic star convexity
constraint.
First, we compute the optimum forest Psum for fsum by the regular IFT
algorithm, using only So as seeds, for the given directed graph G = (I, A, w).
Let’s consider the following two sets of arcs ξPi sum = {∀(s, t) ∈ A| s = Psum (t)}
and ξPo sum = {∀(s, t) ∈ A| t = Psum (s)}. We have the following Lemma 1:
Lemma 1. For a given segmentation L, we have Co (L) ∩ ξPo sum = ∅, if and
only if there is a violation of the geodesic star convexity constraint. We have
Ci (L) ∩ ξPi sum = ∅, if and only if there is a violation of the geodesic star convexity
constraint.
Proof. We will demonstrate it for Co (L) ∩ ξPo sum = ∅, but the demonstration for
Ci (L) ∩ ξPi sum = ∅ is essentially identical. By definition, a violation of geodesic
star convexity constraint with respect to a set of centers C = So , will be given
if there exists a point p ∈ O = {∀t|L(t) = 1} that is not visible to C via O (i.e.,
there is a pixel r in the shortest path joining p to C in Psum , and r ∈ / O).
By the definitions of ξPo sum and Co (L), we have Co (L) ∩ ξPo sum = {∀(s, t) ∈
A|L(s) = 1, L(t) = 0 and t = Psum (s)}. For any edge (s, t) ∈ Co ∩ ξPo sum we have
t = Psum (s), which means that there exists a shortest path πs = πt · t, s in
Psum rooted at the internal seeds So (i.e., line segment between s and So ). But
(s, t) ∈ Co (L) implies that L(t) = 0 (i.e., t ∈ / O), and hence s is not visible to So
through πs = πt · t, s in Psum . Thus, Co ∩ ξPo sum = ∅ implies in a violation of
the geodesic star convexity constraint.
On the other hand, if we have a violation of the geodesic star convexity
constraint, it means that ∃s ∈ O (i.e., L(s) = 1), which is not visible to So
via the shortest path πs in Psum , so that there is a pixel pi ∈ / O in πs =
p1 , . . . , pi , . . . , pn = s, with Psum (pi+1 ) = pi and pi+1 ∈ O. Hence, (pi+1 , pi ) ∈
Co ∩ ξPo sum , which implies that Co ∩ ξPo sum = ∅.
Therefore, we have Co ∩ ξPo sum = ∅, if and only if there is a violation of the
geodesic star convexity constraint.
Theorem 1 (Inner/outer-cut boundary optimality). For a given image
graph G = (I, A, w), consider a modified weighted graph G = (I, A, w ), with
weights w (s, t) = −∞ for all (s, t) ∈ ξPo sum , and w (s, t) = w(s, t) otherwise.
For two given sets of seeds So and Sb , the segmentation computed over G by the
bkg
IFT algorithm for function fo,max defines an optimum cut in the original graph
G, that maximizes Eo (L, G) among all possible segmentation results satisfying
the shape constraints by the geodesic star convexity, and the seed constraints.
bkg
Similarly, the segmentation computed by the IFT algorithm for function fi,max ,

over a modified graph G = (I, A, w ); with weights w (s, t) = −∞ for all
(s, t) ∈ ξPi sum , and w (s, t) = w(s, t) otherwise; defines an optimum cut in the
original graph G, that maximizes Ei (L, G) among all possible segmentation re-
sults satisfying the shape constraints by the geodesic star convexity, and the seed
constraints.
Image Segmentation by Oriented Image Foresting Transform 577

bkg
Proof. We will prove the theorem in the case of function fo,max the other case
having essentially identical proof. Since we assign the worst weight to all arcs
(s, t) ∈ ξPo sum in G (i.e., w (s, t) = −∞), any segmentation L̃ with Co (L̃) ∩
ξPo sum = ∅ will receive the worst energy value (Eo (L̃, G ) = −∞) 1 . From the
Theorem in [16], we know that the IFT with fo,max bkg
over G maximizes the energy
Eo (L, G ) in the graph G , consequently, it will naturally avoid in its outer-cut
boundary any edge from ξPo sum . Since, there is always a solution that does not
violate the GSC constraint (e.g., we could take O = So ), and from Lemma 1, we
have that the computed solution cannot violate the GSC constraint.
Since w(s, t) ≥ 0, ∀(s, t) ∈ A, and from Lemma 1, we have that any candidate
segmentation L̈ satisfying the GSC constraint must have Eo (L̈, G ) ≥ 0. More-
over, since its weights for the arcs in Co (L̈) were not changed in G , we also have
that Eo (L̈, G ) = Eo (L̈, G). Hence, all results satisfying the GSC constraint were
considered in the optimization, and therefore Theorem 1 holds, as we wanted to
prove.

6 Experiments and Conclusions

We conducted quantitative experiments, using a total of 40 image slices of 10
thoracic CT studies to segment the liver. All methods, including the power wa-
tershed algorithm (PWq=2 ) [14], were assessed for accuracy employing the mean
performance curve (Dice coeﬃcient) and ground truth data obtained from an
expert of the radiology department at the University of Pennsylvania.
Figure 1a shows the mean accuracy curves for all the images assuming diﬀerent
seed sets obtained by eroding and dilating the ground truth. The undirected arc
weights were computed as w(s, t) = |I(t)−I(s)|. For the directed weighted graphs
we considered α = 0.5, and we used β = 0.0. For higher values of β, GSC-OIFT
imposes less shape constraints, so that the accuracy tends to decrease (Fig. 1b-
d). Figure 2 shows some results in the case of user-selected markers for the liver,
and Figure 3 shows one example in 3D.

1 1 1 1
Dice coefficient

Dice coefficient

0.95 0.95 0.95 0.95

0.9 0.9 0.9 0.9

IRFC IRFC IRFC IRFC
0.85 PW(q=2)
0.85 PW(q=2)
0.85 PW(q=2)
0.85 PW(q=2)
OIFT OIFT OIFT OIFT
0.8 GSC-IFT 0.8 GSC-IFT 0.8 GSC-IFT 0.8 GSC-IFT
GSC-OIFT GSC-OIFT GSC-OIFT GSC-OIFT

0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Erosion radius (pixels) Erosion radius (pixels) Erosion radius (pixels) Erosion radius (pixels)

(a) (b) (c) (d)

Fig. 1. (a) The mean accuracy curves of all methods for the liver segmentation for
various values of β: (a) β = 0.0, (b) β = 0.2, (c) β = 0.5, and (d) β = 0.7

1
The GSC restrictions are embedded directly into the graph G .
578 L.A.C. Mansilla and P.A.V. Miranda

(a) (b) (c) (d)

bkg
Fig. 2. Results for user-selected markers: (a) IRFC (IFT with fmax ), (b) OIFT (fo,max
with α = 0.5), (c) GSC-IFT (β = 0.7, α = 0.0), and (d) GSC-OIFT (β = 0.7, α = 0.5)

(a) (b) (c)

Fig. 3. Example of 3D skull stripping in MRI: (a) IRFC (IFT with fmax ), (b) GSC-IFT
(β = 0.3, α = 0.0), and (c) GSC-OIFT (β = 0.3, α = 0.5), for the same user-selected
markers

In conclusion, we developed extensions to the OIFT algorithm [16], by incorpo-

rating the geodesic star convexity constraint in its formulation. The results were
proved to be optimum according to an energy functional of the cut boundary,
and were shown to improve the accuracy in practice. GSC-OIFT only requires
twice the computational time of a conventional IFT. As future work, we intend
to combine it with statistical models for automatic segmentation.

Acknowledgment. The authors thank FAPESP (2012/06911-2), CNPq

(305381/2012-1), and CAPES for the ﬁnancial support, and Dr. J. K. Udupa
(MIPG-UPENN) for the images.

References
1. Madabhushi, A., Udupa, J.: Interplay between intensity standardization and in-
homogeneity correction in MR image processing. IEEE Transactions on Medical
Imaging 24(5), 561–576 (2005)
2. Bai, X., Sapiro, G.: Distance cut: Interactive segmentation and matting of images
and videos. In: Proc. of the IEEE Intl. Conf. on Image Processing, vol. 2, pp.
II-249–II-252 (2007)
3. Falcão, A., Udupa, J., Samarasekera, S., Sharma, S., Hirsch, B., Lotufo, R.: User-
steered image segmentation paradigms: Live-wire and live-lane. Graphical Models
and Image Processing 60(4), 233–260 (1998)
Image Segmentation by Oriented Image Foresting Transform 579

4. Boykov, Y., Funka-Lea, G.: Graph cuts and eﬃcient N-D image segmentation. Intl.
Journal of Computer Vision 70(2), 109–131 (2006)
5. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. Intl. Journal
of Computer Vision 1, 321–331 (1987)
6. Cousty, J., Bertrand, G., Najman, L., Couprie, M.: Watershed cuts: Thinnings,
shortest path forests, and topological watersheds. Trans. on Pattern Analysis and
Machine Intelligence 32, 925–939 (2010)
7. Grady, L.: Random walks for image segmentation. IEEE Trans. Pattern Anaysis
and Machine Intelligence 28(11), 1768–1783 (2006)
8. Ciesielski, K., Udupa, J., Saha, P., Zhuge, Y.: Iterative relative fuzzy connectedness
for multiple objects with multiple seeds. Computer Vision and Image Understand-
ing 107(3), 160–182 (2007)
9. Falcão, A., Stolﬁ, J., Lotufo, R.: The image foresting transform: Theory, algo-
rithms, and applications. IEEE Transactions on Pattern Analysis and Machine
Intelligence 26(1), 19–29 (2004)
10. Vezhnevets, V., Konouchine, V.: “growcut” - interactive multi-label N-D image
segmentation by cellular automata. In: Proc. Graphicon., pp. 150–156 (2005)
11. Sinop, A., Grady, L.: A seeded image segmentation framework unifying graph cuts
and random walker which yields a new algorithm. In: Proc. of the 11th International
Conference on Computer Vision, ICCV, pp. 1–8. IEEE (2007)
12. Miranda, P., Falcão, A.: Elucidating the relations among seeded image segmen-
tation methods and their possible extensions. In: XXIV Conference on Graphics,
Patterns and Images, Maceió, AL (August 2011)
13. Ciesielski, K., Udupa, J., Falcão, A., Miranda, P.: Fuzzy connectedness image seg-
mentation in graph cut formulation: A linear-time algorithm and a comparative
analysis. Journal of Mathematical Imaging and Vision (2012)
14. Couprie, C., Grady, L., Najman, L., Talbot, H.: Power watersheds: A unifying
graph-based optimization framework. Trans. on Pattern Anal. and Machine Intel-
ligence 99 (2010)
15. Miranda, P., Falcão, A., Udupa, J.: Cloud bank: A multiple clouds model and
its use in MR brain image segmentation. In: Proc. of the IEEE Intl. Symp. on
Biomedical Imaging, Boston, MA, pp. 506–509 (2009)
16. Miranda, P., Mansilla, L.: Oriented image foresting transform segmentation by seed
competition. IEEE Transactions on Image Processing (accepted, to appear, 2013)
17. Lézoray, O., Grady, L.: Image Processing and Analysis with Graphs: Theory and
Practice. CRC Press, California (2012)
18. Miranda, P., Falcão, A.: Links between image segmentation based on optimum-
path forest and minimum cut in graph. Journal of Mathematical Imaging and
Vision 35(2), 128–142 (2009)
19. Veksler, O.: Star shape prior for graph-cut image segmentation. In: Forsyth, D.,
Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 454–467.
Springer, Heidelberg (2008)
20. Gulshan, V., Rother, C., Criminisi, A., Blake, A., Zisserman, A.: Geodesic star
convexity for interactive image segmentation. In: Proc. of Computer Vision and
Pattern Recognition, pp. 3129–3136 (2010)
Multi-run 3D Streetside Reconstruction
from a Vehicle

Yi Zeng and Reinhard Klette

The .enpeda.. Project, Department of Computer Science

The University of Auckland, New Zealand

Abstract. Accurate 3D modellers of real-world scenes are important

tools for visualizing or understanding outside environments. The paper
considers a camera-based 3D reconstruction system where stereo cam-
eras are mounted on a mobile platform, recording images while moving
through the scene. Due to the limited viewing angle of the cameras,
resulting reconstructions often result in missing (e.g. while occluded)
components of the scene. In this paper, we propose a stereo-based 3D
reconstruction framework for merging multiple runs of reconstructions
when driving in diﬀerent directions through a real-world scene.

1 Introduction
Current large-scale camera-based reconstruction techniques can be subdivided
into aerial reconstruction or ground-level reconstruction techniques. Although
a large amount of user interaction is needed, the resulting model is often of
high quality and visually compelling. There are various commercial products
available in the market, demonstrating high quality, such as 3D RealityMaps [1],
for example. However, reconstruction methods using aerial images only cannot
produce models with photo-realistic details at ground level. There is an extensive
literature on ground-level reconstruction; see, for example, [5,9,17]. In both aerial
and ground-level reconstructions, cameras capture input images as they travel
through the scene. Standard cameras only have limited viewing angles. Thus, a
large number of blind spots of the scene exist, resulting in incomplete 3D models,
and this is inevitable for a single run reconstruction (i.e. when moving cameras
on a “nearly straight” path, without any significant variations in the path). A
single run has a defined direction, being the vector from start and end point of
the run.
In this paper, we propose a stereo-based reconstruction framework for auto-
matically merging reconstruction results from multiple single runs in different
directions. For each single run, we perform binocular stereo analysis on pairs of
left and right images. We use the left image and the generated disparity map for
a bundle-adjustment-based visual odometry algorithm. Then, applying the esti-
mated changes in camera poses, a 3D point cloud of the scene is accumulated
frame by frame. Finally, we triangulate the 3D point cloud using an α-shape
algorithm to generate a surface model. Up to this stage we apply basically exist-
ing techniques. The novelty of this paper is mainly in the merging step, and we

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 580–588, 2013.

c Springer-Verlag Berlin Heidelberg 2013
Multi-run 3D Streetside Reconstruction from a Vehicle 581

Fig. 1. From top to bottom: original image of a used stereo frame sequence, and colour-
coded disparity maps using OpenCV (May 2013) block matching or iSGM

detail the case where two surface models are merged generated from single runs
in opposite directions. Input data are recorded stereo sequences from a mobile
platform. In this paper we discuss greyscale sequences recorded at Tamaki cam-
pus, The University of Auckland, at a resolution of 960 × 320 at 25 Hz, with 10
bit per pixel. Each recorded sequence consists of about 1,800 stereo frames. For
an example of an input image, see the top of Fig. 1.
The quality of the used stereo matcher has crucial impact on the accuracy of
our 3D reconstruction. We decided for iterative semi-global matching (iSGM),
see [8], mainly due to its performance at ECCV 2012 [7]. A comparison with the
block-matching stereo procedure in OpenCV (see Fig. 1, middle) illustrates the
achieved improvement by using iSGM.
The rest of the paper is structured as follows. In Section 2, we estimate the
ego-motion of the vehicle using some kind of bundle adjustment. Section 3 dis-
cusses alpha-shape, as used for the surface reconstruction algorithm applied in
the system. Finally, the merging step is discussed in Section 4, also showing
experimental results. Section 5 concludes the paper.

2 Visual Odometry
Visual Odometry [13], the estimation of position and direction of the camera, is
achieved by analysing consecutive images in the recorded sequence. The quality
of our reconstructed 3D scene is directly related to the result of visual odome-
try. Drift in visual odometry [10] often leads to a twist in the 3D model. The
basic algorithm is usually: (1) Detect feature points in the image. (2) Track the
features across consecutive frames. (3) Calculate the camera’s motion based on
the tracked features. In this paper, since we focus on quality, an algorithm [15]
based on Bundle Adjustment (BA) is used for visual odometry.
We tested a basic algorithm for comparison. 2-dimensional (2D) feature points
are detected and tracked across the left sequence only. The speeded-up robust
feature detector (SURF), see [2], is used to extract feature points in the ﬁrst
frame. We chose SURF over the Harris corner detector [6] (which is a common
582 Y. Zeng and R. Klette

choice in visual odometry) because corner points may not be evenly distributed
depending on the geometry of the scene. The Lucas-Kanade [12] algorithm is
used to track these detected features in the subsequent frame. Tracked feature
points serve then as input, and are again tracked in the following frame, and so
on. Since the same set of feature points is tracked, the total number of features
decays over frames. When the total number of features drops below a threshold
τ then a new set of features is detected using again the SURF detector. After
calculating a relative transformation between Frames t − 1 and t, the global pose
of the cameras at time t is obtained by keeping a global accumulator, assuming
that the pose of the camera at time 1 is a 4 × 4 identity matrix for initialization.
However, in our experiments, when applying this basic algorithm, the estimation
of camera pose transformations was inaccurate, and became less stable as errors
accumulate along the sequence. In order to improve the accuracy, we apply a
sliding-window bundle adjustment.
Bundle adjustment [16] is the problem of reﬁning the 3D structure as well
as the camera parameters. Mathematically, assume that n 3D points bi are seen
from m cameras with parameters aj , and Xij is the projection of the ith point on
camera j. Bundle adjustment is the task to minimize the reprojection error with
respect to 3D points bi and cameras’ parameters aj . In formal representation,
determine the minimum

n
m
min d(Q(aj , bi ), Xij )2
aj ,bi
i=1 j=1

where Q(aj , bi ) is the function projecting point i on camera j, and d is the

Euclidean distance between points in the image plane. Bundle adjustment is a
non-linear minimization problem which can be solved by using iterative methods
such as Levenberg-Marquardt.
Ideally, the best result can be obtained by applying bundle adjustment to all
the recorded frames. But, considering its complexity and the limited comput-
ing power we have, we use a sliding window bundle adjustment (similar to the
method used in [15]), i.e. only optimizing the camera poses within a window of k
frames, and moving this window across the whole sequence (only the left images
are used for bundle adjustment).
Starting from frame F1 , a window of k images is constructed and the estimated
camera poses are used as initial estimates. Then, bundle adjustment is applied
for the window using the tracked features. In the next iteration, the window
advances by one frame, i.e. we estimate now the camera pose for frame Fk+1 , as
described in the previous subsection. The estimated pose for Fk+1 plus bundle-
adjusted poses for F2 to Fk , serve then as initial estimates for the camera pose
for frame Fk+2 , and so on.

3 Surface Reconstruction
In this section, we build a 3D model of the scene using results of visual odometry.
The ﬁnal surface representation is polygonal, but in order to build it we construct
Multi-run 3D Streetside Reconstruction from a Vehicle 583

a point cloud model first. Once we calculate the pose for cameras for all frames,
building a 3D point cloud model can be as easy as projecting all 3D points derived
from pixels with valid disparities into a global coordinate system. However, we
did not accumulate pixels for all the frames, because the number of points grows
exponentially, and a large percentage of points is actually redundant information.
(The vehicle was driving at 10 km/h only, and recall that images were captured
at 25 Hz.) For each frame, only pixels within a specified disparity range are used,
due to the non-linear property of the Z-function. See Fig. 2 for an example.
Point-cloud data usually contain large portion of noise and outliers, and the
density of points varies across the 3D space. Two additional steps are to be
carried out to refine the quality of the point cloud.
Down-Sampling. A voxel grid filter is applied to simplify cloud data, thus im-
proving the efficiency of subsequent processing. The filter creates a 3D voxel
grid spanning over the cloud data. Then, for each voxel, all the points within
are replaced by their centroid.
Outlier Removal. Errors in stereo matching and visual odometry lead to sparse
outliers which corrupt the cloud data. Some of these errors can be eliminated by

Fig. 2. A generated point cloud model. Yellow cubes indicate detected camera poses.

Fig. 3. A created surface model. Yellow cubes indicate camera poses.

584 Y. Zeng and R. Klette

applying a statistical filter on the point set, i.e. for each point, we compute the
mean distance from it to all of its neighbours. If this mean distance of the point
is outside a predefined interval, then the point can be treated as an outlier and
is removed from the set. The order of these steps affects the overall performance
of the process. The down-sampling process is significantly faster than outlier
removal. Thus we decided to perform these two processes in the listed order.
Given a set S of points in 3D, the α-shape [4] was designed for answering
questions such as “What is the shape formed by these points?” Edelsbrunner
and Mücke mention in [4] an intuitive description of 3D α-shape: Imagine that
a huge ice-cream fills space R3 and contains all points of S as “hard” chocolate
pieces. Using a sphere-formed spoon, we carve out all possible parts of the ice-
cream block without touching any of the chocolate pieces, even carving out holes
inside the block. The object we end up with is the α-shape of S, and the value
α is the squared radius of the carving spoon.
To formally define the α-shape, we first define an α-complex. An α-complex
of a set S of points is a subcomplex of the 3D Delaunay triangulation of S, which
is a tetrahedrization such that no point in S is inside the circumsphere of any
of the created tetrahedra. Given a value of α, the α-complex contains all the
simplexes in the Delaunay triangulation which have an empty circumscribing
sphere with squared radius equal to, or smaller than α. The α-shape is the
topological frontier of the α-complex.
In our reconstruction pipeline, after obtaining and refining a point-cloud
model, the α-shape is calculated and defines a 3D surface model of the scene.
See Fig. 3 for an example. Compared to Fig. 2, the reader might agree with our
general observation that the surface model looks in general “better” than the
point-cloud visualization.

4 Merging Models from Opposite Runs

Now we are ready to discuss our proposed merger of point-cloud or surface data
obtained from multiple runs through a 3D scene.
The 3D model reconstructed from a single run (i.e. driving through the scene
in one direction) contains a large number of “blind spots” (e.g. due to occlusions,
e.g. the “other side of the wall”, or the limited viewing angle of the cameras, but
also due to missing depth data, if disparities were rated “invalid”). By combining
the results from opposite runs, we aim at producing a more accurate and more
complete model of the scene.
The task of aligning consistently models from diﬀerent views is know as reg-
istration. Fully automatic pairwise registration methods exist for laser-scanner
data, and the main steps are listed below

1. Identify a set of interest points (e.g. SIFT [11]) that best represent both 3D
point sets.
2. Compute a feature descriptor at each interest point, using methods such as
fast point feature histograms (FPFH); see [14].
Multi-run 3D Streetside Reconstruction from a Vehicle 585

Fig. 4. Bird’s-eye view of an initial alignment of two opposite runs. Results of each
run are shown in diﬀerent colours.

3. Estimate the correspondence between two sets of feature points based on

their similarities. The simplest method is brute-force matching.
4. Assuming that the data is noisy, invalid correspondences are rejected to
improve the registration.
5. Compute the pose transformation from the remaining correspondences.
6. Use the resulting estimation as an initial alignment; then apply an iterative
closest points technique (ICP) to further align two point sets; see [3].

However, compared to laser-scanner data, stereo data is more inaccurate and

contains a signiﬁcant amount of noise, especially around the edge areas of scene
objects. Therefore, the method stated above is not applicable for our system
in this form. Considering the complexity of the scene (i.e. objects may look
completely diﬀerent from opposite directions) and the inaccuracy of stereo data,
we propose the following semi-automatic method to align the two stereo point
clouds.
Initial Alignment. We let the user manually select a set of corresponding points
from both models. Then, a rough estimation of alignment is calculated by ap-
plying the least-square method. See Fig. 4 for an example.
Adjustment. Due to (not fully avoidable) errors in the visual odometry process
and the considerable dimension (length) of the recorded scene, both point-cloud

Fig. 5. Bird’s-eye views of individual and merged surface models

586 Y. Zeng and R. Klette

Fig. 6. Street views illustrating the beneﬁt of merging 3D data

models cannot be perfectly aligned as a whole. (Both models are twisted to a

certain degree in 3D space.) Therefore, we break the point cloud models into
a few segments along the Z direction (the main driving direction). Then we
loop through each segment, apply feature matching across the two point-cloud
models using 3D feature detectors, such as SIFT. A more precise alignment for
this segment is calculated by matching the two feature sets. If the new alignment
does not differ from the initial alignment more than a threshold τ , the new
alignment is applied to the cloud segment.
Post Processing. Since we merged two (very extensive) point clouds, the point
density is not uniform any more. We need to down-sample the merged point cloud
again (as described in the previous section), for the convenience of subsequent
processing. After the merged point cloud is simplified, a surface model can be
created using the α-shape algorithm. See Fig. 5 for surface models of two separate
runs, and for the merged point cloud.
The street views in Fig. 6 show clearly the benefit of merging: many of the
missing parts in one run are filled-in by reconstruction results of the second
run. The facades of buildings and other details of the scene are getting more
complete, with an accuracy as defined by stereo matching and visual odometry.
We will not further illustrate the obvious positive effects, but like to point on
two detected issues when merging. Figure 7 reveals that occlusions walls from
opposite directions intersect each other. Due to the inaccurate disparities around
the edge area, a wall structure can be formed along the viewing direction on the

Fig. 7. Occlusion walls from opposite directions intersect each other

Multi-run 3D Streetside Reconstruction from a Vehicle 587

edge. When merging models from opposite runs, the occlusion walls from the
two models intersect each other.

5 Conclusions and Future Work

In this paper we described a stereo-based 3D reconstruction pipeline for mod-

elling street scenes. We proposed a semi-automatic method for aligning mod-
els reconstructed from opposite directions, to ﬁll-in missing components. Our
proposed system is certainly useful for improving the completeness of ground-
level 3D reconstruction. It might also be useful for combining results of aerial
and ground-level large-scale 3D reconstruction. For future improvements we see
needs to increase the accuracy of the visual odometry process, and to enhance
the quality of the point cloud model. Evaluation on the quality and performance
of the reconstruction system also needs to be done.

Acknowledgment. The authors thank Simon Hermann for the provision of

iSGM for stereo matching.

References

1. 3D Reality Maps, www.realitymaps.de/en/ (last visited in April 2013)

2. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: Speeded up robust features. In:
Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS, vol. 3951,
pp. 404–417. Springer, Heidelberg (2006)
3. Besl, P.J., McKay, N.D.: A method for registration of 3-D shapes. IEEE Trans.
Pattern Analysis Machine Intelligence 14, 239–256 (1992)
4. Edelsbrunner, H., Mücke, E.P.: Three-dimensional alpha shapes. ACM Trans.
Graphics 13, 43–72 (1994)
5. Geiger, A., Ziegler, J., Stiller, C.: StereoScan: Dense 3d reconstruction in real-time.
In: Proc. IEEE IV, pp. 963–968 (2011)
6. Harris, C., Stephens, M.J.: A combined corner and edge detector. In: Proc. Alvey
Vision Conf., pp. 147–151 (1988)
7. Heidelberg Robust Vision Challenge at ECCV 2012 (2012),
http://hci.iwr.uni-heidelberg.de/Static/challenge2012/
8. Hermann, S., Klette, R.: Iterative semi-global matching for robust driver assistance
systems. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part
III. LNCS, vol. 7726, pp. 465–478. Springer, Heidelberg (2013)
9. Huang, F., Klette, R.: City-scale modeling towards street navigation applications.
J. Information Convergence Communication Engineering 10 (2012)
10. Jiang, R., Klette, R., Wang, S.: Statistical modeling of long-range drift in visual
odometry. In: Koch, R., Huang, F. (eds.) ACCV 2010 Workshops, Part II. LNCS,
vol. 6469, pp. 214–224. Springer, Heidelberg (2011)
11. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Com-
puter Vision 60, 91–110 (2004)
12. Lucas, B.D., Kanade, T.: An iterative image registration technique with an appli-
cation to stereo vision. Proc. IJCAI 2, 674–679 (1981)
588 Y. Zeng and R. Klette

13. Nister, D., Naroditsky, O., Bergen, J.: Visual odometry. In: Proc. CVPR, vol. 1,
pp. 652–659 (2004)
14. Rusu, R.B., Blodow, N., Beetz, M.: Fast point feature histograms (FPFH) for 3D
registration. In: Proc. IEEE ICRA, pp. 3212–3217 (2009)
15. Sünderhauf, N., Konolige, K., Lacroix, S., Protzel, P.: Visual odometry using sparse
bundle adjustment on an autonomous outdoor vehicle. In: Proc. Autonome Mobile
Systems, pp. 157–163 (2005)
16. Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W.: Bundle adjustment
– A modern synthesis. In: Triggs, B., Zisserman, A., Szeliski, R. (eds.) ICCV-WS
1999. LNCS, vol. 1883, pp. 298–375. Springer, Heidelberg (2000)
17. Xiao, J., Fang, T., Zhao, P., Lhuilier, M., Quan, L.: Image-based street-side city
modeling. In: Proc. SIGGRAPH, pp. 114:1–114:12 (2009)
Interactive Image Segmentation via Graph
Clustering and Synthetic Coordinates Modeling

Costas Panagiotakis1, Harris Papadakis2, Elias Grinias3 , Nikos Komodakis4 ,

Paraskevi Fragopoulou2 , and Georgios Tziritas5
1
Dept. of Commerce and Marketing, Technological Educational Institute (TEI)
of Crete, 72200 Ierapetra, Greece
[email protected]
2
Dept. of Applied Informatics and Multimedia, TEI of Crete, PO Box 140, Greece
[email protected], [email protected]
3
Dept. of Geoinformatics and Surveying, TEI of Serres, 62124 Serres, Greece
[email protected]
4
Dept. of Computer Science and Applied Math, Ecole des Ponts ParisTech, France
[email protected]
5
Dept. of Computer Science, University of Crete, P.O. Box 2208, Greece
[email protected]

Abstract. We propose a method for interactive image segmentation.

We construct a weighted graph that represents the superpixels and the
connections between them. An eﬃcient algorithm for graph clustering
based on synthetic coordinates is used yielding an initial map of clas-
siﬁed pixels. The proposed method minimizes a min-max Bayesian cri-
terion that has been successfully used on image segmentation problem
taking into account visual information as well as the given markers. Ex-
perimental results and comparisons with other methods demonstrate the
high performance of the proposed scheme.

Keywords: image segmentation, network coordinates, graph clustering.

1 Introduction
Image segmentation is a key step in many image-video analysis and multimedia
applications. According to interactive image segmentation, which is a special
case of image segmentation, unambiguous solutions, or segmentations satisfying
subjective criteria, could be obtained, since the user gives some markers on the
regions of interest and on the background. Fig. 1 illustrates an example of an
original image, two types of markers and the segmentation ground truth.
During the last decade, a large number of interactive image segmentation algo-
rithms have been proposed in the literature. In [1], a new shape constraint based
method for interactive image segmentation has been proposed using Geodesic
paths. The authors introduce Geodesic Forests, which exploit the structure of

Paraskevi Fragopoulou is also with the Foundation for Research and Technology-
Hellas, Institute of Computer Science, 70013 Heraklion, Crete, Greece.

R. Wilson et al. (Eds.): CAIP 2013, Part I, LNCS 8047, pp. 589–596, 2013.

c Springer-Verlag Berlin Heidelberg 2013
590 C. Panagiotakis et al.

(a) (b) (c) (d)

Fig. 1. (a) Original image, (b), (c) given markers and (d) the ground truth image

shortest paths in implementing extended constraints. In [2], discriminative learn-

ing methods have been used to train conditional models for both region and
boundary based on interactive scribbles. In the region model, the authors use
two types of local histograms with different window sizes to characterize local
image statistics around a specific pixel. In the boundary model, the authors use
12-bin boundary features by applying gradient filters to each color component.
In [3], a two step segmentation algorithm has been proposed that first obtains a
binary segmentation and then applies matting on the border regions to obtain
a smooth alpha channel. The proposed segmentation algorithm is based on the
minimization of the Geodesic Active Contour energy.
According to the interactive segmentation algorithm proposed in [4], first,
all the labeled seeds are independently propagated for obtaining homogeneous
connected components for each of them. Then, the image is divided into blocks
which are classified according to their probabilistic distance from the classified
regions, and a topographic surface for each class is obtained. Finally, two algo-
rithms for regularized classification based on the topographic surface have been
proposed.
The proposed method can be divided into several steps. In the first step,
we partition the image into superpixels using the oversegmentation algorithm
proposed in [5]. Then, we construct a weighted graph that represents the super-
pixels and the connections between them, taking into account the given markers
and visual information (see Sections 2). Next, we use the Vivaldi algorithm [6]
that generates the superpixels’ synthetic coordinates (see Section 3). An initial
map of classified pixels is provided by an efficient algorithm for graph clustering
based on synthetic coordinates (see Section 4). Thus, we solve the graph clus-
tering problem using the synthetic network coordinates that are automatically
estimated by a distributed algorithm based on interactions between neighboring
nodes. Finally, the image segmentation is provided by a Markov Random Field
(MRF) model or a flooding algorithm minimizing a min-max Bayesian criterion
(see Section 5). Hereafter we present the proposed methodology, a more detailed
analysis of the proposed method is given in [7].

2 Graph Generation

Initially, we partition the image into superpixels using the oversegmentation al-
gorithm proposed in [5]. In this work, the description of visual content consists of
Interactive Image Segmentation via Graph Clustering 591

Lab color components for color distribution and textureness for texture content.
This approach has been also used in [8]. The visual distance dv (si , sj ) between
two superpixels si and sj is given by the Mallows distance [9] of the three color
components in Lab color space and for the textureness measure of the corre-
sponding superpixels. Let G be the weighted graph of superpixels, so that two
superpixels si and sj are connected with an edge of weight dv (si , sj ) if and only
if they are neighbors, meaning that they share a common boundary. Then, the
proximity distance dp (si , sj ) between superpixels si and sj is given by the length
of the shortest path from si to sj in graph G . The proposed distance between
superpixels si and sj that eﬃciently combines the visual and proximity distances
is given by Equation 1:
/
d(si , sj ) = dp (si , sj ) · dv (si , sj ) (1)

The use of the square root on the proximity distance is explained by the fact that
the visual distance is more important than the proximity distance. The graph
G is used in order to compute the graph G that is defined hereafter.
In the next step, we construct a graph G that represents the superpixels
and the connections between them, taking into account the given markers and
visual information. According to the given markers, two superpixels can either
be connected, meaning that they belong to the same class or be disconnected,
meaning that they belong to different classes. Thus, the nodes (superpixels) in
this graph are connected with edges of two types:
– the EC edges that connect two superpixels belonging to the same class,
– the ED edges that connect two superpixels belonging to different classes,
taking into account the two types of relations between superpixels. In the second
step of the algorithm, the visual distance and the superpixels’ proximity are used
to create the set of edges EC until G becomes a connected graph.
Hereafter, we present the procedure that computes the two sets of edges, EC
and ED for graph G. EC and ED are initialized to the corresponding edges ac-
cording to the given markers. Then, the N ·(N2 −1) pairs of distances d(., .) are
sorted and stored in vector v, where N denotes the number of superpixels. We
add the sorted edges of v on EC set until G becomes a connected graph in
order to be able to execute the Vivaldi algorithm [6] that generates the super-
pixels’ synthetic coordinates (see Section 3). In addition, we keep the graph
balanced (almost equal degree per node) using an upper limit on node degree
(M axConn = 10).

3 Synthetic Coordinates
In this work, we have used Vivaldi [6] to position the superpixels in a virtual
space (the n−dimensional Euclidean space n , e.g. n = 20). Vivaldi [6] is a
fully decentralized, light-weight, adaptive network coordinate algorithm that
predicts Internet latencies with low error. Recently, we have successfully applied
592 C. Panagiotakis et al.

Vivaldi on the problem of locating communities on real and synthetic dataset

graphs [10, 11]. In the current work, the input to the Vivaldi algorithm is a
weighted graph, where the weights correspond to the nodes distances in n . We
have used the weights 0.0 and 1000.0 for EC and ED , respectively. These weights
correspond to the Euclidean distance between the virtual position of superpixels,
that is used by the Vivaldi algorithm to position the superpixels in n generat-
ing synthetic coordinates so that the Euclidean distance of any two superpixel
positions approximates the actual distance (edge weight) between those super-
pixels. This means that superpixels of the same class will be placed closer in
space than superpixels of diﬀerent classes, forming natural clusters in space.

4 Graph Clustering

Having estimated a synthetic coordinate (position) p(si ) ∈ n for each super-

pixel si , i ∈ {1, ..., N } of the graph, we can use a clustering algorithm in order to
cluster a subset of superpixels into foreground and background classes, providing
this way an initial map of classified pixels M ap. The proposed algorithm creates
the initial map by merging the superpixels that have been placed in proximity
in n , meaning that they should belong to the same class. In this research, we
have used a hierarchical clustering algorithm that recursively finds clusters in
an agglomerative (bottom-up) mode. We successively merge a cluster c1 with
“UNKNOWN” label with the closest labeled cluster c2 , if the distance between
c1 and c2 is lower than a predefined threshold T according to their synthetic
coordinates. If no such pair of clusters exists, the algorithm terminates. T is
automatically computed by the histogram of distances between all points’ pairs.
T is given by the value that better discriminates the two distributions (distances
between points of the same class and between points of different classes) from
the histogram of distances.
Usually, it holds that the image borders and especially the pixels close to the
four image corners belong to the background class, so we have used the following
simple rule so that a pixel is classified to background class if it belongs to an
unclassified superpixel and its distance

– from the closest image border is less than 1% of the image diagonal
– from the closest image corner is less than 7% of the image diagonal.

Using the criterion of “unclassiﬁed superpixel” the proposed heuristic works

well in cases where an object intersects the image boundary. Finally, we perform
erosion on the classiﬁed superpixels using a disk of 2 pixels radius, in order to
be able to correct some boundary errors of the oversegmentation algorithm.

5 Image Segmentation

Hereafter, we brieﬂy describe the image segmentation

NL method. The proposed
criterion has been proposed in [4,8]. Let S = l=1 Sl be the set of those initially
Interactive Image Segmentation via Graph Clustering 593

classified pixels estimated by the graph clustering algorithm. For any unclassified
pixel s we can consider all the paths linking it to a classified set or region. A
path Cl (s) is a sequence of adjacent pixels {s0 , ..., sn−1 , sn = s}. It holds that all
pixels of the sequence are unlabeled, except s0 which has label l. The cost of a
particular path is defined as the maximum cost of a pixel classification according
to the Bayesian rule and along the path

Cost(Cl (s)) = max dB

l (si ) (2)
i=1...n

Finally, the classification problem becomes equivalent to a search for the shortest
path given the above cost. Two algorithms based on the principle of the min −
max Bayesian criterion for classification, have been used. These algorithms have
been proposed in [4, 8].
– According to the Independent Label Flooding MRF-based minimization Al-
gorithm (ILFMA), we use the primal-dual method proposed in [12], which
casts the M RF optimization problem as an integer program and then makes
use of the duality theory of linear programming in order to derive solutions
that have been proved to be almost optimal.
– The Priority Multi-Class Flooding Algorithm (PMCFA), that is analytically
described in [8], imposes strong topology constraints. All the contours of
initially classified regions are propagated towards the space of unclassified
image pixels, according to similarity criteria, which are based on the class
label and the segmentation features.
In what follows, the proposed methods using the MRF model and the flooding
algorithm are denoted as SGC-ILFMA and SGC-PMCFA, respectively.

6 Experimental Results
SGC-ILFMA and SGC-PMCFA have been compared with algorithms from the
literature according to the reported results of [2] and [13], using the following
two datasets:
– The LHI interactive segmentation benchmark [14]. This benchmark consists
of 21 natural images with ground-truths and three types of users’ scribbles
for each image.
– The Zhao interactive segmentation benchmark [13]. This benchmark consists
of 50 natural images with ground-truths and four types of users’ scribbles
(levels) for each image. The higher the level, the more markers are added.
In order to measure the algorithms’ performance, we use the Region precision
criterion (RP ) [2]. RP measures an overlap rate between a result foreground
and the corresponding ground truth foreground. A higher RP indicates a better
segmentation result.
Fig. 2 depicts intermediate (initial map) and ﬁnal segmentation results of the
proposed methods (SGC-ILFMA and SGC-PMCFA) on an image of LHI dataset
594 C. Panagiotakis et al.

(see Fig. 2(a)) and three images of Zhao dataset. We graphically depict the given
markers on the original images using red color for foreground and green color for
background, respectively. The red, blue and white coloring of intermediate re-
sults correspond to foreground, background and unclassified pixels, respectively.
Under any case, the results of the proposed algorithms are almost the same,
yielding high performance results. In Figs. 2(f), 2(f) and 2(j) the initial marker
information suffices for the segmentation. Although in Fig. 2(n) the low number
of given markers does not suffice to discriminate the foreground and background
classes, the proposed methods give good performance results. A demonstration
of the proposed method with experimental results is given in 1 .
Using the LHI dataset, we have compared the proposed methods with three
other algorithms from the literature: CO3 [2], Unger et al. [3] based on the
reported results of [2]. The proposed methods SGC-ILFMA and SGC-PMCFA
yield RP 85.4% and 85.2%, respectively, outperforming the other methods. The
third and the fourth highest performance results are given by the CO3 [2] method
with RP = 79% and Unger et al. with RP = 73%.
In addition, we have compared the proposed methods with three other algo-
rithms from the literature Couprie et al. [15], Grady [16] and Noma et al. [17]
using the reported results of [13] on Zhao dataset. Table 1 depicts the mean
region precision (RP ) on four different simulation levels of SGC-ILFMA, SGC-
PMCFA, Bai et al., Couprie et al., Grady and Noma et al. algorithms. The
highest performance results are clearly obtained by the SGC-ILFMA and SGC-
PMCFA algorithms, while the third highest performance results are given by the
Couprie et al. [15] method that gives similar results with the Grady and Noma
et al. method.

Table 1. The region precision (RP ) over the Zhao dataset

Method SGC-ILFMA SGC-PMCFA Couprie et al. Grady Noma et al.
1 66.7% 68.4% 50% 46% 49%
2 84.1% 84.5% 72% 71% 69%
3 85.3% 85.7% 84% 84% 82%
4 88.1% 88.6% 88% 88% 87%

7 Conclusion
In this paper, a two-step algorithm is proposed for interactive image segmenta-
tion taking into account image visual information, proximity distances as well as
the given markers. In the first step, we constructed a weighted graph of super-
pixels and we clustered this graph based on a synthetic coordinates algorithm.
In the second step, we have used a MRF or a flooding algorithm for getting the
final image segmentation. The proposed method yields high performance results
under different types of images and shapes of the initial markers.

1
http://www.csd.uoc.gr/~ cpanag/DEMOS/intImageSegmentation.htm
Interactive Image Segmentation via Graph Clustering 595

100

200

300

400

500

600
100 200 300 400 500 600 700 800

(a) (b) (c) (d)

100

150

200

250

300

50 100 150 200 250 300 350 400 450

(e) (f) (g) (h)

100

150

200

250

300

350

400

450

100 200 300 400 500 600

(i) (j) (k) (l)

100

150

200

250

300

50 100 150 200 250 300 350 400 450 500

(m) (n) (o) (p)

Fig. 2. (a), (e), (i), (m) Original images with markers from the LHI and Zhao
datasets. (b), (f ), (j), (n) Initial map of classiﬁed pixels. (c), (g), (k), (o) Final
segmentation results of the SGC-ILFMA method. (d), (h), (l), (p) Final segmenta-
tion results of the SGC-PMCFA method.

Acknowledgments. This research has been partially co-ﬁnanced by the

European Union (European Social Fund - ESF) and Greek national funds
through the Operational Program “Education and Lifelong Learning” of the
National Strategic Reference Framework (NSRF) - Research Funding Programs:
ARCHIMEDE III-TEI-Crete-P2PCOORD, THALIS-NTUA-UrbanMonitor and
THALIS-UOA- ERASITECHNIS.

References

1. Gulshan, V., Rother, C., Criminisi, A., Blake, A., Zisserman, A.: Geodesic star
convexity for interactive image segmentation. In: IEEE Conference on Computer
Vision and Pattern Recognition, CVPR, pp. 3129–3136 (2010)
596 C. Panagiotakis et al.

2. Zhao, Y., Zhu, S., Luo, S.: Co3 for ultra-fast and accurate interactive segmentation.
In: Proceedings of the International Conference on Multimedia, pp. 93–102. ACM
(2010)
3. Unger, M., Pock, T., Trobin, W., Cremers, D., Bischof, H.: Tvseg-interactive total
variation based image segmentation. In: British Machine Vision Conference, BMVC
(2008)
4. Grinias, I., Komodakis, N., Tziritas, G.: Flooding and MRF-based algorithms for
interactive segmentation. In: International Conference on Pattern Recognition,
ICPR, pp. 3943–3946 (2010)
5. Felzenszwalb, P., Huttenlocher, D.: Efficient graph-based image segmentation. In-
ternational Journal of Computer Vision 59, 167–181 (2004)
6. Dabek, F., Cox, R., Kaashoek, F., Morris, R.: Vivaldi: A decentralized network co-
ordinate system. In: Proceedings of the ACM SIGCOMM 2004 Conference, vol. 34,
pp. 15–26 (2004)
7. Panagiotakis, C., Papadakis, H., Grinias, I., Komodakis, N., Fragopoulou, P.,
Tziritas, G.: Interactive image segmentation based on synthetic graph coordinates.
In: Pattern Recognition (accepted, 2013)
8. Panagiotakis, C., Grinias, I., Tziritas, G.: Natural image segmentation based on
tree equipartition, bayesian flooding and region merging. IEEE Transactions on
Image Processing 20, 2276–2287 (2011)
9. Mallows, C.: A note on asymptotic joint normality. The Annals of Mathematical
Statistics 43, 508–515 (1972)
10. Papadakis, H., Panagiotakis, C., Fragopoulou, P.: Local community finding using
synthetic coordinates. In: Park, J.J., Yang, L.T., Lee, C. (eds.) FutureTech 2011,
Part II. CCIS, vol. 185, pp. 9–15. Springer, Heidelberg (2011)
11. Papadakis, H., Panagiotakis, C., Fragopoulou, P.: Locating communities on real
dataset graphs using synthetic coordinates. Parallel Processing Letters, PPL 20
(2012)
12. Komodakis, N., Tziritas, G.: Approximate labeling via graph cuts based on linear
programming. IEEE Transactions on Pattern Analysis and Machine Intelligence 29,
1436–1453 (2007)
13. Zhao, Y., Nie, X., Duan, Y., Huang, Y., Luo, S.: A benchmark for interactive
image segmentation algorithms. In: IEEE Workshop on Person-Oriented Vision,
POV, pp. 33–38 (2011)
14. Yao, B., Yang, X., Zhu, S.-C.: Introduction to a large-scale general purpose ground
truth database: Methodology, annotation tool and benchmarks. In: Yuille, A.L.,
Zhu, S.-C., Cremers, D., Wang, Y. (eds.) EMMCVPR 2007. LNCS, vol. 4679, pp.
169–183. Springer, Heidelberg (2007)
15. Couprie, C., Grady, L., Najman, L., Talbot, H.: Power watersheds: A new image
segmentation framework extending graph cuts, random walker and optimal span-
ning forest. In: International Conference on Computer Vision, pp. 731–738 (2009)
16. Grady, L.: Random walks for image segmentation. IEEE Transactions on Pattern
Analysis and Machine Intelligence 28, 1768–1783 (2006)
17. Noma, A., Graciano, A., Consularo, L., Cesar Jr., R., Bloch, I.: A new algorithm for
interactive structural image segmentation. arXiv preprint arXiv:0805.1854 (2008)
Author Index

Ababsa, Fakhreddine I-482 Börner, Anko I-270

Abdel-Mottaleb, Mohamed I-294, II-440 Bors, Adrian G. I-539, II-241, II-298
Abuzaina, Anas II-290 Borsdorf, Anja I-86
Aggarwal, Preeti I-531 Borstell, Hagen II-546
Akaho, Shotaro I-286, II-250 Boshtayeva, Madina II-67
Akizuki, Shuichi I-473 Bothe, Wolfgang II-117
Al Ghamdi, Manal I-70 Boudissa, Ahmed II-332
Al Harbi, Nouf I-78 Boulekchour, Mohammed II-185
Ali, Sadiq I-154 Boulemnadjel, Amel I-254
Al Ismaeil, Kassem II-100 Boussahla, Miloud I-278
Alizadeh, Fattah II-126 Boyle, Roger II-378
Allano, Lorène II-354 Bradbury, Gwyneth II-209
Alotaibi, Sarah II-402 Bradley, Andrew P. II-449
Altinay, Doreen II-449 Brauer, Jürgen I-145
Andreu Cabedo, Yasmina I-204 Brun, Luc I-302, I-401
Andreu-Garcı́a, Gabriela I-237 Bunke, Horst I-302
Anwar, Hafeez II-17 Busch, Christoph II-177
Aouada, Djamila II-100
Aouf, Nabil II-185 Cappabianco, Fabio II-233
Arens, Michael I-145 Carter, John N. II-290
Artières, Thierry I-45 Chantler, Mike J. I-425, II-169
Atanbori, John II-370 Chen, Cheng I-335
Atienza-Vanacloig, Vicente I-237 Chen, Liming II-160
Atkinson, Gary A. I-212 Chen, Wen-Sheng I-352
Aupetit, Michaël II-354 Chen, Zhili II-346
Aziz, Furqan I-128 Cheng, Yuan I-54, I-229
Azzopardi, George II-9 Chun, Jinhee II-1
Colombo, Carlo II-515
Baba, Takayuki I-37 Colston, Belinda II-370
Backes, André R. II-258, II-416 Conte, Donatello I-401
Bai, Lu I-102 Cordes, Kai I-327
Banerjee, Abhirup I-523 Cortés, Xavier II-457
Banerji, Sugata I-490 Cortez, Paulo César II-258, II-416
Bchir, Ouiem II-402 Courboulay, Vincent I-515
Behera, Laxmidhar I-556 Cowling, Peter II-370
Bellavia, Fabio II-515
Bellili, Asma I-385 Daisy, Maxime II-523
Ben Amar, Chokri II-307, II-563, II-571 Das, Sukhendu I-245
Ben Ammar, Anis II-571 Dawood, Mohammad I-1
Ben Ismail, Mohamed Maher II-402 De Backer, Steve II-539
Biasotti, Silvia I-120 de M. Sá Junior, Jarbas J. II-258, II-416
Bogun, Ivan I-409 Denton, Erika II-346
Boné, Romuald I-310 Denzler, Joachim I-163, II-117
Borgi, Mohamed Anouar II-307 Derraz, Foued I-278
598 Author Index

Dickinson, Patrick II-370 Haase, Daniel II-117

Dong, Xinghui I-425 Hachouf, Fella I-254
Dooley, Laurence S. I-270 Hafner, David II-67
Dubuisson, Séverine I-319 Haghighat, Mohammad II-440
Ducato, Antonino II-362 Haindl, Michal I-433, II-338
Hales, Ian II-378
Eady, Paul II-370 Halley, Fraser II-169
Eglin, Véronique II-408 Han, Lin I-102
Eitzinger, Christian II-75 Han, Zhenjun II-282
El’Arbi, Maher II-307 Hancock, Edwin R. I-62, I-102, I-120,
Escolano, Francisco I-120 I-128, II-424
Essa, Ehab II-466 Hansen, Mark F. I-212
Hashimoto, Manabu I-473
Falcão, Alexandre II-233 Hassan, Bassem II-539
Fan, Ming-Ying I-507 Hauta-Kasari, Markku II-274
Fanfani, Marco II-515 Havlı́ček, Michal II-338
He, Wenda II-386
Faucheux, Cyrille I-310
Heidemann, Gunther II-266
Feki, Ghada II-571
Heigl, Benno I-86
Felsberg, Michael I-154
Heikkinen, Ville II-274
Feng, Weiguo I-172
Hino, Hideitsu II-250
Feng, Wenya II-394
Hogg, David II-378
Fernandez-Maloigne, Christine I-515
Hornegger, Joachim I-86
Florea, Corneliu II-225
Hou, Yonggan II-394
Florea, Laura II-225
Hu, Xiaoming II-507
Flusser, Jan I-221
Hübner, Wolfgang I-145
Foggia, Pasquale I-401
Hughes, Dave II-370
Foo, Lewis I-54
Hunter, Gordon II-531
Forzy, Gerard I-278
Fragopoulou, Paraskevi I-441, I-589
Imiya, Atsushi I-417, I-507, I-564
Fränti, Pasi I-262 Intawong, Kannikar I-188
Fratini, Livan II-362 Ishikawa, Seiji II-332
Fujiki, Jun I-286, II-250 Itoh, Hayato I-564

Garcia, Christophe II-408 Jain, Brijnesh J. I-110

Garcı́a-Sevilla, Pedro I-204 Jehan-Besson, Stéphanie II-315
Garrigues, Matthieu I-360 Jiang, Xiaoyan II-117
Gbèhounou, Syntyche I-515 Jiang, Xiaoyi I-1, II-144, II-152, II-432
Gibert, Jaume I-302 Jiao, Jianbin II-282, II-499
Gigengack, Fabian I-1 Jones, Graeme A. II-531
Gonzales, Christophe I-319 Jones, Jonathan-Lee II-466
Gotoh, Yoshihiko I-70, I-78 Jones, Simon I-20
Goumeidane, Aicha Baya II-491
Grinias, Elias I-589 Kampel, Martin II-17
Grum, Matthew I-539 Kaothanthong, Natsuda II-1
Gudivada, Sravan II-241 Kawamoto, Kazuhiko I-417, I-507, I-564
Güngör, Tunga I-180 Ke, Wei II-499
Guo, Dong I-137 Kepski, Michal I-457
Guo, Yilin II-394 Khamadja, Mohammed II-491
Gupta, P.C. I-344 Khan, Fahad Shahbaz I-154
Author Index 599

Khorsandi, Rahman I-294 Mejdoub, Mahmoud II-563

Kim, Hyoungseop II-332 Miao, Duoqian I-262
Kim, Okhee II-394 Michael, Elena I-94
Klette, Reinhard I-580, II-50, II-91 Migniot, Cyrille I-482
Koko, Jonas II-315 Miguet, Serge I-188, I-393
Komodakis, Nikos I-589 Mikeš, Stanislav I-433
Körner, Marco I-163, II-117 Mikolajczyk, Krystian II-83, II-332
Kotera, Jan II-59 Milanfar, Peyman II-59
Ksibi, Amel II-571 Miranda, Paulo A.V. I-572
Kwolek, Bogdan I-457 Mirbach, Bruno II-100
Kyrgyzova, Khrystyna II-354 Mitchell, Kenny II-209
Mochizuki, Yoshihiko I-417, II-250
La Cascia, Marco II-362 Moehrmann, Julia II-266
Langner, Tobias II-34 Mollineda Cárdenas, Ramón A. I-204
Larabi, Slimane I-385 Moreno, Carlos II-457
Largeron, Christine II-408 Murata, Noboru II-250
Lecellier, François I-515, II-315 Murray, Iain II-324
Lee, Jinseon II-201 Murray, John II-370
Leow, Wee Kheng I-54, I-137, I-229
Levada, Alexandre II-233 Nacereddine, Nafaa II-491
Lézoray, Olivier II-523 Nagase, Masanobu I-473
Li, Ce II-499 Nain, Neeta I-344
Li, Fei I-37 Najati, Hossein I-137
Li, Huibin II-160 Nakamura, Rodrigo II-233
Liu, Chengjun I-490 Nath, Tanmay II-539
Liu, Guangda II-539 Nejati, Hossein II-26
Liu, Long II-394 Ng, Kia II-378
Liu, Meiqing II-507 Nguyen, Thanh Phuong I-360
Liu, Rui I-172 Nguyen, Xuan Son I-319
Liu, Rujie I-37 Niitsuma, Masahiro II-555
Liu, Yonghuai II-108 Nixon, Ian II-370
Loosli, Cédric II-315 Nixon, Mark S. II-290
López-Garcı́a, Fernando I-237 Novotný, Petr I-221
Lourakis, Manolis I-498
Luo, Ming II-298 Oh, Il-Seok II-201
Olivier, Julien I-310
Mahboubi, Amal I-401 Onkarappa, Naveen II-483
Maji, Pradipta I-523 Oommen, B. John I-196, I-368
Majumder, Aditi II-201 Oosten, Jean-Paul van II-555
Majumder, Anima I-556 Osaku, Daniel II-233
Malassiotis, Sotiris II-193 Ostermann, Jörn I-327
Malinen, Mikko I-262 Ottersten, Björn II-100
Mansilla, Lucy A.C. I-572 Ovsepian, Nelly I-94
Manzanera, Antoine I-360
Mariolis, Ioannis II-193 Padilla, Stefano II-169
Martinez, Manuel I-465 Palfinger, Werner II-75
Masnou, Simon II-160 Pan, Binbin I-352
Matuszewski, Bogdan J. I-28 Panagiotakis, Costas I-94, I-441, I-589
Mazzola, Giuseppe II-362 Panwar, Subhash I-344
McKenna, Antony II-408 Papa, Joao II-233
600 Author Index

Papa, Joao P. I-377 Sinha, Atreyee I-490

Papa, Luciene P. I-377 Slimene, Alya II-475
Papadakis, Harris I-589 Smith, Dave II-466
Paquet, Eric II-135 Smith, Melvyn L. I-212
Parkkinen, Jussi II-274 Soffner, Michael II-546
Pathan, Saira Saleem II-546 Song, Zijiang II-91
Pazzaglia, Fabio II-515 Souza, Andre N. I-377
Pereira, Luis A.M. I-377 Spangenberg, Robert II-34
Petkov, Nicolai II-9 Šroubek, Filip II-59
Peyrodie, Laurent I-278 Stiefelhagen, Rainer I-465
Pistarelli, Marcelo D. II-217 Stöger, Matthias II-75
Subramanian, Venkatesh K. I-556
Qasim, Sahar II-402 Suk, Tomáš I-221
Sun, Huiping II-394
Ramaekers, Ariane II-539 Suta, Loreta I-393
Ramos, Caio C.O. I-377 Sutherland, Alistair II-126
Ranca, Razvan II-324
Rathgeb, Christian II-177 Taleb-Ahmed, Abdelmalik I-278
Rauber, Thomas W. I-449 Tan, Chew Lim II-42
Reel, Parminder Singh I-270 Tan, JooKooi II-332
Ribeiro, Eraldo I-409 Tao, Junli II-50
Richter, Klaus II-546 Tao, Lili I-28
Riess, Christian I-86 Tenbrinck, Daniel I-1, II-144, II-152
Risse, Benjamin I-1 Thomas, Anu I-196, I-368
Robb, David A. II-169 Thumfart, Stefan II-75
Robertson, Neil M. I-385 Tokuyama, Takeshi II-1
Rodrigues, Douglas I-377 Toledo, Ricardo II-217
Rojas, Raúl II-34 Tomita, Yo II-555
Rosenhahn, Bodo I-327 Torii, Akihiko I-417
Rossi, Luca I-62 Torsello, Andrea I-62
Tschumperlé, David II-523
Sakai, Tomoya I-417, I-507, I-564 Tziritas, Georgios I-589
Samanta, Suranjana I-245
Sappa, Angel D. II-217, II-483 Valiente-Gonzalez, José-Miguel I-237
Sardana, H.K. I-531 Valveny, Ernest I-302
Saxena, Subhra I-344 van de Weijer, Joost I-154
Schäfers, Klaus I-1, II-432 Varejão, Flávio M. I-449
Schauerte, Boris I-465 Vento, Mario I-401
Scheunders, Paul II-539 Vertan, Constantin II-225
Schmid, Sönke I-1, II-432 Vig, Renu I-531
Schomaker, Lambert II-555 Viktor, Herna Lydia II-135
Schubert, Falk II-83 Vinel, Antoine I-45
Scuturici, Mihaela I-188, I-393 Vrânceanu, Ruxandra II-225
Scuturici, Vasile-Marian I-393
Sekma, Manel II-563 Wang, Jian I-86
Serratosa, Francesc II-457 Wang, Liping II-346
Shao, Ling I-20 Wang, Peng II-408
Shin, Bok-Suk II-50 Wang, Xiaofang II-160
Shinomiya, Takashi II-332 Weickert, Joachim II-67
Sim, Terence I-54, I-137, II-26 Weyn, Barbara II-539
Author Index 601

Weyrich, Tim II-209 Zambanini, Sebastian II-17

Wilson, Richard C. I-128, II-424 Zeng, Yi I-580
Wong, K.C.P. I-270 Zeng, Ziming II-108
Wu, Zhaoguo II-507 Zhang, Kun I-229
Wu, Zhengzhe II-274 Zhang, Li I-54, I-137, II-26
Zhang, Liguo II-499
Xie, Xianghua II-466 Zhang, Xi II-42
Zhao, Yitian II-108
Yan, Dayuan II-507 Zheng, Guoyan I-335, I-548
Yang, Wei II-499 Zhong, Caiming I-262
Ye, Cheng II-424 Zhou, Ya II-507
Ye, Qixiang II-282 Zhu, Ming I-172
Zhu, Shenggao I-137
Zabulis, Xenophon I-498 Zonouz, Saman II-440
Zagrouba, Ezzeddine II-475 Zwiggelaar, Reyer II-346, II-386

978 3 540 76414 4
No ratings yet
978 3 540 76414 4
595 pages
2012 Book ProgressInPatternRecognitionIm
No ratings yet
2012 Book ProgressInPatternRecognitionIm
917 pages
Recent Trends in Image Processing and Pattern Recognition
No ratings yet
Recent Trends in Image Processing and Pattern Recognition
430 pages
Computer Vision for Coders
No ratings yet
Computer Vision for Coders
152 pages
Image Processing Research Papers Bibliography
No ratings yet
Image Processing Research Papers Bibliography
81 pages
Communications in Computer and Information Science 1815: Editorial Board Members
No ratings yet
Communications in Computer and Information Science 1815: Editorial Board Members
18 pages
Full Download Image Analysis and Recognition 13th International Conference ICIAR 2016 in Memory of Mohamed Kamel Póvoa de Varzim Portugal July 13 15 2016 Proceedings 1st Edition Aurélio Campilho PDF
100% (2)
Full Download Image Analysis and Recognition 13th International Conference ICIAR 2016 in Memory of Mohamed Kamel Póvoa de Varzim Portugal July 13 15 2016 Proceedings 1st Edition Aurélio Campilho PDF
65 pages
AI in Architectural Spatial Analysis
No ratings yet
AI in Architectural Spatial Analysis
89 pages
Computer Analysis
No ratings yet
Computer Analysis
821 pages
Computer Vision Algorithms and Applications 2nd Edition Richard Szeliski Full Access
No ratings yet
Computer Vision Algorithms and Applications 2nd Edition Richard Szeliski Full Access
163 pages
Artificial Intelligence and Soft Computing
No ratings yet
Artificial Intelligence and Soft Computing
741 pages
GLCM Features
No ratings yet
GLCM Features
494 pages
Image Recognition in Artificial Intelligence
100% (2)
Image Recognition in Artificial Intelligence
11 pages
(Advances in Intelligent and Soft Computing
No ratings yet
(Advances in Intelligent and Soft Computing
471 pages
Feature Models AI-Driven Design Analysis and Appli
No ratings yet
Feature Models AI-Driven Design Analysis and Appli
129 pages
Rutkowski L. (Ed), Zadeh L. A. (Ed) - Artificial Intelligence and Soft Computing - Icaisc 2004 (2004) PDF
No ratings yet
Rutkowski L. (Ed), Zadeh L. A. (Ed) - Artificial Intelligence and Soft Computing - Icaisc 2004 (2004) PDF
1,237 pages
PHD Visual Object Category Recognition
No ratings yet
PHD Visual Object Category Recognition
193 pages
Advances in Computers, Vol.65 (Elsevier, 2005) (ISBN 9780120121656) (447s) - CsAl
No ratings yet
Advances in Computers, Vol.65 (Elsevier, 2005) (ISBN 9780120121656) (447s) - CsAl
447 pages
Graphics Recognition Current Trends and Evolutions 12th IAPR International Workshop GREC 2017 Kyoto Japan November 9 10 2017 Revised Selected Papers Alicia Fornés PDF Download
No ratings yet
Graphics Recognition Current Trends and Evolutions 12th IAPR International Workshop GREC 2017 Kyoto Japan November 9 10 2017 Revised Selected Papers Alicia Fornés PDF Download
93 pages
Computer Vision Algorithms and Applications 2nd Edition Richard Szeliski Full Chapters Included
100% (2)
Computer Vision Algorithms and Applications 2nd Edition Richard Szeliski Full Chapters Included
135 pages
Page No.197 PDF
No ratings yet
Page No.197 PDF
468 pages
InformationTheory in Computer Visionand Pattern Recognition 2009th Edition by Francisco Escolano, Pablo Suau, Boyan Bonev ISBN 1848822960 9781848822962 PDF Download
100% (10)
InformationTheory in Computer Visionand Pattern Recognition 2009th Edition by Francisco Escolano, Pablo Suau, Boyan Bonev ISBN 1848822960 9781848822962 PDF Download
78 pages
2019 Book ComputerVisionACCV2018Workshop
No ratings yet
2019 Book ComputerVisionACCV2018Workshop
549 pages
Image Data Third International Workshop Stia 2014 Held in Conjunction With Miccai 2014 Boston Ma Usa September 18 2014 Revised Selected
No ratings yet
Image Data Third International Workshop Stia 2014 Held in Conjunction With Miccai 2014 Boston Ma Usa September 18 2014 Revised Selected
171 pages
Image Analysis and Recognition 14th International Conference ICIAR 2017 Montreal QC Canada July 5 7 2017 Proceedings 1st Edition Fakhri Karray Instant Download
No ratings yet
Image Analysis and Recognition 14th International Conference ICIAR 2017 Montreal QC Canada July 5 7 2017 Proceedings 1st Edition Fakhri Karray Instant Download
138 pages
Information Modelling and Knowledge Bases XIV
No ratings yet
Information Modelling and Knowledge Bases XIV
322 pages
Formal Ontology in Information Systems Proceedings of The Eighth International Conference FOIS 2014 1st Edition Pawel Garbacz All Chapters Available
No ratings yet
Formal Ontology in Information Systems Proceedings of The Eighth International Conference FOIS 2014 1st Edition Pawel Garbacz All Chapters Available
98 pages
Formal Ontology in Information Systems Proceedings of The Eighth International Conference FOIS 2014 1st Edition Pawel Garbacz Full
100% (6)
Formal Ontology in Information Systems Proceedings of The Eighth International Conference FOIS 2014 1st Edition Pawel Garbacz Full
191 pages
Intelligent Data Engineering and Automated Learning - IDEAL 2023
No ratings yet
Intelligent Data Engineering and Automated Learning - IDEAL 2023
561 pages
Ei13 Abstracts L
No ratings yet
Ei13 Abstracts L
150 pages
12 I January 2024
No ratings yet
12 I January 2024
5 pages
Proceedings of The IEEE:CVF Conference On Computer Vision and Pattern Recognition (CVPR)
No ratings yet
Proceedings of The IEEE:CVF Conference On Computer Vision and Pattern Recognition (CVPR)
16 pages
Journal of Computer Science IJCSIS July 2015 Special Issue
No ratings yet
Journal of Computer Science IJCSIS July 2015 Special Issue
132 pages
(Lecture Notes in Computer Science _ Image Processing Computer Vision Pattern Recognition and Graphics 8668) Elena Barcucci, Andrea Frosini, Simone Rinaldi-Discrete Geometry for Computer Imagery_ 18th
No ratings yet
(Lecture Notes in Computer Science _ Image Processing Computer Vision Pattern Recognition and Graphics 8668) Elena Barcucci, Andrea Frosini, Simone Rinaldi-Discrete Geometry for Computer Imagery_ 18th
433 pages
Zbornik ContemporaryComputationalScience
No ratings yet
Zbornik ContemporaryComputationalScience
254 pages
The 2014 International Symposium On Information Technology Isit 2014 Dalian China 14 16 October 2014 12194486
No ratings yet
The 2014 International Symposium On Information Technology Isit 2014 Dalian China 14 16 October 2014 12194486
94 pages
Computer Vision Methods For Fast Image Classi Cation and Retrieval Rafał Scherer Instant Download
No ratings yet
Computer Vision Methods For Fast Image Classi Cation and Retrieval Rafał Scherer Instant Download
146 pages
Locating An Object of Interest in An Image Using Its Color Feature
No ratings yet
Locating An Object of Interest in An Image Using Its Color Feature
4 pages
Natural Language Processing & Info Systems
No ratings yet
Natural Language Processing & Info Systems
439 pages
Research and Development in Intelligent Systems XXX Incorporating Applications and Innovations in Intelligent Systems XXI Proceedings of AI 2013 The Thirty third SGAI International Conference on Innovative Techniques and Applications of Artificial Intel 1st Edition Sebastian Peter download
100% (3)
Research and Development in Intelligent Systems XXX Incorporating Applications and Innovations in Intelligent Systems XXI Proceedings of AI 2013 The Thirty third SGAI International Conference on Innovative Techniques and Applications of Artificial Intel 1st Edition Sebastian Peter download
57 pages
Max Bramer - Artificial Intelligence in Theory and Practice - IFIP 19th World Computer Congress, TC-12 IFIP AI 2006 Stream, August 21-24, 2006, Sa
No ratings yet
Max Bramer - Artificial Intelligence in Theory and Practice - IFIP 19th World Computer Congress, TC-12 IFIP AI 2006 Stream, August 21-24, 2006, Sa
513 pages
Computer Vision Methods For Fast Image Classification and Retrieval 2020
100% (5)
Computer Vision Methods For Fast Image Classification and Retrieval 2020
144 pages
Formal Ontology in Information Systems Proceedings of The Eighth International Conference FOIS 2014 1st Edition Pawel Garbacz
No ratings yet
Formal Ontology in Information Systems Proceedings of The Eighth International Conference FOIS 2014 1st Edition Pawel Garbacz
81 pages
Pattern Recognition: Björn Andres Florian Bernard Daniel Cremers Simone Frintrop Bastian Goldlücke Ivo Ihrke
No ratings yet
Pattern Recognition: Björn Andres Florian Bernard Daniel Cremers Simone Frintrop Bastian Goldlücke Ivo Ihrke
607 pages
Visual Indexing and Retrieval (Z-Lib - Io)
No ratings yet
Visual Indexing and Retrieval (Z-Lib - Io)
113 pages
Intelligent Computing: Proceedings of The 2018 Computing Conference, Volume 2 Kohei Arai Ready To Read
100% (1)
Intelligent Computing: Proceedings of The 2018 Computing Conference, Volume 2 Kohei Arai Ready To Read
110 pages
The Semantic Web - ISWC 2023
No ratings yet
The Semantic Web - ISWC 2023
681 pages
Advances in Intelligent Systems and Computing
100% (1)
Advances in Intelligent Systems and Computing
10 pages
(Advances in Soft Computing 51) Pau Giner, Carlos Cetina, Joan Fons (auth.), Juan M. Corchado, Dante I. Tapia, José Bravo (eds.) - 3rd Symposium of Ubiquitous Computing and Ambient Intelligence 2008-S
No ratings yet
(Advances in Soft Computing 51) Pau Giner, Carlos Cetina, Joan Fons (auth.), Juan M. Corchado, Dante I. Tapia, José Bravo (eds.) - 3rd Symposium of Ubiquitous Computing and Ambient Intelligence 2008-S
366 pages
Computer Vision Pattern Recognition Image Processing and Graphics Renu Rameshan PDF Download
100% (1)
Computer Vision Pattern Recognition Image Processing and Graphics Renu Rameshan PDF Download
97 pages
Ontology Representation Design Patterns and Ontologies That Make Sense 1st Edition R. Hoekstra Available Any Format
100% (3)
Ontology Representation Design Patterns and Ontologies That Make Sense 1st Edition R. Hoekstra Available Any Format
133 pages
E-Journal GJCST (G) Vol 24 Issue
No ratings yet
E-Journal GJCST (G) Vol 24 Issue
63 pages
(Lecture ... _ Lecture Notes in Artificial Intelligence) Ian F.C. Smith - Intelligent Computing in Engineering and Architecture_ 13th EG-ICE Workshop 2006, Ascona, Switzerland, June 25-30, 2006, Revis.pdf
No ratings yet
(Lecture ... _ Lecture Notes in Artificial Intelligence) Ian F.C. Smith - Intelligent Computing in Engineering and Architecture_ 13th EG-ICE Workshop 2006, Ascona, Switzerland, June 25-30, 2006, Revis.pdf
703 pages
Premi'13: Call For Papers
No ratings yet
Premi'13: Call For Papers
1 page
Grade 11/12 Reading & Writing Skills Module
No ratings yet
Grade 11/12 Reading & Writing Skills Module
21 pages
Interior Designers List Gurgaon
No ratings yet
Interior Designers List Gurgaon
53 pages
Amazon Boxes Revised FINAL 2020 Update
No ratings yet
Amazon Boxes Revised FINAL 2020 Update
2 pages
Jio Airfiber 9.3.25
No ratings yet
Jio Airfiber 9.3.25
7 pages
Genetics and Genomics Chapter 4 Questions & Answers Multiple Choice Questions
No ratings yet
Genetics and Genomics Chapter 4 Questions & Answers Multiple Choice Questions
23 pages
Invoice for Isuzu Elf Truck Purchase
No ratings yet
Invoice for Isuzu Elf Truck Purchase
1 page
CV - Ilaha Asadova
No ratings yet
CV - Ilaha Asadova
1 page
Addressing The Digital Divide in The Philippines
No ratings yet
Addressing The Digital Divide in The Philippines
15 pages
Pacific Ring of Fire Overview
No ratings yet
Pacific Ring of Fire Overview
49 pages
FBB Macro Calculation Guide
No ratings yet
FBB Macro Calculation Guide
25 pages
MCA Exam: Mathematical Foundations
No ratings yet
MCA Exam: Mathematical Foundations
3 pages
Three Methods For Removing DRM From EPUB On Adobe Digital Editions
No ratings yet
Three Methods For Removing DRM From EPUB On Adobe Digital Editions
5 pages
Java Software and Embedded Systems 1st Edition Mattis Hayes: - Click The Link Below To Download
No ratings yet
Java Software and Embedded Systems 1st Edition Mattis Hayes: - Click The Link Below To Download
50 pages
Love & Pair Co.: Lovebird Breeding & Sales
No ratings yet
Love & Pair Co.: Lovebird Breeding & Sales
88 pages
Autoresponder Alchemy WSL
No ratings yet
Autoresponder Alchemy WSL
13 pages
Qualitative Data Analysis Guide
No ratings yet
Qualitative Data Analysis Guide
13 pages
The Lements Orthodontic Philosophy Courses: Registration Information
No ratings yet
The Lements Orthodontic Philosophy Courses: Registration Information
8 pages
Marking Out For PCD Holes MIG Welding Forum
No ratings yet
Marking Out For PCD Holes MIG Welding Forum
7 pages
Vuchic Solution
100% (1)
Vuchic Solution
5 pages
Important MCQs On RBI Circular For SBI Clerk
No ratings yet
Important MCQs On RBI Circular For SBI Clerk
246 pages
Bodmas, Ratio, SimpleInterest
No ratings yet
Bodmas, Ratio, SimpleInterest
27 pages
Japanese Directives in Public Signs
No ratings yet
Japanese Directives in Public Signs
10 pages
Holcim Compensation Report 2023
No ratings yet
Holcim Compensation Report 2023
32 pages
7.4 Purchasing
No ratings yet
7.4 Purchasing
3 pages
Digital Marketing & Content Writing
No ratings yet
Digital Marketing & Content Writing
11 pages
The Apothecary
No ratings yet
The Apothecary
2 pages
Environmental Science Unit 2 IA
No ratings yet
Environmental Science Unit 2 IA
76 pages
Air Cisternography of The Cerebellopontine Angle Using High Resolution Computed Tomography
No ratings yet
Air Cisternography of The Cerebellopontine Angle Using High Resolution Computed Tomography
3 pages
Communication in The Workplace
No ratings yet
Communication in The Workplace
67 pages
Nissan Navara D40 Safety Report
No ratings yet
Nissan Navara D40 Safety Report
2 pages