0% found this document useful (0 votes)
130 views

Handbook of Research On Machine Learning Applications and Trends

Uploaded by

Abdelkrim Kefif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views

Handbook of Research On Machine Learning Applications and Trends

Uploaded by

Abdelkrim Kefif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Handbook of Research on

Machine Learning
Applications and Trends:
Algorithms, Methods, and
Techniques

Emilio Soria Olivas


University of Valencia, Spain

José David Martín Guerrero


University of Valencia, Spain

Marcelino Martinez Sober


University of Valencia, Spain

Jose Rafael Magdalena Benedito


University of Valencia, Spain

Antonio José Serrano López


University of Valencia, Spain

Volume I

InformatIon scIence reference


Hershey • New York
Director of Editorial Content: Kristin Klinger
Senior Managing Editor: Jamie Snavely
Assistant Managing Editor: Michael Brehm
Publishing Assistant: Sean Woznicki
Typesetter: Sean Woznicki
Cover Design: Lisa Tosheff
Printed at: Yurchak Printing Inc.

Published in the United States of America by


Information Science Reference (an imprint of IGI Global)
701 E. Chocolate Avenue
Hershey PA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail: [email protected]
Web site: http://www.igi-global.com/reference

Copyright © 2010 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in
any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or
companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark.

Library of Congress Cataloging-in-Publication Data

Handbook of research on machine learning applications and trends : algorithms,


methods and techniques / Emilio Soria Olivas ... [et al.].
p. cm.
Summary: "This book investiges machine learning (ML), one of the most
fruitful fields of current research, both in the proposal of new techniques
and theoretic algorithms and in their application to real-life problems"--
Provided by publisher.
Includes bibliographical references and index.
ISBN 978-1-60566-766-9 (hardcover) -- ISBN 978-1-60566-767-6 (ebook) 1.
Machine learning--Congresses. 2. Machine learning--Industrial applications--
Congresses. I. Soria Olivas, Emilio, 1969-
Q325.5.H36 2010
006.3'1--dc22
2009007738

British Cataloguing in Publication Data


A Cataloguing in Publication record for this book is available from the British Library.

All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the
authors, but not necessarily of the publisher.
28

Chapter 2
Principal Graphs and Manifolds
Alexander N. Gorban
University of Leicester, UK

Andrei Y. Zinovyev
Institut Curie, Paris, France

ABStrAct
In many physical, statistical, biological and other investigations it is desirable to approximate a system
of points by objects of lower dimension and/or complexity. For this purpose, Karl Pearson invented
principal component analysis in 1901 and found ‘lines and planes of closest fit to system of points’. The
famous k-means algorithm solves the approximation problem too, but by finite sets instead of lines and
planes. This chapter gives a brief practical introduction into the methods of construction of general
principal objects (i.e., objects embedded in the ‘middle’ of the multidimensional data set). As a basis,
the unifying framework of mean squared distance approximation of finite datasets is selected. Principal
graphs and manifolds are constructed as generalisations of principal components and k-means principal
points. For this purpose, the family of expectation/maximisation algorithms with nearest generalisa-
tions is presented. Construction of principal graphs with controlled complexity is based on the graph
grammar approach.

IntroductIon

In many fields of science, one meets with multivariate (multidimensional) distributions of vectors rep-
resenting some observations. These distributions are often difficult to analyse and make sense of due to
the very nature of human brain which is able to visually manipulate only with the objects of dimension
no more than three.
This makes actual the problem of approximating the multidimensional vector distributions by objects
of lower dimension and/or complexity while retaining the most important information and structures
contained in the initial full and complex data point cloud.

DOI: 10.4018/978-1-60566-766-9.ch002

Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Principal Graphs and Manifolds

The most trivial and coarse approximation is collapsing the whole set of vectors into its mean point.
The mean point represents the ‘most typical’ properties of the system, completely forgetting variability
of observations.
The notion of the mean point can be generalized for approximating data by more complex types of
objects. In 1901 Pearson proposed to approximate multivariate distributions by lines and planes (Pearson,
1901). In this way the Principal Component Analysis (PCA) was invented, nowadays a basic statistical
tool. Principal lines and planes go through the ‘middle’ of multivariate data distribution and correspond
to the first few modes of the multivariate Gaussian distribution approximating the data.
Starting from 1950s (Steinhaus, 1956; Lloyd, 1957; and MacQueen, 1967), it was proposed to ap-
proximate the complex multidimensional dataset by several ‘mean’ points. Thus k-means algorithm was
suggested and nowadays it is one of the most used clustering methods in machine learning (see a review
presented by Xu & Wunsch, 2008).
Both these directions (PCA and K-Means) were further developed during last decades following two
major directions: 1) linear manifolds were generalised for non-linear ones (in simple words, initial lines
and planes were bended and twisted), and 2) some links between the ‘mean’ points were introduced. This
led to the appearance of several large families of new statistical methods; the most famous from them
are Principal Curves, Principal Manifolds and Self-Organising Maps (SOM). It was quickly realized that
the objects that are constructed by these methods are tightly connected theoretically. This observation
allows now to develop a common framework called “Construction of Principal Objects”. The geometrical
nature of these objects can be very different but all of them serve as data approximators of controllable
complexity. It allows using them in the tasks of dimension and complexity reduction. In Machine Learn-
ing this direction is connected with terms ‘Unsupervised Learning’ and ‘Manifold Learning.’
In this chapter we will overview the major directions in the field of principal objects construction.
We will formulate the problem and the classical approaches such as PCA and k-means in a unifying
framework, and show how it is naturally generalised for the Principal Graphs and Manifolds and the
most general types of principal objects, Principal Cubic Complexes. We will systematically introduce
the most used ideas and algorithms developed in this field.

Approximations of Finite datasets

Definition. Dataset is a finite set X of objects representing N multivariate (multidimensional) observa-


tions. These objects xi∈X, i =1…N, are embedded in Rm and in the case of complete data are vectors
xi∈Rm. We will also refer to the individual components of xi as x ki such that xi = (x 1i , x 2i ,..., x mi ) ; we
can also represent dataset as a data matrix X = { x ij } .

Definition. Distance function dist(x,y) is defined for any pair of objects x, y from X such that three usual
axioms are satisfied: dist(x,x) = 0, dist(x,y) = dist(y,x), dist(x,y)+dist(y,z) ≤ dist(x,z).

å (dist(y, x )) .
2
Definition. Mean point MF(X) for X is a vector MF∈Rm such that MF (X ) = arg minm i
yÎR i =1..N

In this form the definition of the mean point goes back to Fréchet (1948). Notice that in this definition
the mean point by Fréchet can be non-unique. However, this definition allows multiple useful gener-
alisations including using it in the abstract metric spaces. It is easy to show that in the case of complete

29
Principal Graphs and Manifolds

m
data and the Euclidean distance function dist(x, y) = å (x i
- yi )2 , or, more generally, in the case
i =1

of any quadratic distance function (for example, Mahalanobis distance), the mean point is the standard
1 N i
expectation MF (X ) x E(X ).
N i 1
Definition. Orthogonal projection P(x,Y) (generalised) is defined for an object x and a set (not necessar-
ily finite) of vectors Y as a vector in Y such that P (x,Y ) = arg min dist (x, y) . Notice that in principle,
yÎY
one can have non-unique and even infinitely many projections of x on Y.

Definition. Mean squared distance MSD(X,Y) between a dataset X and a set of vectors Y is defined
1 N
as MSD(X ,Y ) = å
N i =1
dist2 (xi , P (xi ,Y )) . We will also consider a simple generalisation of MSD:
N
1
weighted mean squared distance MSDW (X ,Y ) = N × å wi dist2 (xi , P (xi ,Y )) , where wi > 0 is
a weight for the object xi. å w i=1 i
i =1

Our objective in the rest of the chapter is to briefly describe the methods for constructing various
approximations (principal objects) for a dataset X. In almost all cases the principal objects will be
represented as a finite or infinite set of vectors Y∈Rm such that 1) it approximates the finite dataset X
in the sense of minimisation of MSD(X,Y), and 2) it answers some regularity conditions that will be
discussed below.

probabilistic Interpretation of Statistics and notion of Self-consistency

In his original works, Pearson followed the principle that the only reality in data analysis is the dataset,
embedded in a multidimensional metric space. This approach can be called geometrical. During the 20th
century, probabilistic interpretation of statistics was actively developed. Accordingly to this interpreta-
tion, a dataset X is one particular of i.i.d. sample from a multidimensional probability distribution F(x)
which defines a probability of appearance of a sample in the point x∈Rm.
The probability distribution, if can be estimated, provides a very useful auxiliary object allowing to
define many notions in the theory of statistical data analysis. In particular, it allows us to define principal
manifolds as self-consistent objects.
The notion of self-consistency in this context was first introduced by Efron (1967) and developed
in the works of Flury (Tarpey & Flury, 1996), where it is claimed to be one of the most fundamental in
statistical theory.

Definition. Given probability distribution F(x) and a set of vectors Y we say that Y is self-consistent with
respect to F(x) if y = EF (x P (x,Y ) = y) for every vector y∈Y. In words, it means that any vector y∈Y
is a conditional mean expectation of point x under condition that x is orthogonally projected in y.
The disadvantage of this definition for finite datasets is that it is not always possible to calculate the
conditional mean, since typically for points y∈Y it is only one or zero point projected from X. This means
that for finite datasets we should develop coarse-grained self-consistency notion. Usually it means that
for every point y∈Y one defines some kind of neighbourhood and introduces a modified self-consistency

30
Principal Graphs and Manifolds

with respect to this neighbourhood instead of y itself. Concrete implementations of this idea are described
further in this chapter. In all cases, the effective size of the neighbourhood is a fundamental parameter
in controlling the complexity of the resulting approximator Y.

Four Approaches to classical pcA

We can define linear principal manifolds as mean squared distance data approximators, constructed
from linear manifolds embedded in Rm. In fact, this corresponds to the original definition of principal
lines and planes by Pearson (Pearson, 1901). However, PCA method was re-invented in other fields
and even obtained different names (Karhunen-Loève or KL decomposition (Karhunen, 1946; Loève,
1955), Hotteling transform (Hotelling, 1933), Proper Orthogonal Decomposition (Lumley, 1967)) and
others. Here we formulate four equivalent ways to define principal components that the user can meet
in different applications.
Let us consider a linear manifold Lk of dimension k in the parametric form Lk = {a0 + β1a1 + … + βkak
| βi∈R }, a0∈Rm and {a1,…, ak} is a set of orthonormal vectors in Rm.

Definition of PCA problem #1 (data approximation by lines and planes): PCA problem consists in
finding such sequence Lk (k=1,2,…,m-1) that the sum of squared distances from data points to their
orthogonal projections on Lk is minimal over all linear manifolds of dimension k embedded in Rm:
MSD(X , Lk ) ® min (k=1,2,…,m-1).

Definition of PCA problem #2 (variance maximisation): For a set of vectors X and for a given ai, let us
construct a one-dimensional distribution Βi = {β: β = (x,ai), x∈X} where (·,·) denotes scalar vector prod-
uct. Then let us define empirical variance of X along ai as Var(Bi), where Var() is the standard empirical
variance. PCA problem consists in finding such Lk that the sum of empirical variances of X along a1,…,
ak would be maximal over all linear manifolds of dimension k embedded in Rm: å Var(Bi ) ® max .
i =1..k

Let us also consider an orthogonal complement {ak+1, …, am} of the basis {a1, …, ak}. Then an equivalent
definition (minimization of residue variance) is

å Var(B ) ® min
i

i =k +1 .

Definition of PCA problem #3 (mean point-to-point squared distance maximisation): PCA problem
consists in finding such sequence Lk that the mean point-to-point squared distance between the orthogo-
nal projections of data points on Lk is maximal over all linear manifolds of dimension k embedded in
1 N
R m: å
N i, j =1
dist2 (P (xi , Lk ), P (x j , Lk )) ® max . Having in mind that all orthogonal projections onto

lower-dimensional space lead to contraction of all point-to-point distances (except for some that do not
change), this is equivalent to minimisation of mean squared distance distortion:

å [dist (x , x ) - dist (P(x , L ), P(x , L ))] ® min


2 i j 2 i
k
j
k
i , j =1

31
Principal Graphs and Manifolds

In the three above mentioned definitions, the basis vectors are defined up to an arbitrary rotation that
does not change the manifold. To make the choice less ambiguous, in the PCA method the following
principle is applied: given {a0, a1,…,ak}, any ‘embedded’ linear manifold of smaller dimension s in the
form Ls = {a0 + β1a1 + …+ βsas| βi∈R,s < k}, must be itself a linear principal manifold of dimension s
for X (a flag of principal subspaces).

Definition of PCA problem #4 (correlation cancellation): Find such an orthonormal basis (a1,…, as) in
which the covariance matrix for x is diagonal. Evidently, in this basis the distributions (ai,x) and (aj,x),
for i ≠ j, have zero correlation.
Definitions 1-3 were given for finite datasets while definition 4 is sensible both for finite datasets and
random vector x. For finite datasets the empiric correlation should be cancelled. The empiric principal
components which annul empiric correlations could be considered as an approximation to the principal
components of the random vector.
Equivalence of the above-mentioned definitions in the case of complete data and Euclidean space
follows from Pythagorean Theorem and elementary algebra. However, in practice this or that definition
can be more useful for computations or generalisations of the PCA approach. Thus, only definitions #1
and #3 are suitable for working with incomplete data since they are defined with use of only distance
function that can be easily calculated for the ‘gapped’ data vectors (see further). The definition #1 can be
generalized by weighting data points (Cochran & Horne, 1977), while the definition #3 can be generalized
by weighting pairs of data points (Gabriel & Zamir, 1979). More details about PCA and generalisations
could be found in the fundamental book by Jollliffe (2002).

Basic expectation/maximisation Iterative Algorithm


for Finding principal objects

Most of the algorithms for finding principal objects for a given dataset X are constructed accordingly
to the classical expectation/maximisation (EM) splitting scheme that was first formulated as a generic
method by Dempster et al (1977):

Generic Expectation-Maximisation algorithm for estimating principal objects


1) Initialisation step. Some initial configuration of the principal object Y is generated;
2) Expectation (projection) step. Given configuration of Y, calculate orthogonal projections P(x,Y),
for all x∈X;
3) Maximisation step. Given the calculated projections, find more optimal configuration of Y with
respect to X.
4) (Optional) adaptation step. Using some strategy, change the properties of Y (typically, add or
remove points to Y).
5) ℜepeat steps 2-4 until some convergence criteria would be satisfied.

For example, for the principal line, we have the following implementation of the above mentioned
bi-iteration scheme (Bauer, 1957; for generalisations see works of ℜoweis (1998) and Gorban & ℜossiev
(1999)).

32
Principal Graphs and Manifolds

Iterative algorithm for calculating the first principal component


1) Set a0 = MF(X) (i.e., zero order principal component is the mean point of X);
2) Choose randomly a1;
(x - a 0 , a1 )
3) Calculate bi = i 2
, i =1…N ;
a1
N å xibi - a 0 å bi
4) Given bi, find new a1, such that å (xi - a 0 - a1bi )2 ® min , i.e. a1 = i =1..N i =1..N
;
5) ℜe-normalize a1 : = a1 || a1 || .
i =1
a1
å bi2

i =1..N

6) ℜepeat steps 3-5 until the direction of a1 do not change more than on some small angle ε.

Remark. To calculate all other principal components, deflation approach is applied: after finding a1,
one calculates new X(1) = X - a0 - a1(x,a1), and the procedure is repeated for X(1).
Remark. The basic EM procedure has good convergence properties only if the first eigenvalues of
the empirical covariance matrix XTX are sufficiently well separated. If this is not the case, more sophis-
ticated approaches are needed (Bau & Trefethen, 1997).
The PCA method can be treated as spectral decomposition of the symmetric and positively de-
1
fined empirical covariance data matrix (defined in the case of complete data) C = X T X or
N N -1
1
C ij = å x k k
x
N - 1 k =1 i j
, where without loss of generality we suppose that the data are centered.

Definition. We call σ > 0 a singular value for the data matrix X iff there exist two vectors of unit
length aσ and bσ such that Xa s = s bTs and bs X = s aTs . Then the vectors aσ = {a1(s ), , am(s ) } and
(s ) (s )
bσ = { b1 , , bN } are called left and right singular vectors for the singular value σ.
If we know all p singular values of X, where p = rank(X) ≤ min(N, m), then we can represent X as
p p
X = å sl b(l )a(l ) or x ik = å slbk(l )ai(l ) . It is called the singular value decomposition (SVD) of X. It is
l =1 l =1

easy to check that the vectors a(l) correspond to the principal vectors of X and the eigenvectors of the
empirical covariance matrix C, whereas b(l) contain projections of N points onto the corresponding prin-
1
(sl ) .
2
cipal vector. Eigenvalues λl of C and singular values σl of X and are connected by ll =
N -1
The mathematical basis for SVD was introduced by Sylvester (1889) and it represents a solid math-
ematical foundation for PCA (Strang, 1993). Although formally the problems of spectral decomposition
of X and eigen decomposition of C are equivalent, the algorithms for performing singular decomposition
directly (without explicit calculation of C) can be more efficient and robust (Bau III & Trefethen, 1997).
Thus, the iterative EM algorithm for calculating the first principal component described in the previous
chapter indeed performs singular decomposition (for centered data we simply put a0 = 0) and finds right
singular (principal) and left singular vectors one by one.

k-means and principal points

K-means clustering goes back to 1950s (Steinhaus (1956); Lloyd (1957); and MacQueen (1967)). It is
another extreme in its simplicity case of finding a principal object. In this case it is simply an unstructured

33
Principal Graphs and Manifolds

finite (and usually, much smaller than the number of points N in the dataset X) set of vectors (centroids).
One can say that the solution searched by the k-means algorithm is a set of k principal points (Flury,
1990).

Definition. A set of k points Y={y1,..,yk}, yi∈Rm is called a set of principal points for dataset X if it
approximates X with minimal mean squared distance error over all sets of k-points in Rm (distortion):
å dist2 (x, P(x,Y )) ® min , where P(x,Y) is the point from Y closest to x. Note that the set of principal
xÎX

points can be not unique.


The simplest implementation of the k-means procedure follows the classical EM scheme:

Basic k-means algorithm


1) Choose initial position of y1,..,yk randomly from xi∈X (with equal probabilities);
2) Partition X into subsets K i , i=1..k of data points by their proximity to y k :
K i = {x : yi = arg min dist(x, y j )} ;
y j ÎY

1
3) ℜe-estimate yi = å x , i = 1..k;
| K i | xÎKi
4) ℜepeat steps 2-3 until complete convergence.

The method is sensitive to the initial choice of y1,..,yk . Arthur & Vassilvitskii (2007) demonstrated
that the special construction of probabilities instead of equidistribution gives serious advantages. The
first centre, y1, they select equiprobable from X. Let the centres y1,..,yj are chosen (j < k) and D(x) be
the squared shortest distance from a data point x to the closest centre we have already chosen. Then, we
select the next centre, yj+1, from xi∈X with probability

p(xi ) = D(xi ) å D(x) .


xÎX

Evidently, any solution of k-means procedure converges to a self-consistent set of points Y={y1,..,yk}
(because Y = E[P(X,Y)]), but this solution may give a local minimum of distortion and is not necessary
the set of principal points (which is the globally optimal approximator from all possible k-means solu-
tions).
Multiple generalisations of k-means scheme have been developed (see, for example, a book of
Mirkin (2005) based on the idea of ‘data recovering’). The most computationally expensive step of the
algorithm, partitioning the dataset by proximity to the centroids, can be significantly accelerated using
kd-tree data structure (Pelleg & Moore, 1999). Analysis of the effectiveness of EM algorithm for the
k-means problem was given by Ostrovsky et al. (2006).
Notice that the case of principal points is the only in this chapter when self-consistency and coarse-
grained self-consistency coincide: centroid yk is the conditional mean point for the data points belonging
to the Voronoi region associated with yk.

34
Principal Graphs and Manifolds

local pcA

The term ‘Local PCA’ was first used by Braverman (1970) and Fukunaga & Olsen (1971) to denote the
simplest cluster-wise PCA approach which consists in 1) applying k-means or other type of clustering to
a dataset and 2) calculating the principal components for each cluster separately. However, this simple
idea performs rather poorly in applications, and more interesting approach consists in generalizing k-
means by introducing principal hyperplane segments proposed by Diday (1979) and called ‘k-segments’
or local subspace analysis in a more advanced version (Liu, 2003). The algorithm for their estimation
follows the classical EM scheme.
Further development of the local PCA idea went in two main directions. First, Verbeek (2002) pro-
posed a variant of the ‘k-segment’ approach for one-dimensional segments accompanied by a strategy
to assemble disconnected line segments into the global piecewise linear principal curve. Einbeck et al
(2008) proposed an iterative cluster splitting and joining approach (recursive local PCA) which helps
to select the optimal number and configuration of disjoined segments.
Second direction is associated with a different understanding of ‘locality’. It consists in calculating
local mean points and local principal directions and following them starting from (may be multiple)
seed points. Locality is introduced using kernel functions defining the effective radius of neighborhood
in the data space. Thus, Delicado (2001) introduced principal oriented points (POP) based on the vari-
ance maximisation-based definition of PCA (#2 in our chapter). POPs are different from the principal
points introduced above because they are defined independently one from another, while the principal
points are defined globally, as a set. POPs can be assembled into the principal curves of oriented points
(PCOP). Einbeck (2005) proposed a simpler approach based on local tracing of principal curves by
calculating local centers of mass and the local first principal components.

Som Approach for principal manifold Approximation and its generalisations

Kohonen in his seminal paper (Kohonen, 1982) proposed to modify the k-means approach by introduc-
ing connections between centroids such that a change in the position of one centroid would also change
the configuration of some neighboring centroids. Thus Self-Organizing Maps (SOM) algorithm was
developed.
With the SOM algorithm (Kohonen, 1982) we take a finite metric space V with metric ρ and try
to map it into Rm with combinations of two criteria: (1) the best preservation of initial structure in the
image of V and (2) the best approximation of the dataset X. In this way, SOMs give the most popular
approximations for principal manifolds: we can take for V a fragment of a regular s-dimensional grid
and consider the resulting SOM as the approximation to the s-dimensional principal manifold (Mulier
& Cherkassky, 1995; ℜitter et al, 1992; Yin H. 2008).
The SOM algorithm has several setup variables to regulate the compromise between these goals.
In the original formulation by Kohonen, we start from some initial approximation of the map, φ1: V →
Rm. Usually this approximation lies on the s-dimensional linear principal manifold. On each k-th step
of the algorithm we have a chosen datapoint x∈X and a current approximation φk: V → Rm. For these
x and φk we define an ‘owner’ of x in V: vx = arg minv ÎV x - jk (v ) . The next approximation, φk+1, is
φk+1(v) = hk×w(ρ(v,vx))(x− φk(v)). Here hk is a step size, 0 ≤ w(ρ(v,vx)) ≤ 1 is a monotonically decreasing
neighborhood function. This process proceeds in several epochs, with neighborhood radius decreasing

35
Principal Graphs and Manifolds

during each next epoch.


The idea of SOM is flexible, was applied in many domains of science, and it lead to multiple gener-
alizations (see the review paper by Yin (2008)). Some of the algorithms for constructing SOMs are of
EM type described above, such as the Batch SOM Algorithm (Kohonen, 1997): it includes projecting
step exactly the same as in k-means and the maximization step at which all φk(v) are modified simul-
taneously.
One source of theoretical dissatisfaction with SOM is that it is not possible to define an optimal-
ity criterion (Erwin et al, 1992): SOM is a result of the algorithm at work and there does not exist any
objective function that is minimized by the training process.
In attempt to resolve this issue, Bishop et al. (1998) developed the optimization-based Generative
Topographic Mapping (GTM) method. In this setting, it is supposed that the observed data is i.i.d.
sample from a mixture of Gaussian distributions with the centers aligned along a two-dimensional grid,
embedded in the data space. Parameters of this mixture are determined by EM-based maximization of
the likelihood function (probability of observing X within this data model).

principal manifolds by hastie and Stuelze

Principal curves and principal two-dimensional surfaces for a probability distribution F(x) were introduced
in the PhD thesis by Trevor Hastie (1984) as a self-consistent (non-linear) one- and two-dimensional
globally parametrisable smooth manifolds without self-intersections.

Definition. Let G be the class of differentiable 1-dimensional curves in Rm, parameterized by λ∈R1 and
without self-intersections. The Principal Curve of the probability distribution F(x) is such a Y(λ)∈G
that is self-consistent.
Remark. Usually, a compact subset of Rm and a compact interval of parameters λ∈R1 are considered.
To discuss unbounded regions, it is necessary to add a condition that Y(λ) has finite length inside any
bounded subset of Rm (Kégl, 1999).

Definition. Let G2 be the class of differentiable 2-dimensional surfaces in Rm, parameterized by λ∈R2 and
without self-intersections. The Principal Surface of the probability distribution F(x) is such a Y(λ)∈G2
that is self-consistent. (Again, for unbounded regions it is necessary to assume that for any bounded set
B from Rm the set of parameters λ for which Y(λ)∈B is also bounded.)
First, Hastie and Stuelze proposed an algorithm for finding the principal curves and principal surfaces
for a probability distribution F(x), using the classical EM splitting. We do not provide this algorithm
here because for a finite dataset X it can not be directly applied because in a typical point on Y(λ) only
zero or one data point is projected, hence, one can not calculate the expectation. As mentioned above,
in this case we should use some kind of coarse-grained self-consistency. In the original approach by
Hastie (1984), this is done through introducing smoothers. This gives the practical formulation of the
HS algorithm for estimating the principal manifolds from a finite dataset X:

Hastie and Stuelze algorithm for finding principal curve for finite dataset
1) Initialize Y(λ) = a0+λa1, where a0 is a mean point and a1 is the first principal component;

36
Principal Graphs and Manifolds

2) Project every data point xi onto Y(λ): i.e., for each xi find λi such that Y (li ) = arg inf || Y (l) - xi ||2 .
»

In practice it requires interpolation procedure because Y(λ) is determined in a finite number of


points {λ1,...,λN}. The simplest is the piecewise interpolation procedure, but more sophisticated
procedures can be proposed (Hastie, 1984);
3) Calculate new Y′(λ) in the finite number of internal coordinates {λ1,...,λN} (found at the previous
step) as the local average of points xi and some other points, that have close to λi projections onto
Y. To do this, 1) a span [w×N] is defined ([.] here is integer part), where 0 < w << 1 is a parameter
of the method (coarse-grained self-consistency neighbourhood radius); 2) for [w×N] internal co-
ordinates {li ,...,li } closest to λi and the corresponding {xi ,...,xi } calculate weighted least
1 [w´N] 1 [w´N]

squares linear regression y(λ) = a(i)λ+b(i); 3) define Y′(λi) as the value of the linear regression in λi:
Y′(λi) = a(i)λi+b(i).
4) ℜeassign Y(λ) ← Y′(λ)
5) ℜepeat steps 2)-4) until Y does not change (approximately).

Remark. For the weights in the regression at the step 3) Hastie proposed to use some symmetric
kernel function that vanishes on the borders of the neighbourhood. For example, for xi let us denote as
li the most distant value of the internal coordinate from [w×N] ones closest to λi. Then we can define
[w´N]

weight for the pair ( lij , xij ) as

i
(1-(| ij i
|/ | ij iN
|)3 )1/ 3 , if | ij i
| | ij iN
|,
j
0, otherwise.

Remark. At the step 3) an alternative approach was also proposed with use of cubic splines to ap-
proximate the smooth function Y′(λ) from all pairs (λi,xi), i = 1..N.
Non-linear Principal Manifolds constructed by this algorithm are usually called Hastie-Stuelze (HS)
principal manifolds. However, the global optimality of HS principal manifolds is not guaranteed (only
self-consistency in the case of distribution or coarse-grained self-consistency in the case of dataset is
guaranteed by construction). For example, the second principal component of a sample X from a normal
distribution is self-consistent and will be correct HS principal curve but of course not the optimal one.
We should also underline that our view on what is the object constructed by the HS algorithm for a
dataset X depends on 1) probabilistic interpretation of the nature of X, and 2) the chosen heuristic approach
to coarse-grained self-consistency. If we do not suppose that the dataset is generated by i.i.d. sampling
from F(x) then the definition of HS principal manifold is purely operational: HS principal manifold for
X is the result of application of HS algorithm for finite datasets. Analogous remark is applicable for all
principal manifold approximators constructed for finite datasets and described further in this chapter.
In his PhD thesis Hastie noticed that the HS principal curve does not coincide with the generating
curve in a very simple additive data generation model

X = f(λ)+ε, (1)

37
Principal Graphs and Manifolds

where f(λ) is some curve embedded in data space and ε is noise distribution independent on λ. Because of
the fact that if f(λ) is not a straight line then it is not self-consistent, HS principal curves were claimed to
be ‘biased’. This inspired Tibshirani (1992) to introduce an alternative definition of the principal curve,
based directly on a continuous mixture model (1) and maximising regularized likelihood.

kégl-kryzhak Improvement

Kégl in his PhD thesis supervised by Kryzhak (Kégl, 1999) revised the existing methods for estimating
the principal curves. In particular, this led to the definition of principal curves with limited length.

Definition.Principal curve YL(λ) of length L is such a curve that the mean squared distance
from the dataset X to the curve YL(λ) is minimal over all curves of length less than or equal to L:
N

å dist (x , P(x ,Y )) ® min


2 i i
L
i =1

( )
T
Theorem. Assume that X has finite second moments, i.e. å xi xi < ¥ . Then for any L > 0 there
exists a principal curve of length L. i =1

Principal curves of length L as defined by Kégl, are globally optimal approximators as opposite to the
HS principal curves that are only self-consistent. However, all attempts to construct a practical algorithm
for finding globally optimal principal curves of length L were not successful. Instead Kégl developed an
efficient heuristic Polygonal line algorithm for constructing piecewise linear principal curves.
Let us consider a piecewise curve Y composed from vertices located in points {y1,…,yk+1} and k seg-
ments connecting pairs of vertices {yj,yj+1}, j=1..k. Kégl’s algorithm searches for a (local) optimum of
the penalised mean squared distance error function:

l k +1
U (X ,Y ) = MSD(X ,Y ) + å CP(i) ,
k + 1 i =1
(2)

where CP(i) is a curvature penalty function for a vertex i chosen as

ïìï 2
ïï y1 - y2 if i = 1
CP (i ) = íïr 2 (1 + cos g(i )) if 1 < i < k + 1
ïï
ïï y k
- y k +1
2
if i = k + 1
,
ïî

(yi -1 - yi , yi +1 - yi )
where cos g(i ) = is the cosines of the angle between two neighbouring
|| yi -1 - yi |||| yi +1 - yi ||
segments at the vertex i, r = maxdist(x, MF (X )) is the ‘radius’ of the dataset X, and λ is a parameter
xÎX

controlling the curve global smoothness.


The Polygonal line algorithm (Kégl, 1999) follows the standard EM splitting scheme:

38
Principal Graphs and Manifolds

Polygonal line algorithm for estimating piece-wise linear principal curve


1) The initial approximation is constructed as a segment of principal line. The length of the segment
is the difference between the maximal and the minimal projection value of X onto the first principal
component. The segment is positioned such that it contains all of the projected data points. Thus
in the initial approximation one has two vertices {y1,y2} and one segment between them (k = 1).
2) Projection step. The dataset X is partitioned into 2k+1 K z = {x : z = arg min dist(x, z )}
z Îvertices Çsegments

subsets constructed by their proximity to k+1 vertices and k segments. If a segment i and a vertex
j are equally distant from x then x is placed into Kj only.
3) Optimisation step. Given partitioning obtained at the step 2, the functional U(X,Y) is optimised by
use of a gradient technique. Fixing partitioning into Ki is needed to calculate the gradient of U(X,Y)
because otherwise it is not a differentiable function with respect to the position of vertices {yi}.
4) Adaptation step. Choose the segment with the largest number of points projected onto it. If more
than one such segment exists then the longest one is chosen. The new vertex is inserted in the
midpoint of this segment; all other segments are renumerated accordingly.
5) Stopping criterion. The algorithm stops when the number of segments exceeds
r
b × N 1/3 ×
MSD(X ,Y )
H e u r i s t i c a l l y, t h e d e f a u l t p a r a m e t e r s o f t h e m e t h o d h a v e b e e n p r o p o s e d
k MSD(X ,Y )
β = 0.3, l = l '× 1/3 × , λ′ = 0.13. The details of implementation together with convergence
N r
and computational complexity study are provided elsewhere (Kégl, 1999).
Smola et al. (2001) proposed a regularized principal manifolds framework, based on minimization of
quantization error functional with a large class of regularizers that can be used and a universal EM-type
algorithm. For this algorithm, the convergence rates were analyzed and it was showed that for some regu-
larizing terms the convergence can be optimized with respect to the Kegl’s polygonal line algorithm.

elastic maps Approach

In a series of works (Gorban & ℜossiev, 1999;Gorban et al., 2001, 2003; Gorban & Zinovyev, 2005,
2008a; Gorban et al., 2007, 2008), the authors of this chapter used metaphor of elastic membrane and plate
to construct one-, two- and three-dimensional principal manifold approximations of various topologies.
Mean squared distance approximation error combined with the elastic energy of the membrane serves
as a functional to be optimised. The elastic map algorithm is extremely fast at the optimisation step due
to the simplest form of the smoothness penalty. It is implemented in several programming languages as
software libraries or front-end user graphical interfaces freely available from the web-site http://bioinfo.
curie.fr/projects/vidaexpert. The software found applications in microarray data analysis, visualization
of genetic texts, visualization of economical and sociological data and other fields (Gorban et al, 2001,
2003; Gorban & Zinovyev 2005, 2008a; Gorban et al, 2007, 2008).
Let G be a simple undirected graph with set of vertices V and set of edges E.

Definition. k-star in a graph G is a subgraph with k + 1 vertices v0,1,...,k ∈ V and k edges {(v0, vi)|i = 1, ..,
k} ∈ E. The rib is by definition a 2-star.

39
Principal Graphs and Manifolds

Definition. Suppose that for each k ≥ 2, a family Sk of k-stars in G has been selected. Then we define an
elastic graph as a graph with selected families of k-stars Sk and for which for all E(i) ∈ E and Sk( j ) ∈ Sk,
the corresponding elasticity moduli λi > 0 and μkj > 0 are defined.

Definition. Primitive elastic graph is an elastic graph in which every non-terminal node (with the
number of neighbours more than one) is associated with a k-star formed by all neighbours of the node.
All k-stars in the primitive elastic graph are selected, i.e. the Sk sets are completely determined by the
graph structure.

Definition. Let E(i)(0), E(i)(1) denote two vertices of the graph edge E(i) and Sk( j ) (0), ..., Sk( j ) (k) denote
vertices of a k-star Sk( j ) (where Sk( j ) (0) is the central vertex, to which all other vertices are connected).
Let us consider a map φ:V →Rm which describes an embedding of the graph into a multidimensional
space. The elastic energy of the graph embedding in the Euclidean space is defined as

U j (G ) := U Ej (G ) + U Rj (G ) , (3)

2
U Ej (G ) := å li j(E (i ) (0)) - j(E (i ) (1)) , (4)
(i )
E

1 k
U Ej (G ) := å mkj || j(Sk( j ) (0)) - å j(Sk( j )(i)) ||2 . (5)
Sk( j )
k i =1

Definition. Elastic net is a particular case of elastic graph which (1) contains only ribs (2-stars) (the
family Sk are empty for all k>2); and (2) the vertices of this graph form a regular small-dimensional
grid (Figure 1).
The elastic net is characterised by internal dimension dim(G). Every node vi in the elastic net is in-
i i
dexed by the discrete values of internal coordinates {l1 ,..., ldim(G ) } in such a way that the nodes close
on the graph have similar internal coordinates.
The purpose of the elastic net is to introduce point approximations to manifolds. Historically it was
first explored and used in applications. To avoid confusion, one should notice that the term elastic net
was independently introduced by several groups: for solving the traveling salesman problem (Durbin
&Willshaw, 1987), in the context of principal manifolds (Gorban et al, 2001) and recently in the context
of regularized regression problem (Zhou & Hastie, 2005). These three notions are completely indepen-
dent and denote different things.
Definition.Elastic map is a continuous manifold Y∈Rm constructed from the elastic net as its grid
approximation using some between-node interpolation procedure. This interpolation procedure con-
structs a continuous mapping φc:{λ1,…, λdim(G)} →Rm from the discrete map φ:V →Rm, used to embed
the graph in Rm, and the discrete values of node indices {l1i ,..., ldim(
i
G)
} , i = 1...|V|. For example, the
simplest piecewise linear elastic map is built by piecewise linear map φc.

Definition.Elastic principal manifold of dimension s for a dataset X is an elastic map, constructed from
an elastic net Y of dimension s embedded in Rm using such a map φopt:Y →Rm that corresponds to the
minimal value of the functional

40
Principal Graphs and Manifolds

Figure 1. Elastic nets used in practice

U j (X ,Y ) = MSDW (X ,Y ) + U j (G ) , (6)

where the weighted mean squared distance from the dataset X to the elastic net Y is calculated as the
distance to the finite set of vertices {y1=φ(v1),..., yk=φ(vk)}.
In the Euclidean space one can apply an EM algorithm for estimating the elastic principal manifold
for a finite dataset. It is based in turn on the general algorithm for estimating the locally optimal embed-
ding map φ for an arbitrary elastic graph G, described below.

Optimisation of the elastic graph algorithm:


1) Choose some initial position of nodes of the elastic graph {y1=φ(v1),..., yk=φ(vk)}, where k is the
number of graph nodes k = |V|;
2) Calculate two matrices eij and sij , using the following sub-algorithm:
i. Initialize the sij matrix to zero;
ii. For each k-star Sk(i ) with elasticity module μki, outer nodes vN1 , ..., vNk and the central node
vN0, the sij matrix is updated as follows (1 ≤ l,m ≤ k):
sN N ¬ sN N + mki , sN N ¬ sN N + mki k 2
0 0 0 0 l m l m

sN N ¬ sN N - mki k , sN N ¬ sN N - mki k
0 l 0 l l 0 l 0

iii. Initialize the eij matrix to zero;


iv. For each edge E(i) with weight λi, one vertex vk1 and the other vertex vk2, the ejk matrix is up-
dated as follows:
ek k ¬ ek k + li , ek k ¬ ek k + li
1 1 1 1 2 2 2 2

ek k ¬ ek k - li , ek k ¬ ek k - li
1 2 1 2 2 1 2 1

41
Principal Graphs and Manifolds

3) Partition X into subsets K i , i=1..k of data points by their proximity to y k :


K i = {x : yi = arg min dist(x, y j )} ;
y j ÎY

n j djs
4) Given Ki, calculate matrix a js = N
+ e js + s js , where n j = åw i
, δjs is the Kronecker’s
symbol. åw i
x i ÎK j

i =1

5) Find new position of {y1,..., yk} by solving the system of linear equations
k
1
å a js ys = N × å w i xi
s =1
å wi x ÎK j
i

i =1

6) ℜepeat steps 3-5 until complete or approximate convergence of node positions {y1,..., yk}.

As usual, the EM algorithm described above gives only locally optimal solution. One can expect
that the number of local minima of the energy function U grows with increasing the ‘softness’ of the
elastic graph (decreasing μkj parameters). Because of this, in order to obtain a solution closer to the
global optimum, the softening strategy has been proposed, used in the algorithm for estimating the
elastic principal manifold.

Algorithm for estimating the elastic principal manifold


1) Define a decreasing set of numbers {m1,…,mp}, mp=1 (for example, {103, 102, 10, 1}), defining p
epochs for softening;
(base )
2) Define the base values of the elastic moduli i(base ) and i ;
3) Initialize positions of the elastic net nodes {y1,..., yk} on the linear principal manifold spanned by
first dim(G) principal components;
4) Set epoch_counter = 1
5) Set the elastic moduli i mepoch _counter i(base ) and i mepoch _counter i(base ) ;
6) Modify the elastic net using the algorithm for optimisation of the elastic graph;
7) ℜepeat steps 5-6 for all values of epoch_counter = 2, …, p.

Remark. The values λi and μj are the coefficients of stretching elasticity of every edge E(i) and of
bending elasticity of every rib S 2( j ) . In the simplest case λ1 = λ2 = ... = λs = λ(s), μ1 = μ2 = ... = μr = μ(r),
where s and r are the numbers of edges and ribs correspondingly. Approximately dependence on graph
2-dim(G ) 2-dim(G )

‘resolution’ is given by Gorban & Zinovyev (2007): l(s ) = l0 × s dim(G ) , m(s ) = m0 × r dim(G ) . This for-
mula is applicable, of course, only for the elastic nets. In general case λi and μi are often made variable
in different parts of the graph accordingly to some adaptation strategy (Gorban & Zinovyev, 2005).
Remark.U Ej (G ) penalizes the total length (or, indirectly, ‘square’, ‘volume, etc.) of the constructed
manifold and provides regularization of distances between node positions at the initial steps of the
softening. At the final stage of the softening λi can be put to zero with little effect on the manifold con-
figuration.
Elastic map post-processing such as map extrapolation can be applied to increase its usability and
avoid the ‘border effect’, for details see (Gorban & Zinovyev, 2008a).

42
Principal Graphs and Manifolds

pluriharmonic graphs as Ideal Approximators

Approximating datasets by one dimensional principal curves is not satisfactory in the case of datasets
that can be intuitively characterized as branched. A principal object which naturally passes through the
‘middle’ of such a data distribution should also have branching points that are missing in the simple
structure of principal curves. Introducing such branching points converts principal curves into principal
graphs.
Principal graphs were introduced by Kégl & Krzyzak (2002) as a natural extension of one-dimensional
principal curves in the context of skeletonisation of hand-written symbols. The most important part of
this definition is the form of the penalty imposed onto deviation of the configuration of the branching
points embedment from their ‘ideal’ configuration (end, line, corner, T-, Y- and X-configuration). As-
signing types for all vertices serves for definition of the penalty on the total deviation from the graph
‘ideal’ configuration (Kégl, 1999). Other types of vertices were not considered, and outside the field of
symbol skeletonization applicability of such a definition of principal graph remains limited.
Gorban & Zinovyev (2005), Gorban et al. (2007), and Gorban et al. (2008) proposed to use a universal
form of non-linearity penalty for the branching points. The form of this penalty is defined in the previ-
ous chapter for the elastic energy of graph embedment. It naturally generalizes the simplest three-point
second derivative approximation squared:

1
for a 2-star (or rib) the penalty equals || j(S 2( j ) (0)) - (j(S 2( j ) (1)) + j(S 2( j ) (2))) ||2 ,
2
1
for a 3-star it is || j(S 3( j ) (0)) - (j(S 3( j ) (1)) + j(S 3( j ) (2)) + j(S 3( j ) (3))) ||2 , etc.
3
For a k-star this penalty equals to zero iff the position of the central node coincides with the mean
point of its neighbors. An embedment φ(G) is ‘ideal’ if all such penalties equal to zero. For a primitive
elastic graph this means that this embedment is a harmonic function on graph: its value in each non-
terminal vertex is a mean of the value in the closest neighbors of this vertex.
For non-primitive graphs we can consider stars which include not all neighbors of their centers. For
example, for a square lattice we create elastic graph (elastic net) using 2-stars (ribs): all vertical 2-stars and
all horizontal 2-stars. For such elastic net, each non-boundary vertex belongs to two stars. For a general
elastic graph G with sets of k-stars Sk we introduce the following notion of pluriharmoning function.

Definition. A map φ:V→Rm defined on vertices of G is pluriharmonic iff for any k-star Sk( j ) Î Sk with
the central vertex Sk( j ) (0) and the neighbouring vertices Sk( j ) (i), i = 1...k, the equality holds:

1 k
j(Sk( j ) (0)) = å
k i =1
j(Sk( j ) (i )) . (7)

Pluriharmonic maps generalize the notion of linear map and of harmonic map, simultaneously. For
example:

43
Principal Graphs and Manifolds

1) 1D harmonic functions are linear;


2) If we consider an nD cubic lattice as a primitive graph (with 2n-stars for all non-boundary vertices),
then the correspondent pluriharmonic functions are just harmonic ones;
3) If we create from nD cubic lattice a standard nD elastic net with 2-stars (each non-boundary vertex
is a center of n 2-stars, one 2-stars for each coordinate direction), then pluriharmonic functions are
linear.

Pluriharmonic functions have many attractive properties, for example, they satisfy the following
maximum principle. A vertex v of an elastic graph is called a corner point or an extreme point of G iff
v is not a centre of any k-star from Sk for all k>0.

Theorem. Let φ:V→Rm be a pluriharmonic map, F be a convex function on Rm, and a = maxx∈VF(φ(x)).
Then there is a corner point v of G such that F(φ(v))=a.
Convex functions achieve their maxima in corner points. Even a particular case of this theorem with
linear functions F is quite useful. Linear functions achieve their maxima and minima in corner points.
In the theory of principal curves and manifolds the penalty functions were introduced to penalise
deviation from linear manifolds (straight lines or planes). We proposed to use pluriharmonic embed-
dings (‘pluriharmonic graphs’) as ‘ideal objects’ instead of manifolds and to introduce penalty (5) for
deviation from this ideal form.

graph grammars and three types of complexity for principal graphs

Principal graphs can be called data approximators of controllable complexity. By complexity of the
principal objects we mean the following three notions:

1) Geometric complexity: how far a principal object deviates from its ideal configuration; for the
elastic principal graphs we explicitly measure deviation from the ‘ideal’ pluriharmonic graph by
the elastic energy Uφ(G) (3) (this complexity may be considered as a measure of non-linearity);
2) Structural complexity measure: it is some non-decreasing function of the number of vertices, edges
and k-stars of different orders SC(G)=SC(|V|,|E|,|S2|,…,|Sm|); this function penalises for number of
structural elements;
3) Construction complexity is defined with respect to a graph grammar as a number of applications
of elementary transformations necessary to construct given G from the simplest graph (one vertex,
zero edges).

The construction complexity is defined with respect to a grammar of elementary transformation. The
graph grammars (Löwe, 1993; Nagl, 1976) provide a well-developed formalism for the description of
elementary transformations. An elastic graph grammar is presented as a set of production (or substitu-
tion) rules. Each rule has a form A → B, where A and B are elastic graphs. When this rule is applied to an
elastic graph, a copy of A is removed from the graph together with all its incident edges and is replaced
with a copy of B with edges that connect B to the graph. For a full description of this language we need
the notion of a labeled graph. Labels are necessary to provide the proper connection between B and the
graph (Nagl, 1976). An approach based on graph grammars to constructing effective approximations
of an elastic principal graph has been proposed recently (Gorban et al, 2007).

44
Principal Graphs and Manifolds

Let us define graph grammar O as a set of graph grammar operations O={o1,..,os}. All possible ap-
plications of a graph grammar operation oi to a graph G gives a set of transformations of the initial graph
oi(G) = {G1, G2, …, Gp}, where p is the number of all possible applications of oi to G. Let us also define
a sequence of r different graph grammars {O (1) = {o1(1),..., os(1) } , ,O (r ) = {o1(r ),..., os(r ) }} .
1 r
Let us choose a grammar of elementary transformations, predefined boundaries of structural com-
plexity SCmax and construction complexity CCmax, and elasticity coefficients λi and μkj .

Definition. Elastic principal graph for a dataset X is such an elastic graph G embedded in the Euclidean
space by the map φ:V→Rm that SC(G) ≤ SCmax, CC(G) ≤ CCmax, and Uφ(G) → min over all possible
elastic graphs G embeddings in Rm .

Algorithm for estimating the elastic principal graph


1) Initialize the elastic graph G by 2 vertices v1 and v2 connected by an edge. The initial map φ is
chosen in such a way that φ(v1) and φ(v2) belong to the first principal line in such a way that all the
data points are projected onto the principal line segment defined by φ(v1), φ(v2);
2) For all j=1…r repeat steps 3-6:
3) Apply all grammar operations from O(j) to G in all possible ways; this gives a collection of candidate
graph transformations {G1, G2, …};
4) Separate {G1, G2, …} into permissible and forbidden transformations; permissible transformation
Gk is such that SC(Gk) ≤ SCmax, where SCmax is some predefined structural complexity ceiling;
5) Optimize the embedment φ and calculate the elastic energy Uφ(G) of graph embedment for every
permissible candidate transformation, and choose such a graph Gopt that gives the minimal value
of the elastic functional: Gopt = arg inf U j (Gk ) ;
Gk Îpermissible set
6) Substitute G →Gopt ;
7) ℜepeat steps 2-6 until the set of permissible transformations is empty or the number of operations
exceeds a predefined number – the construction complexity.

principal trees and metro maps

Let us construct the simplest non-trivial type of the principal graphs, called principal trees. For this
purpose let us introduce a simple ‘Add a node, bisect an edge’ graph grammar (see Fig. 2) applied for
the class of primitive elastic graphs.

Definition.Principal tree is an acyclic primitive elastic principal graph.

Definition.‘Remove a leaf, remove an edge’ graph grammar O(shrink) applicable for the class of primitive
elastic graphs consists of two operations: 1) The transformation ‘remove a leaf’ can be applied to any
vertex v of G with connectivity degree equal to 1: remove v and remove the edge (v,v’) connecting v
to the tree; 2) The transformation ‘remove an edge’ is applicable to any pair of graph vertices v, v’ con-
nected by an edge (v, v’): delete edge (v, v’), delete vertex v’, merge the k-stars for which v and v’ are
the central nodes and make a new k-star for which v is the central node with a set of neighbors which is
the union of the neighbors from the k-stars of v and v’.

45
Principal Graphs and Manifolds

Figure 2. Illustration of the simple “add node to a node” or “bisect an edge” graph grammar. a) We start
with a simple 2-star from which one can generate three distinct graphs shown. The “Op1” operation is
adding a node to a node, operations “Op1” and “Op2” are edge bisections (here they are topologically
equivalent to adding a node to a terminal node of the initial 2-star). For illustration let us suppose that
the “Op2” operation gives the biggest elastic energy decrement, thus it is the “optimal” operation. b)
From the graph obtained one can generate 5 distinct graphs and choose the optimal one. c) The process
is continued until a definite number of nodes are inserted.

Definition.‘Add a node, bisect an edge’ graph grammar O(grow) applicable for the class of primitive elastic
graphs consists of two operations: 1) The transformation “add a node” can be applied to any vertex v
of G: add a new node z and a new edge (v, z); 2) The transformation “bisect an edge” is applicable to
any pair of graph vertices v, v’ connected by an edge (v, v’): delete edge (v, v’), add a vertex z and two
edges, (v, z) and (z, v’). The transformation of the elastic structure (change in the star list) is induced by
the change of topology, because the elastic graph is primitive. Consecutive application of the operations
from this grammar generates trees, i.e. graphs without cycles.
Also we should define the structural complexity measure SC(G)=SC(|V|,|E|,|S2|,…,|Sm|). Its concrete
form depends on the application field. Here are some simple examples:

1) SC(G) = |V|: i.e., the graph is considered more complex if it has more vertices;
ìï m
ïï| S |, if | S |£ b and
2) SC(G ) = í ï 3 3 max å Sk = 0
,
ïï k =4

ïïî ¥, otherwise
i.e., only bmax simple branches (3-stars) are allowed in the principal tree.

Using the sequence {O(grow), O(grow), O(shrink)} in the above-described algorithm for estimating the elastic
principal graph gives an approximation to the principal trees. Introducing the ‘tree trimming’ grammar

46
Principal Graphs and Manifolds

O(shrink) allows to produce principal trees closer to the global optimum, trimming excessive tree branching
and fusing k-stars separated by small ‘bridges’.
Principal trees can have applications in data visualization. A principal tree is embedded into a multidi-
mensional data space. It approximates the data so that one can project points from the multidimensional
space into the closest node of the tree. The tree by its construction is a one-dimensional object, so this
projection performs dimension reduction of the multidimensional data. The question is how to produce a
planar tree layout? Of course, there are many ways to layout a tree on a plane without edge intersection.
But it would be useful if both local tree properties and global distance relations would be represented
using the layout. We can require that

1) In a two-dimensional layout, all k-stars should be represented equiangular; this is the small penalty
configuration;
2) The edge lengths should be proportional to their length in the multidimensional embedding; thus
one can represent between-node distances.

This defines a tree layout up to global rotation and scaling and also up to changing the order of
leaves in every k-star. We can change this order to eliminate edge intersections, but the result can not be
guaranteed. In order to represent the global distance structure, it was found (Gorban et al., 2008) that
a good approximation for the order of k-star leaves can be taken from the projection of every k-star on
the linear principal plane calculated for all data points, or on the local principal plane in the vicinity of
the k-star, calculated only for the points close to this star. The resulting layout can be further optimized
using some greedy optimization methods.
The point projections are then represented as pie diagrams, where the size of the diagram reflects
the number of points projected into the corresponding tree node. The sectors of the diagram allow us to
show proportions of points of different classes projected into the node (see an example on Figure 3).
This data display was called a “metro map” since it is a schematic and “idealized” representation of
the tree and the data distribution with inevitable distortions made to produce a 2D layout. However, us-
ing this map one can still estimate the distance from a point (tree node) to a point passing through other
points. This map is inherently unrooted (as a real metro map). It is useful to compare this metaphor with
trees produced by hierarchical clustering where the metaphor is closer to a “genealogy tree”.

principal cubic complexes

Elastic nets introduced above are characterized by their internal dimension dim(G). The way to general-
ize these characteristics on other elastic graphs is to utilize the notion of cubic complex (Gorban et al,
2007).

Definition.Elastic cubic complex K of internal dimension r is a Cartesian product G1×…× Gr of elastic


graphs G1, . . .Gr . It has the vertex set V1× . . . × Vr. Let 1 ≤ i ≤ r and vj ∈ Vj (j ≠ i). For this set of ver-
tices, {vj}j≠i, a copy of Gi in G1× ... ×Gr is defined with vertices (v1, …, vi−1, v, vi+1, …, vr) (v ∈ Vi), edges
((v1, …, vi−1, v, vi+1, …, vr), (v1, …, vi−1, v’, vi+1, …, vr)), (v, v’) ∈ Ei, and, similarly, k-stars of the form (v1,
…, vi−1, Sk, vi+1, …, vr), where Sk is a k-star in Gi. For any Gi there are Õ | Vj | copies of Gi in G. Sets of
j ¹i

47
Principal Graphs and Manifolds

Figure 3. Principal manifold and principal tree for the Iris dataset. a) View of the principal manifold
projected on the first two principal components, the data points are shown projected into the closest
vertex of the elastic net; b) visualization of data points in the internal coordinates, here classes are rep-
resented in the form of Hinton diagrams: the size of the diagram is proportional to the number of points
projected, the shape of the diagram denote three different point classes; c) same as a), but the data points
are shown projected into the closest point of the piecewise linearly interpolated elastic map; d) same as
b), but based on projection shown in c); e)-g) First 50 iterations of the principal tree algorithm, the tree
is shown projected onto the principal plane; h) metro map representation of the Iris dataset.

edges and k-stars for Cartesian product are unions of that set through all copies of all factors. A map φ:
V1× . . . × Vr → Rm maps all the copies of factors into Rm too.
Remark. By construction, the energy of the elastic graph product is the energy sum of all factor
copies. It is, of course, a quadratic functional of φ.

48
Principal Graphs and Manifolds

If we approximate multidimensional data by an r-dimensional object, the number of points (or, more
generally, elements) in this object grows with r exponentially. This is an obstacle for grammar–based
algorithms even for modest r, because for analysis of the rule A → B applications we should investigate
all isomorphic copies of A in G. Introduction of a cubic complex is useful factorization of the principal
object which allows to avoid this problem.
The only difference between the construction of general elastic graphs and factorized graphs is in the
application of the transformations. For factorized graphs, we apply them to factors. This approach signifi-
cantly reduces the amount of trials in selection of the optimal application. The simple grammar with two
rules, “add a node to a node, or bisect an edge,” is also powerful here, it produces products of primitive
elastic trees. For such a product, the elastic structure is defined by the topology of the factors.

Incomplete data

Some of the methods described above allow us to use incomplete data in a natural way. Let us represent
an incomplete observation by x = (x 1 ,..., @,..., @,..., x m ) , where the ‘@’ symbol denotes a missing
value.

m
Definition.Scalar product between two incomplete observationsx and y is (x, y) = å x i yi . Then the
m i ¹@

Euclidean distance is (x - y)2 = å (x i - yi )2 .


i ¹@

Remark. This definition has a very natural geometrical interpretation: an incomplete observation
with k missing values is represented by a k–dimensional linear manifold Lk, parallel to k coordinate axes
corresponding to the missing vector components.
Thus, any method which uses only scalar products or/and Euclidean distances can be applied for
incomplete data with some minimal modifications subject to random and not too dense distribution of
missing values in X. For example, the iterative method for SVD for incomplete data matrix was devel-
oped (ℜoweis, 1998; Gorban & ℜossiev, 1999).
There are, of course, other approaches to incomplete data in unsupervised learning (for example,
those presented by Little & ℜubin (1987)).

Implicit methods

Most of the principal objects introduced in this paper are constructed as explicit geometrical objects
embedded in Rm to which we can calculate the distance from any object in X. In this way, they generalize
the “data approximation”-based (#1) and the “variation-maximization”-based (#2) definitions of linear
PCA. There also exists the whole family of methods, which we only briefly mention here, that general-
ize the “distance distortion minimization” definition of PCA (#3).
First, some methods take as input a pairwise distance (or, more generally, dissimilarity) matrix D
and construct such a configuration of points in a low-dimensional Euclidean space that the distance
matrix D’ in this space reproduce D with maximal precision. The most fundamental in this series is the
metric multidimensional scaling (Kruskal, 1964). The next is the Kernel PCA approach (Schölkopf et
al., 1997) which takes advantage of the fact that for the linear PCA algorithm one needs only the matrix

49
Principal Graphs and Manifolds

of pairwise scalar products (Gramm matrix) but not the explicit values of coordinates of X. It allows to
apply the kernel trick (Aizerman et al., 1964) and substitute the Gramm matrix by the scalar products
calculated with use of some kernel functions. Kernel PCA method is tightly related to the classical
multidimensional scaling (Williams, 2002).
Local Linear Embedding or LLE (ℜoweis & Saul, 2000) searches for such a N×N matrix A that ap-
N N 2

proximates given xi by a linear combination of n vectors-neighbours of xi: å || xi - å Aki xk || ® min ,


i =1 k =1

where only such Aki ¹ 0 , if k is one of the n closest to xi vectors. After one constructs such a configuration
N
of points in Rs, s << m, that yi = å Aki yk , yi∈Rs, for all i = 1…N. The coordinates of such embedding
k =1

are given by the eigenvectors of the matrix (1-A)T(1-A).


ISOMAP (Tenenbaum et al., 2000) and Laplacian eigenmap (Belkin & Niyogi, 2003; Nadler et al.,
2008) methods start with construction of the neighbourhood graph, i.e. the graph in which close in some
sense data points are connected by (weighted) edges. This weighted graph can be represented in the
form of a weighted adjacency matrix W= {Wij}. From this graph, ISOMAP constructs a new distance
matrix D(ISOMAP), based on the path lengths between two points in the neighbourhood graph, and the mul-
tidimensional scaling is applied to D(ISOMAP). The Laplacian map solves the eigenproblem Lfl = lSfl ,
N N
where S = diag {åW0 j , , åWNj } , L = S – W is the Laplacian matrix. The trivial constant solu-
j =1 j =1

tion corresponding to the smallest eigenvalue λ0 = 0 is discarded, while the elements of the eigenvec-
tors fl , fl , , fl , where l1 < l2 < ... < ls , give the s-dimensional projection of xi, i.e. P(xi)= {
1 2 s
fl (i ), fl (i ), , fl (i ) }.
1 2 s

Finally, one can implicitly construct projections into smaller dimensional spaces by training auto-
associative neural networks with narrow hidden layer. An overview of the existing Neural PCA methods
can be found in the recent collection of review papers (Gorban et al, 2008).

Example: Principal Objects for the Iris Dataset

On Figure 3 we show application of the elastic principal manifolds and principal trees algorithms to the
standard Iris dataset (Fisher, 1936). As expected, two-dimensional approximation of the principal mani-
fold in this case is close to the linear principal plane. One can also see that the principal tree illustrates
well the fact of almost complete separation of classes in data space.

Example: Principal Objects for Molecular Surfaces

A molecular surface defines the effective region of space which is occupied by a molecule. For example,
the Van-der-Waals molecular surface is formed by surrounding every atom in the molecule by a sphere
of radius equal to the characteristic radius of the Van-der-Waals force. After all the interior points are
eliminated, this forms a complicated non-smooth surface in 3D. In practice, this surface is sampled by
a finite number of points.
Using principal manifolds methodology, we constructed a smooth approximation of such molecular
surface for a small piece of a DNA molecule (several nucleotides long). First, we have made an ap-

50
Principal Graphs and Manifolds

proximation of this dataset by a 1D principal curve. Interestingly, this curve followed the backbone of
the molecule, forming a helix (see Figure 4). Second, we approximated the molecular surface by a 2D
manifold. The topology of the surface is expected to be spherical, so we applied spherical topology of
the elastic net for optimisation.
We should notice that since it is impossible to make the lengths of all edges equal for the spherical
grid, corrections were performed for the edge elasticities during the grid initialization (shorter edges
are given larger λis). Third, we applied the method for constructing principal cubic complexes, namely,
graph product of principal trees, which produced somewhat trivial construction (because no branching
was energetically optimal): product of two short elastic principal curves, forming a double helix.

Example: Principal Objects Decipher Genome

A dataset X can be constructed for a string sequence using a short word frequency dictionary approach
in the following way: 1) the notion of word is defined; 2) the set of all possible short words is defined,
let us say that we have m of them; 3) a number N of text fragments of certain width is sampled from
the text; 4) in each fragment the frequency of occurrences of all possible short words is calculated and,
thus, each fragment is represented as a vector in multidimensional space Rm. The whole text then is
represented as a dataset of N vectors embedded in Rm.
We systematically applied this approach to available bacterial genomic sequences (Gorban & Zinovyev,
2008b). In our case we defined: 1) a word is a sequence of three letters from the {A,C,G,T} alphabet
(triplet); 2) evidently, there are 64 possible triplets in the {A,C,G,T} alphabet; 3) we sampled 5000-10000
fragments of width 300 from a genomic sequence; 4) we calculated the frequencies of non-overlapping
triplets for every fragment.
The constructed datasets are interesting objects for data-mining, because 1) they have a non-trivial
cluster structure which usually contains various configurations of 7 clusters (see Figure 5); 2) class labels
can be assigned to points accordingly to available genome annotations; in our case we put information
about presence (in one of six possible frameshifts) or absence of the coding information in the current
position of a genome; 3) using data mining techniques here has immediate applications in the field of
automatic gene recognition and in others, see, for example, (Carbone et al, 2003). On Figure 5 we show

Figure 4. Principal objects approximating molecular surface of a short stretch of DNA molecule. a)
stick-and-balls model of the DNA stretch and the initial molecular surface (black points); b) one- and
two-dimensional (spherical) principal manifolds for the molecular surface; c) simple principal cubic
complex (product of principal trees) which does not have any branching in this case.

51
Principal Graphs and Manifolds

application of both classical PCA and the metro map methods for several bacterial genomes. Look at
http://www.ihes.fr/~zinovyev/7clusters web-site for further information.

Example: Non-Linear Principal Manifolds for Microarray Data Visualization

DNA microarray data is a rich source of information for molecular biology (an expository overview is
provided by Leung & Cavalieri (2003)). This technology found numerous applications in understand-
ing various biological processes including cancer. It allows to screen simultaneously the expression of
all genes in a cell exposed to some specific conditions (for example, stress, cancer, treatment, normal
conditions). Obtaining a sufficient number of observations (chips), one can construct a table of “samples

Figure 5. Seven cluster structures presented for 4 selected genomes. A genome is represented as a col-
lection of points (text fragments represented by their triplet frequencies) in the 64-multidimensional
space. Color codes denote point classes corresponding to 6 possible frameshifts when a random frag-
ment overlaps with a coding gene (3 in the forward and 3 in the backward direction of the gene), and
the black color corresponds to non-coding regions. For every genome a principal tree (“metro map”
layout) is shown together with 2D PCA projection of the data distribution. Note that the clusters that
appear to be mixed on the PCA plot for Escherichia coli (they remain mixed in 3D PCA as well) are
well separated on the “metro map”. This proves that they are well-separated in R64.

52
Principal Graphs and Manifolds

vs genes”, containing logarithms of the expression levels of, typically several thousands (n) of genes, in
typically several tens (m) of samples.
On Figure 6 we provide a comparison of data visualization scatters after projection of the breast
cancer dataset, provided by Wang et al. (2003), onto the linear two- and non-linear two-dimensional
principal manifold. The latter one is constructed by the elastic maps approach. Each point here represents
a patient treated from cancer. Before dimension reduction it is represented as a vector in Rn, containing
the expression values for all n genes in the tumor sample. Linear and non-linear 2D principal manifolds
provide mappings Rn → R2, drastically reducing vector dimensions and allowing data visualization.
The form, the shape and the size of the point on the Fig.6 represent various clinical data (class labels)
extracted from the patient’s disease records.

Figure 6. Visualization of breast cancer microarray dataset using elastic maps. Ab initio classifications
are shown using points size (ER, estrogen receptor status), shape (Group A – patients with aggressive
cancer, Group B – patients with non-aggressive cancer) and color (TYPE, molecular type of breast can-
cer). a) Configuration of nodes projected into the three-dimensional principal linear manifold. One clear
feature is that the dataset is curved such that it can not be mapped adequately onto a two-dimensional
principal plane. b) The distribution of points in the internal non-linear manifold coordinates is shown
together with estimation of the two-dimensional density of points. c) The same as b) but for the linear
two-dimensional manifold. One can notice that the ``basal’’ breast cancer subtype is much better sepa-
rated on the non-linear mapping and some features of the distribution become better resolved.

53
Principal Graphs and Manifolds

Practical experience from bioinformatics studies shows that two-dimensional data visualization
using non-linear projections allow to catch more signals from data (in the form of clusters or specific
regions of higher point density) than linear projections, see Figure 6 and a good example by Ivakhno
& Armstrong (2008).
In addition to that, Gorban & Zinovyev (2008a) performed a systematic comparison of performance
of low-dimensional linear and non-linear principal manifolds for microarray data visualization, using the
following four criteria: 1) mean-square distance error; 2) distortions in mapping the big distances between
points; 3) local point neighbourhood preservation; 4) compactness of point class labels after projection.
It was demonstrated that non-linear two-dimensional principal manifolds provide systematically better
results accordingly to all these criteria, achieving the performance of three- and four- dimensional linear
principal manifolds (principal components).
The interactive ViMiDa (Visualization of Microarray Data) and ViDaExpert software allowing microar-
ray data visualization with use of non-linear principal manifolds are available on the web-site of Institut
Curie (Paris): http://bioinfo.curie.fr/projects/vidaexpert and http://bioinfo.curie.fr/projects/vimida.

concluSIon

In this chapter we gave a brief practical introduction into the methods of construction of principal objects,
i.e. objects embedded in the ‘middle’ of the multidimensional data set. As a basis, we took the unifying
framework of mean squared distance approximation of finite datasets which allowed us to look at the
principal graphs and manifolds as generalizations of the mean point notion.

reFerenceS

Aizerman, M., Braverman, E., & ℜozonoer, L. (1964). Theoretical foundations of the potential function
method in pattern recognition learning. Automation and Remote Control, 25, 821–837.
Aluja-Banet, T., & Nonell-Torrent, ℜ. (1991). Local principal component analysis. Qϋestioό, 3, 267-
278.
Arthur, D., & Vassilvitskii, S. (2007). K-means++ the advantages of careful seeding. In Proceedings of
the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, LA,
USA (pp. 1027-1035).
Bau, D., III, & Trefethen, L. N. (1997). Numerical linear algebra. Philadelphia, PA: SIAM Press.
Bauer, F. L. (1957). Das verfahren der treppeniteration und verwandte verfahren zur losung algebraischer
eigenwert probleme. Math. Phys., 8, 214–235.
Belkin, M., & Niyogi, P. (2006). Laplacian Eigenmaps for dimensionality reduction and data representa-
tion. Neural Computation, 15(6), 1373–1396. doi:10.1162/089976603321780317
Bishop, C. M., Svensén, M., & Williams, C. K. I. (1998). GTM: The generative topographic mapping.
Neural Computation, 10(1), 215–234. doi:10.1162/089976698300017953

54
Principal Graphs and Manifolds

Braverman, E. M. (1970). Methods of extremal grouping of parameters and problem of apportionment


of essential factors. Automation and Remote Control, (1), 108-116.
Cochran, ℜ. N., & Horne, F. H. (1977). Statistically weighted principal component analysis of rapid scan-
ning wavelength kinetics experiments. Analytical Chemistry, 49, 846–853. doi:10.1021/ac50014a045
Delicado, P. (2001). Another look at principal curves and surfaces. Journal of Multivariate Analysis, 77,
84–116. doi:10.1006/jmva.2000.1917
Dempster, A., Laird, N., & ℜubin, D. (1977). Maximum likelihood from incomplete data via the EM
algorithm. Journal of the Royal Statistical Society. Series B. Methodological, 39(1), 1–38.
Diday, E. (1979). Optimisation en classification automatique, Tome 1,2 (in French). ℜocquencourt,
France: INℜIA.
Durbin, ℜ., & Willshaw, D. (1987). An analogue approach to the travelling salesman problem using an
elastic net method. Nature, 326, 689–691. doi:10.1038/326689a0
Efron, B. (1967). The two sample problem with censored data. In Proc. of the Fifth Berkeley Simp. Math.
Statist. Probab. 4 (pp. 831-853). Berkeley, CA: Univ.California Press.
Einbeck, J., Evers, L., & Bailer-Jones, C. (2008). ℜepresenting complex data using localized principal
components with application to astronomical data. In A. Gorban, B. Kégl, D. Wunch, & A. Zinovyev
(Eds.), Principal manifolds for data visualization and dimension reduction (pp. 178-201). Berlin, Ger-
many; Springer.
Einbeck, J., Tutz, G., & Evers, L. (2005). Local principal curves. Statistics and Computing, 15, 301–313.
doi:10.1007/s11222-005-4073-8
Erwin, E., Obermayer, K., & Schulten, K. (1992). Self-organizing maps: Ordering, convergence proper-
ties and energy functions. Biological Cybernetics, 67, 47–55. doi:10.1007/BF00201801
Fisher, ℜ. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics,
7, 179–188.
Flury, B. (1990). Principal points. Biometrika, 77, 33–41. doi:10.1093/biomet/77.1.33
Fréchet, M. (1948). Les élements aléatoires de nature quelconque dans un espace distancié. Ann. Inst.
H. Poincaré, 10, 215–310.
Fukunaga, K., & Olsen, D. (1971). An algorithm for finding intrinsic dimensionality of data. IEEE
Transactions on Computers, 20(2), 176–183. doi:10.1109/T-C.1971.223208
Gabriel, K. ℜ., & Zamir, S. (1979). Lower rank approximation of matrices by least squares with any
choices of weights. Technometrics, 21, 298–489. doi:10.2307/1268288
Gorban, A., Kégl, B., Wunsch, D., & Zinovyev, A. (Eds.). (2008). Principal manifolds for data visual-
ization and dimension reduction. Berlin, Germany: Springer.
Gorban, A., Sumner, N., & Zinovyev, A. (2007). Topological grammars for data approximation. Applied
Mathematics Letters, 20(4), 382–386. doi:10.1016/j.aml.2006.04.022

55
Principal Graphs and Manifolds

Gorban, A., Sumner, N. ℜ., & Zinovyev, A. (2008). Beyond the concept of manifolds: Principal trees,
metro maps, and elastic cubic complexes. In A. Gorban, B. Kégl, D. Wunsch, & A. Zinovyev (Eds.),
Principal manifolds for data visualization and dimension reduction (pp. 219-237). Berlin, Germany:
Springer.
Gorban, A., & Zinovyev, A. (2005). Elastic principal graphs and manifolds and their practical applica-
tions. Computing, 75, 359–379. doi:10.1007/s00607-005-0122-6
Gorban, A., & Zinovyev, A. (2008a). Elastic maps and nets for approximating principal manifolds and
their application to microarray data visualization. In A. Gorban, B. Kégl, D. Wunsch, & A. Zinovyev
(Eds.), Principal manifolds for data visualization and dimension reduction (pp. 96-130). Berlin, Ger-
many: Springer.
Gorban, A., & Zinovyev, A. (2008b). PCA and K-means decipher genome. In A. Gorban, B. Kégl, D.
Wunsch, & A. Zinovyev (Eds.), Principal manifolds for data visualization and dimension reduction (pp.
309-323). Berlin, Germnay: Springer.
Gorban, A. N., Pitenko, A. A., Zinov’ev, A. Y., & Wunsch, D. C. (2001). Vizualization of any data using
elastic map method. Smart Engineering System Design, 11, 363–368.
Gorban, A. N., & ℜossiev, A. A. (1999). Neural network iterative method of principal curves for data
with gaps. Journal of Computer and Systems Sciences International, 38(5), 825–830.
Gorban, A. N., Zinovyev, A. Y., & Wunsch, D. C. (2003). Application of the method of elastic maps
in analysis of genetic texts. In Proceedings of the International Joint Conference on Neural Networks,
IJCNN, Portland, Oregon.
Hartigan, J. A. (1975). Clustering algorithms. New York: Wiley.
Hastie, T. (1984). Principal curves and surfaces. Unpublished doctoral dissertation, Stanford University,
CA.
Hotelling, H. (1933). Analisys of a complex of statistical variables into principal components. Journal
of Educational Psychology, 24, 417–441. doi:10.1037/h0071325
Ivakhno, S., & Armstrong, J. D. (2007). Non-linear dimensionality reduction of signalling networks.
BMC Systems Biology, 1(27).
Jolliffe, I. T. (2002). Principal component analysis, series: Springer series in statistics, 2nd ed., XXIX.
New York: Springer.
Kambhatla, N., & Leen, T. K. (1997). Dimension reduction by local PCA. Neural Computation, 9,
1493–1516. doi:10.1162/neco.1997.9.7.1493
Karhunen, K. (1946). Zur spektraltheorie stochastischer prozesse. Suomalainen Tiedeakatemia Toimi-
tuksia. Sar. A.4: Biologica, 37.
Kégl, B. (1999). Principal curves: Learning, design, and applications. Unpublished doctoral disserta-
tion, Concordia University, Canada.

56
Principal Graphs and Manifolds

Kégl, B., & Krzyzak, A. (2002). Piecewise linear skeletonization using principal curves. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 24(1), 59–74. doi:10.1109/34.982884
Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cyber-
netics, 43, 59–69. doi:10.1007/BF00337288
Kohonen, T. (1997). Self-organizing maps. Berlin, Germany: Springer.
Koren, Y., & Carmel, L. (2004). ℜobust linear dimensionality reduction. IEEE Transactions on Visual-
ization and Computer Graphics, 10(4), 459–470. doi:10.1109/TVCG.2004.17
Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis.
Psychometrika, 29, 1–27. doi:10.1007/BF02289565
Leung, Y. F., & Cavalieri, D. (2003). Fundamentals of cDNA microarray data analysis. Trends in Genet-
ics, 19(11), 649–659. doi:10.1016/j.tig.2003.09.015
Little, ℜ. J. A., & ℜubin, D. B. (1987). Statistical analysis with missing data. New York: John Wiley.
Liu, Z.-Y., Chiu, K.-C., & Xu, L. (2003). Improved system for object detection and star/galaxy classifica-
tion via local subspace analysis. Neural Networks, 16, 437–451. doi:10.1016/S0893-6080(03)00015-7
Lloyd, S. (1957). Least square quantization in PCM’s (Bell Telephone Laboratories Paper).
Loève, M. M. (1955). Probability theory. Princeton, NJ: VanNostrand.
Löwe, M. (1993). Algebraic approach to single–pushout graph transformation. Theoretical Computer
Science, 109, 181–224. doi:10.1016/0304-3975(93)90068-5
Lumley, J. L. (1967). The structure of inhomogeneous turbulent flows. In A. M. Yaglom & V. I. Tatarski
(Eds.), Atmospheric turbulence and radio propagation (pp. 166-178). Moscow, ℜussia: Nauka.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In
Proc. of the 5th Berkeley Symp. on Math. Statistics and Probability (pp. 281-297).
Mirkin, B. (2005). Clustering for data mining: A data recovery approach. Boca ℜaton, FL: Chapman
and Hall.
Mulier, F., & Cherkassky, V. (1995). Self-organization as an iterative kernel smoothing process. Neural
Computation, 7, 1165–1177. doi:10.1162/neco.1995.7.6.1165
Nadler, B., Lafon, S., Coifman, ℜ., & Kevrekidis, I. G. (2008). Diffusion maps - a probabilistic inter-
pretation for spectral embedding and clustering algorithms. In A. Gorban, B. Kégl, D. Wunsch, & A.
Zinovyev (Eds.), Principal manifolds for data visualization and dimension reduction (pp. 238-260).
Berlin, Germany: Springer.
Nagl, M. (1976). Formal languages of labelled graphs. Computing, 16, 113–137. doi:10.1007/
BF02241984
Ostrovsky, ℜ., ℜabani, Y., Schulman, L. J., & Swamy, C. (2006). The effectiveness of Lloyd-type meth-
ods for the k-means problem. In Proceedings of the 47th Annual IEEE Symposium on Foundations of
Computer Science (pp. 165-176). Washington, DC: IEEE Computer Society.

57
Principal Graphs and Manifolds

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Maga-
zine, 6(2), 559–572.
Pelleg, D., & Moore, A. (1999). Accelerating exact k-means algorithms with geometric reasoning. In
Proceedings of the Fifth International Conference on Knowledge Discovery in Databases (pp. 277-281).
Menlo Park, CA: AAAI Press.
ℜitter, H., Martinetz, T., & Schulten, K. (1992). Neural Computation and Self-Organizing maps: An
introduction. ℜeading, MA: Addison-Wesley.
ℜoweis, S. (1998). EM algorithms for PCA and SPCA. In Advances in neural information processing
systems (pp. 626-632). Cambridge, MA: MIT Press.
ℜoweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science,
290, 2323–2326. doi:10.1126/science.290.5500.2323
Schölkopf, B., Smola, A. J., & Müller, K.-ℜ. (1997). Kernel principal component analysis. In Proceed-
ings of the ICANN (pp. 583-588).
Smola, A. J., Mika, S., Schölkopf, B., & Williamson, ℜ. C. (2001). ℜegularized principal manifolds.
Journal of Machine Learning Research, 1, 179–209. doi:10.1162/15324430152748227
Steinhaus, H. (1956). Sur la division des corps materiels en parties. In Bull. Acad. Polon. Sci., C1. III
vol IV (pp. 801-804).
Strang, G. (1993). The fundamental theorem of linear algebra. The American Mathematical Monthly,
100(9), 848–855. doi:10.2307/2324660
Sylvester, J. J. (1889). On the reduction of a bilinear quantic of the nth order to the form of a sum of n
products by a double orthogonal substitution. Messenger of Mathematics, 19, 42–46.
Tarpey, T., & Flury, B. (1996). Self-consistency: A fundamental concept in statistics. Statistical Science,
11(3), 229–243. doi:10.1214/ss/1032280215
Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear
dimensionality reduction. Science, 290, 2319–2323. doi:10.1126/science.290.5500.2319
Tibshirani, ℜ. (1992). Principal curves revisited. Statistics and Computing, 2, 183–190. doi:10.1007/
BF01889678
Verbeek, J. J., Vlassis, N., & Kröse, B. (2002). A k-segments algorithm for finding principal curves.
Pattern Recognition Letters, 23(8), 1009–1017. doi:10.1016/S0167-8655(02)00032-6
Wang, Y., Klijn, J. G., & Zhang, Y. (2005). Gene expression profiles to predict distant metastasis of
lymph-node-negative primary breast cancer. Lancet, 365, 67–679.
Williams, C. K. I. (2002). On a connection between kernel PCA and metric multidimensional scaling.
Machine Learning, 46, 11–19. doi:10.1023/A:1012485807823
Xu, ℜ., & Wunsch, D. (2008). Clustering. New York: IEEE Press / John Wiley & Sons.

58
Principal Graphs and Manifolds

Yin, H. (2008). Learning nonlinear principal manifolds by self-organising maps. In A. Gorban, B. Kégl,
D. Wunsch, & A. Zinovyev (Eds.), Principal manifolds for data visualization and dimension reduction
(pp. 68-95). Berlin, Germany: Springer.
Zou, H., & Hastie, T. (2005). ℜegularization and variable selection via the elastic net. Journal of the
Royal Statistical Society, 67(2), 301–320. doi:10.1111/j.1467-9868.2005.00503.x

key termS And deFInItIonS

Principal Components: Such an orthonormal basis in which the covariance matrix is diagonal.
Principal Manifold: Intuitively, a smooth manifold going through the middle of data cloud; for-
mally, there exist several definitions for the case of data distributions: 1) Hastie and Stuelze’s principal
manifolds are self-consistent curves and surfaces; 2) Kegl’s principal curves provide the minimal mean
squared error given the limited curve length; 3) Tibshirani’s principal curves maximize the likelihood
of the additive noise data model; 4) Gorban and Zinovyev elastic principal manifolds minimize a mean
square error functional regularized by addition of energy of manifold stretching and bending; 5) Smola’s
regularized principal manifolds minimize some form of a regularized quantization error functional; and
some other definitions.
Principal Graph: A graph embedded in the multidimensional data space, providing the minimal mean
squared distance to the dataset combined with deviation from an “ideal” configuration (for example,
from pluriharmonic graph) and not exceeding some limits on complexity (in terms of the number of
structural elements and the number of graph grammar transformations needed for obtaining the principal
graph from some minimal graph).
Self-Consistent Approximation: Approximation of a dataset by a set of vectors such that every point
y in the vector set is a conditional mean of all points from dataset that are projected in y.
Expectation/Maximisation Algorithm: Generic splitting algorithmic scheme with use of which
almost all algorithms for estimating principal objects are constructed; it consists of two basic steps: 1)
projection step, at which the data is projected onto the approximator, and 2) maximization step, at which
the approximator is optimized given the projections obtained at the previous step.

59

You might also like